
Hadoop Security: Essential Best Practices
Hadoop security presents unique challenges due to its distributed architecture and reliance on third-party services, making it particularly vulnerable to attacks if not properly secured.
In this blog, our Hadoop expert explains the five core principles of Hadoop security and shares best practices and recommended tools to reduce your attack surface and protect data from being accessed by bad actors.
Understanding the Challenges of Hadoop Security
Hadoop implementations are subject to all of the same security concerns that apply to modern-day distributed systems. These challenges differ from monolithic applications that access traditional databases.
Some examples:
- Data in Hadoop is distributed across multiple nodes. Without proper encryption of data at rest, a single compromised node can allow a bad actor to lift parts of files and reveal sensitive data or change values.
- Services in Hadoop typically run on separate nodes and use the network to communicate with one another. Without proper encryption of data in motion, the system is vulnerable to man-in-the-middle attacks. A bad actor could intercept sensitive data or inject malicious code.
What makes Hadoop uniquely complex is the number of third-party services in the ecosystem that are commonly used to meet specific needs. Many of the services used within the Hadoop ecosystem can also be run independently outside of Hadoop; therefore, they have their own standalone security implementations and default configurations.
It is important to have an overarching security strategy for a Hadoop cluster, so the individual services can be configured to leverage a single cohesive approach to common security issues. Otherwise, the overall system is vulnerable to data breaches and service disruptions.
Fortunately, there are some core configurations, best practices, and recommended tooling that can help drive a solid strategy that results in a hardened Hadoop implementation.
Back to topCore Principles of Hadoop Security
Early Hadoop design and development focused on solving problems with high availability and scalability of data storage and analytics; therefore, security was not a bedrock component. Although security wasn’t an early concern in the problem domain, it must be a foundational consideration for all production Hadoop implementations.
Here are five crucial areas to focus on in order to ensure coverage of basic information security (i.e. confidentiality, integrity, and availability):
- Authentication: This ensures only known and registered users and services can access the Hadoop cluster.
- Authorization: This requires authenticated users (and services) to be explicitly granted access to applications, data, and services that are managed by the Hadoop cluster.
- Encryption: This protects data stored (at rest) in the system, as well as data being passed around (in motion) the system from unauthorized access.
- Isolation: This prevents a single user, service, or group from consuming resources on the cluster in a way that would compromise performance for other users (and services) sharing the system or jeopardize the availability of the system as a whole.
- Remediation: This is a combination of tooling and processes that monitor the overall system and take action to identify and address any threats to security, availability, or stability.
Back to topReady to Open Source Your Big Data Management?
This white paper explains how enterprises can reduce costs and have more data sovereignty by open sourcing their Big Data infrastructure instead of paying for a proprietary solution like the Cloudera Data Platform.
Hadoop Security Best Practices
Security in Hadoop must be implemented in layers to create a robust strategy that guarantees confidentiality, integrity, and system availability. Each layer potentially protects the whole system, but it should be viewed as a hurdle for bad actors, rather than a wall. Therefore, it is key to stand up each hurdle in an effort to repel attacks and minimize the impacts of a breach.
Here are some best practices to include when securing a production Hadoop cluster:
- Change Default Passwords and Communication Ports: Many of the services that run within the Hadoop ecosystem have default service principal and password combinations as well as default communication ports for accessing information managed by the service. These are readily accessible via the service documentation and source code repository and must be changed to prevent unauthorized access to the system.
- Use a Private Network: The Hadoop cluster and related services should all be on a private network that is not accessible from the internet (with a VPN implemented to enable remote private access if necessary). If this is not possible, then employ firewalls, reverse proxies, and secure gateway techniques to hide details of the implementation from potential bad actors.
- Use the Kerberos Authentication Protocol: This has been widely adopted for good reason and brings many benefits including:
- Reciprocal/mutual authentication that requires both a client and a server to verify each other’s identity before allowing entry. This reduces the attack surface for man-in-the-middle attacks.
- Single sign-on (SSO) that gets a user access to all services without entering their principal and password multiple times. Once the credential is entered, an expiring ticket is used for subsequent requests. This reduces the chance of a hijacked session.
- Reciprocal/mutual authentication that requires both a client and a server to verify each other’s identity before allowing entry. This reduces the attack surface for man-in-the-middle attacks.
- Encryption: Use the Transparent Data Encryption (TDE)built into the Hadoop Distributed File System (HDFS) to encrypt data that is stored (at rest) in Hadoop, and configure SSL/TLS to encrypt traffic (in motion) that is communicated between clients and services within the system.
- Manage Permissions Through Role-Based Access Controls (RBAC): This allows for more organized, policy-based access to data, applications, and services through categories of use cases, making it easier to audit which users and services have access to various resources or actions in the system. It also allows quick confident maintenance of those privileges by adding and removing users from categories of access, rather than maintaining separate Access Control Lists per individual.
- Adhere to the Practice of Principle of Least Privilege (PoLP): Strive to give users and services the minimum access necessary to perform a job function. This is intended to limit the individual user’s access, but it is equally important to limit the attack surface should any user accounts be compromised.
Hadoop Security Tools
There are a handful of tools commonly used to implement or manage aspects of Hadoop security. Kerberos and TDE, mentioned above, are fundamental. This section will highlight other tools that enable Hadoop admins to follow the principles and best practices outlined in the previous section.
YARN Capacity Scheduler
This tool, which is built into Hadoop, should be used to assign user and service quotas to create resource isolation. This is a popular technique for reducing the chances that a runaway process can consume too many resources on the cluster, which supports system availability and stability.
Apache Ranger and Apache Sentry
Apache Ranger and its deprecated predecessor, Apache Sentry, are used to control access to the various services and data sources in the Hadoop ecosystem. These access controls allow permissions to be defined at a granular level like table, row, and column access in databases like Hive or Impala, as well as file-level control in components like HDFS.
Apache Ranger also adds features above and beyond what was previously available in Apache Sentry. For example, it provides audit logging that captures all data access events. Going further, it supports sending that information in real-time for monitoring and analysis in tools like Elasticsearch, OpenSearch, and Grafana. This makes Apache Ranger a linchpin in threat detection that drives remediation of attack vectors.
Apache Knox
Apache Knox, which is becoming more prominent in Hadoop implementations, assists with network-level security of a Hadoop deployment. Knox is a reverse proxy that serves as an application and API gateway. This makes it the single point of entry for managing external traffic in and out of the system. A tool like this is essential to securing a Hadoop cluster that requires connections from the internet.
Back to topFinal Thoughts
Having the right tools and practices in place is the first step toward securing your Hadoop implementation. In today's threat climate, there is no shortage of malicious parties looking to steal to sensitive data, so it's imperative to have a strong Hadoop security strategy in place to disincentivize and rebuff attacks.
Hadoop Support and Services
Perforce OpenLogic can provide 24-7 technical support for your Hadoop implementation or help you migrate to an open source stack equivalent to Cloudera's data platform.
Additional Resources
- Blog - Hadoop Monitoring: Tools Metrics, and Best Practices
- Blog - Developing Your Big Data Strategy
- Blog - Weighing the Value of Apache Hadoop vs. Cloudera
- Datasheet - Hadoop Service Bundle
- Webinar - Is It Time to Open Source Your Big Data Management?
- Blog - Open Source Big Data Infrastructure: Key Technologies