What Is Apache HBase? HBase Features, Use Cases, and Alternatives
Apache HBase is part of the Hadoop ecosystem, which is widely used in the big data space because it can store and analyze high volumes of unstructured data. Originally prototyped in 2006, HBase has gained traction and has been a top-level Apache project since 2010. Keep reading to learn about HBase architecture, features, use cases, and alternatives.
What Is HBase?
Apache HBase is an open source distributed database built on top of Hadoop File System (HDFS). HBase is written in Java and is a NoSQL column-oriented database capable of managing massive amounts of data — potentially billions of rows and millions of columns.
HBase was developed in 2006 by Powerset, a company later acquired by Microsoft, as part of a project to create a natural language search engine for the Internet. Its design comes from Bigtable: A Distributed Storage System for Structured Data, a paper published by Google that describes the API for a distributed storage system that can manage petabytes of structured data using a cluster of commodity servers. HBase was later contributed as open source and became a sub-project of Hadoop in 2008.
Back to topHBase Architecture
People typically associate the term “database” with relational databases (RDBMS). With that in mind, it’s best to think of HBase as a “data store”, since it does not have all the bells and whistles and guardrails that are standard in an RDBMS. In HBase, there are no defined column types, column constraints, action triggers, compound indexes, or native SQL query support.
HBase is built on top of Hadoop, which is geared toward batch processing, and HDFS is not suitable for random disk access. In fact, it cannot update a file in place; all updates require a rewrite of the entire file. These limitations are actually the strengths of a traditional RDBMS. Relational databases are optimized for fast random data i/o. Unfortunately, relational databases struggle to handle large volumes of data due to indexing overhead, as well as the need to maintain performance while providing all the conveniences mentioned previously. An RDBMS is usually scaled vertically and requires specialized hardware and storage devices for optimal performance.
HBase was envisioned, architected, and developed to strike a balance between HDFS and an RDBMS, designed to overcome the drawbacks that existed in real-time data processing in Hadoop. It accomplishes this by focusing on the specific problems of real-time access while leveraging the strengths of some existing components of the Hadoop ecosystem to do the rest.
HBase Data Hierarchy
In HBase:
- a table is made up of one or more rows
- a row consists of one or more columns identified by a unique row key
- a column contain cells, which are timestamped versions of the value in that column
- columns are grouped into column families
HBase requires a predefined table schema that specifies the column families. However, there is flexibility in lower levels of the hierarchy, as new columns can be added to families any time, allowing the schema to adapt to evolving application requirements.
HBase I/O Flow
Reads (HBase Client perspective) | Writes (HBase Server perspective) |
|
|
The HBase client API uses the META system table to identify the region hosting the requested key, so it can read or write to the node directly without interacting with the HMaster node.
Clients can write to HDFS directly or through HBase. Either way, the data is accessible through HBase.
Expert Technical Support For Apache Hadoop
OpenLogic provides SLA-backed Hadoop support, delivered by experienced Enterprise Architects.
HBase Responsibilities Summary
HMaster |
|
RegionServers |
|
Regions |
|
HDFS |
|
ZooKeeper |
|
ZooKeeper is built into HBase; however, a production cluster should have a dedicated ZooKeeper cluster that is integrated with the HBase cluster.
Back to topNotable HBase Features
- While many NoSQL databases offer eventual consistency, HBase touts strong consistency as a core design tenet. There is a single node in an HBase cluster that is responsible for atomic row operations for a subset of the data, so it is able to guarantee consistency.
- Traditional databases require manual sharding. HBase, like many NoSQL databases, provides automatic sharding. The tables are distributed across the cluster via regions, which are automatically split and re-distributed as the data grows. Each individual node has access to the data in HDFS to service reads and writes, and this allows HBase to achieve low latency random access to petabytes of data by distributing requests from applications across a cluster of nodes.
- Many databases and data stores require complicated configuration, architectural decisions, and potentially coding or external product integrations to achieve a high degree of fault tolerance to cover node availability issues. HBase leverages the fault tolerance features of HDFS, as it splits data stored in tables across multiple hosts in the cluster, so it can withstand the failure of an individual node. It achieves this by automatically assigning a healthy node to serve the data previously provided by the failed node, then replaying the Write Ahead Log (WAL) to recover data in motion.
- Because HBase was developed with Hadoop in mind, it natively supports and leverages other components of that ecosystem. Some examples:
- HBase supports and uses HDFS by default as its distributed file system.
- HBase supports massively parallelized processing via MapReduce, and it can be leveraged as both a source and output for MapReduce jobs.
- Although HBase does not support SQL syntax natively, this can be achieved through the use of Apache Phoenix, a complimentary open source project.
- Likewise, Apache Hive allows users to query HBase tables using the Hive Query Language, which is similar to SQL.
- HBase is developed in Java, and it has a Java Client API for convenient access via Java-based applications; however, it also has both Thrift and REST APIs for language agnostic interactions.
HBase Use Cases
HBase is used for both write heavy applications, as well as applications that need to provide fast, random access to vast amounts of available data. Some examples include:
- Storing clickstream data for downstream analysis
- Storing application logs for diagnostic and trend analysis
- Storing document fingerprints used to identify potential plagiarism
- Storing genome sequences and the disease history of people in a particular demographic
- Storing head-to-head competition histories in sports for better analytics and outcome predictions
HBase Alternatives
Broadly speaking, there are many alternatives to HBase. Any data store or database can be contender for solving the same problems. For teams evaluating different open source databases, considering your specific data management needs can help narrow this pool of options down.
For systems that need to house and process thousands, or maybe even millions, of rows, horizontally scaling is not likely a factor. In these cases, most of the data could be stored on a single server, so some form of RDBMS would be a good choice due to all the conveniences that they provide. There are countless options in this space, but the most popular open source relational databases are PostgreSQL, MySQL, and MariaDB.
For systems that need to house and process millions and billions of rows in a performant way, a NoSQL solution like HBase is going to be on the table. Again, there are many NoSQL options with varying strengths and weaknesses that depend on the specifics of the problem. Some open source options (other than HBase) include Cassandra, Redis, and MongoDB. In environments without a Hadoop implementation, these NoSQL options may be more attractive because they are designed to be used as a standalone data store.
Back to topFinal Thoughts
Realistically, HBase is the most logical choice when there is an existing investment in the Hadoop ecosystem for housing and managing big data. This is because HBase depends heavily on other components of the Hadoop ecosystem, such as HDFS, MapReduce, and Zookeeper. The volume of data being processed, managed, and stored, and whether or not you are already using Hadoop will likely be the key factors that determine if it makes sense to deploy HBase in your environment.
Open Source Big Data Management from OpenLogic
With the Hadoop Service Bundle, we can help you manage your Big Data infrastructure no matter where your data is hosted (on-prem, cloud, hybrid) with an open source Hadoop stack that is equivalent to the Cloudera Data Platform.
Additional Resources
- Webinar - Is It Time to Open Source Your Big Data Management
- Blog - Introducing the Hadoop Service Bundle From OpenLogic
- Datasheet - Hadoop Service Bundle
- Blog - Weighing the Value of Apache Hadoop vs. Cloudera
- White Paper - The Decision Maker's Guide to Open Source Databases
- Blog - Apache Spark vs. Hadoop
- Guide - Intro to Open Source Databases
- Blog - RDBMS vs. NoSQL: Differences and How to Migrate