In my previous post some time ago I introduced Apache Hive technology on Hadoop. Coming from SQL and RDBMS this was bound to be my favorite Hadoop technology. Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in HDFS files.
Today, unlike previous basics post, I will concentrate on Hive Partitions and Buckets. A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.
Partitions, Just like in RDBMS, in Hive data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query. Example of creating partitioned table in Hive is below, note you can do this with both internal and external tables:
CREATE TABLE CUSTOMER ( userid BIGINT, First_Name STRING, Last_Name STRING, address1 STRING, address2 STRING, city STRING, zip_code STRING, state STRING ) PARTITION BY ( COUNTRY STRING )
As you can see I am using Country as partition column, The partition statement lets Hive alter the way it manages the underlying structures of the table’s data directory. If you browse the location of the data directory for a non-partitioned table, it will look like this:
.db/. All the data files are directly written to this directory. In case of partitioned tables, subdirectories are created under the table’s data directory for each unique value of a partition column. In case the table is partitioned on multiple columns, then Hive creates nested subdirectories based on the order of partition columns in the table definition. Example with table above:
When a partitioned table is queried with one or both partition columns in criteria or in the
WHERE clause, what Hive effectively does is partition elimination by scanning only those data directories that are needed. If no partitioned columns are used, then all the directories are scanned (full table scan) and partitioning will not have any effect.
With a few quick changes it’s easy to configure Hive to support dynamic partition creation. Just as SQL Server has a SET command to change database options, Hive lets us change settings for a session using the SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster
SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict;
Important notes on best practices
- Be careful using dynamic partitions. Hive has some built-in limits on the number of partitions that can be dynamically created as well as limits on the total number of files that can exist within Hive. Creating many partitions at once will create a lot of files and creating a lot of files will use up memory in Hadoop Name Node
- If your partitions are relatively small then the expense of recursing directories becomes more expensive than simply scanning the data. Likewise, partitions should be roughly similar in size to prevent a single long running thread from holding things up
Buckets. Much like partitioning, bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance. As you create HIVE table you can use CLUSTERED keyword to define buckets.
CREATE TABLE CUSTOMER ( userid BIGINT, First_Name STRING, Last_Name STRING, address1 STRING, address2 STRING, city STRING, zip_code STRING, state STRING ) PARTITION BY ( COUNTRY STRING ) ( CLUSTERED BY (userid) INTO 10 BUCKETS; (
In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. The hash_function depends on the type of the bucketing column. For an int, as in our example it’s easy, hash_int(i) == i.. In our example we can expect all useid’s that end in 0 to be in bucket 1, all userid’s that end in a 1 to be in bucket 2 and so on. For other datatypes it gets a lot more tricky unfortunately. To note hash of a string or a complex datatype will be some number that’s derived from the value, but not anything humanly-recognizable.In general, distributing rows based on the hash will give you a even distribution in the buckets. That’s your goal here. Bucketing is entirely dependent on data correctly being loaded to the table.Loading table properly is critical in case of bucketing,
For more see – http://www.bidn.com/blogs/cprice1979/ssas/4646/partitions-amp-buckets-in-hive, https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/, http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/, http://blog.cloudera.com/blog/2014/08/improving-query-performance-using-partitioning-in-apache-hive/, http://archive.cloudera.com/cdh/3/hive/language_manual/working_with_bucketed_tables.html