Dancing With Elephants and Flying With The Bees–Basic Introduction To Apache Hive


Continuing on my Big Data\Hadoop blogging journey I started from HDFS post and YARN post,   I will move to my favorite Hadoop technology – Apache Hive. At its simplest definition Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in HDFS files.

Hive has three main functions: data summarization, query and analysis.  It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries. Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.


According to the Apache Hive wiki, “Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs).” Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way.)

Hive isn’t a relational database as it only maintains metadata information about Big Data stored on HDFS. however it allows you to treat your Big Data as tables and perform SQL-like operations using HiveQL.


Hive Components.  Components of Hive include HCatalog and WebHCat:

  • HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
  • WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

Hive Data Model.

Data in Hive is organized into:

  • Tables – These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on preexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases.
  • Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = ‘2008-09-01’ would only have to look at files in <table location>/ds=2008-09-01/ directory in HDFS.
  • Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

Hive Metastore.  The Metastore provides two important but often overlooked features of a data warehouse: data abstraction and data discovery. Without the data abstractions provided in Hive, a user has to provide information about data formats, extractors and loaders along with the query. In Hive, this information is given during table creation and reused every time the table is referenced. This is very similar to the traditional warehousing systems. The second functionality, data discovery, enables users to discover and explore relevant and specific data in the warehouse.

HiveQL Compiler and Optimizer.  Just like in RDBMS HiveQL needs to be compiled and logical query plan has to be generated. Moreover, plan will be further optimized via optimizer. Just as with RDBMS Hive optimizer is moving towards cost based optimization based on statistics (SQL or Oracle anyone) – https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive

Enough theory – lets do Hive. Instead of installing Apache Hadoop and Hive for learning sake I would recommend that you download one of the learning VMs provided by both Cloudera and Hortonworks. In my demo I will use Hotronworks Sandbox, latest copy of that VM you can download from here – http://hortonworks.com/products/hortonworks-sandbox/#overview, if you prefer Cloudera you can get that from – http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

So I will launch Hortonworks sandbox on my Virtual Box:


Although I could launch hive easily command line for the purpose of this tutorial I will go easy and use very convenient Sandbox Web Console available on  Console gives me number of nice visual tools such as File Browser to easily move files and visibility into hue, Pig and Hive Editors , Oozie scheduling tooling, etc.


I will use dataset on baseball statistics available from Sean Lahman – http://www.seanlahman.com/. Dataset can be downloaded from – http://seanlahman.com/files/database/lahman591-csv.zip  Once we have a file we will just unzip into some temp directory. I am interested here in two files in that directory – Master,csv and Batting.csv. I can use File Explorer Utility to upload these files to hue\hdfs. File Explorer icon on the toolbar is:


Here is the utility:


Once I upload the files I can go to Beeswax – Hive Editor. To do so will click on another icon on toolbar (the one with bee on it)


Beeswax is a pretty nice editor and allows me to type in and edit my HiveQL statements, as well as get explain\execution plans for query tuning purposes (more on that in next posts).


Creating Tables.  So first thing I will do is create a new table in default database provided to me. The Hive concept of a database is essentially just a catalog or namespace of tables.
However, they are very useful for larger clusters with multiple teams and users, as a
way of avoiding table name collisions. It’s also common to use databases to organize
production tables into logical groups. If you don’t specify a database, the default database is used.

The CREATE TABLE statement follows SQL conventions, but Hive’s version offers significant
extensions to support a wide range of flexibility where the data files for tables
are stored, the formats used, etc.  That’s what I am doing now first:

create table temp_batting (col_value STRING);

The table we have created so far are called managed tables or sometimes called internal
tables, because Hive controls the lifecycle of their data  Hive stores the data for these tables in a subdirectory under the directory defined by hive.metastore.warehouse.directory (e.g., /user/hive/warehouse), by default.  There is also ability to define an external table, i.e. table where data is stored externally to Hive metastore directory.

We can get all the information we need about a table through a query too. Go back to the query editor and execute:

describe temp_batting;

The result describes the schema of the table and detailed table information.  Similar to RDBMS isn’t it?

Loading data.  Next I will load data into my new table from file I uploaded. This is pretty much self-explanatory:

LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;

Since Hive has no row-level insert, update, and delete operations, the only way to put
data into an table is to use one of the “bulk” load operations.  If the LOCAL keyword is used, the path is assumed to be in the local file system. The data is copied into the final location. If LOCAL is omitted, the path is assumed to be in the distributed file system. In this case, the data is moved from the path to the final location.

Create another table. So I have table loaded with batting data ,, but I needs results table here , after all I want to get information on players doing  most runs yearly.

create table batting (player_id STRING, year INT, runs INT);

Finally query:

insert overwrite table batting  
  regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,  
  regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,  
  regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run  
from temp_batting;
SELECT a.year, a.player_id, a.runs from batting a  
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b  
ON (a.year = b.year AND a.runs = b.runs) ;


Some finer points:

  • We are using INSERT and SELECTs here, as well as JOIN
  • SELECT is the projection operator in SQL, WHERE gives the condition of what to filter, GROUP BY gives a list of columns which specify how to aggregate the records
  • INSERT OVERWRITE will overwrite any existing data in the table, whereas INSERT INTO will append data
  • I use online Hive Syntax Checker here – https://sql.treasuredata.com/, to check my syntax

So this is a very basic post and I will go from there into a lot more on Hive. For more information also see – https://github.com/Prokopp/the-free-hive-book, https://altiscale.zendesk.com/hc/en-us/articles/201802346-Simple-Hive-Query-Example, http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/, http://www.tutorialspoint.com/hive/, tutorial on Amazon – http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-load-data.html

Forecast Cloudy – Why Is My Azure Table Storage Query So Slow Again?

Perhaps this post shouldn’t exist as I already profiled basics of Azure Table Storage in my previous post. However, in last few months I heard quite a bit of complaints from customers attempting to query Azure tables of performance. In sheer majority of the cases the issues could have been avoided with proper table and query design. By design, Azure Table services provides the potential to store enormous amounts of data, while enabling efficient access and persistence. The services simplify storage, saving you from jumping through all the hoops required to work with a relational database—constraints, views, indices, relationships and stored procedures. Yet, lest not forget that Azure Table Storage is no way to substitute for relational database – so don’t try to store relational data there, attempt some sort of joins, manage relationships of any kind, etc. Its designed for “quick and dirty” NoSQL storage and access not for analytics. With that established lets again look at design of Azure Table Storage:

Azure Storage Partitions. Looking at quite known Microsoft white paper on Azure Storage, one of the main reasons why storage can scale so well is its partitioning logic. Whenever you store something on Windows Azure storage, it is located on some partition in the system. Partitions are used for scale out in the system.

Referenced Screen

Those Partition Keys are critical. In Table Storage, you have to decide on the Partition Key yourself. In essence, you are responsible for the throughput you’ll get on your system. If you put every entity in the same partition (by using the same partition key), you’ll be limited to the size of the storage machines for the amount of storage you can use. Plus, you’ll be constraining the maximal throughput as there’s lots of entities in the same partition.Should you set the Partition Key to a unique value for every entity stored? No. You can do this and every entity stored will end up in its own partition, but you’ll find that querying your data becomes more difficult.

Row Key – a primary key within given partition. A Row Key in Table Storage is a very simple thing: it’s your “primary key” within a partition. Partition Key + Row Key form the composite unique identifier for an entity. Within one Partition Key, you can only have unique Row Keys. If you use multiple partitions, the same Row Key can be reused in every partition.

Consequences.  For example, consider a products table that contains information about company products and has Partition Keys that correspond to the product types, such as Bicycle,  Tricycle and Scooter. As demand for Bicycles increases and more models\variations are introduced , the rows in the Bicycle partition might be very busy (becoming a so-called “hot” partition). The service can load balance the Products table by moving the Bicycle partition to a different server to 
better handle the many requests made to the partition. If you anticipate more activity on that partition than a single server can handle, you should consider creating more-granular partitions , example Bicycle_Adult and Bicycle_Children.

Before building your code, it’s a good idea to think about both properties. Don’t just assign them a GUID or a random string as it does matter for performance.The fastest way of querying? Specifying both Partition Key and Row Key. By doing this, table storage will immediately know which partition to query and can simply do an ID lookup on Row Key within that partition.

Very good guide to designing your Azure Table Storage is available here – https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/ , its way deeper than one can ever go in this blog post. Another good resource is Azure Storage Team Blog and check out excellent post here – http://blogs.msdn.com/b/mast/archive/2013/02/19/improving-windows-azure-table-storage-query-performance.aspx

The performance difference can be rather striking. To illustrate I will do something pretty bad here – will create a table with Partition Key of “People” (therefore will always use single partition)  and GUID Row Key with lots of records (100,000+) and query my table based on non-Partition\Row Key columns.  Here is the code to create 100,000 of fake people in Azure Table that way from console application:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Table;
using System.Configuration;

namespace GoodAzureStorageQuery
    class Program
        static void Main(string[] args)
            CloudStorageAccount storageAccount = CloudStorageAccount.Parse(ConfigurationManager.ConnectionStrings["StorageConnectionString"].ToString());

                //Create Table Clienbt
                CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

                // Create the table if it doesn't exist.
                CloudTable table = tableClient.GetTableReference("People");

                //Fill up that table 
            for (int i=0;i<100000;i++)
                Person p = new Person(i.ToString(), (i + 1).ToString());
                // Create the TableOperation that inserts the customer entity.
                TableOperation insertOperation = TableOperation.InsertOrReplace(p);
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Table;

namespace GoodAzureStorageQuery
    class Person:TableEntity
        public string FirstName {get;set;}
        public string LastName { get; set; }

    public Person(string first_name, string last_name)
            Guid g = Guid.NewGuid();
            this.PartitionKey = "People";
            this.RowKey = g.ToString();
            this.FirstName = first_name; 

And sure thing – connecting to Azure from ClumsyLeaf Table explorer I see an table with lots of records in no particular order, as I “designed”:


Same in Azure Storage Explorer:


The issue occurs when I query this table via something like to query for single person in my table:

var query = from entity in context.CreateQuery("SpecificCustomer")
                 where entity.LastName.Equals("1")
                 && entity.FirstName.Equals("2")
                 select entity;

Or something like this:

var query = (from entity in context.CreateQuery("Top10People")
                 select entity).Take(10);


That will take a long time. Again, here I committed multiple sins:

  • Used generic Partition Key – therefore not partitioning my table well at all, but throwing all data on a single partition
  • Used GUID for Row Key
  • Used a query that queries neither on Partition or Row Key

So high level rules to make sure your Azure Tables are performant  are:

    • No relations between tables (or entities sets) – this isn’t an RDBMS
    • Storing datasets that don’t require complex joins, foreign keys, or stored procedures and can be denormalized for fast access
    • Quickly querying data using a clustered index (Partition Key + Row Key)
    • Tables doesn’t have a specified schema
    • Query tables using both Partition and Row Key to achieve best performance with Windows Azure Table Storage.

For more see – http://lockmedown.com/be-sure-with-azure-net-azure-table-storage-part-2/, http://blogs.msdn.com/b/windowsazurestorage/archive/2014/11/17/azure-storage-performance-checklist.aspx, https://msdn.microsoft.com/en-us/magazine/ff796231.aspx

Hope this helps.