Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 2 – USQL and Programmability

datalake

This is a continuation on previous post that can be found here. As I stated primary programming language for Azure Data Lake Analytics will be U-SQL.  If you like myself, come from SQL background you will find that many U-SQL queries lookn like SQL, however after a second look those of you familar with .NET and LINQ will notice that there are familiar constructs there as well.

There are two ways to run U-SQL scripts:

    1. You can run U-SQL scripts on your own machine. The data read and written by this script will be on you own machine. You aren’t using Azure resources to do this so there is no additional cost. This method of running U-SQL scripts is called “U-SQL Local Execution”
    2. You can run U-SQL scripts in Azure in the context of a Data Lake Analytics account. The data read or written by the script will also be in Azure – typically in an Azure Data Lake Store account. You pay for any compute and storage used by the script. This is called “U-SQL Cloud Execution”   

If you wish to run scripts locally via your copy of Visual Studio 2015 you can install Data Lake Tools here – https://www.microsoft.com/en-us/download/details.aspx?id=49504 . With Visual Studio 2017 tool is part of Azure Development workload.

data-lake-tools-for-vs-2017-install-02.png

In the previous post I shown how to run U-SQL scripts on the cloud , but today we will use Visual Studio 2017 with Data Lake Tools to run U-SQL scripts.

Once you installed the tools you can connect to your Azure Data Lake Analytics account and run U-SQL Script via right clicking and picking Run U-SQL Script from the menu. Running such script will look like this in VS 2017:

usql1vs

You can see stats and timings just like you would in Azure Portal as well right in Visual Studio

usql2vs

When something is wrong with a U-SQL script it will not compile. Here some of the common things that may drive you mad and you should watch out for:

  1. Invalid case. U-SQL is case sensitive, unlike SQL.  So following script will error out as it uses lower case from vs. upper case FROM:
    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        from @"/SearchLog.tsv"
        USING Extractors.Tsv();
    
    OUTPUT @searchlog 
        TO @"/SearchLog_output.tsv"
        USING Outputters.Tsv();
    
  2.  Bad path to input or otput file. Check your paths, this is one is self evident, but I spent hours debugging my path\folder issues
  3. Invalid C# expression due to typos, etc

When you develop U-SQL scripts, you can save time and expense by running the scripts locally on your machine before they are ready to be ran in the cloud.  You can connect to your local folder path via Visual Studio Data Lake tools and run queries there:

adlavs3

A local data root folder is a local store for the local compute account. Any folder in the local file system on your local machine can be a local data root folder. It’s the same as the default Azure Data Lake Store account of a Data Lake Analytics account. Switching to a different data root folder is just like switching to a different default store account.

When you run a U-SQL script, a working directory folder is needed to cache compilation results, run logs, and perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U-SQL project’s working directory. t’s located under <U-SQL Project Root Path>/bin/debug. The working directory is cleaned every time a new run is triggered.

Now that you got the basics you can delve into U-SQL coinstructs following the laguage tutorials by Microsoft’s Michael Rys here – https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-language-reference , and Adjeta Sighal here – https://blogs.msdn.microsoft.com/azuredatalake/2018/05/26/get-started-with-u-sql-its-easy/

In my next installment I am hoping to go through typical ADLA analytics job from start to completion.

For more on Azure Data Lake Analytics see – https://channel9.msdn.com/blogs/Cloud-and-Enterprise-Premium/Azure-Data-Lake-Analytics-Deep-Divehttp://eatcodelive.com/2015/11/21/querying-azure-sql-database-from-an-azure-data-lake-analytics-u-sql-script/

Good Luck and Happy Coding!

Advertisements

Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 1 – Introduction

adla

Previously, I wrote a post about Google Big Query , GCP service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Similar services are provided now by all public cloud vendors, Microsoft Azure has a service known as Azure Data Lake Analytics , that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.

According to Microsoft, Azure Data Lake Analytics lets you:

  • Analyze data of any kind and of any size.
  • Speed up and sharpen your development and debug cycles.
  • Use the new U-SQL processing language built especially for big data.
  • Rely on Azure’s enterprise-grade SLA.
  • Pay only for the processing resources you actually need and use.
  • Benefit from the YARN-based technology extensively tested at Microsoft.

ADLA is based on top of the YARN technology. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

yarn_architecture

For more on YARN architecture see – https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html .  What being based on YARN helps Azure Data Lake Analytics helps is with extreme scalability. Data Lake Analytics can work with a number of Azure data sources: Azure Blob storage, Azure SQL database, Azure SQL Data Warehouse, Azure Store, and Azure SQL Database in Azure VM. Azure Data Lake Analytics is specially optimized to work with Azure Data Lake Store—providing the highest level of performance, throughput, and parallelization for your big data workloads.Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of SQL with the expressive power of C#. It takes a bit to learn for typical SQL person, but its pretty powerful.

So enough theory and let me show how you can cruch Big Data workloads without creating a large Hadoop cluster or setting up infrastructure, paying only for what you used in storage and compute.

The first thing we will need before starting to work in Azure cloud is subscription. If you dont have one, browse to https://azure.microsoft.com/en-us/free/?v=18.23 and follow the instructions to sign up for a free 30-day trial subscription to Microsoft Azure.

In my example I will use sample retail dataset with couple of stock and sales data files, which I will upload to Azure Data Lake Store.  The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Azure Data Lake Store is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.

To upload these files , I will first create Azure Data Lake Store called adlsgennadyk:

  • Navigate to Azure Portal
  • Find service called Data Lake Storage Gen 1
  • Click Create button

dls1

In the form , I will name my Azure Data Lake Store, pick Azure Resource Group where it will reside and choose billing model, which can be either usual pay as you go or prepayment in advance.

dls2

Once you created your Data Lake Storage, I will click on Data Explorer button to launch that tool and upload files I will be analyzing

dls3

Next, I will use the tool to upload demo retail dataset file called stock, that stores retail stocked product information in tab delimted format .

dls4

Here is the dataset as can be seen in Excel:

dls5

Now, that data has been uploaded to Azure lets create instance of Azure Data Lake Analytics service. Again action sequence is the same:

  • Navigate to Azure Portal
  • Find service called Data Lake Analytics
  • Click Create button

Resulting form is very similar to what I did with storage above, except I will point my ADLA instance to my above created storage instance.

adla11

Once  Azure Data Lake Analytics instance is created you are presented with this screen:

adla12

Once I click on the New Job button I can run brand new query against my files in Azure Data Lake Storage. I will show you simplest U-SQL script here:

@stock = 
    EXTRACT  
            Id   int, 
          Item string 
    FROM "/stock.txt"
    USING Extractors.Tsv();

OUTPUT @stock
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

Here is what I just asked Azure Data Lake Analytics to do.  We extract all mof the data from a file and copy output to another one.

Some items to know:

  • The script contains a number of U-SQL keywords: EXTRACT, FROM, TO, OUTPUT, USING, etc.
  • U-SQL keywords are case sensitive. Keep this in mind – it’s one of the most common errors people run into.
  • The EXTRACT statement reads from files. The built-in extractor called Extractors.Tsv handles Tab-Separated-Value files.
  • The OUTPUT statement writes to files. The built-in outputter called Outputters.Tsv handles Tab-Separated-Value files.
  • From the U-SQL perspective files are “blobs” – they don’t contain any usable schema information. So U-SQL supports a concept called “schema on read” – this means the developer specified the schema that is expected in the file. As you can see the names of the columns and the datatypes are specified in the EXTRACT statement.
  • The default Extractors and Outputters cannot infer the schema from the header row – in fact by default they assume that there is no header row (this behavior can overriden)

After job executes you can see its execution analysis and graph:

adla14

Now lets do something more meaningful by introducing WHERE clause to filter the dataset:

@stock = 
    EXTRACT  
            Id   int, 
          Item string
    FROM "/stock.txt"
    USING Extractors.Tsv();

    @output = 
    SELECT *
    FROM @stock
    WHERE Item == "Tape dispenser (Black)";


OUTPUT @output
    TO "/stock2_output.tsv"
    USING Outputters.Tsv();

The job took about 30 sec to run , including writing to utput file which took most of time here. Looking at the graph by duration one can see where time was spent:

adla15

In the next post I plan to delve deeper into Data Lake Analytics, including using C# functions and USQL catalogs.

 

For more on Data Lake Analytics see – https://docs.microsoft.com/en-us/azure/data-lake-analytics/https://youtu.be/4PVx-7SSs4chttps://cloudacademy.com/blog/azure-data-lake-analytics/, and https://optimalbi.com/blog/2018/02/20/fishing-in-an-azure-data-lake/ 

 

Happy swimming in Azure Data Lake, hope this helps.

Data Cargo In The Clouds – Data and Containerization On The Cloud Platforms

container

This topic is where I actually planned to pivot to for quite a while, but either had no bandwith\time with moving to Seattle or some other excuse like that. This topic is an interesting to me as its intersection of technologies where I used to spend quite a bit of time including:

  • Data, including both traditional RDBMS databases and NoSQL designs
  • Cloud, including Microsoft Azure, AWS and GCP
  • Containers and container orchestration (that is area new to me)

Those who worked with me either displayed various degrees of agreement, but more often of annoyance, on me pushing this topic as growing part of data engineering discipline, but I truly believe in it. But before we start combining all these areas and do something useful in code I will spend some time here with most basic concepts of what I am about to enter here.

I will skip introduction to basic concepts of RDBMS, NoSQL or BigData\Hadoop technologies, as if you didn’t hide under the rock for last five years, you should be quite aware of those. That brings us straight to containers:

As definition states – “A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Available for both Linux and Windows based apps, containerized software will always run the same, regardless of the environment. Containers isolate software from its surroundings, for example differences between development and staging environments and help reduce conflicts between teams running different software on the same infrastructure.”

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. This could be from a developer’s laptop to a test environment, from a staging environment into production, and perhaps from a physical machine in a data center to a virtual machine in a private or public cloud.

container.jpg

But why containers if I already have VMs? 

VMs take up a lot of system resources. Each VM runs not just a full copy of an operating system, but a virtual copy of all the hardware that the operating system needs to run. This quickly adds up to a lot of RAM and CPU cycles. In contrast, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program.
What this means in practice is you can put two to three times as many as applications on a single server with containers than you can with a VM.  In addition, with containers you can create a portable, consistent operating environment for development, testing, and deployment. 

On Linux, containers run on top of LXC. This is a userspace interface for the Linux kernel containment features. It includes an application programming interface (API) to enable Linux users to create and manage system or application containers.

Docker is an open platform tool to make it easier to create, deploy and to execute the applications by using containers.

  • The heart of Docker is Docker engine. The Docker engine is a part of Docker which create and run the Docker containers. The docker container is a live running instance of a docker image. Docker Engine is a client-server based application with following components :
    • A server which is a continuously running service called a daemon process.
    • A REST API which interfaces the programs to use talk with the daemon and give instruct it what to do.
    • A command line interface client.

             docker_engine 

  • The command line interface client uses the Docker REST API to interact with the        Docker daemon through using CLI commands. Many other Docker applications    also use the API and CLI. The daemon process creates and manage Docker images, containers, networks, and volumes.
  • Docker client is the primary service using which Docker users communicate with the Docker. When we use commands “docker run” the client sends these commands to dockerd, which execute them out.

You will build Docker images using Docker and deploy these into what are known as Docker registries. When we run the docker pull and docker run commands, the required images are pulled from our configured registry directory.Using Docker push command, the image can be uploaded to our configured registry directory. 

Finally deploying instances of images from your registry you will deploy containers. We can create, run, stop, or delete a container using the Docker CLI. We can connect a container to more than one networks, or even create a new image based on its current state.By default, a container is well isolated from other containers and its system machine. A container defined by its image or configuration options that we provide during to create or run it.

For more info on containers , LXC and Docker see – https://blog.scottlowe.org/2013/11/25/a-brief-introduction-to-linux-containers-with-lxc/, https://www.docker.com/what-container#/virtual_machines http://searchservervirtualization.techtarget.com/definition/container-based-virtualization-operating-system-level-virtualization

That brings us to container orchestration engines. While the CLI meets the needs of managing one container on one host, it falls short when it comes to managing multiple containers deployed on multiple hosts. To go beyond the management of individual containers, we must turn to orchestration tools. Orchestration tools extend lifecycle management capabilities to complex, multi-container workloads deployed on a cluster of machines.

Some of the well known orchestration engines include:

Kubernetes.  Almost a standard nowdays , originally developed at Google. Kubernetes’ architecture is based on a master server with multiple minions. The command line tool, called kubecfg, connects to the API endpoint of the master to manage and orchestrate the minions.  The service definition, along with the rules and constraints, is described in a JSON file. For service discovery, Kubernetes provides a stable IP address and DNS name that corresponds to a dynamic set of pods. When a container running in a Kubernetes pod connects to this address, the connection is forwarded by a local agent (called the kube-proxy) running on the source machine to one of the corresponding backend containers.Kubernetes supports user-implemented application health checks. These checks are performed by the kubelet running on each minion to ensure that the application is operating correctly.

kub

For more see – https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#kubernetes-is

Apache Mesos. This is an open source cluster manager that simplifies the complexity of running tasks on a shared pool of servers. A typical Mesos cluster consists of one or more servers running the mesos-master and a cluster of servers running the mesos-slave component. Each slave is registered with the master to offer resources. The master interacts with deployed frameworks to delegate tasks to slaves. Unlike other tools, Mesos ensures high availability of the master nodes using Apache ZooKeeper, which replicates the masters to form a quorum. A high availability deployment requires at least three master nodes. All nodes in the system, including masters and slaves, communicate with ZooKeeper to determine which master is the current leading master. The leader performs health checks on all the slaves and proactively deactivates any that fail.

 

Over the next couple of posts I will create containers running SQL Server and other data stores, both RDBMS and NoSQL, deploy these on the cloud and finally attempt to orchestrate these hopefully as well.  So lets move off theory and pictures into world of parctical data engine deployments.

Hope you will find this detour interesting.

Forecast Cloudy – Using Azure Blob Storage with Apache Hive on HDInsight

The beauty of working with Big Data in Azure is that you can manage (create\delete) compute resources with your HDInsight cluster independent of your data stored either in Azure Data Lake or Azure blob storage.  In this case I will concentrate on using Azure blob storage\WASB as data store for HDInsWight Azure PaaS Hadoop service

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back as resultset for consumption by other systems (think PowerBI, Tableau, etc)
Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

hdinsight

In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:

blob1

Hold on, isn’t the whole selling point of Hadoop is proximity of data to compute?  Yes, and just like with any other Hadoop system on premises data is loaded into memory on the individual nodes at compute time. With Azure data infrastructure setup and data center backbone within data center built for performance, your job performance is generally the same or better than if you used disks locally attached to the VMs.

Below is diagram of HDInsight data storage architecture:

hdi.wasb.arch

HDInsight provides access to the distributed file system that is locally attached to the compute nodes. This file system can be accessed by using the fully qualified URI, for example:

hdfs:///

More important is ability access data that is stored in Azure Storage. The syntax is:

wasb[s]://@.blob.core.windows.net/

As per https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage you need to be aware of following:

  • Container Security for WASB storage.  For containers in storage accounts that are connected to cluster,because the account name and key are associated with the cluster during creation, you have full access to the blobs in those containers. For public containers that are not connected to cluster you have read-only permission to the blobs in the containers.  For private containers in storage accounts that are not connected to cluster , you can’t access the blobs in the containers unless you define the storage account when you submit the WebHCat jobs.
  • The storage accounts that are defined in the creation process and their keys are stored in %HADOOP_HOME%/conf/core-site.xml on the cluster nodes. The default behavior of HDInsight is to use the storage accounts defined in the core-site.xml file. It is not recommended to directly edit the core-site.xml file because the cluster head node(master) may be reimaged or migrated at any time, and any changes to this file are not persisted.

 You can create new or point existing storage account to HDinsight cluster easy via portal as I show below:

Capture

You can point your HDInsight cluster to multiple storage accounts as well , as explained here – https://blogs.msdn.microsoft.com/mostlytrue/2014/09/03/hdinsight-working-with-different-storage-accounts/ 

You can also create storage account and container via Azure PowerShell like in this sample:

$SubscriptionID = “<Your Azure Subscription ID>”
$ResourceGroupName = “<New Azure Resource Group Name>”
$Location = “EAST US 2”

$StorageAccountName = “<New Azure Storage Account Name>”
$containerName = “<New Azure Blob Container Name>”

Add-AzureRmAccount
Select-AzureRmSubscription -SubscriptionId $SubscriptionID

# Create resource group
New-AzureRmResourceGroup -name $ResourceGroupName -Location $Location

# Create default storage account
New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName -Location $Location -Type Standard_LRS

# Create default blob containers
$storageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -StorageAccountName $StorageAccountName)[0].Value
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
New-AzureStorageContainer -Name $containerName -Context $destContext

 

The URI scheme for accessing files in Azure storage from HDInsight is:

wasb[s]://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/<path>

The URI scheme provides unencrypted access (with the wasb: prefix) and SSL encrypted access (with wasbs). Microsoft recommends using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

The <BlobStorageContainerName> identifies the name of the blob container in Azure storage. The <StorageAccountName> identifies the Azure Storage account name. A fully qualified domain name (FQDN) is required.

I ran into rather crazy little limitation\ issue when working with \WASB and HDInsight. Hadoop and Hive is looking for and  expects a valid folder hierarchy to import data  files, whereas  WASB does not support a folder hierarchy i.e. all blobs are listed under a container. The workaround is to use SSH session to login into head cluster node and use mkdir command line command to manually create such directory via the driver.

The SSH Procedure with HDInsight can be found here – https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix

Another one recommended to me was that “/” character can be used within the key name to make it appear as if a file is stored within a directory structure. HDInsight sees these as if they are actual directories.For example, a blob’s key may be input/log1.txt. No actual “input” directory exists, but due to the presence of the “/” character in the key name, it has the appearance of a file path.

For more see – https://social.technet.microsoft.com/wiki/contents/articles/6621.azure-hdinsight-faq.aspx,  https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storagehttps://www.codeproject.com/articles/597940/understandingpluswindowsplusazureplusblobplusstora

Hope this helps.

Meet Redis – Setting Up Redis On Ubuntu Linux

redis_thumb.jpg

I have been asked by few folks on quick tutorial setting up Redis under systemd in Ubuntu Linux version 16.04.

I have blogged quite a bit about Redis in general –https://gennadny.wordpress.com/category/redis/ , however just a quick line on Redis in general. Redis is an in-memory key-value store known for its flexibility, performance, and wide language support. That makes Redis one of the most popular key value data stores in existence today. Below are steps to install and configure it to run under systemd in Ubuntu 16.04 and above.

Here are the prerequisites:

Next steps are:

  • Login into your Ubuntu server with this user account
  • Update and install prerequisites via apt-get
             $ sudo apt-get update
             $ sudo apt-get install build-essential tcl
    
  • Now we can download and exgract Redis to tmp directory
              $ cd /tmp
              $ curl -O http://download.redis.io/redis-stable.tar.gz
              $ tar xzvf redis-stable.tar.gz
              $ cd redis-stable
    
  • Next we can build Redis
        $ make
    
  • After the binaries are compiled, run the test suite to make sure everything was built correctly. You can do this by typing:
       $ make test
    
  • This will typically take a few minutes to run. Once it is complete, you can install the binaries onto the system by typing:
    $ sudo make install
    

Now we need to configure Redis to run under systemd. Systemd is an init system used in Linux distributions to bootstrap the user space and manage all processes subsequently, instead of the UNIX System V or Berkeley Software Distribution (BSD) init systems. As of 2016, most Linux distributions have adopted systemd as their default init system.

  • To start off, we need to create a configuration directory. We will use the conventional /etc/redis directory, which can be created by typing
    $ sudo mkdir /etc/redi
    
  • Now, copy over the sample Redis configuration file included in the Redis source archive:
         $ sudo cp /tmp/redis-stable/redis.conf /etc/redis
    
  • Next, we can open the file to adjust a few items in the configuration:
    $ sudo nano /etc/redis/redis.conf
    
  • In the file, find the supervised directive. Currently, this is set to no. Since we are running an operating system that uses the systemd init system, we can change this to systemd:
    . . .
    
    # If you run Redis from upstart or systemd, Redis can interact with your
    # supervision tree. Options:
    #   supervised no      - no supervision interaction
    #   supervised upstart - signal upstart by putting Redis into SIGSTOP mode
    #   supervised systemd - signal systemd by writing READY=1 to $NOTIFY_SOCKET
    #   supervised auto    - detect upstart or systemd method based on
    #                        UPSTART_JOB or NOTIFY_SOCKET environment variables
    # Note: these supervision methods only signal "process is ready."
    #       They do not enable continuous liveness pings back to your supervisor.
    supervised systemd
    
    . . .
    
  • Next, find the dir directory. This option specifies the directory that Redis will use to dump persistent data. We need to pick a location that Redis will have write permission and that isn’t viewable by normal users.
    We will use the /var/lib/redis directory for this, which we will create

    . . .
    
    
    # The working directory.
    #
    # The DB will be written inside this directory, with the filename specified
    # above using the 'dbfilename' configuration directive.
    #
    # The Append Only File will also be created inside this directory.
    #
    # Note that you must specify a directory here, not a file name.
    dir /var/lib/redis
    
    . . .
    

    Save and close the file when you are finished

  • Next, we can create a systemd unit file so that the init system can manage the Redis process.
    Create and open the /etc/systemd/system/redis.service file to get started:

    $ sudo nano /etc/systemd/system/redis.service
    
  • The file will should like this, create sections below
    [Unit]
    Description=Redis In-Memory Data Store
    After=network.target
    
    [Service]
    User=redis
    Group=redis
    ExecStart=/usr/local/bin/redis-server /etc/redis/redis.conf
    ExecStop=/usr/local/bin/redis-cli shutdown
    Restart=always
    
    [Install]
    WantedBy=multi-user.target
    
  • Save and close file when you are finished

Now, we just have to create the user, group, and directory that we referenced in the previous two files.
Begin by creating the redis user and group. This can be done in a single command by typing:

$ sudo chown redis:redis /var/lib/redis

Now we can start Redis:

  $ sudo systemctl start redis

Check that the service had no errors by running:

$ sudo systemctl status redis

And Eureka – here is the response

redis.service - Redis Server
   Loaded: loaded (/etc/systemd/system/redis.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2016-05-11 14:38:08 EDT; 1min 43s ago
  Process: 3115 ExecStop=/usr/local/bin/redis-cli shutdown (code=exited, status=0/SUCCESS)
 Main PID: 3124 (redis-server)
    Tasks: 3 (limit: 512)
   Memory: 864.0K
      CPU: 179ms
   CGroup: /system.slice/redis.service
           └─3124 /usr/local/bin/redis-server 127.0.0.1:6379    

Congrats ! You can now start learning Redis. Connect to Redis CLI by typing

$ redis-cli

Now you can follow these Redis tutorials

Hope this was helpful

Dancing with Elephants and Flying with The Bees–Using ORC File Format with Apache Hive

hive

 

When you start with Hive on Hadoop clear majority of samples and tutorials will have you work with text files. However, some time ago disadvantages of text files as file format were clearly seen by Hive community in terms of storage efficiency and performance.

First move to better columnar storage was introduction to Hive RC File Format. RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0. RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part. Internals for RC File Format can be found in JavaDoc here – http://hive.apache.org/javadocs/r1.0.1/api/org/apache/hadoop/hive/ql/io/RCFile.html. What is important to note is why it was introduced as far as advantages:

  • As row-store, RCFile guarantees that data in the same row are located in the same node
  • As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.

As time passed by explosion of data and need for higher speed in HiveQL queries has pushed need for further optimized columnar storage file formats. Therefore, ORC File Format was introduced. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data

This has following advantages over RCFile format:

  • a single file as the output of each task, which reduces the NameNode’s load
  • light-weight indexes stored within the file, allowing to skip row groups that don’t pass predicate filtering and do seek to a given row
  • block-mode compression based on data type

An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer.

The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.

image

What does it all mean for me?

What it means for us as implementers following:

  • Better read performance due to compression. Streams are compressed using a codec, which is specified as a table property for all streams in that table. To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
  • · Introduction to column-level statistics for optimization, feature that long existed in pretty much all commercial RDBMS packages (Oracle, SQL Server, etc.) . The goal of the column statistics is that for each column, the writer records the count and depending on the type other useful fields. For most of the primitive types, it records the minimum and maximum
    values; and for numeric types it additionally stores the sum. From Hive 1.1.0 onwards, the column statistics will also record if there are any null values within the row group by setting the hasNull flag.
  • · Light weight indexing
  • Larger Blocks by default 256 MB

Here is a good Hive file format comparison from HOrtonworks:

image

Using ORC – create table with ORC format:

Simplest way to create ORC file formatted Hive table is to add STORED AS ORC to Hive CREATE TABLE statement like:

CREATE TABLE my_table  (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT
) STORED AS ORC;

ORC File Format can be used together with Hive Partitioning, which I explained in my previous post.  Here is an example of using Hive partitioning with ORC File Format:

CREATE  TABLE airanalytics 
(flightdate date ,dayofweek int,depttime int,crsdepttime int,arrtime int,crsarrtime int,uniquecarrier varchar(10),flightno int,tailnum int,aet int,cet int,airtime int,arrdelay int,depdelay int,origin varchar(5),dest varchar(5),distance int,taxin int,taxout int,cancelled int,cancelcode int,diverted string,carrdelay string,weatherdelay string,securtydelay string,cadelay string,lateaircraft string) 
 PARTITIONED BY (flight_year String)
 clustered BY (uniquecarrier)
 sorted BY (flightdate)
 INTO 24 buckets
 stored AS orc tblproperties ("orc.compress"="NONE","orc.stripe.size"="67108864", "orc.row.index.stride"="25000")

The parameters added on table level are as per docs:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

268435456

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns

“”

comma separated list of column names for which bloom filter should be created

orc.bloom.filter.fpp

0.05

false positive probability for bloom filter (must >0.0 and <1.0)

If you have existing Hive table, it can be moved to ORC via:

    • ALTER TABLE … [PARTITION partition_spec] SET FILEFORMAT ORC
    • SET ive.default.fileformat=Orc

For more information see – https://en.wikipedia.org/wiki/RCFile, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC, https://cwiki.apache.org/confluence/display/Hive/FileFormats, https://orc.apache.org/docs/hive-config.html

Hope this helps.

Dancing With Elephants and Flying With The Bees–Apache Hive Scaling Out with Partitions and Buckets

hive

In my previous post some time ago I introduced Apache Hive technology on Hadoop. Coming from SQL and RDBMS this was bound to be my favorite Hadoop technology.  Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in HDFS files.

Today, unlike previous basics post, I will concentrate on Hive Partitions and Buckets. A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.

image

Partitions,  Just like in RDBMS, in Hive data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query.  Example of creating partitioned table in Hive is below, note you can do this with both internal and external tables:

CREATE TABLE CUSTOMER   (

    userid             BIGINT,

    First_Name        STRING,

    Last_Name         STRING,

    address1           STRING,

    address2           STRING,

    city               STRING,

    zip_code           STRING,

    state              STRING

 

)

PARTITION BY  (


    COUNTRY            STRING

) 

As you can see I am using Country as partition column, The partition statement lets Hive alter the way it manages the underlying structures of the table’s data directory. If you browse the location of the data directory for a non-partitioned table, it will look like this: .db/. All the data files are directly written to this directory. In case of partitioned tables, subdirectories are created under the table’s data directory for each unique value of a partition column. In case the table is partitioned on multiple columns, then Hive creates nested subdirectories based on the order of partition columns in the table definition. Example with table above:

/gennadyk.db/customer/country=US

/gennadyk.db/customer/country=UK

When a partitioned table is queried with one or both partition columns in criteria or in the WHERE clause, what Hive effectively does is partition elimination by scanning only those data directories that are needed. If no partitioned columns are used, then all the directories are scanned (full table scan) and partitioning will not have any effect.

With a few quick changes it’s easy to configure Hive to support dynamic partition creation. Just as SQL Server has a SET command to change database options, Hive lets us change settings for a session using the SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Important notes on best practices

  • Be careful using dynamic partitions. Hive has some built-in limits on the number of partitions that can be dynamically created as well as limits on the total number of files that can exist within Hive. Creating many partitions at once will create a lot of files and creating a lot of files will use up memory in Hadoop Name Node
  • If your partitions are relatively small then the expense of recursing directories becomes more expensive than simply scanning the data. Likewise, partitions should be roughly similar in size to prevent a single long running thread from holding things up

Buckets. Much like partitioning, bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance. As you create HIVE table you can use CLUSTERED keyword to define buckets.

CREATE TABLE CUSTOMER   (

    userid             BIGINT,

    First_Name        STRING,

    Last_Name         STRING,

    address1           STRING,

    address2           STRING,

    city               STRING,

    zip_code           STRING,

    state              STRING

 

)

PARTITION BY  (


    COUNTRY            STRING

) 
(
   CLUSTERED BY (userid) INTO 10 BUCKETS;
(

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. The hash_function depends on the type of the bucketing column. For an int, as in our example  it’s easy, hash_int(i) == i.. In our example we can expect  all useid’s that end in 0 to be in bucket 1, all userid’s that end in a 1 to be in bucket 2 and so on. For other datatypes it gets a lot more tricky unfortunately. To note hash of a string or a complex datatype will be some number that’s derived from the value, but not anything humanly-recognizable.In general, distributing rows based on the hash will give you a even distribution in the buckets.  That’s your goal here. Bucketing is entirely dependent on data correctly being loaded to the table.Loading table properly is critical in case of bucketing,

For more see – http://www.bidn.com/blogs/cprice1979/ssas/4646/partitions-amp-buckets-in-hive, https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/, http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/, http://blog.cloudera.com/blog/2014/08/improving-query-performance-using-partitioning-in-apache-hive/, http://archive.cloudera.com/cdh/3/hive/language_manual/working_with_bucketed_tables.html