The Dead Philosophers Club – Streaming Data from SQL Server to Azure via Debezium and Apache Kafka


Debezium is a set of open source distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

I first came across usefullness of Debezium and Kafka as change streaming platform\mechanism from SQL Server to Azure based on excellent blog posting from SQL Azure PM Davide Mauri I follow in Medium – At that time I was looking for a way to stream Change Data Capture Changes out of SQL Server to Azure Data Lake for number of customers and after reading Davide’s post decided to take a look deeper. Kafka integration (also can be done with Azure Event Hub Kafka message head – and  allows you to also stream changes to Azure Databricks via Spark Kafka broker for further processing via Spark Streaming:


So this opens a lot of opportunities for near real-time data processing of changes from SQL Server without using transactional replication or Service Broker features.

Most commonly, Debezium is deployed via Apache Kafka Connect. Kafka Connect is a framework and runtime for implementing and operating:

  1. source connectors such as Debezium, which ingest data into Kafka and
  2. sink connectors, which propagate data from Kafka topics into other systems.



Typical Debezium CDC architecture is shown above with data landing in various sinks like Elastic, Infiniscan or another DB.

Back to Debezium\Kafka setup with SQL Server. How would we set this up based on my last post where I have setup Confluent Cloud Kafka?

Debezium will work with SQL Server Change Data Capture. SQL Server needs to be configured for CDC at a database level. Assuming we have database named demo in our source SQL Server:

USE [demo]
EXEC sys.sp_cdc_enable_db

Now lets do it on table level. Assuming we have a table in demo database called orders in that database:

USE [demo]

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name   = N'ORDERS',
@role_name     = NULL,
@supports_net_changes = 0

Verify that user has access to CDC table:

EXEC sys.sp_cdc_help_change_data_capture

If results are empty, make sure that the user has privileges to access both the capture instance and CDC tables.

Once that’s done you can setup the connector. To do so you will need confluent-hub installed.  I went through following:

  1. Assuming you already setup Confluent Cloud in Azure as I already blogged about in my last post 
  2. Setup Kafka Connect as explained here –
  3. Download and install confluent-hub client from – .Shocker for me personally was lack of Windows support, looks like client is only available for MacOS and Linux. Had to use my spare old Mac Book.
  4. Use client to install Debezium SQL Server connector in your Kafka Connect worker.
  5. Restart worker.
confluent-hub install --no-prompt debezium/debezium-connector-sqlserver:0.10.0

Now, check that plugin  has been loaded successfully:

$ curl -s localhost:8083/connector-plugins|jq '.[].class'|grep debezium

Debezium will write to a topic with all of the data from SQL Server. Debezium also needs its own topic for tracking the DDL—and we need to pre-create both these topics:

$ ccloud kafka topic create --partitions 1 dbz_dbhistory.mssql.asgard-01
$ ccloud kafka topic create mssql-01-mssql.dbo.ORDERS
$ ccloud kafka topic list

Now create the connector. It’s a bit more verbose because we’re using a secure Kafka cluster and Debezium needs the details passed directly to it:

curl -i -X PUT -H  "Content-Type:application/json" \
    http://localhost:8083/connectors/source-debezium-mssql-01/config \
    -d '{
    "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector", 
    "database.hostname": "mssql",
    "database.port": "1433",
    "database.user": "sa",
    "database.password": "Admin123",
    "database.dbname": "demo",
    "": "mssql",
    "database.history.kafka.bootstrap.servers": "${file:/data/}:9092",
    "database.history.kafka.topic": "dbz_dbhistory.mssql.asgard-01",
    "": "SASL_SSL",
    "database.history.consumer.ssl.endpoint.identification.algorithm": "https",
    "database.history.consumer.sasl.mechanism": "PLAIN",
    "database.history.consumer.sasl.jaas.config": " required username=\"${file:/data/}\" password=\"${file:/data/}\";",
    "": "SASL_SSL",
    "database.history.producer.ssl.endpoint.identification.algorithm": "https",
    "database.history.producer.sasl.mechanism": "PLAIN",
    "database.history.producer.sasl.jaas.config": " required username=\"${file:/data/}\" password=\"${file:/data/}\";",
    "transforms": "unwrap,addTopicPrefix",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",

Now that everything is running from SQL Server -> Debezium-> Confluent Kafka we can now consume it from Kafka as illustrated here –

Rest is up to you, what do you want to do with data or where to land it.

More information is here:

Good tutorial on using Kafka endpoints in Event Hub with Debezium Docker from Davide Mauri – 

And Apache Kafka with Spark –

Hope this was interesting to some of you at least.

The Dead Philosophers Club – Starting Out with Apache Kafka


Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper cluster syncronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.

Kafka was developed at LinkedIn back in 2010, and it currently handles more than 1.4 trillion messages per day across over 1400 brokers.


As with typical pub sub messaging system you have to understand following terms to work with Kafka architecture:

  • Producer – Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers,  which get appended to a partiotion.
  • Consumer – Consumers read data from brokers. Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers.
  • Topic – A stream of messages belonging to a particular category is called a topic. Data is stored in topics. Topics can be divided into partiotions. For each topic, Kafka keeps a mini-mum of one partition. Each such partition contains messages in an immutable ordered sequence. A partition is implemented as a set of segment files of equal sizes.
  •  Broker – Brokers are simple system responsible for maintaining the pub-lished data. Each broker may have zero or more partitions per topic. Assuming for example there are one partiotions in a topic and one topic , each broker will have one partition.
  •  Kafka Cluster. For High Availability (HA), Kafka’s can have more than one broker , therefore forming  Kafka cluster. A Kafka cluster can be expanded without downtime. These clusters are used to manage the persistence and replication of message data.
  • Leader. Leader is a the node responsible for all reads and writes for the given partition. Every partition has one server acting as a leader.
  • Follower is a node which follows leader instructions. If the leader fails, one of the follower will automatically become the new leader. A follower acts as normal consumer, pulls messages and up-dates its own data store.

Another great architecture that shows Kafka messaging in action from Data Flair tutorial –


Because Kafka is designed for broker scalability and performance, producers (rather than brokers) are responsible for choosing which partition each message is sent to.  The default partition is determined by a hashing function on the message key, or round-robin in the absence of a key. However, this may not always provide the desired behaviour (e.g. message ordering, fair distribution of messages to consumers, etc). Producers can therefore send messages to specific partitions – through the use of a custom partitioner, or by using manual or hashing options available with the default partitioner.

A fundamental explanation of Kafka’s inner workings goes as follows: Every topic is associated with one or more partitions, which are spread over one or more brokers. Every partition gets replicated to those one or more brokers depending on the replication factor that is set. The replication factor is then responsible for determining the reliability, while the number of partitions is responsible for the parallelism for consumers. A partition is associated with only a single consumer instance per consumer group. Since the total consumer instances per group is less than – or the same as – the number of partitions, adding support for extra consumer instances requires that more partitions be added as well, but ensures read scalability.

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.


Kafka works well as a replacement for a traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc).  This where it competes with well known brokers like Azure Service Bus, Google Pub Sub, Azure Event Hub, Rabbit MQ, etc.

Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. That can be a part of Kappa Big Data architecture . Starting in, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data processing as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza, as well as very popular Apache Spark Streaming.  This is probably most interesting use to me personally and one In explore in next few blog articles I am planning showcasing Kafka as part of such architectures.  With this case scenario, I will explore enterprise data management features like Kafka Connect, Apache Kafka’s data integration tools and the Schema Registry, schema management service from Confluent. Generically it looks like following:


Now I will show you how to quickly create test Kafka cluster and topic using Confluent cloud platform.  Confluent is a company founded by former LinkedIn team that created Kafka and even through you can use Apache Kafka as open source it adds number of additional features and significant enhancements to the OSS product.  I will create Confluent on MS Azure signing up for trial here – .

Confluent adds a lot more features on top of Apache Kafka core, and even through I dont need these features for this post, it also allows me to create Kafka cluster without allocating hardware, installing Linux, unpacking and installing libraries, etc. As a matter of fact most of the enterprise customers that I worked with recently that picked Kafka have picked either Confluent or Azure HDInsight PaaS platforms to run it to cut out operational complexity of running it bare metal or on VM.

Once I am signed up for the cloud platform I can see this screen


Now I can just click Create Cluster.

I will pick cluster name and region in Azure. Note I picked Single Zone as I have no HADR requirement for my test and attempting to lower my costs. Obviously that will lead to lower SLA guarantees.


After entering billing information in couple of seconds I have a real Kafka cluster ready to go.


After you have a working Kafka cluster in Confluent Cloud, you can use the Confluent Cloud CLI to interact with your cluster from your local computer. For example you can produce and consume to your topic using the Confluent Cloud CLI. You can get installers for CLI for various OS platforms here –

First, you need to login from CLI client after install.

ccloud login

After you should see following prompt:


Next, I can view my cluster:

ccloud kafka cluster list

You will see following response:


If we have multiple clusters, before we do anything we should set active one:

ccloud kafka cluster use lkc-r50gp

After this I can create topic named test

ccloud kafka topic create test

And now I can query for it:

ccloud kafka topic list

Result is following


So within few minutes I have explained to you concepts of Apache Kafka, created Kafka cluster and even ready to consume topic on Confluent in Azure. Next, we will do something more interesting with it like streaming data from SQL Server and will also look at setting up Kafka on Azure HDInsight PaaS as well.


More on Apache Kafka –,,,

Hope this helps.


Dancing with the Elephants and Flying with the Bees – Troubleshooting Vertex Errors in Hive Interactive\LLAP on Azure HDInsight


One of my customers running Interactive Hive with HDinsight on Tez ran into a fatal error like:

Code: 2, SQL State: 08S01]  Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 12, vertexId=vertex_1573064032618_0017_26_18, diagnostics=[Task failed, taskId=task_1573064032618_0017_26_18_000000, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 killed, TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : java.lang.RuntimeException: Map operator initialization failed

As customer noted that scaling this cluster doesnt stop the error.Looking deeper at the large error  stack one can see that we actually ran into Java OOM error here:

Caused by: org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError: Hash table loading exceeded memory limits for input: Map 11 numEntries: 19000000 estimatedMemoryUsage: 2342546972 effectiveThreshold: 2331551914 memoryMonitorInfo: { isLlap: true executorsPerNode: 21 maxExecutorsOverSubscribeMemory: 3 memoryOverSubscriptionFactor: 0.20000000298023224 memoryCheckInterval: 100000 noConditionalTaskSize: 20000000 adjustedNoConditionalTaskSize: 32000000 hashTableInflationFactor: 1.5 threshold: 48000000 }

Cause is as usual Map Join, as its very memory intensive. Map side join is a process where joins between two tables are performed in the Map phase without the involvement of Reduce phase.Map-side Joins allows a table to get loaded into memory ensuring a very fast join operation, performed entirely within a mapper and that too without having to use both map and reduce phases.


Customer scenario appears to ne pretty close to issues with Map Joins and OOM on tez shown here –

What is Tez and why Hive on Tez vs. Map Reduce? Tez is a new application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework.
It generalizes map and reduce tasks by exposing interfaces for generic data processing tasks, which consist of a triplet of interfaces: input, output and processor. These tasks are the vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class citizens in Tez and together with the input/output interfaces greatly increase the flexibility of how data is transferred between tasks.

As the de facto standard for SQL-In-Hadoop, Apache Hive is optimal for both batch and interactive queries at petabyte scale. Hive embeds Tez so that it can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability. Apache Tez innovations drove many of the Hive performance improvements delivered by the Stinger Initiative.  Tez helps make Hive interactive.

As explained in this blog post, when Tez execution engine is used, the heap space used actually belongs to the Tez container. See the image below describing the Tez container memory.


As the blog post suggests, the following two memory settings define the container memory for the heap: hive.tez.container.size and From my experience, the OOM exception does not mean the container size is too small. It means the Java heap size ( is too small. So whenever you see OOM, you can try to increase “” If needed you might have to increase “hive.tez.container.size.” The “java.opts” should be around 80% of “container.size.”

Note, the setting must always be smaller than hive.tez.container.size. So quick way to fixing the the issue wouldn’t be just scaling up , it would be scaling up RAM , upping container size and keeping 80% to java. Opts. Parameter, otherwise that memory wouldn’t be utilized by tez.

Are there any other ways to work an issue around complex query with lots of memory intensive map joins in Hive on Tez?

Another paramerer that control mapjoin memory utilization is What is it? When, if estimated size of small table(s) is smaller than 10MB),  then common join can convert to map join automatically.  Looks like at least in these docs – Cloudera\Hortonworks recommends setting  to 1/3 of hive.tez.container.size and setting tez.runtime.unordered.output.buffer.size-mb to 10% of hive.tez.container.size.

Another reason that map joins maybe picked and not work very efficiently with memory is lack of cost based optimization and bad statistics. If you see following running EXPLAIN on the problem query:

“Plan not optimized by CBO due to missing feature [Hint].

Vertex dependency in root stage”

if you see this update statistics:

analyze table t compute statistics;

analyze table t compute statistics for columns;


Hope this will help someone troubleshooting this kind of error with HDInsight Interactive Hive cluster.

For more see –, ,,, and

Spark in the Clouds – Running Azure Databricks


Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.

The Apache Spark project is main execution engine for Spark SQL ( SQL and HiveQL) , Spark streaming, machine learning and graph processing engines built on top of Spark Core. You can run them using provided API.


There are three key Spark interfaces that you should know about:

  • RDD – Resilient Distributed Dataset. Apache Spark’s first abstraction was the RDD. It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be created in a variety of ways and are the “lowest level” API available. While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. The RDD API is available in the Java, Python, and Scala languages.
  • DataFrame. These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. The DataFrame API is available in the Java, Python, R, and Scala languages.
  • DataSet. A combination of DataFrame and RDD. It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. The Dataset API is available in the Java and Scala languages

Databricks is a company founded by the creators of Apache Spark, that aims to help clients with cloud-based big data processing using Spark. Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.


Setting up Azure Databricks Workspace via Azure Portal is really easy.

If you don’t have an Azure subscription, create a free account before you begin. You can then navigate to Azure Portal and click + Create Resource to open New Resource blade


Pick Analytics category and Azure Databricks service:


Under Azure Databricks Service, provide the values to create a Databricks workspace.


In the workspace name provide unique name for your workspace, pick your subscription , location of Azure datacenter where workspace will be created, resource group service  as well as pricing tier for the service.  You can pick between standard and premum pricing tiers , for details on each see –  For the sake of this tutorial I will pick standard.

Click Create and in few minutes your workspace will be created. Once that happens, in the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.


Once you login to Azure Databricks workspace you should see a screen like this:


Here you can pick Clusters icon on the side and create a Databricks Cluster.  How would this work, creating a Databricks Spark Cluster in Azure? Well, when a customer launches a cluster via Databricks, a “Databricks appliance” is deployed as an Azure resource in the customer’s subscription. The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer’s subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.


Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Once you click Clusters button image on side toolbar it will open Clusters page.



Hit Create Cluster button and in resulting screen you can name your cluster, as well as pick cluster type.  In Databricks you can create two different types of clusters: standard and high concurrency. Standard clusters are the default and can be used with Python, R, Scala, and SQL. High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users. High concurrency clusters support only SQL, Python, and R languages. For my tutorial I will create a Standard cluster.


I will now download data scource file from GitHub  . I will put this file in Azure Blob Storage. In order to do that I will create an Azure Storage Account:

  • In the Azure portal, select Create a resource. Select the Storage category, and select Storage Accounts
  • Provide a unique name for the storage account.
  • Select Account Kind: Blob Storage
  • Select a Resource Group name. Use the same resource group you created the Databricks workspace.

Next we add Storage Container to the Storage Account and upload source data file:

  • Open the storage account in the Azure portal.
  • Select Blobs.
  • Select + Container to create a new empty container.
  • Provide a Name for the container.
  • Select Private (non anonymous access) access level.
  • Once the container is created, select the container name.
  • Select the Upload button.
  • On the Files page, select the Folder icon to browse and select the sample file for upload.
  • Select Upload to upload the file.

Once your cluster is created nd source data is uploaded to Azure storage you can go to Workspace and create a notebook.


These notebooks can be written in Scala,  Python, etc. I can pick Scala:


Once you created notebook we can now mount storage account where our source data file is to /mnt/mypath. In the following snippet, replace {YOUR CONTAINER NAME}, {YOUR STORAGE ACCOUNT NAME}, and {YOUR STORAGE ACCOUNT ACCESS KEY} with the appropriate values for your Azure Storage account. Paste the snippet in an empty cell in the notebook and then press SHIFT + ENTER to run the code cell.

mountPoint = “/mnt/mypath”,

Once file is mounted we can use its data to create a temporary table and move data there:


DROP TABLE IF EXISTS radio_sample_data;

CREATE TABLE radio_sample_data USING json

OPTIONS ( path “/mnt/mypath/small_radio_json.json” )

Now you can select data from that table:

SELECT * from radio_sample_data

Result should be there in a second or so:


Note that even without knowledge of Scala, only working in SQL or Python its pretty easy to get started here.

To learn more about Azure Databricks see – 


Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 2 – USQL and Programmability


This is a continuation on previous post that can be found here. As I stated primary programming language for Azure Data Lake Analytics will be U-SQL.  If you like myself, come from SQL background you will find that many U-SQL queries lookn like SQL, however after a second look those of you familar with .NET and LINQ will notice that there are familiar constructs there as well.

There are two ways to run U-SQL scripts:

    1. You can run U-SQL scripts on your own machine. The data read and written by this script will be on you own machine. You aren’t using Azure resources to do this so there is no additional cost. This method of running U-SQL scripts is called “U-SQL Local Execution”
    2. You can run U-SQL scripts in Azure in the context of a Data Lake Analytics account. The data read or written by the script will also be in Azure – typically in an Azure Data Lake Store account. You pay for any compute and storage used by the script. This is called “U-SQL Cloud Execution”   

If you wish to run scripts locally via your copy of Visual Studio 2015 you can install Data Lake Tools here – . With Visual Studio 2017 tool is part of Azure Development workload.


In the previous post I shown how to run U-SQL scripts on the cloud , but today we will use Visual Studio 2017 with Data Lake Tools to run U-SQL scripts.

Once you installed the tools you can connect to your Azure Data Lake Analytics account and run U-SQL Script via right clicking and picking Run U-SQL Script from the menu. Running such script will look like this in VS 2017:


You can see stats and timings just like you would in Azure Portal as well right in Visual Studio


When something is wrong with a U-SQL script it will not compile. Here some of the common things that may drive you mad and you should watch out for:

  1. Invalid case. U-SQL is case sensitive, unlike SQL.  So following script will error out as it uses lower case from vs. upper case FROM:
    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        from @"/SearchLog.tsv"
        USING Extractors.Tsv();
    OUTPUT @searchlog 
        TO @"/SearchLog_output.tsv"
        USING Outputters.Tsv();
  2.  Bad path to input or otput file. Check your paths, this is one is self evident, but I spent hours debugging my path\folder issues
  3. Invalid C# expression due to typos, etc

When you develop U-SQL scripts, you can save time and expense by running the scripts locally on your machine before they are ready to be ran in the cloud.  You can connect to your local folder path via Visual Studio Data Lake tools and run queries there:


A local data root folder is a local store for the local compute account. Any folder in the local file system on your local machine can be a local data root folder. It’s the same as the default Azure Data Lake Store account of a Data Lake Analytics account. Switching to a different data root folder is just like switching to a different default store account.

When you run a U-SQL script, a working directory folder is needed to cache compilation results, run logs, and perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U-SQL project’s working directory. t’s located under <U-SQL Project Root Path>/bin/debug. The working directory is cleaned every time a new run is triggered.

Now that you got the basics you can delve into U-SQL coinstructs following the laguage tutorials by Microsoft’s Michael Rys here – , and Adjeta Sighal here –

In my next installment I am hoping to go through typical ADLA analytics job from start to completion.

For more on Azure Data Lake Analytics see –

Good Luck and Happy Coding!

Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 1 – Introduction


Previously, I wrote a post about Google Big Query , GCP service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Similar services are provided now by all public cloud vendors, Microsoft Azure has a service known as Azure Data Lake Analytics , that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.

According to Microsoft, Azure Data Lake Analytics lets you:

  • Analyze data of any kind and of any size.
  • Speed up and sharpen your development and debug cycles.
  • Use the new U-SQL processing language built especially for big data.
  • Rely on Azure’s enterprise-grade SLA.
  • Pay only for the processing resources you actually need and use.
  • Benefit from the YARN-based technology extensively tested at Microsoft.

ADLA is based on top of the YARN technology. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.


For more on YARN architecture see – .  What being based on YARN helps Azure Data Lake Analytics helps is with extreme scalability. Data Lake Analytics can work with a number of Azure data sources: Azure Blob storage, Azure SQL database, Azure SQL Data Warehouse, Azure Store, and Azure SQL Database in Azure VM. Azure Data Lake Analytics is specially optimized to work with Azure Data Lake Store—providing the highest level of performance, throughput, and parallelization for your big data workloads.Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of SQL with the expressive power of C#. It takes a bit to learn for typical SQL person, but its pretty powerful.

So enough theory and let me show how you can cruch Big Data workloads without creating a large Hadoop cluster or setting up infrastructure, paying only for what you used in storage and compute.

The first thing we will need before starting to work in Azure cloud is subscription. If you dont have one, browse to and follow the instructions to sign up for a free 30-day trial subscription to Microsoft Azure.

In my example I will use sample retail dataset with couple of stock and sales data files, which I will upload to Azure Data Lake Store.  The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Azure Data Lake Store is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.

To upload these files , I will first create Azure Data Lake Store called adlsgennadyk:

  • Navigate to Azure Portal
  • Find service called Data Lake Storage Gen 1
  • Click Create button


In the form , I will name my Azure Data Lake Store, pick Azure Resource Group where it will reside and choose billing model, which can be either usual pay as you go or prepayment in advance.


Once you created your Data Lake Storage, I will click on Data Explorer button to launch that tool and upload files I will be analyzing


Next, I will use the tool to upload demo retail dataset file called stock, that stores retail stocked product information in tab delimted format .


Here is the dataset as can be seen in Excel:


Now, that data has been uploaded to Azure lets create instance of Azure Data Lake Analytics service. Again action sequence is the same:

  • Navigate to Azure Portal
  • Find service called Data Lake Analytics
  • Click Create button

Resulting form is very similar to what I did with storage above, except I will point my ADLA instance to my above created storage instance.


Once  Azure Data Lake Analytics instance is created you are presented with this screen:


Once I click on the New Job button I can run brand new query against my files in Azure Data Lake Storage. I will show you simplest U-SQL script here:

@stock = 
            Id   int, 
          Item string 
    FROM "/stock.txt"
    USING Extractors.Tsv();

OUTPUT @stock
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

Here is what I just asked Azure Data Lake Analytics to do.  We extract all mof the data from a file and copy output to another one.

Some items to know:

  • The script contains a number of U-SQL keywords: EXTRACT, FROM, TO, OUTPUT, USING, etc.
  • U-SQL keywords are case sensitive. Keep this in mind – it’s one of the most common errors people run into.
  • The EXTRACT statement reads from files. The built-in extractor called Extractors.Tsv handles Tab-Separated-Value files.
  • The OUTPUT statement writes to files. The built-in outputter called Outputters.Tsv handles Tab-Separated-Value files.
  • From the U-SQL perspective files are “blobs” – they don’t contain any usable schema information. So U-SQL supports a concept called “schema on read” – this means the developer specified the schema that is expected in the file. As you can see the names of the columns and the datatypes are specified in the EXTRACT statement.
  • The default Extractors and Outputters cannot infer the schema from the header row – in fact by default they assume that there is no header row (this behavior can overriden)

After job executes you can see its execution analysis and graph:


Now lets do something more meaningful by introducing WHERE clause to filter the dataset:

@stock = 
            Id   int, 
          Item string
    FROM "/stock.txt"
    USING Extractors.Tsv();

    @output = 
    SELECT *
    FROM @stock
    WHERE Item == "Tape dispenser (Black)";

OUTPUT @output
    TO "/stock2_output.tsv"
    USING Outputters.Tsv();

The job took about 30 sec to run , including writing to utput file which took most of time here. Looking at the graph by duration one can see where time was spent:


In the next post I plan to delve deeper into Data Lake Analytics, including using C# functions and USQL catalogs.


For more on Data Lake Analytics see –, and 


Happy swimming in Azure Data Lake, hope this helps.

Data Cargo In The Clouds – Data and Containerization On The Cloud Platforms


This topic is where I actually planned to pivot to for quite a while, but either had no bandwith\time with moving to Seattle or some other excuse like that. This topic is an interesting to me as its intersection of technologies where I used to spend quite a bit of time including:

  • Data, including both traditional RDBMS databases and NoSQL designs
  • Cloud, including Microsoft Azure, AWS and GCP
  • Containers and container orchestration (that is area new to me)

Those who worked with me either displayed various degrees of agreement, but more often of annoyance, on me pushing this topic as growing part of data engineering discipline, but I truly believe in it. But before we start combining all these areas and do something useful in code I will spend some time here with most basic concepts of what I am about to enter here.

I will skip introduction to basic concepts of RDBMS, NoSQL or BigData\Hadoop technologies, as if you didn’t hide under the rock for last five years, you should be quite aware of those. That brings us straight to containers:

As definition states – “A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Available for both Linux and Windows based apps, containerized software will always run the same, regardless of the environment. Containers isolate software from its surroundings, for example differences between development and staging environments and help reduce conflicts between teams running different software on the same infrastructure.”

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. This could be from a developer’s laptop to a test environment, from a staging environment into production, and perhaps from a physical machine in a data center to a virtual machine in a private or public cloud.


But why containers if I already have VMs? 

VMs take up a lot of system resources. Each VM runs not just a full copy of an operating system, but a virtual copy of all the hardware that the operating system needs to run. This quickly adds up to a lot of RAM and CPU cycles. In contrast, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program.
What this means in practice is you can put two to three times as many as applications on a single server with containers than you can with a VM.  In addition, with containers you can create a portable, consistent operating environment for development, testing, and deployment. 

On Linux, containers run on top of LXC. This is a userspace interface for the Linux kernel containment features. It includes an application programming interface (API) to enable Linux users to create and manage system or application containers.

Docker is an open platform tool to make it easier to create, deploy and to execute the applications by using containers.

  • The heart of Docker is Docker engine. The Docker engine is a part of Docker which create and run the Docker containers. The docker container is a live running instance of a docker image. Docker Engine is a client-server based application with following components :
    • A server which is a continuously running service called a daemon process.
    • A REST API which interfaces the programs to use talk with the daemon and give instruct it what to do.
    • A command line interface client.


  • The command line interface client uses the Docker REST API to interact with the        Docker daemon through using CLI commands. Many other Docker applications    also use the API and CLI. The daemon process creates and manage Docker images, containers, networks, and volumes.
  • Docker client is the primary service using which Docker users communicate with the Docker. When we use commands “docker run” the client sends these commands to dockerd, which execute them out.

You will build Docker images using Docker and deploy these into what are known as Docker registries. When we run the docker pull and docker run commands, the required images are pulled from our configured registry directory.Using Docker push command, the image can be uploaded to our configured registry directory. 

Finally deploying instances of images from your registry you will deploy containers. We can create, run, stop, or delete a container using the Docker CLI. We can connect a container to more than one networks, or even create a new image based on its current state.By default, a container is well isolated from other containers and its system machine. A container defined by its image or configuration options that we provide during to create or run it.

For more info on containers , LXC and Docker see –,

That brings us to container orchestration engines. While the CLI meets the needs of managing one container on one host, it falls short when it comes to managing multiple containers deployed on multiple hosts. To go beyond the management of individual containers, we must turn to orchestration tools. Orchestration tools extend lifecycle management capabilities to complex, multi-container workloads deployed on a cluster of machines.

Some of the well known orchestration engines include:

Kubernetes.  Almost a standard nowdays , originally developed at Google. Kubernetes’ architecture is based on a master server with multiple minions. The command line tool, called kubecfg, connects to the API endpoint of the master to manage and orchestrate the minions.  The service definition, along with the rules and constraints, is described in a JSON file. For service discovery, Kubernetes provides a stable IP address and DNS name that corresponds to a dynamic set of pods. When a container running in a Kubernetes pod connects to this address, the connection is forwarded by a local agent (called the kube-proxy) running on the source machine to one of the corresponding backend containers.Kubernetes supports user-implemented application health checks. These checks are performed by the kubelet running on each minion to ensure that the application is operating correctly.


For more see –

Apache Mesos. This is an open source cluster manager that simplifies the complexity of running tasks on a shared pool of servers. A typical Mesos cluster consists of one or more servers running the mesos-master and a cluster of servers running the mesos-slave component. Each slave is registered with the master to offer resources. The master interacts with deployed frameworks to delegate tasks to slaves. Unlike other tools, Mesos ensures high availability of the master nodes using Apache ZooKeeper, which replicates the masters to form a quorum. A high availability deployment requires at least three master nodes. All nodes in the system, including masters and slaves, communicate with ZooKeeper to determine which master is the current leading master. The leader performs health checks on all the slaves and proactively deactivates any that fail.


Over the next couple of posts I will create containers running SQL Server and other data stores, both RDBMS and NoSQL, deploy these on the cloud and finally attempt to orchestrate these hopefully as well.  So lets move off theory and pictures into world of parctical data engine deployments.

Hope you will find this detour interesting.

Forecast Cloudy – Using Azure Blob Storage with Apache Hive on HDInsight

The beauty of working with Big Data in Azure is that you can manage (create\delete) compute resources with your HDInsight cluster independent of your data stored either in Azure Data Lake or Azure blob storage.  In this case I will concentrate on using Azure blob storage\WASB as data store for HDInsWight Azure PaaS Hadoop service

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back as resultset for consumption by other systems (think PowerBI, Tableau, etc)
Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.


In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:


Hold on, isn’t the whole selling point of Hadoop is proximity of data to compute?  Yes, and just like with any other Hadoop system on premises data is loaded into memory on the individual nodes at compute time. With Azure data infrastructure setup and data center backbone within data center built for performance, your job performance is generally the same or better than if you used disks locally attached to the VMs.

Below is diagram of HDInsight data storage architecture:


HDInsight provides access to the distributed file system that is locally attached to the compute nodes. This file system can be accessed by using the fully qualified URI, for example:


More important is ability access data that is stored in Azure Storage. The syntax is:


As per you need to be aware of following:

  • Container Security for WASB storage.  For containers in storage accounts that are connected to cluster,because the account name and key are associated with the cluster during creation, you have full access to the blobs in those containers. For public containers that are not connected to cluster you have read-only permission to the blobs in the containers.  For private containers in storage accounts that are not connected to cluster , you can’t access the blobs in the containers unless you define the storage account when you submit the WebHCat jobs.
  • The storage accounts that are defined in the creation process and their keys are stored in %HADOOP_HOME%/conf/core-site.xml on the cluster nodes. The default behavior of HDInsight is to use the storage accounts defined in the core-site.xml file. It is not recommended to directly edit the core-site.xml file because the cluster head node(master) may be reimaged or migrated at any time, and any changes to this file are not persisted.

 You can create new or point existing storage account to HDinsight cluster easy via portal as I show below:


You can point your HDInsight cluster to multiple storage accounts as well , as explained here – 

You can also create storage account and container via Azure PowerShell like in this sample:

$SubscriptionID = “<Your Azure Subscription ID>”
$ResourceGroupName = “<New Azure Resource Group Name>”
$Location = “EAST US 2”

$StorageAccountName = “<New Azure Storage Account Name>”
$containerName = “<New Azure Blob Container Name>”

Select-AzureRmSubscription -SubscriptionId $SubscriptionID

# Create resource group
New-AzureRmResourceGroup -name $ResourceGroupName -Location $Location

# Create default storage account
New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName -Location $Location -Type Standard_LRS

# Create default blob containers
$storageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -StorageAccountName $StorageAccountName)[0].Value
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
New-AzureStorageContainer -Name $containerName -Context $destContext


The URI scheme for accessing files in Azure storage from HDInsight is:


The URI scheme provides unencrypted access (with the wasb: prefix) and SSL encrypted access (with wasbs). Microsoft recommends using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

The <BlobStorageContainerName> identifies the name of the blob container in Azure storage. The <StorageAccountName> identifies the Azure Storage account name. A fully qualified domain name (FQDN) is required.

I ran into rather crazy little limitation\ issue when working with \WASB and HDInsight. Hadoop and Hive is looking for and  expects a valid folder hierarchy to import data  files, whereas  WASB does not support a folder hierarchy i.e. all blobs are listed under a container. The workaround is to use SSH session to login into head cluster node and use mkdir command line command to manually create such directory via the driver.

The SSH Procedure with HDInsight can be found here –

Another one recommended to me was that “/” character can be used within the key name to make it appear as if a file is stored within a directory structure. HDInsight sees these as if they are actual directories.For example, a blob’s key may be input/log1.txt. No actual “input” directory exists, but due to the presence of the “/” character in the key name, it has the appearance of a file path.

For more see –,

Hope this helps.

Meet Redis – Setting Up Redis On Ubuntu Linux


I have been asked by few folks on quick tutorial setting up Redis under systemd in Ubuntu Linux version 16.04.

I have blogged quite a bit about Redis in general – , however just a quick line on Redis in general. Redis is an in-memory key-value store known for its flexibility, performance, and wide language support. That makes Redis one of the most popular key value data stores in existence today. Below are steps to install and configure it to run under systemd in Ubuntu 16.04 and above.

Here are the prerequisites:

Next steps are:

  • Login into your Ubuntu server with this user account
  • Update and install prerequisites via apt-get
             $ sudo apt-get update
             $ sudo apt-get install build-essential tcl
  • Now we can download and exgract Redis to tmp directory
              $ cd /tmp
              $ curl -O
              $ tar xzvf redis-stable.tar.gz
              $ cd redis-stable
  • Next we can build Redis
        $ make
  • After the binaries are compiled, run the test suite to make sure everything was built correctly. You can do this by typing:
       $ make test
  • This will typically take a few minutes to run. Once it is complete, you can install the binaries onto the system by typing:
    $ sudo make install

Now we need to configure Redis to run under systemd. Systemd is an init system used in Linux distributions to bootstrap the user space and manage all processes subsequently, instead of the UNIX System V or Berkeley Software Distribution (BSD) init systems. As of 2016, most Linux distributions have adopted systemd as their default init system.

  • To start off, we need to create a configuration directory. We will use the conventional /etc/redis directory, which can be created by typing
    $ sudo mkdir /etc/redi
  • Now, copy over the sample Redis configuration file included in the Redis source archive:
         $ sudo cp /tmp/redis-stable/redis.conf /etc/redis
  • Next, we can open the file to adjust a few items in the configuration:
    $ sudo nano /etc/redis/redis.conf
  • In the file, find the supervised directive. Currently, this is set to no. Since we are running an operating system that uses the systemd init system, we can change this to systemd:
    . . .
    # If you run Redis from upstart or systemd, Redis can interact with your
    # supervision tree. Options:
    #   supervised no      - no supervision interaction
    #   supervised upstart - signal upstart by putting Redis into SIGSTOP mode
    #   supervised systemd - signal systemd by writing READY=1 to $NOTIFY_SOCKET
    #   supervised auto    - detect upstart or systemd method based on
    #                        UPSTART_JOB or NOTIFY_SOCKET environment variables
    # Note: these supervision methods only signal "process is ready."
    #       They do not enable continuous liveness pings back to your supervisor.
    supervised systemd
    . . .
  • Next, find the dir directory. This option specifies the directory that Redis will use to dump persistent data. We need to pick a location that Redis will have write permission and that isn’t viewable by normal users.
    We will use the /var/lib/redis directory for this, which we will create

    . . .
    # The working directory.
    # The DB will be written inside this directory, with the filename specified
    # above using the 'dbfilename' configuration directive.
    # The Append Only File will also be created inside this directory.
    # Note that you must specify a directory here, not a file name.
    dir /var/lib/redis
    . . .

    Save and close the file when you are finished

  • Next, we can create a systemd unit file so that the init system can manage the Redis process.
    Create and open the /etc/systemd/system/redis.service file to get started:

    $ sudo nano /etc/systemd/system/redis.service
  • The file will should like this, create sections below
    Description=Redis In-Memory Data Store
    ExecStart=/usr/local/bin/redis-server /etc/redis/redis.conf
    ExecStop=/usr/local/bin/redis-cli shutdown
  • Save and close file when you are finished

Now, we just have to create the user, group, and directory that we referenced in the previous two files.
Begin by creating the redis user and group. This can be done in a single command by typing:

$ sudo chown redis:redis /var/lib/redis

Now we can start Redis:

  $ sudo systemctl start redis

Check that the service had no errors by running:

$ sudo systemctl status redis

And Eureka – here is the response

redis.service - Redis Server
   Loaded: loaded (/etc/systemd/system/redis.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2016-05-11 14:38:08 EDT; 1min 43s ago
  Process: 3115 ExecStop=/usr/local/bin/redis-cli shutdown (code=exited, status=0/SUCCESS)
 Main PID: 3124 (redis-server)
    Tasks: 3 (limit: 512)
   Memory: 864.0K
      CPU: 179ms
   CGroup: /system.slice/redis.service
           └─3124 /usr/local/bin/redis-server    

Congrats ! You can now start learning Redis. Connect to Redis CLI by typing

$ redis-cli

Now you can follow these Redis tutorials

Hope this was helpful

Dancing with Elephants and Flying with The Bees–Using ORC File Format with Apache Hive



When you start with Hive on Hadoop clear majority of samples and tutorials will have you work with text files. However, some time ago disadvantages of text files as file format were clearly seen by Hive community in terms of storage efficiency and performance.

First move to better columnar storage was introduction to Hive RC File Format. RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0. RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part. Internals for RC File Format can be found in JavaDoc here – What is important to note is why it was introduced as far as advantages:

  • As row-store, RCFile guarantees that data in the same row are located in the same node
  • As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.

As time passed by explosion of data and need for higher speed in HiveQL queries has pushed need for further optimized columnar storage file formats. Therefore, ORC File Format was introduced. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data

This has following advantages over RCFile format:

  • a single file as the output of each task, which reduces the NameNode’s load
  • light-weight indexes stored within the file, allowing to skip row groups that don’t pass predicate filtering and do seek to a given row
  • block-mode compression based on data type

An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer.

The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.


What does it all mean for me?

What it means for us as implementers following:

  • Better read performance due to compression. Streams are compressed using a codec, which is specified as a table property for all streams in that table. To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
  • · Introduction to column-level statistics for optimization, feature that long existed in pretty much all commercial RDBMS packages (Oracle, SQL Server, etc.) . The goal of the column statistics is that for each column, the writer records the count and depending on the type other useful fields. For most of the primitive types, it records the minimum and maximum
    values; and for numeric types it additionally stores the sum. From Hive 1.1.0 onwards, the column statistics will also record if there are any null values within the row group by setting the hasNull flag.
  • · Light weight indexing
  • Larger Blocks by default 256 MB

Here is a good Hive file format comparison from HOrtonworks:


Using ORC – create table with ORC format:

Simplest way to create ORC file formatted Hive table is to add STORED AS ORC to Hive CREATE TABLE statement like:

CREATE TABLE my_table  (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT

ORC File Format can be used together with Hive Partitioning, which I explained in my previous post.  Here is an example of using Hive partitioning with ORC File Format:

CREATE  TABLE airanalytics 
(flightdate date ,dayofweek int,depttime int,crsdepttime int,arrtime int,crsarrtime int,uniquecarrier varchar(10),flightno int,tailnum int,aet int,cet int,airtime int,arrdelay int,depdelay int,origin varchar(5),dest varchar(5),distance int,taxin int,taxout int,cancelled int,cancelcode int,diverted string,carrdelay string,weatherdelay string,securtydelay string,cadelay string,lateaircraft string) 
 PARTITIONED BY (flight_year String)
 clustered BY (uniquecarrier)
 sorted BY (flightdate)
 INTO 24 buckets
 stored AS orc tblproperties ("orc.compress"="NONE","orc.stripe.size"="67108864", "orc.row.index.stride"="25000")

The parameters added on table level are as per docs:






high level compression (one of NONE, ZLIB, SNAPPY)



number of bytes in each compression chunk



number of bytes in each stripe



number of rows between index entries (must be >= 1000)



whether to create row indexes



comma separated list of column names for which bloom filter should be created



false positive probability for bloom filter (must >0.0 and <1.0)

If you have existing Hive table, it can be moved to ORC via:

    • SET ive.default.fileformat=Orc

For more information see –,,,

Hope this helps.