Previously, I wrote a post about Google Big Query , GCP service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Similar services are provided now by all public cloud vendors, Microsoft Azure has a service known as Azure Data Lake Analytics , that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.
According to Microsoft, Azure Data Lake Analytics lets you:
- Analyze data of any kind and of any size.
- Speed up and sharpen your development and debug cycles.
- Use the new U-SQL processing language built especially for big data.
- Rely on Azure’s enterprise-grade SLA.
- Pay only for the processing resources you actually need and use.
- Benefit from the YARN-based technology extensively tested at Microsoft.
ADLA is based on top of the YARN technology. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
For more on YARN architecture see – https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html . What being based on YARN helps Azure Data Lake Analytics helps is with extreme scalability. Data Lake Analytics can work with a number of Azure data sources: Azure Blob storage, Azure SQL database, Azure SQL Data Warehouse, Azure Store, and Azure SQL Database in Azure VM. Azure Data Lake Analytics is specially optimized to work with Azure Data Lake Store—providing the highest level of performance, throughput, and parallelization for your big data workloads.Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of SQL with the expressive power of C#. It takes a bit to learn for typical SQL person, but its pretty powerful.
So enough theory and let me show how you can cruch Big Data workloads without creating a large Hadoop cluster or setting up infrastructure, paying only for what you used in storage and compute.
The first thing we will need before starting to work in Azure cloud is subscription. If you dont have one, browse to https://azure.microsoft.com/en-us/free/?v=18.23 and follow the instructions to sign up for a free 30-day trial subscription to Microsoft Azure.
In my example I will use sample retail dataset with couple of stock and sales data files, which I will upload to Azure Data Lake Store. The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Azure Data Lake Store is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.
To upload these files , I will first create Azure Data Lake Store called adlsgennadyk:
- Navigate to Azure Portal
- Find service called Data Lake Storage Gen 1
- Click Create button
In the form , I will name my Azure Data Lake Store, pick Azure Resource Group where it will reside and choose billing model, which can be either usual pay as you go or prepayment in advance.
Once you created your Data Lake Storage, I will click on Data Explorer button to launch that tool and upload files I will be analyzing
Next, I will use the tool to upload demo retail dataset file called stock, that stores retail stocked product information in tab delimted format .
Here is the dataset as can be seen in Excel:
Now, that data has been uploaded to Azure lets create instance of Azure Data Lake Analytics service. Again action sequence is the same:
- Navigate to Azure Portal
- Find service called Data Lake Analytics
- Click Create button
Resulting form is very similar to what I did with storage above, except I will point my ADLA instance to my above created storage instance.
Once Azure Data Lake Analytics instance is created you are presented with this screen:
Once I click on the New Job button I can run brand new query against my files in Azure Data Lake Storage. I will show you simplest U-SQL script here:
@stock = EXTRACT Id int, Item string FROM "/stock.txt" USING Extractors.Tsv(); OUTPUT @stock TO "/SearchLog_output.tsv" USING Outputters.Tsv();
Here is what I just asked Azure Data Lake Analytics to do. We extract all mof the data from a file and copy output to another one.
Some items to know:
- The script contains a number of U-SQL keywords:
- U-SQL keywords are case sensitive. Keep this in mind – it’s one of the most common errors people run into.
EXTRACTstatement reads from files. The built-in extractor called
Extractors.Tsvhandles Tab-Separated-Value files.
OUTPUTstatement writes to files. The built-in outputter called
Outputters.Tsvhandles Tab-Separated-Value files.
- From the U-SQL perspective files are “blobs” – they don’t contain any usable schema information. So U-SQL supports a concept called “schema on read” – this means the developer specified the schema that is expected in the file. As you can see the names of the columns and the datatypes are specified in the
- The default Extractors and Outputters cannot infer the schema from the header row – in fact by default they assume that there is no header row (this behavior can overriden)
After job executes you can see its execution analysis and graph:
Now lets do something more meaningful by introducing WHERE clause to filter the dataset:
@stock = EXTRACT Id int, Item string FROM "/stock.txt" USING Extractors.Tsv(); @output = SELECT * FROM @stock WHERE Item == "Tape dispenser (Black)"; OUTPUT @output TO "/stock2_output.tsv" USING Outputters.Tsv();
The job took about 30 sec to run , including writing to utput file which took most of time here. Looking at the graph by duration one can see where time was spent:
In the next post I plan to delve deeper into Data Lake Analytics, including using C# functions and USQL catalogs.
For more on Data Lake Analytics see – https://docs.microsoft.com/en-us/azure/data-lake-analytics/ , https://youtu.be/4PVx-7SSs4c, https://cloudacademy.com/blog/azure-data-lake-analytics/, and https://optimalbi.com/blog/2018/02/20/fishing-in-an-azure-data-lake/
Happy swimming in Azure Data Lake, hope this helps.