Spark in the Clouds – Running Azure Databricks


Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.

The Apache Spark project is main execution engine for Spark SQL ( SQL and HiveQL) , Spark streaming, machine learning and graph processing engines built on top of Spark Core. You can run them using provided API.


There are three key Spark interfaces that you should know about:

  • RDD – Resilient Distributed Dataset. Apache Spark’s first abstraction was the RDD. It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be created in a variety of ways and are the “lowest level” API available. While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. The RDD API is available in the Java, Python, and Scala languages.
  • DataFrame. These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. The DataFrame API is available in the Java, Python, R, and Scala languages.
  • DataSet. A combination of DataFrame and RDD. It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. The Dataset API is available in the Java and Scala languages

Databricks is a company founded by the creators of Apache Spark, that aims to help clients with cloud-based big data processing using Spark. Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.


Setting up Azure Databricks Workspace via Azure Portal is really easy.

If you don’t have an Azure subscription, create a free account before you begin. You can then navigate to Azure Portal and click + Create Resource to open New Resource blade


Pick Analytics category and Azure Databricks service:


Under Azure Databricks Service, provide the values to create a Databricks workspace.


In the workspace name provide unique name for your workspace, pick your subscription , location of Azure datacenter where workspace will be created, resource group service  as well as pricing tier for the service.  You can pick between standard and premum pricing tiers , for details on each see –  For the sake of this tutorial I will pick standard.

Click Create and in few minutes your workspace will be created. Once that happens, in the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.


Once you login to Azure Databricks workspace you should see a screen like this:


Here you can pick Clusters icon on the side and create a Databricks Cluster.  How would this work, creating a Databricks Spark Cluster in Azure? Well, when a customer launches a cluster via Databricks, a “Databricks appliance” is deployed as an Azure resource in the customer’s subscription. The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer’s subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.


Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Once you click Clusters button image on side toolbar it will open Clusters page.



Hit Create Cluster button and in resulting screen you can name your cluster, as well as pick cluster type.  In Databricks you can create two different types of clusters: standard and high concurrency. Standard clusters are the default and can be used with Python, R, Scala, and SQL. High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users. High concurrency clusters support only SQL, Python, and R languages. For my tutorial I will create a Standard cluster.


I will now download data scource file from GitHub  . I will put this file in Azure Blob Storage. In order to do that I will create an Azure Storage Account:

  • In the Azure portal, select Create a resource. Select the Storage category, and select Storage Accounts
  • Provide a unique name for the storage account.
  • Select Account Kind: Blob Storage
  • Select a Resource Group name. Use the same resource group you created the Databricks workspace.

Next we add Storage Container to the Storage Account and upload source data file:

  • Open the storage account in the Azure portal.
  • Select Blobs.
  • Select + Container to create a new empty container.
  • Provide a Name for the container.
  • Select Private (non anonymous access) access level.
  • Once the container is created, select the container name.
  • Select the Upload button.
  • On the Files page, select the Folder icon to browse and select the sample file for upload.
  • Select Upload to upload the file.

Once your cluster is created nd source data is uploaded to Azure storage you can go to Workspace and create a notebook.


These notebooks can be written in Scala,  Python, etc. I can pick Scala:


Once you created notebook we can now mount storage account where our source data file is to /mnt/mypath. In the following snippet, replace {YOUR CONTAINER NAME}, {YOUR STORAGE ACCOUNT NAME}, and {YOUR STORAGE ACCOUNT ACCESS KEY} with the appropriate values for your Azure Storage account. Paste the snippet in an empty cell in the notebook and then press SHIFT + ENTER to run the code cell.

mountPoint = “/mnt/mypath”,

Once file is mounted we can use its data to create a temporary table and move data there:


DROP TABLE IF EXISTS radio_sample_data;

CREATE TABLE radio_sample_data USING json

OPTIONS ( path “/mnt/mypath/small_radio_json.json” )

Now you can select data from that table:

SELECT * from radio_sample_data

Result should be there in a second or so:


Note that even without knowledge of Scala, only working in SQL or Python its pretty easy to get started here.

To learn more about Azure Databricks see – 



Forecast Cloudy – Set Up SQL Server Managed Instance via Azure PowerShell


Managed Instance is a new managed database service that represents fully-managed SQL Server Instance in Azure cloud. It shares the same code with the latest version of SQL Server Database Engine and has the latest features, performance improvements, and security patches.

Let’s step back and look at why Azure SQL Managed Instances are being released? Currently there is a gap of functionality between the retail SQL Server and Azure SQL Database. For example: a client that has an application that uses multiple databases and lots of code using three part names. Azure SQL Database doesn’t support three part names, so if you wanted to migrate this instance to Azure, your only choice would be to deploy and manage it as a SQL Server VM on IaaS. And yes, you can enable backup and patching agents, but at the end of the day, the configuration of the VM, maintenance of the OS and all the responsibility for SQL Server is still in your hands.

This is where the Azure SQL Managed Instance comes in. It provides a fully managed, higher feature-compatible instance that is a lot more similar to the SQL Server instance concept that we all know from the retail product. You don’t have to be responsible of every bit like with IaaS, while at the same time, you can continue running as an instance, with the features that are not compatible with the Azure SQL Database single database model. This means migration is straightforward without having to worry about compatibility or features not supported. It’s also easier because you don’t have to build every single piece of IaaS.

If you have experience with Azure SQL Database you might be thinking, how does this compare to the elastic database pools? The difference is that the pools are still created from databases that are single entities and the only thing they share are the elastic resources. On the other hand, Azure SQL Managed Instance will hold the database inside the instance container so they will be able to use three part names, linked servers, and other features that we are accustomed to using on the on-premises SQL Server.

Best of all, because Azure SQL Managed Instance is built on top of the PaaS framework that powers Azure SQL Database, once it’s migrated, it is fully managed in terms of OS, built-in HA, backups, etc.

As we can see in the diagram below, Azure SQL Managed Instance sits between having a full VM running SQL Server and the fully contained individual database paradigm of Azure SQL Database. And just like Azure SQL Db, with a Managed Instance the client can enjoy the continuous improvement and release of features that come with Microsoft’s cloud-first development model.


Managed Instance is a resource in your network hosted by Azure cloud. You need to create Azure VNET and a dedicated subnet where the instance should be deployed. There are some networking constraints for the VNET/subnet that you need to review before you create a managed instance. There is no public IP address dedicated to the Managed Instance. Only applications in customer network can access Managed Instance. In addition, your network administrators have the full control and can configure access to Managed Instance using standard methods such as Network security Groups and firewalls.

Managed Instance enables you to choose how many CPU cores you want to use and how much storage you need. You can create a Managed Instance with 16 cores and 500GB storage, and then increase or decrease these numbers depending on your needs. Changing CPU or storage in instance can be done via Azure portal using simple slider:


Architecture of Managed Instance.

Managed Instance has split compute and storage components. You have compute nodes with 8, 16, or 24 cores, that work with database files stored on Azure Premium disk storage. Every database file is placed on separate Azure premium disk, that guarantees that database files cannot be lost. Although Managed Instance relies on Azure Premium disks, it also has separate backup mechanism that ensures that even if something would happen with the files, platform would still be able to restore files from backups.


There are also redundant passive nodes that are used to fail-over compute instance if needed.

Managed Instance is fully Platform as a Service database offer. High availability, automated backups, point-in-time restore, automatic plan correction, threat detection, vulnerability assessment, and other intelligent features are built-in into service     without any additional charge.

OS patching and database upgrades are handled automatically and do not require any action.

In addition to built-in monitoring and maintenance features, you can use any 3rd-party tool to monitor and manage your instance, because most of the system views are exposed.

Connecting to Managed Instance

Azure SQL Managed Instance is not a service on public endpoint. Azure SQL Managed Instance is placed on private IP address in your VNET. It is just hosted and managed by Azure cloud.


There are several ways to connect to your Managed Instance.

  1. You can create VPN connection between your on-premises network and Azure VNET where Managed Instance is placed. Managed Instance will be seen as any other SQL Server in your network.
  2. You can connect via some Azure VM in your Azure VNET (using standard RDP or via some app that is in your Azure VM). If your VM is placed in some other AzureVNET you need to establish peering between these two networks.
  3. You can Connect your web application to Managed Instance using Azure Apps VNET Integration or by placing your Web App into App Service environment that is placed in VNET. If your Web App or App service Environment is placed in another VNET, you need to establish peering between these two networks.

For more on Connecting see

I want to create Managed Instance and test it out, how do I do it?

Assuming you  have created  Site-To-Site VPN of sorts at least we will need to place Managed Instance in special subnet inside a VNET connected to that VPN so Managed Instance can be seen from on premises.  Here are official docs for network configuration – and . Custom DNS may need to be set in Azure like this – .

I used  Azure PowerShell  below to login to Azure Subscription connected to my on-premises network via Express Route and set up VNET, Subnets and Managed Instance itself, here is example script I run to create Demo instance.  Obviously disregard my subscription, VNET, subnet names, address ranges, etc, instead substitute proper values for you:

$SubscriptionName = ‘Visual Studio Enterprise’
$resourceGroup = ‘SQLMIDemo’
$location = ‘East US’
$vNetName = ‘SQLMIDemoVNET’

#login to azure
Select-AzureRmSubscription $SubscriptionName

#create resource group
New-AzureRmResourceGroup -ResourceGroupName $resourceGroup -Location $location

#create new virtual network
$virtualNetwork = New-AzureRmVirtualNetwork -ResourceGroupName $resourceGroup -Location $location -Name $vNetName -AddressPrefix

#create default subnet
$subnetConfig = Add-AzureRmVirtualNetworkSubnetConfig -Name default -AddressPrefix -VirtualNetwork $virtualNetwork

#set up mi subnet

$subnetConfigMi = Add-AzureRmVirtualNetworkSubnetConfig -Name mi -AddressPrefix -VirtualNetwork $virtualNetwork

$virtualNetwork | Set-AzureRmVirtualNetwork

#enable access to azure management service
$routeTableMiManagementService = New-AzureRmRouteTable -Name ‘mySQLMIRouteTableMiManagementService’ -ResourceGroupName $resourceGroup -location $location

Set-AzureRmVirtualNetworkSubnetConfig -VirtualNetwork $virtualNetwork -Name ‘mi’ -AddressPrefix -RouteTable $routeTableMiManagementService |

Get-AzureRmRouteTable -ResourceGroupName $resourceGroup -Name ‘mySQLMIRouteTableMiManagementService’ |
Add-AzureRmRouteConfig -Name ‘ToManagedInstanceManagementService’ -AddressPrefix -NextHopType ‘Internet’ |

#install managed instance

New-AzureRmSqlManagedInstance -Name “demosqlmi”
-ResourceGroupName $resourceGroup -Location $location -SubnetId $`
-AdministratorCredential (Get-Credential) `
-StorageSizeInGB 1024 -VCore 8 -Edition “GeneralPurpose” `
-ComputeGeneration Gen4 -LicenseType LicenseIncluded

You can install Azure PowerShell from or use Portal to set up everything as per – and

Once you created networking prerequisites and SQL Azure MI you can restore SQL database into MI as explained here –

How can I be sure that my application database features are supported in SQL Azure MI? 

Majority of mainstream SQL Server features are supported in SQL Azure Managed Instance, including such as jobs,  CLR, linked servers, etc.  For feature matrix can be found here –

For more see –

Steering in the Clouds – Running SQL Server on Azure Kubernetes Managed Service (AKS)


In this post I would love to combine my favorite RDBMS , which is SQL Server with hottest Azure service out there – AKS. AKS is Azure hosted Kubernetes based container orchestration service. AKS reduces the complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure. As a hosted Kubernetes service, Azure handles critical tasks like health monitoring and maintenance for you.

Why run SQL Server in containers orchestrated by Kubernetes?

Over the past year, Kubernetes — also known as K8s — has become a dominant topic of conversation in the infrastructure world. Given its pedigree of literally working at Google-scale, it makes sense that people want to bring that kind of power to their DevOps stories; container orchestration turns many tedious and complex tasks into something as simple as a declarative config file.

The rise of orchestration is predicated on a few things, though. First, organizations have moved toward breaking up monolithic applications into microservices. However, the resulting environments have hundreds (or thousands) of these services that need to be managed. Second, infrastructure has become cheap and disposable — if a machine fails, it’s dramatically cheaper to replace it than triage the problems.

So, to solve the first issue, orchestration relies on the boon of the second; it manages services by simply letting new machines, running the exact same containers, take the place of failed ones, which keeps a service running without any manual interference.

However, up until recently K8s were only used by stateless aps and couldn’t work with stateful applications and databases. Managing state in Kubernetes is difficult because the system’s dynamism is too chaotic for most databases to handle–especially SQL databases that offer strong consistency.

Only since StateFull sets were introduced in K8s we can start thinking about running databases in K8s. By far the most common way to run a database, StatefulSets is a feature fully supported as of the Kubernetes 1.9 release. Using it, each of your pods is guaranteed the same network identity and disk across restarts, even if it’s rescheduled to a different physical machine. In AKS in order to run an RDBMS like SQL Server I will need to create persisted volume.

A Persisted Volume (PV) is a piece of storage in the cluster that has been provisioned by an administrator. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual pod that uses the PV.   This will allow our pods to go up and down and our volume to stay available.


So all you need to setup SQL Server in Azure Kubernetes PaaS service (AKS) is really an Azure account.  FIrst thing lets install Azure CLI. I am using Windows machine, but recently I was able to install and run Azure CLI on mac as well , of course.  Instructions for Windows install can be found here –

If you dont want to install Azure CLI on your machine you can simply use it with Azure Cloud Shell on the Azure Portal – . Recently I used GCP Cloud Shell and this , as well as Azure Cloud Shell are great idea if you dont want to or cannot install CLI\SDK on your machine.

Once you installed CLI, lets login to your Azure subscription.

az login

This will launch your default browser for you to login.  Once you logged in , you can list subscriptions you have via this command

az account list

Below picture shows output of above command , I erased all of the unique information


Once the login process is complete and you are operating in the right subscription,  the resource group to hold all the objects of the Kubernetes cluster can be created:

az group create --name GennadyKSQLK8s--location eastus

You should be able to see new resource group in Azure Portal UI:


Once resource group is created we can build AKS cluster in that resource group.

az aks create --resource-group GennadyKSQLK8s --name mySQLK8sCluster --node-count 2 --generate-ssh-keys

There is a lot going behslightly ind the scenes of this one command so you may have to be a bit more patient, something I am not known for:


In few minutes we are done and cluster is created.  Next lets install kubectl to manage Kubernetes cluster.

az aks install-cli

Setup PATH for kubectl:

set PATH=%PATH%;C:\Users\Gennady\.azure-kubectl

Next, lets grab credentials in order to connect and manage the cluster: –

az aks get-credentials --resource-group=GennadyKSQLK8s  --name=mySQLK8sCluster

The cluster is setup and we can connect. Lets test it by listing cluster nodes:

kubectl get nodes

The output is below:


There are the two nodes of the cluster and it is ready to host a sql instance in a pod.  Now we need to create new persisted volume and instance of SQL Server on Linux.

We will begin by creating SA password for SQL Server in Kubernetes cluster. Kubernetes can manage configuration information like passwords as secrets.

kubectl create secret generic mssql --from-literal=SA_PASSWORD="MyP@ssw0rd1"

Now we need persistent volume and persistent volume claim in Kubernetes. Here are the steps to get that created:

      • Create a manifest to define the storage class and the persistent volume claim. The manifest specifies the storage provisioner, parameters, and reclaim policy. The Kubernetes cluster uses this manifest to create the persistent storage. I will use following YAML to define my storage class and persistent volume claim:
        kind: StorageClass
             name: azure-disk
          storageaccounttype: Standard_LRS
          kind: Managed
        kind: PersistentVolumeClaim
        apiVersion: v1
          name: mssql-data
          - ReadWriteOnce
              storage: 8Gi

        I will save above YAML as disk.yaml on my computer.

      • Next , we create the persistent volume claim in Kubernetes. Lets run this command:
        kubectl apply -f C:\Users\Gennady\Downloads\disk.yaml

        Obviously, your yaml file location may be different than mine above
        Result shown below:

        The persistent volume is automatically created as an Azure storage account, and bound to the persistent volume claim

      • Verify the persistent volume claim.  We can run following command:
        kubectl describe pvc mssql-data

        The output can be seen below:


Now you should be able to see this volume in the portal as part of our resource group:


Now, that the persistent volume is added we need to add the SQL Server deployment. The deployment creates a replica set. The replica set creates the pod.  The deployment described via YAML manifest like below:

apiVersion: apps/v1beta1
kind: Deployment
  name: mssql-deployment
  replicas: 1
        app: mssql
      terminationGracePeriodSeconds: 10
      - name: mssql
        image: microsoft/mssql-server-linux
        - containerPort: 1433
        - name: ACCEPT_EULA
          value: "Y"
        - name: SA_PASSWORD
              name: mssql
              key: SA_PASSWORD 
        - name: mssqldb
          mountPath: /var/opt/mssql
      - name: mssqldb
          claimName: mssql-data
apiVersion: v1
kind: Service
  name: mssql-deployment
    app: mssql
    - protocol: TCP
      port: 1433
      targetPort: 1433
  type: LoadBalancer

I will save above YAML to my computer as deployment.yaml file. Now I can run this to deploy:

kubectl apply -f C:\Users\Gennady\Downloads\deployment.yaml

You should see following output:


The deployment and service are created. The SQL Server instance is in a container, connected to persistent storage. Now lets connect to the pod to verify pod was created via kubectl get pod


Now lets verify that services are running via kubectl get services.


This command returns services that are running, as well as the internal and external IP addresses for the services.

Finaly , lets connect to our SQL Server instance using sqlcmd:


Hope this helps and happy steering the cloud databases. For more see –


Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 2 – USQL and Programmability


This is a continuation on previous post that can be found here. As I stated primary programming language for Azure Data Lake Analytics will be U-SQL.  If you like myself, come from SQL background you will find that many U-SQL queries lookn like SQL, however after a second look those of you familar with .NET and LINQ will notice that there are familiar constructs there as well.

There are two ways to run U-SQL scripts:

    1. You can run U-SQL scripts on your own machine. The data read and written by this script will be on you own machine. You aren’t using Azure resources to do this so there is no additional cost. This method of running U-SQL scripts is called “U-SQL Local Execution”
    2. You can run U-SQL scripts in Azure in the context of a Data Lake Analytics account. The data read or written by the script will also be in Azure – typically in an Azure Data Lake Store account. You pay for any compute and storage used by the script. This is called “U-SQL Cloud Execution”   

If you wish to run scripts locally via your copy of Visual Studio 2015 you can install Data Lake Tools here – . With Visual Studio 2017 tool is part of Azure Development workload.


In the previous post I shown how to run U-SQL scripts on the cloud , but today we will use Visual Studio 2017 with Data Lake Tools to run U-SQL scripts.

Once you installed the tools you can connect to your Azure Data Lake Analytics account and run U-SQL Script via right clicking and picking Run U-SQL Script from the menu. Running such script will look like this in VS 2017:


You can see stats and timings just like you would in Azure Portal as well right in Visual Studio


When something is wrong with a U-SQL script it will not compile. Here some of the common things that may drive you mad and you should watch out for:

  1. Invalid case. U-SQL is case sensitive, unlike SQL.  So following script will error out as it uses lower case from vs. upper case FROM:
    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        from @"/SearchLog.tsv"
        USING Extractors.Tsv();
    OUTPUT @searchlog 
        TO @"/SearchLog_output.tsv"
        USING Outputters.Tsv();
  2.  Bad path to input or otput file. Check your paths, this is one is self evident, but I spent hours debugging my path\folder issues
  3. Invalid C# expression due to typos, etc

When you develop U-SQL scripts, you can save time and expense by running the scripts locally on your machine before they are ready to be ran in the cloud.  You can connect to your local folder path via Visual Studio Data Lake tools and run queries there:


A local data root folder is a local store for the local compute account. Any folder in the local file system on your local machine can be a local data root folder. It’s the same as the default Azure Data Lake Store account of a Data Lake Analytics account. Switching to a different data root folder is just like switching to a different default store account.

When you run a U-SQL script, a working directory folder is needed to cache compilation results, run logs, and perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U-SQL project’s working directory. t’s located under <U-SQL Project Root Path>/bin/debug. The working directory is cleaned every time a new run is triggered.

Now that you got the basics you can delve into U-SQL coinstructs following the laguage tutorials by Microsoft’s Michael Rys here – , and Adjeta Sighal here –

In my next installment I am hoping to go through typical ADLA analytics job from start to completion.

For more on Azure Data Lake Analytics see –

Good Luck and Happy Coding!

Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 1 – Introduction


Previously, I wrote a post about Google Big Query , GCP service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Similar services are provided now by all public cloud vendors, Microsoft Azure has a service known as Azure Data Lake Analytics , that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.

According to Microsoft, Azure Data Lake Analytics lets you:

  • Analyze data of any kind and of any size.
  • Speed up and sharpen your development and debug cycles.
  • Use the new U-SQL processing language built especially for big data.
  • Rely on Azure’s enterprise-grade SLA.
  • Pay only for the processing resources you actually need and use.
  • Benefit from the YARN-based technology extensively tested at Microsoft.

ADLA is based on top of the YARN technology. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.


For more on YARN architecture see – .  What being based on YARN helps Azure Data Lake Analytics helps is with extreme scalability. Data Lake Analytics can work with a number of Azure data sources: Azure Blob storage, Azure SQL database, Azure SQL Data Warehouse, Azure Store, and Azure SQL Database in Azure VM. Azure Data Lake Analytics is specially optimized to work with Azure Data Lake Store—providing the highest level of performance, throughput, and parallelization for your big data workloads.Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of SQL with the expressive power of C#. It takes a bit to learn for typical SQL person, but its pretty powerful.

So enough theory and let me show how you can cruch Big Data workloads without creating a large Hadoop cluster or setting up infrastructure, paying only for what you used in storage and compute.

The first thing we will need before starting to work in Azure cloud is subscription. If you dont have one, browse to and follow the instructions to sign up for a free 30-day trial subscription to Microsoft Azure.

In my example I will use sample retail dataset with couple of stock and sales data files, which I will upload to Azure Data Lake Store.  The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Azure Data Lake Store is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.

To upload these files , I will first create Azure Data Lake Store called adlsgennadyk:

  • Navigate to Azure Portal
  • Find service called Data Lake Storage Gen 1
  • Click Create button


In the form , I will name my Azure Data Lake Store, pick Azure Resource Group where it will reside and choose billing model, which can be either usual pay as you go or prepayment in advance.


Once you created your Data Lake Storage, I will click on Data Explorer button to launch that tool and upload files I will be analyzing


Next, I will use the tool to upload demo retail dataset file called stock, that stores retail stocked product information in tab delimted format .


Here is the dataset as can be seen in Excel:


Now, that data has been uploaded to Azure lets create instance of Azure Data Lake Analytics service. Again action sequence is the same:

  • Navigate to Azure Portal
  • Find service called Data Lake Analytics
  • Click Create button

Resulting form is very similar to what I did with storage above, except I will point my ADLA instance to my above created storage instance.


Once  Azure Data Lake Analytics instance is created you are presented with this screen:


Once I click on the New Job button I can run brand new query against my files in Azure Data Lake Storage. I will show you simplest U-SQL script here:

@stock = 
            Id   int, 
          Item string 
    FROM "/stock.txt"
    USING Extractors.Tsv();

OUTPUT @stock
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

Here is what I just asked Azure Data Lake Analytics to do.  We extract all mof the data from a file and copy output to another one.

Some items to know:

  • The script contains a number of U-SQL keywords: EXTRACT, FROM, TO, OUTPUT, USING, etc.
  • U-SQL keywords are case sensitive. Keep this in mind – it’s one of the most common errors people run into.
  • The EXTRACT statement reads from files. The built-in extractor called Extractors.Tsv handles Tab-Separated-Value files.
  • The OUTPUT statement writes to files. The built-in outputter called Outputters.Tsv handles Tab-Separated-Value files.
  • From the U-SQL perspective files are “blobs” – they don’t contain any usable schema information. So U-SQL supports a concept called “schema on read” – this means the developer specified the schema that is expected in the file. As you can see the names of the columns and the datatypes are specified in the EXTRACT statement.
  • The default Extractors and Outputters cannot infer the schema from the header row – in fact by default they assume that there is no header row (this behavior can overriden)

After job executes you can see its execution analysis and graph:


Now lets do something more meaningful by introducing WHERE clause to filter the dataset:

@stock = 
            Id   int, 
          Item string
    FROM "/stock.txt"
    USING Extractors.Tsv();

    @output = 
    SELECT *
    FROM @stock
    WHERE Item == "Tape dispenser (Black)";

OUTPUT @output
    TO "/stock2_output.tsv"
    USING Outputters.Tsv();

The job took about 30 sec to run , including writing to utput file which took most of time here. Looking at the graph by duration one can see where time was spent:


In the next post I plan to delve deeper into Data Lake Analytics, including using C# functions and USQL catalogs.


For more on Data Lake Analytics see –, and 


Happy swimming in Azure Data Lake, hope this helps.

Forecast Cloudy – Create Microsoft SQL Azure DB via PowerShell

In my recent job experience working SQL Server migrations to Microsoft SQL Azure DB I had to use a lot of Azure PowerShell to create Microsoft SQL Azure DB on the fly. Its a pretty easy task . Here is a quick sample\tutorial to create SQL Azure DB on the fly.


First thing you need is to connect to Azure and login to your Azure subscription. Log in to your Azure subscription using the Connect-AzureRmAccount command and follow the on-screen directions.


Next, I usually like to define variables on top of my script. These will hold information like database names, credentials, etc; things that are likely to change when you reuse the script.

# The data center and resource name for your resources
$resourcegroupname = "myResourceGroup"
$location = "US Central"
# The logical server name: Use a random value or replace with your own value (do not capitalize)
$servername = "gennadykServer"
# Set an admin login and password for your database
# The login information for the server
$adminlogin = "ServerAdmin"
$password = "ChangeYourAdminPassword1"
# The ip address range that you want to allow to access your server - change as appropriate
$startip = ""
$endip = ""
# The database name
$databasename = "mySampleDatabase"

Once we defined these values we need to create a resource group. You can create an Azure resource group using the New-AzureRmResourceGroup command. A resource group is a logical container into which Azure resources are deployed and managed as a group.

New-AzureRmResourceGroup -Name $resourcegroupname -Location $location

Next, you need to create SQL Azure logical server. A logical server contains a group of databases managed as a group. You can create an Azure SQL Database logical server using the New-AzureRmSqlServer command.

New-AzureRmSqlServer -ResourceGroupName $resourcegroupname `
    -ServerName $servername `
    -Location $location `
    -SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $adminlogin, $(ConvertTo-SecureString -String $password -AsPlainText -Force))

In order to connect to our logical server we need to setup firewall rule. A server-level firewall rule allows an external application, such as SQL Server Management Studio or the SQLCMD utility to connect to a SQL database through the SQL Database service firewall. We will create an Azure SQL Database server-level firewall rule using the New-AzureRmSqlServerFirewallRule command.

New-AzureRmSqlServerFirewallRule -ResourceGroupName $resourcegroupname `
    -ServerName $servername `
    -FirewallRuleName "AllowSome" -StartIpAddress $startip -EndIpAddress $endip

Finally, we can create our database. You will need to pick SQL Azure performance tier. You can see more information on SQL Azure DB tiers here.  Below script creates a database with an S0 performance level in the server using the New-AzureRmSqlDatabase command.

New-AzureRmSqlDatabase  -ResourceGroupName $resourcegroupname `
    -ServerName $servername `
    -DatabaseName $databasename `
    -SampleName "AdventureWorksLT" `
    -RequestedServiceObjectiveName "S0"

For more information see

Happy Azure Scripting everyone.

Forecast Cloudy – Migrating SQL Server Database from AWS EC2 to SQL Azure Database using Microsoft Data Migration Assistant Tool


Amazon Web Services EC2 is a great IaaS Platform and many organizations use it to host SQL Server instances. Being IaaS platform is opens up easy migration scenarios from on-premises deployments and for many people its a great first step into cloud. However, hopefully soooner than later, many companies realize that they are ready to take next step – move their workloads and databases to PaaS cloud service. This is where Microsoft SQL Azure DB shines.

Microsoft Azure SQL Database (formerly SQL Azure, Windows Azure SQL Database) is a cloud-based database service from Microsoft offering data-storage capabilities. The aim is for users to just communicate with a T-SQL endpoint rather than managing database storage, files, and high availability. Azure SQL Database obviously has number of limitations like lack of support of cross database queries, SQL Broker, etc. meaning that not every database can be migrated to Microsoft Azure SQL DB without prerequisite work. However, for folks that are ready to migrate Microsoft Data Migration Assistant can be viable option for such migration from AWS EC2 IaaS.

For my tutorial I have SQL Server on Windows instance in EC2 running well known venerable AdventureWorks2016 database.  Here it is amongst my other servers in EC2 console:


With AWS elastic IP assigned to that Windows machine I can easily connect to default SQL instance via my local SSMS:


Obviously previously I had to setup proper inbound rules for this machine both with AWS EC2 Security Groups and Windows Firewall.

Next step is to create Microsoft Azure DB target.  This is well documented topic as well, officially described here – .

I created S3 tier database calling it SmallSQLTarget.


I also had to again go to security group for the server and open inbound ports for SQL Server traffic again (port 1433).  Smart way to do it is to filter by client IP of my workstation where I run my SSMS and will run Data Migration Assistant tool

Now I can make sure that I can connect to my target Microsoft Azure SQL DB from my client machine via SSMS as well:


Now lets proceed with migration. For that you will need to download and install Microsoft Data Migration Assistant from –  . Once you installed the tool lets create new project. I will start with assesment project to make sure I dont have any critical incompatibilities that preclude my migration to PaaS platform.


Tool will check for compatibility issues in my database as well as feature parity to see what needs to be done to migrate any SQL Server instance features to PaaS if anything at all.


We can then provide credentials to our AWS EC2 based instance and pick database after connecting to the source:


Now that we added source lets start assessment.  DMA can have any version of SQL Server as assesment source up to 2016.  As assesment done you will get a nice report where you can see all of the potential issues before you migrate. It should something  look like this :


Now lets migrate our database as I found no migration blockers here.

Press on + button and create new migration project


Lets connect again to instance on EC2 and select DB to migrate


Next connect to Microsoft Azure SQL DB target by specifying server address and credentials and select target db, in my case SmallSQLAzure.


After that tool will do quick assesment and present you with schema objects you can migrate , highlighting\marking any objects that may have issues


After you pick objects you wish to migrate , tool will generate SQL DML script for these objects.  You can then save that script for analysis and changes , as well as apply that script via SSMS on your own. Or you can go ahead and apply the script via tool as I done for this example by pressing Deploy Schema button.


After schema deployment finish we can proceed with data migration by pressing Migrate Data button.


Next, pick your tables to migrate and start data migration. Migration of data for smaller database like AdventureWorks takes few minutes.  Everything transferred other than 2 temporal tables , but those are very different from regular database table.  SQL Server 2016 introduced support for system-versioned temporal tables as a database feature that brings built-in support for providing information about data stored in the table at any point in time rather than only the data that is correct at the current moment in time.  There are other ways to migrate temporal tables to SQL Azure that I will post in my next blog posts.


Well, we are pretty much done. Next, check the data has migrated to target by doing few queries in SSMS:


For more on Microsoft Data Migration Assistant see –

Hope this helps.