Spark in the Clouds – Running Azure Databricks


Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.

The Apache Spark project is main execution engine for Spark SQL ( SQL and HiveQL) , Spark streaming, machine learning and graph processing engines built on top of Spark Core. You can run them using provided API.


There are three key Spark interfaces that you should know about:

  • RDD – Resilient Distributed Dataset. Apache Spark’s first abstraction was the RDD. It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be created in a variety of ways and are the “lowest level” API available. While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. The RDD API is available in the Java, Python, and Scala languages.
  • DataFrame. These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. The DataFrame API is available in the Java, Python, R, and Scala languages.
  • DataSet. A combination of DataFrame and RDD. It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. The Dataset API is available in the Java and Scala languages

Databricks is a company founded by the creators of Apache Spark, that aims to help clients with cloud-based big data processing using Spark. Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.


Setting up Azure Databricks Workspace via Azure Portal is really easy.

If you don’t have an Azure subscription, create a free account before you begin. You can then navigate to Azure Portal and click + Create Resource to open New Resource blade


Pick Analytics category and Azure Databricks service:


Under Azure Databricks Service, provide the values to create a Databricks workspace.


In the workspace name provide unique name for your workspace, pick your subscription , location of Azure datacenter where workspace will be created, resource group service  as well as pricing tier for the service.  You can pick between standard and premum pricing tiers , for details on each see –  For the sake of this tutorial I will pick standard.

Click Create and in few minutes your workspace will be created. Once that happens, in the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.


Once you login to Azure Databricks workspace you should see a screen like this:


Here you can pick Clusters icon on the side and create a Databricks Cluster.  How would this work, creating a Databricks Spark Cluster in Azure? Well, when a customer launches a cluster via Databricks, a “Databricks appliance” is deployed as an Azure resource in the customer’s subscription. The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer’s subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.


Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Once you click Clusters button image on side toolbar it will open Clusters page.



Hit Create Cluster button and in resulting screen you can name your cluster, as well as pick cluster type.  In Databricks you can create two different types of clusters: standard and high concurrency. Standard clusters are the default and can be used with Python, R, Scala, and SQL. High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users. High concurrency clusters support only SQL, Python, and R languages. For my tutorial I will create a Standard cluster.


I will now download data scource file from GitHub  . I will put this file in Azure Blob Storage. In order to do that I will create an Azure Storage Account:

  • In the Azure portal, select Create a resource. Select the Storage category, and select Storage Accounts
  • Provide a unique name for the storage account.
  • Select Account Kind: Blob Storage
  • Select a Resource Group name. Use the same resource group you created the Databricks workspace.

Next we add Storage Container to the Storage Account and upload source data file:

  • Open the storage account in the Azure portal.
  • Select Blobs.
  • Select + Container to create a new empty container.
  • Provide a Name for the container.
  • Select Private (non anonymous access) access level.
  • Once the container is created, select the container name.
  • Select the Upload button.
  • On the Files page, select the Folder icon to browse and select the sample file for upload.
  • Select Upload to upload the file.

Once your cluster is created nd source data is uploaded to Azure storage you can go to Workspace and create a notebook.


These notebooks can be written in Scala,  Python, etc. I can pick Scala:


Once you created notebook we can now mount storage account where our source data file is to /mnt/mypath. In the following snippet, replace {YOUR CONTAINER NAME}, {YOUR STORAGE ACCOUNT NAME}, and {YOUR STORAGE ACCOUNT ACCESS KEY} with the appropriate values for your Azure Storage account. Paste the snippet in an empty cell in the notebook and then press SHIFT + ENTER to run the code cell.

mountPoint = “/mnt/mypath”,

Once file is mounted we can use its data to create a temporary table and move data there:


DROP TABLE IF EXISTS radio_sample_data;

CREATE TABLE radio_sample_data USING json

OPTIONS ( path “/mnt/mypath/small_radio_json.json” )

Now you can select data from that table:

SELECT * from radio_sample_data

Result should be there in a second or so:


Note that even without knowledge of Scala, only working in SQL or Python its pretty easy to get started here.

To learn more about Azure Databricks see –