Many enterprises today are moving into era of real-time analytics. As you may be aware on non-relational data typical Hadoop batch analysis architectures are enhanced by technologies like Storm, Kafka, Azure Stream Analytics, Amazon Kinesis, etc. and are moving from purely batch analytics to hybrid real-time\batch architectures like Lambda (http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/) or Zeta (http://radar.oreilly.com/2015/04/zeta-architecture-hexagon-is-the-new-circle.html) with both speed and batch layers. With structured data new generation of elastic cloud based data warehouses is introduced like Amazon Redshift and Azure SQL Data Warehouse. I will blog about Redshift later, but now will take a bit to introduce Azure SQL DW.
Azure SQL Data Warehouse is a turn-key cloud (and related to Microsoft APS\APS on premises solution) data warehousing and analytics solution. It based on existing Azure services effectively and conveniently integrating them together under one roof. The key characteristics of Azure SQL Data Warehouse are:
- Can scale (grow or shrink) to any size on demand, in seconds
- Can import data from any source (using Azure Data Factory): hadoop, NoSQL database, SQL database
- ·Can visualize data and run reports using PowerBI or other service
- ·Use Azure Machine Learning to analyze, model and predict data
- · Expose Machine Learning models as API for apps
- · Provide full ANSI SQL support
MPP Architecture. At its core, SQL Data Warehouse runs using Microsoft’s massive parallel processing (MPP) architecture, originally introduced in Microsoft APS\PDW appliance and also successfully used by some other competitor analytics appliance products like Pivotal Greenplum. This architecture takes advantage of built-in data warehousing performance improvements and also allows SQL Data Warehouse to easily scale-out and parallelize computation of complex SQL queries. In addition, SQL Data Warehouse’s architecture is designed to take advantage of it’s presence in Azure. Combining these two aspects, the architecture breaks up into 4 key components:
- Control Node: You connect to the control node when using SQL Data Warehouse with any development, loading, or business intelligence tools. In SQL Data Warehouse, the compute node is a SQL Database, and when connecting it looks and feels like a standard SQL Database. However, under the surface, it coordinates all of the data movement and computation that takes place in the system. When a command is issued to the control node, it breaks it down into a set of queries that will be passed onto the compute nodes of the service.
- Compute Nodes: Like the control node, the compute nodes of SQL Data Warehouse are powered using using SQL Databases. Their job is to serve as the compute power of the service. Behind the scenes, any time data is loaded into SQL Data Warehouse, it is distributed across the nodes of the service. Then, any time the control node receives a command it breaks it into pieces for each compute node, and the compute nodes operate over their corresponding data. After completing their computation, compute nodes pass partial results to the control node which then aggregates results before returning an answer.
- Storage: All storage for SQL Data Warehouse is standard Azure Storage Blobs. This means that when interacting with data, compute nodes are writing and reading directly to/from Blobs. Azure Storage’s ability to expand transparently and nearly limitlessly allows us to automatically scale storage, and to do so separately from compute. Azure Storage also allows us to persist storage while scaling or paused, streamline our back-up and restore process, and have safer, more fault tolerant storage.
- Data Movement Services: The final piece holding everything together in SQL Data Warehouse is our Data Movement Services. The data movement services allows the control node to communicate and pass data to all of the compute nodes. It also enables the compute nodes to pass data between each other, which gives them access to data on other compute nodes, and allows them to get the data that they need to complete joins and aggregations.
This MPP approach allows SQL Data Warehouse to take a divide and conquer approach as described above when solving large data problems. Since the data in SQL Data Warehouse is divided and distributed across the compute nodes of the service, each compute node is able to operate on its portion of the data in parallel. Finally, results are passed to the control node and aggregated before being passed back to the users
Elasticity. Currently the majority of cloud based database and data warehouse services are provisioned with fixed storage and compute resources. Resizing of resources cannot be performed without compromising availability and performance. This means service user typically ends up with over-provisioned underutilized expensive resources to accommodate possible peak demand. In the worst case, under-provisioned resources unable to handle sudden work overloads.
Azure SQL Data Warehouse (DW) is a fully managed, elastic, and petabyte-scale columnar data-warehouse service. Both Amazon Redshift and Azure SQL DW use massive parallel processing (MPP) architecture to deliver breakthrough performance. But unlike Amazon Redshift, Azure SQL DW architecture allows data and compute to scale independently without any downtime. In addition, Azure SQL DW enables one to dynamically grow and shrink resources taking advantage of best-in-class price and performance.
Most importantly, with Azure SQL DW, storage and compute are billed separately. Storage rates are based on standard blob rates. Compute is priced based on DWU (Data Warehouse Unit) blocks – a new metric to measure the compute capacity when using Azure SQL DW. Performance in Azure SQL DW scales linearly, and changing from one compute block to another (for instance 100 DWUs to 1000 DWUs) happens in seconds without disruption.
DWU (Data Warehouse Unit) capacity blocks. Data Warehouse Units are a new concept delivered by SQL Data Warehouse for capacity. Unfortunately outside of this article here – https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-performance-scale/ there isn’t much information on DWIs. Especially nice would be some sort of calculator to help translate traditional DW metrics like IOPs or queries per hour, etc. to DWUs.
At this time Microsoft recommends following approach to properly purchasing your Azure DW capacity in DWUs:
- For a data warehouse in development, begin by selecting small number of DWUs
- Monitor your application performance, observing the number of DWUs selected compared to the performance you observe
- Determine how much faster or slower performance should be for you to reach the optimum performance level for your requirements by assuming linear scale
- Increase or decrease the number of DWU selected. The service will respond quickly and adjust the compute resources to meet the DWU requirements.
- Continue making adjustments until you reach an optimum performance level for your business requirements.
SQL DW is built with the same technology as APS, except that instead of using SQL Server 2014 it uses version 12 of Azure SQL Database. It also includes PolyBase. PolyBase allows APS and SQL DW to query data in a Hadoop cluster, either directly or by pushing some of the work to Hadoop itself so the query is actually run using the Hadoop clusters CPU’s. The Hadoop data is made to look as if it were local to the data warehouse, so that end-users can use their existing skill sets to query it via SQL or any reporting tool that using SQL (like Excel, SSRS, Power BI, etc). PolyBase can integrate with Hadoop in this manner via a Microsoft HDInsight cluster that can either be inside APS or in the cloud, or via a Hortonworks or Cloudera cluster.
So enough theory lets spin up our own SQL Data Warehouse in Azure. Unlike traditional systems this will not take weeks or months, but minutes. From Azure Preview Portal (it’s not on Classic Portal) go to New->Data + Storage ->Azure Data Warehouse
Add DW name, if you already have version 12 SQL Azure server you can use it, otherwise you will have to create new server instance with login and password. I picked lowest available DWU at 100.
As a result you will see animated GIF in upper right corner with status:
Few minutes and you have yourself a real MPP DW instance on the cloud. Finally after adding my client IP as firewall exclusion I can simply use my SSMS to login to this server and DB.
For more info on Azure SQL DW see – https://azure.microsoft.com/en-us/services/sql-data-warehouse/, https://azure.microsoft.com/en-us/documentation/services/sql-data-warehouse/, http://techcrunch.com/2015/04/29/microsoft-introduces-azure-sql-data-warehouse/, http://blogs.technet.com/b/dataplatforminsider/archive/2015/06/24/azure-sql-data-warehouse-opens-for-limited-public-preview.aspx