Forecast Cloudy – Azure SQL Data Warehouse Introduction

 

    Many enterprises today are moving into era of real-time analytics. As you may be aware on non-relational data typical Hadoop batch analysis architectures are enhanced by technologies like Storm, Kafka, Azure Stream Analytics, Amazon Kinesis, etc. and are moving from purely batch analytics to hybrid real-time\batch architectures like Lambda (http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/) or Zeta (http://radar.oreilly.com/2015/04/zeta-architecture-hexagon-is-the-new-circle.html) with both speed and batch layers. With structured data new generation of elastic cloud based data warehouses is introduced like Amazon Redshift and Azure SQL Data Warehouse. I will blog about Redshift later, but now will take a bit to introduce Azure SQL DW.

Azure SQL Data Warehouse is a turn-key cloud (and related to Microsoft APS\APS on premises solution) data warehousing and analytics solution. It based on existing Azure services effectively and conveniently integrating them together under one roof. The key characteristics of Azure SQL Data Warehouse are:

  • Can scale (grow or shrink) to any size on demand, in seconds
  • Can import data from any source (using Azure Data Factory): hadoop, NoSQL database, SQL database
  • ·Can visualize data and run reports using PowerBI or other service
  • ·Use Azure Machine Learning to analyze, model and predict data
  • · Expose Machine Learning models as API for apps
  • · Provide full ANSI SQL support

image

MPP Architecture. At its core, SQL Data Warehouse runs using Microsoft’s massive parallel processing (MPP) architecture, originally introduced in Microsoft APS\PDW appliance and also successfully used by some other competitor analytics appliance products like Pivotal Greenplum. This architecture takes advantage of built-in data warehousing performance improvements and also allows SQL Data Warehouse to easily scale-out and parallelize computation of complex SQL queries. In addition, SQL Data Warehouse’s architecture is designed to take advantage of it’s presence in Azure. Combining these two aspects, the architecture breaks up into 4 key components:

SQL Data Warehouse Architecture

  • Control Node: You connect to the control node when using SQL Data Warehouse with any development, loading, or business intelligence tools. In SQL Data Warehouse, the compute node is a SQL Database, and when connecting it looks and feels like a standard SQL Database. However, under the surface, it coordinates all of the data movement and computation that takes place in the system. When a command is issued to the control node, it breaks it down into a set of queries that will be passed onto the compute nodes of the service.
  • Compute Nodes: Like the control node, the compute nodes of SQL Data Warehouse are powered using using SQL Databases. Their job is to serve as the compute power of the service. Behind the scenes, any time data is loaded into SQL Data Warehouse, it is distributed across the nodes of the service. Then, any time the control node receives a command it breaks it into pieces for each compute node, and the compute nodes operate over their corresponding data. After completing their computation, compute nodes pass partial results to the control node which then aggregates results before returning an answer.
  • Storage: All storage for SQL Data Warehouse is standard Azure Storage Blobs. This means that when interacting with data, compute nodes are writing and reading directly to/from Blobs. Azure Storage’s ability to expand transparently and nearly limitlessly allows us to automatically scale storage, and to do so separately from compute. Azure Storage also allows us to persist storage while scaling or paused, streamline our back-up and restore process, and have safer, more fault tolerant storage.
  • Data Movement Services: The final piece holding everything together in SQL Data Warehouse is our Data Movement Services. The data movement services allows the control node to communicate and pass data to all of the compute nodes. It also enables the compute nodes to pass data between each other, which gives them access to data on other compute nodes, and allows them to get the data that they need to complete joins and aggregations.

This MPP approach allows SQL Data Warehouse to take a divide and conquer approach as described above when solving large data problems. Since the data in SQL Data Warehouse is divided and distributed across the compute nodes of the service, each compute node is able to operate on its portion of the data in parallel. Finally, results are passed to the control node and aggregated before being passed back to the users

image

Elasticity. Currently the majority of cloud based database and data warehouse services are provisioned with fixed storage and compute resources. Resizing of resources cannot be performed without compromising availability and performance. This means service user typically ends up with over-provisioned underutilized expensive resources to accommodate possible peak demand. In the worst case, under-provisioned resources unable to handle sudden work overloads.

Azure SQL Data Warehouse (DW) is a fully managed, elastic, and petabyte-scale columnar data-warehouse service. Both Amazon Redshift and Azure SQL DW use massive parallel processing (MPP) architecture to deliver breakthrough performance. But unlike Amazon Redshift, Azure SQL DW architecture allows data and compute to scale independently without any downtime. In addition, Azure SQL DW enables one to dynamically grow and shrink resources taking advantage of best-in-class price and performance.

Most importantly, with Azure SQL DW, storage and compute are billed separately. Storage rates are based on standard blob rates. Compute is priced based on DWU (Data Warehouse Unit) blocks – a new metric to measure the compute capacity when using Azure SQL DW. Performance in Azure SQL DW scales linearly, and changing from one compute block to another (for instance 100 DWUs to 1000 DWUs) happens in seconds without disruption.

image

DWU (Data Warehouse Unit) capacity blocks. Data Warehouse Units are a new concept delivered by SQL Data Warehouse for capacity. Unfortunately outside of this article here – https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-performance-scale/ there isn’t much information on DWIs. Especially nice would be some sort of calculator to help translate traditional DW metrics like IOPs or queries per hour, etc. to DWUs.

At this time Microsoft recommends following approach to properly purchasing your Azure DW capacity in DWUs:

  • For a data warehouse in development, begin by selecting small number of DWUs
  • Monitor your application performance, observing the number of DWUs selected compared to the performance you observe
  • Determine how much faster or slower performance should be for you to reach the optimum performance level for your requirements by assuming linear scale
  • Increase or decrease the number of DWU selected. The service will respond quickly and adjust the compute resources to meet the DWU requirements.
  • Continue making adjustments until you reach an optimum performance level for your business requirements.

image

Hadoop Integration.

SQL DW is built with the same technology as APS, except that instead of using SQL Server 2014 it uses version 12 of Azure SQL Database.  It also includes PolyBase.  PolyBase allows APS and SQL DW to query data in a Hadoop cluster, either directly or by pushing some of the work to Hadoop itself so the query is actually run using the Hadoop clusters CPU’s.  The Hadoop data is made to look as if it were local to the data warehouse, so that end-users can use their existing skill sets to query it via SQL or any reporting tool that using SQL (like Excel, SSRS, Power BI, etc).  PolyBase can integrate with Hadoop in this manner via a Microsoft HDInsight cluster that can either be inside APS or in the cloud, or via a Hortonworks or Cloudera cluster.

image

So enough theory lets spin up our own SQL Data Warehouse in Azure. Unlike traditional systems this will not take weeks or months, but minutes. From Azure Preview Portal (it’s not on Classic Portal) go to New->Data + Storage ->Azure Data Warehouse

image

Add DW name, if you already have version 12 SQL Azure server you can use it, otherwise you will have to create new server instance with login and password. I picked lowest available DWU at 100.

image

As a result you will see animated GIF in upper right corner with status:

image

Few minutes and you have yourself a real MPP DW instance on the cloud. Finally after adding my client IP as firewall exclusion I can simply use my SSMS to login to this server and DB.

image

For more info on Azure SQL DW see – https://azure.microsoft.com/en-us/services/sql-data-warehouse/, https://azure.microsoft.com/en-us/documentation/services/sql-data-warehouse/, http://techcrunch.com/2015/04/29/microsoft-introduces-azure-sql-data-warehouse/, http://blogs.technet.com/b/dataplatforminsider/archive/2015/06/24/azure-sql-data-warehouse-opens-for-limited-public-preview.aspx

Advertisements

Resolving Java Threading Issues with ThreadLogic Tool

The ThreadLogic utility is a free tool you can download to assist in analyzing thread dumps taken from a JVM. Threadlogic can digest the log file containing the thread
dump output. This utility does a fair amount of the initial analysis for you like finding locks, the holder of locks, and fatal conditions. If you’ve ever read a raw thread dump from a log file then you know it can be daunting – especially if you don’t exactly what you are looking for. Threadlogic helps by recognizing the type of each thread and categorizing them in the help you understand which threads are JVM threads, WLS threads, and then “application” threads. In addition, Threadlogic can process a series of thread dumps and perform “diff” operation between them. This is helpful in determining what threads are doing over a period of time.

ThreadLogic can be downloaded from – https://java.net/projects/threadlogic/downloads  Thread Logic Comes in a form of Jar File.We need to manually run the file using “java -jar threadLogic.jar”

image

Opening the dump tree and selecting the Advisory Map show a Map with information about the health of the system under investigation. Each of the advisory has a health level indicating severity of the issue found, pattern, name, keyword and related advice.
image

ThreadLogic is able to parse Sun, JRockit, and IBM thread dumps and provide advice based on predefined and externally defined patterns.

The health levels (in descending of severity) are FATAL (meant for Deadlocks, STUCK, Finalizer blocked etc), WARNING, WATCH (worth watching), NORMAL and IGNORE.

Based on the highest criticality of threads within a group, that health level gets promoted to the Thread Group’s health level and same is repeated at the thread dump level. There can be multiple advisories tagged to a Thread, Thread Group and Thread Dump. This is a typical advisory map I see:

image

The threads are associated with thread groups based on the functionality or thread names. Additional patterns exists to tag other threads (like iWay Adapter, SAP, Tibco threads) and group them:

image

For more on the tool see – https://blogs.oracle.com/emeapartnerweblogic/entry/my_first_experiences_with_threadlogic, https://blogs.oracle.com/ATeamExalogic/entry/introducing_threadlogic, https://zeroproductionincidents.wordpress.com/2012/07/10/threadlogic-another-thread-dump-analysis-tool/

Vigilate Reservate – Live Query Statistics in SQL Server 2014

One of many new features with SQL Server 2014 that’s not to be overlooked is ability to see live query stats real time as query is being executed using new DMVs. The fact is that in SQL Server 2014 we have a new Dynamic Management View to track what a query is actually doing at a physical operator level.  I am referring to the sys.dm_exec_query_profiles Dynamic Management View (DMV).

This feature will help the database administrators and experienced developers to troubleshoot long running queries in real-time. I’m pretty sure you had to deal with the following questions: When will this query finish? What percentage of total workload is performed by this request? Which steps are the longest? Before SQL Server 2014 it was impossible to answer the questions above.

Important note – in order to use this feature with this DMV we must use some session options like STATISTICS PROFILE, STATISTICS XML or SET STATISTICS XML ON to  force SQL Server to display the execution plan after execution query with SQL Server Management Studio. So in other words Sit doesn’t work for plans stored in the Plan Cache.

SQL Server BOL has a good detail on new sys.dm_exec_query_profiles DMV here – https://msdn.microsoft.com/en-us/library/dn223301(v=sql.120).aspx. What BOL states there is – “Monitors real time query progress while the query is in execution. The counters are per operator per thread. The data collected is serialized into the SHOWPLAN XML when the query finishes.”  What Microsoft really wanted to say by it was that SHOWPLAN XML or any other sort of actual execution plan is a prerequisite. If actual execution plan isn’t required for your query, SQL Server won’t gather all this data at the first hand. And won’t show it in sys.dm_exec_query_profiles during runtime.

So that means I can use SSMS to run a particulary long running query and get stats on it. However you probably will not be running this all of the time, the overhead is pretty big here.  This DMV provides a lot of useful information.  These information are more granular than SET STATISTICS IO because the counters returned are per operator per thread (node_id / physical_operator_name and thread_id columns). Moreover two others columns are interesting like estimate_row_count and row_count. The former is the number of estimated rows an operator should address and the latter is the current number of rows addressed by the same operator (remember these counters are per operator per thread …). We can compare these two columns to estimate a completion percentage per operator (or per operator per thread). Finally we can categorize some information provided by this DMV in two parts: information related either to the query execution engine (row_count, estimated_row_count) or the storage execution engine (logical_read_count, write_page_count, lob_logical_read_count etc.)

so imagine I am running a following query against AdvetureWorks2012 sample DB:

 

USE AdventureWorks2012 GO SET STATISTICS XML ON SELECT * FROM Sales.SalesOrderHeader OH INNER JOIN Sales.SalesOrderDetail OD ON OD.SalesOrderID = OH.SalesOrderID INNER JOIN Sales.Customer C ON C.CustomerID = OH.CustomerID INNER JOIN Production.Product P ON P.ProductID = OD.ProductID INNER JOIN Production.ProductSubcategory PS ON pS.ProductSubcategoryID = P.ProductSubcategoryID GO

 

Query result will be when finished:

image

While that is running in another SSMS query window I execute:

SELECT  session_id ,
        node_id ,
        physical_operator_name ,
        SUM(row_count) row_count ,
        SUM(estimate_row_count) AS estimate_row_count ,
        IIF(COUNT(thread_id) = 0, 1, COUNT(thread_id)) [Threads] ,
        CAST(SUM(row_count) * 100. / SUM(estimate_row_count) AS DECIMAL(30, 2)) [% Complete] ,
        CONVERT(TIME, DATEADD(ms, MAX(elapsed_time_ms), 0)) [Operator time] ,
        DB_NAME(database_id) + '.' + OBJECT_SCHEMA_NAME(QP.object_id,
                                                        qp.database_id) + '.'
        + OBJECT_NAME(QP.object_id, qp.database_id) [Object Name]
FROM    sys.dm_exec_query_profiles QP
GROUP BY session_id ,
        node_id ,
        physical_operator_name ,
        qp.database_id ,
        QP.OBJECT_ID ,
        QP.index_id
ORDER BY session_id ,
        node_id
GO


I get following as a result:

image

 

Just as with execution plan information can be useful looking at operator time and differences in estimated row counts to actual. Looking on the web appears that folks went further and correlated sys.dm_exec_query_profiles, sys.objects, sys.indexes, sys.dm_exec_requets, sys.dm_exec_sessions, and sys.dm_os_waiting_tasks to cover both the real-time execution information and the related waits occurring in the same time. One such interesting script by David Barbarin is here – http://blog.dbi-services.com/sql-server-2014-sysdmexecqueryprofiles-a-new-promising-feature/ 

 

This feature and new DMV generated lots of buzz so more details are available here – http://sqlserver-help.com/2014/06/10/sql-2014-learning-series-11-new-dmv-sys-dm_exec_query_profiles-part-1/, http://sqlserver-help.com/tag/sys-dm_exec_query_profiles/, and http://blog.sqltechie.com/2013/11/sysdmexecqueryprofiles-sql-2014-new-way.html, http://sqlblog.com/blogs/michael_zilberstein/archive/2013/10/21/51415.aspx