Cosmonaut in Training – Starting out with Azure Cosmos DB

soviet-space-program-propaganda-poster-5

I cannot state that I am stranger to Azure Cosmos DB and its less famous parent Document DB. Moreover, I already blogged on using .NET with AWS competitor to Document DB called Dynamo DB here.  However, long time passed since then and Cosmos DB has really had a lot of good investment and has really progressed since those days.  Therefore I decided to visit this topic in my blog again.  So, l am on my way to becoming a Cosmonaut, something I never dreamt of when living in my native country.

Azure Cosmos DB, the industry’s first globally-distributed service that enables you to elastically scale throughput and storage across any number of geographical regions while guaranteeing low latency, high availability and consistency – all backed by industry leading, comprehensive SLAs.

Azure Cosmos DB allows you to use key-value, graph, and document data in one service, at global scale and without worrying about schema or index management.

cosmos1

As shown above, Azure Cosmos DB natively supports multiple data models. The core type system of Azure Cosmos DB’s database engine is atom-record-sequence (ARS) based. Atoms consist of a small set of primitive types e.g. string, bool, number etc., records are structs and sequences are arrays consisting of atoms, records or sequences. The database engine of Azure Cosmos DB is capable of efficiently translating and projecting the data models onto the ARS based data model. The core data model of Azure Cosmos DB is natively accessible from dynamically typed programming languages and can be exposed as-is using JSON or other similar representations. The design also enables natively supporting popular database APIs for data access and query. Azure Cosmos DB’s database engine currently supports DocumentDB SQL, MongoDB, Azure Table Storage, and Gremlin graph query API. We intend to extend it to support other popular database APIs as well. The key benefit is that developers can continue to build their applications using popular OSS APIs but get all the benefits of a battle-tested and fully managed, globally distributed database system.

Developers can start using Azure Cosmos DB by provisioning a database account using their Azure subscription. A database account manages one or more databases. An Azure Cosmos DB database in-turn manages users, permissions and containers. An Azure Cosmos DB container is a schema-agnostic container of arbitrary user-generated entities and stored procedures, triggers and user-defined-functions (UDFs). Entities under the customer’s database account – databases, users, permissions, containers etc., are referred to as resources as illustrated below:

cosmos2.png

Each resource is uniquely identified by a stable and logical URI and represented as a JSON document. The overall resource model of an application using Azure Cosmos DB is a hierarchical overlay of the resources rooted under the database account, and can be navigated using hyperlinks. With the exception of the item resource – which is used to represent arbitrary user defined content, all other resources have a system-defined schema. The content model of the item resource is based on atom-record-sequence (ARS) described earlier. Both, container and item resources are further projected as reified resource types for a specific type of API interface as depicted in Table 1. For example, while using document-oriented APIs, container and item resources are projected as collection (container) and document (item) resources, respectively; likewise, for graph-oriented API access, the underlying container and item resources are projected as graph (container), node (item) and edge (item) resources respectively; while accessing using a key-value API, table (container) and item/row (item) are projected.

PI Container is projected as … Item is projected as …
DocumentDB SQL Collection Document
MongoDB Collection Document
Azure Table Storage Table Item
Gremlin Graph Node and Edge

All the data within an Azure Cosmos DB container (e.g. collection, table, graph etc.) is horizontally partitioned and transparently managed by resource partitions as illustrated below. A resource partition is a consistent and highly available container of data partitioned by a customer specified partition-key; it provides a single system image for a set of resources it manages and is a fundamental unit of scalability and distribution. Azure Cosmos DB is designed for customer to elastically scale throughput based on the application traffic patterns across different geographical regions to support fluctuating workloads varying both by geography and time. The system manages the partitions transparently without compromising the availability, consistency, latency or throughput of an Azure Cosmos DB container.

cosmos3

Customers can elastically scale throughput of a container by programmatically provisioning throughput at a second or minute granularity on an Azure Cosmos DB container. Internally, the system transparently manages resource partitions to deliver the throughput on a given container. Elastically scaling throughput using horizontal partitioning of resources requires that each resource partition is capable of delivering the portion of the overall throughput for a given budget of system resources. Since an Azure Cosmos DB container is globally distributed, Azure Cosmos DB ensures that the throughput of a container is available for use across all the regions where the container is distributed within a few seconds of the change in its value. Customers can provision throughput (measured in using a currency unit called, Request Unit or RU) on an Azure Cosmos DB container at both.

Here I should also mention consistency models. Database Consistency states that only valid data will be written to the database. If a transaction is executed that violates the database’s consistency rules, the entire transaction will be rolled back and the database will be restored to its original state. On the other hand, if a transaction successfully executes, it will take the database from one state that is consistent with the rules to another state that is also consistent with the rules.

Currently available commercial distributed databases fall into two categories: (1) Databases which do not offer well-defined, provable consistency choices or (2) Databases which offer two extreme consistency choices – strong vs. eventual consistency. The former systems burden the application developers with minutia of their replication protocols and expects them to make difficult tradeoffs between consistency, availability, latency, and throughput. The latter systems put pressure on application developers to choose between the two extremes. Despite the abundance of research and proposals for numerous consistency models, the commercial distributed database services have not been able to operationalize consistency levels beyond strong and eventual consistency. Azure Cosmos DB allows developers to choose between five well-defined consistency models along the consistency spectrum.

cosmos4

Finally, global distribution. Your database resources are distributed along two dimensions: within a given region, all resources are horizontally partitioned using resource partitions (local distribution). Each resource partition is also replicated across geographical regions (global distribution).

cosmos5

Well, enough theory, lets do something.  Lets store and retrieve data in Cosmos using Azure Table API.

First of all, we navigate to Azure Portal (portal.azure.com) and in  the left navigation pane, select Create a resource icon. Select Databases and then select Azure Cosmos DB.

create-nosql-db-databases-json-tutorial-1

On the Create Azure Cosmos DB Account page,  we will enter the settings for the new Azure Cosmos DB account, including Subscription, Resource Group, unique name for Cosmos DB account, pick API as Azure Table, and finally location to closest Azure region.  Please leave the Geo-Redundancy and Multi-region Writes options at their default values (Disable) to avoid extra cost and as we dont need these features for this tutorial.  Click Review+Create tp create this account.

Once its created you can see after going to the resource code snippets on how to connect to your new Cosmos DB account

azure-cosmos-db-account-created

As you can see now we can use variety of frameworks to connect to our Cosmos DB account, I will use C# and .NET as most familiar to me.

I will open Visual Studio and create a .NET Console Project

VSConsole

Once the project is created we will need to install necessary Nuget packages.

  1. Right-click your project in Solution Explorer and choose Manage NuGet Packages.
  2. Search online for Microsoft.Azure.Cosmos.Table and select Install to install the Microsoft Azure Cosmos DB Table Library.

In the Azure Portal we need to copy over Connection String to connect from our application.Lets navigate to portal Connection String Pane. Use the copy button on the right side of the window to copy Primary Connection String.

Add necessary Using references to the top of your program

using System;
using System.Linq;
using System.Threading.Tasks;
using System.Net;
using Newtonsoft.Json;
using Microsoft.Azure.Cosmos.Table;

Next add constants necessary and reference to  Client in your class:

   private const string ConnectionString = your connection string here
     
        private DocumentClient client;

Add this function, which will be a “brain of our operation” and will be called from the Main() function. Here we will create client using our connection string and connect.

 private async Task GetStartedDemo()
        {
            //grab account
           CloudStorageAccount storageAccount = CloudStorageAccount.Parse(connectionString);
            //create table client
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
           
    // Create a table client for interacting with the table service 
    CloudTable table = tableClient.GetTableReference(tableName);       
        }

Now that we can access Cosmos Table lets add Entity to it. Entities map to C# objects by using a custom class derived from TableEntity. To add an entity to a table, create a class that defines the properties of my entity. I will create entity named Car.

namespace CosmosTableSamples.Model
{
    using Microsoft.Azure.Cosmos.Table;
    public class CarEntity : TableEntity
    {
        public CarEntity()
        {
        }

        public CarEntity(string Brand,string Model)
        {
            PartitionKey = Brand;
            RowKey = Model;
        }

        public string Color{ get; set; }
        public string Type{ get; set; }
    }
}

Now that you have database, table and entity its easy to do a CRUD operation like Insert with snippet like this:

// Create an instance of a carentity. See the Model\CarEntity.cs for a description of the entity.
            CarEntity myCar = new CustomerEntity("Ford", "Taurus")
            {
                Color= "Silver",
                Type= "Sedan"
            };

            // Demonstrate how to insert the entity
            Console.WriteLine("Insert an Entity.");
            myCar = await SamplesUtils.InsertOrMergeEntityAsync(table, myCar);

Then you can select using this quick snippet:

 Console.WriteLine("Reading the Car Entity.");
 customer = await SamplesUtils.RetrieveEntityUsingPointQueryAsync(table, "Ford", "Taurus");
            Console.WriteLine();

Hope above gets you interested in looking at Cosmos DB for your workloads, as it is really versatile and unique cloud scale NoSQL data store.

 

More details can be found here – https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/, https://www.sqlshack.com/8-things-know-azure-cosmos-db-formerly-documentdb/, https://docs.microsoft.com/en-us/azure/cosmos-db/.

 

Happy Cosmos Travels!

 

Advertisements

Forecast Cloudy – Migrating SQL Server Database from AWS RDS to SQL Azure Database Managed Instance

sql-database-windows-azure

With SQL Azure Managed Instance entering General Availability some months ago – https://azure.microsoft.com/en-us/blog/azure-sql-database-managed-instance-general-purpose-tier-general-availability/  we are seeing a lot of interest in this offering.  I already blogged about this offering in post Forecast Cloudy – Set Up SQL Server Managed Instance via Azure PowerShell, so you can see basic details on that offering in that post.  But if I am running SQL Server in AWS Relational Database Service (RDS) how can I migrate database to SQL Azure MI?  This is what I would like to show you today.

For my test I have created SQL Server express instance in AWS RDS as you can see below:

aws1

Once its created you will need to enable Backup\Restore functionality to this instance from AWS S3 storage.  I will create custom option group , assign that group to my database and add special SQLSERVER_BACKUP_RESTORE option to be able to do so.

awsrdsoptiongroup

After I am done, one can see SQLSERVER_BACKUP_RESTORE in Options tab In AWS RDS

awsrdsoptiongroup2

Now, I will create S3 bucket to be able to backup my source Adventure Works database to S3. You can use directions here to create AWS S3 Storage Bucket – https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html 

After creation it should look something like this , note I also added folder under the backet named Backup

aws_s3

Now I can backup my  Adventure Works database to S3 bucket.  I will use built in rds_backup_database stored procedure in msdb database to backup Adventure Works

exec msdb.dbo.rds_backup_database 
@source_db_name='database_name', @s3_arn_to_backup_to='arn:aws:s3:::bucket_name/file_name_and_extension', 
@overwrite_S3_backup_file=1;

Here running backup with my example below

aws_backup

Once you start a backup or restore process, you can track their progress or cancel it:

  • rds_task_status: to track the progress of the backup or restore task
  • rds_cancel_task: to cancel a running backup or restore task

I will grab task id and use rds_task_status to query and see status of backup

exec msdb..rds_task_status @task_id= 5;

We can keep running this command to check the status of the restore task. The “lifecycle” column of the output will change to reflect the status which can be any of these:
CREATED
IN_PROGRESS
SUCCESS
CANCEL_REQUESTED
CANCELLED
ERROR

The “% complete” column will show how far the operation has completed. When finished successfully, the output will show a “% complete” value of 100 and a “lifecycle” value of “success”

Now we can see that Adventure Works is fully backed up to S3

s3backupfile

Next I will  have to move that backup into Azure Blob Storage. To do so I will use awesome utility written by Ken Faulkner called AzureCopy. Cannot believe that no one else thought of anything like it, for example AzCopy standard tool from Microsoft cannot move files from AWS S3 bucket to Azure Blob. You can get this utility here – https://kpfaulkner.wordpress.com/azurecopyrelease/ .  Once downloaded and installed I will use command line and navigate to the folder where its installed. Then I will set up all of my necessary secret keys to connect to both S3 bucket and Azure blob:

set AzureAccountKey=MyAzureStorageAccountKey 
set AWSAccessKeyID=MyS3AccessId 
set AWSSecretAccessKeyID=MyS3SecretKey 
set AWSRegion value=us-west-2

Obviously I am not listing my secret keys here, however hopefully you get an idea. Now that my huge secret keys are successfully packaged in variables I can go ahead and try listing my files in S3 bucket like:

azurecopy -list https://s3-us-west-2.amazonaws.com/gennadykbackup/Backup/ -azurekey %AzureAccountKey% -s3k %AWSAccessKeyID% -s3sk %AWSSecretAccessKeyID%

Get response like this:

azcopy-1

Next I list files in Azure Blob, that of course will be empty as I have no files there yet.

azurecopy -list https://gennadyktarget.blob.core.windows.net/backups  -azurekey %AzureAccountKey% -s3k %AWSAccessKeyID% -s3sk %AWSSecretAccessKeyID%

Response looks like this:

azcpy12

Now lets actually move the backup file:

azurecopy -i https://s3-us-west-2.amazonaws.com/gennadykbackup/Backup/ -o https://gennadyktarget.blob.core.windows.net/backups -azurekey %AzureAccountKey% -s3k %AWSAccessKeyID% -s3sk %AWSSecretAccessKeyID% -blobcopy -destblobtype block

And the result after a bit is here:

qazcpyres

Now I see backup in my Azure Blob Storage:

backupinblob

Now I need to restore it in SQL Azure Managed Instance.  I already created one, to do so one can follow tutorials here – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-get-started , https://www.brentozar.com/archive/2018/03/how-to-deploy-your-first-azure-sql-db-managed-instance-preview/  and https://blogs.msdn.microsoft.com/sqlserverstorageengine/2018/11/14/create-azure-sql-managed-instance-using-azure-cli/.

I will log into my SQL Azure MI via SSMS and my Azure Blob storage account and SAS key to create a credential in my Managed Instance.

CREATE CREDENTIAL [https://gennadyktarget.blob.core.windows.net/backups]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE'
, SECRET = 'My Secret SSAS Key'

After that to check my credential, I will run the following script, which uses a container URL to get a backup file list.

RESTORE FILELISTONLY FROM URL = 
   'https://gennadyktarget.blob.core.windows.net/backups/AdventureWorks2014.bak'

Now since I can see file list I can proceed with restore:

   RESTORE DATABASE [AdventureWorks2014] FROM URL =
  'https://gennadyktarget.blob.core.windows.net/backups/AdventureWorks2014.bak'

And finally after a bit – result is in SSMS:

finaltarget

So finally we are done.  My Adventure Works database is fully migrated from AWS RDS to Azure DB Managed Instance. Hope this helps you or you find it interesting.

 

Spark in the Clouds – Running Azure Databricks

azuredblogo2

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.

The Apache Spark project is main execution engine for Spark SQL ( SQL and HiveQL) , Spark streaming, machine learning and graph processing engines built on top of Spark Core. You can run them using provided API.

Apache-Spark-Architecture

There are three key Spark interfaces that you should know about:

  • RDD – Resilient Distributed Dataset. Apache Spark’s first abstraction was the RDD. It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be created in a variety of ways and are the “lowest level” API available. While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. The RDD API is available in the Java, Python, and Scala languages.
  • DataFrame. These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. The DataFrame API is available in the Java, Python, R, and Scala languages.
  • DataSet. A combination of DataFrame and RDD. It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. The Dataset API is available in the Java and Scala languages

Databricks is a company founded by the creators of Apache Spark, that aims to help clients with cloud-based big data processing using Spark. Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

azure-databricks-overview

Setting up Azure Databricks Workspace via Azure Portal is really easy.

If you don’t have an Azure subscription, create a free account before you begin. You can then navigate to Azure Portal and click + Create Resource to open New Resource blade

new_blade

Pick Analytics category and Azure Databricks service:

new_blade2

Under Azure Databricks Service, provide the values to create a Databricks workspace.

create-databricks-workspace

In the workspace name provide unique name for your workspace, pick your subscription , location of Azure datacenter where workspace will be created, resource group service  as well as pricing tier for the service.  You can pick between standard and premum pricing tiers , for details on each see – https://azure.microsoft.com/en-us/pricing/details/databricks/.  For the sake of this tutorial I will pick standard.

Click Create and in few minutes your workspace will be created. Once that happens, in the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.

lworkspace

Once you login to Azure Databricks workspace you should see a screen like this:

azuredatabricksworkspace

Here you can pick Clusters icon on the side and create a Databricks Cluster.  How would this work, creating a Databricks Spark Cluster in Azure? Well, when a customer launches a cluster via Databricks, a “Databricks appliance” is deployed as an Azure resource in the customer’s subscription. The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer’s subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.

azdatabricks

Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Once you click Clusters button image on side toolbar it will open Clusters page.

clusters1

 

Hit Create Cluster button and in resulting screen you can name your cluster, as well as pick cluster type.  In Databricks you can create two different types of clusters: standard and high concurrency. Standard clusters are the default and can be used with Python, R, Scala, and SQL. High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users. High concurrency clusters support only SQL, Python, and R languages. For my tutorial I will create a Standard cluster.

clusters2

I will now download data scource file from GitHub  . I will put this file in Azure Blob Storage. In order to do that I will create an Azure Storage Account:

  • In the Azure portal, select Create a resource. Select the Storage category, and select Storage Accounts
  • Provide a unique name for the storage account.
  • Select Account Kind: Blob Storage
  • Select a Resource Group name. Use the same resource group you created the Databricks workspace.

Next we add Storage Container to the Storage Account and upload source data file:

  • Open the storage account in the Azure portal.
  • Select Blobs.
  • Select + Container to create a new empty container.
  • Provide a Name for the container.
  • Select Private (non anonymous access) access level.
  • Once the container is created, select the container name.
  • Select the Upload button.
  • On the Files page, select the Folder icon to browse and select the sample file for upload.
  • Select Upload to upload the file.

Once your cluster is created nd source data is uploaded to Azure storage you can go to Workspace and create a notebook.

databricks-create-notebook

These notebooks can be written in Scala,  Python, etc. I can pick Scala:

databricks-notebook-details

Once you created notebook we can now mount storage account where our source data file is to /mnt/mypath. In the following snippet, replace {YOUR CONTAINER NAME}, {YOUR STORAGE ACCOUNT NAME}, and {YOUR STORAGE ACCOUNT ACCESS KEY} with the appropriate values for your Azure Storage account. Paste the snippet in an empty cell in the notebook and then press SHIFT + ENTER to run the code cell.

dbutils.fs.mount(
source = “wasbs://{YOUR CONTAINER NAME}@{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net/”,
mountPoint = “/mnt/mypath”,
extraConfigs = Map(“fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net” -> “{YOUR STORAGE ACCOUNT ACCESS KEY}”))

Once file is mounted we can use its data to create a temporary table and move data there:

%sql

DROP TABLE IF EXISTS radio_sample_data;

CREATE TABLE radio_sample_data USING json

OPTIONS ( path “/mnt/mypath/small_radio_json.json” )

Now you can select data from that table:

%sql
SELECT * from radio_sample_data

Result should be there in a second or so:

result

Note that even without knowledge of Scala, only working in SQL or Python its pretty easy to get started here.

To learn more about Azure Databricks see – https://www.wintellect.com/overview-azure-databricks/https://docs.microsoft.com/en-us/azure/azure-databricks/https://databricks.com/product/azurehttps://medium.com/@jcbaey/azure-databricks-hands-on-6ed8bed125c7 

 

Forecast Cloudy – Set Up SQL Server Managed Instance via Azure PowerShell

logoAzureSql

Managed Instance is a new managed database service that represents fully-managed SQL Server Instance in Azure cloud. It shares the same code with the latest version of SQL Server Database Engine and has the latest features, performance improvements, and security patches.

Let’s step back and look at why Azure SQL Managed Instances are being released? Currently there is a gap of functionality between the retail SQL Server and Azure SQL Database. For example: a client that has an application that uses multiple databases and lots of code using three part names. Azure SQL Database doesn’t support three part names, so if you wanted to migrate this instance to Azure, your only choice would be to deploy and manage it as a SQL Server VM on IaaS. And yes, you can enable backup and patching agents, but at the end of the day, the configuration of the VM, maintenance of the OS and all the responsibility for SQL Server is still in your hands.

This is where the Azure SQL Managed Instance comes in. It provides a fully managed, higher feature-compatible instance that is a lot more similar to the SQL Server instance concept that we all know from the retail product. You don’t have to be responsible of every bit like with IaaS, while at the same time, you can continue running as an instance, with the features that are not compatible with the Azure SQL Database single database model. This means migration is straightforward without having to worry about compatibility or features not supported. It’s also easier because you don’t have to build every single piece of IaaS.

If you have experience with Azure SQL Database you might be thinking, how does this compare to the elastic database pools? The difference is that the pools are still created from databases that are single entities and the only thing they share are the elastic resources. On the other hand, Azure SQL Managed Instance will hold the database inside the instance container so they will be able to use three part names, linked servers, and other features that we are accustomed to using on the on-premises SQL Server.

Best of all, because Azure SQL Managed Instance is built on top of the PaaS framework that powers Azure SQL Database, once it’s migrated, it is fully managed in terms of OS, built-in HA, backups, etc.

As we can see in the diagram below, Azure SQL Managed Instance sits between having a full VM running SQL Server and the fully contained individual database paradigm of Azure SQL Database. And just like Azure SQL Db, with a Managed Instance the client can enjoy the continuous improvement and release of features that come with Microsoft’s cloud-first development model.

diagram

Managed Instance is a resource in your network hosted by Azure cloud. You need to create Azure VNET and a dedicated subnet where the instance should be deployed. There are some networking constraints for the VNET/subnet that you need to review before you create a managed instance. There is no public IP address dedicated to the Managed Instance. Only applications in customer network can access Managed Instance. In addition, your network administrators have the full control and can configure access to Managed Instance using standard methods such as Network security Groups and firewalls.

Managed Instance enables you to choose how many CPU cores you want to use and how much storage you need. You can create a Managed Instance with 16 cores and 500GB storage, and then increase or decrease these numbers depending on your needs. Changing CPU or storage in instance can be done via Azure portal using simple slider:

azure-sql-db-mi-sku-slider

Architecture of Managed Instance.

Managed Instance has split compute and storage components. You have compute nodes with 8, 16, or 24 cores, that work with database files stored on Azure Premium disk storage. Every database file is placed on separate Azure premium disk, that guarantees that database files cannot be lost. Although Managed Instance relies on Azure Premium disks, it also has separate backup mechanism that ensures that even if something would happen with the files, platform would still be able to restore files from backups.

mn-arch

There are also redundant passive nodes that are used to fail-over compute instance if needed.

Managed Instance is fully Platform as a Service database offer. High availability, automated backups, point-in-time restore, automatic plan correction, threat detection, vulnerability assessment, and other intelligent features are built-in into service     without any additional charge.

OS patching and database upgrades are handled automatically and do not require any action.

In addition to built-in monitoring and maintenance features, you can use any 3rd-party tool to monitor and manage your instance, because most of the system views are exposed.

Connecting to Managed Instance

Azure SQL Managed Instance is not a service on public endpoint. Azure SQL Managed Instance is placed on private IP address in your VNET. It is just hosted and managed by Azure cloud.

mn-connect

There are several ways to connect to your Managed Instance.

  1. You can create VPN connection between your on-premises network and Azure VNET where Managed Instance is placed. Managed Instance will be seen as any other SQL Server in your network.
  2. You can connect via some Azure VM in your Azure VNET (using standard RDP or via some app that is in your Azure VM). If your VM is placed in some other AzureVNET you need to establish peering between these two networks.
  3. You can Connect your web application to Managed Instance using Azure Apps VNET Integration or by placing your Web App into App Service environment that is placed in VNET. If your Web App or App service Environment is placed in another VNET, you need to establish peering between these two networks.

For more on Connecting see https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-connect-app

I want to create Managed Instance and test it out, how do I do it?

Assuming you  have created  Site-To-Site VPN of sorts at least we will need to place Managed Instance in special subnet inside a VNET connected to that VPN so Managed Instance can be seen from on premises.  Here are official docs for network configuration – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-vnet-configuration and https://blogs.msdn.microsoft.com/sqlserverstorageengine/2018/03/14/how-to-configure-network-for-azure-sql-managed-instance/ . Custom DNS may need to be set in Azure like this – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-custom-dns .

I used  Azure PowerShell  below to login to Azure Subscription connected to my on-premises network via Express Route and set up VNET, Subnets and Managed Instance itself, here is example script I run to create Demo instance.  Obviously disregard my subscription, VNET, subnet names, address ranges, etc, instead substitute proper values for you:

$SubscriptionName = ‘Visual Studio Enterprise’
$resourceGroup = ‘SQLMIDemo’
$location = ‘East US’
$vNetName = ‘SQLMIDemoVNET’

#login to azure
Login-AzureRmAccount
Select-AzureRmSubscription $SubscriptionName

#create resource group
New-AzureRmResourceGroup -ResourceGroupName $resourceGroup -Location $location

#create new virtual network
$virtualNetwork = New-AzureRmVirtualNetwork -ResourceGroupName $resourceGroup -Location $location -Name $vNetName -AddressPrefix 10.0.0.0/16

#create default subnet
$subnetConfig = Add-AzureRmVirtualNetworkSubnetConfig -Name default -AddressPrefix 10.0.0.0/24 -VirtualNetwork $virtualNetwork

#set up mi subnet

$subnetConfigMi = Add-AzureRmVirtualNetworkSubnetConfig -Name mi -AddressPrefix 10.0.1.0/24 -VirtualNetwork $virtualNetwork

$virtualNetwork | Set-AzureRmVirtualNetwork

#enable access to azure management service
$routeTableMiManagementService = New-AzureRmRouteTable -Name ‘mySQLMIRouteTableMiManagementService’ -ResourceGroupName $resourceGroup -location $location

Set-AzureRmVirtualNetworkSubnetConfig -VirtualNetwork $virtualNetwork -Name ‘mi’ -AddressPrefix 10.0.1.0/24 -RouteTable $routeTableMiManagementService |
Set-AzureRmVirtualNetwork

Get-AzureRmRouteTable -ResourceGroupName $resourceGroup -Name ‘mySQLMIRouteTableMiManagementService’ |
Add-AzureRmRouteConfig -Name ‘ToManagedInstanceManagementService’ -AddressPrefix 0.0.0.0/0 -NextHopType ‘Internet’ |
Set-AzureRmRouteTable

#install managed instance

New-AzureRmSqlManagedInstance -Name “demosqlmi”
-ResourceGroupName $resourceGroup -Location $location -SubnetId $subnetConfigMi.id`
-AdministratorCredential (Get-Credential) `
-StorageSizeInGB 1024 -VCore 8 -Edition “GeneralPurpose” `
-ComputeGeneration Gen4 -LicenseType LicenseIncluded




You can install Azure PowerShell from https://docs.microsoft.com/en-us/powershell/azure/install-azurerm-ps?view=azurermps-6.8.1 or use Portal to set up everything as per – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-vnet-configuration and https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-get-started

Once you created networking prerequisites and SQL Azure MI you can restore SQL database into MI as explained here – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-get-started-restore

How can I be sure that my application database features are supported in SQL Azure MI? 

Majority of mainstream SQL Server features are supported in SQL Azure Managed Instance, including such as jobs,  CLR, linked servers, etc.  For feature matrix can be found here –https://docs.microsoft.com/en-us/azure/sql-database/sql-database-features

For more see – https://azure.microsoft.com/en-us/blog/new-options-to-modernize-your-application-with-azure-sql-database/https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance

Steering in the Clouds – Running SQL Server on Azure Kubernetes Managed Service (AKS)

logo_kubernetes_400

In this post I would love to combine my favorite RDBMS , which is SQL Server with hottest Azure service out there – AKS. AKS is Azure hosted Kubernetes based container orchestration service. AKS reduces the complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure. As a hosted Kubernetes service, Azure handles critical tasks like health monitoring and maintenance for you.

Why run SQL Server in containers orchestrated by Kubernetes?

Over the past year, Kubernetes — also known as K8s — has become a dominant topic of conversation in the infrastructure world. Given its pedigree of literally working at Google-scale, it makes sense that people want to bring that kind of power to their DevOps stories; container orchestration turns many tedious and complex tasks into something as simple as a declarative config file.

The rise of orchestration is predicated on a few things, though. First, organizations have moved toward breaking up monolithic applications into microservices. However, the resulting environments have hundreds (or thousands) of these services that need to be managed. Second, infrastructure has become cheap and disposable — if a machine fails, it’s dramatically cheaper to replace it than triage the problems.

So, to solve the first issue, orchestration relies on the boon of the second; it manages services by simply letting new machines, running the exact same containers, take the place of failed ones, which keeps a service running without any manual interference.

However, up until recently K8s were only used by stateless aps and couldn’t work with stateful applications and databases. Managing state in Kubernetes is difficult because the system’s dynamism is too chaotic for most databases to handle–especially SQL databases that offer strong consistency.

Only since StateFull sets were introduced in K8s we can start thinking about running databases in K8s. By far the most common way to run a database, StatefulSets is a feature fully supported as of the Kubernetes 1.9 release. Using it, each of your pods is guaranteed the same network identity and disk across restarts, even if it’s rescheduled to a different physical machine. In AKS in order to run an RDBMS like SQL Server I will need to create persisted volume.

A Persisted Volume (PV) is a piece of storage in the cluster that has been provisioned by an administrator. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual pod that uses the PV.   This will allow our pods to go up and down and our volume to stay available.

wkb_arch

So all you need to setup SQL Server in Azure Kubernetes PaaS service (AKS) is really an Azure account.  FIrst thing lets install Azure CLI. I am using Windows machine, but recently I was able to install and run Azure CLI on mac as well , of course.  Instructions for Windows install can be found here –https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-windows?view=azure-cli-latest.

If you dont want to install Azure CLI on your machine you can simply use it with Azure Cloud Shell on the Azure Portal – https://docs.microsoft.com/en-us/azure/cloud-shell/overview . Recently I used GCP Cloud Shell and this , as well as Azure Cloud Shell are great idea if you dont want to or cannot install CLI\SDK on your machine.

Once you installed CLI, lets login to your Azure subscription.

az login

This will launch your default browser for you to login.  Once you logged in , you can list subscriptions you have via this command

az account list

Below picture shows output of above command , I erased all of the unique information

azaccount

Once the login process is complete and you are operating in the right subscription,  the resource group to hold all the objects of the Kubernetes cluster can be created:

az group create --name GennadyKSQLK8s--location eastus

You should be able to see new resource group in Azure Portal UI:

rg

Once resource group is created we can build AKS cluster in that resource group.

az aks create --resource-group GennadyKSQLK8s --name mySQLK8sCluster --node-count 2 --generate-ssh-keys

There is a lot going behslightly ind the scenes of this one command so you may have to be a bit more patient, something I am not known for:

k8run

In few minutes we are done and cluster is created.  Next lets install kubectl to manage Kubernetes cluster.

az aks install-cli

Setup PATH for kubectl:

set PATH=%PATH%;C:\Users\Gennady\.azure-kubectl

Next, lets grab credentials in order to connect and manage the cluster: –

az aks get-credentials --resource-group=GennadyKSQLK8s  --name=mySQLK8sCluster

The cluster is setup and we can connect. Lets test it by listing cluster nodes:

kubectl get nodes

The output is below:

kubectlnodes

There are the two nodes of the cluster and it is ready to host a sql instance in a pod.  Now we need to create new persisted volume and instance of SQL Server on Linux.

We will begin by creating SA password for SQL Server in Kubernetes cluster. Kubernetes can manage configuration information like passwords as secrets.

kubectl create secret generic mssql --from-literal=SA_PASSWORD="MyP@ssw0rd1"

Now we need persistent volume and persistent volume claim in Kubernetes. Here are the steps to get that created:

      • Create a manifest to define the storage class and the persistent volume claim. The manifest specifies the storage provisioner, parameters, and reclaim policy. The Kubernetes cluster uses this manifest to create the persistent storage. I will use following YAML to define my storage class and persistent volume claim:
        kind: StorageClass
        apiVersion: storage.k8s.io/v1beta1
        metadata:
             name: azure-disk
        provisioner: kubernetes.io/azure-disk
        parameters:
          storageaccounttype: Standard_LRS
          kind: Managed
        ---
        kind: PersistentVolumeClaim
        apiVersion: v1
        metadata:
          name: mssql-data
          annotations:
            volume.beta.kubernetes.io/storage-class: azure-disk
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 8Gi
        

        I will save above YAML as disk.yaml on my computer.

      • Next , we create the persistent volume claim in Kubernetes. Lets run this command:
        kubectl apply -f C:\Users\Gennady\Downloads\disk.yaml

        Obviously, your yaml file location may be different than mine above
        Result shown below:

        kubectlpersistent
        The persistent volume is automatically created as an Azure storage account, and bound to the persistent volume claim

      • Verify the persistent volume claim.  We can run following command:
        kubectl describe pvc mssql-data

        The output can be seen below:

        pvc_describe

Now you should be able to see this volume in the portal as part of our resource group:

pvc1

Now, that the persistent volume is added we need to add the SQL Server deployment. The deployment creates a replica set. The replica set creates the pod.  The deployment described via YAML manifest like below:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: mssql-deployment
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: mssql
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: mssql
        image: microsoft/mssql-server-linux
        ports:
        - containerPort: 1433
        env:
        - name: ACCEPT_EULA
          value: "Y"
        - name: SA_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mssql
              key: SA_PASSWORD 
        volumeMounts:
        - name: mssqldb
          mountPath: /var/opt/mssql
      volumes:
      - name: mssqldb
        persistentVolumeClaim:
          claimName: mssql-data
---
apiVersion: v1
kind: Service
metadata:
  name: mssql-deployment
spec:
  selector:
    app: mssql
  ports:
    - protocol: TCP
      port: 1433
      targetPort: 1433
  type: LoadBalancer

I will save above YAML to my computer as deployment.yaml file. Now I can run this to deploy:

kubectl apply -f C:\Users\Gennady\Downloads\deployment.yaml

You should see following output:

KUBECTLDEPLOY

The deployment and service are created. The SQL Server instance is in a container, connected to persistent storage. Now lets connect to the pod to verify pod was created via kubectl get pod

getpod

Now lets verify that services are running via kubectl get services.

getservices

This command returns services that are running, as well as the internal and external IP addresses for the services.

Finaly , lets connect to our SQL Server instance using sqlcmd:

lastone

Hope this helps and happy steering the cloud databases. For more see – https://kubernetes.io/docs/home/?path=browsehttps://docs.microsoft.com/en-us/sql/linux/tutorial-sql-server-containers-kubernetes?view=sql-server-2017https://joeydantoni.com/2018/03/29/getting-started-with-sql-server-and-kubernetes/https://cloudblogs.microsoft.com/sqlserver/2018/05/29/sql-server-at-the-red-hat-summit-2018/.

 

Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 2 – USQL and Programmability

datalake

This is a continuation on previous post that can be found here. As I stated primary programming language for Azure Data Lake Analytics will be U-SQL.  If you like myself, come from SQL background you will find that many U-SQL queries lookn like SQL, however after a second look those of you familar with .NET and LINQ will notice that there are familiar constructs there as well.

There are two ways to run U-SQL scripts:

    1. You can run U-SQL scripts on your own machine. The data read and written by this script will be on you own machine. You aren’t using Azure resources to do this so there is no additional cost. This method of running U-SQL scripts is called “U-SQL Local Execution”
    2. You can run U-SQL scripts in Azure in the context of a Data Lake Analytics account. The data read or written by the script will also be in Azure – typically in an Azure Data Lake Store account. You pay for any compute and storage used by the script. This is called “U-SQL Cloud Execution”   

If you wish to run scripts locally via your copy of Visual Studio 2015 you can install Data Lake Tools here – https://www.microsoft.com/en-us/download/details.aspx?id=49504 . With Visual Studio 2017 tool is part of Azure Development workload.

data-lake-tools-for-vs-2017-install-02.png

In the previous post I shown how to run U-SQL scripts on the cloud , but today we will use Visual Studio 2017 with Data Lake Tools to run U-SQL scripts.

Once you installed the tools you can connect to your Azure Data Lake Analytics account and run U-SQL Script via right clicking and picking Run U-SQL Script from the menu. Running such script will look like this in VS 2017:

usql1vs

You can see stats and timings just like you would in Azure Portal as well right in Visual Studio

usql2vs

When something is wrong with a U-SQL script it will not compile. Here some of the common things that may drive you mad and you should watch out for:

  1. Invalid case. U-SQL is case sensitive, unlike SQL.  So following script will error out as it uses lower case from vs. upper case FROM:
    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        from @"/SearchLog.tsv"
        USING Extractors.Tsv();
    
    OUTPUT @searchlog 
        TO @"/SearchLog_output.tsv"
        USING Outputters.Tsv();
    
  2.  Bad path to input or otput file. Check your paths, this is one is self evident, but I spent hours debugging my path\folder issues
  3. Invalid C# expression due to typos, etc

When you develop U-SQL scripts, you can save time and expense by running the scripts locally on your machine before they are ready to be ran in the cloud.  You can connect to your local folder path via Visual Studio Data Lake tools and run queries there:

adlavs3

A local data root folder is a local store for the local compute account. Any folder in the local file system on your local machine can be a local data root folder. It’s the same as the default Azure Data Lake Store account of a Data Lake Analytics account. Switching to a different data root folder is just like switching to a different default store account.

When you run a U-SQL script, a working directory folder is needed to cache compilation results, run logs, and perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U-SQL project’s working directory. t’s located under <U-SQL Project Root Path>/bin/debug. The working directory is cleaned every time a new run is triggered.

Now that you got the basics you can delve into U-SQL coinstructs following the laguage tutorials by Microsoft’s Michael Rys here – https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-language-reference , and Adjeta Sighal here – https://blogs.msdn.microsoft.com/azuredatalake/2018/05/26/get-started-with-u-sql-its-easy/

In my next installment I am hoping to go through typical ADLA analytics job from start to completion.

For more on Azure Data Lake Analytics see – https://channel9.msdn.com/blogs/Cloud-and-Enterprise-Premium/Azure-Data-Lake-Analytics-Deep-Divehttp://eatcodelive.com/2015/11/21/querying-azure-sql-database-from-an-azure-data-lake-analytics-u-sql-script/

Good Luck and Happy Coding!

Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 1 – Introduction

adla

Previously, I wrote a post about Google Big Query , GCP service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Similar services are provided now by all public cloud vendors, Microsoft Azure has a service known as Azure Data Lake Analytics , that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.

According to Microsoft, Azure Data Lake Analytics lets you:

  • Analyze data of any kind and of any size.
  • Speed up and sharpen your development and debug cycles.
  • Use the new U-SQL processing language built especially for big data.
  • Rely on Azure’s enterprise-grade SLA.
  • Pay only for the processing resources you actually need and use.
  • Benefit from the YARN-based technology extensively tested at Microsoft.

ADLA is based on top of the YARN technology. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

yarn_architecture

For more on YARN architecture see – https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html .  What being based on YARN helps Azure Data Lake Analytics helps is with extreme scalability. Data Lake Analytics can work with a number of Azure data sources: Azure Blob storage, Azure SQL database, Azure SQL Data Warehouse, Azure Store, and Azure SQL Database in Azure VM. Azure Data Lake Analytics is specially optimized to work with Azure Data Lake Store—providing the highest level of performance, throughput, and parallelization for your big data workloads.Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of SQL with the expressive power of C#. It takes a bit to learn for typical SQL person, but its pretty powerful.

So enough theory and let me show how you can cruch Big Data workloads without creating a large Hadoop cluster or setting up infrastructure, paying only for what you used in storage and compute.

The first thing we will need before starting to work in Azure cloud is subscription. If you dont have one, browse to https://azure.microsoft.com/en-us/free/?v=18.23 and follow the instructions to sign up for a free 30-day trial subscription to Microsoft Azure.

In my example I will use sample retail dataset with couple of stock and sales data files, which I will upload to Azure Data Lake Store.  The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Azure Data Lake Store is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.

To upload these files , I will first create Azure Data Lake Store called adlsgennadyk:

  • Navigate to Azure Portal
  • Find service called Data Lake Storage Gen 1
  • Click Create button

dls1

In the form , I will name my Azure Data Lake Store, pick Azure Resource Group where it will reside and choose billing model, which can be either usual pay as you go or prepayment in advance.

dls2

Once you created your Data Lake Storage, I will click on Data Explorer button to launch that tool and upload files I will be analyzing

dls3

Next, I will use the tool to upload demo retail dataset file called stock, that stores retail stocked product information in tab delimted format .

dls4

Here is the dataset as can be seen in Excel:

dls5

Now, that data has been uploaded to Azure lets create instance of Azure Data Lake Analytics service. Again action sequence is the same:

  • Navigate to Azure Portal
  • Find service called Data Lake Analytics
  • Click Create button

Resulting form is very similar to what I did with storage above, except I will point my ADLA instance to my above created storage instance.

adla11

Once  Azure Data Lake Analytics instance is created you are presented with this screen:

adla12

Once I click on the New Job button I can run brand new query against my files in Azure Data Lake Storage. I will show you simplest U-SQL script here:

@stock = 
    EXTRACT  
            Id   int, 
          Item string 
    FROM "/stock.txt"
    USING Extractors.Tsv();

OUTPUT @stock
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

Here is what I just asked Azure Data Lake Analytics to do.  We extract all mof the data from a file and copy output to another one.

Some items to know:

  • The script contains a number of U-SQL keywords: EXTRACT, FROM, TO, OUTPUT, USING, etc.
  • U-SQL keywords are case sensitive. Keep this in mind – it’s one of the most common errors people run into.
  • The EXTRACT statement reads from files. The built-in extractor called Extractors.Tsv handles Tab-Separated-Value files.
  • The OUTPUT statement writes to files. The built-in outputter called Outputters.Tsv handles Tab-Separated-Value files.
  • From the U-SQL perspective files are “blobs” – they don’t contain any usable schema information. So U-SQL supports a concept called “schema on read” – this means the developer specified the schema that is expected in the file. As you can see the names of the columns and the datatypes are specified in the EXTRACT statement.
  • The default Extractors and Outputters cannot infer the schema from the header row – in fact by default they assume that there is no header row (this behavior can overriden)

After job executes you can see its execution analysis and graph:

adla14

Now lets do something more meaningful by introducing WHERE clause to filter the dataset:

@stock = 
    EXTRACT  
            Id   int, 
          Item string
    FROM "/stock.txt"
    USING Extractors.Tsv();

    @output = 
    SELECT *
    FROM @stock
    WHERE Item == "Tape dispenser (Black)";


OUTPUT @output
    TO "/stock2_output.tsv"
    USING Outputters.Tsv();

The job took about 30 sec to run , including writing to utput file which took most of time here. Looking at the graph by duration one can see where time was spent:

adla15

In the next post I plan to delve deeper into Data Lake Analytics, including using C# functions and USQL catalogs.

 

For more on Data Lake Analytics see – https://docs.microsoft.com/en-us/azure/data-lake-analytics/https://youtu.be/4PVx-7SSs4chttps://cloudacademy.com/blog/azure-data-lake-analytics/, and https://optimalbi.com/blog/2018/02/20/fishing-in-an-azure-data-lake/ 

 

Happy swimming in Azure Data Lake, hope this helps.