Forecast Cloudy – NGINX On Microsoft Azure

NGINX

NGNIX. Nginx (pronounced “engine x”) is software to provide a web server. It can act as a reverse proxy server for TCP, UDP, HTTP, HTTPS, SMTP, POP3, and IMAP protocols, as well as a load balancer and an HTTP cache. Nginx comes in two flavors , free and open source basic version and more advanced Plus version. Here is comparison of both in terms of features from Nginx docs:

https://www.nginx.com/products/feature-matrix/

As you may know NGNIX Plus flavor is already available as VMs in Azure Marketplace.  So to setup in Azure, all I had to do is following:

  • Go to Azure Marketplace, find NGNIX Inc
  • Pick Resource Group or Classic Deployment (You will want to pick Azure Resource Group – see below)
  • Pick  my VM size (I picked smallest A1 Standard  for my experiment)

nginx2

This is all done via portal , but what if you really want to use Azure Resource Manager Templates to automate this?

Azure originally provided only the classic deployment model. In this model, each resource existed independently; there was no way to group related resources together. Instead, you had to manually track which resources made up your solution or application, and remember to manage them in a coordinated approach. To deploy a solution, you had to either create each resource individually through the classic portal or create a script that deployed all of the resources in the correct order. To delete a solution, you had to delete each resource individually. You could not easily apply and update access control policies for related resources. Finally, you could not apply tags to resources to label them with terms that help you monitor your resources and manage billing.

With the introduction of the new Azure Portal it possible to view resources (such as websites, virtual machines and databases) as a single logical unit. This logical unit is called a resource group. The following screenshot shows a resource group called ecommerce-westus which contains a website, SQL database and an Application Insights instance.

nginx4

Resource groups aren’t visible in the old portal, but almost everything within your Azure subscription exists within a set of resource groups that were created by default when the preview portal opened for business. If you access the new portal, press the Browse button on the jump bar (the icons down the left hand side of the screen) and navigate to resource groups you should see a whole bunch listed.

The benefit of resource groups is that they allow an Azure administrator to roll-up billing and monitoring information for resources within a resource group and manage access to those resources as a set. This can be extremely useful when you have a single subscription but you need to do cost recovery on the resources used by a customer or internal department.

For more on Azure Resource Manager see  docs – https://azure.microsoft.com/en-us/documentation/articles/resource-group-overview/ , https://channel9.msdn.com/Shows/Edge/Edge-Show-121-Azure-Resource-Manager , http://www.codeproject.com/Articles/854592/Using-Azure-Resource-Manager

Azure  Resource Manager allows for new deployment paradigm via ARM Templateshttps://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-deployment-model

Below is sample template top deploy NGINX VM:

"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "location": {
            "type": "String"
        },
        "virtualMachineName": {
            "type": "String"
        },
        "virtualMachineSize": {
            "type": "String"
        },
        "adminUsername": {
            "type": "String"
        },
        "storageAccountName": {
            "type": "String"
        },
        "virtualNetworkName": {
            "type": "String"
        },
        "networkInterfaceName": {
            "type": "String"
        },
        "networkSecurityGroupName": {
            "type": "String"
        },
        "adminPublicKey": {
            "type": "String"
        },
        "availabilitySetName": {
            "type": "String"
        },
        "availabilitySetPlatformFaultDomainCount": {
            "type": "String"
        },
        "availabilitySetPlatformUpdateDomainCount": {
            "type": "String"
        },
        "storageAccountType": {
            "type": "String"
        },
        "diagnosticsStorageAccountName": {
            "type": "String"
        },
        "diagnosticsStorageAccountId": {
            "type": "String"
        },
        "diagnosticsStorageAccountType": {
            "type": "String"
        },
        "addressPrefix": {
            "type": "String"
        },
        "subnetName": {
            "type": "String"
        },
        "subnetPrefix": {
            "type": "String"
        },
        "publicIpAddressName": {
            "type": "String"
        },
        "publicIpAddressType": {
            "type": "String"
        }
    },
    "variables": {
        "vnetId": "[resourceId('gennadykngnix','Microsoft.Network/virtualNetworks', parameters('virtualNetworkName'))]",
        "subnetRef": "[concat(variables('vnetId'), '/subnets/', parameters('subnetName'))]",
        "metricsresourceid": "[concat('/subscriptions/', subscription().subscriptionId, '/resourceGroups/', resourceGroup().name, '/providers/', 'Microsoft.Compute/virtualMachines/', parameters('virtualMachineName'))]",
        "metricsclosing": "[concat('')]",
        "metricscounters": "",
        "metricsstart": "",
        "wadcfgx": "[concat(variables('metricsstart'), variables('metricscounters'), variables('metricsclosing'))]",
        "diagnosticsExtensionName": "Microsoft.Insights.VMDiagnosticsSettings"
    },
    "resources": [
        {
            "type": "Microsoft.Compute/virtualMachines",
            "name": "[parameters('virtualMachineName')]",
            "apiVersion": "2015-06-15",
            "location": "[parameters('location')]",
            "plan": {
                "name": "nginx-plus",
                "publisher": "nginxinc",
                "product": "nginx-plus-v1"
            },
            "properties": {
                "osProfile": {
                    "computerName": "[parameters('virtualMachineName')]",
                    "adminUsername": "[parameters('adminUsername')]",
                    "linuxConfiguration": {
                        "disablePasswordAuthentication": "true",
                        "ssh": {
                            "publicKeys": [
                                {
                                    "path": "[concat('/home/', parameters('adminUsername'), '/.ssh/authorized_keys')]",
                                    "keyData": "[parameters('adminPublicKey')]"
                                }
                            ]
                        }
                    }
                },
                "hardwareProfile": {
                    "vmSize": "[parameters('virtualMachineSize')]"
                },
                "storageProfile": {
                    "imageReference": {
                        "publisher": "nginxinc",
                        "offer": "nginx-plus-v1",
                        "sku": "nginx-plus",
                        "version": "latest"
                    },
                    "osDisk": {
                        "name": "[parameters('virtualMachineName')]",
                        "vhd": {
                            "uri": "[concat(concat(reference(resourceId('gennadykngnix', 'Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2015-06-15').primaryEndpoints['blob'], 'vhds/'), parameters('virtualMachineName'), '20161130115146.vhd')]"
                        },
                        "createOption": "fromImage"
                    },
                    "dataDisks": []
                },
                "networkProfile": {
                    "networkInterfaces": [
                        {
                            "id": "[resourceId('Microsoft.Network/networkInterfaces', parameters('networkInterfaceName'))]"
                        }
                    ]
                },
                "diagnosticsProfile": {
                    "bootDiagnostics": {
                        "enabled": true,
                        "storageUri": "[reference(resourceId('gennadykngnix', 'Microsoft.Storage/storageAccounts', parameters('diagnosticsStorageAccountName')), '2015-06-15').primaryEndpoints['blob']]"
                    }
                },
                "availabilitySet": {
                    "id": "[resourceId('Microsoft.Compute/availabilitySets', parameters('availabilitySetName'))]"
                }
            },
            "dependsOn": [
                "[concat('Microsoft.Network/networkInterfaces/', parameters('networkInterfaceName'))]",
                "[concat('Microsoft.Compute/availabilitySets/', parameters('availabilitySetName'))]",
                "[concat('Microsoft.Storage/storageAccounts/', parameters('storageAccountName'))]",
                "[concat('Microsoft.Storage/storageAccounts/', parameters('diagnosticsStorageAccountName'))]"
            ]
        },
        {
            "type": "Microsoft.Compute/virtualMachines/extensions",
            "name": "[concat(parameters('virtualMachineName'),'/', variables('diagnosticsExtensionName'))]",
            "apiVersion": "2015-06-15",
            "location": "[parameters('location')]",
            "properties": {
                "publisher": "Microsoft.OSTCExtensions",
                "type": "LinuxDiagnostic",
                "typeHandlerVersion": "2.3",
                "autoUpgradeMinorVersion": true,
                "settings": {
                    "StorageAccount": "[parameters('diagnosticsStorageAccountName')]",
                    "xmlCfg": "[base64(variables('wadcfgx'))]"
                },
                "protectedSettings": {
                    "storageAccountName": "[parameters('diagnosticsStorageAccountName')]",
                    "storageAccountKey": "[listKeys(parameters('diagnosticsStorageAccountId'),'2015-06-15').key1]",
                    "storageAccountEndPoint": "https://core.windows.net/"
                }
            },
            "dependsOn": [
                "[concat('Microsoft.Compute/virtualMachines/', parameters('virtualMachineName'))]"
            ]
        },
        {
            "type": "Microsoft.Compute/availabilitySets",
            "name": "[parameters('availabilitySetName')]",
            "apiVersion": "2015-06-15",
            "location": "[parameters('location')]",
            "properties": {
                "platformFaultDomainCount": "[parameters('availabilitySetPlatformFaultDomainCount')]",
                "platformUpdateDomainCount": "[parameters('availabilitySetPlatformUpdateDomainCount')]"
            }
        },
        {
            "type": "Microsoft.Storage/storageAccounts",
            "name": "[parameters('storageAccountName')]",
            "apiVersion": "2015-06-15",
            "location": "[parameters('location')]",
            "properties": {
                "accountType": "[parameters('storageAccountType')]"
            }
        },
        {
            "type": "Microsoft.Storage/storageAccounts",
            "name": "[parameters('diagnosticsStorageAccountName')]",
            "apiVersion": "2015-06-15",
            "location": "[parameters('location')]",
            "properties": {
                "accountType": "[parameters('diagnosticsStorageAccountType')]"
            }
        },
        {
            "type": "Microsoft.Network/virtualNetworks",
            "name": "[parameters('virtualNetworkName')]",
            "apiVersion": "2016-09-01",
            "location": "[parameters('location')]",
            "properties": {
                "addressSpace": {
                    "addressPrefixes": [
                        "[parameters('addressPrefix')]"
                    ]
                },
                "subnets": [
                    {
                        "name": "[parameters('subnetName')]",
                        "properties": {
                            "addressPrefix": "[parameters('subnetPrefix')]"
                        }
                    }
                ]
            }
        },
        {
            "type": "Microsoft.Network/networkInterfaces",
            "name": "[parameters('networkInterfaceName')]",
            "apiVersion": "2016-09-01",
            "location": "[parameters('location')]",
            "properties": {
                "ipConfigurations": [
                    {
                        "name": "ipconfig1",
                        "properties": {
                            "subnet": {
                                "id": "[variables('subnetRef')]"
                            },
                            "privateIPAllocationMethod": "Dynamic",
                            "publicIpAddress": {
                                "id": "[resourceId('gennadykngnix','Microsoft.Network/publicIpAddresses', parameters('publicIpAddressName'))]"
                            }
                        }
                    }
                ],
                "networkSecurityGroup": {
                    "id": "[resourceId('gennadykngnix', 'Microsoft.Network/networkSecurityGroups', parameters('networkSecurityGroupName'))]"
                }
            },
            "dependsOn": [
                "[concat('Microsoft.Network/virtualNetworks/', parameters('virtualNetworkName'))]",
                "[concat('Microsoft.Network/publicIpAddresses/', parameters('publicIpAddressName'))]",
                "[concat('Microsoft.Network/networkSecurityGroups/', parameters('networkSecurityGroupName'))]"
            ]
        },
        {
            "type": "Microsoft.Network/publicIpAddresses",
            "name": "[parameters('publicIpAddressName')]",
            "apiVersion": "2016-09-01",
            "location": "[parameters('location')]",
            "properties": {
                "publicIpAllocationMethod": "[parameters('publicIpAddressType')]"
            }
        },
        {
            "type": "Microsoft.Network/networkSecurityGroups",
            "name": "[parameters('networkSecurityGroupName')]",
            "apiVersion": "2016-09-01",
            "location": "[parameters('location')]",
            "properties": {
                "securityRules": [
                    {
                        "name": "default-allow-ssh",
                        "properties": {
                            "priority": 1000,
                            "sourceAddressPrefix": "*",
                            "protocol": "TCP",
                            "destinationPortRange": "22",
                            "access": "Allow",
                            "direction": "Inbound",
                            "sourcePortRange": "*",
                            "destinationAddressPrefix": "*"
                        }
                    }
                ]
            }
        }
    ],
    "outputs": {
        "adminUsername": {
            "type": "String",
            "value": "[parameters('adminUsername')]"
        }
    }
}

Obviously parameters. json has my values like storage account and resource group names, VM sizes, network names and finally SSH key. All of those parameters would be different for you.

More on NGINX on Azure as load balancer or reverse proxy  here https://www.nginx.com/products/nginx-plus-microsoft-azure/ 

Hope this helps

Forecast Cloudy – Using Azure Blob Storage with Apache Hive on HDInsight

The beauty of working with Big Data in Azure is that you can manage (create\delete) compute resources with your HDInsight cluster independent of your data stored either in Azure Data Lake or Azure blob storage.  In this case I will concentrate on using Azure blob storage\WASB as data store for HDInsWight Azure PaaS Hadoop service

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back as resultset for consumption by other systems (think PowerBI, Tableau, etc)
Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

hdinsight

In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:

blob1

Hold on, isn’t the whole selling point of Hadoop is proximity of data to compute?  Yes, and just like with any other Hadoop system on premises data is loaded into memory on the individual nodes at compute time. With Azure data infrastructure setup and data center backbone within data center built for performance, your job performance is generally the same or better than if you used disks locally attached to the VMs.

Below is diagram of HDInsight data storage architecture:

hdi.wasb.arch

HDInsight provides access to the distributed file system that is locally attached to the compute nodes. This file system can be accessed by using the fully qualified URI, for example:

hdfs:///

More important is ability access data that is stored in Azure Storage. The syntax is:

wasb[s]://@.blob.core.windows.net/

As per https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage you need to be aware of following:

  • Container Security for WASB storage.  For containers in storage accounts that are connected to cluster,because the account name and key are associated with the cluster during creation, you have full access to the blobs in those containers. For public containers that are not connected to cluster you have read-only permission to the blobs in the containers.  For private containers in storage accounts that are not connected to cluster , you can’t access the blobs in the containers unless you define the storage account when you submit the WebHCat jobs.
  • The storage accounts that are defined in the creation process and their keys are stored in %HADOOP_HOME%/conf/core-site.xml on the cluster nodes. The default behavior of HDInsight is to use the storage accounts defined in the core-site.xml file. It is not recommended to directly edit the core-site.xml file because the cluster head node(master) may be reimaged or migrated at any time, and any changes to this file are not persisted.

 You can create new or point existing storage account to HDinsight cluster easy via portal as I show below:

Capture

You can point your HDInsight cluster to multiple storage accounts as well , as explained here – https://blogs.msdn.microsoft.com/mostlytrue/2014/09/03/hdinsight-working-with-different-storage-accounts/ 

You can also create storage account and container via Azure PowerShell like in this sample:

$SubscriptionID = “<Your Azure Subscription ID>”
$ResourceGroupName = “<New Azure Resource Group Name>”
$Location = “EAST US 2”

$StorageAccountName = “<New Azure Storage Account Name>”
$containerName = “<New Azure Blob Container Name>”

Add-AzureRmAccount
Select-AzureRmSubscription -SubscriptionId $SubscriptionID

# Create resource group
New-AzureRmResourceGroup -name $ResourceGroupName -Location $Location

# Create default storage account
New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName -Location $Location -Type Standard_LRS

# Create default blob containers
$storageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -StorageAccountName $StorageAccountName)[0].Value
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
New-AzureStorageContainer -Name $containerName -Context $destContext

 

The URI scheme for accessing files in Azure storage from HDInsight is:

wasb[s]://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/<path>

The URI scheme provides unencrypted access (with the wasb: prefix) and SSL encrypted access (with wasbs). Microsoft recommends using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

The <BlobStorageContainerName> identifies the name of the blob container in Azure storage. The <StorageAccountName> identifies the Azure Storage account name. A fully qualified domain name (FQDN) is required.

I ran into rather crazy little limitation\ issue when working with \WASB and HDInsight. Hadoop and Hive is looking for and  expects a valid folder hierarchy to import data  files, whereas  WASB does not support a folder hierarchy i.e. all blobs are listed under a container. The workaround is to use SSH session to login into head cluster node and use mkdir command line command to manually create such directory via the driver.

The SSH Procedure with HDInsight can be found here – https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix

Another one recommended to me was that “/” character can be used within the key name to make it appear as if a file is stored within a directory structure. HDInsight sees these as if they are actual directories.For example, a blob’s key may be input/log1.txt. No actual “input” directory exists, but due to the presence of the “/” character in the key name, it has the appearance of a file path.

For more see – https://social.technet.microsoft.com/wiki/contents/articles/6621.azure-hdinsight-faq.aspx,  https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storagehttps://www.codeproject.com/articles/597940/understandingpluswindowsplusazureplusblobplusstora

Hope this helps.

Forecast Cloudy–Azure Resource Manager Templates

Azure originally provided only the classic deployment model. In this model, each resource existed independently; there was no way to group related resources together. Instead, you had to manually track which resources made up your solution or application, and remember to manage them in a coordinated approach. To deploy a solution, you had to either create each resource individually through the classic portal or create a script that deployed all of the resources in the correct order. To delete a solution, you had to delete each resource individually. You could not easily apply and update access control policies for related resources. Finally, you could not apply tags to resources to label them with terms that help you monitor your resources and manage billing.

With the introduction of the new Azure Portal  its possible to view resources (such as websites, virtual machines and databases) as a single logical unit. This logical unit is called a resource group. The following screenshot shows a resource group called ecommerce-westus which contains a website, SQL database and an Application Insights instance.

resource_group

 

Resource groups aren’t visible in the old portal, but almost everything within your Azure subscription exists within a set of resource groups that were created by default when the preview portal opened for business. If you access the new portal, press the Browse button on the jump bar (the icons down the left hand side of the screen) and navigate to resource groups you should see a whole bunch listed.

The benefit of resource groups is that they allow an Azure administrator to roll-up billing and monitoring information for resources within a resource group and manage access to those resources as a set. This can be extremely useful when you have a single subscription but you need to do cost recovery on the resources used by a customer or internal department.

For more on Azure Resource Manager see  docs – https://azure.microsoft.com/en-us/documentation/articles/resource-group-overview/ , https://channel9.msdn.com/Shows/Edge/Edge-Show-121-Azure-Resource-Manager , http://www.codeproject.com/Articles/854592/Using-Azure-Resource-Manager

As an example I created a template in Visual Studio 2015 , this sample with ARM Template and parameters file that creates two Windows Server 2012R2 VM(s) with IIS configured using DSC. It also installs one SQL Server 2014 standard edition VM, a VNET with two subnets, NSG, loader balancer, NATing and probing rules.

resources_geroup2

To start I used one of the open source templates available on GitHub and just added\modified resource.

So how does one author these templates in VS 2015?

Prerequisites: You will need Visual Studio 2015 installed.  In my case I have VS 2015 Update 1 and Azure Tools SDK V2.8.2.

Create Azure Resource Group  Project.

When you first create an Azure Resource Group Project in Visual Studio you are presented with a list of common templates you can use to start defining your desired infrastructure. For example, if you wanted a set of Virtual Machines that could be used to run a scalable web server workload.

RG3

Project structure for Resource Group Project.

Using blank template project results in a project that contains everything you need to begin authoring and eventually deploy your template. As shown below, the project structure contains three folders; Scripts, Templates, and Tools.

rg4

The Scripts folder contains the PowerShell script that is used to deploy the environment you describe in the project. This is a fully functional script that essentially requires no editing. It just works! However, if you have a need to modify the script to support some advanced deployment scenarios you can do so.

The Templates folder contains the JSON files that describe the environment you want to create. The azuredeploy.json file is where you describe the resources you want in your environment, such as the storage account, virtual machine, and other required resources. The azuredeploy.parameters.json file is where you can provide parameter values that you want passed into the template during deployment, such as the type of storage account, size of the virtual machine, credentials to use to sign into the virtual machine, and so on.

The Tools folder contains AzCopy.exe and is used in scenarios where your deployment template needs to copy files to a storage account used for deploying application and configuration files into a resource. For a virtual machine, this typically would be Desired State Configuration (DSC) and DSC resources that you want pushed into the virtual machine after the virtual machine is provisioned to do things like configure Active Directory or IIS Web Server roles in the virtual machine. The script in the Scripts folder uses AzCopy in the Tools folder to copy these kinds of artifacts to a storage account so that ARM can then copy them from storage into the virtual machine after the instance is fully provisioned

More information available here – https://azure.microsoft.com/en-us/documentation/articles/vs-azure-tools-resource-groups-deployment-projects-create-deploy/ , https://azure.microsoft.com/en-us/blog/azure-resource-manager-2-5-for-visual-studio/

In my case finished VS project structure looks like this

Capture

Note that AzureDeploy.json is a JSON template that describes all of the above mentioned infrastructure that I want to deploy in Azure, Deploy-AzureResourceGroup.ps1 is a PowerShell script to connect and deploy it to Azure, there is also parameters files with passwords, subscription names, etc, and finally test project to test such deploy in VS, all part of my test solution in Visual Studio.

So using my test template project  I created all of the above resources in resource group test in Central US data center:

rg5

Obviously this is just first baby steps, but should show you what is possible with this technology.  For more on Azure ARM Templates see – https://blogs.perficient.com/microsoft/2016/03/azure-arm-template-define-web-app-application-settings/, https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-create-first-template, https://blogs.msdn.microsoft.com/kaevans/2015/11/22/creating-arm-templates-with-azure-resource-explorer/, https://azure.microsoft.com/en-us/resources/templates/

Hope this helps.

Introducing Microsoft Azure Service Fabric – a groundbreaking PaaS Microservices Platform in Microsoft Azure and On Premises

sf

Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable microservices. Service Fabric also addresses the significant challenges in developing and managing cloud applications. Developers and administrators can avoid complex infrastructure problems and focus on implementing mission-critical, demanding workloads that are scalable, reliable, and manageable. Service Fabric represents the next-generation middleware platform for building and managing these enterprise-class, tier-1, cloud-scale applications.

That’s great definition, but what does it exactly means? Service Fabric is a base for new type of enterprise Services based application on premises and cloud with built in scalability possibilities, fault tolerance , multi OS deployment and containerization.

There are two main reasons why Service Fabric is able to achieve such massive scalability for applications. The first has to do with application design. Modern, highly scalable applications are generally built around the use of microservices.

The term “microservices” has been thrown around a lot over the last two or three years and has taken on several different meanings. From a Service Fabric prospective, microservices refer to independently deployable services upon which developers can build their applications. Microservices can be almost anything. Some common examples of microservices include protocol gateways, queues, caches, shopping carts, user profiles and inventory processing services.

The other thing that makes it possible for applications to achieve such a large scale when leveraging Service Fabric is the Service Fabric cluster. The individual microservices reside inside containers, and those containers, in turn, are deployed across a Service Fabric cluster. A Service Fabric cluster might contain hundreds of servers and tens of thousands of containers. The Service Fabric can scale to such an extent that Microsoft uses it to run widely used applications such as Cortana, Skype for Business, Microsoft Intune and Power BI.

The foundation of Azure Service Fabric is the System Services that provide the underlying support for customer’s applications.  These services include the following:

  • Failover manager ensures of availability by shifting resources within the cluster including when resources are added or removed.
  • Cluster manager interacts with Failover manager to ensure application and service constraints are not violated.
  • Naming Service provides name resolution services to ensure all services within the application are accessible.  Since the workloads are dynamic, client applications are not expected to understand what underlying infrastructure is supporting a particular service.  The Naming Service will facilitate routing between clients and the underlying service.
  • File store service provides local data and assembly persistence across nodes in the service

 

SystemServices

 

Diagram below shows major subsystems of Service Fabric

sf

  • Management Subsystem – manages lifecycle of applications and services
  • · Testability Subsystem – help devs test services through simulated faults before and after deploying applications and services to production
  • · Communication Subsystem – resolve service locations
  • · Reliabilty Subsystem – responsible for reliability through replication, resource management and failover
  • · Hosting and Activation – manages lifecycle of an application on a single node
  • · Application Model – enables tooling
  • · Native and Managed APIs – exposed to devs

 

On top of the Service Fabric Cluster, customers can deploy two different types of applications including:

  • Stateless where application state is stored out-of- band such as Azure SQL Database or Azure Storage.
  • Stateful where state is replicated to local persistence.  As a result there is a reduction in latency and complexity compared to traditional three tier architectures where developers are typically left to implement state logic themselves.

The Azure Service Fabric platform is responsible for the orchestration of these microservices and the microservices should not have any affinity to a particular node or infrastructure. The following image illustrates how a developer may choose to deploy their application as a series of microservices.  Should a node disappear, it is the responsibility of the Azure Service Fabric platform to ensure the microservice(s) are brought up on a remaining node.

ApplicationServices

 

Getting started with Service Fabric is relatively easy. To start off, you’re going to need a PC running a supported OS (Windows 8, Windows 8.1, Windows Server 2012 R2 or Windows 10) and a copy of Microsoft Visual Studio 2015. You’ll use this PC to install the required runtime SDK and to set up a local cluster. If you don’t have a suitable PC or if your Visual Studio licenses are in short supply, then you might consider using an Azure virtual machine. The Azure virtual machine gallery contains the option to deploy a virtual machine that runs Visual Studio 2015 Enterprise.

Once you have Visual Studio 2015 up and running, the next thing you’ll need to do is download and install the runtime components, the SDK and the required tools. You can download these from – https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-get-started

The good news on Service Fabric development front as well is that VSW 2015 already provides you with number of templates to start development. Templates are classified as Reliable Services, Reliable Actors and Web. If your goal is to build an application that’s based on the use of microservices, then you’ll want to choose either a stateless or a stateful reliable service (yes, stateful services are fully supported).

Once you install Service Fabric SDK and Tools on your machine it will install local Service Fabric Cluster as you can see via icon on your taskbar below.You will create and start the cluster and once you can manage it you should be able to see management screen like this

manage

 

Lets create our first very simple application for Service Fabric. A Service Fabric application can contain one or more services, each with a specific role in delivering the application’s functionality. Create an application project, along with your first service project, using the New Project wizard. You can also add more services later if you want.

  • Launch Visual Studio as an administrator
  • Click File > New Project > Cloud > Service Fabric Application.
  • Name the application and click OK

vs

  • On the next page, choose Stateful as the first service type to include in your application. Name it and click OK.
  • visual Studio will create appropriate project for Service Fabric Stateful Service.

se

 

Once you press F5 in Visual Studio this application will be deployed for debugging When the cluster is ready, you get a notification from the local cluster system tray manager application included with the SDK.

local-cluster-manager-notification

 

If you have added no custom code at all, in the case of the stateful service template, the messages simply show the counter value incrementing in the RunAsync method of MyStatefulService.cs.

Code as you can see is pretty self explanatory

 

protected override async Task RunAsync(CancellationToken cancellationToken)
        {
            // TODO: Replace the following sample code with your own logic 
            //       or remove this RunAsync override if it's not needed in your service.

            var myDictionary = await this.StateManager.GetOrAddAsync>("myDictionary");

            while (true)
            {
                cancellationToken.ThrowIfCancellationRequested();

                using (var tx = this.StateManager.CreateTransaction())
                {
                    var result = await myDictionary.TryGetValueAsync(tx, "Counter");

                    ServiceEventSource.Current.ServiceMessage(this, "Current Counter Value: {0}",
                        result.HasValue ? result.Value.ToString() : "Value does not exist.");

                    await myDictionary.AddOrUpdateAsync(tx, "Counter", 0, (key, value) => ++value);

                    // If an exception is thrown before calling CommitAsync, the transaction aborts, all changes are 
                    // discarded, and nothing is saved to the secondary replicas.
                    await tx.CommitAsync();
                }

                await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
            }
        }
    }
}

it’s important to remember that the local cluster is real. Stopping the debugger removes your application instance and unregisters the application type. The cluster continues to run in the background, however. You have several options to manage the cluster:

To shut down the cluster but keep the application data and traces, click Stop Local Cluster in the system tray app.To delete the cluster entirely, click Remove Local Cluster in the system tray app.

In the near future I will blog more about Service Fabric, Reliable Services and finally creating and deploying to cluster in Azure, So stay tuned please.

For more on Service Fabric see below:

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-overview ,

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-technical-overview,

https://techcrunch.com/2015/04/20/microsoft-announces-azure-service-fabric-a-new-framework-for-building-scalable-cloud-services/,

http://darylscorner.com/2016/02/overview-of-azure-service-fabric/

http://www.c-sharpcorner.com/UploadFile/mahesh/creating-a-service-fabr

Forecast Cloudy – Azure SQL Data Warehouse Introduction

 

    Many enterprises today are moving into era of real-time analytics. As you may be aware on non-relational data typical Hadoop batch analysis architectures are enhanced by technologies like Storm, Kafka, Azure Stream Analytics, Amazon Kinesis, etc. and are moving from purely batch analytics to hybrid real-time\batch architectures like Lambda (http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/) or Zeta (http://radar.oreilly.com/2015/04/zeta-architecture-hexagon-is-the-new-circle.html) with both speed and batch layers. With structured data new generation of elastic cloud based data warehouses is introduced like Amazon Redshift and Azure SQL Data Warehouse. I will blog about Redshift later, but now will take a bit to introduce Azure SQL DW.

Azure SQL Data Warehouse is a turn-key cloud (and related to Microsoft APS\APS on premises solution) data warehousing and analytics solution. It based on existing Azure services effectively and conveniently integrating them together under one roof. The key characteristics of Azure SQL Data Warehouse are:

  • Can scale (grow or shrink) to any size on demand, in seconds
  • Can import data from any source (using Azure Data Factory): hadoop, NoSQL database, SQL database
  • ·Can visualize data and run reports using PowerBI or other service
  • ·Use Azure Machine Learning to analyze, model and predict data
  • · Expose Machine Learning models as API for apps
  • · Provide full ANSI SQL support

image

MPP Architecture. At its core, SQL Data Warehouse runs using Microsoft’s massive parallel processing (MPP) architecture, originally introduced in Microsoft APS\PDW appliance and also successfully used by some other competitor analytics appliance products like Pivotal Greenplum. This architecture takes advantage of built-in data warehousing performance improvements and also allows SQL Data Warehouse to easily scale-out and parallelize computation of complex SQL queries. In addition, SQL Data Warehouse’s architecture is designed to take advantage of it’s presence in Azure. Combining these two aspects, the architecture breaks up into 4 key components:

SQL Data Warehouse Architecture

  • Control Node: You connect to the control node when using SQL Data Warehouse with any development, loading, or business intelligence tools. In SQL Data Warehouse, the compute node is a SQL Database, and when connecting it looks and feels like a standard SQL Database. However, under the surface, it coordinates all of the data movement and computation that takes place in the system. When a command is issued to the control node, it breaks it down into a set of queries that will be passed onto the compute nodes of the service.
  • Compute Nodes: Like the control node, the compute nodes of SQL Data Warehouse are powered using using SQL Databases. Their job is to serve as the compute power of the service. Behind the scenes, any time data is loaded into SQL Data Warehouse, it is distributed across the nodes of the service. Then, any time the control node receives a command it breaks it into pieces for each compute node, and the compute nodes operate over their corresponding data. After completing their computation, compute nodes pass partial results to the control node which then aggregates results before returning an answer.
  • Storage: All storage for SQL Data Warehouse is standard Azure Storage Blobs. This means that when interacting with data, compute nodes are writing and reading directly to/from Blobs. Azure Storage’s ability to expand transparently and nearly limitlessly allows us to automatically scale storage, and to do so separately from compute. Azure Storage also allows us to persist storage while scaling or paused, streamline our back-up and restore process, and have safer, more fault tolerant storage.
  • Data Movement Services: The final piece holding everything together in SQL Data Warehouse is our Data Movement Services. The data movement services allows the control node to communicate and pass data to all of the compute nodes. It also enables the compute nodes to pass data between each other, which gives them access to data on other compute nodes, and allows them to get the data that they need to complete joins and aggregations.

This MPP approach allows SQL Data Warehouse to take a divide and conquer approach as described above when solving large data problems. Since the data in SQL Data Warehouse is divided and distributed across the compute nodes of the service, each compute node is able to operate on its portion of the data in parallel. Finally, results are passed to the control node and aggregated before being passed back to the users

image

Elasticity. Currently the majority of cloud based database and data warehouse services are provisioned with fixed storage and compute resources. Resizing of resources cannot be performed without compromising availability and performance. This means service user typically ends up with over-provisioned underutilized expensive resources to accommodate possible peak demand. In the worst case, under-provisioned resources unable to handle sudden work overloads.

Azure SQL Data Warehouse (DW) is a fully managed, elastic, and petabyte-scale columnar data-warehouse service. Both Amazon Redshift and Azure SQL DW use massive parallel processing (MPP) architecture to deliver breakthrough performance. But unlike Amazon Redshift, Azure SQL DW architecture allows data and compute to scale independently without any downtime. In addition, Azure SQL DW enables one to dynamically grow and shrink resources taking advantage of best-in-class price and performance.

Most importantly, with Azure SQL DW, storage and compute are billed separately. Storage rates are based on standard blob rates. Compute is priced based on DWU (Data Warehouse Unit) blocks – a new metric to measure the compute capacity when using Azure SQL DW. Performance in Azure SQL DW scales linearly, and changing from one compute block to another (for instance 100 DWUs to 1000 DWUs) happens in seconds without disruption.

image

DWU (Data Warehouse Unit) capacity blocks. Data Warehouse Units are a new concept delivered by SQL Data Warehouse for capacity. Unfortunately outside of this article here – https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-performance-scale/ there isn’t much information on DWIs. Especially nice would be some sort of calculator to help translate traditional DW metrics like IOPs or queries per hour, etc. to DWUs.

At this time Microsoft recommends following approach to properly purchasing your Azure DW capacity in DWUs:

  • For a data warehouse in development, begin by selecting small number of DWUs
  • Monitor your application performance, observing the number of DWUs selected compared to the performance you observe
  • Determine how much faster or slower performance should be for you to reach the optimum performance level for your requirements by assuming linear scale
  • Increase or decrease the number of DWU selected. The service will respond quickly and adjust the compute resources to meet the DWU requirements.
  • Continue making adjustments until you reach an optimum performance level for your business requirements.

image

Hadoop Integration.

SQL DW is built with the same technology as APS, except that instead of using SQL Server 2014 it uses version 12 of Azure SQL Database.  It also includes PolyBase.  PolyBase allows APS and SQL DW to query data in a Hadoop cluster, either directly or by pushing some of the work to Hadoop itself so the query is actually run using the Hadoop clusters CPU’s.  The Hadoop data is made to look as if it were local to the data warehouse, so that end-users can use their existing skill sets to query it via SQL or any reporting tool that using SQL (like Excel, SSRS, Power BI, etc).  PolyBase can integrate with Hadoop in this manner via a Microsoft HDInsight cluster that can either be inside APS or in the cloud, or via a Hortonworks or Cloudera cluster.

image

So enough theory lets spin up our own SQL Data Warehouse in Azure. Unlike traditional systems this will not take weeks or months, but minutes. From Azure Preview Portal (it’s not on Classic Portal) go to New->Data + Storage ->Azure Data Warehouse

image

Add DW name, if you already have version 12 SQL Azure server you can use it, otherwise you will have to create new server instance with login and password. I picked lowest available DWU at 100.

image

As a result you will see animated GIF in upper right corner with status:

image

Few minutes and you have yourself a real MPP DW instance on the cloud. Finally after adding my client IP as firewall exclusion I can simply use my SSMS to login to this server and DB.

image

For more info on Azure SQL DW see – https://azure.microsoft.com/en-us/services/sql-data-warehouse/, https://azure.microsoft.com/en-us/documentation/services/sql-data-warehouse/, http://techcrunch.com/2015/04/29/microsoft-introduces-azure-sql-data-warehouse/, http://blogs.technet.com/b/dataplatforminsider/archive/2015/06/24/azure-sql-data-warehouse-opens-for-limited-public-preview.aspx

Meet Redis in the Clouds – Azure PaaS Introduces Premium Public Preview with Cluster and Persistence

redis_logo

Azure Redis Cache is a distributed, managed cache that helps you build highly scalable and responsive applications by providing faster access to your data. I blogged quite a lot previously on Redis and its features here – https://gennadny.wordpress.com/category/redis/ and on Azure Redis PaaS offering here – https://gennadny.wordpress.com/2015/01/19/forecast-cloudy-using-microsoft-azure-redis-cache/.

The new Premium tier includes all Standard-tier features, plus better performance, bigger workloads, disaster recovery, and enhanced security. Additional features include Redis persistence, which allows you to persist data stored in Redis; Redis Cluster, which automatically shards data across multiple Redis nodes, so you can create workloads using increased memory (more than 53 GB) for better performance; and Azure Virtual Network deployment, which provides enhanced security and isolation for your Azure Redis Cache, as well as subnets, access control policies, and other features to help you further restrict access.

To me a huge disappointment for Redis on Windows (MsOpenTech Redis) and Azure has been inability to scale out across nodes and news of Azure Redis Cluster are particularly welcome.

Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.

Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).

So in practical terms, what you get with Redis Cluster?

  • The ability to automatically split your dataset among multiple nodes
  • The ability to continue operations when a subset of the nodes are experiencing failures or are unable to communicate with the rest of the cluster.

Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what they call an hash slot. Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:

  • Node A contains hash slots from 0 to 5500.
  • Node B contains hash slots from 5501 to 11000.
  • Node C contains hash slots from 11001 to 16384.

This allows to add and remove nodes in the cluster easily. For example if I want to add a new node D, I need to move some hash slot from nodes A, B, C to D. Similarly if I want to remove node A from the cluster I can just move the hash slots served by A to B and C. When the node A will be empty I can remove it from the cluster completely. Because moving hash slots from a node to another does not require to stop operations, adding and removing nodes, or changing the percentage of hash slots hold by nodes, does not require any downtime.

Note of caution.

Redis Cluster is not able to guarantee strong consistency. In practical terms this means that under certain conditions it is possible that Redis Cluster will lose writes that were acknowledged by the system to the client.

The first reason why Redis Cluster can lose writes is because it uses asynchronous replication. This means that during writes the following happens:

  • Your client writes to the master A.
  • The master A replies OK to your client
  • The master A propagates the write to its slaves A1, A2 and A3.

As you can see A does not wait for an acknowledge from A1, A2, A3 before replying to the client, since this would be a prohibitive latency penalty for Redis, so if your client writes something, A acknowledges the write, but crashes before being able to send the write to its slaves, one of the slaves (that did not received the write) can be promoted to master, losing the write forever.\

Still this is really exciting news for many of us in Azure NoSQL and Distributed In-Memory Cache world. So I logged into new Azure Portal and yes, creating new Redis Cache I saw a Premium option:

redispremium

As you create your Redis Premium you can specify number of cluster nodes\shards as well, as well as persistence model for the first time!

redispremium2

Few minutes and I have myself a working 3 node cluster:

image

Now I can access this cluster just like I accessed single Redis instance previously.

My next steps are dig into Azure Redis Cluster deeper so stay tuned for updates.

Announcement from Azure Redis PG – https://azure.microsoft.com/en-us/blog/azure-redis-cache-public-preview-of-premium-tier/

Bees and Elephants In The Clouds – Using Hive with Azure HDInsight

hdinsight

For today’s blog entry I will combine both Cloud and Big Data to show you how easy it is to do Big Data on the Cloud. 

Microsoft Azure is a cloud service provided by Microsoft, and one of the services offered is HDInsight, an Apache Hadoop-based distribution running in the cloud on Azure. Another service offered by Azure is blob storage. Azure Storage functions as the default file system and stores data in the cloud and not on-premises or in nodes. This way, when you are done running an HDInsight job, the cluster can be decommissioned (to save you money) while your data remains intact.

8484_110113_2116_TheHDInsigh4

HDinsight provides an excellent scenario for IoT analytics, PoC\Development, bursting of on-Premise resources to the Cloud, etc.

Untitled picture

First thing of course what you will need is Microsoft Azure Subscription. Subscription has many moving parts, and http://azure.microsoft.com has interactive pricing pages, a pricing calculator, and plenty of documentation.  If you have an MSDN subscription you will get a limited Azure subscription as well (just have to enable and link it) or you can setup a trial subscription at https://azure.microsoft.com/en-us/pricing/free-trial/. For this little tutorial I will be using my MSDN subscription.

Assuming you know how to setup your subscription and can login to Azure Management Portal our next step will be to create Azure storage account. Now I already have number of storage accounts, but will create on just for this tutorial:

Hit +NEW in lower left side of portal and expand your choices, picking storage:

image

Pick unique URL , data center location and hit Create Storage Account. Easy:

image

Once the account is created I will create storage container within this account. In your account dashboard navigate to containers tab and hit create new container:

image

Enter name for container and If there is any chance of sensitive or PII data being loaded to this container choose Private access. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.

image

This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.

Next I will create Hadoop\HDInsight cluster. Here as with a lot of things in Azure I can use either Quick Create or Custom Create options, later giving me more control of course. Lets create an HDInsight cluster using Custom Create option.

Back to +New in lower left corner of the portal:

image

Obviously pick unique name and storage account  that we created in previous step , you also will need to provide a strict password and finally hit Create button. It takes a bit to create a cluster (way faster than setting one up on your own trust me) and while that is happening you will see this:

image

Once that is done I can go to Query Console:from the Dashboard for my cluster:

image

After I enter user name (admin) and password I setup on cluster creation I can see Query Console:

image

Just for this tutorial I will pick Twitter Trend Analysis sample. In this tutorial, idea is to  learn how to use Hive to get a list of Twitter users that sent the most tweets containing a particular word. Tutorial has a sample data set of tweets housed in Azure Blob Storage (that’s why I created storage account and pointed my HDInsight cluster there-  wasb://gennadykhdp@gennadykhdpstorage.blob.core.windows.net/HdiSamples/TwitterTrendsSampleData/

First I will create raw tweet table from data:

DROP TABLE IF EXISTS HDISample_Tweets_raw;

--create the raw Tweets table on json formatted twitter data
CREATE EXTERNAL TABLE HDISample_Tweets_raw(json_response STRING)
STORED AS TEXTFILE LOCATION 'wasb://gennadykhdp@gennadykhdpstorage.blob.core.windows.net/HdiSamples/TwitterTrendsSampleData/';

The following Hive queries will create a Hive table called HDISample_Tweets where you will store the parsed raw Twitter data. The Create Hive Table query shows the fields that the HDISample_Tweets table will contain. The Load query shows how the HDISample_Tweets_raw table is parsed to be placed into the HDISample_Tweets table.

DROP TABLE IF EXISTS HDISample_Tweets;
CREATE TABLE HDISample_Tweets(
    id BIGINT,
    created_at STRING,
    created_at_date STRING,
    created_at_year STRING,
    created_at_month STRING,
    created_at_day STRING,
    created_at_time STRING,
    in_reply_to_user_id_str STRING,
    text STRING,
    contributors STRING,
    retweeted STRING,
    truncated STRING,
    coordinates STRING,
    source STRING,
    retweet_count INT,
    url STRING,
    hashtags array,
    user_mentions array,
    first_hashtag STRING,
    first_user_mention STRING,
    screen_name STRING,
    name STRING,
    followers_count INT,
    listed_count INT,
    friends_count INT,
    lang STRING,
    user_location STRING,
    time_zone STRING,
    profile_image_url STRING,
    json_response STRING);
FROM HDISample_Tweets_raw
INSERT OVERWRITE TABLE HDISample_Tweets
SELECT
    CAST(get_json_object(json_response, '$.id_str') as BIGINT),
    get_json_object(json_response, '$.created_at'),
    CONCAT(SUBSTR (get_json_object(json_response, '$.created_at'),1,10),' ',
    SUBSTR (get_json_object(json_response, '$.created_at'),27,4)),
    SUBSTR (get_json_object(json_response, '$.created_at'),27,4),
    CASE SUBSTR (get_json_object(json_response, '$.created_at'),5,3)
        WHEN 'Jan' then '01'
        WHEN 'Feb' then '02'
        WHEN 'Mar' then '03'
        WHEN 'Apr' then '04'
        WHEN 'May' then '05'
        WHEN 'Jun' then '06'
        WHEN 'Jul' then '07'
        WHEN 'Aug' then '08'
        WHEN 'Sep' then '09'
        WHEN 'Oct' then '10'
        WHEN 'Nov' then '11'
        WHEN 'Dec' then '12' end,
    SUBSTR (get_json_object(json_response, '$.created_at'),9,2),
    SUBSTR (get_json_object(json_response, '$.created_at'),12,8),
    get_json_object(json_response, '$.in_reply_to_user_id_str'),
    get_json_object(json_response, '$.text'),
    get_json_object(json_response, '$.contributors'),
    get_json_object(json_response, '$.retweeted'),
    get_json_object(json_response, '$.truncated'),
    get_json_object(json_response, '$.coordinates'),
    get_json_object(json_response, '$.source'),
    CAST (get_json_object(json_response, '$.retweet_count') as INT),
    get_json_object(json_response, '$.entities.display_url'),
    ARRAY(  
        TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[0].text'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[1].text'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[2].text'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[3].text'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[4].text')))),
    ARRAY(
        TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[0].screen_name'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[1].screen_name'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[2].screen_name'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[3].screen_name'))),
        TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[4].screen_name')))),
    TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[0].text'))),
    TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[0].screen_name'))),
    get_json_object(json_response, '$.user.screen_name'),
    get_json_object(json_response, '$.user.name'),
    CAST (get_json_object(json_response, '$.user.followers_count') as INT),
    CAST (get_json_object(json_response, '$.user.listed_count') as INT),
    CAST (get_json_object(json_response, '$.user.friends_count') as INT),
    get_json_object(json_response, '$.user.lang'),
    get_json_object(json_response, '$.user.location'),
    get_json_object(json_response, '$.user.time_zone'),
    get_json_object(json_response, '$.user.profile_image_url'),
    json_response
WHERE (LENGTH(json_response) > 500);

These queries will show you how to analyze the Tweets Hive table that was created to determine the Twitter users that sent out the most tweets containing the word ‘Azure’. The results are saved into a new table called HDISample_topusers

DROP TABLE IF EXISTS HDISample_topusers;

--create the topusers hive table by selecting from the HDISample_Tweets table
CREATE TABLE IF NOT EXISTS  HDISample_topusers(name STRING, screen_name STRING, tweet_count INT);

INSERT OVERWRITE TABLE  HDISample_topusers
SELECT name, screen_name, count(1) as cc
    FROM HDISample_Tweets
    WHERE text LIKE '%Azure%'
    GROUP BY name, screen_name
    ORDER BY cc DESC LIMIT 10;

Now I will run it all, submit the job and watch the Job History Console:

image

Finally as data has been inserted into my HDISample_topusers table I can show this data as well:

select * from HDISample_topusers;

And get results

luciasannino	ciccialucia1960	1
Valerie M	valegabri1	1
Tony Zake	TonyZake	1
Sami Ghazali	SamiGhazali	1
JOSÈ ADARMES	ADARMESJEB	1
Eric Hexter	ehexter	1
Daniel Neumann	neumanndaniel	1
Anthony Bartolo	WirelessLife	1


For more details see – https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-hive/, https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-tutorial-get-started-windows/, http://www.developer.com/db/using-hive-in-hdinsight-to-analyze-data.html, http://blogs.msdn.com/b/cindygross/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx