Forecast Cloudy – Going Big Data with Azure Data Lake Analytics Part 1 – Introduction

adla

Previously, I wrote a post about Google Big Query , GCP service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Similar services are provided now by all public cloud vendors, Microsoft Azure has a service known as Azure Data Lake Analytics , that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.

According to Microsoft, Azure Data Lake Analytics lets you:

  • Analyze data of any kind and of any size.
  • Speed up and sharpen your development and debug cycles.
  • Use the new U-SQL processing language built especially for big data.
  • Rely on Azure’s enterprise-grade SLA.
  • Pay only for the processing resources you actually need and use.
  • Benefit from the YARN-based technology extensively tested at Microsoft.

ADLA is based on top of the YARN technology. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

yarn_architecture

For more on YARN architecture see – https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html .  What being based on YARN helps Azure Data Lake Analytics helps is with extreme scalability. Data Lake Analytics can work with a number of Azure data sources: Azure Blob storage, Azure SQL database, Azure SQL Data Warehouse, Azure Store, and Azure SQL Database in Azure VM. Azure Data Lake Analytics is specially optimized to work with Azure Data Lake Store—providing the highest level of performance, throughput, and parallelization for your big data workloads.Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of SQL with the expressive power of C#. It takes a bit to learn for typical SQL person, but its pretty powerful.

So enough theory and let me show how you can cruch Big Data workloads without creating a large Hadoop cluster or setting up infrastructure, paying only for what you used in storage and compute.

The first thing we will need before starting to work in Azure cloud is subscription. If you dont have one, browse to https://azure.microsoft.com/en-us/free/?v=18.23 and follow the instructions to sign up for a free 30-day trial subscription to Microsoft Azure.

In my example I will use sample retail dataset with couple of stock and sales data files, which I will upload to Azure Data Lake Store.  The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Azure Data Lake Store is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics.

To upload these files , I will first create Azure Data Lake Store called adlsgennadyk:

  • Navigate to Azure Portal
  • Find service called Data Lake Storage Gen 1
  • Click Create button

dls1

In the form , I will name my Azure Data Lake Store, pick Azure Resource Group where it will reside and choose billing model, which can be either usual pay as you go or prepayment in advance.

dls2

Once you created your Data Lake Storage, I will click on Data Explorer button to launch that tool and upload files I will be analyzing

dls3

Next, I will use the tool to upload demo retail dataset file called stock, that stores retail stocked product information in tab delimted format .

dls4

Here is the dataset as can be seen in Excel:

dls5

Now, that data has been uploaded to Azure lets create instance of Azure Data Lake Analytics service. Again action sequence is the same:

  • Navigate to Azure Portal
  • Find service called Data Lake Analytics
  • Click Create button

Resulting form is very similar to what I did with storage above, except I will point my ADLA instance to my above created storage instance.

adla11

Once  Azure Data Lake Analytics instance is created you are presented with this screen:

adla12

Once I click on the New Job button I can run brand new query against my files in Azure Data Lake Storage. I will show you simplest U-SQL script here:

@stock = 
    EXTRACT  
            Id   int, 
          Item string 
    FROM "/stock.txt"
    USING Extractors.Tsv();

OUTPUT @stock
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

Here is what I just asked Azure Data Lake Analytics to do.  We extract all mof the data from a file and copy output to another one.

Some items to know:

  • The script contains a number of U-SQL keywords: EXTRACT, FROM, TO, OUTPUT, USING, etc.
  • U-SQL keywords are case sensitive. Keep this in mind – it’s one of the most common errors people run into.
  • The EXTRACT statement reads from files. The built-in extractor called Extractors.Tsv handles Tab-Separated-Value files.
  • The OUTPUT statement writes to files. The built-in outputter called Outputters.Tsv handles Tab-Separated-Value files.
  • From the U-SQL perspective files are “blobs” – they don’t contain any usable schema information. So U-SQL supports a concept called “schema on read” – this means the developer specified the schema that is expected in the file. As you can see the names of the columns and the datatypes are specified in the EXTRACT statement.
  • The default Extractors and Outputters cannot infer the schema from the header row – in fact by default they assume that there is no header row (this behavior can overriden)

After job executes you can see its execution analysis and graph:

adla14

Now lets do something more meaningful by introducing WHERE clause to filter the dataset:

@stock = 
    EXTRACT  
            Id   int, 
          Item string
    FROM "/stock.txt"
    USING Extractors.Tsv();

    @output = 
    SELECT *
    FROM @stock
    WHERE Item == "Tape dispenser (Black)";


OUTPUT @output
    TO "/stock2_output.tsv"
    USING Outputters.Tsv();

The job took about 30 sec to run , including writing to utput file which took most of time here. Looking at the graph by duration one can see where time was spent:

adla15

In the next post I plan to delve deeper into Data Lake Analytics, including using C# functions and USQL catalogs.

 

For more on Data Lake Analytics see – https://docs.microsoft.com/en-us/azure/data-lake-analytics/https://youtu.be/4PVx-7SSs4chttps://cloudacademy.com/blog/azure-data-lake-analytics/, and https://optimalbi.com/blog/2018/02/20/fishing-in-an-azure-data-lake/ 

 

Happy swimming in Azure Data Lake, hope this helps.

Advertisements

Forecast Cloudy – Create Microsoft SQL Azure DB via PowerShell

In my recent job experience working SQL Server migrations to Microsoft SQL Azure DB I had to use a lot of Azure PowerShell to create Microsoft SQL Azure DB on the fly. Its a pretty easy task . Here is a quick sample\tutorial to create SQL Azure DB on the fly.

PowerShell-ISE-Azure-SQL-Server

First thing you need is to connect to Azure and login to your Azure subscription. Log in to your Azure subscription using the Connect-AzureRmAccount command and follow the on-screen directions.

Connect-AzureRmAccount

Next, I usually like to define variables on top of my script. These will hold information like database names, credentials, etc; things that are likely to change when you reuse the script.

# The data center and resource name for your resources
$resourcegroupname = "myResourceGroup"
$location = "US Central"
# The logical server name: Use a random value or replace with your own value (do not capitalize)
$servername = "gennadykServer"
# Set an admin login and password for your database
# The login information for the server
$adminlogin = "ServerAdmin"
$password = "ChangeYourAdminPassword1"
# The ip address range that you want to allow to access your server - change as appropriate
$startip = "0.0.0.0"
$endip = "0.0.0.0"
# The database name
$databasename = "mySampleDatabase"

Once we defined these values we need to create a resource group. You can create an Azure resource group using the New-AzureRmResourceGroup command. A resource group is a logical container into which Azure resources are deployed and managed as a group.

New-AzureRmResourceGroup -Name $resourcegroupname -Location $location

Next, you need to create SQL Azure logical server. A logical server contains a group of databases managed as a group. You can create an Azure SQL Database logical server using the New-AzureRmSqlServer command.

New-AzureRmSqlServer -ResourceGroupName $resourcegroupname `
    -ServerName $servername `
    -Location $location `
    -SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $adminlogin, $(ConvertTo-SecureString -String $password -AsPlainText -Force))

In order to connect to our logical server we need to setup firewall rule. A server-level firewall rule allows an external application, such as SQL Server Management Studio or the SQLCMD utility to connect to a SQL database through the SQL Database service firewall. We will create an Azure SQL Database server-level firewall rule using the New-AzureRmSqlServerFirewallRule command.

New-AzureRmSqlServerFirewallRule -ResourceGroupName $resourcegroupname `
    -ServerName $servername `
    -FirewallRuleName "AllowSome" -StartIpAddress $startip -EndIpAddress $endip

Finally, we can create our database. You will need to pick SQL Azure performance tier. You can see more information on SQL Azure DB tiers here.  Below script creates a database with an S0 performance level in the server using the New-AzureRmSqlDatabase command.

New-AzureRmSqlDatabase  -ResourceGroupName $resourcegroupname `
    -ServerName $servername `
    -DatabaseName $databasename `
    -SampleName "AdventureWorksLT" `
    -RequestedServiceObjectiveName "S0"

For more information see https://docs.microsoft.com/en-us/azure/sql-database/scripts/sql-database-create-and-configure-database-powershellhttps://www.sqlshack.com/how-to-work-handle-sql-azure-databases-with-powershell/

Happy Azure Scripting everyone.

Forecast Cloudy – Migrating SQL Server Database from AWS EC2 to SQL Azure Database using Microsoft Data Migration Assistant Tool

sql-database-windows-azure

Amazon Web Services EC2 is a great IaaS Platform and many organizations use it to host SQL Server instances. Being IaaS platform is opens up easy migration scenarios from on-premises deployments and for many people its a great first step into cloud. However, hopefully soooner than later, many companies realize that they are ready to take next step – move their workloads and databases to PaaS cloud service. This is where Microsoft SQL Azure DB shines.

Microsoft Azure SQL Database (formerly SQL Azure, Windows Azure SQL Database) is a cloud-based database service from Microsoft offering data-storage capabilities. The aim is for users to just communicate with a T-SQL endpoint rather than managing database storage, files, and high availability. Azure SQL Database obviously has number of limitations like lack of support of cross database queries, SQL Broker, etc. meaning that not every database can be migrated to Microsoft Azure SQL DB without prerequisite work. However, for folks that are ready to migrate Microsoft Data Migration Assistant can be viable option for such migration from AWS EC2 IaaS.

For my tutorial I have SQL Server on Windows instance in EC2 running well known venerable AdventureWorks2016 database.  Here it is amongst my other servers in EC2 console:

ec2src

With AWS elastic IP assigned to that Windows machine I can easily connect to default SQL instance via my local SSMS:

ec2sqlsrc

Obviously previously I had to setup proper inbound rules for this machine both with AWS EC2 Security Groups and Windows Firewall.

Next step is to create Microsoft Azure DB target.  This is well documented topic as well, officially described here – https://docs.microsoft.com/en-us/azure/sql-database/sql-database-get-started-portal .

I created S3 tier database calling it SmallSQLTarget.

targetsqlazure

I also had to again go to security group for the server and open inbound ports for SQL Server traffic again (port 1433).  Smart way to do it is to filter by client IP of my workstation where I run my SSMS and will run Data Migration Assistant tool

Now I can make sure that I can connect to my target Microsoft Azure SQL DB from my client machine via SSMS as well:

both

Now lets proceed with migration. For that you will need to download and install Microsoft Data Migration Assistant from – https://www.microsoft.com/en-us/download/details.aspx?id=53595  . Once you installed the tool lets create new project. I will start with assesment project to make sure I dont have any critical incompatibilities that preclude my migration to PaaS platform.

dma1

Tool will check for compatibility issues in my database as well as feature parity to see what needs to be done to migrate any SQL Server instance features to PaaS if anything at all.

dma2

We can then provide credentials to our AWS EC2 based instance and pick database after connecting to the source:

dma3

Now that we added source lets start assessment.  DMA can have any version of SQL Server as assesment source up to 2016.  As assesment done you will get a nice report where you can see all of the potential issues before you migrate. It should something  look like this :

report1

Now lets migrate our database as I found no migration blockers here.

Press on + button and create new migration project

dma5

Lets connect again to instance on EC2 and select DB to migrate

dma6

Next connect to Microsoft Azure SQL DB target by specifying server address and credentials and select target db, in my case SmallSQLAzure.

dma7

After that tool will do quick assesment and present you with schema objects you can migrate , highlighting\marking any objects that may have issues

dma8

After you pick objects you wish to migrate , tool will generate SQL DML script for these objects.  You can then save that script for analysis and changes , as well as apply that script via SSMS on your own. Or you can go ahead and apply the script via tool as I done for this example by pressing Deploy Schema button.

dma9

After schema deployment finish we can proceed with data migration by pressing Migrate Data button.

dma10

Next, pick your tables to migrate and start data migration. Migration of data for smaller database like AdventureWorks takes few minutes.  Everything transferred other than 2 temporal tables , but those are very different from regular database table.  SQL Server 2016 introduced support for system-versioned temporal tables as a database feature that brings built-in support for providing information about data stored in the table at any point in time rather than only the data that is correct at the current moment in time.  There are other ways to migrate temporal tables to SQL Azure that I will post in my next blog posts.

dma11

Well, we are pretty much done. Next, check the data has migrated to target by doing few queries in SSMS:

result

For more on Microsoft Data Migration Assistant see – https://docs.microsoft.com/en-us/sql/dma/dma-overviewhttp://www.jamesserra.com/archive/2017/03/microsoft-database-migration-tools/https://blogs.msdn.microsoft.com/datamigration/2016/08/26/data-migration-assistant-known-issues-v1-0/https://blogs.msdn.microsoft.com/datamigration/2016/11/08/data-migration-assistant-how-to-run-from-command-line/

Hope this helps.

 

 

Collecting Trash Revisited – Garbage Collection Essentials in .NET

Microsoft_net_4

In my recent past as Field Engineer working with customers I had to deal with many managed applications that had performance and stability issues due to memory usage. With .NET 4 there were definitely big interesting additions introduced to memory management , but talking to customers I believe they may have drowned out in a see of tech announcements and many developers in the field simply do not remember these. Therefore here is some basic information on Garbage Collection flavors in .NET and some ways to troubleshoot GC issues for existing applications.

Workstation vs. Server GC 

There are actually two different modes of garbage collection, which you can control using gcServer tag in the runtime configuration part of your config file. Workstation Garbage Collection is set by default. The gcServer configuration flag was introduced long ago , with .NET 2.0, and you can use it to control whether Workstation or Server GC is used on multi processor machines. With server GC, a thread for every core is created just for doing GC. There is also a small object heap and a large object heap created for each GC thread. All of the program’s allocations are spread among these heaps (more on large object heaps later). When no GC is happening, these threads are blocked and do nothing. When a GC is triggered, all of the user threads get paused, and all the GC threads wake up at highest priority and do collection in parallel. All of these optimizations lead to server GC usually being much faster than workstation GC.

So Server GC:

  • Multiprocessor (MP) Scalable, Parallel
  • One GC thread per CPU
  • Program paused during marking

Workstation GC

  • Great for UI driven apps as minimizes pausing during full GC collection
  • Runs slower

Concurrent vs. Background

Now besides Server vs Workstation, there are Concurrent and Background operation modes. Both of them allow a second generation to be collected by a dedicated thread without pausing all user threads. Generation 0 and 1 require pausing all user threads, but they are always the fastest. This of course increases the level of responsiveness an application can deliver.

Concurrent Mode.

Default mode for Workstation GC and will provide a dedicated thread performing GC, even on multiprocessor machines. You can turn it off using gcConcurrent tag. This mode offers much shorter user threads pauses as the most time consuming Gen2 collection is done concurrently, on the cost of limited allocation capabilities during concurrent GC. While Gen2 collection takes place, other threads can only allocate up to the limit of current ephemeral segment, as it’s impossible to allocate new memory segment. If your process runs out of place in current segment, all threads will have to be paused anyway and wait for concurrent collection to finish. This is because Gen0 and Gen1 collections cannot be performed while concurrent GC is still in progress. Concurrent GC also has a slightly higher memory requirements.

Background Mode

Starting with .NET 4.5 this is actually default mode that replaces Concurrent It’s also available for both Workstation and Server modes, while Concurrent was only available for Workstation. The big improvement being that Background mode can actually perform Gen0 and Gen1 collections while simultaneously performing Gen2 collection. 

Workstation garbage collection can be:

  • Concurrent
  • Background
  • Non-concurrent

while options for Server garbage collection are as follows:

  • Background
  • Non-concurrent

For more see – https://blogs.msdn.microsoft.com/tess/2009/05/29/background-garbage-collection-in-clr-4-0/, https://blogs.msdn.microsoft.com/dotnet/2012/07/20/the-net-framework-4-5-includes-new-garbage-collector-enhancements-for-client-and-server-apps/

So how do I collect my GC pause information to make decision on what mode is best for me? 

The first thing to do is take a look at the properties of the process using the excellent Process Explorer tool from Sysinternals (imagine Task Manager on steroids). It will give you a summary like the one below, the number of Gen 0/1/2 Collections and % Time in GC are the most interesting values to look at.

time-in-gc

The other way to get very deep insight is grab an ETW trace as explained by Maoni Stephens here – https://blogs.msdn.microsoft.com/maoni/2014/12/22/gc-etw-events-1/, this is a method of looking into GC stats that I use utilizing pretty popular PerfView tool that works on top of Event Tracing for Windows (ETW). Another good tutorial on the same can be found here – http://www.philosophicalgeek.com/2012/07/16/how-to-debug-gc-issues-using-perfview/

Hope this little note is useful.

 

Data Cargo In The Clouds – Running SQL Server 2017 in Docker

sqlserver_and_docker

In my previous post I attempted to explain my new interest in Docker and running data-centric applications in container. Starting here I will show how to install SQL Server for Linux on Azure IaaS in container.

Lets start by creating Linux VM in Azure that will run Docker for us.  I dont want to spend all of the time and space going through steps in portal or PowerShell since I already went through these in this post, Good video titorial  can also be found here .

Assuming you successfully created Linux VM and can login like this:

azlinux1

Next thing we will install Docker on that VM.  We will get latest and greatest version of Docker from Docker repository.

  • First, add the GPG key for the official Docker repository to the system:

    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    
  • Add the Docker repository to APT sources:

    $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
  • Next, update the package database with the Docker packages from the newly added repo:

    $ sudo apt-get update
  • Make sure you are about to install from the Docker repo instead of the default Ubuntu 16.04 repo:
    apt-cache policy docker-ce

    You should see following output:

    docker-ce:
      Installed: (none)
      Candidate: 17.03.1~ce-0~ubuntu-xenial
      Version table:
         17.03.1~ce-0~ubuntu-xenial 500
            500 https://download.docker.com/linux/ubuntu xenial/stable amd64 Packages
         17.03.0~ce-0~ubuntu-xenial 500
            500 https://download.docker.com/linux/ubuntu xenial/stable amd64 Packages
    
    

    Notice that docker-ce is not installed, but the candidate for installation is from the Docker repository for Ubuntu 16.04. The docker-ce version number might be different.

  • Finally install Docker
    $ sudo apt-get install -y docker-ce
  • Make sure its installed and running
    $ sudo systemctl status docker

    Output should be similar to below showing that daemon is started and running

    linuxdoc

Next we need to pull latest SQL Server Docker image:

$ sudo docker pull microsoft/mssql-server-linux

Once the image is pulled and extracted you should see similar output

linuxdoc2

Now lets run container image with Docker:

$ sudo docker run -e 'ACCEPT_EULA=Y' -e 'MSSQL_SA_PASSWORD=<YourStrong!Passw0rd>' -e 'MSSQL_PID=Developer' --cap-add SYS_PTRACE -p 1401:1433 -d microsoft/mssql-server-linux

Now we are running Docker container with SQL Server 2017 in Azure. We should be able to list our containers like:

 $ sudo docker ps -a

And see output like:

linuxdoc3

Connect to SQL Server in the container.

The following steps use the SQL Server command-line tool, sqlcmd, inside the container to connect to SQL Server. First lets connect to bash inside the container using docker exec command. Note I am providing container id here as parameter fetched from previous output of ps command

$ sudo docker exec -it d95734b7f9ba  "bash"

Now lets connect via sqlcmd here

/opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P '<YourStrong!Passw0rd>'

Now lets make simplest query:

SELECT @@version
     GO

You should see output like:

linuxdoc4

Now you can go ahead and start creating databases, tables, moving data, etc.

For more see – https://docs.microsoft.com/en-us/sql/linux/quickstart-install-connect-dockerhttps://docs.microsoft.com/en-us/sql/linux/sql-server-linux-configure-docker. SQL Server images on Docker Hub – https://hub.docker.com/r/microsoft/mssql-server-linux/https://mathaywardhill.com/2017/05/08/how-to-attach-a-sql-server-database-in-a-linux-docker-container/

 

Hope this helps.

Forecast Cloudy – The Magic of AzCopy As Database Migration Tool

Here is a quick note that can be helpful to folks. Recently I had to migrate database to Azure via taking a backup from on-premises SQL Server and restoring this backup on SQL Server on Azure VM.  Now, there is a better , more structured method available for the same using Database Migration Assistant (DMA) from Microsoft – https://www.microsoft.com/en-us/download/details.aspx?id=53595 , however I went old fashioned way purely through backups.

But how do I move 800 GB backup quickly to the cloud? What about network issues and need to restart on network hiccup? All of these issues and more can be solved via AzCopy tool.

AzCopy is a command-line utility designed for copying data to and from Microsoft Azure Blob, File, and Table storage using simple commands with optimal performance. You can copy data from one object to another within your storage account, or between storage accounts.

There are two versions of AzCopy that you can download. AzCopy on Windows is built with .NET Framework, and offers Windows style command-line options. AzCopy on Linux is built with .NET Core Framework which targets Linux platforms offering POSIX style command-line options. I will showcase Windows version here for now.

So what I did first is created a File Share on Azure Blob Storage. Azure File storage is a service that offers file shares in the cloud using the standard Server Message Block (SMB) Protocol. Both SMB 2.1 and SMB 3.0 are supported. You can follow this tutorial to do so – https://www.petri.com/configure-a-file-share-using-azure-files . Next, I needed to upload my backup to this file share via AzCopy:

The basic syntax to use AzCopy from command line as follows

AzCopy /Source: /Dest: [Options]

I can upload a single file, called test.bak , from C:\Temp to the demo container in the storage account (blob service) using the following command:

AzCopy /Source:C:\Temp /Dest:https://pazcopy.blob.core.windows.net/demo /DestKey:QB/asdasHJGHJGHJGHJ+QOoPu1fC/Asdjkfh48975845Mh/KlMc/Ur7Dm3485745348gbsUWGz/v0e== /Pattern:"test.bak"

Some notes:

  • Note how the URL of the destination storage account specifies the blob service.
  • I have specified the storage account access key with the /DestKey option. This key is unique to the storage account (source or destination).
  • The /Pattern option specifies the file being copied

Once file copied ypu should see stats in command line as well:

azcopy

Once file is uploaded from my destination Azure VM all I have to is map to the File Share (example here – https://blogs.msdn.microsoft.com/windowsazurestorage/2014/05/12/introducing-microsoft-azure-file-service/)  and restore backup.

Hope this helps.  More on AzCopy – https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopyhttps://blogs.technet.microsoft.com/canitpro/2015/12/28/step-by-step-using-azcopy-to-transfer-files-to-azure/

 

Data Cargo In The Clouds – Data and Containerization On The Cloud Platforms

container

This topic is where I actually planned to pivot to for quite a while, but either had no bandwith\time with moving to Seattle or some other excuse like that. This topic is an interesting to me as its intersection of technologies where I used to spend quite a bit of time including:

  • Data, including both traditional RDBMS databases and NoSQL designs
  • Cloud, including Microsoft Azure, AWS and GCP
  • Containers and container orchestration (that is area new to me)

Those who worked with me either displayed various degrees of agreement, but more often of annoyance, on me pushing this topic as growing part of data engineering discipline, but I truly believe in it. But before we start combining all these areas and do something useful in code I will spend some time here with most basic concepts of what I am about to enter here.

I will skip introduction to basic concepts of RDBMS, NoSQL or BigData\Hadoop technologies, as if you didn’t hide under the rock for last five years, you should be quite aware of those. That brings us straight to containers:

As definition states – “A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Available for both Linux and Windows based apps, containerized software will always run the same, regardless of the environment. Containers isolate software from its surroundings, for example differences between development and staging environments and help reduce conflicts between teams running different software on the same infrastructure.”

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. This could be from a developer’s laptop to a test environment, from a staging environment into production, and perhaps from a physical machine in a data center to a virtual machine in a private or public cloud.

container.jpg

But why containers if I already have VMs? 

VMs take up a lot of system resources. Each VM runs not just a full copy of an operating system, but a virtual copy of all the hardware that the operating system needs to run. This quickly adds up to a lot of RAM and CPU cycles. In contrast, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program.
What this means in practice is you can put two to three times as many as applications on a single server with containers than you can with a VM.  In addition, with containers you can create a portable, consistent operating environment for development, testing, and deployment. 

On Linux, containers run on top of LXC. This is a userspace interface for the Linux kernel containment features. It includes an application programming interface (API) to enable Linux users to create and manage system or application containers.

Docker is an open platform tool to make it easier to create, deploy and to execute the applications by using containers.

  • The heart of Docker is Docker engine. The Docker engine is a part of Docker which create and run the Docker containers. The docker container is a live running instance of a docker image. Docker Engine is a client-server based application with following components :
    • A server which is a continuously running service called a daemon process.
    • A REST API which interfaces the programs to use talk with the daemon and give instruct it what to do.
    • A command line interface client.

             docker_engine 

  • The command line interface client uses the Docker REST API to interact with the        Docker daemon through using CLI commands. Many other Docker applications    also use the API and CLI. The daemon process creates and manage Docker images, containers, networks, and volumes.
  • Docker client is the primary service using which Docker users communicate with the Docker. When we use commands “docker run” the client sends these commands to dockerd, which execute them out.

You will build Docker images using Docker and deploy these into what are known as Docker registries. When we run the docker pull and docker run commands, the required images are pulled from our configured registry directory.Using Docker push command, the image can be uploaded to our configured registry directory. 

Finally deploying instances of images from your registry you will deploy containers. We can create, run, stop, or delete a container using the Docker CLI. We can connect a container to more than one networks, or even create a new image based on its current state.By default, a container is well isolated from other containers and its system machine. A container defined by its image or configuration options that we provide during to create or run it.

For more info on containers , LXC and Docker see – https://blog.scottlowe.org/2013/11/25/a-brief-introduction-to-linux-containers-with-lxc/, https://www.docker.com/what-container#/virtual_machines http://searchservervirtualization.techtarget.com/definition/container-based-virtualization-operating-system-level-virtualization

That brings us to container orchestration engines. While the CLI meets the needs of managing one container on one host, it falls short when it comes to managing multiple containers deployed on multiple hosts. To go beyond the management of individual containers, we must turn to orchestration tools. Orchestration tools extend lifecycle management capabilities to complex, multi-container workloads deployed on a cluster of machines.

Some of the well known orchestration engines include:

Kubernetes.  Almost a standard nowdays , originally developed at Google. Kubernetes’ architecture is based on a master server with multiple minions. The command line tool, called kubecfg, connects to the API endpoint of the master to manage and orchestrate the minions.  The service definition, along with the rules and constraints, is described in a JSON file. For service discovery, Kubernetes provides a stable IP address and DNS name that corresponds to a dynamic set of pods. When a container running in a Kubernetes pod connects to this address, the connection is forwarded by a local agent (called the kube-proxy) running on the source machine to one of the corresponding backend containers.Kubernetes supports user-implemented application health checks. These checks are performed by the kubelet running on each minion to ensure that the application is operating correctly.

kub

For more see – https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#kubernetes-is

Apache Mesos. This is an open source cluster manager that simplifies the complexity of running tasks on a shared pool of servers. A typical Mesos cluster consists of one or more servers running the mesos-master and a cluster of servers running the mesos-slave component. Each slave is registered with the master to offer resources. The master interacts with deployed frameworks to delegate tasks to slaves. Unlike other tools, Mesos ensures high availability of the master nodes using Apache ZooKeeper, which replicates the masters to form a quorum. A high availability deployment requires at least three master nodes. All nodes in the system, including masters and slaves, communicate with ZooKeeper to determine which master is the current leading master. The leader performs health checks on all the slaves and proactively deactivates any that fail.

 

Over the next couple of posts I will create containers running SQL Server and other data stores, both RDBMS and NoSQL, deploy these on the cloud and finally attempt to orchestrate these hopefully as well.  So lets move off theory and pictures into world of parctical data engine deployments.

Hope you will find this detour interesting.