Meet Memcached– Introduction To Another In-Memory Caching System

Memcached_svg

After writing in this blog on Microsoft AppFabric Cache and Redis,  I will be following up with Memcached. Similarly to above mentioned systems, Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering. Just like them its nded for use in speeding up dynamic web applications by alleviating database load.

Memcached was originally developed by Danga Interactive for LiveJournal, but is now used by many other systems, including MocoSpace,YouTube,Reddit, Facebook, Tumblr and Wikipedia.Engine Yard and Jelastic are using Memcached as part of their platform as a service technology stack and Heroku offers several Memcached services as part of their platform as a service. Google App Engine, AppScale, Microsoft Azure and Amazon Web Services also offer a Memcached service through an API.

So Why Memcached?

  • Free & Open source
  • High performance
  • Simple to set up
  • Ease of development
  • APIs are available for most popular languages

Important to note what Memcached is NOT:

  • a persistent data store
  • a database
  • application-specific
  • a large object cache
  • fault-tolerant or
  • highly available

 

Memcached’ primary storage algorithm is a hash table.

Hash table is basically an power of 2 sized array of pointers to entries. Collisions in the hash table are resolved via separate chaining with linked chains of entries: each entry consists of the pointer to the key, a pointer to the value and the pointer to the next chained entry. ash table size is chosen simpy as the ceiling power of 2, closest to the doubled number of entries the hash table need to contain. This means effective load factor varies from 0.5 to 1.0 with average of 0.75. In that order, the same load factor management strategy is implemented in glibc’s unordered_map.

image

Memcached supports multithreaded access to the store. It controls access to the resources via bare POSIX thread mutexes. Operations with hash table buckets are guarded with one of the pthread_mutex objects in the power-of-two sized array. Size of this array couldn’t be smaller than hash table size. Index of the mutex for the bucket is determined as bucket index % bucket mutex array size. I. e. each mutex is responsible for hash table size / bucket mutex array size buckets.

Installing Memcached on Ubuntru – really easy.

To install Memcached on Ubuntu, go to terminal and type the following commands −

 

$sudo apt-get update
$sudo apt-get install memcached

Once installed Memcached should be running on default port -11211.  To check if Memcached is presently running or not, run the command given below

$ps aux | grep memcached

Memcached is originally a Linux application, but since it is open-source, it has been compiled for windows. There are two major sources for the pre-built windows binary: Jellycan and Northscale, and both versions can be used. The following are the download links for the memcached windows binaries:

http://code.jellycan.com/files/memcached-1.2.5-win32-bin.zip

http://code.jellycan.com/files/memcached-1.2.6-win32-bin.zip

http://downloads.northscale.com/memcached-win32-1.4.4-14.zip

http://downloads.northscale.com/memcached-win64-1.4.4-14.zip

http://downloads.northscale.com/memcached-1.4.5-x86.zip

http://downloads.northscale.com/memcached-1.4.5-amd64.zip

Full tutorial to install Memcached on Windows available here – https://commaster.net/content/installing-memcached-windows

To connect to a Memcached server, you need to use telnet command providing HOST name and PORT number.

Using basic TelNet

$telnet HOST PORT

Here, HOST and PORT are machine IP and port number respectively, on which the Memcached server is running.

Here I will connect to local Memcached running on default port and execute basic set and get commands:

$telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
// now store some data and get it from memcached server
set test 0 900 9
memcached
STORED
get test
VALUE tutorialspoint 0 9
memcached
END

Popular basic CLI commands

Command Description` Example
get Reads value based on key get mykey
set Set a key unconditionally set mykey 0 60 5
# Meaning:
0 = > no flags
60 => TTL in [s]
5 => size in byte
add Adds a key add newkey 0 60 5
replace Replaces key replace mykey 0 60 5
delete Deletes an existing key delete mykey

 

See details on these and other commands here – http://blog.elijaa.org/2010/05/21/memcached-telnet-command-summary/

In the near future I will go more in detail on Memcached, including available clients, configuration, Java development with clients and AWS ElastiCache implementation.  For now, you can also explore more here – https://memcached.org/, https://smarttechie.org/2013/07/20/memcached-a-distributed-memory-object-caching-system/, http://code.tutsplus.com/tutorials/turbocharge-your-website-with-memcached–net-23939, and https://wincent.com/wiki/Testing_memcached_with_telnet

Hope this helps.

Ocultos Exitus–JDBC Driver Unicode Settings and SQL Server Performance

While troubleshooting JDBC client apps that connect to SQL SErver I ran into this issue few times, latest very recently.

As you may already know well, SQL Server differentiates its data types that support Unicode from the ones that just support ASCII. For example, the character data types that support Unicode are nchar, nvarchar, longnvarchar where as their ASCII counter parts are char, varchar and longvarchar respectively. By default, all Microsoft’s JDBC drivers send the strings in Unicode format to the SQL Server, irrespective of whether the datatype of the corresponding column defined in the SQL Server supports Unicode or not.

In the case where the data types of the columns support Unicode, everything is smooth. But, in cases where the data types of the columns do not support Unicode, serious performance issues arise especially during data fetches. SQL Server tries to convert non-unicode datatypes implicitly in the table to unicode datatypes before doing the comparison. Moreover, if an index exists on the non-unicode column, it will be ignored. This would ultimately lead to a whole table scan during data fetch, thereby slowing down the search queries drastically.

This can be corrected by resetting one of the default parameter in the Java driver. The parameter name and value to be set might vary from driver to driver, depending on the vendor.

Vendor Parameter
JSQLConnect asciiStringParameters
JTDS sendStringParametersAsUnicode
DataDirectConnect sendStringParametersAsUnicode
Microsoft JDBC sendStringParametersAsUnicode
WebLogic Type 4 JDBC sendStringParametersAsUnicode

Reading – https://msdn.microsoft.com/en-us/library/ms378857(v=sql.110).aspx you will see- “ For optimal performance with CHAR, VARCHAR or LONGVARCHAR type of non-Unicode parameters, set the sendStringParametersAsUnicode connection string property to “false” and use the non-national character methods.” , issue was also reported via http://www.sqlconsulting.com/jdbc.shtml, http://www.jochenhebbrecht.be/site/2014-05-01/java/fixing-slow-queries-running-sql-server-using-jpa-hibernate-and-jtds and http://www.codeproject.com/Articles/281364/Solving-Performance-issues-in-data-migration-to-SQ

Hope this helps.

Taking Care of the Garbage – Generational is default GC policy on newer WebSphere AS

Imagine you have a legacy Java application running in IBM WebSphere that you have upgraded finally to newer version. Yet, customer is reporting serious performance regression. Why would that be? Well, one reason maybe a change in default JVM behavior between WebSphere versions, something that one of my customers discovered the “hard way”

Garbage collection (GC) is an integral part of the Java Virtual Machine (JVM) as it collects unused Java heap memory so that the application can continue allocating new objects. The effectiveness and performance of the GC play an important role in application performance and determinism. The IBM JVM provided with IBM WebSphere Application Server provides four different GC policy algorithms:

  • -Xgcpolicy:optthruput
  • -Xgcpolicy:optavgpause
  • -Xgcpolicy:gencon
  • -Xgcpolicy:balanced

Each of these algorithms provides different performance and deterministic qualities. In addition, the default policy in WebSphere Application Server V8 has changed from -Xgcpolicy:optthruput to the  policy -Xgcpolicy:gencon.   So lets dive in a bit what this really means.

The garbage collector

Different applications naturally have different memory usage patterns. A computationally intensive number crunching workload will not use the Java heap in the same way as a highly transactional customer-facing interface. To optimally handle these different sorts of workloads, different garbage collection strategies are required. The IBM JVM supports several garbage collection policies to enable you to choose the strategy that best fits your application

The parallel mark-sweep-compact collector: optthruput, formerly default

The simplest possible garbage collection technique is to continue allocating until free memory has been exhausted, then stop the application and process the entire heap. While this results in a very efficient garbage collector, it means that the user program must be able to tolerate the pauses introduced by the collector. Workloads that are only concerned about overall throughput might benefit from this strategy.

The optthruput policy (-Xgcpolicy:optthruput) implements this strategy. This collector uses a parallel mark-sweep algorithm. In a nutshell, this means that the collector first walks through the set of reachable objects, marking them as live data. A second pass then sweeps away the unmarked objects, leaving behind free memory than can be used for new allocations. The majority of this work can be done in parallel, so the collector uses additional threads (up to the number of CPUs by default) to get the job done faster, reducing the time the application remains paused.

figure1

The problem with a mark-sweep algorithm is that it can lead to fragmentation . There might be lots of free memory, but if it is in small slices interspersed with live objects then no individual piece might be large enough to satisfy a particular allocation.

The solution to this is compaction. In theory, the compactor slides all the live objects together to one end of the heap, leaving a single contiguous block of free space. This is an expensive operation because every live object might be moved, and every pointer to a moved object must be updated to the new location. As a result, compaction is generally only done when it appears to be necessary. Compaction can also be done in parallel, but it results in a less efficient packing of the live objects — instead of a single block of free space, several smaller ones might be created.

figure2

The concurrent collector: optavgpause

For applications that are willing to trade some overall throughput for shorter pauses, a different policy is available. The optavgpause policy (-Xgcpolicy:optavgpause) attempts to do as much GC work as possible before stopping the application, leading to shorter pauses . The same mark-sweep-compact collector is used, but much of the mark and sweep phases can be done as the application runs. Based on the program’s allocation rate, the system attempts to predict when the next garbage collection will be required. When this threshold approaches, a concurrent GC begins. As application threads allocate objects, they will occasionally be asked to do a small amount of GC work before their allocation is fulfilled. The more allocations a thread does, the more it will be asked to help out. Meanwhile, one or more background GC threads will use idle cycles to get additional work done. Once all the concurrent work is done, or if free memory is exhausted ahead of schedule, the application is halted and the collection is completed. This pause is generally short, unless a compaction is required. Because compaction requires moving and updating live objects, it cannot be done concurrently.

figure3

 

The generational collection: gencon

has long been observed that the majority of objects created are only used for a short period of time. This is the result of both programming techniques and the type of application. Many common Java idioms create helper objects that are quickly discarded; for example StringBuffer/StringBuilder objects, or Iterator objects. These are allocated to accomplish a specific task, and are rarely needed afterwards. On a larger scale, applications that are transactional in nature also tend to create groups of objects that are used and discarded together. Once a reply to a database query has been returned, then the reply, the intermediate state, and the query itself are no longer needed.

This observation lead to the development of generational garbage collectors. The idea is to divide the heap up into different areas, and collect these areas at different rates. New objects are allocated out of one such area, called the nursery (or newspace). Since most objects in this area will become garbage quickly, collecting it offers the best chance to recover memory. Once an object has survived for a while, it is moved into a different area, called tenure (or oldspace). These objects are less likely to become garbage, so the collector examines them much less frequently. For the right sort of workload the result is collections that are faster and more efficient since less memory is examined, and a higher percentage of examined objects are reclaimed. Faster collections mean shorter pauses, and thus better application responsiveness.
IBM’s gencon policy (-Xgcpolicy:gencon) offers a generational GC (“gen-“) on top of the concurrent one described above (“-con”). The tenure space is collected as described above, while the nursery space uses a copying collector. This algorithm works by further subdividing the nursery area into allocate and survivor spaces . New objects are placed in allocate space until its free space has been exhausted. The application is then halted, and any live objects in allocate are copied into survivor. The two spaces then swap roles; that is, survivor becomes allocate, and the application is resumed. If an object has survived for a number of these copies, it is moved into the tenure area instead.

figure4

 

The region-based collector: balanced

A new garbage collection policy has been added in WebSphere Application Server V8. This policy, called balanced (-Xgcpolicy:balanced), expands on the notion of having different areas of the heap. It divides the heap into a large number of regions, which can be dealt with individually. Frankly I haven’t seen it used by any customer I worked with yet.

 

For more on WebSphere IBM JVM GC see – http://www.perfdaddy.com/2015/10/ibm-jvm-tuning-gencon-gc-policy.html, http://www.ibmsystemsmag.com/ibmi/administrator/websphere/Tuning-Garbage-Collection-With-IBM-Technology-for/, http://javaeesupportpatterns.blogspot.com/2012/03/ibm-jvm-tuning-gencon-gc-policy.html

Hope this helps

Javacore Dump Analysis using JCA – IBM Thread and Monitor Dump Analyzer for Java

In my previous blog posts I spent some time in illustrating tools for thread analysis for Oracle\Sun HotSpot JVM. However recently I actually had to analyze stack\javacore dumps from IBM WebSphere for hang condition and therefore had to research equivalent tools to analyze dumps from that JVM.

As we all know, during the run time of a Java process, some Java Virtual Machines (JVMs) may not respond predictably and oftentimes seem to hang up for a long time or until JVM shutdown occurs. It is not easy to determine the root cause of these sorts of problems.

By triggering a javacore when a Java process does not respond, it is possible to collect diagnostic information related to the JVM and a Java application captured at a particular point during execution. For example, the information can be about the operating system, the application environment, threads, native stack, locks, and memory. The exact contents are dependent on the platform on which the application is running.

On non IBM platforms, and in most cases, javacore is known as “javadump.” Check out my previous post of how to analyze dumps on Oracle Sun JVM via jstack utility. The code that creates javacore is part of the JVM. One can control it by using environment variables and run-time switches. By default, a javacore occurs when the JVM terminates unexpectedly. A javacore can also be triggered by sending specific signals to the JVM. Although javacore or javadump is present in Sun JVMs, much of the content of the javacore is added by IBM and, therefore, is present only in IBM JVMs.

This technology analyzes each thread information and provides diagnostic information, such as current thread information, the signal that caused the javacore, Java heap information (maximum Java heap size, initial Java heap size, garbage collector counter, allocation failure counter, free Java heap size, and allocated Java heap size), number of runnable threads, total number of threads, number of monitors locked, and deadlock information.

You can download tool from here – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c.

Once the tool is downloaded, you can run jca.jar with the Java Run-time Environment

command_line_jca

This will open up the tool

jca_splash

Let’s now use File-Open and open a javacore\dump file

jca_fileopen

The tool will show a screen with a progress bar while it loads the javacore. Clicking on the javacore you just loaded in the Thread Dump List

jca_afterfileopen

As you can see you get huge amount of JVM settings details, as well as If the tool detected a deadlock then it will be displayed in the lower section of the Thread Dump List

jca_summary

and

dead

 

Selecting the Compare Monitors option from Analysis Menu will show deadlocked threads

jca_dead

Another useful screen is Thread Status Screen. Here you can see RUNNABLE vs. PARKED vs. BLOCKED threads and associated stacks. This screen can be very useful in resolving slow response and hang conditions

jca_thread_status

A complete explanation of all thread states for Java 7 can be found here – https://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.71.doc/diag/tools/javadump_tags_threads.html. Once you locate the RUNNABLE threads that are executing your application code, find out which method is being executed by following the stack trace. You may get  assistance from development team if needed. Also note the Thread ID.

The following Thread in the example below is in BLOCK state which typically means it is waiting to acquire a lock on an Object monitor. You will need to search in the earlier section and determine which Thread is holding the lock so you can pinpoint the root cause.

3XMTHREADINFO      "[STUCK] ExecuteThread: '162' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000013ACF0800, j9thread_t:0x000000013AC88B20, java/lang/Thread:0x070000001F945798, state:B, prio=1

3XMTHREADINFO1            (native thread ID:0x1AD0F3, native priority:0x1, native policy:UNKNOWN)

3XMTHREADINFO3           Java callstack:

4XESTACKTRACE                at org/springframework/jms/connection/SingleConnectionFactory.createConnection(SingleConnectionFactory.java:207(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/connection/SingleConnectionFactory.createQueueConnection(SingleConnectionFactory.java:222(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate102.createConnection(JmsTemplate102.java:169(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.execute(JmsTemplate.java:418(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.send(JmsTemplate.java:475(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.send(JmsTemplate.java:467(Compiled Code))

…………………………………………………………………………………………………………

 

Hope this helps. For more see – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c, https://www.ibm.com/developerworks/community/blogs/e8206aad-10e2-4c49-b00c-fee572815374/entry/Java_Core_Debugging_using_IBM_Thread_and_Monitor_Dump_Analyzer_for_Java?lang=en

Et Obscure Ex Alto–TDS Compatibility for SQL Server Upgrades

As majority of folks may know time is counting down on SQL Server extended support for SQL Server 2005 , it officially ends April 12, 2016. Working with folks on upgrading to fairly old systems from SQL Server 2000\2005 to newer versions like SQL Server 2012\2014 we ran into fairly rare issue of TDS backward compatibility, mainly with very old clients running older versions of JDBC\ODBC drivers.

Tabular Data Stream (TDS) is an application layer protocol, used to transfer data between a database server and a client. It was initially designed and developed by Sybase Inc. for their Sybase SQL Server relational database engine in 1984, and later by Microsoft in Microsoft SQL Server

tds

Microsoft published TDS specification in 2008 – https://msdn.microsoft.com/en-us/library/cc448435.aspx , Still TDS and especially TDS compatibility on SQL Server upgrades are pretty obscure topic. Microsoft published TDS version list matched to SQL Server and .NET Framework here – https://msdn.microsoft.com/en-us/library/dd339982.aspx

TDS version SQL Server version .NET Framework version
7.0 SQL Server 7.0 .NET Framework 1.1
7.1 SQL Server 2000 .NET Framework 1.1
7.1 Revision 1 SQL Server 2000 SP1 .NET Framework 1.1
7.2 SQL Server 2005 .NET Framework 2.0
7.3.A SQL Server 2008 .NET Framework 4.0
7.3.B SQL Server 2008 R2 N/A
7.4 SQL Server 2012 .NET Framework 4.5
7.4 SQL Server 2014 .NET Framework 4.5
7.4 SQL Server 2016 CTP2 .NET Framework 4.6

Although TDS compatibility isn’t well documented by Microsoft , from what I have experienced it appears Microsoft keeps at least two versions of TDS backwardly compatible.

So how do I find out what versions of TDS are connecting to your server? On newer versions like SQL Server 2005 and above where DMVs are present following query can be used:

select protocol_type, protocol_version from sys.dm_exec_connections

Result will be something like:

TSQL   1930035203 

Well, the big number still doesn’t say which version of data access driver, does it? Actually it does. The integer value is the representation of TDS version (for TSQL). It needs to be translated as follows, converting protocol version to Hex:

SELECT CONVERT(BINARY(4), 1930035203) 

Result will be as follows:

0x730A0003

Take the first two hex digits (0x73) and map it to correct SQL Server version based on the table above. So now we know that connection was made via TDS 7.3x , possibly via SQL Server 2008 Client stack.

In one of its older posts SQLCAT actually combined al;l of these steps into one query:

SELECT session_id, protocol_type, driver_version = 
CASE SUBSTRING(CAST(protocol_version AS BINARY(4)), 1,1)
WHEN 0x70 THEN ‘SQL Server 7.0′
WHEN 0x71 THEN ‘SQL Server 2000′
WHEN 0x72 THEN ‘SQL Server 2005′
WHEN 0x73 THEN ‘SQL Server 2008′
ELSE ‘Unknown driver’
END
FROM sys.dm_exec_connections

But what if I am on SQL Server 2000, well before DMVs were available and need to check my TDS versions connecting before upgrade?

Here things will become a lot more difficult and labor intensive. The only way I think is via checking version in TDS pre-login packet via network capture on SQL Server side.

As per – https://msdn.microsoft.com/en-us/library/dd357559.aspx pre-login is message sent by the client to set up context for login. The server responds to a client PRELOGIN message with a message of packet header type 0x04 and the packet data containing a PRELOGIN structure. As part there is optional version value:

PL_OPTION_TOKEN Value Description
VERSION 0x00 PL_OPTION_DATA =   UL_VERSION

                   US_SUBBUILD

UL_VERSION is represented in network byte order (big-endian).

The server SHOULD use the VERSION sent by the client to the server. The client SHOULD use the version returned from the server to determine which features are enabled or disabled. The client SHOULD do this only if it is known that this feature is supported by that version of the database.<27>

Theoretically upon capture of such data via network trace it can be filtered in wireshark , from – https://www.wireshark.org/docs/dfref/t/tds.html

tds.prelogin.option.version	Version	Unsigned integer, 4 bytes	2.0.0

According to spec, version is an option – client should provide version by docs, however possibly some clients don’t. Assume that would be pretty rare, at least all mainstream clients I have seen do.

So dealing with older, obscure SQL Server clients is hard, but doesn’t have to impossible.

For more see – https://blogs.msdn.microsoft.com/jenss/2009/03/02/tds-protocol-versions-meet-client-stacks/, https://blogs.msdn.microsoft.com/sql_protocols/2008/07/15/connection-failure-because-of-mismatched-tds-version/, https://msdn.microsoft.com/en-us/library/dd339982.aspx

Dancing with Elephants and Flying with The Bees–Using ORC File Format with Apache Hive

hive

 

When you start with Hive on Hadoop clear majority of samples and tutorials will have you work with text files. However, some time ago disadvantages of text files as file format were clearly seen by Hive community in terms of storage efficiency and performance.

First move to better columnar storage was introduction to Hive RC File Format. RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0. RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part. Internals for RC File Format can be found in JavaDoc here – http://hive.apache.org/javadocs/r1.0.1/api/org/apache/hadoop/hive/ql/io/RCFile.html. What is important to note is why it was introduced as far as advantages:

  • As row-store, RCFile guarantees that data in the same row are located in the same node
  • As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.

As time passed by explosion of data and need for higher speed in HiveQL queries has pushed need for further optimized columnar storage file formats. Therefore, ORC File Format was introduced. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data

This has following advantages over RCFile format:

  • a single file as the output of each task, which reduces the NameNode’s load
  • light-weight indexes stored within the file, allowing to skip row groups that don’t pass predicate filtering and do seek to a given row
  • block-mode compression based on data type

An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer.

The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.

image

What does it all mean for me?

What it means for us as implementers following:

  • Better read performance due to compression. Streams are compressed using a codec, which is specified as a table property for all streams in that table. To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
  • · Introduction to column-level statistics for optimization, feature that long existed in pretty much all commercial RDBMS packages (Oracle, SQL Server, etc.) . The goal of the column statistics is that for each column, the writer records the count and depending on the type other useful fields. For most of the primitive types, it records the minimum and maximum
    values; and for numeric types it additionally stores the sum. From Hive 1.1.0 onwards, the column statistics will also record if there are any null values within the row group by setting the hasNull flag.
  • · Light weight indexing
  • Larger Blocks by default 256 MB

Here is a good Hive file format comparison from HOrtonworks:

image

Using ORC – create table with ORC format:

Simplest way to create ORC file formatted Hive table is to add STORED AS ORC to Hive CREATE TABLE statement like:

CREATE TABLE my_table  (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT
) STORED AS ORC;

ORC File Format can be used together with Hive Partitioning, which I explained in my previous post.  Here is an example of using Hive partitioning with ORC File Format:

CREATE  TABLE airanalytics 
(flightdate date ,dayofweek int,depttime int,crsdepttime int,arrtime int,crsarrtime int,uniquecarrier varchar(10),flightno int,tailnum int,aet int,cet int,airtime int,arrdelay int,depdelay int,origin varchar(5),dest varchar(5),distance int,taxin int,taxout int,cancelled int,cancelcode int,diverted string,carrdelay string,weatherdelay string,securtydelay string,cadelay string,lateaircraft string) 
 PARTITIONED BY (flight_year String)
 clustered BY (uniquecarrier)
 sorted BY (flightdate)
 INTO 24 buckets
 stored AS orc tblproperties ("orc.compress"="NONE","orc.stripe.size"="67108864", "orc.row.index.stride"="25000")

The parameters added on table level are as per docs:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

268435456

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns

“”

comma separated list of column names for which bloom filter should be created

orc.bloom.filter.fpp

0.05

false positive probability for bloom filter (must >0.0 and <1.0)

If you have existing Hive table, it can be moved to ORC via:

    • ALTER TABLE … [PARTITION partition_spec] SET FILEFORMAT ORC
    • SET ive.default.fileformat=Orc

For more information see – https://en.wikipedia.org/wiki/RCFile, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC, https://cwiki.apache.org/confluence/display/Hive/FileFormats, https://orc.apache.org/docs/hive-config.html

Hope this helps.

Forecast Cloudy – Azure SQL Data Warehouse Introduction

 

    Many enterprises today are moving into era of real-time analytics. As you may be aware on non-relational data typical Hadoop batch analysis architectures are enhanced by technologies like Storm, Kafka, Azure Stream Analytics, Amazon Kinesis, etc. and are moving from purely batch analytics to hybrid real-time\batch architectures like Lambda (http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/) or Zeta (http://radar.oreilly.com/2015/04/zeta-architecture-hexagon-is-the-new-circle.html) with both speed and batch layers. With structured data new generation of elastic cloud based data warehouses is introduced like Amazon Redshift and Azure SQL Data Warehouse. I will blog about Redshift later, but now will take a bit to introduce Azure SQL DW.

Azure SQL Data Warehouse is a turn-key cloud (and related to Microsoft APS\APS on premises solution) data warehousing and analytics solution. It based on existing Azure services effectively and conveniently integrating them together under one roof. The key characteristics of Azure SQL Data Warehouse are:

  • Can scale (grow or shrink) to any size on demand, in seconds
  • Can import data from any source (using Azure Data Factory): hadoop, NoSQL database, SQL database
  • ·Can visualize data and run reports using PowerBI or other service
  • ·Use Azure Machine Learning to analyze, model and predict data
  • · Expose Machine Learning models as API for apps
  • · Provide full ANSI SQL support

image

MPP Architecture. At its core, SQL Data Warehouse runs using Microsoft’s massive parallel processing (MPP) architecture, originally introduced in Microsoft APS\PDW appliance and also successfully used by some other competitor analytics appliance products like Pivotal Greenplum. This architecture takes advantage of built-in data warehousing performance improvements and also allows SQL Data Warehouse to easily scale-out and parallelize computation of complex SQL queries. In addition, SQL Data Warehouse’s architecture is designed to take advantage of it’s presence in Azure. Combining these two aspects, the architecture breaks up into 4 key components:

SQL Data Warehouse Architecture

  • Control Node: You connect to the control node when using SQL Data Warehouse with any development, loading, or business intelligence tools. In SQL Data Warehouse, the compute node is a SQL Database, and when connecting it looks and feels like a standard SQL Database. However, under the surface, it coordinates all of the data movement and computation that takes place in the system. When a command is issued to the control node, it breaks it down into a set of queries that will be passed onto the compute nodes of the service.
  • Compute Nodes: Like the control node, the compute nodes of SQL Data Warehouse are powered using using SQL Databases. Their job is to serve as the compute power of the service. Behind the scenes, any time data is loaded into SQL Data Warehouse, it is distributed across the nodes of the service. Then, any time the control node receives a command it breaks it into pieces for each compute node, and the compute nodes operate over their corresponding data. After completing their computation, compute nodes pass partial results to the control node which then aggregates results before returning an answer.
  • Storage: All storage for SQL Data Warehouse is standard Azure Storage Blobs. This means that when interacting with data, compute nodes are writing and reading directly to/from Blobs. Azure Storage’s ability to expand transparently and nearly limitlessly allows us to automatically scale storage, and to do so separately from compute. Azure Storage also allows us to persist storage while scaling or paused, streamline our back-up and restore process, and have safer, more fault tolerant storage.
  • Data Movement Services: The final piece holding everything together in SQL Data Warehouse is our Data Movement Services. The data movement services allows the control node to communicate and pass data to all of the compute nodes. It also enables the compute nodes to pass data between each other, which gives them access to data on other compute nodes, and allows them to get the data that they need to complete joins and aggregations.

This MPP approach allows SQL Data Warehouse to take a divide and conquer approach as described above when solving large data problems. Since the data in SQL Data Warehouse is divided and distributed across the compute nodes of the service, each compute node is able to operate on its portion of the data in parallel. Finally, results are passed to the control node and aggregated before being passed back to the users

image

Elasticity. Currently the majority of cloud based database and data warehouse services are provisioned with fixed storage and compute resources. Resizing of resources cannot be performed without compromising availability and performance. This means service user typically ends up with over-provisioned underutilized expensive resources to accommodate possible peak demand. In the worst case, under-provisioned resources unable to handle sudden work overloads.

Azure SQL Data Warehouse (DW) is a fully managed, elastic, and petabyte-scale columnar data-warehouse service. Both Amazon Redshift and Azure SQL DW use massive parallel processing (MPP) architecture to deliver breakthrough performance. But unlike Amazon Redshift, Azure SQL DW architecture allows data and compute to scale independently without any downtime. In addition, Azure SQL DW enables one to dynamically grow and shrink resources taking advantage of best-in-class price and performance.

Most importantly, with Azure SQL DW, storage and compute are billed separately. Storage rates are based on standard blob rates. Compute is priced based on DWU (Data Warehouse Unit) blocks – a new metric to measure the compute capacity when using Azure SQL DW. Performance in Azure SQL DW scales linearly, and changing from one compute block to another (for instance 100 DWUs to 1000 DWUs) happens in seconds without disruption.

image

DWU (Data Warehouse Unit) capacity blocks. Data Warehouse Units are a new concept delivered by SQL Data Warehouse for capacity. Unfortunately outside of this article here – https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-performance-scale/ there isn’t much information on DWIs. Especially nice would be some sort of calculator to help translate traditional DW metrics like IOPs or queries per hour, etc. to DWUs.

At this time Microsoft recommends following approach to properly purchasing your Azure DW capacity in DWUs:

  • For a data warehouse in development, begin by selecting small number of DWUs
  • Monitor your application performance, observing the number of DWUs selected compared to the performance you observe
  • Determine how much faster or slower performance should be for you to reach the optimum performance level for your requirements by assuming linear scale
  • Increase or decrease the number of DWU selected. The service will respond quickly and adjust the compute resources to meet the DWU requirements.
  • Continue making adjustments until you reach an optimum performance level for your business requirements.

image

Hadoop Integration.

SQL DW is built with the same technology as APS, except that instead of using SQL Server 2014 it uses version 12 of Azure SQL Database.  It also includes PolyBase.  PolyBase allows APS and SQL DW to query data in a Hadoop cluster, either directly or by pushing some of the work to Hadoop itself so the query is actually run using the Hadoop clusters CPU’s.  The Hadoop data is made to look as if it were local to the data warehouse, so that end-users can use their existing skill sets to query it via SQL or any reporting tool that using SQL (like Excel, SSRS, Power BI, etc).  PolyBase can integrate with Hadoop in this manner via a Microsoft HDInsight cluster that can either be inside APS or in the cloud, or via a Hortonworks or Cloudera cluster.

image

So enough theory lets spin up our own SQL Data Warehouse in Azure. Unlike traditional systems this will not take weeks or months, but minutes. From Azure Preview Portal (it’s not on Classic Portal) go to New->Data + Storage ->Azure Data Warehouse

image

Add DW name, if you already have version 12 SQL Azure server you can use it, otherwise you will have to create new server instance with login and password. I picked lowest available DWU at 100.

image

As a result you will see animated GIF in upper right corner with status:

image

Few minutes and you have yourself a real MPP DW instance on the cloud. Finally after adding my client IP as firewall exclusion I can simply use my SSMS to login to this server and DB.

image

For more info on Azure SQL DW see – https://azure.microsoft.com/en-us/services/sql-data-warehouse/, https://azure.microsoft.com/en-us/documentation/services/sql-data-warehouse/, http://techcrunch.com/2015/04/29/microsoft-introduces-azure-sql-data-warehouse/, http://blogs.technet.com/b/dataplatforminsider/archive/2015/06/24/azure-sql-data-warehouse-opens-for-limited-public-preview.aspx