Semper Secura–Introduction To SQL Server Always Encrypted

With the introduction of SQL Server 2016 we now have a new way to encrypt columns called Always Encrypted. What makes it different from older encryption features in SQL Server like Column Encryption and Transparent Data Encryption is that it doesn’t just encrypt data at rest or while being transmitted on the wire, but only feature that ensures that the database never sees unencrypted values of sensitive columns, as data gets encrypted on the client tier via SQL Client\database driver. While the database only sees encrypted values, the application code works exclusively with unencrypted data. When a query is executed, the driver automatically looks up the master key in the Windows Certificate Store (or other OS-dependent location). The master key is then used to decrypt a column specific key, which in turn is used for encrypting and decrypting fields and parameters.

pic2

Microsoft advertises following use cases for Always Encrypted feature:

· Client and Data On-Premises

A customer has a client application and SQL Server both running on-premises, at their business location. The customer wants to hire an external vendor to administer SQL Server. In order to protect sensitive data stored in SQL Server, the customer uses Always Encrypted to ensure the separation of duties between database administrators and application administrators. The customer stores plaintext values of Always Encrypted keys in a trusted key store which the client application can access. SQL Server administrators have no access to the keys and, therefore, are unable to decrypt sensitive data stored in SQL Server.

· Client On-Premises with Data in Azure

A customer has an on-premises client application at their business location. The application operates on sensitive data stored in a database hosted in Azure (for example in SQL Server running in a virtual machine on Microsoft Azure). The customer uses Always Encrypted and stores Always Encrypted keys in a trusted key store hosted on-premises, to ensure Microsoft cloud administrators have no access to sensitive data.

· Client and Data in Azure

A customer has a client application, hosted in Microsoft Azure (e.g. in a worker role or a web role), which operates on sensitive data stored also stored in Microsoft Azure. The customer uses Always Encrypted to reduce security attack surface area (the data is always encrypted in the database and on the machine hosting the database).

pic1

 

Types of Always Encrypted encryption:

SQL Server offers two encryption modes: deterministic and random.

· Deterministic encryption ensures that a given value always has the same encrypted representation. This allows you to use the column for equality comparisons, joins, and grouping.

· For more security, you can use random encryption. This prevents guessing by ensuring that a given value’s encrypted representation is never the same twice.

You should use deterministic encryption for columns that will be used as search or grouping parameters, for example a government ID number. Use randomized encryption, for data such as confidential investigation comments, which are not grouped with other records, or used to join tables.

Limitations.

If a column is encrypted, then all range-like operations such as greater/less than, pattern matching using LIKE, etc. are disallowed. Furthermore, you can’t pass encrypted values to functions, user-defined or otherwise, because the database doesn’t have access to the unencrypted values.

Equality comparisons can only be performed on columns that use deterministic encryption.

Indexes can only be applied to deterministically encrypted columns.

If joining between two columns, both columns need to use the same column encryption key.

Constants expressions that refer to encrypted columns are not allowed. For example, you cannot write WHERE SSN = ‘111-11-1111’, but you can write WHERE SSN = @SSN. This is necessary because the driver works with the SqlParameter class to handle encryption requirements.

Unsupported data types include: xml, rowversion, image, ntext, text, sql_variant, hierarchyid, geography, geometry, and user defined-types.

Current drivers that support these feature are listed below:

pic3

 

Here is a quick tutorial on how to set up Always Encrypted

· Open you SQL Server Management Studio and connect to your SQL Server 2016 instance or SQL Azure. As far as on premise SQL Server you can use AlwaysEncrypted feature in Enterprise or Developer Editions

pic4

· First thing we will create is Column Master Key. You can do so either in script or via nice wizard in SSMS. Below is my script to do so , but of course based on your certificate , machine, etc. your script will be different

CREATE DATABASE AEDemo;

GO

USE [AEDemo]
/****** Object:  ColumnMasterKey [test]    Script Date: 9/15/2016 6:44:15 PM ******/
CREATE COLUMN MASTER KEY [test]
WITH
(
	KEY_STORE_PROVIDER_NAME = N'MSSQL_CERTIFICATE_STORE',
	KEY_PATH = N'LocalMachine/My/A9B9AFE1993FA0FAD052F49195FDF10DE5A953DB'
)
GO

•    Next thing is to create Column Encryption Key. Again your actual script will vary from mine below or you can use nice SSMS wizard to do so

CREATE COLUMN ENCRYPTION KEY [test]
WITH VALUES
(
	COLUMN_MASTER_KEY = [test],
	ALGORITHM = 'RSA_OAEP',
	ENCRYPTED_VALUE = 0x01700000016C006F00630061006C006D0061006300680069006E0065002F006D0079002F0061003900620039006100660065003100390039003300660061003000660061006400300035003200660034003900310039003500660064006600310030006400650035006100390035003300640062003B237BB1F1C8A7ADBB0BEA972AB6FC321B7BAB43C6EEEF20CC4BBAC52BE1582868C7E9BF92B01D7022AF9838D625ED5F1CE2D35975BFF600F96A5A5A07EDFA946DBDE08DE896A632B06BAFC5A861DC8298E36E408BE5EBAAEB94C674C04765D0138F30B6B0E676294B5D3FF5374434273B823550D3A89D6337BEEE1D93FBDAB50025ED7A91BE2F2C3C032A636D0049F5614EC248097BD0B12F79FF3A51DC242987D931A2473EAB98BCB7217CFF61B447E435F5FE19BB5DCAF1B7C8D8FA606FE5354EB9773A78C1F1EEFCC4D3D401CB3C602EBB3C197A2421637304D94BC58F129F20A5685A700C6BEDF8D5080F959B57B3F65721DDEF78BBDA035923D3A93D3380E52465663A72916CE4DCF0D904C27DB8298D6F44AACC95998EA7CC1F895BBD55553523A5558778252A5E4AE2DED30D6DE04DB24ACE18771D0B8C27F06A228F67F950C10E0DA035D12934DA5DB5D65E947789EBA86A5375C71DC386FB854FCE702D642CB5ED695B96E864C3F9CC34E52D87178B2DF75B6AAE3A996B2F6BC9EB99445F6311142E8C3F85A545EE0614AFCAB02A547664B31E546FFAA7667D86D7DF239CB4EAB66B847EBD78BDD01707C7C7DE04EE8F52F5F7D722A71BC8E5015F9EFB2A0DBDEB9B732C163D8C32F9DA00AABCDBE68067715CB1C56A385B0EA908A51EC51F8290D701689E112C646A10462636766066F703D82844CA7237F721EF
)
GO

· Now that keys are created lets create table that uses these keys to encrypt columns

CREATE TABLE dbo.EncryptedTable
(
  ID INT IDENTITY(1,1) PRIMARY KEY,
  
  LastName NVARCHAR(32) COLLATE Latin1_General_BIN2 
    ENCRYPTED WITH 
    (
       ENCRYPTION_TYPE = DETERMINISTIC, 
       ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256', 
       COLUMN_ENCRYPTION_KEY = test2
    ) NOT NULL,
    
  Salary INT 
    ENCRYPTED WITH 
    (
       ENCRYPTION_TYPE = RANDOMIZED, 
       ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256', 
       COLUMN_ENCRYPTION_KEY = test2
    ) NOT NULL
);
GO

So essentially we followed this workflow:

pic6

Well , we can try to insert some data into a table

 

DECLARE @LastName NVARCHAR(32) = N'Gennady', @Salary INT = 720000;
INSERT dbo.EncryptedTable(LastName,Salary) SELECT @LastName, @Salary;
go

This  will actually error out, but why? Since encryption and decryption is occurring on driver level this will be way you populate these encrypted columns from client side code. You don’t have to make any changes to your data access layer of your application except addition of following to your connection string:]

Column Encryption Setting=Enabled

Together with rolling out your Column Master Key to the client it should make it easy for your application to work with AlwaysEncrypte

pic8

Great details on Always Encrypted can be found here –

BOL – https://msdn.microsoft.com/en-us/library/mt163865.aspx

Channel 9 – https://channel9.msdn.com/Shows/Data-Exposed/Getting-Started-with-Always-Encrypted-with-SSMS?ocid=relatedentry

As every security and encryption feature there is performance overhead to enabling Always Encrypted, Aaron Bertrand has attempted some benchmarking that is shown here – http://sqlperformance.com/2015/08/sql-server-2016/always-encrypted-performance-follow-up

Meet Memcached– Introduction To Another In-Memory Caching System

Memcached_svg

After writing in this blog on Microsoft AppFabric Cache and Redis,  I will be following up with Memcached. Similarly to above mentioned systems, Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering. Just like them its nded for use in speeding up dynamic web applications by alleviating database load.

Memcached was originally developed by Danga Interactive for LiveJournal, but is now used by many other systems, including MocoSpace,YouTube,Reddit, Facebook, Tumblr and Wikipedia.Engine Yard and Jelastic are using Memcached as part of their platform as a service technology stack and Heroku offers several Memcached services as part of their platform as a service. Google App Engine, AppScale, Microsoft Azure and Amazon Web Services also offer a Memcached service through an API.

So Why Memcached?

  • Free & Open source
  • High performance
  • Simple to set up
  • Ease of development
  • APIs are available for most popular languages

Important to note what Memcached is NOT:

  • a persistent data store
  • a database
  • application-specific
  • a large object cache
  • fault-tolerant or
  • highly available

 

Memcached’ primary storage algorithm is a hash table.

Hash table is basically an power of 2 sized array of pointers to entries. Collisions in the hash table are resolved via separate chaining with linked chains of entries: each entry consists of the pointer to the key, a pointer to the value and the pointer to the next chained entry. ash table size is chosen simpy as the ceiling power of 2, closest to the doubled number of entries the hash table need to contain. This means effective load factor varies from 0.5 to 1.0 with average of 0.75. In that order, the same load factor management strategy is implemented in glibc’s unordered_map.

image

Memcached supports multithreaded access to the store. It controls access to the resources via bare POSIX thread mutexes. Operations with hash table buckets are guarded with one of the pthread_mutex objects in the power-of-two sized array. Size of this array couldn’t be smaller than hash table size. Index of the mutex for the bucket is determined as bucket index % bucket mutex array size. I. e. each mutex is responsible for hash table size / bucket mutex array size buckets.

Installing Memcached on Ubuntru – really easy.

To install Memcached on Ubuntu, go to terminal and type the following commands −

 

$sudo apt-get update
$sudo apt-get install memcached

Once installed Memcached should be running on default port -11211.  To check if Memcached is presently running or not, run the command given below

$ps aux | grep memcached

Memcached is originally a Linux application, but since it is open-source, it has been compiled for windows. There are two major sources for the pre-built windows binary: Jellycan and Northscale, and both versions can be used. The following are the download links for the memcached windows binaries:

http://code.jellycan.com/files/memcached-1.2.5-win32-bin.zip

http://code.jellycan.com/files/memcached-1.2.6-win32-bin.zip

http://downloads.northscale.com/memcached-win32-1.4.4-14.zip

http://downloads.northscale.com/memcached-win64-1.4.4-14.zip

http://downloads.northscale.com/memcached-1.4.5-x86.zip

http://downloads.northscale.com/memcached-1.4.5-amd64.zip

Full tutorial to install Memcached on Windows available here – https://commaster.net/content/installing-memcached-windows

To connect to a Memcached server, you need to use telnet command providing HOST name and PORT number.

Using basic TelNet

$telnet HOST PORT

Here, HOST and PORT are machine IP and port number respectively, on which the Memcached server is running.

Here I will connect to local Memcached running on default port and execute basic set and get commands:

$telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
// now store some data and get it from memcached server
set test 0 900 9
memcached
STORED
get test
VALUE tutorialspoint 0 9
memcached
END

Popular basic CLI commands

Command Description` Example
get Reads value based on key get mykey
set Set a key unconditionally set mykey 0 60 5
# Meaning:
0 = > no flags
60 => TTL in [s]
5 => size in byte
add Adds a key add newkey 0 60 5
replace Replaces key replace mykey 0 60 5
delete Deletes an existing key delete mykey

 

See details on these and other commands here – http://blog.elijaa.org/2010/05/21/memcached-telnet-command-summary/

In the near future I will go more in detail on Memcached, including available clients, configuration, Java development with clients and AWS ElastiCache implementation.  For now, you can also explore more here – https://memcached.org/, https://smarttechie.org/2013/07/20/memcached-a-distributed-memory-object-caching-system/, http://code.tutsplus.com/tutorials/turbocharge-your-website-with-memcached–net-23939, and https://wincent.com/wiki/Testing_memcached_with_telnet

Hope this helps.

Ocultos Exitus–JDBC Driver Unicode Settings and SQL Server Performance

While troubleshooting JDBC client apps that connect to SQL SErver I ran into this issue few times, latest very recently.

As you may already know well, SQL Server differentiates its data types that support Unicode from the ones that just support ASCII. For example, the character data types that support Unicode are nchar, nvarchar, longnvarchar where as their ASCII counter parts are char, varchar and longvarchar respectively. By default, all Microsoft’s JDBC drivers send the strings in Unicode format to the SQL Server, irrespective of whether the datatype of the corresponding column defined in the SQL Server supports Unicode or not.

In the case where the data types of the columns support Unicode, everything is smooth. But, in cases where the data types of the columns do not support Unicode, serious performance issues arise especially during data fetches. SQL Server tries to convert non-unicode datatypes implicitly in the table to unicode datatypes before doing the comparison. Moreover, if an index exists on the non-unicode column, it will be ignored. This would ultimately lead to a whole table scan during data fetch, thereby slowing down the search queries drastically.

This can be corrected by resetting one of the default parameter in the Java driver. The parameter name and value to be set might vary from driver to driver, depending on the vendor.

Vendor Parameter
JSQLConnect asciiStringParameters
JTDS sendStringParametersAsUnicode
DataDirectConnect sendStringParametersAsUnicode
Microsoft JDBC sendStringParametersAsUnicode
WebLogic Type 4 JDBC sendStringParametersAsUnicode

Reading – https://msdn.microsoft.com/en-us/library/ms378857(v=sql.110).aspx you will see- “ For optimal performance with CHAR, VARCHAR or LONGVARCHAR type of non-Unicode parameters, set the sendStringParametersAsUnicode connection string property to “false” and use the non-national character methods.” , issue was also reported via http://www.sqlconsulting.com/jdbc.shtml, http://www.jochenhebbrecht.be/site/2014-05-01/java/fixing-slow-queries-running-sql-server-using-jpa-hibernate-and-jtds and http://www.codeproject.com/Articles/281364/Solving-Performance-issues-in-data-migration-to-SQ

Hope this helps.

Taking Care of the Garbage – Generational is default GC policy on newer WebSphere AS

Imagine you have a legacy Java application running in IBM WebSphere that you have upgraded finally to newer version. Yet, customer is reporting serious performance regression. Why would that be? Well, one reason maybe a change in default JVM behavior between WebSphere versions, something that one of my customers discovered the “hard way”

Garbage collection (GC) is an integral part of the Java Virtual Machine (JVM) as it collects unused Java heap memory so that the application can continue allocating new objects. The effectiveness and performance of the GC play an important role in application performance and determinism. The IBM JVM provided with IBM WebSphere Application Server provides four different GC policy algorithms:

  • -Xgcpolicy:optthruput
  • -Xgcpolicy:optavgpause
  • -Xgcpolicy:gencon
  • -Xgcpolicy:balanced

Each of these algorithms provides different performance and deterministic qualities. In addition, the default policy in WebSphere Application Server V8 has changed from -Xgcpolicy:optthruput to the  policy -Xgcpolicy:gencon.   So lets dive in a bit what this really means.

The garbage collector

Different applications naturally have different memory usage patterns. A computationally intensive number crunching workload will not use the Java heap in the same way as a highly transactional customer-facing interface. To optimally handle these different sorts of workloads, different garbage collection strategies are required. The IBM JVM supports several garbage collection policies to enable you to choose the strategy that best fits your application

The parallel mark-sweep-compact collector: optthruput, formerly default

The simplest possible garbage collection technique is to continue allocating until free memory has been exhausted, then stop the application and process the entire heap. While this results in a very efficient garbage collector, it means that the user program must be able to tolerate the pauses introduced by the collector. Workloads that are only concerned about overall throughput might benefit from this strategy.

The optthruput policy (-Xgcpolicy:optthruput) implements this strategy. This collector uses a parallel mark-sweep algorithm. In a nutshell, this means that the collector first walks through the set of reachable objects, marking them as live data. A second pass then sweeps away the unmarked objects, leaving behind free memory than can be used for new allocations. The majority of this work can be done in parallel, so the collector uses additional threads (up to the number of CPUs by default) to get the job done faster, reducing the time the application remains paused.

figure1

The problem with a mark-sweep algorithm is that it can lead to fragmentation . There might be lots of free memory, but if it is in small slices interspersed with live objects then no individual piece might be large enough to satisfy a particular allocation.

The solution to this is compaction. In theory, the compactor slides all the live objects together to one end of the heap, leaving a single contiguous block of free space. This is an expensive operation because every live object might be moved, and every pointer to a moved object must be updated to the new location. As a result, compaction is generally only done when it appears to be necessary. Compaction can also be done in parallel, but it results in a less efficient packing of the live objects — instead of a single block of free space, several smaller ones might be created.

figure2

The concurrent collector: optavgpause

For applications that are willing to trade some overall throughput for shorter pauses, a different policy is available. The optavgpause policy (-Xgcpolicy:optavgpause) attempts to do as much GC work as possible before stopping the application, leading to shorter pauses . The same mark-sweep-compact collector is used, but much of the mark and sweep phases can be done as the application runs. Based on the program’s allocation rate, the system attempts to predict when the next garbage collection will be required. When this threshold approaches, a concurrent GC begins. As application threads allocate objects, they will occasionally be asked to do a small amount of GC work before their allocation is fulfilled. The more allocations a thread does, the more it will be asked to help out. Meanwhile, one or more background GC threads will use idle cycles to get additional work done. Once all the concurrent work is done, or if free memory is exhausted ahead of schedule, the application is halted and the collection is completed. This pause is generally short, unless a compaction is required. Because compaction requires moving and updating live objects, it cannot be done concurrently.

figure3

 

The generational collection: gencon

has long been observed that the majority of objects created are only used for a short period of time. This is the result of both programming techniques and the type of application. Many common Java idioms create helper objects that are quickly discarded; for example StringBuffer/StringBuilder objects, or Iterator objects. These are allocated to accomplish a specific task, and are rarely needed afterwards. On a larger scale, applications that are transactional in nature also tend to create groups of objects that are used and discarded together. Once a reply to a database query has been returned, then the reply, the intermediate state, and the query itself are no longer needed.

This observation lead to the development of generational garbage collectors. The idea is to divide the heap up into different areas, and collect these areas at different rates. New objects are allocated out of one such area, called the nursery (or newspace). Since most objects in this area will become garbage quickly, collecting it offers the best chance to recover memory. Once an object has survived for a while, it is moved into a different area, called tenure (or oldspace). These objects are less likely to become garbage, so the collector examines them much less frequently. For the right sort of workload the result is collections that are faster and more efficient since less memory is examined, and a higher percentage of examined objects are reclaimed. Faster collections mean shorter pauses, and thus better application responsiveness.
IBM’s gencon policy (-Xgcpolicy:gencon) offers a generational GC (“gen-“) on top of the concurrent one described above (“-con”). The tenure space is collected as described above, while the nursery space uses a copying collector. This algorithm works by further subdividing the nursery area into allocate and survivor spaces . New objects are placed in allocate space until its free space has been exhausted. The application is then halted, and any live objects in allocate are copied into survivor. The two spaces then swap roles; that is, survivor becomes allocate, and the application is resumed. If an object has survived for a number of these copies, it is moved into the tenure area instead.

figure4

 

The region-based collector: balanced

A new garbage collection policy has been added in WebSphere Application Server V8. This policy, called balanced (-Xgcpolicy:balanced), expands on the notion of having different areas of the heap. It divides the heap into a large number of regions, which can be dealt with individually. Frankly I haven’t seen it used by any customer I worked with yet.

 

For more on WebSphere IBM JVM GC see – http://www.perfdaddy.com/2015/10/ibm-jvm-tuning-gencon-gc-policy.html, http://www.ibmsystemsmag.com/ibmi/administrator/websphere/Tuning-Garbage-Collection-With-IBM-Technology-for/, http://javaeesupportpatterns.blogspot.com/2012/03/ibm-jvm-tuning-gencon-gc-policy.html

Hope this helps

Javacore Dump Analysis using JCA – IBM Thread and Monitor Dump Analyzer for Java

In my previous blog posts I spent some time in illustrating tools for thread analysis for Oracle\Sun HotSpot JVM. However recently I actually had to analyze stack\javacore dumps from IBM WebSphere for hang condition and therefore had to research equivalent tools to analyze dumps from that JVM.

As we all know, during the run time of a Java process, some Java Virtual Machines (JVMs) may not respond predictably and oftentimes seem to hang up for a long time or until JVM shutdown occurs. It is not easy to determine the root cause of these sorts of problems.

By triggering a javacore when a Java process does not respond, it is possible to collect diagnostic information related to the JVM and a Java application captured at a particular point during execution. For example, the information can be about the operating system, the application environment, threads, native stack, locks, and memory. The exact contents are dependent on the platform on which the application is running.

On non IBM platforms, and in most cases, javacore is known as “javadump.” Check out my previous post of how to analyze dumps on Oracle Sun JVM via jstack utility. The code that creates javacore is part of the JVM. One can control it by using environment variables and run-time switches. By default, a javacore occurs when the JVM terminates unexpectedly. A javacore can also be triggered by sending specific signals to the JVM. Although javacore or javadump is present in Sun JVMs, much of the content of the javacore is added by IBM and, therefore, is present only in IBM JVMs.

This technology analyzes each thread information and provides diagnostic information, such as current thread information, the signal that caused the javacore, Java heap information (maximum Java heap size, initial Java heap size, garbage collector counter, allocation failure counter, free Java heap size, and allocated Java heap size), number of runnable threads, total number of threads, number of monitors locked, and deadlock information.

You can download tool from here – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c.

Once the tool is downloaded, you can run jca.jar with the Java Run-time Environment

command_line_jca

This will open up the tool

jca_splash

Let’s now use File-Open and open a javacore\dump file

jca_fileopen

The tool will show a screen with a progress bar while it loads the javacore. Clicking on the javacore you just loaded in the Thread Dump List

jca_afterfileopen

As you can see you get huge amount of JVM settings details, as well as If the tool detected a deadlock then it will be displayed in the lower section of the Thread Dump List

jca_summary

and

dead

 

Selecting the Compare Monitors option from Analysis Menu will show deadlocked threads

jca_dead

Another useful screen is Thread Status Screen. Here you can see RUNNABLE vs. PARKED vs. BLOCKED threads and associated stacks. This screen can be very useful in resolving slow response and hang conditions

jca_thread_status

A complete explanation of all thread states for Java 7 can be found here – https://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.71.doc/diag/tools/javadump_tags_threads.html. Once you locate the RUNNABLE threads that are executing your application code, find out which method is being executed by following the stack trace. You may get  assistance from development team if needed. Also note the Thread ID.

The following Thread in the example below is in BLOCK state which typically means it is waiting to acquire a lock on an Object monitor. You will need to search in the earlier section and determine which Thread is holding the lock so you can pinpoint the root cause.

3XMTHREADINFO      "[STUCK] ExecuteThread: '162' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000013ACF0800, j9thread_t:0x000000013AC88B20, java/lang/Thread:0x070000001F945798, state:B, prio=1

3XMTHREADINFO1            (native thread ID:0x1AD0F3, native priority:0x1, native policy:UNKNOWN)

3XMTHREADINFO3           Java callstack:

4XESTACKTRACE                at org/springframework/jms/connection/SingleConnectionFactory.createConnection(SingleConnectionFactory.java:207(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/connection/SingleConnectionFactory.createQueueConnection(SingleConnectionFactory.java:222(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate102.createConnection(JmsTemplate102.java:169(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.execute(JmsTemplate.java:418(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.send(JmsTemplate.java:475(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.send(JmsTemplate.java:467(Compiled Code))

…………………………………………………………………………………………………………

 

Hope this helps. For more see – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c, https://www.ibm.com/developerworks/community/blogs/e8206aad-10e2-4c49-b00c-fee572815374/entry/Java_Core_Debugging_using_IBM_Thread_and_Monitor_Dump_Analyzer_for_Java?lang=en

Et Obscure Ex Alto–TDS Compatibility for SQL Server Upgrades

As majority of folks may know time is counting down on SQL Server extended support for SQL Server 2005 , it officially ends April 12, 2016. Working with folks on upgrading to fairly old systems from SQL Server 2000\2005 to newer versions like SQL Server 2012\2014 we ran into fairly rare issue of TDS backward compatibility, mainly with very old clients running older versions of JDBC\ODBC drivers.

Tabular Data Stream (TDS) is an application layer protocol, used to transfer data between a database server and a client. It was initially designed and developed by Sybase Inc. for their Sybase SQL Server relational database engine in 1984, and later by Microsoft in Microsoft SQL Server

tds

Microsoft published TDS specification in 2008 – https://msdn.microsoft.com/en-us/library/cc448435.aspx , Still TDS and especially TDS compatibility on SQL Server upgrades are pretty obscure topic. Microsoft published TDS version list matched to SQL Server and .NET Framework here – https://msdn.microsoft.com/en-us/library/dd339982.aspx

TDS version SQL Server version .NET Framework version
7.0 SQL Server 7.0 .NET Framework 1.1
7.1 SQL Server 2000 .NET Framework 1.1
7.1 Revision 1 SQL Server 2000 SP1 .NET Framework 1.1
7.2 SQL Server 2005 .NET Framework 2.0
7.3.A SQL Server 2008 .NET Framework 4.0
7.3.B SQL Server 2008 R2 N/A
7.4 SQL Server 2012 .NET Framework 4.5
7.4 SQL Server 2014 .NET Framework 4.5
7.4 SQL Server 2016 CTP2 .NET Framework 4.6

Although TDS compatibility isn’t well documented by Microsoft , from what I have experienced it appears Microsoft keeps at least two versions of TDS backwardly compatible.

So how do I find out what versions of TDS are connecting to your server? On newer versions like SQL Server 2005 and above where DMVs are present following query can be used:

select protocol_type, protocol_version from sys.dm_exec_connections

Result will be something like:

TSQL   1930035203 

Well, the big number still doesn’t say which version of data access driver, does it? Actually it does. The integer value is the representation of TDS version (for TSQL). It needs to be translated as follows, converting protocol version to Hex:

SELECT CONVERT(BINARY(4), 1930035203) 

Result will be as follows:

0x730A0003

Take the first two hex digits (0x73) and map it to correct SQL Server version based on the table above. So now we know that connection was made via TDS 7.3x , possibly via SQL Server 2008 Client stack.

In one of its older posts SQLCAT actually combined al;l of these steps into one query:

SELECT session_id, protocol_type, driver_version = 
CASE SUBSTRING(CAST(protocol_version AS BINARY(4)), 1,1)
WHEN 0x70 THEN ‘SQL Server 7.0′
WHEN 0x71 THEN ‘SQL Server 2000′
WHEN 0x72 THEN ‘SQL Server 2005′
WHEN 0x73 THEN ‘SQL Server 2008′
ELSE ‘Unknown driver’
END
FROM sys.dm_exec_connections

But what if I am on SQL Server 2000, well before DMVs were available and need to check my TDS versions connecting before upgrade?

Here things will become a lot more difficult and labor intensive. The only way I think is via checking version in TDS pre-login packet via network capture on SQL Server side.

As per – https://msdn.microsoft.com/en-us/library/dd357559.aspx pre-login is message sent by the client to set up context for login. The server responds to a client PRELOGIN message with a message of packet header type 0x04 and the packet data containing a PRELOGIN structure. As part there is optional version value:

PL_OPTION_TOKEN Value Description
VERSION 0x00 PL_OPTION_DATA =   UL_VERSION

                   US_SUBBUILD

UL_VERSION is represented in network byte order (big-endian).

The server SHOULD use the VERSION sent by the client to the server. The client SHOULD use the version returned from the server to determine which features are enabled or disabled. The client SHOULD do this only if it is known that this feature is supported by that version of the database.<27>

Theoretically upon capture of such data via network trace it can be filtered in wireshark , from – https://www.wireshark.org/docs/dfref/t/tds.html

tds.prelogin.option.version	Version	Unsigned integer, 4 bytes	2.0.0

According to spec, version is an option – client should provide version by docs, however possibly some clients don’t. Assume that would be pretty rare, at least all mainstream clients I have seen do.

So dealing with older, obscure SQL Server clients is hard, but doesn’t have to impossible.

For more see – https://blogs.msdn.microsoft.com/jenss/2009/03/02/tds-protocol-versions-meet-client-stacks/, https://blogs.msdn.microsoft.com/sql_protocols/2008/07/15/connection-failure-because-of-mismatched-tds-version/, https://msdn.microsoft.com/en-us/library/dd339982.aspx

Dancing with Elephants and Flying with The Bees–Using ORC File Format with Apache Hive

hive

 

When you start with Hive on Hadoop clear majority of samples and tutorials will have you work with text files. However, some time ago disadvantages of text files as file format were clearly seen by Hive community in terms of storage efficiency and performance.

First move to better columnar storage was introduction to Hive RC File Format. RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0. RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part. Internals for RC File Format can be found in JavaDoc here – http://hive.apache.org/javadocs/r1.0.1/api/org/apache/hadoop/hive/ql/io/RCFile.html. What is important to note is why it was introduced as far as advantages:

  • As row-store, RCFile guarantees that data in the same row are located in the same node
  • As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.

As time passed by explosion of data and need for higher speed in HiveQL queries has pushed need for further optimized columnar storage file formats. Therefore, ORC File Format was introduced. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data

This has following advantages over RCFile format:

  • a single file as the output of each task, which reduces the NameNode’s load
  • light-weight indexes stored within the file, allowing to skip row groups that don’t pass predicate filtering and do seek to a given row
  • block-mode compression based on data type

An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer.

The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.

image

What does it all mean for me?

What it means for us as implementers following:

  • Better read performance due to compression. Streams are compressed using a codec, which is specified as a table property for all streams in that table. To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
  • · Introduction to column-level statistics for optimization, feature that long existed in pretty much all commercial RDBMS packages (Oracle, SQL Server, etc.) . The goal of the column statistics is that for each column, the writer records the count and depending on the type other useful fields. For most of the primitive types, it records the minimum and maximum
    values; and for numeric types it additionally stores the sum. From Hive 1.1.0 onwards, the column statistics will also record if there are any null values within the row group by setting the hasNull flag.
  • · Light weight indexing
  • Larger Blocks by default 256 MB

Here is a good Hive file format comparison from HOrtonworks:

image

Using ORC – create table with ORC format:

Simplest way to create ORC file formatted Hive table is to add STORED AS ORC to Hive CREATE TABLE statement like:

CREATE TABLE my_table  (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT
) STORED AS ORC;

ORC File Format can be used together with Hive Partitioning, which I explained in my previous post.  Here is an example of using Hive partitioning with ORC File Format:

CREATE  TABLE airanalytics 
(flightdate date ,dayofweek int,depttime int,crsdepttime int,arrtime int,crsarrtime int,uniquecarrier varchar(10),flightno int,tailnum int,aet int,cet int,airtime int,arrdelay int,depdelay int,origin varchar(5),dest varchar(5),distance int,taxin int,taxout int,cancelled int,cancelcode int,diverted string,carrdelay string,weatherdelay string,securtydelay string,cadelay string,lateaircraft string) 
 PARTITIONED BY (flight_year String)
 clustered BY (uniquecarrier)
 sorted BY (flightdate)
 INTO 24 buckets
 stored AS orc tblproperties ("orc.compress"="NONE","orc.stripe.size"="67108864", "orc.row.index.stride"="25000")

The parameters added on table level are as per docs:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

268435456

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns

“”

comma separated list of column names for which bloom filter should be created

orc.bloom.filter.fpp

0.05

false positive probability for bloom filter (must >0.0 and <1.0)

If you have existing Hive table, it can be moved to ORC via:

    • ALTER TABLE … [PARTITION partition_spec] SET FILEFORMAT ORC
    • SET ive.default.fileformat=Orc

For more information see – https://en.wikipedia.org/wiki/RCFile, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC, https://cwiki.apache.org/confluence/display/Hive/FileFormats, https://orc.apache.org/docs/hive-config.html

Hope this helps.