Capre Noctem – Using SQL Server Diagnostics (Preview) to analyze SQL Server minidump

Microsoft just released the SQL Server Diagnostics (Preview) extension within SQL Server Management Studio and Developer APIs to empower SQL Server customers to achieve more through a variety of offerings to self-resolve SQL Server issues.

diagnostic

So what can it do:

    • Analyze SQL Server dump.

Customers should be able to debug and self-resolve memory dump issues from their SQL Server instances and receive recommended Knowledge Base (KB) article(s) from Microsoft, which may be applicable for the fix.

  • Review recommendations to keep SQL Server instances up to date.

 

Customers will be able to keep their SQL Server instances up-to-date by easily reviewing the recommendations for their SQL Server instances. Customers can filter by product version or by feature area (e.g. Always On, Backup/Restore, Column Store, etc.) and view the latest Cumulative Updates (CU) and the underlying hotfixes addressed in the CU.

  • Developers who want to discover and learn about Microsoft APIs can view developer portal and then use APIs in their custom applications. Developers can log and discuss issues and even submit their applications to the application gallery.

 

So I installed this extension and decided to give it a go with one of the SQL Server minidumps I have.

diagnostics2

It actually correctly identified an issue and issues suggestions. So, while this may not work on every dump, it should worth trying before you start up WinDbg.

Give it a try. Hope this helps.

Let Me Count The Ways – Various methods of generating stack dump for JVM in production

As I profiled previously thread dumps in Java are essential in diagnosing production issues with high CPU, locking, threading deadlocks, etc. There are great online thread dump analysis tools such as http://fastthread.io/ that can analyze and spot problems. But to those tools you need provide proper thread dumps as input. I already blogged about many tools to do so in the past like jstack, JvisualVM and Java Mission Control. Here I will try to summarize all of the ways to capture usable thread dumps in production Java application:

  • JStack

JStack remains one of the most common ways to capture thread dumps. It’s a command ike utility bundled in JDK. The Jstack tool is shipped in JDK_HOME\bin folder. Here is the command that you need to issue to capture thread dump:

jstack -l   > 

Where

pid: is the Process Id of the application, whose thread dump should be captured

file-path: is the file path where thread dump will be written in to.

Example here:

jstack -l 37321 > /opt/tmp/threadDump.txt

As per the example thread dump of the process would be generated in /opt/tmp/threadDump.txt file.

    • Kill –3

 

In many customers only JREs are installed in production machines. Since jstack and other tools are only part of JDK, you wouldn’t be able to use jstack. In such circumstances, ‘kill -3’ option can be used.

kill -3 

Where:

pid: is the Process Id of the application, whose thread dump should be captured

Example:</P?

 Kill -3 37321

When ‘kill -3’ option is used thread dump is sent to standard error stream. Fpr example in apps running under Tomcat it will be <TOMCAT_HOME>/logs/catalina.out file. VisualVM Java VisualVM is a graphical user interface tool that provides detailed information about the applications while they are running on a specified Java Virtual Machine (JVM). It’s located in JDK_HOME\bin\jvisualvm.exe. It’s part of Sun\Oracle JDK distribution since JDK 6 update 7.s Launch the jvisualvm. On the left panel, you will notice all the java applications that are running on your machine. You need to select your application from the list (see the red color highlight in the below diagram). This tool also has the capability to capture thread dumps from the java processes that are running in remote host as well. vjvm In order to generate thread dump, go to Threads Tab and click on Thread Dump button.

    •   Java Mission Control

 

Java Mission Control (JMC) is a tool that collects and analyze data from Java applications running locally or deployed in production environments. This tool has been packaged into JDK since Oracle JDK 7 Update 40. This tool also provides an option to take thread dumps from the JVM. JMC tool is present in JDK_HOME\bin\jmc.exe Once you launch the tool, you will see all the Java processes that are running on your local hostAs you use Flight Recorder feature on one of these processes , in the “Thread Dump” field, you can select the interval in which you want to capture thread dump. jmc

    • ThreadMXBean

 

Introduced in JDK 1.5, ThreadMXBean is a management interface for thread system in JVM and allows you to create thread dump in few lines of code in application like below:

Example

public void  dumpThreadDump() {



        ThreadMXBean threadMxBean = ManagementFactory.getThreadMXBean();



        for (ThreadInfo ti : threadMxBean.dumpAllThreads(true, true)) {



            System.out.print(ti.toString());

threadmxbean

 

  • JCMD

The jcmd tool was introduced with Oracle’s Java 7. It’s useful in troubleshooting issues with JVM applications. It has various capabilities such as identifying java process Ids, acquiring heap dumps, acquiring thread dumps, acquiring garbage collection statistics, ….

Using the below JCMD command you can generate thread dump:

jcmd  Thread.print > 

where

pid: is the Process Id of the application, whose thread dump should be captured

file-path: is the file path where thread dump will be written in to.

Example

jcmd 37321 Thread.print > /opt/tmp/threadDump.txt

For more see – https://docs.oracle.com/javase/7/docs/technotes/tools/windows/jcmd.html, https://blogs.oracle.com/jmxetc/entry/threadmxbean_a_singleton_mxbean_for , http://www.javamex.com/tutorials/profiling/profiling_java5_threads_howto.shtml , http://blog.takipi.com/oracle-java-mission-control-the-ultimate-guide/ , https://www.prosysopc.com/blog/using-java-mission-control-for-performance-monitoring/

Taking Care of the Garbage – Generational is default GC policy on newer WebSphere AS

Imagine you have a legacy Java application running in IBM WebSphere that you have upgraded finally to newer version. Yet, customer is reporting serious performance regression. Why would that be? Well, one reason maybe a change in default JVM behavior between WebSphere versions, something that one of my customers discovered the “hard way”

Garbage collection (GC) is an integral part of the Java Virtual Machine (JVM) as it collects unused Java heap memory so that the application can continue allocating new objects. The effectiveness and performance of the GC play an important role in application performance and determinism. The IBM JVM provided with IBM WebSphere Application Server provides four different GC policy algorithms:

  • -Xgcpolicy:optthruput
  • -Xgcpolicy:optavgpause
  • -Xgcpolicy:gencon
  • -Xgcpolicy:balanced

Each of these algorithms provides different performance and deterministic qualities. In addition, the default policy in WebSphere Application Server V8 has changed from -Xgcpolicy:optthruput to the  policy -Xgcpolicy:gencon.   So lets dive in a bit what this really means.

The garbage collector

Different applications naturally have different memory usage patterns. A computationally intensive number crunching workload will not use the Java heap in the same way as a highly transactional customer-facing interface. To optimally handle these different sorts of workloads, different garbage collection strategies are required. The IBM JVM supports several garbage collection policies to enable you to choose the strategy that best fits your application

The parallel mark-sweep-compact collector: optthruput, formerly default

The simplest possible garbage collection technique is to continue allocating until free memory has been exhausted, then stop the application and process the entire heap. While this results in a very efficient garbage collector, it means that the user program must be able to tolerate the pauses introduced by the collector. Workloads that are only concerned about overall throughput might benefit from this strategy.

The optthruput policy (-Xgcpolicy:optthruput) implements this strategy. This collector uses a parallel mark-sweep algorithm. In a nutshell, this means that the collector first walks through the set of reachable objects, marking them as live data. A second pass then sweeps away the unmarked objects, leaving behind free memory than can be used for new allocations. The majority of this work can be done in parallel, so the collector uses additional threads (up to the number of CPUs by default) to get the job done faster, reducing the time the application remains paused.

figure1

The problem with a mark-sweep algorithm is that it can lead to fragmentation . There might be lots of free memory, but if it is in small slices interspersed with live objects then no individual piece might be large enough to satisfy a particular allocation.

The solution to this is compaction. In theory, the compactor slides all the live objects together to one end of the heap, leaving a single contiguous block of free space. This is an expensive operation because every live object might be moved, and every pointer to a moved object must be updated to the new location. As a result, compaction is generally only done when it appears to be necessary. Compaction can also be done in parallel, but it results in a less efficient packing of the live objects — instead of a single block of free space, several smaller ones might be created.

figure2

The concurrent collector: optavgpause

For applications that are willing to trade some overall throughput for shorter pauses, a different policy is available. The optavgpause policy (-Xgcpolicy:optavgpause) attempts to do as much GC work as possible before stopping the application, leading to shorter pauses . The same mark-sweep-compact collector is used, but much of the mark and sweep phases can be done as the application runs. Based on the program’s allocation rate, the system attempts to predict when the next garbage collection will be required. When this threshold approaches, a concurrent GC begins. As application threads allocate objects, they will occasionally be asked to do a small amount of GC work before their allocation is fulfilled. The more allocations a thread does, the more it will be asked to help out. Meanwhile, one or more background GC threads will use idle cycles to get additional work done. Once all the concurrent work is done, or if free memory is exhausted ahead of schedule, the application is halted and the collection is completed. This pause is generally short, unless a compaction is required. Because compaction requires moving and updating live objects, it cannot be done concurrently.

figure3

 

The generational collection: gencon

has long been observed that the majority of objects created are only used for a short period of time. This is the result of both programming techniques and the type of application. Many common Java idioms create helper objects that are quickly discarded; for example StringBuffer/StringBuilder objects, or Iterator objects. These are allocated to accomplish a specific task, and are rarely needed afterwards. On a larger scale, applications that are transactional in nature also tend to create groups of objects that are used and discarded together. Once a reply to a database query has been returned, then the reply, the intermediate state, and the query itself are no longer needed.

This observation lead to the development of generational garbage collectors. The idea is to divide the heap up into different areas, and collect these areas at different rates. New objects are allocated out of one such area, called the nursery (or newspace). Since most objects in this area will become garbage quickly, collecting it offers the best chance to recover memory. Once an object has survived for a while, it is moved into a different area, called tenure (or oldspace). These objects are less likely to become garbage, so the collector examines them much less frequently. For the right sort of workload the result is collections that are faster and more efficient since less memory is examined, and a higher percentage of examined objects are reclaimed. Faster collections mean shorter pauses, and thus better application responsiveness.
IBM’s gencon policy (-Xgcpolicy:gencon) offers a generational GC (“gen-“) on top of the concurrent one described above (“-con”). The tenure space is collected as described above, while the nursery space uses a copying collector. This algorithm works by further subdividing the nursery area into allocate and survivor spaces . New objects are placed in allocate space until its free space has been exhausted. The application is then halted, and any live objects in allocate are copied into survivor. The two spaces then swap roles; that is, survivor becomes allocate, and the application is resumed. If an object has survived for a number of these copies, it is moved into the tenure area instead.

figure4

 

The region-based collector: balanced

A new garbage collection policy has been added in WebSphere Application Server V8. This policy, called balanced (-Xgcpolicy:balanced), expands on the notion of having different areas of the heap. It divides the heap into a large number of regions, which can be dealt with individually. Frankly I haven’t seen it used by any customer I worked with yet.

 

For more on WebSphere IBM JVM GC see – http://www.perfdaddy.com/2015/10/ibm-jvm-tuning-gencon-gc-policy.html, http://www.ibmsystemsmag.com/ibmi/administrator/websphere/Tuning-Garbage-Collection-With-IBM-Technology-for/, http://javaeesupportpatterns.blogspot.com/2012/03/ibm-jvm-tuning-gencon-gc-policy.html

Hope this helps

Javacore Dump Analysis using JCA – IBM Thread and Monitor Dump Analyzer for Java

In my previous blog posts I spent some time in illustrating tools for thread analysis for Oracle\Sun HotSpot JVM. However recently I actually had to analyze stack\javacore dumps from IBM WebSphere for hang condition and therefore had to research equivalent tools to analyze dumps from that JVM.

As we all know, during the run time of a Java process, some Java Virtual Machines (JVMs) may not respond predictably and oftentimes seem to hang up for a long time or until JVM shutdown occurs. It is not easy to determine the root cause of these sorts of problems.

By triggering a javacore when a Java process does not respond, it is possible to collect diagnostic information related to the JVM and a Java application captured at a particular point during execution. For example, the information can be about the operating system, the application environment, threads, native stack, locks, and memory. The exact contents are dependent on the platform on which the application is running.

On non IBM platforms, and in most cases, javacore is known as “javadump.” Check out my previous post of how to analyze dumps on Oracle Sun JVM via jstack utility. The code that creates javacore is part of the JVM. One can control it by using environment variables and run-time switches. By default, a javacore occurs when the JVM terminates unexpectedly. A javacore can also be triggered by sending specific signals to the JVM. Although javacore or javadump is present in Sun JVMs, much of the content of the javacore is added by IBM and, therefore, is present only in IBM JVMs.

This technology analyzes each thread information and provides diagnostic information, such as current thread information, the signal that caused the javacore, Java heap information (maximum Java heap size, initial Java heap size, garbage collector counter, allocation failure counter, free Java heap size, and allocated Java heap size), number of runnable threads, total number of threads, number of monitors locked, and deadlock information.

You can download tool from here – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c.

Once the tool is downloaded, you can run jca.jar with the Java Run-time Environment

command_line_jca

This will open up the tool

jca_splash

Let’s now use File-Open and open a javacore\dump file

jca_fileopen

The tool will show a screen with a progress bar while it loads the javacore. Clicking on the javacore you just loaded in the Thread Dump List

jca_afterfileopen

As you can see you get huge amount of JVM settings details, as well as If the tool detected a deadlock then it will be displayed in the lower section of the Thread Dump List

jca_summary

and

dead

 

Selecting the Compare Monitors option from Analysis Menu will show deadlocked threads

jca_dead

Another useful screen is Thread Status Screen. Here you can see RUNNABLE vs. PARKED vs. BLOCKED threads and associated stacks. This screen can be very useful in resolving slow response and hang conditions

jca_thread_status

A complete explanation of all thread states for Java 7 can be found here – https://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.71.doc/diag/tools/javadump_tags_threads.html. Once you locate the RUNNABLE threads that are executing your application code, find out which method is being executed by following the stack trace. You may get  assistance from development team if needed. Also note the Thread ID.

The following Thread in the example below is in BLOCK state which typically means it is waiting to acquire a lock on an Object monitor. You will need to search in the earlier section and determine which Thread is holding the lock so you can pinpoint the root cause.

3XMTHREADINFO      "[STUCK] ExecuteThread: '162' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000013ACF0800, j9thread_t:0x000000013AC88B20, java/lang/Thread:0x070000001F945798, state:B, prio=1

3XMTHREADINFO1            (native thread ID:0x1AD0F3, native priority:0x1, native policy:UNKNOWN)

3XMTHREADINFO3           Java callstack:

4XESTACKTRACE                at org/springframework/jms/connection/SingleConnectionFactory.createConnection(SingleConnectionFactory.java:207(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/connection/SingleConnectionFactory.createQueueConnection(SingleConnectionFactory.java:222(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate102.createConnection(JmsTemplate102.java:169(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.execute(JmsTemplate.java:418(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.send(JmsTemplate.java:475(Compiled Code))

4XESTACKTRACE                at org/springframework/jms/core/JmsTemplate.send(JmsTemplate.java:467(Compiled Code))

…………………………………………………………………………………………………………

 

Hope this helps. For more see – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c, https://www.ibm.com/developerworks/community/blogs/e8206aad-10e2-4c49-b00c-fee572815374/entry/Java_Core_Debugging_using_IBM_Thread_and_Monitor_Dump_Analyzer_for_Java?lang=en

Resolving Java Threading Issues with ThreadLogic Tool

The ThreadLogic utility is a free tool you can download to assist in analyzing thread dumps taken from a JVM. Threadlogic can digest the log file containing the thread
dump output. This utility does a fair amount of the initial analysis for you like finding locks, the holder of locks, and fatal conditions. If you’ve ever read a raw thread dump from a log file then you know it can be daunting – especially if you don’t exactly what you are looking for. Threadlogic helps by recognizing the type of each thread and categorizing them in the help you understand which threads are JVM threads, WLS threads, and then “application” threads. In addition, Threadlogic can process a series of thread dumps and perform “diff” operation between them. This is helpful in determining what threads are doing over a period of time.

ThreadLogic can be downloaded from – https://java.net/projects/threadlogic/downloads  Thread Logic Comes in a form of Jar File.We need to manually run the file using “java -jar threadLogic.jar”

image

Opening the dump tree and selecting the Advisory Map show a Map with information about the health of the system under investigation. Each of the advisory has a health level indicating severity of the issue found, pattern, name, keyword and related advice.
image

ThreadLogic is able to parse Sun, JRockit, and IBM thread dumps and provide advice based on predefined and externally defined patterns.

The health levels (in descending of severity) are FATAL (meant for Deadlocks, STUCK, Finalizer blocked etc), WARNING, WATCH (worth watching), NORMAL and IGNORE.

Based on the highest criticality of threads within a group, that health level gets promoted to the Thread Group’s health level and same is repeated at the thread dump level. There can be multiple advisories tagged to a Thread, Thread Group and Thread Dump. This is a typical advisory map I see:

image

The threads are associated with thread groups based on the functionality or thread names. Additional patterns exists to tag other threads (like iWay Adapter, SAP, Tibco threads) and group them:

image

For more on the tool see – https://blogs.oracle.com/emeapartnerweblogic/entry/my_first_experiences_with_threadlogic, https://blogs.oracle.com/ATeamExalogic/entry/introducing_threadlogic, https://zeroproductionincidents.wordpress.com/2012/07/10/threadlogic-another-thread-dump-analysis-tool/

Save Our Souls – Troubleshooting MSDTC

Although a lot of folks would rather forget about MSDTC existence this is one of those pesky, but vital Windows components that has been with us for a long time (since my days of heavy COM development) and will stay with us for longer time, although probably with minimal or no Microsoft investment.
MSDTC is an acronym for Microsoft Distributed Transaction Coordinator. As the name says, MSDTC is a Windows service providing transaction infrastructure for distributed systems. In this case, a transaction means a general way of structuring the interactions between autonomous agents in a distributed system. Each transaction is a state transformation with four key properties – the ACID properties: Atomic (all or nothing), Consistent (legal), Isolated (independent of concurrent transactions) and Durable (once it happens, it cannot be abrogated). There are different techniques that implement the ACID properties but the most known one is two-phase commit. So if you are like my customers using distributed transactions with SQL Server whether its from .NET, legacy code or Java most likely you are using MSDTC on Windows.  I don’t want to go deeper into DTC, but Denny Cherry has a little general snippet on it – http://itknowledgeexchange.techtarget.com/sql-server/what-exactly-is-msdtc-any-when-do-i-need-it/ , plus there is TechNet docs – https://technet.microsoft.com/en-us/library/dd337629(v=ws.10).aspx

For the most part DTC is a “black box” for DBAs and developers. But once in a while you get a situation where you start seeing issues or errors (most common are “Failure to Enlist in Transaction”) that require you to do a bit of troubleshooting.

First thing first. Basic idea should be to see if there is connectability and communication open from DTC between servers. Tool called DTCPing becomes vital here. You can download the tool from Microsoft here – http://www.microsoft.com/en-us/download/details.aspx?id=2868. Use following cadence to run the tool:

  • Download and launch tool on both servers involved in communication via DTC
  • As you launch the tool you will see following interface

image

  • From one of the Servers, in the Remote Server Name type the host name of the client. You can get the hostname by opening a command prompt and typing the command “hostname” without the  commas. Press the Ping button.
  • Then from the other machine, in the Remote Server Name give the hostname of the first server and press the Ping button.
  • This shows the reverse bind too between the machines. Also, you can analyze the text files present in the folder of DTCPing which will contain the results of the Ping.

Typical success in a log (dtcping.log) will look like:

Platform:Windows 2003
++++++++++++ipconfig info++++++++++

++++++++++++lmhosts.sam++++++++++++

++++++++++++hosts      ++++++++++++
127.0.0.1       localhost
199.199.3.2	MachineB
199.199.2.1	firewall

++++++++++++++++++++++++++++++++++++++++++++++
     DTCping 1.8 Report for MachineA
++++++++++++++++++++++++++++++++++++++++++++++
Firewall Port Settings:
	Port:5000-5004
	Port:4343
RPC server is ready
09-11, 14:10:40.187-->RPC server:MachineA received following information:
	Network Name: MachineA
	Source  Port: 5004
	Partner LOG: MachineB528.log
	Partner CID: C2FFD8D2-F9E4-40EF-AFAA-B9A11156A578
++++++++++++Start Reverse Bind Test+++++++++++++
Received Bind call from MachineB
	Network Name: MachineA
	Source  Port: 5004
	Hosting Machine:MachineA
09-11, 14:10:40.671-->Trying to Reverse Bind to MachineB...
	Test Guid:C2FFD8D2-F9E4-40EF-AFAA-B9A11156A578
Name Resolution:
	MachineB-->199.199.3.2-->MachineB
++++++++++++Validating Remote Computer Name++++++++++++
09-11, 14:10:44.937-->Start DTC connection test
Name Resolution:
	MachineB-->199.199.3.2-->MachineB
Reverse Binding success: TOMMYROT-->PACHYDERM
++++++++++++Reverse Bind Test ENDED++++++++++
09-11, 14:10:45.046-->Start RPC test (MachineA-->MachineB)
09-11, 14:10:45.453-->Called POKE from Partner:MachineB
	Network Name: MachineA
	Source  Port: 5004
	Hosting Machine:MachineA
RPC test is successful
	Partner's CID:C2FFD8D2-F9E4-40EF-AFAA-B9A11156A578
++++++++++++RPC test completed+++++++++++++++
++++++++++++Start DTC Binding Test +++++++++++++
Trying Bind to MachineB
09-11, 14:10:47.875-->MachineA Initiating DTC Binding Test....
	Test Guid:89D254FC-B695-4331-B1C1-907B803E491C
Received reverse bind call from MachineB
	Network Name: MachineA
	Source  Port: 5004
	Hosting Machine:MachineA
Binding success: MachineA-->MachineB
++++++++++++DTC Binding Test END+++++++++++++

Whereas typical failure:

Platform:Windows 2003
++++++IP Configure Information+++++

++++++++++++lmhosts.sam++++++++++++

++++++++++++hosts      ++++++++++++
127.0.0.1       localhost
199.199.3.2	MachineB
199.199.2.1	firewall

++++++++++++++++++++++++++++++++++++++++++++++
     DTCping 1.8 Report for MachineA
++++++++++++++++++++++++++++++++++++++++++++++
Firewall Port Settings:
	Port:5000-5004
	Port:4343
RPC server is ready
++++++++++++Validating Remote Computer Name++++++++++++
09-11, 16:54:58.890-->Start DTC connection test
Name Resolution:
	MachineB-->199.199.3.2-->MachineB
09-11, 16:54:58.890-->Start RPC test (MachineA-->MachineB)
09-11, 16:55:01.687-->RPC server:MachineA received following information:
	Network Name: MachineA
	Source  Port: 5004
	Partner LOG: MachineB520.log
	Partner CID: C2FFD8D2-F9E4-40EF-AFAA-B9A11156A578
++++++++++++Start Reverse Bind Test+++++++++++++
Received Bind call from MachineB
	Network Name: MachineA
	Source  Port: 5004
	Hosting Machine:MachineA
09-11, 16:55:02.000-->Trying to Reverse Bind to MachineB...
	Test Guid:C2FFD8D2-F9E4-40EF-AFAA-B9A11156A578
Name Resolution:
	MachineB-->199.199.3.2-->MachineB
Problem:fail to invoke remote RPC method
Error(0x6BA) at dtcping.cpp @289
-->RPC pinging exception
-->1722(The RPC server is unavailable.)
RPC test failed

There is a very detailed blog from MS Distributed Systems Support on DTCPing log exception analysis with some of very common failures – http://blogs.msdn.com/b/distributedservices/archive/2008/11/12/troubleshooting-msdtc-issues-with-the-dtcping-tool.aspx

So if DTCPing tool is all about connectivity, ports and firewalls opened what are some of the other ways to trace DTC transactions? Well a fairly obscure and hidden DTC Transaction Tracing can come to the rescue. The transaction tracing becomes a powerful tool when the transactions are either taking too long to run or they are getting aborted intermittently and nothing inside your application is giving you any hint around why the transaction is getting aborted. By default, only the long-lived transactions are traced. You can turn ON tracing for all transactions or aborted transactions based on the kind of issue you are troubleshooting. You just have to select the relevant checkboxes in the trace settings and tracing will take effect (no restart of MSDTC service is required to toggle these settings)  The next important thing becomes collecting the trace in the right manner. It is always better to collect a trace which just contains information about the problem that you are troubleshooting because on a busy server the Transaction trace can contain a lot of transactions and it might become very difficult to find out which transaction outcome you are interested in

Steps to enable DTC tracing:

  • Start –> run –>dcomcnfg
  • Expand My Computer –> Distributed Transaction Coordinator –> Local DTC –>  Right Click Properties and the go to the tracing tab.
  • By Default, the Trace all transactions won’t be enabled while Trace Aborted Transactions and Trace Long Lived Transactions will be enabled as shown below

  • To collect the traces correctly, reproduce the issue under new session, therefore  we need to first click on Stop Session Button and then the New Session Button as shown below before reproducing the issue:

  • This tracing log will be found under C:\Windows\System32\MSDTC\Trace Folder. Each log can be max of 25 MB and when it exceeds it will create a new log.

You can also trace via registry as explained here – https://support.microsoft.com/en-us/kb/926099

At this point navigate to C:\Windows\System32\MSDTC\Trace folder and we will need to parse that trace file  using MSDTCVTR batch file in that folder:

The MSDTC trace can contain a lot of other events as well and I am pasting here some of the most common events that we see while parsing trace files from real time customer scenarios.

ABORT_DUE_TO_TRANSACTION_TIMER_EXPIRED
The Transaction is timing out because of the configured timeout for a transaction in DTC settings.

RECEIVED_ABORT_REQUEST_FROM_BEGINNER
The application which started the transaction issued an abort request to MSDTC to ABORT the transaction. In case of a transaction originating from a COM+ component, a call to ContextUtil.SetAbort will end up sending this event to MSDTC.

TRANSACTION_PROPOGATION_FAILED_CONNECTION_DOWN_FROM_REMOTE_TM
This error is typically due to networking issues. This happens when MSDTC on one machine loses connection with MSDTC on the other machine. For more information on this event, refer this article.

TRANSACTION_PROPAGATION_FAILED_TRANSACTION_NOT_FOUND
This event happens when you try to commit or abort a transaction that has been already committed or aborted.

RM_ENLIST_FAILED_TOO_MANY_ENLISTS
DTC already has the maximum number of resources enlisted in the transaction. The maximum number of resources that can be enlisted in a transaction is 32

Hope this helps. For more on MSDTC Troubleshooting and setup see – http://www.sqlwebpedia.com/content/msdtc-troubleshooting, https://support.microsoft.com/en-us/kb/926099, https://msdn.microsoft.com/en-us/library/ms678891(v=vs.85).aspx, https://msdn.microsoft.com/en-us/library/ms678917(v=vs.85).aspx

Running To Stand Still – Adventures in .NET Parallelism

Last winter I was fortunate enough to be involved in performance tuning engagement on one of customer sites . While testing and tuning customers Web ASP.NET based application we noted that running certain functions\tasks synchronously surprisingly was actually somewhat faster and more stable than asyncronizing these tasks via Task Parallel Library (TPL).

We saw no CPU contention or extensive context switching. CPU averages 25-35% and maxes around 55-60%. Moreover async change does not cause noticeably higher CPU , we see 2 to 4% increase on average only. The low CPU doesn’t surprise me here since the threads if I understand correctly are simply blocked  while waiting for network call operation to complete) for a while. So no high CPU, no thread switching, no easy to see contention – but yet asynchronous methods that should be running faster are not. Moreover we observed a pretty bad “jitteriness” in application response times with significant latency spikes.

Looking at ANTS Profiler picture is easy to understand (time is spent waiting on async tasks to complete), but really doesn’t explain that question as to why: time is spent in ThreadPool here:

image

Grabbing couple of hang dumps and looking into these actually started explaining things and uncovered a rather interesting picture.  I will scrub stacks to “protect an innocent ” here, but you will get an idea nevertheless:

Looking for managed thread pool contention you will not notice any:

0:032> !threadpool
CPU utilization: 17%
Worker Thread: Total: 22 Running: 5 Idle: 17 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 9 Free: 8 MaxFree: 8 CurrentLimit: 9 MaxLimit: 1000 MinLimit: 4

But in addition to clearly showing no thread contention here, i.e. I am not running out of worker threads, something else becomes really apparent.As we can see, there are 17 idle threads with 5 doing some work. The thread pool removes an idle thread after it’s been idle for 10 seconds so this means the number of threads we are using is bursty. The way that Tasks are being used isn’t really the best way to use them. There are two reasons for using tasks.

  • First, if you have a CPU intensive operation you can put that work on another thread so that you can either return (e.g. in a GUI app to keep UI responsive) or do something else on the current thread.
  • Second, as a way to execute an asynchronous operation such as network or disk io and easily queue up the work that needs to be done after the asynchronous operation has completed.

In addition looking at the stacks in the dump majority are in following stacks:

OS Thread Id: 0x40 (109)
Child SP               IP Call Site
000000bbdf8bd9f8 000007fd958b315b [GCFrame: 000000bbdf8bd9f8] 
000000bbdf8bdac8 000007fd958b315b [HelperMethodFrame_1OBJ: 000000bbdf8bdac8] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
000000bbdf8bdbe0 000007fd8a8f0d6c System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken) 
000000bbdf8bdca0 000007fd8a8f078b System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) 
000000bbdf8bdd30 000007fd8b1bb9da System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken) 
000000bbdf8bde30 000007fd8b1c14f6 System.Threading.Tasks.Task`1[[System.__Canon, mscorlib]].GetResultCore(Boolean)
000000bbdf8bde70 000007fd2db17606 CustomerApp.SubmitRequest[[System.__Canon, mscorlib]](System.__Canon, System.Net.Http.HttpClient, System.Uri, System.Net.Http.Formatting.MediaTypeFormatter)

Looking at dump stacks here we are synchronously waiting on a task to complete which is occupying a thread. We have off loaded the network call to a Task which is running on another thread but aren’t gaining the benefits of doing so as you are just waiting for it to come back. Because of the large (compared with running) number of idle threads, the suspicion is that our off box calls are made using synchronous APIs which would mean another thread blocking doing nothing.

The jitter in the latency times is most likely caused by resource contention with needing to have a new thread created for all these tasks we are dispatching which don’t have a thread to work on. The Worker Thread MinLimit is 4, and as our number of threads is greater than that, the ThreadPool will only give us a new thread once every 500ms. Once you’ve created enough threads, the next request should run quicker, until you leave the threads idle long enough that they are cleaned up.

The basic operation manner of the thread pool is actually pretty easy to explain. The thread pool starts from 0 threads, it immediately creates new threads to serve work requests until the number of running threads reaches a configurable minimum (the default minimum is set to the number of CPUs on the machine).

While the number of running threads is equal or bigger than that minimum – the thread pool will create new threads at the rate of 1 thread per 0.5 second. Which means that if your application is running on a dual core machine, and 16 work requests that spans for 10 seconds are scheduled together, assuming the the thread pool is empty, the first two request will be served immediately, the third after 0.5 second, the forth after 1 second and the 16th after 7 seconds. In addition, to avoid unnecessary starvation a demons thread is running in the background and periodically monitors the CPU – in case of low CPU utilization it creates new threads as appropriate.

The thread pool will not create new threads after it reaches a configurable maximum. The default maximum is set to 250 * number of CPUs, which is 1o times more than it was in the 1.0 version of the .NET Framework. The default maximum was increased in order to reduce the chance for possible dead lock that can occur when callbacks rely on each other in order to complete.

After a thread finish its work it is not being destroyed immediately, rather it stays in the pool waiting for another work to arrive. Once new work arrive it is being served immediately by one of the waiting threads. The waiting threads are being destroyed only after spending 10 seconds (was 40 seconds) on the pool doing nothing.

clip_image002

So what is the lesson here? If you are using TPL make sure that you are using TPL and asynchronous code through your complete call. If you are calling Redis for example – call Redis asynchronous client, etc. Otherwise sometimes you may get more consistent and stable results just being synchronous in your applications that are using ASP.NET thread pool.

Hope this helps.

For more see – http://blogs.msdn.com/b/tmarq/archive/2010/04/14/performing-asynchronous-work-or-tasks-in-asp-net-applications.aspx, http://blog.stephencleary.com/2013/11/taskrun-etiquette-examples-dont-use.html,