Manage Your Locks–New Managed Lock Priority Feature In SQL Server 2014

Imagine you have to perform a typical database management operation such as Online Index Rebuild or Partition Switch on a very busy OLTP database. Or you are about to deploy a change to a column on the table, yet you cannot take an outage completely. Your enemy here is something known as exclusive Schema modification lock – Sch-M lock during the DDL operation impacting the database workload running concurrently and using the affected table. In case of Online Index Rebuilds two locks are required, a table S-lock in the beginning of the DDL operation and a Sch-M lock at the end. In order to execute the DDL statement for OIR, all active blocking transactions running on a particular table must be completed. Conversely, when the DDL for OIR is executed first, this will block all new transactions that require locks on the affected table. Although the duration of the lock for OIR is very short, waiting for all open transactions on a given table to complete and blocking the new transactions to start, may significantly affect the throughput, causing workload slow down or timeout, and significantly limiting an access to the underlying table(s). This has an impact for 24X7 Mission Critical workloads that focus on maximizing throughput and availability with short or non-existent maintenance window.

SQL Server 2014 introduces another new feature – Managed Lock Priority that may at least provide some assistance here. Now the new parameters are added to ALTER INDEX\ALTER TABLE commands that may give us lock priority choices as we perform Online Index Rebuild and Partition Switch operations.  The syntax to add managed lock priority to ALTER INDEX\ALTER TABLE statement is added in SQL Server 2014 as:

low_priority_lock_wait::= { WAIT_AT_LOW_PRIORITY ( MAX_DURATION = [ MINUTES ] , ABORT_AFTER_WAIT = { NONE | SELF | BLOCKERS } ) }

Following choices are introduced here:

  • Kill All Blockers immediately or after a specified wait time  as specified by MAX_DURATION parameter and ABORT_AFTER_WAIT parameter.  That would also to give maintenance operations precedence over user transactions and kill blocking user transactions either immediately or after some wait in number of minutes.
  • Exit DDL after wait, therefore opposite of above give precedence to user transaction over maintenance DDL. In this case command we issue would be something like:

ALTER index myindex on mytable REBUILD WITH (ONLINE = ON (WAIT_AT_LOW_PRIORITY ( MAX_DURATION = 30 MINUTES, ABORT_AFTER_WAIT = SELF) ))

The idea behind the SQL Server 2014 solution is sacrificing the deterministic and predictability of a DDL command over having no impact on the user workload. 

More information is available in SQL BOL.

Advertisements

Save Our Souls- Troubleshooting Heap Corruption With DebugDiag, AppVerifier and Windows Debugger (WINDBG)–PART 2

In Part 1 of this post I went into basics of setting up a memory dump to troubleshoot heap corruption with App Verifier and DebugDiag tool.  Now I will finish by using Windows Debugger\WinDBG to take a quick look at these dumps to see a culprit that causes heap corruption issue to begin with. To protect the innocent I will rename module names in the application , as the point of this post isn’t to blame anyone, but to illustrate easy technique to troubleshoot heap corruption.

As stated the App Verifier hit breakpoint on invalid memory operation and DebugDiag captured a dump on that breakpoint. Next I will open this dump in Windows Debugger\WinDBG.

WinDbg is a debugger that wraps NTSD and KD with a better UI. It provides command-line options like starting minimized (-m), attach to a process by pid (-p) and auto-open crash files (-z)..

It supports three types of commands:

  • regular commands (e.g.: k). The regular commands are to debug processes.
  • dot commands (e.g.: .sympath). The dot commands are to control the debugger.
  • extension commands (e.g.: !handle) – these are custom commands that you can add to WinDbg; they are implemented as exported functions in extension DLLs.

PDB files are program database files generated by the linker. Private PDB files contain information about private and public symbols, source lines, types, locals and globals. Public PDB files do not contain types, local and source line information.

You need symbols in order to be able to do effective debugging. Symbol files could be in an older COFF format or the PDB format. PDBs are program database files and contain public symbols. These debuggers allow you to mention a list of URIs where they would look for symbols for loaded binaries.

OS symbols are usually installed in the %SYSTEMDIR%Symbols directory. Driver symbols (.DBG or .PDB files) are usually in the same folder as the driver (.sys file). Private symbol files contain information about functions, local and global variables, and line information to correlate assembly code to source code; symbol files that are usually made available to customers are public symbol files – these files contain information about public members only.

You can set symbol directories through File->Symbol File Path, or using .sympath from the WinDbg command window. To add reference to a symbol server on the web, add:

SRV*downstream_store*http://msdl.microsoft.com/download/symbols

to your .sympath, thus:

.sympath+ SRV*c:\tmp*http://msdl.microsoft.com/download/symbols

 

Where c:\tmp is the download_store where necessary symbols will be downloaded and stored. Note that this particular symbol server exposes public symbols only.

The debugger matches information like filename, timestamp and checksum when matching a PDB with a binary (DLL or exe). If you have symbol information, you’d be able to see function names and their arguments in your call stack. If the binaries and PDBs are from your application, you’d additionally have information about private functions, local variables and type information.

The sympath can consist of multiple URIs. Sympath is initialized from the _NT_SYMBOL_PATH system environment variable.

So lets open dump we received in debugger:and run very simple stack command on faulting\active thread:

 

kb

1870e42c 10003b68 0869a580 01151000 1870e6f8 ntdll!DbgBreakPoint

1870e634 100078c9 1000c540 0000000f 01151000 vrfcore!VerifierStopMessageEx+0x4d1 1870e658 7c878689 0000000f 7c8789a0 01151000 vrfcore!VfCoreRedirectedStopMessage+0x81 1870e6d4 7c878d0c 01151000 00000009 65cd2108 ntdll!RtlpDphReportCorruptedBlock+0x1cf 1870e738 7c879894 7f660020 00000000 01151000 ntdll!RtlpDphAddToDelayedFreeQueue+0x120 1870e75c 7c879ab4 01151000 01250000 01000002 ntdll!RtlpDphNormalHeapFree+0x73

1870e7b4 7c87c98b 01150000 01000002 7f660040 ntdll!RtlpDebugPageHeapFree+0x146

1870e81c 7c85567a 01150000 01000002 7f660040 ntdll!RtlDebugFreeHeap+0x2c

1870e8f4 7c83e448 01150000 01000002 7f660040 ntdll!RtlFreeHeapSlowly+0x37

1870e9d8 776bcf60 01150000 01000002 7f660040 ntdll!RtlFreeHeap+0x11a

1870e9ec 776bcf7c 7779777c 7f660040 fffffff6 ole32!CRetailMalloc_Free+0x1c

*** WARNING: Unable to verify checksum for OraOps10.dll

*** ERROR: Symbol file could not be found. Defaulted to export symbols for OraOps10.dll -

1870e9fc 07cd8c1f 7f660040 07cd7805 7f660040 ole32!CoTaskMemFree+0x13

1870ea04 07cd7805 7f660040 00000000 00000100 OraOps10!ssmem_free+0xf

1870ea20 07cd25b4 1274afd8 77def3d8 77def338 OraOps10+0x7805

Noted prohibited operation by Oracle.NET Provider (OraOps.dll) that reports heap corruption and hits Verifier breakpoint. Also note below that OraDRResultSet does PInvoke into unmanaged space from managed:

0:015> !clrstack

OS Thread Id: 0xfd0 (15)

Child SP IP Call Site

1870eaf0 7c81a229 [InlinedCallFrame: 1870eaf0] 1870ead0 0b4911b0 DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, IntPtr, IntPtr, IntPtr ByRef, Oracle.DataAccess.Client.OpoSqlValCtx*, Oracle.DataAccess.Client.OpoMetValCtx*, Oracle.DataAccess.Client.OpoDacValCtx*)

1870eaf0 0b490ea0 [InlinedCallFrame: 1870eaf0] Oracle.DataAccess.Client.OpsDac.Read(IntPtr, IntPtr, IntPtr, IntPtr ByRef, Oracle.DataAccess.Client.OpoSqlValCtx*, Oracle.DataAccess.Client.OpoMetValCtx*, Oracle.DataAccess.Client.OpoDacValCtx*) 1870eb48 0b490ea0 Oracle.DataAccess.Client.OracleDataReader.Read() 1

870eb78 0b499979 v5Oracle.OraDRResultSet.Fetch(Boolean)

Customer was using older version of Oracle .NET Provider and from what I understand may have run into the issue that was fixed already some time ago.

Save Our Souls–Troubleshooting Heap Corruption With DebugDiag, AppVerifier and Windows Debugger (WINDBG)–PART 1

There is probably nothing worse in this world for a developer than classic heap corruption.

The heap is a data structure, maintained by the compiler or OS’s runtime libraries, responsible for handling memory allocation (e.g. new, delete, malloc, …). Heap corruption occurs when the heap’s bookkeeping data–such as which parts of memory are allocated and which are free for new allocations–are corrupted. This usually occurs from incorrect use of memory allocation functions by applications. After heap corruption, undefined behavior occurs; the program may appear to work correctly to start with, but might fail on the next run, or when recompiled, or at any other time.

Memory corruption, in general, is one of the toughest issues to work with. For several reasons:

  • It is not immediate, for starting, to understand that a problem (endless loop, unexpected behavior, crash) is caused by a memory corruption.
  • Historically, user-mode processes with their own virtual address space and the separation of user-mode and kernel mode were meant to provide an isolated environment for code, so that bad code which, for example, could cause a memory corruption, was not able to adversely affect other code. On the other hand, the appearance of “host processes” like svchost.exe for services, dllhost.exe for COM+ applications and w3wp.exe for ASP.NET and Web Services, made again different components run in the same process. There are benefits to it, but the fact that different software shares a common address space means that, when a memory corruption occurs, the whole process is affected. Moreover, it may be difficult to determine which component is at fault.
  • The consequences of a memory corruption typically manifest themselves at a later time, when the corrupted area is read. At that time it is difficult, if not impossible, to backtrack to the source of the corruption.

In my nearly decade working with customers in the field I have only seen about dozen incidents of heap corruption. When I did, these were usually nightmare to troubleshoot and required long hours to reproduce and analyze. However, good news there are a couple of  really useful “patterns” for troubleshooting heap corruption that can just be followed in most cases. Here I will show one of these patterns that I used very recently troubleshooting issue for customer.

To debug heap corruption, you must identify both the code that allocated the memory involved and the code that deleted, released, or overwrote it. If the symptom appears immediately, you can often diagnose the problem by examining code near where the error occurred. Often, however, the symptom is delayed, sometimes for hours. In such cases, you must force a symptom to appear at a time and place where you can derive useful information from it.

A common way to do this is for you to command the operating system to insert a special suffix pattern into a small segment of extra memory and check that pattern when the memory is deleted. Another way is for the operating system to allocate extra memory after each allocation and mark it as Protected, which would cause the system to generate an access violation when it was accessed. This is commonly known as enabling pageheap.

First lets download and install DebugDiag diagnostic tool on affected machine. It can be downloaded here. DebugDiag is very useful tool that you should be well aware of. It is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or fragmentation, and crashes in any user-mode process. If you really want to find out more about different uses of DebugDiag and see how you can script\customize memory dump collection via that tool , see DebugDiag MSDN blog.

Next tool you will need is AppVerifier. Application Verifier assists developers in quickly finding subtle programming errors that can be extremely difficult to identify with normal application testing. Using Application Verifier in Visual Studio makes it easier to create reliable applications by identifying errors caused by heap corruption, incorrect handle and critical section usage. If you are using unmanaged code (VC++) you should use this tool in your testing. You can get more information on the tool on MSDN. You can get tool here.  Again you will download and install that tool on affected machine.

Once you got the tools, lets setup our data collection. First we will setup rules in AppVerifier, with that we will enable pageheap on affected application\exe.

  • Start Application Verifier (Start –> Programs –> Application Verifier –> Application Verifier).
  • Click File –> Add Application and browse affected executable
  • In the Tests Panel, expand Basics checkbox and uncheck all except Heaps, on the picture below w3wp is selected in your case it will be your affected executable

appverifier1

 

 

 

 

 

  • In the Tests Panel again, select Heaps checkbox and click Edit –> Verifier Stop Options

verifier2

This basically shows the stop codes that application verifier generates. The defaults actions are for all stop codes. The most important action here is  the “Breakpoint” in the Error Reporting section which means that Application Verifier will call into the breakpoint exception when it detects that the heap is being corrupted.

Next we need to setup DebugDiag to capture a memory dump on such breakpoint.

  • Start DebugDiag (Start –> Programs –> Debug Diagnostic Tool 1.2 –> Debugdiag 1.2
  • Add a crash rule against a specific process
  • Type in name of the process you setup AppVerifier to monitor  in the “Select Target” window and make sure the “This process instance only” check box is unchecked

debugdiag1

  • · In the “Advanced Configuration (Optional)” window, click Exceptions… and add 80000003 exception with an action type of Full Userdump.
  • Finish the wizard and Activate the rule
  • Start new executable session with executable affected to make sure it loads both pageheap layer and application verifier dlls.

So basically, the above configuration will make application verifier calls into the breakpoint exception when it detects that a heap operation is corrupting the heap. When the breakpoint exception is called, debugdiag will generate a full userdump. Post-mortem  analysis of the userdump will give details about the corruption such as the call stack, Above is somewhat very performance impactful and guaranteed to produce number of dumps.

Customer has followed above and sure thing next few memory dumps were travelling my way. In the next part I will show you how to take these post mortem memory dumps and find a culprit that is corrupting customer’s heap via Windows Debugger (WINDBG).

Trade Durability For Performance Boost–SQL Server 2014 Delayed Durability Feature

Another interesting feature that made it into SQL Server 2014 and into my bag of performance tuning tricks is delayed durability. What is delayed durability and why we may use it?

Traditionally every transaction in RDBMS should follow ACID principles. In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. SQL Server guarantees transactional  ACID by implementing Write Ahead Transaction Logging, which is explained in SQL BOL here. Essentially, idea is that when a transaction is committed, SQL Server performs a physical write to disk in the transaction log. The commit command does not return until that write is successful.  That way data is guaranteed not to be lost unbeknown to front end system even in case of RAM or power failure.

However that approach has its own drawbacks when it comes to performance. Imagine an application that has huge number od small concurrent transactions and its transaction log write overhead.  I used to call such applications that I frequently had to tune against all odds as “death by thousand cuts”.  Yes, you can ask for faster storage, SSDs, application to be optimized by changing “chattier” transactions to “chunkier” transactions, but with delayed durability we have another option as well.

Essentially in the system where you have a significant bottleneck on transaction log and can sustain moderate chance of possible data loss you now have the option to temporarily suspend the requirement to wait for the log flush and acknowledgement via this new feature. With Delayed Durability transactions are logged to transaction log buffers in memory and control is returned to the application without writing to directly to disk.  The log buffers are flushed and written to disk later. According BOL such flush would happen when following will occur:

  • The in-memory transaction log buffer fills up.
  • sp_flush_log is executed
  • A “normal” fully durable transaction makes any change in database and commits
  • SQL Server Shutdown request

You can set this option at the database level, at the individual transaction level, or – in the case of natively compiled procedures in In-Memory OLTP – at the procedure level. The database level setting overrides lower level settings; if the database is set to disabled, trying to commit a transaction using the delayed option will simply be ignored, with no error message. Some transactions are set to be always durable and option will be ignored – system transactions, cross-database transactions, distributed transactions controlled by MSDTC, operations involving CDC, change tracking or File Table.

At the database level you can use ALTER DATABASE statement to set this option:

ALTER DATABASE dbname SET DELAYED_DURABILITY = DISABLED | ALLOWED | FORCED;

Essentially with three options here:

  • ALLOWED – Individual transaction can use Delayed Durability on transaction level
  • FORCED – Means that all transactions that can use Delayed Durability will do so.
  • DISABLED – Disables this feature, default.

You can also set these options in database properties screen of SSMS:

dd

Once you set Delayed Durability to allowed on database level you can use following to use it on transaction level:

 COMMIT TRANSACTION WITH (DELAYED_DURABILITY = ON);

With In-Memory OLTP memory optimized procedures (see my previous post) you can use Delayed Durability in procedure like:

BEGIN ATOMIC WITH (DELAYED_DURABILITY = ON, ...)

Well, enough theory. Lets see performance difference inserting 100,000 rows in a loop with Delayed Durability vs. traditional durable way:

Lets first create a test database and set Delayed Durability to allowed on database level:

use master

go

if exists (select 1 from sys.databases where name = 'my_dd_test')

begin

DROP DATABASE my_dd_test

end

go

CREATE DATABASE my_dd_test

go

use my_dd_test

go

ALTER DATABASE my_dd_test SET delayed_durability = ALLOWED

go

Now lets create table to insert data into:

CREATE TABLE dbo.InsertTest (ID int primary key clustered, VALUE varchar(255))

Lets create “traditional” insert procedure into this table:

CREATE PROCEDURE TestInsertTraditional @c1 int, @c2 varchar(255) as

begin tran

insert into dbo.InsertTest values (@c1, @c2)

commit

Lets create Delayed Durability version of procedure:

create procedure TestInsertDelayed @c1 int, @c2 varchar(255) as

begin tran

insert into dbo.InsertTest values (@c1, @c2)

commit with (DELAYED_DURABILITY=ON)

go

Now lets commence with the test. First we will insert 100,000 rows into our table with full durability via traditional insert…commit:

declare @cnt int = 1 , @maxcnt int = 100000, @start datetime

truncate table dbo.InsertTest

select @start = getdate()

while @cnt<@maxcnt

begin

exec TestInsertTraditional @cnt, 'This is a test';

select @cnt = @cnt + 1;

end

select datediff(s, @start, getdate()) truncate table dbo.InsertTest

This took 16 seconds on my laptop:

time1

Now lets do the same with Delayed Durability in action:

set nocount on

declare @cnt int = 1 , @maxcnt int = 100000, @start datetime

set @cnt = 1

select @start = getdate()

while @cnt<@maxcnt

begin exec [dbo].[TestInsertDelayed] @cnt, 'This is a test';

select @cnt = @cnt + 1;

end

select datediff(s, @start, getdate())

truncate table dbo.InsertTest

exec sp_flush_log

This only took three seconds on my laptop:

time2

Obviously great performance boost for certain workloads as I stated above. However, this isnt a feature to be taken lightly and turned on by default every instance of SQL Server you manage – after all it comes with a cost of possible data loss. Under a well-performing disk, the maximum you should expect to lose in a catastrophe – or even a planned and graceful shutdown – is up to one full block (60KB). However, in the case where your I/O subsystem can’t keep up, it is possible that you could lose as much as the entire log buffer (~7MB).  So although this another great feature and very useful in some circumstances please consider above before you turn it on and use it. Its analogous to traction control on high-performance sports car, its frequently turned off while racing to get extra performance boost, yet it will not pull you out of trouble on wet roadway corner while you going 120 mph.