There is probably nothing worse in this world for a developer than classic heap corruption.
The heap is a data structure, maintained by the compiler or OS’s runtime libraries, responsible for handling memory allocation (e.g. new, delete, malloc, …). Heap corruption occurs when the heap’s bookkeeping data–such as which parts of memory are allocated and which are free for new allocations–are corrupted. This usually occurs from incorrect use of memory allocation functions by applications. After heap corruption, undefined behavior occurs; the program may appear to work correctly to start with, but might fail on the next run, or when recompiled, or at any other time.
Memory corruption, in general, is one of the toughest issues to work with. For several reasons:
- It is not immediate, for starting, to understand that a problem (endless loop, unexpected behavior, crash) is caused by a memory corruption.
- Historically, user-mode processes with their own virtual address space and the separation of user-mode and kernel mode were meant to provide an isolated environment for code, so that bad code which, for example, could cause a memory corruption, was not able to adversely affect other code. On the other hand, the appearance of “host processes” like svchost.exe for services, dllhost.exe for COM+ applications and w3wp.exe for ASP.NET and Web Services, made again different components run in the same process. There are benefits to it, but the fact that different software shares a common address space means that, when a memory corruption occurs, the whole process is affected. Moreover, it may be difficult to determine which component is at fault.
- The consequences of a memory corruption typically manifest themselves at a later time, when the corrupted area is read. At that time it is difficult, if not impossible, to backtrack to the source of the corruption.
In my nearly decade working with customers in the field I have only seen about dozen incidents of heap corruption. When I did, these were usually nightmare to troubleshoot and required long hours to reproduce and analyze. However, good news there are a couple of really useful “patterns” for troubleshooting heap corruption that can just be followed in most cases. Here I will show one of these patterns that I used very recently troubleshooting issue for customer.
To debug heap corruption, you must identify both the code that allocated the memory involved and the code that deleted, released, or overwrote it. If the symptom appears immediately, you can often diagnose the problem by examining code near where the error occurred. Often, however, the symptom is delayed, sometimes for hours. In such cases, you must force a symptom to appear at a time and place where you can derive useful information from it.
A common way to do this is for you to command the operating system to insert a special suffix pattern into a small segment of extra memory and check that pattern when the memory is deleted. Another way is for the operating system to allocate extra memory after each allocation and mark it as Protected, which would cause the system to generate an access violation when it was accessed. This is commonly known as enabling pageheap.
First lets download and install DebugDiag diagnostic tool on affected machine. It can be downloaded here. DebugDiag is very useful tool that you should be well aware of. It is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or fragmentation, and crashes in any user-mode process. If you really want to find out more about different uses of DebugDiag and see how you can script\customize memory dump collection via that tool , see DebugDiag MSDN blog.
Next tool you will need is AppVerifier. Application Verifier assists developers in quickly finding subtle programming errors that can be extremely difficult to identify with normal application testing. Using Application Verifier in Visual Studio makes it easier to create reliable applications by identifying errors caused by heap corruption, incorrect handle and critical section usage. If you are using unmanaged code (VC++) you should use this tool in your testing. You can get more information on the tool on MSDN. You can get tool here. Again you will download and install that tool on affected machine.
Once you got the tools, lets setup our data collection. First we will setup rules in AppVerifier, with that we will enable pageheap on affected application\exe.
- Start Application Verifier (Start –> Programs –> Application Verifier –> Application Verifier).
- Click File –> Add Application and browse affected executable
- In the Tests Panel, expand Basics checkbox and uncheck all except Heaps, on the picture below w3wp is selected in your case it will be your affected executable
- In the Tests Panel again, select Heaps checkbox and click Edit –> Verifier Stop Options
This basically shows the stop codes that application verifier generates. The defaults actions are for all stop codes. The most important action here is the “Breakpoint” in the Error Reporting section which means that Application Verifier will call into the breakpoint exception when it detects that the heap is being corrupted.
Next we need to setup DebugDiag to capture a memory dump on such breakpoint.
- Start DebugDiag (Start –> Programs –> Debug Diagnostic Tool 1.2 –> Debugdiag 1.2
- Add a crash rule against a specific process
- Type in name of the process you setup AppVerifier to monitor in the “Select Target” window and make sure the “This process instance only” check box is unchecked
- · In the “Advanced Configuration (Optional)” window, click Exceptions… and add 80000003 exception with an action type of Full Userdump.
- Finish the wizard and Activate the rule
- Start new executable session with executable affected to make sure it loads both pageheap layer and application verifier dlls.
So basically, the above configuration will make application verifier calls into the breakpoint exception when it detects that a heap operation is corrupting the heap. When the breakpoint exception is called, debugdiag will generate a full userdump. Post-mortem analysis of the userdump will give details about the corruption such as the call stack, Above is somewhat very performance impactful and guaranteed to produce number of dumps.
Customer has followed above and sure thing next few memory dumps were travelling my way. In the next part I will show you how to take these post mortem memory dumps and find a culprit that is corrupting customer’s heap via Windows Debugger (WINDBG).