Forecast Cloudy – Troubleshooting and Debugging Azure PaaS Roles -Architecture And Built-In Logging

First of all these posts would be pretty much impossible without MSDN blogs posts of Kevin Williamson and awesome work done by my colleague Jeff Van Nortwick running Windows Debugger\WinDBG in Azure. They both moved me to do something to understand this topic deeper and huge thanks to both for information they shared with me.

So for a while all of us treated Azure PaaS as “black box development platform” and I am fairly sure in some ways it was intended to be so when introduced. Yet, that went completely against my idea that in order to develop well on something you have to understand its internals – after all I like to think of myself as sort of DevOps character, residing somewhere between development and infrastructure with good dose of DBA thrown in. Its against my nature to state that “something doesn’t work well and there is nothing we can do about it” , instead my favorite activity is to collect and pour through data and understand what is the cause – finding virtual “needle in the haystack”. Therefore I was really excited when someone pointed me to Kevin Williamson blog.

What is great there is awesome explanation of Azure architecture and not in some big cubes in Visio that some architects draw with no details, but in terms of processes spawning more processes.  So how does Azure actually work:

High level like this:

image

Azure considers each rack a ‘node’ of compute power and puts a switch on top of it. Each node — servers+top rack switch — is considered a ‘fault domain’ , i.e., a possible point of failure. An aggregator and load balancers manage groups of nodes, and all feed back to the Fabric Controller (FC), the operational heart of Azure.

The FC gets it’s marching orders from the “Red Dog Front End” (RDFE). RDFE takes its name from nomenclature left over from Dave Cutler’s original Red Dog project that became Azure. The RDFE acts as kind of router for request and traffic to and from the load balancers and Fabric Controller.

The Fabric Controller does all the heavy lifting for Azure. It provisions, stores, delivers, monitors and commands the virtual machines (VMs) that make up Azure. It is a “distributed stateful application distributed across data center nodes and fault domains.”

In English, this means there are a number of Fabric Controller instances running in various racks. One is elected to act as the primary controller. If it fails, another picks up the slack. If the entire FC fails, all of the operations it started, including the nodes, keep running, albeit without much governance until it comes back online. If you start a service on Azure, the FC can fall over entirely and your service is not shut down.

The Fabric Controller automates pretty much everything, including new hardware installs. New blades are configured for PXE and the FC has a PXE boot server in it. It boots a ‘maintenance image,’ which downloads a host operating system (OS) that includes all the parts necessary to make it an Azure host machine. Sysprep is run, the system is rebooted as a unique machine and the FC sucks it into the fold.

image

This is explained very well by Mark Russinovich in his various very good talks – http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/WAD-B402#fbid=, http://azure.microsoft.com/en-us/documentation/videos/mark-russinovich-windows-on-azure/ , http://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx

However, what I was interested more in is particular Azure Role architecture and here is where Kevin’s information was really helpful.  Kevin’s diagram is large and complex as you can see below, so lets dig into into main parts of it:

image

  • RDFE is still main communication path from you (user) to the fabric. RDFE APIs are used by portal, Visual Studio, SDKs, etc. Therefore all requests from user to fabric go through RDFE (Red Dog Front End).  FFE (Fabric Front End) is the layer which translates requests from RDFE into the fabric commands. All requests from RDFE go through the FFE to reach the fabric controllers
  • I also already mentioned Fabric Controller above. Essentially it’s a watchdog responsible for monitoring and administering all of the VMs\Resources in data center.
  • But here is new item – Host Agent. As the name implies and it only makes sense in virtualized environment this lives on the Host and communicates to Guest agents (there are now I understand 2 of them). Here comes the idea of “heartbeat” that is well known to many from Windows Clustering for example, Host Agent receives “heartbeat” from the Guest and if Guest doesn’t respond for 10 minutes its OS is restarted.
  • Guest agents – WaAppAgent and WindowsAzureGuestAgent. Now we are on guest\role itself. WindowsAzureGuestAgent is responsible for Guest OS configuration, communicating status\heartbeat to the Host, setting up the SID for the user account which the role will run under Interestingly enough WaAppAgent is “overseer” of WindowsAzureGuestAgent  as it installs, configures, updates WindowsAzureGuestAgent.exe itself.
  • WindowsAzureGuestAgent spins up another process – WaHostBootstrapper. That process is responsible for the role configuration and starting up all of the appropriate tasks and processes to configure and run the role, monitoring its own child processes, raising StatusCheck event on role Host process.
  • Startup Tasks are defined by the role model and started by WaHostBootstrapper. Startup tasks can be configured to run in the Background asynchronously and the host bootstrapper will start the startup task and then continue on to other startup tasks. Startup tasks can also be configured to run in Simple (default) mode where the host bootstrapper will wait for the startup task to finish running and return with a success (0) exit code before continuing on to the next startup task. When expanded into startup tasks the DiagnosticsAgent and RemoteAccessAgent are unique in that they define 2 startup tasks each, one regular and one with a /blockStartup parameter. The normal startup task is defined as a Background startup task so that it can run in the background while the role itself is running. The /blockStartup startup task is defined as a Simple startup task so that WaHostBootstrapper will wait for it to exit before continuing. The /blockStartup task simply waits for the regular task to finish initializing and then it will exit and allow the host bootstrapper to continue. The reason this is done is so that diagnostics and RDP access can be configured prior to the role processes starting up (this is done via the /blockStartup task), and that diagnostics and RDP access can continue running after the host bootstrapper has finished with startup tasks (this is done via the normal task).
  • Now what is most important for troubleshooting custom code on the PaaS roles – Worker Processes. WaWorkerHost is the standard host process for normal worker roles. This host process will host all of the role’s DLLs and entry point code such as OnStart and Run. WaWebHost is the standard host process for web roles when they are configured to use the SDK 1.2 compatible Hostable Web Core (HWC).WaIISHost is the host process for role entry point code for web roles using Full IIS. This process will load the first DLL found which implements the RoleEntryPoint class (this DLL is defined in E:\__entrypoint.txt) and execute the code from this class (OnStart, Run, OnStop). Any RoleEnvironment events (ie. StatusCheck, Changed, etc) created in the RoleEntryPoint class will be raised in this process. W3WP is the standard IIS worker process which will be used when the role is configured to use Full IIS. This will run the AppPool configured from IISConfigurator

So that’s definitely a lot of technical theory courtesy of Kevin. What’s more interesting talks about logs, yes logs available to you and me to troubleshoot common Azure roie issues with Wa*.logs and event logs.  So now for a bit of practice – let get to these logs and look at them.

First thing before we examine logs we need to a some sort of Azure PaaS role deployed, either web or worker role. Lets build a test little worker role and deploy it to Azure.

So I created a little basic Azure worker role, named it testservicegennadyk and deployed it to Azure. In order to get to logs I mention above I will need to RDP to my role, therefore when I publish my role I have to enable RDP. I can do that in Publish settings for my Role in Visual Studio:

image

Now I can go to Azure Portal and see\connect to my role in the portal

image

Now I can see following :

Windows Azure Event Logs via Event Viewer –> Applications and Services Logs –> Windows Azure that contains important diagnostic output from the Windows Azure Runtime, including information such as Role starts/stops, startup tasks, OnStart start and stop, OnRun start, crashes, recycles, etc

image

 

Application Event Logs via Event Viewer –> Windows Logs –> Application. Standard troubleshooting source for any application development.

App Engine Runtime Log – in C:\Logs\AppEngineRuntime.log. These logs are written by WindowsAzureGuestAgent.exe and contain information about events happening within the guest agent and the VM.  This includes information such as firewall configuration, role state changes, recycles, reboots, health status changes, role stops/starts, certificate configuration, etc.

image

App Agent Heartbeat Logs – In C:\Logs\WaAppAgent.log. These logs are written by WindowsAzureGuestAgent.exe and contain information about the status of the health probes to the host bootstrapper. These logs are typically useful for determining what is the current state of the role within the VM, as well as determining what the state was at some time in the past. You can definitely see great information on Heartbeat Pings in that log like:

[00000004] [01/29/2015 19:36:01.91] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000004] [01/29/2015 19:36:01.91] [INFO]  Role a95af93b891f4451ac81ac670af28d5f.WorkerRole1_IN_0 is reporting state Ready.
[00000009] [01/29/2015 19:36:04.49] [INFO]  Role a95af93b891f4451ac81ac670af28d5f.WorkerRole1_IN_0 has current state Started, desired state Started, and goal state execution status StartSucceeded.

Host Bootstrapper Logs – in C:\Resources\WaHostBootstrapper.log. This log contains entries for startup tasks (including plugins such as Caching or RDP) and health probes to the host process running your role entrypoint code (ie. WebRole.cs code running in WaIISHost.exe). A new log file is generated each time the host bootstrapper is restarted (ie. each time your role is recycled due to a crash, recycle, VM restart, upgrade, etc) which makes these logs easy to use to determine how often or when your role recycled.

If this was a web role I would be able able to get to IIS logs and HTTP.SYS logs, but with worker role sample things will be limited.  IIS logs would be in C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\LogFiles\Web and HHTP.SYS in D:\WIndows\System32\LogFiles\HTTPERR

Hope this was helpful. In the the next post I plan to actually install WinDBG on my sample role and attempt to debug it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s