Trouble in Distributed Cache Land –Windows AppFabric Cache Timeouts

Recently I had to help some customers troubleshoot periodic performance degradation and timeouts in Windows AppFabric. Example errors these customers would see were:

ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.

ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure.

So in my earlier post I talked about troubleshooting and monitoring AppFabric Cache cluster.  However I will add an obscure client configuration setting here as well that may be useful in resolving these timeouts.

config

Client to Server network contention:  Quite well known possibility. Here netstat utility can be very useful. We can start with

netstat -a -n

Knowing that default port is 22233 you can use further switches like:

netstat -a -n | find "TCP 127.0.0.1:22233 " | find /C "TIME_WAIT"

That should give you all of the connections against host port 22233. If we get large numbers in TIME_WAIT state  it means that there is a situation of: port\network contention. The client is trying to establish too many connection yet someone blocks the client from establishing them. To fix client to server connection contention you may modify client configuration, that obscure MaxConnectionsToServer parameter , its 2 by default.

<dataCacheClient requestTimeout=”15000″ channelOpenTimeout=”3000″ maxConnectionsToServer=”5″…>

if you are looking at a high throughput scenario, then increasing this value beyond 1 is recommended. Also, be aware that if you had 5 cache servers in the cluster, if the application uses 3 DataCacheFactories and if maxConnectionsToServer=3, from each client machine there would be 9 outbound TCP connections to each cacheserver, 45 in total across all cache servers. Based on that you may wish to look at increasing that value, but do so carefully as stated above that will increase number of TCP connections and therefore overhead as well. In general with singleton DataCacheFactory (as we recommend) I have seen pretty good results from modest increase.

Hope this helps.

Distributed Caching In The Wild – Windows AppFabric Cache Part 2

In Part 1 of this post I went through some of the basics of Windows AppFabric Cache and its usage. In this part I wish to cover some of the management and monitoring of your AppFabric Cache cluster, as well as some best practices and gotchas that I learned in the last few years with this product.

You can monitor health of your AppFabric cluster multiple ways:

  • Windows Performance Log (Perfmon) counters. There are three counter categories related to the caching features of AppFabric –  AppFabric Caching:Cache, AppFabric Caching: Host, AppFabric Caching: SecondaryHost. These allow you to monitor performance and health of your installation on multiple levels from logical cache collection running across multiple cluster hosts to each individual host and finally monitor secondaries if HA is turned on. The entire counter list can be found at – http://msdn.microsoft.com/en-us/library/ff637725.aspx  Some of the most useful counters are below:

       image

  • Monitoring and management by PowerShell. PowerShell is a main vehicle for management of Windows AppFabric. Two PowerShell Modules are installed with AppFabric Cache:

    DistributedCacheAdministration and DistributedCacheConfiguration. When using PowerShell you can include these with Import-Module command. There are all together 41 cmdlets available for AppFabric in these modules. Very detailed information is on MSDN here – http://msdn.microsoft.com/en-us/library/ff718177(v=azure.10).aspx Most commonly used are below:

    image

  • Logging. Windows AppFabric Cache provides the ability to trace events on the cache client and cache host. These events are captured by enabling log sinks in the configuration settings of the client’s application configuration file or the cache host’s DistributedCache.exe.config file. By default, the cache client and cache host both have log sinks enabled. Without explicitly specifying any log sink configuration settings, each cache client automatically creates a console-based log sink with the event trace level set to Error. If you want to override this default log sink, you can explicitly configure a log sinks. There are three types of log sinks – console, file and ETW. If you want to change the cache host log sink settings from their default behavior, there are some options. By using the log element in the dataCacheConfig element, you can change the default file-based log sink’s event trace level and the location where it writes the log file. When writing to a folder outside the default location, make sure that it has been created and that the application has been granted write permissions. Otherwise, your application will throw exceptions when you enable the file-based log sink.

Eviction and Throttling. Windows Server AppFabric cache cluster uses expiration to control the amount of memory that the Caching Service uses on a cache host. Although expiration based on cache TTL (Time To Live) is normal, you may notice eviction where objects are evicted prior to TTL expiration.  From an application perspective, such eviction causes applications not to find items in the cache that would otherwise be there. This means that the applications must repopulate those items, which could adversely affect application performance. There are two reasons for eviction:

  • The available physical memory on the server is critically low.
  • The Caching Service’s memory usage exceeds the high watermark for the cache host

Symptoms of eviction include events 118 and 115 in Operational Log for AppFabric.

When the memory consumption of the cache service on a cache server exceeds the low watermark threshold, AppFabric starts evicting objects that have already expired. When memory consumption exceeds the high watermark threshold, objects are evicted from memory, regardless of whether they have expired or not, until memory consumption goes back down to the low watermark. Subsequently cached objects may be rerouted to other hosts to maintain an optimal distribution of memory.  Default Low and High Watermarks are respectively 80 and 90% of reserved size, you can use GetCache Config cmdlet to see reserved memory size (Size parameter) and watermarks.  Eviction performane counters in AppFabric Cache:Host object are useful to watch for incidents of eviction and throttling.

Throttling is event that goes above and beyond eviction in severity. Essentially when physical memory becomes really low and eviction doesn’t do quick enough job freeing it – time for throttling to kick in. The cache cluster will not write data to any cache that resides on a throttled cache host until the available physical memory increases to resolve the throttled state.The most obvious symptom of throttling will come from applications. Attempts to write to the cache will generate DataCacheException errors – http://msdn.microsoft.com/en-us/library/ff921032.aspx. More on eviction and throttling – http://msdn.microsoft.com/en-us/library/ff921021.aspx, http://msdn.microsoft.com/en-us/library/ff921030.aspx.

Finally some thoughts from experience:

  • Lead Hosts. When the leadHostManagement and leadHost settings are true, the cache host is elevated to a level of increased responsibility in the cluster and designated as a lead host. In addition to the normal cache host’s operations related to caching data, the lead host also works with other lead hosts to manage the cluster operations.When lead hosts perform the cluster management role, if a majority of lead hosts fail, the entire cache cluster shuts down. Alternatively, if using SQL Server and no lead hosts if SQL Server fails the cluster will shut down. There is some additional overhead involved in lead host communication. It is a good practice to run your cluster with as few lead hosts as necessary to maintain a quorum of leads. For small clusters, ranging from one to three nodes in size, it is acceptable to run all nodes as lead nodes as the amount of additional overhead generated by a small grouping of lead hosts will be relatively low. For larger clusters, however, to minimize overhead involved in ensuring cluster stability, it is recommended to use a smaller percentage of lead hosts—for example, 5 lead hosts for a 10-node cluster.  You have to balance here is added security\stability of more lead hosts  vs. overhead that lead hosts suffers due to cluster management role.

  • Also its recommended not to allocate more than 16GB for the AppFabric server (and corresponding 8GB for the cache host configuration). If the cache host’s size is larger than 16GB/8GB, garbage collection could take long enough to cause a noticeable interruption for clients.A common recommendation is to spec AppFabric servers with 16GB of physical RAM, and set the cache host size to 7GB. With this arrangement, you can expect about 14GB to be used by the AppFabric process, leaving 2GB for other server processes on the host.

  • Default client setting of  maxConnectionsToServer=1  will work in many  situations. In scenarios, when there is a single shared DataCacheFactory and a lot of threads are posting on that connection, there may be a need to increase it. So if you are looking at a high throughput scenario, then increasing this value beyond 1 is recommended. In general you will be looking at very modest increase based on number of hosts on your cluster.

  • You should have an odd number of lead host servers when using a lead host configuration since a majority of lead hosts (> 50%) are required for the cluster to stay alive. 3, 5 or 7… etc.Use at least 4 servers if you plan to use the High Availability feature; since 3 servers are required for HA if you need to do maintenance on the cluster having an extra server will allow you to pull out a node without disrupting the cluster.

Well that pretty much does it on basics. Through the years my colleagues Rick McGuire, Ryan Berry and Xuehong Gan of Microsoft were critical in helping me and customers on AppFabric support and its with their help that I could learn this product, so huge thanks to them as well.

    Distributed Caching in the Wild–Windows AppFabric Cache Part 1

    Every developer at one point or another used caches to increase scalability of typical web application. Up until some time ago, majority of people creating web application would cache most heavily used reference data in-process to the application. That had its own pluses and minuses of course, as any design. On plus side it facilitated fairly quick retrieval and writing to the cache, as there is no network or inter-process communication between application and cache, however there were plenty of minuses to this approach as well. What happens if critical exception forces IIS worker process to recycle? Well, cache is gone on that web front end. What about ability to scale cache out?  With caches running separately on all web front ends that doesn’t exist, as IIS worker process hosting both application and cache will continue to increase in footprint. That of course will negatively affect GC performance causing GC to walk “deep roots” and application will be spending more time in GC.

    About 5 years ago I first started working with customers that were implementing third party distributed cache clusters as separate tiers of their applications from vendors such as ScaleOut and NCache. Then finally Microsoft introduced Windows AppFabric Cache around 2010 timeframe. Immediately I became very intrigued with this technology, as all in-memory NoSQL\distributed cache technology and through last 4 years was lucky to have helped number of customers in implementation.

    So what is Windows AppFabric Cache?  Windows AppFabric cache is what I call a distributed cache cluster technology and is very similar in its core idea to products such as memcached or Redis. It provides a distributed cache that you can integrate into both Web and desktop applications. AppFabric can improve performance, scalability and availability while, from the developer perspective, behaving like a common memory cache. You can cache any serializable object, including DataSets, DataTables, binary data, XML, custom entities and data transfer objects.

    The AppFabric client API is simple and easy to use, and the server API provides a full-featured Distributed Resource Manager (DRM) that can manage one or more cache servers (with multiple servers comprising a cache cluster). Each server provides its own memory quota, object serialization and transport, region grouping, tag-based search and expiration. The cache servers also support high availability, a feature that creates object replicas on secondary servers. Windows AppFabric exposes a unified cache tier to client applications by fusing together memory across servers. The AppFabric Cache architecture consists of a ring of cache servers running the AppFabric Distributed Caching Windows service as well as client applications that utilize the AppFabric Cache client library to communicate with the unified cache view. The cache cluster is a collection of one or more instances of the Caching Service working together in the form of a ring to store and distribute data. Data is stored in memory to minimize response times for data requests. Cluster management can be performed either by designated lead hosts or by storing the cluster configuration information in a SQL Server database. Each node in the cluster is running AppFabric Distributed Cache Windows Service. For each cache server, only one instance of the Caching Service can be installed
    Product is a free add-on to Windows Server – http://www.microsoft.com/en-us/download/details.aspx?id=27115  and requires no additional licensing, however if you will use HA feature it has to be deployed on Windows Server Enterprise or Data Centre edition.

    image

    Well, so cache cluster runs across number of dedicated nodes, but what about caches? A named cache, also referred to as a cache, is a configurable unit of in-memory storage that all applications use to store data in the distributed cache. You can configure one or more named caches for each of your applications. Each cache can be configured independent of the others, which lets you optimize the policies of each cache for your application. Each cache spans all cache hosts in the cluster. When the AppFabric Caching features are first set up, a cache comes pre-configured with the name “default.” You can store data in this default cache, or you can create and use named caches.All caches are defined in the cluster configuration. Use the Windows PowerShell administration tool to create or reconfigure caches. Some settings can only be configured when you first create the cache. Other settings can be changed later, but may require the entire cache cluster to be restarted. There is a limit of 128 named caches for the cluster. Restarting your cache cluster causes all data to be flushed from all named caches in the cluster, but the named caches themselves are persisted.

    image

    Regions is an additional data container that can be placed inside the cache. Regions are a cache construct: they are not defined in the cluster configuration settings. Regions are optional; if you want to use them, you must explicitly create them at run time with your application code by using the CreateRegion method. I am actually not a big fan of regions for following reason:

    To provide this added search functionality, objects in a region are limited to a single cache host. Thus, applications that use that data cannot realize the scalability benefits of a distributed cache. In contrast, if you do not specify a region, cached objects can be load balanced across all cache hosts in the cache cluster. Regions offer searching capabilities, but by limiting cached objects to a single cache host, the use of regions presents a trade-off between functionality and scalability

    So how do I program against AppFabric Cache? To start using AppFabric caching in your application, just add the references to CacheBaseLibrary.dll, CASBase.dll, CASMain.dll and ClientLibrary.dll in your Visual Studio project.Make sure that the using statement (Imports in Visual Basic) is at the top of your application code to reference the Microsoft.ApplicationServer.Caching namespace.

    Application code should be designed so that it can function independent of the cache, and not require that cached data always be available. Because data in the cache is not persisted in a durable fashion, the possibility exists that the data in the case could be unavailable. DataCacheFactory Class – provides methods to return DataCache objects that are mapped to a named cache. This class also enables programmatic configuration of the cache client. What we are looking at here is a typical Factory Pattern well familiar to any developer. IMPORTANT – constructing DataCacheFactory is very expensive, if possible, store and reuse the same DataCacheFactory object for application lifetime to conserve memory and optimize performance:

    // Create instance of cachefactory

    DataCacheFactory factory = new DataCacheFactory();

    Use the DataCacheFactory object to create a DataCache object (also referred to as the cache client).

    // Get a named cache from the factory

    DataCache MyCache= factory.GetCache(“catalog_products");

    Now its fairly easy to write and read to our cache:

    //add string object to cache with key “product100"

    myCache.Add(“product100", new Product(“car”));

    //add or replace string object in cache using key “product100"

    myCache.Put(“product100", new Product (“toaster”));

    //get Product from cache using key “Product100"

    string myString1 = (Product) myCache.Get(“Product100");

    With AppFabric Cache you will be using Cache-Aside Pattern, explained here – http://msdn.microsoft.com/en-us/library/dn589799.aspx to emulate read-through functionality.

    image

    High Availability.

    When high availability is enabled by setting Secondaries parameter in New-Cache Powershell commandlet to 1, a copy of each cached object or region is maintained on a separate cache host. The cache cluster manages maintenance of these copies and supplies them to your application if the primary copies are not available. No code changes are required to make your cache-enabled applications highly available.  This is generally used for Activity data, performance overhead makes it not worth it on reference data, just use cache-aside pattern. The cache cluster chooses where the secondary copies of objects and regions are stored. Just as AppFabric distributes cached objects across all cache hosts in the cluster, it also distributes the secondary copies of those objects across all cache hosts in the cluster

    image

    If a cache host fails (assuming there are still a sufficient number of cache hosts available to keep the cluster running) , aside for brief period of rebalancing, nothing changes for the cache-enabled application. The cache cluster re-routes requests for the object to the cache host that maintained the secondary copy of the object. Within the cluster, the secondary copies of all the primary objects are then elevated to become the new primary objects. Then, secondary copies of those new primary objects are distributed to other cache hosts across the cluster. Secondary objects on the cache host that failed are replaced by new secondary objects and distributed across the cluster. This process also applies to regions.

    Pessimistic Concurrency or Refusing to Share Your Spoils. In the optimistic concurrency model –default, updates to cached objects do not take locks. Instead, when the cache client gets an object from the cache, it also obtains and stores the current version of that object. When an update is required, the cache client sends the new value for the object along with the stored version object. The system only updates the object if the version sent matches the current version of the object in the cache. Every update to an object changes its version number, which prevents the update from overwriting someone else’s changes. In the pessimistic concurrency model, the client explicitly locks objects to perform operations. Other operations that request locks are rejected (the system does not block requests) until the locks are released. When objects are locked, a lock handle is returned as an output parameter. The lock handle is required to unlock the object. In case the client application ends before freeing a locked object, time-outs are provided to release the locks. IMPORTANT – Pessimistic concurrency is very expensive, obviously affects application throughput and concurrency and I would definitely stay away from this construct unless its absolutely necessary for some reason. Locks are necessary evil, lets not introduce new ones.

    Security. Windows Server AppFabric Caching features provide several options for managing security. By default, communication between cache clients and the cache cluster use both encryption and signing. In addition, you must explicitly add a Windows account to the list of allowed accounts before the associated user can access the cache cluster. I would actually recommend turning off transport level security (encryption and signing) and use your own methods like IPSec, VLANs, Firewalls to protect cache. I found that it enhances performance by taking off additional overhead on every call. When the security is enabled, the AppFabric Caching Service must run under an appropriate identity. For domain environments, this should be the built-in “NT Authority\Network Service” account. For workgroup environments, this should be a local machine account. However, there is one exception to the service account setting for a domain environment. When security is disabled by setting the security mode to None, it is possible to run the AppFabric Caching Service as a specific domain account other than Network Service. Finally, only authorized accounts can access cache cluster, use grant-cacheallowedaccessaccount cmdlet to grant access to windows user.

    In the next part I will cover management of your cluster, troubleshooting, as well as some interesting best practices and gotchas that I learned so far.