Sticky sessions are evil. Ok, I’ve said it. Actually I have said it before. In fact I am surprised at how often I have had to say it. They seem to get used with surprising frequency without properly thinking through the repercussions. I have been asked to explain the issues several times, so let me go walk through and document them.
Before we delve into the issues with sticky sessions, lets start with what they are and why they get used. The HTTP protocol is inherently stateless. Each request to a web server is executed independently without any knowledge of any previous requests. However it is fairly common for a web application to need to track a users progress between pages. Some examples are keeping track of log in information, or the progress through a test with correct/incorrect totals, the items in a shopping cart. We could fill this whole page with examples.
Just about every web application framework that I am familiar with (PHP, Java, .NET, etc) all have a simple way to handle this. The developer just enables server sessions. A session is created for each user and it stores all the transient user data for the web application. The data is typically stored in server memory and looked up on each web request via a session key.
This works great until you have to deploy the application across multiple load balanced servers. Any enterprise level application will always have at least two servers. Even if a single server can handle the expected traffic loads. Primarily for redundancy and maintenance. The Cisco CSS and the F5 Big-IP are popular for load balancing. But the server session is only on a single server, so we have to ensure that the load balancer always sends all requests from a single user to the same server where their server session is. This is where the sticky sessions comes in.
By instructing the load balancer to use sticky sessions, it keeps track of all user requests and which server it has routed their past requests to and sends future requests to the same server. On must load balancers this is a simple option that can be enabled. Sometimes this will also be referred to as server affinity or persistentence-based load balancing. That is obviously an over simplification, but I think you get the idea.
So this all seems pretty simple and straightforward so far. We are using built in features of the web application framework and standard options on the load balancer. What is the issue? The issue comes in when the load balancer is forced to shift users to a different server. When that happens, all of the user’s session data is lost. The effects of this will vary by application and what the user is doing at the time. Perhaps they get booted back to a login prompt. Or maybe they lose the contents of their shopping cart. Or maybe they suddenly lose their place in the test and are forced back to the beginning. All of these cases make the application appear defective to the end user.
So why might the load balancer be forced to shift the user to a different server even when sticky sessions are enabled? There are actually a number of reason. I’ll list a few of them.
- Load Balancing: The job of the load balancer is to keep the load fairly evenly distributed between all the servers in the pool. Sticky sessioning interferes with that. Depending on user activity, the load can become unbalanced. Eventually an individual server may become overburdened and unresponsive. Then the load balancer detects this, it will begin rerouting some traffic destined for one server to another. It keeps the application up, but the user experiences a glitch.
- Maintenance: From time to time a server has to come down to perform maintenance. This might be patches to the operating system or application server. But it might also be a standard part of deploying application updates. Either way, you must plan to take servers down from time to time. When that happens, the load balancer will shift traffic to remaining servers automatically. But again, the user experiences a glitch.
- Session mis-identification: There are a few scenarios where a load balancer might group multiple users into the same sticky session. The typical way a load balancer associates sessions with a user is through the IP address of the requests. There are alternative ways to set it up, but this is normally the default and is the most unobtrusive. But is it reliable? If requests from multiple users are passing through a proxy server or network address translation (NAT), they may all have the same IP address. Think about all the AOL users passing through a small bank of proxy servers. Or all the users at a company whose computers interface with the Internet via NAT. We use a content delivery network, and so user requests arrive at our origin via an CDN edge node using the IP of the edge node. In any of these scenarios, a group of users are routed to the same server since the load balancer is identifying them all as part of the same sticky session.
- Session identifier changes: Like in the example above, the user’s IP address may be used to identify the user. However there are circumstances where the user’s IP address might change between requests. Perhaps their request passes through a different proxy server, or a different CDN edge node, or DHCP renews their IP with a different one, or maybe it gets assigned a new IP when the user’s device moves from one area to another. All of these cases would be rare, but when it does happen the user just sees a glitch in your application.
- Cloud support: Many cloud providers do not provide for sticky sessions. Amazon S3 only recently announced they were adding this feature to their elastic load balancers. If moving your application to the cloud is something you are considering, then you might not be able to relay on sticky sessions.
I would like to take a minute to walk through some real word examples I have encountered that might shed more light on some of these issues.
We had an application that was providing an API for some client based systems. A development team stood up the API application on two load balanced servers, tested, and deployed it. All worked well. After a while someone noticed that the loads did not seem to be even. In fact upon inspection all traffic was going to a single node and the second node was not receiving any traffic. This caused a lot of head scratching, because everything appeared to be configured correctly.
I came in and did my own investigation and found the problem. The user requests where actually traversing two load balancers. The first handled routing the request to the proper application, the second balanced the requests across the two application nodes. The second load balancer was instructed to use sticky sessions based on IP. Unfortunately all requests it was receiving all came from the primary load balancer. So they all had the same IP and therefore were all assigned to a single sticky session.
In this case I worked with the development team to remove the need for sticky sessioning and disabled this feature on the load balancer. Fortunately we caught this problem before the application traffic rose to a point where more than one server was needed.
In another case, we had an API providing data for a mobile application. The way the application was written, it required the API to maintain a user session on the server. Therefore sticky sessions were enabled. Shortly before the application launch, the API code was moved into production and final quality assurance testing took place. The QA team begin to report intermittent glitches that had not appeared in earlier testing. The state information would periodically be lost. This did not happen frequently, but they were able to document multiple instances.
The development team immediately expected a problem with the sticky sessions. A quick check and it was confirmed that the feature was enabled however. No issues were found with the application code itself. It took further investigation until finally it was found that the IP addresses of the requests would periodically change.
A conference with the mobile application developer was initially unable to explain this behavior. They had to do more digging before eventually turning up the cause. It turned out that the framework used by the mobile application passed all requests up to a centralized system which then made requests out to our API servers when it needed data. Therefore we were seeing the IP addresses of these framework servers, and not of the user’s mobile device as expected. From time to time, the server handling the users requests would change which would break the sticky sessions. This problem caused a significant delay in the product launch while the application was reworked
One of our teams added a captcha challenge to an application. The users would have to pass the captcha challenge before the API would process their requests. Before adding the captcha, the application did not need to use sticky sessions. The team dropped in a simple captcha component. It would generate a challenge/response, store it in a server session, return the challenge to the user. The user would respond with the correct response (or the wrong one) and the users response would be checked against the one stored in the server session. Since server sessions were now being used, enabling sticky sessions was done on the load balancer.
There actually were no major problems with this. The sessions only lasted a minute or two typically so very few users (if any) would be affected when load was shifted from one server to another. And when a user was impacted, their captcha would fail, and they would get a new one. So the user just assumed they mistyped the correct response and would just try again (hopefully successfully), not that the application was malfunctioning. The use of sticky sessions did cause a little bit of uneven load balancing, but this was minimal due to the short user interaction with the service.
In this instance the time and cost to put a different method in place may not be warranted. Any issues from it were rare and did not cause major user frustration.
The use of simple server sessions tends to be taken for granted and developers just assume that enabling the use of sticky sessions on their load balancers is perfectly safe. Typically it will work fine most of the time. But when it doesn’t it can take a significant amount of time to track down the issue. An application may have infrequent reports of glitches that a can not be reproduced.
Sometimes configuring your load balancer to use a non default sessioning method will fix these problems. In fact in my first example above we could have reconfigured the load balancer to use a cookie based method. Possibly the second as well. Some load balancers support injecting their own session cookie into your HTTP traffic which it uses to track user sessions. Some may also support checking on an existing application cookie (e.g. JSESSIONID, PHPSESSID, ASPSESSIONID). This last option is better if supported. These methods only help with some of the problems associated sticky sessions however.
If your application requires 24/7 availability and high user quality I strongly suggest staying away from sticky sessions. There are a number of ways to enable you application to use sessions that do not require the user to always visit the same server. Illustrating those methods is beyond the scope of this post however. If however you still require sticky sessions, look into configuring it to use a non IP based sessioning method.
Enough of my ranting. Hopefully now you will think carefully before you enable sticky sessions on your load balancer. And if you still use it, at least you are more aware of the consequences.