The story so far
For those listening in on this public conversation, here's the background.
I participated in a podcast episode about Continuous Deployment at the Java Posse Roundup 2012.
Marc Esher tweeted about the podcast, asking me to elaborate further on my assertion that "state is a bug".
That led to a longer question on a GitHub gist:
My response got a bit long for a gist, so here it is as a blog post. Marc and I haven't met. This is just how the internet works now.Joe, thanks for responding.I'm most interested in what are perhaps pedestrian issues, but they're issues (for me) nonetheless. You mentioned "outsourcing" session management... where can I read more on that?Take the simple case of: I'm a user on your system. I'm logged in. I'm doing things. The server I'm on disappears while the screen I'm reading is currently loading.What happens? Do I see an error? Am I sent to a new machine, with all my session state in tact, and the screen I was reading simply reloads?Or take perhaps a different architecture where a machine isn't brought down until all its users have been successfully moved to other servers. For example, perhaps our configuration is such that we have N instances, and we do a rolling code push to those instances. User A is on Server 5, which has the old code. We need to get that user to Server 2, which has the new code, along with all of his state. Once all users are off of Server 5, the deploy then moves to Server 5 as well and then Server 5 is brought back into commission and accepts new requests.So: what techniques, processes, and tools support these practices?Thanks again.
Here's how I see it.
As for the question of instantaneous failure, the server shouldn't be expected to disappear while it's handling a user's request and delivering a quick response, unless the instance truly has catastrophic power failure in that moment. In that case there's no way I know of to recover. However, I'm not really concerned with that case because I think it's sufficiently rare, not currently practical to protect against, and not significant yet, because clients still occasionally expect to do three retries before succeeding. During an electrical storm or "Take your monkey to work"-day, servers can end up getting unplugged, and clients will be affected if their requests occurred in the right instant, but I don't think that problem has been solved well yet. The solvable problem is, when the user retries and gets a healthy server's response, will the user have an appropriate experience, or will they lose all their work and need to redo all their recent actions?
So, for the duration of one request and response in a stateless protocol like HTTP, the client and server are dependent on each other, and on the internet hops between them. Keep that request-response duration short enough, and the risk of instantaneous power loss should be mitigated.
If an instance gets a software shut-down signal, it ought to be configured to shut down its web server process gracefully before permitting the operating system to shut down. Web server graceful shut down should mean refusing to accept new requests, while finishing the delivery of responses the server has already started. Therefore, "terminate instance" should usually entail automatic draining of connections before total shut down occurs.
The state that I consider a bug in any highly available system is session state, not request state. Anything you want to store in session could instead be stored in a remote shared state service such as a database, a queue, or a workflow service. Consider a case for a mature shopping site like Amazon.com or eBay.com. If you log in to your account with two laptops, and start viewing items on both laptops, the history of what you've recently viewed is generally visible to both sessions around the same time, even though you are collecting history in two different sessions, possibly on two different servers. This means that the important shopping transactions of "view item" are stored in a database shared by the two web servers your clients are talking to. If one of those servers shuts down and both of your laptops continue sending shopping requests to view more items, then some of the client traffic should get switched by a load balancer to a healthy server for the future requests. The state of the user's shopping sessions is undamaged because it is still in the shared database, visible to all server instances.
Once you move "session"-type state out of the server's session context and into a remote shared service, it becomes practical to set up auto-scaling policies to get more stateless virtual servers when you need them, and to terminate stateless virtual servers when your traffic levels drop, in order to save money when renting virtual servers from a cloud provider.
I think about this stuff a lot because I work on Asgard, the open source app deployment and cloud management app produced and used by Netflix. I talk to a lot of Netflix engineers about the need to get state out of their applications. I'm also working to get state out of Asgard itself.
Rehearsing for failure
Many Netflix services use AWS auto scaling for availability and cost savings, so they need to keep user state out of their service so servers can be used interchangeably by clients. Netflix also uses Chaos Monkey to terminate instances within an Auto Scaling Group daily during business hours just to make sure the developers are still maintaining a system that is resilient to small-scale instance failures. Amazon and all other data centers have server failures, so our best defense is to expect failures and to plan for them, and to practice automatic recovery all the time. Netflix doesn't use auto scaling for Cassandra database rings because Cassandra doesn't have a good way to handle frequent growth in the size of a cluster. However, we do use Chaos Monkey to exercise Cassandra's ability to recover completely from an occasional single instance termination.
We also have all our services in three AWS Availability Zones (data centers) so if one zone has a major problem, Netflix is generally unaffected while Reddit and Pinterest are sometimes down for hours.
Asgard's state problem
On a more personal level, I'm working to remove state from Asgard so that I can increase Asgard's server count and release new versions of Asgard for use by Netflix engineers more frequently and conveniently. My intent is to use Amazon Simple Workflow Service to store the state of each long-running automation process that gets started by an Asgard user. For example, a complex rolling push of new code to replacement instances in an auto scaling group, or a reversible push of new code to a new auto scaling group, with automated result checking and rollback on failure. The state of the long-running workflow execution will then be visible to all Asgard instances, while none of those instances needs to stay up just to finish the automation process.
Related Netflix tech blog posts
- Post-mortem of October 22,2012 AWS degradation
- Netflix Shares Cloud Load Balancing And Failover Tool: Eureka!
- Chaos Monkey Released Into The Wild
- Lessons Netflix Learned From The AWS Storm
- Scalable Logging and Tracking
- Asgard: Web-based Cloud Management and Deployment
- Netflix Operations: Part I, Going Distributed
- Fault Tolerance in a High Volume, Distributed System
- Auto Scaling in the Amazon Cloud
- Making the Netflix API More Resilient
- Benchmarking Cassandra Scalability on AWS - Over a million writes per second
- The Netflix Simian Army
- Lessons Netflix Learned from the AWS Outage
- NoSQL @ Netflix Talk (Part 1)
- NoSQL at Netflix
- Four Reasons We Choose Amazon’s Cloud as Our Computing Platform
- 5 Lessons We’ve Learned Using AWS