lordy be! look what george reese made me do. he goes off and posts this little thing. and the twitterverse erupts. a shot across the bow, answered by don macaskill.
the tweets run red with blood!
except they aren’t talking about the same thing (heck, don isn’t even talking about his whole site, just a part of it). and they certainly aren’t addressing the core problems with auto-scaling. as i define it, of course.
let’s start with that definition: auto-scale is the automatic, as opposed to human-directed, provisioning and deprovisioning of infrastructure in response to varying load on or performance of an online application. all but the smallest sites are actually composed of numerous, interacting, and interdependent applications. this seems straightforward enough! why, i just measure what resources i’m using, maybe a bit of performance measurement, and bingo, i know when it is time to add or remove resources.
i’ll stipulate that in a perfect world with apps that behave linearly you can succeed at that, as don obviously has. writing an app to be auto-scaled and then writing the auto-scaling system to scale that app is the simplest situation. an app not meant to be auto-scaled? harder to scale automatically. an auto-scale system that is generic and can automatically scale arbitrary apps? really extra harder.
the central reason for the trouble is that app behavior is not linear. apps often go through load-induced phase transitions so an app that worked fine with 5 web servers handling 500 hits per second grinds to a near halt with 6 web servers and 600 hits per second. common patterns, encouraged by widely-adopted frameworks like rails, make this situation almost inevitable.
every database client consumes a certain amount of memory on the database server. add too many clients and the database goes from having enough memory and cruising right along to running out of memory, swapping like crazy, and your app performance going in the toilet. you’ve planned your capacity to match the mix of queries on the database. oops, a page with a really expensive query on it just got incredibly popular, thanks, digg! you’ve run out of cpu on the database server. neither of the above happened, but you’ve been chugging along, adding web servers handling ever more customers. those customers each require a tiny amount of database disk i/o. now you have lots of customers and a big number times a small number can be a big number. you’re out of disk i/o, your performance goes in the toilet.
these are phase transitions: your previously solid app takes a little more load than it has before and liquefies or vaporizes. adding more capacity probably won’t help and often makes the problem worse. hopefully you’ve been carefully collecting metrics for that auto-scaling system to get you into this mess. you can use those metrics to determine the new bottleneck and then come up with strategies, sometimes just tuning, often architectural, to eliminate it.
perhaps you start partitioning your data across multiple servers. perhaps you deploy in-memory query cache servers. perhaps you go the web 2.0 hipster route and shard things up. eventually, you get your app(s) stable at the new load, and you know the next phase transition looms in the darkness ahead. this is ok. this is the process. even the biggest of the big sites are still going through this, often having to invent entirely new systems and services just to keep the pig airborne. you are not alone.
all of which is to say you can probably auto-scale reliably between known phase transitions, but auto-scaling systems simply cannot get you across an unknown phase transition. this doesn’t make them useless, it makes them useful in specific circumstances. like all of the tools in your toolbox, you should know when and how to use it and know when it should be put away in favor of your most powerful and expensive tool: your brain.
