The five most recent posts.
Origins of the Internet
Fundamental Publications
I’ve had a long interest in the origins of the Internet, both the people and the seminal publications. A few years ago I put together a short presentation on some of that history and how it might relate to the evolution of the cloud:
I thought folks might be interested in some of the original publications directly or indirectly referenced in that presentation, so I’ve collected them here. Please contact me if you have copies of PDFs you think are missing from my list (especially Leonard Kleinrock’s work).
- A Mathematical Theory of Communication, Shannon, 1948
- Reliable Digital Communications Systems Using Unreliable Network Repeater Nodes, Baran, 1960
- On Distributed Communications Networks, Baran, 1962
- On Distributed Communications, Baran, 1964
- The Aloha System – Another Alternative for Computer Communication, Abramson, 1970
- The Ethernet Memo, Metcalfe, 1973
—Jan 25, 2012
Some Rules for Engineering and Operations
The best solution to a problem is not to have it.
An insufficiently ugly temporary hack is permanent.
There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.
The first fallacy of automation is making machines perform each step of a manual human process.
These are not features: Security, Availability, Performance.
—Jan 24, 2012
Service Level Disagreements, Part 2
Yesterday, I explained the dangers of the common misunderstanding of service level agreements as insurance policies. While I mentioned a strategy of using multiple vendors rather than relying on the SLA offered by a single vendor, some more specific details will be useful in understanding and internalizing this approach.
Over the past ten years I have participated in or lead negotiations for internet and CDN bandwidth at Internap, Amazon, and Microsoft. at first I invested significant time and effort in defining SLAs, methodology, metrics, and penalties, as is common practice. What eventually became apparent were two things:
- Defining meaningful SLAs for public internet services, as opposed to private telco links, is not generally possible.
- SLA failure penalties are insufficient compensation for business impact.
From this experience and these realizations I changed my approach significantly. The two facets of the new strategy were, and are:
- Only enter into contracts with as small a traffic commitment as feasible and with no penalties for termination, regardless of cause.
- Engage multiple vendors for all bandwidth services.
Availability, which is always the responsibility of the customer, is now actually under the customer’s control, rather than being delegated to a vendor via an SLA. Should a vendor fail to deliver the desired service level, even for a short period of time, traffic can be shifted to other vendors until quality improves. Should a vendor prove too unreliable to use at all, their services can be terminated and other vendors brought in to replace them.
To make best use of this strategy it is important to have proper software support in place. For example, a single CDN vendor should be used for content on each page served, and the vendor used varied dynamically across requests; mixing multiple CDN vendors on a single page can actually reduce availability. Similar traffic engineering can be done for requests to your own web servers using DNS-based global load balancing, though with coarser granularity. Similar principles will apply to “the cloud” as the interfaces and functionality in the space are commoditized.
As Heinlein said, TANSTAAFL, and high-availability distributed systems are not exceptions. You are responsible for your availability. Understand clearly the business value to you of a vendor SLA and be prepared to change your strategy, and put in the technical and contract work required, if it will not meet your business needs.
—Jul 16, 2009
Service Level Disagreements
Vijay posted a (better late than never) rebuttal to a post from November last year by Joe Weinman of AT&T. I agree with all the points Vijay makes, and want to focus in on a particular area of Joe’s article:
(4) SLAs with financial penalties — Not only won’t enterprises accept “Well, after all, it’s still in beta” as an excuse for service outages, they demand meaningful SLAs (service level agreements) with clear metrics for evaluating achievement of those SLAs, backed up by monitoring and management systems, and financial penalties such as credits or refunds if service levels aren’t met. A “free” or low-cost service with questionable delivery quality is about as attractive to a CIO as an offer of free neurosurgery from someone who just skimmed a blog on how to do it in three easy steps.
Ah, the mighty service level agreement! The tooth and claw by which the wily customer brings the vendor to heel. Get the SLA right and you, the customer, can sit back and relax, safe in the knowledge that should there be an outage, you are covered. Your business is protected from harm by the warm, experienced embrace of a big, stable telco. Pinch me, I must be dreaming.
Vijay refers to SLAs as “an actuarial game”. The situation is rather worse than that. The trouble is that many intelligent people mistake an SLA for an insurance policy. It most definitely is not.
An insurance policy is purchased for a price, often based on actuarial tables, that reflects the risk of the policy being paid out and the size of the pay out. The value of the policy is that it is a hedge: in the event of a claim, the holder is compensated for (approximately) the full value lost. The insurance industry is predicated on most policy holders paying far more over the life of their policies than they are paid out, and on there not being catastrophic events that cause simultaneous claims by a large number of policy holders.
A service level agreement does not work this way. An SLA is not a hedge against the business impact of an outage: it is a refund policy. The maximum value of an SLA ‘claim’ is your monthly bill. The cost to your business of an SLA failure is likely to be far higher, but you will not be compensated for that loss. A six hour service outage might cost your small business 10,000 dollars. receiving a 500 dollar service credit is cold comfort.
SLA failures become more common as you move up the stack from the rigid, extremely well-characterized, layer 1 telco sweet spot. Outages that impact large sections of your customer base simultaneously are inevitable in large-scale, shared software infrastructure. If SLAs were insurance policies, vendors would quickly be out of business.
Given this, the question remains: how do you achieve confidence in the availability of the services on which your business relies? The answer is to use multiple vendors for the same services. This is already common practice in other areas: internet connection multihoming, multiple CDN vendors, multiple ad networks, etc. The cloud does not change this. If you want high availability, you’re going to have to work for it.
—Jul 15, 2009
EC2 Origins
I was trying to avoid writing this post and had succeeded at that goal for almost 2 years. After some recent exchanges, I see the wisest move is the opposite. so, here goes.
In 2003 I was working at Amazon for the best manager I’ve ever had, Chris Pinkham. Chris had hired me the previous year as a network engineer, quickly promoting me to manager for the (ridiculously awesome) team. Chris was always pushing me to change the infrastructure, especially driving better abstraction and uniformity, essential for efficiently scaling. He wanted an all IP network instead of the mess of VLANs Amazon had at the time, so we designed it, built it, and worked with developers so their applications would work with it. He wanted anycast DNS, so we hacked up some routing software and put it out there (great idea at the time, but in hindsight we probably should’ve taken a different approach). Chris asked for something, we figured out how, and did it.
Sorry for the digression, back to what I was saying about 2003: Chris and I wrote a short paper describing a vision for Amazon infrastructure that was completely standardized, completely automated, and relied extensively on web services for things like storage. We drew on the work of a number of other folks internally who had been thinking and writing (and sometimes even coding) in the storage services space, and we combined it with our own thinking and experience in infrastructure. Near the end of it, we mentioned the possibility of selling virtual servers as a service.
We presented the paper to Bezos (he doesn’t do slides), he liked a lot of it, and we went back to work.
A few months later, in early 2004, I was told Jeff was interested in the virtual server as a service idea and asked for a more detailed write up of it. This I did, also incorporating a couple of requests Jeff had, like the idea of a “universe” of virtuals, which I translated into network-speak as a distributed firewall to isolate groups of servers. This first cut at it looked almost nothing like the production EC2 service, and, in my view, every change made by the team who built EC2 was for the better. As just one example, that first paper called for a system manifest from which a server would be built. This is similar to how much systems automation works, but is actually terrible for the sort of dynamism desired for EC2.
After presenting the “executive brief” paper to Jeff, the realities of turning this hare-brained scheme into a real service meant involving the smartest folks around (i.e., not me). In the Amazon style of “starting from the customer and working backwards”, we produced a “press release” and a FAQ to further detail the how and why of what would become EC2. At this point attention turned from these paper pushing exercises to specifics of getting it built. Most importantly, who would lead the effort?
Everyone seemed to leap at once to the same conclusion: Pinkham. And so it was that Pinkham returned to South Africa, taking a stellar lead developer with him, and they built the EC2 team, then built EC2. That last part seems awfully compressed, doesn’t it? Well, that’s because I had almost no interaction with the EC2 team. They went off and kicked a lot of ass and the rest is history.
The end.
Want more data? Here’s Jeff in a 2008 interview with Om Malik…
—Jan 25, 2009