I am working with an SME customer at the moment that is big enough to have a high dependency on their website but not big enough to have an operational team available feeding and watering their systems all day, every day. This type of customer is not only common in the self-service cloud, but is the hottest, and probably biggest, target market.
So we’re building an AWS based system that has been architected, from the ground up to be loosely coupled, failure resilient and scalable. We have multiple load balanced and auto-scaled web servers across multiple availability zones. We have mongoDB replica sets across multiple machines and a hot-standby RDS MySQL database. We have Chef, with all its culinary nomenclature of recipes and knife telling all the servers what to do when they wake up. We have leant towards AWS services such as S3 and SQS because of their durability instead of trying to roll our own. We have engineered the system so that even if multiple failures occur that the solution will still serve requests until the next day when someone comes in and fixes things – much like a 747 can have multiple engine failures and still operate adequately without a need to repair or replace any of the engines while in flight.
In a nutshell, we have made all the right technical and architectural decisions to ensure that things will be as automated as possible and if something goes bump in the night, that there is no need to panic.
So we were asked what would happen if something did go horribly wrong at 3am? Something not even related to OS/hardware/network failure. Something not preventable through planned maintenance like disk space or suboptimal indexes. What about something that is the result of a bug in an edge case or bad data coming in over a feed? What do you do when something happens that your automated, decoupled, resilient and generally awesome system falls over for some unknown reason?
You call the person who can fix it.
Calling an expert to look at a problem is something that happens every day (or night) in data centres around the world, whether internal enterprise or public hosting services. Someone is sitting on night shift watching a bunch of blinking lights. When a light flashes orange he sends and email or a text message. When it flashes red he picks up the phone and, according to the script and the directory in front of him, calls the person who is able to fix or diagnose the problem. Apart from the general rudeness you may get when phoning someone up at 3am, the operator making the call can do so with confidence because that person is on call and the script says that they should be notified. If they can’t get hold of someone because they can’t hear their mobile in the nightclub at 3am, the operator is not at a loss as to what to do – the script has a whole host of names of supervisors, operational managers and backup people that are, according to the script, both interruptible and keen to deal with the matter at 3am.
If you are running your system on AWS (or any similar self-service public cloud infrastructure) you don’t have a moist robot who has the scripts or abilities to call people when things go wrong. Sure, you can send a gazillion alert emails, but nobody reads their email at 3am. Even if they do see the email or text message they may think that the other people on the distribution list are going to respond – so they turn over and go back to sleep.
You would think that it there is a business out there that will do this monitoring for you out of Bangalore or somewhere else where people to do monitoring are cheap. We want cheap people to do the monitoring because, by virtue of running on AWS, we are trying to do things as cheap as possible, so cheap is good. Granted, those people doing the monitoring won’t be able to restore and database and rerun transaction logs unsupervised at 3am, but we wouldn’t want them to and neither would we want that from our traditional hosting provider (because we are cheap, remember). So if we contracted in someone for first line support we would (at least) get people who have a script, a list of contact people, telephone and a friendly demeanour.
But what would those monitoring people offer that we can’t automate? Surely if they’re not doing much application diagnosis and repair then the tasks that they perform can be automated? What you get from moist robot monitoring, that you don’t get with automated alerts, is a case managed synchronous workflow. Synchronous because you pick up the phone and if no one answers you know to go to the next step; unlike emails where you don’t know if anyone has given it any attention. Workflow is the predefined series of steps to go through for each event. And case managed gives you the sense of ownership of a problem and the responsibility to do what you can (contacting people and escalating) in order to get it resolved.
But we’re engineers who like to automate things; surely even this can be automated?
Obviously there is an engineering solution to most things in life, and the Automated First Line Support system (AFLS) would look something like this…
You need some things to measure in order to pick up if something has gone wrong. This could be simple things that we are used to from something like Amazon CloudWatch which can monitor infrastructure level problems – cpu load, memory, io etc. You can also monitor generic application metrics such as those monitored by New Relic – page response times, application errors, requests per minute, cache hits, ISS application pool memory use, database query times and, in New Relic’s case, important metrics such as the overall Apdex on the system. You will also need to build custom metrics that are coded into the system. Say, for example, the system imported data; you could measure the number of rows imported per second. You can measure the number of failed login attempts, abandoned baskets, comments added; anything really that is important to the running of the system.
Once you are collecting a whole lot of data you need to be able to do something with it. Id’ be loath to talk about a “rules engine” but you would need some DSL (Domain Specific Language) to figure out how to trigger things. It gets tricky when you consider temporality (time) and other automated tasks. Consider a trigger that looks something like this
When Apdex drops below 0.7 and user load is above mean and an additional web server has already been added and the database response times are still good and we haven’t called anybody in the last ten minutes about some other trigger and this has continued for more than five minutes, then run workflow “ThingsAreFishy”
On Duty Schedule
Before any workflow runs you need to have a handle of who (as in real people) are on duty and can be called. This could be the primary support contact person, a backup, their supervisor, the operational manager and even, if all else fails, the business owner who can drive to someone’s house and get them out of bed. The schedule has to be maintained and accurate. You don’t want a call if you are off duty. A useful feature of the schedule would also be the ability to record who received callouts so that remuneration can be sorted out.
Voice Dialler and IVR System
You would need the AFLS to be able to make calls and talk to the person (possibly very slowly at 3am) and explain what the problem is. This is fairly easy to do and products and services exist that will translate your workflow steps and messages into voice. It would also be useful to have IVR acknowledgement prompts as well “Press 1 if you can deal with the problem… Press 2 if you want me to wake up Bob instead… Press 3 if you will look at it but are too drunk to accept responsibility for what happens”
“I’m Fixing” Mode
The AFLS will need to detect when it should not be raising triggers. If you are doing a 4am Sunday deployment (as the lowest usage period) and are bouncing boxes like ping-pong balls, the last thing you want is an automaton to phone up your bosses boss and tell him that all hell is breaking loose.
Host Platform Integration
Some triggers and progress data will need to come from the provider of the platform. If there is a major event in a particular data centre all hell may be breaking loose on your system as servers fail over to another data centre. Your AFLS will need to receive a “Don’t Panic. Yet.” message from the hosting providers’ system in order to adjust your triggers accordingly. There is no point in getting out of bed if the data centres router went down for three minutes and now, five minutes later, when you are barely awake, everything is fine.
All of this needs to hang together in an easy to use workflow system and GUI that allows steps and rules to be defined for each of your triggers. The main function of the workflow is not diagnosis or recovery, but bringing together the on-duty schedule and the voice dialler to get hold of the right person. It would also be great if workflows could be shared, published and even sold in a “Flow Store” (sorry Apple, I got it first) so that a library of workflows can be built up and tweaked by people either much smarter than you are or more specialised with specific triggers that you are monitoring.
The AFLS should cost a few (hundred maybe) dollars per server per month. None of that enterprise price list stuff will do.
Like everything else that we are consuming on the public cloud, it needs to be easy to use and accessible via a web control panel.
Is anybody building one of these?
Obviously you don’t want to build something like this yourself; otherwise you land up with an Escheresque problem of not being able to monitor your monitor. Automated First Line Support (AFLS) needs to be build and operated by the public cloud providers such as Amazon, Google or Microsoft (Do you see, @jeffbarr, that I put an ‘A’ in the front so that you can claim it for Amazon). Although they may want someone from their channel to do it, you still need access to the internal system APIs to know about datacentre events taking place.
Unfortunately the likes of AWS and Google don’t have full coverage of metrics either and need something like New Relic to get the job done. Either New Relic should expose a broader API or they should get bought by Amazon; I’m for the latter because I’m a fan of both.
Regardless of who builds this, it has to be done. I’ve just picked up this idea from the aether and have only been thinking about the problem for a day or two. No doubt that somebody has given this a lot more attention than me and is getting further than a hand-wavey blog post. As the competitive market heats up, it is imperative that the mega public clouds like AWS, Azure and GAE, that traditionally don’t have monitoring services and aren’t used to dealing directly with end users do some sort of AFLS. If they don’t, the old school hosters that are getting cloudly, like Rackspace are going to have a differentiator that makes them attractive. Maybe not to technical people, but to the SME business manager who has to worry about who gets up at 3am.