Any developer who builds a new system dreams of the day their software goes live, and real users start pounding on their application. This is perhaps the most tangible validation of their work that some developers ever have. On that day, it's pretty common for developers to either be the team responsible for running the system, or at least to be working shoulder-to-shoulder with the people who are running it.
For those lucky souls who "grow into" this role, though, it may be helpful to have a sense of perspective about what success in Operations means. Although specific technical and design details are aligned with your implementation, your job ultimately boils down to some simple objectives:
1. Understand what's supposed to happen
While Development is all about designing, building, and testing, Operations is all about Execution. Go rent Apollo 13, or better yet, pick up a copy of Failure is Not an Option by Gene Kranz. Happiness in Operations means understanding what's going to happen before it happens, and there’s no such thing as a good surprise. You can see this in the checklists that all the NASA engineers used, and when someone tries to tell you that you're too smart for checklists, remember that those guys were, in fact, rocket scientists. Nuff said?
Virtually nothing in Operations is of any use whatsoever unless everyone who’s touching a keyboard knows the exact same stuff, which implies that checklists are written down in a way that ensures everyone’s got the same information. If you don't have anything better to start with, get a whiteboard and write, "8am: Make sure the servers are still on," and then start filling in from there.
2. Know what's actually happening
This is the IT equivalent of watching the gauges on your car's dashboard. Failure to pay attention to the gauges could prove costly when your engine starts shooting connecting rods through the hood.
In operations, you're watching for failures and performance problems -- hopefully in time to react to them before your customers start complaining, and you're watching for unusual activity that could indicate problems with other systems you interface with or even hacker activity. When you get more sophisticated about what you're watching, you may even be able to provide design guidance on what features your customers are using the most, or whether there are parts of your application(s) that users seem to be struggling with, but please make sure you're covering the basics first – server uptime, exceptions, and application performance.
As usual, tools help here. It’s a whole lot more efficient and effective to have a tool checking to make sure servers are responding properly. Fortunately, there are all sorts of tools like this, including some free ones.
Important: Be sure to understand the difference between IT Operations and Business Operations. These can, in some cases, be co-resident, but remember that one is focused on your systems and the other is focused on the business. These two aspects of Operations should communicate liberally back and forth, but it’s important to understand the difference between technical status and management and business status and management.
3. Communicate status
Since it’s Operations’ job to know what’s happening, they therefore serve as a fount of knowledge for other departments. In a lot of cases, proactive communication is more effective than “pull” communication, and again, whenever you can drive decision-making out of the process, it’s a good thing. Therefore, operations should know in advance what sort of events should trigger communication, and to whom they’d be communicating. Some of this could, in fact, be automated.
Status is typically focused on what’s happening right now, but a complete understanding of status also includes a sense for whether measures are trending in one direction or another. Data about how our system performs over time, for instance, can tell you a lot about whether a performance metric you’re seeing right now is a blip or part of a trend that’s moving steadily toward a big problem. This sort of long-term information should also help us see performance or resource constraints in time to react to them before they affect customers.
4. Handle catastrophes
Sometimes, bad things happen to good applications. When the sh*t hits the fan, it’s absolutely imperative that the cure isn’t worse than the disease. Go watch Apollo 13 again. Since everything that normally happens in operations should happen according to a checklist or procedure, it should be glaringly obvious to everyone (to the point of discomfort) that you’re now operating off-script.
I’ve heard pilots describe their jobs as “hours of boredom punctuated with seconds of sheer terror.” This is when you want to open the cockpit door and see Sully sitting there. Sully uses checklists, too, by the way.
5. Maintenance and planning
Since operations has done such a good job of ensuring our system is running like a Swiss watch, they’ve got some time left to plan for future improvements. With any luck, this might include stuff like:
- Preparing and managing hardware and/or virtual servers.
- Planning infrastructure changes for upcoming software releases. This is actually a very important form of developer support, because this is where operations and development work together to make sure you can deploy the things you’re building without any undue drama.
- Tuning / tweaking system monitoring and management tools.
- Analysis to assist development – where are your servers stressed, what custom tasks do you deal with today that might be built into the application, etc.
This list is just a start, of course, but it’s a pretty good start.
What tips would you add?
Related articles
- imabonehead: Deploy ALL the Things - blog dot lusis (lusis.org)
- Applying Cloud Principles to the Data Center (datacenterknowledge.com)
- What cloud boils down to for the enterprise (gigaom.com)