What does it take to run a service in production?

Nowadays it is easier than ever to deploy services and have them available on Internet. Lot’s of the heavy lifting has been taken care of for you. AWS, GCP and Azure make it easy to build global services with very few engineers.

However this is only half of the truth. It is indeed easy to build the service. Running it and especially handling problems and unforeseen situation is still difficult. There are always surprises, big ones and small ones. It takes practice and a lot of thought beforehand to handle nasty surprises with minimal impact to your users.

Let’s consider some scenarios which might not get the attention until it is too late. If you can answer all of these you are probably more prepared than most organizations ever are.

Have you ever practiced restoring a backup?

This is pretty simple. Have you tried restoring your data from backup. For example:

When was the last time you tried to restore the database from backup?
- On production?
- How did it go?
- Was there any tooling or did you have to make up the steps as you went along?
- Could you do that with 100% confidence that the steps you took are the correct ones?
- Could you do it under pressure knowing that your whole application is down and each hour costs tens or hundreds of thousands?
- If you can do it, can everyone on your team?
- Do you have automated way of restoring? Or at least a checklist or runbook for it?

It is easy to say that you have backups. Point of backups is to be able to restore data within RTO (Recovery Time Objective) and within RPO (Recovery Point Objective) meaning recovery should happen within defined time frame, the RTO, and only maximum of RPO data can be lost. It is crucial that just taking backups isn’t enough. You must test that process because without testing you cannot be sure that you are actually able to restore the data.

Isolating compromised resources

This one is also simple. One of your servers is compromised somehow. You do not know how but something fishy is going on.

Have you ever practiced cordoning off compromised server/container/aws-account/gcp-project?
- Have you done it in actual situation?

Forensics is important. You should have the capability of isolating suspicious activity in order to investigate it safely.

How do you handle errors in your data?

At some point most systems develop errors in operative data. Business processes fall short, are incomplete. Bugs cause data corruption. Users behave unforeseen ways.

Do you have tooling to handle errors in your operative data?
- Tools helpdesk or developers can use to check data integrity, corruption or validity?
- Do you rely on random SQL-scripts, copy-pasted into terminal on developers laptop?
Have you created tooling to the point where no human-access to database is needed?
- Do you have tools to remedy any corrupt/invalid data/state?
How do you detect corrupt or invalid data?

On early stages of development it is normal for developers to check data errors. Access to databases from personal laptops is common. However on certain areas it is not an option. Regulation might prevent it. Volume of data might be too big. Also human errors are something you want to avoid, delete from clients without proper where-clause is not something you want to experience unless you want to find the answer to my first point “Have you practiced restoring a backup”.

Developing proper tooling is crucial. Don’t overlook it. Also automate it as much as possible. Hours spent on tooling will repay themselves many times over as a form of reduced toil-work.

Audits

Have you been audited against common criteria like PCI/DSS, Katakri/Pitukri (Specific to Finland). ISO-27001 or some other framework? If not consider going through one. Even a simple security audit done by company specialized in those. Audits are not there to blame, they are awesome possibilities for learning and finding issues you might have but are completely oblivious to.

Following security bulletins

Security is continuous process. It’s not something that you can get done in a sprint, it’s not an epic that can be split into stories and once those are done you have secure system. Security is continuous work that requires efforts.

Do you tooling in place which raises CVE’s to teams backlog?
- Dependabot? Renovate? You are using something like those aren’t you?
How do you follow security incidents?
How do you handle vulnerabilities you’ve accidentally coded in your product?
Do you have clear process how to handle bug-reports?
- Have you considered bug bounties?
Do you have clear priority for security related incidents versus normal feature development?

Is your site really up?

Your application shouldn’t be a black box. You should always have situational awareness on how your site is doing as well as awareness of your systems internal state.

How do you know whether your system is actually working?
- Do you have outside testing which simulates end-users (RUM, Real User Monitoring)

How do you know what actions to take and where to concentrate efforts if you do not know the state your system is in?

Malicious insider

Most systems are not breached with futuristic algorithms or complex hacking efforts. Usually threat is insider leaking credentials and in some cases by malicious actor.

If developer is malicious, how much destruction can he/she cause in production?
- Are there safeguards in place to prevent that?
What other harm might be caused?

Together with the tooling mentioned earlier this is threat that can be greatly be reduced by having automation. I would say that if developer, operations, sre or someone else is required to login into production database with sql-client then you have a problem. Automation and tooling should be on the level that logging into production database is “break a glass” emergency situation.

In an emergency - break a glass

Talking about emergencies and break a glass. If all else fail - do you have some kind “break a glass” process to follow? What to do when everything else is down, unaccessible, compromised? What do you then?

Approvals

In many cases there is need for formal approvals. Sometimes it’s up to developers, sometimes strict hierarchy is called for. This is usually something that is dictated by things outside of development team’s control.

How do you aproach approvals? Be it approvals inside the application or infrastructure changes or deploying new versions.
Are there regulatory requirements on approvals?
How do handle situation where production release requires approval from multiple people/role in the organization?
Are your approvals easy enough to use so they get used and not circumvented because they are cumbersome?

Denial of Service

More popular your application gets the more interesting target you will become. Being targetted by DDoS will be reality sooner or later.

Have you been target DDoS?
Do you have direct channel to your service providers NOC/SIRT team?
Are you imdemnified from the DDoS costs?

Able to act in crisis

Crises and unexpected events and accidents are not common and our cognitive functions are not attuned to working that way. Stress reaction can be detrimental on our ability to make rational decisions. We have biases which warp our thinking. On the positive side it is possible to reduce these effects by practicing. That’s why firefighters and first responders practice all the time. Site reliability and operations can be practiced. Reacting to outages, attacks and other unexpected events can be practiced.

If you are not sure whether or not you are able to function when crisis hits you should practice. Arrange gamedays, have runbooks, test recovery, test co-operation, utilize chaos testing, test and improve tooling.

Are we able to handle unexpected events? Do we regularly handle those kind of situations? And most importantly, do we learn from past incidents so that we do not repeat same mistakes or stumble on the same things in the future?

Epilogue

There could be a lot of questions like these and granted that many of them are wayyyyyyyyyyyy overkill for most systems. But at the same time there might many systems would desperately need attention to some of the points mentioned above.

We are excellent at figuring out the business-features/challenges and keeping the business happy. Really running production with high confidence with bigger scale requires a lot if you want to be confident and not run with only luck. Building robust operations/sre practices is not something that happens overnight. CTO of Amazon, Werner Vogel has said it well: “You cannot compress experience”.