“The customer is always right” – that’s a saying that serious businesses actually say sometimes, with a serious and straight face. It’s one of those sayings rooted in good intentions and immediately polluted with the messy reality of the real world.
When you’re working in Database or even Network Engineering, the real answer is far more complicated and nuanced, but it’s much closer to “the customer is always wrong”. That is not to say that the Database or Network Engineer is right, either, though.
Usually the customer (in this case usually a Software Engineer) comes to us, or more likely pages us, with something like “the database is down”. Why? Because they got an error message like “The server has gone away” or “Too many connections” or “Host unreachable” or any number of error messages which are simultaneously too descriptive, and completely unhelpful. Such error messages are good at leading people astray.
For many reasons, not the least of which is that it’s impossible, error messages never say “Your query, yes the new one from ReallyExpensiveJob at line 52, added last Thursday, was really not a good idea” – that would be too simple. Instead the web page, or background job, or query quietly and efficiently consumes every resource available to it, or deadlocks some important process, or overwhelms the network, or any number of other secondary or tertiary symptoms. Maybe it even works fine most of the time but in some circumstances (say, when processing certain users’ data, or a specific shard) it goes absolutely wild.
More often than not, what that looks like both to the Software Engineer and the Database Engineer is that the “database is down”… because for all practical purposes, it is. Often though, the database is not really the problem. The database is the victim of some bad behavior: anywhere from a new bad query to an external end user clicking a button or a crawler/bot doing something.
When responding to these sorts of issues, in my experience, the priority is:
- Don’t make a bigger mess. Avoid taking any drastic actions to “fix” the underlying systems: don’t failover, don’t rebuild a host, don’t reboot.
- Provide first aid. Do what you need to: turn things off, block jobs, or even put the application in maintenance mode to give yourself breathing room.
- Find the root cause. Figure out what are primary symptoms and what are secondary symptoms.
- Patch temporarily. Get things mostly working again relatively quickly.
- Document the incident. Make sure future-you and your peers know what happened and what still needs to be done.
- Fix permanently. Implement a long-term solution.
Don’t make a bigger mess
When you log in and see that a particular thing is “broken”, it can be very tempting to try and immediately fix it. The graphs all show db73 is broken, it’s obvious that you should try replacing db73, right? The other folks responding in their own on-call rotations are likely also applying additional pressure for you to do something – anything – to fix it.
Resist the temptation to failover, rebuild, reboot, or restart things without a better understanding of the underlying problem. Sometimes doing so is the right answer, and you’ll feel stupid for not having done it sooner. More often than not, though, there are a lot of symptoms to sort through to figure out what the root cause is.
Why not, though? A failover or reboot in the middle of an incident caused by what turns out to be a bad query can distract you, but it can also seriously exacerbate the actual problem. Changing key parts of your infrastructure in the middle of an incident may cause problems due to locks, cold caches, killed/broken queries, and just generally introduces more change to an unknown system state. Additionally, it may mask the problem (making you think you fixed the problem) because the bad actor itself gets temporarily broken/stopped by the change and takes some time to come back.
Provide first aid
Hopefully you have a few tools at your disposal to make things better without pushing new code or taking drastic measures in the underlying systems. These will usually be temporary measures to “stop the bleeding”: blocking specific kinds of jobs, IP blocks, locking specific users out, or even just putting the entire application or parts of it into maintenance mode.
The idea here is not to actually solve the problem, just to give yourself or other responders more breathing room and stop any escalation of harm. Don’t be afraid to take fairly extreme measures, so long as they are temporary and safe (and ideally well designed and documented). I promise you that users will not mind getting a maintenance page instead of a page that repeatedly times out loading.
Find the root cause
Now you’ve got some breathing room, it’s time to figure out what is going on. When things go terribly wrong though, pretty much every graph for every system will look “wrong”. You will need to systematically work through the available data, logs, and metrics to figure out what are primary symptoms and what are secondary symptoms.
Some examples of secondary symptoms which can easily confuse your ability to make sense of the data and cause you to misunderstand the problem are:
- When the system is overloaded, every query may get slower, time out, or error. Those can be conflicting signals that send you on a wild goose chase.
- When queries are slower, generally fewer of them will get processed, and they will “stack up” more, running concurrently. This will introduce entirely new behaviors.
- When the database is overloaded or behaving badly, upstream systems such as web servers will behave differently and look different. The number of workers may increase. They may see more timeouts or even OOMs.
- Conversely: Maybe the increase in workers is actually the problem and they are now overloading the databases? Confusing, isn’t it?
- As you’re digging through logs and metrics, you’re likely to find a few things that have “always” been there and always been bad: bad queries, weird user behaviors, abusive bots, etc. Keep that in mind, and try to figure out whether something you’re seeing is new or whether it changed during the incident window.
- The monitoring systems that provide all of this data sometimes themselves cause extra load or bad behaviors when the system is under load. This can be due to increased logging rates, the size of some frequently queried internal table, locking required to access internal data, or any number of other reasons.
- Automatic retries in components of your systems can cause escalating or increasingly stacked concurrent loads. For example, a system may retry an operation repeatedly after a short timeout. Then what happens when the operation starts to time out repeatedly because of an unrelated problem? This can make it seem like a particular system is the root cause, when it’s really a secondary symptom.
- Users will often make things worse. When a page doesn’t load or produces an error, users will often try it again, and again, and again. They’re impatient and tenacious. Don’t be afraid to lock them out while you work to fix things.
It can be really frustrating and challenging to figure out what is really going on when the system is overloaded and everything is broken.
Okay, now you’ve identified what you think is the root cause.
Resist the temptation to dig into the real fix for that yet. Often, there’s a temporary solution to get the system back to “mostly” working: Perhaps you can block only a specific type of job rather than all jobs; or perhaps you can early-return an error for a specific URL rather than having the entire site in maintenance mode. Be creative, but keep things simple and temporary.
Making a clever temporary patch will also help you to validate that you’ve identified the root cause correctly before you spend time implementing the permanent fix. It sucks to spend time writing a permanent fix with thorough review and tests only to realize later that the problem was not what you understood, and your fix doesn’t help.
Document the incident
Make sure that you document everything that you did as part of this investigation to the extent possible, as soon as possible after the fire is out (or ideally even work with someone else during the incident to document it in real time):
- Temporary patches or configuration changes that need to be reversed.
- A proposal for the permanent fix, if you have one.
- Interesting alternative theories about the problem, in case your final identified root cause turns out to be wrong.
- Unrelated bad or broken things you found during your investigation.
- Missing monitoring or alerting that would’ve identified the problem more efficiently.
- Missing protections in the system that could’ve prevented the problem.
If you’ve gotten this far, the world is no longer on fire, and you should have a little bit more time to implement a permanent and more proper fix. Ideally, someone else actually is responsible for that, so you can go back to sleep or the fix can wait for until Monday.