There are millions of ways in which things can go wrong. But I believe all of it can be classified to just this simple thing – Who screwed up?
- You Screwed Up – Go fix your shit, don’t you dare to retry till you’ve fixed it.
- I Screwed Up – I’m sorry. I’m looking into it. You may retry in a while.
This simple distinction is extremely useful in a lot of places.
Error Handling in UI
If you’re building a UI, “You Screwed Up” requires you to tell the user to fix something. You can either present a message user’s would understand, or better, build a UI that prevents the user from screwing up in the first place. But if your database goes down when you are trying to fetch the users details, and you tell the user “You Screwed Up, User does not exist” – that is nasty. You have wrongly classified the error, and therefore have very conveniently absolved yourself of you sins.
Error Handling in APIs
If you’re building an REST API, “You Screwed Up” has a special status code – the 4xx series. “I Screwed Up” gets its own series, the 5xx. Distributed systems understand and respect these status codes; you must not abuse them. And if you are building a library, “You Screwed Up” is an IllegalArgumentException, or a custom exception that the caller can catch to fix the inputs.
Some More Examples
- Monitoring & New Relic – “You Screwed Up” shouldn’t wake you up in the middle of the night. It’s someone else’s fault, and if you’ve told them exactly how they have screwed up, you can get your sleep
- Retries – “You Screwed Up” means the client or user has to fix the error. “I screwed up” means the client can keep retrying till it succeeds. This is the basis of REST APIs.
- Background jobs – SQS considers 500 errors as “I should retry the job”. 400 errors are not retried. If you get a bad input, but you return a 500, SQS will keep retrying the job – even though that job has no chance of ever succeeding.
- Separation of Responsibilities – In a distributed team, it clearly tells the project manager which team the ticket should be assigned to.
- Empower the user – When you classify errors like this, the user feels in control. They understand what they did wrong, can fix it and move on with their lives. An “Oops, an error occurred” is the worst possible thing to show the user