git commit -m "Initial Commit"
“Why does Support know about issues before we do?” asked the CTO. “A better question is, why do customers know about issues before we do?”
A series of e-mails followed the conversation that gave some more insight into what he was thinking. He knew the answer, and he knew that we didn’t. Almost all of us had written what we would have considered “enterprise” software. We had cool tools like New Relic and Kibana that we would use during an incident to help point us (eventually) at the cause of the problem. We were able to (eventually) fix whatever was buggy or broken and get users back on track. What we were failing at, however, was getting ahead of users.
We needed to get ahead of users.
Engineering at InVisionApp is awesome. We’re a design-driven company. For companies that use prototyping tools, like… ours… it’s a wonderful thing. The projects we work on have already been designed and prototyped and tested and everything else before it comes to us, so we get to do what we love to do: build all the things. We aren’t building prototypes that get tossed, we build production systems that get used by millions of people immediately. It’s exciting, it’s challenging, it’s rewarding, and did I mention it’s challenging? It’s rewarding. But it’s challenging.
Like any codebase, over time it gets bigger and more complex and creates new dependencies and things break. Like every good developer dreams of doing, we write tests so we can be aware of code that will break things if it gets merged, we use linting to make sure all of our tabs|spaces|parens are aligned the same across the entire organization, etc etc etc.
But we arrived at a place where our tests weren’t catching everything. Systems would go down, bugs would be introduced, something would get used in a way that wasn’t intended and… uh oh. The worst part was that we were told when things were broken by our support team… and they were told by our users.
We needed to get ahead of users.
git commit --amend -m "Figuring out our flaws"
Back to my conversation with the CTO… “Why does Support know about issues before we do?” asked the CTO. “A better question is, why do customers know about issues before we do?”
“I think we just need to do a better job of looking at the logs.” I said, faking confidence in my response.
“That will tell us what happens after users are experiencing issues. What we need to do is implement monitoring and automatic alerts so that we can know about issues before users experience them.” the CTO said, with actual confidence, backed by experience.
This was frustratingly profound. How could I have missed something so simple? Of course we should setup automatic alerts on our monitoring systems. The picture of what needed to be done was becoming more clear.
With lots of confidence, but little experience, I thought to myself “I need to monitor all the things!”
git commit -m "Initial monitoring setup"
Now here’s a caveat: you can’t monitor all the things. You need to identify the value that XYZ service delivers to users, and setup your monitoring to ensure that this value is being delivered. It took us some time to figure out exactly what that meant and how to do it, but like anything in life, it’s more about the journey than the destination.
At first, we made it rain monitoring on everything we could think of. My team is responsible for commenting throughout all of our various systems, so we created graphs in DataDog for things like:
- Comments added from Share links
- Comments added from Console
- Comments added from Inbox
- Comments added via quick replies
- Comments added from anonymous users
- Comments edited
- Duration of calls to create comments
- Duration of calls to delete comments
- CPU usage of the database
- Etc, etc, etc
We had a lot of data. We overlayed the deployment events on top of these graphs so we could see how a deployment affected the timings and counts of our metrics. It was interesting to see how our code changes affected (positively or negatively) the performance of our API calls or error counts. We felt like we were on to something finally.
git commit -m "Initial alerting setup"
Looking at ALL of the data at once was a bit overwhelming. Over time we discovered (through a bit of trial and error) that we had a few key metrics that could tell us before users were going to have a negative impact, and some metrics that could tell us when code changes caused users to experience a bug in real time. Specifically, we were looking at:
- CPU usage of the database
- Error rates
We setup automatic alerting for when CPU or error rates exceeded a certain threshold. This was efficiently sufficient for us. There’s no way we would have thought these two metrics would have been all we needed without going through the process and looking at all of our options. We leveraged DataDog’s alerting system to talk to PagerDuty and, in general, it’s been wonderful (there have been a few times where the CPU spiked and it didn’t have any effect on our systems, but this is rare).
git commit -m "Summary of all the things"
Here’s the catch. You’re never done with monitoring/alerting. We have identified key metrics for now, but they will change and evolve over time. What’s valuable to monitor today might not make sense tomorrow, and we may discover other metrics that better represent the user experience. Like most things in life, it’s a journey, and not a destination.
Monitoring the right things has become the focus of our journey. We don’t always get it right the first time, but a focus on building with monitorability in mind has fundamentally shifted how we think about operating our services in production. We have learned to focus on a few key metrics that help alert us when users are experiencing issues, but more importantly we are now able to detect and resolve issues before users experience them.
What would it look like if you knew about issues before your customers did?