Dev Tools to Make Debugging Easier

Dev Tools to Make Debugging Easier

Throughout my career, I’ve found that debugging is one of the most valuable skills a developer can have. This can range from debugging small issues like “why is this test failing” to debugging large production issues across multiple systems.

Most of us have had the experience of banging our head against the wall on a problem only to have someone else come by and solve it with seemingly no effort.

But what separates someone who is good at debugging from someone who isn’t? And how do you actually get better at it? The most obvious answer is… do it a lot and you’ll get better at it. Put in your 10,000 hours and come out the other side as an expert.

However, that’s not really the whole story. The tooling you use and your communication internally can have a massive impact on your teams ability to diagnose and solve issues.

A simple example

Here’s a block of code that adds all the numbers together in an array:

function addAll(arrOfNumbers) {
    let sum = 0
    for (const number in arrOfNumbers) {
        sum += number;
    }
    return sum
}

This is one of those blocks of code that’s so simple, it’s obviously correct, right? Let’s test it:

addAll([]) // returns 0
addAll([0]) // returns '00'

Huh… that’s strange. Well, it’s Javascript, it’s probably treating the number as a string for some reason? We’ve all seen the Wat video, this is some weird type thing, let’s fix it:

function addAll(arrOfNumbers) {
    let sum = 0
    for (const number in arrOfNumbers) {
        // This is required because javascript is weird
        sum += Number(number);
    }
    return sum
}

Back to testing:

addAll([]) // returns 0
addAll([0]) // returns 0 
addAll([3]) // returns 0
addAll([100]) // returns 0

Ok, Javascript, very funny. Always returning 0 now? Maybe it’s not entering the loop? We know that’s not true because our Number change rules it out. We need more data:

addAll([1, 2]) // returns 1
addAll([1, 3]) // returns 1
addAll([4, 10]) // returns 1
addAll([4, 15]) // returns 1
addAll([4, 4, 4]) // returns 3
addAll([4, 1, 2]) // returns 3

Maybe it’s not a Javascript issue? The return value seems to be related to the length of the array, and only the length of the array. If we want to be very thorough, we can even try this:

addAll([null, undefined, "wtf"]) // returns 3

So the values in the array are definitely unused. You might already know the issue, but if you try printing out number , you’ll find that it’s the sequence 0, 1, 2, etc.

Eventually, we’ll find our answer which is: “It’s iterating over the indices instead of the values” - a quick search will show that we are using for ... in instead of for ... of and we can quickly resolve our issue.

What is an ideal debugging environment?

I know the addAll example is a little contrived, but it also illustrates an important point. We got to our answer way faster because we could repeatedly make calls to addAll, inspect the results, and try something else.

This is an ideal debugging environment, and it’s main properties are:

  • I’m able to reproduce the issue locally
  • I can quickly iterate on potential fixes
  • I get immediate feedback when I fix it

How can we approach this with our architecture?

Reproducibility

The name of the game here is bridging the gap between what’s running in production and what’s running locally. The most obvious feature here is your entire stack should be something that your developers can run locally, or in remote development environments like Nimbus.

In addition, you may need to mock out external services like S3, which you can do with services like LocalStack. Ideally, there isn’t a thing you can do in prod that you can’t also do locally.

Docker is a classic tool here. Even if you don’t use Docker for your application, it can be valuable as an easy way to spin up the exact same version of your database that you use in production.

Data Access

Reproducibility is not just about your infrastructure, it’s also about your data. Some bugs only happen on a specific customer’s data.

It could be the customer's scale or that one small hack you put in just for them. Whatever it is, putting in place a way for developers to access (or request access to) production data will help bridge the gap when they can’t reproduce it.

If you can’t give access, incorporating tools like Faker can help you generate realistic looking data to seed your non-prod deployments with.

Proactive and Reactive Logging

If all else has failed and you can’t reproduce the issue, your next best bet is to catch the issue in the act. Hopefully someone thought about it and instrumented your code with great logging.

In reality, that doesn’t always happen, so you’ll need to be able to quickly ship logging changes to production. The thing to prioritize here is how quickly can you get those logging changes live.

Even companies I worked at that had slow release cycles figured out ways to allow exceptions for logging/debugging changes, because of how difficult debugging in the dark is. If you are using Docker, tools like Depot can help speed up your builds so you can iterate faster.

Log Contexts and Tracing

When thinking about logging, you’ll want to make sure that each log comes with all the information you need to debug the issue. Tools like Sentry have contexts which let you set structured information on errors.

Sentry.setContext("metadata", {
  "thisIsUsefulFor": "debugging"
});

As your infrastructure gets more complicated, you should consider adding traces so you can follow the request around and see everything that it did.

Reducing noise - How do we know there’s an issue?

Imagine how much harder debugging addAll would be if the only information you got about it was a ticket that says “The customer is saying that it never returns anything”

Is it erroring? How is it being used? Maybe they are passing in bad data? If so, this might actually be the expected behavior!

The sooner you can get clarity on exactly what the issue is, the faster you can debug it. Tools like Highlight and Posthog provide “Session Replay” which let you see a recording of what a user experienced.

Tools like PropelAuth provide “User Impersonation” which lets you log into your product as your user, diagnose the issue, and check that the fix actually worked. If we can’t reproduce the issue when we are literally acting as the user, we definitely wouldn’t be able to reproduce it anywhere else.

Summary

Debugging is a critical skill that all developers should work on. However, tooling can make the difference between carefully debugging an airplane’s software while its flying and leisurely iterating on the software until we are confident we’ve fixed the issue.

If you are finding that issues take longer to debug then you’d expect, try looking at your stack and seeing what’s missing. Are you missing a critical component of the stack from your local testing? Do your engineers not have enough test data to appropriately test edge cases?

Whatever the case may be, think about how easy it is to debug a single function like addAll and then use that to determine what’s missing from your workflow.