Cloud Management Is Expensive

The Big Mistake

With modern software development processes and high-performing teams, I don't see too many major mistakes from developers. With the adoption of cloud services, when mistakes do happen there is a unique opportunity for development teams to see the actual cost of their mistakes. Recently, my team was notified that some of our code had logged millions of exceptions in Azure for weeks. We happened to have a deployment scheduled that day, so we were able to complete a quick PR, get it merged in, and deploy it. The code in question managed a cache of data from a database. If the database or the target table weren't available, the code would log the error and try again. In this case, the table was missing from the database. The mistake was that there was no retry limit added. Also, the Azure App Config setting for how long to wait was missing so it defaulted to 0 and kept retrying in a tight loop. Our code should have died gracefully or at least stopped trying to hit a resource that wasn't available. That was our first big mistake.

The Bigger Mistake

This code had been thoroughly tested in multiple regions and our database changes were all in place. The bigger mistake was that there was a separate region called "DEMO". That region is not part of my team's region stack and is not included in our pipelines. DevOps had configured a pipeline on their side to scoop up our code and deploy it to DEMO. However, our SQL changes and Azure App Config changes were not part of that automated process. The tables the code depended on were not deployed to DEMO and our config setting that controlled the retry timing was missing also. Our code logged exception after exception, but no one was watching and no alerts were set up. My team didn't own, manage, or test the DEMO region and the team that did manage the DEMO region had no idea what code they were bringing into it and what dependencies it had.

The DEMO region was created with good intentions. The DEMO region is used by Sales to showcase some of our software products to new and existing clients. When the DEMO region was requested, IT had numerous projects already in progress. So, DevOps was tasked with getting things stood up quickly and without encumbering the development teams. DevOps was successful in the initial setup, but, as an organization, our bigger mistake was that we didn't take the region seriously and manage it appropriately.

The Biggest Mistake

The biggest mistake could be that our code logged so many messages in Azure for weeks unnoticed by my team, DevOps, and the Cloud Architects, but I think there may be more to it than that. I sat through some meetings to discuss what happened and what steps we can take to prevent this from happening again. It seemed clear to me and other leaders that, as an organization, our use of cloud-based services and resources has grown, but our monitoring and management efforts were lagging behind. Cloud resources are easy to stand up and deliver value with quickly, but they have their own classes of technical debt and optimizations.

The good news for us and many other companies is that these kinds of issues are what I consider solved problems. We need to partner with our provider, internal teams, and consulting resources to get our house in order. Better to do it sooner rather than later. I shudder to think how much money we're wasting on misconfigured resources. Hopefully, I'll have some more lessons learned and good news on this topic in the future.

Cheers!