We've all been there. This is the story of my first time really breaking production.
I won't dwell too long on what I broke, and how, as the specific technical problem isn't what this post is about.
Around 3 years into my career, I'm at the company that gave me my break into writing code for a living. I've got a lot of love for the place for "taking a chance" on me (hello, Imposter Syndrome), but I've also begun to develop a small ego. I think, fairly, it was a partially deserved ego. The first project I'd ever been given at the company was to replace an old-as-heck billing system hearkening back to the days of shared hosting websites with PHP control panels in the backend.
I'd poured blood, sweat, and tears into the project. All the seniors who were meant to guide me were either fired or quit early on in the piece, so the vast majority of the technical decisions and work was left up to the most junior person in the org. Sometimes I wonder if they wanted the project to fail - a la "too hard basket" - and I was to be a convenient sacrificial lamb. Later on one of my longest standing and most respected colleagues (though new to me at the time) rolled on to the project and we delivered it; a project they had attempted and failed twice before. You can imagine what that can do to an impressionable young person.
And so I carried the cognitive dissonance of thinking I was an absolute hotshot, while feeling like I didn't belong.
My colleague and I proved to be a solid pair, and so we continued working together on subsequent successful projects; a boon to a burgeoning ego.
As a small aside one of the most impactful compliments I ever got was from a colleague as I left this company. "The company would not be where it is without you". I still think about the feeling it gave me to this day.
Eventually the projects take me toward my - now - career path of infrastructure/devops/sre/cloud-wrangler person. I'd spent a good amount of time straddling the borders between traditional SWE and some kind of tooling/infrastructure person; but without the access I believed I needed to do my job properly. So I ask to move to the infrastructure team, which the company graciously agrees to.
The operations team at the time had a reputation for arrogance. For the longest time in the office, a physical barrier of plants separated them from the SWE team. I've often thought this caused a more egregious manifestation of 'us and them'. SWE were, in their minds, morons deploying broken code onto their perfect machines. People were often afraid to approach the team lest they be chastised for asking inane questions, or were belittled in public.
And so the young, junior egotist (me) joins the team and feeds off their energy. My first week on the team I'm shipped off to San Francisco to work on a Very Important Project for a month or so, which does very little to help curb my ever-inflating sense of superiority.
The first part of the project is to modernise the system controlling the customer-facing DNS infrastructure. Part of the work is in a language/framework by this point I'm overly-confident in. I set about my task, seeing myself as no less than Picasso at an easel.
With the code ready to go, I just needed to update the database schema. Every seasoned operator right now just started nodding their head and smiling; and they're right.
We had a Galera cluster in an active-active configuration as the persistence layer for the system. The changes were to be done on one of the active nodes, and then replicate to the other active node. Confidently I showed the changes to my manager at the time to get his rubber stamp.
$ mysql dns_db < my_schema_changes.sql
⏳
.. a minute passes ..
.. two minutes pass ..
.. five minutes pass ..
My pager goes off.
Bye-bye DNS queries. The cluster has completely locked up, no traffic is getting in or out of the nodes, and I panic.
$ mysql dns_db < my_schema_changes.sql
^C^C^C
The cluster is having none of it. The CPU is pegged, and MySQL will not respond.
After around 10 minutes of attempting to determine what had happened, the alerts began to clear. By this point I'm shaking from stress and anxiety; I'd taken down the customer-facing DNS during only my first week in operations.
I remember excusing myself, ostensibly to go to lunch, but in actuality I couldn't bear the thought of being around my peers judging me.
A few minutes later I was walking down Market St feeling numb, not sure what to do with myself. I was pretty sure they were going to fire me, or I'd at least be severely dropped in estimation amongst my colleagues. I found the smallest of comforts in a McDonalds; something familiar enough that I needn't think. If you're familiar with Market st, it was (and still is I believe) the McDonalds on the right headed toward Embarcadero with the downstairs section. I have no recollection of what I ordered if anything, but I remember sitting downstairs in the dim light facing the wall so nobody could see the tears welling up in my eyes.
I vividly remember feeling isolated. The timezones weren't favourable to call home to anyone I knew, and I didn't have friends in the city, so I sat in the McDonalds bracing myself to face the music on my return to the office. My cognitive dissonant ego had shattered and I was a vulnerable junior all over again.
After around 20 minutes of pulling myself together I was ready to face judgement. I felt like a fraud walking back to the office, convinced every tech nerd in the city could see it written all over my face. The pavement was eggshells, and I gingerly stepped my way to the lobby.
And so I arrived in the office, prepared for what may come, only to find my manager calmly typing on his laptop, feet propped up on his desk like nothing had happened. The rest of my team were similarly strewn about the oversized space, variously laying on a couch, leaning way back in their chairs, or curled on a beanbag.
My manager had noticed me, but hadn't said anything. I decided it was best to get it over with so I approached him to broach the subject of my colossal mistake.
And he said: "Oh, hey. Good lunch?"
It wasn't what I expected, but I managed to awkwardly stammer out a yes as the images of myself crying in a McDonalds flashed behind my eyes.
The best I could do to start the conversation was a "so, what do we do?". He glanced over at me and shrugged, making the mouth noise equivalent of "I dunno". What I failed to appreciate at the time was the impactfulness of his attitude.
So I unlocked my laptop and got back to work, trying to understand what happened. The majority of my afternoon was spent trawling through logs, eventually finding where things had gone haywire. I brought the discovery to him, and he nodded in agreement, suggesting he had a suspicion that was the cause. We had an office-wide "debrief" to end the week following that, so I didn't have much opportunity to continue the discussion and force what I thought was a necessary discussion about my mistake.
I mixed myself into the crowd at debrief, hoping for some anonymity amongst the people. I don't at all recall the meeting, but I do remember what followed; when the attitude of my manager and team finally all sank in.
I found my manager hiding at the back of the group as we broke from the meeting, in his usual posture of typing away on his laptop with his feet up. I'd decided the best thing to do to start the conversation was an apology.
"Hey. Umm.. I'm sorry about the DNS stuff", I said, not able to directly admit my mistake.
I knew this is when the conversation would start. We'd get into the nitty gritty of all the things that went wrong, and how I should have performed more due diligence.
But his face instead portrayed a mix of good humour, and sarcasm. What followed shaped how I treat others in my career.
"That? We've all done much worse. This is your badge of honour, you're now officially an ops person". Instead of cutting me down I was offered empathy, and understanding. He lead me to the area where the rest of my team was, and each team mate took turns trading "war stories" and mistakes; all while smiling and shaking their heads at their own folly.
I've harped long enough on this story.
If you've made it this far, thanks for reading.
It's my sincere hope that you have an incident so that you experience the bonding experience of "breaking production". But more importantly for you to be exposed to how your team treats you when you're down.
The mark of a good manager - and by extension a good team - is not praising excellence, but humanising failure and uplifting those in need.
Find that team.