July 8th, 2014 - a date that will be forever burned in my brain. It was the date of my first “catastrophic” deploy. And the real kicker was that I didn’t even make any changes.
I had a simple task in a routine server migration - move my batch jobs from the old to the new, and make sure the job docs I wrote for the Tivoli setup were correct. Seemed easy enough. Sure I was out on vacation on July 7th, but I did my prep work and it should all go smoothly. I caught an error in my Tivoli job doc, but that was my fault (I skipped over it before leaving on July 3rd for the weekend). The change finished up nicely at around 3:45 AM. My batch jobs (now being executed by Tivoli) kick off at around 6 AM so I went back to bed to catch a few hours of sleep.
I get up at 5:50, remote back into work, and watch as both of my jobs successfully crash and burn. The first job to die only ran once a day and was self-recovering, so no harm, no foul. The second job, however, ran every two minutes for the next 13 hours. I knew the second job was going to be a problem, but after failing to find anyone that would be able to pause the job in Tivoli, I decided to let it go until I got into work, and headed back to bed for another 2 hours of sleep. I hadn’t setup any monitoring for it yet, so beyond cluttering up a Tivoli task with failed jobs, I wasn’t hurting anyone. Once I got into work (at around 10:30 AM), I discovered quite a surprise - not only had someone set up monitoring on the job without notifying me, but they configured it to open a high priority ticket to the wrong team each time the job failed. I opened my email, read a flurry of emails from a team with no ties to this job, and began working to resolve the failing job.
All things said and done, that day has, so far, been the worst day of my career as a software engineer. I made a ton of rookie mistakes and each one came back to bite me. All in all, I was the recipient of several questioning emails from managers and a concerned tech lead, burned through 2 different people’s lunches to fix the problem, and received a whooping 70 high priority incidents. There were a few key lessons that I was able to glean from this whole fiasco though. Namely… * I am part of a team and I need to communicate. I need to relay information to my teammates about what I’m working on. This whole scenario would have been much, much better had Ijust gave someone a heads up. They could have stopped the job while I was on my way in. * Don’t make assumptions. Just because I didn’t set up monitoring doesn’t mean someone else didn’t do it for me. And just because it works in the old environment doesn’t mean it will work in the new one. * Never leave a problem unless someone else can cover for me. I never should have gone back to sleep once I saw I had a problem. No matter how tired I am, it is my job to make sure the problem I find (or that finds me) is fixed.
I learned those lessons the hard way. I don’t plan on learning that way any more.