Episode 1: Living in SYN
by Jasper Bongertz
I had just returned from lunch and walked back into the office. My colleague looked over at me and pointed to the server status page, noting that "something had just gone critical". I glanced at the screen to see that one of the servers was in fact no longer accessible.
All that was running on this web server was a standard commercial forum application for a minor online RPG. My plan to have a nice quiet coffee upon reaching the office was quickly discarded – if the forum really was down, things were going to get very uncomfortable.
Even as I was unlocking my PC, my phone rang. It could only be Christian, the project leader responsible for the RPG. I took the call anyway. "Hey! Yep, the forum's down. Yes, I've seen it, leave it with me. No, I don't know what's happened, let me look into it. Oh, and tell your GMs not to bug me, otherwise it'll just take longer" I said. The GMs are the GameMasters, staff who help out the players in-play. They're a nice bunch of guys, as long as they're not on your back wanting to know when everything's going to be back up and running before you've even had the chance to find out what's gone wrong. Christian promised me he'd take care of them and I hung up.
Hit and destroyed!
The first checks I ran tracked down the problem. Pings to the management interface IP address were mostly returning dropped packets. The few incoming ICMP echo reply packets there were had huge latencies of seconds or more – latencies over our leased line are usually around 10 milliseconds. I tried to load the login page in my browser, but no dice. Something was definitely wrong – the server wasn't dead exactly, but it wasn't really living either.
If gamers were unable to access the web site, we had a pretty big problem. The forum was the venue for a lot of interaction between players and, with the game heavily dependent on an active gaming community, it had to be up and running. Unhappy players equals fewer players, equals fewer paying customers, equals not good.
Without any great hope of success, I tried to access the system shell using SSH. If I could barely ping it, a proper TCP connection was hardly going to fare much better. And so it transpired – after a minute or so I gave up. Because the forum server runs as a virtual machine, I was still able to view the privileges via the virtualisation server's guest operating system console. I was presented with the standard Debian login prompt. I logged on. After a disquietingly protracted pause between entering the user name and password I was in.
Having found my way into the virtual server's shell, the first thing I needed to do was to find the source of the load on the system. I unleashed the 'top' command and staggered back in amazement. CPU usage was well over 98% and RAM usage was at 100%. What was using all these resources was a seemingly endless list of Apache threads. Typically, only around a handful were active – the server was running at way over its load limits.
Normally, the system can only be rebooted within previously agreed maintenance windows. But with the forum offline anyway, I had nothing to lose; I decided to start with a complete reboot.
To my relief, the server restarted as normal after a couple of minutes. I watched as the forum's online user list filled back up. Unsurprisingly, the most active discussion thread was about the downtime. The server icon in the Nagios monitoring system was now back to green and it looked like we were over the worst.