Welcome back to the forums.
I had an interesting 48 hours
there. Unfortunately, we ended up losing a little bit of data while I was repairing things. Outside of that, everything is back to normal, it's good to be back and posting at our normal forums. We will be sticking around on this forum software for the next few months until we discover the new promised land. Over the next month or two, I'll keep you posted on the great forum search. If we end up rolling our own solution, I'll let you guys know how that develops as well.
The rest of this post is going to be informational about what broke and the timeline of how it all got fixed. There will be some techie bits and I will ramble. You have been warned!
For the past two years we have been plagued with the MySQL server crashing
. I discovered that the MySQL server was deadlocking. I'm not an expert MySQL C++ developer, but my theory is that there is a race condition with the combination of Linux threads, FreeBSD (AMD64), and the MyISAM storage engine. This explains why attempts to shutdown gracefully would fail. In order to keep the forums running, I committed programming sin and wrote bounce_mysql
For the non-programmers out there, this is essentially how it worked:
1) It would start up the database server
2) If the database server has not responded for over 30 seconds, bounce_mysql would gruesomely murder the database server
3) It would then immediately start a new server
The problem that the new database server was responsible for cleaning up the mess left over from the gruesome murder in step two. While this process is automated and fairly reliable, MySQL really doesn't like it. You can tell by the database error thread
that this process happens several times a day.
Okay, so now that you know the history, let's talk about this week. On Wednesday, the forums crashed in such a way that the automated process could not repair it. This requires running a command to look at each post (we store ~20M right now) and rebuild the list of searchable words (called an index). This process takes several hours but usually works out alright.
I've been on the look out for the past two years for any patch that would solve our deadlock issue. Since we were going to eat several hours of downtime anyway, I thought I would try a version of MySQL
that is patched for performance, reliability, and speed. It was a long shot, but it couldn't hurt, right?
Turns out, it did hurt. The next morning I woke up to the forums having crashed again. The patched version also has the same deadlock problems, but instead of just locking up it would start duplicating rows. This became a real problem because now it wasn't just the index that was corrupted, but the table data itself.
I immediately set to restore MySQL to the old, unpatched version but things were acting really strange. I later found out that a hard drive inside the server had died sometime during the previous 12 hour process and the RAID array was in the process of copying data to the spare hard drive. My thought process then turned to "screw it, I'm done with MySQL" I decided to start looking for alternatives to immediately switch to. I decided on using PostgreSQL and phpBB3 as a stop-gap forum until we could find our permanent home.
I very rarely get to upgrade the operating system on a database server, so I was making use of the downtime to upgrade to FreeBSD 8.1. Around 4am on Friday, I chained a bunch of build commands together in a screen session. I set my alarm for a few hours later and got some shut eye. When I woke up, the server was dead. I'm not sure what caused it to die during the rebuilding step, but it was going to require calling our datacenter to have a tech act as "remote hands" After some investigation we decided core of the operating system (the kernel) was bad. At this point I decided to do something radical, instead of repairing the current system I decided to format the server and install Linux. The theory went that Linux would run PostgreSQL just as well as FreeBSD and it had the potential to run MySQL reliably (assuming my deadlock theory is true).
The rest is pretty uneventful. Once Linux was installed, we restored a backup of the forums, updated some configuration values, and turned the forums back on.
I'd like to thank all of the developers, network administrators, and community members who took time to write me. I'd especially like to thank Ramius for looking over the code with me, the IRC people that helped me test the forums, and the classy guys at IndieClick
for the datacenter operations and MySQL help.
If you have any questions, feel free to ask.