Hey Guys,
Welcome back to the forums.
I had an
interesting 48 hours there. Unfortunately, we ended up losing a little bit of data while I was repairing things. Outside of that, everything is back to normal, it's good to be back and posting at our normal forums. We will be sticking around on this forum software for the next few months until we discover the new promised land. Over the next month or two, I'll keep you posted on the great forum search. If we end up rolling our own solution, I'll let you guys know how that develops as well.
The rest of this post is going to be informational about what broke and the timeline of how it all got fixed. There will be some techie bits and I will ramble.
You have been warned!
For the past two years we have been plagued with the
MySQL server crashing. I discovered that the MySQL server was deadlocking. I'm not an expert MySQL C++ developer, but my theory is that there is a race condition with the combination of Linux threads, FreeBSD (AMD64), and the MyISAM storage engine. This explains why attempts to shutdown gracefully would fail. In order to keep the forums running, I committed programming sin and wrote
bounce_mysql.
For the non-programmers out there, this is essentially how it worked:
1) It would start up the database server
2) If the database server has not responded for over 30 seconds, bounce_mysql would gruesomely murder the database server
3) It would then immediately start a new server
The problem that the new database server was responsible for cleaning up the mess left over from the gruesome murder in step two. While this process is automated and fairly reliable, MySQL really doesn't like it. You can tell by the
database error thread that this process happens several times a day.
Okay, so now that you know the history, let's talk about this week. On Wednesday, the forums crashed in such a way that the automated process could not repair it. This requires running a command to look at each post (we store ~20M right now) and rebuild the list of searchable words (called an index). This process takes several hours but usually works out alright.
I've been on the look out for the past two years for any patch that would solve our deadlock issue. Since we were going to eat several hours of downtime anyway, I thought I would try a
version of MySQL that is patched for performance, reliability, and speed. It was a long shot, but it couldn't hurt, right?
Turns out, it did hurt. The next morning I woke up to the forums having crashed again. The patched version also has the same deadlock problems, but instead of just locking up it would start duplicating rows. This became a real problem because now it wasn't just the index that was corrupted, but the table data itself.
I immediately set to restore MySQL to the old, unpatched version but things were acting really strange. I later found out that a hard drive inside the server had died sometime during the previous 12 hour process and the RAID array was in the process of copying data to the spare hard drive. My thought process then turned to "screw it, I'm done with MySQL" I decided to start looking for alternatives to immediately switch to. I decided on using PostgreSQL and phpBB3 as a stop-gap forum until we could find our permanent home.
I very rarely get to upgrade the operating system on a database server, so I was making use of the downtime to upgrade to FreeBSD 8.1. Around 4am on Friday, I chained a bunch of build commands together in a screen session. I set my alarm for a few hours later and got some shut eye. When I woke up, the server was dead. I'm not sure what caused it to die during the rebuilding step, but it was going to require calling our datacenter to have a tech act as "remote hands" After some investigation we decided core of the operating system (the kernel) was bad. At this point I decided to do something radical, instead of repairing the current system I decided to format the server and install Linux. The theory went that Linux would run PostgreSQL just as well as FreeBSD and it had the potential to run MySQL reliably (assuming my deadlock theory is true).
The rest is pretty uneventful. Once Linux was installed, we restored a backup of the forums, updated some configuration values, and turned the forums back on.
I'd like to thank all of the developers, network administrators, and community members who took time to write me. I'd especially like to thank Ramius for looking over the code with me, the IRC people that helped me test the forums, and the classy guys at
IndieClick for the datacenter operations and MySQL help.
If you have any questions, feel free to ask.
Posts
this is happening to me as well
Same.
I think we lost some data, though I couldn't tell you exactly how much.
ok, just wanted to make sure I wasn't going crazy :rotate:
thanks for all your hard work :^:
anything posted after like afternoon wednesday is gone it seems
(11:12:31 PM) alphamonkey: once you pee on them, they are your bitch
I'm not sure about exactly when the forums died, but a couple of the posts I know I made around 2pm-ish PST were rolled back. My last visited says 2:30.
Also, thanks for all the hard work Alpha
hahaha
also:
Warning: Memcache::connect() [function.Memcache-connect]: Can't connect to 172.17.17.136:11211, Operation timed out (60) in [path]/includes/class_datastore.php on line 222
Fatal error: Unable to connect to memcache server in [path]/includes/class_datastore.php on line 224
edit: o/
error buddies!
Warning: Memcache::connect() [function.Memcache-connect]: Can't connect to 172.17.17.136:11211, Operation timed out (60) in [path]/includes/class_datastore.php on line 222
Fatal error: Unable to connect to memcache server in [path]/includes/class_datastore.php on line 224
EDIT: And just got it again.
error.
It looks like we lost everything since the first table crashes on Wednesday.
(Although, I'm not completely certain of that.)
No memory errors since Alpha did whatever he does.
But... we've kept your own reference to aforementioned table crashes over in the database error thread. Looks like about then, though. We've certainly dropped everything after the major downtime shortly after, though.
EDIT: Break between [post=17518949]here[/post] and [post=17518951]here[/post], it appears.
Yeah. He was able to recover from that table crash.
But the really recent ones were probably a lot more problematic (and I counted at least two tables complaining about being crashed when that happened). Things were probably really borked (well, I mean, the forums were completely inoperable instead of just a thread, so there's that.)
Sounds about right. See edit for what looks like the breakpoint.
Wow. So when those 2 jailed people were posting, were they inadvertently causing problems?