Recent downtime

ÄlphämönkëyÄlphämönkëy Registered User regular
Hey Guys,

Welcome back to the forums.

I had an interesting 48 hours there. Unfortunately, we ended up losing a little bit of data while I was repairing things. Outside of that, everything is back to normal, it's good to be back and posting at our normal forums. We will be sticking around on this forum software for the next few months until we discover the new promised land. Over the next month or two, I'll keep you posted on the great forum search. If we end up rolling our own solution, I'll let you guys know how that develops as well.

The rest of this post is going to be informational about what broke and the timeline of how it all got fixed. There will be some techie bits and I will ramble. You have been warned!

For the past two years we have been plagued with the MySQL server crashing. I discovered that the MySQL server was deadlocking. I'm not an expert MySQL C++ developer, but my theory is that there is a race condition with the combination of Linux threads, FreeBSD (AMD64), and the MyISAM storage engine. This explains why attempts to shutdown gracefully would fail. In order to keep the forums running, I committed programming sin and wrote bounce_mysql.

For the non-programmers out there, this is essentially how it worked:
1) It would start up the database server
2) If the database server has not responded for over 30 seconds, bounce_mysql would gruesomely murder the database server
3) It would then immediately start a new server

The problem that the new database server was responsible for cleaning up the mess left over from the gruesome murder in step two. While this process is automated and fairly reliable, MySQL really doesn't like it. You can tell by the database error thread that this process happens several times a day.

Okay, so now that you know the history, let's talk about this week. On Wednesday, the forums crashed in such a way that the automated process could not repair it. This requires running a command to look at each post (we store ~20M right now) and rebuild the list of searchable words (called an index). This process takes several hours but usually works out alright.

I've been on the look out for the past two years for any patch that would solve our deadlock issue. Since we were going to eat several hours of downtime anyway, I thought I would try a version of MySQL that is patched for performance, reliability, and speed. It was a long shot, but it couldn't hurt, right?

Turns out, it did hurt. The next morning I woke up to the forums having crashed again. The patched version also has the same deadlock problems, but instead of just locking up it would start duplicating rows. This became a real problem because now it wasn't just the index that was corrupted, but the table data itself.

I immediately set to restore MySQL to the old, unpatched version but things were acting really strange. I later found out that a hard drive inside the server had died sometime during the previous 12 hour process and the RAID array was in the process of copying data to the spare hard drive. My thought process then turned to "screw it, I'm done with MySQL" I decided to start looking for alternatives to immediately switch to. I decided on using PostgreSQL and phpBB3 as a stop-gap forum until we could find our permanent home.

I very rarely get to upgrade the operating system on a database server, so I was making use of the downtime to upgrade to FreeBSD 8.1. Around 4am on Friday, I chained a bunch of build commands together in a screen session. I set my alarm for a few hours later and got some shut eye. When I woke up, the server was dead. I'm not sure what caused it to die during the rebuilding step, but it was going to require calling our datacenter to have a tech act as "remote hands" After some investigation we decided core of the operating system (the kernel) was bad. At this point I decided to do something radical, instead of repairing the current system I decided to format the server and install Linux. The theory went that Linux would run PostgreSQL just as well as FreeBSD and it had the potential to run MySQL reliably (assuming my deadlock theory is true).

The rest is pretty uneventful. Once Linux was installed, we restored a backup of the forums, updated some configuration values, and turned the forums back on.

I'd like to thank all of the developers, network administrators, and community members who took time to write me. I'd especially like to thank Ramius for looking over the code with me, the IRC people that helped me test the forums, and the classy guys at IndieClick for the datacenter operations and MySQL help.

If you have any questions, feel free to ask.

Älphämönkëy on
«1345

Posts

  • NerdgasmicNerdgasmic __BANNED USERS regular
    edited December 2010
    I've been occasionally had a page get stuck loading, requiring me to refresh the page again

    Nerdgasmic on
  • T4CTT4CT BAFTA-NOMINATED NAFTA-APPROVEDRegistered User regular
    edited December 2010
    Nerdgasmic wrote: »
    I've been occasionally had a page get stuck loading, requiring me to refresh the page again

    this is happening to me as well

    T4CT on
    my twitter | beats music
    I made a bad game! Download it scrubs.
    Keep it Up!: iOS Android
  • skettiosskettios Registered User regular
    edited December 2010
    Was there some kinda rollback? Or this just a backup? Cause I'm pretty sure there was a new MineCraft thread in SE++, but the old one is there and unlocked...

    skettios on
  • CheesecakeRecipeCheesecakeRecipe "Should not be allowed to post in the Steam Thread" - Isorn Squalor Victoria, Squalor Victoria!Registered User regular
    edited December 2010
    Yeah i've had a few timeouts.

    CheesecakeRecipe on
  • skettiosskettios Registered User regular
    edited December 2010
    T4CT wrote: »
    Nerdgasmic wrote: »
    I've been occasionally had a page get stuck loading, requiring me to refresh the page again

    this is happening to me as well

    Same.

    skettios on
  • NerdgasmicNerdgasmic __BANNED USERS regular
    edited December 2010
    anyway, thank you for your hard work and dedication

    Nerdgasmic on
  • ÄlphämönkëyÄlphämönkëy Registered User regular
    edited December 2010
    skettios wrote: »
    Was there some kinda rollback? Or this just a backup? Cause I'm pretty sure there was a new MineCraft thread in SE++, but the old one is there and unlocked...

    I think we lost some data, though I couldn't tell you exactly how much.

    Älphämönkëy on
  • skettiosskettios Registered User regular
    edited December 2010
    skettios wrote: »
    Was there some kinda rollback? Or this just a backup? Cause I'm pretty sure there was a new MineCraft thread in SE++, but the old one is there and unlocked...

    I think we lost some data, though I couldn't tell you exactly how much.

    ok, just wanted to make sure I wasn't going crazy :rotate:

    thanks for all your hard work :^:

    skettios on
  • Romanian My EscutcheonRomanian My Escutcheon Two of Forks Registered User regular
    edited December 2010
    Getting occasional server reset errors, but that might just be on my end.

    Romanian My Escutcheon on
    [IMG][/img]
  • ÄlphämönkëyÄlphämönkëy Registered User regular
    edited December 2010
    The server errors are definitely real. Let me take a look at fixing them.

    Älphämönkëy on
  • T4CTT4CT BAFTA-NOMINATED NAFTA-APPROVEDRegistered User regular
    edited December 2010
    can't tell if the timeouts are getting more frequent or if i'm just getting unlucky

    anything posted after like afternoon wednesday is gone it seems

    T4CT on
    my twitter | beats music
    I made a bad game! Download it scrubs.
    Keep it Up!: iOS Android
  • EvilBadmanEvilBadman DO NOT TRUST THIS MAN Registered User regular
    edited December 2010
    (11:12:24 PM) alphamonkey: its how you prove dominance to the servers
    (11:12:31 PM) alphamonkey: once you pee on them, they are your bitch

    EvilBadman on
    FyreWulff wrote: »
    I should note that Badman is fucking awesome
    XBL- Evil Badman; Steam- EvilBadman; Twitter - EvilBadman
  • ÄlphämönkëyÄlphämönkëy Registered User regular
    edited December 2010
    This is why I stopped going into IRC. It changes me.

    Älphämönkëy on
  • SpoitSpoit *twitch twitch* Registered User regular
    edited December 2010
    skettios wrote: »
    skettios wrote: »
    Was there some kinda rollback? Or this just a backup? Cause I'm pretty sure there was a new MineCraft thread in SE++, but the old one is there and unlocked...

    I think we lost some data, though I couldn't tell you exactly how much.

    ok, just wanted to make sure I wasn't going crazy :rotate:

    thanks for all your hard work :^:

    I'm not sure about exactly when the forums died, but a couple of the posts I know I made around 2pm-ish PST were rolled back. My last visited says 2:30.

    Also, thanks for all the hard work Alpha <3

    Spoit on
    steam_sig.png
  • yoshamanoyoshamano The fuck is this. The fuck was that. Marshall, Soviet MichiganRegistered User regular
    edited December 2010
    Edit: It appears the rollback was to some point before DST ended. My time zone shows up as GMT-4 instead of -5.
    This is why I stopped going into IRC. It changes me.

    hahaha

    also:

    photo.jpg

    yoshamano on
    pa_forums_sig.jpg
  • skettiosskettios Registered User regular
    edited December 2010
    I just got
    Warning: Memcache::connect() [function.Memcache-connect]: Can't connect to 172.17.17.136:11211, Operation timed out (60) in [path]/includes/class_datastore.php on line 222
    
    Fatal error: Unable to connect to memcache server in [path]/includes/class_datastore.php on line 224
    

    skettios on
  • SquallSquall hap cloud Registered User regular
    edited December 2010
    I just got this error

    Warning: Memcache::connect() [function.Memcache-connect]: Can't connect to 172.17.17.136:11211, Operation timed out (60) in [path]/includes/class_datastore.php on line 222

    Fatal error: Unable to connect to memcache server in [path]/includes/class_datastore.php on line 224

    edit: o/

    Squall on
  • admanbadmanb unionize your workplace Seattle, WARegistered User regular
    edited December 2010
    Ditto the above.

    admanb on
  • skettiosskettios Registered User regular
    edited December 2010
    \o
    error buddies!

    skettios on
  • SquallSquall hap cloud Registered User regular
    edited December 2010
    yeah Im getting them every couple minutes now

    Squall on
  • skettiosskettios Registered User regular
    edited December 2010
    Same.

    skettios on
  • unintentionalunintentional smelly Registered User regular
    edited December 2010
    me too

    unintentional on
  • T4CTT4CT BAFTA-NOMINATED NAFTA-APPROVEDRegistered User regular
    edited December 2010
    yeah keeps dropping on me as well

    T4CT on
    my twitter | beats music
    I made a bad game! Download it scrubs.
    Keep it Up!: iOS Android
  • ArcanisTheImpotentArcanisTheImpotent Registered User regular
    edited December 2010
    ditto~

    ArcanisTheImpotent on
  • gtrmpgtrmp Registered User regular
    edited December 2010
    Seems like I'm getting that error on about one out of every four pages.

    gtrmp on
  • Satanic JesusSatanic Jesus Hi, I'm Liam! Registered User regular
    edited December 2010
    Just got this:

    Warning: Memcache::connect() [function.Memcache-connect]: Can't connect to 172.17.17.136:11211, Operation timed out (60) in [path]/includes/class_datastore.php on line 222

    Fatal error: Unable to connect to memcache server in [path]/includes/class_datastore.php on line 224

    EDIT: And just got it again.

    Satanic Jesus on
    my backloggery 3DS: 0533-5338-5186 steam: porcelain_cow goodreads
  • ArtreusArtreus I'm a wizard And that looks fucked upRegistered User regular
    edited December 2010
    Yeah I was about to post about how I have not had that problem at all but I just got it myself.

    Artreus on
    http://atlanticus.tumblr.com/ PSN: Atlanticus 3DS: 1590-4692-3954 Steam: Artreus
  • ZetxZetx Part-time Lurker, Fixer Registered User regular
    edited December 2010
    Thanks for your hard work. I'm also getting the
    Warning: Memcache::connect() [function.Memcache-connect]: Can't connect to 172.17.17.136:11211, Operation timed out (60) in [path]/includes/class_datastore.php on line 222
    
    Fatal error: Unable to connect to memcache server in [path]/includes/class_datastore.php on line 224
    

    error.

    Zetx on
  • ArtreusArtreus I'm a wizard And that looks fucked upRegistered User regular
    edited December 2010
    Yeah I actually just got that error for a good 20 minutes.

    Artreus on
    http://atlanticus.tumblr.com/ PSN: Atlanticus 3DS: 1590-4692-3954 Steam: Artreus
  • EndEnd Registered User regular
    edited December 2010
    I'll take memcache errors over mysql errors any day.

    End on
    I wish that someway, somehow, that I could save every one of us
    zaleiria-by-lexxy-sig.jpgsteam~tinythumb.png
  • ASimPersonASimPerson Cold... ... and hard.Registered User regular
    edited December 2010
    I haven't seen any errors yet, FWIW.

    ASimPerson on
    redoctober2.png
    SE++ Forum Battle Archive | PDT is not PST | DRUNKSTUCK: A Homestuck recap
  • Xenogears of BoreXenogears of Bore Registered User regular
    edited December 2010
    Yeah, there's definitely been some posts lost.

    Xenogears of Bore on
    3DS CODE: 3093-7068-3576
  • Romanian My EscutcheonRomanian My Escutcheon Two of Forks Registered User regular
    edited December 2010
    Errors seem to be letting up; I've been browsing for ten minutes now without running into any error screens.

    Romanian My Escutcheon on
    [IMG][/img]
  • EndEnd Registered User regular
    edited December 2010
    Yeah, there's definitely been some posts lost.

    It looks like we lost everything since the first table crashes on Wednesday.

    (Although, I'm not completely certain of that.)

    End on
    I wish that someway, somehow, that I could save every one of us
    zaleiria-by-lexxy-sig.jpgsteam~tinythumb.png
  • BaidolBaidol I will hold him off Escape while you canRegistered User regular
    edited December 2010
    There are couple pages missing from the faster moving threads in SE++ and skettios is right, there was a new Minecraft thread that no longer exists, but not a big deal.

    No memory errors since Alpha did whatever he does.

    Baidol on
    Steam Overwatch: Baidol#1957
  • ÄlphämönkëyÄlphämönkëy Registered User regular
    edited December 2010
    I found the problem. Somehow our warden server gave two jails the same IP!
    db1# dsh -w web1,web2,web3,web4,web5,web6,web7,web8,web9,db1,db2 jls | grep forums
    web2:      2  172.17.17.133   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web3:      2  172.17.17.134   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web4:      1  172.17.17.135   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web5:      1  172.17.17.132   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web6:      2  172.17.17.137   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web8:      4  172.17.17.136   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web9:      2  172.17.17.133   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    

    Älphämönkëy on
  • LindenLinden Registered User regular
    edited December 2010
    End wrote: »
    Yeah, there's definitely been some posts lost.

    It looks like we lost everything since the first table crashes on Wednesday.

    (Although, I'm not completely certain of that.)

    But... we've kept your own reference to aforementioned table crashes over in the database error thread. Looks like about then, though. We've certainly dropped everything after the major downtime shortly after, though.
    EDIT: Break between [post=17518949]here[/post] and [post=17518951]here[/post], it appears.

    Linden on
    What if this weren't a rhetorical question?
  • EndEnd Registered User regular
    edited December 2010
    Linden wrote: »
    End wrote: »
    Yeah, there's definitely been some posts lost.

    It looks like we lost everything since the first table crashes on Wednesday.

    (Although, I'm not completely certain of that.)

    But... we've kept your own reference to aforementioned table crashes over in the database error thread. Looks like about then, though. We've certainly dropped everything after the major downtime shortly after, though.

    Yeah. He was able to recover from that table crash.

    But the really recent ones were probably a lot more problematic (and I counted at least two tables complaining about being crashed when that happened). Things were probably really borked (well, I mean, the forums were completely inoperable instead of just a thread, so there's that.)

    End on
    I wish that someway, somehow, that I could save every one of us
    zaleiria-by-lexxy-sig.jpgsteam~tinythumb.png
  • LindenLinden Registered User regular
    edited December 2010
    End wrote: »
    Linden wrote: »
    End wrote: »
    Yeah, there's definitely been some posts lost.

    It looks like we lost everything since the first table crashes on Wednesday.

    (Although, I'm not completely certain of that.)

    But... we've kept your own reference to aforementioned table crashes over in the database error thread. Looks like about then, though. We've certainly dropped everything after the major downtime shortly after, though.

    Yeah. He was able to recover from that table crash.

    But the really recent ones were probably a lot more problematic (and I counted at least two tables complaining about being crashed when that happened). Things were probably really borked.

    Sounds about right. See edit for what looks like the breakpoint.

    Linden on
    What if this weren't a rhetorical question?
  • HenroidHenroid Radio Demon Internet HellRegistered User regular
    edited December 2010
    I found the problem. Somehow our warden server gave two jails the same IP!
    db1# dsh -w web1,web2,web3,web4,web5,web6,web7,web8,web9,db1,db2 jls | grep forums
    web2:      2  172.17.17.133   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web3:      2  172.17.17.134   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web4:      1  172.17.17.135   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web5:      1  172.17.17.132   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web6:      2  172.17.17.137   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web8:      4  172.17.17.136   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    web9:      2  172.17.17.133   forums.penny-arcade.com       /usr/jails/forums.penny-arcade.com
    

    Wow. So when those 2 jailed people were posting, were they inadvertently causing problems?

    Henroid on
    Nobody likes me but that's okay. I'm used to it.
Sign In or Register to comment.