|
I looked up the original timeline in my e-mail. It looks like our server went down Friday May 15th, 2009, and we rebuilt our server on Friday May 22nd, 2009. We were on an unreliable mobile phone connection on wifi in the back of a van, thank goodness for screen...
It looks like the original VPS did come up later on the 22nd, allowing us to pull in the latest data to supplement our newly deployed system (having been based on an older backup). So my memory was faulty, the downtime wasn't weeks, but about one week. (EDIT: Other sources say we may have been down for two weeks, but got access to the data after one week?) The rest of the recollection would be accurate in that there was no news from the host for the first chunk of it, and only later on did we start getting updates from the host.
We probably should have migrated sooner, but a combination of factors meant we didn't:
1) While our event's pre-registration was live at the time (and in that sense we were losing hundreds or thousands of dollars in sales), our event is of the type where people were likely to wait until the server was back up to pre-register. It was also not a critical point in the pre-registration process (which would be closer to the end of it). In fact, our biggest concern with restoring from backup, and one reason we were hesitant to rebuild, was what we should do about registrations that had been paid and processed, but that we no longer had any data about. We obviously couldn't refuse an attendee who had paid and registered just because we had lost their data... We could have reviewed our paypal records to rebuild the important parts of the information (the names of people who paid, if not their assigned registration numbers) We were lucky that we later got access to the original data and were able to re-merge it back in regardless.
2) After 2 days, I redirected our DNS to a server at the local university which we controlled and posted a downtime notice. A day later, with the server still down, I put our website's design around the message to at least make it look a bit more official, and redirected all 404s to the message. We also redirected our mail to a server we controlled so that we could bring mail services back up on a temporary server.
3) By this point we were actively researching a new host to switch to. Picking a hosting provider is a big decision, and despite the urgency, we still had to do our due diligence, and by this point we were pretty much settled on Linode.
4) After the situation went from "Our server is down" to "Our host is MIA", we started trying to gather up all the backups we could from various servers and sources. Database and site code/content was primarily pulled from these older backups, and we refreshed our content with the newer data from the wayback machine. We were still hesitant to restore from backup because of the difficulty of a later merge if we did get access to the original data.
5) Eventually, the situation became intolerable; we were leaving for Anime North, the largest convention in the country, and we needed our server up for promotion purposes. This is why at this point we took the plunge and started rebuilding from our gathered backups. From the back of a car, going down the highway.
It's always a difficult decision to make. How much downtime do you tolerate before you go from waiting for your server to come back online to rebuilding from backups? As a somewhat loosely organized non-profit company, we're also not the kind of organization that has policies or procedures for this sort of thing.
Since then, we've at least taken precautions. As I've said, we have nightly incrementals off-site, and on-site linode backups. At this exact moment, since our event has ended for the year (literally just three days ago), our registration system is not active and downtime would be relatively unimportant; if we did need to take emergency actions our forums are probably all we'd care about. But nevertheless, we'd probably still take action sooner.
Of course, we've also gone from a fly-by-night operation to a first-class hosting provider (Linode), so I'm relatively confident that we'd be unlikely to have to restore from off-site backups. If our linode's host should die, we can restore from Linode's on-site backups in a matter of minutes. If Linode's datacenter should go down, we can restore from the off-site backup with a little bit of rebuilding (it's an incomplete backup so we'd need to deploy a full linode, layer our backup on top of that, do some checking after that, and get back up in about 3 hours (I've got two bonded VDSL2 lines and some other connections to my apartment, so I can push 14 megs upstream on my fastest link, and probably 20 megs up total if I combine that with cable, 3G, and free wifi). And if Linode went down entirely (nuclear bomb exploding at Linode headquarters?) then we have a much better picture of the VPS hosting industry such that we could move to a new host and be up and running again probably in 6 to 12 hours. Of course, all these times are *after* we make the decision that the original machine is a write-off and we need to restore from backups...
|