stex wrote:
However, it would be devastating to my business to lose even a day's worth of data -- so I would love to hear your thoughts on what the best practices are for maintaining a live backup.
You could look into various HA solutions (I think the Linode Library has some articles too), but that might be a lot of work, and a lot of that overhead is from aiming at quick failover to prevent outages, as opposed to just data loss.
My own setup seems more in line with your requirements, which is to take significant (but not extreme) steps to ensure data integrity but not sweat the small access outages, or the need to manually cut over during a larger outage.
In my case, I set up a second, mirror, Linode, with an identical configuration (initially cloned from the primary Linode). I maintain them in parallel. I have both set up for Linode's backups, so ensure I have a quick way to do a bare metal recovery to a point in time at least no more than 24 hours old.
Use a file synchronization tool to reflect pure filesystem changes between the two machines (e.g,. rsync, unison, etc...) for any of your own files. Run it frequently - at whatever latency you can afford to lose data. Note that it's easy to say absolutely no data loss, but typically the risk of <1min is rather small, for example.
Set up an appropriate replication system for your database depending on the engine. For example, I use a warm standby with WAL shipping for my PostgreSQL database, with a minimum update frequency of 30-60 seconds (so at most I could lose the last 30-60 seconds of transactions). Once I upgrade to PostgreSQL 9.x I'll probably use the more real time hot standby replication. mysql has similar replication options.
If you have any other applications whose state can not be reflected by a filesystem or database synchronization, identify a way to replicate its changes and include that as well.
Having the standby Linode in the same data center lets you use the private network for all transfers which lets you crank down the latency and not worry about bandwidth usage. It won't protect against a network outage affecting clients (but you said that wasn't a major deal) or a data center wide catastrophe, but you could provision a second machine in a different data center for that purpose and sync from your primary warm standby.
This approach has only a modest cost, with pretty good coverage. It does require manual intervention in a true disaster, and does not absolutely remove any window of time for loss. But trying to have no window at all increases costs and management time severely in my experience and introduces its own failure modes. You used the term "devastating" which might imply you're willing to jump through all the hoops required to try to eliminate that last little bit of risk, but in my experience it's easy to say that but in practice there's almost always a little wiggle room in the risk/cost analysis and it's easier to say "absolutely no risk of data loss" than to implement, much less test and then continuing testing to be sure you got it right.
-- David