node.js + CouchDB == Crazy Delicious by Mikeal Rogers http://jsconf.eu/2010/speaker/nodejs_couchdb_crazy_delicious.html
I was watching this a couple days ago and I've been thinking about how to deal with instance and service (think of sending emails as a "service") failures. Because it's easy to make sure that only one email is sent if you only have one server sending emails, but if that machine fails, then no emails get sent out. You compose an email while offline and save it to your local couch instance. Then later it gets replicated to one of the couchdb instances in your cloud. And then: 1. You have the date when it was saved on the phone, etc. If you had a timestamp when that replication happened, you'd be able to have a chain of couchdb instances try to send the email, but only if it is older than X time after it was replicated to your cloud of couchdb instances. instance_a would try immediately, instance_b tries if it hasn't been taken in X minutes, and so on for instance_c. see [A]. 2. When instance_a wants to send the email, it updates the state to "taking" and then waits for instance_b and instance_c to ack the taking by adding fields to the current document. oops, instance_b and instance_c will race more often than not and you'll get a conflict so it needs to be separate temporary state tracking documents. You still need [A] or if there are no other instances you'll wait forever for acks that won't happen. 3. You have one instance that sends emails and you deal with the downtime if that instance fails or some other failure happens that prevents email from being sent. 4. You send periodic test emails to make sure they are being sent, and if they are not then take over the function on instance_$self. see [B] A) And this only works assuming that all of your cloud couchdb instances are replicating to each other correctly at the moment. Now you have N > 1 emails sent out. (and imagine if what's happening is something where it's more important than receiving an email or receiving more than one email) To keep this from happening you need a couchdb instance heartbeat (maybe have an app update a document that describes that instances "registration" in the system with the current time stamp every 60 seconds) and a STONITH system to kill any instances of couchdb that stop updating their document. B) Do you still need [A]? maybe it's good enough that the email didn't get back to you, but maybe it is sending emails to other places. so it seems [A] is still needed. Now you also need a service registration system (make sure this and other services like it are only running on one instance). So these are some of the ideas that I'm coming up with on this issue. I'm looking for more input. What would you do?
