Thanks Andrew for the detailed response We’re having a replication factor of 3 so we’re safe there. What do you recommend for min.insync.replicas, acks and log flush internal?
I was worried about region failures or someone that goes in and deletes our instances and associated volumes (that’s a kind of disaster). - Mirror Maker is great, but does it keep the topic configuration, partitioning and offsets? We basically want our consumers to keep on working like they used to in case we lose a whole cluster. - Secor as you said, will allow me to backup the data, but not to reload it to a cluster, so I don’t think it fits our exact purpose - EBS snapshots guarantee some kind of point in time recovery (although some data may be lost as you said). Shutting down the brokers before the EBS backup, one at a time, sounds like an option, the backup will just take time to happen over the whole cluster I guess? I appreciate your feedback and look forward to hearing from you Regards, Stephane On 22 December 2016 at 5:40:30 pm, Andrew Clarkson ( andrew.clark...@rallyhealth.com) wrote: Hi Stephane, I say this not to be condescending in any way, but simple replication *might* cover your needs. This will cover most node failures (causing unclean shutdown) like disk or power failure. This assumes that one of the replicas of your data survives (see the configs min.insync.replicas, acks, and log.flush.interval.*). Making sure that you have the correct ack'ing and replication strategy will likely cover a lot of the failure/recovery use cases. If you need better recovery/availability guarantees than simple replication, the de facto mechanism is "mirroring <https://kafka.apache.org/documentation.html#basic_ops_mirror_maker>" using a tool called "mirror maker". This would cover cases where an entire cluster crashed (like an AWS region being down) or other catastrophic failures. This is the preferred way to do multi-data center (multi-region) replication. Back to EBS snapshots. From what I understand, snapshotting the file system won't give you a full picture of what's going on because brokers flush the logs infrequently and, as you mentioned, leave logs in a "corrupted" state. If you need a persistent record in order to rerun expired data (see the configs log.retention.*), you might want to look at a tool like Secor <https://github.com/pinterest/secor>. Secor will write all messages to an S3 bucket from which you could rerun the data if you need to. Sadly, it doesn't come with a producer to rerun the data and you would have to write your own. Let me know if that helps! Thanks much, Andrew Clarkson On Wed, Dec 21, 2016 at 9:32 PM, Stephane Maarek < steph...@simplemachines.com.au> wrote: > Hi, > > I have Kafka running on EC2 in AWS. > I would like to backup my data volumes daily in order to recover to a point > in time in case of a disaster. > > One thing I’m worried about is that if I do an EBS snapshot while Kafka is > running, it seems a Kafka that recovers on it will have to deal with > corrupted logs (it goes through a repair / rebuild index process). It seems > that Kafka on shutdown properly closes the logs. > > Questions: > 1) If I take the EBS snapshots while Kafka is running, is it dangerous that > a new instance launched from this backup has to go through a repair > process? > 2) The other option I see is to stop the Kafka broker, and then take my EBS > snapshot. But I can’t do that for all brokers simultaneously as I would > lose my cluster, so therefore if I do: stop kafka broker, take snapshot, > start kafka, next broker same steps, I would get a clean backup, but not a > point in time backup… is that an issue? > 3) Are there any other backup strategies I haven’t considered? > > Thanks! > Stephane >