At LinkedIn, the most common type of failure is controlled shutdown for code/config pushes. For that, we have a tool for reducing the unavailability window ( https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools). This can happen once or twice a month. The next common type of failure is disk/raid failure, which seems to happen once every month or two. The remaining types of failure include Linux crashes, JMV bugs, and other types of hardware failures. They happen a few times a year.
Thanks, Jun On Tue, Jun 11, 2013 at 1:22 AM, Pankaj Misra <[email protected]>wrote: > Hi, > > We are using 0.8 version of Kafka and are planning for high availability > testing with replication. While the entire scheme to enable the cluster to > be highly available is clear, I wanted to get some idea about Kafka Service > lifetime in terms of Mean-Time to Failure and Time of Recovery in cases of > failure. Any historic evidences will also help, as it will be vital for us > to calculate the actual availability of the system across an year. > > While I understand that Kafka provides more of active/active mode of > seamless high availability, but any failure, will impact the performance to > some extent and this calculation will help in deriving the actual number of > nodes that we should consider without compromising on the performance as > well, while the system is available. > > Any ideas/facts would be very helpful . > > Thanks & Regards > Pankaj Misra > > > ________________________________ > > > > > > > NOTE: This message may contain information that is confidential, > proprietary, privileged or otherwise protected by law. The message is > intended solely for the named addressee. If received in error, please > destroy and notify the sender. Any use of this email is prohibited when > received in error. Impetus does not represent, warrant and/or guarantee, > that the integrity of this communication has been maintained nor that the > communication is free of errors, virus, interception or interference. >
