Re: JBOD disk failure - just say no
We recently helped a team deal with some JBOD issues, they can be quite painful, and the experience depends a bit on the C* version in use. We wrote a blog post about it (published today): http://thelastpickle.com/blog/2018/08/22/the-fine-print-when-using-multiple-data-directories.html Hope this helps. Jon On Mon, Aug 20, 2018 at 5:49 PM James Briggs wrote: > Cassandra JBOD has a bunch of issues, so I don't recommend it for > production: > > 1) disks fill up with load (data) unevenly, meaning you can run out on a > disk while some are half-full > 2) one bad disk can take out the whole node > 3) instead of a small failure probability on an LVM/RAID volume, with JBOD > you end up near 100% chance of failure after 3 years or so. > 4) generally you will not have enough warning of a looming failure with > JBOD compared to LVM/RAID. (Some > companies take a week or two to replace a failed disk.) > > JBOD is easy to setup, but hard to manage. > > Thanks, James. > > > > -- > *From:* kurt greaves > *To:* User > *Sent:* Friday, August 17, 2018 5:42 AM > *Subject:* Re: JBOD disk failure > > As far as I'm aware, yes. I recall hearing someone mention tying system > tables to a particular disk but at the moment that doesn't exist. > > On Fri., 17 Aug. 2018, 01:04 Eric Evans, > wrote: > > On Wed, Aug 15, 2018 at 3:23 AM kurt greaves wrote: > > Yep. It might require a full node replace depending on what data is lost > from the system tables. In some cases you might be able to recover from > partially lost system info, but it's not a sure thing. > > Ugh, does it really just boil down to what part of `system` happens to > be on the disk in question? In my mind, that makes the only sane > operational procedure for a failed disk to be: "replace the entire > node". IOW, I don't think we can realistically claim you can survive > a failed a JBOD device if it relies on happenstance. > > > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, < > christian.lor...@webtrekk.com > wrote: > >> > >> Thank you for the answers. We are using the current version 3.11.3 So > this one includes CASSANDRA-6696. > >> > >> So if I get this right, losing system tables will need a full node > rebuild. Otherwise repair will get the node consistent again. > > > > [ ... ] > > -- > Eric Evans > john.eric.ev...@gmail.com > > -- -- - > To unsubscribe, e-mail: user-unsubscribe@cassandra. apache.org > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: JBOD disk failure - just say no
Cassandra JBOD has a bunch of issues, so I don't recommend it for production: 1) disks fill up with load (data) unevenly, meaning you can run out on a disk while some are half-full2) one bad disk can take out the whole node3) instead of a small failure probability on an LVM/RAID volume, with JBOD you end up near 100% chance of failure after 3 years or so.4) generally you will not have enough warning of a looming failure with JBOD compared to LVM/RAID. (Somecompanies take a week or two to replace a failed disk.) JBOD is easy to setup, but hard to manage. Thanks, James. From: kurt greaves To: User Sent: Friday, August 17, 2018 5:42 AM Subject: Re: JBOD disk failure As far as I'm aware, yes. I recall hearing someone mention tying system tables to a particular disk but at the moment that doesn't exist. On Fri., 17 Aug. 2018, 01:04 Eric Evans, wrote: On Wed, Aug 15, 2018 at 3:23 AM kurt greaves wrote: > Yep. It might require a full node replace depending on what data is lost from > the system tables. In some cases you might be able to recover from partially > lost system info, but it's not a sure thing. Ugh, does it really just boil down to what part of `system` happens to be on the disk in question? In my mind, that makes the only sane operational procedure for a failed disk to be: "replace the entire node". IOW, I don't think we can realistically claim you can survive a failed a JBOD device if it relies on happenstance. > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, > wrote: >> >> Thank you for the answers. We are using the current version 3.11.3 So this >> one includes CASSANDRA-6696. >> >> So if I get this right, losing system tables will need a full node rebuild. >> Otherwise repair will get the node consistent again. > > [ ... ] -- Eric Evans john.eric.ev...@gmail.com -- -- - To unsubscribe, e-mail: user-unsubscribe@cassandra. apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: JBOD disk failure
As far as I'm aware, yes. I recall hearing someone mention tying system tables to a particular disk but at the moment that doesn't exist. On Fri., 17 Aug. 2018, 01:04 Eric Evans, wrote: > On Wed, Aug 15, 2018 at 3:23 AM kurt greaves wrote: > > Yep. It might require a full node replace depending on what data is lost > from the system tables. In some cases you might be able to recover from > partially lost system info, but it's not a sure thing. > > Ugh, does it really just boil down to what part of `system` happens to > be on the disk in question? In my mind, that makes the only sane > operational procedure for a failed disk to be: "replace the entire > node". IOW, I don't think we can realistically claim you can survive > a failed a JBOD device if it relies on happenstance. > > > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, < > christian.lor...@webtrekk.com> wrote: > >> > >> Thank you for the answers. We are using the current version 3.11.3 So > this one includes CASSANDRA-6696. > >> > >> So if I get this right, losing system tables will need a full node > rebuild. Otherwise repair will get the node consistent again. > > > > [ ... ] > > -- > Eric Evans > john.eric.ev...@gmail.com > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >
Re: JBOD disk failure
On Wed, Aug 15, 2018 at 3:23 AM kurt greaves wrote: > Yep. It might require a full node replace depending on what data is lost from > the system tables. In some cases you might be able to recover from partially > lost system info, but it's not a sure thing. Ugh, does it really just boil down to what part of `system` happens to be on the disk in question? In my mind, that makes the only sane operational procedure for a failed disk to be: "replace the entire node". IOW, I don't think we can realistically claim you can survive a failed a JBOD device if it relies on happenstance. > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, > wrote: >> >> Thank you for the answers. We are using the current version 3.11.3 So this >> one includes CASSANDRA-6696. >> >> So if I get this right, losing system tables will need a full node rebuild. >> Otherwise repair will get the node consistent again. > > [ ... ] -- Eric Evans john.eric.ev...@gmail.com - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: JBOD disk failure
Yep. It might require a full node replace depending on what data is lost from the system tables. In some cases you might be able to recover from partially lost system info, but it's not a sure thing. On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, < christian.lor...@webtrekk.com> wrote: > Thank you for the answers. We are using the current version 3.11.3 So this > one includes CASSANDRA-6696. > > So if I get this right, losing system tables will need a full node > rebuild. Otherwise repair will get the node consistent again. > > > > Regards, > > Christian > > > > > > *Von: *kurt greaves > *Antworten an: *"user@cassandra.apache.org" > *Datum: *Mittwoch, 15. August 2018 um 04:53 > *An: *User > *Betreff: *Re: JBOD disk failure > > > > If that disk had important data in the system tables however you might > have some trouble and need to replace the entire instance anyway. > > > > On 15 August 2018 at 12:20, Jeff Jirsa wrote: > > Depends on version > > > > For versions without the fix from Cassandra-6696, the only safe option on > single disk failure is to stop and replace the whole instance - this is > important because in older versions of Cassandra, you could have data in > one sstable, a tombstone shadowing it in another disk, and it could be very > far behind gc_grace_seconds. On disk failure in this scenario, if the disk > holding the tombstone is lost, repair will propagate the > (deleted/resurrected) data to the other replicas, which probably isn’t what > you want to happen. > > > > With 6696, you should be safe to replace the disk and run repair - 6696 > will keep data for a given token range all on the same disks, so the > resurrection problem is solved. > > > > > > -- > > Jeff Jirsa > > > > > On Aug 14, 2018, at 6:10 AM, Christian Lorenz < > christian.lor...@webtrekk.com> wrote: > > Hi, > > > > given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting > data, what happens if the nodes are setup with JBOD and one disk fails? Do > I get consistent results while the broken drive is replaced and a nodetool > repair is running on the node with the replaced drive? > > > > Kind regards, > > Christian > > >
Re: JBOD disk failure
Thank you for the answers. We are using the current version 3.11.3 So this one includes CASSANDRA-6696. So if I get this right, losing system tables will need a full node rebuild. Otherwise repair will get the node consistent again. Regards, Christian Von: kurt greaves Antworten an: "user@cassandra.apache.org" Datum: Mittwoch, 15. August 2018 um 04:53 An: User Betreff: Re: JBOD disk failure If that disk had important data in the system tables however you might have some trouble and need to replace the entire instance anyway. On 15 August 2018 at 12:20, Jeff Jirsa mailto:jji...@gmail.com>> wrote: Depends on version For versions without the fix from Cassandra-6696, the only safe option on single disk failure is to stop and replace the whole instance - this is important because in older versions of Cassandra, you could have data in one sstable, a tombstone shadowing it in another disk, and it could be very far behind gc_grace_seconds. On disk failure in this scenario, if the disk holding the tombstone is lost, repair will propagate the (deleted/resurrected) data to the other replicas, which probably isn’t what you want to happen. With 6696, you should be safe to replace the disk and run repair - 6696 will keep data for a given token range all on the same disks, so the resurrection problem is solved. -- Jeff Jirsa On Aug 14, 2018, at 6:10 AM, Christian Lorenz mailto:christian.lor...@webtrekk.com>> wrote: Hi, given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting data, what happens if the nodes are setup with JBOD and one disk fails? Do I get consistent results while the broken drive is replaced and a nodetool repair is running on the node with the replaced drive? Kind regards, Christian
Re: JBOD disk failure
If that disk had important data in the system tables however you might have some trouble and need to replace the entire instance anyway. On 15 August 2018 at 12:20, Jeff Jirsa wrote: > Depends on version > > For versions without the fix from Cassandra-6696, the only safe option on > single disk failure is to stop and replace the whole instance - this is > important because in older versions of Cassandra, you could have data in > one sstable, a tombstone shadowing it in another disk, and it could be very > far behind gc_grace_seconds. On disk failure in this scenario, if the disk > holding the tombstone is lost, repair will propagate the > (deleted/resurrected) data to the other replicas, which probably isn’t what > you want to happen. > > With 6696, you should be safe to replace the disk and run repair - 6696 > will keep data for a given token range all on the same disks, so the > resurrection problem is solved. > > > -- > Jeff Jirsa > > > On Aug 14, 2018, at 6:10 AM, Christian Lorenz < > christian.lor...@webtrekk.com> wrote: > > Hi, > > > > given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting > data, what happens if the nodes are setup with JBOD and one disk fails? Do > I get consistent results while the broken drive is replaced and a nodetool > repair is running on the node with the replaced drive? > > > > Kind regards, > > Christian > >
Re: JBOD disk failure
Depends on version For versions without the fix from Cassandra-6696, the only safe option on single disk failure is to stop and replace the whole instance - this is important because in older versions of Cassandra, you could have data in one sstable, a tombstone shadowing it in another disk, and it could be very far behind gc_grace_seconds. On disk failure in this scenario, if the disk holding the tombstone is lost, repair will propagate the (deleted/resurrected) data to the other replicas, which probably isn’t what you want to happen. With 6696, you should be safe to replace the disk and run repair - 6696 will keep data for a given token range all on the same disks, so the resurrection problem is solved. -- Jeff Jirsa > On Aug 14, 2018, at 6:10 AM, Christian Lorenz > wrote: > > Hi, > > given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting data, > what happens if the nodes are setup with JBOD and one disk fails? Do I get > consistent results while the broken drive is replaced and a nodetool repair > is running on the node with the replaced drive? > > Kind regards, > Christian
Re: JBOD disk failure
you have to explain what you mean by "JBOD". All in one large vdisk? Separate drives? At the end of the day, if a device fails in a way that the data housed on that device (or array) is no longer available, that HDFS storage is marked down. HDFS now needs to create a 3rd replicant. Various timers control how long HDFS waits to see if the device comes back on line. But assume immediately for convenience. Remember that a write is to a (random) copy of the data, and that datanode then replicates to the next node, and so forth. The in-process-of-being-created 3rd copy will also get those delete "updates". Have you read up on how "deleting" a record works? <==> Be the reason someone smiles today. Or the reason they need a drink. Whichever works. *Daemeon C.M. Reiydelle* *email: daeme...@gmail.com * *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.m.reiydelle* On Tue, Aug 14, 2018 at 6:10 AM Christian Lorenz < christian.lor...@webtrekk.com> wrote: > Hi, > > > > given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting > data, what happens if the nodes are setup with JBOD and one disk fails? Do > I get consistent results while the broken drive is replaced and a nodetool > repair is running on the node with the replaced drive? > > > > Kind regards, > > Christian >