Re: JBOD disk failure - just say no

2018-08-22 Thread Jonathan Haddad
We recently helped a team deal with some JBOD issues, they can be quite
painful, and the experience depends a bit on the C* version in use.  We
wrote a blog post about it (published today):

http://thelastpickle.com/blog/2018/08/22/the-fine-print-when-using-multiple-data-directories.html

Hope this helps.

Jon

On Mon, Aug 20, 2018 at 5:49 PM James Briggs 
wrote:

> Cassandra JBOD has a bunch of issues, so I don't recommend it for
> production:
>
> 1) disks fill up with load (data) unevenly, meaning you can run out on a
> disk while some are half-full
> 2) one bad disk can take out the whole node
> 3) instead of a small failure probability on an LVM/RAID volume, with JBOD
> you end up near 100% chance of failure after 3 years or so.
> 4) generally you will not have enough warning of a looming failure with
> JBOD compared to LVM/RAID. (Some
> companies take a week or two to replace a failed disk.)
>
> JBOD is easy to setup, but hard to manage.
>
> Thanks, James.
>
>
>
> --
> *From:* kurt greaves 
> *To:* User 
> *Sent:* Friday, August 17, 2018 5:42 AM
> *Subject:* Re: JBOD disk failure
>
> As far as I'm aware, yes. I recall hearing someone mention tying system
> tables to a particular disk but at the moment that doesn't exist.
>
> On Fri., 17 Aug. 2018, 01:04 Eric Evans, 
> wrote:
>
> On Wed, Aug 15, 2018 at 3:23 AM kurt greaves  wrote:
> > Yep. It might require a full node replace depending on what data is lost
> from the system tables. In some cases you might be able to recover from
> partially lost system info, but it's not a sure thing.
>
> Ugh, does it really just boil down to what part of `system` happens to
> be on the disk in question?  In my mind, that makes the only sane
> operational procedure for a failed disk to be: "replace the entire
> node".  IOW, I don't think we can realistically claim you can survive
> a failed a JBOD device if it relies on happenstance.
>
> > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, <
> christian.lor...@webtrekk.com > wrote:
> >>
> >> Thank you for the answers. We are using the current version 3.11.3 So
> this one includes CASSANDRA-6696.
> >>
> >> So if I get this right, losing system tables will need a full node
> rebuild. Otherwise repair will get the node consistent again.
> >
> > [ ... ]
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>
> -- -- -
> To unsubscribe, e-mail: user-unsubscribe@cassandra. apache.org
> 
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>
>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Re: JBOD disk failure - just say no

2018-08-20 Thread James Briggs
Cassandra JBOD has a bunch of issues, so I don't recommend it for production:
1) disks fill up with load (data) unevenly, meaning you can run out on a disk 
while some are half-full2) one bad disk can take out the whole node3) instead 
of a small failure probability on an LVM/RAID volume, with JBOD you end up near 
100% chance of failure after 3 years or so.4) generally you will not have 
enough warning of a looming failure with JBOD compared to LVM/RAID. 
(Somecompanies take a week or two to replace a failed disk.)
JBOD is easy to setup, but hard to manage. Thanks, James.


  From: kurt greaves 
 To: User  
 Sent: Friday, August 17, 2018 5:42 AM
 Subject: Re: JBOD disk failure
   
As far as I'm aware, yes. I recall hearing someone mention tying system tables 
to a particular disk but at the moment that doesn't exist.
On Fri., 17 Aug. 2018, 01:04 Eric Evans,  wrote:

On Wed, Aug 15, 2018 at 3:23 AM kurt greaves  wrote:
> Yep. It might require a full node replace depending on what data is lost from 
> the system tables. In some cases you might be able to recover from partially 
> lost system info, but it's not a sure thing.

Ugh, does it really just boil down to what part of `system` happens to
be on the disk in question?  In my mind, that makes the only sane
operational procedure for a failed disk to be: "replace the entire
node".  IOW, I don't think we can realistically claim you can survive
a failed a JBOD device if it relies on happenstance.

> On Wed., 15 Aug. 2018, 17:55 Christian Lorenz,  > wrote:
>>
>> Thank you for the answers. We are using the current version 3.11.3 So this 
>> one includes CASSANDRA-6696.
>>
>> So if I get this right, losing system tables will need a full node rebuild. 
>> Otherwise repair will get the node consistent again.
>
> [ ... ]

-- 
Eric Evans
john.eric.ev...@gmail.com

-- -- -
To unsubscribe, e-mail: user-unsubscribe@cassandra. apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




   

Re: JBOD disk failure

2018-08-17 Thread kurt greaves
As far as I'm aware, yes. I recall hearing someone mention tying system
tables to a particular disk but at the moment that doesn't exist.

On Fri., 17 Aug. 2018, 01:04 Eric Evans,  wrote:

> On Wed, Aug 15, 2018 at 3:23 AM kurt greaves  wrote:
> > Yep. It might require a full node replace depending on what data is lost
> from the system tables. In some cases you might be able to recover from
> partially lost system info, but it's not a sure thing.
>
> Ugh, does it really just boil down to what part of `system` happens to
> be on the disk in question?  In my mind, that makes the only sane
> operational procedure for a failed disk to be: "replace the entire
> node".  IOW, I don't think we can realistically claim you can survive
> a failed a JBOD device if it relies on happenstance.
>
> > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, <
> christian.lor...@webtrekk.com> wrote:
> >>
> >> Thank you for the answers. We are using the current version 3.11.3 So
> this one includes CASSANDRA-6696.
> >>
> >> So if I get this right, losing system tables will need a full node
> rebuild. Otherwise repair will get the node consistent again.
> >
> > [ ... ]
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: JBOD disk failure

2018-08-16 Thread Eric Evans
On Wed, Aug 15, 2018 at 3:23 AM kurt greaves  wrote:
> Yep. It might require a full node replace depending on what data is lost from 
> the system tables. In some cases you might be able to recover from partially 
> lost system info, but it's not a sure thing.

Ugh, does it really just boil down to what part of `system` happens to
be on the disk in question?  In my mind, that makes the only sane
operational procedure for a failed disk to be: "replace the entire
node".  IOW, I don't think we can realistically claim you can survive
a failed a JBOD device if it relies on happenstance.

> On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, 
>  wrote:
>>
>> Thank you for the answers. We are using the current version 3.11.3 So this 
>> one includes CASSANDRA-6696.
>>
>> So if I get this right, losing system tables will need a full node rebuild. 
>> Otherwise repair will get the node consistent again.
>
> [ ... ]

-- 
Eric Evans
john.eric.ev...@gmail.com

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: JBOD disk failure

2018-08-15 Thread kurt greaves
Yep. It might require a full node replace depending on what data is lost
from the system tables. In some cases you might be able to recover from
partially lost system info, but it's not a sure thing.

On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, <
christian.lor...@webtrekk.com> wrote:

> Thank you for the answers. We are using the current version 3.11.3 So this
> one includes CASSANDRA-6696.
>
> So if I get this right, losing system tables will need a full node
> rebuild. Otherwise repair will get the node consistent again.
>
>
>
> Regards,
>
> Christian
>
>
>
>
>
> *Von: *kurt greaves 
> *Antworten an: *"user@cassandra.apache.org" 
> *Datum: *Mittwoch, 15. August 2018 um 04:53
> *An: *User 
> *Betreff: *Re: JBOD disk failure
>
>
>
> If that disk had important data in the system tables however you might
> have some trouble and need to replace the entire instance anyway.
>
>
>
> On 15 August 2018 at 12:20, Jeff Jirsa  wrote:
>
> Depends on version
>
>
>
> For versions without the fix from Cassandra-6696, the only safe option on
> single disk failure is to stop and replace the whole instance - this is
> important because in older versions of Cassandra, you could have data in
> one sstable, a tombstone shadowing it in another disk, and it could be very
> far behind gc_grace_seconds. On disk failure in this scenario, if the disk
> holding the tombstone is lost, repair will propagate the
> (deleted/resurrected) data to the other replicas, which probably isn’t what
> you want to happen.
>
>
>
> With 6696, you should be safe to replace the disk and run repair - 6696
> will keep data for a given token range all on the same disks, so the
> resurrection problem is solved.
>
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Aug 14, 2018, at 6:10 AM, Christian Lorenz <
> christian.lor...@webtrekk.com> wrote:
>
> Hi,
>
>
>
> given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting
> data, what happens if the nodes are setup with JBOD and one disk fails? Do
> I get consistent results while the broken drive is replaced and a nodetool
> repair is running on the node with the replaced drive?
>
>
>
> Kind regards,
>
> Christian
>
>
>


Re: JBOD disk failure

2018-08-15 Thread Christian Lorenz
Thank you for the answers. We are using the current version 3.11.3 So this one 
includes CASSANDRA-6696.
So if I get this right, losing system tables will need a full node rebuild. 
Otherwise repair will get the node consistent again.

Regards,
Christian


Von: kurt greaves 
Antworten an: "user@cassandra.apache.org" 
Datum: Mittwoch, 15. August 2018 um 04:53
An: User 
Betreff: Re: JBOD disk failure

If that disk had important data in the system tables however you might have 
some trouble and need to replace the entire instance anyway.

On 15 August 2018 at 12:20, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Depends on version

For versions without the fix from Cassandra-6696, the only safe option on 
single disk failure is to stop and replace the whole instance - this is 
important because in older versions of Cassandra, you could have data in one 
sstable, a tombstone shadowing it in another disk, and it could be very far 
behind gc_grace_seconds. On disk failure in this scenario, if the disk holding 
the tombstone is lost, repair will propagate the (deleted/resurrected) data to 
the other replicas, which probably isn’t what you want to happen.

With 6696, you should be safe to replace the disk and run repair - 6696 will 
keep data for a given token range all on the same disks, so the resurrection 
problem is solved.


--
Jeff Jirsa


On Aug 14, 2018, at 6:10 AM, Christian Lorenz 
mailto:christian.lor...@webtrekk.com>> wrote:
Hi,

given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting data, 
what happens if the nodes are setup with JBOD and one disk fails? Do I get 
consistent results while the broken drive is replaced and a nodetool repair is 
running on the node with the replaced drive?

Kind regards,
Christian



Re: JBOD disk failure

2018-08-14 Thread kurt greaves
If that disk had important data in the system tables however you might have
some trouble and need to replace the entire instance anyway.

On 15 August 2018 at 12:20, Jeff Jirsa  wrote:

> Depends on version
>
> For versions without the fix from Cassandra-6696, the only safe option on
> single disk failure is to stop and replace the whole instance - this is
> important because in older versions of Cassandra, you could have data in
> one sstable, a tombstone shadowing it in another disk, and it could be very
> far behind gc_grace_seconds. On disk failure in this scenario, if the disk
> holding the tombstone is lost, repair will propagate the
> (deleted/resurrected) data to the other replicas, which probably isn’t what
> you want to happen.
>
> With 6696, you should be safe to replace the disk and run repair - 6696
> will keep data for a given token range all on the same disks, so the
> resurrection problem is solved.
>
>
> --
> Jeff Jirsa
>
>
> On Aug 14, 2018, at 6:10 AM, Christian Lorenz <
> christian.lor...@webtrekk.com> wrote:
>
> Hi,
>
>
>
> given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting
> data, what happens if the nodes are setup with JBOD and one disk fails? Do
> I get consistent results while the broken drive is replaced and a nodetool
> repair is running on the node with the replaced drive?
>
>
>
> Kind regards,
>
> Christian
>
>


Re: JBOD disk failure

2018-08-14 Thread Jeff Jirsa
Depends on version

For versions without the fix from Cassandra-6696, the only safe option on 
single disk failure is to stop and replace the whole instance - this is 
important because in older versions of Cassandra, you could have data in one 
sstable, a tombstone shadowing it in another disk, and it could be very far 
behind gc_grace_seconds. On disk failure in this scenario, if the disk holding 
the tombstone is lost, repair will propagate the (deleted/resurrected) data to 
the other replicas, which probably isn’t what you want to happen.

With 6696, you should be safe to replace the disk and run repair - 6696 will 
keep data for a given token range all on the same disks, so the resurrection 
problem is solved. 


-- 
Jeff Jirsa


> On Aug 14, 2018, at 6:10 AM, Christian Lorenz  
> wrote:
> 
> Hi,
>  
> given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting data, 
> what happens if the nodes are setup with JBOD and one disk fails? Do I get 
> consistent results while the broken drive is replaced and a nodetool repair 
> is running on the node with the replaced drive?
>  
> Kind regards,
> Christian


Re: JBOD disk failure

2018-08-14 Thread daemeon reiydelle
you have to explain what you mean by "JBOD". All in one large vdisk?
Separate drives?

At the end of the day, if a device fails in a way that the data housed on
that device (or array) is no longer available, that HDFS storage is marked
down. HDFS now needs to create a 3rd replicant. Various timers control how
long HDFS waits to see if the device comes back on line. But assume
immediately for convenience. Remember that a write is to a (random) copy of
the data, and that datanode then replicates to the next node, and so forth.
The in-process-of-being-created 3rd copy will also get those delete
"updates". Have you read up on how "deleting" a record works?

<==>
Be the reason someone smiles today.
Or the reason they need a drink.
Whichever works.

*Daemeon C.M. Reiydelle*

*email: daeme...@gmail.com *
*San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
daemeon.c.m.reiydelle*



On Tue, Aug 14, 2018 at 6:10 AM Christian Lorenz <
christian.lor...@webtrekk.com> wrote:

> Hi,
>
>
>
> given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting
> data, what happens if the nodes are setup with JBOD and one disk fails? Do
> I get consistent results while the broken drive is replaced and a nodetool
> repair is running on the node with the replaced drive?
>
>
>
> Kind regards,
>
> Christian
>