Re: Cassandra nodes reduce disks per node

Alain RODRIGUEZ Fri, 19 Feb 2016 01:05:38 -0800

>
> Alain, thanks for sharing!  I'm confused why you do so many repetitive
> rsyncs.  Just being cautious or is there another reason?  Also, why do you
> have --delete-before when you're copying data to a temp (assumed empty)
> directory?

 Since they are immutable I do a first sync while everything is up and
> running to the new location which runs really long. Meanwhile new ones are
> created and I sync them again online, much less files to copy now. After
> that I shutdown the node and my last rsync now has to copy only a few files
> which is quite fast and so the downtime for that node is within minutes.

Jan guess is right. Except for the "immutable" thing. Compaction can make
big files go away, replaced by bigger ones you'll have to stream again.

Here is a detailed explanation about what I did it this way.

More precisely, let's say we have 10 files of 100 GB on the disk to remove
(let's say 'old-dir')

I run a first rsync to an empty folder indeed (let's call this 'tmp-dir'),
in the disk that will remain after the operation. Let's say this takes
about 10 hours. This can be run in parallel though.

So I now have 10 files of 10GB on the tmp-dir. But meanwhile one compaction
triggered and I now have 6 files of 100 GB and 1 of 350 GB.

At this point I disable compaction, stop running ones.

My second rsync has to remove the 4 files that were compacted from tmp-dir,
so that's why I use the '--delete-before'. As this tmp-dir needs to be
mirroring old-dir, this is fine. This new operation takes 3.5 hours, also
runnable in parallel (Keep in mind C* won't compact anything for 3.5 hours,
that's why I did not stopped compaction before the first rsync, in my case
dataset was 2 TB big)

At this point I have 950 GB in tmp-dir, but meanwhile clients continued to
write on the disk. let's say 50 GB more.

3rd rsync will take 0.5 hour, no compaction ran, so I just have to add the
diff to tmp-dir. Still runnable in parallel.

Then the script stop the node, so should be run sequentially, and perform 2
more rsync, the first one to take the diff between end of 3rd rsync and the
moment you stop the node, should be a few seconds, minutes maybe, depending
how fast you ran the script after 3rd rsync ended. The second rsync in the
script is a 'useless' one. I just like to control things. I run it, expect
to see it to say that there is no diff. It is just a way to stop the script
if for some reason data is still being appended to old-dir.

Then I just move all the files from tmp-dir to new-dir (the proper data dir
remaining after the operation). This is an instant op a files are not
really moved as they already are on disk. That's due to system files
property.

I finally unmount and rm -rf old-dir.

So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and
nodes are down for about 5-10 min.

VS

Straight forward way (stop node, move, start node) : 10 h * number of node
as this needs to be sequential. Plus each node is down for 10 hours, you
have to repair them as it is higher than hinted handoff limit...

Branton, I did not went through your process, but I guess you will be able
to review it by yourself after reading the above (typically, repair is not
needed if you use the strategy I describe above, as node is down for 5-10
minutes). Also, not sure how "rsync -azvuiP /var/data/cassandra/data2/
/var/data/cassandra/data/" will behave, my guess i this is going to do a
copy, so this might be very long. My script perform an instant move and as
the next command is 'rm -Rf /var/data/cassandra/data2' I see no reason
copying rather than moving files.

Your solution would probably work, but with big constraints on operational
point of view (very long operation + repair needed)

Hope this long email will be useful, maybe should I blog about this. Let me
know if the process above makes sense or if some things might be improved.

C*heers,
-----------------
Alain Rodriguez
France

The Last Pickle
http://www.thelastpickle.com

2016-02-19 7:19 GMT+01:00 Branton Davis <branton.da...@spanning.com>:

> Jan, thanks!  That makes perfect sense to run a second time before
> stopping cassandra.  I'll add that in when I do the production cluster.
>
> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten <j.kes...@enercast.de> wrote:
>
>> Hi Branton,
>>
>> two cents from me - I didnt look through the script, but for the rsyncs I
>> do pretty much the same when moving them. Since they are immutable I do a
>> first sync while everything is up and running to the new location which
>> runs really long. Meanwhile new ones are created and I sync them again
>> online, much less files to copy now. After that I shutdown the node and my
>> last rsync now has to copy only a few files which is quite fast and so the
>> downtime for that node is within minutes.
>>
>> Jan
>>
>>
>>
>> Von meinem iPhone gesendet
>>
>> Am 18.02.2016 um 22:12 schrieb Branton Davis <branton.da...@spanning.com
>> >:
>>
>> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>> have --delete-before when you're copying data to a temp (assumed empty)
>> directory?
>>
>> On Thu, Feb 18, 2016 at 4:12 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>> wrote:
>>
>>> I did the process a few weeks ago and ended up writing a runbook and a
>>> script. I have anonymised and share it fwiw.
>>>
>>> https://github.com/arodrime/cassandra-tools/tree/master/remove_disk
>>>
>>> It is basic bash. I tried to have the shortest down time possible,
>>> making this a bit more complex, but it allows you to do a lot in parallel
>>> and just do a fast operation sequentially, reducing overall operation time.
>>>
>>> This worked fine for me, yet I might have make some errors while making
>>> it configurable though variables. Be sure to be around if you decide to run
>>> this. Also I automated this more by using knife (Chef), I hate to repeat
>>> ops, this is something you might want to consider.
>>>
>>> Hope this is useful,
>>>
>>> C*heers,
>>> -----------------
>>> Alain Rodriguez
>>> France
>>>
>>> The Last Pickle
>>> http://www.thelastpickle.com
>>>
>>> 2016-02-18 8:28 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>>>
>>>> Hey Branton,
>>>>
>>>> Please do let us know if you face any problems  doing this.
>>>>
>>>> Thanks
>>>> anishek
>>>>
>>>> On Thu, Feb 18, 2016 at 3:33 AM, Branton Davis <
>>>> branton.da...@spanning.com> wrote:
>>>>
>>>>> We're about to do the same thing.  It shouldn't be necessary to shut
>>>>> down the entire cluster, right?
>>>>>
>>>>> On Wed, Feb 17, 2016 at 12:45 PM, Robert Coli <rc...@eventbrite.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 16, 2016 at 11:29 PM, Anishek Agarwal <anis...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> To accomplish this can I just copy the data from disk1 to disk2 with
>>>>>>> in the relevant cassandra home location folders, change the 
>>>>>>> cassanda.yaml
>>>>>>> configuration and restart the node. before starting i will shutdown the
>>>>>>> cluster.
>>>>>>>
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> =Rob
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra nodes reduce disks per node

Reply via email to