Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-05-07 Thread Alfredo Deza
On Wed, May 2, 2018 at 12:18 PM, Nicolas Huillard  wrote:
> Le dimanche 08 avril 2018 à 20:40 +, Jens-U. Mozdzen a écrit :
>> sorry for bringing up that old topic again, but we just faced a
>> corresponding situation and have successfully tested two migration
>> scenarios.
>
> Thank you very much for this update, as I needed to do exactly that,
> due to an SSD crash triggering hardware replacement.
> The block.db on the crashed SSD were lost, so the whole two OSDs
> depending on it were re-created. I also replaced two other bad SSDs
> before they failed, thus needed to effectively replace DB/WAL devices
> on the live cluster (2 SSDs on 2 hosts and 4 OSDs).
>
>> it is possible to move a separate WAL/DB to a new device, whilst
>> without changing the size. We have done this for multiple OSDs,
>> using
>> only existing (mainstream :) ) tools and have documented the
>> procedure
>> in
>> http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-b
>> lock-db/
>> . It will *not* allow to separate WAL / DB after OSD creation, nor
>> does it allow changing the DB size.
>
> The lost OSD were still backfilling when I did the above procedure
> (data redundancy was high enough to risk losing one more node). I even
> mis-typed the "ceph osd set noout" command ("ceph osd unset noout"
> instead, effectively a no-op), and replaced 2 OSDs of a single host at
> the same time (thus taking more time than the 10 minutes before kicking
> the OSDs out, triggering even more data movement).
> Everything went cleanly though, thanks to your detailed commands, which
> I ran one at a time, thinking twice before each [Enter].
>
> I digged a bit into the LVM tags :
> * make a backup of all pv/vg/lv config : vgcfgbackup
> * check the backed-up tags : grep tags /etc/lvm/backup/*
>
> I then noticed that :
> * there are lots of "ceph.*=" tags
> * tags are still present on the old DB/WAL LVs (since I didn't remove
> them)
> * tags are absent from the new DB/WAL LVs (ditto, I didn't create
> them), which may be a problem later on...

This is absolutely going to be a problem for you if I understand that
these are handled by ceph-volume, in which case
it reads from these tags to be able to bring up the OSD.

> * I changed the ceph.db_device= tag, but there is also a ceph.db_uuid=
> tag which was not changed, and may or may not trigger a problem upon
> reboot (I don't know if this UUID is part of the dd'ed data)

For sure you can get into a situation where ceph-volume needs one of
these and can't find it and then it breaks.

>
> You effectively helped a lot! Thanks.
>
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-05-02 Thread Nicolas Huillard
Le dimanche 08 avril 2018 à 20:40 +, Jens-U. Mozdzen a écrit :
> sorry for bringing up that old topic again, but we just faced a  
> corresponding situation and have successfully tested two migration  
> scenarios.

Thank you very much for this update, as I needed to do exactly that,
due to an SSD crash triggering hardware replacement.
The block.db on the crashed SSD were lost, so the whole two OSDs
depending on it were re-created. I also replaced two other bad SSDs
before they failed, thus needed to effectively replace DB/WAL devices
on the live cluster (2 SSDs on 2 hosts and 4 OSDs).

> it is possible to move a separate WAL/DB to a new device, whilst  
> without changing the size. We have done this for multiple OSDs,
> using  
> only existing (mainstream :) ) tools and have documented the
> procedure  
> in  
> http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-b
> lock-db/  
> . It will *not* allow to separate WAL / DB after OSD creation, nor  
> does it allow changing the DB size.

The lost OSD were still backfilling when I did the above procedure
(data redundancy was high enough to risk losing one more node). I even
mis-typed the "ceph osd set noout" command ("ceph osd unset noout"
instead, effectively a no-op), and replaced 2 OSDs of a single host at
the same time (thus taking more time than the 10 minutes before kicking
the OSDs out, triggering even more data movement).
Everything went cleanly though, thanks to your detailed commands, which
I ran one at a time, thinking twice before each [Enter].

I digged a bit into the LVM tags :
* make a backup of all pv/vg/lv config : vgcfgbackup
* check the backed-up tags : grep tags /etc/lvm/backup/*

I then noticed that :
* there are lots of "ceph.*=" tags
* tags are still present on the old DB/WAL LVs (since I didn't remove
them)
* tags are absent from the new DB/WAL LVs (ditto, I didn't create
them), which may be a problem later on...
* I changed the ceph.db_device= tag, but there is also a ceph.db_uuid=
tag which was not changed, and may or may not trigger a problem upon
reboot (I don't know if this UUID is part of the dd'ed data)

You effectively helped a lot! Thanks.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-04-08 Thread Jens-U. Mozdzen

Hi *,

sorry for bringing up that old topic again, but we just faced a  
corresponding situation and have successfully tested two migration  
scenarios.


Zitat von ceph-users-requ...@lists.ceph.com:

Date: Sat, 24 Feb 2018 06:10:16 +
From: David Turner <drakonst...@gmail.com>
To: Nico Schottelius <nico.schottel...@ungleich.ch>
Cc: Caspar Smit <caspars...@supernas.eu>, ceph-users
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Proper procedure to replace DB/WAL SSD
Message-ID:
<can-gepjzd8rgxchbxnsf7hwu22rk2dnfbdutuy2ygklmmyi...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Caspar, it looks like your idea should work. Worst case scenario seems like
the osd wouldn't start, you'd put the old SSD back in and go back to the
idea to weight them to 0, backfilling, then recreate the osds. Definitely
with a try in my opinion, and I'd love to hear your experience after.

Nico, it is not possible to change the WAL or DB size, location, etc after
osd creation.


it is possible to move a separate WAL/DB to a new device, whilst  
without changing the size. We have done this for multiple OSDs, using  
only existing (mainstream :) ) tools and have documented the procedure  
in  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/  
. It will *not* allow to separate WAL / DB after OSD creation, nor  
does it allow changing the DB size.


As we faced a failing WAL/DB SSD during one of the moves (fatal read  
errors from the DB block device), we also established a procedure to  
initialize the OSD to "empty" during that operation, so that the OSD  
will get re-filled without changing the OSD map:  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/resetting-an-existing-bluestore-osd/


HTH
Jens

PS: Live WAL/DB migration is something that can be done easily when  
using logical volumes, which is why I'd highly recommend to go that  
route, instead of using partitions. LVM not only helps when the SSDs  
reach their EOL, but with live changes to load balancing (WAL/DB LVs  
distributing across multiple SSDs), too.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-03-03 Thread Willem Jan Withagen

On 23/02/2018 14:27, Caspar Smit wrote:

Hi All,

What would be the proper way to preventively replace a DB/WAL SSD (when 
it is nearing it's DWPD/TBW limit and not failed yet).


It hosts DB partitions for 5 OSD's

Maybe something like:

1) ceph osd reweight 0 the 5 OSD's
2) let backfilling complete
3) destroy/remove the 5 OSD's
4) replace SSD
5) create 5 new OSD's with seperate DB partition on new SSD

When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so 
i thought maybe the following would work:


1) ceph osd set noout
2) stop the 5 OSD's (systemctl stop)
3) 'dd' the old SSD to a new SSD of same or bigger size
4) remove the old SSD
5) start the 5 OSD's (systemctl start)
6) let backfilling/recovery complete (only delta data between OSD stop 
and now)

6) ceph osd unset noout

Would this be a viable method to replace a DB SSD? Any udev/serial 
nr/uuid stuff preventing this to work?


What I would do under FreeBSD/ZFS (and perhaps there is something under 
Linux that works the same):


Promote the the disk/zvol for the DB/WAL to mirror.
  This is instantaneous, and does not modify anything.
Add the new SSD to the mirror, and wait until the new SSD is updated.
Then I'dd delete the old SSD from the mirror.

You'd be stuck with a mirror with one disk for the DB/WALL, but that 
does not consume much. ZFS does not even think it is wrong, if you 
deleted the disk in the correct way.


And no reboot required.

No idea if you can do something similar under LVM or other types of 
mirroring stuff.


--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Dietmar Rieder
Thanks for making this clear.

Dietmar

On 02/27/2018 05:29 PM, Alfredo Deza wrote:
> On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
>  wrote:
>> ... however, it would be nice if ceph-volume would also create the
>> partitions for the WAL and/or DB if needed. Is there a special reason,
>> why this is not implemented?
> 
> Yes, the reason is that this was one of the most painful points in
> ceph-disk (code and maintenance-wise): to be in the business of
> understanding partitions, sizes, requirements, and devices
> is non-trivial.
> 
> One of the reasons ceph-disk did this was because it required quite a
> hefty amount of "special sauce" on partitions so that these would be
> discovered later by mechanisms that included udev.
> 
> If an admin wanted more flexibility, we decided that it had to be up
> to configuration management system (or whatever deployment mechanism)
> to do so. For users that want a simplistic approach (in the case of
> bluestore)
> we have a 1:1 mapping for device->logical volume->OSD
> 
> On the ceph-volume side as well, implementing partitions meant to also
> have a similar support for logical volumes, which have lots of
> variations that can be supported and we were not willing to attempt to
> support them all.
> 
> Even a small subset would inevitably bring up the question of "why is
> setup X not supported by ceph-volume if setup Y is?"
> 
> Configuration management systems are better suited for handling these
> situations, and we would prefer to offload that responsibility to
> those systems.
> 
>>
>> Dietmar
>>
>>
>> On 02/27/2018 04:25 PM, David Turner wrote:
>>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>>> ceph-volume does not create partitions for the WAL or DB.  You need to
>>> create those partitions manually if using anything other than a whole
>>> block device when creating OSDs with ceph-volume.
>>>
>>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit >> > wrote:
>>>
>>> David,
>>>
>>> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>>> just to inform other people that Ceph's default of 1GB is pretty low.
>>> Now that i read my own sentence it indeed looks as if i was using
>>> 1GB partitions, sorry for the confusion.
>>>
>>> Caspar
>>>
>>> 2018-02-27 14:11 GMT+01:00 David Turner >> >:
>>>
>>> If you're only using a 1GB DB partition, there is a very real
>>> possibility it's already 100% full. The safe estimate for DB
>>> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>>> for most use cases (except loads and loads of small files).
>>> There are a few threads that mention how to check how much of
>>> your DB partition is in use. Once it's full, it spills over to
>>> the HDD.
>>>
>>>
>>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>>> > wrote:
>>>
>>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
>>> >:
>>>
>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>>> >
>>> wrote:
>>>
>>> 2018-02-24 7:10 GMT+01:00 David Turner
>>> >:
>>>
>>> Caspar, it looks like your idea should work.
>>> Worst case scenario seems like the osd wouldn't
>>> start, you'd put the old SSD back in and go back
>>> to the idea to weight them to 0, backfilling,
>>> then recreate the osds. Definitely with a try in
>>> my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>>> Hi David,
>>>
>>> First of all, thank you for ALL your answers on this
>>> ML, you're really putting a lot of effort into
>>> answering many questions asked here and very often
>>> they contain invaluable information.
>>>
>>>
>>> To follow up on this post i went out and built a
>>> very small (proxmox) cluster (3 OSD's per host) to
>>> test my suggestion of cloning the DB/WAL SDD. And it
>>> worked!
>>> Note: this was on Luminous v12.2.2 (all bluestore,
>>> ceph-disk based OSD's)
>>>
>>> Here's what i did on 1 node:
>>>
>>> 1) ceph osd set noout
>>> 2) systemctl stop osd.0; systemctl stop
>>> osd.1; systemctl stop osd.2
>>> 3) ddrescue -f 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Alfredo Deza
On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
 wrote:
> ... however, it would be nice if ceph-volume would also create the
> partitions for the WAL and/or DB if needed. Is there a special reason,
> why this is not implemented?

Yes, the reason is that this was one of the most painful points in
ceph-disk (code and maintenance-wise): to be in the business of
understanding partitions, sizes, requirements, and devices
is non-trivial.

One of the reasons ceph-disk did this was because it required quite a
hefty amount of "special sauce" on partitions so that these would be
discovered later by mechanisms that included udev.

If an admin wanted more flexibility, we decided that it had to be up
to configuration management system (or whatever deployment mechanism)
to do so. For users that want a simplistic approach (in the case of
bluestore)
we have a 1:1 mapping for device->logical volume->OSD

On the ceph-volume side as well, implementing partitions meant to also
have a similar support for logical volumes, which have lots of
variations that can be supported and we were not willing to attempt to
support them all.

Even a small subset would inevitably bring up the question of "why is
setup X not supported by ceph-volume if setup Y is?"

Configuration management systems are better suited for handling these
situations, and we would prefer to offload that responsibility to
those systems.

>
> Dietmar
>
>
> On 02/27/2018 04:25 PM, David Turner wrote:
>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>> ceph-volume does not create partitions for the WAL or DB.  You need to
>> create those partitions manually if using anything other than a whole
>> block device when creating OSDs with ceph-volume.
>>
>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit > > wrote:
>>
>> David,
>>
>> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>> just to inform other people that Ceph's default of 1GB is pretty low.
>> Now that i read my own sentence it indeed looks as if i was using
>> 1GB partitions, sorry for the confusion.
>>
>> Caspar
>>
>> 2018-02-27 14:11 GMT+01:00 David Turner > >:
>>
>> If you're only using a 1GB DB partition, there is a very real
>> possibility it's already 100% full. The safe estimate for DB
>> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>> for most use cases (except loads and loads of small files).
>> There are a few threads that mention how to check how much of
>> your DB partition is in use. Once it's full, it spills over to
>> the HDD.
>>
>>
>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>> > wrote:
>>
>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
>> >:
>>
>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>> >
>> wrote:
>>
>> 2018-02-24 7:10 GMT+01:00 David Turner
>> >:
>>
>> Caspar, it looks like your idea should work.
>> Worst case scenario seems like the osd wouldn't
>> start, you'd put the old SSD back in and go back
>> to the idea to weight them to 0, backfilling,
>> then recreate the osds. Definitely with a try in
>> my opinion, and I'd love to hear your experience
>> after.
>>
>>
>> Hi David,
>>
>> First of all, thank you for ALL your answers on this
>> ML, you're really putting a lot of effort into
>> answering many questions asked here and very often
>> they contain invaluable information.
>>
>>
>> To follow up on this post i went out and built a
>> very small (proxmox) cluster (3 OSD's per host) to
>> test my suggestion of cloning the DB/WAL SDD. And it
>> worked!
>> Note: this was on Luminous v12.2.2 (all bluestore,
>> ceph-disk based OSD's)
>>
>> Here's what i did on 1 node:
>>
>> 1) ceph osd set noout
>> 2) systemctl stop osd.0; systemctl stop
>> osd.1; systemctl stop osd.2
>> 3) ddrescue -f -n -vv  
>> /root/clone-db.log
>> 4) removed the old SSD physically from the node
>> 5) checked with "ceph -s" and already saw HEALTH_OK
>> and 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Dietmar Rieder
... however, it would be nice if ceph-volume would also create the
partitions for the WAL and/or DB if needed. Is there a special reason,
why this is not implemented?

Dietmar


On 02/27/2018 04:25 PM, David Turner wrote:
> Gotcha.  As a side note, that setting is only used by ceph-disk as
> ceph-volume does not create partitions for the WAL or DB.  You need to
> create those partitions manually if using anything other than a whole
> block device when creating OSDs with ceph-volume.
> 
> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit  > wrote:
> 
> David,
> 
> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
> just to inform other people that Ceph's default of 1GB is pretty low.
> Now that i read my own sentence it indeed looks as if i was using
> 1GB partitions, sorry for the confusion.
> 
> Caspar
> 
> 2018-02-27 14:11 GMT+01:00 David Turner  >:
> 
> If you're only using a 1GB DB partition, there is a very real
> possibility it's already 100% full. The safe estimate for DB
> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
> for most use cases (except loads and loads of small files).
> There are a few threads that mention how to check how much of
> your DB partition is in use. Once it's full, it spills over to
> the HDD.
> 
> 
> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
> > wrote:
> 
> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
> >:
> 
> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
> >
> wrote:
> 
> 2018-02-24 7:10 GMT+01:00 David Turner
> >:
> 
> Caspar, it looks like your idea should work.
> Worst case scenario seems like the osd wouldn't
> start, you'd put the old SSD back in and go back
> to the idea to weight them to 0, backfilling,
> then recreate the osds. Definitely with a try in
> my opinion, and I'd love to hear your experience
> after.
> 
> 
> Hi David,
> 
> First of all, thank you for ALL your answers on this
> ML, you're really putting a lot of effort into
> answering many questions asked here and very often
> they contain invaluable information.
> 
> 
> To follow up on this post i went out and built a
> very small (proxmox) cluster (3 OSD's per host) to
> test my suggestion of cloning the DB/WAL SDD. And it
> worked!
> Note: this was on Luminous v12.2.2 (all bluestore,
> ceph-disk based OSD's)
> 
> Here's what i did on 1 node:
> 
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop
> osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv  
> /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK
> and all OSD's up/in
> 6) ceph osd unset noout
> 
> I assume that once the ddrescue step is finished a
> 'partprobe' or something similar is triggered and
> udev finds the DB partitions on the new SSD and
> starts the OSD's again (kind of what happens during
> hotplug)
> So it is probably better to clone the SSD in another
> (non-ceph) system to not trigger any udev events.
> 
> I also tested a reboot after this and everything
> still worked.
> 
> 
> The old SSD was 120GB and the new is 256GB (cloning
> took around 4 minutes)
> Delta of data was very low because it was a test
> cluster.
> 
> All in all the OSD's in question were 'down' for
> only 5 minutes (so i stayed within the
> ceph_osd_down_out interval of the default 10 minutes
> and didn't actually need to set noout :)
> 
> 
> I kicked off a brief discussion about this with some of
> the BlueStore guys and they're aware of the problem with
> migrating 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread David Turner
Gotcha.  As a side note, that setting is only used by ceph-disk as
ceph-volume does not create partitions for the WAL or DB.  You need to
create those partitions manually if using anything other than a whole block
device when creating OSDs with ceph-volume.

On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit  wrote:

> David,
>
> Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to
> inform other people that Ceph's default of 1GB is pretty low.
> Now that i read my own sentence it indeed looks as if i was using 1GB
> partitions, sorry for the confusion.
>
> Caspar
>
> 2018-02-27 14:11 GMT+01:00 David Turner :
>
>> If you're only using a 1GB DB partition, there is a very real possibility
>> it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
>> so for a 4TB osd a 40GB DB should work for most use cases (except loads and
>> loads of small files). There are a few threads that mention how to check
>> how much of your DB partition is in use. Once it's full, it spills over to
>> the HDD.
>>
>>
>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit  wrote:
>>
>>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum :
>>>
 On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
 wrote:

> 2018-02-24 7:10 GMT+01:00 David Turner :
>
>> Caspar, it looks like your idea should work. Worst case scenario
>> seems like the osd wouldn't start, you'd put the old SSD back in and go
>> back to the idea to weight them to 0, backfilling, then recreate the 
>> osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL 
> SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based
> OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv   /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system
> to not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4
> minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>

 I kicked off a brief discussion about this with some of the BlueStore
 guys and they're aware of the problem with migrating across SSDs, but so
 far it's just a Trello card:
 https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
 They do confirm you should be okay with dd'ing things across, assuming
 symlinks get set up correctly as David noted.


>>> Great that it is on the radar to address. This method feels hacky.
>>>
>>>
 I've got some other bad news, though: BlueStore has internal metadata
 about the size of the block device it's using, so if you copy it onto a
 larger block device, it will not actually make use of the additional space.
 :(
 -Greg

>>>
>>> Yes, i was well aware of that, no problem. The reason was the smaller
>>> SSD sizes are simply not being made anymore or discontinued by the
>>> manufacturer.
>>> Would be nice though if the DB size could be resized in the future, the
>>> default 1GB DB size seems very small to me.
>>>
>>> Caspar
>>>
>>>


>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Caspar Smit
David,

Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to
inform other people that Ceph's default of 1GB is pretty low.
Now that i read my own sentence it indeed looks as if i was using 1GB
partitions, sorry for the confusion.

Caspar

2018-02-27 14:11 GMT+01:00 David Turner :

> If you're only using a 1GB DB partition, there is a very real possibility
> it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
> so for a 4TB osd a 40GB DB should work for most use cases (except loads and
> loads of small files). There are a few threads that mention how to check
> how much of your DB partition is in use. Once it's full, it spills over to
> the HDD.
>
>
> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit  wrote:
>
>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum :
>>
>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
>>> wrote:
>>>
 2018-02-24 7:10 GMT+01:00 David Turner :

> Caspar, it looks like your idea should work. Worst case scenario seems
> like the osd wouldn't start, you'd put the old SSD back in and go back to
> the idea to weight them to 0, backfilling, then recreate the osds.
> Definitely with a try in my opinion, and I'd love to hear your experience
> after.
>
>
 Hi David,

 First of all, thank you for ALL your answers on this ML, you're really
 putting a lot of effort into answering many questions asked here and very
 often they contain invaluable information.


 To follow up on this post i went out and built a very small (proxmox)
 cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
 And it worked!
 Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based
 OSD's)

 Here's what i did on 1 node:

 1) ceph osd set noout
 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
 3) ddrescue -f -n -vv   /root/clone-db.log
 4) removed the old SSD physically from the node
 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
 6) ceph osd unset noout

 I assume that once the ddrescue step is finished a 'partprobe' or
 something similar is triggered and udev finds the DB partitions on the new
 SSD and starts the OSD's again (kind of what happens during hotplug)
 So it is probably better to clone the SSD in another (non-ceph) system
 to not trigger any udev events.

 I also tested a reboot after this and everything still worked.


 The old SSD was 120GB and the new is 256GB (cloning took around 4
 minutes)
 Delta of data was very low because it was a test cluster.

 All in all the OSD's in question were 'down' for only 5 minutes (so i
 stayed within the ceph_osd_down_out interval of the default 10 minutes and
 didn't actually need to set noout :)

>>>
>>> I kicked off a brief discussion about this with some of the BlueStore
>>> guys and they're aware of the problem with migrating across SSDs, but so
>>> far it's just a Trello card: https://trello.com/c/
>>> 9cxTgG50/324-bluestore-add-remove-resize-wal-db
>>> They do confirm you should be okay with dd'ing things across, assuming
>>> symlinks get set up correctly as David noted.
>>>
>>>
>> Great that it is on the radar to address. This method feels hacky.
>>
>>
>>> I've got some other bad news, though: BlueStore has internal metadata
>>> about the size of the block device it's using, so if you copy it onto a
>>> larger block device, it will not actually make use of the additional space.
>>> :(
>>> -Greg
>>>
>>
>> Yes, i was well aware of that, no problem. The reason was the smaller SSD
>> sizes are simply not being made anymore or discontinued by the manufacturer.
>> Would be nice though if the DB size could be resized in the future, the
>> default 1GB DB size seems very small to me.
>>
>> Caspar
>>
>>
>>>
>>>

 Kind regards,
 Caspar



> Nico, it is not possible to change the WAL or DB size, location, etc
> after osd creation. If you want to change the configuration of the osd
> after creation, you have to remove it from the cluster and recreate it.
> There is no similar functionality to how you could move, recreate, etc
> filesystem osd journals. I think this might be on the radar as a feature,
> but I don't know for certain. I definitely consider it to be a regression
> of bluestore.
>
>
>
>
> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> A very interesting question and I would add the follow up question:
>>
>> Is there an easy way to add an external DB/WAL devices to an existing
>> OSD?
>>
>> I suspect that it might be something on the lines of:
>>
>> - stop osd
>> - create a link in 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread David Turner
If you're only using a 1GB DB partition, there is a very real possibility
it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
so for a 4TB osd a 40GB DB should work for most use cases (except loads and
loads of small files). There are a few threads that mention how to check
how much of your DB partition is in use. Once it's full, it spills over to
the HDD.

On Tue, Feb 27, 2018, 6:19 AM Caspar Smit  wrote:

> 2018-02-26 23:01 GMT+01:00 Gregory Farnum :
>
>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
>> wrote:
>>
>>> 2018-02-24 7:10 GMT+01:00 David Turner :
>>>
 Caspar, it looks like your idea should work. Worst case scenario seems
 like the osd wouldn't start, you'd put the old SSD back in and go back to
 the idea to weight them to 0, backfilling, then recreate the osds.
 Definitely with a try in my opinion, and I'd love to hear your experience
 after.


>>> Hi David,
>>>
>>> First of all, thank you for ALL your answers on this ML, you're really
>>> putting a lot of effort into answering many questions asked here and very
>>> often they contain invaluable information.
>>>
>>>
>>> To follow up on this post i went out and built a very small (proxmox)
>>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>>> And it worked!
>>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>>
>>> Here's what i did on 1 node:
>>>
>>> 1) ceph osd set noout
>>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>>> 3) ddrescue -f -n -vv   /root/clone-db.log
>>> 4) removed the old SSD physically from the node
>>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>>> 6) ceph osd unset noout
>>>
>>> I assume that once the ddrescue step is finished a 'partprobe' or
>>> something similar is triggered and udev finds the DB partitions on the new
>>> SSD and starts the OSD's again (kind of what happens during hotplug)
>>> So it is probably better to clone the SSD in another (non-ceph) system
>>> to not trigger any udev events.
>>>
>>> I also tested a reboot after this and everything still worked.
>>>
>>>
>>> The old SSD was 120GB and the new is 256GB (cloning took around 4
>>> minutes)
>>> Delta of data was very low because it was a test cluster.
>>>
>>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>>> didn't actually need to set noout :)
>>>
>>
>> I kicked off a brief discussion about this with some of the BlueStore
>> guys and they're aware of the problem with migrating across SSDs, but so
>> far it's just a Trello card:
>> https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>> They do confirm you should be okay with dd'ing things across, assuming
>> symlinks get set up correctly as David noted.
>>
>>
> Great that it is on the radar to address. This method feels hacky.
>
>
>> I've got some other bad news, though: BlueStore has internal metadata
>> about the size of the block device it's using, so if you copy it onto a
>> larger block device, it will not actually make use of the additional space.
>> :(
>> -Greg
>>
>
> Yes, i was well aware of that, no problem. The reason was the smaller SSD
> sizes are simply not being made anymore or discontinued by the manufacturer.
> Would be nice though if the DB size could be resized in the future, the
> default 1GB DB size seems very small to me.
>
> Caspar
>
>
>>
>>
>>>
>>> Kind regards,
>>> Caspar
>>>
>>>
>>>
 Nico, it is not possible to change the WAL or DB size, location, etc
 after osd creation. If you want to change the configuration of the osd
 after creation, you have to remove it from the cluster and recreate it.
 There is no similar functionality to how you could move, recreate, etc
 filesystem osd journals. I think this might be on the radar as a feature,
 but I don't know for certain. I definitely consider it to be a regression
 of bluestore.




 On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
 nico.schottel...@ungleich.ch> wrote:

>
> A very interesting question and I would add the follow up question:
>
> Is there an easy way to add an external DB/WAL devices to an existing
> OSD?
>
> I suspect that it might be something on the lines of:
>
> - stop osd
> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
> - (maybe run some kind of osd mkfs ?)
> - start osd
>
> Has anyone done this so far or recommendations on how to do it?
>
> Which also makes me wonder: what is actually the format of WAL and
> BlockDB in bluestore? Is there any documentation available about it?
>
> Best,
>
> Nico
>
>
> Caspar Smit  writes:
>
> > Hi All,
> >

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Caspar Smit
2018-02-26 23:01 GMT+01:00 Gregory Farnum :

> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
> wrote:
>
>> 2018-02-24 7:10 GMT+01:00 David Turner :
>>
>>> Caspar, it looks like your idea should work. Worst case scenario seems
>>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>>> the idea to weight them to 0, backfilling, then recreate the osds.
>>> Definitely with a try in my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>> Hi David,
>>
>> First of all, thank you for ALL your answers on this ML, you're really
>> putting a lot of effort into answering many questions asked here and very
>> often they contain invaluable information.
>>
>>
>> To follow up on this post i went out and built a very small (proxmox)
>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>> And it worked!
>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>
>> Here's what i did on 1 node:
>>
>> 1) ceph osd set noout
>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>> 3) ddrescue -f -n -vv   /root/clone-db.log
>> 4) removed the old SSD physically from the node
>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>> 6) ceph osd unset noout
>>
>> I assume that once the ddrescue step is finished a 'partprobe' or
>> something similar is triggered and udev finds the DB partitions on the new
>> SSD and starts the OSD's again (kind of what happens during hotplug)
>> So it is probably better to clone the SSD in another (non-ceph) system to
>> not trigger any udev events.
>>
>> I also tested a reboot after this and everything still worked.
>>
>>
>> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
>> Delta of data was very low because it was a test cluster.
>>
>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>> didn't actually need to set noout :)
>>
>
> I kicked off a brief discussion about this with some of the BlueStore guys
> and they're aware of the problem with migrating across SSDs, but so far
> it's just a Trello card: https://trello.com/c/9cxTgG50/324-bluestore-add-
> remove-resize-wal-db
> They do confirm you should be okay with dd'ing things across, assuming
> symlinks get set up correctly as David noted.
>
>
Great that it is on the radar to address. This method feels hacky.


> I've got some other bad news, though: BlueStore has internal metadata
> about the size of the block device it's using, so if you copy it onto a
> larger block device, it will not actually make use of the additional space.
> :(
> -Greg
>

Yes, i was well aware of that, no problem. The reason was the smaller SSD
sizes are simply not being made anymore or discontinued by the manufacturer.
Would be nice though if the DB size could be resized in the future, the
default 1GB DB size seems very small to me.

Caspar


>
>
>>
>> Kind regards,
>> Caspar
>>
>>
>>
>>> Nico, it is not possible to change the WAL or DB size, location, etc
>>> after osd creation. If you want to change the configuration of the osd
>>> after creation, you have to remove it from the cluster and recreate it.
>>> There is no similar functionality to how you could move, recreate, etc
>>> filesystem osd journals. I think this might be on the radar as a feature,
>>> but I don't know for certain. I definitely consider it to be a regression
>>> of bluestore.
>>>
>>>
>>>
>>>
>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>>> nico.schottel...@ungleich.ch> wrote:
>>>

 A very interesting question and I would add the follow up question:

 Is there an easy way to add an external DB/WAL devices to an existing
 OSD?

 I suspect that it might be something on the lines of:

 - stop osd
 - create a link in ...ceph/osd/ceph-XX/block.db to the target device
 - (maybe run some kind of osd mkfs ?)
 - start osd

 Has anyone done this so far or recommendations on how to do it?

 Which also makes me wonder: what is actually the format of WAL and
 BlockDB in bluestore? Is there any documentation available about it?

 Best,

 Nico


 Caspar Smit  writes:

 > Hi All,
 >
 > What would be the proper way to preventively replace a DB/WAL SSD
 (when it
 > is nearing it's DWPD/TBW limit and not failed yet).
 >
 > It hosts DB partitions for 5 OSD's
 >
 > Maybe something like:
 >
 > 1) ceph osd reweight 0 the 5 OSD's
 > 2) let backfilling complete
 > 3) destroy/remove the 5 OSD's
 > 4) replace SSD
 > 5) create 5 new OSD's with seperate DB partition on new SSD
 >
 > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
 so i
 > thought maybe the following would work:
 >
 > 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Caspar Smit
2018-02-26 18:02 GMT+01:00 David Turner :

> I'm glad that I was able to help out.  I wanted to point out that the
> reason those steps worked for you as quickly as they did is likely that you
> configured your blocks.db to use the /dev/disk/by-partuuid/{guid} instead
> of /dev/sdx#.  Had you configured your osds with /dev/sdx#, then you would
> have needed to either modify them to point to the partuuid path or changed
> them to the new devices name (which is a bad name as it will likely change
> on reboot).  Changing your path for blocks.db is as simple as `ln -sf
> /var/lib/ceph/osd/ceph-#/blocks.db /dev/disk/by-partuuid/{uuid}` and then
> restarting the osd to make sure that it can read from the new symlink
> location.
>
>
Yes, i (proxmox) used  /dev/disk/by-partuuid/{guid} style links.


> I'm curious about your OSDs starting automatically after doing those steps
> as well.  I would guess you deployed them with ceph-disk instead of
> ceph-volume, is that right?  ceph-volume no longer uses udev rules and
> shouldn't have picked up these changes here.
>
>
Yes, ceph-disk based so udev kicked in on the partprobe.

Caspar


> On Mon, Feb 26, 2018 at 6:23 AM Caspar Smit 
> wrote:
>
>> 2018-02-24 7:10 GMT+01:00 David Turner :
>>
>>> Caspar, it looks like your idea should work. Worst case scenario seems
>>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>>> the idea to weight them to 0, backfilling, then recreate the osds.
>>> Definitely with a try in my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>> Hi David,
>>
>> First of all, thank you for ALL your answers on this ML, you're really
>> putting a lot of effort into answering many questions asked here and very
>> often they contain invaluable information.
>>
>>
>> To follow up on this post i went out and built a very small (proxmox)
>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>> And it worked!
>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>
>> Here's what i did on 1 node:
>>
>> 1) ceph osd set noout
>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>> 3) ddrescue -f -n -vv   /root/clone-db.log
>> 4) removed the old SSD physically from the node
>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>> 6) ceph osd unset noout
>>
>> I assume that once the ddrescue step is finished a 'partprobe' or
>> something similar is triggered and udev finds the DB partitions on the new
>> SSD and starts the OSD's again (kind of what happens during hotplug)
>> So it is probably better to clone the SSD in another (non-ceph) system to
>> not trigger any udev events.
>>
>> I also tested a reboot after this and everything still worked.
>>
>>
>> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
>> Delta of data was very low because it was a test cluster.
>>
>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>> didn't actually need to set noout :)
>>
>> Kind regards,
>> Caspar
>>
>>
>>
>>> Nico, it is not possible to change the WAL or DB size, location, etc
>>> after osd creation. If you want to change the configuration of the osd
>>> after creation, you have to remove it from the cluster and recreate it.
>>> There is no similar functionality to how you could move, recreate, etc
>>> filesystem osd journals. I think this might be on the radar as a feature,
>>> but I don't know for certain. I definitely consider it to be a regression
>>> of bluestore.
>>>
>>>
>>>
>>>
>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>>> nico.schottel...@ungleich.ch> wrote:
>>>

 A very interesting question and I would add the follow up question:

 Is there an easy way to add an external DB/WAL devices to an existing
 OSD?

 I suspect that it might be something on the lines of:

 - stop osd
 - create a link in ...ceph/osd/ceph-XX/block.db to the target device
 - (maybe run some kind of osd mkfs ?)
 - start osd

 Has anyone done this so far or recommendations on how to do it?

 Which also makes me wonder: what is actually the format of WAL and
 BlockDB in bluestore? Is there any documentation available about it?

 Best,

 Nico


 Caspar Smit  writes:

 > Hi All,
 >
 > What would be the proper way to preventively replace a DB/WAL SSD
 (when it
 > is nearing it's DWPD/TBW limit and not failed yet).
 >
 > It hosts DB partitions for 5 OSD's
 >
 > Maybe something like:
 >
 > 1) ceph osd reweight 0 the 5 OSD's
 > 2) let backfilling complete
 > 3) destroy/remove the 5 OSD's
 > 4) replace SSD
 > 5) create 5 new OSD's with seperate DB partition on new SSD
 >
 > When these 5 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-26 Thread Gregory Farnum
On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit  wrote:

> 2018-02-24 7:10 GMT+01:00 David Turner :
>
>> Caspar, it looks like your idea should work. Worst case scenario seems
>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>> the idea to weight them to 0, backfilling, then recreate the osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv   /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system to
> not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>

I kicked off a brief discussion about this with some of the BlueStore guys
and they're aware of the problem with migrating across SSDs, but so far
it's just a Trello card:
https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
They do confirm you should be okay with dd'ing things across, assuming
symlinks get set up correctly as David noted.

I've got some other bad news, though: BlueStore has internal metadata about
the size of the block device it's using, so if you copy it onto a larger
block device, it will not actually make use of the additional space. :(
-Greg


>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider it to be a regression
>> of bluestore.
>>
>>
>>
>>
>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> A very interesting question and I would add the follow up question:
>>>
>>> Is there an easy way to add an external DB/WAL devices to an existing
>>> OSD?
>>>
>>> I suspect that it might be something on the lines of:
>>>
>>> - stop osd
>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>> - (maybe run some kind of osd mkfs ?)
>>> - start osd
>>>
>>> Has anyone done this so far or recommendations on how to do it?
>>>
>>> Which also makes me wonder: what is actually the format of WAL and
>>> BlockDB in bluestore? Is there any documentation available about it?
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>> Caspar Smit  writes:
>>>
>>> > Hi All,
>>> >
>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>> (when it
>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>> >
>>> > It hosts DB partitions for 5 OSD's
>>> >
>>> > Maybe something like:
>>> >
>>> > 1) ceph osd reweight 0 the 5 OSD's
>>> > 2) let backfilling complete
>>> > 3) destroy/remove the 5 OSD's
>>> > 4) replace SSD
>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>> >
>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
>>> so i
>>> > thought maybe the following would work:
>>> >
>>> > 1) ceph osd set noout
>>> > 2) stop the 5 OSD's (systemctl stop)
>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>> > 4) remove the old SSD
>>> > 5) start the 5 OSD's (systemctl start)
>>> > 6) let backfilling/recovery complete (only delta data between OSD stop
>>> and
>>> > now)
>>> > 6) ceph osd unset noout
>>> >
>>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>>> nr/uuid
>>> > stuff preventing this to work?
>>> >
>>> > Or is there another 'less hacky' way to replace a DB SSD without
>>> moving too
>>> > much 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-26 Thread David Turner
I'm glad that I was able to help out.  I wanted to point out that the
reason those steps worked for you as quickly as they did is likely that you
configured your blocks.db to use the /dev/disk/by-partuuid/{guid} instead
of /dev/sdx#.  Had you configured your osds with /dev/sdx#, then you would
have needed to either modify them to point to the partuuid path or changed
them to the new devices name (which is a bad name as it will likely change
on reboot).  Changing your path for blocks.db is as simple as `ln -sf
/var/lib/ceph/osd/ceph-#/blocks.db /dev/disk/by-partuuid/{uuid}` and then
restarting the osd to make sure that it can read from the new symlink
location.

I'm curious about your OSDs starting automatically after doing those steps
as well.  I would guess you deployed them with ceph-disk instead of
ceph-volume, is that right?  ceph-volume no longer uses udev rules and
shouldn't have picked up these changes here.

On Mon, Feb 26, 2018 at 6:23 AM Caspar Smit  wrote:

> 2018-02-24 7:10 GMT+01:00 David Turner :
>
>> Caspar, it looks like your idea should work. Worst case scenario seems
>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>> the idea to weight them to 0, backfilling, then recreate the osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv   /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system to
> not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider it to be a regression
>> of bluestore.
>>
>>
>>
>>
>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> A very interesting question and I would add the follow up question:
>>>
>>> Is there an easy way to add an external DB/WAL devices to an existing
>>> OSD?
>>>
>>> I suspect that it might be something on the lines of:
>>>
>>> - stop osd
>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>> - (maybe run some kind of osd mkfs ?)
>>> - start osd
>>>
>>> Has anyone done this so far or recommendations on how to do it?
>>>
>>> Which also makes me wonder: what is actually the format of WAL and
>>> BlockDB in bluestore? Is there any documentation available about it?
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>> Caspar Smit  writes:
>>>
>>> > Hi All,
>>> >
>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>> (when it
>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>> >
>>> > It hosts DB partitions for 5 OSD's
>>> >
>>> > Maybe something like:
>>> >
>>> > 1) ceph osd reweight 0 the 5 OSD's
>>> > 2) let backfilling complete
>>> > 3) destroy/remove the 5 OSD's
>>> > 4) replace SSD
>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>> >
>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
>>> so i
>>> > thought maybe the following would work:
>>> >
>>> > 1) ceph osd set noout
>>> > 2) stop the 5 OSD's (systemctl stop)
>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>> > 4) remove the old SSD
>>> > 5) start the 5 OSD's (systemctl start)
>>> > 6) let 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-26 Thread Caspar Smit
2018-02-24 7:10 GMT+01:00 David Turner :

> Caspar, it looks like your idea should work. Worst case scenario seems
> like the osd wouldn't start, you'd put the old SSD back in and go back to
> the idea to weight them to 0, backfilling, then recreate the osds.
> Definitely with a try in my opinion, and I'd love to hear your experience
> after.
>
>
Hi David,

First of all, thank you for ALL your answers on this ML, you're really
putting a lot of effort into answering many questions asked here and very
often they contain invaluable information.


To follow up on this post i went out and built a very small (proxmox)
cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
And it worked!
Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)

Here's what i did on 1 node:

1) ceph osd set noout
2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
3) ddrescue -f -n -vv   /root/clone-db.log
4) removed the old SSD physically from the node
5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
6) ceph osd unset noout

I assume that once the ddrescue step is finished a 'partprobe' or something
similar is triggered and udev finds the DB partitions on the new SSD and
starts the OSD's again (kind of what happens during hotplug)
So it is probably better to clone the SSD in another (non-ceph) system to
not trigger any udev events.

I also tested a reboot after this and everything still worked.


The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
Delta of data was very low because it was a test cluster.

All in all the OSD's in question were 'down' for only 5 minutes (so i
stayed within the ceph_osd_down_out interval of the default 10 minutes and
didn't actually need to set noout :)

Kind regards,
Caspar



> Nico, it is not possible to change the WAL or DB size, location, etc after
> osd creation. If you want to change the configuration of the osd after
> creation, you have to remove it from the cluster and recreate it. There is
> no similar functionality to how you could move, recreate, etc filesystem
> osd journals. I think this might be on the radar as a feature, but I don't
> know for certain. I definitely consider it to be a regression of bluestore.
>
>
>
>
> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> A very interesting question and I would add the follow up question:
>>
>> Is there an easy way to add an external DB/WAL devices to an existing
>> OSD?
>>
>> I suspect that it might be something on the lines of:
>>
>> - stop osd
>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>> - (maybe run some kind of osd mkfs ?)
>> - start osd
>>
>> Has anyone done this so far or recommendations on how to do it?
>>
>> Which also makes me wonder: what is actually the format of WAL and
>> BlockDB in bluestore? Is there any documentation available about it?
>>
>> Best,
>>
>> Nico
>>
>>
>> Caspar Smit  writes:
>>
>> > Hi All,
>> >
>> > What would be the proper way to preventively replace a DB/WAL SSD (when
>> it
>> > is nearing it's DWPD/TBW limit and not failed yet).
>> >
>> > It hosts DB partitions for 5 OSD's
>> >
>> > Maybe something like:
>> >
>> > 1) ceph osd reweight 0 the 5 OSD's
>> > 2) let backfilling complete
>> > 3) destroy/remove the 5 OSD's
>> > 4) replace SSD
>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>> >
>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so
>> i
>> > thought maybe the following would work:
>> >
>> > 1) ceph osd set noout
>> > 2) stop the 5 OSD's (systemctl stop)
>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>> > 4) remove the old SSD
>> > 5) start the 5 OSD's (systemctl start)
>> > 6) let backfilling/recovery complete (only delta data between OSD stop
>> and
>> > now)
>> > 6) ceph osd unset noout
>> >
>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>> nr/uuid
>> > stuff preventing this to work?
>> >
>> > Or is there another 'less hacky' way to replace a DB SSD without moving
>> too
>> > much data?
>> >
>> > Kind regards,
>> > Caspar
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread David Turner
Caspar, it looks like your idea should work. Worst case scenario seems like
the osd wouldn't start, you'd put the old SSD back in and go back to the
idea to weight them to 0, backfilling, then recreate the osds. Definitely
with a try in my opinion, and I'd love to hear your experience after.

Nico, it is not possible to change the WAL or DB size, location, etc after
osd creation. If you want to change the configuration of the osd after
creation, you have to remove it from the cluster and recreate it. There is
no similar functionality to how you could move, recreate, etc filesystem
osd journals. I think this might be on the radar as a feature, but I don't
know for certain. I definitely consider it to be a regression of bluestore.



On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius 
wrote:

>
> A very interesting question and I would add the follow up question:
>
> Is there an easy way to add an external DB/WAL devices to an existing
> OSD?
>
> I suspect that it might be something on the lines of:
>
> - stop osd
> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
> - (maybe run some kind of osd mkfs ?)
> - start osd
>
> Has anyone done this so far or recommendations on how to do it?
>
> Which also makes me wonder: what is actually the format of WAL and
> BlockDB in bluestore? Is there any documentation available about it?
>
> Best,
>
> Nico
>
>
> Caspar Smit  writes:
>
> > Hi All,
> >
> > What would be the proper way to preventively replace a DB/WAL SSD (when
> it
> > is nearing it's DWPD/TBW limit and not failed yet).
> >
> > It hosts DB partitions for 5 OSD's
> >
> > Maybe something like:
> >
> > 1) ceph osd reweight 0 the 5 OSD's
> > 2) let backfilling complete
> > 3) destroy/remove the 5 OSD's
> > 4) replace SSD
> > 5) create 5 new OSD's with seperate DB partition on new SSD
> >
> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> > thought maybe the following would work:
> >
> > 1) ceph osd set noout
> > 2) stop the 5 OSD's (systemctl stop)
> > 3) 'dd' the old SSD to a new SSD of same or bigger size
> > 4) remove the old SSD
> > 5) start the 5 OSD's (systemctl start)
> > 6) let backfilling/recovery complete (only delta data between OSD stop
> and
> > now)
> > 6) ceph osd unset noout
> >
> > Would this be a viable method to replace a DB SSD? Any udev/serial
> nr/uuid
> > stuff preventing this to work?
> >
> > Or is there another 'less hacky' way to replace a DB SSD without moving
> too
> > much data?
> >
> > Kind regards,
> > Caspar
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread Nico Schottelius

A very interesting question and I would add the follow up question:

Is there an easy way to add an external DB/WAL devices to an existing
OSD?

I suspect that it might be something on the lines of:

- stop osd
- create a link in ...ceph/osd/ceph-XX/block.db to the target device
- (maybe run some kind of osd mkfs ?)
- start osd

Has anyone done this so far or recommendations on how to do it?

Which also makes me wonder: what is actually the format of WAL and
BlockDB in bluestore? Is there any documentation available about it?

Best,

Nico


Caspar Smit  writes:

> Hi All,
>
> What would be the proper way to preventively replace a DB/WAL SSD (when it
> is nearing it's DWPD/TBW limit and not failed yet).
>
> It hosts DB partitions for 5 OSD's
>
> Maybe something like:
>
> 1) ceph osd reweight 0 the 5 OSD's
> 2) let backfilling complete
> 3) destroy/remove the 5 OSD's
> 4) replace SSD
> 5) create 5 new OSD's with seperate DB partition on new SSD
>
> When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> thought maybe the following would work:
>
> 1) ceph osd set noout
> 2) stop the 5 OSD's (systemctl stop)
> 3) 'dd' the old SSD to a new SSD of same or bigger size
> 4) remove the old SSD
> 5) start the 5 OSD's (systemctl start)
> 6) let backfilling/recovery complete (only delta data between OSD stop and
> now)
> 6) ceph osd unset noout
>
> Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
> stuff preventing this to work?
>
> Or is there another 'less hacky' way to replace a DB SSD without moving too
> much data?
>
> Kind regards,
> Caspar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread Caspar Smit
Hi All,

What would be the proper way to preventively replace a DB/WAL SSD (when it
is nearing it's DWPD/TBW limit and not failed yet).

It hosts DB partitions for 5 OSD's

Maybe something like:

1) ceph osd reweight 0 the 5 OSD's
2) let backfilling complete
3) destroy/remove the 5 OSD's
4) replace SSD
5) create 5 new OSD's with seperate DB partition on new SSD

When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
thought maybe the following would work:

1) ceph osd set noout
2) stop the 5 OSD's (systemctl stop)
3) 'dd' the old SSD to a new SSD of same or bigger size
4) remove the old SSD
5) start the 5 OSD's (systemctl start)
6) let backfilling/recovery complete (only delta data between OSD stop and
now)
6) ceph osd unset noout

Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
stuff preventing this to work?

Or is there another 'less hacky' way to replace a DB SSD without moving too
much data?

Kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com