Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Reed Dier
These SSD’s are definitely up to the task, 3-5 DWPD over 5 years, however I 
mostly use an abundance of caution and try to minimize unnecessary data 
movement so as not to exacerbate things.

I definitely could, I just er on the side of conservative wear.

Reed

> On Aug 6, 2018, at 11:19 AM, Richard Hesketh  
> wrote:
> 
> I would have thought that with the write endurance on modern SSDs,
> additional write wear from the occasional rebalance would honestly be
> negligible? If you're hitting them hard enough that you're actually
> worried about your write endurance, a rebalance or two is peanuts
> compared to your normal I/O. If you're not, then there's more than
> enough write endurance in an SSD to handle daily rebalances for years.
> 
> On 06/08/18 17:05, Reed Dier wrote:
>> This has been my modus operandi when replacing drives.
>> 
>> Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
>> process, and in the case of SSD’s, shuffling data adds unnecessary write 
>> wear to the disks.
>> 
>> When migrating from filestore to bluestore, I would actually forklift an 
>> entire failure domain using the below script, and the noout, norebalance, 
>> norecover flags.
>> 
>> This would keep crush from pushing data around until I had all of the drives 
>> replaced, and would then keep crush from trying to recover until I was ready.
>> 
>>> # $1 use $ID or osd.id
>>> # $2 use $DATA or /dev/sdx
>>> # $3 use $NVME or /dev/nvmeXnXpX
>>> 
>>> sudo systemctl stop ceph-osd@$1.service
>>> sudo ceph-osd -i $1 --flush-journal
>>> sudo umount /var/lib/ceph/osd/ceph-$1
>>> sudo ceph-volume lvm zap /dev/$2
>>> ceph osd crush remove osd.$1
>>> ceph auth del osd.$1
>>> ceph osd rm osd.$1
>>> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3
>> 
>> For a single drive, this would stop it, remove it from crush, make a new one 
>> (and let it retake the old/existing osd.id), and then after I unset the 
>> norebalance/norecover flags, then it backfills from the other copies to the 
>> replaced drive, and doesn’t move data around.
>> That script is specific for filestore to bluestore somewhat, as the 
>> flush-journal command is no longer used in bluestore.
>> 
>> Hope thats helpful.
>> 
>> Reed
>> 
>>> On Aug 6, 2018, at 9:30 AM, Richard Hesketh  
>>> wrote:
>>> 
>>> Waiting for rebalancing is considered the safest way, since it ensures
>>> you retain your normal full number of replicas at all times. If you take
>>> the disk out before rebalancing is complete, you will be causing some
>>> PGs to lose a replica. That is a risk to your data redundancy, but it
>>> might be an acceptable one if you prefer to just get the disk replaced
>>> quickly.
>>> 
>>> Personally, if running at 3+ replicas, briefly losing one isn't the end
>>> of the world; you'd still need two more simultaneous disk failures to
>>> actually lose data, though one failure would cause inactive PGs (because
>>> you are running with min_size >= 2, right?). If running pools with only
>>> two replicas at size = 2 I absolutely would not remove a disk without
>>> waiting for rebalancing unless that disk was actively failing so badly
>>> that it was making rebalancing impossible.
>>> 
>>> Rich
>>> 
>>> On 06/08/18 15:20, Josef Zelenka wrote:
 Hi, our procedure is usually(assured that the cluster was ok the
 failure, with 2 replicas as crush rule)
 
 1.Stop the OSD process(to keep it from coming up and down and putting
 load on the cluster)
 
 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
 can be set manually but i let it happen by itself)
 
 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
 ceph osd rm)
 
 4. note down the journal partitions if needed
 
 5. umount drive, replace the disk with new one
 
 6. ensure permissions are set to ceph:ceph in /dev
 
 7. mklabel gpt on the new drive
 
 8. create the new osd with ceph-disk prepare(automatically adds it to
 the crushmap)
 
 
 your procedure sounds reasonable to me, as far as i'm concerned you
 shouldn't have to wait for rebalancing after you remove the osd. all
 this might not be 100% per ceph books but it works for us :)
 
 Josef
 
 
 On 06/08/18 16:15, Iztok Gregori wrote:
> Hi Everyone,
> 
> Which is the best way to replace a failing (SMART Health Status:
> HARDWARE IMPENDING FAILURE) OSD hard disk?
> 
> Normally I will:
> 
> 1. set the OSD as out
> 2. wait for rebalancing
> 3. stop the OSD on the osd-server (unmount if needed)
> 4. purge the OSD from CEPH
> 5. physically replace the disk with the new one
> 6. with ceph-deploy:
> 6a   zap the new disk (just in case)
> 6b   create the new OSD
> 7. add the new osd to the crush map.
> 8. wait for rebalancing.
> 
> My questions are:
> 
> - Is my procedure reasonable?

Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Richard Hesketh
I would have thought that with the write endurance on modern SSDs,
additional write wear from the occasional rebalance would honestly be
negligible? If you're hitting them hard enough that you're actually
worried about your write endurance, a rebalance or two is peanuts
compared to your normal I/O. If you're not, then there's more than
enough write endurance in an SSD to handle daily rebalances for years.

On 06/08/18 17:05, Reed Dier wrote:
> This has been my modus operandi when replacing drives.
> 
> Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
> process, and in the case of SSD’s, shuffling data adds unnecessary write wear 
> to the disks.
> 
> When migrating from filestore to bluestore, I would actually forklift an 
> entire failure domain using the below script, and the noout, norebalance, 
> norecover flags.
> 
> This would keep crush from pushing data around until I had all of the drives 
> replaced, and would then keep crush from trying to recover until I was ready.
> 
>> # $1 use $ID or osd.id
>> # $2 use $DATA or /dev/sdx
>> # $3 use $NVME or /dev/nvmeXnXpX
>>
>> sudo systemctl stop ceph-osd@$1.service
>> sudo ceph-osd -i $1 --flush-journal
>> sudo umount /var/lib/ceph/osd/ceph-$1
>> sudo ceph-volume lvm zap /dev/$2
>> ceph osd crush remove osd.$1
>> ceph auth del osd.$1
>> ceph osd rm osd.$1
>> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3
> 
> For a single drive, this would stop it, remove it from crush, make a new one 
> (and let it retake the old/existing osd.id), and then after I unset the 
> norebalance/norecover flags, then it backfills from the other copies to the 
> replaced drive, and doesn’t move data around.
> That script is specific for filestore to bluestore somewhat, as the 
> flush-journal command is no longer used in bluestore.
> 
> Hope thats helpful.
> 
> Reed
> 
>> On Aug 6, 2018, at 9:30 AM, Richard Hesketh  
>> wrote:
>>
>> Waiting for rebalancing is considered the safest way, since it ensures
>> you retain your normal full number of replicas at all times. If you take
>> the disk out before rebalancing is complete, you will be causing some
>> PGs to lose a replica. That is a risk to your data redundancy, but it
>> might be an acceptable one if you prefer to just get the disk replaced
>> quickly.
>>
>> Personally, if running at 3+ replicas, briefly losing one isn't the end
>> of the world; you'd still need two more simultaneous disk failures to
>> actually lose data, though one failure would cause inactive PGs (because
>> you are running with min_size >= 2, right?). If running pools with only
>> two replicas at size = 2 I absolutely would not remove a disk without
>> waiting for rebalancing unless that disk was actively failing so badly
>> that it was making rebalancing impossible.
>>
>> Rich
>>
>> On 06/08/18 15:20, Josef Zelenka wrote:
>>> Hi, our procedure is usually(assured that the cluster was ok the
>>> failure, with 2 replicas as crush rule)
>>>
>>> 1.Stop the OSD process(to keep it from coming up and down and putting
>>> load on the cluster)
>>>
>>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>>> can be set manually but i let it happen by itself)
>>>
>>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>>> ceph osd rm)
>>>
>>> 4. note down the journal partitions if needed
>>>
>>> 5. umount drive, replace the disk with new one
>>>
>>> 6. ensure permissions are set to ceph:ceph in /dev
>>>
>>> 7. mklabel gpt on the new drive
>>>
>>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>>> the crushmap)
>>>
>>>
>>> your procedure sounds reasonable to me, as far as i'm concerned you
>>> shouldn't have to wait for rebalancing after you remove the osd. all
>>> this might not be 100% per ceph books but it works for us :)
>>>
>>> Josef
>>>
>>>
>>> On 06/08/18 16:15, Iztok Gregori wrote:
 Hi Everyone,

 Which is the best way to replace a failing (SMART Health Status:
 HARDWARE IMPENDING FAILURE) OSD hard disk?

 Normally I will:

 1. set the OSD as out
 2. wait for rebalancing
 3. stop the OSD on the osd-server (unmount if needed)
 4. purge the OSD from CEPH
 5. physically replace the disk with the new one
 6. with ceph-deploy:
 6a   zap the new disk (just in case)
 6b   create the new OSD
 7. add the new osd to the crush map.
 8. wait for rebalancing.

 My questions are:

 - Is my procedure reasonable?
 - What if I skip the #2 and instead to wait for rebalancing I directly
 purge the OSD?
 - Is better to reweight the OSD before take it out?

 I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
 is host.

 Thanks,
 Iztok

>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> 

Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Reed Dier
This has been my modus operandi when replacing drives.

Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
process, and in the case of SSD’s, shuffling data adds unnecessary write wear 
to the disks.

When migrating from filestore to bluestore, I would actually forklift an entire 
failure domain using the below script, and the noout, norebalance, norecover 
flags.

This would keep crush from pushing data around until I had all of the drives 
replaced, and would then keep crush from trying to recover until I was ready.

> # $1 use $ID or osd.id
> # $2 use $DATA or /dev/sdx
> # $3 use $NVME or /dev/nvmeXnXpX
> 
> sudo systemctl stop ceph-osd@$1.service
> sudo ceph-osd -i $1 --flush-journal
> sudo umount /var/lib/ceph/osd/ceph-$1
> sudo ceph-volume lvm zap /dev/$2
> ceph osd crush remove osd.$1
> ceph auth del osd.$1
> ceph osd rm osd.$1
> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3

For a single drive, this would stop it, remove it from crush, make a new one 
(and let it retake the old/existing osd.id), and then after I unset the 
norebalance/norecover flags, then it backfills from the other copies to the 
replaced drive, and doesn’t move data around.
That script is specific for filestore to bluestore somewhat, as the 
flush-journal command is no longer used in bluestore.

Hope thats helpful.

Reed

> On Aug 6, 2018, at 9:30 AM, Richard Hesketh  
> wrote:
> 
> Waiting for rebalancing is considered the safest way, since it ensures
> you retain your normal full number of replicas at all times. If you take
> the disk out before rebalancing is complete, you will be causing some
> PGs to lose a replica. That is a risk to your data redundancy, but it
> might be an acceptable one if you prefer to just get the disk replaced
> quickly.
> 
> Personally, if running at 3+ replicas, briefly losing one isn't the end
> of the world; you'd still need two more simultaneous disk failures to
> actually lose data, though one failure would cause inactive PGs (because
> you are running with min_size >= 2, right?). If running pools with only
> two replicas at size = 2 I absolutely would not remove a disk without
> waiting for rebalancing unless that disk was actively failing so badly
> that it was making rebalancing impossible.
> 
> Rich
> 
> On 06/08/18 15:20, Josef Zelenka wrote:
>> Hi, our procedure is usually(assured that the cluster was ok the
>> failure, with 2 replicas as crush rule)
>> 
>> 1.Stop the OSD process(to keep it from coming up and down and putting
>> load on the cluster)
>> 
>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>> can be set manually but i let it happen by itself)
>> 
>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>> ceph osd rm)
>> 
>> 4. note down the journal partitions if needed
>> 
>> 5. umount drive, replace the disk with new one
>> 
>> 6. ensure permissions are set to ceph:ceph in /dev
>> 
>> 7. mklabel gpt on the new drive
>> 
>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>> the crushmap)
>> 
>> 
>> your procedure sounds reasonable to me, as far as i'm concerned you
>> shouldn't have to wait for rebalancing after you remove the osd. all
>> this might not be 100% per ceph books but it works for us :)
>> 
>> Josef
>> 
>> 
>> On 06/08/18 16:15, Iztok Gregori wrote:
>>> Hi Everyone,
>>> 
>>> Which is the best way to replace a failing (SMART Health Status:
>>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>> 
>>> Normally I will:
>>> 
>>> 1. set the OSD as out
>>> 2. wait for rebalancing
>>> 3. stop the OSD on the osd-server (unmount if needed)
>>> 4. purge the OSD from CEPH
>>> 5. physically replace the disk with the new one
>>> 6. with ceph-deploy:
>>> 6a   zap the new disk (just in case)
>>> 6b   create the new OSD
>>> 7. add the new osd to the crush map.
>>> 8. wait for rebalancing.
>>> 
>>> My questions are:
>>> 
>>> - Is my procedure reasonable?
>>> - What if I skip the #2 and instead to wait for rebalancing I directly
>>> purge the OSD?
>>> - Is better to reweight the OSD before take it out?
>>> 
>>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
>>> is host.
>>> 
>>> Thanks,
>>> Iztok
>>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Richard Hesketh
Waiting for rebalancing is considered the safest way, since it ensures
you retain your normal full number of replicas at all times. If you take
the disk out before rebalancing is complete, you will be causing some
PGs to lose a replica. That is a risk to your data redundancy, but it
might be an acceptable one if you prefer to just get the disk replaced
quickly.

Personally, if running at 3+ replicas, briefly losing one isn't the end
of the world; you'd still need two more simultaneous disk failures to
actually lose data, though one failure would cause inactive PGs (because
you are running with min_size >= 2, right?). If running pools with only
two replicas at size = 2 I absolutely would not remove a disk without
waiting for rebalancing unless that disk was actively failing so badly
that it was making rebalancing impossible.

Rich

On 06/08/18 15:20, Josef Zelenka wrote:
> Hi, our procedure is usually(assured that the cluster was ok the
> failure, with 2 replicas as crush rule)
> 
> 1.Stop the OSD process(to keep it from coming up and down and putting
> load on the cluster)
> 
> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
> can be set manually but i let it happen by itself)
> 
> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
> ceph osd rm)
> 
> 4. note down the journal partitions if needed
> 
> 5. umount drive, replace the disk with new one
> 
> 6. ensure permissions are set to ceph:ceph in /dev
> 
> 7. mklabel gpt on the new drive
> 
> 8. create the new osd with ceph-disk prepare(automatically adds it to
> the crushmap)
> 
> 
> your procedure sounds reasonable to me, as far as i'm concerned you
> shouldn't have to wait for rebalancing after you remove the osd. all
> this might not be 100% per ceph books but it works for us :)
> 
> Josef
> 
> 
> On 06/08/18 16:15, Iztok Gregori wrote:
>> Hi Everyone,
>>
>> Which is the best way to replace a failing (SMART Health Status:
>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>
>> Normally I will:
>>
>> 1. set the OSD as out
>> 2. wait for rebalancing
>> 3. stop the OSD on the osd-server (unmount if needed)
>> 4. purge the OSD from CEPH
>> 5. physically replace the disk with the new one
>> 6. with ceph-deploy:
>> 6a   zap the new disk (just in case)
>> 6b   create the new OSD
>> 7. add the new osd to the crush map.
>> 8. wait for rebalancing.
>>
>> My questions are:
>>
>> - Is my procedure reasonable?
>> - What if I skip the #2 and instead to wait for rebalancing I directly
>> purge the OSD?
>> - Is better to reweight the OSD before take it out?
>>
>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
>> is host.
>>
>> Thanks,
>> Iztok
>>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Josef Zelenka
Hi, our procedure is usually(assured that the cluster was ok the 
failure, with 2 replicas as crush rule)


1.Stop the OSD process(to keep it from coming up and down and putting 
load on the cluster)


2. Wait for the "Reweight" to come to 0(happens after 5 min i think - 
can be set manually but i let it happen by itself)


3. remove the osd from cluster(ceph auth del, ceph osd crush remove, 
ceph osd rm)


4. note down the journal partitions if needed

5. umount drive, replace the disk with new one

6. ensure permissions are set to ceph:ceph in /dev

7. mklabel gpt on the new drive

8. create the new osd with ceph-disk prepare(automatically adds it to 
the crushmap)



your procedure sounds reasonable to me, as far as i'm concerned you 
shouldn't have to wait for rebalancing after you remove the osd. all 
this might not be 100% per ceph books but it works for us :)


Josef


On 06/08/18 16:15, Iztok Gregori wrote:

Hi Everyone,

Which is the best way to replace a failing (SMART Health Status: 
HARDWARE IMPENDING FAILURE) OSD hard disk?


Normally I will:

1. set the OSD as out
2. wait for rebalancing
3. stop the OSD on the osd-server (unmount if needed)
4. purge the OSD from CEPH
5. physically replace the disk with the new one
6. with ceph-deploy:
6a   zap the new disk (just in case)
6b   create the new OSD
7. add the new osd to the crush map.
8. wait for rebalancing.

My questions are:

- Is my procedure reasonable?
- What if I skip the #2 and instead to wait for rebalancing I directly 
purge the OSD?

- Is better to reweight the OSD before take it out?

I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain 
is host.


Thanks,
Iztok



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best way to replace OSD

2018-08-06 Thread Iztok Gregori

Hi Everyone,

Which is the best way to replace a failing (SMART Health Status: 
HARDWARE IMPENDING FAILURE) OSD hard disk?


Normally I will:

1. set the OSD as out
2. wait for rebalancing
3. stop the OSD on the osd-server (unmount if needed)
4. purge the OSD from CEPH
5. physically replace the disk with the new one
6. with ceph-deploy:
6a   zap the new disk (just in case)
6b   create the new OSD
7. add the new osd to the crush map.
8. wait for rebalancing.

My questions are:

- Is my procedure reasonable?
- What if I skip the #2 and instead to wait for rebalancing I directly 
purge the OSD?

- Is better to reweight the OSD before take it out?

I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain is 
host.


Thanks,
Iztok

--
Iztok Gregori
ICT Systems and Services
Elettra - Sincrotrone Trieste S.C.p.A.
Telephone: +39 040 3758948
http://www.elettra.eu


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com