Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-24 Thread Oliver Freyermuth
Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
> 
> 
> On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth 
> > wrote:
> 
> Dear Cephalopodians,
> 
> in a Luminous 12.2.3 cluster with a pool with:
> - 192 Bluestore OSDs total
> - 6 hosts (32 OSDs per host)
> - 2048 total PGs
> - EC profile k=4, m=2
> - CRUSH failure domain = host
> which results in 2048*6/192 = 64 PGs per OSD on average, I run into 
> issues with PG overdose protection.
> 
> In case I reinstall one OSD host (zapping all disks), and recreate the 
> OSDs one by one with ceph-volume,
> they will usually come back "slowly", i.e. one after the other.
> 
> This means the first OSD will initially be assigned all 2048 PGs (to 
> fulfill the "failure domain host" requirement),
> thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> We also use mon_max_pg_per_osd default, i.e. 200.
> 
> This appears to cause the previously active (but of course 
> undersized+degraded) PGs to enter an "activating+remapped" state,
> and hence they become unavailable.
> Thus, data availability is reduced. All this is caused by adding an OSD!
> 
> Of course, as more and more OSDs are added until all 32 are back online, 
> this situation is relaxed.
> Still, I observe that some PGs get stuck in this "activating" state, and 
> can't seem to figure out from logs or by dumping them
> what's the actual reason. Waiting does not help, PGs stay "activating", 
> data stays inaccessible.
> 
> 
> Can you upload logs from each of the OSDs that are (and should be, but 
> aren't) involved with one of the PGs that happens to? (ceph-post-file) And 
> create a ticket about it?
> 
> Once you have a good map, all the PGs should definitely activate themselves.
> -Greg

The ticket is here:
http://tracker.ceph.com/issues/23117
I have uploaded all logs from the freshly installed OSD host (including the OSD 
which was first installed and should have triggered overdose protection, and 
the OSD that should be involved in an example PG but was not),
and all hosts involved in one of the stuck "activating" PGs. 

Cheers,
Oliver

> 
> 
> Waiting a bit and manually restarting the ceph-OSD-services on the 
> reinstalled host seems to bring them back.
> Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 
> 10) appears to prevent the issue.
> 
> So my best guess is that this is related to PG overdose protection.
> Any ideas on how to best overcome this / similar observations?
> 
> It would be nice to be able to reinstall an OSD host without temporarily 
> making data unavailable,
> right now the only thing which comes to my mind is to effectively disable 
> PG overdose protection.
> 
> Cheers,
>         Oliver
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-24 Thread Oliver Freyermuth
Am 24.02.2018 um 07:14 schrieb David Turner:
> There was another part to my suggestion which was to set the initial crush 
> weight to 0 in ceph.conf. after you add all of your osds, you could download 
> the crush map, weight the new osds to what they should be, and upload the 
> crush map to give them all the ability to take PGs at the same time. With 
> this method you never have any osds that can take PGs on the host until all 
> of them can.

I did indeed miss this part of the suggestion. 
Up to now, I have refrained from any manual edits of the Crush map, but made 
use of device classes and automatic Crush location updates - 
it seems to me the general direction in which Ceph is moving is to make it 
unnecessary to ever touch the Crushmap, 
and even to obsolete ceph.conf at some point in the near future. 
Since there are already first tools playing with the weights (such as the 
balancer), it would also not be nice to have to intervene manually
in this regard. 

Still, it seems very likely that manually adapting the weights should avoid the 
issue completely. 

However, I'd then prefer my hack (osd_max_pg_per_osd_hard_ratio = 32) which 
turns off the hard overdose protection until the issue is fixed
over the manual crushmap editing. In a cluster with almost 200 OSDs, this would 
otherwise mean I have to manually edit the crush map each time 
I purge an OSD and add it anew. Right now, the HDDs are fresh, but as soon as 
they start to become old and fail, this would become a cumbersome
(and technically not really necessary) task. 

I'll re-trigger the issue and upload logs as suggested by Greg soon-ish, maybe 
this issue will even be fixed before we have the first failing disk ;-). 

Cheers,
Oliver

> 
> On Thu, Feb 22, 2018, 7:14 PM Oliver Freyermuth 
> > wrote:
> 
> Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
> >
> >
> > On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth 
>  
>  >> wrote:
> >
> >     Dear Cephalopodians,
> >
> >     in a Luminous 12.2.3 cluster with a pool with:
> >     - 192 Bluestore OSDs total
> >     - 6 hosts (32 OSDs per host)
> >     - 2048 total PGs
> >     - EC profile k=4, m=2
> >     - CRUSH failure domain = host
> >     which results in 2048*6/192 = 64 PGs per OSD on average, I run into 
> issues with PG overdose protection.
> >
> >     In case I reinstall one OSD host (zapping all disks), and recreate 
> the OSDs one by one with ceph-volume,
> >     they will usually come back "slowly", i.e. one after the other.
> >
> >     This means the first OSD will initially be assigned all 2048 PGs 
> (to fulfill the "failure domain host" requirement),
> >     thus breaking through the default osd_max_pg_per_osd_hard_ratio of 
> 2.
> >     We also use mon_max_pg_per_osd default, i.e. 200.
> >
> >     This appears to cause the previously active (but of course 
> undersized+degraded) PGs to enter an "activating+remapped" state,
> >     and hence they become unavailable.
> >     Thus, data availability is reduced. All this is caused by adding an 
> OSD!
> >
> >     Of course, as more and more OSDs are added until all 32 are back 
> online, this situation is relaxed.
> >     Still, I observe that some PGs get stuck in this "activating" 
> state, and can't seem to figure out from logs or by dumping them
> >     what's the actual reason. Waiting does not help, PGs stay 
> "activating", data stays inaccessible.
> >
> >
> > Can you upload logs from each of the OSDs that are (and should be, but 
> aren't) involved with one of the PGs that happens to? (ceph-post-file) And 
> create a ticket about it?
> 
> I'll reproduce in the weekend and then capture the logs, at least I did 
> not see anything in there, but I also am not yet too much used to reading 
> them.
> 
> What I can already confirm for sure is that after I set:
> osd_max_pg_per_osd_hard_ratio = 32
> in ceph.conf (global) and deploy new OSD hosts with that, the problem has 
> fully vanished. I have already tested this with two machines.
> 
> Cheers,
> Oliver
> 
> >
> > Once you have a good map, all the PGs should definitely activate 
> themselves.
> > -Greg
> >
> >
> >     Waiting a bit and manually restarting the ceph-OSD-services on the 
> reinstalled host seems to bring them back.
> >     Also, adjusting osd_max_pg_per_osd_hard_ratio to something large 
> (e.g. 10) appears to prevent the issue.
> >
> >     So my best guess is that this is related to PG overdose protection.
> >     Any ideas on how to best overcome this / similar observations?
> >
> >     It would be nice to be able to reinstall an OSD host 

Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-24 Thread Vladimir Prokofev
In my case we don't even use "default" ruleset in CRUSH - no pool has this
ruleset associated with it. Adding OSDs doesn't lead to any PG
recalculation or data movement. It's triggered only after modifying CRUSH
map and placing OSDs in appropriate failure domain.
This way you can add any number of OSDs at any time and don't worry about
triggering data movement. It will only happen on your explicit command when
you set new CRUSH map.

2018-02-24 9:14 GMT+03:00 David Turner :

> There was another part to my suggestion which was to set the initial crush
> weight to 0 in ceph.conf. after you add all of your osds, you could
> download the crush map, weight the new osds to what they should be, and
> upload the crush map to give them all the ability to take PGs at the same
> time. With this method you never have any osds that can take PGs on the
> host until all of them can.
>
> On Thu, Feb 22, 2018, 7:14 PM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de> wrote:
>
>> Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
>> >
>> >
>> > On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth <
>> freyerm...@physik.uni-bonn.de >
>> wrote:
>> >
>> > Dear Cephalopodians,
>> >
>> > in a Luminous 12.2.3 cluster with a pool with:
>> > - 192 Bluestore OSDs total
>> > - 6 hosts (32 OSDs per host)
>> > - 2048 total PGs
>> > - EC profile k=4, m=2
>> > - CRUSH failure domain = host
>> > which results in 2048*6/192 = 64 PGs per OSD on average, I run into
>> issues with PG overdose protection.
>> >
>> > In case I reinstall one OSD host (zapping all disks), and recreate
>> the OSDs one by one with ceph-volume,
>> > they will usually come back "slowly", i.e. one after the other.
>> >
>> > This means the first OSD will initially be assigned all 2048 PGs
>> (to fulfill the "failure domain host" requirement),
>> > thus breaking through the default osd_max_pg_per_osd_hard_ratio of
>> 2.
>> > We also use mon_max_pg_per_osd default, i.e. 200.
>> >
>> > This appears to cause the previously active (but of course
>> undersized+degraded) PGs to enter an "activating+remapped" state,
>> > and hence they become unavailable.
>> > Thus, data availability is reduced. All this is caused by adding an
>> OSD!
>> >
>> > Of course, as more and more OSDs are added until all 32 are back
>> online, this situation is relaxed.
>> > Still, I observe that some PGs get stuck in this "activating"
>> state, and can't seem to figure out from logs or by dumping them
>> > what's the actual reason. Waiting does not help, PGs stay
>> "activating", data stays inaccessible.
>> >
>> >
>> > Can you upload logs from each of the OSDs that are (and should be, but
>> aren't) involved with one of the PGs that happens to? (ceph-post-file) And
>> create a ticket about it?
>>
>> I'll reproduce in the weekend and then capture the logs, at least I did
>> not see anything in there, but I also am not yet too much used to reading
>> them.
>>
>> What I can already confirm for sure is that after I set:
>> osd_max_pg_per_osd_hard_ratio = 32
>> in ceph.conf (global) and deploy new OSD hosts with that, the problem has
>> fully vanished. I have already tested this with two machines.
>>
>> Cheers,
>> Oliver
>>
>> >
>> > Once you have a good map, all the PGs should definitely activate
>> themselves.
>> > -Greg
>> >
>> >
>> > Waiting a bit and manually restarting the ceph-OSD-services on the
>> reinstalled host seems to bring them back.
>> > Also, adjusting osd_max_pg_per_osd_hard_ratio to something large
>> (e.g. 10) appears to prevent the issue.
>> >
>> > So my best guess is that this is related to PG overdose protection.
>> > Any ideas on how to best overcome this / similar observations?
>> >
>> > It would be nice to be able to reinstall an OSD host without
>> temporarily making data unavailable,
>> > right now the only thing which comes to my mind is to effectively
>> disable PG overdose protection.
>> >
>> > Cheers,
>> > Oliver
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-23 Thread David Turner
There was another part to my suggestion which was to set the initial crush
weight to 0 in ceph.conf. after you add all of your osds, you could
download the crush map, weight the new osds to what they should be, and
upload the crush map to give them all the ability to take PGs at the same
time. With this method you never have any osds that can take PGs on the
host until all of them can.

On Thu, Feb 22, 2018, 7:14 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
> >
> >
> > On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > Dear Cephalopodians,
> >
> > in a Luminous 12.2.3 cluster with a pool with:
> > - 192 Bluestore OSDs total
> > - 6 hosts (32 OSDs per host)
> > - 2048 total PGs
> > - EC profile k=4, m=2
> > - CRUSH failure domain = host
> > which results in 2048*6/192 = 64 PGs per OSD on average, I run into
> issues with PG overdose protection.
> >
> > In case I reinstall one OSD host (zapping all disks), and recreate
> the OSDs one by one with ceph-volume,
> > they will usually come back "slowly", i.e. one after the other.
> >
> > This means the first OSD will initially be assigned all 2048 PGs (to
> fulfill the "failure domain host" requirement),
> > thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> > We also use mon_max_pg_per_osd default, i.e. 200.
> >
> > This appears to cause the previously active (but of course
> undersized+degraded) PGs to enter an "activating+remapped" state,
> > and hence they become unavailable.
> > Thus, data availability is reduced. All this is caused by adding an
> OSD!
> >
> > Of course, as more and more OSDs are added until all 32 are back
> online, this situation is relaxed.
> > Still, I observe that some PGs get stuck in this "activating" state,
> and can't seem to figure out from logs or by dumping them
> > what's the actual reason. Waiting does not help, PGs stay
> "activating", data stays inaccessible.
> >
> >
> > Can you upload logs from each of the OSDs that are (and should be, but
> aren't) involved with one of the PGs that happens to? (ceph-post-file) And
> create a ticket about it?
>
> I'll reproduce in the weekend and then capture the logs, at least I did
> not see anything in there, but I also am not yet too much used to reading
> them.
>
> What I can already confirm for sure is that after I set:
> osd_max_pg_per_osd_hard_ratio = 32
> in ceph.conf (global) and deploy new OSD hosts with that, the problem has
> fully vanished. I have already tested this with two machines.
>
> Cheers,
> Oliver
>
> >
> > Once you have a good map, all the PGs should definitely activate
> themselves.
> > -Greg
> >
> >
> > Waiting a bit and manually restarting the ceph-OSD-services on the
> reinstalled host seems to bring them back.
> > Also, adjusting osd_max_pg_per_osd_hard_ratio to something large
> (e.g. 10) appears to prevent the issue.
> >
> > So my best guess is that this is related to PG overdose protection.
> > Any ideas on how to best overcome this / similar observations?
> >
> > It would be nice to be able to reinstall an OSD host without
> temporarily making data unavailable,
> > right now the only thing which comes to my mind is to effectively
> disable PG overdose protection.
> >
> > Cheers,
> > Oliver
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-22 Thread Oliver Freyermuth
Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
> 
> 
> On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth 
> > wrote:
> 
> Dear Cephalopodians,
> 
> in a Luminous 12.2.3 cluster with a pool with:
> - 192 Bluestore OSDs total
> - 6 hosts (32 OSDs per host)
> - 2048 total PGs
> - EC profile k=4, m=2
> - CRUSH failure domain = host
> which results in 2048*6/192 = 64 PGs per OSD on average, I run into 
> issues with PG overdose protection.
> 
> In case I reinstall one OSD host (zapping all disks), and recreate the 
> OSDs one by one with ceph-volume,
> they will usually come back "slowly", i.e. one after the other.
> 
> This means the first OSD will initially be assigned all 2048 PGs (to 
> fulfill the "failure domain host" requirement),
> thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> We also use mon_max_pg_per_osd default, i.e. 200.
> 
> This appears to cause the previously active (but of course 
> undersized+degraded) PGs to enter an "activating+remapped" state,
> and hence they become unavailable.
> Thus, data availability is reduced. All this is caused by adding an OSD!
> 
> Of course, as more and more OSDs are added until all 32 are back online, 
> this situation is relaxed.
> Still, I observe that some PGs get stuck in this "activating" state, and 
> can't seem to figure out from logs or by dumping them
> what's the actual reason. Waiting does not help, PGs stay "activating", 
> data stays inaccessible.
> 
> 
> Can you upload logs from each of the OSDs that are (and should be, but 
> aren't) involved with one of the PGs that happens to? (ceph-post-file) And 
> create a ticket about it?

I'll reproduce in the weekend and then capture the logs, at least I did not see 
anything in there, but I also am not yet too much used to reading them. 

What I can already confirm for sure is that after I set:
osd_max_pg_per_osd_hard_ratio = 32
in ceph.conf (global) and deploy new OSD hosts with that, the problem has fully 
vanished. I have already tested this with two machines. 

Cheers,
Oliver

> 
> Once you have a good map, all the PGs should definitely activate themselves.
> -Greg
> 
> 
> Waiting a bit and manually restarting the ceph-OSD-services on the 
> reinstalled host seems to bring them back.
> Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 
> 10) appears to prevent the issue.
> 
> So my best guess is that this is related to PG overdose protection.
> Any ideas on how to best overcome this / similar observations?
> 
> It would be nice to be able to reinstall an OSD host without temporarily 
> making data unavailable,
> right now the only thing which comes to my mind is to effectively disable 
> PG overdose protection.
> 
> Cheers,
>         Oliver
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-22 Thread Gregory Farnum
On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Dear Cephalopodians,
>
> in a Luminous 12.2.3 cluster with a pool with:
> - 192 Bluestore OSDs total
> - 6 hosts (32 OSDs per host)
> - 2048 total PGs
> - EC profile k=4, m=2
> - CRUSH failure domain = host
> which results in 2048*6/192 = 64 PGs per OSD on average, I run into issues
> with PG overdose protection.
>
> In case I reinstall one OSD host (zapping all disks), and recreate the
> OSDs one by one with ceph-volume,
> they will usually come back "slowly", i.e. one after the other.
>
> This means the first OSD will initially be assigned all 2048 PGs (to
> fulfill the "failure domain host" requirement),
> thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> We also use mon_max_pg_per_osd default, i.e. 200.
>
> This appears to cause the previously active (but of course
> undersized+degraded) PGs to enter an "activating+remapped" state,
> and hence they become unavailable.
> Thus, data availability is reduced. All this is caused by adding an OSD!
>
> Of course, as more and more OSDs are added until all 32 are back online,
> this situation is relaxed.
> Still, I observe that some PGs get stuck in this "activating" state, and
> can't seem to figure out from logs or by dumping them
> what's the actual reason. Waiting does not help, PGs stay "activating",
> data stays inaccessible.
>

Can you upload logs from each of the OSDs that are (and should be, but
aren't) involved with one of the PGs that happens to? (ceph-post-file) And
create a ticket about it?

Once you have a good map, all the PGs should definitely activate themselves.
-Greg


> Waiting a bit and manually restarting the ceph-OSD-services on the
> reinstalled host seems to bring them back.
> Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 10)
> appears to prevent the issue.
>
> So my best guess is that this is related to PG overdose protection.
> Any ideas on how to best overcome this / similar observations?
>
> It would be nice to be able to reinstall an OSD host without temporarily
> making data unavailable,
> right now the only thing which comes to my mind is to effectively disable
> PG overdose protection.
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-22 Thread Oliver Freyermuth
Am 22.02.2018 um 02:54 schrieb David Turner:
> You could set the flag noin to prevent the new osds from being calculated by 
> crush until you are ready for all of them in the host to be marked in. 
> You can also set initial crush weight to 0 for new pads so that they won't 
> receive any PGs until you're ready for it.

I tried this just now for the next reinstallation and it did not help. Here's 
what I did:

$ ceph osd set noin
# Shutdown to-be-reinstalled host, purge old OSDs, reinstall host, create new 
OSDs
$ ceph osd unset noin
=> nothing happens, new OSDs are obviously "up", but not "in". 

Now I have to put them in somehow. 
What I did was:
$ for i in {68.99}; do ceph osd in osd.${i}; done

And I ended up with the very same problem, since there is of course a delay 
between the first OSD going "in"
and the second OSD going "in". It seems our mons are fast enough to recalculate 
the crush map within this small delay,
then "PG overdose protection" kicks in (via osd_max_pg_per_osd_hard_ratio), 
many PGs enter "activating+undersized+degraded+remapped" or 
"activating+remapped" state and get stuck in this condition,
and I end up with about 100 PGs being inactive and data availability being 
reduced (just by adding a host!). 

So it seems to me the only solution to prevent data inavailabiltiy in such a 
(probably common?) setup when you want to reinstall a host
is to effective disable overdose protection, or at least the 
"osd_max_pg_per_osd_hard_ratio". 

If that really is the case, maybe the documentation should contain a huge 
warning that this has to be done during reinstallation of a full OSD host
if the number of total OSD hosts matches k+m in an EC pool. 

Alternatively, it would be nice if the "activating" PGs would at least recover 
at some point without manual intervention. 

Cheers,
Oliver

> On Wed, Feb 21, 2018, 5:46 PM Oliver Freyermuth 
> > wrote:
> 
> Dear Cephalopodians,
> 
> in a Luminous 12.2.3 cluster with a pool with:
> - 192 Bluestore OSDs total
> - 6 hosts (32 OSDs per host)
> - 2048 total PGs
> - EC profile k=4, m=2
> - CRUSH failure domain = host
> which results in 2048*6/192 = 64 PGs per OSD on average, I run into 
> issues with PG overdose protection.
> 
> In case I reinstall one OSD host (zapping all disks), and recreate the 
> OSDs one by one with ceph-volume,
> they will usually come back "slowly", i.e. one after the other.
> 
> This means the first OSD will initially be assigned all 2048 PGs (to 
> fulfill the "failure domain host" requirement),
> thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> We also use mon_max_pg_per_osd default, i.e. 200.
> 
> This appears to cause the previously active (but of course 
> undersized+degraded) PGs to enter an "activating+remapped" state,
> and hence they become unavailable.
> Thus, data availability is reduced. All this is caused by adding an OSD!
> 
> Of course, as more and more OSDs are added until all 32 are back online, 
> this situation is relaxed.
> Still, I observe that some PGs get stuck in this "activating" state, and 
> can't seem to figure out from logs or by dumping them
> what's the actual reason. Waiting does not help, PGs stay "activating", 
> data stays inaccessible.
> 
> Waiting a bit and manually restarting the ceph-OSD-services on the 
> reinstalled host seems to bring them back.
> Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 
> 10) appears to prevent the issue.
> 
> So my best guess is that this is related to PG overdose protection.
> Any ideas on how to best overcome this / similar observations?
> 
> It would be nice to be able to reinstall an OSD host without temporarily 
> making data unavailable,
> right now the only thing which comes to my mind is to effectively disable 
> PG overdose protection.
> 
> Cheers,
>         Oliver
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-21 Thread David Turner
You could set the flag noin to prevent the new osds from being calculated
by crush until you are ready for all of them in the host to be marked in.
You can also set initial crush weight to 0 for new pads so that they won't
receive any PGs until you're ready for it.

On Wed, Feb 21, 2018, 5:46 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Dear Cephalopodians,
>
> in a Luminous 12.2.3 cluster with a pool with:
> - 192 Bluestore OSDs total
> - 6 hosts (32 OSDs per host)
> - 2048 total PGs
> - EC profile k=4, m=2
> - CRUSH failure domain = host
> which results in 2048*6/192 = 64 PGs per OSD on average, I run into issues
> with PG overdose protection.
>
> In case I reinstall one OSD host (zapping all disks), and recreate the
> OSDs one by one with ceph-volume,
> they will usually come back "slowly", i.e. one after the other.
>
> This means the first OSD will initially be assigned all 2048 PGs (to
> fulfill the "failure domain host" requirement),
> thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> We also use mon_max_pg_per_osd default, i.e. 200.
>
> This appears to cause the previously active (but of course
> undersized+degraded) PGs to enter an "activating+remapped" state,
> and hence they become unavailable.
> Thus, data availability is reduced. All this is caused by adding an OSD!
>
> Of course, as more and more OSDs are added until all 32 are back online,
> this situation is relaxed.
> Still, I observe that some PGs get stuck in this "activating" state, and
> can't seem to figure out from logs or by dumping them
> what's the actual reason. Waiting does not help, PGs stay "activating",
> data stays inaccessible.
>
> Waiting a bit and manually restarting the ceph-OSD-services on the
> reinstalled host seems to bring them back.
> Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 10)
> appears to prevent the issue.
>
> So my best guess is that this is related to PG overdose protection.
> Any ideas on how to best overcome this / similar observations?
>
> It would be nice to be able to reinstall an OSD host without temporarily
> making data unavailable,
> right now the only thing which comes to my mind is to effectively disable
> PG overdose protection.
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com