[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-30 Thread Eugen Block

I created a tracker issue, maybe that will get some attention:

https://tracker.ceph.com/issues/61861

Zitat von Michel Jouvin :


Hi Eugen,

Thank you very much for these detailed tests that match what I  
observed and reported earlier. I'm happy to see that we have the  
same understanding of how it should work (based on the  
documentation). Is there any other way that this list to enter in  
contact with the plugin developers as it seems they are not  
following this (very high volume) list... Or may somebody pass the  
email thread to one of them?


Help would be really appreciated. Cheers,

Michel

Le 19/06/2023 à 14:09, Eugen Block a écrit :
Hi, I have a real hardware cluster for testing available now. I'm  
not sure whether I'm completely misunderstanding how it's supposed  
to work or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I  
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a  
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15  
chunks in total across those 3 DCs, one chunk per host, I checked  
the chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4  
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three  
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with  
one missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being  
marked as "down". That was unexpected since with m=3 I expected the  
PG to still be active but degraded. Before test #3 I started all  
OSDs to have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and  
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks,  
but not an entire DC. I get the impression (and I also discussed  
this with a colleague) that LRC with this implementation is either  
designed to loose only single OSDs which can be recovered quicker  
with fewer surviving OSDs and saving bandwidth. Or this is a bug  
because according to the low-level description [1] the algorithm  
works its way up in the reverse order within the configured layers,  
like in this example (not displaying my k, m, l requirements, just  
for reference):


chunk nr    01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be  
recovered, and maybe step 2 also fails, but eventually step 1  
contains the actual k and m chunks which should sustain the loss of  
an entire DC. My impression is that the algorithm somehow doesn't  
arrive at step 1 and therefore the PG stays down although there are  
enough surviving chunks. I'm not sure if my observations and  
conclusion are correct, I'd love to have a comment from the  
developers on this topic. But in this state I would not recommend  
to use the LRC plugin when the resiliency requirements are to  
sustain the loss of an entire DC.


Thanks,
Eugen

[1]  
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email,  
probably required to understand the discussion here, has been  
stripped down by mailman. To avoid poluting the thread with a long  
output, I put it on at  
https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if  
you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really  
urgent. If you manage to do some tests that help me to understand  
the problem I remain interested. I propose to keep this thread  
for that.


Zitat, I shared my crush map in the email you answered if the  
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like  
a unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes  
since your
first post and had some thoughts I wanted to share, but wanted  
to see your
rule first so I could try 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Eugen Block

Hi,
adding the dev mailing list, hopefully someone there can chime in. But  
apparently the LRC code hasn't been maintained for a few years  
(https://github.com/ceph/ceph/tree/main/src/erasure-code/lrc). Let's  
see...


Zitat von Michel Jouvin :


Hi Eugen,

Thank you very much for these detailed tests that match what I  
observed and reported earlier. I'm happy to see that we have the  
same understanding of how it should work (based on the  
documentation). Is there any other way that this list to enter in  
contact with the plugin developers as it seems they are not  
following this (very high volume) list... Or may somebody pass the  
email thread to one of them?


Help would be really appreciated. Cheers,

Michel

Le 19/06/2023 à 14:09, Eugen Block a écrit :
Hi, I have a real hardware cluster for testing available now. I'm  
not sure whether I'm completely misunderstanding how it's supposed  
to work or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I  
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a  
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15  
chunks in total across those 3 DCs, one chunk per host, I checked  
the chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4  
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three  
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with  
one missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being  
marked as "down". That was unexpected since with m=3 I expected the  
PG to still be active but degraded. Before test #3 I started all  
OSDs to have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and  
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks,  
but not an entire DC. I get the impression (and I also discussed  
this with a colleague) that LRC with this implementation is either  
designed to loose only single OSDs which can be recovered quicker  
with fewer surviving OSDs and saving bandwidth. Or this is a bug  
because according to the low-level description [1] the algorithm  
works its way up in the reverse order within the configured layers,  
like in this example (not displaying my k, m, l requirements, just  
for reference):


chunk nr    01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be  
recovered, and maybe step 2 also fails, but eventually step 1  
contains the actual k and m chunks which should sustain the loss of  
an entire DC. My impression is that the algorithm somehow doesn't  
arrive at step 1 and therefore the PG stays down although there are  
enough surviving chunks. I'm not sure if my observations and  
conclusion are correct, I'd love to have a comment from the  
developers on this topic. But in this state I would not recommend  
to use the LRC plugin when the resiliency requirements are to  
sustain the loss of an entire DC.


Thanks,
Eugen

[1]  
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email,  
probably required to understand the discussion here, has been  
stripped down by mailman. To avoid poluting the thread with a long  
output, I put it on at  
https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if  
you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really  
urgent. If you manage to do some tests that help me to understand  
the problem I remain interested. I propose to keep this thread  
for that.


Zitat, I shared my crush map in the email you answered if the  
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like  
a unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Michel Jouvin

Hi Eugen,

Thank you very much for these detailed tests that match what I observed 
and reported earlier. I'm happy to see that we have the same 
understanding of how it should work (based on the documentation). Is 
there any other way that this list to enter in contact with the plugin 
developers as it seems they are not following this (very high volume) 
list... Or may somebody pass the email thread to one of them?


Help would be really appreciated. Cheers,

Michel

Le 19/06/2023 à 14:09, Eugen Block a écrit :
Hi, I have a real hardware cluster for testing available now. I'm not 
sure whether I'm completely misunderstanding how it's supposed to work 
or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I 
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a 
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 
chunks in total across those 3 DCs, one chunk per host, I checked the 
chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three 
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with one 
missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being 
marked as "down". That was unexpected since with m=3 I expected the PG 
to still be active but degraded. Before test #3 I started all OSDs to 
have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and 
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks, but 
not an entire DC. I get the impression (and I also discussed this with 
a colleague) that LRC with this implementation is either designed to 
loose only single OSDs which can be recovered quicker with fewer 
surviving OSDs and saving bandwidth. Or this is a bug because 
according to the low-level description [1] the algorithm works its way 
up in the reverse order within the configured layers, like in this 
example (not displaying my k, m, l requirements, just for reference):


chunk nr    01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be 
recovered, and maybe step 2 also fails, but eventually step 1 contains 
the actual k and m chunks which should sustain the loss of an entire 
DC. My impression is that the algorithm somehow doesn't arrive at step 
1 and therefore the PG stays down although there are enough surviving 
chunks. I'm not sure if my observations and conclusion are correct, 
I'd love to have a comment from the developers on this topic. But in 
this state I would not recommend to use the LRC plugin when the 
resiliency requirements are to sustain the loss of an entire DC.


Thanks,
Eugen

[1] 
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email, probably 
required to understand the discussion here, has been stripped down by 
mailman. To avoid poluting the thread with a long output, I put it on 
at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if 
you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent. 
If you manage to do some tests that help me to understand the 
problem I remain interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the 
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a 
unique use
case to expand my knowledge. I don't use LRC or anything outside 
basic

erasure coding.

What is your current crush steps rule?  I know you made changes 
since your
first post and had some thoughts I wanted to share, but wanted to 
see your
rule first so I could try to visualize the distribution better. 
 The only
way I can currently visualize it working is with more servers, I'm 
thinking
6 or 9 per data center min, but that could be my lack of 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Eugen Block
Hi, I have a real hardware cluster for testing available now. I'm not  
sure whether I'm completely misunderstanding how it's supposed to work  
or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I  
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a  
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15  
chunks in total across those 3 DCs, one chunk per host, I checked the  
chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4  
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three  
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with one  
missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being  
marked as "down". That was unexpected since with m=3 I expected the PG  
to still be active but degraded. Before test #3 I started all OSDs to  
have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and  
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks, but  
not an entire DC. I get the impression (and I also discussed this with  
a colleague) that LRC with this implementation is either designed to  
loose only single OSDs which can be recovered quicker with fewer  
surviving OSDs and saving bandwidth. Or this is a bug because  
according to the low-level description [1] the algorithm works its way  
up in the reverse order within the configured layers, like in this  
example (not displaying my k, m, l requirements, just for reference):


chunk nr01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be  
recovered, and maybe step 2 also fails, but eventually step 1 contains  
the actual k and m chunks which should sustain the loss of an entire  
DC. My impression is that the algorithm somehow doesn't arrive at step  
1 and therefore the PG stays down although there are enough surviving  
chunks. I'm not sure if my observations and conclusion are correct,  
I'd love to have a comment from the developers on this topic. But in  
this state I would not recommend to use the LRC plugin when the  
resiliency requirements are to sustain the loss of an entire DC.


Thanks,
Eugen

[1]  
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email, probably  
required to understand the discussion here, has been stripped down  
by mailman. To avoid poluting the thread with a long output, I put  
it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download  
it if you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent.  
If you manage to do some tests that help me to understand the  
problem I remain interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the  
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a  
unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes since your
first post and had some thoughts I wanted to share, but wanted to see your
rule first so I could try to visualize the distribution better.  The only
way I can currently visualize it working is with more servers,  
I'm thinking

6 or 9 per data center min, but that could be my lack of knowledge on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:


Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-26 Thread Michel Jouvin

Hi,

 I realize that the crushmap I attached to one of my email, probably 
required to understand the discussion here, has been stripped down by 
mailman. To avoid poluting the thread with a long output, I put it on at 
https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are 
interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent. If 
you manage to do some tests that help me to understand the problem I 
remain interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the 
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a 
unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes 
since your
first post and had some thoughts I wanted to share, but wanted to 
see your
rule first so I could try to visualize the distribution better.  The 
only
way I can currently visualize it working is with more servers, I'm 
thinking
6 or 9 per data center min, but that could be my lack of knowledge 
on some

of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:


Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree yet, could you do that?
Apparently nobody else but us reads this thread or nobody reading this
uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin :


Hi,

I had to restart one of my OSD server today and the problem showed up
again. This time I managed to capture "ceph health detail" output
showing the problem with the 2 PGs:

[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2
pgs down
pg 56.1 is down, acting
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
pg 56.12 is down, acting


[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a
datacenter failure, I cannot survive to 3 OSDs down on the same host,
hosting shards for the PG. In the second case it is only 2 OSDs down
but I'm surprised they don't seem in the same "group" of OSD (I'd
expected all the the OSDs of one datacenter to be in the same groupe
of 5 if the order given really reflects the allocation done...

Still interested by some explanation on what I'm doing wrong! Best
regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :

I think I got it wrong with the locality setting, I'm still limited
by the number of hosts I have available in my test cluster, but as
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
locality=datacenter could fit your requirement, at least with
regards to the recovery bandwidth usage between DCs, but the
resiliency would not match your requirement (one DC failure). That
profile creates 3 groups of 4 chunks (3 data/coding chunks and one
parity chunk) across three DCs, in total 12 chunks. The min_size=7
would not allow an entire DC to go down, I'm afraid, you'd have to
reduce it to 6 to allow reads/writes in a disaster scenario. I'm
still not sure if I got it right this time, but maybe you're better
off without the LRC plugin with the limited number of hosts. Instead
you could use the jerasure plugin with a profile like k=4 m=5
allowing an entire DC to fail without losing data access (we have
one customer using that).

Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might
be some misunderstandings on my side. But I tried to play around
with one of my test clusters (Nautilus). Because I'm limited in the
number of hosts (6 across 3 virtual DCs) I tried two different
profiles with lower numbers to get a feeling for how that works.

# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
k=4 m=2 l=3 crush-failure-domain=host

For every third OSD one parity chunk is added, so 2 more chunks to
store ==> 8 chunks in total. Since my failure-domain is host and I
only have 6 I get incomplete PGs.

# second attempt
ceph:~ # ceph osd 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-21 Thread Michel Jouvin

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent. If you 
manage to do some tests that help me to understand the problem I remain 
interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the attachment 
was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile
Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a unique use
case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes since your
first post and had some thoughts I wanted to share, but wanted to see your
rule first so I could try to visualize the distribution better.  The only
way I can currently visualize it working is with more servers, I'm thinking
6 or 9 per data center min, but that could be my lack of knowledge on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:


Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree yet, could you do that?
Apparently nobody else but us reads this thread or nobody reading this
uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin :


Hi,

I had to restart one of my OSD server today and the problem showed up
again. This time I managed to capture "ceph health detail" output
showing the problem with the 2 PGs:

[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2
pgs down
pg 56.1 is down, acting
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
pg 56.12 is down, acting

[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a
datacenter failure, I cannot survive to 3 OSDs down on the same host,
hosting shards for the PG. In the second case it is only 2 OSDs down
but I'm surprised they don't seem in the same "group" of OSD (I'd
expected all the the OSDs of one datacenter to be in the same groupe
of 5 if the order given really reflects the allocation done...

Still interested by some explanation on what I'm doing wrong! Best
regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :

I think I got it wrong with the locality setting, I'm still limited
by the number of hosts I have available in my test cluster, but as
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
locality=datacenter could fit your requirement, at least with
regards to the recovery bandwidth usage between DCs, but the
resiliency would not match your requirement (one DC failure). That
profile creates 3 groups of 4 chunks (3 data/coding chunks and one
parity chunk) across three DCs, in total 12 chunks. The min_size=7
would not allow an entire DC to go down, I'm afraid, you'd have to
reduce it to 6 to allow reads/writes in a disaster scenario. I'm
still not sure if I got it right this time, but maybe you're better
off without the LRC plugin with the limited number of hosts. Instead
you could use the jerasure plugin with a profile like k=4 m=5
allowing an entire DC to fail without losing data access (we have
one customer using that).

Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might
be some misunderstandings on my side. But I tried to play around
with one of my test clusters (Nautilus). Because I'm limited in the
number of hosts (6 across 3 virtual DCs) I tried two different
profiles with lower numbers to get a feeling for how that works.

# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
k=4 m=2 l=3 crush-failure-domain=host

For every third OSD one parity chunk is added, so 2 more chunks to
store ==> 8 chunks in total. Since my failure-domain is host and I
only have 6 I get incomplete PGs.

# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
k=2 m=2 l=2 crush-failure-domain=host

This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
OMAP_KEYS* LOG STATESINCE VERSION REPORTED
UPACTING SCRUB_STAMP DEEP_SCRUB_STAMP
50.0   10 0   0   

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-18 Thread Eugen Block
Hi, I don’t have a good explanation for this yet, but I’ll soon get  
the opportunity to play around with a decommissioned cluster. I’ll try  
to get a better understanding of the LRC plugin, but it might take  
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I  
don’t have anything to confirm it yet.


Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a unique use
case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes since your
first post and had some thoughts I wanted to share, but wanted to see your
rule first so I could try to visualize the distribution better.  The only
way I can currently visualize it working is with more servers, I'm thinking
6 or 9 per data center min, but that could be my lack of knowledge on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:


Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :
> Hi,
>
> I don't think you've shared your osd tree yet, could you do that?
> Apparently nobody else but us reads this thread or nobody reading this
> uses the LRC plugin. ;-)
>
> Thanks,
> Eugen
>
> Zitat von Michel Jouvin :
>
>> Hi,
>>
>> I had to restart one of my OSD server today and the problem showed up
>> again. This time I managed to capture "ceph health detail" output
>> showing the problem with the 2 PGs:
>>
>> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2
>> pgs down
>> pg 56.1 is down, acting
>> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
>> pg 56.12 is down, acting
>>
[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]
>>
>> I still doesn't understand why, if I am supposed to survive to a
>> datacenter failure, I cannot survive to 3 OSDs down on the same host,
>> hosting shards for the PG. In the second case it is only 2 OSDs down
>> but I'm surprised they don't seem in the same "group" of OSD (I'd
>> expected all the the OSDs of one datacenter to be in the same groupe
>> of 5 if the order given really reflects the allocation done...
>>
>> Still interested by some explanation on what I'm doing wrong! Best
>> regards,
>>
>> Michel
>>
>> Le 03/05/2023 à 10:21, Eugen Block a écrit :
>>> I think I got it wrong with the locality setting, I'm still limited
>>> by the number of hosts I have available in my test cluster, but as
>>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
>>> locality=datacenter could fit your requirement, at least with
>>> regards to the recovery bandwidth usage between DCs, but the
>>> resiliency would not match your requirement (one DC failure). That
>>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one
>>> parity chunk) across three DCs, in total 12 chunks. The min_size=7
>>> would not allow an entire DC to go down, I'm afraid, you'd have to
>>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm
>>> still not sure if I got it right this time, but maybe you're better
>>> off without the LRC plugin with the limited number of hosts. Instead
>>> you could use the jerasure plugin with a profile like k=4 m=5
>>> allowing an entire DC to fail without losing data access (we have
>>> one customer using that).
>>>
>>> Zitat von Eugen Block :
>>>
 Hi,

 disclaimer: I haven't used LRC in a real setup yet, so there might
 be some misunderstandings on my side. But I tried to play around
 with one of my test clusters (Nautilus). Because I'm limited in the
 number of hosts (6 across 3 virtual DCs) I tried two different
 profiles with lower numbers to get a feeling for how that works.

 # first attempt
 ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
 k=4 m=2 l=3 crush-failure-domain=host

 For every third OSD one parity chunk is added, so 2 more chunks to
 store ==> 8 chunks in total. Since my failure-domain is host and I
 only have 6 I get incomplete PGs.

 # second attempt
 ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
 k=2 m=2 l=2 crush-failure-domain=host

 This gives me 6 chunks in total to store across 6 hosts which works:

 ceph:~ # ceph pg ls-by-pool lrcpool
 PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
 OMAP_KEYS* LOG STATESINCE VERSION REPORTED
 UPACTING SCRUB_STAMP DEEP_SCRUB_STAMP
 50.0   10 0   0   619 0  0 1
 active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27
 [27,13,0,2,25,7]p27 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-17 Thread Curt
Hi,

I've been following this thread with interest as it seems like a unique use
case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes since your
first post and had some thoughts I wanted to share, but wanted to see your
rule first so I could try to visualize the distribution better.  The only
way I can currently visualize it working is with more servers, I'm thinking
6 or 9 per data center min, but that could be my lack of knowledge on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:

> Hi Eugen,
>
> Yes, sure, no problem to share it. I attach it to this email (as it may
> clutter the discussion if inline).
>
> If somebody on the list has some clue on the LRC plugin, I'm still
> interested by understand what I'm doing wrong!
>
> Cheers,
>
> Michel
>
> Le 04/05/2023 à 15:07, Eugen Block a écrit :
> > Hi,
> >
> > I don't think you've shared your osd tree yet, could you do that?
> > Apparently nobody else but us reads this thread or nobody reading this
> > uses the LRC plugin. ;-)
> >
> > Thanks,
> > Eugen
> >
> > Zitat von Michel Jouvin :
> >
> >> Hi,
> >>
> >> I had to restart one of my OSD server today and the problem showed up
> >> again. This time I managed to capture "ceph health detail" output
> >> showing the problem with the 2 PGs:
> >>
> >> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2
> >> pgs down
> >> pg 56.1 is down, acting
> >> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
> >> pg 56.12 is down, acting
> >>
> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]
> >>
> >> I still doesn't understand why, if I am supposed to survive to a
> >> datacenter failure, I cannot survive to 3 OSDs down on the same host,
> >> hosting shards for the PG. In the second case it is only 2 OSDs down
> >> but I'm surprised they don't seem in the same "group" of OSD (I'd
> >> expected all the the OSDs of one datacenter to be in the same groupe
> >> of 5 if the order given really reflects the allocation done...
> >>
> >> Still interested by some explanation on what I'm doing wrong! Best
> >> regards,
> >>
> >> Michel
> >>
> >> Le 03/05/2023 à 10:21, Eugen Block a écrit :
> >>> I think I got it wrong with the locality setting, I'm still limited
> >>> by the number of hosts I have available in my test cluster, but as
> >>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
> >>> locality=datacenter could fit your requirement, at least with
> >>> regards to the recovery bandwidth usage between DCs, but the
> >>> resiliency would not match your requirement (one DC failure). That
> >>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one
> >>> parity chunk) across three DCs, in total 12 chunks. The min_size=7
> >>> would not allow an entire DC to go down, I'm afraid, you'd have to
> >>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm
> >>> still not sure if I got it right this time, but maybe you're better
> >>> off without the LRC plugin with the limited number of hosts. Instead
> >>> you could use the jerasure plugin with a profile like k=4 m=5
> >>> allowing an entire DC to fail without losing data access (we have
> >>> one customer using that).
> >>>
> >>> Zitat von Eugen Block :
> >>>
>  Hi,
> 
>  disclaimer: I haven't used LRC in a real setup yet, so there might
>  be some misunderstandings on my side. But I tried to play around
>  with one of my test clusters (Nautilus). Because I'm limited in the
>  number of hosts (6 across 3 virtual DCs) I tried two different
>  profiles with lower numbers to get a feeling for how that works.
> 
>  # first attempt
>  ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
>  k=4 m=2 l=3 crush-failure-domain=host
> 
>  For every third OSD one parity chunk is added, so 2 more chunks to
>  store ==> 8 chunks in total. Since my failure-domain is host and I
>  only have 6 I get incomplete PGs.
> 
>  # second attempt
>  ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
>  k=2 m=2 l=2 crush-failure-domain=host
> 
>  This gives me 6 chunks in total to store across 6 hosts which works:
> 
>  ceph:~ # ceph pg ls-by-pool lrcpool
>  PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
>  OMAP_KEYS* LOG STATESINCE VERSION REPORTED
>  UPACTING SCRUB_STAMP DEEP_SCRUB_STAMP
>  50.0   10 0   0   619 0  0 1
>  active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27
>  [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02
>  14:53:54.322135
>  50.1   00 0   0 0 0  0 0
>  active+clean6m 0'0 18414:26 [27,33,22,6,13,34]p27
>  

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-16 Thread Michel Jouvin

Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may 
clutter the discussion if inline).


If somebody on the list has some clue on the LRC plugin, I'm still 
interested by understand what I'm doing wrong!


Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree yet, could you do that? 
Apparently nobody else but us reads this thread or nobody reading this 
uses the LRC plugin. ;-)


Thanks,
Eugen

Zitat von Michel Jouvin :


Hi,

I had to restart one of my OSD server today and the problem showed up 
again. This time I managed to capture "ceph health detail" output 
showing the problem with the 2 PGs:


[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 
pgs down
    pg 56.1 is down, acting 
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
    pg 56.12 is down, acting 
[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a 
datacenter failure, I cannot survive to 3 OSDs down on the same host, 
hosting shards for the PG. In the second case it is only 2 OSDs down 
but I'm surprised they don't seem in the same "group" of OSD (I'd 
expected all the the OSDs of one datacenter to be in the same groupe 
of 5 if the order given really reflects the allocation done...


Still interested by some explanation on what I'm doing wrong! Best 
regards,


Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still limited 
by the number of hosts I have available in my test cluster, but as 
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with 
locality=datacenter could fit your requirement, at least with 
regards to the recovery bandwidth usage between DCs, but the 
resiliency would not match your requirement (one DC failure). That 
profile creates 3 groups of 4 chunks (3 data/coding chunks and one 
parity chunk) across three DCs, in total 12 chunks. The min_size=7 
would not allow an entire DC to go down, I'm afraid, you'd have to 
reduce it to 6 to allow reads/writes in a disaster scenario. I'm 
still not sure if I got it right this time, but maybe you're better 
off without the LRC plugin with the limited number of hosts. Instead 
you could use the jerasure plugin with a profile like k=4 m=5 
allowing an entire DC to fail without losing data access (we have 
one customer using that).


Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might 
be some misunderstandings on my side. But I tried to play around 
with one of my test clusters (Nautilus). Because I'm limited in the 
number of hosts (6 across 3 virtual DCs) I tried two different 
profiles with lower numbers to get a feeling for how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc 
k=4 m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks to 
store ==> 8 chunks in total. Since my failure-domain is host and I 
only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc 
k=2 m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* 
OMAP_KEYS* LOG STATE    SINCE VERSION REPORTED 
UP    ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
50.0   1    0 0   0   619 0  0 1 
active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27 
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.1   0    0 0   0 0 0  0 0 
active+clean    6m 0'0 18414:26 [27,33,22,6,13,34]p27 
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.2   0    0 0   0 0 0  0 0 
active+clean    6m 0'0 18413:25 [1,28,14,4,31,21]p1 
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.3   0    0 0   0 0 0  0 0 
active+clean    6m 0'0 18413:24 [8,16,26,33,7,25]p8 
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135


After stopping all OSDs on one host I was still able to read and 
write into the pool, but after stopping a second host one PG from 
that pool went "down". That I don't fully understand yet, but I 
just started to look into it.
With your setup (12 hosts) I would recommend to not utilize all of 
them so you have capacity to recover, let's say one "spare" host 
per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could 
make sense here, resulting in 9 total chunks (one more parity 
chunks for every other OSD), min_size 4. But as I wrote, it 
probably doesn't have the resiliency for a DC failure, so that 
needs some further investigation.


Regards,
Eugen


[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Frank Schilder
Yep, reading but not using LRC. Please keep it on the ceph user list for future 
reference -- thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, May 4, 2023 3:07 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Help needed to configure erasure coding LRC plugin

Hi,

I don't think you've shared your osd tree yet, could you do that?
Apparently nobody else but us reads this thread or nobody reading this
uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin :

> Hi,
>
> I had to restart one of my OSD server today and the problem showed
> up again. This time I managed to capture "ceph health detail" output
> showing the problem with the 2 PGs:
>
> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
> pg 56.1 is down, acting
> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
> pg 56.12 is down, acting
> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]
>
> I still doesn't understand why, if I am supposed to survive to a
> datacenter failure, I cannot survive to 3 OSDs down on the same
> host, hosting shards for the PG. In the second case it is only 2
> OSDs down but I'm surprised they don't seem in the same "group" of
> OSD (I'd expected all the the OSDs of one datacenter to be in the
> same groupe of 5 if the order given really reflects the allocation
> done...
>
> Still interested by some explanation on what I'm doing wrong! Best regards,
>
> Michel
>
> Le 03/05/2023 à 10:21, Eugen Block a écrit :
>> I think I got it wrong with the locality setting, I'm still limited
>> by the number of hosts I have available in my test cluster, but as
>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
>> locality=datacenter could fit your requirement, at least with
>> regards to the recovery bandwidth usage between DCs, but the
>> resiliency would not match your requirement (one DC failure). That
>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one
>> parity chunk) across three DCs, in total 12 chunks. The min_size=7
>> would not allow an entire DC to go down, I'm afraid, you'd have to
>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm
>> still not sure if I got it right this time, but maybe you're better
>> off without the LRC plugin with the limited number of hosts.
>> Instead you could use the jerasure plugin with a profile like k=4
>> m=5 allowing an entire DC to fail without losing data access (we
>> have one customer using that).
>>
>> Zitat von Eugen Block :
>>
>>> Hi,
>>>
>>> disclaimer: I haven't used LRC in a real setup yet, so there might
>>> be some misunderstandings on my side. But I tried to play around
>>> with one of my test clusters (Nautilus). Because I'm limited in
>>> the number of hosts (6 across 3 virtual DCs) I tried two different
>>> profiles with lower numbers to get a feeling for how that works.
>>>
>>> # first attempt
>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
>>> k=4 m=2 l=3 crush-failure-domain=host
>>>
>>> For every third OSD one parity chunk is added, so 2 more chunks to
>>> store ==> 8 chunks in total. Since my failure-domain is host and I
>>> only have 6 I get incomplete PGs.
>>>
>>> # second attempt
>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
>>> k=2 m=2 l=2 crush-failure-domain=host
>>>
>>> This gives me 6 chunks in total to store across 6 hosts which works:
>>>
>>> ceph:~ # ceph pg ls-by-pool lrcpool
>>> PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
>>> OMAP_KEYS* LOG STATESINCE VERSION REPORTED
>>> UPACTING SCRUB_STAMP
>>> DEEP_SCRUB_STAMP
>>> 50.0   10 0   0   619 0  0   1
>>> active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27
>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>> 50.1   00 0   0 0 0  0   0
>>> active+clean6m 0'0 18414:26 [27,33,22,6,13,34]p27
>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>> 50.2   00 0   0 0 0  0   0
>>> active+clean6m 0'0 18413:25 [1,28,14,4,31,21]p1
>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>> 50.3   00 0   

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Eugen Block

Hi,

I don't think you've shared your osd tree yet, could you do that?  
Apparently nobody else but us reads this thread or nobody reading this  
uses the LRC plugin. ;-)


Thanks,
Eugen

Zitat von Michel Jouvin :


Hi,

I had to restart one of my OSD server today and the problem showed  
up again. This time I managed to capture "ceph health detail" output  
showing the problem with the 2 PGs:


[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 56.1 is down, acting  
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
    pg 56.12 is down, acting  
[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a  
datacenter failure, I cannot survive to 3 OSDs down on the same  
host, hosting shards for the PG. In the second case it is only 2  
OSDs down but I'm surprised they don't seem in the same "group" of  
OSD (I'd expected all the the OSDs of one datacenter to be in the  
same groupe of 5 if the order given really reflects the allocation  
done...


Still interested by some explanation on what I'm doing wrong! Best regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still limited  
by the number of hosts I have available in my test cluster, but as  
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with  
locality=datacenter could fit your requirement, at least with  
regards to the recovery bandwidth usage between DCs, but the  
resiliency would not match your requirement (one DC failure). That  
profile creates 3 groups of 4 chunks (3 data/coding chunks and one  
parity chunk) across three DCs, in total 12 chunks. The min_size=7  
would not allow an entire DC to go down, I'm afraid, you'd have to  
reduce it to 6 to allow reads/writes in a disaster scenario. I'm  
still not sure if I got it right this time, but maybe you're better  
off without the LRC plugin with the limited number of hosts.  
Instead you could use the jerasure plugin with a profile like k=4  
m=5 allowing an entire DC to fail without losing data access (we  
have one customer using that).


Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might  
be some misunderstandings on my side. But I tried to play around  
with one of my test clusters (Nautilus). Because I'm limited in  
the number of hosts (6 across 3 virtual DCs) I tried two different  
profiles with lower numbers to get a feeling for how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc  
k=4 m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks to  
store ==> 8 chunks in total. Since my failure-domain is host and I  
only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc  
k=2 m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*  
OMAP_KEYS* LOG STATE    SINCE VERSION REPORTED  
UP    ACTING SCRUB_STAMP     
DEEP_SCRUB_STAMP
50.0   1    0 0   0   619 0  0   1  
active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27    
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.1   0    0 0   0 0 0  0   0  
active+clean    6m 0'0 18414:26 [27,33,22,6,13,34]p27  
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.2   0    0 0   0 0 0  0   0  
active+clean    6m 0'0 18413:25 [1,28,14,4,31,21]p1    
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.3   0    0 0   0 0 0  0   0  
active+clean    6m 0'0 18413:24 [8,16,26,33,7,25]p8    
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135


After stopping all OSDs on one host I was still able to read and  
write into the pool, but after stopping a second host one PG from  
that pool went "down". That I don't fully understand yet, but I  
just started to look into it.
With your setup (12 hosts) I would recommend to not utilize all of  
them so you have capacity to recover, let's say one "spare" host  
per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could  
make sense here, resulting in 9 total chunks (one more parity  
chunks for every other OSD), min_size 4. But as I wrote, it  
probably doesn't have the resiliency for a DC failure, so that  
needs some further investigation.


Regards,
Eugen

Zitat von Michel Jouvin :


Hi,

No... our current setup is 3 datacenters with the same  
configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each.  
Thus the total of 12 OSDs servers. As with LRC plugin, k+m 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Michel Jouvin

Hi,

I had to restart one of my OSD server today and the problem showed up 
again. This time I managed to capture "ceph health detail" output 
showing the problem with the 2 PGs:


[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 56.1 is down, acting 
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
    pg 56.12 is down, acting 
[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a 
datacenter failure, I cannot survive to 3 OSDs down on the same host, 
hosting shards for the PG. In the second case it is only 2 OSDs down but 
I'm surprised they don't seem in the same "group" of OSD (I'd expected 
all the the OSDs of one datacenter to be in the same groupe of 5 if the 
order given really reflects the allocation done...


Still interested by some explanation on what I'm doing wrong! Best regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still limited by 
the number of hosts I have available in my test cluster, but as far as 
I got with failure-domain=osd I believe k=6, m=3, l=3 with 
locality=datacenter could fit your requirement, at least with regards 
to the recovery bandwidth usage between DCs, but the resiliency would 
not match your requirement (one DC failure). That profile creates 3 
groups of 4 chunks (3 data/coding chunks and one parity chunk) across 
three DCs, in total 12 chunks. The min_size=7 would not allow an 
entire DC to go down, I'm afraid, you'd have to reduce it to 6 to 
allow reads/writes in a disaster scenario. I'm still not sure if I got 
it right this time, but maybe you're better off without the LRC plugin 
with the limited number of hosts. Instead you could use the jerasure 
plugin with a profile like k=4 m=5 allowing an entire DC to fail 
without losing data access (we have one customer using that).


Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might be 
some misunderstandings on my side. But I tried to play around with 
one of my test clusters (Nautilus). Because I'm limited in the number 
of hosts (6 across 3 virtual DCs) I tried two different profiles with 
lower numbers to get a feeling for how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 
m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks to 
store ==> 8 chunks in total. Since my failure-domain is host and I 
only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 
m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* 
LOG STATE    SINCE VERSION REPORTED UP    ACTING 
SCRUB_STAMP    DEEP_SCRUB_STAMP
50.0   1    0 0   0   619 0  0   1 
active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27   
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.1   0    0 0   0 0 0  0   0 
active+clean    6m 0'0 18414:26 [27,33,22,6,13,34]p27 
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.2   0    0 0   0 0 0  0   0 
active+clean    6m 0'0 18413:25 [1,28,14,4,31,21]p1   
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.3   0    0 0   0 0 0  0   0 
active+clean    6m 0'0 18413:24 [8,16,26,33,7,25]p8   
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135


After stopping all OSDs on one host I was still able to read and 
write into the pool, but after stopping a second host one PG from 
that pool went "down". That I don't fully understand yet, but I just 
started to look into it.
With your setup (12 hosts) I would recommend to not utilize all of 
them so you have capacity to recover, let's say one "spare" host per 
DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make 
sense here, resulting in 9 total chunks (one more parity chunks for 
every other OSD), min_size 4. But as I wrote, it probably doesn't 
have the resiliency for a DC failure, so that needs some further 
investigation.


Regards,
Eugen

Zitat von Michel Jouvin :


Hi,

No... our current setup is 3 datacenters with the same 
configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. 
Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a 
multiple of l, I found that k=9/m=66/l=5 with 
crush-locality=datacenter was achieving my goal of being resilient 
to a datacenter failure. Because I had this, I considered that 
lowering the crush failure domain to osd was not a major issue in my 
case (as it 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-03 Thread Eugen Block
I think I got it wrong with the locality setting, I'm still limited by  
the number of hosts I have available in my test cluster, but as far as  
I got with failure-domain=osd I believe k=6, m=3, l=3 with  
locality=datacenter could fit your requirement, at least with regards  
to the recovery bandwidth usage between DCs, but the resiliency would  
not match your requirement (one DC failure). That profile creates 3  
groups of 4 chunks (3 data/coding chunks and one parity chunk) across  
three DCs, in total 12 chunks. The min_size=7 would not allow an  
entire DC to go down, I'm afraid, you'd have to reduce it to 6 to  
allow reads/writes in a disaster scenario. I'm still not sure if I got  
it right this time, but maybe you're better off without the LRC plugin  
with the limited number of hosts. Instead you could use the jerasure  
plugin with a profile like k=4 m=5 allowing an entire DC to fail  
without losing data access (we have one customer using that).


Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might  
be some misunderstandings on my side. But I tried to play around  
with one of my test clusters (Nautilus). Because I'm limited in the  
number of hosts (6 across 3 virtual DCs) I tried two different  
profiles with lower numbers to get a feeling for how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4  
m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks to  
store ==> 8 chunks in total. Since my failure-domain is host and I  
only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2  
m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS*  
LOG STATESINCE VERSION REPORTED UPACTING  
   SCRUB_STAMPDEEP_SCRUB_STAMP
50.0   10 0   0   619   0  0  
  1 active+clean   72s 18410'1 18415:54   [27,13,0,2,25,7]p27
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.1   00 0   0 0   0  0  
  0 active+clean6m 0'0 18414:26 [27,33,22,6,13,34]p27  
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.2   00 0   0 0   0  0  
  0 active+clean6m 0'0 18413:25   [1,28,14,4,31,21]p1
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.3   00 0   0 0   0  0  
  0 active+clean6m 0'0 18413:24   [8,16,26,33,7,25]p8
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135


After stopping all OSDs on one host I was still able to read and  
write into the pool, but after stopping a second host one PG from  
that pool went "down". That I don't fully understand yet, but I just  
started to look into it.
With your setup (12 hosts) I would recommend to not utilize all of  
them so you have capacity to recover, let's say one "spare" host per  
DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make  
sense here, resulting in 9 total chunks (one more parity chunks for  
every other OSD), min_size 4. But as I wrote, it probably doesn't  
have the resiliency for a DC failure, so that needs some further  
investigation.


Regards,
Eugen

Zitat von Michel Jouvin :


Hi,

No... our current setup is 3 datacenters with the same  
configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each.  
Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be  
a multiple of l, I found that k=9/m=66/l=5 with  
crush-locality=datacenter was achieving my goal of being resilient  
to a datacenter failure. Because I had this, I considered that  
lowering the crush failure domain to osd was not a major issue in  
my case (as it would not be worst than a datacenter failure if all  
the shards are on the same server in a datacenter) and was working  
around the lack of hosts for k=9/m=6 (15 OSDs).


May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the pool  
created with this profile has vanished after Quincy upgrade as this  
parameter is no longer in the CRUH map rule! and the `ceph osd pool  
get` command reports the expected number (10):


-


ceph osd pool get fink-z1.rgw.buckets.data min_size

min_size: 10


Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :

Hello,

What is your current setup, 1 server pet data center with 12 osd  
each? What is your current crush rule and LRC crush rule?



On Fri, Apr 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-02 Thread Eugen Block

Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might be  
some misunderstandings on my side. But I tried to play around with one  
of my test clusters (Nautilus). Because I'm limited in the number of  
hosts (6 across 3 virtual DCs) I tried two different profiles with  
lower numbers to get a feeling for how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4  
m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks to  
store ==> 8 chunks in total. Since my failure-domain is host and I  
only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2  
m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS*  
LOG STATESINCE VERSION REPORTED UPACTING
 SCRUB_STAMPDEEP_SCRUB_STAMP
50.0   10 0   0   619   0  0
1 active+clean   72s 18410'1 18415:54   [27,13,0,2,25,7]p27
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.1   00 0   0 0   0  0
0 active+clean6m 0'0 18414:26 [27,33,22,6,13,34]p27  
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.2   00 0   0 0   0  0
0 active+clean6m 0'0 18413:25   [1,28,14,4,31,21]p1
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.3   00 0   0 0   0  0
0 active+clean6m 0'0 18413:24   [8,16,26,33,7,25]p8
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135


After stopping all OSDs on one host I was still able to read and write  
into the pool, but after stopping a second host one PG from that pool  
went "down". That I don't fully understand yet, but I just started to  
look into it.
With your setup (12 hosts) I would recommend to not utilize all of  
them so you have capacity to recover, let's say one "spare" host per  
DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make  
sense here, resulting in 9 total chunks (one more parity chunks for  
every other OSD), min_size 4. But as I wrote, it probably doesn't have  
the resiliency for a DC failure, so that needs some further  
investigation.


Regards,
Eugen

Zitat von Michel Jouvin :


Hi,

No... our current setup is 3 datacenters with the same  
configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each.  
Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a  
multiple of l, I found that k=9/m=66/l=5 with  
crush-locality=datacenter was achieving my goal of being resilient  
to a datacenter failure. Because I had this, I considered that  
lowering the crush failure domain to osd was not a major issue in my  
case (as it would not be worst than a datacenter failure if all the  
shards are on the same server in a datacenter) and was working  
around the lack of hosts for k=9/m=6 (15 OSDs).


May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the pool  
created with this profile has vanished after Quincy upgrade as this  
parameter is no longer in the CRUH map rule! and the `ceph osd pool  
get` command reports the expected number (10):


-


ceph osd pool get fink-z1.rgw.buckets.data min_size

min_size: 10


Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :

Hello,

What is your current setup, 1 server pet data center with 12 osd  
each? What is your current crush rule and LRC crush rule?



On Fri, Apr 28, 2023, 12:29 Michel Jouvin  
 wrote:


   Hi,

   I think I found a possible cause of my PG down but still
   understand why.
   As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
   m=6) but I have only 12 OSD servers in the cluster. To workaround the
   problem I defined the failure domain as 'osd' with the reasoning
   that as
   I was using the LRC plugin, I had the warranty that I could loose
   a site
   without impact, thus the possibility to loose 1 OSD server. Am I
   wrong?

   Best regards,

   Michel

   Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
   > Hi,
   >
   > I'm still interesting by getting feedback from those using the LRC
   > plugin about the right way to configure it... Last week I upgraded
   > from Pacific to Quincy (17.2.6) with cephadm which is doing the
   > upgrade host by host, checking if an OSD is ok to stop before
   actually
   > upgrading it. I had the surprise to see 1 or 2 PGs down at some
   points
   > in the upgrade 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-29 Thread Michel Jouvin

Hi,

No... our current setup is 3 datacenters with the same configuration, 
i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. Thus the total of 12 
OSDs servers. As with LRC plugin, k+m must be a multiple of l, I found 
that k=9/m=66/l=5 with crush-locality=datacenter was achieving my goal 
of being resilient to a datacenter failure. Because I had this, I 
considered that lowering the crush failure domain to osd was not a major 
issue in my case (as it would not be worst than a datacenter failure if 
all the shards are on the same server in a datacenter) and was working 
around the lack of hosts for k=9/m=6 (15 OSDs).


May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the pool 
created with this profile has vanished after Quincy upgrade as this 
parameter is no longer in the CRUH map rule! and the `ceph osd pool get` 
command reports the expected number (10):


-

> ceph osd pool get fink-z1.rgw.buckets.data min_size
min_size: 10


Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :

Hello,

What is your current setup, 1 server pet data center with 12 osd each? 
What is your current crush rule and LRC crush rule?



On Fri, Apr 28, 2023, 12:29 Michel Jouvin 
 wrote:


Hi,

I think I found a possible cause of my PG down but still
understand why.
As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
m=6) but I have only 12 OSD servers in the cluster. To workaround the
problem I defined the failure domain as 'osd' with the reasoning
that as
I was using the LRC plugin, I had the warranty that I could loose
a site
without impact, thus the possibility to loose 1 OSD server. Am I
wrong?

Best regards,

Michel

Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
> Hi,
>
> I'm still interesting by getting feedback from those using the LRC
> plugin about the right way to configure it... Last week I upgraded
> from Pacific to Quincy (17.2.6) with cephadm which is doing the
> upgrade host by host, checking if an OSD is ok to stop before
actually
> upgrading it. I had the surprise to see 1 or 2 PGs down at some
points
> in the upgrade (happened not for all OSDs but for every
> site/datacenter). Looking at the details with "ceph health
detail", I
> saw that for these PGs there was 3 OSDs down but I was expecting
the
> pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
> wondering if there is something wrong in our pool configuration
(k=9,
> m=6, l=5).
>
> Cheers,
>
> Michel
>
> Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
>> Hi,
>>
>> Is somebody using LRC plugin ?
>>
>> I came to the conclusion that LRC  k=9, m=3, l=4 is not the
same as
>> jerasure k=9, m=6 in terms of protection against failures and
that I
>> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
>> k=9, m=6. The example in the documentation (k=4, m=2, l=3)
suggests
>> that this LRC configuration gives something better than
jerasure k=4,
>> m=2 as it is resilient to 3 drive failures (but not 4 if I
understood
>> properly). So how many drives can fail in the k=9, m=6, l=5
>> configuration first without loosing RW access and second without
>> loosing data?
>>
>> Another thing that I don't quite understand is that a pool created
>> with this configuration (and failure domain=osd,
locality=datacenter)
>> has a min_size=3 (max_size=18 as expected). It seems wrong to
me, I'd
>> expected something ~10 (depending on answer to the previous
question)...
>>
>> Thanks in advance if somebody could provide some sort of
>> authoritative answer on these 2 questions. Best regards,
>>
>> Michel
>>
>> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
>>> Answering to myself, I found the reason for 2147483647: it's
>>> documented as a failure to find enough OSD (missing OSDs). And
it is
>>> normal as I selected different hosts for the 15 OSDs but I
have only
>>> 12 hosts!
>>>
>>> I'm still interested by an "expert" to confirm that LRC  k=9,
m=3,
>>> l=4 configuration is equivalent, in terms of redundancy, to a
>>> jerasure configuration with k=9, m=6.
>>>
>>> Michel
>>>
>>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
 Hi,

 As discussed in another thread (Crushmap rule for
multi-datacenter
 erasure coding), I'm trying to create an EC pool spanning 3
 datacenters (datacenters are present in the crushmap), with the
 objective to be resilient to 1 DC down, at least keeping the
 readonly access to the pool and if possible the read-write

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-29 Thread Curt
Hello,

What is your current setup, 1 server pet data center with 12 osd each? What
is your current crush rule and LRC crush rule?


On Fri, Apr 28, 2023, 12:29 Michel Jouvin 
wrote:

> Hi,
>
> I think I found a possible cause of my PG down but still understand why.
> As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
> m=6) but I have only 12 OSD servers in the cluster. To workaround the
> problem I defined the failure domain as 'osd' with the reasoning that as
> I was using the LRC plugin, I had the warranty that I could loose a site
> without impact, thus the possibility to loose 1 OSD server. Am I wrong?
>
> Best regards,
>
> Michel
>
> Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
> > Hi,
> >
> > I'm still interesting by getting feedback from those using the LRC
> > plugin about the right way to configure it... Last week I upgraded
> > from Pacific to Quincy (17.2.6) with cephadm which is doing the
> > upgrade host by host, checking if an OSD is ok to stop before actually
> > upgrading it. I had the surprise to see 1 or 2 PGs down at some points
> > in the upgrade (happened not for all OSDs but for every
> > site/datacenter). Looking at the details with "ceph health detail", I
> > saw that for these PGs there was 3 OSDs down but I was expecting the
> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
> > wondering if there is something wrong in our pool configuration (k=9,
> > m=6, l=5).
> >
> > Cheers,
> >
> > Michel
> >
> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
> >> Hi,
> >>
> >> Is somebody using LRC plugin ?
> >>
> >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the same as
> >> jerasure k=9, m=6 in terms of protection against failures and that I
> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests
> >> that this LRC configuration gives something better than jerasure k=4,
> >> m=2 as it is resilient to 3 drive failures (but not 4 if I understood
> >> properly). So how many drives can fail in the k=9, m=6, l=5
> >> configuration first without loosing RW access and second without
> >> loosing data?
> >>
> >> Another thing that I don't quite understand is that a pool created
> >> with this configuration (and failure domain=osd, locality=datacenter)
> >> has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd
> >> expected something ~10 (depending on answer to the previous question)...
> >>
> >> Thanks in advance if somebody could provide some sort of
> >> authoritative answer on these 2 questions. Best regards,
> >>
> >> Michel
> >>
> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
> >>> Answering to myself, I found the reason for 2147483647: it's
> >>> documented as a failure to find enough OSD (missing OSDs). And it is
> >>> normal as I selected different hosts for the 15 OSDs but I have only
> >>> 12 hosts!
> >>>
> >>> I'm still interested by an "expert" to confirm that LRC  k=9, m=3,
> >>> l=4 configuration is equivalent, in terms of redundancy, to a
> >>> jerasure configuration with k=9, m=6.
> >>>
> >>> Michel
> >>>
> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
>  Hi,
> 
>  As discussed in another thread (Crushmap rule for multi-datacenter
>  erasure coding), I'm trying to create an EC pool spanning 3
>  datacenters (datacenters are present in the crushmap), with the
>  objective to be resilient to 1 DC down, at least keeping the
>  readonly access to the pool and if possible the read-write access,
>  and have a storage efficiency better than 3 replica (let say a
>  storage overhead <= 2).
> 
>  In the discussion, somebody mentioned LRC plugin as a possible
>  jerasure alternative to implement this without tweaking the
>  crushmap rule to implement the 2-step OSD allocation. I looked at
>  the documentation
>  (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)
>  but I have some questions if someone has experience/expertise with
>  this LRC plugin.
> 
>  I tried to create a rule for using 5 OSDs per datacenter (15 in
>  total), with 3 (9 in total) being data chunks and others being
>  coding chunks. For this, based of my understanding of examples, I
>  used k=9, m=3, l=4. Is it right? Is this configuration equivalent,
>  in terms of redundancy, to a jerasure configuration with k=9, m=6?
> 
>  The resulting rule, which looks correct to me, is:
> 
>  
> 
>  {
>  "rule_id": 6,
>  "rule_name": "test_lrc_2",
>  "ruleset": 6,
>  "type": 3,
>  "min_size": 3,
>  "max_size": 15,
>  "steps": [
>  {
>  "op": "set_chooseleaf_tries",
>  "num": 5
>  },
>  {
>  "op": "set_choose_tries",
>  "num": 100
>  },
>  {
> 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-28 Thread Michel Jouvin

Hi,

I think I found a possible cause of my PG down but still understand why. 
As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, 
m=6) but I have only 12 OSD servers in the cluster. To workaround the 
problem I defined the failure domain as 'osd' with the reasoning that as 
I was using the LRC plugin, I had the warranty that I could loose a site 
without impact, thus the possibility to loose 1 OSD server. Am I wrong?


Best regards,

Michel

Le 24/04/2023 à 13:24, Michel Jouvin a écrit :

Hi,

I'm still interesting by getting feedback from those using the LRC 
plugin about the right way to configure it... Last week I upgraded 
from Pacific to Quincy (17.2.6) with cephadm which is doing the 
upgrade host by host, checking if an OSD is ok to stop before actually 
upgrading it. I had the surprise to see 1 or 2 PGs down at some points 
in the upgrade (happened not for all OSDs but for every 
site/datacenter). Looking at the details with "ceph health detail", I 
saw that for these PGs there was 3 OSDs down but I was expecting the 
pool to be resilient to 6 OSDs down (5 for R/W access) so I'm 
wondering if there is something wrong in our pool configuration (k=9, 
m=6, l=5).


Cheers,

Michel

Le 06/04/2023 à 08:51, Michel Jouvin a écrit :

Hi,

Is somebody using LRC plugin ?

I came to the conclusion that LRC  k=9, m=3, l=4 is not the same as 
jerasure k=9, m=6 in terms of protection against failures and that I 
should use k=9, m=6, l=5 to get a level of resilience >= jerasure 
k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests 
that this LRC configuration gives something better than jerasure k=4, 
m=2 as it is resilient to 3 drive failures (but not 4 if I understood 
properly). So how many drives can fail in the k=9, m=6, l=5 
configuration first without loosing RW access and second without 
loosing data?


Another thing that I don't quite understand is that a pool created 
with this configuration (and failure domain=osd, locality=datacenter) 
has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd 
expected something ~10 (depending on answer to the previous question)...


Thanks in advance if somebody could provide some sort of 
authoritative answer on these 2 questions. Best regards,


Michel

Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
Answering to myself, I found the reason for 2147483647: it's 
documented as a failure to find enough OSD (missing OSDs). And it is 
normal as I selected different hosts for the 15 OSDs but I have only 
12 hosts!


I'm still interested by an "expert" to confirm that LRC  k=9, m=3, 
l=4 configuration is equivalent, in terms of redundancy, to a 
jerasure configuration with k=9, m=6.


Michel

Le 04/04/2023 à 15:26, Michel Jouvin a écrit :

Hi,

As discussed in another thread (Crushmap rule for multi-datacenter 
erasure coding), I'm trying to create an EC pool spanning 3 
datacenters (datacenters are present in the crushmap), with the 
objective to be resilient to 1 DC down, at least keeping the 
readonly access to the pool and if possible the read-write access, 
and have a storage efficiency better than 3 replica (let say a 
storage overhead <= 2).


In the discussion, somebody mentioned LRC plugin as a possible 
jerasure alternative to implement this without tweaking the 
crushmap rule to implement the 2-step OSD allocation. I looked at 
the documentation 
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) 
but I have some questions if someone has experience/expertise with 
this LRC plugin.


I tried to create a rule for using 5 OSDs per datacenter (15 in 
total), with 3 (9 in total) being data chunks and others being 
coding chunks. For this, based of my understanding of examples, I 
used k=9, m=3, l=4. Is it right? Is this configuration equivalent, 
in terms of redundancy, to a jerasure configuration with k=9, m=6?


The resulting rule, which looks correct to me, is:



{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -4,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "chooseleaf_indep",
    "num": 5,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
}



Unfortunately, it doesn't work as expected: a pool created with 
this rule ends up with its pages active+undersize, which is 
unexpected for me. Looking at 'ceph health detail` output, I see 
for each page something like:


pg 52.14 is stuck undersized for 27m, current state 
active+undersized, last acting 

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-24 Thread Michel Jouvin

Hi,

I'm still interesting by getting feedback from those using the LRC 
plugin about the right way to configure it... Last week I upgraded from 
Pacific to Quincy (17.2.6) with cephadm which is doing the upgrade host 
by host, checking if an OSD is ok to stop before actually upgrading it. 
I had the surprise to see 1 or 2 PGs down at some points in the upgrade 
(happened not for all OSDs but for every site/datacenter). Looking at 
the details with "ceph health detail", I saw that for these PGs there 
was 3 OSDs down but I was expecting the pool to be resilient to 6 OSDs 
down (5 for R/W access) so I'm wondering if there is something wrong in 
our pool configuration (k=9, m=6, l=5).


Cheers,

Michel

Le 06/04/2023 à 08:51, Michel Jouvin a écrit :

Hi,

Is somebody using LRC plugin ?

I came to the conclusion that LRC  k=9, m=3, l=4 is not the same as 
jerasure k=9, m=6 in terms of protection against failures and that I 
should use k=9, m=6, l=5 to get a level of resilience >= jerasure k=9, 
m=6. The example in the documentation (k=4, m=2, l=3) suggests that 
this LRC configuration gives something better than jerasure k=4, m=2 
as it is resilient to 3 drive failures (but not 4 if I understood 
properly). So how many drives can fail in the k=9, m=6, l=5 
configuration first without loosing RW access and second without 
loosing data?


Another thing that I don't quite understand is that a pool created 
with this configuration (and failure domain=osd, locality=datacenter) 
has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd 
expected something ~10 (depending on answer to the previous question)...


Thanks in advance if somebody could provide some sort of authoritative 
answer on these 2 questions. Best regards,


Michel

Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
Answering to myself, I found the reason for 2147483647: it's 
documented as a failure to find enough OSD (missing OSDs). And it is 
normal as I selected different hosts for the 15 OSDs but I have only 
12 hosts!


I'm still interested by an "expert" to confirm that LRC  k=9, m=3, 
l=4 configuration is equivalent, in terms of redundancy, to a 
jerasure configuration with k=9, m=6.


Michel

Le 04/04/2023 à 15:26, Michel Jouvin a écrit :

Hi,

As discussed in another thread (Crushmap rule for multi-datacenter 
erasure coding), I'm trying to create an EC pool spanning 3 
datacenters (datacenters are present in the crushmap), with the 
objective to be resilient to 1 DC down, at least keeping the 
readonly access to the pool and if possible the read-write access, 
and have a storage efficiency better than 3 replica (let say a 
storage overhead <= 2).


In the discussion, somebody mentioned LRC plugin as a possible 
jerasure alternative to implement this without tweaking the crushmap 
rule to implement the 2-step OSD allocation. I looked at the 
documentation 
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) 
but I have some questions if someone has experience/expertise with 
this LRC plugin.


I tried to create a rule for using 5 OSDs per datacenter (15 in 
total), with 3 (9 in total) being data chunks and others being 
coding chunks. For this, based of my understanding of examples, I 
used k=9, m=3, l=4. Is it right? Is this configuration equivalent, 
in terms of redundancy, to a jerasure configuration with k=9, m=6?


The resulting rule, which looks correct to me, is:



{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -4,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "chooseleaf_indep",
    "num": 5,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
}



Unfortunately, it doesn't work as expected: a pool created with this 
rule ends up with its pages active+undersize, which is unexpected 
for me. Looking at 'ceph health detail` output, I see for each page 
something like:


pg 52.14 is stuck undersized for 27m, current state 
active+undersized, last acting 
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]


For each PG, there is 3 '2147483647' entries and I guess it is the 
reason of the problem. What are these entries about? Clearly it is 
not OSD entries... Looks like a negative number, -1, which in terms 
of crushmap ID is the crushmap root (named "default" in our 
configuration). Any trivial mistake I would have made?


Thanks in advance for any help or for sharing any successful 
configuration?


Best regards,

Michel
___

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-06 Thread Michel Jouvin

Hi,

Is somebody using LRC plugin ?

I came to the conclusion that LRC  k=9, m=3, l=4 is not the same as 
jerasure k=9, m=6 in terms of protection against failures and that I 
should use k=9, m=6, l=5 to get a level of resilience >= jerasure k=9, 
m=6. The example in the documentation (k=4, m=2, l=3) suggests that this 
LRC configuration gives something better than jerasure k=4, m=2 as it is 
resilient to 3 drive failures (but not 4 if I understood properly). So 
how many drives can fail in the k=9, m=6, l=5 configuration first 
without loosing RW access and second without loosing data?


Another thing that I don't quite understand is that a pool created with 
this configuration (and failure domain=osd, locality=datacenter) has a 
min_size=3 (max_size=18 as expected). It seems wrong to me, I'd expected 
something ~10 (depending on answer to the previous question)...


Thanks in advance if somebody could provide some sort of authoritative 
answer on these 2 questions. Best regards,


Michel

Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
Answering to myself, I found the reason for 2147483647: it's 
documented as a failure to find enough OSD (missing OSDs). And it is 
normal as I selected different hosts for the 15 OSDs but I have only 
12 hosts!


I'm still interested by an "expert" to confirm that LRC  k=9, m=3, l=4 
configuration is equivalent, in terms of redundancy, to a jerasure 
configuration with k=9, m=6.


Michel

Le 04/04/2023 à 15:26, Michel Jouvin a écrit :

Hi,

As discussed in another thread (Crushmap rule for multi-datacenter 
erasure coding), I'm trying to create an EC pool spanning 3 
datacenters (datacenters are present in the crushmap), with the 
objective to be resilient to 1 DC down, at least keeping the readonly 
access to the pool and if possible the read-write access, and have a 
storage efficiency better than 3 replica (let say a storage overhead 
<= 2).


In the discussion, somebody mentioned LRC plugin as a possible 
jerasure alternative to implement this without tweaking the crushmap 
rule to implement the 2-step OSD allocation. I looked at the 
documentation 
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) 
but I have some questions if someone has experience/expertise with 
this LRC plugin.


I tried to create a rule for using 5 OSDs per datacenter (15 in 
total), with 3 (9 in total) being data chunks and others being coding 
chunks. For this, based of my understanding of examples, I used k=9, 
m=3, l=4. Is it right? Is this configuration equivalent, in terms of 
redundancy, to a jerasure configuration with k=9, m=6?


The resulting rule, which looks correct to me, is:



{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -4,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "chooseleaf_indep",
    "num": 5,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
}



Unfortunately, it doesn't work as expected: a pool created with this 
rule ends up with its pages active+undersize, which is unexpected for 
me. Looking at 'ceph health detail` output, I see for each page 
something like:


pg 52.14 is stuck undersized for 27m, current state 
active+undersized, last acting 
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]


For each PG, there is 3 '2147483647' entries and I guess it is the 
reason of the problem. What are these entries about? Clearly it is 
not OSD entries... Looks like a negative number, -1, which in terms 
of crushmap ID is the crushmap root (named "default" in our 
configuration). Any trivial mistake I would have made?


Thanks in advance for any help or for sharing any successful 
configuration?


Best regards,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-04 Thread Michel Jouvin
Answering to myself, I found the reason for 2147483647: it's documented 
as a failure to find enough OSD (missing OSDs). And it is normal as I 
selected different hosts for the 15 OSDs but I have only 12 hosts!


I'm still interested by an "expert" to confirm that LRC  k=9, m=3, l=4 
configuration is equivalent, in terms of redundancy, to a jerasure 
configuration with k=9, m=6.


Michel

Le 04/04/2023 à 15:26, Michel Jouvin a écrit :

Hi,

As discussed in another thread (Crushmap rule for multi-datacenter 
erasure coding), I'm trying to create an EC pool spanning 3 
datacenters (datacenters are present in the crushmap), with the 
objective to be resilient to 1 DC down, at least keeping the readonly 
access to the pool and if possible the read-write access, and have a 
storage efficiency better than 3 replica (let say a storage overhead 
<= 2).


In the discussion, somebody mentioned LRC plugin as a possible 
jerasure alternative to implement this without tweaking the crushmap 
rule to implement the 2-step OSD allocation. I looked at the 
documentation 
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) 
but I have some questions if someone has experience/expertise with 
this LRC plugin.


I tried to create a rule for using 5 OSDs per datacenter (15 in 
total), with 3 (9 in total) being data chunks and others being coding 
chunks. For this, based of my understanding of examples, I used k=9, 
m=3, l=4. Is it right? Is this configuration equivalent, in terms of 
redundancy, to a jerasure configuration with k=9, m=6?


The resulting rule, which looks correct to me, is:



{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -4,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "chooseleaf_indep",
    "num": 5,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
}



Unfortunately, it doesn't work as expected: a pool created with this 
rule ends up with its pages active+undersize, which is unexpected for 
me. Looking at 'ceph health detail` output, I see for each page 
something like:


pg 52.14 is stuck undersized for 27m, current state active+undersized, 
last acting 
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]


For each PG, there is 3 '2147483647' entries and I guess it is the 
reason of the problem. What are these entries about? Clearly it is not 
OSD entries... Looks like a negative number, -1, which in terms of 
crushmap ID is the crushmap root (named "default" in our 
configuration). Any trivial mistake I would have made?


Thanks in advance for any help or for sharing any successful 
configuration?


Best regards,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io