Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-17 Thread lin zhou
Thanks so much.
my ceph osd df tree output is
here:https://gist.github.com/hnuzhoulin/e83140168eb403f4712273e3bb925a1c

just like the output, based on the reply of David:
when I out osd.132, its pg remap to just its host cld-osd12-56-sata.
it seems the out do not change Host's weight.
but if I out osd.10 which is in a replication pool, its pg remap to
its media site1-rack1-ssd, not its host cld-osd1-56-ssd. This seems
the out change Host's weight.
so the out command does diff action among firstn and indep, am I right?
if it is right, we need to reserve more available size for each disk
in indep pool.

when I try to execute ceph osd crush reweight osd.132 0.0, then its pg
remap to its media site1-rack1-ssd just like only do out in firstn,
so its Host's weight changed.
pg diffs https://gist.github.com/hnuzhoulin/aab164975b4e3d31bbecbc5c8b2f1fef
from this diff output,  I got the difference of different strategies
for selecting items OSDs in a CRUSH hierarchy.
But I can not get why out command do a different action for firstn and indep.

David Turner  于2019年2月16日周六 上午1:22写道:
>
> I'm leaving the response on the CRUSH rule for Gregory, but you have another 
> problem you're running into that is causing more of this data to stay on this 
> node than you intend.  While you `out` the OSD it is still contributing to 
> the Host's weight.  So the host is still set to receive that amount of data 
> and distribute it among the disks inside of it.  This is the default behavior 
> (even if you `destroy` the OSD) to minimize the data movement for losing the 
> disk and again for adding it back into the cluster after you replace the 
> device.  If you are really strapped for space, though, then you might 
> consider fully purging the OSD which will reduce the Host weight to what the 
> other OSDs are.  However if you do have a problem in your CRUSH rule, then 
> doing this won't change anything for you.
>
> On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2  wrote:
>>
>> Thanks. I read the your reply in 
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.html
>> so using indep will do fewer data remap when osd failed.
>> using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap
>> using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap
>>
>> am I right?
>> if so, what recommend to do when a disk failed and the total available size 
>> of the rest disk in the machine is not enough(can not replace failed disk 
>> immediately). or I should reserve more available size in EC situation.
>>
>> On 02/14/2019 02:49,Gregory Farnum wrote:
>>
>> Your CRUSH rule for EC spools is forcing that behavior with the line
>>
>> step chooseleaf indep 1 type ctnr
>>
>> If you want different behavior, you’ll need a different crush rule.
>>
>> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>>>
>>> Hi, cephers
>>>
>>>
>>> I am building a ceph EC cluster.when a disk is error,I out it.But its all 
>>> PGs remap to the osds in the same host,which I think they should remap to 
>>> other hosts in the same rack.
>>> test process is:
>>>
>>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 
>>> site1_sata_erasure_ruleset 4
>>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>>> /etc/init.d/ceph stop osd.2
>>> ceph osd out 2
>>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>>
>>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>>> TOTAL 3073T 197G | TOTAL 3065T 197G
>>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>>
>>>
>>> some config info: (detail configs see: 
>>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>>> jewel 10.2.11  filestore+rocksdb
>>>
>>> ceph osd erasure-code-profile get ISA-4-2
>>> k=4
>>> m=2
>>> plugin=isa
>>> ruleset-failure-domain=ctnr
>>> ruleset-root=site1-sata
>>> technique=reed_sol_van
>>>
>>> part of ceph.conf is:
>>>
>>> [global]
>>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>>> auth cluster required = cephx
>>> auth service required = cephx
>>> auth client required = cephx
>>> pid file = /home/ceph/var/run/$name.pid
>>> log file = /home/ceph/log/$cluster-$name.log
>>> mon osd nearfull ratio = 0.85
>>> mon osd full ratio = 0.95
>>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>>> osd pool default size = 3
>>> osd pool default min size 

Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread Gregory Farnum
Actually I think I misread what this was doing, sorry.

Can you do a “ceph osd tree”? It’s hard to see the structure via the text
dumps.

On Wed, Feb 13, 2019 at 10:49 AM Gregory Farnum  wrote:

> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: (detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable chooseleaf_vary_r 1
>> tunable straw_calc_version 1
>> tunable allowed_bucket_algs 54
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> 。。。
>> device 502 osd.502
>> device 503 osd.503
>>
>> # types
>> type 0 osd  # osd
>> type 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xx
>> type 2 media# sata/ssd group by rack, -11~1x/-21~2x
>> type 3 mediagroup   # sata/ssd group by site, -5/-6
>> type 4 unit # site, -2
>> type 5 root # root, -1
>>
>> # buckets
>> ctnr cld-osd1-56-sata {
>> id -101  # do not change unnecessarily
>> # weight 10.000
>> alg straw2
>> hash 0   # rjenkins1
>> item osd.0 weight 1.000
>> item osd.1 weight 1.000
>> item osd.2 weight 1.000
>> item osd.3 weight 1.000
>> item osd.4 weight 1.000
>> item osd.5 weight 1.000
>> item osd.6 weight 1.000
>> item osd.7 weight 1.000
>> item osd.8 weight 1.000
>> item osd.9 weight 1.000
>> }
>> ctnr cld-osd1-56-ssd {
>> id -201  # do not change unnecessarily
>> # weight 2.000
>> alg straw2
>> hash 0   # rjenkins1
>> 

Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread David Turner
I'm leaving the response on the CRUSH rule for Gregory, but you have
another problem you're running into that is causing more of this data to
stay on this node than you intend.  While you `out` the OSD it is still
contributing to the Host's weight.  So the host is still set to receive
that amount of data and distribute it among the disks inside of it.  This
is the default behavior (even if you `destroy` the OSD) to minimize the
data movement for losing the disk and again for adding it back into the
cluster after you replace the device.  If you are really strapped for
space, though, then you might consider fully purging the OSD which will
reduce the Host weight to what the other OSDs are.  However if you do have
a problem in your CRUSH rule, then doing this won't change anything for you.

On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2  wrote:

> Thanks. I read the your reply in
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.html
> so using indep will do fewer data remap when osd failed.
> using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap
> using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap
>
> am I right?
> if so, what recommend to do when a disk failed and the total available
> size of the rest disk in the machine is not enough(can not replace failed
> disk immediately). or I should reserve more available size in EC situation.
>
> On 02/14/2019 02:49,Gregory Farnum
>  wrote:
>
> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: (detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> 

Re: [ceph-users] jewel10.2.11 EC pool out a osd,its PGs remap to the osds in the same host

2019-02-14 Thread hnuzhoulin2






Thanks. I read the your reply in https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.htmlso using indep will do fewer data remap when osd failed.using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remapusing indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remapam I right?if so, what recommend to do when a disk failed and the total available size of the rest disk in the machine is not enough(can not replace failed disk immediately). or I should reserve more available size in EC situation. 








On 02/14/2019 02:49,Gregory Farnum wrote: 


Your CRUSH rule for EC spools is forcing that behavior with the linestep chooseleaf indep 1 type ctnrIf you want different behavior, you’ll need a different crush rule.On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:







Hi, cephersI am building a ceph EC cluster.when a disk is error,I out it.But its all PGs remap to the osds in the same host,which I think they should remap to other hosts in the same rack.test process is:ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 site1_sata_erasure_ruleset 4ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1/etc/init.d/ceph stop osd.2ceph osd out 2ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2diff /tmp/1 /tmp/2 -y --suppress-common-lines0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.01 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.12 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.23 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.34 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.45 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.56 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.67 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.78 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.89 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9TOTAL 3073T 197G | TOTAL 3065T 197GMIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52some config info: (detail configs see: https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)jewel 10.2.11  filestore+rocksdbceph osd erasure-code-profile get ISA-4-2k=4m=2plugin=isaruleset-failure-domain=ctnrruleset-root=site1-satatechnique=reed_sol_vanpart of ceph.conf is:[global]fsid = 1CAB340D-E551-474F-B21A-399AC0F10900auth cluster required = cephxauth service required = cephxauth client required = cephxpid file = /home/ceph/var/run/$name.pidlog file = /home/ceph/log/$cluster-$name.logmon osd nearfull ratio = 0.85mon osd full ratio = 0.95admin socket = /home/ceph/var/run/$cluster-$name.asokosd pool default size = 3osd pool default min size = 1osd objectstore = filestorefilestore merge threshold = -10[mon]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringmon data = "">mon cluster log file = /home/ceph/log/$cluster.log[osd]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringosd data = "">osd journal = /home/ceph/var/lib/$type/$cluster-$id/journalosd journal size = 1osd mkfs type = xfsosd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256kosd backfill full ratio = 0.92osd failsafe full ratio = 0.95osd failsafe nearfull ratio = 0.85osd max backfills = 1osd crush update on start = falseosd op thread timeout = 60filestore split multiple = 8filestore max sync interval = 15filestore min sync interval = 5[osd.0]host = cld-osd1-56addr = Xuser = cephdevs = /disk/link/osd-0/dataosd journal = /disk/link/osd-0/journal…….[osd.503]host = cld-osd42-56addr = 10.108.87.52user = cephdevs = /disk/link/osd-503/dataosd journal = /disk/link/osd-503/journalcrushmap is below:# begin crush maptunable choose_local_tries 0tunable choose_local_fallback_tries 0tunable choose_total_tries 50tunable chooseleaf_descend_once 1tunable chooseleaf_vary_r 1tunable straw_calc_version 1tunable allowed_bucket_algs 54# devicesdevice 0 osd.0device 1 osd.1device 2 osd.2。。。device 502 osd.502device 503 osd.503# typestype 0 osd  # osdtype 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xxtype 2 media    # sata/ssd group by rack, -11~1x/-21~2xtype 3 mediagroup   # sata/ssd group by site, -5/-6type 4 unit # site, -2type 5 root # root, -1# bucketsctnr cld-osd1-56-sata {id -101  # do not change unnecessarily # weight 10.000alg straw2hash 0   # rjenkins1item osd.0 weight 1.000item osd.1 weight 1.000item osd.2 weight 1.000item osd.3 weight 1.000item osd.4 weight 1.000item osd.5 weight 1.000item osd.6 weight 

Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-13 Thread Gregory Farnum
Your CRUSH rule for EC spools is forcing that behavior with the line

step chooseleaf indep 1 type ctnr

If you want different behavior, you’ll need a different crush rule.

On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:

> Hi, cephers
>
>
> I am building a ceph EC cluster.when a disk is error,I out it.But its all
> PGs remap to the osds in the same host,which I think they should remap to
> other hosts in the same rack.
> test process is:
>
> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
> site1_sata_erasure_ruleset 4
> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
> /etc/init.d/ceph stop osd.2
> ceph osd out 2
> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>
> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
> TOTAL 3073T 197G | TOTAL 3065T 197G
> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>
>
> some config info: (detail configs see:
> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
> jewel 10.2.11  filestore+rocksdb
>
> ceph osd erasure-code-profile get ISA-4-2
> k=4
> m=2
> plugin=isa
> ruleset-failure-domain=ctnr
> ruleset-root=site1-sata
> technique=reed_sol_van
>
> part of ceph.conf is:
>
> [global]
> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> pid file = /home/ceph/var/run/$name.pid
> log file = /home/ceph/log/$cluster-$name.log
> mon osd nearfull ratio = 0.85
> mon osd full ratio = 0.95
> admin socket = /home/ceph/var/run/$cluster-$name.asok
> osd pool default size = 3
> osd pool default min size = 1
> osd objectstore = filestore
> filestore merge threshold = -10
>
> [mon]
> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
> mon data = /home/ceph/var/lib/$type/$cluster-$id
> mon cluster log file = /home/ceph/log/$cluster.log
> [osd]
> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
> osd data = /home/ceph/var/lib/$type/$cluster-$id
> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
> osd journal size = 1
> osd mkfs type = xfs
> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
> osd backfill full ratio = 0.92
> osd failsafe full ratio = 0.95
> osd failsafe nearfull ratio = 0.85
> osd max backfills = 1
> osd crush update on start = false
> osd op thread timeout = 60
> filestore split multiple = 8
> filestore max sync interval = 15
> filestore min sync interval = 5
> [osd.0]
> host = cld-osd1-56
> addr = X
> user = ceph
> devs = /disk/link/osd-0/data
> osd journal = /disk/link/osd-0/journal
> …….
> [osd.503]
> host = cld-osd42-56
> addr = 10.108.87.52
> user = ceph
> devs = /disk/link/osd-503/data
> osd journal = /disk/link/osd-503/journal
>
>
> crushmap is below:
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> 。。。
> device 502 osd.502
> device 503 osd.503
>
> # types
> type 0 osd  # osd
> type 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xx
> type 2 media# sata/ssd group by rack, -11~1x/-21~2x
> type 3 mediagroup   # sata/ssd group by site, -5/-6
> type 4 unit # site, -2
> type 5 root # root, -1
>
> # buckets
> ctnr cld-osd1-56-sata {
> id -101  # do not change unnecessarily
> # weight 10.000
> alg straw2
> hash 0   # rjenkins1
> item osd.0 weight 1.000
> item osd.1 weight 1.000
> item osd.2 weight 1.000
> item osd.3 weight 1.000
> item osd.4 weight 1.000
> item osd.5 weight 1.000
> item osd.6 weight 1.000
> item osd.7 weight 1.000
> item osd.8 weight 1.000
> item osd.9 weight 1.000
> }
> ctnr cld-osd1-56-ssd {
> id -201  # do not change unnecessarily
> # weight 2.000
> alg straw2
> hash 0   # rjenkins1
> item osd.10 weight 1.000
> item osd.11 weight 1.000
> }
> …..
> ctnr cld-osd41-56-sata {
> id -141  # do not change unnecessarily
> # weight 10.000
> alg straw2
> hash 0   # rjenkins1
> item osd.480 weight 1.000
> item osd.481 weight 1.000
> item osd.482 weight 1.000
> item osd.483 weight 1.000
> item osd.484 weight 1.000
> 

[ceph-users] jewel10.2.11 EC pool out a osd,its PGs remap to the osds in the same host

2019-02-13 Thread hnuzhoulin2







Hi, cephersI am building a ceph EC cluster.when a disk is error,I out it.But its all PGs remap to the osds in the same host,which I think they should remap to other hosts in the same rack.test process is:ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 site1_sata_erasure_ruleset 4ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1/etc/init.d/ceph stop osd.2ceph osd out 2ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2diff /tmp/1 /tmp/2 -y --suppress-common-lines0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.01 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.12 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.23 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.34 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.45 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.56 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.67 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.78 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.89 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9TOTAL 3073T 197G | TOTAL 3065T 197GMIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52some config info: (detail configs see: https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)jewel 10.2.11  filestore+rocksdbceph osd erasure-code-profile get ISA-4-2k=4m=2plugin=isaruleset-failure-domain=ctnrruleset-root=site1-satatechnique=reed_sol_vanpart of ceph.conf is:[global]fsid = 1CAB340D-E551-474F-B21A-399AC0F10900auth cluster required = cephxauth service required = cephxauth client required = cephxpid file = /home/ceph/var/run/$name.pidlog file = /home/ceph/log/$cluster-$name.logmon osd nearfull ratio = 0.85mon osd full ratio = 0.95admin socket = /home/ceph/var/run/$cluster-$name.asokosd pool default size = 3osd pool default min size = 1osd objectstore = filestorefilestore merge threshold = -10[mon]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringmon data = "">mon cluster log file = /home/ceph/log/$cluster.log[osd]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringosd data = "">osd journal = /home/ceph/var/lib/$type/$cluster-$id/journalosd journal size = 1osd mkfs type = xfsosd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256kosd backfill full ratio = 0.92osd failsafe full ratio = 0.95osd failsafe nearfull ratio = 0.85osd max backfills = 1osd crush update on start = falseosd op thread timeout = 60filestore split multiple = 8filestore max sync interval = 15filestore min sync interval = 5[osd.0]host = cld-osd1-56addr = Xuser = cephdevs = /disk/link/osd-0/dataosd journal = /disk/link/osd-0/journal…….[osd.503]host = cld-osd42-56addr = 10.108.87.52user = cephdevs = /disk/link/osd-503/dataosd journal = /disk/link/osd-503/journalcrushmap is below:# begin crush maptunable choose_local_tries 0tunable choose_local_fallback_tries 0tunable choose_total_tries 50tunable chooseleaf_descend_once 1tunable chooseleaf_vary_r 1tunable straw_calc_version 1tunable allowed_bucket_algs 54# devicesdevice 0 osd.0device 1 osd.1device 2 osd.2。。。device 502 osd.502device 503 osd.503# typestype 0 osd  # osdtype 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xxtype 2 media    # sata/ssd group by rack, -11~1x/-21~2xtype 3 mediagroup   # sata/ssd group by site, -5/-6type 4 unit # site, -2type 5 root # root, -1# bucketsctnr cld-osd1-56-sata {id -101  # do not change unnecessarily# weight 10.000alg straw2hash 0   # rjenkins1item osd.0 weight 1.000item osd.1 weight 1.000item osd.2 weight 1.000item osd.3 weight 1.000item osd.4 weight 1.000item osd.5 weight 1.000item osd.6 weight 1.000item osd.7 weight 1.000item osd.8 weight 1.000item osd.9 weight 1.000}ctnr cld-osd1-56-ssd {id -201  # do not change unnecessarily# weight 2.000alg straw2hash 0   # rjenkins1item osd.10 weight 1.000item osd.11 weight 1.000}…..ctnr cld-osd41-56-sata {id -141  # do not change unnecessarily# weight 10.000alg straw2hash 0   # rjenkins1item osd.480 weight 1.000item osd.481 weight 1.000item osd.482 weight 1.000item osd.483 weight 1.000item osd.484 weight 1.000item osd.485 weight 1.000item osd.486 weight 1.000item osd.487 weight 1.000item osd.488 weight 1.000item osd.489 weight 1.000}ctnr cld-osd41-56-ssd {id -241  # do not change unnecessarily# weight 2.000alg straw2hash 0   # rjenkins1item osd.490 weight 1.000item osd.491 weight 1.000}ctnr cld-osd42-56-sata {id -142  # do not change unnecessarily# weight 10.000alg straw2hash 0   # rjenkins1item cld-osd29-56-sata weight 10.000item cld-osd30-56-sata weight 10.000item cld-osd31-56-sata weight 10.000item cld-osd32-56-sata weight 10.000item cld-osd33-56-sata weight 10.000item cld-osd34-56-sata weight 10.000item cld-osd35-56-sata weight 10.000}media site1-rack1-sata {id -11  

[ceph-users] jewel10.2.11 EC pool out a osd,its PGs remap to the osds in the same host

2019-02-12 Thread hnuzhoulin2






Hi, cephersI am building a ceph EC cluster.when a disk is error,I out it.But its all PGs remap to the osds in the same host,which I think they should remap to other hosts in the same rack.test process is:ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 site1_sata_erasure_ruleset 4ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1/etc/init.d/ceph stop osd.2ceph osd out 2ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2diff /tmp/1 /tmp/2 -y --suppress-common-lines0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.01 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.12 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.23 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.34 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.45 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.56 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.67 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.78 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.89 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9TOTAL 3073T 197G | TOTAL 3065T 197GMIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52some config info: (detail configs see: https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)jewel 10.2.11  filestore+rocksdbceph osd erasure-code-profile get ISA-4-2k=4m=2plugin=isaruleset-failure-domain=ctnrruleset-root=site1-satatechnique=reed_sol_vanpart of ceph.conf is:[global]fsid = 1CAB340D-E551-474F-B21A-399AC0F10900auth cluster required = cephxauth service required = cephxauth client required = cephxpid file = /home/ceph/var/run/$name.pidlog file = /home/ceph/log/$cluster-$name.logmon osd nearfull ratio = 0.85mon osd full ratio = 0.95admin socket = /home/ceph/var/run/$cluster-$name.asokosd pool default size = 3osd pool default min size = 1osd objectstore = filestorefilestore merge threshold = -10[mon]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringmon data = "">mon cluster log file = /home/ceph/log/$cluster.log[osd]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringosd data = "">osd journal = /home/ceph/var/lib/$type/$cluster-$id/journalosd journal size = 1osd mkfs type = xfsosd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256kosd backfill full ratio = 0.92osd failsafe full ratio = 0.95osd failsafe nearfull ratio = 0.85osd max backfills = 1osd crush update on start = falseosd op thread timeout = 60filestore split multiple = 8filestore max sync interval = 15filestore min sync interval = 5[osd.0]host = cld-osd1-56addr = Xuser = cephdevs = /disk/link/osd-0/dataosd journal = /disk/link/osd-0/journal…….[osd.503]host = cld-osd42-56addr = 10.108.87.52user = cephdevs = /disk/link/osd-503/dataosd journal = /disk/link/osd-503/journalcrushmap is below:# begin crush maptunable choose_local_tries 0tunable choose_local_fallback_tries 0tunable choose_total_tries 50tunable chooseleaf_descend_once 1tunable chooseleaf_vary_r 1tunable straw_calc_version 1tunable allowed_bucket_algs 54# devicesdevice 0 osd.0device 1 osd.1device 2 osd.2。。。device 502 osd.502device 503 osd.503# typestype 0 osd  # osdtype 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xxtype 2 media    # sata/ssd group by rack, -11~1x/-21~2xtype 3 mediagroup   # sata/ssd group by site, -5/-6type 4 unit # site, -2type 5 root # root, -1# bucketsctnr cld-osd1-56-sata {id -101  # do not change unnecessarily # weight 10.000alg straw2hash 0   # rjenkins1item osd.0 weight 1.000item osd.1 weight 1.000item osd.2 weight 1.000item osd.3 weight 1.000item osd.4 weight 1.000item osd.5 weight 1.000item osd.6 weight 1.000item osd.7 weight 1.000item osd.8 weight 1.000item osd.9 weight 1.000}ctnr cld-osd1-56-ssd {id -201  # do not change unnecessarily # weight 2.000alg straw2hash 0   # rjenkins1item osd.10 weight 1.000item osd.11 weight 1.000}…..ctnr cld-osd41-56-sata {id -141  # do not change unnecessarily # weight 10.000alg straw2hash 0   # rjenkins1item osd.480 weight 1.000item osd.481 weight 1.000item osd.482 weight 1.000item osd.483 weight 1.000item osd.484 weight 1.000item osd.485 weight 1.000item osd.486 weight 1.000item osd.487 weight 1.000item osd.488 weight 1.000item osd.489 weight 1.000}ctnr cld-osd41-56-ssd {id -241  # do not change