On Mon, Nov 9, 2015 at 9:42 AM, Deneau, Tom <tom.den...@amd.com> wrote:
> I don't have much experience with crush rules but wanted one that does the 
> following:
>
> On a 3-node cluster, I wanted a rule where I could have an erasure-coded pool 
> of k=3,m=2
> and where the first 3 chunks (the read chunks) are all on different hosts but 
> the last 2 chunks
> step to different osds but can reuse the hosts (since we don't have enough 
> hosts in this cluster
> to have the 5 chunks all on different hosts).
>
> Here was my attempt at a rule,
>
> rule combo-rule-ecrule-3-2 {
>     ruleset 9
>     type erasure
>     min_size 5
>     max_size 5
>     step set_chooseleaf_tries 5
>     step set_choose_tries 100
>     step take default
>     step chooseleaf indep 3 type host
>     step emit
>     step take default
>     step chooseleaf indep -3 type osd
>     step emit
> }
>
> which was fine for the first 3 osds, but had a problem in that the last 2 osds
> often were chosen to be the same as the first 2 osds for example
> (hosts have 5 osds each so 0-4, 5-9, 10-14 are the osd numbers per host).
>
> 18.7c   0   0   0   0   0   0   0   0   active+clean    2015-11-09 
> 09:28:40.744509  0'0 227:9   [11,1,6,11,12]  11
> 18.7d   0   0   0   0   0   0   0   0   active+clean    2015-11-09 
> 09:28:42.734292  0'0 227:9   [4,11,5,4,0]    4
> 18.7e   0   0   0   0   0   0   0   0   active+clean    2015-11-09 
> 09:28:42.569645  0'0 227:9   [5,0,12,5,0]    5
> 18.7f   0   0   0   0   0   0   0   0   active+clean    2015-11-09 
> 09:28:41.897589  0'0 227:9   [2,12,6,2,12]   2
>
> How should such a rule be written?

In *general* there's not a good way to specify what you're after. In
specific cases you can often do something like:

rule combo-rule-ecrule-3-2 {
    ruleset 9
    type erasure
    min_size 5
    max_size 5
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 3 type host
    step chooseleaf indep 2 type osd
    step emit
}

That will generate 6 OSD IDs across your three hosts, but the last one
will get cut off of the list (You need sufficiently new clients or
they won't like this, but it is supported now.) and you won't have any
duplicates.

It will not spread the full read set for each PG across hosts, but
since it will be choosing them randomly anyway it should balance out
in the end.

I guess I should note that people have done this with replicated pools
but I'm not sure about EC ones so there might be some weird side
effects. In particular, if you lose an entire node, CRUSH will fail to
map fully and things won't be able to repair. (That will be the case
in general though, if you require copies across 3 hosts and only have
3.)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to