At Thu, 26 Sep 2013 15:25:37 +0800, Liu Yuan wrote: > > v2: > - update fec.c based on the comments of Hitosh and Kazutaka > - change strip data stripe size to 1K so as to work with VM > - rename get_object_size as get_store_objsize > - update some commits > - remove unnecessary padding > - change copy type from uint32_t to uint8_t > - add one more patch to pass functinal/030 tests > > Introductin: > > This is the first round to add erasure code support for sheepdog. This patch > set > add basic read/write/remove code for erasure coded vdis. The full blowned > version is planned to support all the current features, such as snapshot, > clone, > incremental backup and cluster-wide backup. > > As always, we support random read/write to erasure coded vdis. > > Instead of storing copy in the replica, erasure code tries to spread data on > all > the replica to achieve the same fault tolerance while reducing the redundancy > to > minimal (less than 0.5 redundancy level). > > Sheepdog will transparently support erasure coding for read/write/remove > opertaions on the fly while clients are storing/retrieving data in the > sheepdog. > No changes to the client APIs or protocols. > > For a simple test on my box, aligned-4k write get 1.5x faster than replication > at most, while read get 1.15x faster, compared with copies=3 (4:2 scheme) > > For 6 nodes cluster with 1000Mb/s NIC, I got the following result: > > replication(3 copies): write 36.5 MB/s, read 71.8 MB/s > erasure code(4 : 2) : write 46.6 MB/s, read 82.9 MB/ > > How It Works: > > /* > * Stripe: data strips + parity strips, spread on all replica > * DS: data strip > * PS: parity strip > * R: Replica > * > * +--------------------stripe ----------------------+ > * v v > * +----+----------------------------------------------+ > * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps | > * +----+----------------------------------------------+ > * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. | > * +----+----+----+----+----+ ... +----+----+-----+----+ > * R1 R2 R3 R4 R5 ... Rn Rn+1 Rn+2 Rn+3 > */ > > We use replica to hold data and parity strips. Suppose we have a > 4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = > 1k, > so basically we'll generate 2k parites for each 4k write, we call this 6K as > stripe as a whole. For write, we'll horizontally spread data, not vertically > as > replciation. So for read, we have to assemble the strip from all the data > replica. > > The downsize for erasure coding is: > 1. for recovery, we have to recover 0.5x more data > 2. if any replica fails, we have to wait for its recover for read. > 3. it needs at least 6(4+2) nodes to work. > > Usage: > > just add one more option for 'dog vdi create' > > $dog vdi create -e test 10G # This will create a erasure coded vdi with > thin-provsion > > For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'. > But I have '-e number' in plan, that users could specify how many parity > replica > he wants with different erasure scheme for different vdis. E.g, we can have > > -e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure) > -e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure) > -e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure) > > TODOs: > > 1. add recovery code > 2. support snapshot/clone/backup > 3. support user-defined redundancy level > 4. add tests
Applied, thanks. I'd like to see the above implementation before the 0.8 release. :) Kazutaka -- sheepdog mailing list [email protected] http://lists.wpkg.org/mailman/listinfo/sheepdog
