[linux-lvm] bcache now to become io-manager?
> "Joe" == Joe Thornber writes: [ I saw this and couldn't resist asking for more details ...] Joe> Also there are big changes to bcache coming, that remove file Joe> descriptors from the interface completely. See the branch Joe> 2019-09-05-add-io-manager for more info (bcache has been renamed Joe> io_manager). Can you post the full URL for this, and maybe give us a teaser on what to expect? I run lvache on my home system and I'd love to know how to A) improve reporting of how well it works, and B) whether I'm using it right, and of course C) if bcache would be a better replacement. I'm using Linux v5.0.21 (self compiled) on Debian 9.11 with lvcache setup. It's a pair of mirrored 250gb SSDs in front of 4tb of mirrored disk. So anything where I can use my SSDs to cache my data accesses is certainly interesting. John ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
On Tue, Oct 29, 2019 at 11:47:19AM +, Heming Zhao wrote: > You are right. I forgot the radix_tree_remove_prefix(). > The radix_tree_remove_prefix() is called both bcache_invalidate_fd & > bcache_abort_fd. > So the job will not work as expected in bcache_abort_fd, because the node is > already removed. > Please correct me if my thinking is wrong. Slightly wrong, the remove_prefix in invalidate should only be running if it.success is true. Well spotted. - Joe ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
On 10/29/19 7:01 PM, Joe Thornber wrote: > On Tue, Oct 29, 2019 at 05:07:30AM +, Heming Zhao wrote: >> Hello Joe, >> >> Please check my comments for your commit 2938b4dcc & 6b0d969b >> >> 1. b->ref_count is non-zero, and write error happens, the data never release? >> (no place to call _unlink_block & _free_block) > > Correct, the data will not be released until the client calls > bcache_abort_fd(), to > indicate that it's giving up on the write. That way the client is free to > retry > io, eg, see this unit test: > > 689│static void test_write_bad_issue_stops_flush(void *context) > 1│{ > 2│struct fixture *f = context; > 3│struct mock_engine *me = f->me; > 4│struct bcache *cache = f->cache; > 5│struct block *b; > 6│int fd = 17; > 7│ > 8│T_ASSERT(bcache_get(cache, fd, 0, GF_ZERO, )); > 9│_expect_write_bad_issue(me, fd, 0); >10│bcache_put(b); >11│T_ASSERT(!bcache_flush(cache)); >12│ >13│// we'll let it succeed the second time >14│_expect_write(me, fd, 0); >15│_expect(me, E_WAIT); >16│T_ASSERT(bcache_flush(cache)); >17│} > you are right. > >> 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is >> wrong. > > Quite possibly, this unset_last_byte stuff is a hack that Dave put in. I'll > look some more. > > >> 3. I still think below error handling should be added. >> Below base on stable-2.02, but the core idea is same, should access the >> return value of io_submit & io_getevents. >> >> ``` >> static bool _async_issue(struct io_engine *ioe, enum dir d, int fd, >> ... ... >> if (r < 0) { >> _cb_free(e->cbs, cb); >> + ((struct block *)context)->error = r; <== assign errno & print >> warning >> + log_warn("io_submit <%c> off %llu bytes %llu return %d:%s", >> + (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset, >> + (long long unsigned)nbytes, r, strerror(-r)); >> return false; >> } >> >> static void _issue_low_level(struct block *b, enum dir d) >> ... ... >> dm_list_move(>io_pending, >list); >> >> if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, >> b)) { >> - /* FIXME: if io_submit() set an errno, return that instead of EIO? */ >> - _complete_io(b, -EIO); >> + _complete_io(b, b->error); <=== this pass the right errno to caller. >> return; >> } >>} > > Yep, this is good. Added. > > >> -static void _wait_all(struct bcache *cache) >> +static bool _wait_all(struct bcache *cache) <=== change to return error >>{ >> + bool ret = true; >> while (!dm_list_empty(>io_pending)) >> - _wait_io(cache); >> + ret = _wait_io(cache); >> + return ret; >>} >> >> -static void _wait_specific(struct block *b) >> +static bool _wait_specific(struct block *b) <=== change to return error >>{ >> + bool ret = true; >> while (_test_flags(b, BF_IO_PENDING)) >> - _wait_io(b->cache); >> + ret = _wait_io(b->cache); >> + return ret; >>} > > No, the wait functions just wait for io to complete, they're not interested > in whether it succeeded. That's what b->error is for. > if io_getevents failed, how do you do? just ignore? the data still in cache->io_pending not in cache->errored. > >> >>bool bcache_flush(struct bcache *cache) < add more error handling >>{ >> + bool write_ret = true, wait_ret = true; >> >> ... ... >> _issue_write(b); >> + if (b->error) write_ret = false; >> } >> >> - _wait_all(cache); >> + wait_ret = _wait_all(cache); >> >> - return dm_list_empty(>errored); >> + if (write_ret == false || wait_ret == false || >> + !dm_list_empty(>errored)) >> + return false; >> + else >> + return true; >>} > > I don't understand how this changes the behaviour from just checking the > size of cache->errored. this is stable-2.02 code. master branch like below. the core idea is to check the io_submit & io_getevents return value. (refer above codes changes) ``` bool bcache_flush(struct bcache *cache) { + bool write_ret = true, wait_ret = true; + // Only dirty data is on the errored list, since bad read blocks get // recycled straight away. So we put these back on the dirty list, and // try and rewrite everything. dm_list_splice(>dirty, >errored); while (!dm_list_empty(>dirty)) { struct block *b = dm_list_item(_list_pop(>dirty), struct block); if (b->ref_count || _test_flags(b, BF_IO_PENDING)) { // The superblock may well be still locked. continue; } -_issue_write(b); + if (b->error) write_ret = false; } -_wait_all(cache); + wait_ret = _wait_all(cache); - return dm_list_empty(>errored); + if (write_ret
Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
You are right. I forgot the radix_tree_remove_prefix(). The radix_tree_remove_prefix() is called both bcache_invalidate_fd & bcache_abort_fd. So the job will not work as expected in bcache_abort_fd, because the node is already removed. Please correct me if my thinking is wrong. On 10/29/19 7:05 PM, Joe Thornber wrote: > On Tue, Oct 29, 2019 at 09:46:56AM +, Heming Zhao wrote: >> Add another comment. >> >> The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd(). >> But from my analysis, it looks your codes do not fix this issue. >> >> _invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd. >> bcache_invalidate_fd work as before, only return true or false to indicate >> there is or isn't fd in cache->errored. >> Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & >> _free_block. >> These two functions only delist block from currently cache->errored & >> cache->free list. >> The data still in radix tree with flags BF_DIRTY. > > In _abort_v(): > > 1402│// We can't remove the block from the radix tree yet because > 1│// we're in the middle of an iteration. > > and then after the iteration: > > 1416│radix_tree_remove_prefix(cache->rtree, k.bytes, k.bytes + > sizeof(k.parts.fd)); > > I've added a unit test to demonstrate it does indeed work: > > 840│static void test_abort_forces_reread(void *context) > 1│{ > 2│struct fixture *f = context; > 3│struct mock_engine *me = f->me; > 4│struct bcache *cache = f->cache; > 5│struct block *b; > 6│int fd = 17; > 7│ > 8│_expect_read(me, fd, 0); > 9│_expect(me, E_WAIT); >10│T_ASSERT(bcache_get(cache, fd, 0, GF_DIRTY, )); >11│bcache_put(b); >12│ >13│bcache_abort_fd(cache, fd); >14│T_ASSERT(bcache_flush(cache)); >15│ >16│// Check the block is re-read >17│_expect_read(me, fd, 0); >18│_expect(me, E_WAIT); >19│T_ASSERT(bcache_get(cache, fd, 0, 0, )); >20│bcache_put(b); >21│} > > - Joe > ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
On Tue, Oct 29, 2019 at 09:46:56AM +, Heming Zhao wrote: > Add another comment. > > The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd(). > But from my analysis, it looks your codes do not fix this issue. > > _invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd. > bcache_invalidate_fd work as before, only return true or false to indicate > there is or isn't fd in cache->errored. > Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & > _free_block. > These two functions only delist block from currently cache->errored & > cache->free list. > The data still in radix tree with flags BF_DIRTY. In _abort_v(): 1402│// We can't remove the block from the radix tree yet because 1│// we're in the middle of an iteration. and then after the iteration: 1416│radix_tree_remove_prefix(cache->rtree, k.bytes, k.bytes + sizeof(k.parts.fd)); I've added a unit test to demonstrate it does indeed work: 840│static void test_abort_forces_reread(void *context) 1│{ 2│struct fixture *f = context; 3│struct mock_engine *me = f->me; 4│struct bcache *cache = f->cache; 5│struct block *b; 6│int fd = 17; 7│ 8│_expect_read(me, fd, 0); 9│_expect(me, E_WAIT); 10│T_ASSERT(bcache_get(cache, fd, 0, GF_DIRTY, )); 11│bcache_put(b); 12│ 13│bcache_abort_fd(cache, fd); 14│T_ASSERT(bcache_flush(cache)); 15│ 16│// Check the block is re-read 17│_expect_read(me, fd, 0); 18│_expect(me, E_WAIT); 19│T_ASSERT(bcache_get(cache, fd, 0, 0, )); 20│bcache_put(b); 21│} - Joe ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
On Tue, Oct 29, 2019 at 05:07:30AM +, Heming Zhao wrote: > Hello Joe, > > Please check my comments for your commit 2938b4dcc & 6b0d969b > > 1. b->ref_count is non-zero, and write error happens, the data never release? > (no place to call _unlink_block & _free_block) Correct, the data will not be released until the client calls bcache_abort_fd(), to indicate that it's giving up on the write. That way the client is free to retry io, eg, see this unit test: 689│static void test_write_bad_issue_stops_flush(void *context) 1│{ 2│struct fixture *f = context; 3│struct mock_engine *me = f->me; 4│struct bcache *cache = f->cache; 5│struct block *b; 6│int fd = 17; 7│ 8│T_ASSERT(bcache_get(cache, fd, 0, GF_ZERO, )); 9│_expect_write_bad_issue(me, fd, 0); 10│bcache_put(b); 11│T_ASSERT(!bcache_flush(cache)); 12│ 13│// we'll let it succeed the second time 14│_expect_write(me, fd, 0); 15│_expect(me, E_WAIT); 16│T_ASSERT(bcache_flush(cache)); 17│} > 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is > wrong. Quite possibly, this unset_last_byte stuff is a hack that Dave put in. I'll look some more. > 3. I still think below error handling should be added. > Below base on stable-2.02, but the core idea is same, should access the > return value of io_submit & io_getevents. > > ``` > static bool _async_issue(struct io_engine *ioe, enum dir d, int fd, > ... ... > if (r < 0) { > _cb_free(e->cbs, cb); > + ((struct block *)context)->error = r; <== assign errno & print warning > + log_warn("io_submit <%c> off %llu bytes %llu return %d:%s", > + (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset, > + (long long unsigned)nbytes, r, strerror(-r)); > return false; > } > > static void _issue_low_level(struct block *b, enum dir d) > ... ... > dm_list_move(>io_pending, >list); > > if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) { > - /* FIXME: if io_submit() set an errno, return that instead of EIO? */ > - _complete_io(b, -EIO); > + _complete_io(b, b->error); <=== this pass the right errno to caller. > return; > } > } Yep, this is good. Added. > -static void _wait_all(struct bcache *cache) > +static bool _wait_all(struct bcache *cache) <=== change to return error > { > + bool ret = true; > while (!dm_list_empty(>io_pending)) > - _wait_io(cache); > + ret = _wait_io(cache); > + return ret; > } > > -static void _wait_specific(struct block *b) > +static bool _wait_specific(struct block *b) <=== change to return error > { > + bool ret = true; > while (_test_flags(b, BF_IO_PENDING)) > - _wait_io(b->cache); > + ret = _wait_io(b->cache); > + return ret; > } No, the wait functions just wait for io to complete, they're not interested in whether it succeeded. That's what b->error is for. > > bool bcache_flush(struct bcache *cache) < add more error handling > { > + bool write_ret = true, wait_ret = true; > > ... ... > _issue_write(b); > + if (b->error) write_ret = false; > } > > - _wait_all(cache); > + wait_ret = _wait_all(cache); > > - return dm_list_empty(>errored); > + if (write_ret == false || wait_ret == false || > + !dm_list_empty(>errored)) > + return false; > + else > + return true; > } I don't understand how this changes the behaviour from just checking the size of cache->errored. - Joe ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
Add another comment. The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd(). But from my analysis, it looks your codes do not fix this issue. _invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd. bcache_invalidate_fd work as before, only return true or false to indicate there is or isn't fd in cache->errored. Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & _free_block. These two functions only delist block from currently cache->errored & cache->free list. The data still in radix tree with flags BF_DIRTY. So when next lvm reuses closed error fd, and calls _lookup_or_read_block (with legacy fd), the data will reuse again. The bug will be trigger. Thanks, zhm On 10/29/19 1:07 PM, heming.zhao wrote: > Hello Joe, > > Please check my comments for your commit 2938b4dcc & 6b0d969b > > 1. b->ref_count is non-zero, and write error happens, the data never release? > (no place to call _unlink_block & _free_block) > > 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is > wrong. > > 3. I still think below error handling should be added. > Below base on stable-2.02, but the core idea is same, should access the > return value of io_submit & io_getevents. > ``` > static bool _async_issue(struct io_engine *ioe, enum dir d, int fd, > ... ... > if (r < 0) { > _cb_free(e->cbs, cb); > + ((struct block *)context)->error = r; <== assign errno & print warning > + log_warn("io_submit <%c> off %llu bytes %llu return %d:%s", > + (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset, > + (long long unsigned)nbytes, r, strerror(-r)); > return false; > } > > static void _issue_low_level(struct block *b, enum dir d) > ... ... > dm_list_move(>io_pending, >list); > > if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) { > - /* FIXME: if io_submit() set an errno, return that instead of EIO? */ > - _complete_io(b, -EIO); > + _complete_io(b, b->error); <=== this pass the right errno to caller. > return; > } > } > > -static void _wait_all(struct bcache *cache) > +static bool _wait_all(struct bcache *cache) <=== change to return error > { > + bool ret = true; > while (!dm_list_empty(>io_pending)) > - _wait_io(cache); > + ret = _wait_io(cache); > + return ret; > } > > -static void _wait_specific(struct block *b) > +static bool _wait_specific(struct block *b) <=== change to return error > { > + bool ret = true; > while (_test_flags(b, BF_IO_PENDING)) > - _wait_io(b->cache); > + ret = _wait_io(b->cache); > + return ret; > } > > bool bcache_flush(struct bcache *cache) < add more error handling > { > + bool write_ret = true, wait_ret = true; > > ... ... > _issue_write(b); > + if (b->error) write_ret = false; > } > > - _wait_all(cache); > + wait_ret = _wait_all(cache); > > - return dm_list_empty(>errored); > + if (write_ret == false || wait_ret == false || > + !dm_list_empty(>errored)) > + return false; > + else > + return true; > } > ``` > > Thanks, > zhm > > On 10/28/19 11:43 PM, Joe Thornber wrote: >> On Thu, Oct 24, 2019 at 03:06:18AM +, Heming Zhao wrote: >>> First fail is in bcache_flush, then bcache_invalidate_fd does nothing >>> because the data >>> in cache->errored, which not belongs to dirty & clean list. Then the data >>> mistakenly >>> move from cache->errored into cache->dirty by "bcache_get => >>> _lookup_or_read_block" >>> (because the data holds BF_DIRTY flag). >> >> I just pushed a couple of patches that will hopefully fix this issue for you: >> >> commit 6b0d969b2a85ba69046afa26af4d7bcbccd5 (HEAD -> master, >> origin/master, origin/HEAD, 2019-10-11-bcache-purge) >> Author: Joe Thornber >> Date: Mon Oct 28 15:01:47 2019 + >> >> [label] Use bcache_abort_fd() to ensure blocks are no longer in the >> cache. >> The return value from bcache_invalidate_fd() was not being checked. >> So I've introduced a little function, _invalidate_fd() that always >> calls bcache_abort_fd() if the write fails. >> >> commit 2938b4dcca0a1df661758abfab7f402ea7aab018 >> Author: Joe Thornber >> Date: Mon Oct 28 14:29:47 2019 + >> >> [bcache] add bcache_abort() >> This gives us a way to cope with write failures. >> >> >> >> >> Also there are big changes to bcache coming, that remove file descriptors >> from >> the interface completely. See the branch 2019-09-05-add-io-manager for more >> info >> (bcache has been renamed io_manager). >> >> - Joe >> >> ___ >> linux-lvm mailing list >> linux-lvm@redhat.com >> https://www.redhat.com/mailman/listinfo/linux-lvm >> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ >> ___ linux-lvm mailing list linux-lvm@redhat.com
[linux-lvm] Best way to run LVM over multiple SW RAIDs?
Hello, I have a server with very high load using four NVMe SSDs and therefore no HW RAID. Instead I used SW RAID with the mdadm tool. Using one RAID5 volume does not work well since the driver can only utilize one CPU core which spikes at 100% and harms performance. Therefore I created 8 partitions on each disk, and 8 RAID5s across the four disks. Now I want to bring them together with LVM. If I do not use a striped volume I get high performance (in expected magnitude according to disk specs). But when I use a striped volume, performance drops to a magnitude below. The reason I am looking for a striped setup is to ensure that data is spread well over the drive to guarantee a good worst-case performance. With linear allocation rather than striped, if load is directed to files on the first PV (a SW RAID) the system is again exposed to the 1-core limitation. I tried "--stripes 8 --stripesize 512", and would appreciate any ideas of other things to try. I guess the performance hit can be in the file system as well. I tried XFS and EXT4 with default settings. Kind Regards, Daniel ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] exposing snapshot block device
Wow! Impressing. This will make history! If this is possible, than we are able to implement a solution, which can do: - progressive block level incremental forever (always incremental on block level : this already exist) - instant recovery to point in time (using the mentioned methods you just described) For example, lets say that a client wants to restore a file system, or a logical volume to how it looked a like yesterday. Eventhough there are no snapshot, nor any data. Than the client (with some coding); can start from an empty volume, and re-attach a cow device, and convert that using lvconvert --merge, so that the copying can be done in background using the backup server. If you forget about "how we will re-create the cow device"; and just focusing on the LVM ideas of re-attaching a cow device. Do you think that I have understood it correctly? Den tors 24 okt. 2019 kl 18:01 skrev Zdenek Kabelac : > Dne 23. 10. 19 v 13:24 Tomas Dalebjörk napsal(a): > > I have tested FusionIO together with old thick snapshots. > > I created the thick snapshot on a separate old traditional SATA drive, > just to > > check if that could be used as a snapshot target for high performance > disks; > > like a Fusion IO card. > > For those who doesn't know about FusionIO; they can deal with > 150-250,000 IOPS. > > > > And to be honest, I couldn't bottle neck the SATA disk I used as a thick > > snapshot target. > > The reason for why is simple: > > - thick snapshots uses sequential write techniques > > > > If I would have been using thin snapshots, than the writes would most > likely > > be more randomized on disk, which would have required more spindles to > coop > > with this. > > > > Anyhow; > > I am still eager to hear how to use an external device to import > snapshots. > > And when I say "import"; I am not talking about copyback, more to use to > read > > data from. > > Format of 'on-disk' snapshot metadata for old snapshot is trivial - being > some > header + pairs of dataoffset-TO-FROM - I think googling will reveal couple > python tools playing with it. > > You can add pre-created COW image to LV with lvconvert --snapshot > and to avoid 'zeroing' metadata use option -Zn > (BTW in the same way you can detach snapshot from LV with --splitsnapshot > so > you can look how the metadata looks like...) > > Although it's pretty unusual why would anyone create first the COW image > with > all the special layout and then merge it to LV - instead of directly > merging... There is only the 'little' advantage of minimizing 'offline' > time > of such device (and it's the reason why --split exists). > > Regards > > Zdenek > > > ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/