Re: [linux-lvm] when bringing dm-cache online, consumes all memory and reboots

2020-03-23 Thread Joe Thornber
On Sun, Mar 22, 2020 at 10:57:35AM -0700, Scott Mcdermott wrote:
> have a 931.5 GibiByte SSD pair in raid1 (mdraid) as cache LV for a
> data LV on 1.8 TebiByte raid1 (mdraid) pair of larger spinning disk.
> these disks are hosted by a small 4GB big.little ARM system
> running4.4.192-rk3399 (armbian 5.98 bionic).  parameters were set
> with: lvconvert --type cache --cachemode writeback --cachepolicy smq
> --cachesettings migration_threshold=1000

If you crash then the cache assumes all blocks are dirty and performs
a full writeback.  You have set the migration_threshold extremely high
so I think this writeback process is just submitting far too much io at once.

Bring it down to around 2048 and try again.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] bcache now to become io-manager?

2019-10-30 Thread Joe Thornber
On Tue, Oct 29, 2019 at 08:25:23PM -0400, John Stoffel wrote:
> >>>>> "Joe" == Joe Thornber  writes:
> 
> [ I saw this and couldn't resist asking for more details ...]
> 
> Joe> Also there are big changes to bcache coming, that remove file
> Joe> descriptors from the interface completely.  See the branch
> Joe> 2019-09-05-add-io-manager for more info (bcache has been renamed
> Joe> io_manager).
> 
> Can you post the full URL for this, and maybe give us a teaser on what
> to expect?  I run lvache on my home system and I'd love to know how to
> A) improve reporting of how well it works, and B) whether I'm using it
> right, and of course C) if bcache would be a better replacement.
> 
> I'm using Linux v5.0.21 (self compiled) on Debian 9.11 with lvcache
> setup.  It's a pair of mirrored 250gb SSDs in front of 4tb of mirrored
> disk.
> 
> So anything where I can use my SSDs to cache my data accesses is
> certainly interesting.

Sorry, I think we're getting confused with similarly named things.  There
is a component in LVM source called 'block cache' that we use to scan
devices in parallel (under the hood it uses aio).  That component is being
renamed.  It's nothing to do with the disk caching technology called 'bcache'.

However, Dave Teigland has been adding support to LVM for a new caching
target that emphasises caching writes.  So keep an eye out for that.


- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error

2019-10-29 Thread Joe Thornber
On Tue, Oct 29, 2019 at 11:47:19AM +, Heming Zhao wrote:
> You are right. I forgot the radix_tree_remove_prefix().
> The radix_tree_remove_prefix() is called both bcache_invalidate_fd & 
> bcache_abort_fd.
> So the job will not work as expected in bcache_abort_fd, because the node is 
> already removed.
> Please correct me if my thinking is wrong.

Slightly wrong, the remove_prefix in invalidate should only be running if 
it.success is true.
Well spotted.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error

2019-10-29 Thread Joe Thornber
On Tue, Oct 29, 2019 at 09:46:56AM +, Heming Zhao wrote:
> Add another comment.
> 
> The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd().
> But from my analysis, it looks your codes do not fix this issue.
> 
> _invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd.
> bcache_invalidate_fd work as before, only return true or false to indicate 
> there is or isn't fd in cache->errored.
> Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & 
> _free_block.
> These two functions only delist block from currently cache->errored & 
> cache->free list.
> The data still in radix tree with flags BF_DIRTY.

In _abort_v():

1402│// We can't remove the block from the radix tree yet because
   1│// we're in the middle of an iteration.

and then after the iteration:

1416│radix_tree_remove_prefix(cache->rtree, k.bytes, k.bytes + 
sizeof(k.parts.fd));

I've added a unit test to demonstrate it does indeed work:

 840│static void test_abort_forces_reread(void *context)
   1│{
   2│struct fixture *f = context;
   3│struct mock_engine *me = f->me;
   4│struct bcache *cache = f->cache;
   5│struct block *b;
   6│int fd = 17;
   7│ 
   8│_expect_read(me, fd, 0);
   9│_expect(me, E_WAIT);
  10│T_ASSERT(bcache_get(cache, fd, 0, GF_DIRTY, ));
  11│bcache_put(b);
  12│ 
  13│bcache_abort_fd(cache, fd);
  14│T_ASSERT(bcache_flush(cache));
  15│ 
  16│// Check the block is re-read
  17│_expect_read(me, fd, 0);
  18│_expect(me, E_WAIT);
  19│T_ASSERT(bcache_get(cache, fd, 0, 0, ));
  20│bcache_put(b);
  21│}

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error

2019-10-29 Thread Joe Thornber
On Tue, Oct 29, 2019 at 05:07:30AM +, Heming Zhao wrote:
> Hello Joe,
> 
> Please check my comments for your commit 2938b4dcc & 6b0d969b
> 
> 1. b->ref_count is non-zero, and write error happens, the data never release?
> (no place to call _unlink_block & _free_block)

Correct, the data will not be released until the client calls 
bcache_abort_fd(), to
indicate that it's giving up on the write.  That way the client is free to retry
io, eg, see this unit test:

 689│static void test_write_bad_issue_stops_flush(void *context)
   1│{
   2│struct fixture *f = context;
   3│struct mock_engine *me = f->me;
   4│struct bcache *cache = f->cache;
   5│struct block *b;
   6│int fd = 17;
   7│ 
   8│T_ASSERT(bcache_get(cache, fd, 0, GF_ZERO, ));
   9│_expect_write_bad_issue(me, fd, 0);
  10│bcache_put(b);
  11│T_ASSERT(!bcache_flush(cache));
  12│
  13│// we'll let it succeed the second time
  14│_expect_write(me, fd, 0);
  15│_expect(me, E_WAIT);
  16│T_ASSERT(bcache_flush(cache));
  17│}


> 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is 
> wrong.

Quite possibly, this unset_last_byte stuff is a hack that Dave put in.  I'll 
look some more.


> 3. I still think below error handling should be added.
> Below base on stable-2.02, but the core idea is same, should access the 
> return value of io_submit & io_getevents.
> 
> ```
> static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
> ... ...
>  if (r < 0) {
>  _cb_free(e->cbs, cb);
> +   ((struct block *)context)->error = r; <== assign errno & print warning
> +   log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
> +   (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
> +   (long long unsigned)nbytes, r, strerror(-r));
>  return false;
>  }
> 
> static void _issue_low_level(struct block *b, enum dir d)
> ... ...
>  dm_list_move(>io_pending, >list);
>   
>  if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
> -   /* FIXME: if io_submit() set an errno, return that instead of EIO? */
> -   _complete_io(b, -EIO);
> +   _complete_io(b, b->error); <=== this pass the right errno to caller.
>  return;
>  }
>   }

Yep, this is good. Added.


> -static void _wait_all(struct bcache *cache)
> +static bool _wait_all(struct bcache *cache) <=== change to return error
>   {
> +   bool ret = true;
>  while (!dm_list_empty(>io_pending))
> -   _wait_io(cache);
> +   ret = _wait_io(cache);
> +   return ret;
>   }
>   
> -static void _wait_specific(struct block *b)
> +static bool _wait_specific(struct block *b) <=== change to return error
>   {
> +   bool ret = true;
>  while (_test_flags(b, BF_IO_PENDING))
> -   _wait_io(b->cache);
> +   ret = _wait_io(b->cache);
> +   return ret;
>   }

No, the wait functions just wait for io to complete, they're not interested
in whether it succeeded.  That's what b->error is for.


> 
>   bool bcache_flush(struct bcache *cache) < add more error handling
>   {
> +   bool write_ret = true, wait_ret = true;
>   
>  ... ...
>  _issue_write(b);
> +   if (b->error) write_ret = false;
>  }
>   
> -   _wait_all(cache);
> +   wait_ret = _wait_all(cache);
>   
> -   return dm_list_empty(>errored);
> +   if (write_ret == false || wait_ret == false ||
> +   !dm_list_empty(>errored))
> +   return false;
> +   else
> +   return true;
>   }

I don't understand how this changes the behaviour from just checking the
size of cache->errored.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error

2019-10-28 Thread Joe Thornber
On Thu, Oct 24, 2019 at 03:06:18AM +, Heming Zhao wrote:
> First fail is in bcache_flush, then bcache_invalidate_fd does nothing because 
> the data
> in cache->errored, which not belongs to dirty & clean list. Then the data 
> mistakenly
> move from cache->errored into cache->dirty by "bcache_get => 
> _lookup_or_read_block"
> (because the data holds BF_DIRTY flag).

I just pushed a couple of patches that will hopefully fix this issue for you:

commit 6b0d969b2a85ba69046afa26af4d7bcbccd5 (HEAD -> master, origin/master, 
origin/HEAD, 2019-10-11-bcache-purge)
Author: Joe Thornber 
Date:   Mon Oct 28 15:01:47 2019 +

[label] Use bcache_abort_fd() to ensure blocks are no longer in the cache.

The return value from bcache_invalidate_fd() was not being checked.

So I've introduced a little function, _invalidate_fd() that always
calls bcache_abort_fd() if the write fails.

commit 2938b4dcca0a1df661758abfab7f402ea7aab018
Author: Joe Thornber 
Date:   Mon Oct 28 14:29:47 2019 +

[bcache] add bcache_abort()

This gives us a way to cope with write failures.




Also there are big changes to bcache coming, that remove file descriptors from
the interface completely.  See the branch 2019-09-05-add-io-manager for more 
info
(bcache has been renamed io_manager).

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error

2019-10-23 Thread Joe Thornber
On Tue, Oct 22, 2019 at 09:47:32AM +, Heming Zhao wrote:
> Hello List & David,
> 
> This patch is responsible for legacy mail:
> [linux-lvm] pvresize will cause a meta-data corruption with error message 
> "Error writing device at 4096 length 512"
> 
> I had send it to our customer, the code ran as expected. I think this code is 
> enough to fix this issue.
> 
> Thanks
> zhm
> 
> --(patch for branch stable-2.02) --
>  From d0d77d0bdad6136c792c966d73dd47b809cb Mon Sep 17 00:00:00 2001
> From: Zhao Heming 
> Date: Tue, 22 Oct 2019 17:22:17 +0800
> Subject: [PATCH] bcache may mistakenly write data to another disk when writes
>   error
> 
> When bcache write data error, the errored fd and its data is saved in
> cache->errored, then this fd is closed. Later lvm will reuse this
> closed fd to new opened devs, but the fd related data still in
> cache->errored and flags with BF_DIRTY. It make the data may mistakenly
> write to another disk.

I think real issue here is that the flush fails, and the error path for that
calls invalidate dev, which also fails, but that return value is not checked.
The fd is subsequently closed, and reopened with data still in the cache.

So I think the correct fix is to have a variant of invalidate, that doesn't
bother retrying the IO, and just throws away the dirty data.  bcache_abort()?
This should be called when the flush() fails.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] Fast thin volume preallocation?

2019-06-03 Thread Joe Thornber
On Fri, May 31, 2019 at 03:13:41PM +0200, Gionatan Danti wrote:

> - does standard lvmthin support something similar? If not, how do you see a
> zero coalesce/compression/trim/whatever feature?

There isn't such a feature as yet.  With the next iteration of thin I'd like to
get away from the fixed allocation block sizes that we're using which should 
greatly
reduce the number of mappings that we need to create to provision a thick 
volume and
so speed it up a lot.

Presumably you want a thick volume but inside a thin pool so that you can used 
snapshots?
If so have you considered the 'external snapshot' feature?

> - can I obtain something similar by simply touching (maybe with a 512B only
> write) once each thin chunk?

Yes.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] Some questions about the upcoming LVM v2_02_178

2018-06-13 Thread Joe Thornber
On Tue, Jun 12, 2018 at 08:14:17PM -0600, Gang He wrote:
> Hello Joe,
> 
> 
> >>> On 2018/6/12 at 22:22, in message <20180612142219.ixpzqxqws3qiwqbm@reti>, 
> >>> Joe
> Thornber  wrote:
> > On Tue, Jun 12, 2018 at 03:01:27AM -0600, Gang He wrote:
> >> Hello List,
> >> 
> >> I saw there was a tag "v2_02_178-rc1" for LVM2, then I have some questions 
> > about  the upcoming LVM v2_02_178.
> >> 1) Will there be the version v2_02_178 for LVM2? since I saw some text 
> >> about 
> > Version 3.0.0 in the git change logs.
> > 
> > Yes there will be.  We've had no bug reports for the -rc, so the final
> > release will be the same as the -rc.
> Between LVM2 v2_02_178-rc1 and  LVM2 v2_02_177, 
> there will not be any components/features remove, except include some bug 
> fixes, right?


178 dropped support for two very old metadata formats (pool and lvm1), and 
switched
to async io all io ops apart from logging.

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] Some questions about the upcoming LVM v2_02_178

2018-06-12 Thread Joe Thornber
On Tue, Jun 12, 2018 at 03:01:27AM -0600, Gang He wrote:
> Hello List,
> 
> I saw there was a tag "v2_02_178-rc1" for LVM2, then I have some questions 
> about  the upcoming LVM v2_02_178.
> 1) Will there be the version v2_02_178 for LVM2? since I saw some text about 
> Version 3.0.0 in the git change logs.

Yes there will be.  We've had no bug reports for the -rc, so the final
release will be the same as the -rc.

> 2) For the next LVM2 version, which components will be affected? since I saw 
> that clvmd related code has been removed.

We've decided to bump the version number to 3.0.0 for the release
*after* 2.02.178.  This change in version number indicates the *start*
of some large changes to lvm.

Obviously the release notes for v3.0.0 will go into this more.  But,
initially the most visible changes will be removal of a couple of
features:

clvmd
-

The locking required to provide this feature was quite pervasive and
was restricting the adding of new features (for instance, I'd like to
be able to allocate from any LV not just PVs).  With Dave Teigland's
lvmlockd I think the vast majority of use cases are covered.  Those
that are wedded to clvmd can continue to use LVM2.02.*

Also, testing cluster software is terribly expensive, we just don't
have the resources to provide two solutions.

lvmapi
--

This library has been deprecated for a while in favour of the dbus api.


- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] [PATCH] lib/device/bcache: don't use PAGE_SIZE

2018-05-17 Thread Joe Thornber
Thanks.  Merged.

On Wed, May 16, 2018 at 09:19:03PM +0100, Alex Bennée wrote:
> PAGE_SIZE is not a compile time constant. Use sysconf instead like
> elsewhere in the code.

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] lvm filter regex format

2017-12-19 Thread Joe Thornber
On Mon, Dec 18, 2017 at 06:53:06PM +, matthew patton wrote:
> >    https://github.com/jthornber/lvm2-ejt/blob/master/libdm/regex/parse_rx.h
>  
> not to be ungrateful but why on earth would one NOT use the glibc standard 
> regex library? Nobody cares about pointless optimization. Surprises like 
> "well, we only implemented most of the spec" are what drives people nuts!

It was written about 17 years ago, and the optimisation was not pointless at 
that time.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] lvm filter regex format

2017-12-18 Thread Joe Thornber
On Mon, Dec 18, 2017 at 05:16:14PM +, Thanos Makatos wrote:
> I'm trying to be very specific in the global_filter of lvm.conf and
> ignore devices under /dev/mapper of the format
> '^/dev/mapper/[a-z0-9]{14}$', however the repetition count '{14}' does
> not seem to be honored?
> 
> Currently I have to repeat '[a-z0-9]' fourteen times, which works but
> it's a bit ugly.
> 
> Does the filter use some standarized regex format?

It's a custom engine that I wrote which matches all the regexs in the
filters at the same time (so is pretty fast).  Looking at the header here:

   https://github.com/jthornber/lvm2-ejt/blob/master/libdm/regex/parse_rx.h

It seems to support just catenation, |, *, +, ?, [], ^ and $

Out of interest why are you using the length of the device name as a 
discriminator?

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] Snapshot behavior on classic LVM vs ThinLVM

2017-05-12 Thread Joe Thornber
On Fri, May 12, 2017 at 03:02:58PM +0200, Gionatan Danti wrote:
> On 02/05/2017 13:00, Gionatan Danti wrote:
> >>Anyway, I think (and maybe I am wrong...) that the better solution is to
> >>fail *all* writes to a full pool, even the ones directed to allocated
> >>space. This will effectively "freeze" the pool and avoid any
> >>long-standing inconsistencies.

I think dm-thin behaviour is fine given the semantics of write
and flush IOs.

A block device can complete a write even if it hasn't hit the physical
media, a flush request needs to come in at a later time which means
'flush all IOs that you've previously completed'.  So any software using
a block device (fs, database etc), tends to generate batches of writes,
followed by a flush to commit the changes.  For example if there was a
power failure between the batch of write io completing and the flush
completing you do not know how much of the writes will be visible when
the machine comes back.

When a pool is full it will allow writes to provisioned areas of a thin to
succeed.  But if any writes failed due to inability to provision then a
REQ_FLUSH io to that thin device will *not* succeed.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


Re: [linux-lvm] Repair thin pool

2016-02-10 Thread Joe Thornber
On Tue, Feb 09, 2016 at 02:03:39AM +0800, M.H. Tsai wrote:
> 2016-02-08 16:56 GMT+08:00 Joe Thornber <thorn...@redhat.com>:
> > On Fri, Feb 05, 2016 at 07:44:46PM +0800, M.H. Tsai wrote:
> >> I also wrote some extension to thin-provisioning-tools (not yet
> >> published. the code still need some refinement...), maybe it could
> >> help.
> >
> > I'd definitely like to see what you changed please.
> >
> > - Joe
> 
> I wrote some tools to do "semi-auto" repair, called thin_ll_dump and
> thin_ll_restore (low-level dump & restore), that can find orphan nodes
> and reconstruct the metadata using orphan nodes. It could cope the cases
> that the top-level data mapping tree or some higher-level nodes were
> broken, to complement the repairing feature of thin_repair.
> 
> Although that users are required to have knowledge about dm-thin metadata
> before using these tools (you need to specify which orphan node to use), I
> think that these tools are useful for system administrators. Most thin-pool
> corruption cases I experienced (caused by power lost, broken disks, RAID
> corruption, etc.) cannot be handled by the current thin-provisioning-tools
> --  thin_repair is fully automatic, but it just skips broken nodes.
> However, those missing mappings could be found in orphan nodes.
> 
> Also, I wrote another tool called thin_scan, to show the entire metadata
> layout and scan broken nodes. (which is an enhanced version of
> thin_show_block in branch low_level_examine_metadata -- I didn't notice
> that before... maybe the name thin_show_block sounds more clear?)
> 
> What do you think about these features? Are they worth to be merged to the
> upstream?

Yep, I definitely want these for upstream.  Send me what you've got,
whatever state it's in; I'll happily spend a couple of weeks tidying
this.

- Joe

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/