Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-09 Thread Zygo Blaxell
On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote:
> 
> 
> On 03/08/2017 09:12 PM, Zygo Blaxell wrote:
> >This is a story about 4 distinct (and very old) btrfs bugs.
> >
> 
> Really great write up.
> 
> [ ... ]
> 
> >
> >diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >index 25ac2cf..4d41a31 100644
> >--- a/fs/btrfs/inode.c
> >+++ b/fs/btrfs/inode.c
> >@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct 
> >btrfs_path *path,
> > max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> > ret = btrfs_decompress(compress_type, tmp, page,
> >extent_offset, inline_size, max_size);
> >+WARN_ON(max_size + pg_offset > PAGE_SIZE);
> 
> Can you please drop this WARN_ON and make the math reflect any possible
> pg_offset?  I do agree it shouldn't be happening, but its easy to correct
> for and the WARN is likely to get lost.

I'm not sure how to do that.  It looks like I'd have to pass pg_offset
through btrfs_decompress to the decompress functions?

ret = btrfs_decompress(compress_type, tmp, page,
extent_offset, inline_size, max_size, pg_offset);

and in the compression functions get pg_offset from the argument list
instead of hardcoding zero.

But how does pg_offset become non-zero for an inline extent?  A micro-hole
before the first byte?  If the offset was >= 4096, the data wouldn't
be in the first block so there would never be an inline extent in the
first place.

> >+if (max_size + pg_offset < PAGE_SIZE) {
> >+char *map = kmap(page);
> >+memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - 
> >pg_offset);
> >+kunmap(page);
> >+}
> 
> Both lzo and zlib have a memset to cover the gap between what they actually
> decompress and the max_size that we pass here.  That's important because
> ram_bytes may not be 100% accurate.
> 
> Can you also please toss in a comment about how the decompression code is
> responsible for the memset up to max_bytes?
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [PATCH v2] btrfs: fix a bogus warning when converting only data or metadata

2017-03-09 Thread Adam Borowski
On Thu, Mar 09, 2017 at 02:43:27PM +0100, David Sterba wrote:
> The patch does not apply to current master, there's
> 
>   3861 if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
>   3862 fs_info->num_tolerated_disk_barrier_failures = min(
>   3863 
> btrfs_calc_num_tolerated_disk_barrier_failures(fs_info),
>   3864 btrfs_get_num_tolerated_disk_barrier_failures(
>   3865 bctl->sys.target));
>   3866 }
>   3867
> 
> followed by the line below, I'll merge that manually but slightly wonder
> which branch you have used as a base. No big deal though, I'll add the
> patch to the queue. Thanks.
>
> >   ret = insert_balance_item(fs_info, bctl);

I had Qu's chunk counting patchset in my tree (and some irrelevant stuff,
like tty patches).  But this means there'll be a minor conflict between the
two once you get around to merging Qu's stuff.

It's obvious how they interact; the bogus warning happens with and without
chunk counting.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] convert: rework rollback

2017-03-09 Thread David Sterba
On Fri, Feb 24, 2017 at 01:52:23PM +0800, Qu Wenruo wrote:
> The patchset can be fetched from my github, based on v4.9.1:
> https://github.com/adam900710/btrfs-progs/tree/rollback_rework_v2

I have manually merged the patchset and made some minor change along the
way. The convert tests now pass, I'll let the full set finish.

> Rework rollback to provide a easier to understand new mechnisim, which can
> handle both old convert behavior and new convert behavior.

Yeah, overall the new code looks better, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add file item tracepoint

2017-03-09 Thread Liu Bo
On Thu, Mar 09, 2017 at 01:52:22PM +0100, David Sterba wrote:
> On Tue, Mar 07, 2017 at 08:49:42AM -0800, Liu Bo wrote:
> > On Tue, Mar 07, 2017 at 04:48:24PM +0100, David Sterba wrote:
> > > On Fri, Mar 03, 2017 at 06:41:27PM -0800, Liu Bo wrote:
> > > > +   TP_printk_btrfs(
> > > > +   "root %llu(%s) ino %llu sz 0x%llx disk_isz 0x%llx "
> > > > +   "file extent range [0x%llx 0x%llx] "
> > > > +   "(num_bytes 0x%llx ram_bytes 0x%llx disk_bytenr 0x%llx "
> > > > +   "disk_num_bytes 0x%llx extent_offset 0x%llx type (%s) "
> > > > +   "compression %u",
> > > 
> > > > +   TP_printk_btrfs(
> > > > +   "root %llu(%s) ino %llu sz 0x%llx disk_isz 0x%llx "
> > > > +   "file extent range [0x%llx 0x%llx] "
> > > > +   "extent_type (%s) compression %u",
> > > 
> > > Please update the message formats according to
> > > https://btrfs.wiki.kernel.org/index.php/Development_notes#Tracepoints
> > 
> > I feel like for size and offset, if using '%llu' rather than '%llx', it
> > is not easy to tell whether the size/offset is aligned to 4K or not.
> 
> I think we'll have to do some conversion either way, and I don't see
> %llx used anywhere in our thacepoints.
> 
> If the alignment is useful for certain values, we can extend the format
> to signify that, for example: "start=4097!".

OK, I'll update this patch to comply with the wiki so you can take it,
and then make a new patch to convert %llu to %llx to see if any
objections.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-09 Thread Liu Bo
On Wed, Mar 08, 2017 at 10:25:51AM +0800, Qu Wenruo wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
> 
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: 
> fs/btrfs//extent-tree.c, line: 5858
> 
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
> 
> a) Create one data block group with one 4K extent in it.
>To avoid the bug that hangs btrfs due to ordered extent which never
>finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
> 
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
> 
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
> 
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<--- cleanup range --->|
> |<---  --->|
>  \/
>  btrfs_finish_ordered_io() range
> 
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
> 
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
> 
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<---  --->|<-- cleanup range ->|
>  \/
>  btrfs_finish_ordered_io() range

Reviewed-by: Liu Bo 

Thanks,

-liubo
> 
> Signed-off-by: Qu Wenruo 
> ---
> v6:
>   New, split from v5 patch, as this is a separate bug.
> v7:
>   Fix a wrong operator pointed out by Liu Bo.
>   Test case will follow soon.
> ---
>  fs/btrfs/inode.c | 51 +++
>  1 file changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c40060cc481f..fe588bf30d02 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>   BTRFS_DATA_RELOC_TREE_OBJECTID) {
>   ret = btrfs_reloc_clone_csums(inode, start,
> cur_alloc_size);
> + /*
> +  * Only drop cache here, and process as normal.
> +  *
> +  * We must not allow extent_clear_unlock_delalloc()
> +  * at out_unlock label to free meta of this ordered
> +  * extent, as its meta should be freed by
> +  * btrfs_finish_ordered_io().
> +  *
> +  * So we must continue until @start is increased to
> +  * skip current ordered extent.
> +  */
>   if (ret)
> - goto out_drop_extent_cache;
> + btrfs_drop_extent_cache(BTRFS_I(inode), start,
> + start + ram_size - 1, 0);
>   }
>  
>   btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  
> - if (disk_num_bytes < cur_alloc_size)
> - break;
> -
>   /* we're not doing compressed IO, don't unlock the first
>* page (which the caller expects to stay locked), don't
>* clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode 
> *inode,
>delalloc_end, locked_page,
>EXTENT_LOCKED | EXTENT_DELALLOC,
>op);
> - disk_num_bytes -= cur_alloc_size;
> + if (disk_num_bytes < cur_alloc_size)
> + disk_num_bytes = 0;
> + else
> + disk_num_bytes -= cur_alloc_size;
>   num_bytes -= cur_alloc_size;
>   alloc_hint = ins.objectid + ins.offset;
>   start += cur_alloc_size;
> +
> + /*
> +  * btrfs_reloc_clone_csums() error, since start is increased
> +  * extent_clear_unlock_delalloc() at out_unlock label won't
> +  * free metadata of current ordered extent, we're OK to exit.
> +  */
> + if (ret)
> + 

Re: [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang

2017-03-09 Thread Liu Bo
On Wed, Mar 08, 2017 at 10:25:52AM +0800, Qu Wenruo wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
> 
> Call Trace:
>  __schedule+0x2d4/0xae0
>  schedule+0x3d/0x90
>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>  process_one_work+0x2af/0x720
>  ? process_one_work+0x22b/0x720
>  worker_thread+0x4b/0x4f0
>  kthread+0x10f/0x150
>  ? process_one_work+0x720/0x720
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x2e/0x40
> 
> [CAUSE]
> 
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|   |<-- cleanup range ->|
>  ||
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
> 
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
> 
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
> 
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
> 
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
> 
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
> 
> After fix, delalloc error will be handled like:
> 
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<  --->|<-- old error handler ->|
>  ||  ||
>  ||  \_=> Cleaned up by cleanup_ordered_extents()
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()

Reviewed-by: Liu Bo 

Thanks,

-liubo
> 
> Signed-off-by: Qu Wenruo 
> Signed-off-by: Filipe Manana 
> ---
> v2:
>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>   outstanding extents, which is already done by
>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
>   Skip first page to avoid underflow ordered->bytes_left.
>   Fix range passed in cow_file_range() which doesn't cover the whole
>   extent.
>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>   metadata release.
> v4:
>   Don't use extra bit to skip metadata freeing for ordered extent,
>   but only handle btrfs_reloc_clone_csums() error just before processing
>   next extent.
>   This makes error handle much easier for run_delalloc_nocow().
> v5:
>   Variant gramma and comment fixes suggested by Filipe Manana
>   Enhanced commit message to focus on the generic error handler bug,
>   pointed out by Filipe Manana
>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>   out by Filipe Manana.
>   Enhanced commit message with ascii art to explain the bug more easily.
>   Fix a bug which can leads to corrupted extent allocation, exposed by
>   Filipe Manana.
> v6:
>   Split the metadata underflow fix to another patch.
>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>   from Filipe Manana.
> v7:
>   Add back const prefix
> ---
>  fs/btrfs/inode.c | 63 
> ++--
>  1 file changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index fe588bf30d02..d3bc68bbe0e7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode 
> *inode, u64 start, u64 len,
>  u64 ram_bytes, int compress_type,
>  int type);
>  
> +static void __endio_write_update_ordered(struct inode *inode,
> +  const u64 offset, const u64 bytes,
> +  const bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved 
> metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_delalloc() 

Re: assertion failed: last_size == new_size, file: fs/btrfs/inode.c

2017-03-09 Thread Liu Bo
On Thu, Mar 09, 2017 at 03:24:21PM +0100, David Sterba wrote:
> On Thu, Mar 02, 2017 at 06:04:33PM -0800, Liu Bo wrote:
> > On Thu, Mar 02, 2017 at 07:58:01AM -0800, Liu Bo wrote:
> > > On Wed, Mar 01, 2017 at 03:03:19PM -0500, Dave Jones wrote:
> > > > On Tue, Feb 28, 2017 at 05:12:01PM -0800, Liu Bo wrote:
> > > >  > On Mon, Feb 27, 2017 at 11:23:42AM -0500, Dave Jones wrote:
> > > >  > > On Mon, Feb 27, 2017 at 07:53:48AM -0800, Liu Bo wrote:
> > > >  > >  > On Sun, Feb 26, 2017 at 07:18:42PM -0500, Dave Jones wrote:
> > > >  > >  > > Hitting this fairly frequently.. I'm not sure if this is the 
> > > > same bug I've
> > > >  > >  > > been hitting occasionally since 4.9. The assertion looks new 
> > > > to me at least.
> > > >  > >  > >
> > > >  > >  > 
> > > >  > >  > It was recently introduced by my commit and used to catch data 
> > > > loss at truncate.
> > > >  > >  > 
> > > >  > >  > Were you running the test with a mkfs.btrfs -O NO_HOLES?
> > > >  > >  > (We just queued a fix for the NO_HOLES case in btrfs-next.)
> > > >  > > 
> > > >  > > No, a fs created with default mkfs.btrfs options.
> > > >  > 
> > > >  > I have this patch[1] to fix a bug which results in file hole extent, 
> > > > and this
> > > >  > bug could lead us to hit the assertion.
> > > >  > 
> > > >  > Would you try to run the test w/ it, please?
> > > >  > 
> > > >  > [1]: https://patchwork.kernel.org/patch/9597281/
> > > > 
> > > > Made no difference. Still see the same trace & assertion.
> > > 
> > > Some updates here, I've got it reproduced, somehow a corner case ends up
> > > with a inline file extent following by some pre-alloc extents, along the
> > > way, isize also got updated unexpectedly.  Will try to narrow it down.
> > >
> > 
> > I realized that btrfs now could tolerate files that mix inline extents with
> > regular extents,
> 
> Where did that change? I thought that inline extent can represent only
> the entire file, so any mixing of extents does not make sense to me.
> Do you refer to some intermediate state where the file is being
> converted from inline to non-inline, thus we could see both types of
> extents at some point?
>

We could get that by doing the following step,

xfs_io -f -c "pwrite 0 8" foo
xfs_io -f -c "falloc 4 8188" foo

offset 4 is rounded down to offset 0 and before allocating blocks we wait for
any dirty stuff starting at offset 0, since the isize is not yet updated, the
inline extent is created as is again.  Now we have an inline extent from 0 to 8
and a prealloc extent from 4096 to 8192, AND its isize is 8192.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Retry after commit on getting EDQUOT

2017-03-09 Thread David Sterba
On Sun, Feb 19, 2017 at 07:30:59AM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> We are facing the same problem with EDQUOT which was experienced with
> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
> here is a fix. Let me know if it is too big a hammer.

On first look it does seem too heavy. Do you have numbers what's the
slowdown? If it's bearable we can stick with it and implement the
ticketed approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] fs, btrfs refcount conversions

2017-03-09 Thread David Sterba
On Fri, Mar 03, 2017 at 10:55:09AM +0200, Elena Reshetova wrote:
> Now when new refcount_t type and API are finally merged
> (see include/linux/refcount.h), the following
> patches convert various refcounters in the btrfs filesystem from atomic_t
> to refcount_t. By doing this we prevent intentional or accidental
> underflows or overflows that can led to use-after-free vulnerabilities.
> 
> The below patches are fully independent and can be cherry-picked separately.
> Since we convert all kernel subsystems in the same fashion, resulting
> in about 300 patches, we have to group them for sending at least in some
> fashion to be manageable. Please excuse the long cc list.

Thanks, the patchset looks good to me, I'm going to add it to the 4.12 queue.

> These patches have been tested with xfstests by running btrfs-related tests.
> btrfs debug was enabled, warns on refcount errors, too. No output related to
> refcount errors produced. However, the following errors were during the run:
>  * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but
>  process hangs. They all seem to be around qgroup, sometimes error visible
>  such as qgroup scan failed -4 before it blocks, but not always.
>  * test btrfs/104 dmesg has additional error output:
>  BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0,
>  to free: 4096
>  I tried looking at the code on what causes the failure, but could not figure
>  it out. It doesn't seem to be related to any refcount changes at least IMO.
> 
> The above test failures are hard for me to understand and interpreted, but
> they don't seem to relate to refcount conversions.

I don't think they're related to the refcount updates so we'll address
them.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-09 Thread Chris Mason



On 03/08/2017 09:12 PM, Zygo Blaxell wrote:

This is a story about 4 distinct (and very old) btrfs bugs.



Really great write up.

[ ... ]



diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25ac2cf..4d41a31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size + pg_offset > PAGE_SIZE);


Can you please drop this WARN_ON and make the math reflect any possible 
pg_offset?  I do agree it shouldn't be happening, but its easy to 
correct for and the WARN is likely to get lost.



+   if (max_size + pg_offset < PAGE_SIZE) {
+   char *map = kmap(page);
+   memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - 
pg_offset);
+   kunmap(page);
+   }


Both lzo and zlib have a memset to cover the gap between what they 
actually decompress and the max_size that we pass here.  That's 
important because ram_bytes may not be 100% accurate.


Can you also please toss in a comment about how the decompression code 
is responsible for the memset up to max_bytes?


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] fs, btrfs refcount conversions

2017-03-09 Thread David Sterba
On Tue, Mar 07, 2017 at 03:49:52PM +0800, Qu Wenruo wrote:
> >>> If the patches pass all tests on your side, could you please take them in 
> >>> and
> >> propagate further?
> >>> I will continue with other kernel subsystems.
> >>
> >> The patchset itself looks like a common cleanup, while I did encounter
> >> several cases (almost all scrub tests) causing kernel warning due to
> >> underflow.
> >
> > Oh, could you please send me the warning outputs? I can hopefully analyze 
> > and fix them.
> 
> Attached. Which is the generated by running btrfs/070 test case.
> And I canceled the case almost instantly, so output is not much, but 
> still contains enough info.
> 
> Both refcount_inc() and refcount_sub_and_test() are causing warning.
> 
> So now I'm not sure which is the cause, btrfs or bad use of refcount?

We we do atomic_inc to get the first reference after initialization in
scrub_pages, instead of atomic_set (or an equivalent):

2266 spage = kzalloc(sizeof(*spage), GFP_KERNEL);
2267 if (!spage) {
...
2274 }
...
2276 scrub_page_get(spage);

so the references are 0 and refcount_inc will catch that, the fix is simple.

The refcount_sub_and_test reports seem to catch a bug in refcounting, I'm
analyzing it right now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assertion failed: last_size == new_size, file: fs/btrfs/inode.c

2017-03-09 Thread David Sterba
On Thu, Mar 02, 2017 at 06:04:33PM -0800, Liu Bo wrote:
> On Thu, Mar 02, 2017 at 07:58:01AM -0800, Liu Bo wrote:
> > On Wed, Mar 01, 2017 at 03:03:19PM -0500, Dave Jones wrote:
> > > On Tue, Feb 28, 2017 at 05:12:01PM -0800, Liu Bo wrote:
> > >  > On Mon, Feb 27, 2017 at 11:23:42AM -0500, Dave Jones wrote:
> > >  > > On Mon, Feb 27, 2017 at 07:53:48AM -0800, Liu Bo wrote:
> > >  > >  > On Sun, Feb 26, 2017 at 07:18:42PM -0500, Dave Jones wrote:
> > >  > >  > > Hitting this fairly frequently.. I'm not sure if this is the 
> > > same bug I've
> > >  > >  > > been hitting occasionally since 4.9. The assertion looks new to 
> > > me at least.
> > >  > >  > >
> > >  > >  > 
> > >  > >  > It was recently introduced by my commit and used to catch data 
> > > loss at truncate.
> > >  > >  > 
> > >  > >  > Were you running the test with a mkfs.btrfs -O NO_HOLES?
> > >  > >  > (We just queued a fix for the NO_HOLES case in btrfs-next.)
> > >  > > 
> > >  > > No, a fs created with default mkfs.btrfs options.
> > >  > 
> > >  > I have this patch[1] to fix a bug which results in file hole extent, 
> > > and this
> > >  > bug could lead us to hit the assertion.
> > >  > 
> > >  > Would you try to run the test w/ it, please?
> > >  > 
> > >  > [1]: https://patchwork.kernel.org/patch/9597281/
> > > 
> > > Made no difference. Still see the same trace & assertion.
> > 
> > Some updates here, I've got it reproduced, somehow a corner case ends up
> > with a inline file extent following by some pre-alloc extents, along the
> > way, isize also got updated unexpectedly.  Will try to narrow it down.
> >
> 
> I realized that btrfs now could tolerate files that mix inline extents with
> regular extents,

Where did that change? I thought that inline extent can represent only
the entire file, so any mixing of extents does not make sense to me.
Do you refer to some intermediate state where the file is being
converted from inline to non-inline, thus we could see both types of
extents at some point?

> so we don't need this ASSERT() anymore, will send a patch to
> remove it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] Btrfs updates for 4.11-rc2

2017-03-09 Thread David Sterba
Hi,

there's a regression fix for the assertion failure reported by Dave Jones, the
rest of patches are minor updates. Please pull, thanks.

The following changes since commit e9f467d028cd7d8bee2a4d6c4fb806caf8cd580b:

  Merge branch 'for-chris-4.11-part2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 
(2017-02-28 14:35:09 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git 
for-chris-4.11-rc2

for you to fetch changes up to 94b5a2924fc0ec2bb2347ca31156be8fabe907c4:

  Btrfs: consistent usage of types in balance_args (2017-03-09 14:23:00 +0100)


Dmitry V. Levin (2):
  btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
  MAINTAINERS: add btrfs file entries for include directories

Goldwyn Rodrigues (1):
  btrfs: No need to check !(flags & MS_RDONLY) twice

Hans van Kranenburg (1):
  Btrfs: consistent usage of types in balance_args

Liu Bo (2):
  Btrfs: update comments in cache_save_setup
  Btrfs: fix regression in lock_delalloc_pages

 MAINTAINERS|  2 ++
 fs/btrfs/extent-tree.c |  3 ++-
 fs/btrfs/extent_io.c   |  3 ++-
 fs/btrfs/super.c   |  3 +--
 include/uapi/linux/btrfs.h | 37 +
 5 files changed, 12 insertions(+), 36 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-09 Thread Austin S. Hemmelgarn

On 2017-03-09 04:49, Peter Grandi wrote:

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.



Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.



There is a command specifically for replacing devices.  It
operates very differently from the add+delete or delete+add
sequences. [ ... ]


Perhaps it was not clear that I was talking about removing a
device, as distinct from replacing it, and that I used "removed"
instead of "deleted" deliberately, to avoid the confusion with
the 'delete' command.

Ah, sorry I misunderstood what you were saying.


In the everyday practice of system administration it often
happens that a device should be removed first, and replaced
later, for example when it is suspected to be faulty, or is
intermittently faulty. The replacement can be done with
'replace' or 'add+delete' or 'delete+add', but that's a
different matter.

Perhaps I should have not have used the generic verb "remove",
but written "make unavailable".

This brings about again the topic of some "confusion" in the
design of the Btrfs multidevice handling logic, where at least
initially one could only expand the storage space of a
multidevice by 'add' of a new device or shrink the storage space
by 'delete' of an existing one, but I think it was not conceived
at Btrfs design time of storage space being nominally constant
but for a device (and the chunks on it) having a state of
"available" ("present", "online", "enabled") or "unavailable"
("absent", "offline", "disabled"), either because of events or
because of system administrator action.

The 'missing' pseudo-device designator was added later, and
'replace' also later to avoid having to first expand then shrink
(or viceversa) the storage space and the related copying.

My impression is that it would be less "confused" if the Btrfs
device handling logic were changed to allow for the the state of
"member of the multidevice set but not actually available" and
the related consequent state for chunks that ought to be on it;
that probably would be essential to fixing the confusing current
aspects of recovery in a multidevice set. That would be very
useful even if it may require a change in the on-disk format to
distinguish the distinct states of membership and availability
for devices and mark chunks as available or not (chunks of course
being only possible on member devices).

That is, it would also be nice to have the opposite state of "not
member of the multidevice set but actually available to it", that
is a spare device, and related logic.
OK, so expanding on this a bit, there are currently three functional 
device states in BTRFS right now (note that the terms I use here aren't 
official, they're just what I use to describe them):
1. Active/Online.  This is the normal state for a device, you can both 
read from it and write to it.
2. Inactive/Replacing/Deleting.  This is the state a device is in when 
it's either being deleted or replaced.  Inactive devices don't count 
towards total volume size, and can't be written to, but can be read from 
if they weren't missing prior to becoming inactive.
3. Missing/Offilne.  This is pretty self explanatory.  A device in this 
state can't be read from or written to, but it does count towards volume 
size.


Currently, the only transitions available to a sysadmin through BTRFS 
itself are temporary transitions from Active to Inactive (replace and 
delete).


In an ideal situation, there would be two other states:
4. Local hot-spare/Nearline.  Won't be read from and doesn't count 
towards total volume size, but may be written to (depending on how the 
FS is configured), and will be automatically used to replace a failed 
device in the filesystem it's associated with.
5. Global hot-spare.  Similar to local hot-spare, but can be used for 
any filesystem on the system, and won't be touched until it's needed.


The following manually initiated transitions would be possible for 
regular operation:

1. Active -> Inactive (persistently)
2. Inactive -> Active
3. Active -> Local hot-spare
4. Inactive -> Local hot-spare
5. Local hot-spare -> Active
6. Local hot-spare -> Inactive
7. Global hot-spare -> Active
8. Global hot-spare -> Inactive
9. Local hot-spare -> Global hot-spare
10. Global hot-spare -> Local hot-spare

And the following automatic transitions would be possible:
1. Local hot-spare -> Active
2. Global hot-spare -> Active
3.  -> Missing
4. Missing -> 

And there would be the option of manually triggering the automatic 
transitions for debugging purposes.


Note: simply setting '/sys/block/$DEV/device/delete' is not a
good option, because that 

Re: [PATCH v2] btrfs: fix a bogus warning when converting only data or metadata

2017-03-09 Thread David Sterba
On Tue, Mar 07, 2017 at 11:34:44PM +0100, Adam Borowski wrote:
> On Tue, Mar 07, 2017 at 12:34:07PM -0800, Liu Bo wrote:
> > This looks good, but this also brings another side effect, @bctl would
> > also be kept in balance_item which will be used to resume balance in
> > case of crash, so it may see a different bctl->meta.target.
> >
> > So would you please use local varibles for meta.target and data.target?
> 
> Okay.
> 
> I'm not sure why storing a bogus value that came from userspace and was
> uninitialized there (0 in normal use) would be better, but here we go:
> v2 doesn't overwrite what we got anymore.
> 
> Unrelated: I wonder if the profiles in the warning message shouldn't be
> printk'ed as words (akin to ebce0e01), but we don't have a function to do
> that, have we?

Printing human readable strings would be better, you can copy the
missing functions from progs as Lui pointed out.

> @@ -3750,6 +3750,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
> struct btrfs_ioctl_balance_args *bargs)
>  {
>   struct btrfs_fs_info *fs_info = bctl->fs_info;
> + __u64 meta_target, data_target;

Changed to u64, the __u64 is for kernel/userspace interfaces.

>   u64 allowed;
>   int mixed = 0;
>   int ret;
> @@ -3846,11 +3847,16 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
>   }
>   } while (read_seqretry(_info->profiles_lock, seq));
>  
> - if (btrfs_get_num_tolerated_disk_barrier_failures(bctl->meta.target) <
> - 
> btrfs_get_num_tolerated_disk_barrier_failures(bctl->data.target)) {
> + /* if we're not converting, the target field is uninitialized */
> + meta_target = (bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) ?
> + bctl->meta.target : fs_info->avail_metadata_alloc_bits;
> + data_target = (bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT) ?
> + bctl->data.target : fs_info->avail_data_alloc_bits;
> + if (btrfs_get_num_tolerated_disk_barrier_failures(meta_target) <
> + btrfs_get_num_tolerated_disk_barrier_failures(data_target)) {
>   btrfs_warn(fs_info,
>  "metadata profile 0x%llx has lower redundancy than 
> data profile 0x%llx",
> -bctl->meta.target, bctl->data.target);
> +meta_target, data_target);
>   }

The patch does not apply to current master, there's

  3861 if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
  3862 fs_info->num_tolerated_disk_barrier_failures = min(
  3863 
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info),
  3864 btrfs_get_num_tolerated_disk_barrier_failures(
  3865 bctl->sys.target));
  3866 }
  3867

followed by the line below, I'll merge that manually but slightly wonder
which branch you have used as a base. No big deal though, I'll add the
patch to the queue. Thanks.

>   ret = insert_balance_item(fs_info, bctl);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add file item tracepoint

2017-03-09 Thread David Sterba
On Tue, Mar 07, 2017 at 08:49:42AM -0800, Liu Bo wrote:
> On Tue, Mar 07, 2017 at 04:48:24PM +0100, David Sterba wrote:
> > On Fri, Mar 03, 2017 at 06:41:27PM -0800, Liu Bo wrote:
> > > + TP_printk_btrfs(
> > > + "root %llu(%s) ino %llu sz 0x%llx disk_isz 0x%llx "
> > > + "file extent range [0x%llx 0x%llx] "
> > > + "(num_bytes 0x%llx ram_bytes 0x%llx disk_bytenr 0x%llx "
> > > + "disk_num_bytes 0x%llx extent_offset 0x%llx type (%s) "
> > > + "compression %u",
> > 
> > > + TP_printk_btrfs(
> > > + "root %llu(%s) ino %llu sz 0x%llx disk_isz 0x%llx "
> > > + "file extent range [0x%llx 0x%llx] "
> > > + "extent_type (%s) compression %u",
> > 
> > Please update the message formats according to
> > https://btrfs.wiki.kernel.org/index.php/Development_notes#Tracepoints
> 
> I feel like for size and offset, if using '%llu' rather than '%llx', it
> is not easy to tell whether the size/offset is aligned to 4K or not.

I think we'll have to do some conversion either way, and I don't see
%llx used anywhere in our thacepoints.

If the alignment is useful for certain values, we can extend the format
to signify that, for example: "start=4097!".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: remove unused delalloc_end parameter

2017-03-09 Thread David Sterba
On Tue, Mar 07, 2017 at 06:35:02PM +, Filipe Manana wrote:
> On Tue, Mar 7, 2017 at 4:05 PM, David Sterba  wrote:
> > On Mon, Mar 06, 2017 at 11:32:47PM +, fdman...@kernel.org wrote:
> >> From: Filipe Manana 
> >>
> >> The delalloc_end parameter for extent_clear_unlock_delalloc() is never
> >> used, and only making the code for all callers longer and more complex.
> >> Just remove it.
> >
> > It's been added on purpose
> >
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dda3245eca18c73413a854a834c41fc770feb0dd
> >
> > "btrfs: expand cow_file_range() to support in-band dedup and
> > subpage-blocksize"
> >
> > so we can merge both mentioned patchsets for testing.
> 
> I definitely missed that. To be honest I was finding it strange that
> none of the usual cleanup fanatics noticed it before.

I was disappointed for a minute that I couldn't remove it, you bet.

> Anyway I'm sending a v2 of the 2nd patch so that it works correctly on
> top of Qu's patchset that I reviewed recently. I've added Qu's patches
> to my 4.11 integration branch just so that I could work on top of
> them, feel free to pick stuff from there if convenient or ignore
> otherwise.

Let's keep it in your branch then.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: consistent usage of types in balance_args

2017-03-09 Thread David Sterba
On Wed, Mar 08, 2017 at 06:58:43PM +0100, Hans van Kranenburg wrote:
> The btrfs_balance_args are only used for the balance ioctl, so use __u
> instead of __le here for consistency. The __le usage was introduced in
> bc3094673f22d and dee32d0ac3719 and was probably a result of
> copy/pasting when the code was written.
> 
> The usage of __le did not break anything, but it's unnecessary. Also,
> this change makes the code less confusing for the careful reader.
> 
> Signed-off-by: Hans van Kranenburg 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-09 Thread David Sterba
On Tue, Mar 07, 2017 at 04:09:56PM +0100, Michal Hocko wrote:
> On Mon 06-03-17 13:22:14, Andrew Morton wrote:
> > On Mon,  6 Mar 2017 14:14:05 +0100 Michal Hocko  wrote:
> [...]
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -210,8 +210,16 @@ struct vm_area_struct;
> > >   *
> > >   * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
> > >   *   that do not require the starting of any physical IO.
> > > + *   Please try to avoid using this flag directly and instead use
> > > + *   memalloc_noio_{save,restore} to mark the whole scope which cannot
> > > + *   perform any IO with a short explanation why. All allocation requests
> > > + *   will inherit GFP_NOIO implicitly.
> > >   *
> > >   * GFP_NOFS will use direct reclaim but will not use any filesystem 
> > > interfaces.
> > > + *   Please try to avoid using this flag directly and instead use
> > > + *   memalloc_nofs_{save,restore} to mark the whole scope which 
> > > cannot/shouldn't
> > > + *   recurse into the FS layer with a short explanation why. All 
> > > allocation
> > > + *   requests will inherit GFP_NOFS implicitly.
> > 
> > I wonder if these are worth a checkpatch rule.
> 
> I am not really sure, to be honest. This may easilly end up people
> replacing
> 
> do_alloc(GFP_NOFS)
> 
> with
> 
> memalloc_nofs_save()
> do_alloc(GFP_KERNEL)
> memalloc_nofs_restore()
> 
> which doesn't make any sense of course. From my experience, people tend
> to do stupid things just to silent checkpatch warnings very often.
> Moreover I believe we need to do the transition to the new api first
> before we can push back on the explicit GFP_NOFS usage. Maybe then we
> can think about the a checkpatch warning.

I agree will all your objections against adding that to checkpatch, at
this point it's less harmful to use GFP_NOFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-09 Thread Peter Grandi
>> Consider the common case of a 3-member volume with a 'raid1'
>> target profile: if the sysadm thinks that a drive should be
>> replaced, the goal is to take it out *without* converting every
>> chunk to 'single', because with 2-out-of-3 devices half of the
>> chunks will still be fully mirrored.

>> Also, removing the device to be replaced should really not be
>> the same thing as balancing the chunks, if there is space, to be
>> 'raid1' across remaining drives, because that's a completely
>> different operation.

> There is a command specifically for replacing devices.  It
> operates very differently from the add+delete or delete+add
> sequences. [ ... ]

Perhaps it was not clear that I was talking about removing a
device, as distinct from replacing it, and that I used "removed"
instead of "deleted" deliberately, to avoid the confusion with
the 'delete' command.

In the everyday practice of system administration it often
happens that a device should be removed first, and replaced
later, for example when it is suspected to be faulty, or is
intermittently faulty. The replacement can be done with
'replace' or 'add+delete' or 'delete+add', but that's a
different matter.

Perhaps I should have not have used the generic verb "remove",
but written "make unavailable".

This brings about again the topic of some "confusion" in the
design of the Btrfs multidevice handling logic, where at least
initially one could only expand the storage space of a
multidevice by 'add' of a new device or shrink the storage space
by 'delete' of an existing one, but I think it was not conceived
at Btrfs design time of storage space being nominally constant
but for a device (and the chunks on it) having a state of
"available" ("present", "online", "enabled") or "unavailable"
("absent", "offline", "disabled"), either because of events or
because of system administrator action.

The 'missing' pseudo-device designator was added later, and
'replace' also later to avoid having to first expand then shrink
(or viceversa) the storage space and the related copying.

My impression is that it would be less "confused" if the Btrfs
device handling logic were changed to allow for the the state of
"member of the multidevice set but not actually available" and
the related consequent state for chunks that ought to be on it;
that probably would be essential to fixing the confusing current
aspects of recovery in a multidevice set. That would be very
useful even if it may require a change in the on-disk format to
distinguish the distinct states of membership and availability
for devices and mark chunks as available or not (chunks of course
being only possible on member devices).

That is, it would also be nice to have the opposite state of "not
member of the multidevice set but actually available to it", that
is a spare device, and related logic.

Note: simply setting '/sys/block/$DEV/device/delete' is not a
good option, because that makes the device unavailable not just
to Btrfs, but also to the whole systems. In the ordinary practice
of system administration it may well be useful to make a device
unavailable to Btrfs but still available to the system, for
example for testing, and anyhow they are logically distinct
states. That also means a member device might well be available
to the system, but marked as "not available" to Btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html