Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

2017-05-10 Thread Eric Biggers
On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> 
> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> > Theodore Ts'o  writes:
> > 
> > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> > >> 1.) Privacy implications.  Say the filesystem is being shared between 
> > >> multiple
> > >> users, and one user unpacks foo.tar.gz into their home directory, 
> > >> which
> > >> they've set to mode 700 to hide from other users.  Because of this 
> > >> new
> > >> ioctl, all users will be able to see every (inode number, size in 
> > >> blocks)
> > >> pair that was added to the filesystem, as well as the exact layout 
> > >> of the
> > >> physical block allocations which might hint at how the files were 
> > >> created.
> > >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> > >> regard, its presence on the filesystem will be revealed to all 
> > >> users.  And
> > >> if any filesystems happen to prefer allocating blocks near the 
> > >> containing
> > >> directory, the directory the files are in would likely be revealed 
> > >> too.
> 
> Frankly, why are container users even allowed to make unrestricted ioctl
> calls?  I thought we had a bunch of security infrastructure to constrain
> what userspace can do to a system, so why don't ioctls fall under these
> same protections?  If your containers are really that adversarial, you
> ought to be blacklisting as much as you can.
> 

Personally I don't find the presence of sandboxing features to be a very good
excuse for introducing random insecure ioctls.  Not everyone has everything
perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
about the filesystem ioctls, too, since they can be executed on any regular
file, without having to open some device node in /dev.

(And this actually does happen; the SELinux policy in Android, for example,
still allows apps to call any ioctl on their data files, despite all the effort
that has gone into whitelisting other types of ioctls.  Which should be fixed,
of course, but it shows that this kind of mistake is very easy to make.)

> > > Unix/Linux has historically not been terribly concerned about trying
> > > to protect this kind of privacy between users.  So for example, in
> > > order to do this, you would have to call GETFSMAP continously to track
> > > this sort of thing.  Someone who wanted to do this could probably get
> > > this information (and much, much more) by continuously running "ps" to
> > > see what processes are running.
> > >
> > > (I will note. wryly, that in the bad old days, when dozens of users
> > > were sharing a one MIPS Vax/780, it was considered a *good* thing
> > > that social pressure could be applied when it was found that someone
> > > was running a CPU or memory hogger on a time sharing system.  The
> > > privacy right of someone running "xtrek" to be able to hide this from
> > > other users on the system was never considered important at all.  :-)
> 
> Not to mention someone running GETFSMAP in a loop will be pretty obvious
> both from the high kernel cpu usage and the huge number of metadata
> operations.

Well, only if that someone running GETFSMAP actually wants to watch things in
real-time (it's not necessary for all scenarios that have been mentioned), *and*
there is monitoring in place which actually detects it and can do something
about it.

Yes, PIDs have traditionally been global, but today we have PID namespaces, and
many other isolation features such as mount namespaces.  Nothing is perfect, of
course, and containers are a lot worse than VMs, but it seems weird to use that
as an excuse to knowingly make things worse...

> 
> > > Fortunately, the days of timesharing seem to well behind us.  For
> > > those people who think that containers are as secure as VM's (hah,
> > > hah, hah), it might be that best way to handle this is to have a mount
> > > option that requires root access to this functionality.  For those
> > > people who really care about this, they can disable access.
> 
> Or use separate filesystems for each container so that exploitable bugs
> that shut down the filesystem can't be used to kill the other
> containers.  You could use a torrent of metadata-heavy operations
> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> the other containers.
> 
> > What would be the reason for not putting this behind
> > capable(CAP_SYS_ADMIN)?
> > 
> > What possible legitimate function could this functionality serve to
> > users who don't own your filesystem?
> 
> As I've said before, it's to enable dedupe tools to decide, given a set
> of files with shareable blocks, roughly how many other times each of
> those shareable blocks are shared so that they can make better decisions
> about which file keeps its shareable blocks, and which file 

Re: Question on compression unit

2017-05-10 Thread Qu Wenruo



At 05/11/2017 04:11 AM, Xiaochu Liu wrote:

Hi there,

I'm trying to tune compression options for btrfs. Specifically, I want
to know the performance impact on the system under different
compression unit (block) sizes.


Compression unit size is fixed in btrfs.
It's sectorsize, determined at mkfs time, and only 4K (page size) is 
supported for x86 yet.




I'm aware of '--nodesize' parameter which sets the block size of
metadata tree. Does that also set the block size in an extent? (from
my understanding, file data are mostly stored in extent unless small
enough to be inline-d in metadata leaf node?)


nodesize only affects metadata, nothing to do with data size.



Also from btrfs's wikipedia page:

In compressed extents, individual blocks are not compressed
separately; rather, the compression stream spans the entire extent.

Is that still true?


Yes.

For example if there is one continuous range represents 0~1M data of one 
file, and all this data is dirty (not written to disk).


Then compress will happen when trying to writing them to disk.
And since the maximum uncompressed size for compressed extent is 128K 
(fixed), the 0~1M will be split into 8 extents (if compression ratio is 
acceptable).


And then each 128K extent will be compressed then compressed data will 
be written to disk. (compressed extent still meet sectorsize alignment).


So the wiki page is still right and we must read out the whole 
(compressed) extent to get its content.


And since both uncompressed data and compressed extent must meet 
sectorsize alignment, data smaller than or equal to sectorsize won't go 
through compression since it will just waste CPU time and no space saving.


Thanks,
Qu



Thanks,
Xiaochu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qgroup: direct writes returns -EDQUOT too soon

2017-05-10 Thread Qu Wenruo



At 05/11/2017 08:19 AM, Goldwyn Rodrigues wrote:


Here is a sample script to recreate the issue:
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /mnt
btrfs quota enable /mnt
btrfs sub create /mnt/tmp
btrfs qgroup limit 200M /mnt/tmp
btrfs quota rescan -w /mnt
cd /mnt/tmp
for i in {1..5}; do
sync
dd if=/dev/zero of=/mnt/tmp/file-$i oflag=direct
sync
done

btrfs qgroup show -pcref /mnt/tmp


Output:

Create subvolume '/mnt/tmp'
quota rescan started
dd: writing to '/mnt/tmp/file-1': Disk quota exceeded
11991+0 records in
11990+0 records out
6138880 bytes (6.1 MB, 5.9 MiB) copied, 2.40459 s, 2.6 MB/s
dd: writing to '/mnt/tmp/file-2': Disk quota exceeded
11807+0 records in
11806+0 records out
6044672 bytes (6.0 MB, 5.8 MiB) copied, 2.11256 s, 2.9 MB/s
dd: writing to '/mnt/tmp/file-3': Disk quota exceeded
11628+0 records in
11627+0 records out
5953024 bytes (6.0 MB, 5.7 MiB) copied, 2.53767 s, 2.3 MB/s
dd: writing to '/mnt/tmp/file-4': Disk quota exceeded
11080+0 records in
11079+0 records out
5672448 bytes (5.7 MB, 5.4 MiB) copied, 2.3697 s, 2.4 MB/s
dd: writing to '/mnt/tmp/file-5': Disk quota exceeded
11358+0 records in
11357+0 records out
5814784 bytes (5.8 MB, 5.5 MiB) copied, 2.10354 s, 2.8 MB/s

qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/25728.84MiB 28.84MiB200.00MiB none --- ---

The files created are only 5-6MB when the subvolume size is 200m. Each
of the attempts, including the first attempt, returns EDQUOT at around
5-6 MB.


IIRC that's the problem of metadata reservation.

We're always over-reserving metadata, not only for qgroup.
We reserve one full tree block even if we're only trying to insert one item.

So I'm afraid the problem is that the default block size for dd is too 
small (512) and direct makes it always try to reserve a new tree block 
for it.


It's already over 10K file extents, which will take up more than 160M if 
using 16K nodesize.

At least this can explain the problem.

Try to change dd blocksize to 1M or more should avoid the problem.

However the true solution is to find a method to calculate metadata 
reservation more correctly, which I didn't see any good method yet.


Thanks,
Qu







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qgroup: direct writes returns -EDQUOT too soon

2017-05-10 Thread Goldwyn Rodrigues

Here is a sample script to recreate the issue:
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /mnt
btrfs quota enable /mnt
btrfs sub create /mnt/tmp
btrfs qgroup limit 200M /mnt/tmp
btrfs quota rescan -w /mnt
cd /mnt/tmp
for i in {1..5}; do
sync
dd if=/dev/zero of=/mnt/tmp/file-$i oflag=direct
sync
done

btrfs qgroup show -pcref /mnt/tmp


Output:

Create subvolume '/mnt/tmp'
quota rescan started
dd: writing to '/mnt/tmp/file-1': Disk quota exceeded
11991+0 records in
11990+0 records out
6138880 bytes (6.1 MB, 5.9 MiB) copied, 2.40459 s, 2.6 MB/s
dd: writing to '/mnt/tmp/file-2': Disk quota exceeded
11807+0 records in
11806+0 records out
6044672 bytes (6.0 MB, 5.8 MiB) copied, 2.11256 s, 2.9 MB/s
dd: writing to '/mnt/tmp/file-3': Disk quota exceeded
11628+0 records in
11627+0 records out
5953024 bytes (6.0 MB, 5.7 MiB) copied, 2.53767 s, 2.3 MB/s
dd: writing to '/mnt/tmp/file-4': Disk quota exceeded
11080+0 records in
11079+0 records out
5672448 bytes (5.7 MB, 5.4 MiB) copied, 2.3697 s, 2.4 MB/s
dd: writing to '/mnt/tmp/file-5': Disk quota exceeded
11358+0 records in
11357+0 records out
5814784 bytes (5.8 MB, 5.5 MiB) copied, 2.10354 s, 2.8 MB/s

qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/25728.84MiB 28.84MiB200.00MiB none --- ---

The files created are only 5-6MB when the subvolume size is 200m. Each
of the attempts, including the first attempt, returns EDQUOT at around
5-6 MB.


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


errno=-28 No space left, with kernel backtrace (blocking bug)

2017-05-10 Thread alpha_one_x86
Hi, this bug is very blocking for me:

https://bugzilla.kernel.org/show_bug.cgi?id=195257

The server is backup server, I btrfs receive (with and without -p), and
of course btrfs subvolume delete
The volume is 70TB, then I use space_cache=v2

Cheers,


-- 
alpha_one_x86/BRULE Herman 
Main developer of Supercopier/Ultracopier/CatchChallenger, Esourcing and server 
management
IT, OS, technologies, research & development, security and business department

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

2017-05-10 Thread Darrick J. Wong
[cc btrfs, since afaict that's where most of the dedupe tool authors hang out]

On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> Theodore Ts'o  writes:
> 
> > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> >> 1.) Privacy implications.  Say the filesystem is being shared between 
> >> multiple
> >> users, and one user unpacks foo.tar.gz into their home directory, which
> >> they've set to mode 700 to hide from other users.  Because of this new
> >> ioctl, all users will be able to see every (inode number, size in 
> >> blocks)
> >> pair that was added to the filesystem, as well as the exact layout of 
> >> the
> >> physical block allocations which might hint at how the files were 
> >> created.
> >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> >> regard, its presence on the filesystem will be revealed to all users.  
> >> And
> >> if any filesystems happen to prefer allocating blocks near the 
> >> containing
> >> directory, the directory the files are in would likely be revealed too.

Frankly, why are container users even allowed to make unrestricted ioctl
calls?  I thought we had a bunch of security infrastructure to constrain
what userspace can do to a system, so why don't ioctls fall under these
same protections?  If your containers are really that adversarial, you
ought to be blacklisting as much as you can.

> > Unix/Linux has historically not been terribly concerned about trying
> > to protect this kind of privacy between users.  So for example, in
> > order to do this, you would have to call GETFSMAP continously to track
> > this sort of thing.  Someone who wanted to do this could probably get
> > this information (and much, much more) by continuously running "ps" to
> > see what processes are running.
> >
> > (I will note. wryly, that in the bad old days, when dozens of users
> > were sharing a one MIPS Vax/780, it was considered a *good* thing
> > that social pressure could be applied when it was found that someone
> > was running a CPU or memory hogger on a time sharing system.  The
> > privacy right of someone running "xtrek" to be able to hide this from
> > other users on the system was never considered important at all.  :-)

Not to mention someone running GETFSMAP in a loop will be pretty obvious
both from the high kernel cpu usage and the huge number of metadata
operations.

> > Fortunately, the days of timesharing seem to well behind us.  For
> > those people who think that containers are as secure as VM's (hah,
> > hah, hah), it might be that best way to handle this is to have a mount
> > option that requires root access to this functionality.  For those
> > people who really care about this, they can disable access.

Or use separate filesystems for each container so that exploitable bugs
that shut down the filesystem can't be used to kill the other
containers.  You could use a torrent of metadata-heavy operations
(fallocate a huge file, punch every block, truncate file, repeat) to DoS
the other containers.

> What would be the reason for not putting this behind
> capable(CAP_SYS_ADMIN)?
> 
> What possible legitimate function could this functionality serve to
> users who don't own your filesystem?

As I've said before, it's to enable dedupe tools to decide, given a set
of files with shareable blocks, roughly how many other times each of
those shareable blocks are shared so that they can make better decisions
about which file keeps its shareable blocks, and which file gets
remapped.  Dedupe is not a privileged operation, nor are any of the
tools.

> I have seen several people speak up how this is a concern I don't see
> anyone saying here is a legitimate use for a non-system administrator.

/I/ said that a few emails ago.

--D

> This doesn't seem like something where abuses of time-sharing systems
> can be observed.
> 
> Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question on compression unit

2017-05-10 Thread Xiaochu Liu
Hi there,

I'm trying to tune compression options for btrfs. Specifically, I want
to know the performance impact on the system under different
compression unit (block) sizes.

I'm aware of '--nodesize' parameter which sets the block size of
metadata tree. Does that also set the block size in an extent? (from
my understanding, file data are mostly stored in extent unless small
enough to be inline-d in metadata leaf node?)

Also from btrfs's wikipedia page:

In compressed extents, individual blocks are not compressed
separately; rather, the compression stream spans the entire extent.

Is that still true?

Thanks,
Xiaochu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-05-10 Thread Goldwyn Rodrigues


On 05/09/2017 09:36 PM, Qu Wenruo wrote:
> Introduce a new parameter, struct extent_changeset for
> btrfs_qgroup_reserved_data() and its callers.
> 
> Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
> which range it reserved in current reserve, so it can free it at error
> path.
> 
> The reason we need to export it to callers is, at buffered write error
> path, without knowing what exactly which range we reserved in current
> allocation, we can free space which is not reserved by us.
> 
> This will lead to qgroup reserved space underflow.
> 
> Reviewed-by: Chandan Rajendra 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h   |  6 --
>  fs/btrfs/extent-tree.c | 16 +++-
>  fs/btrfs/extent_io.h   | 34 ++
>  fs/btrfs/file.c| 12 +---
>  fs/btrfs/inode-map.c   |  4 +++-
>  fs/btrfs/inode.c   | 18 ++
>  fs/btrfs/ioctl.c   |  5 -
>  fs/btrfs/qgroup.c  | 41 +
>  fs/btrfs/qgroup.h  |  3 ++-
>  fs/btrfs/relocation.c  |  4 +++-
>  10 files changed, 113 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1e82516fe2d8..52a0147cd612 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2704,8 +2704,9 @@ enum btrfs_flush_state {
>   COMMIT_TRANS=   6,
>  };
>  
> -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> +int btrfs_check_data_free_space(struct inode *inode,
> + struct extent_changeset **reserved, u64 start, u64 len);
>  void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
>  void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
>   u64 len);
> @@ -2723,7 +2724,8 @@ void btrfs_subvolume_release_metadata(struct 
> btrfs_fs_info *fs_info,
> struct btrfs_block_rsv *rsv);
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 
> num_bytes);
>  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 
> num_bytes);
> -int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
> +int btrfs_delalloc_reserve_space(struct inode *inode,
> + struct extent_changeset **reserved, u64 start, u64 len);
>  void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
>  void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
>  struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4f62696131a6..782e0f5feb69 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,6 +3364,7 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   struct btrfs_fs_info *fs_info = block_group->fs_info;
>   struct btrfs_root *root = fs_info->tree_root;
>   struct inode *inode = NULL;
> + struct extent_changeset *data_reserved = NULL;
>   u64 alloc_hint = 0;
>   int dcs = BTRFS_DC_ERROR;
>   u64 num_pages = 0;
> @@ -3483,7 +3484,7 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   num_pages *= 16;
>   num_pages *= PAGE_SIZE;
>  
> - ret = btrfs_check_data_free_space(inode, 0, num_pages);
> + ret = btrfs_check_data_free_space(inode, _reserved, 0, num_pages);
>   if (ret)
>   goto out_put;
>  
> @@ -3514,6 +3515,7 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>   block_group->disk_cache_state = dcs;
>   spin_unlock(_group->lock);
>  
> + extent_changeset_free(data_reserved);
>   return ret;
>  }
>  
> @@ -4282,7 +4284,8 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
> *inode, u64 bytes)
>   * Will replace old btrfs_check_data_free_space(), but for patch split,
>   * add a new function first and then replace it.
>   */
> -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
> +int btrfs_check_data_free_space(struct inode *inode,
> + struct extent_changeset **reserved, u64 start, u64 len)
>  {
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   int ret;
> @@ -4297,9 +4300,11 @@ int btrfs_check_data_free_space(struct inode *inode, 
> u64 start, u64 len)
>   return ret;
>  
>   /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
> - ret = btrfs_qgroup_reserve_data(inode, start, len);
> + ret = btrfs_qgroup_reserve_data(inode, reserved, start, len);
>   if (ret < 0)
>   btrfs_free_reserved_data_space_noquota(inode, start, len);
> + else
> + ret = 0;
>   return ret;
>  }
>  
> @@ -6140,11 +6145,12 @@ void 

Re: [PATCH 0/6 RFC] utilize bio_clone_fast to clean up

2017-05-10 Thread David Sterba
On Tue, May 09, 2017 at 03:49:13PM -0700, Liu Bo wrote:
> > This patch is too big, can you split it to smaller chunks? I was not
> > able to review it, it seems to touch several things at once, it's hard
> > to keep the context.
> 
> Oh I see, the diff does look scary but the changes are in fact not
> intrusive, I'll try to do something.

On second read, I think it's not that bad as it looked. Some of the code
is just moved/indented differently and I see the core of the change now.

I'm concerned about the used types for length/size, but this can be done
as a followup patch. If you have split the patch already, please send
it anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 23/27] gfs2: clean up some filemap_* calls

2017-05-10 Thread Bob Peterson
- Original Message -
| In some places, it's trying to reset the mapping error after calling
| filemap_fdatawait. That's no longer required. Also, turn several
| filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait.
| That will at least return writeback errors that occur during the write
| phase.
| 
| Signed-off-by: Jeff Layton 
| ---

Hi Jeff,

This version looks better. ACK.

Regards,

Bob Peterson
Red Hat File Systems
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix incorrect error return ret being passed to mapping_set_error

2017-05-10 Thread David Sterba
On Tue, May 09, 2017 at 11:58:55AM -0700, Liu Bo wrote:
> On Tue, May 09, 2017 at 06:14:01PM +0100, Colin King wrote:
> > From: Colin Ian King 
> > 
> > The setting of return code ret should be based on the error code
> > passed into function end_extent_writepage and not on ret. Thanks
> > to Liu Bo for spotting this mistake in the original fix I submitted.
> > 
> > Detected by CoverityScan, CID#1414312 ("Logically dead code")
> 
> Looks good.
> 
> Reviewed-by: Liu Bo 

Thanks, patch replaced.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jeff Layton
On Wed, 2017-05-10 at 07:18 -0700, Matthew Wilcox wrote:
> On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote:
> > +++ b/lib/errseq.c
> > @@ -0,0 +1,199 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +/*
> > + * An errseq_t is a way of recording errors in one place, and allowing any
> > + * number of "subscribers" to tell whether it has changed since an 
> > arbitrary
> > + * time of their choosing.
> 
> You use the word "time" in several places in the documentation, but I think
> it's clearer to say "sampling point" or "sample", since you're not using 
> jiffies
> or nanoseconds.  For example, I'd phrase this paragraph this way:
> 
>  * An errseq_t is a way of recording errors in one place, and allowing any
>  * number of "subscribers" to tell whether it has changed since they last
>  * sampled it.
> 
> > + * The general idea is for consumers to sample an errseq_t value at a
> > + * particular point in time. Later, that value can be used to tell whether 
> > any
> > + * new errors have occurred since that time.
> 
>  * The general idea is for consumers to sample an errseq_t value.  That
>  * value can be used to tell whether any new errors have occurred since
>  * the last time it was sampled.
> 
> > +/* The "ones" bit for the counter */
> 
> Maybe "The lowest bit of the counter"?
> 
> > +/**
> > + * errseq_check - has an error occurred since a particular point in time?
> 
> "has an error occurred since the last time it was sampled"
> 
> > +/**
> > + * errseq_check_and_advance - check an errseq_t and advance it to the 
> > current value
> > + * @eseq: pointer to value being checked reported
> 
> "value being checked reported"?
> 

Thanks. I'm cleaning up the comments like you suggest.

> > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > +{
> > +   int err = 0;
> > +   errseq_t old, new;
> > +
> > +   /*
> > +* Most callers will want to use the inline wrapper to check this,
> > +* so that the common case of no error is handled without needing
> > +* to lock.
> > +*/
> > +   old = READ_ONCE(*eseq);
> > +   if (old != *since) {
> > +   /*
> > +* Set the flag and try to swap it into place if it has
> > +* changed.
> > +*
> > +* We don't care about the outcome of the swap here. If the
> > +* swap doesn't occur, then it has either been updated by a
> > +* writer who is bumping the seq count anyway, or another
> > +* reader who is just setting the "seen" flag. Either outcome
> > +* is OK, and we can advance "since" and return an error based
> > +* on what we have.
> > +*/
> > +   new = old | ERRSEQ_SEEN;
> > +   if (new != old)
> > +   cmpxchg(eseq, old, new);
> > +   *since = new;
> > +   err = -(new & MAX_ERRNO);
> > +   }
> 
> I probably need to read through the patchset some more to understand this.
> Naively, surely "since" should be updated to the current value of 'eseq'
> if we failed the cmpxchg()?

I don't think so. If we want to do that, then we'll need to redrive the
cmpxchg to set the SEEN flag if it's now clear. Storing the value in
"since" is effectively sampling it, so you do have to mark it seen.

The good news is that I think that "new" is just as valid a value to
store here as *eseq would be. It ends up representing an errseq_t value
that never actually got stored in eseq, but that's OK with the way this
works.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Matthew Wilcox
On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote:
> +++ b/lib/errseq.c
> @@ -0,0 +1,199 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * An errseq_t is a way of recording errors in one place, and allowing any
> + * number of "subscribers" to tell whether it has changed since an arbitrary
> + * time of their choosing.

You use the word "time" in several places in the documentation, but I think
it's clearer to say "sampling point" or "sample", since you're not using jiffies
or nanoseconds.  For example, I'd phrase this paragraph this way:

 * An errseq_t is a way of recording errors in one place, and allowing any
 * number of "subscribers" to tell whether it has changed since they last
 * sampled it.

> + * The general idea is for consumers to sample an errseq_t value at a
> + * particular point in time. Later, that value can be used to tell whether 
> any
> + * new errors have occurred since that time.

 * The general idea is for consumers to sample an errseq_t value.  That
 * value can be used to tell whether any new errors have occurred since
 * the last time it was sampled.

> +/* The "ones" bit for the counter */

Maybe "The lowest bit of the counter"?

> +/**
> + * errseq_check - has an error occurred since a particular point in time?

"has an error occurred since the last time it was sampled"

> +/**
> + * errseq_check_and_advance - check an errseq_t and advance it to the 
> current value
> + * @eseq: pointer to value being checked reported

"value being checked reported"?

> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> + int err = 0;
> + errseq_t old, new;
> +
> + /*
> +  * Most callers will want to use the inline wrapper to check this,
> +  * so that the common case of no error is handled without needing
> +  * to lock.
> +  */
> + old = READ_ONCE(*eseq);
> + if (old != *since) {
> + /*
> +  * Set the flag and try to swap it into place if it has
> +  * changed.
> +  *
> +  * We don't care about the outcome of the swap here. If the
> +  * swap doesn't occur, then it has either been updated by a
> +  * writer who is bumping the seq count anyway, or another
> +  * reader who is just setting the "seen" flag. Either outcome
> +  * is OK, and we can advance "since" and return an error based
> +  * on what we have.
> +  */
> + new = old | ERRSEQ_SEEN;
> + if (new != old)
> + cmpxchg(eseq, old, new);
> + *since = new;
> + err = -(new & MAX_ERRNO);
> + }

I probably need to read through the patchset some more to understand this.
Naively, surely "since" should be updated to the current value of 'eseq'
if we failed the cmpxchg()?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting

2017-05-10 Thread Jan Kara
On Wed 10-05-17 08:19:50, Jeff Layton wrote:
> On Wed, 2017-05-10 at 13:48 +0200, Jan Kara wrote:
> > On Tue 09-05-17 11:49:17, Jeff Layton wrote:
> > > diff --git a/fs/file_table.c b/fs/file_table.c
> > > index 954d510b765a..d6138b6411ff 100644
> > > --- a/fs/file_table.c
> > > +++ b/fs/file_table.c
> > > @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, 
> > > fmode_t mode,
> > >   file->f_path = *path;
> > >   file->f_inode = path->dentry->d_inode;
> > >   file->f_mapping = path->dentry->d_inode->i_mapping;
> > > + file->f_wb_err = filemap_sample_wb_error(file->f_mapping);
> > 
> > Why do you sample here when you also sample in do_dentry_open()? I didn't
> > find any alloc_file() callers that would possibly care about writeback
> > errors... 
> > 
> > Honza
> 
> I basically used the setting of f_mapping as a guideline as to where to
> sample it for initialization. My thinking was that if f_mapping ever
> ended up different then you'd probably also want f_wb_err to be
> resampled anyway.

OK, makes sense.

> I can drop this hunk if you think we don't need it.

I don't really care. I was just wondering whether I'm missing something...

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 06/27] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-05-10 Thread Matthew Wilcox
On Tue, May 09, 2017 at 11:49:09AM -0400, Jeff Layton wrote:
> ext2 currently does a test+clear of the AS_EIO flag, which is
> is problematic for some coming changes.
> 
> What we really need to do instead is call filemap_check_errors
> in __generic_file_fsync after syncing out the buffers. That
> will be sufficient for this case, and help other callers detect
> these errors properly as well.
> 
> With that, we don't need to twiddle it in ext2.
> 
> Suggested-by: Jan Kara 
> Signed-off-by: Jeff Layton 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jan Kara 

Reviewed-by: Matthew Wilcox 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting

2017-05-10 Thread Jeff Layton
On Wed, 2017-05-10 at 13:48 +0200, Jan Kara wrote:
> On Tue 09-05-17 11:49:17, Jeff Layton wrote:
> > Most filesystems currently use mapping_set_error and
> > filemap_check_errors for setting and reporting/clearing writeback errors
> > at the mapping level. filemap_check_errors is indirectly called from
> > most of the filemap_fdatawait_* functions and from
> > filemap_write_and_wait*. These functions are called from all sorts of
> > contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> > also in truncate calls, getattr, etc.
> > 
> > The non-fsync callers are problematic. We should be reporting writeback
> > errors during fsync, but many places spread over the tree clear out
> > errors before they can be properly reported, or report errors at
> > nonsensical times.
> > 
> > If I get -EIO on a stat() call, there is no reason for me to assume that
> > it is because some previous writeback failed. The fact that it also
> > clears out the error such that a subsequent fsync returns 0 is a bug,
> > and a nasty one since that's potentially silent data corruption.
> > 
> > This patch adds a small bit of new infrastructure for setting and
> > reporting errors during address_space writeback. While the above was my
> > original impetus for adding this, I think it's also the case that
> > current fsync semantics are just problematic for userland. Most
> > applications that call fsync do so to ensure that the data they wrote
> > has hit the backing store.
> > 
> > In the case where there are multiple writers to the file at the same
> > time, this is really hard to determine. The first one to call fsync will
> > see any stored error, and the rest get back 0. The processes with open
> > fds may not be associated with one another in any way. They could even
> > be in different containers, so ensuring coordination between all fsync
> > callers is not really an option.
> > 
> > One way to remedy this would be to track what file descriptor was used
> > to dirty the file, but that's rather cumbersome and would likely be
> > slow. However, there is a simpler way to improve the semantics here
> > without incurring too much overhead.
> > 
> > This set adds an errseq_t to struct address_space, and a corresponding
> > one is added to struct file. Writeback errors are recorded in the
> > mapping's errseq_t, and the one in struct file is used as the "since"
> > value.
> > 
> > This changes the semantics of the Linux fsync implementation such that
> > applications can now use it to determine whether there were any
> > writeback errors since fsync(fd) was last called (or since the file was
> > opened in the case of fsync having never been called).
> > 
> > Note that those writeback errors may have occurred when writing data
> > that was dirtied via an entirely different fd, but that's the case now
> > with the current mapping_set_error/filemap_check_error infrastructure.
> > This will at least prevent you from getting a false report of success.
> > 
> > The new behavior is still consistent with the POSIX spec, and is more
> > reliable for application developers. This patch just adds some basic
> > infrastructure for doing this. Later patches will change the existing
> > code to use this new infrastructure.
> > 
> > Signed-off-by: Jeff Layton 
> 
> Just one nit below. Otherwise the patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 
> 
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index 954d510b765a..d6138b6411ff 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, 
> > fmode_t mode,
> > file->f_path = *path;
> > file->f_inode = path->dentry->d_inode;
> > file->f_mapping = path->dentry->d_inode->i_mapping;
> > +   file->f_wb_err = filemap_sample_wb_error(file->f_mapping);
> 
> Why do you sample here when you also sample in do_dentry_open()? I didn't
> find any alloc_file() callers that would possibly care about writeback
> errors... 
> 
>   Honza

I basically used the setting of f_mapping as a guideline as to where to
sample it for initialization. My thinking was that if f_mapping ever
ended up different then you'd probably also want f_wb_err to be
resampled anyway.

I can drop this hunk if you think we don't need it.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:17, Jeff Layton wrote:
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> The non-fsync callers are problematic. We should be reporting writeback
> errors during fsync, but many places spread over the tree clear out
> errors before they can be properly reported, or report errors at
> nonsensical times.
> 
> If I get -EIO on a stat() call, there is no reason for me to assume that
> it is because some previous writeback failed. The fact that it also
> clears out the error such that a subsequent fsync returns 0 is a bug,
> and a nasty one since that's potentially silent data corruption.
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during address_space writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fds may not be associated with one another in any way. They could even
> be in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds an errseq_t to struct address_space, and a corresponding
> one is added to struct file. Writeback errors are recorded in the
> mapping's errseq_t, and the one in struct file is used as the "since"
> value.
> 
> This changes the semantics of the Linux fsync implementation such that
> applications can now use it to determine whether there were any
> writeback errors since fsync(fd) was last called (or since the file was
> opened in the case of fsync having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The new behavior is still consistent with the POSIX spec, and is more
> reliable for application developers. This patch just adds some basic
> infrastructure for doing this. Later patches will change the existing
> code to use this new infrastructure.
> 
> Signed-off-by: Jeff Layton 

Just one nit below. Otherwise the patch looks good to me. You can add:

Reviewed-by: Jan Kara 

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 954d510b765a..d6138b6411ff 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
> mode,
>   file->f_path = *path;
>   file->f_inode = path->dentry->d_inode;
>   file->f_mapping = path->dentry->d_inode->i_mapping;
> + file->f_wb_err = filemap_sample_wb_error(file->f_mapping);

Why do you sample here when you also sample in do_dentry_open()? I didn't
find any alloc_file() callers that would possibly care about writeback
errors... 

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:16, Jeff Layton wrote:
> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
> 
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
> 
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
> 
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
> 
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
> 
> Signed-off-by: Jeff Layton 

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Just two nits below:
...
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> + int err = 0;
> + errseq_t old, new;
> +
> + /*
> +  * Most callers will want to use the inline wrapper to check this,
> +  * so that the common case of no error is handled without needing
> +  * to lock.
> +  */

I'm not sure which locking you are speaking about here. Is the comment
stale?

> + old = READ_ONCE(*eseq);
> + if (old != *since) {
> + /*
> +  * Set the flag and try to swap it into place if it has
> +  * changed.
> +  *
> +  * We don't care about the outcome of the swap here. If the
> +  * swap doesn't occur, then it has either been updated by a
> +  * writer who is bumping the seq count anyway, or another

"bumping the seq count anyway" part is not quite true. Writer may see
ERRSEQ_SEEN not set and so just update the error code and leave seq count
as is. But since you compare full errseq_t for equality, this works out as
designed...

> +  * reader who is just setting the "seen" flag. Either outcome
> +  * is OK, and we can advance "since" and return an error based
> +  * on what we have.
> +  */
> + new = old | ERRSEQ_SEEN;
> + if (new != old)
> + cmpxchg(eseq, old, new);
> + *since = new;
> + err = -(new & MAX_ERRNO);
> + }
> + return err;
> +}
> +EXPORT_SYMBOL(errseq_check_and_advance);

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread Jeff Layton
On Wed, 2017-05-10 at 08:03 +1000, NeilBrown wrote:
> On Tue, May 09 2017, Jeff Layton wrote:
> 
> > An errseq_t is a way of recording errors in one place, and allowing any
> > number of "subscribers" to tell whether an error has been set again
> > since a previous time.
> > 
> > It's implemented as an unsigned 32-bit value that is managed with atomic
> > operations. The low order bits are designated to hold an error code
> > (max size of MAX_ERRNO). The upper bits are used as a counter.
> > 
> > The API works with consumers sampling an errseq_t value at a particular
> > point in time. Later, that value can be used to tell whether new errors
> > have been set since that time.
> > 
> > Note that there is a 1 in 512k risk of collisions here if new errors
> > are being recorded frequently, since we have so few bits to use as a
> > counter. To mitigate this, one bit is used as a flag to tell whether the
> > value has been sampled since a new value was recorded. That allows
> > us to avoid bumping the counter if no one has sampled it since it
> > was last bumped.
> > 
> > Later patches will build on this infrastructure to change how writeback
> > errors are tracked in the kernel.
> > 
> > Signed-off-by: Jeff Layton 
> 
> I like that this is a separate lib/*.c - nicely structured too.
> 
> Reviewed-by: NeilBrown 
> 
> 

Thanks, yeah...it occurred to me that this scheme is not really specific
to writeback errors. While I can't think of another use-case for
errseq_t's right offhand, I think this makes for cleaner layering and
should make it easy to use in other ways should they arise.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Stefan Priebe - Profihost AG
Hello,

thanks. But is there any way to recover from this error? Like removing
the item or so? Data loss isn't a problem. Just reconstructing the whole
FS will take quite a long time.

Stefan

Am 10.05.2017 um 11:54 schrieb Hugo Mills:
> On Wed, May 10, 2017 at 11:20:58AM +0200, Stefan Priebe - Profihost AG wrote:
>> Hello,
>>
>> here's the output:
>> # for block in 163316514816 163322413056 163325722624; do echo $block;
>> btrfs-debug-tree -b $block /dev/mapper/crypt_md0|sed -re 's/(\t| )name:
>> .*/\1name: HIDDEN/'; done
>>
>> 163316514816
>> btrfs-progs v4.8.5
>> leaf 163316514816 items 188 free space 1387 generation 86739 owner 3892
>> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
>> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
> [...]
>> item 37 key (23760 DIR_INDEX 36) itemoff 14278 itemsize 58
>> location key (28124232 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 28
>> name: HIDDEN
>> item 38 key (23760 DIR_INDEX 37) itemoff 14220 itemsize 58
>> location key (28124233 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 28
>> name: HIDDEN
>> item 39 key (23760 DIR_INDEX 38) itemoff 14165 itemsize 55
>> location key (28124234 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 25
>> name: HIDDEN
>> item 40 key (23760 DIR_INDEX 22) itemoff 14115 itemsize 50
>> location key (26923383 INODE_ITEM 0) type FILE
>> transid 74009 data_len 0 name_len 20
>> name: HIDDEN
>> item 41 key (23760 DIR_INDEX 23) itemoff 14067 itemsize 48
>> location key (26923384 INODE_ITEM 0) type FILE
>> transid 74009 data_len 0 name_len 18
>> name: HIDDEN
> [...]
>> 163322413056
>> btrfs-progs v4.8.5
>> leaf 163322413056 items 113 free space 934 generation 86739 owner 3892
>> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
>> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
> [...]
>> item 73 key (24016 DIR_INDEX 19) itemoff 9651 itemsize 62
>> location key (28124251 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 32
>> name: HIDDEN
>> item 74 key (24016 DIR_INDEX 20) itemoff 9592 itemsize 59
>> location key (28124252 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 29
>> name: HIDDEN
>> item 75 key (24016 DIR_INDEX 4) itemoff 9538 itemsize 54
>> location key (26923401 INODE_ITEM 0) type FILE
>> transid 74009 data_len 0 name_len 24
>> name: HIDDEN
>> item 76 key (24016 DIR_INDEX 5) itemoff 9486 itemsize 52
>> location key (26923402 INODE_ITEM 0) type FILE
>> transid 74009 data_len 0 name_len 22
>> name: HIDDEN
> [...]
>> 163325722624
>> btrfs-progs v4.8.5
>> leaf 163325722624 items 78 free space 6563 generation 86739 owner 3892
>> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
>> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
> [...]
>> item 62 key (24189 DIR_INDEX 16) itemoff 9409 itemsize 64
>> location key (28124267 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 34
>> name: HIDDEN
>> item 63 key (24189 DIR_INDEX 17) itemoff 9349 itemsize 60
>> location key (28124268 INODE_ITEM 0) type FILE
>> transid 86739 data_len 0 name_len 30
>> name: HIDDEN
>> item 64 key (24189 DIR_INDEX 4) itemoff 9296 itemsize 53
>> location key (26923421 INODE_ITEM 0) type FILE
>> transid 74010 data_len 0 name_len 23
>> name: HIDDEN
>> item 65 key (24189 DIR_INDEX 5) itemoff 9236 itemsize 60
>> location key (26923422 INODE_ITEM 0) type FILE
>> transid 74010 data_len 0 name_len 30
>> name: HIDDEN
> [...]
> 
>In each case, the tree node keys have simply been sorted
> incorrectly -- the ordering is otherwise correct, but jumps backwards
> at some point in the sequence. At least in the first instance, some of
> the keys appear to have been duplicated: there are two (23760
> DIR_INDEX 22) keys in the list. (I didn't check in detail with the
> other two whether there are duplicates, but I wouldn't be surprised).
> 
>I note that the jump in the keys in the first two cases is 16, and
> the jump in the third case is 13. That _might_ suggest some kind of
> bit-level error, but it's probably not in the RAM that was used to
> store the blocks, as the error appears in a different place in each
> block.
> 
>I'm tentatively going to point the finger at your hardware for
> this. It's probably going to need something really long and/or
> stressful to pick it up, though (one 

Re: [PATCH v4 12/27] cifs: set mapping error when page writeback fails in writepage or launder_pages

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:15, Jeff Layton wrote:
> Signed-off-by: Jeff Layton 
> Reviewed-by: Christoph Hellwig 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/cifs/file.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 21d404535739..0bee7f8d91ad 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct 
> writeback_control *wbc)
>   set_page_writeback(page);
>  retry_write:
>   rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
> - if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL)
> - goto retry_write;
> - else if (rc == -EAGAIN)
> + if (rc == -EAGAIN) {
> + if (wbc->sync_mode == WB_SYNC_ALL)
> + goto retry_write;
>   redirty_page_for_writepage(wbc, page);
> - else if (rc != 0)
> + } else if (rc != 0) {
>   SetPageError(page);
> - else
> + mapping_set_error(page->mapping, rc);
> + } else {
>   SetPageUptodate(page);
> + }
>   end_page_writeback(page);
>   put_page(page);
>   free_xid(xid);
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 11/27] fuse: set mapping error in writepage_locked when it fails

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:14, Jeff Layton wrote:
> This ensures that we see errors on fsync when writeback fails.
> 
> Signed-off-by: Jeff Layton 
> Reviewed-by: Christoph Hellwig 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/fuse/file.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index ec238fb5a584..07d0efcb050c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
>  err_free:
>   fuse_request_free(req);
>  err:
> + mapping_set_error(page->mapping, error);
>   end_page_writeback(page);
>   return error;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 05/27] btrfs: btrfs_wait_tree_block_writeback can be void return

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:08, Jeff Layton wrote:
> Nothing checks its return value.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
> ---
>  fs/btrfs/disk-io.c | 6 +++---
>  fs/btrfs/disk-io.h | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index eb1ee7b6f532..8c479bd5534a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1222,10 +1222,10 @@ int btrfs_write_tree_block(struct extent_buffer *buf)
>   buf->start + buf->len - 1);
>  }
>  
> -int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
> +void btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
>  {
> - return filemap_fdatawait_range(buf->pages[0]->mapping,
> -buf->start, buf->start + buf->len - 1);
> + filemap_fdatawait_range(buf->pages[0]->mapping,
> + buf->start, buf->start + buf->len - 1);
>  }
>  
>  struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 
> bytenr,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 2e0ec29bfd69..9cc87835abb5 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -127,7 +127,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, 
> struct inode *inode,
>   extent_submit_bio_hook_t *submit_bio_done);
>  unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info);
>  int btrfs_write_tree_block(struct extent_buffer *buf);
> -int btrfs_wait_tree_block_writeback(struct extent_buffer *buf);
> +void btrfs_wait_tree_block_writeback(struct extent_buffer *buf);
>  int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
>struct btrfs_fs_info *fs_info);
>  int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/27] fs: remove unneeded forward definition of mm_struct from fs.h

2017-05-10 Thread Jan Kara
On Tue 09-05-17 11:49:04, Jeff Layton wrote:
> Signed-off-by: Jeff Layton 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7251f7bb45e8..38adefd8e2a0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1252,8 +1252,6 @@ extern void f_delown(struct file *filp);
>  extern pid_t f_getown(struct file *filp);
>  extern int send_sigurg(struct fown_struct *fown);
>  
> -struct mm_struct;
> -
>  /*
>   *   Umount options
>   */
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] fstests: regression test for nocsum buffered read's repair

2017-05-10 Thread Eryu Guan
On Tue, May 09, 2017 at 11:56:11AM -0600, Liu Bo wrote:
> This is to test whether buffered read retry-repair code is able to work in
> raid1 case as expected.
> 
> Please note that without checksum, btrfs doesn't know if the data used to
> repair is correct, so repair is more of resync which makes sure that both
> of the copy has the same content.
> 
> Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and 
> drop
> checks") introduced the regression.
> 
> The upstream fix is
>   Btrfs: bring back repair during read

btrfs/14[1-3] all could refer to the upstream patch along with its
commit id.

> 
> Signed-off-by: Liu Bo 
> ---
>  tests/btrfs/143 | 155 
> 
>  tests/btrfs/143.out |  39 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 195 insertions(+)
>  create mode 100755 tests/btrfs/143
>  create mode 100644 tests/btrfs/143.out
> 
> diff --git a/tests/btrfs/143 b/tests/btrfs/143
> new file mode 100755
> index 000..5263e78
> --- /dev/null
> +++ b/tests/btrfs/143
> @@ -0,0 +1,155 @@
> +#! /bin/bash
> +# FS QA Test 143
> +#
> +# Regression test for btrfs buffered read's repair during read without 
> checksum.
> +#
> +# This is to test whether buffered read retry-repair code is able to work in
> +# raid1 case as expected.
> +#
> +# Please note that without checksum, btrfs doesn't know if the data used to
> +# repair is correct, so repair is more of resync which makes sure that both
> +# of the copy has the same content.
> +#
> +# Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and 
> drop
> +# checks") introduced the regression.
> +#
> +# The upstream fix is
> +#Btrfs: bring back repair during read
> +#
> +#---
> +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2

btrfs/142 and btrfs/143 need a "_require_fail_make_request"

> +
> +_require_btrfs_command inspect-internal dump-tree
> +_require_command "$FILEFRAG_PROG" filefrag
> +
> +_check_repair()
> +{
> + filter=${1:-cat}
> + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | 
> $filter | grep -q -e "read error corrected"
> + if [ $? -eq 0 ]; then
> + echo 1
> + else
> + echo 0
> + fi
> +}

This function can be removed.

Thanks,
Eryu

> +
> +get_physical()
> +{
> +# $1 is logical address
> +# print chunk tree and find devid 2 which is $SCRATCH_DEV
> +$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | \
> + grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print 
> $6 }'
> +}
> +
> +SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
> +
> +start_fail()
> +{
> + echo 100 > $DEBUGFS_MNT/fail_make_request/probability
> + echo 4 > $DEBUGFS_MNT/fail_make_request/times
> + echo 0 > $DEBUGFS_MNT/fail_make_request/verbose
> + echo 1 > $SYSFS_BDEV/make-it-fail
> +}
> +
> +stop_fail()
> +{
> + echo 0 > $DEBUGFS_MNT/fail_make_request/probability
> + echo 0 > $DEBUGFS_MNT/fail_make_request/times
> + echo 0 > $SYSFS_BDEV/make-it-fail
> +}
> +
> +_scratch_dev_pool_get 2
> +# step 1, create a raid1 btrfs which contains one 128k file.
> +echo "step 1..mkfs.btrfs" >>$seqres.full
> +
> +mkfs_opts="-d raid1 -b 1G"
> +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
> +
> +# -o nospace_cache makes sure data is written to the start position of the 
> data
> +# chunk
> +_scratch_mount -o nospace_cache,nodatasum
> +
> +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" 
> | _filter_xfs_io
> +
> +# 

Re: [PATCH 3/6] fstests: regression test for btrfs dio read repair

2017-05-10 Thread Eryu Guan
On Tue, May 09, 2017 at 11:56:08AM -0600, Liu Bo wrote:
> This case tests whether dio read can repair the bad copy if we have
> a good copy.
> 
> Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
> introduced the regression.
> 
> The upstream fix is
>   Btrfs: fix invalid dereference in btrfs_retry_endio

I noticed this is in upstream now, you can refer to it along with hash
tag too.

> 
> Signed-off-by: Liu Bo 
> ---
>  tests/btrfs/140 | 115 
> 
>  tests/btrfs/140.out |  39 ++
>  tests/btrfs/group   |   1 +
>  3 files changed, 155 insertions(+)
>  create mode 100755 tests/btrfs/140
>  create mode 100644 tests/btrfs/140.out
> 
> diff --git a/tests/btrfs/140 b/tests/btrfs/140
> new file mode 100755
> index 000..09a9939
> --- /dev/null
> +++ b/tests/btrfs/140
> @@ -0,0 +1,115 @@
> +#! /bin/bash
> +# FS QA Test 140
> +#
> +# Regression test for btrfs DIO read's repair during read.
> +#
> +# Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
> +# introduced the regression.
> +# The upstream fix is
> +#Btrfs: fix invalid dereference in btrfs_retry_endio
> +#
> +#---
> +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +
> +_require_btrfs_command inspect-internal dump-tree
> +_require_command "$FILEFRAG_PROG" filefrag
> +_require_odirect
> +
> +get_physical()
> +{
> + # $1 is logical address
> + # print chunk tree and find devid 2 which is $SCRATCH_DEV
> + $BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | \
> + grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print 
> $6 }'
> +}
> +
> +_scratch_dev_pool_get 2
> +# step 1, create a raid1 btrfs which contains one 128k file.
> +echo "step 1..mkfs.btrfs" >>$seqres.full
> +
> +mkfs_opts="-d raid1 -b 1G"
> +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
> +
> +# -o nospace_cache makes sure data is written to the start position of the 
> data
> +# chunk
> +_scratch_mount -o nospace_cache
> +
> +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" 
> | _filter_xfs_io
> +
> +# step 2, corrupt the first 64k of one copy (on SCRATCH_DEV which is the 
> first
> +# one in $SCRATCH_DEV_POOL
> +echo "step 2..corrupt file extent" >>$seqres.full
> +
> +${FILEFRAG_PROG} -v $SCRATCH_MNT/foobar >> $seqres.full
> +logical_in_btrfs=`${FILEFRAG_PROG} -v $SCRATCH_MNT/foobar | _filter_filefrag 
> | cut -d '#' -f 1`
> +physical_on_scratch=`get_physical ${logical_in_btrfs}`
> +
> +_scratch_unmount
> +$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 64K $physical_on_scratch 64K" 
> $SCRATCH_DEV | _filter_xfs_io
> +
> +_scratch_mount
> +
> +# step 3, 128k dio read (this read can repair bad copy)
> +echo "step 3..repair the bad copy" >>$seqres.full
> +
> +# since raid1 consists of two copies, and the following read may read the 
> good
> +# copy directly, so lets loop 10 times here and discard output that dio reads
> +# give
> +for i in `seq 1 10`; do
> + $XFS_IO_PROG -d -c "pread -b 128K 0 128K" "$SCRATCH_MNT/foobar" > 
> /dev/null
> + _get_current_dmesg | grep -q -e "csum failed" && break
> +done

Half of the time I got test failure because pread from SCRATCH_DEV read
0xbb instead of 0xaa on v4.11 kernel (bug should be fixed there), tested
on two different hosts and could hit failure on both hosts.

Similar failure happened to all the 4 tests randomly. I thought it was
because "csum failed" was never hit, so I tried a "while true; do" loop,
and that did fix the btrfs/140 failure for me, 

Re: [PATCH 2/6] fstests: add _get_current_dmesg

2017-05-10 Thread Eryu Guan
On Tue, May 09, 2017 at 11:56:07AM -0600, Liu Bo wrote:
> _get_current_dmesg can be used to grep customized pattern.
> 
> Signed-off-by: Liu Bo 

I can't apply this patch on top of current master, perhaps it needs a
rebase :)

> ---
>  common/rc | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/common/rc b/common/rc
> index 78a2101..111ed69 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3215,6 +3215,12 @@ _get_device_size()
>   grep `_short_dev $1` /proc/partitions | awk '{print $3}'
>  }
>  
> +# get current dmesg log
> +_get_current_dmesg()
> +{
> + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac
> +}
> +
>  # check dmesg log for WARNING/Oops/etc.
>  _check_dmesg()
>  {
> @@ -3230,8 +3236,7 @@ _check_dmesg()
>  
>   # search the dmesg log of last run of $seqnum for possible failures
>   # use sed \cregexpc address type, since $seqnum contains "/"

This second line of comment should go to _get_current_dmesg.

Thanks,
Eryu

> - dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
> - tac | $filter >$seqres.dmesg
> + _get_current_dmesg | $filter >$seqres.dmesg
>   grep -q -e "kernel BUG at" \
>-e "WARNING:" \
>-e "BUG:" \
> -- 
> 2.5.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Hugo Mills
On Wed, May 10, 2017 at 11:20:58AM +0200, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> here's the output:
> # for block in 163316514816 163322413056 163325722624; do echo $block;
> btrfs-debug-tree -b $block /dev/mapper/crypt_md0|sed -re 's/(\t| )name:
> .*/\1name: HIDDEN/'; done
> 
> 163316514816
> btrfs-progs v4.8.5
> leaf 163316514816 items 188 free space 1387 generation 86739 owner 3892
> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
[...]
> item 37 key (23760 DIR_INDEX 36) itemoff 14278 itemsize 58
> location key (28124232 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 28
> name: HIDDEN
> item 38 key (23760 DIR_INDEX 37) itemoff 14220 itemsize 58
> location key (28124233 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 28
> name: HIDDEN
> item 39 key (23760 DIR_INDEX 38) itemoff 14165 itemsize 55
> location key (28124234 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 25
> name: HIDDEN
> item 40 key (23760 DIR_INDEX 22) itemoff 14115 itemsize 50
> location key (26923383 INODE_ITEM 0) type FILE
> transid 74009 data_len 0 name_len 20
> name: HIDDEN
> item 41 key (23760 DIR_INDEX 23) itemoff 14067 itemsize 48
> location key (26923384 INODE_ITEM 0) type FILE
> transid 74009 data_len 0 name_len 18
> name: HIDDEN
[...]
> 163322413056
> btrfs-progs v4.8.5
> leaf 163322413056 items 113 free space 934 generation 86739 owner 3892
> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
[...]
> item 73 key (24016 DIR_INDEX 19) itemoff 9651 itemsize 62
> location key (28124251 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 32
> name: HIDDEN
> item 74 key (24016 DIR_INDEX 20) itemoff 9592 itemsize 59
> location key (28124252 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 29
> name: HIDDEN
> item 75 key (24016 DIR_INDEX 4) itemoff 9538 itemsize 54
> location key (26923401 INODE_ITEM 0) type FILE
> transid 74009 data_len 0 name_len 24
> name: HIDDEN
> item 76 key (24016 DIR_INDEX 5) itemoff 9486 itemsize 52
> location key (26923402 INODE_ITEM 0) type FILE
> transid 74009 data_len 0 name_len 22
> name: HIDDEN
[...]
> 163325722624
> btrfs-progs v4.8.5
> leaf 163325722624 items 78 free space 6563 generation 86739 owner 3892
> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
[...]
> item 62 key (24189 DIR_INDEX 16) itemoff 9409 itemsize 64
> location key (28124267 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 34
> name: HIDDEN
> item 63 key (24189 DIR_INDEX 17) itemoff 9349 itemsize 60
> location key (28124268 INODE_ITEM 0) type FILE
> transid 86739 data_len 0 name_len 30
> name: HIDDEN
> item 64 key (24189 DIR_INDEX 4) itemoff 9296 itemsize 53
> location key (26923421 INODE_ITEM 0) type FILE
> transid 74010 data_len 0 name_len 23
> name: HIDDEN
> item 65 key (24189 DIR_INDEX 5) itemoff 9236 itemsize 60
> location key (26923422 INODE_ITEM 0) type FILE
> transid 74010 data_len 0 name_len 30
> name: HIDDEN
[...]

   In each case, the tree node keys have simply been sorted
incorrectly -- the ordering is otherwise correct, but jumps backwards
at some point in the sequence. At least in the first instance, some of
the keys appear to have been duplicated: there are two (23760
DIR_INDEX 22) keys in the list. (I didn't check in detail with the
other two whether there are duplicates, but I wouldn't be surprised).

   I note that the jump in the keys in the first two cases is 16, and
the jump in the third case is 13. That _might_ suggest some kind of
bit-level error, but it's probably not in the RAM that was used to
store the blocks, as the error appears in a different place in each
block.

   I'm tentatively going to point the finger at your hardware for
this. It's probably going to need something really long and/or
stressful to pick it up, though (one of the CPU stress tests, for
example, and also a good long run with a RAM tester -- 24 hours or
longer).

   Hugo.

> Stefan
> Am 10.05.2017 um 11:08 schrieb Hugo Mills:
> > On Wed, May 10, 2017 at 10:40:31AM +0200, Stefan Priebe - Profihost AG 
> > wrote:
> >> Am 10.05.2017 um 09:40 schrieb Hugo Mills:
> >>> On Wed, May 10, 2017 at 09:36:30AM +0200, 

Re: runtime btrfsck

2017-05-10 Thread Stefan Priebe - Profihost AG
Hello,

here's the output:
# for block in 163316514816 163322413056 163325722624; do echo $block;
btrfs-debug-tree -b $block /dev/mapper/crypt_md0|sed -re 's/(\t| )name:
.*/\1name: HIDDEN/'; done

163316514816
btrfs-progs v4.8.5
leaf 163316514816 items 188 free space 1387 generation 86739 owner 3892
fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f
item 0 key (23760 DIR_ITEM 2479948887) itemoff 16229 itemsize 54
location key (26923382 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 24
name: HIDDEN
item 1 key (23760 DIR_ITEM 2652742785) itemoff 16170 itemsize 59
location key (28124230 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 29
name: HIDDEN
item 2 key (23760 DIR_ITEM 2688971413) itemoff 16119 itemsize 51
location key (26923386 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 21
name: HIDDEN
item 3 key (23760 DIR_ITEM 2764880658) itemoff 16064 itemsize 55
location key (26923399 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 25
name: HIDDEN
item 4 key (23760 DIR_ITEM 2805527189) itemoff 16006 itemsize 58
location key (28124233 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 28
name: HIDDEN
item 5 key (23760 DIR_ITEM 2876464375) itemoff 15957 itemsize 49
location key (26923393 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 19
name: HIDDEN
item 6 key (23760 DIR_ITEM 2951059296) itemoff 15907 itemsize 50
location key (28124218 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 20
name: HIDDEN
item 7 key (23760 DIR_ITEM 3058144963) itemoff 15859 itemsize 48
location key (26923384 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 18
name: HIDDEN
item 8 key (23760 DIR_ITEM 3095440808) itemoff 15804 itemsize 55
location key (26923394 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 25
name: HIDDEN
item 9 key (23760 DIR_ITEM 3124573416) itemoff 15748 itemsize 56
location key (26923387 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 26
name: HIDDEN
item 10 key (23760 DIR_ITEM 3194204932) itemoff 15690 itemsize 58
location key (26923397 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 28
name: HIDDEN
item 11 key (23760 DIR_ITEM 3281114395) itemoff 15637 itemsize 53
location key (26923388 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 23
name: HIDDEN
item 12 key (23760 DIR_ITEM 3353597736) itemoff 15588 itemsize 49
location key (24944 INODE_ITEM 0) type FILE
transid 10694 data_len 0 name_len 19
name: HIDDEN
item 13 key (23760 DIR_ITEM 3389003195) itemoff 15539 itemsize 49
location key (28124226 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 19
name: HIDDEN
item 14 key (23760 DIR_ITEM 3461310858) itemoff 15473 itemsize 66
location key (26923392 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 36
name: HIDDEN
item 15 key (23760 DIR_ITEM 3660173809) itemoff 15422 itemsize 51
location key (28124225 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 21
name: HIDDEN
item 16 key (23760 DIR_ITEM 3678308711) itemoff 15371 itemsize 51
location key (28124220 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 21
name: HIDDEN
item 17 key (23760 DIR_ITEM 3708519009) itemoff 15316 itemsize 55
location key (28124224 INODE_ITEM 0) type FILE
transid 86739 data_len 0 name_len 25
name: HIDDEN
item 18 key (23760 DIR_ITEM 3716314603) itemoff 15258 itemsize 58
location key (26923396 INODE_ITEM 0) type FILE
transid 74009 data_len 0 name_len 28
name: HIDDEN
item 19 key (23760 DIR_ITEM 3958443109) itemoff 15224 itemsize 34
location key (24016 INODE_ITEM 0) type DIR
transid 10693 data_len 0 name_len 4
name: HIDDEN
item 20 key (23760 DIR_INDEX 2) itemoff 15190 itemsize 34
location key (24016 INODE_ITEM 0) type DIR
transid 10693 data_len 0 name_len 4
name: HIDDEN
item 21 key (23760 DIR_INDEX 

Re: runtime btrfsck

2017-05-10 Thread Hugo Mills
On Wed, May 10, 2017 at 10:40:31AM +0200, Stefan Priebe - Profihost AG wrote:
> Am 10.05.2017 um 09:40 schrieb Hugo Mills:
> > On Wed, May 10, 2017 at 09:36:30AM +0200, Stefan Priebe - Profihost AG 
> > wrote:
> >> Hello Roman,
> >>
> >> the FS is mountable. It just goes readonly when trying to write some data.
> >>
> >> The kernel msgs are:
> >> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> >> block=163316514816,root=1, slot=39
> >> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> >> block=163322413056,root=1, slot=74
> >> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> >> block=163325722624,root=1, slot=63
> >> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> >> block=163316514816,root=1, slot=39
> >> BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure
> >> BTRFS info (device dm-2): forced readonly
> >> BTRFS info (device dm-2): delayed_refs has NO entry
> > 
> >Can you show the output of btrfs-debug-tree -b , where
> >  is each of the three "block=" values above?
> 
> Can do that. But the lists are very long - should i upload them to
> pastebin? Is it ok to remove the name atribute - which provides filenames?

   On the mailing list would be preferable, as then the conversation
is completely preserved in archives. They shouldn't be so long that
vger rejects them. And yes, if you want to filter out the filenames,
do so (but _just_ the filename -- don't remove the whole line).

   Hugo.

> Stefan
> 
> 
> >Hugo.
> > 
> >> Greets,
> >> Stefan
> >> Am 10.05.2017 um 09:18 schrieb Roman Mamedov:
> >>> On Wed, 10 May 2017 09:02:46 +0200
> >>> Stefan Priebe - Profihost AG  wrote:
> >>>
>  how to fix bad key ordering?
> >>>
> >>> You should clarify does the FS in question mount (read-write? read-only?)
> >>> and what are the kernel messages if it does not.
> >>>
> > 

-- 
Hugo Mills | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: runtime btrfsck

2017-05-10 Thread Roman Mamedov
On Wed, 10 May 2017 09:48:07 +0200
Martin Steigerwald  wrote:

> Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. 
>  

Indeed it is. It may or may not be possible to build a perfect Fsck, but IMO
for the time being, what's most sorely missing, is some sort of a knowingly
destructive repair mode, as in "I don't care about partial user data loss,
just whack the FS metadata into full logical consistency at any means
necessary".

Also feels like it doesn't currently deal with the majority of actual
in-real-world corruptions, notably the "parent transid failure" (even by a few
dozens increments) which it can only helpfully "Ignore" during repair.

So even with a minor corruption (something wonky in just ONE block of a
multi-terabyte FS) the answer is way too often "nuke the entire thing and
restore from backups".

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Stefan Priebe - Profihost AG
Hi,
Am 10.05.2017 um 09:48 schrieb Martin Steigerwald:
> Stefan Priebe - Profihost AG - 10.05.17, 09:02:
>> I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
>> something about the expected runtime or how to fix bad key ordering?
> 
> I had a similar issue which remained unresolved.
> But I clearly saw that btrfs check was running in a loop, see thread:
> [4.9] btrfs check --repair looping over file extent discount errors
> 
> So it would be interesting to see the exact output of btrfs check, maybe 
> there 
> is something like repeated numbers that also indicate a loop.

Output is just:
enabling repair mode
Checking filesystem on /dev/mapper/crypt_md0
UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
bad key ordering 39 40
checking extents [.]

even after 2,5 weeks running.

Stefan

> I was about to say that BTRFS is production ready before this issue happened. 
> I still think it for a lot of setup mostly is, as at least the "I get stuck 
> on 
> the CPU while searching for free space" issue seems to be gone since about 
> anything between 4.5/4.6 kernels. I also think so regarding absence of data 
> loss. I was able to copy over all of the data I needed of the broken 
> filesystem.
> 
> Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. 
>  
> So unless someone has a clever idea here and shares it with you, it may be 
> needed to backup anything you can from this filesystem and then start over 
> from 
> scratch. As to my past experience something like xfs_repair surpasses btrfs 
> check in the ability to actually fix broken filesystem by a great extent.
> 
> Ciao,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Stefan Priebe - Profihost AG
Am 10.05.2017 um 09:40 schrieb Hugo Mills:
> On Wed, May 10, 2017 at 09:36:30AM +0200, Stefan Priebe - Profihost AG wrote:
>> Hello Roman,
>>
>> the FS is mountable. It just goes readonly when trying to write some data.
>>
>> The kernel msgs are:
>> BTRFS critical (device dm-2): corrupt leaf, bad key order:
>> block=163316514816,root=1, slot=39
>> BTRFS critical (device dm-2): corrupt leaf, bad key order:
>> block=163322413056,root=1, slot=74
>> BTRFS critical (device dm-2): corrupt leaf, bad key order:
>> block=163325722624,root=1, slot=63
>> BTRFS critical (device dm-2): corrupt leaf, bad key order:
>> block=163316514816,root=1, slot=39
>> BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure
>> BTRFS info (device dm-2): forced readonly
>> BTRFS info (device dm-2): delayed_refs has NO entry
> 
>Can you show the output of btrfs-debug-tree -b , where
>  is each of the three "block=" values above?

Can do that. But the lists are very long - should i upload them to
pastebin? Is it ok to remove the name atribute - which provides filenames?

Stefan


>Hugo.
> 
>> Greets,
>> Stefan
>> Am 10.05.2017 um 09:18 schrieb Roman Mamedov:
>>> On Wed, 10 May 2017 09:02:46 +0200
>>> Stefan Priebe - Profihost AG  wrote:
>>>
 how to fix bad key ordering?
>>>
>>> You should clarify does the FS in question mount (read-write? read-only?)
>>> and what are the kernel messages if it does not.
>>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Martin Steigerwald
Stefan Priebe - Profihost AG - 10.05.17, 09:02:
> I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
> something about the expected runtime or how to fix bad key ordering?

I had a similar issue which remained unresolved.

But I clearly saw that btrfs check was running in a loop, see thread:

[4.9] btrfs check --repair looping over file extent discount errors

So it would be interesting to see the exact output of btrfs check, maybe there 
is something like repeated numbers that also indicate a loop.

I was about to say that BTRFS is production ready before this issue happened. 
I still think it for a lot of setup mostly is, as at least the "I get stuck on 
the CPU while searching for free space" issue seems to be gone since about 
anything between 4.5/4.6 kernels. I also think so regarding absence of data 
loss. I was able to copy over all of the data I needed of the broken 
filesystem.

Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me.  
So unless someone has a clever idea here and shares it with you, it may be 
needed to backup anything you can from this filesystem and then start over from 
scratch. As to my past experience something like xfs_repair surpasses btrfs 
check in the ability to actually fix broken filesystem by a great extent.

Ciao,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Hugo Mills
On Wed, May 10, 2017 at 09:36:30AM +0200, Stefan Priebe - Profihost AG wrote:
> Hello Roman,
> 
> the FS is mountable. It just goes readonly when trying to write some data.
> 
> The kernel msgs are:
> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> block=163316514816,root=1, slot=39
> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> block=163322413056,root=1, slot=74
> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> block=163325722624,root=1, slot=63
> BTRFS critical (device dm-2): corrupt leaf, bad key order:
> block=163316514816,root=1, slot=39
> BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure
> BTRFS info (device dm-2): forced readonly
> BTRFS info (device dm-2): delayed_refs has NO entry

   Can you show the output of btrfs-debug-tree -b , where
 is each of the three "block=" values above?

   Hugo.

> Greets,
> Stefan
> Am 10.05.2017 um 09:18 schrieb Roman Mamedov:
> > On Wed, 10 May 2017 09:02:46 +0200
> > Stefan Priebe - Profihost AG  wrote:
> > 
> >> how to fix bad key ordering?
> > 
> > You should clarify does the FS in question mount (read-write? read-only?)
> > and what are the kernel messages if it does not.
> > 

-- 
Hugo Mills | Welcome to Hollywood, a land just off the coast of
hugo@... carfax.org.uk | Planet Earth
http://carfax.org.uk/  |
PGP: E2AB1DE4  |The Cat's Meow


signature.asc
Description: Digital signature


Re: runtime btrfsck

2017-05-10 Thread Stefan Priebe - Profihost AG
Hello Roman,

the FS is mountable. It just goes readonly when trying to write some data.

The kernel msgs are:
BTRFS critical (device dm-2): corrupt leaf, bad key order:
block=163316514816,root=1, slot=39
BTRFS critical (device dm-2): corrupt leaf, bad key order:
block=163322413056,root=1, slot=74
BTRFS critical (device dm-2): corrupt leaf, bad key order:
block=163325722624,root=1, slot=63
BTRFS critical (device dm-2): corrupt leaf, bad key order:
block=163316514816,root=1, slot=39
BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure
BTRFS info (device dm-2): forced readonly
BTRFS info (device dm-2): delayed_refs has NO entry

Greets,
Stefan
Am 10.05.2017 um 09:18 schrieb Roman Mamedov:
> On Wed, 10 May 2017 09:02:46 +0200
> Stefan Priebe - Profihost AG  wrote:
> 
>> how to fix bad key ordering?
> 
> You should clarify does the FS in question mount (read-write? read-only?)
> and what are the kernel messages if it does not.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Marat Khalili

Hello,

(Warning: I'm a user, not a developer here.)

In my experience (on kernel 4.4) it processed larger and slower devices 
within a day, BUT according to some recent topics the runaway 
fragmentation (meaning in this case large number of small extents 
regardless of their relative location) can significantly slow down BTRFS 
operations to the point of making them infeasible. Possible reasons for 
fragmentations are snapshotting volumes too often and/or running VM 
images from BTRFS without taking some precautions.


On top of this, mount device name makes one suspicious there's another 
layer between BTRFS and hardware. Are you sure it's not the bottleneck 
in this case?


--

With Best Regards,
Marat Khalili

On 10/05/17 10:02, Stefan Priebe - Profihost AG wrote:

I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
something about the expected runtime or how to fix bad key ordering?

Greets,
Stefan

Am 06.05.2017 um 07:56 schrieb Stefan Priebe - Profihost AG:

It's still running. Is this the normal behaviour? Is there any other way
to fix the bad key ordering?

Greets,
Stefan

Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG:

Hello list,

i wanted to check an fs cause it has bad key ordering.

But btrfscheck is now running since 7 days. Current output:
# btrfsck -p --repair /dev/mapper/crypt_md0
enabling repair mode
Checking filesystem on /dev/mapper/crypt_md0
UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
bad key ordering 39 40
checking extents [O]

FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should
btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5

Thanks!

Greets,
Stefan


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Roman Mamedov
On Wed, 10 May 2017 09:02:46 +0200
Stefan Priebe - Profihost AG  wrote:

> how to fix bad key ordering?

You should clarify does the FS in question mount (read-write? read-only?)
and what are the kernel messages if it does not.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Stefan Priebe - Profihost AG
I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
something about the expected runtime or how to fix bad key ordering?

Greets,
Stefan

Am 06.05.2017 um 07:56 schrieb Stefan Priebe - Profihost AG:
> It's still running. Is this the normal behaviour? Is there any other way
> to fix the bad key ordering?
> 
> Greets,
> Stefan
> 
> Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG:
>> Hello list,
>>
>> i wanted to check an fs cause it has bad key ordering.
>>
>> But btrfscheck is now running since 7 days. Current output:
>> # btrfsck -p --repair /dev/mapper/crypt_md0
>> enabling repair mode
>> Checking filesystem on /dev/mapper/crypt_md0
>> UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
>> bad key ordering 39 40
>> checking extents [O]
>>
>> FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should
>> btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5
>>
>> Thanks!
>>
>> Greets,
>> Stefan
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: record_orphan_data_extent: Fix memory leak

2017-05-10 Thread Praveen K Pandey
The nodes of the list at btrfs_root->orphan_data_extents
are never freed causing memory to be leaked. This commit
fixes the bug by freeing the nodes on all the
btrfs_root->orphan_data_extents list.

Signed-off-by: Praveen K Pandey 
---
 cmds-check.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 0014bc2..cac94a8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -9286,6 +9286,9 @@ static int check_extent_refs(struct btrfs_root *root,
 {
struct extent_record *rec;
struct cache_extent *cache;
+   struct btrfs_fs_info *fs_info;
+   struct btrfs_root *root_orphan_extent;
+   struct rb_node *rb_node;
int ret = 0;
int had_dups = 0;
 
@@ -9439,6 +9442,26 @@ static int check_extent_refs(struct btrfs_root *root,
   rec->start + rec->max_size - 1);
free(rec);
}
+   /*
+* Release memory of oprhan_data_extent
+* which allocated  while traversing in orphan_data_extents
+*/
+
+   fs_info = root->fs_info;
+   rb_node = rb_first(_info->fs_root_tree);
+   while (rb_node) {
+   root_orphan_extent = container_of(rb_node,
+   struct btrfs_root, rb_node);
+   
free_orphan_data_extents(_orphan_extent->orphan_data_extents);
+   rb_node  = rb_next(rb_node);
+   }
+   free_orphan_data_extents(_info->extent_root->orphan_data_extents);
+   free_orphan_data_extents(_info->tree_root->orphan_data_extents);
+   free_orphan_data_extents(_info->chunk_root->orphan_data_extents);
+   free_orphan_data_extents(_info->dev_root->orphan_data_extents);
+   free_orphan_data_extents(_info->csum_root->orphan_data_extents);
+   if (fs_info->quota_enabled)
+   
free_orphan_data_extents(_info->quota_root->orphan_data_extents);
 repair_abort:
if (repair) {
if (ret && ret != -EAGAIN) {
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html