Re: [PATCH] btrfs: Handle delalloc error correctly to avoid deadlock

2017-02-20 Thread Qu Wenruo

Please ignore this patch.

This patch doesn't handle outstanding_extents well.

Thanks,
Qu

At 02/21/2017 10:29 AM, Qu Wenruo wrote:

If run btrfs/125 with nospace_cache or space_cache=v2 mount option,
btrfs will block with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

The direct cause is the error handler in run_delalloc_nocow() doesn't
handle error from btrfs_reloc_clone_csums() well.

The error handler of run_delalloc_nocow() will clear dirty and finish IO
for the pages in that extent.
However we have already inserted one ordered extent.
And that ordered extent is relying on endio hooks to wait all its pages
to finish, while only the first page will finish.

This makes that ordered extent never finish, so blocking the file
system.

Although the root cause is still in RAID5/6, it won't hurt to fix the
error routine first.

This patch will slightly modify one existing function,
btrfs_endio_direct_write_update_ordered() to handle free space inode,
just like what btrfs_writepage_end_io_hook() does.

And use it as base to implement one inline function,
btrfs_cleanup_ordered_extents() to handle the error in
run_delalloc_nocow() and cow_file_range().

For compression, it's calling writepage_end_io_hook() itself to handle
its error, and any submitted ordered extent will have its bio submitted,
so no need to worry about compression part.

Suggested-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 58 ++--
 1 file changed, 44 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e861a063721..e2e2267b9a73 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -113,6 +113,24 @@ static struct extent_map *create_pinned_em(struct inode 
*inode, u64 start,
   u64 block_start, u64 block_len,
   u64 orig_block_len, u64 ram_bytes,
   int type);
+static void btrfs_endio_write_update_ordered(struct inode *inode,
+const u64 offset,
+const u64 bytes,
+const int uptodate);
+/*
+ * Set error bit and cleanup all ordered extents in specified range of @inode.
+ *
+ * This is for error case where ordered extent(s) is submitted but
+ * corresponding bio is not submitted.
+ * This can make waiter on such ordered extent never finish, as there is no
+ * endio hook called to finish such ordered extent.
+ */
+static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
+const u64 start,
+const u64 len)
+{
+   return btrfs_endio_write_update_ordered(inode, start, len, 0);
+}

 static int btrfs_dirty_inode(struct inode *inode);

@@ -1096,6 +1114,7 @@ static noinline int cow_file_range(struct inode *inode,
 EXTENT_DELALLOC | EXTENT_DEFRAG,
 PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
+   btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
goto out;
 }

@@ -1538,7 +1557,7 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
if (!ret)
ret = err;

-   if (ret && cur_offset < end)
+   if (ret && cur_offset < end) {
extent_clear_unlock_delalloc(inode, cur_offset, end, end,
 locked_page, EXTENT_LOCKED |
 EXTENT_DELALLOC | EXTENT_DEFRAG |
@@ -1546,6 +1565,8 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
 PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK |
 PAGE_END_WRITEBACK);
+   btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
+   }
btrfs_free_path(path);
return ret;
 }
@@ -8185,17 +8206,27 @@ static void btrfs_endio_direct_read(struct bio *bio)
bio_put(bio);
 }

-static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
-   const u64 offset,
-   const u64 bytes,
- 

[PATCH] btrfs: Handle delalloc error correctly to avoid deadlock

2017-02-20 Thread Qu Wenruo
If run btrfs/125 with nospace_cache or space_cache=v2 mount option,
btrfs will block with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

The direct cause is the error handler in run_delalloc_nocow() doesn't
handle error from btrfs_reloc_clone_csums() well.

The error handler of run_delalloc_nocow() will clear dirty and finish IO
for the pages in that extent.
However we have already inserted one ordered extent.
And that ordered extent is relying on endio hooks to wait all its pages
to finish, while only the first page will finish.

This makes that ordered extent never finish, so blocking the file
system.

Although the root cause is still in RAID5/6, it won't hurt to fix the
error routine first.

This patch will slightly modify one existing function,
btrfs_endio_direct_write_update_ordered() to handle free space inode,
just like what btrfs_writepage_end_io_hook() does.

And use it as base to implement one inline function,
btrfs_cleanup_ordered_extents() to handle the error in
run_delalloc_nocow() and cow_file_range().

For compression, it's calling writepage_end_io_hook() itself to handle
its error, and any submitted ordered extent will have its bio submitted,
so no need to worry about compression part.

Suggested-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 58 ++--
 1 file changed, 44 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e861a063721..e2e2267b9a73 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -113,6 +113,24 @@ static struct extent_map *create_pinned_em(struct inode 
*inode, u64 start,
   u64 block_start, u64 block_len,
   u64 orig_block_len, u64 ram_bytes,
   int type);
+static void btrfs_endio_write_update_ordered(struct inode *inode,
+const u64 offset,
+const u64 bytes,
+const int uptodate);
+/*
+ * Set error bit and cleanup all ordered extents in specified range of @inode.
+ *
+ * This is for error case where ordered extent(s) is submitted but
+ * corresponding bio is not submitted.
+ * This can make waiter on such ordered extent never finish, as there is no
+ * endio hook called to finish such ordered extent.
+ */
+static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
+const u64 start,
+const u64 len)
+{
+   return btrfs_endio_write_update_ordered(inode, start, len, 0);
+}
 
 static int btrfs_dirty_inode(struct inode *inode);
 
@@ -1096,6 +1114,7 @@ static noinline int cow_file_range(struct inode *inode,
 EXTENT_DELALLOC | EXTENT_DEFRAG,
 PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
+   btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
goto out;
 }
 
@@ -1538,7 +1557,7 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
if (!ret)
ret = err;
 
-   if (ret && cur_offset < end)
+   if (ret && cur_offset < end) {
extent_clear_unlock_delalloc(inode, cur_offset, end, end,
 locked_page, EXTENT_LOCKED |
 EXTENT_DELALLOC | EXTENT_DEFRAG |
@@ -1546,6 +1565,8 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
 PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK |
 PAGE_END_WRITEBACK);
+   btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
+   }
btrfs_free_path(path);
return ret;
 }
@@ -8185,17 +8206,27 @@ static void btrfs_endio_direct_read(struct bio *bio)
bio_put(bio);
 }
 
-static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
-   const u64 offset,
-   const u64 bytes,
-   const int uptodate)
+static void btrfs_endio_write_update_ordered(struct inode *inode,
+

Re: mount troubles after crash

2017-02-20 Thread Qu Wenruo



At 02/20/2017 10:13 PM, Patrick Schmid wrote:

Hi togehter,

during a balance run, the server crash. after the crash i got the
attached error messages

Kernel: 4.9.11


A known bug when qgroup is enabled.
Fixed in v4.10-rcs.

Please mount using v4.10 kernels and it will mount without problem.

Thanks,
Qu


Btrfs-Tools: v4.9.1

btrfs check --repair failed with:

couldn't open RDWR because of unsupported option features (3).
Open ctree failed


is there a way to fix this 140TB filesystem?

Regards Patrick



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Retry after commit on getting EDQUOT

2017-02-20 Thread Qu Wenruo



At 02/20/2017 08:51 PM, Goldwyn Rodrigues wrote:



On 02/19/2017 11:35 PM, Qu Wenruo wrote:



At 02/20/2017 12:35 PM, Goldwyn Rodrigues wrote:

Hi Qu,

On 02/19/2017 09:07 PM, Qu Wenruo wrote:



At 02/19/2017 09:30 PM, Goldwyn Rodrigues wrote:

From: Goldwyn Rodrigues 

We are facing the same problem with EDQUOT which was experienced with
ENOSPC. Not sure if we require a full ticketing system such as ENOSPC,
but
here is a fix. Let me know if it is too big a hammer.

Quotas are reserved during the start of an operation, incrementing
qg->reserved. However, it is written to disk in a commit_transaction
which could take as long as commit_interval. In the meantime there
could be deletions which are not accounted for because deletions are
accounted for only while committed (free_refroot). So, when we get
a EDQUOT flush the data to disk and try again.


IIRC Jeff submitted a qgroup patch to allow unlink to be done even we
already hit EDQUOT.
https://patchwork.kernel.org/patch/9537193/

I think that's a better solution, which is quite like what we do to
handle ENOSPC.



These are two separate issues. This issue is where qg->reserved gets
over-inflated because of repeated deletions and recreations within a
commit_interval.


My fault, that's indeed two different bugs.

And I succeeded to reproduce the bug.


Jeff's patch deals with allowing removal even if the quota is exceeded
so that eventually there can be space freed.

I would suggest you apply Jeff's patch and try to run the script I have
presented.



Thanks,
Qu



I combined the conditions of rfer and excl to reduce code lines, though
the condition looks uglier.

Here is a sample script which shows this issue.

DEVICE=/dev/vdb
MOUNTPOINT=/mnt
TESTVOL=$MOUNTPOINT/tmp
QUOTA=5
PROG=btrfs
DD_BS="4k"
DD_COUNT="256"
RUN_TIMES=5000

mkfs.btrfs -f $DEVICE
mount -o commit=240 $DEVICE $MOUNTPOINT
$PROG subvolume create $TESTVOL
$PROG quota enable $TESTVOL
$PROG qgroup limit ${QUOTA}G $TESTVOL

typeset -i DD_RUN_GOOD
typeset -i QUOTA

function _check_cmd() {
if [[ ${?} > 0 ]]; then
echo -n "$(date) E: Running previous command"
echo ${*}
echo "Without sync"
$PROG qgroup show -pcreFf ${TESTVOL}
echo "With sync"
$PROG qgroup show -pcreFf --sync ${TESTVOL}
exit 1
fi
}

while true; do
DD_RUN_GOOD=$RUN_TIMES

while (( ${DD_RUN_GOOD} != 0 )); do
dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS}
count=${DD_COUNT}
_check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD}
bs=${DD_BS} count=${DD_COUNT}"
DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
done

$PROG qgroup show -pcref $TESTVOL
echo "--- Cleanup -- "
rm $TESTVOL/quotatest*

done


It would be better if we can reduce the reproduce time and submit it as
a fstest test case.


Yes, unfortunately, it is more probabilistic than deterministic. But
yes, reducing the size and increasing the commit_interval can improve
the time.





Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/qgroup.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4700cac..9ace407 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2093,6 +2093,7 @@ static int qgroup_reserve(struct btrfs_root
*root, u64 num_bytes)
 struct btrfs_fs_info *fs_info = root->fs_info;
 u64 ref_root = root->root_key.objectid;
 int ret = 0;
+int retried = 0;
 struct ulist_node *unode;
 struct ulist_iterator uiter;

@@ -2101,7 +2102,7 @@ static int qgroup_reserve(struct btrfs_root
*root, u64 num_bytes)

 if (num_bytes == 0)
 return 0;
-
+retry:
 spin_lock(_info->qgroup_lock);
 quota_root = fs_info->quota_root;
 if (!quota_root)
@@ -2127,16 +2128,21 @@ static int qgroup_reserve(struct btrfs_root
*root, u64 num_bytes)

 qg = u64_to_ptr(unode->aux);

-if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
-qg->reserved + (s64)qg->rfer + num_bytes >
-qg->max_rfer) {
-ret = -EDQUOT;
-goto out;
-}
-
-if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) &&
-qg->reserved + (s64)qg->excl + num_bytes >
-qg->max_excl) {
+if (((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
+  qg->reserved + (s64)qg->rfer + num_bytes >
qg->max_rfer) ||
+((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) &&
+qg->reserved + (s64)qg->excl + num_bytes >
qg->max_excl)) {
+if (!retried && qg->reserved > 0) {


Does the qg->reserved > 0 check has any extra meaning here?



qg->reserved greater than zero means the qgroup tree is dirty and needs
to be committed to disk. It may or may not have quota reductions which
can only be confirmed after free_ref_root is called in commit. While it
may be a small probability, it will save I/O processing when the 

Re: [PATCH] btrfs: Handle btrfs_reloc_clone_csums error correctly to avoid deadlock

2017-02-20 Thread Qu Wenruo



At 02/20/2017 09:43 PM, Filipe Manana wrote:

On Mon, Feb 20, 2017 at 12:31 AM, Qu Wenruo  wrote:



At 02/17/2017 11:25 PM, Filipe Manana wrote:


On Fri, Feb 17, 2017 at 12:37 AM, Qu Wenruo 
wrote:


In case of an error the endio handler is called only for the first
page, and not for whole range for which one or more ordered extents
were created. So the ordered extents will be around forever.



That's the same thing I'm trying to fix.

While I'm asking the reason the cleanup previously created ordered extent,
which didn't encounter any error.


Any error in the loop requires cleaning up previously created ordered
extents, because if that function returns an error no bio is submitted
and therefore the ordered extents won't be removed, except for the
case where only one ordered extent is created and the delalloc range
is a single page.








Or is this a designed behavior for things like fsync?
Success for all or failure if any fails?



This is unrelated to fsync. It's about leaving ordered extents which
can make any task get them and hang on them forever.



However in the case of btrfs_reloc_clone_csum(), if it fails, there is only
one ordered extent really needs to clean.


The problem is more generic and you can forget about
btrfs_reloc_clone_csum(). It's about the delalloc callback returning
an error and one or more ordered extents were created by past
iterations of the loop before the error happened.


While the only possible problem we can hit after submitting an ordered 
extent is btrfs_reloc_clone_csum().






Previously created ordered extent can finish without problem, so we only
need to clean the last created one, and no need to cleanup all ordered
extent that it created.

For a more specific example of this case:

We are relocating one data block group, whose size is 1G, containing 8 128M
file extents.

For run_delalloc_nocow(), it's called on root DATA_RELOC inode 257, range is
[0,1G),

And for first loop, it handles the first [0, 128M) without problem.
1st ordered extent is submitted and can finish even before submitting 2nd
ordered extent.


The ordered extent is submitted but not a bio for it. It's when the
bio completes (writeback finishes) that the endio callback is invoked,
which wakes up every task waiting on the ordered extent, sets IO
errors bit if needed, removes the ordered extent, etc.



Then when we are submitting 2nd ordered extent, ranged from [128M, 256M),
btrfs_reloc_clone_csum() fails. we must cleanup at least 2nd ordered extent
or it will hang forever.

If I understand your right, you mean we must cleanup all ordered extent in
the range [0, 1G).


Yes.



My point is, since 1st ordered extent can be submitted and finished before
2nd ordered extent submission without problem, I didn't see the point to
cleanup 1st ordered extent and non-exist 3rd~8th ordered extent.


As said before in this reply, if the delalloc callback
(cow_file_range, run_dellaloc_nocow) returns an error, no bios are
submitted for any ordered extents it might have created before an
error happened. Check __extent_writepage() - it calls
writepage_delalloc() which calls our delalloc callback, and if that
returns an error, it won't call __extent_writepage_io(), which is what
submits bios.



Or I missed or misunderstand something else?


You missed a lot of things I suppose, namely that bios aren't
submitted if the dealloc function returns an error.


That's the point. I misunderstand the bio submission timing, so in that 
case you're right about cleanup the whole range.


I'll update the patch to address it.

Thanks,
Qu



Honestly I don't know how to explain things better to you, and was
hopping this was easier to understand using the direct IO write error
path as an example.

thanks



Thanks,
Qu






thanks



Thanks,
Qu





In that case, since we haven't unlock pages/extent, there will no race
nor
new ordered extent, and we are ensured to have only one ordered extent
need
cleanup.



Also, for any created ordered extent, you want to set the bit
BTRFS_ORDERED_IOERR on it before removing it, since there might be
concurrent tasks waiting for it that do error handling (like an fsync,
since when running delalloc the inode isn't locked).





Fsync is not affected IIRC, as btrfs_reloc_clone_csum() is only called
for
DATA_RELOC tree inode, which we don't call fsync on it.



Look at the direct IO write path error handling and
btrfs_endio_direct_write_update_ordered() for a very similar scenario.
And the same problem happens in the cow case
(inode.c:cow_file_range()).





BTW, I have already done it in cow_file_range(), check the beginning
lines
of inode.c of this patch.




Indeed. I missed it.

Thanks.



Thanks,
Qu



Thanks.


+   }
btrfs_free_path(path);
return ret;
 }
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 041c3326d109..dba1cf3464a7 100644
--- a/fs/btrfs/ordered-data.c
+++ 

[PATCH 2/4] btrfs: document existence of extent_io ops callbacks

2017-02-20 Thread David Sterba
Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c   |  7 +--
 fs/btrfs/extent_io.h | 23 ---
 fs/btrfs/inode.c |  7 +--
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2b06f557c176..0715b6f3f686 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4653,9 +4653,12 @@ static int btrfs_cleanup_transaction(struct 
btrfs_fs_info *fs_info)
 }
 
 static const struct extent_io_ops btree_extent_io_ops = {
-   .readpage_end_io_hook = btree_readpage_end_io_hook,
-   .readpage_io_failed_hook = btree_io_failed_hook,
+   /* mandatory callbacks */
.submit_bio_hook = btree_submit_bio_hook,
+   .readpage_end_io_hook = btree_readpage_end_io_hook,
/* note we're sharing with inode.c for the merge bio hook */
.merge_bio_hook = btrfs_merge_bio_hook,
+
+   /* optional callbacks */
+   .readpage_io_failed_hook = btree_io_failed_hook,
 };
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index fbc92315b503..5c5e2e6cfb9e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -91,18 +91,27 @@ typedef int (extent_submit_bio_hook_t)(struct inode 
*inode, struct bio *bio,
   int mirror_num, unsigned long bio_flags,
   u64 bio_offset);
 struct extent_io_ops {
-   int (*fill_delalloc)(struct inode *inode, struct page *locked_page,
-u64 start, u64 end, int *page_started,
-unsigned long *nr_written);
-   int (*writepage_start_hook)(struct page *page, u64 start, u64 end);
+   /*
+* The following callbacks must be allways defined, the function
+* pointer will be called unconditionally.
+*/
extent_submit_bio_hook_t *submit_bio_hook;
+   int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset,
+   struct page *page, u64 start, u64 end,
+   int mirror);
int (*merge_bio_hook)(struct page *page, unsigned long offset,
  size_t size, struct bio *bio,
  unsigned long bio_flags);
+
+   /*
+* Optional hooks, called if the pointer is not NULL
+*/
+   int (*fill_delalloc)(struct inode *inode, struct page *locked_page,
+u64 start, u64 end, int *page_started,
+unsigned long *nr_written);
int (*readpage_io_failed_hook)(struct page *page, int failed_mirror);
-   int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset,
-   struct page *page, u64 start, u64 end,
-   int mirror);
+
+   int (*writepage_start_hook)(struct page *page, u64 start, u64 end);
void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end,
  struct extent_state *state, int uptodate);
void (*set_bit_hook)(struct inode *inode, struct extent_state *state,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eafadf0851d1..72faf9b5616a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10541,10 +10541,13 @@ static const struct file_operations 
btrfs_dir_file_operations = {
 };
 
 static const struct extent_io_ops btrfs_extent_io_ops = {
-   .fill_delalloc = run_delalloc_range,
+   /* mandatory callbacks */
.submit_bio_hook = btrfs_submit_bio_hook,
-   .merge_bio_hook = btrfs_merge_bio_hook,
.readpage_end_io_hook = btrfs_readpage_end_io_hook,
+   .merge_bio_hook = btrfs_merge_bio_hook,
+
+   /* optional callbacks */
+   .fill_delalloc = run_delalloc_range,
.writepage_end_io_hook = btrfs_writepage_end_io_hook,
.writepage_start_hook = btrfs_writepage_start_hook,
.set_bit_hook = btrfs_set_bit_hook,
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: add dummy callback for readpage_io_failed and drop checks

2017-02-20 Thread David Sterba
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c   | 2 +-
 fs/btrfs/extent_io.c | 2 +-
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/inode.c | 7 +++
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0715b6f3f686..fbf4921f4d60 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4658,7 +4658,7 @@ static const struct extent_io_ops btree_extent_io_ops = {
.readpage_end_io_hook = btree_readpage_end_io_hook,
/* note we're sharing with inode.c for the merge bio hook */
.merge_bio_hook = btrfs_merge_bio_hook,
+   .readpage_io_failed_hook = btree_io_failed_hook,
 
/* optional callbacks */
-   .readpage_io_failed_hook = btree_io_failed_hook,
 };
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f5cff93ab152..eaee7bb2ff7c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2578,7 +2578,7 @@ static void end_bio_extent_readpage(struct bio *bio)
if (likely(uptodate))
goto readpage_ok;
 
-   if (tree->ops && tree->ops->readpage_io_failed_hook) {
+   if (tree->ops) {
ret = tree->ops->readpage_io_failed_hook(page, mirror);
if (!ret && !bio->bi_error)
uptodate = 1;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5c5e2e6cfb9e..63c8cc970b1c 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -102,6 +102,7 @@ struct extent_io_ops {
int (*merge_bio_hook)(struct page *page, unsigned long offset,
  size_t size, struct bio *bio,
  unsigned long bio_flags);
+   int (*readpage_io_failed_hook)(struct page *page, int failed_mirror);
 
/*
 * Optional hooks, called if the pointer is not NULL
@@ -109,7 +110,6 @@ struct extent_io_ops {
int (*fill_delalloc)(struct inode *inode, struct page *locked_page,
 u64 start, u64 end, int *page_started,
 unsigned long *nr_written);
-   int (*readpage_io_failed_hook)(struct page *page, int failed_mirror);
 
int (*writepage_start_hook)(struct page *page, u64 start, u64 end);
void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 72faf9b5616a..a74191fa3934 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10503,6 +10503,12 @@ static int btrfs_tmpfile(struct inode *dir, struct 
dentry *dentry, umode_t mode)
 
 }
 
+__attribute__((const))
+static int dummy_readpage_io_failed_hook(struct page *page, int failed_mirror)
+{
+   return 0;
+}
+
 static const struct inode_operations btrfs_dir_inode_operations = {
.getattr= btrfs_getattr,
.lookup = btrfs_lookup,
@@ -10545,6 +10551,7 @@ static const struct extent_io_ops btrfs_extent_io_ops = 
{
.submit_bio_hook = btrfs_submit_bio_hook,
.readpage_end_io_hook = btrfs_readpage_end_io_hook,
.merge_bio_hook = btrfs_merge_bio_hook,
+   .readpage_io_failed_hook = dummy_readpage_io_failed_hook,
 
/* optional callbacks */
.fill_delalloc = run_delalloc_range,
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Cleanups in extent_io callbacks

2017-02-20 Thread David Sterba
Some of the checks for extent_io callbacks can be safely dropped as they're
always defined, plus some dummy callback additions so more checks can be
dropped. There's more potential for the same cleanup in other callbacks but
this would need more evaluation wheather dummy callbacks vs existence checks
are really worth it.

David Sterba (4):
  btrfs: let writepage_end_io_hook return void
  btrfs: document existence of extent_io ops callbacks
  btrfs: drop checks for mandatory extent_io_ops callbacks
  btrfs: add dummy callback for readpage_io_failed and drop checks

 fs/btrfs/disk-io.c   |  7 +--
 fs/btrfs/extent_io.c | 18 +++---
 fs/btrfs/extent_io.h | 25 +
 fs/btrfs/inode.c | 20 ++--
 4 files changed, 43 insertions(+), 27 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] btrfs: drop checks for mandatory extent_io_ops callbacks

2017-02-20 Thread David Sterba
We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are
always defined so we can drop the checks before we call them.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8de29aa4d1a2..f5cff93ab152 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2565,8 +2565,7 @@ static void end_bio_extent_readpage(struct bio *bio)
len = bvec->bv_len;
 
mirror = io_bio->mirror_num;
-   if (likely(uptodate && tree->ops &&
-  tree->ops->readpage_end_io_hook)) {
+   if (likely(uptodate && tree->ops)) {
ret = tree->ops->readpage_end_io_hook(io_bio, offset,
  page, start, end,
  mirror);
@@ -2728,7 +2727,7 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
bio->bi_private = NULL;
bio_get(bio);
 
-   if (tree->ops && tree->ops->submit_bio_hook)
+   if (tree->ops)
ret = tree->ops->submit_bio_hook(page->mapping->host, bio,
   mirror_num, bio_flags, start);
else
@@ -2743,7 +2742,7 @@ static int merge_bio(struct extent_io_tree *tree, struct 
page *page,
 unsigned long bio_flags)
 {
int ret = 0;
-   if (tree->ops && tree->ops->merge_bio_hook)
+   if (tree->ops)
ret = tree->ops->merge_bio_hook(page, offset, size, bio,
bio_flags);
return ret;
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs: let writepage_end_io_hook return void

2017-02-20 Thread David Sterba
There's no error path in any of the instances, always return 0.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 9 +++--
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/inode.c | 6 ++
 3 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d15b5ddb6732..8de29aa4d1a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2435,12 +2435,9 @@ void end_extent_writepage(struct page *page, int err, 
u64 start, u64 end)
 
tree = _I(page->mapping->host)->io_tree;
 
-   if (tree->ops && tree->ops->writepage_end_io_hook) {
-   ret = tree->ops->writepage_end_io_hook(page, start,
-  end, NULL, uptodate);
-   if (ret)
-   uptodate = 0;
-   }
+   if (tree->ops && tree->ops->writepage_end_io_hook)
+   tree->ops->writepage_end_io_hook(page, start, end, NULL,
+   uptodate);
 
if (!uptodate) {
ClearPageUptodate(page);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 270d03be290e..fbc92315b503 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -103,7 +103,7 @@ struct extent_io_ops {
int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset,
struct page *page, u64 start, u64 end,
int mirror);
-   int (*writepage_end_io_hook)(struct page *page, u64 start, u64 end,
+   void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end,
  struct extent_state *state, int uptodate);
void (*set_bit_hook)(struct inode *inode, struct extent_state *state,
 unsigned *bits);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dae2734a725b..eafadf0851d1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2977,7 +2977,7 @@ static void finish_ordered_fn(struct btrfs_work *work)
btrfs_finish_ordered_io(ordered_extent);
 }
 
-static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+static void btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
struct extent_state *state, int uptodate)
 {
struct inode *inode = page->mapping->host;
@@ -2991,7 +2991,7 @@ static int btrfs_writepage_end_io_hook(struct page *page, 
u64 start, u64 end,
ClearPagePrivate2(page);
if (!btrfs_dec_test_ordered_pending(inode, _extent, start,
end - start + 1, uptodate))
-   return 0;
+   return;
 
if (btrfs_is_free_space_inode(inode)) {
wq = fs_info->endio_freespace_worker;
@@ -3004,8 +3004,6 @@ static int btrfs_writepage_end_io_hook(struct page *page, 
u64 start, u64 end,
btrfs_init_work(_extent->work, func, finish_ordered_fn, NULL,
NULL);
btrfs_queue_work(wq, _extent->work);
-
-   return 0;
 }
 
 static int __readpage_endio_check(struct inode *inode,
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs-progs: RAID5: Inject data stripe corruption and verify scrub fixes it and retains correct parity.

2017-02-20 Thread Lakshmipathi.G
Test script to create file with specific data-stripe layout.Computes stripe 
locations.
Inject corruption into data-stripe and verify scrubbing process fixes corrupted 
block.
It also validates the corresponding parity stripe.

Signed-off-by: Lakshmipathi.G 
---
 .../test.sh| 402 +
 1 file changed, 402 insertions(+)
 create mode 100755 
tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh

diff --git 
a/tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh 
b/tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh
new file mode 100755
index 000..0129a75
--- /dev/null
+++ b/tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh
@@ -0,0 +1,402 @@
+#!/bin/bash
+#
+# Raid5: Inject data stripe corruption and fix them using scrub.
+# 
+# Script will perform the following:
+# 1) Create Raid5 using 3 loopback devices.
+# 2) Ensure file layout is created in a predictable manner. 
+#Each data stripe(64KB) should uniquely start with 'DN',   
+#where N represents the data stripe number.(ex:D0,D1 etc)
+# 3) Once file is created with specific layout, check whether file 
+#has single extent. At the moment, script wont handle multi-extent
+#files.
+# 4) If file has single extent with the help of 'dump-tree' compute data and 
+#parity stripe details like devicename, position and actual on-disk data.
+# 5) $STRIPEINFO_COMPLETE file will have all necessary data at this stage.
+# 6) Inject corruption into given data-stripe by zero'ing out its first 4 
bytes.
+# 7) After injecting corruption, running online-scrub is expected to fix 
+#the corrupted data stripe with the help of parity block and 
+#corresponding data stripe. 
+# 8) If scrub successful, verify the data stripe has original un-corrupted 
value.
+# 9) If scrub successful, verify parity stripe is valid, otherwise its a 
parity bug.
+# 10) If no issues found, cleanup files and devices. Repeat the process for 
+#different file size and data-stripe.
+#
+#  Note: This script corrupts only data-stripe blocks. 
+#  Known Limitations (will be addressed later):
+# - Script expects even number of data-stripes in file.
+# - Script expects the file to have single extent.
+
+source $TOP/tests/common
+
+check_prereq btrfs
+check_prereq mkfs.btrfs
+check_prereq btrfs-debugfs
+
+setup_root_helper
+prepare_test_dev 1024M
+
+ndevs=3
+declare -a devs
+declare -a parity_offsets
+stripe_entry=""
+device_name=""
+stripe_offset=""
+stripe_content=""
+
+#Complete stripe layout
+STRIPEINFO_COMPLETE=$(mktemp --tmpdir 
btrfs-progs-raid5-stripe-complete.infoXX)
+#dump-tree output file
+DUMPTREE_OUTPUT=$(mktemp --tmpdir btrfs-progs-raid5-tree-dump.infoXX)
+#tmp files
+STRIPEINFO_PARTIAL=$(mktemp --tmpdir 
btrfs-progs-raid5-stripe-partial.infoXX)
+STRIPE_TMP=$(mktemp --tmpdir btrfs-progs-raid5-stripetmp.infoXX)
+MULTI_EXTENT_CHECK=$(mktemp --tmpdir btrfs-progs-raid5-extent-check.infoXX)
+EXTENT_WITH_SIZE=$(mktemp --tmpdir btrfs-progs-raid5-extent-size.infoXX)
+PARITY_LOC1=$(mktemp --tmpdir btrfs-progs-raid5-parity-loc1.infoXX)
+PARITY_LOCATION=$(mktemp --tmpdir 
btrfs-progs-raid5-parity-locations.infoXX)
+
+
+prepare_devices()
+{
+   for i in `seq $ndevs`; do
+   touch img$i
+   chmod a+rw img$i
+   truncate -s0 img$i
+   truncate -s512M img$i
+   devs[$i]=`run_check_stdout $SUDO_HELPER losetup --find --show 
img$i`
+   done
+   truncate -s0 $STRIPE_TMP
+   truncate -s0 $STRIPEINFO_PARTIAL
+   truncate -s0 $STRIPEINFO_COMPLETE
+}
+
+cleanup_devices()
+{
+   for dev in ${devs[@]}; do
+   run_check $SUDO_HELPER losetup -d $dev
+   done
+   for i in `seq $ndevs`; do
+   truncate -s0 img$i
+   done
+   run_check $SUDO_HELPER losetup --all
+}
+
+test_do_mkfs()
+{
+   run_check $SUDO_HELPER $TOP/mkfs.btrfs -f   \
+   $@
+}
+
+test_mkfs_multi()
+{
+   test_do_mkfs $@ ${devs[@]}
+}
+
+#$1 Filename
+#$2 Expected no.of data stripes for the file.
+create_layout(){
+   fname=$1
+   size=$(( $2 * 65536 ))
+   n=0
+   bs_value=1
+   stripe=0
+   while (( $n < $size ))
+   do
+   if [ $(( $n % 65536 )) -eq 0 ]; then
+   val='D'$stripe
+   echo -n $val
+   stripe=$(( $stripe+1 ))
+   # ensure proper value   
+   bs_value=`echo "${#val}"` 
+   else
+   echo -n 'x'
+   bs_value=1
+   fi
+n=$(( $n+$bs_value ))
+   done | dd of=/tmp/$fname bs=$bs_value conv=notrunc &> /dev/null
+   cp /tmp/$fname $TEST_MNT
+   rm -f /tmp/$fname
+}
+
+# $1,$2,$3 - available device offsets
+find_parity_device(){
+   

[PATCH 2/3] btrfs: handle allocation error in update_dev_stat_item

2017-02-20 Thread David Sterba
Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1fac98728814..64d6665f6eda 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6954,7 +6954,8 @@ static int update_dev_stat_item(struct btrfs_trans_handle 
*trans,
key.offset = device->devid;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
ret = btrfs_search_slot(trans, dev_root, , path, -1, 1);
if (ret < 0) {
btrfs_warn_in_rcu(fs_info,
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: remove BUG_ON from __tree_mod_log_insert

2017-02-20 Thread David Sterba
All callers dereference the 'tm' parameter before it gets to this
function, the NULL check does not make much sense here.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 1192bc7d2ee7..2c3c943bfcdc 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -453,8 +453,6 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct 
tree_mod_elem *tm)
struct rb_node *parent = NULL;
struct tree_mod_elem *cur;
 
-   BUG_ON(!tm);
-
tm->seq = btrfs_inc_tree_mod_seq(fs_info);
 
tm_root = _info->tree_mod_log;
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: do proper error handling in btrfs_insert_xattr_item

2017-02-20 Thread David Sterba
The space check in btrfs_insert_xattr_item is duplicated in it's caller
(do_setxattr) so we won't hit the BUG_ON. Continuing without any check
could be disasterous so turn it to a proper error handling.

Signed-off-by: David Sterba 
---
 fs/btrfs/dir-item.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 724504a2d7ac..640801082533 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -80,7 +80,8 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
struct extent_buffer *leaf;
u32 data_size;
 
-   BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root->fs_info));
+   if (name_len + data_len > BTRFS_MAX_XATTR_SIZE(root->fs_info))
+   return -ENOSPC;
 
key.objectid = objectid;
key.type = BTRFS_XATTR_ITEM_KEY;
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Remove some BUG_ONs

2017-02-20 Thread David Sterba
Minor cleanup in the BUG_ONs, where could be either removed or replaced by
proper error handling where the call chain is ready for that. For 4.11.

David Sterba (3):
  btrfs: remove BUG_ON from __tree_mod_log_insert
  btrfs: handle allocation error in update_dev_stat_item
  btrfs: do proper error handling in btrfs_insert_xattr_item

 fs/btrfs/ctree.c| 2 --
 fs/btrfs/dir-item.c | 3 ++-
 fs/btrfs/volumes.c  | 3 ++-
 3 files changed, 4 insertions(+), 4 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why is btrfs_run_delayed_refs() run multiple times

2017-02-20 Thread Filipe Manana
On Mon, Feb 20, 2017 at 5:08 PM, Goldwyn Rodrigues  wrote:
> Hi,
>
> Why do we call btrfs_run_delayed_refs multiple times in a function? Some
> of the examples are:
>
> btrfs_commit_transaction()
> commit_cowonly_roots()
>
> Is it because one call can generate more delayed refs and hence we need
> to run them again? Under what scenarios is this possible?

All extents, for both data and metadata, are reference counted and
have back references in the extent tree. Adding/updating/removing
things to/in/from a tree implies COWing metadata extents (tree nodes
and leaves), which implies generating more delayed references.
Eventually this loop stops, as metadata extents are COWed only once
per transaction or more times if writeback was triggered.
So it's possible in all scenarios.

>
> --
> Goldwyn
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 17:56 +0100, Peter Zijlstra wrote:
> On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote:
> > On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> > > On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > > > Now when new refcount_t type and API are finally merged
> > > > (see include/linux/refcount.h), the following
> > > > patches convert various refcounters in the block susystem from 
> > > > atomic_t to refcount_t. By doing this we prevent intentional or
> > > > accidental underflows or overflows that can led to use-after
> > > > -free vulnerabilities.
> > 
> > This description isn't right ... nothing is prevented; we get 
> > warnings on saturation and use after free with this.
> 
> The thing that is prevented is overflow and then a use-after-free by
> making it a leak.
> 
> Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1
> we'll never get there.
> 
> So you loose use-after-free, you gain a resource leak. The general 
> idea being that use-after-free is a nice trampoline for exploits, 
> leaks are 'only' a DoS.

OK, I see the intention: it's protection from outside influence.  It
still doesn't prevent *us* from screwing up in the kernel and inducing
a use after free by doing too many puts (or too few gets) ... that's
what the message suggests to me (me coding wrongly is accidental
underflows or overflows as I read it).

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why is btrfs_run_delayed_refs() run multiple times

2017-02-20 Thread Goldwyn Rodrigues
Hi,

Why do we call btrfs_run_delayed_refs multiple times in a function? Some
of the examples are:

btrfs_commit_transaction()
commit_cowonly_roots()

Is it because one call can generate more delayed refs and hence we need
to run them again? Under what scenarios is this possible?

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread Peter Zijlstra
On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote:
> On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> > On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > > Now when new refcount_t type and API are finally merged
> > > (see include/linux/refcount.h), the following
> > > patches convert various refcounters in the block susystem from 
> > > atomic_t to refcount_t. By doing this we prevent intentional or 
> > > accidental underflows or overflows that can led to use-after-free
> > > vulnerabilities.
> 
> This description isn't right ... nothing is prevented; we get warnings
> on saturation and use after free with this.

The thing that is prevented is overflow and then a use-after-free by
making it a leak.

Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1
we'll never get there.

So you loose use-after-free, you gain a resource leak. The general idea
being that use-after-free is a nice trampoline for exploits, leaks are
'only' a DoS.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread Jens Axboe
On 02/20/2017 08:41 AM, James Bottomley wrote:
> On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
>> On 02/20/2017 04:16 AM, Elena Reshetova wrote:
>>> Now when new refcount_t type and API are finally merged
>>> (see include/linux/refcount.h), the following
>>> patches convert various refcounters in the block susystem from 
>>> atomic_t to refcount_t. By doing this we prevent intentional or 
>>> accidental underflows or overflows that can led to use-after-free
>>> vulnerabilities.
> 
> This description isn't right ... nothing is prevented; we get warnings
> on saturation and use after free with this.
> 
>>> The below patches are fully independent and can be cherry-picked 
>>> separately. Since we convert all kernel subsystems in the same 
>>> fashion, resulting in about 300 patches, we have to group them for 
>>> sending at least in some fashion to be manageable. Please excuse
>>> the long cc list.
>>>
>>> Elena Reshetova (5):
>>>   block: convert bio.__bi_cnt from atomic_t to refcount_t
>>>   block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
>>>   block: convert blkcg_gq.refcnt from atomic_t to refcount_t
>>>   block: convert io_context.active_ref from atomic_t to refcount_t
>>>   block: convert bsg_device.ref_count from atomic_t to refcount_t
>>
>> I went to look at the implementation, and the size of a refcount_t.
>> But the code is not there?! You say it's finally merged, where is
>> it merged? Don't send code like this without the necessary
>> infrastructure being in mainline.
> 
> It's one of those no discussion except notice by tipbot things because
> Ingo liked it.
> 
> The size is atomic_t, but the primitives check for overflow and check
> inc from zero and things, so in a true refcounting situation we get
> automatic warnings of problems inside the primitives.
> 
> That said, if we were going to convert the block layer to this
> semantic, surely the benefit of the conversion would be getting rid of
> the now unnecessary BUG_ON and WARN_ON in the code, which do exactly
> the same thing the refcount primitives now do?

Well, I have no idea what it does, which is why I went to look at the
code. So any talk of converting the block layer is premature.  But it's
not there. I'll defer judgment until we have something in mainline,
until then I've archived this thread.

And I agree, keeping warn/bug for cases that should be handled
with this framework would be counter productive and pointless.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > Now when new refcount_t type and API are finally merged
> > (see include/linux/refcount.h), the following
> > patches convert various refcounters in the block susystem from 
> > atomic_t to refcount_t. By doing this we prevent intentional or 
> > accidental underflows or overflows that can led to use-after-free
> > vulnerabilities.

This description isn't right ... nothing is prevented; we get warnings
on saturation and use after free with this.

> > The below patches are fully independent and can be cherry-picked 
> > separately. Since we convert all kernel subsystems in the same 
> > fashion, resulting in about 300 patches, we have to group them for 
> > sending at least in some fashion to be manageable. Please excuse
> > the long cc list.
> > 
> > Elena Reshetova (5):
> >   block: convert bio.__bi_cnt from atomic_t to refcount_t
> >   block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
> >   block: convert blkcg_gq.refcnt from atomic_t to refcount_t
> >   block: convert io_context.active_ref from atomic_t to refcount_t
> >   block: convert bsg_device.ref_count from atomic_t to refcount_t
> 
> I went to look at the implementation, and the size of a refcount_t.
> But the code is not there?! You say it's finally merged, where is
> it merged? Don't send code like this without the necessary
> infrastructure being in mainline.

It's one of those no discussion except notice by tipbot things because
Ingo liked it.

The size is atomic_t, but the primitives check for overflow and check
inc from zero and things, so in a true refcounting situation we get
automatic warnings of problems inside the primitives.

That said, if we were going to convert the block layer to this
semantic, surely the benefit of the conversion would be getting rid of
the now unnecessary BUG_ON and WARN_ON in the code, which do exactly
the same thing the refcount primitives now do?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread Jens Axboe
On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> Now when new refcount_t type and API are finally merged
> (see include/linux/refcount.h), the following
> patches convert various refcounters in the block susystem from atomic_t
> to refcount_t. By doing this we prevent intentional or accidental
> underflows or overflows that can led to use-after-free vulnerabilities.
> 
> The below patches are fully independent and can be cherry-picked separately.
> Since we convert all kernel subsystems in the same fashion, resulting
> in about 300 patches, we have to group them for sending at least in some
> fashion to be manageable. Please excuse the long cc list.
> 
> Elena Reshetova (5):
>   block: convert bio.__bi_cnt from atomic_t to refcount_t
>   block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
>   block: convert blkcg_gq.refcnt from atomic_t to refcount_t
>   block: convert io_context.active_ref from atomic_t to refcount_t
>   block: convert bsg_device.ref_count from atomic_t to refcount_t

I went to look at the implementation, and the size of a refcount_t.
But the code is not there?! You say it's finally merged, where is
it merged? Don't send code like this without the necessary
infrastructure being in mainline.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/311: Disable dmesg check

2017-02-20 Thread Anand Jain


Hi Chandan,

On 07/17/15 12:56, Chandan Rajendra wrote:

When running generic/311 on Btrfs' subpagesize-blocksize patchset (on ppc64
with 4k sectorsize and 16k node/leaf size) I noticed the following call trace,

BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 found 
158
BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 found 
158
BTRFS: Transaction aborted (error -5)

WARNING: at /root/repos/linux/fs/btrfs/super.c:260
Modules linked in:
CPU: 3 PID: 30769 Comm: umount Tainted: GWL 
4.0.0-rc5-11671-g8b82e73e #63
task: c00079aaddb0 ti: c00079a48000 task.ti: c00079a48000
NIP: c0499aa0 LR: c0499a9c CTR: c0779630
REGS: c00079a4b480 TRAP: 0700   Tainted: GW   L   
(4.0.0-rc5-11671-g8b82e73e)
MSR: 800100029032   CR: 28008828  XER: 2000
CFAR: c0a23914 SOFTE: 1
GPR00: c0499a9c c00079a4b700 c103bdf8 0025
GPR04: 0001 0502 c107e918 0cda
GPR08: 0007 0007 0001 c10f5044
GPR12: 28008822 cfdc0d80 2000 10152e00
GPR16: 010002979380 10140724  
GPR20:    
GPR24: c000151f61a8  c00055e5e800 c0aac270
GPR28: 04a4 fffb c00055e5e800 c000679204d0
NIP [c0499aa0] .__btrfs_abort_transaction+0x180/0x190
LR [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190
Call Trace:
[c00079a4b700] [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190 
(unreliable)
[c00079a4b7a0] [c0541678] .__btrfs_run_delayed_items+0xe8/0x220
[c00079a4b850] [c04d5b3c] .btrfs_commit_transaction+0x37c/0xca0
[c00079a4b960] [c049824c] .btrfs_sync_fs+0x6c/0x1a0
[c00079a4ba00] [c0255270] .sync_filesystem+0xd0/0x100
[c00079a4ba80] [c0218070] .generic_shutdown_super+0x40/0x170
[c00079a4bb10] [c0218598] .kill_anon_super+0x18/0x30
[c00079a4bb90] [c0498418] .btrfs_kill_super+0x18/0xc0
[c00079a4bc10] [c0218ac8] .deactivate_locked_super+0x98/0xe0
[c00079a4bc90] [c023e744] .cleanup_mnt+0x54/0xa0
[c00079a4bd10] [c00b7d14] .task_work_run+0x114/0x150
[c00079a4bdb0] [c0015f84] .do_notify_resume+0x74/0x80
[c00079a4be30] [c0009838] .ret_from_except_lite+0x64/0x68
Instruction dump:
ebc1fff0 ebe1fff8 4bfffb28 6000 3ce2ffcd 38e7e818 4bbc 3c62ffd2
7fa4eb78 3863b808 48589e1d 6000 <0fe0> 4bfffedc 6000 6000
BTRFS: error (device dm-0) in __btrfs_run_delayed_items:1188: errno=-5 IO 
failure


The call trace is seen when executing _run_test() for the 8th time.
The above trace is actually a false-positive failure as indicated below,
 fsync-tester
   fsync(fd)
   Write delayed inode item to fs tree
 (assume transid to be 160)
 (assume tree block to start at logical address 29720576)
 md5sum $testfile
   This causes a delayed inode to be added
 Load flakey table
   i.e. drop writes that are initiated from now onwards
 Unmount filesystem
   btrfs_sync_fs is invoked
 Write 29720576 metadata block to disk
 free_extent_buffer(29720576)
   release_extent_buffer(29720576)
   Start writing delayed inode
 Traverse the fs tree
   (assume the parent tree block of 29720576 is still in memory)
   When reading 29720576 from disk, parent's blkptr will have generation
   set to 160. But the on-disk tree block will have an older
   generation (say, 158). Transid verification fails and hence the
   transaction gets aborted

The test only cares about the FS instance before the unmount
operation (i.e. the synced FS). Hence to get the test to pass, ignore the
false-positive trace that could be generated.


 Looks like this patch didn't make it, is there any kernel patch
 which fixed this bug ? Or any hints on how to reproduce this bug ?

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mount troubles after crash

2017-02-20 Thread Patrick Schmid

Hi togehter,

during a balance run, the server crash. after the crash i got the 
attached error messages


Kernel: 4.9.11
Btrfs-Tools: v4.9.1

btrfs check --repair failed with:

couldn't open RDWR because of unsupported option features (3).
Open ctree failed


is there a way to fix this 140TB filesystem?

Regards Patrick
--
Patrick Schmid   support: +41 44 633 2668
IT Services Group, HPT H 8voice:   +41 44 633 3997
Departement Physik, ETH Zurich
CH-8093 Zurich, Switzerland
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.622208] BUG: unable to handle kernel 
paging request at fe10 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.630033] IP: [] 
qgroup_fix_relocated_data_extents+0x2b/0x290 [btrfs] 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.639224] PGD 1c0a067  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.641864] PUD 1c0c067  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.644708] PMD 0  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.645306]  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.646983] Oops:  [#1] SMP 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.650495] Modules linked in: nfsv3(E) 
ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) xt_multiport(E) iptable_filter(E) 
ip_tables(E) x_tables(E) autofs4(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) 
ib_core(E) configfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) 
scsi_transport_iscsi(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) 
grace(E) sunrpc(E) fscache(E) btrfs(E) xor(E) raid6_pq(E) sb_edac(E) 
edac_core(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) 
crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) 
aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 
bonding(E) ipmi_ssif(E) mousedev(E) lpc_ich(E) wmi(E) mei_me(E) mei(E) 
ioatdma(E) shpchp(E) tpm_tis(E) tpm_tis_core(E) ipmi_si(E) ipmi_poweroff(E) 
tcp_nv(E) ipmi_devintf(E) ipmi_msghandler(E) 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.730264]  sch_fq(E) hid_generic(E) 
igb(E) isci(E) ixgbe(E) i2c_algo_bit(E) libsas(E) dca(E) ahci(E) usbhid(E) 
ptp(E) scsi_transport_sas(E) arcmsr(OE) libahci(E) hid(E) mdio(E) pps_core(E) 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.747985] CPU: 6 PID: 86374 Comm: mount 
Tainted: G   OE   4.9.11-stable-blkmq #41 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.757389] Hardware name: Intel 
Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 
04/19/2014 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.768937] task: 88010323 
task.stack: c90026238000 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.775601] RIP: 
0010:[]  [] 
qgroup_fix_relocated_data_extents+0x2b/0x290 [btrfs] 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.787571] RSP: 0018:c9002623ba00  
EFLAGS: 00010246 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.793542] RAX:  RBX: 
880b6b60c000 RCX:  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.801559] RDX: 88010323 RSI: 
881f18e2d000 RDI: 881f48535f40 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.809572] RBP: c9002623ba78 R08: 
60e0008071d0 R09: 881f48535f40 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.817588] R10: 9cd50001 R11: 
880d9cd508c0 R12: 881ff59b5800 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.825604] R13: c9002623baa0 R14: 
 R15:  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.833619] FS:  7fb6570ee880() 
GS:881ffe40() knlGS: 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.842731] CS:  0010 DS:  ES:  
CR0: 80050033 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.849189] CR2: fe10 CR3: 
000c93c9d000 CR4: 000406e0 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.857211] Stack: 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.859498]  881f48535f40 
0801 c9002623ba68 a08ea66b 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.867884]  881f48535f40 
 88010323  
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.876269]  880dc489a1c0 
881ff59b5800 880dc489a1c0 881ff59b5800 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.884655] Call Trace: 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.887444]  [] ? 
start_transaction+0x9b/0x4b0 [btrfs] 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.894986]  [] 
btrfs_recover_relocation+0x378/0x420 [btrfs] 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.903136]  [] 
open_ctree+0x22e7/0x27f0 [btrfs] 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.910089]  [] 
btrfs_mount+0xca4/0xec0 [btrfs] 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.916942]  [] ? 
find_next_zero_bit+0x1e/0x20 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.923703]  [] ? 
pcpu_next_unpop+0x3e/0x50 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.930164]  [] ? 
find_next_bit+0x19/0x20 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.936432]  [] 
mount_fs+0x39/0x160 
Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.942114]  [] ? 
__alloc_percpu+0x15/0x20 
Feb 20 14:42:01 phd-bkp-gw 

Re: Large no.of extents for a single small file with a fresh RAID5 setup

2017-02-20 Thread Lakshmipathi.G
On Mon, Feb 20, 2017 at 03:53:15PM +0800, Qu Wenruo wrote:
> 
> 
> At 02/18/2017 10:07 AM, Lakshmipathi.G wrote:
> >Hi,
> >
> >With RAID5, using 3 loop-back devices each has 4GB. On a fresh setup, 
> >created a 2mb file.
> >While checking no.of extents,it shows like:
> >
> ># ./btrfs-debugfs -f /mnt/file2m.txt
> >(257 0): ram 110592 disk 145227776 disk_size 114688
> >(257 110592): ram 4096 disk 145461248 disk_size 4096
> >(257 114688): ram 114688 disk 145342464 disk_size 118784
> >(257 229376): ram 712704 disk 145620992 disk_size 716800
> >(257 942080): ram 4096 disk 145096704 disk_size 4096
> >(257 946176): ram 737280 disk 146407424 disk_size 741376
> >(257 1683456): ram 413696 disk 147193856 disk_size 413696
> >file: /mnt/file2m.txt extents 7 disk size 2113536 logical size 2097152 ratio 
> >0.99
> >
> >Is this expected behaviour? Is there any mount option to minimize no.of 
> >extents for
> >a smaller file?
> 
> What mount option did you use?
> Are you using compression?

Used the default mount options (mount /dev/loop0 /mnt) and compression not 
enabled.

> How did you create the 2m file?

Yes,this is the problem. script invokes 'dd' with 'notrunc' for each
byte. If the file is 1kb, 'dd' invoked 1024 times with notrunc.
This seems to be creating the issue.

Tried using open-fd (exec FD<>/path/f, write finally close FD) method
even this runs into possible sync and creates multiple extents.

so finally found a simple trick, created the file outside btrfs then copy it
into btrfs-mnt-pt :) this workaround helps to solve the script file-creation
issue.

> 
> This seems quite strange, if you go through normal buffered write without
> compression, 2m can easily be embedded into one file extent.
> 
> And according to extent at offset 0, 112K, 224K, its data extent is larger
> than its file extent, which means it goes through CoW.
> 
> Either you did fsync/sync between writes or you goes through Direct IO.
> 
> Anyway, the way you create the 2M file is quite important here now.
> 
> Thanks,
> Qu
> >

thanks for the help.

Cheers.
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove unnecessary constraint when splitting a leaf

2017-02-20 Thread Filipe Manana
On Sat, Feb 18, 2017 at 12:34 AM, Liu Bo  wrote:
> On Fri, Feb 17, 2017 at 07:16:04PM +, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> If we want to append an item to a leaf we were trying to move off from the
>> leaf into its neighbors an amount of space corresponding to the item's
>> size. That amount of space is too much and can be reduced to the item's
>> size minus the amount of free space in the leaf, like we do for all the
>> other cases (item inserted at the beginning or somewhere in the middle of
>> the leaf).
>>
>> Signed-off-by: Filipe Manana 
>> ---
>>  fs/btrfs/ctree.c | 12 +---
>>  1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index a426dc8..86be619 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -4130,11 +4130,11 @@ static noinline int push_for_double_split(struct 
>> btrfs_trans_handle *trans,
>>   int progress = 0;
>>   int slot;
>>   u32 nritems;
>> - int space_needed = data_size;
>> + int space_needed;
>>
>>   slot = path->slots[0];
>> - if (slot < btrfs_header_nritems(path->nodes[0]))
>> - space_needed -= btrfs_leaf_free_space(fs_info, path->nodes[0]);
>> + space_needed = data_size -
>> + btrfs_leaf_free_space(fs_info, path->nodes[0]);
>>
>>   /*
>>* try to push all the items after our slot into the
>> @@ -4205,11 +4205,9 @@ static noinline int split_leaf(struct 
>> btrfs_trans_handle *trans,
>>
>>   /* first try to make some room by pushing left and right */
>>   if (data_size && path->nodes[1]) {
>> - int space_needed = data_size;
>> -
>> - if (slot < btrfs_header_nritems(l))
>> - space_needed -= btrfs_leaf_free_space(fs_info, l);
>> + int space_needed;
>
> One doubt, if (slot == nritems(l)), and push_leaf_right() is with @emtpy == 0,
> so it is possible to use the right sibling as the node for the new item, so if
> we minus l's free space from space_needed, then it's possible that
>
> (the original space_needed) > (free space of right node) > (the subtracted 
> space_needed)
>
> in that case, we're going to use right node as path->nodes[0] while right node
> doesn't have enough space.

Right. And I forgot why I added this constraint in the first place and
the optimization at push_leaf_right() that uses the right leaf and its
first slot.
The idea here (this patch) was to avoid less splits when we fallback
to try to move items into the left sibling, that is, if the new item
is for a slot > 0, right sibling is full (or does not have enough
space to move items into it) we try to push the left sibling with the
goal of freeing new_item_size - leaf_free_space bytes from our leaf
instead of new_item_size bytes. Clearly while doing that I made the
first optimization of trying to use the right leaf less likely to
succeed. I've reworked this with a new subject and changelog at:
https://patchwork.kernel.org/patch/9582901/

Thanks for reminding about that.

>
> Thanks,
>
> -liubo
>>
>> + space_needed = data_size - btrfs_leaf_free_space(fs_info, l);
>>   wret = push_leaf_right(trans, root, path, space_needed,
>>  space_needed, 0, 0);
>>   if (wret < 0)
>> --
>> 2.7.0.rc3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: try harder to migrate items to left sibling before splitting a leaf

2017-02-20 Thread fdmanana
From: Filipe Manana 

Before attempting to split a leaf we try to migrate items from the leaf to
its right and left siblings. We start by trying to move items into the
rigth sibling and, if the new item is meant to be inserted at the end of
our leaf, we try to free from our leaf an amount of bytes equal to the
number of bytes used by the new item, by setting the variable space_needed
to the byte size of that new item. However if we fail to move enough items
to the right sibling due to lack of space in that sibling, we then try
to move items into the left sibling, and in that case we try to free
an amount equal to the size of the new item from our leaf, when we need
only to free an amount corresponding to the size of the new item minus
the current free space of our leaf. So make sure that before we try to
move items to the left sibling we do set the variable space_needed with
a value corresponding to the new item's size minus the leaf's current
free space.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/ctree.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a426dc8..1d66761 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -4160,6 +4160,9 @@ static noinline int push_for_double_split(struct 
btrfs_trans_handle *trans,
 
/* try to push all the items before our slot into the next leaf */
slot = path->slots[0];
+   space_needed = data_size;
+   if (slot > 0)
+   space_needed -= btrfs_leaf_free_space(fs_info, path->nodes[0]);
ret = push_leaf_left(trans, root, path, 1, space_needed, 0, slot);
if (ret < 0)
return ret;
@@ -4215,6 +4218,10 @@ static noinline int split_leaf(struct btrfs_trans_handle 
*trans,
if (wret < 0)
return wret;
if (wret) {
+   space_needed = data_size;
+   if (slot > 0)
+   space_needed -= btrfs_leaf_free_space(fs_info,
+ l);
wret = push_leaf_left(trans, root, path, space_needed,
  space_needed, 0, (u32)-1);
if (wret < 0)
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Handle btrfs_reloc_clone_csums error correctly to avoid deadlock

2017-02-20 Thread Filipe Manana
On Mon, Feb 20, 2017 at 12:31 AM, Qu Wenruo  wrote:
>
>
> At 02/17/2017 11:25 PM, Filipe Manana wrote:
>>
>> On Fri, Feb 17, 2017 at 12:37 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> At 02/16/2017 06:03 PM, Filipe Manana wrote:


 On Thu, Feb 16, 2017 at 12:39 AM, Qu Wenruo 
 wrote:
>
>
>
>
> At 02/15/2017 11:59 PM, Filipe Manana wrote:
>>
>>
>>
>> On Wed, Feb 15, 2017 at 8:49 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> If run btrfs/125 with nospace_cache or space_cache=v2 mount option,
>>> btrfs will block with the following backtrace:
>>>
>>> Call Trace:
>>>  __schedule+0x2d4/0xae0
>>>  schedule+0x3d/0x90
>>>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>>>  ? wake_atomic_t_function+0x60/0x60
>>>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>>>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>>>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>>>  process_one_work+0x2af/0x720
>>>  ? process_one_work+0x22b/0x720
>>>  worker_thread+0x4b/0x4f0
>>>  kthread+0x10f/0x150
>>>  ? process_one_work+0x720/0x720
>>>  ? kthread_create_on_node+0x40/0x40
>>>  ret_from_fork+0x2e/0x40
>>>
>>> The direct cause is the error handler in run_delalloc_nocow() doesn't
>>> handle error from btrfs_reloc_clone_csums() well.
>>>
>>> The related part call path will be:
>>> __extent_writepage
>>> |- writepage_delalloc()
>>> |  |- run_delalloc_range()
>>> | |- run_delalloc_nocow()
>>> ||- btrfs_add_ordered_extent()
>>> ||  Now one ordered extent for file range, e.g [0, 1M) is
>>> inserted
>>> ||
>>> ||- btrfs_reloc_clone_csums()
>>> ||  Fails with -EIO, as RAID5/6 doesn't repair some csum tree
>>> ||  blocks
>>> ||
>>> ||- extent_clear_unlock_delalloc()
>>> |   Error routine, unlock and clear page DIRTY, end page
>>> writeback
>>> |   So the remaining 255 pages will not go through writeback
>>> |
>>> |- __extent_writepage_io()
>>>|- writepage_end_io_hook()
>>>   |- btrfs_dev_test_ordered_pending()
>>>  Reduce ordered_extent->bytes_left by 4K.
>>>  Still have (1M - 4K) to finish.
>>>
>>> While the remaining 255 pages will not go through IO nor trigger
>>> writepage_end_io_hook(), the ordered extent for [0, 1M) will
>>> never finish, and blocking current transaction forever.
>>>
>>> Although the root cause is still in RAID5/6, it won't hurt to fix the
>>> error routine first.
>>>
>>> This patch will cleanup the ordered extent in error routine, so at
>>> least
>>> we won't cause deadlock.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  fs/btrfs/extent_io.c|  1 -
>>>  fs/btrfs/inode.c| 10 --
>>>  fs/btrfs/ordered-data.c | 25 +
>>>  fs/btrfs/ordered-data.h | 10 ++
>>>  4 files changed, 43 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 4ac383a3a649..a14d1b0840c5 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -3258,7 +3258,6 @@ static noinline_for_stack int
>>> writepage_delalloc(struct inode *inode,
>>>delalloc_end,
>>>_started,
>>>nr_written);
>>> -   /* File system has been set read-only */
>>> if (ret) {
>>> SetPageError(page);
>>> /* fill_delalloc should be return < 0 for
>>> error
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 1e861a063721..3c3ade58afd7 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -1052,8 +1052,11 @@ static noinline int cow_file_range(struct
>>> inode
>>> *inode,
>>> BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>> ret = btrfs_reloc_clone_csums(inode, start,
>>>
>>> cur_alloc_size);
>>> -   if (ret)
>>> +   if (ret) {
>>> +   btrfs_clean_ordered_extent(inode,
>>> start,
>>> +  ram_size);
>>> goto out_drop_extent_cache;
>>> +   }
>>> }
>>>
>>> btrfs_dec_block_group_reservations(fs_info,
>>> ins.objectid);
>>> @@ -1538,7 +1541,7 @@ static noinline int run_delalloc_nocow(struct
>>> inode
>>> 

Re: [PATCH] qgroup: Retry after commit on getting EDQUOT

2017-02-20 Thread Goldwyn Rodrigues


On 02/19/2017 11:35 PM, Qu Wenruo wrote:
> 
> 
> At 02/20/2017 12:35 PM, Goldwyn Rodrigues wrote:
>> Hi Qu,
>>
>> On 02/19/2017 09:07 PM, Qu Wenruo wrote:
>>>
>>>
>>> At 02/19/2017 09:30 PM, Goldwyn Rodrigues wrote:
 From: Goldwyn Rodrigues 

 We are facing the same problem with EDQUOT which was experienced with
 ENOSPC. Not sure if we require a full ticketing system such as ENOSPC,
 but
 here is a fix. Let me know if it is too big a hammer.

 Quotas are reserved during the start of an operation, incrementing
 qg->reserved. However, it is written to disk in a commit_transaction
 which could take as long as commit_interval. In the meantime there
 could be deletions which are not accounted for because deletions are
 accounted for only while committed (free_refroot). So, when we get
 a EDQUOT flush the data to disk and try again.
>>>
>>> IIRC Jeff submitted a qgroup patch to allow unlink to be done even we
>>> already hit EDQUOT.
>>> https://patchwork.kernel.org/patch/9537193/
>>>
>>> I think that's a better solution, which is quite like what we do to
>>> handle ENOSPC.
>>>
>>
>> These are two separate issues. This issue is where qg->reserved gets
>> over-inflated because of repeated deletions and recreations within a
>> commit_interval.
> 
> My fault, that's indeed two different bugs.
> 
> And I succeeded to reproduce the bug.
>>
>> Jeff's patch deals with allowing removal even if the quota is exceeded
>> so that eventually there can be space freed.
>>
>> I would suggest you apply Jeff's patch and try to run the script I have
>> presented.
>>
>>
>>> Thanks,
>>> Qu
>>>

 I combined the conditions of rfer and excl to reduce code lines, though
 the condition looks uglier.

 Here is a sample script which shows this issue.

 DEVICE=/dev/vdb
 MOUNTPOINT=/mnt
 TESTVOL=$MOUNTPOINT/tmp
 QUOTA=5
 PROG=btrfs
 DD_BS="4k"
 DD_COUNT="256"
 RUN_TIMES=5000

 mkfs.btrfs -f $DEVICE
 mount -o commit=240 $DEVICE $MOUNTPOINT
 $PROG subvolume create $TESTVOL
 $PROG quota enable $TESTVOL
 $PROG qgroup limit ${QUOTA}G $TESTVOL

 typeset -i DD_RUN_GOOD
 typeset -i QUOTA

 function _check_cmd() {
 if [[ ${?} > 0 ]]; then
 echo -n "$(date) E: Running previous command"
 echo ${*}
 echo "Without sync"
 $PROG qgroup show -pcreFf ${TESTVOL}
 echo "With sync"
 $PROG qgroup show -pcreFf --sync ${TESTVOL}
 exit 1
 fi
 }

 while true; do
 DD_RUN_GOOD=$RUN_TIMES

 while (( ${DD_RUN_GOOD} != 0 )); do
 dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS}
 count=${DD_COUNT}
 _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD}
 bs=${DD_BS} count=${DD_COUNT}"
 DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
 done

 $PROG qgroup show -pcref $TESTVOL
 echo "--- Cleanup -- "
 rm $TESTVOL/quotatest*

 done
> 
> It would be better if we can reduce the reproduce time and submit it as
> a fstest test case.

Yes, unfortunately, it is more probabilistic than deterministic. But
yes, reducing the size and increasing the commit_interval can improve
the time.

> 

 Signed-off-by: Goldwyn Rodrigues 
 ---
  fs/btrfs/qgroup.c | 28 +---
  1 file changed, 17 insertions(+), 11 deletions(-)

 diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
 index 4700cac..9ace407 100644
 --- a/fs/btrfs/qgroup.c
 +++ b/fs/btrfs/qgroup.c
 @@ -2093,6 +2093,7 @@ static int qgroup_reserve(struct btrfs_root
 *root, u64 num_bytes)
  struct btrfs_fs_info *fs_info = root->fs_info;
  u64 ref_root = root->root_key.objectid;
  int ret = 0;
 +int retried = 0;
  struct ulist_node *unode;
  struct ulist_iterator uiter;

 @@ -2101,7 +2102,7 @@ static int qgroup_reserve(struct btrfs_root
 *root, u64 num_bytes)

  if (num_bytes == 0)
  return 0;
 -
 +retry:
  spin_lock(_info->qgroup_lock);
  quota_root = fs_info->quota_root;
  if (!quota_root)
 @@ -2127,16 +2128,21 @@ static int qgroup_reserve(struct btrfs_root
 *root, u64 num_bytes)

  qg = u64_to_ptr(unode->aux);

 -if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
 -qg->reserved + (s64)qg->rfer + num_bytes >
 -qg->max_rfer) {
 -ret = -EDQUOT;
 -goto out;
 -}
 -
 -if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) &&
 -qg->reserved + (s64)qg->excl + num_bytes >
 -qg->max_excl) {
 +if (((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
 +  

[PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread Elena Reshetova
Now when new refcount_t type and API are finally merged
(see include/linux/refcount.h), the following
patches convert various refcounters in the block susystem from atomic_t
to refcount_t. By doing this we prevent intentional or accidental
underflows or overflows that can led to use-after-free vulnerabilities.

The below patches are fully independent and can be cherry-picked separately.
Since we convert all kernel subsystems in the same fashion, resulting
in about 300 patches, we have to group them for sending at least in some
fashion to be manageable. Please excuse the long cc list.

Elena Reshetova (5):
  block: convert bio.__bi_cnt from atomic_t to refcount_t
  block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
  block: convert blkcg_gq.refcnt from atomic_t to refcount_t
  block: convert io_context.active_ref from atomic_t to refcount_t
  block: convert bsg_device.ref_count from atomic_t to refcount_t

 block/bio.c|  6 +++---
 block/blk-cgroup.c |  2 +-
 block/blk-ioc.c|  4 ++--
 block/blk-tag.c|  8 
 block/bsg.c|  9 +
 block/cfq-iosched.c|  4 ++--
 fs/btrfs/volumes.c |  2 +-
 include/linux/bio.h|  4 ++--
 include/linux/blk-cgroup.h | 11 ++-
 include/linux/blk_types.h  |  3 ++-
 include/linux/blkdev.h |  3 ++-
 include/linux/iocontext.h  |  7 ---
 12 files changed, 34 insertions(+), 29 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] block: convert bio.__bi_cnt from atomic_t to refcount_t

2017-02-20 Thread Elena Reshetova
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 block/bio.c   | 6 +++---
 fs/btrfs/volumes.c| 2 +-
 include/linux/bio.h   | 4 ++--
 include/linux/blk_types.h | 3 ++-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e0..3dffc17 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -275,7 +275,7 @@ void bio_init(struct bio *bio, struct bio_vec *table,
 {
memset(bio, 0, sizeof(*bio));
atomic_set(>__bi_remaining, 1);
-   atomic_set(>__bi_cnt, 1);
+   refcount_set(>__bi_cnt, 1);
 
bio->bi_io_vec = table;
bio->bi_max_vecs = max_vecs;
@@ -543,12 +543,12 @@ void bio_put(struct bio *bio)
if (!bio_flagged(bio, BIO_REFFED))
bio_free(bio);
else {
-   BIO_BUG_ON(!atomic_read(>__bi_cnt));
+   BIO_BUG_ON(!refcount_read(>__bi_cnt));
 
/*
 * last put frees it
 */
-   if (atomic_dec_and_test(>__bi_cnt))
+   if (refcount_dec_and_test(>__bi_cnt))
bio_free(bio);
}
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 13e55d1..ff1fe9d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -441,7 +441,7 @@ static noinline void run_scheduled_bios(struct btrfs_device 
*device)
waitqueue_active(_info->async_submit_wait))
wake_up(_info->async_submit_wait);
 
-   BUG_ON(atomic_read(>__bi_cnt) == 0);
+   BUG_ON(refcount_read(>__bi_cnt) == 0);
 
/*
 * if we're doing the sync list, record that our
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..44ac678 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -234,7 +234,7 @@ static inline void bio_get(struct bio *bio)
 {
bio->bi_flags |= (1 << BIO_REFFED);
smp_mb__before_atomic();
-   atomic_inc(>__bi_cnt);
+   refcount_inc(>__bi_cnt);
 }
 
 static inline void bio_cnt_set(struct bio *bio, unsigned int count)
@@ -243,7 +243,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned 
int count)
bio->bi_flags |= (1 << BIO_REFFED);
smp_mb__before_atomic();
}
-   atomic_set(>__bi_cnt, count);
+   refcount_set(>__bi_cnt, count);
 }
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb..c41407f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 
 struct bio_set;
 struct bio;
@@ -73,7 +74,7 @@ struct bio {
 
unsigned short  bi_max_vecs;/* max bvl_vecs we can hold */
 
-   atomic_t__bi_cnt;   /* pin count */
+   refcount_t  __bi_cnt;   /* pin count */
 
struct bio_vec  *bi_io_vec; /* the actual vec list */
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] block: convert blk_queue_tag.refcnt from atomic_t to refcount_t

2017-02-20 Thread Elena Reshetova
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 block/blk-tag.c| 8 
 include/linux/blkdev.h | 3 ++-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/blk-tag.c b/block/blk-tag.c
index 07cc329..d83555e 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -35,7 +35,7 @@ EXPORT_SYMBOL(blk_queue_find_tag);
  */
 void blk_free_tags(struct blk_queue_tag *bqt)
 {
-   if (atomic_dec_and_test(>refcnt)) {
+   if (refcount_dec_and_test(>refcnt)) {
BUG_ON(find_first_bit(bqt->tag_map, bqt->max_depth) <
bqt->max_depth);
 
@@ -130,7 +130,7 @@ static struct blk_queue_tag *__blk_queue_init_tags(struct 
request_queue *q,
if (init_tag_map(q, tags, depth))
goto fail;
 
-   atomic_set(>refcnt, 1);
+   refcount_set(>refcnt, 1);
tags->alloc_policy = alloc_policy;
tags->next_tag = 0;
return tags;
@@ -180,7 +180,7 @@ int blk_queue_init_tags(struct request_queue *q, int depth,
queue_flag_set(QUEUE_FLAG_QUEUED, q);
return 0;
} else
-   atomic_inc(>refcnt);
+   refcount_inc(>refcnt);
 
/*
 * assign it, all done
@@ -225,7 +225,7 @@ int blk_queue_resize_tags(struct request_queue *q, int 
new_depth)
 * Currently cannot replace a shared tag map with a new
 * one, so error out if this is the case
 */
-   if (atomic_read(>refcnt) != 1)
+   if (refcount_read(>refcnt) != 1)
return -EBUSY;
 
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aecca0e..95ef11c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct module;
 struct scsi_ioctl_command;
@@ -288,7 +289,7 @@ struct blk_queue_tag {
unsigned long *tag_map; /* bit map of free/busy tags */
int max_depth;  /* what we will send to device */
int real_max_depth; /* what the array can hold */
-   atomic_t refcnt;/* map can be shared */
+   refcount_t refcnt;  /* map can be shared */
int alloc_policy;   /* tag allocation policy */
int next_tag;   /* next tag */
 };
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] block: convert io_context.active_ref from atomic_t to refcount_t

2017-02-20 Thread Elena Reshetova
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 block/blk-ioc.c   | 4 ++--
 block/cfq-iosched.c   | 4 ++--
 include/linux/iocontext.h | 7 ---
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b12f9c8..130dc23 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -173,7 +173,7 @@ void put_io_context_active(struct io_context *ioc)
unsigned long flags;
struct io_cq *icq;
 
-   if (!atomic_dec_and_test(>active_ref)) {
+   if (!refcount_dec_and_test(>active_ref)) {
put_io_context(ioc);
return;
}
@@ -256,7 +256,7 @@ int create_task_io_context(struct task_struct *task, gfp_t 
gfp_flags, int node)
/* initialize */
atomic_long_set(>refcount, 1);
atomic_set(>nr_tasks, 1);
-   atomic_set(>active_ref, 1);
+   refcount_set(>active_ref, 1);
spin_lock_init(>lock);
INIT_RADIX_TREE(>icq_tree, GFP_ATOMIC | __GFP_HIGH);
INIT_HLIST_HEAD(>icq_list);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9212627..2871bb9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2959,7 +2959,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 * task has exited, don't wait
 */
cic = cfqd->active_cic;
-   if (!cic || !atomic_read(>icq.ioc->active_ref))
+   if (!cic || !refcount_read(>icq.ioc->active_ref))
return;
 
/*
@@ -3959,7 +3959,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
 
if (cfqq->next_rq && req_noidle(cfqq->next_rq))
enable_idle = 0;
-   else if (!atomic_read(>icq.ioc->active_ref) ||
+   else if (!refcount_read(>icq.ioc->active_ref) ||
 !cfqd->cfq_slice_idle ||
 (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index df38db2..a1e28c3 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 enum {
@@ -96,7 +97,7 @@ struct io_cq {
  */
 struct io_context {
atomic_long_t refcount;
-   atomic_t active_ref;
+   refcount_t active_ref;
atomic_t nr_tasks;
 
/* all the fields below are protected by this lock */
@@ -128,9 +129,9 @@ struct io_context {
 static inline void get_io_context_active(struct io_context *ioc)
 {
WARN_ON_ONCE(atomic_long_read(>refcount) <= 0);
-   WARN_ON_ONCE(atomic_read(>active_ref) <= 0);
+   WARN_ON_ONCE(refcount_read(>active_ref) == 0);
atomic_long_inc(>refcount);
-   atomic_inc(>active_ref);
+   refcount_inc(>active_ref);
 }
 
 static inline void ioc_task_link(struct io_context *ioc)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] block: convert blkcg_gq.refcnt from atomic_t to refcount_t

2017-02-20 Thread Elena Reshetova
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 block/blk-cgroup.c |  2 +-
 include/linux/blk-cgroup.h | 11 ++-
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 295e98c2..75de844 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -106,7 +106,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
struct request_queue *q,
blkg->q = q;
INIT_LIST_HEAD(>q_node);
blkg->blkcg = blkcg;
-   atomic_set(>refcnt, 1);
+   refcount_set(>refcnt, 1);
 
/* root blkg uses @q->root_rl, init rl only for !root blkgs */
if (blkcg != _root) {
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7..0d3efa9 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
 #define BLKG_STAT_CPU_BATCH(INT_MAX / 2)
@@ -122,7 +123,7 @@ struct blkcg_gq {
struct request_list rl;
 
/* reference count */
-   atomic_trefcnt;
+   refcount_t  refcnt;
 
/* is this blkg online? protected by both blkcg and q locks */
boolonline;
@@ -354,8 +355,8 @@ static inline int blkg_path(struct blkcg_gq *blkg, char 
*buf, int buflen)
  */
 static inline void blkg_get(struct blkcg_gq *blkg)
 {
-   WARN_ON_ONCE(atomic_read(>refcnt) <= 0);
-   atomic_inc(>refcnt);
+   WARN_ON_ONCE(refcount_read(>refcnt) == 0);
+   refcount_inc(>refcnt);
 }
 
 void __blkg_release_rcu(struct rcu_head *rcu);
@@ -366,8 +367,8 @@ void __blkg_release_rcu(struct rcu_head *rcu);
  */
 static inline void blkg_put(struct blkcg_gq *blkg)
 {
-   WARN_ON_ONCE(atomic_read(>refcnt) <= 0);
-   if (atomic_dec_and_test(>refcnt))
+   WARN_ON_ONCE(refcount_read(>refcnt) == 0);
+   if (refcount_dec_and_test(>refcnt))
call_rcu(>rcu_head, __blkg_release_rcu);
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] block: convert bsg_device.ref_count from atomic_t to refcount_t

2017-02-20 Thread Elena Reshetova
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 block/bsg.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/bsg.c b/block/bsg.c
index 74835db..6d0ceb9 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -38,7 +39,7 @@ struct bsg_device {
struct list_head busy_list;
struct list_head done_list;
struct hlist_node dev_list;
-   atomic_t ref_count;
+   refcount_t ref_count;
int queued_cmds;
int done_cmds;
wait_queue_head_t wq_done;
@@ -711,7 +712,7 @@ static int bsg_put_device(struct bsg_device *bd)
 
mutex_lock(_mutex);
 
-   do_free = atomic_dec_and_test(>ref_count);
+   do_free = refcount_dec_and_test(>ref_count);
if (!do_free) {
mutex_unlock(_mutex);
goto out;
@@ -763,7 +764,7 @@ static struct bsg_device *bsg_add_device(struct inode 
*inode,
 
bsg_set_block(bd, file);
 
-   atomic_set(>ref_count, 1);
+   refcount_set(>ref_count, 1);
mutex_lock(_mutex);
hlist_add_head(>dev_list, bsg_dev_idx_hash(iminor(inode)));
 
@@ -783,7 +784,7 @@ static struct bsg_device *__bsg_get_device(int minor, 
struct request_queue *q)
 
hlist_for_each_entry(bd, bsg_dev_idx_hash(minor), dev_list) {
if (bd->queue == q) {
-   atomic_inc(>ref_count);
+   refcount_inc(>ref_count);
goto found;
}
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] btrfs: export compression buffer limits in a header

2017-02-20 Thread Qu Wenruo



At 02/18/2017 02:05 AM, David Sterba wrote:

Move the buffer limit definitions out of compress_file_range.



Nice one.
No longer need to dig the codes to know the 128K limit now.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.h | 15 +++
 fs/btrfs/inode.c   | 10 --
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index e453f42b3bbf..c9d7e552cfa8 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -19,6 +19,21 @@
 #ifndef __BTRFS_COMPRESSION_
 #define __BTRFS_COMPRESSION_

+/*
+ * We want to make sure that amount of RAM required to uncompress an extent is
+ * reasonable, so we limit the total size in ram of a compressed extent to
+ * 128k.  This is a crucial number because it also controls how easily we can
+ * spread reads across cpus for decompression.
+ *
+ * We also want to make sure the amount of IO required to do a random read is
+ * reasonably small, so we limit the size of a compressed extent to 128k.
+ */
+
+/* Maximum length of compressed data stored on disk */
+#define BTRFS_MAX_COMPRESSED   (SZ_128K)
+/* Maximum size of data before compression */
+#define BTRFS_MAX_UNCOMPRESSED (SZ_128K)
+
 void btrfs_init_compress(void);
 void btrfs_exit_compress(void);

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 801d1b3fd9d7..bc547608ff1a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -470,16 +470,6 @@ static noinline void compress_file_range(struct inode 
*inode,
   (start > 0 || end + 1 < BTRFS_I(inode)->disk_i_size))
goto cleanup_and_bail_uncompressed;

-   /* we want to make sure that amount of ram required to uncompress
-* an extent is reasonable, so we limit the total size in ram
-* of a compressed extent to 128k.  This is a crucial number
-* because it also controls how easily we can spread reads across
-* cpus for decompression.
-*
-* We also want to make sure the amount of IO required to do
-* a random read is reasonably small, so we limit the size of
-* a compressed extent to 128k.
-*/
total_compressed = min(total_compressed, max_uncompressed);
num_bytes = ALIGN(end - start + 1, blocksize);
num_bytes = max(blocksize,  num_bytes);




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html