[PATCH] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-05 Thread Qu Wenruo
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:25088..25215   128   0x1
 # umount /mnt/btrfs
 # mount /dev/vdb5 /mnt/btrfs
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..31]: 25088..2511932   0x0
   1: [32..63]:25120..2515132   0x0
   2: [64..95]:25152..2518332   0x0
   3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   | |- btrfs_get_extent(start=0 len=64k)
   ||  Found on-disk (ino, EXTENT_DATA, 0)
   ||- add_extent_mapping()
   ||- Return (em->start=0, len=16k)
   |
   |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
   |
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   | |- btrfs_get_extent(start=16k len=48k)
   ||  Found on-disk (ino, EXTENT_DATA, 16k)
   ||- add_extent_mapping()
   ||  |- try_merge_map()
   || Merge with previous em start=0 len=16k
   || resulting em start=0 len=32k
   ||- Return (em->start=0, len=32K)<< Merged result
   |- Stripe off the unrelated range (0~16K) of return em
   |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
  ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.
And now btrfs can pass generic/414 without problem.

Signed-off-by: Qu Wenruo 
---
The fix can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
---
 fs/btrfs/extent_io.c | 111 ---
 1 file changed, 105 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e81922a21c..bae16320fbf4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4353,6 +4353,105 @@ static struct extent_map *get_extent_skip_holes(struct 
inode *inode,
return NULL;
 }
 
+/*
+ * To cache previous fiemap extent
+ *
+ * Will be used for merging fiemap extent
+ */
+struct fiemap_cache {
+   bool cached;
+   u64 offset;
+   u64 phys;
+   u64 len;
+   u32 flags;
+};
+
+/*
+ * Helper to submit fiemap extent.
+ *
+ * Will try to merge current fiemap extent specified by @offset, @phys,
+ * @len and @flags with cached one.
+ * And only when we fails to merge, cached one will be submitted as
+ * fiemap extent.
+ *
+ * Return 0 if merged or submitted.
+ * Return <0 for error.
+ */
+static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo,
+   struct fiemap_cache *cache,
+   u64 offset, u64 phys, u64 len, u32 flags)
+{
+   int ret;
+
+   if (!cache->cached) {
+assign:
+   cache->cached = true;
+   cache->offset = offset;
+   cache->phys = phys;
+   cache->len = len;
+   cache->flags = flags;
+   return 0;
+   }
+
+   /*
+* Sanity check, extent_fiemap() should have ensured that new
+* fiemap extent won't overlap with cahced one.
+* NOTE: Physical address can overlap, due to compression
+*/
+   WARN_ON(cache->offset + cache->len > offset);
+
+   /*
+* Only merge fiemap extents if
+* 1) Their logical addresses are continuous
+*
+* 2) Their physical addresses are continuous
+*So truly compressed (physical size smaller than logical size)
+*extents won't get merged with each other
+*
+* 3) Share same flags except FIEMAP_EXTENT_LAST
+*So regular extent won't get merged with prealloc extent
+*
+* 4) Merged result is no larger than BTRFS_MAX_EXTENT_SIZE
+*/
+   i

[PATCH v4 5/7] btrfs: use q which is already obtained from bdev_get_queue

2017-04-05 Thread Anand Jain
We have already assigned q from bdev_get_queue() so use it.
And rearrange the code for better view.

Signed-off-by: Anand Jain 
Reviewed-by: David Sterba 
---
v2-4: no change

 fs/btrfs/volumes.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1563ae03079b..4de5b2d549bd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1008,14 +1008,13 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
device->can_discard = 1;
+   if (!blk_queue_nonrot(q))
+   fs_devices->rotating = 1;
 
device->bdev = bdev;
device->in_fs_metadata = 0;
device->mode = flags;
 
-   if (!blk_queue_nonrot(bdev_get_queue(bdev)))
-   fs_devices->rotating = 1;
-
fs_devices->open_devices++;
if (device->writeable &&
device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -2417,7 +2416,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
fs_info->free_chunk_space += device->total_bytes;
spin_unlock(&fs_info->free_chunk_lock);
 
-   if (!blk_queue_nonrot(bdev_get_queue(bdev)))
+   if (!blk_queue_nonrot(q))
fs_info->fs_devices->rotating = 1;
 
tmp = btrfs_super_total_bytes(fs_info->super_copy);
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 7/7] btrfs: check if the device is flush capable

2017-04-05 Thread Anand Jain
The blkdev_issue_flush() will check if the write cache is enabled
before submitting the flush. This will add a code to fail fast if
its not.

Signed-off-by: Anand Jain 
---
v2: This patch will now _not_ replace the below patch and this patch
  should be applied on top of it.
[PATCH] btrfs: delete unused member nobarriers
  That's because, Q write cache flag is not static, so we can't save
  its state into nobarries.
  - commit log is updated.
v3: no change
v4: no change

 fs/btrfs/disk-io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b6d047250ce2..bb5816c7f5c4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3517,6 +3517,11 @@ static void btrfs_dev_issue_flush(struct work_struct 
*work)
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
+   struct request_queue *q = bdev_get_queue(device->bdev);
+
+   if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+   return 0;
+
if (wait) {
int ret;
 
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 6/7] btrfs: delete unused member nobarriers

2017-04-05 Thread Anand Jain
The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
through it just return success without actually transpiring such a
request to the block device/lun.

[1]
commit b25de9d6da49b1a8760a89672283128aa8c78345
block: remove BIO_EOPNOTSUPP

And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.

Signed-off-by: Anand Jain 
---
v2: add commit log. ref of last consumer.
v3: nochange
v4: nochange

 fs/btrfs/disk-io.c | 3 ---
 fs/btrfs/volumes.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3c476b118440..b6d047250ce2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3517,9 +3517,6 @@ static void btrfs_dev_issue_flush(struct work_struct 
*work)
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
-   if (device->nobarriers)
-   return 0;
-
if (wait) {
int ret;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0df50bc65578..1168b78c5f1d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -123,7 +123,6 @@ struct btrfs_device {
struct list_head resized_list;
 
/* for sending down flush barriers */
-   int nobarriers;
struct completion flush_wait;
struct work_struct flush_work;
int last_flush_error;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/7] btrfs: cleanup barrier_all_devices() unify dev error count

2017-04-05 Thread Anand Jain
Now when counting number of error devices we don't need to count
them separately once during send and wait, as because device error
counted during send is more of static check.

Also kindly note that as of now there is no code which would set
dev->bdev = NULL unless device is missing. However I still kept
bdev == NULL counted towards error device in view of future
enhancements. And as the device_list_mutex is held when
barrier_all_devices() is called, I don't expect a new bdev to null
in between send and wait.

Now in this process I also rename error_wait to dropouts.

Signed-off-by: Anand Jain 
---
v2: As the write_dev_flush with wait=0 is always successful,
from the previous patch the ret is now removed.
v3: no change
v4: no change

 fs/btrfs/disk-io.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08cbbee228ee..420753d37e1a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3546,19 +3546,15 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
-   int ret;
+   int dropouts = 0;
 
/* send down all the barriers */
head = &info->fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
-   if (!dev->bdev) {
-   errors_send++;
+   if (!dev->bdev)
continue;
-   }
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
@@ -3570,18 +3566,16 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   dropouts++;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 1);
-   if (ret)
-   errors_wait++;
+   if (write_dev_flush(dev, 1))
+   dropouts++;
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (dropouts > info->num_tolerated_disk_barrier_failures)
return -EIO;
return 0;
 }
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/7] btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback

2017-04-05 Thread Anand Jain
REQ_PREFLUSH bio to flush dev cache uses btrfs_end_empty_barrier()
completion callback only, as of now, and there it accounts for dev
stat flush errors BTRFS_DEV_STAT_FLUSH_ERRS, so remove it from the
btrfs_end_bio().

Signed-off-by: Anand Jain 
---
v2-3: no change
v4: no change, but now the patch number is 4/7 in the set.

 fs/btrfs/volumes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73d56eef5e60..1563ae03079b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6010,9 +6010,6 @@ static void btrfs_end_bio(struct bio *bio)
else
btrfs_dev_stat_inc(dev,
BTRFS_DEV_STAT_READ_ERRS);
-   if (bio->bi_opf & REQ_PREFLUSH)
-   btrfs_dev_stat_inc(dev,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
btrfs_dev_stat_print_on_error(dev);
}
}
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/7] btrfs: use blkdev_issue_flush to flush the device cache

2017-04-05 Thread Anand Jain
As of now we do alloc an empty bio and then use the flag REQ_PREFLUSH
to flush the device cache, instead we can use blkdev_issue_flush()
for this puspose.

Also now no need to check the return when write_dev_flush() is called
with wait = 0

Signed-off-by: Anand Jain 
---
v2: Title of this patch is changed from
   btrfs: communicate back ENOMEM when it occurs
  And its entirely a new patch, which now use blkdev_issue_flush()
v3: no change
v4: no change

 fs/btrfs/disk-io.c | 64 +++---
 fs/btrfs/volumes.h |  3 ++-
 2 files changed, 19 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08b74daf35d0..08cbbee228ee 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3498,70 +3498,42 @@ static int write_dev_supers(struct btrfs_device *device,
return errors < i ? 0 : -1;
 }
 
-/*
- * endio for the write_dev_flush, this will wake anyone waiting
- * for the barrier when it is done
- */
-static void btrfs_end_empty_barrier(struct bio *bio)
+static void btrfs_dev_issue_flush(struct work_struct *work)
 {
-   if (bio->bi_private)
-   complete(bio->bi_private);
-   bio_put(bio);
+   int ret;
+   struct btrfs_device *device;
+
+   device = container_of(work, struct btrfs_device, flush_work);
+
+   /* we are in the commit thread */
+   ret = blkdev_issue_flush(device->bdev, GFP_NOFS, NULL);
+   device->last_flush_error = ret;
+   complete(&device->flush_wait);
 }
 
 /*
  * trigger flushes for one the devices.  If you pass wait == 0, the flushes are
  * sent down.  With wait == 1, it waits for the previous flush.
- *
- * any device where the flush fails with eopnotsupp are flagged as not-barrier
- * capable
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
-   struct bio *bio;
-   int ret = 0;
-
if (device->nobarriers)
return 0;
 
if (wait) {
-   bio = device->flush_bio;
-   if (!bio)
-   return 0;
+   int ret;
 
wait_for_completion(&device->flush_wait);
-
-   if (bio->bi_error) {
-   ret = bio->bi_error;
+   ret = device->last_flush_error;
+   if (ret)
btrfs_dev_stat_inc_and_print(device,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
-   }
-
-   /* drop the reference from the wait == 0 run */
-   bio_put(bio);
-   device->flush_bio = NULL;
-
+   BTRFS_DEV_STAT_FLUSH_ERRS);
return ret;
}
 
-   /*
-* one reference for us, and we leave it for the
-* caller
-*/
-   device->flush_bio = NULL;
-   bio = btrfs_io_bio_alloc(GFP_NOFS, 0);
-   if (!bio)
-   return -ENOMEM;
-
-   bio->bi_end_io = btrfs_end_empty_barrier;
-   bio->bi_bdev = device->bdev;
-   bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
init_completion(&device->flush_wait);
-   bio->bi_private = &device->flush_wait;
-   device->flush_bio = bio;
-
-   bio_get(bio);
-   btrfsic_submit_bio(bio);
+   INIT_WORK(&device->flush_work, btrfs_dev_issue_flush);
+   schedule_work(&device->flush_work);
 
return 0;
 }
@@ -3590,9 +3562,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 0);
-   if (ret)
-   errors_send++;
+   write_dev_flush(dev, 0);
}
 
/* wait for all the barriers */
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be81206dd7..0df50bc65578 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -124,8 +124,9 @@ struct btrfs_device {
 
/* for sending down flush barriers */
int nobarriers;
-   struct bio *flush_bio;
struct completion flush_wait;
+   struct work_struct flush_work;
+   int last_flush_error;
 
/* per-device scrub information */
struct scrub_ctx *scrub_device;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/7] Holistic view of device error at commit flush and related cleanup

2017-04-05 Thread Anand Jain
These patches adds cleanup of device barrier, q and flush codes.
This was needed for the following reasons..

  - First of all, Qu approach on the chunk based degradable check
wanted to know what devices failed rather than total count of failed
devices. So the patch 3/7 adds a function check_barrier_error() in
which Qu can add such a logic. Irrespective whether those Qu patches
passes through all the reviews, and if it does not, still this patch
is good to have as it does not change the commit performance since
check_barrier_error() is only called when there is an error in any of
the device. However I believe a volume manager when deciding an action
of the device error, it should rather have a broader view and state of
other devices instead of action based on just individual device, this
just a step towards that. (Earlier the patch [1] was also on the similar
idea, its yet to be separated from the feature patch, as time permits
and priority rises. [1] https://patchwork.kernel.org/patch/9058361/)

 [PATCH v4 3/7] btrfs: cleanup barrier_all_devices() to check dev stat flush 
error

  - However barrier_all_devices() wasn't capable of supporting 3/7
(above). So clean up of write_dev_flush() is needed, and if it uses
blkdev_issue_flush() then the way we count the error device especially
the err_send++ part can be avoided. Patch 1/7 provides this cleanup.
  
 [PATCH v4 1/7] btrfs: use blkdev_issue_flush to flush the device cache

  - There are some static error checking in barrier_all_devices(), such
as !dev->bdev && errors_send++, these errors would remain same before
or after the flush. As of now we do not have any code which will bring
a null bdev back, also we are in device_mutex. So consolidated all
those errors as dropouts at the place after the flush. This clean up is
in 2/7.

 [PATCH v4 2/7] btrfs: cleanup barrier_all_devices() unify dev error count

Now with 1/7 and 2/7 in place, 3/7 can be applied. These 3 are
dependent patches.

  - Patch 4/7 is not related to above 3 patches, however it adds up a
cleanup around the REQ_PREFLUSH flag. It is found that we do not have
any bio with the flag  REQ_PREFLUSH set. So an end io checking for that
flag is not required. Concerned what if there is a bio in future which
will set REQ_PREFLUSH ? I am not too sure about that. However the
write_dev_flush code with the patch 3/7 (above) is still unaffected
with or without patch 4/7, thats because 3/7  does not depend on
BTRFS_DEV_STAT_FLUSH_ERRS, instead it uses a new per device
last_flush_err member to track the error. But it would have been better
or the right way if it had checked BTRFS_DEV_STAT_FLUSH_ERRS instead.
But as it seeks much bigger change, it can be optimized to use
BTRFS_DEV_STAT_FLUSH_ERRS at a later point
of time.
  
 [PATCH v4 4/7] btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion 
callback

And rest of the patches are completely a cleanup patches which are
found good to have when around these parts of the code. They are -

[PATCH v4 5/7] btrfs: use q which is already obtained from bdev_get_queue
[PATCH v4 6/7] btrfs: delete unused member nobarriers
[PATCH v4 7/7] btrfs: check if the device is flush capable

Anand Jain (7):
  btrfs: use blkdev_issue_flush to flush the device cache
  btrfs: cleanup barrier_all_devices() unify dev error count
  btrfs: cleanup barrier_all_devices() to check dev stat flush error
  btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback
  btrfs: use q which is already obtained from bdev_get_queue
  btrfs: delete unused member nobarriers
  btrfs: check if the device is flush capable

 fs/btrfs/disk-io.c | 112 +
 fs/btrfs/volumes.c |  10 ++---
 fs/btrfs/volumes.h |   4 +-
 3 files changed, 58 insertions(+), 68 deletions(-)

-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/7] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-04-05 Thread Anand Jain
The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.

By doing this it helps to further develop patches which would tune
the error-actions as needed.

Signed-off-by: Anand Jain 
---
v2: Address Qu review comments viz..
 Add meaningful names, like cp_list (for checkpoint_list head).
 (And actually it does not need a new struct type just to hold
  the head pointer, list node is already named as device_checkpoint).
 Check return value of add_device_checkpoint()
 Check if the device is already added at add_device_checkpoint()
 Rename fini_devices_checkpoint() to rel_devices_checkpoint()
v3: (resent with the correct version (that is 3 not 2) of the patch).
   Dropped for idea of using the BTRFS_DEV_STAT_FLUSH_ERRS, though
   its the right way, but it needs a better infracture to handle that.
   Now the flush error return is saved and checked instead of the
   checkpoint of the dev_stat method earlier.
v4: no change

 fs/btrfs/disk-io.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 420753d37e1a..3c476b118440 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3538,6 +3538,23 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }
 
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, &fsdevs->devices, dev_list) {
+   if (!dev->bdev || dev->last_flush_error)
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3575,8 +3592,19 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (write_dev_flush(dev, 1))
dropouts++;
}
-   if (dropouts > info->num_tolerated_disk_barrier_failures)
-   return -EIO;
+
+   /*
+* A slight optimization, we check for dropouts here which avoids
+* a dev list loop when disks are healthy.
+*/
+   if (dropouts) {
+   /*
+* As we need holistic view of the failed disks, so
+* error checking is pushed to a separate loop.
+*/
+   return check_barrier_error(info->fs_devices);
+   }
+
return 0;
 }
 
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-05 Thread Qu Wenruo
Btrfs allows inline file extent if and only if
1) It's at offset 0
2) It's smaller than min(max_inline, page_size)
   Although we don't specify if the size is before compression or after
   compression.
   At least according to current behavior, we are only limiting the size
   after compression.
3) It's the only file extent
   So that if we append existing inline extent, it should be converted
   to regular file extents.

However several users in btrfs mail list have reported invalid inline
file extent, which only meets the first two condition, but with regular
file extents following.

The bug is here for a long long time, so long that we have modified
kernel and btrfs-progs to accept such case, but it's still not designed
behavior, and must be detected and fixed.

Signed-off-by: Qu Wenruo 
---
v2:
  All suggested by Eryu Guan
  Use loop_counts instead of runtime to avoid naming confusion.
  Fix whitespace issues
  Use fsync instead of sync
  Use $XFS_IO_PROG instead of calling xfs_io directly
  Always output corruption possibility into seqres.full

v3:
  All suggested by Filipe Manana
  Fix gramma errors
  Use simpler reproducer
  Add test case to compress group
---
 tests/btrfs/140 | 111 
 tests/btrfs/140.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 114 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..183d9cd
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,111 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Check if btrfs will create inline-then-regular file layout.
+#
+# Btrfs only allows inline file extent if file is small enough, and any
+# incoming write enlarges the file beyond max_inline should replace inline
+# extents with regular extents.
+# This is a long standing bug, so fsck won't detect it and kernel will allow
+# normal read on it, but should be avoided.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+
+inline_size=2048
+page_size=$(get_page_size)
+loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case
+
+do_test()
+{
+   _scratch_mkfs > /dev/null 2>&1
+
+   # specify max_inline and compress explicitly to create
+   # inlined compressed extent
+   _scratch_mount "-o max_inline=$inline_size,compress"
+
+   # Page sized data will be compressed to quite small size,
+   # which is about 50 bytes for 4K, 100 bytes for 64K,
+   # far smaller than $inline_size, causing a inlined compressed
+   # file extent
+   $XFS_IO_PROG -f -c "pwrite 0 $page_size" -c "fsync" $SCRATCH_MNT/file \
+   > /dev/null 2>&1
+
+   # Append another page to file in snapshot, since its offset is not
+   # 0, it will cause regular extent, and should convert inline extent
+   # to regular too.
+   $XFS_IO_PROG -c "pwrite $page_size $page_size" $SCRATCH_MNT/file \
+   > /dev/null 2>&1
+   _scratch_unmount
+
+   # Fsck won't report inline-then-regular as error, as it's so long
+   # standing that we have no choice but to accept.
+   # So here we use dump-tree to catch inline extent.
+   $BTRFS_UTIL_PROG inspect-internal dump-tree -t 5 $SCRATCH_DEV \
+   > $tmp.dump_tree 2>&1
+   if grep -q inline $tmp.dump_tree; then
+   cat $tmp.dump_tree >> $seqres.full
+   echo "Invalid inline file extent detected"
+   return 1
+   fi
+   return 0
+}
+
+for i in $(seq 1 $loop_counts); do
+   do_test
+   if [ $? -ne 0 ]; then
+

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread NeilBrown
On Wed, Apr 05 2017, Jan Kara wrote:

> On Wed 05-04-17 11:43:32, NeilBrown wrote:
>> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>> 
>> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> >> > > > Because if above is acceptable we could make reported i_version to 
>> >> > > > be a sum
>> >> > > > of "superblock crash counter" and "inode i_version". We increment
>> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> >> > > > shutdown.
>> >> > > > That way after a crash we are guaranteed each inode will report new
>> >> > > > i_version (the sum would probably have to look like "superblock 
>> >> > > > crash
>> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> >> > > > possible
>> >> > > > i_version numbers we gave away but did not write to disk but 
>> >> > > > still...).
>> >> > > > Thoughts?
>> >> > 
>> >> > How hard is this for filesystems to support?  Do they need an on-disk
>> >> > format change to keep track of the crash counter?  Maybe not, maybe the
>> >> > high bits of the i_version counters are all they need.
>> >> > 
>> >> 
>> >> Yeah, I imagine we'd need a on-disk change for this unless there's
>> >> something already present that we could use in place of a crash counter.
>> >
>> > We could consider using the current time instead.  So, put the current
>> > time (or time of last boot, or this inode's ctime, or something) in the
>> > high bits of the change attribute, and keep the low bits as a counter.
>> 
>> This is a very different proposal.
>> I don't think Jan was suggesting that the i_version be split into two
>> bit fields, one the change-counter and one the crash-counter.
>> Rather, the crash-counter was multiplied by a large-number and added to
>> the change-counter with the expectation that while not ever
>> change-counter landed on disk, at least 1 in every large-number would.
>> So after each crash we effectively add large-number to the
>> change-counter, and can be sure that number hasn't been used already.
>
> Yes, that was my thinking.
>
>> To store the crash-counter in each inode (which does appeal) you would
>> need to be able to remove it before adding the new crash counter, and
>> that requires bit-fields.  Maybe there are enough bits.
>
> Furthermore you'd have a potential problem that you need to change
> i_version on disk just because you are reading after a crash and such
> changes tend to be problematic (think of read-only mounts and stuff like
> that).
>  
>> If you want to ensure read-only files can remain cached over a crash,
>> then you would have to mark a file in some way on stable storage
>> *before* allowing any change.
>> e.g. you could use the lsb.  Odd i_versions might have been changed
>> recently and crash-count*large-number needs to be added.
>> Even i_versions have not been changed recently and nothing need be
>> added.
>> 
>> If you want to change a file with an even i_version, you subtract
>>   crash-count*large-number
>> to the i_version, then set lsb.  This is written to stable storage before
>> the change.
>> 
>> If a file has not been changed for a while, you can add
>>   crash-count*large-number
>> and clear lsb.
>> 
>> The lsb of the i_version would be for internal use only.  It would not
>> be visible outside the filesystem.
>> 
>> It feels a bit clunky, but I think it would work and is the best
>> combination of Jan's idea and your requirement.
>> The biggest cost would be switching to 'odd' before an changes, and the
>> unknown is when does it make sense to switch to 'even'.
>
> Well, there is also a problem that you would need to somehow remember with
> which 'crash count' the i_version has been previously reported as that is
> not stored on disk with my scheme. So I don't think we can easily use your
> scheme.

I don't think there is a problem here maybe I didn't explain
properly or something.

I'm assuming there is a crash-count that is stored once per filesystem.
This might be a disk-format change, or maybe the "Last checked" time
could be used with ext4 (that is a bit horrible though).

Every on-disk i_version has a flag to choose between:
  - use this number as it is, but update it on-disk before any change
  - add multiple of current crash-count to this number before use.
  If you crash during an update, the i_version is thus automatically
  increased.

To change from the first option to the second option you subtract the
multiple of the current crash-count (which might make the stored
i_version negative), and flip the bit.
To change from the second option to the first, you add the multiple
of the current crash-count, and flip the bit.
In each case, the externally visible i_version does not change.
Nothing needs to be stored except the per-inode i_version and the per-fs
crash

Re: [PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-05 Thread Qu Wenruo



At 04/06/2017 12:25 AM, Filipe Manana wrote:

On Wed, Apr 5, 2017 at 5:00 PM, Filipe Manana  wrote:

On Wed, Apr 5, 2017 at 3:14 AM, Qu Wenruo  wrote:

Btrfs allows inline file extent if and only if
1) It's at offset 0
2) It's smaller than min(max_inline, page_size)
   Although we don't specify if the size is before compression or after
   compression.
   At least according to current behavior, we are only limiting the size
   after compression.
3) It's the only file extent
   So that if we append existing inline extent, it should be converted
   to regular file extents.

However several users in btrfs mail list have reported invalid inline
file extent, which only meets the first two condition, but with regular
file extents follows.


follows -> following



The bug is here for a long long time, so long that we have modified
kernel and btrfs-progs to accept such case, but it's still not designed
behavior, and must be detected and fixed.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/140 | 120 
 tests/btrfs/140.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 123 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..55f81f2
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,120 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Check if btrfs will create inline-then-regular file layout.
+#
+# Btrfs only allows inline file extent if file is small enough, and any
+# incoming write enlarges the file beyong max_inline should replace inline


beyong -> beyond


+# extents with regular extents.
+# This is a long standing bug, so fsck won't detect it and kernel will allow
+# normal read on it, but should be avoided.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+
+inline_size=2048
+page_size=$(get_page_size)
+src_subv="$SCRATCH_MNT/src"
+dst_subv="$SCRATCH_MNT/dst"
+loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case
+
+do_test()
+{
+   local ret


unused variable


+   _scratch_mkfs > /dev/null 2>&1
+
+   # specify max_inline and compress explicitly to create
+   # inlined compressed extent
+   _scratch_mount "-o max_inline=$inline_size,compress"
+
+   # Subvolume id=257
+   _run_btrfs_util_prog subvolume create $src_subv
+
+   # Page sized data will be compressed to quite small size,
+   # which is about 50 bytes for 4K, 100 bytes for 64K,
+   # far smaller than $inline_size, causing a inlined compressed
+   # file extent
+   $XFS_IO_PROG -f -c "pwrite 0 $page_size" -c "fsync" $src_subv/file \
+   > /dev/null
+
+   # Subvolume id=258
+   _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
+
+   # Append another page to file in snapshot, since its offset is not
+   # 0, it will cause regular extent, and should convert inline extent
+   # to regular too.
+   $XFS_IO_PROG -c "pwrite $page_size $page_size" $dst_subv/file > 
/dev/null
+   _scratch_unmount
+
+   # Fsck won't report inline-then-regular as error, as it's so long
+   # standing that we have no choice but to accept.
+   # So here we use dump-tree to catch inline extent of $dst_subv, and
+   # we need piping stdout, call btrfs-progs directly


I don't understand the last sentence of the comment. Need piping
stdout? What pipe, to where? And call btrfs-progs directly where, what
does it mean?


+   $BTRFS_UTIL_PROG inspect-internal dump-tree -t 258 $SCRATCH_DEV \
+   > $tmp.dump_tree
+   if grep

Re: [PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-05 Thread Qu Wenruo



At 04/06/2017 12:00 AM, Filipe Manana wrote:

On Wed, Apr 5, 2017 at 3:14 AM, Qu Wenruo  wrote:

Btrfs allows inline file extent if and only if
1) It's at offset 0
2) It's smaller than min(max_inline, page_size)
   Although we don't specify if the size is before compression or after
   compression.
   At least according to current behavior, we are only limiting the size
   after compression.
3) It's the only file extent
   So that if we append existing inline extent, it should be converted
   to regular file extents.

However several users in btrfs mail list have reported invalid inline
file extent, which only meets the first two condition, but with regular
file extents follows.


follows -> following



The bug is here for a long long time, so long that we have modified
kernel and btrfs-progs to accept such case, but it's still not designed
behavior, and must be detected and fixed.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/140 | 120 
 tests/btrfs/140.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 123 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..55f81f2
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,120 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Check if btrfs will create inline-then-regular file layout.
+#
+# Btrfs only allows inline file extent if file is small enough, and any
+# incoming write enlarges the file beyong max_inline should replace inline


beyong -> beyond


+# extents with regular extents.
+# This is a long standing bug, so fsck won't detect it and kernel will allow
+# normal read on it, but should be avoided.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+
+inline_size=2048
+page_size=$(get_page_size)
+src_subv="$SCRATCH_MNT/src"
+dst_subv="$SCRATCH_MNT/dst"
+loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case
+
+do_test()
+{
+   local ret


unused variable


+   _scratch_mkfs > /dev/null 2>&1
+
+   # specify max_inline and compress explicitly to create
+   # inlined compressed extent
+   _scratch_mount "-o max_inline=$inline_size,compress"
+
+   # Subvolume id=257
+   _run_btrfs_util_prog subvolume create $src_subv
+
+   # Page sized data will be compressed to quite small size,
+   # which is about 50 bytes for 4K, 100 bytes for 64K,
+   # far smaller than $inline_size, causing a inlined compressed
+   # file extent
+   $XFS_IO_PROG -f -c "pwrite 0 $page_size" -c "fsync" $src_subv/file \
+   > /dev/null
+
+   # Subvolume id=258
+   _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
+
+   # Append another page to file in snapshot, since its offset is not
+   # 0, it will cause regular extent, and should convert inline extent
+   # to regular too.
+   $XFS_IO_PROG -c "pwrite $page_size $page_size" $dst_subv/file > 
/dev/null
+   _scratch_unmount
+
+   # Fsck won't report inline-then-regular as error, as it's so long
+   # standing that we have no choice but to accept.
+   # So here we use dump-tree to catch inline extent of $dst_subv, and
+   # we need piping stdout, call btrfs-progs directly


I don't understand the last sentence of the comment. Need piping
stdout? What pipe, to where? And call btrfs-progs directly where, what
does it mean?


Redirect to a file without populating stdout.

Directly means calling $BTRFS_UTIL_PROG other than the 
_run_btrfs_util_progs wrapper.


Thanks,
Qu



+   $BTRFS_UTIL_PROG in

[PATCH] Btrfs: fix invalid dereference in btrfs_retry_endio

2017-04-05 Thread Liu Bo
When doing directIO repair, we have this oops

[ 1458.532816] general protection fault:  [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: 88082a42d100 task.stack: c90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374]  bio_endio+0xed/0x100
[ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224]  process_one_work+0x34d/0xb70
[ 1458.546570]  ? process_one_work+0x29e/0xb70
[ 1458.546938]  worker_thread+0x1cf/0x960
[ 1458.547263]  ? process_one_work+0xb70/0xb70
[ 1458.547624]  kthread+0x17d/0x180
[ 1458.547909]  ? kthread_create_on_node+0x70/0x70
[ 1458.548300]  ret_from_fork+0x31/0x40

It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.

This fixes the problem by using the preservd inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.

Also cleanup unused @start(which is too trivial for a separate patch).

Cc: David Sterba 
Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ec5a05..8e71ed7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7910,7 +7910,6 @@ struct btrfs_retry_complete {
 static void btrfs_retry_endio_nocsum(struct bio *bio)
 {
struct btrfs_retry_complete *done = bio->bi_private;
-   struct inode *inode;
struct bio_vec *bvec;
int i;
 
@@ -7918,12 +7917,12 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
goto end;
 
ASSERT(bio->bi_vcnt == 1);
-   inode = bio->bi_io_vec->bv_page->mapping->host;
-   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode));
+   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode));
 
done->uptodate = 1;
bio_for_each_segment_all(bvec, bio, i)
-   clean_io_failure(BTRFS_I(done->inode), done->start, bvec->bv_page, 0);
+   clean_io_failure(BTRFS_I(done->inode), done->start,
+bvec->bv_page, 0);
 end:
complete(&done->done);
bio_put(bio);
@@ -7986,9 +7985,7 @@ static void btrfs_retry_endio(struct bio *bio)
 {
struct btrfs_retry_complete *done = bio->bi_private;
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-   struct inode *inode;
struct bio_vec *bvec;
-   u64 start;
int uptodate;
int ret;
int i;
@@ -7998,11 +7995,8 @@ static void btrfs_retry_endio(struct bio *bio)
 
uptodate = 1;
 
-   start = done->start;
-
ASSERT(bio->bi_vcnt == 1);
-   inode = bio->bi_io_vec->bv_page->mapping->host;
-   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode));
+   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode));
 
bio_for_each_segment_all(bvec, bio, i) {
ret = __readpage_endio_check(done->inode, io_bio, i,
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields
On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> 1) Keep i_version as is, make clients also check for i_ctime.

That would be a protocol revision, which we'd definitely rather avoid.

But can't we accomplish the same by using something like

ctime * (some constant) + i_version

?

>Pro: No on-disk format changes.
>Cons: After a crash, i_version can go backwards (but when file changes
>i_version, i_ctime pair should be still different) or not, data can be
>old or not.

This is probably good enough for NFS purposes: typically on an NFS
filesystem, results of a read in the face of a concurrent write open are
undefined.  And writers sync before close.

So after a crash with a dirty inode, we're in a situation where an NFS
client still needs to resend some writes, sync, and close.  I'm OK with
things being inconsistent during this window.

I do expect things to return to normal once that client's has resent its
writes--hence the worry about actually resuing old values after boot
(such as if i_version regresses on boot and then increments back to the
same value after further writes).  Factoring in ctime fixes that.

> 2) Fsync when reporting i_version.
>Pro: No on-disk format changes, strong consistency of i_version and
> data.
>Cons: Difficult to implement for filesystems due to locking constrains.
>  High performance overhead or i_version reporting.

Sounds painful.

> 3) Some variant of crash counter.
>Pro: i_version cannot go backwards.
>Cons: Requires on-disk format changes. After a crash data can be old
>  (however i_version increased).

Also, some unnecessary invalidations.  Which maybe can be mostly avoided
by some variation of Neil's scheme.

It looks to me like option (1) is doable now and introduces no
regressions compared to the current situation.  (2) and (3) are more
copmlicated and involve some tradeoffs.

Also, we can implement (1) and switch to (2) or (3) later.  We'd want to
do it without reported i_versions decreasing on kernel upgrade, but
there are multiple ways of handling that.  (Worst case, just restrict
the change to newly created filesystems.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields
On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

I was sort of ignoring the distinction between concatenate(A,B) and
A*m+B, but, sure, multiplying's probably better.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

i_version and the NFSv4 change attribute are 64 bits which gives us a
fair amount of flexibility.

> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

I'm not sure how to model the costs.  Something like that might work.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-05 Thread Filipe Manana
On Wed, Apr 5, 2017 at 5:00 PM, Filipe Manana  wrote:
> On Wed, Apr 5, 2017 at 3:14 AM, Qu Wenruo  wrote:
>> Btrfs allows inline file extent if and only if
>> 1) It's at offset 0
>> 2) It's smaller than min(max_inline, page_size)
>>Although we don't specify if the size is before compression or after
>>compression.
>>At least according to current behavior, we are only limiting the size
>>after compression.
>> 3) It's the only file extent
>>So that if we append existing inline extent, it should be converted
>>to regular file extents.
>>
>> However several users in btrfs mail list have reported invalid inline
>> file extent, which only meets the first two condition, but with regular
>> file extents follows.
>
> follows -> following
>
>>
>> The bug is here for a long long time, so long that we have modified
>> kernel and btrfs-progs to accept such case, but it's still not designed
>> behavior, and must be detected and fixed.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  tests/btrfs/140 | 120 
>> 
>>  tests/btrfs/140.out |   2 +
>>  tests/btrfs/group   |   1 +
>>  3 files changed, 123 insertions(+)
>>  create mode 100755 tests/btrfs/140
>>  create mode 100644 tests/btrfs/140.out
>>
>> diff --git a/tests/btrfs/140 b/tests/btrfs/140
>> new file mode 100755
>> index 000..55f81f2
>> --- /dev/null
>> +++ b/tests/btrfs/140
>> @@ -0,0 +1,120 @@
>> +#! /bin/bash
>> +# FS QA Test 140
>> +#
>> +# Check if btrfs will create inline-then-regular file layout.
>> +#
>> +# Btrfs only allows inline file extent if file is small enough, and any
>> +# incoming write enlarges the file beyong max_inline should replace inline
>
> beyong -> beyond
>
>> +# extents with regular extents.
>> +# This is a long standing bug, so fsck won't detect it and kernel will allow
>> +# normal read on it, but should be avoided.
>> +#
>> +#---
>> +# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software Foundation,
>> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> +#---
>> +#
>> +
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1   # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +   cd /
>> +   rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_scratch
>> +_require_btrfs_command inspect-internal dump-tree
>> +
>> +inline_size=2048
>> +page_size=$(get_page_size)
>> +src_subv="$SCRATCH_MNT/src"
>> +dst_subv="$SCRATCH_MNT/dst"
>> +loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case
>> +
>> +do_test()
>> +{
>> +   local ret
>
> unused variable
>
>> +   _scratch_mkfs > /dev/null 2>&1
>> +
>> +   # specify max_inline and compress explicitly to create
>> +   # inlined compressed extent
>> +   _scratch_mount "-o max_inline=$inline_size,compress"
>> +
>> +   # Subvolume id=257
>> +   _run_btrfs_util_prog subvolume create $src_subv
>> +
>> +   # Page sized data will be compressed to quite small size,
>> +   # which is about 50 bytes for 4K, 100 bytes for 64K,
>> +   # far smaller than $inline_size, causing a inlined compressed
>> +   # file extent
>> +   $XFS_IO_PROG -f -c "pwrite 0 $page_size" -c "fsync" $src_subv/file \
>> +   > /dev/null
>> +
>> +   # Subvolume id=258
>> +   _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
>> +
>> +   # Append another page to file in snapshot, since its offset is not
>> +   # 0, it will cause regular extent, and should convert inline extent
>> +   # to regular too.
>> +   $XFS_IO_PROG -c "pwrite $page_size $page_size" $dst_subv/file > 
>> /dev/null
>> +   _scratch_unmount
>> +
>> +   # Fsck won't report inline-then-regular as error, as it's so long
>> +   # standing that we have no choice but to accept.
>> +   # So here we use dump-tree to catch inline extent of 

Re: [PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-05 Thread Filipe Manana
On Wed, Apr 5, 2017 at 3:14 AM, Qu Wenruo  wrote:
> Btrfs allows inline file extent if and only if
> 1) It's at offset 0
> 2) It's smaller than min(max_inline, page_size)
>Although we don't specify if the size is before compression or after
>compression.
>At least according to current behavior, we are only limiting the size
>after compression.
> 3) It's the only file extent
>So that if we append existing inline extent, it should be converted
>to regular file extents.
>
> However several users in btrfs mail list have reported invalid inline
> file extent, which only meets the first two condition, but with regular
> file extents follows.

follows -> following

>
> The bug is here for a long long time, so long that we have modified
> kernel and btrfs-progs to accept such case, but it's still not designed
> behavior, and must be detected and fixed.
>
> Signed-off-by: Qu Wenruo 
> ---
>  tests/btrfs/140 | 120 
> 
>  tests/btrfs/140.out |   2 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 123 insertions(+)
>  create mode 100755 tests/btrfs/140
>  create mode 100644 tests/btrfs/140.out
>
> diff --git a/tests/btrfs/140 b/tests/btrfs/140
> new file mode 100755
> index 000..55f81f2
> --- /dev/null
> +++ b/tests/btrfs/140
> @@ -0,0 +1,120 @@
> +#! /bin/bash
> +# FS QA Test 140
> +#
> +# Check if btrfs will create inline-then-regular file layout.
> +#
> +# Btrfs only allows inline file extent if file is small enough, and any
> +# incoming write enlarges the file beyong max_inline should replace inline

beyong -> beyond

> +# extents with regular extents.
> +# This is a long standing bug, so fsck won't detect it and kernel will allow
> +# normal read on it, but should be avoided.
> +#
> +#---
> +# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_btrfs_command inspect-internal dump-tree
> +
> +inline_size=2048
> +page_size=$(get_page_size)
> +src_subv="$SCRATCH_MNT/src"
> +dst_subv="$SCRATCH_MNT/dst"
> +loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case
> +
> +do_test()
> +{
> +   local ret

unused variable

> +   _scratch_mkfs > /dev/null 2>&1
> +
> +   # specify max_inline and compress explicitly to create
> +   # inlined compressed extent
> +   _scratch_mount "-o max_inline=$inline_size,compress"
> +
> +   # Subvolume id=257
> +   _run_btrfs_util_prog subvolume create $src_subv
> +
> +   # Page sized data will be compressed to quite small size,
> +   # which is about 50 bytes for 4K, 100 bytes for 64K,
> +   # far smaller than $inline_size, causing a inlined compressed
> +   # file extent
> +   $XFS_IO_PROG -f -c "pwrite 0 $page_size" -c "fsync" $src_subv/file \
> +   > /dev/null
> +
> +   # Subvolume id=258
> +   _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
> +
> +   # Append another page to file in snapshot, since its offset is not
> +   # 0, it will cause regular extent, and should convert inline extent
> +   # to regular too.
> +   $XFS_IO_PROG -c "pwrite $page_size $page_size" $dst_subv/file > 
> /dev/null
> +   _scratch_unmount
> +
> +   # Fsck won't report inline-then-regular as error, as it's so long
> +   # standing that we have no choice but to accept.
> +   # So here we use dump-tree to catch inline extent of $dst_subv, and
> +   # we need piping stdout, call btrfs-progs directly

I don't understand the last sentence of the comment. Need piping
stdout? What pipe, to where? And call btrfs-progs directly

Re: [PATCH v2] btrfs: check if the device is flush capable

2017-04-05 Thread David Sterba
On Wed, Apr 05, 2017 at 11:48:34AM +0800, Anand Jain wrote:
> The blkdev_issue_flush() will check if the write cache is enabled
> before submitting the flush. This will add a code to fail fast if
> its not.
> 
> Signed-off-by: Anand Jain 
> ---
> v1:
>   This patch will replace
>   [PATCH] btrfs: delete unused member nobarriers
> 
> v2:
>   - This patch will now _not_ replace the below patch and this patch
>   should be applied on top of it.
> [PATCH] btrfs: delete unused member nobarriers
>   That's because, Q wc flag is not static, so we can't save its
>   state into nobarries.

Now I'm confused what's the last version of the device flush patches,
please, this one is marked as V2 and has the same subject, yet does
something completely different. Please resend the relevant patches in a
series, we might need another round of review afterward.

>   - commit log is updated.
> 
>  fs/btrfs/disk-io.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b6d047250ce2..bb5816c7f5c4 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3517,6 +3517,11 @@ static void btrfs_dev_issue_flush(struct work_struct 
> *work)
>   */
>  static int write_dev_flush(struct btrfs_device *device, int wait)
>  {
> + struct request_queue *q = bdev_get_queue(device->bdev);
> +
> + if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags))
> + return 0;
> +
>   if (wait) {
>   int ret;
>  
> -- 
> 2.10.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: check if the device is flush capable

2017-04-05 Thread David Sterba
On Tue, Apr 04, 2017 at 06:41:44PM +0800, Anand Jain wrote:
> blkdev_issue_flush() or the empty buffer with the flag REQ_PREFLUSH
> will never return BIO_EOPNOTSUPP as of now, however it should rather
> or it may in future. So for now the BTRFS to have least affected by
> this change at the blk layer, we can check if the device is flush
> capable.
> 
> In this process, I rename the unused nobarrier member as flushable,
> which the below patch made it redundant.
> 
> Per commit b25de9d6da49b1a8760a89672283128aa8c78345
>block: remove BIO_EOPNOTSUPP
> 
> Signed-off-by: Anand Jain 

So up to now, dev->nobarriers was always 0 and thus write_dev_flush
always sent down the empty bio with PREFLUSH. Depending on the actual
device flush capabilities, it either kept the flush bit or removed it in
the generic_make_request_checks, and then proceeded to
generic_make_request.

In short, we always sent the flush bio. What your patch changes is that
this will happen only for devices that have the flush capability,
determined by the queue bit.

This change does not seem to contradict the commit
387125fc722a8ed432066b85a552917343bdafca that brought the
dev->nobarriers, so regarding metadata ordering and superblock flushes,
all seem ok. But this could use some doublechecking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: delete unused member nobarriers

2017-04-05 Thread David Sterba
On Tue, Apr 04, 2017 at 06:59:19PM +0800, Anand Jain wrote:
> 
> 
> On 04/03/2017 08:06 PM, David Sterba wrote:
> > Please update the changelog to say why it's ok to remove it, eg. the
> > commit that removed the last user.
> >
> > commit b25de9d6da49b1a8760a89672283128aa8c78345
> > Author: Christoph Hellwig 
> > Date:   Fri Apr 24 21:41:01 2015 +0200
> >
> > block: remove BIO_EOPNOTSUPP
> 
>   Ah. I should have search the commit log. sorry about that.
> 
> 
> IMO, there is a bug in generic_make_request_checks() in which
> it should rather return EOPNOTSUPP, instead of EIO if QUEUE_FLAG_WC
> is not supported.

I think it works as intended, ie. it's not a bug. The code has been
added to avoid reporting an error code in the mentioned case. See
generic_make_request that would exit, or need to handle EOPNOTSUPP each
time. If you still think this is a bug, please report it to the
blocklayer maintainers.

> 
> 1853 static noinline_for_stack bool
> 1854 generic_make_request_checks(struct bio *bio)
> 1855 {
> ::
> 
> 1892 /*
> 1893  * Filter flush bio's early so that make_request based
> 1894  * drivers without flush support don't have to worry
> 1895  * about them.
> 1896  */
> 1897 if ((bio->bi_opf & (REQ_PREFLUSH | REQ_FUA)) &&
> 1898 !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> 1899 bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> 1900 if (!nr_sectors) {
> 1901 err = 0;
> 1902 goto end_io; <- this should goto  not_supported
> 1903 }
> 1904 }
> 
> ::
> 
> 1946 not_supported:
> 1947 err = -EOPNOTSUPP;
> 
> 
> 
>   Pls ignore this patch.
> 
>   I have submitted
> [PATCH] btrfs: check if the device is flush capable
> 
>   which will remain unaffected by the above bug/not-a-bug
>   in the blk core code.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use q which is already obtained from bdev_get_queue

2017-04-05 Thread David Sterba
On Tue, Apr 04, 2017 at 06:40:19PM +0800, Anand Jain wrote:
> We have already assigned q from bdev_get_queue() so use it.
> And rearrange the code for better view.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: switch to div64_u64 if with a u64 divisor

2017-04-05 Thread David Sterba
On Mon, Apr 03, 2017 at 01:45:24PM -0700, Liu Bo wrote:
> This is fixing code pieces where we use div_u64 when passing a u64 divisor.
> 
> Cc: David Sterba 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface support

2017-04-05 Thread David Sterba
On Tue, Apr 04, 2017 at 09:36:39PM +0530, Chandan Jay Sharma wrote:
> This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface support
> for btrfs. Extended file attributes are 32 bit values (FS_XFLAGS_SYNC,
> FS_XFLAG_IMMUTABLE, etc) which have one-to-one mapping to the flag values
> that can be stored in inode->i_flags (i.e. S_SYNC, S_IMMUTABLE, etc).
> The flags can be set/unset to enable/disable file attributes.
> These attributes are listed/modified by lsattr/chattr.

Ok.

> Signed-off-by: Chandan Jay Sharma 
> ---
>  fs/btrfs/ioctl.c | 148 
> +++
>  1 file changed, 148 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index dabfc7a..5d8486b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -132,6 +132,25 @@ static unsigned int btrfs_flags_to_ioctl(unsigned int 
> flags)
>   return iflags;
>  }
>  
> +/* Transfer ioctl flags to btrfs internal flags */

Please mention which ioctl as we now have 2 that do a similar thing.

> +static unsigned int btrfs_ioctl_to_flags(unsigned int iflags)
> +{
> + unsigned int flags = 0;
> +
> + if (iflags & FS_SYNC_FL)
> + flags |= BTRFS_INODE_SYNC;
> + if (iflags & FS_IMMUTABLE_FL)
> + flags |= BTRFS_INODE_IMMUTABLE;
> + if (iflags & FS_APPEND_FL)
> + flags |= BTRFS_INODE_APPEND;
> + if (iflags & FS_NODUMP_FL)
> + flags |= BTRFS_INODE_NODUMP;
> + if (iflags & FS_NOATIME_FL)
> + flags |= BTRFS_INODE_NOATIME;
> +
> + return flags;
> +}
> +
>  /*
>   * Update inode->i_flags based on the btrfs internal flags.
>   */
> @@ -157,6 +176,75 @@ void btrfs_update_iflags(struct inode *inode)
>  }
>  
>  /*
> + * Propagate flags from i_flags to BTRFS_I(inode)->flags
> + */
> +void btrfs_get_inode_flags(struct btrfs_inode *ip)
> +{
> + unsigned int vfs_fl;
> + unsigned long old_fl, new_fl;
> +
> + do {
> + vfs_fl = ip->vfs_inode.i_flags;
> + old_fl = ip->flags;
> + new_fl = old_fl & ~(BTRFS_INODE_SYNC|BTRFS_INODE_APPEND|
> + BTRFS_INODE_IMMUTABLE|BTRFS_INODE_NOATIME|
> + BTRFS_INODE_DIRSYNC);

The list of supported flags should be declared in a header, something
like

#define BTRFS_INODE_FLAGS_MASK (BTRFS_INODE_SYNC |
BTRFS_INODE_APPEND |
BTRFS_INODE_IMMUTABLE |
BTRFS_INODE_NOATIME |
BTRFS_INODE_DIRSYNC)

> + if (vfs_fl & S_SYNC)
> + new_fl |= BTRFS_INODE_SYNC;
> + if (vfs_fl & S_APPEND)
> + new_fl |= BTRFS_INODE_APPEND;
> + if (vfs_fl & S_IMMUTABLE)
> + new_fl |= BTRFS_INODE_IMMUTABLE;
> + if (vfs_fl & S_NOATIME)
> + new_fl |= BTRFS_INODE_NOATIME;
> + if (vfs_fl & S_DIRSYNC)
> + new_fl |= BTRFS_INODE_DIRSYNC;
> + } while (cmpxchg(&ip->flags, old_fl, new_fl) != old_fl);
> +}
> +
> +/*
> + * Translate btrfs internal flags BTRFS_I(inode)->flags to xflags.

Maybe also reference the related ioctl.

> + */
> +static inline unsigned int btrfs_flags_to_xflags(unsigned int flags)
> +{
> + unsigned int xflags = 0;
> +
> + if (flags & BTRFS_INODE_SYNC)
> + xflags |= FS_XFLAG_SYNC;
> + if (flags & BTRFS_INODE_IMMUTABLE)
> + xflags |= FS_XFLAG_IMMUTABLE;
> + if (flags & BTRFS_INODE_APPEND)
> + xflags |= FS_XFLAG_APPEND;
> + if (flags & BTRFS_INODE_NODUMP)
> + xflags |= FS_XFLAG_NODUMP;
> + if (flags & BTRFS_INODE_NOATIME)
> + xflags |= FS_XFLAG_NOATIME;
> +
> + return xflags;
> +}
> +
> +/*
> + * Transfer xflags flags to ioctl flags.
> + */
> +static inline unsigned int btrfs_xflags_to_ioctl(unsigned int xflags)
> +{
> + unsigned int flags = 0;
> +
> + if (xflags & FS_XFLAG_SYNC)
> + flags |= FS_SYNC_FL;
> + if (xflags & FS_XFLAG_IMMUTABLE)
> + flags |= FS_IMMUTABLE_FL;
> + if (xflags & FS_XFLAG_APPEND)
> + flags |= FS_APPEND_FL;
> + if (xflags & FS_XFLAG_NODUMP)
> + flags |= FS_NODUMP_FL;
> + if (xflags & FS_XFLAG_NOATIME)
> + flags |= FS_NOATIME_FL;
> +
> + return flags;
> +}
> +
> +/*
>   * Inherit flags from the parent inode.
>   *
>   * Currently only the compression flags and the cow flags are inherited.
> @@ -5504,6 +5592,62 @@ static int btrfs_ioctl_set_features(struct file *file, 
> void __user *arg)
>   return ret;
>  }
>  
> +static int btrfs_ioctl_fsgetxattr(struct file *file, void __user *arg)
> +{
> + struct fsxattr fa;
> + struct btrfs_inode *ip = BTRFS_I(file_inode(file));
> +
> + memset(&fa, 0, sizeof(struct fsxattr));
> + btrfs_get_inode_flags(ip);
> + fa.fsx_xflags = btrfs_flags_to_xflags(ip->flags);
> +
> + 

Re: [PATCH] Btrfs: update scrub_parity to use u64 stripe_len

2017-04-05 Thread David Sterba
On Mon, Apr 03, 2017 at 01:45:33PM -0700, Liu Bo wrote:
> Commit 3d8da6781760 ("Btrfs: fix divide error upon chunk's stripe_len")
> changed stripe_len in struct map_lookup to u64, but didn't update
> stripe_len in struct scrub_parity.
> 
> This updates the type and switches to div64_u64_rem to match u64 divisor.
> 
> Cc: David Sterba 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]

2017-04-05 Thread Amir Goldstein
On Wed, Apr 5, 2017 at 3:30 PM, David Sterba  wrote:
> On Wed, Apr 05, 2017 at 11:53:41AM +0100, David Howells wrote:
>> I've added a test to xfstests that exercises the new statx syscall.  However,
>> it fails on btrfs:
>>
>>  Test statx on a directory
>> +[!] stx_nlink differs, 1 != 2
>> +Failed
>> +stat_test failed
>>
>> because a new directory it creates has an nlink of 1, not 2.  Is this a case
>> of my making an incorrect assumption or is it an fs bug?
>
> Afaik nlink == 1 means that there's no accounting of subdirectories, and
> it's a valid value. The 'find' utility can use nlink to optimize
> directory traversal but otherwise I'm not aware of other usage.
>
> All directories in btrfs have nlink == 1.

FYI,

Overlayfs uses nlink = 1 for merge dirs to silence 'find' et al.
Ext4 uses nlink = 1 for directories with more than 32K subdirs
(EXT4_FEATURE_RO_COMPAT_DIR_NLINK).

But in both those fs newly created directories will have nlink = 2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]

2017-04-05 Thread David Sterba
On Wed, Apr 05, 2017 at 11:53:41AM +0100, David Howells wrote:
> I've added a test to xfstests that exercises the new statx syscall.  However,
> it fails on btrfs:
> 
>  Test statx on a directory
> +[!] stx_nlink differs, 1 != 2
> +Failed
> +stat_test failed
> 
> because a new directory it creates has an nlink of 1, not 2.  Is this a case
> of my making an incorrect assumption or is it an fs bug?

Afaik nlink == 1 means that there's no accounting of subdirectories, and
it's a valid value. The 'find' utility can use nlink to optimize
directory traversal but otherwise I'm not aware of other usage.

All directories in btrfs have nlink == 1.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix extent map leak during fallocate error path

2017-04-05 Thread David Sterba
On Tue, Apr 04, 2017 at 03:20:36AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
> extent map structure. The failure can happen either due to an -ENOMEM
> condition or, when quotas are enabled, due to -EDQUOT for example.
> 
> Signed-off-by: Filipe Manana 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]

2017-04-05 Thread David Howells
I've added a test to xfstests that exercises the new statx syscall.  However,
it fails on btrfs:

 Test statx on a directory
+[!] stx_nlink differs, 1 != 2
+Failed
+stat_test failed

because a new directory it creates has an nlink of 1, not 2.  Is this a case
of my making an incorrect assumption or is it an fs bug?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] generic/071: check that the fs supports fallocate with the KEEP_SIZE flag

2017-04-05 Thread fdmanana
From: Filipe Manana 

So that the test is skipped for filesystems that don't support it instead
of failing (like NFS 4.2 for example).

Signed-off-by: Filipe Manana 
---
 tests/generic/071 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/generic/071 b/tests/generic/071
index 65ed0c7..c1f90c6 100755
--- a/tests/generic/071
+++ b/tests/generic/071
@@ -46,7 +46,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fs generic
 _supported_os Linux
 _require_scratch
-_require_xfs_io_command "falloc"
+_require_xfs_io_command "falloc -k"
 
 rm -f $seqres.full
 
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] generic: test for number of bytes used by files after buffered writes

2017-04-05 Thread fdmanana
From: Filipe Manana 

Test that a filesystem's implementation of the stat(2) system call
reports correct values for the number of blocks allocated for a file
when there are delayed allocations.

This test is motivated by a bug in btrfs which is fixed by the following
path for the linux kernel:

 "Btrfs: fix reported number of inode blocks"

Signed-off-by: Filipe Manana 
---

V2: Added missing _require_odirect and make the test check that the tested
filesystem supports fallocate with the KEEP_SIZE flag.

 tests/generic/422 | 127 ++
 tests/generic/422.out |  41 
 tests/generic/group   |   1 +
 3 files changed, 169 insertions(+)
 create mode 100755 tests/generic/422
 create mode 100644 tests/generic/422.out

diff --git a/tests/generic/422 b/tests/generic/422
new file mode 100755
index 000..cae76a9
--- /dev/null
+++ b/tests/generic/422
@@ -0,0 +1,127 @@
+#! /bin/bash
+# FS QA Test No. generic/422
+#
+# Test that a filesystem's implementation of the stat(2) system call reports
+# correct values for the number of blocks allocated for a file when there are
+# delayed allocations.
+#
+#---
+#
+# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_scratch
+_require_xfs_io_command "falloc -k"
+_require_odirect
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
+$XFS_IO_PROG -f \
+ -c "pwrite -S 0xaa 0 64K" \
+ -c "truncate 128K" \
+ $SCRATCH_MNT/foo2 | _filter_xfs_io
+$XFS_IO_PROG -f \
+ -c "falloc -k 0 128K" \
+ -c "pwrite -S 0xaa 0 64K" \
+ $SCRATCH_MNT/foo3 | _filter_xfs_io
+touch $SCRATCH_MNT/foo4
+
+# Make sure everything done so far is durably persisted.
+sync
+
+# Now overwrite the extent of the first file.
+$XFS_IO_PROG -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
+
+# Write to a hole of the second file.
+$XFS_IO_PROG -c "pwrite -S 0xff 64K 64K" $SCRATCH_MNT/foo2 | _filter_xfs_io
+# Write again to the same location, just to test that the fs will not account
+# the same write twice.
+$XFS_IO_PROG -c "pwrite -S 0x20 64K 64K" $SCRATCH_MNT/foo2 | _filter_xfs_io
+
+# Write beyond eof of the third file into the pre-allocated extent.
+$XFS_IO_PROG -c "pwrite -S 0xff 64K 64K" $SCRATCH_MNT/foo3 | _filter_xfs_io
+
+# Do a buffered write immediately followed by a direct IO write, without a
+# fsync in between, just to test that page invalidation does not lead to an
+# incorrect number of file blocks reported.
+$XFS_IO_PROG -c "pwrite -S 0xab 0 64K" $SCRATCH_MNT/foo4 | _filter_xfs_io
+$XFS_IO_PROG -d -c "pwrite -S 0xef 0 64K" $SCRATCH_MNT/foo4 | _filter_xfs_io
+
+echo
+echo "Before writeback"
+echo
+
+echo "Space used by file foo1:"
+du -h $SCRATCH_MNT/foo1 | _filter_scratch
+
+echo "Space used by file foo2:"
+du -h $SCRATCH_MNT/foo2 | _filter_scratch
+
+echo "Space used by file foo3:"
+du -h $SCRATCH_MNT/foo3 | _filter_scratch
+
+echo "Space used by file foo4:"
+du -h $SCRATCH_MNT/foo4 | _filter_scratch
+
+sync
+
+# We expect the same file sizes reported by 'du' after writeback finishes.
+echo
+echo "After writeback"
+echo
+
+echo "Space used by file foo1:"
+du -h $SCRATCH_MNT/foo1 | _filter_scratch
+
+echo "Space used by file foo2:"
+du -h $SCRATCH_MNT/foo2 | _filter_scratch
+
+echo "Space used by file foo3:"
+du -h $SCRATCH_MNT/foo3 | _filter_scratch
+
+echo "Space used by file foo4:"
+du -h $SCRATCH_MNT/foo4 | _filter_scratch
+
+status=0
+exit
diff --git a/tests/generic/422.out b/tests/generic/422.out
new file mode 100644
index 000..3696088
--- /dev/null
+++ b/tests/generic/422.out
@@ -0,0 +1,41 @@
+QA output created by 422
+wrote 65536/65536 bytes at offset 0
+XXX Bytes, X ops

[PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags

2017-04-05 Thread fdmanana
From: Filipe Manana 

For example NFS 4.2 supports fallocate but it does not support its
KEEP_SIZE flag, so we want to skip tests that use fallocate with that
flag on filesystems that don't support it.

Suggested-by: Eryu Guan 
Signed-off-by: Filipe Manana 
---
 common/rc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/common/rc b/common/rc
index e1ab2c6..3d0f089 100644
--- a/common/rc
+++ b/common/rc
@@ -2021,8 +2021,8 @@ _require_xfs_io_command()
"chproj")
testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1`
;;
-   "falloc" )
-   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
+   "falloc*" )
+   testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`
;;
"fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare")
testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: test for number of bytes used by files after buffered writes

2017-04-05 Thread Eryu Guan
On Wed, Apr 05, 2017 at 11:13:56AM +0100, Filipe Manana wrote:
> On Wed, Apr 5, 2017 at 10:46 AM, Eryu Guan  wrote:
> > On Tue, Apr 04, 2017 at 03:23:35AM +0100, fdman...@kernel.org wrote:
> >> From: Filipe Manana 
> >>
> >> Test that a filesystem's implementation of the stat(2) system call
> >> reports correct values for the number of blocks allocated for a file
> >> when there are delayed allocations.
> >>
> >> This test is motivated by a bug in btrfs which is fixed by the following
> >> path for the linux kernel:
> >>
> >>  "Btrfs: fix reported number of inode blocks"
> >>
> >> Signed-off-by: Filipe Manana 
...
> >
> >> +_require_xfs_io_command "falloc"
> >
> > This has some problems with the "-k" flag. NFSv4.2 supports fallocate(2)
> > but not KEEP_SIZE flag, so test fails with NFSv4.2 mount.
> >
> >  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > +fallocate: Operation not supported
> >  wrote 65536/65536 bytes at offset 0
> >
> > We met the same issue before with generic/071.
> > http://www.spinics.net/lists/fstests/msg03527.html
> >
> > So I have two options now, one is the method proposed by Eric in above
> > thread, run falloc command with $param.
> >
> > common/rc::_require_xfs_io_command
> > -   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> > +   testio=`$XFS_IO_PROG -F -f -c "falloc $param 0 1m" 
> > $testfile 2>&1`
> >
> > tests/generic/422:
> > -_require_xfs_io_command "falloc"
> > +_require_xfs_io_command "falloc" "-k"
> >
> >
> > The other is requiring "falloc -k" in the test:
> >
> > common/rc::_require_xfs_io_command
> > -   "falloc" )
> > -   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> > +   falloc* )
> > +   testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 
> > 2>&1`
> >
> > tests/generic/422:
> > -_require_xfs_io_command "falloc"
> > +_require_xfs_io_command "falloc -k"
> >
> > I slightly prefer the second way, as it doesn't change the default
> > behavior and makes falloc a special-case.
> > (_require_xfs_io_command "" "" behaves the same as other
> > commands).
> 
> Ok, do you prefer the changes to common/rc in a separate patch or can
> they be folded in v2 for this test?

A separate patch would be good, as it fixes a separate & long standing
issue. Thanks!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: test for number of bytes used by files after buffered writes

2017-04-05 Thread Filipe Manana
On Wed, Apr 5, 2017 at 10:46 AM, Eryu Guan  wrote:
> On Tue, Apr 04, 2017 at 03:23:35AM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Test that a filesystem's implementation of the stat(2) system call
>> reports correct values for the number of blocks allocated for a file
>> when there are delayed allocations.
>>
>> This test is motivated by a bug in btrfs which is fixed by the following
>> path for the linux kernel:
>>
>>  "Btrfs: fix reported number of inode blocks"
>>
>> Signed-off-by: Filipe Manana 
>
> Overall this looks good to me, test fails with btrfs and passes with
> other filesystems (tested xfs, extN). Some comments below.
>
>> ---
>>  tests/generic/422 | 126 
>> ++
>>  tests/generic/422.out |  41 
>>  tests/generic/group   |   1 +
>>  3 files changed, 168 insertions(+)
>>  create mode 100755 tests/generic/422
>>  create mode 100644 tests/generic/422.out
>>
>> diff --git a/tests/generic/422 b/tests/generic/422
>> new file mode 100755
>> index 000..20cd54a
>> --- /dev/null
>> +++ b/tests/generic/422
>> @@ -0,0 +1,126 @@
>> +#! /bin/bash
>> +# FS QA Test No. generic/422
>> +#
>> +# Test that a filesystem's implementation of the stat(2) system call reports
>> +# correct values for the number of blocks allocated for a file when there 
>> are
>> +# delayed allocations.
>> +#
>> +#---
>> +#
>> +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
>> +# Author: Filipe Manana 
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software Foundation,
>> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> +#---
>> +#
>> +
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +tmp=/tmp/$$
>> +status=1 # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> + cd /
>> + rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# real QA test starts here
>> +_supported_fs generic
>> +_supported_os Linux
>> +_require_test
>> +_require_scratch
>
> Need "_require_odirect" too, it does direct I/O.
>
>> +_require_xfs_io_command "falloc"
>
> This has some problems with the "-k" flag. NFSv4.2 supports fallocate(2)
> but not KEEP_SIZE flag, so test fails with NFSv4.2 mount.
>
>  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +fallocate: Operation not supported
>  wrote 65536/65536 bytes at offset 0
>
> We met the same issue before with generic/071.
> http://www.spinics.net/lists/fstests/msg03527.html
>
> So I have two options now, one is the method proposed by Eric in above
> thread, run falloc command with $param.
>
> common/rc::_require_xfs_io_command
> -   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> +   testio=`$XFS_IO_PROG -F -f -c "falloc $param 0 1m" $testfile 
> 2>&1`
>
> tests/generic/422:
> -_require_xfs_io_command "falloc"
> +_require_xfs_io_command "falloc" "-k"
>
>
> The other is requiring "falloc -k" in the test:
>
> common/rc::_require_xfs_io_command
> -   "falloc" )
> -   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> +   falloc* )
> +   testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`
>
> tests/generic/422:
> -_require_xfs_io_command "falloc"
> +_require_xfs_io_command "falloc -k"
>
> I slightly prefer the second way, as it doesn't change the default
> behavior and makes falloc a special-case.
> (_require_xfs_io_command "" "" behaves the same as other
> commands).

Ok, do you prefer the changes to common/rc in a separate patch or can
they be folded in v2 for this test?

thanks

>
> Thanks,
> Eryu
>
>> +
>> +rm -f $seqres.full
>> +
>> +_scratch_mkfs >>$seqres.full 2>&1
>> +_scratch_mount
>> +
>> +$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
>> +$XFS_IO_PROG -f \
>> + -c "pwrite -S 0xaa 0 64K" \
>> + -c "truncate 128K" \
>> + $SCRATCH_MNT/foo2 | _filter_xfs_io
>> +$XFS_IO_PROG -f \
>> + -c "falloc -k 0 128K" \
>> + -c "pwrite -S 0xaa 0 64K" \
>> + $SCRATCH_MNT/foo3 | _filter_xfs_io
>> +touch $SCRATCH_MNT/foo4
>> +
>> +# Make sure everything done so far is durably persisted.
>> +syn

Re: [PATCH] generic: test for number of bytes used by files after buffered writes

2017-04-05 Thread Eryu Guan
On Tue, Apr 04, 2017 at 03:23:35AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that a filesystem's implementation of the stat(2) system call
> reports correct values for the number of blocks allocated for a file
> when there are delayed allocations.
> 
> This test is motivated by a bug in btrfs which is fixed by the following
> path for the linux kernel:
> 
>  "Btrfs: fix reported number of inode blocks"
> 
> Signed-off-by: Filipe Manana 

Overall this looks good to me, test fails with btrfs and passes with
other filesystems (tested xfs, extN). Some comments below.

> ---
>  tests/generic/422 | 126 
> ++
>  tests/generic/422.out |  41 
>  tests/generic/group   |   1 +
>  3 files changed, 168 insertions(+)
>  create mode 100755 tests/generic/422
>  create mode 100644 tests/generic/422.out
> 
> diff --git a/tests/generic/422 b/tests/generic/422
> new file mode 100755
> index 000..20cd54a
> --- /dev/null
> +++ b/tests/generic/422
> @@ -0,0 +1,126 @@
> +#! /bin/bash
> +# FS QA Test No. generic/422
> +#
> +# Test that a filesystem's implementation of the stat(2) system call reports
> +# correct values for the number of blocks allocated for a file when there are
> +# delayed allocations.
> +#
> +#---
> +#
> +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
> +# Author: Filipe Manana 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_test
> +_require_scratch

Need "_require_odirect" too, it does direct I/O.

> +_require_xfs_io_command "falloc"

This has some problems with the "-k" flag. NFSv4.2 supports fallocate(2)
but not KEEP_SIZE flag, so test fails with NFSv4.2 mount.

 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+fallocate: Operation not supported
 wrote 65536/65536 bytes at offset 0

We met the same issue before with generic/071.
http://www.spinics.net/lists/fstests/msg03527.html

So I have two options now, one is the method proposed by Eric in above
thread, run falloc command with $param.

common/rc::_require_xfs_io_command
-   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
+   testio=`$XFS_IO_PROG -F -f -c "falloc $param 0 1m" $testfile 
2>&1`

tests/generic/422:
-_require_xfs_io_command "falloc"
+_require_xfs_io_command "falloc" "-k"


The other is requiring "falloc -k" in the test:

common/rc::_require_xfs_io_command
-   "falloc" )
-   testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
+   falloc* )
+   testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`

tests/generic/422:
-_require_xfs_io_command "falloc"
+_require_xfs_io_command "falloc -k"

I slightly prefer the second way, as it doesn't change the default
behavior and makes falloc a special-case.
(_require_xfs_io_command "" "" behaves the same as other
commands).

Thanks,
Eryu

> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +
> +$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
> +$XFS_IO_PROG -f \
> + -c "pwrite -S 0xaa 0 64K" \
> + -c "truncate 128K" \
> + $SCRATCH_MNT/foo2 | _filter_xfs_io
> +$XFS_IO_PROG -f \
> + -c "falloc -k 0 128K" \
> + -c "pwrite -S 0xaa 0 64K" \
> + $SCRATCH_MNT/foo3 | _filter_xfs_io
> +touch $SCRATCH_MNT/foo4
> +
> +# Make sure everything done so far is durably persisted.
> +sync
> +
> +# Now overwrite the extent of the first file.
> +$XFS_IO_PROG -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
> +
> +# Write to a hole of the second file.
> +$XFS_IO_PROG -c "pwrite -S 0xff 64K 64K" $SCRATCH_MNT/foo2 | _filter_xfs_io
> +# Write again to the same location, just to test that the fs will not account
> +# the same

[PATCH 2/2] btrfs-progs: print-tree: add validation to print_chunk

2017-04-05 Thread Lu Fengqi
In print_chunk, validate the value of uuid_offset when read the dev_uuid of
stripe.

Signed-off-by: Lu Fengqi 
---
 cmds-inspect-dump-super.c |  1 +
 print-tree.c  | 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index 48b5219c..a89fe451 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -68,6 +68,7 @@ static void print_sys_chunk_array(struct btrfs_super_block 
*sb)
return;
}
write_extent_buffer(buf, sb, 0, BTRFS_SUPER_INFO_SIZE);
+   buf->len = BTRFS_SUPER_INFO_SIZE;
array_size = btrfs_super_sys_array_size(sb);
 
array_ptr = sb->sys_chunk_array;
diff --git a/print-tree.c b/print-tree.c
index 5af80e87..8352e03d 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -199,8 +199,14 @@ void print_chunk(struct extent_buffer *eb, struct 
btrfs_chunk *chunk)
 {
int num_stripes = btrfs_chunk_num_stripes(eb, chunk);
int i;
+   u32 chunk_item_size = btrfs_chunk_item_size(num_stripes);
char chunk_flags_str[32] = {0};
 
+   if ((unsigned long)chunk + chunk_item_size > eb->len) {
+   printf("\t\tchunk item invalid\n");
+   return;
+   }
+
bg_flags_to_str(btrfs_chunk_type(eb, chunk), chunk_flags_str);
printf("\t\tlength %llu owner %llu stripe_len %llu type %s\n",
   (unsigned long long)btrfs_chunk_length(eb, chunk),
@@ -216,9 +222,21 @@ void print_chunk(struct extent_buffer *eb, struct 
btrfs_chunk *chunk)
for (i = 0 ; i < num_stripes ; i++) {
unsigned char dev_uuid[BTRFS_UUID_SIZE];
char str_dev_uuid[BTRFS_UUID_UNPARSED_SIZE];
+   u64 uuid_offset;
+   u64 stripe_offset;
+
+   uuid_offset = (unsigned long)btrfs_stripe_dev_uuid_nr(chunk, i);
+   stripe_offset = (unsigned long)btrfs_stripe_nr(chunk, i);
+
+   if (uuid_offset < stripe_offset ||
+   (uuid_offset + BTRFS_UUID_SIZE) >
+   (stripe_offset + sizeof(struct btrfs_stripe))) {
+   printf("\t\t\tstripe %d invalid\n", i);
+   break;
+   }
 
read_extent_buffer(eb, dev_uuid,
-   (unsigned long)btrfs_stripe_dev_uuid_nr(chunk, i),
+   uuid_offset,
BTRFS_UUID_SIZE);
uuid_unparse(dev_uuid, str_dev_uuid);
printf("\t\t\tstripe %d devid %llu offset %llu\n", i,
-- 
2.12.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: dump-super: check array_size in print_sys_chunk_array

2017-04-05 Thread Lu Fengqi
Without validation of array_size, the dump-super may lead to a bad
memory access.

Signed-off-by: Lu Fengqi 
---
 cmds-inspect-dump-super.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index ee2c8e3a..48b5219c 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -62,16 +62,23 @@ static void print_sys_chunk_array(struct btrfs_super_block 
*sb)
struct btrfs_key key;
int item;
 
-   buf = malloc(sizeof(*buf) + sizeof(*sb));
+   buf = malloc(sizeof(*buf) + BTRFS_SUPER_INFO_SIZE);
if (!buf) {
error("not enough memory");
-   goto out;
+   return;
}
-   write_extent_buffer(buf, sb, 0, sizeof(*sb));
+   write_extent_buffer(buf, sb, 0, BTRFS_SUPER_INFO_SIZE);
array_size = btrfs_super_sys_array_size(sb);
 
array_ptr = sb->sys_chunk_array;
sb_array_offset = offsetof(struct btrfs_super_block, sys_chunk_array);
+
+   if (array_size > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) {
+   error("sys_array_size %u shouldn't exceed %u bytes",
+   array_size, BTRFS_SYSTEM_CHUNK_ARRAY_SIZE);
+   goto out;
+   }
+
cur_offset = 0;
item = 0;
 
@@ -124,8 +131,8 @@ static void print_sys_chunk_array(struct btrfs_super_block 
*sb)
item++;
}
 
-   free(buf);
 out:
+   free(buf);
return;
 
 out_short_read:
-- 
2.12.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread Jan Kara
On Wed 05-04-17 11:43:32, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

Yes, that was my thinking.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

Furthermore you'd have a potential problem that you need to change
i_version on disk just because you are reading after a crash and such
changes tend to be problematic (think of read-only mounts and stuff like
that).
 
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

Well, there is also a problem that you would need to somehow remember with
which 'crash count' the i_version has been previously reported as that is
not stored on disk with my scheme. So I don't think we can easily use your
scheme.

So the options we have are:

1) Keep i_version as is, make clients also check for i_ctime.
   Pro: No on-disk format changes.
   Cons: After a crash, i_version can go backwards (but when file changes
   i_version, i_ctime pair should be still different) or not, data can be
   old or not.

2) Fsync when reporting i_version.
   Pro: No on-disk format changes, strong consistency of i_version and
data.
   Cons: Difficult to implement for filesystems due to locking constrains.
 High performance overhead or i_version reporting.

3) Some variant of crash counter.
   Pro: i_version cannot go backwards.
   Cons: Requires on-disk format changes. After a crash data can be old
 (however i_version increased).

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-05 Thread Marat Khalili

On 04/04/17 20:36, Peter Grandi wrote:

SATA works for external use, eSATA works well, but what really
matters is the chipset of the adapter card.
eSATA might be sound electrically, but mechanically it is awful. Try to 
run it for months in a crowded server room, and inevitably you'll get 
disconnections and data corruption. Tried different cables, brackets -- 
same result. If you ever used eSATA connector, you'd feel it.



In my experience JMicron is not so good, Marvell a bit better,
best is to use a recent motherboard chipset with a SATA-eSATA
internal cable and bracket.
That's exactly what I used to use: internal controller of Z77 chipset + 
bracket(s).



But that does not change the fact that it is a library and work
is initiated by user requests which are not per-subvolume, but
in effect per-volume.

That's the answer I was looking for.


It is a way to do so and not a very good way. There is no
obviously good way to define "real usage" in the presence of
hard-links and reflinking, and qgroups use just one way to
define it. A similar problem happens with processes in the
presence of shared pages, multiple mapped shared libraries etc.
No need to over-generalize. There's an obvious good way to define "real 
usage" of a subvolume and its snapshots as long as it don't share any 
data with other subvolumes, as is often the case. If it does share, two 
figures -- exclusive and referenced, like in qgroups -- are sufficient 
for most tasks.



The problem is that
both hard-links and ref-linking create really significant
ambiguities as to used space. Plus the same problem would happen
with directories instead of subvolumes and hard-links instead of
reflinked snapshots.
You're right, although with hard-links there's at least remote chance to 
estimate storage use with usermode scripts.



ASMedia USB3 chipsets are fairly reliable at the least the card
ones on the system side. The ones on the disk side I don't know
much about.
This is getting increasingly off-topic, but our mainstay are CFI 5-disk 
DAS boxes (8253JDGG to be exact) filled with WD Red-s in RAID5 
configuration. They are no longer produced and getting harder and harder 
to source, but showed themselves as very reliable. According to lsusb 
they contain JMicron JMS567 SATA 6Gb/s bridge.


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html