[RFC PATCH V2 3/8] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size.

2014-06-11 Thread Chandan Rajendra
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/file.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 006af2f..541e227 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1339,18 +1339,21 @@ fail:
 static noinline int
 lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
size_t num_pages, loff_t pos,
+   size_t write_bytes,
u64 *lockstart, u64 *lockend,
struct extent_state **cached_state)
 {
+   struct btrfs_root *root = BTRFS_I(inode)-root;
u64 start_pos;
u64 last_pos;
int i;
int ret = 0;
 
-   start_pos = pos  ~((u64)PAGE_CACHE_SIZE - 1);
-   last_pos = start_pos + ((u64)num_pages  PAGE_CACHE_SHIFT) - 1;
+   start_pos = pos  ~((u64)root-sectorsize - 1);
+   last_pos = start_pos
+   + ALIGN(pos + write_bytes - start_pos, root-sectorsize) - 1;
 
-   if (start_pos  inode-i_size) {
+   if (start_pos  inode-i_size) {
struct btrfs_ordered_extent *ordered;
lock_extent_bits(BTRFS_I(inode)-io_tree,
 start_pos, last_pos, 0, cached_state);
@@ -1468,6 +1471,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
while (iov_iter_count(i)  0) {
size_t offset = pos  (PAGE_CACHE_SIZE - 1);
+   size_t sector_offset;
size_t write_bytes = min(iov_iter_count(i),
 nrptrs * (size_t)PAGE_CACHE_SIZE -
 offset);
@@ -1488,7 +1492,9 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
break;
}
 
-   reserve_bytes = num_pages  PAGE_CACHE_SHIFT;
+   sector_offset = pos  (root-sectorsize - 1);
+   reserve_bytes = ALIGN(write_bytes + sector_offset, 
root-sectorsize);
+
ret = btrfs_check_data_free_space(inode, reserve_bytes);
if (ret == -ENOSPC 
(BTRFS_I(inode)-flags  (BTRFS_INODE_NODATACOW |
@@ -1503,7 +1509,9 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
num_pages = (write_bytes + offset +
 PAGE_CACHE_SIZE - 1) 
PAGE_CACHE_SHIFT;
-   reserve_bytes = num_pages  PAGE_CACHE_SHIFT;
+
+   reserve_bytes = ALIGN(write_bytes + 
sector_offset,
+   root-sectorsize);
ret = 0;
} else {
ret = -ENOSPC;
@@ -1536,8 +1544,8 @@ again:
break;
 
ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
- pos, lockstart, lockend,
- cached_state);
+   pos, write_bytes, lockstart, 
lockend,
+   cached_state);
if (ret  0) {
if (ret == -EAGAIN)
goto again;
@@ -1574,9 +1582,9 @@ again:
 * we still have an outstanding extent for the chunk we actually
 * managed to copy.
 */
-   if (num_pages  dirty_pages) {
-   release_bytes = (num_pages - dirty_pages) 
-   PAGE_CACHE_SHIFT;
+   if (write_bytes  copied) {
+   release_bytes = (write_bytes - copied)
+~((u64)root-sectorsize - 1);
if (copied  0) {
spin_lock(BTRFS_I(inode)-lock);
BTRFS_I(inode)-outstanding_extents++;
@@ -1590,7 +1598,7 @@ again:
 release_bytes);
}
 
-   release_bytes = dirty_pages  PAGE_CACHE_SHIFT;
+   release_bytes = ALIGN(copied + sector_offset, root-sectorsize);
 
if (copied  0)
ret = btrfs_dirty_pages(root, inode, pages,
@@ -1609,7 +1617,7 @@ again:
if (only_release_metadata  copied  0) {
u64 lockstart = round_down(pos, root-sectorsize);
u64 lockend = lockstart +
-   (dirty_pages  PAGE_CACHE_SHIFT) - 1;
+   ALIGN(copied, root-sectorsize) - 1;
 

[RFC PATCH V2 0/8] Btrfs: Subpagesize-blocksize: Get rid of whole page I/O

2014-06-11 Thread Chandan Rajendra
This patchset continues with the work posted earlier at
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34036.html.

Changes from V1:
1. Remove usage of bio_vec-bv_{len,offset} in end_bio_extent_readpage()
   and end_bio_extent_writepage().

Xfstests' generic tests were run on an x86_64 machine with the patches
applied.

On multiple runs of the tests with 4k blocksize, 'umount' process would
sometimes get blocked indefinitely causing 'hung task detector' to print the
function call trace. Also, there are occasional instances where warning
messages from btree_invalidatepage() is being printed to indicate that
PG_private flag of a page is still set.

For 2k blocksize only a few Xfstests' generic tests pass.

The following is a list of known TODO items which will be implemented in
future revisions of this patchset:
1. Remove usage of bvec-{bv_offset, bv_len} from btrfs_csum_one_bio.
2. Get __extent_writepage() to write dirty blocks that don't start at
   page_offset(page). In such a scenario and with the current
   patchset, brfs_csum_one_bio() hits a BUG_ON() when searching for a
   non-existant ordered extent that would begin at file offset mapped by
   the first byte of the corresponding page.
3. Remove PAGE_CACHE_SIZE delalloc reservation in 
btrfs_writepage_fixup_worker().
4. Create separate slab caches for 'extent buffer head' and 'extent buffer'.
5. Add 'leak list' tracking for 'extent buffer' instances.
6. Rename EXTENT_BUFFER_TREE_REF and EXTENT_BUFFER_IN_TREE to
   EXTENT_BUFFER_HEAD_TREE_REF and EXTENT_BUFFER_HEAD_IN_TREE respectively.
7. Get Xfstests' generic tests to successfully run on both 4k and 2k
   blocksizes.

Chandan Rajendra (6):
  Btrfs: subpagesize-blocksize: Get rid of whole page reads.
  Btrfs: subpagesize-blocksize: Get rid of whole page writes.
  Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release
extents aligned to block size.
  Btrfs: subpagesize-blocksize: Read tree blocks whose size is
PAGE_CACHE_SIZE.
  Btrfs: subpagesize-blocksize: Write only dirty extent buffers
belonging to a page
  Btrfs: subpagesize-blocksize: Compute and look up csums based on
sectorsized blocks.

Chandra Seetharaman (2):
  Btrfs: subpagesize-blocksize: Define extent_buffer_head.
  Btrfs: subpagesize-blocksize: Allow mounting filesystems where
sectorsize != PAGE_SIZE

 fs/btrfs/backref.c   |2 +-
 fs/btrfs/ctree.c |2 +-
 fs/btrfs/ctree.h |6 +-
 fs/btrfs/disk-io.c   |  117 +++--
 fs/btrfs/disk-io.h   |3 +
 fs/btrfs/extent-tree.c   |6 +-
 fs/btrfs/extent_io.c | 1131 +-
 fs/btrfs/extent_io.h |   48 +-
 fs/btrfs/file-item.c |   85 ++--
 fs/btrfs/file.c  |   32 +-
 fs/btrfs/inode.c |   45 +-
 fs/btrfs/volumes.c   |2 +-
 fs/btrfs/volumes.h   |3 +
 include/trace/events/btrfs.h |2 +-
 14 files changed, 1004 insertions(+), 480 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 2/8] Btrfs: subpagesize-blocksize: Get rid of whole page writes.

2014-06-11 Thread Chandan Rajendra
This commit brings back functions that set/clear EXTENT_WRITEBACK bits. These
are required to reliably clear PG_writeback page flag.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/extent_io.c | 147 +++
 fs/btrfs/extent_io.h |   2 +-
 fs/btrfs/inode.c |  45 
 3 files changed, 136 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fa28545..20d8bdc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1293,6 +1293,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, 
u64 start, u64 end,
cached_state, mask);
 }
 
+static int set_extent_writeback(struct extent_io_tree *tree, u64 start, u64 
end,
+   struct extent_state **cached_state, gfp_t mask)
+{
+   return set_extent_bit(tree, start, end, EXTENT_WRITEBACK, NULL,
+   cached_state, mask);
+}
+
+static int clear_extent_writeback(struct extent_io_tree *tree, u64 start, u64 
end,
+   struct extent_state **cached_state, gfp_t mask)
+{
+   return clear_extent_bit(tree, start, end, EXTENT_WRITEBACK, 1, 0,
+   cached_state, mask);
+}
+
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
@@ -1399,6 +1413,7 @@ static int set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
page_cache_release(page);
index++;
}
+   set_extent_writeback(tree, start, end, NULL, GFP_NOFS);
return 0;
 }
 
@@ -1966,6 +1981,16 @@ static void check_page_locked(struct extent_io_tree 
*tree, struct page *page)
}
 }
 
+static void check_page_writeback(struct extent_io_tree *tree, struct page 
*page)
+{
+   u64 start = page_offset(page);
+   u64 end = start + PAGE_CACHE_SIZE - 1;
+
+   if (!test_range_bit(tree, start, end, EXTENT_WRITEBACK, 0, NULL))
+   end_page_writeback(page);
+}
+
+/*
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
@@ -2359,27 +2384,69 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
 }
 
 /* lots and lots of room for performance fixes in the end_bio funcs */
-
-int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
+void end_extents_write(struct inode *inode, int err, u64 start, u64 end)
 {
+   struct extent_io_tree *tree = BTRFS_I(inode)-io_tree;
int uptodate = (err == 0);
-   struct extent_io_tree *tree;
+   pgoff_t index, end_index;
+   u64 page_start, page_end;
+   struct page *page;
int ret;
 
-   tree = BTRFS_I(page-mapping-host)-io_tree;
+   index = start  PAGE_CACHE_SHIFT;
+   end_index = end  PAGE_CACHE_SHIFT;
 
-   if (tree-ops  tree-ops-writepage_end_io_hook) {
-   ret = tree-ops-writepage_end_io_hook(page, start,
-  end, NULL, uptodate);
-   if (ret)
-   uptodate = 0;
+   page_start = start;
+
+   while (index = end_index) {
+   page = find_get_page(inode-i_mapping, index);
+   BUG_ON(!page);
+
+   page_end = min_t(u64, end, page_offset(page) + PAGE_CACHE_SIZE 
- 1);
+
+   if (tree-ops  tree-ops-writepage_end_io_hook) {
+   ret = tree-ops-writepage_end_io_hook(page,
+   page_start, page_end,
+   NULL, uptodate);
+   if (ret)
+   uptodate = 0;
+   }
+
+   page_start = page_end + 1;
+
+   ++index;
+
+   if (!uptodate) {
+   ClearPageUptodate(page);
+   SetPageError(page);
+   }
+
+   page_cache_release(page);
}
+}
+
+static void clear_extent_and_page_writeback(struct address_space *mapping,
+   struct extent_io_tree *tree,
+   struct btrfs_io_bio *io_bio)
+{
+   struct page *page;
+   pgoff_t index;
+   u64 offset, len;
 
-   if (!uptodate) {
-   ClearPageUptodate(page);
-   SetPageError(page);
+   offset  = io_bio-start_offset;
+   len = io_bio-len;
+
+   clear_extent_writeback(tree, offset, offset + len - 1, NULL,
+   GFP_ATOMIC);
+
+   index = offset  PAGE_CACHE_SHIFT;
+   while (offset  io_bio-start_offset + len) {
+   page = find_get_page(mapping, index);
+   check_page_writeback(tree, page);
+   page_cache_release(page);
+   

[RFC PATCH V2 6/8] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page

2014-06-11 Thread Chandan Rajendra
For the subpagesize-blocksize scenario, This patch adds the ability to write a
single extent buffer to the disk.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/disk-io.c   |  20 ++--
 fs/btrfs/extent_io.c | 277 ++-
 2 files changed, 243 insertions(+), 54 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b2c4e9d..28a45f6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -466,17 +466,23 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
 {
-   u64 start = page_offset(page);
-   u64 found_start;
struct extent_buffer *eb;
+   u64 found_start;
 
eb = (struct extent_buffer *)page-private;
-   if (page != eb-pages[0])
+   if (page != eb_head(eb)-pages[0])
return 0;
-   found_start = btrfs_header_bytenr(eb);
-   if (WARN_ON(found_start != start || !PageUptodate(page)))
-   return 0;
-   csum_tree_block(root, eb, 0);
+   do {
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, eb-ebflags))
+   continue;
+   if (WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, eb-ebflags)))
+   continue;
+   found_start = btrfs_header_bytenr(eb);
+   if (WARN_ON(found_start != eb-start))
+   return 0;
+   csum_tree_block(root, eb, 0);
+   } while ((eb = eb-eb_next) != NULL);
+
return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c095575..507ef5b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3420,32 +3420,53 @@ void wait_on_extent_buffer_writeback(struct 
extent_buffer *eb)
TASK_UNINTERRUPTIBLE);
 }
 
-static int lock_extent_buffer_for_io(struct extent_buffer *eb,
-struct btrfs_fs_info *fs_info,
-struct extent_page_data *epd)
+static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
+   struct extent_page_data *epd)
 {
+   struct extent_buffer *eb = ebh-eb;
unsigned long i, num_pages;
-   int flush = 0;
+
+   num_pages = num_extent_pages(eb-start, eb-len);
+   for (i = 0; i  num_pages; i++) {
+   struct page *p = extent_buffer_page(eb, i);
+
+   if (!trylock_page(p)) {
+   flush_write_bio(epd);
+   lock_page(p);
+   }
+   }
+
+   return;
+}
+
+static int lock_extent_buffer_for_io(struct extent_buffer *eb,
+   struct btrfs_fs_info *fs_info,
+   struct extent_page_data *epd)
+{
+   int dirty;
int ret = 0;
 
if (!btrfs_try_tree_write_lock(eb)) {
-   flush = 1;
flush_write_bio(epd);
btrfs_tree_lock(eb);
}
 
-   if (test_bit(EXTENT_BUFFER_WRITEBACK, eb-bflags)) {
+   if (test_bit(EXTENT_BUFFER_WRITEBACK, eb-ebflags)) {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, eb-ebflags);
btrfs_tree_unlock(eb);
-   if (!epd-sync_io)
-   return 0;
-   if (!flush) {
-   flush_write_bio(epd);
-   flush = 1;
+   if (!epd-sync_io) {
+   if (!dirty)
+   return 1;
+   else
+   return 2;
}
+
+   flush_write_bio(epd);
+
while (1) {
wait_on_extent_buffer_writeback(eb);
btrfs_tree_lock(eb);
-   if (!test_bit(EXTENT_BUFFER_WRITEBACK, eb-bflags))
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, eb-ebflags))
break;
btrfs_tree_unlock(eb);
}
@@ -3456,27 +3477,25 @@ static int lock_extent_buffer_for_io(struct 
extent_buffer *eb,
 * under IO since we can end up having no IO bits set for a short period
 * of time.
 */
-   spin_lock(eb-refs_lock);
-   if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, eb-bflags)) {
-   set_bit(EXTENT_BUFFER_WRITEBACK, eb-bflags);
-   spin_unlock(eb-refs_lock);
+   spin_lock(eb_head(eb)-refs_lock);
+   if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, eb-ebflags)) {
+   set_bit(EXTENT_BUFFER_WRITEBACK, eb-ebflags);
+   spin_unlock(eb_head(eb)-refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
__percpu_counter_add(fs_info-dirty_metadata_bytes,
 -eb-len,
 fs_info-dirty_metadata_batch);
-   ret = 1;
+   ret = 0;

[RFC PATCH V2 7/8] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE

2014-06-11 Thread Chandan Rajendra
From: Chandra Seetharaman sekha...@us.ibm.com

This patch allows mounting filesystems with blocksize smaller than the
PAGE_SIZE.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/disk-io.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 28a45f6..3bb7072 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2599,12 +2599,6 @@ int open_ctree(struct super_block *sb,
goto fail_sb_buffer;
}
 
-   if (sectorsize != PAGE_SIZE) {
-   printk(KERN_WARNING BTRFS: Incompatible sector size(%lu) 
-  found on %s\n, (unsigned long)sectorsize, sb-s_id);
-   goto fail_sb_buffer;
-   }
-
mutex_lock(fs_info-chunk_mutex);
ret = btrfs_read_sys_array(tree_root);
mutex_unlock(fs_info-chunk_mutex);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 5/8] Btrfs: subpagesize-blocksize: Read tree blocks whose size is PAGE_CACHE_SIZE.

2014-06-11 Thread Chandan Rajendra
In the case of subpagesize-blocksize, this patch makes it possible to read
only a single metadata block from the disk instead of all the metadata blocks
that map into a page.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/disk-io.c   |  45 -
 fs/btrfs/disk-io.h   |   3 ++
 fs/btrfs/extent_io.c | 135 +++
 3 files changed, 137 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bda2157..b2c4e9d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -413,7 +413,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
int mirror_num = 0;
int failed_mirror = 0;
 
-   clear_bit(EXTENT_BUFFER_CORRUPT, eb-bflags);
+   clear_bit(EXTENT_BUFFER_CORRUPT, eb-ebflags);
io_tree = BTRFS_I(root-fs_info-btree_inode)-io_tree;
while (1) {
ret = read_extent_buffer_pages(io_tree, eb, start,
@@ -432,7 +432,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
 * there is no reason to read the other copies, they won't be
 * any less wrong.
 */
-   if (test_bit(EXTENT_BUFFER_CORRUPT, eb-bflags))
+   if (test_bit(EXTENT_BUFFER_CORRUPT, eb-ebflags))
break;
 
num_copies = btrfs_num_copies(root-fs_info,
@@ -564,12 +564,13 @@ static noinline int check_leaf(struct btrfs_root *root,
return 0;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
- u64 phy_offset, struct page *page,
- u64 start, u64 end, int mirror)
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+   struct page *page,
+   u64 start, u64 end, int mirror)
 {
u64 found_start;
int found_level;
+   struct extent_buffer_head *ebh;
struct extent_buffer *eb;
struct btrfs_root *root = BTRFS_I(page-mapping-host)-root;
int ret = 0;
@@ -579,18 +580,26 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
goto out;
 
eb = (struct extent_buffer *)page-private;
+   do {
+   if ((eb-start = start)  (eb-start + eb-len - 1  start))
+   break;
+   } while ((eb = eb-eb_next) != NULL);
+
+   BUG_ON(!eb);
+
+   ebh = eb_head(eb);
 
/* the pending IO might have been the only thing that kept this buffer
 * in memory.  Make sure we have a ref for all this other checks
 */
extent_buffer_get(eb);
 
-   reads_done = atomic_dec_and_test(eb-io_pages);
+   reads_done = atomic_dec_and_test(ebh-io_bvecs);
if (!reads_done)
goto err;
 
eb-read_mirror = mirror;
-   if (test_bit(EXTENT_BUFFER_IOERR, eb-bflags)) {
+   if (test_bit(EXTENT_BUFFER_IOERR, eb-ebflags)) {
ret = -EIO;
goto err;
}
@@ -632,7 +641,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
 * return -EIO.
 */
if (found_level == 0  check_leaf(root, eb)) {
-   set_bit(EXTENT_BUFFER_CORRUPT, eb-bflags);
+   set_bit(EXTENT_BUFFER_CORRUPT, eb-ebflags);
ret = -EIO;
}
 
@@ -640,7 +649,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
set_extent_buffer_uptodate(eb);
 err:
if (reads_done 
-   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-bflags))
+   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-ebflags))
btree_readahead_hook(root, eb, eb-start, ret);
 
if (ret) {
@@ -649,7 +658,7 @@ err:
 * again, we have to make sure it has something
 * to decrement
 */
-   atomic_inc(eb-io_pages);
+   atomic_inc(eb_head(eb)-io_bvecs);
clear_extent_buffer_uptodate(eb);
}
free_extent_buffer(eb);
@@ -657,20 +666,6 @@ out:
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-   struct btrfs_root *root = BTRFS_I(page-mapping-host)-root;
-
-   eb = (struct extent_buffer *)page-private;
-   set_bit(EXTENT_BUFFER_IOERR, eb-bflags);
-   eb-read_mirror = failed_mirror;
-   atomic_dec(eb-io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-bflags))
-   btree_readahead_hook(root, eb, eb-start, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio, int err)
 {
struct end_io_wq *end_io_wq = bio-bi_private;
@@ -4109,8 +4104,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root 
*root)
 }
 
 static struct extent_io_ops btree_extent_io_ops = {
-   .readpage_end_io_hook = 

[RFC PATCH V2 8/8] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks.

2014-06-11 Thread Chandan Rajendra
Checksums are applicable to sectorsize units. The current code uses
bio-bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_CACHE_SIZE. This patch makes the checksum
computation and look up code to work with sectorsize units.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/file-item.c | 85 
 1 file changed, 52 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 9d84658..16deb87 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u64 item_start_offset = 0;
u64 item_last_offset = 0;
u64 disk_bytenr;
+   u64 page_bytes_left;
u32 diff;
int nblocks;
int bio_index = 0;
@@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
disk_bytenr = (u64)bio-bi_sector  9;
if (dio)
offset = logical_offset;
+
+   page_bytes_left = bvec-bv_len;
while (bio_index  bio-bi_vcnt) {
if (!dio)
offset = page_offset(bvec-bv_page) + bvec-bv_offset;
@@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (BTRFS_I(inode)-root-root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) {
set_extent_bits(io_tree, offset,
-   offset + bvec-bv_len - 1,
+   offset + root-sectorsize - 1,
EXTENT_NODATASUM, GFP_NOFS);
} else {

btrfs_info(BTRFS_I(inode)-root-fs_info,
@@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
*root,
 found:
csum += count * csum_size;
nblocks -= count;
+
while (count--) {
-   disk_bytenr += bvec-bv_len;
-   offset += bvec-bv_len;
-   bio_index++;
-   bvec++;
+   disk_bytenr += root-sectorsize;
+   offset += root-sectorsize;
+   page_bytes_left -= root-sectorsize;
+   if (!page_bytes_left) {
+   bio_index++;
+   bvec++;
+   page_bytes_left = bvec-bv_len;
+   }
+
}
}
btrfs_free_path(path);
@@ -442,6 +451,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
struct bio_vec *bvec = bio-bi_io_vec;
int bio_index = 0;
int index;
+   int nr_sectors;
+   int i;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;
@@ -468,41 +479,49 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
if (!contig)
offset = page_offset(bvec-bv_page) + bvec-bv_offset;
 
-   if (offset = ordered-file_offset + ordered-len ||
-   offset  ordered-file_offset) {
-   unsigned long bytes_left;
-   sums-len = this_sum_bytes;
-   this_sum_bytes = 0;
-   btrfs_add_ordered_sum(inode, ordered, sums);
-   btrfs_put_ordered_extent(ordered);
+   data = kmap_atomic(bvec-bv_page);
 
-   bytes_left = bio-bi_size - total_bytes;
+   nr_sectors = (bvec-bv_len + root-sectorsize - 1)
+root-fs_info-sb-s_blocksize_bits;
+
+   for (i = 0; i  nr_sectors; i++) {
+   if (offset = ordered-file_offset + ordered-len ||
+   offset  ordered-file_offset) {
+   unsigned long bytes_left;
+   sums-len = this_sum_bytes;
+   this_sum_bytes = 0;
+   btrfs_add_ordered_sum(inode, ordered, sums);
+   btrfs_put_ordered_extent(ordered);
+
+   bytes_left = bio-bi_size - total_bytes;
+
+   sums = kzalloc(btrfs_ordered_sum_size(root, 
bytes_left),
+   GFP_NOFS);
+   BUG_ON(!sums); /* -ENOMEM */
+   sums-len = bytes_left;
+   ordered = btrfs_lookup_ordered_extent(inode, 
offset);
+   BUG_ON(!ordered); /* Logic error */
+   sums-bytenr = ((u64)bio-bi_sector  9) +
+   total_bytes;
+   

[RFC PATCH V2 4/8] Btrfs: subpagesize-blocksize: Define extent_buffer_head.

2014-06-11 Thread Chandan Rajendra
From: Chandra Seetharaman sekha...@us.ibm.com

In order to handle multiple extent buffers per page, first we need to create a
way to handle all the extent buffers that are attached to a page.

This patch creates a new data structure 'struct extent_buffer_head', and moves
fields that are common to all extent buffers in a page from 'struct extent
buffer' to 'struct extent_buffer_head'

Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
EXTENT_BUFFER_IN_TREE flags from extent_buffer-ebflags  to
extent_buffer_head-bflags.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/backref.c   |   2 +-
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |   6 +-
 fs/btrfs/disk-io.c   |  46 --
 fs/btrfs/extent-tree.c   |   6 +-
 fs/btrfs/extent_io.c | 372 +--
 fs/btrfs/extent_io.h |  46 --
 fs/btrfs/volumes.c   |   2 +-
 include/trace/events/btrfs.h |   2 +-
 9 files changed, 326 insertions(+), 158 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index a88da72..603ae44 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1272,7 +1272,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
eb = path-nodes[0];
/* make sure we can use eb after releasing the path */
if (eb != eb_in) {
-   atomic_inc(eb-refs);
+   atomic_inc(eb_head(eb)-refs);
btrfs_tree_read_lock(eb);
btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
}
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index cbd3a7d..0d4ad91 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -169,7 +169,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
*root)
 * the inc_not_zero dance and if it doesn't work then
 * synchronize_rcu and try again.
 */
-   if (atomic_inc_not_zero(eb-refs)) {
+   if (atomic_inc_not_zero(eb_head(eb)-refs)) {
rcu_read_unlock();
break;
}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dac6653..901ada2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2138,14 +2138,16 @@ static inline void btrfs_set_token_##name(struct 
extent_buffer *eb, \
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)   \
 {  \
-   type *p = page_address(eb-pages[0]);   \
+   type *p = page_address(eb_head(eb)-pages[0]) + \
+   (eb-start  (PAGE_CACHE_SIZE -1)); \
u##bits res = le##bits##_to_cpu(p-member); \
return res; \
 }  \
 static inline void btrfs_set_##name(struct extent_buffer *eb,  \
u##bits val)\
 {  \
-   type *p = page_address(eb-pages[0]);   \
+   type *p = page_address(eb_head(eb)-pages[0]) + \
+   (eb-start  (PAGE_CACHE_SIZE -1)); \
p-member = cpu_to_le##bits(val);   \
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index cc1b423..bda2157 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1018,13 +1018,21 @@ static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
struct extent_buffer *eb;
+   int i, dirty = 0;
 
BUG_ON(!PagePrivate(page));
eb = (struct extent_buffer *)page-private;
BUG_ON(!eb);
-   BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, eb-bflags));
-   BUG_ON(!atomic_read(eb-refs));
-   btrfs_assert_tree_locked(eb);
+
+   do {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, eb-ebflags);
+   if (dirty)
+   break;
+   } while ((eb = eb-eb_next) != NULL);
+
+   BUG_ON(!dirty);
+   BUG_ON(!atomic_read((eb_head(eb)-refs)));
+   btrfs_assert_tree_locked(ebh-eb);
 #endif
return __set_page_dirty_nobuffers(page);
 }
@@ -1068,7 +1076,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr, u32 blocksize,
if (!buf)
return 0;
 
-   set_bit(EXTENT_BUFFER_READAHEAD, buf-bflags);
+   set_bit(EXTENT_BUFFER_READAHEAD, buf-ebflags);
 
ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
   btree_get_extent, mirror_num);
@@ -1077,7 

[RFC PATCH V2 1/8] Btrfs: subpagesize-blocksize: Get rid of whole page reads.

2014-06-11 Thread Chandan Rajendra
Based on original patch from Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

bio_vec-{bv_offset, bv_len} cannot be relied upon by the end bio functions
to track the file offset range operated on by the bio. Hence this patch adds
two new members to 'struct btrfs_io_bio' to track the file offset range.

This patch also brings back check_page_locked() to reliably unlock pages in
readpage's end bio function.

Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
---
 fs/btrfs/extent_io.c | 200 ++-
 fs/btrfs/volumes.h   |   3 +
 2 files changed, 90 insertions(+), 113 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fbe501d..fa28545 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1943,15 +1943,29 @@ int test_range_bit(struct extent_io_tree *tree, u64 
start, u64 end,
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
+static void check_page_uptodate(struct extent_io_tree *tree, struct page *page,
+   struct extent_state *cached)
 {
u64 start = page_offset(page);
u64 end = start + PAGE_CACHE_SIZE - 1;
-   if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
+   if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, cached))
SetPageUptodate(page);
 }
 
 /*
+ * helper function to unlock a page if all the extents in the tree
+ * for that page are unlocked
+ */
+static void check_page_locked(struct extent_io_tree *tree, struct page *page)
+{
+   u64 start = page_offset(page);
+   u64 end = start + PAGE_CACHE_SIZE - 1;
+
+   if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) {
+   unlock_page(page);
+   }
+}
+
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
@@ -2173,6 +2187,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
struct bio *bio;
struct btrfs_io_bio *btrfs_failed_bio;
struct btrfs_io_bio *btrfs_bio;
+   int nr_sectors;
int num_copies;
int ret;
int read_mode;
@@ -2267,7 +2282,8 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 *  a) deliver good data to the caller
 *  b) correct the bad sectors on disk
 */
-   if (failed_bio-bi_vcnt  1) {
+   nr_sectors = btrfs_io_bio(failed_bio)-len  
inode-i_sb-s_blocksize_bits;
+   if (nr_sectors  1) {
/*
 * to fulfill b), we need to know the exact failing sectors, as
 * we don't want to rewrite any more than the failed ones. thus,
@@ -2314,6 +2330,8 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
bio-bi_sector = failrec-logical  9;
bio-bi_bdev = BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev;
bio-bi_size = 0;
+   btrfs_io_bio(bio)-start_offset = start;
+   btrfs_io_bio(bio)-len = end - start + 1;
 
btrfs_failed_bio = btrfs_io_bio(failed_bio);
if (btrfs_failed_bio-csum) {
@@ -2414,18 +2432,6 @@ static void end_bio_extent_writepage(struct bio *bio, 
int err)
bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
- int uptodate)
-{
-   struct extent_state *cached = NULL;
-   u64 end = start + len - 1;
-
-   if (uptodate  tree-track_uptodate)
-   set_extent_uptodate(tree, start, end, cached, GFP_ATOMIC);
-   unlock_extent_cached(tree, start, end, cached, GFP_ATOMIC);
-}
-
 /*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
@@ -2440,76 +2446,50 @@ endio_readpage_release_extent(struct extent_io_tree 
*tree, u64 start, u64 len,
 static void end_bio_extent_readpage(struct bio *bio, int err)
 {
int uptodate = test_bit(BIO_UPTODATE, bio-bi_flags);
-   struct bio_vec *bvec_end = bio-bi_io_vec + bio-bi_vcnt - 1;
-   struct bio_vec *bvec = bio-bi_io_vec;
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   struct bio_vec *bvec = bio-bi_io_vec;
+   struct bio_vec *bvec_end = bio-bi_io_vec + bio-bi_vcnt - 1;
+   struct address_space *mapping;
+   struct extent_state *cached = NULL;
struct extent_io_tree *tree;
-   u64 offset = 0;
+   struct btrfs_root *root;
+   struct inode *inode;
+   struct page *page;
u64 start;
-   u64 end;
+   u64 offset = 0;
u64 len;
-   u64 extent_start = 0;
-   u64 extent_len = 0;
+   int nr_sectors;
int mirror;
int ret;
 
-   if (err)
-   uptodate = 0;
+   mapping = 

[PATCH] Btrfs-progs: fix race condition between btrfs and udev

2014-06-11 Thread Wang Shilong
Originally this problem was reproduced by the following scripts:

 # dd if=/dev/zero of=data bs=1M count=50
 # losetup /dev/loop1 data
 # i =1
 # while [ 1 ]
   do
mkfs.btrfs -fK /dev/loop1  /dev/null || exit 1
((i++))
echo loop $i
   done

Futher, a easy way to trigger this problem is by running the following
c codes repeatedly:

 int main(int argc, char **argv)
 {
/* pass a btrfs block device */
int fd = open(argv[1], O_RDWR | O_EXCL);
if (fd  0) {
perror(fail to open: %s, strerror(errno));
exit(1);
}
close(fd);
return 0;
 }

So the problem is RW opening would trigger udev event which will
call btrfs_scan_one_device(). In btrfs_scan_one_device(), it
would open the block device with EXCL flag..meanwhile if another
program try to open that device with O_EXCL, it would fail with
EBUSY.

This happen seldomly in the real world, but if we use loop device
for test, we may hit this annoying problem.

A walkaround way to solve this problem is to wait kernel scanning
finished and then try it again.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
 utils.c   | 36 +++-
 utils.h   |  1 +
 volumes.c |  3 ++-
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/utils.c b/utils.c
index d29de94..20108c3 100644
--- a/utils.c
+++ b/utils.c
@@ -2058,6 +2058,40 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 
data_profile,
return 0;
 }
 
+/*
+ * there is a race condition between btrfs and udev, we may fail
+ * because udev call btrfs_scan_device() which will open the block
+ * device with O_EXCL. A walkaround solution is to wait kernel
+ * scanning finished and then try again.
+ */
+int btrfs_open_block_device(char *file, int flag)
+{
+   int ret, fd;
+   struct btrfs_ioctl_vol_args args;
+   int tried = 0;
+
+again:
+   fd = open(file, flag);
+   if (fd  0) {
+   if (!tried  errno == EBUSY  (flag  O_EXCL)) {
+   tried = 1;
+   fd = open(/dev/btrfs-control, O_RDWR);
+   if (fd  0) {
+   fprintf(stderr,
+   unable to open /dev/btrfs-control: 
%s\n,
+strerror(errno));
+   return fd;
+   }
+   strncpy(args.name, file, BTRFS_PATH_NAME_MAX);
+   ret = ioctl(fd, BTRFS_IOC_SCAN_DEV, args);
+   close(fd);
+   if (!ret)
+   goto again;
+   }
+   }
+   return fd;
+}
+
 /* Check if disk is suitable for btrfs
  * returns:
  *  1: something is wrong, estr provides the error
@@ -2096,7 +2130,7 @@ int test_dev_for_mkfs(char *file, int force_overwrite, 
char *estr)
return 1;
}
/* check if the device is busy */
-   fd = open(file, O_RDWR|O_EXCL);
+   fd = btrfs_open_block_device(file, O_RDWR | O_EXCL);
if (fd  0) {
snprintf(estr, sz, unable to open %s: %s\n, file,
strerror(errno));
diff --git a/utils.h b/utils.h
index 0b03830..f6ca252 100644
--- a/utils.h
+++ b/utils.h
@@ -95,6 +95,7 @@ int open_path_or_dev_mnt(const char *path, DIR **dirstream);
 u64 btrfs_device_size(int fd, struct stat *st);
 /* Helper to always get proper size of the destination string */
 #define strncpy_null(dest, src) __strncpy__null(dest, src, sizeof(dest))
+int btrfs_open_block_device(char *file, int flag);
 int test_dev_for_mkfs(char *file, int force_overwrite, char *estr);
 int scan_for_btrfs(int where, int update_kernel);
 int get_label_mounted(const char *mount_path, char *labelp);
diff --git a/volumes.c b/volumes.c
index a61928c..f40740a 100644
--- a/volumes.c
+++ b/volumes.c
@@ -30,6 +30,7 @@
 #include print-tree.h
 #include volumes.h
 #include math.h
+#include utils.h
 
 struct stripe {
struct btrfs_device *dev;
@@ -207,7 +208,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, 
int flags)
continue;
}
 
-   fd = open(device-name, flags);
+   fd = btrfs_open_block_device(device-name, flags);
if (fd  0) {
ret = -errno;
goto fail;
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: fix race condition between btrfs and udev

2014-06-11 Thread Tomasz Torcz
On Wed, Jun 11, 2014 at 08:11:38PM +0800, Wang Shilong wrote:
 So the problem is RW opening would trigger udev event which will
 call btrfs_scan_one_device(). In btrfs_scan_one_device(), it
 would open the block device with EXCL flag..meanwhile if another
 program try to open that device with O_EXCL, it would fail with
 EBUSY.
 
 This happen seldomly in the real world, but if we use loop device
 for test, we may hit this annoying problem.

  Hi,

udev just changed the locking semantics, see description in
http://cgit.freedesktop.org/systemd/systemd/commit/NEWS?id=4196a3ead3cfb823670d225eefcb3e60e34c7d95
 

-- 
Tomasz   .. oo o.   oo o. .o   .o o. o. oo o.   ..
Torcz.. .o .o   .o .o oo   oo .o .. .. oo   oo
o.o.o.   .o .. o.   o. o. o.   o. o. oo .. ..   o.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] xfstests: add cifs.ko server-side copy helper

2014-06-11 Thread David Disseldorp
In preparation for adding cifs.ko support to xfstests, this patch series
extends the cloner binary to support SMB2 server-side copies via
CIFS_IOC_COPYCHUNK_FILE, in addition to the existing Btrfs COW clone
functionality.

This could be split into a separate binary if deemed necessary, but
given the code overlap, I though it suitable to share the same source.

Feedback appreciated.

--

David Disseldorp (2):
  src/cloner: check filesystem type
  src/cloner: add CIFS_IOC_COPYCHUNK_FILE support

 configure.ac |   1 +
 src/cloner.c | 123 
+--
 2 files changed, 114 insertions(+), 10 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] src/cloner: add CIFS_IOC_COPYCHUNK_FILE support

2014-06-11 Thread David Disseldorp
cifs.ko supports server-side copy offloads via CIFS_IOC_COPYCHUNK_FILE.
In handling the ioctl, the request is split into a series of
SMB2 FSCTL_SRV_COPYCHUNK wire requests, which may be handled by the SMB
server as a local read/write, or COW clone as is the case for Samba with
vfs_btrfs.

Signed-off-by: David Disseldorp dd...@suse.de
---
 configure.ac |  1 +
 src/cloner.c | 78 +++-
 2 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/configure.ac b/configure.ac
index 53459d8..d038f95 100644
--- a/configure.ac
+++ b/configure.ac
@@ -31,6 +31,7 @@ AC_HEADER_STDC
sys/fs/xfs_itable.h \
xfs/platform_defs.h \
btrfs/ioctl.h   \
+   cifs/ioctl.h\
 ])
 
 AC_CHECK_HEADERS([xfs/xfs_log_format.h],,,[#include xfs/libxfs.h])
diff --git a/src/cloner.c b/src/cloner.c
index 6fb40fa..18c44b9 100644
--- a/src/cloner.c
+++ b/src/cloner.c
@@ -1,6 +1,5 @@
 /*
- *  Tiny program to perform file (range) clones using raw Btrfs ioctls.
- *  It should only be needed until btrfs-progs has an xfs_io equivalent.
+ *  Tiny program to perform file (range) clones using raw Btrfs and CIFS 
ioctls.
  *
  *  Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
  *
@@ -49,9 +48,21 @@ struct btrfs_ioctl_clone_range_args {
   struct btrfs_ioctl_clone_range_args)
 #endif
 
+#ifdef HAVE_CIFS_IOCTL_H
+#include cifs/ioctl.h
+#else
+
+#define CIFS_IOCTL_MAGIC 0xCF
+#define CIFS_IOC_COPYCHUNK_FILE _IOW(CIFS_IOCTL_MAGIC, 3, int)
+
+#endif
+
 #ifndef BTRFS_SUPER_MAGIC
 #define BTRFS_SUPER_MAGIC0x9123683E
 #endif
+#ifndef CIFS_MAGIC_NUMBER
+#define CIFS_MAGIC_NUMBER0xFE534D42
+#endif
 
 static void
 usage(char *name, const char *msg)
@@ -59,17 +70,19 @@ usage(char *name, const char *msg)
printf(Fatal: %s\n
   Usage:\n
   %s [options] src_file dest_file\n
-  \tA full file clone (reflink) is performed by default, 
-  unless any of the following are specified:\n
+  \tA full file clone is performed by default, 
+  unless any of the following are specified (Btrfs only):\n
   \t-s offset:  source file offset (default = 0)\n
   \t-d offset:  destination file offset (default = 0)\n
-  \t-l length:  length of clone (default = 0)\n,
+  \t-l length:  length of clone (default = 0)\n\n
+  \tBoth Btrfs and CIFS are supported. On Btrfs, a COW clone 
+  is attempted. On CIFS, a server-side copy is requested.\n,
   msg, name);
_exit(1);
 }
 
 static int
-clone_file(int src_fd, int dst_fd)
+clone_file_btrfs(int src_fd, int dst_fd)
 {
int ret = ioctl(dst_fd, BTRFS_IOC_CLONE, src_fd);
if (ret != 0)
@@ -78,8 +91,33 @@ clone_file(int src_fd, int dst_fd)
 }
 
 static int
-clone_file_range(int src_fd, int dst_fd, uint64_t src_off, uint64_t dst_off,
-uint64_t len)
+clone_file_cifs(int src_fd, int dst_fd)
+{
+   int ret = ioctl(dst_fd, CIFS_IOC_COPYCHUNK_FILE, src_fd);
+   if (ret != 0)
+   ret = errno;
+   return ret;
+}
+
+static int
+clone_file(unsigned int fs_type, int src_fd, int dst_fd)
+{
+   switch (fs_type) {
+   case BTRFS_SUPER_MAGIC:
+   return clone_file_btrfs(src_fd, dst_fd);
+   break;
+   case CIFS_MAGIC_NUMBER:
+   return clone_file_cifs(src_fd, dst_fd);
+   break;
+   default:
+   return ENOTSUP;
+   break;
+   }
+}
+
+static int
+clone_file_range_btrfs(int src_fd, int dst_fd, uint64_t src_off,
+  uint64_t dst_off, uint64_t len)
 {
struct btrfs_ioctl_clone_range_args cr_args;
int ret;
@@ -96,6 +134,22 @@ clone_file_range(int src_fd, int dst_fd, uint64_t src_off, 
uint64_t dst_off,
 }
 
 static int
+clone_file_range(unsigned int fs_type, int src_fd, int dst_fd, uint64_t 
src_off,
+uint64_t dst_off, uint64_t len)
+{
+   switch (fs_type) {
+   case BTRFS_SUPER_MAGIC:
+   return clone_file_range_btrfs(src_fd, dst_fd, src_off, dst_off,
+ len);
+   break;
+   case CIFS_MAGIC_NUMBER: /* only supports full file server-side copies */
+   default:
+   return ENOTSUP;
+   break;
+   }
+}
+
+static int
 cloner_check_fs_support(int src_fd, int dest_fd, unsigned int *fs_type)
 {
int ret;
@@ -107,7 +161,8 @@ cloner_check_fs_support(int src_fd, int dest_fd, unsigned 
int *fs_type)
return errno;
}
 
-   if (sfs.f_type != BTRFS_SUPER_MAGIC) {
+   if ((sfs.f_type != BTRFS_SUPER_MAGIC)
+ (sfs.f_type != CIFS_MAGIC_NUMBER)) {
printf(unsupported source FS 0x%x\n,
   (unsigned 

[PATCH 1/2] src/cloner: check filesystem type

2014-06-11 Thread David Disseldorp
Limit clone requests to Btrfs only for the moment.

Signed-off-by: David Disseldorp dd...@suse.de
---
 src/cloner.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/src/cloner.c b/src/cloner.c
index ccc2354..6fb40fa 100644
--- a/src/cloner.c
+++ b/src/cloner.c
@@ -22,6 +22,7 @@
 #include sys/types.h
 #include sys/stat.h
 #include sys/ioctl.h
+#include sys/vfs.h
 #include stdint.h
 #include stdbool.h
 #include fcntl.h
@@ -30,6 +31,7 @@
 #include stdio.h
 #include string.h
 #include errno.h
+#include linux/magic.h
 #ifdef HAVE_BTRFS_IOCTL_H
 #include btrfs/ioctl.h
 #else
@@ -47,6 +49,10 @@ struct btrfs_ioctl_clone_range_args {
   struct btrfs_ioctl_clone_range_args)
 #endif
 
+#ifndef BTRFS_SUPER_MAGIC
+#define BTRFS_SUPER_MAGIC0x9123683E
+#endif
+
 static void
 usage(char *name, const char *msg)
 {
@@ -89,6 +95,41 @@ clone_file_range(int src_fd, int dst_fd, uint64_t src_off, 
uint64_t dst_off,
return ret;
 }
 
+static int
+cloner_check_fs_support(int src_fd, int dest_fd, unsigned int *fs_type)
+{
+   int ret;
+   struct statfs sfs;
+
+   ret = fstatfs(src_fd, sfs);
+   if (ret != 0) {
+   printf(failed to stat source FS\n);
+   return errno;
+   }
+
+   if (sfs.f_type != BTRFS_SUPER_MAGIC) {
+   printf(unsupported source FS 0x%x\n,
+  (unsigned int)sfs.f_type);
+   return ENOTSUP;
+   }
+
+   *fs_type = (unsigned int)sfs.f_type;
+
+   ret = fstatfs(dest_fd, sfs);
+   if (ret != 0) {
+   printf(failed to stat destination FS\n);
+   return errno;
+   }
+
+   if (sfs.f_type != *fs_type) {
+   printf(dest FS type 0x%x does not match source 0x%x\n,
+  (unsigned int)sfs.f_type, *fs_type);
+   return ENOTSUP;
+   }
+
+   return 0;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -102,6 +143,7 @@ main(int argc, char **argv)
int dst_fd;
int ret;
int opt;
+   unsigned int fs_type = 0;
 
while ((opt = getopt(argc, argv, s:d:l:)) != -1) {
char *sval_end;
@@ -162,6 +204,11 @@ main(int argc, char **argv)
goto err_src_close;
}
 
+   ret = cloner_check_fs_support(src_fd, dst_fd, fs_type);
+   if (ret != 0) {
+   goto err_dst_close;
+   }
+
if (full_file) {
ret = clone_file(src_fd, dst_fd);
} else {
-- 
1.8.4.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs for 3.16

2014-06-11 Thread Chris Mason
Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

The biggest change here is Josef's rework of the btrfs quota accounting,
which improves the in-memory tracking of delayed extent operations.

I had been working on Btrfs stack usage for a while, mostly because it
had become impossible to do long stress runs with slab, lockdep and
pagealloc debugging turned on without blowing the stack.  Even though
you upgraded us to a nice king sized stack, I kept most of the patches.

We also have some very hard to find corruption fixes, an awesome sysfs
use after free, and the usual assortment of optimizations, cleanups and
other fixes.

Filipe Manana (24) commits (+630/-249):
Btrfs: send, account for orphan directories when building path strings 
(+9/-24)
Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled 
(+83/-25)
Btrfs: read inode size after acquiring the mutex when punching a hole 
(+2/-1)
Btrfs: update commit root on snapshot creation after orphan cleanup (+29/-0)
Btrfs: fix hang on error (such as ENOSPC) when writing extent pages (+11/-5)
Btrfs: don't release invalid page in btrfs_page_exists_in_range() (+1/-0)
Btrfs: fix leaf corruption caused by ENOSPC while hole punching (+19/-1)
Btrfs: set dead flag on the right root when destroying snapshot (+6/-6)
Btrfs: send, use the right limits for xattr names and values (+23/-7)
Btrfs: send, don't error in the presence of subvols/snapshots (+4/-0)
Btrfs: check if items are ordered when a leaf is marked dirty (+6/-0)
Btrfs: send, avoid unnecessary inode item lookup in the btree (+7/-6)
Btrfs: avoid visiting all extent items when cloning a range (+22/-4)
Btrfs: don't access non-existent key when csum tree is empty (+1/-1)
Btrfs: send, remove dead code from __get_cur_name_and_parent (+0/-6)
Btrfs: ioctl, don't re-lock extent range when not necessary (+7/-2)
Btrfs: ensure readers see new data after a clone operation (+31/-5)
Btrfs: send, fix more issues related to directory renames (+96/-94)
Btrfs: make sure we retry if page is a retriable exception (+3/-1)
Btrfs: make sure we retry if we couldn't get the page (+3/-1)
Btrfs: implement inode_operations callback tmpfile (+98/-20)
Btrfs: make fsync work after cloning into a file (+155/-38)
Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item (+11/-1)
Btrfs: fix transaction leak during fsync call (+3/-1)

David Sterba (9) commits (+145/-39):
btrfs: assert that send is not in progres before root deletion (+1/-13)
btrfs: balance filter: add limit of processed chunks (+27/-2)
btrfs: remove newline from inode cache kthread name (+1/-1)
btrfs: protect snapshots from deleting during send (+53/-2)
btrfs: remove stale newlines from log messages (+14/-14)
btrfs: make DEV_INFO ioctl available to anyone (+0/-3)
btrfs: make FS_INFO ioctl available to anyone (+0/-3)
btrfs: retrieve more info from FS_INFO ioctl (+9/-1)
btrfs: export more from FS_INFO to sysfs (+40/-0)

Liu Bo (6) commits (+54/-21):
Btrfs: fix scrub_print_warning to handle skinny metadata extents (+24/-15)
Btrfs: mark mapping with error flag to report errors to userspace (+2/-0)
Btrfs: fix NULL pointer crash of deleting a seed device (+8/-4)
Btrfs: fix leaf corruption after __btrfs_drop_extents (+18/-0)
Btrfs: do not increment on bio_index one by one (+1/-1)
Btrfs: use right type to get real comparison (+1/-1)

Chris Mason (6) commits (+500/-264):
Btrfs: break up __btrfs_write_out_cache to cut down stack usage (+191/-117)
Btrfs: split up __extent_writepage to lower stack usage (+194/-138)
Btrfs: cut down stack usage in btree_write_cache_pages (+5/-4)
Btrfs: fix double free in find_lock_delalloc_range (+1/-0)
Btrfs: convert smp_mb__{before,after}_clear_bit (+2/-2)
Btrfs: async delayed refs (+107/-3)

Gui Hecheng (4) commits (+20/-3):
btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel space (+16/-0)
btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56 (+1/-1)
btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX (+1/-1)
btrfs: fix wrong max system array size check in kernel space (+2/-1)

Josef Bacik (4) commits (+1748/-518):
Btrfs: add sanity tests for new qgroup accounting code (+700/-37)
Btrfs: don't check nodes for extent items (+3/-2)
Btrfs: free tmp ulist for qgroup rescan (+1/-0)
Btrfs: rework qgroup accounting (+1044/-479)

Miao Xie (4) commits (+261/-124):
Btrfs: use bitfield instead of integer data type for the some variants in 
btrfs_root (+109/-94)
Btrfs: output warning instead of error when loading free space cache failed 
(+2/-2)
Btrfs: use helpers for last_trans_log_full_commit instead of opencode 
(+36/-27)
Btrfs: reclaim the reserved metadata space at background (+114/-1)

Wang Shilong (3) commits (+14/-3):
Btrfs: make 

Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Goffredo Baroncelli
Hi all,

I would like to share a my experience about a slowness of systemd when used on 
BTRFS.

My boot time was very high (about ~50 seconds); most of time it was due to 
NetworkManager which took about 30-40 seconds to start (this data came from 
systemd-analyze plot).

I make several attempts to address this issue. Also I noticed that sometime 
this problem disappeared; but I was never able to understand why.

However this link

https://bugzilla.redhat.com/show_bug.cgi?id=1006386

suggested me that the problem could be due to a bad interaction between systemd 
and btrfs. NetworkManager was innocent. 
It seems that systemd-journal create a very hight fragmented files when it 
stores its log. And BTRFS it is know to behave slowly when a file is highly 
fragmented.
This had caused a slow startup of systemd-journal, which in turn had blocked 
the services which depend by the loggin system.

In fact after I de-fragmented the files under /var/log/journal [*], my boot 
time decreased of about 20second (from 50s to 30s).

Unfortunately I don't have any data to show. The next time I will try to 
collect more information. But I am quite sure that when the log are highly 
fragmented systemd-journal becomes very slow on BTRFS.

I don't know if the problem is more on the systemd side or btrfs side. What I 
know is that both the projects likely will be important in the near futures, 
and both must work well together.

I know that I can chattr +C to avoid COW for some files; but I don't want to 
lost also the checksum protection. 

If someone is able to suggest me how FRAGMENT the log file, I can try to 
collect more scientific data.


BR
G.Baroncelli

[*] 
# btrfs fi defrag /var/log/journal/*/*



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mount: add btrfs to mount.8

2014-06-11 Thread Eric Sandeen
On 6/7/14, 8:41 AM, Christoph Hellwig wrote:
 On Fri, Jun 06, 2014 at 10:52:48AM -0500, Eric Sandeen wrote:
 On 6/6/14, 5:03 AM, Karel Zak wrote:
 On Fri, Jun 06, 2014 at 11:44:28AM +0200, Karel Zak wrote:
  I personally have no problem to maintain information about arbitrary
  FS in mount.8, the problem are updates. Unfortunately, kernel FS 
 developers
  don't care about the man page at all and it's very often not up to date.

  Hmm.. another possible way would be to create a script for util-linux
  that will analyze kernel Documentation/filesystems/fsname.txt and
  report changes that is necessary to make to mount.8. It should be
  relative simple with git. I'll try it..

 I like that idea.  Maybe fsname.txt will need a defined format, though,
 right?  Maybe asciidoc?

 I've still been meaning (in theory) to produce a mount manpage just for xfs.
 I'm still willing to do that if the above doesn't pan out.  I just need
 to get to it.  I'd be happy to do it for extN as well.
 
 Autogenerating man pages from an adhoc format sounds like the wrong
 approach.  I'd much rather have proper man paged for every filesystem.
 With those we could also drop all that information from the kernel
 Documentation directory, where users won't looks for them anyway.
 
 Eric, if you take care of xfs an extN I'll get started on man pages
 for the various minor filesystems that don't really have active
 maintainers.

Ok, so I've sent xfs  extN and I am about to send btrfs.

But I still have the nagging feeling that it would be better to have these
mount-option manpages distributed with the kernel, which is ultimately what
they must match.

So although I've sent them all, I'm still feeling unsure about it.  :)

-Eric
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: add mount options to btrfs-mount.5

2014-06-11 Thread Eric Sandeen
This is a straight cut and paste from the util-linux
mount manpage into btrfs-mount.5

It's pretty much impossible for util-linux to keep up
with every filesystem out there, and Karel has more than
once expressed a wish that mount options move into fs-specific
manpages.

So, here we go.

The way btrfs asciidoc is generated, there's not a trivial
way to have both btrfs(5) and btrfs(8) so I named it btrfs-mount(5)
for now.  A bit ick and I'm open to suggestions.

Signed-off-by: Eric Sandeen sand...@redhat.com
---

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 03a5cd5..95ecbf6 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -31,13 +31,21 @@ MAN8_TXT += btrfs-replace.txt
 MAN8_TXT += btrfs-restore.txt
 MAN8_TXT += btrfs-property.txt
 
-MAN_TXT = $(MAN8_TXT)
+# Mount manpage
+MAN5_TXT += btrfs-mount.txt
+
+MAN_TXT = $(MAN8_TXT) $(MAN5_TXT)
 MAN_XML = $(patsubst %.txt,%.xml,$(MAN_TXT))
+
+DOC_MAN5 = $(patsubst %.txt,%.5,$(MAN5_TXT))
+GZ_MAN5 = $(patsubst %.txt,%.5.gz,$(MAN5_TXT))
+
 DOC_MAN8 = $(patsubst %.txt,%.8,$(MAN8_TXT))
 GZ_MAN8 = $(patsubst %.txt,%.8.gz,$(MAN8_TXT))
 
 mandir ?= $(prefix)/share/man
 man8dir = $(mandir)/man8
+man5dir = $(mandir)/man5
 
 ASCIIDOC = asciidoc
 ASCIIDOC_EXTRA =
@@ -67,25 +75,35 @@ endif
 endif
 
 all: man
-man: man8
+man: man5 man8
+man5: $(GZ_MAN5)
 man8: $(GZ_MAN8)
 
 install: install-man
 
 install-man: man
$(INSTALL) -d -m 755 $(DESTDIR)$(man8dir)
+   $(INSTALL) -m 644 $(GZ_MAN5) $(DESTDIR)$(man5dir)
$(INSTALL) -m 644 $(GZ_MAN8) $(DESTDIR)$(man8dir)
$(LNS) btrfs-check.8.gz $(DESTDIR)$(man8dir)
 
 clean:
-   $(QUIET_RM)$(RM) *.xml *.xml+ *.8 *.8.gz
+   $(QUIET_RM)$(RM) *.xml *.xml+ *.5 *.5.gz *.8 *.8.gz
+
+%.5.gz : %.5
+   $(QUIET_GZIP)$(GZIP) -n -c $  $@
 
 %.8.gz : %.8
$(QUIET_GZIP)$(GZIP) -n -c $  $@
 
+%.5 : %.xml 
+   $(QUIET_XMLTO)$(RM) $@  \
+   $(XMLTO) -m $(MANPAGE_XSL) $(XMLTO_EXTRA) man $
+
 %.8 : %.xml 
$(QUIET_XMLTO)$(RM) $@  \
$(XMLTO) -m $(MANPAGE_XSL) $(XMLTO_EXTRA) man $
+
 %.xml : %.txt asciidoc.conf
$(QUIET_ASCIIDOC)$(RM) $@.tmp[12] $@  \
sed -e s/\([^]\+\)/'\1'/g  $  $@.tmp1  \
diff --git a/Documentation/btrfs-mount.txt b/Documentation/btrfs-mount.txt
new file mode 100644
index 000..4433a78
--- /dev/null
+++ b/Documentation/btrfs-mount.txt
@@ -0,0 +1,186 @@
+btrfs-mount(5)
+==
+
+NAME
+
+btrfs-mount - mount options for the btrfs filesystem
+
+DESCRIPTION
+---
+This document describes mount options specific to the btrfs filesystem.
+Other generic mount options are available,and are described in the
+`mount`(8) manpage.
+
+MOUNT OPTIONS
+-
+*alloc_start='bytes'*::
+   Debugging option to force all block allocations above a certain
+   byte threshold on each block device.  The value is specified in
+   bytes, optionally with a K, M, or G suffix, case insensitive.
+   Default is 1MB.
+
+*autodefrag*::
+   Disable/enable auto defragmentation.
+   Auto defragmentation detects small random writes into files and queue
+   them up for the defrag process.  Works best for small files;
+   Not well suited for large database workloads.
+
+*check_int*|*check_int_data*|*check_int_print_mask='value'*::
+   These debugging options control the behavior of the integrity checking
+   module (the BTRFS_FS_CHECK_INTEGRITY config option required). +
+   +
+   `check_int` enables the integrity checker module, which examines all
+   block write requests to ensure on-disk consistency, at a large
+   memory and CPU cost. +
+   +
+   `check_int_data` includes extent data in the integrity checks, and
+   implies the check_int option. +
+   +
+   `check_int_print_mask` takes a bitmask of BTRFSIC_PRINT_MASK_* values
+   as defined in 'fs/btrfs/check-integrity.c', to control the integrity
+   checker module behavior. +
+   +
+   See comments at the top of 'fs/btrfs/check-integrity.c'
+   for more info.
+
+*commit='seconds'*::
+   Set the interval of periodic commit, 30 seconds by default. Higher
+   values defer data being synced to permanent storage with obvious
+   consequences when the system crashes. The upper bound is not forced,
+   but a warning is printed if it's more than 300 seconds (5 minutes).
+
+*compress*|*compress='type'*|*compress-force*|*compress-force='type'*::
+   Control BTRFS file data compression.  Type may be specified as zlib
+   lzo or no (for no compression, used for remounting).  If no type
+   is specified, zlib is used.  If compress-force is specified,
+   all files will be compressed, whether or not they compress well.
+   If compression is enabled, nodatacow and nodatasum are disabled.
+
+*degraded*::
+   Allow mounts to continue with missing devices.  A read-write mount may
+   fail with too many devices missing, for example if a stripe member
+ 

[PATCH V2] btrfs-progs: add mount options to btrfs-mount.5

2014-06-11 Thread Eric Sandeen
This is a straight cut and paste from the util-linux
mount manpage into btrfs-mount.5

It's pretty much impossible for util-linux to keep up
with every filesystem out there, and Karel has more than
once expressed a wish that mount options move into fs-specific
manpages.

So, here we go.

The way btrfs asciidoc is generated, there's not a trivial
way to have both btrfs(5) and btrfs(8) so I named it btrfs-mount(5)
for now.  A bit ick and I'm open to suggestions.

Signed-off-by: Eric Sandeen sand...@redhat.com
---

V2: whoops, have to $(INSTALL) -d -m 755 $(DESTDIR)$(man5dir) too...

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 03a5cd5..be95fda 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -31,13 +31,21 @@ MAN8_TXT += btrfs-replace.txt
 MAN8_TXT += btrfs-restore.txt
 MAN8_TXT += btrfs-property.txt
 
-MAN_TXT = $(MAN8_TXT)
+# Mount manpage
+MAN5_TXT += btrfs-mount.txt
+
+MAN_TXT = $(MAN8_TXT) $(MAN5_TXT)
 MAN_XML = $(patsubst %.txt,%.xml,$(MAN_TXT))
+
+DOC_MAN5 = $(patsubst %.txt,%.5,$(MAN5_TXT))
+GZ_MAN5 = $(patsubst %.txt,%.5.gz,$(MAN5_TXT))
+
 DOC_MAN8 = $(patsubst %.txt,%.8,$(MAN8_TXT))
 GZ_MAN8 = $(patsubst %.txt,%.8.gz,$(MAN8_TXT))
 
 mandir ?= $(prefix)/share/man
 man8dir = $(mandir)/man8
+man5dir = $(mandir)/man5
 
 ASCIIDOC = asciidoc
 ASCIIDOC_EXTRA =
@@ -67,25 +75,36 @@ endif
 endif
 
 all: man
-man: man8
+man: man5 man8
+man5: $(GZ_MAN5)
 man8: $(GZ_MAN8)
 
 install: install-man
 
 install-man: man
+   $(INSTALL) -d -m 755 $(DESTDIR)$(man5dir)
$(INSTALL) -d -m 755 $(DESTDIR)$(man8dir)
+   $(INSTALL) -m 644 $(GZ_MAN5) $(DESTDIR)$(man5dir)
$(INSTALL) -m 644 $(GZ_MAN8) $(DESTDIR)$(man8dir)
$(LNS) btrfs-check.8.gz $(DESTDIR)$(man8dir)
 
 clean:
-   $(QUIET_RM)$(RM) *.xml *.xml+ *.8 *.8.gz
+   $(QUIET_RM)$(RM) *.xml *.xml+ *.5 *.5.gz *.8 *.8.gz
+
+%.5.gz : %.5
+   $(QUIET_GZIP)$(GZIP) -n -c $  $@
 
 %.8.gz : %.8
$(QUIET_GZIP)$(GZIP) -n -c $  $@
 
+%.5 : %.xml 
+   $(QUIET_XMLTO)$(RM) $@  \
+   $(XMLTO) -m $(MANPAGE_XSL) $(XMLTO_EXTRA) man $
+
 %.8 : %.xml 
$(QUIET_XMLTO)$(RM) $@  \
$(XMLTO) -m $(MANPAGE_XSL) $(XMLTO_EXTRA) man $
+
 %.xml : %.txt asciidoc.conf
$(QUIET_ASCIIDOC)$(RM) $@.tmp[12] $@  \
sed -e s/\([^]\+\)/'\1'/g  $  $@.tmp1  \
diff --git a/Documentation/btrfs-mount.txt b/Documentation/btrfs-mount.txt
new file mode 100644
index 000..4433a78
--- /dev/null
+++ b/Documentation/btrfs-mount.txt
@@ -0,0 +1,186 @@
+btrfs-mount(5)
+==
+
+NAME
+
+btrfs-mount - mount options for the btrfs filesystem
+
+DESCRIPTION
+---
+This document describes mount options specific to the btrfs filesystem.
+Other generic mount options are available,and are described in the
+`mount`(8) manpage.
+
+MOUNT OPTIONS
+-
+*alloc_start='bytes'*::
+   Debugging option to force all block allocations above a certain
+   byte threshold on each block device.  The value is specified in
+   bytes, optionally with a K, M, or G suffix, case insensitive.
+   Default is 1MB.
+
+*autodefrag*::
+   Disable/enable auto defragmentation.
+   Auto defragmentation detects small random writes into files and queue
+   them up for the defrag process.  Works best for small files;
+   Not well suited for large database workloads.
+
+*check_int*|*check_int_data*|*check_int_print_mask='value'*::
+   These debugging options control the behavior of the integrity checking
+   module (the BTRFS_FS_CHECK_INTEGRITY config option required). +
+   +
+   `check_int` enables the integrity checker module, which examines all
+   block write requests to ensure on-disk consistency, at a large
+   memory and CPU cost. +
+   +
+   `check_int_data` includes extent data in the integrity checks, and
+   implies the check_int option. +
+   +
+   `check_int_print_mask` takes a bitmask of BTRFSIC_PRINT_MASK_* values
+   as defined in 'fs/btrfs/check-integrity.c', to control the integrity
+   checker module behavior. +
+   +
+   See comments at the top of 'fs/btrfs/check-integrity.c'
+   for more info.
+
+*commit='seconds'*::
+   Set the interval of periodic commit, 30 seconds by default. Higher
+   values defer data being synced to permanent storage with obvious
+   consequences when the system crashes. The upper bound is not forced,
+   but a warning is printed if it's more than 300 seconds (5 minutes).
+
+*compress*|*compress='type'*|*compress-force*|*compress-force='type'*::
+   Control BTRFS file data compression.  Type may be specified as zlib
+   lzo or no (for no compression, used for remounting).  If no type
+   is specified, zlib is used.  If compress-force is specified,
+   all files will be compressed, whether or not they compress well.
+   If compression is enabled, nodatacow and nodatasum are disabled.
+
+*degraded*::
+   Allow mounts to continue with 

[PATCH] Btrfs: fix qgroups sanity test crash or hang

2014-06-11 Thread Filipe David Borba Manana
Often when running the qgroups sanity test, a crash or a hang happened.
This is because the extent buffer the test uses for the root node doesn't
have an header level explicitly set, making it have a random level value.
This is a problem when it's not zero for the btrfs_search_slot() calls
the test ends up doing, resulting in crashes or hangs such as the following:

[ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
(...)
[ 6454.127760] BTRFS: selftest: Running qgroup tests
[ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
[ 6454.127966] BTRFS: selftest: Qgroup basic add
[ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
[ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core 
pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
[ 6480.152005] irq event stamp: 188448
[ 6480.152005] hardirqs last  enabled at (188447): [8168ef5c] 
restore_args+0x0/0x30
[ 6480.152005] hardirqs last disabled at (188448): [81698e6a] 
apic_timer_interrupt+0x6a/0x80
[ 6480.152005] softirqs last  enabled at (188446): [810516cf] 
__do_softirq+0x1cf/0x450
[ 6480.152005] softirqs last disabled at (188441): [81051c25] 
irq_exit+0xb5/0xc0
[ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 
3.15.0-rc8-fdm-btrfs-next-33+ #4
[ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6480.152005] task: 8802146125a0 ti: 8800d0d0 task.ti: 
8800d0d0
[ 6480.152005] RIP: 0010:[81349a63]  [81349a63] 
__write_lock_failed+0x13/0x20
[ 6480.152005] RSP: 0018:8800d0d038e8  EFLAGS: 0287
[ 6480.152005] RAX:  RBX: 8168ef5c RCX: 05deb8525852
[ 6480.152005] RDX:  RSI: 1d45 RDI: 8802105000b8
[ 6480.152005] RBP: 8800d0d038e8 R08: fe12710f63db R09: a03196fb
[ 6480.152005] R10: 8802146125a0 R11: 880214612e28 R12: 8800d0d03858
[ 6480.152005] R13:  R14: 8800d0d0 R15: 8802146125a0
[ 6480.152005] FS:  7f14ff804700() GS:880215e0() 
knlGS:
[ 6480.152005] CS:  0010 DS:  ES:  CR0: 8005003b
[ 6480.152005] CR2: 7fff4df0dac8 CR3: d1796000 CR4: 06f0
[ 6480.152005] Stack:
[ 6480.152005]  8800d0d03908 810ae967 0001 
8802105000b8
[ 6480.152005]  8800d0d03938 8168e57e a0319c16 
0007
[ 6480.152005]  88021050 880210500100 8800d0d039b8 
a0319c16
[ 6480.152005] Call Trace:
[ 6480.152005]  [810ae967] do_raw_write_lock+0x47/0xa0
[ 6480.152005]  [8168e57e] _raw_write_lock+0x5e/0x80
[ 6480.152005]  [a0319c16] ? btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [a0319c16] btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [a02b2acb] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 6480.152005]  [a02b81a6] btrfs_search_slot+0x916/0xa20 [btrfs]
[ 6480.152005]  [811a727f] ? create_object+0x23f/0x300
[ 6480.152005]  [a02b9958] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[ 6480.152005]  [a036041a] 
insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
[ 6480.152005]  [a03605c3] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
[ 6480.152005]  [8108cad6] ? local_clock+0x16/0x30
[ 6480.152005]  [a035ef8e] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
[ 6480.152005]  [a03a69d2] ? 
ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
[ 6480.152005]  [a03a6a86] init_btrfs_fs+0xb4/0x153 [btrfs]
[ 6480.152005]  [81000352] do_one_initcall+0x102/0x150
[ 6480.152005]  [8103d223] ? set_memory_nx+0x43/0x50
[ 6480.152005]  [81682668] ? set_section_ro_nx+0x6d/0x74
[ 6480.152005]  [810d91cc] load_module+0x1cdc/0x2630
(...)

Therefore initialize the extent buffer as an empty leaf (level 0).

Issue easy to reproduce when btrfs is built as a module via:

$ for ((i = 1; i = 100; i++)); do rmmod btrfs; modprobe btrfs; done

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
---
 fs/btrfs/tests/qgroup-tests.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index fa691b7..0e69c8e 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -410,6 +410,8 @@ int btrfs_test_qgroups(void)
 * *cough*backref walking code*cough*
 */
root-node = alloc_test_extent_buffer(root-fs_info, 4096, 4096);
+   btrfs_set_header_level(root-node, 0);
+   btrfs_set_header_nritems(root-node, 0);
if (!root-node) {
test_msg(Couldn't allocate dummy buffer\n);
ret = -ENOMEM;
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More 

Re: [PATCH] Btrfs: fix qgroups sanity test crash or hang

2014-06-11 Thread Chris Mason
On 06/11/2014 08:12 PM, Filipe David Borba Manana wrote:
 Often when running the qgroups sanity test, a crash or a hang happened.
 This is because the extent buffer the test uses for the root node doesn't
 have an header level explicitly set, making it have a random level value.
 This is a problem when it's not zero for the btrfs_search_slot() calls
 the test ends up doing, resulting in crashes or hangs such as the following:
 

 Therefore initialize the extent buffer as an empty leaf (level 0).
 
 Issue easy to reproduce when btrfs is built as a module via:
 
 $ for ((i = 1; i = 100; i++)); do rmmod btrfs; modprobe btrfs; done

Nice, thanks Filipe, I hadn't been able to trigger this yet.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Chris Murphy

On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli kreij...@libero.it wrote:
 
 If someone is able to suggest me how FRAGMENT the log file, I can try to 
 collect more scientific data.

So long as you're not using compression, filefrag will show you fragments of 
systemd-journald journals. I can vouch for the behavior you experience without 
xattr +C or autodefrag, but further it also causes much slowness when reading 
journal contents. LIke if I want to search all boots for a particular error 
message to see how far back it started, this takes quite a bit longer than on 
other file systems. So far I'm not experiencing this problem with autodefrag or 
any other negative side effects, but my understanding is this code is still in 
flux.

Since the journals have their own checksumming I'm not overly concerned about 
setting xattr +C.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix qgroups sanity test crash or hang

2014-06-11 Thread Filipe David Borba Manana
Often when running the qgroups sanity test, a crash or a hang happened.
This is because the extent buffer the test uses for the root node doesn't
have an header level explicitly set, making it have a random level value.
This is a problem when it's not zero for the btrfs_search_slot() calls
the test ends up doing, resulting in crashes or hangs such as the following:

[ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
(...)
[ 6454.127760] BTRFS: selftest: Running qgroup tests
[ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
[ 6454.127966] BTRFS: selftest: Qgroup basic add
[ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
[ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core 
pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
[ 6480.152005] irq event stamp: 188448
[ 6480.152005] hardirqs last  enabled at (188447): [8168ef5c] 
restore_args+0x0/0x30
[ 6480.152005] hardirqs last disabled at (188448): [81698e6a] 
apic_timer_interrupt+0x6a/0x80
[ 6480.152005] softirqs last  enabled at (188446): [810516cf] 
__do_softirq+0x1cf/0x450
[ 6480.152005] softirqs last disabled at (188441): [81051c25] 
irq_exit+0xb5/0xc0
[ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 
3.15.0-rc8-fdm-btrfs-next-33+ #4
[ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6480.152005] task: 8802146125a0 ti: 8800d0d0 task.ti: 
8800d0d0
[ 6480.152005] RIP: 0010:[81349a63]  [81349a63] 
__write_lock_failed+0x13/0x20
[ 6480.152005] RSP: 0018:8800d0d038e8  EFLAGS: 0287
[ 6480.152005] RAX:  RBX: 8168ef5c RCX: 05deb8525852
[ 6480.152005] RDX:  RSI: 1d45 RDI: 8802105000b8
[ 6480.152005] RBP: 8800d0d038e8 R08: fe12710f63db R09: a03196fb
[ 6480.152005] R10: 8802146125a0 R11: 880214612e28 R12: 8800d0d03858
[ 6480.152005] R13:  R14: 8800d0d0 R15: 8802146125a0
[ 6480.152005] FS:  7f14ff804700() GS:880215e0() 
knlGS:
[ 6480.152005] CS:  0010 DS:  ES:  CR0: 8005003b
[ 6480.152005] CR2: 7fff4df0dac8 CR3: d1796000 CR4: 06f0
[ 6480.152005] Stack:
[ 6480.152005]  8800d0d03908 810ae967 0001 
8802105000b8
[ 6480.152005]  8800d0d03938 8168e57e a0319c16 
0007
[ 6480.152005]  88021050 880210500100 8800d0d039b8 
a0319c16
[ 6480.152005] Call Trace:
[ 6480.152005]  [810ae967] do_raw_write_lock+0x47/0xa0
[ 6480.152005]  [8168e57e] _raw_write_lock+0x5e/0x80
[ 6480.152005]  [a0319c16] ? btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [a0319c16] btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [a02b2acb] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 6480.152005]  [a02b81a6] btrfs_search_slot+0x916/0xa20 [btrfs]
[ 6480.152005]  [811a727f] ? create_object+0x23f/0x300
[ 6480.152005]  [a02b9958] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[ 6480.152005]  [a036041a] 
insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
[ 6480.152005]  [a03605c3] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
[ 6480.152005]  [8108cad6] ? local_clock+0x16/0x30
[ 6480.152005]  [a035ef8e] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
[ 6480.152005]  [a03a69d2] ? 
ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
[ 6480.152005]  [a03a6a86] init_btrfs_fs+0xb4/0x153 [btrfs]
[ 6480.152005]  [81000352] do_one_initcall+0x102/0x150
[ 6480.152005]  [8103d223] ? set_memory_nx+0x43/0x50
[ 6480.152005]  [81682668] ? set_section_ro_nx+0x6d/0x74
[ 6480.152005]  [810d91cc] load_module+0x1cdc/0x2630
(...)

Therefore initialize the extent buffer as an empty leaf (level 0).

Issue easy to reproduce when btrfs is built as a module via:

$ for ((i = 1; i = 100; i++)); do rmmod btrfs; modprobe btrfs; done

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
---

V2: Fixed silly mistake. Set root-node's header level and nritems after
checking if root-node is not null.

 fs/btrfs/tests/qgroup-tests.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index fa691b7..ec3dcb2 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -415,6 +415,8 @@ int btrfs_test_qgroups(void)
ret = -ENOMEM;
goto out;
}
+   btrfs_set_header_level(root-node, 0);
+   btrfs_set_header_nritems(root-node, 0);
root-alloc_bytenr += 8192;
 
tmp_root = btrfs_alloc_dummy_root();
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to 

Re: Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Russell Coker
On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
 https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between
 systemd and btrfs. NetworkManager was innocent.  It seems that
 systemd-journal create a very hight fragmented files when it stores its
 log. And BTRFS it is know to behave slowly when a file is highly
 fragmented. This had caused a slow startup of systemd-journal, which in
 turn had blocked the services which depend by the loggin system.

On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put 
SystemMaxUse=50M.  That doesn't solve the fragmentation problem but reduces 
it enough that it doesn't bother me.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Dave Chinner
On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
 Hi all,
 
 I would like to share a my experience about a slowness of systemd when used 
 on BTRFS.
 
 My boot time was very high (about ~50 seconds); most of time it was due to 
 NetworkManager which took about 30-40 seconds to start (this data came from 
 systemd-analyze plot).
 
 I make several attempts to address this issue. Also I noticed that sometime 
 this problem disappeared; but I was never able to understand why.
 
 However this link
 
   https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between 
 systemd and btrfs. NetworkManager was innocent. 

systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it.  All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.

Yup, it fragments journal files on XFS, too.

http://oss.sgi.com/archives/xfs/2014-03/msg00322.html

IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Dave Chinner
On Thu, Jun 12, 2014 at 11:21:04AM +1000, Dave Chinner wrote:
 On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
  Hi all,
  
  I would like to share a my experience about a slowness of systemd when used 
  on BTRFS.
  
  My boot time was very high (about ~50 seconds); most of time it was due to 
  NetworkManager which took about 30-40 seconds to start (this data came from 
  systemd-analyze plot).
  
  I make several attempts to address this issue. Also I noticed that sometime 
  this problem disappeared; but I was never able to understand why.
  
  However this link
  
  https://bugzilla.redhat.com/show_bug.cgi?id=1006386
  
  suggested me that the problem could be due to a bad interaction between 
  systemd and btrfs. NetworkManager was innocent. 
 
 systemd has a very stupid journal write pattern. It checks if there
 is space in the file for the write, and if not it fallocates the
 small amount of space it needs (it does *4 byte* fallocate calls!)
 and then does the write to it.  All this does is fragment the crap
 out of the log files because the filesystems cannot optimise the
 allocation patterns.
 
 Yup, it fragments journal files on XFS, too.
 
 http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
 
 IIRC, the systemd developers consider this a filesystem problem and
 so refused to change the systemd code to be nice to the filesystem
 allocators, even though they don't actually need to use fallocate...

BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs-progs: fix max mirror number error for chunk-recover

2014-06-11 Thread Gui Hecheng
When run chunk-recover on a health btrfs(data profile raid0, with
plenty of data), the program has a chance to abort on the number
of mirrors of an extent.

According to the kernel code, the max mirror number of an extent
is 3 not 2:
ctree.h:BTRFS_MAX_MIRRORS   3
chunk-recover.c :   BTRFS_NUM_MIRRORS   2
just change BTRFS_NUM_MIRRORS to 3, and everything goes well.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 chunk-recover.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index 9b46b0b..d5a688e 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -42,7 +42,7 @@
 #include btrfsck.h
 #include commands.h
 
-#define BTRFS_NUM_MIRRORS  2
+#define BTRFS_NUM_MIRRORS  3
 
 struct recover_control {
int verbose;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs-progs: fix missing parity stripe for raid6 in chunk-recover

2014-06-11 Thread Gui Hecheng
When deal with the p  q stripes for data profile raid6, chunk-recover
forgets to insert them into the chunk record. Just insert them back
freely.
Also wrap the inert procedure into a new function, fill_chunk_up.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 chunk-recover.c | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index dfa7ff6..9b46b0b 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1785,6 +1785,23 @@ static inline int count_devext_records(struct list_head 
*record_list)
return num_of_records;
 }
 
+static int fill_chunk_up(struct chunk_record *chunk, struct list_head *devexts,
+struct recover_control *rc)
+{
+   int ret = 0;
+   int i;
+
+   for (i = 0; i  chunk-num_stripes; i++) {
+   if (!chunk-stripes[i].devid) {
+   ret = insert_stripe(devexts, rc, chunk, i);
+   if (ret)
+   break;
+   }
+   }
+
+   return ret;
+}
+
 #define EQUAL_STRIPE (1  0)
 
 static int rebuild_raid_data_chunk_stripes(struct recover_control *rc,
@@ -1919,9 +1936,9 @@ next_csum:
num_unordered = count_devext_records(unordered);
if (chunk-type_flags  BTRFS_BLOCK_GROUP_RAID6
 num_unordered == 2) {
-   list_splice_init(unordered, chunk-dextents);
btrfs_release_path(path);
-   return 0;
+   ret = fill_chunk_up(chunk, unordered, rc);
+   return ret;
}
 
goto next_stripe;
@@ -1966,14 +1983,7 @@ out:
 BTRFS_BLOCK_GROUP_RAID5)
 || (num_unordered == 3  chunk-type_flags
 BTRFS_BLOCK_GROUP_RAID6)) {
-   for (i = 0; i  chunk-num_stripes; i++) {
-   if (!chunk-stripes[i].devid) {
-   ret = insert_stripe(unordered, rc,
-   chunk, i);
-   if (ret)
-   break;
-   }
-   }
+   ret = fill_chunk_up(chunk, unordered, rc);
}
}
 fail_out:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs-progs: cleanup unused assignment for chunk-recover

2014-06-11 Thread Gui Hecheng
The 'num_unordered' will be recounted after 'goto out',
just remove it.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 chunk-recover.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index d5a688e..8bc5bc3 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1905,7 +1905,6 @@ next_csum:
fprintf(stderr, Fetch csum failed\n);
goto fail_out;
} else if (ret == 1) {
-   num_unordered = count_devext_records(unordered);
if (!(*flags  EQUAL_STRIPE))
*flags |= EQUAL_STRIPE;
goto out;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Chris Murphy
On Jun 11, 2014, at 7:21 PM, Dave Chinner da...@fromorbit.com wrote:

 On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
 Hi all,
 
 I would like to share a my experience about a slowness of systemd when used 
 on BTRFS.
 
 My boot time was very high (about ~50 seconds); most of time it was due to 
 NetworkManager which took about 30-40 seconds to start (this data came from 
 systemd-analyze plot).
 
 I make several attempts to address this issue. Also I noticed that sometime 
 this problem disappeared; but I was never able to understand why.
 
 However this link
 
  https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between 
 systemd and btrfs. NetworkManager was innocent. 
 
 systemd has a very stupid journal write pattern. It checks if there
 is space in the file for the write, and if not it fallocates the
 small amount of space it needs (it does *4 byte* fallocate calls!)
 and then does the write to it.  All this does is fragment the crap
 out of the log files because the filesystems cannot optimise the
 allocation patterns.
 
 Yup, it fragments journal files on XFS, too.
 
 http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
 
 IIRC, the systemd developers consider this a filesystem problem and
 so refused to change the systemd code to be nice to the filesystem
 allocators, even though they don't actually need to use fallocate...
 
 Cheers,
 
 Dave.
 
 -- 
 Dave Chinner
 da...@fromorbit.com

On Jun 11, 2014, at 7:37 PM, Dave Chinner da...@fromorbit.com wrote:
 
 BTW, the systemd list is subscriber only, so thay aren't going to
 see anything that we comment on from a cross-post to the btrfs list.


Unless a subscriber finds something really interesting, quotes it, and cross 
posts it.

Chris murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Duncan
Russell Coker posted on Thu, 12 Jun 2014 11:18:37 +1000 as excerpted:

 On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
 https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between
 systemd and btrfs. NetworkManager was innocent.  It seems that
 systemd-journal create a very hight fragmented files when it stores its
 log. And BTRFS it is know to behave slowly when a file is highly
 fragmented. This had caused a slow startup of systemd-journal, which in
 turn had blocked the services which depend by the loggin system.
 
 On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
 SystemMaxUse=50M.  That doesn't solve the fragmentation problem but
 reduces it enough that it doesn't bother me.

FWIW, as a relatively new switcher to systemd, that is, after switching 
to btrfs only a year or so ago...  Two comments:

1) Having seen a few reports of journald's journal fragmentation on this 
list, I was worried about those journals here as well.

My solution to both this problem and to an unrelated frustration with 
journald[1] was to:

a) confine journald to only a volatile (memory-only) log, first on a 
temporary basis while I was only experimenting with and setting up systemd 
(using the kernel command-line's init= to point at systemd while /sbin/
init still pointed at sysv's init for openrc), then later permanently, 
once I got enough of my systemd and journald config setup to actually 
switch to it.

b) configure my former syslogger (syslog-ng, in my case) to continue in 
that role under systemd, with journald relaying to it for non-volatile 
logging.

Here's the /etc/journald.conf changes I ended up with to accomplish (a), 
see the journald.conf(5) manpage for the documentation, as well as the 
below explanation:

Storage=volatile
RuntimeMaxUse=448M
RuntimeKeepFree=48M
RuntimeMaxFileSize=64M

Storage=volatile is the important one.  As the manpage notes, that means 
journald stores files under /run/log/journal only, where /run is normally 
setup by systemd as a tmpfs mount, so these files are tmpfs and thus 
memory-only.

The other three must be read in the context of a 512 MiB /run on tmpfs
[2].  From that and the information in the journald.conf manpage, it 
should be possible to see that my setup is (runtime* settings apply to 
the volatile files under /run):

An individual journal filesize (MaxFileSize) of 64 MiB, with seven such 
files in rotation (the default if MaxFileSize is unset is eight), 
totaling 448 MiB (MaxUse, the default being 10% of the filesystem, too 
small here since the journals are basically the only thing taking 
space).  On a 512 MiB filesystem, that will leave 64 MiB for other uses 
(pretty much all 0-byte lock and pidfiles, IIRC I was running something 
like a 2 MiB /run before systemd without issue).

It's worth noting that UNLIKE MaxUse, which will trigger journal file 
rotation when hit, hitting the KeepFree forces journald to stop 
journaling entirely -- *NOT* just to stop writing them here, but to stop 
forwarding to syslog (syslog-ng here) as well.  I FOUND THIS OUT THE HARD 
WAY!  Thus, in ordered to keep journald still functional, make sure 
journald runs into the MaxUse limit before it runs into KeepFree.  The 
KeepFree default is 15% of the filesystem, just under 77 MiB on a 512 MiB 
filesystem which is why I found this out the hard way with settings that 
would otherwise keep only 64 MiB free.  The 48 MiB setting I chose leaves 
16 MiB of room for other files before journald shuts down journaling, 
which should be plenty, since under normal circumstances the other files 
should all be 0-byte lock and pidfiles.  Just in case, however, there's 
still 48 MiB of room for other files after journald shuts down, before 
the filesystem itself fills up.

Configuring the syslogger to work with journald is left as an exercise 
for the reader, as they say, since for all I know the OP is using 
something other than the syslog-ng I'm familiar with anyway.  But if 
hints for syslog-ng are needed too, let me know. =:^)


2) Someone else mentioned btrfs' autodefrag mount-option.  Given #1 above 
I've obviously not had a lot of experience with journald logs and 
autodefrag, but based on all I know about btrfs fragmentation behavior as 
well as journald journal file behavior from this list, as long as 
journald's non-volatile files are kept significantly under 1 GiB and 
preferably under half a GiB each, it shouldn't be a problem, with a 
/possible/ exception if you get something run-away-journaling multiple 
messages a second for a reasonably long period, such that the I/O can't 
keep up with both the journaling and autodefrag.

If you do choose to keep a persistent journal with autodefrag, then, I'd 
recommend journald.conf settings that keep individual journal files to 
perhaps 128 MiB each.  (System* settings apply to the non-volatile files 
under /var, in /var/log/journal/.)


[PATCH] btrfs: free ulist in qgroup_shared_accounting() error path

2014-06-11 Thread Eric Sandeen
If tmp = ulist_alloc(GFP_NOFS) fails, we return without
freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
and cause a memory leak.

Signed-off-by: Eric Sandeen sand...@redhat.com
---

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index cf5aead..98cb6b2 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1798,8 +1798,10 @@ static int qgroup_shared_accounting(struct 
btrfs_trans_handle *trans,
return -ENOMEM;
 
tmp = ulist_alloc(GFP_NOFS);
-   if (!tmp)
+   if (!tmp) {
+   ulist_free(qgroups);
return -ENOMEM;
+   }
 
btrfs_get_tree_mod_seq(fs_info, elem);
ret = btrfs_find_all_roots(trans, fs_info, oper-bytenr, elem.seq,

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix use of uninit ret in end_extent_writepage()

2014-06-11 Thread Eric Sandeen
If this condition in end_extent_writepage() is false:

if (tree-ops  tree-ops-writepage_end_io_hook)

we will then test an uninitialized ret at:

ret = ret  0 ? ret : -EIO;

The test for ret is for the case where -writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no -writepage_end_io_hook, nothing sets ret.

Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.

Signed-of-by: Eric Sandeen sand...@redhat.com
---

p.s. - feel free to double check that this is sufficient ;)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f25a909..20b73c4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2354,7 +2354,7 @@ int end_extent_writepage(struct page *page, int err, u64 
start, u64 end)
 {
int uptodate = (err == 0);
struct extent_io_tree *tree;
-   int ret;
+   int ret = 0;
 
tree = BTRFS_I(page-mapping-host)-io_tree;
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix error handling in create_pending_snapshot

2014-06-11 Thread Eric Sandeen
fcebe456 cut and pasted some code to a later point
in create_pending_snapshot(), but didn't switch
to the appropriate error handling for this stage
of the function.

Signed-off-by: Eric Sandeen sand...@redhat.com
---

I think this is right.  Josef, please double check me :)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9630f10..511839c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1284,11 +1284,13 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
goto fail;
}
 
-   pending-error = btrfs_qgroup_inherit(trans, fs_info,
- root-root_key.objectid,
- objectid, pending-inherit);
-   if (pending-error)
-   goto no_free_objectid;
+   ret = btrfs_qgroup_inherit(trans, fs_info,
+  root-root_key.objectid,
+  objectid, pending-inherit);
+   if (ret) {
+   btrfs_abort_transaction(trans, root, ret);
+   goto fail;
+   }
 
/* see comments in should_cow_block() */
set_bit(BTRFS_ROOT_FORCE_COW, root-state);

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html