Re: potential integer truncation issues on 32 bit archs

2015-10-18 Thread Dave Chinner
On Sun, Oct 18, 2015 at 03:51:19PM +0200, PaX Team wrote:
> Hi everyone,
> 
> while investigating some integer related error reports we got, i ran across
> some code in both the VFS/MM and btrfs that i think raise a more generic 
> problem.
> 
> in particular, when converting a file offset to a page cache index, the 
> canonical
> type of the latter is usually pgoff_t, typedef'ed to unsigned long (but i saw
> unsigned long used directly as well). this can be problematic if the VFS or 
> any
> file system wants to support files over 16TB (say on i386) since after a shift
> by PAGE_CACHE_SHIFT some MSBs will be lost on 32 bit archs (now this may not 
> be
> a supported use case but at least btrfs doesn't seem to exclude it). another
> trigger seems to be vfs_fsync that passes LLONG_MAX and which can end up 
> converted
> to a page cache index (truncated on 32 bit archs).

Files larger than 16TB are not supported on 32 bit arches. Most
filesystems limit fil size to 8TB (MAX_LFS_FILESIZE), though some
(e.g. XFS) limit it to (16TB - 1 byte).  Filesystems currently
enforce the 8/16TB file size limit on 32bit systems through
sb->s_maxbytes.

I very much doubt we'll ever change this because of all the
infrastructure changes needed. e.g.  radix trees need to be
converted to 64 bit indexes on 32 bit platforms. It's far easier for
people to use a 64 bit CPU than it is for use to extend 32 bit
system support

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

2015-10-18 Thread Christoph Hellwig
Just commenting on the man page here as the comment is about sematics.
All the infrastructure in the patch looks reasonable to me, but this
is something we need to get right.

> +.B COPY_FR_REFLINK
> +Create a lightweight "reflink", where data is not copied until
> +one of the files is modified.
> +.PP
> +The default behavior
> +.RI ( flags
> +== 0) is to perform a full data copy of the requested range.
> +.SH RETURN VALUE
> +Upon successful completion,
> +.BR copy_file_range ()
> +will return the number of bytes copied between files.
> +This could be less than the length originally requested.

As mentioned in the previous discussion I fundamentally disagree with
the way your word the flags here.

flags = 0 gives you the data from source at dest, period.  How it's
implemented is up to the file system as a user cannot observe how data
actually is stored underneath.

Additionaly I think the 'clone' option with it's stronger guarantees
should be a separate system call.  So for now just have no supported
flag and leave it up to the file system and storage device how to
implement it.

For the future a COPY_FALLOC flag taht guaranatees you do not get ENOSPC
on the copied range will be very useful, but given the complexity I
think it's not something we should add now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: switch __btrfs_fs_incompat return type from int to bool

2015-10-18 Thread Alexandru Moise
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.

Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
---
 fs/btrfs/ctree.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..f387e2d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4103,7 +4103,7 @@ static inline void __btrfs_set_fs_incompat(struct 
btrfs_fs_info *fs_info,
 #define btrfs_fs_incompat(fs_info, opt) \
__btrfs_fs_incompat((fs_info), BTRFS_FEATURE_INCOMPAT_##opt)
 
-static inline int __btrfs_fs_incompat(struct btrfs_fs_info *fs_info, u64 flag)
+static inline bool __btrfs_fs_incompat(struct btrfs_fs_info *fs_info, u64 flag)
 {
struct btrfs_super_block *disk_super;
disk_super = fs_info->super_copy;
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recover btrfs volume which can only be mounded in read-only mode

2015-10-18 Thread Dmitry Katsubo
On 16/10/2015 10:18, Duncan wrote:
> Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted:
> 
>> On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote:
>>
>>> [snipped] 
>>
>> Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in
>> now in experimental Debian repo (but you anyway suggest at least 4.2.2,
>> which is just 10 days ago released in master git). Kernel image 3.18 is
>> still not there, perhaps because Debian jessie was frozen before is was
>> released (2014-12-07).
> 
> For userspace, as long as it's supporting the features you need at 
> runtime (where it generally simply has to know how to make the call to 
> the kernel, to do the actual work), and you're not running into anything 
> really hairy that you're trying to offline-recover, which is where the 
> latest userspace code becomes critical...
> 
> Running a userspace series behind, or even more (as long as it's not 
> /too/ far), isn't all /that/ critical a problem.
> 
> It generally becomes a problem in one of three ways: 1) You have a bad 
> filesystem and want the best chance at fixing it, in which case you 
> really want the latest code, including the absolute latest fixups for the 
> most recently discovered possible problems. 2) You want/need a new 
> feature that's simply not supported in your old userspace.  3) The 
> userspace gets so old that the output from its diagnostics commands no 
> longer easily compares with that of current tools, giving people on-list 
> difficulties when trying to compare the output in your posts to the 
> output they get.
> 
> As a very general rule, at least try to keep the userspace version 
> comparable to the kernel version you are running.  Since the userspace 
> version numbering syncs to kernelspace version numbering, and userspace 
> of a particular version is normally released shortly after the similarly 
> numbered kernel series is released, with a couple minor updates before 
> the next kernel-series-synced release, keeping userspace to at least the 
> kernel space version, means you're at least running the userspace release 
> that was made with that kernel series release in mind.
> 
> Then, as long as you don't get too far behind on kernel version, you 
> should remain at least /somewhat/ current on userspace as well, since 
> you'll be upgrading to near the same userspace (at least), when you 
> upgrade the kernel.
> 
> Using that loose guideline, since you're aiming for the 3.18 stable 
> kernel, you should be running at least a 3.18 btrfs-progs as well.
> 
> In that context, btrfs-progs 4.1.2 should be fine, as long as you're not 
> trying to fix any problems that a newer version fixed.  And, my 
> recommendation of the latest 4.2.2 was in the "fixing problems" context, 
> in which case, yes, getting your hands on 4.2.2, even if it means 
> building from sources to do so, could be critical, depending of course on 
> the problem you're trying to fix.  But otherwise, 4.1.2, or even back to 
> the last 3.18.whatever release since that's the kernel version you're 
> targeting, should be fine.
> 
> Just be sure that whenever you do upgrade to later, you avoid the known-
> bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs-
> progs-4.2 series, that you get 4.2.2 or later.
> 
> As for finding a current 3.18 series kernel released for Debian, I'm not 
> a Debian user so my my knowledge of the ecosystem around it is limited, 
> but I've been very much under the impression that there are various 
> optional repos available that you can choose to include and update from 
> as well, and I'm quite sure based on previous discussions with others 
> that there's a well recognized and fairly commonly enabled repo that 
> includes debian kernel updates thru current release, or close to it.
> 
> Of course you could also simply run a mainstream Linus kernel and build 
> it yourself, and it's not too horribly hard to do either, as there's all 
> sorts of places with instructions for doing so out there, and back when I 
> switched from MS to freedomware Linux in late 2001, I learned the skill, 
> at at least the reasonably basic level of mostly taking a working config 
> from my distro's kernel and using it as a basis for my mainstream kernel 
> config as well, within about two months of switching.
> 
> Tho of course just because you can doesn't mean you want to, and for 
> many, finding their distro's experimental/current kernel repos and simply 
> installing the packages from it, will be far simpler.
> 
> But regardless of the method used, finding or building and keeping 
> current with your own copy of at least the lastest couple of LTS 
> releases, shouldn't be /horribly/ difficult.  While I've not used them as 
> actual package resources in years, I do still know a couple rpm-based 
> package resources from my time back on Mandrake (and do still check them 
> in contexts like this for others, or to quickly see what files a package 
> 

Re: [PATCH] Btrfs-progs: fix btrfs-convert rollback to check ROOT_BACKREF

2015-10-18 Thread Qu Wenruo



在 2015年10月18日 13:44, Liu Bo 写道:

Btrfs has changed to delete subvolume/snapshot asynchronously, which means that
after umount itself, if we've already deleted 'ext2_saved', rollback can still
be completed.

So this adds a check for ROOT_BACKREF before checking ROOT_ITEM since
ROOT_BACKREF is immediately not in the btree after ioctl(BTRFS_IOC_SNAP_DESTROY)
returns.

Signed-off-by: Liu Bo 

Reviewed-by: Qu Wenruo 

Looks good to me.

Although the error message for ret > 0 case can be improved a little, like:
"unable to find convert image subvolume, maybe it's already deleted?\n".


BTW, would you please submit a test case for fstests? It won't be a hard 
one though.


Thanks,
Qu


---
  btrfs-convert.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/btrfs-convert.c b/btrfs-convert.c
index 802930c..f8a6c16 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -2591,6 +2591,22 @@ static int do_rollback(const char *devname)
btrfs_init_path();

key.objectid = CONV_IMAGE_SUBVOL_OBJECTID;
+   key.type = BTRFS_ROOT_BACKREF_KEY;
+   key.offset = BTRFS_FS_TREE_OBJECTID;
+   ret = btrfs_search_slot(NULL, root->fs_info->tree_root, , , 0,
+   0);
+   btrfs_release_path();
+   if (ret > 0) {
+   fprintf(stderr, "unable to open subvol %llu\n",
+   (unsigned long long)key.objectid);
+   goto fail;
+   } else if (ret < 0) {
+   fprintf(stderr, "unable to open subvol %llu ret %d\n",
+   (unsigned long long)key.objectid, ret);
+   goto fail;
+   }
+
+   key.objectid = CONV_IMAGE_SUBVOL_OBJECTID;
key.type = BTRFS_ROOT_ITEM_KEY;
key.offset = (u64)-1;
image_root = btrfs_read_fs_root(root->fs_info, );


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log

2015-10-18 Thread Chandan Rajendra
In subpagesize-blocksize a page can map multiple extent buffers and hence
using (page index, seq) as the search key is incorrect. For example, searching
through tree modification log tree can return an entry associated with the
first extent buffer mapped by the page (if such an entry exists), when we are
actually searching for entries associated with extent buffers that are mapped
at position 2 or more in the page.

Reviewed-by: Liu Bo 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 5f745ea..719ed3c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -311,7 +311,7 @@ struct tree_mod_root {
 
 struct tree_mod_elem {
struct rb_node node;
-   u64 index;  /* shifted logical */
+   u64 logical;
u64 seq;
enum mod_log_op op;
 
@@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
 
 /*
  * key order of the log:
- *   index -> sequence
+ *   node/leaf start address -> sequence
  *
- * the index is the shifted logical of the *new* root node for root replace
- * operations, or the shifted logical of the affected block for all other
- * operations.
+ * The 'start address' is the logical address of the *new* root node
+ * for root replace operations, or the logical address of the affected
+ * block for all other operations.
  *
  * Note: must be called with write lock (tree_mod_log_write_lock).
  */
@@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct 
tree_mod_elem *tm)
while (*new) {
cur = container_of(*new, struct tree_mod_elem, node);
parent = *new;
-   if (cur->index < tm->index)
+   if (cur->logical < tm->logical)
new = &((*new)->rb_left);
-   else if (cur->index > tm->index)
+   else if (cur->logical > tm->logical)
new = &((*new)->rb_right);
else if (cur->seq < tm->seq)
new = &((*new)->rb_left);
@@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
if (!tm)
return NULL;
 
-   tm->index = eb->start >> PAGE_CACHE_SHIFT;
+   tm->logical = eb->start;
if (op != MOD_LOG_KEY_ADD) {
btrfs_node_key(eb, >key, slot);
tm->blockptr = btrfs_node_blockptr(eb, slot);
@@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
goto free_tms;
}
 
-   tm->index = eb->start >> PAGE_CACHE_SHIFT;
+   tm->logical = eb->start;
tm->slot = src_slot;
tm->move.dst_slot = dst_slot;
tm->move.nr_items = nr_items;
@@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
goto free_tms;
}
 
-   tm->index = new_root->start >> PAGE_CACHE_SHIFT;
+   tm->logical = new_root->start;
tm->old_root.logical = old_root->start;
tm->old_root.level = btrfs_header_level(old_root);
tm->generation = btrfs_header_generation(old_root);
@@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 
start, u64 min_seq,
struct rb_node *node;
struct tree_mod_elem *cur = NULL;
struct tree_mod_elem *found = NULL;
-   u64 index = start >> PAGE_CACHE_SHIFT;
 
tree_mod_log_read_lock(fs_info);
tm_root = _info->tree_mod_log;
node = tm_root->rb_node;
while (node) {
cur = container_of(node, struct tree_mod_elem, node);
-   if (cur->index < index) {
+   if (cur->logical < start) {
node = node->rb_left;
-   } else if (cur->index > index) {
+   } else if (cur->logical > start) {
node = node->rb_right;
} else if (cur->seq < min_seq) {
node = node->rb_left;
@@ -1230,9 +1229,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
return NULL;
 
/*
-* the very last operation that's logged for a root is the replacement
-* operation (if it is replaced at all). this has the index of the *new*
-* root, making it the very first operation that's logged for this root.
+* the very last operation that's logged for a root is the
+* replacement operation (if it is replaced at all). this has
+* the logical address of the *new* root, making it the very
+* first operation that's logged for this root.
 */
while (1) {
tm = tree_mod_log_search_oldest(fs_info, root_logical,
@@ -1336,7 +1336,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, 
struct extent_buffer *eb,
if (!next)
break;
 

[PATCH V6 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units

2015-10-18 Thread Chandan Rajendra
In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Reviewed-by: Liu Bo 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 36 +++-
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 266f216..294c503 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8661,11 +8661,24 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
loff_t size;
int ret;
int reserved = 0;
+   u64 reserved_space;
u64 page_start;
u64 page_end;
+   u64 end;
+
+   reserved_space = PAGE_CACHE_SIZE;
 
sb_start_pagefault(inode->i_sb);
-   ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+
+   /*
+* Reserving delalloc space after obtaining the page lock can lead to
+* deadlock. For example, if a dirty page is locked by this function
+* and the call to btrfs_delalloc_reserve_space() ends up triggering
+* dirty page write out, then the btrfs_writepage() function could
+* end up waiting indefinitely to get a lock on the page currently
+* being processed by btrfs_page_mkwrite() function.
+*/
+   ret  = btrfs_delalloc_reserve_space(inode, reserved_space);
if (!ret) {
ret = file_update_time(vma->vm_file);
reserved = 1;
@@ -8686,6 +8699,7 @@ again:
size = i_size_read(inode);
page_start = page_offset(page);
page_end = page_start + PAGE_CACHE_SIZE - 1;
+   end = page_end;
 
if ((page->mapping != inode->i_mapping) ||
(page_start >= size)) {
@@ -8701,7 +8715,7 @@ again:
 * we can't set the delalloc bits if there are pending ordered
 * extents.  Drop our locks and wait for them to finish
 */
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, page_end);
if (ordered) {
unlock_extent_cached(io_tree, page_start, page_end,
 _state, GFP_NOFS);
@@ -8711,6 +8725,18 @@ again:
goto again;
}
 
+   if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) {
+   reserved_space = round_up(size - page_start, root->sectorsize);
+   if (reserved_space < PAGE_CACHE_SIZE) {
+   end = page_start + reserved_space - 1;
+   spin_lock(_I(inode)->lock);
+   BTRFS_I(inode)->outstanding_extents++;
+   spin_unlock(_I(inode)->lock);
+   btrfs_delalloc_release_space(inode,
+   PAGE_CACHE_SIZE - 
reserved_space);
+   }
+   }
+
/*
 * XXX - page_mkwrite gets called every time the page is dirtied, even
 * if it was already dirty, so for space accounting reasons we need to
@@ -8718,12 +8744,12 @@ again:
 * is probably a better way to do this, but for now keep consistent with
 * prepare_pages in the normal write path.
 */
-   clear_extent_bit(_I(inode)->io_tree, page_start, page_end,
+   clear_extent_bit(_I(inode)->io_tree, page_start, end,
  EXTENT_DIRTY | EXTENT_DELALLOC |
  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
  0, 0, _state, GFP_NOFS);
 
-   ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+   ret = btrfs_set_extent_delalloc(inode, page_start, end,
_state);
if (ret) {
unlock_extent_cached(io_tree, page_start, page_end,
@@ -8762,7 +8788,7 @@ out_unlock:
}
unlock_page(page);
 out:
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+   btrfs_delalloc_release_space(inode, reserved_space);
 out_noreserve:
sb_end_pagefault(inode->i_sb);
return ret;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated

2015-10-18 Thread Chandan Rajendra
The following issue was observed when running generic/095 test on
subpagesize-blocksize patchset.

Assume that we are trying to write a dirty page that is mapping file offset
range [159744, 163839].

writepage_delalloc()
  find_lock_delalloc_range(*start = 159744, *end = 0)
find_delalloc_range()
  Returns range [X, Y] where (X > 163839)
lock_delalloc_pages()
  One of the pages in range [X, Y] has dirty flag cleared;
  Loop once more restricting the delalloc range to span only
  PAGE_CACHE_SIZE bytes;
find_delalloc_range()
  Returns range [356352, 360447];
lock_delalloc_pages()
  The page [356352, 360447] has dirty flag cleared;
Returns with *start = 159744 and *end = 0;
  *start = *end + 1;
  find_lock_delalloc_range(*start = 1, *end = 0)
Finds and returns delalloc range [1, 12288];
  cow_file_range()
Clears delalloc range [1, 12288]
Create ordered extent for range [1, 12288]

The ordered extent thus created above breaks the rule that extents have to be
aligned to the filesystem's block size.

In cases where lock_delalloc_pages() fails (either due to PG_dirty flag being
cleared or the page no longer being a member of the inode's page cache), this
patch sets and returns the delalloc range that was found by
find_delalloc_range().

Reviewed-by: Josef Bacik 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0ee486a..3912d1f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1731,6 +1731,8 @@ again:
goto again;
} else {
found = 0;
+   *start = delalloc_start;
+   *end = delalloc_end;
goto out_failed;
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 04/13] Btrfs: fallocate: Work with sectorsized blocks

2015-10-18 Thread Chandan Rajendra
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  | 44 ++--
 fs/btrfs/inode.c | 52 +++-
 3 files changed, 50 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 74a1439..e599f31 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3896,7 +3896,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct inode *dir, u64 objectid,
const char *name, int name_len);
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
int front);
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e16cb40..cb82bb6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2287,10 +2287,10 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
int ret = 0;
int err = 0;
int rsv_count;
-   bool same_page;
+   bool same_block;
bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
u64 ino_size;
-   bool truncated_page = false;
+   bool truncated_block = false;
bool updated_inode = false;
 
ret = btrfs_wait_ordered_range(inode, offset, len);
@@ -2298,7 +2298,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
return ret;
 
mutex_lock(>i_mutex);
-   ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE);
+   ino_size = round_up(inode->i_size, root->sectorsize);
ret = find_first_non_hole(inode, , );
if (ret < 0)
goto out_only_mutex;
@@ -2311,31 +2311,30 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
lockend = round_down(offset + len,
 BTRFS_I(inode)->root->sectorsize) - 1;
-   same_page = ((offset >> PAGE_CACHE_SHIFT) ==
-   ((offset + len - 1) >> PAGE_CACHE_SHIFT));
-
+   same_block = (BTRFS_BYTES_TO_BLKS(root->fs_info, offset))
+   == (BTRFS_BYTES_TO_BLKS(root->fs_info, offset + len - 1));
/*
-* We needn't truncate any page which is beyond the end of the file
+* We needn't truncate any block which is beyond the end of the file
 * because we are sure there is no data there.
 */
/*
-* Only do this if we are in the same page and we aren't doing the
-* entire page.
+* Only do this if we are in the same block and we aren't doing the
+* entire block.
 */
-   if (same_page && len < PAGE_CACHE_SIZE) {
+   if (same_block && len < root->sectorsize) {
if (offset < ino_size) {
-   truncated_page = true;
-   ret = btrfs_truncate_page(inode, offset, len, 0);
+   truncated_block = true;
+   ret = btrfs_truncate_block(inode, offset, len, 0);
} else {
ret = 0;
}
goto out_only_mutex;
}
 
-   /* zero back part of the first page */
+   /* zero back part of the first block */
if (offset < ino_size) {
-   truncated_page = true;
-   ret = btrfs_truncate_page(inode, offset, 0, 0);
+   truncated_block = true;
+   ret = btrfs_truncate_block(inode, offset, 0, 0);
if (ret) {
mutex_unlock(>i_mutex);
return ret;
@@ -2370,9 +2369,10 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if (!ret) {
/* zero the front end of the last page */
if (tail_start + tail_len < ino_size) {
-   truncated_page = true;
-   ret = btrfs_truncate_page(inode,
-   tail_start + tail_len, 0, 1);
+   truncated_block = true;
+   ret = btrfs_truncate_block(inode,
+   tail_start + tail_len,
+   0, 1);
if (ret)
goto out_only_mutex;
}
@@ -2539,7 +2539,7 @@ out:
unlock_extent_cached(_I(inode)->io_tree, 

[PATCH V6 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache

2015-10-18 Thread Chandan Rajendra
When reading the page from the disk, we can race with Direct I/O which can get
the page lock (before prepare_uptodate_page() gets it) and can go ahead and
invalidate the page. Hence if the page is not found in the inode's address
space, retry the operation of getting a page.

Reported-by: Jakub Palider 
Reviewed-by: Josef Bacik 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index bde222b..ded7a93 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1316,6 +1316,7 @@ static noinline int prepare_pages(struct inode *inode, 
struct page **pages,
int faili;
 
for (i = 0; i < num_pages; i++) {
+again:
pages[i] = find_or_create_page(inode->i_mapping, index + i,
   mask | __GFP_WRITE);
if (!pages[i]) {
@@ -1330,6 +1331,21 @@ static noinline int prepare_pages(struct inode *inode, 
struct page **pages,
if (i == num_pages - 1)
err = prepare_uptodate_page(pages[i],
pos + write_bytes, false);
+
+   /*
+* When reading the page from the disk, we can race
+* with direct i/o which can get the page lock (before
+* prepare_uptodate_page() gets it) and can go ahead
+* and invalidate the page. Hence if the page is found
+* to be not belonging to the inode's address space,
+* retry the operation of getting a page.
+*/
+   if (unlikely(pages[i]->mapping != inode->i_mapping)) {
+   unlock_page(pages[i]);
+   page_cache_release(pages[i]);
+   goto again;
+   }
+
if (err) {
page_cache_release(pages[i]);
faili = i - 1;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 06/13] Btrfs: Search for all ordered extents that could span across a page

2015-10-18 Thread Chandan Rajendra
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c |  3 ++-
 fs/btrfs/inode.c | 25 ++---
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 11aa8f7..0ee486a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3224,7 +3224,8 @@ static int __extent_read_full_page(struct extent_io_tree 
*tree,
 
while (1) {
lock_extent(tree, start, end);
-   ordered = btrfs_lookup_ordered_extent(inode, start);
+   ordered = btrfs_lookup_ordered_range(inode, start,
+   PAGE_CACHE_SIZE);
if (!ordered)
break;
unlock_extent(tree, start, end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 294c503..8a79e53 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1975,7 +1975,8 @@ again:
if (PagePrivate2(page))
goto out;
 
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start,
+   PAGE_CACHE_SIZE);
if (ordered) {
unlock_extent_cached(_I(inode)->io_tree, page_start,
 page_end, _state, GFP_NOFS);
@@ -8554,6 +8555,8 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
struct extent_state *cached_state = NULL;
u64 page_start = page_offset(page);
u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
+   u64 start;
+   u64 end;
int inode_evicting = inode->i_state & I_FREEING;
 
/*
@@ -8573,14 +8576,18 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 
if (!inode_evicting)
lock_extent_bits(tree, page_start, page_end, 0, _state);
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+again:
+   start = page_start;
+   ordered = btrfs_lookup_ordered_range(inode, start,
+   page_end - start + 1);
if (ordered) {
+   end = min(page_end, ordered->file_offset + ordered->len - 1);
/*
 * IO on this page will never be started, so we need
 * to account for any ordered extents now
 */
if (!inode_evicting)
-   clear_extent_bit(tree, page_start, page_end,
+   clear_extent_bit(tree, start, end,
 EXTENT_DIRTY | EXTENT_DELALLOC |
 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 EXTENT_DEFRAG, 1, 0, _state,
@@ -8597,22 +8604,26 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 
spin_lock_irq(>lock);
set_bit(BTRFS_ORDERED_TRUNCATED, >flags);
-   new_len = page_start - ordered->file_offset;
+   new_len = start - ordered->file_offset;
if (new_len < ordered->truncated_len)
ordered->truncated_len = new_len;
spin_unlock_irq(>lock);
 
if (btrfs_dec_test_ordered_pending(inode, ,
-  page_start,
-  PAGE_CACHE_SIZE, 1))
+  start,
+  end - start + 1, 1))
btrfs_finish_ordered_io(ordered);
}
btrfs_put_ordered_extent(ordered);
if (!inode_evicting) {
cached_state = NULL;
-   lock_extent_bits(tree, page_start, page_end, 0,
+   lock_extent_bits(tree, start, end, 0,
 _state);
}
+
+   start = end + 1;
+   if (start < page_end)
+   goto again;
}
 
if (!inode_evicting) {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length

2015-10-18 Thread Chandan Rajendra
In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this by comparing map_length against block size rather
than with bv_len.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8a79e53..1826603 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8150,9 +8150,11 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
u64 file_offset = dip->logical_offset;
u64 submit_len = 0;
u64 map_length;
-   int nr_pages = 0;
-   int ret;
+   u32 blocksize = root->sectorsize;
int async_submit = 0;
+   int nr_sectors;
+   int ret;
+   int i;
 
map_length = orig_bio->bi_iter.bi_size;
ret = btrfs_map_block(root->fs_info, rw, start_sector << 9,
@@ -8182,9 +8184,12 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
atomic_inc(>pending_bios);
 
while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
-   if (map_length < submit_len + bvec->bv_len ||
-   bio_add_page(bio, bvec->bv_page, bvec->bv_len,
-bvec->bv_offset) < bvec->bv_len) {
+   nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info, bvec->bv_len);
+   i = 0;
+next_block:
+   if (unlikely(map_length < submit_len + blocksize ||
+   bio_add_page(bio, bvec->bv_page, blocksize,
+   bvec->bv_offset + (i * blocksize)) < blocksize)) {
/*
 * inc the count before we submit the bio so
 * we know the end IO handler won't happen before
@@ -8205,7 +8210,6 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
file_offset += submit_len;
 
submit_len = 0;
-   nr_pages = 0;
 
bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev,
  start_sector, GFP_NOFS);
@@ -8223,9 +8227,14 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
bio_put(bio);
goto out_err;
}
+
+   goto next_block;
} else {
-   submit_len += bvec->bv_len;
-   nr_pages++;
+   submit_len += blocksize;
+   if (--nr_sectors) {
+   i++;
+   goto next_block;
+   }
bvec++;
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 11/13] Btrfs: Clean pte corresponding to page straddling i_size

2015-10-18 Thread Chandan Rajendra
When extending a file by either "truncate up" or by writing beyond i_size, the
page which had i_size needs to be marked "read only" so that future writes to
the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not,
a write performed after extending the file via the mmap interface will find
the page to be writaeable and continue writing to the page without invoking
btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk
space.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file.c  | 12 ++--
 fs/btrfs/inode.c |  2 +-
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cb82bb6..bde222b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1759,6 +1759,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t err;
loff_t pos;
size_t count;
+   loff_t oldsize;
+   int clean_page = 0;
 
mutex_lock(>i_mutex);
err = generic_write_checks(iocb, from);
@@ -1797,14 +1799,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
pos = iocb->ki_pos;
count = iov_iter_count(from);
start_pos = round_down(pos, root->sectorsize);
-   if (start_pos > i_size_read(inode)) {
+   oldsize = i_size_read(inode);
+   if (start_pos > oldsize) {
/* Expand hole size to cover write data, preventing empty gap */
end_pos = round_up(pos + count, root->sectorsize);
-   err = btrfs_cont_expand(inode, i_size_read(inode), end_pos);
+   err = btrfs_cont_expand(inode, oldsize, end_pos);
if (err) {
mutex_unlock(>i_mutex);
goto out;
}
+   if (start_pos > round_up(oldsize, root->sectorsize))
+   clean_page = 1;
}
 
if (sync)
@@ -1816,6 +1821,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
num_written = __btrfs_buffered_write(file, from, pos);
if (num_written > 0)
iocb->ki_pos = pos + num_written;
+   if (clean_page)
+   pagecache_isize_extended(inode, oldsize,
+   i_size_read(inode));
}
 
mutex_unlock(>i_mutex);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6d145ed..b9e8494 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4853,7 +4853,6 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
}
 
if (newsize > oldsize) {
-   truncate_pagecache(inode, newsize);
/*
 * Don't do an expanding truncate while snapshoting is ongoing.
 * This is to ensure the snapshot captures a fully consistent
@@ -4876,6 +4875,7 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
 
i_size_write(inode, newsize);
btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL);
+   pagecache_isize_extended(inode, oldsize, newsize);
ret = btrfs_update_inode(trans, root, inode);
btrfs_end_write_no_snapshoting(root);
btrfs_end_transaction(trans, root);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 10/13] Btrfs: Fix block size returned to user space

2015-10-18 Thread Chandan Rajendra
btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
generic_fillattr() already does the right thing (by obtaining block size
from inode->i_blkbits), just remove the statement from btrfs_getattr.

Reviewed-by: Josef Bacik 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c0c83f1..6d145ed 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9199,7 +9199,6 @@ static int btrfs_getattr(struct vfsmount *mnt,
 
generic_fillattr(inode, stat);
stat->dev = BTRFS_I(inode)->root->anon_dev;
-   stat->blksize = PAGE_CACHE_SIZE;
 
spin_lock(_I(inode)->lock);
delalloc_bytes = BTRFS_I(inode)->delalloc_bytes;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size

2015-10-18 Thread Chandan Rajendra
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this by doing reservation/releases in block size units.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.h |  3 +++
 fs/btrfs/file.c  | 46 +-
 2 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..74a1439 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2236,6 +2236,9 @@ struct btrfs_map_token {
unsigned long offset;
 };
 
+#define BTRFS_BYTES_TO_BLKS(fs_info, bytes) \
+   ((bytes) >> (fs_info)->sb->s_blocksize_bits)
+
 static inline void btrfs_init_map_token (struct btrfs_map_token *token)
 {
token->kaddr = NULL;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..e16cb40 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -499,7 +499,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode 
*inode,
loff_t isize = i_size_read(inode);
 
start_pos = pos & ~((u64)root->sectorsize - 1);
-   num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
+   num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize);
 
end_of_last_block = start_pos + num_bytes - 1;
err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
@@ -1362,16 +1362,19 @@ fail:
 static noinline int
 lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
size_t num_pages, loff_t pos,
+   size_t write_bytes,
u64 *lockstart, u64 *lockend,
struct extent_state **cached_state)
 {
+   struct btrfs_root *root = BTRFS_I(inode)->root;
u64 start_pos;
u64 last_pos;
int i;
int ret = 0;
 
-   start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1);
-   last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1;
+   start_pos = round_down(pos, root->sectorsize);
+   last_pos = start_pos
+   + round_up(pos + write_bytes - start_pos, root->sectorsize) - 1;
 
if (start_pos < inode->i_size) {
struct btrfs_ordered_extent *ordered;
@@ -1489,6 +1492,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
while (iov_iter_count(i) > 0) {
size_t offset = pos & (PAGE_CACHE_SIZE - 1);
+   size_t sector_offset;
size_t write_bytes = min(iov_iter_count(i),
 nrptrs * (size_t)PAGE_CACHE_SIZE -
 offset);
@@ -1497,6 +1501,8 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
size_t reserve_bytes;
size_t dirty_pages;
size_t copied;
+   size_t dirty_sectors;
+   size_t num_sectors;
 
WARN_ON(num_pages > nrptrs);
 
@@ -1509,8 +1515,12 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
break;
}
 
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+   sector_offset = pos & (root->sectorsize - 1);
+   reserve_bytes = round_up(write_bytes + sector_offset,
+   root->sectorsize);
+
ret = btrfs_check_data_free_space(inode, reserve_bytes, 
write_bytes);
+
if (ret == -ENOSPC &&
(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
  BTRFS_INODE_PREALLOC))) {
@@ -1523,7 +1533,10 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 */
num_pages = DIV_ROUND_UP(write_bytes + offset,
 PAGE_CACHE_SIZE);
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+   reserve_bytes = round_up(write_bytes
+   + sector_offset,
+   root->sectorsize);
+
ret = 0;
} else {
ret = -ENOSPC;
@@ -1558,8 +1571,8 @@ again:
break;
 
ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
- pos, , ,
- _state);
+   pos, write_bytes, ,
+   , _state);
if (ret < 0) {
if (ret == -EAGAIN)
goto again;
@@ -1595,9 +1608,16 @@ again:
 * we still have an outstanding extent for the 

[PATCH V6 09/13] Btrfs: Limit inline extents to root->sectorsize

2015-10-18 Thread Chandan Rajendra
cow_file_range_inline() limits the size of an inline extent to
PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by
comparing against root->sectorsize.

Reviewed-by: Josef Bacik 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1826603..c0c83f1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -257,7 +257,7 @@ static noinline int cow_file_range_inline(struct btrfs_root 
*root,
data_len = compressed_size;
 
if (start > 0 ||
-   actual_end > PAGE_CACHE_SIZE ||
+   actual_end > root->sectorsize ||
data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
(!compressed_size &&
(actual_end & (root->sectorsize - 1)) == 0) ||
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 03/13] Btrfs: Direct I/O read: Work on sectorsized blocks

2015-10-18 Thread Chandan Rajendra
The direct I/O read's endio and corresponding repair functions work on
page sized blocks. This commit adds the ability for direct I/O read to work on
subpagesized blocks.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 98 +++-
 1 file changed, 75 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b7e439b..3aded4c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7664,9 +7664,9 @@ static int btrfs_check_dio_repairable(struct inode *inode,
 }
 
 static int dio_read_error(struct inode *inode, struct bio *failed_bio,
- struct page *page, u64 start, u64 end,
- int failed_mirror, bio_end_io_t *repair_endio,
- void *repair_arg)
+   struct page *page, unsigned int pgoff,
+   u64 start, u64 end, int failed_mirror,
+   bio_end_io_t *repair_endio, void *repair_arg)
 {
struct io_failure_record *failrec;
struct bio *bio;
@@ -7687,7 +7687,9 @@ static int dio_read_error(struct inode *inode, struct bio 
*failed_bio,
return -EIO;
}
 
-   if (failed_bio->bi_vcnt > 1)
+   if ((failed_bio->bi_vcnt > 1)
+   || (failed_bio->bi_io_vec->bv_len
+   > BTRFS_I(inode)->root->sectorsize))
read_mode = READ_SYNC | REQ_FAILFAST_DEV;
else
read_mode = READ_SYNC;
@@ -7695,7 +7697,7 @@ static int dio_read_error(struct inode *inode, struct bio 
*failed_bio,
isector = start - btrfs_io_bio(failed_bio)->logical;
isector >>= inode->i_sb->s_blocksize_bits;
bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
- 0, isector, repair_endio, repair_arg);
+   pgoff, isector, repair_endio, repair_arg);
if (!bio) {
free_io_failure(inode, failrec);
return -EIO;
@@ -7725,12 +7727,17 @@ struct btrfs_retry_complete {
 static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
 {
struct btrfs_retry_complete *done = bio->bi_private;
+   struct inode *inode;
struct bio_vec *bvec;
int i;
 
if (err)
goto end;
 
+   ASSERT(bio->bi_vcnt == 1);
+   inode = bio->bi_io_vec->bv_page->mapping->host;
+   ASSERT(bio->bi_io_vec->bv_len == BTRFS_I(inode)->root->sectorsize);
+
done->uptodate = 1;
bio_for_each_segment_all(bvec, bio, i)
clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
@@ -7742,25 +7749,35 @@ end:
 static int __btrfs_correct_data_nocsum(struct inode *inode,
   struct btrfs_io_bio *io_bio)
 {
+   struct btrfs_fs_info *fs_info;
struct bio_vec *bvec;
struct btrfs_retry_complete done;
u64 start;
+   unsigned int pgoff;
+   u32 sectorsize;
+   int nr_sectors;
int i;
int ret;
 
+   fs_info = BTRFS_I(inode)->root->fs_info;
+   sectorsize = BTRFS_I(inode)->root->sectorsize;
+
start = io_bio->logical;
done.inode = inode;
 
bio_for_each_segment_all(bvec, _bio->bio, i) {
-try_again:
+   nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
+   pgoff = bvec->bv_offset;
+
+next_block_or_try_again:
done.uptodate = 0;
done.start = start;
init_completion();
 
-   ret = dio_read_error(inode, _bio->bio, bvec->bv_page, start,
-start + bvec->bv_len - 1,
-io_bio->mirror_num,
-btrfs_retry_endio_nocsum, );
+   ret = dio_read_error(inode, _bio->bio, bvec->bv_page,
+   pgoff, start, start + sectorsize - 1,
+   io_bio->mirror_num,
+   btrfs_retry_endio_nocsum, );
if (ret)
return ret;
 
@@ -7768,10 +7785,15 @@ try_again:
 
if (!done.uptodate) {
/* We might have another mirror, so try again */
-   goto try_again;
+   goto next_block_or_try_again;
}
 
-   start += bvec->bv_len;
+   start += sectorsize;
+
+   if (nr_sectors--) {
+   pgoff += sectorsize;
+   goto next_block_or_try_again;
+   }
}
 
return 0;
@@ -7781,7 +7803,9 @@ static void btrfs_retry_endio(struct bio *bio, int err)
 {
struct btrfs_retry_complete *done = bio->bi_private;
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   struct inode *inode;
struct bio_vec *bvec;
+   u64 start;
int uptodate;
int ret;

[PATCH V6 00/13] Btrfs: Pre subpagesize-blocksize cleanups

2015-10-18 Thread Chandan Rajendra
The patches posted along with this cover letter are cleanups made
during the developement of subpagesize-blocksize patchset. I believe
that they can be integrated with the mainline kernel. Hence I have
posted them separately from the subpagesize-blocksize patchset.

I have testsed the patchset by running xfstests on ppc64 and
x86_64. On ppc64, some of the Btrfs specific tests and generic/255
fail because they assume 4K as the filesystem's block size. I have
fixed some of the test cases. I will fix the rest and mail them to the
fstests mailing list in the near future.

Changes from V5:
1. Introduced BTRFS_BYTES_TO_BLKS() helper to compute the number of
   filesystem blocks spanning across a range of bytes. A call to this
   macro replaces code such as "nr_blks = bytes >> inode->i_blkbits".

Changes from V4:
1. Removed the RFC tag.

Changes from V3:
Two new issues have been been fixed by the patches,
1. Btrfs: prepare_pages: Retry adding a page to the page cache.
2. Btrfs: Return valid delalloc range when the page does not have
   PG_Dirty flag set or has been invalidated.
IMHO, The above issues are also applicable to the "page size == block
size" scenario but for reasons unknown to me they aren't seen even
when the tests are run for a long time.

Changes from V2:
1. For detecting logical errors, Use ASSERT() calls instead of calls to
   BUG_ON().
2. In the patch "Btrfs: Compute and look up csums based on sectorsized
   blocks", fix usage of kmap_atomic/kunmap_atomic such that between the
   kmap_atomic() and kunmap_atomic() calls we do not invoke any function
   that might cause the current task to sleep.
   
Changes from V1:
1. Call round_[down,up]() functions instead of doing hard coded alignment.

Chandan Rajendra (13):
  Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to
block size
  Btrfs: Compute and look up csums based on sectorsized blocks
  Btrfs: Direct I/O read: Work on sectorsized blocks
  Btrfs: fallocate: Work with sectorsized blocks
  Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
  Btrfs: Search for all ordered extents that could span across a page
  Btrfs: Use (eb->start, seq) as search key for tree modification log
  Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
  Btrfs: Limit inline extents to root->sectorsize
  Btrfs: Fix block size returned to user space
  Btrfs: Clean pte corresponding to page straddling i_size
  Btrfs: prepare_pages: Retry adding a page to the page cache
  Btrfs: Return valid delalloc range when the page does not have
PG_Dirty flag set or has been invalidated

 fs/btrfs/ctree.c |  34 
 fs/btrfs/ctree.h |   5 +-
 fs/btrfs/extent_io.c |   5 +-
 fs/btrfs/file-item.c |  92 +---
 fs/btrfs/file.c  | 118 +
 fs/btrfs/inode.c | 241 ---
 6 files changed, 335 insertions(+), 160 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 02/13] Btrfs: Compute and look up csums based on sectorsized blocks

2015-10-18 Thread Chandan Rajendra
Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
look up code to work with sectorsize units.

Reviewed-by: Liu Bo 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file-item.c | 92 +---
 1 file changed, 59 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 58ece65..e2a1cad 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u64 item_start_offset = 0;
u64 item_last_offset = 0;
u64 disk_bytenr;
+   u64 page_bytes_left;
u32 diff;
int nblocks;
int bio_index = 0;
@@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
if (dio)
offset = logical_offset;
+
+   page_bytes_left = bvec->bv_len;
while (bio_index < bio->bi_vcnt) {
if (!dio)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (BTRFS_I(inode)->root->root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) {
set_extent_bits(io_tree, offset,
-   offset + bvec->bv_len - 1,
+   offset + root->sectorsize - 1,
EXTENT_NODATASUM, GFP_NOFS);
} else {

btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
*root,
 found:
csum += count * csum_size;
nblocks -= count;
-   bio_index += count;
+
while (count--) {
-   disk_bytenr += bvec->bv_len;
-   offset += bvec->bv_len;
-   bvec++;
+   disk_bytenr += root->sectorsize;
+   offset += root->sectorsize;
+   page_bytes_left -= root->sectorsize;
+   if (!page_bytes_left) {
+   bio_index++;
+   bvec++;
+   page_bytes_left = bvec->bv_len;
+   }
+
}
}
btrfs_free_path(path);
@@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
struct bio_vec *bvec = bio->bi_io_vec;
int bio_index = 0;
int index;
+   int nr_sectors;
+   int i;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;
@@ -459,41 +470,56 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
if (!contig)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
-   if (offset >= ordered->file_offset + ordered->len ||
-   offset < ordered->file_offset) {
-   unsigned long bytes_left;
-   sums->len = this_sum_bytes;
-   this_sum_bytes = 0;
-   btrfs_add_ordered_sum(inode, ordered, sums);
-   btrfs_put_ordered_extent(ordered);
+   data = kmap_atomic(bvec->bv_page);
 
-   bytes_left = bio->bi_iter.bi_size - total_bytes;
+   nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info,
+   bvec->bv_len + root->sectorsize
+   - 1);
+
+   for (i = 0; i < nr_sectors; i++) {
+   if (offset >= ordered->file_offset + ordered->len ||
+   offset < ordered->file_offset) {
+   unsigned long bytes_left;
+
+   kunmap_atomic(data);
+   sums->len = this_sum_bytes;
+   this_sum_bytes = 0;
+   btrfs_add_ordered_sum(inode, ordered, sums);
+   btrfs_put_ordered_extent(ordered);
+
+   bytes_left = bio->bi_iter.bi_size - total_bytes;
+
+   sums = kzalloc(btrfs_ordered_sum_size(root, 
bytes_left),
+   GFP_NOFS);
+   BUG_ON(!sums); /* -ENOMEM */
+   sums->len = bytes_left;
+   ordered = 

Re: btrfs autodefrag?

2015-10-18 Thread Xavier Gnata



On 18/10/2015 07:46, Duncan wrote:

Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:


Hi,

On a desktop equipped with an ssd with one 100GB virtual image used
frequently, what do you recommend?
1) nothing special, it is all fine as long as you have a recent kernel
(which I do)
2) Disabling copy-on-write for just the VM image directory.
3) autodefrag as a mount option.
4) something else.

I don't think this usecase is well documented therefore I asked this
question.


You are correct.  The VM images on ssd use-case /isn't/ particularly well
documented, I'd guess because people have differing opinions, and,
indeed, actual observed behavior, and thus recommendations even in the
ideal case, may well be different depending on the specs and firmware of
the ssd.  The documentation tends to be aimed at the spinning rust case.

There's one detail of the use-case (besides ssd specs), however, that you
didn't mention, that could have a big impact on the recommendation.  What
sort of btrfs snapshotting are you planning to do, and if you're doing
snapshots, does your use-case really need them to include the VM image
file?

Snapshots are a big issue for anything that you might set nocow, because
snapshot functionality assumes and requires cow, and thus conflicts, to
some extent, with nocow.  A snapshot locks in place the existing extents,
so they can no longer be modified.  On a normal btrfs cow-based file,
that's not an issue, since any modifications would be cowed elsewhere
anyway -- that's how btrfs normally works.  On a nocow file, however,
there's a problem, because once the snapshot locks in place the existing
version, the first change to a specific block (normally 4 KiB) *MUST* be
cowed, despite the nocow attribute, because to rewrite in-place would
alter the snapshot.  The nocow attribute remains in place, however, and
further writes to the same block will again be nocow... to the new block
location established by that first post-snapshot write... until the next
snapshot comes along and locks that too in-place, of course.  This sort
of cow-only-once behavior is sometimes called cow1.

If you only do very occasional snapshots, probably manually, this cow1
behavior isn't /so/ bad, tho the file will still fragment over time as
more and more bits of it are written and rewritten after the few
snapshots that are taken.  However, for people doing frequent, generally
schedule-automated snapshots, the nocow attribute is effectively
nullified as all those snapshots force cow1s over and over again.

So ssd or spinning rust, there's serious conflicts between nocow and
snapshotting that really must be taken into consideration if you're
planning to both snapshot and nocow.

For use-cases that don't require snapshotting of the nocow files, the
simplest workaround is to put any nocow files on dedicated subvolumes.
Since snapshots stop at subvolume boundaries, having nocow files on
dedicated subvolume(s) stops snapshots of the parent from including them,
thus avoiding the cow1 situation entirely.

If the use-case requires snapshotting of nocow files, the workaround that
has been reported (mostly on spinning rust, where fragmentation is a far
worse problem due to non-zero seek-times) to work is first to reduce
snapshotting to a minimum -- if it was going to be hourly, consider daily
or every 12 hours, if you can get away with it, if it was going to be
daily, consider every other day or weekly.  Less snapshotting means less
cow1s and thus directly affects how quickly fragmentation becomes a
problem.  Again, dedicated subvolumes can help here, allowing you to
snapshot the nocow files on a different schedule than you do the up-
hierarchy parent subvolume.  Second, schedule periodic manual defrags of
the nocow files, so the fragmentation that does occur is at least kept
manageable.  If the snapshotting is daily, consider weekly or monthly
defrags.  If it's weekly, consider monthly or quarterly defrags.  Again,
various people who do need to snapshot their nocow files have reported
that this really does help, keeping fragmentation to at least some sanely
managed level.

That's the snapshot vs. nocow problem in general.  With luck, however,
you can avoid snapshotting the files in question entirely, thus factoring
this issue out of the equation entirely.

Now to the ssd issue.

On ssds in general, there are two very major differences we need to
consider vs. spinning rust.  One, fragmentation isn't as much of a
problem as it is on spinning rust.  It's still worth keeping to a
minimum, because as the number of fragments increases, so does both btrfs
and device overhead, but it's not the nearly everything-overriding
consideration that it is on spinning rust.

Two, ssds have a limited write-cycle factor to consider, where with
spinning rust the write-cycle limit is effectively infinite... at least
compared to the much lower limit of ssds.

The weighing of these two overriding ssd factors one against the other,
along 

Re: "free_raid_bio" crash on RAID6

2015-10-18 Thread Philip Seeger

Hi Tobias

On 07/20/2015 06:20 PM, Tobias Holst wrote:

My btrfs-RAID6 seems to be broken again :(

When reading from it I get several of these:
[  176.349943] BTRFS info (device dm-4): csum failed ino 1287707
extent 21274957705216 csum 2830458701 wanted 426660650 mirror 2

then followed by a "free_raid_bio"-crash:

[  176.349961] [ cut here ]
[  176.349981] WARNING: CPU: 6 PID: 110 at
/home/kernel/COD/linux/fs/btrfs/raid56.c:831
__free_raid_bio+0xfc/0x130 [btrfs]()
...


It's been 3 months now, have you ever figured this out? Do you know if 
the bug has been identified and fixed or have you filed a bugzilla report?



One drive is broken, so at the moment it is mounted with "-O
defaults,ro,degraded,recovery,compress=lzo,space_cache,subvol=raid".


Did you try removing the bad drive and did the system keep crashing anyway?



Philip
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs autodefrag?

2015-10-18 Thread Hugo Mills
On Sun, Oct 18, 2015 at 10:24:39AM -0400, Rich Freeman wrote:
> On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnata  wrote:
> > 2) Disabling copy-on-write for just the VM image directory.
> 
> Unless this has changed, doing this will also disable checksumming.  I
> don't see any reason why it has to, but it does.  So, I avoid using
> this at all costs.

   It has to be disabled because if you enable it, there's a race
condition: since you're overwriting existing data (rather than CoWing
it), you can't update the checksums atomically. So, in the interests
of consistency, checksums are disabled.

   Hugo.

-- 
Hugo Mills | Nothing wrong with being written in Perl... Some of
hugo@... carfax.org.uk | my best friends are written in Perl.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  dark


signature.asc
Description: Digital signature


Re: btrfs autodefrag?

2015-10-18 Thread Rich Freeman
On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnata  wrote:
> 2) Disabling copy-on-write for just the VM image directory.

Unless this has changed, doing this will also disable checksumming.  I
don't see any reason why it has to, but it does.  So, I avoid using
this at all costs.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


potential integer truncation issues on 32 bit archs

2015-10-18 Thread PaX Team
Hi everyone,

while investigating some integer related error reports we got, i ran across
some code in both the VFS/MM and btrfs that i think raise a more generic 
problem.

in particular, when converting a file offset to a page cache index, the 
canonical
type of the latter is usually pgoff_t, typedef'ed to unsigned long (but i saw
unsigned long used directly as well). this can be problematic if the VFS or any
file system wants to support files over 16TB (say on i386) since after a shift
by PAGE_CACHE_SHIFT some MSBs will be lost on 32 bit archs (now this may not be
a supported use case but at least btrfs doesn't seem to exclude it). another
trigger seems to be vfs_fsync that passes LLONG_MAX and which can end up 
converted
to a page cache index (truncated on 32 bit archs).

my basic question is whether this is considered an actual problem and whether
there are already measures at some higher abstraction levels to prevent such
integer truncations and the use of incorrect page cache indices.

cheers,
 PaX Team

PS: two random examples:
mm/filemap.c:filemap_fdatawait_range 
fs/btrfs/extent_io.c:extent_range_clear_dirty_for_io

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html