from:"Jan Kara"

Re: Btrfs and fanotify filesystem watches

2018-11-27 Thread Jan Kara

On Fri 23-11-18 19:53:11, Amir Goldstein wrote:
> On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein  wrote:
> > > So open_by_handle() should work fine even if we get mount_fd of /subvol1
> > > and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> > > only used to get the superblock and that is the same.
> >
> > I don't think it will work fine.
> > do_handle_to_path() will compose a path from /subvol1 mnt and dentry from
> > /subvol2. This may resolve to a full path that does not really exist,
> > so application
> > cannot match it to anything that is can use name_to_handle_at() to identify.
> >
> > The whole thing just sounds too messy. At the very least we need more time
> > to communicate with btrfs developers and figure this out, so I am going to
> > go with -EXDEV for any attempt to set *any* mark on a group with
> > FAN_REPORT_FID if fsid of fanotify_mark() path argument is different
> > from fsid of path->dentry->d_sb->s_root.
> >
> > We can relax that later if we figure out a better way.
> >
> > BTW, I am also going to go with -ENODEV for zero fsid (e.g. tmpfs).
> > tmpfs can be easily fixed to have non zero fsid if filesystem watch on 
> > tmpfs is
> > important.
> >
> 
> Well, this is interesting... I have implemented the -EXDEV logic and it
> works as expected. I can set a filesystem global watch on the main
> btrfs mount and not allowed to set a global watch on a subvolume.
> 
> The interesting part is that the global watch on the main btrfs volume
> more useful than I though it would be. The file handles reported by the
> main volume global watch are resolved to correct paths in the main volume.
> I guess this is because a btrfs subvolume looks like a directory tree
> in the global
> namespace to vfs. See below.
> 
> So I will continue based on this working POC:
> https://github.com/amir73il/linux/commits/fanotify_fid
> 
> Note that in the POC, fsid is cached in mark connector as you suggested.
> It is still buggy, but my prototype always decodes file handles from the first
> path argument given to the program, so it just goes to show that by getting
> fsid of the main btrfs volume, the application will be able to properly decode
> file handles and resolve correct paths.
> 
> The bottom line is that btrfs will have the full functionality of super block
> monitoring with no ability to watch (or ignore) a single subvolume
> (unless by using a mount mark).

Sounds good. I'll check the new version of your series.

Honza

-- 
Jan Kara 
SUSE Labs, CR

Btrfs and fanotify filesystem watches

2018-11-23 Thread Jan Kara

Changed subject to better match what we discuss and added btrfs list to CC.

On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> >
> > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is associated 
> > > > > with
> > > > > the single super block struct, so all dentries in all subvolumes will
> > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > >
> > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > >
> > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > be32_to_cpu(fsid[2]);
> > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > be32_to_cpu(fsid[3]);
> > > > /* Mask in the root object ID too, to disambiguate subvols */
> > > > buf->f_fsid.val[0] ^=
> > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> > > > buf->f_fsid.val[1] ^=
> > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > >
> > > > So subvolume root is xored into the FSID. Thus dentries from different
> > > > subvolumes are going to return different fsids...
> > > >
> > >
> > > Right... how could I have missed that :-/
> > >
> > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > Instead, I will use:
> > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> >
> > So what about my proposal to store fsid in the notification mark when it is
> > created and then use it when that mark results in event being generated?
> > When mark is created, we have full path available, so getting proper fsid
> > is trivial. Furthermore if the behavior is documented like that, it is
> > fairly easy for userspace to track fsids it should care about and translate
> > them to proper file descriptors for open_by_handle().
> >
> > This would get rid of statfs() on every event creation (which I don't like
> > much) and also avoids these problems "how to get fsid for inode". What do
> > you think?
> >
> 
> That's interesting. I like the simplicity.
> What happens when there are 2 btrfs subvols /subvol1 /subvol2
> and the application obviously doesn't know about this and does:
> fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> statfs("/subvol1",...);
> fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> statfs("/subvol2",...);
> 
> Now the second fanotify_mark() call just updates the existing super block
> mark mask, but does not change the fsid on the mark, so all events will have
> fsid of subvol1 that was stored when first creating the mark.

Yes.

> fanotify-watch application (for example) hashes the watches (paths) under
> /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> "watch" (i.e. path).

I agree this can be confusing... but with btrfs fanotify-watch will be
confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
on /subvol1 (with fsid A) is going to return also events on inodes from
/subvol2 (with fsid B). So your current code will return event with fsid B
which fanotify-watch has no way to match back and can get confused.

So currently application can get events with fsid it has never seen, with
the code as I suggest it can get "wrong" fsid. That is confusing but still
somewhat better?

The core of the problem is that btrfs does not have "the superblock
identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
that we could use.

> And when trying to open_by_handle fid with fhandle from /subvol2
> using mount_fd of /subvol1, I suppose we can either get ESTALE
> or a disconnected dentry, because object from /subvol2 cannot
> have a path inside /subvol1

So open_by_handle() should work fine even if we get mount_fd of /subvol1
and handle for inode on /subvol2. mount_fd in open_by_handle() is really
only used to get the superblock and that is the same.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC][PATCH v4 09/09] btrfs: use common file type conversion

2018-11-22 Thread Jan Kara

On Wed 21-11-18 19:07:06, Phillip Potter wrote:
> Deduplicate the btrfs file type conversion implementation - file systems
> that use the same file types as defined by POSIX do not need to define
> their own versions and can use the common helper functions decared in
> fs_types.h and implemented in fs_types.c
> 
> Acked-by: David Sterba 
> Signed-off-by: Amir Goldstein 
> Signed-off-by: Phillip Potter 

The patch looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/btrfs/btrfs_inode.h  |  2 --
>  fs/btrfs/delayed-inode.c|  2 +-
>  fs/btrfs/inode.c| 32 +++-
>  include/uapi/linux/btrfs_tree.h |  2 ++
>  4 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 97d91e55b70a..bb01c804485f 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -196,8 +196,6 @@ struct btrfs_inode {
>   struct inode vfs_inode;
>  };
>  
> -extern unsigned char btrfs_filetype_table[];
> -
>  static inline struct btrfs_inode *BTRFS_I(const struct inode *inode)
>  {
>   return container_of(inode, struct btrfs_inode, vfs_inode);
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index c669f250d4a0..e61947f5eb76 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1692,7 +1692,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
> *ctx,
>   name = (char *)(di + 1);
>   name_len = btrfs_stack_dir_name_len(di);
>  
> - d_type = btrfs_filetype_table[di->type];
> + d_type = fs_ftype_to_dtype(di->type);
>   btrfs_disk_key_to_cpu(, >location);
>  
>   over = !dir_emit(ctx, name, name_len,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ea4c6f0352f..8b7b1b29e2ad 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -72,17 +72,6 @@ struct kmem_cache *btrfs_trans_handle_cachep;
>  struct kmem_cache *btrfs_path_cachep;
>  struct kmem_cache *btrfs_free_space_cachep;
>  
> -#define S_SHIFT 12
> -static const unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = {
> - [S_IFREG >> S_SHIFT]= BTRFS_FT_REG_FILE,
> - [S_IFDIR >> S_SHIFT]= BTRFS_FT_DIR,
> - [S_IFCHR >> S_SHIFT]= BTRFS_FT_CHRDEV,
> - [S_IFBLK >> S_SHIFT]= BTRFS_FT_BLKDEV,
> - [S_IFIFO >> S_SHIFT]= BTRFS_FT_FIFO,
> - [S_IFSOCK >> S_SHIFT]   = BTRFS_FT_SOCK,
> - [S_IFLNK >> S_SHIFT]= BTRFS_FT_SYMLINK,
> -};
> -
>  static int btrfs_setsize(struct inode *inode, struct iattr *attr);
>  static int btrfs_truncate(struct inode *inode, bool skip_writeback);
>  static int btrfs_finish_ordered_io(struct btrfs_ordered_extent 
> *ordered_extent);
> @@ -5793,10 +5782,6 @@ static struct dentry *btrfs_lookup(struct inode *dir, 
> struct dentry *dentry,
>   return d_splice_alias(inode, dentry);
>  }
>  
> -unsigned char btrfs_filetype_table[] = {
> - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
> -};
> -
>  /*
>   * All this infrastructure exists because dir_emit can fault, and we are 
> holding
>   * the tree lock when doing readdir.  For now just allocate a buffer and copy
> @@ -5935,7 +5920,7 @@ static int btrfs_real_readdir(struct file *file, struct 
> dir_context *ctx)
>   name_ptr = (char *)(entry + 1);
>   read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
>  name_len);
> - put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
> + put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
>   >type);
>   btrfs_dir_item_key_to_cpu(leaf, di, );
>   put_unaligned(location.objectid, >ino);
> @@ -6340,7 +6325,20 @@ static struct inode *btrfs_new_inode(struct 
> btrfs_trans_handle *trans,
>  
>  static inline u8 btrfs_inode_type(struct inode *inode)
>  {
> - return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT];
> + /*
> +  * compile-time asserts that generic FT_x types still match
> +  * BTRFS_FT_x types
> +  */
> + BUILD_BUG_ON(BTRFS_FT_UNKNOWN != FT_UNKNOWN);
> + BUILD_BUG_ON(BTRFS_FT_REG_FILE != FT_REG_FILE);
> + BUILD_BUG_ON(BTRFS_FT_DIR != FT_DIR);
> + BUILD_BUG_ON(BTRFS_FT_CHRDEV != FT_CHRDEV);
> + BUILD_BUG_ON(BTRFS_FT_BLKDEV != FT_BLKDEV);
> + BUILD_BUG_ON(BTRFS_FT_FIFO != FT_FIFO);
> + BUILD_BUG_ON(BTRFS_FT_SOCK != FT_SOCK);
> + BUILD_BUG_ON(BTRF

Re: [RFC v2 3/4] ext4: add verifier check for symlink with append/immutable flags

2018-05-11 Thread Jan Kara

On Thu 10-05-18 16:13:58, Luis R. Rodriguez wrote:
> The Linux VFS does not allow a way to set append/immuttable
> attributes to symlinks, this is just not possible. If this is
> detected inform the user as the filesystem must be corrupted.
> 
> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/ext4/inode.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 37a2f7a2b66a..6acf0dd6b6e6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4947,6 +4947,13 @@ struct inode *ext4_iget(struct super_block *sb, 
> unsigned long ino)
>   inode->i_op = _dir_inode_operations;
>   inode->i_fop = _dir_operations;
>   } else if (S_ISLNK(inode->i_mode)) {
> + /* VFS does not allow setting these so must be corruption */
> + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> + EXT4_ERROR_INODE(inode,
> +   "immutable or append flags not allowed on symlinks");
> + ret = -EFSCORRUPTED;
> + goto bad_inode;
> + }
>   if (ext4_encrypted_inode(inode)) {
>   inode->i_op = _encrypted_symlink_inode_operations;
>   ext4_set_aops(inode);
> -- 
> 2.17.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-09 Thread Jan Kara

On Tue 09-01-18 09:10:42, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> The rationale for taking the i_lock when incrementing this value is
> lost in antiquity. The readers of the field don't take it (at least
> not universally), so my assumption is that it was only done here to
> serialize incrementors.
> 
> If that is indeed the case, then we can drop the i_lock from this
> codepath and treat it as a atomic64_t for the purposes of
> incrementing it. This allows us to use inode_inc_iversion without
> any danger of lock inversion.
> 
> Note that the read side is not fetched atomically with this change.
> The assumption here is that that is not a critical issue since the
> i_version is not fully synchronized with anything else anyway.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

This changes the memory barrier behavior but IMO it is good enough for an
intermediate version. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/iversion.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index d09cc3a08740..5ad9eaa3a9b0 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new)
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> - spin_lock(>i_lock);
> - inode->i_version++;
> - spin_unlock(>i_lock);
> + atomic64_t *ivp = (atomic64_t *)>i_version;
> +
> + atomic64_inc(ivp);
>   return true;
>  }
>  
> +
>  /**
>   * inode_inc_iversion - forcibly increment i_version
>   * @inode: inode that needs to be updated
> -- 
> 2.14.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-04 Thread Jan Kara

On Thu 04-01-18 12:32:07, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> > On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > > I think we could probably block ->write_metadata if necessary via a
> > > completion/wakeup style notification when a specific LSN is reached
> > > by the log tail, but realistically if there's any amount of data
> > > needing to be written it'll throttle data writes because the IO
> > > pipeline is being kept full by background metadata writes
> > 
> > So the problem I'm concerned about is a corner case. Consider a situation
> > when you have no dirty data, only dirty metadata but enough of them to
> > trigger background writeback. How should metadata writeback behave for XFS
> > in this case? Who should be responsible that wb_writeback() just does not
> > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> > enough progress?
> >
> > Thinking about this today, I think this looping prevention belongs to
> > wb_writeback().
> 
> Well, backgroudn data writeback can block in two ways. One is during
> IO submission when the request queue is full, the other is when all
> dirty inodes have had some work done on them and have all been moved
> to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared
> on the last(?) inode on that list, hence backing off before
> submitting more IO.
> 
> IOws, there's a "during writeback" blocking mechanism as well as a
> "between cycles" block mechanism.
> 
> > Sadly we don't have much info to decide how long to sleep
> > before trying more writeback so we'd have to just sleep for
> >  if we found no writeback happened in the last writeback
> > round before going through the whole writeback loop again.
> 
> Right - I don't think we can provide a generic "between cycles"
> blocking mechanism for XFS, but I'm pretty sure we can emulate a
> "during writeback" blocking mechanism to avoid busy looping inside
> the XFS code.
> 
> e.g. if we get a writeback call that asks for 5% to be written,
> and we already have a metadata writeback target of 5% in place,
> that means we should block for a while. That would emulate request
> queue blocking and prevent busy looping in this case

If you can do this in XFS then fine, it saves some mess in the generic
code.

> > And
> > ->write_metadata() for XFS would need to always return 0 (as in "no progress
> > made") to make sure this busyloop avoidance logic in wb_writeback()
> > triggers. ext4 and btrfs would return number of bytes written from
> > ->write_metadata (or just 1 would be enough to indicate some progress in
> > metadata writeback was made and busyloop avoidance is not needed).
> 
> Well, if we block for a little while, we can indicate that progress
> has been made and this whole mess would go away, right?

Right. So let's just ignore the problem for the sake of Josef's patch set.
Once the patches land and when XFS starts using the infrastructure, we will
make sure this is handled properly.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-03 Thread Jan Kara

On Wed 03-01-18 10:49:33, Josef Bacik wrote:
> On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> > On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > > On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > > > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > > > IOWs, treating metadata like it's one great big data inode 
> > > > > > > > doesn't
> > > > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > > > fileystems it's a bunch of objects with a complex dependency 
> > > > > > > > tree
> > > > > > > > and unknown write ordering, not an inode full of data that can 
> > > > > > > > be
> > > > > > > > sequentially written.
> > > > > > > > 
> > > > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > > > ->writeback_metadata() for background writeback, 
> > > > > > > > ->sync_metadata() for
> > > > > > > > sync based operations. That way different filesystems can 
> > > > > > > > ignore the
> > > > > > > > parts they don't need simply by not implementing those 
> > > > > > > > operations,
> > > > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > > > operations through the one op. The writeback code should be 
> > > > > > > > cleaner,
> > > > > > > > the filesystem code should be cleaner, and we can tailor the 
> > > > > > > > work
> > > > > > > > guidelines for each operation separately so there's less 
> > > > > > > > mismatch
> > > > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > > > metadata...
> > > > > > > 
> > > > > > > I agree that writeback for memory cleaning and writeback for data 
> > > > > > > integrity
> > > > > > > are two very different things especially for metadata. In fact 
> > > > > > > for data
> > > > > > > integrity writeback we already have ->sync_fs operation so there 
> > > > > > > the
> > > > > > > functionality gets duplicated. What we could do is that in
> > > > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > > > work->for_kupdate or work->for_background is set. That way 
> > > > > > > ->write_metadata
> > > > > > > would be called only for memory cleaning purposes.
> > > > > > 
> > > > > > That makes sense, but I still think we need a better indication of
> > > > > > how much writeback we need to do than just "writeback this chunk of
> > > > > > pages". That "writeback a chunk" interface is necessary to share
> > > > > > writeback bandwidth across numerous data inodes so that we don't
> > > > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > > > metadata writeback on a superblock - we don't need to share that
> > > > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > > > need to know is how much writeback we need to do as a total of all
> > > > > > the dirty metadata on the superblock.
> > > > > > 
> > > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > > > simple generic helper that converts "flush X percent of dirty
> > > > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > > > this way allows filesystems to completely internalise the accounting
> > > > > > that needs to be done, rather than trying to hack around a
> > > > > > writeback accounting interface with large impedance mismatches to
> > > > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > > &

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-03 Thread Jan Kara

On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > > and unknown write ordering, not an inode full of data that can be
> > > > > > sequentially written.
> > > > > > 
> > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() 
> > > > > > for
> > > > > > sync based operations. That way different filesystems can ignore the
> > > > > > parts they don't need simply by not implementing those operations,
> > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > operations through the one op. The writeback code should be cleaner,
> > > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > > guidelines for each operation separately so there's less mismatch
> > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > metadata...
> > > > > 
> > > > > I agree that writeback for memory cleaning and writeback for data 
> > > > > integrity
> > > > > are two very different things especially for metadata. In fact for 
> > > > > data
> > > > > integrity writeback we already have ->sync_fs operation so there the
> > > > > functionality gets duplicated. What we could do is that in
> > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > work->for_kupdate or work->for_background is set. That way 
> > > > > ->write_metadata
> > > > > would be called only for memory cleaning purposes.
> > > > 
> > > > That makes sense, but I still think we need a better indication of
> > > > how much writeback we need to do than just "writeback this chunk of
> > > > pages". That "writeback a chunk" interface is necessary to share
> > > > writeback bandwidth across numerous data inodes so that we don't
> > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > metadata writeback on a superblock - we don't need to share that
> > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > need to know is how much writeback we need to do as a total of all
> > > > the dirty metadata on the superblock.
> > > > 
> > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > simple generic helper that converts "flush X percent of dirty
> > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > this way allows filesystems to completely internalise the accounting
> > > > that needs to be done, rather than trying to hack around a
> > > > writeback accounting interface with large impedance mismatches to
> > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > writeback progress.
> > > 
> > > Let me think loud on how we could tie this into how memory cleaning
> > > writeback currently works - the one with for_background == 1 which is
> > > generally used to get amount of dirty pages in the system under control.
> > > We have a queue of inodes to write, we iterate over this queue and ask 
> > > each
> > > inode to write some amount (e.g. 64 M - exact amount depends on measured
> 
> It's a maximum of 1024 pages per inode.

That's actually a minimum, not maximum, if I read the code in
writeback_chunk_size() right.

> > > writeback bandwidth etc.). Some amount from that inode gets written and we
> > > continue with the next inode in the queue (put this one at the end of the
> > > queue if it still has dirty pages). We do this until:
> > > 
> > > a) the number of dirty pages in the system is below background dirty limit
> > >an

Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

2018-01-02 Thread Jan Kara

On Fri 22-12-17 18:54:57, Jeff Layton wrote:
> On Sat, 2017-12-23 at 10:14 +1100, NeilBrown wrote:
> > > +#include 
> > > +
> > > +/*
> > > + * The change attribute (i_version) is mandated by NFSv4 and is mostly 
> > > for
> > > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version 
> > > must
> > > + * appear different to observers if there was a change to the inode's 
> > > data or
> > > + * metadata since it was last queried.
> > > + *
> > > + * It should be considered an opaque value by observers. If it remains 
> > > the same
> > 
> >  
> > 
> > You keep using that word ... I don't think it means what you think it
> > means.
> > Change that sentence to:
> > 
> > Observers see i_version as a 64 number which never decreases.
> > 
> > and the rest still makes perfect sense.
> > 
> 
> Thanks! Fixed in my tree. I'll not resend the set just for that though.

With this fixed the patch looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-02 Thread Jan Kara

On Fri 22-12-17 07:05:56, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/fs.h   |   2 +-
>  include/linux/iversion.h | 208 
> ++-
>  2 files changed, 154 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76382c24e9d0..6804d075933e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -639,7 +639,7 @@ struct inode {
>   struct hlist_head   i_dentry;
>   struct rcu_head i_rcu;
>   };
> - u64 i_version;
> + atomic64_t  i_version;
>   atomic_ti_count;
>   atomic_ti_dio_count;
>   atomic_ti_writecount;
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e08c634779df..cef242e54489 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -5,6 +5,8 @@
>  #include 
>  
>  /*
> + * The inode->i_version field:
> + * ---
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
>   * appear different to observers if there was a change to the inode's data or
> @@ -27,86 +29,171 @@
>   * i_version on namespace changes in directories (mkdir, rmdir, unlink, 
> etc.).
>   * We consider these sorts of filesystems to have a kernel-managed i_version.
>   *
> + * This implementation uses the low bit in the i_version field as a flag to
> + * track when the value has been queried. If it has not been queried since it
> + * was last incremented, we can skip the increment in most cases.
> + *
> + * In the event that we're updating the ctime, we will usually go ahead and
> + * bump the i_version anyway. Since that has to go to stable storage in some
> + * fashion, we might as well increment it as well.
> + *
> + * With this implementation, the value should always appear to observers to
> + * increase over time if the file has changed. It's recommended to use
> + * inode_cmp_iversion() helper to compare values.
> + *
>   * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> - * a server-provided value (for the most part). For that reason, those
> + * a server-provided value for the most part. For that reason, those
>   * filesystems do not set SB_I_VERSION. These filesystems are considered to
>   * have a self-managed i_version.
> + *
> + * Persistently storing the i_version
> + * --
> + * Queries of the i_version field are not gated on them hitting the backing
> + * store. It's always possible that the host could crash after allowing
> + * a query of the value but before it has made it to disk.
> + *
> + * To mitigate this problem, filesystems should always use
> + * inode_set_iversion_queried when loading an existing ino

Re: [PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

2018-01-02 Thread Jan Kara

On Fri 22-12-17 07:05:53, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> We only really need to update i_version if someone has queried for it
> since we last incremented it. By doing that, we can avoid having to
> update the inode if the times haven't changed.
> 
> If the times have changed, then we go ahead and forcibly increment the
> counter, under the assumption that we'll be going to the storage
> anyway, and the increment itself is relatively cheap.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  fs/inode.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 19e72f500f71..2fa920188759 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path 
> *path, struct inode *inode,
>  int generic_update_time(struct inode *inode, struct timespec *time, int 
> flags)
>  {
>   int iflags = I_DIRTY_TIME;
> + bool dirty = false;
>  
>   if (flags & S_ATIME)
>   inode->i_atime = *time;
>   if (flags & S_VERSION)
> - inode_inc_iversion(inode);
> + dirty |= inode_maybe_inc_iversion(inode, dirty);
>   if (flags & S_CTIME)
>   inode->i_ctime = *time;
>   if (flags & S_MTIME)
>   inode->i_mtime = *time;
> + if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
> + !(inode->i_sb->s_flags & SB_LAZYTIME))
> + dirty = true;

When you pass 'dirty' to inode_maybe_inc_iversion(), it is always false.
Maybe this condition should be at the beginning of the function? Once you
fix that the patch looks good so you can add:

Reviewed-by: Jan Kara <j...@suse.cz>


Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-21 Thread Jan Kara

On Thu 21-12-17 06:25:55, Jeff Layton wrote:
> Got it, I think. How about this (sorry for the unrelated deltas here):
> 
> [PATCH] SQUASH: add memory barriers around i_version accesses

Yep, this looks good to me.

Honza
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  include/linux/iversion.h | 60 
> +++-
>  1 file changed, 39 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..1b3b5ed7c5b8 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
>   atomic64_set(>i_version, val);
>  }
>  
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a 
> self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return atomic64_read(>i_version);
> +}
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -152,7 +169,18 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>   u64 cur, old, new;
>  
> - cur = (u64)atomic64_read(>i_version);
> + /*
> +  * The i_version field is not strictly ordered with any other inode
> +  * information, but the legacy inode_inc_iversion code used a spinlock
> +  * to serialize increments.
> +  *
> +  * Here, we add full memory barriers to ensure that any de-facto
> +  * ordering with other info is preserved.
> +  *
> +  * This barrier pairs with the barrier in inode_query_iversion()
> +  */
> + smp_mb();
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is clear then we needn't do anything */
>   if (!force && !(cur & I_VERSION_QUERIED))
> @@ -183,23 +211,6 @@ inode_inc_iversion(struct inode *inode)
>   inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a 
> self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> - return atomic64_read(>i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,15 +259,22 @@ inode_query_iversion(struct inode *inode)
>  {
>   u64 cur, old, new;
>  
> - cur = atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is already set, then no need to swap */
> - if (cur & I_VERSION_QUERIED)
> + if (cur & I_VERSION_QUERIED) {
> + /*
> +  * This barrier (and the implicit barrier in the
> +  * cmpxchg below) pairs with the barrier in
> +  * inode_maybe_inc_iversion().
> +      */
> + smp_mb();
>   break;
> + }
>  
>   new = cur | I_VERSION_QUERIED;
>   old = atomic64_cmpxchg(>i_version, cur, new);
> - if (old == cur)
> + if (likely(old == cur))
>   break;
>   cur = old;
>   }
> -- 
> 2.14.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-20 Thread Jan Kara

UERIED))
> @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>   new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>  
>   old = atomic64_cmpxchg(>i_version, cur, new);
> - if (likely(old == cur))
> + if (likely(old == cur)) {
> + smp_mb__after_atomic();

I don't think you need this. Cmpxchg is guaranteed to be full memory
barrier - from Documentation/atomic_t.txt:
  - RMW operations that have a return value are fully ordered;

>   break;
> + }
>   cur = old;
>   }
>   return true;

...

> @@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
>  {
>   u64 cur, old, new;
>  
> - cur = atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is already set, then no need to swap */
>   if (cur & I_VERSION_QUERIED)

And here I'd expect smp_mb() after inode_peek_iversion_raw() (actually be
needed only if you are not going to do cmpxchg as that implies barrier as
well). "Safe" use of i_version would be:

Update:

modify inode
inode_maybe_inc_iversion(inode)

Read:

my_version = inode_query_iversion(inode)
get inode data

And you need to make sure 'get inode data' does not get speculatively
evaluated before you actually sample i_version so that you are guaranteed
that if data changes, you will observe larger i_version in the future.

Also please add a comment smp_mb() in inode_maybe_inc_iversion() like:

/* This barrier pairs with the barrier in inode_query_iversion() */

and a similar comment to inode_query_iversion(). Because memory barriers
make sense only in pairs (see SMP BARRIER PAIRING in
Documentation/memory-barriers.txt).

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-20 Thread Jan Kara

On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > This is just one of those things that's going to be slightly shitty.  
> > > > It's the
> > > > same for memory reclaim, all of those places use pages so we just take
> > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close 
> > > > enough.
> > > 
> > > Ok, so that isn't exactly easy to deal with, because all our
> > > metadata writeback is based on log sequence number targets (i.e. how
> > > far to push the tail of the log towards the current head). We've
> > > actually got no idea how pages/bytes actually map to a LSN target
> > > because while we might account a full buffer as dirty for memory
> > > reclaim purposes (up to 64k in size), we might have only logged 128
> > > bytes of it.
> > > 
> > > i.e. if we are asked to push 2MB of metadata and we treat that as
> > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > > logged several tens of megabytes of dirty metadata in that LSN
> > > range and have to flush it all. OTOH, if the buffers are fully
> > > logged, then that same target might only flush 1.5MB of metadata
> > > once all the log overhead is taken into account.
> > > 
> > > So there's a fairly large disconnect between the "flush N bytes of
> > > metadata" API and the "push to a target LSN" that XFS uses for
> > > flushing metadata in aged order. I'm betting that extN and otehr
> > > filesystems might have similar mismatches with their journal
> > > flushing...
> > 
> > Well, for ext4 it isn't as bad since we do full block logging only. So if
> > we are asked to flush N pages, we can easily translate that to number of fs
> > blocks and flush that many from the oldest transaction.
> > 
> > Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> > when pushing items from AIL (which is what I suppose XFS would do in
> > response to metadata writeback request) and just stop pushing when it has
> > cleaned as much as it was asked to?
> 
> If only it were that simple :/
> 
> To start with, flushing the dirty objects (such as inodes) to their
> backing buffers do not mean the the object is clean once the
> writeback completes. XFS has decoupled in-memory objects with
> logical object logging rather than logging physical buffers, and
> so can be modified and dirtied while the inode buffer
> is being written back. Hence if we just count things like "buffer
> size written" it's not actually a correct account of the amount of
> dirty metadata we've cleaned. If we don't get that right, it'll
> result in accounting errors and incorrect behaviour.
> 
> The bigger problem, however, is that we have no channel to return
> flush information from the AIL pushing to whatever caller asked for
> the push. Pushing metadata is completely decoupled from every other
> subsystem. i.e. the caller asked the xfsaild to push to a specific
> LSN (e.g. to free up a certain amount of log space for new
> transactions), and *nothing* has any idea of how much metadata we'll
> need to write to push the tail of the log to that LSN.
> 
> It's also completely asynchronous - there's no mechanism for waiting
> on a push to a specific LSN. Anything that needs a specific amount
> of log space to be available waits in ordered ticket queues on the
> log tail moving forwards. The only interfaces that have access to
> the log tail ticket waiting is the transaction reservation
> subsystem, which cannot be used during metadata writeback because
> that's a guaranteed deadlock vector
> 
> Saying "just account for bytes written" assumes directly connected,
> synchronous dispatch metadata writeback infrastructure which we
> simply don't have in XFS. "just clean this many bytes" doesn't
> really fit at all because we have no way of referencing that to the
> distance we need to push the tail of the log. An interface that
> tells us "clean this percentage of dirty metadata" is much more
> useful because we can map that easily to a log sequence number
> based push target

OK, understood.

> > > IOWs, treating metadata like it's one great big data inode doesn't
> > > seem to me to be

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-19 Thread Jan Kara

On Mon 11-12-17 16:55:31, Josef Bacik wrote:
> @@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block 
> *sb,
>* background threshold and other termination conditions.
>*/
>   if (wrote) {
> - if (time_is_before_jiffies(start_time + HZ / 10UL))
> - break;
> - if (work->nr_pages <= 0)
> + if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> + work->nr_pages <= 0) {
> + done = true;
>   break;
> + }
>   }
>   }
> + if (!done && sb->s_op->write_metadata) {
> + spin_unlock(>list_lock);
> + wrote += writeback_sb_metadata(sb, wb, work);
> + spin_lock(>list_lock);
> + }
>   return wrote;
>  }

One thing I've notice when looking at this patch again: This duplicates the
metadata writeback done in __writeback_inodes_wb(). So you probably need a
new helper function like writeback_sb() that will call writeback_sb_inodes()
and handle metadata writeback and call that from wb_writeback() instead of
writeback_sb_inodes() directly.

Honza

> @@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
> *wb,
>  {
>   unsigned long start_time = jiffies;
>   long wrote = 0;
> + bool done = false;
>  
>   while (!list_empty(>b_io)) {
>   struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct 
> bdi_writeback *wb,
>  
>   /* refer to the same tests at the end of writeback_sb_inodes */
>   if (wrote) {
> - if (time_is_before_jiffies(start_time + HZ / 10UL))
> - break;
> - if (work->nr_pages <= 0)
> + if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> + work->nr_pages <= 0) {
> + done = true;
>   break;
> + }
>   }
>   }
> +
> + if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
> + LIST_HEAD(list);
> +
> + spin_unlock(>list_lock);
> + spin_lock(>bdi->sb_list_lock);
> + list_splice_init(>bdi->dirty_sb_list, );
> + while (!list_empty()) {
> + struct super_block *sb;
> +
> + sb = list_first_entry(, struct super_block,
> +   s_bdi_dirty_list);
> + list_move_tail(>s_bdi_dirty_list,
> +>bdi->dirty_sb_list);
> + if (!sb->s_op->write_metadata)
> + continue;
> + if (!trylock_super(sb))
> + continue;
> + spin_unlock(>bdi->sb_list_lock);
> + wrote += writeback_sb_metadata(sb, wb, work);
> + spin_lock(>bdi->sb_list_lock);
> + up_read(>s_umount);
> + }
> + spin_unlock(>bdi->sb_list_lock);
> + spin_lock(>list_lock);
> + }
>   /* Leave any unwritten inodes on b_io */
>   return wrote;
>  }
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-19 Thread Jan Kara

On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > From: Josef Bacik <jba...@fb.com>
> > > > 
> > > > Now that we have metadata counters in the VM, we need to provide a way 
> > > > to kick
> > > > writeback on dirty metadata.  Introduce 
> > > > super_operations->write_metadata.  This
> > > > allows file systems to deal with writing back any dirty metadata we 
> > > > need based
> > > > on the writeback needs of the system.  Since there is no inode to key 
> > > > off of we
> > > > need a list in the bdi for dirty super blocks to be added.  From there 
> > > > we can
> > > > find any dirty sb's on the bdi we are currently doing writeback on and 
> > > > call into
> > > > their ->write_metadata callback.
> > > > 
> > > > Signed-off-by: Josef Bacik <jba...@fb.com>
> > > > Reviewed-by: Jan Kara <j...@suse.cz>
> > > > Reviewed-by: Tejun Heo <t...@kernel.org>
> > > > ---
> > > >  fs/fs-writeback.c| 72 
> > > > 
> > > >  fs/super.c   |  6 
> > > >  include/linux/backing-dev-defs.h |  2 ++
> > > >  include/linux/fs.h   |  4 +++
> > > >  mm/backing-dev.c |  2 ++
> > > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 987448ed7698..fba703dff678 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct 
> > > > bdi_writeback *wb,
> > > > return pages;
> > > >  }
> > > >  
> > > > +static long writeback_sb_metadata(struct super_block *sb,
> > > > + struct bdi_writeback *wb,
> > > > + struct wb_writeback_work *work)
> > > > +{
> > > > +   struct writeback_control wbc = {
> > > > +   .sync_mode  = work->sync_mode,
> > > > +   .tagged_writepages  = work->tagged_writepages,
> > > > +   .for_kupdate= work->for_kupdate,
> > > > +   .for_background = work->for_background,
> > > > +   .for_sync   = work->for_sync,
> > > > +   .range_cyclic   = work->range_cyclic,
> > > > +   .range_start= 0,
> > > > +   .range_end  = LLONG_MAX,
> > > > +   };
> > > > +   long write_chunk;
> > > > +
> > > > +   write_chunk = writeback_chunk_size(wb, work);
> > > > +   wbc.nr_to_write = write_chunk;
> > > > +   sb->s_op->write_metadata(sb, );
> > > > +   work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > > +
> > > > +   return write_chunk - wbc.nr_to_write;
> > > 
> > > Ok, writeback_chunk_size() returns a page count. We've already gone
> > > through the "metadata is not page sized" dance on the dirty
> > > accounting side, so how are we supposed to use pages to account for
> > > metadata writeback?
> > > 
> > 
> > This is just one of those things that's going to be slightly shitty.  It's 
> > the
> > same for memory reclaim, all of those places use pages so we just take
> > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> 
> Ok, so that isn't exactly easy to deal with, because all our
> metadata writeback is based on log sequence number targets (i.e. how
> far to push the tail of the log towards the current head). We've
> actually got no idea how pages/bytes actually map to a LSN target
> because while we might account a full buffer as dirty for memory
> reclaim purposes (up to 64k in size), we might have only logged 128
> bytes of it.
> 
> i.e. if we are asked to push 2MB of metadata and we treat that as
> 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> logged several tens of megabytes of dirty metadata in that LSN
> range and have to flush i

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-19 Thread Jan Kara

On Mon 18-12-17 12:22:20, Jeff Layton wrote:
> On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > >  static inline bool
> > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > >  {
> > > - atomic64_t *ivp = (atomic64_t *)>i_version;
> > > + u64 cur, old, new;
> > >  
> > > - atomic64_inc(ivp);
> > > + cur = (u64)atomic64_read(>i_version);
> > > + for (;;) {
> > > + /* If flag is clear then we needn't do anything */
> > > + if (!force && !(cur & I_VERSION_QUERIED))
> > > + return false;
> > 
> > The fast path here misses any memory barrier. Thus it seems this query
> > could be in theory reordered before any store that happened to modify the
> > inode? Or maybe we could race and miss the fact that in fact this i_version
> > has already been queried? But maybe there's some higher level locking that
> > makes sure this is all a non-issue... But in that case it would deserve
> > some comment I guess.
> > 
> 
> There's no higher-level locking. Getting locking out of this codepath is
> a good thing IMO. The larger question here is whether we really care
> about ordering this with anything else.
> 
> The i_version, as implemented today, is not ordered with actual changes
> to the inode. We only take the i_lock today when modifying it, not when
> querying it. It's possible today that you could see the results of a
> change and then do a fetch of the i_version that doesn't show an
> increment vs. a previous change.

Yeah, so I don't suggest that you should fix unrelated issues but original
i_lock protection did actually provide memory barriers (although
semi-permeable, but in practice they are very often enough) and your patch
removing those could have changed a theoretical issue to a practical
problem. So at least preserving that original acquire-release semantics
of i_version handling would be IMHO good.

> It'd be nice if this were atomic with the actual changes that it
> represents, but I think that would be prohibitively expensive. That may
> be something we need to address. I'm not sure we really want to do it as
> part of this patchset though.
> 
> > > +
> > > + /* Since lowest bit is flag, add 2 to avoid it */
> > > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > > +
> > > + old = atomic64_cmpxchg(>i_version, cur, new);
> > > + if (likely(old == cur))
> > > + break;
> > > + cur = old;
> > > + }
> > >   return true;
> > >  }
> > >  
> > 
> > ...
> > 
> > >  static inline u64
> > >  inode_query_iversion(struct inode *inode)
> > >  {
> > > - return inode_peek_iversion(inode);
> > > + u64 cur, old, new;
> > > +
> > > + cur = atomic64_read(>i_version);
> > > + for (;;) {
> > > + /* If flag is already set, then no need to swap */
> > > + if (cur & I_VERSION_QUERIED)
> > > + break;
> > > +
> > > + new = cur | I_VERSION_QUERIED;
> > > + old = atomic64_cmpxchg(>i_version, cur, new);
> > > + if (old == cur)
> > > + break;
> > > + cur = old;
> > > + }
> > 
> > Why not just use atomic64_or() here?
> > 
> 
> If the cmpxchg fails, then either:
> 
> 1) it was incremented
> 2) someone flagged it QUERIED
> 
> If an increment happened then we don't need to flag it as QUERIED if
> we're returning an older value. If we use atomic64_or, then we can't
> tell if an increment happened so we'd end up potentially flagging it
> more than necessary.
> 
> In principle, either outcome is technically OK and we don't have to loop
> if the cmpxchg doesn't work. That said, if we think there might be a
> later i_version available, then I think we probably want to try to query
> it again so we can return as late a one as possible.

OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to
behave pretty badly in contended cases but I guess i_version won't be
hammered *that* hard.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 05/10] writeback: add counters for metadata usage

2017-12-18 Thread Jan Kara

On Mon 11-12-17 16:55:30, Josef Bacik wrote:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 356a814e7c8e..48de090f5a07 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -179,9 +179,19 @@ enum node_stat_item {
>   NR_VMSCAN_IMMEDIATE,/* Prioritise for reclaim when writeback ends */
>   NR_DIRTIED, /* page dirtyings since bootup */
>   NR_WRITTEN, /* page writings since bootup */
> + NR_METADATA_DIRTY_BYTES,/* Metadata dirty bytes */
> + NR_METADATA_WRITEBACK_BYTES,/* Metadata writeback bytes */
> + NR_METADATA_BYTES,  /* total metadata bytes in use. */
>   NR_VM_NODE_STAT_ITEMS
>  };

Please add here something like: "Warning: These counters will overflow on
32-bit machines if we ever have more than 2G of metadata on such machine!
But kernel won't be able to address that easily either so it should not be
a real issue."

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e72ac97..0b32e6381590 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -273,6 +273,13 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
> enum node_stat_item item,
>  
>   t = __this_cpu_read(pcp->stat_threshold);
>  
> + /*
> +  * If this item is counted in bytes and not pages adjust the threshold
> +  * accordingly.
> +  */
> + if (is_bytes_node_stat(item))
> + t <<= PAGE_SHIFT;
> +
>   if (unlikely(x > t || x < -t)) {
>   node_page_state_add(x, pgdat, item);
>   x = 0;

This is wrong. The per-cpu counters are stored in s8 so you cannot just
bump the threshold. I would just ignore the PCP counters for metadata (I
don't think they are that critical for performance for metadata tracking)
and add to the comment I've suggested above: "Also note that updates to
these counters won't be batched using per-cpu counters since the updates
are generally larger than the counter threshold."

Honza

-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 03/10] lib: add a __fprop_add_percpu_max

2017-12-18 Thread Jan Kara

On Mon 11-12-17 16:55:28, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> This helper allows us to add an arbitrary amount to the fprop
> structures.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/flex_proportions.h | 11 +--
>  lib/flex_proportions.c   |  9 +
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/flex_proportions.h 
> b/include/linux/flex_proportions.h
> index 0d348e011a6e..9f88684bf0a0 100644
> --- a/include/linux/flex_proportions.h
> +++ b/include/linux/flex_proportions.h
> @@ -83,8 +83,8 @@ struct fprop_local_percpu {
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
>  void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
>  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> *pl);
> -void __fprop_inc_percpu_max(struct fprop_global *p, struct 
> fprop_local_percpu *pl,
> - int max_frac);
> +void __fprop_add_percpu_max(struct fprop_global *p, struct 
> fprop_local_percpu *pl,
> + unsigned long nr, int max_frac);
>  void fprop_fraction_percpu(struct fprop_global *p,
>   struct fprop_local_percpu *pl, unsigned long *numerator,
>   unsigned long *denominator);
> @@ -99,4 +99,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct 
> fprop_local_percpu *pl)
>   local_irq_restore(flags);
>  }
>  
> +static inline
> +void __fprop_inc_percpu_max(struct fprop_global *p,
> + struct fprop_local_percpu *pl, int max_frac)
> +{
> + __fprop_add_percpu_max(p, pl, 1, max_frac);
> +}
> +
>  #endif
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index 2cc1f94e03a1..31003989d34a 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -255,8 +255,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
>   * Like __fprop_inc_percpu() except that event is counted only if the given
>   * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
>   */
> -void __fprop_inc_percpu_max(struct fprop_global *p,
> - struct fprop_local_percpu *pl, int max_frac)
> +void __fprop_add_percpu_max(struct fprop_global *p,
> + struct fprop_local_percpu *pl, unsigned long nr,
> + int max_frac)
>  {
>   if (unlikely(max_frac < FPROP_FRAC_BASE)) {
>   unsigned long numerator, denominator;
> @@ -267,6 +268,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
>   return;
>   } else
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> - percpu_counter_add(>events, 1);
> + percpu_counter_add_batch(>events, nr, PROP_BATCH);
> + percpu_counter_add(>events, nr);
>  }
> -- 
> 2.7.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-18 Thread Jan Kara

On Mon 18-12-17 10:11:56, Jeff Layton wrote:
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> - atomic64_t *ivp = (atomic64_t *)>i_version;
> + u64 cur, old, new;
>  
> - atomic64_inc(ivp);
> + cur = (u64)atomic64_read(>i_version);
> + for (;;) {
> + /* If flag is clear then we needn't do anything */
> + if (!force && !(cur & I_VERSION_QUERIED))
> + return false;

The fast path here misses any memory barrier. Thus it seems this query
could be in theory reordered before any store that happened to modify the
inode? Or maybe we could race and miss the fact that in fact this i_version
has already been queried? But maybe there's some higher level locking that
makes sure this is all a non-issue... But in that case it would deserve
some comment I guess.

> +
> + /* Since lowest bit is flag, add 2 to avoid it */
> + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> +
> + old = atomic64_cmpxchg(>i_version, cur, new);
> + if (likely(old == cur))
> + break;
> + cur = old;
> + }
>   return true;
>  }
>  

...

>  static inline u64
>  inode_query_iversion(struct inode *inode)
>  {
> - return inode_peek_iversion(inode);
> + u64 cur, old, new;
> +
> + cur = atomic64_read(>i_version);
> + for (;;) {
> + /* If flag is already set, then no need to swap */
> + if (cur & I_VERSION_QUERIED)
> + break;
> +
> + new = cur | I_VERSION_QUERIED;
> + old = atomic64_cmpxchg(>i_version, cur, new);
> + if (old == cur)
> + break;
> + cur = old;
> + }

Why not just use atomic64_or() here?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 16/19] fs: only set S_VERSION when updating times if necessary

2017-12-18 Thread Jan Kara

On Mon 18-12-17 10:11:53, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> We only really need to update i_version if someone has queried for it
> since we last incremented it. By doing that, we can avoid having to
> update the inode if the times haven't changed.
> 
> If the times have changed, then we go ahead and forcibly increment the
> counter, under the assumption that we'll be going to the storage
> anyway, and the increment itself is relatively cheap.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  fs/inode.c | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 03102d6ef044..83f6cfc3cde7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  /*
> @@ -1634,17 +1635,24 @@ static int relatime_need_update(const struct path 
> *path, struct inode *inode,
>  int generic_update_time(struct inode *inode, struct timespec *time, int 
> flags)
>  {
>   int iflags = I_DIRTY_TIME;
> + bool dirty = false;
>  
> - if (flags & S_ATIME)
> + if (flags & S_ATIME) {
>   inode->i_atime = *time;
> + dirty |= !(inode->i_sb->s_flags & SB_LAZYTIME);
> + }
>   if (flags & S_VERSION)
> - inode_inc_iversion(inode);
> - if (flags & S_CTIME)
> + dirty |= inode_maybe_inc_iversion(inode, dirty);
> + if (flags & S_CTIME) {
>   inode->i_ctime = *time;
> - if (flags & S_MTIME)
> + dirty = true;
> + }
> + if (flags & S_MTIME) {
>   inode->i_mtime = *time;
> + dirty = true;
> + }

The SB_LAZYTIME handling is wrong here. That option is not only about atime
handling but rather about all inode time stamps. So you rather need
something like:

if (flags & (S_ATIME | S_CTIME | S_MTIME) &&
!(inode->i_sb->s_flags & SB_LAZYTIME))
dirty = true;

Honza

>  
> - if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
> + if (dirty)
>   iflags |= I_DIRTY_SYNC;
>   __mark_inode_dirty(inode, iflags);
>   return 0;
> @@ -1863,7 +1871,7 @@ int file_update_time(struct file *file)
>   if (!timespec_equal(>i_ctime, ))
>   sync_it |= S_CTIME;
>  
> - if (IS_I_VERSION(inode))
> + if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
>   sync_it |= S_VERSION;
>  
>   if (!sync_it)
> -- 
> 2.14.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-15 Thread Jan Kara

On Fri 15-12-17 09:17:42, Yan, Zheng wrote:
> On Fri, Dec 15, 2017 at 12:53 AM, Jan Kara <j...@suse.cz> wrote:
> >> >
> >> > In this particular case I'm not sure why does ceph pass 'filp' into
> >> > readpage() / readpages() handler when it already gets that pointer as 
> >> > part
> >> > of arguments...
> >>
> >> It actually a flag which tells ceph_readpages() if its caller is
> >> ceph_read_iter or readahead/fadvise/madvise. because when there are
> >> multiple clients read/write a file a the same time, page cache should
> >> be disabled.
> >
> > I'm not sure I understand the reasoning properly but from what you say
> > above it rather seems the 'hint' should be stored in the inode (or possibly
> > struct file)?
> >
> 
> The capability of using page cache is hold by the process who got it.
> ceph_read_iter() first gets the capability, calls
> generic_file_read_iter(), then release the capability. The capability
> can not be easily stored in inode or file because it can be revoked by
> server any time if caller does not hold it

OK, understood. But using storage in task_struct (such as journal_info) is
problematic as it has hard to fix recursion issues as the bug you're trying
to fix shows (it is difficult to track down all the paths that can recurse
into another filesystem which will clobber the stored info). So either you
have to come up with some scheme to safely use current->journal_info (by
somehow tracking owner as Andreas suggests) and convert all users to it or
you have to come up with some scheme propagating the information through
the inode / file->private_data and use it in Ceph.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Jan Kara

On Thu 14-12-17 22:30:26, Yan, Zheng wrote:
> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <j...@suse.cz> wrote:
> > On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
> >> We recently got an Oops report:
> >>
> >> BUG: unable to handle kernel NULL pointer dereference at (null)
> >> IP: jbd2__journal_start+0x38/0x1a2
> >> [...]
> >> Call Trace:
> >>   ext4_page_mkwrite+0x307/0x52b
> >>   _ext4_get_block+0xd8/0xd8
> >>   do_page_mkwrite+0x6e/0xd8
> >>   handle_mm_fault+0x686/0xf9b
> >>   mntput_no_expire+0x1f/0x21e
> >>   __do_page_fault+0x21d/0x465
> >>   dput+0x4a/0x2f7
> >>   page_fault+0x22/0x30
> >>   copy_user_generic_string+0x2c/0x40
> >>   copy_page_to_iter+0x8c/0x2b8
> >>   generic_file_read_iter+0x26e/0x845
> >>   timerqueue_del+0x31/0x90
> >>   ceph_read_iter+0x697/0xa33 [ceph]
> >>   hrtimer_cancel+0x23/0x41
> >>   futex_wait+0x1c8/0x24d
> >>   get_futex_key+0x32c/0x39a
> >>   __vfs_read+0xe0/0x130
> >>   vfs_read.part.1+0x6c/0x123
> >>   handle_mm_fault+0x831/0xf9b
> >>   __fget+0x7e/0xbf
> >>   SyS_read+0x4d/0xb5
> >>
> >> ceph_read_iter() uses current->journal_info to pass context info to
> >> ceph_readpages(). Because ceph_readpages() needs to know if its caller
> >> has already gotten capability of using page cache (distinguish read
> >> from readahead/fadvise). ceph_read_iter() set current->journal_info,
> >> then calls generic_file_read_iter().
> >>
> >> In above Oops, page fault happened when copying data to userspace.
> >> Page fault handler called ext4_page_mkwrite(). Ext4 code read
> >> current->journal_info and assumed it is journal handle.
> >>
> >> I checked other filesystems, btrfs probably suffers similar problem
> >> for its readpage. (page fault happens when write() copies data from
> >> userspace memory and the memory is mapped to a file in btrfs.
> >> verify_parent_transid() can be called during readpage)
> >>
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: "Yan, Zheng" <z...@redhat.com>
> >
> > I agree with the analysis but the patch is too ugly too live. Ceph just
> > should not be abusing current->journal_info for passing information between
> > two random functions or when it does a hackery like this, it should just
> > make sure the pieces hold together. Poluting generic code to accommodate
> > this hack in Ceph is not acceptable. Also bear in mind there are likely
> > other code paths (e.g. memory reclaim) which could recurse into another
> > filesystem confusing it with non-NULL current->journal_info in the same
> > way.
> 
> But ...
> 
> some filesystem set journal_info in its write_begin(), then clear it
> in write_end(). If buffer for write is mapped to another filesystem,
> current->journal can leak to the later filesystem's page_readpage().
> The later filesystem may read current->journal and treat it as its own
> journal handle.  Besides, most filesystem's vm fault handle is
> filemap_fault(), filemap also may tigger memory reclaim.

Did you really observe this? Because write path uses
iov_iter_copy_from_user_atomic() which does not allow page faults to
happen. All page faulting happens in iov_iter_fault_in_readable() before
->write_begin() is called. And the recursion problems like you mention
above are exactly the reason why things are done in a more complicated way
like this.

> >
> > In this particular case I'm not sure why does ceph pass 'filp' into
> > readpage() / readpages() handler when it already gets that pointer as part
> > of arguments...
> 
> It actually a flag which tells ceph_readpages() if its caller is
> ceph_read_iter or readahead/fadvise/madvise. because when there are
> multiple clients read/write a file a the same time, page cache should
> be disabled.

I'm not sure I understand the reasoning properly but from what you say
above it rather seems the 'hint' should be stored in the inode (or possibly
struct file)?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Jan Kara

On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
> We recently got an Oops report:
> 
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: jbd2__journal_start+0x38/0x1a2
> [...]
> Call Trace:
>   ext4_page_mkwrite+0x307/0x52b
>   _ext4_get_block+0xd8/0xd8
>   do_page_mkwrite+0x6e/0xd8
>   handle_mm_fault+0x686/0xf9b
>   mntput_no_expire+0x1f/0x21e
>   __do_page_fault+0x21d/0x465
>   dput+0x4a/0x2f7
>   page_fault+0x22/0x30
>   copy_user_generic_string+0x2c/0x40
>   copy_page_to_iter+0x8c/0x2b8
>   generic_file_read_iter+0x26e/0x845
>   timerqueue_del+0x31/0x90
>   ceph_read_iter+0x697/0xa33 [ceph]
>   hrtimer_cancel+0x23/0x41
>   futex_wait+0x1c8/0x24d
>   get_futex_key+0x32c/0x39a
>   __vfs_read+0xe0/0x130
>   vfs_read.part.1+0x6c/0x123
>   handle_mm_fault+0x831/0xf9b
>   __fget+0x7e/0xbf
>   SyS_read+0x4d/0xb5
> 
> ceph_read_iter() uses current->journal_info to pass context info to
> ceph_readpages(). Because ceph_readpages() needs to know if its caller
> has already gotten capability of using page cache (distinguish read
> from readahead/fadvise). ceph_read_iter() set current->journal_info,
> then calls generic_file_read_iter().
> 
> In above Oops, page fault happened when copying data to userspace.
> Page fault handler called ext4_page_mkwrite(). Ext4 code read
> current->journal_info and assumed it is journal handle.
> 
> I checked other filesystems, btrfs probably suffers similar problem
> for its readpage. (page fault happens when write() copies data from
> userspace memory and the memory is mapped to a file in btrfs.
> verify_parent_transid() can be called during readpage)
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: "Yan, Zheng" <z...@redhat.com>

I agree with the analysis but the patch is too ugly too live. Ceph just
should not be abusing current->journal_info for passing information between
two random functions or when it does a hackery like this, it should just
make sure the pieces hold together. Poluting generic code to accommodate
this hack in Ceph is not acceptable. Also bear in mind there are likely
other code paths (e.g. memory reclaim) which could recurse into another
filesystem confusing it with non-NULL current->journal_info in the same
way.

In this particular case I'm not sure why does ceph pass 'filp' into
readpage() / readpages() handler when it already gets that pointer as part
of arguments...

Honza

> diff --git a/mm/memory.c b/mm/memory.c
> index a728bed16c20..db2a50233c49 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   unsigned int flags)
>  {
>   int ret;
> + void *old_journal_info;
>  
>   __set_current_state(TASK_RUNNING);
>  
> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   if (flags & FAULT_FLAG_USER)
>   mem_cgroup_oom_enable();
>  
> + /*
> +  * Fault can happen when filesystem A's read_iter()/write_iter()
> +  * copies data to/from userspace. Filesystem A may have set
> +  * current->journal_info. If the userspace memory is MAP_SHARED
> +  * mapped to a file in filesystem B, we later may call filesystem
> +  * B's vm operation. Filesystem B may also want to read/set
> +  * current->journal_info.
> +  */
> + old_journal_info = current->journal_info;
> + current->journal_info = NULL;
> +
>   if (unlikely(is_vm_hugetlb_page(vma)))
>   ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>   else
>   ret = __handle_mm_fault(vma, address, flags);
>  
> + current->journal_info = old_journal_info;
> +
>   if (flags & FAULT_FLAG_USER) {
>   mem_cgroup_oom_disable();
>   /*
> -- 
> 2.13.6
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at

2017-12-13 Thread Jan Kara

On Wed 13-12-17 06:38:25, Adam Borowski wrote:
> This link is replicated in most filesystems' config stanzas.  Referring
> to an archived version of that site is pointless as it mostly deals with
> patches; user documentation is available elsewhere.
> 
> Signed-off-by: Adam Borowski <kilob...@angband.pl>

Looks good to me. You can add:

Acked-by: Jan Kara <j...@suse.cz>

Honza

> ---
> Sending this as one piece; if you guys would instead prefer this chopped
> into tiny per-filesystem bits, please say so.
> 
> 
>  Documentation/filesystems/ext2.txt |  2 --
>  Documentation/filesystems/ext4.txt |  7 +++
>  fs/9p/Kconfig  |  3 ---
>  fs/Kconfig |  6 +-
>  fs/btrfs/Kconfig   |  3 ---
>  fs/ceph/Kconfig|  3 ---
>  fs/cifs/Kconfig| 15 +++
>  fs/ext2/Kconfig|  6 +-
>  fs/ext4/Kconfig|  3 ---
>  fs/f2fs/Kconfig|  6 +-
>  fs/hfsplus/Kconfig |  3 ---
>  fs/jffs2/Kconfig   |  6 +-
>  fs/jfs/Kconfig |  3 ---
>  fs/reiserfs/Kconfig|  6 +-
>  fs/xfs/Kconfig |  3 ---
>  15 files changed, 15 insertions(+), 60 deletions(-)
> 
> diff --git a/Documentation/filesystems/ext2.txt 
> b/Documentation/filesystems/ext2.txt
> index 55755395d3dc..81c0becab225 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -49,12 +49,10 @@ sb=n  Use alternate 
> superblock at this location.
>  
>  user_xattr   Enable "user." POSIX Extended Attributes
>   (requires CONFIG_EXT2_FS_XATTR).
> - See also http://acl.bestbits.at
>  nouser_xattr Don't support "user." extended attributes.
>  
>  acl  Enable POSIX Access Control Lists support
>   (requires CONFIG_EXT2_FS_POSIX_ACL).
> - See also http://acl.bestbits.at
>  noaclDon't support POSIX ACLs.
>  
>  nobh Do not attach buffer_heads to file pagecache.
> diff --git a/Documentation/filesystems/ext4.txt 
> b/Documentation/filesystems/ext4.txt
> index 75236c0c2ac2..8cd63e16f171 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -202,15 +202,14 @@ inode_readahead_blks=n  This tuning parameter controls 
> the maximum
>   the buffer cache.  The default value is 32 blocks.
>  
>  nouser_xattr Disables Extended User Attributes.  See the
> - attr(5) manual page and http://acl.bestbits.at/
> - for more information about extended attributes.
> + attr(5) manual page for more information about
> + extended attributes.
>  
>  noaclThis option disables POSIX Access Control List
>   support. If ACL support is enabled in the kernel
>   configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
>   enabled by default on mount. See the acl(5) manual
> - page and http://acl.bestbits.at/ for more information
> - about acl.
> + page for more information about acl.
>  
>  bsddf(*) Make 'df' act like BSD.
>  minixdf  Make 'df' act like Minix.
> diff --git a/fs/9p/Kconfig b/fs/9p/Kconfig
> index 6489e1fc1afd..11045d8e356a 100644
> --- a/fs/9p/Kconfig
> +++ b/fs/9p/Kconfig
> @@ -25,9 +25,6 @@ config 9P_FS_POSIX_ACL
> POSIX Access Control Lists (ACLs) support permissions for users and
> groups beyond the owner/group/world scheme.
>  
> -   To learn more about Access Control Lists, visit the POSIX ACLs for
> -   Linux website <http://acl.bestbits.at/>.
> -
> If you don't know what Access Control Lists are, say N
>  
>  endif
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7aee6d699fd6..0ed56752f208 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -167,17 +167,13 @@ config TMPFS_POSIX_ACL
> files for sound to work properly.  In short, if you're not sure,
> say Y.
>  
> -   To learn more about Access Control Lists, visit the POSIX ACLs for
> -   Linux website <http://acl.bestbits.at/>.
> -
>  config TMPFS_XATTR
>   bool "Tmpfs extended attributes"
>

Re: [PATCH v2 06/11] writeback: add counters for metadata usage

2017-12-04 Thread Jan Kara

gt; PAGE_SHIFT;
> + /*
> +  * We don't do writeout through the shrinkers so subtract any
> +  * dirty/writeback metadata bytes from the reclaimable count.
> +  */
> + if (nr_metadata_reclaimable) {
> + unsigned long unreclaimable =
> + node_page_state(pgdat, NR_METADATA_DIRTY_BYTES) +
> + node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES);
> + unreclaimable >>= PAGE_SHIFT;
> + nr_metadata_reclaimable -= unreclaimable;
> + }
> + return nr_metadata_reclaimable + nr_pagecache_reclaimable - delta;
>  }

Ditto as with __vm_enough_memory(). In particular I'm unsure whether the
watermarks like min_unmapped_pages or min_slab_pages would still work as
designed.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-12-01 Thread Jan Kara

On Wed 29-11-17 13:38:26, Chris Mason wrote:
> On 11/29/2017 12:05 PM, Tejun Heo wrote:
> >On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:
> >>Hello,
> >>
> >>On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:
> >>>What has happened with this patch set?
> >>
> >>No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
> >>can you please route them through the btrfs tree?
> >
> >lol looking at the patchset again, I'm not sure that's obviously the
> >right tree.  It can either be cgroup, block or btrfs.  If no one
> >objects, I'll just route them through cgroup.
> 
> We'll have to coordinate a bit during the next merge window but I don't have
> a problem with these going in through cgroup.  Dave does this sound good to
> you?

Also I was wondering about another thing: How does this play with Josef's
series for metadata writeback (Metadata specific accouting and dirty
writeout)? Would the per-inode selection of cgroup writeback still be
needed when Josef's series is applied since metadata writeback then won't
be associated with any particular mapping anymore?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 05/11] writeback: convert the flexible prop stuff to bytes

2017-11-29 Thread Jan Kara

On Wed 22-11-17 16:16:00, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> The flexible proportions were all page based, but now that we are doing
> metadata writeout that can be smaller or larger than page size we need
> to account for this in bytes instead of number of pages.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/page-writeback.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e4563645749a..2a1994194cc1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -574,11 +574,11 @@ static unsigned long wp_next_time(unsigned long 
> cur_time)
>   return cur_time;
>  }
>  
> -static void wb_domain_writeout_inc(struct wb_domain *dom,
> +static void wb_domain_writeout_add(struct wb_domain *dom,
>  struct fprop_local_percpu *completions,
> -unsigned int max_prop_frac)
> +long bytes, unsigned int max_prop_frac)
>  {
> - __fprop_inc_percpu_max(>completions, completions,
> + __fprop_add_percpu_max(>completions, completions, bytes,
>  max_prop_frac);
>   /* First event after period switching was turned off? */
>   if (unlikely(!dom->period_time)) {
> @@ -602,12 +602,12 @@ static inline void __wb_writeout_add(struct 
> bdi_writeback *wb, long bytes)
>   struct wb_domain *cgdom;
>  
>   __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
> - wb_domain_writeout_inc(_wb_domain, >completions,
> + wb_domain_writeout_add(_wb_domain, >completions, bytes,
>  wb->bdi->max_prop_frac);
>  
>   cgdom = mem_cgroup_wb_domain(wb);
>   if (cgdom)
> - wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
> +     wb_domain_writeout_add(cgdom, wb_memcg_completions(wb), bytes,
>  wb->bdi->max_prop_frac);
>  }
>  
> -- 
> 2.7.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 03/11] lib: make the fprop batch size a multiple of PAGE_SIZE

2017-11-29 Thread Jan Kara

On Wed 22-11-17 16:15:58, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> We are converting the writeback counters to use bytes instead of pages,
> so we need to make the batch size for the percpu modifications align
> properly with the new units.  Since we used pages before, just multiply
> by PAGE_SIZE to get the equivalent bytes for the batch size.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>

Looks good to me, just please make this part of patch 5/11. Otherwise
bisection may get broken by too large errors in per-cpu counters of IO
completions... Thanks!

Honza

> ---
>  lib/flex_proportions.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index 2cc1f94e03a1..b0343ae71f5e 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -166,7 +166,7 @@ void fprop_fraction_single(struct fprop_global *p,
>  /*
>   *  PERCPU 
>   */
> -#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
> +#define PROP_BATCH (8*PAGE_SIZE*(1+ilog2(nr_cpu_ids)))
>  
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
>  {
> -- 
> 2.7.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 04/11] lib: add a __fprop_add_percpu_max

2017-11-29 Thread Jan Kara

On Wed 22-11-17 16:15:59, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> This helper allows us to add an arbitrary amount to the fprop
> structures.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/flex_proportions.h | 11 +--
>  lib/flex_proportions.c   |  9 +
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/flex_proportions.h 
> b/include/linux/flex_proportions.h
> index 0d348e011a6e..9f88684bf0a0 100644
> --- a/include/linux/flex_proportions.h
> +++ b/include/linux/flex_proportions.h
> @@ -83,8 +83,8 @@ struct fprop_local_percpu {
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
>  void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
>  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> *pl);
> -void __fprop_inc_percpu_max(struct fprop_global *p, struct 
> fprop_local_percpu *pl,
> - int max_frac);
> +void __fprop_add_percpu_max(struct fprop_global *p, struct 
> fprop_local_percpu *pl,
> + unsigned long nr, int max_frac);
>  void fprop_fraction_percpu(struct fprop_global *p,
>   struct fprop_local_percpu *pl, unsigned long *numerator,
>   unsigned long *denominator);
> @@ -99,4 +99,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct 
> fprop_local_percpu *pl)
>   local_irq_restore(flags);
>  }
>  
> +static inline
> +void __fprop_inc_percpu_max(struct fprop_global *p,
> + struct fprop_local_percpu *pl, int max_frac)
> +{
> + __fprop_add_percpu_max(p, pl, 1, max_frac);
> +}
> +
>  #endif
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index b0343ae71f5e..fd95791a2c93 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -255,8 +255,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
>   * Like __fprop_inc_percpu() except that event is counted only if the given
>   * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
>   */
> -void __fprop_inc_percpu_max(struct fprop_global *p,
> - struct fprop_local_percpu *pl, int max_frac)
> +void __fprop_add_percpu_max(struct fprop_global *p,
> + struct fprop_local_percpu *pl, unsigned long nr,
> + int max_frac)
>  {
>   if (unlikely(max_frac < FPROP_FRAC_BASE)) {
>   unsigned long numerator, denominator;
> @@ -267,6 +268,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
>   return;
>   } else
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> - percpu_counter_add(>events, 1);
> + percpu_counter_add_batch(>events, nr, PROP_BATCH);
> + percpu_counter_add(>events, nr);
>  }
> -- 
> 2.7.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 02/11] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes

2017-11-29 Thread Jan Kara

On Wed 22-11-17 16:15:57, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> These are counters that constantly go up in order to do bandwidth 
> calculations.
> It isn't important what the units are in, as long as they are consistent 
> between
> the two of them, so convert them to count bytes written/dirtied, and allow the
> metadata accounting stuff to change the counters as well.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>
> Acked-by: Tejun Heo <t...@kernel.org>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/fuse/file.c   |  4 ++--
>  include/linux/backing-dev-defs.h |  4 ++--
>  include/linux/backing-dev.h  |  2 +-
>  mm/backing-dev.c |  9 +
>  mm/page-writeback.c  | 20 ++--
>  5 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index cb7dff5c45d7..67e7c4fac28d 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1471,7 +1471,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
> struct fuse_req *req)
>   for (i = 0; i < req->num_pages; i++) {
>   dec_wb_stat(>wb, WB_WRITEBACK);
>   dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
> - wb_writeout_inc(>wb);
> + wb_writeout_add(>wb, PAGE_SIZE);
>   }
>   wake_up(>page_waitq);
>  }
> @@ -1776,7 +1776,7 @@ static bool fuse_writepage_in_flight(struct fuse_req 
> *new_req,
>  
>   dec_wb_stat(>wb, WB_WRITEBACK);
>   dec_node_page_state(page, NR_WRITEBACK_TEMP);
> - wb_writeout_inc(>wb);
> + wb_writeout_add(>wb, PAGE_SIZE);
>   fuse_writepage_free(fc, new_req);
>   fuse_request_free(new_req);
>   goto out;
> diff --git a/include/linux/backing-dev-defs.h 
> b/include/linux/backing-dev-defs.h
> index 866c433e7d32..ded45ac2cec7 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int);
>  enum wb_stat_item {
>   WB_RECLAIMABLE,
>   WB_WRITEBACK,
> - WB_DIRTIED,
> - WB_WRITTEN,
> + WB_DIRTIED_BYTES,
> + WB_WRITTEN_BYTES,
>   NR_WB_STAT_ITEMS
>  };
>  
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 14e266d12620..39b8dc486ea7 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -89,7 +89,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, 
> enum wb_stat_item item)
>   return percpu_counter_sum_positive(>stat[item]);
>  }
>  
> -extern void wb_writeout_inc(struct bdi_writeback *wb);
> +extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
>  
>  /*
>   * maximal error of a stat counter.
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e19606bb41a0..62a332a91b38 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -68,14 +68,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void 
> *v)
>   wb_thresh = wb_calc_thresh(wb, dirty_thresh);
>  
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> +#define BtoK(x) ((x) >> 10)
>   seq_printf(m,
>  "BdiWriteback:   %10lu kB\n"
>  "BdiReclaimable: %10lu kB\n"
>  "BdiDirtyThresh: %10lu kB\n"
>  "DirtyThresh:%10lu kB\n"
>  "BackgroundThresh:   %10lu kB\n"
> -"BdiDirtied: %10lu kB\n"
> -"BdiWritten: %10lu kB\n"
> +"BdiDirtiedBytes:%10lu kB\n"
> +"BdiWrittenBytes:%10lu kB\n"
>  "BdiWriteBandwidth:  %10lu kBps\n"
>  "b_dirty:%10lu\n"
>  "b_io:   %10lu\n"
> @@ -88,8 +89,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>  K(wb_thresh),
>  K(dirty_thresh),
>  K(background_thresh),
> -(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
> -(unsigned long) K(wb_stat(wb, WB_WRITTEN)),
> +(unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
> +(unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
>  (unsigned long) K(wb->write_bandwidth),
>  nr_dirty,
>  nr_io,
> diff --g

Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-29 Thread Jan Kara

Hi Tejun,

What has happened with this patch set?

Honza

On Tue 10-10-17 08:54:36, Tejun Heo wrote:
> Changes from the last version are
> 
> * blkcg_root_css exported to fix build breakage on modular btrfs.
> 
> * Use ext4_should_journal_data() test instead of
>   EXT4_MOUNT_JOURNAL_DATA.
> 
> * Separated out create_bh_bio() and used it to implement
>   submit_bh_blkcg_css() as suggested by Jan.
> 
> btrfs has different ways to issue metadata IOs and may end up issuing
> metadata or otherwise shared IOs from a non-root cgroup, which can
> lead to priority inversion and ineffective IO isolation.
> 
> This patchset makes sure that btrfs issues all metadata and shared IOs
> from the root cgroup by exempting btree_inodes from cgroup writeback
> and explicitly associating shared IOs with the root cgroup.
> 
> This patchset containst he following three patches
> 
>  [PATCH 1/5] blkcg: export blkcg_root_css
>  [PATCH 2/5] cgroup, writeback: replace SB_I_CGROUPWB with per-inode
>  [PATCH 3/5] buffer_head: separate out create_bh_bio() from
>  [PATCH 4/5] cgroup, buffer_head: implement submit_bh_blkcg_css()
>  [PATCH 5/5] btrfs: ensure that metadata and flush are issued from the
> 
> and is also available in the following git branch
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
> review-cgroup-btrfs-metadata-v2
> 
> diffstat follows.  Thanks.
> 
>  block/blk-cgroup.c  |1 +
>  fs/block_dev.c  |3 +--
>  fs/btrfs/check-integrity.c  |2 +-
>  fs/btrfs/disk-io.c  |4 
>  fs/btrfs/ioctl.c|6 +-
>  fs/btrfs/super.c|1 -
>  fs/buffer.c |   42 ++
>  fs/ext2/inode.c |3 ++-
>  fs/ext2/super.c |1 -
>  fs/ext4/inode.c |5 -
>  fs/ext4/super.c |2 --
>  include/linux/backing-dev.h |2 +-
>  include/linux/buffer_head.h |3 +++
>  include/linux/fs.h  |3 ++-
>  14 files changed, 58 insertions(+), 20 deletions(-)
> 
> --
> tejun
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/10] writeback: add counters for metadata usage

2017-11-22 Thread Jan Kara

 NR_METADATA_WRITEBACK_BYTES);
> + unreclaimable >>= PAGE_SHIFT;
> + nr_metadata_reclaimable -= unreclaimable;
> + }
> + return nr_metadata_reclaimable + nr_pagecache_reclaimable - delta;
>  }

So I've checked both places that use this function and I think they are fine
with the change. However it would still be good to get someone more
knowledgeable of reclaim paths to have a look at this patch.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/10] lib: add a batch size to fprop_global

2017-11-22 Thread Jan Kara

On Wed 22-11-17 09:47:16, Jan Kara wrote:
> On Tue 14-11-17 16:56:49, Josef Bacik wrote:
> > From: Josef Bacik <jba...@fb.com>
> > 
> > The flexible proportion stuff has been used to track how many pages we
> > are writing out over a period of time, so counts everything in single
> > increments.  If we wanted to use another base value we need to be able
> > to adjust the batch size to fit our the units we'll be using for the
> > proportions.
> > 
> > Signed-off-by: Josef Bacik <jba...@fb.com>
> 
> Frankly, I had to look into the code to understand what the patch is about.
> Can we rephrase the changelog like:
> 
> Currently flexible proportion code is using fixed per-cpu counter batch size
> since all the counters use only increment / decrement to track number of
> pages which completed writeback. When we start tracking amount of done
> writeback in different units, we need to update per-cpu counter batch size
> accordingly. Make counter batch size configurable on a per-proportion
> domain basis to allow for this.

Actually, now that I'm looking at other patches: Since fprop code is only
used for bdi writeback tracking, I guess there's no good reason to make
this configurable on a per-proportion basis. Just drop this patch and
bump PROP_BATCH in the following patch and we are done. Am I missing
something?

Honza

> > ---
> >  include/linux/flex_proportions.h |  4 +++-
> >  lib/flex_proportions.c   | 11 +--
> >  2 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/flex_proportions.h 
> > b/include/linux/flex_proportions.h
> > index 0d348e011a6e..853f4305d1b2 100644
> > --- a/include/linux/flex_proportions.h
> > +++ b/include/linux/flex_proportions.h
> > @@ -20,7 +20,7 @@
> >   */
> >  #define FPROP_FRAC_SHIFT 10
> >  #define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT)
> > -
> > +#define FPROP_BATCH_SIZE (8*(1+ilog2(nr_cpu_ids)))
> >  /*
> >   *  Global proportion definitions 
> >   */
> > @@ -31,6 +31,8 @@ struct fprop_global {
> > unsigned int period;
> > /* Synchronization with period transitions */
> > seqcount_t sequence;
> > +   /* batch size */
> > +   s32 batch_size;
> >  };
> >  
> >  int fprop_global_init(struct fprop_global *p, gfp_t gfp);
> > diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> > index 2cc1f94e03a1..5552523b663a 100644
> > --- a/lib/flex_proportions.c
> > +++ b/lib/flex_proportions.c
> > @@ -44,6 +44,7 @@ int fprop_global_init(struct fprop_global *p, gfp_t gfp)
> > if (err)
> > return err;
> > seqcount_init(>sequence);
> > +   p->batch_size = FPROP_BATCH_SIZE;
> > return 0;
> >  }
> >  
> > @@ -166,8 +167,6 @@ void fprop_fraction_single(struct fprop_global *p,
> >  /*
> >   *  PERCPU 
> >   */
> > -#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
> > -
> >  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
> >  {
> > int err;
> > @@ -204,11 +203,11 @@ static void fprop_reflect_period_percpu(struct 
> > fprop_global *p,
> > if (period - pl->period < BITS_PER_LONG) {
> > s64 val = percpu_counter_read(>events);
> >  
> > -   if (val < (nr_cpu_ids * PROP_BATCH))
> > +   if (val < (nr_cpu_ids * p->batch_size))
> > val = percpu_counter_sum(>events);
> >  
> > percpu_counter_add_batch(>events,
> > -   -val + (val >> (period-pl->period)), PROP_BATCH);
> > +   -val + (val >> (period-pl->period)), p->batch_size);
> > } else
> > percpu_counter_set(>events, 0);
> > pl->period = period;
> > @@ -219,7 +218,7 @@ static void fprop_reflect_period_percpu(struct 
> > fprop_global *p,
> >  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> > *pl)
> >  {
> >     fprop_reflect_period_percpu(p, pl);
> > -   percpu_counter_add_batch(>events, 1, PROP_BATCH);
> > +   percpu_counter_add_batch(>events, 1, p->batch_size);
> > percpu_counter_add(>events, 1);
> >  }
> >  
> > @@ -267,6 +266,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
> > return;
> > } else
> > fprop_reflect_period_percpu(p, pl);
> > -   percpu_counter_add_batch(>events, 1, PROP_BATCH);
> > +   percpu_counter_add_batch(>events, 1, p->batch_size);
> > percpu_counter_add(>events, 1);
> >  }
> > -- 
> > 2.7.5
> > 
> -- 
> Jan Kara <j...@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/10] lib: add a batch size to fprop_global

2017-11-22 Thread Jan Kara

On Tue 14-11-17 16:56:49, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> The flexible proportion stuff has been used to track how many pages we
> are writing out over a period of time, so counts everything in single
> increments.  If we wanted to use another base value we need to be able
> to adjust the batch size to fit our the units we'll be using for the
> proportions.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>

Frankly, I had to look into the code to understand what the patch is about.
Can we rephrase the changelog like:

Currently flexible proportion code is using fixed per-cpu counter batch size
since all the counters use only increment / decrement to track number of
pages which completed writeback. When we start tracking amount of done
writeback in different units, we need to update per-cpu counter batch size
accordingly. Make counter batch size configurable on a per-proportion
domain basis to allow for this.

Honza
> ---
>  include/linux/flex_proportions.h |  4 +++-
>  lib/flex_proportions.c   | 11 +--
>  2 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/flex_proportions.h 
> b/include/linux/flex_proportions.h
> index 0d348e011a6e..853f4305d1b2 100644
> --- a/include/linux/flex_proportions.h
> +++ b/include/linux/flex_proportions.h
> @@ -20,7 +20,7 @@
>   */
>  #define FPROP_FRAC_SHIFT 10
>  #define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT)
> -
> +#define FPROP_BATCH_SIZE (8*(1+ilog2(nr_cpu_ids)))
>  /*
>   *  Global proportion definitions 
>   */
> @@ -31,6 +31,8 @@ struct fprop_global {
>   unsigned int period;
>   /* Synchronization with period transitions */
>   seqcount_t sequence;
> + /* batch size */
> + s32 batch_size;
>  };
>  
>  int fprop_global_init(struct fprop_global *p, gfp_t gfp);
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index 2cc1f94e03a1..5552523b663a 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -44,6 +44,7 @@ int fprop_global_init(struct fprop_global *p, gfp_t gfp)
>   if (err)
>   return err;
>   seqcount_init(>sequence);
> + p->batch_size = FPROP_BATCH_SIZE;
>   return 0;
>  }
>  
> @@ -166,8 +167,6 @@ void fprop_fraction_single(struct fprop_global *p,
>  /*
>   *  PERCPU 
>   */
> -#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
> -
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
>  {
>   int err;
> @@ -204,11 +203,11 @@ static void fprop_reflect_period_percpu(struct 
> fprop_global *p,
>   if (period - pl->period < BITS_PER_LONG) {
>   s64 val = percpu_counter_read(>events);
>  
> - if (val < (nr_cpu_ids * PROP_BATCH))
> + if (val < (nr_cpu_ids * p->batch_size))
>   val = percpu_counter_sum(>events);
>  
>   percpu_counter_add_batch(>events,
> - -val + (val >> (period-pl->period)), PROP_BATCH);
> + -val + (val >> (period-pl->period)), p->batch_size);
>   } else
>   percpu_counter_set(>events, 0);
>   pl->period = period;
> @@ -219,7 +218,7 @@ static void fprop_reflect_period_percpu(struct 
> fprop_global *p,
>  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> *pl)
>  {
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> + percpu_counter_add_batch(>events, 1, p->batch_size);
>   percpu_counter_add(>events, 1);
>  }
>  
> @@ -267,6 +266,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
>   return;
>   } else
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> + percpu_counter_add_batch(>events, 1, p->batch_size);
>   percpu_counter_add(>events, 1);
>  }
> -- 
> 2.7.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] cgroup, buffer_head: implement submit_bh_blkcg_css()

2017-10-11 Thread Jan Kara

On Tue 10-10-17 08:54:40, Tejun Heo wrote:
> Implement submit_bh_blkcg_css() which will be used to override cgroup
> membership on specific buffer_heads.
> 
> v2: Reimplemented using create_bh_bio() as suggested by Jan.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Jens Axboe <ax...@kernel.dk>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/buffer.c | 12 
>  include/linux/buffer_head.h |  3 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4b2169..ed0e473 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3147,6 +3147,18 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>   return 0;
>  }
>  
> +int submit_bh_blkcg_css(int op, int op_flags, struct buffer_head *bh,
> + struct cgroup_subsys_state *blkcg_css)
> +{
> + struct bio *bio;
> +
> + bio = create_bh_bio(op, op_flags, bh, 0);
> + bio_associate_blkcg(bio, blkcg_css);
> + submit_bio(bio);
> + return 0;
> +}
> +EXPORT_SYMBOL(submit_bh_blkcg_css);
> +
>  int submit_bh(int op, int op_flags, struct buffer_head *bh)
>  {
>   struct bio *bio;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae55..dca5b3b 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_BLOCK
>  
> @@ -197,6 +198,8 @@ void ll_rw_block(int, int, int, struct buffer_head * 
> bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
>  void write_dirty_buffer(struct buffer_head *bh, int op_flags);
> +int submit_bh_blkcg_css(int op, int op_flags, struct buffer_head *bh,
> + struct cgroup_subsys_state *blkcg_css);
>  int submit_bh(int, int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>   sector_t bblock, unsigned blocksize);
> -- 
> 2.9.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] buffer_head: separate out create_bh_bio() from submit_bh_wbc()

2017-10-11 Thread Jan Kara

On Tue 10-10-17 08:54:39, Tejun Heo wrote:
> submit_bh_wbc() creates a bio matching the specific @bh and submits it
> at the end.  This patch separates out the bio creation part to its own
> function, create_bh_bio(), and reimplement submit_bh[_wbc]() using the
> function.
> 
> As bio can now be manipulated before submitted, we can move out @wbc
> handling into submit_bh_wbc() and similarly this will make adding more
> submit_bh variants straight-forward.
> 
> This patch is pure refactoring and doesn't cause any functional
> changes.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Suggested-by: Jan Kara <j...@suse.cz>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza


> ---
>  fs/buffer.c | 30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df85..b4b2169 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3086,8 +3086,8 @@ void guard_bio_eod(int op, struct bio *bio)
>   }
>  }
>  
> -static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> -  enum rw_hint write_hint, struct writeback_control *wbc)
> +struct bio *create_bh_bio(int op, int op_flags, struct buffer_head *bh,
> +  enum rw_hint write_hint)
>  {
>   struct bio *bio;
>  
> @@ -3109,11 +3109,6 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>*/
>   bio = bio_alloc(GFP_NOIO, 1);
>  
> - if (wbc) {
> - wbc_init_bio(wbc, bio);
> - wbc_account_io(wbc, bh->b_page, bh->b_size);
> - }
> -
>   bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
>   bio_set_dev(bio, bh->b_bdev);
>   bio->bi_write_hint = write_hint;
> @@ -3133,13 +3128,32 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>   op_flags |= REQ_PRIO;
>   bio_set_op_attrs(bio, op, op_flags);
>  
> + return bio;
> +}
> +
> +static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> +  enum rw_hint write_hint, struct writeback_control *wbc)
> +{
> + struct bio *bio;
> +
> + bio = create_bh_bio(op, op_flags, bh, write_hint);
> +
> + if (wbc) {
> + wbc_init_bio(wbc, bio);
> + wbc_account_io(wbc, bh->b_page, bh->b_size);
> + }
> +
>   submit_bio(bio);
>   return 0;
>  }
>  
>  int submit_bh(int op, int op_flags, struct buffer_head *bh)
>  {
> - return submit_bh_wbc(op, op_flags, bh, 0, NULL);
> + struct bio *bio;
> +
> + bio = create_bh_bio(op, op_flags, bh, 0);
> + submit_bio(bio);
> + return 0;
>  }
>  EXPORT_SYMBOL(submit_bh);
>  
> -- 
> 2.9.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] blkcg: export blkcg_root_css

2017-10-11 Thread Jan Kara

On Tue 10-10-17 08:54:37, Tejun Heo wrote:
> Export blkcg_root_css so that filesystem modules can use it.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  block/blk-cgroup.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d3f56ba..597a457 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -45,6 +45,7 @@ struct blkcg blkcg_root;
>  EXPORT_SYMBOL_GPL(blkcg_root);
>  
>  struct cgroup_subsys_state * const blkcg_root_css = _root.css;
> +EXPORT_SYMBOL_GPL(blkcg_root_css);
>  
>  static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
>  
> -- 
> 2.9.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] cgroup, writeback: implement submit_bh_blkcg_css()

2017-10-10 Thread Jan Kara

On Mon 09-10-17 14:29:10, Tejun Heo wrote:
> Add wbc->blkcg_css so that the blkcg_css association can be specified
> independently and implement submit_bh_blkcg_css() using it.  This will
> be used to override cgroup membership on specific buffer_heads.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Jens Axboe <ax...@kernel.dk>

Hum, I dislike growing wbc just to pass blkcg_css through one function in
fs/buffer.c (in otherwise empty wbc). Not that space would be a big concern
for wbc but it just gets messier... Cannot we just refactor the code a
bit like:

1) Create function

struct bio *create_bio_bh(int op, int op_flags, struct buffer_head *bh,
  enum rw_hint write_hint);

which would create bio from bh.

2) Make submit_bh(), submit_bh_wbc(), submit_bh_blkcg_css() use this and
the latter two would further tweak the bio as needed before submitting.

Thoughts?

Honza


> ---
>  fs/buffer.c | 12 
>  fs/fs-writeback.c   |  1 +
>  include/linux/buffer_head.h | 11 +++
>  include/linux/writeback.h   |  6 --
>  4 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df85..fac4f9a 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3143,6 +3143,18 @@ int submit_bh(int op, int op_flags, struct buffer_head 
> *bh)
>  }
>  EXPORT_SYMBOL(submit_bh);
>  
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +int submit_bh_blkcg_css(int op, int op_flags,  struct buffer_head *bh,
> + struct cgroup_subsys_state *blkcg_css)
> +{
> + struct writeback_control wbc = { .blkcg_css = blkcg_css };
> +
> + /* @wbc is used just to override the bio's blkcg_css */
> + return submit_bh_wbc(op, op_flags, bh, 0, );
> +}
> +EXPORT_SYMBOL(submit_bh_blkcg_css);
> +#endif
> +
>  /**
>   * ll_rw_block: low-level access to block devices (DEPRECATED)
>   * @op: whether to %READ or %WRITE
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 245c430..cd882e6 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -538,6 +538,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control 
> *wbc,
>   }
>  
>   wbc->wb = inode_to_wb(inode);
> + wbc->blkcg_css = wbc->wb->blkcg_css;
>   wbc->inode = inode;
>  
>   wbc->wb_id = wbc->wb->memcg_css->id;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae55..abb4dd4 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_BLOCK
>  
> @@ -205,6 +206,16 @@ int bh_submit_read(struct buffer_head *bh);
>  loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset,
>loff_t length, int whence);
>  
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +int submit_bh_blkcg(int op, int op_flags, struct buffer_head *bh,
> + struct cgroup_subsys_state *blkcg_css);
> +#else
> +static inline int submit_bh_blkcg(int op, int op_flags, struct buffer_head 
> *bh,
> +   struct cgroup_subsys_state *blkcg_css)
> +{
> + return submit_bh(op, op_flags, bh);
> +}
> +#endif
>  extern int buffer_heads_over_limit;
>  
>  /*
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index d581579..81e5946 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -91,6 +91,8 @@ struct writeback_control {
>   unsigned for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
>  #ifdef CONFIG_CGROUP_WRITEBACK
>   struct bdi_writeback *wb;   /* wb this writeback is issued under */
> + struct cgroup_subsys_state *blkcg_css; /* usually wb->blkcg_css but
> +   may be overridden */
>   struct inode *inode;/* inode being written out */
>  
>   /* foreign inode detection, see wbc_detach_inode() */
> @@ -277,8 +279,8 @@ static inline void wbc_init_bio(struct writeback_control 
> *wbc, struct bio *bio)
>* behind a slow cgroup.  Ultimately, we want pageout() to kick off
>* regular writeback instead of writing things out itself.
>*/
> - if (wbc->wb)
> - bio_associate_blkcg(bio, wbc->wb->blkcg_css);
> + if (wbc->blkcg_css)
> + bio_associate_blkcg(bio, wbc->blkcg_css);
>  }
>  
>  #else/* CONFIG_CGROUP_WRITEBACK */
> -- 
> 2.9.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] cgroup, writeback: replace SB_I_CGROUPWB with per-inode S_CGROUPWB

2017-10-10 Thread Jan Kara

On Mon 09-10-17 14:29:09, Tejun Heo wrote:
> Currently, filesystem can indiate cgroup writeback support per
> superblock; however, depending on the filesystem, especially if inodes
> are used to carry metadata, it can be useful to indicate cgroup
> writeback support per inode.
> 
> This patch replaces the superblock flag SB_I_CGROUPWB with per-inode
> S_CGROUPWB, so that cgroup writeback can be enabled selectively.
> 
> * block_dev sets the new flag in bdget() when initializing new inode.
> 
> * ext2/4 set the new flag in ext?_set_inode_flags() function.
> 
> * btrfs sets the new flag in btrfs_update_iflags() function.  Note
>   that this automatically excludes btree_inode which doesn't use
>   btrfs_update_iflags() during initialization.  This is an intended
>   behavior change.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Jens Axboe <ax...@kernel.dk>
> Cc: Chris Mason <c...@fb.com>
> Cc: Josef Bacik <jba...@fb.com>
> Cc: linux-btrfs@vger.kernel.org
> Cc: "Theodore Ts'o" <ty...@mit.edu>
> Cc: Andreas Dilger <adilger.ker...@dilger.ca>
> Cc: linux-e...@vger.kernel.org

...

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 31db875..344f12b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4591,8 +4591,11 @@ void ext4_set_inode_flags(struct inode *inode)
>   !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
>   !ext4_encrypted_inode(inode))
>   new_fl |= S_DAX;
> + if (test_opt(inode->i_sb, DATA_FLAGS) != EXT4_MOUNT_JOURNAL_DATA)
> + new_fl |= S_CGROUPWB;

Use ext4_should_journal_data(inode) here? Ext4 can be journalling data also
because of per-inode flag.

Otherwise the patch looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/16] btrfs: Use pagevec_lookup_range_tag()

2017-10-09 Thread Jan Kara

We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dste...@suse.com>
Reviewed-by: David Sterba <dste...@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/extent_io.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 970190cd347e..a4eb6c988f27 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3818,8 +3818,8 @@ int btree_write_cache_pages(struct address_space *mapping,
if (wbc->sync_mode == WB_SYNC_ALL)
tag_pages_for_writeback(mapping, index, end);
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3829,11 +3829,6 @@ int btree_write_cache_pages(struct address_space 
*mapping,
if (!PagePrivate(page))
continue;
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   break;
-   }
-
spin_lock(>private_lock);
if (!PagePrivate(page)) {
spin_unlock(>private_lock);
@@ -3965,8 +3960,8 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3991,12 +3986,6 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
continue;
}
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   unlock_page(page);
-   continue;
-   }
-
if (wbc->sync_mode != WB_SYNC_NONE) {
if (PageWriteback(page))
flush_fn(data);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/15] btrfs: Use pagevec_lookup_range_tag()

2017-09-27 Thread Jan Kara

We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dste...@suse.com>
Reviewed-by: David Sterba <dste...@suse.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/extent_io.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0f077c5db58e..9b7936ea3a88 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3819,8 +3819,8 @@ int btree_write_cache_pages(struct address_space *mapping,
if (wbc->sync_mode == WB_SYNC_ALL)
tag_pages_for_writeback(mapping, index, end);
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3830,11 +3830,6 @@ int btree_write_cache_pages(struct address_space 
*mapping,
if (!PagePrivate(page))
continue;
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   break;
-   }
-
spin_lock(>private_lock);
if (!PagePrivate(page)) {
spin_unlock(>private_lock);
@@ -3966,8 +3961,8 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3992,12 +3987,6 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
continue;
}
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   unlock_page(page);
-   continue;
-   }
-
if (wbc->sync_mode != WB_SYNC_NONE) {
if (PageWriteback(page))
flush_fn(data);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/15] btrfs: Use pagevec_lookup_range_tag()

2017-09-14 Thread Jan Kara

We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dste...@suse.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/extent_io.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0f077c5db58e..9b7936ea3a88 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3819,8 +3819,8 @@ int btree_write_cache_pages(struct address_space *mapping,
if (wbc->sync_mode == WB_SYNC_ALL)
tag_pages_for_writeback(mapping, index, end);
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3830,11 +3830,6 @@ int btree_write_cache_pages(struct address_space 
*mapping,
if (!PagePrivate(page))
continue;
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   break;
-   }
-
spin_lock(>private_lock);
if (!PagePrivate(page)) {
spin_unlock(>private_lock);
@@ -3966,8 +3961,8 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3992,12 +3987,6 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
continue;
}
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   unlock_page(page);
-   continue;
-   }
-
if (wbc->sync_mode != WB_SYNC_NONE) {
if (PageWriteback(page))
flush_fn(data);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/11 v1] Fix inheritance of SGID in presence of default ACLs

2017-06-22 Thread Jan Kara

Hello,

this patch set fixes a problem introduced by commit 073931017b49 "posix_acl:
Clear SGID bit when setting file permissions". The problem is that when new
directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is
expected to have SGID bit set (and owning group equal to the owning group of
'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits,
setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is
not member of the owning group.

The problem is fixed by moving posix_acl_update_mode() so that it does not
get called when default ACLs are inherited.

I have created new generic/441 test for this and verified that generic/314,
generic/375, and generic/441 pass for ext2, ext4, btrfs, xfs, ocfs2, reiserfs.

All patches in this series are completely independent so fs maintainers please
pick them up as soon as they get reviewed. I'm leaving for three weeks of
vacation at the end of the week so I won't be able to push this further for
some time.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/11] btrfs: Don't clear SGID when inheriting ACLs

2017-06-22 Thread Jan Kara

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: sta...@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dste...@suse.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/acl.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index 247b8dfaf6e5..8d8370ddb6b2 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -78,12 +78,6 @@ static int __btrfs_set_acl(struct btrfs_trans_handle *trans,
switch (type) {
case ACL_TYPE_ACCESS:
name = XATTR_NAME_POSIX_ACL_ACCESS;
-   if (acl) {
-   ret = posix_acl_update_mode(inode, >i_mode, 
);
-   if (ret)
-   return ret;
-   }
-   ret = 0;
break;
case ACL_TYPE_DEFAULT:
if (!S_ISDIR(inode->i_mode))
@@ -119,6 +113,13 @@ static int __btrfs_set_acl(struct btrfs_trans_handle 
*trans,
 
 int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
+   int ret;
+
+   if (type == ACL_TYPE_ACCESS && acl) {
+   ret = posix_acl_update_mode(inode, >i_mode, );
+   if (ret)
+   return ret;
+   }
return __btrfs_set_acl(NULL, inode, acl, type);
 }
 
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 01/22] fs: remove call_fsync helper function

2017-06-20 Thread Jan Kara

On Fri 16-06-17 15:34:06, Jeff Layton wrote:
> Requested-by: Christoph Hellwig <h...@infradead.org>
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza
> ---
>  fs/sync.c  | 2 +-
>  include/linux/fs.h | 6 --
>  ipc/shm.c  | 2 +-
>  3 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/sync.c b/fs/sync.c
> index 11ba023434b1..2a54c1f22035 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -192,7 +192,7 @@ int vfs_fsync_range(struct file *file, loff_t start, 
> loff_t end, int datasync)
>   spin_unlock(>i_lock);
>   mark_inode_dirty_sync(inode);
>   }
> - return call_fsync(file, start, end, datasync);
> + return file->f_op->fsync(file, start, end, datasync);
>  }
>  EXPORT_SYMBOL(vfs_fsync_range);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4929a8f28cc3..1a135274b4f8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1740,12 +1740,6 @@ static inline int call_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   return file->f_op->mmap(file, vma);
>  }
>  
> -static inline int call_fsync(struct file *file, loff_t start, loff_t end,
> -  int datasync)
> -{
> - return file->f_op->fsync(file, start, end, datasync);
> -}
> -
>  ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
> unsigned long nr_segs, unsigned long fast_segs,
> struct iovec *fast_pointer,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index ec5688e98f25..28a444861a8f 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -453,7 +453,7 @@ static int shm_fsync(struct file *file, loff_t start, 
> loff_t end, int datasync)
>  
>   if (!sfd->file->f_op->fsync)
>   return -EINVAL;
> - return call_fsync(sfd->file, start, end, datasync);
> + return sfd->file->f_op->fsync(sfd->file, start, end, datasync);
>  }
>  
>  static long shm_fallocate(struct file *file, int mode, loff_t offset,
> -- 
> 2.13.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 05/22] jbd2: don't clear and reset errors after waiting on writeback

2017-06-20 Thread Jan Kara

On Fri 16-06-17 15:34:10, Jeff Layton wrote:
> Resetting this flag is almost certainly racy, and will be problematic
> with some coming changes.
> 
> Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
> Have jbd2 call it instead of filemap_fdatawait and don't attempt to
> re-set the error flag if it fails.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/jbd2/commit.c   | 15 +++
>  include/linux/fs.h |  2 +-
>  mm/filemap.c   | 16 ++--
>  3 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b6b194ec1b4f..502110540598 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -263,18 +263,9 @@ static int journal_finish_inode_data_buffers(journal_t 
> *journal,
>   continue;
>   jinode->i_flags |= JI_COMMIT_RUNNING;
>   spin_unlock(>j_list_lock);
> - err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
> - if (err) {
> - /*
> -  * Because AS_EIO is cleared by
> -  * filemap_fdatawait_range(), set it again so
> -  * that user process can get -EIO from fsync().
> -  */
> - mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
> -
> - if (!ret)
> - ret = err;
> - }
> + err = 
> filemap_fdatawait_keep_errors(jinode->i_vfs_inode->i_mapping);
> + if (!ret)
> + ret = err;
>   spin_lock(>j_list_lock);
>   jinode->i_flags &= ~JI_COMMIT_RUNNING;
>   smp_mb();
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1a135274b4f8..1b1233a1db1e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2509,7 +2509,7 @@ extern int write_inode_now(struct inode *, int);
>  extern int filemap_fdatawrite(struct address_space *);
>  extern int filemap_flush(struct address_space *);
>  extern int filemap_fdatawait(struct address_space *);
> -extern void filemap_fdatawait_keep_errors(struct address_space *);
> +extern int filemap_fdatawait_keep_errors(struct address_space *);
>  extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
>  loff_t lend);
>  extern int filemap_write_and_wait(struct address_space *mapping);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b9e870600572..37f286df7c95 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -311,6 +311,16 @@ int filemap_check_errors(struct address_space *mapping)
>  }
>  EXPORT_SYMBOL(filemap_check_errors);
>  
> +static int filemap_check_and_keep_errors(struct address_space *mapping)
> +{
> + /* Check for outstanding write errors */
> + if (test_bit(AS_EIO, >flags))
> + return -EIO;
> + if (test_bit(AS_ENOSPC, >flags))
> + return -ENOSPC;
> + return 0;
> +}
> +
>  /**
>   * __filemap_fdatawrite_range - start writeback on mapping dirty pages in 
> range
>   * @mapping: address space structure to write
> @@ -455,15 +465,17 @@ EXPORT_SYMBOL(filemap_fdatawait_range);
>   * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
>   * fsfreeze(8)
>   */
> -void filemap_fdatawait_keep_errors(struct address_space *mapping)
> +int filemap_fdatawait_keep_errors(struct address_space *mapping)
>  {
>   loff_t i_size = i_size_read(mapping->host);
>  
>   if (i_size == 0)
> -         return;
> + return 0;
>  
>   __filemap_fdatawait_range(mapping, 0, i_size - 1);
> + return filemap_check_and_keep_errors(mapping);
>  }
> +EXPORT_SYMBOL(filemap_fdatawait_keep_errors);
>  
>  /**
>   * filemap_fdatawait - wait for all under-writeback pages to complete
> -- 
> 2.13.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/35] fscache: Remove unused ->now_uncached callback

2017-06-19 Thread Jan Kara

On Thu 01-06-17 13:34:34, Jan Kara wrote:
> On Thu 01-06-17 11:26:08, David Howells wrote:
> > Jan Kara <j...@suse.cz> wrote:
> > 
> > > The callback doesn't ever get called. Remove it.
> > 
> > Hmmm...  I should perhaps be calling this.  I'm not sure why I never did.
> > 
> > At the moment, it doesn't strictly matter as ops on pages marked with
> > PG_fscache get ignored if the cache has suffered an I/O error or has been
> > withdrawn - but it will incur a performance penalty (the PG_fscache flag is
> > checked in the netfs before calling into fscache).
> > 
> > The downside of calling this is that when a cache is removed, fscache would 
> > go
> > through all the cookies for that cache and iterate over all the pages
> > associated with those cookies - which could cause a performance dip in the
> > system.
> 
> So I know nothing about fscache. If you decide these functions should stay
> in as you are going to use them soon, then I can just convert them to the
> new API as everything else. What just caught my eye and why I had a more
> detailed look is that I didn't understand that 'PAGEVEC_SIZE -
> pagevec_count()' as a pagevec_lookup() argument since pagevec_count()
> should always return 0 at that point?

David, what is your final decision regarding this? Do you want to keep
these unused functions (and I will just update my patch to convert them to
the new calling convention) or will you apply the patch to remove them?

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Cluster-devel] [PATCH 00/35 v1] pagevec API cleanups

2017-06-01 Thread Jan Kara

On Thu 01-06-17 04:36:04, Christoph Hellwig wrote:
> On Thu, Jun 01, 2017 at 11:32:10AM +0200, Jan Kara wrote:
> > * Implement ranged variants for pagevec_lookup and find_get_ functions. Lot
> >   of callers actually want a ranged lookup and we unnecessarily opencode 
> > this
> >   in lot of them.
> 
> How does this compare to Kents page cache iterators:
> 
> http://www.spinics.net/lists/linux-mm/msg104737.html

Interesting. I didn't know about that work. I guess the tradeoff is pretty
obvious - my patches are more conservative (changing less) and as a result
the API is not as neat as Kent's one. That being said I was also thinking
about something similar to what Kent did but what I didn't like about such
iterator is that you still need to specially handle the cases where you
break out of the loop (you need to do that with pagevecs too but there it
is kind of obvious from the API).

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/35] fscache: Remove unused ->now_uncached callback

2017-06-01 Thread Jan Kara

On Thu 01-06-17 11:26:08, David Howells wrote:
> Jan Kara <j...@suse.cz> wrote:
> 
> > The callback doesn't ever get called. Remove it.
> 
> Hmmm...  I should perhaps be calling this.  I'm not sure why I never did.
> 
> At the moment, it doesn't strictly matter as ops on pages marked with
> PG_fscache get ignored if the cache has suffered an I/O error or has been
> withdrawn - but it will incur a performance penalty (the PG_fscache flag is
> checked in the netfs before calling into fscache).
> 
> The downside of calling this is that when a cache is removed, fscache would go
> through all the cookies for that cache and iterate over all the pages
> associated with those cookies - which could cause a performance dip in the
> system.

So I know nothing about fscache. If you decide these functions should stay
in as you are going to use them soon, then I can just convert them to the
new API as everything else. What just caught my eye and why I had a more
detailed look is that I didn't understand that 'PAGEVEC_SIZE -
pagevec_count()' as a pagevec_lookup() argument since pagevec_count()
should always return 0 at that point?

    Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/35] hugetlbfs: Use pagevec_lookup_range() in remove_inode_hugepages()

2017-06-01 Thread Jan Kara

We want only pages from given range in remove_inode_hugepages(). Use
pagevec_lookup_range() instead of pagevec_lookup().

CC: Nadia Yvette Chambers <n...@holomorphy.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/hugetlbfs/inode.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 372fc8aac38e..99885f9b9d56 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -403,7 +403,6 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
struct pagevec pvec;
pgoff_t next, index;
int i, freed = 0;
-   long lookup_nr = PAGEVEC_SIZE;
bool truncate_op = (lend == LLONG_MAX);
 
memset(_vma, 0, sizeof(struct vm_area_struct));
@@ -412,30 +411,17 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
next = start;
while (next < end) {
/*
-* Don't grab more pages than the number left in the range.
-*/
-   if (end - next < lookup_nr)
-   lookup_nr = end - next;
-
-   /*
 * When no more pages are found, we are done.
 */
-   if (!pagevec_lookup(, mapping, , lookup_nr))
+   if (!pagevec_lookup_range(, mapping, , end - 1,
+ PAGEVEC_SIZE))
break;
 
for (i = 0; i < pagevec_count(); ++i) {
struct page *page = pvec.pages[i];
u32 hash;
 
-   /*
-* The page (index) could be beyond end.  This is
-* only possible in the punch hole case as end is
-* max page offset in the truncate case.
-*/
index = page->index;
-   if (index >= end)
-   break;
-
hash = hugetlb_fault_mutex_hash(h, current->mm,
_vma,
mapping, index, 0);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/35] mm: Fix THP handling in invalidate_mapping_pages()

2017-06-01 Thread Jan Kara

The condition checking for THP straddling end of invalidated range is
wrong - it checks 'index' against 'end' but 'index' has been already
advanced to point to the end of THP and thus the condition can never be
true. As a result THP straddling 'end' has been fully invalidated. Given
the nature of invalidate_mapping_pages(), this could be only performance
issue. In fact, we are lucky the condition is wrong because if it was
ever true, we'd leave locked page behind.

Fix the condition checking for THP straddling 'end' and also properly
unlock the page. Also update the comment before the condition to explain
why we decide not to invalidate the page as it was not clear to me and I
had to ask Kirill.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/truncate.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 6479ed2afc53..2330223841fb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -530,9 +530,15 @@ unsigned long invalidate_mapping_pages(struct 
address_space *mapping,
} else if (PageTransHuge(page)) {
index += HPAGE_PMD_NR - 1;
i += HPAGE_PMD_NR - 1;
-   /* 'end' is in the middle of THP */
-   if (index ==  round_down(end, HPAGE_PMD_NR))
+   /*
+* 'end' is in the middle of THP. Don't
+* invalidate the page as the part outside of
+* 'end' could be still useful.
+*/
+   if (index > end) {
+   unlock_page(page);
continue;
+   }
}
 
ret = invalidate_inode_page(page);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/35] fscache: Remove unused ->now_uncached callback

2017-06-01 Thread Jan Kara

The callback doesn't ever get called. Remove it.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 Documentation/filesystems/caching/netfs-api.txt |  2 --
 fs/9p/cache.c   | 29 -
 fs/afs/cache.c  | 43 -
 fs/ceph/cache.c | 31 --
 fs/cifs/cache.c | 31 --
 fs/nfs/fscache-index.c  | 40 ---
 include/linux/fscache.h |  9 --
 7 files changed, 185 deletions(-)

diff --git a/Documentation/filesystems/caching/netfs-api.txt 
b/Documentation/filesystems/caching/netfs-api.txt
index aed6b94160b1..0eb31de3a2c1 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -151,8 +151,6 @@ To define an object, a structure of the following type 
should be filled out:
void (*mark_pages_cached)(void *cookie_netfs_data,
  struct address_space *mapping,
  struct pagevec *cached_pvec);
-
-   void (*now_uncached)(void *cookie_netfs_data);
};
 
 This has the following fields:
diff --git a/fs/9p/cache.c b/fs/9p/cache.c
index 103ca5e1267b..64c58eb26159 100644
--- a/fs/9p/cache.c
+++ b/fs/9p/cache.c
@@ -151,34 +151,6 @@ fscache_checkaux v9fs_cache_inode_check_aux(void 
*cookie_netfs_data,
return FSCACHE_CHECKAUX_OKAY;
 }
 
-static void v9fs_cache_inode_now_uncached(void *cookie_netfs_data)
-{
-   struct v9fs_inode *v9inode = cookie_netfs_data;
-   struct pagevec pvec;
-   pgoff_t first;
-   int loop, nr_pages;
-
-   pagevec_init(, 0);
-   first = 0;
-
-   for (;;) {
-   nr_pages = pagevec_lookup(, v9inode->vfs_inode.i_mapping,
- first,
- PAGEVEC_SIZE - pagevec_count());
-   if (!nr_pages)
-   break;
-
-   for (loop = 0; loop < nr_pages; loop++)
-   ClearPageFsCache(pvec.pages[loop]);
-
-   first = pvec.pages[nr_pages - 1]->index + 1;
-
-   pvec.nr = nr_pages;
-   pagevec_release();
-   cond_resched();
-   }
-}
-
 const struct fscache_cookie_def v9fs_cache_inode_index_def = {
.name   = "9p.inode",
.type   = FSCACHE_COOKIE_TYPE_DATAFILE,
@@ -186,7 +158,6 @@ const struct fscache_cookie_def v9fs_cache_inode_index_def 
= {
.get_attr   = v9fs_cache_inode_get_attr,
.get_aux= v9fs_cache_inode_get_aux,
.check_aux  = v9fs_cache_inode_check_aux,
-   .now_uncached   = v9fs_cache_inode_now_uncached,
 };
 
 void v9fs_cache_inode_get_cookie(struct inode *inode)
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index 577763c3d88b..1fe855191261 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -39,7 +39,6 @@ static uint16_t afs_vnode_cache_get_aux(const void 
*cookie_netfs_data,
 static enum fscache_checkaux afs_vnode_cache_check_aux(void *cookie_netfs_data,
   const void *buffer,
   uint16_t buflen);
-static void afs_vnode_cache_now_uncached(void *cookie_netfs_data);
 
 struct fscache_netfs afs_cache_netfs = {
.name   = "afs",
@@ -75,7 +74,6 @@ struct fscache_cookie_def afs_vnode_cache_index_def = {
.get_attr   = afs_vnode_cache_get_attr,
.get_aux= afs_vnode_cache_get_aux,
.check_aux  = afs_vnode_cache_check_aux,
-   .now_uncached   = afs_vnode_cache_now_uncached,
 };
 
 /*
@@ -359,44 +357,3 @@ static enum fscache_checkaux 
afs_vnode_cache_check_aux(void *cookie_netfs_data,
_leave(" = SUCCESS");
return FSCACHE_CHECKAUX_OKAY;
 }
-
-/*
- * indication the cookie is no longer uncached
- * - this function is called when the backing store currently caching a cookie
- *   is removed
- * - the netfs should use this to clean up any markers indicating cached pages
- * - this is mandatory for any object that may have data
- */
-static void afs_vnode_cache_now_uncached(void *cookie_netfs_data)
-{
-   struct afs_vnode *vnode = cookie_netfs_data;
-   struct pagevec pvec;
-   pgoff_t first;
-   int loop, nr_pages;
-
-   _enter("{%x,%x,%Lx}",
-  vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version);
-
-   pagevec_init(, 0);
-   first = 0;
-
-   for (;;) {
-   /* grab a bunch of pages to clean */
-   nr_pages = pagevec_lookup(, vnode->vfs_inode.i_mapping,
- first,
-

[PATCH 03/35] ext4: Fix off-by-in in loop termination in ext4_find_unwritten_pgoff()

2017-06-01 Thread Jan Kara

There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bbea2dccd584..2b00bf84c05b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -474,7 +474,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
endoff = (loff_t)end_blk << blkbits;
 
index = startoff >> PAGE_SHIFT;
-   end = endoff >> PAGE_SHIFT;
+   end = (endoff - 1) >> PAGE_SHIFT;
 
pagevec_init(, 0);
do {
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/35] ext4: Use pagevec_lookup_range() in writeback code

2017-06-01 Thread Jan Kara

Both occurences of pagevec_lookup() actually want only pages from a
given range. Use pagevec_lookup_range() for the lookup.

CC: "Theodore Ts'o" <ty...@mit.edu>
CC: linux-e...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/inode.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 784f41328dc8..59d82530d269 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1670,13 +1670,13 @@ static void mpage_release_unused_pages(struct 
mpage_da_data *mpd,
 
pagevec_init(, 0);
while (index <= end) {
-   nr_pages = pagevec_lookup(, mapping, , PAGEVEC_SIZE);
+   nr_pages = pagevec_lookup_range(, mapping, , end,
+   PAGEVEC_SIZE);
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
-   if (page->index > end)
-   break;
+
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
if (invalidate) {
@@ -2283,15 +2283,13 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
 
pagevec_init(, 0);
while (start <= end) {
-   nr_pages = pagevec_lookup(, inode->i_mapping, ,
- PAGEVEC_SIZE);
+   nr_pages = pagevec_lookup_range(, inode->i_mapping,
+   , end, PAGEVEC_SIZE);
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
 
-   if (page->index > end)
-   break;
bh = head = page_buffers(page);
do {
if (lblk < mpd->map.m_lblk)
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/35] mm: Make pagevec_lookup() update index

2017-06-01 Thread Jan Kara

Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/buffer.c |  6 ++
 fs/ext4/file.c  |  4 +---
 fs/ext4/inode.c |  8 ++--
 fs/fscache/page.c   |  5 ++---
 fs/hugetlbfs/inode.c| 17 -
 fs/nilfs2/page.c|  3 +--
 fs/ramfs/file-nommu.c   |  2 +-
 fs/xfs/xfs_file.c   |  3 +--
 include/linux/pagemap.h |  2 +-
 include/linux/pagevec.h |  2 +-
 mm/filemap.c|  9 +++--
 mm/swap.c   |  5 +++--
 12 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..fe0ee01c5a44 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1638,13 +1638,12 @@ void clean_bdev_aliases(struct block_device *bdev, 
sector_t block, sector_t len)
 
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
pagevec_init(, 0);
-   while (index <= end && pagevec_lookup(, bd_mapping, index,
+   while (index <= end && pagevec_lookup(, bd_mapping, ,
min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   index = page->index;
-   if (index > end)
+   if (page->index > end)
break;
if (!page_has_buffers(page))
continue;
@@ -1675,7 +1674,6 @@ void clean_bdev_aliases(struct block_device *bdev, 
sector_t block, sector_t len)
}
pagevec_release();
cond_resched();
-   index++;
}
 }
 EXPORT_SYMBOL(clean_bdev_aliases);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2b00bf84c05b..ddca17c7875a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -482,7 +482,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
unsigned long nr_pages;
 
num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
-   nr_pages = pagevec_lookup(, inode->i_mapping, index,
+   nr_pages = pagevec_lookup(, inode->i_mapping, ,
  (pgoff_t)num);
if (nr_pages == 0)
break;
@@ -547,8 +547,6 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
/* The no. of pages is less than our desired, we are done. */
if (nr_pages < num)
break;
-
-   index = pvec.pages[i - 1]->index + 1;
pagevec_release();
} while (index <= end);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1bd0bfa547f6..784f41328dc8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1670,7 +1670,7 @@ static void mpage_release_unused_pages(struct 
mpage_da_data *mpd,
 
pagevec_init(, 0);
while (index <= end) {
-   nr_pages = pagevec_lookup(, mapping, index, PAGEVEC_SIZE);
+   nr_pages = pagevec_lookup(, mapping, , PAGEVEC_SIZE);
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
@@ -1687,7 +1687,6 @@ static void mpage_release_unused_pages(struct 
mpage_da_data *mpd,
}
unlock_page(page);
}
-   index = pvec.pages[nr_pages - 1]->index + 1;
pagevec_release();
}
 }
@@ -2284,7 +2283,7 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
 
pagevec_init(, 0);
while (start <= end) {
-   nr_pages = pagevec_lookup(, inode->i_mapping, start,
+   nr_pages = pagevec_lookup(, inode->i_mapping, ,
  PAGEVEC_SIZE);
if (nr_pages == 0)
break;
@@ -2293,8 +2292,6 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
 
if (page->index > end)
break;
-   /* Up to 'end' pages must be contiguous */
-   BUG_ON(page->index != start);
bh = head = page_buffers(page);
do {
if (lblk < mpd->map.m_lblk)
@@ -2339,7 +2336,6 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
pagevec_release();
return err;
}
-   start++;
}
pagevec_release();
}
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index c8c4f79c7c

[PATCH 09/35] ext4: Use pagevec_lookup_range() in ext4_find_unwritten_pgoff()

2017-06-01 Thread Jan Kara

Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
interested only in pages in the given range. Simplify the logic as a
result of not getting pages out of range and index getting automatically
advanced.

CC: linux-e...@vger.kernel.org
CC: "Theodore Ts'o" <ty...@mit.edu>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ddca17c7875a..6821070a388b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -478,12 +478,11 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 
pagevec_init(, 0);
do {
-   int i, num;
+   int i;
unsigned long nr_pages;
 
-   num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
-   nr_pages = pagevec_lookup(, inode->i_mapping, ,
- (pgoff_t)num);
+   nr_pages = pagevec_lookup_range(, inode->i_mapping,
+   , end, PAGEVEC_SIZE);
if (nr_pages == 0)
break;
 
@@ -502,9 +501,6 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
goto out;
}
 
-   if (page->index > end)
-   goto out;
-
lock_page(page);
 
if (unlikely(page->mapping != inode->i_mapping)) {
@@ -544,12 +540,10 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
unlock_page(page);
}
 
-   /* The no. of pages is less than our desired, we are done. */
-   if (nr_pages < num)
-   break;
pagevec_release();
} while (index <= end);
 
+   /* There are no pages upto endoff - that would be a hole in there. */
if (whence == SEEK_HOLE && lastoff < endoff) {
found = 1;
*offset = lastoff;
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/35] fs: Fix performance regression in clean_bdev_aliases()

2017-06-01 Thread Jan Kara

Commit e64855c6cfaa "fs: Add helper to clean bdev aliases under a bh and
use it" added a wrapper for clean_bdev_aliases() that invalidates bdev
aliases underlying a single buffer head. However this has caused a
performance regression for bonnie++ benchmark on ext4 filesystem when
delayed allocation is turned off (ext3 mode) - average of 3 runs:

Hmean SeqOut Char  164787.55 (  0.00%) 107189.06 (-34.95%)
Hmean SeqOut Block 219883.89 (  0.00%) 168870.32 (-23.20%)

The reason for this regression is that clean_bdev_aliases() is slower
when called for a single block because pagevec_lookup() it uses will end
up iterating through the radix tree until it finds a page (which may
take a while) but we are only interested whether there's a page at a
particular index.

Fix the problem by using pagevec_lookup_range() instead which avoids the
needless iteration.

Fixes: e64855c6cfaa0a80c1b71c5f647cb792dc436668
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/buffer.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index fe0ee01c5a44..d63b22e50f38 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1632,19 +1632,18 @@ void clean_bdev_aliases(struct block_device *bdev, 
sector_t block, sector_t len)
struct pagevec pvec;
pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
pgoff_t end;
-   int i;
+   int i, count;
struct buffer_head *bh;
struct buffer_head *head;
 
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
pagevec_init(, 0);
-   while (index <= end && pagevec_lookup(, bd_mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
-   for (i = 0; i < pagevec_count(); i++) {
+   while (pagevec_lookup_range(, bd_mapping, , end,
+   PAGEVEC_SIZE)) {
+   count = pagevec_count();
+   for (i = 0; i < count; i++) {
struct page *page = pvec.pages[i];
 
-   if (page->index > end)
-   break;
if (!page_has_buffers(page))
continue;
/*
@@ -1674,6 +1673,9 @@ void clean_bdev_aliases(struct block_device *bdev, 
sector_t block, sector_t len)
}
pagevec_release();
cond_resched();
+   /* End of range already reached? */
+   if (index > end || !index)
+   break;
}
 }
 EXPORT_SYMBOL(clean_bdev_aliases);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/35] xfs: Use pagevec_lookup_range() in xfs_find_get_desired_pgoff()

2017-06-01 Thread Jan Kara

We want only pages from given range in xfs_find_get_desired_pgoff(). Use
pagevec_lookup_range() instead of pagevec_lookup() and remove
unnecessary code. Note that the check for getting less pages than
desired can be removed because index gets updated by
pagevec_lookup_range().

CC: Darrick J. Wong <darrick.w...@oracle.com>
CC: linux-...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/xfs/xfs_file.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 487342078fc7..f9343dac7ff9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1045,13 +1045,11 @@ xfs_find_get_desired_pgoff(
endoff = XFS_FSB_TO_B(mp, map->br_startoff + map->br_blockcount);
end = (endoff - 1) >> PAGE_SHIFT;
do {
-   int want;
unsignednr_pages;
unsigned inti;
 
-   want = min_t(pgoff_t, end - index, PAGEVEC_SIZE - 1) + 1;
-   nr_pages = pagevec_lookup(, inode->i_mapping, ,
- want);
+   nr_pages = pagevec_lookup_range(, inode->i_mapping,
+   , end, PAGEVEC_SIZE);
if (nr_pages == 0)
break;
 
@@ -1075,9 +1073,6 @@ xfs_find_get_desired_pgoff(
*offset = lastoff;
goto out;
}
-   /* Searching done if the page index is out of range. */
-   if (page->index > end)
-   goto out;
 
lock_page(page);
/*
@@ -1117,13 +1112,6 @@ xfs_find_get_desired_pgoff(
unlock_page(page);
}
 
-   /*
-* The number of returned pages less than our desired, search
-* done.
-*/
-   if (nr_pages < want)
-   break;
-
pagevec_release();
} while (index <= end);
 
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/35] ext4: Fix SEEK_HOLE

2017-06-01 Thread Jan Kara

Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0. sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence  Result
HOLE0

Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
   -c "seek -h 0" file

wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0. sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0. sec (2 GiB/sec and 50. ops/sec)
Whence  Result
HOLE139264

Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.

The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).

Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.

CC: sta...@vger.kernel.org
Fixes: c8c0df241cc2719b1262e627f999638411934f60
CC: Zheng Liu <wenqing...@taobao.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/file.c | 50 ++
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 831fd6beebf0..bbea2dccd584 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -484,47 +484,27 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
nr_pages = pagevec_lookup(, inode->i_mapping, index,
  (pgoff_t)num);
-   if (nr_pages == 0) {
-   if (whence == SEEK_DATA)
-   break;
-
-   BUG_ON(whence != SEEK_HOLE);
-   /*
-* If this is the first time to go into the loop and
-* offset is not beyond the end offset, it will be a
-* hole at this offset
-*/
-   if (lastoff == startoff || lastoff < endoff)
-   found = 1;
-   break;
-   }
-
-   /*
-* If this is the first time to go into the loop and
-* offset is smaller than the first page offset, it will be a
-* hole at this offset.
-*/
-   if (lastoff == startoff && whence == SEEK_HOLE &&
-   lastoff < page_offset(pvec.pages[0])) {
-   found = 1;
+   if (nr_pages == 0)
break;
-   }
 
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
struct buffer_head *bh, *head;
 
/*
-* If the current offset is not beyond the end of given
-* range, it will be a hole.
+* If current offset is smaller than the page offset,
+* there is a hole at this offset.
 */
-   if (lastoff < endoff && whence == SEEK_HOLE &&
-   page->index > end) {
+   if (whence == SEEK_HOLE && lastoff < endoff &&
+   lastoff < page_offset(pvec.pages[i])) {
found = 1;
*offset = lastoff;
goto out;
}
 
+   if (page->index > end)
+   goto out;
+
lock_page(page);
 
if (unlikely(page->mapping != inode->i_mapping)) {
@@ -564,20 +544,18 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
unlock_page(page);
}
 
-   /*
-* The no. of pages is less than our desired, that would be a
-* hole in there.
-*/
-   if (nr_pages < num && whence == SEEK_HOLE) {
-

[PATCH 13/35] mm: Remove nr_pages argument from pagevec_lookup{,_range}()

2017-06-01 Thread Jan Kara

All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/buffer.c | 3 +--
 fs/ext4/file.c  | 2 +-
 fs/ext4/inode.c | 5 ++---
 fs/fscache/page.c   | 2 +-
 fs/hugetlbfs/inode.c| 3 +--
 fs/nilfs2/page.c| 2 +-
 fs/xfs/xfs_file.c   | 2 +-
 include/linux/pagevec.h | 7 +++
 mm/swap.c   | 5 ++---
 9 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d63b22e50f38..89605cb42a55 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1638,8 +1638,7 @@ void clean_bdev_aliases(struct block_device *bdev, 
sector_t block, sector_t len)
 
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
pagevec_init(, 0);
-   while (pagevec_lookup_range(, bd_mapping, , end,
-   PAGEVEC_SIZE)) {
+   while (pagevec_lookup_range(, bd_mapping, , end)) {
count = pagevec_count();
for (i = 0; i < count; i++) {
struct page *page = pvec.pages[i];
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6821070a388b..2d9a198026e5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -482,7 +482,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
unsigned long nr_pages;
 
nr_pages = pagevec_lookup_range(, inode->i_mapping,
-   , end, PAGEVEC_SIZE);
+   , end);
if (nr_pages == 0)
break;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 59d82530d269..ace4bb9073d8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1670,8 +1670,7 @@ static void mpage_release_unused_pages(struct 
mpage_da_data *mpd,
 
pagevec_init(, 0);
while (index <= end) {
-   nr_pages = pagevec_lookup_range(, mapping, , end,
-   PAGEVEC_SIZE);
+   nr_pages = pagevec_lookup_range(, mapping, , end);
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
@@ -2284,7 +2283,7 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
pagevec_init(, 0);
while (start <= end) {
nr_pages = pagevec_lookup_range(, inode->i_mapping,
-   , end, PAGEVEC_SIZE);
+   , end);
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 83018861dcd2..0ad3fd3ad0b4 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -1178,7 +1178,7 @@ void __fscache_uncache_all_inode_pages(struct 
fscache_cookie *cookie,
pagevec_init(, 0);
next = 0;
do {
-   if (!pagevec_lookup(, mapping, , PAGEVEC_SIZE))
+   if (!pagevec_lookup(, mapping, ))
break;
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 99885f9b9d56..461e01500e60 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -413,8 +413,7 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
/*
 * When no more pages are found, we are done.
 */
-   if (!pagevec_lookup_range(, mapping, , end - 1,
- PAGEVEC_SIZE))
+   if (!pagevec_lookup_range(, mapping, , end - 1))
break;
 
for (i = 0; i < pagevec_count(); ++i) {
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 382a36c72d72..8616c46d33da 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -312,7 +312,7 @@ void nilfs_copy_back_pages(struct address_space *dmap,
 
pagevec_init(, 0);
 repeat:
-   n = pagevec_lookup(, smap, , PAGEVEC_SIZE);
+   n = pagevec_lookup(, smap, );
if (!n)
return;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f9343dac7ff9..a7abc981e4a9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1049,7 +1049,7 @@ xfs_find_get_desired_pgoff(
unsigned inti;
 
nr_pages = pagevec_lookup_range(, inode->i_mapping,
-   , end, PAGEVEC_SIZE);
+   , end);
if (nr_pages == 0)
break;
 
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 7df056910437..4dcd5506f1ed 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -29,13 +29,12 @@ unsigned pagev

[PATCH 14/35] mm: Implement find_get_pages_range_tag()

2017-06-01 Thread Jan Kara

Implement a variant of find_get_pages_tag() that stops iterating at
given index. Lots of users of this function (through pagevec_lookup())
actually want a range lookup and all of them are currently open-coding
this.

Also create corresponding pagevec_lookup_range_tag() function.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/pagemap.h | 12 ++--
 include/linux/pagevec.h | 11 +--
 mm/filemap.c| 33 -
 mm/swap.c   |  9 +
 4 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2bb5e636a8c8..a2d3534a514f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -348,8 +348,16 @@ static inline unsigned find_get_pages(struct address_space 
*mapping,
 }
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
   unsigned int nr_pages, struct page **pages);
-unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
-   int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t 
*index,
+   pgoff_t end, int tag, unsigned int nr_pages,
+   struct page **pages);
+static inline unsigned find_get_pages_tag(struct address_space *mapping,
+   pgoff_t *index, int tag, unsigned int nr_pages,
+   struct page **pages)
+{
+   return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
+   nr_pages, pages);
+}
 unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
int tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices);
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 4dcd5506f1ed..371edacc10d5 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -37,9 +37,16 @@ static inline unsigned pagevec_lookup(struct pagevec *pvec,
return pagevec_lookup_range(pvec, mapping, start, (pgoff_t)-1);
 }
 
-unsigned pagevec_lookup_tag(struct pagevec *pvec,
+unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
+   struct address_space *mapping, pgoff_t *index, pgoff_t end,
+   int tag, unsigned nr_pages);
+static inline unsigned pagevec_lookup_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, int tag,
-   unsigned nr_pages);
+   unsigned nr_pages)
+{
+   return pagevec_lookup_range_tag(pvec, mapping, index, (pgoff_t)-1, tag,
+   nr_pages);
+}
 
 static inline void pagevec_init(struct pagevec *pvec, int cold)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 2693f87a7968..56af68f6a375 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1612,9 +1612,10 @@ unsigned find_get_pages_contig(struct address_space 
*mapping, pgoff_t index,
 EXPORT_SYMBOL(find_get_pages_contig);
 
 /**
- * find_get_pages_tag - find and return pages that match @tag
+ * find_get_pages_range_tag - find and return pages in given range matching 
@tag
  * @mapping:   the address_space to search
  * @index: the starting page index
+ * @end:   The final page index (inclusive)
  * @tag:   the tag index
  * @nr_pages:  the maximum number of pages
  * @pages: where the resulting pages are placed
@@ -1622,8 +1623,9 @@ EXPORT_SYMBOL(find_get_pages_contig);
  * Like find_get_pages, except we only return pages which are tagged with
  * @tag.   We update @index to index the next page for the traversal.
  */
-unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
-   int tag, unsigned int nr_pages, struct page **pages)
+unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t 
*index,
+   pgoff_t end, int tag, unsigned int nr_pages,
+   struct page **pages)
 {
struct radix_tree_iter iter;
void **slot;
@@ -1636,6 +1638,9 @@ unsigned find_get_pages_tag(struct address_space 
*mapping, pgoff_t *index,
radix_tree_for_each_tagged(slot, >page_tree,
   , *index, tag) {
struct page *head, *page;
+
+   if (iter.index > end)
+   break;
 repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1677,18 +1682,28 @@ unsigned find_get_pages_tag(struct address_space 
*mapping, pgoff_t *index,
}
 
pages[ret] = page;
-   if (++ret == nr_pages)
-   break;
+   if (++ret == nr_pages) {
+   *index = pages[ret - 1]->index + 1;
+   goto out;
+   }
}
 
+   /*
+* We come here when we got at @en

[PATCH 07/35] mm: Implement find_get_pages_range()

2017-06-01 Thread Jan Kara

Implement a variant of find_get_pages() that stops iterating at given
index. This may be substantial performance gain if the mapping is
sparse. See following commit for details. Furthermore lots of users of
this function (through pagevec_lookup()) actually want a range lookup
and all of them are currently open-coding this.

Also create corresponding pagevec_lookup_range() function.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/pagemap.h | 12 ++--
 include/linux/pagevec.h | 13 +++--
 mm/filemap.c| 42 ++
 mm/swap.c   | 22 ++
 4 files changed, 65 insertions(+), 24 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 86de6f9c8607..2bb5e636a8c8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -336,8 +336,16 @@ struct page *find_lock_entry(struct address_space 
*mapping, pgoff_t offset);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
  unsigned int nr_entries, struct page **entries,
  pgoff_t *indices);
-unsigned find_get_pages(struct address_space *mapping, pgoff_t *start,
-   unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
+   pgoff_t end, unsigned int nr_pages,
+   struct page **pages);
+static inline unsigned find_get_pages(struct address_space *mapping,
+   pgoff_t *start, unsigned int nr_pages,
+   struct page **pages)
+{
+   return find_get_pages_range(mapping, start, (pgoff_t)-1, nr_pages,
+   pages);
+}
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
   unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index c395a5bb58b2..7df056910437 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -27,8 +27,17 @@ unsigned pagevec_lookup_entries(struct pagevec *pvec,
pgoff_t start, unsigned nr_entries,
pgoff_t *indices);
 void pagevec_remove_exceptionals(struct pagevec *pvec);
-unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
-   pgoff_t *start, unsigned nr_pages);
+unsigned pagevec_lookup_range(struct pagevec *pvec,
+ struct address_space *mapping,
+ pgoff_t *start, pgoff_t end, unsigned nr_pages);
+static inline unsigned pagevec_lookup(struct pagevec *pvec,
+ struct address_space *mapping,
+ pgoff_t *start, unsigned nr_pages)
+{
+   return pagevec_lookup_range(pvec, mapping, start, (pgoff_t)-1,
+   nr_pages);
+}
+
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, int tag,
unsigned nr_pages);
diff --git a/mm/filemap.c b/mm/filemap.c
index 10d926a423e2..2693f87a7968 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1438,24 +1438,29 @@ unsigned find_get_entries(struct address_space *mapping,
 }
 
 /**
- * find_get_pages - gang pagecache lookup
+ * find_get_pages_range - gang pagecache lookup
  * @mapping:   The address_space to search
  * @start: The starting page index
+ * @end:   The final page index (inclusive)
  * @nr_pages:  The maximum number of pages
  * @pages: Where the resulting pages are placed
  *
- * find_get_pages() will search for and return a group of up to
- * @nr_pages pages in the mapping.  The pages are placed at @pages.
- * find_get_pages() takes a reference against the returned pages.
+ * find_get_pages_range() will search for and return a group of up to @nr_pages
+ * pages in the mapping starting at index @start and up to index @end
+ * (inclusive).  The pages are placed at @pages.  find_get_pages_range() takes
+ * a reference against the returned pages.
  *
  * The search returns a group of mapping-contiguous pages with ascending
  * indexes.  There may be holes in the indices due to not-present pages.
  * We also update @start to index the next page for the traversal.
  *
- * find_get_pages() returns the number of pages which were found.
+ * find_get_pages_range() returns the number of pages which were found. If this
+ * number is smaller than @nr_pages, the end of specified range has been
+ * reached.
  */
-unsigned find_get_pages(struct address_space *mapping, pgoff_t *start,
-   unsigned int nr_pages, struct page **pages)
+unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
+ pgoff_t end, unsigned int nr

[PATCH 16/35] ceph: Use pagevec_lookup_range_tag()

2017-06-01 Thread Jan Kara

We want only pages from given range in ceph_writepages_start(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

CC: Ilya Dryomov <idryo...@gmail.com>
CC: "Yan, Zheng" <z...@redhat.com>
CC: ceph-de...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ceph/addr.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 1e71e6ca5ddf..0b7e56ae3b8c 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -841,21 +841,16 @@ static int ceph_writepages_start(struct address_space 
*mapping,
struct page **pages = NULL, **data_pages;
mempool_t *pool = NULL; /* Becomes non-null if mempool used */
struct page *page;
-   int want;
u64 offset = 0, len = 0;
 
max_pages = max_pages_ever;
 
 get_more_pages:
first = -1;
-   want = min(end - index,
-  min((pgoff_t)PAGEVEC_SIZE,
-  max_pages - (pgoff_t)locked_pages) - 1)
-   + 1;
-   pvec_pages = pagevec_lookup_tag(, mapping, ,
-   PAGECACHE_TAG_DIRTY,
-   want);
-   dout("pagevec_lookup_tag got %d\n", pvec_pages);
+   pvec_pages = pagevec_lookup_range_tag(, mapping, ,
+   end, PAGECACHE_TAG_DIRTY,
+   PAGEVEC_SIZE);
+   dout("pagevec_lookup_range_tag got %d\n", pvec_pages);
if (!pvec_pages && !locked_pages)
break;
for (i = 0; i < pvec_pages && locked_pages < max_pages; i++) {
@@ -873,12 +868,6 @@ static int ceph_writepages_start(struct address_space 
*mapping,
unlock_page(page);
break;
}
-   if (!wbc->range_cyclic && page->index > end) {
-   dout("end of range %p\n", page);
-   done = 1;
-   unlock_page(page);
-   break;
-   }
if (strip_unit_end && (page->index > strip_unit_end)) {
dout("end of strip unit %p\n", page);
unlock_page(page);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 23/35] mm: Use pagevec_lookup_range_tag() in __filemap_fdatawait_range()

2017-06-01 Thread Jan Kara

Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
interested only in pages from given range. Remove unnecessary code
resulting from this.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/filemap.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 56af68f6a375..8039b6bb9c27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -390,18 +390,13 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
 
pagevec_init(, 0);
while ((index <= end) &&
-   (nr_pages = pagevec_lookup_tag(, mapping, ,
-   PAGECACHE_TAG_WRITEBACK,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
+   (nr_pages = pagevec_lookup_range_tag(, mapping,
+   , end, PAGECACHE_TAG_WRITEBACK, PAGEVEC_SIZE))) {
unsigned i;
 
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
 
-   /* until radix tree lookup accepts end_index */
-   if (page->index > end)
-   continue;
-
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/35] btrfs: Use pagevec_lookup_range_tag()

2017-06-01 Thread Jan Kara

We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dste...@suse.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/extent_io.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d8da3edf2ac3..6287eaba30ac 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3837,8 +3837,8 @@ int btree_write_cache_pages(struct address_space *mapping,
if (wbc->sync_mode == WB_SYNC_ALL)
tag_pages_for_writeback(mapping, index, end);
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -3848,11 +3848,6 @@ int btree_write_cache_pages(struct address_space 
*mapping,
if (!PagePrivate(page))
continue;
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   break;
-   }
-
spin_lock(>private_lock);
if (!PagePrivate(page)) {
spin_unlock(>private_lock);
@@ -3984,8 +3979,8 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_tag(, mapping, , tag,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE))) {
unsigned i;
 
scanned = 1;
@@ -4010,12 +4005,6 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
continue;
}
 
-   if (!wbc->range_cyclic && page->index > end) {
-   done = 1;
-   unlock_page(page);
-   continue;
-   }
-
if (wbc->sync_mode != WB_SYNC_NONE) {
if (PageWriteback(page))
flush_fn(data);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/35 v1] pagevec API cleanups

2017-06-01 Thread Jan Kara

Hello,

This series cleans up pagevec API. The original motivation for the series is
the patch "fs: Fix performance regression in clean_bdev_aliases()" however it
has somewhat grown beyond that... The series is pretty large but most of the
patches are trivial in nature. What the series does is:

* Make all pagevec_lookup_ and find_get_ functions update index to where the
  search terminated. Currently tagged page lookup did update the index, other
  variants did not...

* Implement ranged variants for pagevec_lookup and find_get_ functions. Lot
  of callers actually want a ranged lookup and we unnecessarily opencode this
  in lot of them.

* Remove nr_pages argument from pagevec_ API since after implementing ranged
  lookups everyone just wants to pass PAGEVEC_SIZE there.

The conversion of the APIs for entries variants is not such a clear win as
for the other cases as callers tend to play more complex games with indices
etc. (hello THP). I still think the conversion is worth it for consistency
but I'm open to ideas (including just discarding that part) there.

The series also contains several fixes in the beginning to the bugs that I've
found during these cleanups. I've included them to have a clean base (4.12-rc3)
but those should get merged independently (e.g. ext4 fixes are already sitting
in ext4 tree). Also it is possible to split the series in smaller parts (like
convert one API at a time) however I wanted to post the full series so that
people can get the full picture.

The series can be also obtained from my git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git 
find_get_pages_range

Opinions and review welcome!

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 22/35] nilfs2: Use pagevec_lookup_range_tag()

2017-06-01 Thread Jan Kara

We want only pages from given range in
nilfs_lookup_dirty_data_buffers(). Use pagevec_lookup_range_tag()
instead of pagevec_lookup_tag() and remove unnecessary code.

CC: Ryusuke Konishi <konishi.ryus...@lab.ntt.co.jp>
CC: linux-ni...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/nilfs2/segment.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index febed1217b3f..fd9eeca5f784 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -711,18 +711,14 @@ static size_t nilfs_lookup_dirty_data_buffers(struct 
inode *inode,
pagevec_init(, 0);
  repeat:
if (unlikely(index > last) ||
-   !pagevec_lookup_tag(, mapping, , PAGECACHE_TAG_DIRTY,
-   min_t(pgoff_t, last - index,
- PAGEVEC_SIZE - 1) + 1))
+   !pagevec_lookup_range_tag(, mapping, , last,
+   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE))
return ndirties;
 
for (i = 0; i < pagevec_count(); i++) {
struct buffer_head *bh, *head;
struct page *page = pvec.pages[i];
 
-   if (unlikely(page->index > last))
-   break;
-
lock_page(page);
if (!page_has_buffers(page))
create_empty_buffers(page, i_blocksize(inode), 0);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 18/35] f2fs: Use pagevec_lookup_range_tag()

2017-06-01 Thread Jan Kara

We want only pages from given range in f2fs_write_cache_pages(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

CC: Jaegeuk Kim <jaeg...@kernel.org>
CC: linux-f2fs-de...@lists.sourceforge.net
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/f2fs/data.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 7c0f6bdf817d..3e6244a82ac5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1603,8 +1603,8 @@ static int f2fs_write_cache_pages(struct address_space 
*mapping,
while (!done && (index <= end)) {
int i;
 
-   nr_pages = pagevec_lookup_tag(, mapping, , tag,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1);
+   nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE);
if (nr_pages == 0)
break;
 
@@ -1612,11 +1612,6 @@ static int f2fs_write_cache_pages(struct address_space 
*mapping,
struct page *page = pvec.pages[i];
bool submitted = false;
 
-   if (page->index > end) {
-   done = 1;
-   break;
-   }
-
done_index = page->index;
 
lock_page(page);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 25/35] mm: Remove nr_pages argument from pagevec_lookup_{,range}_tag()

2017-06-01 Thread Jan Kara

All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/btrfs/extent_io.c| 6 +++---
 fs/ceph/addr.c  | 3 +--
 fs/ext4/inode.c | 2 +-
 fs/f2fs/checkpoint.c| 2 +-
 fs/f2fs/data.c  | 2 +-
 fs/f2fs/node.c  | 8 
 fs/gfs2/aops.c  | 2 +-
 fs/nilfs2/btree.c   | 4 ++--
 fs/nilfs2/page.c| 7 +++
 fs/nilfs2/segment.c | 6 +++---
 include/linux/pagevec.h | 8 +++-
 mm/filemap.c| 2 +-
 mm/page-writeback.c | 2 +-
 mm/swap.c   | 4 ++--
 14 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6287eaba30ac..53d742a5a99b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3838,7 +3838,7 @@ int btree_write_cache_pages(struct address_space *mapping,
tag_pages_for_writeback(mapping, index, end);
while (!done && !nr_to_write_done && (index <= end) &&
   (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
-   tag, PAGEVEC_SIZE))) {
+   tag))) {
unsigned i;
 
scanned = 1;
@@ -3979,8 +3979,8 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
-  (nr_pages = pagevec_lookup_range_tag(, mapping, , end,
-   tag, PAGEVEC_SIZE))) {
+   (nr_pages = pagevec_lookup_range_tag(, mapping,
+   , end, tag))) {
unsigned i;
 
scanned = 1;
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 0b7e56ae3b8c..a0d5c46fc9bf 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -848,8 +848,7 @@ static int ceph_writepages_start(struct address_space 
*mapping,
 get_more_pages:
first = -1;
pvec_pages = pagevec_lookup_range_tag(, mapping, ,
-   end, PAGECACHE_TAG_DIRTY,
-   PAGEVEC_SIZE);
+   end, PAGECACHE_TAG_DIRTY);
dout("pagevec_lookup_range_tag got %d\n", pvec_pages);
if (!pvec_pages && !locked_pages)
break;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 050fba2d12c2..d1896f14d72f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2556,7 +2556,7 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
mpd->next_page = index;
while (index <= end) {
nr_pages = pagevec_lookup_range_tag(, mapping, , end,
-   tag, PAGEVEC_SIZE);
+   tag);
if (nr_pages == 0)
goto out;
 
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 6da86eac758a..ad5bc5340ba2 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -310,7 +310,7 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum 
page_type type,
blk_start_plug();
 
while (nr_pages = pagevec_lookup_tag(, mapping, ,
-   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE)) {
+   PAGECACHE_TAG_DIRTY)) {
int i;
 
for (i = 0; i < nr_pages; i++) {
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 3e6244a82ac5..afca42392fbd 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1604,7 +1604,7 @@ static int f2fs_write_cache_pages(struct address_space 
*mapping,
int i;
 
nr_pages = pagevec_lookup_range_tag(, mapping, , end,
-   tag, PAGEVEC_SIZE);
+   tag);
if (nr_pages == 0)
break;
 
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index dd53bcd9fc46..00cae42c778a 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1268,7 +1268,7 @@ static struct page *last_fsync_dnode(struct f2fs_sb_info 
*sbi, nid_t ino)
index = 0;
 
while (nr_pages = pagevec_lookup_tag(, NODE_MAPPING(sbi), ,
-   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE)) {
+   PAGECACHE_TAG_DIRTY)) {
int i;
 
for (i = 0; i < nr_pages; i++) {
@@ -1418,7 +1418,7 @@ int fsync_node_pages(struct f2fs_sb_info *sbi, struct 
inode *inode,
index = 0;
 
while (nr_pages = pagevec_lookup_tag(, NODE_MAPPING(sbi), ,
-   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE)) {
+   PAGECACHE_TAG_DIRTY)) {
int i;
 
for (i = 0; i <

[PATCH 33/35] mm: Remove nr_entries argument from pagevec_lookup_entries{,_range}()

2017-06-01 Thread Jan Kara

All users pass PAGEVEC_SIZE as the number of entries now. Remove the
argument.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/pagevec.h | 7 +++
 mm/shmem.c  | 4 ++--
 mm/swap.c   | 6 ++
 mm/truncate.c   | 8 
 4 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 93308689d6a7..f765fc5eca31 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -25,14 +25,13 @@ void __pagevec_lru_add(struct pagevec *pvec);
 unsigned pagevec_lookup_entries_range(struct pagevec *pvec,
struct address_space *mapping,
pgoff_t *start, pgoff_t end,
-   unsigned nr_entries, pgoff_t *indices);
+   pgoff_t *indices);
 static inline unsigned pagevec_lookup_entries(struct pagevec *pvec,
struct address_space *mapping,
-   pgoff_t *start, unsigned nr_entries,
-   pgoff_t *indices)
+   pgoff_t *start, pgoff_t *indices)
 {
return pagevec_lookup_entries_range(pvec, mapping, start, (pgoff_t)-1,
-   nr_entries, indices);
+   indices);
 }
 void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup_range(struct pagevec *pvec,
diff --git a/mm/shmem.c b/mm/shmem.c
index e5ea044aae24..dd8144230ecf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -769,7 +769,7 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
index = start;
while (index < end) {
if (!pagevec_lookup_entries_range(, mapping, ,
-   end - 1, PAGEVEC_SIZE, indices))
+   end - 1, indices))
break;
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
@@ -857,7 +857,7 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
cond_resched();
 
if (!pagevec_lookup_entries_range(, mapping, ,
-   end - 1, PAGEVEC_SIZE, indices)) {
+   end - 1, indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
if (lookup_start == start || end != -1)
break;
diff --git a/mm/swap.c b/mm/swap.c
index 88c7eb4e97db..1640bbb34e59 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -894,7 +894,6 @@ EXPORT_SYMBOL(__pagevec_lru_add);
  * @mapping:   The address_space to search
  * @start: The starting entry index
  * @end:   The final entry index (inclusive)
- * @nr_entries:The maximum number of entries
  * @indices:   The cache indices corresponding to the entries in @pvec
  *
  * pagevec_lookup_entries() will search for and return a group of up
@@ -911,10 +910,9 @@ EXPORT_SYMBOL(__pagevec_lru_add);
  */
 unsigned pagevec_lookup_entries_range(struct pagevec *pvec,
struct address_space *mapping,
-   pgoff_t *start, pgoff_t end, unsigned nr_pages,
-   pgoff_t *indices)
+   pgoff_t *start, pgoff_t end, pgoff_t *indices)
 {
-   pvec->nr = find_get_entries_range(mapping, start, end, nr_pages,
+   pvec->nr = find_get_entries_range(mapping, start, end, PAGEVEC_SIZE,
  pvec->pages, indices);
return pagevec_count(pvec);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 31d5c5f3da30..d35531d83cb3 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -290,7 +290,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
pagevec_init(, 0);
index = start;
while (index < end && pagevec_lookup_entries_range(, mapping,
-   , end - 1, PAGEVEC_SIZE, indices)) {
+   , end - 1, indices)) {
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
@@ -354,7 +354,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
 
cond_resched();
if (!pagevec_lookup_entries_range(, mapping, ,
-   end - 1, PAGEVEC_SIZE, indices)) {
+   end - 1, indices)) {
/* If all gone from start onwards, we're done */
if (lookup_start == start)
break;
@@ -476,7 +476,7 @@ unsigned long invalidate_mapping_pages(struct address_space 
*mapping,
 
pagevec_init(, 0);
while (index <= end && pagevec_lookup_entries_range(, mapping,
-

[PATCH 26/35] afs: Use find_get_pages_range_tag()

2017-06-01 Thread Jan Kara

Use find_get_pages_range_tag() in afs_writepages_region() as we are
interested only in pages from given range. Remove unnecessary code after
this conversion.

CC: David Howells <dhowe...@redhat.com>
CC: linux-...@lists.infradead.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/afs/write.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 2d2fccd5044b..630f2a42fae7 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -497,20 +497,13 @@ static int afs_writepages_region(struct address_space 
*mapping,
_enter(",,%lx,%lx,", index, end);
 
do {
-   n = find_get_pages_tag(mapping, , PAGECACHE_TAG_DIRTY,
-  1, );
+   n = find_get_pages_range_tag(mapping, , end,
+   PAGECACHE_TAG_DIRTY, 1, );
if (!n)
break;
 
_debug("wback %lx", page->index);
 
-   if (page->index > end) {
-   *_next = index;
-   put_page(page);
-   _leave(" = 0 [%lx]", *_next);
-   return 0;
-   }
-
/* at this point we hold neither mapping->tree_lock nor lock on
 * the page itself: the page may be truncated or invalidated
 * (changing page->mapping to NULL), or even swizzled back from
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 21/35] gfs2: Use pagevec_lookup_range_tag()

2017-06-01 Thread Jan Kara

We want only pages from given range in gfs2_write_cache_jdata(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

CC: Bob Peterson <rpete...@redhat.com>
CC: cluster-de...@redhat.com
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/gfs2/aops.c | 20 ++--
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ed7a2e252ad8..158ceb900ab5 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -268,22 +268,6 @@ static int gfs2_write_jdata_pagevec(struct address_space 
*mapping,
for(i = 0; i < nr_pages; i++) {
struct page *page = pvec->pages[i];
 
-   /*
-* At this point, the page may be truncated or
-* invalidated (changing page->mapping to NULL), or
-* even swizzled back from swapper_space to tmpfs file
-* mapping. However, page->index will not change
-* because we have a reference on the page.
-*/
-   if (page->index > end) {
-   /*
-* can't be range_cyclic (1st pass) because
-* end == -1 in that case.
-*/
-   ret = 1;
-   break;
-   }
-
*done_index = page->index;
 
lock_page(page);
@@ -401,8 +385,8 @@ static int gfs2_write_cache_jdata(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && (index <= end)) {
-   nr_pages = pagevec_lookup_tag(, mapping, , tag,
- min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+   nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE);
if (nr_pages == 0)
break;
 
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 28/35] shmem: Use pagevec_lookup_entries()

2017-06-01 Thread Jan Kara

Currently we use find_get_entries() shmem which just opencode what
pagevec_lookup_entries() does. Use pagevec_lookup_entries() instead
except for one case which plays tricks with number of pages looked up
and it won't fit in how we will make pagevec_lookup_entries() work.

CC: Hugh Dickins <hu...@google.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/shmem.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index a614a9cfb58c..8a6fddec27a1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -768,10 +768,9 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
pagevec_init(, 0);
index = start;
while (index < end) {
-   pvec.nr = find_get_entries(mapping, index,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE),
-   pvec.pages, indices);
-   if (!pvec.nr)
+   if (!pagevec_lookup_entries(, mapping, index,
+   min(end - index, (pgoff_t)PAGEVEC_SIZE),
+   indices))
break;
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
@@ -859,10 +858,9 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
while (index < end) {
cond_resched();
 
-   pvec.nr = find_get_entries(mapping, index,
+   if (!pagevec_lookup_entries(, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
-   pvec.pages, indices);
-   if (!pvec.nr) {
+   indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
if (index == start || end != -1)
break;
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 34/35] mm: Make find_get_entries_tag() update index

2017-06-01 Thread Jan Kara

Make find_get_entries_tag() update 'start' to index the next page for
iteration.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/dax.c| 3 +--
 include/linux/pagemap.h | 2 +-
 mm/filemap.c| 8 ++--
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c204445a69b0..4b295c544fd4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -841,7 +841,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 
pagevec_init(, 0);
while (!done) {
-   pvec.nr = find_get_entries_tag(mapping, start_index,
+   pvec.nr = find_get_entries_tag(mapping, _index,
PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
pvec.pages, indices);
 
@@ -859,7 +859,6 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (ret < 0)
goto out;
}
-   start_index = indices[pvec.nr - 1] + 1;
}
 out:
put_dax(dax_dev);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index df128a56f44b..1dc7e54ec32a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -365,7 +365,7 @@ static inline unsigned find_get_pages_tag(struct 
address_space *mapping,
return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
nr_pages, pages);
 }
-unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t *start,
int tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index e55100459710..3eb05c91c07a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1731,7 +1731,7 @@ EXPORT_SYMBOL(find_get_pages_range_tag);
  * Like find_get_entries, except we only return entries which are tagged with
  * @tag.
  */
-unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t *start,
int tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices)
 {
@@ -1744,7 +1744,7 @@ unsigned find_get_entries_tag(struct address_space 
*mapping, pgoff_t start,
 
rcu_read_lock();
radix_tree_for_each_tagged(slot, >page_tree,
-  , start, tag) {
+  , *start, tag) {
struct page *head, *page;
 repeat:
page = radix_tree_deref_slot(slot);
@@ -1786,6 +1786,10 @@ unsigned find_get_entries_tag(struct address_space 
*mapping, pgoff_t start,
break;
}
rcu_read_unlock();
+
+   if (ret)
+   *start = indices[ret - 1] + 1;
+
return ret;
 }
 EXPORT_SYMBOL(find_get_entries_tag);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 20/35] f2fs: Use find_get_pages_tag() for looking up single page

2017-06-01 Thread Jan Kara

__get_first_dirty_index() wants to lookup only the first dirty page
after given index. There's no point in using pagevec_lookup_tag() for
that. Just use find_get_pages_tag() directly.

CC: Jaegeuk Kim <jaeg...@kernel.org>
CC: linux-f2fs-de...@lists.sourceforge.net
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/f2fs/file.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 61af721329fa..52df1ef66883 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -286,18 +286,19 @@ int f2fs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 static pgoff_t __get_first_dirty_index(struct address_space *mapping,
pgoff_t pgofs, int whence)
 {
-   struct pagevec pvec;
+   struct page *page;
int nr_pages;
 
if (whence != SEEK_DATA)
return 0;
 
/* find first dirty page index */
-   pagevec_init(, 0);
-   nr_pages = pagevec_lookup_tag(, mapping, ,
-   PAGECACHE_TAG_DIRTY, 1);
-   pgofs = nr_pages ? pvec.pages[0]->index : ULONG_MAX;
-   pagevec_release();
+   nr_pages = find_get_pages_tag(mapping, , PAGECACHE_TAG_DIRTY,
+ 1, );
+   if (!nr_pages)
+   return ULONG_MAX;
+   pgofs = page->index;
+   put_page(page);
return pgofs;
 }
 
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 19/35] f2fs: Simplify page iteration loops

2017-06-01 Thread Jan Kara

In several places we want to iterate over all tagged pages in a mapping.
However the code was apparently copied from places that iterate only
over a limited range and thus it checks for index <= end, optimizes the
case where we are coming close to range end which is all pointless when
end == ULONG_MAX. So just remove this dead code.

CC: Jaegeuk Kim <jaeg...@kernel.org>
CC: linux-f2fs-de...@lists.sourceforge.net
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/f2fs/checkpoint.c | 13 ---
 fs/f2fs/node.c   | 65 +++-
 2 files changed, 28 insertions(+), 50 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index ea9c317b5916..6da86eac758a 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -296,9 +296,10 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum 
page_type type,
long nr_to_write)
 {
struct address_space *mapping = META_MAPPING(sbi);
-   pgoff_t index = 0, end = ULONG_MAX, prev = ULONG_MAX;
+   pgoff_t index = 0, prev = ULONG_MAX;
struct pagevec pvec;
long nwritten = 0;
+   int nr_pages;
struct writeback_control wbc = {
.for_reclaim = 0,
};
@@ -308,13 +309,9 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum 
page_type type,
 
blk_start_plug();
 
-   while (index <= end) {
-   int i, nr_pages;
-   nr_pages = pagevec_lookup_tag(, mapping, ,
-   PAGECACHE_TAG_DIRTY,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
-   if (unlikely(nr_pages == 0))
-   break;
+   while (nr_pages = pagevec_lookup_tag(, mapping, ,
+   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE)) {
+   int i;
 
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 4547c5c5cd98..dd53bcd9fc46 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1259,21 +1259,17 @@ void move_node_page(struct page *node_page, int gc_type)
 
 static struct page *last_fsync_dnode(struct f2fs_sb_info *sbi, nid_t ino)
 {
-   pgoff_t index, end;
+   pgoff_t index;
struct pagevec pvec;
struct page *last_page = NULL;
+   int nr_pages;
 
pagevec_init(, 0);
index = 0;
-   end = ULONG_MAX;
-
-   while (index <= end) {
-   int i, nr_pages;
-   nr_pages = pagevec_lookup_tag(, NODE_MAPPING(sbi), ,
-   PAGECACHE_TAG_DIRTY,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
-   if (nr_pages == 0)
-   break;
+
+   while (nr_pages = pagevec_lookup_tag(, NODE_MAPPING(sbi), ,
+   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE)) {
+   int i;
 
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
@@ -1403,13 +1399,14 @@ static int f2fs_write_node_page(struct page *page,
 int fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
struct writeback_control *wbc, bool atomic)
 {
-   pgoff_t index, end;
+   pgoff_t index;
pgoff_t last_idx = ULONG_MAX;
struct pagevec pvec;
int ret = 0;
struct page *last_page = NULL;
bool marked = false;
nid_t ino = inode->i_ino;
+   int nr_pages;
 
if (atomic) {
last_page = last_fsync_dnode(sbi, ino);
@@ -1419,15 +1416,10 @@ int fsync_node_pages(struct f2fs_sb_info *sbi, struct 
inode *inode,
 retry:
pagevec_init(, 0);
index = 0;
-   end = ULONG_MAX;
-
-   while (index <= end) {
-   int i, nr_pages;
-   nr_pages = pagevec_lookup_tag(, NODE_MAPPING(sbi), ,
-   PAGECACHE_TAG_DIRTY,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
-   if (nr_pages == 0)
-   break;
+
+   while (nr_pages = pagevec_lookup_tag(, NODE_MAPPING(sbi), ,
+   PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE)) {
+   int i;
 
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
@@ -1525,25 +1517,21 @@ int fsync_node_pages(struct f2fs_sb_info *sbi, struct 
inode *inode,
 
 int sync_node_pages(struct f2fs_sb_info *sbi, struct writeback_control *wbc)
 {
-   pgoff_t index, end;
+   pgoff_t index;
struct pagevec pvec;
int step = 0;
int nwritten = 0;
int ret = 0;
+   int nr_pages;
 
pagevec_init(, 0);
 
 next_step:
index = 0;
-   end = ULONG_MAX;
-
-   while (index <= end) {
-   int i, nr_pages;
-   nr_pages

[PATCH 24/35] mm: Use pagevec_lookup_range_tag() in write_cache_pages()

2017-06-01 Thread Jan Kara

Use pagevec_lookup_range_tag() in write_cache_pages() as it is
interested only in pages from given range. Remove unnecessary code
resulting from this.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/page-writeback.c | 20 ++--
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..c77c387465ec 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2194,30 +2194,14 @@ int write_cache_pages(struct address_space *mapping,
while (!done && (index <= end)) {
int i;
 
-   nr_pages = pagevec_lookup_tag(, mapping, , tag,
- min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+   nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE);
if (nr_pages == 0)
break;
 
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
 
-   /*
-* At this point, the page may be truncated or
-* invalidated (changing page->mapping to NULL), or
-* even swizzled back from swapper_space to tmpfs file
-* mapping. However, page->index will not change
-* because we have a reference on the page.
-*/
-   if (page->index > end) {
-   /*
-* can't be range_cyclic (1st pass) because
-* end == -1 in that case.
-*/
-   done = 1;
-   break;
-   }
-
done_index = page->index;
 
lock_page(page);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 32/35] mm: Convert truncate code to pagevec_lookup_entries_range()

2017-06-01 Thread Jan Kara

All radix tree scanning code in truncate paths is interested only in
pages from given range. Convert them to pagevec_lookup_entries_range().

Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/truncate.c | 52 +---
 1 file changed, 9 insertions(+), 43 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 9efc82f18b74..31d5c5f3da30 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -289,16 +289,11 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
 
pagevec_init(, 0);
index = start;
-   while (index < end && pagevec_lookup_entries(, mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE),
-   indices)) {
+   while (index < end && pagevec_lookup_entries_range(, mapping,
+   , end - 1, PAGEVEC_SIZE, indices)) {
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   /* We rely upon deletion not changing page->index */
-   if (indices[i] >= end)
-   break;
-
if (radix_tree_exceptional_entry(page)) {
truncate_exceptional_entry(mapping, indices[i],
   page);
@@ -352,20 +347,14 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
put_page(page);
}
}
-   /*
-* If the truncation happened within a single page no pages
-* will be released, just zeroed, so we can bail out now.
-*/
-   if (start >= end)
-   goto out;
 
index = start;
-   for ( ; ; ) {
+   while (index < end) {
pgoff_t lookup_start = index;
 
cond_resched();
-   if (!pagevec_lookup_entries(, mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
+   if (!pagevec_lookup_entries_range(, mapping, ,
+   end - 1, PAGEVEC_SIZE, indices)) {
/* If all gone from start onwards, we're done */
if (lookup_start == start)
break;
@@ -373,22 +362,9 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
index = start;
continue;
}
-   if (lookup_start == start && indices[0] >= end) {
-   /* All gone out of hole to be punched, we're done */
-   pagevec_remove_exceptionals();
-   pagevec_release();
-   break;
-   }
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   /* We rely upon deletion not changing page->index */
-   if (indices[i] >= end) {
-   /* Restart punch to make sure all gone */
-   index = start;
-   break;
-   }
-
if (radix_tree_exceptional_entry(page)) {
truncate_exceptional_entry(mapping, indices[i],
   page);
@@ -499,16 +475,11 @@ unsigned long invalidate_mapping_pages(struct 
address_space *mapping,
int i;
 
pagevec_init(, 0);
-   while (index <= end && pagevec_lookup_entries(, mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
-   indices)) {
+   while (index <= end && pagevec_lookup_entries_range(, mapping,
+   , end, PAGEVEC_SIZE, indices)) {
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   /* We rely upon deletion not changing page->index */
-   if (indices[i] > end)
-   break;
-
if (radix_tree_exceptional_entry(page)) {
invalidate_exceptional_entry(mapping,
 indices[i], page);
@@ -629,16 +600,11 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
 
pagevec_init(, 0);
index = start;
-   while (index <= end && pagevec_lookup_entries(, mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
-   indices)) {
+   while (index <= end && pagevec_lookup_entries_range(, mapping,
+   , end, PAGEVEC_SIZE, indices)) {
for (i = 0; i < pagevec_count(); i++) {

[PATCH 29/35] mm: Make pagevec_lookup_entries() update index

2017-06-01 Thread Jan Kara

Make pagevec_lookup_entries() (and underlying find_get_entries()) update
index to the next page where iteration should continue. This is mostly
for consistency with pagevec_lookup() and future
pagevec_lookup_entries_range().

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/pagemap.h |  2 +-
 include/linux/pagevec.h |  2 +-
 mm/filemap.c| 11 ++---
 mm/shmem.c  | 57 +++
 mm/swap.c   |  4 +--
 mm/truncate.c   | 65 +++--
 6 files changed, 72 insertions(+), 69 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a2d3534a514f..283d191c18be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -333,7 +333,7 @@ static inline struct page *grab_cache_page_nowait(struct 
address_space *mapping,
 
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
-unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
+unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
  unsigned int nr_entries, struct page **entries,
  pgoff_t *indices);
 unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index f3f2b9690764..3798c142338d 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -24,7 +24,7 @@ void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
 unsigned pagevec_lookup_entries(struct pagevec *pvec,
struct address_space *mapping,
-   pgoff_t start, unsigned nr_entries,
+   pgoff_t *start, unsigned nr_entries,
pgoff_t *indices);
 void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup_range(struct pagevec *pvec,
diff --git a/mm/filemap.c b/mm/filemap.c
index 910f2e39fef2..de12b7355821 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1373,11 +1373,11 @@ EXPORT_SYMBOL(pagecache_get_page);
  * Any shadow entries of evicted pages, or swap entries from
  * shmem/tmpfs, are included in the returned array.
  *
- * find_get_entries() returns the number of pages and shadow entries
- * which were found.
+ * find_get_entries() returns the number of pages and shadow entries which were
+ * found. It also updates @start to index the next page for the traversal.
  */
 unsigned find_get_entries(struct address_space *mapping,
- pgoff_t start, unsigned int nr_entries,
+ pgoff_t *start, unsigned int nr_entries,
  struct page **entries, pgoff_t *indices)
 {
void **slot;
@@ -1388,7 +1388,7 @@ unsigned find_get_entries(struct address_space *mapping,
return 0;
 
rcu_read_lock();
-   radix_tree_for_each_slot(slot, >page_tree, , start) {
+   radix_tree_for_each_slot(slot, >page_tree, , *start) {
struct page *head, *page;
 repeat:
page = radix_tree_deref_slot(slot);
@@ -1429,6 +1429,9 @@ unsigned find_get_entries(struct address_space *mapping,
break;
}
rcu_read_unlock();
+
+   if (ret)
+   *start = indices[ret - 1] + 1;
return ret;
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 8a6fddec27a1..f9c4afbdd70c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -768,26 +768,25 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
pagevec_init(, 0);
index = start;
while (index < end) {
-   if (!pagevec_lookup_entries(, mapping, index,
+   if (!pagevec_lookup_entries(, mapping, ,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
indices))
break;
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   index = indices[i];
-   if (index >= end)
+   if (indices[i] >= end)
break;
 
if (radix_tree_exceptional_entry(page)) {
if (unfalloc)
continue;
nr_swaps_freed += !shmem_free_swap(mapping,
-   index, page);
+   indices[i], page);
continue;
}
 
-   VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);
+   VM_BUG_ON_PAGE(page_to_pgoff(p

[PATCH 35/35] mm: Implement find_get_entries_range_tag()

2017-06-01 Thread Jan Kara

Implement find_get_entries_range_tag() (actually convert
find_get_entries_tag() tag to it as the only user of
find_get_entries_tag() needs a ranged lookup) and use it in DAX which is
the only user of this interface. This is mostly for consistency with
other page/entry iteration interfaces.

Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/dax.c| 12 +++-
 include/linux/pagemap.h |  3 ++-
 mm/filemap.c| 36 ++--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4b295c544fd4..acf17b55f76b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -819,7 +819,6 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
pgoff_t indices[PAGEVEC_SIZE];
struct dax_device *dax_dev;
struct pagevec pvec;
-   bool done = false;
int i, ret = 0;
 
if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
@@ -840,20 +839,15 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
tag_pages_for_writeback(mapping, start_index, end_index);
 
pagevec_init(, 0);
-   while (!done) {
-   pvec.nr = find_get_entries_tag(mapping, _index,
-   PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+   while (start_index <= end_index) {
+   pvec.nr = find_get_entries_range_tag(mapping, _index,
+   end_index, PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
pvec.pages, indices);
 
if (pvec.nr == 0)
break;
 
for (i = 0; i < pvec.nr; i++) {
-   if (indices[i] > end_index) {
-   done = true;
-   break;
-   }
-
ret = dax_writeback_one(bdev, dax_dev, mapping,
indices[i], pvec.pages[i]);
if (ret < 0)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1dc7e54ec32a..38227e670a83 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -365,7 +365,8 @@ static inline unsigned find_get_pages_tag(struct 
address_space *mapping,
return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
nr_pages, pages);
 }
-unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t *start,
+unsigned find_get_entries_range_tag(struct address_space *mapping,
+   pgoff_t *start, pgoff_t end,
int tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 3eb05c91c07a..06f82ed9096e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1720,9 +1720,10 @@ unsigned find_get_pages_range_tag(struct address_space 
*mapping, pgoff_t *index,
 EXPORT_SYMBOL(find_get_pages_range_tag);
 
 /**
- * find_get_entries_tag - find and return entries that match @tag
+ * find_get_entries_range_tag - find and return entries that match @tag
  * @mapping:   the address_space to search
  * @start: the starting page cache index
+ * @end:   the final page cache index (inclusive)
  * @tag:   the tag index
  * @nr_entries:the maximum number of entries
  * @entries:   where the resulting entries are placed
@@ -1731,9 +1732,10 @@ EXPORT_SYMBOL(find_get_pages_range_tag);
  * Like find_get_entries, except we only return entries which are tagged with
  * @tag.
  */
-unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t *start,
-   int tag, unsigned int nr_entries,
-   struct page **entries, pgoff_t *indices)
+unsigned find_get_entries_range_tag(struct address_space *mapping,
+   pgoff_t *start, pgoff_t end, int tag,
+   unsigned int nr_entries, struct page **entries,
+   pgoff_t *indices)
 {
void **slot;
unsigned int ret = 0;
@@ -1746,6 +1748,9 @@ unsigned find_get_entries_tag(struct address_space 
*mapping, pgoff_t *start,
radix_tree_for_each_tagged(slot, >page_tree,
   , *start, tag) {
struct page *head, *page;
+
+   if (iter.index > end)
+   break;
 repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1782,17 +1787,28 @@ unsigned find_get_entries_tag(struct address_space 
*mapping, pgoff_t *start,
 export:
indices[ret] = iter.index;
entries[ret] = page;
-   if (++ret == nr_entries)
-   break;
+   if (++ret == nr_entries) {
+   *start = indices[ret - 1] + 1;
+   goto out;
+   }
}
-   rcu_read_unlock();
 
-   if (ret)
-   *s

[PATCH 27/35] shmem: Use pagevec_lookup() in shmem_unlock_mapping()

2017-06-01 Thread Jan Kara

The comment about find_get_pages() returning if it finds a row of swap
entries seems to be stale. Use pagevec_lookup() in
shmem_unlock_mapping() to simplify the code.

CC: Hugh Dickins <hu...@google.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/shmem.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba4e98e..a614a9cfb58c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -729,24 +729,14 @@ unsigned long shmem_swap_usage(struct vm_area_struct *vma)
 void shmem_unlock_mapping(struct address_space *mapping)
 {
struct pagevec pvec;
-   pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index = 0;
 
pagevec_init(, 0);
/*
 * Minor point, but we might as well stop if someone else SHM_LOCKs it.
 */
-   while (!mapping_unevictable(mapping)) {
-   /*
-* Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
-* has finished, if it hits a row of PAGEVEC_SIZE swap entries.
-*/
-   pvec.nr = find_get_entries(mapping, index,
-  PAGEVEC_SIZE, pvec.pages, indices);
-   if (!pvec.nr)
-   break;
-   index = indices[pvec.nr - 1] + 1;
-   pagevec_remove_exceptionals();
+   while (!mapping_unevictable(mapping) &&
+   pagevec_lookup(, mapping, )) {
check_move_unevictable_pages(pvec.pages, pvec.nr);
pagevec_release();
cond_resched();
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 30/35] mm: Implement find_get_entries_range()

2017-06-01 Thread Jan Kara

Implement a variant of find_get_entries() that stops iterating at
given index. Some callers want this, so let's provide the interface.
Also it makes the interface consistent with find_get_pages().

Signed-off-by: Jan Kara <j...@suse.cz>
---
 include/linux/pagemap.h | 13 ++---
 include/linux/pagevec.h | 12 ++--
 mm/filemap.c| 32 
 mm/swap.c   | 11 ++-
 4 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 283d191c18be..df128a56f44b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -333,9 +333,16 @@ static inline struct page *grab_cache_page_nowait(struct 
address_space *mapping,
 
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
-unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
- unsigned int nr_entries, struct page **entries,
- pgoff_t *indices);
+unsigned find_get_entries_range(struct address_space *mapping, pgoff_t *start,
+   pgoff_t end, unsigned int nr_entries,
+   struct page **entries, pgoff_t *indices);
+static inline unsigned find_get_entries(struct address_space *mapping,
+   pgoff_t *start, unsigned int nr_entries,
+   struct page **entries, pgoff_t *indices)
+{
+   return find_get_entries_range(mapping, start, (pgoff_t)-1, nr_entries,
+ entries, indices);
+}
 unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
pgoff_t end, unsigned int nr_pages,
struct page **pages);
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 3798c142338d..93308689d6a7 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,10 +22,18 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
-unsigned pagevec_lookup_entries(struct pagevec *pvec,
+unsigned pagevec_lookup_entries_range(struct pagevec *pvec,
+   struct address_space *mapping,
+   pgoff_t *start, pgoff_t end,
+   unsigned nr_entries, pgoff_t *indices);
+static inline unsigned pagevec_lookup_entries(struct pagevec *pvec,
struct address_space *mapping,
pgoff_t *start, unsigned nr_entries,
-   pgoff_t *indices);
+   pgoff_t *indices)
+{
+   return pagevec_lookup_entries_range(pvec, mapping, start, (pgoff_t)-1,
+   nr_entries, indices);
+}
 void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup_range(struct pagevec *pvec,
  struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index de12b7355821..e55100459710 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1354,9 +1354,10 @@ struct page *pagecache_get_page(struct address_space 
*mapping, pgoff_t offset,
 EXPORT_SYMBOL(pagecache_get_page);
 
 /**
- * find_get_entries - gang pagecache lookup
+ * find_get_entries_range - gang pagecache lookup
  * @mapping:   The address_space to search
  * @start: The starting page cache index
+ * @end:   The final page cache index (inclusive)
  * @nr_entries:The maximum number of entries
  * @entries:   Where the resulting entries are placed
  * @indices:   The cache indices corresponding to the entries in @entries
@@ -1376,9 +1377,9 @@ EXPORT_SYMBOL(pagecache_get_page);
  * find_get_entries() returns the number of pages and shadow entries which were
  * found. It also updates @start to index the next page for the traversal.
  */
-unsigned find_get_entries(struct address_space *mapping,
- pgoff_t *start, unsigned int nr_entries,
- struct page **entries, pgoff_t *indices)
+unsigned find_get_entries_range(struct address_space *mapping,
+   pgoff_t *start, pgoff_t end, unsigned int nr_entries,
+   struct page **entries, pgoff_t *indices)
 {
void **slot;
unsigned int ret = 0;
@@ -1390,6 +1391,9 @@ unsigned find_get_entries(struct address_space *mapping,
rcu_read_lock();
radix_tree_for_each_slot(slot, >page_tree, , *start) {
struct page *head, *page;
+
+   if (iter.index > end)
+   break;
 repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1425,13 +1429,25 @@ unsigned find_get_entries(struct address_space *mapping,
 export:
indices[ret] = iter.index;
entries

[PATCH 17/35] ext4: Use pagevec_lookup_range_tag()

2017-06-01 Thread Jan Kara

We want only pages from given range in ext4_writepages(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

CC: "Theodore Ts'o" <ty...@mit.edu>
CC: linux-e...@vger.kernel.org
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/inode.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ace4bb9073d8..050fba2d12c2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2555,8 +2555,8 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
mpd->map.m_len = 0;
mpd->next_page = index;
while (index <= end) {
-   nr_pages = pagevec_lookup_tag(, mapping, , tag,
- min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+   nr_pages = pagevec_lookup_range_tag(, mapping, , end,
+   tag, PAGEVEC_SIZE);
if (nr_pages == 0)
goto out;
 
@@ -2564,16 +2564,6 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
struct page *page = pvec.pages[i];
 
/*
-* At this point, the page may be truncated or
-* invalidated (changing page->mapping to NULL), or
-* even swizzled back from swapper_space to tmpfs file
-* mapping. However, page->index will not change
-* because we have a reference on the page.
-*/
-   if (page->index > end)
-   goto out;
-
-   /*
 * Accumulated enough dirty pages? This doesn't apply
 * to WB_SYNC_ALL mode. For integrity sync we have to
 * keep going because someone may be concurrently
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 31/35] shmem: Convert to pagevec_lookup_entries_range()

2017-06-01 Thread Jan Kara

Convert radix tree scanners to use pagevec_lookup_entries_range() and
find_get_entries_range() since they all want only entries from given
range.

CC: Hugh Dickins <hu...@google.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 mm/shmem.c | 23 +++
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f9c4afbdd70c..e5ea044aae24 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -768,16 +768,12 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
pagevec_init(, 0);
index = start;
while (index < end) {
-   if (!pagevec_lookup_entries(, mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE),
-   indices))
+   if (!pagevec_lookup_entries_range(, mapping, ,
+   end - 1, PAGEVEC_SIZE, indices))
break;
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   if (indices[i] >= end)
-   break;
-
if (radix_tree_exceptional_entry(page)) {
if (unfalloc)
continue;
@@ -860,9 +856,8 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
 
cond_resched();
 
-   if (!pagevec_lookup_entries(, mapping, ,
-   min(end - index, (pgoff_t)PAGEVEC_SIZE),
-   indices)) {
+   if (!pagevec_lookup_entries_range(, mapping, ,
+   end - 1, PAGEVEC_SIZE, indices)) {
/* If all gone or hole-punch or unfalloc, we're done */
if (lookup_start == start || end != -1)
break;
@@ -873,9 +868,6 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
 
-   if (indices[i] >= end)
-   break;
-
if (radix_tree_exceptional_entry(page)) {
if (unfalloc)
continue;
@@ -2494,9 +2486,9 @@ static pgoff_t shmem_seek_hole_data(struct address_space 
*mapping,
 
pagevec_init(, 0);
pvec.nr = 1;/* start small: we may be there already */
-   while (!done) {
+   while (!done && index < end) {
last = index;
-   pvec.nr = find_get_entries(mapping, ,
+   pvec.nr = find_get_entries_range(mapping, , end - 1,
pvec.nr, pvec.pages, indices);
if (!pvec.nr) {
if (whence == SEEK_DATA)
@@ -2516,8 +2508,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space 
*mapping,
if (!PageUptodate(page))
page = NULL;
}
-   if (last >= end ||
-   (page && whence == SEEK_DATA) ||
+   if ((page && whence == SEEK_DATA) ||
(!page && whence == SEEK_HOLE)) {
done = true;
break;
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/35] dax: Fix inefficiency in dax_writeback_mapping_range()

2017-06-01 Thread Jan Kara

dax_writeback_mapping_range() fails to update iteration index when
searching radix tree for entries needing cache flushing. Thus each
pagevec worth of entries is searched starting from the start which is
inefficient and prone to livelocks. Update index properly.

CC: sta...@vger.kernel.org
Fixes: 9973c98ecfda3a1dfcab981665b5f1e39bcde64a
Signed-off-by: Jan Kara <j...@suse.cz>
---
 fs/dax.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/dax.c b/fs/dax.c
index c22eaf162f95..c204445a69b0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -859,6 +859,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (ret < 0)
goto out;
}
+   start_index = indices[pvec.nr - 1] + 1;
}
 out:
put_dax(dax_dev);
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] xfs: nowait aio support

2017-05-31 Thread Jan Kara

On Tue 30-05-17 11:13:29, Goldwyn Rodrigues wrote:
> > Btw, can you write a small blurb up for the man page to document these
> > ѕemantics in man-page like language?
> > 
> 
> Yes, but which man page would it belong to?
> Should it be a subsection of errors in io_getevents/io_submit. We don't
> want to add ERRORS to io_getevents() because it would be the return
> value of the io_getevents call, and not the ones in the iocb structure.
> Should it be a new man page, say for iocb(7/8)?

I think you should extend the manpage for io_submit(8). There you can add
definition of struct iocb in 'DESCRIPTION' section explaining at least the
most common fields. You can also explain there which flags can be passed
and what are they intended to do.

You can also expand EAGAIN error description to specifically mention that
in case of NOWAIT aio EAGAIN can be returned if io submission would block.

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] xfs: nowait aio support

2017-05-29 Thread Jan Kara

On Sun 28-05-17 21:38:26, Goldwyn Rodrigues wrote:
> On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
> > Despite my previous reviewed-by tag this will need another fix:
> > 
> > xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
> > list in memoery already.  E.g. something like this:
> > 
> > if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
> > error = -EAGAIN;
> > goto out_unlock;
> > }
> > 
> > right after locking the ilock.
> 
> I am not sure if it is right to penalize the application to write to
> file which has been freshly opened (and is the first one to open). It
> basically means extent maps needs to be read from disk. Do you see a
> reason it would have a non-deterministic wait if it is the only user? I
> understand the block layer can block if it has too many requests though.

Well, submitting such write will have to wait for read of metadata from
disk. That is certainly considered blocking so Christoph is right that we
must return EAGAIN in such case.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-05-25 Thread Jan Kara

On Wed 24-05-17 11:41:46, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
> This is used by XFS in the XFS patch.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/iomap.c| 2 ++
>  include/linux/iomap.h | 1 +
>  2 files changed, 3 insertions(+)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 4b10892967a5..5d85ec6e7b20 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -879,6 +879,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   } else {
>   dio->flags |= IOMAP_DIO_WRITE;
>   flags |= IOMAP_WRITE;
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + flags |= IOMAP_NOWAIT;
>   }
>  
>   ret = filemap_write_and_wait_range(mapping, start, end);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f753e788da31..69f4e9470084 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -52,6 +52,7 @@ struct iomap {
>  #define IOMAP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */
>  #define IOMAP_FAULT  (1 << 3) /* mapping for page fault */
>  #define IOMAP_DIRECT (1 << 4) /* direct I/O */
> +#define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */
>  
>  struct iomap_ops {
>   /*
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/10] fs: return if direct write will trigger writeback

2017-05-25 Thread Jan Kara

On Wed 24-05-17 11:41:45, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> Find out if the write will trigger a wait due to writeback. If yes,
> return -EAGAIN.
> 
> Return -EINVAL for buffered AIO: there are multiple causes of
> delay such as page locks, dirty throttling logic, page loading
> from disk etc. which cannot be taken care of.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/filemap.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 097213275461..bc146efa6815 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2675,6 +2675,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
> struct iov_iter *from)
>  
>   pos = iocb->ki_pos;
>  
> + if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> + return -EINVAL;
> +
>   if (limit != RLIM_INFINITY) {
>   if (iocb->ki_pos >= limit) {
>   send_sig(SIGXFSZ, current, 0);
> @@ -2743,9 +2746,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
> iov_iter *from)
>   write_len = iov_iter_count(from);
>   end = (pos + write_len - 1) >> PAGE_SHIFT;
>  
> - written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
> 1);
> - if (written)
> - goto out;
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + /* If there are pages to writeback, return */
> + if (filemap_range_has_page(inode->i_mapping, pos,
> +pos + iov_iter_count(from)))
> + return -EAGAIN;
> + } else {
> + written = filemap_write_and_wait_range(mapping, pos,
> + pos + write_len - 1);
> + if (written)
> + goto out;
> + }
>  
>   /*
>* After a write we want buffered reads to be sure to go to disk to get
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-25 Thread Jan Kara

On Wed 24-05-17 11:41:44, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> RWF_NOWAIT informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.
> 
> The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
> is called by most filesystems, either through fsops.write_iter() or through
> the function defined by write_iter(). If not, we perform the check defined
> by .write_iter() which is called for direct IO specifically.
> 
> Filesystems xfs, btrfs and ext4 would be supported in the following patches.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good now. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza


> ---
>  fs/9p/vfs_file.c|  3 +++
>  fs/aio.c| 13 +
>  fs/ceph/file.c  |  3 +++
>  fs/cifs/file.c  |  3 +++
>  fs/fuse/file.c  |  3 +++
>  fs/nfs/direct.c |  3 +++
>  fs/ocfs2/file.c |  3 +++
>  include/linux/fs.h  |  5 -
>  include/uapi/linux/fs.h |  1 +
>  mm/filemap.c|  3 +++
>  10 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 3de3b4a89d89..403681db7723 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>   loff_t origin;
>   int err = 0;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   retval = generic_write_checks(iocb, from);
>   if (retval <= 0)
>   return retval;
> diff --git a/fs/aio.c b/fs/aio.c
> index 020fa0045e3c..9616dc733103 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1592,6 +1592,19 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> + if (req->common.ki_flags & IOCB_NOWAIT) {
> + if (!(req->common.ki_flags & IOCB_DIRECT)) {
> + ret = -EOPNOTSUPP;
> + goto out_put_req;
> + }
> +
> + if ((iocb->aio_lio_opcode != IOCB_CMD_PWRITE) &&
> + (iocb->aio_lio_opcode != IOCB_CMD_PWRITEV)) {
> + ret = -EINVAL;
> + goto out_put_req;
> + }
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 3fdde0b283c9..a53fd2675b1b 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1300,6 +1300,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   int err, want, got;
>   loff_t pos;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   if (ceph_snap(inode) != CEPH_NOSNAP)
>   return -EROFS;
>  
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 0fd081bd2a2f..ff84fa9ddb6c 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2725,6 +2725,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
> iov_iter *from)
>* write request.
>*/
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   rc = generic_write_checks(iocb, from);
>   if (rc <= 0)
>   return rc;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 3ee4fdc3da9e..812c7bd0c290 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>   struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
>   ssize_t res;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   if (is_bad_inode(inode))
>   return -EIO;
>  
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 6fb9fad2d1e6..c8e7dd76126c 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -979,6 +979,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
> iov_iter *iter)
>   dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
>   file, iov_iter_count(iter), (long long) iocb->ki_pos);
>  
> + if

Re: [PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-05-25 Thread Jan Kara

On Wed 24-05-17 11:41:43, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
> carry the RWF_* flags. We cannot use aio_flags because they are not
> checked for validity which may break existing applications.
> 
> Note, the only place RWF_HIPRI comes in effect is dio_await_one().
> All the rest of the locations, aio code return -EIOCBQUEUED before the
> checks for RWF_HIPRI.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/aio.c | 8 +++-
>  include/uapi/linux/aio_abi.h | 2 +-
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index f52d925ee259..020fa0045e3c 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   ssize_t ret;
>  
>   /* enforce forwards compatibility on users */
> - if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
> + if (unlikely(iocb->aio_reserved2)) {
>   pr_debug("EINVAL: reserve field set\n");
>   return -EINVAL;
>   }
> @@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   req->common.ki_flags |= IOCB_EVENTFD;
>   }
>  
> + ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
> + if (unlikely(ret)) {
> + pr_debug("EINVAL: aio_rw_flags\n");
> + goto out_put_req;
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index bb2554f7fbd1..a2d4a8ac94ca 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -79,7 +79,7 @@ struct io_event {
>  struct iocb {
>   /* these are internal to the kernel/libc. */
>   __u64   aio_data;   /* data to be returned in event's data */
> - __u32   PADDED(aio_key, aio_reserved1);
> + __u32   PADDED(aio_key, aio_rw_flags);
>   /* the kernel sets aio_key to the req # */
>  
>   /* common fields */
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-05-25 Thread Jan Kara

On Wed 24-05-17 11:41:42, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> filemap_range_has_page() return true if the file's mapping has
> a page within the range mentioned. This function will be used
> to check if a write() call will cause a writeback of previous
> writes.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c   | 33 +
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f53867140f43..dc0ab585cd56 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2517,6 +2517,8 @@ extern int filemap_fdatawait(struct address_space *);
>  extern void filemap_fdatawait_keep_errors(struct address_space *);
>  extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
>  loff_t lend);
> +extern int filemap_range_has_page(struct address_space *, loff_t lstart,
> +   loff_t lend);
>  extern int filemap_write_and_wait(struct address_space *mapping);
>  extern int filemap_write_and_wait_range(struct address_space *mapping,
>   loff_t lstart, loff_t lend);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6f1be573a5e6..87aba7698584 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
>  }
>  EXPORT_SYMBOL(filemap_flush);
>  
> +/**
> + * filemap_range_has_page - check if a page exists in range.
> + * @mapping:   address space structure to wait for
> + * @start_byte:offset in bytes where the range starts
> + * @end_byte:  offset in bytes where the range ends (inclusive)
> + *
> + * Find at least one page in the range supplied, usually used to check if
> + * direct writing in this range will trigger a writeback.
> + */
> +int filemap_range_has_page(struct address_space *mapping,
> +loff_t start_byte, loff_t end_byte)
> +{
> + pgoff_t index = start_byte >> PAGE_SHIFT;
> + pgoff_t end = end_byte >> PAGE_SHIFT;
> + struct pagevec pvec;
> + int ret;
> +
> + if (end_byte < start_byte)
> + return 0;
> +
> + if (mapping->nrpages == 0)
> + return 0;
> +
> + pagevec_init(, 0);
> + ret = pagevec_lookup(, mapping, index, 1);
> + if (!ret)
> + return 0;
> + ret = (pvec.pages[0]->index <= end);
> + pagevec_release();
> + return ret;
> +}
> +EXPORT_SYMBOL(filemap_range_has_page);
> +
>  static int __filemap_fdatawait_range(struct address_space *mapping,
>loff_t start_byte, loff_t end_byte)
>  {
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-05-25 Thread Jan Kara

On Wed 24-05-17 11:41:41, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/read_write.c| 12 +++-
>  include/linux/fs.h | 14 ++
>  2 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 47c1d4484df9..53c816c61122 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
> struct iov_iter *iter,
>   struct kiocb kiocb;
>   ssize_t ret;
>  
> - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
> - return -EOPNOTSUPP;
> -
>   init_sync_kiocb(, filp);
> - if (flags & RWF_HIPRI)
> - kiocb.ki_flags |= IOCB_HIPRI;
> - if (flags & RWF_DSYNC)
> - kiocb.ki_flags |= IOCB_DSYNC;
> - if (flags & RWF_SYNC)
> - kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + ret = kiocb_set_rw_flags(, flags);
> + if (ret)
> + return ret;
>   kiocb.ki_pos = *ppos;
>  
>   if (type == READ)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 803e5a9b2654..f53867140f43 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3056,6 +3056,20 @@ static inline int iocb_flags(struct file *file)
>   return res;
>  }
>  
> +static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
> +{
> + if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)))
> + return -EOPNOTSUPP;
> +
> + if (flags & RWF_HIPRI)
> + ki->ki_flags |= IOCB_HIPRI;
> + if (flags & RWF_DSYNC)
> + ki->ki_flags |= IOCB_DSYNC;
> +     if (flags & RWF_SYNC)
> + ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + return 0;
> +}
> +
>  static inline ino_t parent_ino(struct dentry *dentry)
>  {
>   ino_t res;
> -- 
> 2.12.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-15 Thread Jan Kara

On Thu 11-05-17 14:17:04, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> RWF_NOWAIT informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.
> 
> The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
> is called by most filesystems, either through fsops.write_iter() or through
> the function defined by write_iter(). If not, we perform the check defined
> by .write_iter() which is called for direct IO specifically.
> 
> Filesystems xfs, btrfs and ext4 would be supported in the following patches.
...
> diff --git a/fs/aio.c b/fs/aio.c
> index 020fa0045e3c..34027b67e2f4 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> + if ((req->common.ki_flags & IOCB_NOWAIT) &&
> + !(req->common.ki_flags & IOCB_DIRECT)) {
> + ret = -EOPNOTSUPP;
> + goto out_put_req;
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");

I think you need to also check here that the IO is write. So that NOWAIT
reads don't silently pass.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 21/27] mm: clean up error handling in write_one_page

2017-05-15 Thread Jan Kara

On Tue 09-05-17 11:49:24, Jeff Layton wrote:
> Don't try to check PageError since that's potentially racy and not
> necessarily going to be set after writepage errors out.
> 
> Instead, sample the mapping error early on, and use that value to tell
> us whether we got a writeback error since then.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  mm/page-writeback.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index de0dbf12e2c1..1643456881b4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2373,11 +2373,12 @@ int do_writepages(struct address_space *mapping, 
> struct writeback_control *wbc)
>  int write_one_page(struct page *page)
>  {
>   struct address_space *mapping = page->mapping;
> - int ret = 0;
> + int ret = 0, ret2;
>   struct writeback_control wbc = {
>   .sync_mode = WB_SYNC_ALL,
>   .nr_to_write = 1,
>   };
> + errseq_t since = filemap_sample_wb_error(mapping);
>  
>   BUG_ON(!PageLocked(page));
>  
> @@ -2386,16 +2387,14 @@ int write_one_page(struct page *page)
>   if (clear_page_dirty_for_io(page)) {
>   get_page(page);
>   ret = mapping->a_ops->writepage(page, );
> - if (ret == 0) {
> + if (ret == 0)
>   wait_on_page_writeback(page);
> - if (PageError(page))
> - ret = -EIO;
> - }
>   put_page(page);
>   } else {
>   unlock_page(page);
>   }
> - return ret;
> + ret2 = filemap_check_wb_error(mapping, since);
> + return ret ? : ret2;
>  }
>  EXPORT_SYMBOL(write_one_page);
>  
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 22/27] jbd2: don't reset error in journal_finish_inode_data_buffers

2017-05-15 Thread Jan Kara

On Tue 09-05-17 11:49:25, Jeff Layton wrote:
> Now that we don't clear writeback errors after fetching them, there is
> no need to reset them. This is also potentially racy.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  fs/jbd2/commit.c | 13 ++---
>  1 file changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b6b194ec1b4f..4c6262652028 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -264,17 +264,8 @@ static int journal_finish_inode_data_buffers(journal_t 
> *journal,
>   jinode->i_flags |= JI_COMMIT_RUNNING;
>   spin_unlock(>j_list_lock);
>   err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
> - if (err) {
> - /*
> -  * Because AS_EIO is cleared by
> -  * filemap_fdatawait_range(), set it again so
> -  * that user process can get -EIO from fsync().
> -  */
> - mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
> -
> - if (!ret)
> - ret = err;
> - }
> + if (err && !ret)
> + ret = err;
>   spin_lock(>j_list_lock);
>   jinode->i_flags &= ~JI_COMMIT_RUNNING;
>   smp_mb();
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 19/27] buffer: set errors in mapping at the time that the error occurs

2017-05-15 Thread Jan Kara

On Tue 09-05-17 11:49:22, Jeff Layton wrote:
> I noticed on xfs that I could still sometimes get back an error on fsync
> on a fd that was opened after the error condition had been cleared.
> 
> The problem is that the buffer code sets the write_io_error flag and
> then later checks that flag to set the error in the mapping. That flag
> perisists for quite a while however. If the file is later opened with
> O_TRUNC, the buffers will then be invalidated and the mapping's error
> set such that a subsequent fsync will return error. I think this is
> incorrect, as there was no writeback between the open and fsync.
> 
> Add a new mark_buffer_write_io_error operation that sets the flag and
> the error in the mapping at the same time. Replace all calls to
> set_buffer_write_io_error with mark_buffer_write_io_error, and remove
> the places that check this flag in order to set the error in the
> mapping.
> 
> This sets the error in the mapping earlier, at the time that it's first
> detected.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Small nits below.

> @@ -354,7 +354,7 @@ void end_buffer_async_write(struct buffer_head *bh, int 
> uptodate)
>   } else {
>   buffer_io_error(bh, ", lost async page write");
>   mapping_set_error(page->mapping, -EIO);
> - set_buffer_write_io_error(bh);
> + mark_buffer_write_io_error(bh);

No need to call mapping_set_error() here when it gets called in
mark_buffer_write_io_error() again?

> @@ -1182,6 +1180,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> +void mark_buffer_write_io_error(struct buffer_head *bh)
> +{
> + set_buffer_write_io_error(bh);
> + /* FIXME: do we need to set this in both places? */
> + if (bh->b_page && bh->b_page->mapping)
> + mapping_set_error(bh->b_page->mapping, -EIO);
> + if (bh->b_assoc_map)
> + mapping_set_error(bh->b_assoc_map, -EIO);
> +}
> +EXPORT_SYMBOL(mark_buffer_write_io_error);

So buffers that are shared by several inodes cannot have bh->b_assoc_map
set. So for filesystems that have metadata like this setting in
bh->b_assoc_map doesn't really help and they have to check blockdevice's
mapping anyway. OTOH if filesystem doesn't have such type of metadata
relevant for fsync, this could help it. So maybe it is worth it.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 15/27] fs: retrofit old error reporting API onto new infrastructure

2017-05-15 Thread Jan Kara

On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> Now that we have a better way to store and report errors that occur
> during writeback, we need to convert the existing codebase to use it. We
> could just adapt all of the filesystem code and related infrastructure
> to the new API, but that's a lot of churn.
> 
> When it comes to setting errors in the mapping, filemap_set_wb_error is
> a drop-in replacement for mapping_set_error. Turn that function into a
> simple wrapper around the new one.
> 
> Because we want to ensure that writeback errors are always reported at
> fsync time, inject filemap_report_wb_error calls much closer to the
> syscall boundary, in call_fsync.
> 
> For fsync calls (and things like the nfsd equivalent), we either return
> the error that the fsync operation returns, or the one returned by
> filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> the latest value. This allows us to provide new fsync semantics that
> will return errors that may have occurred previously and been viewed
> via other file descriptors.
> 
> The final piece of the puzzle is what to do about filemap_check_errors
> calls that are being called directly or via filemap_* functions. Here,
> we must take a little "creative license".
> 
> Since we now handle advancing the file->f_wb_err value at the generic
> filesystem layer, we no longer need those callers to clear errors out
> of the mapping or advance an errseq_t.
> 
> A lot of the existing codebase relies on being getting an error back
> from those functions when there is a writeback problem, so we do still
> want to have them report writeback errors somehow.
> 
> When reporting writeback errors, we will always report errors that have
> occurred since a particular point in time. With the old writeback error
> reporting, the time we used was "since it was last tested/cleared" which
> is entirely arbitrary and potentially racy. Now, we can at least report
> the latest error that has occurred since an arbitrary point in time
> (represented as a sampled errseq_t value).
> 
> In the case where we don't have a struct file to work with, this patch
> just has the wrappers sample the current mapping->wb_err value, and use
> that as an arbitrary point from which to check for errors.

I think this is really dangerous and we shouldn't do this. You are quite
likely to lose IO errors in such calls because you will ignore all errors
that happened during previous background writeback or even for IO that
managed to complete before we called filemap_fdatawait(). Maybe we need to
keep the original set-clear-bit IO error reporting for these cases, until
we can convert them to fdatawait_range_since()?

> That's probably not "correct" in all cases, particularly in the case of
> something like filemap_fdatawait, but I'm not sure it's any worse than
> what we already have, and this gives us a basis from which to work.
> 
> A lot of those callers will likely want to change to a model where they
> sample the errseq_t much earlier (perhaps when starting a transaction),
> store it in an appropriate place and then use that value later when
> checking to see if an error occurred.
> 
> That will almost certainly take some involvement from other subsystem
> maintainers. I'm quite open to adding new API functions to help enable
> this if that would be helpful, but I don't really want to do that until
> I better understand what's needed.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

...

> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 5f7317875a67..7ce13281925f 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   .nr_to_write = LONG_MAX,
>   .for_reclaim = 0,
>   };
> + errseq_t since = READ_ONCE(file->f_wb_err);
>  
>   if (unlikely(f2fs_readonly(inode->i_sb)))
>   return 0;
> @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   }
>  
>   ret = wait_on_node_pages_writeback(sbi, ino);
> + if (ret == 0)
> + ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
>   if (ret)
>   goto out;

So this conversion looks wrong and actually points to a larger issue with
the scheme. The problem is there are two mappings that come into play here
- file_inode(file)->i_mapping which is the data mapping and
NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
specific to f2fs. For example ext2 also uses this scheme where block
devices' mapping is the metadata mapping). And we need to merge error
information from these two mapping

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread Jan Kara

On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> > 
> > That would be a protocol revision, which we'd definitely rather avoid.
> > 
> > But can't we accomplish the same by using something like
> > 
> > ctime * (some constant) + i_version
> > 
> > ?
> > 
> > >Pro: No on-disk format changes.
> > >Cons: After a crash, i_version can go backwards (but when file changes
> > >i_version, i_ctime pair should be still different) or not, data can be
> > >old or not.
> > 
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined.  And writers sync before close.
> > 
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close.  I'm OK with
> > things being inconsistent during this window.
> > 
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes).  Factoring in ctime fixes that.
> 
> So for now I'm thinking of just doing something like the following.
> 
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
> 
> --b.
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;

Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
Probably you need explicit (u64) cast... Otherwise I'm fine with this.

Honza

> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 246 matches

Mail list logo