[PATCH v3] btrfs: handle dynamically reappearing missing device

2017-12-16 Thread Anand Jain
If the device is not present at the time of (-o degrade) mount,
the mount context will create a dummy missing struct btrfs_device.
Later this device may reappear after the FS is mounted and
then device is included in the device list but it missed the
open_device part. So this patch handles that case by going
through the open_device steps which this device missed and finally
adds to the device alloc list.

So now with this patch, to bring back the missing device user can run,

   btrfs dev scan 

Without this kernel patch, even though 'btrfs fi show' and 'btrfs
dev ready' would tell you that missing device has reappeared
successfully but actually in kernel FS layer it didn't.

Signed-off-by: Anand Jain 
---
This patch needs:
 [PATCH 0/4]  factor __btrfs_open_devices()
v3:
The check for missing in the device_list_add() is now a another
patch as its not related.
 btrfs: fix inconsistency during missing device rejoin

v2:
Add more comments.
Add more change log.
Add to check if device missing is set, to handle the case
dev open fail and user will rerun the dev scan.

 fs/btrfs/volumes.c | 57 --
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 93d65c72b731..5c3190c65f81 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -812,8 +812,61 @@ static noinline int device_list_add(const char *path,
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
-   fs_devices->missing_devices--;
-   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
+   int ret;
+   struct btrfs_fs_info *fs_info = fs_devices->fs_info;
+   fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+   if (btrfs_super_flags(disk_super) &
+   BTRFS_SUPER_FLAG_SEEDING)
+   fmode &= ~FMODE_WRITE;
+
+   /*
+* Missing can be set only when FS is mounted.
+* So here its always fs_devices->opened > 0 and most
+* of the struct device members are already updated by
+* the mount process even if this device was missing, so
+* now follow the normal open device procedure for this
+* device. The scrub will take care of filling the
+* missing stripes for raid56 and balance for raid1 and
+* raid10.
+*/
+   ASSERT(fs_devices->opened);
+   mutex_lock(_devices->device_list_mutex);
+   mutex_lock(_info->chunk_mutex);
+   /*
+* As of now do not fail the dev scan thread for the
+* reason that btrfs_open_one_device() fails and keep
+* the legacy dev scan requisites as it is.
+* And reset missing only if open is successful, as
+* user can rerun dev scan after fixing the device
+* for which the device open (below) failed.
+*/
+   ret = btrfs_open_one_device(fs_devices, device, fmode,
+   fs_info->bdev_holder);
+   if (!ret) {
+   fs_devices->missing_devices--;
+   clear_bit(BTRFS_DEV_STATE_MISSING,
+   >dev_state);
+   btrfs_clear_opt(fs_info->mount_opt, DEGRADED);
+   btrfs_warn(fs_info,
+   "BTRFS: device %s devid %llu joined\n",
+   path, devid);
+   }
+
+   if (test_bit(BTRFS_DEV_STATE_WRITEABLE,
+>dev_state) &&
+   !test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+ >dev_state)) {
+   fs_devices->total_rw_bytes +=
+   device->total_bytes;
+   atomic64_add(device->total_bytes -
+   device->bytes_used,
+   _info->free_chunk_space);
+   }
+   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state);
+   mutex_unlock(_info->chunk_mutex);
+   mutex_unlock(_devices->device_list_mutex);

Re: [PATCH v2 01/19] fs: new API for handling inode->i_version

2017-12-16 Thread Jeff Layton
On Sun, 2017-12-17 at 09:37 +1100, Dave Chinner wrote:
> On Sat, Dec 16, 2017 at 08:46:38AM -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > Add a documentation blob that explains what the i_version field is, how
> > it is expected to work, and how it is currently implemented by various
> > filesystems.
> > 
> > We already have inode_inc_iversion. Add several other functions for
> > manipulating and accessing the i_version counter. For now, the
> > implementation is trivial and basically works the way that all of the
> > open-coded i_version accesses work today.
> > 
> > Future patches will convert existing users of i_version to use the new
> > API, and then convert the backend implementation to do things more
> > efficiently.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/fs.h | 200 
> > ++---
> >  1 file changed, 192 insertions(+), 8 deletions(-)
> 
> Just a random sunday morning coffee musing
> 
> I was just wondering if it would be better to split this stuff out
> into it's own header file now? include/linux/fs.h is aleady a
> massive header file (~3500 lines) and changes cause tree-wide
> rebuilds, so maybe it would be better to split relatively isolated
> functionality like this out while it's being reworked and you're
> already touching every file that uses it?
> 
> Cheers,
> 
> Dave.

That's a good idea. Let me do that and I'll re-post.

Thanks,
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/19] fs: new API for handling inode->i_version

2017-12-16 Thread Dave Chinner
On Sat, Dec 16, 2017 at 08:46:38AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
> 
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
> 
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h | 200 
> ++---
>  1 file changed, 192 insertions(+), 8 deletions(-)

Just a random sunday morning coffee musing

I was just wondering if it would be better to split this stuff out
into it's own header file now? include/linux/fs.h is aleady a
massive header file (~3500 lines) and changes cause tree-wide
rebuilds, so maybe it would be better to split relatively isolated
functionality like this out while it's being reworked and you're
already touching every file that uses it?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unexpected raid1 behaviour

2017-12-16 Thread Dark Penguin
Could someone please point me towards some read about how btrfs handles
multiple devices? Namely, kicking faulty devices and re-adding them.

I've been using btrfs on single devices for a while, but now I want to
start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
tried to see how does it handle various situations. The experience left
me very surprised; I've tried a number of things, all of which produced
unexpected results.

I create a btrfs raid1 filesystem on two hard drives and mount it.

- When I pull one of the drives out (simulating a simple cable failure,
which happens pretty often to me), the filesystem sometimes goes
read-only. ???
- But only after a while, and not always. ???
- When I fix the cable problem (plug the device back), it's immediately
"re-added" back. But I see no replication of the data I've written onto
a degraded filesystem... Nothing shows any problems, so "my filesystem
must be ok". ???
- If I unmount the filesystem and then mount it back, I see all my
recent changes lost (everything I wrote during the "degraded" period).
- If I continue working with a degraded raid1 filesystem (even without
damaging it further by re-adding the faulty device), after a while it
won't mount at all, even with "-o degraded".

I can't wrap my head about all this. Either the kicked device should not
be re-added, or it should be re-added "properly", or it should at least
show some errors and not pretend nothing happened, right?..

I must be missing something. Is there an explanation somewhere about
what's really going on during those situations? Also, do I understand
correctly that upon detecting a faulty device (a write error), nothing
is done about it except logging an error into the 'btrfs device stats'
report? No device kicking, no notification?.. And what about degraded
filesystems - is it absolutely forbidden to work with them without
converting them to a "single" filesystem first?..

On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .


-- 
darkpenguin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/19] afs: convert to new i_version API

2017-12-16 Thread Jeff Layton
On Sat, 2017-12-16 at 11:18 -0500, Jeffrey Altman wrote:
> Hi Jeff,
> 
> A few thoughts on AFS usage below which might impact a future revision
> of the API.  I hope they are useful.
> 
> On 12/16/2017 8:49 AM, Jeff Layton wrote:
> > On Sat, 2017-12-16 at 08:46 -0500, Jeff Layton wrote:
> > > From: Jeff Layton 
> > > 
> > > For AFS, it's generally treated as an opaque value, so we use the
> > > *_raw variants of the API here.
> > > 
> > > Note that AFS has quite a different definition for this counter. AFS
> > > only increments it on changes to the data, not for the metadata. We'll
> > > need to reconcile that somehow if we ever want to present this to
> > > userspace via statx.
> > > 
> 
> From the patch series notes:
> 
> "The inode->i_version field is supposed to be a value that changes
> whenever there is any data or metadata change to the inode. Some
> filesystems use it internally to detect directory changes during
> readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
> will also use it to optimize away some remeasurement if it's available.
> NFS and AFS just use it to store an opaque change attribute from the
> server.
> 
> "Only btrfs, ext4, and xfs increment it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes. That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]"
> 
> 
> The AFS/AuriStorFS data version is an unsigned 64-bit value that is
> incremented by the file server as part of a data changing operation. For
> files, a StoreData and for directories entry manipulations such as
> create, rename, delete.  This data version is used to tag the version of
> any subset of the data stream for caching and replication purposes.
> 
> As Jeff notes, the AFS data version is not incremented for metadata
> changes.  Metadata cannot be trusted by clients without acquiring a
> callback promise from a fileserver.  The callback promise will either be
> satisfied by the issuing fileserver sending a CallBack notification that
> the metadata is no longer valid OR the callback promise will expire.
> 
> Something else that is important to note that it is assumed that local
> data changes that occur under a valid callback promise is newer than the
> data on the fileserver.  It might be useful if the new i_version API
> supported major and minor version numbers.  AFS implementations would
> store the fileserver provided data version number as the major version
> and would increment the minor version when local changes have been made
> which have yet to be stored back to the fileserver.  This functionality
> would be especially useful if disconnected operations were implemented
> for the AFS implementation.
> 
> It might also be useful to separate metadata version and data version
> although some filesystems would set the same value to both.  For AFS,
> the metadata major version would the timestamp at which the callback was
> issued.
> 
> Jeffrey Altman

Thanks. That seems like rather specialized use case.

If we did want to go that route, we'd probably need to give filesystems
a way to overload how i_version is handled and queried (maybe some new
inode ops?).

Given that nothing ever looks at the the i_version in kAFS now, I don't
have a lot of incentive to do anything along those lines in this set. I
think this patchset will probably make it simpler to do something like
that in the future, if you were motivated to do so though.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/19] afs: convert to new i_version API

2017-12-16 Thread Jeffrey Altman
Hi Jeff,

A few thoughts on AFS usage below which might impact a future revision
of the API.  I hope they are useful.

On 12/16/2017 8:49 AM, Jeff Layton wrote:
> On Sat, 2017-12-16 at 08:46 -0500, Jeff Layton wrote:
>> From: Jeff Layton 
>>
>> For AFS, it's generally treated as an opaque value, so we use the
>> *_raw variants of the API here.
>>
>> Note that AFS has quite a different definition for this counter. AFS
>> only increments it on changes to the data, not for the metadata. We'll
>> need to reconcile that somehow if we ever want to present this to
>> userspace via statx.
>>

From the patch series notes:

"The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

"Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]"


The AFS/AuriStorFS data version is an unsigned 64-bit value that is
incremented by the file server as part of a data changing operation. For
files, a StoreData and for directories entry manipulations such as
create, rename, delete.  This data version is used to tag the version of
any subset of the data stream for caching and replication purposes.

As Jeff notes, the AFS data version is not incremented for metadata
changes.  Metadata cannot be trusted by clients without acquiring a
callback promise from a fileserver.  The callback promise will either be
satisfied by the issuing fileserver sending a CallBack notification that
the metadata is no longer valid OR the callback promise will expire.

Something else that is important to note that it is assumed that local
data changes that occur under a valid callback promise is newer than the
data on the fileserver.  It might be useful if the new i_version API
supported major and minor version numbers.  AFS implementations would
store the fileserver provided data version number as the major version
and would increment the minor version when local changes have been made
which have yet to be stored back to the fileserver.  This functionality
would be especially useful if disconnected operations were implemented
for the AFS implementation.

It might also be useful to separate metadata version and data version
although some filesystems would set the same value to both.  For AFS,
the metadata major version would the timestamp at which the callback was
issued.

Jeffrey Altman
<>

smime.p7s
Description: S/MIME Cryptographic Signature


How to repair errors only found with check --mode=lowmem

2017-12-16 Thread Tom Hale
The following shows that errors are found with check --mode=lowmem, but 
are not picked up without lowmem.


How would I go about fixing errors only reported by lowmem?



[manjaro manjaro]# btrfs check --mode=lowmem --progress 
/dev/mapper/vg_svelte-home

Checking filesystem on /dev/mapper/vg_svelte-home
UUID: 93722fa7-7e8f-418a-a7ca-080aca8db94b
ERROR: extent[691815358464, 11042816] referencer count mismatch (root: 
257, owner: 1869679, offset: 613974016) wanted: 1, have: 2
ERROR: extent[720156536832, 99430400] referencer count mismatch (root: 
257, owner: 758215, offset: 1610616832) wanted: 8, have: 379
ERROR: extent[720669147136, 268435456] referencer count mismatch (root: 
257, owner: 758215, offset: 4096) wanted: 86, have: 1021
ERROR: extent[720669147136, 268435456] referencer count mismatch (root: 
257, owner: 1767807, offset: 4096) wanted: 87, have: 1021
ERROR: extent[726724722688, 64069632] referencer count mismatch (root: 
257, owner: 1480823, offset: 99090432) wanted: 1, have: 5
ERROR: extent[737910194176, 134217728] referencer count mismatch (root: 
257, owner: 1480726, offset: 268435456) wanted: 1, have: 8
ERROR: extent[738077896704, 134217728] referencer count mismatch (root: 
257, owner: 1869696, offset: 402653184) wanted: 5, have: 8
ERROR: extent[744334426112, 268435456] referencer count mismatch (root: 
257, owner: 1767802, offset: 0) wanted: 111, have: 294
ERROR: extent[824948670464, 1671168] referencer count mismatch (root: 
257, owner: 2000876, offset: 247861248) wanted: 16, have: 26

ERROR: data extent[681550843904 8192] backref lost
ERROR: errors found in extent allocation tree or chunk allocation
cache and super generation don't match, space cache will be invalidated
ERROR: errors found in fs roots
found 172094545920 bytes used, error(s) found
total csum bytes: 165679768
total tree bytes: 3066789888
total fs tree bytes: 2751315968
total extent tree bytes: 112295936
btree space waste bytes: 568660274
file data blocks allocated: 8158426562560
 referenced 597269540864
[manjaro manjaro]# btrfs check --repair  --progress 
/dev/mapper/vg_svelte-home

enabling repair mode
Checking filesystem on /dev/mapper/vg_svelte-home
UUID: 93722fa7-7e8f-418a-a7ca-080aca8db94b
checking extents [.]
Fixed 0 roots.
cache and super generation don't match, space cache will be invalidated
checking fs roots [o]
checking csums
checking root refs
found 172091723777 bytes used, no error found
total csum bytes: 165679768
total tree bytes: 2003206144
total fs tree bytes: 1687732224
total extent tree bytes: 112295936
btree space waste bytes: 346177995
file data blocks allocated: 5952940838912
 referenced 474676633600
[manjaro manjaro]#



--
Tom Hale
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/19] fs: new API for handling inode->i_version

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 200 ++---
 1 file changed, 192 insertions(+), 8 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..5001e77342fd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,19 +2036,203 @@ static inline void inode_dec_link_count(struct inode 
*inode)
mark_inode_dirty(inode);
 }
 
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * It should be considered an opaque value by observers. If it remains the same
+ * since it was last checked, then nothing has changed in the inode. If it's
+ * different then something has changed. Observers cannot infer anything about
+ * the nature or magnitude of the changes from the value, only that the inode
+ * has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems 
that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_I_VERSION will automatically have their i_version counter
+ * incremented on writes to normal files. If the SB_I_VERSION is not set, then
+ * the VFS will not touch it on writes, and the filesystem can use it how it
+ * wishes. Note that the filesystem is always responsible for updating the
+ * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
+ * We consider these sorts of filesystems to have a kernel-managed i_version.
+ *
+ * Note that some filesystems (e.g. NFS and AFS) just use the field to store
+ * a server-provided value (for the most part). For that reason, those
+ * filesystems do not set SB_I_VERSION. These filesystems are considered to
+ * have a self-managed i_version.
+ */
+
+/**
+ * inode_set_iversion_raw - set i_version to the specified raw value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for use by
+ * filesystems that self-manage the i_version.
+ *
+ * For example, the NFS client stores its NFSv4 change attribute in this way,
+ * and the AFS client stores the data_version from the server here.
+ */
+static inline void
+inode_set_iversion_raw(struct inode *inode, const u64 new)
+{
+   inode->i_version = new;
+}
+
+/**
+ * inode_set_iversion - set i_version to a particular value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for filesystems with
+ * a kernel-managed i_version.
+ *
+ * For now, this just does the same thing as the _raw variant.
+ */
+static inline void
+inode_set_iversion(struct inode *inode, const u64 new)
+{
+   inode_set_iversion_raw(inode, new);
+}
+
+/**
+ * inode_set_iversion_queried - set i_version to a particular value and set
+ *  flag to indicate that it has been viewed
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * When loading in an i_version value from a backing store, we typically don't
+ * know whether it was previously viewed before being stored or not. Thus, we
+ * must assume that it was, to ensure that any changes will result in the
+ * value changing.
+ *
+ * This function will set the inode's i_version, and possibly flag the value
+ * as if it has already been viewed at least once.
+ *
+ * For now, this just does what inode_set_iversion does.
+ */
+static inline void
+inode_set_iversion_queried(struct inode *inode, const u64 new)
+{
+   inode_set_iversion(inode, new);
+}
+
+/**
+ * inode_maybe_inc_iversion - increments i_version
+ * @inode: inode with the i_version that should be updated
+ * @force: increment the counter even if it's not necessary
+ *
+ * Every time the inode is modified, the i_version field must be seen to have
+ * changed by any observer.
+ *
+ * In this implementation, we always increment it after taking the i_lock to
+ * ensure that we don't race with other incrementors.
+ *
+ * Returns true if counter was bumped, and false if it wasn't.
+ */
+static inline bool

[PATCH v2 03/19] fat: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/fat/dir.c |  2 +-
 fs/fat/inode.c   |  8 
 fs/fat/namei_msdos.c |  6 +++---
 fs/fat/namei_vfat.c  | 20 ++--
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..221d85529c5f 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -1055,7 +1055,7 @@ int fat_remove_entries(struct inode *dir, struct 
fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..a444336c1eed 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -507,7 +507,7 @@ int fat_fill_inode(struct inode *inode, struct 
msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = get_seconds();
 
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +590,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1377,7 +1377,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1828,7 +1828,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
-   root_inode->i_version = 1;
+   inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..7b71c1b242ce 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -480,7 +480,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
} else
mark_inode_dirty(old_inode);
 
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = 
current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +508,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +540,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c03a3a..fe1bb65149c8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
int ret = 1;
spin_lock(>d_lock);
-   if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+   if (inode_cmp_iversion(d_inode(dentry->d_parent), 
vfat_d_version(dentry)))
ret = 0;
spin_unlock(>d_lock);
return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct 
dentry *dentry,
 out:
mutex_unlock(_SB(sb)->s_lock);
if (!inode)
-   vfat_d_version_set(dentry, dir->i_version);
+   vfat_d_version_set(dentry, inode_query_iversion(dir));
return d_splice_alias(inode, dentry);
 error:
mutex_unlock(_SB(sb)->s_lock);
@@ -781,7 +781,7 @@ static int vfat_create(struct inode *dir, struct dentry 
*dentry, umode_t mode,
err = vfat_add_entry(dir, >d_name, 0, 0, , );
if (err)
goto out;
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
brelse(sinfo.bh);
@@ -789,7 +789,7 @@ static int vfat_create(struct inode *dir, struct dentry 
*dentry, umode_t mode,
err = 

[PATCH v2 02/19] fs: don't take the i_lock in inode_inc_iversion

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

I'm not dead-set on this patch, but I was a little leery of adding
the i_lock into all of the places that do inode->i_version++ today,
even as an interim step. We can probably drop this patch if it's too
ugly to live...

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5001e77342fd..c234fac4bb77 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2136,9 +2136,9 @@ inode_set_iversion_queried(struct inode *inode, const u64 
new)
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-   spin_lock(>i_lock);
-   inode->i_version++;
-   spin_unlock(>i_lock);
+   atomic64_t *ivp = (atomic64_t *)>i_version;
+
+   atomic64_inc(ivp);
return true;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/19] afs: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton 
---
 fs/afs/fsclient.c | 2 +-
 fs/afs/inode.c| 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index b90ef39ae914..67ed9bb5fe31 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -124,7 +124,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client;
vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime;
-   vnode->vfs_inode.i_version  = data_version;
+   inode_set_iversion_raw(>vfs_inode, data_version);
}
 
expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3415eb7484f6..af9577210a46 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -89,7 +89,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, 
struct key *key)
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
-   inode->i_version= vnode->status.data_version;
+   inode_set_iversion_raw(inode, vnode->status.data_version);
inode->i_mapping->a_ops = _fs_aops;
 
read_sequnlock_excl(>cb_lock);
@@ -218,7 +218,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const 
char *dev_name,
inode->i_ctime.tv_nsec  = 0;
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
-   inode->i_version= 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_generation = 0;
 
set_bit(AFS_VNODE_PSEUDODIR, >flags);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/19] affs: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/affs/amigaffs.c | 4 ++--
 fs/affs/dir.c  | 4 ++--
 fs/affs/super.c| 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0f0e6925e97d..085a93bea1e2 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -60,7 +60,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return 0;
@@ -114,7 +114,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head 
*rem_bh)
affs_brelse(bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index a105e77df2c1..1a8c7b067c44 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -80,7 +80,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 * we can jump directly to where we left off.
 */
ino = (u32)(long)file->private_data;
-   if (ino && file->f_version == inode->i_version) {
+   if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -130,7 +130,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
 done:
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 1117e36134cc..70596d4084de 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -102,7 +102,7 @@ static struct inode *affs_alloc_inode(struct super_block 
*sb)
if (!i)
return NULL;
 
-   i->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/19] exofs: switch to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/exofs/dir.c   | 8 
 fs/exofs/super.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 98233a97b7b8..36e314083b80 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -60,7 +60,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +241,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
-   int need_revalidate = (file->f_version != inode->i_version);
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +264,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 819624cfc8da..6746999137df 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -159,7 +159,7 @@ static struct inode *exofs_alloc_inode(struct super_block 
*sb)
if (!oi)
return NULL;
 
-   oi->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
return >vfs_inode;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/19] btrfs: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/btrfs/delayed-inode.c | 6 --
 fs/btrfs/inode.c | 6 --
 fs/btrfs/tree-log.c  | 3 ++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..1f0cd7c904c6 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1700,7 +1700,8 @@ static void fill_stack_inode_item(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
 BTRFS_I(inode)->generation);
-   btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+   btrfs_set_stack_inode_sequence(inode_item,
+  inode_peek_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1754,7 +1755,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
 BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
 
-   inode->i_version = btrfs_stack_inode_sequence(inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..1dfa299a31a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3777,7 +3777,8 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
 
-   inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3945,7 +3946,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
 );
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
 );
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item, inode_peek_iversion(inode),
+  );
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7bf9b31561db..3c93e456d9f7 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3609,7 +3609,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
 );
 
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item,
+  inode_peek_iversion(inode), );
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/19] ext4: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
Acked-by: Theodore Ts'o 
---
 fs/ext4/dir.c|  8 
 fs/ext4/inline.c |  6 +++---
 fs/ext4/inode.c  | 12 
 fs/ext4/ioctl.c  |  2 +-
 fs/ext4/namei.c  |  4 ++--
 fs/ext4/super.c  |  2 +-
 fs/ext4/xattr.c  |  4 ++--
 7 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..18eb37395a45 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -208,7 +208,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -227,7 +227,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < inode->i_size
@@ -568,10 +568,10 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 * cached entries.
 */
if ((!info->curr_node) ||
-   (file->f_version != inode->i_version)) {
+   inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(>root);
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
   info->curr_minor_hash,
   >next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1367553c43bb..a4a621eb80c1 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1042,7 +1042,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 */
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
return 1;
 }
 
@@ -1494,7 +1494,7 @@ int ext4_read_inline_dir(struct file *file,
 * dirent right now.  Scan from the start of the inline
 * dir to make sure.
 */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
 * "." is with offset 0 and
@@ -1526,7 +1526,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..1f8ec9cd042c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4873,12 +4873,14 @@ struct inode *ext4_iget(struct super_block *sb, 
unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+   u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
-   inode->i_version |=
+   ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+   inode_set_iversion_queried(inode, ivers);
}
 
ret = 0;
@@ -5164,11 +5166,13 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+   u64 ivers = inode_peek_iversion(inode);
+
+   raw_inode->i_disk_version = cpu_to_le32(ivers);
if (ei->i_extra_isize) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
raw_inode->i_version_hi =
-   cpu_to_le32(inode->i_version >> 32);
+   cpu_to_le32(ivers 

[PATCH v2 10/19] nfs: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Signed-off-by: Jeff Layton 
---
 fs/nfs/delegation.c|  2 +-
 fs/nfs/fscache-index.c |  4 ++--
 fs/nfs/inode.c | 16 
 fs/nfs/nfs4proc.c  |  9 +
 fs/nfs/nfstrace.h  |  4 ++--
 fs/nfs/write.c |  7 ++-
 6 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index ade44ca0c66c..89b22b2a111c 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -347,7 +347,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct 
rpc_cred *cred, struct
nfs4_stateid_copy(>stateid, >delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
-   delegation->change_attr = inode->i_version;
+   delegation->change_attr = inode_peek_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode);
 
if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +243,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void 
*cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode);
 
if (memcmp(data, , datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b992d2382ffa..c55b86c87bce 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
memset(>i_atime, 0, sizeof(inode->i_atime));
memset(>i_mtime, 0, sizeof(inode->i_mtime));
memset(>i_ctime, 0, sizeof(inode->i_ctime));
-   inode->i_version = 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(_user_ns, -2);
@@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   inode->i_version = fattr->change_attr;
+   inode_set_iversion_raw(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode 
*inode, struct nfs_fattr
 
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   && inode->i_version == fattr->pre_change_attr) {
-   inode->i_version = fattr->change_attr;
+   && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+   inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
 
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
-   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode->i_version != fattr->change_attr)
+   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | 
NFS_INO_REVAL_PAGECACHE;
 
if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && 
!timespec_equal(>i_mtime, >mtime))
@@ -1642,7 +1642,7 @@ int nfs_post_op_update_inode_force_wcc_locked(struct 
inode *inode, struct nfs_fa
}
if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
   

[PATCH v2 11/19] nfsd: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton 
---
 fs/nfsd/nfsfh.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 43f31cf49bae..7dedda5e3ed4 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -259,7 +259,7 @@ static inline u64 nfsd4_change_attribute(struct inode 
*inode)
chattr =  inode->i_ctime.tv_sec;
chattr <<= 30;
chattr += inode->i_ctime.tv_nsec;
-   chattr += inode->i_version;
+   chattr += inode_query_iversion(inode);
return chattr;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/19] afs: convert to new i_version API

2017-12-16 Thread Jeff Layton
On Sat, 2017-12-16 at 08:46 -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> For AFS, it's generally treated as an opaque value, so we use the
> *_raw variants of the API here.
> 
> Note that AFS has quite a different definition for this counter. AFS
> only increments it on changes to the data, not for the metadata. We'll
> need to reconcile that somehow if we ever want to present this to
> userspace via statx.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/afs/fsclient.c | 2 +-
>  fs/afs/inode.c| 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
> index b90ef39ae914..67ed9bb5fe31 100644
> --- a/fs/afs/fsclient.c
> +++ b/fs/afs/fsclient.c
> @@ -124,7 +124,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
>   vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client;
>   vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime;
>   vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime;
> - vnode->vfs_inode.i_version  = data_version;
> + inode_set_iversion_raw(>vfs_inode, data_version);
>   }
>  
>   expected_version = status->data_version;
> diff --git a/fs/afs/inode.c b/fs/afs/inode.c
> index 3415eb7484f6..af9577210a46 100644
> --- a/fs/afs/inode.c
> +++ b/fs/afs/inode.c
> @@ -89,7 +89,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, 
> struct key *key)
>   inode->i_atime  = inode->i_mtime = inode->i_ctime;
>   inode->i_blocks = 0;
>   inode->i_generation = vnode->fid.unique;
> - inode->i_version= vnode->status.data_version;
> + inode_set_iversion_raw(inode, vnode->status.data_version);
>   inode->i_mapping->a_ops = _fs_aops;
>  
>   read_sequnlock_excl(>cb_lock);
> @@ -218,7 +218,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const 
> char *dev_name,
>   inode->i_ctime.tv_nsec  = 0;
>   inode->i_atime  = inode->i_mtime = inode->i_ctime;
>   inode->i_blocks = 0;
> - inode->i_version= 0;
> + inode_set_iversion_raw(inode, 0);
>   inode->i_generation = 0;
>  
>   set_bit(AFS_VNODE_PSEUDODIR, >flags);

TBH, we could just remove the i_version handling here. Nothing ever
looks at the i_version field in AFS inodes.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/19] ufs: use new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/ufs/dir.c   | 8 
 fs/ufs/inode.c | 2 +-
 fs/ufs/super.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 2edc1755b7c5..547c4d4c4db8 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -47,7 +47,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -428,7 +428,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;
 
UFSD("BEGIN\n");
@@ -455,8 +455,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index afb601c0dda0..58cb6dfb116d 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -693,7 +693,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned 
long ino)
if (err)
goto bad_inode;
 
-   inode->i_version++;
+   inode_inc_iversion(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 4d497e9c6883..7dee0b07a571 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1440,7 +1440,7 @@ static struct inode *ufs_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
 
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
seqlock_init(>meta_lock);
mutex_init(>truncate_mutex);
return >vfs_inode;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 14/19] xfs: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
 fs/xfs/xfs_icache.c   | 4 ++--
 fs/xfs/xfs_inode.c| 2 +-
 fs/xfs/xfs_inode_item.c   | 2 +-
 fs/xfs/xfs_trans_inode.c  | 2 +-
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b7989038d75..207db10d78d5 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -264,7 +264,8 @@ xfs_inode_from_disk(
to->di_flags= be16_to_cpu(from->di_flags);
 
if (to->di_version == 3) {
-   inode->i_version = be64_to_cpu(from->di_changecount);
+   inode_set_iversion_queried(inode,
+  be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -314,7 +315,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);
 
if (from->di_version == 3) {
-   to->di_changecount = cpu_to_be64(inode->i_version);
+   to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..11ac85a75dc2 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -293,14 +293,14 @@ xfs_reinit_inode(
int error;
uint32_tnlink = inode->i_nlink;
uint32_tgeneration = inode->i_generation;
-   uint64_tversion = inode->i_version;
+   uint64_tversion = inode_peek_iversion(inode);
umode_t mode = inode->i_mode;
 
error = inode_init_always(mp->m_super, inode);
 
set_nlink(inode, nlink);
inode->i_generation = generation;
-   inode->i_version = version;
+   inode_set_iversion_queried(inode, version);
inode->i_mode = mode;
return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..be6d87980dd5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -833,7 +833,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;
 
if (ip->i_d.di_version == 3) {
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..2ca02e7b3966 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -354,7 +354,7 @@ xfs_inode_to_log_dinode(
to->di_next_unlinked = NULLAGINO;
 
if (from->di_version == 3) {
-   to->di_changecount = inode->i_version;
+   to->di_changecount = inode_peek_iversion(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index daa7615497f9..b41a92d18140 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -117,7 +117,7 @@ xfs_trans_log_inode(
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   VFS_I(ip)->i_version++;
+   inode_inc_iversion(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 15/19] IMA: switch IMA over to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 security/integrity/ima/ima_api.c  | 2 +-
 security/integrity/ima/ima_main.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index c7e8db0ea4c0..588d4c05eb1e 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -215,7 +215,7 @@ int ima_collect_measurement(struct integrity_iint_cache 
*iint,
 * which do not support i_version, support is limited to an initial
 * measurement/appraisal/audit.
 */
-   i_version = file_inode(file)->i_version;
+   i_version = inode_query_iversion(inode);
hash.hdr.algo = algo;
 
/* Initialize hash digest to 0's in case of failure */
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 50b8254d..2ba3a12ff33c 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -128,7 +128,7 @@ static void ima_check_last_writer(struct 
integrity_iint_cache *iint,
inode_lock(inode);
if (atomic_read(>i_writecount) == 1) {
if (!IS_I_VERSION(inode) ||
-   (iint->version != inode->i_version) ||
+   inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 16/19] fs: only set S_VERSION when updating times if necessary

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..82d625203e34 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1634,17 +1634,24 @@ static int relatime_need_update(const struct path 
*path, struct inode *inode,
 int generic_update_time(struct inode *inode, struct timespec *time, int flags)
 {
int iflags = I_DIRTY_TIME;
+   bool dirty = false;
 
-   if (flags & S_ATIME)
+   if (flags & S_ATIME) {
inode->i_atime = *time;
+   dirty |= !(inode->i_sb->s_flags & SB_LAZYTIME);
+   }
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
+   dirty |= inode_maybe_inc_iversion(inode, dirty);
+   if (flags & S_CTIME) {
inode->i_ctime = *time;
-   if (flags & S_MTIME)
+   dirty = true;
+   }
+   if (flags & S_MTIME) {
inode->i_mtime = *time;
+   dirty = true;
+   }
 
-   if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
+   if (dirty)
iflags |= I_DIRTY_SYNC;
__mark_inode_dirty(inode, iflags);
return 0;
@@ -1863,7 +1870,7 @@ int file_update_time(struct file *file)
if (!timespec_equal(>i_ctime, ))
sync_it |= S_CTIME;
 
-   if (IS_I_VERSION(inode))
+   if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
 
if (!sync_it)
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 12/19] ocfs2: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/ocfs2/dir.c  | 14 +++---
 fs/ocfs2/inode.c|  2 +-
 fs/ocfs2/namei.c|  2 +-
 fs/ocfs2/quota_global.c |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe6312ceff..fe2c430a7809 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -1174,7 +1174,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct 
inode *dir,
le16_add_cpu(>rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1729,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1775,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1791,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1869,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + 
i);
/* It's too expensive to do a full
@@ -1886,7 +1886,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1940,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 
*f_version,
  */
 int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
 {
-   u64 version = inode->i_version;
+   u64 version = inode_query_iversion(inode);
ocfs2_dir_foreach_blk(inode, , ctx, true);
return 0;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e0078ab38..71ce64665a18 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -302,7 +302,7 @@ void ocfs2_populate_inode(struct inode *inode, struct 
ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
 
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c045826b716a 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -1520,7 +1520,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
if (S_ISDIR(new_inode->i_mode))
ocfs2_set_links_count(newfe, 0);
diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index b39d14cbfa34..e7595a63da43 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -289,7 +289,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, 

[PATCH v2 19/19] fs: handle inode->i_version more efficiently

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 171 ++---
 1 file changed, 122 insertions(+), 49 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c234fac4bb77..9e3e8cf9a1e1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -639,7 +639,7 @@ struct inode {
struct hlist_head   i_dentry;
struct rcu_head i_rcu;
};
-   u64 i_version;
+   atomic64_t  i_version;
atomic_ti_count;
atomic_ti_dio_count;
atomic_ti_writecount;
@@ -2037,6 +2037,8 @@ static inline void inode_dec_link_count(struct inode 
*inode)
 }
 
 /*
+ * The inode->i_version field:
+ * ---
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
  * appear different to observers if there was a change to the inode's data or
@@ -2059,86 +2061,139 @@ static inline void inode_dec_link_count(struct inode 
*inode)
  * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
  * We consider these sorts of filesystems to have a kernel-managed i_version.
  *
+ * This implementation uses the low bit in the i_version field as a flag to
+ * track when the value has been queried. If it has not been queried since it
+ * was last incremented, we can skip the increment in most cases.
+ *
+ * In the event that we're updating the ctime, we will usually go ahead and
+ * bump the i_version anyway. Since that has to go to stable storage in some
+ * fashion, we might as well increment it as well.
+ *
  * Note that some filesystems (e.g. NFS and AFS) just use the field to store
- * a server-provided value (for the most part). For that reason, those
+ * a server-provided value for the most part. For that reason, those
  * filesystems do not set SB_I_VERSION. These filesystems are considered to
  * have a self-managed i_version.
+ *
+ * Persistently storing the i_version
+ * --
+ * Queries of the i_version field are not gated on them hitting the backing
+ * store. It's always possible that the host could crash after allowing
+ * a query of the value but before it has made it to disk.
+ *
+ * To mitigate this problem, filesystems should always use
+ * inode_set_iversion_queried when loading an existing inode from disk. This
+ * ensures that the next attempted inode increment will result in the value
+ * changing.
+ *
+ * Storing the value to disk therefore does not count as a query, so those
+ * filesystems should use inode_peek_iversion to grab the value to be stored.
+ * There is no need to flag the value as having been queried in that case.
  */
 
+/*
+ * We borrow the lowest bit in the i_version to use as a flag to tell whether
+ * it has been queried since we last incremented it. If it has, then we must
+ * increment it on the next change. After that, we can clear the flag and
+ * avoid incrementing it again until it has again been queried.
+ */
+#define I_VERSION_QUERIED_SHIFT(1)
+#define I_VERSION_QUERIED  (1ULL << (I_VERSION_QUERIED_SHIFT - 1))
+#define I_VERSION_INCREMENT(1ULL << I_VERSION_QUERIED_SHIFT)
+
 /**
  * inode_set_iversion_raw - set i_version to the specified raw value
  * @inode: inode to set
- * @new: new i_version 

[PATCH v2 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_trans_inode.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index b41a92d18140..18635c5b5dc6 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -110,15 +110,17 @@ xfs_trans_log_inode(
 
/*
 * First time we log the inode in a transaction, bump the inode change
-* counter if it is configured for this to occur. We don't use
-* inode_inc_version() because there is no need for extra locking around
-* i_version as we already hold the inode locked exclusively for
-* metadata modification.
+* counter if it is configured for this to occur. While we have the
+* inode locked exclusively for metadata modification, we can usually
+* avoid setting XFS_ILOG_CORE if no one has queried the value since
+* the last time it was incremented. If we have XFS_ILOG_CORE already
+* set however, then go ahead and bump the i_version counter
+* unconditionally.
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   inode_inc_iversion(VFS_I(ip));
-   flags |= XFS_ILOG_CORE;
+   if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+   flags |= XFS_ILOG_CORE;
}
 
tp->t_flags |= XFS_TRANS_DIRTY;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1dfa299a31a9..5d1956ad6d39 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6106,19 +6106,20 @@ static int btrfs_update_time(struct inode *inode, 
struct timespec *now,
 int flags)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
+   bool dirty = flags & ~S_VERSION;
 
if (btrfs_root_readonly(root))
return -EROFS;
 
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
+   dirty |= inode_maybe_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
-   return btrfs_dirty_inode(inode);
+   return dirty ? btrfs_dirty_inode(inode) : 0;
 }
 
 /*
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/19] ext2: convert to new i_version API

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/ext2/dir.c   | 8 
 fs/ext2/super.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 987647986f47..c99f14fec3f3 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -92,7 +92,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
 
if (pos+len > dir->i_size) {
@@ -293,7 +293,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -319,8 +319,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..dd7c3c81d918 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -184,7 +184,7 @@ static struct inode *ext2_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
 #ifdef CONFIG_QUOTA
memset(>i_dquot, 0, sizeof(ei->i_dquot));
 #endif
@@ -1569,7 +1569,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, 
int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/19] fs: rework and optimize i_version handling in filesystems

2017-12-16 Thread Jeff Layton
From: Jeff Layton 

v2:
- xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
- rework file_update_time patch
- don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
- better comments and documentation

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it if no
one is looking at it. We can also clean up the code to prepare to
eventually expose this value via statx().

Note that this set relies on a few patches that are in other trees. The
full stack that I've been testing with is here:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]

It turns out though that none of these users of i_version require that
it change on every change to the file. The only real requirement is that
it be different if something changed since the last time we queried for
it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same.  The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented. It then uses that to avoid incrementing the counter when
it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

I can see measurable performance gains on xfs and ext4 with iversion
enabled, when streaming small (4k) I/Os. 

btrfs shows some slight gain in testing, but not quite the magnitude
that xfs and ext4 show. I'm not sure why yet and would appreciate some
input from btrfs folks.

My goal is to get this into -next fairly soon. If it shows no problems
then we can look at merging it for 4.16, or 4.17 if all of the
prequisite patches are not yet merged.

Jeff Layton (19):
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: handle inode->i_version more efficiently

 fs/affs/amigaffs.c|   4 +-
 fs/affs/dir.c |   4 +-
 fs/affs/super.c   |   2 +-
 fs/afs/fsclient.c |   2 +-
 fs/afs/inode.c|   4 +-
 fs/btrfs/delayed-inode.c  |   6 +-
 fs/btrfs/inode.c  |  11 +-
 fs/btrfs/tree-log.c   |   3 +-
 fs/exofs/dir.c|   8 +-
 fs/exofs/super.c  |   2 +-
 fs/ext2/dir.c |   8 +-
 fs/ext2/super.c   |   4 +-
 fs/ext4/dir.c |   8 +-
 fs/ext4/inline.c  |   6 +-
 fs/ext4/inode.c   |  12 +-
 fs/ext4/ioctl.c   |   2 +-
 fs/ext4/namei.c   |   4 +-
 fs/ext4/super.c   |   2 +-
 fs/ext4/xattr.c   |   4 +-
 fs/fat/dir.c  |   2 +-
 fs/fat/inode.c|   8 +-
 fs/fat/namei_msdos.c  |   6 +-
 fs/fat/namei_vfat.c   |  20 +--
 fs/inode.c|  19 ++-
 fs/nfs/delegation.c  

[PATCH 2/7] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-16 Thread Tejun Heo
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
  spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
  of free_request.  Fixed.

Signed-off-by: Tejun Heo 
Cc: "jianchao.wang" 
Cc: Peter Zijlstra 
---
 block/blk-core.c   |   2 +
 block/blk-mq.c | 220 +
 block/blk-mq.h |  46 +++
 block/blk-timeout.c|   2 +-
 block/blk.h|   6 --
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |  23 ++
 7 files changed, 220 insertions(+), 80 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b888175..6034623 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
+   seqcount_init(>gstate_seq);
+   u64_stats_init(>aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf4fbb..abd5d01 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -483,6 +483,7 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
+   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
clear_bit(REQ_ATOM_STARTED, >atomic_flags);
clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
if (rq->tag != -1)
@@ -530,6 +531,8 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
if (rq->rq_flags & RQF_STATS) {
@@ -557,6 +560,30 @@ static void 

[PATCHSET v3] blk-mq: reimplement timeout handling

2017-12-16 Thread Tejun Heo
Hello,

Changes from [v2]

- Possible extended looping around seqcount and u64_stat_sync fixed.

- Misplaced MQ_RQ_IDLE state setting fixed.

- RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout
  multiple times.

- s/queue_rq_src/srcu/ patch added.

- Other misc changes.

Changes from [v1]

- BLK_EH_RESET_TIMER handling fixed.

- s/request->gstate_seqc/request->gstate_seq/

- READ_ONCE() added to blk_mq_rq_udpate_state().

- Removed left over blk_clear_rq_complete() invocation from
  blk_mq_rq_timed_out().

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

It's pretty easy to make blk_mq_check_expired() terminate a later
instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patchset replaces the broken synchronization mechanism with a RCU
and generation number based one.  Please read the patch description of
the second path for more details.

This patchset contains the following six patches.

0001-blk-mq-protect-completion-path-with-RCU.patch
0002-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch
0003-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch
0004-blk-mq-make-blk_abort_request-trigger-timeout-path.patch
0005-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch
0006-blk-mq-remove-REQ_ATOM_STARTED.patch
0007-blk-mq-rename-blk_mq_hw_ctx-queue_rq_srcu-to-srcu.patch

and is available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout-v3

diffstat follows.  Thanks.

 block/blk-core.c   |2 
 block/blk-mq-debugfs.c |4 
 block/blk-mq.c |  285 +
 block/blk-mq.h |   49 
 block/blk-timeout.c|   11 +
 block/blk.h|7 -
 include/linux/blk-mq.h |3 
 include/linux/blkdev.h |   25 
 8 files changed, 257 insertions(+), 129 deletions(-)

--
tejun

[v2] http://lkml.kernel.org/r/20171010155441.753966-1...@kernel.org
[v1] http://lkml.kernel.org/r/20171209192525.982030-1...@kernel.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] blk-mq: protect completion path with RCU

2017-12-16 Thread Tejun Heo
Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo 
---
 block/blk-mq.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1109747..acf4fbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,11 +568,23 @@ static void __blk_mq_complete_request(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+   int srcu_idx;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
-   if (!blk_mark_rq_complete(rq))
-   __blk_mq_complete_request(rq);
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+   rcu_read_lock();
+   if (!blk_mark_rq_complete(rq))
+   __blk_mq_complete_request(rq);
+   rcu_read_unlock();
+   } else {
+   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+   if (!blk_mark_rq_complete(rq))
+   __blk_mq_complete_request(rq);
+   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+   }
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

2017-12-16 Thread Tejun Heo
After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times.  This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

Signed-off-by: Tejun Heo 
Cc: "jianchao.wang" 
---
 block/blk-mq.c | 18 --
 block/blk-timeout.c|  1 +
 include/linux/blkdev.h |  2 ++
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 88baa82..47e722b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -607,14 +607,12 @@ void blk_mq_complete_request(struct request *rq)
 */
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
rcu_read_lock();
-   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-   !blk_mark_rq_complete(rq))
+   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
rcu_read_unlock();
} else {
srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-   !blk_mark_rq_complete(rq))
+   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
}
@@ -665,8 +663,6 @@ void blk_mq_start_request(struct request *rq)
preempt_enable();
 
set_bit(REQ_ATOM_STARTED, >atomic_flags);
-   if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags))
-   clear_bit(REQ_ATOM_COMPLETE, >atomic_flags);
 
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@@ -817,6 +813,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
return;
 
+   req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
+
if (ops->timeout)
ret = ops->timeout(req, reserved);
 
@@ -832,7 +830,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 */
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
-   blk_clear_rq_complete(req);
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -851,7 +848,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 
might_sleep();
 
-   if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) ||
+   !test_bit(REQ_ATOM_STARTED, >atomic_flags))
return;
 
/* read coherent snapshots of @rq->state_gen and @rq->deadline */
@@ -886,8 +884,8 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
*hctx,
 * now guaranteed to see @rq->aborted_gstate and yield.  If
 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
 */
-   if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
-   !blk_mark_rq_complete(rq))
+   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
+   READ_ONCE(rq->gstate) == rq->aborted_gstate)
blk_mq_rq_timed_out(rq, reserved);
 }
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index d580af3..25c4ffa 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -209,6 +209,7 @@ void blk_add_timer(struct request *req)
req->timeout = q->rq_timeout;
 
req->deadline = jiffies + req->timeout;
+   req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
/*
 * Only the non-mq case needs to add the request to a protected list.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2d6fd11..13186a7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -123,6 +123,8 @@ typedef __u32 __bitwise req_flags_t;
 /* Look at ->special_vec for the actual data payload instead of the
bio chain. */
 #define RQF_SPECIAL_PAYLOAD((__force req_flags_t)(1 << 18))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 19))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE

2017-12-16 Thread Tejun Heo
blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state.  Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results.  Replace the tests.  This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.

Signed-off-by: Tejun Heo 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index abd5d01..643a38d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -95,8 +95,7 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
struct mq_inflight *mi = priv;
 
-   if (test_bit(REQ_ATOM_STARTED, >atomic_flags) &&
-   !test_bit(REQ_ATOM_COMPLETE, >atomic_flags)) {
+   if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
/*
 * index[0] counts the specific partition that was asked
 * for. index[1] counts the ones that are active on the
@@ -2974,7 +2973,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue 
*q,
 
hrtimer_init_sleeper(, current);
do {
-   if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags))
+   if (test_bit(REQ_ATOM_STARTED, >atomic_flags) &&
+   blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
break;
set_current_state(TASK_UNINTERRUPTIBLE);
hrtimer_start_expires(, mode);
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu

2017-12-16 Thread Tejun Heo
The RCU protection has been expanded to cover both queueing and
completion paths making ->queue_rq_srcu a misnomer.  Rename it to
->srcu as suggested by Bart.

Signed-off-by: Tejun Heo 
Cc: Bart Van Assche 
---
 block/blk-mq.c | 22 +++---
 include/linux/blk-mq.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 724d340..872d4ee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -219,7 +219,7 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 
queue_for_each_hw_ctx(q, hctx, i) {
if (hctx->flags & BLK_MQ_F_BLOCKING)
-   synchronize_srcu(hctx->queue_rq_srcu);
+   synchronize_srcu(hctx->srcu);
else
rcu = true;
}
@@ -611,10 +611,10 @@ void blk_mq_complete_request(struct request *rq)
__blk_mq_complete_request(rq);
rcu_read_unlock();
} else {
-   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+   srcu_idx = srcu_read_lock(hctx->srcu);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
-   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+   srcu_read_unlock(hctx->srcu, srcu_idx);
}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -916,7 +916,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
if (!(hctx->flags & BLK_MQ_F_BLOCKING))
has_rcu = true;
else
-   synchronize_srcu(hctx->queue_rq_srcu);
+   synchronize_srcu(hctx->srcu);
 
hctx->nr_expired = 0;
}
@@ -1279,9 +1279,9 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
} else {
might_sleep();
 
-   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+   srcu_idx = srcu_read_lock(hctx->srcu);
blk_mq_sched_dispatch_requests(hctx);
-   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+   srcu_read_unlock(hctx->srcu, srcu_idx);
}
 }
 
@@ -1718,9 +1718,9 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
might_sleep();
 
-   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+   srcu_idx = srcu_read_lock(hctx->srcu);
__blk_mq_try_issue_directly(hctx, rq, cookie, true);
-   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+   srcu_read_unlock(hctx->srcu, srcu_idx);
}
 }
 
@@ -2074,7 +2074,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
set->ops->exit_hctx(hctx, hctx_idx);
 
if (hctx->flags & BLK_MQ_F_BLOCKING)
-   cleanup_srcu_struct(hctx->queue_rq_srcu);
+   cleanup_srcu_struct(hctx->srcu);
 
blk_mq_remove_cpuhp(hctx);
blk_free_flush_queue(hctx->fq);
@@ -2147,7 +2147,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
goto free_fq;
 
if (hctx->flags & BLK_MQ_F_BLOCKING)
-   init_srcu_struct(hctx->queue_rq_srcu);
+   init_srcu_struct(hctx->srcu);
 
blk_mq_debugfs_register_hctx(q, hctx);
 
@@ -2436,7 +2436,7 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set 
*tag_set)
 {
int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
 
-   BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, queue_rq_srcu),
+   BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
   __alignof__(struct blk_mq_hw_ctx)) !=
 sizeof(struct blk_mq_hw_ctx));
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 460798d..8efcf49 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -66,7 +66,7 @@ struct blk_mq_hw_ctx {
 #endif
 
/* Must be the last member - see also blk_mq_hw_ctx_size(). */
-   struct srcu_struct  queue_rq_srcu[0];
+   struct srcu_struct  srcu[0];
 };
 
 struct blk_mq_tag_set {
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] blk-mq: make blk_abort_request() trigger timeout path

2017-12-16 Thread Tejun Heo
With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path.  This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Note that this makes blk_abort_request() asynchronous - it initiates
abortion but the actual termination will happen after a short while,
even when the caller owns the request.  AFAICS, SCSI and ATA should be
fine with that and I think mtip32xx and dasd should be safe but not
completely sure.  It'd be great if people who know the drivers take a
look.

Signed-off-by: Tejun Heo 
Cc: Asai Thambi SP 
Cc: Stefan Haberland 
Cc: Jan Hoeppner 
---
 block/blk-mq.c  | 2 +-
 block/blk-mq.h  | 2 --
 block/blk-timeout.c | 8 
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 643a38d..88baa82 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -800,7 +800,7 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cf01f6f..6b2d616 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -94,8 +94,6 @@ extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
-extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
-
 void blk_mq_release(struct request_queue *q);
 
 /**
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 6427be7..d580af3 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -156,12 +156,12 @@ void blk_timeout_work(struct work_struct *work)
  */
 void blk_abort_request(struct request *req)
 {
-   if (blk_mark_rq_complete(req))
-   return;
-
if (req->q->mq_ops) {
-   blk_mq_rq_timed_out(req, false);
+   req->deadline = jiffies;
+   mod_timer(>q->timeout, 0);
} else {
+   if (blk_mark_rq_complete(req))
+   return;
blk_delete_timer(req);
blk_rq_timed_out(req);
}
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] blk-mq: remove REQ_ATOM_STARTED

2017-12-16 Thread Tejun Heo
After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo 
---
 block/blk-mq-debugfs.c |  4 +---
 block/blk-mq.c | 37 -
 block/blk-mq.h |  1 +
 block/blk.h|  1 -
 4 files changed, 10 insertions(+), 33 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b56a4f3..8adc837 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = {
 #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
 static const char *const rqf_name[] = {
RQF_NAME(SORTED),
-   RQF_NAME(STARTED),
RQF_NAME(QUEUED),
RQF_NAME(SOFTBARRIER),
RQF_NAME(FLUSH_SEQ),
@@ -295,7 +294,6 @@ static const char *const rqf_name[] = {
 #define RQAF_NAME(name) [REQ_ATOM_##name] = #name
 static const char *const rqaf_name[] = {
RQAF_NAME(COMPLETE),
-   RQAF_NAME(STARTED),
RQAF_NAME(POLL_SLEPT),
 };
 #undef RQAF_NAME
@@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 47e722b..724d340 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -483,7 +483,6 @@ void blk_mq_free_request(struct request *rq)
blk_put_rl(blk_rq_rl(rq));
 
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-   clear_bit(REQ_ATOM_STARTED, >atomic_flags);
clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -531,6 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
int cpu;
 
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -621,7 +621,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-   return test_bit(REQ_ATOM_STARTED, >atomic_flags);
+   return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
@@ -640,7 +640,6 @@ void blk_mq_start_request(struct request *rq)
}
 
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-   WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, >atomic_flags));
 
/*
 * Mark @rq in-flight which also advances the generation number,
@@ -662,8 +661,6 @@ void blk_mq_start_request(struct request *rq)
write_seqcount_end(>gstate_seq);
preempt_enable();
 
-   set_bit(REQ_ATOM_STARTED, >atomic_flags);
-
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
 * Make sure space for the drain appears.  We know we can do
@@ -676,13 +673,9 @@ void blk_mq_start_request(struct request *rq)
 EXPORT_SYMBOL(blk_mq_start_request);
 
 /*
- * When we reach here because queue is busy, REQ_ATOM_COMPLETE
- * flag isn't set yet, so there may be race with timeout handler,
- * but given rq->deadline is just set in .queue_rq() under
- * this situation, the race won't be possible in reality because
- * rq->timeout should be set as big enough to cover the window
- * between blk_mq_start_request() called from .queue_rq() and
- * clearing REQ_ATOM_STARTED here.
+ * When we reach here because queue is busy, it's safe to change the state
+ * to IDLE without checking @rq->aborted_gstate because we should still be
+ * holding the RCU read lock and thus protected against timeout.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
@@ -694,7 +687,7 @@ static void __blk_mq_requeue_request(struct request *rq)
wbt_requeue(q->rq_wb, >issue_stat);
blk_mq_sched_requeue_request(rq);
 
-   if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
+   if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
@@ -801,18 +794,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-   /*
-* We know that complete is set at this point. If STARTED 

Re: [PATCH v5 77/78] irqdomain: Convert to XArray

2017-12-16 Thread Marc Zyngier
On Fri, 15 Dec 2017 22:04:49 +,
Matthew Wilcox wrote:
> 
> From: Matthew Wilcox 
> 
> In a non-critical path, irqdomain wants to know how many entries are
> stored in the xarray, so add xa_count().  This is a pretty straightforward
> conversion; mostly just removing now-redundant locking.  The only thing
> of note is just how much simpler irq_domain_fix_revmap() becomes.
> 
> Signed-off-by: Matthew Wilcox 
> ---
>  include/linux/irqdomain.h | 10 --
>  include/linux/xarray.h|  1 +
>  kernel/irq/irqdomain.c| 39 ++-
>  lib/xarray.c  | 25 +
>  4 files changed, 40 insertions(+), 35 deletions(-)

I certainly welcome the simplification this provides.

As for xa_count(), I'm trying to convince myself that the code under
CONFIG_IRQ_DOMAIN_DEBUG is now completely superseded by
CONFIG_GENERIC_IRQ_DEBUGFS, and that there would be more value in
getting rid of it rather than keeping it on life support.

Until then, the 1:1 replacement is good enough.

Acked-by: Marc Zyngier 

M.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.14.3] btrfs out of space error

2017-12-16 Thread Ian Kumlien
On Sat, Dec 16, 2017 at 12:07 AM, Hans van Kranenburg
 wrote:
> On 12/15/2017 09:53 AM, Qu Wenruo wrote:
>> On 2017年12月15日 16:36, Ian Kumlien wrote:
>>> On Fri, Dec 15, 2017 at 6:34 AM, Roman Mamedov  wrote:
 On Fri, 15 Dec 2017 01:39:03 +0100
 Ian Kumlien  wrote:

> Hi,
>
> Running a 4.14.3 kernel, this just happened, but there should have
> been another 20 gigs or so available.
>
> The filesystem seems fine after a reboot though

 What are your mount options, and can you show the output of "btrfs fi
 df" and "btrfs fi us" for the filesystem? And what does
 "cat /sys/block/sdb/queue/rotational" return.
>>>
>>> It's a btrfs raid1 mirror of two ssd:s
>>>
>>> mount options was:
>>> defaults,acl,noatime,space_cache,autodefrag,compress=lzo
>>>
>>> btrfs fi df /
>>> Data, RAID1: total=459.25GiB, used=372.42GiB
>>> Data, single: total=8.00MiB, used=0.00B
>>> System, RAID1: total=8.00MiB, used=80.00KiB
>>> System, single: total=4.00MiB, used=0.00B
>>> Metadata, RAID1: total=6.00GiB, used=3.69GiB
>>
>> Both meta and data has a lot of space let.
>
> The real question is how fragmented that free space is.

Somewhat fragmented i'd say

> Here's a good starting point to read up about the -o ssd behavior pre 4.14:
>
> https://www.spinics.net/lists/linux-btrfs/msg70622.html

Will do

[--8<--]

>>> Unallocated:
>>>/dev/sdb2   24.00KiB
>>>/dev/sdc2   20.02MiB
>>
>> Well, at least no new chunk can be allocated.
>
> Exactly.

After running:
btrfs balance start -dusage=50 /

a few times (first one failed but freed 6 gigs) it's now:
Unallocated:
   /dev/sdb2   13.25GiB
   /dev/sdc2   13.26GiB

>> In v4.15, Josef introduced a new inode reservation system, which I think
>> it would enhance metadata related reservation.
>>
>> If you're hitting the problem quite frequently, please consider to try
>> v4.15-rc* to see if it solves the problem.
>>
>> Despite of that, there should be no damage to your fs.
>> (except some unwritten data in buffer)
>>
>> Thanks,
>> Qu
>>
>>>
>>> And as expected:
>>> cat /sys/block/sdb/queue/rotational
>>> 0
>>>
 I wonder if it's the same old "ssd allocation scheme" problem, and no
 balancing done in a long time or at all.
>
> Looks like it. So, yay, you're on 4.14 already. Now just do a full
> balance of your entire filesystem, only once (data only, metadata not
> needed) and then you can forget about this again.

Any special filters etc needed?

>>> I had something similar happen on a laptop a while ago - took a while
>>> before i could get it back in order
>>> (in that case i think it was actually a oops --- it kept saying "no
>>> space left" and switched to read only even
>>> if you removed a lot of data, invalidated the space cache and so on)
>
> --
> Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.14.3] btrfs out of space error

2017-12-16 Thread Ian Kumlien
On Fri, Dec 15, 2017 at 9:53 AM, Qu Wenruo  wrote:
> On 2017年12月15日 16:36, Ian Kumlien wrote:
>> On Fri, Dec 15, 2017 at 6:34 AM, Roman Mamedov  wrote:
>>> On Fri, 15 Dec 2017 01:39:03 +0100
>>> Ian Kumlien  wrote:
>>>
 Hi,

 Running a 4.14.3 kernel, this just happened, but there should have
 been another 20 gigs or so available.

 The filesystem seems fine after a reboot though
>>>
>>> What are your mount options, and can you show the output of "btrfs fi
>>> df" and "btrfs fi us" for the filesystem? And what does
>>> "cat /sys/block/sdb/queue/rotational" return.
>>
>> It's a btrfs raid1 mirror of two ssd:s

[--8<--]

(sorry if you got a reply before - dang html mails)

> Well, at least no new chunk can be allocated.
>
>
> In v4.15, Josef introduced a new inode reservation system, which I think
> it would enhance metadata related reservation.

Humm... maybe when it has been trough some rc:s ;)

> If you're hitting the problem quite frequently, please consider to try
> v4.15-rc* to see if it solves the problem.
>
> Despite of that, there should be no damage to your fs.
> (except some unwritten data in buffer)

Yeah figured as much - did a full defrag and recompress (with zstd
this time) just to play with it

> Thanks,
> Qu
>
>>
>> And as expected:
>> cat /sys/block/sdb/queue/rotational
>> 0
>>
>>> I wonder if it's the same old "ssd allocation scheme" problem, and no
>>> balancing done in a long time or at all.
>>
>> I had something similar happen on a laptop a while ago - took a while
>> before i could get it back in order
>> (in that case i think it was actually a oops --- it kept saying "no
>> space left" and switched to read only even
>> if you removed a lot of data, invalidated the space cache and so on)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.14.3] btrfs out of space error

2017-12-16 Thread Ian Kumlien
On Fri, Dec 15, 2017 at 5:01 AM, Chris Murphy  wrote:
> On Thu, Dec 14, 2017 at 5:39 PM, Ian Kumlien  wrote:
>> Hi,
>>
>> Running a 4.14.3 kernel, this just happened, but there should have
>> been another 20 gigs or so available.
>>
>> The filesystem seems fine after a reboot though
>>
>> [1070034.614893] [ cut here ]
>> [1070034.614899] WARNING: CPU: 4 PID: 18634 at fs/btrfs/inode.c:4647
>> btrfs_truncate_inode_items+0xea5/0xfb0
>> [1070034.614899] Modules linked in: mlx4_en chaoskey amdgpu mfd_core 
>> mlx4_core
>> [1070034.614904] CPU: 4 PID: 18634 Comm: TaskSchedulerFo Not tainted 4.14.3 
>> #165
>> [1070034.614905] Hardware name: To be filled by O.E.M. To be filled by
>> O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
>> [1070034.614906] task: 8bb0f2332c00 task.stack: 9630064c4000
>> [1070034.614907] RIP: 0010:btrfs_truncate_inode_items+0xea5/0xfb0
>> [1070034.614907] RSP: 0018:9630064c7cc0 EFLAGS: 00010282
>> [1070034.614909] RAX: 0026 RBX: 8bb11348a070 RCX:
>> 
>> [1070034.614909] RDX: 8bb13ed147b0 RSI: 8bb13ed0c9d8 RDI:
>> 8bb13ed0c9d8
>> [1070034.614910] RBP: 8baf61905c70 R08: 0001 R09:
>> 05f1
>> [1070034.614910] R10: 1000 R11:  R12:
>> 8bb1138e0800
>> [1070034.614911] R13: 01e6 R14: 8baf60610ee0 R15:
>> 00eb
>> [1070034.614912] FS:  7fad68b7f700() GS:8bb13ed0()
>> knlGS:
>> [1070034.614912] CS:  0010 DS:  ES:  CR0: 80050033
>> [1070034.614913] CR2: 1c7e020d CR3: 000180a19000 CR4:
>> 000406e0
>> [1070034.614913] Call Trace:
>> [1070034.614918]  btrfs_truncate+0x11d/0x2c0
>> [1070034.614919]  btrfs_setattr+0x27b/0x3c0
>> [1070034.614921]  notify_change+0x29f/0x430
>> [1070034.614923]  do_truncate+0x5e/0x90
>> [1070034.614925]  do_sys_ftruncate.constprop.3+0x139/0x1a0
>> [1070034.614927]  entry_SYSCALL_64_fastpath+0x13/0x94
>> [1070034.614928] RIP: 0033:0x7fadd17c945e
>> [1070034.614929] RSP: 002b:7fad68b7e870 EFLAGS: 0246 ORIG_RAX:
>> 004d
>> [1070034.614930] RAX: ffda RBX: 23f3339a9710 RCX:
>> 7fadd17c945e
>> [1070034.614930] RDX: 00bb RSI: 00eb RDI:
>> 0304
>> [1070034.614931] RBP: 5604376778bd R08: 0023ff6cb6a33301 R09:
>> 7ffd8a3c0080
>> [1070034.614931] R10: 0023 R11: 0246 R12:
>> 23f3339a96c0
>> [1070034.614932] R13: 0001 R14: 7fad68b7e938 R15:
>> 56043dfdbaf8
>> [1070034.614933] Code: 48 8b 44 24 28 48 8b 40 60 f0 0f ba a8 08 ce 00
>> 00 02 72 19 83 7c 24 58 fb 74 12 8b 74 24 58 48 c7 c7 e8 3b 22 93 e8
>> 06 b5 c7 ff <0f> ff 8b 4c 24 58 ba 27 12 00 00 48 8b 7c 24 28 48 c7 c6
>> 30 0b
>> [1070034.614950] ---[ end trace c8eff7895ddacab0 ]---
>> [1070034.614951] BTRFS: error (device sdb2) in
>> btrfs_truncate_inode_items:4647: errno=-28 No space left
>> [1070034.614954] BTRFS info (device sdb2): forced readonly
>> [1070034.616386] BTRFS error (device sdb2): pending csums is 106496
>
> This last line is concerning, I interpret that as csum tree flush did
> not finish before the fs was forced read only. I'm not sure what the
> future consequences of that will be.
>
> You should probably run a scrub and also an offline readonly btrfs
> check and see what that turns up.

I did a full defrag and recompression of all files and it was all fine =)

> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html