Re: [PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Amir Goldstein
On Thu, Nov 24, 2016 at 4:08 PM, Amir Goldstein  wrote:
> On Thu, Nov 24, 2016 at 3:51 PM, Miklos Szeredi  wrote:
>> On Thu, Nov 24, 2016 at 2:12 PM, Amir Goldstein  wrote:
>>> On Thu, Nov 24, 2016 at 2:03 PM, Miklos Szeredi  wrote:
 On Thu, Nov 24, 2016 at 12:52 PM, Amir Goldstein  
 wrote:
> On Thu, Nov 24, 2016 at 12:55 PM, Miklos Szeredi  
> wrote:

>> +   /*
>> +* These should be intercepted, but they are very 
>> unlikely to be
>> +* a problem in practice.  Leave them alone for now.
>
> It could also be handled in vfs helpers.
> Since these ops all start with establishing that src and dest are on
> the same sb,
> then the cost of copy up of src is the cost of clone_file_range from
> lower to upper,
> so it is probably worth to copy up src and leave those fops alone.
>
>> +*/
>> +   ofop->fops.copy_file_range = orig->copy_file_range;
>> +   ofop->fops.clone_file_range = orig->clone_file_range;
>> +   ofop->fops.dedupe_file_range = orig->dedupe_file_range;

 Not sure I understand.  Why should we copy up src?  Copy up is the
 problem not the solution.

>>>
>>> Maybe the idea is ill conceived, but the reasoning is:
>>> To avoid the corner case of cloning from a stale lower src,
>>> call d_real() in vfs helpers to always copy up src before cloning from it
>>> and pass the correct file onwards.
>>
>> Which correct file?  src is still the wrong one after calling d_real.
>> We need to clone-open src, just like we do in ovl_read_iter to get the
>> correct file.  But then what's the use of copying it up beforehand?
>>
>> We could move the whole logic into the vfs, but I don't really see the point.
>>

Here is a relevant use case (creating several clones),
although not directly related to ro/rw inconsistency, which
justified putting the logic in vfs.

X is a file in lower
lower is different fs then upper
upper supports clone/dedup/copy_range

for i in `seq 1 100`; do cp --reflink=auto X X${i}; done

With current code the src and destination files are on the same
mount (test in  ioctl_file_clone), but not on the same sb (test in
vfs_clone_file_range), so cp will fall back to 100 expensive data copies.

*If* instead we d_real() and clone-open src in start of vfs_clone_file_range
*after* verifying the dest file ops support clone, then we will get only one
expensive copy up and 100 cheap clones, so its a big win.

And for the case of src and dst inodes already on the same sb, we can
skip d_real() to avoid possible unneeded copy up, although a clone up
is going to be cheap anyway.

The so called worst case is that this was a one time clone (to X1),
but the cost in this case is not huge - 1 data copy up of X and 1 clone
X->X1 instead of just 1 data copy X->X1, so the difference is negligible.

Now it's true that this is heuristic, but arguably a good one.

Amir.


Re: [PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Amir Goldstein
On Thu, Nov 24, 2016 at 3:51 PM, Miklos Szeredi  wrote:
> On Thu, Nov 24, 2016 at 2:12 PM, Amir Goldstein  wrote:
>> On Thu, Nov 24, 2016 at 2:03 PM, Miklos Szeredi  wrote:
>>> On Thu, Nov 24, 2016 at 12:52 PM, Amir Goldstein  wrote:
 On Thu, Nov 24, 2016 at 12:55 PM, Miklos Szeredi  
 wrote:
>>>
> +   /*
> +* These should be intercepted, but they are very 
> unlikely to be
> +* a problem in practice.  Leave them alone for now.

 It could also be handled in vfs helpers.
 Since these ops all start with establishing that src and dest are on
 the same sb,
 then the cost of copy up of src is the cost of clone_file_range from
 lower to upper,
 so it is probably worth to copy up src and leave those fops alone.

> +*/
> +   ofop->fops.copy_file_range = orig->copy_file_range;
> +   ofop->fops.clone_file_range = orig->clone_file_range;
> +   ofop->fops.dedupe_file_range = orig->dedupe_file_range;
>>>
>>> Not sure I understand.  Why should we copy up src?  Copy up is the
>>> problem not the solution.
>>>
>>
>> Maybe the idea is ill conceived, but the reasoning is:
>> To avoid the corner case of cloning from a stale lower src,
>> call d_real() in vfs helpers to always copy up src before cloning from it
>> and pass the correct file onwards.
>
> Which correct file?  src is still the wrong one after calling d_real.
> We need to clone-open src, just like we do in ovl_read_iter to get the
> correct file.  But then what's the use of copying it up beforehand?
>
> We could move the whole logic into the vfs, but I don't really see the point.
>
> I left these ops alone because there is some confusion in there about
> getting the f_op from the source or the destination file.

Yes, I saw that. Shouldn't be a problem to always use src->f_ops-> IMO.

Could you please push this work to a topic branch to make it easier for me to
pull and test?

Thanks,
Amir.

>  And while
> it doesn't matter normally (all regular files have the same f_op,
> regardless of open flags)  it does matter for overlayfs intercept,
> because overriding fops in the the dest file would mean additional
> complexity and  resources).  That could easily be fixed in the vfs:
> calling src->f_ops->foo works equally well, but I simply didn't want
> to bother with this.  We can return to it later.
>
> Thanks,
> Miklos


Re: [PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Miklos Szeredi
On Thu, Nov 24, 2016 at 2:12 PM, Amir Goldstein  wrote:
> On Thu, Nov 24, 2016 at 2:03 PM, Miklos Szeredi  wrote:
>> On Thu, Nov 24, 2016 at 12:52 PM, Amir Goldstein  wrote:
>>> On Thu, Nov 24, 2016 at 12:55 PM, Miklos Szeredi  
>>> wrote:
>>
 +   /*
 +* These should be intercepted, but they are very unlikely 
 to be
 +* a problem in practice.  Leave them alone for now.
>>>
>>> It could also be handled in vfs helpers.
>>> Since these ops all start with establishing that src and dest are on
>>> the same sb,
>>> then the cost of copy up of src is the cost of clone_file_range from
>>> lower to upper,
>>> so it is probably worth to copy up src and leave those fops alone.
>>>
 +*/
 +   ofop->fops.copy_file_range = orig->copy_file_range;
 +   ofop->fops.clone_file_range = orig->clone_file_range;
 +   ofop->fops.dedupe_file_range = orig->dedupe_file_range;
>>
>> Not sure I understand.  Why should we copy up src?  Copy up is the
>> problem not the solution.
>>
>
> Maybe the idea is ill conceived, but the reasoning is:
> To avoid the corner case of cloning from a stale lower src,
> call d_real() in vfs helpers to always copy up src before cloning from it
> and pass the correct file onwards.

Which correct file?  src is still the wrong one after calling d_real.
We need to clone-open src, just like we do in ovl_read_iter to get the
correct file.  But then what's the use of copying it up beforehand?

We could move the whole logic into the vfs, but I don't really see the point.

I left these ops alone because there is some confusion in there about
getting the f_op from the source or the destination file.  And while
it doesn't matter normally (all regular files have the same f_op,
regardless of open flags)  it does matter for overlayfs intercept,
because overriding fops in the the dest file would mean additional
complexity and  resources).  That could easily be fixed in the vfs:
calling src->f_ops->foo works equally well, but I simply didn't want
to bother with this.  We can return to it later.

Thanks,
Miklos


Re: [PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Amir Goldstein
On Thu, Nov 24, 2016 at 2:03 PM, Miklos Szeredi  wrote:
> On Thu, Nov 24, 2016 at 12:52 PM, Amir Goldstein  wrote:
>> On Thu, Nov 24, 2016 at 12:55 PM, Miklos Szeredi  wrote:
>
>>> +   /*
>>> +* These should be intercepted, but they are very unlikely 
>>> to be
>>> +* a problem in practice.  Leave them alone for now.
>>
>> It could also be handled in vfs helpers.
>> Since these ops all start with establishing that src and dest are on
>> the same sb,
>> then the cost of copy up of src is the cost of clone_file_range from
>> lower to upper,
>> so it is probably worth to copy up src and leave those fops alone.
>>
>>> +*/
>>> +   ofop->fops.copy_file_range = orig->copy_file_range;
>>> +   ofop->fops.clone_file_range = orig->clone_file_range;
>>> +   ofop->fops.dedupe_file_range = orig->dedupe_file_range;
>
> Not sure I understand.  Why should we copy up src?  Copy up is the
> problem not the solution.
>

Maybe the idea is ill conceived, but the reasoning is:
To avoid the corner case of cloning from a stale lower src,
call d_real() in vfs helpers to always copy up src before cloning from it
and pass the correct file onwards.

It would have been crazy if we suggested the same for read_iter(),
so why it may make sense for clone?
because once you got that far into the vfs_clone_range() helper,
with src from lower and dst from upper, it means they are on the same sb
that supports clone, so it is likely that copy up is going to use clone and
be relatively cheap.

Pretty twisted. I agree... and not sure if this logic holds for copy_range as
well. Anyway, that's what I meant.

Amir.


Re: [PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Miklos Szeredi
On Thu, Nov 24, 2016 at 12:52 PM, Amir Goldstein  wrote:
> On Thu, Nov 24, 2016 at 12:55 PM, Miklos Szeredi  wrote:

>> +   /*
>> +* These should be intercepted, but they are very unlikely 
>> to be
>> +* a problem in practice.  Leave them alone for now.
>
> It could also be handled in vfs helpers.
> Since these ops all start with establishing that src and dest are on
> the same sb,
> then the cost of copy up of src is the cost of clone_file_range from
> lower to upper,
> so it is probably worth to copy up src and leave those fops alone.
>
>> +*/
>> +   ofop->fops.copy_file_range = orig->copy_file_range;
>> +   ofop->fops.clone_file_range = orig->clone_file_range;
>> +   ofop->fops.dedupe_file_range = orig->dedupe_file_range;

Not sure I understand.  Why should we copy up src?  Copy up is the
problem not the solution.

Thanks,
Miklos


Re: [PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Amir Goldstein
On Thu, Nov 24, 2016 at 12:55 PM, Miklos Szeredi  wrote:
> Overlayfs needs to intercept a few file operations in order to properly
> handle the following corner case:
>
>  - file X is in overlayfs;
>
>  - X resides on a lower (read-only) layer
> the lower file is L;
>
>  - X is opened with O_RDONLY ->
> L is opened -> 'rofd';
>
>  - X is opened with O_WRONLY ->
> this results in L being copied up to the upper layer file U
> U is opened -> 'rwfd';
>
>  - write to 'rwfd' modifying U;
>
>  - read from 'rofd' gets data from unmodified L;
>
> While this is a rare occurrence, it has been known to cause problems.  To
> prevent such inconsistencies, allow intercepting read operations and fix
> them up in the unlikely case that it's needed.
>
> This patch adds an ovl_fops structure that is going to contain the pieces
> necessary for intercepting file operations:
>
>  a) a new file_operations structure, based on the original but changing
> those operations that need to be intercepted;
>
>  b) a pointer to the origin f_op.  Intercepted operations will normally
> just call the original, unless the file was copied up;
>
>  c) a hash table entry to hook this up for reuse in subsequent opens.
>
> The hash table is small (32 buckets) since there will only be a few
> different file operations used (basically the number of different
> filesystems used as lower layer of overlay).  The key to the has table is
> the original fops pointer.
>
> Insertion is done under a global mutex.  There will only be one insertion
> event for each unique underlying file_operations structure, so this is
> going to be rare.
>
> The hash table is accessed using rcu, so this will add minimal overhead to
> opens.  The override fops are removed only at module exit time, so we don't
> even have to be worry about read-side rcu locking.
>
> There are a few assumption this scheme makes about file operation
> structures supplied by filesystems:
>
>  1) the lifetime of the structures is equal to the lifetime of the
> filesystem module; ensure this by holding a ref to the
> sb->s_type->owner (normal filesystems don't fill in f_ops->owner).
> This means that once a filesystem has been used as lower layer of
> overlayfs its module cannot be removed until the overlay module itself
> is removed.  I don't believe this will be a problem.
>
>  2) Filesystems should not make use of f_op in any way except possibly
> replacing it in ->open.  Overlayfs itself breaks this assumption, but
> recursion is handled by ->d_real so this is not an issue.  Add a
> ->magic field to the override fops structure to verify that the
> filesystem didn't mess with file->f_op.
>
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/overlayfs/inode.c | 144 
> ++-
>  fs/overlayfs/overlayfs.h |   2 +
>  fs/overlayfs/super.c |   1 +
>  3 files changed, 146 insertions(+), 1 deletion(-)
>
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index 1981a5514f51..3e26615c4697 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "overlayfs.h"
>
>  static int ovl_copy_up_truncate(struct dentry *dentry)
> @@ -334,16 +335,157 @@ static const struct inode_operations 
> ovl_symlink_inode_operations = {
> .update_time= ovl_update_time,
>  };
>
> +static DEFINE_READ_MOSTLY_HASHTABLE(ovl_fops_htable, 5);
> +static DEFINE_MUTEX(ovl_fops_mutex);
> +
> +#define OVL_FOPS_MAGIC 0x73706f462a6c764f
> +
> +struct ovl_fops {
> +   struct module *owner;
> +   struct file_operations fops;
> +   u64 magic;
> +   const struct file_operations *orig_fops;
> +   struct hlist_node entry;
> +};
> +
> +void ovl_cleanup_fops_htable(void)
> +{
> +   int bkt;
> +   struct hlist_node *tmp;
> +   struct ovl_fops *ofop;
> +
> +   hash_for_each_safe(ovl_fops_htable, bkt, tmp, ofop, entry) {
> +   module_put(ofop->owner);
> +   fops_put(ofop->orig_fops);
> +   kfree(ofop);
> +   }
> +}
> +
> +#define OVL_CALL_REAL_FOP(file, call) \
> +   ({ struct ovl_fops *__ofop = \
> +   container_of(file->f_op, struct ovl_fops, fops); \
> +  WARN_ON(__ofop->magic != OVL_FOPS_MAGIC) ? -EIO : \
> +  __ofop->orig_fops->call;  \
> +   })
> +
> +static struct ovl_fops *ovl_fops_find(const struct file_operations *orig)
> +{
> +   struct ovl_fops *ofop;
> +
> +   hash_for_each_possible_rcu(ovl_fops_htable, ofop, entry, (long) orig) 
> {
> +   if (ofop->orig_fops == orig)
> +   return ofop;
> +   }
> +   return NULL;
> +}
> +
> +static struct ovl_fops *ovl_fops_get(struct file *file)
> +{
> +   const struct file_operations *orig = file->f_op;
> +   struct ovl_fops

[PATCH 4/7] ovl: add infrastructure for intercepting file ops

2016-11-24 Thread Miklos Szeredi
Overlayfs needs to intercept a few file operations in order to properly
handle the following corner case:

 - file X is in overlayfs;

 - X resides on a lower (read-only) layer
the lower file is L;

 - X is opened with O_RDONLY ->
L is opened -> 'rofd';

 - X is opened with O_WRONLY ->
this results in L being copied up to the upper layer file U
U is opened -> 'rwfd';

 - write to 'rwfd' modifying U;

 - read from 'rofd' gets data from unmodified L;

While this is a rare occurrence, it has been known to cause problems.  To
prevent such inconsistencies, allow intercepting read operations and fix
them up in the unlikely case that it's needed.

This patch adds an ovl_fops structure that is going to contain the pieces
necessary for intercepting file operations:

 a) a new file_operations structure, based on the original but changing
those operations that need to be intercepted;

 b) a pointer to the origin f_op.  Intercepted operations will normally
just call the original, unless the file was copied up;

 c) a hash table entry to hook this up for reuse in subsequent opens.

The hash table is small (32 buckets) since there will only be a few
different file operations used (basically the number of different
filesystems used as lower layer of overlay).  The key to the has table is
the original fops pointer.

Insertion is done under a global mutex.  There will only be one insertion
event for each unique underlying file_operations structure, so this is
going to be rare.

The hash table is accessed using rcu, so this will add minimal overhead to
opens.  The override fops are removed only at module exit time, so we don't
even have to be worry about read-side rcu locking.

There are a few assumption this scheme makes about file operation
structures supplied by filesystems:

 1) the lifetime of the structures is equal to the lifetime of the
filesystem module; ensure this by holding a ref to the
sb->s_type->owner (normal filesystems don't fill in f_ops->owner).
This means that once a filesystem has been used as lower layer of
overlayfs its module cannot be removed until the overlay module itself
is removed.  I don't believe this will be a problem.

 2) Filesystems should not make use of f_op in any way except possibly
replacing it in ->open.  Overlayfs itself breaks this assumption, but
recursion is handled by ->d_real so this is not an issue.  Add a
->magic field to the override fops structure to verify that the
filesystem didn't mess with file->f_op.

Signed-off-by: Miklos Szeredi 
---
 fs/overlayfs/inode.c | 144 ++-
 fs/overlayfs/overlayfs.h |   2 +
 fs/overlayfs/super.c |   1 +
 3 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 1981a5514f51..3e26615c4697 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "overlayfs.h"
 
 static int ovl_copy_up_truncate(struct dentry *dentry)
@@ -334,16 +335,157 @@ static const struct inode_operations 
ovl_symlink_inode_operations = {
.update_time= ovl_update_time,
 };
 
+static DEFINE_READ_MOSTLY_HASHTABLE(ovl_fops_htable, 5);
+static DEFINE_MUTEX(ovl_fops_mutex);
+
+#define OVL_FOPS_MAGIC 0x73706f462a6c764f
+
+struct ovl_fops {
+   struct module *owner;
+   struct file_operations fops;
+   u64 magic;
+   const struct file_operations *orig_fops;
+   struct hlist_node entry;
+};
+
+void ovl_cleanup_fops_htable(void)
+{
+   int bkt;
+   struct hlist_node *tmp;
+   struct ovl_fops *ofop;
+
+   hash_for_each_safe(ovl_fops_htable, bkt, tmp, ofop, entry) {
+   module_put(ofop->owner);
+   fops_put(ofop->orig_fops);
+   kfree(ofop);
+   }
+}
+
+#define OVL_CALL_REAL_FOP(file, call) \
+   ({ struct ovl_fops *__ofop = \
+   container_of(file->f_op, struct ovl_fops, fops); \
+  WARN_ON(__ofop->magic != OVL_FOPS_MAGIC) ? -EIO : \
+  __ofop->orig_fops->call;  \
+   })
+
+static struct ovl_fops *ovl_fops_find(const struct file_operations *orig)
+{
+   struct ovl_fops *ofop;
+
+   hash_for_each_possible_rcu(ovl_fops_htable, ofop, entry, (long) orig) {
+   if (ofop->orig_fops == orig)
+   return ofop;
+   }
+   return NULL;
+}
+
+static struct ovl_fops *ovl_fops_get(struct file *file)
+{
+   const struct file_operations *orig = file->f_op;
+   struct ovl_fops *ofop = ovl_fops_find(orig);
+
+   if (likely(ofop))
+   return ofop;
+
+   mutex_lock(&ovl_fops_mutex);
+   ofop = ovl_fops_find(orig);
+   if (ofop)
+   goto out_unlock;
+
+   ofop = kzalloc(sizeof(struct ovl_fops), GFP_KERNEL);
+   if (ofop) {
+