from:"Maxim Patlasov"

[PATCH] fuse: relax inode_lock on fsync(2)

2016-12-13 Thread Maxim Patlasov

Sometimes handling FUSE_FSYNC in userspace may take a while. No need
to block incoming writes while userspace processes FUSE_FSYNC.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2401c5d..9d52a8a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -483,6 +483,8 @@ int fuse_fsync_common(struct file *file, loff_t start, 
loff_t end,
if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir))
goto out;
 
+   inode_unlock(inode);
+
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;
inarg.fsync_flags = datasync ? 1 : 0;
@@ -499,6 +501,7 @@ int fuse_fsync_common(struct file *file, loff_t start, 
loff_t end,
fc->no_fsync = 1;
err = 0;
}
+   return err;
 out:
inode_unlock(inode);
return err;

[PATCH v2] btrfs: limit async_work allocation and worker func duration

2016-12-12 Thread Maxim Patlasov

Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
mkdir /mnt/$i
btrfs subvolume create /mnt/SV_$i
ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | 
grep -v dma; sleep 60; done
kmalloc-12810144  10144128   321 : tunables000 : 
slabdata317317  0
kmalloc-128   9992352 9992352128   321 : tunables000 : 
slabdata 312261 312261  0
kmalloc-128   24226752 24226752128   321 : tunables000 
: slabdata 757086 757086  0
kmalloc-128   42754240 42754240128   321 : tunables000 
: slabdata 1336070 1336070  0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Changed in v2: remove support of thresh == NO_THRESHOLD.

Signed-off-by: Maxim Patlasov 
---
 fs/btrfs/async-thread.c  |   14 ++
 fs/btrfs/async-thread.h  |1 +
 fs/btrfs/delayed-inode.c |6 --
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index e0f071f..63d1977 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -86,6 +86,20 @@ btrfs_work_owner(struct btrfs_work *work)
return work->wq->fs_info;
 }
 
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)
+{
+   /*
+* We could compare wq->normal->pending with num_online_cpus()
+* to support "thresh == NO_THRESHOLD" case, but it requires
+* moving up atomic_inc/dec in thresh_queue/exec_hook. Let's
+* postpone it until someone needs the support of that case.
+*/
+   if (wq->normal->thresh == NO_THRESHOLD)
+   return false;
+
+   return atomic_read(&wq->normal->pending) > wq->normal->thresh * 2;
+}
+
 BTRFS_WORK_HELPER(worker_helper);
 BTRFS_WORK_HELPER(delalloc_helper);
 BTRFS_WORK_HELPER(flush_delalloc_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index 8e52484..1f95973 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int 
max);
 void btrfs_set_work_high_priority(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq);
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq);
 #endif
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 3eeb9cd..de946dd 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1356,7 +1356,8 @@ release_path:
total_done++;
 
btrfs_release_prepared_delayed_node(delayed_node);
-   if (async_work->nr == 0 || total_done < async_work->nr)
+   if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) ||
+   total_done < async_work->nr)
goto again;
 
 free_path:
@@ -1372,7 +1373,8 @@ static int btrfs_wq_run_delayed_node(struct 
btrfs_delayed_root *delayed_root,
 {
struct btrfs_async_delayed_work *async_

Re: [PATCH] btrfs: limit async_work allocation and worker func duration

2016-12-12 Thread Maxim Patlasov


On 12/12/2016 06:54 AM, David Sterba wrote:


On Fri, Dec 02, 2016 at 05:51:36PM -0800, Maxim Patlasov wrote:

Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
mkdir /mnt/$i
btrfs subvolume create /mnt/SV_$i
ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | 
grep -v dma; sleep 60; done
kmalloc-12810144  10144128   321 : tunables000 : 
slabdata317317  0
kmalloc-128   9992352 9992352128   321 : tunables000 : 
slabdata 312261 312261  0
kmalloc-128   24226752 24226752128   321 : tunables000 
: slabdata 757086 757086  0
kmalloc-128   42754240 42754240128   321 : tunables000 
: slabdata 1336070 1336070  0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

Nice analysis!


The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Signed-off-by: Maxim Patlasov 
---
  fs/btrfs/async-thread.c  |8 
  fs/btrfs/async-thread.h  |1 +
  fs/btrfs/delayed-inode.c |6 --
  3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index e0f071f..29f6252 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work)
return work->wq->fs_info;
  }
  
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)

+{
+   int thresh = wq->normal->thresh != NO_THRESHOLD ?
+   wq->normal->thresh : num_possible_cpus();

Why not num_online_cpus? I vaguely remember we should be checking online
cpus, but don't have the mails for reference. We use it elsewhere for
spreading the work over cpus, but it's still not bullet proof regarding
cpu onlining/offlining.


Thank you for review, David! I borrowed num_possible_cpus from the 
definition of WQ_UNBOUND_MAX_ACTIVE in workqueue.h, but if btrfs uses 
num_online_cpus elsewhere, it must be OK as well.


Another problem that I realized only now, is that nobody 
increments/decrements wq->normal->pending if thresh == NO_THRESHOLD, so 
the code looks pretty misleading: it looks as though assigning thresh to 
num_possible_cpus (or num_online_cpus) matters, but the next line 
compares it with "pending" that is always zero.


As far as we don't have any NO_THRESHOLD users of 
btrfs_workqueue_normal_congested for now, I tend to think it's better to 
add a descriptive comment and simply return "false" from 
btrfs_workqueue_normal_congested rather than trying to address some 
future needs now. See please v2 of the patch.


Thanks,
Maxim



Otherwise looks good to me, as far as I can imagine the possible
behaviour of the various async parameters just from reading the code.

[PATCH] btrfs: limit async_work allocation and worker func duration

2016-12-02 Thread Maxim Patlasov

Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
mkdir /mnt/$i
btrfs subvolume create /mnt/SV_$i
ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | 
grep -v dma; sleep 60; done
kmalloc-12810144  10144128   321 : tunables000 : 
slabdata317317  0
kmalloc-128   9992352 9992352128   321 : tunables000 : 
slabdata 312261 312261  0
kmalloc-128   24226752 24226752128   321 : tunables000 
: slabdata 757086 757086  0
kmalloc-128   42754240 42754240128   321 : tunables000 
: slabdata 1336070 1336070  0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Signed-off-by: Maxim Patlasov 
---
 fs/btrfs/async-thread.c  |8 
 fs/btrfs/async-thread.h  |1 +
 fs/btrfs/delayed-inode.c |6 --
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index e0f071f..29f6252 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work)
return work->wq->fs_info;
 }
 
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)
+{
+   int thresh = wq->normal->thresh != NO_THRESHOLD ?
+   wq->normal->thresh : num_possible_cpus();
+
+   return atomic_read(&wq->normal->pending) > thresh * 2;
+}
+
 BTRFS_WORK_HELPER(worker_helper);
 BTRFS_WORK_HELPER(delalloc_helper);
 BTRFS_WORK_HELPER(flush_delalloc_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index 8e52484..1f95973 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int 
max);
 void btrfs_set_work_high_priority(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq);
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq);
 #endif
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 3eeb9cd..de946dd 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1356,7 +1356,8 @@ release_path:
total_done++;
 
btrfs_release_prepared_delayed_node(delayed_node);
-   if (async_work->nr == 0 || total_done < async_work->nr)
+   if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) ||
+   total_done < async_work->nr)
goto again;
 
 free_path:
@@ -1372,7 +1373,8 @@ static int btrfs_wq_run_delayed_node(struct 
btrfs_delayed_root *delayed_root,
 {
struct btrfs_async_delayed_work *async_work;
 
-   if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND)
+   if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND ||
+   btrfs_workqueue_normal_congested(fs_info->delayed_workers))
return 0;
 
async_work = kmalloc(sizeof(*async_work), GFP_NOFS);

Re: [fuse-devel] fuse: max_background and congestion_threshold settings

2016-11-22 Thread Maxim Patlasov


On 11/22/2016 02:45 PM, Nikolaus Rath wrote:


On Nov 16 2016, Maxim Patlasov  wrote:

On 11/16/2016 12:19 PM, Nikolaus Rath wrote:


On Nov 16 2016, Maxim Patlasov  wrote:

On 11/16/2016 11:19 AM, Nikolaus Rath wrote:


Hi Maxim,

On Nov 15 2016, Maxim Patlasov  wrote:

On 11/15/2016 08:18 AM, Nikolaus Rath wrote:

Could someone explain to me the meaning of the max_background and
congestion_threshold settings of the fuse module?

At first I assumed that max_background specifies the maximum number of
pending requests (i.e., requests that have been send to userspace but
for which no reply was received yet). But looking at fs/fuse/dev.c, it
looks as if not every request is included in this number.

fuse uses max_background for cases where the total number of
simultaneous requests of given type is not limited by some other
natural means. AFAIU, these cases are: 1) async processing of direct
IO; 2) read-ahead. As an example of "natural" limitation: when
userspace process blocks on a sync direct IO read/write, the number of
requests fuse consumed is limited by the number of such processes
(actually their threads). In contrast, if userspace requests 1GB
direct IO read/write, it would be unreasonable to issue 1GB/128K==8192
fuse requests simultaneously. That's where max_background steps in.

Ah, that makes sense. Are these two cases meant as examples, or is that
an exhaustive list? Because I would have thought that other cases should
be writing of cached data (when writeback caching is enabled), and
asynchronous I/O from userspace...?

I think that's exhaustive list, but I can miss something.

As for writing of cached data, that definitely doesn't go through
background requests. Here we rely on flusher: fuse will allocate as
many requests as the flusher wants to writeback.

Buffered AIO READs actually block in submit_io until fully
processed. So it's just another example of "natural" limitation I told
above.

Not sure I understand. What is it that's blocking? It can't be the
userspace process, because then it wouldn't be asynchronous I/O...

Surprise! Alas, Linux kernel does NOT process buffered AIO reads in
async manner. You can verify it yourself by strace-ing a simple
program looping over io_submit + io_getevents: for direct IO (as
expected) io_submit returns immediately while io_getevents waits for
actual IO; in contrast, for buffered IO (surprisingly) io_submit waits
for actual IO while io_getevents returns immediately. Presumably,
people are supposed to use mmap-ed read/writes rather than buffered
AIO.

What about buffered, asynchronous writes when writeback cache is
disabled? It sounds as if io_submit does not block (so userspace could
create an unlimited number), nor can the kernel coalesce them (since
writeback caching is disabled).


I've never looked closely at it. Do you have a particular use case or 
concern?





Thanks!
-Nikolaus

Re: [fuse-devel] fuse: max_background and congestion_threshold settings

2016-11-16 Thread Maxim Patlasov


On 11/16/2016 11:19 AM, Nikolaus Rath wrote:


Hi Maxim,

On Nov 15 2016, Maxim Patlasov  wrote:

On 11/15/2016 08:18 AM, Nikolaus Rath wrote:

Could someone explain to me the meaning of the max_background and
congestion_threshold settings of the fuse module?

At first I assumed that max_background specifies the maximum number of
pending requests (i.e., requests that have been send to userspace but
for which no reply was received yet). But looking at fs/fuse/dev.c, it
looks as if not every request is included in this number.

fuse uses max_background for cases where the total number of
simultaneous requests of given type is not limited by some other
natural means. AFAIU, these cases are: 1) async processing of direct
IO; 2) read-ahead. As an example of "natural" limitation: when
userspace process blocks on a sync direct IO read/write, the number of
requests fuse consumed is limited by the number of such processes
(actually their threads). In contrast, if userspace requests 1GB
direct IO read/write, it would be unreasonable to issue 1GB/128K==8192
fuse requests simultaneously. That's where max_background steps in.

Ah, that makes sense. Are these two cases meant as examples, or is that
an exhaustive list? Because I would have thought that other cases should
be writing of cached data (when writeback caching is enabled), and
asynchronous I/O from userspace...?


I think that's exhaustive list, but I can miss something.

As for writing of cached data, that definitely doesn't go through 
background requests. Here we rely on flusher: fuse will allocate as many 
requests as the flusher wants to writeback.


Buffered AIO READs actually block in submit_io until fully processed. So 
it's just another example of "natural" limitation I told above. Buffered 
AIO WRITEs go through writeback mechanics anyway, so here again we rely 
on flusher behaving reasonable. And finally, direct AIO does go through 
fuse background requests as I wrote: "1) async processing of direct IO;"




Also, I am not sure what you mean with async processing of direct
I/O. Shouldn't direct I/O always go directly to the file-system? If so,
how can it be processed asynchronously?


That's a nice optimization we implemented a few years ago: having 
incoming sync direct IO request of 1MB size, kernel fuse splits it into 
eight 128K requests and starts processing them in async manner, waiting 
for the completion of all of them before completing that incoming 1MB 
requests. This boosts performance tremendously if userspace fuse daemon 
is able to efficiently process many requests "in parallel". This 
optimization is implemented using background fuse requests. Otherwise, 
having 1GB incoming request, we would obediently allocate 8K fuse 
requests in one shot -- too dangerous and not good for latency.


Thanks,
Maxim

Re: [fuse-devel] fuse: max_background and congestion_threshold settings

2016-11-16 Thread Maxim Patlasov


On 11/16/2016 12:19 PM, Nikolaus Rath wrote:


On Nov 16 2016, Maxim Patlasov  wrote:

On 11/16/2016 11:19 AM, Nikolaus Rath wrote:


Hi Maxim,

On Nov 15 2016, Maxim Patlasov  wrote:

On 11/15/2016 08:18 AM, Nikolaus Rath wrote:

Could someone explain to me the meaning of the max_background and
congestion_threshold settings of the fuse module?

At first I assumed that max_background specifies the maximum number of
pending requests (i.e., requests that have been send to userspace but
for which no reply was received yet). But looking at fs/fuse/dev.c, it
looks as if not every request is included in this number.

fuse uses max_background for cases where the total number of
simultaneous requests of given type is not limited by some other
natural means. AFAIU, these cases are: 1) async processing of direct
IO; 2) read-ahead. As an example of "natural" limitation: when
userspace process blocks on a sync direct IO read/write, the number of
requests fuse consumed is limited by the number of such processes
(actually their threads). In contrast, if userspace requests 1GB
direct IO read/write, it would be unreasonable to issue 1GB/128K==8192
fuse requests simultaneously. That's where max_background steps in.

Ah, that makes sense. Are these two cases meant as examples, or is that
an exhaustive list? Because I would have thought that other cases should
be writing of cached data (when writeback caching is enabled), and
asynchronous I/O from userspace...?

I think that's exhaustive list, but I can miss something.

As for writing of cached data, that definitely doesn't go through
background requests. Here we rely on flusher: fuse will allocate as
many requests as the flusher wants to writeback.

Buffered AIO READs actually block in submit_io until fully
processed. So it's just another example of "natural" limitation I told
above.

Not sure I understand. What is it that's blocking? It can't be the
userspace process, because then it wouldn't be asynchronous I/O...


Surprise! Alas, Linux kernel does NOT process buffered AIO reads in 
async manner. You can verify it yourself by strace-ing a simple program 
looping over io_submit + io_getevents: for direct IO (as expected) 
io_submit returns immediately while io_getevents waits for actual IO; in 
contrast, for buffered IO (surprisingly) io_submit waits for actual IO 
while io_getevents returns immediately. Presumably, people are supposed 
to use mmap-ed read/writes rather than buffered AIO.






Also, I am not sure what you mean with async processing of direct
I/O. Shouldn't direct I/O always go directly to the file-system? If so,
how can it be processed asynchronously?

That's a nice optimization we implemented a few years ago: having
incoming sync direct IO request of 1MB size, kernel fuse splits it
into eight 128K requests and starts processing them in async manner,
waiting for the completion of all of them before completing that
incoming 1MB requests.

I see. But why isn't that also done for regular (non-direct) IO?


Regular READs are helped by async read-ahead. Regular writes go through 
writeback mechanics: flusher calls fuse_writepages() and the latter 
submits as many async write requests as needed. Everything looks fine. 
(but as I wrote those async requests are not under fuse max_backgroung 
control).


Thanks,
Maxim




Thanks,
-Nikolaus

Re: [fuse-devel] fuse: max_background and congestion_threshold settings

2016-11-16 Thread Maxim Patlasov


Hi,


On 11/15/2016 08:18 AM, Nikolaus Rath wrote:

Hello,

Could someone explain to me the meaning of the max_background and
congestion_threshold settings of the fuse module?

At first I assumed that max_background specifies the maximum number of
pending requests (i.e., requests that have been send to userspace but
for which no reply was received yet). But looking at fs/fuse/dev.c, it
looks as if not every request is included in this number.


fuse uses max_background for cases where the total number of 
simultaneous requests of given type is not limited by some other natural 
means. AFAIU, these cases are: 1) async processing of direct IO; 2) 
read-ahead. As an example of "natural" limitation: when userspace 
process blocks on a sync direct IO read/write, the number of requests 
fuse consumed is limited by the number of such processes (actually their 
threads). In contrast, if userspace requests 1GB direct IO read/write, 
it would be unreasonable to issue 1GB/128K==8192 fuse requests 
simultaneously. That's where max_background steps in.




I also figured out that if the number of background requests (whatever
they are) exceeds the congestion threshold, fuse calls
set_bdi_congested() for the backing device. But what does this do?


AFAIU, this is a hint for reclaimer to avoid busy loop:


/*
 * If kswapd scans pages marked marked for immediate
 * reclaim and under writeback (nr_immediate), it implies
 * that pages are cycling through the LRU faster than
 * they are written so also forcibly stall.
 */
if (nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);




And
does this become a no-op if there is no backing device?


current->backing_dev_info exists (and helps to control writeback) even 
if there is no "real" backing device.


Thanks,
Maxim

[PATCH v2] fuse: fuse_flush must check mapping->flags for errors

2016-07-20 Thread Maxim Patlasov

fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().

Changed in v2:
 - fixed silly type: check must be *after* fuse_sync_writes()

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad1da83..6cac3dc 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -417,6 +417,15 @@ static int fuse_flush(struct file *file, fl_owner_t id)
fuse_sync_writes(inode);
inode_unlock(inode);
 
+   if (test_bit(AS_ENOSPC, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_bit(AS_EIO, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   return err;
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;

[PATCH] fuse: fuse_flush must check mapping->flags for errors

2016-07-19 Thread Maxim Patlasov

fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad1da83..b43401e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -413,6 +413,15 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (err)
return err;
 
+   if (test_bit(AS_ENOSPC, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_bit(AS_EIO, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   return err;
+
inode_lock(inode);
fuse_sync_writes(inode);
inode_unlock(inode);

[PATCH] fuse: fsync() did not return IO errors

2016-07-19 Thread Maxim Patlasov

From: Alexey Kuznetsov 

Due to implementation of fuse writeback filemap_write_and_wait_range()
does not catch errors. We have to do this directly after fuse_sync_writes()

Signed-off-by: Alexey Kuznetsov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9154f86..ad1da83 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -462,6 +462,21 @@ int fuse_fsync_common(struct file *file, loff_t start, 
loff_t end,
goto out;
 
fuse_sync_writes(inode);
+
+   /*
+* Due to implementation of fuse writeback
+* filemap_write_and_wait_range() does not catch errors.
+* We have to do this directly after fuse_sync_writes()
+*/
+   if (test_bit(AS_ENOSPC, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_bit(AS_EIO, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   goto out;
+
err = sync_inode_metadata(inode, 1);
if (err)
goto out;

[PATCH] tmpfs: shmem_fallocate must return ERESTARTSYS

2016-03-03 Thread Maxim Patlasov

shmem_fallocate() is restartable, so it can return ERESTARTSYS if
signal_pending(). Although fallocate(2) manpage permits EINTR,
the more places use ERESTARTSYS the better.

Signed-off-by: Maxim Patlasov 
---
 mm/shmem.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 440e2a7..60e9c8a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2229,11 +2229,13 @@ static long shmem_fallocate(struct file *file, int 
mode, loff_t offset,
struct page *page;
 
/*
-* Good, the fallocate(2) manpage permits EINTR: we may have
-* been interrupted because we are using up too much memory.
+* Although fallocate(2) manpage permits EINTR, the more
+* places use ERESTARTSYS the better. If we have been
+* interrupted because we are using up too much memory,
+* oom-killer used fatal signal and we will die anyway.
 */
if (signal_pending(current))
-   error = -EINTR;
+   error = -ERESTARTSYS;
else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced)
error = -ENOMEM;
else

Re: [PATCH] fs/pnode.c: treat zero mnt_group_id-s as unequal

2016-02-16 Thread Maxim Patlasov


On 02/16/2016 11:54 AM, Al Viro wrote:

On Tue, Feb 16, 2016 at 11:45:33AM -0800, Maxim Patlasov wrote:

propagate_one(m) calculates "type" argument for copy_tree() like this:


if (m->mnt_group_id == last_dest->mnt_group_id) {
type = CL_MAKE_SHARED;
} else {
type = CL_SLAVE;
if (IS_MNT_SHARED(m))
   type |= CL_MAKE_SHARED;
   }

The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.

The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:


last_source = last_source->mnt_master;
last_dest = last_source->mnt_parent;

proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.

Mind putting together a reproducer?


There are two files attached: reproducer1.c and reproducer2.c. The 
former demonstrates the problem before applying the patch. The latter 
demonstrates why the first hunk of the patch is not enough.


[root@f22ml ~]# reproducer1
main pid = 1496
monitor pid = 1497
child pid = 1498
grand-child pid = 1499

[root@f22ml ~]# grep "child" /proc/1496/mountinfo
243 144 0:37 /child /tmp/child rw shared:93 - tmpfs tmpfs rw,seclabel
[root@f22ml ~]# grep "child" /proc/1498/mountinfo
244 208 0:37 /child /tmp/child rw shared:127 master:93 - tmpfs tmpfs 
rw,seclabel

[root@f22ml ~]# grep "child" /proc/1499/mountinfo
245 240 0:37 /child /tmp/child rw master:127 - tmpfs tmpfs rw,seclabel
[root@f22ml ~]# grep "child" /proc/1497/mountinfo
246 176 0:37 /child /tmp/child rw shared:128 master:127 - tmpfs tmpfs 
rw,seclabel


while expected info for 1497 would be:
246 176 0:37 /child /tmp/child rw master:93 - tmpfs tmpfs rw,seclabel

Now, assuming that only the first hunk of the patch is applied:

> -if (m->mnt_group_id == last_dest->mnt_group_id) {
> +if (m->mnt_group_id && m->mnt_group_id == last_dest->mnt_group_id) {

[root@f22ml ~]# reproducer2
main pid = 1506
monitor pid = 1507
child pid = 1508
grand-child pid = 1509

[root@f22ml ~]# grep "child" /proc/1506/mountinfo
243 144 0:37 /child /tmp/child rw shared:93 - tmpfs tmpfs rw,seclabel
[root@f22ml ~]# grep "child" /proc/1508/mountinfo
244 208 0:37 /child /tmp/child rw shared:93 - tmpfs tmpfs rw,seclabel
[root@f22ml ~]# grep "child" /proc/1509/mountinfo
245 240 0:37 /child /tmp/child rw master:93 - tmpfs tmpfs rw,seclabel
[root@f22ml ~]# grep "child" /proc/1507/mountinfo
246 176 0:37 /child /tmp/child rw master:0 - tmpfs tmpfs rw,seclabel

while expected info for 1507 would be:
246 176 0:37 /child /tmp/child rw master:93 - tmpfs tmpfs rw,seclabel

Thanks,
Maxim
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main()
{
	const char *child  = "/tmp/child";
	int ret;

	printf("main pid = %d\n", getpid());	

	/* make our own private playground ... */
	ret = unshare(CLONE_NEWNS);
	if (ret) {
		perror("unshare");
		exit(1);
	}

	ret = mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	ret = mount("none", "/", NULL, MS_REC|MS_SHARED, NULL);
	if (ret) {
		perror("mount2");
		exit(1);
	}

	/* fork monitor ... */
	ret = fork();
	if (ret < 0) {
		perror("fork");
		exit(1);
	} else if (!ret) {
		printf("monitor pid = %d\n", getpid());

		ret = unshare(CLONE_NEWNS);
		if (ret) {
			perror("unshare in monitor");
			exit(1);
		}

		ret = mount("none", "/", NULL, MS_REC|MS_SHARED, NULL);
		if (ret) {
			perror("mount in monitor");
			exit(1);
		}

		ret = mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL);
		if (ret) {
			perror("mount2 in monitor");
			exit(1);
		}

		sleep(-1);
	}

	/* wait monitor to setup */
	sleep(1);


	/* fork child ... */	
	ret = fork();
	if (ret < 0) {
		perror("fork");
		exit(1);
	} else if (!ret) {
		printf("child pid = %d\n", getpid());

		ret = unshare(CLONE_NEWNS);
		if (ret) {
			perror("unshare in child");
			exit(1);
		}

		ret = mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL);
		if (ret) {
			perror("mount in child");

[PATCH] fs/pnode.c: treat zero mnt_group_id-s as unequal

2016-02-16 Thread Maxim Patlasov

propagate_one(m) calculates "type" argument for copy_tree() like this:

>if (m->mnt_group_id == last_dest->mnt_group_id) {
>type = CL_MAKE_SHARED;
>} else {
>type = CL_SLAVE;
>if (IS_MNT_SHARED(m))
>   type |= CL_MAKE_SHARED;
>   }

The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.

The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:

>last_source = last_source->mnt_master;
>last_dest = last_source->mnt_parent;

proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.

Signed-off-by: Maxim Patlasov 
---
 fs/pnode.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 6367e1e..abf156a 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -212,7 +212,7 @@ static int propagate_one(struct mount *m)
/* skip if mountpoint isn't covered by it */
if (!is_subdir(mp->m_dentry, m->mnt.mnt_root))
return 0;
-   if (m->mnt_group_id == last_dest->mnt_group_id) {
+   if (m->mnt_group_id && m->mnt_group_id == last_dest->mnt_group_id) {
type = CL_MAKE_SHARED;
} else {
struct mount *n, *p;
@@ -223,7 +223,9 @@ static int propagate_one(struct mount *m)
last_source = last_source->mnt_master;
last_dest = last_source->mnt_parent;
}
-   if (n->mnt_group_id != last_dest->mnt_group_id) 
{
+   if (n->mnt_group_id != last_dest->mnt_group_id 
||
+   (!n->mnt_group_id &&
+!last_dest->mnt_group_id)) {
last_source = last_source->mnt_master;
last_dest = last_source->mnt_parent;
}

Re: [PATCH] kvm: do not SetPageDirty from kvm_set_pfn_dirty for file mappings

2016-02-12 Thread Maxim Patlasov


On 02/12/2016 05:48 AM, Dmitry Monakhov wrote:

Maxim Patlasov  writes:


The patch solves the following problem: file system specific routines
involved in ordinary routine writeback process BUG_ON page_buffers()
because a page goes to writeback without buffer-heads attached.

The way how kvm_set_pfn_dirty calls SetPageDirty works only for anon
mappings. For file mappings it is obviously incorrect - there page_mkwrite
must be called. It's not easy to add page_mkwrite call to kvm_set_pfn_dirty
because there is no universal way to find vma by pfn. But actually
SetPageDirty may be simply skipped in those cases. Below is a
justification.

Confirm. I've hit that BUGON
[ 4442.219121] [ cut here ]
[ 4442.219188] kernel BUG at fs/ext4/inode.c:2285!
<...>


When guest modifies the content of a page with file mapping, kernel kvm
makes the page dirty by the following call-path:

vmx_handle_exit ->
  handle_ept_violation ->
   __get_user_pages ->
page_mkwrite ->
 SetPageDirty

Since then, the page is dirty from both guest and host point of view. Then
the host makes writeback and marks the page as write-protected. So any
further write from the guest triggers call-path above again.

Please elaborate exact call-path which marks host-page.


wb_workfn ->
 wb_do_writeback ->
  wb_writeback ->
   __writeback_inodes_wb ->
writeback_sb_inodes ->
 __writeback_single_inode ->
  do_writepages ->
   ext4_writepages ->
mpage_prepare_extent_to_map ->
 mpage_process_page_bufs ->
  mpage_submit_page ->
   clear_page_dirty_for_io ->
 page_mkclean ->
  rmap_walk->
rmap_walk_file ->
 page_mkclean_one->
  pte_wrprotect ->
pte_clear_flags(pte, _PAGE_RW)

Thanks,
Maxim


So, for file mappings, it's not possible to have new data written to a page
inside the guest w/o corresponding SetPageDirty on the host.

This makes explicit SetPageDirty from kvm_set_pfn_dirty redundant.

Signed-off-by: Maxim Patlasov 
---
  virt/kvm/kvm_main.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a11cfd2..5a7d3fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1582,7 +1582,8 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn)
if (!kvm_is_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
  
-		if (!PageReserved(page))

+   if (!PageReserved(page) &&
+   (!page->mapping || PageAnon(page)))
SetPageDirty(page);
}
  }

___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[PATCH] kvm: do not SetPageDirty from kvm_set_pfn_dirty for file mappings

2016-02-11 Thread Maxim Patlasov

The patch solves the following problem: file system specific routines
involved in ordinary routine writeback process BUG_ON page_buffers()
because a page goes to writeback without buffer-heads attached.

The way how kvm_set_pfn_dirty calls SetPageDirty works only for anon
mappings. For file mappings it is obviously incorrect - there page_mkwrite
must be called. It's not easy to add page_mkwrite call to kvm_set_pfn_dirty
because there is no universal way to find vma by pfn. But actually
SetPageDirty may be simply skipped in those cases. Below is a
justification.

When guest modifies the content of a page with file mapping, kernel kvm
makes the page dirty by the following call-path:

vmx_handle_exit ->
 handle_ept_violation ->
  __get_user_pages ->
   page_mkwrite ->
SetPageDirty

Since then, the page is dirty from both guest and host point of view. Then
the host makes writeback and marks the page as write-protected. So any
further write from the guest triggers call-path above again.

So, for file mappings, it's not possible to have new data written to a page
inside the guest w/o corresponding SetPageDirty on the host.

This makes explicit SetPageDirty from kvm_set_pfn_dirty redundant.

Signed-off-by: Maxim Patlasov 
---
 virt/kvm/kvm_main.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a11cfd2..5a7d3fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1582,7 +1582,8 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn)
if (!kvm_is_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
 
-   if (!PageReserved(page))
+   if (!PageReserved(page) &&
+   (!page->mapping || PageAnon(page)))
SetPageDirty(page);
}
 }

Re: [fuse-devel] [PATCH] fuse: break infinite loop in fuse_fill_write_pages()

2015-10-05 Thread Maxim Patlasov


On 10/02/2015 06:58 PM, Konstantin Khlebnikov wrote:

On Sat, Oct 3, 2015 at 1:04 AM, Andrew Morton  wrote:

On Fri, 2 Oct 2015 12:27:45 -0700 Maxim Patlasov  
wrote:


On 10/02/2015 04:21 AM, Konstantin Khlebnikov wrote:

Bump. Add more peopple in CC.

On Mon, Sep 21, 2015 at 1:02 PM, Roman Gushchin  wrote:

I got a report about unkillable task eating CPU. Thge further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.

iov_iter_copy_from_user_atomic() eventually calls iterate_iovec(). The
latter silently consumes zero-length iov. So I don't think "iov's first
segment has zero length" can cause infinite loop.

I'm suspecting it got stuck because local variable `bytes' is zero, so
the code does `goto again' repeatedly.

Or maybe not.   A more complete description of the bug would help.

I suspect here is the same scenario like in 124d3b7041f:
Zero-length segmend is followed by segment with invalid address:
iov_iter_fault_in_readable() checks only first segment (zero-length)
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.

Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.


Makes sense to me. The patch looks fine.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [fuse-devel] [PATCH] fuse: break infinite loop in fuse_fill_write_pages()

2015-10-02 Thread Maxim Patlasov


On 10/02/2015 04:21 AM, Konstantin Khlebnikov wrote:

Bump. Add more peopple in CC.

On Mon, Sep 21, 2015 at 1:02 PM, Roman Gushchin  wrote:

I got a report about unkillable task eating CPU. Thge further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.


iov_iter_copy_from_user_atomic() eventually calls iterate_iovec(). The 
latter silently consumes zero-length iov. So I don't think "iov's first 
segment has zero length" can cause infinite loop.


Thanks,
Maxim



Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.

A similar problem is described in 124d3b7041f ("fix writev regression:
pan hanging unkillable and un-straceable").

Signed-off-by: Roman Gushchin 
Cc: Miklos Szeredi 
---
  fs/fuse/file.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f523f2f..195476a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1049,6 +1049,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
 flush_dcache_page(page);

+   iov_iter_advance(ii, tmp);
 if (!tmp) {
 unlock_page(page);
 page_cache_release(page);
@@ -1061,7 +1062,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 req->page_descs[req->num_pages].length = tmp;
 req->num_pages++;

-   iov_iter_advance(ii, tmp);
 count += tmp;
 pos += tmp;
 offset += tmp;
--
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
___
fuse-devel mailing list
fuse-de...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based

2015-03-19 Thread Maxim Patlasov


On 03/18/2015 07:57 PM, Ming Lei wrote:

On Thu, Mar 19, 2015 at 2:28 AM, Maxim Patlasov  wrote:

On 01/13/2015 07:44 AM, Ming Lei wrote:

Part of the patch is based on Dave's previous post.

This patch submits I/O to fs via kernel aio, and we
can obtain following benefits:

 - double cache in both loop file system and backend file
 gets avoided
 - context switch decreased a lot, and finally CPU utilization
 is decreased
 - cached memory got decreased a lot

One main side effect is that throughput is decreased when
accessing raw loop block(not by filesystem) with kernel aio.

This patch has passed xfstests test(./check -g auto), and
both test and scratch devices are loop block, file system is ext4.

Follows two fio tests' result:

1. fio test inside ext4 file system over loop block
1) How to run
 - linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
 - loop over SSD image 1 in ext4
 - linux psync, 16 jobs, size 200M, ext4 over loop block
 - test result: IOPS from fio output

2) Throughput result:
 -
 test cases  |randread   |read   |randwrite  |write  |
 -
 base|16799  |59508  |31059  |58829
 -
 base+kernel aio |15480  |64453  |30187  |57222
 -


Ming, it's important to understand the overhead of aio_kernel_()
implementation. So could you please add test results for raw SSD device to
the table above next time (in v3 of your patches).

what aio_kernel_() does is to just call ->read_iter()/->write_iter(),
so it should not have introduced extra overload.

 From performance view, the effect is only from switching to
O_DIRECT. With O_DIRECT, double cache can be avoided,
meantime both page caches and CPU utilization can be decreased.


The way how you reused loop_queue_rq() --> queue_work() functionality 
(added early, by commit b5dd2f604) may affect performance of O_DIRECT 
operations. It can be easily demonstrated on ram-drive, but measurements 
on real storage h/w would be more convincing.


Btw, when you wrote "linux psync, 16 jobs, size 200M, ext4 over loop 
block" -- does it mean that there were 16 threads in userspace 
submitting I/O concurrently? If yes, throughput comparison for a single 
job test would be also useful to look at.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based

2015-03-18 Thread Maxim Patlasov


On 01/13/2015 07:44 AM, Ming Lei wrote:

Part of the patch is based on Dave's previous post.

This patch submits I/O to fs via kernel aio, and we
can obtain following benefits:

- double cache in both loop file system and backend file
gets avoided
- context switch decreased a lot, and finally CPU utilization
is decreased
- cached memory got decreased a lot

One main side effect is that throughput is decreased when
accessing raw loop block(not by filesystem) with kernel aio.

This patch has passed xfstests test(./check -g auto), and
both test and scratch devices are loop block, file system is ext4.

Follows two fio tests' result:

1. fio test inside ext4 file system over loop block
1) How to run
- linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
- loop over SSD image 1 in ext4
- linux psync, 16 jobs, size 200M, ext4 over loop block
- test result: IOPS from fio output

2) Throughput result:
-
test cases  |randread   |read   |randwrite  |write  |
-
base|16799  |59508  |31059  |58829
-
base+kernel aio |15480  |64453  |30187  |57222
-


Ming, it's important to understand the overhead of aio_kernel_() 
implementation. So could you please add test results for raw SSD device 
to the table above next time (in v3 of your patches).


Jens, if you have some fast storage at hand, could you please measure 
IOPS for Ming's patches vs. raw block device -- to ensure that the 
patches do not impose too low limit on performance.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

2015-01-05 Thread Maxim Patlasov


On 12/31/2014 04:52 PM, Ming Lei wrote:

On Thu, Jan 1, 2015 at 6:35 AM, Sedat Dilek  wrote:

On Wed, Dec 31, 2014 at 10:52 PM, Dave Kleikamp
 wrote:

On 12/31/2014 02:38 PM, Sedat Dilek wrote:

What has happened to that aio_loop patchset?
Is it in Linux-next?
( /me started to play with "block: loop: convert to blk-mq (v3)", so I
recalled this other improvement. )

It met with some harsh resistance, so I backed off on it. Then Al Viro
got busy re-writing the iov_iter infrastructure and I put my patchset on
the shelf to look at later. Then Ming Lei submitted more up-to-date
patchset: https://lkml.org/lkml/2014/8/6/175

It looks like Ming is currently only pushing the first half of that
patchset. I don't know what his plans are for the last three patches:

aio: add aio_kernel_() interface
fd/direct-io: introduce should_dirty for kernel aio
block: loop: support to submit I/O via kernel aio based


I tested with block-mq-v3 (for next-20141231) [1] and this looks promising [2].

Maybe Ming can say what the plan is with the missing parts.

I have compared kernel aio based loop-mq(the other 3 aio patches
against loop-mq v2, [1]) with loop-mq v3, looks the data isn't
better than loop-mq v3.

kernel aio based approach requires direct I/O, at least direct write
shouldn't be good as page cache write, IMO.

So I think we need to investigate kernel aio based approach further
wrt. loop improvement.


A great advantage of kernel aio for loop device is the avoidance of 
double caching: the data from page cache of inner filesystem (mounted on 
the loop device) won't be cached again in the page cache of the outer 
filesystem (one that keeps image file of the loop device).


So I don't think it's correct to compare the performance of aio based 
loop-mq with loop-mq v3. Aio based approach is OK as long as it doesn't 
introduce significant overhead as compared with submitting bio-s 
straightforwardly from loop device (or any other in-kernel user of 
kernel aio) to host block device.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] fuse: handle release synchronously (v4)

2014-10-16 Thread Maxim Patlasov


Hi Miklos,

On 10/09/2014 12:14 PM, Miklos Szeredi wrote:

On Wed, Oct 1, 2014 at 1:28 PM, Maxim Patlasov  wrote:

Given those patches must die, do you have any ideas how to resolve that
"spurious EBUSY" problem?

Check the "sync_release" branch of fuse:

   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git sync_release

And same branch name for libfuse:

   git://git.code.sf.net/p/fuse/fuse sync_release

What it does is send RELEASE from ->flush() after checking the
refcount of file (being careful about RCU accesses).

Lightly tested, more testing, as well as review, is welcome.


Thank you very much for efforts, highly appreciated! I've had a close 
look at your patches and found a few issues. Most of them can be easily 
fixed, but one puzzles me: the way how you detect last flush is not race 
free. Something as simple as:


int main(int argc, char *argv[])
{
int fd = open(argv[1], O_RDWR);
fork();
}

may easily dive into fuse_try_sync_release() concurrently and both 
observe file->f_count == 2. Then both return falling back to sending the 
release asynchronously. This makes sync/async behaviour unpredictable 
even for well-behaved applications which don't do any esoteric things 
like racing i/o with close or exiting while a descriptor is in-flight in 
a unix domain socket.


I cannot see any way to recognise last flush without help of VFS layer, 
can you?


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] fuse: handle release synchronously (v4)

2014-10-01 Thread Maxim Patlasov


On 10/01/2014 12:44 AM, Linus Torvalds wrote:

On Tue, Sep 30, 2014 at 12:19 PM, Miklos Szeredi  wrote:

What about flock(2), FL_SETLEASE, etc semantics (which are the sane ones,
compared to the POSIX locks shit which mandates release of lock on each close(2)
instead of "when all [duplicate] descriptors have been closed")?

You have to do that from ->release(), there's no question about that.

We do locks_remove_file() independently on ->release, but yes, it's
basically done just before the last release.

But it has the *exact* same semantics as release, including very much
having nothing what-so-ever to do with "last close()".

If the file descriptor is opened for other reasons (ie mmap, /proc
accesses, whatever), then that delays locks_remove_file() the same way
it delays release.

None of that has *anothing* to do with "synchronous". Thinking it does is wrong.

And none of this has *anything* to do with the issue that Maxim
pointed to in the mailing list web page, which was about write caches,
and how you cannot (and MUST NOT) delay them until release time.


I apologise for mentioning that mailing list web page in my title 
message. This was really misleading, I had to think about it in advance. 
Of course, write caches must be flushed in scope of ->flush(), not 
->release(). Let me please set forth an use-case that led me to those 
patches.


We implemented a FUSE-based distributed storage solution intended for 
keeping images of VMs (virtual machines) and their configuration files. 
The way how VMs use images makes exclusive-open()er semantics very 
attractive: while a VM is using its image on a node, the concurrent 
access from other nodes to that image is neither desirable nor 
necessary. So,  we acquire an exclusive lease on FUSE_OPEN and release 
it on FUSE_RELEASE. This is quite natural and has obviously nothing to 
do with FUSE_FLUSH.


Following such semantics, there are two choices for handling open() if 
the file is currently exclusively locked by a remote node: (a) return 
EBUSY; (b) block until the remote node release the file. We decided for 
(a), because (b) is very inconvenient in practice: most applications 
handle failed open(2) properly, but very few are clever enough to spawn 
a separate thread with open() and kill it if the open() has not 
succeeded in a reasonable time.


The patches I sent make essentially one thing: they make FUSE 
->release() wait for ACK from userspace before return. Without these 
patches, any attempt to test or use our storage in valid use-cases led 
to spurious EBUSY. For example, while migrating a VM from one node to 
another, we firstly close the image file on source node, then try to 
open it on destination node, but fail because FUSE_RELEASE is not 
processed by userspace on source node yet.


Given those patches must die, do you have any ideas how to resolve that 
"spurious EBUSY" problem?


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] fuse: enable close_wait synchronous release

2014-09-25 Thread Maxim Patlasov

The patch enables the feature by passing 'true' to fuse_file_put in
fuse_release_common.

Previously, this was safe only in special cases when we sure that
multi-threaded userspace won't deadlock if we'll synchronously send
FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's
always safe because callbacks don't send requests to userspace anymore.

The feature can be made privileged by means of DISABLE_SYNC_RELEASE mount
option implemented by the previous patch.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c4599c8..858280f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -334,7 +334,8 @@ void fuse_release_common(struct file *file, int opcode)
 * synchronous RELEASE is allowed (and desirable) in this case
 * because the server can be trusted not to screw up.
 */
-   fuse_file_put(ff, ff->fc->destroy_req != NULL);
+   fuse_file_put(ff, ff->fc->destroy_req != NULL ||
+ must_release_synchronously(ff));
 }
 
 static int fuse_open(struct inode *inode, struct file *file)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] fuse: add mount option to disable synchronous release

2014-09-25 Thread Maxim Patlasov

Synchronous release ensures that kernel fuse reports to userspace exactly
last fput(). However, nothing guarantees that last fput() will happen in the
context of close(2). So userspace applications must be prepared to the
situation when close(2) returned, but processing of FUSE_RELEASE is not
completed by fuse daemon yet. To make testing such use cases easier, the
patch introduces DISABLE_SYNC_RELEASE mount option which effectively mask out
FOPEN_SYNC_RELEASE flag.

Also, the mount option can be used to protect from DoS scenarios with a sync
release: fusermount can be instrumented to use the option by default for
unprivileged mounts (allowing system administrator to configure it like
"user_allow_other").

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c   |3 ++-
 fs/fuse/fuse_i.h |3 +++
 fs/fuse/inode.c  |8 
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8713e62..c4599c8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -292,7 +292,8 @@ static void fuse_prepare_release(struct fuse_file *ff, int 
flags, int opcode)
 
 static bool must_release_synchronously(struct fuse_file *ff)
 {
-   return ff->open_flags & FOPEN_SYNC_RELEASE;
+   return ff->open_flags & FOPEN_SYNC_RELEASE &&
+   !(ff->fc->flags & FUSE_DISABLE_SYNC_RELEASE);
 }
 
 void fuse_release_common(struct file *file, int opcode)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e8e47a6..c5e2fca 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,9 @@
 doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER (1 << 1)
 
+/** Disable synchronous release */
+#define FUSE_DISABLE_SYNC_RELEASE (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 03246cd..86d47d0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -463,6 +463,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_DISABLE_SYNC_RELEASE,
OPT_ERR
 };
 
@@ -475,6 +476,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   "allow_other"},
{OPT_MAX_READ,  "max_read=%u"},
{OPT_BLKSIZE,   "blksize=%u"},
+   {OPT_DISABLE_SYNC_RELEASE,  "disable_sync_release"},
{OPT_ERR,   NULL}
 };
 
@@ -560,6 +562,10 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev)
d->blksize = value;
break;
 
+   case OPT_DISABLE_SYNC_RELEASE:
+   d->flags |= FUSE_DISABLE_SYNC_RELEASE;
+   break;
+
default:
return 0;
}
@@ -583,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_puts(m, ",default_permissions");
if (fc->flags & FUSE_ALLOW_OTHER)
seq_puts(m, ",allow_other");
+   if (fc->flags & FUSE_DISABLE_SYNC_RELEASE)
+   seq_puts(m, ",disable_sync_release");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] fuse: wait for end of IO on release

2014-09-25 Thread Maxim Patlasov

There are two types of I/O activity that can be "in progress" at the time
of fuse_release() execution: asynchronous read-ahead and write-back. The
patch ensures that they are completed before fuse_release_common sends
FUSE_RELEASE to userspace.

So far as fuse_release() waits for end of async I/O, its callbacks
(fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot
be the last holders of fuse file anymore. To emphasize the fact, the patch
replaces fuse_file_put with __fuse_file_put there.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   71 
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7723b3f..8713e62 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
 }
 
+/*
+ * Asynchronous callbacks may use it instead of fuse_file_put() because
+ * we guarantee that they are never last holders of ff. Hitting BUG() below
+ * will make clear any violation of the guarantee.
+ */
+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(&ff->count))
+   BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 bool isdir)
 {
@@ -279,6 +290,11 @@ static void fuse_prepare_release(struct fuse_file *ff, int 
flags, int opcode)
req->in.args[0].value = inarg;
 }
 
+static bool must_release_synchronously(struct fuse_file *ff)
+{
+   return ff->open_flags & FOPEN_SYNC_RELEASE;
+}
+
 void fuse_release_common(struct file *file, int opcode)
 {
struct fuse_file *ff;
@@ -302,6 +318,13 @@ void fuse_release_common(struct file *file, int opcode)
req->misc.release.path = file->f_path;
 
/*
+* No more in-flight asynchronous READ or WRITE requests if
+* fuse file release is synchronous
+*/
+   if (must_release_synchronously(ff))
+   BUG_ON(atomic_read(&ff->count) != 1);
+
+   /*
 * Normally this will send the RELEASE request, however if
 * some asynchronous READ or WRITE requests are outstanding,
 * the sending will be delayed.
@@ -321,11 +344,34 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 static int fuse_release(struct inode *inode, struct file *file)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_file *ff = file->private_data;
 
/* see fuse_vma_close() for !writeback_cache case */
if (fc->writeback_cache)
write_inode_now(inode, 1);
 
+   if (must_release_synchronously(ff)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   /*
+* Must remove file from write list. Otherwise it is possible
+* this file will get more writeback from another files
+* rerouted via write_files.
+*/
+   spin_lock(&ff->fc->lock);
+   list_del_init(&ff->write_entry);
+   spin_unlock(&ff->fc->lock);
+
+   wait_event(fi->page_waitq, atomic_read(&ff->count) == 1);
+
+   /*
+* spin_unlock_wait(&ff->fc->lock) would be natural here to
+* wait for threads just released ff to leave their critical
+* sections. But taking spinlock is the first thing
+* fuse_release_common does, so that this is unnecessary.
+*/
+   }
+
fuse_release_common(file, FUSE_RELEASE);
 
/* return value is ignored by VFS */
@@ -823,8 +869,15 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
unlock_page(page);
page_cache_release(page);
}
-   if (req->ff)
-   fuse_file_put(req->ff, false);
+   if (req->ff) {
+   if (must_release_synchronously(req->ff)) {
+   struct fuse_inode *fi = get_fuse_inode(req->inode);
+
+   __fuse_file_put(req->ff);
+   wake_up(&fi->page_waitq);
+   } else
+   fuse_file_put(req->ff, false);
+   }
 }
 
 struct fuse_fill_data {
@@ -851,6 +904,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data)
if (fc->async_read) {
req->ff = fuse_file_get(ff);
req->end = fuse_readpages_end;
+   req->inode = data->inode;
fuse_request_send_background(fc, req);
} else {
fuse_request_send(fc, req);
@@ -1502,7 +1556,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++)
__free_page(req->pages[i]);
 
-   if (req->ff

[PATCH 1/5] fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags

2014-09-25 Thread Maxim Patlasov

The feature will be governed by fuse file open flag FOPEN_SYNC_RELEASE.
Userspace can enable it on per file basis in the same way as for
FOPEN_KEEP_CACHE or FOPEN_DIRECT_IO.

Signed-off-by: Maxim Patlasov 
---
 include/uapi/linux/fuse.h |3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 25084a0..607c45c 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -205,10 +205,13 @@ struct fuse_file_lock {
  * FOPEN_DIRECT_IO: bypass page cache for this open file
  * FOPEN_KEEP_CACHE: don't invalidate the data cache on open
  * FOPEN_NONSEEKABLE: the file is not seekable
+ * FOPEN_SYNC_RELEASE: synchronously release file on last fput,
+ * which, in turn, not always bound to fclose(2)!
  */
 #define FOPEN_DIRECT_IO(1 << 0)
 #define FOPEN_KEEP_CACHE   (1 << 1)
 #define FOPEN_NONSEEKABLE  (1 << 2)
+#define FOPEN_SYNC_RELEASE (1 << 3)
 
 /**
  * INIT request/reply flags

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] fuse: cosmetic rework of fuse_send_readpages

2014-09-25 Thread Maxim Patlasov

The patch change arguments of fuse_send_readpages to give it access to inode
(will be used in the next patch of patch-set). The change is cosmetic,
no logic changed.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 912061a..7723b3f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -827,8 +827,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_file_put(req->ff, false);
 }
 
-static void fuse_send_readpages(struct fuse_req *req, struct file *file)
+struct fuse_fill_data {
+   struct fuse_req *req;
+   struct file *file;
+   struct inode *inode;
+   unsigned nr_pages;
+};
+
+static void fuse_send_readpages(struct fuse_fill_data *data)
 {
+   struct fuse_req *req = data->req;
+   struct file *file = data->file;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
loff_t pos = page_offset(req->pages[0]);
@@ -850,13 +859,6 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
}
 }
 
-struct fuse_fill_data {
-   struct fuse_req *req;
-   struct file *file;
-   struct inode *inode;
-   unsigned nr_pages;
-};
-
 static int fuse_readpages_fill(void *_data, struct page *page)
 {
struct fuse_fill_data *data = _data;
@@ -872,7 +874,7 @@ static int fuse_readpages_fill(void *_data, struct page 
*page)
 req->pages[req->num_pages - 1]->index + 1 != page->index)) {
int nr_alloc = min_t(unsigned, data->nr_pages,
 FUSE_MAX_PAGES_PER_REQ);
-   fuse_send_readpages(req, data->file);
+   fuse_send_readpages(data);
if (fc->async_read)
req = fuse_get_req_for_background(fc, nr_alloc);
else
@@ -925,7 +927,7 @@ static int fuse_readpages(struct file *file, struct 
address_space *mapping,
err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
if (!err) {
if (data.req->num_pages)
-   fuse_send_readpages(data.req, file);
+   fuse_send_readpages(&data);
else
fuse_put_request(fc, data.req);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] fuse: handle release synchronously (v4)

2014-09-25 Thread Maxim Patlasov

Hi,

There is a long-standing demand for synchronous behaviour of fuse_release:

http://sourceforge.net/mailarchive/message.php?msg_id=19343889
http://sourceforge.net/mailarchive/message.php?msg_id=29814693

A year ago Avati and me explained why such a feature would be useful:

http://sourceforge.net/mailarchive/message.php?msg_id=29889055
http://sourceforge.net/mailarchive/message.php?msg_id=29867423

In short, the problem is that fuse_release (that's usually called on last
user close(2)) sends FUSE_RELEASE to userspace and returns without waiting
for ACK from userspace. Consequently, there is a gap when user regards the
file released while userspace fuse is still working on it. An attempt to
access the file from another node leads to complicated synchronization
problems because the first node still "holds" the file.

The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

It must be emphasized that even if the feature is enabled (i.e. fuse_release
is synchronous), nothing guarantees that fuse_release() will be called
in the context of close(2). In fact, it may happen later, on last fput().
However, there are still a lot of use cases when close(2) is synchronous,
so the feature must be regarded as an optimization maximizing chances of
synchronous behaviour of close(2).

To keep single-threaded userspace implementations happy the patch-set
ensures that by the time fuse_release_common calls fuse_file_put, no
more in-flight I/O exists. Asynchronous fuse callbacks (like
fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll
never block in contexts other than close().

Changed in v2:
 - improved comments, commented spin_unlock_wait out according to Brian'
suggestions.
 - rebased on v3.15-rc8 tag of Linus' tree.
Changed in v3:
 - changed patch-set title from "close file synchronously" to "handle
   release synchronously"
 - renamed CLOSE_WAIT to SYNC_RELEASE to convey the essence of the feature
   ("synchronous release" instead of "wait on close")
 - enabled feature on per file basis (instead of global fuse_conn parameter)
 - added "disable_sync_release" mount option
 - rebased on v3.17-rc1 tag of Linus' tree.
Changed in v4:
 - removed redundant locking around __fuse_file_put()
 - removed a patch making FUSE_RELEASE request uninterruptible. As Miklos
   pointed out, if the request is interrupted, then we should possibly allow
   that, effectively backgrounding the request. However, such a backgrounding
   is not easy to implement without breaking locking expectations of the
   userspace filesystem. Given the problem has existed for fuseblk mount for
   long time and there is no reasonable solution now, I put it aside for the
   future.

Thanks,
Maxim

---

Maxim Patlasov (5):
  fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags
  fuse: cosmetic rework of fuse_send_readpages
  fuse: wait for end of IO on release
  fuse: add mount option to disable synchronous release
  fuse: enable close_wait synchronous release


 fs/fuse/file.c|   97 ++---
 fs/fuse/fuse_i.h  |3 +
 fs/fuse/inode.c   |8 
 include/uapi/linux/fuse.h |3 +
 4 files changed, 95 insertions(+), 16 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] fuse: fix synchronous case of fuse_file_put()

2014-09-24 Thread Maxim Patlasov


On 09/16/2014 12:19 PM, Miklos Szeredi wrote:

On Thu, Sep 11, 2014 at 6:14 PM, Maxim Patlasov  wrote:


I really need your help to proceed with this patch. Could you please explain
what those places are where we should allow interruption.

BTW, as for "just an optimization", I've recently noticed that __fput()
calls locks_remove_file(). So guarantees provided by the patch-set are on
the same level as flock(2) behaviour.

SIGKILL trumps that.  At least that's what I think, and that's what
NFS currently does as well, AFAICS.


Also fuse really should distinguish fatal and non-fatal interruptions
and handle them accordingly...


And elaborate on this concern, please.

Requests have two states where they stay for any significant amount of
time: PENDING (queued to userspace) and SENT (in userspace).

Currently we do the following for interrupted requests:

PENDING:
- non-fatal signal: do nothing
- fatal signal: dequeue and return -EINTR, unless force is set

SENT:
- send INTERRUPT request to userspace

This is fine, but fatal interrupts should be able to abort SENT and
forced requests as well without having to wait for the userspace
reply.  This is what I was referring to.


Thank you for detailed clarification, that's much clearer now. If I 
understood it right, fatal signals must abort *any* request in *any* 
state. The only difference between forced and not forced requests is 
that forced ones must be eventually delivered to userspace in all cases 
(even if they were in PENDING state when we were interrupted and we 
returned -EINTR).


The thing that bothers me is the net result of these changes. Yes, 
end-user will be able to interrupt its app by SIGKIILL if it is waiting 
in request_wait_answer(). But there are many other places where kernel 
fuse waits for something dependent on userspace. Do you think we have to 
make those places interruptible as well?




This would not be difficult, were it not for i_mutex and
s_vfs_rename_mutex being held by some operations.   For correctness,
we can't release these while a reply is not received, since the
locking expecations of the userspace filesystem would not be met.
This can be solved by adding shadow locks to fuse that we hold onto
even after the request is interrupted.


Shadow locking seems to be not enough. For example, we have to postpone 
FUSE_RELEASE until all interrupted synchronous I/O is ACKed by 
userspace. And similarly we shouldn't surprise userspace by FUSE_DESTROY 
if any requests are still in-flight. May be there are other hidden 
dependencies that don't come to mind now.


Thanks,
Maxim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] fuse: fix synchronous case of fuse_file_put()

2014-09-11 Thread Maxim Patlasov


On 08/22/2014 06:08 PM, Miklos Szeredi wrote:

On Thu, Aug 21, 2014 at 6:09 PM, Maxim Patlasov  wrote:

If fuse_file_put() is called with sync==true, the user may be blocked for
a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be
uninterruptible. Otherwise request could be interrupted, but file association
in user space remains.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/file.c |4 
  1 file changed, 4 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cd55488..b92143a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
 path_put(&req->misc.release.path);
 fuse_put_request(ff->fc, req);
 } else if (sync) {
+   /* Must force. Otherwise request could be interrupted,
+* but file association in user space remains.
+*/
+   req->force = 1;
 req->background = 0;
 fuse_request_send(ff->fc, req);
 path_put(&req->misc.release.path);



Some thought needs to go into this:  if RELEASE is interrupted, then
we should possibly allow that, effectively backgrounding the request.

The synchronous nature is just an optimization and we really don't
know where we are being interrupted, possibly in a place which very
much *should* allow interruption.


I really need your help to proceed with this patch. Could you please 
explain what those places are where we should allow interruption.


BTW, as for "just an optimization", I've recently noticed that __fput() 
calls locks_remove_file(). So guarantees provided by the patch-set are 
on the same level as flock(2) behaviour.




Also fuse really should distinguish fatal and non-fatal interruptions
and handle them accordingly...


And elaborate on this concern, please.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] fuse: fix regression in fuse_get_user_pages()

2014-09-10 Thread Maxim Patlasov


On 09/10/2014 01:51 PM, Miklos Szeredi wrote:

On Wed, Sep 03, 2014 at 02:10:23PM +0400, Maxim Patlasov wrote:

Hi,

The patchset fixes a regression introduced by the following commits:

c7f3888ad7f0 ("switch iov_iter_get_pages() to passing maximal number of pages")
c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()")


Hmm, instead of reverting to passing maxbytes *instead* of maxpages, I think the
right fix is to *add* the maxbytes argument.

Just maxbytes alone doesn't have enough information in it.  E.g. 4096 contiguous
bytes could occupy 1 or 2 pages, depending on the starting offset.
Yes, you are right. I missed that c7f3888ad7f0 fixed a subtle bug in 
get_pages_iovec().




So how about the following (untested) patch?

Your patch works fine in my tests.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] fuse: fuse_get_user_pages(): do not pack more data than requested

2014-09-03 Thread Maxim Patlasov

The patch fixes a bug introduced by commit c9c37e2e6378 ("fuse: switch to
iov_iter_get_pages()"):

The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
ensure that *nbytesp won't grow. Before that commit, it was ensured by:

>   ret = get_user_pages_fast(user_addr, npages, !write,
> &req->pages[req->num_pages]);
>   ...
>   npages = ret;
>   frag_size = min_t(size_t, frag_size,
> (npages << PAGE_SHIFT) - offset);

Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated "maxsize"
to the helper.

Reported-by: Werner Baumann 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 40ac262..1d2bb70 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1303,10 +1303,15 @@ static int fuse_get_user_pages(struct fuse_req *req, 
struct iov_iter *ii,
while (nbytes < *nbytesp && req->num_pages < req->max_pages) {
unsigned npages;
size_t start;
+   ssize_t ret;
unsigned n = req->max_pages - req->num_pages;
-   ssize_t ret = iov_iter_get_pages(ii,
-   &req->pages[req->num_pages],
-   n * PAGE_SIZE, &start);
+   size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes);
+
+   frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT);
+
+   ret = iov_iter_get_pages(ii,
+   &req->pages[req->num_pages],
+   frag_size, &start);
if (ret < 0)
return ret;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] vfs: switch iov_iter_get_pages() to passing maximal size

2014-09-03 Thread Maxim Patlasov

The patch reverts the commit c7f3888ad7f0 ("switch iov_iter_get_pages() to
passing maximal number of pages") because FUSE do need maxsize argument that
existed before that commit.

Signed-off-by: Maxim Patlasov 
---
 fs/direct-io.c  |2 +-
 fs/fuse/file.c  |4 ++--
 include/linux/uio.h |2 +-
 mm/iov_iter.c   |   17 +
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index c311640..17e39b0 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -158,7 +158,7 @@ static inline int dio_refill_pages(struct dio *dio, struct 
dio_submit *sdio)
 {
ssize_t ret;
 
-   ret = iov_iter_get_pages(sdio->iter, dio->pages, DIO_PAGES,
+   ret = iov_iter_get_pages(sdio->iter, dio->pages, DIO_PAGES * PAGE_SIZE,
&sdio->from);
 
if (ret < 0 && sdio->blocks_available && (dio->rw & WRITE)) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 912061a..40ac262 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1303,10 +1303,10 @@ static int fuse_get_user_pages(struct fuse_req *req, 
struct iov_iter *ii,
while (nbytes < *nbytesp && req->num_pages < req->max_pages) {
unsigned npages;
size_t start;
+   unsigned n = req->max_pages - req->num_pages;
ssize_t ret = iov_iter_get_pages(ii,
&req->pages[req->num_pages],
-   req->max_pages - req->num_pages,
-   &start);
+   n * PAGE_SIZE, &start);
if (ret < 0)
return ret;
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 48d64e6..09a7cff 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -84,7 +84,7 @@ unsigned long iov_iter_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
-   unsigned maxpages, size_t *start);
+   size_t maxsize, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index ab88dc0..7b5dbd1 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -310,7 +310,7 @@ void iov_iter_init(struct iov_iter *i, int direction,
 EXPORT_SYMBOL(iov_iter_init);
 
 static ssize_t get_pages_iovec(struct iov_iter *i,
-  struct page **pages, unsigned maxpages,
+  struct page **pages, size_t maxsize,
   size_t *start)
 {
size_t offset = i->iov_offset;
@@ -323,10 +323,10 @@ static ssize_t get_pages_iovec(struct iov_iter *i,
len = iov->iov_len - offset;
if (len > i->count)
len = i->count;
+   if (len > maxsize)
+   len = maxsize;
addr = (unsigned long)iov->iov_base + offset;
len += *start = addr & (PAGE_SIZE - 1);
-   if (len > maxpages * PAGE_SIZE)
-   len = maxpages * PAGE_SIZE;
addr &= ~(PAGE_SIZE - 1);
n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages);
@@ -588,14 +588,15 @@ static unsigned long alignment_bvec(const struct iov_iter 
*i)
 }
 
 static ssize_t get_pages_bvec(struct iov_iter *i,
-  struct page **pages, unsigned maxpages,
+  struct page **pages, size_t maxsize,
   size_t *start)
 {
const struct bio_vec *bvec = i->bvec;
size_t len = bvec->bv_len - i->iov_offset;
if (len > i->count)
len = i->count;
-   /* can't be more than PAGE_SIZE */
+   if (len > maxsize)
+   len = maxsize;
*start = bvec->bv_offset + i->iov_offset;
 
get_page(*pages = bvec->bv_page);
@@ -711,13 +712,13 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 EXPORT_SYMBOL(iov_iter_alignment);
 
 ssize_t iov_iter_get_pages(struct iov_iter *i,
-  struct page **pages, unsigned maxpages,
+  struct page **pages, size_t maxsize,
   size_t *start)
 {
if (i->type & ITER_BVEC)
-   return get_pages_bvec(i, pages, maxpages, start);
+   return get_pages_bvec(i, pages, maxsize, start);
else
-   return get_pages_iovec(i, pages, maxpages, start);
+   return get_pages_iovec(i, pages, maxsize, start);
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 

--
To unsu

[PATCH 0/2] fuse: fix regression in fuse_get_user_pages()

2014-09-03 Thread Maxim Patlasov

Hi,

The patchset fixes a regression introduced by the following commits:

c7f3888ad7f0 ("switch iov_iter_get_pages() to passing maximal number of pages")
c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()")

The regression manifests itslef like this (thanks to Werner Baumann for 
reporting):

> davfs2 uses the fuse kernel module directly (not using the fuse
> userspace library). A user of davfs2 reported this problem
> (http://savannah.nongnu.org/support/?108640):
> 
> dd if=/dev/zero of=/mnt/owncloud/test.txt bs=20416 count=1
> works fine, but
> dd if=/dev/zero of=/mnt/owncloud/test.txt bs=20417 count=1
> fails.

Thanks,
Maxim

---

Maxim Patlasov (2):
  vfs: switch iov_iter_get_pages() to passing maximal size
  fuse: fuse_get_user_pages(): do not pack more data than requested


 fs/direct-io.c  |2 +-
 fs/fuse/file.c  |   13 +
 include/linux/uio.h |2 +-
 mm/iov_iter.c   |   17 +
 4 files changed, 20 insertions(+), 14 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1 5/9] block: loop: convert to blk-mq

2014-08-29 Thread Maxim Patlasov

On 08/28/2014 06:06 AM, Ming Lei wrote:

On 8/28/14, Maxim Patlasov wrote:

On 08/21/2014 09:44 AM, Ming Lei wrote:

On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe wrote:

Reworked a bit more:

http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6

One big problem of the commit is that it is basically a serialized
workqueue
because of single &hctx->run_work, and per-req work_struct has to be
used for concurrent implementation. So looks the approach isn't flexible
enough compared with doing that in driver, or any idea about how to fix
that?

I'm interested what's the price of handling requests in a separate
thread at large. I used the following fio script:

[global]
direct=1
bsrange=512-512
timeout=10
numjobs=1
ioengine=sync

filename=/dev/loop0 # or /dev/nullb0

[f1]
rw=randwrite

to compare the performance of:

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops

If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this
result since blk-mq implements a serialized workqueue.

BLK_MQ_F_WQ_CONTEXT is not in 3.17.0-rc1, so I couldn't enable it.

2) the same as above, but call loop_queue_work() directly from
loop_queue_rq() -- 270K iops
3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

In my recent investigation and discussion with Jens, using workqueue
may introduce some regression for cases like loop over null_blk, tmpfs.

And 270K vs. 380K is a bit similar with my result, and it was observed that
context switch is increased by more than 50% with introducing workqueue.

The figures are similar, but the comparison is not. Both 270K and 380K
refer to configurations where no extra context switch involved.

I will post V3 which will use previous kthread, with blk-mq & kernel aio, which
should make full use of blk-mq and kernel aio, and won't introduce regression
for cases like above.

That would be great!

Taking into account so big difference (11K vs. 270K), would it be worthy
to implement pure non-blocking version of aio_kernel_submit() returning
error if blocking needed? Then loop driver (or any other in-kernel user)

The kernel aio submit is very similar with user space's implementation,
except for block plug&unplug usage in user space aio submit path.

If it is blocked in aio_kernel_submit(), you should observe similar thing
with io_submit() too.

Yes, I agree. My point was that there is a room for optimization as my
experiments demonstrate. The question is whether it's worthy to
sophisticate kernel aio (and fs-specific code too) for the sake of that
optimization.

In fact, in a simple case like block fs on top of loopback device on top
of a file on another block fs, what kernel aio does for loopback driver
is a subtle way of converting incoming bio-s to outgoing bio-s. In case
you know where the image file is placed (e.g. by fiemap), such a
conversion may be done with zero overhead and anything that makes the
overhead noticeable is suspicious. And it is easy to imagine other
use-cases when that extra context switch is avoidable.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH v1 5/9] block: loop: convert to blk-mq

2014-08-29 Thread Maxim Patlasov


On 8/28/14, Zach Brown  wrote:


On Wed, Aug 27, 2014 at 09:19:36PM +0400, Maxim Patlasov wrote:

On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:

On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
...

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
2) the same as above, but call loop_queue_work() directly from
loop_queue_rq() -- 270K iops
3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

Taking into account so big difference (11K vs. 270K), would it be
worthy
to implement pure non-blocking version of aio_kernel_submit() returning
error if blocking needed? Then loop driver (or any other in-kernel
user)
might firstly try that non-blocking submit as fast-path, and, only if
it's failed, fall back to queueing.

What filesystem is the backing file for loop0 on?  O_DIRECT access as
Ming's patches use should be non-blocking, and if not, that's something
to fix.

I used loop0 directly on top of null_blk driver (because my goal was to
measure the overhead of processing requests in a separate thread).

The relative overhead while doing nothing else.  While zooming way down
in to micro benchmarks is fun and all, testing on an fs on brd might be
more representitive and so more compelling.


The measurements on an fs on brd are even more outrageous (the same fio 
script I posted a few messages above):


1) Baseline. no loopback device involved.

fio on /dev/ram0:   467K iops
fio on ext4 over /dev/ram0: 378K iops

2) Loopback device from 3.17.0-rc1 with Ming's patches (v1) applied:

fio on /dev/loop0 over /dev/ram0:10K iops
fio on ext4 over /dev/loop0 over /dev/ram0:   9K iops

3) the same as above, but avoid extra context switch (call 
loop_queue_work() directly from loop_queue_rq()):


fio on /dev/loop0 over /dev/ram0:   267K iops
fio on ext4 over /dev/loop0 over /dev/ram0: 223K iops

The problem is not about huge relative overhead while doing nothing 
else. It's rather about introducing extra latency (~100 microseconds on 
commodity h/w I used) which might be noticeable on modern SSDs (and h/w 
RAIDs with caching).


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1 5/9] block: loop: convert to blk-mq

2014-08-27 Thread Maxim Patlasov


On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:

On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
...

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
2) the same as above, but call loop_queue_work() directly from
loop_queue_rq() -- 270K iops
3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

Taking into account so big difference (11K vs. 270K), would it be worthy
to implement pure non-blocking version of aio_kernel_submit() returning
error if blocking needed? Then loop driver (or any other in-kernel user)
might firstly try that non-blocking submit as fast-path, and, only if
it's failed, fall back to queueing.

What filesystem is the backing file for loop0 on?  O_DIRECT access as
Ming's patches use should be non-blocking, and if not, that's something
to fix.


I used loop0 directly on top of null_blk driver (because my goal was to 
measure the overhead of processing requests in a separate thread).


In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may 
easily block on something like bh_submit_read(), when fs reads file 
metadata to calculate the offset on block device by position in the file.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1 5/9] block: loop: convert to blk-mq

2014-08-27 Thread Maxim Patlasov


On 08/21/2014 09:44 AM, Ming Lei wrote:

On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe  wrote:



Reworked a bit more:

http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6

One big problem of the commit is that it is basically a serialized workqueue
because of single &hctx->run_work, and per-req work_struct has to be
used for concurrent implementation.  So looks the approach isn't flexible
enough compared with doing that in driver, or any idea about how to fix
that?



I'm interested what's the price of handling requests in a separate 
thread at large. I used the following fio script:


[global]
direct=1
bsrange=512-512
timeout=10
numjobs=1
ioengine=sync

filename=/dev/loop0 # or /dev/nullb0

[f1]
rw=randwrite

to compare the performance of:

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
2) the same as above, but call loop_queue_work() directly from 
loop_queue_rq() -- 270K iops

3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

Taking into account so big difference (11K vs. 270K), would it be worthy 
to implement pure non-blocking version of aio_kernel_submit() returning 
error if blocking needed? Then loop driver (or any other in-kernel user) 
might firstly try that non-blocking submit as fast-path, and, only if 
it's failed, fall back to queueing.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] fuse: wait for end of IO on release (v2)

2014-08-26 Thread Maxim Patlasov

There are two types of I/O activity that can be "in progress" at the time
of fuse_release() execution: asynchronous read-ahead and write-back. The
patch ensures that they are completed before fuse_release_common sends
FUSE_RELEASE to userspace.

So far as fuse_release() waits for end of async I/O, its callbacks
(fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot
be the last holders of fuse file anymore. To emphasize the fact, the patch
replaces fuse_file_put with __fuse_file_put there.

Changed in v2 (thanks to Miklos):
 - removed redundant locking around __fuse_file_put()

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   71 
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7723b3f..8713e62 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
 }
 
+/*
+ * Asynchronous callbacks may use it instead of fuse_file_put() because
+ * we guarantee that they are never last holders of ff. Hitting BUG() below
+ * will make clear any violation of the guarantee.
+ */
+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(&ff->count))
+   BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 bool isdir)
 {
@@ -279,6 +290,11 @@ static void fuse_prepare_release(struct fuse_file *ff, int 
flags, int opcode)
req->in.args[0].value = inarg;
 }
 
+static bool must_release_synchronously(struct fuse_file *ff)
+{
+   return ff->open_flags & FOPEN_SYNC_RELEASE;
+}
+
 void fuse_release_common(struct file *file, int opcode)
 {
struct fuse_file *ff;
@@ -302,6 +318,13 @@ void fuse_release_common(struct file *file, int opcode)
req->misc.release.path = file->f_path;
 
/*
+* No more in-flight asynchronous READ or WRITE requests if
+* fuse file release is synchronous
+*/
+   if (must_release_synchronously(ff))
+   BUG_ON(atomic_read(&ff->count) != 1);
+
+   /*
 * Normally this will send the RELEASE request, however if
 * some asynchronous READ or WRITE requests are outstanding,
 * the sending will be delayed.
@@ -321,11 +344,34 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 static int fuse_release(struct inode *inode, struct file *file)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_file *ff = file->private_data;
 
/* see fuse_vma_close() for !writeback_cache case */
if (fc->writeback_cache)
write_inode_now(inode, 1);
 
+   if (must_release_synchronously(ff)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   /*
+* Must remove file from write list. Otherwise it is possible
+* this file will get more writeback from another files
+* rerouted via write_files.
+*/
+   spin_lock(&ff->fc->lock);
+   list_del_init(&ff->write_entry);
+   spin_unlock(&ff->fc->lock);
+
+   wait_event(fi->page_waitq, atomic_read(&ff->count) == 1);
+
+   /*
+* spin_unlock_wait(&ff->fc->lock) would be natural here to
+* wait for threads just released ff to leave their critical
+* sections. But taking spinlock is the first thing
+* fuse_release_common does, so that this is unnecessary.
+*/
+   }
+
fuse_release_common(file, FUSE_RELEASE);
 
/* return value is ignored by VFS */
@@ -823,8 +869,15 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
unlock_page(page);
page_cache_release(page);
}
-   if (req->ff)
-   fuse_file_put(req->ff, false);
+   if (req->ff) {
+   if (must_release_synchronously(req->ff)) {
+   struct fuse_inode *fi = get_fuse_inode(req->inode);
+
+   __fuse_file_put(req->ff);
+   wake_up(&fi->page_waitq);
+   } else
+   fuse_file_put(req->ff, false);
+   }
 }
 
 struct fuse_fill_data {
@@ -851,6 +904,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data)
if (fc->async_read) {
req->ff = fuse_file_get(ff);
req->end = fuse_readpages_end;
+   req->inode = data->inode;
fuse_request_send_background(fc, req);
} else {
fuse_request_send(fc, req);
@@ -1502,7 +1556,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->

Re: [PATCH 5/6] fuse: fix synchronous case of fuse_file_put()

2014-08-25 Thread Maxim Patlasov


On 08/22/2014 06:08 PM, Miklos Szeredi wrote:

On Thu, Aug 21, 2014 at 6:09 PM, Maxim Patlasov  wrote:

If fuse_file_put() is called with sync==true, the user may be blocked for
a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be
uninterruptible. Otherwise request could be interrupted, but file association
in user space remains.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/file.c |4 
  1 file changed, 4 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cd55488..b92143a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
 path_put(&req->misc.release.path);
 fuse_put_request(ff->fc, req);
 } else if (sync) {
+   /* Must force. Otherwise request could be interrupted,
+* but file association in user space remains.
+*/
+   req->force = 1;
 req->background = 0;
 fuse_request_send(ff->fc, req);
 path_put(&req->misc.release.path);



Some thought needs to go into this:  if RELEASE is interrupted, then
we should possibly allow that, effectively backgrounding the request.

The synchronous nature is just an optimization and we really don't
know where we are being interrupted, possibly in a place which very
much *should* allow interruption.


A fuse daemon who explicitly enables the feature (synchronous release) 
would definitely want non-interruptible behaviour of last fput. 
Otherwise, it would face the same problem that the feature tries to 
resolve: an application was killed and exited, but there is no way to 
determine why actual processing of RELEASE will be completed.


As for fuseblk mounts, I'm not so sure. I believed the lack of force=1 
was a bug and my patch fixes it. If you think it's safer to preserve old 
behaviour, I could set "force" conditionally. May be you could explain 
in more details why you think we should allow interruption somewhere. 
Any examples or use cases? Btw, fuse_flush also uses force=1. Do you 
concerns deal with it as well?




Also fuse really should distinguish fatal and non-fatal interruptions
and handle them accordingly...


Do you think it's worthy to elaborate this in the scope of "synchronous 
release" feature?


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/6] fuse: enable close_wait synchronous release

2014-08-25 Thread Maxim Patlasov


On 08/22/2014 06:04 PM, Miklos Szeredi wrote:

On Thu, Aug 21, 2014 at 6:09 PM, Maxim Patlasov  wrote:

The patch enables the feature by passing 'true' to fuse_file_put in
fuse_release_common.

Previously, this was safe only in special cases when we sure that
multi-threaded userspace won't deadlock if we'll synchronously send
FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's
always safe because callbacks don't send requests to userspace anymore.

But we do want to make this privileged, as there are unlikely but
possible DoS scenarios with a sync release.


The latest patch of the set implements DISABLE_SYNC_RELEASE mount 
option. We can instrument fusermount to use the option by default for 
unprivileged mounts (allowing system administrator to configure it like 
"user_allow_other"). Do you have a better way to implement DoS 
protection in mind?


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] fuse: wait for end of IO on release

2014-08-25 Thread Maxim Patlasov


On 08/22/2014 06:00 PM, Miklos Szeredi wrote:

On Thu, Aug 21, 2014 at 6:08 PM, Maxim Patlasov  wrote:

There are two types of I/O activity that can be "in progress" at the time
of fuse_release() execution: asynchronous read-ahead and write-back. The
patch ensures that they are completed before fuse_release_common sends
FUSE_RELEASE to userspace.

So far as fuse_release() waits for end of async I/O, its callbacks
(fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot
be the last holders of fuse file anymore. To emphasize the fact, the patch
replaces fuse_file_put with __fuse_file_put there.

1) spinlock around __fuse_file_put() is unnecessary,
wake_up/wait_event will provide the necessary synchronization.

Yes, I agree.


2) can't we always wait for I/O and just make the actual RELEASE
message sync or async based on the flag?
3) and can't we merge the fuseblk case into this as well?


I think this is doable, but the same argument that Anand suggested for 
doing sync release selectively:


In a real world scenario you may want to perform synchronous release 
selectively. As such performance over lots of small files is generally 
slow because of context switches, and a synchronous release adds an 
extra switch.


is applicable here: if an application opened a file read-only and read a 
block initiating read-ahead, it's not obvious why the app doing close(2) 
should always wait for the end of that read-ahead. For some fs it's 
desirable, for others is not. Let's fuse daemon decide which behaviour 
is preferable.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] fuse: add mount option to disable synchronous release

2014-08-21 Thread Maxim Patlasov

Synchronous release ensures that kernel fuse reports to userspace exactly
last fput(). However, nothing guarantees that last fput() will happen in the
context of close(2). So userspace applications must be prepared to the
situation when close(2) returned, but processing of FUSE_RELEASE is not
completed by fuse daemon yet. To make testing such use cases easier, the
patch introduces DISABLE_SYNC_RELEASE mount option which effectively mask out
FOPEN_SYNC_RELEASE flag.

Signed-off-by: Maxim Patlasov 
---
---
 fs/fuse/file.c   |3 ++-
 fs/fuse/fuse_i.h |3 +++
 fs/fuse/inode.c  |8 
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b92143a..ad3d357 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -296,7 +296,8 @@ static void fuse_prepare_release(struct fuse_file *ff, int 
flags, int opcode)
 
 static bool must_release_synchronously(struct fuse_file *ff)
 {
-   return ff->open_flags & FOPEN_SYNC_RELEASE;
+   return ff->open_flags & FOPEN_SYNC_RELEASE &&
+   !(ff->fc->flags & FUSE_DISABLE_SYNC_RELEASE);
 }
 
 void fuse_release_common(struct file *file, int opcode)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e8e47a6..c5e2fca 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,9 @@
 doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER (1 << 1)
 
+/** Disable synchronous release */
+#define FUSE_DISABLE_SYNC_RELEASE (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 03246cd..86d47d0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -463,6 +463,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_DISABLE_SYNC_RELEASE,
OPT_ERR
 };
 
@@ -475,6 +476,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   "allow_other"},
{OPT_MAX_READ,  "max_read=%u"},
{OPT_BLKSIZE,   "blksize=%u"},
+   {OPT_DISABLE_SYNC_RELEASE,  "disable_sync_release"},
{OPT_ERR,   NULL}
 };
 
@@ -560,6 +562,10 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev)
d->blksize = value;
break;
 
+   case OPT_DISABLE_SYNC_RELEASE:
+   d->flags |= FUSE_DISABLE_SYNC_RELEASE;
+   break;
+
default:
return 0;
}
@@ -583,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_puts(m, ",default_permissions");
if (fc->flags & FUSE_ALLOW_OTHER)
seq_puts(m, ",allow_other");
+   if (fc->flags & FUSE_DISABLE_SYNC_RELEASE)
+   seq_puts(m, ",disable_sync_release");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] fuse: fix synchronous case of fuse_file_put()

2014-08-21 Thread Maxim Patlasov

If fuse_file_put() is called with sync==true, the user may be blocked for
a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be
uninterruptible. Otherwise request could be interrupted, but file association
in user space remains.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |4 
 1 file changed, 4 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cd55488..b92143a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
path_put(&req->misc.release.path);
fuse_put_request(ff->fc, req);
} else if (sync) {
+   /* Must force. Otherwise request could be interrupted,
+* but file association in user space remains.
+*/
+   req->force = 1;
req->background = 0;
fuse_request_send(ff->fc, req);
path_put(&req->misc.release.path);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] fuse: cosmetic rework of fuse_send_readpages

2014-08-21 Thread Maxim Patlasov

The patch change arguments of fuse_send_readpages to give it access to inode
(will be used in the next patch of patch-set). The change is cosmetic,
no logic changed.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 912061a..7723b3f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -827,8 +827,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_file_put(req->ff, false);
 }
 
-static void fuse_send_readpages(struct fuse_req *req, struct file *file)
+struct fuse_fill_data {
+   struct fuse_req *req;
+   struct file *file;
+   struct inode *inode;
+   unsigned nr_pages;
+};
+
+static void fuse_send_readpages(struct fuse_fill_data *data)
 {
+   struct fuse_req *req = data->req;
+   struct file *file = data->file;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
loff_t pos = page_offset(req->pages[0]);
@@ -850,13 +859,6 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
}
 }
 
-struct fuse_fill_data {
-   struct fuse_req *req;
-   struct file *file;
-   struct inode *inode;
-   unsigned nr_pages;
-};
-
 static int fuse_readpages_fill(void *_data, struct page *page)
 {
struct fuse_fill_data *data = _data;
@@ -872,7 +874,7 @@ static int fuse_readpages_fill(void *_data, struct page 
*page)
 req->pages[req->num_pages - 1]->index + 1 != page->index)) {
int nr_alloc = min_t(unsigned, data->nr_pages,
 FUSE_MAX_PAGES_PER_REQ);
-   fuse_send_readpages(req, data->file);
+   fuse_send_readpages(data);
if (fc->async_read)
req = fuse_get_req_for_background(fc, nr_alloc);
else
@@ -925,7 +927,7 @@ static int fuse_readpages(struct file *file, struct 
address_space *mapping,
err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
if (!err) {
if (data.req->num_pages)
-   fuse_send_readpages(data.req, file);
+   fuse_send_readpages(&data);
else
fuse_put_request(fc, data.req);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] fuse: enable close_wait synchronous release

2014-08-21 Thread Maxim Patlasov

The patch enables the feature by passing 'true' to fuse_file_put in
fuse_release_common.

Previously, this was safe only in special cases when we sure that
multi-threaded userspace won't deadlock if we'll synchronously send
FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's
always safe because callbacks don't send requests to userspace anymore.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 73bce1b..cd55488 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -333,7 +333,8 @@ void fuse_release_common(struct file *file, int opcode)
 * synchronous RELEASE is allowed (and desirable) in this case
 * because the server can be trusted not to screw up.
 */
-   fuse_file_put(ff, ff->fc->destroy_req != NULL);
+   fuse_file_put(ff, ff->fc->destroy_req != NULL ||
+ must_release_synchronously(ff));
 }
 
 static int fuse_open(struct inode *inode, struct file *file)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] fuse: wait for end of IO on release

2014-08-21 Thread Maxim Patlasov

There are two types of I/O activity that can be "in progress" at the time
of fuse_release() execution: asynchronous read-ahead and write-back. The
patch ensures that they are completed before fuse_release_common sends
FUSE_RELEASE to userspace.

So far as fuse_release() waits for end of async I/O, its callbacks
(fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot
be the last holders of fuse file anymore. To emphasize the fact, the patch
replaces fuse_file_put with __fuse_file_put there.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   76 
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7723b3f..73bce1b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
 }
 
+/*
+ * Asynchronous callbacks may use it instead of fuse_file_put() because
+ * we guarantee that they are never last holders of ff. Hitting BUG() below
+ * will make clear any violation of the guarantee.
+ */
+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(&ff->count))
+   BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 bool isdir)
 {
@@ -279,6 +290,11 @@ static void fuse_prepare_release(struct fuse_file *ff, int 
flags, int opcode)
req->in.args[0].value = inarg;
 }
 
+static bool must_release_synchronously(struct fuse_file *ff)
+{
+   return ff->open_flags & FOPEN_SYNC_RELEASE;
+}
+
 void fuse_release_common(struct file *file, int opcode)
 {
struct fuse_file *ff;
@@ -302,6 +318,13 @@ void fuse_release_common(struct file *file, int opcode)
req->misc.release.path = file->f_path;
 
/*
+* No more in-flight asynchronous READ or WRITE requests if
+* fuse file release is synchronous
+*/
+   if (must_release_synchronously(ff))
+   BUG_ON(atomic_read(&ff->count) != 1);
+
+   /*
 * Normally this will send the RELEASE request, however if
 * some asynchronous READ or WRITE requests are outstanding,
 * the sending will be delayed.
@@ -321,11 +344,34 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 static int fuse_release(struct inode *inode, struct file *file)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_file *ff = file->private_data;
 
/* see fuse_vma_close() for !writeback_cache case */
if (fc->writeback_cache)
write_inode_now(inode, 1);
 
+   if (must_release_synchronously(ff)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   /*
+* Must remove file from write list. Otherwise it is possible
+* this file will get more writeback from another files
+* rerouted via write_files.
+*/
+   spin_lock(&ff->fc->lock);
+   list_del_init(&ff->write_entry);
+   spin_unlock(&ff->fc->lock);
+
+   wait_event(fi->page_waitq, atomic_read(&ff->count) == 1);
+
+   /*
+* spin_unlock_wait(&ff->fc->lock) would be natural here to
+* wait for threads just released ff to leave their critical
+* sections. But taking spinlock is the first thing
+* fuse_release_common does, so that this is unnecessary.
+*/
+   }
+
fuse_release_common(file, FUSE_RELEASE);
 
/* return value is ignored by VFS */
@@ -823,8 +869,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
unlock_page(page);
page_cache_release(page);
}
-   if (req->ff)
-   fuse_file_put(req->ff, false);
+   if (req->ff) {
+   if (must_release_synchronously(req->ff)) {
+   struct fuse_inode *fi = get_fuse_inode(req->inode);
+
+   spin_lock(&fc->lock);
+   __fuse_file_put(req->ff);
+   wake_up(&fi->page_waitq);
+   spin_unlock(&fc->lock);
+   } else
+   fuse_file_put(req->ff, false);
+   }
 }
 
 struct fuse_fill_data {
@@ -851,6 +906,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data)
if (fc->async_read) {
req->ff = fuse_file_get(ff);
req->end = fuse_readpages_end;
+   req->inode = data->inode;
fuse_request_send_background(fc, req);
} else {
fuse_request_send(fc, req);
@@ -1502,7 +1558,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0;

[PATCH 1/6] fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags

2014-08-21 Thread Maxim Patlasov

The feature will be governed by fuse file open flag FOPEN_SYNC_RELEASE.
Userspace can enable it on per file basis in the same way as for
FOPEN_KEEP_CACHE or FOPEN_DIRECT_IO.

Signed-off-by: Maxim Patlasov 
---
 include/uapi/linux/fuse.h |3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 25084a0..607c45c 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -205,10 +205,13 @@ struct fuse_file_lock {
  * FOPEN_DIRECT_IO: bypass page cache for this open file
  * FOPEN_KEEP_CACHE: don't invalidate the data cache on open
  * FOPEN_NONSEEKABLE: the file is not seekable
+ * FOPEN_SYNC_RELEASE: synchronously release file on last fput,
+ * which, in turn, not always bound to fclose(2)!
  */
 #define FOPEN_DIRECT_IO(1 << 0)
 #define FOPEN_KEEP_CACHE   (1 << 1)
 #define FOPEN_NONSEEKABLE  (1 << 2)
+#define FOPEN_SYNC_RELEASE (1 << 3)
 
 /**
  * INIT request/reply flags

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/6] fuse: handle release synchronously (v3)

2014-08-21 Thread Maxim Patlasov

Hi,

There is a long-standing demand for synchronous behaviour of fuse_release:

http://sourceforge.net/mailarchive/message.php?msg_id=19343889
http://sourceforge.net/mailarchive/message.php?msg_id=29814693

A year ago Avati and me explained why such a feature would be useful:

http://sourceforge.net/mailarchive/message.php?msg_id=29889055
http://sourceforge.net/mailarchive/message.php?msg_id=29867423

In short, the problem is that fuse_release (that's usually called on last
user close(2)) sends FUSE_RELEASE to userspace and returns without waiting
for ACK from userspace. Consequently, there is a gap when user regards the
file released while userspace fuse is still working on it. An attempt to
access the file from another node leads to complicated synchronization
problems because the first node still "holds" the file.

The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

It must be emphasized that even if the feature is enabled (i.e. fuse_release
is synchronous), nothing guarantees that fuse_release() will be called
in the context of close(2). In fact, it may happen later, on last fput().
However, there are still a lot of use cases when close(2) is synchronous,
so the feature must be regarded as an optimization maximizing chances of
synchronous behaviour of close(2).

To keep single-threaded userspace implementations happy the patch-set
ensures that by the time fuse_release_common calls fuse_file_put, no
more in-flight I/O exists. Asynchronous fuse callbacks (like
fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll
never block in contexts other than close().

Changed in v2:
 - improved comments, commented spin_unlock_wait out according to Brian'
suggestions.
 - rebased on v3.15-rc8 tag of Linus' tree.
Changed in v3:
 - changed patch-set title from "close file synchronously" to "handle
   release synchronously"
 - renamed CLOSE_WAIT to SYNC_RELEASE to convey the essence of the feature
   ("synchronous release" instead of "wait on close")
 - enabled feature on per file basis (instead of global fuse_conn parameter)
 - added "disable_sync_release" mount option
 - rebased on v3.17-rc1 tag of Linus' tree.

Thanks,
Maxim

---

Maxim Patlasov (6):
  fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags
  fuse: cosmetic rework of fuse_send_readpages
  fuse: wait for end of IO on release
  fuse: enable close_wait synchronous release
  fuse: fix synchronous case of fuse_file_put()
  fuse: add mount option to disable synchronous release


 fs/fuse/file.c|  106 ++---
 fs/fuse/fuse_i.h  |3 +
 fs/fuse/inode.c   |8 +++
 include/uapi/linux/fuse.h |3 +
 4 files changed, 104 insertions(+), 16 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fuse: do not evict dirty inodes

2014-08-15 Thread Maxim Patlasov


Hi Miklos,

On 08/13/2014 02:32 PM, Miklos Szeredi wrote:

On Tue, Jun 3, 2014 at 1:49 PM, Maxim Patlasov  wrote:

Commit 1e18bda8 added .write_inode method to the fuse super_operations. This
allowed fuse to use the kernel infrastructure for writing out dirty metadata
(mtime and ctime for now). However, given that .drop_inode was not redefined
from the legacy generic_delete_inode(), on umount(2) generic_shutdown_super()
led to the eviction of all inodes disregarding their state.

The patch removes .drop_inode definition from the fuse super_operations. This
works because now iput_final() calls generic_drop_inode() and returns w/o
evicting inode. This, in turn, allows generic_shutdown_super() to write dirty
inodes by calling sync_filesystem().

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/inode.c |1 -
  1 file changed, 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 754dcf2..ee017be 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -791,7 +791,6 @@ static const struct super_operations fuse_super_operations 
= {
 .destroy_inode  = fuse_destroy_inode,
 .evict_inode= fuse_evict_inode,
 .write_inode= fuse_write_inode,
-   .drop_inode = generic_delete_inode,
 .remount_fs = fuse_remount_fs,
 .put_super  = fuse_put_super,
 .umount_begin   = fuse_umount_begin,


(Sorry about the late answer)

Big problem with this is that I don't want to make umount(2) and
sync(2) wait on userspace filesystem.  Generally this would make
umount() hang if a fuse daemon was stuck for any reason.


I think we must honour interests both privileged and unprivileged 
mounts. In case of trusted environment where only sysad decides which 
fuse daemons are eligible to run, blocking umount() is completely fine. 
And more than that, after settling down that sync-close feature, I 
intend to take on synchronous umount (it's a shame that users have no 
tools to find out when umount completed better than monitoring of fuse 
daemon in proc table waiting for daemon termination). So we could put 
such a behaviour under control of a tunable parameter.




But is this really necessary?

We are talking about just regular files: mtime is only updated by
write(2) and friends. ctime is updated by write(2) as well as some
other ops.  For write, we can sync the times on FLUSH (close), for
other ops we could flush the ctime synchronously.  E.g. unlink would
trigger UNLINK and SETATTR.

Long term, much better solution would be to add a timestamp to
fuse_in_header which would remove the duplicate requests and then we
could also extend the kernel caching of timestamps from just regular
files to everything, which would make the protocol conceptually
simpler.


I like the idea of extending fuse_in_header. But if we'll extend it, why 
not go further adding timestamps for mtime and atime as well? Because 
pushing mtime along with data modifications is more reliable than 
postponing flush to write_inode call. Also, this would simplify the code 
a lot -- no more fuse_write_inode and fuse_flush_times needed. And who 
knows, may be someone will request proper atime handling at some point 
in future. If you're OK about:



@@ -675,10 +675,15 @@ struct fuse_in_header {
uint32_topcode;
uint64_tunique;
uint64_tnodeid;
+   uint64_tatime;
+   uint64_tmtime;
+   uint64_tctime;
+   uint32_tatimensec;
+   uint32_tmtimensec;
+   uint32_tctimensec;
uint32_tuid;
uint32_tgid;
uint32_tpid;
-   uint32_tpadding;
 };


I could work on the patch.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] fuse: close file synchronously (v2)

2014-08-14 Thread Maxim Patlasov

On 08/13/2014 04:44 PM, Miklos Szeredi wrote:

On Fri, Jun 6, 2014 at 3:27 PM, Maxim Patlasov  wrote:

Hi,

There is a long-standing demand for synchronous behaviour of fuse_release:

http://sourceforge.net/mailarchive/message.php?msg_id=19343889
http://sourceforge.net/mailarchive/message.php?msg_id=29814693

A year ago Avati and me explained why such a feature would be useful:

http://sourceforge.net/mailarchive/message.php?msg_id=29889055
http://sourceforge.net/mailarchive/message.php?msg_id=29867423

In short, the problem is that fuse_release (that's called on last user
close(2)) sends FUSE_RELEASE to userspace and returns without waiting for
ACK from userspace. Consequently, there is a gap when user regards the
file released while userspace fuse is still working on it. An attempt to
access the file from another node leads to complicated synchronization
problems because the first node still "holds" the file.

Tying RELEASE to close(2) is not going to work.  Look at all the
places that call fput() (or fdput() in recent kernels), those are all
potential triggers for RELEASE, some realistic, some not quite, but
all are certainly places that a synchronous release could block
*instead* of close.

Which just means, that close will still be asynchronous with release
some of the time.  So it's not clear to me what is to be gained from
this patchset.

The patch-set doesn't tie RELEASE to close(2), it ensures that we report 
to user space exactly the last fput(). That's correct because this is 
exactly the moment when any file system with sharing mode connected to 
open/close must drop sharing mode. This is the case even for some local 
filesystems, for example, ntfs-3g.

Could you please look closely at your commit 
5a18ec176c934ca1bc9dc61580a5e0e90a9b5733. It actually implemented two 
different things: 1) synchronous release and 2) delayed path_put. The 
latter was well explained by the comment:

>/*
> * If this is a fuseblk mount, then it's possible that
> * releasing the path will result in releasing the
> * super block and sending the DESTROY request.  If
> * the server is single threaded, this would hang.
> * For this reason do the path_put() in a separate
> * thread.
> */

So it's clear why the delay needed and why it's bound to fuseblk 
condition. But synchronous close was made under the same condition, 
which is obviously wrong. I understand why you made that decision in 
2011: otherwise, we could block in a wrong context (last decrement of 
ff->count might happen in scope of read-ahead or mmap-ed writeback). But 
now, with the approach implemented in this patch-set, this is impossible 
-- we wait for completion of all async operations before triggering 
synchronous release. Thus the patch-set untie a functionality which 
already existed before (synchronous release) from wrong condition 
(fuseblk mount) putting it under well-defined control (FUSE_CLOSE_WAIT).

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fuse: avoid scheduling while atomic

2014-07-14 Thread Maxim Patlasov


On 07/04/2014 08:27 PM, Miklos Szeredi wrote:

Maxim, thanks for the patch.

Here's a reworked one.  While it looks a bit more complicated, its chances of
being (and remaining) correct are, I think, better, since the region surrounded
by kmap/kunmap_atomic is trivially non-sleeping.  This patch also removes more
lines than it adds, as an added bonus.

Please review and test if possible.


The patch looks fine and works well in my tests.

Maxim



Thanks,
Miklos

---
  fs/fuse/dev.c |   51 +++
  1 file changed, 23 insertions(+), 28 deletions(-)

--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -643,9 +643,8 @@ struct fuse_copy_state {
unsigned long seglen;
unsigned long addr;
struct page *pg;
-   void *mapaddr;
-   void *buf;
unsigned len;
+   unsigned offset;
unsigned move_pages:1;
  };
  
@@ -666,23 +665,17 @@ static void fuse_copy_finish(struct fuse

if (cs->currbuf) {
struct pipe_buffer *buf = cs->currbuf;
  
-		if (!cs->write) {

-   kunmap_atomic(cs->mapaddr);
-   } else {
-   kunmap_atomic(cs->mapaddr);
+   if (cs->write)
buf->len = PAGE_SIZE - cs->len;
-   }
cs->currbuf = NULL;
-   cs->mapaddr = NULL;
-   } else if (cs->mapaddr) {
-   kunmap_atomic(cs->mapaddr);
+   } else if (cs->pg) {
if (cs->write) {
flush_dcache_page(cs->pg);
set_page_dirty_lock(cs->pg);
}
put_page(cs->pg);
-   cs->mapaddr = NULL;
}
+   cs->pg = NULL;
  }
  
  /*

@@ -691,7 +684,7 @@ static void fuse_copy_finish(struct fuse
   */
  static int fuse_copy_fill(struct fuse_copy_state *cs)
  {
-   unsigned long offset;
+   struct page *page;
int err;
  
  	unlock_request(cs->fc, cs->req);

@@ -706,14 +699,12 @@ static int fuse_copy_fill(struct fuse_co
  
  			BUG_ON(!cs->nr_segs);

cs->currbuf = buf;
-   cs->mapaddr = kmap_atomic(buf->page);
+   cs->pg = buf->page;
+   cs->offset = buf->offset;
cs->len = buf->len;
-   cs->buf = cs->mapaddr + buf->offset;
cs->pipebufs++;
cs->nr_segs--;
} else {
-   struct page *page;
-
if (cs->nr_segs == cs->pipe->buffers)
return -EIO;
  
@@ -726,8 +717,8 @@ static int fuse_copy_fill(struct fuse_co

buf->len = 0;
  
  			cs->currbuf = buf;

-   cs->mapaddr = kmap_atomic(page);
-   cs->buf = cs->mapaddr;
+   cs->pg = page;
+   cs->offset = 0;
cs->len = PAGE_SIZE;
cs->pipebufs++;
cs->nr_segs++;
@@ -740,14 +731,13 @@ static int fuse_copy_fill(struct fuse_co
cs->iov++;
cs->nr_segs--;
}
-   err = get_user_pages_fast(cs->addr, 1, cs->write, &cs->pg);
+   err = get_user_pages_fast(cs->addr, 1, cs->write, &page);
if (err < 0)
return err;
BUG_ON(err != 1);
-   offset = cs->addr % PAGE_SIZE;
-   cs->mapaddr = kmap_atomic(cs->pg);
-   cs->buf = cs->mapaddr + offset;
-   cs->len = min(PAGE_SIZE - offset, cs->seglen);
+   cs->pg = page;
+   cs->offset = cs->addr % PAGE_SIZE;
+   cs->len = min(PAGE_SIZE - cs->offset, cs->seglen);
cs->seglen -= cs->len;
cs->addr += cs->len;
}
@@ -760,15 +750,20 @@ static int fuse_copy_do(struct fuse_copy
  {
unsigned ncpy = min(*size, cs->len);
if (val) {
+   void *pgaddr = kmap_atomic(cs->pg);
+   void *buf = pgaddr + cs->offset;
+
if (cs->write)
-   memcpy(cs->buf, *val, ncpy);
+   memcpy(buf, *val, ncpy);
else
-   memcpy(*val, cs->buf, ncpy);
+   memcpy(*val, buf, ncpy);
+
+   kunmap_atomic(pgaddr);
*val += ncpy;
}
*size -= ncpy;
cs->len -= ncpy;
-   cs->buf += ncpy;
+   cs->offset += ncpy;
return ncpy;
  }
  
@@ -874,8 +869,8 @@ static int fuse_try_move_page(struct fus

  out_fallback_unlock:
unlock_page(newpage);
  out_fallback:
-   cs->mapaddr = kmap_atomic(buf->page);
-   cs->buf = cs->mapaddr + buf->offset;
+   cs->pg = buf->page;
+   cs->offset = buf->offset;
  
  	err = lock_request(cs->fc, cs->req);

if (err)



--
To unsubscribe f

Re: [PATCH] mm/page-writeback.c: fix divide by zero in bdi_dirty_limits

2014-07-14 Thread Maxim Patlasov


Hi Andrew,

On 07/12/2014 02:27 AM, Andrew Morton wrote:

On Fri, 11 Jul 2014 12:18:27 +0400 Maxim Patlasov  
wrote:


Under memory pressure, it is possible for dirty_thresh, calculated by
global_dirty_limits() in balance_dirty_pages(), to equal zero.

Under what circumstances?  Really small values of vm_dirty_bytes?


No, I used default settings:

vm_dirty_bytes = 0;
dirty_background_bytes = 0;
vm_dirty_ratio = 20;
dirty_background_ratio = 10;

and a simple program like main() { while(1) { p = malloc(4096); mlock(p, 
4096); } }. Of course, this triggers oom eventually, but immediately 
before oom, the system is under hard memory pressure.





Then, if
strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:

   bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh

by dividing by zero.

...

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1306,9 +1306,9 @@ static inline void bdi_dirty_limits(struct 
backing_dev_info *bdi,
*bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
  
  	if (bdi_bg_thresh)

-   *bdi_bg_thresh = div_u64((u64)*bdi_thresh *
-background_thresh,
-dirty_thresh);
+   *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh *
+   background_thresh,
+   dirty_thresh) : 0;

This introduces a peculiar discontinuity:

if dirty_thresh==3, treat it as 3
if dirty_thresh==2, treat it as 2
if dirty_thresh==1, treat it as 1
if dirty_thresh==0, treat it as infinity


No, the patch doesn't treat dirty_thresh==0 as infinity. In fact, in 
that case we have equation: x : 0 = 0 : 0, and the patch resolves it as 
x=0. Here is the reasoning:


1. A bdi counter is always a fraction of global one. Hence bdi_thresh is 
always not greater than dirty_thresh. So far as dirty_thresh is equal to 
zero, bdi_thresh is equal to zero too.
2. bdi_bg_thresh must be not greater than bdi_thresh because we want to 
start background process earlier than throttling it. So far as 
bdi_thresh is equal to zero, bdi_bg_thresh must be zero too.





Would it not make more sense to change global_dirty_limits() to convert
0 to 1?  With an appropriate comment, obviously.


Or maybe the fix lies elsewhere.  Please do tell us how this zero comes
about.



Firstly let me explain where available_memory equal to one came from. 
global_dirty_limits() calculates it by calling 
global_dirtyable_memory(). The latter takes into consideration three 
global counters and a global reserve. In my case corresponding values were:


NR_INACTIVE_FILE = 0
NR_ACTIVE_FILE = 0
NR_FREE_PAGES = 7006
dirty_balance_reserve = 7959.

Consequently, "x" in global_dirtyable_memory() was equal to zero, and 
the function returned one. Now global_dirty_limits() assigns 
available_memory to one and calculates "dirty" as a fraction of 
available_memory:


dirty = (vm_dirty_ratio * available_memory) / 100;

So far as vm_drity_ratio is lesser than 100 (it is 20 by default), dirty 
is calculated as zero.


As for your question about conversion 0 to 1, I think that bdi_thresh = 
dirty_thresh = 0 makes natural sense: we are under strong memory 
pressure, please always start background writeback and throttle process 
(even if actual number of dirty pages is low). So other parts of 
balance_dirty_pages machinery must handle zero thresholds properly.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm/page-writeback.c: fix divide by zero in bdi_dirty_limits

2014-07-11 Thread Maxim Patlasov

Under memory pressure, it is possible for dirty_thresh, calculated by
global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if
strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:

  bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh

by dividing by zero.

Signed-off-by: Maxim Patlasov 
---
 mm/page-writeback.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 518e2c3..e0c9430 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1306,9 +1306,9 @@ static inline void bdi_dirty_limits(struct 
backing_dev_info *bdi,
*bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
 
if (bdi_bg_thresh)
-   *bdi_bg_thresh = div_u64((u64)*bdi_thresh *
-background_thresh,
-dirty_thresh);
+   *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh *
+   background_thresh,
+   dirty_thresh) : 0;
 
/*
 * In order to avoid the stacked BDI deadlock we need

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fuse: release temporary page if fuse_writepage_locked() failed

2014-07-10 Thread Maxim Patlasov

tmp_page to be freed if fuse_write_file_get() returns NULL.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3a47aa2..a18baa4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1679,7 +1679,7 @@ static int fuse_writepage_locked(struct page *page)
error = -EIO;
req->ff = fuse_write_file_get(fc, fi);
if (!req->ff)
-   goto err_free;
+   goto err_nofile;
 
fuse_write_fill(req, req->ff, page_offset(page), 0);
 
@@ -1707,6 +1707,8 @@ static int fuse_writepage_locked(struct page *page)
 
return 0;
 
+err_nofile:
+   __free_page(tmp_page);
 err_free:
fuse_request_free(req);
 err:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fuse: remove WARN_ON writeback on dead inode

2014-07-10 Thread Maxim Patlasov

FUSE has an architectural peculiarity: it cannot write to an inode, unless it
has an associated file open for write. Since the beginning (Apr 30 2008, commit
3be5a52b), FUSE BUG_ON (later WARN_ON) the case when it has to process
writeback, but all associated files are closed. The latest relevant commit is
72523425:

>   Don't bug if there's no writable files found for page writeback.  If ever
>   this is triggered, a WARN_ON helps debugging it much better then a BUG_ON.
>
>   Signed-off-by: Miklos Szeredi 

But that situation can happen in quite a legal way: for example, let's mmap a
file, then issue O_DIRECT read to mmapped region and immediately close file
and munmap. O_DIRECT will pin some pages, execute IO to them and then mark
pages dirty. Here we are.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6e16dad..3a47aa2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1624,8 +1624,8 @@ static void fuse_writepage_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_writepage_free(fc, req);
 }
 
-static struct fuse_file *__fuse_write_file_get(struct fuse_conn *fc,
-  struct fuse_inode *fi)
+static struct fuse_file *fuse_write_file_get(struct fuse_conn *fc,
+struct fuse_inode *fi)
 {
struct fuse_file *ff = NULL;
 
@@ -1640,14 +1640,6 @@ static struct fuse_file *__fuse_write_file_get(struct 
fuse_conn *fc,
return ff;
 }
 
-static struct fuse_file *fuse_write_file_get(struct fuse_conn *fc,
-struct fuse_inode *fi)
-{
-   struct fuse_file *ff = __fuse_write_file_get(fc, fi);
-   WARN_ON(!ff);
-   return ff;
-}
-
 int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
@@ -1655,7 +1647,7 @@ int fuse_write_inode(struct inode *inode, struct 
writeback_control *wbc)
struct fuse_file *ff;
int err;
 
-   ff = __fuse_write_file_get(fc, fi);
+   ff = fuse_write_file_get(fc, fi);
err = fuse_flush_times(inode, ff);
if (ff)
fuse_file_put(ff, 0);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fuse: avoid scheduling while atomic

2014-06-25 Thread Maxim Patlasov

As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
triggers complains about scheduling while atomic:

> Jun 23 11:53:24 localhost kernel: BUG: scheduling while atomic:
fuse.hf/13976/0x1001

This happens because fuse_notify_inval_entry() attempts to allocate memory
with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().

The patch fixes the problem for fuse_notify_inval_entry() and other notifiers
by unmapping fuse_copy_state before allocations and remapping it again
afterwards.

Reported-by: Richard Sharpe 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dev.c |   24 
 1 file changed, 24 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 098f97b..502aa1d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -755,6 +755,22 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
return lock_request(cs->fc, cs->req);
 }
 
+static void fuse_copy_unmap(struct fuse_copy_state *cs)
+{
+   cs->buf = (void *)(cs->buf - cs->mapaddr);
+   kunmap_atomic(cs->mapaddr);
+}
+
+static void fuse_copy_remap(struct fuse_copy_state *cs)
+{
+   if (cs->currbuf)
+   cs->mapaddr = kmap_atomic(cs->currbuf->page);
+   else
+   cs->mapaddr = kmap_atomic(cs->pg);
+
+   cs->buf = (unsigned long)cs->buf + cs->mapaddr;
+}
+
 /* Do as much copy to/from userspace buffer as we can */
 static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
 {
@@ -1431,7 +1447,9 @@ static int fuse_notify_inval_entry(struct fuse_conn *fc, 
unsigned int size,
char *buf;
struct qstr name;
 
+   fuse_copy_unmap(cs);
buf = kzalloc(FUSE_NAME_MAX + 1, GFP_KERNEL);
+   fuse_copy_remap(cs);
if (!buf)
goto err;
 
@@ -1482,7 +1500,9 @@ static int fuse_notify_delete(struct fuse_conn *fc, 
unsigned int size,
char *buf;
struct qstr name;
 
+   fuse_copy_unmap(cs);
buf = kzalloc(FUSE_NAME_MAX + 1, GFP_KERNEL);
+   fuse_copy_remap(cs);
if (!buf)
goto err;
 
@@ -1554,6 +1574,7 @@ static int fuse_notify_store(struct fuse_conn *fc, 
unsigned int size,
 
nodeid = outarg.nodeid;
 
+   fuse_copy_unmap(cs);
down_read(&fc->killsb);
 
err = -ENOENT;
@@ -1586,7 +1607,9 @@ static int fuse_notify_store(struct fuse_conn *fc, 
unsigned int size,
goto out_iput;
 
this_num = min_t(unsigned, num, PAGE_CACHE_SIZE - offset);
+   fuse_copy_remap(cs);
err = fuse_copy_page(cs, &page, offset, this_num, 0);
+   fuse_copy_unmap(cs);
if (!err && offset == 0 &&
(this_num == PAGE_CACHE_SIZE || file_size == end))
SetPageUptodate(page);
@@ -1607,6 +1630,7 @@ out_iput:
iput(inode);
 out_up_killsb:
up_read(&fc->killsb);
+   fuse_copy_remap(cs);
 out_finish:
fuse_copy_finish(cs);
return err;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously (v2)

2014-06-09 Thread Maxim Patlasov


On 06/09/2014 03:11 PM, John Muir wrote:

On 2014.06.09, at 12:46 , Maxim Patlasov  wrote:


On 06/09/2014 01:26 PM, John Muir wrote:

On 2014.06.09, at 9:50 , Maxim Patlasov  wrote:


On 06/06/2014 05:51 PM, John Muir wrote:

On 2014.06.06, at 15:27 , Maxim Patlasov  wrote:


The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

Why not make this feature per-file with a new flag bit in struct fuse_file_info 
rather than as a file-system global?

I don't expect a great demand for such a granularity. File-system global 
"close_wait" conveys a general user expectation about filesystem behaviour in 
distributed environment: if you stopped using a file on given node, whether it means that 
the file is immediately accessible from another node.


By user do you mean the end-user, or the implementor of the file-system? It 
seems to me that the end-user doesn't care, and just wants the file-system to 
work as expected. I don't think we're really talking about the end-user.

No, this is exactly about end-user expectations. Imagine a complicated 
heavy-loaded shared storage where handling FUSE_RELEASE in userspace may take a 
few minutes. In close_wait=0 case, an end-user who has just called close(2) has 
no idea when it's safe to access the file from another node or even when it's 
OK to umount filesystem.

I think we're saying the same thing here from different perspectives. The 
end-user wants the file-system to operate with the semantics you describe, but 
I don't think it makes sense to give the end-user control over those semantics. 
The file-system itself should be implemented that way, or not, or per-file

If it's a read-only file, then does this not add the overhead of having the 
kernel wait for the user-space file-system process to respond before closing 
it. In my experience, there is actually significant cost to the kernel to 
user-space messaging in FUSE when manipulating thousands of files.


The implementor of a file-system, on the other hand, might want the semantics 
for close_wait on some files, but not on others. Won't there be a performance 
impact? Some distributed file-systems might want this on specific files only. 
Implementing it as a flag on the struct fuse_file_info gives the flexibility to 
the file-system implementor.

fuse_file_info is an userspace structure, in-kernel fuse knows nothing about 
it. In close_wait=1 case, nothing prevents a file-system implementation from 
ACK-ing FUSE_RELEASE request immediately (for specific files) and schedule 
actual handling for future processing.

Of course you know I meant that you'd add another flag to both fuse_file_info, and 
in the kernel equivalent for those flags which is struct fuse_open_out -> 
open_flags. This is where other such per file options are specified such as 
whether or not to keep the in-kernal cache for a file, whether or not to allow 
direct-io, and whether or not to allow seek.

Anyway, I guess you're the one doing all the work on this and if you have a 
particular implementation that doesn't require such fine-grained control, and 
no one else does then it's up to you. I'm just trying to show an alternative 
implementation that gives the file-system implementor more control while 
keeping the ability to meet user expectations.


Thank you, John. That's really depends on whether someone else wants 
fine-grained control or not. I'm generally OK to re-work the patch-set 
if more requesters emerge.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously (v2)

2014-06-09 Thread Maxim Patlasov


On 06/09/2014 01:26 PM, John Muir wrote:

On 2014.06.09, at 9:50 , Maxim Patlasov  wrote:


On 06/06/2014 05:51 PM, John Muir wrote:

On 2014.06.06, at 15:27 , Maxim Patlasov  wrote:


The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

Why not make this feature per-file with a new flag bit in struct fuse_file_info 
rather than as a file-system global?

I don't expect a great demand for such a granularity. File-system global 
"close_wait" conveys a general user expectation about filesystem behaviour in 
distributed environment: if you stopped using a file on given node, whether it means that 
the file is immediately accessible from another node.


By user do you mean the end-user, or the implementor of the file-system? It 
seems to me that the end-user doesn't care, and just wants the file-system to 
work as expected. I don't think we're really talking about the end-user.


No, this is exactly about end-user expectations. Imagine a complicated 
heavy-loaded shared storage where handling FUSE_RELEASE in userspace may 
take a few minutes. In close_wait=0 case, an end-user who has just 
called close(2) has no idea when it's safe to access the file from 
another node or even when it's OK to umount filesystem.




The implementor of a file-system, on the other hand, might want the semantics 
for close_wait on some files, but not on others. Won't there be a performance 
impact? Some distributed file-systems might want this on specific files only. 
Implementing it as a flag on the struct fuse_file_info gives the flexibility to 
the file-system implementor.


fuse_file_info is an userspace structure, in-kernel fuse knows nothing 
about it. In close_wait=1 case, nothing prevents a file-system 
implementation from ACK-ing FUSE_RELEASE request immediately (for 
specific files) and schedule actual handling for future processing.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously (v2)

2014-06-09 Thread Maxim Patlasov


On 06/06/2014 05:51 PM, John Muir wrote:

On 2014.06.06, at 15:27 , Maxim Patlasov  wrote:


The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

Why not make this feature per-file with a new flag bit in struct fuse_file_info 
rather than as a file-system global?


I don't expect a great demand for such a granularity. File-system global 
"close_wait" conveys a general user expectation about filesystem 
behaviour in distributed environment: if you stopped using a file on 
given node, whether it means that the file is immediately accessible 
from another node.


Maxim



John.

--
John Muir - j...@jmuir.com
+32 491 64 22 76





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] fuse: fix synchronous case of fuse_file_put()

2014-06-06 Thread Maxim Patlasov

If fuse_file_put() is called with sync==true, the user may be blocked for
a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be
uninterruptible. Otherwise request could be interrupted, but file association
in user space remains.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |4 
 1 file changed, 4 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 783cb52..9f38568 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
path_put(&req->misc.release.path);
fuse_put_request(ff->fc, req);
} else if (sync) {
+   /* Must force. Otherwise request could be interrupted,
+* but file association in user space remains.
+*/
+   req->force = 1;
req->background = 0;
fuse_request_send(ff->fc, req);
path_put(&req->misc.release.path);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] fuse: enable close_wait feature

2014-06-06 Thread Maxim Patlasov

The patch enables feature by passing 'true' to fuse_file_put in
fuse_release_common.

Previously, this was safe only in special cases when we sure that
multi-threaded userspace won't deadlock if we'll synchronously send
FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's
always safe because callbacks don't send requests to userspace anymore.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d50af99..783cb52 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -328,7 +328,8 @@ void fuse_release_common(struct file *file, int opcode)
 * synchronous RELEASE is allowed (and desirable) in this case
 * because the server can be trusted not to screw up.
 */
-   fuse_file_put(ff, ff->fc->destroy_req != NULL);
+   fuse_file_put(ff, ff->fc->destroy_req != NULL ||
+ ff->fc->close_wait);
 }
 
 static int fuse_open(struct inode *inode, struct file *file)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] fuse: wait for end of IO on release

2014-06-06 Thread Maxim Patlasov

There are two types of I/O activity that can be "in progress" at the time
of fuse_release() execution: asynchronous read-ahead and write-back. The
patch ensures that they are completed before fuse_release_common sends
FUSE_RELEASE to userspace.

So far as fuse_release() waits for end of async I/O, its callbacks
(fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot
be the last holders of fuse file anymore. To emphasize the fact, the patch
replaces fuse_file_put with __fuse_file_put there.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   71 
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b81a945..d50af99 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
 }
 
+/*
+ * Asynchronous callbacks may use it instead of fuse_file_put() because
+ * we guarantee that they are never last holders of ff. Hitting BUG() below
+ * will make clear any violation of the guarantee.
+ */
+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(&ff->count))
+   BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 bool isdir)
 {
@@ -302,6 +313,13 @@ void fuse_release_common(struct file *file, int opcode)
req->misc.release.path = file->f_path;
 
/*
+* No more in-flight asynchronous READ or WRITE requests if
+* fuse file release is synchronous
+*/
+   if (ff->fc->close_wait)
+   BUG_ON(atomic_read(&ff->count) != 1);
+
+   /*
 * Normally this will send the RELEASE request, however if
 * some asynchronous READ or WRITE requests are outstanding,
 * the sending will be delayed.
@@ -321,11 +339,34 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 static int fuse_release(struct inode *inode, struct file *file)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_file *ff = file->private_data;
 
/* see fuse_vma_close() for !writeback_cache case */
if (fc->writeback_cache)
write_inode_now(inode, 1);
 
+   if (ff->fc->close_wait) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   /*
+* Must remove file from write list. Otherwise it is possible
+* this file will get more writeback from another files
+* rerouted via write_files.
+*/
+   spin_lock(&ff->fc->lock);
+   list_del_init(&ff->write_entry);
+   spin_unlock(&ff->fc->lock);
+
+   wait_event(fi->page_waitq, atomic_read(&ff->count) == 1);
+
+   /*
+* spin_unlock_wait(&ff->fc->lock) would be natural here to
+* wait for threads just released ff to leave their critical
+* sections. But taking spinlock is the first thing
+* fuse_release_common does, so that this is unnecessary.
+*/
+   }
+
fuse_release_common(file, FUSE_RELEASE);
 
/* return value is ignored by VFS */
@@ -823,8 +864,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
unlock_page(page);
page_cache_release(page);
}
-   if (req->ff)
-   fuse_file_put(req->ff, false);
+   if (req->ff) {
+   if (fc->close_wait) {
+   struct fuse_inode *fi = get_fuse_inode(req->inode);
+
+   spin_lock(&fc->lock);
+   __fuse_file_put(req->ff);
+   wake_up(&fi->page_waitq);
+   spin_unlock(&fc->lock);
+   } else
+   fuse_file_put(req->ff, false);
+   }
 }
 
 struct fuse_fill_data {
@@ -851,6 +901,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data)
if (fc->async_read) {
req->ff = fuse_file_get(ff);
req->end = fuse_readpages_end;
+   req->inode = data->inode;
fuse_request_send_background(fc, req);
} else {
fuse_request_send(fc, req);
@@ -1537,7 +1588,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++)
__free_page(req->pages[i]);
 
-   if (req->ff)
+   if (req->ff && !fc->close_wait)
fuse_file_put(req->ff, false);
 }
 
@@ -1554,6 +1605,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
b

[PATCH 2/5] fuse: cosmetic rework of fuse_send_readpages

2014-06-06 Thread Maxim Patlasov

The patch change arguments of fuse_send_readpages to give it access to inode
(will be used in the next patch of patch-set). The change is cosmetic,
no logic changed.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 96d513e..b81a945 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -827,8 +827,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_file_put(req->ff, false);
 }
 
-static void fuse_send_readpages(struct fuse_req *req, struct file *file)
+struct fuse_fill_data {
+   struct fuse_req *req;
+   struct file *file;
+   struct inode *inode;
+   unsigned nr_pages;
+};
+
+static void fuse_send_readpages(struct fuse_fill_data *data)
 {
+   struct fuse_req *req = data->req;
+   struct file *file = data->file;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
loff_t pos = page_offset(req->pages[0]);
@@ -850,13 +859,6 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
}
 }
 
-struct fuse_fill_data {
-   struct fuse_req *req;
-   struct file *file;
-   struct inode *inode;
-   unsigned nr_pages;
-};
-
 static int fuse_readpages_fill(void *_data, struct page *page)
 {
struct fuse_fill_data *data = _data;
@@ -872,7 +874,7 @@ static int fuse_readpages_fill(void *_data, struct page 
*page)
 req->pages[req->num_pages - 1]->index + 1 != page->index)) {
int nr_alloc = min_t(unsigned, data->nr_pages,
 FUSE_MAX_PAGES_PER_REQ);
-   fuse_send_readpages(req, data->file);
+   fuse_send_readpages(data);
if (fc->async_read)
req = fuse_get_req_for_background(fc, nr_alloc);
else
@@ -925,7 +927,7 @@ static int fuse_readpages(struct file *file, struct 
address_space *mapping,
err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
if (!err) {
if (data.req->num_pages)
-   fuse_send_readpages(data.req, file);
+   fuse_send_readpages(&data);
else
fuse_put_request(fc, data.req);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] fuse: close file synchronously (v2)

2014-06-06 Thread Maxim Patlasov

Hi,

There is a long-standing demand for synchronous behaviour of fuse_release:

http://sourceforge.net/mailarchive/message.php?msg_id=19343889
http://sourceforge.net/mailarchive/message.php?msg_id=29814693

A year ago Avati and me explained why such a feature would be useful:

http://sourceforge.net/mailarchive/message.php?msg_id=29889055
http://sourceforge.net/mailarchive/message.php?msg_id=29867423

In short, the problem is that fuse_release (that's called on last user
close(2)) sends FUSE_RELEASE to userspace and returns without waiting for
ACK from userspace. Consequently, there is a gap when user regards the
file released while userspace fuse is still working on it. An attempt to
access the file from another node leads to complicated synchronization
problems because the first node still "holds" the file.

The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

To keep single-threaded userspace implementations happy the patch-set
ensures that by the time fuse_release_common calls fuse_file_put, no
more in-flight I/O exists. Asynchronous fuse callbacks (like
fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll
never block in contexts other than close().

Changed in v2:
 - improved comments, commented spin_unlock_wait out according to Brian'
suggestions.
 - rebased on v3.15-rc8 tag of Linus' tree.

Thanks,
Maxim

---

Maxim Patlasov (5):
  fuse: add close_wait flag to fuse_conn
  fuse: cosmetic rework of fuse_send_readpages
  fuse: wait for end of IO on release
  fuse: enable close_wait feature
  fuse: fix synchronous case of fuse_file_put()


 fs/fuse/file.c|  100 ++---
 fs/fuse/fuse_i.h  |3 +
 fs/fuse/inode.c   |4 +-
 include/uapi/linux/fuse.h |3 +
 4 files changed, 93 insertions(+), 17 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] fuse: add close_wait flag to fuse_conn

2014-06-06 Thread Maxim Patlasov

The feature will be governed by fc->close_wait. Userspace can enable it in
the same way as auto_inval_data or any other kernel fuse capability.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/fuse_i.h  |3 +++
 fs/fuse/inode.c   |4 +++-
 include/uapi/linux/fuse.h |3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7aa5c75..434ff08 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -557,6 +557,9 @@ struct fuse_conn {
/** Does the filesystem support asynchronous direct-IO submission? */
unsigned async_dio:1;
 
+   /** Wait for response from daemon on close */
+   unsigned close_wait:1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 754dcf2..580d1b3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -898,6 +898,8 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
else
fc->sb->s_time_gran = 10;
 
+   if (arg->flags & FUSE_CLOSE_WAIT)
+   fc->close_wait = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -926,7 +928,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO |
-   FUSE_WRITEBACK_CACHE;
+   FUSE_WRITEBACK_CACHE | FUSE_CLOSE_WAIT;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 40b5ca8..1e1b6fa 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -101,6 +101,7 @@
  *  - add FATTR_CTIME
  *  - add ctime and ctimensec to fuse_setattr_in
  *  - add FUSE_RENAME2 request
+ *  - add FUSE_CLOSE_WAIT
  */
 
 #ifndef _LINUX_FUSE_H
@@ -229,6 +230,7 @@ struct fuse_file_lock {
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
  * FUSE_ASYNC_DIO: asynchronous direct I/O submission
  * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
+ * FUSE_CLOSE_WAIT: wait for response from daemon on close
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
@@ -247,6 +249,7 @@ struct fuse_file_lock {
 #define FUSE_READDIRPLUS_AUTO  (1 << 14)
 #define FUSE_ASYNC_DIO (1 << 15)
 #define FUSE_WRITEBACK_CACHE   (1 << 16)
+#define FUSE_CLOSE_WAIT(1 << 17)
 
 /**
  * CUSE INIT request/reply flags

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fuse: do not evict dirty inodes

2014-06-03 Thread Maxim Patlasov

Commit 1e18bda8 added .write_inode method to the fuse super_operations. This
allowed fuse to use the kernel infrastructure for writing out dirty metadata
(mtime and ctime for now). However, given that .drop_inode was not redefined
from the legacy generic_delete_inode(), on umount(2) generic_shutdown_super()
led to the eviction of all inodes disregarding their state.

The patch removes .drop_inode definition from the fuse super_operations. This
works because now iput_final() calls generic_drop_inode() and returns w/o
evicting inode. This, in turn, allows generic_shutdown_super() to write dirty
inodes by calling sync_filesystem().

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/inode.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 754dcf2..ee017be 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -791,7 +791,6 @@ static const struct super_operations fuse_super_operations 
= {
.destroy_inode  = fuse_destroy_inode,
.evict_inode= fuse_evict_inode,
.write_inode= fuse_write_inode,
-   .drop_inode = generic_delete_inode,
.remount_fs = fuse_remount_fs,
.put_super  = fuse_put_super,
.umount_begin   = fuse_umount_begin,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,writeback: fix divide by zero in pos_ratio_polynom

2014-04-30 Thread Maxim Patlasov


On 04/30/2014 12:12 PM, Michal Hocko wrote:

On Wed 30-04-14 12:04:04, Maxim Patlasov wrote:

Hi Rik!

On 04/29/2014 11:19 PM, Rik van Riel wrote:

It is possible for "limit - setpoint + 1" to equal zero, leading to a
divide by zero error. Blindly adding 1 to "limit - setpoint" is not
working, so we need to actually test the divisor before calling div64.

The patch looks correct, but I'm afraid it can hide an actual bug in a
caller of pos_ratio_polynom(). The latter is not intended for setpoint >
limit. All callers take pains to ensure that setpoint <= limit. Look, for
example, at global_dirty_limits():

The bug might trigger even if setpoint < limit because the result is
trucated to s32 and I guess this is what is going on here?
Is (limit - setpoint + 1) > 4G possible?


Yes, you are right. Probably the problem came from s32 overflow.




 if (background >= dirty)
background = dirty / 2;

If you ever encountered "limit - setpoint + 1" equal zero, it may be worthy
to investigate how you came to setpoint greater than limit.

Thanks,
Maxim


Signed-off-by: Rik van Riel 
Cc: sta...@vger.kernel.org
---
  mm/page-writeback.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ef41349..2682516 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,11 +597,16 @@ static inline long long pos_ratio_polynom(unsigned long 
setpoint,
  unsigned long dirty,
  unsigned long limit)
  {
+   unsigned int divisor;
long long pos_ratio;
long x;
+   divisor = limit - setpoint;
+   if (!divisor)
+   divisor = 1;
+
x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
-   limit - setpoint + 1);
+   divisor);
pos_ratio = x;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,writeback: fix divide by zero in pos_ratio_polynom

2014-04-30 Thread Maxim Patlasov

Hi Rik!

On 04/29/2014 11:19 PM, Rik van Riel wrote:

It is possible for "limit - setpoint + 1" to equal zero, leading to a
divide by zero error. Blindly adding 1 to "limit - setpoint" is not
working, so we need to actually test the divisor before calling div64.

The patch looks correct, but I'm afraid it can hide an actual bug in a 
caller of pos_ratio_polynom(). The latter is not intended for setpoint > 
limit. All callers take pains to ensure that setpoint <= limit. Look, 
for example, at global_dirty_limits():

> if (background >= dirty)
>background = dirty / 2;

If you ever encountered "limit - setpoint + 1" equal zero, it may be 
worthy to investigate how you came to setpoint greater than limit.

Thanks,
Maxim

Signed-off-by: Rik van Riel 
Cc: sta...@vger.kernel.org
---
  mm/page-writeback.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ef41349..2682516 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,11 +597,16 @@ static inline long long pos_ratio_polynom(unsigned long 
setpoint,
  unsigned long dirty,
  unsigned long limit)
  {
+   unsigned int divisor;
long long pos_ratio;
long x;

+	divisor = limit - setpoint;

+   if (!divisor)
+   divisor = 1;
+
x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
-   limit - setpoint + 1);
+   divisor);
pos_ratio = x;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/7] fuse: flush ctime in FUSE_FORGET requests

2014-04-15 Thread Maxim Patlasov

Some inode operations (e.g., rename) operate directly on inodes and dentries
without opened files involved. This means that even though fuse set
inode->i_ctime and FUSE_I_CTIME_DIRTY properly, a corresponding flush operation
will never happen (i.e. no fsync or close to call fuse_flush_cmtime()).

The patch solves the problem by passing local ctime to the userspace server
inside forget requests.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dev.c |   30 +++---
 fs/fuse/fuse_i.h  |2 +-
 fs/fuse/inode.c   |6 ++
 include/uapi/linux/fuse.h |   18 ++
 4 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index aac71ce..97c7749 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -578,17 +578,25 @@ void fuse_request_send_background_locked(struct fuse_conn 
*fc,
 void fuse_force_forget(struct file *file, u64 nodeid)
 {
struct inode *inode = file_inode(file);
+   struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
-   struct fuse_forget_in inarg;
+   struct fuse_forget_in_ext inarg;
+   int inarg_size = fc->minor >= 24 ?
+   sizeof(inarg) : sizeof(struct fuse_forget_in);
 
memset(&inarg, 0, sizeof(inarg));
inarg.nlookup = 1;
+   if (test_bit(FUSE_I_CTIME_DIRTY, &fi->state)) {
+   inarg.forget_flags = FUSE_FORGET_CTIME_SET;
+   inarg.ctime = inode->i_ctime.tv_sec;
+   inarg.ctimensec = inode->i_ctime.tv_nsec;
+   }
req = fuse_get_req_nofail_nopages(fc, file);
req->in.h.opcode = FUSE_FORGET;
req->in.h.nodeid = nodeid;
req->in.numargs = 1;
-   req->in.args[0].size = sizeof(inarg);
+   req->in.args[0].size = inarg_size;
req->in.args[0].value = &inarg;
req->isreply = 0;
__fuse_request_send(fc, req);
@@ -1102,14 +1110,19 @@ __releases(fc->lock)
 {
int err;
struct fuse_forget_link *forget = dequeue_forget(fc, 1, NULL);
-   struct fuse_forget_in arg = {
+   struct fuse_forget_in_ext arg = {
.nlookup = forget->forget_one.nlookup,
+   .ctime = forget->forget_one.ctime,
+   .ctimensec = forget->forget_one.ctimensec,
+   .forget_flags = forget->forget_one.forget_flags,
};
+   int arg_size = fc->minor >= 24 ?
+   sizeof(arg) : sizeof(struct fuse_forget_in);
struct fuse_in_header ih = {
.opcode = FUSE_FORGET,
.nodeid = forget->forget_one.nodeid,
.unique = fuse_get_unique(fc),
-   .len = sizeof(ih) + sizeof(arg),
+   .len = sizeof(ih) + arg_size,
};
 
spin_unlock(&fc->lock);
@@ -1142,18 +1155,21 @@ __releases(fc->lock)
.unique = fuse_get_unique(fc),
.len = sizeof(ih) + sizeof(arg),
};
+   int forget_one_size = fc->minor >= 24 ?
+   sizeof(struct fuse_forget_one_ext) :
+   sizeof(struct fuse_forget_one);
 
if (nbytes < ih.len) {
spin_unlock(&fc->lock);
return -EINVAL;
}
 
-   max_forgets = (nbytes - ih.len) / sizeof(struct fuse_forget_one);
+   max_forgets = (nbytes - ih.len) / forget_one_size;
head = dequeue_forget(fc, max_forgets, &count);
spin_unlock(&fc->lock);
 
arg.count = count;
-   ih.len += count * sizeof(struct fuse_forget_one);
+   ih.len += count * forget_one_size;
err = fuse_copy_one(cs, &ih, sizeof(ih));
if (!err)
err = fuse_copy_one(cs, &arg, sizeof(arg));
@@ -1163,7 +1179,7 @@ __releases(fc->lock)
 
if (!err) {
err = fuse_copy_one(cs, &forget->forget_one,
-   sizeof(forget->forget_one));
+   forget_one_size);
}
head = forget->next;
kfree(forget);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 781a01d..d5172fa 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -59,7 +59,7 @@ extern unsigned max_user_congthresh;
 
 /* One forget request */
 struct fuse_forget_link {
-   struct fuse_forget_one forget_one;
+   struct fuse_forget_one_ext forget_one;
struct fuse_forget_link *next;
 };
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8b9a1d1..932e04a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -128,6 +128,12 @@ static void fuse_evict_inode(struct inode *inode)
if (inode->i_sb->s_flags & MS_ACTIVE) {
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+

[PATCH 6/7] fuse: clear FUSE_I_CTIME_DIRTY flag on setattr

2014-04-15 Thread Maxim Patlasov

The patch addresses two use-cases when the flag may be safely cleared:

1. fuse_do_setattr() is called with ATTR_CTIME flag set in attr->ia_valid.
In this case attr->ia_ctime bears actual value. In-kernel fuse must send it
to the userspace server and then assign the value to inode->i_ctime.

2. fuse_do_setattr() is called with ATTR_SIZE flag set in attr->ia_valid,
whereas ATTR_CTIME is not set (truncate(2)).
In this case in-kernel fuse must sent "now" to the userspace server and then
assign the value to inode->i_ctime.

In both cases fuse can clear FUSE_I_CTIME_DIRTY flag because actual ctime
value was flushed to the server.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c |   38 ++
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 2187960..3495aaf 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1518,8 +1518,9 @@ static bool update_mtime(unsigned ivalid, bool 
trust_local_mtime)
return true;
 }
 
-static void iattr_to_fattr(struct inode *inode, struct iattr *iattr,
-  struct fuse_setattr_in *arg, bool trust_local_mtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct inode *inode,
+  struct iattr *iattr, struct fuse_setattr_in *arg,
+  bool trust_local_mtime, struct timespec *ctime)
 {
unsigned ivalid = iattr->ia_valid;
struct timespec now = current_fs_time(inode->i_sb);
@@ -1550,6 +1551,19 @@ static void iattr_to_fattr(struct inode *inode, struct 
iattr *iattr,
arg->mtime = now.tv_sec;
arg->mtimensec = now.tv_nsec;
}
+
+   if ((ivalid & ATTR_CTIME) && trust_local_mtime)
+   *ctime = iattr->ia_ctime;
+   else if ((ivalid & ATTR_SIZE) && trust_local_mtime)
+   *ctime = now;
+   else
+   ctime = NULL;
+
+   if (ctime && fc->minor >= 24) {
+   arg->valid |= FATTR_CTIME;
+   arg->ctime = ctime->tv_sec;
+   arg->ctimensec = ctime->tv_nsec;
+   }
 }
 
 /*
@@ -1688,6 +1702,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
loff_t oldsize;
int err;
bool trust_local_mtime = is_wb && S_ISREG(inode->i_mode);
+   struct timespec ctime;
 
if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS))
attr->ia_valid |= ATTR_FORCE;
@@ -1716,7 +1731,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
 
memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
-   iattr_to_fattr(inode, attr, &inarg, trust_local_mtime);
+   iattr_to_fattr(fc, inode, attr, &inarg, trust_local_mtime, &ctime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
@@ -1744,11 +1759,18 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
}
 
spin_lock(&fc->lock);
-   /* the kernel maintains i_mtime locally */
-   if (trust_local_mtime && (attr->ia_valid & (ATTR_MTIME | ATTR_SIZE))) {
-   inode->i_mtime.tv_sec = inarg.mtime;
-   inode->i_mtime.tv_nsec = inarg.mtimensec;
-   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   /* the kernel maintains i_mtime and i_ctime locally */
+   if (trust_local_mtime) {
+   if (attr->ia_valid & (ATTR_MTIME | ATTR_SIZE)) {
+   inode->i_mtime.tv_sec = inarg.mtime;
+   inode->i_mtime.tv_nsec = inarg.mtimensec;
+   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   }
+
+   if (attr->ia_valid & (ATTR_CTIME | ATTR_SIZE)) {
+   inode->i_ctime = ctime;
+   clear_bit(FUSE_I_CTIME_DIRTY, &fi->state);
+   }
}
 
fuse_change_attributes_common(inode, &outarg.attr,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/7] fuse: trust kernel i_ctime only

2014-04-15 Thread Maxim Patlasov

Let the kernel maintain i_ctime locally:
 - clear S_NOCMTIME (implemented earlier)
 - update i_ctime in i_op->update_time()
 - flush ctime on fsync and last close (previous patch)
 - update i_ctime explicitly on truncate, fallocate, open(O_TRUNC),
   setxattr, link, rename, unlink.

Fuse inode flag FUSE_I_CTIME_DIRTY serves as indication that local i_ctime
should be flushed to the server eventually. The patch sets the flag and updates
i_ctime in course of operations listed above.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c   |   24 +++-
 fs/fuse/file.c  |8 ++--
 fs/fuse/inode.c |6 --
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0596726..2187960 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -679,6 +679,15 @@ static int fuse_symlink(struct inode *dir, struct dentry 
*entry,
return create_new_entry(fc, req, dir, entry, S_IFLNK);
 }
 
+static inline void fuse_update_ctime(struct fuse_conn *fc, struct inode *inode)
+{
+   if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   inode->i_ctime = current_fs_time(inode->i_sb);
+   set_bit(FUSE_I_CTIME_DIRTY, &fi->state);
+   }
+}
+
 static int fuse_unlink(struct inode *dir, struct dentry *entry)
 {
int err;
@@ -713,6 +722,7 @@ static int fuse_unlink(struct inode *dir, struct dentry 
*entry)
fuse_invalidate_attr(inode);
fuse_invalidate_attr(dir);
fuse_invalidate_entry_cache(entry);
+   fuse_update_ctime(fc, inode);
} else if (err == -EINTR)
fuse_invalidate_entry(entry);
return err;
@@ -771,6 +781,7 @@ static int fuse_rename(struct inode *olddir, struct dentry 
*oldent,
if (!err) {
/* ctime changes */
fuse_invalidate_attr(oldent->d_inode);
+   fuse_update_ctime(fc, oldent->d_inode);
 
fuse_invalidate_attr(olddir);
if (olddir != newdir)
@@ -780,6 +791,7 @@ static int fuse_rename(struct inode *olddir, struct dentry 
*oldent,
if (newent->d_inode) {
fuse_invalidate_attr(newent->d_inode);
fuse_invalidate_entry_cache(newent);
+   fuse_update_ctime(fc, newent->d_inode);
}
} else if (err == -EINTR) {
/* If request was interrupted, DEITY only knows if the
@@ -829,6 +841,7 @@ static int fuse_link(struct dentry *entry, struct inode 
*newdir,
inc_nlink(inode);
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
+   fuse_update_ctime(fc, inode);
} else if (err == -EINTR) {
fuse_invalidate_attr(inode);
}
@@ -846,6 +859,8 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
attr->size = i_size_read(inode);
attr->mtime = inode->i_mtime.tv_sec;
attr->mtimensec = inode->i_mtime.tv_nsec;
+   attr->ctime = inode->i_ctime.tv_sec;
+   attr->ctimensec = inode->i_ctime.tv_nsec;
}
 
stat->dev = inode->i_sb->s_dev;
@@ -1830,8 +1845,10 @@ static int fuse_setxattr(struct dentry *entry, const 
char *name,
fc->no_setxattr = 1;
err = -EOPNOTSUPP;
}
-   if (!err)
+   if (!err) {
fuse_invalidate_attr(inode);
+   fuse_update_ctime(fc, inode);
+   }
return err;
 }
 
@@ -1974,6 +1991,11 @@ static int fuse_update_time(struct inode *inode, struct 
timespec *now,
set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
BUG_ON(!S_ISREG(inode->i_mode));
}
+   if (flags & S_CTIME) {
+   inode->i_ctime = *now;
+   set_bit(FUSE_I_CTIME_DIRTY, &get_fuse_inode(inode)->state);
+   BUG_ON(!S_ISREG(inode->i_mode));
+   }
return 0;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9ed8590b..f644aeb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -224,8 +224,10 @@ void fuse_finish_open(struct inode *inode, struct file 
*file)
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
if (fc->writeback_cache) {
-   inode->i_mtime = current_fs_time(inode->i_sb);
+   inode->i_ctime = inode->i_mtime =
+   current_fs_time(inode->i_sb);
set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   set_bit(FUSE_I_CTIME_DIRTY, &fi->state);
}
}
if ((file->f_mode & FMODE_WRITE) &&

[PATCH 4/7] fuse: introduce FUSE_I_CTIME_DIRTY flag

2014-04-15 Thread Maxim Patlasov

The flag plays the role similar to FUSE_I_MTIME_DIRTY: local ctime
(inode->i_ctime) is newer than on the server; so, local ctime must be
flushed to the server afterwards.

The patch only introduces the flag, extends API (fuse_setattr_in) properly,
and implements the flush procedure (fuse_flush_cmtime()). Further patches of
the patch-set will implement the logic of setting and clearing the flag.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c |   22 ++
 fs/fuse/file.c|   11 ---
 fs/fuse/fuse_i.h  |4 +++-
 include/uapi/linux/fuse.h |   11 ---
 4 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c7cb41c..0596726 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1600,9 +1600,9 @@ static void fuse_setattr_fill(struct fuse_conn *fc, 
struct fuse_req *req,
 }
 
 /*
- * Flush inode->i_mtime to the server
+ * Flush inode->i_mtime and inode->i_ctime to the server
  */
-int fuse_flush_mtime(struct file *file, bool nofail)
+int fuse_flush_cmtime(struct file *file, bool nofail)
 {
struct inode *inode = file->f_mapping->host;
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -1610,8 +1610,16 @@ int fuse_flush_mtime(struct file *file, bool nofail)
struct fuse_req *req = NULL;
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
+   unsigned cmtime_flags = 0;
int err;
 
+   if (test_bit(FUSE_I_MTIME_DIRTY, &fi->state))
+   cmtime_flags |= FATTR_MTIME;
+   if (test_bit(FUSE_I_CTIME_DIRTY, &fi->state) && fc->minor >= 24)
+   cmtime_flags |= FATTR_CTIME;
+   if (!cmtime_flags)
+   return 0;
+
if (nofail) {
req = fuse_get_req_nofail_nopages(fc, file);
} else {
@@ -1623,17 +1631,23 @@ int fuse_flush_mtime(struct file *file, bool nofail)
memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
 
-   inarg.valid |= FATTR_MTIME;
+   inarg.valid = cmtime_flags;
inarg.mtime = inode->i_mtime.tv_sec;
inarg.mtimensec = inode->i_mtime.tv_nsec;
+   if (fc->minor >= 24) {
+   inarg.ctime = inode->i_ctime.tv_sec;
+   inarg.ctimensec = inode->i_ctime.tv_nsec;
+   }
 
fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
fuse_request_send(fc, req);
err = req->out.h.error;
fuse_put_request(fc, req);
 
-   if (!err)
+   if (!err) {
clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   clear_bit(FUSE_I_CTIME_DIRTY, &fi->state);
+   }
 
return err;
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c2c6df7..9ed8590b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -328,8 +328,7 @@ static int fuse_release(struct inode *inode, struct file 
*file)
if (fc->writeback_cache)
filemap_write_and_wait(file->f_mapping);
 
-   if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
-   fuse_flush_mtime(file, true);
+   fuse_flush_cmtime(file, true);
 
fuse_release_common(file, FUSE_RELEASE);
 
@@ -512,11 +511,9 @@ int fuse_fsync_common(struct file *file, loff_t start, 
loff_t end,
 
fuse_sync_writes(inode);
 
-   if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
-   int err = fuse_flush_mtime(file, false);
-   if (err)
-   goto out;
-   }
+   err = fuse_flush_cmtime(file, false);
+   if (err)
+   goto out;
 
req = fuse_get_req_nopages(fc);
if (IS_ERR(req)) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a257ed8e..781a01d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -121,6 +121,8 @@ enum {
FUSE_I_SIZE_UNSTABLE,
/** i_mtime has been updated locally; a flush to userspace needed */
FUSE_I_MTIME_DIRTY,
+   /** i_ctime has been updated locally; a flush to userspace needed */
+   FUSE_I_CTIME_DIRTY,
 };
 
 struct fuse_conn;
@@ -891,7 +893,7 @@ int fuse_dev_release(struct inode *inode, struct file 
*file);
 
 bool fuse_write_update_size(struct inode *inode, loff_t pos);
 
-int fuse_flush_mtime(struct file *file, bool nofail);
+int fuse_flush_cmtime(struct file *file, bool nofail);
 
 int fuse_do_setattr(struct inode *inode, struct iattr *attr,
struct file *file);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index cf4750e..8af06cc 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -96,6 +96,10 @@
  *
  * 7.23
  *  - add FUSE_WRITEBACK_CACHE
+ *
+ * 7.24
+ *  - ctime support for FUSE_WRITEBACK_CACHE:
+ *changes in fuse_setattr_in and fuse_forget_in
  */
 
 #ifndef _LINUX_FUSE_H
@@ -131,7 +135,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /**

[PATCH 3/7] fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode

2014-04-15 Thread Maxim Patlasov

In case of fc->atomic_o_trunc is set, fuse does nothing in fuse_do_setattr()
while handling open(O_TRUNC). Hence, i_mtime must be updated explicitly in
fuse_finish_open(). The patch also adds extra locking encompassing
open(O_TRUNC) operation to avoid races between updating i_mtime and setting
FUSE_I_MTIME_DIRTY flag.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 13f8bde..c2c6df7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -223,6 +223,10 @@ void fuse_finish_open(struct inode *inode, struct file 
*file)
i_size_write(inode, 0);
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
+   if (fc->writeback_cache) {
+   inode->i_mtime = current_fs_time(inode->i_sb);
+   set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   }
}
if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
fuse_link_write_file(file);
@@ -232,18 +236,26 @@ int fuse_open_common(struct inode *inode, struct file 
*file, bool isdir)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
int err;
+   bool lock_inode = (file->f_flags & O_TRUNC) &&
+ fc->atomic_o_trunc &&
+ fc->writeback_cache;
 
err = generic_file_open(inode, file);
if (err)
return err;
 
+   if (lock_inode)
+   mutex_lock(&inode->i_mutex);
+
err = fuse_do_open(fc, get_node_id(inode), file, isdir);
-   if (err)
-   return err;
 
-   fuse_finish_open(inode, file);
+   if (!err)
+   fuse_finish_open(inode, file);
 
-   return 0;
+   if (lock_inode)
+   mutex_unlock(&inode->i_mutex);
+
+   return err;
 }
 
 static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/7] fuse: update mtime on truncate(2)

2014-04-15 Thread Maxim Patlasov

Handling truncate(2), VFS doesn't set ATTR_MTIME bit in iattr structure; only
ATTR_SIZE bit is set. In-kernel fuse must handle the case by setting mtime
fields of struct fuse_setattr_in to "now" and set FATTR_MTIME bit even
though ATTR_MTIME was not set.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c |   16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5b4e035..c7cb41c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1503,10 +1503,11 @@ static bool update_mtime(unsigned ivalid, bool 
trust_local_mtime)
return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-  bool trust_local_mtime)
+static void iattr_to_fattr(struct inode *inode, struct iattr *iattr,
+  struct fuse_setattr_in *arg, bool trust_local_mtime)
 {
unsigned ivalid = iattr->ia_valid;
+   struct timespec now = current_fs_time(inode->i_sb);
 
if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
@@ -1529,6 +1530,10 @@ static void iattr_to_fattr(struct iattr *iattr, struct 
fuse_setattr_in *arg,
arg->mtimensec = iattr->ia_mtime.tv_nsec;
if (!(ivalid & ATTR_MTIME_SET) && !trust_local_mtime)
arg->valid |= FATTR_MTIME_NOW;
+   } else if ((ivalid & ATTR_SIZE) && trust_local_mtime) {
+   arg->valid |= FATTR_MTIME;
+   arg->mtime = now.tv_sec;
+   arg->mtimensec = now.tv_nsec;
}
 }
 
@@ -1682,7 +1687,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
 
memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
-   iattr_to_fattr(attr, &inarg, trust_local_mtime);
+   iattr_to_fattr(inode, attr, &inarg, trust_local_mtime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
@@ -1711,8 +1716,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
 
spin_lock(&fc->lock);
/* the kernel maintains i_mtime locally */
-   if (trust_local_mtime && (attr->ia_valid & ATTR_MTIME)) {
-   inode->i_mtime = attr->ia_mtime;
+   if (trust_local_mtime && (attr->ia_valid & (ATTR_MTIME | ATTR_SIZE))) {
+   inode->i_mtime.tv_sec = inarg.mtime;
+   inode->i_mtime.tv_nsec = inarg.mtimensec;
clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/7] fuse: do not use uninitialized i_mode

2014-04-15 Thread Maxim Patlasov

When inode is in I_NEW state, inode->i_mode is not initialized yet. Do not
use it before fuse_init_inode() is called.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/inode.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8d61169..299e553 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -303,7 +303,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 
if ((inode->i_state & I_NEW)) {
inode->i_flags |= S_NOATIME;
-   if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
+   if (!fc->writeback_cache || !S_ISREG(attr->mode))
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = &fc->bdi;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/7] fuse: fix handling mtime and cmtime in writeback_cache mode

2014-04-15 Thread Maxim Patlasov

Hi,

The patch-set fixes a few issues introduced by writeback_cache feature:

 - fuse_iget() used inode->i_mode before its initialization
 - in course of truncate(2), inode->i_mtime was not updated at all
 - the same for open(O_TRUNC) in case of fc->atomic_o_trunc is set
 - we cannot trust ctime from server anymore because we doesn't 
   flush all data changes to the server immediately
 - to maintain ctime locally, we have to update it in proper places
   (fuse_update_time, fallocate, setxattr, etc.)
 - locally updated ctime must be flushed to the server eventually.

Maxim
---

Maxim Patlasov (7):
  fuse: do not use uninitialized i_mode
  fuse: update mtime on truncate(2)
  fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode
  fuse: introduce FUSE_I_CTIME_DIRTY flag
  fuse: trust kernel i_ctime only
  fuse: clear FUSE_I_CTIME_DIRTY flag on setattr
  fuse: flush ctime in FUSE_FORGET requests


 fs/fuse/dev.c |   30 ---
 fs/fuse/dir.c |   88 +++--
 fs/fuse/file.c|   37 +--
 fs/fuse/fuse_i.h  |6 ++-
 fs/fuse/inode.c   |   14 ++-
 include/uapi/linux/fuse.h |   29 +--
 6 files changed, 165 insertions(+), 39 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/11] fuse: restructure fuse_readpage()

2014-01-20 Thread Maxim Patlasov

On 01/06/2014 08:43 PM, Miklos Szeredi wrote:

On Fri, Dec 20, 2013 at 06:54:40PM +0400, Maxim Patlasov wrote:

Hi Miklos,

Sorry for delay, see please inline comments below.

On 11/12/2013 09:17 PM, Miklos Szeredi wrote:

On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote:

Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/file.c |   55 +--
  1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b4d4189..77eb849 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
}
  }
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+  int *err, struct fuse_req **req_pp, u64 *attr_ver_p)

Signature of this helper looks really ugly.  A quick look tells me that neither
caller actually needs 'req'.

fuse_readpage() passes 'req' to fuse_short_read(). And the latter
uses req->pages[] to nullify a part of request.

I don't get it.  __fuse_readpage() itself call's fuse_short_read(), not callers
of __fuse_readpage().  Or do they?

fuse_readpage() is a caller of __fuse_readpage() and it looks (after 
applying the patch) like this:

> static int fuse_readpage(struct file *file, struct page *page)
> {
> ...
>num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
>if (!err) {
>/*
> * Short read means EOF.  If file size is larger, truncate it
> */
>if (num_read < count)
>fuse_short_read(req, inode, attr_ver);
>
>SetPageUptodate(page);
>}

Thanks,
Maxim

And fuse_get_attr_version() can be moved to the
one caller that needs it.

Yes, it's doable. But this would make attr_version mechanism less
efficient (under some loads): suppose the file on server was
truncated externally, then fuse_readpage() acquires
fc->attr_version, then some innocent write bumps fc->attr_version
while we're waiting for fuse writeback, then fuse_read_update_size()
would noop. In the other words, it's beneficial to keep the time
interval between acquiring fc->attr_version and subsequent
comparison as short as possible.

Okay, lets try to keep this the way it is.  I don't like it very much, but I
fear changing user visible behavior.

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2

2014-01-20 Thread Maxim Patlasov


On 01/06/2014 08:22 PM, Miklos Szeredi wrote:

On Thu, Dec 26, 2013 at 07:41:41PM +0400, Maxim Patlasov wrote:

+
+   if (!err)
+   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);

Doing the test and the clear separately opens a huge race window when i_mtime
modifications are bound to get lost.

No. Because the whole operation is protected by i_mutex (see
fuse_fsync_common()).

fuse_release_common() doesn't have i_mutex.  It's probably safe to acquire it,
but is that really needed?


No, that's not needed, I think. Because by the time of calling 
fuse_release(), file->f_count is already zero and no userspace activity 
is possible on the file.


Are you OK about -v3 version of the patch (sent 12/26/2013)?

Thanks,
Maxim



Thanks,
Miklos



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/11] fuse: Trust kernel i_mtime only -v3

2013-12-26 Thread Maxim Patlasov

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
 - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
 - simplified fuse_set_mtime_local()
 - abandoned optimizations: clearing the flag on some operations (like
   direct write) is doable, but may lead to unexpected changes of
   user-visible mtime.

Changed in v3 (thanks to Miklos):
 - let client fs know about local mtime change for ATTR_SIZE and
   !ATTR_MTIME_SET cases.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c|  113 +-
 fs/fuse/file.c   |   30 --
 fs/fuse/fuse_i.h |6 ++-
 fs/fuse/inode.c  |   13 +-
 4 files changed, 136 insertions(+), 26 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f022968..d1a3167 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);
 
/* see the comment in fuse_change_attributes() */
-   if (fc->writeback_cache && S_ISREG(inode->i_mode))
+   if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
attr->size = i_size_read(inode);
+   attr->mtime = inode->i_mtime.tv_sec;
+   attr->mtimensec = inode->i_mtime.tv_nsec;
+   }
 
stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1510,7 +1513,8 @@ static bool update_mtime(unsigned ivalid)
return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg)
+static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
+  bool trust_local_mtime, struct timespec *mtime)
 {
unsigned ivalid = iattr->ia_valid;
 
@@ -1529,11 +1533,17 @@ static void iattr_to_fattr(struct iattr *iattr, struct 
fuse_setattr_in *arg)
if (!(ivalid & ATTR_ATIME_SET))
arg->valid |= FATTR_ATIME_NOW;
}
-   if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) {
+
+   if (!trust_local_mtime || (ivalid & ATTR_MTIME_SET) ||
+   !(ivalid & ATTR_SIZE))
+   *mtime = iattr->ia_mtime;
+
+   if ((ivalid & ATTR_MTIME) && (update_mtime(ivalid) ||
+ trust_local_mtime)) {
arg->valid |= FATTR_MTIME;
-   arg->mtime = iattr->ia_mtime.tv_sec;
-   arg->mtimensec = iattr->ia_mtime.tv_nsec;
-   if (!(ivalid & ATTR_MTIME_SET))
+   arg->mtime = mtime->tv_sec;
+   arg->mtimensec = mtime->tv_nsec;
+   if (!(ivalid & ATTR_MTIME_SET) && !trust_local_mtime)
arg->valid |= FATTR_MTIME_NOW;
}
 }
@@ -1582,6 +1592,63 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+   req->in.h.opcode = FUSE_SETATTR;
+   req->in.h.nodeid = get_node_id(inode);
+   req->in.numargs = 1;
+   req->in.args[0].size = sizeof(*inarg_p);
+   req->in.args[0].value = inarg_p;
+   req->out.numargs = 1;
+   if (fc->minor < 9)
+   req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+   else
+   req->out.args[0].size = sizeof(*outarg_p);
+   req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+   struct inode *inode = file->f_mapping->host;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   struct fuse_setattr_in inarg;
+   struct fuse_attr_out outarg;
+   int err;
+
+   if (nofail) {
+   req = fuse_get_req_nofail_nopages(fc, file);
+   } else {
+   req = fuse_get_req_nopages(fc);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+   }
+
+   memset(&inarg, 0, sizeof(inarg));
+   memset(&outarg, 0, sizeof(outarg));
+
+   inarg.valid |= FATTR_MTIME;
+   inarg.mtime = inode->i_mtime.tv_sec;
+   inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+   fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+   fuse_request_send(fc,

Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2

2013-12-26 Thread Maxim Patlasov


Hi Miklos,

Sorry for delay, see please inline comments below.

On 11/12/2013 08:52 PM, Miklos Szeredi wrote:

On Thu, Oct 10, 2013 at 05:10:56PM +0400, Maxim Patlasov wrote:

Let the kernel maintain i_mtime locally:
  - clear S_NOCMTIME
  - implement i_op->update_time()
  - flush mtime on fsync and last close
  - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
  - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
  - simplified fuse_set_mtime_local()
  - abandoned optimizations: clearing the flag on some operations (like
direct write) is doable, but may lead to unexpected changes of
user-visible mtime.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/dir.c|  109 --
  fs/fuse/file.c   |   30 +--
  fs/fuse/fuse_i.h |6 ++-
  fs/fuse/inode.c  |   13 +-
  4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f022968..eda248b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);
  
  	/* see the comment in fuse_change_attributes() */

-   if (fc->writeback_cache && S_ISREG(inode->i_mode))
+   if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
attr->size = i_size_read(inode);
+   attr->mtime = inode->i_mtime.tv_sec;
+   attr->mtimensec = inode->i_mtime.tv_nsec;
+   }
  
  	stat->dev = inode->i_sb->s_dev;

stat->ino = attr->ino;
@@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(&fc->lock);
  }
  
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,

+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+   req->in.h.opcode = FUSE_SETATTR;
+   req->in.h.nodeid = get_node_id(inode);
+   req->in.numargs = 1;
+   req->in.args[0].size = sizeof(*inarg_p);
+   req->in.args[0].value = inarg_p;
+   req->out.numargs = 1;
+   if (fc->minor < 9)
+   req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+   else
+   req->out.args[0].size = sizeof(*outarg_p);
+   req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+   struct inode *inode = file->f_mapping->host;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   struct fuse_setattr_in inarg;
+   struct fuse_attr_out outarg;
+   int err;
+
+   if (nofail) {
+   req = fuse_get_req_nofail_nopages(fc, file);
+   } else {
+   req = fuse_get_req_nopages(fc);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+   }
+
+   memset(&inarg, 0, sizeof(inarg));
+   memset(&outarg, 0, sizeof(outarg));
+
+   inarg.valid |= FATTR_MTIME;
+   inarg.mtime = inode->i_mtime.tv_sec;
+   inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+   fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+   fuse_request_send(fc, req);
+   err = req->out.h.error;
+   fuse_put_request(fc, req);
+
+   if (!err)
+   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);

Doing the test and the clear separately opens a huge race window when i_mtime
modifications are bound to get lost.


No. Because the whole operation is protected by i_mutex (see 
fuse_fsync_common()).





+
+   return err;
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+   unsigned ivalid = iattr->ia_valid;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
+   /* client fs has just set mtime to iattr->ia_mtime */
+   inode->i_mtime = iattr->ia_mtime;
+   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);

This is protected by i_mutex, so it should be safe.


Yes.




+   } else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
+   /* client fs doesn't know that we're updating i_mtime */

If so, why not tell the client fs to update mtime?


Looks reasonable. I'll resend updated patch soon.




+   inode->i_mtime = current_fs_time(inode->i_sb);
+

Re: [PATCH 07/11] fuse: restructure fuse_readpage()

2013-12-20 Thread Maxim Patlasov


Hi Miklos,

Sorry for delay, see please inline comments below.

On 11/12/2013 09:17 PM, Miklos Szeredi wrote:

On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote:

Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/file.c |   55 +--
  1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b4d4189..77eb849 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
}
  }
  
-static int fuse_readpage(struct file *file, struct page *page)

+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+  int *err, struct fuse_req **req_pp, u64 *attr_ver_p)

Signature of this helper looks really ugly.  A quick look tells me that neither
caller actually needs 'req'.


fuse_readpage() passes 'req' to fuse_short_read(). And the latter uses 
req->pages[] to nullify a part of request.



And fuse_get_attr_version() can be moved to the
one caller that needs it.


Yes, it's doable. But this would make attr_version mechanism less 
efficient (under some loads): suppose the file on server was truncated 
externally, then fuse_readpage() acquires fc->attr_version, then some 
innocent write bumps fc->attr_version while we're waiting for fuse 
writeback, then fuse_read_update_size() would noop. In the other words, 
it's beneficial to keep the time interval between acquiring 
fc->attr_version and subsequent comparison as short as possible.



And negative err can be returned.


Yes, but this will require some precautions for positive 
req->out.h.error. Like "err = (req->out.h.error <= 0) ? req->out.h.error 
: -EIO;". But this must be OK - filtering out positive req->out.h.error 
is a good idea, imho.




And then all those
ugly pointer args are gone and the whole thing is much simpler.


If you agree with my comments above, only 1 of 3 ugly pointers can be 
avoided. Another way would be to revert the code back to the initial 
implementation where fuse_readpage() and fuse_prepare_write() sent read 
requests independently.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: add strictlimit knob -v2

2013-11-06 Thread Maxim Patlasov

"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
/sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value.

Changed in v2:
 - updated patch description and documentation

Signed-off-by: Maxim Patlasov 
---
 Documentation/ABI/testing/sysfs-class-bdi |8 +++
 mm/backing-dev.c  |   35 +
 2 files changed, 43 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-bdi 
b/Documentation/ABI/testing/sysfs-class-bdi
index d773d56..3187a18 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -53,3 +53,11 @@ stable_pages_required (read-only)
 
If set, the backing device requires that all pages comprising a write
request must not be changed until writeout is complete.
+
+strictlimit (read-write)
+
+   Forces per-BDI checks for the share of given device in the write-back
+   cache even before the global background dirty limit is reached. This
+   is useful in situations where the global limit is much higher than
+   affordable for given relatively slow (or untrusted) device. Turning
+   strictlimit on has no visible effect if max_ratio is equal to 100%.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7..4ee1d64 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device 
*dev,
 }
 static DEVICE_ATTR_RO(stable_pages_required);
 
+static ssize_t strictlimit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t count)
+{
+   struct backing_dev_info *bdi = dev_get_drvdata(dev);
+   unsigned int val;
+   ssize_t ret;
+
+   ret = kstrtouint(buf, 10, &val);
+   if (ret < 0)
+   return ret;
+
+   switch (val) {
+   case 0:
+   bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
+   break;
+   case 1:
+   bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return count;
+}
+static ssize_t strictlimit_show(struct device *dev,
+   struct device_attribute *attr, char *page)
+{
+   struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE-1, "%d\n",
+   !!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
+}
+static DEVICE_ATTR_RW(strictlimit);
+
 static struct attribute *bdi_dev_attrs[] = {
&dev_attr_read_ahead_kb.attr,
&dev_attr_min_ratio.attr,
&dev_attr_max_ratio.attr,
&dev_attr_stable_pages_required.attr,
+   &dev_attr_strictlimit.attr,
NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add strictlimit knob

2013-11-06 Thread Maxim Patlasov


Hi Andrew,

On 11/05/2013 02:01 AM, Andrew Morton wrote:

On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov  
wrote:


"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
max_ratio is decreased to some reasonable value.

I suggest replacing "max_ratio" here with the much more informative
"/sys/class/bdi/X:Y/max_ratio".

Also, Documentation/ABI/testing/sysfs-class-bdi will need an update
please.


OK, I'll update it, fix patch description and re-send the patch.




  mm/backing-dev.c |   35 +++
  1 file changed, 35 insertions(+)


I'm not really sure what to make of the patch.  I assume you tested it
and observed some effect.  Could you please describe the test setup and
the effects in some detail?


I plugged 16GB USB-flash in a node with 8GB RAM running 3.12.0-rc7 and 
started writing a huge file by "dd" (from /dev/zero to USB-flash 
mount-point). While writing I was observing "Dirty" counter as reported 
by /proc/meminfo. As expected it stabilized on a level about 1.2GB (15% 
of total RAM). Immediately after dd completed, the "umount" command took 
about 5 minutes. This corresponded to 5MB write throughput of the flash 
drive.


Then I repeated the experiment after setting tunables:

echo 1 > /sys/class/bdi/8\:16/max_ratio
echo 1 > /sys/class/bdi/8\:16/strictlimit

This time, "Dirty" counter became 100 times lesser - about 12MB and 
"umount" took about a second.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: add strictlimit knob

2013-11-01 Thread Maxim Patlasov

"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
max_ratio is decreased to some reasonable value.

Signed-off-by: Maxim Patlasov 
---
 mm/backing-dev.c |   35 +++
 1 file changed, 35 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7..4ee1d64 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device 
*dev,
 }
 static DEVICE_ATTR_RO(stable_pages_required);
 
+static ssize_t strictlimit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t count)
+{
+   struct backing_dev_info *bdi = dev_get_drvdata(dev);
+   unsigned int val;
+   ssize_t ret;
+
+   ret = kstrtouint(buf, 10, &val);
+   if (ret < 0)
+   return ret;
+
+   switch (val) {
+   case 0:
+   bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
+   break;
+   case 1:
+   bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return count;
+}
+static ssize_t strictlimit_show(struct device *dev,
+   struct device_attribute *attr, char *page)
+{
+   struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE-1, "%d\n",
+   !!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
+}
+static DEVICE_ATTR_RW(strictlimit);
+
 static struct attribute *bdi_dev_attrs[] = {
&dev_attr_read_ahead_kb.attr,
&dev_attr_min_ratio.attr,
&dev_attr_max_ratio.attr,
&dev_attr_stable_pages_required.attr,
+   &dev_attr_strictlimit.attr,
NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

2013-11-01 Thread Maxim Patlasov

On Thu 31-10-13 14:26:12, Karl Kiniger wrote:
> On Tue 131029, Jan Kara wrote:
> > On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> > > On Fri 131025, Linus Torvalds wrote:
> 
> > > Is it currently possible to somehow set above values per block device?
> >   Yes, to some extent. You can set /sys/block//bdi/max_ratio to
> > the maximum proportion the device's dirty data can take from the total
> > amount. The caveat currently is that this setting only takes effect after
> > we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
> > total because that is an amount of dirty data when we start to throttle
> > processes. So if the device you'd like to limit is the only one which is
> > currently written to, the limiting doesn't have a big effect.
>
> Thanks for the info - thats was I am looking for.
>
> You are right that the limiting doesn't have a big effect right now:
>
> on my  4x speed  DVD+RW on /dev/sr0, x86_64, 4GB,
> Fedora19:
>
> max_ratio set to 100  - about 500MB buffered, sync time 2:10 min.
> max_ratio set to 1- about 330MB buffered, sync time 1:23 min.
>
> ... way too much buffering.

"strictlimit" feature must fit your and Artem's needs quite well. The feature
enforces per-BDI dirty limits even if the global dirty limit is not reached
yet. I'll send a patch adding knob to turn it on/off.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fuse: Flush files on wb close -v2

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Changed in v2:
 - updated patch to be applied smoothly on top of
   "Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eabe202..b4d4189 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see fuse_vma_close() for !writeback_cache case */
+   if (fc->writeback_cache)
+   filemap_write_and_wait(file->f_mapping);
+
if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
fuse_flush_mtime(file, true);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/11] fuse: fuse_flush() should wait on writeback

2013-10-10 Thread Maxim Patlasov

The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 642bd51..f23fc65 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,7 @@
 #include 
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
@@ -399,6 +400,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fc->no_flush)
return 0;
 
+   err = filemap_write_and_wait(file->f_mapping);
+   if (err)
+   return err;
+
+   mutex_lock(&inode->i_mutex);
+   fuse_sync_writes(inode);
+   mutex_unlock(&inode->i_mutex);
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/11] fuse: Turn writeback cache on

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c|5 +
 fs/fuse/fuse_i.h  |4 
 fs/fuse/inode.c   |   13 +
 include/uapi/linux/fuse.h |2 ++
 4 files changed, 24 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 653ae9c..aa38e77 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -208,6 +208,8 @@ void fuse_finish_open(struct inode *inode, struct file 
*file)
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
}
+   if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+   fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1228,6 +1230,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, 
const struct iovec *iov,
struct iov_iter i;
loff_t endbyte = 0;
 
+   if (get_fuse_conn(inode)->writeback_cache)
+   return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
WARN_ON(iocb->ki_pos != pos);
 
ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 129af78..76c4f77 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
 doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE   (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 253701f..6ddf135 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -460,6 +460,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_ALLOW_WBCACHE,
OPT_ERR
 };
 
@@ -472,6 +473,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   "allow_other"},
{OPT_MAX_READ,  "max_read=%u"},
{OPT_BLKSIZE,   "blksize=%u"},
+   {OPT_ALLOW_WBCACHE, "allow_wbcache"},
{OPT_ERR,   NULL}
 };
 
@@ -545,6 +547,10 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev)
d->blksize = value;
break;
 
+   case OPT_ALLOW_WBCACHE:
+   d->flags |= FUSE_ALLOW_WBCACHE;
+   break;
+
default:
return 0;
}
@@ -572,6 +578,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->flags & FUSE_ALLOW_WBCACHE)
+   seq_puts(m, ",allow_wbcache");
return 0;
 }
 
@@ -889,6 +897,9 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
}
if (arg->flags & FUSE_ASYNC_DIO)
fc->async_dio = 1;
+   if (arg->flags & FUSE_WRITEBACK_CACHE &&
+   fc->flags & FUSE_ALLOW_WBCACHE)
+   fc->writeback_cache = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -917,6 +928,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO;
+   if (fc->flags & FUSE_ALLOW_WBCACHE)
+   arg->flags |= FUSE_WRITEBACK_CACHE;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 60bb2f9..5c98dd1 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -219,6 +219,7 @@ struct fuse_file_lock {
  * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
  * FUSE_ASYNC_DIO: asynchronous direct I/O submission
+ * FUSE_WRITEBACK_CACHE: use

[PATCH 10/11] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/cuse.c   |5 +++--
 fs/fuse/file.c   |   49 +++--
 fs/fuse/fuse_i.h |   13 -
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index adbfd66..9bf4bb2 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -95,7 +95,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, 
size_t count,
struct iovec iov = { .iov_base = buf, .iov_len = count };
struct fuse_io_priv io = { .async = 0, .file = file };
 
-   return fuse_direct_io(&io, &iov, 1, count, &pos, 0);
+   return fuse_direct_io(&io, &iov, 1, count, &pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -109,7 +109,8 @@ static ssize_t cuse_write(struct file *file, const char 
__user *buf,
 * No locking or generic_write_checks(), the server is
 * responsible for locking and sanity checks.
 */
-   return fuse_direct_io(&io, &iov, 1, count, &pos, 1);
+   return fuse_direct_io(&io, &iov, 1, count, &pos,
+ FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f23fc65..653ae9c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -341,6 +341,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+   pgoff_t idx_to)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_req *req;
+   bool found = false;
+
+   spin_lock(&fc->lock);
+   list_for_each_entry(req, &fi->writepages, writepages_entry) {
+   pgoff_t curr_index;
+
+   BUG_ON(req->inode != inode);
+   curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+   if (!(idx_from >= curr_index + req->num_pages ||
+ idx_to < curr_index)) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(&fc->lock);
+
+   return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -385,6 +410,19 @@ static int fuse_wait_on_page_writeback(struct inode 
*inode, pgoff_t index)
return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+  size_t bytes)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   pgoff_t idx_from, idx_to;
+
+   idx_from = start >> PAGE_CACHE_SHIFT;
+   idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+   wait_event(fi->page_waitq,
+  !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
struct inode *inode = file_inode(file);
@@ -1363,8 +1401,10 @@ static inline int fuse_iter_npages(const struct iov_iter 
*ii_p)
 
 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
   unsigned long nr_segs, size_t count, loff_t *ppos,
-  int write)
+  int flags)
 {
+   int write = flags & FUSE_DIO_WRITE;
+   int cuse = flags & FUSE_DIO_CUSE;
struct file *file = io->file;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
@@ -1393,6 +1433,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const 
struct iovec *iov,
break;
}
 
+   if (!cuse)
+   fuse_wait_on_writeback(file->f_mapping->host, pos,
+  nbytes);
+
if (write)
nres = fuse_send_write(req, io, pos, nbytes, owner);
else
@@ -1471,7 +1515,8

[PATCH 08/11] fuse: Implement write_begin/write_end callbacks -v2

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.

Changed in v2:
 - added fuse_wait_on_page_writeback() to fuse ->write_begin; this is
   crucial to avoid the lost of user data by incorrect crop of a secondary
   request to a stale inarg->size of the primary.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |  100 
 1 file changed, 100 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77eb849..642bd51 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1950,6 +1950,104 @@ out:
return err;
 }
 
+/*
+ * Determine the number of bytes of data the page contains
+ */
+static inline unsigned fuse_page_length(struct page *page)
+{
+   loff_t i_size = i_size_read(page_file_mapping(page)->host);
+
+   if (i_size > 0) {
+   pgoff_t page_index = page_file_index(page);
+   pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
+   if (page_index < end_index)
+   return PAGE_CACHE_SIZE;
+   if (page_index == end_index)
+   return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
+   }
+   return 0;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+   struct page *page, loff_t pos, unsigned len)
+{
+   struct fuse_req *req = NULL;
+   unsigned num_read;
+   unsigned page_len;
+   int err;
+
+   if (PageUptodate(page) || (len == PAGE_CACHE_SIZE)) {
+   fuse_wait_on_page_writeback(page->mapping->host, page->index);
+   return 0;
+   }
+
+   page_len = fuse_page_length(page);
+   if (!page_len) {
+   fuse_wait_on_page_writeback(page->mapping->host, page->index);
+   zero_user(page, 0, PAGE_CACHE_SIZE);
+   return 0;
+   }
+
+   num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL);
+   if (req)
+   fuse_put_request(fc, req);
+   if (err) {
+   unlock_page(page);
+   page_cache_release(page);
+   } else if (num_read != PAGE_CACHE_SIZE) {
+   zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
+   }
+   return err;
+}
+
+/*
+ * It's worthy to make sure that space is reserved on disk for the write,
+ * but how to implement it without killing performance need more thinking.
+ */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+   struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+   BUG_ON(!fc->writeback_cache);
+
+   *pagep = grab_cache_page_write_begin(mapping, index, flags);
+   if (!*pagep)
+   return -ENOMEM;
+
+   return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+   unsigned from, unsigned to)
+{
+   struct inode *inode = page->mapping->host;
+   loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+   if (!PageUptodate(page))
+   SetPageUptodate(page);
+
+   fuse_write_update_size(inode, pos);
+   set_page_dirty(page);
+   return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+   fuse_commit_write(file, page, from, from+copied);
+
+   unlock_page(page);
+   page_cache_release(page);
+
+   return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
int err = 0;
@@ -2974,6 +3072,8 @@ static const struct address_space_operations 
fuse_file_aops  = {
.set_page_dirty = __set_page_dirty_nobuffers,
.bmap   = fuse_bmap,
.direct_IO  = fuse_direct_IO,
+   .write_begin= fuse_write_begin,
+   .write_end  = fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/11] fuse: Flush files on wb close -v2

2013-10-10 Thread Maxim Patlasov

Changed in v2:
 - updated patch to be applied smoothly on top of
   "Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eabe202..b4d4189 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see fuse_vma_close() for !writeback_cache case */
+   if (fc->writeback_cache)
+   filemap_write_and_wait(file->f_mapping);
+
if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
fuse_flush_mtime(file, true);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/11] fuse: Trust kernel i_mtime only -v2

2013-10-10 Thread Maxim Patlasov

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
 - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
 - simplified fuse_set_mtime_local()
 - abandoned optimizations: clearing the flag on some operations (like
   direct write) is doable, but may lead to unexpected changes of
   user-visible mtime.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c|  109 --
 fs/fuse/file.c   |   30 +--
 fs/fuse/fuse_i.h |6 ++-
 fs/fuse/inode.c  |   13 +-
 4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f022968..eda248b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);
 
/* see the comment in fuse_change_attributes() */
-   if (fc->writeback_cache && S_ISREG(inode->i_mode))
+   if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
attr->size = i_size_read(inode);
+   attr->mtime = inode->i_mtime.tv_sec;
+   attr->mtimensec = inode->i_mtime.tv_nsec;
+   }
 
stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+   req->in.h.opcode = FUSE_SETATTR;
+   req->in.h.nodeid = get_node_id(inode);
+   req->in.numargs = 1;
+   req->in.args[0].size = sizeof(*inarg_p);
+   req->in.args[0].value = inarg_p;
+   req->out.numargs = 1;
+   if (fc->minor < 9)
+   req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+   else
+   req->out.args[0].size = sizeof(*outarg_p);
+   req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+   struct inode *inode = file->f_mapping->host;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   struct fuse_setattr_in inarg;
+   struct fuse_attr_out outarg;
+   int err;
+
+   if (nofail) {
+   req = fuse_get_req_nofail_nopages(fc, file);
+   } else {
+   req = fuse_get_req_nopages(fc);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+   }
+
+   memset(&inarg, 0, sizeof(inarg));
+   memset(&outarg, 0, sizeof(outarg));
+
+   inarg.valid |= FATTR_MTIME;
+   inarg.mtime = inode->i_mtime.tv_sec;
+   inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+   fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+   fuse_request_send(fc, req);
+   err = req->out.h.error;
+   fuse_put_request(fc, req);
+
+   if (!err)
+   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+
+   return err;
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+   unsigned ivalid = iattr->ia_valid;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
+   /* client fs has just set mtime to iattr->ia_mtime */
+   inode->i_mtime = iattr->ia_mtime;
+   clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   } else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
+   /* client fs doesn't know that we're updating i_mtime */
+   inode->i_mtime = current_fs_time(inode->i_sb);
+   set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+   }
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1641,17 +1720,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
inarg.valid |= FATTR_LOCKOWNER;
inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
}
-   req->in.h.opcode = FUSE_SETATTR;
-   req->in.h.nodeid = get_node_id(inode);
-   req->in.numargs = 1;
-   req->in.args[0].size = sizeof(inarg);
-   req->in.args[0].value = &inarg;

[PATCH 07/11] fuse: restructure fuse_readpage()

2013-10-10 Thread Maxim Patlasov

Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   55 +--
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b4d4189..77eb849 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
}
 }
 
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+  int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
 {
struct fuse_io_priv io = { .async = 0, .file = file };
struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
size_t num_read;
-   loff_t pos = page_offset(page);
-   size_t count = PAGE_CACHE_SIZE;
-   u64 attr_ver;
-   int err;
-
-   err = -EIO;
-   if (is_bad_inode(inode))
-   goto out;
 
/*
 * Page writeback can extend beyond the lifetime of the
@@ -724,20 +717,45 @@ static int fuse_readpage(struct file *file, struct page 
*page)
fuse_wait_on_page_writeback(inode, page->index);
 
req = fuse_get_req(fc, 1);
-   err = PTR_ERR(req);
+   *err = PTR_ERR(req);
if (IS_ERR(req))
-   goto out;
+   return 0;
 
-   attr_ver = fuse_get_attr_version(fc);
+   if (attr_ver_p)
+   *attr_ver_p = fuse_get_attr_version(fc);
 
req->out.page_zeroing = 1;
req->out.argpages = 1;
req->num_pages = 1;
req->pages[0] = page;
req->page_descs[0].length = count;
-   num_read = fuse_send_read(req, &io, pos, count, NULL);
-   err = req->out.h.error;
 
+   num_read = fuse_send_read(req, &io, page_offset(page), count, NULL);
+   *err = req->out.h.error;
+
+   if (*err)
+   fuse_put_request(fc, req);
+   else
+   *req_pp = req;
+
+   return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   size_t num_read;
+   size_t count = PAGE_CACHE_SIZE;
+   u64 attr_ver = 0;
+   int err;
+
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
+
+   num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
@@ -747,10 +765,11 @@ static int fuse_readpage(struct file *file, struct page 
*page)
 
SetPageUptodate(page);
}
-
-   fuse_put_request(fc, req);
-   fuse_invalidate_attr(inode); /* atime changed */
- out:
+   if (req) {
+   fuse_put_request(fc, req);
+   fuse_invalidate_attr(inode); /* atime changed */
+   }
+out:
unlock_page(page);
return err;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/11] fuse: Connection bit for enabling writeback

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/fuse_i.h |3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a08b355..a490ba3 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -479,6 +479,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;
 
+   /** write-back cache policy (default is write-through) */
+   unsigned writeback_cache:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/11] fuse: Trust kernel i_size only - v4

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
 - fixed fuse_fillattr() not to use local i_size if writeback-cache is off
 - added a comment explaining why we cannot trust attr.size from server

Changed in v4:
 - do not change fuse_file_fallocate() because what we need is already
   implemented in commits a200a2d and 14c1441

Signed-off-by: Maxim V. Patlasov 
---
 fs/fuse/dir.c   |   13 +++--
 fs/fuse/file.c  |   26 --
 fs/fuse/inode.c |   11 +--
 3 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b5..f022968 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -854,6 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
  struct kstat *stat)
 {
unsigned int blkbits;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see the comment in fuse_change_attributes() */
+   if (fc->writeback_cache && S_ISREG(inode->i_mode))
+   attr->size = i_size_read(inode);
 
stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1594,6 +1599,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
+   bool is_wb = fc->writeback_cache;
loff_t oldsize;
int err;
 
@@ -1665,7 +1671,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
fuse_change_attributes_common(inode, &outarg.attr,
  attr_timeout(&outarg));
oldsize = inode->i_size;
-   i_size_write(inode, outarg.attr.size);
+   /* see the comment in fuse_change_attributes() */
+   if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+   i_size_write(inode, outarg.attr.size);
 
if (is_truncate) {
/* NOTE: this may release/reacquire fc->lock */
@@ -1677,7 +1685,8 @@ int fuse_do_setattr(struct inode *inode, struct iattr 
*attr,
 * Only call invalidate_inode_pages2() after removing
 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 */
-   if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) &&
+   S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
truncate_pagecache(inode, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 886f3a2..333a764 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -658,9 +658,31 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
u64 attr_ver)
 {
size_t num_read = req->out.args[0].size;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc->writeback_cache) {
+   /*
+* A hole in a file. Some data after the hole are in page cache,
+* but have not reached the client fs yet. So, the hole is not
+* present there.
+*/
+   int i;
+   int start_idx = num_read >> PAGE_CACHE_SHIFT;
+   size_t off = num_read & (PAGE_CACHE_SIZE - 1);
+
+   for (i = start_idx; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   void *mapaddr = kmap_atomic(page);
 
-   loff_t pos = page_offset(req->pages[0]) + num_read;
-   fuse_read_update_size(inode, pos, attr_ver);
+   memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+   kunmap_atomic(mapaddr);
+   off = 0;
+   }
+   } else {
+   loff_t pos = page_offset(req->pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+   }
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a8ce6da..63818ab 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,7 @@ void fuse_change_attributes(struct inode *inode, struct 
fuse_attr *attr,
 {
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+   bool is_wb = fc->writeback_cache;
loff_t oldsize;
struct timespec old_mtime;

[PATCH 02/11] fuse: Prepare to handle short reads

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

A helper which gets called when read reports less bytes than was requested.
See patch #4 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5e49c8f..886f3a2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -654,6 +654,15 @@ static void fuse_read_update_size(struct inode *inode, 
loff_t size,
spin_unlock(&fc->lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+   u64 attr_ver)
+{
+   size_t num_read = req->out.args[0].size;
+
+   loff_t pos = page_offset(req->pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
struct fuse_io_priv io = { .async = 0, .file = file };
@@ -691,18 +700,18 @@ static int fuse_readpage(struct file *file, struct page 
*page)
req->page_descs[0].length = count;
num_read = fuse_send_read(req, &io, pos, count, NULL);
err = req->out.h.error;
-   fuse_put_request(fc, req);
 
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
 */
if (num_read < count)
-   fuse_read_update_size(inode, pos + num_read, attr_ver);
+   fuse_short_read(req, inode, attr_ver);
 
SetPageUptodate(page);
}
 
+   fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
  out:
unlock_page(page);
@@ -725,13 +734,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
/*
 * Short read means EOF. If file size is larger, truncate it
 */
-   if (!req->out.h.error && num_read < count) {
-   loff_t pos;
+   if (!req->out.h.error && num_read < count)
+   fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-   pos = page_offset(req->pages[0]) + num_read;
-   fuse_read_update_size(inode, pos,
- req->misc.read.attr_ver);
-   }
fuse_invalidate_attr(inode); /* atime changed */
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/11] fuse: Linking file to inode helper

2013-10-10 Thread Maxim Patlasov

From: Pavel Emelyanov 

When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3f219fd..5e49c8f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -171,6 +171,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct 
file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+   struct inode *inode = file_inode(file);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_file *ff = file->private_data;
+   /*
+* file may be written through mmap, so chain it onto the
+* inodes's write_file list
+*/
+   spin_lock(&fc->lock);
+   if (list_empty(&ff->write_entry))
+   list_add(&ff->write_entry, &fi->write_files);
+   spin_unlock(&fc->lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
struct fuse_file *ff = file->private_data;
@@ -1929,20 +1945,9 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-   struct inode *inode = file_inode(file);
-   struct fuse_conn *fc = get_fuse_conn(inode);
-   struct fuse_inode *fi = get_fuse_inode(inode);
-   struct fuse_file *ff = file->private_data;
-   /*
-* file may be written through mmap, so chain it onto the
-* inodes's write_file list
-*/
-   spin_lock(&fc->lock);
-   if (list_empty(&ff->write_entry))
-   list_add(&ff->write_entry, &fi->write_files);
-   spin_unlock(&fc->lock);
-   }
+   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+   fuse_link_write_file(file);
+
file_accessed(file);
vma->vm_ops = &fuse_file_vm_ops;
return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy

2013-10-10 Thread Maxim Patlasov

nch of the fuse tree 
(beba4ae4c2fda4e03a813aec640586087fa80a8b)
 - added waiting for fuse writeback to ->write_begin fuse address space 
operation.

Thanks,
Maxim

---

Maxim Patlasov (9):
  fuse: Linking file to inode helper
  fuse: Connection bit for enabling writeback
  fuse: Trust kernel i_size only - v4
  fuse: Trust kernel i_mtime only -v2
  fuse: Flush files on wb close -v2
  fuse: restructure fuse_readpage()
  fuse: fuse_flush() should wait on writeback
  fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  fuse: Turn writeback cache on

Pavel Emelyanov (2):
  fuse: Prepare to handle short reads
  fuse: Implement write_begin/write_end callbacks -v2


 fs/fuse/cuse.c|5 -
 fs/fuse/dir.c |  120 +++-
 fs/fuse/file.c|  328 +++--
 fs/fuse/fuse_i.h  |   26 +++-
 fs/fuse/inode.c   |   37 -
 include/uapi/linux/fuse.h |2 
 6 files changed, 451 insertions(+), 67 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 255 matches

Mail list logo