[PATCH] fuse: relax inode_lock on fsync(2)
Sometimes handling FUSE_FSYNC in userspace may take a while. No need to block incoming writes while userspace processes FUSE_FSYNC. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 2401c5d..9d52a8a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -483,6 +483,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end, if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir)) goto out; + inode_unlock(inode); + memset(&inarg, 0, sizeof(inarg)); inarg.fh = ff->fh; inarg.fsync_flags = datasync ? 1 : 0; @@ -499,6 +501,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end, fc->no_fsync = 1; err = 0; } + return err; out: inode_unlock(inode); return err;
[PATCH v2] btrfs: limit async_work allocation and worker func duration
Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-12810144 10144128 321 : tunables000 : slabdata317317 0 kmalloc-128 9992352 9992352128 321 : tunables000 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752128 321 : tunables000 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240128 321 : tunables000 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Changed in v2: remove support of thresh == NO_THRESHOLD. Signed-off-by: Maxim Patlasov --- fs/btrfs/async-thread.c | 14 ++ fs/btrfs/async-thread.h |1 + fs/btrfs/delayed-inode.c |6 -- 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index e0f071f..63d1977 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -86,6 +86,20 @@ btrfs_work_owner(struct btrfs_work *work) return work->wq->fs_info; } +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq) +{ + /* +* We could compare wq->normal->pending with num_online_cpus() +* to support "thresh == NO_THRESHOLD" case, but it requires +* moving up atomic_inc/dec in thresh_queue/exec_hook. Let's +* postpone it until someone needs the support of that case. +*/ + if (wq->normal->thresh == NO_THRESHOLD) + return false; + + return atomic_read(&wq->normal->pending) > wq->normal->thresh * 2; +} + BTRFS_WORK_HELPER(worker_helper); BTRFS_WORK_HELPER(delalloc_helper); BTRFS_WORK_HELPER(flush_delalloc_helper); diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index 8e52484..1f95973 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int max); void btrfs_set_work_high_priority(struct btrfs_work *work); struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work); struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq); +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq); #endif diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 3eeb9cd..de946dd 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1356,7 +1356,8 @@ release_path: total_done++; btrfs_release_prepared_delayed_node(delayed_node); - if (async_work->nr == 0 || total_done < async_work->nr) + if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) || + total_done < async_work->nr) goto again; free_path: @@ -1372,7 +1373,8 @@ static int btrfs_wq_run_delayed_node(struct btrfs_delayed_root *delayed_root, { struct btrfs_async_delayed_work *async_
Re: [PATCH] btrfs: limit async_work allocation and worker func duration
On 12/12/2016 06:54 AM, David Sterba wrote: On Fri, Dec 02, 2016 at 05:51:36PM -0800, Maxim Patlasov wrote: Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-12810144 10144128 321 : tunables000 : slabdata317317 0 kmalloc-128 9992352 9992352128 321 : tunables000 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752128 321 : tunables000 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240128 321 : tunables000 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. Nice analysis! The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Signed-off-by: Maxim Patlasov --- fs/btrfs/async-thread.c |8 fs/btrfs/async-thread.h |1 + fs/btrfs/delayed-inode.c |6 -- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index e0f071f..29f6252 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work) return work->wq->fs_info; } +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq) +{ + int thresh = wq->normal->thresh != NO_THRESHOLD ? + wq->normal->thresh : num_possible_cpus(); Why not num_online_cpus? I vaguely remember we should be checking online cpus, but don't have the mails for reference. We use it elsewhere for spreading the work over cpus, but it's still not bullet proof regarding cpu onlining/offlining. Thank you for review, David! I borrowed num_possible_cpus from the definition of WQ_UNBOUND_MAX_ACTIVE in workqueue.h, but if btrfs uses num_online_cpus elsewhere, it must be OK as well. Another problem that I realized only now, is that nobody increments/decrements wq->normal->pending if thresh == NO_THRESHOLD, so the code looks pretty misleading: it looks as though assigning thresh to num_possible_cpus (or num_online_cpus) matters, but the next line compares it with "pending" that is always zero. As far as we don't have any NO_THRESHOLD users of btrfs_workqueue_normal_congested for now, I tend to think it's better to add a descriptive comment and simply return "false" from btrfs_workqueue_normal_congested rather than trying to address some future needs now. See please v2 of the patch. Thanks, Maxim Otherwise looks good to me, as far as I can imagine the possible behaviour of the various async parameters just from reading the code.
[PATCH] btrfs: limit async_work allocation and worker func duration
Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-12810144 10144128 321 : tunables000 : slabdata317317 0 kmalloc-128 9992352 9992352128 321 : tunables000 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752128 321 : tunables000 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240128 321 : tunables000 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Signed-off-by: Maxim Patlasov --- fs/btrfs/async-thread.c |8 fs/btrfs/async-thread.h |1 + fs/btrfs/delayed-inode.c |6 -- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index e0f071f..29f6252 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work) return work->wq->fs_info; } +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq) +{ + int thresh = wq->normal->thresh != NO_THRESHOLD ? + wq->normal->thresh : num_possible_cpus(); + + return atomic_read(&wq->normal->pending) > thresh * 2; +} + BTRFS_WORK_HELPER(worker_helper); BTRFS_WORK_HELPER(delalloc_helper); BTRFS_WORK_HELPER(flush_delalloc_helper); diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index 8e52484..1f95973 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int max); void btrfs_set_work_high_priority(struct btrfs_work *work); struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work); struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq); +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq); #endif diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 3eeb9cd..de946dd 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1356,7 +1356,8 @@ release_path: total_done++; btrfs_release_prepared_delayed_node(delayed_node); - if (async_work->nr == 0 || total_done < async_work->nr) + if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) || + total_done < async_work->nr) goto again; free_path: @@ -1372,7 +1373,8 @@ static int btrfs_wq_run_delayed_node(struct btrfs_delayed_root *delayed_root, { struct btrfs_async_delayed_work *async_work; - if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND) + if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND || + btrfs_workqueue_normal_congested(fs_info->delayed_workers)) return 0; async_work = kmalloc(sizeof(*async_work), GFP_NOFS);
Re: [fuse-devel] fuse: max_background and congestion_threshold settings
On 11/22/2016 02:45 PM, Nikolaus Rath wrote: On Nov 16 2016, Maxim Patlasov wrote: On 11/16/2016 12:19 PM, Nikolaus Rath wrote: On Nov 16 2016, Maxim Patlasov wrote: On 11/16/2016 11:19 AM, Nikolaus Rath wrote: Hi Maxim, On Nov 15 2016, Maxim Patlasov wrote: On 11/15/2016 08:18 AM, Nikolaus Rath wrote: Could someone explain to me the meaning of the max_background and congestion_threshold settings of the fuse module? At first I assumed that max_background specifies the maximum number of pending requests (i.e., requests that have been send to userspace but for which no reply was received yet). But looking at fs/fuse/dev.c, it looks as if not every request is included in this number. fuse uses max_background for cases where the total number of simultaneous requests of given type is not limited by some other natural means. AFAIU, these cases are: 1) async processing of direct IO; 2) read-ahead. As an example of "natural" limitation: when userspace process blocks on a sync direct IO read/write, the number of requests fuse consumed is limited by the number of such processes (actually their threads). In contrast, if userspace requests 1GB direct IO read/write, it would be unreasonable to issue 1GB/128K==8192 fuse requests simultaneously. That's where max_background steps in. Ah, that makes sense. Are these two cases meant as examples, or is that an exhaustive list? Because I would have thought that other cases should be writing of cached data (when writeback caching is enabled), and asynchronous I/O from userspace...? I think that's exhaustive list, but I can miss something. As for writing of cached data, that definitely doesn't go through background requests. Here we rely on flusher: fuse will allocate as many requests as the flusher wants to writeback. Buffered AIO READs actually block in submit_io until fully processed. So it's just another example of "natural" limitation I told above. Not sure I understand. What is it that's blocking? It can't be the userspace process, because then it wouldn't be asynchronous I/O... Surprise! Alas, Linux kernel does NOT process buffered AIO reads in async manner. You can verify it yourself by strace-ing a simple program looping over io_submit + io_getevents: for direct IO (as expected) io_submit returns immediately while io_getevents waits for actual IO; in contrast, for buffered IO (surprisingly) io_submit waits for actual IO while io_getevents returns immediately. Presumably, people are supposed to use mmap-ed read/writes rather than buffered AIO. What about buffered, asynchronous writes when writeback cache is disabled? It sounds as if io_submit does not block (so userspace could create an unlimited number), nor can the kernel coalesce them (since writeback caching is disabled). I've never looked closely at it. Do you have a particular use case or concern? Thanks! -Nikolaus
Re: [fuse-devel] fuse: max_background and congestion_threshold settings
On 11/16/2016 11:19 AM, Nikolaus Rath wrote: Hi Maxim, On Nov 15 2016, Maxim Patlasov wrote: On 11/15/2016 08:18 AM, Nikolaus Rath wrote: Could someone explain to me the meaning of the max_background and congestion_threshold settings of the fuse module? At first I assumed that max_background specifies the maximum number of pending requests (i.e., requests that have been send to userspace but for which no reply was received yet). But looking at fs/fuse/dev.c, it looks as if not every request is included in this number. fuse uses max_background for cases where the total number of simultaneous requests of given type is not limited by some other natural means. AFAIU, these cases are: 1) async processing of direct IO; 2) read-ahead. As an example of "natural" limitation: when userspace process blocks on a sync direct IO read/write, the number of requests fuse consumed is limited by the number of such processes (actually their threads). In contrast, if userspace requests 1GB direct IO read/write, it would be unreasonable to issue 1GB/128K==8192 fuse requests simultaneously. That's where max_background steps in. Ah, that makes sense. Are these two cases meant as examples, or is that an exhaustive list? Because I would have thought that other cases should be writing of cached data (when writeback caching is enabled), and asynchronous I/O from userspace...? I think that's exhaustive list, but I can miss something. As for writing of cached data, that definitely doesn't go through background requests. Here we rely on flusher: fuse will allocate as many requests as the flusher wants to writeback. Buffered AIO READs actually block in submit_io until fully processed. So it's just another example of "natural" limitation I told above. Buffered AIO WRITEs go through writeback mechanics anyway, so here again we rely on flusher behaving reasonable. And finally, direct AIO does go through fuse background requests as I wrote: "1) async processing of direct IO;" Also, I am not sure what you mean with async processing of direct I/O. Shouldn't direct I/O always go directly to the file-system? If so, how can it be processed asynchronously? That's a nice optimization we implemented a few years ago: having incoming sync direct IO request of 1MB size, kernel fuse splits it into eight 128K requests and starts processing them in async manner, waiting for the completion of all of them before completing that incoming 1MB requests. This boosts performance tremendously if userspace fuse daemon is able to efficiently process many requests "in parallel". This optimization is implemented using background fuse requests. Otherwise, having 1GB incoming request, we would obediently allocate 8K fuse requests in one shot -- too dangerous and not good for latency. Thanks, Maxim
Re: [fuse-devel] fuse: max_background and congestion_threshold settings
On 11/16/2016 12:19 PM, Nikolaus Rath wrote: On Nov 16 2016, Maxim Patlasov wrote: On 11/16/2016 11:19 AM, Nikolaus Rath wrote: Hi Maxim, On Nov 15 2016, Maxim Patlasov wrote: On 11/15/2016 08:18 AM, Nikolaus Rath wrote: Could someone explain to me the meaning of the max_background and congestion_threshold settings of the fuse module? At first I assumed that max_background specifies the maximum number of pending requests (i.e., requests that have been send to userspace but for which no reply was received yet). But looking at fs/fuse/dev.c, it looks as if not every request is included in this number. fuse uses max_background for cases where the total number of simultaneous requests of given type is not limited by some other natural means. AFAIU, these cases are: 1) async processing of direct IO; 2) read-ahead. As an example of "natural" limitation: when userspace process blocks on a sync direct IO read/write, the number of requests fuse consumed is limited by the number of such processes (actually their threads). In contrast, if userspace requests 1GB direct IO read/write, it would be unreasonable to issue 1GB/128K==8192 fuse requests simultaneously. That's where max_background steps in. Ah, that makes sense. Are these two cases meant as examples, or is that an exhaustive list? Because I would have thought that other cases should be writing of cached data (when writeback caching is enabled), and asynchronous I/O from userspace...? I think that's exhaustive list, but I can miss something. As for writing of cached data, that definitely doesn't go through background requests. Here we rely on flusher: fuse will allocate as many requests as the flusher wants to writeback. Buffered AIO READs actually block in submit_io until fully processed. So it's just another example of "natural" limitation I told above. Not sure I understand. What is it that's blocking? It can't be the userspace process, because then it wouldn't be asynchronous I/O... Surprise! Alas, Linux kernel does NOT process buffered AIO reads in async manner. You can verify it yourself by strace-ing a simple program looping over io_submit + io_getevents: for direct IO (as expected) io_submit returns immediately while io_getevents waits for actual IO; in contrast, for buffered IO (surprisingly) io_submit waits for actual IO while io_getevents returns immediately. Presumably, people are supposed to use mmap-ed read/writes rather than buffered AIO. Also, I am not sure what you mean with async processing of direct I/O. Shouldn't direct I/O always go directly to the file-system? If so, how can it be processed asynchronously? That's a nice optimization we implemented a few years ago: having incoming sync direct IO request of 1MB size, kernel fuse splits it into eight 128K requests and starts processing them in async manner, waiting for the completion of all of them before completing that incoming 1MB requests. I see. But why isn't that also done for regular (non-direct) IO? Regular READs are helped by async read-ahead. Regular writes go through writeback mechanics: flusher calls fuse_writepages() and the latter submits as many async write requests as needed. Everything looks fine. (but as I wrote those async requests are not under fuse max_backgroung control). Thanks, Maxim Thanks, -Nikolaus
Re: [fuse-devel] fuse: max_background and congestion_threshold settings
Hi, On 11/15/2016 08:18 AM, Nikolaus Rath wrote: Hello, Could someone explain to me the meaning of the max_background and congestion_threshold settings of the fuse module? At first I assumed that max_background specifies the maximum number of pending requests (i.e., requests that have been send to userspace but for which no reply was received yet). But looking at fs/fuse/dev.c, it looks as if not every request is included in this number. fuse uses max_background for cases where the total number of simultaneous requests of given type is not limited by some other natural means. AFAIU, these cases are: 1) async processing of direct IO; 2) read-ahead. As an example of "natural" limitation: when userspace process blocks on a sync direct IO read/write, the number of requests fuse consumed is limited by the number of such processes (actually their threads). In contrast, if userspace requests 1GB direct IO read/write, it would be unreasonable to issue 1GB/128K==8192 fuse requests simultaneously. That's where max_background steps in. I also figured out that if the number of background requests (whatever they are) exceeds the congestion threshold, fuse calls set_bdi_congested() for the backing device. But what does this do? AFAIU, this is a hint for reclaimer to avoid busy loop: /* * If kswapd scans pages marked marked for immediate * reclaim and under writeback (nr_immediate), it implies * that pages are cycling through the LRU faster than * they are written so also forcibly stall. */ if (nr_immediate && current_may_throttle()) congestion_wait(BLK_RW_ASYNC, HZ/10); And does this become a no-op if there is no backing device? current->backing_dev_info exists (and helps to control writeback) even if there is no "real" backing device. Thanks, Maxim
[PATCH v2] fuse: fuse_flush must check mapping->flags for errors
fuse_flush() calls write_inode_now() that triggers writeback, but actual writeback will happen later, on fuse_sync_writes(). If an error happens, fuse_writepage_end() will set error bit in mapping->flags. So, we have to check mapping->flags after fuse_sync_writes(). Changed in v2: - fixed silly type: check must be *after* fuse_sync_writes() Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |9 + 1 file changed, 9 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index ad1da83..6cac3dc 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -417,6 +417,15 @@ static int fuse_flush(struct file *file, fl_owner_t id) fuse_sync_writes(inode); inode_unlock(inode); + if (test_bit(AS_ENOSPC, &file->f_mapping->flags) && + test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags)) + err = -ENOSPC; + if (test_bit(AS_EIO, &file->f_mapping->flags) && + test_and_clear_bit(AS_EIO, &file->f_mapping->flags)) + err = -EIO; + if (err) + return err; + req = fuse_get_req_nofail_nopages(fc, file); memset(&inarg, 0, sizeof(inarg)); inarg.fh = ff->fh;
[PATCH] fuse: fuse_flush must check mapping->flags for errors
fuse_flush() calls write_inode_now() that triggers writeback, but actual writeback will happen later, on fuse_sync_writes(). If an error happens, fuse_writepage_end() will set error bit in mapping->flags. So, we have to check mapping->flags after fuse_sync_writes(). Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |9 + 1 file changed, 9 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index ad1da83..b43401e 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -413,6 +413,15 @@ static int fuse_flush(struct file *file, fl_owner_t id) if (err) return err; + if (test_bit(AS_ENOSPC, &file->f_mapping->flags) && + test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags)) + err = -ENOSPC; + if (test_bit(AS_EIO, &file->f_mapping->flags) && + test_and_clear_bit(AS_EIO, &file->f_mapping->flags)) + err = -EIO; + if (err) + return err; + inode_lock(inode); fuse_sync_writes(inode); inode_unlock(inode);
[PATCH] fuse: fsync() did not return IO errors
From: Alexey Kuznetsov Due to implementation of fuse writeback filemap_write_and_wait_range() does not catch errors. We have to do this directly after fuse_sync_writes() Signed-off-by: Alexey Kuznetsov Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 9154f86..ad1da83 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -462,6 +462,21 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end, goto out; fuse_sync_writes(inode); + + /* +* Due to implementation of fuse writeback +* filemap_write_and_wait_range() does not catch errors. +* We have to do this directly after fuse_sync_writes() +*/ + if (test_bit(AS_ENOSPC, &file->f_mapping->flags) && + test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags)) + err = -ENOSPC; + if (test_bit(AS_EIO, &file->f_mapping->flags) && + test_and_clear_bit(AS_EIO, &file->f_mapping->flags)) + err = -EIO; + if (err) + goto out; + err = sync_inode_metadata(inode, 1); if (err) goto out;
[PATCH] tmpfs: shmem_fallocate must return ERESTARTSYS
shmem_fallocate() is restartable, so it can return ERESTARTSYS if signal_pending(). Although fallocate(2) manpage permits EINTR, the more places use ERESTARTSYS the better. Signed-off-by: Maxim Patlasov --- mm/shmem.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 440e2a7..60e9c8a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2229,11 +2229,13 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, struct page *page; /* -* Good, the fallocate(2) manpage permits EINTR: we may have -* been interrupted because we are using up too much memory. +* Although fallocate(2) manpage permits EINTR, the more +* places use ERESTARTSYS the better. If we have been +* interrupted because we are using up too much memory, +* oom-killer used fatal signal and we will die anyway. */ if (signal_pending(current)) - error = -EINTR; + error = -ERESTARTSYS; else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) error = -ENOMEM; else
Re: [PATCH] fs/pnode.c: treat zero mnt_group_id-s as unequal
On 02/16/2016 11:54 AM, Al Viro wrote: On Tue, Feb 16, 2016 at 11:45:33AM -0800, Maxim Patlasov wrote: propagate_one(m) calculates "type" argument for copy_tree() like this: if (m->mnt_group_id == last_dest->mnt_group_id) { type = CL_MAKE_SHARED; } else { type = CL_SLAVE; if (IS_MNT_SHARED(m)) type |= CL_MAKE_SHARED; } The "type" argument then governs clone_mnt() behavior with respect to flags and mnt_master of new mount. When we iterate through a slave group, it is possible that both current "m" and "last_dest" are not shared (although, both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison above erroneously makes new mount shared and sets its mnt_master to last_source->mnt_master. The patch fixes the problem by handling zero mnt_group_id-s as though they are unequal. The similar problem exists in the implementation of "else" clause above when we have to ascend upward in the master/slave tree by calling: last_source = last_source->mnt_master; last_dest = last_source->mnt_parent; proper number of times. The last step is governed by "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if both are zero. The patch fixes this case in the same way as the former one. Mind putting together a reproducer? There are two files attached: reproducer1.c and reproducer2.c. The former demonstrates the problem before applying the patch. The latter demonstrates why the first hunk of the patch is not enough. [root@f22ml ~]# reproducer1 main pid = 1496 monitor pid = 1497 child pid = 1498 grand-child pid = 1499 [root@f22ml ~]# grep "child" /proc/1496/mountinfo 243 144 0:37 /child /tmp/child rw shared:93 - tmpfs tmpfs rw,seclabel [root@f22ml ~]# grep "child" /proc/1498/mountinfo 244 208 0:37 /child /tmp/child rw shared:127 master:93 - tmpfs tmpfs rw,seclabel [root@f22ml ~]# grep "child" /proc/1499/mountinfo 245 240 0:37 /child /tmp/child rw master:127 - tmpfs tmpfs rw,seclabel [root@f22ml ~]# grep "child" /proc/1497/mountinfo 246 176 0:37 /child /tmp/child rw shared:128 master:127 - tmpfs tmpfs rw,seclabel while expected info for 1497 would be: 246 176 0:37 /child /tmp/child rw master:93 - tmpfs tmpfs rw,seclabel Now, assuming that only the first hunk of the patch is applied: > -if (m->mnt_group_id == last_dest->mnt_group_id) { > +if (m->mnt_group_id && m->mnt_group_id == last_dest->mnt_group_id) { [root@f22ml ~]# reproducer2 main pid = 1506 monitor pid = 1507 child pid = 1508 grand-child pid = 1509 [root@f22ml ~]# grep "child" /proc/1506/mountinfo 243 144 0:37 /child /tmp/child rw shared:93 - tmpfs tmpfs rw,seclabel [root@f22ml ~]# grep "child" /proc/1508/mountinfo 244 208 0:37 /child /tmp/child rw shared:93 - tmpfs tmpfs rw,seclabel [root@f22ml ~]# grep "child" /proc/1509/mountinfo 245 240 0:37 /child /tmp/child rw master:93 - tmpfs tmpfs rw,seclabel [root@f22ml ~]# grep "child" /proc/1507/mountinfo 246 176 0:37 /child /tmp/child rw master:0 - tmpfs tmpfs rw,seclabel while expected info for 1507 would be: 246 176 0:37 /child /tmp/child rw master:93 - tmpfs tmpfs rw,seclabel Thanks, Maxim #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include int main() { const char *child = "/tmp/child"; int ret; printf("main pid = %d\n", getpid()); /* make our own private playground ... */ ret = unshare(CLONE_NEWNS); if (ret) { perror("unshare"); exit(1); } ret = mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL); if (ret) { perror("mount"); exit(1); } ret = mount("none", "/", NULL, MS_REC|MS_SHARED, NULL); if (ret) { perror("mount2"); exit(1); } /* fork monitor ... */ ret = fork(); if (ret < 0) { perror("fork"); exit(1); } else if (!ret) { printf("monitor pid = %d\n", getpid()); ret = unshare(CLONE_NEWNS); if (ret) { perror("unshare in monitor"); exit(1); } ret = mount("none", "/", NULL, MS_REC|MS_SHARED, NULL); if (ret) { perror("mount in monitor"); exit(1); } ret = mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL); if (ret) { perror("mount2 in monitor"); exit(1); } sleep(-1); } /* wait monitor to setup */ sleep(1); /* fork child ... */ ret = fork(); if (ret < 0) { perror("fork"); exit(1); } else if (!ret) { printf("child pid = %d\n", getpid()); ret = unshare(CLONE_NEWNS); if (ret) { perror("unshare in child"); exit(1); } ret = mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL); if (ret) { perror("mount in child");
[PATCH] fs/pnode.c: treat zero mnt_group_id-s as unequal
propagate_one(m) calculates "type" argument for copy_tree() like this: >if (m->mnt_group_id == last_dest->mnt_group_id) { >type = CL_MAKE_SHARED; >} else { >type = CL_SLAVE; >if (IS_MNT_SHARED(m)) > type |= CL_MAKE_SHARED; > } The "type" argument then governs clone_mnt() behavior with respect to flags and mnt_master of new mount. When we iterate through a slave group, it is possible that both current "m" and "last_dest" are not shared (although, both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison above erroneously makes new mount shared and sets its mnt_master to last_source->mnt_master. The patch fixes the problem by handling zero mnt_group_id-s as though they are unequal. The similar problem exists in the implementation of "else" clause above when we have to ascend upward in the master/slave tree by calling: >last_source = last_source->mnt_master; >last_dest = last_source->mnt_parent; proper number of times. The last step is governed by "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if both are zero. The patch fixes this case in the same way as the former one. Signed-off-by: Maxim Patlasov --- fs/pnode.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 6367e1e..abf156a 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -212,7 +212,7 @@ static int propagate_one(struct mount *m) /* skip if mountpoint isn't covered by it */ if (!is_subdir(mp->m_dentry, m->mnt.mnt_root)) return 0; - if (m->mnt_group_id == last_dest->mnt_group_id) { + if (m->mnt_group_id && m->mnt_group_id == last_dest->mnt_group_id) { type = CL_MAKE_SHARED; } else { struct mount *n, *p; @@ -223,7 +223,9 @@ static int propagate_one(struct mount *m) last_source = last_source->mnt_master; last_dest = last_source->mnt_parent; } - if (n->mnt_group_id != last_dest->mnt_group_id) { + if (n->mnt_group_id != last_dest->mnt_group_id || + (!n->mnt_group_id && +!last_dest->mnt_group_id)) { last_source = last_source->mnt_master; last_dest = last_source->mnt_parent; }
Re: [PATCH] kvm: do not SetPageDirty from kvm_set_pfn_dirty for file mappings
On 02/12/2016 05:48 AM, Dmitry Monakhov wrote: Maxim Patlasov writes: The patch solves the following problem: file system specific routines involved in ordinary routine writeback process BUG_ON page_buffers() because a page goes to writeback without buffer-heads attached. The way how kvm_set_pfn_dirty calls SetPageDirty works only for anon mappings. For file mappings it is obviously incorrect - there page_mkwrite must be called. It's not easy to add page_mkwrite call to kvm_set_pfn_dirty because there is no universal way to find vma by pfn. But actually SetPageDirty may be simply skipped in those cases. Below is a justification. Confirm. I've hit that BUGON [ 4442.219121] [ cut here ] [ 4442.219188] kernel BUG at fs/ext4/inode.c:2285! <...> When guest modifies the content of a page with file mapping, kernel kvm makes the page dirty by the following call-path: vmx_handle_exit -> handle_ept_violation -> __get_user_pages -> page_mkwrite -> SetPageDirty Since then, the page is dirty from both guest and host point of view. Then the host makes writeback and marks the page as write-protected. So any further write from the guest triggers call-path above again. Please elaborate exact call-path which marks host-page. wb_workfn -> wb_do_writeback -> wb_writeback -> __writeback_inodes_wb -> writeback_sb_inodes -> __writeback_single_inode -> do_writepages -> ext4_writepages -> mpage_prepare_extent_to_map -> mpage_process_page_bufs -> mpage_submit_page -> clear_page_dirty_for_io -> page_mkclean -> rmap_walk-> rmap_walk_file -> page_mkclean_one-> pte_wrprotect -> pte_clear_flags(pte, _PAGE_RW) Thanks, Maxim So, for file mappings, it's not possible to have new data written to a page inside the guest w/o corresponding SetPageDirty on the host. This makes explicit SetPageDirty from kvm_set_pfn_dirty redundant. Signed-off-by: Maxim Patlasov --- virt/kvm/kvm_main.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a11cfd2..5a7d3fa 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1582,7 +1582,8 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn) if (!kvm_is_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) + if (!PageReserved(page) && + (!page->mapping || PageAnon(page))) SetPageDirty(page); } } ___ Linux-nvdimm mailing list linux-nvd...@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH] kvm: do not SetPageDirty from kvm_set_pfn_dirty for file mappings
The patch solves the following problem: file system specific routines involved in ordinary routine writeback process BUG_ON page_buffers() because a page goes to writeback without buffer-heads attached. The way how kvm_set_pfn_dirty calls SetPageDirty works only for anon mappings. For file mappings it is obviously incorrect - there page_mkwrite must be called. It's not easy to add page_mkwrite call to kvm_set_pfn_dirty because there is no universal way to find vma by pfn. But actually SetPageDirty may be simply skipped in those cases. Below is a justification. When guest modifies the content of a page with file mapping, kernel kvm makes the page dirty by the following call-path: vmx_handle_exit -> handle_ept_violation -> __get_user_pages -> page_mkwrite -> SetPageDirty Since then, the page is dirty from both guest and host point of view. Then the host makes writeback and marks the page as write-protected. So any further write from the guest triggers call-path above again. So, for file mappings, it's not possible to have new data written to a page inside the guest w/o corresponding SetPageDirty on the host. This makes explicit SetPageDirty from kvm_set_pfn_dirty redundant. Signed-off-by: Maxim Patlasov --- virt/kvm/kvm_main.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a11cfd2..5a7d3fa 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1582,7 +1582,8 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn) if (!kvm_is_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) + if (!PageReserved(page) && + (!page->mapping || PageAnon(page))) SetPageDirty(page); } }
Re: [fuse-devel] [PATCH] fuse: break infinite loop in fuse_fill_write_pages()
On 10/02/2015 06:58 PM, Konstantin Khlebnikov wrote: On Sat, Oct 3, 2015 at 1:04 AM, Andrew Morton wrote: On Fri, 2 Oct 2015 12:27:45 -0700 Maxim Patlasov wrote: On 10/02/2015 04:21 AM, Konstantin Khlebnikov wrote: Bump. Add more peopple in CC. On Mon, Sep 21, 2015 at 1:02 PM, Roman Gushchin wrote: I got a report about unkillable task eating CPU. Thge further investigation shows, that the problem is in the fuse_fill_write_pages() function. If iov's first segment has zero length, we get an infinite loop, because we never reach iov_iter_advance() call. iov_iter_copy_from_user_atomic() eventually calls iterate_iovec(). The latter silently consumes zero-length iov. So I don't think "iov's first segment has zero length" can cause infinite loop. I'm suspecting it got stuck because local variable `bytes' is zero, so the code does `goto again' repeatedly. Or maybe not. A more complete description of the bug would help. I suspect here is the same scenario like in 124d3b7041f: Zero-length segmend is followed by segment with invalid address: iov_iter_fault_in_readable() checks only first segment (zero-length) iov_iter_copy_from_user_atomic() skips it, fails at second and returns zero -> goto again without skipping zero-length segment. Patch calls iov_iter_advance() before goto again: we'll skip zero-length segment at second iteraction and iov_iter_fault_in_readable() will detect invalid address. Makes sense to me. The patch looks fine. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] [PATCH] fuse: break infinite loop in fuse_fill_write_pages()
On 10/02/2015 04:21 AM, Konstantin Khlebnikov wrote: Bump. Add more peopple in CC. On Mon, Sep 21, 2015 at 1:02 PM, Roman Gushchin wrote: I got a report about unkillable task eating CPU. Thge further investigation shows, that the problem is in the fuse_fill_write_pages() function. If iov's first segment has zero length, we get an infinite loop, because we never reach iov_iter_advance() call. iov_iter_copy_from_user_atomic() eventually calls iterate_iovec(). The latter silently consumes zero-length iov. So I don't think "iov's first segment has zero length" can cause infinite loop. Thanks, Maxim Fix this by calling iov_iter_advance() before repeating an attempt to copy data from userspace. A similar problem is described in 124d3b7041f ("fix writev regression: pan hanging unkillable and un-straceable"). Signed-off-by: Roman Gushchin Cc: Miklos Szeredi --- fs/fuse/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index f523f2f..195476a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1049,6 +1049,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req, tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes); flush_dcache_page(page); + iov_iter_advance(ii, tmp); if (!tmp) { unlock_page(page); page_cache_release(page); @@ -1061,7 +1062,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req, req->page_descs[req->num_pages].length = tmp; req->num_pages++; - iov_iter_advance(ii, tmp); count += tmp; pos += tmp; offset += tmp; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- ___ fuse-devel mailing list fuse-de...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fuse-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
On 03/18/2015 07:57 PM, Ming Lei wrote: On Thu, Mar 19, 2015 at 2:28 AM, Maxim Patlasov wrote: On 01/13/2015 07:44 AM, Ming Lei wrote: Part of the patch is based on Dave's previous post. This patch submits I/O to fs via kernel aio, and we can obtain following benefits: - double cache in both loop file system and backend file gets avoided - context switch decreased a lot, and finally CPU utilization is decreased - cached memory got decreased a lot One main side effect is that throughput is decreased when accessing raw loop block(not by filesystem) with kernel aio. This patch has passed xfstests test(./check -g auto), and both test and scratch devices are loop block, file system is ext4. Follows two fio tests' result: 1. fio test inside ext4 file system over loop block 1) How to run - linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged) - loop over SSD image 1 in ext4 - linux psync, 16 jobs, size 200M, ext4 over loop block - test result: IOPS from fio output 2) Throughput result: - test cases |randread |read |randwrite |write | - base|16799 |59508 |31059 |58829 - base+kernel aio |15480 |64453 |30187 |57222 - Ming, it's important to understand the overhead of aio_kernel_() implementation. So could you please add test results for raw SSD device to the table above next time (in v3 of your patches). what aio_kernel_() does is to just call ->read_iter()/->write_iter(), so it should not have introduced extra overload. From performance view, the effect is only from switching to O_DIRECT. With O_DIRECT, double cache can be avoided, meantime both page caches and CPU utilization can be decreased. The way how you reused loop_queue_rq() --> queue_work() functionality (added early, by commit b5dd2f604) may affect performance of O_DIRECT operations. It can be easily demonstrated on ram-drive, but measurements on real storage h/w would be more convincing. Btw, when you wrote "linux psync, 16 jobs, size 200M, ext4 over loop block" -- does it mean that there were 16 threads in userspace submitting I/O concurrently? If yes, throughput comparison for a single job test would be also useful to look at. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
On 01/13/2015 07:44 AM, Ming Lei wrote: Part of the patch is based on Dave's previous post. This patch submits I/O to fs via kernel aio, and we can obtain following benefits: - double cache in both loop file system and backend file gets avoided - context switch decreased a lot, and finally CPU utilization is decreased - cached memory got decreased a lot One main side effect is that throughput is decreased when accessing raw loop block(not by filesystem) with kernel aio. This patch has passed xfstests test(./check -g auto), and both test and scratch devices are loop block, file system is ext4. Follows two fio tests' result: 1. fio test inside ext4 file system over loop block 1) How to run - linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged) - loop over SSD image 1 in ext4 - linux psync, 16 jobs, size 200M, ext4 over loop block - test result: IOPS from fio output 2) Throughput result: - test cases |randread |read |randwrite |write | - base|16799 |59508 |31059 |58829 - base+kernel aio |15480 |64453 |30187 |57222 - Ming, it's important to understand the overhead of aio_kernel_() implementation. So could you please add test results for raw SSD device to the table above next time (in v3 of your patches). Jens, if you have some fast storage at hand, could you please measure IOPS for Ming's patches vs. raw block device -- to ensure that the patches do not impose too low limit on performance. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec
On 12/31/2014 04:52 PM, Ming Lei wrote: On Thu, Jan 1, 2015 at 6:35 AM, Sedat Dilek wrote: On Wed, Dec 31, 2014 at 10:52 PM, Dave Kleikamp wrote: On 12/31/2014 02:38 PM, Sedat Dilek wrote: What has happened to that aio_loop patchset? Is it in Linux-next? ( /me started to play with "block: loop: convert to blk-mq (v3)", so I recalled this other improvement. ) It met with some harsh resistance, so I backed off on it. Then Al Viro got busy re-writing the iov_iter infrastructure and I put my patchset on the shelf to look at later. Then Ming Lei submitted more up-to-date patchset: https://lkml.org/lkml/2014/8/6/175 It looks like Ming is currently only pushing the first half of that patchset. I don't know what his plans are for the last three patches: aio: add aio_kernel_() interface fd/direct-io: introduce should_dirty for kernel aio block: loop: support to submit I/O via kernel aio based I tested with block-mq-v3 (for next-20141231) [1] and this looks promising [2]. Maybe Ming can say what the plan is with the missing parts. I have compared kernel aio based loop-mq(the other 3 aio patches against loop-mq v2, [1]) with loop-mq v3, looks the data isn't better than loop-mq v3. kernel aio based approach requires direct I/O, at least direct write shouldn't be good as page cache write, IMO. So I think we need to investigate kernel aio based approach further wrt. loop improvement. A great advantage of kernel aio for loop device is the avoidance of double caching: the data from page cache of inner filesystem (mounted on the loop device) won't be cached again in the page cache of the outer filesystem (one that keeps image file of the loop device). So I don't think it's correct to compare the performance of aio based loop-mq with loop-mq v3. Aio based approach is OK as long as it doesn't introduce significant overhead as compared with submitting bio-s straightforwardly from loop device (or any other in-kernel user of kernel aio) to host block device. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] fuse: handle release synchronously (v4)
Hi Miklos, On 10/09/2014 12:14 PM, Miklos Szeredi wrote: On Wed, Oct 1, 2014 at 1:28 PM, Maxim Patlasov wrote: Given those patches must die, do you have any ideas how to resolve that "spurious EBUSY" problem? Check the "sync_release" branch of fuse: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git sync_release And same branch name for libfuse: git://git.code.sf.net/p/fuse/fuse sync_release What it does is send RELEASE from ->flush() after checking the refcount of file (being careful about RCU accesses). Lightly tested, more testing, as well as review, is welcome. Thank you very much for efforts, highly appreciated! I've had a close look at your patches and found a few issues. Most of them can be easily fixed, but one puzzles me: the way how you detect last flush is not race free. Something as simple as: int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR); fork(); } may easily dive into fuse_try_sync_release() concurrently and both observe file->f_count == 2. Then both return falling back to sending the release asynchronously. This makes sync/async behaviour unpredictable even for well-behaved applications which don't do any esoteric things like racing i/o with close or exiting while a descriptor is in-flight in a unix domain socket. I cannot see any way to recognise last flush without help of VFS layer, can you? Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] fuse: handle release synchronously (v4)
On 10/01/2014 12:44 AM, Linus Torvalds wrote: On Tue, Sep 30, 2014 at 12:19 PM, Miklos Szeredi wrote: What about flock(2), FL_SETLEASE, etc semantics (which are the sane ones, compared to the POSIX locks shit which mandates release of lock on each close(2) instead of "when all [duplicate] descriptors have been closed")? You have to do that from ->release(), there's no question about that. We do locks_remove_file() independently on ->release, but yes, it's basically done just before the last release. But it has the *exact* same semantics as release, including very much having nothing what-so-ever to do with "last close()". If the file descriptor is opened for other reasons (ie mmap, /proc accesses, whatever), then that delays locks_remove_file() the same way it delays release. None of that has *anothing* to do with "synchronous". Thinking it does is wrong. And none of this has *anything* to do with the issue that Maxim pointed to in the mailing list web page, which was about write caches, and how you cannot (and MUST NOT) delay them until release time. I apologise for mentioning that mailing list web page in my title message. This was really misleading, I had to think about it in advance. Of course, write caches must be flushed in scope of ->flush(), not ->release(). Let me please set forth an use-case that led me to those patches. We implemented a FUSE-based distributed storage solution intended for keeping images of VMs (virtual machines) and their configuration files. The way how VMs use images makes exclusive-open()er semantics very attractive: while a VM is using its image on a node, the concurrent access from other nodes to that image is neither desirable nor necessary. So, we acquire an exclusive lease on FUSE_OPEN and release it on FUSE_RELEASE. This is quite natural and has obviously nothing to do with FUSE_FLUSH. Following such semantics, there are two choices for handling open() if the file is currently exclusively locked by a remote node: (a) return EBUSY; (b) block until the remote node release the file. We decided for (a), because (b) is very inconvenient in practice: most applications handle failed open(2) properly, but very few are clever enough to spawn a separate thread with open() and kill it if the open() has not succeeded in a reasonable time. The patches I sent make essentially one thing: they make FUSE ->release() wait for ACK from userspace before return. Without these patches, any attempt to test or use our storage in valid use-cases led to spurious EBUSY. For example, while migrating a VM from one node to another, we firstly close the image file on source node, then try to open it on destination node, but fail because FUSE_RELEASE is not processed by userspace on source node yet. Given those patches must die, do you have any ideas how to resolve that "spurious EBUSY" problem? Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] fuse: enable close_wait synchronous release
The patch enables the feature by passing 'true' to fuse_file_put in fuse_release_common. Previously, this was safe only in special cases when we sure that multi-threaded userspace won't deadlock if we'll synchronously send FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's always safe because callbacks don't send requests to userspace anymore. The feature can be made privileged by means of DISABLE_SYNC_RELEASE mount option implemented by the previous patch. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index c4599c8..858280f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -334,7 +334,8 @@ void fuse_release_common(struct file *file, int opcode) * synchronous RELEASE is allowed (and desirable) in this case * because the server can be trusted not to screw up. */ - fuse_file_put(ff, ff->fc->destroy_req != NULL); + fuse_file_put(ff, ff->fc->destroy_req != NULL || + must_release_synchronously(ff)); } static int fuse_open(struct inode *inode, struct file *file) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] fuse: add mount option to disable synchronous release
Synchronous release ensures that kernel fuse reports to userspace exactly last fput(). However, nothing guarantees that last fput() will happen in the context of close(2). So userspace applications must be prepared to the situation when close(2) returned, but processing of FUSE_RELEASE is not completed by fuse daemon yet. To make testing such use cases easier, the patch introduces DISABLE_SYNC_RELEASE mount option which effectively mask out FOPEN_SYNC_RELEASE flag. Also, the mount option can be used to protect from DoS scenarios with a sync release: fusermount can be instrumented to use the option by default for unprivileged mounts (allowing system administrator to configure it like "user_allow_other"). Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |3 ++- fs/fuse/fuse_i.h |3 +++ fs/fuse/inode.c |8 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 8713e62..c4599c8 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -292,7 +292,8 @@ static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) static bool must_release_synchronously(struct fuse_file *ff) { - return ff->open_flags & FOPEN_SYNC_RELEASE; + return ff->open_flags & FOPEN_SYNC_RELEASE && + !(ff->fc->flags & FUSE_DISABLE_SYNC_RELEASE); } void fuse_release_common(struct file *file, int opcode) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index e8e47a6..c5e2fca 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -44,6 +44,9 @@ doing the mount will be allowed to access the filesystem */ #define FUSE_ALLOW_OTHER (1 << 1) +/** Disable synchronous release */ +#define FUSE_DISABLE_SYNC_RELEASE (1 << 2) + /** Number of page pointers embedded in fuse_req */ #define FUSE_REQ_INLINE_PAGES 1 diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 03246cd..86d47d0 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -463,6 +463,7 @@ enum { OPT_ALLOW_OTHER, OPT_MAX_READ, OPT_BLKSIZE, + OPT_DISABLE_SYNC_RELEASE, OPT_ERR }; @@ -475,6 +476,7 @@ static const match_table_t tokens = { {OPT_ALLOW_OTHER, "allow_other"}, {OPT_MAX_READ, "max_read=%u"}, {OPT_BLKSIZE, "blksize=%u"}, + {OPT_DISABLE_SYNC_RELEASE, "disable_sync_release"}, {OPT_ERR, NULL} }; @@ -560,6 +562,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) d->blksize = value; break; + case OPT_DISABLE_SYNC_RELEASE: + d->flags |= FUSE_DISABLE_SYNC_RELEASE; + break; + default: return 0; } @@ -583,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) seq_puts(m, ",default_permissions"); if (fc->flags & FUSE_ALLOW_OTHER) seq_puts(m, ",allow_other"); + if (fc->flags & FUSE_DISABLE_SYNC_RELEASE) + seq_puts(m, ",disable_sync_release"); if (fc->max_read != ~0) seq_printf(m, ",max_read=%u", fc->max_read); if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] fuse: wait for end of IO on release
There are two types of I/O activity that can be "in progress" at the time of fuse_release() execution: asynchronous read-ahead and write-back. The patch ensures that they are completed before fuse_release_common sends FUSE_RELEASE to userspace. So far as fuse_release() waits for end of async I/O, its callbacks (fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot be the last holders of fuse file anymore. To emphasize the fact, the patch replaces fuse_file_put with __fuse_file_put there. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 71 1 file changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 7723b3f..8713e62 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) } } +/* + * Asynchronous callbacks may use it instead of fuse_file_put() because + * we guarantee that they are never last holders of ff. Hitting BUG() below + * will make clear any violation of the guarantee. + */ +static void __fuse_file_put(struct fuse_file *ff) +{ + if (atomic_dec_and_test(&ff->count)) + BUG(); +} + int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, bool isdir) { @@ -279,6 +290,11 @@ static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) req->in.args[0].value = inarg; } +static bool must_release_synchronously(struct fuse_file *ff) +{ + return ff->open_flags & FOPEN_SYNC_RELEASE; +} + void fuse_release_common(struct file *file, int opcode) { struct fuse_file *ff; @@ -302,6 +318,13 @@ void fuse_release_common(struct file *file, int opcode) req->misc.release.path = file->f_path; /* +* No more in-flight asynchronous READ or WRITE requests if +* fuse file release is synchronous +*/ + if (must_release_synchronously(ff)) + BUG_ON(atomic_read(&ff->count) != 1); + + /* * Normally this will send the RELEASE request, however if * some asynchronous READ or WRITE requests are outstanding, * the sending will be delayed. @@ -321,11 +344,34 @@ static int fuse_open(struct inode *inode, struct file *file) static int fuse_release(struct inode *inode, struct file *file) { struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_file *ff = file->private_data; /* see fuse_vma_close() for !writeback_cache case */ if (fc->writeback_cache) write_inode_now(inode, 1); + if (must_release_synchronously(ff)) { + struct fuse_inode *fi = get_fuse_inode(inode); + + /* +* Must remove file from write list. Otherwise it is possible +* this file will get more writeback from another files +* rerouted via write_files. +*/ + spin_lock(&ff->fc->lock); + list_del_init(&ff->write_entry); + spin_unlock(&ff->fc->lock); + + wait_event(fi->page_waitq, atomic_read(&ff->count) == 1); + + /* +* spin_unlock_wait(&ff->fc->lock) would be natural here to +* wait for threads just released ff to leave their critical +* sections. But taking spinlock is the first thing +* fuse_release_common does, so that this is unnecessary. +*/ + } + fuse_release_common(file, FUSE_RELEASE); /* return value is ignored by VFS */ @@ -823,8 +869,15 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) unlock_page(page); page_cache_release(page); } - if (req->ff) - fuse_file_put(req->ff, false); + if (req->ff) { + if (must_release_synchronously(req->ff)) { + struct fuse_inode *fi = get_fuse_inode(req->inode); + + __fuse_file_put(req->ff); + wake_up(&fi->page_waitq); + } else + fuse_file_put(req->ff, false); + } } struct fuse_fill_data { @@ -851,6 +904,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data) if (fc->async_read) { req->ff = fuse_file_get(ff); req->end = fuse_readpages_end; + req->inode = data->inode; fuse_request_send_background(fc, req); } else { fuse_request_send(fc, req); @@ -1502,7 +1556,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) for (i = 0; i < req->num_pages; i++) __free_page(req->pages[i]); - if (req->ff
[PATCH 1/5] fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags
The feature will be governed by fuse file open flag FOPEN_SYNC_RELEASE. Userspace can enable it on per file basis in the same way as for FOPEN_KEEP_CACHE or FOPEN_DIRECT_IO. Signed-off-by: Maxim Patlasov --- include/uapi/linux/fuse.h |3 +++ 1 file changed, 3 insertions(+) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 25084a0..607c45c 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -205,10 +205,13 @@ struct fuse_file_lock { * FOPEN_DIRECT_IO: bypass page cache for this open file * FOPEN_KEEP_CACHE: don't invalidate the data cache on open * FOPEN_NONSEEKABLE: the file is not seekable + * FOPEN_SYNC_RELEASE: synchronously release file on last fput, + * which, in turn, not always bound to fclose(2)! */ #define FOPEN_DIRECT_IO(1 << 0) #define FOPEN_KEEP_CACHE (1 << 1) #define FOPEN_NONSEEKABLE (1 << 2) +#define FOPEN_SYNC_RELEASE (1 << 3) /** * INIT request/reply flags -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] fuse: cosmetic rework of fuse_send_readpages
The patch change arguments of fuse_send_readpages to give it access to inode (will be used in the next patch of patch-set). The change is cosmetic, no logic changed. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 912061a..7723b3f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -827,8 +827,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) fuse_file_put(req->ff, false); } -static void fuse_send_readpages(struct fuse_req *req, struct file *file) +struct fuse_fill_data { + struct fuse_req *req; + struct file *file; + struct inode *inode; + unsigned nr_pages; +}; + +static void fuse_send_readpages(struct fuse_fill_data *data) { + struct fuse_req *req = data->req; + struct file *file = data->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; loff_t pos = page_offset(req->pages[0]); @@ -850,13 +859,6 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file) } } -struct fuse_fill_data { - struct fuse_req *req; - struct file *file; - struct inode *inode; - unsigned nr_pages; -}; - static int fuse_readpages_fill(void *_data, struct page *page) { struct fuse_fill_data *data = _data; @@ -872,7 +874,7 @@ static int fuse_readpages_fill(void *_data, struct page *page) req->pages[req->num_pages - 1]->index + 1 != page->index)) { int nr_alloc = min_t(unsigned, data->nr_pages, FUSE_MAX_PAGES_PER_REQ); - fuse_send_readpages(req, data->file); + fuse_send_readpages(data); if (fc->async_read) req = fuse_get_req_for_background(fc, nr_alloc); else @@ -925,7 +927,7 @@ static int fuse_readpages(struct file *file, struct address_space *mapping, err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); if (!err) { if (data.req->num_pages) - fuse_send_readpages(data.req, file); + fuse_send_readpages(&data); else fuse_put_request(fc, data.req); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] fuse: handle release synchronously (v4)
Hi, There is a long-standing demand for synchronous behaviour of fuse_release: http://sourceforge.net/mailarchive/message.php?msg_id=19343889 http://sourceforge.net/mailarchive/message.php?msg_id=29814693 A year ago Avati and me explained why such a feature would be useful: http://sourceforge.net/mailarchive/message.php?msg_id=29889055 http://sourceforge.net/mailarchive/message.php?msg_id=29867423 In short, the problem is that fuse_release (that's usually called on last user close(2)) sends FUSE_RELEASE to userspace and returns without waiting for ACK from userspace. Consequently, there is a gap when user regards the file released while userspace fuse is still working on it. An attempt to access the file from another node leads to complicated synchronization problems because the first node still "holds" the file. The patch-set resolves the problem by making fuse_release synchronous: wait for ACK from userspace for FUSE_RELEASE if the feature is ON. It must be emphasized that even if the feature is enabled (i.e. fuse_release is synchronous), nothing guarantees that fuse_release() will be called in the context of close(2). In fact, it may happen later, on last fput(). However, there are still a lot of use cases when close(2) is synchronous, so the feature must be regarded as an optimization maximizing chances of synchronous behaviour of close(2). To keep single-threaded userspace implementations happy the patch-set ensures that by the time fuse_release_common calls fuse_file_put, no more in-flight I/O exists. Asynchronous fuse callbacks (like fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll never block in contexts other than close(). Changed in v2: - improved comments, commented spin_unlock_wait out according to Brian' suggestions. - rebased on v3.15-rc8 tag of Linus' tree. Changed in v3: - changed patch-set title from "close file synchronously" to "handle release synchronously" - renamed CLOSE_WAIT to SYNC_RELEASE to convey the essence of the feature ("synchronous release" instead of "wait on close") - enabled feature on per file basis (instead of global fuse_conn parameter) - added "disable_sync_release" mount option - rebased on v3.17-rc1 tag of Linus' tree. Changed in v4: - removed redundant locking around __fuse_file_put() - removed a patch making FUSE_RELEASE request uninterruptible. As Miklos pointed out, if the request is interrupted, then we should possibly allow that, effectively backgrounding the request. However, such a backgrounding is not easy to implement without breaking locking expectations of the userspace filesystem. Given the problem has existed for fuseblk mount for long time and there is no reasonable solution now, I put it aside for the future. Thanks, Maxim --- Maxim Patlasov (5): fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags fuse: cosmetic rework of fuse_send_readpages fuse: wait for end of IO on release fuse: add mount option to disable synchronous release fuse: enable close_wait synchronous release fs/fuse/file.c| 97 ++--- fs/fuse/fuse_i.h |3 + fs/fuse/inode.c |8 include/uapi/linux/fuse.h |3 + 4 files changed, 95 insertions(+), 16 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] fuse: fix synchronous case of fuse_file_put()
On 09/16/2014 12:19 PM, Miklos Szeredi wrote: On Thu, Sep 11, 2014 at 6:14 PM, Maxim Patlasov wrote: I really need your help to proceed with this patch. Could you please explain what those places are where we should allow interruption. BTW, as for "just an optimization", I've recently noticed that __fput() calls locks_remove_file(). So guarantees provided by the patch-set are on the same level as flock(2) behaviour. SIGKILL trumps that. At least that's what I think, and that's what NFS currently does as well, AFAICS. Also fuse really should distinguish fatal and non-fatal interruptions and handle them accordingly... And elaborate on this concern, please. Requests have two states where they stay for any significant amount of time: PENDING (queued to userspace) and SENT (in userspace). Currently we do the following for interrupted requests: PENDING: - non-fatal signal: do nothing - fatal signal: dequeue and return -EINTR, unless force is set SENT: - send INTERRUPT request to userspace This is fine, but fatal interrupts should be able to abort SENT and forced requests as well without having to wait for the userspace reply. This is what I was referring to. Thank you for detailed clarification, that's much clearer now. If I understood it right, fatal signals must abort *any* request in *any* state. The only difference between forced and not forced requests is that forced ones must be eventually delivered to userspace in all cases (even if they were in PENDING state when we were interrupted and we returned -EINTR). The thing that bothers me is the net result of these changes. Yes, end-user will be able to interrupt its app by SIGKIILL if it is waiting in request_wait_answer(). But there are many other places where kernel fuse waits for something dependent on userspace. Do you think we have to make those places interruptible as well? This would not be difficult, were it not for i_mutex and s_vfs_rename_mutex being held by some operations. For correctness, we can't release these while a reply is not received, since the locking expecations of the userspace filesystem would not be met. This can be solved by adding shadow locks to fuse that we hold onto even after the request is interrupted. Shadow locking seems to be not enough. For example, we have to postpone FUSE_RELEASE until all interrupted synchronous I/O is ACKed by userspace. And similarly we shouldn't surprise userspace by FUSE_DESTROY if any requests are still in-flight. May be there are other hidden dependencies that don't come to mind now. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] fuse: fix synchronous case of fuse_file_put()
On 08/22/2014 06:08 PM, Miklos Szeredi wrote: On Thu, Aug 21, 2014 at 6:09 PM, Maxim Patlasov wrote: If fuse_file_put() is called with sync==true, the user may be blocked for a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be uninterruptible. Otherwise request could be interrupted, but file association in user space remains. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |4 1 file changed, 4 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index cd55488..b92143a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) path_put(&req->misc.release.path); fuse_put_request(ff->fc, req); } else if (sync) { + /* Must force. Otherwise request could be interrupted, +* but file association in user space remains. +*/ + req->force = 1; req->background = 0; fuse_request_send(ff->fc, req); path_put(&req->misc.release.path); Some thought needs to go into this: if RELEASE is interrupted, then we should possibly allow that, effectively backgrounding the request. The synchronous nature is just an optimization and we really don't know where we are being interrupted, possibly in a place which very much *should* allow interruption. I really need your help to proceed with this patch. Could you please explain what those places are where we should allow interruption. BTW, as for "just an optimization", I've recently noticed that __fput() calls locks_remove_file(). So guarantees provided by the patch-set are on the same level as flock(2) behaviour. Also fuse really should distinguish fatal and non-fatal interruptions and handle them accordingly... And elaborate on this concern, please. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] fuse: fix regression in fuse_get_user_pages()
On 09/10/2014 01:51 PM, Miklos Szeredi wrote: On Wed, Sep 03, 2014 at 02:10:23PM +0400, Maxim Patlasov wrote: Hi, The patchset fixes a regression introduced by the following commits: c7f3888ad7f0 ("switch iov_iter_get_pages() to passing maximal number of pages") c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()") Hmm, instead of reverting to passing maxbytes *instead* of maxpages, I think the right fix is to *add* the maxbytes argument. Just maxbytes alone doesn't have enough information in it. E.g. 4096 contiguous bytes could occupy 1 or 2 pages, depending on the starting offset. Yes, you are right. I missed that c7f3888ad7f0 fixed a subtle bug in get_pages_iovec(). So how about the following (untested) patch? Your patch works fine in my tests. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] fuse: fuse_get_user_pages(): do not pack more data than requested
The patch fixes a bug introduced by commit c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()"): The third argument of fuse_get_user_pages() "nbytesp" refers to the number of bytes a caller asked to pack into fuse request. This value may be lesser than capacity of fuse request or iov_iter. So fuse_get_user_pages() must ensure that *nbytesp won't grow. Before that commit, it was ensured by: > ret = get_user_pages_fast(user_addr, npages, !write, > &req->pages[req->num_pages]); > ... > npages = ret; > frag_size = min_t(size_t, frag_size, > (npages << PAGE_SHIFT) - offset); Now, when helper iov_iter_get_pages() performs all hard work of extracting pages from iov_iter, it can be done by passing properly calculated "maxsize" to the helper. Reported-by: Werner Baumann Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 40ac262..1d2bb70 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1303,10 +1303,15 @@ static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii, while (nbytes < *nbytesp && req->num_pages < req->max_pages) { unsigned npages; size_t start; + ssize_t ret; unsigned n = req->max_pages - req->num_pages; - ssize_t ret = iov_iter_get_pages(ii, - &req->pages[req->num_pages], - n * PAGE_SIZE, &start); + size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes); + + frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT); + + ret = iov_iter_get_pages(ii, + &req->pages[req->num_pages], + frag_size, &start); if (ret < 0) return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] vfs: switch iov_iter_get_pages() to passing maximal size
The patch reverts the commit c7f3888ad7f0 ("switch iov_iter_get_pages() to passing maximal number of pages") because FUSE do need maxsize argument that existed before that commit. Signed-off-by: Maxim Patlasov --- fs/direct-io.c |2 +- fs/fuse/file.c |4 ++-- include/linux/uio.h |2 +- mm/iov_iter.c | 17 + 4 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index c311640..17e39b0 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -158,7 +158,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio) { ssize_t ret; - ret = iov_iter_get_pages(sdio->iter, dio->pages, DIO_PAGES, + ret = iov_iter_get_pages(sdio->iter, dio->pages, DIO_PAGES * PAGE_SIZE, &sdio->from); if (ret < 0 && sdio->blocks_available && (dio->rw & WRITE)) { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 912061a..40ac262 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1303,10 +1303,10 @@ static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii, while (nbytes < *nbytesp && req->num_pages < req->max_pages) { unsigned npages; size_t start; + unsigned n = req->max_pages - req->num_pages; ssize_t ret = iov_iter_get_pages(ii, &req->pages[req->num_pages], - req->max_pages - req->num_pages, - &start); + n * PAGE_SIZE, &start); if (ret < 0) return ret; diff --git a/include/linux/uio.h b/include/linux/uio.h index 48d64e6..09a7cff 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -84,7 +84,7 @@ unsigned long iov_iter_alignment(const struct iov_iter *i); void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov, unsigned long nr_segs, size_t count); ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages, - unsigned maxpages, size_t *start); + size_t maxsize, size_t *start); ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, size_t maxsize, size_t *start); int iov_iter_npages(const struct iov_iter *i, int maxpages); diff --git a/mm/iov_iter.c b/mm/iov_iter.c index ab88dc0..7b5dbd1 100644 --- a/mm/iov_iter.c +++ b/mm/iov_iter.c @@ -310,7 +310,7 @@ void iov_iter_init(struct iov_iter *i, int direction, EXPORT_SYMBOL(iov_iter_init); static ssize_t get_pages_iovec(struct iov_iter *i, - struct page **pages, unsigned maxpages, + struct page **pages, size_t maxsize, size_t *start) { size_t offset = i->iov_offset; @@ -323,10 +323,10 @@ static ssize_t get_pages_iovec(struct iov_iter *i, len = iov->iov_len - offset; if (len > i->count) len = i->count; + if (len > maxsize) + len = maxsize; addr = (unsigned long)iov->iov_base + offset; len += *start = addr & (PAGE_SIZE - 1); - if (len > maxpages * PAGE_SIZE) - len = maxpages * PAGE_SIZE; addr &= ~(PAGE_SIZE - 1); n = (len + PAGE_SIZE - 1) / PAGE_SIZE; res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages); @@ -588,14 +588,15 @@ static unsigned long alignment_bvec(const struct iov_iter *i) } static ssize_t get_pages_bvec(struct iov_iter *i, - struct page **pages, unsigned maxpages, + struct page **pages, size_t maxsize, size_t *start) { const struct bio_vec *bvec = i->bvec; size_t len = bvec->bv_len - i->iov_offset; if (len > i->count) len = i->count; - /* can't be more than PAGE_SIZE */ + if (len > maxsize) + len = maxsize; *start = bvec->bv_offset + i->iov_offset; get_page(*pages = bvec->bv_page); @@ -711,13 +712,13 @@ unsigned long iov_iter_alignment(const struct iov_iter *i) EXPORT_SYMBOL(iov_iter_alignment); ssize_t iov_iter_get_pages(struct iov_iter *i, - struct page **pages, unsigned maxpages, + struct page **pages, size_t maxsize, size_t *start) { if (i->type & ITER_BVEC) - return get_pages_bvec(i, pages, maxpages, start); + return get_pages_bvec(i, pages, maxsize, start); else - return get_pages_iovec(i, pages, maxpages, start); + return get_pages_iovec(i, pages, maxsize, start); } EXPORT_SYMBOL(iov_iter_get_pages); -- To unsu
[PATCH 0/2] fuse: fix regression in fuse_get_user_pages()
Hi, The patchset fixes a regression introduced by the following commits: c7f3888ad7f0 ("switch iov_iter_get_pages() to passing maximal number of pages") c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()") The regression manifests itslef like this (thanks to Werner Baumann for reporting): > davfs2 uses the fuse kernel module directly (not using the fuse > userspace library). A user of davfs2 reported this problem > (http://savannah.nongnu.org/support/?108640): > > dd if=/dev/zero of=/mnt/owncloud/test.txt bs=20416 count=1 > works fine, but > dd if=/dev/zero of=/mnt/owncloud/test.txt bs=20417 count=1 > fails. Thanks, Maxim --- Maxim Patlasov (2): vfs: switch iov_iter_get_pages() to passing maximal size fuse: fuse_get_user_pages(): do not pack more data than requested fs/direct-io.c |2 +- fs/fuse/file.c | 13 + include/linux/uio.h |2 +- mm/iov_iter.c | 17 + 4 files changed, 20 insertions(+), 14 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 5/9] block: loop: convert to blk-mq
On 08/28/2014 06:06 AM, Ming Lei wrote: On 8/28/14, Maxim Patlasov wrote: On 08/21/2014 09:44 AM, Ming Lei wrote: On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe wrote: Reworked a bit more: http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6 One big problem of the commit is that it is basically a serialized workqueue because of single &hctx->run_work, and per-req work_struct has to be used for concurrent implementation. So looks the approach isn't flexible enough compared with doing that in driver, or any idea about how to fix that? I'm interested what's the price of handling requests in a separate thread at large. I used the following fio script: [global] direct=1 bsrange=512-512 timeout=10 numjobs=1 ioengine=sync filename=/dev/loop0 # or /dev/nullb0 [f1] rw=randwrite to compare the performance of: 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this result since blk-mq implements a serialized workqueue. BLK_MQ_F_WQ_CONTEXT is not in 3.17.0-rc1, so I couldn't enable it. 2) the same as above, but call loop_queue_work() directly from loop_queue_rq() -- 270K iops 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops In my recent investigation and discussion with Jens, using workqueue may introduce some regression for cases like loop over null_blk, tmpfs. And 270K vs. 380K is a bit similar with my result, and it was observed that context switch is increased by more than 50% with introducing workqueue. The figures are similar, but the comparison is not. Both 270K and 380K refer to configurations where no extra context switch involved. I will post V3 which will use previous kthread, with blk-mq & kernel aio, which should make full use of blk-mq and kernel aio, and won't introduce regression for cases like above. That would be great! Taking into account so big difference (11K vs. 270K), would it be worthy to implement pure non-blocking version of aio_kernel_submit() returning error if blocking needed? Then loop driver (or any other in-kernel user) The kernel aio submit is very similar with user space's implementation, except for block plug&unplug usage in user space aio submit path. If it is blocked in aio_kernel_submit(), you should observe similar thing with io_submit() too. Yes, I agree. My point was that there is a room for optimization as my experiments demonstrate. The question is whether it's worthy to sophisticate kernel aio (and fs-specific code too) for the sake of that optimization. In fact, in a simple case like block fs on top of loopback device on top of a file on another block fs, what kernel aio does for loopback driver is a subtle way of converting incoming bio-s to outgoing bio-s. In case you know where the image file is placed (e.g. by fiemap), such a conversion may be done with zero overhead and anything that makes the overhead noticeable is suspicious. And it is easy to imagine other use-cases when that extra context switch is avoidable. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 5/9] block: loop: convert to blk-mq
On 8/28/14, Zach Brown wrote: On Wed, Aug 27, 2014 at 09:19:36PM +0400, Maxim Patlasov wrote: On 08/27/2014 08:29 PM, Benjamin LaHaise wrote: On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote: ... 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops 2) the same as above, but call loop_queue_work() directly from loop_queue_rq() -- 270K iops 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops Taking into account so big difference (11K vs. 270K), would it be worthy to implement pure non-blocking version of aio_kernel_submit() returning error if blocking needed? Then loop driver (or any other in-kernel user) might firstly try that non-blocking submit as fast-path, and, only if it's failed, fall back to queueing. What filesystem is the backing file for loop0 on? O_DIRECT access as Ming's patches use should be non-blocking, and if not, that's something to fix. I used loop0 directly on top of null_blk driver (because my goal was to measure the overhead of processing requests in a separate thread). The relative overhead while doing nothing else. While zooming way down in to micro benchmarks is fun and all, testing on an fs on brd might be more representitive and so more compelling. The measurements on an fs on brd are even more outrageous (the same fio script I posted a few messages above): 1) Baseline. no loopback device involved. fio on /dev/ram0: 467K iops fio on ext4 over /dev/ram0: 378K iops 2) Loopback device from 3.17.0-rc1 with Ming's patches (v1) applied: fio on /dev/loop0 over /dev/ram0:10K iops fio on ext4 over /dev/loop0 over /dev/ram0: 9K iops 3) the same as above, but avoid extra context switch (call loop_queue_work() directly from loop_queue_rq()): fio on /dev/loop0 over /dev/ram0: 267K iops fio on ext4 over /dev/loop0 over /dev/ram0: 223K iops The problem is not about huge relative overhead while doing nothing else. It's rather about introducing extra latency (~100 microseconds on commodity h/w I used) which might be noticeable on modern SSDs (and h/w RAIDs with caching). Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 5/9] block: loop: convert to blk-mq
On 08/27/2014 08:29 PM, Benjamin LaHaise wrote: On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote: ... 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops 2) the same as above, but call loop_queue_work() directly from loop_queue_rq() -- 270K iops 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops Taking into account so big difference (11K vs. 270K), would it be worthy to implement pure non-blocking version of aio_kernel_submit() returning error if blocking needed? Then loop driver (or any other in-kernel user) might firstly try that non-blocking submit as fast-path, and, only if it's failed, fall back to queueing. What filesystem is the backing file for loop0 on? O_DIRECT access as Ming's patches use should be non-blocking, and if not, that's something to fix. I used loop0 directly on top of null_blk driver (because my goal was to measure the overhead of processing requests in a separate thread). In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may easily block on something like bh_submit_read(), when fs reads file metadata to calculate the offset on block device by position in the file. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 5/9] block: loop: convert to blk-mq
On 08/21/2014 09:44 AM, Ming Lei wrote: On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe wrote: Reworked a bit more: http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6 One big problem of the commit is that it is basically a serialized workqueue because of single &hctx->run_work, and per-req work_struct has to be used for concurrent implementation. So looks the approach isn't flexible enough compared with doing that in driver, or any idea about how to fix that? I'm interested what's the price of handling requests in a separate thread at large. I used the following fio script: [global] direct=1 bsrange=512-512 timeout=10 numjobs=1 ioengine=sync filename=/dev/loop0 # or /dev/nullb0 [f1] rw=randwrite to compare the performance of: 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops 2) the same as above, but call loop_queue_work() directly from loop_queue_rq() -- 270K iops 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops Taking into account so big difference (11K vs. 270K), would it be worthy to implement pure non-blocking version of aio_kernel_submit() returning error if blocking needed? Then loop driver (or any other in-kernel user) might firstly try that non-blocking submit as fast-path, and, only if it's failed, fall back to queueing. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6] fuse: wait for end of IO on release (v2)
There are two types of I/O activity that can be "in progress" at the time of fuse_release() execution: asynchronous read-ahead and write-back. The patch ensures that they are completed before fuse_release_common sends FUSE_RELEASE to userspace. So far as fuse_release() waits for end of async I/O, its callbacks (fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot be the last holders of fuse file anymore. To emphasize the fact, the patch replaces fuse_file_put with __fuse_file_put there. Changed in v2 (thanks to Miklos): - removed redundant locking around __fuse_file_put() Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 71 1 file changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 7723b3f..8713e62 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) } } +/* + * Asynchronous callbacks may use it instead of fuse_file_put() because + * we guarantee that they are never last holders of ff. Hitting BUG() below + * will make clear any violation of the guarantee. + */ +static void __fuse_file_put(struct fuse_file *ff) +{ + if (atomic_dec_and_test(&ff->count)) + BUG(); +} + int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, bool isdir) { @@ -279,6 +290,11 @@ static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) req->in.args[0].value = inarg; } +static bool must_release_synchronously(struct fuse_file *ff) +{ + return ff->open_flags & FOPEN_SYNC_RELEASE; +} + void fuse_release_common(struct file *file, int opcode) { struct fuse_file *ff; @@ -302,6 +318,13 @@ void fuse_release_common(struct file *file, int opcode) req->misc.release.path = file->f_path; /* +* No more in-flight asynchronous READ or WRITE requests if +* fuse file release is synchronous +*/ + if (must_release_synchronously(ff)) + BUG_ON(atomic_read(&ff->count) != 1); + + /* * Normally this will send the RELEASE request, however if * some asynchronous READ or WRITE requests are outstanding, * the sending will be delayed. @@ -321,11 +344,34 @@ static int fuse_open(struct inode *inode, struct file *file) static int fuse_release(struct inode *inode, struct file *file) { struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_file *ff = file->private_data; /* see fuse_vma_close() for !writeback_cache case */ if (fc->writeback_cache) write_inode_now(inode, 1); + if (must_release_synchronously(ff)) { + struct fuse_inode *fi = get_fuse_inode(inode); + + /* +* Must remove file from write list. Otherwise it is possible +* this file will get more writeback from another files +* rerouted via write_files. +*/ + spin_lock(&ff->fc->lock); + list_del_init(&ff->write_entry); + spin_unlock(&ff->fc->lock); + + wait_event(fi->page_waitq, atomic_read(&ff->count) == 1); + + /* +* spin_unlock_wait(&ff->fc->lock) would be natural here to +* wait for threads just released ff to leave their critical +* sections. But taking spinlock is the first thing +* fuse_release_common does, so that this is unnecessary. +*/ + } + fuse_release_common(file, FUSE_RELEASE); /* return value is ignored by VFS */ @@ -823,8 +869,15 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) unlock_page(page); page_cache_release(page); } - if (req->ff) - fuse_file_put(req->ff, false); + if (req->ff) { + if (must_release_synchronously(req->ff)) { + struct fuse_inode *fi = get_fuse_inode(req->inode); + + __fuse_file_put(req->ff); + wake_up(&fi->page_waitq); + } else + fuse_file_put(req->ff, false); + } } struct fuse_fill_data { @@ -851,6 +904,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data) if (fc->async_read) { req->ff = fuse_file_get(ff); req->end = fuse_readpages_end; + req->inode = data->inode; fuse_request_send_background(fc, req); } else { fuse_request_send(fc, req); @@ -1502,7 +1556,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) for (i = 0; i < req->
Re: [PATCH 5/6] fuse: fix synchronous case of fuse_file_put()
On 08/22/2014 06:08 PM, Miklos Szeredi wrote: On Thu, Aug 21, 2014 at 6:09 PM, Maxim Patlasov wrote: If fuse_file_put() is called with sync==true, the user may be blocked for a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be uninterruptible. Otherwise request could be interrupted, but file association in user space remains. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |4 1 file changed, 4 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index cd55488..b92143a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) path_put(&req->misc.release.path); fuse_put_request(ff->fc, req); } else if (sync) { + /* Must force. Otherwise request could be interrupted, +* but file association in user space remains. +*/ + req->force = 1; req->background = 0; fuse_request_send(ff->fc, req); path_put(&req->misc.release.path); Some thought needs to go into this: if RELEASE is interrupted, then we should possibly allow that, effectively backgrounding the request. The synchronous nature is just an optimization and we really don't know where we are being interrupted, possibly in a place which very much *should* allow interruption. A fuse daemon who explicitly enables the feature (synchronous release) would definitely want non-interruptible behaviour of last fput. Otherwise, it would face the same problem that the feature tries to resolve: an application was killed and exited, but there is no way to determine why actual processing of RELEASE will be completed. As for fuseblk mounts, I'm not so sure. I believed the lack of force=1 was a bug and my patch fixes it. If you think it's safer to preserve old behaviour, I could set "force" conditionally. May be you could explain in more details why you think we should allow interruption somewhere. Any examples or use cases? Btw, fuse_flush also uses force=1. Do you concerns deal with it as well? Also fuse really should distinguish fatal and non-fatal interruptions and handle them accordingly... Do you think it's worthy to elaborate this in the scope of "synchronous release" feature? Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/6] fuse: enable close_wait synchronous release
On 08/22/2014 06:04 PM, Miklos Szeredi wrote: On Thu, Aug 21, 2014 at 6:09 PM, Maxim Patlasov wrote: The patch enables the feature by passing 'true' to fuse_file_put in fuse_release_common. Previously, this was safe only in special cases when we sure that multi-threaded userspace won't deadlock if we'll synchronously send FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's always safe because callbacks don't send requests to userspace anymore. But we do want to make this privileged, as there are unlikely but possible DoS scenarios with a sync release. The latest patch of the set implements DISABLE_SYNC_RELEASE mount option. We can instrument fusermount to use the option by default for unprivileged mounts (allowing system administrator to configure it like "user_allow_other"). Do you have a better way to implement DoS protection in mind? Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] fuse: wait for end of IO on release
On 08/22/2014 06:00 PM, Miklos Szeredi wrote: On Thu, Aug 21, 2014 at 6:08 PM, Maxim Patlasov wrote: There are two types of I/O activity that can be "in progress" at the time of fuse_release() execution: asynchronous read-ahead and write-back. The patch ensures that they are completed before fuse_release_common sends FUSE_RELEASE to userspace. So far as fuse_release() waits for end of async I/O, its callbacks (fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot be the last holders of fuse file anymore. To emphasize the fact, the patch replaces fuse_file_put with __fuse_file_put there. 1) spinlock around __fuse_file_put() is unnecessary, wake_up/wait_event will provide the necessary synchronization. Yes, I agree. 2) can't we always wait for I/O and just make the actual RELEASE message sync or async based on the flag? 3) and can't we merge the fuseblk case into this as well? I think this is doable, but the same argument that Anand suggested for doing sync release selectively: In a real world scenario you may want to perform synchronous release selectively. As such performance over lots of small files is generally slow because of context switches, and a synchronous release adds an extra switch. is applicable here: if an application opened a file read-only and read a block initiating read-ahead, it's not obvious why the app doing close(2) should always wait for the end of that read-ahead. For some fs it's desirable, for others is not. Let's fuse daemon decide which behaviour is preferable. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] fuse: add mount option to disable synchronous release
Synchronous release ensures that kernel fuse reports to userspace exactly last fput(). However, nothing guarantees that last fput() will happen in the context of close(2). So userspace applications must be prepared to the situation when close(2) returned, but processing of FUSE_RELEASE is not completed by fuse daemon yet. To make testing such use cases easier, the patch introduces DISABLE_SYNC_RELEASE mount option which effectively mask out FOPEN_SYNC_RELEASE flag. Signed-off-by: Maxim Patlasov --- --- fs/fuse/file.c |3 ++- fs/fuse/fuse_i.h |3 +++ fs/fuse/inode.c |8 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index b92143a..ad3d357 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -296,7 +296,8 @@ static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) static bool must_release_synchronously(struct fuse_file *ff) { - return ff->open_flags & FOPEN_SYNC_RELEASE; + return ff->open_flags & FOPEN_SYNC_RELEASE && + !(ff->fc->flags & FUSE_DISABLE_SYNC_RELEASE); } void fuse_release_common(struct file *file, int opcode) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index e8e47a6..c5e2fca 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -44,6 +44,9 @@ doing the mount will be allowed to access the filesystem */ #define FUSE_ALLOW_OTHER (1 << 1) +/** Disable synchronous release */ +#define FUSE_DISABLE_SYNC_RELEASE (1 << 2) + /** Number of page pointers embedded in fuse_req */ #define FUSE_REQ_INLINE_PAGES 1 diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 03246cd..86d47d0 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -463,6 +463,7 @@ enum { OPT_ALLOW_OTHER, OPT_MAX_READ, OPT_BLKSIZE, + OPT_DISABLE_SYNC_RELEASE, OPT_ERR }; @@ -475,6 +476,7 @@ static const match_table_t tokens = { {OPT_ALLOW_OTHER, "allow_other"}, {OPT_MAX_READ, "max_read=%u"}, {OPT_BLKSIZE, "blksize=%u"}, + {OPT_DISABLE_SYNC_RELEASE, "disable_sync_release"}, {OPT_ERR, NULL} }; @@ -560,6 +562,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) d->blksize = value; break; + case OPT_DISABLE_SYNC_RELEASE: + d->flags |= FUSE_DISABLE_SYNC_RELEASE; + break; + default: return 0; } @@ -583,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) seq_puts(m, ",default_permissions"); if (fc->flags & FUSE_ALLOW_OTHER) seq_puts(m, ",allow_other"); + if (fc->flags & FUSE_DISABLE_SYNC_RELEASE) + seq_puts(m, ",disable_sync_release"); if (fc->max_read != ~0) seq_printf(m, ",max_read=%u", fc->max_read); if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6] fuse: fix synchronous case of fuse_file_put()
If fuse_file_put() is called with sync==true, the user may be blocked for a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be uninterruptible. Otherwise request could be interrupted, but file association in user space remains. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |4 1 file changed, 4 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index cd55488..b92143a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) path_put(&req->misc.release.path); fuse_put_request(ff->fc, req); } else if (sync) { + /* Must force. Otherwise request could be interrupted, +* but file association in user space remains. +*/ + req->force = 1; req->background = 0; fuse_request_send(ff->fc, req); path_put(&req->misc.release.path); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] fuse: cosmetic rework of fuse_send_readpages
The patch change arguments of fuse_send_readpages to give it access to inode (will be used in the next patch of patch-set). The change is cosmetic, no logic changed. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 912061a..7723b3f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -827,8 +827,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) fuse_file_put(req->ff, false); } -static void fuse_send_readpages(struct fuse_req *req, struct file *file) +struct fuse_fill_data { + struct fuse_req *req; + struct file *file; + struct inode *inode; + unsigned nr_pages; +}; + +static void fuse_send_readpages(struct fuse_fill_data *data) { + struct fuse_req *req = data->req; + struct file *file = data->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; loff_t pos = page_offset(req->pages[0]); @@ -850,13 +859,6 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file) } } -struct fuse_fill_data { - struct fuse_req *req; - struct file *file; - struct inode *inode; - unsigned nr_pages; -}; - static int fuse_readpages_fill(void *_data, struct page *page) { struct fuse_fill_data *data = _data; @@ -872,7 +874,7 @@ static int fuse_readpages_fill(void *_data, struct page *page) req->pages[req->num_pages - 1]->index + 1 != page->index)) { int nr_alloc = min_t(unsigned, data->nr_pages, FUSE_MAX_PAGES_PER_REQ); - fuse_send_readpages(req, data->file); + fuse_send_readpages(data); if (fc->async_read) req = fuse_get_req_for_background(fc, nr_alloc); else @@ -925,7 +927,7 @@ static int fuse_readpages(struct file *file, struct address_space *mapping, err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); if (!err) { if (data.req->num_pages) - fuse_send_readpages(data.req, file); + fuse_send_readpages(&data); else fuse_put_request(fc, data.req); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6] fuse: enable close_wait synchronous release
The patch enables the feature by passing 'true' to fuse_file_put in fuse_release_common. Previously, this was safe only in special cases when we sure that multi-threaded userspace won't deadlock if we'll synchronously send FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's always safe because callbacks don't send requests to userspace anymore. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 73bce1b..cd55488 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -333,7 +333,8 @@ void fuse_release_common(struct file *file, int opcode) * synchronous RELEASE is allowed (and desirable) in this case * because the server can be trusted not to screw up. */ - fuse_file_put(ff, ff->fc->destroy_req != NULL); + fuse_file_put(ff, ff->fc->destroy_req != NULL || + must_release_synchronously(ff)); } static int fuse_open(struct inode *inode, struct file *file) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6] fuse: wait for end of IO on release
There are two types of I/O activity that can be "in progress" at the time of fuse_release() execution: asynchronous read-ahead and write-back. The patch ensures that they are completed before fuse_release_common sends FUSE_RELEASE to userspace. So far as fuse_release() waits for end of async I/O, its callbacks (fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot be the last holders of fuse file anymore. To emphasize the fact, the patch replaces fuse_file_put with __fuse_file_put there. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 76 1 file changed, 71 insertions(+), 5 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 7723b3f..73bce1b 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) } } +/* + * Asynchronous callbacks may use it instead of fuse_file_put() because + * we guarantee that they are never last holders of ff. Hitting BUG() below + * will make clear any violation of the guarantee. + */ +static void __fuse_file_put(struct fuse_file *ff) +{ + if (atomic_dec_and_test(&ff->count)) + BUG(); +} + int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, bool isdir) { @@ -279,6 +290,11 @@ static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) req->in.args[0].value = inarg; } +static bool must_release_synchronously(struct fuse_file *ff) +{ + return ff->open_flags & FOPEN_SYNC_RELEASE; +} + void fuse_release_common(struct file *file, int opcode) { struct fuse_file *ff; @@ -302,6 +318,13 @@ void fuse_release_common(struct file *file, int opcode) req->misc.release.path = file->f_path; /* +* No more in-flight asynchronous READ or WRITE requests if +* fuse file release is synchronous +*/ + if (must_release_synchronously(ff)) + BUG_ON(atomic_read(&ff->count) != 1); + + /* * Normally this will send the RELEASE request, however if * some asynchronous READ or WRITE requests are outstanding, * the sending will be delayed. @@ -321,11 +344,34 @@ static int fuse_open(struct inode *inode, struct file *file) static int fuse_release(struct inode *inode, struct file *file) { struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_file *ff = file->private_data; /* see fuse_vma_close() for !writeback_cache case */ if (fc->writeback_cache) write_inode_now(inode, 1); + if (must_release_synchronously(ff)) { + struct fuse_inode *fi = get_fuse_inode(inode); + + /* +* Must remove file from write list. Otherwise it is possible +* this file will get more writeback from another files +* rerouted via write_files. +*/ + spin_lock(&ff->fc->lock); + list_del_init(&ff->write_entry); + spin_unlock(&ff->fc->lock); + + wait_event(fi->page_waitq, atomic_read(&ff->count) == 1); + + /* +* spin_unlock_wait(&ff->fc->lock) would be natural here to +* wait for threads just released ff to leave their critical +* sections. But taking spinlock is the first thing +* fuse_release_common does, so that this is unnecessary. +*/ + } + fuse_release_common(file, FUSE_RELEASE); /* return value is ignored by VFS */ @@ -823,8 +869,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) unlock_page(page); page_cache_release(page); } - if (req->ff) - fuse_file_put(req->ff, false); + if (req->ff) { + if (must_release_synchronously(req->ff)) { + struct fuse_inode *fi = get_fuse_inode(req->inode); + + spin_lock(&fc->lock); + __fuse_file_put(req->ff); + wake_up(&fi->page_waitq); + spin_unlock(&fc->lock); + } else + fuse_file_put(req->ff, false); + } } struct fuse_fill_data { @@ -851,6 +906,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data) if (fc->async_read) { req->ff = fuse_file_get(ff); req->end = fuse_readpages_end; + req->inode = data->inode; fuse_request_send_background(fc, req); } else { fuse_request_send(fc, req); @@ -1502,7 +1558,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) for (i = 0;
[PATCH 1/6] fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags
The feature will be governed by fuse file open flag FOPEN_SYNC_RELEASE. Userspace can enable it on per file basis in the same way as for FOPEN_KEEP_CACHE or FOPEN_DIRECT_IO. Signed-off-by: Maxim Patlasov --- include/uapi/linux/fuse.h |3 +++ 1 file changed, 3 insertions(+) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 25084a0..607c45c 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -205,10 +205,13 @@ struct fuse_file_lock { * FOPEN_DIRECT_IO: bypass page cache for this open file * FOPEN_KEEP_CACHE: don't invalidate the data cache on open * FOPEN_NONSEEKABLE: the file is not seekable + * FOPEN_SYNC_RELEASE: synchronously release file on last fput, + * which, in turn, not always bound to fclose(2)! */ #define FOPEN_DIRECT_IO(1 << 0) #define FOPEN_KEEP_CACHE (1 << 1) #define FOPEN_NONSEEKABLE (1 << 2) +#define FOPEN_SYNC_RELEASE (1 << 3) /** * INIT request/reply flags -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/6] fuse: handle release synchronously (v3)
Hi, There is a long-standing demand for synchronous behaviour of fuse_release: http://sourceforge.net/mailarchive/message.php?msg_id=19343889 http://sourceforge.net/mailarchive/message.php?msg_id=29814693 A year ago Avati and me explained why such a feature would be useful: http://sourceforge.net/mailarchive/message.php?msg_id=29889055 http://sourceforge.net/mailarchive/message.php?msg_id=29867423 In short, the problem is that fuse_release (that's usually called on last user close(2)) sends FUSE_RELEASE to userspace and returns without waiting for ACK from userspace. Consequently, there is a gap when user regards the file released while userspace fuse is still working on it. An attempt to access the file from another node leads to complicated synchronization problems because the first node still "holds" the file. The patch-set resolves the problem by making fuse_release synchronous: wait for ACK from userspace for FUSE_RELEASE if the feature is ON. It must be emphasized that even if the feature is enabled (i.e. fuse_release is synchronous), nothing guarantees that fuse_release() will be called in the context of close(2). In fact, it may happen later, on last fput(). However, there are still a lot of use cases when close(2) is synchronous, so the feature must be regarded as an optimization maximizing chances of synchronous behaviour of close(2). To keep single-threaded userspace implementations happy the patch-set ensures that by the time fuse_release_common calls fuse_file_put, no more in-flight I/O exists. Asynchronous fuse callbacks (like fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll never block in contexts other than close(). Changed in v2: - improved comments, commented spin_unlock_wait out according to Brian' suggestions. - rebased on v3.15-rc8 tag of Linus' tree. Changed in v3: - changed patch-set title from "close file synchronously" to "handle release synchronously" - renamed CLOSE_WAIT to SYNC_RELEASE to convey the essence of the feature ("synchronous release" instead of "wait on close") - enabled feature on per file basis (instead of global fuse_conn parameter) - added "disable_sync_release" mount option - rebased on v3.17-rc1 tag of Linus' tree. Thanks, Maxim --- Maxim Patlasov (6): fuse: add FOPEN_SYNC_RELEASE flag to ff->open_flags fuse: cosmetic rework of fuse_send_readpages fuse: wait for end of IO on release fuse: enable close_wait synchronous release fuse: fix synchronous case of fuse_file_put() fuse: add mount option to disable synchronous release fs/fuse/file.c| 106 ++--- fs/fuse/fuse_i.h |3 + fs/fuse/inode.c |8 +++ include/uapi/linux/fuse.h |3 + 4 files changed, 104 insertions(+), 16 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fuse: do not evict dirty inodes
Hi Miklos, On 08/13/2014 02:32 PM, Miklos Szeredi wrote: On Tue, Jun 3, 2014 at 1:49 PM, Maxim Patlasov wrote: Commit 1e18bda8 added .write_inode method to the fuse super_operations. This allowed fuse to use the kernel infrastructure for writing out dirty metadata (mtime and ctime for now). However, given that .drop_inode was not redefined from the legacy generic_delete_inode(), on umount(2) generic_shutdown_super() led to the eviction of all inodes disregarding their state. The patch removes .drop_inode definition from the fuse super_operations. This works because now iput_final() calls generic_drop_inode() and returns w/o evicting inode. This, in turn, allows generic_shutdown_super() to write dirty inodes by calling sync_filesystem(). Signed-off-by: Maxim Patlasov --- fs/fuse/inode.c |1 - 1 file changed, 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 754dcf2..ee017be 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -791,7 +791,6 @@ static const struct super_operations fuse_super_operations = { .destroy_inode = fuse_destroy_inode, .evict_inode= fuse_evict_inode, .write_inode= fuse_write_inode, - .drop_inode = generic_delete_inode, .remount_fs = fuse_remount_fs, .put_super = fuse_put_super, .umount_begin = fuse_umount_begin, (Sorry about the late answer) Big problem with this is that I don't want to make umount(2) and sync(2) wait on userspace filesystem. Generally this would make umount() hang if a fuse daemon was stuck for any reason. I think we must honour interests both privileged and unprivileged mounts. In case of trusted environment where only sysad decides which fuse daemons are eligible to run, blocking umount() is completely fine. And more than that, after settling down that sync-close feature, I intend to take on synchronous umount (it's a shame that users have no tools to find out when umount completed better than monitoring of fuse daemon in proc table waiting for daemon termination). So we could put such a behaviour under control of a tunable parameter. But is this really necessary? We are talking about just regular files: mtime is only updated by write(2) and friends. ctime is updated by write(2) as well as some other ops. For write, we can sync the times on FLUSH (close), for other ops we could flush the ctime synchronously. E.g. unlink would trigger UNLINK and SETATTR. Long term, much better solution would be to add a timestamp to fuse_in_header which would remove the duplicate requests and then we could also extend the kernel caching of timestamps from just regular files to everything, which would make the protocol conceptually simpler. I like the idea of extending fuse_in_header. But if we'll extend it, why not go further adding timestamps for mtime and atime as well? Because pushing mtime along with data modifications is more reliable than postponing flush to write_inode call. Also, this would simplify the code a lot -- no more fuse_write_inode and fuse_flush_times needed. And who knows, may be someone will request proper atime handling at some point in future. If you're OK about: @@ -675,10 +675,15 @@ struct fuse_in_header { uint32_topcode; uint64_tunique; uint64_tnodeid; + uint64_tatime; + uint64_tmtime; + uint64_tctime; + uint32_tatimensec; + uint32_tmtimensec; + uint32_tctimensec; uint32_tuid; uint32_tgid; uint32_tpid; - uint32_tpadding; }; I could work on the patch. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] fuse: close file synchronously (v2)
On 08/13/2014 04:44 PM, Miklos Szeredi wrote: On Fri, Jun 6, 2014 at 3:27 PM, Maxim Patlasov wrote: Hi, There is a long-standing demand for synchronous behaviour of fuse_release: http://sourceforge.net/mailarchive/message.php?msg_id=19343889 http://sourceforge.net/mailarchive/message.php?msg_id=29814693 A year ago Avati and me explained why such a feature would be useful: http://sourceforge.net/mailarchive/message.php?msg_id=29889055 http://sourceforge.net/mailarchive/message.php?msg_id=29867423 In short, the problem is that fuse_release (that's called on last user close(2)) sends FUSE_RELEASE to userspace and returns without waiting for ACK from userspace. Consequently, there is a gap when user regards the file released while userspace fuse is still working on it. An attempt to access the file from another node leads to complicated synchronization problems because the first node still "holds" the file. Tying RELEASE to close(2) is not going to work. Look at all the places that call fput() (or fdput() in recent kernels), those are all potential triggers for RELEASE, some realistic, some not quite, but all are certainly places that a synchronous release could block *instead* of close. Which just means, that close will still be asynchronous with release some of the time. So it's not clear to me what is to be gained from this patchset. The patch-set doesn't tie RELEASE to close(2), it ensures that we report to user space exactly the last fput(). That's correct because this is exactly the moment when any file system with sharing mode connected to open/close must drop sharing mode. This is the case even for some local filesystems, for example, ntfs-3g. Could you please look closely at your commit 5a18ec176c934ca1bc9dc61580a5e0e90a9b5733. It actually implemented two different things: 1) synchronous release and 2) delayed path_put. The latter was well explained by the comment: >/* > * If this is a fuseblk mount, then it's possible that > * releasing the path will result in releasing the > * super block and sending the DESTROY request. If > * the server is single threaded, this would hang. > * For this reason do the path_put() in a separate > * thread. > */ So it's clear why the delay needed and why it's bound to fuseblk condition. But synchronous close was made under the same condition, which is obviously wrong. I understand why you made that decision in 2011: otherwise, we could block in a wrong context (last decrement of ff->count might happen in scope of read-ahead or mmap-ed writeback). But now, with the approach implemented in this patch-set, this is impossible -- we wait for completion of all async operations before triggering synchronous release. Thus the patch-set untie a functionality which already existed before (synchronous release) from wrong condition (fuseblk mount) putting it under well-defined control (FUSE_CLOSE_WAIT). Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fuse: avoid scheduling while atomic
On 07/04/2014 08:27 PM, Miklos Szeredi wrote: Maxim, thanks for the patch. Here's a reworked one. While it looks a bit more complicated, its chances of being (and remaining) correct are, I think, better, since the region surrounded by kmap/kunmap_atomic is trivially non-sleeping. This patch also removes more lines than it adds, as an added bonus. Please review and test if possible. The patch looks fine and works well in my tests. Maxim Thanks, Miklos --- fs/fuse/dev.c | 51 +++ 1 file changed, 23 insertions(+), 28 deletions(-) --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -643,9 +643,8 @@ struct fuse_copy_state { unsigned long seglen; unsigned long addr; struct page *pg; - void *mapaddr; - void *buf; unsigned len; + unsigned offset; unsigned move_pages:1; }; @@ -666,23 +665,17 @@ static void fuse_copy_finish(struct fuse if (cs->currbuf) { struct pipe_buffer *buf = cs->currbuf; - if (!cs->write) { - kunmap_atomic(cs->mapaddr); - } else { - kunmap_atomic(cs->mapaddr); + if (cs->write) buf->len = PAGE_SIZE - cs->len; - } cs->currbuf = NULL; - cs->mapaddr = NULL; - } else if (cs->mapaddr) { - kunmap_atomic(cs->mapaddr); + } else if (cs->pg) { if (cs->write) { flush_dcache_page(cs->pg); set_page_dirty_lock(cs->pg); } put_page(cs->pg); - cs->mapaddr = NULL; } + cs->pg = NULL; } /* @@ -691,7 +684,7 @@ static void fuse_copy_finish(struct fuse */ static int fuse_copy_fill(struct fuse_copy_state *cs) { - unsigned long offset; + struct page *page; int err; unlock_request(cs->fc, cs->req); @@ -706,14 +699,12 @@ static int fuse_copy_fill(struct fuse_co BUG_ON(!cs->nr_segs); cs->currbuf = buf; - cs->mapaddr = kmap_atomic(buf->page); + cs->pg = buf->page; + cs->offset = buf->offset; cs->len = buf->len; - cs->buf = cs->mapaddr + buf->offset; cs->pipebufs++; cs->nr_segs--; } else { - struct page *page; - if (cs->nr_segs == cs->pipe->buffers) return -EIO; @@ -726,8 +717,8 @@ static int fuse_copy_fill(struct fuse_co buf->len = 0; cs->currbuf = buf; - cs->mapaddr = kmap_atomic(page); - cs->buf = cs->mapaddr; + cs->pg = page; + cs->offset = 0; cs->len = PAGE_SIZE; cs->pipebufs++; cs->nr_segs++; @@ -740,14 +731,13 @@ static int fuse_copy_fill(struct fuse_co cs->iov++; cs->nr_segs--; } - err = get_user_pages_fast(cs->addr, 1, cs->write, &cs->pg); + err = get_user_pages_fast(cs->addr, 1, cs->write, &page); if (err < 0) return err; BUG_ON(err != 1); - offset = cs->addr % PAGE_SIZE; - cs->mapaddr = kmap_atomic(cs->pg); - cs->buf = cs->mapaddr + offset; - cs->len = min(PAGE_SIZE - offset, cs->seglen); + cs->pg = page; + cs->offset = cs->addr % PAGE_SIZE; + cs->len = min(PAGE_SIZE - cs->offset, cs->seglen); cs->seglen -= cs->len; cs->addr += cs->len; } @@ -760,15 +750,20 @@ static int fuse_copy_do(struct fuse_copy { unsigned ncpy = min(*size, cs->len); if (val) { + void *pgaddr = kmap_atomic(cs->pg); + void *buf = pgaddr + cs->offset; + if (cs->write) - memcpy(cs->buf, *val, ncpy); + memcpy(buf, *val, ncpy); else - memcpy(*val, cs->buf, ncpy); + memcpy(*val, buf, ncpy); + + kunmap_atomic(pgaddr); *val += ncpy; } *size -= ncpy; cs->len -= ncpy; - cs->buf += ncpy; + cs->offset += ncpy; return ncpy; } @@ -874,8 +869,8 @@ static int fuse_try_move_page(struct fus out_fallback_unlock: unlock_page(newpage); out_fallback: - cs->mapaddr = kmap_atomic(buf->page); - cs->buf = cs->mapaddr + buf->offset; + cs->pg = buf->page; + cs->offset = buf->offset; err = lock_request(cs->fc, cs->req); if (err) -- To unsubscribe f
Re: [PATCH] mm/page-writeback.c: fix divide by zero in bdi_dirty_limits
Hi Andrew, On 07/12/2014 02:27 AM, Andrew Morton wrote: On Fri, 11 Jul 2014 12:18:27 +0400 Maxim Patlasov wrote: Under memory pressure, it is possible for dirty_thresh, calculated by global_dirty_limits() in balance_dirty_pages(), to equal zero. Under what circumstances? Really small values of vm_dirty_bytes? No, I used default settings: vm_dirty_bytes = 0; dirty_background_bytes = 0; vm_dirty_ratio = 20; dirty_background_ratio = 10; and a simple program like main() { while(1) { p = malloc(4096); mlock(p, 4096); } }. Of course, this triggers oom eventually, but immediately before oom, the system is under hard memory pressure. Then, if strictlimit is true, bdi_dirty_limits() tries to resolve the proportion: bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh by dividing by zero. ... --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1306,9 +1306,9 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); if (bdi_bg_thresh) - *bdi_bg_thresh = div_u64((u64)*bdi_thresh * -background_thresh, -dirty_thresh); + *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh * + background_thresh, + dirty_thresh) : 0; This introduces a peculiar discontinuity: if dirty_thresh==3, treat it as 3 if dirty_thresh==2, treat it as 2 if dirty_thresh==1, treat it as 1 if dirty_thresh==0, treat it as infinity No, the patch doesn't treat dirty_thresh==0 as infinity. In fact, in that case we have equation: x : 0 = 0 : 0, and the patch resolves it as x=0. Here is the reasoning: 1. A bdi counter is always a fraction of global one. Hence bdi_thresh is always not greater than dirty_thresh. So far as dirty_thresh is equal to zero, bdi_thresh is equal to zero too. 2. bdi_bg_thresh must be not greater than bdi_thresh because we want to start background process earlier than throttling it. So far as bdi_thresh is equal to zero, bdi_bg_thresh must be zero too. Would it not make more sense to change global_dirty_limits() to convert 0 to 1? With an appropriate comment, obviously. Or maybe the fix lies elsewhere. Please do tell us how this zero comes about. Firstly let me explain where available_memory equal to one came from. global_dirty_limits() calculates it by calling global_dirtyable_memory(). The latter takes into consideration three global counters and a global reserve. In my case corresponding values were: NR_INACTIVE_FILE = 0 NR_ACTIVE_FILE = 0 NR_FREE_PAGES = 7006 dirty_balance_reserve = 7959. Consequently, "x" in global_dirtyable_memory() was equal to zero, and the function returned one. Now global_dirty_limits() assigns available_memory to one and calculates "dirty" as a fraction of available_memory: dirty = (vm_dirty_ratio * available_memory) / 100; So far as vm_drity_ratio is lesser than 100 (it is 20 by default), dirty is calculated as zero. As for your question about conversion 0 to 1, I think that bdi_thresh = dirty_thresh = 0 makes natural sense: we are under strong memory pressure, please always start background writeback and throttle process (even if actual number of dirty pages is low). So other parts of balance_dirty_pages machinery must handle zero thresholds properly. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm/page-writeback.c: fix divide by zero in bdi_dirty_limits
Under memory pressure, it is possible for dirty_thresh, calculated by global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if strictlimit is true, bdi_dirty_limits() tries to resolve the proportion: bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh by dividing by zero. Signed-off-by: Maxim Patlasov --- mm/page-writeback.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 518e2c3..e0c9430 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1306,9 +1306,9 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); if (bdi_bg_thresh) - *bdi_bg_thresh = div_u64((u64)*bdi_thresh * -background_thresh, -dirty_thresh); + *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh * + background_thresh, + dirty_thresh) : 0; /* * In order to avoid the stacked BDI deadlock we need -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fuse: release temporary page if fuse_writepage_locked() failed
tmp_page to be freed if fuse_write_file_get() returns NULL. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 3a47aa2..a18baa4 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1679,7 +1679,7 @@ static int fuse_writepage_locked(struct page *page) error = -EIO; req->ff = fuse_write_file_get(fc, fi); if (!req->ff) - goto err_free; + goto err_nofile; fuse_write_fill(req, req->ff, page_offset(page), 0); @@ -1707,6 +1707,8 @@ static int fuse_writepage_locked(struct page *page) return 0; +err_nofile: + __free_page(tmp_page); err_free: fuse_request_free(req); err: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fuse: remove WARN_ON writeback on dead inode
FUSE has an architectural peculiarity: it cannot write to an inode, unless it has an associated file open for write. Since the beginning (Apr 30 2008, commit 3be5a52b), FUSE BUG_ON (later WARN_ON) the case when it has to process writeback, but all associated files are closed. The latest relevant commit is 72523425: > Don't bug if there's no writable files found for page writeback. If ever > this is triggered, a WARN_ON helps debugging it much better then a BUG_ON. > > Signed-off-by: Miklos Szeredi But that situation can happen in quite a legal way: for example, let's mmap a file, then issue O_DIRECT read to mmapped region and immediately close file and munmap. O_DIRECT will pin some pages, execute IO to them and then mark pages dirty. Here we are. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 6e16dad..3a47aa2 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1624,8 +1624,8 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req) fuse_writepage_free(fc, req); } -static struct fuse_file *__fuse_write_file_get(struct fuse_conn *fc, - struct fuse_inode *fi) +static struct fuse_file *fuse_write_file_get(struct fuse_conn *fc, +struct fuse_inode *fi) { struct fuse_file *ff = NULL; @@ -1640,14 +1640,6 @@ static struct fuse_file *__fuse_write_file_get(struct fuse_conn *fc, return ff; } -static struct fuse_file *fuse_write_file_get(struct fuse_conn *fc, -struct fuse_inode *fi) -{ - struct fuse_file *ff = __fuse_write_file_get(fc, fi); - WARN_ON(!ff); - return ff; -} - int fuse_write_inode(struct inode *inode, struct writeback_control *wbc) { struct fuse_conn *fc = get_fuse_conn(inode); @@ -1655,7 +1647,7 @@ int fuse_write_inode(struct inode *inode, struct writeback_control *wbc) struct fuse_file *ff; int err; - ff = __fuse_write_file_get(fc, fi); + ff = fuse_write_file_get(fc, fi); err = fuse_flush_times(inode, ff); if (ff) fuse_file_put(ff, 0); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fuse: avoid scheduling while atomic
As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry() triggers complains about scheduling while atomic: > Jun 23 11:53:24 localhost kernel: BUG: scheduling while atomic: fuse.hf/13976/0x1001 This happens because fuse_notify_inval_entry() attempts to allocate memory with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic(). The patch fixes the problem for fuse_notify_inval_entry() and other notifiers by unmapping fuse_copy_state before allocations and remapping it again afterwards. Reported-by: Richard Sharpe Signed-off-by: Maxim Patlasov --- fs/fuse/dev.c | 24 1 file changed, 24 insertions(+) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 098f97b..502aa1d 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -755,6 +755,22 @@ static int fuse_copy_fill(struct fuse_copy_state *cs) return lock_request(cs->fc, cs->req); } +static void fuse_copy_unmap(struct fuse_copy_state *cs) +{ + cs->buf = (void *)(cs->buf - cs->mapaddr); + kunmap_atomic(cs->mapaddr); +} + +static void fuse_copy_remap(struct fuse_copy_state *cs) +{ + if (cs->currbuf) + cs->mapaddr = kmap_atomic(cs->currbuf->page); + else + cs->mapaddr = kmap_atomic(cs->pg); + + cs->buf = (unsigned long)cs->buf + cs->mapaddr; +} + /* Do as much copy to/from userspace buffer as we can */ static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size) { @@ -1431,7 +1447,9 @@ static int fuse_notify_inval_entry(struct fuse_conn *fc, unsigned int size, char *buf; struct qstr name; + fuse_copy_unmap(cs); buf = kzalloc(FUSE_NAME_MAX + 1, GFP_KERNEL); + fuse_copy_remap(cs); if (!buf) goto err; @@ -1482,7 +1500,9 @@ static int fuse_notify_delete(struct fuse_conn *fc, unsigned int size, char *buf; struct qstr name; + fuse_copy_unmap(cs); buf = kzalloc(FUSE_NAME_MAX + 1, GFP_KERNEL); + fuse_copy_remap(cs); if (!buf) goto err; @@ -1554,6 +1574,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size, nodeid = outarg.nodeid; + fuse_copy_unmap(cs); down_read(&fc->killsb); err = -ENOENT; @@ -1586,7 +1607,9 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size, goto out_iput; this_num = min_t(unsigned, num, PAGE_CACHE_SIZE - offset); + fuse_copy_remap(cs); err = fuse_copy_page(cs, &page, offset, this_num, 0); + fuse_copy_unmap(cs); if (!err && offset == 0 && (this_num == PAGE_CACHE_SIZE || file_size == end)) SetPageUptodate(page); @@ -1607,6 +1630,7 @@ out_iput: iput(inode); out_up_killsb: up_read(&fc->killsb); + fuse_copy_remap(cs); out_finish: fuse_copy_finish(cs); return err; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously (v2)
On 06/09/2014 03:11 PM, John Muir wrote: On 2014.06.09, at 12:46 , Maxim Patlasov wrote: On 06/09/2014 01:26 PM, John Muir wrote: On 2014.06.09, at 9:50 , Maxim Patlasov wrote: On 06/06/2014 05:51 PM, John Muir wrote: On 2014.06.06, at 15:27 , Maxim Patlasov wrote: The patch-set resolves the problem by making fuse_release synchronous: wait for ACK from userspace for FUSE_RELEASE if the feature is ON. Why not make this feature per-file with a new flag bit in struct fuse_file_info rather than as a file-system global? I don't expect a great demand for such a granularity. File-system global "close_wait" conveys a general user expectation about filesystem behaviour in distributed environment: if you stopped using a file on given node, whether it means that the file is immediately accessible from another node. By user do you mean the end-user, or the implementor of the file-system? It seems to me that the end-user doesn't care, and just wants the file-system to work as expected. I don't think we're really talking about the end-user. No, this is exactly about end-user expectations. Imagine a complicated heavy-loaded shared storage where handling FUSE_RELEASE in userspace may take a few minutes. In close_wait=0 case, an end-user who has just called close(2) has no idea when it's safe to access the file from another node or even when it's OK to umount filesystem. I think we're saying the same thing here from different perspectives. The end-user wants the file-system to operate with the semantics you describe, but I don't think it makes sense to give the end-user control over those semantics. The file-system itself should be implemented that way, or not, or per-file If it's a read-only file, then does this not add the overhead of having the kernel wait for the user-space file-system process to respond before closing it. In my experience, there is actually significant cost to the kernel to user-space messaging in FUSE when manipulating thousands of files. The implementor of a file-system, on the other hand, might want the semantics for close_wait on some files, but not on others. Won't there be a performance impact? Some distributed file-systems might want this on specific files only. Implementing it as a flag on the struct fuse_file_info gives the flexibility to the file-system implementor. fuse_file_info is an userspace structure, in-kernel fuse knows nothing about it. In close_wait=1 case, nothing prevents a file-system implementation from ACK-ing FUSE_RELEASE request immediately (for specific files) and schedule actual handling for future processing. Of course you know I meant that you'd add another flag to both fuse_file_info, and in the kernel equivalent for those flags which is struct fuse_open_out -> open_flags. This is where other such per file options are specified such as whether or not to keep the in-kernal cache for a file, whether or not to allow direct-io, and whether or not to allow seek. Anyway, I guess you're the one doing all the work on this and if you have a particular implementation that doesn't require such fine-grained control, and no one else does then it's up to you. I'm just trying to show an alternative implementation that gives the file-system implementor more control while keeping the ability to meet user expectations. Thank you, John. That's really depends on whether someone else wants fine-grained control or not. I'm generally OK to re-work the patch-set if more requesters emerge. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously (v2)
On 06/09/2014 01:26 PM, John Muir wrote: On 2014.06.09, at 9:50 , Maxim Patlasov wrote: On 06/06/2014 05:51 PM, John Muir wrote: On 2014.06.06, at 15:27 , Maxim Patlasov wrote: The patch-set resolves the problem by making fuse_release synchronous: wait for ACK from userspace for FUSE_RELEASE if the feature is ON. Why not make this feature per-file with a new flag bit in struct fuse_file_info rather than as a file-system global? I don't expect a great demand for such a granularity. File-system global "close_wait" conveys a general user expectation about filesystem behaviour in distributed environment: if you stopped using a file on given node, whether it means that the file is immediately accessible from another node. By user do you mean the end-user, or the implementor of the file-system? It seems to me that the end-user doesn't care, and just wants the file-system to work as expected. I don't think we're really talking about the end-user. No, this is exactly about end-user expectations. Imagine a complicated heavy-loaded shared storage where handling FUSE_RELEASE in userspace may take a few minutes. In close_wait=0 case, an end-user who has just called close(2) has no idea when it's safe to access the file from another node or even when it's OK to umount filesystem. The implementor of a file-system, on the other hand, might want the semantics for close_wait on some files, but not on others. Won't there be a performance impact? Some distributed file-systems might want this on specific files only. Implementing it as a flag on the struct fuse_file_info gives the flexibility to the file-system implementor. fuse_file_info is an userspace structure, in-kernel fuse knows nothing about it. In close_wait=1 case, nothing prevents a file-system implementation from ACK-ing FUSE_RELEASE request immediately (for specific files) and schedule actual handling for future processing. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously (v2)
On 06/06/2014 05:51 PM, John Muir wrote: On 2014.06.06, at 15:27 , Maxim Patlasov wrote: The patch-set resolves the problem by making fuse_release synchronous: wait for ACK from userspace for FUSE_RELEASE if the feature is ON. Why not make this feature per-file with a new flag bit in struct fuse_file_info rather than as a file-system global? I don't expect a great demand for such a granularity. File-system global "close_wait" conveys a general user expectation about filesystem behaviour in distributed environment: if you stopped using a file on given node, whether it means that the file is immediately accessible from another node. Maxim John. -- John Muir - j...@jmuir.com +32 491 64 22 76 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] fuse: fix synchronous case of fuse_file_put()
If fuse_file_put() is called with sync==true, the user may be blocked for a while, until userspace ACKs our FUSE_RELEASE request. This blocking must be uninterruptible. Otherwise request could be interrupted, but file association in user space remains. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |4 1 file changed, 4 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 783cb52..9f38568 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -136,6 +136,10 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) path_put(&req->misc.release.path); fuse_put_request(ff->fc, req); } else if (sync) { + /* Must force. Otherwise request could be interrupted, +* but file association in user space remains. +*/ + req->force = 1; req->background = 0; fuse_request_send(ff->fc, req); path_put(&req->misc.release.path); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] fuse: enable close_wait feature
The patch enables feature by passing 'true' to fuse_file_put in fuse_release_common. Previously, this was safe only in special cases when we sure that multi-threaded userspace won't deadlock if we'll synchronously send FUSE_RELEASE in the context of read-ahead or write-back callback. Now, it's always safe because callbacks don't send requests to userspace anymore. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index d50af99..783cb52 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -328,7 +328,8 @@ void fuse_release_common(struct file *file, int opcode) * synchronous RELEASE is allowed (and desirable) in this case * because the server can be trusted not to screw up. */ - fuse_file_put(ff, ff->fc->destroy_req != NULL); + fuse_file_put(ff, ff->fc->destroy_req != NULL || + ff->fc->close_wait); } static int fuse_open(struct inode *inode, struct file *file) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] fuse: wait for end of IO on release
There are two types of I/O activity that can be "in progress" at the time of fuse_release() execution: asynchronous read-ahead and write-back. The patch ensures that they are completed before fuse_release_common sends FUSE_RELEASE to userspace. So far as fuse_release() waits for end of async I/O, its callbacks (fuse_readpages_end and fuse_writepage_finish) calling fuse_file_put cannot be the last holders of fuse file anymore. To emphasize the fact, the patch replaces fuse_file_put with __fuse_file_put there. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 71 1 file changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index b81a945..d50af99 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -149,6 +149,17 @@ static void fuse_file_put(struct fuse_file *ff, bool sync) } } +/* + * Asynchronous callbacks may use it instead of fuse_file_put() because + * we guarantee that they are never last holders of ff. Hitting BUG() below + * will make clear any violation of the guarantee. + */ +static void __fuse_file_put(struct fuse_file *ff) +{ + if (atomic_dec_and_test(&ff->count)) + BUG(); +} + int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, bool isdir) { @@ -302,6 +313,13 @@ void fuse_release_common(struct file *file, int opcode) req->misc.release.path = file->f_path; /* +* No more in-flight asynchronous READ or WRITE requests if +* fuse file release is synchronous +*/ + if (ff->fc->close_wait) + BUG_ON(atomic_read(&ff->count) != 1); + + /* * Normally this will send the RELEASE request, however if * some asynchronous READ or WRITE requests are outstanding, * the sending will be delayed. @@ -321,11 +339,34 @@ static int fuse_open(struct inode *inode, struct file *file) static int fuse_release(struct inode *inode, struct file *file) { struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_file *ff = file->private_data; /* see fuse_vma_close() for !writeback_cache case */ if (fc->writeback_cache) write_inode_now(inode, 1); + if (ff->fc->close_wait) { + struct fuse_inode *fi = get_fuse_inode(inode); + + /* +* Must remove file from write list. Otherwise it is possible +* this file will get more writeback from another files +* rerouted via write_files. +*/ + spin_lock(&ff->fc->lock); + list_del_init(&ff->write_entry); + spin_unlock(&ff->fc->lock); + + wait_event(fi->page_waitq, atomic_read(&ff->count) == 1); + + /* +* spin_unlock_wait(&ff->fc->lock) would be natural here to +* wait for threads just released ff to leave their critical +* sections. But taking spinlock is the first thing +* fuse_release_common does, so that this is unnecessary. +*/ + } + fuse_release_common(file, FUSE_RELEASE); /* return value is ignored by VFS */ @@ -823,8 +864,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) unlock_page(page); page_cache_release(page); } - if (req->ff) - fuse_file_put(req->ff, false); + if (req->ff) { + if (fc->close_wait) { + struct fuse_inode *fi = get_fuse_inode(req->inode); + + spin_lock(&fc->lock); + __fuse_file_put(req->ff); + wake_up(&fi->page_waitq); + spin_unlock(&fc->lock); + } else + fuse_file_put(req->ff, false); + } } struct fuse_fill_data { @@ -851,6 +901,7 @@ static void fuse_send_readpages(struct fuse_fill_data *data) if (fc->async_read) { req->ff = fuse_file_get(ff); req->end = fuse_readpages_end; + req->inode = data->inode; fuse_request_send_background(fc, req); } else { fuse_request_send(fc, req); @@ -1537,7 +1588,7 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) for (i = 0; i < req->num_pages; i++) __free_page(req->pages[i]); - if (req->ff) + if (req->ff && !fc->close_wait) fuse_file_put(req->ff, false); } @@ -1554,6 +1605,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP); b
[PATCH 2/5] fuse: cosmetic rework of fuse_send_readpages
The patch change arguments of fuse_send_readpages to give it access to inode (will be used in the next patch of patch-set). The change is cosmetic, no logic changed. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 96d513e..b81a945 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -827,8 +827,17 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) fuse_file_put(req->ff, false); } -static void fuse_send_readpages(struct fuse_req *req, struct file *file) +struct fuse_fill_data { + struct fuse_req *req; + struct file *file; + struct inode *inode; + unsigned nr_pages; +}; + +static void fuse_send_readpages(struct fuse_fill_data *data) { + struct fuse_req *req = data->req; + struct file *file = data->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; loff_t pos = page_offset(req->pages[0]); @@ -850,13 +859,6 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file) } } -struct fuse_fill_data { - struct fuse_req *req; - struct file *file; - struct inode *inode; - unsigned nr_pages; -}; - static int fuse_readpages_fill(void *_data, struct page *page) { struct fuse_fill_data *data = _data; @@ -872,7 +874,7 @@ static int fuse_readpages_fill(void *_data, struct page *page) req->pages[req->num_pages - 1]->index + 1 != page->index)) { int nr_alloc = min_t(unsigned, data->nr_pages, FUSE_MAX_PAGES_PER_REQ); - fuse_send_readpages(req, data->file); + fuse_send_readpages(data); if (fc->async_read) req = fuse_get_req_for_background(fc, nr_alloc); else @@ -925,7 +927,7 @@ static int fuse_readpages(struct file *file, struct address_space *mapping, err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); if (!err) { if (data.req->num_pages) - fuse_send_readpages(data.req, file); + fuse_send_readpages(&data); else fuse_put_request(fc, data.req); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] fuse: close file synchronously (v2)
Hi, There is a long-standing demand for synchronous behaviour of fuse_release: http://sourceforge.net/mailarchive/message.php?msg_id=19343889 http://sourceforge.net/mailarchive/message.php?msg_id=29814693 A year ago Avati and me explained why such a feature would be useful: http://sourceforge.net/mailarchive/message.php?msg_id=29889055 http://sourceforge.net/mailarchive/message.php?msg_id=29867423 In short, the problem is that fuse_release (that's called on last user close(2)) sends FUSE_RELEASE to userspace and returns without waiting for ACK from userspace. Consequently, there is a gap when user regards the file released while userspace fuse is still working on it. An attempt to access the file from another node leads to complicated synchronization problems because the first node still "holds" the file. The patch-set resolves the problem by making fuse_release synchronous: wait for ACK from userspace for FUSE_RELEASE if the feature is ON. To keep single-threaded userspace implementations happy the patch-set ensures that by the time fuse_release_common calls fuse_file_put, no more in-flight I/O exists. Asynchronous fuse callbacks (like fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll never block in contexts other than close(). Changed in v2: - improved comments, commented spin_unlock_wait out according to Brian' suggestions. - rebased on v3.15-rc8 tag of Linus' tree. Thanks, Maxim --- Maxim Patlasov (5): fuse: add close_wait flag to fuse_conn fuse: cosmetic rework of fuse_send_readpages fuse: wait for end of IO on release fuse: enable close_wait feature fuse: fix synchronous case of fuse_file_put() fs/fuse/file.c| 100 ++--- fs/fuse/fuse_i.h |3 + fs/fuse/inode.c |4 +- include/uapi/linux/fuse.h |3 + 4 files changed, 93 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] fuse: add close_wait flag to fuse_conn
The feature will be governed by fc->close_wait. Userspace can enable it in the same way as auto_inval_data or any other kernel fuse capability. Signed-off-by: Maxim Patlasov --- fs/fuse/fuse_i.h |3 +++ fs/fuse/inode.c |4 +++- include/uapi/linux/fuse.h |3 +++ 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 7aa5c75..434ff08 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -557,6 +557,9 @@ struct fuse_conn { /** Does the filesystem support asynchronous direct-IO submission? */ unsigned async_dio:1; + /** Wait for response from daemon on close */ + unsigned close_wait:1; + /** The number of requests waiting for completion */ atomic_t num_waiting; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 754dcf2..580d1b3 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -898,6 +898,8 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req) else fc->sb->s_time_gran = 10; + if (arg->flags & FUSE_CLOSE_WAIT) + fc->close_wait = 1; } else { ra_pages = fc->max_read / PAGE_CACHE_SIZE; fc->no_lock = 1; @@ -926,7 +928,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req) FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ | FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA | FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO | - FUSE_WRITEBACK_CACHE; + FUSE_WRITEBACK_CACHE | FUSE_CLOSE_WAIT; req->in.h.opcode = FUSE_INIT; req->in.numargs = 1; req->in.args[0].size = sizeof(*arg); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 40b5ca8..1e1b6fa 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -101,6 +101,7 @@ * - add FATTR_CTIME * - add ctime and ctimensec to fuse_setattr_in * - add FUSE_RENAME2 request + * - add FUSE_CLOSE_WAIT */ #ifndef _LINUX_FUSE_H @@ -229,6 +230,7 @@ struct fuse_file_lock { * FUSE_READDIRPLUS_AUTO: adaptive readdirplus * FUSE_ASYNC_DIO: asynchronous direct I/O submission * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes + * FUSE_CLOSE_WAIT: wait for response from daemon on close */ #define FUSE_ASYNC_READ(1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -247,6 +249,7 @@ struct fuse_file_lock { #define FUSE_READDIRPLUS_AUTO (1 << 14) #define FUSE_ASYNC_DIO (1 << 15) #define FUSE_WRITEBACK_CACHE (1 << 16) +#define FUSE_CLOSE_WAIT(1 << 17) /** * CUSE INIT request/reply flags -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fuse: do not evict dirty inodes
Commit 1e18bda8 added .write_inode method to the fuse super_operations. This allowed fuse to use the kernel infrastructure for writing out dirty metadata (mtime and ctime for now). However, given that .drop_inode was not redefined from the legacy generic_delete_inode(), on umount(2) generic_shutdown_super() led to the eviction of all inodes disregarding their state. The patch removes .drop_inode definition from the fuse super_operations. This works because now iput_final() calls generic_drop_inode() and returns w/o evicting inode. This, in turn, allows generic_shutdown_super() to write dirty inodes by calling sync_filesystem(). Signed-off-by: Maxim Patlasov --- fs/fuse/inode.c |1 - 1 file changed, 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 754dcf2..ee017be 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -791,7 +791,6 @@ static const struct super_operations fuse_super_operations = { .destroy_inode = fuse_destroy_inode, .evict_inode= fuse_evict_inode, .write_inode= fuse_write_inode, - .drop_inode = generic_delete_inode, .remount_fs = fuse_remount_fs, .put_super = fuse_put_super, .umount_begin = fuse_umount_begin, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,writeback: fix divide by zero in pos_ratio_polynom
On 04/30/2014 12:12 PM, Michal Hocko wrote: On Wed 30-04-14 12:04:04, Maxim Patlasov wrote: Hi Rik! On 04/29/2014 11:19 PM, Rik van Riel wrote: It is possible for "limit - setpoint + 1" to equal zero, leading to a divide by zero error. Blindly adding 1 to "limit - setpoint" is not working, so we need to actually test the divisor before calling div64. The patch looks correct, but I'm afraid it can hide an actual bug in a caller of pos_ratio_polynom(). The latter is not intended for setpoint > limit. All callers take pains to ensure that setpoint <= limit. Look, for example, at global_dirty_limits(): The bug might trigger even if setpoint < limit because the result is trucated to s32 and I guess this is what is going on here? Is (limit - setpoint + 1) > 4G possible? Yes, you are right. Probably the problem came from s32 overflow. if (background >= dirty) background = dirty / 2; If you ever encountered "limit - setpoint + 1" equal zero, it may be worthy to investigate how you came to setpoint greater than limit. Thanks, Maxim Signed-off-by: Rik van Riel Cc: sta...@vger.kernel.org --- mm/page-writeback.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ef41349..2682516 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -597,11 +597,16 @@ static inline long long pos_ratio_polynom(unsigned long setpoint, unsigned long dirty, unsigned long limit) { + unsigned int divisor; long long pos_ratio; long x; + divisor = limit - setpoint; + if (!divisor) + divisor = 1; + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT, - limit - setpoint + 1); + divisor); pos_ratio = x; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,writeback: fix divide by zero in pos_ratio_polynom
Hi Rik! On 04/29/2014 11:19 PM, Rik van Riel wrote: It is possible for "limit - setpoint + 1" to equal zero, leading to a divide by zero error. Blindly adding 1 to "limit - setpoint" is not working, so we need to actually test the divisor before calling div64. The patch looks correct, but I'm afraid it can hide an actual bug in a caller of pos_ratio_polynom(). The latter is not intended for setpoint > limit. All callers take pains to ensure that setpoint <= limit. Look, for example, at global_dirty_limits(): > if (background >= dirty) >background = dirty / 2; If you ever encountered "limit - setpoint + 1" equal zero, it may be worthy to investigate how you came to setpoint greater than limit. Thanks, Maxim Signed-off-by: Rik van Riel Cc: sta...@vger.kernel.org --- mm/page-writeback.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ef41349..2682516 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -597,11 +597,16 @@ static inline long long pos_ratio_polynom(unsigned long setpoint, unsigned long dirty, unsigned long limit) { + unsigned int divisor; long long pos_ratio; long x; + divisor = limit - setpoint; + if (!divisor) + divisor = 1; + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT, - limit - setpoint + 1); + divisor); pos_ratio = x; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] fuse: flush ctime in FUSE_FORGET requests
Some inode operations (e.g., rename) operate directly on inodes and dentries without opened files involved. This means that even though fuse set inode->i_ctime and FUSE_I_CTIME_DIRTY properly, a corresponding flush operation will never happen (i.e. no fsync or close to call fuse_flush_cmtime()). The patch solves the problem by passing local ctime to the userspace server inside forget requests. Signed-off-by: Maxim Patlasov --- fs/fuse/dev.c | 30 +++--- fs/fuse/fuse_i.h |2 +- fs/fuse/inode.c |6 ++ include/uapi/linux/fuse.h | 18 ++ 4 files changed, 48 insertions(+), 8 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index aac71ce..97c7749 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -578,17 +578,25 @@ void fuse_request_send_background_locked(struct fuse_conn *fc, void fuse_force_forget(struct file *file, u64 nodeid) { struct inode *inode = file_inode(file); + struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_req *req; - struct fuse_forget_in inarg; + struct fuse_forget_in_ext inarg; + int inarg_size = fc->minor >= 24 ? + sizeof(inarg) : sizeof(struct fuse_forget_in); memset(&inarg, 0, sizeof(inarg)); inarg.nlookup = 1; + if (test_bit(FUSE_I_CTIME_DIRTY, &fi->state)) { + inarg.forget_flags = FUSE_FORGET_CTIME_SET; + inarg.ctime = inode->i_ctime.tv_sec; + inarg.ctimensec = inode->i_ctime.tv_nsec; + } req = fuse_get_req_nofail_nopages(fc, file); req->in.h.opcode = FUSE_FORGET; req->in.h.nodeid = nodeid; req->in.numargs = 1; - req->in.args[0].size = sizeof(inarg); + req->in.args[0].size = inarg_size; req->in.args[0].value = &inarg; req->isreply = 0; __fuse_request_send(fc, req); @@ -1102,14 +1110,19 @@ __releases(fc->lock) { int err; struct fuse_forget_link *forget = dequeue_forget(fc, 1, NULL); - struct fuse_forget_in arg = { + struct fuse_forget_in_ext arg = { .nlookup = forget->forget_one.nlookup, + .ctime = forget->forget_one.ctime, + .ctimensec = forget->forget_one.ctimensec, + .forget_flags = forget->forget_one.forget_flags, }; + int arg_size = fc->minor >= 24 ? + sizeof(arg) : sizeof(struct fuse_forget_in); struct fuse_in_header ih = { .opcode = FUSE_FORGET, .nodeid = forget->forget_one.nodeid, .unique = fuse_get_unique(fc), - .len = sizeof(ih) + sizeof(arg), + .len = sizeof(ih) + arg_size, }; spin_unlock(&fc->lock); @@ -1142,18 +1155,21 @@ __releases(fc->lock) .unique = fuse_get_unique(fc), .len = sizeof(ih) + sizeof(arg), }; + int forget_one_size = fc->minor >= 24 ? + sizeof(struct fuse_forget_one_ext) : + sizeof(struct fuse_forget_one); if (nbytes < ih.len) { spin_unlock(&fc->lock); return -EINVAL; } - max_forgets = (nbytes - ih.len) / sizeof(struct fuse_forget_one); + max_forgets = (nbytes - ih.len) / forget_one_size; head = dequeue_forget(fc, max_forgets, &count); spin_unlock(&fc->lock); arg.count = count; - ih.len += count * sizeof(struct fuse_forget_one); + ih.len += count * forget_one_size; err = fuse_copy_one(cs, &ih, sizeof(ih)); if (!err) err = fuse_copy_one(cs, &arg, sizeof(arg)); @@ -1163,7 +1179,7 @@ __releases(fc->lock) if (!err) { err = fuse_copy_one(cs, &forget->forget_one, - sizeof(forget->forget_one)); + forget_one_size); } head = forget->next; kfree(forget); diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 781a01d..d5172fa 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -59,7 +59,7 @@ extern unsigned max_user_congthresh; /* One forget request */ struct fuse_forget_link { - struct fuse_forget_one forget_one; + struct fuse_forget_one_ext forget_one; struct fuse_forget_link *next; }; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 8b9a1d1..932e04a 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -128,6 +128,12 @@ static void fuse_evict_inode(struct inode *inode) if (inode->i_sb->s_flags & MS_ACTIVE) { struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_inode *fi = get_fuse_inode(inode); +
[PATCH 6/7] fuse: clear FUSE_I_CTIME_DIRTY flag on setattr
The patch addresses two use-cases when the flag may be safely cleared: 1. fuse_do_setattr() is called with ATTR_CTIME flag set in attr->ia_valid. In this case attr->ia_ctime bears actual value. In-kernel fuse must send it to the userspace server and then assign the value to inode->i_ctime. 2. fuse_do_setattr() is called with ATTR_SIZE flag set in attr->ia_valid, whereas ATTR_CTIME is not set (truncate(2)). In this case in-kernel fuse must sent "now" to the userspace server and then assign the value to inode->i_ctime. In both cases fuse can clear FUSE_I_CTIME_DIRTY flag because actual ctime value was flushed to the server. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c | 38 ++ 1 file changed, 30 insertions(+), 8 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 2187960..3495aaf 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1518,8 +1518,9 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct inode *inode, struct iattr *iattr, - struct fuse_setattr_in *arg, bool trust_local_mtime) +static void iattr_to_fattr(struct fuse_conn *fc, struct inode *inode, + struct iattr *iattr, struct fuse_setattr_in *arg, + bool trust_local_mtime, struct timespec *ctime) { unsigned ivalid = iattr->ia_valid; struct timespec now = current_fs_time(inode->i_sb); @@ -1550,6 +1551,19 @@ static void iattr_to_fattr(struct inode *inode, struct iattr *iattr, arg->mtime = now.tv_sec; arg->mtimensec = now.tv_nsec; } + + if ((ivalid & ATTR_CTIME) && trust_local_mtime) + *ctime = iattr->ia_ctime; + else if ((ivalid & ATTR_SIZE) && trust_local_mtime) + *ctime = now; + else + ctime = NULL; + + if (ctime && fc->minor >= 24) { + arg->valid |= FATTR_CTIME; + arg->ctime = ctime->tv_sec; + arg->ctimensec = ctime->tv_nsec; + } } /* @@ -1688,6 +1702,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, loff_t oldsize; int err; bool trust_local_mtime = is_wb && S_ISREG(inode->i_mode); + struct timespec ctime; if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS)) attr->ia_valid |= ATTR_FORCE; @@ -1716,7 +1731,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); - iattr_to_fattr(inode, attr, &inarg, trust_local_mtime); + iattr_to_fattr(fc, inode, attr, &inarg, trust_local_mtime, &ctime); if (file) { struct fuse_file *ff = file->private_data; inarg.valid |= FATTR_FH; @@ -1744,11 +1759,18 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, } spin_lock(&fc->lock); - /* the kernel maintains i_mtime locally */ - if (trust_local_mtime && (attr->ia_valid & (ATTR_MTIME | ATTR_SIZE))) { - inode->i_mtime.tv_sec = inarg.mtime; - inode->i_mtime.tv_nsec = inarg.mtimensec; - clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); + /* the kernel maintains i_mtime and i_ctime locally */ + if (trust_local_mtime) { + if (attr->ia_valid & (ATTR_MTIME | ATTR_SIZE)) { + inode->i_mtime.tv_sec = inarg.mtime; + inode->i_mtime.tv_nsec = inarg.mtimensec; + clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); + } + + if (attr->ia_valid & (ATTR_CTIME | ATTR_SIZE)) { + inode->i_ctime = ctime; + clear_bit(FUSE_I_CTIME_DIRTY, &fi->state); + } } fuse_change_attributes_common(inode, &outarg.attr, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] fuse: trust kernel i_ctime only
Let the kernel maintain i_ctime locally: - clear S_NOCMTIME (implemented earlier) - update i_ctime in i_op->update_time() - flush ctime on fsync and last close (previous patch) - update i_ctime explicitly on truncate, fallocate, open(O_TRUNC), setxattr, link, rename, unlink. Fuse inode flag FUSE_I_CTIME_DIRTY serves as indication that local i_ctime should be flushed to the server eventually. The patch sets the flag and updates i_ctime in course of operations listed above. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c | 24 +++- fs/fuse/file.c |8 ++-- fs/fuse/inode.c |6 -- 3 files changed, 33 insertions(+), 5 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 0596726..2187960 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -679,6 +679,15 @@ static int fuse_symlink(struct inode *dir, struct dentry *entry, return create_new_entry(fc, req, dir, entry, S_IFLNK); } +static inline void fuse_update_ctime(struct fuse_conn *fc, struct inode *inode) +{ + if (fc->writeback_cache && S_ISREG(inode->i_mode)) { + struct fuse_inode *fi = get_fuse_inode(inode); + inode->i_ctime = current_fs_time(inode->i_sb); + set_bit(FUSE_I_CTIME_DIRTY, &fi->state); + } +} + static int fuse_unlink(struct inode *dir, struct dentry *entry) { int err; @@ -713,6 +722,7 @@ static int fuse_unlink(struct inode *dir, struct dentry *entry) fuse_invalidate_attr(inode); fuse_invalidate_attr(dir); fuse_invalidate_entry_cache(entry); + fuse_update_ctime(fc, inode); } else if (err == -EINTR) fuse_invalidate_entry(entry); return err; @@ -771,6 +781,7 @@ static int fuse_rename(struct inode *olddir, struct dentry *oldent, if (!err) { /* ctime changes */ fuse_invalidate_attr(oldent->d_inode); + fuse_update_ctime(fc, oldent->d_inode); fuse_invalidate_attr(olddir); if (olddir != newdir) @@ -780,6 +791,7 @@ static int fuse_rename(struct inode *olddir, struct dentry *oldent, if (newent->d_inode) { fuse_invalidate_attr(newent->d_inode); fuse_invalidate_entry_cache(newent); + fuse_update_ctime(fc, newent->d_inode); } } else if (err == -EINTR) { /* If request was interrupted, DEITY only knows if the @@ -829,6 +841,7 @@ static int fuse_link(struct dentry *entry, struct inode *newdir, inc_nlink(inode); spin_unlock(&fc->lock); fuse_invalidate_attr(inode); + fuse_update_ctime(fc, inode); } else if (err == -EINTR) { fuse_invalidate_attr(inode); } @@ -846,6 +859,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, attr->size = i_size_read(inode); attr->mtime = inode->i_mtime.tv_sec; attr->mtimensec = inode->i_mtime.tv_nsec; + attr->ctime = inode->i_ctime.tv_sec; + attr->ctimensec = inode->i_ctime.tv_nsec; } stat->dev = inode->i_sb->s_dev; @@ -1830,8 +1845,10 @@ static int fuse_setxattr(struct dentry *entry, const char *name, fc->no_setxattr = 1; err = -EOPNOTSUPP; } - if (!err) + if (!err) { fuse_invalidate_attr(inode); + fuse_update_ctime(fc, inode); + } return err; } @@ -1974,6 +1991,11 @@ static int fuse_update_time(struct inode *inode, struct timespec *now, set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state); BUG_ON(!S_ISREG(inode->i_mode)); } + if (flags & S_CTIME) { + inode->i_ctime = *now; + set_bit(FUSE_I_CTIME_DIRTY, &get_fuse_inode(inode)->state); + BUG_ON(!S_ISREG(inode->i_mode)); + } return 0; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 9ed8590b..f644aeb 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -224,8 +224,10 @@ void fuse_finish_open(struct inode *inode, struct file *file) spin_unlock(&fc->lock); fuse_invalidate_attr(inode); if (fc->writeback_cache) { - inode->i_mtime = current_fs_time(inode->i_sb); + inode->i_ctime = inode->i_mtime = + current_fs_time(inode->i_sb); set_bit(FUSE_I_MTIME_DIRTY, &fi->state); + set_bit(FUSE_I_CTIME_DIRTY, &fi->state); } } if ((file->f_mode & FMODE_WRITE) &&
[PATCH 4/7] fuse: introduce FUSE_I_CTIME_DIRTY flag
The flag plays the role similar to FUSE_I_MTIME_DIRTY: local ctime (inode->i_ctime) is newer than on the server; so, local ctime must be flushed to the server afterwards. The patch only introduces the flag, extends API (fuse_setattr_in) properly, and implements the flush procedure (fuse_flush_cmtime()). Further patches of the patch-set will implement the logic of setting and clearing the flag. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c | 22 ++ fs/fuse/file.c| 11 --- fs/fuse/fuse_i.h |4 +++- include/uapi/linux/fuse.h | 11 --- 4 files changed, 33 insertions(+), 15 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index c7cb41c..0596726 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1600,9 +1600,9 @@ static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req, } /* - * Flush inode->i_mtime to the server + * Flush inode->i_mtime and inode->i_ctime to the server */ -int fuse_flush_mtime(struct file *file, bool nofail) +int fuse_flush_cmtime(struct file *file, bool nofail) { struct inode *inode = file->f_mapping->host; struct fuse_inode *fi = get_fuse_inode(inode); @@ -1610,8 +1610,16 @@ int fuse_flush_mtime(struct file *file, bool nofail) struct fuse_req *req = NULL; struct fuse_setattr_in inarg; struct fuse_attr_out outarg; + unsigned cmtime_flags = 0; int err; + if (test_bit(FUSE_I_MTIME_DIRTY, &fi->state)) + cmtime_flags |= FATTR_MTIME; + if (test_bit(FUSE_I_CTIME_DIRTY, &fi->state) && fc->minor >= 24) + cmtime_flags |= FATTR_CTIME; + if (!cmtime_flags) + return 0; + if (nofail) { req = fuse_get_req_nofail_nopages(fc, file); } else { @@ -1623,17 +1631,23 @@ int fuse_flush_mtime(struct file *file, bool nofail) memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); - inarg.valid |= FATTR_MTIME; + inarg.valid = cmtime_flags; inarg.mtime = inode->i_mtime.tv_sec; inarg.mtimensec = inode->i_mtime.tv_nsec; + if (fc->minor >= 24) { + inarg.ctime = inode->i_ctime.tv_sec; + inarg.ctimensec = inode->i_ctime.tv_nsec; + } fuse_setattr_fill(fc, req, inode, &inarg, &outarg); fuse_request_send(fc, req); err = req->out.h.error; fuse_put_request(fc, req); - if (!err) + if (!err) { clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); + clear_bit(FUSE_I_CTIME_DIRTY, &fi->state); + } return err; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index c2c6df7..9ed8590b 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -328,8 +328,7 @@ static int fuse_release(struct inode *inode, struct file *file) if (fc->writeback_cache) filemap_write_and_wait(file->f_mapping); - if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) - fuse_flush_mtime(file, true); + fuse_flush_cmtime(file, true); fuse_release_common(file, FUSE_RELEASE); @@ -512,11 +511,9 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end, fuse_sync_writes(inode); - if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) { - int err = fuse_flush_mtime(file, false); - if (err) - goto out; - } + err = fuse_flush_cmtime(file, false); + if (err) + goto out; req = fuse_get_req_nopages(fc); if (IS_ERR(req)) { diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index a257ed8e..781a01d 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -121,6 +121,8 @@ enum { FUSE_I_SIZE_UNSTABLE, /** i_mtime has been updated locally; a flush to userspace needed */ FUSE_I_MTIME_DIRTY, + /** i_ctime has been updated locally; a flush to userspace needed */ + FUSE_I_CTIME_DIRTY, }; struct fuse_conn; @@ -891,7 +893,7 @@ int fuse_dev_release(struct inode *inode, struct file *file); bool fuse_write_update_size(struct inode *inode, loff_t pos); -int fuse_flush_mtime(struct file *file, bool nofail); +int fuse_flush_cmtime(struct file *file, bool nofail); int fuse_do_setattr(struct inode *inode, struct iattr *attr, struct file *file); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index cf4750e..8af06cc 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -96,6 +96,10 @@ * * 7.23 * - add FUSE_WRITEBACK_CACHE + * + * 7.24 + * - ctime support for FUSE_WRITEBACK_CACHE: + *changes in fuse_setattr_in and fuse_forget_in */ #ifndef _LINUX_FUSE_H @@ -131,7 +135,7 @@ #define FUSE_KERNEL_VERSION 7 /**
[PATCH 3/7] fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode
In case of fc->atomic_o_trunc is set, fuse does nothing in fuse_do_setattr() while handling open(O_TRUNC). Hence, i_mtime must be updated explicitly in fuse_finish_open(). The patch also adds extra locking encompassing open(O_TRUNC) operation to avoid races between updating i_mtime and setting FUSE_I_MTIME_DIRTY flag. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 13f8bde..c2c6df7 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -223,6 +223,10 @@ void fuse_finish_open(struct inode *inode, struct file *file) i_size_write(inode, 0); spin_unlock(&fc->lock); fuse_invalidate_attr(inode); + if (fc->writeback_cache) { + inode->i_mtime = current_fs_time(inode->i_sb); + set_bit(FUSE_I_MTIME_DIRTY, &fi->state); + } } if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache) fuse_link_write_file(file); @@ -232,18 +236,26 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir) { struct fuse_conn *fc = get_fuse_conn(inode); int err; + bool lock_inode = (file->f_flags & O_TRUNC) && + fc->atomic_o_trunc && + fc->writeback_cache; err = generic_file_open(inode, file); if (err) return err; + if (lock_inode) + mutex_lock(&inode->i_mutex); + err = fuse_do_open(fc, get_node_id(inode), file, isdir); - if (err) - return err; - fuse_finish_open(inode, file); + if (!err) + fuse_finish_open(inode, file); - return 0; + if (lock_inode) + mutex_unlock(&inode->i_mutex); + + return err; } static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] fuse: update mtime on truncate(2)
Handling truncate(2), VFS doesn't set ATTR_MTIME bit in iattr structure; only ATTR_SIZE bit is set. In-kernel fuse must handle the case by setting mtime fields of struct fuse_setattr_in to "now" and set FATTR_MTIME bit even though ATTR_MTIME was not set. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 5b4e035..c7cb41c 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1503,10 +1503,11 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, - bool trust_local_mtime) +static void iattr_to_fattr(struct inode *inode, struct iattr *iattr, + struct fuse_setattr_in *arg, bool trust_local_mtime) { unsigned ivalid = iattr->ia_valid; + struct timespec now = current_fs_time(inode->i_sb); if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; @@ -1529,6 +1530,10 @@ static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, arg->mtimensec = iattr->ia_mtime.tv_nsec; if (!(ivalid & ATTR_MTIME_SET) && !trust_local_mtime) arg->valid |= FATTR_MTIME_NOW; + } else if ((ivalid & ATTR_SIZE) && trust_local_mtime) { + arg->valid |= FATTR_MTIME; + arg->mtime = now.tv_sec; + arg->mtimensec = now.tv_nsec; } } @@ -1682,7 +1687,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); - iattr_to_fattr(attr, &inarg, trust_local_mtime); + iattr_to_fattr(inode, attr, &inarg, trust_local_mtime); if (file) { struct fuse_file *ff = file->private_data; inarg.valid |= FATTR_FH; @@ -1711,8 +1716,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, spin_lock(&fc->lock); /* the kernel maintains i_mtime locally */ - if (trust_local_mtime && (attr->ia_valid & ATTR_MTIME)) { - inode->i_mtime = attr->ia_mtime; + if (trust_local_mtime && (attr->ia_valid & (ATTR_MTIME | ATTR_SIZE))) { + inode->i_mtime.tv_sec = inarg.mtime; + inode->i_mtime.tv_nsec = inarg.mtimensec; clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/7] fuse: do not use uninitialized i_mode
When inode is in I_NEW state, inode->i_mode is not initialized yet. Do not use it before fuse_init_inode() is called. Signed-off-by: Maxim Patlasov --- fs/fuse/inode.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 8d61169..299e553 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -303,7 +303,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid, if ((inode->i_state & I_NEW)) { inode->i_flags |= S_NOATIME; - if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) + if (!fc->writeback_cache || !S_ISREG(attr->mode)) inode->i_flags |= S_NOCMTIME; inode->i_generation = generation; inode->i_data.backing_dev_info = &fc->bdi; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7] fuse: fix handling mtime and cmtime in writeback_cache mode
Hi, The patch-set fixes a few issues introduced by writeback_cache feature: - fuse_iget() used inode->i_mode before its initialization - in course of truncate(2), inode->i_mtime was not updated at all - the same for open(O_TRUNC) in case of fc->atomic_o_trunc is set - we cannot trust ctime from server anymore because we doesn't flush all data changes to the server immediately - to maintain ctime locally, we have to update it in proper places (fuse_update_time, fallocate, setxattr, etc.) - locally updated ctime must be flushed to the server eventually. Maxim --- Maxim Patlasov (7): fuse: do not use uninitialized i_mode fuse: update mtime on truncate(2) fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode fuse: introduce FUSE_I_CTIME_DIRTY flag fuse: trust kernel i_ctime only fuse: clear FUSE_I_CTIME_DIRTY flag on setattr fuse: flush ctime in FUSE_FORGET requests fs/fuse/dev.c | 30 --- fs/fuse/dir.c | 88 +++-- fs/fuse/file.c| 37 +-- fs/fuse/fuse_i.h |6 ++- fs/fuse/inode.c | 14 ++- include/uapi/linux/fuse.h | 29 +-- 6 files changed, 165 insertions(+), 39 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/11] fuse: restructure fuse_readpage()
On 01/06/2014 08:43 PM, Miklos Szeredi wrote: On Fri, Dec 20, 2013 at 06:54:40PM +0400, Maxim Patlasov wrote: Hi Miklos, Sorry for delay, see please inline comments below. On 11/12/2013 09:17 PM, Miklos Szeredi wrote: On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote: Move the code filling and sending read request to a separate function. Future patches will use it for .write_begin -- partial modification of a page requires reading the page from the storage very similarly to what fuse_readpage does. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 55 +-- 1 file changed, 37 insertions(+), 18 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index b4d4189..77eb849 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode, } } -static int fuse_readpage(struct file *file, struct page *page) +static int __fuse_readpage(struct file *file, struct page *page, size_t count, + int *err, struct fuse_req **req_pp, u64 *attr_ver_p) Signature of this helper looks really ugly. A quick look tells me that neither caller actually needs 'req'. fuse_readpage() passes 'req' to fuse_short_read(). And the latter uses req->pages[] to nullify a part of request. I don't get it. __fuse_readpage() itself call's fuse_short_read(), not callers of __fuse_readpage(). Or do they? fuse_readpage() is a caller of __fuse_readpage() and it looks (after applying the patch) like this: > static int fuse_readpage(struct file *file, struct page *page) > { > ... >num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver); >if (!err) { >/* > * Short read means EOF. If file size is larger, truncate it > */ >if (num_read < count) >fuse_short_read(req, inode, attr_ver); > >SetPageUptodate(page); >} Thanks, Maxim And fuse_get_attr_version() can be moved to the one caller that needs it. Yes, it's doable. But this would make attr_version mechanism less efficient (under some loads): suppose the file on server was truncated externally, then fuse_readpage() acquires fc->attr_version, then some innocent write bumps fc->attr_version while we're waiting for fuse writeback, then fuse_read_update_size() would noop. In the other words, it's beneficial to keep the time interval between acquiring fc->attr_version and subsequent comparison as short as possible. Okay, lets try to keep this the way it is. I don't like it very much, but I fear changing user visible behavior. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
On 01/06/2014 08:22 PM, Miklos Szeredi wrote: On Thu, Dec 26, 2013 at 07:41:41PM +0400, Maxim Patlasov wrote: + + if (!err) + clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); Doing the test and the clear separately opens a huge race window when i_mtime modifications are bound to get lost. No. Because the whole operation is protected by i_mutex (see fuse_fsync_common()). fuse_release_common() doesn't have i_mutex. It's probably safe to acquire it, but is that really needed? No, that's not needed, I think. Because by the time of calling fuse_release(), file->f_count is already zero and no userspace activity is possible on the file. Are you OK about -v3 version of the patch (sent 12/26/2013)? Thanks, Maxim Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/11] fuse: Trust kernel i_mtime only -v3
Let the kernel maintain i_mtime locally: - clear S_NOCMTIME - implement i_op->update_time() - flush mtime on fsync and last close - update i_mtime explicitly on truncate and fallocate Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime should be flushed to the server eventually. Changed in v2 (thanks to Brian): - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY - simplified fuse_set_mtime_local() - abandoned optimizations: clearing the flag on some operations (like direct write) is doable, but may lead to unexpected changes of user-visible mtime. Changed in v3 (thanks to Miklos): - let client fs know about local mtime change for ATTR_SIZE and !ATTR_MTIME_SET cases. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c| 113 +- fs/fuse/file.c | 30 -- fs/fuse/fuse_i.h |6 ++- fs/fuse/inode.c | 13 +- 4 files changed, 136 insertions(+), 26 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f022968..d1a3167 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, struct fuse_conn *fc = get_fuse_conn(inode); /* see the comment in fuse_change_attributes() */ - if (fc->writeback_cache && S_ISREG(inode->i_mode)) + if (fc->writeback_cache && S_ISREG(inode->i_mode)) { attr->size = i_size_read(inode); + attr->mtime = inode->i_mtime.tv_sec; + attr->mtimensec = inode->i_mtime.tv_nsec; + } stat->dev = inode->i_sb->s_dev; stat->ino = attr->ino; @@ -1510,7 +1513,8 @@ static bool update_mtime(unsigned ivalid) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg) +static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, + bool trust_local_mtime, struct timespec *mtime) { unsigned ivalid = iattr->ia_valid; @@ -1529,11 +1533,17 @@ static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg) if (!(ivalid & ATTR_ATIME_SET)) arg->valid |= FATTR_ATIME_NOW; } - if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) { + + if (!trust_local_mtime || (ivalid & ATTR_MTIME_SET) || + !(ivalid & ATTR_SIZE)) + *mtime = iattr->ia_mtime; + + if ((ivalid & ATTR_MTIME) && (update_mtime(ivalid) || + trust_local_mtime)) { arg->valid |= FATTR_MTIME; - arg->mtime = iattr->ia_mtime.tv_sec; - arg->mtimensec = iattr->ia_mtime.tv_nsec; - if (!(ivalid & ATTR_MTIME_SET)) + arg->mtime = mtime->tv_sec; + arg->mtimensec = mtime->tv_nsec; + if (!(ivalid & ATTR_MTIME_SET) && !trust_local_mtime) arg->valid |= FATTR_MTIME_NOW; } } @@ -1582,6 +1592,63 @@ void fuse_release_nowrite(struct inode *inode) spin_unlock(&fc->lock); } +static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req, + struct inode *inode, + struct fuse_setattr_in *inarg_p, + struct fuse_attr_out *outarg_p) +{ + req->in.h.opcode = FUSE_SETATTR; + req->in.h.nodeid = get_node_id(inode); + req->in.numargs = 1; + req->in.args[0].size = sizeof(*inarg_p); + req->in.args[0].value = inarg_p; + req->out.numargs = 1; + if (fc->minor < 9) + req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE; + else + req->out.args[0].size = sizeof(*outarg_p); + req->out.args[0].value = outarg_p; +} + +/* + * Flush inode->i_mtime to the server + */ +int fuse_flush_mtime(struct file *file, bool nofail) +{ + struct inode *inode = file->f_mapping->host; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_req *req = NULL; + struct fuse_setattr_in inarg; + struct fuse_attr_out outarg; + int err; + + if (nofail) { + req = fuse_get_req_nofail_nopages(fc, file); + } else { + req = fuse_get_req_nopages(fc); + if (IS_ERR(req)) + return PTR_ERR(req); + } + + memset(&inarg, 0, sizeof(inarg)); + memset(&outarg, 0, sizeof(outarg)); + + inarg.valid |= FATTR_MTIME; + inarg.mtime = inode->i_mtime.tv_sec; + inarg.mtimensec = inode->i_mtime.tv_nsec; + + fuse_setattr_fill(fc, req, inode, &inarg, &outarg); + fuse_request_send(fc,
Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
Hi Miklos, Sorry for delay, see please inline comments below. On 11/12/2013 08:52 PM, Miklos Szeredi wrote: On Thu, Oct 10, 2013 at 05:10:56PM +0400, Maxim Patlasov wrote: Let the kernel maintain i_mtime locally: - clear S_NOCMTIME - implement i_op->update_time() - flush mtime on fsync and last close - update i_mtime explicitly on truncate and fallocate Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime should be flushed to the server eventually. Changed in v2 (thanks to Brian): - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY - simplified fuse_set_mtime_local() - abandoned optimizations: clearing the flag on some operations (like direct write) is doable, but may lead to unexpected changes of user-visible mtime. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c| 109 -- fs/fuse/file.c | 30 +-- fs/fuse/fuse_i.h |6 ++- fs/fuse/inode.c | 13 +- 4 files changed, 138 insertions(+), 20 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f022968..eda248b 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, struct fuse_conn *fc = get_fuse_conn(inode); /* see the comment in fuse_change_attributes() */ - if (fc->writeback_cache && S_ISREG(inode->i_mode)) + if (fc->writeback_cache && S_ISREG(inode->i_mode)) { attr->size = i_size_read(inode); + attr->mtime = inode->i_mtime.tv_sec; + attr->mtimensec = inode->i_mtime.tv_nsec; + } stat->dev = inode->i_sb->s_dev; stat->ino = attr->ino; @@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode) spin_unlock(&fc->lock); } +static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req, + struct inode *inode, + struct fuse_setattr_in *inarg_p, + struct fuse_attr_out *outarg_p) +{ + req->in.h.opcode = FUSE_SETATTR; + req->in.h.nodeid = get_node_id(inode); + req->in.numargs = 1; + req->in.args[0].size = sizeof(*inarg_p); + req->in.args[0].value = inarg_p; + req->out.numargs = 1; + if (fc->minor < 9) + req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE; + else + req->out.args[0].size = sizeof(*outarg_p); + req->out.args[0].value = outarg_p; +} + +/* + * Flush inode->i_mtime to the server + */ +int fuse_flush_mtime(struct file *file, bool nofail) +{ + struct inode *inode = file->f_mapping->host; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_req *req = NULL; + struct fuse_setattr_in inarg; + struct fuse_attr_out outarg; + int err; + + if (nofail) { + req = fuse_get_req_nofail_nopages(fc, file); + } else { + req = fuse_get_req_nopages(fc); + if (IS_ERR(req)) + return PTR_ERR(req); + } + + memset(&inarg, 0, sizeof(inarg)); + memset(&outarg, 0, sizeof(outarg)); + + inarg.valid |= FATTR_MTIME; + inarg.mtime = inode->i_mtime.tv_sec; + inarg.mtimensec = inode->i_mtime.tv_nsec; + + fuse_setattr_fill(fc, req, inode, &inarg, &outarg); + fuse_request_send(fc, req); + err = req->out.h.error; + fuse_put_request(fc, req); + + if (!err) + clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); Doing the test and the clear separately opens a huge race window when i_mtime modifications are bound to get lost. No. Because the whole operation is protected by i_mutex (see fuse_fsync_common()). + + return err; +} + +/* + * S_NOCMTIME is clear, so we need to update inode->i_mtime manually. + */ +static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode) +{ + unsigned ivalid = iattr->ia_valid; + struct fuse_inode *fi = get_fuse_inode(inode); + + if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) { + /* client fs has just set mtime to iattr->ia_mtime */ + inode->i_mtime = iattr->ia_mtime; + clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); This is protected by i_mutex, so it should be safe. Yes. + } else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) { + /* client fs doesn't know that we're updating i_mtime */ If so, why not tell the client fs to update mtime? Looks reasonable. I'll resend updated patch soon. + inode->i_mtime = current_fs_time(inode->i_sb); +
Re: [PATCH 07/11] fuse: restructure fuse_readpage()
Hi Miklos, Sorry for delay, see please inline comments below. On 11/12/2013 09:17 PM, Miklos Szeredi wrote: On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote: Move the code filling and sending read request to a separate function. Future patches will use it for .write_begin -- partial modification of a page requires reading the page from the storage very similarly to what fuse_readpage does. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 55 +-- 1 file changed, 37 insertions(+), 18 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index b4d4189..77eb849 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode, } } -static int fuse_readpage(struct file *file, struct page *page) +static int __fuse_readpage(struct file *file, struct page *page, size_t count, + int *err, struct fuse_req **req_pp, u64 *attr_ver_p) Signature of this helper looks really ugly. A quick look tells me that neither caller actually needs 'req'. fuse_readpage() passes 'req' to fuse_short_read(). And the latter uses req->pages[] to nullify a part of request. And fuse_get_attr_version() can be moved to the one caller that needs it. Yes, it's doable. But this would make attr_version mechanism less efficient (under some loads): suppose the file on server was truncated externally, then fuse_readpage() acquires fc->attr_version, then some innocent write bumps fc->attr_version while we're waiting for fuse writeback, then fuse_read_update_size() would noop. In the other words, it's beneficial to keep the time interval between acquiring fc->attr_version and subsequent comparison as short as possible. And negative err can be returned. Yes, but this will require some precautions for positive req->out.h.error. Like "err = (req->out.h.error <= 0) ? req->out.h.error : -EIO;". But this must be OK - filtering out positive req->out.h.error is a good idea, imho. And then all those ugly pointer args are gone and the whole thing is much simpler. If you agree with my comments above, only 1 of 3 ugly pointers can be avoided. Another way would be to revert the code back to the initial implementation where fuse_readpage() and fuse_prepare_write() sent read requests independently. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: add strictlimit knob -v2
"strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. Changed in v2: - updated patch description and documentation Signed-off-by: Maxim Patlasov --- Documentation/ABI/testing/sysfs-class-bdi |8 +++ mm/backing-dev.c | 35 + 2 files changed, 43 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi index d773d56..3187a18 100644 --- a/Documentation/ABI/testing/sysfs-class-bdi +++ b/Documentation/ABI/testing/sysfs-class-bdi @@ -53,3 +53,11 @@ stable_pages_required (read-only) If set, the backing device requires that all pages comprising a write request must not be changed until writeout is complete. + +strictlimit (read-write) + + Forces per-BDI checks for the share of given device in the write-back + cache even before the global background dirty limit is reached. This + is useful in situations where the global limit is much higher than + affordable for given relatively slow (or untrusted) device. Turning + strictlimit on has no visible effect if max_ratio is equal to 100%. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7..4ee1d64 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, } static DEVICE_ATTR_RO(stable_pages_required); +static ssize_t strictlimit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int val; + ssize_t ret; + + ret = kstrtouint(buf, 10, &val); + if (ret < 0) + return ret; + + switch (val) { + case 0: + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; + break; + case 1: + bdi->capabilities |= BDI_CAP_STRICTLIMIT; + break; + default: + return -EINVAL; + } + + return count; +} +static ssize_t strictlimit_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); +} +static DEVICE_ATTR_RW(strictlimit); + static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, + &dev_attr_strictlimit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: add strictlimit knob
Hi Andrew, On 11/05/2013 02:01 AM, Andrew Morton wrote: On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov wrote: "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until max_ratio is decreased to some reasonable value. I suggest replacing "max_ratio" here with the much more informative "/sys/class/bdi/X:Y/max_ratio". Also, Documentation/ABI/testing/sysfs-class-bdi will need an update please. OK, I'll update it, fix patch description and re-send the patch. mm/backing-dev.c | 35 +++ 1 file changed, 35 insertions(+) I'm not really sure what to make of the patch. I assume you tested it and observed some effect. Could you please describe the test setup and the effects in some detail? I plugged 16GB USB-flash in a node with 8GB RAM running 3.12.0-rc7 and started writing a huge file by "dd" (from /dev/zero to USB-flash mount-point). While writing I was observing "Dirty" counter as reported by /proc/meminfo. As expected it stabilized on a level about 1.2GB (15% of total RAM). Immediately after dd completed, the "umount" command took about 5 minutes. This corresponded to 5MB write throughput of the flash drive. Then I repeated the experiment after setting tunables: echo 1 > /sys/class/bdi/8\:16/max_ratio echo 1 > /sys/class/bdi/8\:16/strictlimit This time, "Dirty" counter became 100 times lesser - about 12MB and "umount" took about a second. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: add strictlimit knob
"strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until max_ratio is decreased to some reasonable value. Signed-off-by: Maxim Patlasov --- mm/backing-dev.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7..4ee1d64 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, } static DEVICE_ATTR_RO(stable_pages_required); +static ssize_t strictlimit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int val; + ssize_t ret; + + ret = kstrtouint(buf, 10, &val); + if (ret < 0) + return ret; + + switch (val) { + case 0: + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; + break; + case 1: + bdi->capabilities |= BDI_CAP_STRICTLIMIT; + break; + default: + return -EINVAL; + } + + return count; +} +static ssize_t strictlimit_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); +} +static DEVICE_ATTR_RW(strictlimit); + static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, + &dev_attr_strictlimit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
On Thu 31-10-13 14:26:12, Karl Kiniger wrote: > On Tue 131029, Jan Kara wrote: > > On Fri 25-10-13 11:15:55, Karl Kiniger wrote: > > > On Fri 131025, Linus Torvalds wrote: > > > > Is it currently possible to somehow set above values per block device? > > Yes, to some extent. You can set /sys/block//bdi/max_ratio to > > the maximum proportion the device's dirty data can take from the total > > amount. The caveat currently is that this setting only takes effect after > > we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in > > total because that is an amount of dirty data when we start to throttle > > processes. So if the device you'd like to limit is the only one which is > > currently written to, the limiting doesn't have a big effect. > > Thanks for the info - thats was I am looking for. > > You are right that the limiting doesn't have a big effect right now: > > on my 4x speed DVD+RW on /dev/sr0, x86_64, 4GB, > Fedora19: > > max_ratio set to 100 - about 500MB buffered, sync time 2:10 min. > max_ratio set to 1- about 330MB buffered, sync time 1:23 min. > > ... way too much buffering. "strictlimit" feature must fit your and Artem's needs quite well. The feature enforces per-BDI dirty limits even if the global dirty limit is not reached yet. I'll send a patch adding knob to turn it on/off. Thanks, Maxim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fuse: Flush files on wb close -v2
From: Pavel Emelyanov Any write request requires a file handle to report to the userspace. Thus when we close a file (and free the fuse_file with this info) we have to flush all the outstanding dirty pages. filemap_write_and_wait() is enough because every page under fuse writeback is accounted in ff->count. This delays actual close until all fuse wb is completed. In case of "write cache" turned off, the flush is ensured by fuse_vma_close(). Changed in v2: - updated patch to be applied smoothly on top of "Trust kernel i_mtime only -v2". Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index eabe202..b4d4189 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file) static int fuse_release(struct inode *inode, struct file *file) { + struct fuse_conn *fc = get_fuse_conn(inode); + + /* see fuse_vma_close() for !writeback_cache case */ + if (fc->writeback_cache) + filemap_write_and_wait(file->f_mapping); + if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) fuse_flush_mtime(file, true); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/11] fuse: fuse_flush() should wait on writeback
The aim of .flush fop is to hint file-system that flushing its state or caches or any other important data to reliable storage would be desirable now. fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace. However, dirty pages and pages under writeback may be not visible to userspace yet if we won't ensure it explicitly. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |9 + 1 file changed, 9 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 642bd51..f23fc65 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -19,6 +19,7 @@ #include static const struct file_operations fuse_direct_io_file_operations; +static void fuse_sync_writes(struct inode *inode); static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file, int opcode, struct fuse_open_out *outargp) @@ -399,6 +400,14 @@ static int fuse_flush(struct file *file, fl_owner_t id) if (fc->no_flush) return 0; + err = filemap_write_and_wait(file->f_mapping); + if (err) + return err; + + mutex_lock(&inode->i_mutex); + fuse_sync_writes(inode); + mutex_unlock(&inode->i_mutex); + req = fuse_get_req_nofail_nopages(fc, file); memset(&inarg, 0, sizeof(inarg)); inarg.fh = ff->fh; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/11] fuse: Turn writeback cache on
From: Pavel Emelyanov Introduce a bit kernel and userspace exchange between each-other on the init stage and turn writeback on if the userspace want this and mount option 'allow_wbcache' is present (controlled by fusermount). Also add each writable file into per-inode write list and call the generic_file_aio_write to make use of the Linux page cache engine. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c|5 + fs/fuse/fuse_i.h |4 fs/fuse/inode.c | 13 + include/uapi/linux/fuse.h |2 ++ 4 files changed, 24 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 653ae9c..aa38e77 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -208,6 +208,8 @@ void fuse_finish_open(struct inode *inode, struct file *file) spin_unlock(&fc->lock); fuse_invalidate_attr(inode); } + if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache) + fuse_link_write_file(file); } int fuse_open_common(struct inode *inode, struct file *file, bool isdir) @@ -1228,6 +1230,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov, struct iov_iter i; loff_t endbyte = 0; + if (get_fuse_conn(inode)->writeback_cache) + return generic_file_aio_write(iocb, iov, nr_segs, pos); + WARN_ON(iocb->ki_pos != pos); ocount = 0; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 129af78..76c4f77 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -44,6 +44,10 @@ doing the mount will be allowed to access the filesystem */ #define FUSE_ALLOW_OTHER (1 << 1) +/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem +module will enable support of writback cache */ +#define FUSE_ALLOW_WBCACHE (1 << 2) + /** Number of page pointers embedded in fuse_req */ #define FUSE_REQ_INLINE_PAGES 1 diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 253701f..6ddf135 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -460,6 +460,7 @@ enum { OPT_ALLOW_OTHER, OPT_MAX_READ, OPT_BLKSIZE, + OPT_ALLOW_WBCACHE, OPT_ERR }; @@ -472,6 +473,7 @@ static const match_table_t tokens = { {OPT_ALLOW_OTHER, "allow_other"}, {OPT_MAX_READ, "max_read=%u"}, {OPT_BLKSIZE, "blksize=%u"}, + {OPT_ALLOW_WBCACHE, "allow_wbcache"}, {OPT_ERR, NULL} }; @@ -545,6 +547,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) d->blksize = value; break; + case OPT_ALLOW_WBCACHE: + d->flags |= FUSE_ALLOW_WBCACHE; + break; + default: return 0; } @@ -572,6 +578,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) seq_printf(m, ",max_read=%u", fc->max_read); if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) seq_printf(m, ",blksize=%lu", sb->s_blocksize); + if (fc->flags & FUSE_ALLOW_WBCACHE) + seq_puts(m, ",allow_wbcache"); return 0; } @@ -889,6 +897,9 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req) } if (arg->flags & FUSE_ASYNC_DIO) fc->async_dio = 1; + if (arg->flags & FUSE_WRITEBACK_CACHE && + fc->flags & FUSE_ALLOW_WBCACHE) + fc->writeback_cache = 1; } else { ra_pages = fc->max_read / PAGE_CACHE_SIZE; fc->no_lock = 1; @@ -917,6 +928,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req) FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ | FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA | FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO; + if (fc->flags & FUSE_ALLOW_WBCACHE) + arg->flags |= FUSE_WRITEBACK_CACHE; req->in.h.opcode = FUSE_INIT; req->in.numargs = 1; req->in.args[0].size = sizeof(*arg); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 60bb2f9..5c98dd1 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -219,6 +219,7 @@ struct fuse_file_lock { * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one) * FUSE_READDIRPLUS_AUTO: adaptive readdirplus * FUSE_ASYNC_DIO: asynchronous direct I/O submission + * FUSE_WRITEBACK_CACHE: use
[PATCH 10/11] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
From: Pavel Emelyanov The problem is: 1. write cached data to a file 2. read directly from the same file (via another fd) The 2nd operation may read stale data, i.e. the one that was in a file before the 1st op. Problem is in how fuse manages writeback. When direct op occurs the core kernel code calls filemap_write_and_wait to flush all the cached ops in flight. But fuse acks the writeback right after the ->writepages callback exits w/o waiting for the real write to happen. Thus the subsequent direct op proceeds while the real writeback is still in flight. This is a problem for backends that reorder operation. Fix this by making the fuse direct IO callback explicitly wait on the in-flight writeback to finish. Changed in v2: - do not wait on writeback if fuse_direct_io() call came from CUSE (because it doesn't use fuse inodes) Signed-off-by: Maxim Patlasov --- fs/fuse/cuse.c |5 +++-- fs/fuse/file.c | 49 +++-- fs/fuse/fuse_i.h | 13 - 3 files changed, 62 insertions(+), 5 deletions(-) diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index adbfd66..9bf4bb2 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -95,7 +95,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count, struct iovec iov = { .iov_base = buf, .iov_len = count }; struct fuse_io_priv io = { .async = 0, .file = file }; - return fuse_direct_io(&io, &iov, 1, count, &pos, 0); + return fuse_direct_io(&io, &iov, 1, count, &pos, FUSE_DIO_CUSE); } static ssize_t cuse_write(struct file *file, const char __user *buf, @@ -109,7 +109,8 @@ static ssize_t cuse_write(struct file *file, const char __user *buf, * No locking or generic_write_checks(), the server is * responsible for locking and sanity checks. */ - return fuse_direct_io(&io, &iov, 1, count, &pos, 1); + return fuse_direct_io(&io, &iov, 1, count, &pos, + FUSE_DIO_WRITE | FUSE_DIO_CUSE); } static int cuse_open(struct inode *inode, struct file *file) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index f23fc65..653ae9c 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -341,6 +341,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) return (u64) v0 + ((u64) v1 << 32); } +static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from, + pgoff_t idx_to) +{ + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_req *req; + bool found = false; + + spin_lock(&fc->lock); + list_for_each_entry(req, &fi->writepages, writepages_entry) { + pgoff_t curr_index; + + BUG_ON(req->inode != inode); + curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT; + if (!(idx_from >= curr_index + req->num_pages || + idx_to < curr_index)) { + found = true; + break; + } + } + spin_unlock(&fc->lock); + + return found; +} + /* * Check if page is under writeback * @@ -385,6 +410,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) return 0; } +static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start, + size_t bytes) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + pgoff_t idx_from, idx_to; + + idx_from = start >> PAGE_CACHE_SHIFT; + idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT; + + wait_event(fi->page_waitq, + !fuse_range_is_writeback(inode, idx_from, idx_to)); +} + static int fuse_flush(struct file *file, fl_owner_t id) { struct inode *inode = file_inode(file); @@ -1363,8 +1401,10 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p) ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, unsigned long nr_segs, size_t count, loff_t *ppos, - int write) + int flags) { + int write = flags & FUSE_DIO_WRITE; + int cuse = flags & FUSE_DIO_CUSE; struct file *file = io->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; @@ -1393,6 +1433,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, break; } + if (!cuse) + fuse_wait_on_writeback(file->f_mapping->host, pos, + nbytes); + if (write) nres = fuse_send_write(req, io, pos, nbytes, owner); else @@ -1471,7 +1515,8
[PATCH 08/11] fuse: Implement write_begin/write_end callbacks -v2
From: Pavel Emelyanov The .write_begin and .write_end are requiered to use generic routines (generic_file_aio_write --> ... --> generic_perform_write) for buffered writes. Changed in v2: - added fuse_wait_on_page_writeback() to fuse ->write_begin; this is crucial to avoid the lost of user data by incorrect crop of a secondary request to a stale inarg->size of the primary. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 100 1 file changed, 100 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 77eb849..642bd51 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1950,6 +1950,104 @@ out: return err; } +/* + * Determine the number of bytes of data the page contains + */ +static inline unsigned fuse_page_length(struct page *page) +{ + loff_t i_size = i_size_read(page_file_mapping(page)->host); + + if (i_size > 0) { + pgoff_t page_index = page_file_index(page); + pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT; + if (page_index < end_index) + return PAGE_CACHE_SIZE; + if (page_index == end_index) + return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1; + } + return 0; +} + +static int fuse_prepare_write(struct fuse_conn *fc, struct file *file, + struct page *page, loff_t pos, unsigned len) +{ + struct fuse_req *req = NULL; + unsigned num_read; + unsigned page_len; + int err; + + if (PageUptodate(page) || (len == PAGE_CACHE_SIZE)) { + fuse_wait_on_page_writeback(page->mapping->host, page->index); + return 0; + } + + page_len = fuse_page_length(page); + if (!page_len) { + fuse_wait_on_page_writeback(page->mapping->host, page->index); + zero_user(page, 0, PAGE_CACHE_SIZE); + return 0; + } + + num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL); + if (req) + fuse_put_request(fc, req); + if (err) { + unlock_page(page); + page_cache_release(page); + } else if (num_read != PAGE_CACHE_SIZE) { + zero_user_segment(page, num_read, PAGE_CACHE_SIZE); + } + return err; +} + +/* + * It's worthy to make sure that space is reserved on disk for the write, + * but how to implement it without killing performance need more thinking. + */ +static int fuse_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + pgoff_t index = pos >> PAGE_CACHE_SHIFT; + struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode); + + BUG_ON(!fc->writeback_cache); + + *pagep = grab_cache_page_write_begin(mapping, index, flags); + if (!*pagep) + return -ENOMEM; + + return fuse_prepare_write(fc, file, *pagep, pos, len); +} + +static int fuse_commit_write(struct file *file, struct page *page, + unsigned from, unsigned to) +{ + struct inode *inode = page->mapping->host; + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + + if (!PageUptodate(page)) + SetPageUptodate(page); + + fuse_write_update_size(inode, pos); + set_page_dirty(page); + return 0; +} + +static int fuse_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + unsigned from = pos & (PAGE_CACHE_SIZE - 1); + + fuse_commit_write(file, page, from, from+copied); + + unlock_page(page); + page_cache_release(page); + + return copied; +} + static int fuse_launder_page(struct page *page) { int err = 0; @@ -2974,6 +3072,8 @@ static const struct address_space_operations fuse_file_aops = { .set_page_dirty = __set_page_dirty_nobuffers, .bmap = fuse_bmap, .direct_IO = fuse_direct_IO, + .write_begin= fuse_write_begin, + .write_end = fuse_write_end, }; void fuse_init_file_inode(struct inode *inode) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/11] fuse: Flush files on wb close -v2
Changed in v2: - updated patch to be applied smoothly on top of "Trust kernel i_mtime only -v2". Signed-off-by: Maxim Patlasov --- fs/fuse/file.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index eabe202..b4d4189 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file) static int fuse_release(struct inode *inode, struct file *file) { + struct fuse_conn *fc = get_fuse_conn(inode); + + /* see fuse_vma_close() for !writeback_cache case */ + if (fc->writeback_cache) + filemap_write_and_wait(file->f_mapping); + if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) fuse_flush_mtime(file, true); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/11] fuse: Trust kernel i_mtime only -v2
Let the kernel maintain i_mtime locally: - clear S_NOCMTIME - implement i_op->update_time() - flush mtime on fsync and last close - update i_mtime explicitly on truncate and fallocate Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime should be flushed to the server eventually. Changed in v2 (thanks to Brian): - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY - simplified fuse_set_mtime_local() - abandoned optimizations: clearing the flag on some operations (like direct write) is doable, but may lead to unexpected changes of user-visible mtime. Signed-off-by: Maxim Patlasov --- fs/fuse/dir.c| 109 -- fs/fuse/file.c | 30 +-- fs/fuse/fuse_i.h |6 ++- fs/fuse/inode.c | 13 +- 4 files changed, 138 insertions(+), 20 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f022968..eda248b 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, struct fuse_conn *fc = get_fuse_conn(inode); /* see the comment in fuse_change_attributes() */ - if (fc->writeback_cache && S_ISREG(inode->i_mode)) + if (fc->writeback_cache && S_ISREG(inode->i_mode)) { attr->size = i_size_read(inode); + attr->mtime = inode->i_mtime.tv_sec; + attr->mtimensec = inode->i_mtime.tv_nsec; + } stat->dev = inode->i_sb->s_dev; stat->ino = attr->ino; @@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode) spin_unlock(&fc->lock); } +static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req, + struct inode *inode, + struct fuse_setattr_in *inarg_p, + struct fuse_attr_out *outarg_p) +{ + req->in.h.opcode = FUSE_SETATTR; + req->in.h.nodeid = get_node_id(inode); + req->in.numargs = 1; + req->in.args[0].size = sizeof(*inarg_p); + req->in.args[0].value = inarg_p; + req->out.numargs = 1; + if (fc->minor < 9) + req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE; + else + req->out.args[0].size = sizeof(*outarg_p); + req->out.args[0].value = outarg_p; +} + +/* + * Flush inode->i_mtime to the server + */ +int fuse_flush_mtime(struct file *file, bool nofail) +{ + struct inode *inode = file->f_mapping->host; + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_req *req = NULL; + struct fuse_setattr_in inarg; + struct fuse_attr_out outarg; + int err; + + if (nofail) { + req = fuse_get_req_nofail_nopages(fc, file); + } else { + req = fuse_get_req_nopages(fc); + if (IS_ERR(req)) + return PTR_ERR(req); + } + + memset(&inarg, 0, sizeof(inarg)); + memset(&outarg, 0, sizeof(outarg)); + + inarg.valid |= FATTR_MTIME; + inarg.mtime = inode->i_mtime.tv_sec; + inarg.mtimensec = inode->i_mtime.tv_nsec; + + fuse_setattr_fill(fc, req, inode, &inarg, &outarg); + fuse_request_send(fc, req); + err = req->out.h.error; + fuse_put_request(fc, req); + + if (!err) + clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); + + return err; +} + +/* + * S_NOCMTIME is clear, so we need to update inode->i_mtime manually. + */ +static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode) +{ + unsigned ivalid = iattr->ia_valid; + struct fuse_inode *fi = get_fuse_inode(inode); + + if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) { + /* client fs has just set mtime to iattr->ia_mtime */ + inode->i_mtime = iattr->ia_mtime; + clear_bit(FUSE_I_MTIME_DIRTY, &fi->state); + } else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) { + /* client fs doesn't know that we're updating i_mtime */ + inode->i_mtime = current_fs_time(inode->i_sb); + set_bit(FUSE_I_MTIME_DIRTY, &fi->state); + } +} + /* * Set attributes, and at the same time refresh them. * @@ -1641,17 +1720,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, inarg.valid |= FATTR_LOCKOWNER; inarg.lock_owner = fuse_lock_owner_id(fc, current->files); } - req->in.h.opcode = FUSE_SETATTR; - req->in.h.nodeid = get_node_id(inode); - req->in.numargs = 1; - req->in.args[0].size = sizeof(inarg); - req->in.args[0].value = &inarg;
[PATCH 07/11] fuse: restructure fuse_readpage()
Move the code filling and sending read request to a separate function. Future patches will use it for .write_begin -- partial modification of a page requires reading the page from the storage very similarly to what fuse_readpage does. Signed-off-by: Maxim Patlasov --- fs/fuse/file.c | 55 +-- 1 file changed, 37 insertions(+), 18 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index b4d4189..77eb849 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode, } } -static int fuse_readpage(struct file *file, struct page *page) +static int __fuse_readpage(struct file *file, struct page *page, size_t count, + int *err, struct fuse_req **req_pp, u64 *attr_ver_p) { struct fuse_io_priv io = { .async = 0, .file = file }; struct inode *inode = page->mapping->host; struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_req *req; size_t num_read; - loff_t pos = page_offset(page); - size_t count = PAGE_CACHE_SIZE; - u64 attr_ver; - int err; - - err = -EIO; - if (is_bad_inode(inode)) - goto out; /* * Page writeback can extend beyond the lifetime of the @@ -724,20 +717,45 @@ static int fuse_readpage(struct file *file, struct page *page) fuse_wait_on_page_writeback(inode, page->index); req = fuse_get_req(fc, 1); - err = PTR_ERR(req); + *err = PTR_ERR(req); if (IS_ERR(req)) - goto out; + return 0; - attr_ver = fuse_get_attr_version(fc); + if (attr_ver_p) + *attr_ver_p = fuse_get_attr_version(fc); req->out.page_zeroing = 1; req->out.argpages = 1; req->num_pages = 1; req->pages[0] = page; req->page_descs[0].length = count; - num_read = fuse_send_read(req, &io, pos, count, NULL); - err = req->out.h.error; + num_read = fuse_send_read(req, &io, page_offset(page), count, NULL); + *err = req->out.h.error; + + if (*err) + fuse_put_request(fc, req); + else + *req_pp = req; + + return num_read; +} + +static int fuse_readpage(struct file *file, struct page *page) +{ + struct inode *inode = page->mapping->host; + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_req *req = NULL; + size_t num_read; + size_t count = PAGE_CACHE_SIZE; + u64 attr_ver = 0; + int err; + + err = -EIO; + if (is_bad_inode(inode)) + goto out; + + num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver); if (!err) { /* * Short read means EOF. If file size is larger, truncate it @@ -747,10 +765,11 @@ static int fuse_readpage(struct file *file, struct page *page) SetPageUptodate(page); } - - fuse_put_request(fc, req); - fuse_invalidate_attr(inode); /* atime changed */ - out: + if (req) { + fuse_put_request(fc, req); + fuse_invalidate_attr(inode); /* atime changed */ + } +out: unlock_page(page); return err; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/11] fuse: Connection bit for enabling writeback
From: Pavel Emelyanov Off (0) by default. Will be used in the next patches and will be turned on at the very end. Signed-off-by: Maxim Patlasov Signed-off-by: Pavel Emelyanov --- fs/fuse/fuse_i.h |3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index a08b355..a490ba3 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -479,6 +479,9 @@ struct fuse_conn { /** Set if bdi is valid */ unsigned bdi_initialized:1; + /** write-back cache policy (default is write-through) */ + unsigned writeback_cache:1; + /* * The following bitfields are only for optimization purposes * and hence races in setting them will not cause malfunction -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/11] fuse: Trust kernel i_size only - v4
From: Pavel Emelyanov Make fuse think that when writeback is on the inode's i_size is always up-to-date and not update it with the value received from the userspace. This is done because the page cache code may update i_size without letting the FS know. This assumption implies fixing the previously introduced short-read helper -- when a short read occurs the 'hole' is filled with zeroes. fuse_file_fallocate() is also fixed because now we should keep i_size up to date, so it must be updated if FUSE_FALLOCATE request succeeded. Changed in v2: - improved comment in fuse_short_read() - fixed fuse_file_fallocate() for KEEP_SIZE mode Changed in v3: - fixed fuse_fillattr() not to use local i_size if writeback-cache is off - added a comment explaining why we cannot trust attr.size from server Changed in v4: - do not change fuse_file_fallocate() because what we need is already implemented in commits a200a2d and 14c1441 Signed-off-by: Maxim V. Patlasov --- fs/fuse/dir.c | 13 +++-- fs/fuse/file.c | 26 -- fs/fuse/inode.c | 11 +-- 3 files changed, 44 insertions(+), 6 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 62b43b5..f022968 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -854,6 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, struct kstat *stat) { unsigned int blkbits; + struct fuse_conn *fc = get_fuse_conn(inode); + + /* see the comment in fuse_change_attributes() */ + if (fc->writeback_cache && S_ISREG(inode->i_mode)) + attr->size = i_size_read(inode); stat->dev = inode->i_sb->s_dev; stat->ino = attr->ino; @@ -1594,6 +1599,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, struct fuse_setattr_in inarg; struct fuse_attr_out outarg; bool is_truncate = false; + bool is_wb = fc->writeback_cache; loff_t oldsize; int err; @@ -1665,7 +1671,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, fuse_change_attributes_common(inode, &outarg.attr, attr_timeout(&outarg)); oldsize = inode->i_size; - i_size_write(inode, outarg.attr.size); + /* see the comment in fuse_change_attributes() */ + if (!is_wb || is_truncate || !S_ISREG(inode->i_mode)) + i_size_write(inode, outarg.attr.size); if (is_truncate) { /* NOTE: this may release/reacquire fc->lock */ @@ -1677,7 +1685,8 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr, * Only call invalidate_inode_pages2() after removing * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock. */ - if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { + if ((is_truncate || !is_wb) && + S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { truncate_pagecache(inode, outarg.attr.size); invalidate_inode_pages2(inode->i_mapping); } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 886f3a2..333a764 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -658,9 +658,31 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver) { size_t num_read = req->out.args[0].size; + struct fuse_conn *fc = get_fuse_conn(inode); + + if (fc->writeback_cache) { + /* +* A hole in a file. Some data after the hole are in page cache, +* but have not reached the client fs yet. So, the hole is not +* present there. +*/ + int i; + int start_idx = num_read >> PAGE_CACHE_SHIFT; + size_t off = num_read & (PAGE_CACHE_SIZE - 1); + + for (i = start_idx; i < req->num_pages; i++) { + struct page *page = req->pages[i]; + void *mapaddr = kmap_atomic(page); - loff_t pos = page_offset(req->pages[0]) + num_read; - fuse_read_update_size(inode, pos, attr_ver); + memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off); + + kunmap_atomic(mapaddr); + off = 0; + } + } else { + loff_t pos = page_offset(req->pages[0]) + num_read; + fuse_read_update_size(inode, pos, attr_ver); + } } static int fuse_readpage(struct file *file, struct page *page) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index a8ce6da..63818ab 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -197,6 +197,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, { struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_inode *fi = get_fuse_inode(inode); + bool is_wb = fc->writeback_cache; loff_t oldsize; struct timespec old_mtime;
[PATCH 02/11] fuse: Prepare to handle short reads
From: Pavel Emelyanov A helper which gets called when read reports less bytes than was requested. See patch #4 (trust kernel i_size only) for details. Signed-off-by: Maxim Patlasov Signed-off-by: Pavel Emelyanov --- fs/fuse/file.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 5e49c8f..886f3a2 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -654,6 +654,15 @@ static void fuse_read_update_size(struct inode *inode, loff_t size, spin_unlock(&fc->lock); } +static void fuse_short_read(struct fuse_req *req, struct inode *inode, + u64 attr_ver) +{ + size_t num_read = req->out.args[0].size; + + loff_t pos = page_offset(req->pages[0]) + num_read; + fuse_read_update_size(inode, pos, attr_ver); +} + static int fuse_readpage(struct file *file, struct page *page) { struct fuse_io_priv io = { .async = 0, .file = file }; @@ -691,18 +700,18 @@ static int fuse_readpage(struct file *file, struct page *page) req->page_descs[0].length = count; num_read = fuse_send_read(req, &io, pos, count, NULL); err = req->out.h.error; - fuse_put_request(fc, req); if (!err) { /* * Short read means EOF. If file size is larger, truncate it */ if (num_read < count) - fuse_read_update_size(inode, pos + num_read, attr_ver); + fuse_short_read(req, inode, attr_ver); SetPageUptodate(page); } + fuse_put_request(fc, req); fuse_invalidate_attr(inode); /* atime changed */ out: unlock_page(page); @@ -725,13 +734,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) /* * Short read means EOF. If file size is larger, truncate it */ - if (!req->out.h.error && num_read < count) { - loff_t pos; + if (!req->out.h.error && num_read < count) + fuse_short_read(req, inode, req->misc.read.attr_ver); - pos = page_offset(req->pages[0]) + num_read; - fuse_read_update_size(inode, pos, - req->misc.read.attr_ver); - } fuse_invalidate_attr(inode); /* atime changed */ } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/11] fuse: Linking file to inode helper
From: Pavel Emelyanov When writeback is ON every writeable file should be in per-inode write list, not only mmap-ed ones. Thus introduce a helper for this linkage. Signed-off-by: Maxim Patlasov Signed-off-by: Pavel Emelyanov --- fs/fuse/file.c | 33 +++-- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 3f219fd..5e49c8f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -171,6 +171,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, } EXPORT_SYMBOL_GPL(fuse_do_open); +static void fuse_link_write_file(struct file *file) +{ + struct inode *inode = file_inode(file); + struct fuse_conn *fc = get_fuse_conn(inode); + struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_file *ff = file->private_data; + /* +* file may be written through mmap, so chain it onto the +* inodes's write_file list +*/ + spin_lock(&fc->lock); + if (list_empty(&ff->write_entry)) + list_add(&ff->write_entry, &fi->write_files); + spin_unlock(&fc->lock); +} + void fuse_finish_open(struct inode *inode, struct file *file) { struct fuse_file *ff = file->private_data; @@ -1929,20 +1945,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = { static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) { - if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) { - struct inode *inode = file_inode(file); - struct fuse_conn *fc = get_fuse_conn(inode); - struct fuse_inode *fi = get_fuse_inode(inode); - struct fuse_file *ff = file->private_data; - /* -* file may be written through mmap, so chain it onto the -* inodes's write_file list -*/ - spin_lock(&fc->lock); - if (list_empty(&ff->write_entry)) - list_add(&ff->write_entry, &fi->write_files); - spin_unlock(&fc->lock); - } + if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) + fuse_link_write_file(file); + file_accessed(file); vma->vm_ops = &fuse_file_vm_ops; return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy
nch of the fuse tree (beba4ae4c2fda4e03a813aec640586087fa80a8b) - added waiting for fuse writeback to ->write_begin fuse address space operation. Thanks, Maxim --- Maxim Patlasov (9): fuse: Linking file to inode helper fuse: Connection bit for enabling writeback fuse: Trust kernel i_size only - v4 fuse: Trust kernel i_mtime only -v2 fuse: Flush files on wb close -v2 fuse: restructure fuse_readpage() fuse: fuse_flush() should wait on writeback fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 fuse: Turn writeback cache on Pavel Emelyanov (2): fuse: Prepare to handle short reads fuse: Implement write_begin/write_end callbacks -v2 fs/fuse/cuse.c|5 - fs/fuse/dir.c | 120 +++- fs/fuse/file.c| 328 +++-- fs/fuse/fuse_i.h | 26 +++- fs/fuse/inode.c | 37 - include/uapi/linux/fuse.h |2 6 files changed, 451 insertions(+), 67 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/