Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-26 Thread Maxim V. Patlasov

Miklos, MM folks,

04/26/2013 06:02 PM, Miklos Szeredi пишет:

On Fri, Apr 26, 2013 at 12:32:24PM +0400, Maxim V. Patlasov wrote:


The idea is that fuse filesystems should not go over the bdi limit even if
the global limit hasn't been reached.

This might work, but kicking flusher every time someone write to
fuse mount and dives into balance_dirty_pages looks fishy.

Yeah.  Fixed patch attached.


The patch didn't work for me. I'll investigate what's wrong and get back 
to you later.





Let's combine
our suggestions: mark fuse inodes with AS_FUSE_WRITEBACK flag and
convert what you strongly dislike above to:

if (test_bit(AS_FUSE_WRITEBACK, >flags))
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

I don't think this is right.  The fuse daemon could itself be writing to another
fuse filesystem, in which case blocking because of NR_WRITEBACK_TEMP being high
isn't a smart strategy.


Please don't say 'blocking'. Per-bdi checks will decide whether to block 
or not. In the case you set forth, judging on per-bdi checks would be 
completely fine for upper fuse: it may and should block for a while if 
lower fuse doesn't catch up.




Furthermore it isn't enough.  Becuase the root problem, I think, is that we
allow fuse filesystems to grow a large number of dirty pages before throttling.
This was never intended and it may actually have worked properly at a point in
time but broke by some change to the dirty throttling algorithm.


Could someone from mm list step in and comment on this point? Which 
approach is better to follow: account NR_WRITEBACK_TEMP in 
balance_dirty_pages accurately (as we discussed in LSF/MM) or re-work 
balance_dirty_pages in direction suggested by Miklos (fuse should never 
go over the bdi limit even if the global limit hasn't been reached)?


I'm for accounting NR_WRITEBACK_TEMP because balance_dirty_pages is 
already overcomplicated (imho) and adding new clauses for FUSE makes me 
sick.


Thanks,
Maxim



Thanks,
Miklos


diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 137185c..195ee45 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOATIME|S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = >bdi;
+   set_bit(AS_STRICTLIMIT, >i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..97f6a0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE  = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+   AS_STRICTLIMIT  = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
  };
  
  static inline void mapping_set_error(struct address_space *mapping, int error)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..b6db421 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1226,6 +1226,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = mapping->backing_dev_info;
+   int strictlimit = test_bit(AS_STRICTLIMIT, >flags);
unsigned long start_time = jiffies;
  
  	for (;;) {

@@ -1250,7 +1251,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
freerun = dirty_freerun_ceiling(dirty_thresh,
background_thresh);
-   if (nr_dirty <= freerun) {
+   if (nr_dirty <= freerun && !strictlimit) {
current->dirty_paused_when = now;
current->nr_dirtied = 0;
current->nr_dirtied_pause =
@@ -1258,7 +1259,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
break;
}
  
-		if (unlikely(!writeback_in_progress(bdi)))

+   if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
bdi_start_background_writeback(bdi);
  
  		/*

@@ -1296,8 +1297,12 @@ static void balance_dirty_pages(struct address_space 
*mapping,
bdi_stat(bdi, BDI_WRITEBACK);
}
  
+		if (unlikely(!writeback_in_progress(bdi)) &&

+   bdi_dirty > bdi_thresh / 2)
+   bdi_start_background_writeback(bdi);
+
dirty_exceeded = (bdi_dirty > bdi_thresh) &&
- (nr_dirty > 

Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-26 Thread Maxim V. Patlasov

Hi Miklos,

04/26/2013 12:43 AM, Miklos Szeredi пишет:

On Thu, Apr 25, 2013 at 08:16:45PM +0400, Maxim V. Patlasov wrote:

As Mel Gorman pointed out, fuse daemon diving into
balance_dirty_pages should not kick flusher judging on
NR_WRITEBACK_TEMP. Essentially, all we need in balance_dirty_pages
is:

 if (I'm not fuse daemon)
 nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

I strongly dislike the above.


The above was well-discussed on mm track of LSF/MM. Everybody seemed to 
agree with solution above. I'm cc-ing some guys who were involved in 
discussion, mm mailing list and Andrew as well. For those who don't 
follow from the beginning here is an excerpt:



04/25/2013 07:49 PM, Miklos Szeredi пишет:

On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
  wrote:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
*mapping,
   */
  nr_reclaimable = global_page_state(NR_FILE_DIRTY) +

global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
  global_dirty_limits(_thresh, _thresh);

Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
when balance_dirty_pages() is called from fuse daemon. I'll send a separate
patch-set soon.

Please elaborate.  From a technical perspective "fuse daemon" is very
hard to define, so anything that relies on whether something came from
the fuse daemon or not is conceptually broken.

As Mel Gorman pointed out, fuse daemon diving into balance_dirty_pages
should not kick flusher judging on NR_WRITEBACK_TEMP. Essentially, all
we need in balance_dirty_pages is:

  if (I'm not fuse daemon)
  nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

The way how to identify fuse daemon was not thoroughly scrutinized
during LSF/MM. Firstly, I thought it would be enough to set a
per-process flag handling fuse device open. But now I understand that
fuse daemon may be quite a complicated multi-threaded multi-process
construction. I'm going to add new FUSE_NOTIFY to allow fuse daemon
decide when it works on behalf of draining writeout-s. Having in mind
that fuse-lib is multi-threaded, I'm also going to inherit the flag on
copy_process(). Does it make sense for you?

Also, another patch will put this ad-hoc FUSE_NOTIFY under fusermount
control. This will prevent malicious unprivileged fuse mounts from
setting the flag for malicious purposes.


And returning back to the last Miklos' mail...



What about something like the following untested patch?

The idea is that fuse filesystems should not go over the bdi limit even if the
global limit hasn't been reached.


This might work, but kicking flusher every time someone write to fuse 
mount and dives into balance_dirty_pages looks fishy. However, setting 
ad-hoc inode flag for files on fuse makes much more sense than my 
approach of identifying fuse daemons (a feeble hope that userspace 
daemons would notify in-kernel fuse saying "I'm fuse daemon, please 
disregard NR_WRITEBACK_TEMP for me"). Let's combine our suggestions: 
mark fuse inodes with AS_FUSE_WRITEBACK flag and convert what you 
strongly dislike above to:


if (test_bit(AS_FUSE_WRITEBACK, >flags))
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

Thanks,
Maxim



Thanks,
Miklos

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 137185c..195ee45 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOATIME|S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = >bdi;
+   set_bit(AS_STRICTLIMIT, >i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..97f6a0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE  = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+   AS_STRICTLIMIT  = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
  };
  
  static inline void mapping_set_error(struct address_space *mapping, int error)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..91a9e6e 100644
--- a/mm/page-writebac

Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-26 Thread Maxim V. Patlasov

Hi Miklos,

04/26/2013 12:43 AM, Miklos Szeredi пишет:

On Thu, Apr 25, 2013 at 08:16:45PM +0400, Maxim V. Patlasov wrote:

As Mel Gorman pointed out, fuse daemon diving into
balance_dirty_pages should not kick flusher judging on
NR_WRITEBACK_TEMP. Essentially, all we need in balance_dirty_pages
is:

 if (I'm not fuse daemon)
 nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

I strongly dislike the above.


The above was well-discussed on mm track of LSF/MM. Everybody seemed to 
agree with solution above. I'm cc-ing some guys who were involved in 
discussion, mm mailing list and Andrew as well. For those who don't 
follow from the beginning here is an excerpt:



04/25/2013 07:49 PM, Miklos Szeredi пишет:

On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
mpatla...@parallels.com  wrote:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
*mapping,
   */
  nr_reclaimable = global_page_state(NR_FILE_DIRTY) +

global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
  global_dirty_limits(background_thresh, dirty_thresh);

Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
when balance_dirty_pages() is called from fuse daemon. I'll send a separate
patch-set soon.

Please elaborate.  From a technical perspective fuse daemon is very
hard to define, so anything that relies on whether something came from
the fuse daemon or not is conceptually broken.

As Mel Gorman pointed out, fuse daemon diving into balance_dirty_pages
should not kick flusher judging on NR_WRITEBACK_TEMP. Essentially, all
we need in balance_dirty_pages is:

  if (I'm not fuse daemon)
  nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

The way how to identify fuse daemon was not thoroughly scrutinized
during LSF/MM. Firstly, I thought it would be enough to set a
per-process flag handling fuse device open. But now I understand that
fuse daemon may be quite a complicated multi-threaded multi-process
construction. I'm going to add new FUSE_NOTIFY to allow fuse daemon
decide when it works on behalf of draining writeout-s. Having in mind
that fuse-lib is multi-threaded, I'm also going to inherit the flag on
copy_process(). Does it make sense for you?

Also, another patch will put this ad-hoc FUSE_NOTIFY under fusermount
control. This will prevent malicious unprivileged fuse mounts from
setting the flag for malicious purposes.


And returning back to the last Miklos' mail...



What about something like the following untested patch?

The idea is that fuse filesystems should not go over the bdi limit even if the
global limit hasn't been reached.


This might work, but kicking flusher every time someone write to fuse 
mount and dives into balance_dirty_pages looks fishy. However, setting 
ad-hoc inode flag for files on fuse makes much more sense than my 
approach of identifying fuse daemons (a feeble hope that userspace 
daemons would notify in-kernel fuse saying I'm fuse daemon, please 
disregard NR_WRITEBACK_TEMP for me). Let's combine our suggestions: 
mark fuse inodes with AS_FUSE_WRITEBACK flag and convert what you 
strongly dislike above to:


if (test_bit(AS_FUSE_WRITEBACK, mapping-flags))
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

Thanks,
Maxim



Thanks,
Miklos

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 137185c..195ee45 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode-i_flags |= S_NOATIME|S_NOCMTIME;
inode-i_generation = generation;
inode-i_data.backing_dev_info = fc-bdi;
+   set_bit(AS_STRICTLIMIT, inode-i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode-i_mode ^ attr-mode)  S_IFMT) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..97f6a0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE  = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+   AS_STRICTLIMIT  = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
  };
  
  static inline void mapping_set_error(struct address_space *mapping, int error)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..91a9e6e 100644
--- a/mm/page-writeback.c
+++ b

Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-26 Thread Maxim V. Patlasov

Miklos, MM folks,

04/26/2013 06:02 PM, Miklos Szeredi пишет:

On Fri, Apr 26, 2013 at 12:32:24PM +0400, Maxim V. Patlasov wrote:


The idea is that fuse filesystems should not go over the bdi limit even if
the global limit hasn't been reached.

This might work, but kicking flusher every time someone write to
fuse mount and dives into balance_dirty_pages looks fishy.

Yeah.  Fixed patch attached.


The patch didn't work for me. I'll investigate what's wrong and get back 
to you later.





Let's combine
our suggestions: mark fuse inodes with AS_FUSE_WRITEBACK flag and
convert what you strongly dislike above to:

if (test_bit(AS_FUSE_WRITEBACK, mapping-flags))
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

I don't think this is right.  The fuse daemon could itself be writing to another
fuse filesystem, in which case blocking because of NR_WRITEBACK_TEMP being high
isn't a smart strategy.


Please don't say 'blocking'. Per-bdi checks will decide whether to block 
or not. In the case you set forth, judging on per-bdi checks would be 
completely fine for upper fuse: it may and should block for a while if 
lower fuse doesn't catch up.




Furthermore it isn't enough.  Becuase the root problem, I think, is that we
allow fuse filesystems to grow a large number of dirty pages before throttling.
This was never intended and it may actually have worked properly at a point in
time but broke by some change to the dirty throttling algorithm.


Could someone from mm list step in and comment on this point? Which 
approach is better to follow: account NR_WRITEBACK_TEMP in 
balance_dirty_pages accurately (as we discussed in LSF/MM) or re-work 
balance_dirty_pages in direction suggested by Miklos (fuse should never 
go over the bdi limit even if the global limit hasn't been reached)?


I'm for accounting NR_WRITEBACK_TEMP because balance_dirty_pages is 
already overcomplicated (imho) and adding new clauses for FUSE makes me 
sick.


Thanks,
Maxim



Thanks,
Miklos


diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 137185c..195ee45 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode-i_flags |= S_NOATIME|S_NOCMTIME;
inode-i_generation = generation;
inode-i_data.backing_dev_info = fc-bdi;
+   set_bit(AS_STRICTLIMIT, inode-i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode-i_mode ^ attr-mode)  S_IFMT) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..97f6a0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE  = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+   AS_STRICTLIMIT  = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
  };
  
  static inline void mapping_set_error(struct address_space *mapping, int error)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..b6db421 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1226,6 +1226,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = mapping-backing_dev_info;
+   int strictlimit = test_bit(AS_STRICTLIMIT, mapping-flags);
unsigned long start_time = jiffies;
  
  	for (;;) {

@@ -1250,7 +1251,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
freerun = dirty_freerun_ceiling(dirty_thresh,
background_thresh);
-   if (nr_dirty = freerun) {
+   if (nr_dirty = freerun  !strictlimit) {
current-dirty_paused_when = now;
current-nr_dirtied = 0;
current-nr_dirtied_pause =
@@ -1258,7 +1259,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
break;
}
  
-		if (unlikely(!writeback_in_progress(bdi)))

+   if (unlikely(!writeback_in_progress(bdi))  !strictlimit)
bdi_start_background_writeback(bdi);
  
  		/*

@@ -1296,8 +1297,12 @@ static void balance_dirty_pages(struct address_space 
*mapping,
bdi_stat(bdi, BDI_WRITEBACK);
}
  
+		if (unlikely(!writeback_in_progress(bdi)) 

+   bdi_dirty  bdi_thresh / 2)
+   bdi_start_background_writeback(bdi);
+
dirty_exceeded = (bdi_dirty  bdi_thresh) 
- (nr_dirty  dirty_thresh);
+ ((nr_dirty  dirty_thresh) || strictlimit

Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-25 Thread Maxim V. Patlasov

Hi,

04/25/2013 07:49 PM, Miklos Szeredi пишет:

On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
 wrote:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
*mapping,
  */
 nr_reclaimable = global_page_state(NR_FILE_DIRTY) +

global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
 global_dirty_limits(_thresh, _thresh);


Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
when balance_dirty_pages() is called from fuse daemon. I'll send a separate
patch-set soon.

Please elaborate.  From a technical perspective "fuse daemon" is very
hard to define, so anything that relies on whether something came from
the fuse daemon or not is conceptually broken.


As Mel Gorman pointed out, fuse daemon diving into balance_dirty_pages 
should not kick flusher judging on NR_WRITEBACK_TEMP. Essentially, all 
we need in balance_dirty_pages is:


if (I'm not fuse daemon)
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

The way how to identify fuse daemon was not thoroughly scrutinized 
during LSF/MM. Firstly, I thought it would be enough to set a 
per-process flag handling fuse device open. But now I understand that 
fuse daemon may be quite a complicated multi-threaded multi-process 
construction. I'm going to add new FUSE_NOTIFY to allow fuse daemon 
decide when it works on behalf of draining writeout-s. Having in mind 
that fuse-lib is multi-threaded, I'm also going to inherit the flag on 
copy_process(). Does it make sense for you?


Also, another patch will put this ad-hoc FUSE_NOTIFY under fusermount 
control. This will prevent malicious unprivileged fuse mounts from 
setting the flag for malicious purposes.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-25 Thread Maxim V. Patlasov

Hi Miklos,

04/01/2013 02:42 PM, Maxim V. Patlasov пишет:

Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
  mm/page-writeback.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
  
  		global_dirty_limits(_thresh, _thresh);


Please drop this patch. As we discussed in LSF/MM, the fix above is 
correct, but it's not enough: we also need to ensure disregard of 
NR_WRITEBACK_TEMP when balance_dirty_pages() is called from fuse daemon. 
I'll send a separate patch-set soon.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-25 Thread Maxim V. Patlasov

Hi Miklos,

04/01/2013 02:42 PM, Maxim V. Patlasov пишет:

Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
  mm/page-writeback.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
  
  		global_dirty_limits(background_thresh, dirty_thresh);


Please drop this patch. As we discussed in LSF/MM, the fix above is 
correct, but it's not enough: we also need to ensure disregard of 
NR_WRITEBACK_TEMP when balance_dirty_pages() is called from fuse daemon. 
I'll send a separate patch-set soon.


Thanks,
Maxim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-25 Thread Maxim V. Patlasov

Hi,

04/25/2013 07:49 PM, Miklos Szeredi пишет:

On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
*mapping,
  */
 nr_reclaimable = global_page_state(NR_FILE_DIRTY) +

global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable +
global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
 global_dirty_limits(background_thresh, dirty_thresh);


Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
when balance_dirty_pages() is called from fuse daemon. I'll send a separate
patch-set soon.

Please elaborate.  From a technical perspective fuse daemon is very
hard to define, so anything that relies on whether something came from
the fuse daemon or not is conceptually broken.


As Mel Gorman pointed out, fuse daemon diving into balance_dirty_pages 
should not kick flusher judging on NR_WRITEBACK_TEMP. Essentially, all 
we need in balance_dirty_pages is:


if (I'm not fuse daemon)
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

The way how to identify fuse daemon was not thoroughly scrutinized 
during LSF/MM. Firstly, I thought it would be enough to set a 
per-process flag handling fuse device open. But now I understand that 
fuse daemon may be quite a complicated multi-threaded multi-process 
construction. I'm going to add new FUSE_NOTIFY to allow fuse daemon 
decide when it works on behalf of draining writeout-s. Having in mind 
that fuse-lib is multi-threaded, I'm also going to inherit the flag on 
copy_process(). Does it make sense for you?


Also, another patch will put this ad-hoc FUSE_NOTIFY under fusermount 
control. This will prevent malicious unprivileged fuse mounts from 
setting the flag for malicious purposes.


Thanks,
Maxim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] fuse: add support of async IO

2013-04-23 Thread Maxim V. Patlasov

Hi Miklos,

04/22/2013 08:34 PM, Miklos Szeredi пишет:

On Fri, Dec 14, 2012 at 07:20:41PM +0400, Maxim V. Patlasov wrote:

The patch implements a framework to process an IO request asynchronously. The
idea is to associate several fuse requests with a single kiocb by means of
fuse_io_priv structure. The structure plays the same role for FUSE as 'struct
dio' for direct-io.c.

The framework is supposed to be used like this:
  - someone (who wants to process an IO asynchronously) allocates fuse_io_priv
and initializes it setting 'async' field to non-zero value.
  - as soon as fuse request is filled, it can be submitted (in non-blocking way)
by fuse_async_req_send()
  - when all submitted requests are ACKed by userspace, io->reqs drops to zero
triggering aio_complete()

In case of IO initiated by libaio, aio_complete() will finish processing the
same way as in case of dio_complete() calling aio_complete(). But the
framework may be also used for internal FUSE use when initial IO request
was synchronous (from user perspective), but it's beneficial to process it
asynchronously. Then the caller should wait on kiocb explicitly and
aio_complete() will wake the caller up.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/file.c   |   92 ++
  fs/fuse/fuse_i.h |   17 ++
  2 files changed, 109 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6685cb0..8dd931f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -503,6 +503,98 @@ static void fuse_release_user_pages(struct fuse_req *req, 
int write)
}
  }
  
+/**

+ * In case of short read, the caller sets 'pos' to the position of
+ * actual end of fuse request in IO request. Otherwise, if bytes_requested
+ * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1.
+ *
+ * An example:
+ * User requested DIO read of 64K. It was splitted into two 32K fuse requests,
+ * both submitted asynchronously. The first of them was ACKed by userspace as
+ * fully completed (req->out.args[0].size == 32K) resulting in pos == -1. The
+ * second request was ACKed as short, e.g. only 1K was read, resulting in
+ * pos == 33K.
+ *
+ * Thus, when all fuse requests are completed, the minimal non-negative 'pos'
+ * will be equal to the length of the longest contiguous fragment of
+ * transferred data starting from the beginning of IO request.
+ */
+static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos)
+{
+   int left;
+
+   spin_lock(>lock);
+   if (err)
+   io->err = io->err ? : err;
+   else if (pos >= 0 && (io->bytes < 0 || pos < io->bytes))
+   io->bytes = pos;
+
+   left = --io->reqs;
+   spin_unlock(>lock);
+
+   if (!left) {
+   long res;
+
+   if (io->err)
+   res = io->err;
+   else if (io->bytes >= 0 && io->write)
+   res = -EIO;
+   else {
+   res = io->bytes < 0 ? io->size : io->bytes;
+
+   if (!is_sync_kiocb(io->iocb)) {
+   struct path *path = >iocb->ki_filp->f_path;
+   struct inode *inode = path->dentry->d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   spin_lock(>lock);
+   fi->attr_version = ++fc->attr_version;
+   spin_unlock(>lock);

Hmm, what is this?  Incrementing the attr version without setting any attributes
doesn't make sense.


It makes sense at least for writes. __fuse_direct_write() always called 
fuse_write_update_size() and the latter always incremented attr_version, 
even if *ppos <= inode->i_size. I believed it was implemented in this 
way intentionally: if write succeeded, the file is changed on server, 
hence attrs requested from server early should be regarded as stale.


Adding async IO support to fuse, a case emerges when 
fuse_write_update_size() won't be called: incoming direct IO write is 
asynchronous (e.g. it came from libaio), it's not extending write, so 
it's allowable to process it by submitting fuse requests to background 
and return -EIOCBQUEUED without waiting for completions (see 4th patch 
of this patch-set). But in this case the file on server will be changed 
anyway. That's why I bump attr_version in fuse_aio_complete() -- to be 
consistent with the model we had before this patch-set.


The fact that I did the trick both for writes and reads was probably 
overlook. I'd suggest to fix it like this:



-   if (!is_sync_kiocb(io->iocb)) {
+   if (!is_sync_kiocb(io->iocb) && io->write) {


Thanks,
Maxi

Re: [PATCH 2/6] fuse: add support of async IO

2013-04-23 Thread Maxim V. Patlasov

Hi Miklos,

04/22/2013 08:34 PM, Miklos Szeredi пишет:

On Fri, Dec 14, 2012 at 07:20:41PM +0400, Maxim V. Patlasov wrote:

The patch implements a framework to process an IO request asynchronously. The
idea is to associate several fuse requests with a single kiocb by means of
fuse_io_priv structure. The structure plays the same role for FUSE as 'struct
dio' for direct-io.c.

The framework is supposed to be used like this:
  - someone (who wants to process an IO asynchronously) allocates fuse_io_priv
and initializes it setting 'async' field to non-zero value.
  - as soon as fuse request is filled, it can be submitted (in non-blocking way)
by fuse_async_req_send()
  - when all submitted requests are ACKed by userspace, io-reqs drops to zero
triggering aio_complete()

In case of IO initiated by libaio, aio_complete() will finish processing the
same way as in case of dio_complete() calling aio_complete(). But the
framework may be also used for internal FUSE use when initial IO request
was synchronous (from user perspective), but it's beneficial to process it
asynchronously. Then the caller should wait on kiocb explicitly and
aio_complete() will wake the caller up.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
  fs/fuse/file.c   |   92 ++
  fs/fuse/fuse_i.h |   17 ++
  2 files changed, 109 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6685cb0..8dd931f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -503,6 +503,98 @@ static void fuse_release_user_pages(struct fuse_req *req, 
int write)
}
  }
  
+/**

+ * In case of short read, the caller sets 'pos' to the position of
+ * actual end of fuse request in IO request. Otherwise, if bytes_requested
+ * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1.
+ *
+ * An example:
+ * User requested DIO read of 64K. It was splitted into two 32K fuse requests,
+ * both submitted asynchronously. The first of them was ACKed by userspace as
+ * fully completed (req-out.args[0].size == 32K) resulting in pos == -1. The
+ * second request was ACKed as short, e.g. only 1K was read, resulting in
+ * pos == 33K.
+ *
+ * Thus, when all fuse requests are completed, the minimal non-negative 'pos'
+ * will be equal to the length of the longest contiguous fragment of
+ * transferred data starting from the beginning of IO request.
+ */
+static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos)
+{
+   int left;
+
+   spin_lock(io-lock);
+   if (err)
+   io-err = io-err ? : err;
+   else if (pos = 0  (io-bytes  0 || pos  io-bytes))
+   io-bytes = pos;
+
+   left = --io-reqs;
+   spin_unlock(io-lock);
+
+   if (!left) {
+   long res;
+
+   if (io-err)
+   res = io-err;
+   else if (io-bytes = 0  io-write)
+   res = -EIO;
+   else {
+   res = io-bytes  0 ? io-size : io-bytes;
+
+   if (!is_sync_kiocb(io-iocb)) {
+   struct path *path = io-iocb-ki_filp-f_path;
+   struct inode *inode = path-dentry-d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   spin_lock(fc-lock);
+   fi-attr_version = ++fc-attr_version;
+   spin_unlock(fc-lock);

Hmm, what is this?  Incrementing the attr version without setting any attributes
doesn't make sense.


It makes sense at least for writes. __fuse_direct_write() always called 
fuse_write_update_size() and the latter always incremented attr_version, 
even if *ppos = inode-i_size. I believed it was implemented in this 
way intentionally: if write succeeded, the file is changed on server, 
hence attrs requested from server early should be regarded as stale.


Adding async IO support to fuse, a case emerges when 
fuse_write_update_size() won't be called: incoming direct IO write is 
asynchronous (e.g. it came from libaio), it's not extending write, so 
it's allowable to process it by submitting fuse requests to background 
and return -EIOCBQUEUED without waiting for completions (see 4th patch 
of this patch-set). But in this case the file on server will be changed 
anyway. That's why I bump attr_version in fuse_aio_complete() -- to be 
consistent with the model we had before this patch-set.


The fact that I did the trick both for writes and reads was probably 
overlook. I'd suggest to fix it like this:



-   if (!is_sync_kiocb(io-iocb)) {
+   if (!is_sync_kiocb(io-iocb)  io-write) {


Thanks,
Maxim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More

Re: [fuse-devel] [PATCH v2 0/6] fuse: process direct IO asynchronously

2013-04-11 Thread Maxim V. Patlasov

Hi,

04/11/2013 08:07 PM, Miklos Szeredi пишет:

Hi Maxim,

On Thu, Apr 11, 2013 at 1:22 PM, Maxim V. Patlasov
 wrote:

Hi Miklos,

Any feedback would be highly appreciated.

What is the order of all these patchsets with regards to each other?


They are logically independent, so I formed them to be applied w/o each 
other. There might be some minor collisions between them (if you try to 
apply one patch-set on the top of another). So, as soon as you get one 
of them to fuse-next, I'll update others to be applied smoothly. Either 
we can settle down some order now, and I'll do it in advance.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH v2 0/6] fuse: process direct IO asynchronously

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

12/14/2012 07:20 PM, Maxim V. Patlasov пишет:

Hi,

Existing fuse implementation always processes direct IO synchronously: it
submits next request to userspace fuse only when previous is completed. This
is suboptimal because: 1) libaio DIO works in blocking way; 2) userspace fuse
can't achieve parallelism  processing several requests simultaneously (e.g.
in case of distributed network storage); 3) userspace fuse can't merge
requests before passing it to actual storage.

The idea of the patch-set is to submit fuse requests in non-blocking way
(where it's possible) and either return -EIOCBQUEUED or wait for their
completion synchronously. The patch-set to be applied on top of for-next of
Miklos' git repo.

To estimate performance improvement I used slightly modified fusexmp over
tmpfs (clearing O_DIRECT bit from fi->flags in xmp_open). For synchronous
operations I used 'dd' like this:

dd of=/dev/null if=/fuse/mnt/file bs=2M count=256 iflag=direct
dd if=/dev/zero of=/fuse/mnt/file bs=2M count=256 oflag=direct conv=notrunc

For AIO I used 'aio-stress' like this:

aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 1 /fuse/mnt/file
aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 0 /fuse/mnt/file

The throughput on some commodity (rather feeble) server was (in MB/sec):

  original / patched

dd reads:~322 / ~382
dd writes:   ~277 / ~288

aio reads:   ~380 / ~459
aio writes:  ~319 / ~353

Changed in v2 - cleanups suggested by Brian:
  - Updated fuse_io_priv with an async field and file pointer to preserve
the current style of interface (i.e., use this instead of iocb).
  - Trigger the type of request submission based on the async field.
  - Pulled up the fuse_write_update_size() call out of __fuse_direct_write()
to make the separate paths more consistent.

Thanks,
Maxim

---

Maxim V. Patlasov (6):
   fuse: move fuse_release_user_pages() up
   fuse: add support of async IO
   fuse: make fuse_direct_io() aware about AIO
   fuse: enable asynchronous processing direct IO
   fuse: truncate file if async dio failed
   fuse: optimize short direct reads


  fs/fuse/cuse.c   |6 +
  fs/fuse/file.c   |  290 +++---
  fs/fuse/fuse_i.h |   19 +++-
  3 files changed, 276 insertions(+), 39 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

12/20/2012 04:30 PM, Maxim Patlasov пишет:

Hi,

There is a long-standing demand for syncronous behaviour of fuse_release:

http://sourceforge.net/mailarchive/message.php?msg_id=19343889
http://sourceforge.net/mailarchive/message.php?msg_id=29814693

A few months ago Avati and me explained why such a feature would be useful:

http://sourceforge.net/mailarchive/message.php?msg_id=29889055
http://sourceforge.net/mailarchive/message.php?msg_id=29867423

In short, the problem is that fuse_release (that's called on last user
close(2)) sends FUSE_RELEASE to userspace and returns without waiting for
ACK from userspace. Consequently, there is a gap when user regards the
file released while userspace fuse is still working on it. An attempt to
access the file from another node leads to complicated synchronization
problems because the first node still "holds" the file.

The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

To keep single-threaded userspace implementations happy the patch-set
ensures that by the time fuse_release_common calls fuse_file_put, no
more in-flight I/O exists. Asynchronous fuse callbacks (like
fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll
never block in contexts other than close().

Thanks,
Maxim

---

Maxim Patlasov (5):
   fuse: add close_wait flag to fuse_conn
   fuse: cosmetic rework of fuse_send_readpages
   fuse: wait for end of IO on release
   fuse: enable close_wait feature
   fuse: fix synchronous case of fuse_file_put()


  fs/fuse/file.c|   82 ++---
  fs/fuse/fuse_i.h  |3 ++
  fs/fuse/inode.c   |5 ++-
  include/uapi/linux/fuse.h |7 +++-
  4 files changed, 82 insertions(+), 15 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH 0/4] fuse: fix accounting background requests (v2)

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

03/21/2013 06:01 PM, Maxim V. Patlasov пишет:

Hi,

The feature was added long time ago (commit 08a53cdc...) with the comment:


A task may have at most one synchronous request allocated.  So these requests
need not be otherwise limited.

However the number of background requests (release, forget, asynchronous
reads, interrupted requests) can grow indefinitely.  This can be used by a
malicous user to cause FUSE to allocate arbitrary amounts of unswappable
kernel memory, denying service.

For this reason add a limit for the number of background requests, and block
allocations of new requests until the number goes bellow the limit.

However, the implementation suffers from the following problems:

1. Latency of synchronous requests. As soon as fc->num_background hits the
limit, all allocations are blocked: both for synchronous and background
requests. This is unnecessary - as the comment cited above states, synchronous
requests need not be limited (by fuse). Moreover, sometimes it's very
inconvenient. For example, a dozen of tasks aggressively writing to mmap()-ed
area may block 'ls' for long while (>1min in my experiments).

2. Thundering herd problem. When fc->num_background falls below the limit,
request_end() calls wake_up_all(>blocked_waitq). This wakes up all waiters
while it's not impossible that the first waiter getting new request will
immediately put it to background increasing fc->num_background again.
(experimenting with mmap()-ed writes I observed 2x slowdown as compared with
fuse after applying this patch-set)

The patch-set re-works fuse_get_req (and its callers) to throttle only requests
intended for background processing. Having this done, it becomes possible to
use exclusive wakeups in chained manner: request_end() wakes up a waiter,
the waiter allocates new request and submits it for background processing,
the processing ends in request_end() where another wakeup happens an so on.

Changed in v2:
  - rebased on for-next branch of the fuse tree
  - fixed race when processing request begins before init-reply came

Thanks,
Maxim

---

Maxim V. Patlasov (4):
   fuse: make request allocations for background processing explicit
   fuse: add flag fc->uninitialized
   fuse: skip blocking on allocations of synchronous requests
   fuse: implement exclusive wakeup for blocked_waitq


  fs/fuse/cuse.c   |3 ++
  fs/fuse/dev.c|   69 +++---
  fs/fuse/file.c   |6 +++--
  fs/fuse/fuse_i.h |8 ++
  fs/fuse/inode.c  |4 +++
  5 files changed, 73 insertions(+), 17 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

04/01/2013 02:40 PM, Maxim V. Patlasov пишет:

Hi,

This is the fourth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back 
policy.
With this file data are pushed to the userspace with big chunks (depending on 
the
dirty memory limits, but this is much more than 128k) which lets the FUSE 
daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback 
is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback 
request
   and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size 
synchronization
   with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
   This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not 
yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare 
meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes 
way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this 
scenario.
Kirill (in Cc) can share more details about the performance and the storage 
concepts
details if required.

Changed in v2:
  - numerous bugfixes:
- fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must 
wait
  on page writeback because page writeback can extend beyond the lifetime of
  the page-cache page
- fuse_send_writepages can end_page_writeback on original page only after 
adding
  request to fi->writepages list; otherwise another writeback may happen 
inside
  the gap between end_page_writeback and adding to the list
- fuse_direct_io must wait on page writeback; otherwise data corruption is 
possible
  due to reordering requests
- fuse_flush must flush dirty memory and wait for all writeback on given 
inode
  before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not 
reliable
- fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size 
update;
  otherwise a race with a writer extending i_size is possible
- fix handling errors in fuse_writepages and fuse_send_writepages
  - handle i_mtime intelligently if writeback cache is on (see patch #7 (update 
i_mtime
on buffered writes) for details.
  - put enabling writeback cache under fusermount control; (see mount option
'allow_wbcache' introduced by patch #13 (turn writeback cache on))
  - rebased on v3.7-rc5

Changed in v3:
  - rebased on for-next branch of the fuse tree 
(fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
  - rebased on for-next branch of the fuse tree 
(634734b63ac39e137a1c623ba74f3e062b6577db)
  - fixed fuse_fillattr() for non-writeback_chace case
  - added comments explaining why we cannot trust size from server
  - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
  - simplified patch titled Flush-files-on-wb-close
  - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
  - added comment about "disk full" errors to fuse_write_begin()

Thanks,
Maxim

---

Maxim V. Patlasov (14):
   fuse: Linking file to inode helper
   fuse: Getting file for writeback helper
   fuse: Prepare to handle short reads
   fuse: Prepare to handle multiple pages in writeback
   fuse: Connection bit for enabling writeback
   fuse: Trust kernel i_size only - v3
   fuse: Trust kernel i_mtime only
   fuse: Flush files on wb close
   fuse: Implement writepages and write_begin/write_end callbacks - v3
   fuse: fuse_writepage_locked() should wait on writeback
   fuse: fuse_flush() should wait on writeback
   fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
   fuse: Turn wr

Re: [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

04/01/2013 02:40 PM, Maxim V. Patlasov пишет:

Hi,

This is the fourth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back 
policy.
With this file data are pushed to the userspace with big chunks (depending on 
the
dirty memory limits, but this is much more than 128k) which lets the FUSE 
daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback 
is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback 
request
   and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size 
synchronization
   with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
   This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not 
yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare 
meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes 
way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this 
scenario.
Kirill (in Cc) can share more details about the performance and the storage 
concepts
details if required.

Changed in v2:
  - numerous bugfixes:
- fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must 
wait
  on page writeback because page writeback can extend beyond the lifetime of
  the page-cache page
- fuse_send_writepages can end_page_writeback on original page only after 
adding
  request to fi-writepages list; otherwise another writeback may happen 
inside
  the gap between end_page_writeback and adding to the list
- fuse_direct_io must wait on page writeback; otherwise data corruption is 
possible
  due to reordering requests
- fuse_flush must flush dirty memory and wait for all writeback on given 
inode
  before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not 
reliable
- fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size 
update;
  otherwise a race with a writer extending i_size is possible
- fix handling errors in fuse_writepages and fuse_send_writepages
  - handle i_mtime intelligently if writeback cache is on (see patch #7 (update 
i_mtime
on buffered writes) for details.
  - put enabling writeback cache under fusermount control; (see mount option
'allow_wbcache' introduced by patch #13 (turn writeback cache on))
  - rebased on v3.7-rc5

Changed in v3:
  - rebased on for-next branch of the fuse tree 
(fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
  - rebased on for-next branch of the fuse tree 
(634734b63ac39e137a1c623ba74f3e062b6577db)
  - fixed fuse_fillattr() for non-writeback_chace case
  - added comments explaining why we cannot trust size from server
  - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
  - simplified patch titled Flush-files-on-wb-close
  - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
  - added comment about disk full errors to fuse_write_begin()

Thanks,
Maxim

---

Maxim V. Patlasov (14):
   fuse: Linking file to inode helper
   fuse: Getting file for writeback helper
   fuse: Prepare to handle short reads
   fuse: Prepare to handle multiple pages in writeback
   fuse: Connection bit for enabling writeback
   fuse: Trust kernel i_size only - v3
   fuse: Trust kernel i_mtime only
   fuse: Flush files on wb close
   fuse: Implement writepages and write_begin/write_end callbacks - v3
   fuse: fuse_writepage_locked() should wait on writeback
   fuse: fuse_flush() should wait on writeback
   fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
   fuse: Turn writeback cache

Re: [fuse-devel] [PATCH 0/4] fuse: fix accounting background requests (v2)

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

03/21/2013 06:01 PM, Maxim V. Patlasov пишет:

Hi,

The feature was added long time ago (commit 08a53cdc...) with the comment:


A task may have at most one synchronous request allocated.  So these requests
need not be otherwise limited.

However the number of background requests (release, forget, asynchronous
reads, interrupted requests) can grow indefinitely.  This can be used by a
malicous user to cause FUSE to allocate arbitrary amounts of unswappable
kernel memory, denying service.

For this reason add a limit for the number of background requests, and block
allocations of new requests until the number goes bellow the limit.

However, the implementation suffers from the following problems:

1. Latency of synchronous requests. As soon as fc-num_background hits the
limit, all allocations are blocked: both for synchronous and background
requests. This is unnecessary - as the comment cited above states, synchronous
requests need not be limited (by fuse). Moreover, sometimes it's very
inconvenient. For example, a dozen of tasks aggressively writing to mmap()-ed
area may block 'ls' for long while (1min in my experiments).

2. Thundering herd problem. When fc-num_background falls below the limit,
request_end() calls wake_up_all(fc-blocked_waitq). This wakes up all waiters
while it's not impossible that the first waiter getting new request will
immediately put it to background increasing fc-num_background again.
(experimenting with mmap()-ed writes I observed 2x slowdown as compared with
fuse after applying this patch-set)

The patch-set re-works fuse_get_req (and its callers) to throttle only requests
intended for background processing. Having this done, it becomes possible to
use exclusive wakeups in chained manner: request_end() wakes up a waiter,
the waiter allocates new request and submits it for background processing,
the processing ends in request_end() where another wakeup happens an so on.

Changed in v2:
  - rebased on for-next branch of the fuse tree
  - fixed race when processing request begins before init-reply came

Thanks,
Maxim

---

Maxim V. Patlasov (4):
   fuse: make request allocations for background processing explicit
   fuse: add flag fc-uninitialized
   fuse: skip blocking on allocations of synchronous requests
   fuse: implement exclusive wakeup for blocked_waitq


  fs/fuse/cuse.c   |3 ++
  fs/fuse/dev.c|   69 +++---
  fs/fuse/file.c   |6 +++--
  fs/fuse/fuse_i.h |8 ++
  fs/fuse/inode.c  |4 +++
  5 files changed, 73 insertions(+), 17 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH 0/5] fuse: close file synchronously

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

12/20/2012 04:30 PM, Maxim Patlasov пишет:

Hi,

There is a long-standing demand for syncronous behaviour of fuse_release:

http://sourceforge.net/mailarchive/message.php?msg_id=19343889
http://sourceforge.net/mailarchive/message.php?msg_id=29814693

A few months ago Avati and me explained why such a feature would be useful:

http://sourceforge.net/mailarchive/message.php?msg_id=29889055
http://sourceforge.net/mailarchive/message.php?msg_id=29867423

In short, the problem is that fuse_release (that's called on last user
close(2)) sends FUSE_RELEASE to userspace and returns without waiting for
ACK from userspace. Consequently, there is a gap when user regards the
file released while userspace fuse is still working on it. An attempt to
access the file from another node leads to complicated synchronization
problems because the first node still holds the file.

The patch-set resolves the problem by making fuse_release synchronous:
wait for ACK from userspace for FUSE_RELEASE if the feature is ON.

To keep single-threaded userspace implementations happy the patch-set
ensures that by the time fuse_release_common calls fuse_file_put, no
more in-flight I/O exists. Asynchronous fuse callbacks (like
fuse_readpages_end) cannot trigger FUSE_RELEASE anymore. Hence, we'll
never block in contexts other than close().

Thanks,
Maxim

---

Maxim Patlasov (5):
   fuse: add close_wait flag to fuse_conn
   fuse: cosmetic rework of fuse_send_readpages
   fuse: wait for end of IO on release
   fuse: enable close_wait feature
   fuse: fix synchronous case of fuse_file_put()


  fs/fuse/file.c|   82 ++---
  fs/fuse/fuse_i.h  |3 ++
  fs/fuse/inode.c   |5 ++-
  include/uapi/linux/fuse.h |7 +++-
  4 files changed, 82 insertions(+), 15 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH v2 0/6] fuse: process direct IO asynchronously

2013-04-11 Thread Maxim V. Patlasov

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

12/14/2012 07:20 PM, Maxim V. Patlasov пишет:

Hi,

Existing fuse implementation always processes direct IO synchronously: it
submits next request to userspace fuse only when previous is completed. This
is suboptimal because: 1) libaio DIO works in blocking way; 2) userspace fuse
can't achieve parallelism  processing several requests simultaneously (e.g.
in case of distributed network storage); 3) userspace fuse can't merge
requests before passing it to actual storage.

The idea of the patch-set is to submit fuse requests in non-blocking way
(where it's possible) and either return -EIOCBQUEUED or wait for their
completion synchronously. The patch-set to be applied on top of for-next of
Miklos' git repo.

To estimate performance improvement I used slightly modified fusexmp over
tmpfs (clearing O_DIRECT bit from fi-flags in xmp_open). For synchronous
operations I used 'dd' like this:

dd of=/dev/null if=/fuse/mnt/file bs=2M count=256 iflag=direct
dd if=/dev/zero of=/fuse/mnt/file bs=2M count=256 oflag=direct conv=notrunc

For AIO I used 'aio-stress' like this:

aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 1 /fuse/mnt/file
aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 0 /fuse/mnt/file

The throughput on some commodity (rather feeble) server was (in MB/sec):

  original / patched

dd reads:~322 / ~382
dd writes:   ~277 / ~288

aio reads:   ~380 / ~459
aio writes:  ~319 / ~353

Changed in v2 - cleanups suggested by Brian:
  - Updated fuse_io_priv with an async field and file pointer to preserve
the current style of interface (i.e., use this instead of iocb).
  - Trigger the type of request submission based on the async field.
  - Pulled up the fuse_write_update_size() call out of __fuse_direct_write()
to make the separate paths more consistent.

Thanks,
Maxim

---

Maxim V. Patlasov (6):
   fuse: move fuse_release_user_pages() up
   fuse: add support of async IO
   fuse: make fuse_direct_io() aware about AIO
   fuse: enable asynchronous processing direct IO
   fuse: truncate file if async dio failed
   fuse: optimize short direct reads


  fs/fuse/cuse.c   |6 +
  fs/fuse/file.c   |  290 +++---
  fs/fuse/fuse_i.h |   19 +++-
  3 files changed, 276 insertions(+), 39 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH v2 0/6] fuse: process direct IO asynchronously

2013-04-11 Thread Maxim V. Patlasov

Hi,

04/11/2013 08:07 PM, Miklos Szeredi пишет:

Hi Maxim,

On Thu, Apr 11, 2013 at 1:22 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

Hi Miklos,

Any feedback would be highly appreciated.

What is the order of all these patchsets with regards to each other?


They are logically independent, so I formed them to be applied w/o each 
other. There might be some minor collisions between them (if you try to 
apply one patch-set on the top of another). So, as soon as you get one 
of them to fuse-next, I'll update others to be applied smoothly. Either 
we can settle down some order now, and I'll do it in advance.


Thanks,
Maxim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-01 Thread Maxim V. Patlasov
Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 mm/page-writeback.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
 
global_dirty_limits(_thresh, _thresh);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2

2013-04-01 Thread Maxim V. Patlasov
The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/cuse.c   |5 +++--
 fs/fuse/file.c   |   49 +++--
 fs/fuse/fuse_i.h |   13 -
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..fb63185 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -93,7 +93,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, 
size_t count,
loff_t pos = 0;
struct iovec iov = { .iov_base = buf, .iov_len = count };
 
-   return fuse_direct_io(file, , 1, count, , 0);
+   return fuse_direct_io(file, , 1, count, , FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -106,7 +106,8 @@ static ssize_t cuse_write(struct file *file, const char 
__user *buf,
 * No locking or generic_write_checks(), the server is
 * responsible for locking and sanity checks.
 */
-   return fuse_direct_io(file, , 1, count, , 1);
+   return fuse_direct_io(file, , 1, count, ,
+ FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7c24f6b..14880bb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -338,6 +338,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+   pgoff_t idx_to)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_req *req;
+   bool found = false;
+
+   spin_lock(>lock);
+   list_for_each_entry(req, >writepages, writepages_entry) {
+   pgoff_t curr_index;
+
+   BUG_ON(req->inode != inode);
+   curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+   if (!(idx_from >= curr_index + req->num_pages ||
+ idx_to < curr_index)) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(>lock);
+
+   return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -382,6 +407,19 @@ static int fuse_wait_on_page_writeback(struct inode 
*inode, pgoff_t index)
return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+  size_t bytes)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   pgoff_t idx_from, idx_to;
+
+   idx_from = start >> PAGE_CACHE_SHIFT;
+   idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+   wait_event(fi->page_waitq,
+  !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
struct inode *inode = file->f_path.dentry->d_inode;
@@ -1248,8 +1286,10 @@ static inline int fuse_iter_npages(const struct iov_iter 
*ii_p)
 
 ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
   unsigned long nr_segs, size_t count, loff_t *ppos,
-  int write)
+  int flags)
 {
+   int write = flags & FUSE_DIO_WRITE;
+   int cuse = flags & FUSE_DIO_CUSE;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
size_t nmax = write ? fc->max_write : fc->max_read;
@@ -1274,6 +1314,10 @@ ssize_t fuse_direct_io(struct file *file, const struct 
iovec *iov,
break;
}
 
+   if (!cuse)
+   fuse_wait_on_writeback(file->f_mapping->host, pos,
+  nbytes);
+
if (write)
nres = fuse_send_write(req, file, pos, nbytes, owner);
else
@@ -1342,7 +1386,8 @@ static ssize_t __fuse_direct_write(struct file *file, 
const struct iovec *iov,
 
res = generic_write_checks(file, ppos, , 0);
if (!res) {
-   res = 

[PATCH 13/14] fuse: Turn writeback cache on

2013-04-01 Thread Maxim V. Patlasov
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c|5 +
 fs/fuse/fuse_i.h  |4 
 fs/fuse/inode.c   |   13 +
 include/uapi/linux/fuse.h |2 ++
 4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 14880bb..5d2c77f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -205,6 +205,8 @@ void fuse_finish_open(struct inode *inode, struct file 
*file)
spin_unlock(>lock);
fuse_invalidate_attr(inode);
}
+   if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+   fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1099,6 +1101,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, 
const struct iovec *iov,
struct iov_iter i;
loff_t endbyte = 0;
 
+   if (get_fuse_conn(inode)->writeback_cache)
+   return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
WARN_ON(iocb->ki_pos != pos);
 
ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f54d669..f023814 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
 doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE   (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 921930f..2271177 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -457,6 +457,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_ALLOW_WBCACHE,
OPT_ERR
 };
 
@@ -469,6 +470,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   "allow_other"},
{OPT_MAX_READ,  "max_read=%u"},
{OPT_BLKSIZE,   "blksize=%u"},
+   {OPT_ALLOW_WBCACHE, "allow_wbcache"},
{OPT_ERR,   NULL}
 };
 
@@ -542,6 +544,10 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev)
d->blksize = value;
break;
 
+   case OPT_ALLOW_WBCACHE:
+   d->flags |= FUSE_ALLOW_WBCACHE;
+   break;
+
default:
return 0;
}
@@ -569,6 +575,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->flags & FUSE_ALLOW_WBCACHE)
+   seq_puts(m, ",allow_wbcache");
return 0;
 }
 
@@ -882,6 +890,9 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc->do_readdirplus = 1;
if (arg->flags & FUSE_READDIRPLUS_AUTO)
fc->readdirplus_auto = 1;
+   if (arg->flags & FUSE_WRITEBACK_CACHE &&
+   fc->flags & FUSE_ALLOW_WBCACHE)
+   fc->writeback_cache = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -910,6 +921,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO;
+   if (fc->flags & FUSE_ALLOW_WBCACHE)
+   arg->flags |= FUSE_WRITEBACK_CACHE;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4c43b44..6acda83 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -220,6 +220,7 @@ struct fuse_file_lock {
  * FUSE_AUTO_INVAL_DATA: automatically invalidate cached pages
  * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
+ * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
@@ -236,6 +237,7 @@ struct 

[PATCH 11/14] fuse: fuse_flush() should wait on writeback

2013-04-01 Thread Maxim V. Patlasov
The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2409654..7c24f6b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,7 @@
 #include 
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
@@ -396,6 +397,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fc->no_flush)
return 0;
 
+   err = filemap_write_and_wait(file->f_mapping);
+   if (err)
+   return err;
+
+   mutex_lock(>i_mutex);
+   fuse_sync_writes(inode);
+   mutex_unlock(>i_mutex);
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(, 0, sizeof(inarg));
inarg.fh = ff->fh;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback

2013-04-01 Thread Maxim V. Patlasov
fuse_writepage_locked() should never submit new i/o for given page->index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   18 +++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6ceffdf..2409654 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1472,7 +1472,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn 
*fc,
return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+struct writeback_control *wbc)
 {
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
@@ -1481,6 +1482,14 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_req *req;
struct page *tmp_page;
 
+   if (fuse_page_is_writeback(inode, page->index)) {
+   if (wbc->sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page->index);
+   }
+
set_page_writeback(page);
 
req = fuse_request_alloc_nofs(1);
@@ -1527,7 +1536,7 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
 {
int err;
 
-   err = fuse_writepage_locked(page);
+   err = fuse_writepage_locked(page, wbc);
unlock_page(page);
 
return err;
@@ -1797,7 +1806,10 @@ static int fuse_launder_page(struct page *page)
int err = 0;
if (clear_page_dirty_for_io(page)) {
struct inode *inode = page->mapping->host;
-   err = fuse_writepage_locked(page);
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_ALL,
+   };
+   err = fuse_writepage_locked(page, );
if (!err)
fuse_wait_on_page_writeback(inode, page->index);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/14] fuse: Flush files on wb close

2013-04-01 Thread Maxim V. Patlasov
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6821e95..5509c0b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -288,6 +288,12 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see fuse_vma_close() for !writeback_cache case */
+   if (fc->writeback_cache)
+   filemap_write_and_wait(file->f_mapping);
+
if (test_bit(FUSE_I_MTIME_UPDATED,
 _fuse_inode(inode)->state))
fuse_flush_mtime(file, true);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3

2013-04-01 Thread Maxim V. Patlasov
The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
 - fixed fuse_prepare_write() to avoid reads beyond EOF
 - fixed fuse_prepare_write() to zero uninitialized part of page

Changed in v3:
 - moved common part of fuse_readpage() and fuse_prepare_write() to
   __fuse_readpage().

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim V. Patlasov 
---
 fs/fuse/file.c |  322 +---
 1 files changed, 303 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5509c0b..6ceffdf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -588,20 +588,13 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
}
 }
 
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+  int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
 {
struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
size_t num_read;
-   loff_t pos = page_offset(page);
-   size_t count = PAGE_CACHE_SIZE;
-   u64 attr_ver;
-   int err;
-
-   err = -EIO;
-   if (is_bad_inode(inode))
-   goto out;
 
/*
 * Page writeback can extend beyond the lifetime of the
@@ -611,20 +604,45 @@ static int fuse_readpage(struct file *file, struct page 
*page)
fuse_wait_on_page_writeback(inode, page->index);
 
req = fuse_get_req(fc, 1);
-   err = PTR_ERR(req);
+   *err = PTR_ERR(req);
if (IS_ERR(req))
-   goto out;
+   return 0;
 
-   attr_ver = fuse_get_attr_version(fc);
+   if (attr_ver_p)
+   *attr_ver_p = fuse_get_attr_version(fc);
 
req->out.page_zeroing = 1;
req->out.argpages = 1;
req->num_pages = 1;
req->pages[0] = page;
req->page_descs[0].length = count;
-   num_read = fuse_send_read(req, file, pos, count, NULL);
-   err = req->out.h.error;
 
+   num_read = fuse_send_read(req, file, page_offset(page), count, NULL);
+   *err = req->out.h.error;
+
+   if (*err)
+   fuse_put_request(fc, req);
+   else
+   *req_pp = req;
+
+   return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   size_t num_read;
+   size_t count = PAGE_CACHE_SIZE;
+   u64 attr_ver;
+   int err;
+
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
+
+   num_read = __fuse_readpage(file, page, count, , , _ver);
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
@@ -634,10 +652,11 @@ static int fuse_readpage(struct file *file, struct page 
*page)
 
SetPageUptodate(page);
}
-
-   fuse_put_request(fc, req);
-   fuse_invalidate_attr(inode); /* atime changed */
- out:
+   if (req) {
+   fuse_put_request(fc, req);
+   fuse_invalidate_attr(inode); /* atime changed */
+   }
+out:
unlock_page(page);
return err;
 }
@@ -702,7 +721,10 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
 
 struct fuse_fill_data {
struct fuse_req *req;
-   struct file *file;
+   union {
+   struct file *file;
+   struct fuse_file *ff;
+   };
struct inode *inode;
unsigned nr_pages;
 };
@@ -1511,6 +1533,265 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+   int i, all_ok = 1;
+   struct fuse_req *req = data->req;
+   struct inode *inode = data->inode;
+   struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   loff_t off = -1;
+
+   if (!data->ff)
+   data->ff = fuse_write_file(fc, fi);
+
+   if (!data->ff) {
+   for (i = 0; i < req->num_pages; i++)
+   end_page_writeback(req->pages[i]);
+   return -EIO;
+   }
+
+   req->inode = inode;
+   req->misc.write.in.offset = page_offset(req->pages[0]);
+
+   spin_lock(>lock);
+   list_add(>writepages_entry, >writepages);
+   spin_unlock(>lock);
+
+   for (i = 0; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   struct page *tmp_page;
+
+   tmp_page = alloc_page(GFP_NOFS 

[PATCH 06/14] fuse: Trust kernel i_size only - v3

2013-04-01 Thread Maxim V. Patlasov
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
 - fixed fuse_fillattr() not to use local i_size if writeback-cache is off
 - added a comment explaining why we cannot trust attr.size from server

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim V. Patlasov 
---
 fs/fuse/dir.c   |   13 +++--
 fs/fuse/file.c  |   43 +--
 fs/fuse/inode.c |   11 +--
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8506522..8672ee4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -845,6 +845,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
  struct kstat *stat)
 {
unsigned int blkbits;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see the comment in fuse_change_attributes() */
+   if (fc->writeback_cache && S_ISREG(inode->i_mode))
+   attr->size = i_size_read(inode);
 
stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1571,6 +1576,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
+   bool is_wb = fc->writeback_cache;
loff_t oldsize;
int err;
 
@@ -1643,7 +1649,9 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
fuse_change_attributes_common(inode, ,
  attr_timeout());
oldsize = inode->i_size;
-   i_size_write(inode, outarg.attr.size);
+   /* see the comment in fuse_change_attributes() */
+   if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+   i_size_write(inode, outarg.attr.size);
 
if (is_truncate) {
/* NOTE: this may release/reacquire fc->lock */
@@ -1655,7 +1663,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 * Only call invalidate_inode_pages2() after removing
 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 */
-   if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) &&
+   S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
truncate_pagecache(inode, oldsize, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ee44b24..af58bbf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -543,9 +544,31 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
u64 attr_ver)
 {
size_t num_read = req->out.args[0].size;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc->writeback_cache) {
+   /*
+* A hole in a file. Some data after the hole are in page cache,
+* but have not reached the client fs yet. So, the hole is not
+* present there.
+*/
+   int i;
+   int start_idx = num_read >> PAGE_CACHE_SHIFT;
+   size_t off = num_read & (PAGE_CACHE_SIZE - 1);
 
-   loff_t pos = page_offset(req->pages[0]) + num_read;
-   fuse_read_update_size(inode, pos, attr_ver);
+   for (i = start_idx; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   void *mapaddr = kmap_atomic(page);
+
+   memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+   kunmap_atomic(mapaddr);
+   off = 0;
+   }
+   } else {
+   loff_t pos = page_offset(req->pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+   }
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2286,6 +2309,8 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
.mode = mode
};
int err;
+   bool change_i_size = fc->writeback_cache &&
+   !(mode & F

[PATCH 07/14] fuse: Trust kernel i_mtime only

2013-04-01 Thread Maxim V. Patlasov
Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_UPDATED serves as indication that local i_mtime
should be flushed to the server eventually. Some operations (direct write,
truncate, fallocate and setattr) leads to updating mtime on server. So, we can
clear FUSE_I_MTIME_UPDATED when such an operation is completed. This is safe
because these operations (as well as i_op->update_time and fsync) are
protected by i_mutex.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c|  116 --
 fs/fuse/file.c   |   33 +--
 fs/fuse/fuse_i.h |6 ++-
 fs/fuse/inode.c  |   13 +-
 4 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8672ee4..8c04677 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -848,8 +848,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);
 
/* see the comment in fuse_change_attributes() */
-   if (fc->writeback_cache && S_ISREG(inode->i_mode))
+   if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
attr->size = i_size_read(inode);
+   attr->mtime = inode->i_mtime.tv_sec;
+   attr->mtimensec = inode->i_mtime.tv_nsec;
+   }
 
stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1559,6 +1562,89 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(>lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+   req->in.h.opcode = FUSE_SETATTR;
+   req->in.h.nodeid = get_node_id(inode);
+   req->in.numargs = 1;
+   req->in.args[0].size = sizeof(*inarg_p);
+   req->in.args[0].value = inarg_p;
+   req->out.numargs = 1;
+   if (fc->minor < 9)
+   req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+   else
+   req->out.args[0].size = sizeof(*outarg_p);
+   req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+   struct inode *inode = file->f_mapping->host;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   struct fuse_setattr_in inarg;
+   struct fuse_attr_out outarg;
+   int err;
+
+   if (nofail) {
+   req = fuse_get_req_nofail_nopages(fc, file);
+   } else {
+   req = fuse_get_req_nopages(fc);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+   }
+
+   memset(, 0, sizeof(inarg));
+   memset(, 0, sizeof(outarg));
+
+   inarg.valid |= FATTR_MTIME;
+   inarg.mtime = inode->i_mtime.tv_sec;
+   inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+   fuse_setattr_fill(fc, req, inode, , );
+   fuse_request_send(fc, req);
+   err = req->out.h.error;
+   fuse_put_request(fc, req);
+
+   if (!err)
+   clear_bit(FUSE_I_MTIME_UPDATED, >state);
+
+   return err;
+}
+
+static inline void set_mtime_helper(struct inode *inode, struct timespec mtime)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   inode->i_mtime = mtime;
+   clear_bit(FUSE_I_MTIME_UPDATED, >state);
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually. But
+ * we can also clear FUSE_I_MTIME_UPDATED if FUSE_SETATTR has just changed
+ * mtime on server.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+   unsigned ivalid = iattr->ia_valid;
+
+   if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) {
+   if (ivalid & ATTR_MTIME_SET)
+   set_mtime_helper(inode, iattr->ia_mtime);
+   else
+   set_mtime_helper(inode, current_fs_time(inode->i_sb));
+   } else if (ivalid & ATTR_SIZE)
+   set_mtime_helper(inode, current_fs_time(inode->i_sb));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1619,17 +1705,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
inarg.valid |= FATTR_LOCKOWNER;
inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
}
-   req->in.h.opcode = FUSE_SETATTR;
-   req->in.h.nodeid = get_node_id(inode);
-   req->in.numargs = 1;
-   req->in.args[0].size = sizeof(inarg);
-   req->in.args[0].value = 
-   req->out.numargs = 1;
-   if (fc->minor < 9)
-   req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-   else
- 

[PATCH 04/14] fuse: Prepare to handle multiple pages in writeback

2013-04-01 Thread Maxim V. Patlasov
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   22 +++---
 1 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 648de34..ee44b24 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, 
pgoff_t index)
 
BUG_ON(req->inode != inode);
curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
-   if (curr_index == index) {
+   if (curr_index <= index &&
+   index < curr_index + req->num_pages) {
found = true;
break;
}
@@ -1295,7 +1296,10 @@ static ssize_t fuse_direct_write(struct file *file, 
const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-   __free_page(req->pages[0]);
+   int i;
+
+   for (i = 0; i < req->num_pages; i++)
+   __free_page(req->pages[i]);
fuse_file_put(req->ff, false);
 }
 
@@ -1304,10 +1308,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
struct inode *inode = req->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+   int i;
 
list_del(>writepages_entry);
-   dec_bdi_stat(bdi, BDI_WRITEBACK);
-   dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+   for (i = 0; i < req->num_pages; i++) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+   }
bdi_writeout_inc(bdi);
wake_up(>page_waitq);
 }
@@ -1320,14 +1327,15 @@ __acquires(fc->lock)
struct fuse_inode *fi = get_fuse_inode(req->inode);
loff_t size = i_size_read(req->inode);
struct fuse_write_in *inarg = >misc.write.in;
+   __u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
 
if (!fc->connected)
goto out_free;
 
-   if (inarg->offset + PAGE_CACHE_SIZE <= size) {
-   inarg->size = PAGE_CACHE_SIZE;
+   if (inarg->offset + data_size <= size) {
+   inarg->size = data_size;
} else if (inarg->offset < size) {
-   inarg->size = size & (PAGE_CACHE_SIZE - 1);
+   inarg->size = size - inarg->offset;
} else {
/* Got truncated off completely */
goto out_free;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/14] fuse: Connection bit for enabling writeback

2013-04-01 Thread Maxim V. Patlasov
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/fuse_i.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6aeba86..09c3139 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -455,6 +455,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;
 
+   /** write-back cache policy (default is write-through) */
+   unsigned writeback_cache:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/14] fuse: Prepare to handle short reads

2013-04-01 Thread Maxim V. Patlasov
A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   21 +
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6fc65b4..648de34 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,15 @@ static void fuse_read_update_size(struct inode *inode, 
loff_t size,
spin_unlock(>lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+   u64 attr_ver)
+{
+   size_t num_read = req->out.args[0].size;
+
+   loff_t pos = page_offset(req->pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
struct inode *inode = page->mapping->host;
@@ -574,18 +583,18 @@ static int fuse_readpage(struct file *file, struct page 
*page)
req->page_descs[0].length = count;
num_read = fuse_send_read(req, file, pos, count, NULL);
err = req->out.h.error;
-   fuse_put_request(fc, req);
 
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
 */
if (num_read < count)
-   fuse_read_update_size(inode, pos + num_read, attr_ver);
+   fuse_short_read(req, inode, attr_ver);
 
SetPageUptodate(page);
}
 
+   fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
  out:
unlock_page(page);
@@ -608,13 +617,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
/*
 * Short read means EOF. If file size is larger, truncate it
 */
-   if (!req->out.h.error && num_read < count) {
-   loff_t pos;
+   if (!req->out.h.error && num_read < count)
+   fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-   pos = page_offset(req->pages[0]) + num_read;
-   fuse_read_update_size(inode, pos,
- req->misc.read.attr_ver);
-   }
fuse_invalidate_attr(inode); /* atime changed */
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/14] fuse: Getting file for writeback helper

2013-04-01 Thread Maxim V. Patlasov
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   24 
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 356610c..6fc65b4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1375,6 +1375,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+struct fuse_inode *fi)
+{
+   struct fuse_file *ff;
+
+   spin_lock(>lock);
+   BUG_ON(list_empty(>write_files));
+   ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+   fuse_file_get(ff);
+   spin_unlock(>lock);
+
+   return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
struct address_space *mapping = page->mapping;
@@ -1382,7 +1396,6 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_req *req;
-   struct fuse_file *ff;
struct page *tmp_page;
 
set_page_writeback(page);
@@ -1395,13 +1408,8 @@ static int fuse_writepage_locked(struct page *page)
if (!tmp_page)
goto err_free;
 
-   spin_lock(>lock);
-   BUG_ON(list_empty(>write_files));
-   ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
-   req->ff = fuse_file_get(ff);
-   spin_unlock(>lock);
-
-   fuse_write_fill(req, ff, page_offset(page), 0);
+   req->ff = fuse_write_file(fc, fi);
+   fuse_write_fill(req, req->ff, page_offset(page), 0);
 
copy_highpage(tmp_page, page);
req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/14] fuse: Linking file to inode helper

2013-04-01 Thread Maxim V. Patlasov
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   33 +++--
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c807176..356610c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct 
file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+   struct inode *inode = file->f_dentry->d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_file *ff = file->private_data;
+   /*
+* file may be written through mmap, so chain it onto the
+* inodes's write_file list
+*/
+   spin_lock(>lock);
+   if (list_empty(>write_entry))
+   list_add(>write_entry, >write_files);
+   spin_unlock(>lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
struct fuse_file *ff = file->private_data;
@@ -1484,20 +1500,9 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-   struct inode *inode = file->f_dentry->d_inode;
-   struct fuse_conn *fc = get_fuse_conn(inode);
-   struct fuse_inode *fi = get_fuse_inode(inode);
-   struct fuse_file *ff = file->private_data;
-   /*
-* file may be written through mmap, so chain it onto the
-* inodes's write_file list
-*/
-   spin_lock(>lock);
-   if (list_empty(>write_entry))
-   list_add(>write_entry, >write_files);
-   spin_unlock(>lock);
-   }
+   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+   fuse_link_write_file(file);
+
file_accessed(file);
vma->vm_ops = _file_vm_ops;
return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy

2013-04-01 Thread Maxim V. Patlasov
Hi,

This is the fourth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back 
policy.
With this file data are pushed to the userspace with big chunks (depending on 
the
dirty memory limits, but this is much more than 128k) which lets the FUSE 
daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback 
is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback 
request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size 
synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not 
yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare 
meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes 
way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this 
scenario.
Kirill (in Cc) can share more details about the performance and the storage 
concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must 
wait
 on page writeback because page writeback can extend beyond the lifetime of
 the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after 
adding
 request to fi->writepages list; otherwise another writeback may happen 
inside
 the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is 
possible
 due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given 
inode
 before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not 
reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size 
update;
 otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update 
i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree 
(fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
 - rebased on for-next branch of the fuse tree 
(634734b63ac39e137a1c623ba74f3e062b6577db)
 - fixed fuse_fillattr() for non-writeback_chace case
 - added comments explaining why we cannot trust size from server
 - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
 - simplified patch titled Flush-files-on-wb-close
 - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
 - added comment about "disk full" errors to fuse_write_begin()

Thanks,
Maxim

---

Maxim V. Patlasov (14):
  fuse: Linking file to inode helper
  fuse: Getting file for writeback helper
  fuse: Prepare to handle short reads
  fuse: Prepare to handle multiple pages in writeback
  fuse: Connection bit for enabling writeback
  fuse: Trust kernel i_size only - v3
  fuse: Trust kernel i_mtime only
  fuse: Flush files on wb close
  fuse: Implement writepages and write_begin/write_end callbacks - v3
  fuse: fuse_writepage_locked() should wait on writeback
  fuse: fuse_flush() should wait on writeback
  fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  fuse: Turn writeback cache on
  mm: Account for WRITEBACK_TEMP in balance_dirty_pages


 fs/fuse/cuse.c|5 
 fs/fuse/dir.c |  127 +-
 f

[PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy

2013-04-01 Thread Maxim V. Patlasov
Hi,

This is the fourth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back 
policy.
With this file data are pushed to the userspace with big chunks (depending on 
the
dirty memory limits, but this is much more than 128k) which lets the FUSE 
daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback 
is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback 
request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size 
synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not 
yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare 
meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes 
way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this 
scenario.
Kirill (in Cc) can share more details about the performance and the storage 
concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must 
wait
 on page writeback because page writeback can extend beyond the lifetime of
 the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after 
adding
 request to fi-writepages list; otherwise another writeback may happen 
inside
 the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is 
possible
 due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given 
inode
 before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not 
reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size 
update;
 otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update 
i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree 
(fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
 - rebased on for-next branch of the fuse tree 
(634734b63ac39e137a1c623ba74f3e062b6577db)
 - fixed fuse_fillattr() for non-writeback_chace case
 - added comments explaining why we cannot trust size from server
 - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
 - simplified patch titled Flush-files-on-wb-close
 - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
 - added comment about disk full errors to fuse_write_begin()

Thanks,
Maxim

---

Maxim V. Patlasov (14):
  fuse: Linking file to inode helper
  fuse: Getting file for writeback helper
  fuse: Prepare to handle short reads
  fuse: Prepare to handle multiple pages in writeback
  fuse: Connection bit for enabling writeback
  fuse: Trust kernel i_size only - v3
  fuse: Trust kernel i_mtime only
  fuse: Flush files on wb close
  fuse: Implement writepages and write_begin/write_end callbacks - v3
  fuse: fuse_writepage_locked() should wait on writeback
  fuse: fuse_flush() should wait on writeback
  fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  fuse: Turn writeback cache on
  mm: Account for WRITEBACK_TEMP in balance_dirty_pages


 fs/fuse/cuse.c|5 
 fs/fuse/dir.c |  127 +-
 fs/fuse/file.c

[PATCH 01/14] fuse: Linking file to inode helper

2013-04-01 Thread Maxim V. Patlasov
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   33 +++--
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c807176..356610c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct 
file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_file *ff = file-private_data;
+   /*
+* file may be written through mmap, so chain it onto the
+* inodes's write_file list
+*/
+   spin_lock(fc-lock);
+   if (list_empty(ff-write_entry))
+   list_add(ff-write_entry, fi-write_files);
+   spin_unlock(fc-lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
struct fuse_file *ff = file-private_data;
@@ -1484,20 +1500,9 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   if ((vma-vm_flags  VM_SHARED)  (vma-vm_flags  VM_MAYWRITE)) {
-   struct inode *inode = file-f_dentry-d_inode;
-   struct fuse_conn *fc = get_fuse_conn(inode);
-   struct fuse_inode *fi = get_fuse_inode(inode);
-   struct fuse_file *ff = file-private_data;
-   /*
-* file may be written through mmap, so chain it onto the
-* inodes's write_file list
-*/
-   spin_lock(fc-lock);
-   if (list_empty(ff-write_entry))
-   list_add(ff-write_entry, fi-write_files);
-   spin_unlock(fc-lock);
-   }
+   if ((vma-vm_flags  VM_SHARED)  (vma-vm_flags  VM_MAYWRITE))
+   fuse_link_write_file(file);
+
file_accessed(file);
vma-vm_ops = fuse_file_vm_ops;
return 0;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/14] fuse: Getting file for writeback helper

2013-04-01 Thread Maxim V. Patlasov
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   24 
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 356610c..6fc65b4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1375,6 +1375,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+struct fuse_inode *fi)
+{
+   struct fuse_file *ff;
+
+   spin_lock(fc-lock);
+   BUG_ON(list_empty(fi-write_files));
+   ff = list_entry(fi-write_files.next, struct fuse_file, write_entry);
+   fuse_file_get(ff);
+   spin_unlock(fc-lock);
+
+   return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
struct address_space *mapping = page-mapping;
@@ -1382,7 +1396,6 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_req *req;
-   struct fuse_file *ff;
struct page *tmp_page;
 
set_page_writeback(page);
@@ -1395,13 +1408,8 @@ static int fuse_writepage_locked(struct page *page)
if (!tmp_page)
goto err_free;
 
-   spin_lock(fc-lock);
-   BUG_ON(list_empty(fi-write_files));
-   ff = list_entry(fi-write_files.next, struct fuse_file, write_entry);
-   req-ff = fuse_file_get(ff);
-   spin_unlock(fc-lock);
-
-   fuse_write_fill(req, ff, page_offset(page), 0);
+   req-ff = fuse_write_file(fc, fi);
+   fuse_write_fill(req, req-ff, page_offset(page), 0);
 
copy_highpage(tmp_page, page);
req-misc.write.in.write_flags |= FUSE_WRITE_CACHE;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/14] fuse: Prepare to handle short reads

2013-04-01 Thread Maxim V. Patlasov
A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   21 +
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6fc65b4..648de34 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,15 @@ static void fuse_read_update_size(struct inode *inode, 
loff_t size,
spin_unlock(fc-lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+   u64 attr_ver)
+{
+   size_t num_read = req-out.args[0].size;
+
+   loff_t pos = page_offset(req-pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
struct inode *inode = page-mapping-host;
@@ -574,18 +583,18 @@ static int fuse_readpage(struct file *file, struct page 
*page)
req-page_descs[0].length = count;
num_read = fuse_send_read(req, file, pos, count, NULL);
err = req-out.h.error;
-   fuse_put_request(fc, req);
 
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
 */
if (num_read  count)
-   fuse_read_update_size(inode, pos + num_read, attr_ver);
+   fuse_short_read(req, inode, attr_ver);
 
SetPageUptodate(page);
}
 
+   fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
  out:
unlock_page(page);
@@ -608,13 +617,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
/*
 * Short read means EOF. If file size is larger, truncate it
 */
-   if (!req-out.h.error  num_read  count) {
-   loff_t pos;
+   if (!req-out.h.error  num_read  count)
+   fuse_short_read(req, inode, req-misc.read.attr_ver);
 
-   pos = page_offset(req-pages[0]) + num_read;
-   fuse_read_update_size(inode, pos,
- req-misc.read.attr_ver);
-   }
fuse_invalidate_attr(inode); /* atime changed */
}
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/14] fuse: Prepare to handle multiple pages in writeback

2013-04-01 Thread Maxim V. Patlasov
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |   22 +++---
 1 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 648de34..ee44b24 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, 
pgoff_t index)
 
BUG_ON(req-inode != inode);
curr_index = req-misc.write.in.offset  PAGE_CACHE_SHIFT;
-   if (curr_index == index) {
+   if (curr_index = index 
+   index  curr_index + req-num_pages) {
found = true;
break;
}
@@ -1295,7 +1296,10 @@ static ssize_t fuse_direct_write(struct file *file, 
const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-   __free_page(req-pages[0]);
+   int i;
+
+   for (i = 0; i  req-num_pages; i++)
+   __free_page(req-pages[i]);
fuse_file_put(req-ff, false);
 }
 
@@ -1304,10 +1308,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
struct inode *inode = req-inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   int i;
 
list_del(req-writepages_entry);
-   dec_bdi_stat(bdi, BDI_WRITEBACK);
-   dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
+   for (i = 0; i  req-num_pages; i++) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(req-pages[i], NR_WRITEBACK_TEMP);
+   }
bdi_writeout_inc(bdi);
wake_up(fi-page_waitq);
 }
@@ -1320,14 +1327,15 @@ __acquires(fc-lock)
struct fuse_inode *fi = get_fuse_inode(req-inode);
loff_t size = i_size_read(req-inode);
struct fuse_write_in *inarg = req-misc.write.in;
+   __u64 data_size = req-num_pages * PAGE_CACHE_SIZE;
 
if (!fc-connected)
goto out_free;
 
-   if (inarg-offset + PAGE_CACHE_SIZE = size) {
-   inarg-size = PAGE_CACHE_SIZE;
+   if (inarg-offset + data_size = size) {
+   inarg-size = data_size;
} else if (inarg-offset  size) {
-   inarg-size = size  (PAGE_CACHE_SIZE - 1);
+   inarg-size = size - inarg-offset;
} else {
/* Got truncated off completely */
goto out_free;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/14] fuse: Connection bit for enabling writeback

2013-04-01 Thread Maxim V. Patlasov
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/fuse_i.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6aeba86..09c3139 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -455,6 +455,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;
 
+   /** write-back cache policy (default is write-through) */
+   unsigned writeback_cache:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/14] fuse: Trust kernel i_size only - v3

2013-04-01 Thread Maxim V. Patlasov
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
 - fixed fuse_fillattr() not to use local i_size if writeback-cache is off
 - added a comment explaining why we cannot trust attr.size from server

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim V. Patlasov mpatla...@parallels.com
---
 fs/fuse/dir.c   |   13 +++--
 fs/fuse/file.c  |   43 +--
 fs/fuse/inode.c |   11 +--
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8506522..8672ee4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -845,6 +845,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
  struct kstat *stat)
 {
unsigned int blkbits;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see the comment in fuse_change_attributes() */
+   if (fc-writeback_cache  S_ISREG(inode-i_mode))
+   attr-size = i_size_read(inode);
 
stat-dev = inode-i_sb-s_dev;
stat-ino = attr-ino;
@@ -1571,6 +1576,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
+   bool is_wb = fc-writeback_cache;
loff_t oldsize;
int err;
 
@@ -1643,7 +1649,9 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
fuse_change_attributes_common(inode, outarg.attr,
  attr_timeout(outarg));
oldsize = inode-i_size;
-   i_size_write(inode, outarg.attr.size);
+   /* see the comment in fuse_change_attributes() */
+   if (!is_wb || is_truncate || !S_ISREG(inode-i_mode))
+   i_size_write(inode, outarg.attr.size);
 
if (is_truncate) {
/* NOTE: this may release/reacquire fc-lock */
@@ -1655,7 +1663,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 * Only call invalidate_inode_pages2() after removing
 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 */
-   if (S_ISREG(inode-i_mode)  oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) 
+   S_ISREG(inode-i_mode)  oldsize != outarg.attr.size) {
truncate_pagecache(inode, oldsize, outarg.attr.size);
invalidate_inode_pages2(inode-i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ee44b24..af58bbf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
 #include linux/module.h
 #include linux/compat.h
 #include linux/swap.h
+#include linux/falloc.h
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -543,9 +544,31 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
u64 attr_ver)
 {
size_t num_read = req-out.args[0].size;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc-writeback_cache) {
+   /*
+* A hole in a file. Some data after the hole are in page cache,
+* but have not reached the client fs yet. So, the hole is not
+* present there.
+*/
+   int i;
+   int start_idx = num_read  PAGE_CACHE_SHIFT;
+   size_t off = num_read  (PAGE_CACHE_SIZE - 1);
 
-   loff_t pos = page_offset(req-pages[0]) + num_read;
-   fuse_read_update_size(inode, pos, attr_ver);
+   for (i = start_idx; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   void *mapaddr = kmap_atomic(page);
+
+   memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+   kunmap_atomic(mapaddr);
+   off = 0;
+   }
+   } else {
+   loff_t pos = page_offset(req-pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+   }
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2286,6 +2309,8 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
.mode = mode
};
int err;
+   bool change_i_size = fc-writeback_cache 
+   !(mode  FALLOC_FL_KEEP_SIZE);
 
if (fc-no_fallocate

[PATCH 07/14] fuse: Trust kernel i_mtime only

2013-04-01 Thread Maxim V. Patlasov
Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op-update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_UPDATED serves as indication that local i_mtime
should be flushed to the server eventually. Some operations (direct write,
truncate, fallocate and setattr) leads to updating mtime on server. So, we can
clear FUSE_I_MTIME_UPDATED when such an operation is completed. This is safe
because these operations (as well as i_op-update_time and fsync) are
protected by i_mutex.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/dir.c|  116 --
 fs/fuse/file.c   |   33 +--
 fs/fuse/fuse_i.h |6 ++-
 fs/fuse/inode.c  |   13 +-
 4 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8672ee4..8c04677 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -848,8 +848,11 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);
 
/* see the comment in fuse_change_attributes() */
-   if (fc-writeback_cache  S_ISREG(inode-i_mode))
+   if (fc-writeback_cache  S_ISREG(inode-i_mode)) {
attr-size = i_size_read(inode);
+   attr-mtime = inode-i_mtime.tv_sec;
+   attr-mtimensec = inode-i_mtime.tv_nsec;
+   }
 
stat-dev = inode-i_sb-s_dev;
stat-ino = attr-ino;
@@ -1559,6 +1562,89 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(fc-lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+   req-in.h.opcode = FUSE_SETATTR;
+   req-in.h.nodeid = get_node_id(inode);
+   req-in.numargs = 1;
+   req-in.args[0].size = sizeof(*inarg_p);
+   req-in.args[0].value = inarg_p;
+   req-out.numargs = 1;
+   if (fc-minor  9)
+   req-out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+   else
+   req-out.args[0].size = sizeof(*outarg_p);
+   req-out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode-i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+   struct inode *inode = file-f_mapping-host;
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   struct fuse_setattr_in inarg;
+   struct fuse_attr_out outarg;
+   int err;
+
+   if (nofail) {
+   req = fuse_get_req_nofail_nopages(fc, file);
+   } else {
+   req = fuse_get_req_nopages(fc);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+   }
+
+   memset(inarg, 0, sizeof(inarg));
+   memset(outarg, 0, sizeof(outarg));
+
+   inarg.valid |= FATTR_MTIME;
+   inarg.mtime = inode-i_mtime.tv_sec;
+   inarg.mtimensec = inode-i_mtime.tv_nsec;
+
+   fuse_setattr_fill(fc, req, inode, inarg, outarg);
+   fuse_request_send(fc, req);
+   err = req-out.h.error;
+   fuse_put_request(fc, req);
+
+   if (!err)
+   clear_bit(FUSE_I_MTIME_UPDATED, fi-state);
+
+   return err;
+}
+
+static inline void set_mtime_helper(struct inode *inode, struct timespec mtime)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   inode-i_mtime = mtime;
+   clear_bit(FUSE_I_MTIME_UPDATED, fi-state);
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode-i_mtime manually. But
+ * we can also clear FUSE_I_MTIME_UPDATED if FUSE_SETATTR has just changed
+ * mtime on server.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+   unsigned ivalid = iattr-ia_valid;
+
+   if ((ivalid  ATTR_MTIME)  update_mtime(ivalid)) {
+   if (ivalid  ATTR_MTIME_SET)
+   set_mtime_helper(inode, iattr-ia_mtime);
+   else
+   set_mtime_helper(inode, current_fs_time(inode-i_sb));
+   } else if (ivalid  ATTR_SIZE)
+   set_mtime_helper(inode, current_fs_time(inode-i_sb));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1619,17 +1705,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
inarg.valid |= FATTR_LOCKOWNER;
inarg.lock_owner = fuse_lock_owner_id(fc, current-files);
}
-   req-in.h.opcode = FUSE_SETATTR;
-   req-in.h.nodeid = get_node_id(inode);
-   req-in.numargs = 1;
-   req-in.args[0].size = sizeof(inarg);
-   req-in.args[0].value = inarg;
-   req-out.numargs = 1;
-   if (fc-minor  9)
-   req-out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-   else
-  

[PATCH 08/14] fuse: Flush files on wb close

2013-04-01 Thread Maxim V. Patlasov
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff-count. This delays actual close until all fuse wb is
completed.

In case of write cache turned off, the flush is ensured by fuse_vma_close().

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6821e95..5509c0b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -288,6 +288,12 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   /* see fuse_vma_close() for !writeback_cache case */
+   if (fc-writeback_cache)
+   filemap_write_and_wait(file-f_mapping);
+
if (test_bit(FUSE_I_MTIME_UPDATED,
 get_fuse_inode(inode)-state))
fuse_flush_mtime(file, true);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3

2013-04-01 Thread Maxim V. Patlasov
The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
 - fixed fuse_prepare_write() to avoid reads beyond EOF
 - fixed fuse_prepare_write() to zero uninitialized part of page

Changed in v3:
 - moved common part of fuse_readpage() and fuse_prepare_write() to
   __fuse_readpage().

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim V. Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |  322 +---
 1 files changed, 303 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5509c0b..6ceffdf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -588,20 +588,13 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
}
 }
 
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+  int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
 {
struct inode *inode = page-mapping-host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
size_t num_read;
-   loff_t pos = page_offset(page);
-   size_t count = PAGE_CACHE_SIZE;
-   u64 attr_ver;
-   int err;
-
-   err = -EIO;
-   if (is_bad_inode(inode))
-   goto out;
 
/*
 * Page writeback can extend beyond the lifetime of the
@@ -611,20 +604,45 @@ static int fuse_readpage(struct file *file, struct page 
*page)
fuse_wait_on_page_writeback(inode, page-index);
 
req = fuse_get_req(fc, 1);
-   err = PTR_ERR(req);
+   *err = PTR_ERR(req);
if (IS_ERR(req))
-   goto out;
+   return 0;
 
-   attr_ver = fuse_get_attr_version(fc);
+   if (attr_ver_p)
+   *attr_ver_p = fuse_get_attr_version(fc);
 
req-out.page_zeroing = 1;
req-out.argpages = 1;
req-num_pages = 1;
req-pages[0] = page;
req-page_descs[0].length = count;
-   num_read = fuse_send_read(req, file, pos, count, NULL);
-   err = req-out.h.error;
 
+   num_read = fuse_send_read(req, file, page_offset(page), count, NULL);
+   *err = req-out.h.error;
+
+   if (*err)
+   fuse_put_request(fc, req);
+   else
+   *req_pp = req;
+
+   return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+   struct inode *inode = page-mapping-host;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req = NULL;
+   size_t num_read;
+   size_t count = PAGE_CACHE_SIZE;
+   u64 attr_ver;
+   int err;
+
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
+
+   num_read = __fuse_readpage(file, page, count, err, req, attr_ver);
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
@@ -634,10 +652,11 @@ static int fuse_readpage(struct file *file, struct page 
*page)
 
SetPageUptodate(page);
}
-
-   fuse_put_request(fc, req);
-   fuse_invalidate_attr(inode); /* atime changed */
- out:
+   if (req) {
+   fuse_put_request(fc, req);
+   fuse_invalidate_attr(inode); /* atime changed */
+   }
+out:
unlock_page(page);
return err;
 }
@@ -702,7 +721,10 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
 
 struct fuse_fill_data {
struct fuse_req *req;
-   struct file *file;
+   union {
+   struct file *file;
+   struct fuse_file *ff;
+   };
struct inode *inode;
unsigned nr_pages;
 };
@@ -1511,6 +1533,265 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+   int i, all_ok = 1;
+   struct fuse_req *req = data-req;
+   struct inode *inode = data-inode;
+   struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   loff_t off = -1;
+
+   if (!data-ff)
+   data-ff = fuse_write_file(fc, fi);
+
+   if (!data-ff) {
+   for (i = 0; i  req-num_pages; i++)
+   end_page_writeback(req-pages[i]);
+   return -EIO;
+   }
+
+   req-inode = inode;
+   req-misc.write.in.offset = page_offset(req-pages[0]);
+
+   spin_lock(fc-lock);
+   list_add(req-writepages_entry, fi-writepages);
+   spin_unlock(fc-lock);
+
+   for (i = 0; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   struct page *tmp_page;
+
+   tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (tmp_page

[PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback

2013-04-01 Thread Maxim V. Patlasov
fuse_writepage_locked() should never submit new i/o for given page-index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |   18 +++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6ceffdf..2409654 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1472,7 +1472,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn 
*fc,
return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+struct writeback_control *wbc)
 {
struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
@@ -1481,6 +1482,14 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_req *req;
struct page *tmp_page;
 
+   if (fuse_page_is_writeback(inode, page-index)) {
+   if (wbc-sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page-index);
+   }
+
set_page_writeback(page);
 
req = fuse_request_alloc_nofs(1);
@@ -1527,7 +1536,7 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
 {
int err;
 
-   err = fuse_writepage_locked(page);
+   err = fuse_writepage_locked(page, wbc);
unlock_page(page);
 
return err;
@@ -1797,7 +1806,10 @@ static int fuse_launder_page(struct page *page)
int err = 0;
if (clear_page_dirty_for_io(page)) {
struct inode *inode = page-mapping-host;
-   err = fuse_writepage_locked(page);
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_ALL,
+   };
+   err = fuse_writepage_locked(page, wbc);
if (!err)
fuse_wait_on_page_writeback(inode, page-index);
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/14] fuse: fuse_flush() should wait on writeback

2013-04-01 Thread Maxim V. Patlasov
The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2409654..7c24f6b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,7 @@
 #include linux/falloc.h
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
@@ -396,6 +397,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fc-no_flush)
return 0;
 
+   err = filemap_write_and_wait(file-f_mapping);
+   if (err)
+   return err;
+
+   mutex_lock(inode-i_mutex);
+   fuse_sync_writes(inode);
+   mutex_unlock(inode-i_mutex);
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(inarg, 0, sizeof(inarg));
inarg.fh = ff-fh;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2

2013-04-01 Thread Maxim V. Patlasov
The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the -writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/cuse.c   |5 +++--
 fs/fuse/file.c   |   49 +++--
 fs/fuse/fuse_i.h |   13 -
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..fb63185 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -93,7 +93,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, 
size_t count,
loff_t pos = 0;
struct iovec iov = { .iov_base = buf, .iov_len = count };
 
-   return fuse_direct_io(file, iov, 1, count, pos, 0);
+   return fuse_direct_io(file, iov, 1, count, pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -106,7 +106,8 @@ static ssize_t cuse_write(struct file *file, const char 
__user *buf,
 * No locking or generic_write_checks(), the server is
 * responsible for locking and sanity checks.
 */
-   return fuse_direct_io(file, iov, 1, count, pos, 1);
+   return fuse_direct_io(file, iov, 1, count, pos,
+ FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7c24f6b..14880bb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -338,6 +338,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
return (u64) v0 + ((u64) v1  32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+   pgoff_t idx_to)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_req *req;
+   bool found = false;
+
+   spin_lock(fc-lock);
+   list_for_each_entry(req, fi-writepages, writepages_entry) {
+   pgoff_t curr_index;
+
+   BUG_ON(req-inode != inode);
+   curr_index = req-misc.write.in.offset  PAGE_CACHE_SHIFT;
+   if (!(idx_from = curr_index + req-num_pages ||
+ idx_to  curr_index)) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(fc-lock);
+
+   return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -382,6 +407,19 @@ static int fuse_wait_on_page_writeback(struct inode 
*inode, pgoff_t index)
return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+  size_t bytes)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   pgoff_t idx_from, idx_to;
+
+   idx_from = start  PAGE_CACHE_SHIFT;
+   idx_to = (start + bytes - 1)  PAGE_CACHE_SHIFT;
+
+   wait_event(fi-page_waitq,
+  !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
struct inode *inode = file-f_path.dentry-d_inode;
@@ -1248,8 +1286,10 @@ static inline int fuse_iter_npages(const struct iov_iter 
*ii_p)
 
 ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
   unsigned long nr_segs, size_t count, loff_t *ppos,
-  int write)
+  int flags)
 {
+   int write = flags  FUSE_DIO_WRITE;
+   int cuse = flags  FUSE_DIO_CUSE;
struct fuse_file *ff = file-private_data;
struct fuse_conn *fc = ff-fc;
size_t nmax = write ? fc-max_write : fc-max_read;
@@ -1274,6 +1314,10 @@ ssize_t fuse_direct_io(struct file *file, const struct 
iovec *iov,
break;
}
 
+   if (!cuse)
+   fuse_wait_on_writeback(file-f_mapping-host, pos,
+  nbytes);
+
if (write)
nres = fuse_send_write(req, file, pos, nbytes, owner);
else
@@ -1342,7 +1386,8 @@ static ssize_t __fuse_direct_write(struct file *file, 
const struct iovec *iov,
 
res = generic_write_checks(file, ppos, count, 0);

[PATCH 13/14] fuse: Turn writeback cache on

2013-04-01 Thread Maxim V. Patlasov
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c|5 +
 fs/fuse/fuse_i.h  |4 
 fs/fuse/inode.c   |   13 +
 include/uapi/linux/fuse.h |2 ++
 4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 14880bb..5d2c77f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -205,6 +205,8 @@ void fuse_finish_open(struct inode *inode, struct file 
*file)
spin_unlock(fc-lock);
fuse_invalidate_attr(inode);
}
+   if ((file-f_mode  FMODE_WRITE)  fc-writeback_cache)
+   fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1099,6 +1101,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, 
const struct iovec *iov,
struct iov_iter i;
loff_t endbyte = 0;
 
+   if (get_fuse_conn(inode)-writeback_cache)
+   return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
WARN_ON(iocb-ki_pos != pos);
 
ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f54d669..f023814 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
 doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER (1  1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE   (1  2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 921930f..2271177 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -457,6 +457,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_ALLOW_WBCACHE,
OPT_ERR
 };
 
@@ -469,6 +470,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   allow_other},
{OPT_MAX_READ,  max_read=%u},
{OPT_BLKSIZE,   blksize=%u},
+   {OPT_ALLOW_WBCACHE, allow_wbcache},
{OPT_ERR,   NULL}
 };
 
@@ -542,6 +544,10 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev)
d-blksize = value;
break;
 
+   case OPT_ALLOW_WBCACHE:
+   d-flags |= FUSE_ALLOW_WBCACHE;
+   break;
+
default:
return 0;
}
@@ -569,6 +575,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ,max_read=%u, fc-max_read);
if (sb-s_bdev  sb-s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ,blksize=%lu, sb-s_blocksize);
+   if (fc-flags  FUSE_ALLOW_WBCACHE)
+   seq_puts(m, ,allow_wbcache);
return 0;
 }
 
@@ -882,6 +890,9 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc-do_readdirplus = 1;
if (arg-flags  FUSE_READDIRPLUS_AUTO)
fc-readdirplus_auto = 1;
+   if (arg-flags  FUSE_WRITEBACK_CACHE 
+   fc-flags  FUSE_ALLOW_WBCACHE)
+   fc-writeback_cache = 1;
} else {
ra_pages = fc-max_read / PAGE_CACHE_SIZE;
fc-no_lock = 1;
@@ -910,6 +921,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO;
+   if (fc-flags  FUSE_ALLOW_WBCACHE)
+   arg-flags |= FUSE_WRITEBACK_CACHE;
req-in.h.opcode = FUSE_INIT;
req-in.numargs = 1;
req-in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4c43b44..6acda83 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -220,6 +220,7 @@ struct fuse_file_lock {
  * FUSE_AUTO_INVAL_DATA: automatically invalidate cached pages
  * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
+ * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
  */
 #define FUSE_ASYNC_READ(1  0)
 #define FUSE_POSIX_LOCKS   (1  1)
@@ -236,6 +237,7 @@ struct fuse_file_lock {

[PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-04-01 Thread Maxim V. Patlasov
Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 mm/page-writeback.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
 
global_dirty_limits(background_thresh, dirty_thresh);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v2

2013-03-27 Thread Maxim V. Patlasov

Hi Miklos,

01/30/2013 03:08 AM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:25 PM, Maxim V. Patlasov
 wrote:

The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
  - fixed fuse_prepare_write() to avoid reads beyond EOF
  - fixed fuse_prepare_write() to zero uninitialized part of page

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim V. Patlasov 
---
  fs/fuse/file.c |  282 
  1 files changed, 281 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 496e74c..3b4dc98 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -722,7 +722,10 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)

  struct fuse_fill_data {
 struct fuse_req *req;
-   struct file *file;
+   union {
+   struct file *file;
+   struct fuse_file *ff;
+   };
 struct inode *inode;
 unsigned nr_pages;
  };
@@ -1530,6 +1533,280 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
 return err;
  }

+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+   int i, all_ok = 1;
+   struct fuse_req *req = data->req;
+   struct inode *inode = data->inode;
+   struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   loff_t off = -1;
+
+   if (!data->ff)
+   data->ff = fuse_write_file(fc, fi);
+
+   if (!data->ff) {
+   for (i = 0; i < req->num_pages; i++)
+   end_page_writeback(req->pages[i]);
+   return -EIO;
+   }
+
+   req->inode = inode;
+   req->misc.write.in.offset = page_offset(req->pages[0]);
+
+   spin_lock(>lock);
+   list_add(>writepages_entry, >writepages);
+   spin_unlock(>lock);
+
+   for (i = 0; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   struct page *tmp_page;
+
+   tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (tmp_page) {
+   copy_highpage(tmp_page, page);
+   inc_bdi_stat(bdi, BDI_WRITEBACK);
+   inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+   } else
+   all_ok = 0;
+   req->pages[i] = tmp_page;
+   if (i == 0)
+   off = page_offset(page);
+
+   end_page_writeback(page);
+   }
+
+   if (!all_ok) {
+   for (i = 0; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   if (page) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+   __free_page(page);
+   req->pages[i] = NULL;
+   }
+   }
+
+   spin_lock(>lock);
+   list_del(>writepages_entry);
+   wake_up(>page_waitq);
+   spin_unlock(>lock);
+   return -ENOMEM;
+   }
+
+   req->ff = fuse_file_get(data->ff);
+   fuse_write_fill(req, data->ff, off, 0);
+
+   req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+   req->in.argpages = 1;
+   fuse_page_descs_length_init(req, 0, req->num_pages);
+   req->end = fuse_writepage_end;
+
+   spin_lock(>lock);
+   list_add_tail(>list, >queued_writes);
+   fuse_flush_writepages(data->inode);
+   spin_unlock(>lock);
+
+   return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+   struct writeback_control *wbc, void *_data)
+{
+   struct fuse_fill_data *data = _data;
+   struct fuse_req *req = data->req;
+   struct inode *inode = data->inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fuse_page_is_writeback(inode, page->index)) {
+   if (wbc->sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page->index);
+   }
+
+   if (req->num_pages &&
+   (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+(req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+   int err;
+
+   err = fuse_send_writepages(data);
+   if (err) {
+   unlock_page(page);
+ 

Re: [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v2

2013-03-27 Thread Maxim V. Patlasov

Hi Miklos,

01/30/2013 03:08 AM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:25 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
  - fixed fuse_prepare_write() to avoid reads beyond EOF
  - fixed fuse_prepare_write() to zero uninitialized part of page

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim V. Patlasov mpatla...@parallels.com
---
  fs/fuse/file.c |  282 
  1 files changed, 281 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 496e74c..3b4dc98 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -722,7 +722,10 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)

  struct fuse_fill_data {
 struct fuse_req *req;
-   struct file *file;
+   union {
+   struct file *file;
+   struct fuse_file *ff;
+   };
 struct inode *inode;
 unsigned nr_pages;
  };
@@ -1530,6 +1533,280 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
 return err;
  }

+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+   int i, all_ok = 1;
+   struct fuse_req *req = data-req;
+   struct inode *inode = data-inode;
+   struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   loff_t off = -1;
+
+   if (!data-ff)
+   data-ff = fuse_write_file(fc, fi);
+
+   if (!data-ff) {
+   for (i = 0; i  req-num_pages; i++)
+   end_page_writeback(req-pages[i]);
+   return -EIO;
+   }
+
+   req-inode = inode;
+   req-misc.write.in.offset = page_offset(req-pages[0]);
+
+   spin_lock(fc-lock);
+   list_add(req-writepages_entry, fi-writepages);
+   spin_unlock(fc-lock);
+
+   for (i = 0; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   struct page *tmp_page;
+
+   tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (tmp_page) {
+   copy_highpage(tmp_page, page);
+   inc_bdi_stat(bdi, BDI_WRITEBACK);
+   inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+   } else
+   all_ok = 0;
+   req-pages[i] = tmp_page;
+   if (i == 0)
+   off = page_offset(page);
+
+   end_page_writeback(page);
+   }
+
+   if (!all_ok) {
+   for (i = 0; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   if (page) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+   __free_page(page);
+   req-pages[i] = NULL;
+   }
+   }
+
+   spin_lock(fc-lock);
+   list_del(req-writepages_entry);
+   wake_up(fi-page_waitq);
+   spin_unlock(fc-lock);
+   return -ENOMEM;
+   }
+
+   req-ff = fuse_file_get(data-ff);
+   fuse_write_fill(req, data-ff, off, 0);
+
+   req-misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+   req-in.argpages = 1;
+   fuse_page_descs_length_init(req, 0, req-num_pages);
+   req-end = fuse_writepage_end;
+
+   spin_lock(fc-lock);
+   list_add_tail(req-list, fi-queued_writes);
+   fuse_flush_writepages(data-inode);
+   spin_unlock(fc-lock);
+
+   return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+   struct writeback_control *wbc, void *_data)
+{
+   struct fuse_fill_data *data = _data;
+   struct fuse_req *req = data-req;
+   struct inode *inode = data-inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fuse_page_is_writeback(inode, page-index)) {
+   if (wbc-sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page-index);
+   }
+
+   if (req-num_pages 
+   (req-num_pages == FUSE_MAX_PAGES_PER_REQ ||
+(req-num_pages + 1) * PAGE_CACHE_SIZE  fc-max_write ||
+req-pages[req-num_pages - 1]-index + 1 != page-index)) {
+   int err;
+
+   err = fuse_send_writepages(data);
+   if (err) {
+   unlock_page(page);
+   return err;
+   }
+
+   data-req = req =
+   fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ

Re: [PATCH 08/14] fuse: Flush files on wb close

2013-03-26 Thread Maxim V. Patlasov

Hi Miklos,

01/30/2013 02:58 AM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:24 PM, Maxim V. Patlasov
 wrote:

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediately and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
  fs/fuse/file.c |   26 +-
  1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4f8fa45..496e74c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -138,6 +138,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
 }
  }

+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(>count))
+   BUG();
+}
+
  int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  bool isdir)
  {
@@ -286,8 +292,23 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 return fuse_open_common(inode, file, false);
  }

+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   filemap_write_and_wait(file->f_mapping);
+   wait_event(fi->page_waitq, list_empty_careful(>writepages));
+   spin_unlock_wait(>lock);
+}
+
  static int fuse_release(struct inode *inode, struct file *file)
  {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc->writeback_cache)
+   fuse_flush_writeback(inode, file);
+
 fuse_release_common(file, FUSE_RELEASE);

 /* return value is ignored by VFS */
@@ -1343,7 +1364,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)

 for (i = 0; i < req->num_pages; i++)
 __free_page(req->pages[i]);
-   fuse_file_put(req->ff, false);
+   if (!fc->writeback_cache)
+   fuse_file_put(req->ff, false);
  }

  static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1360,6 +1382,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 }
 bdi_writeout_inc(bdi);
 wake_up(>page_waitq);
+   if (fc->writeback_cache)
+   __fuse_file_put(req->ff);
  }

I don't see how this belongs in this patch.  And I suspect this can be
done unconditionally (for the non-writeback-cache case as well), but
please move it into a separate patch.


OK, I'll do.

Thanks,
Maxim



Thanks,
Miklos



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/14] fuse: Update i_mtime on buffered writes

2013-03-26 Thread Maxim V. Patlasov

Hi Miklos,

Sorry for long delay, see please inline comment below.

01/30/2013 02:19 AM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:24 PM, Maxim V. Patlasov
 wrote:

If writeback cache is on, buffered write doesn't result in immediate mtime
update in userspace because the userspace will see modified data later, when
writeback happens. Consequently, mtime provided by userspace may be older than
actual time of buffered write.

The problem can be solved by generating mtime locally (will come in next
patches) and flushing it to userspace periodically. Here we introduce a flag to
keep the state of fuse_inode: the flag is ON if and only if locally generated
mtime (stored in inode->i_mtime) was not pushed to the userspace yet.

The patch also implements all bits related to flushing and clearing the flag.

Signed-off-by: Maxim Patlasov 
---
  fs/fuse/dir.c|   42 +
  fs/fuse/file.c   |   31 ++---
  fs/fuse/fuse_i.h |   13 -
  fs/fuse/inode.c  |   79 +-
  4 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ff8b603..969c60d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -177,6 +177,13 @@ static int fuse_dentry_revalidate(struct dentry *entry, 
unsigned int flags)
 if (flags & LOOKUP_RCU)
 return -ECHILD;

+   if (test_bit(FUSE_I_MTIME_UPDATED,
+_fuse_inode(inode)->state)) {
+   err = fuse_flush_mtime(inode, 0);

->d_revalidate may be called with or without i_mutex, there's
absolutely no way to know.  So this won't work.

I know it was me who suggested this approach, but I have second
thoughts...  I really don't like the way this mixes userspace and
kernel updates to mtime.  I think it should be either one or the
other.

I don't think you need to much changes to this patch.  Just clear
S_NOCMTIME, implement i_op->update_time(), which sets the
FUSE_I_MTIME_UPDATED flag and flush mtime just like you do now.
Except now it doesn't need to take i_mutex since all mtime updates are
now done by the kernel.

Does that make sense?


Yes, but it's not as simple as you described above. mtime updates should 
be strictly serialized, I used i_mutex for this purpose. Abandoning 
i_mutex, we'll have to introduce another lock for synchronization. 
Otherwise, we won't know when it's secure to clear FUSE_I_MTIME_UPDATED 
flag. Another approach is to introduce one more state: 
FUSE_I_MTIME_UPDATE_IN_PROGRESS. But again, we'll need something like 
waitq to wait for mtime update completion.


I'd prefer much more simple solution: clear S_NOCMTIME and implement 
i_op->update_time() as you suggested; but flush mtime only on last 
close. May be we could extend FUSE_RELEASE request (struct 
fuse_release_in) to accommodate mtime. Are you OK about it?


Thanks,
Maxim



Thanks,
Miklos



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/14] fuse: Update i_mtime on buffered writes

2013-03-26 Thread Maxim V. Patlasov

Hi Miklos,

Sorry for long delay, see please inline comment below.

01/30/2013 02:19 AM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:24 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

If writeback cache is on, buffered write doesn't result in immediate mtime
update in userspace because the userspace will see modified data later, when
writeback happens. Consequently, mtime provided by userspace may be older than
actual time of buffered write.

The problem can be solved by generating mtime locally (will come in next
patches) and flushing it to userspace periodically. Here we introduce a flag to
keep the state of fuse_inode: the flag is ON if and only if locally generated
mtime (stored in inode-i_mtime) was not pushed to the userspace yet.

The patch also implements all bits related to flushing and clearing the flag.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
  fs/fuse/dir.c|   42 +
  fs/fuse/file.c   |   31 ++---
  fs/fuse/fuse_i.h |   13 -
  fs/fuse/inode.c  |   79 +-
  4 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ff8b603..969c60d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -177,6 +177,13 @@ static int fuse_dentry_revalidate(struct dentry *entry, 
unsigned int flags)
 if (flags  LOOKUP_RCU)
 return -ECHILD;

+   if (test_bit(FUSE_I_MTIME_UPDATED,
+get_fuse_inode(inode)-state)) {
+   err = fuse_flush_mtime(inode, 0);

-d_revalidate may be called with or without i_mutex, there's
absolutely no way to know.  So this won't work.

I know it was me who suggested this approach, but I have second
thoughts...  I really don't like the way this mixes userspace and
kernel updates to mtime.  I think it should be either one or the
other.

I don't think you need to much changes to this patch.  Just clear
S_NOCMTIME, implement i_op-update_time(), which sets the
FUSE_I_MTIME_UPDATED flag and flush mtime just like you do now.
Except now it doesn't need to take i_mutex since all mtime updates are
now done by the kernel.

Does that make sense?


Yes, but it's not as simple as you described above. mtime updates should 
be strictly serialized, I used i_mutex for this purpose. Abandoning 
i_mutex, we'll have to introduce another lock for synchronization. 
Otherwise, we won't know when it's secure to clear FUSE_I_MTIME_UPDATED 
flag. Another approach is to introduce one more state: 
FUSE_I_MTIME_UPDATE_IN_PROGRESS. But again, we'll need something like 
waitq to wait for mtime update completion.


I'd prefer much more simple solution: clear S_NOCMTIME and implement 
i_op-update_time() as you suggested; but flush mtime only on last 
close. May be we could extend FUSE_RELEASE request (struct 
fuse_release_in) to accommodate mtime. Are you OK about it?


Thanks,
Maxim



Thanks,
Miklos



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/14] fuse: Flush files on wb close

2013-03-26 Thread Maxim V. Patlasov

Hi Miklos,

01/30/2013 02:58 AM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:24 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediately and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
  fs/fuse/file.c |   26 +-
  1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4f8fa45..496e74c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -138,6 +138,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
 }
  }

+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(ff-count))
+   BUG();
+}
+
  int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  bool isdir)
  {
@@ -286,8 +292,23 @@ static int fuse_open(struct inode *inode, struct file 
*file)
 return fuse_open_common(inode, file, false);
  }

+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   filemap_write_and_wait(file-f_mapping);
+   wait_event(fi-page_waitq, list_empty_careful(fi-writepages));
+   spin_unlock_wait(fc-lock);
+}
+
  static int fuse_release(struct inode *inode, struct file *file)
  {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc-writeback_cache)
+   fuse_flush_writeback(inode, file);
+
 fuse_release_common(file, FUSE_RELEASE);

 /* return value is ignored by VFS */
@@ -1343,7 +1364,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)

 for (i = 0; i  req-num_pages; i++)
 __free_page(req-pages[i]);
-   fuse_file_put(req-ff, false);
+   if (!fc-writeback_cache)
+   fuse_file_put(req-ff, false);
  }

  static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1360,6 +1382,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 }
 bdi_writeout_inc(bdi);
 wake_up(fi-page_waitq);
+   if (fc-writeback_cache)
+   __fuse_file_put(req-ff);
  }

I don't see how this belongs in this patch.  And I suspect this can be
done unconditionally (for the non-writeback-cache case as well), but
please move it into a separate patch.


OK, I'll do.

Thanks,
Maxim



Thanks,
Miklos



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/14] fuse: Trust kernel i_size only - v2

2013-03-25 Thread Maxim V. Patlasov

Hi Miklos,

Sorry for long delay, see please inline comments below.

01/29/2013 02:18 PM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:22 PM, Maxim V. Patlasov
 wrote:

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
  - improved comment in fuse_short_read()
  - fixed fuse_file_fallocate() for KEEP_SIZE mode

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim V. Patlasov 
---
  fs/fuse/dir.c   |9 ++---
  fs/fuse/file.c  |   43 +--
  fs/fuse/inode.c |6 --
  3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ed8f8c5..ff8b603 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -827,7 +827,7 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
 stat->mtime.tv_nsec = attr->mtimensec;
 stat->ctime.tv_sec = attr->ctime;
 stat->ctime.tv_nsec = attr->ctimensec;
-   stat->size = attr->size;
+   stat->size = i_size_read(inode);

The old code is correct and you break it.


fuse_fillattr() is called strictly after fuse_change_attributes(). The 
latter usually sets local i_size with server value: i_size_write(inode, 
attr->size). This makes -/+ lines above equivalent. The only exception 
is stale attributes when current fi->attr_version is greater than 
attr_version acquired before sending FUSE_GETATTR request. My patch 
breaks old behaviour in this special case, I'll fix it, thanks for the 
catch.



We always use the values
returned by GETATTR, instead of the cached ones.  The cached ones are
a best guess by the kernel and they may or may not have been correct
at any point in time.  The attributes returned by userspace are the
authentic ones.


Yes, that's correct when "write cache" is off.


For the "write cache" case what we want, I think, is a mode where the
kernel always trusts the cached attributes.  The attribute cache is
initialized from values returned in LOOKUP and the kernel never needs
to call GETATTR since the attributes are always up-to-date.

Is that correct?


No, for "write cache" case the kernel always trusts cached i_size for 
regular files. For other attributes (and for !S_ISGREG() files) the 
kernel relies on userspace.





 stat->blocks = attr->blocks;

 if (attr->blksize != 0)
@@ -1541,6 +1541,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 struct fuse_setattr_in inarg;
 struct fuse_attr_out outarg;
 bool is_truncate = false;
+   bool is_wb = fc->writeback_cache;
 loff_t oldsize;
 int err;

@@ -1613,7 +1614,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 fuse_change_attributes_common(inode, ,
   attr_timeout());
 oldsize = inode->i_size;
-   i_size_write(inode, outarg.attr.size);
+   if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+   i_size_write(inode, outarg.attr.size);

Okay, I managed to understand what is going on here:  if userspace is
behaving badly and is changing the size even if that was not
requested, then we silently reject that.  But that's neither clearly
unrestandable (without a comment) nor sensible, I think.

If the filesystem is behaving badly, just let it.  Or is there some
other reason why we'd want this check?


The change above has nothing to do with misbehaving userspace. The check 
literally means: do not trust attr.size from server if "write cache" is 
on and the file is regular and there was no explicit user request to 
change size (i.e. ATTR_SIZE bit was not set in attr->ia_valid). The 
check is necessary if the user changes some attribute (not related to 
i_size) and at the time of processing FUSE_SETATTR server doesn't have 
up-to-date info about the size of file. In "write cache" case this 
situation is typical for cached writes extending file.





 if (is_truncate) {
 /* NOTE: this may release/reacquire fc->lock */
@@ -1625,7 +1627,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
  * Only call invalidate_inode_pages2() after removing
  * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
  */
-   if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) &&
+  

Re: [PATCH 06/14] fuse: Trust kernel i_size only - v2

2013-03-25 Thread Maxim V. Patlasov

Hi Miklos,

Sorry for long delay, see please inline comments below.

01/29/2013 02:18 PM, Miklos Szeredi пишет:

On Fri, Jan 25, 2013 at 7:22 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
  - improved comment in fuse_short_read()
  - fixed fuse_file_fallocate() for KEEP_SIZE mode

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim V. Patlasov mpatla...@parallels.com
---
  fs/fuse/dir.c   |9 ++---
  fs/fuse/file.c  |   43 +--
  fs/fuse/inode.c |6 --
  3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ed8f8c5..ff8b603 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -827,7 +827,7 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
 stat-mtime.tv_nsec = attr-mtimensec;
 stat-ctime.tv_sec = attr-ctime;
 stat-ctime.tv_nsec = attr-ctimensec;
-   stat-size = attr-size;
+   stat-size = i_size_read(inode);

The old code is correct and you break it.


fuse_fillattr() is called strictly after fuse_change_attributes(). The 
latter usually sets local i_size with server value: i_size_write(inode, 
attr-size). This makes -/+ lines above equivalent. The only exception 
is stale attributes when current fi-attr_version is greater than 
attr_version acquired before sending FUSE_GETATTR request. My patch 
breaks old behaviour in this special case, I'll fix it, thanks for the 
catch.



We always use the values
returned by GETATTR, instead of the cached ones.  The cached ones are
a best guess by the kernel and they may or may not have been correct
at any point in time.  The attributes returned by userspace are the
authentic ones.


Yes, that's correct when write cache is off.


For the write cache case what we want, I think, is a mode where the
kernel always trusts the cached attributes.  The attribute cache is
initialized from values returned in LOOKUP and the kernel never needs
to call GETATTR since the attributes are always up-to-date.

Is that correct?


No, for write cache case the kernel always trusts cached i_size for 
regular files. For other attributes (and for !S_ISGREG() files) the 
kernel relies on userspace.





 stat-blocks = attr-blocks;

 if (attr-blksize != 0)
@@ -1541,6 +1541,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 struct fuse_setattr_in inarg;
 struct fuse_attr_out outarg;
 bool is_truncate = false;
+   bool is_wb = fc-writeback_cache;
 loff_t oldsize;
 int err;

@@ -1613,7 +1614,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 fuse_change_attributes_common(inode, outarg.attr,
   attr_timeout(outarg));
 oldsize = inode-i_size;
-   i_size_write(inode, outarg.attr.size);
+   if (!is_wb || is_truncate || !S_ISREG(inode-i_mode))
+   i_size_write(inode, outarg.attr.size);

Okay, I managed to understand what is going on here:  if userspace is
behaving badly and is changing the size even if that was not
requested, then we silently reject that.  But that's neither clearly
unrestandable (without a comment) nor sensible, I think.

If the filesystem is behaving badly, just let it.  Or is there some
other reason why we'd want this check?


The change above has nothing to do with misbehaving userspace. The check 
literally means: do not trust attr.size from server if write cache is 
on and the file is regular and there was no explicit user request to 
change size (i.e. ATTR_SIZE bit was not set in attr-ia_valid). The 
check is necessary if the user changes some attribute (not related to 
i_size) and at the time of processing FUSE_SETATTR server doesn't have 
up-to-date info about the size of file. In write cache case this 
situation is typical for cached writes extending file.





 if (is_truncate) {
 /* NOTE: this may release/reacquire fc-lock */
@@ -1625,7 +1627,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
  * Only call invalidate_inode_pages2() after removing
  * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
  */
-   if (S_ISREG(inode-i_mode)  oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) 
+   S_ISREG(inode-i_mode)  oldsize != outarg.attr.size) {
 truncate_pagecache

[PATCH 4/4] fuse: implement exclusive wakeup for blocked_waitq

2013-03-21 Thread Maxim V. Patlasov
The patch solves thundering herd problem. So far as previous patches ensured
that only allocations for background may block, it's safe to wake up one
waiter. Whoever it is, it will wake up another one in request_end() afterwards.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dev.c |   20 
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 1f7ce89..ea99e2a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -152,7 +152,8 @@ struct fuse_req *fuse_get_req_internal(struct fuse_conn 
*fc, unsigned npages,
int intr;
 
block_sigs();
-   intr = wait_event_interruptible(fc->blocked_waitq, !*flag_p);
+   intr = wait_event_interruptible_exclusive(fc->blocked_waitq,
+ !*flag_p);
restore_sigs();
err = -EINTR;
if (intr)
@@ -265,6 +266,13 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct 
fuse_conn *fc,
 void fuse_put_request(struct fuse_conn *fc, struct fuse_req *req)
 {
if (atomic_dec_and_test(>count)) {
+   if (unlikely(req->background)) {
+   spin_lock(>lock);
+   if (!fc->blocked)
+   wake_up(>blocked_waitq);
+   spin_unlock(>lock);
+   }
+
if (req->waiting)
atomic_dec(>num_waiting);
 
@@ -362,10 +370,14 @@ __releases(fc->lock)
list_del(>intr_entry);
req->state = FUSE_REQ_FINISHED;
if (req->background) {
-   if (fc->num_background == fc->max_background) {
+   req->background = 0;
+
+   if (fc->num_background == fc->max_background)
fc->blocked = 0;
-   wake_up_all(>blocked_waitq);
-   }
+
+   if (!fc->blocked)
+   wake_up(>blocked_waitq);
+
if (fc->num_background == fc->congestion_threshold &&
fc->connected && fc->bdi_initialized) {
clear_bdi_congested(>bdi, BLK_RW_SYNC);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] fuse: skip blocking on allocations of synchronous requests

2013-03-21 Thread Maxim V. Patlasov
Miklos wrote:

> A task may have at most one synchronous request allocated. So these
> requests need not be otherwise limited.

The patch re-works fuse_get_req() to follow this idea.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dev.c |   26 ++
 1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6137650..1f7ce89 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -137,17 +137,27 @@ struct fuse_req *fuse_get_req_internal(struct fuse_conn 
*fc, unsigned npages,
   bool for_background)
 {
struct fuse_req *req;
-   sigset_t oldset;
-   int intr;
int err;
+   int *flag_p = NULL;
 
atomic_inc(>num_waiting);
-   block_sigs();
-   intr = wait_event_interruptible(fc->blocked_waitq, !fc->blocked);
-   restore_sigs();
-   err = -EINTR;
-   if (intr)
-   goto out;
+
+   if (for_background)
+   flag_p = >blocked;
+   else if (fc->uninitialized)
+   flag_p = >uninitialized;
+
+   if (flag_p) {
+   sigset_t oldset;
+   int intr;
+
+   block_sigs();
+   intr = wait_event_interruptible(fc->blocked_waitq, !*flag_p);
+   restore_sigs();
+   err = -EINTR;
+   if (intr)
+   goto out;
+   }
 
err = -ENOTCONN;
if (!fc->connected)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] fuse: make request allocations for background processing explicit

2013-03-21 Thread Maxim V. Patlasov
There are two types of processing requests in FUSE: synchronous (via
fuse_request_send()) and asynchronous (via adding to fc->bg_queue).

Fortunately, the type of processing is always known in advance, at the time
of request allocation. This preparatory patch utilizes this fact making
fuse_get_req() aware about the type. Next patches will use it.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/cuse.c   |2 +-
 fs/fuse/dev.c|   24 +---
 fs/fuse/file.c   |6 --
 fs/fuse/fuse_i.h |4 
 fs/fuse/inode.c  |1 +
 5 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..b7c7f30 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -422,7 +422,7 @@ static int cuse_send_init(struct cuse_conn *cc)
 
BUILD_BUG_ON(CUSE_INIT_INFO_MAX > PAGE_SIZE);
 
-   req = fuse_get_req(fc, 1);
+   req = fuse_get_req_for_background(fc, 1);
if (IS_ERR(req)) {
rc = PTR_ERR(req);
goto err;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e9bdec0..512626f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -86,7 +86,10 @@ EXPORT_SYMBOL_GPL(fuse_request_alloc);
 
 struct fuse_req *fuse_request_alloc_nofs(unsigned npages)
 {
-   return __fuse_request_alloc(npages, GFP_NOFS);
+   struct fuse_req *req = __fuse_request_alloc(npages, GFP_NOFS);
+   if (req)
+   req->background = 1; /* writeback always goes to bg_queue */
+   return req;
 }
 
 void fuse_request_free(struct fuse_req *req)
@@ -130,7 +133,8 @@ static void fuse_req_init_context(struct fuse_req *req)
req->in.h.pid = current->pid;
 }
 
-struct fuse_req *fuse_get_req(struct fuse_conn *fc, unsigned npages)
+struct fuse_req *fuse_get_req_internal(struct fuse_conn *fc, unsigned npages,
+  bool for_background)
 {
struct fuse_req *req;
sigset_t oldset;
@@ -156,14 +160,27 @@ struct fuse_req *fuse_get_req(struct fuse_conn *fc, 
unsigned npages)
 
fuse_req_init_context(req);
req->waiting = 1;
+   req->background = for_background;
return req;
 
  out:
atomic_dec(>num_waiting);
return ERR_PTR(err);
 }
+
+struct fuse_req *fuse_get_req(struct fuse_conn *fc, unsigned npages)
+{
+   return fuse_get_req_internal(fc, npages, 0);
+}
 EXPORT_SYMBOL_GPL(fuse_get_req);
 
+struct fuse_req *fuse_get_req_for_background(struct fuse_conn *fc,
+unsigned npages)
+{
+   return fuse_get_req_internal(fc, npages, 1);
+}
+EXPORT_SYMBOL_GPL(fuse_get_req_for_background);
+
 /*
  * Return request in fuse_file->reserved_req.  However that may
  * currently be in use.  If that is the case, wait for it to become
@@ -442,6 +459,7 @@ __acquires(fc->lock)
 
 static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
 {
+   BUG_ON(req->background);
spin_lock(>lock);
if (!fc->connected)
req->out.h.error = -ENOTCONN;
@@ -469,7 +487,7 @@ EXPORT_SYMBOL_GPL(fuse_request_send);
 static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
struct fuse_req *req)
 {
-   req->background = 1;
+   BUG_ON(!req->background);
fc->num_background++;
if (fc->num_background == fc->max_background)
fc->blocked = 1;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c807176..097d48f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -131,6 +131,7 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
fuse_put_request(ff->fc, req);
} else {
req->end = fuse_release_end;
+   req->background = 1;
fuse_request_send_background(ff->fc, req);
}
kfree(ff);
@@ -661,7 +662,8 @@ static int fuse_readpages_fill(void *_data, struct page 
*page)
int nr_alloc = min_t(unsigned, data->nr_pages,
 FUSE_MAX_PAGES_PER_REQ);
fuse_send_readpages(req, data->file);
-   data->req = req = fuse_get_req(fc, nr_alloc);
+   data->req = req = fuse_get_req_internal(fc, nr_alloc,
+   fc->async_read);
if (IS_ERR(req)) {
unlock_page(page);
return PTR_ERR(req);
@@ -696,7 +698,7 @@ static int fuse_readpages(struct file *file, struct 
address_space *mapping,
 
data.file = file;
data.inode = inode;
-   data.req = fuse_get_req(fc, nr_alloc);
+   data.req = fuse_get_req_internal(fc, nr_alloc, fc->async_read);
data.nr_pages = nr_pages;
err = PTR_ERR(data.req);
if (IS_ERR(data.req))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6aeba86..457f62e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -708,6 +708,10 @@ void 

[PATCH 2/4] fuse: add flag fc->uninitialized

2013-03-21 Thread Maxim V. Patlasov
Existing flag fc->blocked is used to suspend request allocation both in case
of many background request submitted and period of time before init_reply
arrives from userspace. Next patch will skip blocking allocations of
synchronous request (disregarding fc->blocked). This is mostly OK, but
we still need to suspend allocations if init_reply is not arrived yet. The
patch introduces flag fc->uninitialized which will serve this purpose.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/cuse.c   |1 +
 fs/fuse/dev.c|1 +
 fs/fuse/fuse_i.h |4 
 fs/fuse/inode.c  |3 +++
 4 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index b7c7f30..8fe0998 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -505,6 +505,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
 
cc->fc.connected = 1;
cc->fc.blocked = 0;
+   cc->fc.uninitialized = 0;
rc = cuse_send_init(cc);
if (rc) {
fuse_conn_put(>fc);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 512626f..6137650 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2089,6 +2089,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
if (fc->connected) {
fc->connected = 0;
fc->blocked = 0;
+   fc->uninitialized = 0;
end_io_requests(fc);
end_queued_requests(fc);
end_polls(fc);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 457f62e..e893126 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -417,6 +417,10 @@ struct fuse_conn {
/** Batching of FORGET requests (positive indicates FORGET batch) */
int forget_batch;
 
+   /** Flag indicating that INIT reply is not received yet. Allocating
+* any fuse request will be suspended until the flag is cleared */
+   int uninitialized;
+
/** Flag indicating if connection is blocked.  This will be
the case before the INIT reply is received, and if there
are too many outstading backgrounds requests */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index de0bee0..0d14f03 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -362,6 +362,7 @@ void fuse_conn_kill(struct fuse_conn *fc)
spin_lock(>lock);
fc->connected = 0;
fc->blocked = 0;
+   fc->uninitialized = 0;
spin_unlock(>lock);
/* Flush all readers on this fs */
kill_fasync(>fasync, SIGIO, POLL_IN);
@@ -582,6 +583,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->polled_files = RB_ROOT;
fc->reqctr = 0;
fc->blocked = 1;
+   fc->uninitialized = 1;
fc->attr_version = 1;
get_random_bytes(>scramble_key, sizeof(fc->scramble_key));
 }
@@ -881,6 +883,7 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc->conn_init = 1;
}
fc->blocked = 0;
+   fc->uninitialized = 0;
wake_up_all(>blocked_waitq);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] fuse: fix accounting background requests (v2)

2013-03-21 Thread Maxim V. Patlasov
Hi,

The feature was added long time ago (commit 08a53cdc...) with the comment:

> A task may have at most one synchronous request allocated.  So these requests
> need not be otherwise limited.
>
> However the number of background requests (release, forget, asynchronous
> reads, interrupted requests) can grow indefinitely.  This can be used by a
> malicous user to cause FUSE to allocate arbitrary amounts of unswappable
> kernel memory, denying service.
>
> For this reason add a limit for the number of background requests, and block
> allocations of new requests until the number goes bellow the limit.

However, the implementation suffers from the following problems:

1. Latency of synchronous requests. As soon as fc->num_background hits the
limit, all allocations are blocked: both for synchronous and background
requests. This is unnecessary - as the comment cited above states, synchronous
requests need not be limited (by fuse). Moreover, sometimes it's very
inconvenient. For example, a dozen of tasks aggressively writing to mmap()-ed
area may block 'ls' for long while (>1min in my experiments).

2. Thundering herd problem. When fc->num_background falls below the limit,
request_end() calls wake_up_all(>blocked_waitq). This wakes up all waiters
while it's not impossible that the first waiter getting new request will
immediately put it to background increasing fc->num_background again.
(experimenting with mmap()-ed writes I observed 2x slowdown as compared with
fuse after applying this patch-set)

The patch-set re-works fuse_get_req (and its callers) to throttle only requests
intended for background processing. Having this done, it becomes possible to
use exclusive wakeups in chained manner: request_end() wakes up a waiter,
the waiter allocates new request and submits it for background processing,
the processing ends in request_end() where another wakeup happens an so on.

Changed in v2:
 - rebased on for-next branch of the fuse tree
 - fixed race when processing request begins before init-reply came

Thanks,
Maxim

---

Maxim V. Patlasov (4):
  fuse: make request allocations for background processing explicit
  fuse: add flag fc->uninitialized
  fuse: skip blocking on allocations of synchronous requests
  fuse: implement exclusive wakeup for blocked_waitq


 fs/fuse/cuse.c   |3 ++
 fs/fuse/dev.c|   69 +++---
 fs/fuse/file.c   |6 +++--
 fs/fuse/fuse_i.h |8 ++
 fs/fuse/inode.c  |4 +++
 5 files changed, 73 insertions(+), 17 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] fuse: fix accounting background requests

2013-03-21 Thread Maxim V. Patlasov

02/06/2013 09:12 PM, Miklos Szeredi пишет:

On Wed, Dec 26, 2012 at 1:44 PM, Maxim Patlasov  wrote:

Hi,

The feature was added long time ago (commit 08a53cdc...) with the comment:


A task may have at most one synchronous request allocated.  So these requests
need not be otherwise limited.

However the number of background requests (release, forget, asynchronous
reads, interrupted requests) can grow indefinitely.  This can be used by a
malicous user to cause FUSE to allocate arbitrary amounts of unswappable
kernel memory, denying service.

For this reason add a limit for the number of background requests, and block
allocations of new requests until the number goes bellow the limit.

However, the implementation suffers from the following problems:

1. Latency of synchronous requests. As soon as fc->num_background hits the
limit, all allocations are blocked: both for synchronous and background
requests. This is unnecessary - as the comment cited above states, synchronous
requests need not be limited (by fuse). Moreover, sometimes it's very
inconvenient. For example, a dozen of tasks aggressively writing to mmap()-ed
area may block 'ls' for long while (>1min in my experiments).

2. Thundering herd problem. When fc->num_background falls below the limit,
request_end() calls wake_up_all(>blocked_waitq). This wakes up all waiters
while it's not impossible that the first waiter getting new request will
immediately put it to background increasing fc->num_background again.
(experimenting with mmap()-ed writes I observed 2x slowdown as compared with
fuse after applying this patch-set)

The patch-set re-works fuse_get_req (and its callers) to throttle only requests
intended for background processing. Having this done, it becomes possible to
use exclusive wakeups in chained manner: request_end() wakes up a waiter,
the waiter allocates new request and submits it for background processing,
the processing ends in request_end() where another wakeup happens an so on.

Thanks.  These patches look okay.

But they don't apply to for-next.  Can you please update them?


Sorry for long delay. I'll send updated patches soon.

Thanks,
Maxim



Thanks,
Miklos


Thanks,
Maxim

---

Maxim Patlasov (3):
   fuse: make request allocations for background processing explicit
   fuse: skip blocking on allocations of synchronous requests
   fuse: implement exclusive wakeup for blocked_waitq


  fs/fuse/cuse.c   |2 +-
  fs/fuse/dev.c|   60 +-
  fs/fuse/file.c   |5 +++--
  fs/fuse/fuse_i.h |3 +++
  fs/fuse/inode.c  |1 +
  5 files changed, 54 insertions(+), 17 deletions(-)

--
Signature


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] fuse: fix accounting background requests

2013-03-21 Thread Maxim V. Patlasov

02/06/2013 09:12 PM, Miklos Szeredi пишет:

On Wed, Dec 26, 2012 at 1:44 PM, Maxim Patlasov mpatla...@parallels.com wrote:

Hi,

The feature was added long time ago (commit 08a53cdc...) with the comment:


A task may have at most one synchronous request allocated.  So these requests
need not be otherwise limited.

However the number of background requests (release, forget, asynchronous
reads, interrupted requests) can grow indefinitely.  This can be used by a
malicous user to cause FUSE to allocate arbitrary amounts of unswappable
kernel memory, denying service.

For this reason add a limit for the number of background requests, and block
allocations of new requests until the number goes bellow the limit.

However, the implementation suffers from the following problems:

1. Latency of synchronous requests. As soon as fc-num_background hits the
limit, all allocations are blocked: both for synchronous and background
requests. This is unnecessary - as the comment cited above states, synchronous
requests need not be limited (by fuse). Moreover, sometimes it's very
inconvenient. For example, a dozen of tasks aggressively writing to mmap()-ed
area may block 'ls' for long while (1min in my experiments).

2. Thundering herd problem. When fc-num_background falls below the limit,
request_end() calls wake_up_all(fc-blocked_waitq). This wakes up all waiters
while it's not impossible that the first waiter getting new request will
immediately put it to background increasing fc-num_background again.
(experimenting with mmap()-ed writes I observed 2x slowdown as compared with
fuse after applying this patch-set)

The patch-set re-works fuse_get_req (and its callers) to throttle only requests
intended for background processing. Having this done, it becomes possible to
use exclusive wakeups in chained manner: request_end() wakes up a waiter,
the waiter allocates new request and submits it for background processing,
the processing ends in request_end() where another wakeup happens an so on.

Thanks.  These patches look okay.

But they don't apply to for-next.  Can you please update them?


Sorry for long delay. I'll send updated patches soon.

Thanks,
Maxim



Thanks,
Miklos


Thanks,
Maxim

---

Maxim Patlasov (3):
   fuse: make request allocations for background processing explicit
   fuse: skip blocking on allocations of synchronous requests
   fuse: implement exclusive wakeup for blocked_waitq


  fs/fuse/cuse.c   |2 +-
  fs/fuse/dev.c|   60 +-
  fs/fuse/file.c   |5 +++--
  fs/fuse/fuse_i.h |3 +++
  fs/fuse/inode.c  |1 +
  5 files changed, 54 insertions(+), 17 deletions(-)

--
Signature


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] fuse: fix accounting background requests (v2)

2013-03-21 Thread Maxim V. Patlasov
Hi,

The feature was added long time ago (commit 08a53cdc...) with the comment:

 A task may have at most one synchronous request allocated.  So these requests
 need not be otherwise limited.

 However the number of background requests (release, forget, asynchronous
 reads, interrupted requests) can grow indefinitely.  This can be used by a
 malicous user to cause FUSE to allocate arbitrary amounts of unswappable
 kernel memory, denying service.

 For this reason add a limit for the number of background requests, and block
 allocations of new requests until the number goes bellow the limit.

However, the implementation suffers from the following problems:

1. Latency of synchronous requests. As soon as fc-num_background hits the
limit, all allocations are blocked: both for synchronous and background
requests. This is unnecessary - as the comment cited above states, synchronous
requests need not be limited (by fuse). Moreover, sometimes it's very
inconvenient. For example, a dozen of tasks aggressively writing to mmap()-ed
area may block 'ls' for long while (1min in my experiments).

2. Thundering herd problem. When fc-num_background falls below the limit,
request_end() calls wake_up_all(fc-blocked_waitq). This wakes up all waiters
while it's not impossible that the first waiter getting new request will
immediately put it to background increasing fc-num_background again.
(experimenting with mmap()-ed writes I observed 2x slowdown as compared with
fuse after applying this patch-set)

The patch-set re-works fuse_get_req (and its callers) to throttle only requests
intended for background processing. Having this done, it becomes possible to
use exclusive wakeups in chained manner: request_end() wakes up a waiter,
the waiter allocates new request and submits it for background processing,
the processing ends in request_end() where another wakeup happens an so on.

Changed in v2:
 - rebased on for-next branch of the fuse tree
 - fixed race when processing request begins before init-reply came

Thanks,
Maxim

---

Maxim V. Patlasov (4):
  fuse: make request allocations for background processing explicit
  fuse: add flag fc-uninitialized
  fuse: skip blocking on allocations of synchronous requests
  fuse: implement exclusive wakeup for blocked_waitq


 fs/fuse/cuse.c   |3 ++
 fs/fuse/dev.c|   69 +++---
 fs/fuse/file.c   |6 +++--
 fs/fuse/fuse_i.h |8 ++
 fs/fuse/inode.c  |4 +++
 5 files changed, 73 insertions(+), 17 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] fuse: make request allocations for background processing explicit

2013-03-21 Thread Maxim V. Patlasov
There are two types of processing requests in FUSE: synchronous (via
fuse_request_send()) and asynchronous (via adding to fc-bg_queue).

Fortunately, the type of processing is always known in advance, at the time
of request allocation. This preparatory patch utilizes this fact making
fuse_get_req() aware about the type. Next patches will use it.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/cuse.c   |2 +-
 fs/fuse/dev.c|   24 +---
 fs/fuse/file.c   |6 --
 fs/fuse/fuse_i.h |4 
 fs/fuse/inode.c  |1 +
 5 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..b7c7f30 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -422,7 +422,7 @@ static int cuse_send_init(struct cuse_conn *cc)
 
BUILD_BUG_ON(CUSE_INIT_INFO_MAX  PAGE_SIZE);
 
-   req = fuse_get_req(fc, 1);
+   req = fuse_get_req_for_background(fc, 1);
if (IS_ERR(req)) {
rc = PTR_ERR(req);
goto err;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e9bdec0..512626f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -86,7 +86,10 @@ EXPORT_SYMBOL_GPL(fuse_request_alloc);
 
 struct fuse_req *fuse_request_alloc_nofs(unsigned npages)
 {
-   return __fuse_request_alloc(npages, GFP_NOFS);
+   struct fuse_req *req = __fuse_request_alloc(npages, GFP_NOFS);
+   if (req)
+   req-background = 1; /* writeback always goes to bg_queue */
+   return req;
 }
 
 void fuse_request_free(struct fuse_req *req)
@@ -130,7 +133,8 @@ static void fuse_req_init_context(struct fuse_req *req)
req-in.h.pid = current-pid;
 }
 
-struct fuse_req *fuse_get_req(struct fuse_conn *fc, unsigned npages)
+struct fuse_req *fuse_get_req_internal(struct fuse_conn *fc, unsigned npages,
+  bool for_background)
 {
struct fuse_req *req;
sigset_t oldset;
@@ -156,14 +160,27 @@ struct fuse_req *fuse_get_req(struct fuse_conn *fc, 
unsigned npages)
 
fuse_req_init_context(req);
req-waiting = 1;
+   req-background = for_background;
return req;
 
  out:
atomic_dec(fc-num_waiting);
return ERR_PTR(err);
 }
+
+struct fuse_req *fuse_get_req(struct fuse_conn *fc, unsigned npages)
+{
+   return fuse_get_req_internal(fc, npages, 0);
+}
 EXPORT_SYMBOL_GPL(fuse_get_req);
 
+struct fuse_req *fuse_get_req_for_background(struct fuse_conn *fc,
+unsigned npages)
+{
+   return fuse_get_req_internal(fc, npages, 1);
+}
+EXPORT_SYMBOL_GPL(fuse_get_req_for_background);
+
 /*
  * Return request in fuse_file-reserved_req.  However that may
  * currently be in use.  If that is the case, wait for it to become
@@ -442,6 +459,7 @@ __acquires(fc-lock)
 
 static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
 {
+   BUG_ON(req-background);
spin_lock(fc-lock);
if (!fc-connected)
req-out.h.error = -ENOTCONN;
@@ -469,7 +487,7 @@ EXPORT_SYMBOL_GPL(fuse_request_send);
 static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
struct fuse_req *req)
 {
-   req-background = 1;
+   BUG_ON(!req-background);
fc-num_background++;
if (fc-num_background == fc-max_background)
fc-blocked = 1;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c807176..097d48f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -131,6 +131,7 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
fuse_put_request(ff-fc, req);
} else {
req-end = fuse_release_end;
+   req-background = 1;
fuse_request_send_background(ff-fc, req);
}
kfree(ff);
@@ -661,7 +662,8 @@ static int fuse_readpages_fill(void *_data, struct page 
*page)
int nr_alloc = min_t(unsigned, data-nr_pages,
 FUSE_MAX_PAGES_PER_REQ);
fuse_send_readpages(req, data-file);
-   data-req = req = fuse_get_req(fc, nr_alloc);
+   data-req = req = fuse_get_req_internal(fc, nr_alloc,
+   fc-async_read);
if (IS_ERR(req)) {
unlock_page(page);
return PTR_ERR(req);
@@ -696,7 +698,7 @@ static int fuse_readpages(struct file *file, struct 
address_space *mapping,
 
data.file = file;
data.inode = inode;
-   data.req = fuse_get_req(fc, nr_alloc);
+   data.req = fuse_get_req_internal(fc, nr_alloc, fc-async_read);
data.nr_pages = nr_pages;
err = PTR_ERR(data.req);
if (IS_ERR(data.req))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6aeba86..457f62e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -708,6 +708,10 @@ void 

[PATCH 2/4] fuse: add flag fc-uninitialized

2013-03-21 Thread Maxim V. Patlasov
Existing flag fc-blocked is used to suspend request allocation both in case
of many background request submitted and period of time before init_reply
arrives from userspace. Next patch will skip blocking allocations of
synchronous request (disregarding fc-blocked). This is mostly OK, but
we still need to suspend allocations if init_reply is not arrived yet. The
patch introduces flag fc-uninitialized which will serve this purpose.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/cuse.c   |1 +
 fs/fuse/dev.c|1 +
 fs/fuse/fuse_i.h |4 
 fs/fuse/inode.c  |3 +++
 4 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index b7c7f30..8fe0998 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -505,6 +505,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
 
cc-fc.connected = 1;
cc-fc.blocked = 0;
+   cc-fc.uninitialized = 0;
rc = cuse_send_init(cc);
if (rc) {
fuse_conn_put(cc-fc);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 512626f..6137650 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2089,6 +2089,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
if (fc-connected) {
fc-connected = 0;
fc-blocked = 0;
+   fc-uninitialized = 0;
end_io_requests(fc);
end_queued_requests(fc);
end_polls(fc);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 457f62e..e893126 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -417,6 +417,10 @@ struct fuse_conn {
/** Batching of FORGET requests (positive indicates FORGET batch) */
int forget_batch;
 
+   /** Flag indicating that INIT reply is not received yet. Allocating
+* any fuse request will be suspended until the flag is cleared */
+   int uninitialized;
+
/** Flag indicating if connection is blocked.  This will be
the case before the INIT reply is received, and if there
are too many outstading backgrounds requests */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index de0bee0..0d14f03 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -362,6 +362,7 @@ void fuse_conn_kill(struct fuse_conn *fc)
spin_lock(fc-lock);
fc-connected = 0;
fc-blocked = 0;
+   fc-uninitialized = 0;
spin_unlock(fc-lock);
/* Flush all readers on this fs */
kill_fasync(fc-fasync, SIGIO, POLL_IN);
@@ -582,6 +583,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc-polled_files = RB_ROOT;
fc-reqctr = 0;
fc-blocked = 1;
+   fc-uninitialized = 1;
fc-attr_version = 1;
get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key));
 }
@@ -881,6 +883,7 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc-conn_init = 1;
}
fc-blocked = 0;
+   fc-uninitialized = 0;
wake_up_all(fc-blocked_waitq);
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] fuse: skip blocking on allocations of synchronous requests

2013-03-21 Thread Maxim V. Patlasov
Miklos wrote:

 A task may have at most one synchronous request allocated. So these
 requests need not be otherwise limited.

The patch re-works fuse_get_req() to follow this idea.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/dev.c |   26 ++
 1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6137650..1f7ce89 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -137,17 +137,27 @@ struct fuse_req *fuse_get_req_internal(struct fuse_conn 
*fc, unsigned npages,
   bool for_background)
 {
struct fuse_req *req;
-   sigset_t oldset;
-   int intr;
int err;
+   int *flag_p = NULL;
 
atomic_inc(fc-num_waiting);
-   block_sigs(oldset);
-   intr = wait_event_interruptible(fc-blocked_waitq, !fc-blocked);
-   restore_sigs(oldset);
-   err = -EINTR;
-   if (intr)
-   goto out;
+
+   if (for_background)
+   flag_p = fc-blocked;
+   else if (fc-uninitialized)
+   flag_p = fc-uninitialized;
+
+   if (flag_p) {
+   sigset_t oldset;
+   int intr;
+
+   block_sigs(oldset);
+   intr = wait_event_interruptible(fc-blocked_waitq, !*flag_p);
+   restore_sigs(oldset);
+   err = -EINTR;
+   if (intr)
+   goto out;
+   }
 
err = -ENOTCONN;
if (!fc-connected)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] fuse: implement exclusive wakeup for blocked_waitq

2013-03-21 Thread Maxim V. Patlasov
The patch solves thundering herd problem. So far as previous patches ensured
that only allocations for background may block, it's safe to wake up one
waiter. Whoever it is, it will wake up another one in request_end() afterwards.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/dev.c |   20 
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 1f7ce89..ea99e2a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -152,7 +152,8 @@ struct fuse_req *fuse_get_req_internal(struct fuse_conn 
*fc, unsigned npages,
int intr;
 
block_sigs(oldset);
-   intr = wait_event_interruptible(fc-blocked_waitq, !*flag_p);
+   intr = wait_event_interruptible_exclusive(fc-blocked_waitq,
+ !*flag_p);
restore_sigs(oldset);
err = -EINTR;
if (intr)
@@ -265,6 +266,13 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct 
fuse_conn *fc,
 void fuse_put_request(struct fuse_conn *fc, struct fuse_req *req)
 {
if (atomic_dec_and_test(req-count)) {
+   if (unlikely(req-background)) {
+   spin_lock(fc-lock);
+   if (!fc-blocked)
+   wake_up(fc-blocked_waitq);
+   spin_unlock(fc-lock);
+   }
+
if (req-waiting)
atomic_dec(fc-num_waiting);
 
@@ -362,10 +370,14 @@ __releases(fc-lock)
list_del(req-intr_entry);
req-state = FUSE_REQ_FINISHED;
if (req-background) {
-   if (fc-num_background == fc-max_background) {
+   req-background = 0;
+
+   if (fc-num_background == fc-max_background)
fc-blocked = 0;
-   wake_up_all(fc-blocked_waitq);
-   }
+
+   if (!fc-blocked)
+   wake_up(fc-blocked_waitq);
+
if (fc-num_background == fc-congestion_threshold 
fc-connected  fc-bdi_initialized) {
clear_bdi_congested(fc-bdi, BLK_RW_SYNC);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2013-01-25 Thread Maxim V. Patlasov
Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 mm/page-writeback.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
 
global_dirty_limits(_thresh, _thresh);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/14] fuse: Turn writeback cache on

2013-01-25 Thread Maxim V. Patlasov
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c|   15 +++
 fs/fuse/fuse_i.h  |4 
 fs/fuse/inode.c   |   13 +
 include/uapi/linux/fuse.h |2 ++
 4 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e6e064c..147e618 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -211,6 +211,8 @@ void fuse_finish_open(struct inode *inode, struct file 
*file)
spin_unlock(>lock);
fuse_invalidate_attr(inode);
}
+   if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+   fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1096,6 +1098,19 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, 
const struct iovec *iov,
struct iov_iter i;
loff_t endbyte = 0;
 
+   if (get_fuse_conn(inode)->writeback_cache) {
+   if (!(file->f_flags & O_DIRECT)) {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   spin_lock(>lock);
+   inode->i_mtime = current_fs_time(inode->i_sb);
+   set_bit(FUSE_I_MTIME_UPDATED, >state);
+   spin_unlock(>lock);
+   }
+   return generic_file_aio_write(iocb, iov, nr_segs, pos);
+   }
+
WARN_ON(iocb->ki_pos != pos);
 
ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 86caa8c..a207744 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
 doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE   (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3687daf..541fc4f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -521,6 +521,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_ALLOW_WBCACHE,
OPT_ERR
 };
 
@@ -533,6 +534,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   "allow_other"},
{OPT_MAX_READ,  "max_read=%u"},
{OPT_BLKSIZE,   "blksize=%u"},
+   {OPT_ALLOW_WBCACHE, "allow_wbcache"},
{OPT_ERR,   NULL}
 };
 
@@ -606,6 +608,10 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev)
d->blksize = value;
break;
 
+   case OPT_ALLOW_WBCACHE:
+   d->flags |= FUSE_ALLOW_WBCACHE;
+   break;
+
default:
return 0;
}
@@ -633,6 +639,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->flags & FUSE_ALLOW_WBCACHE)
+   seq_puts(m, ",allow_wbcache");
return 0;
 }
 
@@ -944,6 +952,9 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
fc->auto_inval_data = 1;
if (arg->flags & FUSE_DO_READDIRPLUS)
fc->do_readdirplus = 1;
+   if (arg->flags & FUSE_WRITEBACK_CACHE &&
+   fc->flags & FUSE_ALLOW_WBCACHE)
+   fc->writeback_cache = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -972,6 +983,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
FUSE_DO_READDIRPLUS;
+   if (fc->flags & FUSE_ALLOW_WBCACHE)
+   arg->flags |= FUSE_WRITEBACK_CACHE;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5dc1fea..20a9553 100644
--- a/include/uapi/linux/fuse.h
+++ 

[PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2

2013-01-25 Thread Maxim V. Patlasov
The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/cuse.c   |5 +++--
 fs/fuse/file.c   |   49 +++--
 fs/fuse/fuse_i.h |   13 -
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..fb63185 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -93,7 +93,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, 
size_t count,
loff_t pos = 0;
struct iovec iov = { .iov_base = buf, .iov_len = count };
 
-   return fuse_direct_io(file, , 1, count, , 0);
+   return fuse_direct_io(file, , 1, count, , FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -106,7 +106,8 @@ static ssize_t cuse_write(struct file *file, const char 
__user *buf,
 * No locking or generic_write_checks(), the server is
 * responsible for locking and sanity checks.
 */
-   return fuse_direct_io(file, , 1, count, , 1);
+   return fuse_direct_io(file, , 1, count, ,
+ FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3767824..e6e064c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -349,6 +349,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+   pgoff_t idx_to)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_req *req;
+   bool found = false;
+
+   spin_lock(>lock);
+   list_for_each_entry(req, >writepages, writepages_entry) {
+   pgoff_t curr_index;
+
+   BUG_ON(req->inode != inode);
+   curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+   if (!(idx_from >= curr_index + req->num_pages ||
+ idx_to < curr_index)) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(>lock);
+
+   return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -393,6 +418,19 @@ static int fuse_wait_on_page_writeback(struct inode 
*inode, pgoff_t index)
return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+  size_t bytes)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   pgoff_t idx_from, idx_to;
+
+   idx_from = start >> PAGE_CACHE_SHIFT;
+   idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+   wait_event(fi->page_waitq,
+  !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
struct inode *inode = file->f_path.dentry->d_inode;
@@ -1245,8 +1283,10 @@ static inline int fuse_iter_npages(const struct iov_iter 
*ii_p)
 
 ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
   unsigned long nr_segs, size_t count, loff_t *ppos,
-  int write)
+  int flags)
 {
+   int write = flags & FUSE_DIO_WRITE;
+   int cuse = flags & FUSE_DIO_CUSE;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
size_t nmax = write ? fc->max_write : fc->max_read;
@@ -1271,6 +1311,10 @@ ssize_t fuse_direct_io(struct file *file, const struct 
iovec *iov,
break;
}
 
+   if (!cuse)
+   fuse_wait_on_writeback(file->f_mapping->host, pos,
+  nbytes);
+
if (write)
nres = fuse_send_write(req, file, pos, nbytes, owner);
else
@@ -1339,7 +1383,8 @@ static ssize_t __fuse_direct_write(struct file *file, 
const struct iovec *iov,
 
res = generic_write_checks(file, ppos, , 0);
if (!res) {
-   res = 

[PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback

2013-01-25 Thread Maxim V. Patlasov
fuse_writepage_locked() should never submit new i/o for given page->index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   18 +++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3b4dc98..52c7d81 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1472,7 +1472,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn 
*fc,
return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+struct writeback_control *wbc)
 {
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
@@ -1481,6 +1482,14 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_req *req;
struct page *tmp_page;
 
+   if (fuse_page_is_writeback(inode, page->index)) {
+   if (wbc->sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page->index);
+   }
+
set_page_writeback(page);
 
req = fuse_request_alloc_nofs(1);
@@ -1527,7 +1536,7 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
 {
int err;
 
-   err = fuse_writepage_locked(page);
+   err = fuse_writepage_locked(page, wbc);
unlock_page(page);
 
return err;
@@ -1812,7 +1821,10 @@ static int fuse_launder_page(struct page *page)
int err = 0;
if (clear_page_dirty_for_io(page)) {
struct inode *inode = page->mapping->host;
-   err = fuse_writepage_locked(page);
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_ALL,
+   };
+   err = fuse_writepage_locked(page, );
if (!err)
fuse_wait_on_page_writeback(inode, page->index);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/14] fuse: fuse_flush() should wait on writeback

2013-01-25 Thread Maxim V. Patlasov
The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 52c7d81..3767824 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,7 @@
 #include 
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
@@ -414,6 +415,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fc->no_flush)
return 0;
 
+   err = filemap_write_and_wait(file->f_mapping);
+   if (err)
+   return err;
+
+   mutex_lock(>i_mutex);
+   fuse_sync_writes(inode);
+   mutex_unlock(>i_mutex);
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(, 0, sizeof(inarg));
inarg.fh = ff->fh;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v2

2013-01-25 Thread Maxim V. Patlasov
The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
 - fixed fuse_prepare_write() to avoid reads beyond EOF
 - fixed fuse_prepare_write() to zero uninitialized part of page

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim V. Patlasov 
---
 fs/fuse/file.c |  282 
 1 files changed, 281 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 496e74c..3b4dc98 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -722,7 +722,10 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
 
 struct fuse_fill_data {
struct fuse_req *req;
-   struct file *file;
+   union {
+   struct file *file;
+   struct fuse_file *ff;
+   };
struct inode *inode;
unsigned nr_pages;
 };
@@ -1530,6 +1533,280 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+   int i, all_ok = 1;
+   struct fuse_req *req = data->req;
+   struct inode *inode = data->inode;
+   struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   loff_t off = -1;
+
+   if (!data->ff)
+   data->ff = fuse_write_file(fc, fi);
+
+   if (!data->ff) {
+   for (i = 0; i < req->num_pages; i++)
+   end_page_writeback(req->pages[i]);
+   return -EIO;
+   }
+
+   req->inode = inode;
+   req->misc.write.in.offset = page_offset(req->pages[0]);
+
+   spin_lock(>lock);
+   list_add(>writepages_entry, >writepages);
+   spin_unlock(>lock);
+
+   for (i = 0; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   struct page *tmp_page;
+
+   tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (tmp_page) {
+   copy_highpage(tmp_page, page);
+   inc_bdi_stat(bdi, BDI_WRITEBACK);
+   inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+   } else
+   all_ok = 0;
+   req->pages[i] = tmp_page;
+   if (i == 0)
+   off = page_offset(page);
+
+   end_page_writeback(page);
+   }
+
+   if (!all_ok) {
+   for (i = 0; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   if (page) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+   __free_page(page);
+   req->pages[i] = NULL;
+   }
+   }
+
+   spin_lock(>lock);
+   list_del(>writepages_entry);
+   wake_up(>page_waitq);
+   spin_unlock(>lock);
+   return -ENOMEM;
+   }
+
+   req->ff = fuse_file_get(data->ff);
+   fuse_write_fill(req, data->ff, off, 0);
+
+   req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+   req->in.argpages = 1;
+   fuse_page_descs_length_init(req, 0, req->num_pages);
+   req->end = fuse_writepage_end;
+
+   spin_lock(>lock);
+   list_add_tail(>list, >queued_writes);
+   fuse_flush_writepages(data->inode);
+   spin_unlock(>lock);
+
+   return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+   struct writeback_control *wbc, void *_data)
+{
+   struct fuse_fill_data *data = _data;
+   struct fuse_req *req = data->req;
+   struct inode *inode = data->inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fuse_page_is_writeback(inode, page->index)) {
+   if (wbc->sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page->index);
+   }
+
+   if (req->num_pages &&
+   (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+(req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+   int err;
+
+   err = fuse_send_writepages(data);
+   if (err) {
+   unlock_page(page);
+   return err;
+   }
+
+   data->req = req =
+   fuse_request_alloc_nofs(FUSE_MAX_PAG

[PATCH 08/14] fuse: Flush files on wb close

2013-01-25 Thread Maxim V. Patlasov
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediately and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   26 +-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4f8fa45..496e74c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -138,6 +138,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
 }
 
+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(>count))
+   BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 bool isdir)
 {
@@ -286,8 +292,23 @@ static int fuse_open(struct inode *inode, struct file 
*file)
return fuse_open_common(inode, file, false);
 }
 
+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   filemap_write_and_wait(file->f_mapping);
+   wait_event(fi->page_waitq, list_empty_careful(>writepages));
+   spin_unlock_wait(>lock);
+}
+
 static int fuse_release(struct inode *inode, struct file *file)
 {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc->writeback_cache)
+   fuse_flush_writeback(inode, file);
+
fuse_release_common(file, FUSE_RELEASE);
 
/* return value is ignored by VFS */
@@ -1343,7 +1364,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)
 
for (i = 0; i < req->num_pages; i++)
__free_page(req->pages[i]);
-   fuse_file_put(req->ff, false);
+   if (!fc->writeback_cache)
+   fuse_file_put(req->ff, false);
 }
 
 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1360,6 +1382,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
}
bdi_writeout_inc(bdi);
wake_up(>page_waitq);
+   if (fc->writeback_cache)
+   __fuse_file_put(req->ff);
 }
 
 /* Called under fc->lock, may release and reacquire it */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/14] fuse: Update i_mtime on buffered writes

2013-01-25 Thread Maxim V. Patlasov
If writeback cache is on, buffered write doesn't result in immediate mtime
update in userspace because the userspace will see modified data later, when
writeback happens. Consequently, mtime provided by userspace may be older than
actual time of buffered write.

The problem can be solved by generating mtime locally (will come in next
patches) and flushing it to userspace periodically. Here we introduce a flag to
keep the state of fuse_inode: the flag is ON if and only if locally generated
mtime (stored in inode->i_mtime) was not pushed to the userspace yet.

The patch also implements all bits related to flushing and clearing the flag.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/dir.c|   42 +
 fs/fuse/file.c   |   31 ++---
 fs/fuse/fuse_i.h |   13 -
 fs/fuse/inode.c  |   79 +-
 4 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ff8b603..969c60d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -177,6 +177,13 @@ static int fuse_dentry_revalidate(struct dentry *entry, 
unsigned int flags)
if (flags & LOOKUP_RCU)
return -ECHILD;
 
+   if (test_bit(FUSE_I_MTIME_UPDATED,
+_fuse_inode(inode)->state)) {
+   err = fuse_flush_mtime(inode, 0);
+   if (err)
+   return 0;
+   }
+
fc = get_fuse_conn(inode);
req = fuse_get_req_nopages(fc);
if (IS_ERR(req))
@@ -839,7 +846,7 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
 }
 
 static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
-  struct file *file)
+  struct file *file, int locked)
 {
int err;
struct fuse_getattr_in inarg;
@@ -848,6 +855,12 @@ static int fuse_do_getattr(struct inode *inode, struct 
kstat *stat,
struct fuse_req *req;
u64 attr_version;
 
+   if (test_bit(FUSE_I_MTIME_UPDATED, _fuse_inode(inode)->state)) {
+   err = fuse_flush_mtime(inode, locked);
+   if (err)
+   return err;
+   }
+
req = fuse_get_req_nopages(fc);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -893,7 +906,7 @@ static int fuse_do_getattr(struct inode *inode, struct 
kstat *stat,
 }
 
 int fuse_update_attributes(struct inode *inode, struct kstat *stat,
-  struct file *file, bool *refreshed)
+  struct file *file, bool *refreshed, int locked)
 {
struct fuse_inode *fi = get_fuse_inode(inode);
int err;
@@ -901,7 +914,7 @@ int fuse_update_attributes(struct inode *inode, struct 
kstat *stat,
 
if (fi->i_time < get_jiffies_64()) {
r = true;
-   err = fuse_do_getattr(inode, stat, file);
+   err = fuse_do_getattr(inode, stat, file, locked);
} else {
r = false;
err = 0;
@@ -1055,7 +1068,7 @@ static int fuse_perm_getattr(struct inode *inode, int 
mask)
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
 
-   return fuse_do_getattr(inode, NULL, NULL);
+   return fuse_do_getattr(inode, NULL, NULL, 0);
 }
 
 /*
@@ -1524,6 +1537,12 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(>lock);
 }
 
+static inline bool fuse_operation_updates_mtime_on_server(unsigned ivalid)
+{
+   return (ivalid & ATTR_SIZE) ||
+   ((ivalid & ATTR_MTIME) && update_mtime(ivalid));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1564,6 +1583,15 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
if (attr->ia_valid & ATTR_SIZE)
is_truncate = true;
 
+   if (!fuse_operation_updates_mtime_on_server(attr->ia_valid)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   if (test_bit(FUSE_I_MTIME_UPDATED, >state)) {
+   err = fuse_flush_mtime(inode, 1);
+   if (err)
+   return err;
+   }
+   }
+
req = fuse_get_req_nopages(fc);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -1611,6 +1639,10 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
}
 
spin_lock(>lock);
+   if (fuse_operation_updates_mtime_on_server(attr->ia_valid)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   clear_bit(FUSE_I_MTIME_UPDATED, >state);
+   }
fuse_change_attributes_common(inode, ,
  attr_timeout());
oldsize = inode->i_size;
@@ -1659,7 +1691,7 @@ static int fuse_getattr(struct vfsmount *mnt, struct 
dentry *entry,
if (!fuse_allow_task(fc, current))
  

[PATCH 06/14] fuse: Trust kernel i_size only - v2

2013-01-25 Thread Maxim V. Patlasov
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim V. Patlasov 
---
 fs/fuse/dir.c   |9 ++---
 fs/fuse/file.c  |   43 +--
 fs/fuse/inode.c |6 --
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ed8f8c5..ff8b603 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -827,7 +827,7 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
stat->mtime.tv_nsec = attr->mtimensec;
stat->ctime.tv_sec = attr->ctime;
stat->ctime.tv_nsec = attr->ctimensec;
-   stat->size = attr->size;
+   stat->size = i_size_read(inode);
stat->blocks = attr->blocks;
 
if (attr->blksize != 0)
@@ -1541,6 +1541,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
+   bool is_wb = fc->writeback_cache;
loff_t oldsize;
int err;
 
@@ -1613,7 +1614,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
fuse_change_attributes_common(inode, ,
  attr_timeout());
oldsize = inode->i_size;
-   i_size_write(inode, outarg.attr.size);
+   if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+   i_size_write(inode, outarg.attr.size);
 
if (is_truncate) {
/* NOTE: this may release/reacquire fc->lock */
@@ -1625,7 +1627,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 * Only call invalidate_inode_pages2() after removing
 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 */
-   if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) &&
+   S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
truncate_pagecache(inode, oldsize, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b28be33..6b64e11 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -543,9 +544,31 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
u64 attr_ver)
 {
size_t num_read = req->out.args[0].size;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc->writeback_cache) {
+   /*
+* A hole in a file. Some data after the hole are in page cache,
+* but have not reached the client fs yet. So, the hole is not
+* present there.
+*/
+   int i;
+   int start_idx = num_read >> PAGE_CACHE_SHIFT;
+   size_t off = num_read & (PAGE_CACHE_SIZE - 1);
 
-   loff_t pos = page_offset(req->pages[0]) + num_read;
-   fuse_read_update_size(inode, pos, attr_ver);
+   for (i = start_idx; i < req->num_pages; i++) {
+   struct page *page = req->pages[i];
+   void *mapaddr = kmap_atomic(page);
+
+   memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+   kunmap_atomic(mapaddr);
+   off = 0;
+   }
+   } else {
+   loff_t pos = page_offset(req->pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+   }
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2285,6 +2308,8 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
.mode = mode
};
int err;
+   bool change_i_size = fc->writeback_cache &&
+   !(mode & FALLOC_FL_KEEP_SIZE);
 
if (fc->no_fallocate)
return -EOPNOTSUPP;
@@ -2293,6 +2318,11 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
if (IS_ERR(req))
return PTR_ERR(req);
 
+   if (change_i_size) {
+   struct inode *inode

[PATCH 04/14] fuse: Prepare to handle multiple pages in writeback

2013-01-25 Thread Maxim V. Patlasov
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Original patch by: Pavel Emelyanov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   22 +++---
 1 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1d76283..b28be33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, 
pgoff_t index)
 
BUG_ON(req->inode != inode);
curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
-   if (curr_index == index) {
+   if (curr_index <= index &&
+   index < curr_index + req->num_pages) {
found = true;
break;
}
@@ -1295,7 +1296,10 @@ static ssize_t fuse_direct_write(struct file *file, 
const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-   __free_page(req->pages[0]);
+   int i;
+
+   for (i = 0; i < req->num_pages; i++)
+   __free_page(req->pages[i]);
fuse_file_put(req->ff, false);
 }
 
@@ -1304,10 +1308,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
struct inode *inode = req->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+   int i;
 
list_del(>writepages_entry);
-   dec_bdi_stat(bdi, BDI_WRITEBACK);
-   dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+   for (i = 0; i < req->num_pages; i++) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+   }
bdi_writeout_inc(bdi);
wake_up(>page_waitq);
 }
@@ -1320,14 +1327,15 @@ __acquires(fc->lock)
struct fuse_inode *fi = get_fuse_inode(req->inode);
loff_t size = i_size_read(req->inode);
struct fuse_write_in *inarg = >misc.write.in;
+   __u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
 
if (!fc->connected)
goto out_free;
 
-   if (inarg->offset + PAGE_CACHE_SIZE <= size) {
-   inarg->size = PAGE_CACHE_SIZE;
+   if (inarg->offset + data_size <= size) {
+   inarg->size = data_size;
} else if (inarg->offset < size) {
-   inarg->size = size & (PAGE_CACHE_SIZE - 1);
+   inarg->size = size - inarg->offset;
} else {
/* Got truncated off completely */
goto out_free;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/14] fuse: Connection bit for enabling writeback

2013-01-25 Thread Maxim V. Patlasov
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/fuse_i.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 13befcd..65d76cd 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -446,6 +446,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;
 
+   /** write-back cache policy (default is write-through) */
+   unsigned writeback_cache:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/14] fuse: Getting file for writeback helper

2013-01-25 Thread Maxim V. Patlasov
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   24 
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9bdc0fa..ad89b21 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1375,6 +1375,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+struct fuse_inode *fi)
+{
+   struct fuse_file *ff;
+
+   spin_lock(>lock);
+   BUG_ON(list_empty(>write_files));
+   ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+   fuse_file_get(ff);
+   spin_unlock(>lock);
+
+   return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
struct address_space *mapping = page->mapping;
@@ -1382,7 +1396,6 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_req *req;
-   struct fuse_file *ff;
struct page *tmp_page;
 
set_page_writeback(page);
@@ -1395,13 +1408,8 @@ static int fuse_writepage_locked(struct page *page)
if (!tmp_page)
goto err_free;
 
-   spin_lock(>lock);
-   BUG_ON(list_empty(>write_files));
-   ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
-   req->ff = fuse_file_get(ff);
-   spin_unlock(>lock);
-
-   fuse_write_fill(req, ff, page_offset(page), 0);
+   req->ff = fuse_write_file(fc, fi);
+   fuse_write_fill(req, req->ff, page_offset(page), 0);
 
copy_highpage(tmp_page, page);
req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/14] fuse: Prepare to handle short reads

2013-01-25 Thread Maxim V. Patlasov
A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   21 +
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad89b21..1d76283 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,15 @@ static void fuse_read_update_size(struct inode *inode, 
loff_t size,
spin_unlock(>lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+   u64 attr_ver)
+{
+   size_t num_read = req->out.args[0].size;
+
+   loff_t pos = page_offset(req->pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
struct inode *inode = page->mapping->host;
@@ -574,18 +583,18 @@ static int fuse_readpage(struct file *file, struct page 
*page)
req->page_descs[0].length = count;
num_read = fuse_send_read(req, file, pos, count, NULL);
err = req->out.h.error;
-   fuse_put_request(fc, req);
 
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
 */
if (num_read < count)
-   fuse_read_update_size(inode, pos + num_read, attr_ver);
+   fuse_short_read(req, inode, attr_ver);
 
SetPageUptodate(page);
}
 
+   fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
  out:
unlock_page(page);
@@ -608,13 +617,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
/*
 * Short read means EOF. If file size is larger, truncate it
 */
-   if (!req->out.h.error && num_read < count) {
-   loff_t pos;
+   if (!req->out.h.error && num_read < count)
+   fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-   pos = page_offset(req->pages[0]) + num_read;
-   fuse_read_update_size(inode, pos,
- req->misc.read.attr_ver);
-   }
fuse_invalidate_attr(inode); /* atime changed */
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/14] fuse: Linking file to inode helper

2013-01-25 Thread Maxim V. Patlasov
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov 
Signed-off-by: Pavel Emelyanov 
---
 fs/fuse/file.c |   33 +++--
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 28bc9c6..9bdc0fa 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct 
file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+   struct inode *inode = file->f_dentry->d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_file *ff = file->private_data;
+   /*
+* file may be written through mmap, so chain it onto the
+* inodes's write_file list
+*/
+   spin_lock(>lock);
+   if (list_empty(>write_entry))
+   list_add(>write_entry, >write_files);
+   spin_unlock(>lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
struct fuse_file *ff = file->private_data;
@@ -1484,20 +1500,9 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-   struct inode *inode = file->f_dentry->d_inode;
-   struct fuse_conn *fc = get_fuse_conn(inode);
-   struct fuse_inode *fi = get_fuse_inode(inode);
-   struct fuse_file *ff = file->private_data;
-   /*
-* file may be written through mmap, so chain it onto the
-* inodes's write_file list
-*/
-   spin_lock(>lock);
-   if (list_empty(>write_entry))
-   list_add(>write_entry, >write_files);
-   spin_unlock(>lock);
-   }
+   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+   fuse_link_write_file(file);
+
file_accessed(file);
vma->vm_ops = _file_vm_ops;
return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 00/14] fuse: An attempt to implement a write-back cache policy

2013-01-25 Thread Maxim V. Patlasov
Hi,

This is the second iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back 
policy.
With this file data are pushed to the userspace with big chunks (depending on 
the
dirty memory limits, but this is much more than 128k) which lets the FUSE 
daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback 
is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback 
request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size 
synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not 
yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare 
meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes 
way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this 
scenario.
Kirill (in Cc) can share more details about the performance and the storage 
concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must 
wait
 on page writeback because page writeback can extend beyond the lifetime of
 the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after 
adding
 request to fi->writepages list; otherwise another writeback may happen 
inside
 the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is 
possible
 due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given 
inode
 before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not 
reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size 
update;
 otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update 
i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree

Thanks,
Maxim

---

Maxim V. Patlasov (14):
  fuse: Linking file to inode helper
  fuse: Getting file for writeback helper
  fuse: Prepare to handle short reads
  fuse: Prepare to handle multiple pages in writeback
  fuse: Connection bit for enabling writeback
  fuse: Trust kernel i_size only - v2
  fuse: Update i_mtime on buffered writes
  fuse: Flush files on wb close
  fuse: Implement writepages and write_begin/write_end callbacks - v2
  fuse: fuse_writepage_locked() should wait on writeback
  fuse: fuse_flush() should wait on writeback
  fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  fuse: Turn writeback cache on
  mm: Account for WRITEBACK_TEMP in balance_dirty_pages


 fs/fuse/cuse.c|5 
 fs/fuse/dir.c |   51 +++-
 fs/fuse/file.c|  567 +
 fs/fuse/fuse_i.h  |   33 ++-
 fs/fuse/inode.c   |   98 
 include/uapi/linux/fuse.h |2 
 mm/page-writeback.c   |3 
 7 files changed, 696 insertions(+), 63 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy

2013-01-25 Thread Maxim V. Patlasov

Hi Miklos,

01/25/2013 02:21 PM, Miklos Szeredi пишет:

On Tue, Jan 15, 2013 at 4:20 PM, Maxim V. Patlasov
 wrote:

Heard nothing from you for two months. Any feedback would still be
appreciated.

Sorry about the long silence.

I haven't done a detailed review yet.  It would be good if you could
resent the patchset against for-next branch of the fuse tree.


OK.


I see that you have some other patchsets pending.   Are they independent?


They are logically independent, but some of them may require cosmetic 
changes to be applied on the top of others.


Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy

2013-01-25 Thread Maxim V. Patlasov

Hi Miklos,

01/25/2013 02:21 PM, Miklos Szeredi пишет:

On Tue, Jan 15, 2013 at 4:20 PM, Maxim V. Patlasov
mpatla...@parallels.com wrote:

Heard nothing from you for two months. Any feedback would still be
appreciated.

Sorry about the long silence.

I haven't done a detailed review yet.  It would be good if you could
resent the patchset against for-next branch of the fuse tree.


OK.


I see that you have some other patchsets pending.   Are they independent?


They are logically independent, but some of them may require cosmetic 
changes to be applied on the top of others.


Thanks,
Maxim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 00/14] fuse: An attempt to implement a write-back cache policy

2013-01-25 Thread Maxim V. Patlasov
Hi,

This is the second iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back 
policy.
With this file data are pushed to the userspace with big chunks (depending on 
the
dirty memory limits, but this is much more than 128k) which lets the FUSE 
daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback 
is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback 
request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size 
synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not 
yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare 
meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes 
way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this 
scenario.
Kirill (in Cc) can share more details about the performance and the storage 
concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must 
wait
 on page writeback because page writeback can extend beyond the lifetime of
 the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after 
adding
 request to fi-writepages list; otherwise another writeback may happen 
inside
 the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is 
possible
 due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given 
inode
 before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not 
reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size 
update;
 otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update 
i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree

Thanks,
Maxim

---

Maxim V. Patlasov (14):
  fuse: Linking file to inode helper
  fuse: Getting file for writeback helper
  fuse: Prepare to handle short reads
  fuse: Prepare to handle multiple pages in writeback
  fuse: Connection bit for enabling writeback
  fuse: Trust kernel i_size only - v2
  fuse: Update i_mtime on buffered writes
  fuse: Flush files on wb close
  fuse: Implement writepages and write_begin/write_end callbacks - v2
  fuse: fuse_writepage_locked() should wait on writeback
  fuse: fuse_flush() should wait on writeback
  fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  fuse: Turn writeback cache on
  mm: Account for WRITEBACK_TEMP in balance_dirty_pages


 fs/fuse/cuse.c|5 
 fs/fuse/dir.c |   51 +++-
 fs/fuse/file.c|  567 +
 fs/fuse/fuse_i.h  |   33 ++-
 fs/fuse/inode.c   |   98 
 include/uapi/linux/fuse.h |2 
 mm/page-writeback.c   |3 
 7 files changed, 696 insertions(+), 63 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/14] fuse: Linking file to inode helper

2013-01-25 Thread Maxim V. Patlasov
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   33 +++--
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 28bc9c6..9bdc0fa 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct 
file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_file *ff = file-private_data;
+   /*
+* file may be written through mmap, so chain it onto the
+* inodes's write_file list
+*/
+   spin_lock(fc-lock);
+   if (list_empty(ff-write_entry))
+   list_add(ff-write_entry, fi-write_files);
+   spin_unlock(fc-lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
struct fuse_file *ff = file-private_data;
@@ -1484,20 +1500,9 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   if ((vma-vm_flags  VM_SHARED)  (vma-vm_flags  VM_MAYWRITE)) {
-   struct inode *inode = file-f_dentry-d_inode;
-   struct fuse_conn *fc = get_fuse_conn(inode);
-   struct fuse_inode *fi = get_fuse_inode(inode);
-   struct fuse_file *ff = file-private_data;
-   /*
-* file may be written through mmap, so chain it onto the
-* inodes's write_file list
-*/
-   spin_lock(fc-lock);
-   if (list_empty(ff-write_entry))
-   list_add(ff-write_entry, fi-write_files);
-   spin_unlock(fc-lock);
-   }
+   if ((vma-vm_flags  VM_SHARED)  (vma-vm_flags  VM_MAYWRITE))
+   fuse_link_write_file(file);
+
file_accessed(file);
vma-vm_ops = fuse_file_vm_ops;
return 0;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/14] fuse: Getting file for writeback helper

2013-01-25 Thread Maxim V. Patlasov
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   24 
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9bdc0fa..ad89b21 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1375,6 +1375,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, 
struct fuse_req *req)
fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+struct fuse_inode *fi)
+{
+   struct fuse_file *ff;
+
+   spin_lock(fc-lock);
+   BUG_ON(list_empty(fi-write_files));
+   ff = list_entry(fi-write_files.next, struct fuse_file, write_entry);
+   fuse_file_get(ff);
+   spin_unlock(fc-lock);
+
+   return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
struct address_space *mapping = page-mapping;
@@ -1382,7 +1396,6 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_req *req;
-   struct fuse_file *ff;
struct page *tmp_page;
 
set_page_writeback(page);
@@ -1395,13 +1408,8 @@ static int fuse_writepage_locked(struct page *page)
if (!tmp_page)
goto err_free;
 
-   spin_lock(fc-lock);
-   BUG_ON(list_empty(fi-write_files));
-   ff = list_entry(fi-write_files.next, struct fuse_file, write_entry);
-   req-ff = fuse_file_get(ff);
-   spin_unlock(fc-lock);
-
-   fuse_write_fill(req, ff, page_offset(page), 0);
+   req-ff = fuse_write_file(fc, fi);
+   fuse_write_fill(req, req-ff, page_offset(page), 0);
 
copy_highpage(tmp_page, page);
req-misc.write.in.write_flags |= FUSE_WRITE_CACHE;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/14] fuse: Prepare to handle short reads

2013-01-25 Thread Maxim V. Patlasov
A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   21 +
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad89b21..1d76283 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,15 @@ static void fuse_read_update_size(struct inode *inode, 
loff_t size,
spin_unlock(fc-lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+   u64 attr_ver)
+{
+   size_t num_read = req-out.args[0].size;
+
+   loff_t pos = page_offset(req-pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
struct inode *inode = page-mapping-host;
@@ -574,18 +583,18 @@ static int fuse_readpage(struct file *file, struct page 
*page)
req-page_descs[0].length = count;
num_read = fuse_send_read(req, file, pos, count, NULL);
err = req-out.h.error;
-   fuse_put_request(fc, req);
 
if (!err) {
/*
 * Short read means EOF.  If file size is larger, truncate it
 */
if (num_read  count)
-   fuse_read_update_size(inode, pos + num_read, attr_ver);
+   fuse_short_read(req, inode, attr_ver);
 
SetPageUptodate(page);
}
 
+   fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
  out:
unlock_page(page);
@@ -608,13 +617,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, 
struct fuse_req *req)
/*
 * Short read means EOF. If file size is larger, truncate it
 */
-   if (!req-out.h.error  num_read  count) {
-   loff_t pos;
+   if (!req-out.h.error  num_read  count)
+   fuse_short_read(req, inode, req-misc.read.attr_ver);
 
-   pos = page_offset(req-pages[0]) + num_read;
-   fuse_read_update_size(inode, pos,
- req-misc.read.attr_ver);
-   }
fuse_invalidate_attr(inode); /* atime changed */
}
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/14] fuse: Prepare to handle multiple pages in writeback

2013-01-25 Thread Maxim V. Patlasov
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |   22 +++---
 1 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1d76283..b28be33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, 
pgoff_t index)
 
BUG_ON(req-inode != inode);
curr_index = req-misc.write.in.offset  PAGE_CACHE_SHIFT;
-   if (curr_index == index) {
+   if (curr_index = index 
+   index  curr_index + req-num_pages) {
found = true;
break;
}
@@ -1295,7 +1296,10 @@ static ssize_t fuse_direct_write(struct file *file, 
const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-   __free_page(req-pages[0]);
+   int i;
+
+   for (i = 0; i  req-num_pages; i++)
+   __free_page(req-pages[i]);
fuse_file_put(req-ff, false);
 }
 
@@ -1304,10 +1308,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
struct inode *inode = req-inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   int i;
 
list_del(req-writepages_entry);
-   dec_bdi_stat(bdi, BDI_WRITEBACK);
-   dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
+   for (i = 0; i  req-num_pages; i++) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(req-pages[i], NR_WRITEBACK_TEMP);
+   }
bdi_writeout_inc(bdi);
wake_up(fi-page_waitq);
 }
@@ -1320,14 +1327,15 @@ __acquires(fc-lock)
struct fuse_inode *fi = get_fuse_inode(req-inode);
loff_t size = i_size_read(req-inode);
struct fuse_write_in *inarg = req-misc.write.in;
+   __u64 data_size = req-num_pages * PAGE_CACHE_SIZE;
 
if (!fc-connected)
goto out_free;
 
-   if (inarg-offset + PAGE_CACHE_SIZE = size) {
-   inarg-size = PAGE_CACHE_SIZE;
+   if (inarg-offset + data_size = size) {
+   inarg-size = data_size;
} else if (inarg-offset  size) {
-   inarg-size = size  (PAGE_CACHE_SIZE - 1);
+   inarg-size = size - inarg-offset;
} else {
/* Got truncated off completely */
goto out_free;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/14] fuse: Connection bit for enabling writeback

2013-01-25 Thread Maxim V. Patlasov
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/fuse_i.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 13befcd..65d76cd 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -446,6 +446,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;
 
+   /** write-back cache policy (default is write-through) */
+   unsigned writeback_cache:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/14] fuse: Trust kernel i_size only - v2

2013-01-25 Thread Maxim V. Patlasov
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim V. Patlasov mpatla...@parallels.com
---
 fs/fuse/dir.c   |9 ++---
 fs/fuse/file.c  |   43 +--
 fs/fuse/inode.c |6 --
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ed8f8c5..ff8b603 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -827,7 +827,7 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
stat-mtime.tv_nsec = attr-mtimensec;
stat-ctime.tv_sec = attr-ctime;
stat-ctime.tv_nsec = attr-ctimensec;
-   stat-size = attr-size;
+   stat-size = i_size_read(inode);
stat-blocks = attr-blocks;
 
if (attr-blksize != 0)
@@ -1541,6 +1541,7 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
+   bool is_wb = fc-writeback_cache;
loff_t oldsize;
int err;
 
@@ -1613,7 +1614,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
fuse_change_attributes_common(inode, outarg.attr,
  attr_timeout(outarg));
oldsize = inode-i_size;
-   i_size_write(inode, outarg.attr.size);
+   if (!is_wb || is_truncate || !S_ISREG(inode-i_mode))
+   i_size_write(inode, outarg.attr.size);
 
if (is_truncate) {
/* NOTE: this may release/reacquire fc-lock */
@@ -1625,7 +1627,8 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
 * Only call invalidate_inode_pages2() after removing
 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 */
-   if (S_ISREG(inode-i_mode)  oldsize != outarg.attr.size) {
+   if ((is_truncate || !is_wb) 
+   S_ISREG(inode-i_mode)  oldsize != outarg.attr.size) {
truncate_pagecache(inode, oldsize, outarg.attr.size);
invalidate_inode_pages2(inode-i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b28be33..6b64e11 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
 #include linux/module.h
 #include linux/compat.h
 #include linux/swap.h
+#include linux/falloc.h
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -543,9 +544,31 @@ static void fuse_short_read(struct fuse_req *req, struct 
inode *inode,
u64 attr_ver)
 {
size_t num_read = req-out.args[0].size;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc-writeback_cache) {
+   /*
+* A hole in a file. Some data after the hole are in page cache,
+* but have not reached the client fs yet. So, the hole is not
+* present there.
+*/
+   int i;
+   int start_idx = num_read  PAGE_CACHE_SHIFT;
+   size_t off = num_read  (PAGE_CACHE_SIZE - 1);
 
-   loff_t pos = page_offset(req-pages[0]) + num_read;
-   fuse_read_update_size(inode, pos, attr_ver);
+   for (i = start_idx; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   void *mapaddr = kmap_atomic(page);
+
+   memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+   kunmap_atomic(mapaddr);
+   off = 0;
+   }
+   } else {
+   loff_t pos = page_offset(req-pages[0]) + num_read;
+   fuse_read_update_size(inode, pos, attr_ver);
+   }
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2285,6 +2308,8 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
.mode = mode
};
int err;
+   bool change_i_size = fc-writeback_cache 
+   !(mode  FALLOC_FL_KEEP_SIZE);
 
if (fc-no_fallocate)
return -EOPNOTSUPP;
@@ -2293,6 +2318,11 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
if (IS_ERR(req))
return PTR_ERR(req);
 
+   if (change_i_size) {
+   struct inode *inode = file-f_mapping-host;
+   mutex_lock(inode-i_mutex

[PATCH 07/14] fuse: Update i_mtime on buffered writes

2013-01-25 Thread Maxim V. Patlasov
If writeback cache is on, buffered write doesn't result in immediate mtime
update in userspace because the userspace will see modified data later, when
writeback happens. Consequently, mtime provided by userspace may be older than
actual time of buffered write.

The problem can be solved by generating mtime locally (will come in next
patches) and flushing it to userspace periodically. Here we introduce a flag to
keep the state of fuse_inode: the flag is ON if and only if locally generated
mtime (stored in inode-i_mtime) was not pushed to the userspace yet.

The patch also implements all bits related to flushing and clearing the flag.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/dir.c|   42 +
 fs/fuse/file.c   |   31 ++---
 fs/fuse/fuse_i.h |   13 -
 fs/fuse/inode.c  |   79 +-
 4 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ff8b603..969c60d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -177,6 +177,13 @@ static int fuse_dentry_revalidate(struct dentry *entry, 
unsigned int flags)
if (flags  LOOKUP_RCU)
return -ECHILD;
 
+   if (test_bit(FUSE_I_MTIME_UPDATED,
+get_fuse_inode(inode)-state)) {
+   err = fuse_flush_mtime(inode, 0);
+   if (err)
+   return 0;
+   }
+
fc = get_fuse_conn(inode);
req = fuse_get_req_nopages(fc);
if (IS_ERR(req))
@@ -839,7 +846,7 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
 }
 
 static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
-  struct file *file)
+  struct file *file, int locked)
 {
int err;
struct fuse_getattr_in inarg;
@@ -848,6 +855,12 @@ static int fuse_do_getattr(struct inode *inode, struct 
kstat *stat,
struct fuse_req *req;
u64 attr_version;
 
+   if (test_bit(FUSE_I_MTIME_UPDATED, get_fuse_inode(inode)-state)) {
+   err = fuse_flush_mtime(inode, locked);
+   if (err)
+   return err;
+   }
+
req = fuse_get_req_nopages(fc);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -893,7 +906,7 @@ static int fuse_do_getattr(struct inode *inode, struct 
kstat *stat,
 }
 
 int fuse_update_attributes(struct inode *inode, struct kstat *stat,
-  struct file *file, bool *refreshed)
+  struct file *file, bool *refreshed, int locked)
 {
struct fuse_inode *fi = get_fuse_inode(inode);
int err;
@@ -901,7 +914,7 @@ int fuse_update_attributes(struct inode *inode, struct 
kstat *stat,
 
if (fi-i_time  get_jiffies_64()) {
r = true;
-   err = fuse_do_getattr(inode, stat, file);
+   err = fuse_do_getattr(inode, stat, file, locked);
} else {
r = false;
err = 0;
@@ -1055,7 +1068,7 @@ static int fuse_perm_getattr(struct inode *inode, int 
mask)
if (mask  MAY_NOT_BLOCK)
return -ECHILD;
 
-   return fuse_do_getattr(inode, NULL, NULL);
+   return fuse_do_getattr(inode, NULL, NULL, 0);
 }
 
 /*
@@ -1524,6 +1537,12 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(fc-lock);
 }
 
+static inline bool fuse_operation_updates_mtime_on_server(unsigned ivalid)
+{
+   return (ivalid  ATTR_SIZE) ||
+   ((ivalid  ATTR_MTIME)  update_mtime(ivalid));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1564,6 +1583,15 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
if (attr-ia_valid  ATTR_SIZE)
is_truncate = true;
 
+   if (!fuse_operation_updates_mtime_on_server(attr-ia_valid)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   if (test_bit(FUSE_I_MTIME_UPDATED, fi-state)) {
+   err = fuse_flush_mtime(inode, 1);
+   if (err)
+   return err;
+   }
+   }
+
req = fuse_get_req_nopages(fc);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -1611,6 +1639,10 @@ static int fuse_do_setattr(struct dentry *entry, struct 
iattr *attr,
}
 
spin_lock(fc-lock);
+   if (fuse_operation_updates_mtime_on_server(attr-ia_valid)) {
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   clear_bit(FUSE_I_MTIME_UPDATED, fi-state);
+   }
fuse_change_attributes_common(inode, outarg.attr,
  attr_timeout(outarg));
oldsize = inode-i_size;
@@ -1659,7 +1691,7 @@ static int fuse_getattr(struct vfsmount *mnt, struct 
dentry *entry,
if 

[PATCH 08/14] fuse: Flush files on wb close

2013-01-25 Thread Maxim V. Patlasov
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediately and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
Signed-off-by: Pavel Emelyanov xe...@openvz.org
---
 fs/fuse/file.c |   26 +-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4f8fa45..496e74c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -138,6 +138,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
 }
 
+static void __fuse_file_put(struct fuse_file *ff)
+{
+   if (atomic_dec_and_test(ff-count))
+   BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 bool isdir)
 {
@@ -286,8 +292,23 @@ static int fuse_open(struct inode *inode, struct file 
*file)
return fuse_open_common(inode, file, false);
 }
 
+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+
+   filemap_write_and_wait(file-f_mapping);
+   wait_event(fi-page_waitq, list_empty_careful(fi-writepages));
+   spin_unlock_wait(fc-lock);
+}
+
 static int fuse_release(struct inode *inode, struct file *file)
 {
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fc-writeback_cache)
+   fuse_flush_writeback(inode, file);
+
fuse_release_common(file, FUSE_RELEASE);
 
/* return value is ignored by VFS */
@@ -1343,7 +1364,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, 
struct fuse_req *req)
 
for (i = 0; i  req-num_pages; i++)
__free_page(req-pages[i]);
-   fuse_file_put(req-ff, false);
+   if (!fc-writeback_cache)
+   fuse_file_put(req-ff, false);
 }
 
 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1360,6 +1382,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
}
bdi_writeout_inc(bdi);
wake_up(fi-page_waitq);
+   if (fc-writeback_cache)
+   __fuse_file_put(req-ff);
 }
 
 /* Called under fc-lock, may release and reacquire it */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v2

2013-01-25 Thread Maxim V. Patlasov
The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
 - fixed fuse_prepare_write() to avoid reads beyond EOF
 - fixed fuse_prepare_write() to zero uninitialized part of page

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim V. Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |  282 
 1 files changed, 281 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 496e74c..3b4dc98 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -722,7 +722,10 @@ static void fuse_send_readpages(struct fuse_req *req, 
struct file *file)
 
 struct fuse_fill_data {
struct fuse_req *req;
-   struct file *file;
+   union {
+   struct file *file;
+   struct fuse_file *ff;
+   };
struct inode *inode;
unsigned nr_pages;
 };
@@ -1530,6 +1533,280 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+   int i, all_ok = 1;
+   struct fuse_req *req = data-req;
+   struct inode *inode = data-inode;
+   struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   loff_t off = -1;
+
+   if (!data-ff)
+   data-ff = fuse_write_file(fc, fi);
+
+   if (!data-ff) {
+   for (i = 0; i  req-num_pages; i++)
+   end_page_writeback(req-pages[i]);
+   return -EIO;
+   }
+
+   req-inode = inode;
+   req-misc.write.in.offset = page_offset(req-pages[0]);
+
+   spin_lock(fc-lock);
+   list_add(req-writepages_entry, fi-writepages);
+   spin_unlock(fc-lock);
+
+   for (i = 0; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   struct page *tmp_page;
+
+   tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (tmp_page) {
+   copy_highpage(tmp_page, page);
+   inc_bdi_stat(bdi, BDI_WRITEBACK);
+   inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+   } else
+   all_ok = 0;
+   req-pages[i] = tmp_page;
+   if (i == 0)
+   off = page_offset(page);
+
+   end_page_writeback(page);
+   }
+
+   if (!all_ok) {
+   for (i = 0; i  req-num_pages; i++) {
+   struct page *page = req-pages[i];
+   if (page) {
+   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+   __free_page(page);
+   req-pages[i] = NULL;
+   }
+   }
+
+   spin_lock(fc-lock);
+   list_del(req-writepages_entry);
+   wake_up(fi-page_waitq);
+   spin_unlock(fc-lock);
+   return -ENOMEM;
+   }
+
+   req-ff = fuse_file_get(data-ff);
+   fuse_write_fill(req, data-ff, off, 0);
+
+   req-misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+   req-in.argpages = 1;
+   fuse_page_descs_length_init(req, 0, req-num_pages);
+   req-end = fuse_writepage_end;
+
+   spin_lock(fc-lock);
+   list_add_tail(req-list, fi-queued_writes);
+   fuse_flush_writepages(data-inode);
+   spin_unlock(fc-lock);
+
+   return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+   struct writeback_control *wbc, void *_data)
+{
+   struct fuse_fill_data *data = _data;
+   struct fuse_req *req = data-req;
+   struct inode *inode = data-inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if (fuse_page_is_writeback(inode, page-index)) {
+   if (wbc-sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page-index);
+   }
+
+   if (req-num_pages 
+   (req-num_pages == FUSE_MAX_PAGES_PER_REQ ||
+(req-num_pages + 1) * PAGE_CACHE_SIZE  fc-max_write ||
+req-pages[req-num_pages - 1]-index + 1 != page-index)) {
+   int err;
+
+   err = fuse_send_writepages(data);
+   if (err) {
+   unlock_page(page);
+   return err;
+   }
+
+   data-req = req =
+   fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+   if (!req) {
+   unlock_page(page);
+   return -ENOMEM

[PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback

2013-01-25 Thread Maxim V. Patlasov
fuse_writepage_locked() should never submit new i/o for given page-index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |   18 +++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3b4dc98..52c7d81 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1472,7 +1472,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn 
*fc,
return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+struct writeback_control *wbc)
 {
struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
@@ -1481,6 +1482,14 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_req *req;
struct page *tmp_page;
 
+   if (fuse_page_is_writeback(inode, page-index)) {
+   if (wbc-sync_mode != WB_SYNC_ALL) {
+   redirty_page_for_writepage(wbc, page);
+   return 0;
+   }
+   fuse_wait_on_page_writeback(inode, page-index);
+   }
+
set_page_writeback(page);
 
req = fuse_request_alloc_nofs(1);
@@ -1527,7 +1536,7 @@ static int fuse_writepage(struct page *page, struct 
writeback_control *wbc)
 {
int err;
 
-   err = fuse_writepage_locked(page);
+   err = fuse_writepage_locked(page, wbc);
unlock_page(page);
 
return err;
@@ -1812,7 +1821,10 @@ static int fuse_launder_page(struct page *page)
int err = 0;
if (clear_page_dirty_for_io(page)) {
struct inode *inode = page-mapping-host;
-   err = fuse_writepage_locked(page);
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_ALL,
+   };
+   err = fuse_writepage_locked(page, wbc);
if (!err)
fuse_wait_on_page_writeback(inode, page-index);
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/14] fuse: fuse_flush() should wait on writeback

2013-01-25 Thread Maxim V. Patlasov
The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/file.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 52c7d81..3767824 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,7 @@
 #include linux/falloc.h
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
@@ -414,6 +415,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fc-no_flush)
return 0;
 
+   err = filemap_write_and_wait(file-f_mapping);
+   if (err)
+   return err;
+
+   mutex_lock(inode-i_mutex);
+   fuse_sync_writes(inode);
+   mutex_unlock(inode-i_mutex);
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(inarg, 0, sizeof(inarg));
inarg.fh = ff-fh;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2

2013-01-25 Thread Maxim V. Patlasov
The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the -writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Original patch by: Pavel Emelyanov xe...@openvz.org
Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 fs/fuse/cuse.c   |5 +++--
 fs/fuse/file.c   |   49 +++--
 fs/fuse/fuse_i.h |   13 -
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..fb63185 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -93,7 +93,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, 
size_t count,
loff_t pos = 0;
struct iovec iov = { .iov_base = buf, .iov_len = count };
 
-   return fuse_direct_io(file, iov, 1, count, pos, 0);
+   return fuse_direct_io(file, iov, 1, count, pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -106,7 +106,8 @@ static ssize_t cuse_write(struct file *file, const char 
__user *buf,
 * No locking or generic_write_checks(), the server is
 * responsible for locking and sanity checks.
 */
-   return fuse_direct_io(file, iov, 1, count, pos, 1);
+   return fuse_direct_io(file, iov, 1, count, pos,
+ FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3767824..e6e064c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -349,6 +349,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
return (u64) v0 + ((u64) v1  32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+   pgoff_t idx_to)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_req *req;
+   bool found = false;
+
+   spin_lock(fc-lock);
+   list_for_each_entry(req, fi-writepages, writepages_entry) {
+   pgoff_t curr_index;
+
+   BUG_ON(req-inode != inode);
+   curr_index = req-misc.write.in.offset  PAGE_CACHE_SHIFT;
+   if (!(idx_from = curr_index + req-num_pages ||
+ idx_to  curr_index)) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(fc-lock);
+
+   return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -393,6 +418,19 @@ static int fuse_wait_on_page_writeback(struct inode 
*inode, pgoff_t index)
return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+  size_t bytes)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   pgoff_t idx_from, idx_to;
+
+   idx_from = start  PAGE_CACHE_SHIFT;
+   idx_to = (start + bytes - 1)  PAGE_CACHE_SHIFT;
+
+   wait_event(fi-page_waitq,
+  !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
struct inode *inode = file-f_path.dentry-d_inode;
@@ -1245,8 +1283,10 @@ static inline int fuse_iter_npages(const struct iov_iter 
*ii_p)
 
 ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
   unsigned long nr_segs, size_t count, loff_t *ppos,
-  int write)
+  int flags)
 {
+   int write = flags  FUSE_DIO_WRITE;
+   int cuse = flags  FUSE_DIO_CUSE;
struct fuse_file *ff = file-private_data;
struct fuse_conn *fc = ff-fc;
size_t nmax = write ? fc-max_write : fc-max_read;
@@ -1271,6 +1311,10 @@ ssize_t fuse_direct_io(struct file *file, const struct 
iovec *iov,
break;
}
 
+   if (!cuse)
+   fuse_wait_on_writeback(file-f_mapping-host, pos,
+  nbytes);
+
if (write)
nres = fuse_send_write(req, file, pos, nbytes, owner);
else
@@ -1339,7 +1383,8 @@ static ssize_t __fuse_direct_write(struct file *file, 
const struct iovec *iov,
 
res = generic_write_checks(file, ppos, count, 0);

  1   2   >