Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-06-01 Thread Andreas Rohner
On 2015-06-01 02:44, Ryusuke Konishi wrote:
> On Sun, 31 May 2015 20:13:44 +0200, Andreas Rohner wrote:
>> On 2015-05-31 18:45, Ryusuke Konishi wrote:
>>> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>>>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>>>>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
> [...]
>>>>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>>>>> of blocks_per_segment.  It is not clear if the ratio is good
>>>>> (versatile).
>>>>
>>>> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
>>>>
>>>> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
>>>> check the segment again in the future. I guess anything between 25% and
>>>> 75% would also work.
>>>
>>> Sound reasonable.
>>>
>>> By the way, I am thinking we should move cleanerd into kernel as soon
>>> as we can.  It's not only inefficient due to a large amount of data
>>> exchange between kernel and user-land, but also is hindering changes
>>> like we are trying.  We have to care compatibility unnecessarily due
>>> to the early design mistake (i.e. the separation of gc to user-land).
>>
>> I am a bit confused. Is it OK if I implement this functionality in
>> nilfs_cleanerd for this patch set, or would it be better to implement it
>> with a workqueue in the kernel, like you've suggested before?
>>
>> If you intend to move nilfs_cleanerd into the kernel anyway, then the
>> latter would make more sense to me. Which implementation do you prefer
>> for this patch set?
> 
> If nilfs_cleanerd will remain in userland, then the userland
> implementation looks better.  But, yes, if we will move the cleaner
> into kernel, then the kernel implementation looks better because we
> may be able to avoid unnecessary API change.  It's a dilemma.
> 
> Do you have any good idea to reduce or hide overhead of the
> calibration (i.e. traversal rewrite of sufile) in regard to the kernel
> implementation ?
> I'm inclined to leave that in kernel for now.

I haven't looked into that yet, so I don't have a good idea right now. I
will do some experiments. The good thing is, that the calibration does
not have to happen all at once and we do not have to do it all in one
iteration. The only question is how to best split up the work and keep
track of the progress.

If it turns out to be too complicated to do it in the kernel, I will go
for the userspace solution.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>>
>> Regards,
>> Andreas Rohner
>>
>>> Regards,
>>> Ryusuke Konishi
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out

2015-06-01 Thread Andreas Rohner
On 2015-06-01 06:13, Ryusuke Konishi wrote:
> On Sun, 10 May 2015 13:04:18 +0200, Andreas Rohner wrote:
>> On 2015-05-09 14:17, Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
> [...]
>>>
>>> Uum. This still looks to have potential for leak of dirty block
>>> collection between DAT and SUFILE since this retry is limited by
>>> the fixed retry count.
>>>
>>> How about adding function temporarily turning off the live block
>>> tracking and using it after this propagation loop until log write
>>> finishes ?
>>>
>>> It would reduce the accuracy of live block count, but is it enough ?
>>> How do you think ?  We have to eliminate the possibility of the leak
>>> because it can cause file system corruption.  Every checkpoint must be
>>> self-contained.
>>
>> How exactly could it lead to file system corruption? Maybe I miss
>> something important here, but it seems to me, that no corruption is
>> possible.
>>
>> The nilfs_sufile_flush_cache_node() function only reads in already
>> existing blocks. No new blocks are created. If I mark those blocks
>> dirty, the btree is not changed at all. If I do not call
>> nilfs_bmap_propagate(), then the btree stays unchanged and there are no
>> dangling pointers. The resulting checkpoint should be self-contained.
> 
> Good point.  As for btree, it looks like no inconsistency issue arises
> since nilfs_sufile_flush_cache_node() never inserts new blocks as you
> pointed out.  Even though we also must care inconsistency between
> sufile header and sufile data blocks, and block count in inode as
> well, fortunately these look to be ok, too.
> 
> However, I still think it's not good to carry over dirty blocks to the
> next segment construction to avoid extra checkpoint creation and to
> simplify things.
> 
>>From this viewpoint, I also prefer that nilfs_sufile_flush_cache() and
> nilfs_sufile_flush_cache_node() are changed a bit so that they will
> skip adjusting su_nlive_blks and su_nlive_lastmod if the sufile block
> that includes the segment usage is not marked dirty and only_mark == 0
> as well as turing off live block counting temporarily after the
> sufile/DAT propagation loop.

Ok I'll start working on this.

Regards,
Andreas Rohner

>>
>> The only problem would be, that I could lose some nlive_blks updates.
>>
> 
> Regards,
> Ryusuke Konishi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-05-31 Thread Andreas Rohner
On 2015-05-31 18:45, Ryusuke Konishi wrote:
> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>>>> It doesn't really matter if the number of reclaimable blocks for a
>>>> segment is inaccurate, as long as the overall performance is better than
>>>> the simple timestamp algorithm and starvation is prevented.
>>>>
>>>> The following steps will lead to starvation of a segment:
>>>>
>>>> 1. The segment is written
>>>> 2. A snapshot is created
>>>> 3. The files in the segment are deleted and the number of live
>>>>blocks for the segment is decremented to a very low value
>>>> 4. The GC tries to free the segment, but there are no reclaimable
>>>>blocks, because they are all protected by the snapshot. To prevent an
>>>>infinite loop the GC has to adjust the number of live blocks to the
>>>>correct value.
>>>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>>>segment are now reclaimable.
>>>> 6. The GC will never attempt to clean the segment again, because it
>>>>looks as if it had a high number of live blocks.
>>>>
>>>> To prevent this, the already existing padding field of the SUFILE entry
>>>> is used to track the number of snapshot blocks in the segment. This
>>>> number is only set by the GC, since it collects the necessary
>>>> information anyway. So there is no need, to track which block belongs to
>>>> which segment. In step 4 of the list above the GC will set the new field
>>>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>>>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>>>> reduced.
>>>>
>>>> Signed-off-by: Andreas Rohner 
>>>
>>> I still don't know whether this workaround is the way we should take
>>> or not.  This patch has several drawbacks:
>>>
>>>  1. It introduces overheads to every "chcp cp" operation
>>> due to traversal rewrite of sufile.
>>> If the ratio of snapshot protected blocks is high, then
>>> this overheads will be big.
>>>
>>>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>>> written out.   If most blocks are protected by a snapshot,
>>> more than 4MB of sufile blocks will be written per 1TB capacity.
>>>
>>> Even though this rewrite may not happen for contiguous "chcp cp"
>>> operations, it still has potential for creating sufile write blocks
>>> if the application of nilfs manipulates snapshots frequently.
>>
>> I could also implement this functionality in nilfs_cleanerd in
>> userspace. Every time a "chcp cp" happens some kind of permanent flag
>> like "snapshot_was_recently_deleted" is set at an appropriate location.
>> The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
>> would, at certain intervals and if the flag is set, check all segments
>> with GET_SUINFO ioctl() and set the ones that have potentially invalid
>> values with SET_SUINFO ioctl(). After that it would clear the
>> "snapshot_was_recently_deleted" flag. What do you think about this idea?
> 
> Sorry for my late reply.

No problem. I was also very busy last week.

> I think moving the functionality to cleanerd and notifying some sort
> of information to userland through ioctl for that, is a good idea
> except that I feel the ioctl should be GET_CPSTAT instead of
> GET_SUINFO because it's checkpoint/snapshot related information.

Ok good idea.

> I think the parameter that should be added is a set of statistics
> information including the number of deleted snapshots since the file
> system was mounted last (1).  The counter (1) can serve as the
> "snapshot_was_recently_deleted" flag if it monotonically increases.
> Although we can use timestamp of when a snapshot was deleted last
> time, it's not preferable than the counter (1) because the system
> clock may be rewinded and it also has an issue related to precision.

I agree, a counter is better than a simple flag.

> Note that we must add GET_CPSTAT_V2 (or GET_SUSTAT_V2) and the
> corresponding structure (i.e. nilfs_cpstat_v2, or so) since ioctl
> codes depend on the size of argument data and it will be changed in
> both ioctls; unfortunately, neither GET_CPSTAT nor GET_SUSTAT ioctl is
&g

Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-05-22 Thread Andreas Rohner
On 2015-05-20 16:43, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>> It doesn't really matter if the number of reclaimable blocks for a
>> segment is inaccurate, as long as the overall performance is better than
>> the simple timestamp algorithm and starvation is prevented.
>>
>> The following steps will lead to starvation of a segment:
>>
>> 1. The segment is written
>> 2. A snapshot is created
>> 3. The files in the segment are deleted and the number of live
>>blocks for the segment is decremented to a very low value
>> 4. The GC tries to free the segment, but there are no reclaimable
>>blocks, because they are all protected by the snapshot. To prevent an
>>infinite loop the GC has to adjust the number of live blocks to the
>>correct value.
>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>segment are now reclaimable.
>> 6. The GC will never attempt to clean the segment again, because it
>>looks as if it had a high number of live blocks.
>>
>> To prevent this, the already existing padding field of the SUFILE entry
>> is used to track the number of snapshot blocks in the segment. This
>> number is only set by the GC, since it collects the necessary
>> information anyway. So there is no need, to track which block belongs to
>> which segment. In step 4 of the list above the GC will set the new field
>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>> reduced.
>>
>> Signed-off-by: Andreas Rohner 
> 
> I still don't know whether this workaround is the way we should take
> or not.  This patch has several drawbacks:
> 
>  1. It introduces overheads to every "chcp cp" operation
> due to traversal rewrite of sufile.
> If the ratio of snapshot protected blocks is high, then
> this overheads will be big.
> 
>  2. The traversal rewrite of sufile will causes many sufile blocks will be
> written out.   If most blocks are protected by a snapshot,
> more than 4MB of sufile blocks will be written per 1TB capacity.
> 
> Even though this rewrite may not happen for contiguous "chcp cp"
> operations, it still has potential for creating sufile write blocks
> if the application of nilfs manipulates snapshots frequently.

I could also implement this functionality in nilfs_cleanerd in
userspace. Every time a "chcp cp" happens some kind of permanent flag
like "snapshot_was_recently_deleted" is set at an appropriate location.
The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
would, at certain intervals and if the flag is set, check all segments
with GET_SUINFO ioctl() and set the ones that have potentially invalid
values with SET_SUINFO ioctl(). After that it would clear the
"snapshot_was_recently_deleted" flag. What do you think about this idea?

If the policy is "timestamp" the GC would of course skip this scan,
because it is unnecessary.

>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
> of blocks_per_segment.  It is not clear if the ratio is good
> (versatile).

The interval and percentage could be set in /etc/nilfs_cleanerd.conf.

I chose 50% kind of arbitrarily. My intent was to encourage the GC to
check the segment again in the future. I guess anything between 25% and
75% would also work.

> I will add comments inline below.

>> ---
>>  fs/nilfs2/ioctl.c  | 50 +++-
>>  fs/nilfs2/sufile.c | 85 
>> ++
>>  fs/nilfs2/sufile.h |  3 ++
>>  3 files changed, 137 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index 40bf74a..431725f 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -200,6 +200,49 @@ static int nilfs_ioctl_getversion(struct inode *inode, 
>> void __user *argp)
>>  }
>>  
>>  /**
>> + * nilfs_ioctl_fix_starving_segs - fix potentially starving segments
>> + * @nilfs: nilfs object
>> + * @inode: inode object
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This requires a scan of the whole SUFILE,
>> + * which can take a long time on certain devices and under certain 
>> conditions.
>> + * To avoid blocking other file system operations for too long the SUFILE is
>> + * scanned in steps of NILFS_SUFILE_STARVIN

Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out

2015-05-09 Thread Andreas Rohner
On 2015-05-09 14:17, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
>> This patch ensures, that all dirty blocks are written out if the segment
>> construction mode is SC_LSEG_SR. The scanning of the DAT file can cause
>> blocks in the SUFILE to be dirtied and newly dirtied blocks in the
>> SUFILE can in turn dirty more blocks in the DAT file. Since one of
>> these stages has to happen before the other during segment
>> construction, we end up with unwritten dirty blocks, that are lost
>> in case of a file system unmount.
>>
>> This patch introduces a new set of file scanning operations that
>> only propagate the changes to the bmap and do not add anything to the
>> segment buffer. The DAT file and SUFILE are scanned with these
>> operations. The function nilfs_sufile_flush_cache() is called in between
>> these scans with the parameter only_mark set. That way it can be called
>> repeatedly without actually writing anything to the SUFILE. If there are
>> no new blocks dirtied in the flush, the normal segment construction
>> stages can safely continue.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/segment.c | 73 
>> -
>>  fs/nilfs2/segment.h |  3 ++-
>>  2 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index 14e76c3..ab8df33 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -579,6 +579,12 @@ static int nilfs_collect_dat_data(struct nilfs_sc_info 
>> *sci,
>>  return err;
>>  }
>>  
>> +static int nilfs_collect_prop_data(struct nilfs_sc_info *sci,
>> +  struct buffer_head *bh, struct inode *inode)
>> +{
>> +return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
>> +}
>> +
>>  static int nilfs_collect_dat_bmap(struct nilfs_sc_info *sci,
>>struct buffer_head *bh, struct inode *inode)
>>  {
>> @@ -613,6 +619,14 @@ static struct nilfs_sc_operations nilfs_sc_dat_ops = {
>>  .write_node_binfo = nilfs_write_dat_node_binfo,
>>  };
>>  
>> +static struct nilfs_sc_operations nilfs_sc_prop_ops = {
>> +.collect_data = nilfs_collect_prop_data,
>> +.collect_node = nilfs_collect_file_node,
>> +.collect_bmap = NULL,
>> +.write_data_binfo = NULL,
>> +.write_node_binfo = NULL,
>> +};
>> +
>>  static struct nilfs_sc_operations nilfs_sc_dsync_ops = {
>>  .collect_data = nilfs_collect_file_data,
>>  .collect_node = NULL,
>> @@ -998,7 +1012,8 @@ static int nilfs_segctor_scan_file(struct nilfs_sc_info 
>> *sci,
>>  err = nilfs_segctor_apply_buffers(
>>  sci, inode, &data_buffers,
>>  sc_ops->collect_data);
>> -BUG_ON(!err); /* always receive -E2BIG or true error */
>> +/* always receive -E2BIG or true error (NOT ANYMORE?)*/
>> +/* BUG_ON(!err); */
>>  goto break_or_fail;
>>  }
>>  }
> 
> If n > rest, this function will exit without scanning node buffers
> for nilfs_segctor_propagate_sufile().  This looks problem, right?
> 
> I think adding separate functions is better.  For instance,
> 
> static int nilfs_propagate_buffer(struct nilfs_sc_info *sci,
> struct buffer_head *bh,
> struct inode *inode)
> {
>   return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
> }
> 
> static int nilfs_segctor_propagate_file(struct nilfs_sc_info *sci,
>   struct inode *inode)
> {
>   LIST_HEAD(buffers);
>   size_t n;
>   int err;
> 
>   n = nilfs_lookup_dirty_data_buffers(inode, &buffers, SIZE_MAX, 0,
>   LLONG_MAX);
>   if (n > 0) {
>   ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
> nilfs_propagate_buffer);
>   if (unlikely(ret))
>   goto fail;
>   }
> 
>   nilfs_lookup_dirty_node_buffers(inode, &buffers);
>   ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
> nilfs_propagate_buffer);
> fail:
>   return ret;
> }
> 
> With this, you can also avoid defining nilfs_sc_prop_ops, nor touching
> the BUG_ON() in nilfs_segctor_sc

[PATCH v2 3/9] nilfs2: introduce new feature flag for tracking live blocks

2015-05-03 Thread Andreas Rohner
This patch introduces a new file system feature flag
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS. If it is enabled, the file system
will keep track of the number of live blocks per segment. This
information can be used by the GC to select segments for cleaning more
efficiently.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/the_nilfs.h | 8 
 include/linux/nilfs2_fs.h | 4 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index 12cd91d..d755b6b 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -401,4 +401,12 @@ static inline int nilfs_flush_device(struct the_nilfs 
*nilfs)
return err;
 }
 
+static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
+{
+   const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
+   NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT;
+
+   return ((nilfs->ns_feature_compat & required_bits) == required_bits);
+}
+
 #endif /* _THE_NILFS_H */
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index 4800daa..5f05bbf 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -221,11 +221,13 @@ struct nilfs_super_block {
  * doesn't know about, it should refuse to mount the filesystem.
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT  BIT(0)
+#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS   BIT(1)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNTBIT(0)
 
 #define NILFS_FEATURE_COMPAT_SUPP  \
-   (NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
+   (NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT |\
+NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP   NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP0ULL
 
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-03-14 Thread Andreas Rohner
On 2015-03-14 04:51, Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>> It doesn't really matter if the number of reclaimable blocks for a
>> segment is inaccurate, as long as the overall performance is better than
>> the simple timestamp algorithm and starvation is prevented.
>>
>> The following steps will lead to starvation of a segment:
>>
>> 1. The segment is written
>> 2. A snapshot is created
>> 3. The files in the segment are deleted and the number of live
>>blocks for the segment is decremented to a very low value
>> 4. The GC tries to free the segment, but there are no reclaimable
>>blocks, because they are all protected by the snapshot. To prevent an
>>infinite loop the GC has to adjust the number of live blocks to the
>>correct value.
>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>segment are now reclaimable.
>> 6. The GC will never attemt to clean the segment again, because of it
>>incorrectly shows up as having a high number of live blocks.
>>
>> To prevent this, the already existing padding field of the SUFILE entry
>> is used to track the number of snapshot blocks in the segment. This
>> number is only set by the GC, since it collects the necessary
>> information anyway. So there is no need, to track which block belongs to
>> which segment. In step 4 of the list above the GC will set the new field
>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>> reduced.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/cpfile.c|   5 ++
>>  fs/nilfs2/segbuf.c|   1 +
>>  fs/nilfs2/segbuf.h|   1 +
>>  fs/nilfs2/segment.c   |   7 ++-
>>  fs/nilfs2/sufile.c| 114 
>> ++
>>  fs/nilfs2/sufile.h|   4 +-
>>  fs/nilfs2/the_nilfs.h |   7 +++
>>  include/linux/nilfs2_fs.h |  12 +++--
>>  8 files changed, 136 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
>> index 0d58075..6b61fd7 100644
>> --- a/fs/nilfs2/cpfile.c
>> +++ b/fs/nilfs2/cpfile.c
>> @@ -28,6 +28,7 @@
>>  #include 
>>  #include "mdt.h"
>>  #include "cpfile.h"
>> +#include "sufile.h"
>>  
>>  
>>  static inline unsigned long
>> @@ -703,6 +704,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  struct nilfs_cpfile_header *header;
>>  struct nilfs_checkpoint *cp;
>>  struct nilfs_snapshot_list *list;
>> +struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
>>  __u64 next, prev;
>>  void *kaddr;
>>  int ret;
>> @@ -784,6 +786,9 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  mark_buffer_dirty(header_bh);
>>  nilfs_mdt_mark_dirty(cpfile);
>>  
>> +if (nilfs_feature_track_snapshots(nilfs))
>> +nilfs_sufile_fix_starving_segs(nilfs->ns_sufile);
>> +
>>  brelse(prev_bh);
>>  
>>   out_next:
>> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
>> index bbd807b..a98c576 100644
>> --- a/fs/nilfs2/segbuf.c
>> +++ b/fs/nilfs2/segbuf.c
>> @@ -59,6 +59,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct 
>> super_block *sb)
>>  segbuf->sb_super_root = NULL;
>>  segbuf->sb_nlive_blks_added = 0;
>>  segbuf->sb_nlive_blks_diff = 0;
>> +segbuf->sb_nsnapshot_blks = 0;
>>  
>>  init_completion(&segbuf->sb_bio_event);
>>  atomic_set(&segbuf->sb_err, 0);
>> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
>> index 4e994f7..7a462c4 100644
>> --- a/fs/nilfs2/segbuf.h
>> +++ b/fs/nilfs2/segbuf.h
>> @@ -85,6 +85,7 @@ struct nilfs_segment_buffer {
>>  unsignedsb_rest_blocks;
>>  __u32   sb_nlive_blks_added;
>>  __s64   sb_nlive_blks_diff;
>> +__u32   sb_nsnapshot_blks;
>>  
>>  /* Buffers */
>>  struct list_headsb_segsum_buffers;
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index 16c7c36..b976198 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1381,6 +1381,7 @@ static void nilfs_segctor_update_segusage(struct 
>> nilfs_sc_info *sci,
>> 

Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy

2015-03-14 Thread Andreas Rohner
Hi Ryusuke,

Thank you very much for your detailed review and feedback. I agree with
all of your points and I will start working on a rewrite immediately.

On 2015-03-12 13:54, Ryusuke Konishi wrote:
> Hi Andreas,
> 
> On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote:
>> Hi Ryusuke,
>>
>> Thanks for your thorough review.
>>
>> On 2015-03-10 06:21, Ryusuke Konishi wrote:
>>> Hi Andreas,
>>>
>>> I looked through whole kernel patches and a part of util patches.
>>> Overall comments are as follows:
>>>
>>> [Algorithm]
>>> As for algorithm, it looks about OK except for the starvation
>>> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
>>> it's good that it doesn't change kernel/userland interface; we may be
>>> able to replace it with better ways in a future or in a revised
>>> version of this patchset.
>>>
>>> (1) Drawback of the starvation countermeasure
>>> The patch 9/9 looks to make the execution time of chcp operation
>>> worse since it will scan through sufile to modify live block
>>> counters.  How much does it prolong the execution time ?
>>
>> I'll do some tests, but I haven't noticed any significant performance
>> drop. The GC basically does the same thing, every time it selects
>> segments to reclaim.
> 
> GC is performed in background by an independent process.  What I'm
> care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from
> command line interface or application.  They differ in this meaning.
> 
> Was a worse case senario considered in the test ?
> 
> For example:
> 1. Fill a TB class drive with data file(s), and make a snapshot on it.
> 2. Run one pass GC to update snapshot block counts.
> 3. And do "chcp cp"
> 
> If we don't observe noticeable delay on this class of drive, then I
> think we can put the problem off.

Yesterday I did a worst case test as you suggested. I used an old 1 TB
hard drive I had lying around. This was my setup:

1. Write a 850GB file
2. Create a snapshot
3. Delete the file
4. Let GC run through all segments
5. Verify with lssu that the GC has updated all SUFILE entries
6. Drop the page cache
7. chcp cp

The following results are with the page cache dropped immediately before
each call:

1. chcp ss
real0m1.337s
user0m0.017s
sys 0m0.030s

2. chcp cp
real0m6.377s
user0m0.023s
sys 0m0.053s

The following results are without the drop of the page cache:

1. chcp ss
real0m0.137s
user0m0.010s
sys 0m0.000s

2. chcp cp
real0m0.016s
user0m0.010s
sys 0m0.007s

There are 119233 segments in my test. Each SUFILE entry uses 32 bytes.
So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random
reads and one 3.57 MB continuous write. You only get 6.377s because my
hard drive is so slow. You wouldn't notice any difference on a modern
SSD. Furthermore the SUFILE is also scanned by the segment allocation
algorithm and the GC, so it is very likely already in the page cache.

>>> In a use case of nilfs, many snapshots are created and they are
>>> automatically changed back to plain checkpoints because old
>>> snapshots are thinned out over time.  The patch 9/9 may impact on
>>> such usage.
>>>
>>> (2) Compatibility
>>> What will happen in the following case:
>>> 1. Create a file system, use it with the new module, and
>>>create snapshots.
>>> 2. Mount it with an old module, and release snapshot with "chcp cp"
>>> 3. Mount it with the new module, and cleanerd runs gc with
>>>cost benefit or greedy policy.
>>
>> Some segments could be subject to starvation. But it would probably only
>> affect a small number of segments and it could be fixed by "chcp ss
>> ; chcp cp ".
> 
> Ok, let's treat this as a restriction for now.
> If you come up with any good idea, please propose.
> 
>>> (3) Durability against unexpected power failures (just a note)
>>> The current patchset looks not to cause starvation issue even when
>>> unexpected power failure occurs during or after executing "chcp
>>> cp" because nilfs_ioctl_change_cpmode() do changes in a
>>> transactional way with nilfs_transaction_begin/commit.
>>> We should always think this kind of situtation to keep consistency.
>>>
>>> [Coding Style]
>>> (4) This patchset has several coding style issues. Please fix them and
>>> re-check with the latest checkpatch script (script/checkpatch.pl).
>>
>> I&#x

Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy

2015-03-12 Thread Andreas Rohner
o protect it with a lock. Since
almost any operation has to modify the counters in the SUFILE, this
would serialize the whole file system.

> The cache should be well designed. It's important to balance the
> performance and locality/transparency of the feature.  For
> instance, it can be implemented with radix-tree of objects in
> which each object has a vector of 2^k cache entries.

I'll look into that.

> I think the cache should be written back to the sufile buffers
> only within segment construction context. At least, it should be
> written back in the context in which a transaction lock is held.
> 
> In addition, introducing a new bmap lock dependency,
> nilfs_sufile_lock_key, is undesireble. You should avoid it
> by delaying the writeback of cache entries to sufile.

The cache could end up using a lot of memory. In the worst case one
entry per block.

> (8) Changes to the sufile must be finished before dirty buffer
> collection of sufile.
> All mark_buffer_dirty() calls to sufile must be finished
> before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
> 
> (You can write fixed figures to sufile after the collection phase
>  of sufile by preparatory marking buffer dirty before the
>  colection phase.)
>
> In the current patchset, sufile mod cache can be flushed in
> nilfs_segctor_update_palyload_blocknr(), which comes after the
> dirty buffer collection phase.

This is a hard problem. I have to count the blocks added in the
NILFS_ST_DAT stage. I don't know, which SUFILE blocks I have to mark in
advance. I'll have to think about this.

> (9) cpfile is also excluded in the dead block counting like sufile
> cpfile is always changed and written back along with sufile and dat.
> So, cpfile must be excluded from the dead block counting.
> Otherwise, sufile change can trigger cpfile changes, and it in turn
> triggers sufile.

I don't quite understand your example. How exactly can a sufile change
trigger a cpfile change and how can this turn into an infinite loop?

Thanks,
Andreas Rohner

> This also helps to simplify nilfs_dat_commit_end() that the patchset
> added two arguments for the dead block counting in the patchset.
> I mean, "dead" argument and "count_blocks" argument can be unified by
> changing meaning of the "dead" argument.
> 
> 
> I will add detail comments for patches tonight or another day.
> 
> Regards,
> Ryusuke Konishi
> 
> On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
>> Hi Andreas,
>>
>> Thank you for posting this proposal!
>>
>> I would like to have time to review this series through, but please
>> wait for several days. (This week I'm quite busy until weekend)
>>
>> Thanks,
>> Ryusuke Konishi
>>
>> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>>> Hi everyone!
>>>
>>> One of the biggest performance problems of NILFS is its
>>> inefficient Timestamp GC policy. This patch set introduces two new GC
>>> policies, namely Cost-Benefit and Greedy.
>>>
>>> The Cost-Benefit policy is nothing new. It has been around for a long
>>> time with log-structured file systems [1]. But it relies on accurate
>>> information, about the number of live blocks in a segment. NILFS
>>> currently does not provide the necessary information. So this patch set
>>> extends the entries in the SUFILE to include a counter for the number of
>>> live blocks. This counter is decremented whenever a file is deleted or
>>> overwritten.
>>>
>>> Except for some tricky parts, the counting of live blocks is quite
>>> trivial. The problem is snapshots. At any time, a checkpoint can be
>>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>>> one point in time, are protected by a snapshot a moment later.
>>>
>>> This patch set does not try to track snapshots at all. Instead it uses a
>>> heuristic approach to prevent the worst case scenario. The performance
>>> is still significantly better than timestamp for my benchmarks.
>>>
>>> The worst case scenario is, the following:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>by the Snapshot. The GC has to set the number of live blocks
>>>to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. Snapshot is deleted
>>> 5. Segment 1 is reclaimable, but its counter is so high, that 

Re: [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev()

2015-03-10 Thread Andreas Rohner
On 2015-03-10 16:52, Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:36 +0100, Andreas Rohner wrote:
>> This patch refactors nilfs_sufile_updatev() to take an array of
>> arbitrary data structures instead of an array of segment numbers as
>> input parameter. With this  change it is reusable for cases, where
>> it is necessary to pass extra data to the update function. The only
>> requirement for the data structures passed as input is, that they
>> contain the segment number within the structure. By passing the
>> offset to the segment number as another input parameter,
>> nilfs_sufile_updatev() can be oblivious to the actual type of the
>> input structures in the array.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/sufile.c | 79 
>> --
>>  fs/nilfs2/sufile.h | 39 ++-
>>  2 files changed, 68 insertions(+), 50 deletions(-)
>>
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 2a869c3..1e8cac6 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -138,14 +138,18 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode 
>> *sufile)
>>  /**
>>   * nilfs_sufile_updatev - modify multiple segment usages at a time
>>   * @sufile: inode of segment usage file
>> - * @segnumv: array of segment numbers
>> - * @nsegs: size of @segnumv array
>> + * @datav: array of segment numbers
>> + * @datasz: size of elements in @datav
>> + * @segoff: offset to segnum within the elements of @datav
>> + * @ndata: size of @datav array
>>   * @create: creation flag
>>   * @ndone: place to store number of modified segments on @segnumv
>>   * @dofunc: primitive operation for the update
>>   *
>>   * Description: nilfs_sufile_updatev() repeatedly calls @dofunc
>> - * against the given array of segments.  The @dofunc is called with
>> + * against the given array of data elements. Every data element has
>> + * to contain a valid segment number and @segoff should be the offset
>> + * to that within the data structure. The @dofunc is called with
>>   * buffers of a header block and the sufile block in which the target
>>   * segment usage entry is contained.  If @ndone is given, the number
>>   * of successfully modified segments from the head is stored in the
>> @@ -163,50 +167,55 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode 
>> *sufile)
>>   *
>>   * %-EINVAL - Invalid segment usage number
>>   */
>> -int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
>> - int create, size_t *ndone,
>> - void (*dofunc)(struct inode *, __u64,
>> +int nilfs_sufile_updatev(struct inode *sufile, void *datav, size_t datasz,
>> + size_t segoff, size_t ndata, int create,
>> + size_t *ndone,
>> + void (*dofunc)(struct inode *, void *,
>>  struct buffer_head *,
>>  struct buffer_head *))
> 
> Using offset byte of the data like segoff is nasty.
> 
> Please consider defining a template structure and its variation:
> 
> struct nilfs_sufile_update_data {
>__u64 segnum;
>/* Optional data comes after segnum */
> };
> 
> /**
>  * struct nilfs_sufile_update_count - data type of nilfs_sufile_do_xxx
>  * @segnum: segment number
>  * @nadd: additional value to a counter
>  * Description: This structure derives from nilfs_sufile_update_data
>  * struct.
>  */
> struct nilfs_sufile_update_count {
>__u64 segnum;
>__u64 nadd;
> };
> 
> int nilfs_sufile_updatev(struct inode *sufile,
>struct nilfs_sufile_update_data *datav,
>size_t datasz,
>size_t ndata, int create, size_t *ndone,
>  void (*dofunc)(struct inode *,
>   struct nilfs_sufile_update_data *,
>   struct buffer_head *,
>   struct buffer_head *))
> {
>   ...
> }

I agree this is a much better solution. I'll change it.

Regards,
Andreas Rohner

> If you need define segnum in the middle of structure, you can use
> container_of():
> 
> Example:
> 
> struct nilfs_sufile_update_xxx {
>__u32 item_a;
>__u32 item_b;
>struct nilfs_sufile_update_data u_data;
> };
> 
> static inline struct nilfs_sufile_update_xxx *
> NILFS_SU_UPDATE_XXX(struct nilfs_sufi

[PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc

2015-02-24 Thread Andreas Rohner
This patch adds support for additional bit-flags to the
nilfs_vdesc structure used by the GC to communicate block
information from userspace. The field vd_flags cannot be used for
this purpose, because it does not support bit-flags, and changing
that would break backwards compatibility. Therefore the padding
field is renamed to vd_blk_flags to contain more flags.

Unfortunately older versions of the userspace tools do not
initialize the padding field to zero. So it is necessary to signal
to the kernel if the new vd_blk_flags field contains usable flags
or just random data. Since the vd_period field is only used in
userspace, and is guaranteed to contain a value that is > 0
(NILFS_CNO_MIN == 1), it can be used to give the kernel a hint. So
if the userspace tools set vd_period.p_start to 0, the
vd_blk_flags field will be interpreted.

To make the flags available for later stages of the GC process,
they are mapped to corresponding buffer_head flags.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/ioctl.c | 23 ---
 fs/nilfs2/page.h  |  6 -
 include/linux/nilfs2_fs.h | 58 +--
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index f6ee54e..63b1c77 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -578,7 +578,7 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
struct buffer_head *bh;
int ret;
 
-   if (vdesc->vd_flags == 0)
+   if (nilfs_vdesc_data(vdesc))
ret = nilfs_gccache_submit_read_data(
inode, vdesc->vd_offset, vdesc->vd_blocknr,
vdesc->vd_vblocknr, &bh);
@@ -592,7 +592,8 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
   "%s: invalid virtual block address (%s): "
   "ino=%llu, cno=%llu, offset=%llu, "
   "blocknr=%llu, vblocknr=%llu\n",
-  __func__, vdesc->vd_flags ? "node" : "data",
+  __func__,
+  nilfs_vdesc_node(vdesc) ? "node" : "data",
   (unsigned long long)vdesc->vd_ino,
   (unsigned long long)vdesc->vd_cno,
   (unsigned long long)vdesc->vd_offset,
@@ -603,7 +604,8 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
if (unlikely(!list_empty(&bh->b_assoc_buffers))) {
printk(KERN_CRIT "%s: conflicting %s buffer: ino=%llu, "
   "cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu\n",
-  __func__, vdesc->vd_flags ? "node" : "data",
+  __func__,
+  nilfs_vdesc_node(vdesc) ? "node" : "data",
   (unsigned long long)vdesc->vd_ino,
   (unsigned long long)vdesc->vd_cno,
   (unsigned long long)vdesc->vd_offset,
@@ -612,6 +614,12 @@ static int nilfs_ioctl_move_inode_block(struct inode 
*inode,
brelse(bh);
return -EEXIST;
}
+
+   if (nilfs_vdesc_snapshot(vdesc))
+   set_buffer_nilfs_snapshot(bh);
+   if (nilfs_vdesc_protection_period(vdesc))
+   set_buffer_nilfs_protection_period(bh);
+
list_add_tail(&bh->b_assoc_buffers, buffers);
return 0;
 }
@@ -662,6 +670,15 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
}
 
do {
+   /*
+* old user space tools to not initialize vd_blk_flags
+* if vd_period.p_start > 0 then vd_blk_flags was
+* not initialized properly and may contain invalid
+* flags
+*/
+   if (vdesc->vd_period.p_start > 0)
+   vdesc->vd_blk_flags = 0;
+
ret = nilfs_ioctl_move_inode_block(inode, vdesc,
   &buffers);
if (unlikely(ret < 0)) {
diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
index a43b828..b9117e6 100644
--- a/fs/nilfs2/page.h
+++ b/fs/nilfs2/page.h
@@ -36,13 +36,17 @@ enum {
BH_NILFS_Volatile,
BH_NILFS_Checked,
BH_NILFS_Redirected,
+   BH_NILFS_Snapshot,
+   BH_NILFS_Protection_Period,
 };
 
 BUFFER_FNS(NILFS_Node, nilfs_node) /* nilfs node buffers */
 BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
 BUFFER_FNS(NILFS_Checked, nilfs_checked)   /* buffer is verified */
 BUFFER_FNS(NILFS_Redire

[PATCH 5/9] nilfs2: add simple tracking of block deletions and updates

2015-02-24 Thread Andreas Rohner
This patch adds simple tracking of block deletions and updates for
all files except the DAT- and the SUFILE-Metadatafiles. It uses the
fact, that for every block, NILFS2 keeps an entry in the DAT-File
and stores the checkpoint where it was created and deleted or
overwritten. So whenever a block is deleted or overwritten
nilfs_dat_commit_end() is called to update the DAT-Entry. At this
point this patch simply decrements the su_nlive_blks field of the
corresponding segment. The value of su_nlive_blks is set at segment
creation time.

The blocks of the DAT-File cannot be counted this way, because it
does not contain any entries about itself, so the function
nilfs_dat_commit_end() is not called when its blocks are deleted or
overwritten.

The SUFILE cannot be counted this way, because it would lead to a
deadlock. When nilfs_dat_commit_end() is called, the bmap->b_sem is
held by code way up the call chain. To decrement the SUFILE entry
the same semaphore has to be aquired. So if the DAT-Entry belongs to
the SUFILE both semaphores are the same and a deadlock will occur.
But it works for any other file. So by excluding the SUFILE from
being counted by the extra parameter count_blocks a deadlock can be
avoided.

With the above changes the code does not pass the lock dependency
checks of the kernel, because all the locks have the same class and
the order in which the locks are taken is different. Usually it is:

1. down_write(&NILFS_MDT(sufile)->mi_sem);
2. down_write(&bmap->b_sem);

Now it can also be reversed, which leads to failed checks:

1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
2. down_write(&NILFS_MDT(sufile)->mi_sem);

But this is safe as long as the first lock down_write(&bmap->b_sem)
doesn't belong to the SUFILE.

It is also possible, that two bmap->b_sem locks have to be taken at
the same time:

1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
2. down_write(&bmap->b_sem); /* lock of SUFILE */

Since bmap->b_sem of normal files and the bmap->b_sem of the
SUFILE have the same lock class, the above behavior would also lead
to a warning.

Because of this, it is necessary to introduce two new lock classes
for the SUFILE. So the bmap->b_sem of the SUFILE gets its own lock
class and the NILFS_MDT(sufile)->mi_sem as well.

A new feature compatibility flag
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS was added, so that the new
features introduced by this patch can be enabled or disabled at any
time.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/bmap.c  |  8 +++-
 fs/nilfs2/bmap.h  |  5 +++--
 fs/nilfs2/btree.c |  4 +++-
 fs/nilfs2/dat.c   | 25 -
 fs/nilfs2/dat.h   |  7 +--
 fs/nilfs2/direct.c|  4 +++-
 fs/nilfs2/mdt.c   |  5 -
 fs/nilfs2/segbuf.c|  1 +
 fs/nilfs2/segbuf.h|  1 +
 fs/nilfs2/segment.c   | 25 +
 fs/nilfs2/the_nilfs.c |  4 
 fs/nilfs2/the_nilfs.h | 16 
 include/linux/nilfs2_fs.h |  4 +++-
 13 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index aadbd0b..ecd62ba 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -467,6 +467,7 @@ __u64 nilfs_bmap_find_target_in_group(const struct 
nilfs_bmap *bmap)
 
 static struct lock_class_key nilfs_bmap_dat_lock_key;
 static struct lock_class_key nilfs_bmap_mdt_lock_key;
+static struct lock_class_key nilfs_bmap_sufile_lock_key;
 
 /**
  * nilfs_bmap_read - read a bmap from an inode
@@ -498,12 +499,17 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct 
nilfs_inode *raw_inode)
lockdep_set_class(&bmap->b_sem, &nilfs_bmap_dat_lock_key);
break;
case NILFS_CPFILE_INO:
-   case NILFS_SUFILE_INO:
bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
bmap->b_last_allocated_key = 0;
bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
break;
+   case NILFS_SUFILE_INO:
+   bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
+   bmap->b_last_allocated_key = 0;
+   bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
+   lockdep_set_class(&bmap->b_sem, &nilfs_bmap_sufile_lock_key);
+   break;
case NILFS_IFILE_INO:
lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
/* Fall through */
diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
index b89e680..718c814 100644
--- a/fs/nilfs2/bmap.h
+++ b/fs/nilfs2/bmap.h
@@ -222,8 +222,9 @@ static inline void nilfs_bmap_commit_end_ptr(struct 
nilfs_bmap *bmap,
 struct inode *dat)
 {
if (dat)
-   nilfs_dat_commit_end(dat, &

[PATCH 4/9] nilfs2: add function to modify su_nlive_blks

2015-02-24 Thread Andreas Rohner
This patch adds a function to modify the su_nlive_blks field of the
nilfs_segment_usage structure in the SUFILE. By using positive or
negative integers, it is possible to add and substract any value from
the su_nlive_blks field.

The use of a modification cache is optional and by passing a NULL
pointer the value will be added or subtracted directly. Otherwise it is
necessary to call nilfs_sufile_flush_nlive_blks() at some point to make
the modifications persistent.

The modification cache is useful, because it allows for small values,
like simple increments and decrements, to be added up before writing
them to the SUFILE.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/sufile.c | 138 +
 fs/nilfs2/sufile.h |   5 ++
 2 files changed, 143 insertions(+)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index ae08050..574a77e 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -1380,6 +1380,144 @@ static inline int nilfs_sufile_mc_update(struct inode 
*sufile,
 }
 
 /**
+ * nilfs_sufile_do_flush_nlive_blks - apply modification to su_nlive_blks
+ * @sufile: inode of segment usage file
+ * @mod: modification structure
+ * @header_bh: sufile header block
+ * @su_bh: block containing segment usage of m_segnum in @mod
+ *
+ * Description: nilfs_sufile_do_flush_nlive_blks() is a callback function
+ * used with nilfs_sufile_updatev(), that adds m_value in @mod to
+ * the su_nlive_blks field of the segment usage entry belonging to m_segnum.
+ */
+static void nilfs_sufile_do_flush_nlive_blks(struct inode *sufile,
+struct nilfs_sufile_mod *mod,
+struct buffer_head *header_bh,
+struct buffer_head *su_bh)
+{
+   struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
+   struct nilfs_segment_usage *su;
+   void *kaddr;
+   __u32 nblocks, nlive_blocks;
+   __u64 segnum = mod->m_segnum;
+   __s64 value = mod->m_value;
+
+   if (!value)
+   return;
+
+   kaddr = kmap_atomic(su_bh->b_page);
+
+   su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
+   WARN_ON(nilfs_segment_usage_error(su));
+
+   nblocks = le32_to_cpu(su->su_nblocks);
+   nlive_blocks = le32_to_cpu(su->su_nlive_blks);
+
+   value += nlive_blocks;
+   if (value < 0)
+   value = 0;
+   else if (value > nblocks)
+   value = nblocks;
+
+   /* do nothing if the value didn't change */
+   if (value != nlive_blocks) {
+   su->su_nlive_blks = cpu_to_le32(value);
+   su->su_nlive_lastmod = cpu_to_le64(nilfs->ns_ctime);
+   }
+
+   kunmap_atomic(kaddr);
+
+   if (value != nlive_blocks) {
+   mark_buffer_dirty(su_bh);
+   nilfs_mdt_mark_dirty(sufile);
+   }
+}
+
+/**
+ * nilfs_sufile_flush_nlive_blks - flush mod cache to su_nlive_blks
+ * @sufile: inode of segment usage file
+ * @mc: modification cache
+ *
+ * Description: nilfs_sufile_flush_nlive_blks() flushes the cached
+ * modifications in @mc, by applying them to the su_nlive_blks field of
+ * the corresponding segment usage entries. @mc can be NULL or empty. If
+ * the sufile extension needed to support su_nlive_blks is not supported the
+ * function will abort without error.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is in hole block
+ *
+ * %-EINVAL - Invalid segment usage number
+ */
+int nilfs_sufile_flush_nlive_blks(struct inode *sufile,
+ struct nilfs_sufile_mod_cache *mc)
+{
+   int ret;
+
+   if (!mc || !mc->mc_size || !nilfs_sufile_ext_supported(sufile))
+   return 0;
+
+   ret = nilfs_sufile_mc_flush(sufile, mc,
+   nilfs_sufile_do_flush_nlive_blks);
+
+   nilfs_sufile_mc_clear(mc);
+
+   return ret;
+}
+
+/**
+ * nilfs_sufile_mod_nlive_blks - modifiy su_nlive_blks using mod cache
+ * @sufile: inode of segment usage file
+ * @mc: modification cache
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ *
+ * Description: nilfs_sufile_mod_nlive_blks() adds @value to the su_nlive_blks
+ * field of the segment usage entry for @segnum. If @mc is not NULL it first
+ * accumulates all modifications in the cache and flushes it if it is full.
+ * Otherwise the change is applied directly.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is in hole block
+ *
+ * %-EIN

[PATCH 0/9] nilfs2: implementation of cost-benefit GC policy

2015-02-24 Thread Andreas Rohner
Hi everyone!

One of the biggest performance problems of NILFS is its
inefficient Timestamp GC policy. This patch set introduces two new GC
policies, namely Cost-Benefit and Greedy.

The Cost-Benefit policy is nothing new. It has been around for a long
time with log-structured file systems [1]. But it relies on accurate
information, about the number of live blocks in a segment. NILFS
currently does not provide the necessary information. So this patch set
extends the entries in the SUFILE to include a counter for the number of
live blocks. This counter is decremented whenever a file is deleted or
overwritten.

Except for some tricky parts, the counting of live blocks is quite
trivial. The problem is snapshots. At any time, a checkpoint can be
turned into a snapshot or vice versa. So blocks that are reclaimable at
one point in time, are protected by a snapshot a moment later.

This patch set does not try to track snapshots at all. Instead it uses a
heuristic approach to prevent the worst case scenario. The performance
is still significantly better than timestamp for my benchmarks.

The worst case scenario is, the following:

1. Segment 1 is written
2. Snapshot is created
3. GC tries to reclaim Segment 1, but all blocks are protected
   by the Snapshot. The GC has to set the number of live blocks
   to maximum to avoid reclaiming this Segment again in the near future.
4. Snapshot is deleted
5. Segment 1 is reclaimable, but its counter is so high, that the GC
   will never try to reclaim it again.

To prevent this kind of starvation I use another field in the SUFILE
entry, to store the number of blocks that are protected by a snapshot.
This value is just a heuristic and it is usually set to 0. Only if the
GC reclaims a segment, it is written to the SUFILE entry. The GC has to
check for snapshots anyway, so we get this information for free. By
storing this information in the SUFILE we can avoid starvation in the
following way:

1. Segment 1 is written
2. Snapshot is created
3. GC tries to reclaim Segment 1, but all blocks are protected
   by the Snapshot. The GC has to set the number of live blocks
   to maximum to avoid reclaiming this Segment again in the near future.
4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
   entry
5. Snapshot is deleted
6. On Snapshot deletion we walk through every entry in the SUFILE and
   reduce the number of live blocks to half, if the number of snapshot
   blocks is bigger than half of the maximum.
7. Segment 1 is reclaimable and the number of live blocks entry is at
   half the maximum. The GC will try to reclaim this segment as soon as
   there are no other better choices.

BENCHMARKS:
---

My benchmark is quite simple. It consists of a process, that replays
real NFS traces at a faster speed. It thereby creates relatively
realistic patterns of file creation and deletions. At the same time
multiple snapshots are created and deleted in parallel. I use a 100GB
partition of a Samsung SSD:

WITH SNAPSHOTS EVERY 5 MINUTES:

Execution time   Wear (Data written to disk)
Timestamp:  100% 100%
Cost-Benefit:   80%  43%

NO SNAPSHOTS:
-
Execution time   Wear (Data written to disk)
Timestamp:  100% 100%
Cost-Benefit:   70%  45%

I plan on adding more benchmark results soon.

Best regards,
Andreas Rohner

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
tion of a log-structured file system. ACM Trans. Comput. Syst.,
10(1):26–52, February 1992.

Andreas Rohner (9):
  nilfs2: refactor nilfs_sufile_updatev()
  nilfs2: add simple cache for modifications to SUFILE
  nilfs2: extend SUFILE on-disk format to enable counting of live blocks
  nilfs2: add function to modify su_nlive_blks
  nilfs2: add simple tracking of block deletions and updates
  nilfs2: use modification cache to improve performance
  nilfs2: add additional flags for nilfs_vdesc
  nilfs2: improve accuracy and correct for invalid GC values
  nilfs2: prevent starvation of segments protected by snapshots

 fs/nilfs2/bmap.c  |  84 +++-
 fs/nilfs2/bmap.h  |  14 +-
 fs/nilfs2/btree.c |   4 +-
 fs/nilfs2/cpfile.c|   5 +
 fs/nilfs2/dat.c   |  95 -
 fs/nilfs2/dat.h   |   8 +-
 fs/nilfs2/direct.c|   4 +-
 fs/nilfs2/inode.c |  24 ++-
 fs/nilfs2/ioctl.c |  27 ++-
 fs/nilfs2/mdt.c   |   5 +-
 fs/nilfs2/page.h  |   6 +-
 fs/nilfs2/segbuf.c|   6 +
 fs/nilfs2/segbuf.h|   3 +
 fs/nilfs2/segment.c   | 155 +-
 fs/nilfs2/segment.h   |   3 +
 fs/nilfs2/sufile.c| 533 +++---
 fs/nilfs2/sufile.h|  97 +++--
 fs/nilfs2/the_nilfs.c |   4 +
 fs/nilfs2

[PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks

2015-02-24 Thread Andreas Rohner
This patch extends the nilfs_segment_usage structure with two extra
fields. This changes the on-disk format of the SUFILE, but the nilfs2
metadata files are flexible enough, so that there are no compatibility
issues. The extension is fully backwards compatible. Nevertheless a
feature compatibility flag was added to indicate the on-disk format
change.

The new field su_nlive_blks is used to track the number of live blocks
in the corresponding segment. Its value should always be smaller than
su_nblocks, which contains the total number of blocks in the segment.

The field su_nlive_lastmod is necessary because of the protection period
used by the GC. It is a timestamp, which contains the last time
su_nlive_blks was modified. For example if a file is deleted, its
blocks are subtracted from su_nlive_blks and are therefore considered to
be reclaimable by the kernel. But the GC additionally protects them with
the protection period. So while su_nilve_blks contains the number of
potentially reclaimable blocks, the actual number depends on the
protection period. To enable GC policies to effectively choose or prefer
segments with unprotected blocks, the timestamp in su_nlive_lastmod is
necessary.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/ioctl.c |  4 ++--
 fs/nilfs2/sufile.c| 38 +--
 fs/nilfs2/sufile.h|  5 
 include/linux/nilfs2_fs.h | 58 ---
 4 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 9a20e51..f6ee54e 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1250,7 +1250,7 @@ static int nilfs_ioctl_set_suinfo(struct inode *inode, 
struct file *filp,
goto out;
 
ret = -EINVAL;
-   if (argv.v_size < sizeof(struct nilfs_suinfo_update))
+   if (argv.v_size < NILFS_MIN_SUINFO_UPDATE_SIZE)
goto out;
 
if (argv.v_nmembs > nilfs->ns_nsegments)
@@ -1316,7 +1316,7 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
return nilfs_ioctl_get_cpstat(inode, filp, cmd, argp);
case NILFS_IOCTL_GET_SUINFO:
return nilfs_ioctl_get_info(inode, filp, cmd, argp,
-   sizeof(struct nilfs_suinfo),
+   NILFS_MIN_SEGMENT_USAGE_SIZE,
nilfs_ioctl_do_get_suinfo);
case NILFS_IOCTL_SET_SUINFO:
return nilfs_ioctl_set_suinfo(inode, filp, cmd, argp);
diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index a369c30..ae08050 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -466,6 +466,11 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 
*data,
su->su_lastmod = cpu_to_le64(0);
su->su_nblocks = cpu_to_le32(0);
su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
+   if (nilfs_sufile_ext_supported(sufile)) {
+   su->su_nlive_blks = cpu_to_le32(0);
+   su->su_pad = cpu_to_le32(0);
+   su->su_nlive_lastmod = cpu_to_le64(0);
+   }
kunmap_atomic(kaddr);
 
nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
@@ -496,7 +501,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 *data,
WARN_ON(!nilfs_segment_usage_dirty(su));
 
sudirty = nilfs_segment_usage_dirty(su);
-   nilfs_segment_usage_set_clean(su);
+   nilfs_segment_usage_set_clean(su, NILFS_MDT(sufile)->mi_entry_size);
kunmap_atomic(kaddr);
mark_buffer_dirty(su_bh);
 
@@ -551,6 +556,9 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, 
__u64 segnum,
if (modtime)
su->su_lastmod = cpu_to_le64(modtime);
su->su_nblocks = cpu_to_le32(nblocks);
+   if (nilfs_sufile_ext_supported(sufile) &&
+   nblocks < le32_to_cpu(su->su_nlive_blks))
+   su->su_nlive_blks = su->su_nblocks;
kunmap_atomic(kaddr);
 
mark_buffer_dirty(bh);
@@ -713,7 +721,7 @@ static int nilfs_sufile_truncate_range(struct inode *sufile,
nc = 0;
for (su = su2, j = 0; j < n; j++, su = (void *)su + susz) {
if (nilfs_segment_usage_error(su)) {
-   nilfs_segment_usage_set_clean(su);
+   nilfs_segment_usage_set_clean(su, susz);
nc++;
}
}
@@ -836,6 +844,8 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 
segnum, void *buf,
struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
void *kaddr;
unsigned long nsegs, segusages_per_block;
+   __u64 lm = 0;
+   __u32 nlb = 0;
ssize_t n;
int ret, i, j;
 
@@ -873,6 +883,17 @@ ssize_t nilfs_sufile_get_suinfo(

[PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev()

2015-02-24 Thread Andreas Rohner
This patch refactors nilfs_sufile_updatev() to take an array of
arbitrary data structures instead of an array of segment numbers as
input parameter. With this  change it is reusable for cases, where
it is necessary to pass extra data to the update function. The only
requirement for the data structures passed as input is, that they
contain the segment number within the structure. By passing the
offset to the segment number as another input parameter,
nilfs_sufile_updatev() can be oblivious to the actual type of the
input structures in the array.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/sufile.c | 79 --
 fs/nilfs2/sufile.h | 39 ++-
 2 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 2a869c3..1e8cac6 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -138,14 +138,18 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode 
*sufile)
 /**
  * nilfs_sufile_updatev - modify multiple segment usages at a time
  * @sufile: inode of segment usage file
- * @segnumv: array of segment numbers
- * @nsegs: size of @segnumv array
+ * @datav: array of segment numbers
+ * @datasz: size of elements in @datav
+ * @segoff: offset to segnum within the elements of @datav
+ * @ndata: size of @datav array
  * @create: creation flag
  * @ndone: place to store number of modified segments on @segnumv
  * @dofunc: primitive operation for the update
  *
  * Description: nilfs_sufile_updatev() repeatedly calls @dofunc
- * against the given array of segments.  The @dofunc is called with
+ * against the given array of data elements. Every data element has
+ * to contain a valid segment number and @segoff should be the offset
+ * to that within the data structure. The @dofunc is called with
  * buffers of a header block and the sufile block in which the target
  * segment usage entry is contained.  If @ndone is given, the number
  * of successfully modified segments from the head is stored in the
@@ -163,50 +167,55 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode 
*sufile)
  *
  * %-EINVAL - Invalid segment usage number
  */
-int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
-int create, size_t *ndone,
-void (*dofunc)(struct inode *, __u64,
+int nilfs_sufile_updatev(struct inode *sufile, void *datav, size_t datasz,
+size_t segoff, size_t ndata, int create,
+size_t *ndone,
+void (*dofunc)(struct inode *, void *,
struct buffer_head *,
struct buffer_head *))
 {
struct buffer_head *header_bh, *bh;
unsigned long blkoff, prev_blkoff;
__u64 *seg;
-   size_t nerr = 0, n = 0;
+   void *data, *dataend = datav + ndata * datasz;
+   size_t n = 0;
int ret = 0;
 
-   if (unlikely(nsegs == 0))
+   if (unlikely(ndata == 0))
goto out;
 
-   down_write(&NILFS_MDT(sufile)->mi_sem);
-   for (seg = segnumv; seg < segnumv + nsegs; seg++) {
+
+   for (data = datav; data < dataend; data += datasz) {
+   seg = data + segoff;
if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
printk(KERN_WARNING
   "%s: invalid segment number: %llu\n", __func__,
   (unsigned long long)*seg);
-   nerr++;
+   ret = -EINVAL;
+   goto out;
}
}
-   if (nerr > 0) {
-   ret = -EINVAL;
-   goto out_sem;
-   }
 
+   down_write(&NILFS_MDT(sufile)->mi_sem);
ret = nilfs_sufile_get_header_block(sufile, &header_bh);
if (ret < 0)
goto out_sem;
 
-   seg = segnumv;
+   data = datav;
+   seg = data + segoff;
blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
ret = nilfs_mdt_get_block(sufile, blkoff, create, NULL, &bh);
if (ret < 0)
goto out_header;
 
for (;;) {
-   dofunc(sufile, *seg, header_bh, bh);
+   dofunc(sufile, data, header_bh, bh);
 
-   if (++seg >= segnumv + nsegs)
+   ++n;
+   data += datasz;
+   if (data >= dataend)
break;
+   seg = data + segoff;
prev_blkoff = blkoff;
blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
if (blkoff == prev_blkoff)
@@ -220,28 +229,30 @@ int nilfs_sufile_updatev(struct inode *sufile, __u64 
*segnumv, size_t nsegs,
}
brelse(bh);
 
- out_header:
-   n = seg - segnumv;
+out_header:
brelse(header_bh);
- out_sem:
+out_sem:
up_write(&am

[PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values

2015-02-24 Thread Andreas Rohner
This patch improves the accuracy of the su_nlive_blks segment
usage field by also counting the blocks of the DAT-File. A block in
the DAT-File is considered reclaimable as soon as it is overwritten.
There is no need to consider protection periods, snapshots or
checkpoints. So whenever a block is overwritten during segment
construction, the segment usage information of the segment at the
previous location of the block is decremented. To get the previous
location of the block the b_blocknr field of the buffer_head
structure is used.

SUFILE blocks are counted in a similar way, but if the GC reads a
block into a GC inode, that already is in the cache, then there are
two versions of the block. If this happens both versions will be
counted, which can lead to small seemingly random incorrect values.
But it is better to accept these small inaccuracies than to not
count the SUFILE at all. These inaccuracies do not occur for the
DAT-File, because it does not need a GC inode.

Additionally the blocks that belong to a GC inode are rechecked if
they are reclaimable. If so the corresponding counter is
decremented. The blocks were already checked in userspace, but
without the proper locking. It is furthermore possible, that blocks
become reclaimable during the cleaning process. For example by
deleting checkpoints. To improve the performance of these extra
checks, flags from userspace are used to determine reclaimability.
If a block belongs to a snapshot it cannot be reclaimable and if
it is within the protection period it must be counted as
reclaimable.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/dat.c |  70 
 fs/nilfs2/dat.h |   1 +
 fs/nilfs2/inode.c   |   2 ++
 fs/nilfs2/segbuf.c  |   4 +++
 fs/nilfs2/segbuf.h  |   1 +
 fs/nilfs2/segment.c | 101 ++--
 6 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index d2c8f7e..63d079c 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -35,6 +35,17 @@
 #define NILFS_CNO_MAX  (~(__u64)0)
 
 /**
+ * nilfs_dat_entry_is_alive - check if @entry is alive
+ * @entry: DAT-Entry
+ *
+ * Description: Simple check if @entry is alive in the current checkpoint.
+ */
+static inline int nilfs_dat_entry_is_live(struct nilfs_dat_entry *entry)
+{
+   return entry->de_end == cpu_to_le64(NILFS_CNO_MAX);
+}
+
+/**
  * struct nilfs_dat_info - on-memory private data of DAT file
  * @mi: on-memory private data of metadata file
  * @palloc_cache: persistent object allocator cache of DAT file
@@ -391,6 +402,65 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, 
sector_t blocknr)
 }
 
 /**
+ * nilfs_dat_is_live - checks if the virtual block number is alive
+ * @dat: DAT file inode
+ * @vblocknr: virtual block number
+ * @errp: pointer to return code if error occurred
+ *
+ * Description: nilfs_dat_is_live() looks up the DAT-Entry for
+ * @vblocknr and determines if the corresponding block is alive in the current
+ * checkpoint or not. This check ignores snapshots and protection periods.
+ *
+ * Return Value: 1 if vblocknr is alive and 0 otherwise. On error, 0 is
+ * returned and @errp is set to one of the following negative error codes.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - A block number associated with @vblocknr does not exist.
+ */
+int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr, int *errp)
+{
+   struct buffer_head *entry_bh, *bh;
+   struct nilfs_dat_entry *entry;
+   sector_t blocknr;
+   void *kaddr;
+   int ret = 0, err;
+
+   err = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
+   if (err < 0)
+   goto out;
+
+   if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
+   bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
+   if (bh) {
+   WARN_ON(!buffer_uptodate(bh));
+   put_bh(entry_bh);
+   entry_bh = bh;
+   }
+   }
+
+   kaddr = kmap_atomic(entry_bh->b_page);
+   entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
+   blocknr = le64_to_cpu(entry->de_blocknr);
+   if (blocknr == 0) {
+   err = -ENOENT;
+   goto out_unmap;
+   }
+
+   ret = nilfs_dat_entry_is_live(entry);
+
+out_unmap:
+   kunmap_atomic(kaddr);
+   put_bh(entry_bh);
+out:
+   if (errp)
+   *errp = err;
+   return ret;
+}
+
+/**
  * nilfs_dat_translate - translate a virtual block number to a block number
  * @dat: DAT file inode
  * @vblocknr: virtual block number
diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
index d196f09..3cbddd6 100644
--- a/fs/nilfs2/dat.h
+++ b/fs/nilfs2/dat.h
@@ -32,6 +32,7 @@ struct nilfs_palloc_req;
 struct nilfs_sufile_mod_cache;
 
 int nilfs_dat_translate(struct inode *, __u64, sector_t *)

[PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-02-24 Thread Andreas Rohner
It doesn't really matter if the number of reclaimable blocks for a
segment is inaccurate, as long as the overall performance is better than
the simple timestamp algorithm and starvation is prevented.

The following steps will lead to starvation of a segment:

1. The segment is written
2. A snapshot is created
3. The files in the segment are deleted and the number of live
   blocks for the segment is decremented to a very low value
4. The GC tries to free the segment, but there are no reclaimable
   blocks, because they are all protected by the snapshot. To prevent an
   infinite loop the GC has to adjust the number of live blocks to the
   correct value.
5. The snapshot is converted to a checkpoint and the blocks in the
   segment are now reclaimable.
6. The GC will never attemt to clean the segment again, because of it
   incorrectly shows up as having a high number of live blocks.

To prevent this, the already existing padding field of the SUFILE entry
is used to track the number of snapshot blocks in the segment. This
number is only set by the GC, since it collects the necessary
information anyway. So there is no need, to track which block belongs to
which segment. In step 4 of the list above the GC will set the new field
su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
entries with a big su_nsnapshot_blks field get their su_nlive_blks field
reduced.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/cpfile.c|   5 ++
 fs/nilfs2/segbuf.c|   1 +
 fs/nilfs2/segbuf.h|   1 +
 fs/nilfs2/segment.c   |   7 ++-
 fs/nilfs2/sufile.c| 114 ++
 fs/nilfs2/sufile.h|   4 +-
 fs/nilfs2/the_nilfs.h |   7 +++
 include/linux/nilfs2_fs.h |  12 +++--
 8 files changed, 136 insertions(+), 15 deletions(-)

diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
index 0d58075..6b61fd7 100644
--- a/fs/nilfs2/cpfile.c
+++ b/fs/nilfs2/cpfile.c
@@ -28,6 +28,7 @@
 #include 
 #include "mdt.h"
 #include "cpfile.h"
+#include "sufile.h"
 
 
 static inline unsigned long
@@ -703,6 +704,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
*cpfile, __u64 cno)
struct nilfs_cpfile_header *header;
struct nilfs_checkpoint *cp;
struct nilfs_snapshot_list *list;
+   struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
__u64 next, prev;
void *kaddr;
int ret;
@@ -784,6 +786,9 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
*cpfile, __u64 cno)
mark_buffer_dirty(header_bh);
nilfs_mdt_mark_dirty(cpfile);
 
+   if (nilfs_feature_track_snapshots(nilfs))
+   nilfs_sufile_fix_starving_segs(nilfs->ns_sufile);
+
brelse(prev_bh);
 
  out_next:
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index bbd807b..a98c576 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -59,6 +59,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct 
super_block *sb)
segbuf->sb_super_root = NULL;
segbuf->sb_nlive_blks_added = 0;
segbuf->sb_nlive_blks_diff = 0;
+   segbuf->sb_nsnapshot_blks = 0;
 
init_completion(&segbuf->sb_bio_event);
atomic_set(&segbuf->sb_err, 0);
diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
index 4e994f7..7a462c4 100644
--- a/fs/nilfs2/segbuf.h
+++ b/fs/nilfs2/segbuf.h
@@ -85,6 +85,7 @@ struct nilfs_segment_buffer {
unsignedsb_rest_blocks;
__u32   sb_nlive_blks_added;
__s64   sb_nlive_blks_diff;
+   __u32   sb_nsnapshot_blks;
 
/* Buffers */
struct list_headsb_segsum_buffers;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 16c7c36..b976198 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1381,6 +1381,7 @@ static void nilfs_segctor_update_segusage(struct 
nilfs_sc_info *sci,
(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
 live_blocks,
+segbuf->sb_nsnapshot_blks,
 sci->sc_seg_ctime);
WARN_ON(ret); /* always succeed because the segusage is dirty */
 
@@ -1405,7 +1406,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
segbuf = NILFS_FIRST_SEGBUF(logs);
ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
 segbuf->sb_pseg_start -
-segbuf->sb_fseg_start, 0);
+segbuf->sb_fseg_start, 0, 0);
WARN_ON(ret); /* always succeed because the segusage is dirty */
 
if (nilfs_feature_track

[PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE

2015-02-24 Thread Andreas Rohner
This patch adds a simple, small cache that can be used to accumulate
modifications to SUFILE entries. This is for example useful for
keeping track of reclaimable blocks, because most of the
modifications consist of small increments or decrements. By adding
these up and temporarily storing them in a small cache, the
performance can be improved. Additionally lock contention is
reduced.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/sufile.c | 178 +
 fs/nilfs2/sufile.h |  44 +
 2 files changed, 222 insertions(+)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 1e8cac6..a369c30 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -1168,6 +1168,184 @@ out_sem:
 }
 
 /**
+ * nilfs_sufile_mc_init - inits segusg modification cache
+ * @mc: modification cache
+ * @capacity: maximum capacity of the mod cache
+ *
+ * Description: Allocates memory for an array of nilfs_sufile_mod structures
+ * according to @capacity. This memory must be freed with
+ * nilfs_sufile_mc_destroy().
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-EINVAL - Invalid capacity.
+ */
+int nilfs_sufile_mc_init(struct nilfs_sufile_mod_cache *mc, size_t capacity)
+{
+   mc->mc_capacity = capacity;
+   if (!capacity)
+   return -EINVAL;
+
+   mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
+ GFP_KERNEL);
+   if (!mc->mc_mods)
+   return -ENOMEM;
+
+   mc->mc_size = 0;
+
+   return 0;
+}
+
+/**
+ * nilfs_sufile_mc_add - add signed value to segusg modification cache
+ * @mc: modification cache
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ *
+ * Description: nilfs_sufile_mc_add() tries to add a pair of @segnum and
+ * @value to the modification cache. If the cache already contains a
+ * segment number equal to @segnum, then @value is simply added to the
+ * existing value. This way thousands of small modifications can be
+ * accumulated into one value. If @segnum cannot be found and the
+ * capacity allows it, a new element is added to the cache. If the
+ * capacity is reached an error value is returned.
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOSPC - The mod cache has reached its capacity and must be flushed.
+ */
+static inline int nilfs_sufile_mc_add(struct nilfs_sufile_mod_cache *mc,
+ __u64 segnum, __s64 value)
+{
+   struct nilfs_sufile_mod *mods = mc->mc_mods;
+   int i;
+
+   for (i = 0; i < mc->mc_size; ++i, ++mods) {
+   if (mods->m_segnum == segnum) {
+   mods->m_value += value;
+   return 0;
+   }
+   }
+
+   if (mc->mc_size < mc->mc_capacity) {
+   mods->m_segnum = segnum;
+   mods->m_value = value;
+   mc->mc_size++;
+   return 0;
+   }
+
+   return -ENOSPC;
+}
+
+/**
+ * nilfs_sufile_mc_clear - set mc_size to 0
+ * @mc: modification cache
+ *
+ * Description: nilfs_sufile_mc_clear() sets mc_size to 0, which enables
+ * nilfs_sufile_mc_add() to overwrite the elements in @mc.
+ */
+static inline void nilfs_sufile_mc_clear(struct nilfs_sufile_mod_cache *mc)
+{
+   mc->mc_size = 0;
+}
+
+/**
+ * nilfs_sufile_mc_reset - clear cache and add one element
+ * @mc: modification cache
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ *
+ * Description: Clears the modification cache in @mc and adds a new pair of
+ * @segnum and @value to it at the same time.
+ */
+static inline void nilfs_sufile_mc_reset(struct nilfs_sufile_mod_cache *mc,
+__u64 segnum, __s64 value)
+{
+   struct nilfs_sufile_mod *mods = mc->mc_mods;
+
+   mods->m_segnum = segnum;
+   mods->m_value = value;
+   mc->mc_size = 1;
+}
+
+/**
+ * nilfs_sufile_mc_flush - flush modification cache
+ * @sufile: inode of segment usage file
+ * @mc: modification cache
+ * @dofunc: primitive operation for the update
+ *
+ * Description: nilfs_sufile_mc_flush() flushes the cached modifications
+ * and applies them to the segment usages on disk. It persists the cached
+ * changes, by calling @dofunc for every element in the cache. @dofunc also
+ * determines the interpretation of the cached values and how they should
+ * be applied to the corresponding segment usage entries.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is i

[PATCH 6/9] nilfs2: use modification cache to improve performance

2015-02-24 Thread Andreas Rohner
This patch adds a small cache to accumulate the small decrements of
the number of live blocks in a segment usage entry. If for example a
large file is deleted, the segment usage entry has to be updated for
every single block. But for every decrement, a MDT write lock has to
be aquired, which blocks the entire SUFILE and effectively turns
this lock into a global lock for the whole file system.

The cache tries to ameliorate this situation by adding up the
decrements and increments for a given number of segments and
applying the changes all at once. Because the changes are
accumulated in memory and not immediately written to the SUFILE, the
afore mentioned lock only needs to be aquired, if the cache is full
or at the end of the respective operation.

To effectively get the pointer to the modification cache from the
high level operations down to the update of the individual blocks in
nilfs_dat_commit_end(), a new pointer b_private was added to struct
nilfs_bmap.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/bmap.c| 76 +
 fs/nilfs2/bmap.h| 11 +++-
 fs/nilfs2/btree.c   |  2 +-
 fs/nilfs2/direct.c  |  2 +-
 fs/nilfs2/inode.c   | 22 +---
 fs/nilfs2/segment.c | 26 +++---
 fs/nilfs2/segment.h |  3 +++
 7 files changed, 132 insertions(+), 10 deletions(-)

diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index ecd62ba..927acb7 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -288,6 +288,43 @@ int nilfs_bmap_truncate(struct nilfs_bmap *bmap, unsigned 
long key)
 }
 
 /**
+ * nilfs_bmap_truncate_with_mc - truncate a bmap to a specified key
+ * @bmap: bmap
+ * @mc: modification cache
+ * @key: key
+ *
+ * Description: nilfs_bmap_truncate_with_mc() removes key-record pairs whose
+ * keys are greater than or equal to @key from @bmap. It has the same
+ * functionality as nilfs_bmap_truncate(), but allows the passing
+ * of a modification cache to update segment usage information.
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_bmap_truncate_with_mc(struct nilfs_bmap *bmap,
+   struct nilfs_sufile_mod_cache *mc,
+   unsigned long key)
+{
+   int ret;
+
+   down_write(&bmap->b_sem);
+
+   bmap->b_private = mc;
+
+   ret = nilfs_bmap_do_truncate(bmap, key);
+
+   bmap->b_private = NULL;
+
+   up_write(&bmap->b_sem);
+
+   return nilfs_bmap_convert_error(bmap, __func__, ret);
+}
+
+/**
  * nilfs_bmap_clear - free resources a bmap holds
  * @bmap: bmap
  *
@@ -328,6 +365,43 @@ int nilfs_bmap_propagate(struct nilfs_bmap *bmap, struct 
buffer_head *bh)
 }
 
 /**
+ * nilfs_bmap_propagate_with_mc - propagate dirty state
+ * @bmap: bmap
+ * @mc: modification cache
+ * @bh: buffer head
+ *
+ * Description: nilfs_bmap_propagate_with_mc() marks the buffers that directly
+ * or indirectly refer to the block specified by @bh dirty. It has
+ * the same functionality as nilfs_bmap_propagate(), but allows the passing
+ * of a modification cache to update segment usage information.
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_bmap_propagate_with_mc(struct nilfs_bmap *bmap,
+struct nilfs_sufile_mod_cache *mc,
+struct buffer_head *bh)
+{
+   int ret;
+
+   down_write(&bmap->b_sem);
+
+   bmap->b_private = mc;
+
+   ret = bmap->b_ops->bop_propagate(bmap, bh);
+
+   bmap->b_private = NULL;
+
+   up_write(&bmap->b_sem);
+
+   return nilfs_bmap_convert_error(bmap, __func__, ret);
+}
+
+/**
  * nilfs_bmap_lookup_dirty_buffers -
  * @bmap: bmap
  * @listp: pointer to buffer head list
@@ -490,6 +564,7 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct 
nilfs_inode *raw_inode)
 
init_rwsem(&bmap->b_sem);
bmap->b_state = 0;
+   bmap->b_private = NULL;
bmap->b_inode = &NILFS_BMAP_I(bmap)->vfs_inode;
switch (bmap->b_inode->i_ino) {
case NILFS_DAT_INO:
@@ -551,6 +626,7 @@ void nilfs_bmap_init_gc(struct nilfs_bmap *bmap)
bmap->b_last_allocated_key = 0;
bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
bmap->b_state = 0;
+   bmap->b_private = NULL;
nilfs_btree_init_gc(bmap);
 }
 
diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
index 718c814..a8b935a 100644
--- a/fs/nilfs2/bmap.h
+++ b/fs/nilfs2/bmap.h
@@ -36,6 +36,7 @@
 
 
 struct nilfs_bmap;
+struct nilfs_sufile_mod_cache;
 
 /**
  * union nilfs_bmap_ptr_req - request for bmap ptr
@@ -106,6 +107,7 @@ static inline int nil

[PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation

2015-02-24 Thread Andreas Rohner
This patch adds support for the field su_nsnapshot_blks and includes the
necessary flags to update it from the GC.

The GC already has the necessary information about which block belongs
to a snapshot and which doesn't. So these blocks are counted up and
passed to the caller.

The number of snapshot blocks will then be updated with
NILFS_IOCTL_SET_SUINFO ioctl.

Signed-off-by: Andreas Rohner 
---
 include/nilfs.h  |  9 +
 include/nilfs2_fs.h  | 12 
 lib/feature.c|  2 ++
 lib/gc.c | 19 ++-
 lib/nilfs.c  |  2 ++
 man/mkfs.nilfs2.8|  6 ++
 sbin/mkfs/mkfs.c |  3 ++-
 sbin/nilfs-tune/nilfs-tune.c |  6 --
 8 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index 8511163..e84656b 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -131,6 +131,7 @@ struct nilfs {
 #define NILFS_OPT_MMAP 0x01
 #define NILFS_OPT_SET_SUINFO   0x02
 #define NILFS_OPT_TRACK_LIVE_BLKS  0x04
+#define NILFS_OPT_TRACK_SNAPSHOTS  0x08
 
 
 struct nilfs *nilfs_open(const char *, const char *, int);
@@ -161,6 +162,7 @@ nilfs_opt_test_##name(const struct nilfs *nilfs)
\
 
 NILFS_OPT_FLAG(SET_SUINFO, set_suinfo);
 NILFS_OPT_FLAG(TRACK_LIVE_BLKS, track_live_blks);
+NILFS_OPT_FLAG(TRACK_SNAPSHOTS, track_snapshots);
 
 nilfs_cno_t nilfs_get_oldest_cno(struct nilfs *);
 
@@ -356,4 +358,11 @@ static inline int nilfs_feature_track_live_blks(const 
struct nilfs *nilfs)
(fc & NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
 }
 
+static inline int nilfs_feature_track_snapshots(const struct nilfs *nilfs)
+{
+   __u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
+   return (fc & NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS) &&
+   nilfs_feature_track_live_blks(nilfs);
+}
+
 #endif /* NILFS_H */
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index 427ca53..f1f315c 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -221,11 +221,13 @@ struct nilfs_super_block {
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION  (1ULL << 0)
 #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS   (1ULL << 1)
+#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS   (1ULL << 2)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT(1ULL << 0)
 
 #define NILFS_FEATURE_COMPAT_SUPP  (NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
-   | NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
+   | NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
+   | NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP   NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP0ULL
 
@@ -630,7 +632,7 @@ struct nilfs_segment_usage {
__le32 su_nblocks;
__le32 su_flags;
__le32 su_nlive_blks;
-   __le32 su_pad;
+   __le32 su_nsnapshot_blks;
__le64 su_nlive_lastmod;
 };
 
@@ -682,7 +684,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage 
*su, size_t susz)
su->su_flags = cpu_to_le32(0);
if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
su->su_nlive_blks = cpu_to_le32(0);
-   su->su_pad = cpu_to_le32(0);
+   su->su_nsnapshot_blks = cpu_to_le32(0);
su->su_nlive_lastmod = cpu_to_le64(0);
}
 }
@@ -723,7 +725,7 @@ struct nilfs_suinfo {
__u32 sui_nblocks;
__u32 sui_flags;
__u32 sui_nlive_blks;
-   __u32 sui_pad;
+   __u32 sui_nsnapshot_blks;
__u64 sui_nlive_lastmod;
 };
 
@@ -764,6 +766,7 @@ enum {
NILFS_SUINFO_UPDATE_FLAGS,
NILFS_SUINFO_UPDATE_NLIVE_BLKS,
NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
+   NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -788,6 +791,7 @@ NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
 NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
 NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
 
 enum {
diff --git a/lib/feature.c b/lib/feature.c
index ebe8c3f..376fa53 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -59,6 +59,8 @@ static const struct nilfs_feature features[] = {
  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION, "sufile_ext" },
{ NILFS_FEATURE_TYPE_COMPAT,
  NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS, "track_live_blks" },
+   { NILFS_FEATURE_TYPE_COMPAT,
+ NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS, "track_snapshots" },
/* Read-only compat features */
{ NILFS_FEATURE_TYPE_COMPAT_RO,
  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
diff --git a/lib/gc.c b/lib/gc.c
index a2461b9..f1b8b85 100644

[PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies

2015-02-24 Thread Andreas Rohner
This patch implements the cost-benefit and greedy GC policies. These
are well known policies for log-structured file systems [1].

* Greedy:
  Select the segments with the most reclaimable space.
* Cost-Benefit [1]:
  Perform a cost-benefit analysis, whereby the reclaimable space
  gained is weighed against the cost of collecting the segment.

Since especially cost-benefit needs more information than is
available in nilfs_suinfo, a few extra parameters are added to the
policy callback function prototype. The flag p_comparison is added to
indicate how the importance values should be interpreted. For example
for the timestamp policy smaller values mean older timestamps, which
is better. For greedy and cost-benefit on the other hand, higher
values are better. nilfs_cleanerd_select_segments() was updated
accordingly.

The threshold in nilfs_cleanerd_select_segments() can no
longer be set to sustat->ss_nongc_ctime on default, because the
greedy/cost-benefit policies do not return a timestamp, so their
importance values cannot be compared to one on default. Instead
segments that are younger than sustat->ss_nongc_ctime are always
excluded.

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
tion of a log-structured file system. ACM Trans. Comput. Syst.,
10(1):26–52, February 1992.

Signed-off-by: Andreas Rohner 
---
 sbin/cleanerd/cldconfig.c | 79 +--
 sbin/cleanerd/cldconfig.h | 22 +
 sbin/cleanerd/cleanerd.c  | 43 --
 3 files changed, 126 insertions(+), 18 deletions(-)

diff --git a/sbin/cleanerd/cldconfig.c b/sbin/cleanerd/cldconfig.c
index c8b197b..68090e9 100644
--- a/sbin/cleanerd/cldconfig.c
+++ b/sbin/cleanerd/cldconfig.c
@@ -380,7 +380,9 @@ nilfs_cldconfig_handle_clean_check_interval(struct 
nilfs_cldconfig *config,
 }
 
 static unsigned long long
-nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si)
+nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si,
+  const struct nilfs_sustat *sustat,
+  __u64 prottime)
 {
return si->sui_lastmod;
 }
@@ -392,13 +394,84 @@ nilfs_cldconfig_handle_selection_policy_timestamp(struct 
nilfs_cldconfig *config
config->cf_selection_policy.p_importance =
NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE;
config->cf_selection_policy.p_threshold =
-   NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD;
+   NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+   config->cf_selection_policy.p_comparison =
+   NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER;
+   return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_greedy(const struct nilfs_suinfo *si,
+   const struct nilfs_sustat *sustat,
+   __u64 prottime)
+{
+   if (si->sui_nblocks < si->sui_nlive_blks ||
+   si->sui_nlive_lastmod >= prottime)
+   return 0;
+
+   return si->sui_nblocks - si->sui_nlive_blks;
+}
+
+static int
+nilfs_cldconfig_handle_selection_policy_greedy(struct nilfs_cldconfig *config,
+  char **tokens, size_t ntoks)
+{
+   config->cf_selection_policy.p_importance =
+   nilfs_cldconfig_selection_policy_greedy;
+   config->cf_selection_policy.p_threshold =
+   NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+   config->cf_selection_policy.p_comparison =
+   NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER;
+   return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_cost_benefit(const struct nilfs_suinfo *si,
+ const struct nilfs_sustat *sustat,
+ __u64 prottime)
+{
+   __u32 free_blocks, cleaning_cost;
+   unsigned long long age;
+
+   if (si->sui_nblocks < si->sui_nlive_blks ||
+   sustat->ss_nongc_ctime < si->sui_lastmod ||
+   si->sui_nlive_lastmod >= prottime)
+   return 0;
+
+   free_blocks = si->sui_nblocks - si->sui_nlive_blks;
+   /* read the whole segment + write the live blocks */
+   cleaning_cost = 2 * si->sui_nlive_blks;
+   /*
+* multiply by 1000 to convert age to milliseconds
+* (higher precision for division)
+*/
+   age = (sustat->ss_nongc_ctime - si->sui_lastmod) * 1000;
+
+   if (cleaning_cost == 0)
+   cleaning_cost = 1;
+
+   return (age * free_blocks) / cleaning_cost;
+}
+
+static int
+nilfs_cldconfig_handle_selection_policy_cost_benefit(
+   struct nilfs_cldconfig *config,
+

[PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo

2015-02-24 Thread Andreas Rohner
If the tracking of live blocks is enabled, the information passed to
the kernel with the set_suinfo ioctl must also be modified. To this
end the nilfs_count_nlive_blks() fucntion is introduced. It simply
loops through the vdescv and bdescv vectors and counts the live
blocks belonging to a certain segment. Here the new vdesc flags
introduced earlier come in handy. If NILFS_VDESC_SNAPSHOT flag is set,
the block is always counted as alive. However if it is not set and
NILFS_VDESC_PROTECTION_PERIOD is set instead it is counted as
reclaimable.

Additionally the nilfs_xreclaim_segment() function is refactored, so
that the set_suinfo part is extracted into its own function
nilfs_try_set_suinfo(). This is useful, because the code gets more
complicated with the new additions.

If the kernel either doesn't support the set_suinfo ioctl or doesn't
support the set_nlive_blks flag, it returns ENOTTY or EINVAL
respectively and the corresponding options are disabled and not used
again.

Signed-off-by: Andreas Rohner 
---
 include/nilfs.h |   6 ++
 lib/gc.c| 168 +++-
 2 files changed, 136 insertions(+), 38 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index 22a9190..8511163 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -343,6 +343,12 @@ static inline __u32 nilfs_get_blocks_per_segment(const 
struct nilfs *nilfs)
return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline __u64
+nilfs_get_segnum_of_block(const struct nilfs *nilfs, sector_t blocknr)
+{
+   return blocknr / nilfs_get_blocks_per_segment(nilfs);
+}
+
 static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
 {
__u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
diff --git a/lib/gc.c b/lib/gc.c
index b56744c..a2461b9 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -620,6 +620,121 @@ static int nilfs_toss_bdescs(struct nilfs_vector *bdescv)
 }
 
 /**
+ * nilfs_count_nlive_blks - returns the number of live blocks in segnum
+ * @nilfs: nilfs object
+ * @segnum: segment number
+ * @bdescv: vector object storing (descriptors of) disk block numbers
+ * @vdescv: vector object storing (descriptors of) virtual block numbers
+ */
+static size_t nilfs_count_nlive_blks(const struct nilfs *nilfs,
+__u64 segnum,
+struct nilfs_vector *vdescv,
+struct nilfs_vector *bdescv)
+{
+   struct nilfs_vdesc *vdesc;
+   struct nilfs_bdesc *bdesc;
+   int i;
+   size_t res = 0;
+
+   for (i = 0; i < nilfs_vector_get_size(bdescv); i++) {
+   bdesc = nilfs_vector_get_element(bdescv, i);
+   assert(bdesc != NULL);
+
+   if (nilfs_get_segnum_of_block(nilfs, bdesc->bd_blocknr) ==
+   segnum && nilfs_bdesc_is_live(bdesc))
+   ++res;
+   }
+
+   for (i = 0; i < nilfs_vector_get_size(vdescv); i++) {
+   vdesc = nilfs_vector_get_element(vdescv, i);
+   assert(vdesc != NULL);
+
+   if (nilfs_get_segnum_of_block(nilfs, vdesc->vd_blocknr) ==
+   segnum && (nilfs_vdesc_snapshot(vdesc) ||
+   !nilfs_vdesc_protection_period(vdesc)))
+   ++res;
+   }
+
+   return res;
+}
+
+/**
+ * nilfs_try_set_suinfo - wrapper for nilfs_set_suinfo
+ * @nilfs: nilfs object
+ * @segnums: array of segment numbers storing selected segments
+ * @nsegs: size of the @segnums array
+ * @vdescv: vector object storing (descriptors of) virtual block numbers
+ * @bdescv: vector object storing (descriptors of) disk block numbers
+ *
+ * Description: nilfs_try_set_suinfo() prepares the input data structure
+ * for nilfs_set_suinfo(). If the kernel doesn't support the
+ * NILFS_IOCTL_SET_SUINFO ioctl, errno is set to ENOTTY and the set_suinfo
+ * option is cleared to prevent future calls to nilfs_try_set_suinfo().
+ * Similarly if the SUFILE extension is not supported by the kernel,
+ * errno is set to EINVAL and the track_live_blks option is disabled.
+ *
+ * Return Value: On success, zero is returned.  On error, a negative value
+ * is returned. If errno is set to ENOTTY or EINVAL, the kernel doesn't support
+ * the current configuration for nilfs_set_suinfo().
+ */
+static int nilfs_try_set_suinfo(struct nilfs *nilfs, __u64 *segnums,
+   size_t nsegs, struct nilfs_vector *vdescv,
+   struct nilfs_vector *bdescv)
+{
+   struct nilfs_vector *supv;
+   struct nilfs_suinfo_update *sup;
+   struct timeval tv;
+   int ret = -1;
+   size_t i, nblocks;
+
+   supv = nilfs_vector_create(sizeof(struct nilfs_suinfo_update));
+   if (!supv)
+   goto out;
+
+   ret = gettimeofday(&tv, NULL);
+   if (ret < 0)
+   goto out;
+
+   for (i = 0; i <

[PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc

2015-02-24 Thread Andreas Rohner
This patch adds support for additional bit-flags to the nilfs_vdesc
structure used by the GC to communicate block information to the
kernel.

The field vd_flags cannot be used for this purpose, because it does
not support bit-flags, and changing that would break backwards
compatibility. Therefore the padding field is renamed to vd_blk_flags
to contain more flags.

Unfortunately older versions of nilfs-utils do not initialize the
padding field to zero. So it is necessary to signal to the kernel if
the new vd_blk_flags field contains usable flags or just random data.
Since the vd_period field is only used in userspace, and is guaranteed
to contain a value that is > 0 (NILFS_CNO_MIN == 1), it can be used to
give the kernel a hint. So if vd_period.p_start is set to 0, the
vd_blk_flags field will be interpreted by the kernel.

The following new flags are added:

NILFS_VDESC_SNAPSHOT:
The block corresponding to the vdesc structure is protected by a
snapshot. This information is used in the kernel as well as in
nilfs-utils to calcualte the number of live blocks in a given
segment. A block with this flag is counted as live regardless of
other indicators.

NILFS_VDESC_PROTECTION_PERIOD:
The block corresponding to the vdesc structure is protected by the
protection period of the userspace GC. The block is actually
reclaimable, but for the moment protected. So it has to be
treated as if it were alive and moved to a new free segment,
but it must not be counted as live by the kernel. This flag
indicates to the kernel, that this block should be counted as
reclaimable.

The nilfs_vdesc_is_live() function is modified to store the
corresponding flags in the vdesc structure. However the algorithm it
uses it not modified, so it should return exactly the same results.

After the nilfs_vdesc_is_live() is called the vd_period field is no
longer needed and set to 0, to indicate to the kernel, that the
vd_blk_flags field should be interpreted. This ensures full backward
compatibility:

Old nilfs2 and new nilfs-utils:
vd_blk_flags is ignored

New nilfs2 and old nilfs-utils:
vd_period.p_start > 0 so vd_blk_flags is ignored

New nilfs2 and new nilfs-utils:
vd_period.p_start == 0 so vd_blk_flags is interpreted

Signed-off-by: Andreas Rohner 
---
 include/nilfs2_fs.h | 58 +++--
 lib/gc.c| 36 -
 2 files changed, 83 insertions(+), 11 deletions(-)

diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index 9137824..d01a924 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -884,7 +884,7 @@ struct nilfs_vinfo {
  * @vd_blocknr: disk block number
  * @vd_offset: logical block offset inside a file
  * @vd_flags: flags (data or node block)
- * @vd_pad: padding
+ * @vd_blk_flags: additional flags
  */
 struct nilfs_vdesc {
__u64 vd_ino;
@@ -894,9 +894,63 @@ struct nilfs_vdesc {
__u64 vd_blocknr;
__u64 vd_offset;
__u32 vd_flags;
-   __u32 vd_pad;
+   /*
+* vd_blk_flags needed because vd_flags doesn't support
+* bit-flags because of backwards compatibility
+*/
+   __u32 vd_blk_flags;
 };
 
+/* vdesc flags */
+enum {
+   NILFS_VDESC_DATA,
+   NILFS_VDESC_NODE,
+
+   /* ... */
+};
+enum {
+   NILFS_VDESC_SNAPSHOT,
+   NILFS_VDESC_PROTECTION_PERIOD,
+
+   /* ... */
+
+   __NR_NILFS_VDESC_FIELDS,
+};
+
+#define NILFS_VDESC_FNS(flag, name)\
+static inline void \
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)  \
+{  \
+   vdesc->vd_flags = NILFS_VDESC_##flag;   \
+}  \
+static inline int  \
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)\
+{  \
+   return vdesc->vd_flags == NILFS_VDESC_##flag;   \
+}
+
+#define NILFS_VDESC_FNS2(flag, name)   \
+static inline void \
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)  \
+{  \
+   vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag); \
+}  \
+static inline void \
+nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)\
+{  \
+   vdesc->vd_blk_flags &

[PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks

2015-02-24 Thread Andreas Rohner
This patch extends the nilfs_segment_usage structure with two extra
fields. This changes the on-disk format of the SUFILE, but the nilfs2
metadata files are flexible enough, so that there are no compatibility
issues. The extension is fully backwards compatible. Nevertheless a
feature compatibility flag was added to indicate the on-disk format
change.

The new field su_nlive_blks is used to track the number of live blocks
in the corresponding segment. Its value should always be smaller than
su_nblocks, which contains the total number of blocks in the segment.

The field su_nlive_lastmod is necessary because of the protection period
used by the GC. It is a timestamp, which contains the last time
su_nlive_blks was modified. For example if a file is deleted, its
blocks are subtracted from su_nlive_blks and are therefore
considered to be reclaimable by the kernel. But the GC additionally
protects them with the protection period. So while su_nilve_blks
contains the number of potentially reclaimable blocks, the actual number
depends on the protection period. To enable GC policies to
effectively choose or prefer segments with unprotected blocks, the
timestamp in su_nlive_lastmod is necessary.

Since the changes to the disk layout are fully backwards compatible and
the feature flag cannot be set after file system creation time,
NILFS_FEATURE_COMPAT_SUFILE_EXTENSION is set by default. It can however
be disabled by mkfs.nilfs2 -O ^sufile_ext

Signed-off-by: Andreas Rohner 
---
 bin/lssu.c  | 14 +++
 include/nilfs2_fs.h | 46 +--
 lib/feature.c   |  2 ++
 man/mkfs.nilfs2.8   |  8 +++
 sbin/mkfs/mkfs.c| 69 +++--
 5 files changed, 109 insertions(+), 30 deletions(-)

diff --git a/bin/lssu.c b/bin/lssu.c
index 09ed973..e50e628 100644
--- a/bin/lssu.c
+++ b/bin/lssu.c
@@ -104,8 +104,8 @@ static const struct lssu_format lssu_format[] = {
},
{
"   SEGNUMDATE TIME STAT NBLOCKS" \
-   "   NLIVEBLOCKS",
-   "%17llu  %s %c%c%c%c  %10u %10u (%3u%%)\n"
+   "   NLIVEBLOCKS   NPREDLIVEBLOCKS",
+   "%17llu  %s %c%c%c%c  %10u %10u (%3u%%) %10u (%3u%%)\n"
}
 };
 
@@ -164,9 +164,9 @@ static ssize_t lssu_print_suinfo(struct nilfs *nilfs, __u64 
segnum,
time_t t;
char timebuf[LSSU_BUFSIZE];
ssize_t i, n = 0, ret;
-   int ratio;
+   int ratio, predratio;
int protected;
-   size_t nliveblks;
+   size_t nliveblks, npredliveblks;
 
for (i = 0; i < nsi; i++, segnum++) {
if (!all && nilfs_suinfo_clean(&suinfos[i]))
@@ -192,7 +192,10 @@ static ssize_t lssu_print_suinfo(struct nilfs *nilfs, 
__u64 segnum,
break;
case LSSU_MODE_LATEST_USAGE:
nliveblks = 0;
+   npredliveblks = suinfos[i].sui_nlive_blks;
ratio = 0;
+   predratio = (npredliveblks * 100 + 99) /
+   blocks_per_segment;
protected = suinfos[i].sui_lastmod >= prottime;
 
if (!nilfs_suinfo_dirty(&suinfos[i]) ||
@@ -223,7 +226,8 @@ skip_scan:
   nilfs_suinfo_dirty(&suinfos[i]) ? 'd' : '-',
   nilfs_suinfo_error(&suinfos[i]) ? 'e' : '-',
   protected ? 'p' : '-',
-  suinfos[i].sui_nblocks, nliveblks, ratio);
+  suinfos[i].sui_nblocks, nliveblks, ratio,
+  npredliveblks, predratio);
break;
}
n++;
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index a16ad4c..9137824 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -219,9 +219,11 @@ struct nilfs_super_block {
  * If there is a bit set in the incompatible feature set that the kernel
  * doesn't know about, it should refuse to mount the filesystem.
  */
-#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT0x0001ULL
+#define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION  (1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP  0ULL
+#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT(1ULL << 0)
+
+#define NILFS_FEATURE_COMPAT_SUPP  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
 #define NILFS_FEATURE_COMPAT_RO_SUPP   NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP0ULL
 
@@ -607,18 +609,35 @@ struct nilfs_cpfile_header {
  sizeof(struct nilfs_checkpoint) - 1) /\
sizeof(struct nilfs_checkpoint))
 
+#undef offsetof
+#define offsetof(TYPE, MEMBER) ((size_t) &((

[PATCH 3/6] nilfs-utils: add support for tracking live blocks

2015-02-24 Thread Andreas Rohner
This patch adds a new feature flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
which allows the user to enable and disable the tracking of live
blocks. The flag can be set at file system creation time with mkfs or
at any later time with nilfs-tune.

Additionally a new option NILFS_OPT_TRACK_LIVE_BLKS is added to be
used by the GC. It is set to the same value as
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS at startup. It is mainly used to
easily and efficiently check for the feature at runtime and to disable
it if the kernel doesn't support it.

It is fully backwards compatible, because
NILFS_FEATURE_COMPAT_SUFILE_EXTENSION also is backwards compatible and
it basically only tells the kernel to update a counter for every
segment in the SUFILE. If the kernel doesn't support it, the counter
won't be updated and the GC policies depending on that information
will work less efficient, but they would still work.

Signed-off-by: Andreas Rohner 
---
 include/nilfs.h  | 30 +++---
 include/nilfs2_fs.h  |  4 +++-
 lib/feature.c|  2 ++
 lib/nilfs.c  | 32 
 man/mkfs.nilfs2.8|  6 ++
 sbin/mkfs/mkfs.c |  3 ++-
 sbin/nilfs-tune/nilfs-tune.c |  4 ++--
 7 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index f695f48..22a9190 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -130,6 +130,7 @@ struct nilfs {
 
 #define NILFS_OPT_MMAP 0x01
 #define NILFS_OPT_SET_SUINFO   0x02
+#define NILFS_OPT_TRACK_LIVE_BLKS  0x04
 
 
 struct nilfs *nilfs_open(const char *, const char *, int);
@@ -141,9 +142,25 @@ void nilfs_opt_clear_mmap(struct nilfs *);
 int nilfs_opt_set_mmap(struct nilfs *);
 int nilfs_opt_test_mmap(struct nilfs *);
 
-void nilfs_opt_clear_set_suinfo(struct nilfs *);
-int nilfs_opt_set_set_suinfo(struct nilfs *);
-int nilfs_opt_test_set_suinfo(struct nilfs *);
+#define NILFS_OPT_FLAG(flag, name) \
+static inline void \
+nilfs_opt_set_##name(struct nilfs *nilfs)  \
+{  \
+   nilfs->n_opts |= NILFS_OPT_##flag;  \
+}  \
+static inline void \
+nilfs_opt_clear_##name(struct nilfs *nilfs)\
+{  \
+   nilfs->n_opts &= ~NILFS_OPT_##flag; \
+}  \
+static inline int  \
+nilfs_opt_test_##name(const struct nilfs *nilfs)   \
+{  \
+   return !!(nilfs->n_opts & NILFS_OPT_##flag);\
+}
+
+NILFS_OPT_FLAG(SET_SUINFO, set_suinfo);
+NILFS_OPT_FLAG(TRACK_LIVE_BLKS, track_live_blks);
 
 nilfs_cno_t nilfs_get_oldest_cno(struct nilfs *);
 
@@ -326,4 +343,11 @@ static inline __u32 nilfs_get_blocks_per_segment(const 
struct nilfs *nilfs)
return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
+{
+   __u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
+   return (fc & NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS) &&
+   (fc & NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
+}
+
 #endif /* NILFS_H */
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index d01a924..427ca53 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -220,10 +220,12 @@ struct nilfs_super_block {
  * doesn't know about, it should refuse to mount the filesystem.
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION  (1ULL << 0)
+#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS   (1ULL << 1)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT(1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
+#define NILFS_FEATURE_COMPAT_SUPP  (NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
+   | NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP   NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP0ULL
 
diff --git a/lib/feature.c b/lib/feature.c
index d954cda..ebe8c3f 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -57,6 +57,8 @@ static const struct nilfs_feature features[] = {
/* Compat features */
{ NILFS_FEATURE_TYPE_COMPAT,
  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION, "sufile_ext" },
+   { NILFS_FEATURE_TYPE_COMPAT,
+ NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS, "track_live_blks" },
/* Read-only compat features

[PATCH v3] nilfs2: avoid duplicate segment construction for fsync()

2014-12-01 Thread Andreas Rohner
This patch removes filemap_write_and_wait_range() from
nilfs_sync_file(), because it triggers a data segment construction by
calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction
does not remove the inode from the i_dirty list and it does not clear
the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns
true, which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().

A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms. Instead it implements
its own mechanism to collect all dirty pages and write them into
segments. It is more efficient to initiate the segment construction
directly in nilfs_sync_file() without the detour over
filemap_write_and_wait_range().

Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:

Functioni_mutex nilfs_transaction
--
nilfs_ioctl_setflags:   yes yes
nilfs_fiemap:   yes no
nilfs_write_begin:  yes yes
nilfs_write_end:yes yes
nilfs_lookup:   yes no
nilfs_create:   yes yes
nilfs_link: yes yes
nilfs_mknod:yes yes
nilfs_symlink:  yes yes
nilfs_mkdir:yes yes
nilfs_unlink:   yes yes
nilfs_rmdir:yes yes
nilfs_rename:   yes yes
nilfs_setattr:  yes yes

For nilfs_lookup() i_mutex is held for the parent directory, to protect
it from modification. The segment construction does not modify directory
inodes, so no lock is needed.

nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index e9e3325..3a03e0a 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -39,21 +39,15 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
 */
struct the_nilfs *nilfs;
struct inode *inode = file->f_mapping->host;
-   int err;
-
-   err = filemap_write_and_wait_range(inode->i_mapping, start, end);
-   if (err)
-   return err;
-   mutex_lock(&inode->i_mutex);
+   int err = 0;
 
if (nilfs_inode_dirty(inode)) {
if (datasync)
err = nilfs_construct_dsync_segment(inode->i_sb, inode,
-   0, LLONG_MAX);
+   start, end);
else
err = nilfs_construct_segment(inode->i_sb);
}
-   mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
if (!err)
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] nilfs2: avoid duplicate segment construction for fsync()

2014-12-01 Thread Andreas Rohner
On 2014-12-01 18:13, Ryusuke Konishi wrote:
> Andreas,
> On Sun,  9 Nov 2014 17:00:12 +0100, Andreas Rohner wrote:
>> This patch removes filemap_write_and_wait_range() from
>> nilfs_sync_file(), because it triggers a data segment construction by
>> calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction
>> does not remove the inode from the i_dirty list and it does not clear
>> the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns
>> true, which leads to an unnecessary duplicate segment construction in
>> nilfs_sync_file().
>>
>> A call to filemap_write_and_wait_range() is not needed, because NILFS2
>> does not rely on the generic writeback mechanisms. Instead it implements
>> its own mechanism to collect all dirty pages and write them into
>> segments. It is more efficient to initiate the segment construction
>> directly in nilfs_sync_file() without the detour over
>> filemap_write_and_wait_range().
>>
>> Additionally the lock of i_mutex is not needed, because all code blocks
>> that are protected by i_mutex are also protected by a NILFS transaction:
>>
>> Functioni_mutex nilfs_transaction
>> --
>> nilfs_ioctl_setflags:   yes yes
>> nilfs_fiemap:   yes no
>> nilfs_write_begin:  yes yes
>> nilfs_write_end:yes yes
>> nilfs_lookup:   yes no
>> nilfs_create:   yes yes
>> nilfs_link: yes yes
>> nilfs_mknod:yes yes
>> nilfs_symlink:  yes yes
>> nilfs_mkdir:yes yes
>> nilfs_unlink:   yes yes
>> nilfs_rmdir:yes yes
>> nilfs_rename:   yes yes
>> nilfs_setattr:  yes yes
>>
>> For nilfs_lookup() i_mutex is held for the parent directory, to protect
>> it from modification. The segment construction does not modify directory
>> inodes, so no lock is needed.
>>
>> nilfs_fiemap() reads the block layout on the disk, by using
>> nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/file.c | 21 -
>>  1 file changed, 8 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
>> index e9e3325..1ad6bdf 100644
>> --- a/fs/nilfs2/file.c
>> +++ b/fs/nilfs2/file.c
>> @@ -41,19 +41,14 @@ int nilfs_sync_file(struct file *file, loff_t start, 
>> loff_t end, int datasync)
>>  struct inode *inode = file->f_mapping->host;
>>  int err;
>>  
>> -err = filemap_write_and_wait_range(inode->i_mapping, start, end);
>> -if (err)
>> -return err;
>> -mutex_lock(&inode->i_mutex);
>> -
>> -if (nilfs_inode_dirty(inode)) {
>> -if (datasync)
>> -err = nilfs_construct_dsync_segment(inode->i_sb, inode,
>> -0, LLONG_MAX);
>> -else
>> -err = nilfs_construct_segment(inode->i_sb);
>> -}
>> -mutex_unlock(&inode->i_mutex);
> 
>> +if (!nilfs_inode_dirty(inode))
>> +return 0;
> 
> I just noticed that this transformation is not equivalent to the
> original one.  With this patch, nilfs_flush_device() is not called if
> nilfs_inode_dirty() is not true, which looks to be causing another
> data integrity issue.
> 
> Could you reconsider if the above check is correct or not ?

Yes you are right. I thought, that no flush would be necessary in that
case, but it clearly is. Sorry for that mistake. I will send in a fixed
version of the patch.

Regards,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] nilfs2: remove unnecessary call to nilfs_construct_dsync_segment()

2014-11-11 Thread Andreas Rohner
On 2014-11-11 17:58, Ryusuke Konishi wrote:
> On Wed, 05 Nov 2014 18:08:57 +0100, Andreas Rohner wrote:
>> On 2014-11-05 01:07, Ryusuke Konishi wrote:
>>> On Tue, 04 Nov 2014 16:50:21 +0100, Andreas Rohner wrote:
>>>
>>> I found filemap_write_and_wait_range() returns error status of
>>> already done page I/Os via filemap_check_errors().  We need to
>>> look into what it does.
>>
>> I have looked into this a bit. AS_EIO and AS_ENOSPC are asynchronous
>> error flags, set by the function mapping_set_error(). However I don't
>> think this is relevant for NILFS2, because it implements its own
>> writepages() function:
>>
>> nilfs_sync_file()
>>filemap_write_and_wait_range()
>>   __filemap_fdatawrite_range()
>>  do_writepages()
>> writepages()
>>nilfs_writepages()
>>
>> mapping_set_error() would only be called if NILFS2 would use
>> generic_writepages() like this:
>>
>> nilfs_sync_file()
>>filemap_write_and_wait_range()
>>   __filemap_fdatawrite_range()
>>  do_writepages()
>> generic_writepages()
>>
>> But it doesn't, so we can ignore filemap_check_errors(). Furthermore
>> NILFS2 doesn't use the generic writeback mechanism of the kernel at all.
>> It creates its own bio in nilfs_segbuf_submit_bh(), submits the bio with
>> nilfs_segbuf_submit_bio() and waits for it with nilfs_segbuf_wait() and
>> records IO-errors in segbuf->sb_err, so there is no need to check AS_EIO
>> and AS_ENOSPC.
>>
>> I think filemap_write_and_wait_range() is mostly useful for in place
>> updates. A copy on write filesystem like NILFS2 doesn't need it. BTRFS
>> doesn't use it either in its fsync function...
> 
> OK.  I confirmed the current NILFS2 doesn't need to check AS_EIO and
> AS_ENOSPC because NILFS2 doesn't evict erroneous pages from the page
> cache; NILFS2 tries to keep such pages until they will be successfully
> written back to disk.  Therefore it can detect error of already
> finished page-IOs without calling filemap_fdatawait_range() or
> filemap_check_errors().
> 
> On the other hand, regular filesystems need to these functions in
> fsync() because they will clear the dirty flags of pages and buffers
> even if the writeback failed due to an IO error or a disk full error.
> Without AS_EIO and AS_ENOSPC check, fsync() of these filesystems
> can miss the errors.

So it is necessary to catch the errors that happened before the call to
fsync() and if the pages are not redirtied, the errors could be missed.
I didn't know that.

> However this design policy of NILFS2 has a defect that the system
> easily falls into a memory shortage with too many dirty pages when
> I/O errors will block.
>
> If we would introduce a simliar logic to NILFS2, it will need the
> following changes:
> 
>  - nilfs_abort_logs() and nilfs_end_page_io() are changed so that they
>call mapping_set_error() instead of redirtying data pages on error.
> 
>  - nilfs_sync_file() will be also changed so that it first calls
>filemap_fdatawait_range() and catches errors previously happened.

I don't know if I fully understand the problem. So I may be completely
wrong here.

But if there is only one bad sector somewhere, which causes an I/O
Error, you could potentially lose the data from a whole segment and
there would possibly be data loss all over the file system. There could
also be inconsistent metadata, because the metadata files would also
lose data.

NILFS has a good chance of recovering from an I/O error, because it will
mark the segment with nilfs_sufile_set_error() and write the pages to a
different segment. Isn't that worth the extra memory?

Best regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] nilfs2: avoid duplicate segment construction for fsync()

2014-11-09 Thread Andreas Rohner
This patch removes filemap_write_and_wait_range() from
nilfs_sync_file(), because it triggers a data segment construction by
calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction
does not remove the inode from the i_dirty list and it does not clear
the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns
true, which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().

A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms. Instead it implements
its own mechanism to collect all dirty pages and write them into
segments. It is more efficient to initiate the segment construction
directly in nilfs_sync_file() without the detour over
filemap_write_and_wait_range().

Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:

Functioni_mutex nilfs_transaction
--
nilfs_ioctl_setflags:   yes yes
nilfs_fiemap:   yes no
nilfs_write_begin:  yes yes
nilfs_write_end:yes yes
nilfs_lookup:   yes no
nilfs_create:   yes yes
nilfs_link: yes yes
nilfs_mknod:yes yes
nilfs_symlink:  yes yes
nilfs_mkdir:yes yes
nilfs_unlink:   yes yes
nilfs_rmdir:yes yes
nilfs_rename:   yes yes
nilfs_setattr:  yes yes

For nilfs_lookup() i_mutex is held for the parent directory, to protect
it from modification. The segment construction does not modify directory
inodes, so no lock is needed.

nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index e9e3325..1ad6bdf 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -41,19 +41,14 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
struct inode *inode = file->f_mapping->host;
int err;
 
-   err = filemap_write_and_wait_range(inode->i_mapping, start, end);
-   if (err)
-   return err;
-   mutex_lock(&inode->i_mutex);
-
-   if (nilfs_inode_dirty(inode)) {
-   if (datasync)
-   err = nilfs_construct_dsync_segment(inode->i_sb, inode,
-   0, LLONG_MAX);
-   else
-   err = nilfs_construct_segment(inode->i_sb);
-   }
-   mutex_unlock(&inode->i_mutex);
+   if (!nilfs_inode_dirty(inode))
+   return 0;
+
+   if (datasync)
+   err = nilfs_construct_dsync_segment(inode->i_sb, inode,
+   start, end);
+   else
+   err = nilfs_construct_segment(inode->i_sb);
 
nilfs = inode->i_sb->s_fs_info;
if (!err)
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] nilfs2: remove unnecessary call to nilfs_construct_dsync_segment()

2014-11-05 Thread Andreas Rohner
On 2014-11-05 01:07, Ryusuke Konishi wrote:
> On Tue, 04 Nov 2014 16:50:21 +0100, Andreas Rohner wrote:
>> On 2014-11-04 15:34, Ryusuke Konishi wrote:
>>> Since each call to nilfs_construct_segment() or
>>> nilfs_construct_dsync_segment() implies an IO completion wait, it
>>> seems that this doubles the latency of fsync().
>>>
>>> Do you really need to call filemap_write_and_wait_range() in
>>> nilfs_sync_file() ?
>>
>> I don't think we need it, but I found the following paragraph in
>> Documentation/filesystems/porting:
>>
>> [mandatory]
>>  If you have your own ->fsync() you must make sure to call
>> filemap_write_and_wait_range() so that all dirty pages are synced out
>> properly. You must also keep in mind that ->fsync() is not called with
>> i_mutex held anymore, so if you require i_mutex locking you must make
>> sure to take it and release it yourself.
>>
>> So I was unsure, if it is safe to remove it. But maybe I interpreted
>> that wrongly, since nilfs_construct_dsync_segment() and
>> nilfs_construct_segment() write out all dirty pages anyway, there is no
>> need for filemap_write_and_wait_range().
> 
> I found filemap_write_and_wait_range() returns error status of
> already done page I/Os via filemap_check_errors().  We need to
> look into what it does.

I have looked into this a bit. AS_EIO and AS_ENOSPC are asynchronous
error flags, set by the function mapping_set_error(). However I don't
think this is relevant for NILFS2, because it implements its own
writepages() function:

nilfs_sync_file()
   filemap_write_and_wait_range()
  __filemap_fdatawrite_range()
 do_writepages()
writepages()
   nilfs_writepages()

mapping_set_error() would only be called if NILFS2 would use
generic_writepages() like this:

nilfs_sync_file()
   filemap_write_and_wait_range()
  __filemap_fdatawrite_range()
 do_writepages()
generic_writepages()

But it doesn't, so we can ignore filemap_check_errors(). Furthermore
NILFS2 doesn't use the generic writeback mechanism of the kernel at all.
It creates its own bio in nilfs_segbuf_submit_bh(), submits the bio with
nilfs_segbuf_submit_bio() and waits for it with nilfs_segbuf_wait() and
records IO-errors in segbuf->sb_err, so there is no need to check AS_EIO
and AS_ENOSPC.

I think filemap_write_and_wait_range() is mostly useful for in place
updates. A copy on write filesystem like NILFS2 doesn't need it. BTRFS
doesn't use it either in its fsync function...

>> Also do we need i_mutex? As far as I can tell all relevant code blocks
>> are wrapped in nilfs_transaction_begin/commit/abort().
> 
> Yes, we may also remove the i_mutex.  We have to confirm what i_mutex
> protects for nilfs.

There are some callback functions which are called with i_mutex already
held, but I can't find documentation about that right now. I'm sure I
saw it somewhere. Anyway I am going to look into this as well.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] nilfs2: remove unnecessary call to nilfs_construct_dsync_segment()

2014-11-04 Thread Andreas Rohner
Hi Ryusuke,

On 2014-11-04 15:34, Ryusuke Konishi wrote:
> Hi Andreas,
> On Sat,  1 Nov 2014 18:01:07 +0100, Andreas Rohner wrote:
>> If some of the pages between start and end are dirty, then
>> filemap_write_and_wait_range() calls nilfs_writepages() with WB_SYNC_ALL
>> set in the writeback_control structure. This initiates the construction
>> of a dsync segment via nilfs_construct_dsync_segment(). The problem is,
>> that the construction of a dsync segment doesnt't remove the inode from
>> die i_dirty list and doesn't clear the NILFS_I_DIRTY flag. So
>> nilfs_inode_dirty() still returns true after
>> nilfs_construct_dsync_segment() succeded. This leads to an
>> unnecessary second call to nilfs_construct_dsync_segment() in
>> nilfs_sync_file() if datasync is true.
>>
>> This patch simply removes the second invokation of
>> nilfs_construct_dsync_segment().
>>
>> Signed-off-by: Andreas Rohner 
> 
> Thank you for posting this patch.
> 
> This optimization looks to become possible by the commit 02c24a821
> "fs: push i_mutex and filemap_write_and_wait down into ->fsync()
> handlers".  I haven't noticed that the change makes it possible to
> simplify nilfs_sync_file() like this.
> 
> One my simple question is why you removed the call to
> nilfs_construct_dsync_segment() instead of
> filemap_write_and_wait_range().
> 
> If the datasync flag is false, nilfs_sync_file() first calls
> nilfs_construct_dsync_segment() via
> 
>filemap_write_and_wait_range()
>  __filemap_fdatawrite_range(,, WB_SYNC_ALL)
>do_writepages()
>   nilfs_writepages()
>  nilfs_construct_dsync_segment()
> 
> and then calls nilfs_construct_segment().

Exactly.

> Since each call to nilfs_construct_segment() or
> nilfs_construct_dsync_segment() implies an IO completion wait, it
> seems that this doubles the latency of fsync().
> 
> Do you really need to call filemap_write_and_wait_range() in
> nilfs_sync_file() ?

I don't think we need it, but I found the following paragraph in
Documentation/filesystems/porting:

[mandatory]
If you have your own ->fsync() you must make sure to call
filemap_write_and_wait_range() so that all dirty pages are synced out
properly. You must also keep in mind that ->fsync() is not called with
i_mutex held anymore, so if you require i_mutex locking you must make
sure to take it and release it yourself.

So I was unsure, if it is safe to remove it. But maybe I interpreted
that wrongly, since nilfs_construct_dsync_segment() and
nilfs_construct_segment() write out all dirty pages anyway, there is no
need for filemap_write_and_wait_range().

Also do we need i_mutex? As far as I can tell all relevant code blocks
are wrapped in nilfs_transaction_begin/commit/abort().

Best regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
> 
>> ---
>>  fs/nilfs2/file.c | 10 +++---
>>  1 file changed, 3 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
>> index e9e3325..b12e0ab 100644
>> --- a/fs/nilfs2/file.c
>> +++ b/fs/nilfs2/file.c
>> @@ -46,13 +46,9 @@ int nilfs_sync_file(struct file *file, loff_t start, 
>> loff_t end, int datasync)
>>  return err;
>>  mutex_lock(&inode->i_mutex);
>>  
>> -if (nilfs_inode_dirty(inode)) {
>> -if (datasync)
>> -err = nilfs_construct_dsync_segment(inode->i_sb, inode,
>> -0, LLONG_MAX);
>> -else
>> -err = nilfs_construct_segment(inode->i_sb);
>> -}
>> +if (!datasync && nilfs_inode_dirty(inode))
>> +err = nilfs_construct_segment(inode->i_sb);
>> +
>>  mutex_unlock(&inode->i_mutex);
>>  
>>  nilfs = inode->i_sb->s_fs_info;
>> -- 
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] nilfs2: remove unnecessary call to nilfs_construct_dsync_segment()

2014-11-01 Thread Andreas Rohner
If some of the pages between start and end are dirty, then
filemap_write_and_wait_range() calls nilfs_writepages() with WB_SYNC_ALL
set in the writeback_control structure. This initiates the construction
of a dsync segment via nilfs_construct_dsync_segment(). The problem is,
that the construction of a dsync segment doesnt't remove the inode from
die i_dirty list and doesn't clear the NILFS_I_DIRTY flag. So
nilfs_inode_dirty() still returns true after
nilfs_construct_dsync_segment() succeded. This leads to an
unnecessary second call to nilfs_construct_dsync_segment() in
nilfs_sync_file() if datasync is true.

This patch simply removes the second invokation of
nilfs_construct_dsync_segment().

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index e9e3325..b12e0ab 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -46,13 +46,9 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
return err;
mutex_lock(&inode->i_mutex);
 
-   if (nilfs_inode_dirty(inode)) {
-   if (datasync)
-   err = nilfs_construct_dsync_segment(inode->i_sb, inode,
-   0, LLONG_MAX);
-   else
-   err = nilfs_construct_segment(inode->i_sb);
-   }
+   if (!datasync && nilfs_inode_dirty(inode))
+   err = nilfs_construct_segment(inode->i_sb);
+
mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] improve inode allocation

2014-10-21 Thread Andreas Rohner
Hi,

I extended the inode test to delete a certain number of inodes.

https://github.com/zeitgeist87/inodetest

This should make the benchmark a little bit more realistic, but the
results are essentially the same as before. For the following tests
about 10% of the inodes were continuously deleted:

1.) One process, 20 million inodes, 2 million deleted:
a) Normal Nilfs
$ time inodetest 1000 1000 20 100
real3m49.793s
user0m6.323s
sys 2m47.947s

$ time find ./ > /dev/null
real5m21.020s
user0m25.440s
sys 1m6.633s
b) Improved Nilfs
$ time inodetest 1000 1000 20 100
real2m35.011s
user0m6.847s
sys 1m33.093s

$ time find ./ > /dev/null
real5m18.922s
user0m25.323s
sys 1m6.877s
2.) Three processes in parallel, 60 million inodes, 6 million deleted
a) Normal Nilfs
$ time inodetest 1000 1000 20 100 &
$ time inodetest 1000 1000 20 100 &
$ time inodetest 1000 1000 20 100 &
real19m18.135s
user0m7.973s
sys 16m16.833s

$ time find ./ > /dev/null
real29m38.577s
user1m32.763s
sys 4m44.140s
b) Improved Nilfs
$ time inodetest 1000 1000 20 100 &
$ time inodetest 1000 1000 20 100 &
$ time inodetest 1000 1000 20 100 &
real6m30.458s
user0m6.697s
sys 3m10.213s

$ time find ./ > /dev/null
real28m50.304s
user1m30.133s
sys 4m40.770s


So the performance improved 32% for a single process and 66% for
multiple processes.

All benchmarks were run on an AMD Phenom II X6 1090T processor with 6
cores and 8 GB of RAM.

Best regards
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1] improve inode allocation

2014-10-19 Thread Andreas Rohner
Hi,

The following patch is a very simplified version of the the one I sent 
in a week ago. My benchmarks showed, that the aligned allocation of 
directories create too much overhead.

I used a very simple C program that creates millions of inodes as a 
benchmark. I uploaded the source code to github:

https://github.com/zeitgeist87/inodetest

Here are the results:

1.) One process, 20 million inodes:
a) Normal Nilfs
$ time inodetest 1000 1000 20
real3m50.431s
user0m5.647s
sys 2m50.760s

$ time find ./ > /dev/null
real5m49.021s
user0m27.917s
sys 1m14.197s
b) Improved Nilfs
$ time inodetest 1000 1000 20
real2m31.857s
user0m5.950s
sys 1m29.707s

$ time find ./ > /dev/null
real5m49.060s
user0m27.787s
sys 1m13.673s
2.) Three processes in parallel, total of 60 million inodes
a) Normal Nilfs
$ time inodetest 1000 1000 20 &
$ time inodetest 1000 1000 20 &
$ time inodetest 1000 1000 20 &
real20m21.914s
user0m5.603s
sys 17m43.987s

$ time find ./ > /dev/null
real28m10.340s
user1m38.477s
sys 5m9.133s
b) Improved Nilfs
$ time inodetest 1000 1000 20 &
$ time inodetest 1000 1000 20 &
$ time inodetest 1000 1000 20 &
real6m21.609s
user0m5.970s
sys 3m8.100s

$ time find ./ > /dev/null
real30m35.320s
user1m40.577s
sys 5m14.580s

There is a significant improvement in runtime for both the single and 
multiple process case. It is also notable, that the improved version 
scales much better for parallel processes.

"find ./ > /dev/null" is virtually identical for the benchmark 1.a and 
1.b, but 2.b is consitently slower by 2 minutes, which I cannot 
currently explain.

I repeated the benchmarks several times and there were only tiny 
variations in the results. 

Best regards
Andreas Rohner  

Andreas Rohner (1):
  nilfs2: improve inode allocation

 fs/nilfs2/ifile.c   | 31 +--
 fs/nilfs2/ifile.h   |  1 +
 fs/nilfs2/segment.c |  5 -
 3 files changed, 34 insertions(+), 3 deletions(-)

-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] nilfs2: improve inode allocation

2014-10-19 Thread Andreas Rohner
The current inode allocation algorithm of NILFS2 does not use any
information about the previous allocation. It simply searches for a free
entry in the ifile, always starting from position 0. This patch
introduces an improved allocation scheme.

Inodes are allocated sequentially within the ifile. The current
algorithm always starts the search for a free slot at position 0,
because it has to find possible freed up slots of previously deleted
inodes. This minimizes wasted space, but has a certain cost attached to
it.

This patch introduces the field next_inode in the nilfs_ifile_info
structure, which stores the location of the most likely next free slot.
Whenever an inode is created or deleted next_inode is updated
accordingly. If an inode is deleted next_inode points to the newly
available slot. If an inode is created next_inode points to the slot
after that. Instead of starting every search for a free slot at 0, it is
started at next_inode. This way the search space is narrowed
considerably and a lot of overhead can be avoided.

Because of performance reasons the updates to next_inode are not
protected by locks. So race conditions, non-atomic updates and lost
updates are possible. This can lead to some empty slots that are
overlooked and therefore to some wasted space. But this is only
temporary, because next_inode is periodically reset to 0, to force a
full search starting from position 0.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/ifile.c   | 31 +--
 fs/nilfs2/ifile.h   |  1 +
 fs/nilfs2/segment.c |  5 -
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/fs/nilfs2/ifile.c b/fs/nilfs2/ifile.c
index 6548c78..0f27e66 100644
--- a/fs/nilfs2/ifile.c
+++ b/fs/nilfs2/ifile.c
@@ -33,10 +33,12 @@
  * struct nilfs_ifile_info - on-memory private data of ifile
  * @mi: on-memory private data of metadata file
  * @palloc_cache: persistent object allocator cache of ifile
+ * @next_inode: ino of the next likely free entry
  */
 struct nilfs_ifile_info {
struct nilfs_mdt_info mi;
struct nilfs_palloc_cache palloc_cache;
+   __u64 next_inode;
 };
 
 static inline struct nilfs_ifile_info *NILFS_IFILE_I(struct inode *ifile)
@@ -45,6 +47,26 @@ static inline struct nilfs_ifile_info *NILFS_IFILE_I(struct 
inode *ifile)
 }
 
 /**
+ * nilfs_ifile_next_inode_reset - set last_dir to 0
+ * @ifile: ifile inode
+ *
+ * Description: The value of next_inode will increase with every new
+ * allocation of a inode, because it is used as the starting point of the
+ * search for a free entry in the ifile. It should be reset periodically to 0
+ * (e.g.: every segctor timeout), so that previously deleted entries can be
+ * found.
+ */
+void nilfs_ifile_next_inode_reset(struct inode *ifile)
+{
+   /*
+* possible race condition/non atomic update
+* next_inode is just a hint for the next allocation so
+* the possible invalid values are not really harmful
+*/
+   NILFS_IFILE_I(ifile)->next_inode = 0;
+}
+
+/**
  * nilfs_ifile_create_inode - create a new disk inode
  * @ifile: ifile inode
  * @out_ino: pointer to a variable to store inode number
@@ -68,9 +90,8 @@ int nilfs_ifile_create_inode(struct inode *ifile, ino_t 
*out_ino,
struct nilfs_palloc_req req;
int ret;
 
-   req.pr_entry_nr = 0;  /* 0 says find free inode from beginning of
-a group. dull code!! */
req.pr_entry_bh = NULL;
+   req.pr_entry_nr = NILFS_IFILE_I(ifile)->next_inode;
 
ret = nilfs_palloc_prepare_alloc_entry(ifile, &req);
if (!ret) {
@@ -86,6 +107,9 @@ int nilfs_ifile_create_inode(struct inode *ifile, ino_t 
*out_ino,
nilfs_palloc_commit_alloc_entry(ifile, &req);
mark_buffer_dirty(req.pr_entry_bh);
nilfs_mdt_mark_dirty(ifile);
+
+   /* see comment in nilfs_ifile_next_inode_reset() */
+   NILFS_IFILE_I(ifile)->next_inode = req.pr_entry_nr + 1;
*out_ino = (ino_t)req.pr_entry_nr;
*out_bh = req.pr_entry_bh;
return 0;
@@ -137,6 +161,9 @@ int nilfs_ifile_delete_inode(struct inode *ifile, ino_t ino)
 
nilfs_palloc_commit_free_entry(ifile, &req);
 
+   /* see comment in nilfs_ifile_next_inode_reset() */
+   if (NILFS_IFILE_I(ifile)->next_inode > req.pr_entry_nr)
+   NILFS_IFILE_I(ifile)->next_inode = req.pr_entry_nr;
return 0;
 }
 
diff --git a/fs/nilfs2/ifile.h b/fs/nilfs2/ifile.h
index 679674d..36edbcc 100644
--- a/fs/nilfs2/ifile.h
+++ b/fs/nilfs2/ifile.h
@@ -45,6 +45,7 @@ static inline void nilfs_ifile_unmap_inode(struct inode 
*ifile, ino_t ino,
kunmap(ibh->b_page);
 }
 
+void nilfs_ifile_next_inode_reset(struct inode *);
 int nilfs_ifile_create_inode(struct inode *, ino_t *, struct buffer_head **);
 int nilfs_ifile_delete_inode(struct inode *, ino_t);
 int nilfs_ifile_get_inode_block(struct inode *, ino_t, struct buffer_head **);
diff --git a/fs/nilfs2/segm

Re: [PATCH 0/2] nilfs2: improve inode allocation algorithm

2014-10-13 Thread Andreas Rohner
On 2014-10-13 16:52, Ryusuke Konishi wrote:
> Hi,
> On Sun, 12 Oct 2014 12:38:21 +0200, Andreas Rohner wrote:
>> Hi,
>>
>> The algorithm simply makes sure, that after a directory inode there are
>> a certain number of free slots available and the search for file inodes
>> is started at their parent directory.
>>
>> I havn't had the time yet to do a full scale performance test of it, but 
>> my simple preliminary tests have shown, that the allocation of inodes 
>> takes a little bit longer and the lookup is a little bit faster. My 
>> simple test just creates 1500 directories and after that creates 10 
>> files in each directory.
>>
>> So more testing is definetly necessary, but I wanted to get some 
>> feedback about the design first. Is my code a step in the right 
>> direction?
>>
>> Best regards,
>> Andreas Rohner
>>
>> Andreas Rohner (2):
>>   nilfs2: support the allocation of whole blocks of meta data entries
>>   nilfs2: improve inode allocation algorithm
>>
>>  fs/nilfs2/alloc.c   | 161 
>> 
>>  fs/nilfs2/alloc.h   |  18 +-
>>  fs/nilfs2/ifile.c   |  63 ++--
>>  fs/nilfs2/ifile.h   |   6 +-
>>  fs/nilfs2/inode.c   |   6 +-
>>  fs/nilfs2/segment.c |   5 +-
>>  6 files changed, 235 insertions(+), 24 deletions(-)
> 
> I don't know whether this patchset is going in the right direction.
> .. we should first measure how the original naive allocator is bad in
> comparison with an elaborately designed allocator like this.  But, I
> will add some comments anyway:

I think the alignment creates a lot of overhead, because every directory
uses up a whole block in the ifile. I could also create a simpler patch
that only stores the last allocated inode number in struct
nilfs_ifile_info and starts the search from there for the next
allocation. Then I can test the three versions against each other in a
large scale test.

>  1) You must not use sizeof(struct nilfs_inode) to get inode size.
> The size of on-disk inodes is variable and you have to use
> NILFS_MDT(ifile)->mi_entry_size to ensure compatibility.
> To get ipb (= number of inodes per block), you should use
> NILFS_MDT(ifile)->mi_entries_per_block.
> Please remove nilfs_ifile_inodes_per_block().  It's redundant.

Agreed.

>  2) __nilfs_palloc_prepare_alloc_entry()
> The argument block_size is so confusing. Usually we use it
> for the size of disk block.
> Please use a proper word "alignment_size" or so.

Yes that's true "alignment_size" sounds better.

>  3) nilfs_palloc_find_available_slot_align32()
> This function seems to be violating endian compatibility.
> The order of two 32-bit words in a 64-bit word in little endian
> architectures differs from that of big endian architectures.
> 
> Having three different implementations looks too overkill to me at
> this time.  It should be removed unless it will make a significant
> difference.

32 is the most common case (4096 block size and 128 inode size), so I
thought it makes sense to optimize for it. But it is not necessary and
it shouldn't make a big difference.

>  4) nilfs_cpu_to_leul()
> Adding this macro is not preferable.  It depends on endian.
> Did you look for a generic macro which does the same thing ?

There are only macros for specific bit lengths, as far as I know. But
unsigned long varies for 32bit and 64bit systems. You could also
implement it like this:

#if BITS_PER_LONG == 64
#define nilfs_cpu_to_leul   cpu_to_le64
#elif BITS_PER_LONG == 32
#define nilfs_cpu_to_leul   cpu_to_le32
#else
#error BITS_PER_LONG not defined
#endif

Best regards,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] nilfs2: support the allocation of whole blocks of meta data entries

2014-10-12 Thread Andreas Rohner
This patch generalizes the function nilfs_palloc_prepare_alloc_entry()
by adding two parameters, namely block_size and threshold.

The newly allocated entry must be at the start of a block of empty
entries of size block_size and it must be contained in a group of at
least threshold number of free entries. The old behavior of the function
can be achieved by supplying 1 and 0 respectively.

This generalization allows for more sophisticated allocation algorithms
to be build on top of it. For example with block_size an algorithm
can space out certain entries to leave room between them for following
allocations, which can achieve a better localization of entries that
belong together and are likely to be accessed at the same time in the
future. threshold on the other hand can be used to exclude groups, where
the search is not likely to find an empty continuous block of size
block_size.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/alloc.c | 161 ++
 fs/nilfs2/alloc.h |  18 +-
 2 files changed, 166 insertions(+), 13 deletions(-)

diff --git a/fs/nilfs2/alloc.c b/fs/nilfs2/alloc.c
index 741fd02..fb71d90 100644
--- a/fs/nilfs2/alloc.c
+++ b/fs/nilfs2/alloc.c
@@ -331,18 +331,18 @@ void *nilfs_palloc_block_get_entry(const struct inode 
*inode, __u64 nr,
 }
 
 /**
- * nilfs_palloc_find_available_slot - find available slot in a group
+ * nilfs_palloc_find_available_slot_unaligned - find available slot in a group
  * @inode: inode of metadata file using this allocator
  * @group: group number
  * @target: offset number of an entry in the group (start point)
  * @bitmap: bitmap of the group
  * @bsize: size in bits
  */
-static int nilfs_palloc_find_available_slot(struct inode *inode,
-   unsigned long group,
-   unsigned long target,
-   unsigned char *bitmap,
-   int bsize)
+static int nilfs_palloc_find_available_slot_unaligned(struct inode *inode,
+ unsigned long group,
+ unsigned long target,
+ unsigned char *bitmap,
+ int bsize)
 {
int curr, pos, end, i;
 
@@ -381,6 +381,127 @@ static int nilfs_palloc_find_available_slot(struct inode 
*inode,
 }
 
 /**
+ * nilfs_palloc_find_available_slot_align32 - find available slot in a group
+ * @inode: inode of metadata file using this allocator
+ * @group: group number
+ * @target: offset number of an entry in the group (start point)
+ * @bitmap: bitmap of the group
+ * @bsize: size in bits
+ *
+ * Description: Finds an available aligned slot in a group. It will be
+ * aligned to 32 slots followed by 31 empty slots.
+ *
+ * Return Value: On success, the available slot is returned.
+ * On error, %-ENOSPC is returned.
+ */
+static int nilfs_palloc_find_available_slot_align32(struct inode *inode,
+   unsigned long group,
+   unsigned long target,
+   unsigned char *bitmap,
+   int bsize)
+{
+   u32 *end = (u32 *)bitmap + bsize / 32;
+   u32 *p = (u32 *)bitmap + target / 32;
+   int i, pos = target & ~31;
+
+   for (i = 0; i < bsize; i += 32, pos += 32, ++p) {
+   /* wrap around */
+   if (p == end) {
+   p = (u32 *)bitmap;
+   pos = 0;
+   }
+
+   if (!*p && !nilfs_set_bit_atomic(nilfs_mdt_bgl_lock(inode,
+group), pos, bitmap))
+   return pos;
+   }
+
+   return -ENOSPC;
+}
+
+/**
+ * nilfs_palloc_find_available_slot_align - find available slot in a group
+ * @inode: inode of metadata file using this allocator
+ * @group: group number
+ * @target: offset number of an entry in the group (start point)
+ * @bitmap: bitmap of the group
+ * @bsize: size in bits
+ * @block_size: size of the empty block to allocate the new entry (in bits)
+ *
+ * Description: Finds an available aligned slot in a group. It will be
+ * aligned to @block_size slots followed by @block_size - 1 empty slots.
+ * @block_size must be smaller or equal to BITS_PER_LONG.
+ *
+ * Return Value: On success, the available slot is returned.
+ * On error, %-ENOSPC is returned.
+ */
+static int nilfs_palloc_find_available_slot_align(struct inode *inode,
+ unsigned long group,
+ unsigned long target,
+ unsigned char *bitmap,
+

[PATCH 2/2] nilfs2: improve inode allocation algorithm

2014-10-12 Thread Andreas Rohner
The current inode allocation algorithm of NILFS2 does not use any
information about the previous allocation or the parent directory of the
inode. It simply searches for a free entry in the ifile, always starting
from position 0. This patch introduces an improved allocation scheme.
There are no changes to the on-disk-format necessary.

First of all the algorithm distinguishes between files and directories.

File inodes start their search at the location of their parent
directory, so that they will be allocated near their parent. If the file
inode ends up on the same block as its parent, the common task of
listing the directory contents (e.g.: "ls") will be faster. Although
there is no guarantee, that any subsequent blocks after that will end up
near the block with the directory on disk, there is still a possibility
for performance improvement, because it can be expected, that subsequent
blocks are read ahead. Also the cleaner will write out blocks in
the order of their file offset, so there is an increased likelihood that
those blocks can be read in faster.

Directory inodes are allocated aligned to block boundaries and with
a certain number of empty slots following them. The empty slots
increase the likelihood, that files can be allocated in the same block
as their parent. Furthermore the alignment improves performance, because
unaligned locations do not have to be searched. The location of the last
successful allocation of a directory is stored in memory and used as a
starting point for the next allocation. This value is periodically reset
to 0 to allow the algorithm to find previously deleted slots.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/ifile.c   | 63 -
 fs/nilfs2/ifile.h   |  6 +++--
 fs/nilfs2/inode.c   |  6 +++--
 fs/nilfs2/segment.c |  5 -
 4 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/fs/nilfs2/ifile.c b/fs/nilfs2/ifile.c
index 6548c78..9fc83ce 100644
--- a/fs/nilfs2/ifile.c
+++ b/fs/nilfs2/ifile.c
@@ -33,20 +33,49 @@
  * struct nilfs_ifile_info - on-memory private data of ifile
  * @mi: on-memory private data of metadata file
  * @palloc_cache: persistent object allocator cache of ifile
+ * @last_dir: ino of the last directory allocated
  */
 struct nilfs_ifile_info {
struct nilfs_mdt_info mi;
struct nilfs_palloc_cache palloc_cache;
+   __u64 last_dir;
 };
 
 static inline struct nilfs_ifile_info *NILFS_IFILE_I(struct inode *ifile)
 {
return (struct nilfs_ifile_info *)NILFS_MDT(ifile);
 }
+/**
+ * nilfs_ifile_last_dir_reset - set last_dir to 0
+ * @ifile: ifile inode
+ *
+ * Description: The value of last_dir will increase with every new
+ * allocation of a directory, because it is used as the starting point of the
+ * search for a free entry in the ifile. It should be reset periodically to 0
+ * (e.g.: every segctor timeout), so that previously deleted entries can be
+ * found.
+ */
+void nilfs_ifile_last_dir_reset(struct inode *ifile)
+{
+   NILFS_IFILE_I(ifile)->last_dir = 0;
+}
+
+/**
+ * nilfs_ifile_inodes_per_block - get number of inodes in a block
+ * @ifile: ifile inode
+ *
+ * Return Value: The number of inodes that fit into a file system block.
+ */
+static inline int nilfs_ifile_inodes_per_block(struct inode *ifile)
+{
+   return (1 << ifile->i_blkbits) / sizeof(struct nilfs_inode);
+}
 
 /**
  * nilfs_ifile_create_inode - create a new disk inode
  * @ifile: ifile inode
+ * @parent: inode number of the parent directory
+ * @mode: inode type of the newly created inode
  * @out_ino: pointer to a variable to store inode number
  * @out_bh: buffer_head contains newly allocated disk inode
  *
@@ -62,17 +91,29 @@ static inline struct nilfs_ifile_info *NILFS_IFILE_I(struct 
inode *ifile)
  *
  * %-ENOSPC - No inode left.
  */
-int nilfs_ifile_create_inode(struct inode *ifile, ino_t *out_ino,
+int nilfs_ifile_create_inode(struct inode *ifile, ino_t parent,
+umode_t mode, ino_t *out_ino,
 struct buffer_head **out_bh)
 {
struct nilfs_palloc_req req;
-   int ret;
+   int ret, ipb = nilfs_ifile_inodes_per_block(ifile);
 
-   req.pr_entry_nr = 0;  /* 0 says find free inode from beginning of
-a group. dull code!! */
req.pr_entry_bh = NULL;
 
-   ret = nilfs_palloc_prepare_alloc_entry(ifile, &req);
+   if (S_ISDIR(mode)) {
+   req.pr_entry_nr = NILFS_IFILE_I(ifile)->last_dir;
+   ret = __nilfs_palloc_prepare_alloc_entry(ifile, &req, ipb,
+   nilfs_palloc_entries_per_group(ifile) >> 2);
+   if (unlikely(ret == -ENOSPC)) {
+   /* fallback to normal allocation */
+   req.pr_entry_nr = 0;
+   ret = nilfs_palloc_prepare_alloc_entry(ifile, &req);
+   }
+   } else {
+   req.pr

[PATCH 0/2] nilfs2: improve inode allocation algorithm

2014-10-12 Thread Andreas Rohner
Hi,

The algorithm simply makes sure, that after a directory inode there are
a certain number of free slots available and the search for file inodes
is started at their parent directory.

I havn't had the time yet to do a full scale performance test of it, but 
my simple preliminary tests have shown, that the allocation of inodes 
takes a little bit longer and the lookup is a little bit faster. My 
simple test just creates 1500 directories and after that creates 10 
files in each directory.

So more testing is definetly necessary, but I wanted to get some 
feedback about the design first. Is my code a step in the right 
direction?

Best regards,
Andreas Rohner

Andreas Rohner (2):
  nilfs2: support the allocation of whole blocks of meta data entries
  nilfs2: improve inode allocation algorithm

 fs/nilfs2/alloc.c   | 161 
 fs/nilfs2/alloc.h   |  18 +-
 fs/nilfs2/ifile.c   |  63 ++--
 fs/nilfs2/ifile.h   |   6 +-
 fs/nilfs2/inode.c   |   6 +-
 fs/nilfs2/segment.c |   5 +-
 6 files changed, 235 insertions(+), 24 deletions(-)

-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improve inode allocation

2014-09-24 Thread Andreas Rohner
On 2014-09-24 17:01, Ryusuke Konishi wrote:
> On Wed, 24 Sep 2014 10:01:05 +0200, Andreas Rohner wrote:
>> On 2014-09-23 18:35, Ryusuke Konishi wrote:
>>> On Tue, 23 Sep 2014 16:21:33 +0200, Andreas Rohner wrote:
>>>> On 2014-09-23 14:47, Ryusuke Konishi wrote:
>>>>> By the way, if you are interested in improving this sort of bad
>>>>> implemetation, please consider improving inode allocator that we can
>>>>> see at nilfs_ifile_create_inode().
>>>>>
>>>>> It always searches free inode from ino=0.  It doesn't use the
>>>>> knowledge of the last allocated inode number (inumber) nor any
>>>>> locality of close-knit inodes such as a file and the directory that
>>>>> contains it.
>>>>>
>>>>> A simple strategy is to start finding a free inode from (inumber of
>>>>> the parent directory) + 1, but this may not work efficiently if the
>>>>> namespace has multiple active directories, and requires that inumbers
>>>>> of directories are suitably dispersed.  On the other hands, it
>>>>> increases the number of disk read and also increases the number of
>>>>> inode blocks to be written out if inodes are allocated too discretely.
>>>>>
>>>>> The optimal strategy may differ from that of other file systems
>>>>> because inode blocks are not allocated to static places in nilfs.  For
>>>>> example, it may be better if we gather inodes of frequently accessed
>>>>> directories into the first valid inode block (on ifile) for nilfs.
>>>>
>>>> Sure I'll have a look at it, but this seems to be a hard problem.
>>>>
>>>> Since one inode has 128 bytes a typical block of 4096 contains 32
>>>> inodes. We could just allocate every directory inode into an empty block
>>>> with 31 free slots. Then any subsequent file inode allocation would
>>>> first search the 31 slots of the parent directory and if they are full,
>>>> fallback to a search starting with ino 0.
>>>
>>> We can utilize several characteristics of metadata files for this
>>> problem:
>>>
>>> - It supports read ahead feature.  when ifile reads an inode block, we
>>>   can expect that several subsequent blocks will be loaded to page
>>>   cache in the background.
>>>
>>> - B-tree of NILFS is efficient to hold sparse blocks.  This means that
>>>   putting close-knit 32 * n inodes far from offset=0 is not so bad.
>>>
>>> - ifile now can have private variables in nilfs_ifile_info (on-memory)
>>>   struct.  They are available to store context information of
>>>   allocator without compatibility issue.
>>>
>>> - We can also use nilfs_inode_info struct of directories to store
>>>   directory-based context of allocator without losing compatibility.
>>>
>>> - Only caller of nilfs_ifile_create_inode() is nilfs_new_inode(), and
>>>   this function knows the inode of the parent directory.
>>
>> Then the only problem is how to efficiently allocate the directories. We
>> could do something similar to the Orlov allocator used by the ext2/3/4
>> file systems:
>>
>> 1. We spread first level directories. Every one gets a full bitmap
>>block (or half a bitmap block)
>> 2. For the other directories we will try to choose the bitmap block of
>>the parent unless the number of free inodes is below a certain
>>threshold. Within this bitmap block the directories should also
>>spread out.
> 
> In my understanding, the basic strategy of the Orlov allocator is to
> physically spead out subtrees over cylinder groups.  This strategy is
> effective for ext2/ext3/ext4 to mitigate overheads which come from
> disk seeks.  The strategy increases the locality of data and metadata
> and that of a parent directory and its childs nodes, but the same
> thing isn't always true for nilfs because real block allocation of
> ifile and other files including directories is virtualized and doesn't
> reflect underlying phyiscs (e.g. relation between LBA and seek
> time) as is.
> 
> I think the strategy 1 above doesn't make sense unlike ext2/3/4.

I know that it is a sparse file and the blocks can end up anywhere on
disk, independent of the offset in the ifile. I just thought it may be a
good idea to give top level directories more room to grow. But you are
probably right and it makes no sense for nilfs...

>> File inodes will just start a linear search at the parents inode if
>&

Re: improve inode allocation (was Re: [PATCH v2] nilfs2: improve the performance of fdatasync())

2014-09-24 Thread Andreas Rohner
On 2014-09-23 18:35, Ryusuke Konishi wrote:
> On Tue, 23 Sep 2014 16:21:33 +0200, Andreas Rohner wrote:
>> On 2014-09-23 14:47, Ryusuke Konishi wrote:
>>> By the way, if you are interested in improving this sort of bad
>>> implemetation, please consider improving inode allocator that we can
>>> see at nilfs_ifile_create_inode().
>>>
>>> It always searches free inode from ino=0.  It doesn't use the
>>> knowledge of the last allocated inode number (inumber) nor any
>>> locality of close-knit inodes such as a file and the directory that
>>> contains it.
>>>
>>> A simple strategy is to start finding a free inode from (inumber of
>>> the parent directory) + 1, but this may not work efficiently if the
>>> namespace has multiple active directories, and requires that inumbers
>>> of directories are suitably dispersed.  On the other hands, it
>>> increases the number of disk read and also increases the number of
>>> inode blocks to be written out if inodes are allocated too discretely.
>>>
>>> The optimal strategy may differ from that of other file systems
>>> because inode blocks are not allocated to static places in nilfs.  For
>>> example, it may be better if we gather inodes of frequently accessed
>>> directories into the first valid inode block (on ifile) for nilfs.
>>
>> Sure I'll have a look at it, but this seems to be a hard problem.
>>
>> Since one inode has 128 bytes a typical block of 4096 contains 32
>> inodes. We could just allocate every directory inode into an empty block
>> with 31 free slots. Then any subsequent file inode allocation would
>> first search the 31 slots of the parent directory and if they are full,
>> fallback to a search starting with ino 0.
> 
> We can utilize several characteristics of metadata files for this
> problem:
> 
> - It supports read ahead feature.  when ifile reads an inode block, we
>   can expect that several subsequent blocks will be loaded to page
>   cache in the background.
> 
> - B-tree of NILFS is efficient to hold sparse blocks.  This means that
>   putting close-knit 32 * n inodes far from offset=0 is not so bad.
> 
> - ifile now can have private variables in nilfs_ifile_info (on-memory)
>   struct.  They are available to store context information of
>   allocator without compatibility issue.
> 
> - We can also use nilfs_inode_info struct of directories to store
>   directory-based context of allocator without losing compatibility.
> 
> - Only caller of nilfs_ifile_create_inode() is nilfs_new_inode(), and
>   this function knows the inode of the parent directory.

Then the only problem is how to efficiently allocate the directories. We
could do something similar to the Orlov allocator used by the ext2/3/4
file systems:

1. We spread first level directories. Every one gets a full bitmap
   block (or half a bitmap block)
2. For the other directories we will try to choose the bitmap block of
   the parent unless the number of free inodes is below a certain
   threshold. Within this bitmap block the directories should also
   spread out.

File inodes will just start a linear search at the parents inode if
there is enough space left in the bitmap.

>> This way if a directory has less than 32 files, all its inodes can be
>> read in with one single block. If a directory has more than 32 files its
>> inodes will spill over into the slots of other directories.
>>
>> But I am not sure if this strategy would pay off.
> 
> Yes, for small namespaces, the current implementation may be enough.
> We should first decide how we evaluate the effect of the algorithm.
> It may be the scalability of namespace.

It will be very difficult to measure the time accurately. I would
suggest to simply count the number of reads and writes on the device.
This can be easily done:

mkfs.nilfs2 /dev/sdb

cat /proc/diskstats > rw_before.txt

do_tests

extract_kernel_sources

...

find /mnt

cat /proc/diskstats > rw_after.txt

The algorithm with fewer writes and reads wins.

I am still not convinced that all of this will pay off, but I will try a
few things and see if it works.

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] nilfs2: improve the performance of fdatasync()

2014-09-23 Thread Andreas Rohner
On 2014-09-23 14:47, Ryusuke Konishi wrote:
> On Tue, 23 Sep 2014 14:17:05 +0200, Andreas Rohner wrote:
>> On 2014-09-23 12:50, Ryusuke Konishi wrote:
>>> On Tue, 23 Sep 2014 10:46:58 +0200, Andreas Rohner wrote:
>>>> Support for fdatasync() has been implemented in NILFS2 for a long time,
>>>> but whenever the corresponding inode is dirty the implementation falls
>>>> back to a full-flegded sync(). Since every write operation has to update
>>>> the modification time of the file, the inode will almost always be dirty
>>>> and fdatasync() will fall back to sync() most of the time. But this
>>>> fallback is only necessary for a change of the file size and not for
>>>> a change of the various timestamps.
>>>>
>>>> This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
>>>> those two situations.
>>>>
>>>>  * If it is set the file size was changed and a full sync is necessary.
>>>>  * If it is not set then only the timestamps were updated and
>>>>fdatasync() can go ahead.
>>>>
>>>> There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
>>>> the exact same semantics. Unfortunately it cannot be used directly,
>>>> because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
>>>> flags when inodes are written out. So the VFS writeback thread can
>>>> clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
>>>> I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
>>>> nilfs_update_inode().
>>>>
>>>> Signed-off-by: Andreas Rohner 
>>>
>>> I now sent this to Andrew.
>>>
>>> The datasync segments that this patch creates more frequently, will
>>> cause rollforward recovery after a crash or a power failure.
>>>
>>> So, please test also that the recovery works properly for fdatasync()
>>> and reset.  The situation can be simulated, for example, by using
>>> "reboot -nfh":
>>>
>>>  # dd if=/dev/zero of=/nilfs/test bs=4k count=1 seek=
>>>  # dd if=/dev/urandom of=/nilfs/test bs=8k count=1 seek=50 
>>> conv=fdatasync,notrunc,nocreat
>>>  # reboot -nfh
>>>
>>> We can use dumpseg command to confirm that the datasync segment is
>>> actually made or how recovery has done after mount.
>>
>> I tested it using your script, but I duplicated the second line twice
>> with different values for seek and added a md5sum at the end. So in
>> total 6 blocks were written with fdatasync().
>>
>> The checksum before the reboot was: 66500bd6c7a1f89ed860cd7203f5c6e8
>>
>> The last lines of the output of dumpseg after reboot:
>>
>>   partial segment: blocknr = 26, nblocks = 3
>> creation time = 2014-09-23 12:02:56
>> nfinfo = 1
>> finfo
>>   ino = 12, cno = 3, nblocks = 2, ndatblk = 2
>> vblocknr = 16385, blkoff = 100, blocknr = 27
>> vblocknr = 16386, blkoff = 101, blocknr = 28
>>   partial segment: blocknr = 29, nblocks = 3
>> creation time = 2014-09-23 12:02:56
>> nfinfo = 1
>> finfo
>>   ino = 12, cno = 3, nblocks = 2, ndatblk = 2
>> vblocknr = 16387, blkoff = 120, blocknr = 30
>> vblocknr = 16389, blkoff = 121, blocknr = 31
>>   partial segment: blocknr = 32, nblocks = 3
>> creation time = 2014-09-23 12:02:56
>> nfinfo = 1
>> finfo
>>   ino = 12, cno = 3, nblocks = 2, ndatblk = 2
>> vblocknr = 16390, blkoff = 140, blocknr = 33
>> vblocknr = 16391, blkoff = 141, blocknr = 34
>>
>> The output of dmesg for the rollforward:
>>
>> [  110.701337] NILFS warning: mounting unchecked fs
>> [  110.833196] NILFS (device sdb): salvaged 6 blocks
>> [  110.837311] segctord starting. Construction interval = 5 seconds, CP
>> frequency < 30 seconds
>> [  110.878959] NILFS: recovery complete.
>> [  110.882674] segctord starting. Construction interval = 5 seconds, CP
>> frequency < 30 seconds
>>
>> The checksum after rollforward: 66500bd6c7a1f89ed860cd7203f5c6e8
>>
>> Works like a charm :)
> 
> Thank you, it looks perfect so far.

You're welcome :)

> 
> By the way, if you are interested in improving this sort of bad
> implemetation, please consider improving inode allocator that we can
> see at nilfs_ifile_create_inode().
> 
> It always searches free inode from ino=0.  It doesn't use the
> knowledge of the last allocated inode

Re: [PATCH v2] nilfs2: improve the performance of fdatasync()

2014-09-23 Thread Andreas Rohner
On 2014-09-23 12:50, Ryusuke Konishi wrote:
> On Tue, 23 Sep 2014 10:46:58 +0200, Andreas Rohner wrote:
>> Support for fdatasync() has been implemented in NILFS2 for a long time,
>> but whenever the corresponding inode is dirty the implementation falls
>> back to a full-flegded sync(). Since every write operation has to update
>> the modification time of the file, the inode will almost always be dirty
>> and fdatasync() will fall back to sync() most of the time. But this
>> fallback is only necessary for a change of the file size and not for
>> a change of the various timestamps.
>>
>> This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
>> those two situations.
>>
>>  * If it is set the file size was changed and a full sync is necessary.
>>  * If it is not set then only the timestamps were updated and
>>fdatasync() can go ahead.
>>
>> There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
>> the exact same semantics. Unfortunately it cannot be used directly,
>> because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
>> flags when inodes are written out. So the VFS writeback thread can
>> clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
>> I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
>> nilfs_update_inode().
>>
>> Signed-off-by: Andreas Rohner 
> 
> I now sent this to Andrew.
> 
> The datasync segments that this patch creates more frequently, will
> cause rollforward recovery after a crash or a power failure.
> 
> So, please test also that the recovery works properly for fdatasync()
> and reset.  The situation can be simulated, for example, by using
> "reboot -nfh":
> 
>  # dd if=/dev/zero of=/nilfs/test bs=4k count=1 seek=
>  # dd if=/dev/urandom of=/nilfs/test bs=8k count=1 seek=50 
> conv=fdatasync,notrunc,nocreat
>  # reboot -nfh
> 
> We can use dumpseg command to confirm that the datasync segment is
> actually made or how recovery has done after mount.

I tested it using your script, but I duplicated the second line twice
with different values for seek and added a md5sum at the end. So in
total 6 blocks were written with fdatasync().

The checksum before the reboot was: 66500bd6c7a1f89ed860cd7203f5c6e8

The last lines of the output of dumpseg after reboot:

  partial segment: blocknr = 26, nblocks = 3
creation time = 2014-09-23 12:02:56
nfinfo = 1
finfo
  ino = 12, cno = 3, nblocks = 2, ndatblk = 2
vblocknr = 16385, blkoff = 100, blocknr = 27
vblocknr = 16386, blkoff = 101, blocknr = 28
  partial segment: blocknr = 29, nblocks = 3
creation time = 2014-09-23 12:02:56
nfinfo = 1
finfo
  ino = 12, cno = 3, nblocks = 2, ndatblk = 2
vblocknr = 16387, blkoff = 120, blocknr = 30
vblocknr = 16389, blkoff = 121, blocknr = 31
  partial segment: blocknr = 32, nblocks = 3
creation time = 2014-09-23 12:02:56
nfinfo = 1
finfo
  ino = 12, cno = 3, nblocks = 2, ndatblk = 2
vblocknr = 16390, blkoff = 140, blocknr = 33
vblocknr = 16391, blkoff = 141, blocknr = 34

The output of dmesg for the rollforward:

[  110.701337] NILFS warning: mounting unchecked fs
[  110.833196] NILFS (device sdb): salvaged 6 blocks
[  110.837311] segctord starting. Construction interval = 5 seconds, CP
frequency < 30 seconds
[  110.878959] NILFS: recovery complete.
[  110.882674] segctord starting. Construction interval = 5 seconds, CP
frequency < 30 seconds

The checksum after rollforward: 66500bd6c7a1f89ed860cd7203f5c6e8

Works like a charm :)

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] nilfs2: improve the performance of fdatasync()

2014-09-23 Thread Andreas Rohner
On 2014-09-23 12:50, Ryusuke Konishi wrote:
> On Tue, 23 Sep 2014 10:46:58 +0200, Andreas Rohner wrote:
>> Support for fdatasync() has been implemented in NILFS2 for a long time,
>> but whenever the corresponding inode is dirty the implementation falls
>> back to a full-flegded sync(). Since every write operation has to update
>> the modification time of the file, the inode will almost always be dirty
>> and fdatasync() will fall back to sync() most of the time. But this
>> fallback is only necessary for a change of the file size and not for
>> a change of the various timestamps.
>>
>> This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
>> those two situations.
>>
>>  * If it is set the file size was changed and a full sync is necessary.
>>  * If it is not set then only the timestamps were updated and
>>fdatasync() can go ahead.
>>
>> There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
>> the exact same semantics. Unfortunately it cannot be used directly,
>> because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
>> flags when inodes are written out. So the VFS writeback thread can
>> clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
>> I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
>> nilfs_update_inode().
>>
>> Signed-off-by: Andreas Rohner 
> 
> I now sent this to Andrew.
> 
> The datasync segments that this patch creates more frequently, will
> cause rollforward recovery after a crash or a power failure.
> 
> So, please test also that the recovery works properly for fdatasync()
> and reset.  The situation can be simulated, for example, by using
> "reboot -nfh":
> 
>  # dd if=/dev/zero of=/nilfs/test bs=4k count=1 seek=
>  # dd if=/dev/urandom of=/nilfs/test bs=8k count=1 seek=50 
> conv=fdatasync,notrunc,nocreat
>  # reboot -nfh

Very nice script to test this!

> We can use dumpseg command to confirm that the datasync segment is
> actually made or how recovery has done after mount.

I already tested this before I sent in my patch. I used a virtual
machine and just killed the process after the fdatasync(). After the
rollforward NILFS reports the correct number of blocks salvaged and the
md5sum of the file is correct.

I will test it again with dumpseg the way you suggested.

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] nilfs2: improve the performance of fdatasync()

2014-09-23 Thread Andreas Rohner
Support for fdatasync() has been implemented in NILFS2 for a long time,
but whenever the corresponding inode is dirty the implementation falls
back to a full-flegded sync(). Since every write operation has to update
the modification time of the file, the inode will almost always be dirty
and fdatasync() will fall back to sync() most of the time. But this
fallback is only necessary for a change of the file size and not for
a change of the various timestamps.

This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
those two situations.

 * If it is set the file size was changed and a full sync is necessary.
 * If it is not set then only the timestamps were updated and
   fdatasync() can go ahead.

There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
the exact same semantics. Unfortunately it cannot be used directly,
because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
flags when inodes are written out. So the VFS writeback thread can
clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
nilfs_update_inode().

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/inode.c   | 13 +++--
 fs/nilfs2/nilfs.h   | 14 +++---
 fs/nilfs2/segment.c |  4 ++--
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6252b17..67f2112 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -125,7 +125,7 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
nilfs_transaction_abort(inode->i_sb);
goto out;
}
-   nilfs_mark_inode_dirty(inode);
+   nilfs_mark_inode_dirty_sync(inode);
nilfs_transaction_commit(inode->i_sb); /* never fails */
/* Error handling should be detailed */
set_buffer_new(bh_result);
@@ -667,7 +667,7 @@ void nilfs_write_inode_common(struct inode *inode,
   for substitutions of appended fields */
 }
 
-void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh)
+void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh, int 
flags)
 {
ino_t ino = inode->i_ino;
struct nilfs_inode_info *ii = NILFS_I(inode);
@@ -678,7 +678,8 @@ void nilfs_update_inode(struct inode *inode, struct 
buffer_head *ibh)
 
if (test_and_clear_bit(NILFS_I_NEW, &ii->i_state))
memset(raw_inode, 0, NILFS_MDT(ifile)->mi_entry_size);
-   set_bit(NILFS_I_INODE_DIRTY, &ii->i_state);
+   if (flags & I_DIRTY_DATASYNC)
+   set_bit(NILFS_I_INODE_SYNC, &ii->i_state);
 
nilfs_write_inode_common(inode, raw_inode, 0);
/* XXX: call with has_bmap = 0 is a workaround to avoid
@@ -934,7 +935,7 @@ int nilfs_set_file_dirty(struct inode *inode, unsigned 
nr_dirty)
return 0;
 }
 
-int nilfs_mark_inode_dirty(struct inode *inode)
+int __nilfs_mark_inode_dirty(struct inode *inode, int flags)
 {
struct buffer_head *ibh;
int err;
@@ -945,7 +946,7 @@ int nilfs_mark_inode_dirty(struct inode *inode)
  "failed to reget inode block.\n");
return err;
}
-   nilfs_update_inode(inode, ibh);
+   nilfs_update_inode(inode, ibh, flags);
mark_buffer_dirty(ibh);
nilfs_mdt_mark_dirty(NILFS_I(inode)->i_root->ifile);
brelse(ibh);
@@ -978,7 +979,7 @@ void nilfs_dirty_inode(struct inode *inode, int flags)
return;
}
nilfs_transaction_begin(inode->i_sb, &ti, 0);
-   nilfs_mark_inode_dirty(inode);
+   __nilfs_mark_inode_dirty(inode, flags);
nilfs_transaction_commit(inode->i_sb); /* never fails */
 }
 
diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h
index 0696161..91093cd 100644
--- a/fs/nilfs2/nilfs.h
+++ b/fs/nilfs2/nilfs.h
@@ -104,7 +104,7 @@ enum {
   constructor */
NILFS_I_COLLECTED,  /* All dirty blocks are collected */
NILFS_I_UPDATED,/* The file has been written back */
-   NILFS_I_INODE_DIRTY,/* write_inode is requested */
+   NILFS_I_INODE_SYNC, /* dsync is not allowed for inode */
NILFS_I_BMAP,   /* has bmap and btnode_cache */
NILFS_I_GCINODE,/* inode for GC, on memory only */
 };
@@ -273,7 +273,7 @@ struct inode *nilfs_iget(struct super_block *sb, struct 
nilfs_root *root,
 unsigned long ino);
 extern struct inode *nilfs_iget_for_gc(struct super_block *sb,
   unsigned long ino, __u64 cno);
-extern void nilfs_update_inode(struct inode *, struct buffer_head *);
+extern void nilfs_update_inode(struct inode *, struct buffer_head *, int);
 extern void nilfs_truncate(struct inode *);
 extern void n

Re: [PATCH] nilfs2: improve the performance of fdatasync()

2014-09-22 Thread Andreas Rohner
On 2014-09-23 07:09, Ryusuke Konishi wrote:
> Hi Andreas,
> On Mon, 22 Sep 2014 18:20:27 +0200, Andreas Rohner wrote:
>> Support for fdatasync() has been implemented in NILFS2 for a long time,
>> but whenever the corresponding inode is dirty the implementation falls
>> back to a full-flegded sync(). Since every write operation has to update
>> the modification time of the file, the inode will almost always be dirty
>> and fdatasync() will fall back to sync() most of the time. But this
>> fallback is only necessary for a change of the file size and not for
>> a change of the various timestamps.
>>
>> This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
>> those two situations.
>>
>>  * If it is set the file size was changed and a full sync is necessary.
>>  * If it is not set then only the timestamps were updated and
>>fdatasync() can go ahead.
>>
>> There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
>> the exact same semantics. Unfortunately it cannot be used directly,
>> because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
>> flags when inodes are written out. So the VFS writeback thread can
>> clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
>> I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
>> nilfs_update_inode().
>>
>> Signed-off-by: Andreas Rohner 
> 
> I looked into the patch. 
> 
> Very nice. This is what we should have done several years ago.
> 
> When this patch is applied, NILFS_I_INODE_DIRTY flag will be no longer
> required.  Can you remove it at the same time?

Ah yes of course. I just assumed that NILFS_I_INODE_DIRTY is needed for
something else and never actually checked it. In that case don't you
think that NILFS_I_INODE_DIRTY is a better name for the flag than
NILFS_I_INODE_SYNC?

The SYNC can be a bit confusing, especially because I used it in the
helper functions, where it means exactly the opposite:

static inline int nilfs_mark_inode_dirty(struct inode *inode)
static inline int nilfs_mark_inode_dirty_sync(struct inode *inode)

I did that to match the corresponding names of the VFS functions:

static inline void mark_inode_dirty(struct inode *inode)
static inline void mark_inode_dirty_sync(struct inode *inode)

So there is a bit of a conflict in names. What do you think?

br,
Andreas Rohner

> Thanks,
> Ryusuke Konishi
> 
>> ---
>>  fs/nilfs2/inode.c   | 12 +++-
>>  fs/nilfs2/nilfs.h   | 13 +++--
>>  fs/nilfs2/segment.c |  3 ++-
>>  3 files changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
>> index 6252b17..2f67153 100644
>> --- a/fs/nilfs2/inode.c
>> +++ b/fs/nilfs2/inode.c
>> @@ -125,7 +125,7 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
>>  nilfs_transaction_abort(inode->i_sb);
>>  goto out;
>>  }
>> -nilfs_mark_inode_dirty(inode);
>> +nilfs_mark_inode_dirty_sync(inode);
>>  nilfs_transaction_commit(inode->i_sb); /* never fails */
>>  /* Error handling should be detailed */
>>  set_buffer_new(bh_result);
>> @@ -667,7 +667,7 @@ void nilfs_write_inode_common(struct inode *inode,
>> for substitutions of appended fields */
>>  }
>>  
>> -void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh)
>> +void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh, int 
>> flags)
>>  {
>>  ino_t ino = inode->i_ino;
>>  struct nilfs_inode_info *ii = NILFS_I(inode);
>> @@ -679,6 +679,8 @@ void nilfs_update_inode(struct inode *inode, struct 
>> buffer_head *ibh)
>>  if (test_and_clear_bit(NILFS_I_NEW, &ii->i_state))
>>  memset(raw_inode, 0, NILFS_MDT(ifile)->mi_entry_size);
>>  set_bit(NILFS_I_INODE_DIRTY, &ii->i_state);
>> +if (flags & I_DIRTY_DATASYNC)
>> +set_bit(NILFS_I_INODE_SYNC, &ii->i_state);
>>  
>>  nilfs_write_inode_common(inode, raw_inode, 0);
>>  /* XXX: call with has_bmap = 0 is a workaround to avoid
>> @@ -934,7 +936,7 @@ int nilfs_set_file_dirty(struct inode *inode, unsigned 
>> nr_dirty)
>>  return 0;
>>  }
>>  
>> -int nilfs_mark_inode_dirty(struct inode *inode)
>> +int __nilfs_mark_inode_dirty(struct inode *inode, int flags)
>>  {
>>  struct buffer_head *ibh;
>>  int err;
>> @@ -945,7 +947,7 @@ int nilfs_mark_inode_dirty(struct inode *inode)
>>  

[PATCH] nilfs2: improve the performance of fdatasync()

2014-09-22 Thread Andreas Rohner
Support for fdatasync() has been implemented in NILFS2 for a long time,
but whenever the corresponding inode is dirty the implementation falls
back to a full-flegded sync(). Since every write operation has to update
the modification time of the file, the inode will almost always be dirty
and fdatasync() will fall back to sync() most of the time. But this
fallback is only necessary for a change of the file size and not for
a change of the various timestamps.

This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
those two situations.

 * If it is set the file size was changed and a full sync is necessary.
 * If it is not set then only the timestamps were updated and
   fdatasync() can go ahead.

There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
the exact same semantics. Unfortunately it cannot be used directly,
because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
flags when inodes are written out. So the VFS writeback thread can
clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
nilfs_update_inode().

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/inode.c   | 12 +++-
 fs/nilfs2/nilfs.h   | 13 +++--
 fs/nilfs2/segment.c |  3 ++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6252b17..2f67153 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -125,7 +125,7 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
nilfs_transaction_abort(inode->i_sb);
goto out;
}
-   nilfs_mark_inode_dirty(inode);
+   nilfs_mark_inode_dirty_sync(inode);
nilfs_transaction_commit(inode->i_sb); /* never fails */
/* Error handling should be detailed */
set_buffer_new(bh_result);
@@ -667,7 +667,7 @@ void nilfs_write_inode_common(struct inode *inode,
   for substitutions of appended fields */
 }
 
-void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh)
+void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh, int 
flags)
 {
ino_t ino = inode->i_ino;
struct nilfs_inode_info *ii = NILFS_I(inode);
@@ -679,6 +679,8 @@ void nilfs_update_inode(struct inode *inode, struct 
buffer_head *ibh)
if (test_and_clear_bit(NILFS_I_NEW, &ii->i_state))
memset(raw_inode, 0, NILFS_MDT(ifile)->mi_entry_size);
set_bit(NILFS_I_INODE_DIRTY, &ii->i_state);
+   if (flags & I_DIRTY_DATASYNC)
+   set_bit(NILFS_I_INODE_SYNC, &ii->i_state);
 
nilfs_write_inode_common(inode, raw_inode, 0);
/* XXX: call with has_bmap = 0 is a workaround to avoid
@@ -934,7 +936,7 @@ int nilfs_set_file_dirty(struct inode *inode, unsigned 
nr_dirty)
return 0;
 }
 
-int nilfs_mark_inode_dirty(struct inode *inode)
+int __nilfs_mark_inode_dirty(struct inode *inode, int flags)
 {
struct buffer_head *ibh;
int err;
@@ -945,7 +947,7 @@ int nilfs_mark_inode_dirty(struct inode *inode)
  "failed to reget inode block.\n");
return err;
}
-   nilfs_update_inode(inode, ibh);
+   nilfs_update_inode(inode, ibh, flags);
mark_buffer_dirty(ibh);
nilfs_mdt_mark_dirty(NILFS_I(inode)->i_root->ifile);
brelse(ibh);
@@ -978,7 +980,7 @@ void nilfs_dirty_inode(struct inode *inode, int flags)
return;
}
nilfs_transaction_begin(inode->i_sb, &ti, 0);
-   nilfs_mark_inode_dirty(inode);
+   __nilfs_mark_inode_dirty(inode, flags);
nilfs_transaction_commit(inode->i_sb); /* never fails */
 }
 
diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h
index 0696161..30573d7 100644
--- a/fs/nilfs2/nilfs.h
+++ b/fs/nilfs2/nilfs.h
@@ -107,6 +107,7 @@ enum {
NILFS_I_INODE_DIRTY,/* write_inode is requested */
NILFS_I_BMAP,   /* has bmap and btnode_cache */
NILFS_I_GCINODE,/* inode for GC, on memory only */
+   NILFS_I_INODE_SYNC, /* dsync is not allowed for inode */
 };
 
 /*
@@ -273,7 +274,7 @@ struct inode *nilfs_iget(struct super_block *sb, struct 
nilfs_root *root,
 unsigned long ino);
 extern struct inode *nilfs_iget_for_gc(struct super_block *sb,
   unsigned long ino, __u64 cno);
-extern void nilfs_update_inode(struct inode *, struct buffer_head *);
+extern void nilfs_update_inode(struct inode *, struct buffer_head *, int);
 extern void nilfs_truncate(struct inode *);
 extern void nilfs_evict_inode(struct inode *);
 extern int nilfs_setattr(struct dentry *, struct iattr *);
@@ -282,10 +283,18 @@ int nilfs_permission(struct inode *inode, int mask);
 int nilfs_load_inode_block(struct inod

[PATCH v2] nilfs2: fix data loss with mmap()

2014-09-18 Thread Andreas Rohner
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system. It is easily
reproducible with the following script:

[BEGIN SCRIPT]
mkfs.nilfs2 -f /dev/sdb
mount /dev/sdb /mnt

dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

umount /mnt
mount /dev/sdb /mnt
CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

/root/mmaptest/mmaptest /mnt/testfile 30 10 5

sync
CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
umount /mnt
mount /dev/sdb /mnt
CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
umount /mnt

echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
echo "AFTER MMAP:\t$CHECKSUM_AFTER"
echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
[END SCRIPT]

The mmaptest tool looks something like this (very simplified, with
error checking removed):

[BEGIN mmaptest]
data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, file_offset);

for (i = 0; i < write_count; ++i) {
memcpy(data + i * 4096, buf, sizeof(buf));
msync(data, file_size - file_offset, MS_SYNC))
}
[END mmaptest]

The output of the script looks something like this:

BEFORE MMAP:281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
AFTER MMAP: 6604a1c31f10780331a6850371b3a313  /mnt/testfile
AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount. This can be reproduced a 100% of the time. The problem was
introduced with the following commit:

136e877 nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF
boundary

If the page was read with mpage_readpage() or mpage_readpages() for
example, then  it has no buffers attached to it. In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if
the page has no buffers attached to it.

Signed-off-by: Andreas Rohner 
Tested-by: Andreas Rohner 
---
 fs/nilfs2/inode.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6252b17..6b0a241 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -219,10 +219,10 @@ static int nilfs_writepage(struct page *page, struct 
writeback_control *wbc)
 
 static int nilfs_set_page_dirty(struct page *page)
 {
+   struct inode *inode = page->mapping->host;
int ret = __set_page_dirty_nobuffers(page);
 
if (page_has_buffers(page)) {
-   struct inode *inode = page->mapping->host;
unsigned nr_dirty = 0;
struct buffer_head *bh, *head;
 
@@ -245,6 +245,10 @@ static int nilfs_set_page_dirty(struct page *page)
 
if (nr_dirty)
nilfs_set_file_dirty(inode, nr_dirty);
+   } else if (ret) {
+   unsigned nr_dirty = 1 << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+
+   nilfs_set_file_dirty(inode, nr_dirty);
}
return ret;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nilfs2: fix data loss with mmap()

2014-09-18 Thread Andreas Rohner
On 2014-09-18 09:24, Ryusuke Konishi wrote:
> On Wed, 17 Sep 2014 21:34:39 +0900 (JST), Ryusuke Konishi wrote:
>> On Wed, 17 Sep 2014 10:16:46 +0200, Andreas Rohner wrote:
>>> On 2014-09-16 15:57, Ryusuke Konishi wrote:
>>>> On Tue, 16 Sep 2014 10:38:29 +0200, Andreas Rohner wrote:
>>>>>> I'd appreciate your help on testing the patch for some old kernels.
>>>>>> (And, please declare a "Tested-by" tag in the reply mail, if the test
>>>>>> is ok).
>>>>>
>>>>> Sure I have everything set up. Which kernels do I have to test? Was
>>>>> commit 136e877 backported? I presume at least stable and some of the
>>>>> longterm kernels on https://www.kernel.org/...
>>>>
>>>> The commit 136e877 was merged to v3.10 and backported to stable trees
>>>> of earlier kernels.  But, most of earlier stable trees are no longer
>>>> maintained.  Well maintained trees are the following longterm kernels:
>>>>
>>>> - 3.4.y  (backported commit 136e877)
>>>> - 3.10.y
>>>> - 3.14.y
>>>>
>>>> I think these three kernels are worty to be tested.
>>>
>>> I tested it on all stable kernels including 3.4.x, 3.10.x, 3.14.x. The
>>> bug is present in all of them and the patch fixes it. The patch also
>>> applies cleanly on all kernels. I sent it again yesterday, and added the
>>> Tested-by: tag.
> 
> One thing I have a question.
> 
> Is the original issue that commit 136e877 fixed still OK ?  If you
> haven't tested it, I apprecicate if you examine the test for the prior
> issue.

Yes the original issue is still fixed. I was able to reproduce it by
reverting  nilfs_set_page_dirty() to the state prior to commit 136e877.
Then I used the new version of nilfs_set_page_dirty() including my patch
and I couldn't reproduce the issue anymore. So it seems to still be fixed.

Furthermore the original issue was caused by the use of
__set_page_dirty_buffers(), which marked unmapped buffers as dirty. My
patch does not change the fix for that.

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nilfs2: fix data loss with mmap()

2014-09-17 Thread Andreas Rohner
On 2014-09-16 15:57, Ryusuke Konishi wrote:
> On Tue, 16 Sep 2014 10:38:29 +0200, Andreas Rohner wrote:
>>> I'd appreciate your help on testing the patch for some old kernels.
>>> (And, please declare a "Tested-by" tag in the reply mail, if the test
>>> is ok).
>>
>> Sure I have everything set up. Which kernels do I have to test? Was
>> commit 136e877 backported? I presume at least stable and some of the
>> longterm kernels on https://www.kernel.org/...
> 
> The commit 136e877 was merged to v3.10 and backported to stable trees
> of earlier kernels.  But, most of earlier stable trees are no longer
> maintained.  Well maintained trees are the following longterm kernels:
> 
> - 3.4.y  (backported commit 136e877)
> - 3.10.y
> - 3.14.y
> 
> I think these three kernels are worty to be tested.

I tested it on all stable kernels including 3.4.x, 3.10.x, 3.14.x. The
bug is present in all of them and the patch fixes it. The patch also
applies cleanly on all kernels. I sent it again yesterday, and added the
Tested-by: tag.

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nilfs2: fix data loss with mmap()

2014-09-16 Thread Andreas Rohner
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system. It is easily
reproducible with the following script:

[BEGIN SCRIPT]
mkfs.nilfs2 -f /dev/sdb
mount /dev/sdb /mnt

# create 30MB testfile
dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

umount /mnt
mount /dev/sdb /mnt
CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

# simple tool that opens /mnt/testfile and
# writes a few blocks using mmap at a 5MB offset
/root/mmaptest/mmaptest /mnt/testfile 30 10 5

sync
CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
umount /mnt
mount /dev/sdb /mnt
CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
umount /mnt

echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
echo "AFTER MMAP:\t$CHECKSUM_AFTER"
echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
[END SCRIPT]

The mmaptest tool looks something like this (very simplified, with
error checking removed):

[BEGIN mmaptest]
data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, file_offset);

for (i = 0; i < write_count; ++i) {
memcpy(data + i * 4096, buf, sizeof(buf));
msync(data, file_size - file_offset, MS_SYNC))
}
[END mmaptest]

The output of the script looks something like this:

BEFORE MMAP:281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
AFTER MMAP: 6604a1c31f10780331a6850371b3a313  /mnt/testfile
AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount. This can be reproduced a 100% of the time. The problem was
introduced with the following commit:

136e877 nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF
boundary

If the page was read with mpage_readpage() or mpage_readpages() for
example, then  it has no buffers attached to it. In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if
the page has no buffers attached to it.

Signed-off-by: Andreas Rohner 
Tested-by: Andreas Rohner 
---
 fs/nilfs2/inode.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6252b17..9e3c525 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -219,10 +219,10 @@ static int nilfs_writepage(struct page *page, struct 
writeback_control *wbc)
 
 static int nilfs_set_page_dirty(struct page *page)
 {
+   struct inode *inode = page->mapping->host;
int ret = __set_page_dirty_nobuffers(page);
 
if (page_has_buffers(page)) {
-   struct inode *inode = page->mapping->host;
unsigned nr_dirty = 0;
struct buffer_head *bh, *head;
 
@@ -245,6 +245,10 @@ static int nilfs_set_page_dirty(struct page *page)
 
if (nr_dirty)
nilfs_set_file_dirty(inode, nr_dirty);
+   } else if (ret) {
+   unsigned nr_dirty = 1 << (PAGE_SHIFT - inode->i_blkbits);
+
+   nilfs_set_file_dirty(inode, nr_dirty);
}
return ret;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nilfs2: fix data loss with mmap()

2014-09-16 Thread Andreas Rohner
On 2014-09-16 06:42, Ryusuke Konishi wrote:
> On Tue, 16 Sep 2014 00:24:05 +0200, Andreas Rohner wrote:
>> On 2014-09-16 00:01, Ryusuke Konishi wrote:
>>> Hi Andreas,
>>> On Mon, 15 Sep 2014 21:47:30 +0200, Andreas Rohner wrote:
>>>> This bug leads to reproducible silent data loss, despite the use of
>>>> msync(), sync() and a clean unmount of the file system. It is easily
>>>> reproducible with the following script:
>> 
>>> Thank you for reporting this issue.
>>
>> I just stumbled upon the weird behaviour of mmap() while testing the
>> nilfs_sync_fs() patch.
>>
>>> I'd like to look into this patch, it looks to point out an important
>>> regression, but it may take some time since I am quite busy this week..
>>
>> Of course. I understand.
> 
> The patch looks correct.  It is my mistake that the commit 136e877
> leaked consideration for the case where the page doesn't have buffer
> heads.  This fix should be backported to stable kernels. (I'll add a
> "Cc: stable" tag when sending this to Andrew.)
> 
> Did you confirm that the patch works as expected ?

Yes at least with the current master kernel:

BEFORE MMAP:281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
AFTER MMAP: 3d9183f1c471b9baff15c9cc8d12c303  /mnt/testfile
AFTER REMOUNT:  3d9183f1c471b9baff15c9cc8d12c303  /mnt/testfile

For the record here is the little tool I used for testing mmap:

https://github.com/zeitgeist87/mmaptest

It writes pages with random bytes at a certain offset using mmap. The
frequent umounts and mounts in the test script are necessary to purge
the page cache. There is probably a better way of doing that.

> I'd appreciate your help on testing the patch for some old kernels.
> (And, please declare a "Tested-by" tag in the reply mail, if the test
> is ok).

Sure I have everything set up. Which kernels do I have to test? Was
commit 136e877 backported? I presume at least stable and some of the
longterm kernels on https://www.kernel.org/...

By the way thanks for your continued effort and time investment in
reviewing my patches.

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nilfs2: fix data loss with mmap()

2014-09-15 Thread Andreas Rohner
On 2014-09-16 00:01, Ryusuke Konishi wrote:
> Hi Andreas,
> On Mon, 15 Sep 2014 21:47:30 +0200, Andreas Rohner wrote:
>> This bug leads to reproducible silent data loss, despite the use of
>> msync(), sync() and a clean unmount of the file system. It is easily
>> reproducible with the following script:

> Thank you for reporting this issue.

I just stumbled upon the weird behaviour of mmap() while testing the
nilfs_sync_fs() patch.

> I'd like to look into this patch, it looks to point out an important
> regression, but it may take some time since I am quite busy this week..

Of course. I understand.

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nilfs2: fix data loss with mmap()

2014-09-15 Thread Andreas Rohner
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system. It is easily
reproducible with the following script:

[BEGIN SCRIPT]
mkfs.nilfs2 -f /dev/sdb
mount /dev/sdb /mnt

# create 30MB testfile
dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

umount /mnt
mount /dev/sdb /mnt
CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

# simple tool that opens /mnt/testfile and
# writes a few blocks using mmap at a 5MB offset
/root/mmaptest/mmaptest /mnt/testfile 30 10 5

sync
CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
umount /mnt
mount /dev/sdb /mnt
CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
umount /mnt

echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
echo "AFTER MMAP:\t$CHECKSUM_AFTER"
echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
[END SCRIPT]

The mmaptest tool looks something like this (very simplified, with
error checking removed):

[BEGIN mmaptest]
data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, file_offset);

for (i = 0; i < write_count; ++i) {
memcpy(data + i * 4096, buf, sizeof(buf));
msync(data, file_size - file_offset, MS_SYNC))
}
[END mmaptest]

The output of the script looks something like this:

BEFORE MMAP:281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
AFTER MMAP: 6604a1c31f10780331a6850371b3a313  /mnt/testfile
AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount. This can be reproduced a 100% of the time. The problem was
introduced with the following commit:

136e877 nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF
boundary

If the page was read with mpage_readpage() or mpage_readpages() for
example, then  it has no buffers attached to it. In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if
the page has no buffers attached to it.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/inode.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6252b17..9e3c525 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -219,10 +219,10 @@ static int nilfs_writepage(struct page *page, struct 
writeback_control *wbc)
 
 static int nilfs_set_page_dirty(struct page *page)
 {
+   struct inode *inode = page->mapping->host;
int ret = __set_page_dirty_nobuffers(page);
 
if (page_has_buffers(page)) {
-   struct inode *inode = page->mapping->host;
unsigned nr_dirty = 0;
struct buffer_head *bh, *head;
 
@@ -245,6 +245,10 @@ static int nilfs_set_page_dirty(struct page *page)
 
if (nr_dirty)
nilfs_set_file_dirty(inode, nr_dirty);
+   } else if (ret) {
+   unsigned nr_dirty = 1 << (PAGE_SHIFT - inode->i_blkbits);
+
+   nilfs_set_file_dirty(inode, nr_dirty);
}
return ret;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-13 Thread Andreas Rohner
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
is introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed. For convenience the function
nilfs_flush_device() is added, which contains the above logic.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c  |  8 +++-
 fs/nilfs2/ioctl.c |  8 +++-
 fs/nilfs2/segment.c   |  3 +++
 fs/nilfs2/super.c |  6 ++
 fs/nilfs2/the_nilfs.h | 22 ++
 5 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..e9e3325 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,11 +56,9 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
-   if (!err && nilfs_test_opt(nilfs, BARRIER)) {
-   err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
-   if (err != -EIO)
-   err = 0;
-   }
+   if (!err)
+   err = nilfs_flush_device(nilfs);
+
return err;
 }
 
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..9a20e51 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,11 +1022,9 @@ static int nilfs_ioctl_sync(struct inode *inode, struct 
file *filp,
return ret;
 
nilfs = inode->i_sb->s_fs_info;
-   if (nilfs_test_opt(nilfs, BARRIER)) {
-   ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
-   if (ret == -EIO)
-   return ret;
-   }
+   ret = nilfs_flush_device(nilfs);
+   if (ret < 0)
+   return ret;
 
if (argp != NULL) {
down_read(&nilfs->ns_segctor_sem);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..0b7d2ca 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1833,6 +1833,7 @@ static void nilfs_segctor_complete_write(struct 
nilfs_sc_info *sci)
nilfs_set_next_segment(nilfs, segbuf);
 
if (update_sr) {
+   nilfs->ns_flushed_device = 0;
nilfs_set_last_segment(nilfs, segbuf->sb_pseg_start,
   segbuf->sb_sum.seg_seq, nilfs->ns_cno++);
 
@@ -2216,6 +2217,8 @@ int nilfs_construct_dsync_segment(struct super_block *sb, 
struct inode *inode,
sci->sc_dsync_end = end;
 
err = nilfs_segctor_do_construct(sci, SC_LSEG_DSYNC);
+   if (!err)
+   nilfs->ns_flushed_device = 0;
 
nilfs_transaction_unlock(sb);
return err;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..2e5b3ec 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,9 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+   nilfs->ns_flushed_device = 1;
+   /* make sure store to ns_flushed_device cannot be reordered */
+   smp_wmb();
return nilfs_sync_super(sb, flag);
 }
 
@@ -514,6 +517,9 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
 
+   if (!err)
+   err = nilfs_flush_device(nilfs);
+
return err;
 }
 
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..23778d3 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -46,6 +46,7 @@ enum {
 /**
  * struct the_nilfs - struct to supervise multiple nilfs mount points
  * @ns_flags: flags
+ * @ns_flushed_device: flag indicating if all volatile data was flushed
  * @ns_bdev: block device
  * @ns_sem: semaphore for shared states
  * @ns_snapshot_mount_mutex: mutex to protect snapshot mounts
@@ -103,6 +104,7 @@ enum {
  */
 struct the_nilfs {
unsigned long   ns_flags;
+   int ns_flushed_device;
 
struct block_device*ns_bdev;
struct rw_semaphore ns_sem;
@@ -371,4 +373,24 @@ static inline int nilfs_segment_is_active(struct the_nilfs 
*nilfs, __u64 n)
return n == nilfs->ns_segnum || n == nilfs->ns_nextnum;
 }
 
+static inline int nilfs_flush_device(struct the_nilfs *nilfs)
+{
+   int err;
+
+   if (!nilfs_test_opt(nilfs, BARRIER) || nilfs->ns_flushed_device)
+   return 0;
+
+   nilfs->ns_flush

[PATCH v6 0/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-13 Thread Andreas Rohner
Hi,

I have looked a bit more into the semantics of the various flags
concerning block device caching behaviour. According to
"Documentation/block/writeback_cache_control.txt" a call to
blkdev_issue_flush() is equivalent to an empty bio with the
REQ_FLUSH flag set. So there is no need to call blkdev_issue_flush()
after a call to nilfs_commit_super(). But if there is no need to write
the super block an additional call to blkdev_issue_flush() is necessary.

To avoid an overhead I introduced the nilfs->ns_flushed_device flag, 
which is set to 0  whenever new logs are written and set to 1 whenever 
the block device is flushed. If the super block was written during 
segment construction or in nilfs_sync_fs(), then blkdev_issue_flush() is 
not called.

br,
Andreas Rohner

v5->v6 (review by Ryusuke Konishi)
 * Remove special handling of EIO error state from nilfs_ioctl_sync()

v4->v5 (review by Ryusuke Konishi)
 * Move device flushing logic into separate function
 * Fix invalid comment
 * Move clearing of the flag to nilfs_segctor_complete_write() and
   nilfs_construct_dsync_segment()

v3->v4 (review by Ryusuke Konishi)
 * Replace atomic_t with int for ns_flushed_device
 * Use smp_wmb() to guarantee correct ordering

v2->v3 (review of Ryusuke Konishi)
 * Use separate atomic flag for ns_flushed_device instead of a bit flag 
   in ns_flags
 * Use smp_mb__after_atomic() after setting ns_flushed_device

v1->v2
 * Add new flag THE_NILFS_FLUSHED

Andreas Rohner (1):
  nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

 fs/nilfs2/file.c  |  8 +++-
 fs/nilfs2/ioctl.c |  8 +++-
 fs/nilfs2/segment.c   |  3 +++
 fs/nilfs2/super.c |  6 ++
 fs/nilfs2/the_nilfs.h | 22 ++
 5 files changed, 37 insertions(+), 10 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-13 Thread Andreas Rohner
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
is introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed. For convenience the function
nilfs_flush_device() is added, which contains the above logic.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c  |  8 +++-
 fs/nilfs2/ioctl.c |  8 +++-
 fs/nilfs2/segment.c   |  3 +++
 fs/nilfs2/super.c |  6 ++
 fs/nilfs2/the_nilfs.h | 22 ++
 5 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..e9e3325 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,11 +56,9 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
-   if (!err && nilfs_test_opt(nilfs, BARRIER)) {
-   err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
-   if (err != -EIO)
-   err = 0;
-   }
+   if (!err)
+   err = nilfs_flush_device(nilfs);
+
return err;
 }
 
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..5a530f3 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,11 +1022,9 @@ static int nilfs_ioctl_sync(struct inode *inode, struct 
file *filp,
return ret;
 
nilfs = inode->i_sb->s_fs_info;
-   if (nilfs_test_opt(nilfs, BARRIER)) {
-   ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
-   if (ret == -EIO)
-   return ret;
-   }
+   ret = nilfs_flush_device(nilfs);
+   if (ret == -EIO)
+   return ret;
 
if (argp != NULL) {
down_read(&nilfs->ns_segctor_sem);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..0b7d2ca 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1833,6 +1833,7 @@ static void nilfs_segctor_complete_write(struct 
nilfs_sc_info *sci)
nilfs_set_next_segment(nilfs, segbuf);
 
if (update_sr) {
+   nilfs->ns_flushed_device = 0;
nilfs_set_last_segment(nilfs, segbuf->sb_pseg_start,
   segbuf->sb_sum.seg_seq, nilfs->ns_cno++);
 
@@ -2216,6 +2217,8 @@ int nilfs_construct_dsync_segment(struct super_block *sb, 
struct inode *inode,
sci->sc_dsync_end = end;
 
err = nilfs_segctor_do_construct(sci, SC_LSEG_DSYNC);
+   if (!err)
+   nilfs->ns_flushed_device = 0;
 
nilfs_transaction_unlock(sb);
return err;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..2e5b3ec 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,9 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+   nilfs->ns_flushed_device = 1;
+   /* make sure store to ns_flushed_device cannot be reordered */
+   smp_wmb();
return nilfs_sync_super(sb, flag);
 }
 
@@ -514,6 +517,9 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
 
+   if (!err)
+   err = nilfs_flush_device(nilfs);
+
return err;
 }
 
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..23778d3 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -46,6 +46,7 @@ enum {
 /**
  * struct the_nilfs - struct to supervise multiple nilfs mount points
  * @ns_flags: flags
+ * @ns_flushed_device: flag indicating if all volatile data was flushed
  * @ns_bdev: block device
  * @ns_sem: semaphore for shared states
  * @ns_snapshot_mount_mutex: mutex to protect snapshot mounts
@@ -103,6 +104,7 @@ enum {
  */
 struct the_nilfs {
unsigned long   ns_flags;
+   int ns_flushed_device;
 
struct block_device*ns_bdev;
struct rw_semaphore ns_sem;
@@ -371,4 +373,24 @@ static inline int nilfs_segment_is_active(struct the_nilfs 
*nilfs, __u64 n)
return n == nilfs->ns_segnum || n == nilfs->ns_nextnum;
 }
 
+static inline int nilfs_flush_device(struct the_nilfs *nilfs)
+{
+   int err;
+
+   if (!nilfs_test_opt(nilfs, BARRIER) || nilfs->ns_flushed_device)
+   return 0;
+
+   nilfs->ns_flushed_

[PATCH v5 0/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-13 Thread Andreas Rohner
Hi,

I have looked a bit more into the semantics of the various flags
concerning block device caching behaviour. According to
"Documentation/block/writeback_cache_control.txt" a call to
blkdev_issue_flush() is equivalent to an empty bio with the
REQ_FLUSH flag set. So there is no need to call blkdev_issue_flush()
after a call to nilfs_commit_super(). But if there is no need to write
the super block an additional call to blkdev_issue_flush() is necessary.

To avoid an overhead I introduced the nilfs->ns_flushed_device flag, 
which is set to 0  whenever new logs are written and set to 1 whenever 
the block device is flushed. If the super block was written during 
segment construction or in nilfs_sync_fs(), then blkdev_issue_flush() is 
not called.

br,
Andreas Rohner

v4->v5 (review by Ryusuke Konishi)
 * Move device flushing logic into separate function
 * Fix invalid comment
 * Move clearing of the flag to nilfs_segctor_complete_write() and
   nilfs_construct_dsync_segment()

v3->v4 (review by Ryusuke Konishi)
 * Replace atomic_t with int for ns_flushed_device
 * Use smp_wmb() to guarantee correct ordering

v2->v3 (review of Ryusuke Konishi)
 * Use separate atomic flag for ns_flushed_device instead of a bit flag 
   in ns_flags
 * Use smp_mb__after_atomic() after setting ns_flushed_device

v1->v2
 * Add new flag THE_NILFS_FLUSHED

Andreas Rohner (1):
  nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

 fs/nilfs2/file.c  |  8 +++-
 fs/nilfs2/ioctl.c |  8 +++-
 fs/nilfs2/segment.c   |  3 +++
 fs/nilfs2/super.c |  6 ++
 fs/nilfs2/the_nilfs.h | 22 ++
 5 files changed, 37 insertions(+), 10 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-10 Thread Andreas Rohner
On 2014-09-09 23:17, Andreas Rohner wrote:
> Under normal circumstances nilfs_sync_fs() writes out the super block,
> which causes a flush of the underlying block device. But this depends on
> the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
> last segment crosses a segment boundary. So if only a small amount of
> data is written before the call to nilfs_sync_fs(), no flush of the
> block device occurs.
> 
> In the above case an additional call to blkdev_issue_flush() is needed.
> To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
> is introduced, which is cleared whenever new logs are written and set
> whenever the block device is flushed.
> 
> Signed-off-by: Andreas Rohner 
> ---
>  fs/nilfs2/file.c  | 10 +-
>  fs/nilfs2/ioctl.c | 10 +-
>  fs/nilfs2/segment.c   |  4 
>  fs/nilfs2/super.c | 17 +
>  fs/nilfs2/the_nilfs.h |  2 ++
>  5 files changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
> index 2497815..16375c2 100644
> --- a/fs/nilfs2/file.c
> +++ b/fs/nilfs2/file.c
> @@ -56,7 +56,15 @@ int nilfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
>   mutex_unlock(&inode->i_mutex);
>  
>   nilfs = inode->i_sb->s_fs_info;
> - if (!err && nilfs_test_opt(nilfs, BARRIER)) {
> + if (!err && nilfs_test_opt(nilfs, BARRIER) &&
> + !nilfs->ns_flushed_device) {
> + nilfs->ns_flushed_device = 1;
> + /*
> +  * the store to ns_flushed_device must not be reordered after
> +  * blkdev_issue_flush
> +  */
> + smp_wmb();
> +
>   err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
>   if (err != -EIO)
>   err = 0;
> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> index 422fb54..9444d5d 100644
> --- a/fs/nilfs2/ioctl.c
> +++ b/fs/nilfs2/ioctl.c
> @@ -1022,7 +1022,15 @@ static int nilfs_ioctl_sync(struct inode *inode, 
> struct file *filp,
>   return ret;
>  
>   nilfs = inode->i_sb->s_fs_info;
> - if (nilfs_test_opt(nilfs, BARRIER)) {
> + if (nilfs_test_opt(nilfs, BARRIER) &&
> + !nilfs->ns_flushed_device) {
> + nilfs->ns_flushed_device = 1;
> + /*
> +  * the store to ns_flushed_device must not be reordered after
> +  * blkdev_issue_flush
> +  */
> + smp_wmb();
> +
>   ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
>   if (ret == -EIO)
>   return ret;
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index a1a1916..379da1b 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1997,6 +1997,10 @@ static int nilfs_segctor_do_construct(struct 
> nilfs_sc_info *sci, int mode)
>   err = nilfs_segctor_wait(sci);
>   if (err)
>   goto failed_to_write;
> +
> + if (test_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags) ||
> + mode == SC_LSEG_DSYNC)
> + nilfs->ns_flushed_device = 0;
>   }
>   } while (sci->sc_stage.scnt != NILFS_ST_DONE);
>  
> diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
> index 228f5bd..33aafbd 100644
> --- a/fs/nilfs2/super.c
> +++ b/fs/nilfs2/super.c
> @@ -310,6 +310,9 @@ int nilfs_commit_super(struct super_block *sb, int flag)
>   nilfs->ns_sbsize));
>   }
>   clear_nilfs_sb_dirty(nilfs);
> + nilfs->ns_flushed_device = 1;
> + /* make sure store to ns_flushed_device cannot be reordered */
> + smp_wmb();
>   return nilfs_sync_super(sb, flag);
>  }
>  
> @@ -514,6 +517,20 @@ static int nilfs_sync_fs(struct super_block *sb, int 
> wait)
>   }
>   up_write(&nilfs->ns_sem);
>  
> + if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
> + !nilfs->ns_flushed_device) {
> + nilfs->ns_flushed_device = 1;
> + /*
> +  * the store to ns_flushed_device must not be reordered after
> +  * blkdev_issue_flush
> +  */
> + smp_wmb();

I am not at all sure if this memory barrier is enough. Memory barriers
only guarantee the order in which memory operations hit the CPU cache.
They do not guarantee that all CPUSs see the previous memory operations.
They cannot be used to provide uncondit

[PATCH v4 0/1] nilfs2: add missing blkdev_issue_flush() to

2014-09-09 Thread Andreas Rohner
Hi,

I have looked a bit more into the semantics of the various flags
concerning block device caching behaviour. According to
"Documentation/block/writeback_cache_control.txt" a call to
blkdev_issue_flush() is equivalent to an empty bio with the
REQ_FLUSH flag set. So there is no need to call blkdev_issue_flush()
after a call to nilfs_commit_super(). But if there is no need to write
the super block an additional call to blkdev_issue_flush() is necessary.

To avoid an overhead I introduced the nilfs->ns_flushed_device flag, 
which is set to 0  whenever new logs are written and set to 1 whenever 
the block device is flushed. If the super block was written during 
segment construction or in nilfs_sync_fs(), then blkdev_issue_flush() is 
not called.

br,
Andreas Rohner

v3->v4 (review by Ryusuke Konishi)
 * replace atomic_t with int for ns_flushed_device
 * use smp_wmb() to guarantee correct ordering

v2->v3 (review of Ryusuke Konishi)
 * Use separate atomic flag for ns_flushed_device instead of a bit flag 
   in ns_flags
 * Use smp_mb__after_atomic() after setting ns_flushed_device

v1->v2
 * Add new flag THE_NILFS_FLUSHED

Andreas Rohner (1):
  nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

 fs/nilfs2/file.c  | 10 +-
 fs/nilfs2/ioctl.c | 10 +-
 fs/nilfs2/segment.c   |  4 
 fs/nilfs2/super.c | 17 +
 fs/nilfs2/the_nilfs.h |  2 ++
 5 files changed, 41 insertions(+), 2 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-09 Thread Andreas Rohner
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
is introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c  | 10 +-
 fs/nilfs2/ioctl.c | 10 +-
 fs/nilfs2/segment.c   |  4 
 fs/nilfs2/super.c | 17 +
 fs/nilfs2/the_nilfs.h |  2 ++
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..16375c2 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,7 +56,15 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
-   if (!err && nilfs_test_opt(nilfs, BARRIER)) {
+   if (!err && nilfs_test_opt(nilfs, BARRIER) &&
+   !nilfs->ns_flushed_device) {
+   nilfs->ns_flushed_device = 1;
+   /*
+* the store to ns_flushed_device must not be reordered after
+* blkdev_issue_flush
+*/
+   smp_wmb();
+
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (err != -EIO)
err = 0;
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..9444d5d 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,7 +1022,15 @@ static int nilfs_ioctl_sync(struct inode *inode, struct 
file *filp,
return ret;
 
nilfs = inode->i_sb->s_fs_info;
-   if (nilfs_test_opt(nilfs, BARRIER)) {
+   if (nilfs_test_opt(nilfs, BARRIER) &&
+   !nilfs->ns_flushed_device) {
+   nilfs->ns_flushed_device = 1;
+   /*
+* the store to ns_flushed_device must not be reordered after
+* blkdev_issue_flush
+*/
+   smp_wmb();
+
ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (ret == -EIO)
return ret;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..379da1b 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1997,6 +1997,10 @@ static int nilfs_segctor_do_construct(struct 
nilfs_sc_info *sci, int mode)
err = nilfs_segctor_wait(sci);
if (err)
goto failed_to_write;
+
+   if (test_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags) ||
+   mode == SC_LSEG_DSYNC)
+   nilfs->ns_flushed_device = 0;
}
} while (sci->sc_stage.scnt != NILFS_ST_DONE);
 
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..33aafbd 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,9 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+   nilfs->ns_flushed_device = 1;
+   /* make sure store to ns_flushed_device cannot be reordered */
+   smp_wmb();
return nilfs_sync_super(sb, flag);
 }
 
@@ -514,6 +517,20 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
 
+   if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+   !nilfs->ns_flushed_device) {
+   nilfs->ns_flushed_device = 1;
+   /*
+* the store to ns_flushed_device must not be reordered after
+* blkdev_issue_flush
+*/
+   smp_wmb();
+
+   err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+   if (err != -EIO)
+   err = 0;
+   }
+
return err;
 }
 
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..dabb02c 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -45,6 +45,7 @@ enum {
 
 /**
  * struct the_nilfs - struct to supervise multiple nilfs mount points
+ * @ns_flushed_device: flag indicating if all volatile data was flushed
  * @ns_flags: flags
  * @ns_bdev: block device
  * @ns_sem: semaphore for shared states
@@ -103,6 +104,7 @@ enum {
  */
 struct the_nilfs {
unsigned long   ns_flags;
+   int ns_flushed_device;

Re: [PATCH 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-09 Thread Andreas Rohner
On 2014-09-09 21:18, Ryusuke Konishi wrote:
> On Tue,  9 Sep 2014 18:35:40 +0200, Andreas Rohner wrote:
>> Under normal circumstances nilfs_sync_fs() writes out the super block,
>> which causes a flush of the underlying block device. But this depends on
>> the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
>> last segment crosses a segment boundary. So if only a small amount of
>> data is written before the call to nilfs_sync_fs(), no flush of the
>> block device occurs.
>>
>> In the above case an additional call to blkdev_issue_flush() is needed.
>> To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
>> is introduced, which is cleared whenever new logs are written and set
>> whenever the block device is flushed.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/file.c  |  6 +-
>>  fs/nilfs2/ioctl.c |  6 +-
>>  fs/nilfs2/segment.c   |  4 
>>  fs/nilfs2/super.c | 12 
>>  fs/nilfs2/the_nilfs.c |  1 +
>>  fs/nilfs2/the_nilfs.h |  2 ++
>>  6 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
>> index 2497815..8a3e702 100644
>> --- a/fs/nilfs2/file.c
>> +++ b/fs/nilfs2/file.c
>> @@ -56,7 +56,11 @@ int nilfs_sync_file(struct file *file, loff_t start, 
>> loff_t end, int datasync)
>>  mutex_unlock(&inode->i_mutex);
>>  
>>  nilfs = inode->i_sb->s_fs_info;
>> -if (!err && nilfs_test_opt(nilfs, BARRIER)) {
>> +if (!err && nilfs_test_opt(nilfs, BARRIER) &&
>> +!atomic_read(&nilfs->ns_flushed_device)) {
>> +atomic_set(&nilfs->ns_flushed_device, 1);
>> +smp_mb__after_atomic();
>> +
>>  err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
>>  if (err != -EIO)
>>  err = 0;
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index 422fb54..47fe7cf 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -1022,7 +1022,11 @@ static int nilfs_ioctl_sync(struct inode *inode, 
>> struct file *filp,
>>  return ret;
>>  
>>  nilfs = inode->i_sb->s_fs_info;
>> -if (nilfs_test_opt(nilfs, BARRIER)) {
>> +if (nilfs_test_opt(nilfs, BARRIER) &&
>> +!atomic_read(&nilfs->ns_flushed_device)) {
>> +atomic_set(&nilfs->ns_flushed_device, 1);
>> +smp_mb__after_atomic();
>> +
>>  ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
>>  if (ret == -EIO)
>>  return ret;
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index a1a1916..3119b64 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1997,6 +1997,10 @@ static int nilfs_segctor_do_construct(struct 
>> nilfs_sc_info *sci, int mode)
>>  err = nilfs_segctor_wait(sci);
>>  if (err)
>>  goto failed_to_write;
>> +
>> +if (test_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags) ||
>> +mode == SC_LSEG_DSYNC)
>> +atomic_set(&nilfs->ns_flushed_device, 0);
>>  }
>>  } while (sci->sc_stage.scnt != NILFS_ST_DONE);
>>  
>> diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
>> index 228f5bd..74a9930 100644
>> --- a/fs/nilfs2/super.c
>> +++ b/fs/nilfs2/super.c
>> @@ -310,6 +310,8 @@ int nilfs_commit_super(struct super_block *sb, int flag)
>>  nilfs->ns_sbsize));
>>  }
>>  clear_nilfs_sb_dirty(nilfs);
>> +atomic_set(&nilfs->ns_flushed_device, 1);
>> +smp_mb__after_atomic();
>>  return nilfs_sync_super(sb, flag);
>>  }
>>  
>> @@ -514,6 +516,16 @@ static int nilfs_sync_fs(struct super_block *sb, int 
>> wait)
>>  }
>>  up_write(&nilfs->ns_sem);
>>  
>> +if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
>> +!atomic_read(&nilfs->ns_flushed_device)) {
>> +atomic_set(&nilfs->ns_flushed_device, 1);
>> +smp_mb__after_atomic();
>> +
>> +err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
>> +if (err != -EIO)
>> +err = 0;
>> +}
>> +
>

[PATCH 0/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-09 Thread Andreas Rohner
Hi,

I have looked a bit more into the semantics of the various flags
concerning block device caching behaviour. According to
"Documentation/block/writeback_cache_control.txt" a call to
blkdev_issue_flush() is equivalent to an empty bio with the
REQ_FLUSH flag set. So there is no need to call blkdev_issue_flush()
after a call to nilfs_commit_super(). But if there is no need to write
the super block an additional call to blkdev_issue_flush() is necessary.

To avoid an overhead I introduced the nilfs->ns_flushed_device flag, 
which is set to 0  whenever new logs are written and set to 1 whenever 
the block device is flushed. If the super block was written during 
segment construction or in nilfs_sync_fs(), then blkdev_issue_flush() is not 
called.

On most modern architectures loads and stores of single word integers 
are atomic. I still used atomic_t for ns_flushed_device for 
documentation purposes. I only use atomic_read() and atomic_set(). Both 
are inline functions, which compile down to simple loads and stores on 
modern architectures, so there is no performance benefit in using a 
simple int instead.

br,
Andreas Rohner

v2->v3 (based on review of Ryusuke Konishi)
 * Use separate atomic flag for ns_flushed_device instead of a bit flag 
   in ns_flags
 * Use smp_mb__after_atomic() after setting ns_flushed_device

v1->v2
 * Add new flag THE_NILFS_FLUSHED

Andreas Rohner (1):
  nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

 fs/nilfs2/file.c  |  6 +-
 fs/nilfs2/ioctl.c |  6 +-
 fs/nilfs2/segment.c   |  4 
 fs/nilfs2/super.c | 12 
 fs/nilfs2/the_nilfs.c |  1 +
 fs/nilfs2/the_nilfs.h |  2 ++
 6 files changed, 29 insertions(+), 2 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-09 Thread Andreas Rohner
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
is introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c  |  6 +-
 fs/nilfs2/ioctl.c |  6 +-
 fs/nilfs2/segment.c   |  4 
 fs/nilfs2/super.c | 12 
 fs/nilfs2/the_nilfs.c |  1 +
 fs/nilfs2/the_nilfs.h |  2 ++
 6 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..8a3e702 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,7 +56,11 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
-   if (!err && nilfs_test_opt(nilfs, BARRIER)) {
+   if (!err && nilfs_test_opt(nilfs, BARRIER) &&
+   !atomic_read(&nilfs->ns_flushed_device)) {
+   atomic_set(&nilfs->ns_flushed_device, 1);
+   smp_mb__after_atomic();
+
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (err != -EIO)
err = 0;
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..47fe7cf 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,7 +1022,11 @@ static int nilfs_ioctl_sync(struct inode *inode, struct 
file *filp,
return ret;
 
nilfs = inode->i_sb->s_fs_info;
-   if (nilfs_test_opt(nilfs, BARRIER)) {
+   if (nilfs_test_opt(nilfs, BARRIER) &&
+   !atomic_read(&nilfs->ns_flushed_device)) {
+   atomic_set(&nilfs->ns_flushed_device, 1);
+   smp_mb__after_atomic();
+
ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (ret == -EIO)
return ret;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..3119b64 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1997,6 +1997,10 @@ static int nilfs_segctor_do_construct(struct 
nilfs_sc_info *sci, int mode)
err = nilfs_segctor_wait(sci);
if (err)
goto failed_to_write;
+
+   if (test_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags) ||
+   mode == SC_LSEG_DSYNC)
+   atomic_set(&nilfs->ns_flushed_device, 0);
}
} while (sci->sc_stage.scnt != NILFS_ST_DONE);
 
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..74a9930 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,8 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+   atomic_set(&nilfs->ns_flushed_device, 1);
+   smp_mb__after_atomic();
return nilfs_sync_super(sb, flag);
 }
 
@@ -514,6 +516,16 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
 
+   if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+   !atomic_read(&nilfs->ns_flushed_device)) {
+   atomic_set(&nilfs->ns_flushed_device, 1);
+   smp_mb__after_atomic();
+
+   err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+   if (err != -EIO)
+   err = 0;
+   }
+
return err;
 }
 
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 9da25fe..d37c50b 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -74,6 +74,7 @@ struct the_nilfs *alloc_nilfs(struct block_device *bdev)
return NULL;
 
nilfs->ns_bdev = bdev;
+   atomic_set(&nilfs->ns_flushed_device, 0);
atomic_set(&nilfs->ns_ndirtyblks, 0);
init_rwsem(&nilfs->ns_sem);
mutex_init(&nilfs->ns_snapshot_mount_mutex);
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..ec53958 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -45,6 +45,7 @@ enum {
 
 /**
  * struct the_nilfs - struct to supervise multiple nilfs mount points
+ * @ns_flushed_device: flag indicating if all volatile data was flushed
  * @ns_flags: flags
  * @ns_bdev: block device
  * @ns_sem: se

Re: [PATCH v2 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-08 Thread Andreas Rohner

Hi Ryusuke,

Sorry for the late response I was busy over the weekend.

On 2014-09-07 07:12, Ryusuke Konishi wrote:
> Hi Andreas,
> On Wed, 03 Sep 2014 14:32:22 +0200, Andreas Rohner wrote:
>> On 2014-09-03 02:35, Ryusuke Konishi wrote:
>>> On Mon, 01 Sep 2014 21:18:30 +0200, Andreas Rohner wrote:
>>> On the other hand, we need explicit barrier operation like
>>> smp_mb__after_atomic() if a certain operation is performed after
>>> set_bit() and the changed bit should be visible to other processors
>>> before the operation.
>>
>> Great suggestion. I didn't know about those functions.
> 
> I recommend you read Documentation/memory-barries.txt.  It's an
> excellent document summarizing information on what we should know
> about memory synchronization on smp.  Documentation/atomic_ops.txt
> also contains some information on barriers related to atomic
> operations.
> 
>> Do we also need a call to smp_mb__before_atomic() before
>> clear_nilfs_flushed(nilfs) in segment.c?
> 
> I think the timing restrictions of this flag are not so serve.  The
> only restrictions that the flag must ensure are:
> 
>  1) Bioes for log are completed before this flag is cleared.
>  2) Clearing the flag is propagated to the processor executing
> nilfs_sync_fs() and nilfs_sync_file() before log writer returns.
> 
> The restriction (1) is guaranteed since nilfs_wait_on_logs() is called
> before nilfs_segctor_complete_write().  This sequence appears at
> nilfs_segctor_wait() function.
> 
> The restriction (2) looks to be satisfied by (at least)
> nilfs_segctor_notify() that nilfs_segctor_construct() calls or
> nilfs_transaction_unlock() that nilfs_construct_dsync_segment() calls.

I agree with both points.

>> I would be happy to provide another version of the patch with
>> set_nilfs_flushed(nilfs) and smp_mb__after_atomic() if you prefer that
>> version over the test_and_set_bit approach...
> 
> Two additional comments:
> 
> - Splitting test_and_set_bit() into test_bit() and set_bit() can
>   introduce a race condition.  Two processors can call test_bit() at
>   the same time and both can call set_bit() and blkdev_issue_flush().
>   But, this race is not critical.  It only allows duplicate
>   blkdev_issue_flush() calls in the rare case, and I think it's
>   ignorable.

I agree.

> - clear_nilfs_flushed() seems to be called more than necessary.
>   Incomplete logs that the mount time recovery of nilfs doesn't
>   salvage do not need to be flushed.  In this sense, it may be enough
>   only for logs containing a super root and those for datasync
>   nilfs_construct_dsync_segment() creates.

Yes you are right I will change that as well.

On the other hand it seems to me, that almost any file operation causes
a super root to be written. Even if you use fdatasync(). If the i_mtime
on the inode has to be changed, then NILFS_I_INODE_DIRTY is set and the
fdatasync() turns into a normal sync(), which always writes a super
root. Every write() to a file causes an update of i_mtime. I could only
make fdatasync() work as intended with mmap(), but maybe I am missing
something. So we may not save us a lot of updates by only updating the
flag in case the log contains a super root...

> By the way, we are using atomic bit operations too much.  Even though
> {set,clear}_bit() don't imply a memory barrier, they still imply a
> lock prefix to protect the flag from other bit operations on ns_flags.
> For load and store of integer variables which are properly aligned to
> a cache line, modern processors naturally satisfy atomicity without
> additional lock operations.  I think we can replace the flag with just
> an integer variable like "int ns_flushed_device".  How do you think ?

I think that is a good idea. I will implement that right away.

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-03 Thread Andreas Rohner
On 2014-09-03 02:35, Ryusuke Konishi wrote:
> On Mon, 01 Sep 2014 21:18:30 +0200, Andreas Rohner wrote:
>> On 2014-09-01 20:43, Andreas Rohner wrote:
>>> Hi Ryusuke,
>>> On 2014-09-01 19:59, Ryusuke Konishi wrote:
>>>> On Sun, 31 Aug 2014 17:47:13 +0200, Andreas Rohner wrote:
>>>>> Under normal circumstances nilfs_sync_fs() writes out the super block,
>>>>> which causes a flush of the underlying block device. But this depends on
>>>>> the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
>>>>> last segment crosses a segment boundary. So if only a small amount of
>>>>> data is written before the call to nilfs_sync_fs(), no flush of the
>>>>> block device occurs.
>>>>>
>>>>> In the above case an additional call to blkdev_issue_flush() is needed.
>>>>> To prevent unnecessary overhead, the new flag THE_NILFS_FLUSHED is
>>>>> introduced, which is cleared whenever new logs are written and set
>>>>> whenever the block device is flushed.
>>>>>
>>>>> Signed-off-by: Andreas Rohner 
>>>>
>>>> The patch looks good to me except that I feel the use of atomic
>>>> test-and-set bitwise operations something unfavorable (though it's
>>>> logically correct).  I will try to send this to upstream as is unless
>>>> a comment comes to mind.
>>>
>>> I originally thought, that it is necessary to do it atomically to avoid
>>> a race condition, but I am not so sure about that any more. I think the
>>> only case we have to avoid is, to call set_nilfs_flushed() after
>>> blkdev_issue_flush(), because this could race with the
>>> clear_nilfs_flushed() from the segment construction. So this should also
>>> work:
>>>
>>>  +  if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
>>>  +  !nilfs_flushed(nilfs)) {
>>>  +  set_nilfs_flushed(nilfs);
>>>  +  err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
>>>  +  if (err != -EIO)
>>>  +  err = 0;
>>>  +  }
>>>  +
>>
>> On the other hand, it says in the comments to set_bit(), that it can be
>> reordered on architectures other than x86. test_and_set_bit() implies a
>> memory barrier on all architectures. But I don't think the processor
>> would reorder set_nilfs_flushed() after the external function call to
>> blkdev_issue_flush(), would it?
> 
> I believe compiler doesn't reorder set_bit() operation after an
> external function call unless it knows the content of the function and
> the function can be optimized.  But, yes, set_bit() doesn't imply
> memory barrier unlike test_and_set_bit().  As for
> blkdev_issue_flush(), it would imply memory barrier by some lock
> functions or other primitive used inside it.  (I haven't actually
> confirmed that the premise is true)

Yes blkdev_issue_flush() probably implies a memory barrier.

> On the other hand, we need explicit barrier operation like
> smp_mb__after_atomic() if a certain operation is performed after
> set_bit() and the changed bit should be visible to other processors
> before the operation.

Great suggestion. I didn't know about those functions. Do we also need a
call to smp_mb__before_atomic() before clear_nilfs_flushed(nilfs) in
segment.c?

I would be happy to provide another version of the patch with
set_nilfs_flushed(nilfs) and smp_mb__after_atomic() if you prefer that
version over the test_and_set_bit approach...

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-01 Thread Andreas Rohner
On 2014-09-01 20:43, Andreas Rohner wrote:
> Hi Ryusuke,
> On 2014-09-01 19:59, Ryusuke Konishi wrote:
>> On Sun, 31 Aug 2014 17:47:13 +0200, Andreas Rohner wrote:
>>> Under normal circumstances nilfs_sync_fs() writes out the super block,
>>> which causes a flush of the underlying block device. But this depends on
>>> the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
>>> last segment crosses a segment boundary. So if only a small amount of
>>> data is written before the call to nilfs_sync_fs(), no flush of the
>>> block device occurs.
>>>
>>> In the above case an additional call to blkdev_issue_flush() is needed.
>>> To prevent unnecessary overhead, the new flag THE_NILFS_FLUSHED is
>>> introduced, which is cleared whenever new logs are written and set
>>> whenever the block device is flushed.
>>>
>>> Signed-off-by: Andreas Rohner 
>>
>> The patch looks good to me except that I feel the use of atomic
>> test-and-set bitwise operations something unfavorable (though it's
>> logically correct).  I will try to send this to upstream as is unless
>> a comment comes to mind.
> 
> I originally thought, that it is necessary to do it atomically to avoid
> a race condition, but I am not so sure about that any more. I think the
> only case we have to avoid is, to call set_nilfs_flushed() after
> blkdev_issue_flush(), because this could race with the
> clear_nilfs_flushed() from the segment construction. So this should also
> work:
> 
>  +if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
>  +!nilfs_flushed(nilfs)) {
>  +  set_nilfs_flushed(nilfs);
>  +err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
>  +if (err != -EIO)
>  +err = 0;
>  +}
>  +

On the other hand, it says in the comments to set_bit(), that it can be
reordered on architectures other than x86. test_and_set_bit() implies a
memory barrier on all architectures. But I don't think the processor
would reorder set_nilfs_flushed() after the external function call to
blkdev_issue_flush(), would it?

/**
 * set_bit - Atomically set a bit in memory
 * @nr: the bit to set
 * @addr: the address to start counting from
 *
 * This function is atomic and may not be reordered.  See __set_bit()
 * if you do not require the atomic guarantees.
 *
 * Note: there are no guarantees that this function will not be reordered
 * on non x86 architectures, so if you are writing portable code,
 * make sure not to rely on its reordering guarantees.
 */

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-09-01 Thread Andreas Rohner
Hi Ryusuke,
On 2014-09-01 19:59, Ryusuke Konishi wrote:
> On Sun, 31 Aug 2014 17:47:13 +0200, Andreas Rohner wrote:
>> Under normal circumstances nilfs_sync_fs() writes out the super block,
>> which causes a flush of the underlying block device. But this depends on
>> the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
>> last segment crosses a segment boundary. So if only a small amount of
>> data is written before the call to nilfs_sync_fs(), no flush of the
>> block device occurs.
>>
>> In the above case an additional call to blkdev_issue_flush() is needed.
>> To prevent unnecessary overhead, the new flag THE_NILFS_FLUSHED is
>> introduced, which is cleared whenever new logs are written and set
>> whenever the block device is flushed.
>>
>> Signed-off-by: Andreas Rohner 
> 
> The patch looks good to me except that I feel the use of atomic
> test-and-set bitwise operations something unfavorable (though it's
> logically correct).  I will try to send this to upstream as is unless
> a comment comes to mind.

I originally thought, that it is necessary to do it atomically to avoid
a race condition, but I am not so sure about that any more. I think the
only case we have to avoid is, to call set_nilfs_flushed() after
blkdev_issue_flush(), because this could race with the
clear_nilfs_flushed() from the segment construction. So this should also
work:

 +  if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
 +  !nilfs_flushed(nilfs)) {
 +  set_nilfs_flushed(nilfs);
 +  err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
 +  if (err != -EIO)
 +  err = 0;
 +  }
 +

What do you think?

br,
Andreas Rohner

> Thanks,
> Ryusuke Konishi

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-08-31 Thread Andreas Rohner
Hi,

I have looked a bit more into the semantics of the various flags
concerning block device caching behaviour. According to
"Documentation/block/writeback_cache_control.txt" a call to
blkdev_issue_flush() is equivalent to an empty bio with the
REQ_FLUSH flag set. So there is no need to call blkdev_issue_flush()
after a call to nilfs_commit_super(). But if there is no need to write
the super block an additional call to blkdev_issue_flush() is necessary.

To avoid an overhead I introduced the THE_NILFS_FLUSHED flag, which is
cleared whenever new logs are written and set whenever the block device
is flushed. If the super block was written during segment construction
or in nilfs_sync_fs(), then blkdev_issue_flush() is not called.

I am pretty sure, that there are no race conditions, but maybe someone
should check that possibility before merging this.

br,
Andreas Rohner

v1->v2
 * Add new flag THE_NILFS_FLUSHED

Andreas Rohner (1):
  nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

 fs/nilfs2/file.c  | 3 ++-
 fs/nilfs2/ioctl.c | 3 ++-
 fs/nilfs2/segment.c   | 2 ++
 fs/nilfs2/super.c | 8 
 fs/nilfs2/the_nilfs.h | 6 ++
 5 files changed, 20 insertions(+), 2 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-08-31 Thread Andreas Rohner
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag THE_NILFS_FLUSHED is
introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/file.c  | 3 ++-
 fs/nilfs2/ioctl.c | 3 ++-
 fs/nilfs2/segment.c   | 2 ++
 fs/nilfs2/super.c | 8 
 fs/nilfs2/the_nilfs.h | 6 ++
 5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..7857460 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,7 +56,8 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
mutex_unlock(&inode->i_mutex);
 
nilfs = inode->i_sb->s_fs_info;
-   if (!err && nilfs_test_opt(nilfs, BARRIER)) {
+   if (!err && nilfs_test_opt(nilfs, BARRIER) &&
+   !test_and_set_nilfs_flushed(nilfs)) {
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (err != -EIO)
err = 0;
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..dc5d101 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,7 +1022,8 @@ static int nilfs_ioctl_sync(struct inode *inode, struct 
file *filp,
return ret;
 
nilfs = inode->i_sb->s_fs_info;
-   if (nilfs_test_opt(nilfs, BARRIER)) {
+   if (nilfs_test_opt(nilfs, BARRIER) &&
+   !test_and_set_nilfs_flushed(nilfs)) {
ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (ret == -EIO)
return ret;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..54a6be1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1842,6 +1842,8 @@ static void nilfs_segctor_complete_write(struct 
nilfs_sc_info *sci)
nilfs_segctor_clear_metadata_dirty(sci);
} else
clear_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags);
+
+   clear_nilfs_flushed(nilfs);
 }
 
 static int nilfs_segctor_wait(struct nilfs_sc_info *sci)
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..332fdf0 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,7 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+   set_nilfs_flushed(nilfs);
return nilfs_sync_super(sb, flag);
 }
 
@@ -514,6 +515,13 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
 
+   if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+   !test_and_set_nilfs_flushed(nilfs)) {
+   err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+   if (err != -EIO)
+   err = 0;
+   }
+
return err;
 }
 
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..d12a8ce 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -41,6 +41,7 @@ enum {
THE_NILFS_DISCONTINUED, /* 'next' pointer chain has broken */
THE_NILFS_GC_RUNNING,   /* gc process is running */
THE_NILFS_SB_DIRTY, /* super block is dirty */
+   THE_NILFS_FLUSHED,  /* volatile data was flushed to disk */
 };
 
 /**
@@ -202,6 +203,10 @@ struct the_nilfs {
 };
 
 #define THE_NILFS_FNS(bit, name)   \
+static inline int test_and_set_nilfs_##name(struct the_nilfs *nilfs)   \
+{  \
+   return test_and_set_bit(THE_NILFS_##bit, &(nilfs)->ns_flags);   \
+}  \
 static inline void set_nilfs_##name(struct the_nilfs *nilfs)   \
 {  \
set_bit(THE_NILFS_##bit, &(nilfs)->ns_flags);   \
@@ -219,6 +224,7 @@ THE_NILFS_FNS(INIT, init)
 THE_NILFS_FNS(DISCONTINUED, discontinued)
 THE_NILFS_FNS(GC_RUNNING, gc_running)
 THE_NILFS_FNS(SB_DIRTY, sb_dirty)
+THE_NILFS_FNS(FLUSHED, flushed)
 
 /*
  * Mount option operations
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-08-27 Thread Andreas Rohner
Hi Ryusuke,

On 2014-08-27 02:29, Ryusuke Konishi wrote:
>> Looking over the code I noticed, that nilfs_sync_file() gets called by 
>> the fsync() syscall and it basically constructs a new partial segment 
>> and calls blkdev_issue_flush().
>>
>> nilfs_ioctl_sync(), which is used by the cleaner, also essentially works 
>> the same way. It writes all the dirty files to disk and calls 
>> blkdev_issue_flush().
>>
>> nilfs_sync_fs() also writes out all the dirty files, but there is no 
>> blkdev_issue_flush() at the end. At first I thought, that 
>> nilfs_commit_super() may flush the block device anyway and therefore no 
>> additional flush is necessary, but nilfs_sb_dirty(nilfs) is only set to 
>> true if a new segment is started. So in the following scenario data 
>> could be lost despite a call to sync():
>>
>> 1. Write out less data than a full segment
>> 2. Call sync()
>> 3. nilfs_sb_dirty() is false and nilfs_commit_super() is NOT called
>> 4. Cut power to the device
>> 5. Data loss
>>
>> As I stated above, I am not sure if this is really necessary. Maybe I 
>> have overlooked something obvious.
> 
> Your indication is right, we have a data integration issue for the
> "nilfs_sb_dirty() is false" case in nilfs_sync_fs().
> 
> But, I rather would mitigate the cache flush overhead keeping data
> integerity instead of simply adding the third blkdev_issue_flush()
> call.
> 
> We don't have to call blkdev_issue_flush() if the last log was
> written synchronously with a cache flush operation.
> (this logic looks to be feasible by adding a flag.)
>
> Also, the cache flush is not needed after writing super block; a disk
> cache flush is needed BEFORE writing a super block to ensure that the
> super block is pointing to a valid log, but a succeeding flush
> operation is not needed because the pointer information is recoverable
> with mount time recovery.

Yes that's true.

> nilfs_sync_super() uses both FLUSH/FUA options for writing the primary
> super block and the FUA option may be superfluous in that sense.
> (we need to understand the precise semantics)

I have looked into that. According to
"Documentation/block/writeback_cache_control.txt" the FLUSH option makes
sure, that all previous write requests are in non-volatile storage and
the FUA option makes sure, that the current request only returns
successful if it was written to non-volatile storage. Since it doesn't
matter if the write to the super block is lost, because it can be
recovered, the FUA option is not necessary. But the name of the function
nilfs_sync_super() kind of suggests, that it guarantees, that the super
block is written to non-volatile storage so I don't know if we should
remove the FUA flag.

This also means, that if nilfs_sync_super() was called no additional
blkdev_issue_flush() is necessary. So if the super block was written
during segment construction there is also no need for an additional
blkdev_issue_flush().

> Can you improve the patch considering these view points ?

Yes I will work on it.

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-08-25 Thread Andreas Rohner
This patch adds a call to blkdev_issue_flush() to nilfs_sync_fs(), which
is the nilfs implementation of the sync() syscall. If the BARRIER
mount option is set, both the nilfs implementation of fsync() and nilfs'
custom ioctl version of sync() used by the cleaner, use
blkdev_issue_flush() to guarantee that the data is written to the
underlying device. To get the same behaviour and guarantees for the
sync() syscall, blkdev_issue_flush() should also be called in
nilfs_sync_fs().

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/super.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..1f21e81 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -514,6 +514,12 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
 
+   if (wait && !err && nilfs_test_opt(nilfs, BARRIER)) {
+   err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+   if (err != -EIO)
+   err = 0;
+   }
+
return err;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1] add missing blkdev_issue_flush() to nilfs_sync_fs()

2014-08-25 Thread Andreas Rohner
Hi,

I do not know, if this patch is really necessary or not. I am not an 
expert in how the BARRIER flag should be interpreted. So this patch is 
more like a question, than a fix for any real bug.

Looking over the code I noticed, that nilfs_sync_file() gets called by 
the fsync() syscall and it basically constructs a new partial segment 
and calls blkdev_issue_flush().

nilfs_ioctl_sync(), which is used by the cleaner, also essentially works 
the same way. It writes all the dirty files to disk and calls 
blkdev_issue_flush().

nilfs_sync_fs() also writes out all the dirty files, but there is no 
blkdev_issue_flush() at the end. At first I thought, that 
nilfs_commit_super() may flush the block device anyway and therefore no 
additional flush is necessary, but nilfs_sb_dirty(nilfs) is only set to 
true if a new segment is started. So in the following scenario data 
could be lost despite a call to sync():

1. Write out less data than a full segment
2. Call sync()
3. nilfs_sb_dirty() is false and nilfs_commit_super() is NOT called
4. Cut power to the device
5. Data loss

As I stated above, I am not sure if this is really necessary. Maybe I 
have overlooked something obvious.

br,
Andreas Rohner


Andreas Rohner (1):
  nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()

 fs/nilfs2/super.c | 6 ++
 1 file changed, 6 insertions(+)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] nilfs2: add counting of live blocks for blocks that are overwritten

2014-03-18 Thread Andreas Rohner
On 2014-03-18 12:50, Vyacheslav Dubeyko wrote:
> On Sun, 2014-03-16 at 11:47 +0100, Andreas Rohner wrote:
> 
>> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
>> index 7adb15d..e7b19c40 100644
>> --- a/fs/nilfs2/dat.c
>> +++ b/fs/nilfs2/dat.c
>> @@ -445,6 +445,64 @@ int nilfs_dat_clean_snapshot_flag(struct inode *dat, 
>> __u64 vblocknr)
>>  }
>>  
>>  /**
>> + * nilfs_dat_is_live - checks if the virtual block number is alive
> 
> What about nilfs_dat_block_is_alive?

Yes sounds good.

>> + * @dat: DAT file inode
>> + * @vblocknr: virtual block number
>> + *
>> + * Description: nilfs_dat_is_live() looks up the DAT entry for @vblocknr and
>> + * determines if the corresponding block is alive or not. This check ignores
>> + * snapshots and protection periods.
>> + *
>> + * Return Value: 1 if vblocknr is alive and 0 otherwise. On error, one
>> + * of the following negative error codes is returned.
> 
> It is really bad idea to mess error codes and info return, from my point
> of view. Usually, it results in very buggy code in the place of call.
> Actually, you use binary nature of returned value.
> 
> I think that it needs to rework ideology of this function. Maybe, it
> needs to return bool and to return error value as argument.

Yes that is true.

>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + *
>> + * %-ENOENT - A block number associated with @vblocknr does not exist.
>> + */
>> +int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr)
>> +{
>> +struct buffer_head *entry_bh, *bh;
>> +struct nilfs_dat_entry *entry;
>> +sector_t blocknr;
>> +void *kaddr;
>> +int ret;
>> +
>> +ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
>> +if (ret < 0)
>> +return ret;
>> +
>> +if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
>> +bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
>> +if (bh) {
>> +WARN_ON(!buffer_uptodate(bh));
>> +brelse(entry_bh);
>> +entry_bh = bh;
>> +}
>> +}
>> +
>> +kaddr = kmap_atomic(entry_bh->b_page);
>> +entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
>> +blocknr = le64_to_cpu(entry->de_blocknr);
>> +if (blocknr == 0) {
> 
> I suppose that zero is specially named constant?

I copied that code from nilfs_dat_translate(). So it is not my fault
that there isn't a properly named constant ;)

>> +ret = -ENOENT;
>> +goto out;
>> +}
>> +
>> +
>> +if (entry->de_end == cpu_to_le64(NILFS_CNO_MAX))
>> +ret = 1;
>> +else
>> +ret = 0;
>> +out:
>> +kunmap_atomic(kaddr);
>> +brelse(entry_bh);
>> +return ret;
>> +}
>> +
>> +/**
>>   * nilfs_dat_translate - translate a virtual block number to a block number
>>   * @dat: DAT file inode
>>   * @vblocknr: virtual block number
>> diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
>> index a528024..51d44c0 100644
>> --- a/fs/nilfs2/dat.h
>> +++ b/fs/nilfs2/dat.h
>> @@ -31,6 +31,7 @@
>>  struct nilfs_palloc_req;
>>  
>>  int nilfs_dat_translate(struct inode *, __u64, sector_t *);
>> +int nilfs_dat_is_live(struct inode *, __u64);
>>  
>>  int nilfs_dat_prepare_alloc(struct inode *, struct nilfs_palloc_req *);
>>  void nilfs_dat_commit_alloc(struct inode *, struct nilfs_palloc_req *);
>> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
>> index b9c5726..c32b896 100644
>> --- a/fs/nilfs2/inode.c
>> +++ b/fs/nilfs2/inode.c
>> @@ -86,6 +86,8 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
>>  int err = 0, ret;
>>  unsigned maxblocks = bh_result->b_size >> inode->i_blkbits;
>>  
>> +bh_result->b_blocknr = 0;
>> +
>>  down_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
>>  ret = nilfs_bmap_lookup_contig(ii->i_bmap, blkoff, &blknum, maxblocks);
>>  up_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index 0b62bf4..3603394 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -612,6 +612,12 @@ static int nilfs_ioctl_move_inode_block(struct inode 
>> *inode,
>>  brelse(bh);
>>  return -EEXIST;
>>  }
>&g

Re: [PATCH 4/6] nilfs2: add ioctl() to clean snapshot flags from dat entries

2014-03-18 Thread Andreas Rohner
On 2014-03-18 08:10, Vyacheslav Dubeyko wrote:
> On Mon, 2014-03-17 at 14:49 +0100, Andreas Rohner wrote:
> 
>>>
>>>>   */
>>>>  struct nilfs_vdesc {
>>>>__u64 vd_ino;
>>>> @@ -873,9 +873,55 @@ struct nilfs_vdesc {
>>>>__u64 vd_blocknr;
>>>>__u64 vd_offset;
>>>>__u32 vd_flags;
>>>> -  __u32 vd_pad;
>>>> +  /* vd_flags2 needed because of backwards compatibility */
>>>
>>> Completely, misunderstand comment. Usually, it keeps old fields for
>>> backward compatibility. But this flag is new.
>>
>> I will rewrite the comment. I need vd_flags2 because I can't use
>> vd_flags because of backwards compatibility.
>>
>>>> +  __u32 vd_flags2;
> 
> What about vd_blk_state instead of vd_flags2?

Yes sounds good to me.

>>>>  };
>>>>  
>>>> +/* vdesc flags */
>>>
>>> To be honest, I misunderstand why such number of flags and why namely
>>> such flags? Comments are really necessary.
>>>
>>>> +enum {
>>>> +  NILFS_VDESC_DATA,
>>>> +  NILFS_VDESC_NODE,
>>>> +  /* ... */
>>>
>>> What does it mean?
>>
>> NILFS_VDESC_DATA = 0 and NILFS_VDESC_NODE = 1. This represents the type
>> of block. These two already existed, in the previous version, but they
>> were not explicit. See "[Patch 4/4] nilfs-utils: add extra flags to
>> nilfs_vdesc and update sui_nblocks":
>>
>> @@ -148,17 +149,19 @@ static int nilfs_acc_blocks_file(struct nilfs_file
>> *file,
>> -vdesc->vd_flags = 0;/* data */
>> +nilfs_vdesc_set_data(vdesc);
>>  } else {
>>  vdesc->vd_vblocknr =
>>  le64_to_cpu(*(__le64 *)blk.b_binfo);
>> -vdesc->vd_flags = 1;/* node */
>> +nilfs_vdesc_set_node(vdesc);
>>  }
>>
>>>> +};
>>>> +enum {
>>>> +  NILFS_VDESC_SNAPSHOT,
>>>> +  __NR_NILFS_VDESC_FIELDS,
>>>> +  /* ... */
>>>
>>> What does it mean?
> 
> I asked here about strange comment. What does it mean?

Sorry for the misunderstanding. I copied the comment from other flags like:

enum {
NILFS_SEGMENT_USAGE_ACTIVE,
NILFS_SEGMENT_USAGE_DIRTY,
NILFS_SEGMENT_USAGE_ERROR,

/* ... */
};

I guess it means "additional flags come here".

But you are right it is confusing it should be like that:

enum {
    NILFS_VDESC_SNAPSHOT,
NILFS_VDESC_PROTECTION_PERIOD,

/* ... */

__NR_NILFS_VDESC_FIELDS,
};

> Moreover, I slightly confused by NILFS_VDESC_SNAPSHOT. Is it bit-based
> flag? I mean NILFS_VDESC_SNAPSHOT = (1 << 0). Or am I incorrect?

Yes NILFS_VDESC_SNAPSHOT and NILFS_VDESC_PROTECTION_PERIOD are
bit-based. NILFS_VDESC_DATA and NILFS_VDESC_NODE are not bit-based
because of backwards compatibility.

Please also note, that [PATCH 5/6] adds another flag, namely
NILFS_VDESC_PROTECTION_PERIOD.

Best regards,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] nilfs2: add ioctl() to clean snapshot flags from dat entries

2014-03-17 Thread Andreas Rohner
On 2014-03-17 14:19, Vyacheslav Dubeyko wrote:
> On Sun, 2014-03-16 at 11:47 +0100, Andreas Rohner wrote:
>> This patch introduces new flags for nilfs_vdesc to indicate the reason a
>> block is alive. So if the block would be reclaimable, but must be
>> treated as if it were alive, because it is part of a snapshot, then the
>> snapshot flag is set.
>>
> 
> I suppose that I don't quite follow your idea. As far as I can judge,
> every block in DAT file has: (1) de_start: start checkpoint number; (2)
> de_end: end checkpoint number. So, while one of checkpoint number is
> snapshot number then we know that this block lives in snapshot. Am I
> correct? Why do we need in special flags?

Yes, but a snapshot can also be in between de_start and de_end. So to
check it you would have to get a list of all snapshots and look if one
of them is within the range of de_start to de_end. The userspace tools
already do this. The flags in nilfs_vdesc are there so that I don't have
to check it again in the kernel.

>> Additionally a new ioctl() is added, which enables the userspace GC to
>> perform a cleanup operation after setting the number of blocks with
>> NILFS_IOCTL_SET_SUINFO. It sets DAT entries with de_ss values of
>> NILFS_CNO_MAX to 0. NILFS_CNO_MAX indicates, that the corresponding
>> block belongs to some snapshot, but was already decremented by a
>> previous deletion operation. If the segment usage info is changed with
>> NILFS_IOCTL_SET_SUINFO and the number of blocks is updated, then these
>> blocks would never be decremented and there are scenarios where the
>> corresponding segments would starve (never be cleaned). To prevent that
>> they must be reset to 0.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/dat.c   |  63 
>>  fs/nilfs2/dat.h   |   1 +
>>  fs/nilfs2/ioctl.c | 103 
>> +-
>>  include/linux/nilfs2_fs.h |  52 ++-
>>  4 files changed, 216 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
>> index 89a4a5f..7adb15d 100644
>> --- a/fs/nilfs2/dat.c
>> +++ b/fs/nilfs2/dat.c
>> @@ -382,6 +382,69 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, 
>> sector_t blocknr)
>>  }
>>  
>>  /**
>> + * nilfs_dat_clean_snapshot_flag - check flags used by snapshots
>> + * @dat: DAT file inode
>> + * @vblocknr: virtual block number
>> + *
>> + * Description: nilfs_dat_clean_snapshot_flag() changes the flags from
>> + * NILFS_CNO_MAX to 0 if necessary, so that segment usage is accurately
>> + * counted. NILFS_CNO_MAX indicates, that the corresponding block belongs
>> + * to some snapshot, but was already decremented. If the segment usage info
>> + * is changed with NILFS_IOCTL_SET_SUINFO and the number of blocks is 
>> updated,
>> + * then these blocks would never be decremented and there are scenarios 
>> where
>> + * the corresponding segments would starve (never be cleaned).
>> + *
>> + * Return Value: On success, 0 is returned. On error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_dat_clean_snapshot_flag(struct inode *dat, __u64 vblocknr)
> 
> Sounds likewise we clear flag. It can be confusing name.

Yes it is hard to get a good name for that function. It has nothing to
do with the nilfs_vdesc flags.

>> +{
>> +struct buffer_head *entry_bh;
>> +struct nilfs_dat_entry *entry;
>> +void *kaddr;
>> +int ret;
>> +
>> +ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
>> +if (ret < 0)
>> +return ret;
>> +
>> +/*
>> + * The given disk block number (blocknr) is not yet written to
>> + * the device at this point.
>> + *
>> + * To prevent nilfs_dat_translate() from returning the
>> + * uncommitted block number, this makes a copy of the entry
>> + * buffer and redirects nilfs_dat_translate() to the copy.
>> + */
>> +if (!buffer_nilfs_redirected(entry_bh)) {
>> +ret = nilfs_mdt_freeze_buffer(dat, entry_bh);
>> +if (ret) {
>> +brelse(entry_bh);
>> +return ret;
>> +}
>> +}
>> +
>> +kaddr = kmap_atomic(entry_bh->b_page);
>> +entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
>> +if (entry-&g

Re: [PATCH 3/6] nilfs2: scan dat entries at snapshot creation/deletion time

2014-03-17 Thread Andreas Rohner
On 2014-03-17 08:04, Vyacheslav Dubeyko wrote:
> On Sun, 2014-03-16 at 11:47 +0100, Andreas Rohner wrote:
>> To accurately count the number of live blocks in a segment, it is
>> important to take snapshots into account, because snapshots can protect
>> reclaimable blocks from being cleaned.
>>
>> This patch uses the previously reserved de_rsv field of the
>> nilfs_dat_entry struct to store one of the snapshots the corresponding
>> block belongs to. One block can belong to many snapshots, but because
>> the snapshots are stored in a sorted linked list, it is easy to check if
>> a block belongs to any other snapshot given the previous and the next
>> snapshot. For example if the current snapshot (in de_ss) is being
>> removed and neither the previous nor the next snapshot is in the range
>> of de_start to de_end, then it is guaranteed that the block doesn't
>> belong to any other snapshot and is reclaimable. On the other hand if
>> lets say the previous snapshot is in the range of de_start to de_end, we
>> simply set de_ss to the previous snapshot and the block is not
>> reclaimable.
>>
>> To implement this every DAT entry is scanned at snapshot
>> creation/deletion time and updated if needed. 
> 
> It is well known problem of NILFS2 that deletion is very slow operation
> for big files because of necessity to update DAT file (de_end: end
> checkpoint number). So, how your addition does affect this disadvantage?

Additionally to setting "de_end: end checkpoint number" the live block
counter in the SUFILE needs to be decremented. This makes the deletion a
little bit more expensive, but its not really noticeable, because the
SUFILE-Entries are mostly in the cache. I have timed the deletion of 100
GB and there is no discernible difference in the performance.

But my additions make snapshot creation and deletion more expensive.

>> To avoid too many update
>> operations only potentially reclaimable blocks are ever updated. For
>> example if there are some deleted files and the checkpoint to which
>> these files belong is turned into a snapshot, then su_nblocks is
>> incremented for these blocks, which reverses the decrement that happened
>> when the files were deleted. If after some time this snapshot is
>> deleted, su_nblocks is decremented again to reverse the increment at
>> creation time.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/cpfile.c|  7 
>>  fs/nilfs2/dat.c   | 86 
>> +++
>>  fs/nilfs2/dat.h   | 26 ++
>>  include/linux/nilfs2_fs.h |  4 +--
>>  4 files changed, 121 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
>> index 0d58075..29952f5 100644
>> --- a/fs/nilfs2/cpfile.c
>> +++ b/fs/nilfs2/cpfile.c
>> @@ -28,6 +28,7 @@
>>  #include 
>>  #include "mdt.h"
>>  #include "cpfile.h"
>> +#include "sufile.h"
>>  
>>
>>  static inline unsigned long
>> @@ -584,6 +585,7 @@ static int nilfs_cpfile_set_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  struct nilfs_cpfile_header *header;
>>  struct nilfs_checkpoint *cp;
>>  struct nilfs_snapshot_list *list;
>> +struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
>>  __u64 curr, prev;
>>  unsigned long curr_blkoff, prev_blkoff;
>>  void *kaddr;
>> @@ -681,6 +683,8 @@ static int nilfs_cpfile_set_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  mark_buffer_dirty(header_bh);
>>  nilfs_mdt_mark_dirty(cpfile);
>>  
>> +nilfs_dat_scan_inc_ss(nilfs->ns_dat, cno);
>> +
>>  brelse(prev_bh);
>>  
>>   out_curr:
>> @@ -703,6 +707,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  struct nilfs_cpfile_header *header;
>>  struct nilfs_checkpoint *cp;
>>  struct nilfs_snapshot_list *list;
>> +struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
>>  __u64 next, prev;
>>  void *kaddr;
>>  int ret;
>> @@ -784,6 +789,8 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  mark_buffer_dirty(header_bh);
>>  nilfs_mdt_mark_dirty(cpfile);
>>  
>> +nilfs_dat_scan_dec_ss(nilfs->ns_dat, cno, prev, next);
>> +
>>  brelse(prev_bh);
>>  
>>   out_next:
>> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
>> index 0d5fada..89a4a5f 100644
>> --- a/fs/nilfs2/dat.c
>> +++ b/fs/nilfs2/dat.c
>> @@ -

Re: [PATCH 1/6] nilfs2: add helper function to go through all entries of meta data file

2014-03-17 Thread Andreas Rohner
On 2014-03-17 07:51, Vyacheslav Dubeyko wrote:
> On Sun, 2014-03-16 at 11:47 +0100, Andreas Rohner wrote:
>> This patch introduces the nilfs_palloc_scan_entries() function,
>> which takes an inode of one of nilfs' meta data files and iterates
>> through all of its entries. For each entry the callback function
>> pointer that is given as a parameter is called. The data parameter
>> is passed to the callback function, so that it may receive
>> parameters and return results.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/alloc.c | 121 
>> ++
>>  fs/nilfs2/alloc.h |   6 +++
>>  2 files changed, 127 insertions(+)
>>
>> diff --git a/fs/nilfs2/alloc.c b/fs/nilfs2/alloc.c
>> index 741fd02..0edd85a 100644
>> --- a/fs/nilfs2/alloc.c
>> +++ b/fs/nilfs2/alloc.c
>> @@ -545,6 +545,127 @@ int nilfs_palloc_prepare_alloc_entry(struct inode 
>> *inode,
>>  }
>>  
>>  /**
>> + * nilfs_palloc_scan_entries - scan through every entry and execute dofunc
>> + * @inode: inode of metadata file using this allocator
>> + * @dofunc: function executed for every entry
>> + * @data: data pointer passed to dofunc
>> + *
>> + * Description: nilfs_palloc_scan_entries() walks through every allocated 
>> entry
>> + * of a metadata file and executes dofunc on it. It passes a data pointer to
>> + * dofunc, which can be used as an input parameter or for returning of 
>> results.
>> + *
>> + * Return Value: On success, 0 is returned. On error, a
>> + * negative error code is returned.
>> + */
>> +int nilfs_palloc_scan_entries(struct inode *inode,
>> +  void (*dofunc)(struct inode *,
>> + struct nilfs_palloc_req *,
>> + void *),
>> +  void *data)
>> +{
>> +struct buffer_head *desc_bh, *bitmap_bh;
>> +struct nilfs_palloc_group_desc *desc;
>> +struct nilfs_palloc_req req;
>> +unsigned char *bitmap;
>> +void *desc_kaddr, *bitmap_kaddr;
>> +unsigned long group, maxgroup, ngroups;
>> +unsigned long n, m, entries_per_group, groups_per_desc_block;
>> +unsigned long i, j, pos;
>> +unsigned long blkoff, prev_blkoff;
>> +int ret;
>> +
> 
> I think that it really makes sense to split this function's code between
> several small functions. It improves code style and readability of
> function. Moreover, it makes function more easy understandable.

Ok I could move one of the inner for-loops into a separate function.

>> +ngroups = nilfs_palloc_groups_count(inode);
>> +maxgroup = ngroups - 1;
>> +entries_per_group = nilfs_palloc_entries_per_group(inode);
>> +groups_per_desc_block = nilfs_palloc_groups_per_desc_block(inode);
>> +
>> +for (group = 0; group < ngroups;) {
>> +ret = nilfs_palloc_get_desc_block(inode, group, 0, &desc_bh);
>> +if (ret == -ENOENT)
> 
> I suggest to add comment here.

Ok.

-ENOENT basically means, that the description block is not  allocated
yet, which is not an error. ngroups is a very big constant value and
does not contain the actual number of groups, but rather the maximum
number of groups. So the only way to tell if it is the last group is by
-ENOENT error.

>> +return 0;
>> +else if (ret < 0)
>> +return ret;
>> +req.pr_desc_bh = desc_bh;
>> +desc_kaddr = kmap(desc_bh->b_page);
>> +desc = nilfs_palloc_block_get_group_desc(inode, group,
>> + desc_bh, desc_kaddr);
>> +n = nilfs_palloc_rest_groups_in_desc_block(inode, group,
>> +   maxgroup);
>> +
>> +for (i = 0; i < n; i++, desc++, group++) {
>> +m = entries_per_group -
>> +nilfs_palloc_group_desc_nfrees(inode,
>> +group, desc);
> 
> Looks weird. It makes sense to split on several functions or to use
> variable.
> 
>> +if (!m)
>> +continue;
>> +
>> +ret = nilfs_palloc_get_bitmap_block(
>> +inode, group, 0, &bitmap_bh);
> 
> Ditto. Looks weird.
> 
>> +if (ret == -ENOENT) {
>> +ret = 0;

Re: [PATCH 2/6] nilfs2: add new timestamp to seg usage and function to change su_nblocks

2014-03-16 Thread Andreas Rohner
On 2014-03-16 14:31, Ryusuke Konishi wrote:
> On Sun, 16 Mar 2014 17:06:10 +0300, Vyacheslav Dubeyko wrote:
>>
>> On Mar 16, 2014, at 3:24 PM, Andreas Rohner wrote:
>>
>>>>>
>>>>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>>>>> index ff3fea3..ca269ad 100644
>>>>> --- a/include/linux/nilfs2_fs.h
>>>>> +++ b/include/linux/nilfs2_fs.h
>>>>> @@ -614,11 +614,13 @@ struct nilfs_cpfile_header {
>>>>> * @su_lastmod: last modified timestamp
>>>>> * @su_nblocks: number of blocks in segment
>>>>> * @su_flags: flags
>>>>> + * @su_lastdec: last decrement of su_nblocks timestamp
>>>>> */
>>>>> struct nilfs_segment_usage {
>>>>>   __le64 su_lastmod;
>>>>>   __le32 su_nblocks;
>>>>>   __le32 su_flags;
>>>>> + __le64 su_lastdec;
>>>>
>>>> So, this change makes on-disk layout incompatible with previous one.
>>>> Am I correct? At first it needs to be fully confident that we really need 
>>>> in
>>>> changing in this place. Secondly, it needs to add incompatible flag for
>>>> s_feature_incompat field of superblock and maybe mount option.
>>>
>>> No it IS compatible. NILFS uses the entry sizes stored in the super
>>> block. Notice, that the code does not depend on sizeof(struct
>>> nilfs_suinfo) or sizeof(struct nilfs_segment_usage). So an old kernel
>>> can read a file system with su_lastdec and a new kernel can read an old
>>> file system without su_lastdec.
>>
>> But, anyway, I think that you add some new feature by this and previous
>> patches. I suppose that it makes sense to add specially dedicated flag or
>> flags in s_feature_xxx field of superblock. If feature is compatible with
>> previous state of driver then flag can be added for s_feature_compat
>> field.
>>
>> Thanks,
>> Vyacheslav Dubeyko.
> 
> This is important thing.  Please evaluate backward compatibility and
> forward compatibility of modifications, and properly add one of
> incompat, compat_ro, or compat flags as Vyacheslav mentioned.  It will
> be a focal point of early stage review.

Ok, I have to look into these flags.

I reuse su_nblocks to represent the number of live blocks, which gets
incremented and decremented as files are deleted and snapshots
created/removed. That is definitely incompatible. Is it better to set a
incompat flag or should I define a new field like su_nliveblocks? With a
new field it could be compatible with older drivers, but it would add
another 8 bytes to the structure.

But if I understood you correctly I need to add a new feature flag in
any case.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] nilfs2: add new timestamp to seg usage and function to change su_nblocks

2014-03-16 Thread Andreas Rohner
On 2014-03-16 14:34, Vyacheslav Dubeyko wrote:
> 
>> On 16 марта 2014 г., at 16:24, Andreas Rohner  wrote:
>>
>>> On 2014-03-16 14:00, Vyacheslav Dubeyko wrote:
>>>
>>>> On Mar 16, 2014, at 1:47 PM, Andreas Rohner wrote:
>>>>
>>>> This patch adds an additional timestamp to the segment usage
>>>> information that indicates the last time the usage information was
>>>> changed. So su_lastmod indicates the last time the segment itself was
>>>> modified and su_lastdec indicates the last time the usage information
>>>> itself was changed.
>>>
>>> What will we have if user changes time?
>>> What sequence will we have after such "malicious" action?
>>> Did you test such situation?
>>
>> The timestamp is just a hint for the userspace GC. If the hint is wrong
>> the result would be that the GC is less efficient for a while. After a
>> while it would go back to normal. You have the same problem with the
>> already existing su_lastmod timestamp.
>>
> 
> But I worry about such thing. Previously, we had complains of users about
> different issues with timestamp policy of GC. And I had hope that namely
> new GC policies can resolve such GC disadvantage. So, what have we again?
> The same issue of GC?

Yes but I have to compare it to the protection period, which is a
timestamp. Maybe I could use the current checkpoint number instead...

Regards,
Andreas Rohner

> Thanks,
> Vyacheslav Dubeyko.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] nilfs-utils: add cost-benefit and greedy policies

2014-03-16 Thread Andreas Rohner
On 2014-03-16 13:55, Ryusuke Konishi wrote:
> On Sun, 16 Mar 2014 11:49:16 +0100, Andreas Rohner wrote:
>> This patch implements the cost-benefit and greedy GC policies. These are
>> well known policies for log-structured file systems [1].
>>
>> * Greedy:
>>   Select the segments with the most free space.
>> * Cost-Benefit:
>>   Perform a cost-benefit analysis, whereby the free space gained is
>>   weighed against the cost of collecting the segment.
>>
>> Since especially cost-benefit needed more information than was available
>> in nilfs_suinfo, a few extra parameters were added to the policy
>> callback function prototype. The policy threshold was removed, since it
>> served no real purpose. The flag p_comparison was added to indicate how
>> the importance values should be interpreted. For example for the
>> timestamp policy smaller values mean older timestamps, which is better.
>> For greedy and cost-benefit on the other hand higher values are better.
>> nilfs_cleanerd_select_segments() was updated accordingly.
>>
>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>> tion of a log-structured file system. ACM Trans. Comput. Syst.,
>> 10(1):26–52, February 1992.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  include/nilfs2_fs.h   |   9 -
>>  sbin/cleanerd/cldconfig.c | 100 
>> +++---
>>  sbin/cleanerd/cldconfig.h |  18 +
>>  sbin/cleanerd/cleanerd.c  |  56 --
>>  4 files changed, 149 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
>> index a16ad4c..967c2af 100644
>> --- a/include/nilfs2_fs.h
>> +++ b/include/nilfs2_fs.h
>> @@ -483,7 +483,7 @@ struct nilfs_dat_entry {
>>  __le64 de_blocknr;
>>  __le64 de_start;
>>  __le64 de_end;
>> -__le64 de_rsv;
>> +__le64 de_ss;
>>  };
>>  
>>  /**
>> @@ -612,11 +612,13 @@ struct nilfs_cpfile_header {
>>   * @su_lastmod: last modified timestamp
>>   * @su_nblocks: number of blocks in segment
>>   * @su_flags: flags
>> + * @su_lastdec: last decrement of su_nblocks timestamp
>>   */
>>  struct nilfs_segment_usage {
>>  __le64 su_lastmod;
>>  __le32 su_nblocks;
>>  __le32 su_flags;
>> +__le64 su_lastdec;
>>  };
>>  
>>  /* segment usage flag */
>> @@ -659,6 +661,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage 
>> *su)
>>  su->su_lastmod = cpu_to_le64(0);
>>  su->su_nblocks = cpu_to_le32(0);
>>  su->su_flags = cpu_to_le32(0);
>> +su->su_lastdec = cpu_to_le64(0);
>>  }
>>  
>>  static inline int
>> @@ -690,11 +693,13 @@ struct nilfs_sufile_header {
>>   * @sui_lastmod: timestamp of last modification
>>   * @sui_nblocks: number of written blocks in segment
>>   * @sui_flags: segment usage flags
>> + * @sui_lastdec: last decrement of sui_nblocks timestamp
>>   */
>>  struct nilfs_suinfo {
>>  __u64 sui_lastmod;
>>  __u32 sui_nblocks;
>>  __u32 sui_flags;
>> +__u64 sui_lastdec;
>>  };
>>  
>>  #define NILFS_SUINFO_FNS(flag, name)
>> \
>> @@ -732,6 +737,7 @@ enum {
>>  NILFS_SUINFO_UPDATE_LASTMOD,
>>  NILFS_SUINFO_UPDATE_NBLOCKS,
>>  NILFS_SUINFO_UPDATE_FLAGS,
>> +NILFS_SUINFO_UPDATE_LASTDEC,
>>  __NR_NILFS_SUINFO_UPDATE_FIELDS,
>>  };
>>  
>> @@ -755,6 +761,7 @@ nilfs_suinfo_update_##name(const struct 
>> nilfs_suinfo_update *sup)\
>>  NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
>>  NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
>>  NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
>> +NILFS_SUINFO_UPDATE_FNS(LASTDEC, lastdec)
>>  
>>  enum {
>>  NILFS_CHECKPOINT,
>> diff --git a/sbin/cleanerd/cldconfig.c b/sbin/cleanerd/cldconfig.c
>> index c8b197b..ade974a 100644
>> --- a/sbin/cleanerd/cldconfig.c
>> +++ b/sbin/cleanerd/cldconfig.c
>> @@ -380,7 +380,10 @@ nilfs_cldconfig_handle_clean_check_interval(struct 
>> nilfs_cldconfig *config,
>>  }
>>  
>>  static unsigned long long
>> -nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si)
>> +nilfs_cldconfig_selection_policy_timestamp(struct nilfs *nilfs,
>> +   const struct nilfs_sustat *sustat,
>> +   const struct nilfs_suinfo *si,
>> +   __u64 

Re: [PATCH 2/6] nilfs2: add new timestamp to seg usage and function to change su_nblocks

2014-03-16 Thread Andreas Rohner
On 2014-03-16 14:00, Vyacheslav Dubeyko wrote:
> 
> On Mar 16, 2014, at 1:47 PM, Andreas Rohner wrote:
> 
>> This patch adds an additional timestamp to the segment usage
>> information that indicates the last time the usage information was
>> changed. So su_lastmod indicates the last time the segment itself was
>> modified and su_lastdec indicates the last time the usage information
>> itself was changed.
>>
> 
> What will we have if user changes time?
> What sequence will we have after such "malicious" action?
> Did you test such situation?

The timestamp is just a hint for the userspace GC. If the hint is wrong
the result would be that the GC is less efficient for a while. After a
while it would go back to normal. You have the same problem with the
already existing su_lastmod timestamp.

>> This is important information for the GC, because it needs to avoid
>> selecting segments for cleaning that are created (su_lastmod) outside of
>> the protection period, but the blocks got reclaimable (su_nblocks is
>> decremented) within the protection period. Without that information the
>> GC policy has to assume, that there are reclaimble blocks, only to find
>> out, that they are protected by the protection period.
>>
>> This patch also introduces nilfs_sufile_add_segment_usage(), which can
>> be used to increment or decrement the value of su_nblocks of a specific
>> segment.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>> fs/nilfs2/sufile.c| 86 
>> +--
>> fs/nilfs2/sufile.h| 18 ++
>> include/linux/nilfs2_fs.h |  7 
>> 3 files changed, 109 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 2a869c3..0886938 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -453,6 +453,8 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 
>> segnum,
>>  su->su_lastmod = cpu_to_le64(0);
>>  su->su_nblocks = cpu_to_le32(0);
>>  su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
>> +if (nilfs_sufile_lastdec_supported(sufile))
>> +su->su_lastdec = cpu_to_le64(0);
>>  kunmap_atomic(kaddr);
>>
>>  nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
>> @@ -482,7 +484,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 
>> segnum,
>>  WARN_ON(!nilfs_segment_usage_dirty(su));
>>
>>  sudirty = nilfs_segment_usage_dirty(su);
>> -nilfs_segment_usage_set_clean(su);
>> +nilfs_sufile_segment_usage_set_clean(sufile, su);
>>  kunmap_atomic(kaddr);
>>  mark_buffer_dirty(su_bh);
>>
>> @@ -549,6 +551,75 @@ int nilfs_sufile_set_segment_usage(struct inode 
>> *sufile, __u64 segnum,
>> }
>>
>> /**
>> + * nilfs_sufile_add_segment_usage - decrement usage of a segment
> 
> I feel cultural dissonance about this name. Add or decrement? :)
> Decrement and add are different operations for me.

Yes that description is wrong. Thanks for pointing that out. The long
description below is correct though. By adding a signed value, one can
decrement and increment.

>> + * @sufile: inode of segment usage file
>> + * @segnum: segment number
>> + * @value: value to add to su_nblocks
>> + * @dectime: current time
>> + *
>> + * Description: nilfs_sufile_add_segment_usage() adds a signed value to the
>> + * su_nblocks field of the segment usage information of @segnum. It ensures
>> + * that the result is bigger than 0 and smaller or equal to the maximum 
>> number
>> + * of blocks per segment
>> + *
>> + * Return Value: On success, 0 is returned. On error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient memory available.
>> + *
>> + * %-EIO - I/O error
>> + *
>> + * %-ENOENT - the specified block does not exist (hole block)
>> + */
>> +int nilfs_sufile_add_segment_usage(struct inode *sufile, __u64 segnum,
>> +   __s64 value, time_t dectime)
>> +{
>> +struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>> +struct buffer_head *bh;
>> +struct nilfs_segment_usage *su;
>> +void *kaddr;
>> +int ret;
>> +
>> +if (value == 0)
>> +return 0;
>> +
>> +down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
>> +ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, 0, &bh);
>> +if (ret < 0)
> 
> Maybe it nee

Re: [PATCH 0/6] nilfs2: implement tracking of live blocks

2014-03-16 Thread Andreas Rohner
On 2014-03-16 13:34, Vyacheslav Dubeyko wrote:
> 
> On Mar 16, 2014, at 2:01 PM, Andreas Rohner wrote:
> 
>> On 2014-03-16 11:47, Andreas Rohner wrote:
>>> Hi,
>>>
>>> This patch set implements the tracking of live blocks in segments. This 
>>> information is crucial in implementing better GC policies, because 
>>> now the policies can make informed decisions about which segments have 
>>> the biggest number of reclaimable blocks.
>>
>> IMPORTANT:
>> I forgot to mention, that the patches are based on linux-next/master,
>> because they rely on previous patches that aren't in master yet.
>>
> 
> As far as I can see, some guys mention about it via [PATCH -next 0/6], for 
> example. :)

Thanks! I didn't know that. I will keep it in mind for next time.

br,
Andreas Rohner

> With the best regards,
> Vyacheslav Dubeyko.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] nilfs2: implement tracking of live blocks

2014-03-16 Thread Andreas Rohner
On 2014-03-16 11:47, Andreas Rohner wrote:
> Hi,
> 
> This patch set implements the tracking of live blocks in segments. This 
> information is crucial in implementing better GC policies, because 
> now the policies can make informed decisions about which segments have 
> the biggest number of reclaimable blocks.

IMPORTANT:
I forgot to mention, that the patches are based on linux-next/master,
because they rely on previous patches that aren't in master yet.

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] nilfs-utils: add extra flags to nilfs_vdesc and update sui_nblocks

2014-03-16 Thread Andreas Rohner
This patch adds extra flags to nilfs_vdesc that indicate the reason for
which a particular block is considered alive. If it is because of a
snapshot, then the snapshot flag is set, if it is because of the
protection period, then that flag is set.

This information is useful to determine the number of live blocks in a
segment. If a block is part of a snapshot, it is counted as alive, if it
is alive because of the protection period it is counted as reclaimable.
These flags are used both in userspace and by the kernel.

Additionally this patch adds code that calculates the correct number of
live blocks per segment if nilfs_set_suinfo() is used.

Signed-off-by: Andreas Rohner 
---
 include/nilfs.h |  6 +
 include/nilfs2_fs.h | 52 +--
 lib/gc.c| 71 +++--
 3 files changed, 119 insertions(+), 10 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index bd134be..3585f6b 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -329,4 +329,10 @@ static inline __u32 nilfs_get_blocks_per_segment(const 
struct nilfs *nilfs)
return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline __u64
+nilfs_get_segnum_of_block(const struct nilfs *nilfs, sector_t blocknr)
+{
+   return blocknr / nilfs_get_blocks_per_segment(nilfs);
+}
+
 #endif /* NILFS_H */
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index cb02739..7a060b3 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -859,7 +859,7 @@ struct nilfs_vinfo {
  * @vd_blocknr: disk block number
  * @vd_offset: logical block offset inside a file
  * @vd_flags: flags (data or node block)
- * @vd_pad: padding
+ * @vd_flags2: additional flags
  */
 struct nilfs_vdesc {
__u64 vd_ino;
@@ -869,9 +869,57 @@ struct nilfs_vdesc {
__u64 vd_blocknr;
__u64 vd_offset;
__u32 vd_flags;
-   __u32 vd_pad;
+   /* vd_flags2 needed because of backwards compatibility */
+   __u32 vd_flags2;
 };
 
+/* vdesc flags */
+enum {
+   NILFS_VDESC_DATA,
+   NILFS_VDESC_NODE,
+   /* ... */
+};
+enum {
+   NILFS_VDESC_SNAPSHOT,
+   NILFS_VDESC_PROTECTION_PERIOD,
+   __NR_NILFS_VDESC_FIELDS,
+   /* ... */
+};
+
+#define NILFS_VDESC_FNS(flag, name)\
+static inline void \
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)  \
+{  \
+   vdesc->vd_flags = NILFS_VDESC_##flag;   \
+}  \
+static inline int  \
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)\
+{  \
+   return vdesc->vd_flags == NILFS_VDESC_##flag;   \
+}
+
+#define NILFS_VDESC_FNS2(flag, name)   \
+static inline void \
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)  \
+{  \
+   vdesc->vd_flags2 |= (1UL << NILFS_VDESC_##flag);\
+}  \
+static inline void \
+nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)\
+{  \
+   vdesc->vd_flags2 &= ~(1UL << NILFS_VDESC_##flag);   \
+}  \
+static inline int  \
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)\
+{  \
+   return !!(vdesc->vd_flags2 & (1UL << NILFS_VDESC_##flag));  \
+}
+
+NILFS_VDESC_FNS(DATA, data)
+NILFS_VDESC_FNS(NODE, node)
+NILFS_VDESC_FNS2(SNAPSHOT, snapshot)
+NILFS_VDESC_FNS2(PROTECTION_PERIOD, protection_period)
+
 /**
  * struct nilfs_bdesc - descriptor of disk block number
  * @bd_ino: inode number
diff --git a/lib/gc.c b/lib/gc.c
index 2338174..2df15f7 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -128,6 +128,7 @@ static int nilfs_acc_blocks_file(struct nilfs_file *file,
return -1;
bdesc->bd_ino = ino;
bdesc->bd_oblocknr = blk.b_blocknr;
+   bdesc->bd_pad = 0;
if (nilfs_block_is_data(&blk)) {
bdesc->bd_offset =
le64_to_cpu(*(__le64 *)blk.b_binfo);
@@ -148,17 +149

[PATCH 1/4] nilfs-utils: remove reliance on sui_nblocks to read segment

2014-03-16 Thread Andreas Rohner
Since sui_nblocks is reused to represent the number of live blocks in a
segment it cannot be used any more to mark the end of the segment.
Instead the sequence number of the partial segments is checked. The
sequence number of partial segments should all be the same. The usual
CRC checks should be enough to reliably determine the end of a segment.

Signed-off-by: Andreas Rohner 
---
 include/nilfs.h | 1 +
 lib/gc.c| 5 ++---
 lib/nilfs.c | 4 +++-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index fab8ff2..05cfe3b 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -185,6 +185,7 @@ struct nilfs_psegment {
size_t p_maxblocks;
size_t p_blksize;
__u32 p_seed;
+   __u64 p_seq;
 };
 
 /**
diff --git a/lib/gc.c b/lib/gc.c
index 453acf2..a165a5c 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -273,9 +273,8 @@ static ssize_t nilfs_acc_blocks(struct nilfs *nilfs,
return -1;
continue;
}
-   ret = nilfs_acc_blocks_segment(
-   nilfs, segnums[i], segment, si.sui_nblocks,
-   vdescv, bdescv);
+   ret = nilfs_acc_blocks_segment(nilfs, segnums[i], segment,
+   nilfs_get_blocks_per_segment(nilfs), vdescv, bdescv);
if (nilfs_put_segment(nilfs, segment) < 0 || ret < 0)
return -1;
i++;
diff --git a/lib/nilfs.c b/lib/nilfs.c
index 65bf7d5..e8f5c96 100644
--- a/lib/nilfs.c
+++ b/lib/nilfs.c
@@ -900,7 +900,8 @@ static int nilfs_psegment_is_valid(const struct 
nilfs_psegment *pseg)
 {
int offset;
 
-   if (le32_to_cpu(pseg->p_segsum->ss_magic) != NILFS_SEGSUM_MAGIC)
+   if (le32_to_cpu(pseg->p_segsum->ss_magic) != NILFS_SEGSUM_MAGIC ||
+   le64_to_cpu(pseg->p_segsum->ss_seq) != pseg->p_seq)
return 0;
 
offset = sizeof(pseg->p_segsum->ss_datasum) +
@@ -928,6 +929,7 @@ void nilfs_psegment_init(struct nilfs_psegment *pseg, __u64 
segnum,
pseg->p_seed = le32_to_cpu(nilfs->n_sb->s_crc_seed);
 
pseg->p_segsum = seg + blkoff * pseg->p_blksize;
+   pseg->p_seq = le64_to_cpu(pseg->p_segsum->ss_seq);
pseg->p_blocknr = pseg->p_segblocknr;
 }
 
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] nilfs-utils: add cost-benefit and greedy policies

2014-03-16 Thread Andreas Rohner
This patch implements the cost-benefit and greedy GC policies. These are
well known policies for log-structured file systems [1].

* Greedy:
  Select the segments with the most free space.
* Cost-Benefit:
  Perform a cost-benefit analysis, whereby the free space gained is
  weighed against the cost of collecting the segment.

Since especially cost-benefit needed more information than was available
in nilfs_suinfo, a few extra parameters were added to the policy
callback function prototype. The policy threshold was removed, since it
served no real purpose. The flag p_comparison was added to indicate how
the importance values should be interpreted. For example for the
timestamp policy smaller values mean older timestamps, which is better.
For greedy and cost-benefit on the other hand higher values are better.
nilfs_cleanerd_select_segments() was updated accordingly.

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
tion of a log-structured file system. ACM Trans. Comput. Syst.,
10(1):26–52, February 1992.

Signed-off-by: Andreas Rohner 
---
 include/nilfs2_fs.h   |   9 -
 sbin/cleanerd/cldconfig.c | 100 +++---
 sbin/cleanerd/cldconfig.h |  18 +
 sbin/cleanerd/cleanerd.c  |  56 --
 4 files changed, 149 insertions(+), 34 deletions(-)

diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index a16ad4c..967c2af 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -483,7 +483,7 @@ struct nilfs_dat_entry {
__le64 de_blocknr;
__le64 de_start;
__le64 de_end;
-   __le64 de_rsv;
+   __le64 de_ss;
 };
 
 /**
@@ -612,11 +612,13 @@ struct nilfs_cpfile_header {
  * @su_lastmod: last modified timestamp
  * @su_nblocks: number of blocks in segment
  * @su_flags: flags
+ * @su_lastdec: last decrement of su_nblocks timestamp
  */
 struct nilfs_segment_usage {
__le64 su_lastmod;
__le32 su_nblocks;
__le32 su_flags;
+   __le64 su_lastdec;
 };
 
 /* segment usage flag */
@@ -659,6 +661,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage 
*su)
su->su_lastmod = cpu_to_le64(0);
su->su_nblocks = cpu_to_le32(0);
su->su_flags = cpu_to_le32(0);
+   su->su_lastdec = cpu_to_le64(0);
 }
 
 static inline int
@@ -690,11 +693,13 @@ struct nilfs_sufile_header {
  * @sui_lastmod: timestamp of last modification
  * @sui_nblocks: number of written blocks in segment
  * @sui_flags: segment usage flags
+ * @sui_lastdec: last decrement of sui_nblocks timestamp
  */
 struct nilfs_suinfo {
__u64 sui_lastmod;
__u32 sui_nblocks;
__u32 sui_flags;
+   __u64 sui_lastdec;
 };
 
 #define NILFS_SUINFO_FNS(flag, name)   \
@@ -732,6 +737,7 @@ enum {
NILFS_SUINFO_UPDATE_LASTMOD,
NILFS_SUINFO_UPDATE_NBLOCKS,
NILFS_SUINFO_UPDATE_FLAGS,
+   NILFS_SUINFO_UPDATE_LASTDEC,
__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -755,6 +761,7 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update 
*sup)   \
 NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
+NILFS_SUINFO_UPDATE_FNS(LASTDEC, lastdec)
 
 enum {
NILFS_CHECKPOINT,
diff --git a/sbin/cleanerd/cldconfig.c b/sbin/cleanerd/cldconfig.c
index c8b197b..ade974a 100644
--- a/sbin/cleanerd/cldconfig.c
+++ b/sbin/cleanerd/cldconfig.c
@@ -380,7 +380,10 @@ nilfs_cldconfig_handle_clean_check_interval(struct 
nilfs_cldconfig *config,
 }
 
 static unsigned long long
-nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si)
+nilfs_cldconfig_selection_policy_timestamp(struct nilfs *nilfs,
+  const struct nilfs_sustat *sustat,
+  const struct nilfs_suinfo *si,
+  __u64 prottime)
 {
return si->sui_lastmod;
 }
@@ -391,14 +394,101 @@ nilfs_cldconfig_handle_selection_policy_timestamp(struct 
nilfs_cldconfig *config
 {
config->cf_selection_policy.p_importance =
NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE;
-   config->cf_selection_policy.p_threshold =
-   NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD;
+   config->cf_selection_policy.p_comparison =
+   NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER;
+   return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_greedy(struct nilfs *nilfs,
+   const struct nilfs_sustat *sustat,
+   const struct nilfs_suinfo *si,
+   __u64 prottime)
+{
+   __u32 value, max_blocks = nilfs_get_blocks_per_segment(nilfs);
+
+   if (max_blocks < si->sui_nblocks)
+   return 0;
+
+   value = max_blocks - si->sui_nblocks;
+
+   /*
+

[PATCH 3/4] nilfs-utils: add support for nilfs_clean_snapshot_flags()

2014-03-16 Thread Andreas Rohner
This ioctl enables the userspace GC to perform a cleanup operation after
setting the number of blocks with NILFS_IOCTL_SET_SUINFO. It sets DAT
entries with de_ss values of NILFS_CNO_MAX to 0. NILFS_CNO_MAX
indicates, that the corresponding block belongs to some snapshot, but
was already decremented by a previous deletion operation. If the segment
usage info is changed with NILFS_IOCTL_SET_SUINFO and the number of
blocks is updated, then these blocks would never be decremented and
there are scenarios where the corresponding segments would starve (never
be cleaned). To prevent that the value of de_ss must be set to 0, so
that it can be decremented again, should the snapshot be deleted in the
future.

Signed-off-by: Andreas Rohner 
---
 include/nilfs.h |  2 ++
 include/nilfs2_fs.h |  2 ++
 lib/gc.c|  6 +-
 lib/nilfs.c | 23 +++
 4 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index 05cfe3b..bd134be 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -313,6 +313,8 @@ ssize_t nilfs_get_bdescs(const struct nilfs *, struct 
nilfs_bdesc *, size_t);
 int nilfs_clean_segments(struct nilfs *, struct nilfs_vdesc *, size_t,
 struct nilfs_period *, size_t, __u64 *, size_t,
 struct nilfs_bdesc *, size_t, __u64 *, size_t);
+int nilfs_clean_snapshot_flags(struct nilfs *nilfs,
+  struct nilfs_vdesc *vdescs, size_t nvdescs);
 int nilfs_sync(const struct nilfs *, nilfs_cno_t *);
 int nilfs_resize(struct nilfs *nilfs, off_t size);
 int nilfs_set_alloc_range(struct nilfs *nilfs, off_t start, off_t end);
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index 967c2af..cb02739 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -918,5 +918,7 @@ struct nilfs_bdesc {
_IOW(NILFS_IOCTL_IDENT, 0x8C, __u64[2])
 #define NILFS_IOCTL_SET_SUINFO  \
_IOW(NILFS_IOCTL_IDENT, 0x8D, struct nilfs_argv)
+#define NILFS_IOCTL_CLEAN_SNAPSHOT_FLAGS  \
+   _IOW(NILFS_IOCTL_IDENT, 0x8F, struct nilfs_argv)
 
 #endif /* _LINUX_NILFS_FS_H */
diff --git a/lib/gc.c b/lib/gc.c
index a165a5c..2338174 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -762,8 +762,12 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 
ret = nilfs_set_suinfo(nilfs, nilfs_vector_get_data(supv), n);
 
-   if (ret == 0)
+   if (ret == 0) {
+   ret = nilfs_clean_snapshot_flags(nilfs,
+   nilfs_vector_get_data(vdescv),
+   nilfs_vector_get_size(vdescv));
goto out_lock;
+   }
 
if (ret < 0 && errno != ENOTTY) {
nilfs_gc_logger(LOG_ERR, "cannot set suinfo: %s",
diff --git a/lib/nilfs.c b/lib/nilfs.c
index e8f5c96..b909a23 100644
--- a/lib/nilfs.c
+++ b/lib/nilfs.c
@@ -743,6 +743,29 @@ int nilfs_clean_segments(struct nilfs *nilfs,
 }
 
 /**
+ * nilfs_clean_snapshot_flags - cleanup snapshot flags after set_suinfo
+ * @nilfs: nilfs object
+ * @vdescs: array of nilfs_vdesc structs to specify live blocks
+ * @nvdescs: size of @vdescs array (number of items)
+ */
+int nilfs_clean_snapshot_flags(struct nilfs *nilfs,
+  struct nilfs_vdesc *vdescs, size_t nvdescs)
+{
+   struct nilfs_argv argv;
+
+   if (nilfs->n_iocfd < 0) {
+   errno = EBADF;
+   return -1;
+   }
+
+   memset(&argv, 0, sizeof(struct nilfs_argv));
+   argv.v_base = (unsigned long)vdescs;
+   argv.v_nmembs = nvdescs;
+   argv.v_size = sizeof(struct nilfs_vdesc);
+   return ioctl(nilfs->n_iocfd, NILFS_IOCTL_CLEAN_SNAPSHOT_FLAGS, &argv);
+}
+
+/**
  * nilfs_sync - sync a NILFS file system
  * @nilfs: nilfs object
  * @cnop: buffer to store the latest checkpoint number in
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] nilfs2: add ioctl() to clean snapshot flags from dat entries

2014-03-16 Thread Andreas Rohner
This patch introduces new flags for nilfs_vdesc to indicate the reason a
block is alive. So if the block would be reclaimable, but must be
treated as if it were alive, because it is part of a snapshot, then the
snapshot flag is set.

Additionally a new ioctl() is added, which enables the userspace GC to
perform a cleanup operation after setting the number of blocks with
NILFS_IOCTL_SET_SUINFO. It sets DAT entries with de_ss values of
NILFS_CNO_MAX to 0. NILFS_CNO_MAX indicates, that the corresponding
block belongs to some snapshot, but was already decremented by a
previous deletion operation. If the segment usage info is changed with
NILFS_IOCTL_SET_SUINFO and the number of blocks is updated, then these
blocks would never be decremented and there are scenarios where the
corresponding segments would starve (never be cleaned). To prevent that
they must be reset to 0.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/dat.c   |  63 
 fs/nilfs2/dat.h   |   1 +
 fs/nilfs2/ioctl.c | 103 +-
 include/linux/nilfs2_fs.h |  52 ++-
 4 files changed, 216 insertions(+), 3 deletions(-)

diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index 89a4a5f..7adb15d 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -382,6 +382,69 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, 
sector_t blocknr)
 }
 
 /**
+ * nilfs_dat_clean_snapshot_flag - check flags used by snapshots
+ * @dat: DAT file inode
+ * @vblocknr: virtual block number
+ *
+ * Description: nilfs_dat_clean_snapshot_flag() changes the flags from
+ * NILFS_CNO_MAX to 0 if necessary, so that segment usage is accurately
+ * counted. NILFS_CNO_MAX indicates, that the corresponding block belongs
+ * to some snapshot, but was already decremented. If the segment usage info
+ * is changed with NILFS_IOCTL_SET_SUINFO and the number of blocks is updated,
+ * then these blocks would never be decremented and there are scenarios where
+ * the corresponding segments would starve (never be cleaned).
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_dat_clean_snapshot_flag(struct inode *dat, __u64 vblocknr)
+{
+   struct buffer_head *entry_bh;
+   struct nilfs_dat_entry *entry;
+   void *kaddr;
+   int ret;
+
+   ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
+   if (ret < 0)
+   return ret;
+
+   /*
+* The given disk block number (blocknr) is not yet written to
+* the device at this point.
+*
+* To prevent nilfs_dat_translate() from returning the
+* uncommitted block number, this makes a copy of the entry
+* buffer and redirects nilfs_dat_translate() to the copy.
+*/
+   if (!buffer_nilfs_redirected(entry_bh)) {
+   ret = nilfs_mdt_freeze_buffer(dat, entry_bh);
+   if (ret) {
+   brelse(entry_bh);
+   return ret;
+   }
+   }
+
+   kaddr = kmap_atomic(entry_bh->b_page);
+   entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
+   if (entry->de_ss == cpu_to_le64(NILFS_CNO_MAX)) {
+   entry->de_ss = cpu_to_le64(0);
+   kunmap_atomic(kaddr);
+   mark_buffer_dirty(entry_bh);
+   nilfs_mdt_mark_dirty(dat);
+   } else {
+   kunmap_atomic(kaddr);
+   }
+
+   brelse(entry_bh);
+
+   return 0;
+}
+
+/**
  * nilfs_dat_translate - translate a virtual block number to a block number
  * @dat: DAT file inode
  * @vblocknr: virtual block number
diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
index 92a187e..a528024 100644
--- a/fs/nilfs2/dat.h
+++ b/fs/nilfs2/dat.h
@@ -51,6 +51,7 @@ void nilfs_dat_abort_update(struct inode *, struct 
nilfs_palloc_req *,
 int nilfs_dat_mark_dirty(struct inode *, __u64);
 int nilfs_dat_freev(struct inode *, __u64 *, size_t);
 int nilfs_dat_move(struct inode *, __u64, sector_t);
+int nilfs_dat_clean_snapshot_flag(struct inode *, __u64);
 ssize_t nilfs_dat_get_vinfo(struct inode *, void *, unsigned, size_t);
 
 int nilfs_dat_read(struct super_block *sb, size_t entry_size,
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..0b62bf4 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -578,7 +578,7 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
struct buffer_head *bh;
int ret;
 
-   if (vdesc->vd_flags == 0)
+   if (nilfs_vdesc_data(vdesc))
ret = nilfs_gccache_submit_read_data(
inode, vdesc->vd_offset, vdesc->vd_blocknr,
vdesc->vd_vblocknr, &bh);
@@ -662,6 +662,14 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
  

[PATCH 2/6] nilfs2: add new timestamp to seg usage and function to change su_nblocks

2014-03-16 Thread Andreas Rohner
This patch adds an additional timestamp to the segment usage
information that indicates the last time the usage information was
changed. So su_lastmod indicates the last time the segment itself was
modified and su_lastdec indicates the last time the usage information
itself was changed.

This is important information for the GC, because it needs to avoid
selecting segments for cleaning that are created (su_lastmod) outside of
the protection period, but the blocks got reclaimable (su_nblocks is
decremented) within the protection period. Without that information the
GC policy has to assume, that there are reclaimble blocks, only to find
out, that they are protected by the protection period.

This patch also introduces nilfs_sufile_add_segment_usage(), which can
be used to increment or decrement the value of su_nblocks of a specific
segment.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/sufile.c| 86 +--
 fs/nilfs2/sufile.h| 18 ++
 include/linux/nilfs2_fs.h |  7 
 3 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 2a869c3..0886938 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -453,6 +453,8 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 
segnum,
su->su_lastmod = cpu_to_le64(0);
su->su_nblocks = cpu_to_le32(0);
su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
+   if (nilfs_sufile_lastdec_supported(sufile))
+   su->su_lastdec = cpu_to_le64(0);
kunmap_atomic(kaddr);
 
nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
@@ -482,7 +484,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 
segnum,
WARN_ON(!nilfs_segment_usage_dirty(su));
 
sudirty = nilfs_segment_usage_dirty(su);
-   nilfs_segment_usage_set_clean(su);
+   nilfs_sufile_segment_usage_set_clean(sufile, su);
kunmap_atomic(kaddr);
mark_buffer_dirty(su_bh);
 
@@ -549,6 +551,75 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, 
__u64 segnum,
 }
 
 /**
+ * nilfs_sufile_add_segment_usage - decrement usage of a segment
+ * @sufile: inode of segment usage file
+ * @segnum: segment number
+ * @value: value to add to su_nblocks
+ * @dectime: current time
+ *
+ * Description: nilfs_sufile_add_segment_usage() adds a signed value to the
+ * su_nblocks field of the segment usage information of @segnum. It ensures
+ * that the result is bigger than 0 and smaller or equal to the maximum number
+ * of blocks per segment
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient memory available.
+ *
+ * %-EIO - I/O error
+ *
+ * %-ENOENT - the specified block does not exist (hole block)
+ */
+int nilfs_sufile_add_segment_usage(struct inode *sufile, __u64 segnum,
+  __s64 value, time_t dectime)
+{
+   struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
+   struct buffer_head *bh;
+   struct nilfs_segment_usage *su;
+   void *kaddr;
+   int ret;
+
+   if (value == 0)
+   return 0;
+
+   down_write(&NILFS_MDT(sufile)->mi_sem);
+
+   ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, 0, &bh);
+   if (ret < 0)
+   goto out_sem;
+
+   kaddr = kmap_atomic(bh->b_page);
+   su = nilfs_sufile_block_get_segment_usage(sufile, segnum, bh, kaddr);
+   WARN_ON(nilfs_segment_usage_error(su));
+
+   value += le32_to_cpu(su->su_nblocks);
+   if (value < 0)
+   value = 0;
+   if (value > nilfs->ns_blocks_per_segment)
+   value = nilfs->ns_blocks_per_segment;
+
+   if (value == le32_to_cpu(su->su_nblocks)) {
+   kunmap_atomic(kaddr);
+   goto out_brelse;
+   }
+
+   su->su_nblocks = cpu_to_le32(value);
+   if (dectime && nilfs_sufile_lastdec_supported(sufile))
+   su->su_lastdec = cpu_to_le64(dectime);
+   kunmap_atomic(kaddr);
+
+   mark_buffer_dirty(bh);
+   nilfs_mdt_mark_dirty(sufile);
+
+out_brelse:
+   brelse(bh);
+out_sem:
+   up_write(&NILFS_MDT(sufile)->mi_sem);
+   return ret;
+}
+
+/**
  * nilfs_sufile_get_stat - get segment usage statistics
  * @sufile: inode of segment usage file
  * @stat: pointer to a structure of segment usage statistics
@@ -698,7 +769,8 @@ static int nilfs_sufile_truncate_range(struct inode *sufile,
nc = 0;
for (su = su2, j = 0; j < n; j++, su = (void *)su + susz) {
if (nilfs_segment_usage_error(su)) {
-   nilfs_segment_usage_set_clean(su);
+   nilfs_sufile_segment_usage_set_clean(sufile,
+   su);

[PATCH 3/6] nilfs2: scan dat entries at snapshot creation/deletion time

2014-03-16 Thread Andreas Rohner
To accurately count the number of live blocks in a segment, it is
important to take snapshots into account, because snapshots can protect
reclaimable blocks from being cleaned.

This patch uses the previously reserved de_rsv field of the
nilfs_dat_entry struct to store one of the snapshots the corresponding
block belongs to. One block can belong to many snapshots, but because
the snapshots are stored in a sorted linked list, it is easy to check if
a block belongs to any other snapshot given the previous and the next
snapshot. For example if the current snapshot (in de_ss) is being
removed and neither the previous nor the next snapshot is in the range
of de_start to de_end, then it is guaranteed that the block doesn't
belong to any other snapshot and is reclaimable. On the other hand if
lets say the previous snapshot is in the range of de_start to de_end, we
simply set de_ss to the previous snapshot and the block is not
reclaimable.

To implement this every DAT entry is scanned at snapshot
creation/deletion time and updated if needed. To avoid too many update
operations only potentially reclaimable blocks are ever updated. For
example if there are some deleted files and the checkpoint to which
these files belong is turned into a snapshot, then su_nblocks is
incremented for these blocks, which reverses the decrement that happened
when the files were deleted. If after some time this snapshot is
deleted, su_nblocks is decremented again to reverse the increment at
creation time.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/cpfile.c|  7 
 fs/nilfs2/dat.c   | 86 +++
 fs/nilfs2/dat.h   | 26 ++
 include/linux/nilfs2_fs.h |  4 +--
 4 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
index 0d58075..29952f5 100644
--- a/fs/nilfs2/cpfile.c
+++ b/fs/nilfs2/cpfile.c
@@ -28,6 +28,7 @@
 #include 
 #include "mdt.h"
 #include "cpfile.h"
+#include "sufile.h"
 
 
 static inline unsigned long
@@ -584,6 +585,7 @@ static int nilfs_cpfile_set_snapshot(struct inode *cpfile, 
__u64 cno)
struct nilfs_cpfile_header *header;
struct nilfs_checkpoint *cp;
struct nilfs_snapshot_list *list;
+   struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
__u64 curr, prev;
unsigned long curr_blkoff, prev_blkoff;
void *kaddr;
@@ -681,6 +683,8 @@ static int nilfs_cpfile_set_snapshot(struct inode *cpfile, 
__u64 cno)
mark_buffer_dirty(header_bh);
nilfs_mdt_mark_dirty(cpfile);
 
+   nilfs_dat_scan_inc_ss(nilfs->ns_dat, cno);
+
brelse(prev_bh);
 
  out_curr:
@@ -703,6 +707,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
*cpfile, __u64 cno)
struct nilfs_cpfile_header *header;
struct nilfs_checkpoint *cp;
struct nilfs_snapshot_list *list;
+   struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
__u64 next, prev;
void *kaddr;
int ret;
@@ -784,6 +789,8 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
*cpfile, __u64 cno)
mark_buffer_dirty(header_bh);
nilfs_mdt_mark_dirty(cpfile);
 
+   nilfs_dat_scan_dec_ss(nilfs->ns_dat, cno, prev, next);
+
brelse(prev_bh);
 
  out_next:
diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index 0d5fada..89a4a5f 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -28,6 +28,7 @@
 #include "mdt.h"
 #include "alloc.h"
 #include "dat.h"
+#include "sufile.h"
 
 
 #define NILFS_CNO_MIN  ((__u64)1)
@@ -97,6 +98,7 @@ void nilfs_dat_commit_alloc(struct inode *dat, struct 
nilfs_palloc_req *req)
entry->de_start = cpu_to_le64(NILFS_CNO_MIN);
entry->de_end = cpu_to_le64(NILFS_CNO_MAX);
entry->de_blocknr = cpu_to_le64(0);
+   entry->de_ss = cpu_to_le64(0);
kunmap_atomic(kaddr);
 
nilfs_palloc_commit_alloc_entry(dat, req);
@@ -121,6 +123,7 @@ static void nilfs_dat_commit_free(struct inode *dat,
entry->de_start = cpu_to_le64(NILFS_CNO_MIN);
entry->de_end = cpu_to_le64(NILFS_CNO_MIN);
entry->de_blocknr = cpu_to_le64(0);
+   entry->de_ss = cpu_to_le64(0);
kunmap_atomic(kaddr);
 
nilfs_dat_commit_entry(dat, req);
@@ -201,6 +204,7 @@ void nilfs_dat_commit_end(struct inode *dat, struct 
nilfs_palloc_req *req,
WARN_ON(start > end);
}
entry->de_end = cpu_to_le64(end);
+   entry->de_ss = cpu_to_le64(NILFS_CNO_MAX);
blocknr = le64_to_cpu(entry->de_blocknr);
kunmap_atomic(kaddr);
 
@@ -365,6 +369,8 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, 
sector_t blocknr)
}
WARN_ON(blocknr == 0);
entry->de_blocknr = cpu_to_le64(blocknr);
+   if (entry->de_ss == cpu_to_le64(NILFS_CNO_MAX))
+   entry->de_ss = cpu_to_le64(0);
k

[PATCH 5/6] nilfs2: add counting of live blocks for blocks that are overwritten

2014-03-16 Thread Andreas Rohner
After a Block is written to disk, the buffer_head is never mapped to
that location on disk. By simply using map_bh() after writing the block
the origin of overwritten blocks can be determined and the corresponding
segment can be calculated with nilfs_get_segnum_of_block(). Since the
block is now at a new location, the old one is reclaimable. Therefore
the number of live blocks in the segment usage information of the
segment of the previous location of the block needs to be decremented.
This approach also works for the DAT file and other metadata files.

nilfs_node Blocks have to be treated differently. Also GC blocks have to
be treated separately, because they contain the virtual block number in
the b_blocknr field and their old location is about to be cleaned
anyway. So it is not necessary to decrement the live block counters for
GC blocks.

This patch does not count deleted blocks, when a whole file is deleted.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/dat.c   | 58 +++
 fs/nilfs2/dat.h   |  1 +
 fs/nilfs2/inode.c |  2 ++
 fs/nilfs2/ioctl.c |  6 +
 fs/nilfs2/page.h  |  6 -
 fs/nilfs2/segbuf.c| 25 +
 fs/nilfs2/segbuf.h|  4 +++
 fs/nilfs2/segment.c   | 69 +++
 include/linux/nilfs2_fs.h |  2 ++
 9 files changed, 167 insertions(+), 6 deletions(-)

diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index 7adb15d..e7b19c40 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -445,6 +445,64 @@ int nilfs_dat_clean_snapshot_flag(struct inode *dat, __u64 
vblocknr)
 }
 
 /**
+ * nilfs_dat_is_live - checks if the virtual block number is alive
+ * @dat: DAT file inode
+ * @vblocknr: virtual block number
+ *
+ * Description: nilfs_dat_is_live() looks up the DAT entry for @vblocknr and
+ * determines if the corresponding block is alive or not. This check ignores
+ * snapshots and protection periods.
+ *
+ * Return Value: 1 if vblocknr is alive and 0 otherwise. On error, one
+ * of the following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - A block number associated with @vblocknr does not exist.
+ */
+int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr)
+{
+   struct buffer_head *entry_bh, *bh;
+   struct nilfs_dat_entry *entry;
+   sector_t blocknr;
+   void *kaddr;
+   int ret;
+
+   ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
+   if (ret < 0)
+   return ret;
+
+   if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
+   bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
+   if (bh) {
+   WARN_ON(!buffer_uptodate(bh));
+   brelse(entry_bh);
+   entry_bh = bh;
+   }
+   }
+
+   kaddr = kmap_atomic(entry_bh->b_page);
+   entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
+   blocknr = le64_to_cpu(entry->de_blocknr);
+   if (blocknr == 0) {
+   ret = -ENOENT;
+   goto out;
+   }
+
+
+   if (entry->de_end == cpu_to_le64(NILFS_CNO_MAX))
+   ret = 1;
+   else
+   ret = 0;
+out:
+   kunmap_atomic(kaddr);
+   brelse(entry_bh);
+   return ret;
+}
+
+/**
  * nilfs_dat_translate - translate a virtual block number to a block number
  * @dat: DAT file inode
  * @vblocknr: virtual block number
diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
index a528024..51d44c0 100644
--- a/fs/nilfs2/dat.h
+++ b/fs/nilfs2/dat.h
@@ -31,6 +31,7 @@
 struct nilfs_palloc_req;
 
 int nilfs_dat_translate(struct inode *, __u64, sector_t *);
+int nilfs_dat_is_live(struct inode *, __u64);
 
 int nilfs_dat_prepare_alloc(struct inode *, struct nilfs_palloc_req *);
 void nilfs_dat_commit_alloc(struct inode *, struct nilfs_palloc_req *);
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index b9c5726..c32b896 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -86,6 +86,8 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
int err = 0, ret;
unsigned maxblocks = bh_result->b_size >> inode->i_blkbits;
 
+   bh_result->b_blocknr = 0;
+
down_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
ret = nilfs_bmap_lookup_contig(ii->i_bmap, blkoff, &blknum, maxblocks);
up_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 0b62bf4..3603394 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -612,6 +612,12 @@ static int nilfs_ioctl_move_inode_block(struct inode 
*inode,
brelse(bh);
return -EEXIST;
}
+
+   if (nilfs_vdesc_snapshot(vdesc))
+   set_buffer_nilfs_snapshot(bh);
+   if (nilfs_vdesc_protection_pe

[PATCH 0/6] nilfs2: implement tracking of live blocks

2014-03-16 Thread Andreas Rohner
Hi,

This patch set implements the tracking of live blocks in segments. This 
information is crucial in implementing better GC policies, because 
now the policies can make informed decisions about which segments have 
the biggest number of reclaimable blocks.

The difficulty in tracking live blocks is the fact, that any block can 
belong to any number of snapshots and snapshots can be deleted and 
created at any time. A block belongs to a snapshot, if the checkpoint 
number lies between de_start and de_end of the block. So if a new 
snapshot is created, all the reclaimable blocks belonging to it are no 
longer reclaimable and therefore the live block counter of the 
corresponding segment must be incremented. Conversely if a snapshot is 
removed, all the reclaimable blocks belonging to it should really be 
counted as reclaimable again and the counter must be decremented. But if 
one block belongs to two or more snapshots the counter must only be 
incremented once for the first and decremented once for the last 
snapshot.

To achieve this I used the de_rsv field of nilfs_dat_entry to store one 
of the snapshot numbers. Every time a snapshot is created/removed the 
whole DAT-File is scanned and de_rsv is updated if the snapshot number 
is between de_start and de_end. But one block can belong to an 
arbitrary number of snapshots. Here I use the fact, that the 
snapshot list is organized as a sorted linked list. So by knowing the 
previous and the next snapshot number it is possible to 
reliably determine, if a block is reclaimable or belongs to another 
snapshot. 

It is of course unacceptable to update the whole DAT-File to create one 
snapshot. So only reclaimable blocks are updated. But this leads to 
certain situations, where the counters won't be accurate. The userspace 
GC should be capable of compensating and correcting the inaccurate 
values.

Another problem is the protection period in the userspace GC. The kernel 
doesn't know anything about the userspace protection period, and it is 
therefore not reflected in the number of live blocks in a segment. For 
example if the GC policy chooses a segment that seems to have a lot of 
reclaimable blocks, it could turn out, that all of those blocks are 
still protected by the protection period.

To overcome this problem I added an additional field to su_lastdec to 
the segment usage information. Whenever the number of live blocks in a 
segment is adjusted su_lastdec is set to the current timestamp. If the 
number of live blocks was adjusted within the protection period, then 
the userspace GC policy can recognize it and choose a different segment.

Compatibility Issues:
1. su_nblocks is reused to represent the number of live blocks
   old nilfs-utils would break the file system.
2. the vd_pad field of nilfs_vdesc was not initialized to 0
   so old nilfs-utils could send arbitrary flags to the kernel

Benchmark Results:

The benchmark replays NFS-Traces to simulate a real file system load. 
The file system is filled up to 20% capacity and then the NFS-Traces are 
replayed. In parallel every 5 minutes random checkpoints are turned into 
snapshots. After 15 minutes the snapshot is turned back into a 
checkpoint.

Greedy-Policy-Runtime:   6221.712s
Cost-Benefit-Policy-Runtime: 6874.840s
Timestamp-Policy-Runtime:13179.626s

Best regards,
Andreas Rohner

---
Andreas Rohner (6):
  nilfs2: add helper function to go through all entries of meta data
file
  nilfs2: add new timestamp to seg usage and function to change
su_nblocks
  nilfs2: scan dat entries at snapshot creation/deletion time
  nilfs2: add ioctl() to clean snapshot flags from dat entries
  nilfs2: add counting of live blocks for blocks that are overwritten
  nilfs2: add counting of live blocks for deleted files

 fs/nilfs2/alloc.c | 121 +
 fs/nilfs2/alloc.h |   6 ++
 fs/nilfs2/bmap.c  |   8 +-
 fs/nilfs2/bmap.h  |   2 +-
 fs/nilfs2/btree.c |   3 +-
 fs/nilfs2/cpfile.c|   7 ++
 fs/nilfs2/dat.c   | 225 +-
 fs/nilfs2/dat.h   |  32 ++-
 fs/nilfs2/direct.c|   3 +-
 fs/nilfs2/inode.c |   2 +
 fs/nilfs2/ioctl.c | 109 +-
 fs/nilfs2/mdt.c   |   5 +-
 fs/nilfs2/page.h  |   6 +-
 fs/nilfs2/segbuf.c|  25 ++
 fs/nilfs2/segbuf.h|   4 +
 fs/nilfs2/segment.c   |  69 --
 fs/nilfs2/sufile.c|  86 +-
 fs/nilfs2/sufile.h|  18 
 include/linux/nilfs2_fs.h |  65 +-
 19 files changed, 772 insertions(+), 24 deletions(-)

-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] nilfs2: add counting of live blocks for deleted files

2014-03-16 Thread Andreas Rohner
If a file is deleted, then the entries of its blocks in the DAT-File
need to be updated. So everytime a file is deleted
nilfs_dat_commit_end() is called for every block to set de_end to the
current checkpoint number. So it is the perfect hook to insert logic
that counts live blocks. If the file is deleted, then the blocks are
reclaimable and the number of live blocks for the corresponding
segments must be decremented. This patch adds code to
nilfs_dat_commit_end() that decrements the number of live blocks under
certain conditions.

One condition is, that the block must not belong to the SUFILE, because
that would lead to a deadlock. When nilfs_dat_commit_end() is called the
bmaps b_sem is already held, but nilfs_sufile_add_segment_usage() has to
lock that same lock for the SUFILE, to decrement the number of live
blocks. Secondly the blocks must only be counted if
nilfs_dat_commit_end() is called from a file deletion operation,
because overwritten blocks are already counted somewhere else.

With the above changes the code does not pass the lock dependency
checks, because all the locks have the same class and the order in which
the locks are taken is different. Usually it is:

1. down_write(&NILFS_MDT(sufile)->mi_sem);
2. down_write(&bmap->b_sem);

Now it can also be reversed, which leads to failed checks:

1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
2. down_write(&NILFS_MDT(sufile)->mi_sem);

But this is safe as long as the first lock down_write(&bmap->b_sem)
doesn't belong to the SUFILE. So the warnings can be resolved, by adding
an extra lock class for the SUFILE and the code is safe, because the
SUFILE is excluded from being counted.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/bmap.c   |  8 +++-
 fs/nilfs2/bmap.h   |  2 +-
 fs/nilfs2/btree.c  |  3 ++-
 fs/nilfs2/dat.c| 18 ++
 fs/nilfs2/dat.h|  4 ++--
 fs/nilfs2/direct.c |  3 ++-
 fs/nilfs2/mdt.c|  5 -
 7 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index aadbd0b..ecd62ba 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -467,6 +467,7 @@ __u64 nilfs_bmap_find_target_in_group(const struct 
nilfs_bmap *bmap)
 
 static struct lock_class_key nilfs_bmap_dat_lock_key;
 static struct lock_class_key nilfs_bmap_mdt_lock_key;
+static struct lock_class_key nilfs_bmap_sufile_lock_key;
 
 /**
  * nilfs_bmap_read - read a bmap from an inode
@@ -498,12 +499,17 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct 
nilfs_inode *raw_inode)
lockdep_set_class(&bmap->b_sem, &nilfs_bmap_dat_lock_key);
break;
case NILFS_CPFILE_INO:
-   case NILFS_SUFILE_INO:
bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
bmap->b_last_allocated_key = 0;
bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
break;
+   case NILFS_SUFILE_INO:
+   bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
+   bmap->b_last_allocated_key = 0;
+   bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
+   lockdep_set_class(&bmap->b_sem, &nilfs_bmap_sufile_lock_key);
+   break;
case NILFS_IFILE_INO:
lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
/* Fall through */
diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
index b89e680..f09009c 100644
--- a/fs/nilfs2/bmap.h
+++ b/fs/nilfs2/bmap.h
@@ -223,7 +223,7 @@ static inline void nilfs_bmap_commit_end_ptr(struct 
nilfs_bmap *bmap,
 {
if (dat)
nilfs_dat_commit_end(dat, &req->bpr_req,
-bmap->b_ptr_type == NILFS_BMAP_PTR_VS);
+bmap->b_ptr_type == NILFS_BMAP_PTR_VS, 1);
 }
 
 static inline void nilfs_bmap_abort_end_ptr(struct nilfs_bmap *bmap,
diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
index b2e3ff3..7365cb4 100644
--- a/fs/nilfs2/btree.c
+++ b/fs/nilfs2/btree.c
@@ -1851,7 +1851,8 @@ static void nilfs_btree_commit_update_v(struct nilfs_bmap 
*btree,
 
nilfs_dat_commit_update(dat, &path[level].bp_oldreq.bpr_req,
&path[level].bp_newreq.bpr_req,
-   btree->b_ptr_type == NILFS_BMAP_PTR_VS);
+   btree->b_ptr_type == NILFS_BMAP_PTR_VS,
+   buffer_nilfs_node(path[level].bp_bh));
 
if (buffer_nilfs_node(path[level].bp_bh)) {
nilfs_btnode_commit_change_key(
diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index e7b19c40..f465cbf 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -188,12 +188,13 @@ int nilfs_dat_prepare_end(struct inode *dat, struct 
nilfs_palloc_req *req)
 }
 
 void nilfs_dat_commit_end(struct i

[PATCH 1/6] nilfs2: add helper function to go through all entries of meta data file

2014-03-16 Thread Andreas Rohner
This patch introduces the nilfs_palloc_scan_entries() function,
which takes an inode of one of nilfs' meta data files and iterates
through all of its entries. For each entry the callback function
pointer that is given as a parameter is called. The data parameter
is passed to the callback function, so that it may receive
parameters and return results.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/alloc.c | 121 ++
 fs/nilfs2/alloc.h |   6 +++
 2 files changed, 127 insertions(+)

diff --git a/fs/nilfs2/alloc.c b/fs/nilfs2/alloc.c
index 741fd02..0edd85a 100644
--- a/fs/nilfs2/alloc.c
+++ b/fs/nilfs2/alloc.c
@@ -545,6 +545,127 @@ int nilfs_palloc_prepare_alloc_entry(struct inode *inode,
 }
 
 /**
+ * nilfs_palloc_scan_entries - scan through every entry and execute dofunc
+ * @inode: inode of metadata file using this allocator
+ * @dofunc: function executed for every entry
+ * @data: data pointer passed to dofunc
+ *
+ * Description: nilfs_palloc_scan_entries() walks through every allocated entry
+ * of a metadata file and executes dofunc on it. It passes a data pointer to
+ * dofunc, which can be used as an input parameter or for returning of results.
+ *
+ * Return Value: On success, 0 is returned. On error, a
+ * negative error code is returned.
+ */
+int nilfs_palloc_scan_entries(struct inode *inode,
+ void (*dofunc)(struct inode *,
+struct nilfs_palloc_req *,
+void *),
+ void *data)
+{
+   struct buffer_head *desc_bh, *bitmap_bh;
+   struct nilfs_palloc_group_desc *desc;
+   struct nilfs_palloc_req req;
+   unsigned char *bitmap;
+   void *desc_kaddr, *bitmap_kaddr;
+   unsigned long group, maxgroup, ngroups;
+   unsigned long n, m, entries_per_group, groups_per_desc_block;
+   unsigned long i, j, pos;
+   unsigned long blkoff, prev_blkoff;
+   int ret;
+
+   ngroups = nilfs_palloc_groups_count(inode);
+   maxgroup = ngroups - 1;
+   entries_per_group = nilfs_palloc_entries_per_group(inode);
+   groups_per_desc_block = nilfs_palloc_groups_per_desc_block(inode);
+
+   for (group = 0; group < ngroups;) {
+   ret = nilfs_palloc_get_desc_block(inode, group, 0, &desc_bh);
+   if (ret == -ENOENT)
+   return 0;
+   else if (ret < 0)
+   return ret;
+   req.pr_desc_bh = desc_bh;
+   desc_kaddr = kmap(desc_bh->b_page);
+   desc = nilfs_palloc_block_get_group_desc(inode, group,
+desc_bh, desc_kaddr);
+   n = nilfs_palloc_rest_groups_in_desc_block(inode, group,
+  maxgroup);
+
+   for (i = 0; i < n; i++, desc++, group++) {
+   m = entries_per_group -
+   nilfs_palloc_group_desc_nfrees(inode,
+   group, desc);
+   if (!m)
+   continue;
+
+   ret = nilfs_palloc_get_bitmap_block(
+   inode, group, 0, &bitmap_bh);
+   if (ret == -ENOENT) {
+   ret = 0;
+   goto out_desc;
+   } else if (ret < 0)
+   goto out_desc;
+
+   req.pr_bitmap_bh = bitmap_bh;
+   bitmap_kaddr = kmap(bitmap_bh->b_page);
+   bitmap = bitmap_kaddr + bh_offset(bitmap_bh);
+   /* entry blkoff is always bigger than 0 */
+   blkoff = 0;
+   pos = 0;
+
+   for (j = 0; j < m; ++j, ++pos) {
+   pos = nilfs_find_next_bit(bitmap,
+   entries_per_group, pos);
+
+   if (pos >= entries_per_group)
+   break;
+
+   /* found an entry */
+   req.pr_entry_nr =
+   entries_per_group * group + pos;
+
+   prev_blkoff = blkoff;
+   blkoff = nilfs_palloc_entry_blkoff(inode,
+   req.pr_entry_nr);
+
+   if (blkoff != prev_blkoff) {
+   if (prev_blkoff)
+   brelse(req.pr_entry_bh);
+
+   ret = nilfs_palloc_get_entry_block(
+

[PATCH v4 0/2] nilfs2: add support for FITRIM ioctl

2014-02-23 Thread Andreas Rohner
Hi,

This patch adds support for the FITRIM ioctl, which allows user space 
tools like fstrim to issue TRIM/DISCARD requests to the underlying 
device. It takes a fstrim_range structure as a parameter and for every 
clean segment in the specified range the function blkdev_issue_discard 
is called. The range is truncated to file system block boundaries.

I tested it with a 32 bit and 64 bit kernel. On the 32 bit system 
CONFIG_LBDAF was disabled so that sector_t was 32 bit.  

Best regards,
Andreas Rohner

---
v3->v4 (based on review by Ryusuke Konishi)
 * Fix integer overflow
 * Add comment
v2->v3 (based on review by Ryusuke Konishi)
 * Fix integer overflow
 * Round range to block boundary instead of sector boundary
 * Move range check to nilfs_sufile_trim_fs()
v1->v2 (based on review by Ryusuke Konishi)
 * Remove upper limit of minlen
 * Add check for minlen
 * Round range to sector boundary instead of segment boundary
 * Fix minor bug
 * Use kmap_atomic instead of kmap
 * Move input checks to ioctl.c
 * Use nilfs_sufile_segment_usages_in_block()
--

Andreas Rohner (2):
  nilfs2: add nilfs_sufile_trim_fs to trim clean segs
  nilfs2: add FITRIM ioctl support for nilfs2

 fs/nilfs2/ioctl.c  |  45 
 fs/nilfs2/sufile.c | 152 +
 fs/nilfs2/sufile.h |   1 +
 3 files changed, 198 insertions(+)

-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/2] nilfs2: add nilfs_sufile_trim_fs to trim clean segs

2014-02-23 Thread Andreas Rohner
This patch adds the nilfs_sufile_trim_fs function, which takes a
fstrim_range structure and calls blkdev_issue_discard for every
clean segment in the specified range. The range is truncated to
file system block boundaries.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/sufile.c | 152 +
 fs/nilfs2/sufile.h |   1 +
 2 files changed, 153 insertions(+)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 3127e9f..9febeaf 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -870,6 +870,158 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, 
__u64 segnum, void *buf,
 }
 
 /**
+ * nilfs_sufile_trim_fs() - trim ioctl handle function
+ * @sufile: inode of segment usage file
+ * @range: fstrim_range structure
+ *
+ * start:  First Byte to trim
+ * len:number of Bytes to trim from start
+ * minlen: minimum extent length in Bytes
+ *
+ * Decription: nilfs_sufile_trim_fs goes through all segments containing bytes
+ * from start to start+len. start is rounded up to the next block boundary
+ * and start+len is rounded down. For each clean segment blkdev_issue_discard
+ * function is invoked.
+ *
+ * Return Value: On success, 0 is returned or negative error code, otherwise.
+ */
+int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range)
+{
+   struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
+   struct buffer_head *su_bh;
+   struct nilfs_segment_usage *su;
+   void *kaddr;
+   size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
+   sector_t seg_start, seg_end, start_block, end_block;
+   sector_t start = 0, nblocks = 0;
+   u64 segnum, segnum_end, minlen, len, max_blocks, ndiscarded = 0;
+   int ret = 0;
+   unsigned int sects_per_block;
+
+   sects_per_block = (1 << nilfs->ns_blocksize_bits) /
+   bdev_logical_block_size(nilfs->ns_bdev);
+   len = range->len >> nilfs->ns_blocksize_bits;
+   minlen = range->minlen >> nilfs->ns_blocksize_bits;
+   max_blocks = ((u64)nilfs->ns_nsegments * nilfs->ns_blocks_per_segment);
+
+   if (!len || range->start >= max_blocks << nilfs->ns_blocksize_bits)
+   return -EINVAL;
+
+   start_block = (range->start + nilfs->ns_blocksize - 1) >>
+   nilfs->ns_blocksize_bits;
+
+   /*
+* range->len can be very large (actually, it is set to
+* ULLONG_MAX by default) - truncate upper end of the range
+* carefully so as not to overflow.
+*/
+   if (max_blocks - start_block < len)
+   end_block = max_blocks - 1;
+   else
+   end_block = start_block + len - 1;
+
+   segnum = nilfs_get_segnum_of_block(nilfs, start_block);
+   segnum_end = nilfs_get_segnum_of_block(nilfs, end_block);
+
+   down_read(&NILFS_MDT(sufile)->mi_sem);
+
+   while (segnum <= segnum_end) {
+   n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
+   segnum_end);
+
+   ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, 0,
+  &su_bh);
+   if (ret < 0) {
+   if (ret != -ENOENT)
+   goto out_sem;
+   /* hole */
+   segnum += n;
+   continue;
+   }
+
+   kaddr = kmap_atomic(su_bh->b_page);
+   su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
+   su_bh, kaddr);
+   for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
+   if (!nilfs_segment_usage_clean(su))
+   continue;
+
+   nilfs_get_segment_range(nilfs, segnum, &seg_start,
+   &seg_end);
+
+   if (!nblocks) {
+   /* start new extent */
+   start = seg_start;
+   nblocks = seg_end - seg_start + 1;
+   continue;
+   }
+
+   if (start + nblocks == seg_start) {
+   /* add to previous extent */
+   nblocks += seg_end - seg_start + 1;
+   continue;
+   }
+
+   /* discard previous extent */
+   if (start < start_block) {
+   nblocks -= start_block - start;
+   start = start_block;
+   }
+
+   if (nblocks >= minlen) {
+   kunmap_atomic(kaddr);
+
+ 

[PATCH v4 2/2] nilfs2: add FITRIM ioctl support for nilfs2

2014-02-23 Thread Andreas Rohner
This patch adds support for the FITRIM ioctl, which enables user space
tools to issue TRIM/DISCARD requests to the underlying device. Every
clean segment within the specified range will be discarded.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/ioctl.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 2b34021..f3967e3 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1072,6 +1072,48 @@ out:
 }
 
 /**
+ * nilfs_ioctl_trim_fs() - trim ioctl handle function
+ * @inode: inode object
+ * @argp: pointer on argument from userspace
+ *
+ * Decription: nilfs_ioctl_trim_fs is the FITRIM ioctl handle function. It
+ * checks the arguments from userspace and calls nilfs_sufile_trim_fs, which
+ * performs the actual trim operation.
+ *
+ * Return Value: On success, 0 is returned or negative error code, otherwise.
+ */
+static int nilfs_ioctl_trim_fs(struct inode *inode, void __user *argp)
+{
+   struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
+   struct request_queue *q = bdev_get_queue(nilfs->ns_bdev);
+   struct fstrim_range range;
+   int ret;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   if (!blk_queue_discard(q))
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(&range, argp, sizeof(range)))
+   return -EFAULT;
+
+   range.minlen = max_t(u64, range.minlen, q->limits.discard_granularity);
+
+   down_read(&nilfs->ns_segctor_sem);
+   ret = nilfs_sufile_trim_fs(nilfs->ns_sufile, &range);
+   up_read(&nilfs->ns_segctor_sem);
+
+   if (ret < 0)
+   return ret;
+
+   if (copy_to_user(argp, &range, sizeof(range)))
+   return -EFAULT;
+
+   return 0;
+}
+
+/**
  * nilfs_ioctl_set_alloc_range - limit range of segments to be allocated
  * @inode: inode object
  * @argp: pointer on argument from userspace
@@ -1205,6 +1247,8 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
return nilfs_ioctl_resize(inode, filp, argp);
case NILFS_IOCTL_SET_ALLOC_RANGE:
return nilfs_ioctl_set_alloc_range(inode, argp);
+   case FITRIM:
+   return nilfs_ioctl_trim_fs(inode, argp);
default:
return -ENOTTY;
}
@@ -1235,6 +1279,7 @@ long nilfs_compat_ioctl(struct file *filp, unsigned int 
cmd, unsigned long arg)
case NILFS_IOCTL_SYNC:
case NILFS_IOCTL_RESIZE:
case NILFS_IOCTL_SET_ALLOC_RANGE:
+   case FITRIM:
break;
default:
return -ENOIOCTLCMD;
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] nilfs2: add FITRIM ioctl support for nilfs2

2014-02-21 Thread Andreas Rohner
This patch adds support for the FITRIM ioctl, which enables user space
tools to issue TRIM/DISCARD requests to the underlying device. Every
clean segment within the specified range will be discarded.

Signed-off-by: Andreas Rohner 
---
 fs/nilfs2/ioctl.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 2b34021..8051cb3 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1072,6 +1072,49 @@ out:
 }
 
 /**
+ * nilfs_ioctl_trim_fs() - trim ioctl handle function
+ * @inode: inode object
+ * @argp: pointer on argument from userspace
+ *
+ * Decription: nilfs_ioctl_trim_fs is the FITRIM ioctl handle function. It
+ * checks the arguments from userspace and calls nilfs_sufile_trim_fs, which
+ * performs the actual trim operation.
+ *
+ * Return Value: On success, 0 is returned or negative error code, otherwise.
+ */
+static int nilfs_ioctl_trim_fs(struct inode *inode, void __user *argp)
+{
+   struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
+   struct request_queue *q = bdev_get_queue(nilfs->ns_bdev);
+   struct fstrim_range range;
+   int ret;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   if (!blk_queue_discard(q))
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(&range, argp, sizeof(range)))
+   return -EFAULT;
+
+   range.minlen = max_t(unsigned int, (unsigned int)range.minlen,
+   q->limits.discard_granularity);
+
+   down_read(&nilfs->ns_segctor_sem);
+   ret = nilfs_sufile_trim_fs(nilfs->ns_sufile, &range);
+   up_read(&nilfs->ns_segctor_sem);
+
+   if (ret < 0)
+   return ret;
+
+   if (copy_to_user(argp, &range, sizeof(range)))
+   return -EFAULT;
+
+   return 0;
+}
+
+/**
  * nilfs_ioctl_set_alloc_range - limit range of segments to be allocated
  * @inode: inode object
  * @argp: pointer on argument from userspace
@@ -1205,6 +1248,8 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
return nilfs_ioctl_resize(inode, filp, argp);
case NILFS_IOCTL_SET_ALLOC_RANGE:
return nilfs_ioctl_set_alloc_range(inode, argp);
+   case FITRIM:
+   return nilfs_ioctl_trim_fs(inode, argp);
default:
return -ENOTTY;
}
@@ -1235,6 +1280,7 @@ long nilfs_compat_ioctl(struct file *filp, unsigned int 
cmd, unsigned long arg)
case NILFS_IOCTL_SYNC:
case NILFS_IOCTL_RESIZE:
case NILFS_IOCTL_SET_ALLOC_RANGE:
+   case FITRIM:
break;
default:
return -ENOIOCTLCMD;
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >