Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy

2015-03-14 Thread Ryusuke Konishi
On Sat, 14 Mar 2015 13:24:25 +0100, Andreas Rohner wrote:
> Hi Ryusuke,
> 
> Thank you very much for your detailed review and feedback. I agree with
> all of your points and I will start working on a rewrite immediately.
> 
> On 2015-03-12 13:54, Ryusuke Konishi wrote:
>> Hi Andreas,
>> 
>> On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote:
>>> Hi Ryusuke,
>>>
>>> Thanks for your thorough review.
>>>
>>> On 2015-03-10 06:21, Ryusuke Konishi wrote:
 Hi Andreas,

 I looked through whole kernel patches and a part of util patches.
 Overall comments are as follows:

 [Algorithm]
 As for algorithm, it looks about OK except for the starvation
 countermeasure.  The stavation countermeasure looks adhoc/hacky, but
 it's good that it doesn't change kernel/userland interface; we may be
 able to replace it with better ways in a future or in a revised
 version of this patchset.

 (1) Drawback of the starvation countermeasure
 The patch 9/9 looks to make the execution time of chcp operation
 worse since it will scan through sufile to modify live block
 counters.  How much does it prolong the execution time ?
>>>
>>> I'll do some tests, but I haven't noticed any significant performance
>>> drop. The GC basically does the same thing, every time it selects
>>> segments to reclaim.
>> 
>> GC is performed in background by an independent process.  What I'm
>> care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from
>> command line interface or application.  They differ in this meaning.
>> 
>> Was a worse case senario considered in the test ?
>> 
>> For example:
>> 1. Fill a TB class drive with data file(s), and make a snapshot on it.
>> 2. Run one pass GC to update snapshot block counts.
>> 3. And do "chcp cp"
>> 
>> If we don't observe noticeable delay on this class of drive, then I
>> think we can put the problem off.
> 
> Yesterday I did a worst case test as you suggested. I used an old 1 TB
> hard drive I had lying around. This was my setup:
> 
> 1. Write a 850GB file
> 2. Create a snapshot
> 3. Delete the file
> 4. Let GC run through all segments
> 5. Verify with lssu that the GC has updated all SUFILE entries
> 6. Drop the page cache
> 7. chcp cp
> 
> The following results are with the page cache dropped immediately before
> each call:
> 
> 1. chcp ss
> real  0m1.337s
> user  0m0.017s
> sys   0m0.030s
> 
> 2. chcp cp
> real  0m6.377s
> user  0m0.023s
> sys   0m0.053s
> 
> The following results are without the drop of the page cache:
> 
> 1. chcp ss
> real  0m0.137s
> user  0m0.010s
> sys   0m0.000s
> 
> 2. chcp cp
> real  0m0.016s
> user  0m0.010s
> sys   0m0.007s
> 
> There are 119233 segments in my test. Each SUFILE entry uses 32 bytes.
> So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random
> reads and one 3.57 MB continuous write. You only get 6.377s because my
> hard drive is so slow. You wouldn't notice any difference on a modern
> SSD. Furthermore the SUFILE is also scanned by the segment allocation
> algorithm and the GC, so it is very likely already in the page cache.

6.377s is too long because nilfs_sufile_fix_starving_segs() locks
sufile mi_sem, and even lengthens lock period of the following locks:

 - cpfile mi_sem (held at nilfs_cpfile_clear_snapshot()).
 - transaction lock (held at nilfs_ioctl_change_cpmode()).
 - ns_snapshot_mount_mutex (held at nilfs_ioctl_change_cpmode()).

leading to freeze of all write operations, lssu, lscp, cleanerd, and
snapshot mount, etc.

It is preferable for the function to be moved outside of them and to
release/reacquire transaction lock and sufile mi_sem regularly in some
way.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-03-14 Thread Ryusuke Konishi

One more comment.

On Sat, 14 Mar 2015 12:51:09 +0900 (JST), Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>> @@ -1050,6 +1069,85 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, 
>> void *buf,
>>  }
>>  
>>  /**
>> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
>> + * @sufile: inode of segment usage file
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This way the segment is more likely to be
>> + * chosen by the GC. A segment is marked as potentially starving, if more
>> + * than half of the blocks it contains are protected by snapshots.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile)
>> +{
>> +struct buffer_head *su_bh;
>> +struct nilfs_segment_usage *su;
>> +size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
>> +struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>> +void *kaddr;
>> +unsigned long nsegs, segusages_per_block;
>> +__u32 max_segblks = nilfs->ns_blocks_per_segment / 2;
>> +__u64 segnum = 0;
>> +int ret = 0, blkdirty, dirty = 0;
>> +
>> +down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
>> +segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
>> +nsegs = nilfs_sufile_get_nsegments(sufile);
>> +
>> +while (segnum < nsegs) {
>> +n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
>> + nsegs - 1);
>> +
>> +ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
>> +   0, &su_bh);
>> +if (ret < 0) {
>> +if (ret != -ENOENT)
>> +goto out;
>> +/* hole */
>> +segnum += n;
>> +continue;
>> +}
>> +
>> +kaddr = kmap_atomic(su_bh->b_page);
>> +su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
>> +  su_bh, kaddr);
>> +blkdirty = 0;
>> +for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
>> +if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
>> +continue;
>> +
>> +if (su->su_nlive_blks <= max_segblks)
>> +continue;
>> +
>> +su->su_nlive_blks = max_segblks;
>> +blkdirty = 1;
>> +}
>> +
>> +kunmap_atomic(kaddr);
>> +if (blkdirty) {
>> +mark_buffer_dirty(su_bh);
>> +dirty = 1;
>> +}
>> +put_bh(su_bh);

Insert cond_resched() here to mitigate latency issue (mainly for the
environment in which voluntary preemption is turned off).

Regards,
Ryusuke Konishi

>> +}
>> +
>> +out:
>> +if (dirty)
>> +nilfs_mdt_mark_dirty(sufile);
>> +
>> +up_write(&NILFS_MDT(sufile)->mi_sem);
>> +return ret;
>> +}
>> +
>> +/**
>>   * nilfs_sufile_trim_fs() - trim ioctl handle function
>>   * @sufile: inode of segment usage file
>>   * @range: fstrim_range structure
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-03-14 Thread Ryusuke Konishi
On Sat, 14 Mar 2015 13:36:35 +0100, Andreas Rohner wrote:
> On 2015-03-14 04:51, Ryusuke Konishi wrote:
>> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>>> index 6ffdc09..a3c7593 100644
>>> --- a/include/linux/nilfs2_fs.h
>>> +++ b/include/linux/nilfs2_fs.h
>>> @@ -222,11 +222,13 @@ struct nilfs_super_block {
>>>   */
>>>  #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION  (1ULL << 0)
>>>  #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS   (1ULL << 1)
>>> +#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS   (1ULL << 2)
>>>  
>>>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT(1ULL << 0)
>>>  
>>>  #define NILFS_FEATURE_COMPAT_SUPP  (NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
>>> -   | NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>>> +   | NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
>>> +   | NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
>>>  #define NILFS_FEATURE_COMPAT_RO_SUPP   
>>> NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>>>  #define NILFS_FEATURE_INCOMPAT_SUPP0ULL
>>>  
>> 
>> You don't have to add three compat flags just for this one patchset.
>> Please unify it.
>> 
>> #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS (1ULL << 0)
>> 
>> looks to be enough.
> 
> I could merge the TRACK_LIVE_BLKS and TRACK_SNAPSHOTS flag, but I would
> suggest to at least leave the SUFILE_EXTENSION flag (maybe with a
> different name). The SUFILE_EXTENSION flag has to be set at mkfs time
> and it cannot be set or removed later, because you cannot change the on
> disk format later. I actually set SUFILE_EXTENSION by default in mkfs,
> because it is not harmful and it gives the user the option to switch the
> other flags on later.

I see, it sounds reasonable.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-03-14 Thread Andreas Rohner
On 2015-03-14 04:51, Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>> It doesn't really matter if the number of reclaimable blocks for a
>> segment is inaccurate, as long as the overall performance is better than
>> the simple timestamp algorithm and starvation is prevented.
>>
>> The following steps will lead to starvation of a segment:
>>
>> 1. The segment is written
>> 2. A snapshot is created
>> 3. The files in the segment are deleted and the number of live
>>blocks for the segment is decremented to a very low value
>> 4. The GC tries to free the segment, but there are no reclaimable
>>blocks, because they are all protected by the snapshot. To prevent an
>>infinite loop the GC has to adjust the number of live blocks to the
>>correct value.
>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>segment are now reclaimable.
>> 6. The GC will never attemt to clean the segment again, because of it
>>incorrectly shows up as having a high number of live blocks.
>>
>> To prevent this, the already existing padding field of the SUFILE entry
>> is used to track the number of snapshot blocks in the segment. This
>> number is only set by the GC, since it collects the necessary
>> information anyway. So there is no need, to track which block belongs to
>> which segment. In step 4 of the list above the GC will set the new field
>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>> reduced.
>>
>> Signed-off-by: Andreas Rohner 
>> ---
>>  fs/nilfs2/cpfile.c|   5 ++
>>  fs/nilfs2/segbuf.c|   1 +
>>  fs/nilfs2/segbuf.h|   1 +
>>  fs/nilfs2/segment.c   |   7 ++-
>>  fs/nilfs2/sufile.c| 114 
>> ++
>>  fs/nilfs2/sufile.h|   4 +-
>>  fs/nilfs2/the_nilfs.h |   7 +++
>>  include/linux/nilfs2_fs.h |  12 +++--
>>  8 files changed, 136 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
>> index 0d58075..6b61fd7 100644
>> --- a/fs/nilfs2/cpfile.c
>> +++ b/fs/nilfs2/cpfile.c
>> @@ -28,6 +28,7 @@
>>  #include 
>>  #include "mdt.h"
>>  #include "cpfile.h"
>> +#include "sufile.h"
>>  
>>  
>>  static inline unsigned long
>> @@ -703,6 +704,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  struct nilfs_cpfile_header *header;
>>  struct nilfs_checkpoint *cp;
>>  struct nilfs_snapshot_list *list;
>> +struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
>>  __u64 next, prev;
>>  void *kaddr;
>>  int ret;
>> @@ -784,6 +786,9 @@ static int nilfs_cpfile_clear_snapshot(struct inode 
>> *cpfile, __u64 cno)
>>  mark_buffer_dirty(header_bh);
>>  nilfs_mdt_mark_dirty(cpfile);
>>  
>> +if (nilfs_feature_track_snapshots(nilfs))
>> +nilfs_sufile_fix_starving_segs(nilfs->ns_sufile);
>> +
>>  brelse(prev_bh);
>>  
>>   out_next:
>> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
>> index bbd807b..a98c576 100644
>> --- a/fs/nilfs2/segbuf.c
>> +++ b/fs/nilfs2/segbuf.c
>> @@ -59,6 +59,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct 
>> super_block *sb)
>>  segbuf->sb_super_root = NULL;
>>  segbuf->sb_nlive_blks_added = 0;
>>  segbuf->sb_nlive_blks_diff = 0;
>> +segbuf->sb_nsnapshot_blks = 0;
>>  
>>  init_completion(&segbuf->sb_bio_event);
>>  atomic_set(&segbuf->sb_err, 0);
>> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
>> index 4e994f7..7a462c4 100644
>> --- a/fs/nilfs2/segbuf.h
>> +++ b/fs/nilfs2/segbuf.h
>> @@ -85,6 +85,7 @@ struct nilfs_segment_buffer {
>>  unsignedsb_rest_blocks;
>>  __u32   sb_nlive_blks_added;
>>  __s64   sb_nlive_blks_diff;
>> +__u32   sb_nsnapshot_blks;
>>  
>>  /* Buffers */
>>  struct list_headsb_segsum_buffers;
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index 16c7c36..b976198 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1381,6 +1381,7 @@ static void nilfs_segctor_update_segusage(struct 
>> nilfs_sc_info *sci,
>>  (segbuf->sb_pseg_start - segbuf->sb_fseg_start);
>>  ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>>   live_blocks,
>> + segbuf->sb_nsnapshot_blks,
>>   sci->sc_seg_ctime);
>>  WARN_ON(ret); /* always succeed because the segusage is dirty */
>>  
>> @@ -1405,7 +1406,7 @@ static void nilfs_cancel_segusage(struct list_head 
>> *logs,
>>  segbuf = NILFS_FIRST_SEGBUF(logs);
>>  ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>>   segbuf->sb_pseg_start -
>> -  

Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy

2015-03-14 Thread Andreas Rohner
Hi Ryusuke,

Thank you very much for your detailed review and feedback. I agree with
all of your points and I will start working on a rewrite immediately.

On 2015-03-12 13:54, Ryusuke Konishi wrote:
> Hi Andreas,
> 
> On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote:
>> Hi Ryusuke,
>>
>> Thanks for your thorough review.
>>
>> On 2015-03-10 06:21, Ryusuke Konishi wrote:
>>> Hi Andreas,
>>>
>>> I looked through whole kernel patches and a part of util patches.
>>> Overall comments are as follows:
>>>
>>> [Algorithm]
>>> As for algorithm, it looks about OK except for the starvation
>>> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
>>> it's good that it doesn't change kernel/userland interface; we may be
>>> able to replace it with better ways in a future or in a revised
>>> version of this patchset.
>>>
>>> (1) Drawback of the starvation countermeasure
>>> The patch 9/9 looks to make the execution time of chcp operation
>>> worse since it will scan through sufile to modify live block
>>> counters.  How much does it prolong the execution time ?
>>
>> I'll do some tests, but I haven't noticed any significant performance
>> drop. The GC basically does the same thing, every time it selects
>> segments to reclaim.
> 
> GC is performed in background by an independent process.  What I'm
> care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from
> command line interface or application.  They differ in this meaning.
> 
> Was a worse case senario considered in the test ?
> 
> For example:
> 1. Fill a TB class drive with data file(s), and make a snapshot on it.
> 2. Run one pass GC to update snapshot block counts.
> 3. And do "chcp cp"
> 
> If we don't observe noticeable delay on this class of drive, then I
> think we can put the problem off.

Yesterday I did a worst case test as you suggested. I used an old 1 TB
hard drive I had lying around. This was my setup:

1. Write a 850GB file
2. Create a snapshot
3. Delete the file
4. Let GC run through all segments
5. Verify with lssu that the GC has updated all SUFILE entries
6. Drop the page cache
7. chcp cp

The following results are with the page cache dropped immediately before
each call:

1. chcp ss
real0m1.337s
user0m0.017s
sys 0m0.030s

2. chcp cp
real0m6.377s
user0m0.023s
sys 0m0.053s

The following results are without the drop of the page cache:

1. chcp ss
real0m0.137s
user0m0.010s
sys 0m0.000s

2. chcp cp
real0m0.016s
user0m0.010s
sys 0m0.007s

There are 119233 segments in my test. Each SUFILE entry uses 32 bytes.
So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random
reads and one 3.57 MB continuous write. You only get 6.377s because my
hard drive is so slow. You wouldn't notice any difference on a modern
SSD. Furthermore the SUFILE is also scanned by the segment allocation
algorithm and the GC, so it is very likely already in the page cache.

>>> In a use case of nilfs, many snapshots are created and they are
>>> automatically changed back to plain checkpoints because old
>>> snapshots are thinned out over time.  The patch 9/9 may impact on
>>> such usage.
>>>
>>> (2) Compatibility
>>> What will happen in the following case:
>>> 1. Create a file system, use it with the new module, and
>>>create snapshots.
>>> 2. Mount it with an old module, and release snapshot with "chcp cp"
>>> 3. Mount it with the new module, and cleanerd runs gc with
>>>cost benefit or greedy policy.
>>
>> Some segments could be subject to starvation. But it would probably only
>> affect a small number of segments and it could be fixed by "chcp ss
>> ; chcp cp ".
> 
> Ok, let's treat this as a restriction for now.
> If you come up with any good idea, please propose.
> 
>>> (3) Durability against unexpected power failures (just a note)
>>> The current patchset looks not to cause starvation issue even when
>>> unexpected power failure occurs during or after executing "chcp
>>> cp" because nilfs_ioctl_change_cpmode() do changes in a
>>> transactional way with nilfs_transaction_begin/commit.
>>> We should always think this kind of situtation to keep consistency.
>>>
>>> [Coding Style]
>>> (4) This patchset has several coding style issues. Please fix them and
>>> re-check with the latest checkpatch script (script/checkpatch.pl).
>>
>> I'll fix that. Sorry.
>>
>>> patch 2:
>>> WARNING: Prefer kmalloc_array over kmalloc with multiply
>>> #85: FILE: fs/nilfs2/sufile.c:1192:
>>> +mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
>>>
>>> patch 5,6:
>>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>>> #60: 
>>> the same semaphore has to be aquired. So if the DAT-Entry belongs to
>>>
>>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>>> #46: 
>>> be aquired, which blocks the entire SUFILE and effectively turns
>>>
>>> WARNING: 'aquired' may be misspelled - perhaps 'acquir