Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out

2015-05-31 Thread Ryusuke Konishi
On Sun, 10 May 2015 13:04:18 +0200, Andreas Rohner wrote:
> On 2015-05-09 14:17, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
[...]
>> 
>> Uum. This still looks to have potential for leak of dirty block
>> collection between DAT and SUFILE since this retry is limited by
>> the fixed retry count.
>> 
>> How about adding function temporarily turning off the live block
>> tracking and using it after this propagation loop until log write
>> finishes ?
>> 
>> It would reduce the accuracy of live block count, but is it enough ?
>> How do you think ?  We have to eliminate the possibility of the leak
>> because it can cause file system corruption.  Every checkpoint must be
>> self-contained.
> 
> How exactly could it lead to file system corruption? Maybe I miss
> something important here, but it seems to me, that no corruption is
> possible.
> 
> The nilfs_sufile_flush_cache_node() function only reads in already
> existing blocks. No new blocks are created. If I mark those blocks
> dirty, the btree is not changed at all. If I do not call
> nilfs_bmap_propagate(), then the btree stays unchanged and there are no
> dangling pointers. The resulting checkpoint should be self-contained.

Good point.  As for btree, it looks like no inconsistency issue arises
since nilfs_sufile_flush_cache_node() never inserts new blocks as you
pointed out.  Even though we also must care inconsistency between
sufile header and sufile data blocks, and block count in inode as
well, fortunately these look to be ok, too.

However, I still think it's not good to carry over dirty blocks to the
next segment construction to avoid extra checkpoint creation and to
simplify things.

>From this viewpoint, I also prefer that nilfs_sufile_flush_cache() and
nilfs_sufile_flush_cache_node() are changed a bit so that they will
skip adjusting su_nlive_blks and su_nlive_lastmod if the sufile block
that includes the segment usage is not marked dirty and only_mark == 0
as well as turing off live block counting temporarily after the
sufile/DAT propagation loop.

> 
> The only problem would be, that I could lose some nlive_blks updates.
> 

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-05-31 Thread Ryusuke Konishi
On Sun, 31 May 2015 20:13:44 +0200, Andreas Rohner wrote:
> On 2015-05-31 18:45, Ryusuke Konishi wrote:
>> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
 On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
[...]
  3. The ratio of the threshold "max_segblks" is hard coded to 50%
 of blocks_per_segment.  It is not clear if the ratio is good
 (versatile).
>>>
>>> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
>>>
>>> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
>>> check the segment again in the future. I guess anything between 25% and
>>> 75% would also work.
>> 
>> Sound reasonable.
>> 
>> By the way, I am thinking we should move cleanerd into kernel as soon
>> as we can.  It's not only inefficient due to a large amount of data
>> exchange between kernel and user-land, but also is hindering changes
>> like we are trying.  We have to care compatibility unnecessarily due
>> to the early design mistake (i.e. the separation of gc to user-land).
> 
> I am a bit confused. Is it OK if I implement this functionality in
> nilfs_cleanerd for this patch set, or would it be better to implement it
> with a workqueue in the kernel, like you've suggested before?
> 
> If you intend to move nilfs_cleanerd into the kernel anyway, then the
> latter would make more sense to me. Which implementation do you prefer
> for this patch set?

If nilfs_cleanerd will remain in userland, then the userland
implementation looks better.  But, yes, if we will move the cleaner
into kernel, then the kernel implementation looks better because we
may be able to avoid unnecessary API change.  It's a dilemma.

Do you have any good idea to reduce or hide overhead of the
calibration (i.e. traversal rewrite of sufile) in regard to the kernel
implementation ?
I'm inclined to leave that in kernel for now.

Regards,
Ryusuke Konishi

> 
> Regards,
> Andreas Rohner
> 
>> Regards,
>> Ryusuke Konishi
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-05-31 Thread Andreas Rohner
On 2015-05-31 18:45, Ryusuke Konishi wrote:
> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
 It doesn't really matter if the number of reclaimable blocks for a
 segment is inaccurate, as long as the overall performance is better than
 the simple timestamp algorithm and starvation is prevented.

 The following steps will lead to starvation of a segment:

 1. The segment is written
 2. A snapshot is created
 3. The files in the segment are deleted and the number of live
blocks for the segment is decremented to a very low value
 4. The GC tries to free the segment, but there are no reclaimable
blocks, because they are all protected by the snapshot. To prevent an
infinite loop the GC has to adjust the number of live blocks to the
correct value.
 5. The snapshot is converted to a checkpoint and the blocks in the
segment are now reclaimable.
 6. The GC will never attempt to clean the segment again, because it
looks as if it had a high number of live blocks.

 To prevent this, the already existing padding field of the SUFILE entry
 is used to track the number of snapshot blocks in the segment. This
 number is only set by the GC, since it collects the necessary
 information anyway. So there is no need, to track which block belongs to
 which segment. In step 4 of the list above the GC will set the new field
 su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
 entries with a big su_nsnapshot_blks field get their su_nlive_blks field
 reduced.

 Signed-off-by: Andreas Rohner 
>>>
>>> I still don't know whether this workaround is the way we should take
>>> or not.  This patch has several drawbacks:
>>>
>>>  1. It introduces overheads to every "chcp cp" operation
>>> due to traversal rewrite of sufile.
>>> If the ratio of snapshot protected blocks is high, then
>>> this overheads will be big.
>>>
>>>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>>> written out.   If most blocks are protected by a snapshot,
>>> more than 4MB of sufile blocks will be written per 1TB capacity.
>>>
>>> Even though this rewrite may not happen for contiguous "chcp cp"
>>> operations, it still has potential for creating sufile write blocks
>>> if the application of nilfs manipulates snapshots frequently.
>>
>> I could also implement this functionality in nilfs_cleanerd in
>> userspace. Every time a "chcp cp" happens some kind of permanent flag
>> like "snapshot_was_recently_deleted" is set at an appropriate location.
>> The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
>> would, at certain intervals and if the flag is set, check all segments
>> with GET_SUINFO ioctl() and set the ones that have potentially invalid
>> values with SET_SUINFO ioctl(). After that it would clear the
>> "snapshot_was_recently_deleted" flag. What do you think about this idea?
> 
> Sorry for my late reply.

No problem. I was also very busy last week.

> I think moving the functionality to cleanerd and notifying some sort
> of information to userland through ioctl for that, is a good idea
> except that I feel the ioctl should be GET_CPSTAT instead of
> GET_SUINFO because it's checkpoint/snapshot related information.

Ok good idea.

> I think the parameter that should be added is a set of statistics
> information including the number of deleted snapshots since the file
> system was mounted last (1).  The counter (1) can serve as the
> "snapshot_was_recently_deleted" flag if it monotonically increases.
> Although we can use timestamp of when a snapshot was deleted last
> time, it's not preferable than the counter (1) because the system
> clock may be rewinded and it also has an issue related to precision.

I agree, a counter is better than a simple flag.

> Note that we must add GET_CPSTAT_V2 (or GET_SUSTAT_V2) and the
> corresponding structure (i.e. nilfs_cpstat_v2, or so) since ioctl
> codes depend on the size of argument data and it will be changed in
> both ioctls; unfortunately, neither GET_CPSTAT nor GET_SUSTAT ioctl is
> expandable.  Some ioctls like EVIOCGKEYCODE_V2 will be a reference for
> this issue.
> 
>>
>> If the policy is "timestamp" the GC would of course skip this scan,
>> because it is unnecessary.
>>
>>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>>> of blocks_per_segment.  It is not clear if the ratio is good
>>> (versatile).
>>
>> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
>>
>> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
>> check the segment again in the future. I guess anything between 25% and
>> 75% would also work.
> 
> Sound reasonable.
> 
> By the way, I am thinking we should move cleanerd into kernel as

Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots

2015-05-31 Thread Ryusuke Konishi
On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>>> It doesn't really matter if the number of reclaimable blocks for a
>>> segment is inaccurate, as long as the overall performance is better than
>>> the simple timestamp algorithm and starvation is prevented.
>>>
>>> The following steps will lead to starvation of a segment:
>>>
>>> 1. The segment is written
>>> 2. A snapshot is created
>>> 3. The files in the segment are deleted and the number of live
>>>blocks for the segment is decremented to a very low value
>>> 4. The GC tries to free the segment, but there are no reclaimable
>>>blocks, because they are all protected by the snapshot. To prevent an
>>>infinite loop the GC has to adjust the number of live blocks to the
>>>correct value.
>>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>>segment are now reclaimable.
>>> 6. The GC will never attempt to clean the segment again, because it
>>>looks as if it had a high number of live blocks.
>>>
>>> To prevent this, the already existing padding field of the SUFILE entry
>>> is used to track the number of snapshot blocks in the segment. This
>>> number is only set by the GC, since it collects the necessary
>>> information anyway. So there is no need, to track which block belongs to
>>> which segment. In step 4 of the list above the GC will set the new field
>>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>>> reduced.
>>>
>>> Signed-off-by: Andreas Rohner 
>> 
>> I still don't know whether this workaround is the way we should take
>> or not.  This patch has several drawbacks:
>> 
>>  1. It introduces overheads to every "chcp cp" operation
>> due to traversal rewrite of sufile.
>> If the ratio of snapshot protected blocks is high, then
>> this overheads will be big.
>> 
>>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>> written out.   If most blocks are protected by a snapshot,
>> more than 4MB of sufile blocks will be written per 1TB capacity.
>> 
>> Even though this rewrite may not happen for contiguous "chcp cp"
>> operations, it still has potential for creating sufile write blocks
>> if the application of nilfs manipulates snapshots frequently.
> 
> I could also implement this functionality in nilfs_cleanerd in
> userspace. Every time a "chcp cp" happens some kind of permanent flag
> like "snapshot_was_recently_deleted" is set at an appropriate location.
> The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
> would, at certain intervals and if the flag is set, check all segments
> with GET_SUINFO ioctl() and set the ones that have potentially invalid
> values with SET_SUINFO ioctl(). After that it would clear the
> "snapshot_was_recently_deleted" flag. What do you think about this idea?

Sorry for my late reply.

I think moving the functionality to cleanerd and notifying some sort
of information to userland through ioctl for that, is a good idea
except that I feel the ioctl should be GET_CPSTAT instead of
GET_SUINFO because it's checkpoint/snapshot related information.

I think the parameter that should be added is a set of statistics
information including the number of deleted snapshots since the file
system was mounted last (1).  The counter (1) can serve as the
"snapshot_was_recently_deleted" flag if it monotonically increases.
Although we can use timestamp of when a snapshot was deleted last
time, it's not preferable than the counter (1) because the system
clock may be rewinded and it also has an issue related to precision.

Note that we must add GET_CPSTAT_V2 (or GET_SUSTAT_V2) and the
corresponding structure (i.e. nilfs_cpstat_v2, or so) since ioctl
codes depend on the size of argument data and it will be changed in
both ioctls; unfortunately, neither GET_CPSTAT nor GET_SUSTAT ioctl is
expandable.  Some ioctls like EVIOCGKEYCODE_V2 will be a reference for
this issue.

> 
> If the policy is "timestamp" the GC would of course skip this scan,
> because it is unnecessary.
> 
>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>> of blocks_per_segment.  It is not clear if the ratio is good
>> (versatile).
> 
> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
> 
> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
> check the segment again in the future. I guess anything between 25% and
> 75% would also work.

Sound reasonable.

By the way, I am thinking we should move cleanerd into kernel as soon
as we can.  It's not only inefficient due to a large amount of data
exchange between kernel and user-land, but also is hindering changes
like we are trying.  We have to care compatibility unnecessarily due
to the early design mistake (i.e. the separation of