Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
Hi Ryusuke, Thank you very much for your detailed review and feedback. I agree with all of your points and I will start working on a rewrite immediately. On 2015-03-12 13:54, Ryusuke Konishi wrote: Hi Andreas, On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote: Hi Ryusuke, Thanks for your thorough review. On 2015-03-10 06:21, Ryusuke Konishi wrote: Hi Andreas, I looked through whole kernel patches and a part of util patches. Overall comments are as follows: [Algorithm] As for algorithm, it looks about OK except for the starvation countermeasure. The stavation countermeasure looks adhoc/hacky, but it's good that it doesn't change kernel/userland interface; we may be able to replace it with better ways in a future or in a revised version of this patchset. (1) Drawback of the starvation countermeasure The patch 9/9 looks to make the execution time of chcp operation worse since it will scan through sufile to modify live block counters. How much does it prolong the execution time ? I'll do some tests, but I haven't noticed any significant performance drop. The GC basically does the same thing, every time it selects segments to reclaim. GC is performed in background by an independent process. What I'm care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from command line interface or application. They differ in this meaning. Was a worse case senario considered in the test ? For example: 1. Fill a TB class drive with data file(s), and make a snapshot on it. 2. Run one pass GC to update snapshot block counts. 3. And do chcp cp If we don't observe noticeable delay on this class of drive, then I think we can put the problem off. Yesterday I did a worst case test as you suggested. I used an old 1 TB hard drive I had lying around. This was my setup: 1. Write a 850GB file 2. Create a snapshot 3. Delete the file 4. Let GC run through all segments 5. Verify with lssu that the GC has updated all SUFILE entries 6. Drop the page cache 7. chcp cp The following results are with the page cache dropped immediately before each call: 1. chcp ss real0m1.337s user0m0.017s sys 0m0.030s 2. chcp cp real0m6.377s user0m0.023s sys 0m0.053s The following results are without the drop of the page cache: 1. chcp ss real0m0.137s user0m0.010s sys 0m0.000s 2. chcp cp real0m0.016s user0m0.010s sys 0m0.007s There are 119233 segments in my test. Each SUFILE entry uses 32 bytes. So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random reads and one 3.57 MB continuous write. You only get 6.377s because my hard drive is so slow. You wouldn't notice any difference on a modern SSD. Furthermore the SUFILE is also scanned by the segment allocation algorithm and the GC, so it is very likely already in the page cache. In a use case of nilfs, many snapshots are created and they are automatically changed back to plain checkpoints because old snapshots are thinned out over time. The patch 9/9 may impact on such usage. (2) Compatibility What will happen in the following case: 1. Create a file system, use it with the new module, and create snapshots. 2. Mount it with an old module, and release snapshot with chcp cp 3. Mount it with the new module, and cleanerd runs gc with cost benefit or greedy policy. Some segments could be subject to starvation. But it would probably only affect a small number of segments and it could be fixed by chcp ss CP; chcp cp CP. Ok, let's treat this as a restriction for now. If you come up with any good idea, please propose. (3) Durability against unexpected power failures (just a note) The current patchset looks not to cause starvation issue even when unexpected power failure occurs during or after executing chcp cp because nilfs_ioctl_change_cpmode() do changes in a transactional way with nilfs_transaction_begin/commit. We should always think this kind of situtation to keep consistency. [Coding Style] (4) This patchset has several coding style issues. Please fix them and re-check with the latest checkpatch script (script/checkpatch.pl). I'll fix that. Sorry. patch 2: WARNING: Prefer kmalloc_array over kmalloc with multiply #85: FILE: fs/nilfs2/sufile.c:1192: +mc-mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod), patch 5,6: WARNING: 'aquired' may be misspelled - perhaps 'acquired'? #60: the same semaphore has to be aquired. So if the DAT-Entry belongs to WARNING: 'aquired' may be misspelled - perhaps 'acquired'? #46: be aquired, which blocks the entire SUFILE and effectively turns WARNING: 'aquired' may be misspelled - perhaps 'acquired'? #53: afore mentioned lock only needs to be aquired, if the cache is full (5) sub_sizeof macro: The same definition exists as offsetofend() in vfio.h, and a patch to move it to stddef.h is now
Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
On Sat, 14 Mar 2015 13:24:25 +0100, Andreas Rohner wrote: Hi Ryusuke, Thank you very much for your detailed review and feedback. I agree with all of your points and I will start working on a rewrite immediately. On 2015-03-12 13:54, Ryusuke Konishi wrote: Hi Andreas, On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote: Hi Ryusuke, Thanks for your thorough review. On 2015-03-10 06:21, Ryusuke Konishi wrote: Hi Andreas, I looked through whole kernel patches and a part of util patches. Overall comments are as follows: [Algorithm] As for algorithm, it looks about OK except for the starvation countermeasure. The stavation countermeasure looks adhoc/hacky, but it's good that it doesn't change kernel/userland interface; we may be able to replace it with better ways in a future or in a revised version of this patchset. (1) Drawback of the starvation countermeasure The patch 9/9 looks to make the execution time of chcp operation worse since it will scan through sufile to modify live block counters. How much does it prolong the execution time ? I'll do some tests, but I haven't noticed any significant performance drop. The GC basically does the same thing, every time it selects segments to reclaim. GC is performed in background by an independent process. What I'm care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from command line interface or application. They differ in this meaning. Was a worse case senario considered in the test ? For example: 1. Fill a TB class drive with data file(s), and make a snapshot on it. 2. Run one pass GC to update snapshot block counts. 3. And do chcp cp If we don't observe noticeable delay on this class of drive, then I think we can put the problem off. Yesterday I did a worst case test as you suggested. I used an old 1 TB hard drive I had lying around. This was my setup: 1. Write a 850GB file 2. Create a snapshot 3. Delete the file 4. Let GC run through all segments 5. Verify with lssu that the GC has updated all SUFILE entries 6. Drop the page cache 7. chcp cp The following results are with the page cache dropped immediately before each call: 1. chcp ss real 0m1.337s user 0m0.017s sys 0m0.030s 2. chcp cp real 0m6.377s user 0m0.023s sys 0m0.053s The following results are without the drop of the page cache: 1. chcp ss real 0m0.137s user 0m0.010s sys 0m0.000s 2. chcp cp real 0m0.016s user 0m0.010s sys 0m0.007s There are 119233 segments in my test. Each SUFILE entry uses 32 bytes. So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random reads and one 3.57 MB continuous write. You only get 6.377s because my hard drive is so slow. You wouldn't notice any difference on a modern SSD. Furthermore the SUFILE is also scanned by the segment allocation algorithm and the GC, so it is very likely already in the page cache. 6.377s is too long because nilfs_sufile_fix_starving_segs() locks sufile mi_sem, and even lengthens lock period of the following locks: - cpfile mi_sem (held at nilfs_cpfile_clear_snapshot()). - transaction lock (held at nilfs_ioctl_change_cpmode()). - ns_snapshot_mount_mutex (held at nilfs_ioctl_change_cpmode()). leading to freeze of all write operations, lssu, lscp, cleanerd, and snapshot mount, etc. It is preferable for the function to be moved outside of them and to release/reacquire transaction lock and sufile mi_sem regularly in some way. Regards, Ryusuke Konishi -- To unsubscribe from this list: send the line unsubscribe linux-nilfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
Hi Andreas, On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote: Hi Ryusuke, Thanks for your thorough review. On 2015-03-10 06:21, Ryusuke Konishi wrote: Hi Andreas, I looked through whole kernel patches and a part of util patches. Overall comments are as follows: [Algorithm] As for algorithm, it looks about OK except for the starvation countermeasure. The stavation countermeasure looks adhoc/hacky, but it's good that it doesn't change kernel/userland interface; we may be able to replace it with better ways in a future or in a revised version of this patchset. (1) Drawback of the starvation countermeasure The patch 9/9 looks to make the execution time of chcp operation worse since it will scan through sufile to modify live block counters. How much does it prolong the execution time ? I'll do some tests, but I haven't noticed any significant performance drop. The GC basically does the same thing, every time it selects segments to reclaim. GC is performed in background by an independent process. What I'm care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from command line interface or application. They differ in this meaning. Was a worse case senario considered in the test ? For example: 1. Fill a TB class drive with data file(s), and make a snapshot on it. 2. Run one pass GC to update snapshot block counts. 3. And do chcp cp If we don't observe noticeable delay on this class of drive, then I think we can put the problem off. In a use case of nilfs, many snapshots are created and they are automatically changed back to plain checkpoints because old snapshots are thinned out over time. The patch 9/9 may impact on such usage. (2) Compatibility What will happen in the following case: 1. Create a file system, use it with the new module, and create snapshots. 2. Mount it with an old module, and release snapshot with chcp cp 3. Mount it with the new module, and cleanerd runs gc with cost benefit or greedy policy. Some segments could be subject to starvation. But it would probably only affect a small number of segments and it could be fixed by chcp ss CP; chcp cp CP. Ok, let's treat this as a restriction for now. If you come up with any good idea, please propose. (3) Durability against unexpected power failures (just a note) The current patchset looks not to cause starvation issue even when unexpected power failure occurs during or after executing chcp cp because nilfs_ioctl_change_cpmode() do changes in a transactional way with nilfs_transaction_begin/commit. We should always think this kind of situtation to keep consistency. [Coding Style] (4) This patchset has several coding style issues. Please fix them and re-check with the latest checkpatch script (script/checkpatch.pl). I'll fix that. Sorry. patch 2: WARNING: Prefer kmalloc_array over kmalloc with multiply #85: FILE: fs/nilfs2/sufile.c:1192: +mc-mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod), patch 5,6: WARNING: 'aquired' may be misspelled - perhaps 'acquired'? #60: the same semaphore has to be aquired. So if the DAT-Entry belongs to WARNING: 'aquired' may be misspelled - perhaps 'acquired'? #46: be aquired, which blocks the entire SUFILE and effectively turns WARNING: 'aquired' may be misspelled - perhaps 'acquired'? #53: afore mentioned lock only needs to be aquired, if the cache is full (5) sub_sizeof macro: The same definition exists as offsetofend() in vfio.h, and a patch to move it to stddef.h is now proposed. Please use the same name, and redefine it only if it's not defined: #ifndef offsetofend #define offsetofend(TYPE, MEMBER) \ (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)-MEMBER)) #endif Ok I'll change that. [Implementation] (6) b_blocknr Please do not use bh-b_blocknr to store disk block number. This field is used to keep virtual block number except for DAT files. It is only replaced to an actual block number during calling submit_bh(). Keep this policy. As far as I can tell, this is only true for blocks of GC inodes and node blocks. All other buffer_heads are always mapped to on disk blocks by nilfs_get_block(). I only added the mapping in nilfs_segbuf_submit_bh() to correctly set the value in b_blocknr to the new location. nilfs_get_block() is only used for regular files, directories, and so on. Blocks on metadata files are mapped through nilfs_mdt_submit_block(). Anyway, yes, they stores actual disk block number in b_blocknr in the current implementation. But, it is just a cutting corner of the current implementation, which comes from the reason that we have to set actual disk block numbers when reading blocks with vfs/mm functions. Anyway I don't like you touch nilfs_get_block() and nilfs_segbuf_submit_bh() in a part of the big patch. At least, it