Re: [f2fs-dev] [PATCH v5 1/2] f2fs: fix to avoid broken of dnode block list
On 2018/7/29 10:49, Jaegeuk Kim wrote: > On 07/29, Chao Yu wrote: >> On 2018/7/29 9:33, Jaegeuk Kim wrote: >>> On 07/28, Chao Yu wrote: f2fs recovery flow is relying on dnode block link list, it means fsynced file recovery depends on previous dnode's persistence in the list, so during fsync() we should wait on all regular inode's dnode writebacked before issuing flush. By this way, we can avoid dnode block list being broken by out-of-order IO submission due to IO scheduler or driver. Sheng Yong helps to do the test with this patch: Target:/data (f2fs, -) 64MB / 32768KB / 4KB / 8 1 / PERSIST / Index Base: SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS) 1 867.82 204.15 41440.0341370.54680.8 1025.94 1031.08 2 871.87 205.87 41370.3 40275.2 791.14 1065.84 1101.7 3 866.52 205.69 41795.6740596.16694.69 1037.16 1031.48 Avg868.737 205.237 41535.3 40747.3 722.21 1042.98 1054.75 After: SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS) 1 798.81 202.5 41143 40613.87602.71 838.08 913.83 2 805.79 206.47 40297.2 41291.46604.44 840.75 924.27 3 814.83 206.17 41209.5740453.62602.85 834.66 927.91 Avg806.477 205.047 40883.25667 40786.31667 603.333 837.83 922.003 Patched/Original: 0.928332713 0.999074239 0.984300676 1.000957528 0.835398753 0.803303994 0.874141189 >>> >>> I expect Sheng will provide more test results tho, at least it seems SEQ-RD >>> in Base shows better than two After results, even if it doesn't matter with >>> the issue. Please confirm it first in order for anybody to say there is no >>> regression. >> >> Agreed. > > Hmm, this patch breaks fault injection test where gives a panic in put_super > having fsync_node_num. Let me do the test and fix it. Thanks, > >> >> Thanks, >> >>> It looks like atomic write will suffer performance regression. I suspect that the criminal is that we forcing to wait all dnode being in storage cache before we issue PREFLUSH+FUA. BTW, will commit ("f2fs: don't need to wait for node writes for atomic write") cause the problem: we will lose data of last transaction after SPO, even if atomic write return no error: - atomic_open(); - write() P1, P2, P3; - atomic_commit(); - writeback data: P1, P2, P3; - writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with fsync_mark is writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing last transaction. - preflush + fua; - power-cut If we don't wait dnode writeback for atomic_write: SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS) 1 779.91 206.03 41621.5 40333.16716.9 1038.21 1034.85 2 848.51 204.35 40082.4439486.17791.83 1119.96 1083.77 3 772.12 206.27 41335.2541599.65723.29 1055.07 971.92 Avg800.18 205.55 41013.06333 40472.99333 744.007 1071.08 1030.18 Patched/Original: 0.92108464 1.001526693 0.987425886 0.993268102 1.030180511 1.026942031 0.976702294 SQLite's performance recovers. Jaegeuk: "Practically, I don't see db corruption becase of this. We can excuse to lose the last transaction." Finally, we decide to keep original implementation of atomic write interface sematics that we don't wait all dnode writeback before preflush+fua submission. Tested-by: Sheng Yong Signed-off-by: Chao Yu --- v5: - add missing Tested-by. - fix f2fs_reset_fsync_node_info() to reset sbi->fsync_seg_id correctly. fs/f2fs/checkpoint.c | 2 + fs/f2fs/data.c | 2 + fs/f2fs/f2fs.h | 21 ++- fs/f2fs/file.c | 5 +- fs/f2fs/node.c | 144 +++ fs/f2fs/super.c | 4 ++ 6 files changed, 150 insertions(+), 28 deletions(-)
Re: [f2fs-dev] [PATCH v5 1/2] f2fs: fix to avoid broken of dnode block list
On 07/29, Chao Yu wrote: > On 2018/7/29 9:33, Jaegeuk Kim wrote: > > On 07/28, Chao Yu wrote: > >> f2fs recovery flow is relying on dnode block link list, it means fsynced > >> file recovery depends on previous dnode's persistence in the list, so > >> during fsync() we should wait on all regular inode's dnode writebacked > >> before issuing flush. > >> > >> By this way, we can avoid dnode block list being broken by out-of-order > >> IO submission due to IO scheduler or driver. > >> > >> Sheng Yong helps to do the test with this patch: > >> > >> Target:/data (f2fs, -) > >> 64MB / 32768KB / 4KB / 8 > >> > >> 1 / PERSIST / Index > >> > >> Base: > >>SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) > >> Insert(TPS) Update(TPS) Delete(TPS) > >> 1 867.82 204.15 41440.0341370.54680.8 > >> 1025.94 1031.08 > >> 2 871.87 205.87 41370.3 40275.2 791.14 > >> 1065.84 1101.7 > >> 3 866.52 205.69 41795.6740596.16694.69 > >> 1037.16 1031.48 > >> Avg868.737 205.237 41535.3 40747.3 > >> 722.21 1042.98 1054.75 > >> > >> After: > >>SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) > >> Insert(TPS) Update(TPS) Delete(TPS) > >> 1 798.81 202.5 41143 40613.87602.71 > >> 838.08 913.83 > >> 2 805.79 206.47 40297.2 41291.46604.44 > >> 840.75 924.27 > >> 3 814.83 206.17 41209.5740453.62602.85 > >> 834.66 927.91 > >> Avg806.477 205.047 40883.25667 40786.31667 > >> 603.333 837.83 922.003 > >> > >> Patched/Original: > >>0.928332713 0.999074239 0.984300676 1.000957528 > >> 0.835398753 0.803303994 0.874141189 > > > > I expect Sheng will provide more test results tho, at least it seems SEQ-RD > > in Base shows better than two After results, even if it doesn't matter with > > the issue. Please confirm it first in order for anybody to say there is no > > regression. > > Agreed. Hmm, this patch breaks fault injection test where gives a panic in put_super having fsync_node_num. > > Thanks, > > > > >> > >> It looks like atomic write will suffer performance regression. > >> > >> I suspect that the criminal is that we forcing to wait all dnode being in > >> storage cache before we issue PREFLUSH+FUA. > >> > >> BTW, will commit ("f2fs: don't need to wait for node writes for atomic > >> write") > >> cause the problem: we will lose data of last transaction after SPO, even if > >> atomic write return no error: > >> > >> - atomic_open(); > >> - write() P1, P2, P3; > >> - atomic_commit(); > >> - writeback data: P1, P2, P3; > >> - writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with > >> fsync_mark is > >> writebacked, In SPOR, we won't find N3 since node chain is broken, turns > >> out that losing > >> last transaction. > >> - preflush + fua; > >> - power-cut > >> > >> If we don't wait dnode writeback for atomic_write: > >> > >>SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) > >> Insert(TPS) Update(TPS) Delete(TPS) > >> 1 779.91 206.03 41621.5 40333.16716.9 > >> 1038.21 1034.85 > >> 2 848.51 204.35 40082.4439486.17791.83 > >> 1119.96 1083.77 > >> 3 772.12 206.27 41335.2541599.65723.29 > >> 1055.07 971.92 > >> Avg800.18 205.55 41013.06333 40472.99333 > >> 744.007 1071.08 1030.18 > >> > >> Patched/Original: > >>0.92108464 1.001526693 0.987425886 0.993268102 > >> 1.030180511 1.026942031 0.976702294 > >> > >> SQLite's performance recovers. > >> > >> Jaegeuk: > >> "Practically, I don't see db corruption becase of this. We can excuse to > >> lose > >> the last transaction." > >> > >> Finally, we decide to keep original implementation of atomic write > >> interface > >> sematics that we don't wait all dnode writeback before preflush+fua > >> submission. > >> > >> Tested-by: Sheng Yong > >> Signed-off-by: Chao Yu > >> --- > >> v5: > >> - add missing Tested-by. > >> - fix f2fs_reset_fsync_node_info() to reset sbi->fsync_seg_id correctly. > >> fs/f2fs/checkpoint.c | 2 + > >> fs/f2fs/data.c | 2 + > >> fs/f2fs/f2fs.h | 21 ++- > >> fs/f2fs/file.c | 5 +- > >> fs/f2fs/node.c | 144 +++ > >> fs/f2fs/super.c | 4 ++ > >> 6 files changed, 150 insertions(+), 28 deletions(-) > >> > >> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c > >> index
Re: [f2fs-dev] [PATCH v5 1/2] f2fs: fix to avoid broken of dnode block list
On 2018/7/29 9:33, Jaegeuk Kim wrote: > On 07/28, Chao Yu wrote: >> f2fs recovery flow is relying on dnode block link list, it means fsynced >> file recovery depends on previous dnode's persistence in the list, so >> during fsync() we should wait on all regular inode's dnode writebacked >> before issuing flush. >> >> By this way, we can avoid dnode block list being broken by out-of-order >> IO submission due to IO scheduler or driver. >> >> Sheng Yong helps to do the test with this patch: >> >> Target:/data (f2fs, -) >> 64MB / 32768KB / 4KB / 8 >> >> 1 / PERSIST / Index >> >> Base: >> SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) >> Insert(TPS) Update(TPS) Delete(TPS) >> 1867.82 204.15 41440.0341370.54680.8 >> 1025.94 1031.08 >> 2871.87 205.87 41370.3 40275.2 791.14 >> 1065.84 1101.7 >> 3866.52 205.69 41795.6740596.16694.69 >> 1037.16 1031.48 >> Avg 868.737 205.237 41535.3 40747.3 722.21 >> 1042.98 1054.75 >> >> After: >> SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) >> Insert(TPS) Update(TPS) Delete(TPS) >> 1798.81 202.5 41143 40613.87602.71 >> 838.08 913.83 >> 2805.79 206.47 40297.2 41291.46604.44 >> 840.75 924.27 >> 3814.83 206.17 41209.5740453.62602.85 >> 834.66 927.91 >> Avg 806.477 205.047 40883.25667 40786.31667 >> 603.333 837.83 922.003 >> >> Patched/Original: >> 0.928332713 0.999074239 0.984300676 1.000957528 >> 0.835398753 0.803303994 0.874141189 > > I expect Sheng will provide more test results tho, at least it seems SEQ-RD > in Base shows better than two After results, even if it doesn't matter with > the issue. Please confirm it first in order for anybody to say there is no > regression. Agreed. Thanks, > >> >> It looks like atomic write will suffer performance regression. >> >> I suspect that the criminal is that we forcing to wait all dnode being in >> storage cache before we issue PREFLUSH+FUA. >> >> BTW, will commit ("f2fs: don't need to wait for node writes for atomic >> write") >> cause the problem: we will lose data of last transaction after SPO, even if >> atomic write return no error: >> >> - atomic_open(); >> - write() P1, P2, P3; >> - atomic_commit(); >> - writeback data: P1, P2, P3; >> - writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with >> fsync_mark is >> writebacked, In SPOR, we won't find N3 since node chain is broken, turns out >> that losing >> last transaction. >> - preflush + fua; >> - power-cut >> >> If we don't wait dnode writeback for atomic_write: >> >> SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) >> Insert(TPS) Update(TPS) Delete(TPS) >> 1779.91 206.03 41621.5 40333.16716.9 >> 1038.21 1034.85 >> 2848.51 204.35 40082.4439486.17791.83 >> 1119.96 1083.77 >> 3772.12 206.27 41335.2541599.65723.29 >> 1055.07 971.92 >> Avg 800.18 205.55 41013.06333 40472.99333 >> 744.007 1071.08 1030.18 >> >> Patched/Original: >> 0.92108464 1.001526693 0.987425886 0.993268102 >> 1.030180511 1.026942031 0.976702294 >> >> SQLite's performance recovers. >> >> Jaegeuk: >> "Practically, I don't see db corruption becase of this. We can excuse to lose >> the last transaction." >> >> Finally, we decide to keep original implementation of atomic write interface >> sematics that we don't wait all dnode writeback before preflush+fua >> submission. >> >> Tested-by: Sheng Yong >> Signed-off-by: Chao Yu >> --- >> v5: >> - add missing Tested-by. >> - fix f2fs_reset_fsync_node_info() to reset sbi->fsync_seg_id correctly. >> fs/f2fs/checkpoint.c | 2 + >> fs/f2fs/data.c | 2 + >> fs/f2fs/f2fs.h | 21 ++- >> fs/f2fs/file.c | 5 +- >> fs/f2fs/node.c | 144 +++ >> fs/f2fs/super.c | 4 ++ >> 6 files changed, 150 insertions(+), 28 deletions(-) >> >> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c >> index 581710760ba6..2136430f9f0d 100644 >> --- a/fs/f2fs/checkpoint.c >> +++ b/fs/f2fs/checkpoint.c >> @@ -1410,6 +1410,8 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, >> struct cp_control *cpc) >> >> f2fs_release_ino_entry(sbi, false); >> >> +f2fs_reset_fsync_node_info(sbi); >> + >> clear_sbi_flag(sbi, SBI_IS_DIRTY); >> clear_sbi_flag(sbi, SBI_NEED_CP); >>
Re: [f2fs-dev] [PATCH v5 1/2] f2fs: fix to avoid broken of dnode block list
On 07/28, Chao Yu wrote: > f2fs recovery flow is relying on dnode block link list, it means fsynced > file recovery depends on previous dnode's persistence in the list, so > during fsync() we should wait on all regular inode's dnode writebacked > before issuing flush. > > By this way, we can avoid dnode block list being broken by out-of-order > IO submission due to IO scheduler or driver. > > Sheng Yong helps to do the test with this patch: > > Target:/data (f2fs, -) > 64MB / 32768KB / 4KB / 8 > > 1 / PERSIST / Index > > Base: > SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) > Insert(TPS) Update(TPS) Delete(TPS) > 1 867.82 204.15 41440.0341370.54680.8 > 1025.94 1031.08 > 2 871.87 205.87 41370.3 40275.2 791.14 > 1065.84 1101.7 > 3 866.52 205.69 41795.6740596.16694.69 > 1037.16 1031.48 > Avg 868.737 205.237 41535.3 40747.3 722.21 > 1042.98 1054.75 > > After: > SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) > Insert(TPS) Update(TPS) Delete(TPS) > 1 798.81 202.5 41143 40613.87602.71 > 838.08 913.83 > 2 805.79 206.47 40297.2 41291.46604.44 > 840.75 924.27 > 3 814.83 206.17 41209.5740453.62602.85 > 834.66 927.91 > Avg 806.477 205.047 40883.25667 40786.31667 > 603.333 837.83 922.003 > > Patched/Original: > 0.928332713 0.999074239 0.984300676 1.000957528 > 0.835398753 0.803303994 0.874141189 I expect Sheng will provide more test results tho, at least it seems SEQ-RD in Base shows better than two After results, even if it doesn't matter with the issue. Please confirm it first in order for anybody to say there is no regression. > > It looks like atomic write will suffer performance regression. > > I suspect that the criminal is that we forcing to wait all dnode being in > storage cache before we issue PREFLUSH+FUA. > > BTW, will commit ("f2fs: don't need to wait for node writes for atomic write") > cause the problem: we will lose data of last transaction after SPO, even if > atomic write return no error: > > - atomic_open(); > - write() P1, P2, P3; > - atomic_commit(); > - writeback data: P1, P2, P3; > - writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with > fsync_mark is > writebacked, In SPOR, we won't find N3 since node chain is broken, turns out > that losing > last transaction. > - preflush + fua; > - power-cut > > If we don't wait dnode writeback for atomic_write: > > SEQ-RD(MB/s)SEQ-WR(MB/s)RND-RD(IOPS)RND-WR(IOPS) > Insert(TPS) Update(TPS) Delete(TPS) > 1 779.91 206.03 41621.5 40333.16716.9 > 1038.21 1034.85 > 2 848.51 204.35 40082.4439486.17791.83 > 1119.96 1083.77 > 3 772.12 206.27 41335.2541599.65723.29 > 1055.07 971.92 > Avg 800.18 205.55 41013.06333 40472.99333 > 744.007 1071.08 1030.18 > > Patched/Original: > 0.92108464 1.001526693 0.987425886 0.993268102 > 1.030180511 1.026942031 0.976702294 > > SQLite's performance recovers. > > Jaegeuk: > "Practically, I don't see db corruption becase of this. We can excuse to lose > the last transaction." > > Finally, we decide to keep original implementation of atomic write interface > sematics that we don't wait all dnode writeback before preflush+fua > submission. > > Tested-by: Sheng Yong > Signed-off-by: Chao Yu > --- > v5: > - add missing Tested-by. > - fix f2fs_reset_fsync_node_info() to reset sbi->fsync_seg_id correctly. > fs/f2fs/checkpoint.c | 2 + > fs/f2fs/data.c | 2 + > fs/f2fs/f2fs.h | 21 ++- > fs/f2fs/file.c | 5 +- > fs/f2fs/node.c | 144 +++ > fs/f2fs/super.c | 4 ++ > 6 files changed, 150 insertions(+), 28 deletions(-) > > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c > index 581710760ba6..2136430f9f0d 100644 > --- a/fs/f2fs/checkpoint.c > +++ b/fs/f2fs/checkpoint.c > @@ -1410,6 +1410,8 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, > struct cp_control *cpc) > > f2fs_release_ino_entry(sbi, false); > > + f2fs_reset_fsync_node_info(sbi); > + > clear_sbi_flag(sbi, SBI_IS_DIRTY); > clear_sbi_flag(sbi, SBI_NEED_CP); > __set_cp_next_pack(sbi); > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 6b8ca5011bfd..572c91e43337 100644 > --- a/fs/f2fs/data.c > +++