Re: [f2fs-dev] 答复: 答复: [PATCH] f2fs: compress: avoid unnecessary check in f2fs_prepare_compress_overwrite

2021-05-06 Thread Gao Xiang via Linux-f2fs-devel
On Thu, May 06, 2021 at 08:15:40PM +0800, changfeng...@vivo.com wrote:
> This patch will not bring significant performance improvements, I
> test this on mobile phone, use androbench, the sequential write test
> case was open file with O_TRUNC, set write size to 4KB,  performance
> improved about 2%-3%. If write size set to 32MB, performance improved
> about 0.5%.

Ok, so considering this, my suggestion is that it'd be better to add
your own configuration and raw test results to the commit message to
show the reason why we need this constraint here.

Also, adding some inline comments about this sounds better.

Thanks,
Gao Xiang

> 
> 
> -邮件原件-
> 发件人: Chao Yu  
> 发送时间: 2021年5月6日 18:38
> 收件人: Gao Xiang 
> 抄送: Gao Xiang ; changfeng...@vivo.com; 
> jaeg...@kernel.org; linux-f2fs-devel@lists.sourceforge.net
> 主题: Re: [f2fs-dev] 答复: [PATCH] f2fs: compress: avoid unnecessary check in 
> f2fs_prepare_compress_overwrite
> 
> Hi Xiang,
> 
> On 2021/5/6 17:58, Gao Xiang wrote:
> > Hi Chao,
> > 
> > On Thu, May 06, 2021 at 05:15:04PM +0800, Chao Yu wrote:
> >> On 2021/4/26 17:00, Gao Xiang wrote:
> >>> On Mon, Apr 26, 2021 at 04:42:20PM +0800, changfeng...@vivo.com wrote:
>  Thank you for the reminder, I hadn't thought about fallocate before.
>  I have done some tests and the results are as expected.
>  Here is my test method, create a compressed file, and use fallocate 
>  with keep size, when write data to expand area, 
>  f2fs_prepare_compress_overwrite always return 0, the behavior is same as 
>  my patch , apply my patch can avoid those check.
>  Is there anything else I haven't thought of?
> >>>
> >>> Nope, I didn't look into the implementation. Just a wild guess.
> >>>
> >>> (I just wondered if the cluster size is somewhat large (e.g. 64k),
> >>>but with a partial fallocate (e.g. 16k), and does it behave ok?
> >>>or some other corner cases/conditions are needed.)
> >>
> >> Xiang, sorry for late reply.
> >>
> >> Now, f2fs triggers compression only if one cluster is fully written, 
> >> e.g. cluster size is 16kb, isize is 8kb, then the first cluster is 
> >> non-compressed one, so we don't need to prepare for compressed 
> >> cluster overwrite during write_begin(). Also, blocks fallocated 
> >> beyond isize should never be compressed, so we don't need to worry 
> >> about that.
> >>
> > 
> > Yeah, that could make it unnoticable. but my main concern is actually 
> > not what the current runtime compression logic is, but what the 
> > on-disk compresion format actually is, or there could cause 
> > compatibility issue if some later kernel makes full use of this and 
> > use old kernels
> 
> That's related, if there is layout v2 or we updated runtime compression 
> policy later, it needs to reconsider newly introduced logic of this patch, I 
> guess we need to add comments here to indicate why we can skip the 
> preparation function.
> 
> > instead (also considering some corrupted compression indexes, which is 
> > not generated by the normal runtime compression logic.)
> 
> Yes, that's good concern, but that was not done by 
> f2fs_prepare_compress_overwrite(), another sanity check logic needs to be 
> designed and implemented in separated patch.
> 
> > 
> > My own suggestion about this is still verifying compress indexes first 
> > rather than relying much on runtime logic constraint. (Except that 
> > this patch can show signifiant benefit performance numbers to prove it 
> > can improve performance a lot.) Just my own premature thoughts.
> 
> Fengnan, could you please give some numbers to show how that check can impact 
> performance?
> 
> Thanks,
> 
> > 
> > Thanks,
> > Gao Xiang
> > 
> >> Thanks,
> >>
> >>>
> >>> If that is fine, I have no problem about this, yet i_size here is 
> >>> generally somewhat risky since after post-EOF behavior changes (e.g. 
> >>> supporting FALLOC_FL_ZERO_RANGE with keep size later), it may cause 
> >>> some potential regression.
> >>>
> 
>  -邮件原件-
>  发件人: Gao Xiang 
>  发送时间: 2021年4月26日 11:19
>  收件人: Fengnan Chang 
>  抄送: c...@kernel.org; jaeg...@kernel.org; 
>  linux-f2fs-devel@lists.sourceforge.net
>  主题: Re: [f2fs-dev] [PATCH] f2fs: compress: avoid unnecessary check 
>  in f2fs_prepare_compress_overwrite
> 
>  On Mon, Apr 26, 2021 at 10:11:53AM +0800, Fengnan Chang wrote:
> > when write compressed file with O_TRUNC, there will be a lot of 
> > unnecessary check valid blocks in f2fs_prepare_compress_overwrite, 
> > especially when written in page size, remove it.
> >
> > Signed-off-by: Fengnan Chang 
> 
>  Even though I didn't look into the whole thing, my reaction here is 
>  roughly how to handle fallocate with keep size? Does it work as expected?
> 
> > ---
> >fs/f2fs/data.c | 4 
> >1 file changed, 4 insertions(+)
> >
> > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 
> > cf935474ffba..9c3b0849f35e 100644

Re: [f2fs-dev] 答复: [PATCH] f2fs: compress: avoid unnecessary check in f2fs_prepare_compress_overwrite

2021-05-06 Thread Gao Xiang via Linux-f2fs-devel
On Thu, May 06, 2021 at 06:37:45PM +0800, Chao Yu wrote:
> Hi Xiang,
> 
> On 2021/5/6 17:58, Gao Xiang wrote:
> > Hi Chao,
> > 
> > On Thu, May 06, 2021 at 05:15:04PM +0800, Chao Yu wrote:
> > > On 2021/4/26 17:00, Gao Xiang wrote:
> > > > On Mon, Apr 26, 2021 at 04:42:20PM +0800, changfeng...@vivo.com wrote:
> > > > > Thank you for the reminder, I hadn't thought about fallocate before.
> > > > > I have done some tests and the results are as expected.
> > > > > Here is my test method, create a compressed file, and use fallocate 
> > > > > with keep size, when write data to expand area, 
> > > > > f2fs_prepare_compress_overwrite
> > > > > always return 0, the behavior is same as my patch , apply my patch 
> > > > > can avoid those check.
> > > > > Is there anything else I haven't thought of?
> > > > 
> > > > Nope, I didn't look into the implementation. Just a wild guess.
> > > > 
> > > > (I just wondered if the cluster size is somewhat large (e.g. 64k),
> > > >but with a partial fallocate (e.g. 16k), and does it behave ok?
> > > >or some other corner cases/conditions are needed.)
> > > 
> > > Xiang, sorry for late reply.
> > > 
> > > Now, f2fs triggers compression only if one cluster is fully written,
> > > e.g. cluster size is 16kb, isize is 8kb, then the first cluster is
> > > non-compressed one, so we don't need to prepare for compressed
> > > cluster overwrite during write_begin(). Also, blocks fallocated
> > > beyond isize should never be compressed, so we don't need to worry
> > > about that.
> > > 
> > 
> > Yeah, that could make it unnoticable. but my main concern is actually
> > not what the current runtime compression logic is, but what the on-disk
> > compresion format actually is, or there could cause compatibility
> > issue if some later kernel makes full use of this and use old kernels
> 
> That's related, if there is layout v2 or we updated runtime compression
> policy later, it needs to reconsider newly introduced logic of this patch,
> I guess we need to add comments here to indicate why we can skip the
> preparation function.

Anyway, my thoughts is mainly to distinguish the current runtime
compression logic and the proposal on-disk format by design. If it's
easy to support reading partial written clusters and post-EOF cases
in practice with a few lines, so the later compression logic could
use compat feature (or at least ro_compat feature) to update, which
is much better than an incompat feature for older kernels.

But if it's somewhat hard to add simply, that makes no difference so
v2 may need to be introduced instead.

> 
> > instead (also considering some corrupted compression indexes, which
> > is not generated by the normal runtime compression logic.)
> 
> Yes, that's good concern, but that was not done by
> f2fs_prepare_compress_overwrite(), another sanity check logic needs
> to be designed and implemented in separated patch.
> 
> > 
> > My own suggestion about this is still verifying compress indexes
> > first rather than relying much on runtime logic constraint. (Except
> > that this patch can show signifiant benefit performance numbers to
> > prove it can improve performance a lot.) Just my own premature
> > thoughts.
> 
> Fengnan, could you please give some numbers to show how that check can
> impact performance?

IMO, it'd be better to show some real numbers to add more constraint
like this, if it can be measureable, that is another story indeed.

Thanks,
Gao Xiang

> 
> Thanks,
> 
> > 
> > Thanks,
> > Gao Xiang
> > 
> > > Thanks,
> > > 
> > > > 
> > > > If that is fine, I have no problem about this, yet i_size here
> > > > is generally somewhat risky since after post-EOF behavior
> > > > changes (e.g. supporting FALLOC_FL_ZERO_RANGE with keep size
> > > > later), it may cause some potential regression.
> > > > 
> > > > > 
> > > > > -邮件原件-
> > > > > 发件人: Gao Xiang 
> > > > > 发送时间: 2021年4月26日 11:19
> > > > > 收件人: Fengnan Chang 
> > > > > 抄送: c...@kernel.org; jaeg...@kernel.org;
> > > > > linux-f2fs-devel@lists.sourceforge.net
> > > > > 主题: Re: [f2fs-dev] [PATCH] f2fs: compress: avoid unnecessary check in
> > > > > f2fs_prepare_compress_overwrite
> > > > > 
> > > > > On Mon, Apr 26, 2021 at 10:11:53AM +0800, Fengnan Chang wrote:
> > > > > > when write compressed file with O_TRUNC, there will be a lot of
> > > > > > unnecessary check valid blocks in f2fs_prepare_compress_overwrite,
> > > > > > especially when written in page size, remove it.
> > > > > > 
> > > > > > Signed-off-by: Fengnan Chang 
> > > > > 
> > > > > Even though I didn't look into the whole thing, my reaction here is 
> > > > > roughly
> > > > > how to handle fallocate with keep size? Does it work as expected?
> > > > > 
> > > > > > ---
> > > > > >fs/f2fs/data.c | 4 
> > > > > >1 file changed, 4 insertions(+)
> > > > > > 
> > > > > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index
> > > > > > cf935474ffba..9c3b0849f35e 100644
> > > > > > --- a/fs/f2fs/data.c
> > > > > > +++ 

Re: [f2fs-dev] 答复: [PATCH] f2fs: compress: avoid unnecessary check in f2fs_prepare_compress_overwrite

2021-05-06 Thread Gao Xiang via Linux-f2fs-devel
Hi Chao,

On Thu, May 06, 2021 at 05:15:04PM +0800, Chao Yu wrote:
> On 2021/4/26 17:00, Gao Xiang wrote:
> > On Mon, Apr 26, 2021 at 04:42:20PM +0800, changfeng...@vivo.com wrote:
> > > Thank you for the reminder, I hadn't thought about fallocate before.
> > > I have done some tests and the results are as expected.
> > > Here is my test method, create a compressed file, and use fallocate with 
> > > keep size, when write data to expand area, f2fs_prepare_compress_overwrite
> > > always return 0, the behavior is same as my patch , apply my patch can 
> > > avoid those check.
> > > Is there anything else I haven't thought of?
> > 
> > Nope, I didn't look into the implementation. Just a wild guess.
> > 
> > (I just wondered if the cluster size is somewhat large (e.g. 64k),
> >   but with a partial fallocate (e.g. 16k), and does it behave ok?
> >   or some other corner cases/conditions are needed.)
> 
> Xiang, sorry for late reply.
> 
> Now, f2fs triggers compression only if one cluster is fully written,
> e.g. cluster size is 16kb, isize is 8kb, then the first cluster is
> non-compressed one, so we don't need to prepare for compressed
> cluster overwrite during write_begin(). Also, blocks fallocated
> beyond isize should never be compressed, so we don't need to worry
> about that.
> 

Yeah, that could make it unnoticable. but my main concern is actually
not what the current runtime compression logic is, but what the on-disk
compresion format actually is, or there could cause compatibility
issue if some later kernel makes full use of this and use old kernels
instead (also considering some corrupted compression indexes, which
is not generated by the normal runtime compression logic.)

My own suggestion about this is still verifying compress indexes
first rather than relying much on runtime logic constraint. (Except
that this patch can show signifiant benefit performance numbers to
prove it can improve performance a lot.) Just my own premature
thoughts.

Thanks,
Gao Xiang

> Thanks,
> 
> > 
> > If that is fine, I have no problem about this, yet i_size here
> > is generally somewhat risky since after post-EOF behavior
> > changes (e.g. supporting FALLOC_FL_ZERO_RANGE with keep size
> > later), it may cause some potential regression.
> > 
> > > 
> > > -邮件原件-
> > > 发件人: Gao Xiang 
> > > 发送时间: 2021年4月26日 11:19
> > > 收件人: Fengnan Chang 
> > > 抄送: c...@kernel.org; jaeg...@kernel.org;
> > > linux-f2fs-devel@lists.sourceforge.net
> > > 主题: Re: [f2fs-dev] [PATCH] f2fs: compress: avoid unnecessary check in
> > > f2fs_prepare_compress_overwrite
> > > 
> > > On Mon, Apr 26, 2021 at 10:11:53AM +0800, Fengnan Chang wrote:
> > > > when write compressed file with O_TRUNC, there will be a lot of
> > > > unnecessary check valid blocks in f2fs_prepare_compress_overwrite,
> > > > especially when written in page size, remove it.
> > > > 
> > > > Signed-off-by: Fengnan Chang 
> > > 
> > > Even though I didn't look into the whole thing, my reaction here is 
> > > roughly
> > > how to handle fallocate with keep size? Does it work as expected?
> > > 
> > > > ---
> > > >   fs/f2fs/data.c | 4 
> > > >   1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index
> > > > cf935474ffba..9c3b0849f35e 100644
> > > > --- a/fs/f2fs/data.c
> > > > +++ b/fs/f2fs/data.c
> > > > @@ -3270,6 +3270,7 @@ static int f2fs_write_begin(struct file *file,
> > > > struct address_space *mapping,
> > > > struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> > > > struct page *page = NULL;
> > > > pgoff_t index = ((unsigned long long) pos) >> PAGE_SHIFT;
> > > > +   pgoff_t end = (i_size_read(inode) + PAGE_SIZE - 1) >> 
> > > > PAGE_SHIFT;
> > > > bool need_balance = false, drop_atomic = false;
> > > > block_t blkaddr = NULL_ADDR;
> > > > int err = 0;
> > > > @@ -3306,6 +3307,9 @@ static int f2fs_write_begin(struct file *file,
> > > > struct address_space *mapping,
> > > > 
> > > > *fsdata = NULL;
> > > > 
> > > > +   if (index >= end)
> > > > +   goto repeat;
> > > > +
> > > > ret = f2fs_prepare_compress_overwrite(inode, pagep,
> > > > index, fsdata);
> > > > if (ret < 0) {
> > > > --
> > > > 2.29.0
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > ___
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> > 
> 
> 
> ___
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] IO hang due to f2fs checkpoint and writeback stuck

2020-07-09 Thread Gao Xiang via Linux-f2fs-devel
On Fri, Jul 10, 2020 at 10:54:13AM +0800, Chao Yu wrote:
> Hi Sahitya,
> 
> It looks block plug has already been removed by Jaegeuk with
> below commit:
> 
> commit 1f5f11a3c41e2b23288b2769435a00f74e02496b
> Author: Jaegeuk Kim 
> Date:   Fri May 8 12:25:45 2020 -0700
> 
> f2fs: remove blk_plugging in block_operations
> 
> blk_plugging doesn't seem to give any benefit.
> 
> How about backporting this patch?

Yeah, also notice that

commit bd900d4580107c899d43b262fbbd995f11097a43
Author: Jens Axboe 
Date:   Mon Apr 18 22:06:57 2011 +0200

block: kill blk_flush_plug_list() export

With all drivers and file systems converted, we only have
in-core use of this function. So remove the export.

Reporteed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 

blk_flush_plug_list() is not an exported symbol for now except for in-core use,
as well as blk_flush_plug().

Thanks,
Gao Xiang



___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: compress: allow lz4 to compress data partially

2020-05-08 Thread Gao Xiang via Linux-f2fs-devel
Hi Chao,

On Fri, May 08, 2020 at 05:47:09PM +0800, Chao Yu wrote:
> For lz4 worst compress case, caller should allocate buffer with size
> of LZ4_compressBound(inputsize) for target compressed data storing.
> 
> However lz4 supports partial data compression, so we can get rid of
> output buffer size limitation now, then we can avoid 2 * 4KB size
> intermediate buffer allocation when log_cluster_size is 2, and avoid
> unnecessary compressing work of compressor if we can not save at
> least 4KB space.
> 
> Suggested-by: Daeho Jeong 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/compress.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
> index 5e4947250262..23825f559bcf 100644
> --- a/fs/f2fs/compress.c
> +++ b/fs/f2fs/compress.c
> @@ -228,7 +228,12 @@ static int lz4_init_compress_ctx(struct compress_ctx *cc)
>   if (!cc->private)
>   return -ENOMEM;
>  
> - cc->clen = LZ4_compressBound(PAGE_SIZE << cc->log_cluster_size);
> + /*
> +  * we do not change cc->clen to LZ4_compressBound(inputsize) to
> +  * adapt worst compress case, because lz4 algorithm supports partial
> +  * compression.

Actually, in this case the lz4 compressed block is not valid (at least not ended
in a valid lz4 EOF), and AFAIK the current public lz4 API cannot keep on
compressing this block. so IMO "partial compression" for an invalid lz4
block may be confusing.

I think some words like "because lz4 implementation can handle output buffer
budget properly" or similar stuff could be better.

The same to the patch title and the commit message.

Thanks,
Gao Xiang


> +  */
> + cc->clen = cc->rlen - PAGE_SIZE - COMPRESS_HEADER_SIZE;
>   return 0;
>  }
>  
> @@ -244,11 +249,9 @@ static int lz4_compress_pages(struct compress_ctx *cc)
>  
>   len = LZ4_compress_default(cc->rbuf, cc->cbuf->cdata, cc->rlen,
>   cc->clen, cc->private);
> - if (!len) {
> - printk_ratelimited("%sF2FS-fs (%s): lz4 compress failed\n",
> - KERN_ERR, F2FS_I_SB(cc->inode)->sb->s_id);
> - return -EIO;
> - }
> + if (!len)
> + return -EAGAIN;
> +
>   cc->clen = len;
>   return 0;
>  }
> -- 
> 2.18.0.rc1


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: get parent inode when recovering pino

2020-05-05 Thread Gao Xiang via Linux-f2fs-devel
On Tue, May 05, 2020 at 06:24:28PM -0700, Eric Biggers wrote:
> On Wed, May 06, 2020 at 08:14:07AM +0800, Gao Xiang wrote:
> > >
> > > Actually, I think this is wrong because the fsync can be done via a file
> > > descriptor that was opened to a now-deleted link to the file.
> > 
> > I'm still confused about this...
> > 
> > I don't know what's wrong with this version from my limited knowledge?
> >  inode itself is locked when fsyncing, so
> > 
> >if the fsync inode->i_nlink == 1, this inode has only one hard link
> >(not deleted yet) and should belong to a single directory; and
> > 
> >the only one parent directory would not go away (not deleted as well)
> >since there are some dirents in it (not empty).
> > 
> > Could kindly explain more so I would learn more about this scenario?
> > Thanks a lot!
> 
> i_nlink == 1 just means that there is one non-deleted link.  There can be 
> links
> that have since been deleted, and file descriptors can still be open to them.

Thanks for your inspiration. You are right, thanks.

Correct my words... I didn't check f2fs code just now, it seems f2fs doesn't
take inode_lock as some other fs like __generic_file_fsync or ubifs_fsync.

And i_sem locks nlink / try_to_fix_pino similarly in some extent. It seems
no race by using d_find_alias here. Thanks again.

Thanks,
Gao Xiang

> 
> > 
> > >
> > > We need to find the dentry whose parent directory is still exists, i.e. 
> > > the
> > > parent directory that is counting towards 'inode->i_nlink == 1'.
> > 
> > directory counting towards 'inode->i_nlink == 1', what's happening?
> 
> The non-deleted link is the one counted in i_nlink.
> 
> > 
> > >
> > > I think d_find_alias() is what we're looking for.
> > 
> > It may be simply dentry->d_parent (stable/positive as you said before, and 
> > it's
> > not empty). why need to d_find_alias()?
> 
> Because we need to get the dentry that hasn't been deleted yet, which isn't
> necessarily the one associated with the file descriptor being fsync()'ed.
> 
> > And what is the original problem? I could not get some clue from the 
> > original
> > patch description (I only saw some extra igrab/iput because of some unknown
> > reasons), it there some backtrace related to the problem?
> 
> The problem is that i_pino gets set incorrectly.  I just noticed this while
> reviewing the code.  It's not hard to reproduce, e.g.:
> 
> #include 
> #include 
> #include 
> 
> int main()
> {
> int fd;
> 
> mkdir("dir1", 0700);
> mkdir("dir2", 0700);
> mknod("dir1/file", S_IFREG|0600, 0);
> link("dir1/file", "dir2/file");
> fd = open("dir2/file", O_WRONLY);
> unlink("dir2/file");
> write(fd, "X", 1);
> fsync(fd);
> }
> 
> Then:
> 
> sync
> echo N | dump.f2fs -i $(stat -c %i dir1/file) /dev/vdb | grep 'i_pino'
> echo "dir1 (correct): $(stat -c %i dir1)"
> echo "dir2 (wrong): $(stat -c %i dir2)"
> 
> i_pino will point to dir2 rather than dir1 as expected.
> 
> - Eric
> 
> 
> ___
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v11 19/25] erofs: Convert compressed files from readpages to readahead

2020-04-21 Thread Gao Xiang via Linux-f2fs-devel
Hi Andrew,

On Mon, Apr 20, 2020 at 10:42:10PM -0700, Andrew Morton wrote:
> On Tue, 14 Apr 2020 08:02:27 -0700 Matthew Wilcox  wrote:
> 
> > 
> > Use the new readahead operation in erofs.
> > 
> 
> Well this is exciting.
> 
> fs/erofs/data.c: In function erofs_raw_access_readahead:
> fs/erofs/data.c:149:18: warning: last_block may be used uninitialized in this 
> function [-Wmaybe-uninitialized]
>   *last_block + 1 != current_block) {
> 
> It seems to be a preexisting bug, which your patch prompted gcc-7.2.0
> to notice.
> 
> erofs_read_raw_page() goes in and uses *last_block, but neither of its
> callers has initialized it.  Could the erofs maintainers please take a
> look?

simply because last_block doesn't need to be initialized at first,
because bio == NULL in the begining anyway. I believe this is a gcc
false warning because some gcc versions raised some before (many gccs
don't, including my current gcc (Debian 8.3.0-6) 8.3.0).

in detail,

146 /* note that for readpage case, bio also equals to NULL */
147 if (bio &&
148 /* not continuous */
149 *last_block + 1 != current_block) {
150 submit_bio_retry:
151 submit_bio(bio);
152 bio = NULL;
153 }

bio will be NULL and will bypass the next condition at first.
after that,

155 if (!bio) {

...

221 bio = bio_alloc(GFP_NOIO, nblocks);

...

}

...

230 err = bio_add_page(bio, page, PAGE_SIZE, 0);
231 /* out of the extent or bio is full */
232 if (err < PAGE_SIZE)
233 goto submit_bio_retry;
234
235 *last_block = current_block;

so bio != NULL, and last_block will be assigned then as well.

Thanks,
Gao Xiang




___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] ext4: fix race conditions in ->d_compare() and ->d_hash()

2020-01-23 Thread Gao Xiang via Linux-f2fs-devel
On Thu, Jan 23, 2020 at 09:42:56PM -0800, Eric Biggers wrote:
> On Fri, Jan 24, 2020 at 01:34:23PM +0800, Gao Xiang wrote:
> > On Thu, Jan 23, 2020 at 09:16:01PM -0800, Eric Biggers wrote:
> > 
> > []
> > 
> > > So we need READ_ONCE() to ensure that a consistent value is used.
> > 
> > By the way, my understanding is all pointer could be accessed
> > atomicly guaranteed by compiler. In my opinion, we generally
> > use READ_ONCE() on pointers for other uses (such as, avoid
> > accessing a variable twice due to compiler optimization and
> > it will break some logic potentially or need some data
> > dependency barrier...)
> > 
> > Thanks,
> > Gao Xiang
> 
> But that *is* why we need READ_ONCE() here.  Without it, there's no guarantee
> that the compiler doesn't load the variable twice.  Please read:
> https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

After scanning the patch, it seems the parent variable (dentry->d_parent)
only referenced once as below:

-   struct inode *inode = dentry->d_parent->d_inode;
+   const struct dentry *parent = READ_ONCE(dentry->d_parent);
+   const struct inode *inode = READ_ONCE(parent->d_inode);

So I think it is enough as

const struct inode *inode = READ_ONCE(dentry->d_parent->d_inode);

to access parent inode once to avoid parent inode being accessed
for more time (and all pointers dereference should be in atomic
by compilers) as one reason on

if (!inode || !IS_CASEFOLDED(inode) || ...

or etc.

Thanks for your web reference, I will look into it. I think there
is no worry about dentry->d_parent here because of this only one
dereference on dentry->d_parent.

You could ignore my words anyway, just my little thought though.
Other part of the patch is ok.

Thanks,
Gao Xiang

> 
> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] ext4: fix race conditions in ->d_compare() and ->d_hash()

2020-01-23 Thread Gao Xiang via Linux-f2fs-devel
On Thu, Jan 23, 2020 at 09:16:01PM -0800, Eric Biggers wrote:

[]

> So we need READ_ONCE() to ensure that a consistent value is used.

By the way, my understanding is all pointer could be accessed
atomicly guaranteed by compiler. In my opinion, we generally
use READ_ONCE() on pointers for other uses (such as, avoid
accessing a variable twice due to compiler optimization and
it will break some logic potentially or need some data
dependency barrier...)

Thanks,
Gao Xiang




___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] ext4: fix race conditions in ->d_compare() and ->d_hash()

2020-01-23 Thread Gao Xiang via Linux-f2fs-devel
On Thu, Jan 23, 2020 at 09:16:01PM -0800, Eric Biggers wrote:
> On Fri, Jan 24, 2020 at 01:04:25PM +0800, Gao Xiang wrote:
> > > diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> > > index 8964778aabefb..0129d14629881 100644
> > > --- a/fs/ext4/dir.c
> > > +++ b/fs/ext4/dir.c
> > > @@ -671,9 +671,11 @@ static int ext4_d_compare(const struct dentry 
> > > *dentry, unsigned int len,
> > > const char *str, const struct qstr *name)
> > >  {
> > >   struct qstr qstr = {.name = str, .len = len };
> > > - struct inode *inode = dentry->d_parent->d_inode;
> > > + const struct dentry *parent = READ_ONCE(dentry->d_parent);
> > 
> > I'm not sure if we really need READ_ONCE d_parent here (p.s. d_parent
> > won't be NULL anyway), and d_seq will guard all its validity. If I'm
> > wrong, correct me kindly...
> > 
> > Otherwise, it looks good to me...
> > Reviewed-by: Gao Xiang 
> > 
> 
> While d_parent can't be set to NULL, it can still be changed concurrently.
> So we need READ_ONCE() to ensure that a consistent value is used.

If I understand correctly, unlazy RCU->ref-walk will be guarded by
seqlock, and for ref-walk we have d_lock (and even parent lock)
in relative paths. So I prematurely think no race of renaming or
unlinking evenually.

I'm curious about that if experts could correct me about this.

Thanks,
Gao Xiang

> 
> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] ext4: fix race conditions in ->d_compare() and ->d_hash()

2020-01-23 Thread Gao Xiang via Linux-f2fs-devel
Hi Eric,

On Thu, Jan 23, 2020 at 08:12:34PM -0800, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
> ->d_parent and ->d_inode can be concurrently modified, and in
> particular, ->d_inode may be changed to NULL.  For ext4_d_hash() this
> resulted in a reproducible NULL dereference if a lookup is done in a
> directory being deleted, e.g. with:
> 
>   int main()
>   {
>   if (fork()) {
>   for (;;) {
>   mkdir("subdir", 0700);
>   rmdir("subdir");
>   }
>   } else {
>   for (;;)
>   access("subdir/file", 0);
>   }
>   }
> 
> ... or by running the 't_encrypted_d_revalidate' program from xfstests.
> Both repros work in any directory on a filesystem with the encoding
> feature, even if the directory doesn't actually have the casefold flag.
> 
> I couldn't reproduce a crash in ext4_d_compare(), but it appears that a
> similar crash is possible there.
> 
> Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
> falling back to the case sensitive behavior if the inode is NULL.
> 
> Reported-by: Al Viro 
> Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups")
> Cc:  # v5.2+
> Signed-off-by: Eric Biggers 
> ---
>  fs/ext4/dir.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index 8964778aabefb..0129d14629881 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -671,9 +671,11 @@ static int ext4_d_compare(const struct dentry *dentry, 
> unsigned int len,
> const char *str, const struct qstr *name)
>  {
>   struct qstr qstr = {.name = str, .len = len };
> - struct inode *inode = dentry->d_parent->d_inode;
> + const struct dentry *parent = READ_ONCE(dentry->d_parent);

I'm not sure if we really need READ_ONCE d_parent here (p.s. d_parent
won't be NULL anyway), and d_seq will guard all its validity. If I'm
wrong, correct me kindly...

Otherwise, it looks good to me...
Reviewed-by: Gao Xiang 

Thanks,
Gao Xiang


> + const struct inode *inode = READ_ONCE(parent->d_inode);
>  
> - if (!IS_CASEFOLDED(inode) || !EXT4_SB(inode->i_sb)->s_encoding) {
> + if (!inode || !IS_CASEFOLDED(inode) ||
> + !EXT4_SB(inode->i_sb)->s_encoding) {
>   if (len != name->len)
>   return -1;
>   return memcmp(str, name->name, len);
> @@ -686,10 +688,11 @@ static int ext4_d_hash(const struct dentry *dentry, 
> struct qstr *str)
>  {
>   const struct ext4_sb_info *sbi = EXT4_SB(dentry->d_sb);
>   const struct unicode_map *um = sbi->s_encoding;
> + const struct inode *inode = READ_ONCE(dentry->d_inode);
>   unsigned char *norm;
>   int len, ret = 0;
>  
> - if (!IS_CASEFOLDED(dentry->d_inode) || !um)
> + if (!inode || !IS_CASEFOLDED(inode) || !um)
>   return 0;
>  
>   norm = kmalloc(PATH_MAX, GFP_ATOMIC);
> -- 
> 2.25.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v5] fs: introduce is_dot_or_dotdot helper for cleanup

2019-12-11 Thread Gao Xiang via Linux-f2fs-devel
Hi Matthew,

On Wed, Dec 11, 2019 at 05:40:14AM -0800, Matthew Wilcox wrote:
> On Wed, Dec 11, 2019 at 03:17:11PM +0800, Gao Xiang wrote:
> > > static inline bool is_dot_or_dotdot(const unsigned char *name, size_t len)
> > > {
> > > if (len >= 1 && unlikely(name[0] == '.')) {
> > 
> > 
> > And I suggest drop "unlikely" here since files start with prefix
> > '.' (plus specical ".", "..") are not as uncommon as you expected...
> 
> They absolutely are uncommon.  Even if you just consider
> /home/willy/kernel/linux/.git/config, only one of those six path elements
> starts with a '.'.

Okay, I think it depends on userdata and access patterns.
I admit I have no statistics on all those callers.

Just considering introducing an inline helper for cleanup, except for
lookup_one_len_common() (since it's on an error path), others were all
without unlikely() before.

Ignore my words if it seems unreasonable or unlikely() is an improvement
in this patch and sorry for annoying.

Thanks,
Gao Xiang



___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Potential data corruption?

2019-12-08 Thread Gao Xiang via Linux-f2fs-devel
Hi,

On Sun, Dec 08, 2019 at 09:15:55PM +0800, Hongwei Qin wrote:
> Hi,
> 
> On Sun, Dec 8, 2019 at 12:01 PM Chao Yu  wrote:
> >
> > Hello,
> >
> > On 2019-12-7 18:10, 锟斤拷锟秸碉拷锟斤拷锟斤拷锟斤拷 wrote:
> > > Hi F2FS experts,
> > > The following confuses me:
> > >
> > > A typical fsync() goes like this:
> > > 1) Issue data block IOs
> > > 2) Wait for completion
> > > 3) Issue chained node block IOs
> > > 4) Wait for completion
> > > 5) Issue flush command
> > >
> > > In order to preserve data consistency under sudden power failure, it 
> > > requires that the storage device persists data blocks prior to node 
> > > blocks.
> > > Otherwise, under sudden power failure, it's possible that the persisted 
> > > node block points to NULL data blocks.
> >
> > Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
> > successfully before sudden power-cut, so we can not guarantee that data is 
> > fully
> > persisted in such condition.
> >
> > However, what you want looks like atomic write semantics, which mostly 
> > database
> > want to guarantee during db file update.
> >
> > F2FS has support atomic_write via ioctl, which is used by SQLite 
> > officially, I
> > guess you can check its implementation detail.
> >
> > Thanks,
> >
> 
> Thanks for your kind reply.
> It's true that if we meet power failure before fsync() completes,
> POSIX doen't require FS to recover the file. However, consider the
> following situation:
> 
> 1) Data block IOs (Not persisted)
> 2) Node block IOs (All Persisted)
> 3) Power failure
> 
> Since the node blocks are all persisted before power failure, the node
> chain isn't broken. Note that this file's new data is not properly
> persisted before crash. So the recovery process should be able to
> recognize this situation and avoid recover this file. However, since
> the node chain is not broken, perhaps the recovery process will regard
> this file as recoverable?

As my own limited understanding, I'm afraid it seems true for extreme case.
Without proper FLUSH command, newer nodes could be recovered but no newer
data persisted.

So if fsync() is not successful, the old data should be readed
but for this case, unexpected data (not A or A', could be random data
C) will be considered validly since its node is ok.

It seems it should FLUSH data before the related node chain written or
introduce some data checksum though.

If I am wrong, kindly correct me...

Thanks,
Gao Xiang



___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 2/2] f2fs: support data compression

2019-10-30 Thread Gao Xiang via Linux-f2fs-devel
Hi Eric,

(add some mm folks...)

On Wed, Oct 30, 2019 at 09:50:56AM -0700, Eric Biggers wrote:



> > >>>
> > >>> It isn't really appropriate to create fake pagecache pages like this.  
> > >>> Did you
> > >>> consider changing f2fs to use fscrypt_decrypt_block_inplace() instead?
> > >>
> > >> We need to store i_crypto_info and iv index somewhere, in order to pass 
> > >> them to
> > >> fscrypt_decrypt_block_inplace(), where did you suggest to store them?
> > >>
> > > 
> > > The same place where the pages are stored.
> > 
> > Still we need allocate space for those fields, any strong reason to do so?
> > 
> 
> page->mapping set implies that the page is a pagecache page.  Faking it could
> cause problems with code elsewhere.

Not very related with this patch. Faking page->mapping was used in zsmalloc 
before
nonLRU migration (see material [1]) and use in erofs now (page->mapping to 
indicate
nonLRU short lifetime temporary page type, page->private is used for per-page 
information),
as far as I know, NonLRU page without PAGE_MAPPING_MOVABLE set is safe for most 
mm code.

On the other hands, I think NULL page->mapping will waste such field in precious
page structure... And we can not get such page type directly only by a NULL --
a truncated file page or just allocated page or some type internal temporary 
pages...

So I have some proposal is to use page->mapping to indicate specific page type 
for
such nonLRU pages (by some common convention, e.g. some real structure, rather 
than
just zero out to waste 8 bytes, it's also natural to indicate some page type by
its `mapping' naming )... Since my English is not very well, I delay it util 
now...

[1] https://elixir.bootlin.com/linux/v3.18.140/source/mm/zsmalloc.c#L379

https://lore.kernel.org/linux-mm/1459321935-3655-7-git-send-email-minc...@kernel.org
and some not very related topic: https://lwn.net/Articles/752564/

Thanks,
Gao Xiang



___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: bio_alloc should never fail

2019-10-30 Thread Gao Xiang via Linux-f2fs-devel
On Wed, Oct 30, 2019 at 09:33:13AM -0700, Jaegeuk Kim wrote:
> On 10/30, Theodore Y. Ts'o wrote:
> > On Wed, Oct 30, 2019 at 11:50:37PM +0800, Gao Xiang wrote:
> > > 
> > > So I'm curious about the original issue in commit 740432f83560
> > > ("f2fs: handle failed bio allocation"). Since f2fs manages multiple write
> > > bios with its internal fio but it seems the commit is not helpful to
> > > resolve potential mempool deadlock (I'm confused since no calltrace,
> > > maybe I'm wrong)...
> > 
> > Two possibilities come to mind.  (a) It may be that on older kernels
> > (when f2fs is backported to older Board Support Package kernels from
> > the SOC vendors) didn't have the bio_alloc() guarantee, so it was
> > necessary on older kernels, but not on upstream, or (b) it wasn't
> > *actually* possible for bio_alloc() to fail and someone added the
> > error handling in 740432f83560 out of paranoia.
> 
> Yup, I was checking old device kernels but just stopped digging it out.
> Instead, I hesitate to apply this patch since I can't get why we need to
> get rid of this code for clean-up purpose. This may be able to bring
> some hassles when backporting to android/device kernels.

Yes, got you concern. As I said in other patches for many times, since
you're the maintainer of f2fs, it's all up to you (I'm not paranoia).
However, I think there are 2 valid reasons:

 1) As a newbie of Linux filesystem. When I study or work on f2fs,
and I saw these misleading code, I think I will produce similar
code in the future (not everyone refers comments above bio_alloc),
so such usage will spread (since one could refer some sample code
from exist code);

 2) Since it's upstream, I personally think appropriate cleanup is ok (anyway
it kills net 20+ line dead code), and this patch I think isn't so harmful
for backporting.

Thanks,
Gao Xiang

> 
> > 
> > (Hence my suggestion that in the ext4 version of the patch, we add a
> > code comment justifying why there was no error checking, to make it
> > clear that this was a deliberate choice.  :-)
> > 
> > - Ted


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: bio_alloc should never fail

2019-10-30 Thread Gao Xiang via Linux-f2fs-devel
Hi Ted,

On Wed, Oct 30, 2019 at 11:14:45AM -0400, Theodore Y. Ts'o wrote:
> On Wed, Oct 30, 2019 at 06:43:45PM +0800, Gao Xiang wrote:
> > > You're right, in low memory scenario, allocation with bioset will be 
> > > faster, as
> > > you mentioned offline, maybe we can add/use a priviate bioset like btrfs 
> > > did
> > > rather than using global one, however, we'd better check how deadlock 
> > > happen
> > > with a bioset mempool first ...
> > 
> > Okay, hope to get hints from Jaegeuk and redo this patch then...
> 
> It's not at all clear to me that using a private bioset is a good idea
> for f2fs.  That just means you're allocating a separate chunk of
> memory just for f2fs, as opposed to using the global pool.  That's an
> additional chunk of non-swapable kernel memory that's not going to be
> available, in *addition* to the global mempool.  
> 
> Also, who else would you be contending for space with the global
> mempool?  It's not like an mobile handset is going to have other users
> of the global bio mempool.
> 
> On a low-end mobile handset, memory is at a premium, so wasting memory
> to no good effect isn't going to be a great idea.

Thanks for your reply. I agree with your idea.

Actually I think after this version patch is applied, all are the same
as the previous status (whether some deadlock or not).

So I'm curious about the original issue in commit 740432f83560
("f2fs: handle failed bio allocation"). Since f2fs manages multiple write
bios with its internal fio but it seems the commit is not helpful to
resolve potential mempool deadlock (I'm confused since no calltrace,
maybe I'm wrong)...

I think it should be gotten clear first and think how to do next..
(I tend not to add another private bioset since it's unshareable as you
 said as well...)

Thanks,
Gao Xiang

> 
> Regards,
> 
>   - Ted
> 
>
 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH V4 5/8] f2fs: Use read_callbacks for decrypting file data

2019-08-20 Thread Gao Xiang via Linux-f2fs-devel
Hi Ted,

On Tue, Aug 20, 2019 at 12:25:10PM -0400, Theodore Y. Ts'o wrote:
> On Tue, Aug 20, 2019 at 01:12:36PM +0800, Gao Xiang wrote:
> > Add a word, I have some little concern about post read procession order
> > a bit as I mentioned before, because I'd like to move common EROFS
> > decompression code out in the future as well for other fses to use
> > after we think it's mature enough.
> > 
> > It seems the current code mainly addresses eliminating duplicated code,
> > therefore I have no idea about that...
> 
> Actually, we should chat.  I was actually thinking about "borrowing"
> code from erofs to provide ext4-specific compression.  I was really
> impressed with the efficiency goals in the erofs design[1] when I
> reviewed the Usenix ATC paper, and as the saying goes, the best
> artists know how to steal from the best.  :-)
> 
> [1] https://www.usenix.org/conference/atc19/presentation/gao

I also guessed it's you reviewed our work as well from some written words :)
(even though it's analymous...) and I personally think there are some
useful stuffs in our EROFS effort.

> 
> My original specific thinking was to do code reuse by copy and paste,
> simply because it was simpler, and I have limited time to work on it.
> But if you are interested in making the erofs pipeline reusable by
> other file systems, and have the time to do the necessary code
> refactoring, I'd love to work with you on that.

Yes, I have interest in making the erofs pipeline for generic fses.
Now I'm still investigating sequential read on very high speed NVME
(like SAMSUNG 970pro, one thread seq read >3GB/s), it seems it still
has some optimization space.

And then I will do that work for generic fses as well... (but the first
thing I want to do is getting erofs out of staging, as Greg said [1])

Metadata should be designed for each fs like ext4, but maybe not flexible
and compacted as EROFS, therefore it could be some extra metadata
overhead than EROFS.

[1] https://lore.kernel.org/lkml/20190618064523.ga6...@kroah.com/

> 
> It should be noted that the f2fs developers have been working on their
> own compression scheme that was going to be f2fs-specific, unlike the
> file system generic approach used with fsverity and fscrypt.
> 
> My expectation is that we will need to modify the read pipeling code
> to support compression.  That's true whether we are looking at the
> existing file system-specific code used by ext4 and f2fs or in some
> combined work such as what Chandan has proposed.

I think either form is fine with me. :) But it seems that is some minor
which tree we will work on (Maybe Chandan's work will be merged then).

The first thing I need to do is to tidy up the code, and making it more
general, and then it will be very easy for fses to integrate :)

Thanks,
Gao Xiang


> 
> Cheers,
> 
>   - Ted


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [RFC PATCH 02/10] fs-verity: add data verification hooks for ->readpages()

2018-08-26 Thread Gao Xiang via Linux-f2fs-devel
Hi Eric,

On 2018/8/27 1:04, Eric Biggers wrote:
> Hi Chuck,
> 
> On Sun, Aug 26, 2018 at 11:55:57AM -0400, Chuck Lever wrote:
>>> +
>>> +/**
>>> + * fsverity_verify_page - verify a data page
>>> + *
>>> + * Verify a page that has just been read from a file against that file's 
>>> Merkle
>>> + * tree.  The page is assumed to be a pagecache page.
>>> + *
>>> + * Return: true if the page is valid, else false.
>>> + */
>>> +bool fsverity_verify_page(struct page *data_page)
>>> +{
>>> +   struct inode *inode = data_page->mapping->host;
>>> +   const struct fsverity_info *vi = get_fsverity_info(inode);
>>> +   struct ahash_request *req;
>>> +   bool valid;
>>> +
>>> +   req = ahash_request_alloc(vi->hash_alg->tfm, GFP_KERNEL);

Some minor suggestions occurred to me after I saw this part of code
again before sleeping...

1) How about introducing an iterator callback to avoid too many
ahash_request_alloc and ahash_request_free... (It seems too many pages
and could be some slower than fsverity_verify_bio...)

2) How about adding a gfp_t input argument since I don't know whether
GFP_KERNEL is suitable for all use cases...

It seems there could be more fsverity_verify_page users as well as
fsverity_verify_bio ;)

Sorry for interruption...

Thanks,
Gao Xiang

>>> +   if (unlikely(!req))
>>> +   return false;
>>> +
>>> +   valid = verify_page(inode, vi, req, data_page);
>>> +
>>> +   ahash_request_free(req);
>>> +
>>> +   return valid;
>>> +}
>>> +EXPORT_SYMBOL_GPL(fsverity_verify_page);
>>> +
>>> +/**
>>> + * fsverity_verify_bio - verify a 'read' bio that has just completed
>>> + *
>>> + * Verify a set of pages that have just been read from a file against that
>>> + * file's Merkle tree.  The pages are assumed to be pagecache pages.  
>>> Pages that
>>> + * fail verification are set to the Error state.  Verification is skipped 
>>> for
>>> + * pages already in the Error state, e.g. due to fscrypt decryption 
>>> failure.
>>> + */
>>> +void fsverity_verify_bio(struct bio *bio)
>>
>> Hi Eric-
>>
>> This kind of API won't work for remote filesystems, which do not use
>> "struct bio" to do their I/O. Could a remote filesystem solely use
>> fsverity_verify_page instead?
>>
> 
> Yes, filesystems don't have to use fsverity_verify_bio().  They can call
> fsverity_verify_page() on each page instead.  I will clarify this in the next
> revision of the patchset.
> 
> - Eric
> 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [RFC PATCH 02/10] fs-verity: add data verification hooks for ->readpages()

2018-08-26 Thread Gao Xiang via Linux-f2fs-devel
Hi Ted,

Sorry for the late reply...

On 2018/8/26 1:06, Theodore Y. Ts'o wrote:
> On Sat, Aug 25, 2018 at 03:43:43PM +0800, Gao Xiang wrote:
>>> I don't know of any plan to use fs-verity on Android's system partition or 
>>> to
>>> replace dm-verity on the system partition.  The use cases so far have been
>>> verifying files on /data, like APK files.
>>>
>>> So I don't think you need to support fs-verity in EROFS.
>>
>> Thanks for your information about fs-verity, that is quite useful for us
>> Actually, I was worrying about that these months...  :)
> 
> I'll be even clearer --- I can't *imagine* any situation where it
> would make sense to use fs-verity on the Android system partition.
> Remember, for OTA to work the system image has to be bit-for-bit
> identical to the official golden image for that release.  So the
> system image has to be completely locked down from any modification
> (to data or metadata), and that means dm-verity and *NOT* fs-verity.

I think so mainly because of the security reason you said above.

In addition, I think it is mandatory that the Android system partition
should also _never_ suffer from filesystem corrupted by design (expect
for the storage device corrupt or malware), therefore I think the
bit-for-bit read-only, and identical-verity requirement is quite strong
for Android, which will make the Android system steady and as solid as
rocks.

But I need to make sure my personal thoughts through this topic. :)

> 
> The initial use of fs-verity (as you can see if you look at AOSP) will
> be to protect a small number of privileged APK's that are stored on
> the data partition.  Previously, they were verified when they were
> downloaded, and never again.
> 
> Part of the goal which we are trying to achieve here is that even if
> the kernel gets compromised by a 0-day, a successful reboot should
> restore the system to a known state.  That is, the secure bootloader
> checks the signature of the kernel, and then in turn, dm-verity will
> verify the root Merkle hash protecting the system partition, and
> fs-verity will protect the privileged APK's.  If malware modifies any
> these components in an attempt to be persistent, the modifications
> would be detected, and the worst it could do is to cause subsequent
> reboots to fail until the phone's software could be reflashed.
> 

Yeah, I have seen the the fs-verity presentation and materials from
Android bootcamp and other official channels before.


Thanks for your kindly detailed explanation. :)


Best regards,
Gao Xiang

> Cheers,
> 
>   - Ted
> 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH RFC v5] f2fs: flush cp pack except cp pack 2 page at first

2018-02-09 Thread Gao Xiang via Linux-f2fs-devel
Previously, we attempt to flush the whole cp pack in a single bio,
however, when suddenly powering off at this time, we could get into
an extreme scenario that cp pack 1 page and cp pack 2 page are updated
and latest, but payload or current summaries are still partially
outdated. (see reliable write in the UFS specification)

This patch submits the whole cp pack except cp pack 2 page at first,
and then writes the cp pack 2 page with an extra independent
bio with pre-io barrier.

Signed-off-by: Gao Xiang 
Reviewed-by: Chao Yu 
---
Change log from v4:
  - remove redundant "filemap_fdatawait_range"
  - filemap_fdatawait_range(NODE_MAPPING(sbi), 0, LLONG_MAX);
  - filemap_fdatawait_range(META_MAPPING(sbi), 0, LLONG_MAX);
  - move f2fs_flush_device_cache to a more suitable position
  - wait_on_all_pages_writeback after commit_checkpoint
  - since we remove lots of redundant code, I think it's acceptable
  - and it will ensure one checkpoint safety.
Change log from v3:
  - further review comments are applied from Jaegeuk and Chao
  - Tested on this patch (without multiple-device): mount, boot Android with 
f2fs userdata and make fragment
  - If any problem with this patch or I miss something, please kindly share 
your comments, thanks :)
Change log from v2:
  - Apply the review comments from Chao
Change log from v1:
  - Apply the review comments from Chao
  - time data from "finish block_ops" to " finish checkpoint" (tested on ARM64 
with TOSHIBA 128GB UFS):
 Before patch: 0.002273  0.001973  0.002789  0.005159  0.002050
 After patch: 0.002502  0.001624  0.002487  0.003049  0.002696

 fs/f2fs/checkpoint.c | 69 ++--
 1 file changed, 46 insertions(+), 23 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 512dca8..4e352cf 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -1162,6 +1162,39 @@ static void update_ckpt_flags(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
spin_unlock_irqrestore(>cp_lock, flags);
 }
 
+static void commit_checkpoint(struct f2fs_sb_info *sbi,
+   void *src, block_t blk_addr)
+{
+   struct writeback_control wbc = {
+   .for_reclaim = 0,
+   };
+
+   /*
+* pagevec_lookup_tag and lock_page again will take
+* some extra time. Therefore, update_meta_pages and
+* sync_meta_pages are combined in this function.
+*/
+   struct page *page = grab_meta_page(sbi, blk_addr);
+   int err;
+
+   memcpy(page_address(page), src, PAGE_SIZE);
+   set_page_dirty(page);
+
+   f2fs_wait_on_page_writeback(page, META, true);
+   f2fs_bug_on(sbi, PageWriteback(page));
+   if (unlikely(!clear_page_dirty_for_io(page)))
+   f2fs_bug_on(sbi, 1);
+
+   /* writeout cp pack 2 page */
+   err = __f2fs_write_meta_page(page, , FS_CP_META_IO);
+   f2fs_bug_on(sbi, err);
+
+   f2fs_put_page(page, 0);
+
+   /* submit checkpoint (with barrier if NOBARRIER is not set) */
+   f2fs_submit_merged_write(sbi, META_FLUSH);
+}
+
 static int do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)
 {
struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
@@ -1264,16 +1297,6 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
}
}
 
-   /* need to wait for end_io results */
-   wait_on_all_pages_writeback(sbi);
-   if (unlikely(f2fs_cp_error(sbi)))
-   return -EIO;
-
-   /* flush all device cache */
-   err = f2fs_flush_device_cache(sbi);
-   if (err)
-   return err;
-
/* write out checkpoint buffer at block 0 */
update_meta_page(sbi, ckpt, start_blk++);
 
@@ -1301,26 +1324,26 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
start_blk += NR_CURSEG_NODE_TYPE;
}
 
-   /* writeout checkpoint block */
-   update_meta_page(sbi, ckpt, start_blk);
+   /* update user_block_counts */
+   sbi->last_valid_block_count = sbi->total_valid_block_count;
+   percpu_counter_set(>alloc_valid_block_count, 0);
 
-   /* wait for previous submitted node/meta pages writeback */
+   /* Here, we have one bio having CP pack except cp pack 2 page */
+   sync_meta_pages(sbi, META, LONG_MAX, FS_CP_META_IO);
+
+   /* wait for previous submitted meta pages writeback */
wait_on_all_pages_writeback(sbi);
 
if (unlikely(f2fs_cp_error(sbi)))
return -EIO;
 
-   filemap_fdatawait_range(NODE_MAPPING(sbi), 0, LLONG_MAX);
-   filemap_fdatawait_range(META_MAPPING(sbi), 0, LLONG_MAX);
-
-   /* update user_block_counts */
-   sbi->last_valid_block_count = sbi->total_valid_block_count;
-   percpu_counter_set(>alloc_valid_block_count, 0);
-
-   /* Here, we only have one bio having CP pack */
-   sync_meta_pages(sbi, META_FLUSH, 

Re: [f2fs-dev] [PATCH] f2fs: fix to handle looped node chain during recovery

2018-02-03 Thread Gao Xiang via Linux-f2fs-devel


Sorry, I saw the related code entirely, please ignore these replies.


On 2018/2/3 18:45, Gao Xiang wrote:



On 2018/2/3 18:35, Gao Xiang wrote:

Hi Chao and YunLei,


On 2018/2/3 17:44, Chao Yu wrote:

There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.

For this condition, during recovery, in order to avoid running into 
dead

loop, let's detect it and just skip out.

Signed-off-by: Yunlei He 
Signed-off-by: Chao Yu 
---
  fs/f2fs/recovery.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
index b6d1ec620a8c..60dd0cee4820 100644
--- a/fs/f2fs/recovery.c
+++ b/fs/f2fs/recovery.c
@@ -243,6 +243,9 @@ static int find_fsync_dnodes(struct f2fs_sb_info 
*sbi, struct list_head *head,

  struct curseg_info *curseg;
  struct page *page = NULL;
  block_t blkaddr;
+    unsigned int loop_cnt = 0;
+    unsigned int free_blocks = sbi->user_block_count -
+    valid_user_blocks(sbi);
There exists another way to detect loop more faster but only using 
two variables.
The algorithm is described as simply "B goes forward a steps only A 
goes forwards 2 steps".
"B goes forward a step only when A goes forward 2(or constant x, more 
than 1) steps".



For example:
1)
   1   2  3  4   5   6     7
   |     \ /
   |        \--/
  A, B
2)
   1  2  3  4   5   6     7
    |   |    \ /
   B   A    \--/
 3)
   1  2  3  4   5   6     7
   |    | \ /
  B   A      \--/
  4)
   1  2  3  4   5   6     7
   |   |\ /
  B      A \--/
5)




Sorry, it seems the encoded diagram is in a mess, I try again.
1)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
   |   \ /
   |    \-<-/
  A, B
2)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
   |    |  \ /
   |    |   \-<-/
   B    A
3)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
    |    | \ /
    |    |  \-<-/
    B    A
4)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
    | |\ /
    | | \-<-/
    B A
5)
if B catchs up A, there exists a cycle.


Thanks,

B will equal A or beyoud A if and only if there has a cycle.
It's a more faster algorithm. :D

Thanks,


  int err = 0;
    /* get node pages in the current segment */
@@ -295,6 +298,17 @@ static int find_fsync_dnodes(struct 
f2fs_sb_info *sbi, struct list_head *head,

  if (IS_INODE(page) && is_dent_dnode(page))
  entry->last_dentry = blkaddr;
  next:
+    /* sanity check in order to detect looped node chain */
+    if (++loop_cnt >= free_blocks ||
+    blkaddr == next_blkaddr_of_node(page)) {
+    f2fs_msg(sbi->sb, KERN_NOTICE,
+    "%s: detect looped node chain, "
+    "blkaddr:%u, next:%u",
+    __func__, blkaddr, next_blkaddr_of_node(page));
+    err = -EINVAL;
+    break;
+    }
+
  /* check next segment */
  blkaddr = next_blkaddr_of_node(page);
  f2fs_put_page(page, 1);







--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: fix to handle looped node chain during recovery

2018-02-03 Thread Gao Xiang via Linux-f2fs-devel



On 2018/2/3 18:35, Gao Xiang wrote:

Hi Chao and YunLei,


On 2018/2/3 17:44, Chao Yu wrote:

There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.

For this condition, during recovery, in order to avoid running into dead
loop, let's detect it and just skip out.

Signed-off-by: Yunlei He 
Signed-off-by: Chao Yu 
---
  fs/f2fs/recovery.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
index b6d1ec620a8c..60dd0cee4820 100644
--- a/fs/f2fs/recovery.c
+++ b/fs/f2fs/recovery.c
@@ -243,6 +243,9 @@ static int find_fsync_dnodes(struct f2fs_sb_info 
*sbi, struct list_head *head,

  struct curseg_info *curseg;
  struct page *page = NULL;
  block_t blkaddr;
+    unsigned int loop_cnt = 0;
+    unsigned int free_blocks = sbi->user_block_count -
+    valid_user_blocks(sbi);
There exists another way to detect loop more faster but only using two 
variables.
The algorithm is described as simply "B goes forward a steps only A 
goes forwards 2 steps".
"B goes forward a step only when A goes forward 2(or constant x, more 
than 1) steps".



For example:
1)
   1   2  3  4   5   6     7
   |     \ /
   |        \--/
  A, B
2)
   1  2  3  4   5   6     7
    |   |    \ /
   B   A    \--/
 3)
   1  2  3  4   5   6     7
   |    | \ /
  B   A      \--/
  4)
   1  2  3  4   5   6     7
   |   |\ /
  B      A \--/
5)




Sorry, it seems the encoded diagram is in a mess, I try again.
1)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
   |   \ /
   |    \-<-/
  A, B
2)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
   |    |  \ /
   |    |   \-<-/
   B    A
3)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
    |    | \ /
    |    |  \-<-/
    B    A
4)
   1 -> 2 -> 3 -> 4 -> 5 ->  6 -> 7
    | |\ /
    | | \-<-/
    B A
5)
if B catchs up A, there exists a cycle.


Thanks,

B will equal A or beyoud A if and only if there has a cycle.
It's a more faster algorithm. :D

Thanks,


  int err = 0;
    /* get node pages in the current segment */
@@ -295,6 +298,17 @@ static int find_fsync_dnodes(struct f2fs_sb_info 
*sbi, struct list_head *head,

  if (IS_INODE(page) && is_dent_dnode(page))
  entry->last_dentry = blkaddr;
  next:
+    /* sanity check in order to detect looped node chain */
+    if (++loop_cnt >= free_blocks ||
+    blkaddr == next_blkaddr_of_node(page)) {
+    f2fs_msg(sbi->sb, KERN_NOTICE,
+    "%s: detect looped node chain, "
+    "blkaddr:%u, next:%u",
+    __func__, blkaddr, next_blkaddr_of_node(page));
+    err = -EINVAL;
+    break;
+    }
+
  /* check next segment */
  blkaddr = next_blkaddr_of_node(page);
  f2fs_put_page(page, 1);





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: fix to handle looped node chain during recovery

2018-02-03 Thread Gao Xiang via Linux-f2fs-devel

Hi Chao and YunLei,


On 2018/2/3 17:44, Chao Yu wrote:

There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.

For this condition, during recovery, in order to avoid running into dead
loop, let's detect it and just skip out.

Signed-off-by: Yunlei He 
Signed-off-by: Chao Yu 
---
  fs/f2fs/recovery.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
index b6d1ec620a8c..60dd0cee4820 100644
--- a/fs/f2fs/recovery.c
+++ b/fs/f2fs/recovery.c
@@ -243,6 +243,9 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, 
struct list_head *head,
struct curseg_info *curseg;
struct page *page = NULL;
block_t blkaddr;
+   unsigned int loop_cnt = 0;
+   unsigned int free_blocks = sbi->user_block_count -
+   valid_user_blocks(sbi);
There exists another way to detect loop more faster but only using two 
variables.
The algorithm is described as simply "B goes forward a steps only A goes 
forwards 2 steps".

For example:
1)
   1   2  3  4   5   6     7
   |     \ /
   |        \--/
  A, B
2)
   1  2  3  4   5   6     7
    |   |    \ /
   B   A    \--/
 3)
   1  2  3  4   5   6     7
   |    | \ /
  B   A      \--/
  4)
   1  2  3  4   5   6     7
   |   |\ /
  B      A \--/
5)

B will equal A or beyoud A if and only if there has a cycle.
It's a more faster algorithm. :D

Thanks,


int err = 0;
  
  	/* get node pages in the current segment */

@@ -295,6 +298,17 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, 
struct list_head *head,
if (IS_INODE(page) && is_dent_dnode(page))
entry->last_dentry = blkaddr;
  next:
+   /* sanity check in order to detect looped node chain */
+   if (++loop_cnt >= free_blocks ||
+   blkaddr == next_blkaddr_of_node(page)) {
+   f2fs_msg(sbi->sb, KERN_NOTICE,
+   "%s: detect looped node chain, "
+   "blkaddr:%u, next:%u",
+   __func__, blkaddr, next_blkaddr_of_node(page));
+   err = -EINVAL;
+   break;
+   }
+
/* check next segment */
blkaddr = next_blkaddr_of_node(page);
f2fs_put_page(page, 1);



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH RFC] f2fs: flush cp pack except cp page2 at first

2018-01-24 Thread Gao Xiang via Linux-f2fs-devel

Hi Chao,


On 2018/1/24 23:57, Chao Yu wrote:

On 2018/1/24 14:53, Gaoxiang (OS) wrote:

Previously, we attempt to flush the whole cp pack in a single bio,
however, when suddenly power off at this time, we could meet an
extreme scenario that cp page1 and cp page2 are updated and latest,
but payload or current summaries are still outdated.
(see reliable write in UFS spec)

This patch write the whole cp pack except cp page2 with FLUSH
at first, and then write the cp page2 with an extra independent
bio with FLUSH.

Signed-off-by: Gao Xiang 
---
  fs/f2fs/checkpoint.c | 48 +---
  fs/f2fs/f2fs.h   |  3 ++-
  fs/f2fs/segment.c| 11 +--
  3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 14d2fed..e7f5e85 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -300,6 +300,35 @@ static int f2fs_write_meta_pages(struct address_space 
*mapping,
return 0;
  }
  
+static int sync_meta_page_locked(struct f2fs_sb_info *sbi,

+   struct page *page,
+   enum page_type type, enum iostat_type io_type)
+{
+   struct writeback_control wbc = {
+   .for_reclaim = 0,
+   };
+   int err;
+
+   BUG_ON(page->mapping != META_MAPPING(sbi));
+   BUG_ON(!PageDirty(page));
+
+   f2fs_wait_on_page_writeback(page, META, true);
+
+   BUG_ON(PageWriteback(page));
+   if (unlikely(!clear_page_dirty_for_io(page)))
+   BUG();
+
+   err = __f2fs_write_meta_page(page, , io_type);
+   if (err) {
+   f2fs_put_page(page, 1);
+   return err;
+   }
+   f2fs_put_page(page, 0);
+
+   f2fs_submit_merged_write(sbi, type);
+   return err;
+}
+
  long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
long nr_to_write, enum iostat_type io_type)
  {
@@ -1172,6 +1201,7 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, struct 
cp_control *cpc)
struct curseg_info *seg_i = CURSEG_I(sbi, CURSEG_HOT_NODE);
u64 kbytes_written;
int err;
+   struct page *cp_page2;
  
  	/* Flush all the NAT/SIT pages */

while (get_pages(sbi, F2FS_DIRTY_META)) {
@@ -1250,7 +1280,7 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, struct 
cp_control *cpc)
blk = start_blk + sbi->blocks_per_seg - nm_i->nat_bits_blocks;
for (i = 0; i < nm_i->nat_bits_blocks; i++)
update_meta_page(sbi, nm_i->nat_bits +
-   (i << F2FS_BLKSIZE_BITS), blk + i);
+   (i << F2FS_BLKSIZE_BITS), blk + i, 
NULL);
  
  		/* Flush all the NAT BITS pages */

while (get_pages(sbi, F2FS_DIRTY_META)) {
@@ -1271,11 +1301,11 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
return err;
  
  	/* write out checkpoint buffer at block 0 */

-   update_meta_page(sbi, ckpt, start_blk++);
+   update_meta_page(sbi, ckpt, start_blk++, NULL);
  
  	for (i = 1; i < 1 + cp_payload_blks; i++)

update_meta_page(sbi, (char *)ckpt + i * F2FS_BLKSIZE,
-   start_blk++);
+   start_blk++, NULL);
  
  	if (orphan_num) {

write_orphan_inodes(sbi, start_blk);
@@ -1297,9 +1327,6 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, struct 
cp_control *cpc)
start_blk += NR_CURSEG_NODE_TYPE;
}
  
-	/* writeout checkpoint block */

-   update_meta_page(sbi, ckpt, start_blk);
-
/* wait for previous submitted node/meta pages writeback */
wait_on_all_pages_writeback(sbi);
  
@@ -1313,12 +1340,19 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)

sbi->last_valid_block_count = sbi->total_valid_block_count;
percpu_counter_set(>alloc_valid_block_count, 0);
  
-	/* Here, we only have one bio having CP pack */

+   /* Here, we only have one bio having CP pack except cp page 2 */
sync_meta_pages(sbi, META_FLUSH, LONG_MAX, FS_CP_META_IO);

We don't need to use META_FLUSH here.


hmmm...I think that we need to write to the device medium rather than device 
cache, or I miss something?
could you give me some hints about that? PREFLUSH or what? yet I cannot see 
some code related to that...



  
  	/* wait for previous submitted meta pages writeback */

wait_on_all_pages_writeback(sbi);
  
+	/* write and flush checkpoint cp page 2 */

+   update_meta_page(sbi, ckpt, start_blk, _page2);
+   sync_meta_page_locked(sbi, cp_page2, META_FLUSH, FS_CP_META_IO);

How about

sync_checkpoint()
{
page = grab_meta_page()
memcpy()
set_page_dirty()

...
__f2fs_write_meta_page()
f2fs_put_page()
f2fs_submit_merged_write()
}


Re: [f2fs-dev] [PATCH RFC] f2fs: add PRE2 to mark segments free to one checkpoint but obsolete to the other

2018-01-20 Thread Gao Xiang via Linux-f2fs-devel

Hi Weichao,


On 2018/1/21 0:02, guoweichao wrote:

Hi Xiang,

it's not related to SPOR. Just consider the case given by Chao Yu.



(It seems this email was not sent successfully, I resend it just for 
reference only)

Oh I see, I have considered the scenario what Chao said before.

1. write data into segment x;
2. write checkpoint A;
3. remove data in segment x;
4. write checkpoint B;
5. issue discard or write data into segment x;
6. sudden power-cut

Since f2fs is designed for double backup, 5) and 6), I think, actually belongs 
to checkpoint C.
and when we are in checkpoint C, checkpoint A becomes unsafe because the latest 
checkpoint is B.
and I think in that case we cannot prevent data writeback or something to 
pollute checkpoint A.

However, node segments (another metadata) would be special,
but I have no idea whether introducing PRE2 would make all cases safe or not.

In addition, if some data segments change between checkpoint A and C,
some weird data that it didn't have (new data or data from other files) would 
be gotten when switching back to checkpoint A.


Thanks,


Thanks,
*From:*Gao Xiang
*To:*guoweichao,
*Cc:*linux-f2fs-devel@lists.sourceforge.net,heyunlei,
*Date:*2018-01-20 11:49:22
*Subject:*Re: [PATCH RFC] f2fs: add PRE2 to mark segments free to one 
checkpoint but obsolete to the other


Hi Weichao,


On 2018/1/19 23:47, guoweichao wrote:
> A critical case is using the free segments as data segments which
> are previously node segments to the old checkpoint. With fault injecting
> of the newer CP pack, fsck can find errors when checking the sanity 
of nid.

Sorry to interrupt because I'm just curious about this scenario and the
detail.

As far as I know, if the whole blocks in a segment become invalid,
the segment will become PREFREE, and then if a checkpoint is followed,
we can reuse this segment or
discard the whole segment safely after this checkpoint was done
(I think It makes sure that this segment is certainly FREE and not
reused in this checkpoint).

If the segment in the old checkpoint is a node segment, and node blocks
in the segment are all invalid until the new checkpoint.
It seems no danger to reuse the FREE node segment as a data segment in
the next checkpoint?

or something related to SPOR? In my mind f2fs-tools ignores POR node
chain...

Thanks,
> On 2018/1/20 7:29, Weichao Guo wrote:
>> Currently, we set prefree segments as free ones after writing
>> a checkpoint, then believe the segments could be used safely.
>> However, if a sudden power-off coming and the newer checkpoint
>> corrupted due to hardware issues at the same time, we will try
>> to use the old checkpoint and may face an inconsistent file
>> system status.
>>
>> How about add an PRE2 status for prefree segments, and make
>> sure the segments could be used safely to both checkpoints?
>> Or any better solutions? Or this is not a problem?
>>
>> Look forward to your comments!
>>
>> Signed-off-by: Weichao Guo 
>> ---
>>   fs/f2fs/gc.c  | 11 +--
>>   fs/f2fs/segment.c | 21 ++---
>>   fs/f2fs/segment.h |  6 ++
>>   3 files changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index 33e7969..153e3ea 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -1030,7 +1030,12 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>    * threshold, we can make them free by checkpoint. 
Then, we

>>    * secure free segments which doesn't need fggc any more.
>>    */
>> -    if (prefree_segments(sbi)) {
>> +    if (prefree_segments(sbi) || prefree2_segments(sbi)) {
>> +    ret = write_checkpoint(sbi, );
>> +    if (ret)
>> +    goto stop;
>> +    }
>> +    if (has_not_enough_free_secs(sbi, 0, 0) && 
prefree2_segments(sbi)) {

>>   ret = write_checkpoint(sbi, );
>>   if (ret)
>>   goto stop;
>> @@ -1063,8 +1068,10 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>   goto gc_more;
>>   }
>>
>> -    if (gc_type == FG_GC)
>> +    if (gc_type == FG_GC) {
>> +    ret = write_checkpoint(sbi, );
>>   ret = write_checkpoint(sbi, );
>> +    }
>>   }
>>   stop:
>>   SIT_I(sbi)->last_victim[ALLOC_NEXT] = 0;
>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> index 2e8e054d..9dec445 100644
>> --- a/fs/f2fs/segment.c
>> +++ b/fs/f2fs/segment.c
>> @@ -1606,7 +1606,7 @@ static void 
set_prefree_as_free_segments(struct f2fs_sb_info *sbi)

>>   unsigned int segno;
>>
>>   mutex_lock(_i->seglist_lock);
>> -    for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], 
MAIN_SEGS(sbi))
>> +    for_each_set_bit(segno, dirty_i->dirty_segmap[PRE2], 
MAIN_SEGS(sbi))

>>   __set_test_and_free(sbi, segno);
>>   

Re: [f2fs-dev] [PATCH RFC] f2fs: add PRE2 to mark segments free to one checkpoint but obsolete to the other

2018-01-20 Thread Gao Xiang via Linux-f2fs-devel

Hi Chao and Weichao,


On 2018/1/21 10:34, Chao Yu wrote:

Hi Weichao,

On 2018/1/20 23:50, guoweichao wrote:

Hi Chao,

Yes, it is exactly what I mean.
It seems that F2FS has no responsibility to cover hardware problems.
However, file systems usually choose redundancy for super block fault tolerance.
So I think we actually have considered some external errors when designing a 
file system.
Our dual checkpoint mechanism is mainly designed to keep at least one stable
CP pack while creating a new one in case of SPO. And it has fault tolerant 
effects.
As CP pack is also very critical to F2FS, why not make checkpoint more 
robustness

I think you're trying to change basic design of A/B upgrade system to A/B/C one,
which can keep always two checkpoint valid. There will be no simple modification
to implement that one, in where we should cover not only prefree case but also
SSR case.
*nod*, triple-like-backups would not solve all hardware issues in a even 
worse case,
and in order to make all backups available we need to find more extra 
area to leave all modification among 3 checkpoints.


In my opinion, introducing "snapshot" feature could be something more 
useful for phone (yet it is a big feature :( )

and if fsck found something is unrecoverable
we could roll back to a stable and reliable (via re-checking metadata) 
snapshot (maybe hours ago or days ago,

and remind users when rollback in fsck or somewhere).

Sorry to bother all,

Thanks,


IMO, the biggest problem there is available space, since in checkpoint C, we can
only use invalid blocks in both checkpoint A and B, so in some cases there will
almost be no valid space we can use during allocation, result in frequently
checkpoint.

IMO, what we can do is trying to keep last valid checkpoint being integrity as
possible as we can. One way is that we can add mirror or parity for the
checkpoint which can help to do recovery once checkpoint is corrupted. At
least, I hope that with it in debug version we can help hardware staff to fix
their issue instead of wasting much time to troubleshoot filesystem issue.

Thanks,


with simple modification in current design and little overhead except for FG_GC.
Of cause, keep unchanged is OK. I just want to discuss this proposal. :)

Thanks,
*From:*Chao Yu
*To:*guoweichao,jaeg...@kernel.org,
*Cc:*linux-f2fs-devel@lists.sourceforge.net,linux-fsde...@vger.kernel.org,heyunlei,
*Date:*2018-01-20 15:43:23
*Subject:*Re: [PATCH RFC] f2fs: add PRE2 to mark segments free to one 
checkpoint but obsolete to the other

Hi Weichao,

On 2018/1/20 7:29, Weichao Guo wrote:

Currently, we set prefree segments as free ones after writing
a checkpoint, then believe the segments could be used safely.
However, if a sudden power-off coming and the newer checkpoint
corrupted due to hardware issues at the same time, we will try
to use the old checkpoint and may face an inconsistent file
system status.

IIUC, you mean:

1. write nodes into segment x;
2. write checkpoint A;
3. remove nodes in segment x;
4. write checkpoint B;
5. issue discard or write datas into segment x;
6. sudden power-cut

But after reboot, we found checkpoint B is corrupted due to hardware, and
then start to use checkpoint A, but nodes in segment x recorded as valid
data in checkpoint A has been overcovered in step 5), so we will encounter
inconsistent meta data, right?

Thanks,


How about add an PRE2 status for prefree segments, and make
sure the segments could be used safely to both checkpoints?
Or any better solutions? Or this is not a problem?

Look forward to your comments!

Signed-off-by: Weichao Guo 
---
  fs/f2fs/gc.c  | 11 +--
  fs/f2fs/segment.c | 21 ++---
  fs/f2fs/segment.h |  6 ++
  3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 33e7969..153e3ea 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -1030,7 +1030,12 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
    * threshold, we can make them free by checkpoint. Then, we
    * secure free segments which doesn't need fggc any more.
    */
- if (prefree_segments(sbi)) {
+ if (prefree_segments(sbi) || prefree2_segments(sbi)) {
+ ret = write_checkpoint(sbi, );
+ if (ret)
+ goto stop;
+ }
+ if (has_not_enough_free_secs(sbi, 0, 0) && prefree2_segments(sbi)) {
   ret = write_checkpoint(sbi, );
   if (ret)
   goto stop;
@@ -1063,8 +1068,10 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
   goto gc_more;
   }
  
- if (gc_type == FG_GC)

+ if (gc_type == FG_GC) {
+ ret = write_checkpoint(sbi, );
   ret = write_checkpoint(sbi, );
+ }
   }
  stop:
   SIT_I(sbi)->last_victim[ALLOC_NEXT] = 0;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 2e8e054d..9dec445 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1606,7 +1606,7 @@ static void set_prefree_as_free_segments(struct 
f2fs_sb_info *sbi)
   unsigned int segno;
  
   mutex_lock(_i->seglist_lock);

- for_each_set_bit(segno, 

Re: [f2fs-dev] [PATCH RFC] f2fs: add PRE2 to mark segments free to one checkpoint but obsolete to the other

2018-01-19 Thread Gao Xiang via Linux-f2fs-devel

Hi Weichao,


On 2018/1/19 23:47, guoweichao wrote:

A critical case is using the free segments as data segments which
are previously node segments to the old checkpoint. With fault injecting
of the newer CP pack, fsck can find errors when checking the sanity of nid.
Sorry to interrupt because I'm just curious about this scenario and the 
detail.


As far as I know, if the whole blocks in a segment become invalid,
the segment will become PREFREE, and then if a checkpoint is followed, 
we can reuse this segment or

discard the whole segment safely after this checkpoint was done
(I think It makes sure that this segment is certainly FREE and not 
reused in this checkpoint).


If the segment in the old checkpoint is a node segment, and node blocks 
in the segment are all invalid until the new checkpoint.
It seems no danger to reuse the FREE node segment as a data segment in 
the next checkpoint?


or something related to SPOR? In my mind f2fs-tools ignores POR node 
chain...


Thanks,

On 2018/1/20 7:29, Weichao Guo wrote:

Currently, we set prefree segments as free ones after writing
a checkpoint, then believe the segments could be used safely.
However, if a sudden power-off coming and the newer checkpoint
corrupted due to hardware issues at the same time, we will try
to use the old checkpoint and may face an inconsistent file
system status.

How about add an PRE2 status for prefree segments, and make
sure the segments could be used safely to both checkpoints?
Or any better solutions? Or this is not a problem?

Look forward to your comments!

Signed-off-by: Weichao Guo 
---
  fs/f2fs/gc.c  | 11 +--
  fs/f2fs/segment.c | 21 ++---
  fs/f2fs/segment.h |  6 ++
  3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 33e7969..153e3ea 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -1030,7 +1030,12 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
 * threshold, we can make them free by checkpoint. Then, we
 * secure free segments which doesn't need fggc any more.
 */
-   if (prefree_segments(sbi)) {
+   if (prefree_segments(sbi) || prefree2_segments(sbi)) {
+   ret = write_checkpoint(sbi, );
+   if (ret)
+   goto stop;
+   }
+   if (has_not_enough_free_secs(sbi, 0, 0) && 
prefree2_segments(sbi)) {
ret = write_checkpoint(sbi, );
if (ret)
goto stop;
@@ -1063,8 +1068,10 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
goto gc_more;
}
  
-		if (gc_type == FG_GC)

+   if (gc_type == FG_GC) {
+   ret = write_checkpoint(sbi, );
ret = write_checkpoint(sbi, );
+   }
}
  stop:
SIT_I(sbi)->last_victim[ALLOC_NEXT] = 0;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 2e8e054d..9dec445 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1606,7 +1606,7 @@ static void set_prefree_as_free_segments(struct 
f2fs_sb_info *sbi)
unsigned int segno;
  
  	mutex_lock(_i->seglist_lock);

-   for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], MAIN_SEGS(sbi))
+   for_each_set_bit(segno, dirty_i->dirty_segmap[PRE2], MAIN_SEGS(sbi))
__set_test_and_free(sbi, segno);
mutex_unlock(_i->seglist_lock);
  }
@@ -1617,13 +1617,17 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
struct list_head *head = >entry_list;
struct discard_entry *entry, *this;
struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
-   unsigned long *prefree_map = dirty_i->dirty_segmap[PRE];
+   unsigned long *prefree_map;
unsigned int start = 0, end = -1;
unsigned int secno, start_segno;
bool force = (cpc->reason & CP_DISCARD);
+   int phase = 0;
+   enum dirty_type dirty_type = PRE2;
  
  	mutex_lock(_i->seglist_lock);
  
+next_step:

+   prefree_map = dirty_i->dirty_segmap[dirty_type];
while (1) {
int i;
start = find_next_bit(prefree_map, MAIN_SEGS(sbi), end + 1);
@@ -1635,7 +1639,7 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
for (i = start; i < end; i++)
clear_bit(i, prefree_map);
  
-		dirty_i->nr_dirty[PRE] -= end - start;

+   dirty_i->nr_dirty[dirty_type] -= end - start;
  
  		if (!test_opt(sbi, DISCARD))

continue;
@@ -1663,6 +1667,16 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, 
struct cp_control *cpc)
else
end = start - 1;
}
+   if (phase == 0) {
+   /* status change: PRE -> PRE2 */
+   for_each_set_bit(segno, 

Re: [f2fs-dev] [PATCH v2] mkfs.f2fs: expand scalability of nat bitmap

2018-01-16 Thread Gao Xiang via Linux-f2fs-devel

Hi Chao Yu,


On 2018/1/17 11:15, Chao Yu wrote:

Hi Jaegeuk,

On 2018/1/17 8:47, Jaegeuk Kim wrote:

Hi Chao,

On 01/15, Chao Yu wrote:

Previously, our total node number (nat_bitmap) and total nat segment count
will not monotonously increase along with image size, and max nat_bitmap size
is limited by "CHECKSUM_OFFSET - sizeof(struct f2fs_checkpoint) + 1", it is
with bad scalability when user wants to create more inode/node in larger image.

So this patch tries to relieve the limitation, by default, limitting total nat
entry number with 20% of total block number.

Before:
image_size(GB)  nat_bitmap  sit_bitmap  nat_segment sit_segment
16  383664  36  2
32  383664  72  2
64  3772128 116 4
128 3708192 114 6
256 3580320 110 10

As you see, nat_segment count will reduce when image size increases
starting from 64GB, that means nat segment count will not monotonously
increase when image size is increasing, so it would be better to active
this when image size is larger than 32GB?

IMO, configuring basic nid ratio to fixed value like ext4 ("free inode" :
"free block" is about 1 : 4) would be better:
a. It will be easy for user to predict nid count or nat segment count with
fix-sized image;
b. If user wants to reserve more nid count, we can support -N option in
mkfs.f2fs to specify total nid count as user wish.
I agree bacause it is weird if nat segment count is not monotonously 
increased, especially for server users, and how about modifying like this?
32GB~xxxGB(if (max_sit_bitmap_size + max_nat_bitmap_size ~ (<=) 
MAX_BITMAP_SIZE_IN_CKPT) )    ---  use the original nat calculation version;

>xxx GB --- use CP_LARGE_NAT_BITMAP_FLAG and the introduced ratio.
if user-defined -N is specified, use the nid count ( or ratio ?) instead 
of  the above calculation.


Thanks,


How do you think?

Thanks,


512 3260640 100 20
10242684121682  38
20481468243244  76
409639004800120 150

After:
image_size(GB)  nat_bitmap  sit_bitmap  nat_segment sit_segment
16  256 64  8   2
32  512 64  16  2
64  960 128 30  4
128 1856192 58  6
256 3712320 116 10

Can we activate this, if size is larger than 256GB or something around that?

Thanks,


512 7424640 232 20
102414787   1216462 38
204829504   2432922 76
409659008   48001844150

Signed-off-by: Chao Yu 
---
v2:
- add CP_LARGE_NAT_BITMAP_FLAG flag to indicate new layout of nat/sit bitmap.
  fsck/f2fs.h| 19 +--
  fsck/resize.c  | 35 +--
  include/f2fs_fs.h  |  8 ++--
  lib/libf2fs.c  |  1 +
  mkfs/f2fs_format.c | 45 +++--
  5 files changed, 60 insertions(+), 48 deletions(-)

diff --git a/fsck/f2fs.h b/fsck/f2fs.h
index f5970d9dafc0..8a5ce365282d 100644
--- a/fsck/f2fs.h
+++ b/fsck/f2fs.h
@@ -239,6 +239,12 @@ static inline unsigned int ofs_of_node(struct f2fs_node 
*node_blk)
return flag >> OFFSET_BIT_SHIFT;
  }
  
+static inline bool is_set_ckpt_flags(struct f2fs_checkpoint *cp, unsigned int f)

+{
+   unsigned int ckpt_flags = le32_to_cpu(cp->ckpt_flags);
+   return ckpt_flags & f ? 1 : 0;
+}
+
  static inline unsigned long __bitmap_size(struct f2fs_sb_info *sbi, int flag)
  {
struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
@@ -256,6 +262,13 @@ static inline void *__bitmap_ptr(struct f2fs_sb_info *sbi, 
int flag)
  {
struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
int offset;
+
+   if (is_set_ckpt_flags(ckpt, CP_LARGE_NAT_BITMAP_FLAG)) {
+   offset = (flag == SIT_BITMAP) ?
+   le32_to_cpu(ckpt->nat_ver_bitmap_bytesize) : 0;
+   return >sit_nat_version_bitmap + offset;
+   }
+
if (le32_to_cpu(F2FS_RAW_SUPER(sbi)->cp_payload) > 0) {
if (flag == NAT_BITMAP)
return >sit_nat_version_bitmap;
@@ -268,12 +281,6 @@ static inline void *__bitmap_ptr(struct f2fs_sb_info *sbi, 
int flag)
}
  }
  
-static inline bool is_set_ckpt_flags(struct f2fs_checkpoint *cp, unsigned int f)

-{
-   unsigned int ckpt_flags = le32_to_cpu(cp->ckpt_flags);
-   return ckpt_flags 

Re: [f2fs-dev] [PATCH] mkfs.f2fs: expand scalability of nat bitmap

2018-01-13 Thread Gao Xiang via Linux-f2fs-devel

Hi Chao,


On 2018/01/12 18:24, Chao Yu wrote:


Previously, our total node number (nat_bitmap) and total nat segment count
will not monotonously increase along with image size, and max nat_bitmap size
is limited by "CHECKSUM_OFFSET - sizeof(struct f2fs_checkpoint) + 1", it is
with bad scalability when user wants to create more inode/node in larger image.

So this patch tries to relieve the limitation, by default, limitting total nat
entry number with 20% of total block number.

Before:
image_size(GB)  nat_bitmap  sit_bitmap  nat_segment sit_segment
16  383664  36  2
32  383664  72  2
64  3772128 116 4
128 3708192 114 6
256 3580320 110 10
512 3260640 100 20
10242684121682  38
20481468243244  76
409639004800120 150

After:
image_size(GB)  nat_bitmap  sit_bitmap  nat_segment sit_segment
16  256 64  8   2
32  512 64  16  2
64  960 128 30  4
128 1856192 58  6
256 3712320 116 10
512 7424640 232 20
102414787   1216462 38
204829504   2432922 76
409659008   48001844150

Signed-off-by: Chao Yu 
---
  fsck/resize.c  | 31 ++-
  include/f2fs_fs.h  |  6 --
  mkfs/f2fs_format.c | 35 ++-
  3 files changed, 32 insertions(+), 40 deletions(-)

diff --git a/fsck/resize.c b/fsck/resize.c
index 143ad5d3c0a1..7613c7df4893 100644
--- a/fsck/resize.c
+++ b/fsck/resize.c
@@ -16,7 +16,7 @@ static int get_new_sb(struct f2fs_super_block *sb)
u_int32_t sit_segments, diff, total_meta_segments;
u_int32_t total_valid_blks_available;
u_int32_t sit_bitmap_size, max_sit_bitmap_size;
-   u_int32_t max_nat_bitmap_size, max_nat_segments;
+   u_int32_t max_nat_bitmap_size;
u_int32_t segment_size_bytes = 1 << (get_sb(log_blocksize) +
get_sb(log_blocks_per_seg));
u_int32_t blks_per_seg = 1 << get_sb(log_blocks_per_seg);
@@ -47,7 +47,8 @@ static int get_new_sb(struct f2fs_super_block *sb)
get_sb(segment_count_sit))) * blks_per_seg;
blocks_for_nat = SIZE_ALIGN(total_valid_blks_available,
NAT_ENTRY_PER_BLOCK);
-   set_sb(segment_count_nat, SEG_ALIGN(blocks_for_nat));
+   set_sb(segment_count_nat, SEG_ALIGN(blocks_for_nat) *
+   DEFAULT_NAT_ENTRY_RATIO / 100);
  
  	sit_bitmap_size = ((get_sb(segment_count_sit) / 2) <<

get_sb(log_blocks_per_seg)) / 8;
@@ -56,25 +57,21 @@ static int get_new_sb(struct f2fs_super_block *sb)
else
max_sit_bitmap_size = sit_bitmap_size;
  
-	/*

-* It should be reserved minimum 1 segment for nat.
-* When sit is too large, we should expand cp area. It requires more 
pages for cp.
-*/
-   if (max_sit_bitmap_size > MAX_SIT_BITMAP_SIZE_IN_CKPT) {
-   max_nat_bitmap_size = CHECKSUM_OFFSET - sizeof(struct 
f2fs_checkpoint) + 1;
-   set_sb(cp_payload, F2FS_BLK_ALIGN(max_sit_bitmap_size));
+   max_nat_bitmap_size = (get_sb(segment_count_nat) <<
+   get_sb(log_blocks_per_seg)) / 8;
segment_count_nat would not exceed (CHECKSUM_OFFSET - sizeof(struct 
f2fs_checkpoint) + 1) * 8 >> get_sb(log_blocks_per_seg) (align-down), I 
think.

In my mind, the nat version bitmap is only located in cp page...
So segment_count_nat has a fixed up bound limitation...


+
+   set_sb(segment_count_nat, get_sb(segment_count_nat) * 2);
+
+   /* use cp_payload if free space of f2fs_checkpoint is not enough */
+   if (max_sit_bitmap_size + max_nat_bitmap_size >
+   MAX_BITMAP_SIZE_IN_CKPT) {
+   u_int32_t diff =  max_sit_bitmap_size + max_nat_bitmap_size -
+   MAX_BITMAP_SIZE_IN_CKPT;

I looked up the f2fs source code, in __bitmap_ptr

if payload is used, the whole payload will be used for sit bitmap.

Therefore, I mean would it be u_int32_t diff = max_sit_bitmap_size?

or modify definition in f2fs.h?


+   set_sb(cp_payload, F2FS_BLK_ALIGN(diff));
 

[f2fs-dev] [PATCH RFC] f2fs: refactor get_new_segment

2017-12-16 Thread Gao Xiang via Linux-f2fs-devel
get_new_segment is too unclear to understand how it works.
This patch refactor it in a straight-forward way and
I think it is equal to the original one.

This patch also fixes two issues in the original get_new_segment:
1) left_start could be overflowed when hint == 0 at first:
...
} else {
go_left = 1;
*   left_start = hint - 1;
}
...
*   while (test_bit(left_start, free_i->free_secmap)) {
2) It will do find_next_zero_bit again when go_left == true and ALLOC_LEFT:
...
find_other_zone:
*   secno = find_next_zero_bit(free_i->free_secmap, MAIN_SECS(sbi), hint);
if (secno >= MAIN_SECS(sbi)) {
if (dir == ALLOC_RIGHT) {
...
} else {
go_left = 1;
...
}
}
if (go_left == 0)
goto skip_left;
...
if (i < NR_CURSEG_TYPE) {
/* zone is in user, try another */
*   if (go_left)
*   hint = zoneno * sbi->secs_per_zone - 1;
...
init = false;
goto find_other_zone;
}

Signed-off-by: Gao Xiang 
---
 fs/f2fs/segment.c | 137 --
 1 file changed, 81 insertions(+), 56 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index c117e09..eea9d3f 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -2047,82 +2047,107 @@ static void get_new_segment(struct f2fs_sb_info *sbi,
unsigned int *newseg, bool new_sec, int dir)
 {
struct free_segmap_info *free_i = FREE_I(sbi);
-   unsigned int segno, secno, zoneno;
+   unsigned int segno = *newseg, zoneno;
+   unsigned int secno = GET_SEC_FROM_SEG(sbi, segno);
unsigned int total_zones = MAIN_SECS(sbi) / sbi->secs_per_zone;
-   unsigned int hint = GET_SEC_FROM_SEG(sbi, *newseg);
-   unsigned int old_zoneno = GET_ZONE_FROM_SEG(sbi, *newseg);
-   unsigned int left_start = hint;
-   bool init = true;
-   int go_left = 0;
+   unsigned int old_zoneno = GET_ZONE_FROM_SEC(sbi, secno);
+   bool may_again = true, actually_go_left = false;
int i;
 
spin_lock(_i->segmap_lock);
 
-   if (!new_sec && ((*newseg + 1) % sbi->segs_per_sec)) {
+   /* first, attempt to find a segment in the current section */
+   if (!new_sec && ((segno + 1) % sbi->segs_per_sec)) {
+   unsigned end_segno = GET_SEG_FROM_SEC(sbi, secno + 1);
+
segno = find_next_zero_bit(free_i->free_segmap,
-   GET_SEG_FROM_SEC(sbi, hint + 1), *newseg + 1);
-   if (segno < GET_SEG_FROM_SEC(sbi, hint + 1))
-   goto got_it;
+   end_segno, segno + 1);
+
+   if (segno < end_segno)
+   goto out;
}
-find_other_zone:
-   secno = find_next_zero_bit(free_i->free_secmap, MAIN_SECS(sbi), hint);
-   if (secno >= MAIN_SECS(sbi)) {
-   if (dir == ALLOC_RIGHT) {
-   secno = find_next_zero_bit(free_i->free_secmap,
-   MAIN_SECS(sbi), 0);
-   f2fs_bug_on(sbi, secno >= MAIN_SECS(sbi));
-   } else {
-   go_left = 1;
-   left_start = hint - 1;
+
+try_another_section:
+   if (likely(!actually_go_left)) {
+   /*
+* since ALLOC_LEFT takes much effort,
+* prefer to ALLOC_RIGHT first
+*/
+   unsigned int new_secno = find_next_zero_bit(
+   free_i->free_secmap, MAIN_SECS(sbi), secno);
+
+   if (new_secno < MAIN_SECS(sbi)) {
+   secno = new_secno;
+   goto check_another_zone;
}
}
-   if (go_left == 0)
-   goto skip_left;
 
-   while (test_bit(left_start, free_i->free_secmap)) {
-   if (left_start > 0) {
-   left_start--;
-   continue;
-   }
-   left_start = find_next_zero_bit(free_i->free_secmap,
-   MAIN_SECS(sbi), 0);
-   f2fs_bug_on(sbi, left_start >= MAIN_SECS(sbi));
-   break;
+   if (dir == ALLOC_LEFT) {
+   /* ALLOC_LEFT, no free sections on the right side */
+   actually_go_left = true;
+
+   while(secno)
+   if (!test_bit(--secno, free_i->free_secmap))
+   goto check_another_zone;
}
-   secno = left_start;
-skip_left:
-   segno = GET_SEG_FROM_SEC(sbi, secno);
-   zoneno = GET_ZONE_FROM_SEC(sbi, secno);
 
+   /*
+* when ALLOC_RIGHT and secno >=