Re: [f2fs-dev] general stability of f2fs?
Hi Marc, I'm very interested in trying f2fs on SMR drives too. I also think that several characteristics of SMR drives are very similar with flash drives. So far, the f2fs has been well performed on embedded systems like smart phones. For server environement, however, I couldn't actually test f2fs pretty much intensively. The major uncovered code areas would be: - over 4TB storage space case - inline_dentry mount option; I'm still working on extent_cache for v4.3 too - various sizes of section and zone - tmpfile, and rename2 interfaces In your logs, I suspect some fsck.f2fs bugs in a large storage case. In order to confirm that, could you use the latest f2fs-tools from: http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git And, if possible, could you share some experiences when you didn't fill up the partition to 100%? If there is no problem, we can nicely focus on ENOSPC only. Thanks, On Sat, Aug 08, 2015 at 10:50:03PM +0200, Marc Lehmann wrote: Hi! I did some more experiments, and wonder about the general stabiulity of f2fs. I have not managed to keep an f2fs filesystem that worked for longer than a few days. For example, a few days ago I created an 8TB volume and copied 2TB of data to it, which worked until I hot the (very low...) 32k limit on the number of subdirectories. I moved some directoriesd into a single subdirectory, and continued. Everything seemed fine. Today I ran fsck.f2fs on the fs, which found 4 inodes with wrong link counts (generally higher than fsck counted). It asked me whether to fix this, which I did. I then did another fsck run, and was greeted with tens of thousands of errors: http://ue.tst.eu/f692bac9abbe4e910787adee18ec52be.txt Mounting made the box unusable for multiple minutes, probably due to the amount of backtraces: http://ue.tst.eu/6243cc344a943d95a20907ecbc37061f.txt The data is toast (which is fine, I am still experimenting only), but this, the weird write behaviour, the fact that you don#t get signalled on ENOSPC make me wonder what the general status of f2fs is. It *seems* to be in actual use for a number of years now, and I would expect small hiccups and problems, so backups would be advised, but this level of brokenness (I only tested linux 3.18.14 and 4.1.4) is not something I didn#t expect from a fs that is in development for so long. So I wonder what the general stability epxectation for f2fs is - is it just meant to be an experimental fs not used for any data, or am I just unlucky and hit so many disastrous bugs by chance? (It's really too bad, it's the only fs in linux that has stable write performance on SMR drives at this time). -- The choice of a Deliantra, the free code+content MORPG -==- _GNU_ http://www.deliantra.net ==-- _ generation ---==---(_)__ __ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schm...@schmorp.de -=/_/_//_/\_,_/ /_/\_\ -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] general stability of f2fs?
On Mon, Aug 10, 2015 at 10:53:32PM +0200, Marc Lehmann wrote: On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim jaeg...@kernel.org wrote: I'm very interested in trying f2fs on SMR drives too. I also think that several characteristics of SMR drives are very similar with flash drives. Indeed, but of course there isn't an exact match for any characteristic. Also, in the end, drive-managed SMR drives will suck somewhat with any filesystem (note that nilfs performs very badly, even thought it should be better than anything else till the drive is completely full). IMO, it's similar to flash drives too. Indeed, I believe host-managed SMR/flash drives are likely to show much better performance than drive-managed ones. However, I think there are many HW constraints inside the storage not to move forward to it easily. Now, looking at the characteristics of f2fs, it could be a good match for any rotational media, too, since it writes linearly and can defragment. At least for desktop or similar loads (where files usually aren't randomly written, but mostly replaced and rarely appended). Possible, but not much different from other filesystems. :) The only crucial ability it would need to have is to be able to free large chunks for rewriting, which should be in f2fs as well. So at this time, what I apparently need is mkfs.f2fs -s128 instead of -s7. I wrote a patch to fix the document. Sorry about that. Unfortunately, I probably can't make these tests immediately, and they do take some days to run, but hopefully I cna repeatmy experiments next week. - over 4TB storage space case fsck limits could well have been the issue for my first big filesystem, but not the second (which was only 128G in size to be able to utilize it within a reasonable time). - inline_dentry mount option; I'm still working on extent_cache for v4.3 too I only enabled mount options other than noatime for the 128G filesystem, so it might well have cauzsed the trouble with it. Okay, so I think it'd be good to start with: - noatime,inline_xattr,inline_data,flush_merge,extent_cache. And you can control defragementation through /sys/fs/f2fs/[DEV]/gc_[min|max|no]_sleep_time Another thing that will seriously hamper adoption of these drives is the 32000 limit on hardlinks - I am hard pressed to find any large file tree here that doesn't have places with of 4 subdirs somewhere, but I guess on a 32GB phone flash storage, this was less of a concern. Looking at a glance, it'll be no problme to increase as 64k. Let me check again. In any case, if f2fs turns out to be workable, it will become the fs of choice for me for my archival uses, and maybe even more, and I then have to somehow cope with that limit. In your logs, I suspect some fsck.f2fs bugs in a large storage case. In order to confirm that, could you use the latest f2fs-tools from: http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git Will do so. Is there a repository for out-of-tree module builds for f2fs? It seems kernels 3.17.x to 4.1 (at least) have a kernel bug making reads to these SMR drives unstable (https://bugzilla.kernel.org/show_bug.cgi?id=93581), so I will have to test with a relatively old kernel or play too many tricks. What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly. http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10 Thanks, And I suspect from glancing over patches (And mount options) that there have been quite some improvements in f2fs since 3.16 days. And, if possible, could you share some experiences when you didn't fill up the partition to 100%? If there is no problem, we can nicely focus on ENOSPC only. My experience was that f2fs wrote at nearly maximum I/O speed of the drives. In fact, I couldn't saturate the bandwidth except when writing small files, because the 8 drive source raid using xfs was not able to read files quickly enough. After writing an initial tree of 2TB Directory reading and mass stat seemed to be considerably slower and take more time directly afterwards. I don't know if that is something that balancing can fix (or improve), but I am not overly concerned about that, as the difference to e.g. xfs is not that big (roughly a factor of two), and thes eoperations are too slow for me on any device, so I usually put a dm-cache in front of such storage devices. I don't think that I have more useful data to report - if I used 14MB sections, performance would predictably suck, so the real test is still outstanding. Stay tuned, and thanks for your reply! -- The choice of a Deliantra, the free code+content MORPG -==- _GNU_ http://www.deliantra.net ==-- _ generation ---==---(_)__ __ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schm...@schmorp.de -=/_/_//_/\_,_/ /_/\_\
Re: [f2fs-dev] [PATCH 3/4] f2fs: handle error of f2fs_iget correctly
Hi Chao, On Fri, Aug 07, 2015 at 06:41:02PM +0800, Chao Yu wrote: In recover_orphan_inode, if f2fs_iget failed, we change to report the error number to its caller instead of bug_on. Let's keep this in order to catch any bugs. Or, is there another issue on this? Thanks, Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/checkpoint.c | 25 ++--- fs/f2fs/f2fs.h | 2 +- fs/f2fs/super.c | 4 +++- 3 files changed, 22 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index c311176..e2def90 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -468,22 +468,28 @@ void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) __remove_ino_entry(sbi, ino, ORPHAN_INO); } -static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) +static int recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) { - struct inode *inode = f2fs_iget(sbi-sb, ino); - f2fs_bug_on(sbi, IS_ERR(inode)); + struct inode *inode; + + inode = f2fs_iget(sbi-sb, ino); + if (IS_ERR(inode)) + return PTR_ERR(inode); + clear_nlink(inode); /* truncate all the data during iput */ iput(inode); + return 0; } -void recover_orphan_inodes(struct f2fs_sb_info *sbi) +int recover_orphan_inodes(struct f2fs_sb_info *sbi) { block_t start_blk, orphan_blocks, i, j; + int err; if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG)) - return; + return 0; set_sbi_flag(sbi, SBI_POR_DOING); @@ -499,14 +505,19 @@ void recover_orphan_inodes(struct f2fs_sb_info *sbi) orphan_blk = (struct f2fs_orphan_block *)page_address(page); for (j = 0; j le32_to_cpu(orphan_blk-entry_count); j++) { nid_t ino = le32_to_cpu(orphan_blk-ino[j]); - recover_orphan_inode(sbi, ino); + err = recover_orphan_inode(sbi, ino); + if (err) { + f2fs_put_page(page, 1); + clear_sbi_flag(sbi, SBI_POR_DOING); + return err; + } } f2fs_put_page(page, 1); } /* clear Orphan Flag */ clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG); clear_sbi_flag(sbi, SBI_POR_DOING); - return; + return 0; } static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 311960c..4a6f69b 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -1750,7 +1750,7 @@ int acquire_orphan_inode(struct f2fs_sb_info *); void release_orphan_inode(struct f2fs_sb_info *); void add_orphan_inode(struct f2fs_sb_info *, nid_t); void remove_orphan_inode(struct f2fs_sb_info *, nid_t); -void recover_orphan_inodes(struct f2fs_sb_info *); +int recover_orphan_inodes(struct f2fs_sb_info *); int get_valid_checkpoint(struct f2fs_sb_info *); void update_dirty_page(struct inode *, struct page *); void add_dirty_dir_inode(struct inode *); diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 12eb69d..e5efb53 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -1246,7 +1246,9 @@ try_onemore: f2fs_join_shrinker(sbi); /* if there are nt orphan nodes free them */ - recover_orphan_inodes(sbi); + err = recover_orphan_inodes(sbi); + if (err) + goto free_node_inode; /* read root inode and dentry */ root = f2fs_iget(sb, F2FS_ROOT_INO(sbi)); -- 2.4.2 -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
[f2fs-dev] [PATCH 1/2] f2fs: increase the number of max hard links
This patch increases the number of maximum hard links for one file. Signed-off-by: Jaegeuk Kim jaeg...@kernel.org --- fs/f2fs/f2fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 3884794..f18d31e 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -321,7 +321,7 @@ enum { */ }; -#define F2FS_LINK_MAX 32000 /* maximum link count per file */ +#define F2FS_LINK_MAX 65536 /* maximum link count per file */ #define MAX_DIR_RA_PAGES 4 /* maximum ra pages of dir */ -- 2.1.1 -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
[f2fs-dev] [PATCH 1/3] mkfs.f2fs: fix wrong documentation
The -s should be the number of segments per a section. Reported-by: Marc Lehmann schm...@schmorp.de Signed-off-by: Jaegeuk Kim jaeg...@kernel.org --- man/mkfs.f2fs.8 | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/man/mkfs.f2fs.8 b/man/mkfs.f2fs.8 index f386ac6..48654aa 100644 --- a/man/mkfs.f2fs.8 +++ b/man/mkfs.f2fs.8 @@ -21,7 +21,7 @@ mkfs.f2fs \- create an F2FS file system ] [ .B \-s -.I log-based-#-of-segments-per-section +.I #-of-segments-per-section ] [ .B \-z @@ -60,10 +60,10 @@ Specify the volume label to the partition mounted as F2FS. Specify the percentage over the volume size for overprovision area. This area is hidden to users, and utilized by F2FS cleaner. The default percentage is 5%. .TP -.BI \-s log-based-#-of-segments-per-section -Specify the log-based number of segments per section. A section consists of +.BI \-s #-of-segments-per-section +Specify the number of segments per section. A section consists of multiple consecutive segments, and is the unit of garbage collection. -The default number is 0, which means one segment is assigned to a section. +The default number is 1, which means one segment is assigned to a section. .TP .BI \-z #-of-sections-per-zone Specify the number of sections per zone. A zone consists of multiple sections. -- 2.1.1 -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
[f2fs-dev] [PATCH 3/3] mkfs.f2fs: don't need to limit MIN_VOLUME SIZE
The minimum volume size is determined while preparing superblock. Signed-off-by: Jaegeuk Kim jaeg...@kernel.org --- include/f2fs_fs.h | 1 - lib/libf2fs.c | 7 --- 2 files changed, 8 deletions(-) diff --git a/include/f2fs_fs.h b/include/f2fs_fs.h index 59cc0d1..38a774c 100644 --- a/include/f2fs_fs.h +++ b/include/f2fs_fs.h @@ -209,7 +209,6 @@ static inline uint64_t bswap_64(uint64_t val) #define CHECKSUM_OFFSET4092 /* for mkfs */ -#define F2FS_MIN_VOLUME_SIZE 104857600 #defineF2FS_NUMBER_OF_CHECKPOINT_PACK 2 #defineDEFAULT_SECTOR_SIZE 512 #defineDEFAULT_SECTORS_PER_BLOCK 8 diff --git a/lib/libf2fs.c b/lib/libf2fs.c index e8f4d47..83d1296 100644 --- a/lib/libf2fs.c +++ b/lib/libf2fs.c @@ -499,13 +499,6 @@ int f2fs_get_device_info(struct f2fs_configuration *c) MSG(0, Info: sector size = %u\n, c-sector_size); MSG(0, Info: total sectors = %PRIu64 (%PRIu64 MB)\n, c-total_sectors, c-total_sectors 11); - if (c-total_sectors - (F2FS_MIN_VOLUME_SIZE / c-sector_size)) { - MSG(0, Error: Min volume size supported is %d\n, - F2FS_MIN_VOLUME_SIZE); - return -1; - } - return 0; } -- 2.1.1 -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] f2fs for SMR drives
On Mon, Aug 10, 2015 at 06:20:40PM +0800, Chao Yu chao2...@samsung.com wrote: '-s7' means that we configure seg_per_sec into 7, so our section size will Ah, I see, I am a victim of a documentation bug then: According to the mkfs.f2fs (1.4.0) documentation, -s7 means 256MB (2 * 2**7), so that explains it. Good news, I will reset ASAP! which may cause low performance, is that right? Yes, if the documentation is wrong, that would explain the bad performance of defragmented sections. I have no SMR device, so I have to use hard disk for testing, I can't reproduce this issue with cp in such device. But for rsync, one thing I note is that: I use rsync to copy 32g local file to f2fs partition, the partition is with 100% utilized space and with no available block for further allocation. It took very long time for 'the copy', finally it reported us there is no space. Strange. For me, in 3.18.14, I could cp and rsync to a 100% utilized disk at full (read) speed, but it didn't do any I/O (and the files never arrived). That was the same partition that later had the link count mismatches. b) In f2fs, we use inode/data block space mixedly, so when data block number is zero, we can't create any file in f2fs. It makes rsync failing in step 2, and leads it runs into discard_receive_data function which will still receiving the whole src file. This makes rsync process keeping writing but generating no IO in f2fs filesystem. I am sorry, that cannot be true - if file creation would fail, then rsync simply would be unable to write anything, it wouldn't have a valid fd to write. I also strace'd it, and it successfully open()ed and write()ed AND close()ed the file. It can only be explained by f2fs neither creating nor writing the file, without giving an error. In any case, instead of discarding data, the filesystem should of course return ENOSPC, as anything else causes data loss. Can you please help to check that in your environment the reason of rsync without returning ENOSPC is the same as above? I can already rule it out baseed on API grounds: if file creation fails (e.g. with ENOSPC), then rsync couldn't have an fd to write data to it. something else must go on. The only way for this behaviour to happen is if file creation succeeds (and wriitng and closing, too - silent data loss). If it is not, can you share more details about test steps, io info, and f2fs status info in debugfs (/sys/kernel/debug/f2fs/status). I mounted the partition with -onoatime and no other flags, used cp -Rp to copy a large tree until the disk utilization was 100% for maybe 20 seconds according to /sys/kernel/debug/f2fs/status. A bit puzzled, I ^C's cp, and tried rsync -avP --append, which took a bit to scan the directory information, then proceeded to write. I also don't think rsync --append goes via the temporary file route, but in any case, I also used rsync -avP, which does. After writing a few dozen gigabytes (as measured by read data throughput), I stopped both. I don't know what you mean with io info. Since fsck.f2fs completely destroyed the filesystem, I cannot provide any more f2fs debug info about it. IMO, real-timely increasing ratio of below stat value may be helpful to investigate the degression issue. Can you share us them? I lost this filesystem to corruption as well. I will certainly retry this test though, and will record these values. Anyways, thanks a lot for your input so far! -- The choice of a Deliantra, the free code+content MORPG -==- _GNU_ http://www.deliantra.net ==-- _ generation ---==---(_)__ __ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schm...@schmorp.de -=/_/_//_/\_,_/ /_/\_\ -- ___ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] f2fs for SMR drives
Hi Marc, -Original Message- From: Marc Lehmann [mailto:schm...@schmorp.de] Sent: Saturday, August 08, 2015 9:51 PM To: linux-f2fs-devel@lists.sourceforge.net Subject: [f2fs-dev] f2fs for SMR drives Hi! Sorry if this is the wrong address to ask about user problems. I am currently investigating various filesystems for use on drive-managed SMR drives (e.g. the seagate 8TB disks). These drives have characteristics not unlike flash (they want to be written in large batches), but are, of course, still quite different. I initially tried btrfs, ext4, xfs which, not unsurprisingly, failed rather miserably after a few hundred GB, down to ~30mb/s (or 20 in case of btrfs). I also tried nilfs, which should be an almost perfect match for this technology, but it performed even worse (I have no clue why, maybe nilfs skips sectors when writing, which would explain it). As a last resort, I tried f2fs, which initially performed absolutely great (average write speed ~130mb/s over multiple terabytes). However, I am running into a number of problems, and wonder if f2fs can somehow be configured to work right. First of all, I did most of my tests on linux-3.18.14, and recently switched to 4.1.4. The filesystems were formatted with -s7, the idea '-s7' means that we configure seg_per_sec into 7, so our section size will be 7 * 2M (segment size) = 14M, so no matter how we configure '-z' (section number per zone), our allocation unit will not alignment to 256MB, so both allocation and release unit in f2fs may across zone boundary in SMR driver, which may cause low performance, is that right? being that writes always occur in 256MB blocks as much as possible, and most importantly, are freed in 256MB blocks, to keep fragmentation low. Mount options included noatime or noatime,inline_xattr,inline_data,inline_dentry,flush_merge,extent_cache (I suspect 4.1.4 doesn't implement flush_merge yet?). My first problem considers ENOSPC problem - I was happily able to write to a 100% utilized filesystem with cp and rsync continuing to write, not receiving any error, but no write activity occuring (and the files never ending up on the filesystem). Is this a known bug? I have no SMR device, so I have to use hard disk for testing, I can't reproduce this issue with cp in such device. But for rsync, one thing I note is that: I use rsync to copy 32g local file to f2fs partition, the partition is with 100% utilized space and with no available block for further allocation. It took very long time for 'the copy', finally it reported us there is no space. I did same test with ext4 filesystem, it toke very short time to report us ENOSPC. As I investigate, the main flow of copy used by rsync is: 1. open src file 2. create tmp file in dst partition 3. copy data from src file to tmp file 4. rename tmp file to dst a) In ext4, we reserve space separately for data block and inode, if data block resource is exhausted (it makes df showing utilization as 100%), in this partition, we can't write new data, but we can still create file as creating only grab inode space in ext4, not block space. So this makes rsync failing in step 3, and return error immediately. b) In f2fs, we use inode/data block space mixedly, so when data block number is zero, we can't create any file in f2fs. It makes rsync failing in step 2, and leads it runs into discard_receive_data function which will still receiving the whole src file. This makes rsync process keeping writing but generating no IO in f2fs filesystem. Fininally, I make one block space in f2fs by removing one file, this makes f2fs passing step 2 and return error immediately in step 3 like ext4. Can you please help to check that in your environment the reason of rsync without returning ENOSPC is the same as above? If it is not, can you share more details about test steps, io info, and f2fs status info in debugfs (/sys/kernel/debug/f2fs/status). My second, much bigger problem, considers defragmentation. For testing, I created a 128GB partition and kept writing an assortment of 200kb - multiple megabyte files to it. To stress test it, I kept deleting random files to create holes. after a while (around 84% utilisation), write performance went down to less than 1MB/s, and is at this leve ever since for this filesystem. IMO, real-timely increasing ratio of below stat value may be helpful to investigate the degression issue. Can you share us them? CP calls: GC calls: (BG:) - data segments : - node segments : Try to move blocks (BG:) - data blocks : - node blocks : IPU: blocks SSR: blocks in segments LFS: blocks in segments Thanks, I kept the filesystem idle for a night to hope for defragmentation, but nothing happened. Suspecting in-place-updates to be the culprit, I tried various configurations in the hope of disabling them (such as setting ipu_policy to 4 or 8, and/or setting min_ipu_util to 0 or 100), but that also doesn't seem to