Re: [PATCH] btrfs-progs: Documentation: Add filter section for btrfs-balance.
Qu Wenruo posted on Tue, 03 Jun 2014 14:20:08 +0800 as excerpted: Man page for 'btrfs-balance' mentioned filters but does not explain them, which make end users hard to use '-d', '-m' or '-s options. This patch will use the explanations from https://btrfs.wiki.kernel.org/index.php/Balance_Filters to enrich the man page. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com Thanks. The wiki-only nature of the balance-filters documentation has been a bit of a complication. This should fix that. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs-progs: fix compiler warning
Original Message Subject: [PATCH 1/1] btrfs-progs: fix compiler warning From: Christian Hesse m...@eworm.de To: linux-btrfs@vger.kernel.org Date: 2014年06月03日 19:29 gcc 4.9.0 gives a warning: array subscript is above array bounds Checking for greater or equal instead of just equal fixes this. Signed-off-by: Christian Hesse m...@eworm.de --- cmds-restore.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-restore.c b/cmds-restore.c index 96b97e1..534a49e 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -169,7 +169,7 @@ again: break; } - if (level == BTRFS_MAX_LEVEL) + if (level = BTRFS_MAX_LEVEL) return 1; slot = path-slots[level] + 1; Also I faied to reproduce the bug. Using gcc-4.9.0-3 from Archlinux core repo. It seems to be related to default gcc flags from distribution? Thanks, Qu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs-progs: fix compiler warning
Qu Wenruo quwen...@cn.fujitsu.com on Wed, 2014/06/04 14:48: Original Message Subject: [PATCH 1/1] btrfs-progs: fix compiler warning From: Christian Hesse m...@eworm.de To: linux-btrfs@vger.kernel.org Date: 2014年06月03日 19:29 gcc 4.9.0 gives a warning: array subscript is above array bounds Checking for greater or equal instead of just equal fixes this. Signed-off-by: Christian Hesse m...@eworm.de --- cmds-restore.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-restore.c b/cmds-restore.c index 96b97e1..534a49e 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -169,7 +169,7 @@ again: break; } - if (level == BTRFS_MAX_LEVEL) + if (level = BTRFS_MAX_LEVEL) return 1; slot = path-slots[level] + 1; Also I faied to reproduce the bug. Using gcc-4.9.0-3 from Archlinux core repo. Exactly the same here. ;) It seems to be related to default gcc flags from distribution? Probably. I did compile with optimization, so adding -O2 may do the trick: make CFLAGS=${CFLAGS} -O2 all -- Schoene Gruesse Chris O ascii ribbon campaign stop html mail - www.asciiribbon.org signature.asc Description: PGP signature
Re: [PATCH v4] Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
On Sun, Jun 01, 2014 at 01:50:28AM +0100, Filipe David Borba Manana wrote: If the NO_HOLES feature is enabled holes don't have file extent items in the btree that represent them anymore. This made the clone operation ignore the gaps that exist between consecutive file extent items and therefore not create the holes at the destination. When not using the NO_HOLES feature, the holes were created at the destination. A test case for xfstests follows. Reviewed-by: Liu Bo bo.li@oracle.com -liubo Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- V2: Deal with holes at the boundaries of the cloning range and that either overlap the boundary completely or partially. Test case for xfstests updated too to test these 2 cases. V3: Deal with the case where the cloning range overlaps (partially or completely) a hole at the end of the source file, and might increase the size of the target file. Updated the test for xfstests to cover these cases too. V4: Moved some duplicated code into an helper function. fs/btrfs/ioctl.c | 108 ++- 1 file changed, 83 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 04ece8f..95194a9 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2983,6 +2983,37 @@ out: return ret; } +static int clone_finish_inode_update(struct btrfs_trans_handle *trans, + struct inode *inode, + u64 endoff, + const u64 destoff, + const u64 olen) +{ + struct btrfs_root *root = BTRFS_I(inode)-root; + int ret; + + inode_inc_iversion(inode); + inode-i_mtime = inode-i_ctime = CURRENT_TIME; + /* + * We round up to the block size at eof when determining which + * extents to clone above, but shouldn't round up the file size. + */ + if (endoff destoff + olen) + endoff = destoff + olen; + if (endoff inode-i_size) + btrfs_i_size_write(inode, endoff); + + ret = btrfs_update_inode(trans, root, inode); + if (ret) { + btrfs_abort_transaction(trans, root, ret); + btrfs_end_transaction(trans, root); + goto out; + } + ret = btrfs_end_transaction(trans, root); +out: + return ret; +} + /** * btrfs_clone() - clone a range from inode file to another * @@ -2995,7 +3026,8 @@ out: * @destoff: Offset within @inode to start clone */ static int btrfs_clone(struct inode *src, struct inode *inode, -u64 off, u64 olen, u64 olen_aligned, u64 destoff) +const u64 off, const u64 olen, const u64 olen_aligned, +const u64 destoff) { struct btrfs_root *root = BTRFS_I(inode)-root; struct btrfs_path *path = NULL; @@ -3007,8 +3039,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode, int slot; int ret; int no_quota; - u64 len = olen_aligned; + const u64 len = olen_aligned; u64 last_disko = 0; + u64 last_dest_end = destoff; ret = -ENOMEM; buf = vmalloc(btrfs_level_size(root, 0)); @@ -3076,7 +3109,7 @@ process_slot: u64 disko = 0, diskl = 0; u64 datao = 0, datal = 0; u8 comp; - u64 endoff; + u64 drop_start; extent = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); @@ -3125,6 +3158,18 @@ process_slot: new_key.offset = destoff; /* + * Deal with a hole that doesn't have an extent item + * that represents it (NO_HOLES feature enabled). + * This hole is either in the middle of the cloning + * range or at the beginning (fully overlaps it or + * partially overlaps it). + */ + if (new_key.offset != last_dest_end) + drop_start = last_dest_end; + else + drop_start = new_key.offset; + + /* * 1 - adjusting old extent (we may have to split it) * 1 - add new extent * 1 - inode update @@ -3153,7 +3198,7 @@ process_slot: } ret = btrfs_drop_extents(trans, root, inode, - new_key.offset, + drop_start, new_key.offset + datal,
Race condition between btrfs and udev
Originally this problem was reproduced by the following scripts: # dd if=/dev/zero of=data bs=1M count=50 # losetup /dev/loop1 data # i=1 # while [ 1 ] do mkfs.btrfs -fK /dev/loop1 /dev/null || exit 1 i++ echo loop $i done Further, a easy way to trigger this problem is by running the followng c codes repeatedly: int main(int argc, char **argv) { int fd = open(argv[1], O_RDWR | O_EXCL); if (fd 0) { perror(fail to open); exit(1); } close(fd); return 0; } here @argv[1] needs a btrfs block device. So the problem is RW opening would trigger udev event which will call btrfs_scan_one_device() In btrfs_scan_one_device(), it would open the block device with EXCL flag...meanwhile if another program try to open that device with O_EXCL, it would fail with EBUSY I don't know whether this is a serious problem, now there are two places in btrfs-progs that is trying to open device with O_EXCL: 1. in utils.c: test_dev_for_mkfs() 2. in disk-io.c: __open_ctree_fd() Any ideas on this? maybe we can remove @EXCL flag from btrfs-progs? Thanks, Wang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: All free space eaten during defragmenting (3.14)
Peter Chant posted on Tue, 03 Jun 2014 23:21:55 +0100 as excerpted: On 06/03/2014 05:46 AM, Duncan wrote: Of course if you were using something like find and executing defrag on each found entry, then yes it would recurse, as find would recurse across filesystems and keep going (unless you told it not to using find's -xdev option). I did not know the recursive option existed. However, I'd previously cursed the tools not having a recursive option or being recursive by default. If there is now a recursive option it would be really perverse to use find to implement a recursive defrag. Defrag's -r/recursive option is reasonably new, but checking the btrfs- progs git tree (since I run the git version) says that was commit c2c5353b, which git describe says was v0.19-725, so it should be in btrfs- progs v3.12. So it's not /that/ new. Anyone still running something earlier than that really should update. =:^) But the wiki recommended using find from back before the builtin recursive option, and I can well imagine people with already working scripts not wanting to fix what isn't (for them) broken. =:^) So I imagine there will be find-and-defrag users for some time, tho they should even now be on their way to becoming a rather small percentage, at least for folks following the keep-current recommendations. Meanwhile, this question is bugging me so let me just ask it. The OP was from a different email address (szotsaki@gmail), and once I noticed that I've been assuming that you and the OP are different people, tho in my first reply to you I assumed you were the OP. So just to clear things up, different people and I can't assume that what he wrote about his case applies to you, correct? =:^) Meanwhile, you mention the autodefrag mount option. Assuming you have it on all the time, there should be that much to defrag, *EXCEPT* if the -c/ compress option is used as well. If you aren't also using the compress mount option by default, then you are effectively telling defrag to compress everything as it goes, so it will defrag-and-compressed all files. Which wouldn't be a problem with snapshot-aware-defrag as it'd compress for all snapshots at the same time too. But with snapshot-aware- defrag currently disabled, that would effectively force ALL files to be rewritten in ordered to compress them, thereby breaking the COW link with the other snapshots and duplicating ALL data. I've got compress=lzo, options from fstab: device=/dev/sdb,device=/dev/sdc,autodefrag,defaults,inode_cache,noatime, compress=lzo I'm running kernel 3.13.6. Not sure if snapshot-aware-defrag is enabled or disabled in this version. A git search says (linus' mainline tree) commit 8101c8db, merge commit 878a876b, with git describe labeling the merge commit as v3.14-rc1-13, so it would be in v3.14-rc2. However, the commit in question was CCed to stable@, so it should have made it into a 3.13.x stable release as well. Whether it's in 3.13.6 specifically, I couldn't say without checking the stable tree or changelog, which should be easier for you to do since you're actually running it. (Hint, I simply searched on defrag, here; it ended up being the third hit back from 3.14.0, I believe, so it shouldn't be horribly buried, at least.) Unfortunately I really don't understand how COW works here. I understand the basic idea but have no idea how it is implemented in btrfs or any other fs. FWIW, I think only the kernel/filesystem or at least developer types /really/ understand COW, but I /think/ I have a reasonable sysadmin's- level understanding of the practical effects in terms of btrfs, simply from watching the list. Meanwhile, not that it has any bearing on this thread, but about your mount options, FWIW you may wish to remove that inode_cache option. I don't claim to have a full understanding, but from what I've picked up from various dev remarks, it's not necessary at all on 64-bit systems (well, unless you have really small files filling an exabyte size filesystem!) since the inode-space is large enough finding free inode numbers isn't an issue, and while it can be of help in specific situations on 32-bit systems, there's two problems with it that make it not suitable for the general case: (1) on large filesystems (I'm not sure how large but I'd guess it's TiB scale) there's danger of inode-number- collision due to 32-bit-overflow, and (2) it must be regenerated at every mount, which at least on TiB-scale spinning rust can trigger several minutes of intense drive activity while it does so. (The btrfs wiki now says it's not recommended, but has a somewhat different explanation. While I'm not a coder and thus in no position to say for sure based on the code, I believe the wiki's explanation isn't quite correct, but either way, it's still not recommended.) The use-cases where inode_cache might be worthwhile are thus all 32-bit, and include things like busy email
Re: What to do about snapshot-aware defrag
Martin m_bt...@ml1.co.uk writes: The *ONLY* application that I know of that uses atime is Mutt and then *only* for mbox files!... However, users, such as myself :), can be interested in when a certain file has been last accessed. With snapshots I can even get an idea of all the times the file has been accessed. *And go KISS and move on faster* better? Well, it in uncertain to me if it truly is better that btrfs would after that point no longer truly even support atime, if using it results in blowing up snapshot sizes. They might at that point even consider just using LVM2 snapshots (shudder) ;). -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Partition tables / Output of parted
Hello I have created multiple filesystems with btrfs, in all cases directly on the devices themself without creating partitions beforehand. Now, if I open the disks containing the multi-device filesystem in parted it outputs the partion table as loop and shows one partition with btrfs which covers the whole disk. Opening the disk with the single-device fileystem in parted shows a partion table, type unknown and no partition on the disk itself. I am unsure how to interpret this output. Two possible explanations come to mind: a) Btrfs does create partitions, but only if a filesystem spans multiple devices or b) the output of parted is faulty and no actual partition is created in both cases. Could someone elaborate on this question or just point out where I can find documentation on this issue? Yours sincerely Stefan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/12] trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
Signed-off-by: Antonio Ospite a...@ao2.it Cc: Chris Mason c...@fb.com Cc: Josef Bacik jba...@fb.com Cc: linux-btrfs@vger.kernel.org --- fs/btrfs/ioctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2f6d7b1..b0a206f 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3051,11 +3051,11 @@ process_slot: * | - extent - | */ - /* substract range b */ + /* subtract range b */ if (key.offset + datal off + len) datal = off + len - key.offset; - /* substract range a */ + /* subtract range a */ if (off key.offset) { datao += off - key.offset; datal -= off - key.offset; -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about snapshot-aware defrag
On 04/06/14 10:19, Erkki Seppala wrote: Martin m_bt...@ml1.co.uk writes: The *ONLY* application that I know of that uses atime is Mutt and then *only* for mbox files!... However, users, such as myself :), can be interested in when a certain file has been last accessed. With snapshots I can even get an idea of all the times the file has been accessed. *And go KISS and move on faster* better? Well, it in uncertain to me if it truly is better that btrfs would after that point no longer truly even support atime, if using it results in blowing up snapshot sizes. They might at that point even consider just using LVM2 snapshots (shudder) ;). Not quite... My emphasis is: 1: Go KISS for the defrag and accept that any atime use will render the defrag ineffective. Give a note that the noatime mount option should be used. 2: Consider using noatime as a /default/ being as there are no known 'must-use' use cases. Those users still wanting atime can add that as a mount option with the note that atime use reduces the snapshot defrag effectiveness. (The for/against atime is a good subject for another thread!) Go fast KISS! Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partition tables / Output of parted
On Wed, 4 Jun 2014 13:19:16 Stefan Malte Schumacher wrote: I have created multiple filesystems with btrfs, in all cases directly on the devices themself without creating partitions beforehand. I do that sometimes, it works well. I've done the same thing with Ext2/3 in the past as well, it's no big deal. Now, if I open the disks containing the multi-device filesystem in parted it outputs the partion table as loop and shows one partition with btrfs which covers the whole disk. http://lists.alioth.debian.org/pipermail/parted-devel/2009-May/002840.html A Google search on Partition Table: loop turned up the above explanation as the third result. I am unsure how to interpret this output. Two possible explanations come to mind: a) Btrfs does create partitions, but only if a filesystem spans multiple devices or b) the output of parted is faulty and no actual partition is created in both cases. BTRFS doesn't create partitions. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] Add support for LZ4 compression
On 06/03/2014 11:53 AM, David Sterba wrote: On Sat, May 31, 2014 at 11:48:28PM +, Philip Worrall wrote: LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed. LZ4 gives a slightly worse compression ratio compared with LZO (and much worse than Zlib) but compression speeds are *generally* similar to LZO. Decompression tends to be much faster under LZ4 compared with LZO hence it makes more sense to use LZ4 compression when your workload involves a higher proportion of reads. The following patch set adds LZ4 compression support to BTRFS using the existing kernel implementation. It is based on the changeset for LZO support in 2011. Once a filesystem has been mounted with LZ4 compression enabled older versions of BTRFS will be unable to read it. This implementation is however backwards compatible with filesystems that currently use LZO or Zlib compression. Existing data will remain unchanged but any new files that you create will be compressed with LZ4. tl;dr simply copying what btrfs+LZO does will not buy us anything in terms of speedup or space savings. I have a slightly different reason for holding off on these. Disk format changes are forever, and we need a really strong use case for pulling them in. With that said, thanks for spending all of the time on this. Pulling in Dave's idea to stream larger compression blocks through lzo (or any new alg) might be enough to push performance much higher, and better show case the differences between new algorithms. The whole reason I chose zlib originally was because its streaming interface was a better fit for how FS IO worked. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partition tables / Output of parted
On 4 June 2014 14:30, Russell Coker russ...@coker.com.au wrote: On Wed, 4 Jun 2014 13:19:16 Stefan Malte Schumacher wrote: I have created multiple filesystems with btrfs, in all cases directly on the devices themself without creating partitions beforehand. I do that sometimes, it works well. I've done the same thing with Ext2/3 in the past as well, it's no big deal. Now, if I open the disks containing the multi-device filesystem in parted it outputs the partion table as loop and shows one partition with btrfs which covers the whole disk. http://lists.alioth.debian.org/pipermail/parted-devel/2009-May/002840.html A Google search on Partition Table: loop turned up the above explanation as the third result. I am unsure how to interpret this output. Two possible explanations come to mind: a) Btrfs does create partitions, but only if a filesystem spans multiple devices or b) the output of parted is faulty and no actual partition is created in both cases. BTRFS doesn't create partitions. c) Parted (libparted) is merely displaying a pretend loop partition table as a way to represent the situation of a file system covering the whole disk in it's view of the world where all disks have a partition table. Mike -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On Tuesday 03 June 2014, Dave Chinner wrote: On Tue, Jun 03, 2014 at 04:22:19PM +0200, Arnd Bergmann wrote: On Monday 02 June 2014 14:57:26 H. Peter Anvin wrote: On 06/02/2014 12:55 PM, Arnd Bergmann wrote: The possible uses I can see for non-ktime_t types in the kernel are: * inodes need 96 bit timestamps to represent the full range of values that can be stored in a file system, you made a convincing argument for that. Almost everything else can fit into 64 bit on a 32-bit kernel, in theory also on a 64-bit kernel if we want that. Just ot be pedantic, inodes don't need 96 bit timestamps - some filesystems can *support up to* 96 bit timestamps. If the kernel only supports 64 bit timestamps and that's all the kernel can represent, then the upper bits of the 96 bit on-disk inode timestamps simply remain zero. I meant the reverse: since we have file systems that can store 96-bit timestamps when using 64-bit kernels, we need to extend 32-bit kernels to have the same internal representation so we can actually read those file systems correctly. If you move the filesystem between kernels with different time ranges, then the filesystem needs to be able to tell the kernel what it's supported range is. This is where having the VFS limit the range of supported timestamps is important: the limit is the min(kernel range, filesystem range). This allows the filesystems to be indepenent of the kernel time representation, and the kernel to be independent of the physical filesystem time encoding I agree it makes sense to let the kernel know about the limits of the file system it accesses, but for the reverse, we're probably better off just making the kernel representation large enough (i.e. 96 bits) so it can work with any known file system. We need another check at the user space boundary to turn that into a value that the user can understand, but that's another problem. Arnd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On Monday 02 June 2014, Joseph S. Myers wrote: On Mon, 2 Jun 2014, Arnd Bergmann wrote: Ok. Sorry about missing linux-api, I confused it with linux-arch, which may not be as relevant here, except for the one question whether we actually want to have the new ABI on all 32-bit architectures or only as an opt-in for those that expect to stay around for another 24 years. For glibc I think it will make the most sense to add the support for 64-bit time_t across all architectures that currently have 32-bit time_t (with the new interfaces having fallback support to implementation in terms of the 32-bit kernel interfaces, if the 64-bit syscalls are unavailable either at runtime or in the kernel headers against which glibc is compiled - this fallback code will of course need to check for overflow when passing a time value to the kernel, hopefully with error handling consistent with whatever the kernel ends up doing when a filesystem can't support a timestamp). If some architectures don't provide the new interfaces in the kernel then that will mean the fallback code in glibc can't be removed until glibc support for those architectures is removed (as opposed to removing it when glibc no longer supports kernels predating the kernel support). Ok, that's a good reason to just provide the new interfaces on all architectures right away. Thanks for the insight! Arnd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs-progs: fix compiler warning
On Wed, Jun 04, 2014 at 09:19:26AM +0200, Christian Hesse wrote: It seems to be related to default gcc flags from distribution? Probably. I did compile with optimization, so adding -O2 may do the trick: make CFLAGS=${CFLAGS} -O2 all The warning appears with -O2, so the question is if gcc is not able to reason about the values (ie. a false positive) or if there's a bug that I don't see. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On Wed, 4 Jun 2014, Arnd Bergmann wrote: On Tuesday 03 June 2014, Dave Chinner wrote: Just ot be pedantic, inodes don't need 96 bit timestamps - some filesystems can *support up to* 96 bit timestamps. If the kernel only supports 64 bit timestamps and that's all the kernel can represent, then the upper bits of the 96 bit on-disk inode timestamps simply remain zero. I meant the reverse: since we have file systems that can store 96-bit timestamps when using 64-bit kernels, we need to extend 32-bit kernels to have the same internal representation so we can actually read those file systems correctly. If you move the filesystem between kernels with different time ranges, then the filesystem needs to be able to tell the kernel what it's supported range is. This is where having the VFS limit the range of supported timestamps is important: the limit is the min(kernel range, filesystem range). This allows the filesystems to be indepenent of the kernel time representation, and the kernel to be independent of the physical filesystem time encoding I agree it makes sense to let the kernel know about the limits of the file system it accesses, but for the reverse, we're probably better off just making the kernel representation large enough (i.e. 96 bits) so it can work with any known file system. Depends... 96 bit handling may get prohibitive on 32-bit archs. The important point here is for the kernel to be able to represent the time _range_ used by any known filesystem, not necessarily the time _precision_. For example, a 64 bit representation can be made of 40 bits for seconds spanning 34865 years, and 24 bits for fractional seconds providing precision down to 60 nanosecs. That ought to be plenty good on 32 bit systems while still being cheap to handle. Nicolas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On Wednesday 04 June 2014 13:30:32 Nicolas Pitre wrote: On Wed, 4 Jun 2014, Arnd Bergmann wrote: On Tuesday 03 June 2014, Dave Chinner wrote: Just ot be pedantic, inodes don't need 96 bit timestamps - some filesystems can *support up to* 96 bit timestamps. If the kernel only supports 64 bit timestamps and that's all the kernel can represent, then the upper bits of the 96 bit on-disk inode timestamps simply remain zero. I meant the reverse: since we have file systems that can store 96-bit timestamps when using 64-bit kernels, we need to extend 32-bit kernels to have the same internal representation so we can actually read those file systems correctly. If you move the filesystem between kernels with different time ranges, then the filesystem needs to be able to tell the kernel what it's supported range is. This is where having the VFS limit the range of supported timestamps is important: the limit is the min(kernel range, filesystem range). This allows the filesystems to be indepenent of the kernel time representation, and the kernel to be independent of the physical filesystem time encoding I agree it makes sense to let the kernel know about the limits of the file system it accesses, but for the reverse, we're probably better off just making the kernel representation large enough (i.e. 96 bits) so it can work with any known file system. Depends... 96 bit handling may get prohibitive on 32-bit archs. The important point here is for the kernel to be able to represent the time _range_ used by any known filesystem, not necessarily the time _precision_. For example, a 64 bit representation can be made of 40 bits for seconds spanning 34865 years, and 24 bits for fractional seconds providing precision down to 60 nanosecs. That ought to be plenty good on 32 bit systems while still being cheap to handle. I have checked earlier that we don't do any computation on inode time stamps in common code, we just pass them around, so there is very little runtime overhead. There is a small bit of space overhead (12 byte) per inode, but that structure is already on the order of 500 bytes. For other timekeeping stuff in the kernel, I agree that using some 64-bit representation (nanoseconds, 32/32 unsigned seconds/nanoseconds, ...) has advantages, that's exactly the point I was making earlier against simply extending the internal time_t/timespec to 64-bit seconds for everything. Arnd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about snapshot-aware defrag
On Jun 4, 2014, at 7:15 AM, Martin m_bt...@ml1.co.uk wrote: Consider using noatime as a /default/ being as there are no known 'must-use' use cases. The quote I'm finding on the interwebs is POSIX “requires that operating systems maintain file system metadata that records when each file was last accessed. I'm not sure if upstream kernel projects aim for LSB (and thus POSIX) compliance by default and let distros opt out; or the opposite. Those users still wanting atime can add that as a mount option with the note that atime use reduces the snapshot defrag effectiveness. I can imagine some optimizations for Btrfs that are easier than other file systems, like a way to point metadata chunks to specific devices, for example metadata to persistent memory, while the data goes to conventional hard drives. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: canonicalize pathnames for device commands
mount(8) will canonicalize pathnames before passing them to the kernel. Links to e.g. /dev/sda will be resolved to /dev/sda. Links to /dev/dm-# will be resolved using the name of the device mapper table to /dev/mapper/name. Btrfs will use whatever name the user passes to it, regardless of whether it is canonical or not. That means that if a 'btrfs device ready' is issued on any device node pointing to the original device, it will adopt the new name instead of the name that was used during mount. Mounting using /dev/sdb2 will result in df: /dev/sdb2 209715200 39328 207577088 1% /mnt # ls -la /dev/whatever-i-like lrwxrwxrwx 1 root root 4 Jun 4 13:36 /dev/whatever-i-like - sdb2 # btrfs dev ready /dev/whatever-i-like # df /mnt /dev/whatever-i-like 209715200 39328 207577088 1% /mnt Likewise, mounting with /dev/mapper/whatever and using /dev/dm-0 with a btrfs device command results in df showing /dev/dm-0. This can happen with multipath devices with friendly names enabled and doing something like 'partprobe' which (at least with our version) ends up issuing a 'change' uevent on the sysfs node. That *always* uses the dm-# name, and we get confused users. This patch does the same canonicalization of the paths that mount does so that we don't end up having inconsistent names reported by -show_devices later. Signed-off-by: Jeff Mahoney je...@suse.com --- cmds-device.c | 60 - cmds-replace.c | 13 ++-- utils.c| 57 ++ utils.h|2 + 4 files changed, 117 insertions(+), 15 deletions(-) --- a/cmds-device.c +++ b/cmds-device.c @@ -95,6 +95,7 @@ static int cmd_add_dev(int argc, char ** int devfd, res; u64 dev_block_count = 0; int mixed = 0; + char *path; res = test_dev_for_mkfs(argv[i], force, estr); if (res) { @@ -118,15 +119,24 @@ static int cmd_add_dev(int argc, char ** goto error_out; } - strncpy_null(ioctl_args.name, argv[i]); + path = canonicalize_path(argv[i]); + if (!path) { + fprintf(stderr, + ERROR: Could not canonicalize pathname '%s': %s\n, + argv[i], strerror(errno)); + ret++; + goto error_out; + } + + strncpy_null(ioctl_args.name, path); res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, ioctl_args); e = errno; - if(res0){ + if (res 0) { fprintf(stderr, ERROR: error adding the device '%s' - %s\n, - argv[i], strerror(e)); + path, strerror(e)); ret++; } - + free(path); } error_out: @@ -242,6 +252,7 @@ static int cmd_scan_dev(int argc, char * for( i = devstart ; i argc ; i++ ){ struct btrfs_ioctl_vol_args args; + char *path; if (!is_block_device(argv[i])) { fprintf(stderr, @@ -249,9 +260,17 @@ static int cmd_scan_dev(int argc, char * ret = 1; goto close_out; } - printf(Scanning for Btrfs filesystems in '%s'\n, argv[i]); + path = canonicalize_path(argv[i]); + if (!path) { + fprintf(stderr, + ERROR: Could not canonicalize path '%s': %s\n, + argv[i], strerror(errno)); + ret = 1; + goto close_out; + } + printf(Scanning for Btrfs filesystems in '%s'\n, path); - strncpy_null(args.name, argv[i]); + strncpy_null(args.name, path); /* * FIXME: which are the error code returned by this ioctl ? * it seems that is impossible to understand if there no is @@ -262,9 +281,11 @@ static int cmd_scan_dev(int argc, char * if( ret 0 ){ fprintf(stderr, ERROR: unable to scan the device '%s' - %s\n, - argv[i], strerror(e)); + path, strerror(e)); + free(path); goto close_out; } + free(path); } close_out: @@ -284,6 +305,7 @@ static int cmd_ready_dev(int argc, char struct btrfs_ioctl_vol_args args; int fd; int ret; + char*path; if (check_argc_min(argc, 2)) usage(cmd_ready_dev_usage); @@ -293,22 +315,34 @@ static int cmd_ready_dev(int argc, char perror(failed to open
Very slow filesystem
Hello, Why btrfs becames EXTREMELY slow after some time (months) of usage ? This is now happened second time, first time I though it was hard drive fault, but now drive seems ok. Filesystem is mounted with compress-force=lzo and is used for MySQL databases, files are mostly big 2G-8G. Copying from this file system is unbelievable slow. It goes form 500 KB/s to maybe 5MB/s maybe faster some files. hdparm -t or dd show 130MB/s+. There are no errors on drive.No errors in logs. Can I somehow get position of file on disk, so I can try with raw read with dd or something to make sure it's not drive fault ? As I said I tried dd and speeds are normal but maybe there is problem with only some sectors. Below are btrfs version and info: # uname -a Linux voyager 3.14.2 #1 SMP Tue May 6 09:25:40 CEST 2014 x86_64 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz GenuineIntel GNU/Linux Currently but when filesystem was created it was some 3.x I don't remember. # btrfs --version Btrfs v0.20-rc1-358-g194aa4a (now I'm upgraded to Btrfs v3.14.2) # btrfs fi show Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba Total devices 1 FS bytes used 2.36TiB devid1 size 2.73TiB used 2.38TiB path /dev/sde Label: none uuid: 09898e7a-b0b4-4a26-a956-a833514c17f6 Total devices 1 FS bytes used 1.05GiB devid1 size 3.64TiB used 5.04GiB path /dev/sdb Btrfs v3.14.2 # btrfs fi df /mnt/old Data, single: total=2.36TiB, used=2.35TiB System, DUP: total=8.00MiB, used=264.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=7.13GiB Metadata, single: total=8.00MiB, used=0.00 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow filesystem
On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote: Hello, Why btrfs becames EXTREMELY slow after some time (months) of usage ? # btrfs fi show Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba Total devices 1 FS bytes used 2.36TiB devid1 size 2.73TiB used 2.38TiB path /dev/sde # btrfs fi df /mnt/old Data, single: total=2.36TiB, used=2.35TiB Is that the fs that is slow? It's almost full. Most filesystems would exhibit really bad performance when close to full due to fragmentation issue (threshold vary, but 80-90% full usually means you need to start adding space). You should free up some space (e.g. add a new disk so it becomes multi-device, or delete some files) and rebalance/defrag. -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow filesystem
On Thu, 5 Jun 2014 05:27:33 +0700 Fajar A. Nugraha l...@fajar.net wrote: On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote: Hello, Why btrfs becames EXTREMELY slow after some time (months) of usage ? # btrfs fi show Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba Total devices 1 FS bytes used 2.36TiB devid1 size 2.73TiB used 2.38TiB path /dev/sde # btrfs fi df /mnt/old Data, single: total=2.36TiB, used=2.35TiB Is that the fs that is slow? It's almost full. Really, is it? The device size is 2.75 TiB, while only 2.35 TiB is used. About 400 GiB should be free. That's not almost full. The btrfs fi df readings may be a little confusing, but usually it's those who ask questions on this list are confused by them, not those who (try to) answer. :) -- With respect, Roman signature.asc Description: PGP signature
Re: Very slow filesystem
On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha l...@fajar.net wrote: On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote: Hello, Why btrfs becames EXTREMELY slow after some time (months) of usage ? # btrfs fi show Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba Total devices 1 FS bytes used 2.36TiB devid1 size 2.73TiB used 2.38TiB path /dev/sde # btrfs fi df /mnt/old Data, single: total=2.36TiB, used=2.35TiB Is that the fs that is slow? It's almost full. Most filesystems would exhibit really bad performance when close to full due to fragmentation issue (threshold vary, but 80-90% full usually means you need to start adding space). You should free up some space (e.g. add a new disk so it becomes multi-device, or delete some files) and rebalance/defrag. -- Fajar Yes this one is slow. I know it's getting full I'm just copying to new disk (it will take days or even weeks!). It shouldn't be so much fragmented, data is mostly just added. But still, can reading became so slow just because fullness and fragmentation ? It just seems strange to me. If it would be 60Mb/s instead 130, but so much slower. I'll delete some files and see if it will be faster, but it will take hours to copy them to new disk. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow filesystem
i can mistake, but i think what: btrfstune -x dev # can improve perfomance because this decrease metadata Also, in last versions of btrfs progs changed from 4k to 16k, it also can help (but for this, you must reformat fs) For clean btrfs fi df /, you can try do: btrfs bal start -f -sconvert=dup,soft -mconvert=dup,soft path Data, single: total=52.01GiB, used=49.29GiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=1.50GiB, used=483.77MiB Also disable compression or use without force option, if i properly understand it also do additional fragmentation (filefrag helpfull). Also, for defragmentation data (if you need defrag some files), you can do it by just copy-past, it create nodefragment copy 2014-06-05 1:45 GMT+03:00 Igor M igor...@gmail.com: On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha l...@fajar.net wrote: On Thu, Jun 5, 2014 at 5:15 AM, Igor M igor...@gmail.com wrote: Hello, Why btrfs becames EXTREMELY slow after some time (months) of usage ? # btrfs fi show Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba Total devices 1 FS bytes used 2.36TiB devid1 size 2.73TiB used 2.38TiB path /dev/sde # btrfs fi df /mnt/old Data, single: total=2.36TiB, used=2.35TiB Is that the fs that is slow? It's almost full. Most filesystems would exhibit really bad performance when close to full due to fragmentation issue (threshold vary, but 80-90% full usually means you need to start adding space). You should free up some space (e.g. add a new disk so it becomes multi-device, or delete some files) and rebalance/defrag. -- Fajar Yes this one is slow. I know it's getting full I'm just copying to new disk (it will take days or even weeks!). It shouldn't be so much fragmented, data is mostly just added. But still, can reading became so slow just because fullness and fragmentation ? It just seems strange to me. If it would be 60Mb/s instead 130, but so much slower. I'll delete some files and see if it will be faster, but it will take hours to copy them to new disk. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, Timofey. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On 06/04/2014 12:24 PM, Arnd Bergmann wrote: For other timekeeping stuff in the kernel, I agree that using some 64-bit representation (nanoseconds, 32/32 unsigned seconds/nanoseconds, ...) has advantages, that's exactly the point I was making earlier against simply extending the internal time_t/timespec to 64-bit seconds for everything. How much of a performance issue is it to make time_t 64 bits, and for the bits there are, how hard are they to fix? -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
To return EOPNOTSUPP is more user friendly than to return EINVAL, and then user-space tool will show that the dev_replace operation for raid56 is not currently supported rather than showing that there is an invalid argument. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- fs/btrfs/dev-replace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 9f22905..2af6e66 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -313,7 +313,7 @@ int btrfs_dev_replace_start(struct btrfs_root *root, if (btrfs_fs_incompat(fs_info, RAID56)) { btrfs_warn(fs_info, dev_replace cannot yet handle RAID5/RAID6); - return -EINVAL; + return -EOPNOTSUPP; } switch (args-start.cont_reading_from_srcdev_mode) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: show meaningful msgs for replace cmd upon raid56
This depends on the kernel patch: [PATCH] btrfs:replace EINVAL with EOPNOTSUPP for dev_replace This catches the EOPNOTSUPP and output msg that says dev_replace raid56 is not currently supported. Note that the msg will only be shown when run dev_replace not in background. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- cmds-replace.c | 4 1 file changed, 4 insertions(+) diff --git a/cmds-replace.c b/cmds-replace.c index 9eb981b..8b18110 100644 --- a/cmds-replace.c +++ b/cmds-replace.c @@ -301,6 +301,10 @@ static int cmd_start_replace(int argc, char **argv) ERROR: ioctl(DEV_REPLACE_START) failed on \%s\: %s, %s\n, path, strerror(errno), replace_dev_result2string(start_args.result)); + + if (errno == EOPNOTSUPP) + fprintf(stderr, WARNING: dev_replace cannot yet handle RAID5/RAID6\n); + goto leave_with_error; } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mount: add btrfs to mount.8
Based on Documentation/filesystems/btrfs.txt Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- sys-utils/mount.8 | 186 ++ 1 file changed, 186 insertions(+) diff --git a/sys-utils/mount.8 b/sys-utils/mount.8 index efa1ae8..ec8eab3 100644 --- a/sys-utils/mount.8 +++ b/sys-utils/mount.8 @@ -671,6 +671,7 @@ currently supported include: .IR adfs , .IR affs , .IR autofs , +.IR btrfs , .IR cifs , .IR coda , .IR coherent , @@ -1245,6 +1246,191 @@ Give blocksize. Allowed values are 512, 1024, 2048, 4096. These options are accepted but ignored. (However, quota utilities may react to such strings in .IR /etc/fstab .) +.SH Mount options for btrfs +Btrfs is a copy on write filesystem for Linux aimed at +implementing advanced features while focusing on fault tolerance, +repair and easy administration. +.TP +.BI alloc_start= bytes +Debugging option to force all block allocations above a certain +byte threshold on each block device. The value is specified in +bytes, optionally with a K, M, or G suffix, case insensitive. +Default is 1MB. +.TP +.B autodefrag +Disable/enable auto defragmentation. +Auto defragmentation detects small random writes into files and queue +them up for the defrag process. Works best for small files; +Not well suited for large database workloads. +.TP +\fBcheck_int\fP|\fBcheck_int_data\fP|\fBcheck_int_print_mask=\fP\,\fIvalue\fP +These debugging options control the behavior of the integrity checking +module(the BTRFS_FS_CHECK_INTEGRITY config option required). + +.B check_int +enables the integrity checker module, which examines all +block write requests to ensure on-disk consistency, at a large +memory and CPU cost. + +.B check_int_data +includes extent data in the integrity checks, and +implies the check_int option. + +.B check_int_print_mask +takes a bitmask of BTRFSIC_PRINT_MASK_* values +as defined in fs/btrfs/check-integrity.c, to control the integrity +checker module behavior. + +See comments at the top of +.IR fs/btrfs/check-integrity.c +for more info. +.TP +.BI commit= seconds +Set the interval of periodic commit, 30 seconds by default. Higher +values defer data being synced to permanent storage with obvious +consequences when the system crashes. The upper bound is not forced, +but a warning is printed if it's more than 300 seconds (5 minutes). +.TP +\fBcompress\fP|\fBcompress=\fP\,\fItype\fP|\fBcompress-force\fP|\fBcompress-force=\fP\,\fItype\fP +Control BTRFS file data compression. Type may be specified as zlib +lzo or no (for no compression, used for remounting). If no type +is specified, zlib is used. If compress-force is specified, +all files will be compressed, whether or not they compress well. +If compression is enabled, nodatacow and nodatasum are disabled. +.TP +.B degraded +Allow mounts to continue with missing devices. A read-write mount may +fail with too many devices missing, for example if a stripe member +is completely missing. +.TP +.BI device= devicepath +Specify a device during mount so that ioctls on the control device +can be avoided. Especially useful when trying to mount a multi-device +setup as root. May be specified multiple times for multiple devices. +.TP +.B discard +Disable/enable discard mount option. +Discard issues frequent commands to let the block device reclaim space +freed by the filesystem. +This is useful for SSD devices, thinly provisioned +LUNs and virtual machine images, but may have a significant +performance impact. (The fstrim command is also available to +initiate batch trims from userspace). +.TP +.B enospc_debug +Disable/enable debugging option to be more verbose in some ENOSPC conditions. +.TP +.BI fatal_errors= action +Action to take when encountering a fatal error: + bug - BUG() on a fatal error. This is the default. + panic - panic() on a fatal error. +.TP +.B flushoncommit +The +.B flushoncommit +mount option forces any data dirtied by a write in a +prior transaction to commit as part of the current commit. This makes +the committed state a fully consistent view of the file system from the +application's perspective (i.e., it includes all completed file system +operations). This was previously the behavior only when a snapshot is +created. +.TP +.B inode_cache +Enable free inode number caching. Defaults to off due to an overflow +problem when the free space crcs don't fit inside a single page. +.TP +.BI max_inline= bytes +Specify the maximum amount of space, in bytes, that can be inlined in +a metadata B-tree leaf. The value is specified in bytes, optionally +with a K, M, or G suffix, case insensitive. In practice, this value +is limited by the root sector size, with some space unavailable due +to leaf headers. For a 4k sectorsize, max inline data is ~3900 bytes. +.TP +.BI metadata_ratio= value +Specify that 1 metadata chunk should be allocated after every +.IRvalue +data chunks. Off by default. +.TP +.B noacl +Enable/disable support for Posix
Re: Very slow filesystem
Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: Why btrfs becames EXTREMELY slow after some time (months) of usage ? This is now happened second time, first time I though it was hard drive fault, but now drive seems ok. Filesystem is mounted with compress-force=lzo and is used for MySQL databases, files are mostly big 2G-8G. That's the problem right there, database access pattern on files over 1 GiB in size, but the problem along with the fix has been repeated over and over and over and over... again on this list, and it's covered on the btrfs wiki as well, so I guess you haven't checked existing answers before you asked the same question yet again. Never-the-less, here's the basic answer yet again... Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a particular file rewrite pattern, that being frequently changed and rewritten data internal to an existing file (as opposed to appended to it, like a log file). In the normal case, such an internal-rewrite pattern triggers copies of the rewritten blocks every time they change, *HIGHLY* fragmenting this type of files after only a relatively short period. While compression changes things up a bit (filefrag doesn't know how to deal with it yet and its report isn't reliable), it's not unusual to see people with several-gig files with this sort of write pattern on btrfs without compression find filefrag reporting literally hundreds of thousands of extents! For smaller files with this access pattern (think firefox/thunderbird sqlite database files and the like), typically up to a few hundred MiB or so, btrfs' autodefrag mount option works reasonably well, as when it sees a file fragmenting due to rewrite, it'll queue up that file for background defrag via sequential copy, deleting the old fragmented copy after the defrag is done. For larger files (say a gig plus) with this access pattern, typically larger database files as well as VM images, autodefrag doesn't scale so well, as the whole file must be rewritten each time, and at that size the changes can come faster than the file can be rewritten. So a different solution must be used for them. The recommended solution for larger internal-rewrite-pattern files is to give them the NOCOW file attribute (chattr +C) , so they're updated in place. However, this attribute cannot be added to a file with existing data and have things work as expected. NOCOW must be added to the file before it contains data. The easiest way to do that is to set the attribute on the subdir that will contain the files and let the files inherit the attribute as they are created. Then you can copy (not move, and don't use cp's --reflink option) existing files into the new subdir, such that the new copy gets created with the NOCOW attribute. NOCOW files are updated in-place, thereby eliminating the fragmentation that would otherwise occur, keeping them fast to access. However, there are a few caveats. Setting NOCOW turns off file compression and checksumming as well, which is actually what you want for such files as it eliminates race conditions and other complex issues that would otherwise occur when trying to update the files in-place (thus the reason such features aren't part of most non-COW filesystems, which update in-place by default). Additionally, taking a btrfs snapshot locks the existing data in place for the snapshot, so the first rewrite to a file block (4096 bytes, I believe) after a snapshot will always be COW, even if the file has the NOCOW attribute set. Some people run automatic snapshotting software and can be taking snapshots as often as once a minute. Obviously, this effectively almost kills NOCOW entirely, since it's then only effective on changes after the first one between shapshots, and with snapshots only a minute apart, the file fragments almost as fast as it would have otherwise! So snapshots and the NOCOW attribute basically don't get along with each other. But because snapshots stop at subvolume boundaries, one method to avoid snapshotting NOCOW files is to put your NOCOW files, already in their own subdirs if using the suggestion above, into dedicated subvolumes as well. That lets you continue taking snapshots of the parent subvolume, without snapshotting the the dedicated subvolumes containing the NOCOW database or VM-image files. You'd then do conventional backups of your database and VM-image files, instead of snapshotting them. Of course if you're not using btrfs snapshots in the first place, you can avoid the whole subvolume thing, and just put your NOCOW files in their own subdirs, setting NOCOW on the subdir as suggested above, so files (and further subdirs, nested subdirs inherit the NOCOW as well) inherit the NOCOW of the subdir they're created in, at that creation. Meanwhile, it can be noted that once you turn off COW/compression/ checksumming, and if you're not snapshotting, you're
Re: Very slow filesystem
(resending to the list as plain text, the original reply was rejected due to HTML format) On Thu, Jun 5, 2014 at 10:05 AM, Duncan 1i5t5.dun...@cox.net wrote: Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: Why btrfs becames EXTREMELY slow after some time (months) of usage ? This is now happened second time, first time I though it was hard drive fault, but now drive seems ok. Filesystem is mounted with compress-force=lzo and is used for MySQL databases, files are mostly big 2G-8G. That's the problem right there, database access pattern on files over 1 GiB in size, but the problem along with the fix has been repeated over and over and over and over... again on this list, and it's covered on the btrfs wiki as well Which part on the wiki? It's not on https://btrfs.wiki.kernel.org/index.php/FAQ or https://btrfs.wiki.kernel.org/index.php/UseCases so I guess you haven't checked existing answers before you asked the same question yet again. Never-the-less, here's the basic answer yet again... Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a particular file rewrite pattern, that being frequently changed and rewritten data internal to an existing file (as opposed to appended to it, like a log file). In the normal case, such an internal-rewrite pattern triggers copies of the rewritten blocks every time they change, *HIGHLY* fragmenting this type of files after only a relatively short period. While compression changes things up a bit (filefrag doesn't know how to deal with it yet and its report isn't reliable), it's not unusual to see people with several-gig files with this sort of write pattern on btrfs without compression find filefrag reporting literally hundreds of thousands of extents! For smaller files with this access pattern (think firefox/thunderbird sqlite database files and the like), typically up to a few hundred MiB or so, btrfs' autodefrag mount option works reasonably well, as when it sees a file fragmenting due to rewrite, it'll queue up that file for background defrag via sequential copy, deleting the old fragmented copy after the defrag is done. For larger files (say a gig plus) with this access pattern, typically larger database files as well as VM images, autodefrag doesn't scale so well, as the whole file must be rewritten each time, and at that size the changes can come faster than the file can be rewritten. So a different solution must be used for them. If COW and rewrite is the main issue, why don't zfs experience the extreme slowdown (that is, not if you have sufficient free space available, like 20% or so)? -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow filesystem
Fajar A. Nugraha posted on Thu, 05 Jun 2014 10:22:49 +0700 as excerpted: (resending to the list as plain text, the original reply was rejected due to HTML format) On Thu, Jun 5, 2014 at 10:05 AM, Duncan 1i5t5.dun...@cox.net wrote: Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: Why btrfs becames EXTREMELY slow after some time (months) of usage ? This is now happened second time, first time I though it was hard drive fault, but now drive seems ok. Filesystem is mounted with compress-force=lzo and is used for MySQL databases, files are mostly big 2G-8G. That's the problem right there, database access pattern on files over 1 GiB in size, but the problem along with the fix has been repeated over and over and over and over... again on this list, and it's covered on the btrfs wiki as well Which part on the wiki? It's not on https://btrfs.wiki.kernel.org/index.php/FAQ or https://btrfs.wiki.kernel.org/index.php/UseCases Most of the discussion and information is on the list, but there's a limited amount of information on the wiki in at least three places. Two are on the mount options page, in the autodefrag and nodatacow options description: * Autodefrag says it's well suited to bdb and sqlite dbs but not vm images or big dbs (yet). * Nodatacow says performance gain is usually under 5% *UNLESS* the workload is random writes to large db files, where the difference can be VERY large. (There's also mention of the fact that this turns off checksumming and compression.) Of course that's the nodatacow mount option, not the NOCOW file attribute, which isn't to my knowledge discussed on the wiki, and given the wiki wording, one does indeed have to read a bit between the lines, but it is there if one looks. That was certainly enough hint for me to mark the issue for further study as I did my initial pre-mkfs.btrfs research, for instance, and that it was a problem, with additional detail, was quickly confirmed once I checked the list. * Additionally, there some discussion in the FAQ under Can copy-on-write be turned off for data blocks?, including discussion of the command used (chattr +C), a link to a script, a shell commands example, and the hint will produce file suitable for a raw VM image -- the blocks will be updated in-place and are preallocated. FWIW, if I did wiki editing there'd probably be a dedicated page discussing it, but for better or worse, I seem to work best on mailing lists and newsgroups, and every time I've tried contributing on the web, even when it has been to a web forum which one would think would be close enough to lists/groups for me to adapt to, it simply hasn't gone much of anywhere. So these days I let other people more comfortable with editing wikis or doing web forums do that (and sometimes people do that by either actually quoting my list post nearly verbatim or simply linking to it, which I'm fine with, as after all that's where much of the info I post comes from in the first place), and I stick to the lists. Since I don't directly contribute to the wiki I don't much criticize it, but there are indeed at least hints there for those who can read them, something I did myself so I know it's not asking the impossible. If COW and rewrite is the main issue, why don't zfs experience the extreme slowdown (that is, not if you have sufficient free space available, like 20% or so)? My personal opinion? Primarily two things: 1) zfs is far more mature than btrfs and has been in production usage for many years now, while btrfs is still barely getting the huge warnings stripped off. There's a lot of btrfs optimization possible that simply hasn't occurred yet as the focus is still real data-destruction-risk bugs, and in fact, btrfs isn't yet feature-complete either, so there's still focus on raw feature development as well. When btrfs gets to the maturity level that zfs is at now, I expect a lot of the problems we have now will have been dramatically reduced if not eliminated. (And the devs are indeed working on this problem, among others.) 2) Stating the obvious, while both btrfs and zfs are COW based and have other similarities, btrfs is an different filesystem, with an entirely different implementation and somewhat different emphasis. There consequently WILL be some differences, even when they're both mature filesystems. It's entirely possible that something about the btrfs implementation makes it less suitable in general to this particular use- case. Additionally, while I don't have zfs experience myself nor do I find it a particularly feasible option for me due to licensing and political issues, from what I've read it tends to handle certain issues by simply throwing gigs on gigs of memory at the problem. Btrfs is designed to require far less memory, and as such, will by definition be somewhat more limited in spots. (Arguably, this is simply a specific case of #2 above, they're