Re: [RFC v4+ hot_track 00/19] vfs: hot data tracking
zwu.ker...@gmail.com writes: TODO List: 1.) Need to do scalability or performance tests. You're changing some of the most performance critical code in the kernel. This step is absolutely not optional. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL v2] Update LZO compression
- Post the entire patch series to lkml for review (I'd like a cc please) Already happened, multiple people reviewed and tested. um, I would not consider Looks ok to me from a quick look. and I couldn't tell from the github view, but I assume you follow standard coding style. to indicate a rigorous code review! Let's put it like this. LZO is mature widely used production software and Markus wrote and maintains it. Essentially this is a release from him. The main problem before with using his version directly was that it did not follow Linux coding standards, but that seems to be fixed now. It's also relatively simple and at least I didn't spot anything bad when I looked. It was runtime tested by several people on different architectures. If it was fuzzed so we can assume reasonably good security. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Update LZO compression
If you think a little bit, I bet you could come up with a solution that operates at cacheline-aligned granularity, something that would be _even faster_ than simply fixing the code to do aligned accesses. Cache aligned compression is unlikely to compress anything at all. Compression algorithms are usually by definition unaligned. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Update LZO compression
I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set, and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo. There doesn't seem to be a significant difference between all three variants. I found that in compression benchmarks it depends a lot on the data compressed. urandom (which should be essentially incompressible) will be handled by different code paths in the compressor than other more compressible data. It becomes a complicated memcpy then. But then there are IO benchmarks which also only do zeroes, which also gives an unrealistic picture. Usually it's best to use some corpus with different data types, from very compressible to less so; and look at the aggregate. For my snappy work I usually had at least large executables (medium) and some pdfs (already compressed; low) and then uncompressed source code tars (high compression) -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Update LZO compression
On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote: Hi all, as suggested on the mailing list I have converted the updated LZO code into git, so please pull my lzo-update branch from git://github.com/markus-oberhumer/linux.git lzo-update You can browse the branch at https://github.com/markus-oberhumer/linux/compare/lzo-update Looks ok to me from a quick look. Since kernel lzo is security relevant, I assume the new version has been fuzz tested? I couldn't tell from the github view, but I assume you follow standard coding style. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: use a slab for btrfs_dio_private
Josef Bacik jba...@fusionio.com writes: This is in the IO path, let's try to avoid latencies by having our own slab. Thanks, What latencies does this avoid? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Add the snappy-c compressor to lib v2
(BTW: If you're ever reworking this patch set, I'd like to make an ad hoc request for slightly different names for fs/btrfs/snappy.c and lib/snappy.c) Why? When building a x86 kernel, I get the following errors: CC [M] lib/snappy.o lib/snappy.c: In function 'snappy_init_env': lib/snappy.c:1268:2: error: implicit declaration of function 'vmalloc' CC [M] fs/btrfs/free-space-cache.o lib/snappy.c:1268:18: warning: assignment makes pointer from integer without a cast lib/snappy.c: In function 'snappy_free_env': lib/snappy.c:1293:2: error: implicit declaration of function 'vfree' make[1]: *** [lib/snappy.o] Error 1 make: *** [lib] Error 2 The error clears with this patch: Added thanks. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFB] add LZ4 compression method to btrfs
Markus Lindberg marcuslindb...@gmail.com writes: Are you sure about these figures ? the difference seems too large. It's almost unbelievable. Yes, my benchmarks totally disagree with them. In my tests lz4 is generally slower than snappy-c. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ANN: linux-kernel-lzo-2.06.20120123 - update LZO to v2.06
On Mon, Jan 23, 2012 at 05:19:40PM +0100, Markus F.X.J. Oberhumer wrote: Hi, I've prepared a small package that updates the LZO version in the Linux kernel to LZO v2.06. I ran benchmarks on the new miniLZO and LZ4 on 64bit. LZ4 is generally slower than snappy/lzo in the micro benchmarks. The new LZO is better than the old one, but still loses to snappy most of the time (but often by very small amounts only) Will be worth checking the new LZO will the full distribution boot test. I agree it's definitely a good idea to update the kernel version. However I must say it would be a major project to bring it up to kernel coding standards. snappy is still interesting, but much less so than it was before. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updated btrfs/crypto snappy interface ready for merging
It's because decompressing inline extents always fails. I've fixed it and will send the patch out in a new mail thread. Thanks for fixing. But seems there's bug in lib snappy code, which makes the decompressed data doesn't quite match the original data. Simply copy a file to a btrfs filesystem with snappy enabled, and clear page cache, and check the file: Hmm weird, I have never seen this. Do you have a reproducer? The basic compression code is quite well tested, I have a reasonable unit test. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updated btrfs/crypto snappy interface ready for merging
At first I saved emails and patched them in linux-btrfs git tree, and I got weired result. Then I pulled the snappy branch directly, and now nothing is wrong.. Sorry for the noise. np. Thanks for testing. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] SNAPPY: Add dual GPL/BSD license to snappy module
From: Andi Kleen a...@linux.intel.com To avoid complaints from the module loader. I made it dual because the original compression code was BSD. Reported-by: Chris Mason Signed-off-by: Andi Kleen a...@linux.intel.com --- lib/snappy.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/lib/snappy.c b/lib/snappy.c index 0b39e07..3848c6c 100644 --- a/lib/snappy.c +++ b/lib/snappy.c @@ -1298,3 +1298,5 @@ void snappy_free_env(struct snappy_env *env) memset(env, 0, sizeof(struct snappy_env)); } EXPORT_SYMBOL(snappy_free_env); + +MODULE_LICENSE(Dual BSD/GPL); -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
For the btrfs extent cache it's unclear if just RCUing is a good fit anyways: some workloads are very write heavy and RCU only performs well if you have a lot more reads than writes. For write heavy RCUification usually slows it down. FWIW, I'm mentioning this out of self interest - I need a RCU safe tree structure to index extents for lookless lookups in the XFS buffer cache, but I've got a long list of things to do before I get to it. If someone else implements the tree, that's most of the work done for me. :) FWIW there are fine grained rbtrees in papers too, but they are too fine grained imho: you may need to take a large number of locks for a single traversal. While atomics got a lot cheaper recently they are still somewhat expensive and you don't want too many of them in your fast path. Also I found when there is actual contention having too many bouncing locks is quite bad because the latencies of passing the cache lines around really add up. In these cases uses less fine locks is better. Mathieu also did RCU rbtrees but they are quite complicated. IMHO we would like to have something inbetween a global tree lock and a fully fine grained tree where the lock complexity cannot get out of band. May need a separate data structure for the locks. I don't have a leading candidate for that currently. There are some variants of rbtrees that are less strict and have a simpler rebalance which are interesting. But also some other tree data structures. Needs more work. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
Liu Bo liubo2...@cn.fujitsu.com writes: Here we choose extent_map firstly, since it is a read mostly thing, and the change is quite direct, all we need to do is a) to replace rbtree with skiplist, b) to add rcu support. And more details are in patch 2 and patch 3. I've done some simple tests for performance on my 2-core box, there is no obvious difference, but I want to focus on the design side and make sure there is no more bug in it firstly. For long term goals, we want to ship skiplist to lib, like lib/rbtree.c. I looked at skiplists some time ago. What made them awkward for kernel use is that you have to size the per node skip array in advance and it's hard to resize. So you have a node that wastes memory in common small cases, but still degenerates to linked list on very large sizes. With fine grained locking it gets even worse because the nodes get larger. Con didn't worry about this problem because he assumed the scheduler run queues never could get too long. But for a very scalable subsystem that's definitely a problem. I think skiplists are not a good fit here. At least in our tests the older style trees got a lot better with Chris' recent locking improvements. Now replacing rbtrees is probably still a good idea, but not convinced skiplist are suitable here. There were various other tree alternatives with better locking. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Updated btrfs/crypto snappy interface ready for merging
Here's a slightly updated version of the BTRFS snappy interface. snappy is a faster compression algorithm that provides similar compression as LZO, but generally better performance. This version has: - Ported to latest tree - Extensive testing and benchmarking (in its 3.0 based incarnation) - A few cleanups - Adds a minor performance improvement Should be ready for pulling into the btrfs tree now. The crypto parts have been acked by Herbert. The following changes since commit 2485a4b610171f4e1c4ab0d053569747795c1bbe: Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input (2012-01-12 12:40:41 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc snappy Andi Kleen (3): Add the snappy-c compressor to lib v2 BTRFS: Add snappy support v2 Add snappy interface to crypto API crypto/Kconfig |9 + crypto/Makefile|1 + crypto/snappy.c| 99 fs/btrfs/Kconfig |1 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |1 + fs/btrfs/ctree.h |9 +- fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |4 + fs/btrfs/snappy.c | 432 fs/btrfs/super.c |9 +- include/linux/snappy.h | 26 + lib/Kconfig|6 + lib/Makefile |4 + lib/snappy.c | 1300 16 files changed, 1901 insertions(+), 5 deletions(-) create mode 100644 crypto/snappy.c create mode 100644 fs/btrfs/snappy.c create mode 100644 include/linux/snappy.h create mode 100644 lib/snappy.c -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] BTRFS: Add snappy support v2
From: Andi Kleen a...@linux.intel.com Add support in btrfs for snappy compression. This is generally a faster compressor than LZO (except for higher memory usage) with comparable compression and should imho replace it. This is based on the lzo code with minor modifications. This has been tested with various test workloads and also in a real distribution. One challenge with the benchmarks is that a lot of of benchmarks only read/write zeroes, which are a uncommon good case for IO compression. However even with modified workloads that use different patterns gains are seen. In general snappy performs better with a 64bit CPU, but also generally outperforms LZO on a 32bit Atom system on IO workloads. The disk space usage is similar to LZO (within 0.2%) On a distribution boot test on a netboot snappy is about 11.5% faster than LZO or 15.5% faster than an uncompressed btrfs. Open: implement scatter-gather support and get rid of the temporary buffers. This would only work in some special cases, but should improve performance further by eliminating copies. Thanks to Jacob Sowles for extensive testing and benchmarking. v2: Some cleanups, port to latest tree. Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/Kconfig |1 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |1 + fs/btrfs/ctree.h |9 +- fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |4 + fs/btrfs/snappy.c | 432 fs/btrfs/super.c |9 +- 9 files changed, 456 insertions(+), 5 deletions(-) create mode 100644 fs/btrfs/snappy.c diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index ecb9fd3..d55df9c 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -6,6 +6,7 @@ config BTRFS_FS select ZLIB_DEFLATE select LZO_COMPRESS select LZO_DECOMPRESS + select SNAPPY help Btrfs is a new filesystem with extents, writable snapshotting, support for multiple devices and many more features. diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index c0ddfd2..b93ab52 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -8,6 +8,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ - reada.o backref.o + reada.o backref.o snappy.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 14f1c5a..f85f7fd 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -730,6 +730,7 @@ static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; struct btrfs_compress_op *btrfs_compress_op[] = { btrfs_zlib_compress, btrfs_lzo_compress, + btrfs_snappy_compress, }; int __init btrfs_init_compress(void) diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index a12059f..971a425 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -79,5 +79,6 @@ struct btrfs_compress_op { extern struct btrfs_compress_op btrfs_zlib_compress; extern struct btrfs_compress_op btrfs_lzo_compress; +extern struct btrfs_compress_op btrfs_snappy_compress; #endif diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6738503..a06a642 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -458,6 +458,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL (1ULL 1) #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS(1ULL 2) #define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO(1ULL 3) +#define BTRFS_FEATURE_INCOMPAT_COMPRESS_SNAPPY (1ULL 4) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL #define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL @@ -465,7 +466,8 @@ struct btrfs_super_block { (BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \ BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL |\ BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \ -BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO) +BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \ +BTRFS_FEATURE_INCOMPAT_COMPRESS_SNAPPY) /* * A leaf is full of items. offset and size tell us where to find @@ -621,8 +623,9 @@ enum btrfs_compression_type { BTRFS_COMPRESS_NONE = 0, BTRFS_COMPRESS_ZLIB = 1, BTRFS_COMPRESS_LZO = 2, - BTRFS_COMPRESS_TYPES = 2, - BTRFS_COMPRESS_LAST = 3, + BTRFS_COMPRESS_SNAPPY = 3, + BTRFS_COMPRESS_TYPES = 3, + BTRFS_COMPRESS_LAST = 4, }; struct btrfs_inode_item { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f99a099..925dd0a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2113,6 +2113,8 @@ struct btrfs_root *open_ctree(struct super_block *sb, features
[PATCH 1/3] Add the snappy-c compressor to lib v2
From: Andi Kleen a...@linux.intel.com This is a C port of the google snappy compressor. It has roughly comparable compression to LZO, but is significantly faster on many file types. For example it beats all other compressors on already compressed data. I ported the original C++ code over to C and did some changes to make it better fit into the kernel. It preallocates the worst case memory consumption now. While the code being larger than lzo it is still reasonable (about 5K on x86). Decompression needs very little memory, Compression currently 192K on 64bit and 128K on 32bit. For comparison LZO compression needs 128K on 64bit and 64K on 32bit. [This could be lowered significantly by not preallocating for most use cases, typically the footprint is much lower. The original C++ version only allocated most of this when (rarely) needed, but this is more problematic in the kernel] There are some minor divergences from the Linux coding standards: in particular I kept the C++/C99 style mixed statement/declarations. This was mainly to not diverge too much from the reference C++ source, so that improvements from there can be easily ported. There are some other left overs from the google style, but very little now. Performance: The compressor performs best on 64bit-LE systems, but is also quite good on 32bit. I haven't tested BE, but I don't expect that to add a lot of overhead. Here is some performance data (32bit, Nehalem): c/b = cycles/byte; lower numbers are better. x86-64 executable: (compression minimally slower than qlz, but much better at decompression, lzo is left in the dust): snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 8.13 uncomp 2.65 c/b lzo : emacs-gtk: 11007968 b: ratio 0.33: comp 12.74 uncomp 4.70 c/b zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 49.96 uncomp 13.14 c/b zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 64.17 uncomp 12.33 c/b lzf : emacs-gtk: 11007968 b: ratio 0.37: comp 9.85 uncomp 4.33 c/b qlz : emacs-gtk: 11007968 b: ratio 0.34: comp 7.51 uncomp 6.28 c/b fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 10.73 uncomp 4.97 c/b Compressed data (beats everything else): snappy: udev-151.tar.gz: 634842 b: ratio 1.00: comp 0.99 uncomp 0.33 c/b lzo : udev-151.tar.gz: 634842 b: ratio 1.00: comp 41.44 uncomp 0.66 c/b zlib1 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 116.99 uncomp 3.94 c/b zlib3 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 117.68 uncomp 3.94 c/b lzf : udev-151.tar.gz: 634842 b: ratio 1.03: comp 16.32 uncomp 1.14 c/b qlz : udev-151.tar.gz: 634842 b: ratio 1.00: comp 10.42 uncomp 0.42 c/b fastlz: udev-151.tar.gz: 634842 b: ratio 1.03: comp 19.35 uncomp 2.07 c/b Text file (compression somewhat slower than qlz, but decompression much better, lzo is much worse): snappy: manual.txt: 445343 b: ratio 0.47: comp 12.01 uncomp 3.12 c/b lzo : manual.txt: 445343 b: ratio 0.44: comp 16.32 uncomp 7.53 c/b zlib1 : manual.txt: 445343 b: ratio 0.35: comp 56.37 uncomp 15.59 c/b zlib3 : manual.txt: 445343 b: ratio 0.31: comp 73.45 uncomp 13.99 c/b lzf : manual.txt: 445343 b: ratio 0.46: comp 13.43 uncomp 5.47 c/b qlz : manual.txt: 445343 b: ratio 0.44: comp 9.16 uncomp 8.19 c/b fastlz: manual.txt: 445343 b: ratio 0.46: comp 14.22 uncomp 7.28 c/b As you can see snappy is a good all-around compressor. On 64bit the compression is even faster and beats everything else easily: snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 4.90 uncomp 2.65 c/b lzo : emacs-gtk: 11007968 b: ratio 0.33: comp 11.24 uncomp 4.46 c/b zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 41.67 uncomp 11.13 c/b zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 51.80 uncomp 10.54 c/b lzf : emacs-gtk: 11007968 b: ratio 0.37: comp 8.79 uncomp 4.05 c/b qlz : emacs-gtk: 11007968 b: ratio 0.34: comp 5.44 uncomp 5.46 c/b fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 9.91 uncomp 4.77 c/b On 64bit it's now nearly as fast as qlz on the text file too: snappy: manual.txt: 445343 b: ratio 0.47: comp 7.79 uncomp 3.47 c/b lzo : manual.txt: 445343 b: ratio 0.44: comp 15.46 uncomp 7.27 c/b zlib1 : manual.txt: 445343 b: ratio 0.35: comp 45.79 uncomp 12.78 c/b zlib3 : manual.txt: 445343 b: ratio 0.31: comp 60.52 uncomp 11.72 c/b lzf : manual.txt: 445343 b: ratio 0.46: comp 12.62 uncomp 5.30 c/b qlz : manual.txt: 445343 b: ratio 0.44: comp 6.81 uncomp 7.65 c/b fastlz: manual.txt: 445343 b: ratio 0.46: comp 13.75 uncomp 6.52 c/b Overall it's a good alternative to lzo, with the only drawback being the somewhat higher memory use. v2: Some minor performance improvements and cleanups. 32bit compression should be a few percent faster now. Signed-off-by: Andi Kleen a...@linux.intel.com --- include/linux/snappy.h | 26 + lib/Kconfig|6 + lib/Makefile |4 + lib/snappy.c | 1300 4 files changed, 1336 insertions(+), 0 deletions(-) create mode 100644 include/linux/snappy.h create mode 100644 lib/snappy.c diff --git
[PATCH 3/3] Add snappy interface to crypto API
From: Andi Kleen a...@linux.intel.com Mainly so that ubifs can use it. Acked-by: Herbert Xu herb...@gondor.apana.org.au Signed-off-by: Andi Kleen a...@linux.intel.com --- crypto/Kconfig |9 + crypto/Makefile |1 + crypto/snappy.c | 99 +++ 3 files changed, 109 insertions(+), 0 deletions(-) create mode 100644 crypto/snappy.c diff --git a/crypto/Kconfig b/crypto/Kconfig index e6cfe1a..768943dc 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -924,6 +924,15 @@ config CRYPTO_LZO help This is the LZO algorithm. +config CRYPTO_SNAPPY + tristate Snappy compression algorithm + select CRYPTO_ALGAPI + select SNAPPY + help + snappy is a faster alternative to the lzo compression algorithm + with comparable compression. It is very fast on 64bit systems, but also + good on 32bit systems. It especially excels at already compressed data. + comment Random Number Generation config CRYPTO_ANSI_CPRNG diff --git a/crypto/Makefile b/crypto/Makefile index f638063..8a0e6c6 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -82,6 +82,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o obj-$(CONFIG_CRYPTO_LZO) += lzo.o +obj-$(CONFIG_CRYPTO_SNAPPY) += snappy.o obj-$(CONFIG_CRYPTO_RNG2) += rng.o obj-$(CONFIG_CRYPTO_RNG2) += krng.o obj-$(CONFIG_CRYPTO_ANSI_CPRNG) += ansi_cprng.o diff --git a/crypto/snappy.c b/crypto/snappy.c new file mode 100644 index 000..43a64b9 --- /dev/null +++ b/crypto/snappy.c @@ -0,0 +1,99 @@ +/* + * Cryptographic API. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 51 + * Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + * + */ + +#include linux/init.h +#include linux/module.h +#include linux/crypto.h +#include linux/vmalloc.h +#include linux/snappy.h + +struct snappy_ctx { + struct snappy_env env; +}; + +/* Only needed for compression actually */ +static int snappy_init(struct crypto_tfm *tfm) +{ + struct snappy_ctx *ctx = crypto_tfm_ctx(tfm); + + return snappy_init_env(ctx-env); +} + +static void snappy_exit(struct crypto_tfm *tfm) +{ + struct snappy_ctx *ctx = crypto_tfm_ctx(tfm); + + snappy_free_env(ctx-env); +} + +static int snp_compress(struct crypto_tfm *tfm, const u8 *src, + unsigned int slen, u8 *dst, unsigned int *dlen) +{ + struct snappy_ctx *ctx = crypto_tfm_ctx(tfm); + size_t olen; + int err; + + /* very pessimistic. check in snappy? */ + if (*dlen snappy_max_compressed_length(*dlen)) + return -EINVAL; + err = snappy_compress(ctx-env, src, slen, dst, olen); + *dlen = olen; + return err; +} + +static int snp_decompress(struct crypto_tfm *tfm, const u8 *src, + unsigned int slen, u8 *dst, unsigned int *dlen) +{ + size_t ulen; + + if (!snappy_uncompressed_length(src, slen, ulen)) + return -EIO; + if (*dlen ulen) + return -EINVAL; + *dlen = ulen; + return snappy_uncompress(src, slen, dst) ? 0 : -EIO; +} + +static struct crypto_alg alg = { + .cra_name = snappy, + .cra_flags = CRYPTO_ALG_TYPE_COMPRESS, + .cra_ctxsize= sizeof(struct snappy_ctx), + .cra_module = THIS_MODULE, + .cra_list = LIST_HEAD_INIT(alg.cra_list), + .cra_init = snappy_init, + .cra_exit = snappy_exit, + .cra_u = { .compress = { + .coa_compress = snp_compress, + .coa_decompress = snp_decompress } } +}; + +static int __init snappy_mod_init(void) +{ + return crypto_register_alg(alg); +} + +static void __exit snappy_mod_fini(void) +{ + crypto_unregister_alg(alg); +} + +module_init(snappy_mod_init); +module_exit(snappy_mod_fini); + +MODULE_LICENSE(GPL); +MODULE_DESCRIPTION(Snappy Compression Algorithm); -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/99] btrfs: Use mempools for extent_state structures
Jeff Mahoney je...@suse.com writes: The extent_state structure is used at the core of the extent i/o code for managing flags, locking, etc. It requires allocations deep in the write code and if failures occur they are difficult to recover from. We avoid most of the failures by using a mempool, which can sleep when required, to honor the allocations. This allows future patches to convert most of the {set,clear,convert}_extent_bit and derivatives to return void. Is this really safe? iirc if there's any operation that needs multiple mempool objects you can get into the following ABBA style deadlock: thread 1 thread 2 get object A from pool 1 get object C from pool 2 get object B from pool 2 pool 2 full - block get object D from pool 2 pool1 full - block Now for thread 1 to make progress it needs thread 2 to free its object first, but that needs thread 1 to free its object also first, which is a deadlock. It would still work if there are other users which eventually make progress, but if you're really unlucky either pool 1 or 2 is complete used by threads doing a multiple object operation. So you got a nasty rare deadlock ... The only way to handle this would be to always preallocate all objects needed for a operation in an atomic way. Or never use more than one object. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Add the snappy-c compressor to lib
From: Andi Kleen a...@linux.intel.com This is a C port of the google snappy compressor. It has roughly comparable compression to LZO, but is significantly faster on many file types. For example it beats all other compressors on already compressed data. I ported the original C++ code over to C and did some changes to make it better fit into the kernel. It preallocates the worst case memory consumption now. While the code being larger than lzo it is still reasonable (about 5K on x86). Decompression needs very little memory, Compression currently 192K on 64bit and 128K on 32bit. For comparison LZO compression needs 128K on 64bit and 64K on 32bit. [This could be lowered significantly by not preallocating for most use cases, typically the footprint is much lower. The original C++ version only allocated most of this when (rarely) needed, but this is more problematic in the kernel] There are some minor divergences from the Linux coding standards: in particular I kept the C++/C99 style mixed statement/declarations. This was mainly to not diverge too much from the reference C++ source, so that improvements from there can be easily ported. There are some other left overs from the google style, but very little now. checkpatch.pl has some complaints, but they are either caused by the above or checkpatch.pl bugs (it misparsed #define foo do {} while(0)) Performance: The compressor performs best on 64bit-LE systems, but is also quite good on 32bit. I haven't tested BE, but I don't expect that to add a lot of overhead. Here is some performance data (32bit, Nehalem): c/b = cycles/byte; lower numbers are better. x86-64 executable: (compression minimally slower than qlz, but much better at decompression, lzo is left in the dust): snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 8.13 uncomp 2.65 c/b lzo : emacs-gtk: 11007968 b: ratio 0.33: comp 12.74 uncomp 4.70 c/b zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 49.96 uncomp 13.14 c/b zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 64.17 uncomp 12.33 c/b lzf : emacs-gtk: 11007968 b: ratio 0.37: comp 9.85 uncomp 4.33 c/b qlz : emacs-gtk: 11007968 b: ratio 0.34: comp 7.51 uncomp 6.28 c/b fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 10.73 uncomp 4.97 c/b Compressed data (beats everything else): snappy: udev-151.tar.gz: 634842 b: ratio 1.00: comp 0.99 uncomp 0.33 c/b lzo : udev-151.tar.gz: 634842 b: ratio 1.00: comp 41.44 uncomp 0.66 c/b zlib1 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 116.99 uncomp 3.94 c/b zlib3 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 117.68 uncomp 3.94 c/b lzf : udev-151.tar.gz: 634842 b: ratio 1.03: comp 16.32 uncomp 1.14 c/b qlz : udev-151.tar.gz: 634842 b: ratio 1.00: comp 10.42 uncomp 0.42 c/b fastlz: udev-151.tar.gz: 634842 b: ratio 1.03: comp 19.35 uncomp 2.07 c/b Text file (compression somewhat slower than qlz, but decompression much better, lzo is much worse): snappy: manual.txt: 445343 b: ratio 0.47: comp 12.01 uncomp 3.12 c/b lzo : manual.txt: 445343 b: ratio 0.44: comp 16.32 uncomp 7.53 c/b zlib1 : manual.txt: 445343 b: ratio 0.35: comp 56.37 uncomp 15.59 c/b zlib3 : manual.txt: 445343 b: ratio 0.31: comp 73.45 uncomp 13.99 c/b lzf : manual.txt: 445343 b: ratio 0.46: comp 13.43 uncomp 5.47 c/b qlz : manual.txt: 445343 b: ratio 0.44: comp 9.16 uncomp 8.19 c/b fastlz: manual.txt: 445343 b: ratio 0.46: comp 14.22 uncomp 7.28 c/b As you can see snappy is a good all-around compressor. On 64bit the compression is even faster and beats everything else easily: snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 4.90 uncomp 2.65 c/b lzo : emacs-gtk: 11007968 b: ratio 0.33: comp 11.24 uncomp 4.46 c/b zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 41.67 uncomp 11.13 c/b zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 51.80 uncomp 10.54 c/b lzf : emacs-gtk: 11007968 b: ratio 0.37: comp 8.79 uncomp 4.05 c/b qlz : emacs-gtk: 11007968 b: ratio 0.34: comp 5.44 uncomp 5.46 c/b fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 9.91 uncomp 4.77 c/b On 64bit it's now nearly as fast as qlz on the text file too: snappy: manual.txt: 445343 b: ratio 0.47: comp 7.79 uncomp 3.47 c/b lzo : manual.txt: 445343 b: ratio 0.44: comp 15.46 uncomp 7.27 c/b zlib1 : manual.txt: 445343 b: ratio 0.35: comp 45.79 uncomp 12.78 c/b zlib3 : manual.txt: 445343 b: ratio 0.31: comp 60.52 uncomp 11.72 c/b lzf : manual.txt: 445343 b: ratio 0.46: comp 12.62 uncomp 5.30 c/b qlz : manual.txt: 445343 b: ratio 0.44: comp 6.81 uncomp 7.65 c/b fastlz: manual.txt: 445343 b: ratio 0.46: comp 13.75 uncomp 6.52 c/b Overall it's a good alternative to lzo, with the only drawback being the somewhat higher memory use. The memory use can be fixed for most cases (e.g. some of the current buffers are only used for SG), but isn't yet in this version. Open: it's pretty easy to add scatter-gather support since input/output is quite abstracted. This would benefit some users who could avoid temporary buffers. Signed-off-by: Andi Kleen a...@linux.intel.com
[PATCH 2/3] BTRFS: Add snappy support
From: Andi Kleen a...@linux.intel.com Add support in btrfs for snappy compression. This is based on the lzo code with minor modifications. The btrfs glue code could be significantly improved over LZO by exploiting some snappy features, but hasn't so far. Open: implement scatter-gather support and get rid of the temporary buffers. Some performance numbers (thanks to Jacob Sowles for running them) bonnie++, Core i7-64bit block output rewrite block input random K/sec K/secK/sec K/sec None 100% 100% 100%100% zlib+6.3%+5.4%+6.6% +11.6% lzo+11.5%+4.6%+12.4% +6.7% snappy +19.3%+28.1% +32.6% +9.3% Snappy does extremly well on the 64bit architecture, outperforming everything else, sometimes with a healthy margin. bonnie++, Atom-32bit block output rewrite block input random K/sec K/secK/sec K/sec None 100% 100% 100%100% zlib -43.1%-24.2% -19.0% +12.0% lzo+0.8% +2.6%+6.8% +14.8% snappy +19.5%+16.2% +24.0% +15.7% zlib does very poorly on Atom, actually degrading performance. snappy is generally faster or similar to LZO. The difference is not as big as on the 64bit CPU though, but still visible. bonnie++, files, Core-i7-64bit sequential createdelete random create delete files/sec None 100% 100% 100% 100% zlib +8.3% +10.5%+10.3% +1.4% lzo+3.8% +3.3% +5.4% -3.89% snappy +23.7%+37.2%+21.8% +23.8% bonnie++, files, Atom-32bit sequential createdelete random create delete files/sec None 100% 100% 100% 100% zlib +3.0% +7.9% +5.2% +5.1% lzo+8.2% +5.9% +4.8% +4.6% snappy +3.1% +8.5% +5.7% +1.3% Creation/Deletion on Atom is a case where snappy loses to LZO, however the loss is small. On 64bit Core it's a win. I should add that these benchmarks mainly use 0 filled IO, however FFSB was also quickly tested with more random data and the differences were similar. See also the micro benchmarks in the algorithm description for the behaviour with different data types. FFSB, 4 threads, stair case data pattern, Reads MB/s Core i7-64bit Atom-32bit MB/s MB/s None 100% 100% zlib +8.0% +4.2% lzo+9.3% +4.8% snappy +12.4% +7.9% In general snappy is a better replacement for LZO, especially on 64bit, but even on 32bit. Cc: chris.ma...@oracle.com Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/Kconfig |1 + fs/btrfs/Makefile |3 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |1 + fs/btrfs/ctree.h |9 +- fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |4 + fs/btrfs/snappy.c | 435 fs/btrfs/super.c |9 +- lib/Makefile |1 + 10 files changed, 461 insertions(+), 5 deletions(-) create mode 100644 fs/btrfs/snappy.c diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index ecb9fd3..d55df9c 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -6,6 +6,7 @@ config BTRFS_FS select ZLIB_DEFLATE select LZO_COMPRESS select LZO_DECOMPRESS + select SNAPPY help Btrfs is a new filesystem with extents, writable snapshotting, support for multiple devices and many more features. diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 40e6ac0..7cd86e7 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -7,6 +7,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ - compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o + compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ + snappy.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 8ec5d86..b171858 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -729,6 +729,7 @@ static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; struct btrfs_compress_op *btrfs_compress_op[] = { btrfs_zlib_compress, btrfs_lzo_compress, + btrfs_snappy_compress, }; int __init btrfs_init_compress(void) diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index a12059f..971a425 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -79,5 +79,6 @@ struct btrfs_compress_op { extern struct btrfs_compress_op btrfs_zlib_compress; extern struct btrfs_compress_op btrfs_lzo_compress; +extern struct
[PATCH 3/3] Add snappy interface to crypto API
From: Andi Kleen a...@linux.intel.com Mainly so that ubifs can use it. Snappy is a better compressor in the same niche as LZO. Only lightly tested so far. Experiences welcome. Cc: herb...@gondor.apana.org.au Cc: dedeki...@gmail.com Cc: adrian.hun...@intel.com Signed-off-by: Andi Kleen a...@linux.intel.com --- crypto/Kconfig |9 + crypto/Makefile |1 + crypto/snappy.c | 99 +++ 3 files changed, 109 insertions(+), 0 deletions(-) create mode 100644 crypto/snappy.c diff --git a/crypto/Kconfig b/crypto/Kconfig index ae27b753..2c85991 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -823,6 +823,15 @@ config CRYPTO_LZO help This is the LZO algorithm. +config CRYPTO_SNAPPY + tristate Snappy compression algorithm + select CRYPTO_ALGAPI + select SNAPPY + help + snappy is a faster alternative to the lzo compression algorithm + with comparable compression. It is very fast on 64bit systems, but also + good on 32bit systems. It especially excels at already compressed data. + comment Random Number Generation config CRYPTO_ANSI_CPRNG diff --git a/crypto/Makefile b/crypto/Makefile index ce5a813..cf92f6f 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -80,6 +80,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o obj-$(CONFIG_CRYPTO_LZO) += lzo.o +obj-$(CONFIG_CRYPTO_SNAPPY) += snappy.o obj-$(CONFIG_CRYPTO_RNG2) += rng.o obj-$(CONFIG_CRYPTO_RNG2) += krng.o obj-$(CONFIG_CRYPTO_ANSI_CPRNG) += ansi_cprng.o diff --git a/crypto/snappy.c b/crypto/snappy.c new file mode 100644 index 000..2f44d30 --- /dev/null +++ b/crypto/snappy.c @@ -0,0 +1,99 @@ +/* + * Cryptographic API. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 51 + * Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + * + */ + +#include linux/init.h +#include linux/module.h +#include linux/crypto.h +#include linux/vmalloc.h +#include linux/snappy.h + +struct snappy_ctx { + struct snappy_env env; +}; + +/* Only needed for compression actually */ +static int snappy_init(struct crypto_tfm *tfm) +{ + struct snappy_ctx *ctx = crypto_tfm_ctx(tfm); + + return snappy_init_env(ctx-env); +} + +static void snappy_exit(struct crypto_tfm *tfm) +{ + struct snappy_ctx *ctx = crypto_tfm_ctx(tfm); + + snappy_free_env(ctx-env); +} + +static int snp_compress(struct crypto_tfm *tfm, const u8 *src, + unsigned int slen, u8 *dst, unsigned int *dlen) +{ + struct snappy_ctx *ctx = crypto_tfm_ctx(tfm); + size_t olen; + int err; + + /* very pessimistic. check in snappy? */ + if (*dlen snappy_max_compressed_length(*dlen)) + return -EINVAL; + err = snappy_compress(ctx-env, src, slen, dst, olen); + *dlen = olen; + return err; +} + +static int snp_decompress(struct crypto_tfm *tfm, const u8 *src, + unsigned int slen, u8 *dst, unsigned int *dlen) +{ + size_t ulen; + + if (!snappy_uncompressed_length(src, slen, ulen)) + return -EIO; + if (*dlen ulen) + return -EINVAL; + *dlen = ulen; + return snappy_uncompress(src, slen, dst) ? 0 : -EIO; +} + +static struct crypto_alg alg = { + .cra_name = snappy, + .cra_flags = CRYPTO_ALG_TYPE_COMPRESS, + .cra_ctxsize= sizeof(struct snappy_ctx), + .cra_module = THIS_MODULE, + .cra_list = LIST_HEAD_INIT(alg.cra_list), + .cra_init = snappy_init, + .cra_exit = snappy_exit, + .cra_u = { .compress = { + .coa_compress = snp_compress, + .coa_decompress = snp_decompress } } +}; + +static int __init snappy_mod_init(void) +{ + return crypto_register_alg(alg); +} + +static void __exit snappy_mod_fini(void) +{ + crypto_unregister_alg(alg); +} + +module_init(snappy_mod_init); +module_exit(snappy_mod_fini); + +MODULE_LICENSE(GPL); +MODULE_DESCRIPTION(Snappy Compression Algorithm); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
Re: Honest timeline for btrfsck
Jeff Putney jeffrey.put...@gmail.com writes: http://en.wikipedia.org/wiki/Release_early,_release_often Well the other principle in free software you're forgetting is: It will be released when it's ready If you don't like Chris' ways to do releases you're free to write something on your own or pay someone to do so. Otherwise you just have to deal with his time frames, as shifty as they may be. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v0 07/18] btrfs: generic data structure to build unique lists
Arne Jansen sensi...@gmx.net writes: ulist is a generic data structures to hold a collection of unique u64 values. The only operations it supports is adding to the list and enumerating it. It is possible to store an auxiliary value along with the key. The implementation is preliminary and can probably be sped up significantly. It is used by subvolume quota to translate recursions into iterative loops. Hmm, sounds like a job for lib/idr.c What do your ulists do that idr doesn't? Ok idr doesn't have merge, but that should be simple enough to add. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v0 03/18] btrfs: add nested locking mode for paths
Arne Jansen sensi...@gmx.net writes: This patch adds the possibilty to read-lock an extent even if it is already write-locked from the same thread. Subvolume quota needs this capability. Recursive locking is generally strongly discouraged, it causes all kinds of problems and tends to eventuall ylead to locking hierarchies nobody can understand anymore. If you can find any other way to solve this problem I would encourage you to do so. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v0 17/18] btrfs: add qgroup ioctls
Arne Jansen sensi...@gmx.net writes: + + if (copy_to_user(arg, sa, sizeof(*sa))) + ret = -EFAULT; + + if (trans) { + err = btrfs_commit_transaction(trans, root); + if (err !ret) + ret = err; + } It would seem safer to put the copy to user outside the transaction. A cto can in principle cause new writes (e.g. if it causes COW), so you may end up with nested transactions. Even if that works somehow (not sure) it seems to be a thing better avoided. + + sa = memdup_user(arg, sizeof(*sa)); + if (IS_ERR(sa)) + return PTR_ERR(sa); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out; + } This code seems to be duplicated a lot. Can it be consolidated? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v0 03/18] btrfs: add nested locking mode for paths
On Fri, Oct 07, 2011 at 12:44:30AM +0400, Andrey Kuzmin wrote: Perhaps you could just elaborate on needs this feature? In general, write lock gives one exclusive access, so the need for additional read (non-exclusive) lock does not appear easily understandable. Usually it's because the low level code can be called both with and without locking and it doesn't know. But that usually can be avoided with some restructuring. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] BTRFS: Fix lseek return value for error
On Mon, Sep 19, 2011 at 03:30:02PM -0400, Chris Mason wrote: Excerpts from Andi Kleen's message of 2011-09-19 13:52:03 -0400: Thanks everyone, I've put Jeff's last version of this in my queue. Can you post the version you merged? The previous ones all had issues. https://github.com/chrismason/linux/commit/48802c8ae2a9d618ec734a61283d645ad527e06c This was the last one sent, I thought it combined all the fixes. Ok looks good, but it will be all obsolete once my patchkit lseek get in (except for the SEEK_DATA/HOLE hunk) -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] BTRFS: Fix lseek return value for error
with an additional improvement if the offset is larger or equal to the file size, return -ENXIO in directly: if (offset = inode-i_size) { mutex_unlock(inode-i_mutex); return -ENXIO; } Except that is wrong, because it would then be impossible to write sparse files. And also i_size must be always read with i_size_read() Anyways clearly there's a problem in btrfs land with merging fixes in time. Is anyone collecting patches while Chris is gone? -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] BTRFS: Fix lseek return value for error
On Fri, Sep 16, 2011 at 11:48:15AM -0400, Christoph Hellwig wrote: On Thu, Sep 15, 2011 at 04:06:47PM -0700, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c I think this should go to Chris/Linus ASAP. But a slightly better patch description wouldn't hurt either. Josef acked it earlier, but with a missing Chris the btrfs merge pipeline seems to be disfunctional at the moment. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] BTRFS: Free inode mutex on lseek error
On Sat, Aug 20, 2011 at 08:08:25AM -0400, Josef Bacik wrote: On 08/19/2011 08:07 PM, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com Introduced with b26751575a9aa55fd6dbf3febde3ff06dfadc44f This has already been fixed by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c It's not. That patch returns the wrong return value. -Andi --- From 3c3bb2aa661c2b3405c53ac72e23fc663aea794f Mon Sep 17 00:00:00 2001 From: Andi Kleen a...@linux.intel.com Date: Sat, 20 Aug 2011 05:49:47 -0700 Subject: [PATCH] BTRFS: Fix lseek return value for error Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c Signed-off-by: Andi Kleen a...@linux.intel.com diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e7872e4..c6e493f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1814,19 +1814,17 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin) case SEEK_DATA: case SEEK_HOLE: ret = find_desired_extent(inode, offset, origin); - if (ret) { - mutex_unlock(inode-i_mutex); - return ret; - } + if (ret) + goto error; } if (offset 0 !(file-f_mode FMODE_UNSIGNED_OFFSET)) { ret = -EINVAL; - goto out; + goto error; } if (offset inode-i_sb-s_maxbytes) { ret = -EINVAL; - goto out; + goto error; } /* Special lock needed here? */ @@ -1837,6 +1835,9 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin) out: mutex_unlock(inode-i_mutex); return offset; +error: + mutex_unlock(inode-i_mutex); + return ret; } const struct file_operations btrfs_file_operations = { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] BTRFS: Free inode mutex on lseek error
From: Andi Kleen a...@linux.intel.com Introduced with b26751575a9aa55fd6dbf3febde3ff06dfadc44f Cc: jo...@redhat.com Cc: chris.ma...@oracle.com Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/file.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 658d669..8791613 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1798,16 +1798,15 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin) case SEEK_DATA: case SEEK_HOLE: ret = find_desired_extent(inode, offset, origin); - if (ret) { - mutex_unlock(inode-i_mutex); - return ret; - } + if (ret) + goto error; } + ret = -EINVAL; if (offset 0 !(file-f_mode FMODE_UNSIGNED_OFFSET)) - return -EINVAL; + goto error; if (offset inode-i_sb-s_maxbytes) - return -EINVAL; + goto error; /* Special lock needed here? */ if (offset != file-f_pos) { @@ -1817,6 +1816,9 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin) out: mutex_unlock(inode-i_mutex); return offset; +error: + mutex_unlock(inode-i_mutex); + return ret; } const struct file_operations btrfs_file_operations = { -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/4] btrfs: Moved repair code from inode.c to extent_io.c
Jan Schmidt list.bt...@jan-o-sch.net writes: Repair works that way: Whenever a read error occurs and we have more mirrors to try, note the failed mirror, and retry another. If we find a good one, check if we did note a failure earlier and if so, do not allow the read to complete until after the bad sector was written with the good data we just fetched. As we have the extent locked while reading, no one can change the data in between. This has the potential for error loops: when the write fails too you get another error in the log and can flood the log etc. I assume this could get really noisy if that disk completely went away. Perhaps it needs a threshold to see if there aren't too many errors on the mirror and then stop retrying at some point. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/4] btrfs: Moved repair code from inode.c to extent_io.c
I wasn't clear enough on that: We only track read errors, here. Ans error correction can only happen on the read path. So if the write attempt fails, we can't go into a loop. Not in a loop, but you trigger more IO errors, which can be nasty if the IO error logging triggers more IO (pretty common because syslogd calls fsync). And then your code does even more IO, floods more etc.etc. And the user will be unhappy if their console gets flooded. We've have a similar problems in the past with readahead causing error flooding. Any time where an error can cause more IO you have to be extremly careful. Right now this seems rather risky to me. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 8/8] btrfs: new ioctls to do logical-inode and inode-path resolving
Jan Schmidt list.bt...@jan-o-sch.net writes: + +static long btrfs_ioctl_logical_to_ino(struct btrfs_root *root, + void __user *arg) +{ + int ret = 0; + int size; + u64 extent_offset; + struct btrfs_ioctl_logical_ino_args *loi; + struct btrfs_data_container *inodes = NULL; + struct btrfs_path *path = NULL; + struct btrfs_key key; This really needs to be root-only for obvious reasons. The same for the ino_path function + + loi = memdup_user(arg, sizeof(*loi)); + if (IS_ERR(loi)) { + ret = PTR_ERR(loi); + loi = NULL; + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + size = min(loi-size, 4096); This is likely a root hole. loi-size is signed! Consider the case of a negative value being passed in. Same for the earlier function. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: initial online fsck support
The two big features of btrfs are self-healing and online fsck, those have to Are they? be implemented in kernel space. Why? There have been online fscks in user space in the past, e.g. the various schemes using LVM snapshots for ext* and other related work on the BSD FFS. I don't see any principal reason why it couldn't be done for btrfs either. A good fsck is quite complex and you are unlikely to want all that code in kernel space. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: initial online fsck support
Li Zefan l...@cn.fujitsu.com writes: This is an initial version of online fsck. What it does is: - check the dir item and dir index pointing to a file. - check the structure of extents of a file. As furthur work, we should consider: - fix but not only check the structure of a file. - verify the extent allocation tree on the fly. It's scary to have a fsck in kernel space. Is there no way to do this from user space? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: save preloaded extent_state's in a percpu cache V2
Josef Bacik jo...@redhat.com writes: When doing DIO tracing I noticed we were doing a ton of allocations, a lot of the time for extent_states. Some of the time we don't even use the prealloc'ed extent_state, it just get's free'd up. So instead create a per-cpu cache like the radix tree stuff. So we will check to see if our per-cpu cache has a prealloc'ed extent_state in it and if so we just continue, else we alloc a new one and fill the cache. Then if we need to use a prealloc'ed extent_state we can just take it out of our per-cpu cache. We will also refill the cache on free to try and limit the number of times we have to ask the allocator for caches. With this patch dbench 50 goes from ~210 mb/s to ~260 mb/s. Thanks, You're just reimplementing a poor man's custom slab cache -- all of this is already done in slab. If the difference is really that big better fix slab and have everyone benefit? Did you use slub or slab? Did you analyze where the cycles are spent? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Btrfs updates
Chris Mason chris.ma...@oracle.com writes: Hi everyone, The for-linus branch of the btrfs unstable tree: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable.git for-linus Has our current queue of fixes. Josef's is the biggest pile, mostly in the allocator. Josef and I both managed to merge his patch to avoid mapping the extent buffer if skip_locking was set, git merge is just a little too easy sometimes (I double checked the resulting code). The new in 3.0 btrfs warnings on every build are still there: fs/btrfs/sysfs.c:76: warning: ‘btrfs_root_attrs’ defined but not used fs/btrfs/sysfs.c:97: warning: ‘btrfs_super_attrs’ defined but not used fs/btrfs/sysfs.c:153: warning: ‘btrfs_super_release’ defined but not used fs/btrfs/sysfs.c:160: warning: ‘btrfs_root_release’ defined but not used These are not even used inside any ifdef. It's unclear to me: were these supposed to be used or removed? Probably better to remove since they seem to be untested, unless it was a merge error? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG 3.0-rc1] oops during file removal, severe lock contention
Dave Chinner da...@fromorbit.com writes: Also, there is massive lock contention while running these workloads. perf top shows this for the create after about 5m inodes have been created: We saw pretty much the same thing in some simple tests on large systems (extent io tree locking and higher level b*tree locks are a problem) It is being looked at I believe. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: load the key from the dir item in readdir into a fake dentry
On Thu, May 26, 2011 at 03:02:42PM -0400, Josef Bacik wrote: + /* + * If this dentry needs lookup, don't set the referenced flag so that it + * is more likely to be cleaned up by the dcache shrinker in case of + * memory pressure. + */ + if (!d_need_lookup(dentry)) + dentry-d_flags |= DCACHE_REFERENCED; No it doesn't at all. The allocation will just push everything else out. Really you cannot view this by only looking at the dcache. You have to look at the complete VM behaviour. All the caches and the other memory interact. d_alloc uses a normal GFP_KERNEL, which is quite in appropiate for this. It should at least reclaim and probably more, but even then it's risky. Ah yeah I guess I should have probably used GFP_KERNEL. Sorry about that, GFP_KERNEL is already used, but it's wrong. I'm not sure any of the existing GFP_* flags will give you the semantics you need in fact. The new flag Minchan added for readahead may come near, but even that is probably not enough. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags
Putting them at the end of the cache LRU instead of the head would allow them to be dropped quickly under memory pressure. This still would fill up your memory for find /, potentially pushing out other stuff. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags
On Fri, May 20, 2011 at 08:30:19PM -0400, Josef Bacik wrote: On 05/20/2011 05:31 PM, Andi Kleen wrote: Putting them at the end of the cache LRU instead of the head would allow them to be dropped quickly under memory pressure. This still would fill up your memory for find /, potentially pushing out other stuff. -Andi So these things are just hashed on dput, so they don't have any references to them and they are automatically put on the LRU list, so if we get under memory pressure they will be easily discarded, especially if nobody is actually stating them. Thanks, They are allocated. The allocation will push out other things too. There's no mechanism to only push dentries when allocating other dentries, or limit the total consumption from the dcache. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] BTRFS: Remove unused node_lock
From: Andi Kleen a...@linux.intel.com 240f62c8756 replaced the node_lock with rcu_read_lock, but forgot to remove the actual lock in the data structure. Remove it here. Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/ctree.h |3 --- fs/btrfs/disk-io.c |1 - 2 files changed, 0 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8f4b81d..f290b98 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1088,9 +1088,6 @@ struct btrfs_fs_info { struct btrfs_root { struct extent_buffer *node; - /* the node lock is held while changing the node pointer */ - spinlock_t node_lock; - struct extent_buffer *commit_root; struct btrfs_root *log_root; struct btrfs_root *reloc_root; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 228cf36..64b2896 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1064,7 +1064,6 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize, INIT_LIST_HEAD(root-dirty_list); INIT_LIST_HEAD(root-orphan_list); INIT_LIST_HEAD(root-root_list); - spin_lock_init(root-node_lock); spin_lock_init(root-orphan_lock); spin_lock_init(root-inode_lock); spin_lock_init(root-accounting_lock); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stratified B-trees
Karn Kallio tierplusplusli...@gmail.com writes: Are these stratified B-trees something which the btrfs project could use? The current b*tree is pretty much hardcoded in the disk format, so it would be hard to change in a compatible way. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] btrfs: add scrub code and prototypes
Arne Jansen sensi...@gmx.net writes: + */ + mutex_lock(fs_info-scrub_lock); + atomic_inc(fs_info-scrubs_running); + mutex_unlock(fs_info-scrub_lock); It seems odd to protect an atomic_inc with a mutex. Is that done for some side effect? Otherwise you either don't need atomic or don't need the lock. That seems to be all over the source file. +int btrfs_scrub_pause(struct btrfs_root *root) +{ + struct btrfs_fs_info *fs_info = root-fs_info; + mutex_lock(fs_info-scrub_lock); As I understand it you take that mutex on every transaction commit, which is a fast path for normal IO. For me that looks like a scalability problem with enough cores. Did you do any performance testing of this on a system with a reasonable number of cores? btrfs already has enough scalability problems, please don't add new ones. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)
I have a question though: the deactivation of multiple page-io submission support most likely only would affect bigger systems or also desktop systems (like mine) ? I think this is not a final fix, just a workaround. The problem with the other path still really needs to be tracked down. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)
512MB. 'free' reports 75MB, 419MB free. I originally noticed the problem on really real hardware (thinkpad T61p), however. If you can easily reproduce it could you try a git bisect? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] hunt for 2.6.37 dm-crypt+ext4 corruption?
I've been using a kernel which is between 2.6.37-rc2 and -rc3 with a LUKS / dm-crypt / LVM / ext4 setup for my primary file systems, and I haven't observed any corruption for the last two weeks or so. It's on my todo list to upgrade to top of Linus's tree, but perhaps this is a useful data point. The reported corruptions are with a .37 forward port someone did of my parallel kcryptd patch. The 2.6.36 (and earlier) versions are rock solid with many users. The classic dmcrypt is single threaded and very slow. As another thought, what version of GCC are people using who are having difficulty? Could this perhaps be a compiler-related issue? A compiler problem seems very unlikely here. What may be an useful experiment would be to test 2.6.37-rc + ext4 over device mapper, but not dmcrypt. If that fails too then it's likely some generic device mapper problem. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem
Kay Sievers kay.siev...@vrfy.org writes: Yeah, we thought about that too, but a btrfs mount does not show up as a block device, like md/dm, so there is no place for a slaves/ directory in /sys with the individual disks listed. How could be solve that? Create some fake blockdev for every btrfs mount, but that can't be used to read/write raw blocks? You could simply create a new class for btrfs? (or maybe a generic compound class) -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 15:16:31 +0800 Miao Xie mi...@cn.fujitsu.com wrote: On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote: When the dest and the src do overlap and the memory area is large, memmove of x86_64 is very inefficient, and it led to bad performance, such as btrfs's file deletion performance. This patch improved the performance of memmove on x86_64 by using __memcpy_bwd() instead of byte copy when doing large memory area copy (len 64). I still don't understand why you don't simply use a backwards string copy (with std) ? That should be much simpler and hopefully be as optimized for kernel copies on recent CPUs. But according to the comment of memcpy, some CPUs don't support REP instruction, I think you misread the comment. REP prefixes are in all x86 CPUs. On some very old systems it wasn't optimized very well, but it probably doesn't make too much sense to optimize for those. On newer CPUs in fact REP should be usually faster than an unrolled loop. I'm not sure how optimized the backwards copy is, but most likely it is optimized too. Here's an untested patch that implements backwards copy with string instructions. Could you run it through your test harness? -Andi Implement the 64bit memmmove backwards case using string instructions Signed-off-by: Andi Kleen a...@linux.intel.com diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index bcbcd1e..6e8258d 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -141,3 +141,28 @@ ENDPROC(__memcpy) .byte .Lmemcpy_e - .Lmemcpy_c .byte .Lmemcpy_e - .Lmemcpy_c .previous + +/* + * Copy memory backwards (for memmove) + * rdi target + * rsi source + * rdx count + */ + +ENTRY(memcpy_backwards): + CFI_STARTPROC + std + movq %rdi, %rax + movl %edx, %ecx + add %rdx, %rdi + add %rdx, %rsi + shrl $3, %ecx + andl $7, %edx + rep movsq + movl %edx, %ecx + rep movsb + cld + ret + CFI_ENDPROC +ENDPROC(memcpy_backwards) + diff --git a/arch/x86/lib/memmove_64.c b/arch/x86/lib/memmove_64.c index 0a33909..6c00304 100644 --- a/arch/x86/lib/memmove_64.c +++ b/arch/x86/lib/memmove_64.c @@ -5,16 +5,16 @@ #include linux/string.h #include linux/module.h +extern void asmlinkage memcpy_backwards(void *dst, const void *src, + size_t count); + #undef memmove void *memmove(void *dest, const void *src, size_t count) { if (dest src) { return memcpy(dest, src, count); } else { - char *p = dest + count; - const char *s = src + count; - while (count--) - *--p = *--s; + return memcpy_backwards(dest, src, count); } return dest; } -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
According to the data, the length of the most copies is =128. Thanks for the data. Large is easier to optimize than small, that's good. Could you also measure how many memsets need the backwards copy? (should be easy to add) If the number is small that needs backwards then the easiest fix would be to simply call the normal memcpy in the forward case. That is for backward could also use a string instruction copy of course, just have to set the direction flag. That would be a very small code change. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 08 Sep 2010 20:57:07 +0800 Miao Xie mi...@cn.fujitsu.com wrote: On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote: According to the data, the length of the most copies is=128. Thanks for the data. Large is easier to optimize than small, that's good. Could you also measure how many memsets need the backwards copy? (should be easy to add) I think memset doesn't need the backwards copy. I meant for memmove of course. Obviously memset doesn't need a backwards copy. That was just the only thing the script didn't measure because the original version didn't have memmove support. Your whole thread was about making memmove faster, right? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
Miao Xie mi...@cn.fujitsu.com writes: Changes from V1 to V2: - change the version of GPL from version 2.1 to version 2 the kernel's memcpy and memmove is very inefficient. But the glibc version is quite fast, in some cases it is 10 times faster than the kernel version. So I Can you elaborate on which CPUs and with what workloads you measured that? The kernel memcpy is optimized for copies smaller than a page size for example (kernel very rarely does anything on larger than 4k), the glibc isn't. etc. There are various other differences. memcpy and memmove are very different. AFAIK noone has tried to optimize memmove() before because traditionally it wasn't used for anything performance critical in the kernel. Has that that changed? memcpy on the other hand while not perfect is actually quite optimized for typical workloads. One big difference between the kernel and glibc is that kernel is often cache cold, so you e.g. the cost of a very large code footprint memcpy/memset is harder to amortize. Microbenchmarks often leave out that crucial variable. I have some systemtap scripts to measure size/alignment distributions of copies on a kernel, if you have a particular workload you're interested in those could be tried. Just copying the glibc bloat uncritical is very likely the wrong move at least. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance on high-end server
Jens Axboe ax...@kernel.dk writes: Also, I didn't see Chris mention this, but if you have a newer intel box you can use hw accellerated crc32c instead. For some reason my test box always loads crc32c and not crc32c-intel, so I need to do that manually. I have a patch for that, will post it later: autoloading of modules based on x86 cpuinfo. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design
Daniel Taylor daniel.tay...@wdc.com writes: As long as no object smaller than the disk block size is ever flushed to media, and all flushed objects are aligned to the disk blocks, there should be no real performance hit from that. The question is just how large such a block needs to be. Traditionally some RAID controllers (and possibly some SSDs now) needed very large blocks upto MBs. Otherwise we end up with the damage for the ext[234] family, where the file blocks can be aligned, but the 1K inode updates cause the read-modify-write (RMW) cycles and and cost 10% performance hit for creation/update of large numbers of files. Fixing that doesn't require a new file system layout, just some effort to read/write inodes in batches of multiple of them. XFS did similar things for a long time, I believe there were some efforts for this for ext4 too. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [13/23] BTRFS: Clean up unused variables -- nonbugs
These are all the cases where a variable is set, but not read which are not bugs as far as I can see, but simply leftovers. Still needs more review. Found by gcc 4.6's new warnings Cc: chris.ma...@oracle.com Cc: linux-btrfs@vger.kernel.org fs/btrfs/compression.c|2 -- fs/btrfs/ctree.c | 20 ++-- fs/btrfs/dir-item.c |2 +- fs/btrfs/disk-io.c| 10 -- fs/btrfs/extent-tree.c|2 -- fs/btrfs/extent_io.c | 10 ++ fs/btrfs/inode.c | 19 +++ fs/btrfs/ioctl.c |2 -- fs/btrfs/ordered-data.c |2 -- fs/btrfs/relocation.c |2 ++ fs/btrfs/root-tree.c |2 -- fs/btrfs/super.c |1 + fs/btrfs/tree-defrag.c|2 -- fs/btrfs/tree-log.c | 15 +-- fs/btrfs/volumes.c|4 fs/btrfs/xattr.c |2 -- fs/btrfs/zlib.c |5 - Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/compression.c |2 -- fs/btrfs/ctree.c| 20 ++-- fs/btrfs/disk-io.c | 11 --- fs/btrfs/extent-tree.c |2 -- fs/btrfs/extent_io.c|9 - fs/btrfs/inode.c| 14 -- fs/btrfs/ioctl.c|2 -- fs/btrfs/ordered-data.c |2 -- fs/btrfs/root-tree.c|2 -- fs/btrfs/super.c|6 ++ fs/btrfs/tree-defrag.c |2 -- fs/btrfs/tree-log.c | 15 --- fs/btrfs/volumes.c |4 fs/btrfs/xattr.c|2 -- fs/btrfs/zlib.c |5 - 15 files changed, 4 insertions(+), 94 deletions(-) Index: linux-2.6.35-rc2-gcc/fs/btrfs/ctree.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/ctree.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/ctree.c @@ -200,7 +200,6 @@ int btrfs_copy_root(struct btrfs_trans_h struct extent_buffer **cow_ret, u64 new_root_objectid) { struct extent_buffer *cow; - u32 nritems; int ret = 0; int level; struct btrfs_disk_key disk_key; @@ -210,7 +209,6 @@ int btrfs_copy_root(struct btrfs_trans_h WARN_ON(root-ref_cows trans-transid != root-last_trans); level = btrfs_header_level(buf); - nritems = btrfs_header_nritems(buf); if (level == 0) btrfs_item_key(buf, disk_key, 0); else @@ -1008,7 +1006,6 @@ static noinline int balance_level(struct int wret; int pslot; int orig_slot = path-slots[level]; - int err_on_enospc = 0; u64 orig_ptr; if (level == 0) @@ -1071,8 +1068,7 @@ static noinline int balance_level(struct BTRFS_NODEPTRS_PER_BLOCK(root) / 4) return 0; - if (btrfs_header_nritems(mid) 2) - err_on_enospc = 1; + btrfs_header_nritems(mid); left = read_node_slot(root, parent, pslot - 1); if (left) { @@ -1103,8 +1099,7 @@ static noinline int balance_level(struct wret = push_node_left(trans, root, left, mid, 1); if (wret 0) ret = wret; - if (btrfs_header_nritems(mid) 2) - err_on_enospc = 1; + btrfs_header_nritems(mid); } /* @@ -1224,14 +1219,12 @@ static noinline int push_nodes_for_inser int wret; int pslot; int orig_slot = path-slots[level]; - u64 orig_ptr; if (level == 0) return 1; mid = path-nodes[level]; WARN_ON(btrfs_header_generation(mid) != trans-transid); - orig_ptr = btrfs_node_blockptr(mid, orig_slot); if (level BTRFS_MAX_LEVEL - 1) parent = path-nodes[level + 1]; @@ -2534,7 +2527,6 @@ static noinline int __push_leaf_left(str { struct btrfs_disk_key disk_key; struct extent_buffer *right = path-nodes[0]; - int slot; int i; int push_space = 0; int push_items = 0; @@ -2546,8 +2538,6 @@ static noinline int __push_leaf_left(str u32 this_item_size; u32 old_left_item_size; - slot = path-slots[1]; - if (empty) nr = right_nritems; else @@ -3239,7 +3229,6 @@ int btrfs_truncate_item(struct btrfs_tra { int ret = 0; int slot; - int slot_orig; struct extent_buffer *leaf; struct btrfs_item *item; u32 nritems; @@ -3249,7 +3238,6 @@ int btrfs_truncate_item(struct btrfs_tra unsigned int size_diff; int i; - slot_orig = path-slots[0]; leaf = path-nodes[0]; slot = path-slots[0]; @@ -3354,7 +3342,6 @@ int btrfs_extend_item(struct btrfs_trans { int ret = 0; int slot; - int slot_orig; struct extent_buffer *leaf
[PATCH] [12/23] BTRFS: Clean up unused variables -- bugs
These are all the cases where a variable is set, but not read which are really bugs. - Couple of incorrect error handling fixed. - One incorrect use of a allocation policy - Some other things Still needs more review. Found by gcc 4.6's new warnings Cc: chris.ma...@oracle.com cc: linux-btrfs@vger.kernel.org Signed-off-by: Andi Kleen a...@linux.intel.com --- fs/btrfs/dir-item.c|2 +- fs/btrfs/extent-tree.c |3 +-- fs/btrfs/extent_io.c |2 ++ fs/btrfs/inode.c |6 +++--- fs/btrfs/relocation.c |4 +++- fs/btrfs/tree-log.c|2 +- 6 files changed, 11 insertions(+), 8 deletions(-) Index: linux-2.6.35-rc2-gcc/fs/btrfs/extent-tree.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/extent-tree.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/extent-tree.c @@ -3337,8 +3337,7 @@ struct btrfs_block_rsv *btrfs_alloc_bloc btrfs_init_block_rsv(block_rsv); alloc_target = btrfs_get_alloc_profile(root, 0); - block_rsv-space_info = __find_space_info(fs_info, - BTRFS_BLOCK_GROUP_METADATA); + block_rsv-space_info = __find_space_info(fs_info, alloc_target); return block_rsv; } Index: linux-2.6.35-rc2-gcc/fs/btrfs/dir-item.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/dir-item.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/dir-item.c @@ -427,5 +427,5 @@ int btrfs_delete_one_dir_name(struct btr ret = btrfs_truncate_item(trans, root, path, item_len - sub_item_len, 1); } - return 0; + return ret; } Index: linux-2.6.35-rc2-gcc/fs/btrfs/extent_io.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/extent_io.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/extent_io.c @@ -2825,6 +2825,8 @@ int extent_prepare_write(struct extent_i NULL, 1, end_bio_extent_preparewrite, 0, 0, 0); + if (ret !err) + err = ret; iocount++; block_start = block_start + iosize; } else { Index: linux-2.6.35-rc2-gcc/fs/btrfs/inode.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/inode.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/inode.c @@ -1372,7 +1372,7 @@ int btrfs_merge_bio_hook(struct page *pa if (map_length length + size) return 1; - return 0; + return ret; } /* @@ -2672,7 +2672,7 @@ static int check_path_shared(struct btrf { struct extent_buffer *eb; int level; - int ret; + int ret = 0; u64 refs; for (level = 0; level BTRFS_MAX_LEVEL; level++) { @@ -2686,7 +2686,7 @@ static int check_path_shared(struct btrf if (refs 1) return 1; } - return 0; + return ret; /* XXX callers? */ } /* Index: linux-2.6.35-rc2-gcc/fs/btrfs/tree-log.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/tree-log.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/tree-log.c @@ -2273,7 +2273,7 @@ fail: } btrfs_end_log_trans(root); - return 0; + return err; } /* see comments for btrfs_del_dir_entries_in_log */ Index: linux-2.6.35-rc2-gcc/fs/btrfs/relocation.c === --- linux-2.6.35-rc2-gcc.orig/fs/btrfs/relocation.c +++ linux-2.6.35-rc2-gcc/fs/btrfs/relocation.c @@ -3098,6 +3098,8 @@ static int add_tree_block(struct reloc_c BUG_ON(item_size != sizeof(struct btrfs_extent_item_v0)); ret = get_ref_objectid_v0(rc, path, extent_key, ref_owner, NULL); + if (ret 0) + return ret; BUG_ON(ref_owner = BTRFS_MAX_LEVEL); level = (int)ref_owner; /* FIXME: get real generation */ @@ -4142,7 +4144,7 @@ int btrfs_reloc_clone_csums(struct inode btrfs_add_ordered_sum(inode, ordered, sums); } btrfs_put_ordered_extent(ordered); - return 0; + return ret; } void btrfs_reloc_cow_block(struct btrfs_trans_handle *trans, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: Linking two files together][RFC]
Roberto Ragusa m...@robertoragusa.it writes: I hope that ideas about btrfs are not off-topic for this mailing list. The forwarded message below was written by me on fedora-users. The thread is about the ability to link two files in a manner similar to cat 1 2 3 rm 1 2 while avoiding any data movement on the disk. OCFS2 can do this today with reflinks -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: check alloc return value before use handle and struct
Steven Liu lingjiujia...@gmail.com writes: diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c index e9103b3..230a131 100644 --- a/fs/btrfs/dir-item.c +++ b/fs/btrfs/dir-item.c @@ -142,6 +142,8 @@ int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, struct btrfs_root key.offset = btrfs_name_hash(name, name_len); path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; The big problem is handling state unwind in all the callers, not adding it to the low level code. I attempted it some time ago but it's hard. Just spewing BUG_ON() all over on memory allocation failure is not helpful imho, that's not better than simply having clean NULL pointer faults. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor interactive performance with I/O loads with fsync()ing
Has the reason for this been identified? Judging from the nature of metadata loads, it would seem that it should be substantially easier to implement fsync() efficiently. By design a copy on write tree fs would need to flush a whole tree hierarchy on a sync. btrfs avoids this by using a special log for fsync, but that causes more overhead if you have that log on the same disk. So IO subsystem will do more work. It's a bit like JBD data journaling. However it should not have the stalls inherent in ext3's journaling. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 17/18] Btrfs: Full direct I/O and AIO read implementation.
On Wed, Mar 24, 2010 at 11:08:07PM -0400, jim owens wrote: Andi Kleen wrote: On Tue, Mar 23, 2010 at 05:40:00PM -0400, jim owens wrote: Andi Kleen wrote: With checksums enabled, uncompressed reads aligned on the 4k block are classic direct IO to user memory except at EOF. Hmm, but what happens if the user modifies the memory in parallel? Would spurious checksum failures be reported then? It does put a warning in the log but it does not fail the read because I circumvent that by doing the failed-checksum-retry as a buffered read and retest. The checksum passes and we copy the data to the user memory (where they can then trash it again). Ok. That will work I guess. I was going to put a comment about that but felt my comment density was already over the btrfs style guide limit. :) Hehe. Same for writing I guess (data end up on disk with wrong checksum)? Well we don't have any code done yet for writing and that was just one interesting challenge that needed to be solved. Those both would seem like serious flaws to me. Agree, so the write design needs to prevent bad checksums. How? Do you have a plan for that? Read is already correct and if people do not want a log warning that the application is misbehaving that can be eliminated. I guess if it's strictly rate limitted it might be ok. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 17/18] Btrfs: Full direct I/O and AIO read implementation.
On Tue, Mar 23, 2010 at 05:40:00PM -0400, jim owens wrote: Andi Kleen wrote: One thing that stroke me while reading is that, except for the out of line no data checksum case, this isn't real classical zero copy direct IO because you always have to copy through some buffer. Uh no, unless I really messed up or don't understand what you mean. No I misread the code: you don't set temp_pages in that path. Uncompressed data with no checksums only buffers on an error or EOF. With checksums enabled, uncompressed reads aligned on the 4k block are classic direct IO to user memory except at EOF. Hmm, but what happens if the user modifies the memory in parallel? Would spurious checksum failures be reported then? Same for writing I guess (data end up on disk with wrong checksum)? Those both would seem like serious flaws to me. Is there any particular reason this wasn't done? Was it because of aio? I know the page cache currently doesn't support that today, but presumably it wouldn't be too hard to add. The only reason I did not do something like that is: Ok. 1) I did not want to disturb the page cache with throw-away pages. 2) uncached IO makes it even less like classic direct IO. 3) Writing that page cache code might not be simpler. 4) aio support (although It would be cool if someone finally did proper aio page cache code) As further argument against uncached IO, Chris sent a very simple patch up to read into page cache then purge it for btrfs direct IO reads and it was NACKed. I see. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 17/18] Btrfs: Full direct I/O and AIO read implementation.
jim owens owens6...@gmail.com writes: Hi Jim, I read through large chunks of that patch. I don't claim to fully understand all the btrfs infrastructure details enough, so it wasn't really serious code review. One thing that stroke me while reading is that, except for the out of line no data checksum case, this isn't real classical zero copy direct IO because you always have to copy through some buffer. It's more like uncached IO I was wondering that at least for those cases wouldn't it be simpler to use the normal page cache IO path and use new hints that disable prefetch/write-behind/caching in the page cache after the IO operation? Is there any particular reason this wasn't done? Was it because of aio? I know the page cache currently doesn't support that today, but presumably it wouldn't be too hard to add. I guess the code would be much simpler if it only did the no checksum case. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: file/extent checksums for dedup/sync...
Daniel J Blueman daniel.blue...@gmail.com writes: For purposes of data deduplication and data synchronisation, it would be a powerful tool to expose file data checksums. Since eg BTRFS uses the crc32c algorithm [1], it's possible to compute the file's overall CRC from the accumulation of the CRCs from all it's extents' CRCs. For now, exposing this via an IOCTL may be sufficient, though any ideas for introducing it in a more standard way? (it's a pity that when stat64 was introduced, reserved fields weren't added) The problem of doing it in any standard way is that it would hard code the way the file system does checksums in the applications. So the file system could never change it without breaking user space. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/12] btrfs core patches for direct I/O
jim owens jow...@hp.com writes: The existing core code for pagecache doesn't work for directio because the existing I/O routines depend on peeking inside a struct page to get the valid btrfs inode info. Sorry, no can do, we don't own that page. And rewriting the whole pagecache I/O stack to eliminate dependence on struct page would be too scary IMO. That would simply need another passed argument in a few strategic places, won't it? I can't imagine it would need a full rewrite -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix possible pointer dereference
Diego Calleja dieg...@gmail.com writes: We should always check btrfs_alloc_path(). Some places BUG(), others return -ENOMEM, btrfs_insert_dir_item() seems like it can return safely. The problem is that all the callers don't handle errors. It doesn't make sense to fix it low-level currently when it cannot be handled properly higher up anyways. Proper out of memory handling needs much more work, one liners don't really help. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: potential NULL dereferences
Roel Kluin roel.kl...@gmail.com writes: Allocations may fail, prevent NULL dereferences. Signed-off-by: Roel Kluin roel.kl...@gmail.com --- In several sections of fs/btrfs code a kmalloc() occurs without a check whether it succeeded. this potentially leads to dereferences of a NULL pointer. Are there reasons why we do not check the allocations? Did I choose an incorrect way to err out? please review. Yes, the erroring out needs much more work because often the callers don't handle errors and it can need quite a lot of surgery Until that is done it's actually better to oops than to silently leak resources. BUG_ON(name == NULL) is also fairly useless because it oopses anyways in a obvious way. I had some patches to add more error handling for ENOMEM, but it's fairly complicated. Should probably resurrect my old patchkit. It still wasn't fully complete. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Christoph Hellwig h...@lst.de writes: Add support for the standard attributes set via chattr and read vis lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. The code would be nicer/shorter if you used tables for the flags and loops instead of open coding everything. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
GCC 4.3.2. Maybe i missed something obvious? The typical use case of restrict is to tell it that multiple given arrays are independent and then give the loop optimizer more freedom to handle expressions in the loop that accesses these arrays. Since there are no loops in the list functions nothing changed. Ok presumably there are some other optimizations which rely on that alias information too, but again the list_* stuff is probably too simple to trigger any of them. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
The point is that the compiler is then free to do it. If things slow down after the compiler gets *more* information, then that is a problem with the compiler heuristics rather than the information we give it. The point was the -Os disables it typically then. (not always, compiler heuristics are far from perfect) Then x86s tend to have very very fast L1 caches and if something is not in L1 on reads then the cost of fetching something for a read dwarfs the few cycles you can typically get out of this. Well most architectures have L1 caches of several cycles. And L2 miss typically means going to L2 which in some cases the compiler is expected to attempt to cover as much as possible (eg in-order architectures). L2 cache is so much slower that scheduling a few instructions more doesn't help much. stall, so you still want to get loads out early if possible. Even a lot of OOOE CPUs I think won't have the best alias anaysis, so all else being equal, it wouldn't hurt them to move loads earlier. Hmm, but if the load is nearby it won't matter if a store is in the middle, because the CPU will just execute over it. The real big win is if you do some computation inbetween, but at least for typical list manipulation there isn't really any. Also at least x86 gcc normally doesn't do scheduling beyond basic blocks, so any if () shuts it up. I don't think any of this is a reason not to use restrict, though. But... there are so many places we could add it to the kernel, and probably so few where it makes much difference. Maybe it should be able to help some critical core code, though. Frankly I think it would be another unlikely(). -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs and swap files on SSD's ?
Dmitri Nikulin dniku...@gmail.com writes: On Wed, Jan 21, 2009 at 12:02 AM, Chris Mason chris.ma...@oracle.com wrote: There are patches to support swap over NFS that might make it safe to use on btrfs. At any rate, it is a fixable problem. FreeBSD has been able to run swap over NFS for as long as I can remember, what is different in Linux that makes it especially difficult? One big traditional difference is that FreeBSD uses fixed isolated pools for their networking buffers (that is why you had to tune most systems for higher network workloads), while Linux has fully unified[1] memory management including the network stack. Now I believe recent BSD also moved to more unified network management and it wouldn't surprise me if they had trouble with this now too. [1] at least for now, there are unfortunately some tendencies to move back to fixed pools too. I've read that swap over non-trivial filesystems is hazardous as it may lead to a situation in which memory allocation can fail in the swap/FS code that was meant to make allocation possible again. A lot of this has been fixed in the 2.6 timeframe (e.g. there's now a better enforced global dirty limit), but there are likely still corner cases that could run into difficulties, so noone is really declaring it 100% safe yet. If btrfs is to take the role of a RAID and volume manager, it would certainly be very useful to be able to run swap on it, since that frees up other volumes from an administrative standpoint. The fixed extent mapping of the swap files is really a different problem, independent of the memory allocation issue. In general the memory allocation problem on write out has to be solved in any ways (even if you don't support swap files), because any dirty mmap'ed file effectively acts like a swap file. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
The problem with 'restrict' is that almost nobody uses it, and it does Also gcc traditionally didn't do a very good job using it (this might be better in the very latest versions). At least some of the 3.x often discarded this information. automatically. But it should work well as a way to get Fortran-like performance from HPC workloads written in C - which is where most of the people are who really want the alias analysis. It's more than just HPC -- a lot of code has critical loops. it seems like a nice opt-in thing that can be used where the aliases are verified and the code is particularly performance critical... Yes. I think we could use it in the kernel, although I'm not sure how many cases we would ever find where we really care. Very little I suspect. Also the optimizations that gcc does with this often increase the code size. While that can be a win, with people judging gcc's output apparently *ONLY* on the code size as seen in this thread[1] it would obviously not compete well. -Andi [1] although there are compilers around that generate smaller code than gcc at its best. -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote: Something at the back of my mind said aliasing. $ gcc linus.c -O2 -S ; grep subl linus.s subl$1624, %esp $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s subl$824, %esp That's with 4.3.2. Interesting. Nonsensical, but interesting. What I find nonsensical is that -fno-strict-aliasing generates better code here. Normally one would expect the compiler seeing more aliases with that option and then be more conservative regarding any sharing. But it seems to be the other way round here. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sun, Jan 11, 2009 at 11:25:32AM -0800, Linus Torvalds wrote: On Sun, 11 Jan 2009, Andi Kleen wrote: The proposal was to use -fno-inline-functions-called-once (but the resulting numbers were not promising) Well, the _optimal_ situation would be to not need it, because gcc does a good job without it. That implies trying to find a better balance between worth it and causes problems. Rigth now, it does sound like gcc simply doesn't try to balance AT ALL, or balances only when we add some very version-specific random options (ie the stack-usage one). The gcc 4.3 inliner takes stack growth into account by default (without any special options). I experimented a bit with it when that was introduced and found the default thresholds are too large for the kernel and don't change the checkstack.pl picture much. I asked back then and was told --param large-stack-frame is expected to be a reasonable stable --param (as much as these can be) and I did a patch to lower it, but I couldn't get myself to actually submit it [if you really want it I can send it]. But of course that only helps for gcc 4.3+, older gccs would need a different workaround. On the other hand (my personal opinion, not shared by everyone) is that the ioctl switch stack issue is mostly only a problem with 4K stacks and in the rare cases when I still run 32bit kernels I never set that option because I consider it russian roulette (because there undoutedly dangerous dynamic stack growth cases that checkstack.pl doesn't flag) And even those options don't actually make much sense - yes, they balance things, but they don't do it in a sensible manner. For example: stack usage is undeniably a problem (we've hit it over and over again), but it's not about stack must not be larger than X bytes. If the call is done unconditionally, then inlining _one_ function will grow the static stack usage of the function we inline into, but it will _not_ grow the dynamic stack usage one whit - so deciding to not inline because of stack usage is pointless. Don't think the current inliner takes that into account from a quick look at the sources, although it probably could. Maybe Honza can comment. But even if it did it could only do that for a single file, but if the function is in a different file gcc doesn't have the information (unless you run with David's --combine hack). This means the kernel developers have to do it anyways. On the other hand I'm not sure it's that big a problem. Just someone should run make checkstack occasionally and add noinlines to large offenders. -Andi [keep quote for Honza's benefit] See? So stop inlining when you hit a stack limit IS THE WRONG THING TO DO TOO! Because it just means that the compiler continues to do bad inlining decisions until it hits some magical limit - but since the problem isn't the static stack size of any _single_ function, but the combined stack size of a dynamic chain of them, that's totally idiotic. You still grew the dynamic stack, and you have no way of knowing by how much - the limit on the static one simply has zero bearing what-so-ever on the dynamic one. So no, limit static stack usage is not a good option, because it stops inlining when it doesn't matter (single unconditional call), and doesn't stop inlining when it might (lots of sequential calls, in a deep chain). The other alternative is to let gcc do what it does, but (a) remove lots of unnecessary 'inline's. And we should likely do this regardless of any -fno-inline-functions-called-once issues. (b) add lots of 'noinline's to avoid all the cases where gcc screws up so badly that it's either a debugging disaster or an actual correctness issue. The problem with (b) is that it's a lot of hard thinking, and debugging disasters always happen in code that you didn't realize would be a problem (because if you had, it simply wouldn't be the debugging issue it is). Linus -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
I vote for the, get rid of the current inline, rename __always_inline to There is some code that absolutely requires inline for correctness, like the x86-64 vsyscall code. I would advocate to keep the explicit __always_inline at least there to make it very clear. inline, and then remove all non needed inlines from the kernel. Most inlines in .c should be probably dropped. We'll, probably start adding a lot more noinlines. That would cost you, see the numbers I posted (~4.1% text increase) -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
On Fri, Jan 09, 2009 at 10:28:01AM -0700, Matthew Wilcox wrote: On Fri, Jan 09, 2009 at 06:20:11PM +0100, Andi Kleen wrote: Also cc Honza in case he has comments (you might want to review more of the thread in the archives) I think this particular bug is already known and discussed: I thought so initially too, but: http://gcc.gnu.org/ml/gcc/2008-12/msg00365.html and it hints at being fixed with gcc 4.4. Does anyone want to test that? Hugh already tested with 4.4 and it didn't work well. At least a lot of the low level asm inlines were not inlined. So it looks like it's still mistuned for the kernel. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
What happens when you say -fno-inline-functions-called-once? Does it disable inlining for those functions IN GENERAL, or just for the LARGE It does disable it in general, unless they're marked inline explicitely : The traditional gcc 2.x rules were: I Only inline what is marked inline (but it can decide to not inline) II Also inline others heuristically only with -O3 / -finline-functions [which we don't set in the kernel] Then at some point this additional rule was added: - Also inline everything static that is only called once [on the theory that this shrinks code size which is true according to my measurements] -fno-inline-functions-called once disables this new rule. It's very well and clearly defined. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
On Fri, Jan 09, 2009 at 08:10:20PM +0100, Richard Guenther wrote: On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen a...@firstfloor.org wrote: GCC 4.3 should be good in compiling the kernel with default -Os settings. It's unfortunately not. It doesn't inline a lot of simple asm() inlines for example. Reading Ingos posting with the actual numbers states the opposite. Hugh had some numbers upto 4.4.0 20090102 in http://thread.gmane.org/gmane.linux.kernel/775254/focus=777231 which demonstrated the problem. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
I've done a finegrained size analysis today (see my other mail in this thread), and it turns out that on gcc 4.3.x the main (and pretty much only) inlining annotation that matters in arch/x86/include/asm/*.h is the onliner patch attached below, annotating constant_test_bit(). That's pretty cool. Should definitely file a gcc bug report for that though so that they can fix gcc. Did you already do that or should I? -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
What's the cost/benefit of that 4%? Does it actually improve performance? Especially if you then want to keep DWARF unwind information in memory in order to fix up some of the problems it causes? At that point, you lost dwarf unwind information has nothing to do with this, it doesn't tell you anything about inlining or not inlining. It just gives you finished frames after all of that has been done. Full line number information would help, but I don't think anyone proposed to keep that in memory. Does it help I$ utilization (which can speed things up a lot more, and is probably the main reason -Os actually tends to perform better)? Likely not. Sure, shrinking code is good for I$, but on the other hand inlining can actually be bad for I$ density because if you inline a function that doesn't get called, you now fragmented your footprint a lot more. Not sure that is always true; the gcc basic block reordering based on its standard branch prediction heuristics (e.g. 0 or == NULL unlikely or the unlikely macro) might well put it all out of line. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
I thought -Os actually disabled the basic-block reordering, doesn't it? Not in current gcc head no (just verified by stepping through) And I thought it did that exactly because it generates bigger code and much worse I$ patterns (ie you have a lot of conditional branch to other place and then unconditional branch back instead of conditional branch over the non-taken code. Also, I think we've had about as much good luck with guessing likely/unlikely as we've had with inline ;) That's true. But if you look at the default heuristics that gcc has (gcc/predict.def in the gcc sources) like == NULL, 0, branch guarding etc. I would expect a lot of them to DTRT for the kernel. Honza at some point even fixed goto to be unlikely after I complained :) Sadly, apart from some of the never happens error cases, the kernel doesn't tend to have lots of nice patterns. We have almost no loops (well, there are loops all over, but most of them we hopefully just loop over once or twice in any good situation), and few really predictable things. That actually makes us well suited to gcc, it has a relatively poor loop optimizer compared to other compilers ;-) Or rather, they can easily be very predictable under one particular load, and the totally the other way around under another .. Yes that is why we got good branch predictors in CPUs I guess. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, Jan 10, 2009 at 12:53:42AM +, Jamie Lokier wrote: Harvey Harrison wrote: Oh yeah, and figure out what actually breaks on alpha such that they added the following (arch/alpha/include/asm/compiler.h) #ifdef __KERNEL__ /* Some idiots over in linux/compiler.h thought inline should imply always_inline. This breaks stuff. We'll include this file whenever we run into such problems. */ Does always_inline complain if the function isn't inlinable, while Yes it does. inline allows it? (unless you set -Winline which the kernel doesn't) -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote: Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. I actually often use noinline when developing code simply because it makes it easier to read oopses when gcc doesn't inline ever static (which it normally does if it only has a single caller). You know roughly where it crashed without having to decode the line number. I believe others do that too, I notice it's all over btrfs for example. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, Jan 08, 2009 at 07:42:48PM -0800, Linus Torvalds wrote: I actually often use noinline when developing code simply because it makes it easier to read oopses when gcc doesn't inline ever static (which it normally does if it only has a single caller) Yes. Gcc inlining is a total piece of sh*t. The static inlining by default (unfortunately) saves a lot of text size. For testing I built an x86-64 allyesconfig kernel with -fno-inline-functions-called-once (which disables the default static inlining), and it increased text size by ~4.1% (over 2MB for a allyesconfig kernel). So I think we have to keep that, dropping it would cost too much :/ -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning
I appreciate this is sample code, but using __get_user() on non-userspace pointers messes up architectures which have separate user/kernel spaces (eg the old 4G/4G split for x86-32). Do we have an appropriate function for kernel space pointers? probe_kernel_address(). But it's slow. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs for mainline
On Fri, Jan 02, 2009 at 02:32:29PM -0500, Chris Mason wrote: If combination spinlocks/mutexes are really a win they should be in the generic mutex framework. And I'm still dubious on the hardcoded numbers. Sure, I'm happy to use a generic framework there (or help create one). They are definitely a win for btrfs, and show up in most benchmarks. If they are such a big win then likely they will help other users too and should be generic in some form. - compat.h needs to go Projects that are out of mainline have a difficult task of making sure development can continue until they are in mainline and being clean enough to merge. I'd rather get rid of the small amount of compat code that I have left after btrfs is in (compat.h is 32 lines). It's fine for an out of tree variant, but the in tree version shouldn't have compat.h. For out of tree you just apply a patch that adds the includes. e.g.compat-wireless and lots of other projects do it this way. Yes, I tried to mark those as I did them (a very small number of functions). In general they were copied to avoid adding exports, and that is easily fixed. - there should be manpages for all the ioctls and other interfaces. - ioctl.c was not explicitely root protected. security issues? Christoph added a CAP_SYS_ADMIN check to the trans start ioctl, but I do need to add one to the device add/remove/balance code as well. Ok. Didn't see that. It still needs to be carefully audited for security holes even with root checks. Another thing is that once auto mounting is enabled each usb stick with btrfs on it could be a root hole if you have buffer overflows somewhere triggerable by disk data. I guess that would need some checking too. The subvol/snapshot creation is meant to be user callable (controlled by something similar to quotas later on). But right now that's not there so it should be root only. Also there used to be a lot of BUG_ON()s on memory allocation failure even. - In general BUG_ONs need review I think. Lots of externally triggerable ones. - various checkpath.pl level problems I think (e.g. printk levels) - the printks should all include which file system they refer to In general I think the whole thing needs more review. I don't disagree, please do keep in mind that I'm not suggesting anyone use this in production yet. When it's in mainline I suspect people will start using it for that. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
Chris Mason [EMAIL PROTECTED] writes: Started interactively? I'm not entirely sure what that means, but in general when you ask the user a question about if/how to fix a corruption, they will have no idea what the correct answer is. While that's true today, I'm not sure it has to be true always. I always thought traditional fsck user interfaces were a UI desaster and could be done much better with some simple tweaks. For example the fsck could present the user a list of files that ended up in lost+found and let them examine them, instead of asking a lot of useless questions. Or it could give a high level summary on how many files in which part of the directory tree were corrupted. etc.etc. Or it could default to a high level mode that only gives such high level information to the user. So I don't think all corruptions could be done perfectly user friendly, but at least the basic user friendliness in many situations could be much improved. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote: Andi Kleen wrote: Ray Van Dolson [EMAIL PROTECTED] writes: I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't have this feature yet (although I've read discussions on them looking to add it). There are some patches to do in QEMU's cow format for KVM. That's user level only. And thus, doesn't work for sharing between different images, especially at runtime. It would work if the images are all based once on a reference image, won't it? I would imagine that's the common situation for installing lots of VMs. I'd really, really [any number of reallies], really like to see btrfs deduplication. Sure it would be useful for a couple of things. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
Ray Van Dolson [EMAIL PROTECTED] writes: I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't have this feature yet (although I've read discussions on them looking to add it). There are some patches to do in QEMU's cow format for KVM. That's user level only. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: avoid NULL deref after failed allocation
Jim Meyering [EMAIL PROTECTED] writes: However, in some places, the trend is to BUG_ON(!ptr), so I've done that, too. Even if it's a trend, it's wrong. Better don't add more. And also unnecessary because next reference will obviously oops anyways. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_tree_lock trylock
On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: The idea is to try to spin for a bit to avoid scheduling away, which is especially important for the high levels. Most holders of the mutex let it go very quickly. Ok but that surely should be implemented in the general mutex code then or at least in a standard adaptive mutex wrapper? That depends, am I the only one crazy enough to think its a good idea? Adaptive mutexes are classic, a lot of other OS have it. Gregory et.al. also saw big improvements in the RT kernel (they posted a patchkit a few times) But a lot of people don't like them for some reason. Anyways hiding them in a fs is probably wrong. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New feature Idea
Morey Roof [EMAIL PROTECTED] writes: I have been thinking about a new feature to start work on that I am interested in and I was hoping people could give me some feedback and ideas of how to tackle it. Anyways, I want to create a data deduplication system that can work in two different modes. One mode is that when the system is idle or not beyond a set load point a background process would scan the volume for duplicate blocks. The other mode would be used for systems that are nearline or backup systems that don't really care about the performance and it would do the deduplication during block allocation. Seems like a special case of compression? Perhaps compression would help more? One of the ways I was thinking of to find the duplicate blocks would be to use the checksums as a quick compare. If the checksums match then do a complete compare before adjusting the nodes on the files. However, I believe that I will need to create a tree based on the checksum values. If you really want to do deduplication: It might be advantageous to do this on larger units. If you assume that data is usually shared between similar files (which is a reasonable assumption) and do the deduplication on whole files you can also use the size as an index and avoid checksumming all files with a unique size. I wrote a user level duplicated file checker some time ago that used this trick successfully. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs v0.16 released
So, the mirroring turns a single large write into two large writes. Definitely not free, but always a fixed cost. Thanks for the explanation and the numbers. I see that's the advantage of copy-on-write that you can actually always cluster the metadata together and get always batched IO this way and then afford to do more of it. Still wondering what that will do to read seekiness. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs v0.16 released
Chris Mason [EMAIL PROTECTED] writes: Metadata is duplicated by default even on single spindle drives, Can you please say a bit how much that impacts performance? That sounds costly. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html