Re: [RFC v4+ hot_track 00/19] vfs: hot data tracking

2012-10-29 Thread Andi Kleen
zwu.ker...@gmail.com writes:

 TODO List:

  1.) Need to do scalability or performance tests.

You're changing some of the most performance critical code in the
kernel. This step is absolutely not optional.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v2] Update LZO compression

2012-10-03 Thread Andi Kleen
   - Post the entire patch series to lkml for review (I'd like a cc please)
  
  Already happened, multiple people reviewed and tested.
 
 um, I would not consider Looks ok to me from a quick look. and I
 couldn't tell from the github view, but I assume you follow standard
 coding style. to indicate a rigorous code review!

Let's put it like this. LZO is mature widely used production software and 
Markus wrote and maintains it.  Essentially this is a release from him.

The main problem before with using his version directly was that it did 
not follow Linux coding standards, but that seems to be fixed now. It's 
also relatively simple and at least I didn't spot anything bad when I looked.

It was runtime tested by several people on different architectures.
If it was fuzzed so we can assume reasonably good security.

-Andi 

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Update LZO compression

2012-08-16 Thread Andi Kleen
 If you think a little bit, I bet you could come up with a solution that 
 operates at cacheline-aligned granularity, something that would be _even 
 faster_ than simply fixing the code to do aligned accesses.

Cache aligned compression is unlikely to compress anything at all.
Compression algorithms are usually by definition unaligned.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Update LZO compression

2012-08-16 Thread Andi Kleen
 I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using 
 cpufreq-set,
 and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with 
 lzo.
 There doesn't seem to be a significant difference between all three variants.

I found that in compression benchmarks it depends a lot on the data
compressed.

urandom (which should be essentially incompressible) will be handled
by different code paths in the compressor than other more compressible data.
It becomes a complicated memcpy then.

But then there are IO benchmarks which also only do zeroes, which
also gives an unrealistic picture.

Usually it's best to use some corpus with different data types, from very
compressible to less so; and look at the aggregate.

For my snappy work I usually had at least large executables (medium) and some
pdfs (already compressed; low) and then uncompressed source code tars (high
compression)

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Update LZO compression

2012-08-13 Thread Andi Kleen
On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote:
 Hi all,
 
 as suggested on the mailing list I have converted the updated LZO
 code into git, so please pull my lzo-update branch from
 
   git://github.com/markus-oberhumer/linux.git lzo-update
 
 You can browse the branch at
 
   https://github.com/markus-oberhumer/linux/compare/lzo-update

Looks ok to me from a quick look.

Since kernel lzo is security relevant, I assume the new version
has been fuzz tested?

I couldn't tell from the github view, but I assume you follow
standard coding style.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: use a slab for btrfs_dio_private

2012-08-07 Thread Andi Kleen
Josef Bacik jba...@fusionio.com writes:

 This is in the IO path, let's try to avoid latencies by having our own slab.
 Thanks,

What latencies does this avoid?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Add the snappy-c compressor to lib v2

2012-02-14 Thread Andi Kleen



(BTW:  If you're ever reworking this patch set, I'd like to make an ad
hoc request for slightly different names for fs/btrfs/snappy.c and
lib/snappy.c)


Why?


When building a x86 kernel, I get the following errors:
   CC [M]  lib/snappy.o
lib/snappy.c: In function 'snappy_init_env':
lib/snappy.c:1268:2: error: implicit declaration of function 'vmalloc'
   CC [M]  fs/btrfs/free-space-cache.o
lib/snappy.c:1268:18: warning: assignment makes pointer from integer
without a cast
lib/snappy.c: In function 'snappy_free_env':
lib/snappy.c:1293:2: error: implicit declaration of function 'vfree'
make[1]: *** [lib/snappy.o] Error 1
make: *** [lib] Error 2

The error clears with this patch:


Added thanks.

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFB] add LZ4 compression method to btrfs

2012-02-14 Thread Andi Kleen
Markus Lindberg marcuslindb...@gmail.com writes:


 Are you sure about these figures ? the difference seems too large. It's almost
 unbelievable.

Yes, my benchmarks totally disagree with them. In my tests lz4 is
generally slower than snappy-c.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ANN: linux-kernel-lzo-2.06.20120123 - update LZO to v2.06

2012-01-24 Thread Andi Kleen
On Mon, Jan 23, 2012 at 05:19:40PM +0100, Markus F.X.J. Oberhumer wrote:
 Hi,
 
 I've prepared a small package that updates the LZO version in the Linux
 kernel to LZO v2.06.

I ran benchmarks on the new miniLZO and LZ4 on 64bit. LZ4 is generally slower 
than snappy/lzo in the micro benchmarks. The new LZO is better than the old
one, but still loses to snappy most of the time (but often by very
small amounts only)

Will be worth checking the new LZO will the full distribution boot test.

I agree it's definitely a good idea to update the kernel version.
However I must say it would be a major project to bring it up
to kernel coding standards.

snappy is still interesting, but much less so than it was before.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updated btrfs/crypto snappy interface ready for merging

2012-01-17 Thread Andi Kleen
 It's because decompressing inline extents always fails. I've fixed it
 and will send the patch out in a new mail thread.

Thanks for fixing.

 
 But seems there's bug in lib snappy code, which makes the decompressed
 data doesn't quite match the original data.
 
 Simply copy a file to a btrfs filesystem with snappy enabled, and clear
 page cache, and check the file:

Hmm weird, I have never seen this. Do you have a reproducer?

The basic compression code is quite well tested, I have a reasonable
unit test.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updated btrfs/crypto snappy interface ready for merging

2012-01-17 Thread Andi Kleen
 At first I saved emails and patched them in linux-btrfs git tree, and I
 got weired result. Then I pulled the snappy branch directly, and now
 nothing is wrong.. Sorry for the noise.

np. Thanks for testing.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] SNAPPY: Add dual GPL/BSD license to snappy module

2012-01-15 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

To avoid complaints from the module loader.

I made it dual because the original compression code was BSD.

Reported-by: Chris Mason
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 lib/snappy.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/lib/snappy.c b/lib/snappy.c
index 0b39e07..3848c6c 100644
--- a/lib/snappy.c
+++ b/lib/snappy.c
@@ -1298,3 +1298,5 @@ void snappy_free_env(struct snappy_env *env)
memset(env, 0, sizeof(struct snappy_env));
 }
 EXPORT_SYMBOL(snappy_free_env);
+
+MODULE_LICENSE(Dual BSD/GPL);
-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-13 Thread Andi Kleen

For the btrfs extent cache it's unclear if just RCUing is a good
fit anyways: some workloads are very write heavy and RCU only
performs well if you have a lot more reads than writes.
For write heavy RCUification usually slows it down.

 FWIW, I'm mentioning this out of self interest - I need a RCU safe
 tree structure to index extents for lookless lookups in the XFS
 buffer cache, but I've got a long list of things to do before I get
 to it. If someone else implements the tree, that's most of the work
 done for me. :)

FWIW there are fine grained rbtrees in papers too, but they are too fine 
grained imho: you may need to take a large number of locks for a single 
traversal. While atomics got a lot cheaper recently they are still
somewhat expensive and you don't want too many of them in your 
fast path. Also I found when there is actual contention having
too many bouncing locks is quite bad because the latencies of passing
the cache lines around really add up. In these cases uses less fine
locks is better.

Mathieu also did RCU rbtrees but they are quite complicated.

IMHO we would like to have something inbetween a global tree lock and a
fully fine grained tree where the lock complexity cannot get out of band.
May need a separate data structure for the locks.

I don't have a leading candidate for that currently.

There are some variants of rbtrees that are less strict and have
a simpler rebalance which are interesting. But also some
other tree data structures. Needs more work.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-12 Thread Andi Kleen
Liu Bo liubo2...@cn.fujitsu.com writes:

 Here we choose extent_map firstly, since it is a read mostly thing,
 and the change is quite direct, all we need to do is
 a) to replace rbtree with skiplist,
 b) to add rcu support.
 And more details are in patch 2 and patch 3.

 I've done some simple tests for performance on my 2-core box, there is no
 obvious difference, but I want to focus on the design side and make sure
 there is no more bug in it firstly.

 For long term goals, we want to ship skiplist to lib, like lib/rbtree.c.

I looked at skiplists some time ago. What made them awkward for kernel
use is that you have to size the per node skip array in advance and it's
hard to resize. So you have a node that wastes memory in common small
cases, but still degenerates to linked list on very large sizes.
With fine grained locking it gets even worse because the nodes get larger.

Con didn't worry about this problem because he assumed the scheduler
run queues never could get too long.

But for a very scalable subsystem that's definitely a problem.

I think skiplists are not a good fit here.

At least in our tests the older style trees got a lot better with Chris'
recent locking improvements. 

Now replacing rbtrees is probably still a good idea, but not convinced
skiplist are suitable here. There were various other tree alternatives
with better locking.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Updated btrfs/crypto snappy interface ready for merging

2012-01-12 Thread Andi Kleen
Here's a slightly updated version of the BTRFS snappy interface.
snappy is a faster compression algorithm that provides similar
compression as LZO, but generally better performance.

This version has:

- Ported to latest tree
- Extensive testing and benchmarking (in its 3.0 based incarnation)
- A few cleanups
- Adds a minor performance improvement

Should be ready for pulling into the btrfs tree now.
The crypto parts have been acked by Herbert.

The following changes since commit 2485a4b610171f4e1c4ab0d053569747795c1bbe:

  Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input (2012-01-12 12:40:41 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc snappy

Andi Kleen (3):
  Add the snappy-c compressor to lib v2
  BTRFS: Add snappy support v2
  Add snappy interface to crypto API

 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/snappy.c|   99 
 fs/btrfs/Kconfig   |1 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |1 +
 fs/btrfs/ctree.h   |9 +-
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |4 +
 fs/btrfs/snappy.c  |  432 
 fs/btrfs/super.c   |9 +-
 include/linux/snappy.h |   26 +
 lib/Kconfig|6 +
 lib/Makefile   |4 +
 lib/snappy.c   | 1300 
 16 files changed, 1901 insertions(+), 5 deletions(-)
 create mode 100644 crypto/snappy.c
 create mode 100644 fs/btrfs/snappy.c
 create mode 100644 include/linux/snappy.h
 create mode 100644 lib/snappy.c

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] BTRFS: Add snappy support v2

2012-01-12 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Add support in btrfs for snappy compression. This is generally a faster
compressor than LZO (except for higher memory usage) with comparable
compression and should imho replace it.

This is based on the lzo code with minor modifications.

This has been tested with various test workloads and also
in a real distribution. One challenge with the benchmarks
is that a lot of of benchmarks only read/write zeroes, which
are a uncommon good case for IO compression. However even with
modified workloads that use different patterns gains are seen.

In general snappy performs better with a 64bit CPU, but also
generally outperforms LZO on a 32bit Atom system on IO workloads.

The disk space usage is similar to LZO (within 0.2%)

On a distribution boot test on a netboot snappy is about 11.5% faster
than LZO or 15.5% faster than an uncompressed btrfs.

Open: implement scatter-gather support and get rid of the temporary
buffers. This would only work in some special cases, but should
improve performance further by eliminating copies.

Thanks to Jacob Sowles for extensive testing and benchmarking.

v2: Some cleanups, port to latest tree.
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 fs/btrfs/Kconfig   |1 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |1 +
 fs/btrfs/ctree.h   |9 +-
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |4 +
 fs/btrfs/snappy.c  |  432 
 fs/btrfs/super.c   |9 +-
 9 files changed, 456 insertions(+), 5 deletions(-)
 create mode 100644 fs/btrfs/snappy.c

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index ecb9fd3..d55df9c 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -6,6 +6,7 @@ config BTRFS_FS
select ZLIB_DEFLATE
select LZO_COMPRESS
select LZO_DECOMPRESS
+   select SNAPPY
help
  Btrfs is a new filesystem with extents, writable snapshotting,
  support for multiple devices and many more features.
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index c0ddfd2..b93ab52 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,6 +8,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o
+  reada.o backref.o snappy.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 14f1c5a..f85f7fd 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -730,6 +730,7 @@ static wait_queue_head_t 
comp_workspace_wait[BTRFS_COMPRESS_TYPES];
 struct btrfs_compress_op *btrfs_compress_op[] = {
btrfs_zlib_compress,
btrfs_lzo_compress,
+   btrfs_snappy_compress,
 };
 
 int __init btrfs_init_compress(void)
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index a12059f..971a425 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -79,5 +79,6 @@ struct btrfs_compress_op {
 
 extern struct btrfs_compress_op btrfs_zlib_compress;
 extern struct btrfs_compress_op btrfs_lzo_compress;
+extern struct btrfs_compress_op btrfs_snappy_compress;
 
 #endif
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6738503..a06a642 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -458,6 +458,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL  (1ULL  1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS(1ULL  2)
 #define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO(1ULL  3)
+#define BTRFS_FEATURE_INCOMPAT_COMPRESS_SNAPPY (1ULL  4)
 
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SUPP   0ULL
@@ -465,7 +466,8 @@ struct btrfs_super_block {
(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \
 BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL |\
 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |  \
-BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO)
+BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO |  \
+BTRFS_FEATURE_INCOMPAT_COMPRESS_SNAPPY)
 
 /*
  * A leaf is full of items. offset and size tell us where to find
@@ -621,8 +623,9 @@ enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
BTRFS_COMPRESS_ZLIB  = 1,
BTRFS_COMPRESS_LZO   = 2,
-   BTRFS_COMPRESS_TYPES = 2,
-   BTRFS_COMPRESS_LAST  = 3,
+   BTRFS_COMPRESS_SNAPPY = 3,
+   BTRFS_COMPRESS_TYPES = 3,
+   BTRFS_COMPRESS_LAST  = 4,
 };
 
 struct btrfs_inode_item {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f99a099..925dd0a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2113,6 +2113,8 @@ struct btrfs_root *open_ctree(struct super_block *sb,
features

[PATCH 1/3] Add the snappy-c compressor to lib v2

2012-01-12 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

This is a C port of the google snappy compressor. It has roughly
comparable compression to LZO, but is significantly faster on many file
types. For example it beats all other compressors on already
compressed data.

I ported the original C++ code over to C and did some changes
to make it better fit into the kernel. It preallocates the worst
case memory consumption now. While the code being larger
than lzo it is still reasonable (about 5K on x86).

Decompression needs very little memory, Compression
currently 192K on 64bit and 128K on 32bit. For comparison
LZO compression needs 128K on 64bit and 64K on 32bit.

[This could be lowered significantly by not preallocating
for most use cases, typically the footprint is much lower.
The original C++ version only allocated most of this
when (rarely) needed, but this is more problematic in the kernel]

There are some minor divergences from the Linux coding standards:
in particular I kept the C++/C99 style mixed statement/declarations.
This was mainly to not diverge too much from the reference C++
source, so that improvements from there can be easily ported.
There are some other left overs from the google style, but very
little now.

Performance:

The compressor performs best on 64bit-LE systems,
but is also quite good on 32bit. I haven't tested BE, but
I don't expect that to add a lot of overhead.

Here is some performance data (32bit, Nehalem):
c/b = cycles/byte; lower numbers are better.

x86-64 executable: (compression minimally slower than qlz, but
much better at decompression, lzo is left in the dust):

snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 8.13 uncomp 2.65 c/b
lzo   : emacs-gtk: 11007968 b: ratio 0.33: comp 12.74 uncomp 4.70 c/b
zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 49.96 uncomp 13.14 c/b
zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 64.17 uncomp 12.33 c/b
lzf   : emacs-gtk: 11007968 b: ratio 0.37: comp 9.85 uncomp 4.33 c/b
qlz   : emacs-gtk: 11007968 b: ratio 0.34: comp 7.51 uncomp 6.28 c/b
fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 10.73 uncomp 4.97 c/b

Compressed data (beats everything else):

snappy: udev-151.tar.gz: 634842 b: ratio 1.00: comp 0.99 uncomp 0.33 c/b
lzo   : udev-151.tar.gz: 634842 b: ratio 1.00: comp 41.44 uncomp 0.66 c/b
zlib1 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 116.99 uncomp 3.94 c/b
zlib3 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 117.68 uncomp 3.94 c/b
lzf   : udev-151.tar.gz: 634842 b: ratio 1.03: comp 16.32 uncomp 1.14 c/b
qlz   : udev-151.tar.gz: 634842 b: ratio 1.00: comp 10.42 uncomp 0.42 c/b
fastlz: udev-151.tar.gz: 634842 b: ratio 1.03: comp 19.35 uncomp 2.07 c/b

Text file (compression somewhat slower than qlz, but decompression
much better, lzo is much worse):

snappy: manual.txt: 445343 b: ratio 0.47: comp 12.01 uncomp 3.12 c/b
lzo   : manual.txt: 445343 b: ratio 0.44: comp 16.32 uncomp 7.53 c/b
zlib1 : manual.txt: 445343 b: ratio 0.35: comp 56.37 uncomp 15.59 c/b
zlib3 : manual.txt: 445343 b: ratio 0.31: comp 73.45 uncomp 13.99 c/b
lzf   : manual.txt: 445343 b: ratio 0.46: comp 13.43 uncomp 5.47 c/b
qlz   : manual.txt: 445343 b: ratio 0.44: comp 9.16 uncomp 8.19 c/b
fastlz: manual.txt: 445343 b: ratio 0.46: comp 14.22 uncomp 7.28 c/b

As you can see snappy is a good all-around compressor.

On 64bit the compression is even faster and beats everything else easily:

snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 4.90 uncomp 2.65 c/b
lzo   : emacs-gtk: 11007968 b: ratio 0.33: comp 11.24 uncomp 4.46 c/b
zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 41.67 uncomp 11.13 c/b
zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 51.80 uncomp 10.54 c/b
lzf   : emacs-gtk: 11007968 b: ratio 0.37: comp 8.79 uncomp 4.05 c/b
qlz   : emacs-gtk: 11007968 b: ratio 0.34: comp 5.44 uncomp 5.46 c/b
fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 9.91 uncomp 4.77 c/b

On 64bit it's now nearly as fast as qlz on the text file too:

snappy: manual.txt: 445343 b: ratio 0.47: comp 7.79 uncomp 3.47 c/b
lzo   : manual.txt: 445343 b: ratio 0.44: comp 15.46 uncomp 7.27 c/b
zlib1 : manual.txt: 445343 b: ratio 0.35: comp 45.79 uncomp 12.78 c/b
zlib3 : manual.txt: 445343 b: ratio 0.31: comp 60.52 uncomp 11.72 c/b
lzf   : manual.txt: 445343 b: ratio 0.46: comp 12.62 uncomp 5.30 c/b
qlz   : manual.txt: 445343 b: ratio 0.44: comp 6.81 uncomp 7.65 c/b
fastlz: manual.txt: 445343 b: ratio 0.46: comp 13.75 uncomp 6.52 c/b

Overall it's a good alternative to lzo, with the only
drawback being the somewhat higher memory use.

v2: Some minor performance improvements and cleanups.
32bit compression should be a few percent faster now.
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 include/linux/snappy.h |   26 +
 lib/Kconfig|6 +
 lib/Makefile   |4 +
 lib/snappy.c   | 1300 
 4 files changed, 1336 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/snappy.h
 create mode 100644 lib/snappy.c

diff --git

[PATCH 3/3] Add snappy interface to crypto API

2012-01-12 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Mainly so that ubifs can use it.

Acked-by: Herbert Xu herb...@gondor.apana.org.au
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 crypto/Kconfig  |9 +
 crypto/Makefile |1 +
 crypto/snappy.c |   99 +++
 3 files changed, 109 insertions(+), 0 deletions(-)
 create mode 100644 crypto/snappy.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index e6cfe1a..768943dc 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -924,6 +924,15 @@ config CRYPTO_LZO
help
  This is the LZO algorithm.
 
+config CRYPTO_SNAPPY
+   tristate Snappy compression algorithm
+   select CRYPTO_ALGAPI
+   select SNAPPY
+   help
+ snappy is a faster alternative to the lzo compression algorithm
+ with comparable compression. It is very fast on 64bit systems, but 
also
+ good on 32bit systems. It especially excels at already compressed 
data.
+
 comment Random Number Generation
 
 config CRYPTO_ANSI_CPRNG
diff --git a/crypto/Makefile b/crypto/Makefile
index f638063..8a0e6c6 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o
 obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o
 obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o
 obj-$(CONFIG_CRYPTO_LZO) += lzo.o
+obj-$(CONFIG_CRYPTO_SNAPPY) += snappy.o
 obj-$(CONFIG_CRYPTO_RNG2) += rng.o
 obj-$(CONFIG_CRYPTO_RNG2) += krng.o
 obj-$(CONFIG_CRYPTO_ANSI_CPRNG) += ansi_cprng.o
diff --git a/crypto/snappy.c b/crypto/snappy.c
new file mode 100644
index 000..43a64b9
--- /dev/null
+++ b/crypto/snappy.c
@@ -0,0 +1,99 @@
+/*
+ * Cryptographic API.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 51
+ * Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#include linux/init.h
+#include linux/module.h
+#include linux/crypto.h
+#include linux/vmalloc.h
+#include linux/snappy.h
+
+struct snappy_ctx {
+   struct snappy_env env;
+};
+
+/* Only needed for compression actually */
+static int snappy_init(struct crypto_tfm *tfm)
+{
+   struct snappy_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   return snappy_init_env(ctx-env);
+}
+
+static void snappy_exit(struct crypto_tfm *tfm)
+{
+   struct snappy_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   snappy_free_env(ctx-env);
+}
+
+static int snp_compress(struct crypto_tfm *tfm, const u8 *src,
+   unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+   struct snappy_ctx *ctx = crypto_tfm_ctx(tfm);
+   size_t olen;
+   int err;
+
+   /*  very pessimistic. check in snappy? */
+   if (*dlen  snappy_max_compressed_length(*dlen))
+   return -EINVAL;
+   err = snappy_compress(ctx-env, src, slen, dst, olen);
+   *dlen = olen;
+   return err;
+}
+
+static int snp_decompress(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+   size_t ulen;
+
+   if (!snappy_uncompressed_length(src, slen, ulen))
+   return -EIO;
+   if (*dlen  ulen)
+   return -EINVAL;
+   *dlen = ulen;
+   return snappy_uncompress(src, slen, dst) ? 0 : -EIO;
+}
+
+static struct crypto_alg alg = {
+   .cra_name   = snappy,
+   .cra_flags  = CRYPTO_ALG_TYPE_COMPRESS,
+   .cra_ctxsize= sizeof(struct snappy_ctx),
+   .cra_module = THIS_MODULE,
+   .cra_list   = LIST_HEAD_INIT(alg.cra_list),
+   .cra_init   = snappy_init,
+   .cra_exit   = snappy_exit,
+   .cra_u  = { .compress = {
+   .coa_compress   = snp_compress,
+   .coa_decompress = snp_decompress } }
+};
+
+static int __init snappy_mod_init(void)
+{
+   return crypto_register_alg(alg);
+}
+
+static void __exit snappy_mod_fini(void)
+{
+   crypto_unregister_alg(alg);
+}
+
+module_init(snappy_mod_init);
+module_exit(snappy_mod_fini);
+
+MODULE_LICENSE(GPL);
+MODULE_DESCRIPTION(Snappy Compression Algorithm);
-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/99] btrfs: Use mempools for extent_state structures

2011-11-28 Thread Andi Kleen
Jeff Mahoney je...@suse.com writes:

  The extent_state structure is used at the core of the extent i/o code
  for managing flags, locking, etc. It requires allocations deep in the
  write code and if failures occur they are difficult to recover from.

  We avoid most of the failures by using a mempool, which can sleep when
  required, to honor the allocations. This allows future patches to convert
  most of the {set,clear,convert}_extent_bit and derivatives to return
  void.

Is this really safe?

iirc if there's any operation that needs multiple mempool objects you
can get into the following ABBA style deadlock:

thread 1 thread 2

get object A from pool 1
 get object C from pool 2 
get object B from pool 2
pool 2 full - block
 get object D from pool 2
 pool1 full - block 


Now for thread 1 to make progress it needs thread 2 to free its
object first, but that needs thread 1 to free its object also 
first, which is a deadlock.

It would still work if there are other users which eventually
make progress, but if you're really unlucky either pool 1 or 2 
is complete used by threads doing a multiple object operation.
So you got a nasty rare deadlock ...

The only way to handle this would be to always preallocate
all objects needed for a operation in an atomic way.
Or never use more than one object.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Add the snappy-c compressor to lib

2011-10-20 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

This is a C port of the google snappy compressor. It has roughly
comparable compression to LZO, but is significantly faster on many file
types. For example it beats all other compressors on already
compressed data.

I ported the original C++ code over to C and did some changes
to make it better fit into the kernel. It preallocates the worst
case memory consumption now. While the code being larger
than lzo it is still reasonable (about 5K on x86).

Decompression needs very little memory, Compression
currently 192K on 64bit and 128K on 32bit. For comparison
LZO compression needs 128K on 64bit and 64K on 32bit.

[This could be lowered significantly by not preallocating
for most use cases, typically the footprint is much lower.
The original C++ version only allocated most of this
when (rarely) needed, but this is more problematic in the kernel]

There are some minor divergences from the Linux coding standards:
in particular I kept the C++/C99 style mixed statement/declarations.
This was mainly to not diverge too much from the reference C++
source, so that improvements from there can be easily ported.
There are some other left overs from the google style, but very
little now.

checkpatch.pl has some complaints, but they are either caused by the
above or checkpatch.pl bugs (it misparsed #define foo do {} while(0))

Performance:

The compressor performs best on 64bit-LE systems,
but is also quite good on 32bit. I haven't tested BE, but
I don't expect that to add a lot of overhead.

Here is some performance data (32bit, Nehalem):
c/b = cycles/byte; lower numbers are better.

x86-64 executable: (compression minimally slower than qlz, but
much better at decompression, lzo is left in the dust):

snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 8.13 uncomp 2.65 c/b
lzo   : emacs-gtk: 11007968 b: ratio 0.33: comp 12.74 uncomp 4.70 c/b
zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 49.96 uncomp 13.14 c/b
zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 64.17 uncomp 12.33 c/b
lzf   : emacs-gtk: 11007968 b: ratio 0.37: comp 9.85 uncomp 4.33 c/b
qlz   : emacs-gtk: 11007968 b: ratio 0.34: comp 7.51 uncomp 6.28 c/b
fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 10.73 uncomp 4.97 c/b

Compressed data (beats everything else):

snappy: udev-151.tar.gz: 634842 b: ratio 1.00: comp 0.99 uncomp 0.33 c/b
lzo   : udev-151.tar.gz: 634842 b: ratio 1.00: comp 41.44 uncomp 0.66 c/b
zlib1 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 116.99 uncomp 3.94 c/b
zlib3 : udev-151.tar.gz: 634842 b: ratio 1.00: comp 117.68 uncomp 3.94 c/b
lzf   : udev-151.tar.gz: 634842 b: ratio 1.03: comp 16.32 uncomp 1.14 c/b
qlz   : udev-151.tar.gz: 634842 b: ratio 1.00: comp 10.42 uncomp 0.42 c/b
fastlz: udev-151.tar.gz: 634842 b: ratio 1.03: comp 19.35 uncomp 2.07 c/b

Text file (compression somewhat slower than qlz, but decompression
much better, lzo is much worse):

snappy: manual.txt: 445343 b: ratio 0.47: comp 12.01 uncomp 3.12 c/b
lzo   : manual.txt: 445343 b: ratio 0.44: comp 16.32 uncomp 7.53 c/b
zlib1 : manual.txt: 445343 b: ratio 0.35: comp 56.37 uncomp 15.59 c/b
zlib3 : manual.txt: 445343 b: ratio 0.31: comp 73.45 uncomp 13.99 c/b
lzf   : manual.txt: 445343 b: ratio 0.46: comp 13.43 uncomp 5.47 c/b
qlz   : manual.txt: 445343 b: ratio 0.44: comp 9.16 uncomp 8.19 c/b
fastlz: manual.txt: 445343 b: ratio 0.46: comp 14.22 uncomp 7.28 c/b

As you can see snappy is a good all-around compressor.

On 64bit the compression is even faster and beats everything else easily:

snappy: emacs-gtk: 11007968 b: ratio 0.38: comp 4.90 uncomp 2.65 c/b
lzo   : emacs-gtk: 11007968 b: ratio 0.33: comp 11.24 uncomp 4.46 c/b
zlib1 : emacs-gtk: 11007968 b: ratio 0.27: comp 41.67 uncomp 11.13 c/b
zlib3 : emacs-gtk: 11007968 b: ratio 0.26: comp 51.80 uncomp 10.54 c/b
lzf   : emacs-gtk: 11007968 b: ratio 0.37: comp 8.79 uncomp 4.05 c/b
qlz   : emacs-gtk: 11007968 b: ratio 0.34: comp 5.44 uncomp 5.46 c/b
fastlz: emacs-gtk: 11007968 b: ratio 0.37: comp 9.91 uncomp 4.77 c/b

On 64bit it's now nearly as fast as qlz on the text file too:

snappy: manual.txt: 445343 b: ratio 0.47: comp 7.79 uncomp 3.47 c/b
lzo   : manual.txt: 445343 b: ratio 0.44: comp 15.46 uncomp 7.27 c/b
zlib1 : manual.txt: 445343 b: ratio 0.35: comp 45.79 uncomp 12.78 c/b
zlib3 : manual.txt: 445343 b: ratio 0.31: comp 60.52 uncomp 11.72 c/b
lzf   : manual.txt: 445343 b: ratio 0.46: comp 12.62 uncomp 5.30 c/b
qlz   : manual.txt: 445343 b: ratio 0.44: comp 6.81 uncomp 7.65 c/b
fastlz: manual.txt: 445343 b: ratio 0.46: comp 13.75 uncomp 6.52 c/b

Overall it's a good alternative to lzo, with the only
drawback being the somewhat higher memory use. The memory use
can be fixed for most cases (e.g. some of the current
buffers are only used for SG), but isn't yet in this version.

Open: it's pretty easy to add scatter-gather support since
input/output is quite abstracted. This would benefit
some users who could avoid temporary buffers.

Signed-off-by: Andi Kleen a...@linux.intel.com

[PATCH 2/3] BTRFS: Add snappy support

2011-10-20 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Add support in btrfs for snappy compression.

This is based on the lzo code with minor modifications.
The btrfs glue code could be significantly improved over LZO
by exploiting some snappy features, but hasn't so far.

Open: implement scatter-gather support and get rid of the temporary
buffers.

Some performance numbers (thanks to Jacob Sowles for running them)

bonnie++, Core i7-64bit

block output rewrite  block input random
   K/sec K/secK/sec   K/sec
None   100%  100% 100%100%
zlib+6.3%+5.4%+6.6%   +11.6%
lzo+11.5%+4.6%+12.4%  +6.7%
snappy +19.3%+28.1%   +32.6%  +9.3%

Snappy does extremly well on the 64bit architecture,
outperforming everything else, sometimes with a healthy
margin.

bonnie++, Atom-32bit

block output rewrite  block input random

   K/sec K/secK/sec   K/sec
None   100%  100% 100%100%
zlib   -43.1%-24.2%   -19.0%  +12.0%
lzo+0.8% +2.6%+6.8%   +14.8%
snappy +19.5%+16.2%   +24.0%  +15.7%

zlib does very poorly on Atom, actually degrading
performance. snappy is generally faster or similar to LZO.
The difference is not as big as on the 64bit CPU though, but
still visible.

bonnie++, files, Core-i7-64bit

sequential createdelete random create delete
   files/sec
None   100%  100%  100%   100%
zlib   +8.3% +10.5%+10.3%  +1.4%
lzo+3.8% +3.3% +5.4%   -3.89%
snappy +23.7%+37.2%+21.8%  +23.8%

bonnie++, files, Atom-32bit

sequential createdelete random create delete
   files/sec
None   100%  100%  100%   100%
zlib   +3.0% +7.9% +5.2%  +5.1%
lzo+8.2% +5.9% +4.8%  +4.6%
snappy +3.1% +8.5% +5.7%  +1.3%

Creation/Deletion on Atom is a case where snappy loses to LZO,
however the loss is small. On 64bit Core it's a win.

I should add that these benchmarks mainly use 0 filled IO,
however FFSB was also quickly tested with more random data
and the differences were similar. See also the micro benchmarks
in the algorithm description for the behaviour with different
data types.

FFSB, 4 threads, stair case data pattern, Reads MB/s

Core i7-64bit Atom-32bit
   MB/s   MB/s
None   100%   100%
zlib   +8.0%  +4.2%
lzo+9.3%  +4.8%
snappy +12.4% +7.9%

In general snappy is a better replacement for LZO, especially
on 64bit, but even on 32bit.

Cc: chris.ma...@oracle.com
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 fs/btrfs/Kconfig   |1 +
 fs/btrfs/Makefile  |3 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |1 +
 fs/btrfs/ctree.h   |9 +-
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |4 +
 fs/btrfs/snappy.c  |  435 
 fs/btrfs/super.c   |9 +-
 lib/Makefile   |1 +
 10 files changed, 461 insertions(+), 5 deletions(-)
 create mode 100644 fs/btrfs/snappy.c

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index ecb9fd3..d55df9c 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -6,6 +6,7 @@ config BTRFS_FS
select ZLIB_DEFLATE
select LZO_COMPRESS
select LZO_DECOMPRESS
+   select SNAPPY
help
  Btrfs is a new filesystem with extents, writable snapshotting,
  support for multiple devices and many more features.
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 40e6ac0..7cd86e7 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -7,6 +7,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
-  compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o
+  compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
+  snappy.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8ec5d86..b171858 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -729,6 +729,7 @@ static wait_queue_head_t 
comp_workspace_wait[BTRFS_COMPRESS_TYPES];
 struct btrfs_compress_op *btrfs_compress_op[] = {
btrfs_zlib_compress,
btrfs_lzo_compress,
+   btrfs_snappy_compress,
 };
 
 int __init btrfs_init_compress(void)
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index a12059f..971a425 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -79,5 +79,6 @@ struct btrfs_compress_op {
 
 extern struct btrfs_compress_op btrfs_zlib_compress;
 extern struct btrfs_compress_op btrfs_lzo_compress;
+extern struct

[PATCH 3/3] Add snappy interface to crypto API

2011-10-20 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Mainly so that ubifs can use it.

Snappy is a better compressor in the same niche as LZO.

Only lightly tested so far. Experiences welcome.

Cc: herb...@gondor.apana.org.au
Cc: dedeki...@gmail.com
Cc: adrian.hun...@intel.com
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 crypto/Kconfig  |9 +
 crypto/Makefile |1 +
 crypto/snappy.c |   99 +++
 3 files changed, 109 insertions(+), 0 deletions(-)
 create mode 100644 crypto/snappy.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index ae27b753..2c85991 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -823,6 +823,15 @@ config CRYPTO_LZO
help
  This is the LZO algorithm.
 
+config CRYPTO_SNAPPY
+   tristate Snappy compression algorithm
+   select CRYPTO_ALGAPI
+   select SNAPPY
+   help
+ snappy is a faster alternative to the lzo compression algorithm
+ with comparable compression. It is very fast on 64bit systems, but 
also
+ good on 32bit systems. It especially excels at already compressed 
data.
+
 comment Random Number Generation
 
 config CRYPTO_ANSI_CPRNG
diff --git a/crypto/Makefile b/crypto/Makefile
index ce5a813..cf92f6f 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -80,6 +80,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o
 obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o
 obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o
 obj-$(CONFIG_CRYPTO_LZO) += lzo.o
+obj-$(CONFIG_CRYPTO_SNAPPY) += snappy.o
 obj-$(CONFIG_CRYPTO_RNG2) += rng.o
 obj-$(CONFIG_CRYPTO_RNG2) += krng.o
 obj-$(CONFIG_CRYPTO_ANSI_CPRNG) += ansi_cprng.o
diff --git a/crypto/snappy.c b/crypto/snappy.c
new file mode 100644
index 000..2f44d30
--- /dev/null
+++ b/crypto/snappy.c
@@ -0,0 +1,99 @@
+/*
+ * Cryptographic API.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 51
+ * Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#include linux/init.h
+#include linux/module.h
+#include linux/crypto.h
+#include linux/vmalloc.h
+#include linux/snappy.h
+
+struct snappy_ctx {
+   struct snappy_env env;
+};
+
+/* Only needed for compression actually */
+static int snappy_init(struct crypto_tfm *tfm)
+{
+   struct snappy_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   return snappy_init_env(ctx-env);
+}
+
+static void snappy_exit(struct crypto_tfm *tfm)
+{
+   struct snappy_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   snappy_free_env(ctx-env);
+}
+
+static int snp_compress(struct crypto_tfm *tfm, const u8 *src,
+   unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+   struct snappy_ctx *ctx = crypto_tfm_ctx(tfm);
+   size_t olen;
+   int err;
+
+   /*  very pessimistic. check in snappy? */
+   if (*dlen  snappy_max_compressed_length(*dlen))
+   return -EINVAL;
+   err = snappy_compress(ctx-env, src, slen, dst, olen);
+   *dlen = olen;
+   return err;
+}
+
+static int snp_decompress(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+   size_t ulen;
+
+   if (!snappy_uncompressed_length(src, slen, ulen))
+   return -EIO;
+   if (*dlen  ulen)
+   return -EINVAL;
+   *dlen = ulen;
+   return snappy_uncompress(src, slen, dst) ? 0 : -EIO;
+}
+
+static struct crypto_alg alg = {
+   .cra_name   = snappy,
+   .cra_flags  = CRYPTO_ALG_TYPE_COMPRESS,
+   .cra_ctxsize= sizeof(struct snappy_ctx),
+   .cra_module = THIS_MODULE,
+   .cra_list   = LIST_HEAD_INIT(alg.cra_list),
+   .cra_init   = snappy_init,
+   .cra_exit   = snappy_exit,
+   .cra_u  = { .compress = {
+   .coa_compress   = snp_compress,
+   .coa_decompress = snp_decompress } }
+};
+
+static int __init snappy_mod_init(void)
+{
+   return crypto_register_alg(alg);
+}
+
+static void __exit snappy_mod_fini(void)
+{
+   crypto_unregister_alg(alg);
+}
+
+module_init(snappy_mod_init);
+module_exit(snappy_mod_fini);
+
+MODULE_LICENSE(GPL);
+MODULE_DESCRIPTION(Snappy Compression Algorithm);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: Honest timeline for btrfsck

2011-10-06 Thread Andi Kleen
Jeff Putney jeffrey.put...@gmail.com writes:

 http://en.wikipedia.org/wiki/Release_early,_release_often

Well the other principle in free software you're forgetting 
is:

It will be released when it's ready

If you don't like Chris' ways to do releases you're free to write
something on your own or pay someone to do so. Otherwise
you just have to deal with his time frames, as shifty
as they may be.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0 07/18] btrfs: generic data structure to build unique lists

2011-10-06 Thread Andi Kleen
Arne Jansen sensi...@gmx.net writes:

 ulist is a generic data structures to hold a collection of unique u64
 values. The only operations it supports is adding to the list and
 enumerating it.
 It is possible to store an auxiliary value along with the key.
 The implementation is preliminary and can probably be sped up
 significantly.
 It is used by subvolume quota to translate recursions into iterative
 loops.

Hmm, sounds like a job for lib/idr.c 

What do your ulists do that idr doesn't?
Ok idr doesn't have merge, but that should be simple
enough to add.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0 03/18] btrfs: add nested locking mode for paths

2011-10-06 Thread Andi Kleen
Arne Jansen sensi...@gmx.net writes:

 This patch adds the possibilty to read-lock an extent
 even if it is already write-locked from the same thread.
 Subvolume quota needs this capability.

Recursive locking is generally strongly discouraged, it causes all kinds
of problems and tends to eventuall ylead to locking hierarchies nobody
can understand anymore.

If you can find any other way to solve this problem I would
encourage you to do so.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0 17/18] btrfs: add qgroup ioctls

2011-10-06 Thread Andi Kleen
Arne Jansen sensi...@gmx.net writes:
 +
 + if (copy_to_user(arg, sa, sizeof(*sa)))
 + ret = -EFAULT;
 +
 + if (trans) {
 + err = btrfs_commit_transaction(trans, root);
 + if (err  !ret)
 + ret = err;
 + }

It would seem safer to put the copy to user outside the transaction.
A cto can in principle cause new writes (e.g. if it causes COW), so 
you may end up with nested transactions. Even if that works somehow
(not sure) it seems to be a thing better avoided.

 +
 + sa = memdup_user(arg, sizeof(*sa));
 + if (IS_ERR(sa))
 + return PTR_ERR(sa);
 +
 + trans = btrfs_join_transaction(root);
 + if (IS_ERR(trans)) {
 + ret = PTR_ERR(trans);
 + goto out;
 + }

This code seems to be duplicated a lot. Can it be consolidated?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0 03/18] btrfs: add nested locking mode for paths

2011-10-06 Thread Andi Kleen
On Fri, Oct 07, 2011 at 12:44:30AM +0400, Andrey Kuzmin wrote:
 Perhaps you could just elaborate on needs this feature? In general, write
 lock gives one exclusive access, so the need for additional read
 (non-exclusive) lock does not appear easily understandable.

Usually it's because the low level code can be called both with and
without locking and it doesn't know.

But that usually can be avoided with some restructuring.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] BTRFS: Fix lseek return value for error

2011-09-19 Thread Andi Kleen
On Mon, Sep 19, 2011 at 03:30:02PM -0400, Chris Mason wrote:
 Excerpts from Andi Kleen's message of 2011-09-19 13:52:03 -0400:
   Thanks everyone, I've put Jeff's last version of this in my queue.
  
  Can you post the version you merged? The previous ones all had issues.
 
 https://github.com/chrismason/linux/commit/48802c8ae2a9d618ec734a61283d645ad527e06c
 
 This was the last one sent, I thought it combined all the fixes.

Ok looks good, but it will be all obsolete once my patchkit lseek
get in (except for the SEEK_DATA/HOLE hunk)

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] BTRFS: Fix lseek return value for error

2011-09-17 Thread Andi Kleen
  with an additional improvement if the offset is larger or equal to the
  file size, return -ENXIO in directly:
  
 if (offset = inode-i_size) {
 mutex_unlock(inode-i_mutex);
 return -ENXIO;
 }
 
 Except that is wrong, because it would then be impossible to write sparse 
 files. 

And also i_size must be always read with i_size_read()

Anyways clearly there's a problem in btrfs land with merging fixes in time.
Is anyone collecting patches while Chris is gone?

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] BTRFS: Fix lseek return value for error

2011-09-16 Thread Andi Kleen
On Fri, Sep 16, 2011 at 11:48:15AM -0400, Christoph Hellwig wrote:
 On Thu, Sep 15, 2011 at 04:06:47PM -0700, Andi Kleen wrote:
  From: Andi Kleen a...@linux.intel.com
  
  Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c
 
 I think this should go to Chris/Linus ASAP.  But a slightly better
 patch description wouldn't hurt either.

Josef acked it earlier, but with a missing Chris the btrfs merge
pipeline seems to be disfunctional at the moment.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] BTRFS: Free inode mutex on lseek error

2011-08-20 Thread Andi Kleen
On Sat, Aug 20, 2011 at 08:08:25AM -0400, Josef Bacik wrote:
 On 08/19/2011 08:07 PM, Andi Kleen wrote:
  From: Andi Kleen a...@linux.intel.com
  
  Introduced with b26751575a9aa55fd6dbf3febde3ff06dfadc44f
  
 
 This has already been fixed by
 
 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c

It's not. That patch returns the wrong return value.

-Andi

---

From 3c3bb2aa661c2b3405c53ac72e23fc663aea794f Mon Sep 17 00:00:00 2001
From: Andi Kleen a...@linux.intel.com
Date: Sat, 20 Aug 2011 05:49:47 -0700
Subject: [PATCH] BTRFS: Fix lseek return value for error

Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c

Signed-off-by: Andi Kleen a...@linux.intel.com

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e7872e4..c6e493f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1814,19 +1814,17 @@ static loff_t btrfs_file_llseek(struct file *file, 
loff_t offset, int origin)
case SEEK_DATA:
case SEEK_HOLE:
ret = find_desired_extent(inode, offset, origin);
-   if (ret) {
-   mutex_unlock(inode-i_mutex);
-   return ret;
-   }
+   if (ret)
+   goto error;
}
 
if (offset  0  !(file-f_mode  FMODE_UNSIGNED_OFFSET)) {
ret = -EINVAL;
-   goto out;
+   goto error;
}
if (offset  inode-i_sb-s_maxbytes) {
ret = -EINVAL;
-   goto out;
+   goto error;
}
 
/* Special lock needed here? */
@@ -1837,6 +1835,9 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t 
offset, int origin)
 out:
mutex_unlock(inode-i_mutex);
return offset;
+error:
+   mutex_unlock(inode-i_mutex);
+   return ret;
 }
 
 const struct file_operations btrfs_file_operations = {
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] BTRFS: Free inode mutex on lseek error

2011-08-19 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Introduced with b26751575a9aa55fd6dbf3febde3ff06dfadc44f

Cc: jo...@redhat.com
Cc: chris.ma...@oracle.com
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 fs/btrfs/file.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 658d669..8791613 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1798,16 +1798,15 @@ static loff_t btrfs_file_llseek(struct file *file, 
loff_t offset, int origin)
case SEEK_DATA:
case SEEK_HOLE:
ret = find_desired_extent(inode, offset, origin);
-   if (ret) {
-   mutex_unlock(inode-i_mutex);
-   return ret;
-   }
+   if (ret)
+   goto error;
}
 
+   ret = -EINVAL;
if (offset  0  !(file-f_mode  FMODE_UNSIGNED_OFFSET))
-   return -EINVAL;
+   goto error;
if (offset  inode-i_sb-s_maxbytes)
-   return -EINVAL;
+   goto error;
 
/* Special lock needed here? */
if (offset != file-f_pos) {
@@ -1817,6 +1816,9 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t 
offset, int origin)
 out:
mutex_unlock(inode-i_mutex);
return offset;
+error:
+   mutex_unlock(inode-i_mutex);
+   return ret;
 }
 
 const struct file_operations btrfs_file_operations = {
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/4] btrfs: Moved repair code from inode.c to extent_io.c

2011-07-24 Thread Andi Kleen
Jan Schmidt list.bt...@jan-o-sch.net writes:

 Repair works that way: Whenever a read error occurs and we have more
 mirrors to try, note the failed mirror, and retry another. If we find a
 good one, check if we did note a failure earlier and if so, do not allow
 the read to complete until after the bad sector was written with the good
 data we just fetched. As we have the extent locked while reading, no one
 can change the data in between.

This has the potential for error loops: when the write fails too
you get another error in the log and can flood the log etc. 
I assume this could get really noisy if that disk completely
went away.

Perhaps it needs a threshold to see if there aren't too many errors
on the mirror and then stop retrying at some point.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/4] btrfs: Moved repair code from inode.c to extent_io.c

2011-07-24 Thread Andi Kleen
 I wasn't clear enough on that: We only track read errors, here. Ans
 error correction can only happen on the read path. So if the write
 attempt fails, we can't go into a loop.

Not in a loop, but you trigger more IO errors, which can be nasty 
if the IO error logging triggers more IO (pretty common because
syslogd calls fsync). And then your code does even more IO, floods
more etc.etc. And the user will be unhappy if their
console gets flooded.

We've have a similar problems in the past with readahead causing
error flooding.

Any time where an error can cause more IO you have to be extremly
careful.

Right now this seems rather risky to me.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 8/8] btrfs: new ioctls to do logical-inode and inode-path resolving

2011-07-21 Thread Andi Kleen
Jan Schmidt list.bt...@jan-o-sch.net writes:
 +
 +static long btrfs_ioctl_logical_to_ino(struct btrfs_root *root,
 + void __user *arg)
 +{
 + int ret = 0;
 + int size;
 + u64 extent_offset;
 + struct btrfs_ioctl_logical_ino_args *loi;
 + struct btrfs_data_container *inodes = NULL;
 + struct btrfs_path *path = NULL;
 + struct btrfs_key key;

This really needs to be root-only for obvious reasons.
The same for the ino_path function

 +
 + loi = memdup_user(arg, sizeof(*loi));
 + if (IS_ERR(loi)) {
 + ret = PTR_ERR(loi);
 + loi = NULL;
 + goto out;
 + }
 +
 + path = btrfs_alloc_path();
 + if (!path) {
 + ret = -ENOMEM;
 + goto out;
 + }
 +
 + size = min(loi-size, 4096);

This is likely a root hole. loi-size is signed! Consider the case
of a negative value being passed in.

Same for the earlier function.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: initial online fsck support

2011-07-02 Thread Andi Kleen
 The two big features of btrfs are self-healing and online fsck, those have to 

Are they?

 be implemented in kernel space.

Why? There have been online fscks in user space in the past,
e.g. the various schemes using LVM snapshots for ext* and 
other related work on the BSD FFS. I don't see any principal
reason why it couldn't be done for btrfs either.

A good fsck is quite complex and you are unlikely to want all
that code in kernel space.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: initial online fsck support

2011-07-01 Thread Andi Kleen
Li Zefan l...@cn.fujitsu.com writes:

 This is an initial version of online fsck. What it does is:

 - check the dir item and dir index pointing to a file.
 - check the structure of extents of a file.

 As furthur work, we should consider:

 - fix but not only check the structure of a file.
 - verify the extent allocation tree on the fly.

It's scary to have a fsck in kernel space. Is there no way to do
this from user space?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: save preloaded extent_state's in a percpu cache V2

2011-06-21 Thread Andi Kleen
Josef Bacik jo...@redhat.com writes:

 When doing DIO tracing I noticed we were doing a ton of allocations, a lot of
 the time for extent_states.  Some of the time we don't even use the 
 prealloc'ed
 extent_state, it just get's free'd up.  So instead create a per-cpu cache like
 the radix tree stuff.  So we will check to see if our per-cpu cache has a
 prealloc'ed extent_state in it and if so we just continue, else we alloc a new
 one and fill the cache.  Then if we need to use a prealloc'ed extent_state we
 can just take it out of our per-cpu cache.  We will also refill the cache on
 free to try and limit the number of times we have to ask the allocator for
 caches.  With this patch dbench 50 goes from ~210 mb/s to ~260 mb/s.  Thanks,

You're just reimplementing a poor man's custom slab cache -- all of this is 
already
done in slab.

If the difference is really that big better fix slab and have everyone 
benefit?

Did you use slub or slab?
Did you analyze where the cycles are spent?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates

2011-06-12 Thread Andi Kleen
Chris Mason chris.ma...@oracle.com writes:

 Hi everyone,

 The for-linus branch of the btrfs unstable tree:

 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable.git 
 for-linus


 Has our current queue of fixes.  Josef's is the biggest pile, mostly in
 the allocator.  Josef and I both managed to merge his patch to avoid
 mapping the extent buffer if skip_locking was set, git merge is just a
 little too easy sometimes (I double checked the resulting code).

The new in 3.0 btrfs warnings on every build are still there:

fs/btrfs/sysfs.c:76: warning: ‘btrfs_root_attrs’ defined but not used
fs/btrfs/sysfs.c:97: warning: ‘btrfs_super_attrs’ defined but not used
fs/btrfs/sysfs.c:153: warning: ‘btrfs_super_release’ defined but not used   
fs/btrfs/sysfs.c:160: warning: ‘btrfs_root_release’ defined but not used

These are not even used inside any ifdef. It's unclear to me: were
these supposed to be used or removed?

Probably better to remove since they seem to be untested, unless
it was a merge error?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG 3.0-rc1] oops during file removal, severe lock contention

2011-06-02 Thread Andi Kleen
Dave Chinner da...@fromorbit.com writes:

 Also, there is massive lock contention while running these workloads.
 perf top shows this for the create after about 5m inodes have been
 created:

We saw pretty much the same thing in some simple tests on large systems
(extent io tree locking and higher level b*tree locks are a problem)
It is being looked at I believe.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: load the key from the dir item in readdir into a fake dentry

2011-05-26 Thread Andi Kleen
On Thu, May 26, 2011 at 03:02:42PM -0400, Josef Bacik wrote:
 + /*
 +  * If this dentry needs lookup, don't set the referenced flag so that it
 +  * is more likely to be cleaned up by the dcache shrinker in case of
 +  * memory pressure.
 +  */
 + if (!d_need_lookup(dentry))
 + dentry-d_flags |= DCACHE_REFERENCED;

No it doesn't at all. The allocation will just push everything else
out.

Really you cannot view this by only looking at the dcache.
You have to look at the complete VM behaviour. All the caches
and the other memory interact.
 
 
  d_alloc uses a normal GFP_KERNEL, which is quite in appropiate for this.
  
  It should at least reclaim and probably more, but even then it's
  risky.
  
 
 Ah yeah I guess I should have probably used GFP_KERNEL.  Sorry about that,

GFP_KERNEL is already used, but it's wrong. I'm not sure any
of the existing GFP_* flags will give you the semantics you
need in fact. The new flag Minchan added for readahead may come
near, but even that is probably not enough.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-20 Thread Andi Kleen
 Putting them at the end of the cache LRU instead of the head would allow them 
 to be dropped quickly under memory pressure. 

This still would fill up your memory for find /, potentially pushing
out other stuff.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add a DCACHE_NEED_LOOKUP flag for d_flags

2011-05-20 Thread Andi Kleen
On Fri, May 20, 2011 at 08:30:19PM -0400, Josef Bacik wrote:
 On 05/20/2011 05:31 PM, Andi Kleen wrote:
  Putting them at the end of the cache LRU instead of the head would allow 
  them to be dropped quickly under memory pressure. 
  
  This still would fill up your memory for find /, potentially pushing
  out other stuff.
  
  -Andi
 
 So these things are just hashed on dput, so they don't have any
 references to them and they are automatically put on the LRU list, so if
 we get under memory pressure they will be easily discarded, especially
 if nobody is actually stating them.  Thanks,

They are allocated. The allocation will push out other things too.
There's no mechanism to only push dentries when allocating other dentries,
or limit the total consumption from the dcache.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] BTRFS: Remove unused node_lock

2011-05-17 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

240f62c8756 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 fs/btrfs/ctree.h   |3 ---
 fs/btrfs/disk-io.c |1 -
 2 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8f4b81d..f290b98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1088,9 +1088,6 @@ struct btrfs_fs_info {
 struct btrfs_root {
struct extent_buffer *node;
 
-   /* the node lock is held while changing the node pointer */
-   spinlock_t node_lock;
-
struct extent_buffer *commit_root;
struct btrfs_root *log_root;
struct btrfs_root *reloc_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 228cf36..64b2896 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1064,7 +1064,6 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
INIT_LIST_HEAD(root-dirty_list);
INIT_LIST_HEAD(root-orphan_list);
INIT_LIST_HEAD(root-root_list);
-   spin_lock_init(root-node_lock);
spin_lock_init(root-orphan_lock);
spin_lock_init(root-inode_lock);
spin_lock_init(root-accounting_lock);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stratified B-trees

2011-03-23 Thread Andi Kleen
Karn Kallio tierplusplusli...@gmail.com writes:

 Are these stratified B-trees something which the btrfs project could use?

The current b*tree is pretty much hardcoded in the disk format, so it 
would be hard to change in a compatible way.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/6] btrfs: add scrub code and prototypes

2011-03-16 Thread Andi Kleen

Arne Jansen sensi...@gmx.net writes:
 +  */
 + mutex_lock(fs_info-scrub_lock);
 + atomic_inc(fs_info-scrubs_running);
 + mutex_unlock(fs_info-scrub_lock);

It seems odd to protect an atomic_inc with a mutex. 
Is that done for some side effect? Otherwise you either
don't need atomic or don't need the lock.

That seems to be all over the source file.

 +int btrfs_scrub_pause(struct btrfs_root *root)
 +{
 + struct btrfs_fs_info *fs_info = root-fs_info;
 + mutex_lock(fs_info-scrub_lock);


As I understand it you take that mutex on every transaction
commit, which is a fast path for normal IO. 

For me that looks like a scalability problem with enough
cores. Did you do any performance testing of this on a system
with a reasonable number of cores?

btrfs already has enough scalability problems, please don't
add new ones.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-15 Thread Andi Kleen
 I have a question though: the deactivation of multiple page-io
 submission support most likely only would affect bigger systems or
 also desktop systems (like mine) ?

I think this is not a final fix, just a workaround.
The problem with the other path still really needs to be tracked down.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-09 Thread Andi Kleen
 512MB.
 
 'free' reports 75MB, 419MB free.
 
 I originally noticed the problem on really real hardware (thinkpad
 T61p), however.

If you can easily reproduce it could you try a git bisect?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] hunt for 2.6.37 dm-crypt+ext4 corruption?

2010-12-05 Thread Andi Kleen
 I've been using a kernel which is between 2.6.37-rc2 and -rc3 with a LUKS / 
 dm-crypt / LVM / ext4 setup for my primary file systems, and I haven't 
 observed any corruption for the last two weeks or so.   It's on my todo list 
 to upgrade to top of Linus's tree, but perhaps this is a useful data point.

The reported corruptions are with a .37 forward port someone did of my parallel 
kcryptd patch.  The 2.6.36 (and earlier) versions are rock solid with many 
users.

The classic dmcrypt is single threaded and very slow.

 As another thought, what version of GCC are people using who are having 
 difficulty?   Could this perhaps be a compiler-related issue?

A compiler problem seems very unlikely here.

What may be an useful experiment would be to test 2.6.37-rc + ext4 over
device mapper, but not dmcrypt. If that fails too then it's likely
some generic device mapper problem.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Andi Kleen
Kay Sievers kay.siev...@vrfy.org writes:

 Yeah, we thought about that too, but a btrfs mount does not show up as
 a block device, like md/dm, so there is no place for a slaves/
 directory in /sys with the individual disks listed. How could be solve
 that? Create some fake blockdev for every btrfs mount,  but that can't
 be used to read/write raw blocks?

You could simply create a new class for btrfs? (or maybe a generic
compound class)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86_64/lib: improve the performance of memmove

2010-09-16 Thread Andi Kleen
On Thu, 16 Sep 2010 15:16:31 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote:
  When the dest and the src do overlap and the memory area is large,
  memmove of
  x86_64 is very inefficient, and it led to bad performance, such as
  btrfs's file
  deletion performance. This patch improved the performance of
  memmove on x86_64
  by using __memcpy_bwd() instead of byte copy when doing large
  memory area copy
  (len  64).
 
 
  I still don't understand why you don't simply use a backwards
  string copy (with std) ? That should be much simpler and
  hopefully be as optimized for kernel copies on recent CPUs.
 
 But according to the comment of memcpy, some CPUs don't support REP
 instruction, 

I think you misread the comment. REP prefixes are in all x86 CPUs.
On some very old systems it wasn't optimized very well,
but it probably doesn't make too much sense to optimize for those.
On newer CPUs in fact REP should be usually faster than 
an unrolled loop.

I'm not sure how optimized the backwards copy is, but most likely
it is optimized too.

Here's an untested patch that implements backwards copy with string
instructions. Could you run it through your test harness?

-Andi


Implement the 64bit memmmove backwards case using string instructions

Signed-off-by: Andi Kleen a...@linux.intel.com

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index bcbcd1e..6e8258d 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -141,3 +141,28 @@ ENDPROC(__memcpy)
.byte .Lmemcpy_e - .Lmemcpy_c
.byte .Lmemcpy_e - .Lmemcpy_c
.previous
+
+/* 
+ * Copy memory backwards (for memmove)
+ * rdi target
+ * rsi source
+ * rdx count
+ */
+
+ENTRY(memcpy_backwards):
+   CFI_STARTPROC
+   std
+   movq %rdi, %rax
+   movl %edx, %ecx
+   add  %rdx, %rdi
+   add  %rdx, %rsi
+   shrl $3, %ecx
+   andl $7, %edx
+   rep movsq
+   movl %edx, %ecx
+   rep movsb
+   cld
+   ret
+   CFI_ENDPROC
+ENDPROC(memcpy_backwards)
+   
diff --git a/arch/x86/lib/memmove_64.c b/arch/x86/lib/memmove_64.c
index 0a33909..6c00304 100644
--- a/arch/x86/lib/memmove_64.c
+++ b/arch/x86/lib/memmove_64.c
@@ -5,16 +5,16 @@
 #include linux/string.h
 #include linux/module.h
 
+extern void asmlinkage memcpy_backwards(void *dst, const void *src,
+   size_t count);
+
 #undef memmove
 void *memmove(void *dest, const void *src, size_t count)
 {
if (dest  src) {
return memcpy(dest, src, count);
} else {
-   char *p = dest + count;
-   const char *s = src + count;
-   while (count--)
-   *--p = *--s;
+   return memcpy_backwards(dest, src, count);
}
return dest;
 }



-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Andi Kleen

 According to the data, the length of the most copies is =128.

Thanks for the data. Large is easier to optimize than small, that's good.

Could you also measure how many memsets need the backwards copy?
(should be easy to add)

If the number is small that needs backwards then the easiest fix
would be to simply call the normal memcpy in the forward case.

That is for backward could also use a string instruction copy
of course, just have to set the direction flag.

That would be a very small code change.

-Andi


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Andi Kleen
On Wed, 08 Sep 2010 20:57:07 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
 
  According to the data, the length of the most copies is=128.
 
  Thanks for the data. Large is easier to optimize than small, that's
  good.
 
  Could you also measure how many memsets need the backwards copy?
  (should be easy to add)
 
 I think memset doesn't need the backwards copy.

I meant for memmove of course. Obviously memset doesn't need a backwards
copy. That was just the only thing the script didn't measure because
the original version didn't have memmove support.

Your whole thread was about making memmove faster, right?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Andi Kleen
Miao Xie mi...@cn.fujitsu.com writes:

 Changes from V1 to V2:
 - change the version of GPL from version 2.1 to version 2

 the kernel's memcpy and memmove is very inefficient. But the glibc version is
 quite fast, in some cases it is 10 times faster than the kernel version. So I


Can you elaborate on which CPUs and with what workloads you measured that?

The kernel memcpy is optimized for copies smaller than a page size 
for example (kernel very rarely does anything on larger than 4k), 
the glibc isn't. etc. There are various other differences.

memcpy and memmove are very different. AFAIK noone has tried
to optimize memmove() before because traditionally it wasn't
used for anything performance critical in the kernel. Has that
that changed? memcpy on the other hand while not perfect
is actually quite optimized for typical workloads.

One big difference between the kernel and glibc is that kernel 
is often cache cold, so you e.g. the cost of a very large code footprint
memcpy/memset is harder to amortize.

Microbenchmarks often leave out that crucial variable.

I have some systemtap scripts to measure size/alignment distributions of
copies on a kernel, if you have a particular workload you're interested
in those could be tried.

Just copying the glibc bloat uncritical is very likely
the wrong move at least.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-08 Thread Andi Kleen
Jens Axboe ax...@kernel.dk writes:

 Also, I didn't see Chris mention this, but if you have a newer intel box
 you can use hw accellerated crc32c instead. For some reason my test box
 always loads crc32c and not crc32c-intel, so I need to do that manually.

I have a patch for that, will post it later: autoloading of modules
based on x86 cpuinfo.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design

2010-06-25 Thread Andi Kleen
Daniel Taylor daniel.tay...@wdc.com writes:

 As long as no object smaller than the disk block size is ever
 flushed to media, and all flushed objects are aligned to the disk
 blocks, there should be no real performance hit from that.

The question is just how large such a block needs to be.
Traditionally some RAID controllers (and possibly some SSDs now) 
needed very large blocks upto MBs.


 Otherwise we end up with the damage for the ext[234] family, where
 the file blocks can be aligned, but the 1K inode updates cause
 the read-modify-write (RMW) cycles and and cost 10% performance
 hit for creation/update of large numbers of files.

Fixing that doesn't require a new file system layout, just some effort
to read/write inodes in batches of multiple of them. XFS did similar
things for a long time, I believe there were some efforts for this
for ext4 too.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [13/23] BTRFS: Clean up unused variables -- nonbugs

2010-06-10 Thread Andi Kleen

These are all the cases where a variable is set, but not 
read which are not bugs as far as I can see, but simply
leftovers. 

Still needs more review.

Found by gcc 4.6's new warnings

Cc: chris.ma...@oracle.com
Cc: linux-btrfs@vger.kernel.org

 fs/btrfs/compression.c|2 --
 fs/btrfs/ctree.c  |   20 ++--
 fs/btrfs/dir-item.c   |2 +-
 fs/btrfs/disk-io.c|   10 --
 fs/btrfs/extent-tree.c|2 --
 fs/btrfs/extent_io.c  |   10 ++
 fs/btrfs/inode.c  |   19 +++
 fs/btrfs/ioctl.c  |2 --
 fs/btrfs/ordered-data.c   |2 --
 fs/btrfs/relocation.c |2 ++
 fs/btrfs/root-tree.c  |2 --
 fs/btrfs/super.c  |1 +
 fs/btrfs/tree-defrag.c|2 --
 fs/btrfs/tree-log.c   |   15 +--
 fs/btrfs/volumes.c|4 
 fs/btrfs/xattr.c  |2 --
 fs/btrfs/zlib.c   |5 -

Signed-off-by: Andi Kleen a...@linux.intel.com

---
 fs/btrfs/compression.c  |2 --
 fs/btrfs/ctree.c|   20 ++--
 fs/btrfs/disk-io.c  |   11 ---
 fs/btrfs/extent-tree.c  |2 --
 fs/btrfs/extent_io.c|9 -
 fs/btrfs/inode.c|   14 --
 fs/btrfs/ioctl.c|2 --
 fs/btrfs/ordered-data.c |2 --
 fs/btrfs/root-tree.c|2 --
 fs/btrfs/super.c|6 ++
 fs/btrfs/tree-defrag.c  |2 --
 fs/btrfs/tree-log.c |   15 ---
 fs/btrfs/volumes.c  |4 
 fs/btrfs/xattr.c|2 --
 fs/btrfs/zlib.c |5 -
 15 files changed, 4 insertions(+), 94 deletions(-)

Index: linux-2.6.35-rc2-gcc/fs/btrfs/ctree.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/ctree.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/ctree.c
@@ -200,7 +200,6 @@ int btrfs_copy_root(struct btrfs_trans_h
  struct extent_buffer **cow_ret, u64 new_root_objectid)
 {
struct extent_buffer *cow;
-   u32 nritems;
int ret = 0;
int level;
struct btrfs_disk_key disk_key;
@@ -210,7 +209,6 @@ int btrfs_copy_root(struct btrfs_trans_h
WARN_ON(root-ref_cows  trans-transid != root-last_trans);
 
level = btrfs_header_level(buf);
-   nritems = btrfs_header_nritems(buf);
if (level == 0)
btrfs_item_key(buf, disk_key, 0);
else
@@ -1008,7 +1006,6 @@ static noinline int balance_level(struct
int wret;
int pslot;
int orig_slot = path-slots[level];
-   int err_on_enospc = 0;
u64 orig_ptr;
 
if (level == 0)
@@ -1071,8 +1068,7 @@ static noinline int balance_level(struct
BTRFS_NODEPTRS_PER_BLOCK(root) / 4)
return 0;
 
-   if (btrfs_header_nritems(mid)  2)
-   err_on_enospc = 1;
+   btrfs_header_nritems(mid);
 
left = read_node_slot(root, parent, pslot - 1);
if (left) {
@@ -1103,8 +1099,7 @@ static noinline int balance_level(struct
wret = push_node_left(trans, root, left, mid, 1);
if (wret  0)
ret = wret;
-   if (btrfs_header_nritems(mid)  2)
-   err_on_enospc = 1;
+   btrfs_header_nritems(mid);
}
 
/*
@@ -1224,14 +1219,12 @@ static noinline int push_nodes_for_inser
int wret;
int pslot;
int orig_slot = path-slots[level];
-   u64 orig_ptr;
 
if (level == 0)
return 1;
 
mid = path-nodes[level];
WARN_ON(btrfs_header_generation(mid) != trans-transid);
-   orig_ptr = btrfs_node_blockptr(mid, orig_slot);
 
if (level  BTRFS_MAX_LEVEL - 1)
parent = path-nodes[level + 1];
@@ -2534,7 +2527,6 @@ static noinline int __push_leaf_left(str
 {
struct btrfs_disk_key disk_key;
struct extent_buffer *right = path-nodes[0];
-   int slot;
int i;
int push_space = 0;
int push_items = 0;
@@ -2546,8 +2538,6 @@ static noinline int __push_leaf_left(str
u32 this_item_size;
u32 old_left_item_size;
 
-   slot = path-slots[1];
-
if (empty)
nr = right_nritems;
else
@@ -3239,7 +3229,6 @@ int btrfs_truncate_item(struct btrfs_tra
 {
int ret = 0;
int slot;
-   int slot_orig;
struct extent_buffer *leaf;
struct btrfs_item *item;
u32 nritems;
@@ -3249,7 +3238,6 @@ int btrfs_truncate_item(struct btrfs_tra
unsigned int size_diff;
int i;
 
-   slot_orig = path-slots[0];
leaf = path-nodes[0];
slot = path-slots[0];
 
@@ -3354,7 +3342,6 @@ int btrfs_extend_item(struct btrfs_trans
 {
int ret = 0;
int slot;
-   int slot_orig;
struct extent_buffer *leaf

[PATCH] [12/23] BTRFS: Clean up unused variables -- bugs

2010-06-10 Thread Andi Kleen

These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings

Cc: chris.ma...@oracle.com
cc: linux-btrfs@vger.kernel.org


Signed-off-by: Andi Kleen a...@linux.intel.com

---
 fs/btrfs/dir-item.c|2 +-
 fs/btrfs/extent-tree.c |3 +--
 fs/btrfs/extent_io.c   |2 ++
 fs/btrfs/inode.c   |6 +++---
 fs/btrfs/relocation.c  |4 +++-
 fs/btrfs/tree-log.c|2 +-
 6 files changed, 11 insertions(+), 8 deletions(-)

Index: linux-2.6.35-rc2-gcc/fs/btrfs/extent-tree.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/extent-tree.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/extent-tree.c
@@ -3337,8 +3337,7 @@ struct btrfs_block_rsv *btrfs_alloc_bloc
btrfs_init_block_rsv(block_rsv);
 
alloc_target = btrfs_get_alloc_profile(root, 0);
-   block_rsv-space_info = __find_space_info(fs_info,
- BTRFS_BLOCK_GROUP_METADATA);
+   block_rsv-space_info = __find_space_info(fs_info, alloc_target);
 
return block_rsv;
 }
Index: linux-2.6.35-rc2-gcc/fs/btrfs/dir-item.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/dir-item.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/dir-item.c
@@ -427,5 +427,5 @@ int btrfs_delete_one_dir_name(struct btr
ret = btrfs_truncate_item(trans, root, path,
  item_len - sub_item_len, 1);
}
-   return 0;
+   return ret;
 }
Index: linux-2.6.35-rc2-gcc/fs/btrfs/extent_io.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/extent_io.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/extent_io.c
@@ -2825,6 +2825,8 @@ int extent_prepare_write(struct extent_i
 NULL, 1,
 end_bio_extent_preparewrite, 0,
 0, 0);
+   if (ret  !err)
+   err = ret;
iocount++;
block_start = block_start + iosize;
} else {
Index: linux-2.6.35-rc2-gcc/fs/btrfs/inode.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/inode.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/inode.c
@@ -1372,7 +1372,7 @@ int btrfs_merge_bio_hook(struct page *pa
 
if (map_length  length + size)
return 1;
-   return 0;
+   return ret;
 }
 
 /*
@@ -2672,7 +2672,7 @@ static int check_path_shared(struct btrf
 {
struct extent_buffer *eb;
int level;
-   int ret;
+   int ret = 0;
u64 refs;
 
for (level = 0; level  BTRFS_MAX_LEVEL; level++) {
@@ -2686,7 +2686,7 @@ static int check_path_shared(struct btrf
if (refs  1)
return 1;
}
-   return 0;
+   return ret; /* XXX callers? */
 }
 
 /*
Index: linux-2.6.35-rc2-gcc/fs/btrfs/tree-log.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/tree-log.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/tree-log.c
@@ -2273,7 +2273,7 @@ fail:
}
btrfs_end_log_trans(root);
 
-   return 0;
+   return err;
 }
 
 /* see comments for btrfs_del_dir_entries_in_log */
Index: linux-2.6.35-rc2-gcc/fs/btrfs/relocation.c
===
--- linux-2.6.35-rc2-gcc.orig/fs/btrfs/relocation.c
+++ linux-2.6.35-rc2-gcc/fs/btrfs/relocation.c
@@ -3098,6 +3098,8 @@ static int add_tree_block(struct reloc_c
BUG_ON(item_size != sizeof(struct btrfs_extent_item_v0));
ret = get_ref_objectid_v0(rc, path, extent_key,
  ref_owner, NULL);
+   if (ret  0)
+   return ret;
BUG_ON(ref_owner = BTRFS_MAX_LEVEL);
level = (int)ref_owner;
/* FIXME: get real generation */
@@ -4142,7 +4144,7 @@ int btrfs_reloc_clone_csums(struct inode
btrfs_add_ordered_sum(inode, ordered, sums);
}
btrfs_put_ordered_extent(ordered);
-   return 0;
+   return ret;
 }
 
 void btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Fwd: Re: Linking two files together][RFC]

2010-06-09 Thread Andi Kleen
Roberto Ragusa m...@robertoragusa.it writes:

 I hope that ideas about btrfs are not off-topic for this mailing list.

 The forwarded message below was written by me on fedora-users.
 The thread is about the ability to link two files in a manner
 similar to cat 1 2 3  rm 1 2 while avoiding any data
 movement on the disk.

OCFS2 can do this today with reflinks

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: check alloc return value before use handle and struct

2010-05-19 Thread Andi Kleen
Steven Liu lingjiujia...@gmail.com writes:
 diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
 index e9103b3..230a131 100644
 --- a/fs/btrfs/dir-item.c
 +++ b/fs/btrfs/dir-item.c
 @@ -142,6 +142,8 @@ int btrfs_insert_dir_item(struct
 btrfs_trans_handle *trans, struct btrfs_root
   key.offset = btrfs_name_hash(name, name_len);

   path = btrfs_alloc_path();
 + if (!path)
 + return -ENOMEM;


The big problem is handling state unwind in all the callers, not adding
it to the low level code. I attempted it some time ago but it's hard.

Just spewing BUG_ON() all over on memory allocation failure is not
helpful imho, that's not better than simply having clean NULL pointer
faults.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor interactive performance with I/O loads with fsync()ing

2010-04-11 Thread Andi Kleen
 Has the reason for this been identified? Judging from the nature of metadata
 loads, it would seem that it should be substantially easier to implement
 fsync() efficiently.

By design a copy on write tree fs would need to flush a whole
tree hierarchy on a sync. btrfs avoids this by using a special
log for fsync, but that causes more overhead if you have that
log on the same disk.  So IO subsystem will do more work.

It's a bit like JBD data journaling.

However it should not have the stalls inherent in ext3's journaling.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 17/18] Btrfs: Full direct I/O and AIO read implementation.

2010-03-25 Thread Andi Kleen
On Wed, Mar 24, 2010 at 11:08:07PM -0400, jim owens wrote:
 Andi Kleen wrote:
   On Tue, Mar 23, 2010 at 05:40:00PM -0400, jim owens wrote:
   Andi Kleen wrote:
  
   With checksums enabled, uncompressed reads aligned on the 4k block
   are classic direct IO to user memory except at EOF.
   
   Hmm, but what happens if the user modifies the memory in parallel?
   Would spurious checksum failures be reported then?
 
 It does put a warning in the log but it does not fail the read
 because I circumvent that by doing the failed-checksum-retry as
 a buffered read and retest.  The checksum passes and we copy
 the data to the user memory (where they can then trash it again).

Ok. That will work I guess.


 I was going to put a comment about that but felt my comment
 density was already over the btrfs style guide limit.  :) 

Hehe.

 
   Same for writing I guess (data end up on disk with wrong checksum)?
 
 Well we don't have any code done yet for writing and that was
 just one interesting challenge that needed to be solved.
 
   Those both would seem like serious flaws to me.
 
 Agree, so the write design needs to prevent bad checksums.

How? Do you have a plan for that?

 
 Read is already correct and if people do not want a log warning
 that the application is misbehaving that can be eliminated.

I guess if it's strictly rate limitted it might be ok.
-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 17/18] Btrfs: Full direct I/O and AIO read implementation.

2010-03-23 Thread Andi Kleen
On Tue, Mar 23, 2010 at 05:40:00PM -0400, jim owens wrote:
 Andi Kleen wrote:
 
  One thing that stroke me while reading is that, except for the out of line 
  no data
  checksum case, this isn't real classical zero copy direct IO because
  you always have to copy through some buffer.
 
 Uh no, unless I really messed up or don't understand what you mean.

No I misread the code: you don't set temp_pages in that path. 

 
 Uncompressed data with no checksums only buffers on an error or EOF. 
 
 With checksums enabled, uncompressed reads aligned on the 4k block
 are classic direct IO to user memory except at EOF.

Hmm, but what happens if the user modifies the memory in parallel?
Would spurious checksum failures be reported then?

Same for writing I guess (data end up on disk with wrong checksum)?

Those both would seem like serious flaws to me.

  Is there any particular reason this wasn't done? Was it because
  of aio?
  
  I know the page cache currently doesn't support that today, but
  presumably it wouldn't be too hard to add.
 
 The only reason I did not do something like that is:

Ok.

 1) I did not want to disturb the page cache with throw-away pages.
 2) uncached IO makes it even less like classic direct IO.
 3) Writing that page cache code might not be simpler.
4) aio support (although It would be cool if someone finally did
proper aio page cache code)

 As further argument against uncached IO, Chris sent a very simple
 patch up to read into page cache then purge it for btrfs direct IO
 reads and it was NACKed.

I see.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 17/18] Btrfs: Full direct I/O and AIO read implementation.

2010-03-22 Thread Andi Kleen
jim owens owens6...@gmail.com writes:

Hi Jim,

I read through large chunks of that patch. I don't claim to fully
understand all the btrfs infrastructure details enough, so it wasn't
really serious code review.

One thing that stroke me while reading is that, except for the out of line no 
data
checksum case, this isn't real classical zero copy direct IO because
you always have to copy through some buffer.

It's more like uncached IO

I was wondering that at least for those cases wouldn't it be simpler
to use the normal page cache IO path and use new hints that disable
prefetch/write-behind/caching in the page cache after the IO operation?

Is there any particular reason this wasn't done? Was it because
of aio?

I know the page cache currently doesn't support that today, but
presumably it wouldn't be too hard to add.

I guess the code would be much simpler if it only did the no checksum
case.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: file/extent checksums for dedup/sync...

2010-01-27 Thread Andi Kleen
Daniel J Blueman daniel.blue...@gmail.com writes:

 For purposes of data deduplication and data synchronisation, it would
 be a powerful tool to expose file data checksums.

 Since eg BTRFS uses the crc32c algorithm [1], it's possible to compute
 the file's overall CRC from the accumulation of the CRCs from all it's
 extents' CRCs.

 For now, exposing this via an IOCTL may be sufficient, though any
 ideas for introducing it in a more standard way? (it's a pity that
 when stat64 was introduced, reserved fields weren't added)

The problem of doing it in any standard way is that it would
hard code the way the file system does checksums in the applications.
So the file system could never change it without breaking
user space.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/12] btrfs core patches for direct I/O

2010-01-06 Thread Andi Kleen
jim owens jow...@hp.com writes:

 The existing core code for pagecache
 doesn't work for directio because the existing I/O routines depend
 on peeking inside a struct page to get the valid btrfs inode info.
 Sorry, no can do, we don't own that page. And rewriting the whole
 pagecache I/O stack to eliminate dependence on struct page would
 be too scary IMO.

That would simply need another passed argument in a few strategic
places, won't it? I can't imagine it would need a full rewrite

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix possible pointer dereference

2009-10-06 Thread Andi Kleen
Diego Calleja dieg...@gmail.com writes:

 We should always check btrfs_alloc_path(). Some places BUG(),
 others return -ENOMEM, btrfs_insert_dir_item() seems like it can return
 safely.

The problem is that all the callers don't handle errors.
It doesn't make sense to fix it low-level currently when it cannot be handled
properly higher up anyways.

Proper out of memory handling needs much more work, one liners
don't really help.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: potential NULL dereferences

2009-09-06 Thread Andi Kleen
Roel Kluin roel.kl...@gmail.com writes:

 Allocations may fail, prevent NULL dereferences.

 Signed-off-by: Roel Kluin roel.kl...@gmail.com
 ---
 In several sections of fs/btrfs code a kmalloc() occurs without a
 check whether it succeeded. this potentially leads to dereferences
 of a NULL pointer. Are there reasons why we do not check the
 allocations? Did I choose an incorrect way to err out? please
 review.

Yes, the erroring out needs much more work because often
the callers don't handle errors and it can need quite a lot of surgery
Until that is done it's actually better to oops than to silently
leak resources. BUG_ON(name == NULL) is also fairly useless
because it oopses anyways in a obvious way.

I had some patches to add more error handling for ENOMEM, but it's
fairly complicated. Should probably resurrect my old patchkit. It
still wasn't fully complete.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION

2009-04-25 Thread Andi Kleen
Christoph Hellwig h...@lst.de writes:

 Add support for the standard attributes set via chattr and read vis
 lsattr.  Currently we store the attributes in the flags value in
 the btrfs inode, but I wonder whether we should split it into two so
 that we don't have to keep converting between the two formats.

The code would be nicer/shorter if you used tables for the flags and 
loops instead of open coding everything. 

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Andi Kleen
 GCC 4.3.2. Maybe i missed something obvious?

The typical use case of restrict is to tell it that multiple given
arrays are independent and then give the loop optimizer 
more freedom to handle expressions in the loop that
accesses these arrays.

Since there are no loops in the list functions nothing changed.

Ok presumably there are some other optimizations which 
rely on that alias information too, but again the list_*
stuff is probably too simple to trigger any of them.

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Andi Kleen
 The point is that the compiler is then free to do it. If things
 slow down after the compiler gets *more* information, then that
 is a problem with the compiler heuristics rather than the
 information we give it.

The point was the -Os disables it typically then.
(not always, compiler heuristics are far from perfect)

 
  
  Then x86s tend to have very very fast L1 caches and
  if something is not in L1 on reads then the cost of fetching
  something for a read dwarfs the few cycles you can typically
  get out of this.
 
 Well most architectures have L1 caches of several cycles. And
 L2 miss typically means going to L2 which in some cases the
 compiler is expected to attempt to cover as much as possible
 (eg in-order architectures).

L2 cache is so much slower that scheduling a few instructions
more doesn't help much.

 stall, so you still want to get loads out early if possible.
 
 Even a lot of OOOE CPUs I think won't have the best alias
 anaysis, so all else being equal, it wouldn't hurt them to
 move loads earlier.

Hmm, but if the load is nearby it won't matter if a 
store is in the middle, because the CPU will just execute
over it.

The real big win is if you do some computation inbetween,
but at least for typical list manipulation there isn't 
really any.

  Also at least x86 gcc normally doesn't do scheduling 
  beyond basic blocks, so any if () shuts it up.
 
 I don't think any of this is a reason not to use restrict, though.
 But... there are so many places we could add it to the kernel, and
 probably so few where it makes much difference. Maybe it should be
 able to help some critical core code, though.

Frankly I think it would be another unlikely().

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs and swap files on SSD's ?

2009-01-20 Thread Andi Kleen
Dmitri Nikulin dniku...@gmail.com writes:

 On Wed, Jan 21, 2009 at 12:02 AM, Chris Mason chris.ma...@oracle.com wrote:
 There are patches to support swap over NFS that might make it safe to
 use on btrfs. At any rate, it is a fixable problem.

 FreeBSD has been able to run swap over NFS for as long as I can
 remember, what is different in Linux that makes it especially
 difficult?

One big traditional difference is that FreeBSD uses fixed isolated
pools for their networking buffers (that is why you had to tune most
systems for higher network workloads), while Linux has fully
unified[1] memory management including the network stack.
Now I believe recent BSD also moved to more unified network
management and it wouldn't surprise me if they had trouble
with this now too.

[1] at least for now, there are unfortunately some tendencies to move
back to fixed pools too.

 I've read that swap over non-trivial filesystems is hazardous as it
 may lead to a situation in which memory allocation can fail in the
 swap/FS code that was meant to make allocation possible again.

A lot of this has been fixed in the 2.6 timeframe (e.g. there's
now a better enforced global dirty limit), but there are likely still
corner cases that could run into difficulties, so noone is 
really declaring it 100% safe yet.

 If btrfs is to take the role of a RAID and volume manager, it would
 certainly be very useful to be able to run swap on it, since that
 frees up other volumes from an administrative standpoint.

The fixed extent mapping of the swap files is really a different problem,
independent of the memory allocation issue.

In general the memory allocation problem on write out has to be solved
in any ways (even if you don't support swap files), because any dirty 
mmap'ed file effectively acts like a swap file.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-19 Thread Andi Kleen
 The problem with 'restrict' is that almost nobody uses it, and it does 

Also gcc traditionally didn't do a very good job using it (this
might be better in the very latest versions). At least some of the 3.x
often discarded this information. 

 automatically. But it should work well as a way to get Fortran-like 
 performance from HPC workloads written in C - which is where most of the 
 people are who really want the alias analysis.

It's more than just HPC  -- a lot of code has critical loops.

  it seems like a nice opt-in thing that can be used where the aliases are 
  verified and the code is particularly performance critical...
 
 Yes. I think we could use it in the kernel, although I'm not sure how many 
 cases we would ever find where we really care. 

Very little I suspect. Also the optimizations that gcc does with this
often increase the code size. While that can be a win, with people
judging gcc's output apparently *ONLY* on the code size as seen
in this thread[1] it would obviously not compete well.

-Andi 

[1] although there are compilers around that generate smaller code
than gcc at its best.

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Andi Kleen
On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote:
  Something at the back of my mind said aliasing.
  
  $ gcc linus.c -O2 -S ; grep subl linus.s
  subl$1624, %esp
  $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
  subl$824, %esp
  
  That's with 4.3.2.
 
 Interesting. 
 
 Nonsensical, but interesting.

What I find nonsensical is that -fno-strict-aliasing generates
better code here. Normally one would expect the compiler seeing
more aliases with that option and then be more conservative
regarding any sharing. But it seems to be the other way round
here.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Andi Kleen
On Sun, Jan 11, 2009 at 11:25:32AM -0800, Linus Torvalds wrote:
 
 
 On Sun, 11 Jan 2009, Andi Kleen wrote:
  
  The proposal was to use -fno-inline-functions-called-once (but 
  the resulting numbers were not promising)
 
 Well, the _optimal_ situation would be to not need it, because gcc does a 
 good job without it. That implies trying to find a better balance between 
 worth it and causes problems. 
 
 Rigth now, it does sound like gcc simply doesn't try to balance AT ALL, or 
 balances only when we add some very version-specific random options (ie 
 the stack-usage one). 

The gcc 4.3 inliner takes stack growth into account by default (without
any special options). I experimented a bit with it when that
was introduced and found the default thresholds are too large for the kernel 
and don't change the checkstack.pl picture much.

I asked back then and was told --param large-stack-frame
is expected to be a reasonable stable --param (as much as these can
be) and I did a patch to lower it, but I couldn't get myself
to actually submit it [if you really want it I can send it]. 

But of course that only helps for gcc 4.3+, older gccs would need
a different workaround.

On the other hand (my personal opinion, not shared by everyone) is 
that the ioctl switch stack issue is mostly only a problem with 4K 
stacks and in the rare cases when I still run 32bit kernels
I never set that option because I consider it russian roulette
(because there undoutedly dangerous dynamic stack growth cases that 
checkstack.pl doesn't flag) 

 And even those options don't actually make much 
 sense - yes, they balance things, but they don't do it in a sensible 
 manner.
 
 For example: stack usage is undeniably a problem (we've hit it over and 
 over again), but it's not about stack must not be larger than X bytes. 
 
 If the call is done unconditionally, then inlining _one_ function will 
 grow the static stack usage of the function we inline into, but it will 
 _not_ grow the dynamic stack usage one whit - so deciding to not inline 
 because of stack usage is pointless.

Don't think the current inliner takes that into account from
a quick look at the sources, although it probably could.  
Maybe Honza can comment.

But even if it did it could only do that for a single file,
but if the function is in a different file gcc doesn't
have the information (unless you run with David's --combine hack).
This means the kernel developers have to do it anyways.

On the other hand I'm not sure it's that big a problem. Just someone
should run make checkstack occasionally and add noinlines to large
offenders.

-Andi

[keep quote for Honza's benefit]

 
 See? So stop inlining when you hit a stack limit IS THE WRONG THING TO 
 DO TOO! Because it just means that the compiler continues to do bad 
 inlining decisions until it hits some magical limit - but since the 
 problem isn't the static stack size of any _single_ function, but the 
 combined stack size of a dynamic chain of them, that's totally idiotic. 
 You still grew the dynamic stack, and you have no way of knowing by how 
 much - the limit on the static one simply has zero bearing what-so-ever on 
 the dynamic one.
 
 So no, limit static stack usage is not a good option, because it stops 
 inlining when it doesn't matter (single unconditional call), and doesn't 
 stop inlining when it might (lots of sequential calls, in a deep chain).
 
 The other alternative is to let gcc do what it does, but 
 
  (a) remove lots of unnecessary 'inline's. And we should likely do this 
  regardless of any -fno-inline-functions-called-once issues.
 
  (b) add lots of 'noinline's to avoid all the cases where gcc screws up so 
  badly that it's either a debugging disaster or an actual correctness 
  issue.
 
 The problem with (b) is that it's a lot of hard thinking, and debugging 
 disasters always happen in code that you didn't realize would be a problem 
 (because if you had, it simply wouldn't be the debugging issue it is).
 
   Linus
 

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread Andi Kleen
 I vote for the, get rid of the current inline, rename __always_inline to 

There is some code that absolutely requires inline for correctness,
like the x86-64 vsyscall code. I would advocate to keep the
explicit __always_inline at least there to make it very clear.

 inline, and then remove all non needed inlines from the kernel.

Most inlines in .c should be probably dropped.

 
 We'll, probably start adding a lot more noinlines.

That would cost you, see the numbers I posted (~4.1% text increase)

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread Andi Kleen
On Fri, Jan 09, 2009 at 10:28:01AM -0700, Matthew Wilcox wrote:
 On Fri, Jan 09, 2009 at 06:20:11PM +0100, Andi Kleen wrote:
  Also cc Honza in case he has comments (you might want
  to review more of the thread in the archives) 
 
 I think this particular bug is already known and discussed:

I thought so initially too, but:

 
 http://gcc.gnu.org/ml/gcc/2008-12/msg00365.html
 
 and it hints at being fixed with gcc 4.4.  Does anyone want to test
 that?

Hugh already tested with 4.4 and it didn't work well. At least
a lot of the low level asm inlines were not inlined.
So it looks like it's still mistuned for the kernel.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread Andi Kleen
 What happens when you say -fno-inline-functions-called-once? Does it 
 disable inlining for those functions IN GENERAL, or just for the LARGE 

It does disable it in general, unless they're marked inline explicitely :

The traditional gcc 2.x rules were:

I Only inline what is marked inline (but it can decide to
not inline)
II Also inline others heuristically only with 
-O3 / -finline-functions [which we don't set in the kernel]

Then at some point this additional rule was added:

- Also inline everything static that is only called once
[on the theory that this shrinks code size which is true
according to my measurements]

-fno-inline-functions-called once disables this new rule.
It's very well and clearly defined.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread Andi Kleen
On Fri, Jan 09, 2009 at 08:10:20PM +0100, Richard Guenther wrote:
 On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen a...@firstfloor.org wrote:
  GCC 4.3 should be good in compiling the
  kernel with default -Os settings.
 
  It's unfortunately not. It doesn't inline a lot of simple asm() inlines
  for example.
 
 Reading Ingos posting with the actual numbers states the opposite.

Hugh had some numbers upto 4.4.0 20090102 in

http://thread.gmane.org/gmane.linux.kernel/775254/focus=777231

which demonstrated the problem.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andi Kleen
 I've done a finegrained size analysis today (see my other mail in this 
 thread), and it turns out that on gcc 4.3.x the main (and pretty much 
 only) inlining annotation that matters in arch/x86/include/asm/*.h is the 
 onliner patch attached below, annotating constant_test_bit().

That's pretty cool.

Should definitely file a gcc bug report for that though so that
they can fix gcc. Did you already do that or should I?

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread Andi Kleen
 What's the cost/benefit of that 4%? Does it actually improve performance? 
 Especially if you then want to keep DWARF unwind information in memory in 
 order to fix up some of the problems it causes? At that point, you lost 

dwarf unwind information has nothing to do with this, it doesn't tell
you anything about inlining or not inlining.  It just gives you
finished frames after all of that has been done.

Full line number information would help, but I don't think anyone 
proposed to keep that in memory.

 Does it help I$ utilization (which can speed things up a lot more, and is 
 probably the main reason -Os actually tends to perform better)? Likely 
 not. Sure, shrinking code is good for I$, but on the other hand inlining 
 can actually be bad for I$ density because if you inline a function that 
 doesn't get called, you now fragmented your footprint a lot more.

Not sure that is always true; the gcc basic block reordering 
based on its standard branch prediction heuristics (e.g.  0 or
== NULL unlikely or the unlikely macro) might well put it all out of line.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread Andi Kleen
 I thought -Os actually disabled the basic-block reordering, doesn't it?

Not in current gcc head no (just verified by stepping through) 

 
 And I thought it did that exactly because it generates bigger code and 
 much worse I$ patterns (ie you have a lot of conditional branch to other 
 place and then unconditional branch back instead of conditional branch 
 over the non-taken code.
 
 Also, I think we've had about as much good luck with guessing 
 likely/unlikely as we've had with inline ;)

That's true.

But if you look at the default heuristics that gcc has (gcc/predict.def
in the gcc sources) like == NULL,  0, branch guarding etc.
I would expect a lot of them to DTRT for the kernel.

Honza at some point even fixed goto to be unlikely after I complained :)
 
 Sadly, apart from some of the never happens error cases, the kernel 
 doesn't tend to have lots of nice patterns. We have almost no loops (well, 
 there are loops all over, but most of them we hopefully just loop over 
 once or twice in any good situation), and few really predictable things.

That actually makes us well suited to gcc, it has a relatively poor
loop optimizer compared to other compilers ;-)

 Or rather, they can easily be very predictable under one particular load, 
 and the totally the other way around under another ..

Yes that is why we got good branch predictors in CPUs I guess.

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andi Kleen
On Sat, Jan 10, 2009 at 12:53:42AM +, Jamie Lokier wrote:
 Harvey Harrison wrote:
  Oh yeah, and figure out what actually breaks on alpha such that they added
  the following (arch/alpha/include/asm/compiler.h)
  
  #ifdef __KERNEL__
  /* Some idiots over in linux/compiler.h thought inline should imply
 always_inline.  This breaks stuff.  We'll include this file whenever
 we run into such problems.  */
 
 Does always_inline complain if the function isn't inlinable, while

Yes it does.

 inline allows it? 

(unless you set -Winline which the kernel doesn't) 

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Andi Kleen
On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote:
 Harvey Harrison wrote:
 
  We might still try the second or third options, as i think we shouldnt go 
  back into the business of managing the inline attributes of ~100,000 
  kernel functions.
  
  Or just make it clear that inline shouldn't (unless for a very good reason)
  _ever_ be used in a .c file.
  
 
 The question is if that would produce acceptable quality code.  In
 theory it should, but I'm more than wondering if it really will.

I actually often use noinline when developing code simply because it 
makes it easier to read oopses when gcc doesn't inline ever static
(which it normally does if it only has a single caller). You know
roughly where it crashed without having to decode the line number.

I believe others do that too, I notice it's all over btrfs for example.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Andi Kleen
On Thu, Jan 08, 2009 at 07:42:48PM -0800, Linus Torvalds wrote:
  I actually often use noinline when developing code simply because it 
  makes it easier to read oopses when gcc doesn't inline ever static
  (which it normally does if it only has a single caller)
 
 Yes. Gcc inlining is a total piece of sh*t.

The static inlining by default (unfortunately) saves a lot of text size.

For testing I built an x86-64 allyesconfig kernel with 
-fno-inline-functions-called-once (which disables the default static
inlining), and it increased text size by ~4.1% (over 2MB for a allyesconfig 
kernel). So I think we have to keep that, dropping it would cost too
much :/

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Andi Kleen
 I appreciate this is sample code, but using __get_user() on
 non-userspace pointers messes up architectures which have separate
 user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we have an
 appropriate function for kernel space pointers? 

probe_kernel_address().

But it's slow.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs for mainline

2009-01-02 Thread Andi Kleen
On Fri, Jan 02, 2009 at 02:32:29PM -0500, Chris Mason wrote:
  If combination spinlocks/mutexes are really a win they should be 
  in the generic mutex framework. And I'm still dubious on the hardcoded 
  numbers.
 
 Sure, I'm happy to use a generic framework there (or help create one).
 They are definitely a win for btrfs, and show up in most benchmarks.

If they are such a big win then likely they will help other users
too and should be generic in some form.

 
  - compat.h needs to go
 
 Projects that are out of mainline have a difficult task of making sure
 development can continue until they are in mainline and being clean
 enough to merge.  I'd rather get rid of the small amount of compat code
 that I have left after btrfs is in (compat.h is 32 lines).  

It's fine for an out of tree variant, but the in tree version
shouldn't have compat.h. For out of tree you just apply a patch
that adds the includes. e.g.compat-wireless and lots of other
projects do it this way. 

 
 Yes, I tried to mark those as I did them (a very small number of
 functions).  In general they were copied to avoid adding exports, and
 that is easily fixed.
 
  - there should be manpages for all the ioctls and other interfaces.
  - ioctl.c was not explicitely root protected. security issues?
 
 Christoph added a CAP_SYS_ADMIN check to the trans start ioctl, but I do
 need to add one to the device add/remove/balance code as well.

Ok. Didn't see that.

It still needs to be carefully audited for security holes 
even with root checks.

Another thing is that once auto mounting is enabled each usb stick
with btrfs on it could be a root hole if you have buffer overflows
somewhere triggerable by disk data. I guess that would need some
checking too.

 The subvol/snapshot creation is meant to be user callable (controlled by
 something similar to quotas later on).

But right now that's not there so it should be root only.

 
   Also there used to be a lot of BUG_ON()s on
  memory allocation failure even.
  - In general BUG_ONs need review I think. Lots of externally triggerable
  ones.
  - various checkpath.pl level problems I think (e.g. printk levels) 
  - the printks should all include which file system they refer to
  
  In general I think the whole thing needs more review.
 
 I don't disagree, please do keep in mind that I'm not suggesting anyone
 use this in production yet.

When it's in mainline I suspect people will start using it for that.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Andi Kleen
Chris Mason [EMAIL PROTECTED] writes:

 Started interactively?  I'm not entirely sure what that means, but in
 general when you ask the user a question about if/how to fix a
 corruption, they will have no idea what the correct answer is.

While that's true today, I'm not sure it has to be true always.
I always thought traditional fsck user interfaces were a
UI desaster and could be done much better with some simple tweaks. 

For example the fsck could present the user a list of files that ended
up in lost+found and let them examine them, instead of asking a lot of
useless questions. Or it could give a high level summary on how many
files in which part of the directory tree were corrupted. etc.etc.  Or
it could default to a high level mode that only gives such high level
information to the user.

So I don't think all corruptions could be done perfectly user
friendly, but at least the basic user friendliness in many
situations could be much improved.

-Andi


-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-15 Thread Andi Kleen
On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote:
 Andi Kleen wrote:
 Ray Van Dolson [EMAIL PROTECTED] writes:
 
   
 I recall their being a thread here a number of months back regarding
 data-deduplication support for bttfs.
 
 Did anyone end up picking that up and giving a go at it?  Block level
 data dedup would be *awesome* in a Linux filesystem.  It does wonders
 for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
 have this feature yet (although I've read discussions on them looking
 to add it).
 
 
 There are some patches to do in QEMU's cow format for KVM. That's
 user level only.
   
 
 And thus, doesn't work for sharing between different images, especially 
 at runtime. 

It would work if the images are all based once on a reference image, won't it?
I would imagine that's the common situation for installing lots of VMs.

 I'd really, really [any number of reallies], really like to 
 see btrfs deduplication.

Sure it would be useful for a couple of things.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-13 Thread Andi Kleen
Ray Van Dolson [EMAIL PROTECTED] writes:

 I recall their being a thread here a number of months back regarding
 data-deduplication support for bttfs.

 Did anyone end up picking that up and giving a go at it?  Block level
 data dedup would be *awesome* in a Linux filesystem.  It does wonders
 for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
 have this feature yet (although I've read discussions on them looking
 to add it).

There are some patches to do in QEMU's cow format for KVM. That's
user level only.

-Andi
-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: avoid NULL deref after failed allocation

2008-10-02 Thread Andi Kleen
Jim Meyering [EMAIL PROTECTED] writes:

 However, in some places, the trend is
 to BUG_ON(!ptr), so I've done that, too.

Even if it's a trend, it's wrong.  Better don't add more. 
And also unnecessary because next reference will obviously oops anyways.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_tree_lock trylock

2008-09-08 Thread Andi Kleen
On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote:
 On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote:
   The idea is to try to spin for a bit to avoid scheduling away, which is
   especially important for the high levels.  Most holders of the mutex
   let it go very quickly.
  
  Ok but that surely should be implemented in the general mutex code then
  or at least in a standard adaptive mutex wrapper? 
 
 That depends, am I the only one crazy enough to think its a good idea?

Adaptive mutexes are classic, a lot of other OS have it.

Gregory et.al. also saw big improvements in the RT kernel (they 
posted a patchkit a few times) 

But a lot of people don't like them for some reason. Anyways
hiding them in a fs is probably wrong.

-Andi
-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New feature Idea

2008-08-13 Thread Andi Kleen
Morey Roof [EMAIL PROTECTED] writes:

 I have been thinking about a new feature to start work on that I am
 interested in and I was hoping people could give me some feedback and
 ideas of how to tackle it.  Anyways, I want to create a data
 deduplication system that can work in two different modes.  One mode
 is that when the system is idle or not beyond a set load point a
 background process would scan the volume for duplicate blocks.  The
 other mode would be used for systems that are nearline or backup
 systems that don't really care about the performance and it would do
 the deduplication during block allocation.

Seems like a special case of compression? Perhaps compression would help
more?

 One of the ways I was thinking of to find the duplicate blocks would
 be to use the checksums as a quick compare.  If the checksums match
 then do a complete compare before adjusting the nodes on the files.
 However, I believe that I will need to create a tree based on the
 checksum values.

If you really want to do deduplication: It might be advantageous to do 
this on larger units.

If you assume that data is usually shared between similar files (which
is a reasonable assumption) and do the deduplication on whole files
you can also use the size as an index and avoid checksumming all files
with a unique size.  I wrote a user level duplicated file checker some
time ago that used this trick successfully.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs v0.16 released

2008-08-08 Thread Andi Kleen
 So, the mirroring turns a single large write into two large writes.
 Definitely not free, but always a fixed cost.

Thanks for the explanation and the numbers. I see that's the advantage of 
copy-on-write that you can actually always cluster the metadata together and 
get always batched IO this way and then afford to do more of it. 

Still wondering what that will do to read seekiness.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs v0.16 released

2008-08-07 Thread Andi Kleen
Chris Mason [EMAIL PROTECTED] writes:

 Metadata is duplicated by default even on single spindle drives, 

Can you please say a bit how much that impacts performance? That sounds 
costly.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html