Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-13 Thread Andi Kleen

For the btrfs extent cache it's unclear if just RCUing is a good
fit anyways: some workloads are very write heavy and RCU only
performs well if you have a lot more reads than writes.
For write heavy RCUification usually slows it down.

 FWIW, I'm mentioning this out of self interest - I need a RCU safe
 tree structure to index extents for lookless lookups in the XFS
 buffer cache, but I've got a long list of things to do before I get
 to it. If someone else implements the tree, that's most of the work
 done for me. :)

FWIW there are fine grained rbtrees in papers too, but they are too fine 
grained imho: you may need to take a large number of locks for a single 
traversal. While atomics got a lot cheaper recently they are still
somewhat expensive and you don't want too many of them in your 
fast path. Also I found when there is actual contention having
too many bouncing locks is quite bad because the latencies of passing
the cache lines around really add up. In these cases uses less fine
locks is better.

Mathieu also did RCU rbtrees but they are quite complicated.

IMHO we would like to have something inbetween a global tree lock and a
fully fine grained tree where the lock complexity cannot get out of band.
May need a separate data structure for the locks.

I don't have a leading candidate for that currently.

There are some variants of rbtrees that are less strict and have
a simpler rebalance which are interesting. But also some
other tree data structures. Needs more work.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-12 Thread Andi Kleen
Liu Bo liubo2...@cn.fujitsu.com writes:

 Here we choose extent_map firstly, since it is a read mostly thing,
 and the change is quite direct, all we need to do is
 a) to replace rbtree with skiplist,
 b) to add rcu support.
 And more details are in patch 2 and patch 3.

 I've done some simple tests for performance on my 2-core box, there is no
 obvious difference, but I want to focus on the design side and make sure
 there is no more bug in it firstly.

 For long term goals, we want to ship skiplist to lib, like lib/rbtree.c.

I looked at skiplists some time ago. What made them awkward for kernel
use is that you have to size the per node skip array in advance and it's
hard to resize. So you have a node that wastes memory in common small
cases, but still degenerates to linked list on very large sizes.
With fine grained locking it gets even worse because the nodes get larger.

Con didn't worry about this problem because he assumed the scheduler
run queues never could get too long.

But for a very scalable subsystem that's definitely a problem.

I think skiplists are not a good fit here.

At least in our tests the older style trees got a lot better with Chris'
recent locking improvements. 

Now replacing rbtrees is probably still a good idea, but not convinced
skiplist are suitable here. There were various other tree alternatives
with better locking.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-12 Thread Liu Bo
On 01/13/2012 05:28 AM, Andi Kleen wrote:
 Liu Bo liubo2...@cn.fujitsu.com writes:
 Here we choose extent_map firstly, since it is a read mostly thing,
 and the change is quite direct, all we need to do is
 a) to replace rbtree with skiplist,
 b) to add rcu support.
 And more details are in patch 2 and patch 3.

 I've done some simple tests for performance on my 2-core box, there is no
 obvious difference, but I want to focus on the design side and make sure
 there is no more bug in it firstly.

 For long term goals, we want to ship skiplist to lib, like lib/rbtree.c.
 
 I looked at skiplists some time ago. What made them awkward for kernel
 use is that you have to size the per node skip array in advance and it's
 hard to resize. So you have a node that wastes memory in common small
 cases, but still degenerates to linked list on very large sizes.
 With fine grained locking it gets even worse because the nodes get larger.
 
 Con didn't worry about this problem because he assumed the scheduler
 run queues never could get too long.
 
 But for a very scalable subsystem that's definitely a problem.
 
 I think skiplists are not a good fit here.
 
 At least in our tests the older style trees got a lot better with Chris'
 recent locking improvements. 
 
 Now replacing rbtrees is probably still a good idea, but not convinced
 skiplist are suitable here. There were various other tree alternatives
 with better locking.
 

Hi Andi,

I know what you're worried about, that still keeps biting me, too. ;)

Here we decide to make such an experiment of skiplist, since we have some
in-memory data structures that are dominated by reads, and what we want to
try is to apply RCU, a lockless read scheme, on them.

Yes, skiplist is not good enough for kernel use, but maybe RCU-skiplist can
be a candidate.

According to RCU semantic, once a RCU-skiplist is built, the read most thing
can ensure us that the whole skiplist will remain almost unchanged while 
running.
Thus, to some extent, we do not need to resize the nodes frequently.

So what do you think about this? :)

thanks,
liubo

 -Andi
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-12 Thread Dave Chinner
On Fri, Jan 13, 2012 at 10:18:06AM +0800, Liu Bo wrote:
 On 01/13/2012 05:28 AM, Andi Kleen wrote:
  Liu Bo liubo2...@cn.fujitsu.com writes:
  Here we choose extent_map firstly, since it is a read mostly thing,
  and the change is quite direct, all we need to do is
  a) to replace rbtree with skiplist,
  b) to add rcu support.
  And more details are in patch 2 and patch 3.
 
  I've done some simple tests for performance on my 2-core box, there is no
  obvious difference, but I want to focus on the design side and make sure
  there is no more bug in it firstly.
 
  For long term goals, we want to ship skiplist to lib, like lib/rbtree.c.
  
  I looked at skiplists some time ago. What made them awkward for kernel
  use is that you have to size the per node skip array in advance and it's
  hard to resize. So you have a node that wastes memory in common small
  cases, but still degenerates to linked list on very large sizes.
  With fine grained locking it gets even worse because the nodes get larger.

  But for a very scalable subsystem that's definitely a problem.
  
  I think skiplists are not a good fit here.

  Now replacing rbtrees is probably still a good idea, but not convinced
  skiplist are suitable here. There were various other tree alternatives
  with better locking.
  
 
 Hi Andi,
 
 I know what you're worried about, that still keeps biting me, too. ;)
 
 Here we decide to make such an experiment of skiplist, since we have some
 in-memory data structures that are dominated by reads, and what we want to
 try is to apply RCU, a lockless read scheme, on them.
 
 Yes, skiplist is not good enough for kernel use, but maybe RCU-skiplist can
 be a candidate.
 
 According to RCU semantic, once a RCU-skiplist is built, the read most thing
 can ensure us that the whole skiplist will remain almost unchanged while 
 running.
 Thus, to some extent, we do not need to resize the nodes frequently.
 
 So what do you think about this? :)

I don't think RCU lookups matter here - it's the fact that the
skiplist needs to be a pre-determined size that is the problem
because one size does not fit all users.

If you want a RCU-based tree structure for extent lookups, then an
RCU aware btree is a better bet. Dynamic resizing can be done in an
RCU aware manner (the radix trees do it) so you should probably take
the lib/btree.c code and look to making that RCU safe.  IIRC, the
implementation was based on a RCU-btree prototype so maybe you might
want to read up on that first:

http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch

FWIW, I'm mentioning this out of self interest - I need a RCU safe
tree structure to index extents for lookless lookups in the XFS
buffer cache, but I've got a long list of things to do before I get
to it. If someone else implements the tree, that's most of the work
done for me. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-09 Thread Liu Bo
Since we are inclined to apply a lockless scheme on some objects of btrfs for
higher performance, we want to build a RCU version the Probabilistic Skiplist.

Here our skiplist algorithm is based on the skiplist experiments of
Con Kolivas ker...@kolivas.org for BFS cpu scheduler.
And more details about skiplist design are in patch 1.

Right now we have a plan to apply skiplist on extent_map and extent_state.

Here we choose extent_map firstly, since it is a read mostly thing,
and the change is quite direct, all we need to do is
a) to replace rbtree with skiplist,
b) to add rcu support.
And more details are in patch 2 and patch 3.

I've done some simple tests for performance on my 2-core box, there is no
obvious difference, but I want to focus on the design side and make sure
there is no more bug in it firstly.

For long term goals, we want to ship skiplist to lib, like lib/rbtree.c.

MORE TESTS ARE WELCOME!

---
changes v2:
- fix a bug reported by David Sterba d...@jikos.cz, thanks a lot!
- use mutex lock to protect extent_map updater side, so that we can make
  the reclaim code much easier.
  And I've ran through xfstests, no panic occurred but they failed at
  273 and 274, and I've tested them without my patches and
  found that they still fails on the upstream.
---

Liu Bo (3):
  Btrfs: add the Probabilistic Skiplist
  Btrfs: rebuild extent_map based on skiplist
  Btrfs: convert rwlock to RCU for extent_map

 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |8 +-
 fs/btrfs/disk-io.c |9 +-
 fs/btrfs/extent_io.c   |   13 +--
 fs/btrfs/extent_map.c  |  278 +++-
 fs/btrfs/extent_map.h  |   19 +++-
 fs/btrfs/file.c|   11 +-
 fs/btrfs/inode.c   |   28 +++---
 fs/btrfs/ioctl.c   |8 +-
 fs/btrfs/relocation.c  |4 +-
 fs/btrfs/scrub.c   |4 +-
 fs/btrfs/skiplist.c|  101 +
 fs/btrfs/skiplist.h|  217 +
 fs/btrfs/volumes.c |   58 +-
 14 files changed, 585 insertions(+), 175 deletions(-)
 create mode 100644 fs/btrfs/skiplist.c
 create mode 100644 fs/btrfs/skiplist.h

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html