Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
For the btrfs extent cache it's unclear if just RCUing is a good fit anyways: some workloads are very write heavy and RCU only performs well if you have a lot more reads than writes. For write heavy RCUification usually slows it down. FWIW, I'm mentioning this out of self interest - I need a RCU safe tree structure to index extents for lookless lookups in the XFS buffer cache, but I've got a long list of things to do before I get to it. If someone else implements the tree, that's most of the work done for me. :) FWIW there are fine grained rbtrees in papers too, but they are too fine grained imho: you may need to take a large number of locks for a single traversal. While atomics got a lot cheaper recently they are still somewhat expensive and you don't want too many of them in your fast path. Also I found when there is actual contention having too many bouncing locks is quite bad because the latencies of passing the cache lines around really add up. In these cases uses less fine locks is better. Mathieu also did RCU rbtrees but they are quite complicated. IMHO we would like to have something inbetween a global tree lock and a fully fine grained tree where the lock complexity cannot get out of band. May need a separate data structure for the locks. I don't have a leading candidate for that currently. There are some variants of rbtrees that are less strict and have a simpler rebalance which are interesting. But also some other tree data structures. Needs more work. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
Liu Bo liubo2...@cn.fujitsu.com writes: Here we choose extent_map firstly, since it is a read mostly thing, and the change is quite direct, all we need to do is a) to replace rbtree with skiplist, b) to add rcu support. And more details are in patch 2 and patch 3. I've done some simple tests for performance on my 2-core box, there is no obvious difference, but I want to focus on the design side and make sure there is no more bug in it firstly. For long term goals, we want to ship skiplist to lib, like lib/rbtree.c. I looked at skiplists some time ago. What made them awkward for kernel use is that you have to size the per node skip array in advance and it's hard to resize. So you have a node that wastes memory in common small cases, but still degenerates to linked list on very large sizes. With fine grained locking it gets even worse because the nodes get larger. Con didn't worry about this problem because he assumed the scheduler run queues never could get too long. But for a very scalable subsystem that's definitely a problem. I think skiplists are not a good fit here. At least in our tests the older style trees got a lot better with Chris' recent locking improvements. Now replacing rbtrees is probably still a good idea, but not convinced skiplist are suitable here. There were various other tree alternatives with better locking. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
On 01/13/2012 05:28 AM, Andi Kleen wrote: Liu Bo liubo2...@cn.fujitsu.com writes: Here we choose extent_map firstly, since it is a read mostly thing, and the change is quite direct, all we need to do is a) to replace rbtree with skiplist, b) to add rcu support. And more details are in patch 2 and patch 3. I've done some simple tests for performance on my 2-core box, there is no obvious difference, but I want to focus on the design side and make sure there is no more bug in it firstly. For long term goals, we want to ship skiplist to lib, like lib/rbtree.c. I looked at skiplists some time ago. What made them awkward for kernel use is that you have to size the per node skip array in advance and it's hard to resize. So you have a node that wastes memory in common small cases, but still degenerates to linked list on very large sizes. With fine grained locking it gets even worse because the nodes get larger. Con didn't worry about this problem because he assumed the scheduler run queues never could get too long. But for a very scalable subsystem that's definitely a problem. I think skiplists are not a good fit here. At least in our tests the older style trees got a lot better with Chris' recent locking improvements. Now replacing rbtrees is probably still a good idea, but not convinced skiplist are suitable here. There were various other tree alternatives with better locking. Hi Andi, I know what you're worried about, that still keeps biting me, too. ;) Here we decide to make such an experiment of skiplist, since we have some in-memory data structures that are dominated by reads, and what we want to try is to apply RCU, a lockless read scheme, on them. Yes, skiplist is not good enough for kernel use, but maybe RCU-skiplist can be a candidate. According to RCU semantic, once a RCU-skiplist is built, the read most thing can ensure us that the whole skiplist will remain almost unchanged while running. Thus, to some extent, we do not need to resize the nodes frequently. So what do you think about this? :) thanks, liubo -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
On Fri, Jan 13, 2012 at 10:18:06AM +0800, Liu Bo wrote: On 01/13/2012 05:28 AM, Andi Kleen wrote: Liu Bo liubo2...@cn.fujitsu.com writes: Here we choose extent_map firstly, since it is a read mostly thing, and the change is quite direct, all we need to do is a) to replace rbtree with skiplist, b) to add rcu support. And more details are in patch 2 and patch 3. I've done some simple tests for performance on my 2-core box, there is no obvious difference, but I want to focus on the design side and make sure there is no more bug in it firstly. For long term goals, we want to ship skiplist to lib, like lib/rbtree.c. I looked at skiplists some time ago. What made them awkward for kernel use is that you have to size the per node skip array in advance and it's hard to resize. So you have a node that wastes memory in common small cases, but still degenerates to linked list on very large sizes. With fine grained locking it gets even worse because the nodes get larger. But for a very scalable subsystem that's definitely a problem. I think skiplists are not a good fit here. Now replacing rbtrees is probably still a good idea, but not convinced skiplist are suitable here. There were various other tree alternatives with better locking. Hi Andi, I know what you're worried about, that still keeps biting me, too. ;) Here we decide to make such an experiment of skiplist, since we have some in-memory data structures that are dominated by reads, and what we want to try is to apply RCU, a lockless read scheme, on them. Yes, skiplist is not good enough for kernel use, but maybe RCU-skiplist can be a candidate. According to RCU semantic, once a RCU-skiplist is built, the read most thing can ensure us that the whole skiplist will remain almost unchanged while running. Thus, to some extent, we do not need to resize the nodes frequently. So what do you think about this? :) I don't think RCU lookups matter here - it's the fact that the skiplist needs to be a pre-determined size that is the problem because one size does not fit all users. If you want a RCU-based tree structure for extent lookups, then an RCU aware btree is a better bet. Dynamic resizing can be done in an RCU aware manner (the radix trees do it) so you should probably take the lib/btree.c code and look to making that RCU safe. IIRC, the implementation was based on a RCU-btree prototype so maybe you might want to read up on that first: http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch FWIW, I'm mentioning this out of self interest - I need a RCU safe tree structure to index extents for lookless lookups in the XFS buffer cache, but I've got a long list of things to do before I get to it. If someone else implements the tree, that's most of the work done for me. :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs
Since we are inclined to apply a lockless scheme on some objects of btrfs for higher performance, we want to build a RCU version the Probabilistic Skiplist. Here our skiplist algorithm is based on the skiplist experiments of Con Kolivas ker...@kolivas.org for BFS cpu scheduler. And more details about skiplist design are in patch 1. Right now we have a plan to apply skiplist on extent_map and extent_state. Here we choose extent_map firstly, since it is a read mostly thing, and the change is quite direct, all we need to do is a) to replace rbtree with skiplist, b) to add rcu support. And more details are in patch 2 and patch 3. I've done some simple tests for performance on my 2-core box, there is no obvious difference, but I want to focus on the design side and make sure there is no more bug in it firstly. For long term goals, we want to ship skiplist to lib, like lib/rbtree.c. MORE TESTS ARE WELCOME! --- changes v2: - fix a bug reported by David Sterba d...@jikos.cz, thanks a lot! - use mutex lock to protect extent_map updater side, so that we can make the reclaim code much easier. And I've ran through xfstests, no panic occurred but they failed at 273 and 274, and I've tested them without my patches and found that they still fails on the upstream. --- Liu Bo (3): Btrfs: add the Probabilistic Skiplist Btrfs: rebuild extent_map based on skiplist Btrfs: convert rwlock to RCU for extent_map fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |8 +- fs/btrfs/disk-io.c |9 +- fs/btrfs/extent_io.c | 13 +-- fs/btrfs/extent_map.c | 278 +++- fs/btrfs/extent_map.h | 19 +++- fs/btrfs/file.c| 11 +- fs/btrfs/inode.c | 28 +++--- fs/btrfs/ioctl.c |8 +- fs/btrfs/relocation.c |4 +- fs/btrfs/scrub.c |4 +- fs/btrfs/skiplist.c| 101 + fs/btrfs/skiplist.h| 217 + fs/btrfs/volumes.c | 58 +- 14 files changed, 585 insertions(+), 175 deletions(-) create mode 100644 fs/btrfs/skiplist.c create mode 100644 fs/btrfs/skiplist.h -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html