Fwd: Examining the VM splay tree effectiveness

2010-10-04 Thread Mayur
By mistake last mail went to hackers sending it to freebsd-current

--
Hi,
Sorry for pitching in late on the discussion
Looks like I have missed many mails,  Will start from this point in the
discussion



I just took a look (not in-depth though) at the patch and can't follow
your conclusion. It is not ready to go into svn in its current state.

Even though it is called a radix trie it doesn't look like one.  On first
impression it looks much more like an AA tree.  A radix trie, which we
already have in our network routing table code, is a variable length
(mask) tree that does path compression.  See net/radix.[ch] and

<>

The data-structure is  just like a page table where the index is chopped in
to sub-indices. i th sub-index is offset into corresponding node node on
(i+1)th level. The difference between then wikipedia description of
radixtree and our implementation is that nodes with single child are not
merged with parent (level compression). Also, we started with variable size
slots (sub-indices) but Jeff thought its little too flexible. Current
implementation works with fixed sized sub-indices (size is multiple of
sizeof original index)

About the benchmark: It works on uniform random numbers. Objective was to
optimize random accesses on large objects (~4G-2T)

Take a look at radix_tree_shrink() this function reduces the depth of the
tree (if possible).  Special case of path compression.

<>

http://en.wikipedia.org/wiki/Radix_tree
Extrapolating in a complete guesstimating way from the lookup function
I'd say it may perform only slightly better in an ideal case than a RB
tree but with the added overall expense of requiring external memory to
store the index and branch nodes.  This is probably a nasty disadvantage.
<>
I don't get you, please elaborate. The way tree is structured we do not need
space to store index (nor does radix-trie), it is implicit from the position
of the "value" corresponding the index.
<>


Thanks,
Mayur
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-10-03 Thread Ivan Voras
On 10/01/10 10:54, Andre Oppermann wrote:
> On 30.09.2010 19:51, Ivan Voras wrote:
>> On 09/30/10 18:37, Andre Oppermann wrote:
>>
>>> Both the vmmap and page table make use of splay trees to manage the
>>> entries and to speed up lookups compared to long to traverse linked
>>> lists or more memory expensive hash tables.  Some structures though
>>> do have an additional linked list to simplify ordered traversals.
>>
>> The property of splay tree requiring *writes* for nearly every read
>> really is a thorn in the eye for SMP. It seems to me that even if the
>> immediate benefits from converting to something else are not directly
>> observable, it will still be worth doing it.
> 
> Fully agreed.
> 
>> It's a shame that RCU is still a patent minefield :/
>>
>> http://mirror.leaseweb.com/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf
>>
> 
> I'm not convinced that RCU is the only scalable way of sharing a
> data structure across a possibly large number of CPU's.

Of course, it's just well understood currently.

> The term "lockless" is often used and frequently misunderstood.
> Having a lockess data structure *doesn't* mean that is either
> performant, scalable or both.  It heavily depends on a number
> of additional factors.  Many times "lockless" just replaces a
> simple lock/unlock cycle with a number of atomic operations on
> the data structure.  This can easily backfire because an atomic

Yes.

> operation just hides the computational complexity and also dirties
> the CPU's cache lines.  Generally on cache coherent architectures
> almost all of the atomic operation is in HW with bus lock cycles,
> bus snooping and whatnot.  While seemingly simple form the programmers
> point of view, the overhead and latency is still there.  Needless

Yes, you basically just offload the operation to hardware but the steps
it needs to make are the same in concept.

>  a) make sure the lock is held for only a small amount of time
> to avoid lock contention.
>  b) do everything you can outside of the lock.
>  c) if the lock is found to be heavily contended rethink the
> whole approach and check if other data structures can be used.
>  d) minimize write accesses to memory in the lock protected
> shared data structure.
>  e) PROFILE, DON'T SPECULATE! Measure the access pattern and
> measure the locking/data access strategy's cost in terms
> of CPU cycles consumed.
> 
>  f) on lookup heavy data structures avoid writing to memory and
> by it dirtying CPU caches.
>  g) on modify heavy data structures avoid touching too many
> elements.
>  h) on lookup and modify heavy data structure that are used
> across many CPU's all bets are off and a different data
> structure approach should be considered resulting ideally
> in case f).
> 
> It all starts with the hypothesis that a data structure is not
> optimally locked.

This looks like material for a developer-centric wiki page :) There is a
lot of dispersed wisdom in this thread which would be nice if gathered
in one place.



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-10-01 Thread Ed Schouten
Andre,

* Andre Oppermann  wrote:
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree

Even though a red-black tree is quite good since it guarantees a $2 \log
n$ upperbound, the problem is that it's quite computationally intensive.

Maybe it would be worth looking at other types of balanced trees? For
example, another type of tree which has only $O(\log n)$ amortized
insertion/removal/lookup time, but could already be a lot better in
practice, is a Treap.

Greetings,
-- 
 Ed Schouten 
 WWW: http://80386.nl/


pgppFv0Ft9fqs.pgp
Description: PGP signature


Re: Examining the VM splay tree effectiveness

2010-10-01 Thread Andre Oppermann

On 01.10.2010 06:49, Matthew Dillon wrote:

 I don't remember the reference but I read a comprehensive comparison
 between various indexing methods about a year ago and the splay tree
 did considerably better than a RB-tree.  The RB-tree actually did
 fairly poorly.


It heavily depends on the access pattern of data structure.  Is
it lookup, modify or insert/delete dominated?  Or a mix of any
of them.  How heavily is the data structure shared across CPU's?
Without this information it is impossible to make a qualified choice.

Making general comparative statements on indexing methods without
taking the access pattern and SMP/CPU cache behavior into account
is going to lead to wrong approach 90% of the time. (Made that number
of up ;-)


 Any binary tree-like structure makes fairly poor use of cpu caches.
 Splay trees work somewhat better as long as there is some locality
 of reference in the lookups, since the node being looked up is moved
 to the root.  It isn't a bad trade-off.


Again, it hugely depends on how good the locality is and how expensive
the CPU cache line dirtying of the splay rotation is.  You can quickly
fall off an amortization *cliff* here.

I agree with binary tree structures being a bit less optimal for CPU
caches because the tree node is embedded with the data element.  On
the plus side not many other data structures are either.  And as long
as memory is only read it can be cached on multiple CPU's.  Touching
it throws it out everywhere else and causes a high latency memory access
on the next read.


 On the negative side all binary-tree-like structures tend to be
 difficult to MP-lock in a fine-grained manner (not that very many
 structures are locked at that fine a grain anyway, but it's a data
 point).  Splay-trees are impossible to lock at a fine-grain due to
 the massive modifications made on any search and the movement
 of the root, and there are MP access issues too.


I doubt that fine grained locking of such data structures is beneficial
in many cases.  Fine grained locking implies more locks, more bus lock
cycles, more memory barriers and more CPU cache dirtying.  As long as
a data structure's global lock is not significantly contended on and
based on that a finer locking doesn't lead to parallel operation it just
creates a lot of overhead for nothing.


 --

 What turned out to be the best indexing mechanism was a chained
 hash table whos hoppers were small linear arrays instead of single
 elements.  So instead of pointer-chaining each element you have a small
 for() loop for 4-8 elements before you chain.  The structure being
 indexed would NOT be integrated into the index directly, the index
 would point at the final structure from the hopper.


This makes a lot of sense if the index is sufficiently small, lets say
one or two int's.  When you go beyond that the advantage quickly fades
away.


 For our purposes such linear arrays would contain a pointer and
 an indexing value in as small an element as possible (8-16 bytes),
 the idea being that you make the absolute best use of your cache line
 and L1 cache / memory burst.  One random access (initial hash index),
 then linear accesses using a small indexing element, then one final
 random access to get to the result structure and validate that
 it's the one desired (at that point we'd be 99.9% sure that we have
 the right structure because we have already compared the index value
 stored in the hopper).  As a plus the initial hash index also makes
 MP locking the base of the chains easier.


Agreed under the premise that the access pattern fits this behavior.


 I don't use arrayized chained hash tables in DFly either, but only
 because it's stayed fairly low on the priority list.  cpu isn't really
 a major issue on systems these days.  I/O is the bigger problem.
 RB-Trees are simply extremely convenient from the programming side,
 which is why we still use them.


Agreed with the emphasis on including lock/atomic cycles and CPU
cache hierarchy into I/O.


 I was surprised that splay trees did so well vs RB-trees, I never liked
 the memory writes splay trees do let alone the impossibility of
 fine-grained locking.  Originally I thought the writes would make
 performance worse, but they actually don't.  Still, if I were to change
 any topologies now I would definitely go with the chained-hash /
 small-linear-array / chain / small-linear-array / chain mechanic.  It
 seems to be the clear winner.


Without first studying the accesses pattern and applying it to
the various data structures this is a very speculative statement
to make.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.o

Re: Examining the VM splay tree effectiveness

2010-10-01 Thread Andre Oppermann

On 30.09.2010 19:51, Ivan Voras wrote:

On 09/30/10 18:37, Andre Oppermann wrote:


Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.


The property of splay tree requiring *writes* for nearly every read
really is a thorn in the eye for SMP. It seems to me that even if the
immediate benefits from converting to something else are not directly
observable, it will still be worth doing it.


Fully agreed.


It's a shame that RCU is still a patent minefield :/

http://mirror.leaseweb.com/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf


I'm not convinced that RCU is the only scalable way of sharing a
data structure across a possibly large number of CPU's.

The term "lockless" is often used and frequently misunderstood.
Having a lockess data structure *doesn't* mean that is either
performant, scalable or both.  It heavily depends on a number
of additional factors.  Many times "lockless" just replaces a
simple lock/unlock cycle with a number of atomic operations on
the data structure.  This can easily backfire because an atomic
operation just hides the computational complexity and also dirties
the CPU's cache lines.  Generally on cache coherent architectures
almost all of the atomic operation is in HW with bus lock cycles,
bus snooping and whatnot.  While seemingly simple form the programmers
point of view, the overhead and latency is still there.  Needless
to say that on more relaxed architectures (SPARC, Alpha, ...) the
overhead is higher.  Also important in the overall picture are the
memory barrier semantics of locks.  Some newer CPU's have started
to provide hardware implemented lock managers and combine it with
SMT features so that access to an already locked lock causes an
immediate HW thread switch and on unlock a switch back.  We also
have rm_locks (read mostly locks) that do not require synchronization
on read but have a more expensive write lock.  In UMA we use a mix
of global pools of elements with per-CPU caching of free elements.

As always the best approach depends on the dominant access pattern
of a structure.  It all comes down to the amortization cost of the
different locking or "lockless" strategies.  It's also important
to make sure that worst case behavior doesn't bring down the box.

As a rule of thumb I use this:

 a) make sure the lock is held for only a small amount of time
to avoid lock contention.
 b) do everything you can outside of the lock.
 c) if the lock is found to be heavily contended rethink the
whole approach and check if other data structures can be used.
 d) minimize write accesses to memory in the lock protected
shared data structure.
 e) PROFILE, DON'T SPECULATE! Measure the access pattern and
measure the locking/data access strategy's cost in terms
of CPU cycles consumed.

 f) on lookup heavy data structures avoid writing to memory and
by it dirtying CPU caches.
 g) on modify heavy data structures avoid touching too many
elements.
 h) on lookup and modify heavy data structure that are used
across many CPU's all bets are off and a different data
structure approach should be considered resulting ideally
in case f).

It all starts with the hypothesis that a data structure is not
optimally locked.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Adrian Chadd
On 1 October 2010 12:49, Matthew Dillon  wrote:

>    What turned out to be the best indexing mechanism was a chained
>    hash table whos hoppers were small linear arrays instead of single
>    elements.  So instead of pointer-chaining each element you have a small
>    for() loop for 4-8 elements before you chain.  The structure being
>    indexed would NOT be integrated into the index directly, the index
>    would point at the final structure from the hopper.
>
>    For our purposes such linear arrays would contain a pointer and
>    an indexing value in as small an element as possible (8-16 bytes),
>    the idea being that you make the absolute best use of your cache line
>    and L1 cache / memory burst.  One random access (initial hash index),
>    then linear accesses using a small indexing element, then one final
>    random access to get to the result structure and validate that
>    it's the one desired (at that point we'd be 99.9% sure that we have
>    the right structure because we have already compared the index value
>    stored in the hopper).  As a plus the initial hash index also makes
>    MP locking the base of the chains easier.

Sounds like B+tree style stuff. Minimise the "seek" operations, as
random lookup times are orders of magnitude slower than sequential
access times.

(Memory is hierarchial, who would've thunk. :-)


Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Matthew Dillon
I don't remember the reference but I read a comprehensive comparison
between various indexing methods about a year ago and the splay tree
did considerably better than a RB-tree.  The RB-tree actually did
fairly poorly.

Any binary tree-like structure makes fairly poor use of cpu caches.
Splay trees work somewhat better as long as there is some locality
of reference in the lookups, since the node being looked up is moved
to the root.  It isn't a bad trade-off.

On the negative side all binary-tree-like structures tend to be
difficult to MP-lock in a fine-grained manner (not that very many
structures are locked at that fine a grain anyway, but it's a data
point).  Splay-trees are impossible to lock at a fine-grain due to
the massive modifications made on any search and the movement
of the root, and there are MP access issues too.

--

What turned out to be the best indexing mechanism was a chained
hash table whos hoppers were small linear arrays instead of single
elements.  So instead of pointer-chaining each element you have a small
for() loop for 4-8 elements before you chain.  The structure being
indexed would NOT be integrated into the index directly, the index
would point at the final structure from the hopper.

For our purposes such linear arrays would contain a pointer and
an indexing value in as small an element as possible (8-16 bytes),
the idea being that you make the absolute best use of your cache line
and L1 cache / memory burst.  One random access (initial hash index),
then linear accesses using a small indexing element, then one final
random access to get to the result structure and validate that
it's the one desired (at that point we'd be 99.9% sure that we have
the right structure because we have already compared the index value
stored in the hopper).  As a plus the initial hash index also makes
MP locking the base of the chains easier.

I don't use arrayized chained hash tables in DFly either, but only
because it's stayed fairly low on the priority list.  cpu isn't really
a major issue on systems these days.  I/O is the bigger problem.
RB-Trees are simply extremely convenient from the programming side,
which is why we still use them.

I was surprised that splay trees did so well vs RB-trees, I never liked
the memory writes splay trees do let alone the impossibility of
fine-grained locking.  Originally I thought the writes would make
performance worse, but they actually don't.  Still, if I were to change
any topologies now I would definitely go with the chained-hash /
small-linear-array / chain / small-linear-array / chain mechanic.  It
seems to be the clear winner.

-Matt
Matthew Dillon 

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Freddie Cash
For the curious, DragonflyBSD went through this back in 2005.  All the
gory details are in the thread with Subject: "splay tree and red-black
tree for vm_map entry lookups" [1] While things are most likely
different now between the FreeBSD VM and the DragonflyBSD VM, it may
be worthwhile checking out what they did, and why.  They considered
the FreeBSD splay-tree and compared it to red-black tree, and went
with red-black.

There's mention in that thread that NetBSD uses red-black trees.  No
idea if this is still correct.

[1] http://leaf.dragonflybsd.org/mailarchive/kernel/2005-01/msg00121.html

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 23:44, Andre Oppermann wrote:

On 30.09.2010 20:04, Roman Divacky wrote:

On Thu, Sep 30, 2010 at 07:49:00PM +0200, Roman Divacky wrote:

On Thu, Sep 30, 2010 at 07:46:32PM +0200, Andre Oppermann wrote:

On 30.09.2010 19:24, Roman Divacky wrote:

are you aware of Summer of Code 2008 project by Mayur Shardul?


I remember that there was this project but I never saw any numbers
or other outcome of it. Haven't checked p4 to look at the code
though.


there's a little something:

http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement


and the actual patch + userspace implementation + benchmarking suite
is here:

http://code.google.com/p/google-summer-of-code-2008-freebsd/downloads/detail?name=Mayur_Shardul.tar.gz&can=2&q=


it looks like a solid work with solid results to me


I just took a look (not in-depth though) at the patch and can't follow
your conclusion. It is not ready to go into svn in its current state.

Even though it is called a radix trie it doesn't look like one. On first
impression it looks much more like an AA tree. A radix trie, which we
already have in our network routing table code, is a variable length
(mask) tree that does path compression. See net/radix.[ch] and
http://en.wikipedia.org/wiki/Radix_tree


Thinking more about this I can't see any advantage of the radix trie
becoming useful for the vmmap.  The index of the vmmap is the address
range and can't be expressed in a path compressable power of 2 quantity
and is non-overlapping.  The key for the trie, a pointer, has a fixed
small size and can't be optimized away.  And it is not a balanced tree.
Many keys in the same region lead to long node traversals.

In short: VMmap and a radix trie have an impedance mismatch and are unfit
for each other imho.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:04, Roman Divacky wrote:

On Thu, Sep 30, 2010 at 07:49:00PM +0200, Roman Divacky wrote:

On Thu, Sep 30, 2010 at 07:46:32PM +0200, Andre Oppermann wrote:

On 30.09.2010 19:24, Roman Divacky wrote:

are you aware of Summer of Code 2008 project by Mayur Shardul?


I remember that there was this project but I never saw any numbers
or other outcome of it.  Haven't checked p4 to look at the code
though.


there's a little something:

http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement


and the actual patch + userspace implementation + benchmarking suite
is here:

http://code.google.com/p/google-summer-of-code-2008-freebsd/downloads/detail?name=Mayur_Shardul.tar.gz&can=2&q=

it looks like a solid work with solid results to me


I just took a look (not in-depth though) at the patch and can't follow
your conclusion. It is not ready to go into svn in its current state.

Even though it is called a radix trie it doesn't look like one.  On first
impression it looks much more like an AA tree.  A radix trie, which we
already have in our network routing table code, is a variable length
(mask) tree that does path compression.  See net/radix.[ch] and
 http://en.wikipedia.org/wiki/Radix_tree

Extrapolating in a complete guesstimating way from the lookup function
I'd say it may perform only slightly better in an ideal case than a RB
tree but with the added overall expense of requiring external memory to
store the index and branch nodes.  This is probably a nasty disadvantage.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Attilio Rao
2010/9/30 Andre Oppermann :
> On 30.09.2010 19:24, Roman Divacky wrote:
>>
>> are you aware of Summer of Code 2008 project by Mayur Shardul?
>
> I remember that there was this project but I never saw any numbers
> or other outcome of it.  Haven't checked p4 to look at the code
> though.

The final code (Jeff took Mayur initial patches and further refined)
resulted in a very variadic performance for simple tests, so I think
that the last posted codes had the same performance, overall, than the
splay tree.
I recall, however, that there were still rooms for improvement. Ask
Jeff about the last patches.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:38, Alfred Perlstein wrote:

Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than "N" traversals to reach a node.

This would self-balance the tree and maintain cache without
the expense of writes for nearly all lookups.


This would break the underlying assumption of the splay tree.
A splay tree is not a balanced tree.  With this approach you
can easily get stuck in a sub-optimal situation with many
lookups traversing N-1 nodes and not getting the cache benefit.
When N nodes are traversed for an outlier, and the rotate happens,
you rotate again on the next lookup or you get stuck in another
sub-optimal situation.  In effect you get the worst of an splay
tree while not being able to gain on the amortization effect
when the root node hit rate is high.


I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.


As the assumption of the algorithm is broken I don't think it's
useful to spend time on this.  I'd rather try and add a last-match
pointer to the RB tree to take home the locality effect in vmmap.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:01, Alan Cox wrote:

On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:


On 30.09.2010 18:37, Andre Oppermann wrote:


Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.



Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.
ss



Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.


This is indeed an expected property of the splay tree.  For the vmmap
the root node hit rate, that is the least recently accessed node, is very
high.  Other nodes that are frequently accessed as well should be high
up too.  Nonetheless the churn rate on the nodes in tree rotations is
considerable.

The question now is whether we can get the same positive effect but with
less negative side effects.  Of particular concern is the CPU cache
dirtying and thrashing by the splay rotations.  Especially with today's
and tomorrows many core, many socket, many threads/processes machines.
Another concern is worst case behavior when the root node hit rate is
low.  Currently this is the case in vmpage for certain.  It may also
become a concern in vmmap with the use of memguard (fragmenting the
address space) or an adversary trying to exploit worst case behavior
on a shared host by mapping only every other page.

This concern is validated when accepting the measurement by the 2008 SoC
student Mayur (thanks to rdivacky@ for digging up his page):
 http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement

In the measurements of his alternative radix trie implementation the
lookup time is *230 times less* than for the splay tree (as measured by
with the TSC).  The description of the performed benchmark is extremely
sparse and it is not clear what access pattern he simulated.  Probably
a random lookup pattern which is of course very bad for the splay tree.
Any balanced tree will be better for random lookups.

Whether his measured improvement is due to using a radix trie or if it
would perform equally to a normal balanced binary tree I can't say (yet).


I teach class shortly.  I'll provide more details later.


--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alfred Perlstein
Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than "N" traversals to reach a node.

This would self-balance the tree and maintain cache without 
the expense of writes for nearly all lookups.

I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.

what do you think?

-Alfred

* Andre Oppermann  [100930 09:38] wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
> 
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
> by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
> memory pages to the regions and interfaces with the pmap layer
> (virtual->physical).
> 
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
> 
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
> 
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
> 
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
> least recent looked up node at the top and less recently ones
> close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
> causes a rotation of the tree to make the newly found node the new
> root;
>  3) during the rotate it has to re-arrange and touch possibly a large
> number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
> makes no guarantee that it is balanced.
> 
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
> lookups are root node (cache) hits, otherwise the cost of
> rebalancing destroys any advantage and quickly turns into a
> large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
> causing a lot of CPU cache line dirtying.  For processes with
> shared memory or threads across CPU's this overhead becomes
> almost excessive because the CPU's stomp on each others cached
> nodes and get their own caches invalidated.  The penalty is a
> high-latency memory refetch not only for the root node lookup
> but every hop in the following rebalancing operation => thrashing.
> 
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
> 
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
> 
> The results are very interesting:
> 
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
> 
> Based on an examinati

Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alan Cox
On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann  wrote:

> On 30.09.2010 18:37, Andre Oppermann wrote:
>
>> Just for the kick of it I decided to take a closer look at the use of
>> splay trees (inherited from Mach if I read the history correctly) in
>> the FreeBSD VM system suspecting an interesting journey.
>>
>
> Correcting myself regarding the history: The splay tree for vmmap was
> done about 8 years ago by alc@ to replace a simple linked list and was
> a huge improvement.  The change in vmpage from a hash to the same splay
> tree as in vmmap was committed by dillon@ about 7.5 years ago with some
> involvement of a...@.
> ss
>

Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.

I teach class shortly.  I'll provide more details later.

Regards,
Alan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Roman Divacky
On Thu, Sep 30, 2010 at 07:49:00PM +0200, Roman Divacky wrote:
> On Thu, Sep 30, 2010 at 07:46:32PM +0200, Andre Oppermann wrote:
> > On 30.09.2010 19:24, Roman Divacky wrote:
> > >are you aware of Summer of Code 2008 project by Mayur Shardul?
> > 
> > I remember that there was this project but I never saw any numbers
> > or other outcome of it.  Haven't checked p4 to look at the code
> > though.
> 
> there's a little something: 
> 
>   http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement

and the actual patch + userspace implementation + benchmarking suite
is here:

http://code.google.com/p/google-summer-of-code-2008-freebsd/downloads/detail?name=Mayur_Shardul.tar.gz&can=2&q=

it looks like a solid work with solid results to me
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 18:37, Andre Oppermann wrote:

> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.

The property of splay tree requiring *writes* for nearly every read
really is a thorn in the eye for SMP. It seems to me that even if the
immediate benefits from converting to something else are not directly
observable, it will still be worth doing it.

It's a shame that RCU is still a patent minefield :/

http://mirror.leaseweb.com/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf

Slightly off-topic: a scare-mongering topic on Slashdot:
http://hardware.slashdot.org/story/10/09/30/1528229/Linux-May-Need-a-Rewrite-Beyond-48-Cores


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Roman Divacky
On Thu, Sep 30, 2010 at 07:46:32PM +0200, Andre Oppermann wrote:
> On 30.09.2010 19:24, Roman Divacky wrote:
> >are you aware of Summer of Code 2008 project by Mayur Shardul?
> 
> I remember that there was this project but I never saw any numbers
> or other outcome of it.  Haven't checked p4 to look at the code
> though.

there's a little something: 

http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 19:24, Roman Divacky wrote:

are you aware of Summer of Code 2008 project by Mayur Shardul?


I remember that there was this project but I never saw any numbers
or other outcome of it.  Haven't checked p4 to look at the code
though.

--
Andre


quoting: http://www.freebsd.org/projects/summerofcode-2008.html

Project: VM Algorithm Improvement
Student: Mayur Shardul
Mentor: Jeff Roberson
Summary:
A new data structure, viz. radix tree, was implemented and used for management 
of the resident pages. The
objective is efficient use of memory and faster performance. The biggest 
challenge was to service insert requests
on the data structure without blocking. Because of this constraint the memory 
allocation failures were not
acceptable, to solve the problem the required memory was allocated at the boot 
time. Both the data structures were
used in parallel to check the correctness and we also benchmarked the data 
structures and found that radix trees
gave much better performance over splay trees.

Ready to enter CVS/SVN: We will investigate some more approaches to handle 
allocation failures before the new data
structure goes in CVS.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 19:15, Matthew Fleming wrote:

On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann  wrote:

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.

The VM system has two major structures:
  1) the VM map which is per process and manages its address space
by tracking all regions that are allocated and their permissions.
  2) the VM page table which is per VM map and provides the backing
memory pages to the regions and interfaces with the pmap layer
(virtual->physical).

Both of these are very frequently accessed through memory allocations
from userspace and page faults from the pmap layer.  Their efficiency
and SMP scalability is crucial for many operations beginning with fork/
exec, going through malloc/mmap and ending with free/exit of a process.

Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.

A splay tree is an interesting binary search tree with insertion,
lookup and removal performed in O(log n) *amortized* time.  With
the *amortized* time being the crucial difference to other binary trees.
On every access *including* lookup it rotates the tree to make the
just found element the new root node.  For all gory details see:
  http://en.wikipedia.org/wiki/Splay_tree

This behavior has a few important implications:
  1) the root node and constant rotation works like a cache with
least recent looked up node at the top and less recently ones
close to the top;
  2) every lookup that doesn't match the root node, ie. a cache miss,
causes a rotation of the tree to make the newly found node the new
root;
  3) during the rotate it has to re-arrange and touch possibly a large
number of other nodes;
  4) in worst case the tree may resemble a linked list.  A splay tree
makes no guarantee that it is balanced.

For the kernel and the splay tree usage in the VM map and page table
some more implications come up:
  a) the amortization effect/cost only balance if the majority of
lookups are root node (cache) hits, otherwise the cost of
rebalancing destroys any advantage and quickly turns into a
large overhead.
  b) the overhead becomes even worse due to touching many nodes and
causing a lot of CPU cache line dirtying.  For processes with
shared memory or threads across CPU's this overhead becomes
almost excessive because the CPU's stomp on each others cached
nodes and get their own caches invalidated.  The penalty is a
high-latency memory refetch not only for the root node lookup
but every hop in the following rebalancing operation =>  thrashing.

To quantify the positive and negative effects of the splay tree I
instrumented the code and added counters to track the behavior of
the splay tree in the vmmap and page table.

The test case is an AMD64 kernel build to get a first overview.
Other system usages may have different results depending on their
fork and memory usage patters.

The results are very interesting:

  The page table shows a *very* poor root node hit rate and an excessive
  amount of rotational node modifications representing almost the
  worst case for splay trees.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

  The vmmap shows a high root node hit rate but still a significant
  amount of rotational node modifications.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

 From these observations I come to the conclusion that a splay tree
is suboptimal for the normal case and quickly horrible even on small
deviations from the normal case in the vmmap.  For the page table it
is always horrible.  Not to mention the locking complications due to
modifying the tree on lookups.

Based on an examination of other binary trees I put forward the
hypothesis that a Red-Black tree is a much better, if not the optimal,
fit for the vmmap and page table.  RB trees are balanced binary trees
and O(log n) in all cases.  The big advantage in this context is that
lookups are pure reads and do not cause CPU cache invalidations on
other CPU's and always only require a read lock without the worst
case properties of the unbalanced splay tree.  The high cache locality
of the vmmap lookups can be used with the RB tree as well by simply
adding a pointer to the least recently found node.  To prevent write locking
this can be done lazily.  More profiling is required to make
a non-speculative statement on this t

Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 18:37, Andre Oppermann wrote:

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.


Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Roman Divacky
are you aware of Summer of Code 2008 project by Mayur Shardul?

quoting: http://www.freebsd.org/projects/summerofcode-2008.html

Project: VM Algorithm Improvement
Student: Mayur Shardul
Mentor: Jeff Roberson
Summary:
A new data structure, viz. radix tree, was implemented and used for management 
of the resident pages. The
objective is efficient use of memory and faster performance. The biggest 
challenge was to service insert requests
on the data structure without blocking. Because of this constraint the memory 
allocation failures were not
acceptable, to solve the problem the required memory was allocated at the boot 
time. Both the data structures were
used in parallel to check the correctness and we also benchmarked the data 
structures and found that radix trees
gave much better performance over splay trees.

Ready to enter CVS/SVN: We will investigate some more approaches to handle 
allocation failures before the new data
structure goes in CVS.


On Thu, Sep 30, 2010 at 06:37:38PM +0200, Andre Oppermann wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
> 
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
> by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
> memory pages to the regions and interfaces with the pmap layer
> (virtual->physical).
> 
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
> 
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
> 
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
> 
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
> least recent looked up node at the top and less recently ones
> close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
> causes a rotation of the tree to make the newly found node the new
> root;
>  3) during the rotate it has to re-arrange and touch possibly a large
> number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
> makes no guarantee that it is balanced.
> 
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
> lookups are root node (cache) hits, otherwise the cost of
> rebalancing destroys any advantage and quickly turns into a
> large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
> causing a lot of CPU cache line dirtying.  For processes with
> shared memory or threads across CPU's this overhead becomes
> almost excessive because the CPU's stomp on each others cached
> nodes and get their own caches invalidated.  The penalty is a
> high-latency memory refetch not only for the root node lookup
> but every hop in the following rebalancing operation => thrashing.
> 
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
> 
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
> 
> The results are very interesting:
> 
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|ca

Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Matthew Fleming
On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann  wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
>
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
>    by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
>    memory pages to the regions and interfaces with the pmap layer
>    (virtual->physical).
>
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
>
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
>
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
>
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
>    least recent looked up node at the top and less recently ones
>    close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
>    causes a rotation of the tree to make the newly found node the new
>    root;
>  3) during the rotate it has to re-arrange and touch possibly a large
>    number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
>    makes no guarantee that it is balanced.
>
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
>    lookups are root node (cache) hits, otherwise the cost of
>    rebalancing destroys any advantage and quickly turns into a
>    large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
>    causing a lot of CPU cache line dirtying.  For processes with
>    shared memory or threads across CPU's this overhead becomes
>    almost excessive because the CPU's stomp on each others cached
>    nodes and get their own caches invalidated.  The penalty is a
>    high-latency memory refetch not only for the root node lookup
>    but every hop in the following rebalancing operation => thrashing.
>
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
>
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
>
> The results are very interesting:
>
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
>
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
>
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
>
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table.  RB trees are balanced binary trees
> and O(log n) in all cases.  The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write locking
> this can be 

Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.

The VM system has two major structures:
 1) the VM map which is per process and manages its address space
by tracking all regions that are allocated and their permissions.
 2) the VM page table which is per VM map and provides the backing
memory pages to the regions and interfaces with the pmap layer
(virtual->physical).

Both of these are very frequently accessed through memory allocations
from userspace and page faults from the pmap layer.  Their efficiency
and SMP scalability is crucial for many operations beginning with fork/
exec, going through malloc/mmap and ending with free/exit of a process.

Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.

A splay tree is an interesting binary search tree with insertion,
lookup and removal performed in O(log n) *amortized* time.  With
the *amortized* time being the crucial difference to other binary trees.
On every access *including* lookup it rotates the tree to make the
just found element the new root node.  For all gory details see:
 http://en.wikipedia.org/wiki/Splay_tree

This behavior has a few important implications:
 1) the root node and constant rotation works like a cache with
least recent looked up node at the top and less recently ones
close to the top;
 2) every lookup that doesn't match the root node, ie. a cache miss,
causes a rotation of the tree to make the newly found node the new
root;
 3) during the rotate it has to re-arrange and touch possibly a large
number of other nodes;
 4) in worst case the tree may resemble a linked list.  A splay tree
makes no guarantee that it is balanced.

For the kernel and the splay tree usage in the VM map and page table
some more implications come up:
 a) the amortization effect/cost only balance if the majority of
lookups are root node (cache) hits, otherwise the cost of
rebalancing destroys any advantage and quickly turns into a
large overhead.
 b) the overhead becomes even worse due to touching many nodes and
causing a lot of CPU cache line dirtying.  For processes with
shared memory or threads across CPU's this overhead becomes
almost excessive because the CPU's stomp on each others cached
nodes and get their own caches invalidated.  The penalty is a
high-latency memory refetch not only for the root node lookup
but every hop in the following rebalancing operation => thrashing.

To quantify the positive and negative effects of the splay tree I
instrumented the code and added counters to track the behavior of
the splay tree in the vmmap and page table.

The test case is an AMD64 kernel build to get a first overview.
Other system usages may have different results depending on their
fork and memory usage patters.

The results are very interesting:

 The page table shows a *very* poor root node hit rate and an excessive
 amount of rotational node modifications representing almost the
 worst case for splay trees.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

 The vmmap shows a high root node hit rate but still a significant
 amount of rotational node modifications.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

From these observations I come to the conclusion that a splay tree
is suboptimal for the normal case and quickly horrible even on small
deviations from the normal case in the vmmap.  For the page table it
is always horrible.  Not to mention the locking complications due to
modifying the tree on lookups.

Based on an examination of other binary trees I put forward the
hypothesis that a Red-Black tree is a much better, if not the optimal,
fit for the vmmap and page table.  RB trees are balanced binary trees
and O(log n) in all cases.  The big advantage in this context is that
lookups are pure reads and do not cause CPU cache invalidations on
other CPU's and always only require a read lock without the worst
case properties of the unbalanced splay tree.  The high cache locality
of the vmmap lookups can be used with the RB tree as well by simply
adding a pointer to the least recently found node.  To prevent write 
locking this can be done lazily.  More profiling is required to make

a non-speculative statement on this though.  In addition a few of
the additional linked lists that currently accompany the vmmap and
page structures are n