Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-07 Thread Nick Piggin
On Tue, Mar 06, 2007 at 02:20:55PM -0800, David Miller wrote:
 From: Robert Olsson [EMAIL PROTECTED]
 Date: Tue, 6 Mar 2007 14:26:04 +0100
 
  David Miller writes:
   
Actually, more accurately, the conflict exists in how this GC
logic is implemented.  The core issue is that hash table size
guides the GC processing, and hash table growth therefore
modifies those GC goals.  So with the patch below we'll just
keep growing the hash table instead of giving GC some time to
try to keep the working set in equilibrium before doing the
hash grow.
   
   AFIK the equilibrium is resizing function as well but using fixed 
   hash table. So can we do without equilibrium resizing if tables 
   are dynamic?  I think so
  
   With the hash data structure we could monitor the average chain 
   length or just size and resize hash after that.
 
 I'm not so sure, it may be a mistake to eliminate the equilibrium
 logic.  One error I think it does have is the usage of chain length.
 
 Even a nearly perfect hash has small lumps in distribution, and we
 should not penalize entries which fall into these lumps.
 
 Let us call T the threshold at which we would grow the routing hash
 table.  As we approach T we start to GC.  Let's assume hash table
 has shift = 2. and T would (with T=N+(N1) algorithm) therefore be
 6.
 
 TABLE:[0] DST1, DST2
   [1] DST3, DST4, DST5
 
 DST6 arrives, what should we do?
 
 If we just accept it and don't GC some existing entries, we
 will grow the hash table.  This is the wrong thing to do if
 our true working set is smaller than 6 entries and thus some
 of the existing entries are unlikely to be reused and thus
 could be purged to keep us from hitting T.
 
 If they are all active, growing is the right thing to do.
 
 This is the crux of the whole routing cache problem.

I guess this is similar to our problems with bdev and filesystem
caches as well.

What we do in that case (as you would know), is to let the caches
expand to the size of memory, and the problem just becomes balancing
their relative importance. We just try to go with a reasonable default,
and provide a knob or two for fine tuning.


 I am of the opinion that LRU, for routes not attached to sockets, is
 probably the best thing to do here.
 
 Furthermore at high packet rates, the current rt_may_expire() logic
 probably is not very effective since it's granularity is limited to
 jiffies.  We can quite easily create 100,000 or more entries per
 jiffie when HZ=100 during rDOS, for example.  So perhaps some global
 LRU algorithm using ktime is more appropriate.
 
 Global LRU is not easy without touching a lot of memory.  But I'm
 sure some clever trick can be discovered by someone :)

Well we do a pseudo LRU in most vm/vfs caches, where we just set a
bit if it has been touched since last checked. Add another working
set list to promote popular routes into, and you have something
like our pagecache active/inactive reclaim.

I don't know if that really applies here, but it might if you decide
to try hooking this cache into the slab shrinker thingy...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Tue, 06 Mar 2007 08:58:45 +0100

 Yes, but on bootup you have an appropriate NUMA active policy. (Well... we 
 hope so, but it broke several time in the past)
 I am not sure what kind of mm policy is active for scheduled works.

Good point, that definitely needs investigation.

Thanks for all of your comments so far Eric, very useful :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Nick Piggin
On Mon, Mar 05, 2007 at 08:26:32PM -0800, David Miller wrote:
 
 This is essentially a port of Nick Piggin's dcache hash table
 patches to the routing cache.  It solves the locking issues
 during table grow/shrink that I couldn't handle properly last
 time I tried to code up a patch like this.
 
 But one of the core issues of this kind of change still remains.
 There is a conflict between the desire of routing cache garbage
 collection to reach a state of equilibrium and the hash table
 grow code's desire to match the table size to the current state
 of affairs.
 
 Actually, more accurately, the conflict exists in how this GC
 logic is implemented.  The core issue is that hash table size
 guides the GC processing, and hash table growth therefore
 modifies those GC goals.  So with the patch below we'll just
 keep growing the hash table instead of giving GC some time to
 try to keep the working set in equilibrium before doing the
 hash grow.
 
 One idea is to put the hash grow check in the garbage collector,
 and put the hash shrink check in rt_del().
 
 In fact, it would be a good time to perhaps hack up some entirely
 new passive GC logic for the routing cache.
 
 BTW, another thing that plays into this is that Robert's TRASH work
 could make this patch not necessary :-)
 
 Finally, I know that (due to some of Nick's helpful comments the
 other day) that I'm missing some rcu_assign_pointer()'s in here.
 Fixes in this area are most welcome.

Cool! I have some fixes for the rcu barrier issues, with some C-style
comments and questions :)

I was going to send you a fix first for the rcu barriers, then a
second to convert the read-side to a barrier-less one that I described,
however considering that your patch is a WIP in progress anyway, I
won't worry too much about the normal protocol.

I _think_ my reasoning regarding the rcu barriers and grace periods
is correct. I'll keep thinking about it though. (Paul cc'ed).

I'm not so familiar with this code, so I have sprinkled around a lot
of comments that could be pure crap ;) They are mainly just to help
you ensure that you cover all bases... compile tested only at this
stage.

--
Index: linux-2.6/net/ipv4/route.c
===
--- linux-2.6.orig/net/ipv4/route.c
+++ linux-2.6/net/ipv4/route.c
@@ -311,6 +311,8 @@ static void rthash_free(struct rt_hash_b
 static unsigned int rt_hash_code(struct rt_hash *hashtable,
 u32 daddr, u32 saddr)
 {
+   /* BUG_ON(!rcu_read_protected()) */
+
return (jhash_2words(daddr, saddr, rt_hash_rnd)
 hashtable-mask);
 }
@@ -343,11 +345,16 @@ static void rt_hash_resize_work(struct w
 
old_rt_hash = rt_hash;
/*
-* ensure that if the reader sees the new dentry_hash,
-* then they will also see the old_dentry_hash assignment,
-* above.
+* ensure that if the reader sees the new rt_hash, then they will also
+* see the old_rt_hash assignment, above. synchronize_rcu() is used
+* rather than smp_wmb(), in order to avoid the smp_rmb() in the
+* read-sidde. However synchronize_rcu() also implies a smp_wmb(), so
+* that also means we can skip rcu_assign_pointer().
+*
+* The readers can then also skip rcu_dereference, because a grace
+* period implies that all readers have performed memory barriers.
 */
-   smp_wmb();
+   synchronize_rcu();
rt_hash = new_hash;
synchronize_rcu();
 
@@ -1100,6 +1107,8 @@ static int rt_intern_hash(struct rt_hash
int chain_length;
int attempts = !in_softirq();
 
+   /* BUG_ON(!rcu_read_protected()) */
+
 restart:
chain_length = 0;
min_score = ~(u32)0;
@@ -1286,6 +1295,8 @@ static void rt_del(struct rt_hash *h, un
 {
struct rtable **rthp;
 
+   /* BUG_ON(!rcu_read_protected()) */
+
spin_lock_bh(rt_hash_lock_addr(hash));
ip_rt_put(rt);
for (rthp = h-table[hash].chain; *rthp;
@@ -1328,12 +1339,24 @@ void ip_rt_redirect(__be32 old_gw, __be3
 
for (i = 0; i  2; i++) {
for (k = 0; k  2; k++) {
-   struct rt_hash *h = rt_hash;
-   unsigned hash = rt_hashfn(h, daddr, skeys[i], ikeys[k]);
+   struct rt_hash *h;
+   unsigned hash;
+
+   /*
+* rcu_read_lock() must cover the load of rt_hash, in
+* order to satisfy our RCU protected dynamic hash
+* sizing scheme; and it must also cover the hash list
+* traversal, to satisfy our RCU protected lockless
+* hash entry lookups.
+*
+* This note applies throughout the file.
+*/
+   rcu_read_lock();
+   h = 

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread David Miller
From: Nick Piggin [EMAIL PROTECTED]
Date: Tue, 6 Mar 2007 10:11:12 +0100

 @@ -1449,6 +1472,12 @@ static struct dst_entry *ipv4_negative_a
 %u.%u.%u.%u/%02x dropped\n,
   NIPQUAD(rt-rt_dst), rt-fl.fl4_tos);
  #endif
 + /* XXX:
 +  * What if rt does not exist in rt_hash, but is in
 +  * old_rt_hash? Don't we have to also check there?
 +  * Similar questions for a couple of other places that
 +  * look at rt_hash, but not old_rt_hash.
 +  */
   rt_del(h, hash, rt);
   ret = NULL;
   }

For the cases like ip_rt_redirect() I made the decision that we'll
just not add the complexity of having to look in the old_rt_hash
table.

In these kinds of cases it's OK to miss events, they will just happen
again.

It's one of the nice things about the routing cache, if you lose
information it's OK because we'll just cook up a new entry from
the persistent backing store that is the real routing tables.
And for events like redirects, if we miss it, we'll just send the
packet to the wrong next-hop again and receive another redirect
message which we'll (hopefully) propagate to a routine cache
entry.

Thanks for your feedback patch Nick, I'll process it tomorrow
hopefully after getting some sleep.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Nick Piggin
On Tue, Mar 06, 2007 at 01:17:06AM -0800, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Tue, 6 Mar 2007 10:11:12 +0100
 
  @@ -1449,6 +1472,12 @@ static struct dst_entry *ipv4_negative_a
%u.%u.%u.%u/%02x dropped\n,
  NIPQUAD(rt-rt_dst), rt-fl.fl4_tos);
   #endif
  +   /* XXX:
  +* What if rt does not exist in rt_hash, but is in
  +* old_rt_hash? Don't we have to also check there?
  +* Similar questions for a couple of other places that
  +* look at rt_hash, but not old_rt_hash.
  +*/
  rt_del(h, hash, rt);
  ret = NULL;
  }
 
 For the cases like ip_rt_redirect() I made the decision that we'll
 just not add the complexity of having to look in the old_rt_hash
 table.
 
 In these kinds of cases it's OK to miss events, they will just happen
 again.
 
 It's one of the nice things about the routing cache, if you lose
 information it's OK because we'll just cook up a new entry from
 the persistent backing store that is the real routing tables.
 And for events like redirects, if we miss it, we'll just send the
 packet to the wrong next-hop again and receive another redirect
 message which we'll (hopefully) propagate to a routine cache
 entry.

Ah, that's a very neat trick. OK so with that question out of the
way, there _may_ just be a few other places where you're working
with an rt_hash table outside an rcu read critical section.

I tried to fix up a couple of obvious ones, and I've just put in
the BUG_ON assertions for you to verify the rest.

 Thanks for your feedback patch Nick, I'll process it tomorrow
 hopefully after getting some sleep.

No problem. Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Eric Dumazet
On Tuesday 06 March 2007 10:11, Nick Piggin wrote:

 Cool! I have some fixes for the rcu barrier issues, with some C-style
 comments and questions :)

 I was going to send you a fix first for the rcu barriers, then a
 second to convert the read-side to a barrier-less one that I described,
 however considering that your patch is a WIP in progress anyway, I
 won't worry too much about the normal protocol.

 I _think_ my reasoning regarding the rcu barriers and grace periods
 is correct. I'll keep thinking about it though. (Paul cc'ed).

 I'm not so familiar with this code, so I have sprinkled around a lot
 of comments that could be pure crap ;) They are mainly just to help
 you ensure that you cover all bases... compile tested only at this
 stage.

I think we missed :

+static void rt_hash_resize_work(struct work_struct *work)

+
+   *head = rth-u.dst.rt_next;
+
+   hash = rt_hashfn(rt_hash,
+rth-fl.fl4_dst,
+rth-fl.fl4_src,
+iface);
+   rth-u.dst.rt_next = rt_hash-table[hash].chain;
+   rt_hash-table[hash].chain = rth;

This really needs some ..._del_rcu()/..._add_rcu()_ ... primitives, no ?
Or else a reader might be very confused...

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Nick Piggin
On Tue, Mar 06, 2007 at 10:23:44AM +0100, Eric Dumazet wrote:
 On Tuesday 06 March 2007 10:11, Nick Piggin wrote:
 
  Cool! I have some fixes for the rcu barrier issues, with some C-style
  comments and questions :)
 
  I was going to send you a fix first for the rcu barriers, then a
  second to convert the read-side to a barrier-less one that I described,
  however considering that your patch is a WIP in progress anyway, I
  won't worry too much about the normal protocol.
 
  I _think_ my reasoning regarding the rcu barriers and grace periods
  is correct. I'll keep thinking about it though. (Paul cc'ed).
 
  I'm not so familiar with this code, so I have sprinkled around a lot
  of comments that could be pure crap ;) They are mainly just to help
  you ensure that you cover all bases... compile tested only at this
  stage.
 
 I think we missed :
 
 +static void rt_hash_resize_work(struct work_struct *work)
 
 +
 + *head = rth-u.dst.rt_next;
 +
 + hash = rt_hashfn(rt_hash,
 +  rth-fl.fl4_dst,
 +  rth-fl.fl4_src,
 +  iface);
 + rth-u.dst.rt_next = rt_hash-table[hash].chain;
 + rt_hash-table[hash].chain = rth;
 
 This really needs some ..._del_rcu()/..._add_rcu()_ ... primitives, no ?
 Or else a reader might be very confused...

I'm not sure... this code really depends on the hash table management,
rather than the management of the hash tables, if you understand me ;)

From what I can _see_, this is similar to how rt_intern_hash does it.
I don't know exactly why rt_intern_hash can get away without using
rcu_assign_pointer in some cases, however:

Note that we don't need an rcu_assign_pointer for this, because the
memory operations that initialized the entry have already been ordered
when it was first inserted.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

David Miller writes:
 
 Interesting.
 
  Actually, more accurately, the conflict exists in how this GC
  logic is implemented.  The core issue is that hash table size
  guides the GC processing, and hash table growth therefore
  modifies those GC goals.  So with the patch below we'll just
  keep growing the hash table instead of giving GC some time to
  try to keep the working set in equilibrium before doing the
  hash grow.
 
 AFIK the equilibrium is resizing function as well but using fixed 
 hash table. So can we do without equilibrium resizing if tables 
 are dynamic?  I think so

 With the hash data structure we could monitor the average chain 
 length or just size and resize hash after that.

  One idea is to put the hash grow check in the garbage collector,
  and put the hash shrink check in rt_del().
  
  In fact, it would be a good time to perhaps hack up some entirely
  new passive GC logic for the routing cache.

 Could be, remeber GC in the hash chain also which was added after
 although it does's decrease the number of entries but it gives
 an upper limit. Also gc-goal must picked so it does not force 
 unwanted resizing.

  BTW, another thing that plays into this is that Robert's TRASH work
  could make this patch not necessary :-)

 It has built-in resize and chain control and the gc-goal is chosen not 
 to unnecessary resize the root node. 

 Cheers.
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

Eric Dumazet writes:

  Well, maybe... but after looking robert's trash, I discovered its model is 
  essentially a big (2^18 slots) root node (our hash table), and very few 
  order:1,2,3 nodes.
 
 It's getting hashlike yes. I guess all effective algorithms today is doing
 some sort of index lookup and for large number of entries we cannot expect 
 to find the next node in the same cache line so the tree depth becomes a
 crucial performance factor. IMO nothing can beat a prefect distributed and 
 perfect sized hash. The trash work is an effort to get close with dynamic 
 data structure.

 Cheers
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Eric Dumazet
On Tuesday 06 March 2007 14:42, Robert Olsson wrote:
 Eric Dumazet writes:
   Well, maybe... but after looking robert's trash, I discovered its model
   is essentially a big (2^18 slots) root node (our hash table), and very
   few order:1,2,3 nodes.

  It's getting hashlike yes. I guess all effective algorithms today is
 doing some sort of index lookup and for large number of entries we cannot
 expect to find the next node in the same cache line so the tree depth
 becomes a crucial performance factor. IMO nothing can beat a prefect
 distributed and perfect sized hash. The trash work is an effort to get
 close with dynamic data structure.

Indeed. It would be nice to see how it performs with say 2^20 elements...
Because with your data, I wonder if the extra complexity of the trash is worth 
it (since most lookups are going to only hit the hash and give the answer 
without intermediate nodes)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

Eric Dumazet writes:

  Indeed. It would be nice to see how it performs with say 2^20 elements...
  Because with your data, I wonder if the extra complexity of the trash is 
  worth 
  it (since most lookups are going to only hit the hash and give the answer 
  without intermediate nodes)

 I don't know if I understand you fully. Yes in most cases the first lookup via
 hash-header will take us to direct to the correct leaf. If there are 
collisions 
 we have to sort them out by adding intermediate nodes.

 Something like where you have resizable hash where is each bucket in turn is a 
 resizable hash etc.

 Cheers
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Eric Dumazet
On Tuesday 06 March 2007 18:05, Robert Olsson wrote:
 Eric Dumazet writes:
   Indeed. It would be nice to see how it performs with say 2^20
   elements... Because with your data, I wonder if the extra complexity of
   the trash is worth it (since most lookups are going to only hit the hash
   and give the answer without intermediate nodes)

  I don't know if I understand you fully. Yes in most cases the first lookup
 via hash-header will take us to direct to the correct leaf. If there are
 collisions we have to sort them out by adding intermediate nodes.

With 2^20 entries, your actual limit of 2^19 entries in root node will 
probably show us quite different numbers for order-1,2,3,4... tnodes


  Something like where you have resizable hash where is each bucket in turn
 is a resizable hash etc.

Yes, numbers you gave us basically showed a big root node, and mainly leaves 
and very few tnodes.

I was interested to see the distribution in case the root-node limit is hit, 
and we load into the table a *lot* of entries.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

Eric Dumazet writes:

  With 2^20 entries, your actual limit of 2^19 entries in root node will 
  probably show us quite different numbers for order-1,2,3,4... tnodes

 Yeep trie will get deeper and lookup more costly as insert and delete.
 The 2^19 was that was getting memory alloction problem that I never
 sorted out.

  Yes, numbers you gave us basically showed a big root node, and mainly leaves 
  and very few tnodes.
  
  I was interested to see the distribution in case the root-node limit is hit, 
  and we load into the table a *lot* of entries.

 Maxlength etc... well maybe root-restriction should be removed and just have 
 maxsize instead.

 Cheers
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread David Miller
From: Robert Olsson [EMAIL PROTECTED]
Date: Tue, 6 Mar 2007 14:26:04 +0100

 David Miller writes:
  
   Actually, more accurately, the conflict exists in how this GC
   logic is implemented.  The core issue is that hash table size
   guides the GC processing, and hash table growth therefore
   modifies those GC goals.  So with the patch below we'll just
   keep growing the hash table instead of giving GC some time to
   try to keep the working set in equilibrium before doing the
   hash grow.
  
  AFIK the equilibrium is resizing function as well but using fixed 
  hash table. So can we do without equilibrium resizing if tables 
  are dynamic?  I think so
 
  With the hash data structure we could monitor the average chain 
  length or just size and resize hash after that.

I'm not so sure, it may be a mistake to eliminate the equilibrium
logic.  One error I think it does have is the usage of chain length.

Even a nearly perfect hash has small lumps in distribution, and we
should not penalize entries which fall into these lumps.

Let us call T the threshold at which we would grow the routing hash
table.  As we approach T we start to GC.  Let's assume hash table
has shift = 2. and T would (with T=N+(N1) algorithm) therefore be
6.

TABLE:  [0] DST1, DST2
[1] DST3, DST4, DST5

DST6 arrives, what should we do?

If we just accept it and don't GC some existing entries, we
will grow the hash table.  This is the wrong thing to do if
our true working set is smaller than 6 entries and thus some
of the existing entries are unlikely to be reused and thus
could be purged to keep us from hitting T.

If they are all active, growing is the right thing to do.

This is the crux of the whole routing cache problem.

I am of the opinion that LRU, for routes not attached to sockets, is
probably the best thing to do here.

Furthermore at high packet rates, the current rt_may_expire() logic
probably is not very effective since it's granularity is limited to
jiffies.  We can quite easily create 100,000 or more entries per
jiffie when HZ=100 during rDOS, for example.  So perhaps some global
LRU algorithm using ktime is more appropriate.

Global LRU is not easy without touching a lot of memory.  But I'm
sure some clever trick can be discovered by someone :)

It is amusing, but it seems that for rDOS workload most optimal
routing hash would be tiny one like my example above.  All packets
essentially miss the routing cache and create new entry.  So
keeping the working set as small as possible is what you want
to do since no matter how large you grow your hit rate will be
zero :-)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-05 Thread David Miller

This is essentially a port of Nick Piggin's dcache hash table
patches to the routing cache.  It solves the locking issues
during table grow/shrink that I couldn't handle properly last
time I tried to code up a patch like this.

But one of the core issues of this kind of change still remains.
There is a conflict between the desire of routing cache garbage
collection to reach a state of equilibrium and the hash table
grow code's desire to match the table size to the current state
of affairs.

Actually, more accurately, the conflict exists in how this GC
logic is implemented.  The core issue is that hash table size
guides the GC processing, and hash table growth therefore
modifies those GC goals.  So with the patch below we'll just
keep growing the hash table instead of giving GC some time to
try to keep the working set in equilibrium before doing the
hash grow.

One idea is to put the hash grow check in the garbage collector,
and put the hash shrink check in rt_del().

In fact, it would be a good time to perhaps hack up some entirely
new passive GC logic for the routing cache.

BTW, another thing that plays into this is that Robert's TRASH work
could make this patch not necessary :-)

Finally, I know that (due to some of Nick's helpful comments the
other day) that I'm missing some rcu_assign_pointer()'s in here.
Fixes in this area are most welcome.

This patch passes basic testing on UP sparc64, but please handle
with care :)

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0b3d7bf..57e004a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -92,6 +92,9 @@
 #include linux/jhash.h
 #include linux/rcupdate.h
 #include linux/times.h
+#include linux/workqueue.h
+#include linux/vmalloc.h
+#include linux/mutex.h
 #include net/protocol.h
 #include net/ip.h
 #include net/route.h
@@ -242,28 +245,195 @@ static spinlock_t*rt_hash_locks;
 # define rt_hash_lock_init()
 #endif
 
-static struct rt_hash_bucket   *rt_hash_table;
-static unsignedrt_hash_mask;
-static int rt_hash_log;
-static unsigned intrt_hash_rnd;
+#define MIN_RTHASH_SHIFT 4
+#if BITS_PER_LONG == 32
+#define MAX_RTHASH_SHIFT 24
+#else
+#define MAX_RTHASH_SHIFT 30
+#endif
+
+struct rt_hash {
+   struct rt_hash_bucket   *table;
+   unsigned intmask;
+   unsigned intlog;
+};
+
+struct rt_hash *rt_hash __read_mostly;
+struct rt_hash *old_rt_hash __read_mostly;
+static unsigned int rt_hash_rnd __read_mostly;
+static DEFINE_SEQLOCK(resize_transfer_lock);
+static DEFINE_MUTEX(resize_mutex);
 
 static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
 #define RT_CACHE_STAT_INC(field) \
(__raw_get_cpu_var(rt_cache_stat).field++)
 
-static int rt_intern_hash(unsigned hash, struct rtable *rth,
-   struct rtable **res);
+static void rt_hash_resize(unsigned int new_shift);
+static void check_nr_rthash(void)
+{
+   unsigned int sz = rt_hash-mask + 1;
+   unsigned int nr = atomic_read(ipv4_dst_ops.entries);
+
+   if (unlikely(nr  (sz + (sz  1
+   rt_hash_resize(rt_hash-log + 1);
+   else if (unlikely(nr  (sz  1)))
+   rt_hash_resize(rt_hash-log - 1);
+}
 
-static unsigned int rt_hash_code(u32 daddr, u32 saddr)
+static struct rt_hash_bucket *rthash_alloc(unsigned int sz)
+{
+   struct rt_hash_bucket *n;
+
+   if (sz = PAGE_SIZE)
+   n = kmalloc(sz, GFP_KERNEL);
+   else if (hashdist)
+   n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL);
+   else
+   n = (struct rt_hash_bucket *)
+   __get_free_pages(GFP_KERNEL, get_order(sz));
+
+   if (n)
+   memset(n, 0, sz);
+
+   return n;
+}
+
+static void rthash_free(struct rt_hash_bucket *r, unsigned int sz)
+{
+   if (sz = PAGE_SIZE)
+   kfree(r);
+   else if (hashdist)
+   vfree(r);
+   else
+   free_pages((unsigned long)r, get_order(sz));
+}
+
+static unsigned int rt_hash_code(struct rt_hash *hashtable,
+u32 daddr, u32 saddr)
 {
return (jhash_2words(daddr, saddr, rt_hash_rnd)
-rt_hash_mask);
+hashtable-mask);
 }
 
-#define rt_hash(daddr, saddr, idx) \
-   rt_hash_code((__force u32)(__be32)(daddr),\
+#define rt_hashfn(htab, daddr, saddr, idx) \
+   rt_hash_code(htab, (__force u32)(__be32)(daddr),\
 (__force u32)(__be32)(saddr) ^ ((idx)  5))
 
+static unsigned int resize_new_shift;
+
+static void rt_hash_resize_work(struct work_struct *work)
+{
+   struct rt_hash *new_hash, *old_hash;
+   unsigned int new_size, old_size, transferred;
+   int i;
+
+   if (!mutex_trylock(resize_mutex))
+   goto out;
+
+   new_hash = kmalloc(sizeof(struct rt_hash), GFP_KERNEL);
+   if (!new_hash)
+   goto out_unlock;
+
+   

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-05 Thread Eric Dumazet

David Miller a écrit :

This is essentially a port of Nick Piggin's dcache hash table
patches to the routing cache.  It solves the locking issues
during table grow/shrink that I couldn't handle properly last
time I tried to code up a patch like this.

But one of the core issues of this kind of change still remains.
There is a conflict between the desire of routing cache garbage
collection to reach a state of equilibrium and the hash table
grow code's desire to match the table size to the current state
of affairs.

Actually, more accurately, the conflict exists in how this GC
logic is implemented.  The core issue is that hash table size
guides the GC processing, and hash table growth therefore
modifies those GC goals.  So with the patch below we'll just
keep growing the hash table instead of giving GC some time to
try to keep the working set in equilibrium before doing the
hash grow.

One idea is to put the hash grow check in the garbage collector,
and put the hash shrink check in rt_del().

In fact, it would be a good time to perhaps hack up some entirely
new passive GC logic for the routing cache.

BTW, another thing that plays into this is that Robert's TRASH work
could make this patch not necessary :-)


Well, maybe... but after looking robert's trash, I discovered its model is 
essentially a big (2^18 slots) root node (our hash table), and very few 
order:1,2,3 nodes.


Almost all leaves... work in progress anyway.

Please find my comments in your patch


Finally, I know that (due to some of Nick's helpful comments the
other day) that I'm missing some rcu_assign_pointer()'s in here.
Fixes in this area are most welcome.

This patch passes basic testing on UP sparc64, but please handle
with care :)

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0b3d7bf..57e004a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -92,6 +92,9 @@
 #include linux/jhash.h
 #include linux/rcupdate.h
 #include linux/times.h
+#include linux/workqueue.h
+#include linux/vmalloc.h
+#include linux/mutex.h
 #include net/protocol.h
 #include net/ip.h
 #include net/route.h
@@ -242,28 +245,195 @@ static spinlock_t*rt_hash_locks;
 # define rt_hash_lock_init()
 #endif
 
-static struct rt_hash_bucket 	*rt_hash_table;

-static unsignedrt_hash_mask;
-static int rt_hash_log;
-static unsigned intrt_hash_rnd;
+#define MIN_RTHASH_SHIFT 4


I wonder... are you sure this has no relation with the size of rt_hash_locks / 
RT_HASH_LOCK_SZ ?

One entry must have the same lock in the two tables when resizing is in flight.
#define MIN_RTHASH_SHIFT LOG2(RT_HASH_LOCK_SZ)


+#if BITS_PER_LONG == 32
+#define MAX_RTHASH_SHIFT 24
+#else
+#define MAX_RTHASH_SHIFT 30
+#endif
+
+struct rt_hash {
+   struct rt_hash_bucket   *table;
+   unsigned intmask;
+   unsigned intlog;
+};
+
+struct rt_hash *rt_hash __read_mostly;
+struct rt_hash *old_rt_hash __read_mostly;
+static unsigned int rt_hash_rnd __read_mostly;
+static DEFINE_SEQLOCK(resize_transfer_lock);
+static DEFINE_MUTEX(resize_mutex);


I think a better model would be a structure, with a part containing 'read 
mostly' data, and part of 'higly modified' data with appropriate align_to_cache


For example, resize_transfer_lock should be in the first part, like rt_hash 
and old_rt_hash, dont you think ?


All static data of this file should be placed on this single structure so that 
we can easily avoid false sharing and have optimal placement.


 
 static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);

 #define RT_CACHE_STAT_INC(field) \
(__raw_get_cpu_var(rt_cache_stat).field++)
 
-static int rt_intern_hash(unsigned hash, struct rtable *rth,

-   struct rtable **res);
+static void rt_hash_resize(unsigned int new_shift);
+static void check_nr_rthash(void)
+{
+   unsigned int sz = rt_hash-mask + 1;
+   unsigned int nr = atomic_read(ipv4_dst_ops.entries);
+
+   if (unlikely(nr  (sz + (sz  1
+   rt_hash_resize(rt_hash-log + 1);
+   else if (unlikely(nr  (sz  1)))
+   rt_hash_resize(rt_hash-log - 1);
+}
 
-static unsigned int rt_hash_code(u32 daddr, u32 saddr)

+static struct rt_hash_bucket *rthash_alloc(unsigned int sz)
+{
+   struct rt_hash_bucket *n;
+
+   if (sz = PAGE_SIZE)
+   n = kmalloc(sz, GFP_KERNEL);
+   else if (hashdist)
+   n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL);
+   else
+   n = (struct rt_hash_bucket *)
+   __get_free_pages(GFP_KERNEL, get_order(sz));


I dont feel well with this.
Maybe we could try a __get_free_pages(), and in case of failure, fallback to 
vmalloc(). Then keep a flag to be able to free memory correctly. Anyway, if 
(get_order(sz)=MAX_ORDER) we know __get_free_pages() will fail.





+
+   if (n)
+   memset(n, 0, sz);
+
+   return n;
+}
+

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-05 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Tue, 06 Mar 2007 08:14:46 +0100

 I wonder... are you sure this has no relation with the size of rt_hash_locks 
 / 
 RT_HASH_LOCK_SZ ?
 One entry must have the same lock in the two tables when resizing is in 
 flight.
 #define MIN_RTHASH_SHIFT LOG2(RT_HASH_LOCK_SZ)

Good point.

  +static struct rt_hash_bucket *rthash_alloc(unsigned int sz)
  +{
  +   struct rt_hash_bucket *n;
  +
  +   if (sz = PAGE_SIZE)
  +   n = kmalloc(sz, GFP_KERNEL);
  +   else if (hashdist)
  +   n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL);
  +   else
  +   n = (struct rt_hash_bucket *)
  +   __get_free_pages(GFP_KERNEL, get_order(sz));
 
 I dont feel well with this.
 Maybe we could try a __get_free_pages(), and in case of failure, fallback to 
 vmalloc(). Then keep a flag to be able to free memory correctly. Anyway, if 
 (get_order(sz)=MAX_ORDER) we know __get_free_pages() will fail.

We have to use vmalloc() for the hashdist case so that the pages
are spread out properly on NUMA systems.  That's exactly what the
large system hash allocator is going to do on bootup anyways.

Look, either both are right or both are wrong.  I'm just following
protocol above and you'll note the PRECISE same logic exists in other
dynamically growing hash table implementations such as
net/xfrm/xfrm_hash.c

 Could you add const qualifiers to 'struct rt_hash *' in prototypes where 
 appropriate ?

Sure, no problem.

 Maybe that for small tables (less than PAGE_SIZE/2), we could embed them in 
 'struct rt_hash'

Not worth the pain nor the in-kernel-image-space it would chew up,
in my opinion.

After you visit a handful of web sites you'll get beyond that
threshold.

 Could we group all static vars at the begining of this file, so that we 
 clearly see where we should place them, to avoid false sharing.

Sure.

  +
  +static void rt_hash_resize(unsigned int new_shift)

Damn, please don't quote such huge portions of a patch without any
comments, this has to go out to several thousand recipients you know
:-/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-05 Thread Eric Dumazet

David Miller a écrit :

From: Eric Dumazet [EMAIL PROTECTED]
Date: Tue, 06 Mar 2007 08:14:46 +0100

I wonder... are you sure this has no relation with the size of rt_hash_locks / 
RT_HASH_LOCK_SZ ?

One entry must have the same lock in the two tables when resizing is in flight.
#define MIN_RTHASH_SHIFT LOG2(RT_HASH_LOCK_SZ)


Good point.


+static struct rt_hash_bucket *rthash_alloc(unsigned int sz)
+{
+   struct rt_hash_bucket *n;
+
+   if (sz = PAGE_SIZE)
+   n = kmalloc(sz, GFP_KERNEL);
+   else if (hashdist)
+   n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL);
+   else
+   n = (struct rt_hash_bucket *)
+   __get_free_pages(GFP_KERNEL, get_order(sz));

I dont feel well with this.
Maybe we could try a __get_free_pages(), and in case of failure, fallback to 
vmalloc(). Then keep a flag to be able to free memory correctly. Anyway, if 
(get_order(sz)=MAX_ORDER) we know __get_free_pages() will fail.


We have to use vmalloc() for the hashdist case so that the pages
are spread out properly on NUMA systems.  That's exactly what the
large system hash allocator is going to do on bootup anyways.


Yes, but on bootup you have an appropriate NUMA active policy. (Well... we 
hope so, but it broke several time in the past)

I am not sure what kind of mm policy is active for scheduled works.

Anyway I have some XX GB machines, non NUMA, and I would love to be able to 
have a 2^20 slots hash table, without having to increase MAX_ORDER.




Look, either both are right or both are wrong.  I'm just following
protocol above and you'll note the PRECISE same logic exists in other
dynamically growing hash table implementations such as
net/xfrm/xfrm_hash.c




Yes, they are both wrong/dumb :)

Can we be smarter, or do we have to stay dumb ? :)

struct rt_hash_bucket *n = NULL;

if (sz = PAGE_SIZE) {
n = kmalloc(sz, GFP_KERNEL);
*kind = allocated_by_kmalloc;
}
else if (!hashdist) {
n = (struct rt_hash_bucket *)
__get_free_pages(GFP_KERNEL, get_order(sz));
*kind = allocated_by_get_free_pages;
}
if (!n) {
n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL);
*kind = allocated_by_vmalloc;
}

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html