Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-07 Thread Eric Dumazet

David S. Miller a écrit :


Eric, how important do you honestly think the per-hashchain spinlocks
are?  That's the big barrier from making rt_secret_rebuild() a simple
rehash instead of flushing the whole table as it does now.



No problem for me in going to a single spinlock.
I did the hashed spinlock patch in order to reduce the size of the route hash 
table and not hurting big NUMA machines. If you think a single spinlock is OK, 
that's even better !



The lock is only grabbed for updates, and the access to these locks is
random and as such probably non-local when taken anyways.  Back before
we used RCU for reads, this array-of-spinlock thing made a lot more
sense.

I mean something like this patch:




+static DEFINE_SPINLOCK(rt_hash_lock);



Just one point : This should be cache_line aligned, and use one full cache 
line to avoid false sharing at least. (If a cpu takes the lock, no need to 
invalidate *rt_hash_table for all other cpus)


Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-07 Thread David S. Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Sat, 07 Jan 2006 08:53:52 +0100

 I have no problem with this, since the biggest server I have is 4
 way, but are you sure big machines wont suffer from this single
 spinlock ?

It is the main question.

 Also I dont understand what you want to do after this single
 spinlock patch.  How is it supposed to help the 'ip route flush
 cache' problem ?  In my case, I have about 600.000 dst-entries :

I don't claim to have a solution to this problem currently.

Doing RCU and going through the whole DST GC machinery is overkill for
an active system.  So, perhaps a very simple solution will do:

1) On rt_run_flush(), do not rt_free(), instead collect all active
   routing cache entries onto a global list, begin a timer to
   fire in 10 seconds (or some sysctl configurable amount).

2) When a new routing cache entry is needed, check the global
   list appended to in #1 above first, failing that do dst_alloc()
   as is done currently.

3) If timer expires, rt_free() any entries in the global list.

The missing trick is how to ensure RCU semantics when reallocating
from the global list.

The idea is that an active system will immediately repopulate itself
with all of these entries just flushed from the table.

RCU really doesn't handle this kind of problem very well.  It truly
excels when work is generated by process context work, not interrupt
work.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-07 Thread Paul E. McKenney
On Sat, Jan 07, 2006 at 12:36:25AM -0800, David S. Miller wrote:
 From: Eric Dumazet [EMAIL PROTECTED]
 Date: Sat, 07 Jan 2006 08:53:52 +0100
 
  I have no problem with this, since the biggest server I have is 4
  way, but are you sure big machines wont suffer from this single
  spinlock ?
 
 It is the main question.
 
  Also I dont understand what you want to do after this single
  spinlock patch.  How is it supposed to help the 'ip route flush
  cache' problem ?  In my case, I have about 600.000 dst-entries :
 
 I don't claim to have a solution to this problem currently.
 
 Doing RCU and going through the whole DST GC machinery is overkill for
 an active system.  So, perhaps a very simple solution will do:
 
 1) On rt_run_flush(), do not rt_free(), instead collect all active
routing cache entries onto a global list, begin a timer to
fire in 10 seconds (or some sysctl configurable amount).
 
 2) When a new routing cache entry is needed, check the global
list appended to in #1 above first, failing that do dst_alloc()
as is done currently.
 
 3) If timer expires, rt_free() any entries in the global list.
 
 The missing trick is how to ensure RCU semantics when reallocating
 from the global list.

The straightforward ways of doing this require a per-entry lock in
addition to the dst_entry reference count -- lots of read-side overhead.

More complex approaches use a generation number that is incremented
when adding to or removing from the global list.  When the generation
number overflows, unconditionally rt_free() it rather than adding
to the global list again.  Then there needs to be some clever code
on the read side to detect the case when the generation number
changes while acquiring a reference.  And memory barriers.  Also
lots of read-side overhead.  Also, it is now -always- necessary to
acquire a reference on the read-side.

 The idea is that an active system will immediately repopulate itself
 with all of these entries just flushed from the table.
 
 RCU really doesn't handle this kind of problem very well.  It truly
 excels when work is generated by process context work, not interrupt
 work.

Sounds like a challenge to me.  ;-)

Well, one possible way to attack Eric's workload might be the following:

o   Size the hash table to strike the appropriate balance between
read-side search overhead and memory consumption.  Call the
number of hash-chain headers N.

o   Create a hashed array of locks sized to allow the update to
proceed sufficiently quickly.  Call the number of locks M,
probably a power of two.  This means that M CPUs can be doing
the update in parallel.

o   Create an array of M^2 list headers (call it xfer[][]), but
since this is only needed during an update, it can be allocated
and deallocated if need be.  (Me, with my big-server experience,
would probably just create the array, since M is not likely to
be too large.  But your mileage may vary.  And you really only
need M*(M-1) list headers, but that makes the index calculation
a bit more annoying.)

o   Use a two-phase update.  In the first phase, each updating
CPU acquires the corresponding lock and removes entries from
the corresponding partition of the hash table.  If the new
location of a given entry falls into the same partition, it
is added back to the appropriate hash chain of that partition.
Otherwise, add the entry to xfer[dst][src], where src and 
dst are indexes of the corresponding partitions.

o   When all CPUs finish removing entries from their partition,
they check into a barrier.  Once all have checked in, they
can start the second phase of the update.

o   In the second phase, each CPU removes the entries from the
xfer array that are destined for its partition and adds them
to the hash chain that they are destined for.

Some commentary and variations, in the hope that this inspires someone
to come up with an even better idea:

o   Unless M is at least three, there is no performance gain
over a single global lock with a single CPU doing the update,
since each element must now undergo four list operations rather
than just two.

o   The xfer[][] array must have each entry cache-aligned, or
you lose big on cacheline effects.  Note that it is -not-
sufficient to simply align the rows or the columns, since
each CPU has its own column when inserting and its own
row when removing from xfer[][].

o   And the data-skew effects are less severe if this procedure
runs from process context.  A spinning barrier must be used
otherwise.  But note that the per-partition locks could remain
spinlocks, only the barrier need involve sleeping (in case
that helps, am getting a bit ahead of my understanding of
this part of the kernel).

Re: [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix

2006-01-06 Thread Eric Dumazet

First patch was buggy, sorry :(

This 2nd version makes no more RCU assumptions, because only the 'donelist' 
queue is fetched for an item to be deleted. Items from the donelist are ready 
to be freed.


This V2 also corrects a problem in case of a CPU hotplug, we forgot to update 
the -count variable when transfering a queue to another one.


-
In order to avoid some OOM triggered by a flood of call_rcu() calls, we 
increased in linux 2.6.14 maxbatch from 10 to 1, and conditionally call 
set_need_resched() in call_rcu().


This solution doesnt solve all the problems and has drawbacks.

1) Using a big maxbatch has a bad impact on latency.
2) A flood of call_rcu_bh() still can OOM

I have some servers that once in a while crashes when the ip route cache is 
flushed. After raising /proc/sys/net/ipv4/route/secret_interval (so that *no* 
flush is done), I got better uptime for these servers. But in some cases I 
think the network stack can floods call_rcu_bh(), and a fatal OOM occurs.


I suggest in this patch :

1) To lower maxbatch to a more reasonable value (as far as the latency is 
concerned)


2) To be able to guard a RCU cpu queue against a maximal count (10.000 for 
example). If this limit is reached, free the oldest entry (if available from 
the donelist queue).


3) Bug correction in __rcu_offline_cpu() where we forgot to adjust -count 
field when transfering a queue to another one.


In my stress tests, I could not reproduce OOM anymore after applying this patch.

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
--- linux-2.6.15/kernel/rcupdate.c  2006-01-03 04:21:10.0 +0100
+++ linux-2.6.15-edum/kernel/rcupdate.c 2006-01-06 13:32:02.0 +0100
@@ -71,14 +71,14 @@
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 1;
+static int maxbatch = 100;
 
 #ifndef __HAVE_ARCH_CMPXCHG
 /*
  * We use an array of spinlocks for the rcurefs -- similar to ones in sparc
  * 32 bit atomic_t implementations, and a hash function similar to that
  * for our refcounting needs.
- * Can't help multiprocessors which donot have cmpxchg :(
+ * Can't help multiprocessors which dont have cmpxchg :(
  */
 
 spinlock_t __rcuref_hash[RCUREF_HASH_SIZE] = {
@@ -110,9 +110,19 @@
*rdp-nxttail = head;
rdp-nxttail = head-next;
 
-   if (unlikely(++rdp-count  1))
-   set_need_resched();
-
+/*
+ * OOM avoidance : If we queued too many items in this queue,
+ *  free the oldest entry (from the donelist only to respect
+ *  RCU constraints)
+ */
+   if (unlikely(++rdp-count  1  (head = rdp-donelist))) {
+   rdp-count--;
+   rdp-donelist = head-next;
+   if (!rdp-donelist)
+   rdp-donetail = rdp-donelist;
+   local_irq_restore(flags);
+   return head-func(head);
+   }
local_irq_restore(flags);
 }
 
@@ -148,12 +158,19 @@
rdp = __get_cpu_var(rcu_bh_data);
*rdp-nxttail = head;
rdp-nxttail = head-next;
-   rdp-count++;
 /*
- *  Should we directly call rcu_do_batch() here ?
- *  if (unlikely(rdp-count  1))
- *  rcu_do_batch(rdp);
+ * OOM avoidance : If we queued too many items in this queue,
+ *  free the oldest entry (from the donelist only to respect
+ *  RCU constraints)
  */
+   if (unlikely(++rdp-count  1  (head = rdp-donelist))) {
+   rdp-count--;
+   rdp-donelist = head-next;
+   if (!rdp-donelist)
+   rdp-donetail = rdp-donelist;
+   local_irq_restore(flags);
+   return head-func(head);
+   }
local_irq_restore(flags);
 }
 
@@ -208,19 +225,20 @@
  */
 static void rcu_do_batch(struct rcu_data *rdp)
 {
-   struct rcu_head *next, *list;
-   int count = 0;
+   struct rcu_head *next = NULL, *list;
+   int count = maxbatch;
 
list = rdp-donelist;
while (list) {
-   next = rdp-donelist = list-next;
+   next = list-next;
list-func(list);
list = next;
rdp-count--;
-   if (++count = maxbatch)
+   if (--count = 0)
break;
}
-   if (!rdp-donelist)
+   rdp-donelist = next;
+   if (!next)
rdp-donetail = rdp-donelist;
else
tasklet_schedule(per_cpu(rcu_tasklet, rdp-cpu));
@@ -344,11 +362,9 @@
 static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
struct rcu_head **tail)
 {
-   local_irq_disable();
*this_rdp-nxttail = list;
if (list)
this_rdp-nxttail = tail;
-   local_irq_enable();
 }
 
 static void __rcu_offline_cpu(struct rcu_data *this_rdp,
@@ -362,9 +378,12 @@
if (rcp-cur != rcp-completed)
 

Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Andi Kleen
On Friday 06 January 2006 11:17, Eric Dumazet wrote:


 I assume that if a CPU queued 10.000 items in its RCU queue, then the
 oldest entry cannot still be in use by another CPU. This might sounds as a
 violation of RCU rules, (I'm not an RCU expert) but seems quite reasonable.

I don't think it's a good assumption. Another CPU might be stuck in a long 
running interrupt, and still have a reference in the code running below
the interrupt handler.

And in general letting correctness depend on magic numbers like this is 
very nasty.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Eric Dumazet

Andi Kleen a écrit :

On Friday 06 January 2006 11:17, Eric Dumazet wrote:


I assume that if a CPU queued 10.000 items in its RCU queue, then the
oldest entry cannot still be in use by another CPU. This might sounds as a
violation of RCU rules, (I'm not an RCU expert) but seems quite reasonable.


I don't think it's a good assumption. Another CPU might be stuck in a long 
running interrupt, and still have a reference in the code running below

the interrupt handler.

And in general letting correctness depend on magic numbers like this is 
very nasty.




I agree Andi, I posted a 2nd version of the patch with no more assumptions.

Eric


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Eric Dumazet

Alan Cox a écrit :

On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
entry cannot still be in use by another CPU. This might sounds as a violation 
of RCU rules, (I'm not an RCU expert) but seems quite reasonable.


Fixing the real problem in the routing code would be the real fix. 



So far nobody succeeded in 'fixing the routing code', few people can even read 
the code from the first line to the last one...


I think this code is not buggy, it only makes general RCU assumptions about 
delayed freeing of dst entries. In some cases, the general assumptions are 
just wrong. We can fix it at RCU level, and future users of call_rcu_bh() wont 
have to think *hard* about 'general assumptions'.


Of course, we can ignore the RCU problem and mark somewhere on a sticker: 
***DONT USE OR RISK CRASHES***

***USE IT ONLY FOR FUN***


The underlying problem of RCU and memory usage could be solved more
safely by making sure that the sleeping memory allocator path always
waits until at least one RCU cleanup has occurred after it fails an
allocation before it starts trying harder. That ought to also naturally
throttle memory consumers more in the situation which is the right
behaviour.



In the case of call_rcu_bh(), you can be sure that the caller cannot afford 
'sleeping memory allocations'. Better drop a frame than block the stack, no ?


Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Alan Cox
On Gwe, 2006-01-06 at 15:00 +0100, Eric Dumazet wrote:
 In the case of call_rcu_bh(), you can be sure that the caller cannot afford 
 'sleeping memory allocations'. Better drop a frame than block the stack, no ?

atomic allocations can't sleep and will fail which is fine. If memory
allocation pressure exists for sleeping allocations because of a large
rcu backlog we want to be sure that the rcu backlog from the networking
stack or other sources does not cause us to OOM kill or take incorrect
action.

So if for example we want to grow a process stack and the memory is
there just stuck in the RCU lists pending recovery we want to let the
RCU recovery happen before making drastic decisions.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Paul E. McKenney
On Fri, Jan 06, 2006 at 01:37:12PM +, Alan Cox wrote:
 On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
  I assume that if a CPU queued 10.000 items in its RCU queue, then the 
  oldest 
  entry cannot still be in use by another CPU. This might sounds as a 
  violation 
  of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
 
 Fixing the real problem in the routing code would be the real fix. 
 
 The underlying problem of RCU and memory usage could be solved more
 safely by making sure that the sleeping memory allocator path always
 waits until at least one RCU cleanup has occurred after it fails an
 allocation before it starts trying harder. That ought to also naturally
 throttle memory consumers more in the situation which is the right
 behaviour.

A quick look at rt_garbage_collect() leads me to believe that although
the IP route cache does try to limit its use of memory, it does not
fully account for memory that it has released to RCU, but that RCU has
not yet freed due to a grace period not having elapsed.

The following appears to be possible:

1.  rt_garbage_collect() sees that there are too many entries,
and sets goal to the number to free up, based on a
computed equilibrium value.

2.  The number of entries is (correctly) decremented only when
the corresponding RCU callback is invoked, which actually
frees the entry.

3.  Between the time that rt_garbage_collect() is invoked the
first time and when the RCU grace period ends, rt_garbage_collect()
is invoked again.  It still sees too many entries (since
RCU has not yet freed the ones released by the earlier
invocation in step (1) above), so frees a bunch more.

4.  Packets routed now miss the route cache, because the corresponding
entries are waiting for a grace period, slowing the system down.
Therefore, even more entries are freed to make room for new
entries corresponding to the new packets.

If my (likely quite naive) reading of the IP route cache code is correct,
it would be possible to end up in a steady state with most of the entries
always being in RCU rather than in the route cache.

Eric, could this be what is happening to your system?

If it is, one straightforward fix would be to keep a count of the number
of route-cache entries waiting on RCU, and for rt_garbage_collect()
to subtract this number of entries from its goal.  Does this make sense?

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Eric Dumazet

Paul E. McKenney a écrit :

On Fri, Jan 06, 2006 at 01:37:12PM +, Alan Cox wrote:

On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest 
entry cannot still be in use by another CPU. This might sounds as a violation 
of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
Fixing the real problem in the routing code would be the real fix. 


The underlying problem of RCU and memory usage could be solved more
safely by making sure that the sleeping memory allocator path always
waits until at least one RCU cleanup has occurred after it fails an
allocation before it starts trying harder. That ought to also naturally
throttle memory consumers more in the situation which is the right
behaviour.


A quick look at rt_garbage_collect() leads me to believe that although
the IP route cache does try to limit its use of memory, it does not
fully account for memory that it has released to RCU, but that RCU has
not yet freed due to a grace period not having elapsed.

The following appears to be possible:

1.  rt_garbage_collect() sees that there are too many entries,
and sets goal to the number to free up, based on a
computed equilibrium value.

2.  The number of entries is (correctly) decremented only when
the corresponding RCU callback is invoked, which actually
frees the entry.

3.  Between the time that rt_garbage_collect() is invoked the
first time and when the RCU grace period ends, rt_garbage_collect()
is invoked again.  It still sees too many entries (since
RCU has not yet freed the ones released by the earlier
invocation in step (1) above), so frees a bunch more.

4.  Packets routed now miss the route cache, because the corresponding
entries are waiting for a grace period, slowing the system down.
Therefore, even more entries are freed to make room for new
entries corresponding to the new packets.

If my (likely quite naive) reading of the IP route cache code is correct,
it would be possible to end up in a steady state with most of the entries
always being in RCU rather than in the route cache.

Eric, could this be what is happening to your system?

If it is, one straightforward fix would be to keep a count of the number
of route-cache entries waiting on RCU, and for rt_garbage_collect()
to subtract this number of entries from its goal.  Does this make sense?



Hi Paul

Thanks for reviewing route code :)

As I said, the problem comes from 'route flush cache', that is periodically 
done by rt_run_flush(), triggered by rt_flush_timer.


The 10% of LOWMEM ram that was used by route-cache entries are pushed into rcu 
queues (with call_rcu_bh()) and network continue to receive

packets from *many* sources that want their route-cache entry.


Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Lee Revell
On Fri, 2006-01-06 at 13:58 +0100, Andi Kleen wrote:
 Another CPU might be stuck in a long 
 running interrupt

Shouldn't a long running interrupt be considered a bug?

Lee

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Lee Revell
On Fri, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
 I have some servers that once in a while crashes when the ip route
 cache is flushed. After
 raising /proc/sys/net/ipv4/route/secret_interval (so that *no* 
 flush is done), I got better uptime for these servers. 

Argh, where is that documented?  I have been banging my head against
this for weeks - how do I keep the kernel from flushing 4096 routes at
once in softirq context causing huge (~8-20ms) latency problems?

I tried all the route related sysctls I could find and nothing worked...

Lee

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Paul E. McKenney
On Fri, Jan 06, 2006 at 06:19:15PM +0100, Eric Dumazet wrote:
 Paul E. McKenney a écrit :
 On Fri, Jan 06, 2006 at 01:37:12PM +, Alan Cox wrote:
 On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
 I assume that if a CPU queued 10.000 items in its RCU queue, then the 
 oldest entry cannot still be in use by another CPU. This might sounds as 
 a violation of RCU rules, (I'm not an RCU expert) but seems quite 
 reasonable.
 Fixing the real problem in the routing code would be the real fix. 
 
 The underlying problem of RCU and memory usage could be solved more
 safely by making sure that the sleeping memory allocator path always
 waits until at least one RCU cleanup has occurred after it fails an
 allocation before it starts trying harder. That ought to also naturally
 throttle memory consumers more in the situation which is the right
 behaviour.
 
 A quick look at rt_garbage_collect() leads me to believe that although
 the IP route cache does try to limit its use of memory, it does not
 fully account for memory that it has released to RCU, but that RCU has
 not yet freed due to a grace period not having elapsed.
 
 The following appears to be possible:
 
 1.   rt_garbage_collect() sees that there are too many entries,
  and sets goal to the number to free up, based on a
  computed equilibrium value.
 
 2.   The number of entries is (correctly) decremented only when
  the corresponding RCU callback is invoked, which actually
  frees the entry.
 
 3.   Between the time that rt_garbage_collect() is invoked the
  first time and when the RCU grace period ends, rt_garbage_collect()
  is invoked again.  It still sees too many entries (since
  RCU has not yet freed the ones released by the earlier
  invocation in step (1) above), so frees a bunch more.
 
 4.   Packets routed now miss the route cache, because the corresponding
  entries are waiting for a grace period, slowing the system down.
  Therefore, even more entries are freed to make room for new
  entries corresponding to the new packets.
 
 If my (likely quite naive) reading of the IP route cache code is correct,
 it would be possible to end up in a steady state with most of the entries
 always being in RCU rather than in the route cache.
 
 Eric, could this be what is happening to your system?
 
 If it is, one straightforward fix would be to keep a count of the number
 of route-cache entries waiting on RCU, and for rt_garbage_collect()
 to subtract this number of entries from its goal.  Does this make sense?
 
 
 Hi Paul
 
 Thanks for reviewing route code :)
 
 As I said, the problem comes from 'route flush cache', that is periodically 
 done by rt_run_flush(), triggered by rt_flush_timer.
 
 The 10% of LOWMEM ram that was used by route-cache entries are pushed into 
 rcu queues (with call_rcu_bh()) and network continue to receive
 packets from *many* sources that want their route-cache entry.

Hello, Eric,

The rt_run_flush() function could indeed be suffering from the same
problem.  Dipankar's recent patch should help RCU grace periods proceed
more quickly, does that help?

If not, it may be worthwhile to limit the number of times that
rt_run_flush() runs per RCU grace period.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Andi Kleen
On Saturday 07 January 2006 01:17, David S. Miller wrote:

 
 I mean something like this patch:

Looks like a good idea to me.

I always disliked the per chain spinlocks even for other hash tables like
TCP/UDP multiplex - it would be much nicer to use a much smaller separately 
hashed lock table and save cache. In this case the special case of using
a one entry only lock hash table makes sense.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread David S. Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Sat, 7 Jan 2006 02:09:01 +0100

 I always disliked the per chain spinlocks even for other hash tables like
 TCP/UDP multiplex - it would be much nicer to use a much smaller separately 
 hashed lock table and save cache. In this case the special case of using
 a one entry only lock hash table makes sense.

I used to think they were a great technique.  But in each case I
thought they could be applied, better schemes have come along.
In the case of the page cache we went to a per-address-space tree,
and here in the routing cache we went to RCU.

There are RCU patches around for the TCP hashes and I'd like to
put those in at some point as well.  In fact, they'd be even
more far reaching since Arnaldo abstracted away the socket
hashing stuff into an inet_hashtables subsystem.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Eric Dumazet

Andi Kleen a écrit :


I always disliked the per chain spinlocks even for other hash tables like
TCP/UDP multiplex - it would be much nicer to use a much smaller separately 
hashed lock table and save cache. In this case the special case of using

a one entry only lock hash table makes sense.



I agree, I do use a hashed spinlock array on my local tree for TCP, mainly to 
reduce the hash table size by a 2 factor.


Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread David S. Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Sat, 07 Jan 2006 08:34:35 +0100

 I agree, I do use a hashed spinlock array on my local tree for TCP,
 mainly to reduce the hash table size by a 2 factor.

So what do you think about going to a single spinlock for the
routing cache?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

2006-01-06 Thread Eric Dumazet

David S. Miller a écrit :

From: Eric Dumazet [EMAIL PROTECTED]
Date: Sat, 07 Jan 2006 08:34:35 +0100


I agree, I do use a hashed spinlock array on my local tree for TCP,
mainly to reduce the hash table size by a 2 factor.


So what do you think about going to a single spinlock for the
routing cache?


I have no problem with this, since the biggest server I have is 4 way, but are 
you sure big machines wont suffer from this single spinlock ?


Also I dont understand what you want to do after this single spinlock patch.
How is it supposed to help the 'ip route flush cache' problem ?

In my case, I have about 600.000 dst-entries :

# grep ip_dst /proc/slabinfo
ip_dst_cache  616250 622440320   121 : tunables   54   278 : 
slabdata  51870  51870  0



Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html