Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tue, Mar 06, 2007 at 02:20:55PM -0800, David Miller wrote: From: Robert Olsson [EMAIL PROTECTED] Date: Tue, 6 Mar 2007 14:26:04 +0100 David Miller writes: Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. AFIK the equilibrium is resizing function as well but using fixed hash table. So can we do without equilibrium resizing if tables are dynamic? I think so With the hash data structure we could monitor the average chain length or just size and resize hash after that. I'm not so sure, it may be a mistake to eliminate the equilibrium logic. One error I think it does have is the usage of chain length. Even a nearly perfect hash has small lumps in distribution, and we should not penalize entries which fall into these lumps. Let us call T the threshold at which we would grow the routing hash table. As we approach T we start to GC. Let's assume hash table has shift = 2. and T would (with T=N+(N1) algorithm) therefore be 6. TABLE:[0] DST1, DST2 [1] DST3, DST4, DST5 DST6 arrives, what should we do? If we just accept it and don't GC some existing entries, we will grow the hash table. This is the wrong thing to do if our true working set is smaller than 6 entries and thus some of the existing entries are unlikely to be reused and thus could be purged to keep us from hitting T. If they are all active, growing is the right thing to do. This is the crux of the whole routing cache problem. I guess this is similar to our problems with bdev and filesystem caches as well. What we do in that case (as you would know), is to let the caches expand to the size of memory, and the problem just becomes balancing their relative importance. We just try to go with a reasonable default, and provide a knob or two for fine tuning. I am of the opinion that LRU, for routes not attached to sockets, is probably the best thing to do here. Furthermore at high packet rates, the current rt_may_expire() logic probably is not very effective since it's granularity is limited to jiffies. We can quite easily create 100,000 or more entries per jiffie when HZ=100 during rDOS, for example. So perhaps some global LRU algorithm using ktime is more appropriate. Global LRU is not easy without touching a lot of memory. But I'm sure some clever trick can be discovered by someone :) Well we do a pseudo LRU in most vm/vfs caches, where we just set a bit if it has been touched since last checked. Add another working set list to promote popular routes into, and you have something like our pagecache active/inactive reclaim. I don't know if that really applies here, but it might if you decide to try hooking this cache into the slab shrinker thingy... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
From: Eric Dumazet [EMAIL PROTECTED] Date: Tue, 06 Mar 2007 08:58:45 +0100 Yes, but on bootup you have an appropriate NUMA active policy. (Well... we hope so, but it broke several time in the past) I am not sure what kind of mm policy is active for scheduled works. Good point, that definitely needs investigation. Thanks for all of your comments so far Eric, very useful :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Mon, Mar 05, 2007 at 08:26:32PM -0800, David Miller wrote: This is essentially a port of Nick Piggin's dcache hash table patches to the routing cache. It solves the locking issues during table grow/shrink that I couldn't handle properly last time I tried to code up a patch like this. But one of the core issues of this kind of change still remains. There is a conflict between the desire of routing cache garbage collection to reach a state of equilibrium and the hash table grow code's desire to match the table size to the current state of affairs. Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. One idea is to put the hash grow check in the garbage collector, and put the hash shrink check in rt_del(). In fact, it would be a good time to perhaps hack up some entirely new passive GC logic for the routing cache. BTW, another thing that plays into this is that Robert's TRASH work could make this patch not necessary :-) Finally, I know that (due to some of Nick's helpful comments the other day) that I'm missing some rcu_assign_pointer()'s in here. Fixes in this area are most welcome. Cool! I have some fixes for the rcu barrier issues, with some C-style comments and questions :) I was going to send you a fix first for the rcu barriers, then a second to convert the read-side to a barrier-less one that I described, however considering that your patch is a WIP in progress anyway, I won't worry too much about the normal protocol. I _think_ my reasoning regarding the rcu barriers and grace periods is correct. I'll keep thinking about it though. (Paul cc'ed). I'm not so familiar with this code, so I have sprinkled around a lot of comments that could be pure crap ;) They are mainly just to help you ensure that you cover all bases... compile tested only at this stage. -- Index: linux-2.6/net/ipv4/route.c === --- linux-2.6.orig/net/ipv4/route.c +++ linux-2.6/net/ipv4/route.c @@ -311,6 +311,8 @@ static void rthash_free(struct rt_hash_b static unsigned int rt_hash_code(struct rt_hash *hashtable, u32 daddr, u32 saddr) { + /* BUG_ON(!rcu_read_protected()) */ + return (jhash_2words(daddr, saddr, rt_hash_rnd) hashtable-mask); } @@ -343,11 +345,16 @@ static void rt_hash_resize_work(struct w old_rt_hash = rt_hash; /* -* ensure that if the reader sees the new dentry_hash, -* then they will also see the old_dentry_hash assignment, -* above. +* ensure that if the reader sees the new rt_hash, then they will also +* see the old_rt_hash assignment, above. synchronize_rcu() is used +* rather than smp_wmb(), in order to avoid the smp_rmb() in the +* read-sidde. However synchronize_rcu() also implies a smp_wmb(), so +* that also means we can skip rcu_assign_pointer(). +* +* The readers can then also skip rcu_dereference, because a grace +* period implies that all readers have performed memory barriers. */ - smp_wmb(); + synchronize_rcu(); rt_hash = new_hash; synchronize_rcu(); @@ -1100,6 +1107,8 @@ static int rt_intern_hash(struct rt_hash int chain_length; int attempts = !in_softirq(); + /* BUG_ON(!rcu_read_protected()) */ + restart: chain_length = 0; min_score = ~(u32)0; @@ -1286,6 +1295,8 @@ static void rt_del(struct rt_hash *h, un { struct rtable **rthp; + /* BUG_ON(!rcu_read_protected()) */ + spin_lock_bh(rt_hash_lock_addr(hash)); ip_rt_put(rt); for (rthp = h-table[hash].chain; *rthp; @@ -1328,12 +1339,24 @@ void ip_rt_redirect(__be32 old_gw, __be3 for (i = 0; i 2; i++) { for (k = 0; k 2; k++) { - struct rt_hash *h = rt_hash; - unsigned hash = rt_hashfn(h, daddr, skeys[i], ikeys[k]); + struct rt_hash *h; + unsigned hash; + + /* +* rcu_read_lock() must cover the load of rt_hash, in +* order to satisfy our RCU protected dynamic hash +* sizing scheme; and it must also cover the hash list +* traversal, to satisfy our RCU protected lockless +* hash entry lookups. +* +* This note applies throughout the file. +*/ + rcu_read_lock(); + h =
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
From: Nick Piggin [EMAIL PROTECTED] Date: Tue, 6 Mar 2007 10:11:12 +0100 @@ -1449,6 +1472,12 @@ static struct dst_entry *ipv4_negative_a %u.%u.%u.%u/%02x dropped\n, NIPQUAD(rt-rt_dst), rt-fl.fl4_tos); #endif + /* XXX: + * What if rt does not exist in rt_hash, but is in + * old_rt_hash? Don't we have to also check there? + * Similar questions for a couple of other places that + * look at rt_hash, but not old_rt_hash. + */ rt_del(h, hash, rt); ret = NULL; } For the cases like ip_rt_redirect() I made the decision that we'll just not add the complexity of having to look in the old_rt_hash table. In these kinds of cases it's OK to miss events, they will just happen again. It's one of the nice things about the routing cache, if you lose information it's OK because we'll just cook up a new entry from the persistent backing store that is the real routing tables. And for events like redirects, if we miss it, we'll just send the packet to the wrong next-hop again and receive another redirect message which we'll (hopefully) propagate to a routine cache entry. Thanks for your feedback patch Nick, I'll process it tomorrow hopefully after getting some sleep. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tue, Mar 06, 2007 at 01:17:06AM -0800, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Tue, 6 Mar 2007 10:11:12 +0100 @@ -1449,6 +1472,12 @@ static struct dst_entry *ipv4_negative_a %u.%u.%u.%u/%02x dropped\n, NIPQUAD(rt-rt_dst), rt-fl.fl4_tos); #endif + /* XXX: +* What if rt does not exist in rt_hash, but is in +* old_rt_hash? Don't we have to also check there? +* Similar questions for a couple of other places that +* look at rt_hash, but not old_rt_hash. +*/ rt_del(h, hash, rt); ret = NULL; } For the cases like ip_rt_redirect() I made the decision that we'll just not add the complexity of having to look in the old_rt_hash table. In these kinds of cases it's OK to miss events, they will just happen again. It's one of the nice things about the routing cache, if you lose information it's OK because we'll just cook up a new entry from the persistent backing store that is the real routing tables. And for events like redirects, if we miss it, we'll just send the packet to the wrong next-hop again and receive another redirect message which we'll (hopefully) propagate to a routine cache entry. Ah, that's a very neat trick. OK so with that question out of the way, there _may_ just be a few other places where you're working with an rt_hash table outside an rcu read critical section. I tried to fix up a couple of obvious ones, and I've just put in the BUG_ON assertions for you to verify the rest. Thanks for your feedback patch Nick, I'll process it tomorrow hopefully after getting some sleep. No problem. Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tuesday 06 March 2007 10:11, Nick Piggin wrote: Cool! I have some fixes for the rcu barrier issues, with some C-style comments and questions :) I was going to send you a fix first for the rcu barriers, then a second to convert the read-side to a barrier-less one that I described, however considering that your patch is a WIP in progress anyway, I won't worry too much about the normal protocol. I _think_ my reasoning regarding the rcu barriers and grace periods is correct. I'll keep thinking about it though. (Paul cc'ed). I'm not so familiar with this code, so I have sprinkled around a lot of comments that could be pure crap ;) They are mainly just to help you ensure that you cover all bases... compile tested only at this stage. I think we missed : +static void rt_hash_resize_work(struct work_struct *work) + + *head = rth-u.dst.rt_next; + + hash = rt_hashfn(rt_hash, +rth-fl.fl4_dst, +rth-fl.fl4_src, +iface); + rth-u.dst.rt_next = rt_hash-table[hash].chain; + rt_hash-table[hash].chain = rth; This really needs some ..._del_rcu()/..._add_rcu()_ ... primitives, no ? Or else a reader might be very confused... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tue, Mar 06, 2007 at 10:23:44AM +0100, Eric Dumazet wrote: On Tuesday 06 March 2007 10:11, Nick Piggin wrote: Cool! I have some fixes for the rcu barrier issues, with some C-style comments and questions :) I was going to send you a fix first for the rcu barriers, then a second to convert the read-side to a barrier-less one that I described, however considering that your patch is a WIP in progress anyway, I won't worry too much about the normal protocol. I _think_ my reasoning regarding the rcu barriers and grace periods is correct. I'll keep thinking about it though. (Paul cc'ed). I'm not so familiar with this code, so I have sprinkled around a lot of comments that could be pure crap ;) They are mainly just to help you ensure that you cover all bases... compile tested only at this stage. I think we missed : +static void rt_hash_resize_work(struct work_struct *work) + + *head = rth-u.dst.rt_next; + + hash = rt_hashfn(rt_hash, + rth-fl.fl4_dst, + rth-fl.fl4_src, + iface); + rth-u.dst.rt_next = rt_hash-table[hash].chain; + rt_hash-table[hash].chain = rth; This really needs some ..._del_rcu()/..._add_rcu()_ ... primitives, no ? Or else a reader might be very confused... I'm not sure... this code really depends on the hash table management, rather than the management of the hash tables, if you understand me ;) From what I can _see_, this is similar to how rt_intern_hash does it. I don't know exactly why rt_intern_hash can get away without using rcu_assign_pointer in some cases, however: Note that we don't need an rcu_assign_pointer for this, because the memory operations that initialized the entry have already been ordered when it was first inserted. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH]: Dynamically sized routing cache hash table.
David Miller writes: Interesting. Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. AFIK the equilibrium is resizing function as well but using fixed hash table. So can we do without equilibrium resizing if tables are dynamic? I think so With the hash data structure we could monitor the average chain length or just size and resize hash after that. One idea is to put the hash grow check in the garbage collector, and put the hash shrink check in rt_del(). In fact, it would be a good time to perhaps hack up some entirely new passive GC logic for the routing cache. Could be, remeber GC in the hash chain also which was added after although it does's decrease the number of entries but it gives an upper limit. Also gc-goal must picked so it does not force unwanted resizing. BTW, another thing that plays into this is that Robert's TRASH work could make this patch not necessary :-) It has built-in resize and chain control and the gc-goal is chosen not to unnecessary resize the root node. Cheers. --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
Eric Dumazet writes: Well, maybe... but after looking robert's trash, I discovered its model is essentially a big (2^18 slots) root node (our hash table), and very few order:1,2,3 nodes. It's getting hashlike yes. I guess all effective algorithms today is doing some sort of index lookup and for large number of entries we cannot expect to find the next node in the same cache line so the tree depth becomes a crucial performance factor. IMO nothing can beat a prefect distributed and perfect sized hash. The trash work is an effort to get close with dynamic data structure. Cheers --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tuesday 06 March 2007 14:42, Robert Olsson wrote: Eric Dumazet writes: Well, maybe... but after looking robert's trash, I discovered its model is essentially a big (2^18 slots) root node (our hash table), and very few order:1,2,3 nodes. It's getting hashlike yes. I guess all effective algorithms today is doing some sort of index lookup and for large number of entries we cannot expect to find the next node in the same cache line so the tree depth becomes a crucial performance factor. IMO nothing can beat a prefect distributed and perfect sized hash. The trash work is an effort to get close with dynamic data structure. Indeed. It would be nice to see how it performs with say 2^20 elements... Because with your data, I wonder if the extra complexity of the trash is worth it (since most lookups are going to only hit the hash and give the answer without intermediate nodes) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
Eric Dumazet writes: Indeed. It would be nice to see how it performs with say 2^20 elements... Because with your data, I wonder if the extra complexity of the trash is worth it (since most lookups are going to only hit the hash and give the answer without intermediate nodes) I don't know if I understand you fully. Yes in most cases the first lookup via hash-header will take us to direct to the correct leaf. If there are collisions we have to sort them out by adding intermediate nodes. Something like where you have resizable hash where is each bucket in turn is a resizable hash etc. Cheers --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tuesday 06 March 2007 18:05, Robert Olsson wrote: Eric Dumazet writes: Indeed. It would be nice to see how it performs with say 2^20 elements... Because with your data, I wonder if the extra complexity of the trash is worth it (since most lookups are going to only hit the hash and give the answer without intermediate nodes) I don't know if I understand you fully. Yes in most cases the first lookup via hash-header will take us to direct to the correct leaf. If there are collisions we have to sort them out by adding intermediate nodes. With 2^20 entries, your actual limit of 2^19 entries in root node will probably show us quite different numbers for order-1,2,3,4... tnodes Something like where you have resizable hash where is each bucket in turn is a resizable hash etc. Yes, numbers you gave us basically showed a big root node, and mainly leaves and very few tnodes. I was interested to see the distribution in case the root-node limit is hit, and we load into the table a *lot* of entries. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
Eric Dumazet writes: With 2^20 entries, your actual limit of 2^19 entries in root node will probably show us quite different numbers for order-1,2,3,4... tnodes Yeep trie will get deeper and lookup more costly as insert and delete. The 2^19 was that was getting memory alloction problem that I never sorted out. Yes, numbers you gave us basically showed a big root node, and mainly leaves and very few tnodes. I was interested to see the distribution in case the root-node limit is hit, and we load into the table a *lot* of entries. Maxlength etc... well maybe root-restriction should be removed and just have maxsize instead. Cheers --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
From: Robert Olsson [EMAIL PROTECTED] Date: Tue, 6 Mar 2007 14:26:04 +0100 David Miller writes: Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. AFIK the equilibrium is resizing function as well but using fixed hash table. So can we do without equilibrium resizing if tables are dynamic? I think so With the hash data structure we could monitor the average chain length or just size and resize hash after that. I'm not so sure, it may be a mistake to eliminate the equilibrium logic. One error I think it does have is the usage of chain length. Even a nearly perfect hash has small lumps in distribution, and we should not penalize entries which fall into these lumps. Let us call T the threshold at which we would grow the routing hash table. As we approach T we start to GC. Let's assume hash table has shift = 2. and T would (with T=N+(N1) algorithm) therefore be 6. TABLE: [0] DST1, DST2 [1] DST3, DST4, DST5 DST6 arrives, what should we do? If we just accept it and don't GC some existing entries, we will grow the hash table. This is the wrong thing to do if our true working set is smaller than 6 entries and thus some of the existing entries are unlikely to be reused and thus could be purged to keep us from hitting T. If they are all active, growing is the right thing to do. This is the crux of the whole routing cache problem. I am of the opinion that LRU, for routes not attached to sockets, is probably the best thing to do here. Furthermore at high packet rates, the current rt_may_expire() logic probably is not very effective since it's granularity is limited to jiffies. We can quite easily create 100,000 or more entries per jiffie when HZ=100 during rDOS, for example. So perhaps some global LRU algorithm using ktime is more appropriate. Global LRU is not easy without touching a lot of memory. But I'm sure some clever trick can be discovered by someone :) It is amusing, but it seems that for rDOS workload most optimal routing hash would be tiny one like my example above. All packets essentially miss the routing cache and create new entry. So keeping the working set as small as possible is what you want to do since no matter how large you grow your hit rate will be zero :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH]: Dynamically sized routing cache hash table.
This is essentially a port of Nick Piggin's dcache hash table patches to the routing cache. It solves the locking issues during table grow/shrink that I couldn't handle properly last time I tried to code up a patch like this. But one of the core issues of this kind of change still remains. There is a conflict between the desire of routing cache garbage collection to reach a state of equilibrium and the hash table grow code's desire to match the table size to the current state of affairs. Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. One idea is to put the hash grow check in the garbage collector, and put the hash shrink check in rt_del(). In fact, it would be a good time to perhaps hack up some entirely new passive GC logic for the routing cache. BTW, another thing that plays into this is that Robert's TRASH work could make this patch not necessary :-) Finally, I know that (due to some of Nick's helpful comments the other day) that I'm missing some rcu_assign_pointer()'s in here. Fixes in this area are most welcome. This patch passes basic testing on UP sparc64, but please handle with care :) Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 0b3d7bf..57e004a 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -92,6 +92,9 @@ #include linux/jhash.h #include linux/rcupdate.h #include linux/times.h +#include linux/workqueue.h +#include linux/vmalloc.h +#include linux/mutex.h #include net/protocol.h #include net/ip.h #include net/route.h @@ -242,28 +245,195 @@ static spinlock_t*rt_hash_locks; # define rt_hash_lock_init() #endif -static struct rt_hash_bucket *rt_hash_table; -static unsignedrt_hash_mask; -static int rt_hash_log; -static unsigned intrt_hash_rnd; +#define MIN_RTHASH_SHIFT 4 +#if BITS_PER_LONG == 32 +#define MAX_RTHASH_SHIFT 24 +#else +#define MAX_RTHASH_SHIFT 30 +#endif + +struct rt_hash { + struct rt_hash_bucket *table; + unsigned intmask; + unsigned intlog; +}; + +struct rt_hash *rt_hash __read_mostly; +struct rt_hash *old_rt_hash __read_mostly; +static unsigned int rt_hash_rnd __read_mostly; +static DEFINE_SEQLOCK(resize_transfer_lock); +static DEFINE_MUTEX(resize_mutex); static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat); #define RT_CACHE_STAT_INC(field) \ (__raw_get_cpu_var(rt_cache_stat).field++) -static int rt_intern_hash(unsigned hash, struct rtable *rth, - struct rtable **res); +static void rt_hash_resize(unsigned int new_shift); +static void check_nr_rthash(void) +{ + unsigned int sz = rt_hash-mask + 1; + unsigned int nr = atomic_read(ipv4_dst_ops.entries); + + if (unlikely(nr (sz + (sz 1 + rt_hash_resize(rt_hash-log + 1); + else if (unlikely(nr (sz 1))) + rt_hash_resize(rt_hash-log - 1); +} -static unsigned int rt_hash_code(u32 daddr, u32 saddr) +static struct rt_hash_bucket *rthash_alloc(unsigned int sz) +{ + struct rt_hash_bucket *n; + + if (sz = PAGE_SIZE) + n = kmalloc(sz, GFP_KERNEL); + else if (hashdist) + n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL); + else + n = (struct rt_hash_bucket *) + __get_free_pages(GFP_KERNEL, get_order(sz)); + + if (n) + memset(n, 0, sz); + + return n; +} + +static void rthash_free(struct rt_hash_bucket *r, unsigned int sz) +{ + if (sz = PAGE_SIZE) + kfree(r); + else if (hashdist) + vfree(r); + else + free_pages((unsigned long)r, get_order(sz)); +} + +static unsigned int rt_hash_code(struct rt_hash *hashtable, +u32 daddr, u32 saddr) { return (jhash_2words(daddr, saddr, rt_hash_rnd) -rt_hash_mask); +hashtable-mask); } -#define rt_hash(daddr, saddr, idx) \ - rt_hash_code((__force u32)(__be32)(daddr),\ +#define rt_hashfn(htab, daddr, saddr, idx) \ + rt_hash_code(htab, (__force u32)(__be32)(daddr),\ (__force u32)(__be32)(saddr) ^ ((idx) 5)) +static unsigned int resize_new_shift; + +static void rt_hash_resize_work(struct work_struct *work) +{ + struct rt_hash *new_hash, *old_hash; + unsigned int new_size, old_size, transferred; + int i; + + if (!mutex_trylock(resize_mutex)) + goto out; + + new_hash = kmalloc(sizeof(struct rt_hash), GFP_KERNEL); + if (!new_hash) + goto out_unlock; + +
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
David Miller a écrit : This is essentially a port of Nick Piggin's dcache hash table patches to the routing cache. It solves the locking issues during table grow/shrink that I couldn't handle properly last time I tried to code up a patch like this. But one of the core issues of this kind of change still remains. There is a conflict between the desire of routing cache garbage collection to reach a state of equilibrium and the hash table grow code's desire to match the table size to the current state of affairs. Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. One idea is to put the hash grow check in the garbage collector, and put the hash shrink check in rt_del(). In fact, it would be a good time to perhaps hack up some entirely new passive GC logic for the routing cache. BTW, another thing that plays into this is that Robert's TRASH work could make this patch not necessary :-) Well, maybe... but after looking robert's trash, I discovered its model is essentially a big (2^18 slots) root node (our hash table), and very few order:1,2,3 nodes. Almost all leaves... work in progress anyway. Please find my comments in your patch Finally, I know that (due to some of Nick's helpful comments the other day) that I'm missing some rcu_assign_pointer()'s in here. Fixes in this area are most welcome. This patch passes basic testing on UP sparc64, but please handle with care :) Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 0b3d7bf..57e004a 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -92,6 +92,9 @@ #include linux/jhash.h #include linux/rcupdate.h #include linux/times.h +#include linux/workqueue.h +#include linux/vmalloc.h +#include linux/mutex.h #include net/protocol.h #include net/ip.h #include net/route.h @@ -242,28 +245,195 @@ static spinlock_t*rt_hash_locks; # define rt_hash_lock_init() #endif -static struct rt_hash_bucket *rt_hash_table; -static unsignedrt_hash_mask; -static int rt_hash_log; -static unsigned intrt_hash_rnd; +#define MIN_RTHASH_SHIFT 4 I wonder... are you sure this has no relation with the size of rt_hash_locks / RT_HASH_LOCK_SZ ? One entry must have the same lock in the two tables when resizing is in flight. #define MIN_RTHASH_SHIFT LOG2(RT_HASH_LOCK_SZ) +#if BITS_PER_LONG == 32 +#define MAX_RTHASH_SHIFT 24 +#else +#define MAX_RTHASH_SHIFT 30 +#endif + +struct rt_hash { + struct rt_hash_bucket *table; + unsigned intmask; + unsigned intlog; +}; + +struct rt_hash *rt_hash __read_mostly; +struct rt_hash *old_rt_hash __read_mostly; +static unsigned int rt_hash_rnd __read_mostly; +static DEFINE_SEQLOCK(resize_transfer_lock); +static DEFINE_MUTEX(resize_mutex); I think a better model would be a structure, with a part containing 'read mostly' data, and part of 'higly modified' data with appropriate align_to_cache For example, resize_transfer_lock should be in the first part, like rt_hash and old_rt_hash, dont you think ? All static data of this file should be placed on this single structure so that we can easily avoid false sharing and have optimal placement. static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat); #define RT_CACHE_STAT_INC(field) \ (__raw_get_cpu_var(rt_cache_stat).field++) -static int rt_intern_hash(unsigned hash, struct rtable *rth, - struct rtable **res); +static void rt_hash_resize(unsigned int new_shift); +static void check_nr_rthash(void) +{ + unsigned int sz = rt_hash-mask + 1; + unsigned int nr = atomic_read(ipv4_dst_ops.entries); + + if (unlikely(nr (sz + (sz 1 + rt_hash_resize(rt_hash-log + 1); + else if (unlikely(nr (sz 1))) + rt_hash_resize(rt_hash-log - 1); +} -static unsigned int rt_hash_code(u32 daddr, u32 saddr) +static struct rt_hash_bucket *rthash_alloc(unsigned int sz) +{ + struct rt_hash_bucket *n; + + if (sz = PAGE_SIZE) + n = kmalloc(sz, GFP_KERNEL); + else if (hashdist) + n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL); + else + n = (struct rt_hash_bucket *) + __get_free_pages(GFP_KERNEL, get_order(sz)); I dont feel well with this. Maybe we could try a __get_free_pages(), and in case of failure, fallback to vmalloc(). Then keep a flag to be able to free memory correctly. Anyway, if (get_order(sz)=MAX_ORDER) we know __get_free_pages() will fail. + + if (n) + memset(n, 0, sz); + + return n; +} +
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
From: Eric Dumazet [EMAIL PROTECTED] Date: Tue, 06 Mar 2007 08:14:46 +0100 I wonder... are you sure this has no relation with the size of rt_hash_locks / RT_HASH_LOCK_SZ ? One entry must have the same lock in the two tables when resizing is in flight. #define MIN_RTHASH_SHIFT LOG2(RT_HASH_LOCK_SZ) Good point. +static struct rt_hash_bucket *rthash_alloc(unsigned int sz) +{ + struct rt_hash_bucket *n; + + if (sz = PAGE_SIZE) + n = kmalloc(sz, GFP_KERNEL); + else if (hashdist) + n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL); + else + n = (struct rt_hash_bucket *) + __get_free_pages(GFP_KERNEL, get_order(sz)); I dont feel well with this. Maybe we could try a __get_free_pages(), and in case of failure, fallback to vmalloc(). Then keep a flag to be able to free memory correctly. Anyway, if (get_order(sz)=MAX_ORDER) we know __get_free_pages() will fail. We have to use vmalloc() for the hashdist case so that the pages are spread out properly on NUMA systems. That's exactly what the large system hash allocator is going to do on bootup anyways. Look, either both are right or both are wrong. I'm just following protocol above and you'll note the PRECISE same logic exists in other dynamically growing hash table implementations such as net/xfrm/xfrm_hash.c Could you add const qualifiers to 'struct rt_hash *' in prototypes where appropriate ? Sure, no problem. Maybe that for small tables (less than PAGE_SIZE/2), we could embed them in 'struct rt_hash' Not worth the pain nor the in-kernel-image-space it would chew up, in my opinion. After you visit a handful of web sites you'll get beyond that threshold. Could we group all static vars at the begining of this file, so that we clearly see where we should place them, to avoid false sharing. Sure. + +static void rt_hash_resize(unsigned int new_shift) Damn, please don't quote such huge portions of a patch without any comments, this has to go out to several thousand recipients you know :-/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
David Miller a écrit : From: Eric Dumazet [EMAIL PROTECTED] Date: Tue, 06 Mar 2007 08:14:46 +0100 I wonder... are you sure this has no relation with the size of rt_hash_locks / RT_HASH_LOCK_SZ ? One entry must have the same lock in the two tables when resizing is in flight. #define MIN_RTHASH_SHIFT LOG2(RT_HASH_LOCK_SZ) Good point. +static struct rt_hash_bucket *rthash_alloc(unsigned int sz) +{ + struct rt_hash_bucket *n; + + if (sz = PAGE_SIZE) + n = kmalloc(sz, GFP_KERNEL); + else if (hashdist) + n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL); + else + n = (struct rt_hash_bucket *) + __get_free_pages(GFP_KERNEL, get_order(sz)); I dont feel well with this. Maybe we could try a __get_free_pages(), and in case of failure, fallback to vmalloc(). Then keep a flag to be able to free memory correctly. Anyway, if (get_order(sz)=MAX_ORDER) we know __get_free_pages() will fail. We have to use vmalloc() for the hashdist case so that the pages are spread out properly on NUMA systems. That's exactly what the large system hash allocator is going to do on bootup anyways. Yes, but on bootup you have an appropriate NUMA active policy. (Well... we hope so, but it broke several time in the past) I am not sure what kind of mm policy is active for scheduled works. Anyway I have some XX GB machines, non NUMA, and I would love to be able to have a 2^20 slots hash table, without having to increase MAX_ORDER. Look, either both are right or both are wrong. I'm just following protocol above and you'll note the PRECISE same logic exists in other dynamically growing hash table implementations such as net/xfrm/xfrm_hash.c Yes, they are both wrong/dumb :) Can we be smarter, or do we have to stay dumb ? :) struct rt_hash_bucket *n = NULL; if (sz = PAGE_SIZE) { n = kmalloc(sz, GFP_KERNEL); *kind = allocated_by_kmalloc; } else if (!hashdist) { n = (struct rt_hash_bucket *) __get_free_pages(GFP_KERNEL, get_order(sz)); *kind = allocated_by_get_free_pages; } if (!n) { n = __vmalloc(sz, GFP_KERNEL, PAGE_KERNEL); *kind = allocated_by_vmalloc; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html