Re: [PATCH -rt] Fix initialization of spinlock in irttp_dup()
* Deepak Saxena [EMAIL PROTECTED] wrote: This was found around the 2.6.10 timeframe when testing with the -rt patch and I believe is still is an issue. irttp_dup() does a memcpy() of the tsap_cb structure causing the spinlock protecting various fields in the structure to be duped. This works OK in the non-RT case but in the RT case we end up with two mutexes pointing to the same wait_list and leading to an OOPS. Fix is to simply initialize the spinlock after the memcpy(). note that memcpy based lock initialization is a problem for lockdep too. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops in filter add
On Tue, 2007-20-03 at 11:58 +0100, Patrick McHardy wrote: jamal wrote: So the resolution (as Dave points out) was wrong. In any case, restoring queue_lock for now would slow things but will remove the race. Yes. I think thats what we should do for 2.6.21, since fixing this while keeping ingress_lock is quite intrusive. reasonable. I'm on it. I'm using the opportunity to try to simply the qdisc locking. Ok, thanks Patrick. BTW, I was just staring at the code and i think i have found probably a long standing minor bug on the holding of the tree lock. I will post a patch shortly if i dont get disrupted. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1][PKT_CLS] Avoid multiple tree locks
Seems to have been around a while. IMO, mterial for 2.6.21 but not stable. I have only compile-tested but it looks right(tm). I could have moved the lock down, but this looked safer. cheers, jamal [PKT_CLS] Avoid multiple tree locks This fixes: When dumping filters the tree is locked first in the main dump function then when looking qdisc Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 4a52cdd599f259b05320219d7aba1bac58fdf6d0 tree e9e4b83f7a2925b4408e4f18211365c3f9bff3fa parent 0a14fe6e5efd0af0f9c6c01e0433445d615d0110 author Jamal Hadi Salim [EMAIL PROTECTED] Wed, 21 Mar 2007 05:27:55 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Wed, 21 Mar 2007 05:27:55 -0400 include/net/pkt_sched.h |1 + net/sched/cls_api.c |2 +- net/sched/sch_api.c |2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index f6afee7..dd930bd 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -212,6 +212,7 @@ extern struct Qdisc_ops bfifo_qdisc_ops; extern int register_qdisc(struct Qdisc_ops *qops); extern int unregister_qdisc(struct Qdisc_ops *qops); +extern struct Qdisc *__qdisc_lookup(struct net_device *dev, u32 handle); extern struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle); extern struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle); extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r, diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 5c6ffdb..17d4d37 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -403,7 +403,7 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb) if (!tcm-tcm_parent) q = dev-qdisc_sleeping; else - q = qdisc_lookup(dev, TC_H_MAJ(tcm-tcm_parent)); + q = __qdisc_lookup(dev, TC_H_MAJ(tcm-tcm_parent)); if (!q) goto out; if ((cops = q-ops-cl_ops) == NULL) diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index ecc988a..1a3b65e 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -190,7 +190,7 @@ int unregister_qdisc(struct Qdisc_ops *qops) (root qdisc, all its children, children of children etc.) */ -static struct Qdisc *__qdisc_lookup(struct net_device *dev, u32 handle) +struct Qdisc *__qdisc_lookup(struct net_device *dev, u32 handle) { struct Qdisc *q;
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
jamal wrote: Seems to have been around a while. IMO, mterial for 2.6.21 but not stable. I have only compile-tested but it looks right(tm). Its harmless since its a read lock, which can be nested. I actually don't see any need for qdisc_tree_lock at all, all changes and all walking is done under the RTNL, which is why I've removed it in my (upcoming) patches. I suggest to leave it as is for now so I don't need to change the __qdisc_lookup back to qdisc_lookup in 2.6.22. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
Patrick McHardy wrote: jamal wrote: Seems to have been around a while. IMO, mterial for 2.6.21 but not stable. I have only compile-tested but it looks right(tm). Its harmless since its a read lock, which can be nested. I actually don't see any need for qdisc_tree_lock at all, all changes and all walking is done under the RTNL, which is why I've removed it in my (upcoming) patches. I suggest to leave it as is for now so I don't need to change the __qdisc_lookup back to qdisc_lookup in 2.6.22. Alexey just explained to me why we do need qdisc_tree_lock in private mail. While dumping only the first skb is filled under the RTNL, while filling further skbs we don't hold the RTNL anymore. So I will probably have to drop that patch. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
On Wed, 2007-21-03 at 11:10 +0100, Patrick McHardy wrote: Its harmless since its a read lock, which can be nested. I actually don't see any need for qdisc_tree_lock at all, all changes and all walking is done under the RTNL, which is why I've removed it in my (upcoming) patches. I suggest to leave it as is for now so I don't need to change the __qdisc_lookup back to qdisc_lookup in 2.6.22. Sounds good to me. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/1] netlink: no need to crash if table does not exist.
We would already do that on init. Some things become very confused, when nl_table is not used to store netlink sockets. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] 23ebdcf1f439cde050a63f33897d5b099fe08c95 diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 9b69d9b..071e4d7 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1330,8 +1330,6 @@ netlink_kernel_create(int unit, unsigned int groups, struct netlink_sock *nlk; unsigned long *listeners = NULL; - BUG_ON(!nl_table); - if (unit0 || unit=MAX_LINKS) return NULL; -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] netlink: no need to crash if table does not exist.
Evgeniy Polyakov wrote: We would already do that on init. Some things become very confused, when nl_table is not used to store netlink sockets. Its unnecessary, but I don't understand what the problem is. Why would it be NULL and what gets confused? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] netlink: no need to crash if table does not exist.
On Wed, Mar 21, 2007 at 11:54:45AM +0100, Patrick McHardy ([EMAIL PROTECTED]) wrote: Evgeniy Polyakov wrote: We would already do that on init. Some things become very confused, when nl_table is not used to store netlink sockets. Its unnecessary, but I don't understand what the problem is. Why would it be NULL and what gets confused? There is no problem as-is, but I implement unified cache for different sockets (currently tcp/udp/raw and netlink are supported), which does not use that table, so I currently wrap all access code into special ifdefs, this one can be wrapped too, but since it is not needed, it saves couple of lines of code. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [irda-users] [2.6.20-rt8] Neighbour table overflow.
(Short recap for newly added to cc: netdev: I'm seeing an skb leak in 2.6.20 during an IrDA IrNET+ppp UDP test with periodic connection disruptions) On Wed, 21 Mar 2007, Guennadi Liakhovetski wrote: On Tue, 20 Mar 2007, Guennadi Liakhovetski wrote: Ok, looks like all leaked skbuffs come from ip_append_data(), like this: (sock_alloc_send_skb+0x2c8/0x2e4) (ip_append_data+0x7fc/0xa80) (udp_sendmsg+0x248/0x68c) (inet_sendmsg+0x60/0x64) (sock_sendmsg+0xb4/0xe4) r4 = C3CB4960 (sys_sendto+0xc8/0xf0) r4 = (sys_socketcall+0x168/0x1f0) (ret_fast_syscall+0x0/0x2c) This call to sock_alloc_send_skb() in ip_append_data() is not from the inlined ip_ufo_append_data(), it is here: /* The last fragment gets additional space at tail. * Note, with MSG_MORE we overallocate on fragments, * because we have no idea what fragment will be * the last. */ if (datalen == length + fraggap) alloclen += rt-u.dst.trailer_len; if (transhdrlen) { skb = sock_alloc_send_skb(sk, alloclen + hh_len + 15, (flags MSG_DONTWAIT), err); } else { Then, I traced a couple of paths how such a skbuff, coming down from ip_append_data() and allocated above get freed (when they do): [c0182380] (__kfree_skb+0x0/0x170) from [c0182514] (kfree_skb+0x24/0x50) r5 = C332BC00 r4 = C332BC00 [c01824f0] (kfree_skb+0x0/0x50) from [bf0fac58] (irlap_update_nr_received+0x94/0xc8 [irda]) [bf0fabc4] (irlap_update_nr_received+0x0/0xc8 [irda]) from [bf0fda98] (irlap_state_nrm_p+0x530/0x7c0 [irda]) r7 = 0001 r6 = C0367EC0 r5 = C332BC00 r4 = [bf0fd568] (irlap_state_nrm_p+0x0/0x7c0 [irda]) from [bf0fbd90] (irlap_do_event+0x68/0x18c [irda]) [bf0fbd28] (irlap_do_event+0x0/0x18c [irda]) from [bf1008cc] (irlap_driver_rcv+0x1f0/0xd38 [irda]) [bf1006dc] (irlap_driver_rcv+0x0/0xd38 [irda]) from [c01892c0] (netif_receive_skb+0x244/0x338) [c018907c] (netif_receive_skb+0x0/0x338) from [c0189468] (process_backlog+0xb4/0x194) [c01893b4] (process_backlog+0x0/0x194) from [c01895f8] (net_rx_action+0xb0/0x210) [c0189548] (net_rx_action+0x0/0x210) from [c0042f7c] (ksoftirqd+0x108/0x1cc) [c0042e74] (ksoftirqd+0x0/0x1cc) from [c0053614] (kthread+0x10c/0x138) [c0053508] (kthread+0x0/0x138) from [c003f918] (do_exit+0x0/0x8b0) r8 = r7 = r6 = r5 = r4 = and [c0182380] (__kfree_skb+0x0/0x170) from [c0182514] (kfree_skb+0x24/0x50) r5 = C03909E0 r4 = C1A97400 [c01824f0] (kfree_skb+0x0/0x50) from [c0199bf8] (pfifo_fast_enqueue+0xb4/0xd0) [c0199b44] (pfifo_fast_enqueue+0x0/0xd0) from [c0188c30] (dev_queue_xmit+0x17c/0x25c) r8 = C1A2DCE0 r7 = FFF4 r6 = C3393114 r5 = C03909E0 r4 = C3393000 [c0188ab4] (dev_queue_xmit+0x0/0x25c) from [c01a7c18] (ip_output+0x150/0x254) r7 = C3717120 r6 = C03909E0 r5 = r4 = C1A2DCE0 [c01a7ac8] (ip_output+0x0/0x254) from [c01a93d0] (ip_push_pending_frames+0x368/0x4d4) [c01a9068] (ip_push_pending_frames+0x0/0x4d4) from [c01c6954] (udp_push_pending_frames+0x14c/0x310) [c01c6808] (udp_push_pending_frames+0x0/0x310) from [c01c70d8] (udp_sendmsg+0x5c0/0x690) [c01c6b18] (udp_sendmsg+0x0/0x690) from [c01ceafc] (inet_sendmsg+0x60/0x64) [c01cea9c] (inet_sendmsg+0x0/0x64) from [c017c970] (sock_sendmsg+0xb4/0xe4) r7 = C2CEFDF4 r6 = 0064 r5 = C2CEFEA8 r4 = C3C94080 [c017c8bc] (sock_sendmsg+0x0/0xe4) from [c017dd9c] (sys_sendto+0xc8/0xf0) r7 = 0064 r6 = C3571580 r5 = C2CEFEC4 r4 = [c017dcd4] (sys_sendto+0x0/0xf0) from [c017e654] (sys_socketcall+0x168/0x1f0) [c017e4ec] (sys_socketcall+0x0/0x1f0) from [c001ff40] (ret_fast_syscall+0x0/0x2c) r5 = 00415344 r4 = I would be greatful for any hints how I can identify which skbuff's get lost and why, and where and who should free them. I am not subscribed to netdev, please keep in cc. Thanks Guennadi - Guennadi Liakhovetski, Ph.D. DSA Daten- und Systemtechnik GmbH Pascalstr. 28 D-52076 Aachen Germany - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [irda-users] [2.6.20-rt8] Neighbour table overflow.
On 3/21/2007, Guennadi Liakhovetski [EMAIL PROTECTED] wrote: (Short recap for newly added to cc: netdev: I'm seeing an skb leak in 2.6.20 during an IrDA IrNET+ppp UDP test with periodic connection disruptions) On Wed, 21 Mar 2007, Guennadi Liakhovetski wrote: On Tue, 20 Mar 2007, Guennadi Liakhovetski wrote: Ok, looks like all leaked skbuffs come from ip_append_data(), like this: (sock_alloc_send_skb+0x2c8/0x2e4) (ip_append_data+0x7fc/0xa80) (udp_sendmsg+0x248/0x68c) (inet_sendmsg+0x60/0x64) (sock_sendmsg+0xb4/0xe4) r4 = C3CB4960 (sys_sendto+0xc8/0xf0) r4 = (sys_socketcall+0x168/0x1f0) (ret_fast_syscall+0x0/0x2c) This call to sock_alloc_send_skb() in ip_append_data() is not from the inlined ip_ufo_append_data(), it is here: /* The last fragment gets additional space at tail. * Note, with MSG_MORE we overallocate on fragments, * because we have no idea what fragment will be * the last. */ if (datalen == length + fraggap) alloclen += rt-u.dst.trailer_len; if (transhdrlen) { skb = sock_alloc_send_skb(sk, alloclen + hh_len + 15, (flags MSG_DONTWAIT), err); } else { Then, I traced a couple of paths how such a skbuff, coming down from ip_append_data() and allocated above get freed (when they do): [c0182380] (__kfree_skb+0x0/0x170) from [c0182514] (kfree_skb+0x24/0x50) r5 = C332BC00 r4 = C332BC00 [c01824f0] (kfree_skb+0x0/0x50) from [bf0fac58] (irlap_update_nr_received+0x94/0xc8 [irda]) [bf0fabc4] (irlap_update_nr_received+0x0/0xc8 [irda]) from [bf0fda98] (irlap_state_nrm_p+0x530/0x7c0 [irda]) r7 = 0001 r6 = C0367EC0 r5 = C332BC00 r4 = [bf0fd568] (irlap_state_nrm_p+0x0/0x7c0 [irda]) from [bf0fbd90] (irlap_do_event+0x68/0x18c [irda]) [bf0fbd28] (irlap_do_event+0x0/0x18c [irda]) from [bf1008cc] (irlap_driver_rcv+0x1f0/0xd38 [irda]) [bf1006dc] (irlap_driver_rcv+0x0/0xd38 [irda]) from [c01892c0] (netif_receive_skb+0x244/0x338) [c018907c] (netif_receive_skb+0x0/0x338) from [c0189468] (process_backlog+0xb4/0x194) [c01893b4] (process_backlog+0x0/0x194) from [c01895f8] (net_rx_action+0xb0/0x210) [c0189548] (net_rx_action+0x0/0x210) from [c0042f7c] (ksoftirqd+0x108/0x1cc) [c0042e74] (ksoftirqd+0x0/0x1cc) from [c0053614] (kthread+0x10c/0x138) [c0053508] (kthread+0x0/0x138) from [c003f918] (do_exit+0x0/0x8b0) r8 = r7 = r6 = r5 = r4 = This is the IrDA RX path, so I doubt the corresponding skb ever got through ip_append_data(). The skb was allocated by your HW driver upon packet reception, then queued to the net input queue, and finally passed to the IrDA stack. Are you sure your tracing is correct ? and [c0182380] (__kfree_skb+0x0/0x170) from [c0182514] (kfree_skb+0x24/0x50) r5 = C03909E0 r4 = C1A97400 [c01824f0] (kfree_skb+0x0/0x50) from [c0199bf8] (pfifo_fast_enqueue+0xb4/0xd0) [c0199b44] (pfifo_fast_enqueue+0x0/0xd0) from [c0188c30] (dev_queue_xmit+0x17c/0x25c) r8 = C1A2DCE0 r7 = FFF4 r6 = C3393114 r5 = C03909E0 r4 = C3393000 [c0188ab4] (dev_queue_xmit+0x0/0x25c) from [c01a7c18] (ip_output+0x150/0x254) r7 = C3717120 r6 = C03909E0 r5 = r4 = C1A2DCE0 [c01a7ac8] (ip_output+0x0/0x254) from [c01a93d0] (ip_push_pending_frames+0x368/0x4d4) [c01a9068] (ip_push_pending_frames+0x0/0x4d4) from [c01c6954] (udp_push_pending_frames+0x14c/0x310) [c01c6808] (udp_push_pending_frames+0x0/0x310) from [c01c70d8] (udp_sendmsg+0x5c0/0x690) [c01c6b18] (udp_sendmsg+0x0/0x690) from [c01ceafc] (inet_sendmsg+0x60/0x64) [c01cea9c] (inet_sendmsg+0x0/0x64) from [c017c970] (sock_sendmsg+0xb4/0xe4) r7 = C2CEFDF4 r6 = 0064 r5 = C2CEFEA8 r4 = C3C94080 [c017c8bc] (sock_sendmsg+0x0/0xe4) from [c017dd9c] (sys_sendto+0xc8/0xf0) r7 = 0064 r6 = C3571580 r5 = C2CEFEC4 r4 = [c017dcd4] (sys_sendto+0x0/0xf0) from [c017e654] (sys_socketcall+0x168/0x1f0) [c017e4ec] (sys_socketcall+0x0/0x1f0) from [c001ff40] (ret_fast_syscall+0x0/0x2c) r5 = 00415344 r4 = This one is on the TX path, yes. However it got dropped and freed because your TX queue was full. Any idea in which situation does that happen ? I would be greatful for any hints how I can identify which skbuff's get lost and why, and where and who should free them. You're seeing skb leaks when cutting the ppp connection periodically, right ? Do you such leaks when not cutting the ppp connection ? If not, could you send me a kernel trace (with irda debug set to 5) when the ppp connection is shut down ? It would narrow down the problem a bit. I'm quite sure the leak is in the IrDA code rather than in the ppp or ipv4 one, hence the need for full irda debug... Cheers, Samuel. I am not
Re: [PATCH 5/5] [NETLINK]: Ignore control messages directly in netlink_run_queue()
* Thomas Graf [EMAIL PROTECTED] 2007-03-21 12:45 * Patrick McHardy [EMAIL PROTECTED] 2007-03-21 05:44 This looks like it would break nfnetlink, which appears to be using 0 as smallest message type. It shouldn't do that, the first 16 message types are reserved for control messages. Alright, even though nfnetlink is wrong and buggy we can't break it at this point. Dave, please ignore this last patch for now. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [irda-users] [2.6.20-rt8] Neighbour table overflow.
On Wed, 21 Mar 2007, Samuel Ortiz wrote: On 3/21/2007, Guennadi Liakhovetski [EMAIL PROTECTED] wrote: [c0182380] (__kfree_skb+0x0/0x170) from [c0182514] (kfree_skb+0x24/0x50) r5 = C332BC00 r4 = C332BC00 [c01824f0] (kfree_skb+0x0/0x50) from [bf0fac58] (irlap_update_nr_received+0x94/0xc8 [irda]) [bf0fabc4] (irlap_update_nr_received+0x0/0xc8 [irda]) from [bf0fda98] (irlap_state_nrm_p+0x530/0x7c0 [irda]) r7 = 0001 r6 = C0367EC0 r5 = C332BC00 r4 = [bf0fd568] (irlap_state_nrm_p+0x0/0x7c0 [irda]) from [bf0fbd90] (irlap_do_event+0x68/0x18c [irda]) [bf0fbd28] (irlap_do_event+0x0/0x18c [irda]) from [bf1008cc] (irlap_driver_rcv+0x1f0/0xd38 [irda]) [bf1006dc] (irlap_driver_rcv+0x0/0xd38 [irda]) from [c01892c0] (netif_receive_skb+0x244/0x338) [c018907c] (netif_receive_skb+0x0/0x338) from [c0189468] (process_backlog+0xb4/0x194) [c01893b4] (process_backlog+0x0/0x194) from [c01895f8] (net_rx_action+0xb0/0x210) [c0189548] (net_rx_action+0x0/0x210) from [c0042f7c] (ksoftirqd+0x108/0x1cc) [c0042e74] (ksoftirqd+0x0/0x1cc) from [c0053614] (kthread+0x10c/0x138) [c0053508] (kthread+0x0/0x138) from [c003f918] (do_exit+0x0/0x8b0) r8 = r7 = r6 = r5 = r4 = This is the IrDA RX path, so I doubt the corresponding skb ever got through ip_append_data(). The skb was allocated by your HW driver upon packet reception, then queued to the net input queue, and finally passed to the IrDA stack. Are you sure your tracing is correct ? I've added a bitfield to struct sk_buff: __u8pkt_type:3, fclone:2, - ipvs_property:1; + ipvs_property:1, + trace_dbg:1; and I set itin ip_append_data() before sock_alloc_send_skb() is called. Then I check this bit in __kfree_skb(). The bit is set to 0 in __alloc_skb per memset(skb, 0, offsetof(struct sk_buff, truesize)); So, if it was a freshly allocated skb, the tracing should be correct. [c0182380] (__kfree_skb+0x0/0x170) from [c0182514] (kfree_skb+0x24/0x50) r5 = C03909E0 r4 = C1A97400 [c01824f0] (kfree_skb+0x0/0x50) from [c0199bf8] (pfifo_fast_enqueue+0xb4/0xd0) [c0199b44] (pfifo_fast_enqueue+0x0/0xd0) from [c0188c30] (dev_queue_xmit+0x17c/0x25c) r8 = C1A2DCE0 r7 = FFF4 r6 = C3393114 r5 = C03909E0 r4 = C3393000 [c0188ab4] (dev_queue_xmit+0x0/0x25c) from [c01a7c18] (ip_output+0x150/0x254) r7 = C3717120 r6 = C03909E0 r5 = r4 = C1A2DCE0 [c01a7ac8] (ip_output+0x0/0x254) from [c01a93d0] (ip_push_pending_frames+0x368/0x4d4) [c01a9068] (ip_push_pending_frames+0x0/0x4d4) from [c01c6954] (udp_push_pending_frames+0x14c/0x310) [c01c6808] (udp_push_pending_frames+0x0/0x310) from [c01c70d8] (udp_sendmsg+0x5c0/0x690) [c01c6b18] (udp_sendmsg+0x0/0x690) from [c01ceafc] (inet_sendmsg+0x60/0x64) [c01cea9c] (inet_sendmsg+0x0/0x64) from [c017c970] (sock_sendmsg+0xb4/0xe4) r7 = C2CEFDF4 r6 = 0064 r5 = C2CEFEA8 r4 = C3C94080 [c017c8bc] (sock_sendmsg+0x0/0xe4) from [c017dd9c] (sys_sendto+0xc8/0xf0) r7 = 0064 r6 = C3571580 r5 = C2CEFEC4 r4 = [c017dcd4] (sys_sendto+0x0/0xf0) from [c017e654] (sys_socketcall+0x168/0x1f0) [c017e4ec] (sys_socketcall+0x0/0x1f0) from [c001ff40] (ret_fast_syscall+0x0/0x2c) r5 = 00415344 r4 = This one is on the TX path, yes. However it got dropped and freed because your TX queue was full. Any idea in which situation does that happen ? No. I can only describe what communication is running while ppp is disrupted - it's just some sort of udp mirror test - udp packets are sent one after another and mirrored back. I would be greatful for any hints how I can identify which skbuff's get lost and why, and where and who should free them. You're seeing skb leaks when cutting the ppp connection periodically, right ? Right Do you such leaks when not cutting the ppp connection ? Looks like I don't. If not, could you send me a kernel trace (with irda debug set to 5) when the ppp connection is shut down ? It would narrow down the problem a bit. Attached bzipped... It's a complete log starting from irda up, running udp packets over the link, closing the link and bringing irda completely down. I'm quite sure the leak is in the IrDA code rather than in the ppp or ipv4 one, hence the need for full irda debug... Likely, yes. Why I am asking netdev guys for help is just because I have very little idea about the data flow in the network stack(s). And the more experienced eyes we have on the problem the sooner we might solve it, I hope... Thanks Guennadi - Guennadi Liakhovetski, Ph.D. DSA Daten- und Systemtechnik GmbH Pascalstr. 28 D-52076 Aachen Germany mpppdown.bz2 Description: Binary data
Re: [PATCH 5/5] [NETLINK]: Ignore control messages directly in netlink_run_queue()
Thomas Graf wrote: * Patrick McHardy [EMAIL PROTECTED] 2007-03-21 05:44 This looks like it would break nfnetlink, which appears to be using 0 as smallest message type. It shouldn't do that, the first 16 message types are reserved for control messages. I'm afraid it does: enum cntl_msg_types { IPCTNL_MSG_CT_NEW, IPCTNL_MSG_CT_GET, IPCTNL_MSG_CT_DELETE, IPCTNL_MSG_CT_GET_CTRZERO, IPCTNL_MSG_MAX }; This is totally broken of course since it also uses netlink_ack(), netlink_dump() etc. :( Any smart ideas how to fix this without breaking compatibility? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] [NETLINK]: Ignore control messages directly in netlink_run_queue()
Patrick McHardy wrote: Thomas Graf wrote: * Patrick McHardy [EMAIL PROTECTED] 2007-03-21 05:44 This looks like it would break nfnetlink, which appears to be using 0 as smallest message type. It shouldn't do that, the first 16 message types are reserved for control messages. I'm afraid it does: enum cntl_msg_types { IPCTNL_MSG_CT_NEW, IPCTNL_MSG_CT_GET, IPCTNL_MSG_CT_DELETE, IPCTNL_MSG_CT_GET_CTRZERO, IPCTNL_MSG_MAX }; This is totally broken of course since it also uses netlink_ack(), netlink_dump() etc. :( Any smart ideas how to fix this without breaking compatibility? Seems like we're lucky, nfnetlink encodes the subsystem ID in the upper 8 bits of the message type and uses 1 as the smallest ID: /* netfilter netlink message types are split in two pieces: * 8 bit subsystem, 8bit operation. */ #define NFNL_SUBSYS_ID(x) ((x 0xff00) 8) #define NFNL_MSG_TYPE(x)(x 0x00ff) #define NFNL_SUBSYS_NONE0 #define NFNL_SUBSYS_CTNETLINK 1 #define NFNL_SUBSYS_CTNETLINK_EXP 2 #define NFNL_SUBSYS_QUEUE 3 #define NFNL_SUBSYS_ULOG4 #define NFNL_SUBSYS_COUNT 5 So this should work fine. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] [NETLINK]: Ignore control messages directly in netlink_run_queue()
* Patrick McHardy [EMAIL PROTECTED] 2007-03-21 13:06 Thomas Graf wrote: * Patrick McHardy [EMAIL PROTECTED] 2007-03-21 05:44 This looks like it would break nfnetlink, which appears to be using 0 as smallest message type. It shouldn't do that, the first 16 message types are reserved for control messages. I'm afraid it does: enum cntl_msg_types { IPCTNL_MSG_CT_NEW, IPCTNL_MSG_CT_GET, IPCTNL_MSG_CT_DELETE, IPCTNL_MSG_CT_GET_CTRZERO, IPCTNL_MSG_MAX }; This is totally broken of course since it also uses netlink_ack(), netlink_dump() etc. :( Any smart ideas how to fix this without breaking compatibility? Hmm... I think nfnetlink isn't even broken: /* netfilter netlink message types are split in two pieces: * 8 bit subsystem, 8bit operation. */ #define NFNL_SUBSYS_ID(x) ((x 0xff00) 8) #define NFNL_MSG_TYPE(x)(x 0x00ff) /* No enum here, otherwise __stringify() trick of * MODULE_ALIAS_NFNL_SUBSYS() * won't work anymore */ #define NFNL_SUBSYS_NONE0 #define NFNL_SUBSYS_CTNETLINK 1 #define NFNL_SUBSYS_CTNETLINK_EXP 2 #define NFNL_SUBSYS_QUEUE 3 #define NFNL_SUBSYS_ULOG4 #define NFNL_SUBSYS_COUNT 5 A msg_type 0x10 would just trigger a -EINVAL as no 0x0 subsystem can ever be registered. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] [NETLINK]: Ignore control messages directly in netlink_run_queue()
* Patrick McHardy [EMAIL PROTECTED] 2007-03-21 13:21 Seems like we're lucky, nfnetlink encodes the subsystem ID in the upper 8 bits of the message type and uses 1 as the smallest ID: Alright, you've been quicker :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NETFILTER] nfnetlink: netlink_run_queue() already checks for NLM_F_REQUEST
Patrick has made use of netlink_run_queue() in nfnetlink while my patches have been waiting for net-2.6.22 to open. So this check for NLM_F_REQUEST can go as well. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.22/net/netfilter/nfnetlink.c === --- net-2.6.22.orig/net/netfilter/nfnetlink.c 2007-03-21 13:27:48.0 +0100 +++ net-2.6.22/net/netfilter/nfnetlink.c2007-03-21 13:28:11.0 +0100 @@ -207,10 +207,6 @@ static int nfnetlink_rcv_msg(struct sk_b return -1; } - /* Only requests are handled by kernel now. */ - if (!(nlh-nlmsg_flags NLM_F_REQUEST)) - return 0; - /* All the messages must at least contain nfgenmsg */ if (nlh-nlmsg_len NLMSG_SPACE(sizeof(struct nfgenmsg))) return 0; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
Patrick McHardy wrote: Its harmless since its a read lock, which can be nested. I actually don't see any need for qdisc_tree_lock at all, all changes and all walking is done under the RTNL, which is why I've removed it in my (upcoming) patches. I suggest to leave it as is for now so I don't need to change the __qdisc_lookup back to qdisc_lookup in 2.6.22. Alexey just explained to me why we do need qdisc_tree_lock in private mail. While dumping only the first skb is filled under the RTNL, while filling further skbs we don't hold the RTNL anymore. So I will probably have to drop that patch. What we could do is replace the netlink cb_lock spinlock by a user-supplied mutex (supplied to netlink_kernel_create, rtnl_mutex in this case). That would put the entire dump under the rtnl and allow us to get rid of qdisc_tree_lock and avoid the need to take dev_base_lock during qdisc dumping. Same in other spots like rtnl_dump_ifinfo, inet_dump_ifaddr, ... What do you think? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETFILTER] nfnetlink: netlink_run_queue() already checks for NLM_F_REQUEST
Thomas Graf wrote: Patrick has made use of netlink_run_queue() in nfnetlink while my patches have been waiting for net-2.6.22 to open. So this check for NLM_F_REQUEST can go as well. Looks good, thanks. I've added it to my queue. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/12] [IPv6]: Use rtnl registration interface
* YOSHIFUJI Hideaki / ?$B5HF#1QL@ [EMAIL PROTECTED] 2007-03-21 02:01 In article [EMAIL PROTECTED] (at Wed, 21 Mar 2007 01:06:03 +0100), Thomas Graf [EMAIL PROTECTED] says: -static int -inet6_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) +static int nl_addr_del(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) { struct ifaddrmsg *ifm; I'm rather not favor changing function names here... I was trying to achieve consistent naming among all message handlers. All these functions are static with a single reference. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
Patrick McHardy wrote: Alexey just explained to me why we do need qdisc_tree_lock in private mail. While dumping only the first skb is filled under the RTNL, while filling further skbs we don't hold the RTNL anymore. So I will probably have to drop that patch. What we could do is replace the netlink cb_lock spinlock by a user-supplied mutex (supplied to netlink_kernel_create, rtnl_mutex in this case). That would put the entire dump under the rtnl and allow us to get rid of qdisc_tree_lock and avoid the need to take dev_base_lock during qdisc dumping. Same in other spots like rtnl_dump_ifinfo, inet_dump_ifaddr, ... These (compile tested) patches demonstrate the idea. The first one lets netlink_kernel_create users specify a mutex that should be held during dump callbacks, the second one uses this for rtnetlink and changes inet_dump_ifaddr for demonstration. A complete patch would allow us to simplify locking in lots of spots, all rtnetlink users currently need to implement extra locking just for the dump functions, and a number of them already get it wrong and seem to rely on the rtnl. If there are no objections to this change I'm going to update the second patch to include all rtnetlink users. [NET_SCHED]: cls_basic: fix NULL pointer dereference cls_basic doesn't allocate tp-root before it is linked into the active classifier list, resulting in a NULL pointer dereference when packets hit the classifier before its -change function is called. Reported by Chris Madden [EMAIL PROTECTED] Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit f1b9a0694552e18e7a43c292d21abe3b51dfcae2 tree f5ae39c1746fdc1ffbee6c1d90d035ee48ca4904 parent 0a14fe6e5efd0af0f9c6c01e0433445d615d0110 author Patrick McHardy [EMAIL PROTECTED] Tue, 20 Mar 2007 16:08:54 +0100 committer Patrick McHardy [EMAIL PROTECTED] Tue, 20 Mar 2007 16:08:54 +0100 net/sched/cls_basic.c | 16 +++- 1 files changed, 7 insertions(+), 9 deletions(-) diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c index fad08e5..70fe36e 100644 --- a/net/sched/cls_basic.c +++ b/net/sched/cls_basic.c @@ -81,6 +81,13 @@ static void basic_put(struct tcf_proto * static int basic_init(struct tcf_proto *tp) { + struct basic_head *head; + + head = kzalloc(sizeof(*head), GFP_KERNEL); + if (head == NULL) + return -ENOBUFS; + INIT_LIST_HEAD(head-flist); + tp-root = head; return 0; } @@ -176,15 +183,6 @@ static int basic_change(struct tcf_proto } err = -ENOBUFS; - if (head == NULL) { - head = kzalloc(sizeof(*head), GFP_KERNEL); - if (head == NULL) - goto errout; - - INIT_LIST_HEAD(head-flist); - tp-root = head; - } - f = kzalloc(sizeof(*f), GFP_KERNEL); if (f == NULL) goto errout; [NET_SCHED]: Fix ingress locking Ingress queueing uses a seperate lock for serializing enqueue operations, but fails to properly protect itself against concurrent changes to the qdisc tree. Use queue_lock for now since the real fix it quite intrusive. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 11985909b582dc688b5a7c0f73f16244224116f4 tree 0ee26bec34053f6c9b5f905ffbc1437881428eeb parent f1b9a0694552e18e7a43c292d21abe3b51dfcae2 author Patrick McHardy [EMAIL PROTECTED] Tue, 20 Mar 2007 16:11:56 +0100 committer Patrick McHardy [EMAIL PROTECTED] Tue, 20 Mar 2007 16:11:56 +0100 net/core/dev.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index cf71614..5984b55 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1750,10 +1750,10 @@ static int ing_filter(struct sk_buff *sk skb-tc_verd = SET_TC_AT(skb-tc_verd,AT_INGRESS); - spin_lock(dev-ingress_lock); + spin_lock(dev-queue_lock); if ((q = dev-qdisc_ingress) != NULL) result = q-enqueue(skb, q); - spin_unlock(dev-ingress_lock); + spin_unlock(dev-queue_lock); }
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
Patrick McHardy wrote: Patrick McHardy wrote: Alexey just explained to me why we do need qdisc_tree_lock in private mail. While dumping only the first skb is filled under the RTNL, while filling further skbs we don't hold the RTNL anymore. So I will probably have to drop that patch. What we could do is replace the netlink cb_lock spinlock by a user-supplied mutex (supplied to netlink_kernel_create, rtnl_mutex in this case). That would put the entire dump under the rtnl and allow us to get rid of qdisc_tree_lock and avoid the need to take dev_base_lock during qdisc dumping. Same in other spots like rtnl_dump_ifinfo, inet_dump_ifaddr, ... These (compile tested) patches demonstrate the idea. The first one lets netlink_kernel_create users specify a mutex that should be held during dump callbacks, the second one uses this for rtnetlink and changes inet_dump_ifaddr for demonstration. A complete patch would allow us to simplify locking in lots of spots, all rtnetlink users currently need to implement extra locking just for the dump functions, and a number of them already get it wrong and seem to rely on the rtnl. If there are no objections to this change I'm going to update the second patch to include all rtnetlink users D'oh .. wrong patches. [NETLINK]: Put dump callback under mutex, optionally user supplied Replace the callback spinlock by a mutex and allow users to supply their own mutex to allow getting rid of seperate locking in dump callbacks. For users that don't supply their own mutex nothing changes. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit c3400c45267a1fd291da75b0fe4b7970c846ff50 tree 96a4dc6050d74e72b4fffe9c047a0e695085e6db parent 2c31e4429748f2629c59379b1113931a13a0cca9 author Patrick McHardy [EMAIL PROTECTED] Wed, 21 Mar 2007 14:43:02 +0100 committer Patrick McHardy [EMAIL PROTECTED] Wed, 21 Mar 2007 14:43:02 +0100 drivers/connector/connector.c |2 +- drivers/scsi/scsi_netlink.c |3 ++- drivers/scsi/scsi_transport_iscsi.c |2 +- fs/ecryptfs/netlink.c |2 +- include/linux/netlink.h |5 - lib/kobject_uevent.c|2 +- net/bridge/netfilter/ebt_ulog.c |2 +- net/core/rtnetlink.c|2 +- net/decnet/netfilter/dn_rtmsg.c |2 +- net/ipv4/fib_frontend.c |2 +- net/ipv4/inet_diag.c|2 +- net/ipv4/netfilter/ip_queue.c |2 +- net/ipv4/netfilter/ipt_ULOG.c |2 +- net/ipv6/netfilter/ip6_queue.c |2 +- net/netfilter/nfnetlink.c |2 +- net/netlink/af_netlink.c| 30 +++--- net/netlink/genetlink.c |2 +- net/xfrm/xfrm_user.c|2 +- security/selinux/netlink.c |2 +- 19 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 7f9c4fb..a7b9e9b 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -448,7 +448,7 @@ static int __devinit cn_init(void) dev-nls = netlink_kernel_create(NETLINK_CONNECTOR, CN_NETLINK_USERS + 0xf, - dev-input, THIS_MODULE); + dev-input, NULL, THIS_MODULE); if (!dev-nls) return -EIO; diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c index 45646a2..4bf9aa5 100644 --- a/drivers/scsi/scsi_netlink.c +++ b/drivers/scsi/scsi_netlink.c @@ -168,7 +168,8 @@ scsi_netlink_init(void) } scsi_nl_sock = netlink_kernel_create(NETLINK_SCSITRANSPORT, -SCSI_NL_GRP_CNT, scsi_nl_rcv, THIS_MODULE); +SCSI_NL_GRP_CNT, scsi_nl_rcv, NULL, +THIS_MODULE); if (!scsi_nl_sock) { printk(KERN_ERR %s: register of recieve handler failed\n, __FUNCTION__); diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index 10590cd..aabaa05 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -1435,7 +1435,7 @@ static __init int iscsi_transport_init(void) if (err) goto unregister_conn_class; - nls = netlink_kernel_create(NETLINK_ISCSI, 1, iscsi_if_rx, + nls = netlink_kernel_create(NETLINK_ISCSI, 1, iscsi_if_rx, NULL, THIS_MODULE); if (!nls) { err = -ENOBUFS; diff --git a/fs/ecryptfs/netlink.c b/fs/ecryptfs/netlink.c index 8405d21..fe91863 100644 --- a/fs/ecryptfs/netlink.c +++ b/fs/ecryptfs/netlink.c @@ -229,7 +229,7 @@ int ecryptfs_init_netlink(void) ecryptfs_nl_sock = netlink_kernel_create(NETLINK_ECRYPTFS, 0, ecryptfs_receive_nl_message, - THIS_MODULE); + NULL, THIS_MODULE); if (!ecryptfs_nl_sock) { rc = -EIO; ecryptfs_printk(KERN_ERR, Failed to create netlink socket\n); diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 0d11f6a..f41688f 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -157,7 +157,10 @@ struct netlink_skb_parms #define NETLINK_CREDS(skb)
[PATCH 2/5] netem: use better types for time values
The random number generator always generates 32 bit values. The time values are limited by psched_tdiff_t Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/sched/sch_netem.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) --- net-2.6.22.orig/net/sched/sch_netem.c +++ net-2.6.22/net/sched/sch_netem.c @@ -56,19 +56,20 @@ struct netem_sched_data { struct Qdisc*qdisc; struct qdisc_watchdog watchdog; - u32 latency; + psched_tdiff_t latency; + psched_tdiff_t jitter; + u32 loss; u32 limit; u32 counter; u32 gap; - u32 jitter; u32 duplicate; u32 reorder; u32 corrupt; struct crndstate { - unsigned long last; - unsigned long rho; + u32 last; + u32 rho; } delay_cor, loss_cor, dup_cor, reorder_cor, corrupt_cor; struct disttable { @@ -95,7 +96,7 @@ static void init_crandom(struct crndstat * Next number depends on last value. * rho is scaled to avoid floating point. */ -static unsigned long get_crandom(struct crndstate *state) +static u32 get_crandom(struct crndstate *state) { u64 value, rho; unsigned long answer; @@ -114,11 +115,13 @@ static unsigned long get_crandom(struct * std deviation sigma. Uses table lookup to approximate the desired * distribution, and a uniformly-distributed pseudo-random source. */ -static long tabledist(unsigned long mu, long sigma, - struct crndstate *state, const struct disttable *dist) -{ - long t, x; - unsigned long rnd; +static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma, + struct crndstate *state, + const struct disttable *dist) +{ + psched_tdiff_t x; + long t; + u32 rnd; if (sigma == 0) return mu; -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] netem performance improvements
The following patches for the 2.6.22 net tree, increase the performance of netem by about 2x. With 2.6.20 getting about 100K (out of possible 300K) packets per second, after these patches now at over 200K pps. -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] qdisc: avoid transmit softirq on watchdog wakeup
If possible, avoid having to do a transmit softirq when a qdisc watchdog decides to re-enable. The watchdog routine runs off a timer, so it is already in the same effective context as the softirq. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/sched/sch_api.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) --- net-2.6.22.orig/net/sched/sch_api.c +++ net-2.6.22/net/sched/sch_api.c @@ -296,10 +296,16 @@ static enum hrtimer_restart qdisc_watchd { struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog, timer); + struct net_device *dev = wd-qdisc-dev; wd-qdisc-flags = ~TCQ_F_THROTTLED; smp_wmb(); - netif_schedule(wd-qdisc-dev); + if (spin_trylock(dev-queue_lock)) { + qdisc_run(dev); + spin_unlock(dev-queue_lock); + } else + netif_schedule(dev); + return HRTIMER_NORESTART; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] netem: optimize tfifo
In most cases, the next packet will be sent after the last one. So optimize that case. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/sched/sch_netem.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) --- net-2.6.22.orig/net/sched/sch_netem.c +++ net-2.6.22/net/sched/sch_netem.c @@ -478,22 +478,28 @@ static int netem_change(struct Qdisc *sc */ struct fifo_sched_data { u32 limit; + psched_time_t oldest; }; static int tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch) { struct fifo_sched_data *q = qdisc_priv(sch); struct sk_buff_head *list = sch-q; - const struct netem_skb_cb *ncb - = (const struct netem_skb_cb *)nskb-cb; + psched_time_t tnext = ((struct netem_skb_cb *)nskb-cb)-time_to_send; struct sk_buff *skb; if (likely(skb_queue_len(list) q-limit)) { + /* Optimize for add at tail */ + if (likely(skb_queue_empty(list) || !PSCHED_TLESS(tnext, q-oldest))) { + q-oldest = tnext; + return qdisc_enqueue_tail(nskb, sch); + } + skb_queue_reverse_walk(list, skb) { const struct netem_skb_cb *cb = (const struct netem_skb_cb *)skb-cb; - if (!PSCHED_TLESS(ncb-time_to_send, cb-time_to_send)) + if (!PSCHED_TLESS(tnext, cb-time_to_send)) break; } @@ -506,7 +512,7 @@ static int tfifo_enqueue(struct sk_buff return NET_XMIT_SUCCESS; } - return qdisc_drop(nskb, sch); + return qdisc_reshape_fail(nskb, sch); } static int tfifo_init(struct Qdisc *sch, struct rtattr *opt) @@ -522,6 +528,7 @@ static int tfifo_init(struct Qdisc *sch, } else q-limit = max_t(u32, sch-dev-tx_queue_len, 1); + PSCHED_SET_PASTPERFECT(q-oldest); return 0; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] netem: report reorder percent correctly.
If you setup netem to just delay packets; tc qdisc ls will report the reordering as 100%. Well it's a lie, reorder isn't used unless gap is set, so just set value to 0 so the output of utility is correct. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/sched/sch_netem.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- net-2.6.22.orig/net/sched/sch_netem.c +++ net-2.6.22/net/sched/sch_netem.c @@ -428,7 +428,8 @@ static int netem_change(struct Qdisc *sc /* for compatiablity with earlier versions. * if gap is set, need to assume 100% probablity */ - q-reorder = ~0; + if (q-gap) + q-reorder = ~0; /* Handle nested options after initial queue options. * Should have put all options in nested format but too late now. -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] netem: avoid excessive requeues
The netem code would call getnstimeofday() and dequeue/requeue after every packet, even if it was waiting. Avoid this overhead by using the throttled flag. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/sched/sch_api.c |3 +++ net/sched/sch_netem.c | 21 - 2 files changed, 15 insertions(+), 9 deletions(-) --- net-2.6.22.orig/net/sched/sch_api.c +++ net-2.6.22/net/sched/sch_api.c @@ -298,6 +298,7 @@ static enum hrtimer_restart qdisc_watchd timer); wd-qdisc-flags = ~TCQ_F_THROTTLED; + smp_wmb(); netif_schedule(wd-qdisc-dev); return HRTIMER_NORESTART; } @@ -315,6 +316,7 @@ void qdisc_watchdog_schedule(struct qdis ktime_t time; wd-qdisc-flags |= TCQ_F_THROTTLED; + smp_wmb(); time = ktime_set(0, 0); time = ktime_add_ns(time, PSCHED_US2NS(expires)); hrtimer_start(wd-timer, time, HRTIMER_MODE_ABS); @@ -325,6 +327,7 @@ void qdisc_watchdog_cancel(struct qdisc_ { hrtimer_cancel(wd-timer); wd-qdisc-flags = ~TCQ_F_THROTTLED; + smp_wmb(); } EXPORT_SYMBOL(qdisc_watchdog_cancel); --- net-2.6.22.orig/net/sched/sch_netem.c +++ net-2.6.22/net/sched/sch_netem.c @@ -272,6 +272,10 @@ static struct sk_buff *netem_dequeue(str struct netem_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; + smp_mb(); + if (sch-flags TCQ_F_THROTTLED) + return NULL; + skb = q-qdisc-dequeue(q-qdisc); if (skb) { const struct netem_skb_cb *cb @@ -284,18 +288,17 @@ static struct sk_buff *netem_dequeue(str if (PSCHED_TLESS(cb-time_to_send, now)) { pr_debug(netem_dequeue: return skb=%p\n, skb); sch-q.qlen--; - sch-flags = ~TCQ_F_THROTTLED; return skb; - } else { - qdisc_watchdog_schedule(q-watchdog, cb-time_to_send); + } - if (q-qdisc-ops-requeue(skb, q-qdisc) != NET_XMIT_SUCCESS) { - qdisc_tree_decrease_qlen(q-qdisc, 1); - sch-qstats.drops++; - printk(KERN_ERR netem: queue discpline %s could not requeue\n, - q-qdisc-ops-id); - } + if (unlikely(q-qdisc-ops-requeue(skb, q-qdisc) != NET_XMIT_SUCCESS)) { + qdisc_tree_decrease_qlen(q-qdisc, 1); + sch-qstats.drops++; + printk(KERN_ERR netem: %s could not requeue\n, + q-qdisc-ops-id); } + + qdisc_watchdog_schedule(q-watchdog, cb-time_to_send); } return NULL; -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
iproute2-2.6.20-070313 bug ?
Possible i discovered bug, but maybe specific to my setup. In your sources (tc/tc_core.h) i notice #define TIME_UNITS_PER_SEC10 When i change it to #define TIME_UNITS_PER_SEC 100.0 (it was value before in sources) everythign works fine. Otherwise tbf not working at all, it is dropping all packets. Did anyone test new iproute2 with tbf? -- Virtual ISP S.A.L. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND 0/4] was: [PATCH 0/3] myri10ge updates for 2.6.21
Brice Goglin wrote: Hi Jeff, Here's 3 minor updates for myri10ge in 2.6.21: 1. use regular firmware on Serverworks HT2100 2. update wcfifo and intr_coal_delay default values 3. update driver version to 1.3.0-1.225 Please apply. Thanks, Brice I just got a last minute fix (management of allocated pages was wrong on architectures with page size != 4kB. Please drop this serie, I am going to resend all the patches. Thanks, Brice - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] myri10ge: Serverworks HT2100 provides aligned PCIe completion
[PATCH 1/4] myri10ge: Serverworks HT2100 provides aligned PCIe completion Use the regular firmware on Serverworks HT2100 PCIe ports since this chipset provides aligned PCIe completion. Signed-off-by: Brice Goglin [EMAIL PROTECTED] --- drivers/net/myri10ge/myri10ge.c |8 1 file changed, 8 insertions(+) Index: linux-rc/drivers/net/myri10ge/myri10ge.c === --- linux-rc.orig/drivers/net/myri10ge/myri10ge.c 2007-03-18 21:01:42.0 +0100 +++ linux-rc/drivers/net/myri10ge/myri10ge.c2007-03-18 21:14:12.0 +0100 @@ -2483,6 +2483,8 @@ #define PCI_DEVICE_ID_INTEL_E5000_PCIE23 0x25f7 #define PCI_DEVICE_ID_INTEL_E5000_PCIE47 0x25fa +#define PCI_DEVICE_ID_SERVERWORKS_HT2100_PCIE_FIRST 0x140 +#define PCI_DEVICE_ID_SERVERWORKS_HT2100_PCIE_LAST 0x142 static void myri10ge_select_firmware(struct myri10ge_priv *mgp) { @@ -2514,6 +2516,12 @@ ((bridge-vendor == PCI_VENDOR_ID_SERVERWORKS bridge-device == PCI_DEVICE_ID_SERVERWORKS_HT2000_PCIE) + /* ServerWorks HT2100 */ + || (bridge-vendor == PCI_VENDOR_ID_SERVERWORKS +bridge-device = + PCI_DEVICE_ID_SERVERWORKS_HT2100_PCIE_FIRST +bridge-device = + PCI_DEVICE_ID_SERVERWORKS_HT2100_PCIE_LAST) /* All Intel E5000 PCIE ports */ || (bridge-vendor == PCI_VENDOR_ID_INTEL bridge-device = - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] myri10ge: update wcfifo and intr_coal_delay default values
Update the default value of 2 module parameters: * wcfifo disabled * intr_coal_delay 75us Signed-off-by: Brice Goglin [EMAIL PROTECTED] --- drivers/net/myri10ge/myri10ge.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-rc/drivers/net/myri10ge/myri10ge.c === --- linux-rc.orig/drivers/net/myri10ge/myri10ge.c 2007-03-18 21:14:12.0 +0100 +++ linux-rc/drivers/net/myri10ge/myri10ge.c2007-03-18 21:14:21.0 +0100 @@ -234,7 +234,7 @@ module_param(myri10ge_msi, int, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(myri10ge_msi, Enable Message Signalled Interrupts\n); -static int myri10ge_intr_coal_delay = 25; +static int myri10ge_intr_coal_delay = 75; module_param(myri10ge_intr_coal_delay, int, S_IRUGO); MODULE_PARM_DESC(myri10ge_intr_coal_delay, Interrupt coalescing delay\n); @@ -279,7 +279,7 @@ module_param(myri10ge_fill_thresh, int, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(myri10ge_fill_thresh, Number of empty rx slots allowed\n); -static int myri10ge_wcfifo = 1; +static int myri10ge_wcfifo = 0; module_param(myri10ge_wcfifo, int, S_IRUGO); MODULE_PARM_DESC(myri10ge_wcfifo, Enable WC Fifo when WC is enabled\n); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] myri10ge: update driver version to 1.3.0-1.226
Driver version is now 1.3.0-1.226. Signed-off-by: Brice Goglin [EMAIL PROTECTED] --- drivers/net/myri10ge/myri10ge.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-rc/drivers/net/myri10ge/myri10ge.c === --- linux-rc.orig/drivers/net/myri10ge/myri10ge.c 2007-03-18 21:14:21.0 +0100 +++ linux-rc/drivers/net/myri10ge/myri10ge.c2007-03-18 21:14:23.0 +0100 @@ -71,7 +71,7 @@ #include myri10ge_mcp.h #include myri10ge_mcp_gen_header.h -#define MYRI10GE_VERSION_STR 1.2.0 +#define MYRI10GE_VERSION_STR 1.3.0-1.226 MODULE_DESCRIPTION(Myricom 10G driver (10GbE)); MODULE_AUTHOR(Maintainer: [EMAIL PROTECTED]); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] netxen: enum and #define cleanups
This patch cleans up some rather generically named items in the netxen driver. It seems bad to use names like USER_START and FLASH_TOTAL_SIZE, so I added a NETXEN_ to the front of them. This has been compile tested. Signed-off-by: Andy Gospodarek [EMAIL PROTECTED] --- netxen_nic.h | 51 ++- netxen_nic_ethtool.c |8 netxen_nic_hw.c | 10 +- netxen_nic_init.c| 23 --- 4 files changed, 47 insertions(+), 45 deletions(-) diff --git a/drivers/net/netxen/netxen_nic.h b/drivers/net/netxen/netxen_nic.h index dd8ce35..8310584 100644 --- a/drivers/net/netxen/netxen_nic.h +++ b/drivers/net/netxen/netxen_nic.h @@ -65,12 +65,13 @@ #define _NETXEN_NIC_LINUX_MAJOR 3 #define _NETXEN_NIC_LINUX_MINOR 3 -#define _NETXEN_NIC_LINUX_SUBVERSION 3 -#define NETXEN_NIC_LINUX_VERSIONID 3.3.3 +#define _NETXEN_NIC_LINUX_SUBVERSION 4 +#define NETXEN_NIC_LINUX_VERSIONID 3.3.4 -#define NUM_FLASH_SECTORS (64) -#define FLASH_SECTOR_SIZE (64 * 1024) -#define FLASH_TOTAL_SIZE (NUM_FLASH_SECTORS * FLASH_SECTOR_SIZE) +#define NETXEN_NUM_FLASH_SECTORS (64) +#define NETXEN_FLASH_SECTOR_SIZE (64 * 1024) +#define NETXEN_FLASH_TOTAL_SIZE (NETXEN_NUM_FLASH_SECTORS \ + * NETXEN_FLASH_SECTOR_SIZE) #define PHAN_VENDOR_ID 0x4040 @@ -671,28 +672,28 @@ struct netxen_new_user_info { /* Flash memory map */ typedef enum { - CRBINIT_START = 0, /* Crbinit section */ - BRDCFG_START = 0x4000, /* board config */ - INITCODE_START = 0x6000,/* pegtune code */ - BOOTLD_START = 0x1, /* bootld */ - IMAGE_START = 0x43000, /* compressed image */ - SECONDARY_START = 0x20, /* backup images */ - PXE_START = 0x3E, /* user defined region */ - USER_START = 0x3E8000, /* User defined region for new boards */ - FIXED_START = 0x3F /* backup of crbinit */ + NETXEN_CRBINIT_START = 0, /* Crbinit section */ + NETXEN_BRDCFG_START = 0x4000, /* board config */ + NETXEN_INITCODE_START = 0x6000, /* pegtune code */ + NETXEN_BOOTLD_START = 0x1, /* bootld */ + NETXEN_IMAGE_START = 0x43000, /* compressed image */ + NETXEN_SECONDARY_START = 0x20, /* backup images */ + NETXEN_PXE_START = 0x3E,/* user defined region */ + NETXEN_USER_START = 0x3E8000, /* User defined region for new boards */ + NETXEN_FIXED_START = 0x3F /* backup of crbinit */ } netxen_flash_map_t; -#define USER_START_OLD PXE_START /* for backward compatibility */ - -#define FLASH_START(CRBINIT_START) -#define INIT_SECTOR(0) -#define PRIMARY_START (BOOTLD_START) -#define FLASH_CRBINIT_SIZE (0x4000) -#define FLASH_BRDCFG_SIZE (sizeof(struct netxen_board_info)) -#define FLASH_USER_SIZE(sizeof(struct netxen_user_info)/sizeof(u32)) -#define FLASH_SECONDARY_SIZE (USER_START-SECONDARY_START) -#define NUM_PRIMARY_SECTORS(0x20) -#define NUM_CONFIG_SECTORS (1) +#define NETXEN_USER_START_OLD NETXEN_PXE_START /* for backward compatibility */ + +#define NETXEN_FLASH_START (NETXEN_CRBINIT_START) +#define NETXEN_INIT_SECTOR (0) +#define NETXEN_PRIMARY_START (NETXEN_BOOTLD_START) +#define NETXEN_FLASH_CRBINIT_SIZE (0x4000) +#define NETXEN_FLASH_BRDCFG_SIZE (sizeof(struct netxen_board_info)) +#define NETXEN_FLASH_USER_SIZE (sizeof(struct netxen_user_info)/sizeof(u32)) +#define NETXEN_FLASH_SECONDARY_SIZE (NETXEN_USER_START-NETXEN_SECONDARY_START) +#define NETXEN_NUM_PRIMARY_SECTORS (0x20) +#define NETXEN_NUM_CONFIG_SECTORS (1) #define PFX NetXen: extern char netxen_nic_driver_name[]; diff --git a/drivers/net/netxen/netxen_nic_ethtool.c b/drivers/net/netxen/netxen_nic_ethtool.c index ee1b5a2..4dfa76b 100644 --- a/drivers/net/netxen/netxen_nic_ethtool.c +++ b/drivers/net/netxen/netxen_nic_ethtool.c @@ -94,7 +94,7 @@ static const char netxen_nic_gstrings_test[][ETH_GSTRING_LEN] = { static int netxen_nic_get_eeprom_len(struct net_device *dev) { - return FLASH_TOTAL_SIZE; + return NETXEN_FLASH_TOTAL_SIZE; } static void @@ -475,7 +475,7 @@ netxen_nic_set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom, return 0; } - if (offset == BOOTLD_START) { + if (offset == NETXEN_BOOTLD_START) { ret = netxen_flash_erase_primary(adapter); if (ret != FLASH_SUCCESS) { printk(KERN_ERR %s: Flash erase failed.\n, @@ -483,10 +483,10 @@ netxen_nic_set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom, return ret; } - ret = netxen_rom_se(adapter, USER_START); + ret = netxen_rom_se(adapter, NETXEN_USER_START); if (ret != FLASH_SUCCESS)
Re: many sockets, slow sendto
Eric Dumazet a écrit : Currently, udp_hash[UDP_HTABLE_SIZE] is using a hash function based on dport number only. In your case, as you use a single port value, all sockets are in a single slot of this hash table : To find the good socket, __udp4_lib_lookup() has to search in a list with thousands of elements. Not that good, isnt it ? :( In case you want to try, here is a patch that could help you :) [PATCH] INET : IPV4 UDP lookups converted to a 2 pass algo Some people want to have many UDP sockets, binded to a single port but many different addresses. We currently hash all those sockets into a single chain. Processing of incoming packets is very expensive, because the whole chain must be examined to find the best match. I chose in this patch to hash UDP sockets with a hash function that take into account both their port number and address : This has a drawback because we need two lookups : one with a given address, one with a wildcard (null) address. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 71b0b60..27437e7 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -114,14 +114,33 @@ DEFINE_RWLOCK(udp_hash_lock); static int udp_port_rover; -static inline int __udp_lib_lport_inuse(__u16 num, struct hlist_head udptable[]) +/* + * Note about this hash function : + * Typical use is probably daddr = 0, only dport is going to vary hash + */ +static inline unsigned int hash_port_and_addr(__u16 port, __be32 addr) +{ + addr ^= addr 16; + addr ^= addr 8; + return port ^ addr; +} + +static inline int __udp_lib_port_inuse(unsigned int hash, int port, + __be32 daddr, struct hlist_head udptable[]) { struct sock *sk; struct hlist_node *node; + struct inet_sock *inet; - sk_for_each(sk, node, udptable[num (UDP_HTABLE_SIZE - 1)]) - if (sk-sk_hash == num) + sk_for_each(sk, node, udptable[hash (UDP_HTABLE_SIZE - 1)]) { + if (sk-sk_hash != hash) + continue; + inet = inet_sk(sk); + if (inet-num != port) + continue; + if (inet-rcv_saddr == daddr) return 1; + } return 0; } @@ -142,6 +161,7 @@ int __udp_lib_get_port(struct sock *sk, struct hlist_node *node; struct hlist_head *head; struct sock *sk2; + unsigned int hash; interror = 1; write_lock_bh(udp_hash_lock); @@ -156,7 +176,9 @@ int __udp_lib_get_port(struct sock *sk, for (i = 0; i UDP_HTABLE_SIZE; i++, result++) { int size; - head = udptable[result (UDP_HTABLE_SIZE - 1)]; + hash = hash_port_and_addr(result, + inet_sk(sk)-rcv_saddr); + head = udptable[hash (UDP_HTABLE_SIZE - 1)]; if (hlist_empty(head)) { if (result sysctl_local_port_range[1]) result = sysctl_local_port_range[0] + @@ -181,7 +203,10 @@ int __udp_lib_get_port(struct sock *sk, result = sysctl_local_port_range[0] + ((result - sysctl_local_port_range[0]) (UDP_HTABLE_SIZE - 1)); - if (! __udp_lib_lport_inuse(result, udptable)) + hash = hash_port_and_addr(result, + inet_sk(sk)-rcv_saddr); + if (! __udp_lib_port_inuse(hash, result, + inet_sk(sk)-rcv_saddr, udptable)) break; } if (i = (1 16) / UDP_HTABLE_SIZE) @@ -189,11 +214,13 @@ int __udp_lib_get_port(struct sock *sk, gotit: *port_rover = snum = result; } else { - head = udptable[snum (UDP_HTABLE_SIZE - 1)]; + hash = hash_port_and_addr(snum, inet_sk(sk)-rcv_saddr); + head = udptable[hash (UDP_HTABLE_SIZE - 1)]; sk_for_each(sk2, node, head) - if (sk2-sk_hash == snum + if (sk2-sk_hash == hash sk2 != sk + inet_sk(sk2)-num == snum (!sk2-sk_reuse|| !sk-sk_reuse) (!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) @@ -201,9 +228,9 @@ gotit: goto fail; } inet_sk(sk)-num = snum; - sk-sk_hash = snum; + sk-sk_hash = hash; if (sk_unhashed(sk)) { - head =
Re: many sockets, slow sendto
Zacco a écrit : Actually, the source address would be more important in my case, as my clients (each with different IP address) wants to connect to the same server, i.e. to the same address and port. I dont understand why you need many sockets then. A single socket should be enough. I think, the current design is fair enough for server implementations and for regular clients. But even though my application is not tipical, as far as I know (but it can be important with the fast performance growth of regular PCs), the make-up should be general enough to cope with special circumstances, like mine. My initial idea was to somehow include the complete socket pair, i.e. source address:port and destination address:port, keeping in mind that it should work for both IPv4 and IPv6. Maybe it's an overkill, I don't know. Could you send me a copy of your application source, or detailed specs, because I am confused right now... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] [PATCHSET] netlink error management
This series of patches simplifies the error management and signalization of dump starts of netlink_run_queue() message handlers. It touches a fair bit of nfnetlink code as the error pointer has been passed on to subsystems. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [NETLINK]: Remove error pointer from netlink message handler
The error pointer argument in netlink message handlers is used to signal the special case where processing has to be interrupted because a dump was started but no error happened. Instead it is simpler and more clear to return -EINTR and have netlink_run_queue() deal with getting the queue right. nfnetlink passed on this error pointer to its subsystem handlers but only uses it to signal the start of a netlink dump. Therefore it can be removed there as well. This patch also cleans up the error handling in the affected message handlers to be consistent since it had to be touched anyway. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.22/net/core/rtnetlink.c === --- net-2.6.22.orig/net/core/rtnetlink.c2007-03-21 15:36:28.0 +0100 +++ net-2.6.22/net/core/rtnetlink.c 2007-03-21 18:38:32.0 +0100 @@ -851,8 +851,7 @@ static int rtattr_max; /* Process one rtnetlink message. */ -static __inline__ int -rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, int *errp) +static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) { rtnl_doit_func doit; int sz_idx, kind; @@ -862,10 +861,8 @@ rtnetlink_rcv_msg(struct sk_buff *skb, s int err; type = nlh-nlmsg_type; - - /* Unknown message: reply with EINVAL */ if (type RTM_MAX) - goto err_inval; + return -EINVAL; type -= RTM_BASE; @@ -874,40 +871,33 @@ rtnetlink_rcv_msg(struct sk_buff *skb, s return 0; family = ((struct rtgenmsg*)NLMSG_DATA(nlh))-rtgen_family; - if (family = NPROTO) { - *errp = -EAFNOSUPPORT; - return -1; - } + if (family = NPROTO) + return -EAFNOSUPPORT; sz_idx = type2; kind = type3; - if (kind != 2 security_netlink_recv(skb, CAP_NET_ADMIN)) { - *errp = -EPERM; - return -1; - } + if (kind != 2 security_netlink_recv(skb, CAP_NET_ADMIN)) + return -EPERM; if (kind == 2 nlh-nlmsg_flagsNLM_F_DUMP) { rtnl_dumpit_func dumpit; dumpit = rtnl_get_dumpit(family, type); if (dumpit == NULL) - goto err_inval; + return -EINVAL; - if ((*errp = netlink_dump_start(rtnl, skb, nlh, - dumpit, NULL)) != 0) { - return -1; - } - - netlink_queue_skip(nlh, skb); - return -1; + err = netlink_dump_start(rtnl, skb, nlh, dumpit, NULL); + if (err == 0) + err = -EINTR; + return err; } memset(rta_buf, 0, (rtattr_max * sizeof(struct rtattr *))); min_len = rtm_min[sz_idx]; if (nlh-nlmsg_len min_len) - goto err_inval; + return -EINVAL; if (nlh-nlmsg_len min_len) { int attrlen = nlh-nlmsg_len - NLMSG_ALIGN(min_len); @@ -917,7 +907,7 @@ rtnetlink_rcv_msg(struct sk_buff *skb, s unsigned flavor = attr-rta_type; if (flavor) { if (flavor rta_max[sz_idx]) - goto err_inval; + return -EINVAL; rta_buf[flavor-1] = attr; } attr = RTA_NEXT(attr, attrlen); @@ -926,15 +916,9 @@ rtnetlink_rcv_msg(struct sk_buff *skb, s doit = rtnl_get_doit(family, type); if (doit == NULL) - goto err_inval; - err = doit(skb, nlh, (void *)rta_buf[0]); - - *errp = err; - return err; + return -EINVAL; -err_inval: - *errp = -EINVAL; - return -1; + return doit(skb, nlh, (void *)rta_buf[0]); } static void rtnetlink_rcv(struct sock *sk, int len) Index: net-2.6.22/net/netlink/genetlink.c === --- net-2.6.22.orig/net/netlink/genetlink.c 2007-03-21 15:42:18.0 +0100 +++ net-2.6.22/net/netlink/genetlink.c 2007-03-21 18:38:32.0 +0100 @@ -295,60 +295,49 @@ int genl_unregister_family(struct genl_f return -ENOENT; } -static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, - int *errp) +static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) { struct genl_ops *ops; struct genl_family *family; struct genl_info info; struct genlmsghdr *hdr = nlmsg_data(nlh); - int hdrlen, err = -EINVAL; + int hdrlen, err; family = genl_family_find_byid(nlh-nlmsg_type); - if (family == NULL) { - err = -ENOENT; - goto errout; - } + if
[PATCH 2/3] [IPv4] diag: Use netlink_run_queue() to process the receive queue
Makes use of netlink_run_queue() to process the receive queue and converts inet_diag_rcv_msg() to use the type safe netlink interface. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.22/net/ipv4/inet_diag.c === --- net-2.6.22.orig/net/ipv4/inet_diag.c2007-03-21 18:40:29.0 +0100 +++ net-2.6.22/net/ipv4/inet_diag.c 2007-03-22 00:08:05.0 +0100 @@ -806,68 +806,48 @@ done: return skb-len; } -static inline int inet_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +static int inet_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) { - if (!(nlh-nlmsg_flagsNLM_F_REQUEST)) - return 0; + int hdrlen = sizeof(struct inet_diag_req); - if (nlh-nlmsg_type = INET_DIAG_GETSOCK_MAX) - goto err_inval; + if (nlh-nlmsg_type = INET_DIAG_GETSOCK_MAX || + nlmsg_len(nlh) hdrlen) + return -EINVAL; if (inet_diag_table[nlh-nlmsg_type] == NULL) return -ENOENT; - if (NLMSG_LENGTH(sizeof(struct inet_diag_req)) skb-len) - goto err_inval; - - if (nlh-nlmsg_flagsNLM_F_DUMP) { - if (nlh-nlmsg_len - (4 + NLMSG_SPACE(sizeof(struct inet_diag_req { - struct rtattr *rta = (void *)(NLMSG_DATA(nlh) + -sizeof(struct inet_diag_req)); - if (rta-rta_type != INET_DIAG_REQ_BYTECODE || - rta-rta_len 8 || - rta-rta_len - (nlh-nlmsg_len - -NLMSG_SPACE(sizeof(struct inet_diag_req - goto err_inval; - if (inet_diag_bc_audit(RTA_DATA(rta), RTA_PAYLOAD(rta))) - goto err_inval; - } - return netlink_dump_start(idiagnl, skb, nlh, - inet_diag_dump, NULL); - } else - return inet_diag_get_exact(skb, nlh); - -err_inval: - return -EINVAL; -} + if (nlh-nlmsg_flags NLM_F_DUMP) { + int err; + if (nlmsg_attrlen(nlh, hdrlen)) { + struct nlattr *attr; -static inline void inet_diag_rcv_skb(struct sk_buff *skb) -{ - if (skb-len = NLMSG_SPACE(0)) { - int err; - struct nlmsghdr *nlh = nlmsg_hdr(skb); + attr = nlmsg_find_attr(nlh, hdrlen, + INET_DIAG_REQ_BYTECODE); + if (attr == NULL || + nla_len(attr) sizeof(struct inet_diag_bc_op) || + inet_diag_bc_audit(nla_data(attr), nla_len(attr))) + return -EINVAL; + } - if (nlh-nlmsg_len sizeof(*nlh) || - skb-len nlh-nlmsg_len) - return; - err = inet_diag_rcv_msg(skb, nlh); - if (err || nlh-nlmsg_flags NLM_F_ACK) - netlink_ack(skb, nlh, err); + err = netlink_dump_start(idiagnl, skb, nlh, +inet_diag_dump, NULL); + if (err == 0) + err = -EINTR; + return err; } + + return inet_diag_get_exact(skb, nlh); } static void inet_diag_rcv(struct sock *sk, int len) { - struct sk_buff *skb; - unsigned int qlen = skb_queue_len(sk-sk_receive_queue); + unsigned int qlen = 0; - while (qlen-- (skb = skb_dequeue(sk-sk_receive_queue))) { - inet_diag_rcv_skb(skb); - kfree_skb(skb); - } + do { + netlink_run_queue(sk, qlen, inet_diag_rcv_msg); + } while (qlen); } static DEFINE_SPINLOCK(inet_diag_register_lock); -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: many sockets, slow sendto
Zacco a écrit : So, my worry is confirmed then. But how could that delay disappear when splitting the sender and receiver on distinct hosts? Even in that case the good socket must be found somehow. When the receiver and sender are on the same machine, the sendto() pass the packet to loopback and enters the receiving side. With that many sockets, the time to go through all sockets maybe 100 us. So your sendto() seems to be slow, but the slow part is the receiver. If you put two machines, the sender might send XX.XXX frames per second (full speed), but the receiver might handle 5% of them and drop 95% This is all speculation, since you didnt gave us the exact setup you use. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fix up misplaced inlines.
Turning up the warnings on gcc makes it emit warnings about the placement of 'inline' in function declarations. Here's everything that was under net/ Signed-off-by: Dave Jones [EMAIL PROTECTED] diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c index 4c914df..ecfe8da 100644 --- a/net/bluetooth/hidp/core.c +++ b/net/bluetooth/hidp/core.c @@ -319,7 +319,7 @@ static int __hidp_send_ctrl_message(struct hidp_session *session, return 0; } -static int inline hidp_send_ctrl_message(struct hidp_session *session, +static inline int hidp_send_ctrl_message(struct hidp_session *session, unsigned char hdr, unsigned char *data, int size) { int err; diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index 7712d76..5439a3c 100644 --- a/net/bridge/br_netfilter.c +++ b/net/bridge/br_netfilter.c @@ -61,7 +61,7 @@ static int brnf_filter_vlan_tagged __read_mostly = 1; #define brnf_filter_vlan_tagged 1 #endif -static __be16 inline vlan_proto(const struct sk_buff *skb) +static inline __be16 vlan_proto(const struct sk_buff *skb) { return vlan_eth_hdr(skb)-h_vlan_encapsulated_proto; } diff --git a/net/core/sock.c b/net/core/sock.c index 8d65d64..27c4f62 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -808,7 +808,7 @@ lenout: * * (We also register the sk_lock with the lock validator.) */ -static void inline sock_lock_init(struct sock *sk) +static inline void sock_lock_init(struct sock *sk) { sock_lock_init_class_and_name(sk, af_family_slock_key_strings[sk-sk_family], diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index a7fee6b..1b61699 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -804,7 +804,7 @@ struct ipv6_saddr_score { #define IPV6_SADDR_SCORE_LABEL 0x0020 #define IPV6_SADDR_SCORE_PRIVACY 0x0040 -static int inline ipv6_saddr_preferred(int type) +static inline int ipv6_saddr_preferred(int type) { if (type (IPV6_ADDR_MAPPED|IPV6_ADDR_COMPATv4| IPV6_ADDR_LOOPBACK|IPV6_ADDR_RESERVED)) @@ -813,7 +813,7 @@ static int inline ipv6_saddr_preferred(int type) } /* static matching label */ -static int inline ipv6_saddr_label(const struct in6_addr *addr, int type) +static inline int ipv6_saddr_label(const struct in6_addr *addr, int type) { /* *prefix (longest match) label @@ -3318,7 +3318,7 @@ errout: rtnl_set_sk_err(RTNLGRP_IPV6_IFADDR, err); } -static void inline ipv6_store_devconf(struct ipv6_devconf *cnf, +static inline void ipv6_store_devconf(struct ipv6_devconf *cnf, __s32 *array, int bytes) { BUG_ON(bytes (DEVCONF_MAX * 4)); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 0e1f4b2..a6b3117 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -308,7 +308,7 @@ static inline void rt6_probe(struct rt6_info *rt) /* * Default Router Selection (RFC 2461 6.3.6) */ -static int inline rt6_check_dev(struct rt6_info *rt, int oif) +static inline int rt6_check_dev(struct rt6_info *rt, int oif) { struct net_device *dev = rt-rt6i_dev; int ret = 0; @@ -328,7 +328,7 @@ static int inline rt6_check_dev(struct rt6_info *rt, int oif) return ret; } -static int inline rt6_check_neigh(struct rt6_info *rt) +static inline int rt6_check_neigh(struct rt6_info *rt) { struct neighbour *neigh = rt-rt6i_nexthop; int m = 0; diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c index ee4b84a..93c4223 100644 --- a/net/ipv6/xfrm6_tunnel.c +++ b/net/ipv6/xfrm6_tunnel.c @@ -58,7 +58,7 @@ static struct kmem_cache *xfrm6_tunnel_spi_kmem __read_mostly; static struct hlist_head xfrm6_tunnel_spi_byaddr[XFRM6_TUNNEL_SPI_BYADDR_HSIZE]; static struct hlist_head xfrm6_tunnel_spi_byspi[XFRM6_TUNNEL_SPI_BYSPI_HSIZE]; -static unsigned inline xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr) +static inline unsigned xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr) { unsigned h; @@ -70,7 +70,7 @@ static unsigned inline xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr) return h; } -static unsigned inline xfrm6_tunnel_spi_hash_byspi(u32 spi) +static inline unsigned xfrm6_tunnel_spi_hash_byspi(u32 spi) { return spi % XFRM6_TUNNEL_SPI_BYSPI_HSIZE; } diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c index e85df07..abc47cc 100644 --- a/net/sched/cls_route.c +++ b/net/sched/cls_route.c @@ -93,7 +93,7 @@ void route4_reset_fastmap(struct net_device *dev, struct route4_head *head, u32 spin_unlock_bh(dev-queue_lock); } -static void __inline__ +static inline void route4_set_fastmap(struct route4_head *head, u32 id, int iif, struct route4_filter *f) { diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 9678995..e81e2fb 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -2025,7 +2025,7 @@ nlmsg_failure: return -1; } -static int inline
Re: many sockets, slow sendto
From: Zacco [EMAIL PROTECTED] Date: Wed, 21 Mar 2007 22:53:13 +0100 Do you think there is interest in such a modification? If so, how could we go on with it? The best thing you can do is hash on both saddr/sport. In order to handle the saddr==0 case the socket lookup has to try two lookups, one with the packets saddr, and one with saddr zero. If the first lookup hits, we use that, since a precise match should match over one to a wildcard saddr, otherwise we use the result of the second lookup. I'm not very inclined to hack on this, so anyone else is welcome to. FWIW, pretty much every other networking stack only hashes on sport for UDP, just like Linux. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -mm 3/4] Blackfin: on-chip ethernet MAC controller update driver
Hi folks, As we move 4 piece same board specific code get_bf537_ether_addr() into arch/blackfin/mach-bf537/boards/eth_mac.c, the comment of driver should be updated. Signed-off-by: Bryan Wu [EMAIL PROTECTED] --- drivers/net/bfin_mac.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/drivers/net/bfin_mac.c === --- linux-2.6.orig/drivers/net/bfin_mac.c +++ linux-2.6/drivers/net/bfin_mac.c @@ -842,7 +842,7 @@ /*Is it valid? (Did bootloader initialize it?) */ if (!is_valid_ether_addr(dev-dev_addr)) { /* Grab the MAC from the board somehow - this is done in the - arch/blackfin/boards/bf537/boardname.c */ + arch/blackfin/mach-bf537/boards/eth_mac.c */ get_bf537_ether_addr(dev-dev_addr); } _ Thanks -Bryan - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_cubic: use 32 bit math
On Tue, 13 Mar 2007 21:50:20 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: Hi Stephen, On Mon, Mar 12, 2007 at 02:11:56PM -0700, Stephen Hemminger wrote: Oh BTW, I have a newer version with a first approximation of the cbrt() before the div64_64, which allows us to reduce from 3 div64 to only 2 div64. This results in a version which is twice as fast as the initial one (ncubic), but with slightly less accuracy (0.286% compared to 0.247). But I see that other functions such as hcbrt() had a 1.5% avg error, so I think this is not dramatic. Ignore my hcbrt() it was a less accurate version of andi's stuff. OK. Also, I managed to remove all other divides, to be kind with CPUs having a slow divide instruction or no divide at all. Since we compute on limited range (22 bits), we can multiply then shift right. It shows me even slightly better time on pentium-m and athlon, with a slightly higher avg error (0.297% compared to 0.286%), and slightly smaller code. What does the code look like? Well, I have cleaned it a little bit, there were more comments and ifdefs than code ! I've appended it to the end of this mail. I have changed it a bit, because I noticed that integer divide precision was so coarse that there were other possibilities to play with the bits. I have experimented with combinations of several methods : - replace integer divides with multiplies/shifts where possible. - compensation for divide imprecisions by adding/removing small values bofore/after them. Often, the integer result of 1/(x*(x-1)) is closer to (float)1/(float)x^2 than 1/(x*x). This is because the divide always truncates the result. - use direct result lookup for small values. Small inputs give small outputs which have very few moving bits. Many different values fit in a 32bit integer, so we use a shift offset to lookup the value. I used this in an fls function I wrote a while ago, that I should also post because it is up to twice as fast as the kernel's. Sometimes it seems faster to lookup in from memory, sometimes it is faster from an immediate value. Maybe more visible differences would show up on RISC CPUs where loading 32 bits immediate needs two instructions. I don't know yet, I've not tested on my sparc yet. - use small lookup tables (64 bytes) with 6 bits inputs and at least as many on output. We only lookup the 6 MSB and return the 2-3 MSB of the result. - iterative search and manual refinment of the lookup tables for best accuracy. The avg error rate can easily be halved this way. I have duplicated tried several functions with 0, 1, 2 and 3 divides. Several of them offer better accuracy over what we currently have, in less cycles. Others offer faster results (up to 5 times) with slightly less accuracy. There is one function which is not to be used, but is just here for comparison (ncubic_0div). It does no divide but has awful avg error. But one which is interesting is the ncubic_tab0. It does not use any divide at all, even not any div64. It shows a 0.6% avg error, which I'm not sure is enough or not. It is 6.7 times faster than initial ncubic() with less accuracy, and 4 times smaller. I suspect that it can differ more on architectures which have no divide instruction. Is 0.6% avg error rate is too much, ncubic_tab1() uses one single div64 and is twice slower (still nearly 3 times faster than ncubic). It show 0.195% avg error, which is better than initial ncubic. I think that it is a good tradeoff. If best accuracy is an absolute requirement, then I have a variation of ncubic (ncubic_3div) which does 0.17% in 2/3 of the time (compared to 0.247%), and which is slightly smaller. I have also added a size column, indicating approximative function size, provided that the compiler does not reorder the code. On gcc 3.4, it's OK, but 4.1 returns garbage. That does not matter, it's just a rough estimate anyway. Here are the results classed by speed : /* Sample output on a Pentium-M 600 MHz : Function clocks mean(us) max(us) std(us) Avg err size ncubic_tab0 79 0.66 7.20 1.04 0.613% 160 ncubic_0div 84 0.70 7.64 1.57 4.521% 192 ncubic_1div 178 1.4816.27 1.81 0.443% 336 ncubic_tab1 179 1.4916.34 1.85 0.195% 320 ncubic_ndiv3 263 2.1824.04 3.59 0.250% 512 ncubic_2div 270 2.2424.70 2.77 0.187% 512 ncubic32_1 359 2.9832.81 3.59 0.238% 544 ncubic_3div 361 2.9933.08 3.79 0.170% 656 ncubic32 364 3.0233.29 3.51 0.247% 544 ncubic 529 4.3948.39 4.92 0.247% 720 hcbrt539 4.4749.25 5.98 1.580% 96 ocubic 732 4.9361.83 7.22 0.274% 320
Re: [PATCH] tcp_cubic: use 32 bit math
Hi Stephen, On Wed, Mar 21, 2007 at 11:54:19AM -0700, Stephen Hemminger wrote: On Tue, 13 Mar 2007 21:50:20 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: [...] ( cut my boring part ) Here are the results classed by speed : /* Sample output on a Pentium-M 600 MHz : Function clocks mean(us) max(us) std(us) Avg err size ncubic_tab0 79 0.66 7.20 1.04 0.613% 160 ncubic_0div 84 0.70 7.64 1.57 4.521% 192 ncubic_1div 178 1.4816.27 1.81 0.443% 336 ncubic_tab1 179 1.4916.34 1.85 0.195% 320 ncubic_ndiv3 263 2.1824.04 3.59 0.250% 512 ncubic_2div 270 2.2424.70 2.77 0.187% 512 ncubic32_1 359 2.9832.81 3.59 0.238% 544 ncubic_3div 361 2.9933.08 3.79 0.170% 656 ncubic32 364 3.0233.29 3.51 0.247% 544 ncubic 529 4.3948.39 4.92 0.247% 720 hcbrt539 4.4749.25 5.98 1.580% 96 ocubic 732 4.9361.83 7.22 0.274% 320 acbrt842 6.9876.73 8.55 0.275% 192 bictcp 1032 6.9586.30 9.04 0.172% 768 [...] The following version of div64_64 is faster because do_div already optimized for the 32 bit case.. Cool, this is interesting because I first wanted to optimize it but did not find how to start with this. You seem to get very good results. BTW, you did not append your changes. However, one thing I do not understand is why your avg error is about 1/3 below the original one. Was there a precision bug in the original div_64_64 or did you extend the values used in the test ? Or perhaps you used -fast-math to build and the original cbrt() is less precise in this case ? I get the following results on ULV Core Solo (ie slow current processor) and the following on 64bit Core Duo. ncubic_tab1 seems like the best (no additional error and about as fast) OK. It was the one I preferred too unless tab0's avg error was acceptable. ULV Core Solo Function clocks mean(us) max(us) std(us) Avg err size ncubic_tab0 19211.2445.1015.28 0.450% -2262 ncubic_0div 20111.7747.2327.40 3.357% -2404 ncubic_1div 32419.0276.3225.82 0.189% -2567 ncubic_tab1 32619.1376.7323.71 0.043% -2059 ncubic_2div 45626.72 108.92 493.16 0.028% -2790 ncubic_ndiv3 46327.15 133.37 1889.39 0.104% -3344 ncubic32 54932.18 130.59 508.97 0.041% -3794 ncubic32_1 57433.66 138.32 548.48 0.029% -3604 ncubic_3div 58134.04 140.24 608.55 0.018% -3050 ncubic 73342.92 173.35 523.19 0.041% 299 ocubic 104661.25 283.68 3305.65 0.027% -2232 acbrt 114967.32 284.91 1941.55 0.029% 168 bictcp 166397.41 394.29 604.86 0.017% 628 Core 2 Duo Function clocks mean(us) max(us) std(us) Avg err size ncubic_0div 74 0.03 1.60 0.07 3.357% -2101 ncubic_tab0 74 0.03 1.60 0.04 0.450% -2029 ncubic_1div 142 0.07 3.11 1.05 0.189% -2195 ncubic_tab1 144 0.07 3.18 1.02 0.043% -1638 ncubic_2div 216 0.10 4.74 1.07 0.028% -2326 ncubic_ndiv3 219 0.10 4.76 1.04 0.104% -2709 ncubic32 269 0.13 5.87 1.13 0.041% -1500 ncubic32_1 272 0.13 5.92 1.10 0.029% -2881 ncubic 273 0.13 5.96 1.13 0.041% -1763 ncubic_3div 290 0.14 6.32 1.01 0.018% -2499 acbrt430 0.20 9.42 1.18 0.029% 77 ocubic 444 0.21 9.82 1.82 0.027% -1924 bictcp 549 0.2612.06 1.68 0.017% 236 Thanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] tcp: cubic optimization
Use willy's work in optimizing cube root by having table for small values. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net-2.6.22.orig/net/ipv4/tcp_cubic.c2007-03-21 12:57:11.0 -0700 +++ net-2.6.22/net/ipv4/tcp_cubic.c 2007-03-21 13:04:59.0 -0700 @@ -91,23 +91,51 @@ tcp_sk(sk)-snd_ssthresh = initial_ssthresh; } -/* - * calculate the cubic root of x using Newton-Raphson +/* calculate the cubic root of x using a table lookup followed by one + * Newton-Raphson iteration. + * Avg err ~= 0.195% */ static u32 cubic_root(u64 a) { - u32 x; - - /* Initial estimate is based on: -* cbrt(x) = exp(log(x) / 3) + u32 x, b, shift; + /* +* cbrt(x) MSB values for x MSB values in [0..63]. +* Precomputed then refined by hand - Willy Tarreau +* +* For x in [0..63], +* v = cbrt(x 18) - 1 +* cbrt(x) = (v[x] + 10) 6 */ - x = 1u (fls64(a)/3); + static const u8 v[] = { + /* 0x00 */0, 54, 54, 54, 118, 118, 118, 118, + /* 0x08 */ 123, 129, 134, 138, 143, 147, 151, 156, + /* 0x10 */ 157, 161, 164, 168, 170, 173, 176, 179, + /* 0x18 */ 181, 185, 187, 190, 192, 194, 197, 199, + /* 0x20 */ 200, 202, 204, 206, 209, 211, 213, 215, + /* 0x28 */ 217, 219, 221, 222, 224, 225, 227, 229, + /* 0x30 */ 231, 232, 234, 236, 237, 239, 240, 242, + /* 0x38 */ 244, 245, 246, 248, 250, 251, 252, 254, + }; + + b = fls64(a); + if (b 7) { + /* a in [0..63] */ + return ((u32)v[(u32)a] + 35) 6; + } + + b = ((b * 84) 8) - 1; + shift = (a (b * 3)); - /* converges to 32 bits in 3 iterations */ - x = (2 * x + (u32)div64_64(a, (u64)x*(u64)x)) / 3; - x = (2 * x + (u32)div64_64(a, (u64)x*(u64)x)) / 3; - x = (2 * x + (u32)div64_64(a, (u64)x*(u64)x)) / 3; + x = ((u32)(((u32)v[shift] + 10) b)) 6; + /* +* Newton-Raphson iteration +* 2 +* x= ( 2 * x + a / x ) / 3 +* k+1 k k +*/ + x = (2 * x + (u32)div64_64(a, (u64)x * (u64)(x - 1))); + x = ((x * 341) 10); return x; } -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] div64_64 optimization
Minor optimization of div64_64. do_div() already does optimization for the case of 32 by 32 divide, so no need to do it here. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net-2.6.22.orig/lib/div64.c 2007-03-21 12:03:59.0 -0700 +++ net-2.6.22/lib/div64.c 2007-03-21 12:04:46.0 -0700 @@ -61,20 +61,18 @@ /* 64bit divisor, dividend and result. dynamic precision */ uint64_t div64_64(uint64_t dividend, uint64_t divisor) { - uint32_t d = divisor; + uint32_t high, d; - if (divisor 0xULL) { - unsigned int shift = fls(divisor 32); + high = divisor 32; + if (high) { + unsigned int shift = fls(high); d = divisor shift; dividend = shift; - } + } else + d = divisor; - /* avoid 64 bit division if possible */ - if (dividend 32) - do_div(dividend, d); - else - dividend = (uint32_t) dividend / d; + do_div(dividend, d); return dividend; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] [RFC] AF_RXRPC socket family implementation [try #3]
David Howells [EMAIL PROTECTED] wrote: - recvmsg not supporting MSG_TRUNC is rather weird and really ought to be fixed one day as its useful to find out the sizeof message pending when combined with MSG_PEEK Hmmm... I hadn't considered that. I assumed MSG_TRUNC not to be useful as arbitrarily chopping bits out of the request or reply would seem to be pointless. But why do I need to support MSG_TRUNC? I currently have things arranged so that if you do a recvmsg() that doesn't pull everything out of a packet then the next time you do a recvmsg() you'll get the next part of the data in that packet. MSG_EOR is flagged when recvmsg copies across the last byte of data of a particular phase. Okay... I've rewritten my recvmsg implementation for RxRPC. The one I had could pull messages belonging to a call off the socket in the wrong order if two threads both tried to pull simultaneously. Also: (1) If there's a sequence of data messages belonging to a particular call on the receive queue, then recvmsg() will keep eating them until it meets either a non-data message or a message belonging to a different call or until it fills the user buffer. If it doesn't fill the user buffer, it will sleep unless it is non-blocking. (2) MSG_PEEK operates similarly, but will return immediately if it has put any data in the buffer rather than waiting for further packets to arrive. (3) If a packet is only partially consumed in filling a user buffer, then the shrunken packet will be left on the front of the queue for the next taker. (4) If there is more data to be had on a call (we haven't copied the last byte of the last data packet in that phase yet), then MSG_MORE will be flagged. (5) MSG_EOR will be flagged on the terminal message of a call. No more messages from that call will be received, and the user ID may be reused. Patch attached. David diff --git a/net/rxrpc/Makefile b/net/rxrpc/Makefile index 3369534..f12cd28 100644 --- a/net/rxrpc/Makefile +++ b/net/rxrpc/Makefile @@ -17,6 +17,7 @@ af-rxrpc-objs := \ ar-local.o \ ar-output.o \ ar-peer.o \ + ar-recvmsg.o \ ar-security.o \ ar-skbuff.o \ ar-transport.o diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index b25d931..06963e6 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -385,217 +385,6 @@ out: } /* - * receive a message from an RxRPC socket - */ -static int rxrpc_recvmsg(struct kiocb *iocb, struct socket *sock, -struct msghdr *msg, size_t len, int flags) -{ - struct rxrpc_skb_priv *sp; - struct rxrpc_call *call; - struct rxrpc_sock *rx = rxrpc_sk(sock-sk); - struct sk_buff *skb; - int copy, ret, ullen; - u32 abort_code; - - _enter(,,,%zu,%d, len, flags); - - if (flags (MSG_OOB | MSG_TRUNC)) - return -EOPNOTSUPP; - -try_again: - if (RB_EMPTY_ROOT(rx-calls) - rx-sk.sk_state != RXRPC_SERVER_LISTENING) - return -ENODATA; - - /* receive the next message from the common Rx queue */ - skb = skb_recv_datagram(rx-sk, flags, flags MSG_DONTWAIT, ret); - if (!skb) { - _leave( = %d, ret); - return ret; - } - - sp = rxrpc_skb(skb); - call = sp-call; - ASSERT(call != NULL); - - /* make sure we wait for the state to be updated in this call */ - spin_lock_bh(call-lock); - spin_unlock_bh(call-lock); - - if (test_bit(RXRPC_CALL_RELEASED, call-flags)) { - _debug(packet from release call); - rxrpc_free_skb(skb); - goto try_again; - } - - rxrpc_get_call(call); - - /* copy the peer address. */ - if (msg-msg_name msg-msg_namelen 0) - memcpy(msg-msg_name, call-conn-trans-peer-srx, - sizeof(call-conn-trans-peer-srx)); - - /* set up the control messages */ - ullen = msg-msg_flags MSG_CMSG_COMPAT ? 4 : sizeof(unsigned long); - - sock_recv_timestamp(msg, rx-sk, skb); - - if (skb-mark == RXRPC_SKB_MARK_NEW_CALL) { - _debug(RECV NEW CALL); - ret = put_cmsg(msg, SOL_RXRPC, RXRPC_NEW_CALL, 0, abort_code); - if (ret 0) - goto error_requeue_packet; - goto done; - } - - ret = put_cmsg(msg, SOL_RXRPC, RXRPC_USER_CALL_ID, - ullen, call-user_call_ID); - if (ret 0) - goto error_requeue_packet; - ASSERT(test_bit(RXRPC_CALL_HAS_USERID, call-flags)); - - switch (skb-mark) { - case RXRPC_SKB_MARK_DATA: - _debug(recvmsg DATA #%u { %d, %d }, - ntohl(sp-hdr.seq), skb-len, sp-offset); - - ASSERTCMP(ntohl(sp-hdr.seq), =, call-rx_data_recv); - ASSERTCMP(ntohl(sp-hdr.seq), =, call-rx_data_recv + 1); -
Re: [PATCH 2.6.21 2/4] cxgb3 - Auto-load FW if mismatch detected
On Sun, 18 Mar 2007 13:10:06 -0700 [EMAIL PROTECTED] wrote: config CHELSIO_T3 tristate Chelsio Communications T3 10Gb Ethernet support depends on PCI + select FW_LOADER Something has gone wrong with the indenting there. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] [RFC] AF_RXRPC socket family implementation [try #3]
David Howells [EMAIL PROTECTED] wrote: - recvmsg not supporting MSG_TRUNC is rather weird and really ought to be fixed one day as its useful to find out the sizeof message pending when combined with MSG_PEEK Hmmm... I hadn't considered that. I assumed MSG_TRUNC not to be useful as arbitrarily chopping bits out of the request or reply would seem to be pointless. But why do I need to support MSG_TRUNC? I currently have things arranged so that if you do a recvmsg() that doesn't pull everything out of a packet then the next time you do a recvmsg() you'll get the next part of the data in that packet. MSG_EOR is flagged when recvmsg copies across the last byte of data of a particular phase. I might at some point in the future enable recvmsg() to keep pulling packets off the Rx queue and copying them into userspace until the userspace buffer is full or we find that the next packet is not the logical next in sequence. Hmmm... I'm actually overloading MSG_EOR. MSG_EOR is flagged on the last data read, and is also flagged for terminal messages (end or reply data, abort, net error, final ACK, etc). I wonder if I should use MSG_MORE (or its lack) instead to indicate the end of data, and only set MSG_EOR on the terminal message. MSG_MORE is set by the app to flag to sendmsg() that there's more data to come, so it would be consistent to use it for recvmsg() too. David - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.21 2/4] cxgb3 - Auto-load FW if mismatch detected
Andrew Morton wrote: On Sun, 18 Mar 2007 13:10:06 -0700 [EMAIL PROTECTED] wrote: config CHELSIO_T3 tristate Chelsio Communications T3 10Gb Ethernet support depends on PCI + select FW_LOADER Something has gone wrong with the indenting there. The added line is fine. The surrounding lines are not. They use white spaces. I'll send a patch over the last series to use tabs in drivers/net/Kconfig. Cheers, Divy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.21 3/4] cxgb3 - Fix potential MAC hang
Andrew Morton wrote: On Sun, 18 Mar 2007 13:10:12 -0700 [EMAIL PROTECTED] wrote: From: Divy Le Ray [EMAIL PROTECTED] Under rare conditions, the MAC might hang while generating a pause frame. This patch fine tunes the MAC settings to avoid the issue, allows for periodic MAC state check, and triggers a recovery if hung. Also fix one MAC statistics counter for the rev board T3B2. This conflicts with your previously-submitted, not-yet-merged-by-Jeff cxgb3-add-sw-lro-support.patch. What should we do about this? I can send you a patch against the -mm tree, if it is acceptable. Cheers, Divy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.21 5/4] cxgb3 - fix white spaces in drivers/net/Kconfig
From: Divy Le Ray [EMAIL PROTECTED] Use tabs instead of white spaces for CHELSIO_T3 entry. Signed-off-by: Divy Le Ray [EMAIL PROTECTED] --- drivers/net/Kconfig | 24 1 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 1b6459b..c3f9f59 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2372,23 +2372,23 @@ config CHELSIO_T1_NAPI when the driver is receiving lots of packets from the card. config CHELSIO_T3 -tristate Chelsio Communications T3 10Gb Ethernet support -depends on PCI + tristate Chelsio Communications T3 10Gb Ethernet support + depends on PCI select FW_LOADER -help - This driver supports Chelsio T3-based gigabit and 10Gb Ethernet - adapters. + help + This driver supports Chelsio T3-based gigabit and 10Gb Ethernet + adapters. - For general information about Chelsio and our products, visit - our website at http://www.chelsio.com. + For general information about Chelsio and our products, visit + our website at http://www.chelsio.com. - For customer support, please visit our customer support page at - http://www.chelsio.com/support.htm. + For customer support, please visit our customer support page at + http://www.chelsio.com/support.htm. - Please send feedback to [EMAIL PROTECTED]. + Please send feedback to [EMAIL PROTECTED]. - To compile this driver as a module, choose M here: the module - will be called cxgb3. + To compile this driver as a module, choose M here: the module + will be called cxgb3. config EHEA tristate eHEA Ethernet support - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][PKT_CLS] Avoid multiple tree locks
On Wed, 2007-21-03 at 15:04 +0100, Patrick McHardy wrote: Patrick McHardy wrote: What we could do is replace the netlink cb_lock spinlock by a user-supplied mutex (supplied to netlink_kernel_create, rtnl_mutex in this case). That would put the entire dump under the rtnl and allow us to get rid of qdisc_tree_lock and avoid the need to take dev_base_lock during qdisc dumping. Same in other spots like rtnl_dump_ifinfo, inet_dump_ifaddr, ... These (compile tested) patches demonstrate the idea. The first one lets netlink_kernel_create users specify a mutex that should be held during dump callbacks, the second one uses this for rtnetlink and changes inet_dump_ifaddr for demonstration. A complete patch would allow us to simplify locking in lots of spots, all rtnetlink users currently need to implement extra locking just for the dump functions, and a number of them already get it wrong and seem to rely on the rtnl. The mutex is certainly a cleaner approach; and a lot of the RCU protection would go away. I like it. Knowing you i sense theres something clever in there that i am missing. I dont see how you could get rid of the tree locking since we need to protect against the data path still, no? Or are you looking at that as a separate effort? If there are no objections to this change I'm going to update the second patch to include all rtnetlink users. No objections here. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: many sockets, slow sendto
On Wed, 21 Mar 2007 18:15:10 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Eric Dumazet [EMAIL PROTECTED] Date: Wed, 21 Mar 2007 23:12:40 +0100 I chose in this patch to hash UDP sockets with a hash function that take into account both their port number and address : This has a drawback because we need two lookups : one with a given address, one with a wildcard (null) address. Thanks for doing this work Eric, I'll review this when I get home tomorrow night or Friday. You're welcome :) I knew you were busy with this new wii game^Wprogram :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html