Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Hi Dave, David Miller wrote on 10/10/2007 02:13:31 AM: Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. Indeed, I also realized last night that for the default qdiscs we do a lot of stupid useless work. If the queue is a FIFO and the device can take packets, we should send it directly and not stick it into the qdisc at all. Since you are talking of how it should be done in the *current* code, I feel LLTX drivers will not work nicely with this. Actually I was trying this change a couple of weeks back, but felt that doin go would result in out of order packets (skbs present in q which were not sent out for LLTX failure will be sent out only at next net_tx_action, while other skbs are sent ahead). One option is to first call qdisc_run() and then process this skb, but that is ugly (requeue handling). However I guess this can be done cleanly once LLTX is removed. Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
David Miller [EMAIL PROTECTED] writes: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Fri, 28 Sep 2007 18:59:08 -0600 Currently we have the call path: macvlan_open - dev_unicast_add - __dev_set_rx_mode - __dev_set_promiscuity - ASSERT_RTNL - mutex_trylock When mutex debugging is on taking a mutex complains if we are not allowed to sleep. At that point we have called netif_tx_lock_bh so we are clearly not allowed to sleep. Arguably this is not a problem for mutex_trylock. However we can avoid the complaint and make the ASSERT_RTNL code cheaper, faster and more obvious by simply calling mutex_is_locked. So this patch adds rtnl_is_locked (which does mutex_is_locked on the rtnl_mutex) and changes ASSERT_RTNL to use that. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] There was a lot of discussion about how to do this right and therefore you'll need to resubmit all of this with this discussioned issues addressed. Thanks for the update on where this patch sits. My understanding is that the patch as I submitted it is correct. The code path that sparked this patch has been seen to be extremely convoluted from a locking perspective. Herbert and Patrick have been discussing that code path and it was my impression they were working together to figure out how to refactor that code path to make the locking simpler. There are enough convoluted details that I do not feel comfortable refactoring that code path. There was a practical suggestion by Herbert that ASSERT_RTNL have a might_sleep() added. That suggestion will currently result in ASSERT_RTNL firing unnecessarily from the macvlan_open code path. So I do not feel comfortable adding a might_sleep() into ASSERT_RTNL, as it appears that it will warn on currently correct code. Even if that code has code has nearly incomprehensible locking. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
On Thu, Oct 11, 2007 at 12:57:31AM -0600, Eric W. Biederman wrote: There was a practical suggestion by Herbert that ASSERT_RTNL have a might_sleep() added. That suggestion will currently result in ASSERT_RTNL firing unnecessarily from the macvlan_open code path. As I've already said we should change the macvlan and core netdev code so that this doesn't happen in the first place. In gernal checking for the RTNL while holding a spin lock is a sign of a bug. So I would object to a patch that caused the RTNL_ASSERT to not warn about being called in an atomic context. I don't have a problem with your patch per se, it's the fact that the patch is removing the warning when it's called in an atomic context that I don't like. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: skb_clone SMP race?
Thank you. I was thinking about something like that when I said not find explicit checks, but I was confused because the sk_buff code is plenty of explicit checks and I understood those checks as a try to make the code self-protected against wrong uses: I, sk_buff, offer you this interface. Don´t worry too much about my internals: I will complain on your mistakes The fact the name of the functionality is generic fast_clone and not specific tcp_fast_clone drove me to think: Ok, the allocation of fast-clonable skbs is NOW only in ONE place inside TCP code and sure that they have the implicit checks to avoid this race when they call skb_clone, but -NOW- and -ONE- are not a big guarantee... A newbie's thought. :) 2007/10/11, Herbert Xu [EMAIL PROTECTED]: Santiago Font Arquer [EMAIL PROTECTED] wrote: If an skb with fast clone available (first if true) has references in different CPUs (skb-users1) (I do not find explicit checks for this to be impossible), if skb_clone is called simultaneously over that skb, both callers can get the same clone (the fast clone) and different problems follow: wrong clone_skb-users (1 as expected by the caller, but it should be, to be true, 2), fclone_ref set to 3 involving further problems, ... Fast clones are only used by TCP where the original skb is never given to the outside world. This plus the fact that a given TCP socket is single-threaded makes it safe. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
Herbert Xu [EMAIL PROTECTED] writes: On Thu, Oct 11, 2007 at 12:57:31AM -0600, Eric W. Biederman wrote: There was a practical suggestion by Herbert that ASSERT_RTNL have a might_sleep() added. That suggestion will currently result in ASSERT_RTNL firing unnecessarily from the macvlan_open code path. As I've already said we should change the macvlan and core netdev code so that this doesn't happen in the first place. Agreed. Until that is done I am reluctant to add a warning to ASSERT_RTNL. Last I looked at that part of the thread it looked like you and Patrick were making good progress towards unraveling that, so I have no problem adding an extra warning when we don't expect it to ever trigger. In gernal checking for the RTNL while holding a spin lock is a sign of a bug. So I would object to a patch that caused the RTNL_ASSERT to not warn about being called in an atomic context. ASSERT_RTNL does not warn about being called in an atomic context today! I don't have a problem with your patch per se, it's the fact that the patch is removing the warning when it's called in an atomic context that I don't like. No my patch does not remove a warning. Way way deep in mutex debugging on the slowpath there is a unreadable and incomprehensible WARN_ON in muxtex_trylock that will trigger if you have 10 tons of debugging turned on, and you are in, interrupt context, and you manage to hit the slow path. I think that is a pretty unlikely scenario. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
On Thu, Oct 11, 2007 at 02:23:35AM -0600, Eric W. Biederman wrote: So I would object to a patch that caused the RTNL_ASSERT to not warn about being called in an atomic context. ASSERT_RTNL does not warn about being called in an atomic context today! Well it did didn't it or we wouldn't be having this thread :) Way way deep in mutex debugging on the slowpath there is a unreadable and incomprehensible WARN_ON in muxtex_trylock that will trigger if you have 10 tons of debugging turned on, and you are in, interrupt context, and you manage to hit the slow path. I think that is a pretty unlikely scenario. Well thanks to that warning we're on our way of improving the code that triggered it in such a way that this warning will soon go silent. That's precisely the reason why I object to having this warning removed. Now you have a good point that this warning doesn't trigger all the time. The fix to that is to *make* it trigger always, not removing it. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.23-rc5] ipsec interfamily route handling fix
On Friday 14 September 2007 23:42:52 David Miller wrote: From: Joakim Koskela [EMAIL PROTECTED] Date: Thu, 6 Sep 2007 19:00:10 +0300 This patch addresses a couple of issues related to interfamily ipsec modes. The problem is that the structure of the routing info changes with the family during the __xfrmX_bundle_create, which hasn't been taken properly into account. Seems that by coincidence it hasn't Since nobody else found time to review this, I did :-) Thanks for taking the time, and sorry for not getting back on this until now.. It sets encap_type in the inner loop, but what if we find multiple entries some ipv4 and some ipv6? This logic can't be right. Instead, we need to treat these objects on an individual basis, I think, and that requires a bit more changes. Yes, this is what I was worried about. But as I'm not that familiar with how these dst_entries are used down the line or with subpolicy transformations I didn't feel comfortable rewriting that completely. I'm trying to get this thing solved (any help is of course appreciated :), but in the meantime I think the following bit could actually be separated as, although related, fixes an issue not directly tied to the original problem (..and would be great to get applied as it makes interfamily work quite ok for a number of setups). What do you think? ..and for a short description: This patch resets the ipv4-related flags in the new flow as their content will otherwise depend on the bits of the ipv6 addresses the struct was previously used for. For example, fl4_tos might have RTO_ONLINK set, which usually prevents the right route from being found. -- diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index 15aa4c5..b4a0b54 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -185,6 +185,8 @@ __xfrm6_bundle_create(struct xfrm_policy *policy, struct xfrm_state **xfrm, int case AF_INET: fl_tunnel.fl4_dst = xfrm[i]-id.daddr.a4; fl_tunnel.fl4_src = xfrm[i]-props.saddr.a4; + fl_tunnel.fl4_tos = 0; + fl_tunnel.fl4_scope = 0; break; case AF_INET6: ipv6_addr_copy(fl_tunnel.fl6_dst, __xfrm6_bundle_addr_remote(xfrm[i], fl-fl6_dst)); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
On Wed, 2007-10-10 at 13:51 -0600, Eric W. Biederman wrote: Yes. Netlink sockets are per-namespace and you can use the namespace of a netlink socket to look up a netdev. Ok, thanks. I still haven't really looked into the wireless vs. net namespaces problem but this will probably help. johannes signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
On Thu, 11 Oct 2007, TAKANO Ryousei wrote: From: Ilpo Järvinen [EMAIL PROTECTED] Subject: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Date: Tue, 9 Oct 2007 15:20:01 +0300 Thanks Ilpo! I am trying to evaluate this patch. There's a minor problem in this 2nd patch, it's just preventing the cnt == tp-retrans_out short-circuit from working, not a correctness problem though it could affect the performance. I'll post larger patch series among which is a fixed version (hopefully today). But, I got a kernel panic at net_rx_action() in our experimental setting. I am using the net-2.6.24 kernel _without_ this patch. (I will post a bug report separately). ...Please do. :-) Anyway, I will report the result as soon as possible. Thanks. ...It's very interesting to see because it's not that clear cut how the extra processing that is necessary affects high-speed performance, it could add yet another source of RTOs due processing latency. -- i.
Re: authenc compile warnings in current net-2.6.24
* David Miller | 2007-10-10 16:25:28 [-0700]: From: Sebastian Siewior [EMAIL PROTECTED] Date: Wed, 10 Oct 2007 21:53:37 +0200 * Oliver Hartkopp | 2007-10-10 19:53:53 [+0200]: CC [M] crypto/authenc.o crypto/authenc.c: In function ?crypto_authenc_hash?: crypto/authenc.c:88: warning: ?cryptlen? may be used uninitialized in this function crypto/authenc.c:87: warning: ?dst? may be used uninitialized in this function crypto/authenc.c: In function ?crypto_authenc_decrypt?: crypto/authenc.c:163: warning: ?cryptlen? may be used uninitialized in this function crypto/authenc.c:163: note: ?cryptlen? was declared here crypto/authenc.c:162: warning: ?src? may be used uninitialized in this function crypto/authenc.c:162: note: ?src? was declared here do you already know these warnings? Those warnings are looking like a compiler bug to me. To be honest I don't know of any compiler which commits enough flow variable analysis to support doing %100 accurate warnings in situations like this. gcc (GCC) 4.1.2 (Gentoo 4.1.2) did not produce any warnings in this case. Since the compiler is unlikely to do so, I think we should fix it somehow because useless warnings just distract. sure. Sebastian - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] [TCP]: Fix lost_retrans loop vs fastpath problems
Detection implemented with lost_retrans must work also when fastpath is taken, yet most of the queue is skipped including (very likely) those retransmitted skb's we're interested in. This problem appeared when the hints got added, which removed a need to always walk over the whole write queue head. Therefore decicion for the lost_retrans worker loop entry must be separated from the sacktag processing more than it was necessary before. It turns out to be problematic to optimize the worker loop very heavily because ack_seqs of skb may have a number of discontinuity points. Maybe similar approach as currently is implemented could be attempted but that's becoming more and more complex because the trend is towards less skb walking in sacktag marker. Trying a simple work until all rexmitted skbs heve been processed approach. Maybe after(highest_sack_end_seq, tp-high_seq) checking is not sufficiently accurate and causes entry too often in no-work-to-do cases. Since that's not known, I've separated solution to that from this patch. Noticed because of report against a related problem from TAKANO Ryousei [EMAIL PROTECTED]. He also provided a patch to that part of the problem. This patch includes solution to it (though this patch has to use somewhat different placement). TAKANO's description and patch is available here: http://marc.info/?l=linux-netdevm=119149311913288w=2 ...In short, TAKANO's problem is that end_seq the loop is using not necessarily the largest SACK block's end_seq because the current ACK may still have higher SACK blocks which are later by the loop. Cc: TAKANO Ryousei [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 37 ++--- 1 files changed, 22 insertions(+), 15 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a066039..d5e0fcc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1109,27 +1109,34 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, /* Check for lost retransmit. This superb idea is borrowed from ratehalving. * Event C. Later note: FACK people cheated me again 8), we have to account * for reordering! Ugly, but should help. + * + * Search retransmitted skbs from write_queue that were sent when snd_nxt was + * less than what is now known to be received by the other end (derived from + * SACK blocks by the caller). */ -static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans) +static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; int flag = 0; + int cnt = 0; tcp_for_write_queue(skb, sk) { u32 ack_seq = TCP_SKB_CB(skb)-ack_seq; if (skb == tcp_send_head(sk)) break; - if (after(TCP_SKB_CB(skb)-seq, lost_retrans)) + if (cnt == tp-retrans_out) break; if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una)) continue; - if ((TCP_SKB_CB(skb)-sacked TCPCB_SACKED_RETRANS) - after(lost_retrans, ack_seq) + if (!(TCP_SKB_CB(skb)-sacked TCPCB_SACKED_RETRANS)) + continue; + + if (after(received_upto, ack_seq) (tcp_is_fack(tp) || -!before(lost_retrans, +!before(received_upto, ack_seq + tp-reordering * tp-mss_cache))) { TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_RETRANS; tp-retrans_out -= tcp_skb_pcount(skb); @@ -1143,6 +1150,8 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans) flag |= FLAG_DATA_SACKED; NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT); } + } else { + cnt += tcp_skb_pcount(skb); } } return flag; @@ -1225,7 +1234,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)3; int reord = tp-packets_out; int prior_fackets; - u32 lost_retrans = 0; + u32 highest_sack_end_seq = 0; int flag = 0; int found_dup_sack = 0; int cached_fack_count; @@ -1396,11 +1405,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ continue; } - if ((sackedTCPCB_SACKED_RETRANS) - after(end_seq, TCP_SKB_CB(skb)-ack_seq) - (!lost_retrans || after(end_seq, lost_retrans))) - lost_retrans = end_seq; - if (!in_sack) continue; @@ -1454,9
[PATCH 7/7] [TCP]: Limit processing lost_retrans loop to work-to-do cases
This addition of lost_retrans_low to tcp_sock might be unnecessary, it's not clear how often lost_retrans worker is executed when there wasn't work to do. Cc: TAKANO Ryousei [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |2 ++ net/ipv4/tcp_input.c | 14 +++--- net/ipv4/tcp_output.c |2 ++ 3 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 9ff456e..c5b94c1 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -348,6 +348,8 @@ struct tcp_sock { int lost_cnt_hint; int retransmit_cnt_hint; + u32 lost_retrans_low; /* Sent seq after any rxmit (lowest) */ + u16 advmss; /* Advertised MSS */ u16 prior_ssthresh; /* ssthresh saved at recovery start */ u32 lost_out; /* Lost packets */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index d5e0fcc..0a42e93 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1112,7 +1112,8 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, * * Search retransmitted skbs from write_queue that were sent when snd_nxt was * less than what is now known to be received by the other end (derived from - * SACK blocks by the caller). + * SACK blocks by the caller). Also calculate the lowest snd_nxt among the + * remaining retransmitted skbs to avoid some costly processing per ACKs. */ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) { @@ -1120,6 +1121,7 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) struct sk_buff *skb; int flag = 0; int cnt = 0; + u32 new_low_seq = 0; tcp_for_write_queue(skb, sk) { u32 ack_seq = TCP_SKB_CB(skb)-ack_seq; @@ -1151,9 +1153,15 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT); } } else { + if (!new_low_seq || before(ack_seq, new_low_seq)) + new_low_seq = ack_seq; cnt += tcp_skb_pcount(skb); } } + + if (tp-retrans_out) + tp-lost_retrans_low = new_low_seq; + return flag; } @@ -1481,8 +1489,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ } } - if (tp-retrans_out highest_sack_end_seq - after(highest_sack_end_seq, tp-high_seq) + if (tp-retrans_out + after(highest_sack_end_seq, tp-lost_retrans_low) icsk-icsk_ca_state == TCP_CA_Recovery) flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5329675..324b420 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1914,6 +1914,8 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) printk(KERN_DEBUG retrans_out leaked.\n); } #endif + if (!tp-retrans_out) + tp-lost_retrans_low = tp-snd_nxt; TCP_SKB_CB(skb)-sacked |= TCPCB_RETRANS; tp-retrans_out += tcp_skb_pcount(skb); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6 0/7] TCP: small changes/fixes + lost_retrans brokeness fix
Some small and simple changes. Minor testing done, until I got one LostRetrans counted, which turned out to be hard task than I thought (without specifically arranged scenario, just some basic netem delay drops). Lost_retrans handling fix is now tested, I fixed one incorrectly placed if from the lost_retrans_low patch after reading it through. ...Recount removal patch could be more elegant in the way it plays around with the tcp_clear_retrans, but I just wasn't able to figure out how to do that nicely enough... Those cleanups are preparation for DSACK fix, which is yet to be done but IMHO they won't hurt if applied alone either. Dave, please apply at will. FYI, I'll post also the SACK rewrite patchset rebased after this (built on top of these). -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] [TCP]: Add bytes_acked (ABC) clearing to FRTO too
I was reading tcp_enter_loss while looking for Cedric's bug and noticed bytes_acked adjustment is missing from FRTO side. Since bytes_acked will only be used in tcp_cong_avoid, I think it's safe to assume RTO would be spurious. During FRTO cwnd will be not controlled by tcp_cong_avoid and if FRTO calls for conventional recovery, cwnd is adjusted and the result of wrong assumption is cleared from bytes_acked. If RTO was in fact spurious, we did normal ABC already and can continue without any additional adjustments. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 101054c..e40857e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1687,6 +1687,7 @@ static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag) tp-snd_cwnd_cnt = 0; tp-snd_cwnd_stamp = tcp_time_stamp; tp-frto_counter = 0; + tp-bytes_acked = 0; tp-reordering = min_t(unsigned int, tp-reordering, sysctl_tcp_reordering); @@ -2824,6 +2825,7 @@ static void tcp_conservative_spur_to_response(struct tcp_sock *tp) { tp-snd_cwnd = min(tp-snd_cwnd, tp-snd_ssthresh); tp-snd_cwnd_cnt = 0; + tp-bytes_acked = 0; TCP_ECN_queue_cwr(tp); tcp_moderate_cwnd(tp); } -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] [TCP]: Fix mark_head_lost to ignore R-bit when trying to mark L
This condition (plain R) can arise at least in recovery that is triggered after tcp_undo_loss. There isn't any reason why they should not be marked as lost, not marking makes in_flight estimator to return too large values. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e40857e..c827285 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1987,7 +1987,7 @@ static void tcp_mark_head_lost(struct sock *sk, cnt += tcp_skb_pcount(skb); if (cnt packets || after(TCP_SKB_CB(skb)-end_seq, high_seq)) break; - if (!(TCP_SKB_CB(skb)-sackedTCPCB_TAGBITS)) { + if (!(TCP_SKB_CB(skb)-sacked (TCPCB_SACKED_ACKED|TCPCB_LOST))) { TCP_SKB_CB(skb)-sacked |= TCPCB_LOST; tp-lost_out += tcp_skb_pcount(skb); tcp_verify_retransmit_hint(tp, skb); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] [TCP]: Extract tcp_match_queue_to_sack from sacktag code
This is necessary for upcoming DSACK bugfix. Reduces sacktag length which is not very sad thing at all... :-) Notice that there's a need to handle out-of-mem at caller's place. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 54 - 1 files changed, 35 insertions(+), 19 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5004704..bd18c25 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1181,6 +1181,38 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, return dup_sack; } +/* Check if skb is fully within the SACK block. In presence of GSO skbs, + * the incoming SACK may not exactly match but we can find smaller MSS + * aligned portion of it that matches. Therefore we might need to fragment + * which may fail and creates some hassle (caller must handle error case + * returns). + */ +int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb, + u32 start_seq, u32 end_seq) +{ + int in_sack, err; + unsigned int pkt_len; + + in_sack = !after(start_seq, TCP_SKB_CB(skb)-seq) + !before(end_seq, TCP_SKB_CB(skb)-end_seq); + + if (tcp_skb_pcount(skb) 1 !in_sack + after(TCP_SKB_CB(skb)-end_seq, start_seq)) { + + in_sack = !after(start_seq, TCP_SKB_CB(skb)-seq); + + if (!in_sack) + pkt_len = start_seq - TCP_SKB_CB(skb)-seq; + else + pkt_len = end_seq - TCP_SKB_CB(skb)-seq; + err = tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)-gso_size); + if (err 0) + return err; + } + + return in_sack; +} + static int tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_una) { @@ -1333,25 +1365,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ if (!before(TCP_SKB_CB(skb)-seq, end_seq)) break; - in_sack = !after(start_seq, TCP_SKB_CB(skb)-seq) - !before(end_seq, TCP_SKB_CB(skb)-end_seq); - - if (tcp_skb_pcount(skb) 1 !in_sack - after(TCP_SKB_CB(skb)-end_seq, start_seq)) { - unsigned int pkt_len; - - in_sack = !after(start_seq, -TCP_SKB_CB(skb)-seq); - - if (!in_sack) - pkt_len = (start_seq - - TCP_SKB_CB(skb)-seq); - else - pkt_len = (end_seq - - TCP_SKB_CB(skb)-seq); - if (tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)-gso_size)) - break; - } + in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq); + if (in_sack 0) + break; fack_count += tcp_skb_pcount(skb); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] [TCP]: Kill almost unused variable pcount from sacktag
It's on the way for future cutting of that function. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |9 +++-- 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c827285..5004704 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1314,7 +1314,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ flag |= FLAG_DATA_LOST; tcp_for_write_queue_from(skb, sk) { - int in_sack, pcount; + int in_sack; u8 sacked; if (skb == tcp_send_head(sk)) @@ -1336,9 +1336,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ in_sack = !after(start_seq, TCP_SKB_CB(skb)-seq) !before(end_seq, TCP_SKB_CB(skb)-end_seq); - pcount = tcp_skb_pcount(skb); - - if (pcount 1 !in_sack + if (tcp_skb_pcount(skb) 1 !in_sack after(TCP_SKB_CB(skb)-end_seq, start_seq)) { unsigned int pkt_len; @@ -1353,10 +1351,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ TCP_SKB_CB(skb)-seq); if (tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)-gso_size)) break; - pcount = tcp_skb_pcount(skb); } - fack_count += pcount; + fack_count += tcp_skb_pcount(skb); sacked = TCP_SKB_CB(skb)-sacked; -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] [TCP]: No need to re-count fackets_out/sacked_out at RTO
Both sacked_out and fackets_out are directly known from how parameter. Since fackets_out is accurate, there's no need for recounting (sacked_out was previously unnecessarily counted in the loop anyway). Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 26 -- 1 files changed, 16 insertions(+), 10 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bd18c25..a066039 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1711,18 +1711,23 @@ static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag) tcp_clear_retrans_hints_partial(tp); } -void tcp_clear_retrans(struct tcp_sock *tp) +static void tcp_clear_retrans_partial(struct tcp_sock *tp) { tp-retrans_out = 0; - - tp-fackets_out = 0; - tp-sacked_out = 0; tp-lost_out = 0; tp-undo_marker = 0; tp-undo_retrans = 0; } +void tcp_clear_retrans(struct tcp_sock *tp) +{ + tcp_clear_retrans_partial(tp); + + tp-fackets_out = 0; + tp-sacked_out = 0; +} + /* Enter Loss state. If how is not zero, forget all SACK information * and reset tags completely, otherwise preserve SACKs. If receiver * dropped its ofo queue, we will know this due to reneging detection. @@ -1732,7 +1737,6 @@ void tcp_enter_loss(struct sock *sk, int how) const struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; - int cnt = 0; /* Reduce ssthresh if it has not yet been made inside this window. */ if (icsk-icsk_ca_state = TCP_CA_Disorder || tp-snd_una == tp-high_seq || @@ -1746,7 +1750,10 @@ void tcp_enter_loss(struct sock *sk, int how) tp-snd_cwnd_stamp = tcp_time_stamp; tp-bytes_acked = 0; - tcp_clear_retrans(tp); + tcp_clear_retrans_partial(tp); + + if (tcp_is_reno(tp)) + tcp_reset_reno_sack(tp); if (!how) { /* Push undo marker, if it was plain RTO and nothing @@ -1754,13 +1761,15 @@ void tcp_enter_loss(struct sock *sk, int how) tp-undo_marker = tp-snd_una; tcp_clear_retrans_hints_partial(tp); } else { + tp-sacked_out = 0; + tp-fackets_out = 0; tcp_clear_all_retrans_hints(tp); } tcp_for_write_queue(skb, sk) { if (skb == tcp_send_head(sk)) break; - cnt += tcp_skb_pcount(skb); + if (TCP_SKB_CB(skb)-sackedTCPCB_RETRANS) tp-undo_marker = 0; TCP_SKB_CB(skb)-sacked = (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED; @@ -1768,9 +1777,6 @@ void tcp_enter_loss(struct sock *sk, int how) TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_ACKED; TCP_SKB_CB(skb)-sacked |= TCPCB_LOST; tp-lost_out += tcp_skb_pcount(skb); - } else { - tp-sacked_out += tcp_skb_pcount(skb); - tp-fackets_out = cnt; } } tcp_verify_left_out(tp); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH net-2.6] TCP: SACKtag rewrite
Dave ( others?), Here's the rebased SACKtag rewrite series rebased for you, applies on top of the seven patch series I sent earlier today (or on top of net-2.6 as soon as Dave picks them up and pushes out). Some trivial helper patches first (two first ones from tcp-2.6), the most interesting bits are in the fifth patch. I'll recode the included DSACK handling, it will be much cleaner then so please ignore that mess for now... This time only compile tested but should be still runnable because I tested an earlier version and diff against that looked sane. -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] [TCP]: Convert highest_sack to sk_buff to allow direct access
It is going to replace the sack fastpath hint quite soon... :-) Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |6 -- include/net/tcp.h | 13 + net/ipv4/tcp_input.c | 11 ++- net/ipv4/tcp_output.c | 19 ++- 4 files changed, 33 insertions(+), 16 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index c5b94c1..0ec6bb6 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -332,8 +332,10 @@ struct tcp_sock { struct tcp_sack_block_wire recv_sack_cache[4]; - u32 highest_sack; /* Start seq of globally highest revd SACK -* (validity guaranteed only if sacked_out 0) */ + struct sk_buff *highest_sack; /* highest skb with SACK received +* (validity guaranteed only if +* sacked_out 0) +*/ /* from STCP, retrans queue hinting */ struct sk_buff* lost_skb_hint; diff --git a/include/net/tcp.h b/include/net/tcp.h index 92049e6..aead90a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1306,6 +1306,19 @@ static inline int tcp_write_queue_empty(struct sock *sk) return skb_queue_empty(sk-sk_write_queue); } +/* Start sequence of the highest skb with SACKed bit, valid only if + * sacked 0 or when the caller has ensured validity by itself. + */ +static inline u32 tcp_highest_sack_seq(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (WARN_ON(!tp-sacked_out + tp-highest_sack != tcp_write_queue_head(sk))) + return tp-snd_una; + return TCP_SKB_CB(tp-highest_sack)-seq; +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9d3d390..39d6a6a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1238,10 +1238,11 @@ int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb, return in_sack; } -static void tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp, +static void tcp_sacktag_one(struct sk_buff *skb, struct sock *sk, struct tcp_sacktag_state *state, int in_sack, int dup_sack, int fack_count, u32 end_seq) { + struct tcp_sock *tp = tcp_sk(sk); u8 sacked = TCP_SKB_CB(skb)-sacked; /* Account D-SACK for retransmitted packet. */ @@ -1321,8 +1322,8 @@ static void tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp, if (fack_count tp-fackets_out) tp-fackets_out = fack_count; - if (after(TCP_SKB_CB(skb)-seq, tp-highest_sack)) { - tp-highest_sack = TCP_SKB_CB(skb)-seq; + if (after(TCP_SKB_CB(skb)-seq, tcp_highest_sack_seq(sk))) { + tp-highest_sack = skb; state-highest_sack_end_seq = TCP_SKB_CB(skb)-end_seq; } } else { @@ -1363,7 +1364,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ if (!tp-sacked_out) { if (WARN_ON(tp-fackets_out)) tp-fackets_out = 0; - tp-highest_sack = tp-snd_una; + tp-highest_sack = tcp_write_queue_head(sk); } found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, num_sacks, prior_snd_una); @@ -1497,7 +1498,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ fack_count += tcp_skb_pcount(skb); - tcp_sacktag_one(skb, tp, state, in_sack, + tcp_sacktag_one(skb, sk, state, in_sack, dup_sack, fack_count, end_seq); } } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 324b420..9603de8 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -657,13 +657,15 @@ static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned * tweak SACK fastpath hint too as it would overwrite all changes unless * hint is also changed. */ -static void tcp_adjust_fackets_out(struct tcp_sock *tp, struct sk_buff *skb, +static void tcp_adjust_fackets_out(struct sock *sk, struct sk_buff *skb, int decr) { + struct tcp_sock *tp = tcp_sk(sk); + if (!tp-sacked_out || tcp_is_reno(tp)) return; - if (!before(tp-highest_sack, TCP_SKB_CB(skb)-seq)) + if (!before(tcp_highest_sack_seq(sk), TCP_SKB_CB(skb)-seq)) tp-fackets_out -= decr; /* cnt_hint is off-by-one compared with fackets_out (see sacktag) */ @@ -712,9 +714,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss TCP_SKB_CB(buff)-end_seq = TCP_SKB_CB(skb)-end_seq;
[PATCH 1/6] [TCP]: Create tcp_sacktag_state.
From: David S. Miller [EMAIL PROTECTED] It is difficult to break out the inner-logic of tcp_sacktag_write_queue() into worker functions because so many local variables get updated in-place. Start to overcome this by creating a structure block of state variables that can be passed around into worker routines. [I made minor tweaks due to rebase/reordering of stuff in tcp-2.6 tree, and dropped found_dup_sack dup_sack from state -ij] Signed-off-by: David S. Miller [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 82 +++--- 1 files changed, 44 insertions(+), 38 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0a42e93..1f6cbce 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1165,6 +1165,14 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) return flag; } +struct tcp_sacktag_state { + unsigned int flag; + int reord; + int prior_fackets; + u32 highest_sack_end_seq; + int first_sack_index; +}; + static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, struct tcp_sack_block_wire *sp, int num_sacks, u32 prior_snd_una) @@ -1240,26 +1248,23 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2); struct sk_buff *cached_skb; int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)3; - int reord = tp-packets_out; - int prior_fackets; - u32 highest_sack_end_seq = 0; - int flag = 0; - int found_dup_sack = 0; + struct tcp_sacktag_state state; + int found_dup_sack; int cached_fack_count; int i; - int first_sack_index; + int force_one_sack; + + state.flag = 0; if (!tp-sacked_out) { if (WARN_ON(tp-fackets_out)) tp-fackets_out = 0; tp-highest_sack = tp-snd_una; } - prior_fackets = tp-fackets_out; - found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, -num_sacks, prior_snd_una); + found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, num_sacks, prior_snd_una); if (found_dup_sack) - flag |= FLAG_DSACKING_ACK; + state.flag |= FLAG_DSACKING_ACK; /* Eliminate too old ACKs, but take into * account more or less fresh ones, they can @@ -1272,18 +1277,18 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ * if the only SACK change is the increase of the end_seq of * the first block then only apply that SACK block * and use retrans queue hinting otherwise slowpath */ - flag = 1; + force_one_sack = 1; for (i = 0; i num_sacks; i++) { __be32 start_seq = sp[i].start_seq; __be32 end_seq = sp[i].end_seq; if (i == 0) { if (tp-recv_sack_cache[i].start_seq != start_seq) - flag = 0; + force_one_sack = 0; } else { if ((tp-recv_sack_cache[i].start_seq != start_seq) || (tp-recv_sack_cache[i].end_seq != end_seq)) - flag = 0; + force_one_sack = 0; } tp-recv_sack_cache[i].start_seq = start_seq; tp-recv_sack_cache[i].end_seq = end_seq; @@ -1294,8 +1299,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ tp-recv_sack_cache[i].end_seq = 0; } - first_sack_index = 0; - if (flag) + state.first_sack_index = 0; + if (force_one_sack) num_sacks = 1; else { int j; @@ -1313,17 +1318,14 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ sp[j+1] = tmp; /* Track where the first SACK block goes to */ - if (j == first_sack_index) - first_sack_index = j+1; + if (j == state.first_sack_index) + state.first_sack_index = j+1; } } } } - /* clear flag as used for different purpose in following code */ - flag = 0; - /* Use SACK fastpath hint if valid */ cached_skb = tp-fastpath_skb_hint; cached_fack_count = tp-fastpath_cnt_hint; @@ -1332,12 +1334,16 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
[PATCH 2/6] [TCP]: Create tcp_sacktag_one().
From: David S. Miller [EMAIL PROTECTED] Worker function that implements the main logic of the inner-most loop of tcp_sacktag_write_queue(). Signed-off-by: David S. Miller [EMAIL PROTECTED] Acked-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 205 ++ 1 files changed, 106 insertions(+), 99 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1f6cbce..9d3d390 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1238,6 +1238,110 @@ int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb, return in_sack; } +static void tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp, + struct tcp_sacktag_state *state, int in_sack, + int dup_sack, int fack_count, u32 end_seq) +{ + u8 sacked = TCP_SKB_CB(skb)-sacked; + + /* Account D-SACK for retransmitted packet. */ + if ((dup_sack in_sack) + (sacked TCPCB_RETRANS) + after(TCP_SKB_CB(skb)-end_seq, tp-undo_marker)) + tp-undo_retrans--; + + /* The frame is ACKed. */ + if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una)) { + if (sacked TCPCB_RETRANS) { + if ((dup_sack in_sack) + (sacked TCPCB_SACKED_ACKED)) + state-reord = min(fack_count, state-reord); + } else { + /* If it was in a hole, we detected reordering. */ + if (fack_count state-prior_fackets + !(sacked TCPCB_SACKED_ACKED)) + state-reord = min(fack_count, state-reord); + } + + /* Nothing to do; acked frame is about to be dropped. */ + return; + } + + if (!in_sack) + return; + + if (!(sacked TCPCB_SACKED_ACKED)) { + if (sacked TCPCB_SACKED_RETRANS) { + /* If the segment is not tagged as lost, +* we do not clear RETRANS, believing +* that retransmission is still in flight. +*/ + if (sacked TCPCB_LOST) { + TCP_SKB_CB(skb)-sacked = + ~(TCPCB_LOST|TCPCB_SACKED_RETRANS); + tp-lost_out -= tcp_skb_pcount(skb); + tp-retrans_out -= tcp_skb_pcount(skb); + + /* clear lost hint */ + tp-retransmit_skb_hint = NULL; + } + } else { + /* New sack for not retransmitted frame, +* which was in hole. It is reordering. +*/ + if (!(sacked TCPCB_RETRANS) + fack_count state-prior_fackets) + state-reord = min(fack_count, state-reord); + + if (sacked TCPCB_LOST) { + TCP_SKB_CB(skb)-sacked = ~TCPCB_LOST; + tp-lost_out -= tcp_skb_pcount(skb); + + /* clear lost hint */ + tp-retransmit_skb_hint = NULL; + } + /* SACK enhanced F-RTO detection. +* Set flag if and only if non-rexmitted +* segments below frto_highmark are +* SACKed (RFC4138; Appendix B). +* Clearing correct due to in-order walk +*/ + if (after(end_seq, tp-frto_highmark)) { + state-flag = ~FLAG_ONLY_ORIG_SACKED; + } else { + if (!(sacked TCPCB_RETRANS)) + state-flag |= FLAG_ONLY_ORIG_SACKED; + } + } + + TCP_SKB_CB(skb)-sacked |= TCPCB_SACKED_ACKED; + state-flag |= FLAG_DATA_SACKED; + tp-sacked_out += tcp_skb_pcount(skb); + + if (fack_count tp-fackets_out) + tp-fackets_out = fack_count; + + if (after(TCP_SKB_CB(skb)-seq, tp-highest_sack)) { + tp-highest_sack = TCP_SKB_CB(skb)-seq; + state-highest_sack_end_seq = TCP_SKB_CB(skb)-end_seq; + } + } else { + if (dup_sack (sackedTCPCB_RETRANS)) + state-reord = min(fack_count, state-reord); + } + + /* D-SACK. We can detect redundant retransmission +* in S|R and plain R frames and clear it. +* undo_retrans is decreased above, L|R frames +* are accounted above as well. +*/ + if (dup_sack
[PATCH 4/6] [TCP]: Earlier SACK block verification simplify access to them
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |2 +- net/ipv4/tcp_input.c | 75 ++--- 2 files changed, 47 insertions(+), 30 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 0ec6bb6..3e412f2 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -330,7 +330,7 @@ struct tcp_sock { struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */ struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/ - struct tcp_sack_block_wire recv_sack_cache[4]; + struct tcp_sack_block recv_sack_cache[4]; struct sk_buff *highest_sack; /* highest skb with SACK received * (validity guaranteed only if diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 39d6a6a..c260642 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1350,9 +1350,11 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ struct tcp_sock *tp = tcp_sk(sk); unsigned char *ptr = (skb_transport_header(ack_skb) + TCP_SKB_CB(ack_skb)-sacked); - struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2); + struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2); + struct tcp_sack_block sp[4]; struct sk_buff *cached_skb; int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)3; + int used_sacks; struct tcp_sacktag_state state; int found_dup_sack; int cached_fack_count; @@ -1367,7 +1369,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ tp-highest_sack = tcp_write_queue_head(sk); } - found_dup_sack = tcp_check_dsack(tp, ack_skb, sp, num_sacks, prior_snd_una); + found_dup_sack = tcp_check_dsack(tp, ack_skb, sp_wire, num_sacks, prior_snd_una); if (found_dup_sack) state.flag |= FLAG_DSACKING_ACK; @@ -1378,14 +1380,46 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ if (before(TCP_SKB_CB(ack_skb)-ack_seq, prior_snd_una - tp-max_window)) return 0; + used_sacks = 0; + for (i = 0; i num_sacks; i++) { + int dup_sack = !i found_dup_sack; + + sp[used_sacks].start_seq = ntohl(get_unaligned(sp_wire[i].start_seq)); + sp[used_sacks].end_seq = ntohl(get_unaligned(sp_wire[i].end_seq)); + + if (!tcp_is_sackblock_valid(tp, dup_sack, + sp[used_sacks].start_seq, + sp[used_sacks].end_seq)) { + if (dup_sack) { + if (!tp-undo_marker) + NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO); + else + NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD); + } else { + /* Don't count olds caused by ACK reordering */ + if ((TCP_SKB_CB(ack_skb)-ack_seq != tp-snd_una) + !after(sp[used_sacks].end_seq, tp-snd_una)) + continue; + NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD); + } + continue; + } + + /* Ignore very old stuff early */ + if (!after(sp[used_sacks].end_seq, prior_snd_una)) + continue; + + used_sacks++; + } + /* SACK fastpath: * if the only SACK change is the increase of the end_seq of * the first block then only apply that SACK block * and use retrans queue hinting otherwise slowpath */ force_one_sack = 1; - for (i = 0; i num_sacks; i++) { - __be32 start_seq = sp[i].start_seq; - __be32 end_seq = sp[i].end_seq; + for (i = 0; i used_sacks; i++) { + u32 start_seq = sp[i].start_seq; + u32 end_seq = sp[i].end_seq; if (i == 0) { if (tp-recv_sack_cache[i].start_seq != start_seq) @@ -1406,17 +1440,16 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ state.first_sack_index = 0; if (force_one_sack) - num_sacks = 1; + used_sacks = 1; else { int j; tp-fastpath_skb_hint = NULL; /* order SACK blocks to allow in order walk of the retrans queue */ - for (i = num_sacks-1; i 0; i--) { + for (i = used_sacks-1; i 0; i--) { for (j = 0; j i; j++){ - if (after(ntohl(sp[j].start_seq), -
[RFC PATCH 5/6] [TCP]: Rewrite sack_recv_cache (WIP)
Key points of this patch are: - In case new SACK information is advance only type, no skb processing below previously discovered highest point is done - Optimize cases below highest point too since there's no need to always go up to highest point (which is very likely still present in that SACK), this is not entirely true though because I'm dropping the fastpath_skb_hint which could previously optimize those cases even better. Whether that significant, I'm not too sure. Corrently it will provide skipping by walking. Combined with RB-tree, all skipping would become fast too regardless of window size (can be done incrementally later). Previously a number of cases in TCP SACK processing fails to take advantage of costly stored information in sack_recv_cache. Most importantly expected events such as cumulative ACK and new hole ACKs. Processing on such ACKs result in rather long walks building up latencies (which easily gets nasty when window is large). Those latencies are often completely unnecessary compared with the amount of _new_ information received, usually for cumulative ACK there's no new information at all, yet TCP walked whole queue unnecessary potentially taking a number of costly cache misses on the way, etc.! Since the inclusion of highest_sack, there's a lot information that is very likely redundant (SACK fastpath hint stuff, fackets_out, highest_sack), though there's no ultimate guarantee that they'll remain the same whole the time (in all unearthly scenarios). Take advantage of this too and drop fastpath hint and use direct access to highest SACKed skb as replacement. Effectively special cased fastpath is dropped. This change adds some complexity to introduce better coveraged fastpath, though the added complexity should make TCP behave more cache friendly. The current ACK's SACK blocks are compared against each cached block individially and only ranges that are new are then scanned by the high constant walk. For other parts of write queue, even when in previously known part of the SACK blocks, a faster skip function is used (if necessary at all). In addition, whenever possible, TCP fast-forwards to highest_sack skb that was made available by an earlier patch. In typical case, no other things but this fast-forward and mandatory markings after that occur making the access pattern quite similar to the former fastpath special case. DSACKs are special case that must always be walked. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |3 - include/net/tcp.h |1 - net/ipv4/tcp_input.c | 253 ++-- net/ipv4/tcp_output.c | 14 +--- 4 files changed, 159 insertions(+), 112 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 3e412f2..91c4430 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -343,10 +343,7 @@ struct tcp_sock { struct sk_buff *scoreboard_skb_hint; struct sk_buff *retransmit_skb_hint; struct sk_buff *forward_skb_hint; - struct sk_buff *fastpath_skb_hint; - int fastpath_cnt_hint; /* Lags behind by current skb's pcount -* compared to respective fackets_out */ int lost_cnt_hint; int retransmit_cnt_hint; diff --git a/include/net/tcp.h b/include/net/tcp.h index aead90a..cff2d86 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1078,7 +1078,6 @@ static inline void tcp_clear_retrans_hints_partial(struct tcp_sock *tp) static inline void tcp_clear_all_retrans_hints(struct tcp_sock *tp) { tcp_clear_retrans_hints_partial(tp); - tp-fastpath_skb_hint = NULL; } /* MD5 Signature */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c260642..c13f1af 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1170,10 +1170,14 @@ struct tcp_sacktag_state { int reord; int prior_fackets; u32 highest_sack_end_seq; - int first_sack_index; + int fack_count; + u32 dup_start; + u32 dup_end; }; -static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, +static int tcp_check_dsack(struct tcp_sock *tp, + struct tcp_sacktag_state *state, + struct sk_buff *ack_skb, struct tcp_sack_block_wire *sp, int num_sacks, u32 prior_snd_una) { @@ -1183,6 +1187,8 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, if (before(start_seq_0, TCP_SKB_CB(ack_skb)-ack_seq)) { dup_sack = 1; + state-dup_start = start_seq_0; + state-dup_end = end_seq_0; tcp_dsack_seen(tp); NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV); } else if (num_sacks 1) { @@ -1192,6 +1198,8 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb,
[PATCH 6/6] [TCP]: Track sacktag (DEVEL PATCH)
This is not intented to go to mainline, provided just for those who are interested enough about the algorithm internals during a test. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/snmp.h | 20 +++ net/ipv4/proc.c | 20 +++ net/ipv4/tcp_input.c | 52 +++-- 3 files changed, 89 insertions(+), 3 deletions(-) diff --git a/include/linux/snmp.h b/include/linux/snmp.h index 89f0c2b..42b8c07 100644 --- a/include/linux/snmp.h +++ b/include/linux/snmp.h @@ -214,6 +214,26 @@ enum LINUX_MIB_TCPDSACKIGNOREDOLD, /* TCPSACKIgnoredOld */ LINUX_MIB_TCPDSACKIGNOREDNOUNDO,/* TCPSACKIgnoredNoUndo */ LINUX_MIB_TCPSPURIOUSRTOS, /* TCPSpuriousRTOs */ + LINUX_MIB_TCP_SACKTAG, + LINUX_MIB_TCP_SACK0, + LINUX_MIB_TCP_SACK1, + LINUX_MIB_TCP_SACK2, + LINUX_MIB_TCP_SACK3, + LINUX_MIB_TCP_SACK4, + LINUX_MIB_TCP_WALKEDSKBS, + LINUX_MIB_TCP_WALKEDDSACKS, + LINUX_MIB_TCP_SKIPPEDSKBS, + LINUX_MIB_TCP_NOCACHE, + LINUX_MIB_TCP_FULLWALK, + LINUX_MIB_TCP_HEADWALK, + LINUX_MIB_TCP_TAILWALK, + LINUX_MIB_TCP_FULLSKIP, + LINUX_MIB_TCP_TAILSKIP, + LINUX_MIB_TCP_HEADSKIP, + LINUX_MIB_TCP_FULLSKIP_TOHIGH, + LINUX_MIB_TCP_TAILSKIP_TOHIGH, + LINUX_MIB_TCP_NEWSKIP, + LINUX_MIB_TCP_CACHEREMAINING, __LINUX_MIB_MAX }; diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index e5b05b0..9909178 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -246,6 +246,26 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM(TCPDSACKIgnoredOld, LINUX_MIB_TCPDSACKIGNOREDOLD), SNMP_MIB_ITEM(TCPDSACKIgnoredNoUndo, LINUX_MIB_TCPDSACKIGNOREDNOUNDO), SNMP_MIB_ITEM(TCPSpuriousRTOs, LINUX_MIB_TCPSPURIOUSRTOS), + SNMP_MIB_ITEM(TCP_SACKTAG, LINUX_MIB_TCP_SACKTAG), + SNMP_MIB_ITEM(TCP_SACK0, LINUX_MIB_TCP_SACK0), + SNMP_MIB_ITEM(TCP_SACK1, LINUX_MIB_TCP_SACK1), + SNMP_MIB_ITEM(TCP_SACK2, LINUX_MIB_TCP_SACK2), + SNMP_MIB_ITEM(TCP_SACK3, LINUX_MIB_TCP_SACK3), + SNMP_MIB_ITEM(TCP_SACK4, LINUX_MIB_TCP_SACK4), + SNMP_MIB_ITEM(TCP_WALKEDSKBS, LINUX_MIB_TCP_WALKEDSKBS), + SNMP_MIB_ITEM(TCP_WALKEDDSACKS, LINUX_MIB_TCP_WALKEDDSACKS), + SNMP_MIB_ITEM(TCP_SKIPPEDSKBS, LINUX_MIB_TCP_SKIPPEDSKBS), + SNMP_MIB_ITEM(TCP_NOCACHE, LINUX_MIB_TCP_NOCACHE), + SNMP_MIB_ITEM(TCP_FULLWALK, LINUX_MIB_TCP_FULLWALK), + SNMP_MIB_ITEM(TCP_HEADWALK, LINUX_MIB_TCP_HEADWALK), + SNMP_MIB_ITEM(TCP_TAILWALK, LINUX_MIB_TCP_TAILWALK), + SNMP_MIB_ITEM(TCP_FULLSKIP, LINUX_MIB_TCP_FULLSKIP), + SNMP_MIB_ITEM(TCP_TAILSKIP, LINUX_MIB_TCP_TAILSKIP), + SNMP_MIB_ITEM(TCP_HEADSKIP, LINUX_MIB_TCP_HEADSKIP), + SNMP_MIB_ITEM(TCP_FULLSKIP_TOHIGH, LINUX_MIB_TCP_FULLSKIP_TOHIGH), + SNMP_MIB_ITEM(TCP_TAILSKIP_TOHIGH, LINUX_MIB_TCP_TAILSKIP_TOHIGH), + SNMP_MIB_ITEM(TCP_NEWSKIP, LINUX_MIB_TCP_NEWSKIP), + SNMP_MIB_ITEM(TCP_CACHEREMAINING, LINUX_MIB_TCP_CACHEREMAINING), SNMP_MIB_SENTINEL }; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c13f1af..02b34a5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1382,6 +1382,10 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk, tcp_sacktag_one(skb, sk, state, in_sack, dup_sack, state-fack_count, end_seq); + + NET_INC_STATS_BH(LINUX_MIB_TCP_WALKEDSKBS); + if (dup_sack) + NET_INC_STATS_BH(LINUX_MIB_TCP_WALKEDDSACKS); } return skb; } @@ -1405,6 +1409,7 @@ static struct sk_buff *tcp_sacktag_skip(struct sk_buff *skb, struct sock *sk, skb = tcp_sacktag_walk(skb, sk, state, state-dup_start, state-dup_end, 1); } + NET_INC_STATS_BH(LINUX_MIB_TCP_SKIPPEDSKBS); } return skb; } @@ -1535,9 +1540,22 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ cache++; } + NET_INC_STATS_BH(LINUX_MIB_TCP_SACKTAG); + switch (used_sacks) { + case 0: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK0); break; + case 1: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK1); break; + case 2: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK2); break; + case 3: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK3); break; + case 4: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK4); break; + } + + if (!tcp_sack_cache_ok(tp, cache)) + NET_INC_STATS_BH(LINUX_MIB_TCP_NOCACHE); + while (i used_sacks) { u32 start_seq = sp[i].start_seq; u32 end_seq = sp[i].end_seq; + int fullwalk = 0; /* Event B in the comment above. */
[PATCH respin] ucc_geth: fix module removal
- uccf should be set to NULL to not double-free memory on subsequent calls; - ind_hash_q and group_hash_q lists should be initialized in the probe() function, instead of struct_init() (called by open()), otherwise there will be an oops if ucc_geth_driver removed prior 'ifconfig ethX up'; - add unregister_netdev(); - reorder geth_remove() steps. Signed-off-by: Anton Vorontsov [EMAIL PROTECTED] --- drivers/net/ucc_geth.c | 17 ++--- 1 files changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c index 7dedc96..18a6f48 100644 --- a/drivers/net/ucc_geth.c +++ b/drivers/net/ucc_geth.c @@ -2080,8 +2080,10 @@ static void ucc_geth_memclean(struct ucc_geth_private *ugeth) if (!ugeth) return; - if (ugeth-uccf) + if (ugeth-uccf) { ucc_fast_free(ugeth-uccf); + ugeth-uccf = NULL; + } if (ugeth-p_thread_data_tx) { qe_muram_free(ugeth-thread_dat_tx_offset); @@ -2312,10 +2314,6 @@ static int ucc_struct_init(struct ucc_geth_private *ugeth) ug_info = ugeth-ug_info; uf_info = ug_info-uf_info; - /* Create CQs for hash tables */ - INIT_LIST_HEAD(ugeth-group_hash_q); - INIT_LIST_HEAD(ugeth-ind_hash_q); - if (!((uf_info-bd_mem_part == MEM_PART_SYSTEM) || (uf_info-bd_mem_part == MEM_PART_MURAM))) { if (netif_msg_probe(ugeth)) @@ -3949,6 +3947,10 @@ static int ucc_geth_probe(struct of_device* ofdev, const struct of_device_id *ma ugeth = netdev_priv(dev); spin_lock_init(ugeth-lock); + /* Create CQs for hash tables */ + INIT_LIST_HEAD(ugeth-group_hash_q); + INIT_LIST_HEAD(ugeth-ind_hash_q); + dev_set_drvdata(device, dev); /* Set the dev-base_addr to the gfar reg region */ @@ -4002,9 +4004,10 @@ static int ucc_geth_remove(struct of_device* ofdev) struct net_device *dev = dev_get_drvdata(device); struct ucc_geth_private *ugeth = netdev_priv(dev); - dev_set_drvdata(device, NULL); - ucc_geth_memclean(ugeth); + unregister_netdev(dev); free_netdev(dev); + ucc_geth_memclean(ugeth); + dev_set_drvdata(device, NULL); return 0; } -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ucc_geth: add support for netpoll
This patch adds netpoll support for the QE UCC Gigabit Ethernet driver. The approach is very similar to the gianfar driver. Tested using netconsole. Signed-off-by: Anton Vorontsov [EMAIL PROTECTED] --- drivers/net/ucc_geth.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c index 18a6f48..06807ce 100644 --- a/drivers/net/ucc_geth.c +++ b/drivers/net/ucc_geth.c @@ -3691,6 +3691,22 @@ static irqreturn_t ucc_geth_irq_handler(int irq, void *info) return IRQ_HANDLED; } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* + * Polling 'interrupt' - used by things like netconsole to send skbs + * without having to re-enable interrupts. It's not called while + * the interrupt routine is executing. + */ +static void ucc_netpoll(struct net_device *dev) +{ + struct ucc_geth_private *ugeth = netdev_priv(dev); + + disable_irq(ugeth-ug_info-uf_info.irq); + ucc_geth_irq_handler(ugeth-ug_info-uf_info.irq, dev); + enable_irq(ugeth-ug_info-uf_info.irq); +} +#endif /* CONFIG_NET_POLL_CONTROLLER */ + /* Called when something needs to use the ethernet device */ /* Returns 0 for success. */ static int ucc_geth_open(struct net_device *dev) @@ -3969,6 +3985,9 @@ static int ucc_geth_probe(struct of_device* ofdev, const struct of_device_id *ma dev-poll = ucc_geth_poll; dev-weight = UCC_GETH_DEV_WEIGHT; #endif /* CONFIG_UGETH_NAPI */ +#ifdef CONFIG_NET_POLL_CONTROLLER + dev-poll_controller = ucc_netpoll; +#endif dev-stop = ucc_geth_close; dev-get_stats = ucc_geth_get_stats; //dev-change_mtu = ucc_geth_change_mtu; -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH respin] phy: implement release function
Lately I've got this nice badness on mdio bus removal: Device 'e0103120:06' does not have a release() function, it is broken and must be fixed. [ cut here ] Badness at drivers/base/core.c:107 NIP: c015c1a8 LR: c015c1a8 CTR: c0157488 REGS: c34bdcf0 TRAP: 0700 Not tainted (2.6.23-rc5-g9ebadfbb-dirty) MSR: 00029032 EE,ME,IR,DR CR: 24088422 XER: ... [c34bdda0] [c015c1a8] device_release+0x78/0x80 (unreliable) [c34bddb0] [c01354cc] kobject_cleanup+0x80/0xbc [c34bddd0] [c01365f0] kref_put+0x54/0x6c [c34bdde0] [c013543c] kobject_put+0x24/0x34 [c34bddf0] [c015c384] put_device+0x1c/0x2c [c34bde00] [c0180e84] mdiobus_unregister+0x2c/0x58 ... Though actually there is nothing broken, it just device subsystem core expects another pattern of resource managment. This patch implement phy device's release function, thus we're getting rid of this badness. Also small hidden bug fixed, hope none other introduced. ;-) Signed-off-by: Anton Vorontsov [EMAIL PROTECTED] Acked-by: Andy Fleming [EMAIL PROTECTED] --- drivers/net/phy/mdio_bus.c |9 + drivers/net/phy/phy_device.c | 13 + include/linux/phy.h |1 + 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c index fc2f0e6..c30196d 100644 --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -91,9 +91,12 @@ int mdiobus_register(struct mii_bus *bus) err = device_register(phydev-dev); - if (err) + if (err) { printk(KERN_ERR phy %d failed to register\n, i); + phy_device_free(phydev); + phydev = NULL; + } } bus-phy_map[i] = phydev; @@ -110,10 +113,8 @@ void mdiobus_unregister(struct mii_bus *bus) int i; for (i = 0; i PHY_MAX_ADDR; i++) { - if (bus-phy_map[i]) { + if (bus-phy_map[i]) device_unregister(bus-phy_map[i]-dev); - kfree(bus-phy_map[i]); - } } } EXPORT_SYMBOL(mdiobus_unregister); diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 49328e0..8e25bf7 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -44,6 +44,17 @@ static struct phy_driver genphy_driver; extern int mdio_bus_init(void); extern void mdio_bus_exit(void); +void phy_device_free(struct phy_device *phydev) +{ + kfree(phydev); +} +EXPORT_SYMBOL(phy_device_free); + +static void phy_device_release(struct device *dev) +{ + phy_device_free(to_phy_device(dev)); +} + struct phy_device* phy_device_create(struct mii_bus *bus, int addr, int phy_id) { struct phy_device *dev; @@ -54,6 +65,8 @@ struct phy_device* phy_device_create(struct mii_bus *bus, int addr, int phy_id) if (NULL == dev) return (struct phy_device*) PTR_ERR((void*)-ENOMEM); + dev-dev.release = phy_device_release; + dev-speed = 0; dev-duplex = -1; dev-pause = dev-asym_pause = 0; diff --git a/include/linux/phy.h b/include/linux/phy.h index 2a65978..9ec1363 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -398,6 +398,7 @@ int phy_mii_ioctl(struct phy_device *phydev, int phy_start_interrupts(struct phy_device *phydev); void phy_print_status(struct phy_device *phydev); struct phy_device* phy_device_create(struct mii_bus *bus, int addr, int phy_id); +void phy_device_free(struct phy_device *phydev); extern struct bus_type mdio_bus_type; #endif /* __PHY_H */ -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 4/4] UDP memory usage accounting (take 4): memory limitation
Hi Stephen, Thank you for your comment. { +.ctl_name = NET_UDP_MEM, +.procname = udp_mem, +.data = sysctl_udp_mem, +.maxlen = sizeof(sysctl_udp_mem), +.mode = 0644, +.proc_handler = proc_dointvec +}, +{ .ctl_name = NET_TCP_APP_WIN, .procname = tcp_app_win, .data = sysctl_tcp_app_win, if you use proc_dointvec_minmax, then you could inforce min/max values for udp_mem for the sysctl udp_mem has two meanings: * turn off this limitation function (currently udp_mem=4096) * limit udp memory (currently udp_mem4096) To realize this, udp_mem is evaluated whether udp_mem equals 4096 or smaller in UDP and IP layers. If udp_mem has proc_dointvec_minmax or dedicated proc handler, turn off check must be done in UDP and IP layers. This means there is no reduction of the check in UDP and IP layers. If you pointed out that minus value of udp_mem is strange, I agree. I'll fix it. How about this? min=4096 (and turn off limitation) udp_mem4096 (and turn on limitation) Satoshi Oshima - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Regression in net-2.6.24?
From: Ilpo Järvinen [EMAIL PROTECTED] Subject: Re: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Date: Thu, 11 Oct 2007 13:12:49 +0300 (EEST) But, I got a kernel panic at net_rx_action() in our experimental setting. I am using the net-2.6.24 kernel _without_ this patch. (I will post a bug report separately). ...Please do. :-) I have got a kernel panic, when I run the Iperf benchmark in the following setting. Each node has x86_64 dual processors and 2 tg3 NICs (BCM95704A6). If the router does not regulate the bandwidth, a kernel panic does not occur. Node A Router --- Delay --- Node B (Policing rate: emulator 500Mbps)(RTT: 20ms) Ggit-bisect told me that the following commit causes a regression: commit bea3348eef27e6044b6161fd04c3152215f96411 Author: Stephen Hemminger [EMAIL PROTECTED] Date: Wed Oct 3 16:41:36 2007 -0700 [NET]: Make NAPI polling independent of struct net_device objects. Here is Oops message: Unable to handle kernel paging request at 00100108 RIP: [80421d59] net_rx_action+0x169/0x1c0 PGD f384d067 PUD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: 8021q tcp_bic netconsole evdev joydev sg st sr_mod ohci_hcd i2c_amd756 i2c_amd8111 i2c_core ipv6 tg3 usbhid usbcore ff_memless dm_mod ext3 jbd sata_sil libata sd_mod scsi_mod Pid: 0, comm: swapper Not tainted 2.6.23-rc9 #2 RIP: 0010:[80421d59] [80421d59] net_rx_action+0x169/0x1c0 RSP: 0018:80847eb0 EFLAGS: 00010046 RAX: 00200200 RBX: 8100f63ef750 RCX: 0246 RDX: 00100100 RSI: c20002e60204 RDI: 8100f63ef6c0 RBP: 80847ef0 R08: 810178b72fd8 R09: 0001 R10: 0003 R11: R12: 0040 R13: 0040 R14: 810003f51640 R15: FS: 40331960() GS:805d7000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 00100108 CR3: f45b6000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process swapper (pid: 0, threadinfo 806de000, task 8058b540) Stack: 80847eb0 00ec80847eb0 0001000b17ec 0009 805e10b0 8083eee0 0009 80847f30 8023d935 0046 Call Trace: IRQ [8023d935] __do_softirq+0x75/0x100 [8020cb8c] call_softirq+0x1c/0x30 [8020dd89] do_softirq+0x49/0xb0 [8023d4e5] irq_exit+0x45/0x50 [8020e053] do_IRQ+0x83/0x100 [8020a310] default_idle+0x0/0x50 [8020bf11] ret_from_intr+0x0/0xb EOI [8020a33d] default_idle+0x2d/0x50 [8020aa72] enter_idle+0x22/0x30 [8020ab0c] cpu_idle+0x8c/0xd0 [804af6ec] rest_init+0x5c/0x70 [806e8baf] start_kernel+0x28f/0x300 [806e8140] _sinittext+0x140/0x180 Code: 48 89 42 08 48 89 10 49 8b 46 08 4c 89 33 49 89 5e 08 48 89 RIP [80421d59] net_rx_action+0x169/0x1c0 RSP 80847eb0 CR2: 00100108 Kernel panic - not syncing: Aiee, killing interrupt handler! RIP points at __list_del (net_rx_action - list_move_tail - __list_del). $ objdump -S net/core/dev.o | less snip static inline void __list_del(struct list_head * prev, struct list_head * next) { b42: 48 8b 43 08 mov0x8(%rbx),%rax b46: 48 8b 13mov(%rbx),%rdx next-prev = prev; b49: 48 89 42 08 mov%rax,0x8(%rdx) prev-next = next; b4d: 48 89 10mov%rdx,(%rax) b50: 49 8b 46 08 mov0x8(%r14),%rax b54: 4c 89 33mov%r14,(%rbx) b57: 49 89 5e 08 mov%rbx,0x8(%r14) b5b: 48 89 18mov%rbx,(%rax) What is happen and what information do you need to fix it? Thanks in advance, Ryousei Takano - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue
Roland Dreier wrote: I also ran a test for the code in the branch of 2.6.24 and found a problem. I see that ifconfig down doesn't return (for IPoIB interfaces) and it's stuck in napi_disable() in the kernel (any idea why?) For what it's worth, I took the upstream 2.6.23 git tree and merged in Dave's latest net-2.6.24 tree and my latest for-2.6.24 tree and tried that. I brought up an IPoIB interface, sent a few pings, and did ifconfig down, and it worked fine. Can you try the same thing without the bonding patches to see if your setup works OK too? Also can you give more details about what you do to get ifconfig down stuck? - R. Without bonding ifconfig down works fine. It happens only when ib interfaces are slaves of a bonding device. I thought before that the stuck is in napi_disable() but it's almost right. I put prints before and after call to napi_disable and see that it is called twice. I'll try to investigate in this direction. ib0: stopping interface ib0: before napi_disable ib0: after napi_disable ib0: downing ib_dev ib0: All sends and receives done. ib0: stopping interface ib0: before napi_disable There is also a dump of the kernel log after 'echo t /proc/sysrq-trigger' (for ifconfig) SysRq : Show State ifconfig S 0 6311 6099 810034f49d18 0086 810037e747c0 810037e747c0 00013481e000 81003a851a78 81003a851840 3b0c8c00 802358ee Call Trace: [8023cc89] lock_timer_base+0x24/0x49 [80403754] schedule_timeout+0x8a/0xad [8023d241] process_timeout+0x0/0x5 [8023d6ec] msleep_interruptible+0x11/0x39 [884081a7] :ib_ipoib:ipoib_stop+0x64/0x12c [8039fc07] dev_close+0x3e/0x56 [803a1c31] dev_change_flags+0xa7/0x15f [803e5bee] devinet_ioctl+0x293/0x5ed [803e775b] inet_ioctl+0x7f/0x9d [80395b2e] sock_ioctl+0x0/0x1fe [80395d08] sock_ioctl+0x1da/0x1fe [802947d9] do_ioctl+0x29/0x6f [80294a75] vfs_ioctl+0x256/0x267 [80294adf] sys_ioctl+0x59/0x7a [8020bc0e] system_call+0x7e/0x83 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH]race between open and disconnect in irda-usb
Hi, it seems to me that irda_usb_net_open() must set self-netopen under spinlock or disconnect() may fail to kill all URBs. Signed-off-by: Oliver Neukum [EMAIL PROTECTED] Regards Oliver --- a/drivers/net/irda/irda-usb.c 2007-10-11 17:24:35.0 +0200 +++ b/drivers/net/irda/irda-usb.c 2007-10-11 17:30:30.0 +0200 @@ -1168,6 +1168,7 @@ static int stir421x_patch_device(struct static int irda_usb_net_open(struct net_device *netdev) { struct irda_usb_cb *self; + unsigned long flags; charhwname[16]; int i; @@ -1177,13 +1178,16 @@ static int irda_usb_net_open(struct net_ self = (struct irda_usb_cb *) netdev-priv; IRDA_ASSERT(self != NULL, return -1;); + spin_lock_irqsave(self-lock, flags); /* Can only open the device if it's there */ if(!self-present) { + spin_unlock_irqrestore(self-lock, flags); IRDA_WARNING(%s(), device not present!\n, __FUNCTION__); return -1; } if(self-needspatch) { + spin_unlock_irqrestore(self-lock, flags); IRDA_WARNING(%s(), device needs patch\n, __FUNCTION__) ; return -EIO ; } @@ -1198,6 +1202,7 @@ static int irda_usb_net_open(struct net_ /* To do *before* submitting Rx urbs and starting net Tx queue * Jean II */ self-netopen = 1; + spin_unlock_irqrestore(self-lock, flags); /* * Now that everything should be initialized properly, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e100 problems in .23rc8 ?
Herbert Xu wrote: On Wed, Oct 10, 2007 at 08:36:38PM -0400, Dave Jones wrote: The e1000 changes you reference above, is this the changeset you mean? commit 416b5d10afdc797c21c457ade3714e8f2f75edd9 Author: Auke Kok [EMAIL PROTECTED] Date: Fri Jun 1 10:22:39 2007 -0700 e1000: disable polling before registering netdevice Yep. this patch actually called napi_disable() in the probe routine which was wrong, but e100 does not do that. Nonetheless e100 doesn't call netif_carrier_off() and netif_stop_queue(), so to make e100 the same as e1000 we should probably do this, see below. Dave, can you see if this resolves the issue for you? If so then we might want to push this to -stable. Auke --- e100: disable netdevice explicitly to avoid rx irq oops Several reported OOPS messages suggest that e100 has a race that was fixed in e1000 before where incoming interrupts trigger an OOPS immediately after probe() finishes. Signed-off-by: Auke Kok [EMAIL PROTECTED] diff --git a/drivers/net/e100.c b/drivers/net/e100.c index 280313b..ded5f68 100644 --- a/drivers/net/e100.c +++ b/drivers/net/e100.c @@ -2682,6 +2682,10 @@ static int __devinit e100_probe(struct pci_dev *pdev, if (err) DPRINTK(PROBE, ERR, Error clearing wake event\n); + /* tell the stack to leave us alone until e100_open() is called */ + netif_carrier_off(netdev); + netif_stop_queue(netdev); + strcpy(netdev-name, eth%d); if((err = register_netdev(netdev))) { DPRINTK(PROBE, ERR, Cannot register net device, aborting.\n); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
Herbert Xu [EMAIL PROTECTED] writes: Well thanks to that warning we're on our way of improving the code that triggered it in such a way that this warning will soon go silent. That's precisely the reason why I object to having this warning removed. Now you have a good point that this warning doesn't trigger all the time. The fix to that is to *make* it trigger always, not removing it. I'm almost convinced but. Where people deliberately use convoluted locking is where we most need things like ASSERT_RTNL. Having ASSERT_RTNL warn if you were sleeping does not seem intuitive from the name. This instance of convoluted locking seems like a complete one off to me, and if it will warn about other constructs currently in the kernel it seems wrong. Frankly I don't feel comfortable adding the check because I can't defend the presence of might_sleep() in ASSERT_RTNL. If I can't understand a change well enough to defend it I will not take responsibility for it, and I will not add my Signed-off-by to it. The patch I wrote was trivial a trivial optimization and obviously correct. Adding the might_sleep() and the patch becomes the start of a crusade for better code that I don't believe in. So I would rather forget this patch then make that one line addition. Thanks, Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: authenc compile warnings in current net-2.6.24
Sebastian Siewior wrote: * David Miller | 2007-10-10 16:25:28 [-0700]: From: Sebastian Siewior [EMAIL PROTECTED] Date: Wed, 10 Oct 2007 21:53:37 +0200 * Oliver Hartkopp | 2007-10-10 19:53:53 [+0200]: CC [M] crypto/authenc.o crypto/authenc.c: In function ?crypto_authenc_hash?: crypto/authenc.c:88: warning: ?cryptlen? may be used uninitialized in this function crypto/authenc.c:87: warning: ?dst? may be used uninitialized in this function crypto/authenc.c: In function ?crypto_authenc_decrypt?: crypto/authenc.c:163: warning: ?cryptlen? may be used uninitialized in this function crypto/authenc.c:163: note: ?cryptlen? was declared here crypto/authenc.c:162: warning: ?src? may be used uninitialized in this function crypto/authenc.c:162: note: ?src? was declared here do you already know these warnings? Those warnings are looking like a compiler bug to me. To be honest I don't know of any compiler which commits enough flow variable analysis to support doing %100 accurate warnings in situations like this. gcc (GCC) 4.1.2 (Gentoo 4.1.2) did not produce any warnings in this case. Hi Sebasian, my gcc was the lastest Debian unstable one: gcc -v Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --enable-targets=all --disable-werror --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu Thread model: posix gcc version 4.2.1 (Debian 4.2.1-5) Regards, Oliver - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]race between open and disconnect in irda-usb
On Thu, Oct 11, 2007 at 05:45:28PM +0200, Oliver Neukum wrote: Hi, Hi, You need to cc Samuel Ortiz and the IrDA mailing list if you want any meaningful response. it seems to me that irda_usb_net_open() must set self-netopen under spinlock or disconnect() may fail to kill all URBs. I believe to patch is correct, but in practice I fail to see how it could happen. Anyway, Samuel will have the final word. Signed-off-by: Oliver Neukum [EMAIL PROTECTED] Regards Oliver Have fun... Jean --- a/drivers/net/irda/irda-usb.c 2007-10-11 17:24:35.0 +0200 +++ b/drivers/net/irda/irda-usb.c 2007-10-11 17:30:30.0 +0200 @@ -1168,6 +1168,7 @@ static int stir421x_patch_device(struct static int irda_usb_net_open(struct net_device *netdev) { struct irda_usb_cb *self; + unsigned long flags; charhwname[16]; int i; @@ -1177,13 +1178,16 @@ static int irda_usb_net_open(struct net_ self = (struct irda_usb_cb *) netdev-priv; IRDA_ASSERT(self != NULL, return -1;); + spin_lock_irqsave(self-lock, flags); /* Can only open the device if it's there */ if(!self-present) { + spin_unlock_irqrestore(self-lock, flags); IRDA_WARNING(%s(), device not present!\n, __FUNCTION__); return -1; } if(self-needspatch) { + spin_unlock_irqrestore(self-lock, flags); IRDA_WARNING(%s(), device needs patch\n, __FUNCTION__) ; return -EIO ; } @@ -1198,6 +1202,7 @@ static int irda_usb_net_open(struct net_ /* To do *before* submitting Rx urbs and starting net Tx queue * Jean II */ self-netopen = 1; + spin_unlock_irqrestore(self-lock, flags); /* * Now that everything should be initialized properly, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: [PATCH#2 3/4] [PPC] Compile fix for 8xx CPM Ehernet driver
Hi Jeff, Kumar Gala wrote: Begin forwarded message: From: Jochen Friedrich [EMAIL PROTECTED] Date: September 24, 2007 12:15:35 PM CDT To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], Marcelo Tosatti [EMAIL PROTECTED] Subject: [PATCH#2 3/4] [PPC] Compile fix for 8xx CPM Ehernet driver Jeff, Please pick up for 2.6.23 if you don't mind. Please send an apply-able patch... Jeff, off to bed I rebased git://git.bocc.de/dbox2.git ppc-fixes against 2.6.23. This patch is the only remaining one in this head. I'll resend. Thanks, Jochen - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [PPC] Compile fix for 8xx CPM Ehernet driver
Add #include asm/cacheflush.h for flush_dcache_range to make the driver compile again. CC arch/ppc/8xx_io/enet.o arch/ppc/8xx_io/enet.c: In function 'scc_enet_start_xmit': arch/ppc/8xx_io/enet.c:240: error: implicit declaration of function 'flush_dcache_range' make[1]: *** [arch/ppc/8xx_io/enet.o] Error 1 make: *** [arch/ppc/8xx_io] Error 2 Signed-off-by: Jochen Friedrich [EMAIL PROTECTED] --- arch/ppc/8xx_io/enet.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/ppc/8xx_io/enet.c b/arch/ppc/8xx_io/enet.c index 703d47e..eace3bc 100644 --- a/arch/ppc/8xx_io/enet.c +++ b/arch/ppc/8xx_io/enet.c @@ -44,6 +44,7 @@ #include asm/mpc8xx.h #include asm/uaccess.h #include asm/commproc.h +#include asm/cacheflush.h /* *Theory of Operation
Re: [PATCH][BNX2X] round three
[added Andy Whitcroft, who is listed as a CHECKPATCH maintainer] Stephen Hemminger wrote: Minor formatting nits reported by checkpatch.pl script: Thanks, I Will fix them. ... WARNING: no space between function name and open parenthesis '(' #777: FILE: drivers/net/bnx2x.c:722: + case (RAMROD_CMD_ID_ETH_PORT_SETUP | BNX2X_STATE_OPENING_WAIT4_PORT): WARNING: no space between function name and open parenthesis '(' #782: FILE: drivers/net/bnx2x.c:727: + case (RAMROD_CMD_ID_ETH_HALT | BNX2X_STATE_CLOSING_WAIT4_HALT): WARNING: no space between function name and open parenthesis '(' #788: FILE: drivers/net/bnx2x.c:733: + case (RAMROD_CMD_ID_ETH_PORT_DEL | BNX2X_STATE_CLOSING_WAIT4_DELETE): WARNING: no space between function name and open parenthesis '(' #793: FILE: drivers/net/bnx2x.c:738: + case (RAMROD_CMD_ID_ETH_SET_MAC | BNX2X_STATE_OPEN): These look like false positives ... CHECK: spinlock_t definition without comment #9486: FILE: drivers/net/bnx2x.h:508: + spinlock_t spq_lock; /* Used to serialize slowpath This one too ... ERROR: Macros with complex values should be enclosed in parenthesis #21196: FILE: drivers/net/bnx2x_init.h:138: +#define INIT_INTERNAL0_MEM_WR(block_bar, block, reg, \ + part, hw, value, off, len) \ Not sure, I think this one is too, but I'm going to rewrite bnx2x_init.h anyway, this code is going away. ... Thanks Eliezer - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e100 problems in .23rc8 ?
On Thu, Oct 11, 2007 at 09:10:34AM -0700, Kok, Auke wrote: Herbert Xu wrote: On Wed, Oct 10, 2007 at 08:36:38PM -0400, Dave Jones wrote: The e1000 changes you reference above, is this the changeset you mean? commit 416b5d10afdc797c21c457ade3714e8f2f75edd9 Author: Auke Kok [EMAIL PROTECTED] Date: Fri Jun 1 10:22:39 2007 -0700 e1000: disable polling before registering netdevice Yep. this patch actually called napi_disable() in the probe routine which was wrong, but e100 does not do that. Nonetheless e100 doesn't call netif_carrier_off() and netif_stop_queue(), so to make e100 the same as e1000 we should probably do this, see below. Dave, can you see if this resolves the issue for you? If so then we might want to push this to -stable. Will do, thanks Auke. Eric/David, the Fedora 8 RPM version 2.6.23-6.fc8 will have this if you want to give it a shot too. It'll be at http://people.redhat.com/davej/kernels/Fedora/f7.92/ when it's done building in an hour or so. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493, try2
Hi, From RFC 3493, Section 5.2: IPV6_MULTICAST_IF Set the interface to use for outgoing multicast packets. The argument is the index of the interface to use. If the interface index is specified as zero, the system selects the interface (for example, by looking up the address in a routing table and using the resulting interface). This patch adds support for (index == 0) to reset the value to it's original state, allowing the system to choose the best interface. IPv4 already behaves this way. -Brian Signed-off-by: Brian Haley [EMAIL PROTECTED] diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 532425d..1334fc1 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -539,12 +539,15 @@ done: case IPV6_MULTICAST_IF: if (sk-sk_type == SOCK_STREAM) goto e_inval; - if (sk-sk_bound_dev_if sk-sk_bound_dev_if != val) - goto e_inval; - if (__dev_get_by_index(init_net, val) == NULL) { - retv = -ENODEV; - break; + if (val) { + if (sk-sk_bound_dev_if sk-sk_bound_dev_if != val) +goto e_inval; + + if (__dev_get_by_index(init_net, val) == NULL) { +retv = -ENODEV; +break; + } } np-mcast_oif = val; retv = 0;
Re: [IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493, try2
Acked-by: David L Stevens [EMAIL PROTECTED] Signed-off-by: Brian Haley [EMAIL PROTECTED] diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 532425d..1334fc1 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -539,12 +539,15 @@ done: case IPV6_MULTICAST_IF: if (sk-sk_type == SOCK_STREAM) goto e_inval; - if (sk-sk_bound_dev_if sk-sk_bound_dev_if != val) - goto e_inval; - if (__dev_get_by_index(init_net, val) == NULL) { - retv = -ENODEV; - break; + if (val) { + if (sk-sk_bound_dev_if sk-sk_bound_dev_if != val) +goto e_inval; + + if (__dev_get_by_index(init_net, val) == NULL) { +retv = -ENODEV; +break; + } } np-mcast_oif = val; retv = 0; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
Johannes Berg [EMAIL PROTECTED] writes: On Wed, 2007-10-10 at 13:51 -0600, Eric W. Biederman wrote: Yes. Netlink sockets are per-namespace and you can use the namespace of a netlink socket to look up a netdev. Ok, thanks. I still haven't really looked into the wireless vs. net namespaces problem but this will probably help. I think I may even have some patches in my proof of concept tree that address some of the wireless issues. Especially rtnetlink ones. Generally those cases haven't been hard to spot. Having hash tables and the like that hash and do key compares on an ifindex instead of a net_device * are the in kernel places that make it very hard to have duplicate ifindexes. Thinking about it probably the biggest challenge to deal with is iff in struct sk_buff. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][BNX2X] round three - sparse warnings.
On Wed, 2007-10-10 at 18:54 -0700, Stephen Hemminger wrote: Please fix these... Thanks, All that code is going to be rewritten, as per Dave's request, but I will make sure to test the new code with sparse. Eliezer - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][BNX2X] round three
On Wed, 2007-10-10 at 17:59 -0700, David Miller wrote: ... I was going to add this to the tree for 2.6.24 but there is simply too much super-ugly stuff in this driver for me to do so. We will rewrite bnx2x_hsi.h bnx2x_init.h and bnx2x_init_vlaues.h. Does bnx2x_asm.h look OK? Look, there is zero way I'm adding a driver that's written like this to the tree. ... This huge header file full of register programming magic goes way beyond the limits of reasonable and has to go. Will it be OK if we replace it with 4-5 headers so we can maintain them better manually. (I'm asking what is the upper bound to the number of header files you would consider reasonable for this driver.) Thanks Eliezer - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
iproute2: resend of patches from Debian.
Hello all! Posting this set of patches again, since there where no reaction last time I sent them. Regards, Andreas Henriksson Hello Stephen Hemminger and the rest of the people on the nevdev list! I'm posting a bunch of patches for iproute2. A few I've written myself and have a Signed-off-by with my name in them, the others I've picked up from the iproute package in Debian. I tried my best to add a decent description, references to debian bug number and when possible a From header suggesting who the original author might be. They should all apply with some offset and sometimes a bit fuzz to the current iproute2 git tree. new patches posted to the debian bug tracking system: add-mpath-docs.diff doublefree.diff fix-layer-syntax.diff new-manpages.diff tick2time-fix.diff wrandom-fix.diff fixes in the debian iproute package: empty_linkname.diff ip_address_flush_loop.diff lartc.diff netbug_fix.diff okey-ikey.diff remove_tc_filters_reference.diff patches updating help text syntax: ip_route_usage.diff ip_rule_usage.diff patches fixing documentation typos: fix_ss_typo.diff ip.8-typo.diff ip_address.diff libnetlink_typo.diff manpages-typo.diff tcb_htb_typo.diff tc_cbq_details_typo.diff -- Regards, Andreas Henriksson Disallow empty link name. From: Alexander Wirt [EMAIL PROTECTED] Patch from debian iproute package. diff -urNad iproute-20060323~/ip/iplink.c iproute-20060323/ip/iplink.c --- iproute-20060323~/ip/iplink.c 2006-03-22 00:57:50.0 +0100 +++ iproute-20060323/ip/iplink.c 2006-09-08 21:07:14.0 +0200 @@ -384,6 +384,10 @@ } if (newname strcmp(dev, newname)) { + if (strlen(newname) == 0) { + printf(\\ is not valid device identifier\n,dev); + return -1; + } if (do_changename(dev, newname) 0) return -1; dev = newname; Abort flush after 10 seconds. From: Alexander Wirt [EMAIL PROTECTED] Patch from Debian iproute package. diff -urNad iproute-20060323~/ip/ipaddress.c iproute-20060323/ip/ipaddress.c --- iproute-20060323~/ip/ipaddress.c 2006-09-08 17:02:03.0 +0200 +++ iproute-20060323/ip/ipaddress.c 2006-09-08 17:03:01.0 +0200 @@ -15,6 +15,7 @@ #include stdio.h #include stdlib.h #include unistd.h +#include time.h #include syslog.h #include fcntl.h #include sys/ioctl.h @@ -589,6 +590,7 @@ if (flush) { int round = 0; char flushb[4096-512]; + time_t start = time(0); filter.flushb = flushb; filter.flushp = 0; @@ -617,6 +619,12 @@ printf(Warum?\n); return 1; +if (time(0) - start 10) { +printf(\n*** Flush not completed after %ld seconds, %d entries remain ***\n, + time(0) - start, filter.flushed); +exit(1); +} + if (show_stats) { printf(\n*** Round %d, deleting %d addresses ***\n, round, filter.flushed); fflush(stdout); Add references to lartc, also drop bogus reference to tc-filters Patch from Debian iproute package. diff -Nur old/man/man8/ip.8 new/man/man8/ip.8 --- old/man/man8/ip.8 2005-01-05 22:40:29.0 + +++ new/man/man8/ip.8 2005-01-05 22:47:03.0 + @@ -1803,6 +1803,8 @@ .RB IP Command reference ip-cref.ps .br .RB IP tunnels ip-cref.ps +.br +.RB http://lartc.org/ .SH AUTHOR diff -Nur old/man/man8/tc.8 new/man/man8/tc.8 --- old/man/man8/tc.8 2004-10-19 20:49:02.0 + +++ new/man/man8/tc.8 2005-01-05 22:46:15.0 + @@ -341,7 +341,7 @@ .BR tc-pfifo (8), .BR tc-bfifo (8), .BR tc-pfifo_fast (8), -.BR tc-filters (8) +.BR http://lartc.org/ .SH AUTHOR Manpage maintained by bert hubert ([EMAIL PROTECTED]) Fix misc/netbug script. See the following debian bugs: http://bugs.debian.org/289541 http://bugs.debian.org/313540 http://bugs.debian.org/313541 http://bugs.debian.org/313544 http://bugs.debian.org/347699 Changes from Debian iproute package rediffed to apply against current iproute2 git tree. --- iproute2/misc/netbug 2007-09-09 17:36:19.0 +0200 +++ iproute-20070313/misc/netbug 2007-09-09 20:42:01.0 +0200 @@ -1,23 +1,16 @@ #! /bin/bash +set -e + echo -n Send network configuration summary to [ENTER means [EMAIL PROTECTED] IFS= read mail || exit 1 [ -z $mail ] [EMAIL PROTECTED] +netbug=`mktemp -d -t netbug.XX` || (echo $0: Cannot create temporary directory 2; exit 1; ) +netbugtar=`tempfile -d $netbug --suffix=tar.gz` || (echo $0: Cannot create temporary file 2; exit 1; ) +tmppath=$netbug +trap /bin/rm -rf $netbug $netbugtar 0 1 2 3 13 15 -netbug= -while [ $netbug = ]; do - netbug=`echo netbug.$$.$RANDOM` - if [ -e /tmp/$netbug ]; then - netbug= - fi -done - -tmppath=/tmp/$netbug - -trap rm -rf $tmppath $tmppath.tar.gz 0 SIGINT - -mkdir $tmppath mkdir $tmppath/net cat /proc/slabinfo $tmppath/slabinfo @@ -44,9 +37,8 @@ fi cd /tmp -tar c $netbug | gzip -9c $netbug.tar.gz - -uuencode $netbug.tar.gz $netbug.tar.gz | mail -s $netbug $mail +tar c
Re: e100 problems in .23rc8 ?
Eric/David, the Fedora 8 RPM version 2.6.23-6.fc8 will have this if you want to give it a shot too. It'll be at http://people.redhat.com/davej/kernels/Fedora/f7.92/ when it's done building in an hour or so. Dave Thanks, I'll give it a whirl this evening. I put a new net card in that box 'cause I got tired of resetting it :) -Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On Thu, Oct 11, 2007 at 08:25:32PM +0200, Andreas Henriksson wrote: Patch from debian iproute package. diff -urNad iproute-20060323~/ip/iplink.c iproute-20060323/ip/iplink.c --- iproute-20060323~/ip/iplink.c 2006-03-22 00:57:50.0 +0100 +++ iproute-20060323/ip/iplink.c 2006-09-08 21:07:14.0 +0200 @@ -384,6 +384,10 @@ } if (newname strcmp(dev, newname)) { + if (strlen(newname) == 0) { + printf(\\ is not valid device identifier\n,dev); + return -1; + } if (do_changename(dev, newname) 0) return -1; dev = newname; Isn't that printf missing somewhere for the 'dev' argument to go? -- Len Sorensen - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 4/4] UDP memory usage accounting (take 4): memory limitation
On Thu, 11 Oct 2007 21:51:14 +0900 Satoshi OSHIMA [EMAIL PROTECTED] wrote: Hi Stephen, Thank you for your comment. { + .ctl_name = NET_UDP_MEM, + .procname = udp_mem, + .data = sysctl_udp_mem, + .maxlen = sizeof(sysctl_udp_mem), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { .ctl_name = NET_TCP_APP_WIN, .procname = tcp_app_win, .data = sysctl_tcp_app_win, if you use proc_dointvec_minmax, then you could inforce min/max values for udp_mem for the sysctl One other comment. Sysctl value indexes are deprecated at this point so all new values should use CTL_UNNUMBERED. Therefore unless NET_UDP_MEM already exists, please don't add it. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On tor, 2007-10-11 at 15:25 -0400, Lennart Sorensen wrote: On Thu, Oct 11, 2007 at 08:25:32PM +0200, Andreas Henriksson wrote: Patch from debian iproute package. diff -urNad iproute-20060323~/ip/iplink.c iproute-20060323/ip/iplink.c --- iproute-20060323~/ip/iplink.c 2006-03-22 00:57:50.0 +0100 +++ iproute-20060323/ip/iplink.c2006-09-08 21:07:14.0 +0200 @@ -384,6 +384,10 @@ } if (newname strcmp(dev, newname)) { + if (strlen(newname) == 0) { + printf(\\ is not valid device identifier\n,dev); + return -1; + } if (do_changename(dev, newname) 0) return -1; dev = newname; Isn't that printf missing somewhere for the 'dev' argument to go? This is not my patch so I don't know what the original intention was, but now that you point it out I'd say that the dev argument should just be removed. The new check is for an empty new name old name (dev) probably isn't interesting. I think Alexander Wirt is the original author, lets CC and see if he has a comment... -- Regards, Andreas Henriksson - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue
It happens only when ib interfaces are slaves of a bonding device. I thought before that the stuck is in napi_disable() but it's almost right. I put prints before and after call to napi_disable and see that it is called twice. I'll try to investigate in this direction. ib0: stopping interface ib0: before napi_disable ib0: after napi_disable ib0: downing ib_dev ib0: All sends and receives done. ib0: stopping interface ib0: before napi_disable Yes, two napi_disable()s in a row without a matching napi_enable() will deadlock. I guess the question is why the ipoib interface is being stopped twice. If you just take the net-2.6.24 tree (without bonding patches), does bonding for ethernet interfaces work OK, or is there a similar problem with double napi_disable()? How about bonding of ethernet after this batch of bonding patches? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On Thu, 11 Oct 2007 20:25:32 +0200 Andreas Henriksson [EMAIL PROTECTED] wrote: Hello all! Posting this set of patches again, since there where no reaction last time I sent them. Regards, Andreas Henriksson Since there are so many. Please send them as git separate emails to me and make sure they are in the proper format for importing into git. If you really want to make it easy, just put them in git yourself and then use 'git formatpatch origin' so all the formatting is correct. The current iproute2 git repository is: git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git (or symlink git://git.kernel.org/pub/network/iproute2/iproute2.git) -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
Andreas Henriksson wrote: Abort flush after 10 seconds. This should be optional IMO. Add references to lartc, also drop bogus reference to tc-filters There's a bad habit of reporting bugs on lartc that belong to netdev, please also make it perfectly clear that lartc is not for bug-reports. Fix overflow in time2tick / tick2time. The helper functions gets passed an unsigned int, which gets cast to long and overflows. See Debian bug #175462 - http://bugs.debian.org/175462 Signed-off-by: Andreas Henriksson [EMAIL PROTECTED] diff -uri iproute-20070313.orig/tc/tc_core.c iproute-20070313/tc/tc_core.c --- iproute-20070313.orig/tc/tc_core.c2007-03-13 22:50:56.0 +0100 +++ iproute-20070313/tc/tc_core.c 2007-08-15 00:41:30.0 +0200 @@ -35,12 +35,12 @@ } -long tc_core_time2tick(long time) +unsigned tc_core_time2tick(unsigned time) { return time*tick_in_usec; } -long tc_core_tick2time(long tick) +unsigned tc_core_tick2time(unsigned tick) { return tick/tick_in_usec; } What about the other functions returning long? I think all of these should be unsigned. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On tor, 2007-10-11 at 13:11 -0700, Stephen Hemminger wrote: Since there are so many. Please send them as git separate emails to me and make sure they are in the proper format for importing into git. If you really want to make it easy, just put them in git yourself and then use 'git formatpatch origin' so all the formatting is correct. I'm not so great with git, but I'll read up on how to do this and hopefully resend properly tomorrow. (I'll make sure the patch Len commented on also gets fixed by then.) -- Regards, Andreas Henriksson - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
Andreas Henriksson wrote: Add mpath to ip manpage. From: Norbert Buchmuller [EMAIL PROTECTED] The 'mpath' parameter of 'ip route' is not documented in the manual page nor in ip-cref.tex. ...huge part of the text in the patch was taken from the net/ipv4/Kconfig file in the Linux kernel source (2.6.18). I suppose that's OK because both Linux and iproute2 are GPL'd, but I let you know anyway. See Debian bug #428442 - http://bugs.debian.org/428442 diff -Naur iproute-20061002/doc/ip-cref.tex iproute-20061002.fixed/doc/ip-cref.tex --- iproute-20061002/doc/ip-cref.tex 2007-06-11 19:26:52.0 +0200 +++ iproute-20061002.fixed/doc/ip-cref.tex2007-06-11 20:34:09.0 +0200 @@ -1348,6 +1348,28 @@ route reflecting its relative bandwidth or quality. \end{itemize} +\item \verb|mpath MP_ALGO| This has been removed from the kernel again because of complete brokeness, so this patch shouldn't be merged. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On tor, 2007-10-11 at 22:30 +0200, Patrick McHardy wrote: Andreas Henriksson wrote: Abort flush after 10 seconds. This should be optional IMO. Copying Alexander Wirt here again, as I think this is his patch. I'll skip this patch in my next resend... Add references to lartc, also drop bogus reference to tc-filters There's a bad habit of reporting bugs on lartc that belong to netdev, please also make it perfectly clear that lartc is not for bug-reports. I'll update the patch to mention this. Fix overflow in time2tick / tick2time. The helper functions gets passed an unsigned int, which gets cast to long and overflows. See Debian bug #175462 - http://bugs.debian.org/175462 Signed-off-by: Andreas Henriksson [EMAIL PROTECTED] diff -uri iproute-20070313.orig/tc/tc_core.c iproute-20070313/tc/tc_core.c --- iproute-20070313.orig/tc/tc_core.c 2007-03-13 22:50:56.0 +0100 +++ iproute-20070313/tc/tc_core.c 2007-08-15 00:41:30.0 +0200 @@ -35,12 +35,12 @@ } -long tc_core_time2tick(long time) +unsigned tc_core_time2tick(unsigned time) { return time*tick_in_usec; } -long tc_core_tick2time(long tick) +unsigned tc_core_tick2time(unsigned tick) { return tick/tick_in_usec; } What about the other functions returning long? I think all of these should be unsigned. Since I'm no superhacker, I was afraid to touch anything else then the things I could test and this change is what's needed to fix the reported bug. (Actually only one of them was needed IIRC, but it would be stupid to only change one.) I'll have a look at the other helper functions as well, probably adding a separate patch so that it can be easily dropped if I screw something up there.. :) I'll also drop the mpath stuff you mentioned in the other mail... Thanks for reviewing the patches! -- Regards, Andreas Henriksson - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFD] iptables: mangle table obsoletes filter table
With the existence of the mangle table, how useful is the filter table? Other than requiring the REJECT target to be ported to the mangle table, is the filter table faster than the mangle table? If not, then shouldn't the filter table be obsoleted to avoid confusion? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493, try2
From: David Stevens [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 10:49:14 -0700 Acked-by: David L Stevens [EMAIL PROTECTED] Applied, thanks everyone! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][BNX2X] round three
From: Eliezer Tamir [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 19:53:39 +0200 Will it be OK if we replace it with 4-5 headers so we can maintain them better manually. (I'm asking what is the upper bound to the number of header files you would consider reasonable for this driver.) It's not the count, it's what's in them that's the problem. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
From: Andreas Henriksson [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 23:22:12 +0200 Copying Alexander Wirt here again, as I think this is his patch. It is important that whoever is the package maintainer for an upstream piece of code in any distribution: 1) Completely understands the patches he is applying. and therefore: 2) When he submits upstream, he doesn't have to wake up 2,000 patch submitters from the dead just to figure out why a change was made. Neither seems to be occuring here. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue
Roland Dreier [EMAIL PROTECTED] wrote: [...] Yes, two napi_disable()s in a row without a matching napi_enable() will deadlock. I guess the question is why the ipoib interface is being stopped twice. If you just take the net-2.6.24 tree (without bonding patches), does bonding for ethernet interfaces work OK, or is there a similar problem with double napi_disable()? How about bonding of ethernet after this batch of bonding patches? I just checked this on an x86 box. The bonding in stock net-2.6 pulled this morning or last night works ok (I did some basic tests, including ifconfig down / up, with e100). This remains true with the IPoIB bonding patches applied. I do not have hardware available to test IPoIB. I did get a whammy from tg3, but I think this is unrelated to bonding (as it happens when tg3 comes up, before bonding is involved): BUG: unable to handle kernel paging request at virtual address 4214 printing eip: e0828017 *pde = Oops: 0002 [#1] SMP Modules linked in: thermal processor fan button loop e1000 sg evdev tg3 e100 rtb CPU:0 EIP:0060:[e0828017]Not tainted VLI EFLAGS: 00010206 (2.6.23-ipv6 #1) EIP is at tg3_ape_write32+0x7/0x10 [tg3] eax: de9304c0 ebx: dde8fe18 ecx: edx: 4214 esi: de9304c0 edi: ebp: dde8fe28 esp: dde8fdd4 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process ip (pid: 2817, ti=dde8e000 task=dff4e0b0 task.ti=dde8e000) Stack: e082fb2e dde8fdf4 c01ece3e dde8fdf8 03fe 5400 0800 1aa0 e083b340 08001aa0 0060 e083ce00 08001b20 0030 e083ce80 0101 de9304c0 0001 dde56800 dde8fe38 e0830178 dff69000 Call Trace: [c010536a] show_trace_log_lvl+0x1a/0x30 [c0105429] show_stack_log_lvl+0xa9/0xd0 [c0105639] show_registers+0x1e9/0x2f0 [c0105851] die+0x111/0x260 [c011c5dc] do_page_fault+0x18c/0x6a0 [c0319bea] error_code+0x72/0x78 [e0830178] tg3_init_hw+0x38/0x50 [tg3] [e0838886] tg3_open+0x276/0x5d0 [tg3] [c02aead8] dev_open+0x38/0x80 [c02ad5cd] dev_change_flags+0x7d/0x1a0 [c02f63d8] devinet_ioctl+0x4c8/0x660 [c02f698b] inet_ioctl+0x6b/0x90 [c02a0e5a] sock_ioctl+0x5a/0x210 [c017cd98] do_ioctl+0x28/0x80 [c017ce47] vfs_ioctl+0x57/0x290 [c017d0b9] sys_ioctl+0x39/0x60 [c01042a2] sysenter_past_esp+0x5f/0x99 === Code: 89 0a c3 8d b6 00 00 00 00 55 8b 48 50 89 e5 5d 01 ca 8b 02 c3 8d EIP: [e0828017] tg3_ape_write32+0x7/0x10 [tg3] SS:ESP 0068:dde8fdd4 Kernel panic - not syncing: Fatal exception in interrupt I haven't investigated this further. I'm using a BCM5704 card; if this isn't a known problem and anyone is curious, I can supply additional info. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On tor, 2007-10-11 at 14:48 -0700, David Miller wrote: From: Andreas Henriksson [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 23:22:12 +0200 Copying Alexander Wirt here again, as I think this is his patch. It is important that whoever is the package maintainer for an upstream piece of code in any distribution: 1) Completely understands the patches he is applying. and therefore: 2) When he submits upstream, he doesn't have to wake up 2,000 patch submitters from the dead just to figure out why a change was made. Neither seems to be occuring here. Alexander Wirt is both the Debian package maintainer and (probably) the author of both of the two discussed patches (and only a single person as far as I know). I don't consider it rude to notify him of the comments made about the patches he's been carrying for a long time. While doing a couple of fixes to iproute I thought it would be best to not only submit those but all the patches that has been carried in debian for review upstream and possibly inclusion. The extra review seemed to have been a success so far and I will fix up the patches myself if I have to. If you want me to STFU please just say so and I'll give you all my sincere appologies and go back to the status quo of no fixes shared with others. -- Regards, Andreas Henriksson - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
From: Andreas Henriksson [EMAIL PROTECTED] Date: Fri, 12 Oct 2007 00:35:07 +0200 If you want me to STFU please just say so and I'll give you all my sincere appologies and go back to the status quo of no fixes shared with others. I applaud your efforts, it's not what the problem is. I just wonder, therefore, when Alexander planned on submitting this sizable backlog of local iproute2 patches. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bytes received from recvmsg doesn't match FIONREAD
I wanted to verify that the size of a multicast UDP message received with recvmsg matches the size of the message the kernel thinks the message is. So I went about using the FIONREAD ioctl as follows: res = ioctl (fd, FIONCREAD, value); assert (res != -1); bytes_received = recvmsg (fd, msg_recv, MSG_NOSIGNAL | MSG_DONTWAIT); assert (bytes_received == value); It appears the value and bytes_received do not match after many thousands of runs with a UDP multicast protocol (www.openais.org). (neither is -1) Am I using this ioctl improperly? Shouldn't I expect that the full datagram is returned by recvmsg no matter the state of the blocking or nonblocking mode of the file descriptor? pls copy me on responses - i am not onlist. Regards -steve - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e100 problems in .23rc8 ?
On Thu, Oct 11, 2007 at 09:10:34AM -0700, Kok, Auke wrote: commit 416b5d10afdc797c21c457ade3714e8f2f75edd9 Author: Auke Kok [EMAIL PROTECTED] Date: Fri Jun 1 10:22:39 2007 -0700 e1000: disable polling before registering netdevice this patch actually called napi_disable() in the probe routine which was wrong, but e100 does not do that. Nonetheless e100 doesn't call netif_carrier_off() and netif_stop_queue(), so to make e100 the same as e1000 we should probably do this, see below. Back then we didn't have napi_disable at all. That patch calls netif_poll_disable which has different semantics. Dave, can you see if this resolves the issue for you? If so then we might want to push this to -stable. The problem is with netif_poll so this patch probably won't help. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question on TSO maximum segment sizes.
I'm having an issue with TSO right vs. hardware that can't take the maximum segment size sent from the stack. I've been told that the maximum packet size that can be sent to the hardware today is 64k, but my hardware can only take 32k in certain modes per queue due to hardware limitations. I have two questions regarding this: 1) where is this value set in the TCP code, and 2) Is this something that can be configured on the fly? If the answer to 2 is no, I will try and put something together to allow this to happen. Thanks in advance, PJ Waskiewicz Intel Corporation [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bytes received from recvmsg doesn't match FIONREAD
From: Steven Dake [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:08:15 -0700 I wanted to verify that the size of a multicast UDP message received with recvmsg matches the size of the message the kernel thinks the message is. So I went about using the FIONREAD ioctl as follows: res = ioctl (fd, FIONCREAD, value); ^ (typo? should be FIONREAD not FION_C_READ) assert (res != -1); bytes_received = recvmsg (fd, msg_recv, MSG_NOSIGNAL | MSG_DONTWAIT); assert (bytes_received == value); I think you want to use SIOCINQ, FIONREAD is only valid on files. I'm surprised you didn't get an error return from ioctl() as the VFS code seems to enforce this. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bytes received from recvmsg doesn't match FIONREAD
Questions of this sort should normally be directed to linux-net mailing list. From the code you quoted, I see at least one case where it will fail -- when the allocated buffer you pass to recvmsg is smaller than value (ie. the datagram is too big for the read buffer). If that's not the problem, I suggest you reproduce it with a small test case and post real code and results when it happens. +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on TSO maximum segment sizes.
From: Waskiewicz Jr, Peter P [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:27:14 -0700 I'm having an issue with TSO right vs. hardware that can't take the maximum segment size sent from the stack. I've been told that the maximum packet size that can be sent to the hardware today is 64k, but my hardware can only take 32k in certain modes per queue due to hardware limitations. I have two questions regarding this: 1) where is this value set in the TCP code, and 2) Is this something that can be configured on the fly? If the answer to 2 is no, I will try and put something together to allow this to happen. The TCP code just builds the maximum possible for the underlying protocol, be it ipv4 or ipv6. It takes the underlying protocol maximum packet length, subtracts the amount of header space it knows will be used, and uses that. You'll need to use GSO sw segmentation to split the TSO packets which are too big for your HW to handle. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
From: TAKANO Ryousei [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 22:51:46 +0900 (JST) Modules linked in: 8021q tcp_bic netconsole evdev joydev sg st sr_mod ohci_hcd i2c_amd756 i2c_amd8111 i2c_core ipv6 tg3 usbhid usbcore ff_memless dm_mod ext3 jbd sata_sil libata sd_mod scsi_mod ... RIP points at __list_del (net_rx_action - list_move_tail - __list_del). There is a contract between the driver's -poll() method and net_rx_action() in that it is assumed that if the entire quota has been used up, the driver will not perform a netif_rx_complete(). It seems that in a corner case the tg3 driver, which you appear to be using, will not abide by this rule. That corner case is when the card has exactly budget receive packets pending. In such a case tg3_has_work() will be false, and we will RX complete when work_done = budget, which violates the above mentioned rule. Can you see if the following test patch makes the crash go away? Michael, I know you're not pleased with this patch and neither am I. It might be better to just strictly RX complete when (work budget) and if tg3_has_work(), trigger a HW interrupt. Alternatively we could loop in tg3_poll() until either budget is exhausted or tg3_has_work() returns false. Actually, this sounds like a cleaner scheme the more I think about it. BNX2 likely has a similar issue. diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index d2b30fb..8c27962 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -3599,7 +3599,7 @@ static int tg3_poll(struct napi_struct *napi, int budget) sblk-status = ~SD_STATUS_UPDATED; /* if no more work, tell net stack and NIC we're done */ - if (!tg3_has_work(tp)) { + if ((work_done budget) !tg3_has_work(tp)) { netif_rx_complete(netdev, napi); tg3_restart_ints(tp); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Non-linear SKBs
Hello, I have developed a small patch for the TCP code in 2.6.19 and it works flawlessly. A couple of days ago I decided to make it compatible with 2.6.22.5 and have stumbled upon a problem I cannot solve. In 2.6.19 it seems that all packets (at least the ones my patch work with) are linear, while they are non-linear in 2.6.22.5. I have searched through the code (focusing on tcp_sendmsg) to try to figure out what happens, but can't find any differences that would explain this. Does anyone know what might be the cause and if there is an easy way to return to linear skbs (unless that is totally stupid)? I would also like the benefits offered by the collapsing when retransmitting (which requires number of fragments to be 0). Currently I us e the skb_linearize(skb) on all fragmentet packets, which is not nice :) Both kernels are vanilla kernels with the patch applied, and I used OpenSuse with 2.6.19 and Ubuntu with 2.6.22.5 (but I guess that shouldn't matter in this case). The reason this is a problem is that I copy data between SKBs, and in 2.6.19 I have used memcpy for this purpose. I have looked at the way the kernel copies data into a non-linear skb in the else-part of the large if-test in tcp_sendmsg and the skb_copy_bits-function, but I have to admit that I think the first one is a bit advanced and I don't quite understand how to use the last one. Does anyone have any easier to understand examples or explanations on how to copy data from one non-linear skb to another? Thanks. -Kristian - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iproute2: resend of patches from Debian.
On Thu, 11 Oct 2007 14:48:43 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Andreas Henriksson [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 23:22:12 +0200 Copying Alexander Wirt here again, as I think this is his patch. It is important that whoever is the package maintainer for an upstream piece of code in any distribution: 1) Completely understands the patches he is applying. and therefore: 2) When he submits upstream, he doesn't have to wake up 2,000 patch submitters from the dead just to figure out why a change was made. Neither seems to be occuring here. Also, I want the correct author name for legal and technical reasons. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on TSO maximum segment sizes.
David Miller wrote: From: Waskiewicz Jr, Peter P [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:27:14 -0700 I'm having an issue with TSO right vs. hardware that can't take the maximum segment size sent from the stack. I've been told that the maximum packet size that can be sent to the hardware today is 64k, but my hardware can only take 32k in certain modes per queue due to hardware limitations. Bletch. I have two questions regarding this: 1) where is this value set in the TCP code, and 2) Is this something that can be configured on the fly? If the answer to 2 is no, I will try and put something together to allow this to happen. The TCP code just builds the maximum possible for the underlying protocol, be it ipv4 or ipv6. It takes the underlying protocol maximum packet length, subtracts the amount of header space it knows will be used, and uses that. You'll need to use GSO sw segmentation to split the TSO packets which are too big for your HW to handle. For just messing about, might it be possible to tweak the socket buffer sizes and tcp_tso_win_divisor to kludge things for a short while? Couldn't ship that way certainly, but assuming Peter's going to get his broken hardware fixed it might let him limp along until then. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
On Thu, 11 Oct 2007 16:48:27 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: TAKANO Ryousei [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 22:51:46 +0900 (JST) Modules linked in: 8021q tcp_bic netconsole evdev joydev sg st sr_mod ohci_hcd i2c_amd756 i2c_amd8111 i2c_core ipv6 tg3 usbhid usbcore ff_memless dm_mod ext3 jbd sata_sil libata sd_mod scsi_mod ... RIP points at __list_del (net_rx_action - list_move_tail - __list_del). There is a contract between the driver's -poll() method and net_rx_action() in that it is assumed that if the entire quota has been used up, the driver will not perform a netif_rx_complete(). It seems that in a corner case the tg3 driver, which you appear to be using, will not abide by this rule. That corner case is when the card has exactly budget receive packets pending. In such a case tg3_has_work() will be false, and we will RX complete when work_done = budget, which violates the above mentioned rule. Can you see if the following test patch makes the crash go away? Michael, I know you're not pleased with this patch and neither am I. It might be better to just strictly RX complete when (work budget) and if tg3_has_work(), trigger a HW interrupt. Alternatively we could loop in tg3_poll() until either budget is exhausted or tg3_has_work() returns false. Actually, this sounds like a cleaner scheme the more I think about it. BNX2 likely has a similar issue. sky2 as well. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-linear SKBs
From: Kristian Evensen [EMAIL PROTECTED] Date: Fri, 12 Oct 2007 00:54:37 +0200 I have developed a small patch for the TCP code in 2.6.19 and it works flawlessly. A couple of days ago I decided to make it compatible with 2.6.22.5 and have stumbled upon a problem I cannot solve. In 2.6.19 it seems that all packets (at least the ones my patch work with) are linear, while they are non-linear in 2.6.22.5. I have searched through the code (focusing on tcp_sendmsg) to try to figure out what happens, but can't find any differences that would explain this. Does anyone know what might be the cause and if there is an easy way to return to linear skbs (unless that is totally stupid)? I would also like the benefits offered by the collapsing when retransmitting (which requires number of fragments to be 0). If the underlying device can do scatter-gather and checksumming, the TCP code builds outgoing packets by copying user date into full system pages, and then attaching those pages into the SKB. The protocol headers sit under the skb-data linear area, and the user data mostly sits in the user pages under skb_shinfo(skb)-frags[] This increases the density of data packed into the memory allocated compared to using skb-data for it. It also enormously simplifies the logic necessary to support TSO. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on TSO maximum segment sizes.
From: Rick Jones [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:50:46 -0700 For just messing about, might it be possible to tweak the socket buffer sizes and tcp_tso_win_divisor to kludge things for a short while? Couldn't ship that way certainly, but assuming Peter's going to get his broken hardware fixed it might let him limp along until then. TCP dynamically grows the socket buffer sizes unless the application explicitly sets them via setsockopt() and the limits imposed in those cases are controlled by tcp_{,r,w}mem[] sysctls. Decreasing those will kill performance exactly for the cases this person cares about. No, the only way to deal with this is to GSO segment incoming frames inside of the driver when they exceed the HW limits. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question on TSO maximum segment sizes.
From: Rick Jones [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:50:46 -0700 For just messing about, might it be possible to tweak the socket buffer sizes and tcp_tso_win_divisor to kludge things for a short while? Couldn't ship that way certainly, but assuming Peter's going to get his broken hardware fixed it might let him limp along until then. TCP dynamically grows the socket buffer sizes unless the application explicitly sets them via setsockopt() and the limits imposed in those cases are controlled by tcp_{,r,w}mem[] sysctls. Decreasing those will kill performance exactly for the cases this person cares about. No, the only way to deal with this is to GSO segment incoming frames inside of the driver when they exceed the HW limits. Thanks Dave and Rick. I'll mess with this and make my driver happy again. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:55:48 -0700 On Thu, 11 Oct 2007 16:48:27 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: Alternatively we could loop in tg3_poll() until either budget is exhausted or tg3_has_work() returns false. Actually, this sounds like a cleaner scheme the more I think about it. BNX2 likely has a similar issue. sky2 as well. Thanks for the heads up Stephen. Here is a patch that implements the looping idea in tg3, bnx2, and sky2. diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index bbfbdaf..b015d52 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -2633,15 +2633,11 @@ bnx2_has_work(struct bnx2 *bp) return 0; } -static int -bnx2_poll(struct napi_struct *napi, int budget) +static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget) { - struct bnx2 *bp = container_of(napi, struct bnx2, napi); - struct net_device *dev = bp-dev; struct status_block *sblk = bp-status_blk; u32 status_attn_bits = sblk-status_attn_bits; u32 status_attn_bits_ack = sblk-status_attn_bits_ack; - int work_done = 0; if ((status_attn_bits STATUS_ATTN_EVENTS) != (status_attn_bits_ack STATUS_ATTN_EVENTS)) { @@ -2660,27 +2656,43 @@ bnx2_poll(struct napi_struct *napi, int budget) bnx2_tx_int(bp); if (bp-status_blk-status_rx_quick_consumer_index0 != bp-hw_rx_cons) - work_done = bnx2_rx_int(bp, budget); + work_done += bnx2_rx_int(bp, budget - work_done); - bp-last_status_idx = bp-status_blk-status_idx; - rmb(); + return work_done; +} + +static int bnx2_poll(struct napi_struct *napi, int budget) +{ + struct bnx2 *bp = container_of(napi, struct bnx2, napi); + int work_done = 0; + + while (1) { + work_done += bnx2_poll_work(bp, work_done, budget); - if (!bnx2_has_work(bp)) { - netif_rx_complete(dev, napi); - if (likely(bp-flags USING_MSI_FLAG)) { + if (unlikely(work_done = budget)) + break; + + if (likely(!bnx2_has_work(bp))) { + bp-last_status_idx = bp-status_blk-status_idx; + rmb(); + + netif_rx_complete(bp-dev, napi); + if (likely(bp-flags USING_MSI_FLAG)) { + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + bp-last_status_idx); + return 0; + } REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bp-last_status_idx); - return 0; - } - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | - BNX2_PCICFG_INT_ACK_CMD_MASK_INT | - bp-last_status_idx); - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | - bp-last_status_idx); + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + bp-last_status_idx); + break; + } } return work_done; diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index fe0e756..25da238 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -2605,33 +2605,41 @@ static void sky2_err_intr(struct sky2_hw *hw, u32 status) static int sky2_poll(struct napi_struct *napi, int work_limit) { struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi); - u32 status = sky2_read32(hw, B0_Y2_SP_EISR); - int work_done; + int work_done = 0; - if (unlikely(status Y2_IS_ERROR)) - sky2_err_intr(hw, status); + while (1) { + u32 status = sky2_read32(hw, B0_Y2_SP_EISR); - if (status Y2_IS_IRQ_PHY1) - sky2_phy_intr(hw, 0); + if (unlikely(status Y2_IS_ERROR)) + sky2_err_intr(hw, status); - if (status Y2_IS_IRQ_PHY2) - sky2_phy_intr(hw, 1); + if (status Y2_IS_IRQ_PHY1) + sky2_phy_intr(hw, 0); - work_done = sky2_status_intr(hw, work_limit); + if (status Y2_IS_IRQ_PHY2) + sky2_phy_intr(hw, 1); - /* More work? */ - if (hw-st_idx == sky2_read16(hw, STAT_PUT_IDX)) { - /* Bug/Errata workaround? -* Need to kick the TX irq moderation timer. -*/ - if (sky2_read8(hw,
Re: Question on TSO maximum segment sizes.
Waskiewicz Jr, Peter P wrote: From: Rick Jones [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:50:46 -0700 For just messing about, might it be possible to tweak the socket buffer sizes and tcp_tso_win_divisor to kludge things for a short while? Couldn't ship that way certainly, but assuming Peter's going to get his broken hardware fixed it might let him limp along until then. TCP dynamically grows the socket buffer sizes unless the application explicitly sets them via setsockopt() and the limits imposed in those cases are controlled by tcp_{,r,w}mem[] sysctls. Decreasing those will kill performance exactly for the cases this person cares about. I just tried turning off my explicit SO_SNDBUF/SO_RCVBUG settings in my app, and the connection ran very poorly through a link with even a small bit of latency (~2-4ms I believe). It ran near gige line speed through a cross-over cable. I have the sysctl max values set very generous, though the min and default are fairly small. This was with kernel 2.6.20. Was the auto-tuning put in after 2.6.20? If not, has this been tested through a higher latency link? Or, am I confused and you are talking about some other setsockopt? Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on TSO maximum segment sizes.
From: Ben Greear [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:17:48 -0700 Was the auto-tuning put in after 2.6.20? If not, has this been tested through a higher latency link? Or, am I confused and you are talking about some other setsockopt? It's been there for quite some time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 11 Oct 2007 10:33:39 -0600 Having ASSERT_RTNL warn if you were sleeping does not seem intuitive from the name. This instance of convoluted locking seems like a complete one off to me, and if it will warn about other constructs currently in the kernel it seems wrong. RTNL is a semaphore, therefore it sleeps. Therefore anything that requires RTNL is held can also assume that it can do things like GFP_KERNEL allocations and other sleeping actions. This is why any code path that runs with spinlocks held or interrupts disabled, and hits an RTNL assertion, is buggy. It is the chain of dependencies of what is allowed in such contexts. ASSERT_RTNL(); page = alloc_page(GFP_KERNEL); That sequence above is absolutely always legal. The might_sleep() check is just letting us know the problem exists. If spinlocks or interrupt disabling is needed to implement something deeper in the call chain, that's fine, it just cannot call back into the RTNL asserted domain with those spinlocks held or interrupts disabled. If the mac_vlan driver, or whichever one has the problem, does things like this it must be fixed and putting or not putting a proper might_sleep() check here doesn't change that. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] [TCP]: Add bytes_acked (ABC) clearing to FRTO too
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:01 +0300 I was reading tcp_enter_loss while looking for Cedric's bug and noticed bytes_acked adjustment is missing from FRTO side. Since bytes_acked will only be used in tcp_cong_avoid, I think it's safe to assume RTO would be spurious. During FRTO cwnd will be not controlled by tcp_cong_avoid and if FRTO calls for conventional recovery, cwnd is adjusted and the result of wrong assumption is cleared from bytes_acked. If RTO was in fact spurious, we did normal ABC already and can continue without any additional adjustments. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
On Thu, 11 Oct 2007 17:17:43 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:55:48 -0700 On Thu, 11 Oct 2007 16:48:27 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: Alternatively we could loop in tg3_poll() until either budget is exhausted or tg3_has_work() returns false. Actually, this sounds like a cleaner scheme the more I think about it. BNX2 likely has a similar issue. sky2 as well. Thanks for the heads up Stephen. Here is a patch that implements the looping idea in tg3, bnx2, and sky2. diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index bbfbdaf..b015d52 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -2633,15 +2633,11 @@ bnx2_has_work(struct bnx2 *bp) return 0; } -static int -bnx2_poll(struct napi_struct *napi, int budget) +static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget) { - struct bnx2 *bp = container_of(napi, struct bnx2, napi); - struct net_device *dev = bp-dev; struct status_block *sblk = bp-status_blk; u32 status_attn_bits = sblk-status_attn_bits; u32 status_attn_bits_ack = sblk-status_attn_bits_ack; - int work_done = 0; if ((status_attn_bits STATUS_ATTN_EVENTS) != (status_attn_bits_ack STATUS_ATTN_EVENTS)) { @@ -2660,27 +2656,43 @@ bnx2_poll(struct napi_struct *napi, int budget) bnx2_tx_int(bp); if (bp-status_blk-status_rx_quick_consumer_index0 != bp-hw_rx_cons) - work_done = bnx2_rx_int(bp, budget); + work_done += bnx2_rx_int(bp, budget - work_done); - bp-last_status_idx = bp-status_blk-status_idx; - rmb(); + return work_done; +} + +static int bnx2_poll(struct napi_struct *napi, int budget) +{ + struct bnx2 *bp = container_of(napi, struct bnx2, napi); + int work_done = 0; + + while (1) { + work_done += bnx2_poll_work(bp, work_done, budget); - if (!bnx2_has_work(bp)) { - netif_rx_complete(dev, napi); - if (likely(bp-flags USING_MSI_FLAG)) { + if (unlikely(work_done = budget)) + break; + + if (likely(!bnx2_has_work(bp))) { + bp-last_status_idx = bp-status_blk-status_idx; + rmb(); + + netif_rx_complete(bp-dev, napi); + if (likely(bp-flags USING_MSI_FLAG)) { + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, +BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | +bp-last_status_idx); + return 0; + } REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | +BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bp-last_status_idx); - return 0; - } - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, -BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | -BNX2_PCICFG_INT_ACK_CMD_MASK_INT | -bp-last_status_idx); - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, -BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | -bp-last_status_idx); + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, +BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | +bp-last_status_idx); + break; + } } return work_done; diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index fe0e756..25da238 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -2605,33 +2605,41 @@ static void sky2_err_intr(struct sky2_hw *hw, u32 status) static int sky2_poll(struct napi_struct *napi, int work_limit) { struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi); - u32 status = sky2_read32(hw, B0_Y2_SP_EISR); - int work_done; + int work_done = 0; - if (unlikely(status Y2_IS_ERROR)) - sky2_err_intr(hw, status); + while (1) { + u32 status = sky2_read32(hw, B0_Y2_SP_EISR); - if (status Y2_IS_IRQ_PHY1) - sky2_phy_intr(hw, 0); + if (unlikely(status Y2_IS_ERROR)) + sky2_err_intr(hw, status); - if (status Y2_IS_IRQ_PHY2) - sky2_phy_intr(hw, 1); + if (status Y2_IS_IRQ_PHY1) + sky2_phy_intr(hw, 0); - work_done = sky2_status_intr(hw, work_limit); + if (status Y2_IS_IRQ_PHY2) + sky2_phy_intr(hw, 1); - /* More work? */ - if (hw-st_idx == sky2_read16(hw, STAT_PUT_IDX)) { - /* Bug/Errata workaround? - * Need to kick the TX irq moderation timer.
Re: [PATCH 2/7] [TCP]: Fix mark_head_lost to ignore R-bit when trying to mark L
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:02 +0300 This condition (plain R) can arise at least in recovery that is triggered after tcp_undo_loss. There isn't any reason why they should not be marked as lost, not marking makes in_flight estimator to return too large values. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Nice observation, applied, thanks Ilpo. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] [TCP]: Kill almost unused variable pcount from sacktag
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:03 +0300 It's on the way for future cutting of that function. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] [TCP]: Extract tcp_match_queue_to_sack from sacktag code
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:04 +0300 This is necessary for upcoming DSACK bugfix. Reduces sacktag length which is not very sad thing at all... :-) Notice that there's a need to handle out-of-mem at caller's place. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] [TCP]: No need to re-count fackets_out/sacked_out at RTO
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:05 +0300 Both sacked_out and fackets_out are directly known from how parameter. Since fackets_out is accurate, there's no need for recounting (sacked_out was previously unnecessarily counted in the loop anyway). Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] [TCP]: Fix lost_retrans loop vs fastpath problems
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:06 +0300 Detection implemented with lost_retrans must work also when fastpath is taken, yet most of the queue is skipped including (very likely) those retransmitted skb's we're interested in. This problem appeared when the hints got added, which removed a need to always walk over the whole write queue head. Therefore decicion for the lost_retrans worker loop entry must be separated from the sacktag processing more than it was necessary before. It turns out to be problematic to optimize the worker loop very heavily because ack_seqs of skb may have a number of discontinuity points. Maybe similar approach as currently is implemented could be attempted but that's becoming more and more complex because the trend is towards less skb walking in sacktag marker. Trying a simple work until all rexmitted skbs heve been processed approach. Maybe after(highest_sack_end_seq, tp-high_seq) checking is not sufficiently accurate and causes entry too often in no-work-to-do cases. Since that's not known, I've separated solution to that from this patch. Noticed because of report against a related problem from TAKANO Ryousei [EMAIL PROTECTED]. He also provided a patch to that part of the problem. This patch includes solution to it (though this patch has to use somewhat different placement). TAKANO's description and patch is available here: http://marc.info/?l=linux-netdevm=119149311913288w=2 ...In short, TAKANO's problem is that end_seq the loop is using not necessarily the largest SACK block's end_seq because the current ACK may still have higher SACK blocks which are later by the loop. Cc: TAKANO Ryousei [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] [TCP]: Limit processing lost_retrans loop to work-to-do cases
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:07 +0300 This addition of lost_retrans_low to tcp_sock might be unnecessary, it's not clear how often lost_retrans worker is executed when there wasn't work to do. Cc: TAKANO Ryousei [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6 0/7] TCP: small changes/fixes + lost_retrans brokeness fix
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 14:41:00 +0300 Dave, please apply at will. Done. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:31:49 -0700 You don't need to re-read the status register and process the PHY irq's inside loop. Try this: Are you sure? What if a PHY interrupt comes in during the loop? I'm just preserving the semantics of the driver when -poll() is invoked multiple times per interrupt. And I think preserving that makes sense while we're purely trying to fix this bug, so we don't add some new ones. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
On Thu, 11 Oct 2007 17:40:26 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:31:49 -0700 You don't need to re-read the status register and process the PHY irq's inside loop. Try this: Are you sure? What if a PHY interrupt comes in during the loop? The interrupt is level triggered, and will rearm. I'm just preserving the semantics of the driver when -poll() is invoked multiple times per interrupt. And I think preserving that makes sense while we're purely trying to fix this bug, so we don't add some new ones. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
APE changes cause tg3 regressions...
On my onboard SunBlade1500 5703 chips, which lack firmware, the APE changes introduce bus timeouts while bringing the chip up, specifically I get crashes in tg3_write_sig_post_reset() such as: [ 36.066603] eth1: Tigon3 [partno(BCM95703A30U) rev 1002 PHY(5703)] (PCI:33MHz:64-bit) 10/100/1000Base-T Ethernet 00:10:18:04:13:02 [ 36.076040] eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1] [ 36.085393] eth1: dma_rwctrl[763f] dma_mask[32-bit] [ 36.095909] PCI: Enabling device: (:00:08.0), cmd 3 [ 37.541359] tg3: eth1: No firmware running. [ 37.588662] ERROR(0): Cheetah error trap taken afsr[4000] afar[4210] TL1(0) [ 37.598106] ERROR(0): TPC[1004a700] TNPC[1004a704] O7[42f488] TSTATE[80009602] [ 37.607555] ERROR(0): TPCtg3_write_sig_post_reset+0x68/0x78 [tg3] [ 37.617040] ERROR(0): M_SYND(0), E_SYND(0) [ 37.626422] ERROR(0): Highest priority error (4000) Error due to unsupported store [ 37.636046] ERROR(0): AFAR E-syndrome [???] [ 37.645619] ERROR(0): D-cache idx[0] tag[] utag[] stag[] [ 37.655394] ERROR(0): D-cache data0[] data1[] data2[] data3[] [ 37.665254] ERROR(0): I-cache idx[0] tag[] utag[] stag[] u[] l[] [ 37.675194] ERROR(0): I-cache INSN0[] INSN1[] INSN2[] INSN3[] [ 37.685138] ERROR(0): I-cache INSN4[] INSN5[] INSN6[] INSN7[] [ 37.695069] ERROR(0): E-cache idx[4200] tag[02000fb1] [ 37.704920] ERROR(0): E-cache data0[b0102000a6922000] data1[02800057a004c010] data2[9410001192100019] data3[a2042250b0100013] [ 37.714978] /[EMAIL PROTECTED],70: Safari/JBUS interrupt, UNMAPPED error, interrogating IOMMUs. My initial suspicion is that TG3_FLG3_ENABLE_APE isn't being set right on these firmware-less cards, or some APE programming is not being guarded properly with tests on that feature bit. Please fix this, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:50:59 -0700 On Thu, 11 Oct 2007 17:40:26 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:31:49 -0700 You don't need to re-read the status register and process the PHY irq's inside loop. Try this: Are you sure? What if a PHY interrupt comes in during the loop? The interrupt is level triggered, and will rearm. Fair enough. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
From: David Miller [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 18:00:31 -0700 (PDT) From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:50:59 -0700 On Thu, 11 Oct 2007 17:40:26 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 17:31:49 -0700 You don't need to re-read the status register and process the PHY irq's inside loop. Try this: Are you sure? What if a PHY interrupt comes in during the loop? The interrupt is level triggered, and will rearm. Fair enough. Actually, your change isn't right for another reason. You missed the necessary budget reducing logic that I used in the original changes. You need to adjust work_limit like this: work_done += sky2_status_intr(hw, work_limit - work_done); Otherwise if you just pass plain work_limit, and we do loop, the driver uses more quota than it really should. And I've made that above correction to your patch in my tree. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in net-2.6.24?
Here is what I'm checking into net-2.6 for now: commit 6f535763165331bb91277d7519b507fed22034e5 Author: David S. Miller [EMAIL PROTECTED] Date: Thu Oct 11 18:08:29 2007 -0700 [NET]: Fix NAPI completion handling in some drivers. In order for the list handling in net_rx_action() to be correct, drivers must follow certain rules as stated by this comment in net_rx_action(): /* Drivers must not modify the NAPI state if they * consume the entire weight. In such cases this code * still owns the NAPI instance and therefore can * move the instance around on the list at-will. */ A few drivers do not do this because they mix the budget checks with reading hardware state, resulting in crashes like the one reported by [EMAIL PROTECTED] BNX2 and TG3 are taken care of here, SKY2 fix is from Stephen Hemminger. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index bbfbdaf..d68acce 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -2633,15 +2633,11 @@ bnx2_has_work(struct bnx2 *bp) return 0; } -static int -bnx2_poll(struct napi_struct *napi, int budget) +static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget) { - struct bnx2 *bp = container_of(napi, struct bnx2, napi); - struct net_device *dev = bp-dev; struct status_block *sblk = bp-status_blk; u32 status_attn_bits = sblk-status_attn_bits; u32 status_attn_bits_ack = sblk-status_attn_bits_ack; - int work_done = 0; if ((status_attn_bits STATUS_ATTN_EVENTS) != (status_attn_bits_ack STATUS_ATTN_EVENTS)) { @@ -2660,27 +2656,43 @@ bnx2_poll(struct napi_struct *napi, int budget) bnx2_tx_int(bp); if (bp-status_blk-status_rx_quick_consumer_index0 != bp-hw_rx_cons) - work_done = bnx2_rx_int(bp, budget); + work_done += bnx2_rx_int(bp, budget - work_done); - bp-last_status_idx = bp-status_blk-status_idx; - rmb(); + return work_done; +} + +static int bnx2_poll(struct napi_struct *napi, int budget) +{ + struct bnx2 *bp = container_of(napi, struct bnx2, napi); + int work_done = 0; + + while (1) { + work_done = bnx2_poll_work(bp, work_done, budget); - if (!bnx2_has_work(bp)) { - netif_rx_complete(dev, napi); - if (likely(bp-flags USING_MSI_FLAG)) { + if (unlikely(work_done = budget)) + break; + + if (likely(!bnx2_has_work(bp))) { + bp-last_status_idx = bp-status_blk-status_idx; + rmb(); + + netif_rx_complete(bp-dev, napi); + if (likely(bp-flags USING_MSI_FLAG)) { + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + bp-last_status_idx); + return 0; + } REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bp-last_status_idx); - return 0; - } - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | - BNX2_PCICFG_INT_ACK_CMD_MASK_INT | - bp-last_status_idx); - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | - bp-last_status_idx); + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + bp-last_status_idx); + break; + } } return work_done; diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index fe0e756..4e569fa 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -2606,7 +2606,7 @@ static int sky2_poll(struct napi_struct *napi, int work_limit) { struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi); u32 status = sky2_read32(hw, B0_Y2_SP_EISR); - int work_done; + int work_done = 0; if (unlikely(status Y2_IS_ERROR)) sky2_err_intr(hw, status); @@ -2617,10 +2617,16 @@ static int sky2_poll(struct napi_struct *napi, int work_limit) if (status Y2_IS_IRQ_PHY2) sky2_phy_intr(hw, 1); - work_done = sky2_status_intr(hw, work_limit); + for(;;) { + work_done += sky2_status_intr(hw, work_limit); + + if (work_done = work_limit) + break; + +
[PATCH 1/6] sky2: status polling loop
Handle the corner case where budget is exhausted correctly. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/drivers/net/sky2.c2007-10-11 18:11:44.0 -0700 +++ b/drivers/net/sky2.c2007-10-11 18:12:50.0 -0700 @@ -2245,15 +2245,13 @@ static inline void sky2_tx_done(struct n } /* Process status response ring */ -static int sky2_status_intr(struct sky2_hw *hw, int to_do) +static int sky2_status_intr(struct sky2_hw *hw, int to_do, u16 idx) { int work_done = 0; unsigned rx[2] = { 0, 0 }; - u16 hwidx = sky2_read16(hw, STAT_PUT_IDX); rmb(); - - while (hw-st_idx != hwidx) { + do { struct sky2_port *sky2; struct sky2_status_le *le = hw-st_le + hw-st_idx; unsigned port = le-css CSS_LINK_BIT; @@ -2364,7 +2362,7 @@ static int sky2_status_intr(struct sky2_ printk(KERN_WARNING PFX unknown status opcode 0x%x\n, le-opcode); } - } + } while (hw-st_idx != idx); /* Fully processed status ring so clear irq */ sky2_write32(hw, STAT_CTRL, SC_STAT_CLR_IRQ); @@ -2606,7 +2604,8 @@ static int sky2_poll(struct napi_struct { struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi); u32 status = sky2_read32(hw, B0_Y2_SP_EISR); - int work_done; + int work_done = 0; + u16 idx; if (unlikely(status Y2_IS_ERROR)) sky2_err_intr(hw, status); @@ -2617,21 +2616,24 @@ static int sky2_poll(struct napi_struct if (status Y2_IS_IRQ_PHY2) sky2_phy_intr(hw, 1); - work_done = sky2_status_intr(hw, work_limit); + while ((idx = sky2_read16(hw, STAT_PUT_IDX)) != hw-st_idx) { + work_done += sky2_status_intr(hw, work_limit - work_done, idx); - /* More work? */ - if (hw-st_idx == sky2_read16(hw, STAT_PUT_IDX)) { - /* Bug/Errata workaround? -* Need to kick the TX irq moderation timer. -*/ - if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) { - sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP); - sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START); - } + if (work_done = work_limit) + goto done; + } - napi_complete(napi); - sky2_read32(hw, B0_Y2_SP_LISR); + /* Bug/Errata workaround? +* Need to kick the TX irq moderation timer. +*/ + if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) { + sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP); + sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START); } + napi_complete(napi); + sky2_read32(hw, B0_Y2_SP_LISR); +done: + return work_done; } -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] sky2: fix power settings on Yukon XL
Make sure PCI register for PHY power gets set correctly. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/drivers/net/sky2.c2007-10-11 18:12:56.0 -0700 +++ b/drivers/net/sky2.c2007-10-11 18:13:00.0 -0700 @@ -606,20 +606,19 @@ static void sky2_phy_power(struct sky2_h { struct pci_dev *pdev = hw-pdev; u32 reg1; - static const u32 phy_power[] - = { PCI_Y2_PHY1_POWD, PCI_Y2_PHY2_POWD }; - - /* looks like this XL is back asswards .. */ - if (hw-chip_id == CHIP_ID_YUKON_XL hw-chip_rev 1) - onoff = !onoff; + static const u32 phy_power[] = { PCI_Y2_PHY1_POWD, PCI_Y2_PHY2_POWD }; + static const u32 coma_mode[] = { PCI_Y2_PHY1_COMA, PCI_Y2_PHY2_COMA }; pci_read_config_dword(pdev, PCI_DEV_REG1, reg1); + /* Turn on/off phy power saving */ if (onoff) - /* Turn off phy power saving */ reg1 = ~phy_power[port]; else reg1 |= phy_power[port]; + if (onoff hw-chip_id == CHIP_ID_YUKON_XL hw-chip_rev 1) + reg1 |= coma_mode[port]; + pci_write_config_dword(pdev, PCI_DEV_REG1, reg1); pci_read_config_dword(pdev, PCI_DEV_REG1, reg1); -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] sky2 patches for net-2.6
Includes: * fix for new NAPI status loop * use internal net_stats * fixes for fiber PHY power -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] sky2: fiber advertise bits initialization (trivial)
Put initialization in sequential order (same as other constants). Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/drivers/net/sky2.c2007-10-11 18:13:00.0 -0700 +++ b/drivers/net/sky2.c2007-10-11 18:13:01.0 -0700 @@ -296,10 +296,10 @@ static const u16 copper_fc_adv[] = { /* flow control to advertise bits when using 1000BaseX */ static const u16 fiber_fc_adv[] = { - [FC_BOTH] = PHY_M_P_BOTH_MD_X, + [FC_NONE] = PHY_M_P_NO_PAUSE_X, [FC_TX] = PHY_M_P_ASYM_MD_X, [FC_RX] = PHY_M_P_SYM_MD_X, - [FC_NONE] = PHY_M_P_NO_PAUSE_X, + [FC_BOTH] = PHY_M_P_BOTH_MD_X, }; /* flow control to GMA disable bits */ -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html