Re: [PATCH net] net: try harder to not reuse ifindex when moving interfaces
Hi Thomas, On Thu, Oct 22, 2015, at 18:45, Thomas Graf wrote: > On 10/22/15 at 05:00pm, Jiri Benc wrote: > > On Thu, 22 Oct 2015 16:52:13 +0200, Nicolas Dichtel wrote: > > > With the proposed scenario: > > > 1. create netns 'new_netns' > > > 2. in root netns, move the interface with ifindex 2 to new_netns > > > 3. in new_netns, delete the interface with ifindex 2 > > > 4. in new_netns, create an interface - it will get ifindex 2 > > > > > > Operation 2 and 4 are done by dev_change_net_namespace() under > > > rtnl_lock(). > > > RTM_DELLINK(root netns) and RTM_NEWLINK(new_netns) are sent by this > > > function. > > > It means that operation 3 has been done before and that > > > RTM_DELLINK(new_netns) > > > has been sent before. > > > > Imagine the application trying to configure the interface with ifindex 2 > > after your step 2. It constructs a netlink message and sends it to the > > kernel; but while doing so, steps 3 and 4 happen. Now the application > > ends up configuring a different interface than it intended to. After > > that, it polls the netlink socket and receives the notifications about > > interface disappearing and a new one appearing. > > > > I don't see any way the user space application can prevent this. There > > will always be a race between receiving netlink notifications and > > sending config requests. > > > > I guess Thomas Haller can elaborate more as he ran into this. > > I understand the race but when does it occur? Whoever creates > the original interface owns it and is responsible for its > lifecycle. *Iff* for some reason multiple entities manipulate > the interface, then it's probably a lot safer to just use flock > or something similar to serialize access entirely in user space. This only works if all networking configuration programs would standardize on the same flock. Also, under memory pressure we lose netlink monitor messages, so we need to deal with timeouts and retries and manual sync up on the networking configuration, which makes this scheme a lot harder. For normal socket io, where we specify e.g. ifindex in sin6_addr, this is not really usable at all. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
Daniel Borkmannwrites: > On 10/20/2015 08:56 PM, Eric W. Biederman wrote: > ... >> Just FYI: Using a device for this kind of interface is pretty >> much a non-starter as that quickly gets you into situations where >> things do not work in containers. If someone gets a version of device >> namespaces past GregKH it might be up for discussion to use character >> devices. > > Okay, you are referring to this discussion here: > > http://thread.gmane.org/gmane.linux.kernel.containers/26760 That is a piece of it. It is an old old discussion (which generally has been handled poorly). For the forseeable future device namespaces have a firm NACK by GregKH. Which means that dynamic character device based interfaces do not work in containers. Which means if you are not talking about physical hardware, character devices are a poor fit. Making a character based interface for eBPF not workable. Eric p.s. There are plenty of reasons (even if privilege remains a requirement) to ask how can this functionality be used in a container. If for no other reason than sandboxing privileged applications is typically a good idea. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote: > On 22/10/2015 05:21, Al Viro wrote: > > >>Most of the work on using a file descriptor is local to the thread. > > > >Using - sure, but what of cacheline dirtied every time you resolve a > >descriptor to file reference? > > Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. > >In case of Linux we have two bitmaps and an array of pointers associated > >with descriptor table. They grow on demand (in parallel) > > * reserving a descriptor is done under ->file_lock (dropped/regained > >around memory allocation if we end up expanding the sucker, actual > >reassignment > >of pointers to array/bitmaps is under that spinlock) > > * installing a pointer is lockless (we wait for ongoing resize to > >settle, RCU takes care of the rest) > > * grabbing a file by index is lockless as well > > * removing a pointer is under ->file_lock, so's replacing it by dup2(). > > Is that table per-process or global? Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). It's common for Plan 9-inspired models - basically, you treat every thread as a machine that consists of * memory * file descriptor table * namespace * signal handlers ... * CPU (i.e. actual thread of execution). The last part can't be shared; anything else can. fork(2) variant used to start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD) is told which components should be copies of parent's ones and which should be shared with the parent. fork(2) is simply "copy everything except for the namespace". It's fairly common to have "share everything", but intermediate variants are also possible. There are constraints (e.g. you can't share signal handlers without sharing the memory space), but descriptor table can be shared independently from memory space just fine. There's also a way to say "unshare this, this and that components" - mapped to unshare(2) in Linux and to rfork(2) in Plan 9. Best way to think of that is to consider descriptor table as a first-class object a thread can be connected to. Usually you have one for each process, with all threads belonging to that process connected to the same thing, but that's just the most common use. > I don't think that it's possible to claim that a non-atomic dup2() > is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... > ThreadA remains sat in accept on fd1 which is now a plain file, not > a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
This patch is to the ath5k/eeprom.c that fixes up warning caught by coccicheck: -Unneeded variable: "ret". Return "0" on line 1733 Remove unneccesary variable ret created to return zero. Signed-off-by: Punit Vara--- drivers/net/wireless/ath/ath5k/eeprom.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/wireless/ath/ath5k/eeprom.c b/drivers/net/wireless/ath/ath5k/eeprom.c index 94d34ee..0c82ea5 100644 --- a/drivers/net/wireless/ath/ath5k/eeprom.c +++ b/drivers/net/wireless/ath/ath5k/eeprom.c @@ -1707,8 +1707,7 @@ ath5k_eeprom_read_spur_chans(struct ath5k_hw *ah) struct ath5k_eeprom_info *ee = >ah_capabilities.cap_eeprom; u32 offset; u16 val; - int ret = 0, i; - + int i; offset = AR5K_EEPROM_CTL(ee->ee_version) + AR5K_EEPROM_N_CTLS(ee->ee_version); @@ -1730,7 +1729,7 @@ ath5k_eeprom_read_spur_chans(struct ath5k_hw *ah) } } - return ret; + return 0; } -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote: >> >> >> >And I'm really curious about the things Solaris would do with dup2() there. >> >Does it take into account the possibility of new accept() coming just as >> >dup2() is trying to terminate the ongoing ones? Is there a window when >> >descriptor-to-file lookups would fail? Looks like a race/deadlock >> >country... >> >> Solaris does not "terminate" threads, instead it tells them that the >> file descriptor information used is stale and wkae's up the thread. > >Sorry, lousy wording - I meant "terminate syscall in another thread". >Better yet, make that "what happens if new accept(newfd) comes while dup2() >waits for affected syscalls in other threads to finish"? Assuming it >does wait, that is.. No there is no such window; the accept() call either returns EBADF (dup2()) wins the race or it returns a new file descriptor (and dup2() then closes the listening descriptor). One or the other. >While we are at it, what's the relative order of record locks removal >and switching the meaning of newfd? In our kernel it happens *after* >the switchover (i.e. if another thread is waiting for a record lock held on >any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see >the state with newfd still refering to what it used to; note that waiter >might be using any descriptor refering to the file newfd used to refer >to, so it won't be affected by the "wake those who had the meaning of >their arguments change" side of things). The external behaviour atomic; you cannot distinguish the order between the closing of the original file (and waking up other threads waiting for a record lock) or changing the file referenced by that newfd. But this not include a global or process specific lock. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote: > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > > The tcp memory controller has extensive provisions for future memory > > accounting interfaces that won't materialize after all. Cut the code > > base down to what's actually used, now and in the likely future. > > > > - There won't be any different protocol counters in the future, so a > > direct sock->sk_memcg linkage is enough. This eliminates a lot of > > callback maze and boilerplate code, and restores most of the socket > > allocation code to pre-tcp_memcontrol state. > > > > - There won't be a tcp control soft limit, so integrating the memcg > > In fact, the code is ready for the "soft" limit (I mean min, pressure, > max tuple), it just lacks a knob. Yeah, but that's not going to materialize if the entire interface for dedicated tcp throttling is considered obsolete. > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const > > struct sock *sk) > > if (!sk->sk_prot->memory_pressure) > > return false; > > > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > > - return !!sk->sk_cgrp->memory_pressure; > > - > > AFAIU, now we won't shrink the window on hitting the limit, i.e. this > patch subtly changes the behavior of the existing knobs, potentially > breaking them. Hm, but there is no grace period in which something meaningful could happen with the window shrinking, is there? Any buffer allocation is still going to fail hard. I don't see how this would change anything in practice. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] bnx2x_config_vlan_mac called a NULL function pointer
Hello netdev, I probably found a bug in kernel-4.3.0-0.rc5 (bnx2x driver). So I opened new bug report in our bugzilla [0]. Michal Schmidt told me the best way to solve an upstream bug is to contact you directly to netdev list.. so here I am :). Can somebody take a look at it? [0] https://bugzilla.redhat.com/show_bug.cgi?id=1273894 thanks, Ota -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 1/2] iwlwifi: pcie: allow to build an A-MSDU using TSO core
On 10/22/2015 05:27 AM, Eric Dumazet wrote: > On Thu, 2015-10-22 at 00:14 +, Grumbach, Emmanuel wrote: > >> >> Well. I guess I should at least check, but even with very small MSS, our >> device supports up to 20 pointers for the same 802.11 packet: 2 are for >> metadata. So basically, so leaves me only 18 pointers. for each MSS I >> need at least 2 (one for the headers and one for the payload), so I will >> have at most 9 of these for one packet, even with a tiny MSS. >> > > I did not see in your patch where you made the checks about 18 segs in a > TSO packet ? It is in the other patch: iwlwifi: mvm: send large SKBs to the transport mvm is the op_mode and the op_mode needs to make sure that the payload fits in one 802.11 packet AND it doesn't exhaust the number of pointers. I'll add a comment here. > >> I agree that all this should be added to the code in a comment. >> Speaking of which... >> int tso_count_descs(struct sk_buff *skb) >> { >> /* The Marvell Way */ >> return skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags; >> } >> >> What if there is some payload in the header? >> To me it sounds safer to return: >> >> skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags + 1; >> >> or maybe to test if there is some payload in the header and then add 1? >> If there is payload in the header, it should be considered as another >> frag, shouldn't it? > > Minimal count is gso_segs (one per MSS) > > Then you have to add extra for the cases we have a mss spanning a frag > in skb. > > Thats a max of (skb_shinfo(skb)->nr_frags - 1) + (data_in_head() ? 1 : > 0); > So I believe formula would be correct. > I needed a piece of paper and a few drawings to understand you are right... :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] xfrm: Fix pmtu discovery for local generated packets.
Commit 044a832a777 ("xfrm: Fix local error reporting crash with interfamily tunnels") moved the setting of skb->protocol behind the last access of the inner mode family to fix an interfamily crash. Unfortunately now skb->protocol might not be set at all, so we fail dispatch to the inner address family. As a reault, the local error handler is not called and the mtu value is not reported back to userspace. We fix this by setting skb->protocol on message size errors before we call xfrm_local_error. Fixes: 044a832a7779c ("xfrm: Fix local error reporting crash with interfamily tunnels") Signed-off-by: Steffen Klassert--- net/ipv4/xfrm4_output.c | 2 ++ net/ipv6/xfrm6_output.c | 1 + 2 files changed, 3 insertions(+) diff --git a/net/ipv4/xfrm4_output.c b/net/ipv4/xfrm4_output.c index 2878dbf..41a2613 100644 --- a/net/ipv4/xfrm4_output.c +++ b/net/ipv4/xfrm4_output.c @@ -30,6 +30,8 @@ static int xfrm4_tunnel_check_size(struct sk_buff *skb) mtu = dst_mtu(skb_dst(skb)); if (skb->len > mtu) { + skb->protocol = htons(ETH_P_IP); + if (skb->sk) xfrm_local_error(skb, mtu); else diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c index be033f2..e15feb7 100644 --- a/net/ipv6/xfrm6_output.c +++ b/net/ipv6/xfrm6_output.c @@ -79,6 +79,7 @@ static int xfrm6_tunnel_check_size(struct sk_buff *skb) if (!skb->ignore_df && skb->len > mtu) { skb->dev = dst->dev; + skb->protocol = htons(ETH_P_IPV6); if (xfrm6_local_dontfrag(skb)) xfrm6_local_rxpmtu(skb, mtu); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] xfrm: Fix state threshold configuration from userspace
From: Michael RossbergAllow to change the replay threshold (XFRMA_REPLAY_THRESH) and expiry timer (XFRMA_ETIMER_THRESH) of a state without having to set other attributes like replay counter and byte lifetime. Changing these other values while traffic flows will break the state. Signed-off-by: Michael Rossberg Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_user.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index a8de9e3..24e06a2 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1928,8 +1928,10 @@ static int xfrm_new_ae(struct sk_buff *skb, struct nlmsghdr *nlh, struct nlattr *rp = attrs[XFRMA_REPLAY_VAL]; struct nlattr *re = attrs[XFRMA_REPLAY_ESN_VAL]; struct nlattr *lt = attrs[XFRMA_LTIME_VAL]; + struct nlattr *et = attrs[XFRMA_ETIMER_THRESH]; + struct nlattr *rt = attrs[XFRMA_REPLAY_THRESH]; - if (!lt && !rp && !re) + if (!lt && !rp && !re && !et && !rt) return err; /* pedantic mode - thou shalt sayeth replaceth */ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pull request (net): ipsec 2015-10-22
1) Fix IPsec pre-encap fragmentation for GSO packets. From Herbert Xu. 2) Fix some header checks in _decode_session6. We skip the header informations if the data pointer points already behind the header in question for some protocols. This is because we call pskb_may_pull with a negative value converted to unsigened int from pskb_may_pull in this case. Skipping the header informations can lead to incorrect policy lookups. From Mathias Krause. 3) Allow to change the replay threshold and expiry timer of a state without having to set other attributes like replay counter and byte lifetime. Changing these other attributes may break the SA. From Michael Rossberg. 4) Fix pmtu discovery for local generated packets. We may fail dispatch to the inner address family. As a reault, the local error handler is not called and the mtu value is not reported back to userspace. Please pull or let me know if there are problems. Thanks! The following changes since commit 724a7636ad026a3a68f3fc626ccd04111f65cfd9: Merge branch 'sctp-fixes' (2015-09-03 15:43:06 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master for you to fetch changes up to ca064bd89363a6e7e71b1c5226ff1b718957a9d4: xfrm: Fix pmtu discovery for local generated packets. (2015-10-19 10:30:05 +0200) Herbert Xu (1): ipv6: Fix IPsec pre-encap fragmentation check Mathias Krause (1): xfrm6: Fix ICMPv6 and MH header checks in _decode_session6 Michael Rossberg (1): xfrm: Fix state threshold configuration from userspace Steffen Klassert (1): xfrm: Fix pmtu discovery for local generated packets. net/ipv4/xfrm4_output.c | 2 ++ net/ipv6/xfrm6_output.c | 18 -- net/ipv6/xfrm6_policy.c | 6 -- net/xfrm/xfrm_user.c| 4 +++- 4 files changed, 21 insertions(+), 9 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] xfrm6: Fix ICMPv6 and MH header checks in _decode_session6
From: Mathias KrauseEnsure there's enough data left prior calling pskb_may_pull(). If skb->data was already advanced, we'll call pskb_may_pull() with a negative value converted to unsigned int -- leading to a huge positive value. That won't matter in practice as pskb_may_pull() will likely fail in this case, but it leads to underflow reports on kernels handling such kind of over-/underflows, e.g. a PaX enabled kernel instrumented with the size_overflow plugin. Reported-by: satmd Reported-and-tested-by: Marcin Jurkowski Signed-off-by: Mathias Krause Cc: PaX Team Signed-off-by: Steffen Klassert --- net/ipv6/xfrm6_policy.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index 30caa28..f10b940 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -178,7 +178,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse) return; case IPPROTO_ICMPV6: - if (!onlyproto && pskb_may_pull(skb, nh + offset + 2 - skb->data)) { + if (!onlyproto && (nh + offset + 2 < skb->data || + pskb_may_pull(skb, nh + offset + 2 - skb->data))) { u8 *icmp; nh = skb_network_header(skb); @@ -192,7 +193,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse) #if IS_ENABLED(CONFIG_IPV6_MIP6) case IPPROTO_MH: offset += ipv6_optlen(exthdr); - if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) { + if (!onlyproto && (nh + offset + 3 < skb->data || + pskb_may_pull(skb, nh + offset + 3 - skb->data))) { struct ip6_mh *mh; nh = skb_network_header(skb); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] ipv6: Fix IPsec pre-encap fragmentation check
From: Herbert XuThe IPv6 IPsec pre-encap path performs fragmentation for tunnel-mode packets. That is, we perform fragmentation pre-encap rather than post-encap. A check was added later to ensure that proper MTU information is passed back for locally generated traffic. Unfortunately this check was performed on all IPsec packets, including transport-mode packets. What's more, the check failed to take GSO into account. The end result is that transport-mode GSO packets get dropped at the check. This patch fixes it by moving the tunnel mode check forward as well as adding the GSO check. Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error") Signed-off-by: Herbert Xu Signed-off-by: Steffen Klassert --- net/ipv6/xfrm6_output.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c index 09c76a7..be033f2 100644 --- a/net/ipv6/xfrm6_output.c +++ b/net/ipv6/xfrm6_output.c @@ -136,6 +136,7 @@ static int __xfrm6_output(struct sock *sk, struct sk_buff *skb) struct dst_entry *dst = skb_dst(skb); struct xfrm_state *x = dst->xfrm; int mtu; + bool toobig; #ifdef CONFIG_NETFILTER if (!x) { @@ -144,25 +145,29 @@ static int __xfrm6_output(struct sock *sk, struct sk_buff *skb) } #endif + if (x->props.mode != XFRM_MODE_TUNNEL) + goto skip_frag; + if (skb->protocol == htons(ETH_P_IPV6)) mtu = ip6_skb_dst_mtu(skb); else mtu = dst_mtu(skb_dst(skb)); - if (skb->len > mtu && xfrm6_local_dontfrag(skb)) { + toobig = skb->len > mtu && !skb_is_gso(skb); + + if (toobig && xfrm6_local_dontfrag(skb)) { xfrm6_local_rxpmtu(skb, mtu); return -EMSGSIZE; - } else if (!skb->ignore_df && skb->len > mtu && skb->sk) { + } else if (!skb->ignore_df && toobig && skb->sk) { xfrm_local_error(skb, mtu); return -EMSGSIZE; } - if (x->props.mode == XFRM_MODE_TUNNEL && - ((skb->len > mtu && !skb_is_gso(skb)) || - dst_allfrag(skb_dst(skb { + if (toobig || dst_allfrag(skb_dst(skb))) return ip6_fragment(sk, skb, x->outer_mode->afinfo->output_finish); - } + +skip_frag: return x->outer_mode->afinfo->output_finish(sk, skb); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro>Except that in this case "correctness" is the matter of rather obscure and >ill-documented areas in POSIX. Don't get me wrong - this semantics isn't >inherently bad, but it's nowhere near being an absolute requirement. It would more fruitful to have such a discussion in one of the OpenGroup mailing lists; people gathered there have a lot of experience and it is also possible to fix the standard when it turns out that it indeed as vague as you claim it is (I don't quite agree with that) Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] can: xilinx: use readl/writel instead of ioread/iowrite
Hi Marc, > -Original Message- > From: Marc Kleine-Budde [mailto:m...@pengutronix.de] > Sent: Thursday, October 22, 2015 1:52 PM > To: Arnd Bergmann; linux-arm-ker...@lists.infradead.org > Cc: Appana Durga Kedareswara Rao; Anirudha Sarangi; w...@grandegger.com; > Michal Simek; Soren Brinkmann; Appana Durga Kedareswara Rao; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > c...@vger.kernel.org > Subject: Re: [PATCH 1/2] can: xilinx: use readl/writel instead of > ioread/iowrite > > On 10/22/2015 10:14 AM, Arnd Bergmann wrote: > > On Thursday 22 October 2015 10:16:02 Kedareswara rao Appana wrote: > >> The driver only supports memory-mapped I/O [by ioremap()], so > >> readl/writel is actually the right thing to do, IMO. > >> During the validation of this driver or IP on ARM 64-bit processor > >> while sending lot of packets observed that the tx packet drop with > >> iowrite Putting the barriers for each tx fifo register write fixes > >> this issue Instead of barriers using writel also fixed this issue. > >> > >> Signed-off-by: Kedareswara rao Appana> > > > The two should really do the same thing: iowrite32() is just a static > > inline calling writel() on both ARM32 and ARM64. On which kernel > > version did you observe the difference? It's possible that an older > > version used CONFIG_GENERIC_IOMAP, which made this slightly more > expensive. > > > > If there are barriers that you want to get rid of for performance > > reasons, you should use writel_relaxed(), but be careful to > > synchronize them correctly with regard to DMA. It should be fine in > > this driver, as it does not perform any DMA, but be aware that there > > is no big-endian version of > > writel_relaxed() at the moment. > > We don't have DMA in CAN drivers, but usually a certain write triggers > sending. > Do we need a barrier before triggering the sending? Yes During validation of this IP on ARM 64 bit processor with using iowrite32() and sending a lot of packets it requires barriers before triggering the send. With using writel() barriers are not needed. Regards, Kedar. > > Marc > > -- > Pengutronix e.K. | Marc Kleine-Budde | > Industrial Linux Solutions| Phone: +49-231-2826-924 | > Vertretung West/Dortmund | Fax: +49-5121-206917- | > Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[PATCH 0/3] Refactor nested mdiobus read/write functions
In order to avoid locked signal false positive for nested mdiobus read/write calls, nested code was introduced in mv88e6xxx and mdio-mux. But mv88e6060 also needs such nested mdiobus read/write calls. For sake of refactoring, introduce nested variants of mdiobus read/write and make them used by mv88e6xxx and mv88e6060. In a next patch, mdio-mux should also use these variant calls. Neil Armstrong (3): net: phy: Add nested variants of mdiobus read/write net: dsa: Make mv88e6xxx use nested mdiobus read/write net: dsa: Make mv88e6060 use nested mdiobus read/write drivers/net/dsa/mv88e6060.c | 4 ++-- drivers/net/dsa/mv88e6xxx.c | 46 - drivers/net/phy/mdio_bus.c | 55 + include/linux/phy.h | 2 ++ 4 files changed, 68 insertions(+), 39 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] net: phy: Add nested variants of mdiobus read/write
Since nested variants of mdiobus_read/write are used in multiple drivers, add nested variants in the mdiobus core. Suggested-by: Andrew LunnSigned-off-by: Neil Armstrong --- drivers/net/phy/mdio_bus.c | 55 ++ include/linux/phy.h| 2 ++ 2 files changed, 57 insertions(+) diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c index 12f44c5..88cb459 100644 --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -372,6 +372,33 @@ struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr) EXPORT_SYMBOL(mdiobus_scan); /** + * mdiobus_read_nested - Nested version of the mdiobus_read function + * @bus: the mii_bus struct + * @addr: the phy address + * @regnum: register number to read + * + * In case of nested MDIO bus access avoid lockdep false positives by + * using mutex_lock_nested(). + * + * NOTE: MUST NOT be called from interrupt context, + * because the bus read/write functions may wait for an interrupt + * to conclude the operation. + */ +int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum) +{ + int retval; + + BUG_ON(in_interrupt()); + + mutex_lock_nested(>mdio_lock, SINGLE_DEPTH_NESTING); + retval = bus->read(bus, addr, regnum); + mutex_unlock(>mdio_lock); + + return retval; +} +EXPORT_SYMBOL(mdiobus_read_nested); + +/** * mdiobus_read - Convenience function for reading a given MII mgmt register * @bus: the mii_bus struct * @addr: the phy address @@ -396,6 +423,34 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum) EXPORT_SYMBOL(mdiobus_read); /** + * mdiobus_write_nested - Nested version of the mdiobus_write function + * @bus: the mii_bus struct + * @addr: the phy address + * @regnum: register number to write + * @val: value to write to @regnum + * + * In case of nested MDIO bus access avoid lockdep false positives by + * using mutex_lock_nested(). + * + * NOTE: MUST NOT be called from interrupt context, + * because the bus read/write functions may wait for an interrupt + * to conclude the operation. + */ +int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val) +{ + int err; + + BUG_ON(in_interrupt()); + + mutex_lock_nested(>mdio_lock, SINGLE_DEPTH_NESTING); + err = bus->write(bus, addr, regnum, val); + mutex_unlock(>mdio_lock); + + return err; +} +EXPORT_SYMBOL(mdiobus_write_nested); + +/** * mdiobus_write - Convenience function for writing a given MII mgmt register * @bus: the mii_bus struct * @addr: the phy address diff --git a/include/linux/phy.h b/include/linux/phy.h index 4c477e6..05fde31 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -213,7 +213,9 @@ static inline struct mii_bus *devm_mdiobus_alloc(struct device *dev) void devm_mdiobus_free(struct device *dev, struct mii_bus *bus); struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr); int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum); +int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum); int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val); +int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val); #define PHY_INTERRUPT_DISABLED 0x0 -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 05:44:58AM +0100, Al Viro wrote: > Except that in this case "correctness" is the matter of rather obscure and > ill-documented areas in POSIX. Don't get me wrong - this semantics isn't > inherently bad, but it's nowhere near being an absolute requirement. PS: in principle, a fairly ugly trick might suffice for accept(2), but I'm less than happy with going there. Namely, we could * have ->accept() get descriptor number * have ->flush() get descriptor number in addition to current->files and have it DTRT for sockets in the middle of accept(2). However, in addition to being ugly as hell, it has the problem with the points where we call ->flush(), specifically do_dup2() and __close_fd(). It's done *after* the replacement/removal from descriptor table, so another socket might have already gotten the same descriptor and we'd get spurious termination of accept(2). And I'm really curious about the things Solaris would do with dup2() there. Does it take into account the possibility of new accept() coming just as dup2() is trying to terminate the ongoing ones? Is there a window when descriptor-to-file lookups would fail? Looks like a race/deadlock country... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
On 10/21/15 10:31 PM, Wangnan (F) wrote: +if ((attr->type != PERF_TYPE_RAW && + !(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && + attr->type != PERF_TYPE_HARDWARE) || +attr->inherit) { This 'if' statement is so complex. What about using a inline function instead? hmm. don't see how inline function will help readability. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next RFC 1/2] vhost: introduce vhost_has_work()
On Thu, Oct 22, 2015 at 01:27:28AM -0400, Jason Wang wrote: > This path introduces a helper which can give a hint for whether or not > there's a work queued in the work list. > > Signed-off-by: Jason Wang> --- > drivers/vhost/vhost.c | 6 ++ > drivers/vhost/vhost.h | 1 + > 2 files changed, 7 insertions(+) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index eec2f11..d42d11e 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -245,6 +245,12 @@ void vhost_work_queue(struct vhost_dev *dev, struct > vhost_work *work) > } > EXPORT_SYMBOL_GPL(vhost_work_queue); > > +bool vhost_has_work(struct vhost_dev *dev) > +{ > + return !list_empty(>work_list); > +} > +EXPORT_SYMBOL_GPL(vhost_has_work); > + > void vhost_poll_queue(struct vhost_poll *poll) > { > vhost_work_queue(poll->dev, >work); This doesn't take a lock so it's unreliable. I think it's ok in this case since it's just an optimization - but pls document this. > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index 4772862..ea0327d 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -37,6 +37,7 @@ struct vhost_poll { > > void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); > void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); > +bool vhost_has_work(struct vhost_dev *dev); > > void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, >unsigned long mask, struct vhost_dev *dev); > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] can: xilinx: use readl/writel instead of ioread/iowrite
On Thursday 22 October 2015 10:21:58 Marc Kleine-Budde wrote: > On 10/22/2015 10:14 AM, Arnd Bergmann wrote: > > On Thursday 22 October 2015 10:16:02 Kedareswara rao Appana wrote: > >> The driver only supports memory-mapped I/O [by ioremap()], > >> so readl/writel is actually the right thing to do, IMO. > >> During the validation of this driver or IP on ARM 64-bit processor > >> while sending lot of packets observed that the tx packet drop with iowrite > >> Putting the barriers for each tx fifo register write fixes this issue > >> Instead of barriers using writel also fixed this issue. > >> > >> Signed-off-by: Kedareswara rao Appana> > > > The two should really do the same thing: iowrite32() is just a static inline > > calling writel() on both ARM32 and ARM64. On which kernel version did you > > observe the difference? It's possible that an older version used > > CONFIG_GENERIC_IOMAP, which made this slightly more expensive. > > > > If there are barriers that you want to get rid of for performance reasons, > > you should use writel_relaxed(), but be careful to synchronize them > > correctly > > with regard to DMA. It should be fine in this driver, as it does not > > perform any DMA, but be aware that there is no big-endian version of > > writel_relaxed() at the moment. > > We don't have DMA in CAN drivers, but usually a certain write triggers > sending. Do we need a barrier before triggering the sending? No, the relaxed writes are not well-defined across architectures. On ARM, the CPU guarantees that stores to an MMIO area are still in order with respect to one another, the barrier is only needed for actual DMA, so you are fine. I would expect the same to be true everywhere, otherwise a lot of other drivers would be broken too. To be on the safe side, that last write() could remain a writel() instead of writel_relaxed(), and that would be guaranteed to work on all architectures even if they end relax the ordering between MMIO writes. If there is a measurable performance difference, just use writel_relaxed() and add a comment. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/1] eventfd: implementation of EFD_MASK flag
Hello, > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h > index ff0b981..87de343 100644 > --- a/include/linux/eventfd.h > +++ b/include/linux/eventfd.h > > -/* > - * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining > - * new flags, since they might collide with O_* ones. We want > - * to re-use O_* flags that couldn't possibly have a meaning > - * from eventfd, in order to leave a free define-space for > - * shared O_* flags. > - */ > -#define EFD_SEMAPHORE (1 << 0) > -#define EFD_CLOEXEC O_CLOEXEC > -#define EFD_NONBLOCK O_NONBLOCK > - > -#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) > -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) > - > struct file; > > #ifdef CONFIG_EVENTFD > diff --git a/include/uapi/linux/eventfd.h b/include/uapi/linux/eventfd.h > new file mode 100644 > index 000..097dcad > --- /dev/null > +++ b/include/uapi/linux/eventfd.h > @@ -0,0 +1,33 @@ > + > +/* > + * CAREFUL: Check include/asm-generic/fcntl.h when defining > + * new flags, since they might collide with O_* ones. We want > + * to re-use O_* flags that couldn't possibly have a meaning > + * from eventfd, in order to leave a free define-space for > + * shared O_* flags. > + */ > + > +/* Provide semaphore-like semantics for reads from the eventfd. */ > +#define EFD_SEMAPHORE (1 << 0) > +/* Provide event mask semantics for the eventfd. */ > +#define EFD_MASK (1 << 1) > +/* Set the close-on-exec (FD_CLOEXEC) flag on the eventfd. */ > +#define EFD_CLOEXEC O_CLOEXEC > +/* Create the eventfd in non-blocking mode. */ > +#define EFD_NONBLOCK O_NONBLOCK > +#endif /* _UAPI_LINUX_EVENTFD_H */ > Since the latest version of this patch adds only the EFD_MASK definition to the eventfd header, I was wondering if it was really necessary/recommended to move the definitions from linux/eventfd.h to linux/uapi/eventfd.h. From my understanding, the EFD_SEMAPHORE (and now EFD_MASK) define(s) are provided to user space from the libc headers only. Any advice would be greatly appreciated. Thank you, Damian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
* Wangnan (F)wrote: > > > On 2015/10/22 0:57, Peter Zijlstra wrote: > >On Wed, Oct 21, 2015 at 11:06:47PM +0800, pi3orama wrote: > >>>So explain; how does this eBPF stuff work. > >>I think I get your point this time, and let me explain the eBPF stuff to > >>you. > >> > >>You are aware that BPF programmer can break the system in this way: > >> > >>A=get_non_local_perf_event() > >>perf_event_read_local(A) > >>BOOM! > >> > >>However the above logic is impossible because BPF program can't work this > >>way. > >> > >>First of all, it is impossible for a BPF program directly invoke a > >>kernel function. Doesn't like kernel module, BPF program can only > >>invoke functions designed for them, like what this patch does. So the > >>ability of BPF programs is strictly restricted by kernel. If we don't > >>allow BPF program call perf_event_read_local() across core, we can > >>check this and return error in function we provide for them. > >> > >>Second: there's no way for a BPF program directly access a perf event. > >>All perf events have to be wrapped by a map and be accessed by BPF > >>functions described above. We don't allow BPF program fetch array > >>element from that map. So pointers of perf event is safely protected > >>from BPF program. > >> > >>In summary, your either-or logic doesn't hold in BPF world. A BPF > >>program can only access perf event in a highly restricted way. We > >>don't allow it calling perf_event_read_local() across core, so it > >>can't. > >Urgh, that's still horridly inconsistent. Can we please come up with a > >consistent interface to perf? > > BPF program and kernel module are two different worlds as I said before. > > I don't think making them to share a common interface is a good idea because > such sharing will give BPF programs too much freedom than it really need, > then > it will be hard prevent them to do something bad. If we really need kernel > interface, I think what we need is kernel module, not BPF program. What do you mean, as this does not parse for me. We obviously can (and very likely should) make certain perf functionality available to BPF programs. It should still be a well defined yet flexible iterface, with safe behavior, obviously - all in line with existing BPF sandboxing principles. 'Kernel modules' don't enter this consideration at all, not sure why you mention them - all this functionality is also available if CONFIG_MODULES is turned off completely. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/22 14:21, Alexei Starovoitov wrote: On 10/21/15 10:31 PM, Wangnan (F) wrote: +if ((attr->type != PERF_TYPE_RAW && + !(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && + attr->type != PERF_TYPE_HARDWARE) || +attr->inherit) { This 'if' statement is so complex. What about using a inline function instead? hmm. don't see how inline function will help readability. For example (not tested): static inline bool perf_event_can_insert_to_map(struct perf_event_attr *attr) { /* is inherit? */ if (attr->inherit) return false; /* is software event? */ if (attr->type == PERF_TYPE_SOFTWARE) if (attr->config == PERF_COUNT_SW_BPF_OUTPUT) return true; else return false; /* Comment... */ if (attr->type == PERF_TYPE_RAW) return true; if (attr->type == PERF_TYPE_HARDWARE) return true; return false; } ... if (!perf_event_can_insert_to_map(attr)) Do you think redability is improved? Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] can: xilinx: use readl/writel instead of ioread/iowrite
On Thursday 22 October 2015 10:16:02 Kedareswara rao Appana wrote: > The driver only supports memory-mapped I/O [by ioremap()], > so readl/writel is actually the right thing to do, IMO. > During the validation of this driver or IP on ARM 64-bit processor > while sending lot of packets observed that the tx packet drop with iowrite > Putting the barriers for each tx fifo register write fixes this issue > Instead of barriers using writel also fixed this issue. > > Signed-off-by: Kedareswara rao AppanaThe two should really do the same thing: iowrite32() is just a static inline calling writel() on both ARM32 and ARM64. On which kernel version did you observe the difference? It's possible that an older version used CONFIG_GENERIC_IOMAP, which made this slightly more expensive. If there are barriers that you want to get rid of for performance reasons, you should use writel_relaxed(), but be careful to synchronize them correctly with regard to DMA. It should be fine in this driver, as it does not perform any DMA, but be aware that there is no big-endian version of writel_relaxed() at the moment. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>It's been said that the current mechanisms in Linux & some BSD variants >can be subject to races, and the behaviour exhibited doesn't conform to >POSIX, for example requiring the use of shutdown() on unconnected >sockets because close() doesn't kick off other threads accept()ing on >the same fd. I'd be interested to hear if there's a better and more >performant way of handling the situation that doesn't involve doing the >sort of bookkeeping Casper described,. Of course, the implementation is now around 18 years old; clearly a lot of things have changed since then. In the particular case of Linux close() on a socket, surely it must be possible to detect at close that it is a listening socket and that you are about to close the last reference; the kernel could then do the shutdown() all by itself. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: dsa: mv88e6060: Fix false positive lockdep splat
On 10/21/2015 06:14 PM, Andrew Lunn wrote: > On Wed, Oct 21, 2015 at 05:37:45PM +0200, Neil Armstrong wrote: >> Like the change made for mv88e6xxx, use mutex_lock_nested() to avoid >> lockdep to give false positives because of nested MDIO busses. > > Hi Neil > > We now have three instances of this, since mdio-mux.c has the same > code. Maybe now would be a good time to refactor this code into > mdiobus_read_nested() and mdiobus_write_nested() in mdio_bus.c? At > the same time, add BUG_ON(in_interrupt()) similar to the non-nested > versions? > > Andrew > Well, mdio-mux also calls switch_fn inside the mdio_lock, clean refactoring would introduce a separate lock and call the nested variants. Is that ok ? Can someone test mdio-mux is I make the change ? Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: dsa: mv88e6060: Fix false positive lockdep splat
On 10/21/2015 06:14 PM, Andrew Lunn wrote: > On Wed, Oct 21, 2015 at 05:37:45PM +0200, Neil Armstrong wrote: >> Like the change made for mv88e6xxx, use mutex_lock_nested() to avoid >> lockdep to give false positives because of nested MDIO busses. > > Hi Neil > > We now have three instances of this, since mdio-mux.c has the same > code. Maybe now would be a good time to refactor this code into > mdiobus_read_nested() and mdiobus_write_nested() in mdio_bus.c? At > the same time, add BUG_ON(in_interrupt()) similar to the non-nested > versions? > > Andrew > Well, mdio-mux also calls switch_fn inside the mdio_lock, clean refactoring would introduce a separate lock and call the nested variants. Is that ok ? Can someone test mdio-mux if I make the change ? Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] can: xilinx: use readl/writel instead of ioread/iowrite
Hi Arnd, > -Original Message- > From: Arnd Bergmann [mailto:a...@arndb.de] > Sent: Thursday, October 22, 2015 1:45 PM > To: linux-arm-ker...@lists.infradead.org > Cc: Appana Durga Kedareswara Rao; Anirudha Sarangi; w...@grandegger.com; > m...@pengutronix.de; Michal Simek; Soren Brinkmann; Appana Durga > Kedareswara Rao; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > c...@vger.kernel.org > Subject: Re: [PATCH 1/2] can: xilinx: use readl/writel instead of > ioread/iowrite > > On Thursday 22 October 2015 10:16:02 Kedareswara rao Appana wrote: > > The driver only supports memory-mapped I/O [by ioremap()], so > > readl/writel is actually the right thing to do, IMO. > > During the validation of this driver or IP on ARM 64-bit processor > > while sending lot of packets observed that the tx packet drop with > > iowrite Putting the barriers for each tx fifo register write fixes > > this issue Instead of barriers using writel also fixed this issue. > > > > Signed-off-by: Kedareswara rao Appana> > The two should really do the same thing: iowrite32() is just a static inline > calling > writel() on both ARM32 and ARM64. On which kernel version did you observe the > difference? It's possible that an older version used CONFIG_GENERIC_IOMAP, > which made this slightly more expensive. I observed this issue with the 4.0.0 kernel version > > If there are barriers that you want to get rid of for performance reasons, you > should use writel_relaxed(), but be careful to synchronize them correctly with > regard to DMA. It should be fine in this driver, as it does not perform any > DMA, > but be aware that there is no big-endian version of > writel_relaxed() at the moment. There is no DMA in CAN for this IP. Regards, Kedar. > > Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2] ip-address: properly display zero IPv4 peer address
Kernel allows for zero IPv4 peer addresses (IFA_ADDRESS): ip address add 192.168.5.1 peer 0.0.0.0/24 dev dummy which is distinct from a usual address like: ip address add 192.168.5.1/24 dev dummy ip address add 192.168.5.1 peer 192.168.5.1/24 dev dummy For IPv4, a missing IFA_ADDRESS attribute means that the peer is 0.0.0.0. See inet_fill_ifaddr(), which does: if ((ifa->ifa_address && nla_put_in_addr(skb, IFA_ADDRESS, ifa->ifa_address)) || Signed-off-by: Thomas Haller--- ip/ipaddress.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/ip/ipaddress.c b/ip/ipaddress.c index f290205..6fc4520 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -949,7 +949,8 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n, if (!rta_tb[IFA_LOCAL]) rta_tb[IFA_LOCAL] = rta_tb[IFA_ADDRESS]; - if (!rta_tb[IFA_ADDRESS]) + if (!rta_tb[IFA_ADDRESS] && + ifa->ifa_family != AF_INET) rta_tb[IFA_ADDRESS] = rta_tb[IFA_LOCAL]; if (filter.ifindex && filter.ifindex != ifa->ifa_index) @@ -1034,16 +1035,18 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n, RTA_DATA(rta_tb[IFA_LOCAL]), abuf, sizeof(abuf))); - if (rta_tb[IFA_ADDRESS] == NULL || + if (rta_tb[IFA_ADDRESS] != NULL && memcmp(RTA_DATA(rta_tb[IFA_ADDRESS]), RTA_DATA(rta_tb[IFA_LOCAL]), ifa->ifa_family == AF_INET ? 4 : 16) == 0) { fprintf(fp, "/%d ", ifa->ifa_prefixlen); } else { fprintf(fp, " peer %s/%d ", - format_host(ifa->ifa_family, - RTA_PAYLOAD(rta_tb[IFA_ADDRESS]), - RTA_DATA(rta_tb[IFA_ADDRESS]), - abuf, sizeof(abuf)), + rta_tb[IFA_ADDRESS] + ? format_host(ifa->ifa_family, + RTA_PAYLOAD(rta_tb[IFA_ADDRESS]), + RTA_DATA(rta_tb[IFA_ADDRESS]), + abuf, sizeof(abuf)) + : "0.0.0.0", ifa->ifa_prefixlen); } } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/15] net: wireless: ti: Return flow can be simplified for wl1271_cmd_interrogate
On Wed, Oct 21, 2015 at 10:07 PM, Punit Varawrote: > Remove int ret suggested by kbuild test robot > > This patch is to the wlcore/acx.c file that fixes up warning > reported by coccicheck: > > WARNING: end returns can be simplified if negative or 0 value > > Prefer direct return value instead of writing 2-3 more sentence. > > Signed-off-by: Punit Vara > --- > drivers/net/wireless/ti/wlcore/acx.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/drivers/net/wireless/ti/wlcore/acx.c > b/drivers/net/wireless/ti/wlcore/acx.c > index f28fa3b..6b566d9 100644 > --- a/drivers/net/wireless/ti/wlcore/acx.c > +++ b/drivers/net/wireless/ti/wlcore/acx.c > @@ -158,16 +158,11 @@ out: > int wl1271_acx_mem_map(struct wl1271 *wl, struct acx_header *mem_map, >size_t len) > { > - int ret; > > wl1271_debug(DEBUG_ACX, "acx mem map"); > > - ret = wl1271_cmd_interrogate(wl, ACX_MEM_MAP, mem_map, > + return wl1271_cmd_interrogate(wl, ACX_MEM_MAP, mem_map, > sizeof(struct acx_header), len); > - if (ret < 0) > - return ret; > - > - return 0; > } > this changes the return value in case of positive values. have you verified it can't happen / won't affect the code flow? i'm not sure you really want to blindly patch it... Eliad. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: dsa: mv88e6060: Fix false positive lockdep splat
Hi Andrew, On 10/21/2015 06:14 PM, Andrew Lunn wrote: > On Wed, Oct 21, 2015 at 05:37:45PM +0200, Neil Armstrong wrote: >> Like the change made for mv88e6xxx, use mutex_lock_nested() to avoid >> lockdep to give false positives because of nested MDIO busses. > > Hi Neil > > We now have three instances of this, since mdio-mux.c has the same > code. Maybe now would be a good time to refactor this code into > mdiobus_read_nested() and mdiobus_write_nested() in mdio_bus.c? At > the same time, add BUG_ON(in_interrupt()) similar to the non-nested > versions? > > Andrew > Indeed, you are right, I will post a serie with this refactoring. Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/22 15:39, Ingo Molnar wrote: * Wangnan (F)wrote: [SNIP] In summary, your either-or logic doesn't hold in BPF world. A BPF program can only access perf event in a highly restricted way. We don't allow it calling perf_event_read_local() across core, so it can't. Urgh, that's still horridly inconsistent. Can we please come up with a consistent interface to perf? BPF program and kernel module are two different worlds as I said before. I don't think making them to share a common interface is a good idea because such sharing will give BPF programs too much freedom than it really need, then it will be hard prevent them to do something bad. If we really need kernel interface, I think what we need is kernel module, not BPF program. What do you mean, as this does not parse for me. Because I'm not very sure what the meaning of "inconsistent" in Peter's words... I think what Peter want us to do is to provide similar (consistent) interface between kernel and eBPF that, if kernel reads from a perf_event through perf_event_read_local(struct perf_event *), BPF program should do this work with similar code, or at least similar logic, so we need to create handler for a perf event, and provide a BPF function called BPF_FUNC_perf_event_read_local then pass such handler to it. I don't think like this because if we want kernel interface we'd better use kernel module, not eBPF so I mentioned kernel module here. Ingo, do you think BPF inerface should be *consistent* with anything? Thank you. We obviously can (and very likely should) make certain perf functionality available to BPF programs. It should still be a well defined yet flexible iterface, with safe behavior, obviously - all in line with existing BPF sandboxing principles. 'Kernel modules' don't enter this consideration at all, not sure why you mention them - all this functionality is also available if CONFIG_MODULES is turned off completely. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/2] net: mvneta: add ethtool statistics
Add support for the ethtool statistic interface, returning the full set of statistics which both Armada 370, 38x and Armada XP can support. Tested-by: Andrew LunnSigned-off-by: Russell King --- drivers/net/ethernet/marvell/mvneta.c | 108 ++ 1 file changed, 108 insertions(+) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 514df76fc70f..69d80ca4fba2 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -277,6 +277,50 @@ #define MVNETA_RX_BUF_SIZE(pkt_size) ((pkt_size) + NET_SKB_PAD) +struct mvneta_statistic { + unsigned short offset; + unsigned short type; + const char name[ETH_GSTRING_LEN]; +}; + +#define T_REG_32 32 +#define T_REG_64 64 + +static const struct mvneta_statistic mvneta_statistics[] = { + { 0x3000, T_REG_64, "good_octets_received", }, + { 0x3010, T_REG_32, "good_frames_received", }, + { 0x3008, T_REG_32, "bad_octets_received", }, + { 0x3014, T_REG_32, "bad_frames_received", }, + { 0x3018, T_REG_32, "broadcast_frames_received", }, + { 0x301c, T_REG_32, "multicast_frames_received", }, + { 0x3050, T_REG_32, "unrec_mac_control_received", }, + { 0x3058, T_REG_32, "good_fc_received", }, + { 0x305c, T_REG_32, "bad_fc_received", }, + { 0x3060, T_REG_32, "undersize_received", }, + { 0x3064, T_REG_32, "fragments_received", }, + { 0x3068, T_REG_32, "oversize_received", }, + { 0x306c, T_REG_32, "jabber_received", }, + { 0x3070, T_REG_32, "mac_receive_error", }, + { 0x3074, T_REG_32, "bad_crc_event", }, + { 0x3078, T_REG_32, "collision", }, + { 0x307c, T_REG_32, "late_collision", }, + { 0x2484, T_REG_32, "rx_discard", }, + { 0x2488, T_REG_32, "rx_overrun", }, + { 0x3020, T_REG_32, "frames_64_octets", }, + { 0x3024, T_REG_32, "frames_65_to_127_octets", }, + { 0x3028, T_REG_32, "frames_128_to_255_octets", }, + { 0x302c, T_REG_32, "frames_256_to_511_octets", }, + { 0x3030, T_REG_32, "frames_512_to_1023_octets", }, + { 0x3034, T_REG_32, "frames_1024_to_max_octets", }, + { 0x3038, T_REG_64, "good_octets_sent", }, + { 0x3040, T_REG_32, "good_frames_sent", }, + { 0x3044, T_REG_32, "excessive_collision", }, + { 0x3048, T_REG_32, "multicast_frames_sent", }, + { 0x304c, T_REG_32, "broadcast_frames_sent", }, + { 0x3054, T_REG_32, "fc_sent", }, + { 0x300c, T_REG_32, "internal_mac_transmit_err", }, +}; + struct mvneta_pcpu_stats { struct u64_stats_sync syncp; u64 rx_packets; @@ -312,6 +356,8 @@ struct mvneta_port { unsigned int speed; unsigned int tx_csum_limit; int use_inband_status:1; + + u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)]; }; /* The mvneta_tx_desc and mvneta_rx_desc structures describe the @@ -2875,6 +2921,65 @@ static int mvneta_ethtool_set_ringparam(struct net_device *dev, return 0; } +static void mvneta_ethtool_get_strings(struct net_device *netdev, u32 sset, + u8 *data) +{ + if (sset == ETH_SS_STATS) { + int i; + + for (i = 0; i < ARRAY_SIZE(mvneta_statistics); i++) + memcpy(data + i * ETH_GSTRING_LEN, + mvneta_statistics[i].name, ETH_GSTRING_LEN); + } +} + +static void mvneta_ethtool_update_stats(struct mvneta_port *pp) +{ + const struct mvneta_statistic *s; + void __iomem *base = pp->base; + u32 high, low, val; + int i; + + for (i = 0, s = mvneta_statistics; +s < mvneta_statistics + ARRAY_SIZE(mvneta_statistics); +s++, i++) { + val = 0; + + switch (s->type) { + case T_REG_32: + val = readl_relaxed(base + s->offset); + break; + case T_REG_64: + /* Docs say to read low 32-bit then high */ + low = readl_relaxed(base + s->offset); + high = readl_relaxed(base + s->offset + 4); + val = (u64)high << 32 | low; + break; + } + + pp->ethtool_stats[i] += val; + } +} + +static void mvneta_ethtool_get_stats(struct net_device *dev, +struct ethtool_stats *stats, u64 *data) +{ + struct mvneta_port *pp = netdev_priv(dev); + int i; + + mvneta_ethtool_update_stats(pp); + + for (i = 0; i < ARRAY_SIZE(mvneta_statistics); i++) + *data++ = pp->ethtool_stats[i]; +} + +static int mvneta_ethtool_get_sset_count(struct net_device *dev, int sset) +{ + if (sset == ETH_SS_STATS) + return ARRAY_SIZE(mvneta_statistics); + return -EOPNOTSUPP; +} +
[PATCH v3 0/2] mvneta ethtool statistics
Sorry for v3 - I forgot to update the commit message on patch 1 as requested by Marcin. This short series adds ethtool statistics reporting to mvneta. Having discussed with Andrew on IRC, we decided I'd pick up his patch into my series. My change for patch 1 compared to the previous RFC splits out the reading of the statistics from the hardware into a separate function, in order to facilitate work going on elsewhere to arrange for the statistics to be preserved across a suspend/resume cycle. drivers/net/ethernet/marvell/mvneta.c | 117 +- 1 file changed, 115 insertions(+), 2 deletions(-) -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/2] net: mvneta: Fix clearing of MIB statistics
From: Andrew LunnThe existing function to clear the MIB statatistics was using the wrong address for the registers. Also, the counters would of been cleared when the interface was brought up, not during the probe. Fix both of these. Signed-off-by: Andrew Lunn Signed-off-by: Russell King --- drivers/net/ethernet/marvell/mvneta.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 69d80ca4fba2..8829c481fe1b 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -100,6 +100,8 @@ #define MVNETA_TXQ_CMD 0x2448 #define MVNETA_TXQ_DISABLE_SHIFT8 #define MVNETA_TXQ_ENABLE_MASK 0x00ff +#define MVNETA_RX_DISCARD_FRAME_COUNT 0x2484 +#define MVNETA_OVERRUN_FRAME_COUNT 0x2488 #define MVNETA_GMAC_CLOCK_DIVIDER0x24f4 #define MVNETA_GMAC_1MS_CLOCK_ENABLEBIT(31) #define MVNETA_ACC_MODE 0x2500 @@ -191,7 +193,7 @@ #define MVNETA_GMAC_AN_FLOW_CTRL_EN BIT(11) #define MVNETA_GMAC_CONFIG_FULL_DUPLEX BIT(12) #define MVNETA_GMAC_AN_DUPLEX_ENBIT(13) -#define MVNETA_MIB_COUNTERS_BASE 0x3080 +#define MVNETA_MIB_COUNTERS_BASE 0x3000 #define MVNETA_MIB_LATE_COLLISION 0x7c #define MVNETA_DA_FILT_SPEC_MCAST0x3400 #define MVNETA_DA_FILT_OTH_MCAST 0x3500 @@ -564,6 +566,8 @@ static void mvneta_mib_counters_clear(struct mvneta_port *pp) /* Perform dummy reads from MIB counters */ for (i = 0; i < MVNETA_MIB_LATE_COLLISION; i += 4) dummy = mvreg_read(pp, (MVNETA_MIB_COUNTERS_BASE + i)); + dummy = mvreg_read(pp, MVNETA_RX_DISCARD_FRAME_COUNT); + dummy = mvreg_read(pp, MVNETA_OVERRUN_FRAME_COUNT); } /* Get System Network Statistics */ @@ -792,7 +796,6 @@ static void mvneta_port_up(struct mvneta_port *pp) u32 q_map; /* Enable all initialized TXs. */ - mvneta_mib_counters_clear(pp); q_map = 0; for (queue = 0; queue < txq_number; queue++) { struct mvneta_tx_queue *txq = >txqs[queue]; @@ -1076,6 +1079,8 @@ static void mvneta_defaults_set(struct mvneta_port *pp) mvreg_write(pp, MVNETA_INTR_ENABLE, (MVNETA_RXQ_INTR_ENABLE_ALL_MASK | MVNETA_TXQ_INTR_ENABLE_ALL_MASK)); + + mvneta_mib_counters_clear(pp); } /* Set max sizes for tx queues */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
XFRM Policy lookup in flowcache
Hi All, I have a question regarding speeding up xfrm policy lookup in flow cache. We are working on a product where 3 policies (in,fwd and out) are setup per tunnel. Tunnels comes up and down all the time, so the policy data base will be keep changing. Looks like there is an impact in performance since every time policy database is changed, policies in flow cache gets invalidated. This eventually results in almost all lookups going to policy database. Is there any mechanism to get around this to get faster policy lookup? thanks --naji -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote: ... > @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct > scan_control *sc, > } > } > > + vmpressure(sc->gfp_mask, memcg, > +sc->nr_scanned - scanned, > +sc->nr_reclaimed - reclaimed); > + > /* >* Direct reclaim and kswapd have to scan all memory >* cgroups to fulfill the overall scan target for the > @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct > scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, ))); > > - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > -sc->nr_scanned - nr_scanned, > -sc->nr_reclaimed - nr_reclaimed); > - > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > I may be mistaken, but AFAIU this patch subtly changes the behavior of vmpressure visible from the userspace: w/o this patch a userspace process will only receive a notification for a memory cgroup only if *this* memory cgroup calls reclaimer; with this patch userspace notification will be issued even if reclaimer is invoked by any cgroup up the hierarchy. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
Remove black line suggested by Sergei This patch is to the ath5k/eeprom.c that fixes up warning caught by coccicheck: Unneeded variable: "ret". Return "0" on line 980 Remove unneeded variable ret created to return zero. Signed-off-by: Punit Vara--- drivers/net/wireless/ath/wcn36xx/main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/net/wireless/ath/wcn36xx/main.c b/drivers/net/wireless/ath/wcn36xx/main.c index 900e72a..94bcc08 100644 --- a/drivers/net/wireless/ath/wcn36xx/main.c +++ b/drivers/net/wireless/ath/wcn36xx/main.c @@ -935,8 +935,6 @@ static const struct ieee80211_ops wcn36xx_ops = { static int wcn36xx_init_ieee80211(struct wcn36xx *wcn) { - int ret = 0; - static const u32 cipher_suites[] = { WLAN_CIPHER_SUITE_WEP40, WLAN_CIPHER_SUITE_WEP104, @@ -977,7 +975,7 @@ static int wcn36xx_init_ieee80211(struct wcn36xx *wcn) wcn->hw->sta_data_size = sizeof(struct wcn36xx_sta); wcn->hw->vif_data_size = sizeof(struct wcn36xx_vif); - return ret; + return 0; } static int wcn36xx_platform_get_resources(struct wcn36xx *wcn, -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote: ... > @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) > */ > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > struct page_counter *counter; > + bool force = false; > > - if (page_counter_try_charge(>skmem, nr_pages, )) > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > + if (page_counter_try_charge(>skmem, nr_pages, )) > + return true; > + page_counter_charge(>skmem, nr_pages); > + return false; > + } > + > + if (consume_stock(memcg, nr_pages)) > return true; > +retry: > + if (page_counter_try_charge(>memory, batch, )) > + goto done; Currently, we use memcg->memory only for charging memory pages. Besides, every page charged to this counter (including kmem) has ->mem_cgroup field set appropriately. This looks consistent and nice. As an extra benefit, we can track all pages charged to a memory cgroup via /proc/kapgecgroup. Now, you charge "window size" to it, which AFAIU isn't necessarily equal to the amount of memory actually consumed by the cgroup for socket buffers. I think this looks ugly and inconsistent with the existing behavior. I agree that we need to charge socker buffers to ->memory, but IMO we should do that per each skb page, using memcg_kmem_charge_kmem somewhere in alloc_skb_with_frags invoking the reclaimer just as we do for kmalloc, while tcp window size control should stay aside. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller
David Millerwrites: > From: Mans Rullgard > Date: Thu, 22 Oct 2015 15:02:38 +0100 > >> This adds a driver for the Aurora VLSI NB8800 Ethernet controller. >> It is an almost complete rewrite of a driver originally found in >> a Sigma Designs 2.6.22 tree. >> >> Signed-off-by: Mans Rullgard > > The netdev list is going to have to see patches #1 and #2 as well as > your cover letter in order to review and integrate this driver > properly. Patch 1 only added the "aurora" DT vendor prefix. Patch 2 was the DT binding for this device. I'll include netdev when I send an updated version. -- Måns Rullgård m...@mansr.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 18:05, Al Viro wrote: Oh, for... Right in this thread an example of complete BS has been quoted from POSIX close(2). The part about closing a file when the last descriptor gets closed. _Nothing_ is POSIX-compliant in that respect (nor should it be). That's not exactly what it says, we've already discussed, for example in the case of pending async IO on a filehandle. Semantics around the distinction between file descriptors and file descriptions is underspecified, not to mention being very poorly written. I agree that part could do with some polishing. You want to add something along the lines of "if any action by another thread changes the mapping from file descriptors to file descriptions for any file descriptor passed to syscall, such and such things should happen" - go ahead and specify what should happen. As it is, I don't see anything of that sort in e.g. accept(2). And no, [EBADF] The socket argument is not a valid file descriptor. in there is nowhere near being unambiguous enough - everyone agrees that argument should be a valid descriptor at the time of call, but I would be very surprised to find _any_ implementation (including Solaris one) recheck that upon exit to userland. The scenario I described previously, where dup2() is used to modify a fd that's being used in accept, result in the accept call terminating in the same way as if close had been called on it - Casper gave details earlier. For more bullshit from the same source (issue 7, close(2)): If fildes refers to a socket, close() shall cause the socket to be destroyed. If the socket is in connection-mode, and the SO_LINGER option is set for the socket with non-zero linger time, and the socket has untransmitted data, then close() shall block for up to the current linger interval until all data is transmitted. I challenge you to find *any* implementation that would have fd = socket(...); close(dup(fd)); do what this wonder of technical prose clearly requests. In the same text we also have When all file descriptors associated with a pipe or FIFO special file are closed, any data remaining in the pipe or FIFO shall be discarded. as well as explicit (and underspecified, but perhaps they do it elsewhere) "last close" in parts related to sockets and ptys. Yes, Casper has just reported that to TOG, see http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11573. Our assessment is that sockets should behave the same way as plain files, i.e. 'last close'. And that is not to mention the dup2(2) wording in there: If fildes2 is already a valid open file descriptor, it shall be closed first which is (a) inviting misinterpretation that would make the damn thing non-atomic (the only mentioning of atomicity is in non-normative sections) I've already explained why I believe atomic behaviour of dup2() is required by POSIX. If you feel it's not clear then we can ask POSIX for clarification. and (b) says fsck-all about the effects of closing descriptor. The latter is a problem, since nothing in close(2) bothers making a distinction between the effects specific to particular syscall and those common to all ways of closing a descriptor. And no, it's not a nitpicking - consider e.g. the parts concerning the order of events triggered by close(2) (such and such should be completed before close(2) returns); should it be taken as "same events should be completed before newfd is associated with the file description refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl locks. Sure, there is (otherwise unexplained) The dup2() function is not intended for use in critical regions as a synchronization mechanism. down in informative sections, so one can infer that event order here isn't to be relied upon. With no way to guess whether the event order concerning e.g. effect on ongoing accept(newfd) is any different in that respect. I think "it shall be closed first" makes it pretty clear that what is expected is the same behaviour as any direct invocation of close, and that has to happen before the reassignment. What makes you believe that's isn't the case? The entire area in Issue 7 stinks. It might make sense to try and fix it up, but let's not pretend that what's in there right now does specify the semantics in this kind of situations. Sorry, I disagree. I'm not saying that Solaris approach yields an inherently bad semantics or that it's impossible to implement without high scalability price and/or high memory footprint. But waving the flag of POSIX compliance when you are actually talking about the ways your implementation plugs the holes in a badly incomplete spec... Personally I believe the spec is clear enough to allow an unambiguous interpretation of the required behavior in this area. If you think there are areas where the Solaris
Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg In fact, the code is ready for the "soft" limit (I mean min, pressure, max tuple), it just lacks a knob. > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > ... > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const > struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - AFAIU, now we won't shrink the window on hitting the limit, i.e. this patch subtly changes the behavior of the existing knobs, potentially breaking them. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] net: encx24j600: Fix mask to update LED configuration
From: Jon RingleThis fixes the mask used to update the LED configuration so that it clears the necessary bits as well as setting the bits according to the mask. Also reverse the LED configuration to show the Link state + collisions in LEDA and the Link state + TX/RX events in LEDB. Signed-off-by: Jon Ringle --- drivers/net/ethernet/microchip/encx24j600.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/microchip/encx24j600.c b/drivers/net/ethernet/microchip/encx24j600.c index e1329d9..bf08ce2 100644 --- a/drivers/net/ethernet/microchip/encx24j600.c +++ b/drivers/net/ethernet/microchip/encx24j600.c @@ -617,10 +617,10 @@ static int encx24j600_hw_init(struct encx24j600_priv *priv) (eidled & REVID_MASK) >> REVID_SHIFT); /* PHY Leds: link status, -* LEDA: Link + transmit/receive events -* LEDB: Link State + colision events +* LEDA: Link State + collision events +* LEDB: Link State + transmit/receive events */ - encx24j600_update_reg(priv, EIDLED, 0xbc00, 0xbc00); + encx24j600_update_reg(priv, EIDLED, 0xff00, 0xcb00); /* Loopback disabled */ encx24j600_write_reg(priv, MACON1, 0x9); -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/2] geneve: implement support for IPv6-based tunnels
NOTE: Link-local IPv6 addresses for remote endpoints are not supported, since the driver currently has no capacity for binding a geneve interface to a specific link. Signed-off-by: John W. Linville--- v5: - wrap declaration of sock6 in geneve_dev with IS_ENABLED(CONFIG_IPV6) - remove superfluous '!!' when assigning geneve->collect_md to bool - use skb_scrub_packet in IPv4 tx path as well - check for NULL ip_tunnel_info pointer in geneve[6]_xmit_skb - use ipv6_addr_equal for comparing IPv6 addresses - more use of IS_ENABLED(CONFIG_IPV6) for preserving build integrity - reject link-local ipv6 address for remote tunnel endpoint v4: - treat mode field of ip_tunnel_info as flags - add a missing IS_ENABLED(CONFIG_IPV6) to geneve_rx - remove unneeded flags field in geneve_dev - NULL-check parameter for __geneve_sock_release - check remote socket family for AF_UNSPEC in geneve_configure - rename geneve_get_{rt,dst} as geneve_get_{v4_rt,v6_dst} - refactor some error handling in the xmit paths v3: - declare geneve_remote_unspec as static v2: - do not require remote address for tx on metadata tunnels - pass correct sockaddr family to udp_tun_rx_dst in geneve_rx - accommodate both ipv4 and ipv6 sockets open on same tunnel - move declaration of geneve_get_dst for aesthetic purposes drivers/net/geneve.c | 482 +++ include/uapi/linux/if_link.h | 1 + 2 files changed, 395 insertions(+), 88 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index cde29f8a37bf..47f7512f02cf 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -46,16 +46,27 @@ struct geneve_net { static int geneve_net_id; +union geneve_addr { + struct sockaddr_in sin; + struct sockaddr_in6 sin6; + struct sockaddr sa; +}; + +static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, }; + /* Pseudo network device */ struct geneve_dev { struct hlist_node hlist; /* vni hash table */ struct net *net;/* netns for packet i/o */ struct net_device *dev;/* netdev for geneve tunnel */ - struct geneve_sock *sock; /* socket used for geneve tunnel */ + struct geneve_sock *sock4; /* IPv4 socket used for geneve tunnel */ +#if IS_ENABLED(CONFIG_IPV6) + struct geneve_sock *sock6; /* IPv6 socket used for geneve tunnel */ +#endif u8 vni[3]; /* virtual network ID for tunnel */ u8 ttl; /* TTL override */ u8 tos; /* TOS override */ - struct sockaddr_in remote; /* IPv4 address for link partner */ + union geneve_addr remote; /* IP address for link partner */ struct list_head next;/* geneve's per namespace list */ __be16 dst_port; bool collect_md; @@ -103,12 +114,32 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, vni_list_head = >vni_list[hash]; hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) && - addr == geneve->remote.sin_addr.s_addr) + addr == geneve->remote.sin.sin_addr.s_addr) return geneve; } return NULL; } +#if IS_ENABLED(CONFIG_IPV6) +static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, +struct in6_addr addr6, u8 vni[]) +{ + struct hlist_head *vni_list_head; + struct geneve_dev *geneve; + __u32 hash; + + /* Find the device for this VNI */ + hash = geneve_net_vni_hash(vni); + vni_list_head = >vni_list[hash]; + hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { + if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) && + ipv6_addr_equal(, >remote.sin6.sin6_addr)) + return geneve; + } + return NULL; +} +#endif + static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb) { return (struct genevehdr *)(udp_hdr(skb) + 1); @@ -121,24 +152,49 @@ static void geneve_rx(struct geneve_sock *gs, struct sk_buff *skb) struct metadata_dst *tun_dst = NULL; struct geneve_dev *geneve = NULL; struct pcpu_sw_netstats *stats; - struct iphdr *iph; - u8 *vni; + struct iphdr *iph = NULL; __be32 addr; - int err; + static u8 zero_vni[3]; + u8 *vni; + int err = 0; + sa_family_t sa_family; +#if IS_ENABLED(CONFIG_IPV6) + struct ipv6hdr *ip6h = NULL; + struct in6_addr addr6; + static struct in6_addr zero_addr6; +#endif + + sa_family = gs->sock->sk->sk_family; - iph = ip_hdr(skb); /* outer IP header... */ + if (sa_family == AF_INET) { + iph = ip_hdr(skb); /* outer IP header... */ - if
[PATCH v5 2/2] geneve: handle ipv6 priority like ipv4 tos
Signed-off-by: John W. LinvilleReported-by: Jesse Gross Reviewed-by: Jesse Gross --- v5 -- same as previous revision drivers/net/geneve.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 47f7512f02cf..4d4d8ca9eb7a 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -755,6 +755,7 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb, struct geneve_dev *geneve = netdev_priv(dev); struct geneve_sock *gs6 = geneve->sock6; struct dst_entry *dst = NULL; + __u8 prio; memset(fl6, 0, sizeof(*fl6)); fl6->flowi6_mark = skb->mark; @@ -763,7 +764,16 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb, if (info) { fl6->daddr = info->key.u.ipv6.dst; fl6->saddr = info->key.u.ipv6.src; + fl6->flowi6_tos = RT_TOS(info->key.tos); } else { + prio = geneve->tos; + if (prio == 1) { + const struct iphdr *iip = ip_hdr(skb); + + prio = ip_tunnel_get_dsfield(iip, skb); + } + + fl6->flowi6_tos = RT_TOS(prio); fl6->daddr = geneve->remote.sin6.sin6_addr; } @@ -884,8 +894,9 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_dev *geneve = netdev_priv(dev); struct geneve_sock *gs6 = geneve->sock6; struct dst_entry *dst = NULL; + const struct iphdr *iip; /* interior IP header */ struct flowi6 fl6; - __u8 ttl; + __u8 prio, ttl; __be16 sport; bool udp_csum; int err; @@ -914,6 +925,8 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); skb_reset_mac_header(skb); + iip = ip_hdr(skb); + if (info) { const struct ip_tunnel_key *key = >key; u8 *opts = NULL; @@ -930,6 +943,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, if (unlikely(err)) goto err; + prio = ip_tunnel_ecn_encap(key->tos, iip, skb); ttl = key->ttl; } else { udp_csum = false; @@ -938,13 +952,14 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, if (unlikely(err)) goto err; + prio = ip_tunnel_ecn_encap(fl6.flowi6_tos, iip, skb); ttl = geneve->ttl; if (!ttl && ipv6_addr_is_multicast()) ttl = 1; ttl = ttl ? : ip6_dst_hoplimit(dst); } err = udp_tunnel6_xmit_skb(dst, gs6->sock->sk, skb, dev, - , , 0, ttl, + , , prio, ttl, sport, geneve->dst_port, !udp_csum); iptunnel_xmit_stats(err, >stats, dev->tstats); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote: > >> The external behaviour atomic; you cannot distinguish the order >> between the closing of the original file (and waking up other threads >> waiting for a record lock) or changing the file referenced by that newfd. >> >> But this not include a global or process specific lock. > >Interesting... Do you mean that decriptor-to-file lookup blocks until that >rundown finishes? For that particular file descriptor, yes. (I'm assuming you mean the Solaris kernel running down all lwps who have a system in progress on that particular file descriptor). All other fd to file lookups are not blocked at all by this locking. It should be clear that any such occurrences are application errors and should be hardly ever seen in practice. It is also known when this is needed so it is hardly even attempted. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] net: wireless: ath: use | instead of + for summing bitmasks
On Thu, Oct 22, 2015 at 3:13 AM, Sergei Shtylyovwrote: > Hello. > > On 10/21/2015 05:55 PM, Punit Vara wrote: > >> This patch is to the ath10k/pci.h file that fixes following warning > > >pci.c, you mean? > > >> reported by coccicheck: >> >> WARNING: sum of probable bitmasks, consider | >> >> I have replaced + with OR operator | for summing bitmasks >> >> Signed-off-by: Punit Vara >> --- >> drivers/net/wireless/ath/ath10k/pci.c | 10 +- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/net/wireless/ath/ath10k/pci.c >> b/drivers/net/wireless/ath/ath10k/pci.c >> index 1046ab6..165a318 100644 >> --- a/drivers/net/wireless/ath/ath10k/pci.c >> +++ b/drivers/net/wireless/ath/ath10k/pci.c >> @@ -775,7 +775,7 @@ static u32 ath10k_pci_targ_cpu_to_ce_addr(struct >> ath10k *ar, u32 addr) >> switch (ar->hw_rev) { >> case ATH10K_HW_QCA988X: >> case ATH10K_HW_QCA6174: >> - val = (ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS + >> + val = (ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS | >> CORE_CTRL_ADDRESS) & >>0x7ff) << 21; >> break; >> @@ -1443,10 +1443,10 @@ static void ath10k_pci_irq_msi_fw_mask(struct >> ath10k *ar) >> switch (ar->hw_rev) { >> case ATH10K_HW_QCA988X: >> case ATH10K_HW_QCA6174: >> - val = ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS + >> + val = ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS | >> CORE_CTRL_ADDRESS); >> val &= ~CORE_CTRL_PCIE_REG_31_MASK; >> - ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + >> + ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS | > > >Don't think these 2 are justified. > > >> @@ -1464,10 +1464,10 @@ static void ath10k_pci_irq_msi_fw_unmask(struct >> ath10k *ar) >> switch (ar->hw_rev) { >> case ATH10K_HW_QCA988X: >> case ATH10K_HW_QCA6174: >> - val = ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS + >> + val = ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS | >> CORE_CTRL_ADDRESS); >> val |= CORE_CTRL_PCIE_REG_31_MASK; >> - ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + >> + ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS | >>CORE_CTRL_ADDRESS, val); > > >And these too. > > [...] > > MBR, Sergei > CORE_CTRL_ADDRESS is 0x so it will not mask .. these patch should be rejected ...I have modified by looking at coccicheck .Actually First time I have used that tool I do know it can also generate false warning sometime .I have experience about checkpatch.pl ..Sorry for this patch rest I have resend -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] net: wireless: ath: Remove unneeded variable ret returning 0
On Thu, Oct 22, 2015 at 3:16 AM, Sergei Shtylyovwrote: > On 10/21/2015 05:55 PM, Punit Vara wrote: > >> This patch is to the ath5k/eeprom.c that fixes up warning caught by >> coccicheck: >> >> Unneeded variable: "ret". Return "0" on line 980 >> >> Remove unneeded variable ret created to return zero. >> >> Signed-off-by: Punit Vara >> --- >> drivers/net/wireless/ath/wcn36xx/main.c | 3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> diff --git a/drivers/net/wireless/ath/wcn36xx/main.c >> b/drivers/net/wireless/ath/wcn36xx/main.c >> index 7c169ab..82982d5 100644 >> --- a/drivers/net/wireless/ath/wcn36xx/main.c >> +++ b/drivers/net/wireless/ath/wcn36xx/main.c >> @@ -935,7 +935,6 @@ static const struct ieee80211_ops wcn36xx_ops = { >> >> static int wcn36xx_init_ieee80211(struct wcn36xx *wcn) >> { >> - int ret = 0; >> > > This empty line should be removed too. > >> static const u32 cipher_suites[] = { >> WLAN_CIPHER_SUITE_WEP40, > > [...] > > MBR, Sergei > I have resend this patch as you said Sergei . Thanks for feedback .. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
This patch is to the ath5k/eeprom.c that fixes up warning caught by coccicheck: -Unneeded variable: "ret". Return "0" on line 1733 Remove unneccesary variable ret created to return zero. Also removed empty line suggested by Sergei Signed-off-by: Punit Vara--- drivers/net/wireless/ath/ath5k/eeprom.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/wireless/ath/ath5k/eeprom.c b/drivers/net/wireless/ath/ath5k/eeprom.c index 94d34ee..0c82ea5 100644 --- a/drivers/net/wireless/ath/ath5k/eeprom.c +++ b/drivers/net/wireless/ath/ath5k/eeprom.c @@ -1707,8 +1707,7 @@ ath5k_eeprom_read_spur_chans(struct ath5k_hw *ah) struct ath5k_eeprom_info *ee = >ah_capabilities.cap_eeprom; u32 offset; u16 val; - int ret = 0, i; - + int i; offset = AR5K_EEPROM_CTL(ee->ee_version) + AR5K_EEPROM_N_CTLS(ee->ee_version); @@ -1730,7 +1729,7 @@ ath5k_eeprom_read_spur_chans(struct ath5k_hw *ah) } } - return ret; + return 0; } -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Hi Johannes, On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: ... > Patch #5 adds accounting and tracking of socket memory to the unified > hierarchy memory controller, as described above. It uses the existing > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > Patch #8 uses the vmpressure extension to equalize pressure between > the pages tracked natively by the VM and socket buffer pages. As the > pool is shared, it makes sense that while natively tracked pages are > under duress the network transmit windows are also not increased. First of all, I've no experience in networking, so I'm likely to be mistaken. Nevertheless I beg to disagree that this patch set is a step in the right direction. Here goes why. I admit that your idea to get rid of explicit tcp window control knobs and size it dynamically basing on memory pressure instead does sound tempting, but I don't think it'd always work. The problem is that in contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only stop growing them. Now suppose a system hasn't experienced memory pressure for a while. If we don't have explicit tcp window limit, tcp buffers on such a system might have eaten almost all available memory (because of network load/problems). If a user workload that needs a significant amount of memory is started suddenly then, the network code will receive a notification and surely stop growing buffers, but all those buffers accumulated won't disappear instantly. As a result, the workload might be unable to find enough free memory and have no choice but invoke OOM killer. This looks unexpected from the user POV. That said, I think we do need per memcg tcp window control similar to what we have system-wide. In other words, Glauber's work makes sense to me. You might want to point me at my RFC patch where I proposed to revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my mind since then. Now I think I was mistaken, luckily I was stopped. However, I may be mistaken again :-) Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] net: try harder to not reuse ifindex when moving interfaces
On 10/22/15 at 07:21pm, Hannes Frederic Sowa wrote: > Hi Thomas, > > On Thu, Oct 22, 2015, at 18:45, Thomas Graf wrote: > > I understand the race but when does it occur? Whoever creates > > the original interface owns it and is responsible for its > > lifecycle. *Iff* for some reason multiple entities manipulate > > the interface, then it's probably a lot safer to just use flock > > or something similar to serialize access entirely in user space. > > This only works if all networking configuration programs would > standardize on the same flock. Also, under memory pressure we lose > netlink monitor messages, so we need to deal with timeouts and retries > and manual sync up on the networking configuration, which makes this > scheme a lot harder. For normal socket io, where we specify e.g. ifindex > in sin6_addr, this is not really usable at all. Again, what is the scenario where this happens? Is this being hit or are we talking theoretical races? I'd like to understand the background of this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: > On 22/10/2015 18:05, Al Viro wrote: > > >Oh, for... Right in this thread an example of complete BS has been quoted > >from POSIX close(2). The part about closing a file when the last descriptor > >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should > >it be). > > That's not exactly what it says, we've already discussed, for > example in the case of pending async IO on a filehandle. Sigh... It completely fails to mention descriptor-passing. Which a) is relevant to what "last close" means and b) had been there for nearly the third of a century. > I agree that part could do with some polishing. google("wire brush of enlightenment") is what comes to mind... > >and (b) says fsck-all about the effects of closing descriptor. The latter > >is a problem, since nothing in close(2) bothers making a distinction between > >the effects specific to particular syscall and those common to all ways of > >closing a descriptor. And no, it's not a nitpicking - consider e.g. the > >parts concerning the order of events triggered by close(2) (such and such > >should be completed before close(2) returns); should it be taken as "same > >events should be completed before newfd is associated with the file > >description > >refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl > >locks. Sure, there is (otherwise unexplained) > > The dup2() function is not intended for use in critical regions > > as a synchronization mechanism. > >down in informative sections, so one can infer that event order here isn't > >to be relied upon. With no way to guess whether the event order concerning > >e.g. effect on ongoing accept(newfd) is any different in that respect. > > I think "it shall be closed first" makes it pretty clear that what > is expected is the same behaviour as any direct invocation of close, > and that has to happen before the reassignment. What makes you > believe that's isn't the case? So unless I'm misparsing something, you want thread A: accept(newfd) thread B: dup2(oldfd, newfd) have accept() bugger off before the switchover happens? What should happen if thread C does accept(newfd) right as B has decided that there's nothing more to wait? For close(newfd) it would be simple - we are going to have lookup by descriptor fail with EBADF anyway, so making it do so as soon as we go hunting for those who are currently in accept(newfd) would do the trick - no new threads like that shall appear and as long as the descriptor is not declared free for taking by descriptor allocation nobody is going to be screwed by open() picking that slot of descriptor table too early. Trying to do that for dup2() would lose atomicity. I honestly don't know how Solaris behaves in that case, BTW - the race (if any) would probably be hard to hit, so in case of Linux I would have to go and RTFS before saying that there isn't one. I can't do that in with Solaris; all I can do here is ask you guys... Moreover, see above for record locks removal. Should that happen prior to switchover? If you have dup(fd, fd2); set a record lock on fd2 spawn a thread in child, try to grab the same lock on fd2 in parent, do some work and close(fd) you are guaranteed that child won't see fd refering to the same file after it acquires the lock. Replace close(fd) with dup(fd3, fd); should the same hold true in that case? FWIW, Linux behaviour in that area is to have record locks removal done between the switchover and return to userland in case of dup2() and between the removal from descriptor table and return to userland in case of close(). > Personally I believe the spec is clear enough to allow an > unambiguous interpretation of the required behavior in this area. If > you think there are areas where the Solaris behaviour is in > disagreement with the spec then I'd be interested to hear them. The spec is so vague that I strongly suspect that *both* Solaris and Linux behaviours are not in disagreement with it (modulo shutdown(2) extension Linux-side and we are really stuck with that one). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote: > The external behaviour atomic; you cannot distinguish the order > between the closing of the original file (and waking up other threads > waiting for a record lock) or changing the file referenced by that newfd. > > But this not include a global or process specific lock. Interesting... Do you mean that decriptor-to-file lookup blocks until that rundown finishes? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CONFIG_XPS depends on L1_CACHE_BYTES being greater than sizeof(struct xps_map)
Hi Tom & David, I've queued-up a patch for the parisc architecture which reduces L1_CACHE_BYTES from 32 to 16: https://patchwork.kernel.org/patch/7399291/ But this change will break the kernel build like this: In file included from net/core/dev.c:92:0: net/core/dev.c: In function ‘expand_xps_map’: include/linux/netdevice.h:721:27: warning: overflow in implicit constant conversion [-Woverflow] #define XPS_MIN_MAP_ALLOC ((L1_CACHE_BYTES - sizeof(struct xps_map)) \ net/core/dev.c:1972:18: note: in expansion of macro ‘XPS_MIN_MAP_ALLOC’ int alloc_len = XPS_MIN_MAP_ALLOC; Do you see an easy way to fix this ? Thanks, Helge -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
On Fri, Oct 23, 2015 at 12:05 AM, Sergei Shtylyovwrote: > Hello. > > On 10/22/2015 09:26 PM, Punit Vara wrote: > >> This patch is to the ath5k/eeprom.c that fixes up warning caught by >> coccicheck: >> >> -Unneeded variable: "ret". Return "0" on line 1733 >> >> Remove unneccesary variable ret created to return zero. >> >> Also removed empty line suggested by Sergei >> >> Signed-off-by: Punit Vara >> --- >> drivers/net/wireless/ath/ath5k/eeprom.c | 5 ++--- >> 1 file changed, 2 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/net/wireless/ath/ath5k/eeprom.c >> b/drivers/net/wireless/ath/ath5k/eeprom.c >> index 94d34ee..0c82ea5 100644 >> --- a/drivers/net/wireless/ath/ath5k/eeprom.c >> +++ b/drivers/net/wireless/ath/ath5k/eeprom.c >> @@ -1707,8 +1707,7 @@ ath5k_eeprom_read_spur_chans(struct ath5k_hw *ah) >> struct ath5k_eeprom_info *ee = >ah_capabilities.cap_eeprom; >> u32 offset; >> u16 val; >> - int ret = 0, i; >> - > > >No, this one shouldn't have been removed. There should an empty line > between the declarations and the statements. > >> + int i; >> offset = AR5K_EEPROM_CTL(ee->ee_version) + >> AR5K_EEPROM_N_CTLS(ee->ee_version); >> > > MBR, Sergei > I have resent patch with update but I cant see it in my mailbox I dont know Why.. Sorry if You are getting mail more time. I am sending final version v5 as last.Please sorry for inconvenience -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
Remove empty line suggested by Sergei This patch is to the ath5k/eeprom.c that fixes up warning caught by coccicheck: Unneeded variable: "ret". Return "0" on line 980 Remove unneeded variable ret created to return zero. Signed-off-by: Punit Vara--- drivers/net/wireless/ath/wcn36xx/main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/net/wireless/ath/wcn36xx/main.c b/drivers/net/wireless/ath/wcn36xx/main.c index 900e72a..94bcc08 100644 --- a/drivers/net/wireless/ath/wcn36xx/main.c +++ b/drivers/net/wireless/ath/wcn36xx/main.c @@ -935,8 +935,6 @@ static const struct ieee80211_ops wcn36xx_ops = { static int wcn36xx_init_ieee80211(struct wcn36xx *wcn) { - int ret = 0; - static const u32 cipher_suites[] = { WLAN_CIPHER_SUITE_WEP40, WLAN_CIPHER_SUITE_WEP104, @@ -977,7 +975,7 @@ static int wcn36xx_init_ieee80211(struct wcn36xx *wcn) wcn->hw->sta_data_size = sizeof(struct wcn36xx_sta); wcn->hw->vif_data_size = sizeof(struct wcn36xx_vif); - return ret; + return 0; } static int wcn36xx_platform_get_resources(struct wcn36xx *wcn, -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote: ... > @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) > vmpr->reclaimed = 0; > spin_unlock(>sr_lock); > > + level = vmpressure_calc_level(scanned, reclaimed); > + > + if (level > VMPRESSURE_LOW) { So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL? > + struct mem_cgroup *memcg; > + /* > + * Let the socket buffer allocator know that we are > + * having trouble reclaiming LRU pages. > + * > + * For hysteresis, keep the pressure state asserted > + * for a second in which subsequent pressure events > + * can occur. > + * > + * XXX: is vmpressure a global feature or part of > + * memcg? There shouldn't be anything memcg-specific > + * about exporting reclaim success ratios from the VM. > + */ > + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); > + if (memcg != root_mem_cgroup) > + memcg->socket_pressure = jiffies + HZ; Why 1 second? Thanks, Vladimir > + } > + > do { > - if (vmpressure_event(vmpr, scanned, reclaimed)) > + if (vmpressure_event(vmpr, level)) > break; > /* >* If not handled, propagate the event upward into the -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote: > > > >And I'm really curious about the things Solaris would do with dup2() there. > >Does it take into account the possibility of new accept() coming just as > >dup2() is trying to terminate the ongoing ones? Is there a window when > >descriptor-to-file lookups would fail? Looks like a race/deadlock country... > > Solaris does not "terminate" threads, instead it tells them that the > file descriptor information used is stale and wkae's up the thread. Sorry, lousy wording - I meant "terminate syscall in another thread". Better yet, make that "what happens if new accept(newfd) comes while dup2() waits for affected syscalls in other threads to finish"? Assuming it does wait, that is... While we are at it, what's the relative order of record locks removal and switching the meaning of newfd? In our kernel it happens *after* the switchover (i.e. if another thread is waiting for a record lock held on any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see the state with newfd still refering to what it used to; note that waiter might be using any descriptor refering to the file newfd used to refer to, so it won't be affected by the "wake those who had the meaning of their arguments change" side of things). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
Hello. On 10/22/2015 09:26 PM, Punit Vara wrote: This patch is to the ath5k/eeprom.c that fixes up warning caught by coccicheck: -Unneeded variable: "ret". Return "0" on line 1733 Remove unneccesary variable ret created to return zero. Also removed empty line suggested by Sergei Signed-off-by: Punit Vara--- drivers/net/wireless/ath/ath5k/eeprom.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/wireless/ath/ath5k/eeprom.c b/drivers/net/wireless/ath/ath5k/eeprom.c index 94d34ee..0c82ea5 100644 --- a/drivers/net/wireless/ath/ath5k/eeprom.c +++ b/drivers/net/wireless/ath/ath5k/eeprom.c @@ -1707,8 +1707,7 @@ ath5k_eeprom_read_spur_chans(struct ath5k_hw *ah) struct ath5k_eeprom_info *ee = >ah_capabilities.cap_eeprom; u32 offset; u16 val; - int ret = 0, i; - No, this one shouldn't have been removed. There should an empty line between the declarations and the statements. + int i; offset = AR5K_EEPROM_CTL(ee->ee_version) + AR5K_EEPROM_N_CTLS(ee->ee_version); MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 14/15] net: wireless: ath: Remove unneeded variable ret returning 0
On 10/22/2015 09:47 PM, Punit Vara wrote: Remove black line suggested by Sergei Such kind of comments should be under the --- tear line. This patch is to the ath5k/eeprom.c that fixes up warning caught by coccicheck: Unneeded variable: "ret". Return "0" on line 980 Remove unneeded variable ret created to return zero. Signed-off-by: Punit Vara[...] MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] devicetree: add binding for Aurora VLSI NB8800 Ethernet controller
On Thu, Oct 22, 2015 at 11:28 AM, Mans Rullgardwrote: > This adds a binding for the Aurora VLSI NB8800 Ethernet controller > using the "aurora,nb8800" compatible string. When used in Sigma > Designs chips a few additional control registers are available. > This variant is indicated by the "sigma,smp8640-ethernet" compatible > string. > > Signed-off-by: Mans Rullgard Acked-by: Rob Herring > --- > Changes: > - correct order of compatible strings > --- > .../devicetree/bindings/net/aurora,nb8800.txt | 26 > ++ > 1 file changed, 26 insertions(+) > create mode 100644 Documentation/devicetree/bindings/net/aurora,nb8800.txt > > diff --git a/Documentation/devicetree/bindings/net/aurora,nb8800.txt > b/Documentation/devicetree/bindings/net/aurora,nb8800.txt > new file mode 100644 > index 000..cf7108d > --- /dev/null > +++ b/Documentation/devicetree/bindings/net/aurora,nb8800.txt > @@ -0,0 +1,26 @@ > +* Aurora VLSI AU-NB8800 Ethernet controller > + > +Required properties: > +- compatible: Should be "sigma,smp8640-ethernet", "aurora,nb8800" > + The former indicates presence of extra features added by Sigma > Designs. > +- reg: Should be MMIO address space of the device > +- interrupts: Should contain the interrupt specifier for the device > +- interrupt-parent: Should be a phandle for the interrupt controller > +- clocks: Should be a phandle for the clock for the device > + > +Common properties described in ethernet.txt: > +- local-mac-address > +- mac-address > +- max-speed > +- phy-mode > + > +Example: > + > +ethernet@26000 { > + compatible = "aurora,nb8800"; > + reg = <0x1 0x800>; > + interrupts = <42>; > + clocks = <_clk>; > + max-speed = <1000>; > + phy-connection-type = "rgmii"; > +}; > -- > 2.5.3 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro>On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: >> On 22/10/2015 18:05, Al Viro wrote: >> >> >Oh, for... Right in this thread an example of complete BS has been quoted >> >from POSIX close(2). The part about closing a file when the last descriptor >> >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should >> >it be). >> >> That's not exactly what it says, we've already discussed, for >> example in the case of pending async IO on a filehandle. > >Sigh... It completely fails to mention descriptor-passing. Which > a) is relevant to what "last close" means and > b) had been there for nearly the third of a century. Why is that different? These clearly count as file descriptors. >> I agree that part could do with some polishing. > >google("wire brush of enlightenment") is what comes to mind... Standardese is similar to legalese; it not writing that is directly open to interpretation to those who are not inducted in writing may have some problem interpreting what exactly is meant by wording of the standard. >> I think "it shall be closed first" makes it pretty clear that what >> is expected is the same behaviour as any direct invocation of close, >> and that has to happen before the reassignment. What makes you >> believe that's isn't the case? > >So unless I'm misparsing something, you want >thread A: accept(newfd) >thread B: dup2(oldfd, newfd) >have accept() bugger off before the switchover happens? Well, certainly *before* we return from dup2(). (and clearly only once we have determined that dup2() will return successfully) >What should happen if thread C does accept(newfd) right as B has decided that >there's nothing more to wait? For close(newfd) it would be simple - we are >going to have lookup by descriptor fail with EBADF anyway, so making it do >so as soon as we go hunting for those who are currently in accept(newfd) >would do the trick - no new threads like that shall appear and as long as >the descriptor is not declared free for taking by descriptor allocation nobody >is going to be screwed by open() picking that slot of descriptor table too >early. Trying to do that for dup2() would lose atomicity. I honestly don't >know how Solaris behaves in that case, BTW - the race (if any) would probably >be hard to hit, so in case of Linux I would have to go and RTFS before saying >that there isn't one. I can't do that in with Solaris; all I can do here >is ask you guys... Solaris dup2() behaves exactly like close(). >Moreover, see above for record locks removal. Should that happen prior to >switchover? If you have > >dup(fd, fd2); >set a record lock on fd2 >spawn a thread >in child, try to grab the same lock on fd2 >in parent, do some work and close(fd) >you are guaranteed that child won't see fd refering to the same file after it >acquires the lock. Here's you are talking about a lock held by the "parent" and that the "child" will only get the lock once close(fd) is done? Yes. The final "close" is done *after* the pointer has been removed from the file descriptor table. >Replace close(fd) with dup(fd3, fd); should the same hold true in that case? Yes. >FWIW, Linux behaviour in that area is to have record locks removal done >between the switchover and return to userland in case of dup2() and between >the removal from descriptor table and return to userland in case of close(). > >> Personally I believe the spec is clear enough to allow an >> unambiguous interpretation of the required behavior in this area. If >> you think there are areas where the Solaris behaviour is in >> disagreement with the spec then I'd be interested to hear them. > >The spec is so vague that I strongly suspect that *both* Solaris and Linux >behaviours are not in disagreement with it (modulo shutdown(2) extension >Linux-side and we are really stuck with that one). I'm not sure if the standard allows a handful of threads in accept() for a file descriptor which has already been closed *and* can be re-issued for other uses. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: of_mdiobus_register_phy() and deferred probe
Hello. On 10/22/2015 04:31 PM, Geert Uytterhoeven wrote: Due to a probe deferral of an interrupt controller[1], the Micrel Ethernet PHY on r8a7791/koelsch started failing to get its IRQ: no irq domain found for /interrupt-controller@e61c ! However, of_mdiobus_register_phy() uses irq_of_parse_and_map(), which plainly ignores EPROBE_DEFER, and it just continues. Later I get: sh-eth ee70.ethernet eth0: attached PHY 1 (IRQ -1) to driver Micrel KSZ8041RNLI instead of sh-eth ee70.ethernet eth0: attached PHY 1 (IRQ 408) to driver Micrel KSZ8041RNLI Ethernet still works, as the interrupt seems to be unneeded(?). Yes, the phylib uses PHY polling anyway, IRQ isn't strictly necessary. Has anyone already looked into fixing of_mdio to handle deferred probing? It's the first time I hear about that. Will have to look into this... Gr{oetje,eeting}s, Geert MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] can: xilinx: use readl/writel instead of ioread/iowrite
On 10/22/2015 10:14 AM, Arnd Bergmann wrote: > On Thursday 22 October 2015 10:16:02 Kedareswara rao Appana wrote: >> The driver only supports memory-mapped I/O [by ioremap()], >> so readl/writel is actually the right thing to do, IMO. >> During the validation of this driver or IP on ARM 64-bit processor >> while sending lot of packets observed that the tx packet drop with iowrite >> Putting the barriers for each tx fifo register write fixes this issue >> Instead of barriers using writel also fixed this issue. >> >> Signed-off-by: Kedareswara rao Appana> > The two should really do the same thing: iowrite32() is just a static inline > calling writel() on both ARM32 and ARM64. On which kernel version did you > observe the difference? It's possible that an older version used > CONFIG_GENERIC_IOMAP, which made this slightly more expensive. > > If there are barriers that you want to get rid of for performance reasons, > you should use writel_relaxed(), but be careful to synchronize them correctly > with regard to DMA. It should be fine in this driver, as it does not > perform any DMA, but be aware that there is no big-endian version of > writel_relaxed() at the moment. We don't have DMA in CAN drivers, but usually a certain write triggers sending. Do we need a barrier before triggering the sending? Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions| Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917- | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | signature.asc Description: OpenPGP digital signature
[PATCH 3/3] net: dsa: Make mv88e6060 use nested mdiobus read/write
Like mv88e6xxx and mdio-mux, to avoid lockdep give false positives because of nested MDIO busses, switch to previously introduced nested mdiobus_read/write variants. Signed-off-by: Neil Armstrong--- drivers/net/dsa/mv88e6060.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c index c29aebe..9093577 100644 --- a/drivers/net/dsa/mv88e6060.c +++ b/drivers/net/dsa/mv88e6060.c @@ -26,7 +26,7 @@ static int reg_read(struct dsa_switch *ds, int addr, int reg) if (bus == NULL) return -EINVAL; - return mdiobus_read(bus, ds->pd->sw_addr + addr, reg); + return mdiobus_read_nested(bus, ds->pd->sw_addr + addr, reg); } #define REG_READ(addr, reg)\ @@ -47,7 +47,7 @@ static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val) if (bus == NULL) return -EINVAL; - return mdiobus_write(bus, ds->pd->sw_addr + addr, reg, val); + return mdiobus_write_nested(bus, ds->pd->sw_addr + addr, reg, val); } #define REG_WRITE(addr, reg, val) \ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] net: dsa: Make mv88e6xxx use nested mdiobus read/write
Make the mv88e6xxx driver use the previously introduced nested variants of mdiobus_read/write functions. Signed-off-by: Neil Armstrong--- drivers/net/dsa/mv88e6xxx.c | 46 + 1 file changed, 9 insertions(+), 37 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 4591240..e3cc66e 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -26,34 +26,6 @@ #include #include "mv88e6xxx.h" -/* MDIO bus access can be nested in the case of PHYs connected to the - * internal MDIO bus of the switch, which is accessed via MDIO bus of - * the Ethernet interface. Avoid lockdep false positives by using - * mutex_lock_nested(). - */ -static int mv88e6xxx_mdiobus_read(struct mii_bus *bus, int addr, u32 regnum) -{ - int ret; - - mutex_lock_nested(>mdio_lock, SINGLE_DEPTH_NESTING); - ret = bus->read(bus, addr, regnum); - mutex_unlock(>mdio_lock); - - return ret; -} - -static int mv88e6xxx_mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, - u16 val) -{ - int ret; - - mutex_lock_nested(>mdio_lock, SINGLE_DEPTH_NESTING); - ret = bus->write(bus, addr, regnum, val); - mutex_unlock(>mdio_lock); - - return ret; -} - /* If the switch's ADDR[4:0] strap pins are strapped to zero, it will * use all 32 SMI bus addresses on its SMI bus, and all switch registers * will be directly accessible on some {device address,register address} @@ -68,7 +40,7 @@ static int mv88e6xxx_reg_wait_ready(struct mii_bus *bus, int sw_addr) int i; for (i = 0; i < 16; i++) { - ret = mv88e6xxx_mdiobus_read(bus, sw_addr, SMI_CMD); + ret = mdiobus_read_nested(bus, sw_addr, SMI_CMD); if (ret < 0) return ret; @@ -84,7 +56,7 @@ int __mv88e6xxx_reg_read(struct mii_bus *bus, int sw_addr, int addr, int reg) int ret; if (sw_addr == 0) - return mv88e6xxx_mdiobus_read(bus, addr, reg); + return mdiobus_read_nested(bus, addr, reg); /* Wait for the bus to become free. */ ret = mv88e6xxx_reg_wait_ready(bus, sw_addr); @@ -92,8 +64,8 @@ int __mv88e6xxx_reg_read(struct mii_bus *bus, int sw_addr, int addr, int reg) return ret; /* Transmit the read command. */ - ret = mv88e6xxx_mdiobus_write(bus, sw_addr, SMI_CMD, - SMI_CMD_OP_22_READ | (addr << 5) | reg); + ret = mdiobus_write_nested(bus, sw_addr, SMI_CMD, + SMI_CMD_OP_22_READ | (addr << 5) | reg); if (ret < 0) return ret; @@ -103,7 +75,7 @@ int __mv88e6xxx_reg_read(struct mii_bus *bus, int sw_addr, int addr, int reg) return ret; /* Read the data. */ - ret = mv88e6xxx_mdiobus_read(bus, sw_addr, SMI_DATA); + ret = mdiobus_read_nested(bus, sw_addr, SMI_DATA); if (ret < 0) return ret; @@ -147,7 +119,7 @@ int __mv88e6xxx_reg_write(struct mii_bus *bus, int sw_addr, int addr, int ret; if (sw_addr == 0) - return mv88e6xxx_mdiobus_write(bus, addr, reg, val); + return mdiobus_write_nested(bus, addr, reg, val); /* Wait for the bus to become free. */ ret = mv88e6xxx_reg_wait_ready(bus, sw_addr); @@ -155,13 +127,13 @@ int __mv88e6xxx_reg_write(struct mii_bus *bus, int sw_addr, int addr, return ret; /* Transmit the data to write. */ - ret = mv88e6xxx_mdiobus_write(bus, sw_addr, SMI_DATA, val); + ret = mdiobus_write_nested(bus, sw_addr, SMI_DATA, val); if (ret < 0) return ret; /* Transmit the write command. */ - ret = mv88e6xxx_mdiobus_write(bus, sw_addr, SMI_CMD, - SMI_CMD_OP_22_WRITE | (addr << 5) | reg); + ret = mdiobus_write_nested(bus, sw_addr, SMI_CMD, + SMI_CMD_OP_22_WRITE | (addr << 5) | reg); if (ret < 0) return ret; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: dsa: mv88e6060: Fix false positive lockdep splat
> Well, mdio-mux also calls switch_fn inside the mdio_lock, clean refactoring > would introduce a separate lock and call the nested variants. > Is that ok ? Can someone test mdio-mux if I make the change ? Hi Neil I would not touch mdio-mux. As you said, it does more than lock, read, unlock. It is not something sufficiently generic to place into shared code. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] treewide: Remove newlines inside DEFINE_PER_CPU() macros
On 2015-10-22 13:31, Prarit Bhargava wrote: > > > On 10/21/2015 03:52 PM, Michal Marek wrote: >> Dne 21.10.2015 v 21:27 Prarit Bhargava napsal(a): >>> On 10/15/2015 04:16 PM, Michal Marek wrote: Otherwise make tags can't parse them: ctags: Warning: arch/ia64/kernel/smp.c:60: null expansion of name pattern "\1" ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1" ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1" ctags: Warning: kernel/locking/lockdep.c:153: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:305: null expansion of name pattern "\1" ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1" ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1" ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1" ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1" ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1" >>> >>> I guarantee you're going to end up fixing this issue over and over again as >>> more >>> code is added in. >> >> This is certainly going to happen, but it should be quickly spotted by >> anybody running make tags on linux-next. And 10 instances since the >> beginning of git is not too many. > > Not everyone uses 'make tags'. 'make cscope' exists and functions correctly > ;) cscope works, but unfortunately it cannot be extended to understand the preprocessor constructs. But it does not suffer from the problem at hand, obviously. >> It's not ctags itself parsing the DEFINE_PER_CPU() macro, but a >> user-supplied regex specified on commandline. Which can only operate on >> single lines. >> > > What's the regex? See https://lkml.kernel.org/r/1444940195-28272-9-git-send-email-mma...@suse.com It used to require a closing parenthesis, so it would not match the multiline macro invocations at all. Now it matches them, but ctags correctly warns that the empty string is probably not what we intended to match. Michal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pull request: bluetooth-next 2015-10-22
Hi Dave, Here's probably the last bluetooth-next pull request for 4.4. Among several other changes it contains the rest of the fixes & cleanups from the Bluetooth UnplugFest (that didn't need to be hurried to 4.3). - Refactoring & cleanups to 6lowpan code - New USB ids for two Atheros controllers and BCM43142A0 from Broadcom - Fix (quirk) for broken Broadcom BCM2045 controllers - Support for latest Apple controllers - Improvements to the vendor diagnostic message support Please let me know if there are any issues pulling. Thanks. Johan --- The following changes since commit 26440c835f8b1a491e2704118ac55bf87334366c: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2015-10-20 06:08:27 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git for-upstream for you to fetch changes up to 13972adc3240ea8b18b44906b819c622941a64b6: Bluetooth: Increase minor version of core module (2015-10-22 13:37:26 +0300) Alexander Aring (13): mac802154: llsec: use kzfree bluetooth: 6lowpan: use lowpan dispatch helpers 6lowpan: introduce LOWPAN_IPHC_MAX_HC_BUF_LEN 6lowpan: cleanup lowpan_header_compress 6lowpan: cleanup lowpan_header_decompress 6lowpan: remove lowpan_fetch_skb_u8 6lowpan: nhc: move iphc manipulation out of nhc 6lowpan: move IPHC functionality defines 6lowpan: remove lowpan_is_addr_broadcast 6lowpan: iphc: change define values 6lowpan: rework tc and flow label handling 6lowpan: put mcast compression in an own function ieee802154: 6lowpan: fix memory leak Arnd Bergmann (1): Bluetooth: bpa10x: fix BT_HCIUART dependency Dan Carpenter (1): Bluetooth: hci_bcm: checking for ERR_PTR instead of NULL David Herrmann (1): Bluetooth: hidp: fix device disconnect on idle timeout Dean Jenkins (3): Bluetooth: Unwind l2cap_sock_shutdown() Bluetooth: Reorganize mutex lock in l2cap_sock_shutdown() Bluetooth: l2cap_disconnection_req priority over shutdown Dmitry Tunin (2): Bluetooth: ath3k: Add new AR3012 0930:021c id Bluetooth: ath3k: Add support of AR3012 0cf3:817b device Johan Hedberg (16): Bluetooth: Don't use remote address type to decide IRK persistency Bluetooth: Fix removing connection parameters when unpairing Bluetooth: Fix missing hdev locking for LE scan cleanup Bluetooth: Add le_addr_type() helper function Bluetooth: Add hci_conn_hash_lookup_le() helper function Bluetooth: Use hci_conn_hash_lookup_le() when possible Bluetooth: 6lowpan: Use hci_conn_hash_lookup_le() when possible Bluetooth: Remove unnecessary indentation in unpair_device() Bluetooth: Add hdev helper variable to hci_le_create_connection_cancel Bluetooth: Remove redundant (and possibly wrong) flag clearing Bluetooth: Remove unnecessary hci_explicit_connect_lookup function Bluetooth: Disable auto-connection parameters when unpairing Bluetooth: Fix crash in SMP when unpairing Bluetooth: Introduce hci_req helper to abort a connection Bluetooth: Take advantage of connection abort helpers Bluetooth: Make hci_disconnect() behave correctly for all states Marcel Holtmann (21): Bluetooth: bpa10x: Fix missing BT_HCIUART dependency Bluetooth: btusb: Add support for Broadcom LM_DIAG interface Bluetooth: btintel: Add support for enabling tracing functionality Bluetooth: Remove quirk for HCI_VENDOR_PKT filter handling Bluetooth: Restrict valid packet types via HCI_CHANNEL_RAW Bluetooth: Queue diagnostic messages together with HCI packets Bluetooth: btusb: Print information of Intel SfP lock states Bluetooth: Add new quirk for non-persistent diagnostic settings Bluetooth: btintel: Set quirk for non-persistent diagnostic settings Bluetooth: btintel: Add diagnostic support for older controllers Bluetooth: btusb: Mark BCM2045 devices to have broken link key commands Bluetooth: btbcm: Fix firmware version number calculation Bluetooth: btbcm: Read USB product information for Apple devices Bluetooth: Add support setup stage internal notification event Bluetooth: btusb: Set early vendor info for Intel and Broadcom Bluetooth: btusb: Add support for latest Apple controllers Bluetooth: hci_uart: Provide initial manufacturer information Bluetooth: Introduce driver specific post init callback Bluetooth: btusb: Set manufacturer for Intel bootloader devices Bluetooth: btintel: Enable extra Intel vendor events Bluetooth: Increase minor version of core module Santtu Rekilä (1): Bluetooth: btusb: Add support for Foxconn/Lenovo BCM43142A0 (105b:e065) drivers/bluetooth/Kconfig| 2 +- drivers/bluetooth/ath3k.c| 4 + drivers/bluetooth/btbcm.c
Re: [PATCH 9/9] treewide: Remove newlines inside DEFINE_PER_CPU() macros
On 10/21/2015 03:52 PM, Michal Marek wrote: > Dne 21.10.2015 v 21:27 Prarit Bhargava napsal(a): >> On 10/15/2015 04:16 PM, Michal Marek wrote: >>> Otherwise make tags can't parse them: >>> >>> ctags: Warning: arch/ia64/kernel/smp.c:60: null expansion of name pattern >>> "\1" >>> ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name >>> pattern "\1" >>> ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name >>> pattern "\1" >>> ctags: Warning: kernel/locking/lockdep.c:153: null expansion of name >>> pattern "\1" >>> ctags: Warning: kernel/workqueue.c:305: null expansion of name pattern "\1" >>> ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern >>> "\1" >>> ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern >>> "\1" >>> ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1" >>> ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern >>> "\1" >>> ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern >>> "\1" >> >> I guarantee you're going to end up fixing this issue over and over again as >> more >> code is added in. > > This is certainly going to happen, but it should be quickly spotted by > anybody running make tags on linux-next. And 10 instances since the > beginning of git is not too many. Not everyone uses 'make tags'. 'make cscope' exists and functions correctly ;) > > >> OOC, why not fix ctags to recognize newlines? > > It's not ctags itself parsing the DEFINE_PER_CPU() macro, but a > user-supplied regex specified on commandline. Which can only operate on > single lines. > What's the regex? P. > Michal > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] treewide: Remove newlines inside DEFINE_PER_CPU() macros
On 10/22/2015 08:06 AM, Michal Marek wrote: > On 2015-10-22 13:31, Prarit Bhargava wrote: >> >> >> On 10/21/2015 03:52 PM, Michal Marek wrote: >>> Dne 21.10.2015 v 21:27 Prarit Bhargava napsal(a): On 10/15/2015 04:16 PM, Michal Marek wrote: > Otherwise make tags can't parse them: > > ctags: Warning: arch/ia64/kernel/smp.c:60: null expansion of name pattern > "\1" > ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name > pattern "\1" > ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name > pattern "\1" > ctags: Warning: kernel/locking/lockdep.c:153: null expansion of name > pattern "\1" > ctags: Warning: kernel/workqueue.c:305: null expansion of name pattern > "\1" > ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name > pattern "\1" > ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name > pattern "\1" > ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1" > ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern > "\1" > ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern > "\1" I guarantee you're going to end up fixing this issue over and over again as more code is added in. >>> >>> This is certainly going to happen, but it should be quickly spotted by >>> anybody running make tags on linux-next. And 10 instances since the >>> beginning of git is not too many. >> >> Not everyone uses 'make tags'. 'make cscope' exists and functions correctly >> ;) > > cscope works, but unfortunately it cannot be extended to understand the > preprocessor constructs. But it does not suffer from the problem at > hand, obviously. > > >>> It's not ctags itself parsing the DEFINE_PER_CPU() macro, but a >>> user-supplied regex specified on commandline. Which can only operate on >>> single lines. >>> >> >> What's the regex? > > See > https://lkml.kernel.org/r/1444940195-28272-9-git-send-email-mma...@suse.com > > It used to require a closing parenthesis, so it would not match the > multiline macro invocations at all. Now it matches them, but ctags > correctly warns that the empty string is probably not what we intended > to match. It seems wrong to change kernel code, not for a bug, but for a userspace search. P. > > Michal > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 07:51, casper@oracle.com wrote: It would more fruitful to have such a discussion in one of the OpenGroup mailing lists; people gathered there have a lot of experience and it is also possible to fix the standard when it turns out that it indeed as vague as you claim it is (I don't quite agree with that) +1. If there's interest in doing that I'll ask our POSIX rep the best way of initiating such a conversation. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC v3 2/2] iwlwifi: mvm: send large SKBs to the transport
> > On Wed, 2015-10-21 at 21:34 +0300, Emmanuel Grumbach wrote: > > + > > + if (skb->protocol == htons(ETH_P_IP)) { > > + ip_hdr(tmp)->id = ip_hdr(skb)->id; > > Too late, you already called consume_skb(skb). > So this is a potential use after free. Ouch - thanks for catching this! > > > + be16_add_cpu(_hdr(tmp)->id, i * > num_subframes); > > + } > > + > > > I would use > base_id = ip_hdr(skb)->id; // before the consume_skb(skb) > ip_hdr(tmp)->id = htons(base_id + i * num_subframes); > Sure - I will. I will send a v4 with the fixes.
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 05:21, Al Viro wrote: Most of the work on using a file descriptor is local to the thread. Using - sure, but what of cacheline dirtied every time you resolve a descriptor to file reference? Don't you have to do that anyway, to do anything useful with the file? How much does it cover and to what degree is that local to thread? When it's a refcount inside struct file - no big deal, we'll be reading the same cacheline anyway and unless several threads are pounding on the same file with syscalls at the same time, that won't be a big deal. But when refcounts are associated with descriptors... There is a refcount in the struct held in the per-process list of open files and the 'slow path' processing is only taken if there's more than one LWP in the process that's accessing the file. In case of Linux we have two bitmaps and an array of pointers associated with descriptor table. They grow on demand (in parallel) * reserving a descriptor is done under ->file_lock (dropped/regained around memory allocation if we end up expanding the sucker, actual reassignment of pointers to array/bitmaps is under that spinlock) * installing a pointer is lockless (we wait for ongoing resize to settle, RCU takes care of the rest) * grabbing a file by index is lockless as well * removing a pointer is under ->file_lock, so's replacing it by dup2(). Is that table per-process or global? The point is, dup2() over _unused_ descriptor is inherently racy, but dup2() over a descriptor we'd opened and kept open should be safe. As it is, their implementation boils down to "userland must do process-wide exclusion between *any* dup2() and all syscalls that might create a new descriptor - open()/pipe()/socket()/accept()/recvmsg()/dup()/etc". At the very least, it's a big QoI problem, especially since such userland exclusion would have to be taken around the operations that can block for a long time. Sure, POSIX wording regarding dup2 is so weak that this behaviour can be argued to be compliant, but... replacement of the opened file associated with newfd really ought to be atomic to be of any use for multithreaded processes. There's existing language in the Issue 7 dup2() description that says dup2() has to be atomic: "the dup2( ) function provides unique services, as no other interface is able to atomically replace an existing file descriptor." And there is some new language in Issue 7 Technical Corrigenda 2 that reinforces that, when it's talking about reassignment of stdin/stdout/stderr: "Furthermore, a close() followed by a reopen operation (e.g. open(), dup() etc) is not atomic; dup2() should be used to change standard file descriptors." I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. IOW, if newfd is an opened descriptor prior to dup2() and no other thread attempts to close it by any means, there should be no window during which it would appear to be not opened. Linux and FreeBSD satisfy that; OpenBSD seems to do the same, from the quick look. NetBSD doesn't, no idea about Solaris. FWIW, older NetBSD implementation (prior to "File descriptor changes, discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it had fixed a lot of crap, so it's entirely possible that OpenBSD simply has kept the old implementation, with tons of other races in that area, but this dup2() race got introduced in that rewrite. Related to dup2(), there's some rather surprising behaviour on Linux. Here's the scenario: -- ThreadA opens, listens and accepts on socket fd1, waiting for incoming connections. ThreadB waits for a while, then opens normal file fd2 for read/write. ThreadB uses dup2 to make fd1 a clone of fd2. ThreadB closes fd2. ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. ThreadB writes to fd1, the result of which appears in the file, so fd1 is indeed operating as a plain file. ThreadB exits. ThreadA is still sat in accept on fd1. A connection is made to fd1 by another process. The accept call succeeds and returns the incoming connection. fd1 is still operating as a socket, even though it's now actually a plain file. -- I assume this is another consequence of the fact that threads waiting in accept don't get a notification if the fd they are using is closed, either directly by a call to close or by a syscall such as dup2. Not waking up other threads on a fd when it is closed seems like it's undesirable behaviour. I can see the reasoning behind allowing shutdown to be used to do such a wakeup even if that's not POSIX-compliant - it may make it slightly easier for applications avoid fd recycling races. However the current situation is that shutdown is the *only* way to perform such a wakeup - simply closing the fd has no effect on any other threads. That seems incorrect. -- Alan Burlison -- -- To unsubscribe from this list: send the line
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 05:44, Al Viro wrote: It's been said that the current mechanisms in Linux & some BSD variants can be subject to races You do realize that it goes for the entire area? And the races found in this thread are in the BSD variant that tries to do something similar to what you guys say Solaris is doing, so I'm not sure which way does that argument go. A high-level description with the same level of details will be race-free in _all_ of them. The devil is in the details, of course, and historically they had been very easy to get wrong. And extra complexity in that area doesn't make things better. Yes, I absolutely agree it's difficult to get it right. Modulo undetected bugs I believe the Solaris implementation is race free, and if it isn't I think it's fair to say we'd consider that to be a bug. , and the behaviour exhibited doesn't conform to POSIX, for example requiring the use of shutdown() on unconnected sockets because close() doesn't kick off other threads accept()ing on the same fd. Umm... The old kernels (and even more - old userland) are not going to disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or no POSIX. Yes, I understand the problem and in an earlier part of the discussion I said that I suspected that all that could really be done was to document the behaviour as changing it would break existing code. I still think that's probably the only workable option. I'd be interested to hear if there's a better and more performant way of handling the situation that doesn't involve doing the sort of bookkeeping Casper described,. So would a lot of other people. :-) To quote one of my colleague's favourite sayings: Performance is a goal, correctness is a constraint. Except that in this case "correctness" is the matter of rather obscure and ill-documented areas in POSIX. Don't get me wrong - this semantics isn't inherently bad, but it's nowhere near being an absolute requirement. I don't think it's *that* obscure, when I found the original shutdown() problem google showed up another occurrence pretty quickly - https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html. If a fd is closed then allowing other uses of it to continue in other threads seems incorrect to me, as in the dup2() scenario I posted. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-22 at 08:15 +0200, casper@oracle.com wrote: > >It's been said that the current mechanisms in Linux & some BSD variants > >can be subject to races, and the behaviour exhibited doesn't conform to > >POSIX, for example requiring the use of shutdown() on unconnected > >sockets because close() doesn't kick off other threads accept()ing on > >the same fd. I'd be interested to hear if there's a better and more > >performant way of handling the situation that doesn't involve doing the > >sort of bookkeeping Casper described,. > > Of course, the implementation is now around 18 years old; clearly a lot of > things have changed since then. > > In the particular case of Linux close() on a socket, surely it must be > possible to detect at close that it is a listening socket and that you are > about to close the last reference; the kernel could then do the shutdown() > all by itself. We absolutely do not _want_ to do this just so that linux becomes slower to the point Solaris can compete, or you guys can avoid some work. close(fd) is very far from knowing a file is a 'listener' or even a 'socket' without extra cache line misses. To force a close of an accept() or whatever blocking socket related system call a shutdown() makes a lot of sense. This would have zero additional overhead for the fast path. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 12:30, Eric Dumazet wrote: We absolutely do not _want_ to do this just so that linux becomes slower to the point Solaris can compete, or you guys can avoid some work. Sentiments such as that really have no place in a discussion that's been focussed primarily on the behaviour of interfaces, albeit with digressions into the potential performance impacts. The discussion has been cordial and I for one appreciate Al Viro's posts on the subject, from which I've leaned a lot. Can we please keep it that way? Thanks. close(fd) is very far from knowing a file is a 'listener' or even a 'socket' without extra cache line misses. To force a close of an accept() or whatever blocking socket related system call a shutdown() makes a lot of sense. This would have zero additional overhead for the fast path. Yes, that would I believe be a significant improvement. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 2/2] iwlwifi: mvm: send large SKBs to the transport
On Wed, 2015-10-21 at 21:34 +0300, Emmanuel Grumbach wrote: > + > + if (skb->protocol == htons(ETH_P_IP)) { > + ip_hdr(tmp)->id = ip_hdr(skb)->id; Too late, you already called consume_skb(skb). So this is a potential use after free. > + be16_add_cpu(_hdr(tmp)->id, i * num_subframes); > + } > + I would use base_id = ip_hdr(skb)->id; // before the consume_skb(skb) ip_hdr(tmp)->id = htons(base_id + i * num_subframes); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-22 at 12:58 +0100, Alan Burlison wrote: > On 22/10/2015 12:30, Eric Dumazet wrote: > > > We absolutely do not _want_ to do this just so that linux becomes slower > > to the point Solaris can compete, or you guys can avoid some work. > > Sentiments such as that really have no place in a discussion that's been > focussed primarily on the behaviour of interfaces, albeit with > digressions into the potential performance impacts. The discussion has > been cordial and I for one appreciate Al Viro's posts on the subject, > from which I've leaned a lot. Can we please keep it that way? Thanks. Certainly not. I am a major linux networking developper and wont accept linux is hijacked by guys who never contributed to it, just so it meets their unreasonable expectations. We absolutely care about performance. And I do not care you focus on POSIX crap. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net,stable] qmi_wwan: add Sierra Wireless MC74xx/EM74xx
New device IDs shamelessly lifted from the vendor driver. Signed-off-by: Bjørn Mork--- drivers/net/usb/qmi_wwan.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 355842b85ee9..2a7c1be23c4f 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -765,6 +765,10 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x1199, 0x9056, 8)},/* Sierra Wireless Modem */ {QMI_FIXED_INTF(0x1199, 0x9057, 8)}, {QMI_FIXED_INTF(0x1199, 0x9061, 8)},/* Sierra Wireless Modem */ + {QMI_FIXED_INTF(0x1199, 0x9070, 8)},/* Sierra Wireless MC74xx/EM74xx */ + {QMI_FIXED_INTF(0x1199, 0x9070, 10)}, /* Sierra Wireless MC74xx/EM74xx */ + {QMI_FIXED_INTF(0x1199, 0x9071, 8)},/* Sierra Wireless MC74xx/EM74xx */ + {QMI_FIXED_INTF(0x1199, 0x9071, 10)}, /* Sierra Wireless MC74xx/EM74xx */ {QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II (Alcatel One Touch L100V LTE) */ {QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */ {QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */ -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: igmp broadcast storm Red Hat EL7 3.1 Kernel
On 8/31/15, Jeffrey Merkeywrote: > I was able to reproduce it last night with 2.6.32 on Centos 6.3 so it > does not seem confined to just the 3.X tree. I think this is > something that's been lurking in the kernel for years and just has a > hard time showing up. Bugs that go away on their own come back on > their own. > > I got a trace of the traffic but need one of the os. I'll attempt to > get a trace of this again since I seem to be able to reproduce it by > routing NetFlix traffic through the router while I am streaming live > videos on Linux. > > I will setup the system to trap the bug and post the traces. > > Jeff > > > On 8/31/15, Cong Wang wrote: >> (Cc'ing netdev) >> >> On Sat, Aug 29, 2015 at 3:50 PM, Jeffrey Merkey >> wrote: >>> When installing Centos 7 running kernel 3.1 (Centos 7 distro) the >>> system generates an igmp broadcast storm which locks up and knocks >>> down a Century Link DSL router. Powering the router off clears the >>> broadcast storm. >>> >>> The bug is kernel related and resembles very closely another bug seen >>> about 10 years ago with Caldera's distro's on 2.4. The bug occurs on >>> the ia32 beta build. The bug DOES NOT exist in the 64 bit x86_64 >>> versions.I will attempt to debug it further since I have my own >>> debugger and try to run it down and provide more info. >>> >> >> A reproducer and/or bisect would help a lot. >> >> Thanks. >> > I got to the bottom of this, however, I am not convinced it's a bug in linux any more than configuring a bridge in a loop is a bug. I addressed it to the ISP since this is where the problem lies, even though Linux contributes routed traffic to the problem. Jeff -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sh_eth.c::sh_eth_rx(): mdp->rx_skbuff[entry] can be NULL
Hello. On 10/21/2015 10:26 AM, Yasushi SHOJI wrote: Thank your for your reply. Not at all, I'm virtually a maintainer for that driver now, so trying to filter out the related mails even if I don't have time to read thru all the netdev mail. On 10/19/2015 06:01 PM, Yasushi SHOJI wrote: I'm not that familiar with this code base so I'm note including any patch yet. I appreciate if someone with insight in this code give a quick look and tell me that it's a real one or not. if this is a real case, I can take a deep look. If you got the oops, it's real. Thanks for the reporting. I guess I should check the new ravb driver as well... Do you want to try fixing the bug yourself? Sure. I can dive in to this. I appreciate if someone who has worked on sh_eth.c give me some design advises or tell me the initial design thoughts / what was the intention when allocation if failed. Hm, well, I seem to have some time to spend on fixing the issues in this driver (I noticed a couple while doing the AVB driver), so spending time on your "education" would seem somewhat inefficient... :-) My idea right now is to simply invalidate the descriptor when netdev_alloc_skb() failed. Well, it depends. If you're talking about the second loop in sh_eth_rx(), that seems a good idea (and it's what I've done for the dma_mapping_error() case in the ravb driver -- I just set the descriptor's data size field to 0). The OOM case seems to have been un-addressed in both drivers so far... If we take sh_eth_ring_format(), I believe the best course of action is to just fail on OOM since the driver doesn't correctly handle that case anyway AFAIR; and that was implemented in the ravb driver. When next packet arrived, in near future, the driver can try again to allocate the buffer and update the corresponding descriptor if succeeds. It would be too late, unless you still mean the RX refilling loop in this function. If memory is not yet available when the controller is trying to use the invalid descriptor, the controller will see it and DMA will stop. That means leaving RACT=0 and that's what the driver is even doing... Hm, then I don't understand how the error you've described can occur, unless we encounter OOM during sh_eth_ring_format()... Is it acceptable path to go? I'm not seeing a bug in this function, perhaps I'm missing something? Here is how I understand this driver: [...] The driver utilizes array of sk_buffs for tx and rx. For rx, the driver has an array of pointers of sk_buffs, rx_skbuff[]. This rx_skbuff[] is filled with sk_buffs in sh_eth_ring_format() which is called when the driver is open()ed. The controller, the driver is targeted to, is GETHER. Well, it depends on your SoC, it may be 100 Mbps Ether. A receive descriptor corresponds to one sk_buff. The controller expects array of descriptors in the system memory and treat it as a ring, meaning that the controller process each descriptor one by one. Once the controller finished the last descriptor, it will go back to the first one. Yes, it seems a correct description. To achieve zero copy, the driver push the sk_buffs filled with received packet to the netdev core with netif_receive_skb() then netdev_alloc_skb() sk_buffs in the sh_eth_rx(), the poll method of the driver, and update the corresponding descriptor. If the allocation failed, it just leave the function, leaving old pointer in the descriptor as is. Yes, but note that it also leaves RACT=0, which basically means an invalid descriptor, encountering which the reception should just stop. In some future, the controller will access the descriptor and writes to the old memory address. (I haven't checked the state of bits in the descriptor yet) Check it. BTW, if any one has a bit of time, I have questions regarding to the atomic allocation: Sorry, I'm constantly short of time. Someone else will have to answer that. :-) MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] openvswitch: Fix egress tunnel info.
Hi Pravin, [auto build test WARNING on net/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Pravin-B-Shelar/openvswitch-Fix-egress-tunnel-info/20151023-053247 config: x86_64-randconfig-x010-10211707 (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): net/openvswitch/actions.c: In function 'output_userspace': >> net/openvswitch/actions.c:771:24: warning: unused variable 'info' >> [-Wunused-variable] struct ip_tunnel_info info; ^ vim +/info +771 net/openvswitch/actions.c 7f8a436e Joe Stringer 2015-08-26 755 ethertype = vlan_get_protocol(skb); 7f8a436e Joe Stringer 2015-08-26 756 } 7f8a436e Joe Stringer 2015-08-26 757 7f8a436e Joe Stringer 2015-08-26 758 ovs_fragment(vport, skb, mru, ethertype); 7f8a436e Joe Stringer 2015-08-26 759 } else { 7f8a436e Joe Stringer 2015-08-26 760 kfree_skb(skb); 7f8a436e Joe Stringer 2015-08-26 761 } 7f8a436e Joe Stringer 2015-08-26 762 } else { 738967b8 Andy Zhou2014-09-08 763 kfree_skb(skb); ccb1352e Jesse Gross 2011-10-25 764 } 7f8a436e Joe Stringer 2015-08-26 765 } ccb1352e Jesse Gross 2011-10-25 766 ccb1352e Jesse Gross 2011-10-25 767 static int output_userspace(struct datapath *dp, struct sk_buff *skb, ccea7445 Neil McKee 2015-05-26 768 struct sw_flow_key *key, const struct nlattr *attr, ccea7445 Neil McKee 2015-05-26 769 const struct nlattr *actions, int actions_len) ccb1352e Jesse Gross 2011-10-25 770 { 1d8fff90 Thomas Graf 2015-07-21 @771 struct ip_tunnel_info info; ccb1352e Jesse Gross 2011-10-25 772 struct dp_upcall_info upcall; ccb1352e Jesse Gross 2011-10-25 773 const struct nlattr *a; ccb1352e Jesse Gross 2011-10-25 774 int rem; ccb1352e Jesse Gross 2011-10-25 775 ccea7445 Neil McKee 2015-05-26 776 memset(, 0, sizeof(upcall)); ccb1352e Jesse Gross 2011-10-25 777 upcall.cmd = OVS_PACKET_CMD_ACTION; 7f8a436e Joe Stringer 2015-08-26 778 upcall.mru = OVS_CB(skb)->mru; ccb1352e Jesse Gross 2011-10-25 779 :: The code at line 771 was first introduced by commit :: 1d8fff907342d2339796dbd27ea47d0e76a6a2d0 ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic :: TO: Thomas Graf:: CC: David S. Miller --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH 3/3] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller
Hi Mans, [auto build test WARNING on net/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Mans-Rullgard/devicetree-add-vendor-prefix-for-Aurora-VLSI/20151022-220753 config: sparc64-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=sparc64 All warnings (new ones prefixed by >>): drivers/net/ethernet/aurora/nb8800.c: In function 'nb8800_xmit': drivers/net/ethernet/aurora/nb8800.c:381:14: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] cpsz = (8 - (u32)skb->data) & 7; ^ drivers/net/ethernet/aurora/nb8800.c: In function 'nb8800_probe': >> drivers/net/ethernet/aurora/nb8800.c:1006:2: warning: format '%x' expects >> argument of type 'unsigned int', but argument 3 has type 'resource_size_t' >> [-Wformat=] dev_info(>dev, "AU-NB8800 Ethernet at 0x%x\n", res->start); ^ vim +1006 drivers/net/ethernet/aurora/nb8800.c 990 res = platform_get_resource(pdev, IORESOURCE_MEM, 0); 991 if (!res) { 992 dev_err(>dev, "No MMIO base\n"); 993 return -EINVAL; 994 } 995 996 irq = platform_get_irq(pdev, 0); 997 if (irq <= 0) { 998 dev_err(>dev, "No IRQ\n"); 999 return -EINVAL; 1000 } 1001 1002 base = devm_ioremap_resource(>dev, res); 1003 if (IS_ERR(base)) 1004 return PTR_ERR(base); 1005 > 1006 dev_info(>dev, "AU-NB8800 Ethernet at 0x%x\n", > res->start); 1007 1008 dev = alloc_etherdev(sizeof(*priv)); 1009 if (!dev) 1010 return -ENOMEM; 1011 1012 platform_set_drvdata(pdev, dev); 1013 SET_NETDEV_DEV(dev, >dev); 1014 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH v6] can: xilinx: Convert to runtime_pm
Hi Kedar, On Thu, 2015-10-22 at 10:15AM +0530, Kedareswara rao Appana wrote: > Instead of enabling/disabling clocks at several locations in the driver, > Use the runtime_pm framework. This consolidates the actions for runtime PM > In the appropriate callbacks and makes the driver more readable and > mantainable. > > Signed-off-by: Kedareswara rao Appana[...] > /** > * xcan_probe - Platform registration call > @@ -1072,7 +1103,7 @@ static int xcan_probe(struct platform_device *pdev) > return -ENOMEM; > > priv = netdev_priv(ndev); > - priv->dev = ndev; > + priv->dev = >dev; > priv->can.bittiming_const = _bittiming_const; > priv->can.do_set_mode = xcan_do_set_mode; > priv->can.do_get_berr_counter = xcan_get_berr_counter; > @@ -1114,21 +1145,30 @@ static int xcan_probe(struct platform_device *pdev) > } > } > > - ret = clk_prepare_enable(priv->can_clk); > + ret = clk_prepare(priv->can_clk); > if (ret) { > dev_err(>dev, "unable to enable device clock\n"); > goto err_free; > } > > - ret = clk_prepare_enable(priv->bus_clk); > + ret = clk_prepare(priv->bus_clk); Are these clk_prepare calls needed here? The runtime PM calls do clk_prepare_enable and clk_disable_unprepare. > if (ret) { > dev_err(>dev, "unable to enable bus clock\n"); > - goto err_unprepare_disable_dev; > + goto err_unprepare_dev; > } > > priv->write_reg = xcan_write_reg_le; > priv->read_reg = xcan_read_reg_le; > > + pm_runtime_irq_safe(>dev); > + pm_runtime_enable(>dev); > + ret = pm_runtime_get_sync(>dev); > + if (ret < 0) { > + netdev_err(ndev, "%s: pm_runtime_get failed(%d)\n", > + __func__, ret); > + goto err_unprepare_busclk; > + } > + > if (priv->read_reg(priv, XCAN_SR_OFFSET) != XCAN_SR_CONFIG_MASK) { > priv->write_reg = xcan_write_reg_be; > priv->read_reg = xcan_read_reg_be; > @@ -1141,22 +1181,26 @@ static int xcan_probe(struct platform_device *pdev) > ret = register_candev(ndev); > if (ret) { > dev_err(>dev, "fail to register failed (err=%d)\n", ret); > - goto err_unprepare_disable_busclk; > + goto err_disableclks; > } > > devm_can_led_init(ndev); > - clk_disable_unprepare(priv->bus_clk); > - clk_disable_unprepare(priv->can_clk); > + > + pm_runtime_put(>dev); > + > netdev_dbg(ndev, "reg_base=0x%p irq=%d clock=%d, tx fifo depth:%d\n", > priv->reg_base, ndev->irq, priv->can.clock.freq, > priv->tx_max); > > return 0; > > -err_unprepare_disable_busclk: > - clk_disable_unprepare(priv->bus_clk); > -err_unprepare_disable_dev: > - clk_disable_unprepare(priv->can_clk); > +err_disableclks: > + pm_runtime_put(priv->dev); > +err_unprepare_busclk: > + pm_runtime_disable(>dev); > + clk_unprepare(priv->bus_clk); > +err_unprepare_dev: > + clk_unprepare(priv->can_clk); > err_free: > free_candev(ndev); > err: > @@ -1175,11 +1219,11 @@ static int xcan_remove(struct platform_device *pdev) > struct net_device *ndev = platform_get_drvdata(pdev); > struct xcan_priv *priv = netdev_priv(ndev); > > - if (set_reset_mode(ndev) < 0) > - netdev_err(ndev, "mode resetting failed!\n"); > - > unregister_candev(ndev); > + pm_runtime_disable(>dev); > netif_napi_del(>napi); > + clk_unprepare(priv->bus_clk); > + clk_unprepare(priv->can_clk); I think this can go away when the prepare calls in probe go away. Thanks, Sören -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/15 19:16, Al Viro wrote: Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. Yes, true enough. Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). [snip] Thanks again for the info, interesting. Solaris also does things along the same lines. In fact we recently extended posix_spawn so it could be used by the JVM to create subprocesses without wholescale copying, and Casper has done some optimisation work on posix_spawn as well. I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... Possibly so, and possibly justifiably so as well ;-) ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Yes, but if you believe that dup2() requires an implicit close() within it and that dup2() has to be atomic then expecting that other threads waiting on the same fd in accept() will get a notification seems reasonable enough. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... That's indisputably true - much of the POSIX behaviour predates threads and it shows, quite badly, in some places. Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. Yes, true enough. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net v2] openvswitch: Fix egress tunnel info.
While transitioning to netdev based vport we broke OVS feature which allows user to retrieve tunnel packet egress information for lwtunnel devices. Following patch fixes it by introducing ndo operation to get the tunnel egress info. Same ndo operation can be used for lwtunnel devices and compat ovs-tnl-vport devices. So after adding such device operation we can remove similar operation from ovs-vport. Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device"). Signed-off-by: Pravin B Shelar--- v1-v2: - changed ndo operation name to ndo_fill_metadata_dst() - Fix geneve stats update --- drivers/net/geneve.c | 40 ++- drivers/net/vxlan.c| 41 include/linux/netdevice.h |7 + include/net/dst_metadata.h | 32 ++ net/core/dev.c | 27 ++ net/ipv4/ip_gre.c | 46 +-- net/openvswitch/actions.c |8 ++--- net/openvswitch/datapath.c |5 +-- net/openvswitch/datapath.h |1 - net/openvswitch/flow_netlink.c | 18 +--- net/openvswitch/flow_netlink.h |6 ++-- net/openvswitch/vport-geneve.c | 13 - net/openvswitch/vport-gre.c|8 - net/openvswitch/vport-vxlan.c | 19 - net/openvswitch/vport.c| 58 net/openvswitch/vport.h| 35 16 files changed, 192 insertions(+), 172 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index cde29f8..445071c 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -594,14 +594,12 @@ static struct rtable *geneve_get_rt(struct sk_buff *skb, rt = ip_route_output_key(geneve->net, fl4); if (IS_ERR(rt)) { netdev_dbg(dev, "no route to %pI4\n", >daddr); - dev->stats.tx_carrier_errors++; - return rt; + return ERR_PTR(-ENETUNREACH); } if (rt->dst.dev == dev) { /* is this necessary? */ netdev_dbg(dev, "circular route to %pI4\n", >daddr); - dev->stats.collisions++; ip_rt_put(rt); - return ERR_PTR(-EINVAL); + return ERR_PTR(-ELOOP); } return rt; } @@ -627,12 +625,12 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) struct ip_tunnel_info *info = NULL; struct rtable *rt = NULL; const struct iphdr *iip; /* interior IP header */ + int err = -EINVAL; struct flowi4 fl4; __u8 tos, ttl; __be16 sport; bool udp_csum; __be16 df; - int err; if (geneve->collect_md) { info = skb_tunnel_info(skb); @@ -647,7 +645,7 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) rt = geneve_get_rt(skb, dev, , info); if (IS_ERR(rt)) { netdev_dbg(dev, "no route to %pI4\n", ); - dev->stats.tx_carrier_errors++; + err = PTR_ERR(rt); goto tx_error; } @@ -699,10 +697,37 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) tx_error: dev_kfree_skb(skb); err: - dev->stats.tx_errors++; + if (err == -ELOOP) + dev->stats.collisions++; + else if (err == -ENETUNREACH) + dev->stats.tx_carrier_errors++; + else + dev->stats.tx_errors++; return NETDEV_TX_OK; } +static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb) +{ + struct ip_tunnel_info *info = skb_tunnel_info(skb); + struct geneve_dev *geneve = netdev_priv(dev); + struct rtable *rt; + struct flowi4 fl4; + + if (ip_tunnel_info_af(info) != AF_INET) + return -EINVAL; + + rt = geneve_get_rt(skb, dev, , info); + if (IS_ERR(rt)) + return PTR_ERR(rt); + + ip_rt_put(rt); + info->key.u.ipv4.src = fl4.saddr; + info->key.tp_src = udp_flow_src_port(geneve->net, skb, +1, USHRT_MAX, true); + info->key.tp_dst = geneve->dst_port; + return 0; +} + static const struct net_device_ops geneve_netdev_ops = { .ndo_init = geneve_init, .ndo_uninit = geneve_uninit, @@ -713,6 +738,7 @@ static const struct net_device_ops geneve_netdev_ops = { .ndo_change_mtu = eth_change_mtu, .ndo_validate_addr = eth_validate_addr, .ndo_set_mac_address= eth_mac_addr, + .ndo_fill_metadata_dst = geneve_fill_metadata_dst, }; static void geneve_get_drvinfo(struct net_device *dev, diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index afdc65f..c1587ec 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2337,6 +2337,46 @@
Re: CONFIG_XPS depends on L1_CACHE_BYTES being greater than sizeof(struct xps_map)
On Thu, Oct 22, 2015 at 1:00 PM, Helge Dellerwrote: > Hi Tom & David, > > I've queued-up a patch for the parisc architecture which reduces > L1_CACHE_BYTES from 32 to 16: > https://patchwork.kernel.org/patch/7399291/ > > But this change will break the kernel build like this: > > In file included from net/core/dev.c:92:0: > net/core/dev.c: In function ‘expand_xps_map’: > include/linux/netdevice.h:721:27: warning: overflow in implicit constant > conversion [-Woverflow] >#define XPS_MIN_MAP_ALLOC ((L1_CACHE_BYTES - sizeof(struct xps_map)) \ > net/core/dev.c:1972:18: note: in expansion of macro ‘XPS_MIN_MAP_ALLOC’ >int alloc_len = XPS_MIN_MAP_ALLOC; > > Do you see an easy way to fix this ? > How about #define XPS_MIN_MAP_ALLOC ((L1_CACHE_BYTES - ((sizeof(struct xps_map) % L1_CACHE_BYTES)) \ Tom > Thanks, > Helge > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CONFIG_XPS depends on L1_CACHE_BYTES being greater than sizeof(struct xps_map)
On Thu, 2015-10-22 at 22:00 +0200, Helge Deller wrote: > Hi Tom & David, > > I've queued-up a patch for the parisc architecture which reduces > L1_CACHE_BYTES from 32 to 16: > https://patchwork.kernel.org/patch/7399291/ > > But this change will break the kernel build like this: > > In file included from net/core/dev.c:92:0: > net/core/dev.c: In function ‘expand_xps_map’: > include/linux/netdevice.h:721:27: warning: overflow in implicit constant > conversion [-Woverflow] >#define XPS_MIN_MAP_ALLOC ((L1_CACHE_BYTES - sizeof(struct xps_map)) \ > net/core/dev.c:1972:18: note: in expansion of macro ‘XPS_MIN_MAP_ALLOC’ >int alloc_len = XPS_MIN_MAP_ALLOC; > > Do you see an easy way to fix this ? Using L2_CACHE_BYTES would be better, but it unfortunately does not exist. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 09:51:05PM +0200, casper@oracle.com wrote: > > >On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote: > > > >> The external behaviour atomic; you cannot distinguish the order > >> between the closing of the original file (and waking up other threads > >> waiting for a record lock) or changing the file referenced by that newfd. > >> > >> But this not include a global or process specific lock. > > > >Interesting... Do you mean that decriptor-to-file lookup blocks until that > >rundown finishes? > > For that particular file descriptor, yes. (I'm assuming you mean the > Solaris kernel running down all lwps who have a system in progress on that > particular file descriptor). All other fd to file lookups are not blocked > at all by this locking. > > It should be clear that any such occurrences are application errors and > should be hardly ever seen in practice. It is also known when this is > needed so it is hardly even attempted. Ho-hum... It could even be made lockless in fast path; the problems I see are * descriptor-to-file lookup becomes unsafe in a lot of locking conditions. Sure, most of that happens on the entry to some syscall, with very light locking environment, but... auditing every sodding ioctl that might be doing such lookups is an interesting exercise, and then there are ->mount() instances doing the same thing. And procfs accesses. Probably nothing impossible to deal with, but nothing pleasant either. * memory footprint. In case of Linux on amd64 or sparc64, main() { int i; for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors ; } will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n, of course). How much will Solaris eat on the same? * related to the above - how much cacheline sharing will that involve? These per-descriptor use counts are bitch to pack, and giving each a cacheline of its own... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] net: dsa: mv88e6xxx: remove debugfs interface
It is preferable to have a common debugfs interface for DSA or switchdev instead of a driver specific one. Thus remove the mv88e6xxx debug code. Signed-off-by: Vivien Didelot--- drivers/net/dsa/mv88e6xxx.c | 291 drivers/net/dsa/mv88e6xxx.h | 2 - 2 files changed, 293 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 39664d8..7af9630 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -11,7 +11,6 @@ * (at your option) any later version. */ -#include #include #include #include @@ -21,7 +20,6 @@ #include #include #include -#include #include #include #include "mv88e6xxx.h" @@ -876,13 +874,6 @@ static int _mv88e6xxx_atu_wait(struct dsa_switch *ds) GLOBAL_ATU_OP_BUSY); } -/* Must be called with SMI lock held */ -static int _mv88e6xxx_scratch_wait(struct dsa_switch *ds) -{ - return _mv88e6xxx_wait(ds, REG_GLOBAL2, GLOBAL2_SCRATCH_MISC, - GLOBAL2_SCRATCH_BUSY); -} - /* Must be called with SMI mutex held */ static int _mv88e6xxx_phy_read_indirect(struct dsa_switch *ds, int addr, int regnum) @@ -2107,273 +2098,9 @@ int mv88e6xxx_setup_ports(struct dsa_switch *ds) return 0; } -static int mv88e6xxx_regs_show(struct seq_file *s, void *p) -{ - struct dsa_switch *ds = s->private; - - struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); - int reg, port; - - seq_puts(s, "GLOBAL GLOBAL2 "); - for (port = 0 ; port < ps->num_ports; port++) - seq_printf(s, " %2d ", port); - seq_puts(s, "\n"); - - for (reg = 0; reg < 32; reg++) { - seq_printf(s, "%2x: ", reg); - seq_printf(s, " %4x%4x ", - mv88e6xxx_reg_read(ds, REG_GLOBAL, reg), - mv88e6xxx_reg_read(ds, REG_GLOBAL2, reg)); - - for (port = 0 ; port < ps->num_ports; port++) - seq_printf(s, "%4x ", - mv88e6xxx_reg_read(ds, REG_PORT(port), reg)); - seq_puts(s, "\n"); - } - - return 0; -} - -static int mv88e6xxx_regs_open(struct inode *inode, struct file *file) -{ - return single_open(file, mv88e6xxx_regs_show, inode->i_private); -} - -static const struct file_operations mv88e6xxx_regs_fops = { - .open = mv88e6xxx_regs_open, - .read = seq_read, - .llseek = no_llseek, - .release = single_release, - .owner = THIS_MODULE, -}; - -static void mv88e6xxx_atu_show_header(struct seq_file *s) -{ - seq_puts(s, "DB T/P Vec State Addr\n"); -} - -static void mv88e6xxx_atu_show_entry(struct seq_file *s, int dbnum, -unsigned char *addr, int data) -{ - bool trunk = !!(data & GLOBAL_ATU_DATA_TRUNK); - int portvec = ((data & GLOBAL_ATU_DATA_PORT_VECTOR_MASK) >> - GLOBAL_ATU_DATA_PORT_VECTOR_SHIFT); - int state = data & GLOBAL_ATU_DATA_STATE_MASK; - - seq_printf(s, "%03x %5s %10pb %x %pM\n", - dbnum, (trunk ? "Trunk" : "Port"), , state, addr); -} - -static int mv88e6xxx_atu_show_db(struct seq_file *s, struct dsa_switch *ds, -int dbnum) -{ - unsigned char bcast[] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff}; - unsigned char addr[6]; - int ret, data, state; - - ret = _mv88e6xxx_atu_mac_write(ds, bcast); - if (ret < 0) - return ret; - - do { - ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_ATU_FID, - dbnum); - if (ret < 0) - return ret; - - ret = _mv88e6xxx_atu_cmd(ds, GLOBAL_ATU_OP_GET_NEXT_DB); - if (ret < 0) - return ret; - - data = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_ATU_DATA); - if (data < 0) - return data; - - state = data & GLOBAL_ATU_DATA_STATE_MASK; - if (state == GLOBAL_ATU_DATA_STATE_UNUSED) - break; - ret = _mv88e6xxx_atu_mac_read(ds, addr); - if (ret < 0) - return ret; - mv88e6xxx_atu_show_entry(s, dbnum, addr, data); - } while (state != GLOBAL_ATU_DATA_STATE_UNUSED); - - return 0; -} - -static int mv88e6xxx_atu_show(struct seq_file *s, void *p) -{ - struct dsa_switch *ds = s->private; - struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); - int dbnum; - - mv88e6xxx_atu_show_header(s); - - for (dbnum = 0; dbnum < 255; dbnum++) { - mutex_lock(>smi_mutex); - mv88e6xxx_atu_show_db(s, ds, dbnum); - mutex_unlock(>smi_mutex); - } - - return 0;
[PATCH net-next v5 0/2] mpls: multipath support
From: Roopa PrabhuThis patch adds support for MPLS multipath routes. Includes following changes to support multipath: - splits struct mpls_route into 'struct mpls_route + struct mpls_nh'. - struct mpls_nh represents a mpls nexthop label forwarding entry - Adds support to parse/fill RTA_MULTIPATH netlink attribute for multipath routes similar to ipv4/v6 fib - In the process of restructuring, this patch also consistently changes all labels to u8 $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \ nexthop as 700 via inet 10.1.1.6 dev swp2 \ nexthop as 800 via inet 40.1.1.2 dev swp3 $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 nexthop as to 700 via inet 10.1.1.6 dev swp2 nexthop as to 800 via inet 40.1.1.2 dev swp3 Roopa Prabhu (1): mpls: multipath support Robert Shearman (1): mpls: flow-based multipath selection Signed-off-by: Roopa Prabhu v2: - Incorporate some feedback from Robert: use dynamic allocation (list) instead of static allocation for nexthops v3: - Move back to arrays (same as v1), also suggested by Eric Biederman v4: - address a few comments from Eric Biederman Plan to address the following pending comments in incremental patches after this infrastructure changes go in. - Move VIA size to 16 bytes - use ipv6 flow label in ecmp calculations - dead route handling during multipath route selection (I had planned this in an incremental patch initially). v5: feedback from Eric Biederman - Removed some dead code feedback from Robert - Moved dev_put into find_outdev to make it clear that we dont need a hold on the dev because we are under rtnl - move the unused variable fix into the correct patch file include/net/mpls_iptunnel.h | 2 +- net/mpls/af_mpls.c | 668 ++-- net/mpls/internal.h | 57 +++- 3 files changed, 572 insertions(+), 155 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v5 1/2] mpls: multipath route support
From: Roopa PrabhuThis patch adds support for MPLS multipath routes. Includes following changes to support multipath: - splits struct mpls_route into 'struct mpls_route + struct mpls_nh' - 'struct mpls_nh' represents a mpls nexthop label forwarding entry - moves mpls route and nexthop structures into internal.h - A mpls_route can point to multiple mpls_nh structs - the nexthops are maintained as a array (similar to ipv4 fib) - In the process of restructuring, this patch also consistently changes all labels to u8 - Adds support to parse/fill RTA_MULTIPATH netlink attribute for multipath routes similar to ipv4/v6 fib - In this patch, the multipath route nexthop selection algorithm simply returns the first nexthop. It is replaced by a hash based algorithm from Robert Shearman in the next patch - mpls_route_update cleanup: remove 'dev' handling in mpls_route_update. mpls_route_update though implemented to update based on dev, it was never used that way. And the dev handling gets tricky with multiple nexthops. Cannot match against any single nexthops dev. So, this patch removes the unused 'dev' handling in mpls_route_update. Example: $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \ nexthop as 700 via inet 10.1.1.6 dev swp2 \ nexthop as 800 via inet 40.1.1.2 dev swp3 $ip -f mpls route show 100 nexthop as to 200 via inet 10.1.1.2 dev swp1 nexthop as to 700 via inet 10.1.1.6 dev swp2 nexthop as to 800 via inet 40.1.1.2 dev swp3 Signed-off-by: Roopa Prabhu --- include/net/mpls_iptunnel.h | 2 +- net/mpls/af_mpls.c | 493 +++- net/mpls/internal.h | 52 - 3 files changed, 398 insertions(+), 149 deletions(-) diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h index 4757997..179253f 100644 --- a/include/net/mpls_iptunnel.h +++ b/include/net/mpls_iptunnel.h @@ -18,7 +18,7 @@ struct mpls_iptunnel_encap { u32 label[MAX_NEW_LABELS]; - u32 labels; + u8 labels; }; static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct lwtunnel_state *lwtstate) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index bb185a2..3f95499 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -19,37 +19,9 @@ #include #include #endif +#include #include "internal.h" -#define LABEL_NOT_SPECIFIED (1<<20) -#define MAX_NEW_LABELS 2 - -/* This maximum ha length copied from the definition of struct neighbour */ -#define MAX_VIA_ALEN (ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))) - -enum mpls_payload_type { - MPT_UNSPEC, /* IPv4 or IPv6 */ - MPT_IPV4 = 4, - MPT_IPV6 = 6, - - /* Other types not implemented: -* - Pseudo-wire with or without control word (RFC4385) -* - GAL (RFC5586) -*/ -}; - -struct mpls_route { /* next hop label forwarding entry */ - struct net_device __rcu *rt_dev; - struct rcu_head rt_rcu; - u32 rt_label[MAX_NEW_LABELS]; - u8 rt_protocol; /* routing protocol that set this entry */ - u8 rt_payload_type; - u8 rt_labels; - u8 rt_via_alen; - u8 rt_via_table; - u8 rt_via[0]; -}; - static int zero = 0; static int label_limit = (1 << 20) - 1; @@ -80,10 +52,10 @@ bool mpls_output_possible(const struct net_device *dev) } EXPORT_SYMBOL_GPL(mpls_output_possible); -static unsigned int mpls_rt_header_size(const struct mpls_route *rt) +static unsigned int mpls_nh_header_size(const struct mpls_nh *nh) { /* The size of the layer 2.5 labels to be added for this route */ - return rt->rt_labels * sizeof(struct mpls_shim_hdr); + return nh->nh_labels * sizeof(struct mpls_shim_hdr); } unsigned int mpls_dev_mtu(const struct net_device *dev) @@ -105,6 +77,12 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(mpls_pkt_too_big); +static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt) +{ + /* assume single nexthop for now */ + return >rt_nh[0]; +} + static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, struct mpls_entry_decoded dec) { @@ -159,6 +137,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev, struct net *net = dev_net(dev); struct mpls_shim_hdr *hdr; struct mpls_route *rt; + struct mpls_nh *nh; struct mpls_entry_decoded dec; struct net_device *out_dev; struct mpls_dev *mdev; @@ -196,8 +175,12 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev, if (!rt) goto drop; + nh = mpls_select_multipath(rt); + if (!nh) + goto
[PATCH net-next v5 2/2] mpls: flow-based multipath selection
From: Robert ShearmanChange the selection of a multipath route to use a flow-based hash. This more suitable for traffic sensitive to reordering within a flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution of traffic given enough flows. Selection of the path for a multipath route is done using a hash of: 1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and including entropy label, whichever is first. 2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS payload, if present. Naturally, a 5-tuple hash using L4 information in addition would be possible and be better in some scenarios, but there is a tradeoff between looking deeper into the packet to achieve good distribution, and packet forwarding performance, and I have erred on the side of the latter as the default. Signed-off-by: Robert Shearman --- net/mpls/af_mpls.c | 87 +++--- 1 file changed, 83 insertions(+), 4 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 3f95499..c6392aa 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -22,6 +22,11 @@ #include #include "internal.h" +/* Maximum number of labels to look ahead at when selecting a path of + * a multipath route + */ +#define MAX_MP_SELECT_LABELS 4 + static int zero = 0; static int label_limit = (1 << 20) - 1; @@ -77,10 +82,78 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(mpls_pkt_too_big); -static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt) +static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt, +struct sk_buff *skb, bool bos) { - /* assume single nexthop for now */ - return >rt_nh[0]; + struct mpls_entry_decoded dec; + struct mpls_shim_hdr *hdr; + bool eli_seen = false; + int label_index; + int nh_index = 0; + u32 hash = 0; + + /* No need to look further into packet if there's only +* one path +*/ + if (rt->rt_nhn == 1) + goto out; + + for (label_index = 0; label_index < MAX_MP_SELECT_LABELS && !bos; +label_index++) { + if (!pskb_may_pull(skb, sizeof(*hdr) * label_index)) + break; + + /* Read and decode the current label */ + hdr = mpls_hdr(skb) + label_index; + dec = mpls_entry_decode(hdr); + + /* RFC6790 - reserved labels MUST NOT be used as keys +* for the load-balancing function +*/ + if (likely(dec.label >= MPLS_LABEL_FIRST_UNRESERVED)) { + hash = jhash_1word(dec.label, hash); + + /* The entropy label follows the entropy label +* indicator, so this means that the entropy +* label was just added to the hash - no need to +* go any deeper either in the label stack or in the +* payload +*/ + if (eli_seen) + break; + } else if (dec.label == MPLS_LABEL_ENTROPY) { + eli_seen = true; + } + + bos = dec.bos; + if (bos && pskb_may_pull(skb, sizeof(*hdr) * label_index + +sizeof(struct iphdr))) { + const struct iphdr *v4hdr; + + v4hdr = (const struct iphdr *)(mpls_hdr(skb) + + label_index); + if (v4hdr->version == 4) { + hash = jhash_3words(ntohl(v4hdr->saddr), + ntohl(v4hdr->daddr), + v4hdr->protocol, hash); + } else if (v4hdr->version == 6 && + pskb_may_pull(skb, sizeof(*hdr) * label_index + + sizeof(struct ipv6hdr))) { + const struct ipv6hdr *v6hdr; + + v6hdr = (const struct ipv6hdr *)(mpls_hdr(skb) + + label_index); + + hash = __ipv6_addr_jhash(>saddr, hash); + hash = __ipv6_addr_jhash(>daddr, hash); + hash = jhash_1word(v6hdr->nexthdr, hash); + } + } + } + + nh_index = hash % rt->rt_nhn; +out: + return >rt_nh[nh_index]; } static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, @@ -175,7 +248,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev, if (!rt) goto drop; -
[PATCH 2/2] e1000e: Fix msi-x interrupt automask
Since the introduction of 82574 support in e1000e, the driver has worked on the assumption that msi-x interrupt generation is automatically disabled after each irq. As it turns out, this is not the case. Currently, rx interrupts can fire multiple times before and during napi processing. This can be a problem for users because frames that arrive in a certain window (after adapter->clean_rx() but before napi_complete_done() has cleared NAPI_STATE_SCHED) generate an interrupt which does not lead to napi_schedule(). These frames sit in the rx queue until another frame arrives (a tcp retransmit for example). While the EIAC and CTRL_EXT registers are properly configured for irq automask, the modification of IAM in e1000_configure_msix() is what prevents automask from working as intended. This patch removes that erroneous write and fixes interrupt rearming for tx and "other" interrupts. Since e1000_msix_other() reads ICR, all interrupts must be rearmed in that function. Reported-by: Frank SteinerSigned-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/netdev.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index a228167..8881256 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1921,7 +1921,8 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) no_link_interrupt: if (!test_bit(__E1000_DOWN, >state)) - ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER); + ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | +E1000_IMS_LSC); return IRQ_HANDLED; } @@ -1940,6 +1941,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused irq, void *data) /* Ring was not completely cleaned, so fire another interrupt */ ew32(ICS, tx_ring->ims_val); + if (!test_bit(__E1000_DOWN, >state)) + ew32(IMS, E1000_IMS_TXQ0); + return IRQ_HANDLED; } @@ -2027,11 +2031,7 @@ static void e1000_configure_msix(struct e1000_adapter *adapter) /* enable MSI-X PBA support */ ctrl_ext = er32(CTRL_EXT); - ctrl_ext |= E1000_CTRL_EXT_PBA_CLR; - - /* Auto-Mask Other interrupts upon ICR read */ - ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER); - ctrl_ext |= E1000_CTRL_EXT_EIAME; + ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME; ew32(CTRL_EXT, ctrl_ext); e1e_flush(); } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] e1000e: remove unreachable code
msi-x interrupts are not shared so there's no need to check if the interrupt was really from this adapter. Signed-off-by: Benjamin Poirier--- drivers/net/ethernet/intel/e1000e/netdev.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 0a854a4..a228167 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1907,12 +1907,6 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) struct e1000_hw *hw = >hw; u32 icr = er32(ICR); - if (!(icr & E1000_ICR_INT_ASSERTED)) { - if (!test_bit(__E1000_DOWN, >state)) - ew32(IMS, E1000_IMS_OTHER); - return IRQ_NONE; - } - if (icr & adapter->eiac_mask) ew32(ICS, (icr & adapter->eiac_mask)); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper
Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Also fix error path after perf_event_attrs() and remove redundant 'extern'. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Alexei Starovoitov--- v2->v3: . refactor checks based on Wangnan's and Peter's feedback while refactoring realized that these two issues need fixes as well: . fix perf_event_attrs() error path . remove redundant extern v1->v2: fix compile in case of !CONFIG_PERF_EVENTS Even in the worst case the crash is not possible. Only warn_on_once, so imo net-next is ok. include/linux/bpf.h |1 - kernel/bpf/arraymap.c| 25 - kernel/trace/bpf_trace.c |7 ++- 3 files changed, 22 insertions(+), 11 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e3a51b74e275..75718fa28260 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -194,7 +194,6 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto; extern const struct bpf_func_proto bpf_map_update_elem_proto; extern const struct bpf_func_proto bpf_map_delete_elem_proto; -extern const struct bpf_func_proto bpf_perf_event_read_proto; extern const struct bpf_func_proto bpf_get_prandom_u32_proto; extern const struct bpf_func_proto bpf_get_smp_processor_id_proto; extern const struct bpf_func_proto bpf_tail_call_proto; diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e3cfe46b074f..3f4c99e06c6b 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -292,16 +292,23 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd) attr = perf_event_attrs(event); if (IS_ERR(attr)) - return (void *)attr; + goto err; - if (attr->type != PERF_TYPE_RAW && - !(attr->type == PERF_TYPE_SOFTWARE && - attr->config == PERF_COUNT_SW_BPF_OUTPUT) && - attr->type != PERF_TYPE_HARDWARE) { - perf_event_release_kernel(event); - return ERR_PTR(-EINVAL); - } - return event; + if (attr->inherit) + goto err; + + if (attr->type == PERF_TYPE_RAW) + return event; + + if (attr->type == PERF_TYPE_HARDWARE) + return event; + + if (attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) + return event; +err: + perf_event_release_kernel(event); + return ERR_PTR(-EINVAL); } static void perf_event_fd_array_put_ptr(void *ptr) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 47febbe7998e..003df3887287 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -199,6 +199,11 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5) if (!event) return -ENOENT; + /* make sure event is local and doesn't have pmu::count */ + if (event->oncpu != smp_processor_id() || + event->pmu->count) + return -EINVAL; + /* * we don't know if the function is run successfully by the * return value. It can be judged in other places, such as @@ -207,7 +212,7 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5) return perf_event_read_local(event); } -const struct bpf_func_proto bpf_perf_event_read_proto = { +static const struct bpf_func_proto bpf_perf_event_read_proto = { .func = bpf_perf_event_read, .gpl_only = false, .ret_type = RET_INTEGER, -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] xen-netfront: limit max queues number to online cpus
Should not allocate queues number more than online cpus. Signed-off-by: Joe JinCc: Boris Ostrovsky Cc: Konrad Rzeszutek Wilk Cc: David S. Miller --- drivers/net/xen-netfront.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index f821a97..1eebd4e 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -2137,7 +2137,7 @@ static int __init netif_init(void) /* Allow as many queues as there are CPUs if user has not * specified a value. */ - if (xennet_max_queues == 0) + if (xennet_max_queues == 0 || xennet_max_queues > num_online_cpus()) xennet_max_queues = num_online_cpus(); return xenbus_register_frontend(_driver); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net 2/2] i40e: fix annoying message
From: Jesse BrandeburgThe driver was printing a message about not being able to assign VMDq because of a lack of MSI-X vectors. This was because a line was missing that initialized a variable, simply a merge error. Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index dd44faf..3dd26cd 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -7911,6 +7911,7 @@ static int i40e_sw_init(struct i40e_pf *pf) if (pf->hw.func_caps.vmdq) { pf->num_vmdq_vsis = I40E_DEFAULT_NUM_VMDQ_VSI; pf->flags |= I40E_FLAG_VMDQ_ENABLED; + pf->num_vmdq_qps = i40e_default_queues_per_vmdq(pf); } #ifdef I40E_FCOE -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html