[PATCH] igmp: Fix regression caused by igmp sysctl namespace code.
Commit dcd87999d415 ("igmp: net: Move igmp namespace init to correct file") moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This function is only called as part of per-namespace initialization, only if CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is compiled out, casuing the igmp pernet ops to not be registerd and those sysctl being left initialized with 0. However, there are certain functions, such as ip_mc_join_group which are always compiled and make use of some of those sysctls. Let's do a partial revert of the aforementioned commit and move the sysctl initialization back into tcp_sk_init, that way they will always have sane values. Fixes: dcd87999d415 ("igmp: net: Move igmp namespace init to correct file") Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595 Reported-by: Gerardo Exequiel PozziTested-by: Gerardo Exequiel Pozzi Signed-off-by: Nikolay Borisov Cc: # 4.6 --- net/ipv4/igmp.c | 6 -- net/ipv4/tcp_ipv4.c | 6 ++ 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index 28f14afd0dd3..498706b072fb 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -2974,12 +2974,6 @@ static int __net_init igmp_net_init(struct net *net) goto out_sock; } - /* Sysctl initialization */ - net->ipv4.sysctl_igmp_max_memberships = 20; - net->ipv4.sysctl_igmp_max_msf = 10; - /* IGMP reports for link-local multicast groups are enabled by default */ - net->ipv4.sysctl_igmp_llm_reports = 1; - net->ipv4.sysctl_igmp_qrv = 2; return 0; out_sock: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a20e7f03d5f7..64ba2c93d396 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2528,6 +2528,12 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_window_scaling = 1; net->ipv4.sysctl_tcp_timestamps = 1; + net->ipv4.sysctl_igmp_max_memberships = 20; + net->ipv4.sysctl_igmp_max_msf = 10; + /* IGMP reports for link-local multicast groups are enabled by default */ + net->ipv4.sysctl_igmp_llm_reports = 1; + net->ipv4.sysctl_igmp_qrv = 2; + return 0; fail: tcp_sk_exit(net); -- 2.7.4
Re:Re: Re:Re: Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan
At 2017-08-09 03:45:53, "Cong Wang"wrote: >On Mon, Aug 7, 2017 at 6:10 PM, Gao Feng wrote: >> >> Sorry, I don't get you clearly. Why the sock_hold() isn't helpful? > >I already told you, the dereference happends before sock_hold(). > >sock = rcu_dereference(callid_sock[call_id]); >if (sock) { >opt = >proto.pptp; >if (opt->dst_addr.sin_addr.s_addr != s_addr) <=== HERE >sock = NULL; >else >sock_hold(sk_pppox(sock)); >} > >If we don't wait for readers properly, sock could be freed at the >same time when deference it. Maybe I didn't show my explanation clearly. I think it won't happen as I mentioned in the last email. Because the pptp_release invokes the synchronize_rcu to make sure it, and actually there is no one which would invoke del_chan except pptp_release. It is guaranteed by that the pptp_release doesn't put the sock refcnt until complete all cleanup include marking sk_state as PPPOX_DEAD. In other words, even though the pptp_release is not the last user of this sock, the other one wouldn't invoke del_chan in pptp_sock_destruct. Because the condition "!(sk->sk_state & PPPOX_DEAD)" must be false. As summary, the del_chan and pppox_unbind_sock in pptp_sock_destruct are unnecessary. And it even brings confusing. Best Regards Feng > >> The pptp_release invokes synchronize_rcu after del_chan, it could make sure >> the others has increased the sock refcnt if necessary >> and the lookup is over. >> There is no one could get the sock after synchronize_rcu in pptp_release. > > >If this were true, then this code in pptp_sock_destruct() >would be unneeded: > >if (!(sk->sk_state & PPPOX_DEAD)) { >del_chan(pppox_sk(sk)); >pppox_unbind_sock(sk); >} > > >> >> >> But I think about another problem. >> It seems the pptp_sock_destruct should not invoke del_chan and >> pppox_unbind_sock. >> Because when the sock refcnt is 0, the pptp_release must have be invoked >> already. >> > > >I don't know. Looks like sock_orphan() is only called >in pptp_release(), but I am not sure if there is a case >we call sock destructor before release. > >Also note, this socket is very special, it doesn't support >poll(), sendmsg() or recvmsg()..
[PATCH] net: dsa: make dsa_switch_ops const
Make these structures const as they are only stored in the ops field of a dsa_switch structure, which is const. Done using Coccinelle. Signed-off-by: Bhumika Goyal--- drivers/net/dsa/dsa_loop.c | 2 +- drivers/net/dsa/lan9303-core.c | 2 +- drivers/net/dsa/mt7530.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c index 76d6660..7819a9f 100644 --- a/drivers/net/dsa/dsa_loop.c +++ b/drivers/net/dsa/dsa_loop.c @@ -257,7 +257,7 @@ static int dsa_loop_port_vlan_del(struct dsa_switch *ds, int port, return 0; } -static struct dsa_switch_ops dsa_loop_driver = { +static const struct dsa_switch_ops dsa_loop_driver = { .get_tag_protocol = dsa_loop_get_protocol, .setup = dsa_loop_setup, .get_strings= dsa_loop_get_strings, diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c index 15befd1..f058f98 100644 --- a/drivers/net/dsa/lan9303-core.c +++ b/drivers/net/dsa/lan9303-core.c @@ -797,7 +797,7 @@ static void lan9303_port_disable(struct dsa_switch *ds, int port, } } -static struct dsa_switch_ops lan9303_switch_ops = { +static const struct dsa_switch_ops lan9303_switch_ops = { .get_tag_protocol = lan9303_get_tag_protocol, .setup = lan9303_setup, .get_strings = lan9303_get_strings, diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c index 1270071..7d8cf927 100644 --- a/drivers/net/dsa/mt7530.c +++ b/drivers/net/dsa/mt7530.c @@ -979,7 +979,7 @@ static void mt7530_adjust_link(struct dsa_switch *ds, int port, return 0; } -static struct dsa_switch_ops mt7530_switch_ops = { +static const struct dsa_switch_ops mt7530_switch_ops = { .get_tag_protocol = mtk_get_tag_protocol, .setup = mt7530_setup, .get_strings= mt7530_get_strings, -- 1.9.1
Re: panic at sock_zerocopy_put when I ssh into a VM
Thanks for the report, David, and sorry for the breakage. I am not able to reproduce the issue with my qemu setup with vhost-net with experimental_zcopytx so far. But looking at the code from that codepath point of view, I do see that there are incorrect assumptions on ubuf_info fields being initialized anytime skb_zcopy(skb) is true, that are not true for the legacy zerocopy case. Specifically, uarg->mmp and uarg->zerocopy are only valid for msg_zerocopy. The first can conceivably result in dereferencing a garbage pointer if an ubuf_info from vhost is passed that does not have this field properly initialized. I will take a deeper look. As a first attempt, the following may fix the issue for this vhost case (only): diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index ba08b78ed630..e1e96d97de71 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -253,7 +253,7 @@ static int vhost_net_set_ubuf_info(struct vhost_net *n) zcopy = vhost_net_zcopy_mask & (0x1 << i); if (!zcopy) continue; - n->vqs[i].ubuf_info = kmalloc(sizeof(*n->vqs[i].ubuf_info) * + n->vqs[i].ubuf_info = kzalloc(sizeof(*n->vqs[i].ubuf_info) * UIO_MAXIOV, GFP_KERNEL); if (!n->vqs[i].ubuf_info) goto err; Less critical is correctly returning whether the operation completed without resorting to copying. Boolean uarg->zerocopy is undefined. This should not cause a kernel panic, as the vhost driver must handle both cases safely. Only msg_zerocopy sets bot SKBTX_ZEROCOPY_FRAG and SKBTX_DEV_ZEROCOPY, which is one way to identify this special case. diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 8c0708d2e5e6..7fb8b11ba8f6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1273,7 +1273,10 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy) struct ubuf_info *uarg = skb_zcopy(skb); if (uarg) { - uarg->zerocopy = uarg->zerocopy && zerocopy; + if (skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG) + uarg->zerocopy = uarg->zerocopy && zerocopy; + else + uarg->zerocopy = zerocopy; sock_zerocopy_put(uarg); skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG; } On Tue, Aug 8, 2017 at 6:14 PM, David Ahernwrote: > Willem: > > I updated my host server this morning to top of net-next -- commit > 53b948356554. I am not doing anything fancy or intentionally using the > zerocopy code. I launch a VM with vhost and attempt to login via ssh. > Doing that triggers a panic in the host at sock_zerocopy_put. The > attached is a snapshot of the console -- best I can get for the stack trace. > > David
Re: [PATCH net-next v2] net: ipv6: avoid overhead when no custom FIB rules are installed
From: Vincent BernatDate: Tue, 8 Aug 2017 20:23:49 +0200 > If the user hasn't installed any custom rules, don't go through the > whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4: > Avoid overhead when no custom FIB rules are installed). > > Using a micro-benchmark module [1], timing ip6_route_output() with > get_cycles(), with 40,000 routes in the main routing table, before this > patch: > > min=606 max=12911 count=627 average=1959 95th=4903 90th=3747 50th=1602 > mad=821 > table=254 avgdepth=21.8 maxdepth=39 > value │ ┊count > 600 │▒▒▒ 199 > 880 │▒▒▒ 43 > 1160 │▒▒▒ 48 > 1440 │▒▒▒░░░ 43 > 1720 │░░░ 59 > 2000 │▒▒▒ 50 > 2280 │▒▒░░░26 > 2560 │▒▒░ 31 > 2840 │▒▒ 28 > 3120 │▒░░ 17 > 3400 │▒░░░ 17 > 3680 │░ 8 > 3960 │░░ 11 > 4240 │░░6 > 4520 │░░░ 6 > 4800 │░░░ 9 > > After: > > min=544 max=11687 count=627 average=1776 95th=4546 90th=3585 50th=1227 > mad=565 > table=254 avgdepth=21.8 maxdepth=39 > value │ ┊count > 540 │201 > 800 │▒63 > 1060 │▒░ 68 > 1320 │▒▒▒░░39 > 1580 │▒▒░░ 32 > 1840 │▒▒ 32 > 2100 │▒▒░░░34 > 2360 │▒▒░░ 33 > 2620 │▒▒ 26 > 2880 │▒░░ 22 > 3140 │ 9 > 3400 │░ 8 > 3660 │░ 9 > 3920 │░░8 > 4180 │░░░ 8 > 4440 │░░░ 8 > > At the frequency of the host during the bench (~ 3.7 GHz), this is > about a 100 ns difference on the median value. > > A next step would be to collapse local and main tables, as in > 0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse). > > [1]: > https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c > > Signed-off-by: Vincent Bernat > Reviewed-by: Jiri Pirko Looks great, applied, thanks!
Re: [PATCH net] net: avoid skb_warn_bad_offload false positives on UFO
From: Willem de BruijnDate: Tue, 8 Aug 2017 14:22:55 -0400 > From: Willem de Bruijn > > skb_warn_bad_offload triggers a warning when an skb enters the GSO > stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL > checksum offload set. > > Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") > observed that SKB_GSO_DODGY producers can trigger the check and > that passing those packets through the GSO handlers will fix it > up. But, the software UFO handler will set ip_summed to > CHECKSUM_NONE. > > When __skb_gso_segment is called from the receive path, this > triggers the warning again. > > Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On > Tx these two are equivalent. On Rx, this better matches the > skb state (checksum computed), as CHECKSUM_NONE here means no > checksum computed. > > See also this thread for context: > http://patchwork.ozlabs.org/patch/799015/ > > Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") > Signed-off-by: Willem de Bruijn Applied and queued up for -stable, thank you.
Re: [PATCH net] Revert "vhost: cache used event for better performance"
On Wed, Aug 09, 2017 at 10:38:10AM +0800, Jason Wang wrote: > I think don't think current code can work well if vq.num is grater than > 2^15. Since all cached idx is u16. This looks like a bug which needs to be > fixed. That's a limitation of virtio 1.0. > > * else if the interval of vq.num is [2^15, 2^16): > > the logic in the original patch (809ecb9bca6a9) suffices > > * else (= less than 2^15) (optional): > > checking only (vring_need_event(vq->last_used_event, new + vq->num, new) > > would suffice. > > > > Am I missing something, or is this irrelevant? Could you pls repost the suggestion copying virtio-dev mailing list (subscriber only, sorry about that, but host/guest ABI discussions need to copy that list)? > Looks not, I think this may work. Let me do some test. > > Thanks I think that at this point it's prudent to add a feature bit as the virtio spec does not require to never move the event index back.
Re: [PATCH net-next 0/7] rtnetlink: allow to run selected handlers without rtnl
From: Florian WestphalDate: Tue, 8 Aug 2017 18:02:29 +0200 > Unfortunately RTNL mutex is a performance issue, e.g. a cpu adding > an ip address prevents other cpus from seemingly unrelated tasks > such as dumping tc classifiers. It is related if somehow the TC entries refer to IP addresses. Someone could create something like that. > Initial no-rtnl spots are ip6 fib add/del and netns new/getid. I could see the netns stuff being ok, but IPv6 route add/del I'm not so sure of. Because of things like nexthops etc. there are dependencies on other configuration things. That's the whole reason we have this unfortunate global synchronization point. If I'm changing some aspect of network configuration, I know I can atomically test any piece of networking configuration state. If I test a network address to make sure I can properly reacy X and use X as a nexthop in the route I'm adding, it will be there throughout the entire operation. There really is a hierachy of these dependencies. Device state, up to neighbour table state, up to protocol address state, up to routes, up to FIB tables, etc. etc. etc. I'd really like to make this operate more freely, but this is an extremely delicate area which has been bottled up like this for two decades so good luck :-)
Re: [PATCH net,stable] qmi_wwan: fix NULL deref on disconnect
From: Bjørn MorkDate: Tue, 8 Aug 2017 18:02:11 +0200 > qmi_wwan_disconnect is called twice when disconnecting devices with > separate control and data interfaces. The first invocation will set > the interface data to NULL for both interfaces to flag that the > disconnect has been handled. But the matching NULL check was left > out when qmi_wwan_disconnect was added, resulting in this oops: > > usb 2-1.4: USB disconnect, device number 4 > qmi_wwan 2-1.4:1.6 wwp0s29u1u4i6: unregister 'qmi_wwan' > usb-:00:1d.0-1.4, WWAN/QMI device > BUG: unable to handle kernel NULL pointer dereference at 00e0 > IP: qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan] > PGD 0 > P4D 0 > Oops: [#1] SMP > Modules linked in: > CPU: 2 PID: 33 Comm: kworker/2:1 Tainted: GE > 4.12.3-nr44-normandy-r1500619820+ #1 > Hardware name: LENOVO 4291LR7/4291LR7, BIOS CBET4000 4.6-810-g50522254fb > 07/21/2017 > Workqueue: usb_hub_wq hub_event [usbcore] > task: 8c882b716040 task.stack: b8e800d84000 > RIP: 0010:qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan] > RSP: 0018:b8e800d87b38 EFLAGS: 00010246 > RAX: RBX: RCX: > RDX: 0001 RSI: 8c8824f3f1d0 RDI: 8c8824ef6400 > RBP: 8c8824ef6400 R08: R09: > R10: b8e800d87780 R11: 0011 R12: c07ea0e8 > R13: 8c8824e2e000 R14: 8c8824e2e098 R15: > FS: () GS:8c883530() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 00e0 CR3: 000229ca5000 CR4: 000406e0 > Call Trace: >? usb_unbind_interface+0x71/0x270 [usbcore] >? device_release_driver_internal+0x154/0x210 >? qmi_wwan_unbind+0x6d/0xc0 [qmi_wwan] >? usbnet_disconnect+0x6c/0xf0 [usbnet] >? qmi_wwan_disconnect+0x87/0xc0 [qmi_wwan] >? usb_unbind_interface+0x71/0x270 [usbcore] >? device_release_driver_internal+0x154/0x210 > > Reported-and-tested-by: Nathaniel Roach > Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support") > Cc: Daniele Palmas > Signed-off-by: Bjørn Mork Applied and queued up for -stable, thanks.
Re: [PATCH] hv_set_ifconfig.sh double check before setting ip
From: Eduardo OtuboDate: Tue, 8 Aug 2017 15:53:45 +0200 > This patch fixes the behavior of the hv_set_ifconfig script when setting > the interface ip. Sometimes the interface has already been configured by > network daemon, in this case hv_set_ifconfig causes "RTNETLINK: file > exists error"; in order to avoid this error this patch makes sure double > checks the interface before trying anything. > > Signed-off-by: Eduardo Otubo And if the daemon sets the address after you test it but before you try to set it in the script, what happens? This is why I hate changes like this. They don't remove the problem, they make it smaller. And smaller in a bad way. Smaller makes the problem even more harder to diagnose when it happens. There is implicitly no synchonization between network configuration daemons and things people run by hand like this script. So, caveat emptor. I'm not applying this, sorry.
Re: [PATCH 33/35] wireless: realtek: rtl8192cu: constify usb_device_id
On 08/08/2017 11:05 AM, Arvind Yadav wrote: usb_device_id are not supposed to change at runtime. All functions working with usb_device_id provided by work with const usb_device_id. So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- Acked-by: Larry Finger Thanks, Larry drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c index 96c923b..e673bc2 100644 --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c @@ -275,7 +275,7 @@ static struct rtl_hal_cfg rtl92cu_hal_cfg = { #define USB_VENDER_ID_REALTEK 0x0bda /* 2010-10-19 DID_USB_V3.4 */ -static struct usb_device_id rtl8192c_usb_ids[] = { +static const struct usb_device_id rtl8192c_usb_ids[] = { /*=== Realtek demoboard ===*/ /* Default ID */
Re: [PATCH][next] net: phy: mdio-bcm-unimac: fix unsigned wrap-around when decrementing timeout
From: Colin KingDate: Tue, 8 Aug 2017 10:52:32 +0100 > From: Colin Ian King > > Change post-decrement compare to pre-decrement to avoid an > unsigned integer wrap-around on timeout. This leads to the following > !timeout check to never to be true so -ETIMEDOUT is never returned. > > Detected by CoverityScan, CID#1452623 ("Logically dead code") > > Fixes: 69a60b0579a4 ("net: phy: mdio-bcm-unimac: factor busy polling loop") > Signed-off-by: Colin Ian King Applied.
Re: [PATCH net] ppp: fix xmit recursion detection on ppp channels
From: Guillaume NaultDate: Tue, 8 Aug 2017 11:43:24 +0200 > Commit e5dadc65f9e0 ("ppp: Fix false xmit recursion detect with two ppp > devices") dropped the xmit_recursion counter incrementation in > ppp_channel_push() and relied on ppp_xmit_process() for this task. > But __ppp_channel_push() can also send packets directly (using the > .start_xmit() channel callback), in which case the xmit_recursion > counter isn't incremented anymore. If such packets get routed back to > the parent ppp unit, ppp_xmit_process() won't notice the recursion and > will call ppp_channel_push() on the same channel, effectively creating > the deadlock situation that the xmit_recursion mechanism was supposed > to prevent. > > This patch re-introduces the xmit_recursion counter incrementation in > ppp_channel_push(). Since the xmit_recursion variable is now part of > the parent ppp unit, incrementation is skipped if the channel doesn't > have any. This is fine because only packets routed through the parent > unit may enter the channel recursively. > > Finally, we have to ensure that pch->ppp is not going to be modified > while executing ppp_channel_push(). Instead of taking this lock only > while calling ppp_xmit_process(), we now have to hold it for the full > ppp_channel_push() execution. This respects the ppp locks ordering > which requires locking ->upl before ->downl. > > Fixes: e5dadc65f9e0 ("ppp: Fix false xmit recursion detect with two ppp > devices") > Signed-off-by: Guillaume Nault Applied, thank you.
Re: [PATCH] net: phy: Use tab for indentation in Kconfig
From: Michal SimekDate: Tue, 8 Aug 2017 11:32:25 +0200 > Using tabs instead of space for indentaion > > Signed-off-by: Michal Simek This really isn't appropriate for the 'net' tree, it doesn't fix anything it just makes the spacing consistent. Please respin this patch against net-next, thank you.
Re: [PATCH net] rds: Reintroduce statistics counting
From: Håkon BuggeDate: Tue, 8 Aug 2017 11:13:32 +0200 > In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection > while senders are present"), refilling the receive queue was removed > from rds_ib_recv(), along with the increment of > s_ib_rx_refill_from_thread. > > Commit 73ce4317bf98 ("RDS: make sure we post recv buffers") > re-introduces filling the receive queue from rds_ib_recv(), but does > not add the statistics counter. rds_ib_recv() was later renamed to > rds_ib_recv_path(). > > This commit reintroduces the statistics counting of > s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq. > > Signed-off-by: Håkon Bugge > Reviewed-by: Knut Omang > Reviewed-by: Wei Lin Guay Applied.
Re: [PATCH net] tcp: fastopen: tcp_connect() must refresh the route
From: Eric DumazetDate: Tue, 08 Aug 2017 01:41:58 -0700 > From: Eric Dumazet > > With new TCP_FASTOPEN_CONNECT socket option, there is a possibility > to call tcp_connect() while socket sk_dst_cache is either NULL > or invalid. > > +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 > +0 connect(4, ..., ...) = 0 > > << sk->sk_dst_cache becomes obsolete, or even set to NULL >> > > +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > > > We need to refresh the route otherwise bad things can happen, > especially when syzkaller is running on the host :/ > > Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support") > Reported-by: Dmitry Vyukov > Signed-off-by: Eric Dumazet Applied and queued up for -stable.
[PATCH iproute2 master] examples/bpf: update list of examples
Remove deleted examples and add the new map in map example. Signed-off-by: Alexander Alemayhu--- examples/bpf/README | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/examples/bpf/README b/examples/bpf/README index 4247257850eb..1bbdda3f8dc1 100644 --- a/examples/bpf/README +++ b/examples/bpf/README @@ -1,13 +1,8 @@ eBPF toy code examples (running in kernel) to familiarize yourself with syntax and features: - - bpf_prog.c -> Classifier examples with using maps - bpf_shared.c-> Ingress/egress map sharing example - bpf_tailcall.c -> Using tail call chains - bpf_cyclic.c-> Simple cycle as tail calls - bpf_graft.c -> Demo on altering runtime behaviour - -User space code example: - - - bpf_agent.c -> Counterpart to bpf_prog.c for user - space to transfer/read out map data + - bpf_map_in_map.c -> Using map in map example -- 2.13.3
Re: [PATCHv3 net] net: sched: set xt_tgchk_param par.net properly in ipt_init_target
From: Xin LongDate: Tue, 8 Aug 2017 15:25:25 +0800 > Now xt_tgchk_param par in ipt_init_target is a local varibale, > par.net is not initialized there. Later when xt_check_target > calls target's checkentry in which it may access par.net, it > would cause kernel panic. > > Jaroslav found this panic when running: > > # ip link add TestIface type dummy > # tc qd add dev TestIface ingress handle : > # tc filter add dev TestIface parent : u32 match u32 0 0 \ > action xt -j CONNMARK --set-mark 4 > > This patch is to pass net param into ipt_init_target and set > par.net with it properly in there. > > v1->v2: > As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix > it by also passing net_id to __tcf_ipt_init. > v2->v3: > Missed the fixes tag, so add it. > > Fixes: ecb2421b5ddf ("netfilter: add and use nf_ct_netns_get/put") > Reported-by: Jaroslav Aster > Signed-off-by: Xin Long Applied and queued up for -stable, thanks.
Re: [PATCH net-next] cxgb4: Clear On FLASH config file after a FW upgrade
From: Ganesh GoudarDate: Tue, 8 Aug 2017 11:20:52 +0530 > From: Arjun Vynipadath > > Because Firmware and the Firmware Configuration File need to be > in sync; clear out any On-FLASH Firmware Configuration File when new > Firmware is loaded. This will avoid difficult to diagnose and fix > problems with a mis-matched Firmware Configuration File which prevents the > adapter from being initialized. > > Original work by: Casey Leedom > Signed-off-by: Arjun Vynipadath > Signed-off-by: Ganesh Goudar Applied, thanks.
Re: [PATCH RFC v2 3/5] samples/bpf: Fix inline asm issues building samples on arm64
From: Joel FernandesDate: Mon, 7 Aug 2017 18:20:49 -0700 > On Mon, Aug 7, 2017 at 11:28 AM, David Miller wrote: >> The amount of hellish hacks we are adding to deal with this is getting >> way out of control. > > I agree with you that hellish hacks are being added which is why it > keeps breaking. I think one of the things my series does is to add > back inclusion of asm headers that were previously removed (that is > the worst hellish hack in my opinion that existing in mainline). So in > that respect my patch is an improvement and makes it possible to build > for arm64 platforms (which is currently broken in mainline). Yeah that is a problem. Perhaps another avenue of attack is to separate "type" header files from stuff that has functiond declarations and inline assembler code.
Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported
On Tue, Aug 08, 2017 at 09:22:39PM -0500, Bjorn Helgaas wrote: > On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote: > > When bit4 is set in the PCIe Device Control register, it indicates > > whether the device is permitted to use relaxed ordering. > > On some platforms using relaxed ordering can have performance issues or > > due to erratum can cause data-corruption. In such cases devices must avoid > > using relaxed ordering. > > > > This patch checks if there is any node in the hierarchy that indicates that > > using relaxed ordering is not safe. ... > > +EXPORT_SYMBOL(pcie_relaxed_ordering_supported); > > This is misnamed. This doesn't tell us anything about whether the > device *supports* relaxed ordering. It only tells us whether the > device is *permitted* to use it. > > When a device initiates a transaction, the hardware should set the RO > bit in the TLP with logic something like this: > > RO = && > && > > > The issue you're fixing is that some Completers don't handle RO > correctly. The determining factor is not the Requester, but the > Completer (for this series, a Root Port). So I think this should be > something like: > > int pcie_relaxed_ordering_broken(struct pci_dev *completer) > { > if (!completer) > return 0; > > return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > } > > and the caller should do something like this: > > if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev))) >adapter->flags |= ROOT_NO_RELAXED_ORDERING; > > That way it's obvious where the issue is, and it's obvious that the > answer might be different for peer-to-peer transactions than it is for > transactions to the root port, i.e., to coherent memory. After looking at the driver, I wonder if it would be simpler like this: int pcie_relaxed_ordering_enabled(struct pci_dev *dev) { u16 ctl; pcie_capability_read_word(dev, PCI_EXP_DEVCTL, ); return ctl & PCI_EXP_DEVCTL_RELAX_EN; } EXPORT_SYMBOL(pcie_relaxed_ordering_enabled); static void pci_configure_relaxed_ordering(struct pci_dev *dev) { struct pci_dev *root; if (dev->is_virtfn) return; /* PCI_EXP_DEVCTL_RELAX_EN is RsvdP in VFs */ if (!pcie_relaxed_ordering_enabled(dev)) return; /* * For now, we only deal with Relaxed Ordering issues with Root * Ports. Peer-to-peer DMA is another can of worms. */ root = pci_find_pcie_root_port(dev); if (!root) return; if (root->relaxed_ordering_broken) pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN); } This doesn't check every intervening switch, but I don't think we know about any issues except with root ports. And the driver could do: if (!pcie_relaxed_ordering_enabled(pdev)) adapter->flags |= ROOT_NO_RELAXED_ORDERING; The driver code wouldn't show anything about coherent memory vs. peer-to-peer, but we really don't have a clue about how to handle that yet anyway. I guess this is back to exactly what you proposed, except that I changed the name of pcie_relaxed_ordering_supported() to pcie_relaxed_ordering_enabled(), which I think is slightly more specific from the device's point of view. Bjorn
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
On Wed, Aug 09, 2017 at 01:40:01AM +, Casey Leedom wrote: > | From: Bjorn Helgaas> | Sent: Tuesday, August 8, 2017 4:22 PM > | > | This needs to include a link to the Intel spec > | > (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf, > | sec 3.9.1). > > In the commit message or as a comment? Regardless, I agree. It's always > nice to be able to go back and see what the official documentation says. > However, that said, links on the internet are ... fragile as time goes by, > so we might want to simply quote section 3.9.1 in the commit message since > it's relatively short: > > 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory > and Toward MMIO Regions (P2P) > > In order to maximize performance for PCIe devices in the processors > listed in Table 3-6 below, the soft- ware should determine whether the > accesses are toward coherent memory (system memory) or toward MMIO > regions (P2P access to other devices). If the access is toward MMIO > region, then software can command HW to set the RO bit in the TLP > header, as this would allow hardware to achieve maximum throughput for > these types of accesses. For accesses toward coherent memory, software > can command HW to clear the RO bit in the TLP header (no RO), as this > would allow hardware to achieve maximum throughput for these types of > accesses. > > Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing >PCIe Performance > > ProcessorCPU RP Device IDs > > Intel Xeon processors based on 6F01H-6F0EH > Broadwell microarchitecture > > Intel Xeon processors based on 2F01H-2F0EH > Haswell microarchitecture Agreed, links are prone to being broken. I would include in the changelog the complete title and order number, along with the link as a footnote. Wouldn't hurt to quote the section too, since it's short. > | It should also include a pointer to the AMD erratum, if available, or > | at least some reference to how we know it doesn't obey the rules. > > Getting an ACK from AMD seems like a forlorn cause at this point. My > contact was Bob Shaw and he stopped responding to me > messages almost a year ago saying that all of AMD's energies were being > redirected towards upcoming x86 products (likely Ryzen as we now know). As > far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM > SoC. > > On the specific issue, I can certainly write up somthing even more > extensive than I wrote up for the comment in drivers/pci/quirks.c. Please > review the comment I wrote up and tell me if you'd like something even more > detailed -- I'm usually acused of writing comments which are too long, so > this would be a new one on me ... :-) If you have any bug reports with info about how you debugged it and concluded that Seattle is broken, you could include a link (probably in the changelog). But if there isn't anything, there isn't anything. I might reorganize those patches as: 1) Add a PCI_DEV_FLAGS_RELAXED_ORDERING_BROKEN flag, the quirk that sets it, and the current patch [2/4] that uses it. 2) Add the Intel DECLARE_PCI_FIXUP_CLASS_EARLY()s with the Intel details. 3) Add the AMD DECLARE_PCI_FIXUP_CLASS_EARLY()s with the AMD details.
[PATCH 1/3] security: keys: Replace time_t/timespec with time64_t
The 'struct key' will use 'time_t' which we try to remove in the kernel, since 'time_t' is not year 2038 safe on 32bit systems. Also the 'struct keyring_search_context' will use 'timespec' type to record current time, which is also not year 2038 safe on 32bit systems. Thus this patch replaces 'time_t' with 'time64_t' which is year 2038 safe for 'struct key', and replace 'timespec' with 'time64_t' for the 'struct keyring_search_context', since we only look at the the seconds part of 'timespec' variable. Moreover we also change the codes where using the 'time_t' and 'timespec', and we can get current time by ktime_get_real_seconds() instead of current_kernel_time(), and use 'TIME64_MAX' macro to initialize the 'time64_t' type variable. Especially in proc.c file, we have replaced 'unsigned long' and 'timespec' type with 'u64' and 'time64_t' type to save the timeout value, which means user will get one 'u64' type timeout value by issuing proc_keys_show() function. Signed-off-by: Baolin Wang--- include/linux/key.h |7 --- security/keys/gc.c | 20 ++-- security/keys/internal.h |8 security/keys/key.c | 19 ++- security/keys/keyring.c | 18 +- security/keys/permission.c |3 +-- security/keys/proc.c | 20 ++-- security/keys/process_keys.c |2 +- 8 files changed, 45 insertions(+), 52 deletions(-) diff --git a/include/linux/key.h b/include/linux/key.h index 0441141..6d10f84 100644 --- a/include/linux/key.h +++ b/include/linux/key.h @@ -24,6 +24,7 @@ #include #include #include +#include #ifdef __KERNEL__ #include @@ -157,10 +158,10 @@ struct key { struct key_user *user; /* owner of this key */ void*security; /* security data for this key */ union { - time_t expiry; /* time at which key expires (or 0) */ - time_t revoked_at; /* time at which key was revoked */ + time64_texpiry; /* time at which key expires (or 0) */ + time64_trevoked_at; /* time at which key was revoked */ }; - time_t last_used_at; /* last time used for LRU keyring discard */ + time64_tlast_used_at; /* last time used for LRU keyring discard */ kuid_t uid; kgid_t gid; key_perm_t perm; /* access permissions */ diff --git a/security/keys/gc.c b/security/keys/gc.c index 87cb260..c99700e 100644 --- a/security/keys/gc.c +++ b/security/keys/gc.c @@ -32,7 +32,7 @@ static void key_gc_timer_func(unsigned long); static DEFINE_TIMER(key_gc_timer, key_gc_timer_func, 0, 0); -static time_t key_gc_next_run = LONG_MAX; +static time64_t key_gc_next_run = TIME64_MAX; static struct key_type *key_gc_dead_keytype; static unsigned long key_gc_flags; @@ -53,12 +53,12 @@ struct key_type key_type_dead = { * Schedule a garbage collection run. * - time precision isn't particularly important */ -void key_schedule_gc(time_t gc_at) +void key_schedule_gc(time64_t gc_at) { unsigned long expires; - time_t now = current_kernel_time().tv_sec; + time64_t now = ktime_get_real_seconds(); - kenter("%ld", gc_at - now); + kenter("%lld", gc_at - now); if (gc_at <= now || test_bit(KEY_GC_REAP_KEYTYPE, _gc_flags)) { kdebug("IMMEDIATE"); @@ -87,7 +87,7 @@ void key_schedule_gc_links(void) static void key_gc_timer_func(unsigned long data) { kenter(""); - key_gc_next_run = LONG_MAX; + key_gc_next_run = TIME64_MAX; key_schedule_gc_links(); } @@ -184,11 +184,11 @@ static void key_garbage_collector(struct work_struct *work) struct rb_node *cursor; struct key *key; - time_t new_timer, limit; + time64_t new_timer, limit; kenter("[%lx,%x]", key_gc_flags, gc_state); - limit = current_kernel_time().tv_sec; + limit = ktime_get_real_seconds(); if (limit > key_gc_delay) limit -= key_gc_delay; else @@ -204,7 +204,7 @@ static void key_garbage_collector(struct work_struct *work) gc_state |= KEY_GC_REAPING_DEAD_1; kdebug("new pass %x", gc_state); - new_timer = LONG_MAX; + new_timer = TIME64_MAX; /* As only this function is permitted to remove things from the key * serial tree, if cursor is non-NULL then it will always point to a @@ -235,7 +235,7 @@ static void key_garbage_collector(struct work_struct *work) if (gc_state & KEY_GC_SET_TIMER) { if (key->expiry > limit && key->expiry < new_timer) { - kdebug("will expire %x in %ld", + kdebug("will expire %x in
[PATCH 0/3] Fix y2038 issues for security/keys subsystem
Since 'time_t', 'timeval' and 'timespec' types are not year 2038 safe on 32 bits system, this patchset tries to fix this issues for security/keys subsystem and net/rxrpc subsystem which is connected with security/keys subsystem. Baolin Wang (3): security: keys: Replace time_t/timespec with time64_t security: keys: Replace time_t with time64_t for struct key_preparsed_payload net: rxrpc: Replace time_t type with time64_t type include/keys/rxrpc-type.h| 21 + include/linux/key-type.h |2 +- include/linux/key.h |7 --- net/rxrpc/ar-internal.h |2 +- net/rxrpc/key.c | 22 ++ net/rxrpc/rxkad.c| 14 +++--- security/keys/gc.c | 20 ++-- security/keys/internal.h |8 security/keys/key.c | 27 ++- security/keys/keyring.c | 18 +- security/keys/permission.c |3 +-- security/keys/proc.c | 20 ++-- security/keys/process_keys.c |2 +- 13 files changed, 93 insertions(+), 73 deletions(-) -- 1.7.9.5
[PATCH 2/3] security: keys: Replace time_t with time64_t for struct key_preparsed_payload
The 'struct key_preparsed_payload' will use 'time_t' which we will try to remove in the kernel, since 'time_t' is not year 2038 safe on 32bits systems. Thus this patch replaces 'time_t' with 'time64_t' which is year 2038 safe on 32 bits system for 'struct key_preparsed_payload', moreover we should use the 'TIME64_MAX' macro to initialize the 'time64_t' type variable. Signed-off-by: Baolin Wang--- include/linux/key-type.h |2 +- security/keys/key.c |8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/key-type.h b/include/linux/key-type.h index 8496cf6..4beb006 100644 --- a/include/linux/key-type.h +++ b/include/linux/key-type.h @@ -44,7 +44,7 @@ struct key_preparsed_payload { const void *data; /* Raw data */ size_t datalen;/* Raw datalen */ size_t quotalen; /* Quota length for proposed payload */ - time_t expiry; /* Expiry time of key */ + time64_texpiry; /* Expiry time of key */ }; typedef int (*request_key_actor_t)(struct key_construction *key, diff --git a/security/keys/key.c b/security/keys/key.c index 291a67c..d5c8941 100644 --- a/security/keys/key.c +++ b/security/keys/key.c @@ -446,7 +446,7 @@ static int __key_instantiate_and_link(struct key *key, if (authkey) key_revoke(authkey); - if (prep->expiry != TIME_T_MAX) { + if (prep->expiry != TIME64_MAX) { key->expiry = prep->expiry; key_schedule_gc(prep->expiry + key_gc_delay); } @@ -492,7 +492,7 @@ int key_instantiate_and_link(struct key *key, prep.data = data; prep.datalen = datalen; prep.quotalen = key->type->def_datalen; - prep.expiry = TIME_T_MAX; + prep.expiry = TIME64_MAX; if (key->type->preparse) { ret = key->type->preparse(); if (ret < 0) @@ -834,7 +834,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref, prep.data = payload; prep.datalen = plen; prep.quotalen = index_key.type->def_datalen; - prep.expiry = TIME_T_MAX; + prep.expiry = TIME64_MAX; if (index_key.type->preparse) { ret = index_key.type->preparse(); if (ret < 0) { @@ -968,7 +968,7 @@ int key_update(key_ref_t key_ref, const void *payload, size_t plen) prep.data = payload; prep.datalen = plen; prep.quotalen = key->type->def_datalen; - prep.expiry = TIME_T_MAX; + prep.expiry = TIME64_MAX; if (key->type->preparse) { ret = key->type->preparse(); if (ret < 0) -- 1.7.9.5
[PATCH 3/3] net: rxrpc: Replace time_t type with time64_t type
Since the 'expiry' variable of 'struct key_preparsed_payload' has been changed to 'time64_t' type, which is year 2038 safe on 32bits system. In net/rxrpc subsystem, we need convert 'u32' type to 'time64_t' type when copying ticket expires time to 'prep->expiry', then this patch introduces two helper functions to help convert 'u32' to 'time64_t' type. This patch also uses ktime_get_real_seconds() to get current time instead of get_seconds() which is not year 2038 safe on 32bits system. Signed-off-by: Baolin Wang--- include/keys/rxrpc-type.h | 21 + net/rxrpc/ar-internal.h |2 +- net/rxrpc/key.c | 22 ++ net/rxrpc/rxkad.c | 14 +++--- 4 files changed, 43 insertions(+), 16 deletions(-) diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h index 5de0673..76421e2 100644 --- a/include/keys/rxrpc-type.h +++ b/include/keys/rxrpc-type.h @@ -127,4 +127,25 @@ struct rxrpc_key_data_v1 { #define AFSTOKEN_K5_ADDRESSES_MAX 16 /* max K5 addresses */ #define AFSTOKEN_K5_AUTHDATA_MAX 16 /* max K5 pieces of auth data */ +/* + * truncate a time64_t to the range from 1970 to 2106 as + * in the network protocol + */ +static inline u32 rxrpc_time64_to_u32(time64_t time) +{ + if (time < 0) + return 0; + + if (time > UINT_MAX) + return UINT_MAX; + + return (u32)time; +} + +/* extend u32 back to time64_t using the same 1970-2106 range */ +static inline time64_t rxrpc_u32_to_time64(u32 time) +{ + return (time64_t)time; +} + #endif /* _KEYS_RXRPC_TYPE_H */ diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 69b9733..3c11443 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -894,7 +894,7 @@ void rxrpc_new_incoming_connection(struct rxrpc_sock *, int rxrpc_request_key(struct rxrpc_sock *, char __user *, int); int rxrpc_server_keyring(struct rxrpc_sock *, char __user *, int); -int rxrpc_get_server_data_key(struct rxrpc_connection *, const void *, time_t, +int rxrpc_get_server_data_key(struct rxrpc_connection *, const void *, time64_t, u32); /* diff --git a/net/rxrpc/key.c b/net/rxrpc/key.c index 5436922..e2d3661 100644 --- a/net/rxrpc/key.c +++ b/net/rxrpc/key.c @@ -92,6 +92,7 @@ static int rxrpc_preparse_xdr_rxkad(struct key_preparsed_payload *prep, const __be32 *xdr, unsigned int toklen) { struct rxrpc_key_token *token, **pptoken; + time64_t expiry; size_t plen; u32 tktlen; @@ -158,8 +159,9 @@ static int rxrpc_preparse_xdr_rxkad(struct key_preparsed_payload *prep, pptoken = &(*pptoken)->next) continue; *pptoken = token; - if (token->kad->expiry < prep->expiry) - prep->expiry = token->kad->expiry; + expiry = rxrpc_u32_to_time64(token->kad->expiry); + if (expiry < prep->expiry) + prep->expiry = expiry; _leave(" = 0"); return 0; @@ -433,6 +435,7 @@ static int rxrpc_preparse_xdr_rxk5(struct key_preparsed_payload *prep, struct rxrpc_key_token *token, **pptoken; struct rxk5_key *rxk5; const __be32 *end_xdr = xdr + (toklen >> 2); + time64_t expiry; int ret; _enter(",{%x,%x,%x,%x},%u", @@ -533,8 +536,9 @@ static int rxrpc_preparse_xdr_rxk5(struct key_preparsed_payload *prep, pptoken = &(*pptoken)->next) continue; *pptoken = token; - if (token->kad->expiry < prep->expiry) - prep->expiry = token->kad->expiry; + expiry = rxrpc_u32_to_time64(token->kad->expiry); + if (expiry < prep->expiry) + prep->expiry = expiry; _leave(" = 0"); return 0; @@ -691,6 +695,7 @@ static int rxrpc_preparse(struct key_preparsed_payload *prep) { const struct rxrpc_key_data_v1 *v1; struct rxrpc_key_token *token, **pp; + time64_t expiry; size_t plen; u32 kver; int ret; @@ -777,8 +782,9 @@ static int rxrpc_preparse(struct key_preparsed_payload *prep) while (*pp) pp = &(*pp)->next; *pp = token; - if (token->kad->expiry < prep->expiry) - prep->expiry = token->kad->expiry; + expiry = rxrpc_u32_to_time64(token->kad->expiry); + if (expiry < prep->expiry) + prep->expiry = expiry; token = NULL; ret = 0; @@ -955,7 +961,7 @@ int rxrpc_server_keyring(struct rxrpc_sock *rx, char __user *optval, */ int rxrpc_get_server_data_key(struct rxrpc_connection *conn, const void *session_key, - time_t expiry, + time64_t expiry, u32 kvno) { const struct cred *cred = current_cred(); @@ -982,7 +988,7 @@ int rxrpc_get_server_data_key(struct
Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support
To be clear, the OVS implementation is a placeholder. It will get replaced by whatever netdev implements, and that's OK. I didn't focus on making it perfect because I knew that. Instead, I just made sure it was good enough for an internal OVS implementation that doesn't fix any ABI or API. OVS can even change the user-visible action names, as long as we do that soon (and encap/decap versus push/pop doesn't matter to me). The considerations for netdev are different and more permanent. On Wed, Aug 09, 2017 at 02:05:12AM +, Yang, Yi Y wrote: > Hi, Jiri > > Thank you for your comments. > > __be32 c[4] is the name Ben Pfaff suggested, the original name is c1, c2, c3, > c4, they are context data, so c seems ok, too :-) > > OVS has merged it and has the same name, maybe the better way is adding > comment /* Context data */ after it. > > For MD type 2, struct ovs_key_nsh is very difficult to cover it, so far we > don't know how to support MD type 2 better, Geneve defined 64 > tun_metadata0-63 to handle this, those keys are parts of struct flow_tnl, the > highest possibility is to reuse those keys. > > So for future MD type 2, we will have two parts of keys, one is from struct > ovs_key_nsh, another is from struct flow_tnl, this won't break the old uAPI. > > "#define OVS_ENCAP_NSH_MAX_MD_LEN 16" is changed per Ben's comment from 256, > Ben thinks 256 is too big but we only support MD type 1 now. We still have > ways to extend it, for example: > > struct ovs_action_encap_nsh * oaen = (struct ovs_action_encap_nsh *) malloc > (sizeof(struct ovs_action_encap_nsh) + ANY_SIZE); > > nl_msg_put_unspec(actions, OVS_ACTION_ATTR_ENCAP_NSH, > oaen, sizeof(struct ovs_action_encap_nsh) + > ANY_SIZE); > > In addition, we also need to consider, OVS userspace code must be consistent > with here, so keeping it intact will be better, we have way to support > dynamically extension when we add MD type 2 support. > > About action name, unfortunately, userspace data plane has named them as > encap_nsh & decap_nsh, Jan, what do you think about Jiri's suggestion? > > But from my understanding, encap_* & decap_* are better because they > corresponding to generic encap & decap actions, in addition, encap semantics > are different from push, encap just pushed an empty header with default > values, users must use set_field to set the content of the header. > > Again, OVS userspace code must be consistent with here, so keeping it intact > will be better because OVS userspace code was there. > > > -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On > Behalf Of Jiri Benc > Sent: Tuesday, August 8, 2017 10:28 PM > To: Yang, Yi Y> Cc: netdev@vger.kernel.org; d...@openvswitch.org; da...@davemloft.net > Subject: Re: [PATCH net-next] openvswitch: add NSH support > > On Tue, 8 Aug 2017 12:59:40 +0800, Yi Yang wrote: > > +struct ovs_key_nsh { > > + __u8 flags; > > + __u8 mdtype; > > + __u8 np; > > + __u8 pad; > > + __be32 path_hdr; > > + __be32 c[4]; > > "c" is a very poor name. Please rename it to something that expresses what > this field contains. > > Also, this looks like MD type 1 only. How are those fields going to work with > MD type 2? I don't think MD type 2 implementation is necessary in this patch > but I'd like to know how this is going to work - it's uAPI and thus set in > stone once this is merged. The uAPI needs to be designed with future use in > mind. > > > +#define OVS_ENCAP_NSH_MAX_MD_LEN 16 > > +/* > > + * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH > > + * @flags: NSH header flags. > > + * @mdtype: NSH metadata type. > > + * @mdlen: Length of NSH metadata in bytes. > > + * @np: NSH next_protocol: Inner packet type. > > + * @path_hdr: NSH service path id and service index. > > + * @metadata: NSH metadata for MD type 1 or 2 */ struct > > +ovs_action_encap_nsh { > > + __u8 flags; > > + __u8 mdtype; > > + __u8 mdlen; > > + __u8 np; > > + __be32 path_hdr; > > + __u8 metadata[OVS_ENCAP_NSH_MAX_MD_LEN]; > > This is wrong. The metadata size is set to a fixed size by this and cannot be > ever extended, or at least not easily. Netlink has attributes for exactly > these cases and that's what needs to be used here. > > > @@ -835,6 +866,8 @@ enum ovs_action_attr { > > OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */ > > OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */ > > OVS_ACTION_ATTR_POP_ETH, /* No argument. */ > > + OVS_ACTION_ATTR_ENCAP_NSH,/* struct ovs_action_encap_nsh. */ > > + OVS_ACTION_ATTR_DECAP_NSH,/* No argument. */ > > Use "push" and "pop", not "encap" and "decap" to be consistent with the > naming in the rest of the file. We use encap and decap for tunnel operations. > This code does not use lwtunnels, thus push and pop is more appropriate. > > Jiri >
Re: [PATCH net] Revert "vhost: cache used event for better performance"
On 2017年07月30日 14:26, K. Den wrote: On Wed, 2017-07-26 at 19:08 +0300, Michael S. Tsirkin wrote: On Wed, Jul 26, 2017 at 09:37:15PM +0800, Jason Wang wrote: On 2017年07月26日 21:18, Jason Wang wrote: On 2017年07月26日 20:57, Michael S. Tsirkin wrote: On Wed, Jul 26, 2017 at 04:03:17PM +0800, Jason Wang wrote: This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it was reported to break vhost_net. We want to cache used event and use it to check for notification. We try to valid cached used event by checking whether or not it was ahead of new, but this is not correct all the time, it could be stale and there's no way to know about this. Signed-off-by: Jason WangCould you supply a bit more data here please? How does it get stale? What does guest need to do to make it stale? This will be helpful if anyone wants to bring it back, or if we want to extend the protocol. The problem we don't know whether or not guest has published a new used event. The check vring_need_event(vq->last_used_event, new + vq->num, new) is not sufficient to check for this. Thanks More notes, the previous assumption is that we don't move used event back, but this could happen in fact if idx is wrapper around. You mean if the 16 bit index wraps around after 64K entries. Makes sense. Will repost and add this into commit log. Thanks Hi, Hi, sorry for the late reply, was on vacation last week. I am just curious but I have got a question: AFAIU, if you wanted to keep the caching mechanism alive in the code base, the following two changes could clear off the issue, or not?: (1) Always fetch the latest event value from guest when signalled_used event is invalid, which includes last_used_idx wraps-around case. Otherwise we might need changes which would complicate too much the logic to properly decide whether or not to skip signalling in the next vhost_notify round. (2) On top of that, split the signal-postponing logic to three cases like: * if the interval of vq.num is [2^16, UINT_MAX]: any cached event is in should-postpone-signalling interval, so paradoxically must always do signalling. I think don't think current code can work well if vq.num is grater than 2^15. Since all cached idx is u16. This looks like a bug which needs to be fixed. * else if the interval of vq.num is [2^15, 2^16): the logic in the original patch (809ecb9bca6a9) suffices * else (= less than 2^15) (optional): checking only (vring_need_event(vq->last_used_event, new + vq->num, new) would suffice. Am I missing something, or is this irrelevant? Looks not, I think this may work. Let me do some test. Thanks I would appreciate if you could elaborate a bit more how the situation where event idx wraps around and moves back would make trouble. Thanks.
[PATCH net-next] liquidio: napi cleanup
From: Intiyaz BashaDisable napi when interface is going down. Delete napi when destroying the interface. Signed-off-by: Intiyaz Basha Signed-off-by: Felix Manlunas --- drivers/net/ethernet/cavium/liquidio/lio_main.c| 15 +++ drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 14 ++ 2 files changed, 29 insertions(+) diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c index 3ec0dd9..cbd6287 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c @@ -1736,6 +1736,10 @@ static void liquidio_destroy_nic_device(struct octeon_device *oct, int ifidx) oct->droq[0]->ops.poll_mode = 0; } + /* Delete NAPI */ + list_for_each_entry_safe(napi, n, >napi_list, dev_list) + netif_napi_del(napi); + if (atomic_read(>ifstate) & LIO_IFSTATE_REGISTERED) unregister_netdev(netdev); @@ -2770,6 +2774,17 @@ static int liquidio_stop(struct net_device *netdev) { struct lio *lio = GET_LIO(netdev); struct octeon_device *oct = lio->oct_dev; + struct napi_struct *napi, *n; + + if (oct->props[lio->ifidx].napi_enabled) { + list_for_each_entry_safe(napi, n, >napi_list, dev_list) + napi_disable(napi); + + oct->props[lio->ifidx].napi_enabled = 0; + + if (OCTEON_CN23XX_PF(oct)) + oct->droq[0]->ops.poll_mode = 0; + } ifstate_reset(lio, LIO_IFSTATE_RUNNING); diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c index 935ff29..c6f52f2 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c @@ -1137,6 +1137,10 @@ static void liquidio_destroy_nic_device(struct octeon_device *oct, int ifidx) oct->droq[0]->ops.poll_mode = 0; } + /* Delete NAPI */ + list_for_each_entry_safe(napi, n, >napi_list, dev_list) + netif_napi_del(napi); + if (atomic_read(>ifstate) & LIO_IFSTATE_REGISTERED) unregister_netdev(netdev); @@ -1784,6 +1788,16 @@ static int liquidio_stop(struct net_device *netdev) { struct lio *lio = GET_LIO(netdev); struct octeon_device *oct = lio->oct_dev; + struct napi_struct *napi, *n; + + if (oct->props[lio->ifidx].napi_enabled) { + list_for_each_entry_safe(napi, n, >napi_list, dev_list) + napi_disable(napi); + + oct->props[lio->ifidx].napi_enabled = 0; + + oct->droq[0]->ops.poll_mode = 0; + } netif_info(lio, ifdown, lio->netdev, "Stopping interface!\n"); /* Inform that netif carrier is down */
Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported
On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote: > When bit4 is set in the PCIe Device Control register, it indicates > whether the device is permitted to use relaxed ordering. > On some platforms using relaxed ordering can have performance issues or > due to erratum can cause data-corruption. In such cases devices must avoid > using relaxed ordering. > > This patch checks if there is any node in the hierarchy that indicates that > using relaxed ordering is not safe. I think you only check the devices between the root port and the target device. For example, you don't check siblings or cousins of the target device. > In such cases the patch turns off the > relaxed ordering by clearing the eapability for this device. s/eapability/capability/ > And if the > device is probably running in a guest machine, we should do nothing. I don't know what this sentence means. "Probably running in a guest machine" doesn't really make sense, and there's nothing in your patch that explicitly checks for being in a guest machine. > Signed-off-by: Ding Tianhong> Acked-by: Alexander Duyck > Acked-by: Ashok Raj > --- > drivers/pci/pci.c | 29 + > drivers/pci/probe.c | 37 + > include/linux/pci.h | 2 ++ > 3 files changed, 68 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index af0cc34..4f9d7c1 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -4854,6 +4854,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps) > EXPORT_SYMBOL(pcie_set_mps); > > /** > + * pcie_clear_relaxed_ordering - clear PCI Express relaxed ordering bit > + * @dev: PCI device to query > + * > + * If possible clear relaxed ordering Why "If possible"? The bit is required to be RW or hardwired to zero, so PCI_EXP_DEVCTL_RELAX_EN should *always* be zero when this returns. > + */ > +int pcie_clear_relaxed_ordering(struct pci_dev *dev) > +{ > + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, > + PCI_EXP_DEVCTL_RELAX_EN); > +} > +EXPORT_SYMBOL(pcie_clear_relaxed_ordering); The current series doesn't add any callers of this except pci_configure_relaxed_ordering(), so it doesn't need to be exported to modules. I think I would put both of these functions in drivers/pci/probe.c. Then this one could be static and you'd only have to add pcie_relaxed_ordering_supported() to include/linux/pci.h. > + > +/** > + * pcie_relaxed_ordering_supported - Probe for PCIe relexed ordering support s/relexed/relaxed/ > + * @dev: PCI device to query > + * > + * Returns true if the device support relaxed ordering attribute. > + */ > +bool pcie_relaxed_ordering_supported(struct pci_dev *dev) > +{ > + u16 v; > + > + pcie_capability_read_word(dev, PCI_EXP_DEVCTL, ); > + > + return !!(v & PCI_EXP_DEVCTL_RELAX_EN); > +} > +EXPORT_SYMBOL(pcie_relaxed_ordering_supported); This is misnamed. This doesn't tell us anything about whether the device *supports* relaxed ordering. It only tells us whether the device is *permitted* to use it. When a device initiates a transaction, the hardware should set the RO bit in the TLP with logic something like this: RO = && && The issue you're fixing is that some Completers don't handle RO correctly. The determining factor is not the Requester, but the Completer (for this series, a Root Port). So I think this should be something like: int pcie_relaxed_ordering_broken(struct pci_dev *completer) { if (!completer) return 0; return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING; } and the caller should do something like this: if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev))) adapter->flags |= ROOT_NO_RELAXED_ORDERING; That way it's obvious where the issue is, and it's obvious that the answer might be different for peer-to-peer transactions than it is for transactions to the root port, i.e., to coherent memory. > + > +/** > * pcie_get_minimum_link - determine minimum link settings of a PCI device > * @dev: PCI device to query > * @speed: storage for minimum speed > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index c31310d..48df012 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -1762,6 +1762,42 @@ static void pci_configure_extended_tags(struct pci_dev > *dev) >PCI_EXP_DEVCTL_EXT_TAG); > } > > +/** > + * pci_dev_should_disable_relaxed_ordering - check if the PCI device > + * should disable the relaxed ordering attribute. > + * @dev: PCI device > + * > + * Return true if any of the PCI devices above us do not support > + * relaxed ordering. > + */ > +static bool pci_dev_should_disable_relaxed_ordering(struct pci_dev *dev) > +{ > + while (dev) { > + if (dev->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING) > +
RE: [PATCH net-next] openvswitch: add NSH support
Hi, Jiri Thank you for your comments. __be32 c[4] is the name Ben Pfaff suggested, the original name is c1, c2, c3, c4, they are context data, so c seems ok, too :-) OVS has merged it and has the same name, maybe the better way is adding comment /* Context data */ after it. For MD type 2, struct ovs_key_nsh is very difficult to cover it, so far we don't know how to support MD type 2 better, Geneve defined 64 tun_metadata0-63 to handle this, those keys are parts of struct flow_tnl, the highest possibility is to reuse those keys. So for future MD type 2, we will have two parts of keys, one is from struct ovs_key_nsh, another is from struct flow_tnl, this won't break the old uAPI. "#define OVS_ENCAP_NSH_MAX_MD_LEN 16" is changed per Ben's comment from 256, Ben thinks 256 is too big but we only support MD type 1 now. We still have ways to extend it, for example: struct ovs_action_encap_nsh * oaen = (struct ovs_action_encap_nsh *) malloc (sizeof(struct ovs_action_encap_nsh) + ANY_SIZE); nl_msg_put_unspec(actions, OVS_ACTION_ATTR_ENCAP_NSH, oaen, sizeof(struct ovs_action_encap_nsh) + ANY_SIZE); In addition, we also need to consider, OVS userspace code must be consistent with here, so keeping it intact will be better, we have way to support dynamically extension when we add MD type 2 support. About action name, unfortunately, userspace data plane has named them as encap_nsh & decap_nsh, Jan, what do you think about Jiri's suggestion? But from my understanding, encap_* & decap_* are better because they corresponding to generic encap & decap actions, in addition, encap semantics are different from push, encap just pushed an empty header with default values, users must use set_field to set the content of the header. Again, OVS userspace code must be consistent with here, so keeping it intact will be better because OVS userspace code was there. -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Jiri Benc Sent: Tuesday, August 8, 2017 10:28 PM To: Yang, Yi YCc: netdev@vger.kernel.org; d...@openvswitch.org; da...@davemloft.net Subject: Re: [PATCH net-next] openvswitch: add NSH support On Tue, 8 Aug 2017 12:59:40 +0800, Yi Yang wrote: > +struct ovs_key_nsh { > + __u8 flags; > + __u8 mdtype; > + __u8 np; > + __u8 pad; > + __be32 path_hdr; > + __be32 c[4]; "c" is a very poor name. Please rename it to something that expresses what this field contains. Also, this looks like MD type 1 only. How are those fields going to work with MD type 2? I don't think MD type 2 implementation is necessary in this patch but I'd like to know how this is going to work - it's uAPI and thus set in stone once this is merged. The uAPI needs to be designed with future use in mind. > +#define OVS_ENCAP_NSH_MAX_MD_LEN 16 > +/* > + * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH > + * @flags: NSH header flags. > + * @mdtype: NSH metadata type. > + * @mdlen: Length of NSH metadata in bytes. > + * @np: NSH next_protocol: Inner packet type. > + * @path_hdr: NSH service path id and service index. > + * @metadata: NSH metadata for MD type 1 or 2 */ struct > +ovs_action_encap_nsh { > + __u8 flags; > + __u8 mdtype; > + __u8 mdlen; > + __u8 np; > + __be32 path_hdr; > + __u8 metadata[OVS_ENCAP_NSH_MAX_MD_LEN]; This is wrong. The metadata size is set to a fixed size by this and cannot be ever extended, or at least not easily. Netlink has attributes for exactly these cases and that's what needs to be used here. > @@ -835,6 +866,8 @@ enum ovs_action_attr { > OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */ > OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */ > OVS_ACTION_ATTR_POP_ETH, /* No argument. */ > + OVS_ACTION_ATTR_ENCAP_NSH,/* struct ovs_action_encap_nsh. */ > + OVS_ACTION_ATTR_DECAP_NSH,/* No argument. */ Use "push" and "pop", not "encap" and "decap" to be consistent with the naming in the rest of the file. We use encap and decap for tunnel operations. This code does not use lwtunnels, thus push and pop is more appropriate. Jiri
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Bjorn Helgaas| Sent: Tuesday, August 8, 2017 4:22 PM | | This needs to include a link to the Intel spec | (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf, | sec 3.9.1). In the commit message or as a comment? Regardless, I agree. It's always nice to be able to go back and see what the official documentation says. However, that said, links on the internet are ... fragile as time goes by, so we might want to simply quote section 3.9.1 in the commit message since it's relatively short: 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory and Toward MMIO Regions (P2P) In order to maximize performance for PCIe devices in the processors listed in Table 3-6 below, the soft- ware should determine whether the accesses are toward coherent memory (system memory) or toward MMIO regions (P2P access to other devices). If the access is toward MMIO region, then software can command HW to set the RO bit in the TLP header, as this would allow hardware to achieve maximum throughput for these types of accesses. For accesses toward coherent memory, software can command HW to clear the RO bit in the TLP header (no RO), as this would allow hardware to achieve maximum throughput for these types of accesses. Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing PCIe Performance ProcessorCPU RP Device IDs Intel Xeon processors based on 6F01H-6F0EH Broadwell microarchitecture Intel Xeon processors based on 2F01H-2F0EH Haswell microarchitecture | It should also include a pointer to the AMD erratum, if available, or | at least some reference to how we know it doesn't obey the rules. Getting an ACK from AMD seems like a forlorn cause at this point. My contact was Bob Shaw and he stopped responding to me messages almost a year ago saying that all of AMD's energies were being redirected towards upcoming x86 products (likely Ryzen as we now know). As far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM SoC. On the specific issue, I can certainly write up somthing even more extensive than I wrote up for the comment in drivers/pci/quirks.c. Please review the comment I wrote up and tell me if you'd like something even more detailed -- I'm usually acused of writing comments which are too long, so this would be a new one on me ... :-) | Ashok, thanks for chiming in. Now that you have, I have a few more | questions for you: I can answer a few of these: | - Is the above doc the one you mentioned as being now public? Yes. Ashok worked with me to the extent he was allowed prior to the publishing of the public technocal note, but he couldn't say much. (Believe it or not, it is possible to say less than the quoted section above.) When the note was published, Patrick Cramer sent me the note about it and pointed me at section 3.9.1. | - Is this considered a hardware erratum? I certainly consider it a Hardware Bug. And I'm really hoping that Ashok will be able to find a "Chicken Bit" which allows the broken feature to be turned off. Remember, the Relaxed Ordering Attribute on a Transaction Layer Packet is simply a HINT. It is perfectly reasonable for a compliant implementation to simply ignore the Relaxed Ordering Attribute on an incoming TLP Request. The sole responsibility of a compliant implementation is to return the exact same Relaxed Ordering and No Snoop Attributes in any TLP Response (The rules for ID-Based Ordering Attribute are more complex.) Earlier Intel Root Complexes did exactly this: they ignored the Relaxed Ordering Attribute and there was no performance difference for using/not-using it. It's pretty obvious that an attempt was made to implement optimizations surounding the use of Relaxed Ordering and they didn't work. | - If so, is there a pointer to that as well? Intel is historically tight-lipped about admiting any bugs/errata in their products. I'm guessing that the above quoted Section 3.9.1 is likely to be all we ever get. The language above regarding TLPs targetting Coherent Shared Memory are basically as much of an admission that they got it wrong as we're going to get. But heck, maybe we'll get lucky ... Especially with regard to the hoped for "Chicken Bit" ... | - If this is not considered an erratum, can you provide any guidance |about how an OS should determine when it should use RO? Software? We don't need no stinking software! Sorry, I couldn't resist. | Relying on a list of device IDs in an optimization manual is OK for an | erratum, but if it's *not* an erratum, it seems like a hole in the specs | because as far as I know there's no generic way for the OS to discover | whether to use RO. Well, here's to hoping that Ashok and/or Patrick are able to offer
Re: [Patch net-next] net_sched: get rid of some forward declarations
From: Cong WangDate: Mon, 7 Aug 2017 15:26:50 -0700 > If we move up tcf_fill_node() we can get rid of these > forward declarations. > > Also, move down tfilter_notify_chain() to group them together. > > Reported-by: Jamal Hadi Salim > Cc: Jamal Hadi Salim > Signed-off-by: Cong Wang Applied.
Re: [RFC PATCH] net: don't set __LINK_STATE_START until after dev->open() call
From: Jacob KellerDate: Mon, 7 Aug 2017 15:24:21 -0700 > Fix an issue with relying on netif_running() which could be true during > when dev->open() handler is being called, even if it would exit with > a failure. This ensures the state does not get set and removed with > a narrow race for other callers to read it as open when infact it never > finished opening. > > Signed-off-by: Jacob Keller > --- > I found this as a result of debugging a race condition in the i40evf > driver, in which we assumed that netif_running() would not be true until > after dev->open() had been called and succeeded. Unfortunately we can't > hold the rtnl_lock() while checking netif_running() because it would > cause a deadlock between our reset task and our ndo_open handler. > > I am wondering whether the proposed change is acceptable here, or > whether some ndo_open handlers rely on __LINK_STATE_START being true > prior to their being called? I think this has the potential to break a bunch of drivers, but I cannot prove this. A lot of drivers have several pieces of state setup when they bring the device up. And these routines are also invoked from other code paths like suspend/resume, PCI-E error recovery, etc. and they probably do netif_running() calls here and there. This behavior has been this way for a very long time, so the risk is quite high I think.
Re: [PATCH net-next] net: dsa: lan9303: Only allocate 3 ports
From: Egil HjelmelandDate: Tue, 8 Aug 2017 00:22:21 +0200 > Save 2628 bytes on arm eabi by allocate only the required 3 ports. > > Now that ds->num_ports is correct: In net/dsa/tag_lan9303.c > eliminate duplicate LAN9303_MAX_PORTS, use ds->num_ports. > (Matching the pattern of other net/dsa/tag_xxx.c files.) > > Signed-off-by: Egil Hjelmeland Applied.
Re: [PATCH net-next] liquidio: fix misspelled firmware image filenames
From: Felix ManlunasDate: Mon, 7 Aug 2017 12:22:15 -0700 > From: Derek Chickles > > Fix misspelled firmware image filenames advertised via MODULE_FIRMWARE(). > > Signed-off-by: Derek Chickles > Signed-off-by: Felix Manlunas Applied.
Re: [PATCH net-next] selftests: bpf: add a test for XDP redirect
From: William TuDate: Mon, 7 Aug 2017 13:14:42 -0700 > Add test for xdp_redirect by creating two namespaces with two > veth peers, then forward packets in-between. > > Signed-off-by: William Tu Applied, thank you.
Re: [PATCH net-next v2 1/2] bpf: Move check_uarg_tail_zero() upward
From: Mickaël SalaünDate: Mon, 7 Aug 2017 20:45:19 +0200 > The function check_uarg_tail_zero() may be useful for other part of the > code in the syscall.c file. Move this function at the beginning of the > file. > > Signed-off-by: Mickaël Salaün > Acked-by: Daniel Borkmann Applied.
Re: [PATCH net-next v2 2/2] bpf: Extend check_uarg_tail_zero() checks
From: Mickaël SalaünDate: Mon, 7 Aug 2017 20:45:20 +0200 > The function check_uarg_tail_zero() was created from bpf(2) for > BPF_OBJ_GET_INFO_BY_FD without taking the access_ok() nor the PAGE_SIZE > checks. Make this checks more generally available while unlikely to be > triggered, extend the memory range check and add an explanation > including why the ToCToU should not be a security concern. > > Signed-off-by: Mickaël Salaün > Acked-by: Daniel Borkmann > Cc: Alexei Starovoitov > Cc: David S. Miller > Cc: Kees Cook > Cc: Martin KaFai Lau > Link: > https://lkml.kernel.org/r/CAGXu5j+vRGFvJZmjtAcT8Hi8B+Wz0e1b6VKYZHfQP_=dxzc...@mail.gmail.com Applied.
Re: [PATCH net-next 1/1] netvsc: make sure and unregister datapath
From: Stephen HemmingerDate: Mon, 7 Aug 2017 11:30:00 -0700 > Go back to switching datapath directly in the notifier callback. > Otherwise datapath might not get switched on unregister. > > No need for calling the NOTIFY_PEERS notifier since that is only for > a gratitious ARP/ND packet; but that is not required with Hyper-V > because both VF and synthetic NIC have the same MAC address. > > Reported-by: Vitaly Kuznetsov > Fixes: 0c195567a8f6 ("netvsc: transparent VF management") > Signed-off-by: Stephen Hemminger Applied, thanks Stephen.
Re: [PATCH net-next] liquidio: fix wrong info about vf rx/tx ring parameters reported to ethtool
From: Felix ManlunasDate: Mon, 7 Aug 2017 10:39:00 -0700 > From: Intiyaz Basha > > Information reported to ethtool about vf rx/tx ring parameters is wrong. > Fix it by adding the missing initializations. > > Signed-off-by: Intiyaz Basha > Signed-off-by: Felix Manlunas Applied.
Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure
From: Edward CreeDate: Tue, 8 Aug 2017 21:23:03 +0100 > In any case, if you go with the enum approach and later it _does_ prove > necessary to have more flexibility, you can have enum values dynamically > assigned (like genetlink manages to do); and programs using the existing > fixed IDs will continue to work. Indeed: > It's much harder to go the other way... THIS!
Re: [PATCH] net: dsa: mediatek: add adjust link support for user ports
From: John CrispinDate: Mon, 7 Aug 2017 16:20:49 +0200 > Manually adjust the port settings of user ports once PHY polling has > completed. This patch extends the adjust_link callback to configure the > per port PMCR register, applying the proper values polled from the PHY. > Without this patch flow control was not always getting setup properly. > > Signed-off-by: Shashidhar Lakkavalli > Signed-off-by: Muciri Gatimu > Signed-off-by: John Crispin Applied, thank you.
Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
From: Saeed MahameedDate: Tue, 8 Aug 2017 19:16:52 +0300 > On Thu, Aug 3, 2017 at 11:54 PM, Davide Caratti wrote: >> if the NIC fails to validate the checksum on TCP/UDP, and validation of IP >> checksum is successful, the driver subtracts the pseudo-header checksum >> from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't >> do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails. >> >> V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set >> >> Reported-by: Shuang Li >> Fixes: f8c6455bb04b ("net/mlx4_en: Extend checksum offloading by CHECKSUM >> COMPLETE") >> Signed-off-by: Davide Caratti > > Acked-by: Saeed Mahameed Applied and queued up for -stable.
Re: [PATCH v5 net-next 00/12] bpf: rewrite value tracking in verifier
From: Daniel BorkmannDate: Tue, 08 Aug 2017 02:46:16 +0200 > On 08/07/2017 04:21 PM, Edward Cree wrote: >> This series simplifies alignment tracking, generalises bounds tracking >> and >> fixes some bounds-tracking bugs in the BPF verifier. Pointer >> arithmetic on >> packet pointers, stack pointers, map value pointers and context >> pointers has >> been unified, and bounds on these pointers are only checked when the >> pointer >> is dereferenced. >> Operations on pointers which destroy all relation to the original >> pointer >> (such as multiplies and shifts) are disallowed if >> !env->allow_ptr_leaks, >> otherwise they convert the pointer to an unknown scalar and feed it to >> the >> normal scalar arithmetic handling. >> Pointer types have been unified with the corresponding >> adjusted-pointer types >> where those existed (e.g. PTR_TO_MAP_VALUE[_ADJ] or FRAME_PTR vs >> PTR_TO_STACK); similarly, CONST_IMM and UNKNOWN_VALUE have been >> unified into >> SCALAR_VALUE. >> Pointer types (except CONST_PTR_TO_MAP, PTR_TO_MAP_VALUE_OR_NULL and >> PTR_TO_PACKET_END, which do not allow arithmetic) have a 'fixed >> offset' and >> a 'variable offset'; the former is used when e.g. adding an immediate >> or a >> known-constant register, as long as it does not overflow. Otherwise >> the >> latter is used, and any operation creating a new variable offset >> creates a >> new 'id' (and, for PTR_TO_PACKET, clears the 'range'). >> SCALAR_VALUEs use the 'variable offset' fields to track the range of >> possible >> values; the 'fixed offset' should never be set on a scalar. > > Been testing and reviewing the series over the last several days, > looks > reasonable to me as far as I can tell. Thanks for all the hard work on > unifying this, Edward! > > Acked-by: Daniel Borkmann Series applied, thanks everyone!
Re: [PATCH 0/6] In-kernel QMI handling
On Tue, 2017-08-08 at 15:42 -0700, Bjorn Andersson wrote: > On Tue 08 Aug 04:02 PDT 2017, Bj?rn Mork wrote: > > > Bjorn Anderssonwrites: > > > > > This series starts by moving the common definitions of the QMUX > > > protocol to the > > > uapi header, as they are shared with clients - both in kernel and > > > userspace. > > > > > > This series then introduces in-kernel helper functions for aiding > > > the handling > > > of QMI encoded messages in the kernel. QMI encoding is a wire- > > > format used in > > > exchanging messages between the majority of QRTR clients and > > > services. > > > > Interesting! I tried to add some QMI handling in the kernel a few > > years > > ago, but was thankfully voted down. See > > https://www.spinics.net/lists/netdev/msg183101.html and the > > following > > discussion. I am convinced that was the right decision, for the > > client > > side at least. The protocol is just too extensive and ever-growing > > to be > > implemented in the kernel. We would be catching up forever. > > > > Note that I had very limited knowledge of the protocol at the time > > I > > wrote that driver. Still have, in fact :-) > > > > Thanks for the pointer, I definitely think there's more work to be > done > here to figure out the proper way to interact with these devices. > > But I think that Dan's reply shows a huge source of confusion here; > the > acronym "QMI" covers a large amount of different things - and means > different things for different people. I would agree, sorry for any confusion caused. Great discussion so far. > In the modem world QMI seems to mean a defined set of logical > endpoints > that accepts TLV-encoded messages to do modem-related things. But the > TLV-encoding is used for non-modem related services and the only > common > denominator of everything called QMI is the TLV-encoding. > > > Due to my limited exposure to the USB attached "QMI thingies" I > haven't > previously looked into the exact differences. The proposed patches > aimed > to support implementing a few non-modem-related clients using > QMI-encoded messages over ipcrouter. > > Looking at your patch above, and oPhono, seems to highlight a few > important differences that will take some thinking to overcome. > > = Transport > The transport header in the USB case is your struct qmux, which > contains > the type of message (in "flags") and the transaction id. The > "service" > in the QMUX header matches the service id being communicated with. > But > in order to communicate with a service it seems like one requests a > client-id from the control service. Correct. You cannot talk to a service on the modem without getting an allocated client ID from the CTL service, which has a well-defined client ID. > In the smartphone world (with shared memory communication) the > transport > is ipcrouter - with a header very similar to UDP - and there's no > information about the payload, it provides only the means of > delivering Can you explain a bit about the relationship of SMD to [I/R]PC, qrtr, and QMI? A couple years ago there was smd_qmi.c (like for the Nexus 4 with APQ8064 and a discrete MDM9215) which from a 10 minute fresh look appears to just push QMUX+QMI via SMD rather than being backed by the RPC/IPC stuff. I could be wrong, there's a lot of indirection there and it may well end up going over the router. But that's buried deeper than a 10m look for me. Is it perhaps only with on-chip blocks where the QMUX/QMI/qrtr/irpc stuff you describe here is used? If so, perhaps that's the distinction to be made. I'll let you correct me here since you clearly know more than I about the internals of these devices. > messages from one address/port to another address/port. A typical > smartphone has 3-4 nodes (modem, sensors, audio etc) and ports are > dynamically allocated. The control messages in the QMUX protocol (not > the same QMUX protocol as in the USB case!) are used for clients to > find > the mapping from service id to a port on the given address. The > source > port is dynamically allocated in this case. > > = QMI-encoded messages > The list of TLV-entries have a "QMI header" prepended in both cases, > but > in the QMUX case the header consists only of "msgid" and length. > > In the ipcrouter case the transport doesn't carry any information > regarding the payload, so the header prepended the TLV entries > includes > "type", transaction id, "msg_id" and length. I'll assume that in this case, because the client has already found out how to contact the target service directly, that it has no use for a "fat" QMUX header that includes the client ID and service stuff. I don't really have an issue with the kernel doing "thin" QMUX-related stuff. That's pretty simple. > It looks as if once past the differences in the transport and QMI > message header the messages (TLV-encoded data) are the same. But I'm > not > yet sure about how we can hide the transport
[PATCH net] geneve: maximum value of VNI cannot be used
Geneve's Virtual Network Identifier (VNI) is 24 bit long, so the range of values for it would be from 0 to 16777215 (2^24 -1). However, one cannot create a geneve device with VNI set to 16777215. This patch fixes this issue. Signed-off-by: Girish Moodalbail--- drivers/net/geneve.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 745d57ae..8b8565d 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -1099,7 +1099,7 @@ static int geneve_validate(struct nlattr *tb[], struct nlattr *data[], if (data[IFLA_GENEVE_ID]) { __u32 vni = nla_get_u32(data[IFLA_GENEVE_ID]); - if (vni >= GENEVE_VID_MASK) + if (vni >= GENEVE_N_VID) return -ERANGE; } -- 1.8.3.1
Re: [PATCH] netfilter: nf_nat_h323: fix logical-not-parentheses warning
bumping for review On Mon, Jul 31, 2017 at 11:39 AM, Nick Desaulnierswrote: > Clang produces the following warning: > > net/ipv4/netfilter/nf_nat_h323.c:553:6: error: > logical not is only applied to the left hand side of this comparison > [-Werror,-Wlogical-not-parentheses] > if (!set_h225_addr(skb, protoff, data, dataoff, taddr, > ^ > add parentheses after the '!' to evaluate the comparison first > add parentheses around left hand side expression to silence this warning > > There's not necessarily a bug here, but it's cleaner to use the form: > > if (x != 0) > > rather than: > > if (!x == 0) > > Signed-off-by: Nick Desaulniers > --- > Also, it's even cleaner to use the form: > > if (x) > > but then if the return codes change from treating 0 as success (unlikely), > then all call sites must be updated. > > I'm happy to send v2 that changes to that form, and updates the other call > sites to be: > > if (set_h225_addr()) > handle_failures() > else > handle_success() > > net/ipv4/netfilter/nf_nat_h323.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/netfilter/nf_nat_h323.c > b/net/ipv4/netfilter/nf_nat_h323.c > index 574f7ebba0b6..d8fb251fa6e3 100644 > --- a/net/ipv4/netfilter/nf_nat_h323.c > +++ b/net/ipv4/netfilter/nf_nat_h323.c > @@ -550,9 +550,9 @@ static int nat_callforwarding(struct sk_buff *skb, struct > nf_conn *ct, > } > > /* Modify signal */ > - if (!set_h225_addr(skb, protoff, data, dataoff, taddr, > - >tuplehash[!dir].tuple.dst.u3, > - htons(nated_port)) == 0) { > + if (set_h225_addr(skb, protoff, data, dataoff, taddr, > + >tuplehash[!dir].tuple.dst.u3, > + htons(nated_port)) != 0) { > nf_ct_unexpect_related(exp); > return -1; > } > -- > 2.14.0.rc0.400.g1c36432dff-goog > -- Thanks, ~Nick Desaulniers
Re: [PATCH v06 35/36] uapi linux/tls.h: don't include in user space
On Sun, Aug 06, 2017 at 06:44:26PM +0200, Mikko Rapeli wrote: > It is not needed and not part of uapi headers, but causes > user space compilation error: > > fatal error: net/tcp.h: No such file or directory > #include > ^ > > Signed-off-by: Mikko Rapeli> Cc: Dave Watson > Cc: Ilya Lesokhin > Cc: Aviad Yehezkel > --- > include/uapi/linux/tls.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h > index cc1d21db35d8..d87c698623f2 100644 > --- a/include/uapi/linux/tls.h > +++ b/include/uapi/linux/tls.h > @@ -37,7 +37,9 @@ > #include > #include > #include > +#ifdef __KERNEL__ > #include > +#endif Let's move it to include/net/tls.h instead. -- ldv signature.asc Description: PGP signature
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
On Sat, Aug 05, 2017 at 03:15:10PM +0800, Ding Tianhong wrote: > From: Casey Leedom> > The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that > Relaxed Ordering (RO) attribute should not be used for Transaction Layer > Packets (TLP) targetted towards these affected root complexes. Current list > of affected parts include some Intel Xeon processors root complex which > suffers from > flow control credits that result in performance issues. On these affected > parts RO can still be used for peer-2-peer traffic. AMD A1100 ARM ("SEATTLE") > Root complexes don't obey PCIe 3.0 ordering rules, hence could lead to > data-corruption. This needs to include a link to the Intel spec (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf, sec 3.9.1). It should also include a pointer to the AMD erratum, if available, or at least some reference to how we know it doesn't obey the rules. Ashok, thanks for chiming in. Now that you have, I have a few more questions for you: - Is the above doc the one you mentioned as being now public? - Is this considered a hardware erratum? - If so, is there a pointer to that as well? - If this is not considered an erratum, can you provide any guidance about how an OS should determine when it should use RO? Relying on a list of device IDs in an optimization manual is OK for an erratum, but if it's *not* an erratum, it seems like a hole in the specs because as far as I know there's no generic way for the OS to discover whether to use RO. Bjorn
Re: [PATCH net-next v2] wan: dscc4: add checks for dma mapping errors
Alexey Khoroshilov: [...] > diff --git a/drivers/net/wan/dscc4.c b/drivers/net/wan/dscc4.c > index 799830f..6a9ffac 100644 > --- a/drivers/net/wan/dscc4.c > +++ b/drivers/net/wan/dscc4.c > @@ -518,23 +518,31 @@ static void dscc4_release_ring(struct dscc4_dev_priv > *dpriv) > static inline int try_get_rx_skb(struct dscc4_dev_priv *dpriv, >struct net_device *dev) > { > + struct pci_dev *pdev = dpriv->pci_priv->pdev; > unsigned int dirty = dpriv->rx_dirty%RX_RING_SIZE; > struct RxFD *rx_fd = dpriv->rx_fd + dirty; > const int len = RX_MAX(HDLC_MAX_MRU); For the edification of the masses, you may follow a strict inverted xmas tree style (aka longer lines first as long as it does not bug). [...] > @@ -1147,14 +1155,22 @@ static netdev_tx_t dscc4_start_xmit(struct sk_buff > *skb, > struct dscc4_dev_priv *dpriv = dscc4_priv(dev); > struct dscc4_pci_priv *ppriv = dpriv->pci_priv; > struct TxFD *tx_fd; > + dma_addr_t addr; > int next; > > + addr = pci_map_single(ppriv->pdev, skb->data, skb->len, > + PCI_DMA_TODEVICE); Use a local struct pci_dev *pdev and it fits on a single line. At some point it will probably be converted to plain dma api and use a 'd' dev. [...] > @@ -1887,16 +1903,22 @@ static struct sk_buff *dscc4_init_dummy_skb(struct > dscc4_dev_priv *dpriv) > > skb = dev_alloc_skb(DUMMY_SKB_SIZE); > if (skb) { > + struct pci_dev *pdev = dpriv->pci_priv->pdev; > int last = dpriv->tx_dirty%TX_RING_SIZE; > struct TxFD *tx_fd = dpriv->tx_fd + last; > + dma_addr_t addr; > > skb->len = DUMMY_SKB_SIZE; > skb_copy_to_linear_data(skb, version, > strlen(version) % DUMMY_SKB_SIZE); > tx_fd->state = FrameEnd | TO_STATE_TX(DUMMY_SKB_SIZE); > - tx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev, > - skb->data, DUMMY_SKB_SIZE, > - PCI_DMA_TODEVICE)); > + addr = pci_map_single(pdev, skb->data, DUMMY_SKB_SIZE, > + PCI_DMA_TODEVICE); > + if (pci_dma_mapping_error(pdev, addr)) { > + dev_kfree_skb_any(skb); > + return NULL; > + } > + tx_fd->data = cpu_to_le32(addr); > dpriv->tx_skbuff[last] = skb; > } > return skb; It isn't technically wrong but please don't update tx_fd before the mapping succeeds. It will look marginally better. Thanks. -- Ueimor
Re: [PATCH v06 15/36] uapi linux/socket.h: include sys/socket.h in user space
On Sun, Aug 06, 2017 at 06:44:06PM +0200, Mikko Rapeli wrote: > This libc header has sockaddr definition in user space. > > Fixes user space compilation errors like these from kernel headers including > only linux/socket.h: > > error: field ‘ifru_addr’ has incomplete type > struct sockaddr ifru_addr; > error: field ‘_sockaddr’ has incomplete type > struct sockaddr _sockaddr; > error: invalid application of ‘sizeof’ to incomplete type ‘struct sockaddr’ > > With this following uapi headers now compile in user space: > > rdma/rdma_user_rxe.h > linux/vm_sockets.h > linux/ncp_fs.h > linux/nfc.h > linux/phonet.h > > Signed-off-by: Mikko Rapeli> Cc: netdev@vger.kernel.org > Cc: Dmitry V. Levin > --- > include/uapi/linux/socket.h | 4 > 1 file changed, 4 insertions(+) > > diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h > index 76ab0c68561e..8a81197cc08b 100644 > --- a/include/uapi/linux/socket.h > +++ b/include/uapi/linux/socket.h > @@ -1,6 +1,10 @@ > #ifndef _UAPI_LINUX_SOCKET_H > #define _UAPI_LINUX_SOCKET_H > > +#ifndef __KERNEL__ > +#include > +#endif This is scary because of infamous libc vs uapi interoperability issues. Couldn't we fix affected headers instead? -- ldv signature.asc Description: PGP signature
Re: skb allocation from interrupt handler?
From: Murali KaricheriDate: Tue, 8 Aug 2017 18:17:52 -0400 > Is there an skb_alloc function that can be used from interrupt handler? Looks > like netdev_alloc_skb() > can't be used since I see following trace with kernel hack debug options > enabled. > > [ 652.481713] [] (unwind_backtrace) from [] > (show_stack+0x10/0x14) > [ 652.481725] [] (show_stack) from [] > (dump_stack+0x98/0xc4) > [ 652.481736] [] (dump_stack) from [] > (___might_sleep+0x1b8/0x2a4) > [ 652.481746] [] (___might_sleep) from [] > (rt_spin_lock+0x24/0x5c) > [ 652.481755] [] (rt_spin_lock) from [] > (__netdev_alloc_skb+0xd0/0x254) > [ 652.481774] [] (__netdev_alloc_skb) from [] > (emac_rx_hardirq+0x374/0x554 [prueth]) > [ 652.481793] [] (emac_rx_hardirq [prueth]) from [] > (__handle_irq_event_percpu+0x9c/0x128) > > This is running under RT kernel off 4.9.y Your receive handler should be running from a NAPI poll, which is in software interrupt. You should not be doing packet processing in hardware interrupt context as hardware interrupts should be as short as possible, and with NAPI polling packet input processing can be properly distributed amongst several devices, and if the system is overloaded such processing can be deferred to a kernel thread. NAPI polling has a large number of other advantages as well, more streamlined GRO support, automatic support for busypolling... the list goes on and on and on. I could show you how to do an SKB allocation in a hardware interrupt, but instead I'd rather teach you how to fish properly, and encourage you to convert your driver to NAPI polling instead. Thanks.
Re: [PATCH 0/6] In-kernel QMI handling
On Tue 08 Aug 04:02 PDT 2017, Bj?rn Mork wrote: > Bjorn Anderssonwrites: > > > This series starts by moving the common definitions of the QMUX protocol to > > the > > uapi header, as they are shared with clients - both in kernel and userspace. > > > > This series then introduces in-kernel helper functions for aiding the > > handling > > of QMI encoded messages in the kernel. QMI encoding is a wire-format used in > > exchanging messages between the majority of QRTR clients and services. > > Interesting! I tried to add some QMI handling in the kernel a few years > ago, but was thankfully voted down. See > https://www.spinics.net/lists/netdev/msg183101.html and the following > discussion. I am convinced that was the right decision, for the client > side at least. The protocol is just too extensive and ever-growing to be > implemented in the kernel. We would be catching up forever. > > Note that I had very limited knowledge of the protocol at the time I > wrote that driver. Still have, in fact :-) > Thanks for the pointer, I definitely think there's more work to be done here to figure out the proper way to interact with these devices. But I think that Dan's reply shows a huge source of confusion here; the acronym "QMI" covers a large amount of different things - and means different things for different people. In the modem world QMI seems to mean a defined set of logical endpoints that accepts TLV-encoded messages to do modem-related things. But the TLV-encoding is used for non-modem related services and the only common denominator of everything called QMI is the TLV-encoding. Due to my limited exposure to the USB attached "QMI thingies" I haven't previously looked into the exact differences. The proposed patches aimed to support implementing a few non-modem-related clients using QMI-encoded messages over ipcrouter. Looking at your patch above, and oPhono, seems to highlight a few important differences that will take some thinking to overcome. = Transport The transport header in the USB case is your struct qmux, which contains the type of message (in "flags") and the transaction id. The "service" in the QMUX header matches the service id being communicated with. But in order to communicate with a service it seems like one requests a client-id from the control service. In the smartphone world (with shared memory communication) the transport is ipcrouter - with a header very similar to UDP - and there's no information about the payload, it provides only the means of delivering messages from one address/port to another address/port. A typical smartphone has 3-4 nodes (modem, sensors, audio etc) and ports are dynamically allocated. The control messages in the QMUX protocol (not the same QMUX protocol as in the USB case!) are used for clients to find the mapping from service id to a port on the given address. The source port is dynamically allocated in this case. = QMI-encoded messages The list of TLV-entries have a "QMI header" prepended in both cases, but in the QMUX case the header consists only of "msgid" and length. In the ipcrouter case the transport doesn't carry any information regarding the payload, so the header prepended the TLV entries includes "type", transaction id, "msg_id" and length. It looks as if once past the differences in the transport and QMI message header the messages (TLV-encoded data) are the same. But I'm not yet sure about how we can hide the transport differences. Regards, Bjorn
Re: skb allocation from interrupt handler?
Il giorno mar, 08/08/2017 alle 18.17 -0400, Murali Karicheri ha scritto: > Is there an skb_alloc function that can be used from interrupt > handler? Looks like netdev_alloc_skb() > can't be used since I see following trace with kernel hack debug > options enabled. > > [ 652.481713] [] (unwind_backtrace) from [] > (show_stack+0x10/0x14) > [ 652.481725] [] (show_stack) from [] > (dump_stack+0x98/0xc4) > [ 652.481736] [] (dump_stack) from [] > (___might_sleep+0x1b8/0x2a4) > [ 652.481746] [] (___might_sleep) from [] > (rt_spin_lock+0x24/0x5c) > [ 652.481755] [] (rt_spin_lock) from [] > (__netdev_alloc_skb+0xd0/0x254) > [ 652.481774] [] (__netdev_alloc_skb) from [] > (emac_rx_hardirq+0x374/0x554 [prueth]) > [ 652.481793] [] (emac_rx_hardirq [prueth]) from > [] (__handle_irq_event_percpu+0x9c/0x128) > > This is running under RT kernel off 4.9.y > netdev_alloc_skb() passes GFP_ATOMIC to alloc_skb() so it should work in an interrupt handler too. -- Matteo Croce per aspera ad upstream
skb allocation from interrupt handler?
Is there an skb_alloc function that can be used from interrupt handler? Looks like netdev_alloc_skb() can't be used since I see following trace with kernel hack debug options enabled. [ 652.481713] [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [ 652.481725] [] (show_stack) from [] (dump_stack+0x98/0xc4) [ 652.481736] [] (dump_stack) from [] (___might_sleep+0x1b8/0x2a4) [ 652.481746] [] (___might_sleep) from [] (rt_spin_lock+0x24/0x5c) [ 652.481755] [] (rt_spin_lock) from [] (__netdev_alloc_skb+0xd0/0x254) [ 652.481774] [] (__netdev_alloc_skb) from [] (emac_rx_hardirq+0x374/0x554 [prueth]) [ 652.481793] [] (emac_rx_hardirq [prueth]) from [] (__handle_irq_event_percpu+0x9c/0x128) This is running under RT kernel off 4.9.y -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH] net: systemport: Fix software statistics for SYSTEMPORT Lite
From: Florian FainelliDate: Tue, 8 Aug 2017 14:44:40 -0700 > On 08/08/2017 02:39 PM, Florian Fainelli wrote: >> With SYSTEMPORT Lite we have holes in our statistics layout that make us >> skip over the hardware MIB counters, bcm_sysport_get_stats() was not >> taking that into account, resulting in reporting 0 for all SW-maintained >> statistics, fix this by skipping accordingly. >> >> Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite") >> Signed-off-by: Florian Fainelli > > David, please ignore this version, I accidentally sent it with the > debugging left. Ok.
Re: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk
From: James HoganDate: Tue, 08 Aug 2017 22:20:05 +0100 > cool, i hadn't realised unmentioned elements in an initialiser are > always zeroed, even when non-global/static, so had interpreted the > whole array as uninitialised. learn something new every day :-) > sorry for the noise. You didn't have to know in the first place, you could have simply compiled the code into assembler by running: make kernel/trace/bpf_trace.s and seen for yourself before putting all of this time and effort into this patch and discussion. If you don't know what the compiler does, simply look!
[PATCH net-next] net: ipv6: lower ndisc notifier priority below addrconf
ndisc_notify is used to send unsolicited neighbor advertisements (e.g., on a link up). Currently, the ndisc notifier is run before the addrconf notifer which means NA's are not sent for link-local addresses which are added by the addrconf notifier. Fix by lowering the priority of the ndisc notifier. Setting the priority to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10. Signed-off-by: David Ahern--- net/ipv6/ndisc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 0327c1f2e6fc..5e338eb89509 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -1779,6 +1779,7 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event, static struct notifier_block ndisc_netdev_notifier = { .notifier_call = ndisc_netdev_event, + .priority = ADDRCONF_NOTIFY_PRIORITY - 5, }; #ifdef CONFIG_SYSCTL -- 2.9.3
[PATCH net v2] net: systemport: Fix software statistics for SYSTEMPORT Lite
With SYSTEMPORT Lite we have holes in our statistics layout that make us skip over the hardware MIB counters, bcm_sysport_get_stats() was not taking that into account resulting in reporting 0 for all SW-maintained statistics, fix this by skipping accordingly. Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite") Signed-off-by: Florian Fainelli--- Changes in v2: - no debugging drivers/net/ethernet/broadcom/bcmsysport.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index 5333601f855f..dc3052751bc1 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -449,6 +449,10 @@ static void bcm_sysport_get_stats(struct net_device *dev, p = (char *)>stats; else p = (char *)priv; + + if (priv->is_lite && !bcm_sysport_lite_stat_valid(s->type)) + continue; + p += s->stat_offset; data[j] = *(unsigned long *)p; j++; -- 2.9.3
Re: [PATCH] net: systemport: Fix software statistics for SYSTEMPORT Lite
On 08/08/2017 02:39 PM, Florian Fainelli wrote: > With SYSTEMPORT Lite we have holes in our statistics layout that make us > skip over the hardware MIB counters, bcm_sysport_get_stats() was not > taking that into account, resulting in reporting 0 for all SW-maintained > statistics, fix this by skipping accordingly. > > Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite") > Signed-off-by: Florian FainelliDavid, please ignore this version, I accidentally sent it with the debugging left. -- Florian
[PATCH] net: systemport: Fix software statistics for SYSTEMPORT Lite
With SYSTEMPORT Lite we have holes in our statistics layout that make us skip over the hardware MIB counters, bcm_sysport_get_stats() was not taking that into account, resulting in reporting 0 for all SW-maintained statistics, fix this by skipping accordingly. Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite") Signed-off-by: Florian Fainelli--- drivers/net/ethernet/broadcom/bcmsysport.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index 5333601f855f..abf175372719 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -449,6 +449,10 @@ static void bcm_sysport_get_stats(struct net_device *dev, p = (char *)>stats; else p = (char *)priv; + + if (priv->is_lite && !bcm_sysport_lite_stat_valid(s->type)) + continue; + p += s->stat_offset; data[j] = *(unsigned long *)p; j++; @@ -608,8 +612,8 @@ static struct sk_buff *bcm_sysport_rx_refill(struct bcm_sysport_priv *priv, /* Allocate a new SKB for a new packet */ skb = netdev_alloc_skb(priv->netdev, RX_BUF_LENGTH); + priv->mib.alloc_rx_buff_failed++; if (!skb) { - priv->mib.alloc_rx_buff_failed++; netif_err(priv, rx_err, ndev, "SKB alloc failed\n"); return NULL; } -- 2.9.3
Re: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk
On 8 August 2017 17:48:57 BST, David Millerwrote: >From: Daniel Borkmann >Date: Tue, 08 Aug 2017 10:46:52 +0200 > >> On 08/08/2017 12:25 AM, James Hogan wrote: >>> In bpf_trace_printk(), the elements in mod[] are left uninitialised, >>> but >>> they are then incremented to track the width of the formats. Zero >>> initialise the array just in case the memory contains non-zero >values >>> on >>> entry. >>> >>> Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call >>> bpf_trace_printk()") >>> Signed-off-by: James Hogan >>> Cc: Alexei Starovoitov >>> Cc: Daniel Borkmann >>> Cc: Steven Rostedt >>> Cc: Ingo Molnar >>> Cc: netdev@vger.kernel.org >>> --- >>> When I checked (on MIPS32), the elements tended to have the value >zero >>> anyway (does BPF zero the stack or something clever?), so this is a >>> purely theoretical fix. >>> --- >>> kernel/trace/bpf_trace.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >>> index 32dcbe1b48f2..86a52857d941 100644 >>> --- a/kernel/trace/bpf_trace.c >>> +++ b/kernel/trace/bpf_trace.c >>> @@ -129,7 +129,7 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, >>> fmt_size, u64, arg1, >>>u64, arg2, u64, arg3) >>> { >>> bool str_seen = false; >>> - int mod[3] = {}; >>> + int mod[3] = { 0, 0, 0 }; >> >> I'm probably missing something, but is the behavior of gcc wrt >> above initializers different on mips (it zeroes just fine on x86 >> at least)? If yes, we'd probably need a cocci script to also check >> rest of the kernel given this is used in a number of places. Hm, >> could you elaborate? > >This change is not necessary at all. > >An empty initializer must clear the whole object to zero. > >"theoretical" fix indeed... :-( cool, i hadn't realised unmentioned elements in an initialiser are always zeroed, even when non-global/static, so had interpreted the whole array as uninitialised. learn something new every day :-) sorry for the noise. cheers James
Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure
On Tue, Aug 8, 2017 at 1:23 PM, Edward Creewrote: > On 08/08/17 20:50, Tom Herbert wrote: >> It's a tradeoff. The nice thing about using strings is that we don't >> need maintain a universal enum. > Hmm, that makes it sound as though you're intending for random out-of-tree > modules to add these things; since if they're in-tree it's easy for them > to get enum values assigned when they're added. Do we really want to > encourage sticking random module code into the network stack like this? > > In any case, if you go with the enum approach and later it _does_ prove > necessary to have more flexibility, you can have enum values dynamically > assigned (like genetlink manages to do); and programs using the existing > fixed IDs will continue to work. It's much harder to go the other way... > There is history and precedence. The string mechanism for ulp_ops a direct port of the original ULP infrastructure done for kTLS. That code based the mechanism on TCP congestion ops and that was introduced into the kernel twelve years ago. This method doesn't seem to have been viewed as a problem before now... Tom > -Ed
[PATCH ericsson v2 1/1] tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially send out a STATE message to the peer, to increase the probability that the peer endpoint also is up when the first traffic message arrives. Thereafter, if the establishing link is the second link between two nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message, helping the peer to perform initial synchronization between the two links. However, the initial STATE message may be lost, in which case the SYNCH message will be the first one arriving at the peer. This should also work, as the SYNCH message itself will be used to take up the link endpoint before initializing synchronization. Unfortunately the code for this case is broken. Currently, the link is brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH arrives, whereupon __tipc_node_link_up() is called to distribute the link slots and take the link into traffic. But, __tipc_node_link_up() is itself starting with a test for whether the link is up, and if true, returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call is unnecessary, since tipc_node_link_up() is itself issuing such an event, but also harmful, since it inhibits tipc_node_link_up() to perform the test of its tasks, and the link endpoint in question hence is never taken into traffic. This problem has been exposed when we set up dual links between pre- and post-4.4 kernels, because the former ones don't send out the initial STATE message described above. We fix this by removing the unnecessary event call. Signed-off-by: Jon Maloy--- net/tipc/node.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/tipc/node.c b/net/tipc/node.c index aeef801..9b4dcb6 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -1455,10 +1455,8 @@ static bool tipc_node_check_state(struct tipc_node *n, struct sk_buff *skb, /* Initiate synch mode if applicable */ if ((usr == TUNNEL_PROTOCOL) && (mtyp == SYNCH_MSG) && (oseqno == 1)) { syncpt = iseqno + exp_pkts - 1; - if (!tipc_link_is_up(l)) { - tipc_link_fsm_evt(l, LINK_ESTABLISH_EVT); + if (!tipc_link_is_up(l)) __tipc_node_link_up(n, bearer_id, xmitq); - } if (n->state == SELF_UP_PEER_UP) { n->sync_point = syncpt; tipc_link_fsm_evt(l, LINK_SYNCH_BEGIN_EVT); -- 2.1.4
[net 1/1] tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially send out a STATE message to the peer, to increase the probability that the peer endpoint also is up when the first traffic message arrives. Thereafter, if the establishing link is the second link between two nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message, helping the peer to perform initial synchronization between the two links. However, the initial STATE message may be lost, in which case the SYNCH message will be the first one arriving at the peer. This should also work, as the SYNCH message itself will be used to take up the link endpoint before initializing synchronization. Unfortunately the code for this case is broken. Currently, the link is brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH arrives, whereupon __tipc_node_link_up() is called to distribute the link slots and take the link into traffic. But, __tipc_node_link_up() is itself starting with a test for whether the link is up, and if true, returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call is unnecessary, since tipc_node_link_up() is itself issuing such an event, but also harmful, since it inhibits tipc_node_link_up() to perform the test of its tasks, and the link endpoint in question hence is never taken into traffic. This problem has been exposed when we set up dual links between pre- and post-4.4 kernels, because the former ones don't send out the initial STATE message described above. We fix this by removing the unnecessary event call. Signed-off-by: Jon Maloy--- net/tipc/node.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/tipc/node.c b/net/tipc/node.c index aeef801..9b4dcb6 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -1455,10 +1455,8 @@ static bool tipc_node_check_state(struct tipc_node *n, struct sk_buff *skb, /* Initiate synch mode if applicable */ if ((usr == TUNNEL_PROTOCOL) && (mtyp == SYNCH_MSG) && (oseqno == 1)) { syncpt = iseqno + exp_pkts - 1; - if (!tipc_link_is_up(l)) { - tipc_link_fsm_evt(l, LINK_ESTABLISH_EVT); + if (!tipc_link_is_up(l)) __tipc_node_link_up(n, bearer_id, xmitq); - } if (n->state == SELF_UP_PEER_UP) { n->sync_point = syncpt; tipc_link_fsm_evt(l, LINK_SYNCH_BEGIN_EVT); -- 2.1.4
Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure
On 08/08/17 20:50, Tom Herbert wrote: > It's a tradeoff. The nice thing about using strings is that we don't > need maintain a universal enum. Hmm, that makes it sound as though you're intending for random out-of-tree modules to add these things; since if they're in-tree it's easy for them to get enum values assigned when they're added. Do we really want to encourage sticking random module code into the network stack like this? In any case, if you go with the enum approach and later it _does_ prove necessary to have more flexibility, you can have enum values dynamically assigned (like genetlink manages to do); and programs using the existing fixed IDs will continue to work. It's much harder to go the other way... -Ed
Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure
On Tue, Aug 8, 2017 at 12:30 PM, John Fastabendwrote: > On 08/08/2017 10:04 AM, Tom Herbert wrote: >> On Tue, Aug 8, 2017 at 8:31 AM, John Fastabend >> wrote: >>> On 08/07/2017 10:28 AM, Tom Herbert wrote: Generalize the ULP infrastructure that was recently introduced to support kTLS. This adds a SO_ULP socket option and creates new fields in sock structure for ULP ops and ULP data. Also, the interface allows additional per ULP parameters to be set so that a ULP can be pushed and operations started in one shot. In this patch set: - Minor dependency fix in inet_common.h - Implement ULP infrastructure as a socket mechanism - Fixes TCP and TLS to use the new method (maintaining backwards API compatibility) - Adds a ulp.txt document Tested: Ran simple ULP. Dave Watson verified kTLS works. -v2: Fix compilation errors when CONFIG_ULP_SOCK not set. -v3: Fix one more build issue, check that sk_protocol is IPPROTO_TCP in tsl_init. Also, fix a couple of minor issues related to introducing locked versions of sendmsg, send page. Thanks to Dave Watson, John Fastabend, and Mat Martineau for testing and providing fixes. >>> >>> >>> Hi Tom, Dave, >>> >>> I'm concerned about the performance impact of walking a list and >>> doing string compares on every socket we create with kTLS. Dave >>> do you have any request/response tests for kTLS that would put pressure >>> on the create/destroy time of this infrastructure? We should do some >>> tests with dummy entries in the ULP list to understand the impact of >>> this list walk. >>> >>> I like the underlying TCP generalized hooks, but do we really expect a >>> lot of these hooks to exist? If we only have two on the roadmap >>> (kTLS and socktap) it seems a bit overkill. Further, if we really expect >>> many ULP objects then the list walk and compare will become more expensive >>> perhaps becoming noticeable in request per second metrics. >>> >>> Why not just create another socktap socketopt? That will be better from >>> complexity and likely performance sides. >>> >> IMO, given that there is at most two even proposed at this point I >> don't there's much point addressing performance. When ULP feature >> catches on and we start see a whole bunch of them then it's >> straightforward to use a hash table or some more efficient mechanism. >> > > OTOH these optimizations are usually easiest to do at the beginning. And > building an enum of ULP types would allow removing string comparisons and > to do simpler unsigned comparisons. I wont complain too much here though > because this series didn't introduce the lists. > Hi John, It's a tradeoff. The nice thing about using strings is that we don't need maintain a universal enum. A related problem is how to combine different ULPs on the same socket. For instance, I might want to do filtering on the application layer messages being sent over TLS (stap+kTLS ULPs). So far I don't see an obvious way to do that. The buffering requirement for crypto seems to convolute this some. Tom
Re: Re:Re: Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan
On Mon, Aug 7, 2017 at 6:10 PM, Gao Fengwrote: > > Sorry, I don't get you clearly. Why the sock_hold() isn't helpful? I already told you, the dereference happends before sock_hold(). sock = rcu_dereference(callid_sock[call_id]); if (sock) { opt = >proto.pptp; if (opt->dst_addr.sin_addr.s_addr != s_addr) <=== HERE sock = NULL; else sock_hold(sk_pppox(sock)); } If we don't wait for readers properly, sock could be freed at the same time when deference it. > The pptp_release invokes synchronize_rcu after del_chan, it could make sure > the others has increased the sock refcnt if necessary > and the lookup is over. > There is no one could get the sock after synchronize_rcu in pptp_release. If this were true, then this code in pptp_sock_destruct() would be unneeded: if (!(sk->sk_state & PPPOX_DEAD)) { del_chan(pppox_sk(sk)); pppox_unbind_sock(sk); } > > > But I think about another problem. > It seems the pptp_sock_destruct should not invoke del_chan and > pppox_unbind_sock. > Because when the sock refcnt is 0, the pptp_release must have be invoked > already. > I don't know. Looks like sock_orphan() is only called in pptp_release(), but I am not sure if there is a case we call sock destructor before release. Also note, this socket is very special, it doesn't support poll(), sendmsg() or recvmsg()..
Re: [PATCH v3 net-next 4/5] tcp: Adjust TCP ULP to defer to sockets ULP
On 08/07/2017 10:28 AM, Tom Herbert wrote: > Fix TCP and TLS to use the newer ULP infrastructure in sockets. > > Signed-off-by: Tom Herbert> --- Looks OK to me. Acked-by: John Fastabend
Re: [PATCH net] net: sched: fix NULL pointer dereference when action calls some targets
On Mon, Aug 7, 2017 at 7:33 PM, Xin Longwrote: > On Tue, Aug 8, 2017 at 9:15 AM, Cong Wang wrote: >> This looks like a completely API burden? > netfilter xt targets are not really compatible with netsched action. > I've got to say, the patch is just a way to make checkentry return > false and avoid panic. like [1] said I don't doubt you fix a crash, I am thinking if we can "fix" the API instead of fixing the caller. I am not familiar with this API, so just my 2 cents...
Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure
On 08/08/2017 10:04 AM, Tom Herbert wrote: > On Tue, Aug 8, 2017 at 8:31 AM, John Fastabend> wrote: >> On 08/07/2017 10:28 AM, Tom Herbert wrote: >>> Generalize the ULP infrastructure that was recently introduced to >>> support kTLS. This adds a SO_ULP socket option and creates new fields in >>> sock structure for ULP ops and ULP data. Also, the interface allows >>> additional per ULP parameters to be set so that a ULP can be pushed >>> and operations started in one shot. >>> >>> In this patch set: >>> - Minor dependency fix in inet_common.h >>> - Implement ULP infrastructure as a socket mechanism >>> - Fixes TCP and TLS to use the new method (maintaining backwards >>> API compatibility) >>> - Adds a ulp.txt document >>> >>> Tested: Ran simple ULP. Dave Watson verified kTLS works. >>> >>> -v2: Fix compilation errors when CONFIG_ULP_SOCK not set. >>> -v3: Fix one more build issue, check that sk_protocol is IPPROTO_TCP >>> in tsl_init. Also, fix a couple of minor issues related to >>> introducing locked versions of sendmsg, send page. Thanks to >>> Dave Watson, John Fastabend, and Mat Martineau for testing and >>> providing fixes. >>> >> >> >> Hi Tom, Dave, >> >> I'm concerned about the performance impact of walking a list and >> doing string compares on every socket we create with kTLS. Dave >> do you have any request/response tests for kTLS that would put pressure >> on the create/destroy time of this infrastructure? We should do some >> tests with dummy entries in the ULP list to understand the impact of >> this list walk. >> >> I like the underlying TCP generalized hooks, but do we really expect a >> lot of these hooks to exist? If we only have two on the roadmap >> (kTLS and socktap) it seems a bit overkill. Further, if we really expect >> many ULP objects then the list walk and compare will become more expensive >> perhaps becoming noticeable in request per second metrics. >> >> Why not just create another socktap socketopt? That will be better from >> complexity and likely performance sides. >> > IMO, given that there is at most two even proposed at this point I > don't there's much point addressing performance. When ULP feature > catches on and we start see a whole bunch of them then it's > straightforward to use a hash table or some more efficient mechanism. > OTOH these optimizations are usually easiest to do at the beginning. And building an enum of ULP types would allow removing string comparisons and to do simpler unsigned comparisons. I wont complain too much here though because this series didn't introduce the lists. > Tom > >> Thanks, >> .John >>
[PATCH net-next v2] wan: dscc4: add checks for dma mapping errors
The driver does not check if mapping dma memory succeed. The patch adds the checks and failure handling. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov--- v2: Fix issues noted by David Miller and Francois Romieu. drivers/net/wan/dscc4.c | 52 +++-- 1 file changed, 37 insertions(+), 15 deletions(-) diff --git a/drivers/net/wan/dscc4.c b/drivers/net/wan/dscc4.c index 799830f..6a9ffac 100644 --- a/drivers/net/wan/dscc4.c +++ b/drivers/net/wan/dscc4.c @@ -518,23 +518,31 @@ static void dscc4_release_ring(struct dscc4_dev_priv *dpriv) static inline int try_get_rx_skb(struct dscc4_dev_priv *dpriv, struct net_device *dev) { + struct pci_dev *pdev = dpriv->pci_priv->pdev; unsigned int dirty = dpriv->rx_dirty%RX_RING_SIZE; struct RxFD *rx_fd = dpriv->rx_fd + dirty; const int len = RX_MAX(HDLC_MAX_MRU); struct sk_buff *skb; - int ret = 0; + dma_addr_t addr; skb = dev_alloc_skb(len); + if (!skb) + goto err_out; + + skb->protocol = hdlc_type_trans(skb, dev); + addr = pci_map_single(pdev, skb->data, len, PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(pdev, addr)) + goto err_free_skb; + dpriv->rx_skbuff[dirty] = skb; - if (skb) { - skb->protocol = hdlc_type_trans(skb, dev); - rx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev, - skb->data, len, PCI_DMA_FROMDEVICE)); - } else { - rx_fd->data = 0; - ret = -1; - } - return ret; + rx_fd->data = cpu_to_le32(addr); + return 0; + +err_free_skb: + dev_kfree_skb_any(skb); +err_out: + rx_fd->data = 0; + return -1; } /* @@ -1147,14 +1155,22 @@ static netdev_tx_t dscc4_start_xmit(struct sk_buff *skb, struct dscc4_dev_priv *dpriv = dscc4_priv(dev); struct dscc4_pci_priv *ppriv = dpriv->pci_priv; struct TxFD *tx_fd; + dma_addr_t addr; int next; + addr = pci_map_single(ppriv->pdev, skb->data, skb->len, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(ppriv->pdev, addr)) { + dev_kfree_skb_any(skb); + dev->stats.tx_dropped++; + return NETDEV_TX_OK; + } + next = dpriv->tx_current%TX_RING_SIZE; dpriv->tx_skbuff[next] = skb; tx_fd = dpriv->tx_fd + next; tx_fd->state = FrameEnd | TO_STATE_TX(skb->len); - tx_fd->data = cpu_to_le32(pci_map_single(ppriv->pdev, skb->data, skb->len, -PCI_DMA_TODEVICE)); + tx_fd->data = cpu_to_le32(addr); tx_fd->complete = 0x; tx_fd->jiffies = jiffies; mb(); @@ -1887,16 +1903,22 @@ static struct sk_buff *dscc4_init_dummy_skb(struct dscc4_dev_priv *dpriv) skb = dev_alloc_skb(DUMMY_SKB_SIZE); if (skb) { + struct pci_dev *pdev = dpriv->pci_priv->pdev; int last = dpriv->tx_dirty%TX_RING_SIZE; struct TxFD *tx_fd = dpriv->tx_fd + last; + dma_addr_t addr; skb->len = DUMMY_SKB_SIZE; skb_copy_to_linear_data(skb, version, strlen(version) % DUMMY_SKB_SIZE); tx_fd->state = FrameEnd | TO_STATE_TX(DUMMY_SKB_SIZE); - tx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev, -skb->data, DUMMY_SKB_SIZE, -PCI_DMA_TODEVICE)); + addr = pci_map_single(pdev, skb->data, DUMMY_SKB_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(pdev, addr)) { + dev_kfree_skb_any(skb); + return NULL; + } + tx_fd->data = cpu_to_le32(addr); dpriv->tx_skbuff[last] = skb; } return skb; -- 2.7.4
Re: [PATCH net-next v2] net: ipv6: avoid overhead when no custom FIB rules are installed
On 8/8/17 12:23 PM, Vincent Bernat wrote: > If the user hasn't installed any custom rules, don't go through the > whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4: > Avoid overhead when no custom FIB rules are installed). > > Using a micro-benchmark module [1], timing ip6_route_output() with > get_cycles(), with 40,000 routes in the main routing table, before this > patch: ... > At the frequency of the host during the bench (~ 3.7 GHz), this is > about a 100 ns difference on the median value. > > A next step would be to collapse local and main tables, as in > 0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse). > > [1]: > https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c > > Signed-off-by: Vincent Bernat> Reviewed-by: Jiri Pirko > --- > include/net/netns/ipv6.h | 1 + > net/ipv6/fib6_rules.c| 40 +++- > net/ipv6/route.c | 1 + > 3 files changed, 29 insertions(+), 13 deletions(-) > LGTM. Acked-by: David Ahern
Re: sysctl, argument parsing, possible bug
On Tue, 8 Aug 2017 20:26:36 +0200 Massimo Salawrote: > I make another test with kernel 4.9.32-15.41 > > sysctl procps version 3.2.8 > > sysctl net.ipv4.conf.eth0.100.forwarding > error: "net.ipv4.conf.eth0.100.forwarding" is an unknown key > > > so I install busybox : > > BusyBox v1.19.3 > > busybox sysctl net.ipv4.conf.eth0.100.forwarding > net.ipv4.conf.eth0.100.forwarding = 0 > > It is working, as I expect reading busybox source sysctl.c > > > > Stephen, I test > sysctl net/ipv4/conf/eth0.100/forwarding > I confirm it works. > > > What is the problem ? > > As sysctl, also automation tools and scripts cannot be "netdev names > aware", and so they fail using the usual dot notation. > > > I don't pretend to change sysctl to read from the /proc/sys/ > directory, as busybox does. > > I suggest to add a remark to the man page of sysctl, reporting the > difference between the two tools and an example of the alternate > syntax : > sysctl net/ipv4/conf/eth0.100/forwarding > > > Thank you for your attention. > Best regards, Massimo Busybox has always been a restricted subset of the upstream standard tools. If you have problems with busybox take it up with those developers directly; this is not the right mailing list for that.
Re: [PATCH 00/35] constify net usb_device_id
Arvind Yadavwrites: > usb_device_id are not supposed to change at runtime. All functions > working with usb_device_id provided by work with > const usb_device_id. So mark the non-const structs as const. [...] > [PATCH 16/35] wireless: ath: ar5523: constify usb_device_id > [PATCH 17/35] wireless: ath: ath6kl: constify usb_device_id > [PATCH 18/35] wireless: ath: ath9k: constify usb_device_id > [PATCH 19/35] wireless: ath: carl9170: constify usb_device_id > [PATCH 20/35] wireless: atmel: at76c50x: constify usb_device_id > [PATCH 21/35] wireless: broadcom: brcm80211: constify usb_device_id > [PATCH 22/35] wireless: intersil: orinoco: constify usb_device_id > [PATCH 23/35] wireless: intersil: p54: constify usb_device_id > [PATCH 24/35] wireless: marvell: libertas: constify usb_device_id > [PATCH 25/35] wireless: marvell: libertas_tf: constify usb_device_id > [PATCH 26/35] wireless: marvell: mwifiex: constify usb_device_id > [PATCH 27/35] wireless: mediatek: mt7601u: constify usb_device_id > [PATCH 28/35] wireless: ralink: rt2500usb: constify usb_device_id > [PATCH 29/35] wireless: ralink: rt2800usb: constify usb_device_id > [PATCH 30/35] wireless: ralink: rt73usb: constify usb_device_id > [PATCH 31/35] wireless: realtek: rtl8187: constify usb_device_id > [PATCH 32/35] wireless: realtek: rtl8xxxu: constify usb_device_id > [PATCH 33/35] wireless: realtek: rtl8192cu: constify usb_device_id > [PATCH 34/35] wireless: zydas: zd1201: constify usb_device_id > [PATCH 35/35] wireless: zydas: zd1211rw: constify usb_device_id No need to put the whole path to the title, it's enough to have the name of the driver there. For example: [PATCH 16/35] ar5523: constify usb_device_id [PATCH 17/35] ath6kl: constify usb_device_id [PATCH 18/35] ath9k: constify usb_device_id Please resubmit the wireless patches separately in their own patchset. -- Kalle Valo
Re: sysctl, argument parsing, possible bug
I make another test with kernel 4.9.32-15.41 sysctl procps version 3.2.8 sysctl net.ipv4.conf.eth0.100.forwarding error: "net.ipv4.conf.eth0.100.forwarding" is an unknown key so I install busybox : BusyBox v1.19.3 busybox sysctl net.ipv4.conf.eth0.100.forwarding net.ipv4.conf.eth0.100.forwarding = 0 It is working, as I expect reading busybox source sysctl.c Stephen, I test sysctl net/ipv4/conf/eth0.100/forwarding I confirm it works. What is the problem ? As sysctl, also automation tools and scripts cannot be "netdev names aware", and so they fail using the usual dot notation. I don't pretend to change sysctl to read from the /proc/sys/ directory, as busybox does. I suggest to add a remark to the man page of sysctl, reporting the difference between the two tools and an example of the alternate syntax : sysctl net/ipv4/conf/eth0.100/forwarding Thank you for your attention. Best regards, Massimo
[PATCH net-next v2] net: ipv6: avoid overhead when no custom FIB rules are installed
If the user hasn't installed any custom rules, don't go through the whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4: Avoid overhead when no custom FIB rules are installed). Using a micro-benchmark module [1], timing ip6_route_output() with get_cycles(), with 40,000 routes in the main routing table, before this patch: min=606 max=12911 count=627 average=1959 95th=4903 90th=3747 50th=1602 mad=821 table=254 avgdepth=21.8 maxdepth=39 value │ ┊count 600 │▒▒▒ 199 880 │▒▒▒ 43 1160 │▒▒▒ 48 1440 │▒▒▒░░░ 43 1720 │░░░ 59 2000 │▒▒▒ 50 2280 │▒▒░░░26 2560 │▒▒░ 31 2840 │▒▒ 28 3120 │▒░░ 17 3400 │▒░░░ 17 3680 │░ 8 3960 │░░ 11 4240 │░░6 4520 │░░░ 6 4800 │░░░ 9 After: min=544 max=11687 count=627 average=1776 95th=4546 90th=3585 50th=1227 mad=565 table=254 avgdepth=21.8 maxdepth=39 value │ ┊count 540 │201 800 │▒63 1060 │▒░ 68 1320 │▒▒▒░░39 1580 │▒▒░░ 32 1840 │▒▒ 32 2100 │▒▒░░░34 2360 │▒▒░░ 33 2620 │▒▒ 26 2880 │▒░░ 22 3140 │ 9 3400 │░ 8 3660 │░ 9 3920 │░░8 4180 │░░░ 8 4440 │░░░ 8 At the frequency of the host during the bench (~ 3.7 GHz), this is about a 100 ns difference on the median value. A next step would be to collapse local and main tables, as in 0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse). [1]: https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c Signed-off-by: Vincent BernatReviewed-by: Jiri Pirko --- include/net/netns/ipv6.h | 1 + net/ipv6/fib6_rules.c| 40 +++- net/ipv6/route.c | 1 + 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h index abdf3b40303b..0e50bf3ed097 100644 --- a/include/net/netns/ipv6.h +++ b/include/net/netns/ipv6.h @@ -65,6 +65,7 @@ struct netns_ipv6 { unsigned int ip6_rt_gc_expire; unsigned longip6_rt_last_gc; #ifdef CONFIG_IPV6_MULTIPLE_TABLES + bool fib6_has_custom_rules; struct rt6_info *ip6_prohibit_entry; struct rt6_info *ip6_blk_hole_entry; struct fib6_table *fib6_local_tbl; diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index 2f29e4e33bd3..b240f24a6e52 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -63,19 +63,32 @@ unsigned int fib6_rules_seq_read(struct net *net) struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, int flags, pol_lookup_t lookup) { - struct fib_lookup_arg arg = { - .lookup_ptr = lookup, - .flags = FIB_LOOKUP_NOREF, - }; - - /* update flow if oif or iif point to device enslaved to l3mdev */ - l3mdev_update_flow(net, flowi6_to_flowi(fl6)); - - fib_rules_lookup(net->ipv6.fib6_rules_ops, -flowi6_to_flowi(fl6), flags, ); - - if (arg.result) - return arg.result; + if
[PATCH net] net: avoid skb_warn_bad_offload false positives on UFO
From: Willem de Bruijnskb_warn_bad_offload triggers a warning when an skb enters the GSO stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL checksum offload set. Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") observed that SKB_GSO_DODGY producers can trigger the check and that passing those packets through the GSO handlers will fix it up. But, the software UFO handler will set ip_summed to CHECKSUM_NONE. When __skb_gso_segment is called from the receive path, this triggers the warning again. Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On Tx these two are equivalent. On Rx, this better matches the skb state (checksum computed), as CHECKSUM_NONE here means no checksum computed. See also this thread for context: http://patchwork.ozlabs.org/patch/799015/ Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") Signed-off-by: Willem de Bruijn --- net/core/dev.c | 2 +- net/ipv4/udp_offload.c | 2 +- net/ipv6/udp_offload.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8515f8fe0460..ce15a06d5558 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2739,7 +2739,7 @@ static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path) { if (tx_path) return skb->ip_summed != CHECKSUM_PARTIAL && - skb->ip_summed != CHECKSUM_NONE; + skb->ip_summed != CHECKSUM_UNNECESSARY; return skb->ip_summed == CHECKSUM_NONE; } diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 781250151d40..0932c85b42af 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -235,7 +235,7 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, if (uh->check == 0) uh->check = CSUM_MANGLED_0; - skb->ip_summed = CHECKSUM_NONE; + skb->ip_summed = CHECKSUM_UNNECESSARY; /* If there is no outer header we can fake a checksum offload * due to the fact that we have already done the checksum in diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c index a2267f80febb..e7d378c032cb 100644 --- a/net/ipv6/udp_offload.c +++ b/net/ipv6/udp_offload.c @@ -72,7 +72,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb, if (uh->check == 0) uh->check = CSUM_MANGLED_0; - skb->ip_summed = CHECKSUM_NONE; + skb->ip_summed = CHECKSUM_UNNECESSARY; /* If there is no outer header we can fake a checksum offload * due to the fact that we have already done the checksum in -- 2.14.0.rc1.383.gd1ce394fe2-goog
Re: [PATCH net] tcp: fastopen: tcp_connect() must refresh the route
On Tue, Aug 8, 2017 at 1:41 AM, Eric Dumazetwrote: > From: Eric Dumazet > > With new TCP_FASTOPEN_CONNECT socket option, there is a possibility > to call tcp_connect() while socket sk_dst_cache is either NULL > or invalid. > > +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 > +0 connect(4, ..., ...) = 0 > > << sk->sk_dst_cache becomes obsolete, or even set to NULL >> > > +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > > > We need to refresh the route otherwise bad things can happen, > especially when syzkaller is running on the host :/ > > Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support") > Reported-by: Dmitry Vyukov > Signed-off-by: Eric Dumazet > Cc: Wei Wang > Cc: Yuchung Cheng > --- Acked-by: Yuchung Cheng Thanks for the fix! > net/ipv4/tcp_output.c |4 > 1 file changed, 4 insertions(+) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 276406a83a37..b7661a68d498 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3436,6 +3436,10 @@ int tcp_connect(struct sock *sk) > int err; > > tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB); > + > + if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk)) > + return -EHOSTUNREACH; /* Routing failure or similar. */ > + > tcp_connect_init(sk); > > if (unlikely(tp->repair)) { > >
Re: [PATCH] net: Reduce skb_warn_bad_offload() noise.
>> @@ -2670,6 +2670,7 @@ static inline bool skb_needs_check(struct >> sk_buff *skb, bool tx_path) >> { >> if (tx_path) >> return skb->ip_summed != CHECKSUM_PARTIAL && >> + skb->ip_summed != CHECKSUM_UNNECESSARY && >>skb->ip_summed != CHECKSUM_NONE; > > Good catch. Only, the CHECKSUM_NONE case was added specifically to > work around this UFO issue on the tx path in commit 6e7bc478c9a0 > ("net: skb_needs_check() accepts CHECKSUM_NONE for tx"). If we change > the value generated by UFO, we can remove that statement, so > > + skb->ip_summed != CHECKSUM_UNNECESSARY; > -skb->ip_summed != CHECKSUM_NONE; > > Else the entire check becomes a NOOP. These are the only three valid > states on tx. With very few codepaths generating CHECKSUM_UNNECESSARY > to begin with, it arguably already is practically a NOOP. I need to > look more closely what the statement is intended to protect against, > before we relax it even further. On transmit, packets entering skb_gso_segment are expected to always have ip_summed CHECKSUM_PARTIAL. This check was added to track down unexpected exceptions in commit 67fd1a731ff1 ("net: Add debug info to track down GSO checksum bug"). Only when called for the second time, after skb_mac_gso_segment, do we have to possibly handle the case where the GSO layer computes the checksum and changes ip_summed. Since this only goes into 4.11 to 4.13, making two separate skb_needs_check variants for these two call sites seems overkill. I will send the simple fix to convert CHECKSUM_NONE to CHECKSUM_UNNECESSARY. As a side effect of removing UFO in 4.14-rc1, we can also revert commit 6e7bc478c9a0 ("net: skb_needs_check() accepts CHECKSUM_NONE for tx") in net-next.
Re: [PATCH net-next] ibmvnic: Add netdev_dbg output for debugging
On 08/08/2017 11:27 AM, Stephen Hemminger wrote: > On Mon, 07 Aug 2017 15:02:58 -0400 > Nathan Fontenotwrote: > >> To ease debugging of the ibmvnic driver add a series of netdev_dbg() >> statements to track driver status, especially during initialization, >> removal, and resetting of the driver. >> >> Signed-off-by: Nathan Fontenot > > Maybe use netif_dbg() and the message type stuff. Oh! I like this even more. This just begs to update all of the message reporting in the driver though. Let's scrap this patch. I'll work on a new patchset based on using the netif_*() infrastructure. -Nathan
Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482
On Mon, Aug 7, 2017 at 11:16 AM, Rao Shoaibwrote: > Change from version 0: Rationale behind the change: > > The man page for tcp(7) states > > when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will > override keepalive to determine when to close a connection due to keepalive > failure. > > This is ambigious at best. user expectation is most likely that the connection > will be reset after TCP_USER_TIMEOUT milliseconds of inactivity. ccing the original author Jerry Chu who can tell more. > > The code however waits for the keepalive to kick-in (default 2hrs) and than > after one failure resets the conenction. > > What is the rationale for that ? The same effect can be obtained by simply > changing the value of tcp_keep_alive_probes. > > Since the TCP_USER_TIMEOUT option was added based on RFC 5482 we need to > follow > the RFC. Which states > > 4.2 TCP keep-Alives: >Some TCP implementations, such as those in BSD systems, use a >different abort policy for TCP keep-alives than for user data. Thus, >the TCP keep-alive mechanism might abort a connection that would >otherwise have survived the transient period without connectivity. >Therefore, if a connection that enables keep-alives is also using the >TCP User Timeout Option, then the keep-alive timer MUST be set to a >value larger than that of the adopted USER TIMEOUT. > > This patch enforces the MUST and also dis-associates user timeout from keep > alive. A man page patch will be submitted separately. > > Signed-off-by: Rao Shoaib > --- > net/ipv4/tcp.c | 10 -- > net/ipv4/tcp_timer.c | 9 + > 2 files changed, 9 insertions(+), 10 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 71ce33d..f2af44d 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2628,7 +2628,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level, > break; > > case TCP_KEEPIDLE: > - if (val < 1 || val > MAX_TCP_KEEPIDLE) > + /* Per RFC5482 keepalive_time must be > user_timeout */ > + if (val < 1 || val > MAX_TCP_KEEPIDLE || > + ((val * HZ) <= icsk->icsk_user_timeout)) > err = -EINVAL; > else { > tp->keepalive_time = val * HZ; > @@ -2724,8 +2726,12 @@ static int do_tcp_setsockopt(struct sock *sk, int > level, > case TCP_USER_TIMEOUT: > /* Cap the max time in ms TCP will retry or probe the window > * before giving up and aborting (ETIMEDOUT) a connection. > +* Per RFC5482 TCP user timeout must be < keepalive_time. > +* If the default value changes later -- all bets are off. > */ > - if (val < 0) > + if (val < 0 || (tp->keepalive_time && > + tp->keepalive_time <= msecs_to_jiffies(val)) > || > + net->ipv4.sysctl_tcp_keepalive_time <= > msecs_to_jiffies(val)) > err = -EINVAL; > else > icsk->icsk_user_timeout = msecs_to_jiffies(val); > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > index c0f..d39fe60 100644 > --- a/net/ipv4/tcp_timer.c > +++ b/net/ipv4/tcp_timer.c > @@ -664,14 +664,7 @@ static void tcp_keepalive_timer (unsigned long data) > elapsed = keepalive_time_elapsed(tp); > > if (elapsed >= keepalive_time_when(tp)) { > - /* If the TCP_USER_TIMEOUT option is enabled, use that > -* to determine when to timeout instead. > -*/ > - if ((icsk->icsk_user_timeout != 0 && > - elapsed >= icsk->icsk_user_timeout && > - icsk->icsk_probes_out > 0) || > - (icsk->icsk_user_timeout == 0 && > - icsk->icsk_probes_out >= keepalive_probes(tp))) { > + if (icsk->icsk_probes_out >= keepalive_probes(tp)) { > tcp_send_active_reset(sk, GFP_ATOMIC); > tcp_write_err(sk); > goto out; > -- > 2.7.4 >
Re: [PATCH v3 net-next 3/5] sock: ULP infrastructure
On Tue, Aug 8, 2017 at 9:38 AM, Hannes Frederic Sowawrote: > Tom Herbert writes: > >> +#ifdef CONFIG_MODULES >> + if (!ulp && capable(CAP_NET_ADMIN)) { >> + rcu_read_unlock(); >> + request_module("%s", name); >> + rcu_read_lock(); >> + ulp = ulp_find(name); >> + } >> +#endif > > It looks to me that this allows users with only CAP_NET_ADMIN > privileges to load every module? It's a carryover. Probably should remove the check. Tom
Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure
On Tue, Aug 8, 2017 at 8:31 AM, John Fastabendwrote: > On 08/07/2017 10:28 AM, Tom Herbert wrote: >> Generalize the ULP infrastructure that was recently introduced to >> support kTLS. This adds a SO_ULP socket option and creates new fields in >> sock structure for ULP ops and ULP data. Also, the interface allows >> additional per ULP parameters to be set so that a ULP can be pushed >> and operations started in one shot. >> >> In this patch set: >> - Minor dependency fix in inet_common.h >> - Implement ULP infrastructure as a socket mechanism >> - Fixes TCP and TLS to use the new method (maintaining backwards >> API compatibility) >> - Adds a ulp.txt document >> >> Tested: Ran simple ULP. Dave Watson verified kTLS works. >> >> -v2: Fix compilation errors when CONFIG_ULP_SOCK not set. >> -v3: Fix one more build issue, check that sk_protocol is IPPROTO_TCP >> in tsl_init. Also, fix a couple of minor issues related to >> introducing locked versions of sendmsg, send page. Thanks to >> Dave Watson, John Fastabend, and Mat Martineau for testing and >> providing fixes. >> > > > Hi Tom, Dave, > > I'm concerned about the performance impact of walking a list and > doing string compares on every socket we create with kTLS. Dave > do you have any request/response tests for kTLS that would put pressure > on the create/destroy time of this infrastructure? We should do some > tests with dummy entries in the ULP list to understand the impact of > this list walk. > > I like the underlying TCP generalized hooks, but do we really expect a > lot of these hooks to exist? If we only have two on the roadmap > (kTLS and socktap) it seems a bit overkill. Further, if we really expect > many ULP objects then the list walk and compare will become more expensive > perhaps becoming noticeable in request per second metrics. > > Why not just create another socktap socketopt? That will be better from > complexity and likely performance sides. > IMO, given that there is at most two even proposed at this point I don't there's much point addressing performance. When ULP feature catches on and we start see a whole bunch of them then it's straightforward to use a hash table or some more efficient mechanism. Tom > Thanks, > .John >
Re: [PATCH net] tcp: fastopen: tcp_connect() must refresh the route
On Tue, Aug 8, 2017 at 1:41 AM, Eric Dumazetwrote: > From: Eric Dumazet > > With new TCP_FASTOPEN_CONNECT socket option, there is a possibility > to call tcp_connect() while socket sk_dst_cache is either NULL > or invalid. > > +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 > +0 connect(4, ..., ...) = 0 > > << sk->sk_dst_cache becomes obsolete, or even set to NULL >> > > +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > > > We need to refresh the route otherwise bad things can happen, > especially when syzkaller is running on the host :/ > > Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support") > Reported-by: Dmitry Vyukov > Signed-off-by: Eric Dumazet > Cc: Wei Wang > Cc: Yuchung Cheng > --- Thanks a lot for the fix, Eric. Acked-by: Wei Wang
Re: [PATCH V3 net-next 03/21] net-next/hinic: Initialize api cmd resources
From: Aviad KrawczykDate: Tue, 8 Aug 2017 19:23:32 +0300 > Is it a new rule? It's at least 10 years old. > We can fix it, but I can look over the Linux Kernel source and there are > a lot of examples that have the same problem. Stop right there. Just because there is code in the kernel with a problem doesn't mean you can submit new code with that problem.
[PATCH] isdn: hisax: hfc_usb: constify usb_device_id
usb_device_id are not supposed to change at runtime. All functions working with usb_device_id provided by work with const usb_device_id. So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- drivers/isdn/hisax/hfc_usb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/isdn/hisax/hfc_usb.c b/drivers/isdn/hisax/hfc_usb.c index ef47480..e821218 100644 --- a/drivers/isdn/hisax/hfc_usb.c +++ b/drivers/isdn/hisax/hfc_usb.c @@ -65,7 +65,7 @@ typedef struct { } hfcsusb_vdata; /* VID/PID device list */ -static struct usb_device_id hfcusb_idtab[] = { +static const struct usb_device_id hfcusb_idtab[] = { { USB_DEVICE(0x0959, 0x2bd0), .driver_info = (unsigned long) &((hfcsusb_vdata) -- 2.7.4
[PATCH] isdn: hfcsusb: constify usb_device_id
usb_device_id are not supposed to change at runtime. All functions working with usb_device_id provided by work with const usb_device_id. So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- drivers/isdn/hardware/mISDN/hfcsusb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/isdn/hardware/mISDN/hfcsusb.h b/drivers/isdn/hardware/mISDN/hfcsusb.h index 4157311..5f8f1d9 100644 --- a/drivers/isdn/hardware/mISDN/hfcsusb.h +++ b/drivers/isdn/hardware/mISDN/hfcsusb.h @@ -337,7 +337,7 @@ static const char *HFC_NT_LAYER1_STATES[HFC_MAX_NT_LAYER1_STATE + 1] = { }; /* supported devices */ -static struct usb_device_id hfcsusb_idtab[] = { +static const struct usb_device_id hfcsusb_idtab[] = { { USB_DEVICE(0x0959, 0x2bd0), .driver_info = (unsigned long) &((struct hfcsusb_vdata) -- 2.7.4
Re: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk
From: Daniel BorkmannDate: Tue, 08 Aug 2017 10:46:52 +0200 > On 08/08/2017 12:25 AM, James Hogan wrote: >> In bpf_trace_printk(), the elements in mod[] are left uninitialised, >> but >> they are then incremented to track the width of the formats. Zero >> initialise the array just in case the memory contains non-zero values >> on >> entry. >> >> Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call >> bpf_trace_printk()") >> Signed-off-by: James Hogan >> Cc: Alexei Starovoitov >> Cc: Daniel Borkmann >> Cc: Steven Rostedt >> Cc: Ingo Molnar >> Cc: netdev@vger.kernel.org >> --- >> When I checked (on MIPS32), the elements tended to have the value zero >> anyway (does BPF zero the stack or something clever?), so this is a >> purely theoretical fix. >> --- >> kernel/trace/bpf_trace.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index 32dcbe1b48f2..86a52857d941 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -129,7 +129,7 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, >> fmt_size, u64, arg1, >> u64, arg2, u64, arg3) >> { >> bool str_seen = false; >> -int mod[3] = {}; >> +int mod[3] = { 0, 0, 0 }; > > I'm probably missing something, but is the behavior of gcc wrt > above initializers different on mips (it zeroes just fine on x86 > at least)? If yes, we'd probably need a cocci script to also check > rest of the kernel given this is used in a number of places. Hm, > could you elaborate? This change is not necessary at all. An empty initializer must clear the whole object to zero. "theoretical" fix indeed... :-(
Re: [PATCH net] rds: Reintroduce statistics counting
On 8/8/2017 2:13 AM, Håkon Bugge wrote: In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection while senders are present"), refilling the receive queue was removed from rds_ib_recv(), along with the increment of s_ib_rx_refill_from_thread. Commit 73ce4317bf98 ("RDS: make sure we post recv buffers") re-introduces filling the receive queue from rds_ib_recv(), but does not add the statistics counter. rds_ib_recv() was later renamed to rds_ib_recv_path(). This commit reintroduces the statistics counting of s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq. Signed-off-by: Håkon BuggeReviewed-by: Knut Omang Reviewed-by: Wei Lin Guay --- Looks fine. Acked-by: Santosh Shilimkar
Re: [PATCH v3 net-next 3/5] sock: ULP infrastructure
Tom Herbertwrites: > +#ifdef CONFIG_MODULES > + if (!ulp && capable(CAP_NET_ADMIN)) { > + rcu_read_unlock(); > + request_module("%s", name); > + rcu_read_lock(); > + ulp = ulp_find(name); > + } > +#endif It looks to me that this allows users with only CAP_NET_ADMIN privileges to load every module?
Re: [PATCH net-next] ibmvnic: Add netdev_dbg output for debugging
On Mon, 07 Aug 2017 15:02:58 -0400 Nathan Fontenotwrote: > To ease debugging of the ibmvnic driver add a series of netdev_dbg() > statements to track driver status, especially during initialization, > removal, and resetting of the driver. > > Signed-off-by: Nathan Fontenot Maybe use netif_dbg() and the message type stuff.
Re: [PATCH V3 net-next 03/21] net-next/hinic: Initialize api cmd resources
Hi David, Is it a new rule? We can fix it, but I can look over the Linux Kernel source and there are a lot of examples that have the same problem. I can see even code that inserted to 4.10 with the same problem. Thanks for your time and review, Aviad On 8/4/2017 1:35 AM, David Miller wrote: > From: Aviad Krawczyk> Date: Thu, 3 Aug 2017 17:54:09 +0800 > >> +static int alloc_cmd_buf(struct hinic_api_cmd_chain *chain, >> + struct hinic_api_cmd_cell *cell, int cell_idx) >> +{ >> +struct hinic_hwif *hwif = chain->hwif; >> +struct pci_dev *pdev = hwif->pdev; >> +struct hinic_api_cmd_cell_ctxt *cell_ctxt; >> +dma_addr_t cmd_paddr; >> +u8 *cmd_vaddr; >> +int err = 0; > > Order local variables from longest to shortest line. > >> +static int api_cmd_create_cell(struct hinic_api_cmd_chain *chain, >> + int cell_idx, >> + struct hinic_api_cmd_cell *pre_node, >> + struct hinic_api_cmd_cell **node_vaddr) >> +{ >> +struct hinic_hwif *hwif = chain->hwif; >> +struct pci_dev *pdev = hwif->pdev; >> +struct hinic_api_cmd_cell_ctxt *cell_ctxt; >> +struct hinic_api_cmd_cell *node; >> +dma_addr_t node_paddr; >> +int err; > > Likewise. >> +static void api_cmd_destroy_cell(struct hinic_api_cmd_chain *chain, >> + int cell_idx) >> +{ >> +struct hinic_hwif *hwif = chain->hwif; >> +struct pci_dev *pdev = hwif->pdev; >> +struct hinic_api_cmd_cell_ctxt *cell_ctxt; >> +struct hinic_api_cmd_cell *node; >> +dma_addr_t node_paddr; >> +size_t node_size; > > Likewise. > > etc. etc. etc. > > Please audit your entire submission for this problem. > > Thanks. > > . >
Question about via-ircc.ko
Hello. While searching for races in the Linux kernel I've come across "drivers/net/irda/via-ircc.ko" module. Here are questions that I came up with while analyzing results. Lines are given using the info from Linux v4.12. Consider the following case: Thread 1:Thread 2: via_ircc_net_open request_irq via_ircc_interrupt -> via_ircc_dma_receive -> RxTimerHandler (via-ircc.c: line 1488) (via-ircc.c: line 1315) self->... = ... ... = self->... In the via_ircc_dma_receive a lot of fields of 'self' structure are initialized and via_ircc_interrupt with RxTimerHandler use those fields. If no initialization happened interrupt handler and other functions that it calls may work with incorrect data. I'm not sure how bad this case can be and thus here are my questions. Is this situation feasible from your point of view? If it is feasible, is it a benign race or something serious? Thank you for your time. -- Anton Volkov Linux Verification Center, ISPRAS web: http://linuxtesting.org e-mail: avol...@ispras.ru
Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
On Thu, Aug 3, 2017 at 11:54 PM, Davide Carattiwrote: > if the NIC fails to validate the checksum on TCP/UDP, and validation of IP > checksum is successful, the driver subtracts the pseudo-header checksum > from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't > do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails. > > V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set > > Reported-by: Shuang Li > Fixes: f8c6455bb04b ("net/mlx4_en: Extend checksum offloading by CHECKSUM > COMPLETE") > Signed-off-by: Davide Caratti Acked-by: Saeed Mahameed
Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
On Tue, Aug 8, 2017 at 7:06 PM, Davide Carattiwrote: > On Tue, 2017-08-08 at 18:07 +0300, Saeed Mahameed wrote: >> > { >> > __u16 length_for_csum = 0; >> > __wsum csum_pseudo_header = 0; >> > + __u8 ipproto = iph->protocol; >> > + >> > + if (unlikely(ipproto == IPPROTO_SCTP)) >> > + return -1; >> > >> >> Hi Davide >> > > hi Saeed, > > thank you for looking at this! > >> If you got to here then it means this is a non UDP/TCP ipv4 packet and >> the HW failed to validate it's checksum but >> you get from the connectx3 HW a 1's complement 16-bit sum of IP >> Payload + IP pseudo-header. >> so if you return -1 here the driver will report checksum none for this >> packet (and you will abandon any checsum offload/help from HW). >> >> I am not an SCTP expert but it seems that you decided here that >> connectX3 hw checksum (described above) can't be used to calculate >> SCTP packet checksum >> is that correct? >> > Yes, that's correct. SCTP uses CRC32c, not 1's complement (and does not use > pseudo-headers): therefore, the checksum computed by the NIC hardware can't > be used. > > The issue I observed is skb->ip_summed set to CHECKSUM_COMPLETE, that made > CRC32c validation fail in my setup (that was a netfilter REJECT rule, matching > SCTP packets). AFAIK, CHECKSUM_COMPLETE is valid only for the Internet > Checksum; > setting CHECKSUM_NONE on rx packets carrying IPPROTO_SCTP fixed my test > scenario. > >> If so, then i am ok with this patch. > > I planned to post this some weeks ago, after agreeing v2 with Tariq > (https://www.spinics.net/lists/netdev/msg441231.html), but took some time to > find a ConnectX-3 (from what I saw the issue is not present on ConnectX-3 Pro, > since it has MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP bit set to 0). > Thanks for the clarification, I will ack the patch. > regards, > -- > davide >
[PATCH 04/35] net: irda: irda-usb: constify usb_device_id
usb_device_id are not supposed to change at runtime. All functions working with usb_device_id provided by work with const usb_device_id. So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- drivers/net/irda/irda-usb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/irda/irda-usb.c b/drivers/net/irda/irda-usb.c index 6f3c805..723e49b 100644 --- a/drivers/net/irda/irda-usb.c +++ b/drivers/net/irda/irda-usb.c @@ -72,7 +72,7 @@ static int qos_mtt_bits = 0; /* These are the currently known IrDA USB dongles. Add new dongles here */ -static struct usb_device_id dongles[] = { +static const struct usb_device_id dongles[] = { /* ACTiSYS Corp., ACT-IR2000U FIR-USB Adapter */ { USB_DEVICE(0x9c4, 0x011), .driver_info = IUC_SPEED_BUG | IUC_NO_WINDOW }, /* Look like ACTiSYS, Report : IBM Corp., IBM UltraPort IrDA */ -- 2.7.4
[PATCH 05/35] net: irda: kingsun: constify usb_device_id
usb_device_id are not supposed to change at runtime. All functions working with usb_device_id provided by work with const usb_device_id. So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- drivers/net/irda/kingsun-sir.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/irda/kingsun-sir.c b/drivers/net/irda/kingsun-sir.c index 24c0f16..4fd4ac2 100644 --- a/drivers/net/irda/kingsun-sir.c +++ b/drivers/net/irda/kingsun-sir.c @@ -85,7 +85,7 @@ #define KING_PRODUCT_ID 0x4200 /* These are the currently known USB ids */ -static struct usb_device_id dongles[] = { +static const struct usb_device_id dongles[] = { /* KingSun Co,Ltd IrDA/USB Bridge */ { USB_DEVICE(KING_VENDOR_ID, KING_PRODUCT_ID) }, { } -- 2.7.4
[PATCH 07/35] net: irda: ksdazzle: constify usb_device_id
usb_device_id are not supposed to change at runtime. All functions working with usb_device_id provided by work with const usb_device_id. So mark the non-const structs as const. Signed-off-by: Arvind Yadav--- drivers/net/irda/ksdazzle-sir.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/irda/ksdazzle-sir.c b/drivers/net/irda/ksdazzle-sir.c index 741452c..d2a0755 100644 --- a/drivers/net/irda/ksdazzle-sir.c +++ b/drivers/net/irda/ksdazzle-sir.c @@ -97,7 +97,7 @@ #define KSDAZZLE_PRODUCT_ID 0x4100 /* These are the currently known USB ids */ -static struct usb_device_id dongles[] = { +static const struct usb_device_id dongles[] = { /* KingSun Co,Ltd IrDA/USB Bridge */ {USB_DEVICE(KSDAZZLE_VENDOR_ID, KSDAZZLE_PRODUCT_ID)}, {} -- 2.7.4