Re: [PATCH v7 03/37] sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h
2018-02-14 22:43 GMT+08:00 Arnd Bergmann: > On Tue, Feb 13, 2018 at 10:09 AM, Greentime Hu wrote: >> A commit for the nds32 architecture bootstrap("asm-generic/io.h: move >> ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU") >> will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that >> in order to suppress re-definition errors we need to remove the #define >> in io_32.h. >> >> Also, the change adds a prototype for ioremap where size is size_t and >> offset is phys_addr_t so fix that as well. >> >> Signed-off-by: Greentime Hu > > This patch should have been addressed to the sparclinux mailing list to > the maintainers can see it, otherwise they are unlikely to notice. > > Added it to Cc now. > > Can you confirm that the patches are ordered correctly in your series so that > at no point, sparc is in a state that fails to be build cleanly? > > If not, this may have to get merged into the other patch. Hi, Arnd: These 2 patch will cause sparc building error in any order. commit af84603e339a6832052071aca1f2a16b8963cc2e Author: Greentime Hu Date: Tue Feb 13 17:09:07 2018 +0800 sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h A commit for the nds32 architecture bootstrap("asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU") will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that in order to suppress re-definition errors we need to remove the #define in io_32.h. Also, the change adds a prototype for ioremap where size is size_t and offset is phys_addr_t so fix that as well. Signed-off-by: Greentime Hu commit 1995b30ea6628b261192662741940c22ac978884 Author: Greentime Hu Date: Tue Feb 13 17:09:06 2018 +0800 asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU It allows some architectures to use this generic macro instead of defining theirs. Signed-off-by: Vincent Chen Signed-off-by: Greentime Hu Acked-by: Arnd Bergmann Should I merge them together like this? asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU It allows some architectures to use this generic macro instead of defining theirs. sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h It will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that in order to suppress re-definition errors we need to remove the #define in arch/sparc/include/asm/io_32.h. Also, the change adds a prototype for ioremap where size is size_t and offset is phys_addr_t so fix that as well. Signed-off-by: Greentime Hu
Re: [PATCH v2] sched: report if filter is too large to dump
On 02/21/2018 08:45 AM, Phil Sutter wrote: Hi Roman, On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote: So far, if the filter was too large to fit in the allocated skb, the kernel did not return any error and stopped dumping. Modify the dumper so that it returns -EMSGSIZE when a filter fails to dump and it is the first filter in the skb. If we are not first, we will get a next chance with more room. I understand this is pretty near to being an API change, but the original design (silent truncation) can be considered a bug. Note: The error case can happen pretty easily if you create a filter with 32 actions and have 4kb pages. Also recent versions of iproute try to be clever with their buffer allocation size, which in turn leads to I'm curious, what does it lead to? :) Thanks, Phil tc will dump all filters up to the big one, then it will stop silently. So it will seem as if you have less filters. The new behavior is the same, but tc will at leas print the EMSGSIZE error. It does not handle it in any other way.
Re: [PATCH 2/3] net: Make cleanup_list and net::cleanup_list of llist type
On 20.02.2018 22:42, Cong Wang wrote: > On Mon, Feb 19, 2018 at 1:58 AM, Kirill Tkhaiwrote: >> void __put_net(struct net *net) >> { >> /* Cleanup the network namespace in process context */ >> - unsigned long flags; >> - >> - spin_lock_irqsave(_list_lock, flags); >> - list_add(>cleanup_list, _list); >> - spin_unlock_irqrestore(_list_lock, flags); >> - >> + llist_add(>cleanup_list, _list); >> queue_work(netns_wq, _cleanup_work); >> } > > Is llist safe against IRQ too? Yes, it's safe and it's aimed for the cases like this. There is no "locked" state like spinlock has, there is single cmpxchg(). You may find examples it's used in ./kernel directory. Kirill
Re: WARNING: ODEBUG bug in __queue_work
On Wed, Feb 21, 2018 at 1:59 AM, syzbotwrote: > Hello, > > syzbot hit the following crash on upstream commit > 1a2a7d3ee659e477e0768ac3fc7579794f89071b (Fri Feb 16 17:11:30 2018 +) > Merge tag 'sound-4.16-rc2' of > git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound > > So far this crash happened 2 times on upstream. > C reproducer is attached. > syzkaller reproducer is attached. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. This is already fixed: #syz fix: netfilter: IDLETIMER: be syzkaller friendly > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+8288e492017b059ff...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > x_tables: ip6_tables: TCPMSS target: only valid for protocol 6 > x_tables: ip6_tables: TCPMSS target: only valid for protocol 6 > x_tables: ip6_tables: TCPMSS target: only valid for protocol 6 > [ cut here ] > ODEBUG: activate not available (active state 0) object type: work_struct > hint: 0x8 > WARNING: CPU: 0 PID: 4178 at lib/debugobjects.c:291 > debug_print_object+0x166/0x220 lib/debugobjects.c:288 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 4178 Comm: syzkaller885166 Not tainted 4.16.0-rc1+ #315 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > panic+0x1e4/0x41c kernel/panic.c:183 > __warn+0x1dc/0x200 kernel/panic.c:547 > report_bug+0x211/0x2d0 lib/bug.c:184 > fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178 > fixup_bug arch/x86/kernel/traps.c:247 [inline] > do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296 > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 > invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957 > RIP: 0010:debug_print_object+0x166/0x220 lib/debugobjects.c:288 > RSP: 0018:8801db407400 EFLAGS: 00010082 > RAX: dc08 RBX: 0005 RCX: 815abdbe > RDX: 0100 RSI: 11003b680e30 RDI: 11003b680e05 > RBP: 8801db407440 R08: R09: 11003b680dd7 > R10: 8801db407300 R11: 0002 R12: 0001 > R13: 86b14d80 R14: 86007d60 R15: 8147ac00 > debug_object_activate+0x390/0x730 lib/debugobjects.c:475 > debug_work_activate kernel/workqueue.c:493 [inline] > __queue_work+0x163/0x1230 kernel/workqueue.c:1382 > queue_work_on+0x16a/0x1c0 kernel/workqueue.c:1488 > queue_work include/linux/workqueue.h:488 [inline] > schedule_work include/linux/workqueue.h:546 [inline] > idletimer_tg_expired+0x44/0x60 net/netfilter/xt_IDLETIMER.c:116 > call_timer_fn+0x228/0x820 kernel/time/timer.c:1326 > expire_timers kernel/time/timer.c:1363 [inline] > __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666 > run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 > __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 > invoke_softirq kernel/softirq.c:365 [inline] > irq_exit+0x1cc/0x200 kernel/softirq.c:405 > exiting_irq arch/x86/include/asm/apic.h:541 [inline] > smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 > apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:796 > > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:777 > [inline] > RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 > [inline] > RIP: 0010:_raw_spin_unlock_irqrestore+0x5e/0xba > kernel/locking/spinlock.c:184 > RSP: 0018:8801b1a7f3c8 EFLAGS: 0282 ORIG_RAX: ff12 > RAX: dc00 RBX: 0282 RCX: 0006 > RDX: 10d592d5 RSI: 110035eb5d5b RDI: 0282 > RBP: 8801b1a7f3d8 R08: 11003634fe47 R09: > R10: R11: R12: 86cfcbe0 > R13: 0282 R14: ed003634fe8b R15: 8801b33f25e8 > fill_pool lib/debugobjects.c:119 [inline] > __debug_object_init+0xa7a/0x1040 lib/debugobjects.c:339 > debug_object_init+0x17/0x20 lib/debugobjects.c:391 > __init_work+0x2b/0x60 kernel/workqueue.c:506 > idletimer_tg_create net/netfilter/xt_IDLETIMER.c:152 [inline] > idletimer_tg_checkentry+0x691/0xb00 net/netfilter/xt_IDLETIMER.c:213 > xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:850 > check_target net/ipv6/netfilter/ip6_tables.c:533 [inline] > find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:575 > translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:744 > do_replace net/ipv6/netfilter/ip6_tables.c:1160 [inline] > do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1686 > nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] > nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 >
Re: [PATCH net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins
Hi Andrew, I love your patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Andrew-Lunn/net-dsa-mv88e6xxx-scratch-registers-and-external-MDIO-pins/20180221-150325 config: i386-randconfig-i0-201807 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/net/dsa/mv88e6xxx/serdes.o: In function `mv88e6xxx_g2_scratch_gpio_set_ext_smi': >> drivers/net/dsa/mv88e6xxx/global2.h:478: multiple definition of >> `mv88e6xxx_g2_scratch_gpio_set_ext_smi' drivers/net/dsa/mv88e6xxx/chip.o:drivers/net/dsa/mv88e6xxx/global2.h:478: first defined here vim +478 drivers/net/dsa/mv88e6xxx/global2.h 475 476 int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip *chip, 477bool external) > 478 { 479 return -EOPNOTSUPP; 480 } 481 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v2] sched: report if filter is too large to dump
On 02/21/2018 09:42 AM, Phil Sutter wrote: Hi Roman, On Wed, Feb 21, 2018 at 09:38:52AM +0100, Roman Kapl wrote: On 02/21/2018 08:45 AM, Phil Sutter wrote: On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote: Note: The error case can happen pretty easily if you create a filter with 32 actions and have 4kb pages. Also recent versions of iproute try to be clever with their buffer allocation size, which in turn leads to I'm curious, what does it lead to? :) Thanks, Phil tc will dump all filters up to the big one, then it will stop silently. So it will seem as if you have less filters. The new behavior is the same, but tc will at leas print the EMSGSIZE error. It does not handle it in any other way. I got that, yes. Though your commit message stops mid-sentence and I wondered what the dynamic buffer allocation "in turn leads to". Thanks, Phil Aaah... It leads to "smaller SKBs in kernel". I've lost that line in v2. Should I send a v3 or something, David? Sorry, Roman Kapl
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >> Yeah, I can see it now :( I guess that the ship has sailed and we are >> stuck with this ugly thing forever... >> >> Could you at least make some common code that is shared in between >> netvsc and virtio_net so this is handled in exacly the same way in both? > >IMHO netvsc is a vendor specific driver which made a mistake on what >behaviour it provides (or tried to align itself with Windows SR-IOV). >Let's not make a far, far more commonly deployed and important driver >(virtio) bug-compatible with netvsc. Yeah. netvsc solution is a dangerous precedent here and in my opinition it was a huge mistake to merge it. I personally would vote to unmerge it and make the solution based on team/bond. > >To Jiri's initial comments, I feel the same way, in fact I've talked to >the NetworkManager guys to get auto-bonding based on MACs handled in >user space. I think it may very well get done in next versions of NM, >but isn't done yet. Stephen also raised the point that not everybody is >using NM. Can be done in NM, networkd or other network management tools. Even easier to do this in teamd and let them all benefit. Actually, I took a stab to implement this in teamd. Took me like an hour and half. You can just run teamd with config option "kidnap" like this: # teamd/teamd -c '{"kidnap": true }' Whenever teamd sees another netdev to appear with the same mac as his, or whenever teamd sees another netdev to change mac to his, it enslaves it. Here's the patch (quick and dirty): Subject: [patch teamd] teamd: introduce kidnap feature Signed-off-by: Jiri Pirko--- include/team.h | 7 +++ libteam/ifinfo.c | 20 teamd/teamd.c | 17 + teamd/teamd.h | 5 + teamd/teamd_events.c | 17 + teamd/teamd_ifinfo_watch.c | 9 + teamd/teamd_per_port.c | 7 ++- 7 files changed, 81 insertions(+), 1 deletion(-) diff --git a/include/team.h b/include/team.h index 9ae517d..b0c19c8 100644 --- a/include/team.h +++ b/include/team.h @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th, #define team_for_each_ifinfo(ifinfo, th) \ for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo; \ ifinfo = team_get_next_ifinfo(th, ifinfo)) + +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, + struct team_ifinfo *ifinfo); +#define team_for_each_unlinked_ifinfo(ifinfo, th) \ + for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo; \ +ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo)) + /* ifinfo getters */ bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo); uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo); diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c index 5c32a9c..8f9548e 100644 --- a/libteam/ifinfo.c +++ b/libteam/ifinfo.c @@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th, return NULL; } +/** + * @param th libteam library context + * @param ifinfo ifinfo structure + * + * @details Get next unlinked ifinfo in list. + * + * @return Ifinfo next to ifinfo passed. + **/ +TEAM_EXPORT +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, + struct team_ifinfo *ifinfo) +{ + do { + ifinfo = list_get_next_node_entry(>ifinfo_list, ifinfo, list); + if (ifinfo && !ifinfo->linked) + return ifinfo; + } while (ifinfo); + return NULL; +} + /** * @param ifinfo ifinfo structure * diff --git a/teamd/teamd.c b/teamd/teamd.c index aac2511..069c7f0 100644 --- a/teamd/teamd.c +++ b/teamd/teamd.c @@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx, return 0; } +static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx, +struct team_ifinfo *ifinfo, +void *priv) +{ + int err; + bool kidnap; + + err = teamd_config_bool_get(ctx, , "$.kidnap"); + if (err || !kidnap || + ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) || + memcmp(team_get_ifinfo_hwaddr(ifinfo), + ctx->hwaddr, ctx->hwaddr_len)) + return 0; + return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo)); +} + static const struct teamd_event_watch_ops teamd_port_watch_ops = { .port_added = teamd_event_watch_port_added, + .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed, }; static int teamd_port_watch_init(struct teamd_context *ctx)
Re: [PATCH v2] sched: report if filter is too large to dump
Hi Roman, On Wed, Feb 21, 2018 at 09:38:52AM +0100, Roman Kapl wrote: > On 02/21/2018 08:45 AM, Phil Sutter wrote: > > On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote: > >> So far, if the filter was too large to fit in the allocated skb, the > >> kernel did not return any error and stopped dumping. Modify the dumper > >> so that it returns -EMSGSIZE when a filter fails to dump and it is the > >> first filter in the skb. If we are not first, we will get a next chance > >> with more room. > >> > >> I understand this is pretty near to being an API change, but the > >> original design (silent truncation) can be considered a bug. > >> > >> Note: The error case can happen pretty easily if you create a filter > >> with 32 actions and have 4kb pages. Also recent versions of iproute try > >> to be clever with their buffer allocation size, which in turn leads to > > I'm curious, what does it lead to? :) > > > > Thanks, Phil > > tc will dump all filters up to the big one, then it will stop silently. > So it will seem as if you have less filters. > > The new behavior is the same, but tc will at leas print the EMSGSIZE > error. It does not handle it in any other way. I got that, yes. Though your commit message stops mid-sentence and I wondered what the dynamic buffer allocation "in turn leads to". Thanks, Phil
Re: [PATCH 1/3] net: Kill net_mutex
Hi, Stephen, On 21.02.2018 02:18, Stephen Hemminger wrote: > On Mon, 19 Feb 2018 12:58:38 +0300 > Kirill Tkhaiwrote: > >> +struct list_headexit_list; /* To linked to call pernet exit >> + * methods on dead net (net_sem >> + * read locked), or to >> unregister >> + * pernet ops (net_sem wr >> locked). >> + */ > > Sorry, that comment is completely unparseable. > Either you know what it does, and therefore comment is unnecessary > Or change comment to a valid explanation of the semantics of the list. > > Maybe comments about locking model are best left to where > it is used in the code. Let's improve it :) It's used to call pernet exit methods, and net ns logic guarantees, we never call exit methods for the same net in parallel. How about writing this directly without mention of net_sem? Something like this: /* To link net to call pernet exit methods */ Or maybe you have better variant? Thanks, Kirill
Re: [PATCH net-next 0/5] net/smc: fixes 2018-02-21
On 21.02.2018 12:32, Ursula Braun wrote: > Dave, > > here are some smc-patches for net-next. Besides cleanups, the link id > field of the LLC confirm link reply is changed to comply with RFC7609. > > Thanks, Ursula Looks good to me! Ciao, Stefan
[PATCH net] tcp_bbr: better deal with suboptimal GSO
From: Eric DumazetBBR uses tcp_tso_autosize() in an attempt to probe what would be the burst sizes and to adjust cwnd in bbr_target_cwnd() with following gold formula : /* Allow enough full-sized skbs in flight to utilize end systems. */ cwnd += 3 * bbr->tso_segs_goal; But GSO can be lacking or be constrained to very small units (ip link set dev ... gso_max_segs 2) What we really want is to have enough packets in flight so that both GSO and GRO are efficient. So in the case GSO is off or downgraded, we still want to have the same number of packets in flight as if GSO/TSO was fully operational, so that GRO can hopefully be working efficiently. To fix this issue, we make tcp_tso_autosize() unaware of sk->sk_gso_max_segs Only tcp_tso_segs() has to enforce the gso_max_segs limit. Tested: ethtool -K eth0 tso off gso off tc qd replace dev eth0 root pfifo_fast Before patch: for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done 691 (ss -temoi shows cwnd is stuck around 6 ) 667 651 631 517 After patch : # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done 1733 (ss -temoi shows cwnd is around 386 ) 1778 1746 1781 1718 Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Eric Dumazet Reported-by: Oleksandr Natalenko --- net/ipv4/tcp_output.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index b2bca373f8bee35267df49b5947a6793fed71a12..6818042cd8a9a1778f54637861647091afd9a769 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1730,7 +1730,7 @@ u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now, */ segs = max_t(u32, bytes / mss_now, min_tso_segs); - return min_t(u32, segs, sk->sk_gso_max_segs); + return segs; } EXPORT_SYMBOL(tcp_tso_autosize); @@ -1742,9 +1742,10 @@ static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now) const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0; - return tso_segs ? : - tcp_tso_autosize(sk, mss_now, -sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs); + if (!tso_segs) + tso_segs = tcp_tso_autosize(sk, mss_now, + sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs); + return min_t(u32, tso_segs, sk->sk_gso_max_segs); } /* Returns the portion of skb which can be sent right away */
Re: [PATCH net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins
On Wed, Feb 21, 2018 at 05:40:29PM +0800, kbuild test robot wrote: > Hi Andrew, > > I love your patch! Yet something to improve: > > [auto build test ERROR on net-next/master] > > url: > https://github.com/0day-ci/linux/commits/Andrew-Lunn/net-dsa-mv88e6xxx-scratch-registers-and-external-MDIO-pins/20180221-150325 > config: i386-randconfig-i0-201807 (attached as .config) > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): > >drivers/net/dsa/mv88e6xxx/serdes.o: In function > `mv88e6xxx_g2_scratch_gpio_set_ext_smi': > >> drivers/net/dsa/mv88e6xxx/global2.h:478: multiple definition of > >> `mv88e6xxx_g2_scratch_gpio_set_ext_smi' >drivers/net/dsa/mv88e6xxx/chip.o:drivers/net/dsa/mv88e6xxx/global2.h:478: > first defined here > > vim +478 drivers/net/dsa/mv88e6xxx/global2.h > >475 >476int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip > *chip, >477 bool external) > > 478{ >479return -EOPNOTSUPP; >480} The stub function needs to be inline. I will fix up in v2. Andrew
Re: [PATCH] netlink: put module reference if dump start fails
On Wed, Feb 21, 2018 at 6:47 AM, Eric Dumazetwrote: >> This probably should be queued up for stable. > > When was the bug added ? This would help a lot stable teams ... This needs to be backported to 4.16-rc0+, 4.15+, 4.14+, 4.13.14+, and 4.9.63+.
Re: [PATCH] netlink: put module reference if dump start fails
On Wed, 2018-02-21 at 15:54 +0100, Jason A. Donenfeld wrote: > On Wed, Feb 21, 2018 at 6:47 AM, Eric Dumazetwrote: > > > This probably should be queued up for stable. > > > > When was the bug added ? This would help a lot stable teams ... > > This needs to be backported to 4.16-rc0+, 4.15+, 4.14+, 4.13.14+, and 4.9.63+. I guess I should have asked you to provide an appropriate Fixes: tag, clearly identifying the commit that added the bug. Fixes: 12-digit-sha1 ("patch title") Thanks.
Qualcomm rmnet driver and qmi_wwan
Hello, in rmnet kernel documentation I read: "This driver can be used to register onto any physical network device in IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator." Does this mean that it can be used in association with the qmi_wwan driver? If yes, can someone give me an hint on the steps to follow? If not, does anyone know if it is possible to modify qmi_wwan in order to take advantage of the features provided by the rmnet driver? In this case hint on the changes for modifying qmi_wwan are welcome. Thanks in advance, Daniele
Re: [PATCH v7 02/37] asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU
On Wed, Feb 21, 2018 at 12:21 PM, Greentime Huwrote: > 2018-02-16 18:47 GMT+08:00 kbuild test robot : > > From: Greentime Hu > Date: Wed, 21 Feb 2018 14:21:23 +0800 > Subject: [PATCH] xtensa: add ioremap_nocache declaration before include > asm-generic/io.h. > > A future commit for the nds32 architecture bootstrap("asm-generic/io.h: > move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef > CONFIG_MMU") will move the ioremap_nocache out of the CONFIG_MMU ifdef. > This means that in order to suppress re-definition errors we need to > setup #define's before importing asm-generic/io.h. > > Signed-off-by: Greentime Hu Don't you need to override both ioremap and ioremap_nocache? > --- > arch/xtensa/include/asm/io.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/xtensa/include/asm/io.h b/arch/xtensa/include/asm/io.h > index c38e5a732d86..acc5bb2cf1c7 100644 > --- a/arch/xtensa/include/asm/io.h > +++ b/arch/xtensa/include/asm/io.h > @@ -52,6 +52,7 @@ static inline void __iomem *ioremap_cache(unsigned > long offset, > return xtensa_ioremap_cache(offset, size); > } > #define ioremap_cache ioremap_cache > +#define ioremap_nocache ioremap_nocache > > #define ioremap_wc ioremap_nocache
[PATCH iproute2-next] rdma: Add batch command support
From: Leon RomanovskyImplement an option (-b) to execute RDMAtool commands from supplied file. This follows the same model as in use for ip and devlink tools, by expecting every new command to be on new line. These commands are expected to be without any -* (e.g. -d, -j, e.t.c) global flags, which should be called externally. Signed-off-by: Leon Romanovsky --- man/man8/rdma.8 | 16 + rdma/rdma.c | 70 ++--- 2 files changed, 78 insertions(+), 8 deletions(-) diff --git a/man/man8/rdma.8 b/man/man8/rdma.8 index 798b33d3..fba77693 100644 --- a/man/man8/rdma.8 +++ b/man/man8/rdma.8 @@ -11,6 +11,12 @@ rdma \- RDMA tool .BR help " }" .sp +.ti -8 +.B rdma +.RB "[ " -force " ] " +.BI "-batch " filename +.sp + .ti -8 .IR OBJECT " := { " .BR dev " | " link " }" @@ -31,6 +37,16 @@ Print the version of the .B rdma tool and exit. +.TP +.BR "\-b", " \-batch " +Read commands from provided file or standard input and invoke them. +First failure will cause termination of rdma. + +.TP +.BR "\-force" +Don't terminate rdma on errors in batch mode. +If there were any errors during execution of the commands, the application return code will be non zero. + .TP .BR "\-d" , " --details" Otuput detailed information. diff --git a/rdma/rdma.c b/rdma/rdma.c index 19608f41..ab8f98d2 100644 --- a/rdma/rdma.c +++ b/rdma/rdma.c @@ -15,8 +15,9 @@ static void help(char *name) { pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n" + " %s [ -f[orce] ] -b[atch] filename\n" "where OBJECT := { dev | link | resource | help }\n" - " OPTIONS := { -V[ersion] | -d[etails] | -j[son] | -p[retty]}\n", name); + " OPTIONS := { -V[ersion] | -d[etails] | -j[son] | -p[retty]}\n", name, name); } static int cmd_help(struct rd *rd) @@ -25,7 +26,7 @@ static int cmd_help(struct rd *rd) return 0; } -static int rd_cmd(struct rd *rd) +static int rd_cmd(struct rd *rd, int argc, char **argv) { const struct rd_cmd cmds[] = { { NULL, cmd_help }, @@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd) { 0 } }; + rd->argc = argc; + rd->argv = argv; + return rd_exec_cmd(rd, cmds, "object"); } -static int rd_init(struct rd *rd, int argc, char **argv, char *filename) +static int rd_batch(struct rd *rd, const char *name, bool force) +{ + char *line = NULL; + size_t len = 0; + int ret = 0; + + if (name && strcmp(name, "-") != 0) { + if (!freopen(name, "r", stdin)) { + pr_err("Cannot open file \"%s\" for reading: %s\n", + name, strerror(errno)); + return errno; + } + } + + cmdlineno = 0; + while (getcmdline(, , stdin) != -1) { + char *largv[512]; + int largc; + + largc = makeargs(line, largv, 100); + if (!largc) + continue; /* blank line */ + + ret = rd_cmd(rd, largc, largv); + if (ret) { + pr_err("Command failed %s:%d\n", name, cmdlineno); + if (!force) + break; + } + } + + free(line); + + return ret; +} + +static int rd_init(struct rd *rd, char *filename) { uint32_t seq; int ret; rd->filename = filename; - rd->argc = argc; - rd->argv = argv; INIT_LIST_HEAD(>dev_map_list); INIT_LIST_HEAD(>filter_list); @@ -87,11 +125,15 @@ int main(int argc, char **argv) { "json", no_argument,NULL, 'j' }, { "pretty", no_argument,NULL, 'p' }, { "details",no_argument,NULL, 'd' }, + { "force", no_argument,NULL, 'f' }, + { "batch", required_argument, NULL, 'b' }, { NULL, 0, NULL, 0 } }; + const char *batch_file = NULL; bool pretty_output = false; bool show_details = false; bool json_output = false; + bool force = false; char *filename; struct rd rd; int opt; @@ -99,7 +141,7 @@ int main(int argc, char **argv) filename = basename(argv[0]); - while ((opt = getopt_long(argc, argv, "Vhdpj", + while ((opt = getopt_long(argc, argv, ":Vhdpjfb:", long_options, NULL)) >= 0) { switch (opt) { case 'V': @@ -115,9 +157,18 @@ int main(int argc, char **argv) case 'j': json_output = true; break; + case 'f': + force = true; +
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantalawrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? Tommi
Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO
Hi, On Wed, 2018-02-21 at 06:43 -0800, Eric Dumazet wrote: > From: Eric Dumazet> > BBR uses tcp_tso_autosize() in an attempt to probe what would be the > burst sizes and to adjust cwnd in bbr_target_cwnd() with following > gold formula : > > /* Allow enough full-sized skbs in flight to utilize end systems. */ > cwnd += 3 * bbr->tso_segs_goal; > > But GSO can be lacking or be constrained to very small > units (ip link set dev ... gso_max_segs 2) > > What we really want is to have enough packets in flight so that both > GSO and GRO are efficient. > > So in the case GSO is off or downgraded, we still want to have the same > number of packets in flight as if GSO/TSO was fully operational, so > that GRO can hopefully be working efficiently. > > To fix this issue, we make tcp_tso_autosize() unaware of > sk->sk_gso_max_segs > > Only tcp_tso_segs() has to enforce the gso_max_segs limit. > > Tested: > > ethtool -K eth0 tso off gso off > tc qd replace dev eth0 root pfifo_fast > > Before patch: > for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done > 691 (ss -temoi shows cwnd is stuck around 6 ) > 667 > 651 > 631 > 517 > > After patch : > # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done >1733 (ss -temoi shows cwnd is around 386 ) >1778 >1746 >1781 >1718 > > Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") > Signed-off-by: Eric Dumazet > Reported-by: Oleksandr Natalenko > --- > net/ipv4/tcp_output.c |9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index > b2bca373f8bee35267df49b5947a6793fed71a12..6818042cd8a9a1778f54637861647091afd9a769 > 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1730,7 +1730,7 @@ u32 tcp_tso_autosize(const struct sock *sk, unsigned > int mss_now, >*/ > segs = max_t(u32, bytes / mss_now, min_tso_segs); > > - return min_t(u32, segs, sk->sk_gso_max_segs); > + return segs; > } > EXPORT_SYMBOL(tcp_tso_autosize); > Very minor nit, why don't: return max_t(u32, bytes / mss_now, min_tso_segs); and drop the 'segs' local variable? Thanks, Paolo
Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO
On Wed, Feb 21, 2018 at 7:01 AM, Paolo Abeniwrote: > > Very minor nit, why don't: > > return max_t(u32, bytes / mss_now, min_tso_segs); > > and drop the 'segs' local variable? Simply to ease backports. We had some constant changes in this function in the past.
Re: [PATCH V7 2/4] sctp: Add ip option support
On Tue, Feb 20, 2018 at 07:15:27PM +, Richard Haines wrote: > Add ip option support to allow LSM security modules to utilise CIPSO/IPv4 > and CALIPSO/IPv6 services. > > Signed-off-by: Richard HainesLGTM too, thanks! Acked-by: Marcelo Ricardo Leitner > --- > All SCTP lksctp-tools/src/func_tests run correctly in enforcing mode. > All "./sctp-tests run" obtained from: https://github.com/sctp/sctp-tests > pass. > > V7 Changes: > 1) Log when copy ip options fail for IPv4 and IPv6 > 2) Correct sctp_setsockopt_maxseg() function. Note that the lksctp-tools > func_tests do not test with struct sctp_assoc_value. Just used simple test > and okay. > 3) Move calculation of overheads to sctp_packet_config(). > NOTE: Initially in sctp_packet_reset() I set packet->size and > packet->overhead to zero (as it is a reset). This was okay for all the > lksctp-tools function tests, however when running "sctp-tests" ndatshched > tests it causes these to fail with an st_s.log entry of: > sid: 3, expected: 3 > sid: 3, expected: 3 > unexpected sid packet !!! > sid: 1, expected: 3 > > I then found sctp_packet_transmit() relies on setting > "packet->size = packet->overhead;" to reset size to the current overhead > after sending packets, hence the comment in sctp_packet_reset() > > include/net/sctp/sctp.h| 4 +++- > include/net/sctp/structs.h | 2 ++ > net/sctp/chunk.c | 10 +++--- > net/sctp/ipv6.c| 45 ++--- > net/sctp/output.c | 34 +- > net/sctp/protocol.c| 38 ++ > net/sctp/socket.c | 11 --- > 7 files changed, 117 insertions(+), 27 deletions(-) > > diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h > index f7ae6b0..25c5c87 100644 > --- a/include/net/sctp/sctp.h > +++ b/include/net/sctp/sctp.h > @@ -441,9 +441,11 @@ static inline int sctp_list_single_entry(struct > list_head *head) > static inline int sctp_frag_point(const struct sctp_association *asoc, int > pmtu) > { > struct sctp_sock *sp = sctp_sk(asoc->base.sk); > + struct sctp_af *af = sp->pf->af; > int frag = pmtu; > > - frag -= sp->pf->af->net_header_len; > + frag -= af->ip_options_len(asoc->base.sk); > + frag -= af->net_header_len; > frag -= sizeof(struct sctphdr) + sctp_datachk_len(>stream); > > if (asoc->user_frag) > diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h > index 03e92dd..ead5fce 100644 > --- a/include/net/sctp/structs.h > +++ b/include/net/sctp/structs.h > @@ -491,6 +491,7 @@ struct sctp_af { > void(*ecn_capable)(struct sock *sk); > __u16 net_header_len; > int sockaddr_len; > + int (*ip_options_len)(struct sock *sk); > sa_family_t sa_family; > struct list_head list; > }; > @@ -515,6 +516,7 @@ struct sctp_pf { > int (*addr_to_user)(struct sctp_sock *sk, union sctp_addr *addr); > void (*to_sk_saddr)(union sctp_addr *, struct sock *sk); > void (*to_sk_daddr)(union sctp_addr *, struct sock *sk); > + void (*copy_ip_options)(struct sock *sk, struct sock *newsk); > struct sctp_af *af; > }; > > diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c > index 991a530..d726d21 100644 > --- a/net/sctp/chunk.c > +++ b/net/sctp/chunk.c > @@ -171,6 +171,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct > sctp_association *asoc, > struct list_head *pos, *temp; > struct sctp_chunk *chunk; > struct sctp_datamsg *msg; > + struct sctp_sock *sp; > + struct sctp_af *af; > int err; > > msg = sctp_datamsg_new(GFP_KERNEL); > @@ -189,9 +191,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct > sctp_association *asoc, > /* This is the biggest possible DATA chunk that can fit into >* the packet >*/ > - max_data = asoc->pathmtu - > -sctp_sk(asoc->base.sk)->pf->af->net_header_len - > -sizeof(struct sctphdr) - sctp_datachk_len(>stream); > + sp = sctp_sk(asoc->base.sk); > + af = sp->pf->af; > + max_data = asoc->pathmtu - af->net_header_len - > +sizeof(struct sctphdr) - sctp_datachk_len(>stream) - > +af->ip_options_len(asoc->base.sk); > max_data = SCTP_TRUNC4(max_data); > > /* If the the peer requested that we authenticate DATA chunks > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c > index e35d4f7..30a05a8 100644 > --- a/net/sctp/ipv6.c > +++ b/net/sctp/ipv6.c > @@ -427,6 +427,41 @@ static void sctp_v6_copy_addrlist(struct list_head > *addrlist, > rcu_read_unlock(); > } > > +/* Copy over any ip options */ > +static void sctp_v6_copy_ip_options(struct sock *sk, struct sock *newsk) > +{ > + struct ipv6_pinfo *newnp, *np = inet6_sk(sk); > + struct
Re: [PATCH net-next v2 1/1] net: Allow a rule to track originating protocol
From: Donald SharpDate: Tue, 20 Feb 2018 08:55:58 -0500 > Allow a rule that is being added/deleted/modified or > dumped to contain the originating protocol's id. > > The protocol is handled just like a routes originating > protocol is. This is especially useful because there > is starting to be a plethora of different user space > programs adding rules. > > Allow the vrf device to specify that the kernel is the originator > of the rule created for this device. > > Signed-off-by: Donald Sharp Looks good, applied, thanks Donald.
Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open
From: Anders RoxellDate: Wed, 21 Feb 2018 22:30:01 +0100 > bpf builds a test program for loading BPF ELF files. Add the executable > to the .gitignore list. > > Signed-off-by: Anders Roxell Acked-by: David S. Miller
[PATCH net-next 4/7] net: Rename NETEVENT_MULTIPATH_HASH_UPDATE
Rename NETEVENT_MULTIPATH_HASH_UPDATE to NETEVENT_IPV4_MPATH_HASH_UPDATE to denote it relates to a change in the IPv4 hash policy. Signed-off-by: David Ahern--- drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 +- include/net/netevent.h| 2 +- net/ipv4/sysctl_net_ipv4.c| 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c index 05146970c19c..da8ca721f2fc 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c @@ -2427,7 +2427,7 @@ static int mlxsw_sp_router_netevent_event(struct notifier_block *nb, mlxsw_core_schedule_work(_work->work); mlxsw_sp_port_dev_put(mlxsw_sp_port); break; - case NETEVENT_MULTIPATH_HASH_UPDATE: + case NETEVENT_IPV4_MPATH_HASH_UPDATE: net = ptr; if (!net_eq(net, _net)) diff --git a/include/net/netevent.h b/include/net/netevent.h index 40e7bab68490..baee605a94ab 100644 --- a/include/net/netevent.h +++ b/include/net/netevent.h @@ -26,7 +26,7 @@ enum netevent_notif_type { NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */ NETEVENT_REDIRECT, /* arg is struct netevent_redirect ptr */ NETEVENT_DELAY_PROBE_TIME_UPDATE, /* arg is struct neigh_parms ptr */ - NETEVENT_MULTIPATH_HASH_UPDATE, /* arg is struct net ptr */ + NETEVENT_IPV4_MPATH_HASH_UPDATE, /* arg is struct net ptr */ }; int register_netevent_notifier(struct notifier_block *nb); diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 89683d868b37..011de9a20ec6 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -400,7 +400,7 @@ static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (write && ret == 0) - call_netevent_notifiers(NETEVENT_MULTIPATH_HASH_UPDATE, net); + call_netevent_notifiers(NETEVENT_IPV4_MPATH_HASH_UPDATE, net); return ret; } -- 2.11.0
Re: [PATCH RFC PoC 0/3] nftables meets bpf
On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote: > We also have a large range of TCAM based hardware offload outthere > that will _not_ work with your BPF HW offload infrastructure. What > this bpf infrastructure pushes into the kernel is just a blob > expressing things in a very low-level instruction-set: trying to find > a mapping of that to typical HW intermediate representations in the > TCAM based HW offload world will be simply crazy. I'm not sure where the TCAM talk is coming from. Think much smaller - cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs. The information the verifier is gathering will be crucial for optimizing those. Please don't discount the value of being able to use heterogeneous processing units by the networking stack.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liuwrote: > I haven't checked emails for days and did not realize the new revision > had already came out. And thank you for the effort, this revision > really looks to be a step forward towards our use case and is close to > what we wanted to do. A few questions in line. > > On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck > wrote: >> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. >>> >>> Thank you do doing this. >>> We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. >>> >>> Out of curiosity - why do you link the master netdev to the virtio >>> struct device? >> >> The basic idea of all this is that we wanted this to work with an >> existing VM image that was using virtio. As such we were trying to >> make it so that the bypass interface takes the place of the original >> virtio and get udev to rename the bypass to what the original >> virtio_net was. > > Could it made it also possible to take over the config from VF instead > of virtio on an existing VM image? And get udev rename the bypass > netdev to what the original VF was. I don't say tightly binding the > bypass master to only virtio or VF, but I think we should provide both > options to support different upgrade paths. Possibly we could tweak > the device tree layout to reuse the same PCI slot for the master > bypass netdev, such that udev would not get confused when renaming the > device. The VF needs to use a different function slot afterwards. > Perhaps we might need to a special multiseat like QEMU device for that > purpose? > > Our case we'll upgrade the config from VF to virtio-bypass directly. So if I am understanding what you are saying you are wanting to flip the backup interface from the virtio to a VF. The problem is that becomes a bit of a vendor lock-in solution since it would rely on a specific VF driver. I would agree with Jiri that we don't want to go down that path. We don't want every VF out there firing up its own separate bond. Ideally you want the hypervisor to be able to manage all of this which is why it makes sense to have virtio manage this and why this is associated with the virtio_net interface. The other bits get into more complexity then we are ready to handle for now. I think I might have talked about something similar that I was referring to as a "virtio-bond" where you would have a PCI/PCIe tree topology that makes this easier to sort out, and the "virtio-bond would be used to handle coordination/configuration of a much more complex interface. >> >>> FWIW two solutions that immediately come to mind is to export "backup" >>> as phys_port_name of the backup virtio link and/or assign a name to the >>> master like you are doing already. I think team uses team%d and bond >>> uses bond%d, soft naming of master devices seems quite natural in this >>> case. >> >> I figured I had overlooked something like that.. Thanks for pointing >> this out. Okay so I think the phys_port_name approach might resolve >> the original issue. If I am reading things correctly what we end up >> with is the master showing up as "ens1" for example and the backup >> showing up as "ens1nbackup". Am I understanding that right? >> >> The problem with the team/bond%d approach is that it creates a new >> netdevice and so it would require guest configuration changes. >> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio >>> link is quite neat. >> >> I agree. For non-"backup" virio_net devices would it be okay for us to >> just return -EOPNOTSUPP? I assume it would be and that way the legacy >> behavior could be maintained although the function still exists. >> - When
Re: [PATCH RFC PoC 0/3] nftables meets bpf
On 02/21/2018 03:46 PM, Jakub Kicinski wrote: > On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote: >> We also have a large range of TCAM based hardware offload outthere >> that will _not_ work with your BPF HW offload infrastructure. What >> this bpf infrastructure pushes into the kernel is just a blob >> expressing things in a very low-level instruction-set: trying to find >> a mapping of that to typical HW intermediate representations in the >> TCAM based HW offload world will be simply crazy. > > I'm not sure where the TCAM talk is coming from. Think much smaller - > cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs. The > information the verifier is gathering will be crucial for optimizing > those. Please don't discount the value of being able to use > heterogeneous processing units by the networking stack. > The only use case that we have a good answer for is when there is no HW offload capability available, because there, we know that eBPF is our best possible solution for a software fast path, in large part because of all the efforts that went into making it both safe and fast. When there is offloading HW available, there does not appear to be a perfect answer to this problem of, given a standard Linux utility that can express any sort of match + action, be it ethtool::rxnfc, tc/cls_{u32,flower}, nftables, how do I transform that into what makes most sense to my HW? You could: - have hardware that understands BPF bytecode directly, great, then you don't have to do anything, just pass it up the driver baby, oh wait, it's not that simple, the NFP driver is not small - transform BPF back into something that your hardware understand, does that belong in the kernel? Maybe, maybe not - use a completely different intermediate representation like P4, brainfuck, I don't know Maybe first things first, we have at least 3 different programming interfaces, if not more: ethtool::rxnfc, tc/cls_{u32,flower}, nftables that are all capable of programming TCAMs and hardware capable of match + action, how about we start with having some sort of common library code that: - validates input parameters against HW capabilities - does the adequate transformation from any of these interfaces into a generic set of input parameters - define what the appropriate behavior is when programming through all of these 3 interfaces that ultimately access the same shared piece of HW, and therefore need to manage resources allocation? -- Florian
Proposal
Hello Greetings to you and everyone around you please did you get my previous email regarding my proposal ? please let me know if we can work together on this. Best Reagrds
Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open
On 02/21/2018 03:48 PM, David Miller wrote: > From: Anders Roxell> Date: Wed, 21 Feb 2018 22:30:01 +0100 > >> bpf builds a test program for loading BPF ELF files. Add the executable >> to the .gitignore list. >> >> Signed-off-by: Anders Roxell > > Acked-by: David S. Miller > > Thanks. I will pull this in for 4.16-rc -- Shuah
Re: [PATCH] bpf: clean up unused-variable warning
On 02/20/2018 11:07 PM, Arnd Bergmann wrote: > The only user of this variable is inside of an #ifdef, causing > a warning without CONFIG_INET: > > net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set': > net/core/filter.c:3382:6: error: unused variable 'val' > [-Werror=unused-variable] > int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS; > > This replaces the #ifdef with a nicer IS_ENABLED() check that > makes the code more readable and avoids the warning. > > Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock") > Signed-off-by: Arnd BergmannNow applied to bpf, thanks Arnd!
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On (02/21/18 18:45), Willem de Bruijn wrote: > > I do mean returning 0 instead of -EAGAIN if control data is ready. > Something like > > @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr > *msg, size_t size, > > if (!rds_next_incoming(rs, )) { > if (nonblock) { > - ret = -EAGAIN; > + ncookies = rds_recvmsg_zcookie(rs, msg); > + ret = ncookies ? 0 : -EAGAIN; > break; > } Yes, but you now have an implicit branch based on ncookies, so I'm not sure it saved all that much? like I said let me revisit this > By the way, the put_cmsg is unconditional even if the caller did > not supply msg_control. So it is basically no longer safe to ever > call read, recv or recvfrom on a socket if zerocopy notifications > are outstanding. Wait, I thought put_cmsg already checks for these things. > It is possible to check msg_controllen before even deciding whether > to try to dequeue notifications (and take the lock). I see that this is > not common. But RDS of all cases seems to do this, in > rds_notify_queue_get: yes the comment above that code suggests that this bit of code was done to avoid calling put_cmsg while holding the rs_lock. One bit of administrivia though- if I now drop sk_error_queue for PF_RDS, I'll have to fix selftests in the same patch too, so the patch will get a bit bulky (and thus a bit more difficult to review).
Bug with 'ip' command where build a stuck list of interfaces
Hello. With 'ip tunnel add tun0 mode ipip dev enp1s0' make two interfaces called 'tunl0' and 'tun0'. Late with 'ip tunnel del tun0' remove only 'tun0' and forget the 'tunl0' in the list. with this you can not do anything else in the list, it can not be created and it can not be deleted. It is hanging. # ip link list : : : 4: tunl0@NONE: mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyckwrote: > On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: >> I haven't checked emails for days and did not realize the new revision >> had already came out. And thank you for the effort, this revision >> really looks to be a step forward towards our use case and is close to >> what we wanted to do. A few questions in line. >> >> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck >> wrote: >>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: > Ppatch 2 is in response to the community request for a 3 netdev > solution. However, it creates some issues we'll get into in a moment. > It extends virtio_net to use alternate datapath when available and > registered. When BACKUP feature is enabled, virtio_net driver creates > an additional 'bypass' netdev that acts as a master device and controls > 2 slave devices. The original virtio_net netdev is registered as > 'backup' netdev and a passthru/vf device with the same MAC gets > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are > associated with the same 'pci' device. The user accesses the network > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev > as default for transmits when it is available with link up and running. Thank you do doing this. > We noticed a couple of issues with this approach during testing. > - As both 'bypass' and 'backup' netdevs are associated with the same > virtio pci device, udev tries to rename both of them with the same name > and the 2nd rename will fail. This would be OK as long as the first > netdev > to be renamed is the 'bypass' netdev, but the order in which udev gets > to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? >>> >>> The basic idea of all this is that we wanted this to work with an >>> existing VM image that was using virtio. As such we were trying to >>> make it so that the bypass interface takes the place of the original >>> virtio and get udev to rename the bypass to what the original >>> virtio_net was. >> >> Could it made it also possible to take over the config from VF instead >> of virtio on an existing VM image? And get udev rename the bypass >> netdev to what the original VF was. I don't say tightly binding the >> bypass master to only virtio or VF, but I think we should provide both >> options to support different upgrade paths. Possibly we could tweak >> the device tree layout to reuse the same PCI slot for the master >> bypass netdev, such that udev would not get confused when renaming the >> device. The VF needs to use a different function slot afterwards. >> Perhaps we might need to a special multiseat like QEMU device for that >> purpose? >> >> Our case we'll upgrade the config from VF to virtio-bypass directly. > > So if I am understanding what you are saying you are wanting to flip > the backup interface from the virtio to a VF. The problem is that > becomes a bit of a vendor lock-in solution since it would rely on a > specific VF driver. I would agree with Jiri that we don't want to go > down that path. We don't want every VF out there firing up its own > separate bond. Ideally you want the hypervisor to be able to manage > all of this which is why it makes sense to have virtio manage this and > why this is associated with the virtio_net interface. No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. > > The other bits get into more complexity then we are ready to handle > for now. I think I might have talked about something similar that I > was referring to as a "virtio-bond" where you would have a PCI/PCIe > tree topology that makes this easier to sort out, and the "virtio-bond > would be used to handle coordination/configuration of a much more > complex interface. That was one way to solve this problem but I'd like to see
[PATCH v2 net-next 2/2] lan743x: Update MAINTAINERS to include lan743x driver
Update MAINTAINERS to include lan743x driver Signed-off-by: Bryan Whitehead--- MAINTAINERS | 7 +++ 1 file changed, 7 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 9a7f76e..c340125 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9149,6 +9149,13 @@ F: drivers/net/dsa/microchip/* F: include/linux/platform_data/microchip-ksz.h F: Documentation/devicetree/bindings/net/dsa/ksz.txt +MICROCHIP LAN743X ETHERNET DRIVER +M: Bryan Whitehead +M: Microchip Linux Driver Support +L: netdev@vger.kernel.org +S: Maintained +F: drivers/net/ethernet/microchip/lan743x_* + MICROCHIP USB251XB DRIVER M: Richard Leitner L: linux-...@vger.kernel.org -- 2.7.4
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On Wed, Feb 21, 2018 at 3:19 PM, Sowmini Varadhanwrote: > This commit is an optimization that builds on top of commit 01883eda72bd > ("rds: support for zcopy completion notification") for PF_RDS sockets. > > Cookies associated with zerocopy completion are passed up on the POLLIN > channel, piggybacked with data whereever possible. Such cookies are passed > up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when > the returned value of recvmsg() is >= 0. A max of SO_EE_ORIGIN_MAX_ZCOOKIES > may be passed with each message. I'd put this optimization behind a socket option. Either that, or always store cookies on an RDS specific queue, return as recv cmsg and then remove the alternative error queue path completely. Having both is confusing, also in terms of system behavior (e.g., see signaling and sk_err handling in sock_dequeue_err_skb). > +static int rds_recvmsg_zcookie(struct rds_sock *rs, struct msghdr *msg) > +{ > + struct sk_buff *skb, *tmp; > + struct sock_exterr_skb *serr; > + struct sock *sk = rds_rs_to_sk(rs); > + struct sk_buff_head *q = >sk_error_queue; > + struct rds_zcopy_cookies done; > + u32 *ptr; > + int i; > + unsigned long flags; > + > + spin_lock_irqsave(>lock, flags); This seems expensive on every recvmsg. Even if zerocopy is not enabled. > + if (skb_queue_empty(q)) { > + spin_unlock_irqrestore(>lock, flags); > + return 0; > + } > + skb_queue_walk_safe(q, skb, tmp) { This too. If anything, just peek at the queue head and skip otherwise. Anything else will cause an error, which blocks regular read and write on the socket. > int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, > int msg_flags) > { > @@ -586,6 +623,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, > size_t size, > int ret = 0, nonblock = msg_flags & MSG_DONTWAIT; > DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name); > struct rds_incoming *inc = NULL; > + int ncookies; > > /* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */ > timeo = sock_rcvtimeo(sk, nonblock); > @@ -609,6 +647,14 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, > size_t size, > break; > } > > + if (list_empty(>rs_recv_queue) && nonblock) { > + ncookies = rds_recvmsg_zcookie(rs, msg); > + if (ncookies) { > + ret = 0; This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?
Re: syzcaller patch postings...
David Millerwrote: > I have to mention this now before it gets out of control. > > I would like to ask that syzkaller stop posting the patch it is > testing when it posts to netdev. Same for netfilter-devel. I could not get a reproducer to trigger and asked syzbot to test the patch (great feature, thanks!) -- i did not cc any mailing list. So I was very surprised syzbot announced test result by adding back all CCs from original report. I think its better to announce the result to patch author only.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyckwrote: > On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >>> Ppatch 2 is in response to the community request for a 3 netdev >>> solution. However, it creates some issues we'll get into in a moment. >>> It extends virtio_net to use alternate datapath when available and >>> registered. When BACKUP feature is enabled, virtio_net driver creates >>> an additional 'bypass' netdev that acts as a master device and controls >>> 2 slave devices. The original virtio_net netdev is registered as >>> 'backup' netdev and a passthru/vf device with the same MAC gets >>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>> associated with the same 'pci' device. The user accesses the network >>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>> as default for transmits when it is available with link up and running. >> >> Thank you do doing this. >> >>> We noticed a couple of issues with this approach during testing. >>> - As both 'bypass' and 'backup' netdevs are associated with the same >>> virtio pci device, udev tries to rename both of them with the same name >>> and the 2nd rename will fail. This would be OK as long as the first netdev >>> to be renamed is the 'bypass' netdev, but the order in which udev gets >>> to rename the 2 netdevs is not reliable. >> >> Out of curiosity - why do you link the master netdev to the virtio >> struct device? > > The basic idea of all this is that we wanted this to work with an > existing VM image that was using virtio. As such we were trying to > make it so that the bypass interface takes the place of the original > virtio and get udev to rename the bypass to what the original > virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. > >> FWIW two solutions that immediately come to mind is to export "backup" >> as phys_port_name of the backup virtio link and/or assign a name to the >> master like you are doing already. I think team uses team%d and bond >> uses bond%d, soft naming of master devices seems quite natural in this >> case. > > I figured I had overlooked something like that.. Thanks for pointing > this out. Okay so I think the phys_port_name approach might resolve > the original issue. If I am reading things correctly what we end up > with is the master showing up as "ens1" for example and the backup > showing up as "ens1nbackup". Am I understanding that right? > > The problem with the team/bond%d approach is that it creates a new > netdevice and so it would require guest configuration changes. > >> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio >> link is quite neat. > > I agree. For non-"backup" virio_net devices would it be okay for us to > just return -EOPNOTSUPP? I assume it would be and that way the legacy > behavior could be maintained although the function still exists. > >>> - When the 'active' netdev is unplugged OR not present on a destination >>> system after live migration, the user will see 2 virtio_net netdevs. >> >> That's necessary and expected, all configuration applies to the master >> so master must exist. > > With the naming issue resolved this is the only item left outstanding. > This becomes a matter of form vs function. > > The main complaint about the "3 netdev" solution is a bit confusing to > have the 2 netdevs present if the VF isn't there. The idea is that > having the extra "master" netdev there if there isn't really a bond is > a bit ugly. Is it this uglier in terms of user experience rather than functionality? I don't want it dynamically changed between 2-netdev and 3-netdev depending on the presence of VF. That gets back to my original question and suggestion earlier: why not just hide the lower netdevs from udev renaming and such? Which important observability benefits users may get if exposing the lower netdevs? Thanks, -Siwei > > The downside of the "2 netdev" solution is that you have to deal
nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter
On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote: > > Obvious candidates are: meta, numgen, limit, objref, quota, reject. > > We should probably also consider removing > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always > build both too (at least rbtree since that offers interval). > > For the indirect call issue we can use direct calls from eval loop for > some of the more frequently used ones, similar to what we do already > for nft_cmp_fast_expr. nft_cmp_fast_expr and other expressions mentioned above made me thinking... do we have the same issue with nft interpreter as we had with bpf one? bpf interpreter was used as part of spectre2 attack to leak information via cache side channel and let VM read hypervisor memory. Due to that issue we removed bpf interpreter from the kernel code. That's what CONFIG_BPF_JIT_ALWAYS_ON for... but we still have nft interpreter in the kernel that can also execute arbitrary nft expressions. Jann's exploit used the following bpf instructions: struct bpf_insn evil_bytecode_instrs[] = { // rax = target_byte_addr { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, { .imm = target_byte_addr>>32 }, // rdi = timing_leak_array { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 }, // rax = *(u8*)rax { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 }, // rax = rax << ... { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx }, // rax = rax & 0x400 { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 }, // rax = rdi + rax { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 }, // *(u8*) (rax + 0x800) { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 }, and a gadget to jump into __bpf_prog_run with insn pointing to memory controlled by the guest while accessible (at different virt address) by the hypervisor. It seems possible to construct similar sequence of instructions out of nft expressions and use gadget that jumps into nft_do_chain(). The attacker would need to discover more kernel addresses: nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval, nft_lookup_eval, and nft_bitmap_lookup to populate nft chains, rules and expressions in guest memory comparing to bpf interpreter attack. Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv) pkt needs to point to fake struct sk_buff in guest memory with skb->head == target_byte_addr The first nft expression can be nft_payload_fast_eval(). If it's properly constructed with (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, dreg == 1) it will do arbitrary load of *(u8 *)dest = *(u8 *)ptr; from target_byte_addr into register 1 of nft state machine (dest is u32 array of registers in the stack of nft_do_chain) Second nft expression can be nft_bitwise_eval() to mask particular bit in register 1. Then nft_cmp_eval() to check whether bit is one or zero and conditional NFT_BREAK out of first nft expression into second nft rule. The last conditional nft_immediate_eval() in the first rule will set register 1 to 0x400 * 8 while the first nft_bitwise_eval() in the second rule with do r1 &= 0x400 * 8. So at this point r1 will have either 0x400 * 8 or 0 depending on value of speculatively loaded bit. The last expression can be nft_lookup_eval() with nft_lookup->set->ops->lookup == nft_bitmap_lookup which will do nft_bitmap->bitmap[idx] where idx = r1 / 8 The memory used for this last nft_lookup/bitmap expression is both an instruction and timing_leak_array itself. If I'm not mistaken, this sequence of nft expression will speculatively execute very similar logic as in evil_bytecode_instrs[] The amount of actual speculative native cpu load/stores/branches is probably more than executed by bpf interpreter for these evil bytecodes, but likely well within cpu speculation window of 100+ insns. Obviously such exploit is harder to do than bpf based one. Do we need to do anything about it ? May be it's easier to find gadgets in .text of vmlinux instead of messing with interpreters? Jann, can you comment on removing interpreters in general? Do we need to worry about having bpf and/or nft interpreter in the kernel?
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On (02/21/18 16:54), Willem de Bruijn wrote: > > I'd put this optimization behind a socket option. Yes, that thought occurred to me as well- I think RDS applications are unlikely to use the error_queue path because, as I mentioned before, these are heavily request-response based, so you're going to be getting something back for each sendmsg anyway. So if I had a sockopt, it would be something that would be "piggyback completion" most (all?) of the time anyway, but I suppose making it explicit is useful to have. > Either that, or always store cookies on an RDS specific queue, return as recv > cmsg and then remove the alternative error queue path completely. I think the error queue path is good to have, if only to be aligned with other socket families. > > + spin_lock_irqsave(>lock, flags); > > This seems expensive on every recvmsg. Even if zerocopy is not enabled. noted, I'll predicate it on both zcopy and the sockopt suggestion. > > + if (skb_queue_empty(q)) { > > + spin_unlock_irqrestore(>lock, flags); > > + return 0; > > + } > > + skb_queue_walk_safe(q, skb, tmp) { > > This too. If anything, just peek at the queue head and skip otherwise. > Anything else will cause an error, which blocks regular read and write > on the socket. but that would be technically incorrect, because you could have a mix/match of the zcopy notification with other sk_error_queue messages (given that rds_rm_zerocopy_callback looks at the tail, and creates a new skb if the tail is not a SO_EE_ORIGIN_ZCOOKIE). Of course, its true that *today* the only thing in the rds socket error queue is the cookie list, so this queue walk is a bit of overkill.. but maybe I am missing something you are concerned about? The queue walk is intended to pull out the first SO_EE_ORIGIN_ZCOOKIE (if it exists) and return quietly otherwise (leaving err_queue unchanged)- where do you see the blocking error? > > + if (list_empty(>rs_recv_queue) && nonblock) { > > + ncookies = rds_recvmsg_zcookie(rs, msg); > > + if (ncookies) { > > + ret = 0; > > This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now? The only time MSG_DONTWAIT was previously returning 0 for PF_RDS was when there was a congestion notification or rdma completion i.e., the checks for rs_notify_queue and rs_cong_notify earlier in the function, and in those 2 cases it was not returning data (i.e. ret was always 0). In all other cases rds_recvmsg would always return either - ret == -1, with errno set to EAGAIN or ETIMEDOUT, depending on the value of MSG_DONTWAIT, or, - ret > 0 with data bytes. That behavior (as well existing behavior for congestion notification and rdma completion) has not changed. The only thing that changed is that if there is no data to pass up, the ret will be 0, errno will be 0, and the ancillary data may have completion cookies. --Sowmini
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
>> Okay. If callers must already handle 0 as a valid return code, then >> it is fine to add another case that does this. >> >> The extra branch in the hot path is still rather unfortunately. Could >> this be integrated in the existing if (nonblock) branch below? > > that's where I first started. it got even hairier because the > callers expect a retval of 0 (-EAGAIN threw rds-stress into an > infinite loop of continulally trying to recv) and the end result > was just confusing code with the same number of branches.. I do mean returning 0 instead of -EAGAIN if control data is ready. Something like @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, if (!rds_next_incoming(rs, )) { if (nonblock) { - ret = -EAGAIN; + ncookies = rds_recvmsg_zcookie(rs, msg); + ret = ncookies ? 0 : -EAGAIN; break; } By the way, the put_cmsg is unconditional even if the caller did not supply msg_control. So it is basically no longer safe to ever call read, recv or recvfrom on a socket if zerocopy notifications are outstanding. It is possible to check msg_controllen before even deciding whether to try to dequeue notifications (and take the lock). I see that this is not common. But RDS of all cases seems to do this, in rds_notify_queue_get: max_messages = msghdr->msg_controllen / CMSG_SPACE(sizeof(cmsg));
[PATCH net-next] ibmvnic: Fix TX descriptor tracking
With the recent change, transmissions that only needed one descriptor were being missed. The result is that such packets were tracked as outstanding transmissions but never removed when its completion notification was received. Fixes: ffc385b95adb ("ibmvnic: Keep track of supplementary TX descriptors") Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 340e1ab..b3a34d9 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1478,7 +1478,6 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) if ((*hdrs >> 7) & 1) { build_hdr_descs_arr(tx_buff, _entries, *hdrs); tx_crq.v1.n_crq_elem = num_entries; - tx_buff->num_entries = num_entries; tx_buff->indir_arr[0] = tx_crq; tx_buff->indir_dma = dma_map_single(dev, tx_buff->indir_arr, sizeof(tx_buff->indir_arr), @@ -1533,6 +1532,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) netif_stop_subqueue(netdev, queue_num); } + tx_buff->num_entries = num_entries; tx_packets++; tx_bytes += skb->len; txq->trans_start = jiffies; -- 1.8.3.1
[PATCH net] ibmvnic: Fix early release of login buffer
The login buffer is released before the driver can perform sanity checks between resources the driver requested and what firmware will provide. Don't release the login buffer until the sanity check is performed. Fixes: 34f0f4e3f488 ("ibmvnic: Fix login buffer memory leaks") Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 1703b88..340e1ab 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -3790,7 +3790,6 @@ static int handle_login_rsp(union ibmvnic_crq *login_rsp_crq, dma_unmap_single(dev, adapter->login_buf_token, adapter->login_buf_sz, DMA_BIDIRECTIONAL); - release_login_buffer(adapter); dma_unmap_single(dev, adapter->login_rsp_buf_token, adapter->login_rsp_buf_sz, DMA_BIDIRECTIONAL); @@ -3821,6 +3820,7 @@ static int handle_login_rsp(union ibmvnic_crq *login_rsp_crq, ibmvnic_remove(adapter->vdev); return -EIO; } + release_login_buffer(adapter); complete(>init_done); return 0; -- 1.8.3.1
[PATCH v2 iproute2-next 2/3] ip: Display ip rule protocol used
Modify 'ip rule' command to notice when the kernel passes to us the originating protocol. Add code to allow the `ip rule flush protocol XXX` command to be accepted and properly handled. Modify the documentation to reflect these code changes. Signed-off-by: Donald Sharp--- ip/iprule.c| 29 ++--- man/man8/ip-rule.8 | 18 +- 2 files changed, 39 insertions(+), 8 deletions(-) diff --git a/ip/iprule.c b/ip/iprule.c index 00a6c26a..39008768 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -47,6 +47,7 @@ static void usage(void) "[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ l3mdev ]\n" "[ uidrange NUMBER-NUMBER ]\n" "ACTION := [ table TABLE_ID ]\n" + " [ protocol RPROTO ]\n" " [ nat ADDRESS ]\n" " [ realms [SRCREALM/]DSTREALM ]\n" " [ goto NUMBER ]\n" @@ -71,6 +72,8 @@ static struct struct fib_rule_uid_range range; inet_prefix src; inet_prefix dst; + int protocol; + int protocolmask; } filter; static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb) @@ -338,6 +341,10 @@ int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) rtnl_rtntype_n2a(frh->action, b1, sizeof(b1))); + if (frh->proto != RTPROT_UNSPEC) + fprintf(fp, " proto %s ", + rtnl_rtprot_n2a(frh->proto, b1, sizeof(b1))); + fprintf(fp, "\n"); fflush(fp); return 0; @@ -391,6 +398,9 @@ static int flush_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len); + if ((filter.protocol^frh->proto)) + return 0; + if (tb[FRA_PRIORITY]) { n->nlmsg_type = RTM_DELRULE; n->nlmsg_flags = NLM_F_REQUEST; @@ -415,12 +425,6 @@ static int iprule_list_flush_or_save(int argc, char **argv, int action) if (af == AF_UNSPEC) af = AF_INET; - if (action != IPRULE_LIST && argc > 0) { - fprintf(stderr, "\"ip rule %s\" does not take any arguments.\n", - action == IPRULE_SAVE ? "save" : "flush"); - return -1; - } - switch (action) { case IPRULE_SAVE: if (save_rule_prep()) @@ -508,7 +512,18 @@ static int iprule_list_flush_or_save(int argc, char **argv, int action) NEXT_ARG(); if (get_prefix(, *argv, af)) invarg("from value is invalid\n", *argv); - } else { + } else if (matches(*argv, "protocol") == 0) { + __u32 prot; + NEXT_ARG(); + filter.protocolmask = -1; + if (rtnl_rtprot_a2n(, *argv)) { + if (strcmp(*argv, "all") != 0) + invarg("invalid \"protocol\"\n", *argv); + prot = 0; + filter.protocolmask = 0; + } + filter.protocol = prot; + } else{ if (matches(*argv, "dst") == 0 || matches(*argv, "to") == 0) { NEXT_ARG(); diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8 index a5c47981..98b2573d 100644 --- a/man/man8/ip-rule.8 +++ b/man/man8/ip-rule.8 @@ -50,6 +50,8 @@ ip-rule \- routing policy database management .IR ACTION " := [ " .B table .IR TABLE_ID " ] [ " +.B protocol +.IR RPROTO " ] [ " .B nat .IR ADDRESS " ] [ " .B realms @@ -240,6 +242,10 @@ The options preference and order are synonyms with priority. the routing table identifier to lookup if the rule selector matches. It is also possible to use lookup instead of table. +.TP +.BI protocol " RPROTO" +the protocol who installed the rule in question. + .TP .BI suppress_prefixlength " NUMBER" reject routing decisions that have a prefix length of NUMBER or less. @@ -275,7 +281,11 @@ updates, it flushes the routing cache with .RE .TP .B ip rule flush - also dumps all the deleted rules. -This command has no arguments. +.RS +.TP +.BI protocol " RPROTO" +Select the originating protocol. +.RE .TP .B ip rule show - list rules This command has no arguments. @@ -283,6 +293,12 @@ The options list or lst are synonyms with show. .TP .B ip rule save +.RS +.TP +.BI protocl " RPROTO" +Select the originating protocol. +.RE +.TP save rules table information to stdout .RS This command behaves like -- 2.14.3
[PATCH v2 iproute2-next 0/3] Allow 'ip rule' command to use protocol
Fix iprule.c to use the actual `struct fib_rule_hdr` and to allow the end user to see and use the protocol keyword for rule manipulations. v2: Rearrange and code changes as per David Ahern Donald Sharp (3): ip: Use the `struct fib_rule_hdr` for rules ip: Display ip rule protocol used ip: Allow rules to accept a specified protocol include/uapi/linux/fib_rules.h | 2 +- ip/iprule.c| 164 - man/man8/ip-rule.8 | 18 - 3 files changed, 114 insertions(+), 70 deletions(-) -- 2.14.3
[PATCH v2 iproute2-next 1/3] ip: Use the `struct fib_rule_hdr` for rules
The iprule.c code was using `struct rtmsg` as the data type to pass into the kernel for the netlink message. While 'struct rtmsg' and `struct fib_rule_hdr` are the same size and mostly the same, we should use the correct data structure. This commit translates the data structures to have iprule.c use the correct one. Additionally copy over the modified fib_rules.h file Signed-off-by: Donald Sharp--- include/uapi/linux/fib_rules.h | 2 +- ip/iprule.c| 129 ++--- 2 files changed, 69 insertions(+), 62 deletions(-) diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 2b642bf9..92553917 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -23,8 +23,8 @@ struct fib_rule_hdr { __u8tos; __u8table; + __u8proto; __u8res1; /* reserved */ - __u8res2; /* reserved */ __u8action; __u32 flags; diff --git a/ip/iprule.c b/ip/iprule.c index a3abf2f6..00a6c26a 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -73,25 +73,33 @@ static struct inet_prefix dst; } filter; +static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb) +{ + __u32 table = frh->table; + if (tb[RTA_TABLE]) + table = rta_getattr_u32(tb[RTA_TABLE]); + return table; +} + static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) { - struct rtmsg *r = NLMSG_DATA(n); + struct fib_rule_hdr *frh = NLMSG_DATA(n); __u32 table; - if (preferred_family != AF_UNSPEC && r->rtm_family != preferred_family) + if (preferred_family != AF_UNSPEC && frh->family != preferred_family) return false; if (filter.prefmask && filter.pref ^ (tb[FRA_PRIORITY] ? rta_getattr_u32(tb[FRA_PRIORITY]) : 0)) return false; - if (filter.not && !(r->rtm_flags & FIB_RULE_INVERT)) + if (filter.not && !(frh->flags & FIB_RULE_INVERT)) return false; if (filter.src.family) { inet_prefix *f_src = - if (f_src->family != r->rtm_family || - f_src->bitlen > r->rtm_src_len) + if (f_src->family != frh->family || + f_src->bitlen > frh->src_len) return false; if (inet_addr_match_rta(f_src, tb[FRA_SRC])) @@ -101,15 +109,15 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) if (filter.dst.family) { inet_prefix *f_dst = - if (f_dst->family != r->rtm_family || - f_dst->bitlen > r->rtm_dst_len) + if (f_dst->family != frh->family || + f_dst->bitlen > frh->dst_len) return false; if (inet_addr_match_rta(f_dst, tb[FRA_DST])) return false; } - if (filter.tosmask && filter.tos ^ r->rtm_tos) + if (filter.tosmask && filter.tos ^ frh->tos) return false; if (filter.fwmark) { @@ -159,7 +167,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) return false; } - table = rtm_get_table(r, tb); + table = frh_get_table(frh, tb); if (filter.tb > 0 && filter.tb ^ table) return false; @@ -169,7 +177,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) { FILE *fp = (FILE *)arg; - struct rtmsg *r = NLMSG_DATA(n); + struct fib_rule_hdr *frh = NLMSG_DATA(n); int len = n->nlmsg_len; int host_len = -1; __u32 table; @@ -180,13 +188,13 @@ int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE) return 0; - len -= NLMSG_LENGTH(sizeof(*r)); + len -= NLMSG_LENGTH(sizeof(*frh)); if (len < 0) return -1; - parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len); + parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len); - host_len = af_bit_len(r->rtm_family); + host_len = af_bit_len(frh->family); if (!filter_nlmsg(n, tb, host_len)) return 0; @@ -200,41 +208,41 @@ int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) else fprintf(fp, "0:\t"); - if (r->rtm_flags & FIB_RULE_INVERT) + if (frh->flags & FIB_RULE_INVERT) fprintf(fp, "not "); if (tb[FRA_SRC]) { - if (r->rtm_src_len != host_len) { + if (frh->src_len != host_len) { fprintf(fp,
[PATCH v2 iproute2-next 3/3] ip: Allow rules to accept a specified protocol
Allow the specification of a protocol when the user adds/modifies/deletes a rule. Signed-off-by: Donald Sharp--- ip/iprule.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/ip/iprule.c b/ip/iprule.c index 39008768..192fe215 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -683,6 +683,12 @@ static int iprule_modify(int cmd, int argc, char **argv) if (get_rt_realms_or_raw(, *argv)) invarg("invalid realms\n", *argv); addattr32(, sizeof(req), FRA_FLOW, realm); + } else if (matches(*argv, "protocol") == 0) { + __u32 proto; + NEXT_ARG(); + if (rtnl_rtprot_a2n(, *argv)) + invarg("\"protocol\" value is invalid\n", *argv); + req.frh.proto = proto; } else if (matches(*argv, "table") == 0 || strcmp(*argv, "lookup") == 0) { NEXT_ARG(); -- 2.14.3
Re: ss issue on arm not showing UDP listening ports
Thank you for the suggestions. This is on a raspberry pi 3 not sure if that fact matters. I will notify Raspbian of the issue. On 02/21/2018 03:03 PM, Stefano Brivio wrote: > On Wed, 21 Feb 2018 12:37:31 -0500 > jesse_coo...@codeholics.com wrote: > >> ss utility, iproute2-ss161212 > > Works for me on iproute2-ss161212 and 4.9.0 kernel on armv7l. Unless > somebody on the list has other ideas, I guess you should either try > more recent versions, debug it (strace should show a pair of > recvmsg/sendmsg for each UDP socket) or file a ticket for your > distribution. >
Re: [PATCH net-next v4 0/2] Remove IPVlan module dependencies on IPv6 and L3 Master dev
From: Matteo CroceDate: Wed, 21 Feb 2018 01:31:12 +0100 > The IPVlan module currently depends on IPv6 and L3 Master dev. > Refactor the code to allow building IPVlan module regardless of the value > of CONFIG_IPV6 as done in other drivers like VxLAN or GENEVE. > Also change the CONFIG_NET_L3_MASTER_DEV dependency into a select, > since compiling L3 Master device alone has little sense. > > $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config > CONFIG_IPV6=y > CONFIG_IPVLAN=m > $ ll drivers/net/ipvlan/ipvlan.ko > 48K drivers/net/ipvlan/ipvlan.ko > > $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config > # CONFIG_IPV6 is not set > CONFIG_IPVLAN=m > $ ll drivers/net/ipvlan/ipvlan.ko > 44K drivers/net/ipvlan/ipvlan.ko Series applied, thanks Matteo.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/21/2018 6:02 PM, Jakub Kicinski wrote: On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote: I don't see why the team cannot be there always. It is more the logistical nightmare. Part of the goal here was to work with the cloud base images that are out there such as https://alt.fedoraproject.org/cloud/. With just the kernel changes the overhead for this stays fairly small and would be pulled in as just a standard part of the kernel update process. The virtio bypass only pops up if the backup bit is present. With the team solution it requires that the base image use the team driver on virtio_net when it sees one. I doubt the OSVs would want to do that just because SR-IOV isn't that popular of a case. IIUC we need to monitor for a "backup hint", spawn the master, rename it to maintain backwards compatibility with no-VF setups and enslave the VF if it appears. All those sound possible from user space, the advantage of the kernel solution right now is that it has more complete code. Am I misunderstanding? I think there is some misunderstanding about the exact requirement and the usecase we are trying to solve. If the Guest is allowed to do this configuration, we already have a solution with either bond/team based user space configuration. This is to enable cloud service providers to provide a accelerated datapath by simply letting to tenants to get their own images with the only requirement to enable their kernels with newer virtio_net driver with BACKUP support and the VF driver. To recap from an earlier thread, here is a response from Stephen that talks about the requirement for the netvsc solution and we would like to provide similar solution for KVM based cloud deployments. > The requirement with Azure accelerated network was that a stock distribution image from the > store must be able to run unmodified and get accelerated networking. > Not sure if other environments need to work the same, but it would be nice. > That meant no additional setup scripts (aka no bonding) and also it must > work transparently with hot-plug. Also there are diverse set of environments: > openstack, cloudinit, network manager and systemd. The solution had to not depend > on any one of them, but also not break any of them. Thanks Sridhar
Re: [PATCH] selftests/bpf: tcpbpf_kern: use in6_* macros from glibc
On 02/21/2018 05:51 PM, Anders Roxell wrote: > Both glibc and the kernel have in6_* macros definitions. Build fails > because it picks up wrong in6_* macro from the kernel header and not the > header from glibc. > > Fixes build error below: > clang -I. -I./include/uapi -I../../../include/uapi > -Wno-compare-distinct-pointer-types \ > -O2 -target bpf -emit-llvm -c test_tcpbpf_kern.c -o - | \ > llc -march=bpf -mcpu=generic -filetype=obj > -o .../tools/testing/selftests/bpf/test_tcpbpf_kern.o > In file included from test_tcpbpf_kern.c:12: > .../netinet/in.h:101:5: error: expected identifier > IPPROTO_HOPOPTS = 0, /* IPv6 Hop-by-Hop options. */ > ^ > .../linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS' > ^ > In file included from test_tcpbpf_kern.c:12: > /usr/include/netinet/in.h:103:5: error: expected identifier > IPPROTO_ROUTING = 43, /* IPv6 routing header. */ > ^ > .../linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING' > ^ > In file included from test_tcpbpf_kern.c:12: > .../netinet/in.h:105:5: error: expected identifier > IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header. */ > ^ > > Since both glibc and the kernel have in6_* macros definitions, use the > one from glibc. Kernel headers will check for previous libc definitions > by including include/linux/libc-compat.h. > > Reported-by: Daniel Díaz> Signed-off-by: Anders Roxell Applied to bpf tree, thanks Anders!
Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open
Hi Shuah, On 02/22/2018 12:03 AM, Shuah Khan wrote: > On 02/21/2018 03:48 PM, David Miller wrote: >> From: Anders Roxell>> Date: Wed, 21 Feb 2018 22:30:01 +0100 >> >>> bpf builds a test program for loading BPF ELF files. Add the executable >>> to the .gitignore list. >>> >>> Signed-off-by: Anders Roxell >> >> Acked-by: David S. Miller > > Thanks. I will pull this in for 4.16-rc I would have taken this into bpf tree, but fair enough. This one in particular doesn't cause any conflicts right now, so feel free to pick it up then. But usual paths are via bpf / bpf-next as we otherwise would run into real ugly merge conflicts. Thanks, Daniel
Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open
On 02/22/2018 01:37 AM, Shuah Khan wrote: > On 02/21/2018 05:33 PM, Daniel Borkmann wrote: >> Hi Shuah, >> >> On 02/22/2018 12:03 AM, Shuah Khan wrote: >>> On 02/21/2018 03:48 PM, David Miller wrote: From: Anders RoxellDate: Wed, 21 Feb 2018 22:30:01 +0100 > bpf builds a test program for loading BPF ELF files. Add the executable > to the .gitignore list. > > Signed-off-by: Anders Roxell Acked-by: David S. Miller >>> >>> Thanks. I will pull this in for 4.16-rc >> >> I would have taken this into bpf tree, but fair enough. This one >> in particular doesn't cause any conflicts right now, so feel free >> to pick it up then. But usual paths are via bpf / bpf-next as we >> otherwise would run into real ugly merge conflicts. > > Daniel, > > Go for it. I know bpf tree stuff causes conflicts. That is why I usually > stay away from bpfs selftests and let you handle them. > > You are taking the other one away, pick this up as well. Okay, great, thanks for your understanding, Shuah. Just applied to bpf tree. > Acked-by: Shuah Khan > > thanks, > -- Shuah >
[PATCH v2] selftests/bpf/test_maps: exit child process without error in ENOMEM case
From: Li Zhijiantest_maps contains a series of stress tests, and previously it will break the rest tests when it failed to alloc memory. --- Failed to create hashmap key=8 value=262144 'Cannot allocate memory' Failed to create hashmap key=16 value=262144 'Cannot allocate memory' Failed to create hashmap key=8 value=262144 'Cannot allocate memory' Failed to create hashmap key=8 value=262144 'Cannot allocate memory' test_maps: test_maps.c:955: run_parallel: Assertion `status == 0' failed. Aborted not ok 1..3 selftests: test_maps [FAIL] --- after this patch, the rest tests will be continue when it occurs an ENOMEM failure CC: Alexei Starovoitov CC: Philip Li Suggested-by: Daniel Borkmann Signed-off-by: Li Zhijian --- tools/testing/selftests/bpf/test_maps.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index 436c4c7..9e03a4c 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -126,6 +126,8 @@ static void test_hashmap_sizes(int task, void *data) fd = bpf_create_map(BPF_MAP_TYPE_HASH, i, j, 2, map_flags); if (fd < 0) { + if (errno == ENOMEM) + return; printf("Failed to create hashmap key=%d value=%d '%s'\n", i, j, strerror(errno)); exit(1); -- 2.7.4
Re: [PATCH bpf] bpf, x64: implement retpoline for tail call
On Wed, 2018-02-21 at 19:43 -0800, Alexei Starovoitov wrote: > On Wed, Feb 21, 2018 at 07:04:02PM -0800, Eric Dumazet wrote: > > On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote: > > > > ... > > > > > +/* Instead of plain jmp %rax, we emit a retpoline to control > > > + * speculative execution for the indirect branch. > > > + */ > > > +static void emit_retpoline_rax_trampoline(u8 **pprog) > > > +{ > > > + u8 *prog = *pprog; > > > + int cnt = 0; > > > + > > > + EMIT1_off32(0xE8, 7);/* callq */ > > > + /* capture_spec: */ > > > + EMIT2(0xF3, 0x90); /* pause */ > > > + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ > > > + EMIT2(0xEB, 0xF9); /* jmp */ > > > + /* set_up_target: */ > > > + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ > > > + EMIT1(0xC3); /* retq */ > > > + > > > + BUILD_BUG_ON(cnt != RETPOLINE_SIZE); > > > + *pprog = prog; > > > > You might define the actual code sequence (and length) in > > arch/x86/include/asm/nospec-branch.h > > > > If we need to adjust code sequences for RETPOLINE, then we wont > > forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded. > > like adding a comment to asm/nospec-branch.h that says > "dont forget to adjust bpf_jit_comp.c" ? > but clang/gcc generate slightly different sequences for > retpoline anyway, so even if '.macro RETPOLINE_JMP' in > nospec-branch.h changes it doesn't mean that x64 jit has to change. > So what kinda comment there would make sense? I was thinking of something very explicit : /* byte sequence for following assembly code used by eBPF call ... ... retq */ #define RETPOLINE_RAX_DIRECT_FOR_EBPF \ EMIT1_off32(0xE8, 7);/* callq */ \ /* capture_spec: */\ EMIT2(0xF3, 0x90); /* pause */ \ EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \ EMIT2(0xEB, 0xF9); /* jmp */ \ /* set_up_target: */ \ EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \ EMIT1(0xC3); /* retq */\ Might be simply byte encoded, (array of 17 bytes) Well, something like that anyway...
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 2018年02月21日 00:52, John Fastabend wrote: On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote: On Fri, 16 Feb 2018 09:19:02 -0800 John Fastabendwrote: On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote: On Fri, 16 Feb 2018 13:31:37 +0800 Jason Wang wrote: On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: The virtio_net code have three different RX code-paths in receive_buf(). Two of these code paths can handle XDP, but one of them is broken for at least XDP_REDIRECT. Function(1): receive_big() does not support XDP. Function(2): receive_small() support XDP fully and uses build_skb(). Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). The simple explanation is that receive_mergeable() is broken because it uses napi_alloc_skb(), which violates XDP given XDP assumes packet header+data in single page and enough tail room for skb_shared_info. The longer explaination is that receive_mergeable() tries to work-around and satisfy these XDP requiresments e.g. by having a function xdp_linearize_page() that allocates and memcpy RX buffers around (in case packet is scattered across multiple rx buffers). This does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because we have not implemented bpf_xdp_adjust_tail yet). The XDP_REDIRECT action combined with cpumap is broken, and cause hard to debug crashes. The main issue is that the RX packet does not have the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing skb_shared_info to overlap the next packets head-room (in which cpumap stores info). Reproducing depend on the packet payload length and if RX-buffer size happened to have tail-room for skb_shared_info or not. But to make this even harder to troubleshoot, the RX-buffer size is runtime dynamically change based on an Exponentially Weighted Moving Average (EWMA) over the packet length, when refilling RX rings. This patch only disable XDP_REDIRECT support in receive_mergeable() case, because it can cause a real crash. But IMHO we should NOT support XDP in receive_mergeable() at all, because the principles behind XDP are to gain speed by (1) code simplicity, (2) sacrificing memory and (3) where possible moving runtime checks to setup time. These principles are clearly being violated in receive_mergeable(), that e.g. runtime track average buffer size to save memory consumption. I agree to disable it for -net now. Okay... I'll send an official patch later. For net-next, we probably can do: - drop xdp_linearize_page() and do XDP through generic XDP helper after skb was built I disagree strongly here - it makes no sense. Why do you want to explicit fallback to Generic-XDP? (... then all the performance gain is gone!) And besides, a couple of function calls later, the generic XDP code will/can get invoked anyhow... Hi, Can we get EWMA to ensure for majority of cases we have the extra head room? Seems we could just over-estimate the size by N-bytes. In some cases we may under-estimate and then would need to fall back to generic-xdp or otherwise growing the buffer which of course would be painful and slow, but presumably would happen rarely. Hmmm... (first of all it is missing tail-room not head-room). Second having all this extra size estimating code, and fallback options leaves a very bad taste in my mouth... this sounds like a sure way to kill performance. I think it would be much better to keep this feature vs kill it and make its configuration even more painful to get XDP working on virtio. Based on you request, I'm going to fixing as much as possible of the XDP code path in driver virtio_net... I now have 4 fix patches... Thanks a lot! There is no way around disabling XDP_REDIRECT in receive_mergeable(), as XDP does not have a way to detect/know the "data_hard_end" of the data "frame". Disabling EWMA also seems reasonable to me. To me, it seems more reasonable to have a separate RX function call when an XDP program gets attached, and in that process change to the memory model so it is compatible with XDP. I would be OK with that but would be curious to see what Jason and Michael think. When I original wrote the XDP for virtio support the XDP infra was still primitive and we didn't have metadata, cpu maps, etc. Yes, that why cpumap fails. yet. I suspect there might need to be some additional coordination between guest and host though to switch the packet modes. If I recall this was where some of the original trouble came from. Maybe, but I think just have a separate refill function should be sufficient? Then we could reuse the exist code to deal with e.g synchronization. Take a step back: What is the reason/use-case for implementing XDP inside virtio_net? From a DDoS/performance perspective XDP in virtio_net happens on the "wrong-side" as it is activated _inside_ the guest OS, which is too late for a DDoS filter, as the guest kick/switch
Re: [PATCH net v2 2/2] tuntap: correctly add the missing xdp flush
Hello! On 2/22/2018 9:24 AM, Jason Wang wrote: Commit 762c330d670e ("tuntap: add missing xdp flush") tries to fix the devmap stall caused by missed xdp flush by counting the pending xdp redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or MSG_MORE is clear. This may lead BUG() since xdp_do_flush() was Lead to BUG(). called under process context with preemption enabled. Simply disable s/under/in the/? preemption may silent the warning but be not enough since process may Silence. move between different CPUS during a batch which cause xdp_do_flush() misses some CPU where the process run previously. Consider the several fallouts, that commit was reverted. To fix the issue correctly, we can simply calling xdp_do_flush() immediately after xdp_do_redirect(), Call. a side effect is that this removes any possibility of batching which could be addressed in the future. Reported-by: Christoffer DallFixes: 762c330d670e ("tuntap: add missing xdp flush") Signed-off-by: Jason Wang [...] MBR, Sergei
Re: [PATCH net v2 1/2] Revert "tuntap: add missing xdp flush"
On 2/22/2018 9:24 AM, Jason Wang wrote: This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The reason is we try to batch packets for devmap which causes calling xdp_do_flush() under the process context. Simply disable premmption s/under/in/. Disabling preemption. may not work since process may move among processors which lead xdp_do_flush() to miss some flushes on some processors. So simply revert the patch, a follow-up path will add the xdp flush correctly. Reported-by: Christoffer DallFixes: 762c330d670e ("tuntap: add missing xdp flush") Signed-off-by: Jason Wang [...] MBR, Sergei
[PATCH iproute2-next v1] rdma: Add batch command support
From: Leon RomanovskyImplement an option (-b) to execute RDMAtool commands from supplied file. This follows the same model as in use for ip and devlink tools, by expecting every new command to be on new line. These commands are expected to be without any -* (e.g. -d, -j, e.t.c) global flags, which should be called externally. Signed-off-by: Leon Romanovsky --- Changelog v0->v1: * Used ARRAY_SIZE instead of hardcoded value as an input to makeargs() David, This patch is based on iproute2.git because iproute2-next doesn't have latest restrack code. The patch itself is completely independent from that code and is supposed to go to -next, but it has conflicts (manual page and help line). Can you please merge iproute2 master into iproute2-next prior to applying this patch? Thanks --- man/man8/rdma.8 | 16 + rdma/rdma.c | 70 ++--- 2 files changed, 78 insertions(+), 8 deletions(-) diff --git a/man/man8/rdma.8 b/man/man8/rdma.8 index 798b33d3..fba77693 100644 --- a/man/man8/rdma.8 +++ b/man/man8/rdma.8 @@ -11,6 +11,12 @@ rdma \- RDMA tool .BR help " }" .sp +.ti -8 +.B rdma +.RB "[ " -force " ] " +.BI "-batch " filename +.sp + .ti -8 .IR OBJECT " := { " .BR dev " | " link " }" @@ -31,6 +37,16 @@ Print the version of the .B rdma tool and exit. +.TP +.BR "\-b", " \-batch " +Read commands from provided file or standard input and invoke them. +First failure will cause termination of rdma. + +.TP +.BR "\-force" +Don't terminate rdma on errors in batch mode. +If there were any errors during execution of the commands, the application return code will be non zero. + .TP .BR "\-d" , " --details" Otuput detailed information. diff --git a/rdma/rdma.c b/rdma/rdma.c index 19608f41..ab2c9608 100644 --- a/rdma/rdma.c +++ b/rdma/rdma.c @@ -15,8 +15,9 @@ static void help(char *name) { pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n" + " %s [ -f[orce] ] -b[atch] filename\n" "where OBJECT := { dev | link | resource | help }\n" - " OPTIONS := { -V[ersion] | -d[etails] | -j[son] | -p[retty]}\n", name); + " OPTIONS := { -V[ersion] | -d[etails] | -j[son] | -p[retty]}\n", name, name); } static int cmd_help(struct rd *rd) @@ -25,7 +26,7 @@ static int cmd_help(struct rd *rd) return 0; } -static int rd_cmd(struct rd *rd) +static int rd_cmd(struct rd *rd, int argc, char **argv) { const struct rd_cmd cmds[] = { { NULL, cmd_help }, @@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd) { 0 } }; + rd->argc = argc; + rd->argv = argv; + return rd_exec_cmd(rd, cmds, "object"); } -static int rd_init(struct rd *rd, int argc, char **argv, char *filename) +static int rd_batch(struct rd *rd, const char *name, bool force) +{ + char *line = NULL; + size_t len = 0; + int ret = 0; + + if (name && strcmp(name, "-") != 0) { + if (!freopen(name, "r", stdin)) { + pr_err("Cannot open file \"%s\" for reading: %s\n", + name, strerror(errno)); + return errno; + } + } + + cmdlineno = 0; + while (getcmdline(, , stdin) != -1) { + char *largv[512]; + int largc; + + largc = makeargs(line, largv, ARRAY_SIZE(largv)); + if (!largc) + continue; /* blank line */ + + ret = rd_cmd(rd, largc, largv); + if (ret) { + pr_err("Command failed %s:%d\n", name, cmdlineno); + if (!force) + break; + } + } + + free(line); + + return ret; +} + +static int rd_init(struct rd *rd, char *filename) { uint32_t seq; int ret; rd->filename = filename; - rd->argc = argc; - rd->argv = argv; INIT_LIST_HEAD(>dev_map_list); INIT_LIST_HEAD(>filter_list); @@ -87,11 +125,15 @@ int main(int argc, char **argv) { "json", no_argument,NULL, 'j' }, { "pretty", no_argument,NULL, 'p' }, { "details",no_argument,NULL, 'd' }, + { "force", no_argument,NULL, 'f' }, + { "batch", required_argument, NULL, 'b' }, { NULL, 0, NULL, 0 } }; + const char *batch_file = NULL; bool pretty_output = false; bool show_details = false; bool json_output = false; + bool force = false; char *filename; struct rd rd; int opt; @@ -99,7 +141,7 @@ int main(int argc, char **argv) filename = basename(argv[0]); - while ((opt =
[PATCH bpf] bpf, x64: implement retpoline for tail call
Implement a retpoline [0] for the BPF tail call JIT'ing that converts the indirect jump via jmp %rax that is used to make the long jump into another JITed BPF image. Since this is subject to speculative execution, we need to control the transient instruction sequence here as well when CONFIG_RETPOLINE is set, and direct it into a pause + lfence loop. The latter aligns also with what gcc / clang emits (e.g. [1]). JIT dump after patch: # bpftool p d x i 1 0: (18) r2 = map[id:1] 2: (b7) r3 = 0 3: (85) call bpf_tail_call#12 4: (b7) r0 = 2 5: (95) exit With CONFIG_RETPOLINE: # bpftool p d j i 1 [...] 33: cmp%edx,0x24(%rsi) 36: jbe0x0072 |* 38: mov0x24(%rbp),%eax 3e: cmp$0x20,%eax 41: ja 0x0072 | 43: add$0x1,%eax 46: mov%eax,0x24(%rbp) 4c: mov0x90(%rsi,%rdx,8),%rax 54: test %rax,%rax 57: je 0x0072 | 59: mov0x28(%rax),%rax 5d: add$0x25,%rax 61: callq 0x006d |+ 66: pause | 68: lfence | 6b: jmp0x0066 | 6d: mov%rax,(%rsp) | 71: retq | 72: mov$0x2,%eax [...] * relative fall-through jumps in error case + retpoline for indirect jump Without CONFIG_RETPOLINE: # bpftool p d j i 1 [...] 33: cmp%edx,0x24(%rsi) 36: jbe0x0063 |* 38: mov0x24(%rbp),%eax 3e: cmp$0x20,%eax 41: ja 0x0063 | 43: add$0x1,%eax 46: mov%eax,0x24(%rbp) 4c: mov0x90(%rsi,%rdx,8),%rax 54: test %rax,%rax 57: je 0x0063 | 59: mov0x28(%rax),%rax 5d: add$0x25,%rax 61: jmpq *%rax |- 63: mov$0x2,%eax [...] * relative fall-through jumps in error case - plain indirect jump as before [0] https://support.google.com/faqs/answer/7625886 [1] https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b Signed-off-by: Daniel Borkmann--- arch/x86/net/bpf_jit_comp.c | 52 - 1 file changed, 42 insertions(+), 10 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 4923d92..7e8d562 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -261,6 +261,35 @@ static void emit_prologue(u8 **pprog, u32 stack_depth) *pprog = prog; } +#ifdef CONFIG_RETPOLINE +# define RETPOLINE_SIZE17 +# define OFFSET_ADJRETPOLINE_SIZE + +/* Instead of plain jmp %rax, we emit a retpoline to control + * speculative execution for the indirect branch. + */ +static void emit_retpoline_rax_trampoline(u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + + EMIT1_off32(0xE8, 7);/* callq */ + /* capture_spec: */ + EMIT2(0xF3, 0x90); /* pause */ + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ + EMIT2(0xEB, 0xF9); /* jmp */ + /* set_up_target: */ + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ + EMIT1(0xC3); /* retq */ + + BUILD_BUG_ON(cnt != RETPOLINE_SIZE); + *pprog = prog; +} +#else +/* Plain jmp %rax version used. */ +# define OFFSET_ADJ2 +#endif + /* generate the following code: * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ... * if (index >= array->map.max_entries) @@ -290,7 +319,7 @@ static void emit_bpf_tail_call(u8 **pprog) EMIT2(0x89, 0xD2);/* mov edx, edx */ EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */ offsetof(struct bpf_array, map.max_entries)); -#define OFFSET1 43 /* number of bytes to jump */ +#define OFFSET1 (41 + OFFSET_ADJ) /* number of bytes to jump */ EMIT2(X86_JBE, OFFSET1); /* jbe out */ label1 = cnt; @@ -299,7 +328,7 @@ static void emit_bpf_tail_call(u8 **pprog) */ EMIT2_off32(0x8B, 0x85, 36); /* mov eax, dword ptr [rbp + 36] */ EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ -#define OFFSET2 32 +#define OFFSET2 (30 + OFFSET_ADJ) EMIT2(X86_JA, OFFSET2); /* ja out */ label2 = cnt; EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ @@ -313,7 +342,7 @@ static void emit_bpf_tail_call(u8 **pprog) * goto out; */ EMIT3(0x48, 0x85, 0xC0); /* test rax,rax */ -#define OFFSET3 10 +#define OFFSET3 (8 + OFFSET_ADJ) EMIT2(X86_JE, OFFSET3); /* je out */ label3 = cnt; @@ -326,16 +355,19 @@ static void emit_bpf_tail_call(u8 **pprog) * rdi == ctx (1st arg) * rax == prog->bpf_func + prologue_size */ - EMIT2(0xFF, 0xE0);/* jmp rax */ - - /* out: */ -
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On Wed, Feb 21, 2018 at 7:26 PM, Sowmini Varadhanwrote: > On (02/21/18 18:45), Willem de Bruijn wrote: >> >> I do mean returning 0 instead of -EAGAIN if control data is ready. >> Something like >> >> @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr >> *msg, size_t size, >> >> if (!rds_next_incoming(rs, )) { >> if (nonblock) { >> - ret = -EAGAIN; >> + ncookies = rds_recvmsg_zcookie(rs, msg); >> + ret = ncookies ? 0 : -EAGAIN; >> break; >> } > > Yes, but you now have an implicit branch based on ncookies, so I'm > not sure it saved all that much? At least it removes the extra list empty check in the hot path and relegates this to the relatively rare branch when the queue is empty and the syscall is non-blocking. > like I said let me revisit this Okay. I won't harp on this further. >> By the way, the put_cmsg is unconditional even if the caller did >> not supply msg_control. So it is basically no longer safe to ever >> call read, recv or recvfrom on a socket if zerocopy notifications >> are outstanding. > > Wait, I thought put_cmsg already checks for these things. It does, and sets MSG_CTRUNC to signal that it was unable to write all control data. But by then the notifications have already been dequeued. >> It is possible to check msg_controllen before even deciding whether >> to try to dequeue notifications (and take the lock). I see that this is >> not common. But RDS of all cases seems to do this, in >> rds_notify_queue_get: > > yes the comment above that code suggests that this bit of code > was done to avoid calling put_cmsg while holding the rs_lock. > > One bit of administrivia though- if I now drop sk_error_queue for > PF_RDS, I'll have to fix selftests in the same patch too, so the > patch will get a bit bulky (and thus a bit more difficult to review). Understood. It might be cleanest to split into three patches. One revert of the error queue code, one new feature and one update to the test.
[PATCH v2 net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins
MV88E6352 and later switches support GPIO control through the "Scratch & Misc" global2 register. Two of the pins controlled this way on the mv88e6390 family are the external MDIO pins. They can either by used as part of the MII interface for port 0, GPIOs, or MDIO. Add a function to configure them for MDIO, if possible, and call it when registering the external MDIO bus. Suggested-by: Russell KingSigned-off-by: Andrew Lunn --- v2: Make stub function static inline, as reported by 0-day. drivers/net/dsa/mv88e6xxx/chip.c| 9 + drivers/net/dsa/mv88e6xxx/global2.h | 14 drivers/net/dsa/mv88e6xxx/global2_scratch.c | 51 + 3 files changed, 74 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 39c7ad7e490f..e1b5c5c66fce 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -2165,6 +2165,15 @@ static int mv88e6xxx_mdio_register(struct mv88e6xxx_chip *chip, struct mii_bus *bus; int err; + if (external) { + mutex_lock(>reg_lock); + err = mv88e6xxx_g2_scratch_gpio_set_smi(chip, true); + mutex_unlock(>reg_lock); + + if (err) + return err; + } + bus = devm_mdiobus_alloc_size(chip->dev, sizeof(*mdio_bus)); if (!bus) return -ENOMEM; diff --git a/drivers/net/dsa/mv88e6xxx/global2.h b/drivers/net/dsa/mv88e6xxx/global2.h index 25f92b3d7157..d85c91036c0f 100644 --- a/drivers/net/dsa/mv88e6xxx/global2.h +++ b/drivers/net/dsa/mv88e6xxx/global2.h @@ -266,6 +266,11 @@ #define MV88E6352_G2_SCRATCH_GPIO_PCTL50x6D #define MV88E6352_G2_SCRATCH_GPIO_PCTL60x6E #define MV88E6352_G2_SCRATCH_GPIO_PCTL70x6F +#define MV88E6352_G2_SCRATCH_CONFIG_DATA0 0x70 +#define MV88E6352_G2_SCRATCH_CONFIG_DATA1 0x71 +#define MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU BIT(2) +#define MV88E6352_G2_SCRATCH_CONFIG_DATA2 0x72 +#define MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK 0x3 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_GPIO0 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_TRIG1 @@ -325,6 +330,9 @@ extern const struct mv88e6xxx_avb_ops mv88e6390_avb_ops; extern const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops; +int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip, + bool external); + #else /* !CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */ static inline int mv88e6xxx_g2_require(struct mv88e6xxx_chip *chip) @@ -465,6 +473,12 @@ static const struct mv88e6xxx_avb_ops mv88e6390_avb_ops = {}; static const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {}; +static inline int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip, + bool external) +{ + return -EOPNOTSUPP; +} + #endif /* CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */ #endif /* _MV88E6XXX_GLOBAL2_H */ diff --git a/drivers/net/dsa/mv88e6xxx/global2_scratch.c b/drivers/net/dsa/mv88e6xxx/global2_scratch.c index 0ff12bff9f0e..3f92b8892dc7 100644 --- a/drivers/net/dsa/mv88e6xxx/global2_scratch.c +++ b/drivers/net/dsa/mv88e6xxx/global2_scratch.c @@ -238,3 +238,54 @@ const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = { .get_pctl = mv88e6352_g2_scratch_gpio_get_pctl, .set_pctl = mv88e6352_g2_scratch_gpio_set_pctl, }; + +/** + * mv88e6xxx_g2_gpio_set_smi - set gpio muxing for external smi + * @chip: chip private data + * @external: set mux for external smi, or free for gpio usage + * + * Some mv88e6xxx models have GPIO pins that may be configured as + * an external SMI interface, or they may be made free for other + * GPIO uses. + */ +int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip, + bool external) +{ + int misc_cfg = MV88E6352_G2_SCRATCH_MISC_CFG; + int config_data1 = MV88E6352_G2_SCRATCH_CONFIG_DATA1; + int config_data2 = MV88E6352_G2_SCRATCH_CONFIG_DATA2; + bool no_cpu; + u8 p0_mode; + int err; + u8 val; + + err = mv88e6xxx_g2_scratch_read(chip, config_data2, ); + if (err) + return err; + + p0_mode = val & MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK; + + if (p0_mode == 0x01 || p0_mode == 0x02) + return -EBUSY; + + err = mv88e6xxx_g2_scratch_read(chip, config_data1, ); + if (err) + return err; + + no_cpu = !!(val & MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU); + + err = mv88e6xxx_g2_scratch_read(chip, misc_cfg, ); + if (err) + return err; + + /* NO_CPU being 0 inverts the meaning of the bit */ + if (!no_cpu) + external = !external; + + if (external) + val |= MV88E6352_G2_SCRATCH_MISC_CFG_NORMALSMI; + else +
[PATCH iproute2] ip: Properly display AF_BRIDGE address information for neighbor events
The vxlan driver when a neighbor add/delete event occurs sends NDA_DST filled with a union: union vxlan_addr { struct sockaddr_in sin; struct sockaddr_in6 sin6; struct sockaddr sa; }; This eventually calls rt_addr_n2a_r which had no handler for the AF_BRIDGE family and "???" was being printed. Add code to properly display this data when requested. Signed-off-by: Donald Sharp--- lib/utils.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/lib/utils.c b/lib/utils.c index 24aeddd8..e01e18a7 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -1004,6 +1004,24 @@ const char *rt_addr_n2a_r(int af, int len, } case AF_PACKET: return ll_addr_n2a(addr, len, ARPHRD_VOID, buf, buflen); + case AF_BRIDGE: + { + unsigned short family = ((struct sockaddr *)addr)->sa_family; + struct sockaddr_in6 *sin6; + struct sockaddr_in *sin; + + switch(family) { + case AF_INET: + sin = (struct sockaddr_in *)addr; + return inet_ntop(AF_INET, >sin_addr, buf, buflen); + case AF_INET6: + sin6 = (struct sockaddr_in6 *)addr; + return inet_ntop(AF_INET6, >sin6_addr, +buf, buflen); + } + + /* fallthrough */ + } default: return "???"; } -- 2.14.3
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote: On 2/21/2018 5:59 PM, Siwei Liu wrote: On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyckwrote: On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck wrote: On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. Thank you do doing this. We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. So if I am understanding what you are saying you are wanting to flip the backup interface from the virtio to a VF. The problem is that becomes a bit of a vendor lock-in solution since it would rely on a specific VF driver. I would agree with Jiri that we don't want to go down that path. We don't want every VF out there firing up its own separate bond. Ideally you want the hypervisor to be able to manage all of this which is why it makes sense to have virtio manage this and why this is associated with the virtio_net interface. No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. Yes. This sounds interesting. Looks like you want an existing VM image with VF only configuration to get transparent live migration support by adding virtio_net with BACKUP feature. We may need another feature bit to switch between these 2 options. After thinking some more, this may be more involved than adding a new feature bit. This requires a netdev created by virtio to take over the name of a VF netdev associated with a PCI device that may not be plugged in when the virtio driver is coming up. This definitely requires some new messages
Re: [PATCH bpf v2] bpf: fix memory leak in lpm_trie map_free callback function
On 2/21/18 7:40 PM, Eric Dumazet wrote: On Tue, 2018-02-13 at 19:17 -0800, Alexei Starovoitov wrote: On Tue, Feb 13, 2018 at 07:00:21PM -0800, Yonghong Song wrote: There is a memory leak happening in lpm_trie map_free callback function trie_free. The trie structure itself does not get freed. Also, trie_free function did not do synchronize_rcu before freeing various data structures. This is incorrect as some rcu_read_lock region(s) for lookup, update, delete or get_next_key may not complete yet. The fix is to add synchronize_rcu in the beginning of trie_free. The useless spin_lock is removed from this function as well. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reported-by: Mathieu MalaterreReported-by: Alexei Starovoitov Tested-by: Mathieu Malaterre Signed-off-by: Yonghong Song --- kernel/bpf/lpm_trie.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) v1->v2: Make comments more precise and make label name more appropriate, as suggested by Daniel Applied to bpf tree, Thanks Yonghong. This does not look good. LOCKDEP surely should complain to node = rcu_dereference_protected(*slot, lockdep_is_held(>lock)); Since we no longer hold trie->lock Eric, Thanks for spotting this issue. Will fix this issue soon. Yonghong
Re: [PATCH bpf] bpf, x64: implement retpoline for tail call
On Wed, Feb 21, 2018 at 07:53:22PM -0800, Eric Dumazet wrote: > > So what kinda comment there would make sense? > > I was thinking of something very explicit : > > /* byte sequence for following assembly code used by eBPF >call ... >... >retq > */ > #define RETPOLINE_RAX_DIRECT_FOR_EBPF \ > EMIT1_off32(0xE8, 7);/* callq */ \ > /* capture_spec: */\ > EMIT2(0xF3, 0x90); /* pause */ \ > EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \ > EMIT2(0xEB, 0xF9); /* jmp */ \ > /* set_up_target: */ \ > EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \ > EMIT1(0xC3); /* retq */\ got it. yeah. makes sense to me.
F.LLI PISTOLESI Snc
Hello , I am looking for a reliable supplier /manufacturer of products for sell in Europe. I came across your listing and wanted to get some information regarding minimum Order Quantities, FOB pricing and also the possibility of packaging including payments terms. So could you please get back to be with the above informations as soon as possible . My email is :tm6428...@gmail.com Many thanks and i looking forward to hearing from you and hopefully placing an order with you company. Best Regards Lorenzo Delleani. F.LLI PISTOLESI Snc Company P.O. box 205 2740 AE Waddinxveen The Netherlands
[PATCH net v2 1/2] Revert "tuntap: add missing xdp flush"
This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The reason is we try to batch packets for devmap which causes calling xdp_do_flush() under the process context. Simply disable premmption may not work since process may move among processors which lead xdp_do_flush() to miss some flushes on some processors. So simply revert the patch, a follow-up path will add the xdp flush correctly. Reported-by: Christoffer DallFixes: 762c330d670e ("tuntap: add missing xdp flush") Signed-off-by: Jason Wang --- drivers/net/tun.c | 15 --- 1 file changed, 15 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index b52258c..2823a4a 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -181,7 +181,6 @@ struct tun_file { struct tun_struct *detached; struct ptr_ring tx_ring; struct xdp_rxq_info xdp_rxq; - int xdp_pending_pkts; }; struct tun_flow_entry { @@ -1662,7 +1661,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, case XDP_REDIRECT: get_page(alloc_frag->page); alloc_frag->offset += buflen; - ++tfile->xdp_pending_pkts; err = xdp_do_redirect(tun->dev, , xdp_prog); if (err) goto err_redirect; @@ -1984,11 +1982,6 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) result = tun_get_user(tun, tfile, NULL, from, file->f_flags & O_NONBLOCK, false); - if (tfile->xdp_pending_pkts) { - tfile->xdp_pending_pkts = 0; - xdp_do_flush_map(); - } - tun_put(tun); return result; } @@ -2325,13 +2318,6 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter, m->msg_flags & MSG_DONTWAIT, m->msg_flags & MSG_MORE); - - if (tfile->xdp_pending_pkts >= NAPI_POLL_WEIGHT || - !(m->msg_flags & MSG_MORE)) { - tfile->xdp_pending_pkts = 0; - xdp_do_flush_map(); - } - tun_put(tun); return ret; } @@ -3163,7 +3149,6 @@ static int tun_chr_open(struct inode *inode, struct file * file) sock_set_flag(>sk, SOCK_ZEROCOPY); memset(>tx_ring, 0, sizeof(tfile->tx_ring)); - tfile->xdp_pending_pkts = 0; return 0; } -- 2.7.4
[PATCH] net/smc9194: Remove bogus CONFIG_MAC reference
AFAIK the only version of smc9194.c with Mac support is the one in the linux-mac68k CVS repo, which never made it to the mainline. Despite that, from v2.3.45, arch/m68k/config.in listed CONFIG_SMC9194 under CONFIG_MAC. This mistake got carried over into Kconfig in v2.5.55. (See pre-git era "[PATCH] add m68k dependencies to net driver config".) Signed-off-by: Finn Thain--- drivers/net/ethernet/smsc/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/smsc/Kconfig b/drivers/net/ethernet/smsc/Kconfig index 63aca9f847e1..4c2f612e4414 100644 --- a/drivers/net/ethernet/smsc/Kconfig +++ b/drivers/net/ethernet/smsc/Kconfig @@ -20,7 +20,7 @@ if NET_VENDOR_SMSC config SMC9194 tristate "SMC 9194 support" - depends on (ISA || MAC && BROKEN) + depends on ISA select CRC32 ---help--- This is support for the SMC9xxx based Ethernet cards. Choose this -- 2.16.1
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On Wed, Feb 21, 2018 at 5:14 PM, Sowmini Varadhanwrote: > On (02/21/18 16:54), Willem de Bruijn wrote: >> >> I'd put this optimization behind a socket option. > > Yes, that thought occurred to me as well- I think RDS applications > are unlikely to use the error_queue path because, as I mentioned > before, these are heavily request-response based, so you're > going to be getting something back for each sendmsg anyway. > > So if I had a sockopt, it would be something that would be > "piggyback completion" most (all?) of the time anyway, but I suppose > making it explicit is useful to have. > >> Either that, or always store cookies on an RDS specific queue, return as recv >> cmsg and then remove the alternative error queue path completely. > > I think the error queue path is good to have, if only to be aligned > with other socket families. > >> > + spin_lock_irqsave(>lock, flags); >> >> This seems expensive on every recvmsg. Even if zerocopy is not enabled. > > noted, I'll predicate it on both zcopy and the sockopt suggestion. > >> > + if (skb_queue_empty(q)) { >> > + spin_unlock_irqrestore(>lock, flags); >> > + return 0; >> > + } >> > + skb_queue_walk_safe(q, skb, tmp) { >> >> This too. If anything, just peek at the queue head and skip otherwise. >> Anything else will cause an error, which blocks regular read and write >> on the socket. > > but that would be technically incorrect, because you could have a > mix/match of the zcopy notification with other sk_error_queue messages > (given that rds_rm_zerocopy_callback looks at the tail, and > creates a new skb if the tail is not a SO_EE_ORIGIN_ZCOOKIE). > > Of course, its true that *today* the only thing in the rds socket > error queue is the cookie list, so this queue walk is a bit of overkill.. > > but maybe I am missing something you are concerned about? The queue walk is > intended to pull out the first SO_EE_ORIGIN_ZCOOKIE (if it exists) > and return quietly otherwise Yes, avoiding out of order completions is sensible. In the common case no more than one notification will be outstanding, but with a fixed number of notifications per packet, in edge cases this list may be long. Reading using a reverse walk avoids that problem. > (leaving err_queue unchanged)- where > do you see the blocking error? Socket functions block if sk_err is non-zero. See for instance tcp_sendmsg_locked. It is set by most code that also calls sock_queue_err_skb and also on dequeue from err skb. This is the main reason that I would consider dropping error queue completely if you expect all users of RDS to use the cmsg on regular read to get these notifications. >> > + if (list_empty(>rs_recv_queue) && nonblock) { >> > + ncookies = rds_recvmsg_zcookie(rs, msg); >> > + if (ncookies) { >> > + ret = 0; >> >> This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now? > > The only time MSG_DONTWAIT was previously returning 0 for PF_RDS > was when there was a congestion notification or rdma completion > i.e., the checks for rs_notify_queue and rs_cong_notify earlier > in the function, and in those 2 cases it was not returning data > (i.e. ret was always 0). In all other cases rds_recvmsg would > always return either > - ret == -1, with errno set to EAGAIN or ETIMEDOUT, depending on the value > of MSG_DONTWAIT, or, > - ret > 0 with data bytes. > > That behavior (as well existing behavior for congestion notification > and rdma completion) has not changed. The only thing that changed is that > if there is no data to pass up, the ret will be 0, errno will be 0, > and the ancillary data may have completion cookies. Okay. If callers must already handle 0 as a valid return code, then it is fine to add another case that does this. The extra branch in the hot path is still rather unfortunately. Could this be integrated in the existing if (nonblock) branch below?
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On (02/21/18 17:50), Willem de Bruijn wrote: > > In the common case no more than one notification will be outstanding, > but with a fixed number of notifications per packet, in edge cases this > list may be long. : > Socket functions block if sk_err is non-zero. See for instance > tcp_sendmsg_locked. It is set by most code that also calls > sock_queue_err_skb and also on dequeue from err skb. > > This is the main reason that I would consider dropping error > queue completely if you expect all users of RDS to use the > cmsg on regular read to get these notifications. I see. That's a good point, and maybe it makes sense to just have a struct sk_buff_head rs_zcookie_quese on the rds_sock, and have rds_rm_zerocopy_callback chain cookies ot this rs_zcookie_queue. [discussion regarding rds_recvmsg return values elided] > Okay. If callers must already handle 0 as a valid return code, then > it is fine to add another case that does this. > > The extra branch in the hot path is still rather unfortunately. Could > this be integrated in the existing if (nonblock) branch below? that's where I first started. it got even hairier because the callers expect a retval of 0 (-EAGAIN threw rds-stress into an infinite loop of continulally trying to recv) and the end result was just confusing code with the same number of branches.. let me revisit this when I spin out V2 without the sk_error_queue.. --Sowmini
Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open
On 02/21/2018 05:33 PM, Daniel Borkmann wrote: > Hi Shuah, > > On 02/22/2018 12:03 AM, Shuah Khan wrote: >> On 02/21/2018 03:48 PM, David Miller wrote: >>> From: Anders Roxell>>> Date: Wed, 21 Feb 2018 22:30:01 +0100 >>> bpf builds a test program for loading BPF ELF files. Add the executable to the .gitignore list. Signed-off-by: Anders Roxell >>> >>> Acked-by: David S. Miller >> >> Thanks. I will pull this in for 4.16-rc > > I would have taken this into bpf tree, but fair enough. This one > in particular doesn't cause any conflicts right now, so feel free > to pick it up then. But usual paths are via bpf / bpf-next as we > otherwise would run into real ugly merge conflicts. > Daniel, Go for it. I know bpf tree stuff causes conflicts. That is why I usually stay away from bpfs selftests and let you handle them. You are taking the other one away, pick this up as well. Acked-by: Shuah Khan thanks, -- Shuah
Re: [PATCH RFC PoC 0/3] nftables meets bpf
On Wed, 21 Feb 2018 16:30:07 -0800, Florian Fainelli wrote: > On 02/21/2018 03:46 PM, Jakub Kicinski wrote: > > On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote: > >> We also have a large range of TCAM based hardware offload outthere > >> that will _not_ work with your BPF HW offload infrastructure. What > >> this bpf infrastructure pushes into the kernel is just a blob > >> expressing things in a very low-level instruction-set: trying to find > >> a mapping of that to typical HW intermediate representations in the > >> TCAM based HW offload world will be simply crazy. > > > > I'm not sure where the TCAM talk is coming from. Think much smaller - > > cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs. The > > information the verifier is gathering will be crucial for optimizing > > those. Please don't discount the value of being able to use > > heterogeneous processing units by the networking stack. > > The only use case that we have a good answer for is when there is no HW > offload capability available, because there, we know that eBPF is our > best possible solution for a software fast path, in large part because > of all the efforts that went into making it both safe and fast. I was trying to point out that JITing eBPF for the host on 32 bit systems is already a pain, Jiong Wang is leading an effort to improve this both from LLVM and verifier angles, IOW running through the verifier may become useful even for host JITs :) > When there is offloading HW available, there does not appear to be a > perfect answer to this problem of, given a standard Linux utility that > can express any sort of match + action, be it ethtool::rxnfc, > tc/cls_{u32,flower}, nftables, how do I transform that into what makes > most sense to my HW? You could: > > - have hardware that understands BPF bytecode directly, great, then you > don't have to do anything, just pass it up the driver baby, oh wait, > it's not that simple, the NFP driver is not small True, it's not the largest but fair point, IMHO we should be trying to push for sharing as much code between drivers as possible, and on all fronts, but that's a topic for another time... > - transform BPF back into something that your hardware understand, does > that belong in the kernel? Maybe, maybe not Personally, I think there is non-zero probability of AMP CPUs/systems becoming more common. NFP is very powerful and fast, but less advanced solution may just use an off-the-shelf MIPS/ARM/Andes core. Taking it slightly further from home to the cellular/WiFi wake up problem which was mentioned by Android folks at one of netdevs - if we have MIPS/ARM/Andes *host* JIT in the kernel, and the NIC processor is built on one of those all the driver needs to provide is some glue and we can offload filtering to the MCU on the NIC/modem! > - use a completely different intermediate representation like P4, > brainfuck, I don't know > > Maybe first things first, we have at least 3 different programming > interfaces, if not more: ethtool::rxnfc, tc/cls_{u32,flower}, nftables > that are all capable of programming TCAMs and hardware capable of match > + action, how about we start with having some sort of common library > code that: > > - validates input parameters against HW capabilities This one may be quite hard. > - does the adequate transformation from any of these interfaces into a > generic set of input parameters > - define what the appropriate behavior is when programming through all > of these 3 interfaces that ultimately access the same shared piece of > HW, and therefore need to manage resources allocation? That would be great! :) Flower stands out today as the most feature rich and a go-to for TCAM offloads. >
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote: > > I don't see why the team cannot be there always. > > It is more the logistical nightmare. Part of the goal here was to work > with the cloud base images that are out there such as > https://alt.fedoraproject.org/cloud/. With just the kernel changes the > overhead for this stays fairly small and would be pulled in as just a > standard part of the kernel update process. The virtio bypass only > pops up if the backup bit is present. With the team solution it > requires that the base image use the team driver on virtio_net when it > sees one. I doubt the OSVs would want to do that just because SR-IOV > isn't that popular of a case. IIUC we need to monitor for a "backup hint", spawn the master, rename it to maintain backwards compatibility with no-VF setups and enslave the VF if it appears. All those sound possible from user space, the advantage of the kernel solution right now is that it has more complete code. Am I misunderstanding?
Re: [PATCH bpf] bpf, x64: implement retpoline for tail call
On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote: ... > +/* Instead of plain jmp %rax, we emit a retpoline to control > + * speculative execution for the indirect branch. > + */ > +static void emit_retpoline_rax_trampoline(u8 **pprog) > +{ > + u8 *prog = *pprog; > + int cnt = 0; > + > + EMIT1_off32(0xE8, 7);/* callq */ > + /* capture_spec: */ > + EMIT2(0xF3, 0x90); /* pause */ > + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ > + EMIT2(0xEB, 0xF9); /* jmp */ > + /* set_up_target: */ > + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ > + EMIT1(0xC3); /* retq */ > + > + BUILD_BUG_ON(cnt != RETPOLINE_SIZE); > + *pprog = prog; You might define the actual code sequence (and length) in arch/x86/include/asm/nospec-branch.h If we need to adjust code sequences for RETPOLINE, then we wont forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded. Thanks Daniel.
[PATCH iproute2] README: re-add updated information link
From: Quentin MonnetThe "Information" link was removed from README file in commit d7843207e6fd ("README: update location of git repositories, remove broken info link"), because it redirected to a page that no longer existed on the Linux Foundation wiki. This page has just been restored, so we can add the link back again. Since the previous link was a redirection, use the updated link instead. Thanks to Luca Boccassi for investigating this issue, restoring and updating the page. Signed-off-by: Quentin Monnet --- README | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README b/README index 1b7f44272f5a..f66fd5faf4cf 100644 --- a/README +++ b/README @@ -1,5 +1,8 @@ This is a set of utilities for Linux networking. +Information: +https://wiki.linuxfoundation.org/networking/iproute2 + Download: http://www.kernel.org/pub/linux/utils/net/iproute2/ -- 2.15.1
[PATCH net-next 3/3] nfp: advertise firmware for mixed 10G/25G mode
From: Dirk van der MerweThe AMDA0099-0001 platform can support the 1x10G + 1x25G mixed mode operation. Recently, firmware has been added for this configuration mode. Signed-off-by: Dirk van der Merwe Acked-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index ab301d56430b..c4b1f344b4da 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -645,6 +645,7 @@ MODULE_FIRMWARE("netronome/nic_AMDA0097-0001_4x10_1x40.nffw"); MODULE_FIRMWARE("netronome/nic_AMDA0097-0001_8x10.nffw"); MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x10.nffw"); MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x25.nffw"); +MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_1x10_1x25.nffw"); MODULE_AUTHOR("Netronome Systems "); MODULE_LICENSE("GPL"); -- 2.15.1
[PATCH net-next] ibmvnic: Split counters for scrq/pools/napi
The approach of one counter to rule them all when tracking the number of active sub-crqs, pools, and napi has problems handling some failover scenarios. This is due to the split in initializing the sub crqs, pools and napi in different places and the placement of updating the active counts. This patch simplifies this by having a counter for tx and rx sub-crqs, pools, and napi. Signed-off-by: Nathan Fontenot--- drivers/net/ethernet/ibm/ibmvnic.c | 38 drivers/net/ethernet/ibm/ibmvnic.h |7 +-- 2 files changed, 22 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 1703b881252f..8ca88f7cc661 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -461,7 +461,7 @@ static void release_rx_pools(struct ibmvnic_adapter *adapter) if (!adapter->rx_pool) return; - for (i = 0; i < adapter->num_active_rx_scrqs; i++) { + for (i = 0; i < adapter->num_active_rx_pools; i++) { rx_pool = >rx_pool[i]; netdev_dbg(adapter->netdev, "Releasing rx_pool[%d]\n", i); @@ -484,6 +484,7 @@ static void release_rx_pools(struct ibmvnic_adapter *adapter) kfree(adapter->rx_pool); adapter->rx_pool = NULL; + adapter->num_active_rx_pools = 0; } static int init_rx_pools(struct net_device *netdev) @@ -508,6 +509,8 @@ static int init_rx_pools(struct net_device *netdev) return -1; } + adapter->num_active_rx_pools = rxadd_subcrqs; + for (i = 0; i < rxadd_subcrqs; i++) { rx_pool = >rx_pool[i]; @@ -608,7 +611,7 @@ static void release_tx_pools(struct ibmvnic_adapter *adapter) if (!adapter->tx_pool) return; - for (i = 0; i < adapter->num_active_tx_scrqs; i++) { + for (i = 0; i < adapter->num_active_tx_pools; i++) { netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i); tx_pool = >tx_pool[i]; kfree(tx_pool->tx_buff); @@ -619,6 +622,7 @@ static void release_tx_pools(struct ibmvnic_adapter *adapter) kfree(adapter->tx_pool); adapter->tx_pool = NULL; + adapter->num_active_tx_pools = 0; } static int init_tx_pools(struct net_device *netdev) @@ -635,6 +639,8 @@ static int init_tx_pools(struct net_device *netdev) if (!adapter->tx_pool) return -1; + adapter->num_active_tx_pools = tx_subcrqs; + for (i = 0; i < tx_subcrqs; i++) { tx_pool = >tx_pool[i]; @@ -745,6 +751,7 @@ static int init_napi(struct ibmvnic_adapter *adapter) ibmvnic_poll, NAPI_POLL_WEIGHT); } + adapter->num_active_rx_napi = adapter->req_rx_queues; return 0; } @@ -755,7 +762,7 @@ static void release_napi(struct ibmvnic_adapter *adapter) if (!adapter->napi) return; - for (i = 0; i < adapter->num_active_rx_scrqs; i++) { + for (i = 0; i < adapter->num_active_rx_napi; i++) { if (>napi[i]) { netdev_dbg(adapter->netdev, "Releasing napi[%d]\n", i); @@ -765,6 +772,7 @@ static void release_napi(struct ibmvnic_adapter *adapter) kfree(adapter->napi); adapter->napi = NULL; + adapter->num_active_rx_napi = 0; } static int ibmvnic_login(struct net_device *netdev) @@ -998,10 +1006,6 @@ static int init_resources(struct ibmvnic_adapter *adapter) return rc; rc = init_tx_pools(netdev); - - adapter->num_active_tx_scrqs = adapter->req_tx_queues; - adapter->num_active_rx_scrqs = adapter->req_rx_queues; - return rc; } @@ -1706,9 +1710,6 @@ static int do_reset(struct ibmvnic_adapter *adapter, release_napi(adapter); init_napi(adapter); - - adapter->num_active_tx_scrqs = adapter->req_tx_queues; - adapter->num_active_rx_scrqs = adapter->req_rx_queues; } else { rc = reset_tx_pools(adapter); if (rc) @@ -2398,19 +2399,10 @@ static struct ibmvnic_sub_crq_queue *init_sub_crq_queue(struct ibmvnic_adapter static void release_sub_crqs(struct ibmvnic_adapter *adapter, bool do_h_free) { - u64 num_tx_scrqs, num_rx_scrqs; int i; - if (adapter->state == VNIC_PROBED) { - num_tx_scrqs = adapter->req_tx_queues; - num_rx_scrqs = adapter->req_rx_queues; - } else { - num_tx_scrqs = adapter->num_active_tx_scrqs; - num_rx_scrqs = adapter->num_active_rx_scrqs; - } - if (adapter->tx_scrq) { - for (i = 0; i < num_tx_scrqs; i++) { + for (i = 0; i < adapter->num_active_tx_scrqs; i++) {
[PATCH net-next 1/3] nfp: add Makefiles to all directories
To be able to build separate objects we need to provide Kbuild with a Makefile in each directory. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/bpf/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/flower/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/nfpcore/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/nic/Makefile | 2 ++ 5 files changed, 10 insertions(+) create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/flower/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/nic/Makefile diff --git a/drivers/net/ethernet/netronome/nfp/bpf/Makefile b/drivers/net/ethernet/netronome/nfp/bpf/Makefile new file mode 100644 index ..805fa28f391a --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/bpf/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +# kbuild requires Makefile in a directory to build individual objects diff --git a/drivers/net/ethernet/netronome/nfp/flower/Makefile b/drivers/net/ethernet/netronome/nfp/flower/Makefile new file mode 100644 index ..805fa28f391a --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/flower/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +# kbuild requires Makefile in a directory to build individual objects diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile b/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile new file mode 100644 index ..805fa28f391a --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +# kbuild requires Makefile in a directory to build individual objects diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile new file mode 100644 index ..805fa28f391a --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +# kbuild requires Makefile in a directory to build individual objects diff --git a/drivers/net/ethernet/netronome/nfp/nic/Makefile b/drivers/net/ethernet/netronome/nfp/nic/Makefile new file mode 100644 index ..805fa28f391a --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/nic/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +# kbuild requires Makefile in a directory to build individual objects -- 2.15.1
[PATCH net-next 0/3] nfp: build and FW initramfs updates
Hi! This set brings empty makefiles to allow building single object files (useful for build-testing), Kbuild does not cater to this use case too well. There are two ethernet drivers right now which suffer from this (nfp, aquantia), both are fixed. Dirk adds an uncommon FW image name to the list of firmware files module may request. Dirk van der Merwe (1): nfp: advertise firmware for mixed 10G/25G mode Jakub Kicinski (2): nfp: add Makefiles to all directories aquantia: add Makefiles to all directories drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/bpf/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/flower/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 + drivers/net/ethernet/netronome/nfp/nfpcore/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile | 2 ++ drivers/net/ethernet/netronome/nfp/nic/Makefile | 2 ++ 7 files changed, 13 insertions(+) create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/flower/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile create mode 100644 drivers/net/ethernet/netronome/nfp/nic/Makefile -- 2.15.1
[PATCH net-next 2/3] aquantia: add Makefiles to all directories
To be able to build separate objects we need to provide Kbuild with a Makefile in each directory. Signed-off-by: Jakub Kicinski--- CC: Igor Russkikh drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile b/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile new file mode 100644 index ..805fa28f391a --- /dev/null +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +# kbuild requires Makefile in a directory to build individual objects -- 2.15.1
[PATCH iproute2 net-next] ss: print skmeminfo for packet sockets
From: Roopa Prabhubefore: $ss --packet -p -m p_raw0 0*:eth0 users:(("lldpd",pid=2240,fd=11)) after: $ss --packet -p -m p_raw0 0*:eth0 users:(("lldpd",pid=2240,fd=11)) skmem:(r0,rb266240,t0,tb266240,f0,w0,o320,bl0,d0) Signed-off-by: Roopa Prabhu --- misc/ss.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/misc/ss.c b/misc/ss.c index 29a2507..49f9c49 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -3920,6 +3920,9 @@ static int packet_show_sock(const struct sockaddr_nl *addr, fil++; } } + + if (show_mem) + print_skmeminfo(tb, PACKET_DIAG_MEMINFO); return 0; } -- 2.1.4
[PATCH net v2 2/2] tuntap: correctly add the missing xdp flush
Commit 762c330d670e ("tuntap: add missing xdp flush") tries to fix the devmap stall caused by missed xdp flush by counting the pending xdp redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or MSG_MORE is clear. This may lead BUG() since xdp_do_flush() was called under process context with preemption enabled. Simply disable preemption may silent the warning but be not enough since process may move between different CPUS during a batch which cause xdp_do_flush() misses some CPU where the process run previously. Consider the several fallouts, that commit was reverted. To fix the issue correctly, we can simply calling xdp_do_flush() immediately after xdp_do_redirect(), a side effect is that this removes any possibility of batching which could be addressed in the future. Reported-by: Christoffer DallFixes: 762c330d670e ("tuntap: add missing xdp flush") Signed-off-by: Jason Wang --- drivers/net/tun.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2823a4a..a363ea2 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1662,6 +1662,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, get_page(alloc_frag->page); alloc_frag->offset += buflen; err = xdp_do_redirect(tun->dev, , xdp_prog); + xdp_do_flush_map(); if (err) goto err_redirect; rcu_read_unlock(); -- 2.7.4
Re: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver
> +static void lan743x_intr_unregister_isr(struct lan743x_adapter *adapter, > + int vector_index) > +{ > + struct lan743x_vector *vector = >intr.vector_list > + [vector_index]; > + > + devm_free_irq(>pci.pdev->dev, vector->irq, vector); Hu Bryan The point of devm_ is that you don't need to free resources you have allocated using devm_. The core will release them when the device is removed. In this case, you might want to ensure the hardware will not generate any more interrupts, but you can leave the core to call free_irq(). Please look at all your devm_*free() like calls, and remove them if they are not needed. > +static void lan743x_mdiobus_cleanup(struct lan743x_adapter *adapter) > +{ > + if (adapter->init_flags & LAN743X_INIT_FLAG_MDIOBUS_REGISTERED) { > + mdiobus_unregister(adapter->mdiobus); > + adapter->init_flags &= ~LAN743X_INIT_FLAG_MDIOBUS_REGISTERED; > + } > + if (adapter->init_flags & LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED) { > + devm_mdiobus_free(>pci.pdev->dev, adapter->mdiobus); > + adapter->mdiobus = NULL; > + adapter->init_flags &= ~LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED; > + } > +} So you can delete devm_mdiobus_free(). That probably means you can also remove LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED. > + > +static void lan743x_full_cleanup(struct lan743x_adapter *adapter) > +{ > + if (adapter->init_flags & LAN743X_INIT_FLAG_NETDEV_REGISTERED) { > + unregister_netdev(adapter->netdev); > + adapter->init_flags &= ~LAN743X_INIT_FLAG_NETDEV_REGISTERED; > + } > + lan743x_mdiobus_cleanup(adapter); > + lan743x_hardware_cleanup(adapter); > + if (adapter->init_flags & LAN743X_COMPONENT_FLAG_PCI) { > + lan743x_pci_cleanup(adapter); > + adapter->init_flags &= ~LAN743X_COMPONENT_FLAG_PCI; > + } > + > + /* would have freed netdev here. > + * but netdev was allocated with devm_alloc_etherdev. > + * and devm_free_netdev is not accessible. > + * so it is expected to be freed by the devm subsystem. > + */ And this comment can go. Andrew
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 2018年02月16日 23:41, Jesper Dangaard Brouer wrote: On Fri, 16 Feb 2018 13:31:37 +0800 Jason Wangwrote: On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: The virtio_net code have three different RX code-paths in receive_buf(). Two of these code paths can handle XDP, but one of them is broken for at least XDP_REDIRECT. Function(1): receive_big() does not support XDP. Function(2): receive_small() support XDP fully and uses build_skb(). Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). The simple explanation is that receive_mergeable() is broken because it uses napi_alloc_skb(), which violates XDP given XDP assumes packet header+data in single page and enough tail room for skb_shared_info. The longer explaination is that receive_mergeable() tries to work-around and satisfy these XDP requiresments e.g. by having a function xdp_linearize_page() that allocates and memcpy RX buffers around (in case packet is scattered across multiple rx buffers). This does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because we have not implemented bpf_xdp_adjust_tail yet). The XDP_REDIRECT action combined with cpumap is broken, and cause hard to debug crashes. The main issue is that the RX packet does not have the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing skb_shared_info to overlap the next packets head-room (in which cpumap stores info). Reproducing depend on the packet payload length and if RX-buffer size happened to have tail-room for skb_shared_info or not. But to make this even harder to troubleshoot, the RX-buffer size is runtime dynamically change based on an Exponentially Weighted Moving Average (EWMA) over the packet length, when refilling RX rings. This patch only disable XDP_REDIRECT support in receive_mergeable() case, because it can cause a real crash. But IMHO we should NOT support XDP in receive_mergeable() at all, because the principles behind XDP are to gain speed by (1) code simplicity, (2) sacrificing memory and (3) where possible moving runtime checks to setup time. These principles are clearly being violated in receive_mergeable(), that e.g. runtime track average buffer size to save memory consumption. I agree to disable it for -net now. Okay... I'll send an official patch later. For net-next, we probably can do: - drop xdp_linearize_page() and do XDP through generic XDP helper after skb was built I disagree strongly here - it makes no sense. Why do you want to explicit fallback to Generic-XDP? (... then all the performance gain is gone!) Note this only happens when: 1) Rx buffer size is under estimated, we could disable estimation and then this won't happen 2) headroom is not sufficient, we try hard to not stop device during XDP set, so this can happen but only for the first several packets So this looks pretty fine and remove a lot of complex codes. And besides, a couple of function calls later, the generic XDP code will/can get invoked anyhow... How if we choose to use native mode of XDP? Take a step back: What is the reason/use-case for implementing XDP inside virtio_net? From a DDoS/performance perspective XDP in virtio_net happens on the "wrong-side" as it is activated _inside_ the guest OS, which is too late for a DDoS filter, as the guest kick/switch overhead have already occurred. I don't see any difference of virtio-net now. Consider a real hardward NIC, XDP (except for the offload case) also start to drop packet after it reach hardware. I do use XDP_DROP inside the guest (driver virtio_net), but just to perform what I can zoom-in benchmarking, for perf-record isolating the early RX code path in the guest. (Using iptables "raw" table drop is almost as useful for that purpose). We could not assume the type of virtio-net backend, it could be dpdk or other high speed implementation. And I'm pretty sure there are more use cases, here are two: - Use XDP to accelerate nest VM - XDP offload to host The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is significantly more interesting. As it allow us to skip large parts of the network stack and redirect from a physical device (ixgbe) into a guest device. Ran a benchmark: - 0.5 Mpps with normal code path into device with driver tun - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device Plus, there are indications that 3.7Mpps is not the real limit, as guest CPU doing XDP_DROP is 75% idle... thus this is a likely a scheduling + queue size issue. Yes, XDP_DROP can do more (but I forget the exact number). Btw testpmd (in guest) can give me about 3Mpps when doing forwarding (io mode). The main bottleneck in this case is vhost, XDP_REDIRECT can provides about 8Mpps to tun, but vhost can only receive about 4Mpps, and vhost tx can only have 3Mpps. Thanks - disable EWMA when XDP is set and reserve enough tailroom. Besides the described bug: Update(1): There is also a OOM leak in the
Re: [PATCH bpf] bpf, x64: implement retpoline for tail call
On Wed, Feb 21, 2018 at 07:04:02PM -0800, Eric Dumazet wrote: > On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote: > > ... > > > +/* Instead of plain jmp %rax, we emit a retpoline to control > > + * speculative execution for the indirect branch. > > + */ > > +static void emit_retpoline_rax_trampoline(u8 **pprog) > > +{ > > + u8 *prog = *pprog; > > + int cnt = 0; > > + > > + EMIT1_off32(0xE8, 7);/* callq */ > > + /* capture_spec: */ > > + EMIT2(0xF3, 0x90); /* pause */ > > + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ > > + EMIT2(0xEB, 0xF9); /* jmp */ > > + /* set_up_target: */ > > + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ > > + EMIT1(0xC3); /* retq */ > > + > > + BUILD_BUG_ON(cnt != RETPOLINE_SIZE); > > + *pprog = prog; > > You might define the actual code sequence (and length) in > arch/x86/include/asm/nospec-branch.h > > If we need to adjust code sequences for RETPOLINE, then we wont > forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded. like adding a comment to asm/nospec-branch.h that says "dont forget to adjust bpf_jit_comp.c" ? but clang/gcc generate slightly different sequences for retpoline anyway, so even if '.macro RETPOLINE_JMP' in nospec-branch.h changes it doesn't mean that x64 jit has to change. So what kinda comment there would make sense?
Re: [PATCH] netlink: put module reference if dump start fails
Hi, On Wed, Feb 21, 2018 at 04:41:05PM +0100, Jason A. Donenfeld wrote: Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs") I think you Would better to resend it. Bo,
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/21/2018 5:59 PM, Siwei Liu wrote: On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyckwrote: On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck wrote: On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. Thank you do doing this. We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. So if I am understanding what you are saying you are wanting to flip the backup interface from the virtio to a VF. The problem is that becomes a bit of a vendor lock-in solution since it would rely on a specific VF driver. I would agree with Jiri that we don't want to go down that path. We don't want every VF out there firing up its own separate bond. Ideally you want the hypervisor to be able to manage all of this which is why it makes sense to have virtio manage this and why this is associated with the virtio_net interface. No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. Yes. This sounds interesting. Looks like you want an existing VM image with VF only configuration to get transparent live migration support by adding virtio_net with BACKUP feature. We may need another feature bit to switch between these 2 options. The other bits get into more complexity then we are ready to handle for now. I think I might have talked about something similar that I was referring to as a "virtio-bond" where you would have a PCI/PCIe tree topology that makes this easier to sort out, and the "virtio-bond would be used to handle coordination/configuration of a much more complex interface. That was
Re: [PATCH bpf v2] bpf: fix memory leak in lpm_trie map_free callback function
On Tue, 2018-02-13 at 19:17 -0800, Alexei Starovoitov wrote: > On Tue, Feb 13, 2018 at 07:00:21PM -0800, Yonghong Song wrote: > > There is a memory leak happening in lpm_trie map_free callback > > function trie_free. The trie structure itself does not get freed. > > > > Also, trie_free function did not do synchronize_rcu before freeing > > various data structures. This is incorrect as some rcu_read_lock > > region(s) for lookup, update, delete or get_next_key may not complete yet. > > The fix is to add synchronize_rcu in the beginning of trie_free. > > The useless spin_lock is removed from this function as well. > > > > Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map > > implementation") > > Reported-by: Mathieu Malaterre> > Reported-by: Alexei Starovoitov > > Tested-by: Mathieu Malaterre > > Signed-off-by: Yonghong Song > > --- > > kernel/bpf/lpm_trie.c | 11 +++ > > 1 file changed, 7 insertions(+), 4 deletions(-) > > > > v1->v2: > > Make comments more precise and make label name more appropriate, > > as suggested by Daniel > > Applied to bpf tree, Thanks Yonghong. This does not look good. LOCKDEP surely should complain to node = rcu_dereference_protected(*slot, lockdep_is_held(>lock)); Since we no longer hold trie->lock
[PATCH bpf] bpf: fix rcu lockdep warning for lpm_trie map_free callback
Commit 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback function") fixed a memory leak and removed unnecessary locks in map_free callback function. Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on, running tools/testing/selftests/bpf/test_lpm_map will have: [ 98.294321] = [ 98.294807] WARNING: suspicious RCU usage [ 98.295359] 4.16.0-rc2+ #193 Not tainted [ 98.295907] - [ 98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage! [ 98.297657] [ 98.297657] other info that might help us debug this: [ 98.297657] [ 98.298663] [ 98.298663] rcu_scheduler_active = 2, debug_locks = 1 [ 98.299536] 2 locks held by kworker/2:1/54: [ 98.300152] #0: ((wq_completion)"events"){+.+.}, at: [<196bc1f0>] process_one_work+0x157/0x5c0 [ 98.301381] #1: ((work_completion)(>work)){+.+.}, at: [<196bc1f0>] process_one_work+0x157/0x5c0 Since actual trie tree removal happens only after no other accesses to the tree are possible, this patch simply converted all rcu protected pointer access to normal access, which removed the above warning. Fixes: 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback function") Reported-by: Eric DumazetSigned-off-by: Yonghong Song --- kernel/bpf/lpm_trie.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index a75e02c..0c15813 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -552,7 +552,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) static void trie_free(struct bpf_map *map) { struct lpm_trie *trie = container_of(map, struct lpm_trie, map); - struct lpm_trie_node __rcu **slot; + struct lpm_trie_node **slot; struct lpm_trie_node *node; /* Wait for outstanding programs to complete @@ -569,23 +569,22 @@ static void trie_free(struct bpf_map *map) slot = >root; for (;;) { - node = rcu_dereference_protected(*slot, - lockdep_is_held(>lock)); + node = *slot; if (!node) goto out; - if (rcu_access_pointer(node->child[0])) { + if (node->child[0]) { slot = >child[0]; continue; } - if (rcu_access_pointer(node->child[1])) { + if (node->child[1]) { slot = >child[1]; continue; } kfree(node); - RCU_INIT_POINTER(*slot, NULL); + *slot = NULL; break; } } -- 2.9.5
Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO
On Wed, Feb 21, 2018 at 9:43 AM, Eric Dumazetwrote: > > From: Eric Dumazet > > BBR uses tcp_tso_autosize() in an attempt to probe what would be the > burst sizes and to adjust cwnd in bbr_target_cwnd() with following > gold formula : > > /* Allow enough full-sized skbs in flight to utilize end systems. */ > cwnd += 3 * bbr->tso_segs_goal; > > But GSO can be lacking or be constrained to very small > units (ip link set dev ... gso_max_segs 2) > > What we really want is to have enough packets in flight so that both > GSO and GRO are efficient. > > So in the case GSO is off or downgraded, we still want to have the same > number of packets in flight as if GSO/TSO was fully operational, so > that GRO can hopefully be working efficiently. > > To fix this issue, we make tcp_tso_autosize() unaware of > sk->sk_gso_max_segs > > Only tcp_tso_segs() has to enforce the gso_max_segs limit. > > Tested: > > ethtool -K eth0 tso off gso off > tc qd replace dev eth0 root pfifo_fast > > Before patch: > for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done > 691 (ss -temoi shows cwnd is stuck around 6 ) > 667 > 651 > 631 > 517 > > After patch : > # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done >1733 (ss -temoi shows cwnd is around 386 ) >1778 >1746 >1781 >1718 > > Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") > Signed-off-by: Eric Dumazet > Reported-by: Oleksandr Natalenko > --- > net/ipv4/tcp_output.c |9 + > 1 file changed, 5 insertions(+), 4 deletions(-) Acked-by: Neal Cardwell Thanks, Eric! neal
Re: [PATCH] netlink: put module reference if dump start fails
Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")
Re: [PATCH iproute2-next] rdma: Add batch command support
On Wed, Feb 21, 2018 at 09:10:45AM -0700, David Ahern wrote: > On 2/21/18 5:38 AM, Leon Romanovsky wrote: > > @@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd) > > { 0 } > > }; > > > > + rd->argc = argc; > > + rd->argv = argv; > > + > > return rd_exec_cmd(rd, cmds, "object"); > > } > > > > -static int rd_init(struct rd *rd, int argc, char **argv, char *filename) > > +static int rd_batch(struct rd *rd, const char *name, bool force) > > +{ > > + char *line = NULL; > > + size_t len = 0; > > + int ret = 0; > > + > > + if (name && strcmp(name, "-") != 0) { > > + if (!freopen(name, "r", stdin)) { > > + pr_err("Cannot open file \"%s\" for reading: %s\n", > > + name, strerror(errno)); > > + return errno; > > + } > > + } > > + > > + cmdlineno = 0; > > + while (getcmdline(, , stdin) != -1) { > > + char *largv[512]; > > + int largc; > > + > > + largc = makeargs(line, largv, 100); > > you have largv[512] declared but passing a max of 100. I realize other > batch commands have it hardcoded, but ARRAY_SIZE is better. Thanks for catching it, it was my copy/paste error. I'll wait for a day for other comments and will resubmit if it is needed. The fix is: diff --git a/rdma/rdma.c b/rdma/rdma.c index ab8f98d2..ab2c9608 100644 --- a/rdma/rdma.c +++ b/rdma/rdma.c @@ -62,7 +62,7 @@ static int rd_batch(struct rd *rd, const char *name, bool force) char *largv[512]; int largc; - largc = makeargs(line, largv, 100); + largc = makeargs(line, largv, ARRAY_SIZE(largv)); if (!largc) continue; /* blank line */ signature.asc Description: PGP signature
Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO
On Wed, Feb 21, 2018 at 10:14 AM Neal Cardwellwrote: > On Wed, Feb 21, 2018 at 9:43 AM, Eric Dumazet wrote: > > > > From: Eric Dumazet > > > > BBR uses tcp_tso_autosize() in an attempt to probe what would be the > > burst sizes and to adjust cwnd in bbr_target_cwnd() with following > > gold formula : > > > > /* Allow enough full-sized skbs in flight to utilize end systems. */ > > cwnd += 3 * bbr->tso_segs_goal; > > > > But GSO can be lacking or be constrained to very small > > units (ip link set dev ... gso_max_segs 2) > > > > What we really want is to have enough packets in flight so that both > > GSO and GRO are efficient. > > > > So in the case GSO is off or downgraded, we still want to have the same > > number of packets in flight as if GSO/TSO was fully operational, so > > that GRO can hopefully be working efficiently. > > > > To fix this issue, we make tcp_tso_autosize() unaware of > > sk->sk_gso_max_segs > > > > Only tcp_tso_segs() has to enforce the gso_max_segs limit. > > > > Tested: > > > > ethtool -K eth0 tso off gso off > > tc qd replace dev eth0 root pfifo_fast > > > > Before patch: > > for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done > > 691 (ss -temoi shows cwnd is stuck around 6 ) > > 667 > > 651 > > 631 > > 517 > > > > After patch : > > # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done > >1733 (ss -temoi shows cwnd is around 386 ) > >1778 > >1746 > >1781 > >1718 > > > > Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") > > Signed-off-by: Eric Dumazet > > Reported-by: Oleksandr Natalenko > > --- > > net/ipv4/tcp_output.c |9 + > > 1 file changed, 5 insertions(+), 4 deletions(-) > Acked-by: Neal Cardwell Acked-by: Soheil Hassas Yeganeh Thank you Eric for the nice patch!
Re: [4.15-stable 1/1] uapi/if_ether.h: move __UAPI_DEF_ETHHDR libc define
From: Mark AsselstineDate: Wed, 21 Feb 2018 10:13:18 -0500 > Please promote commit da360299b6734135a5f66d7db458dcc7801c826a [uapi/ > if_ether.h: move __UAPI_DEF_ETHHDR libc define] to linux-4.15.y stable. > Without this fix users of the uapi headers will run into the issue as > described in the commit log. Specifically we have seen this with building > ebtables. Please check my stable patchwork queue, it should be in there. Thanks.
ss issue on arm not showing UDP listening ports
Hi, I currently had the follow issues with ss where it was not displaying the UDP listening ports.This was on: Linux 4.9.59-v7+ #1047 SMP Sun Oct 29 12:19:23 GMT 2017 armv7l GNU/Linux ss -ul State Recv-Q Send-Q Local Address:Port Peer Address:Port netstat -ul Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State udp0 0 0.0.0.0:mdns0.0.0.0:* udp0 0 0.0.0.0:bootpc 0.0.0.0:* udp0 0 0.0.0.0:55887 0.0.0.0:* udp0 0 192.168.1.80:ntp0.0.0.0:* udp0 0 localhost:ntp 0.0.0.0:* udp0 0 0.0.0.0:ntp 0.0.0.0:* udp0 0 0.0.0.0:46739 0.0.0.0:* udp0 0 0.0.0.0:snmp0.0.0.0:* udp6 0 0 [::]:mdns [::]:* udp6 0 0 [::]:40880 [::]:* udp6 0 0 fe80::bcdd:e603:7d8:ntp [::]:* udp6 0 0 localhost:ntp [::]:* udp6 0 0 [::]:ntp[::]:* I hope I have found the proper way to report this issue.
Re: [PATCH v2] sched: report if filter is too large to dump
From: Roman KaplDate: Wed, 21 Feb 2018 09:49:31 +0100 > Should I send a v3 or something, David? No, it's too late to fix it, sorry.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirkowrote: > Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >>> Yeah, I can see it now :( I guess that the ship has sailed and we are >>> stuck with this ugly thing forever... >>> >>> Could you at least make some common code that is shared in between >>> netvsc and virtio_net so this is handled in exacly the same way in both? >> >>IMHO netvsc is a vendor specific driver which made a mistake on what >>behaviour it provides (or tried to align itself with Windows SR-IOV). >>Let's not make a far, far more commonly deployed and important driver >>(virtio) bug-compatible with netvsc. > > Yeah. netvsc solution is a dangerous precedent here and in my opinition > it was a huge mistake to merge it. I personally would vote to unmerge it > and make the solution based on team/bond. > > >> >>To Jiri's initial comments, I feel the same way, in fact I've talked to >>the NetworkManager guys to get auto-bonding based on MACs handled in >>user space. I think it may very well get done in next versions of NM, >>but isn't done yet. Stephen also raised the point that not everybody is >>using NM. > > Can be done in NM, networkd or other network management tools. > Even easier to do this in teamd and let them all benefit. > > Actually, I took a stab to implement this in teamd. Took me like an hour > and half. > > You can just run teamd with config option "kidnap" like this: > # teamd/teamd -c '{"kidnap": true }' > > Whenever teamd sees another netdev to appear with the same mac as his, > or whenever teamd sees another netdev to change mac to his, > it enslaves it. > > Here's the patch (quick and dirty): > > Subject: [patch teamd] teamd: introduce kidnap feature > > Signed-off-by: Jiri Pirko So this doesn't really address the original problem we were trying to solve. You asked earlier why the netdev name mattered and it mostly has to do with configuration. Specifically what our patch is attempting to resolve is the issue of how to allow a cloud provider to upgrade their customer to SR-IOV support and live migration without requiring them to reconfigure their guest. So the general idea with our patch is to take a VM that is running with virtio_net only and allow it to instead spawn a virtio_bypass master using the same netdev name as the original virtio, and then have the virtio_net and VF come up and be enslaved by the bypass interface. Doing it this way we can allow for multi-vendor SR-IOV live migration support using a guest that was originally configured for virtio only. The problem with your solution is we already have teaming and bonding as you said. There is already a write-up from Red Hat on how to do it (https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). That is all well and good as long as you are willing to keep around two VM images, one for virtio, and one for SR-IOV with live migration. The problem is nobody wants to do that. What they want is to maintain one guest image and if they decide to upgrade to SR-IOV they still want their live migration and they don't want to have to reconfigure the guest. That said it does seem to make the existing Red Hat solution easier to manage since you wouldn't be guessing at ifname so I have provided some feedback below. > --- > include/team.h | 7 +++ > libteam/ifinfo.c | 20 > teamd/teamd.c | 17 + > teamd/teamd.h | 5 + > teamd/teamd_events.c | 17 + > teamd/teamd_ifinfo_watch.c | 9 + > teamd/teamd_per_port.c | 7 ++- > 7 files changed, 81 insertions(+), 1 deletion(-) > > diff --git a/include/team.h b/include/team.h > index 9ae517d..b0c19c8 100644 > --- a/include/team.h > +++ b/include/team.h > @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct > team_handle *th, > #define team_for_each_ifinfo(ifinfo, th) \ > for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo; \ > ifinfo = team_get_next_ifinfo(th, ifinfo)) > + > +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, > + struct team_ifinfo *ifinfo); > +#define team_for_each_unlinked_ifinfo(ifinfo, th) \ > + for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo; \ > +ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo)) > + > /* ifinfo getters */ > bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo); > uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo); > diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c > index 5c32a9c..8f9548e 100644 > --- a/libteam/ifinfo.c > +++ b/libteam/ifinfo.c > @@ -494,6 +494,26 @@ struct team_ifinfo
Re: [PATCH iproute2-next] rdma: Add batch command support
On 2/21/18 5:38 AM, Leon Romanovsky wrote: > @@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd) > { 0 } > }; > > + rd->argc = argc; > + rd->argv = argv; > + > return rd_exec_cmd(rd, cmds, "object"); > } > > -static int rd_init(struct rd *rd, int argc, char **argv, char *filename) > +static int rd_batch(struct rd *rd, const char *name, bool force) > +{ > + char *line = NULL; > + size_t len = 0; > + int ret = 0; > + > + if (name && strcmp(name, "-") != 0) { > + if (!freopen(name, "r", stdin)) { > + pr_err("Cannot open file \"%s\" for reading: %s\n", > +name, strerror(errno)); > + return errno; > + } > + } > + > + cmdlineno = 0; > + while (getcmdline(, , stdin) != -1) { > + char *largv[512]; > + int largc; > + > + largc = makeargs(line, largv, 100); you have largv[512] declared but passing a max of 100. I realize other batch commands have it hardcoded, but ARRAY_SIZE is better.
Re: [PATCH RFC net-next 5/7] net/ipv6: Add support for path selection using hash of 5-tuple
On Mon, Feb 12, 2018 at 04:06:00PM -0800, David Ahern wrote: > Some operators prefer IPv6 path selection to use a standard 5-tuple > hash rather than just an L3 hash with the flow the label. To that end > add support to IPv6 for multipath hash policy similar to bf4e0a3db97eb > ("net: ipv4: add support for ECMP hash policy choice"). The default > is still L3 which covers source and destination addresses along with > flow label and IPv6 protocol. > > Signed-off-by: David Ahern[...] > @@ -1819,20 +1820,53 @@ static void ip6_multipath_l3_keys(const struct > sk_buff *skb, > } > > /* if skb is set it will be used and fl6 can be NULL */ > -u32 rt6_multipath_hash(const struct flowi6 *fl6, const struct sk_buff *skb) > +u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6, > +const struct sk_buff *skb) > { > struct flow_keys hash_keys; > u32 mhash; > > - memset(_keys, 0, sizeof(hash_keys)); > - hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; > - if (skb) { > - ip6_multipath_l3_keys(skb, _keys); > - } else { > - hash_keys.addrs.v6addrs.src = fl6->saddr; > - hash_keys.addrs.v6addrs.dst = fl6->daddr; > - hash_keys.tags.flow_label = (__force u32)fl6->flowlabel; > - hash_keys.basic.ip_proto = fl6->flowi6_proto; > + switch (net->ipv6.sysctl.multipath_hash_policy) { > + case 0: > + memset(_keys, 0, sizeof(hash_keys)); > + hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; > + if (skb) { > + ip6_multipath_l3_keys(skb, _keys); > + } else { > + hash_keys.addrs.v6addrs.src = fl6->saddr; > + hash_keys.addrs.v6addrs.dst = fl6->daddr; > + hash_keys.tags.flow_label = (__force u32)fl6->flowlabel; > + hash_keys.basic.ip_proto = fl6->flowi6_proto; > + } > + break; > + case 1: > + if (skb) { > + unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP; > + struct flow_keys keys; > + > + /* short-circuit if we already have L4 hash present */ > + if (skb->l4_hash) > + return skb_get_hash_raw(skb) >> 1; > + > + memset(_keys, 0, sizeof(hash_keys)); > + > + skb_flow_dissect_flow_keys(skb, , flag); > + > + hash_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src; > + hash_keys.addrs.v6addrs.dst = keys.addrs.v6addrs.dst; Shouldn't you add: hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; ? Otherwise flow_hash_from_keys() will not consistentify the ports and the addresses and it will also not use the addresses for the hash computation. It's missing from fib_multipath_hash() as well, so I might be missing something. > + hash_keys.ports.src = keys.ports.src; > + hash_keys.ports.dst = keys.ports.dst; > + hash_keys.tags.flow_label = keys.tags.flow_label; Why are you using the flow label? > + hash_keys.basic.ip_proto = keys.basic.ip_proto; > + } else { > + memset(_keys, 0, sizeof(hash_keys)); > + hash_keys.control.addr_type = > FLOW_DISSECTOR_KEY_IPV6_ADDRS; > + hash_keys.addrs.v6addrs.src = fl6->saddr; > + hash_keys.addrs.v6addrs.dst = fl6->daddr; > + hash_keys.ports.src = fl6->fl6_sport; > + hash_keys.ports.dst = fl6->fl6_dport; > + hash_keys.basic.ip_proto = fl6->flowi6_proto; > + } > } > mhash = flow_hash_from_keys(_keys); > > @@ -1858,7 +1892,7 @@ void ip6_route_input(struct sk_buff *skb) > if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) > fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; > if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6)) > - fl6.mp_hash = rt6_multipath_hash(, skb); > + fl6.mp_hash = rt6_multipath_hash(net, , skb); > skb_dst_drop(skb); > skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, , flags)); > }
Re: [PATCH v2 iproute2-next 0/5] bridge: json and color support
On 2/20/18 12:24 PM, Stephen Hemminger wrote: > From: Stephen Hemminger> > This set of patches adds color and full JSON support to bridge command. > > The output format for bridge link command changes so that > $ bridge link show > and > $ ip link show > use same basic format. > > The "-c" flag to bridge changes from shortened form of "-compressvlan" > to shortened form of "-color". Once again this is so that ip > and bridge command take similar options. > > Lastly the JSON output format changes slightly but this > could not impact any real user, because in several cases > the current format was invalid JSON! > > v2 > rebase to updated iproute2-next > use common pretty flag > use common print_name_and_link > > Stephen Hemminger (5): > bridge: implement json pretty print flag > bridge: colorize output and use JSON print library > bridge: add json support for link command > bridge: update man page for new color and json changes > ip: always print interface name in color > > bridge/br_common.h | 4 +- > bridge/bridge.c| 16 ++- > bridge/fdb.c | 279 - > bridge/link.c | 245 +++- > bridge/mdb.c | 354 > ++--- > bridge/vlan.c | 279 - > include/utils.h| 2 +- > ip/ipaddress.c | 4 +- > lib/utils.c| 4 +- > man/man8/bridge.8 | 15 ++- > 10 files changed, 521 insertions(+), 681 deletions(-) > series applied to iproute2-next.