Re: [PATCH v7 03/37] sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h

2018-02-21 Thread Greentime Hu
2018-02-14 22:43 GMT+08:00 Arnd Bergmann :
> On Tue, Feb 13, 2018 at 10:09 AM, Greentime Hu  wrote:
>> A commit for the nds32 architecture bootstrap("asm-generic/io.h: move
>> ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU")
>> will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that
>> in order to suppress re-definition errors we need to remove the #define
>> in io_32.h.
>>
>> Also, the change adds a prototype for ioremap where size is size_t and
>> offset is phys_addr_t so fix that as well.
>>
>> Signed-off-by: Greentime Hu 
>
> This patch should have been addressed to the sparclinux mailing list to
> the maintainers can see it, otherwise they are unlikely to notice.
>
> Added it to Cc now.
>
> Can you confirm that the patches are ordered correctly in your series so that
> at no point, sparc is in a state that fails to be build cleanly?
>
> If not, this may have to get merged into the other patch.

Hi, Arnd:

These 2 patch will cause sparc building error in any order.

commit af84603e339a6832052071aca1f2a16b8963cc2e
Author: Greentime Hu 
Date:   Tue Feb 13 17:09:07 2018 +0800

sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h

A commit for the nds32 architecture bootstrap("asm-generic/io.h: move
ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef
CONFIG_MMU")
will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that
in order to suppress re-definition errors we need to remove the #define
in io_32.h.

Also, the change adds a prototype for ioremap where size is size_t and
offset is phys_addr_t so fix that as well.

Signed-off-by: Greentime Hu 

commit 1995b30ea6628b261192662741940c22ac978884
Author: Greentime Hu 
Date:   Tue Feb 13 17:09:06 2018 +0800

asm-generic/io.h: move
ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef
CONFIG_MMU

It allows some architectures to use this generic macro instead of
defining theirs.

Signed-off-by: Vincent Chen 
Signed-off-by: Greentime Hu 
Acked-by: Arnd Bergmann 

Should I merge them together like this?

asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of
ifndef CONFIG_MMU

It allows some architectures to use this generic macro instead of
defining theirs.

sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h
It will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that
in order to suppress re-definition errors we need to remove the #define
in arch/sparc/include/asm/io_32.h. Also, the change adds a prototype for
ioremap where size is size_t and offset is phys_addr_t so fix that as well.

Signed-off-by: Greentime Hu 


Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-21 Thread Roman Kapl

On 02/21/2018 08:45 AM, Phil Sutter wrote:

Hi Roman,

On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote:

So far, if the filter was too large to fit in the allocated skb, the
kernel did not return any error and stopped dumping. Modify the dumper
so that it returns -EMSGSIZE when a filter fails to dump and it is the
first filter in the skb. If we are not first, we will get a next chance
with more room.

I understand this is pretty near to being an API change, but the
original design (silent truncation) can be considered a bug.

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

I'm curious, what does it lead to? :)

Thanks, Phil


tc will dump all filters up to the big one, then it will stop silently. 
So it will seem as if you have less filters.


The new behavior is the same, but tc will at leas print the EMSGSIZE 
error. It does not handle it in any other way.




Re: [PATCH 2/3] net: Make cleanup_list and net::cleanup_list of llist type

2018-02-21 Thread Kirill Tkhai
On 20.02.2018 22:42, Cong Wang wrote:
> On Mon, Feb 19, 2018 at 1:58 AM, Kirill Tkhai  wrote:
>>  void __put_net(struct net *net)
>>  {
>> /* Cleanup the network namespace in process context */
>> -   unsigned long flags;
>> -
>> -   spin_lock_irqsave(_list_lock, flags);
>> -   list_add(>cleanup_list, _list);
>> -   spin_unlock_irqrestore(_list_lock, flags);
>> -
>> +   llist_add(>cleanup_list, _list);
>> queue_work(netns_wq, _cleanup_work);
>>  }
> 
> Is llist safe against IRQ too?

Yes, it's safe and it's aimed for the cases like this. There is no "locked"
state like spinlock has, there is single cmpxchg().

You may find examples it's used in ./kernel directory.

Kirill


Re: WARNING: ODEBUG bug in __queue_work

2018-02-21 Thread Dmitry Vyukov
On Wed, Feb 21, 2018 at 1:59 AM, syzbot
 wrote:
> Hello,
>
> syzbot hit the following crash on upstream commit
> 1a2a7d3ee659e477e0768ac3fc7579794f89071b (Fri Feb 16 17:11:30 2018 +)
> Merge tag 'sound-4.16-rc2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
>
> So far this crash happened 2 times on upstream.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.


This is already fixed:

#syz fix: netfilter: IDLETIMER: be syzkaller friendly



> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+8288e492017b059ff...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> x_tables: ip6_tables: TCPMSS target: only valid for protocol 6
> x_tables: ip6_tables: TCPMSS target: only valid for protocol 6
> x_tables: ip6_tables: TCPMSS target: only valid for protocol 6
> [ cut here ]
> ODEBUG: activate not available (active state 0) object type: work_struct
> hint: 0x8
> WARNING: CPU: 0 PID: 4178 at lib/debugobjects.c:291
> debug_print_object+0x166/0x220 lib/debugobjects.c:288
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 4178 Comm: syzkaller885166 Not tainted 4.16.0-rc1+ #315
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
> RIP: 0010:debug_print_object+0x166/0x220 lib/debugobjects.c:288
> RSP: 0018:8801db407400 EFLAGS: 00010082
> RAX: dc08 RBX: 0005 RCX: 815abdbe
> RDX: 0100 RSI: 11003b680e30 RDI: 11003b680e05
> RBP: 8801db407440 R08:  R09: 11003b680dd7
> R10: 8801db407300 R11: 0002 R12: 0001
> R13: 86b14d80 R14: 86007d60 R15: 8147ac00
>  debug_object_activate+0x390/0x730 lib/debugobjects.c:475
>  debug_work_activate kernel/workqueue.c:493 [inline]
>  __queue_work+0x163/0x1230 kernel/workqueue.c:1382
>  queue_work_on+0x16a/0x1c0 kernel/workqueue.c:1488
>  queue_work include/linux/workqueue.h:488 [inline]
>  schedule_work include/linux/workqueue.h:546 [inline]
>  idletimer_tg_expired+0x44/0x60 net/netfilter/xt_IDLETIMER.c:116
>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
>  expire_timers kernel/time/timer.c:1363 [inline]
>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
>  run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  invoke_softirq kernel/softirq.c:365 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:541 [inline]
>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>  apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:796
>  
> RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:777
> [inline]
> RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160
> [inline]
> RIP: 0010:_raw_spin_unlock_irqrestore+0x5e/0xba
> kernel/locking/spinlock.c:184
> RSP: 0018:8801b1a7f3c8 EFLAGS: 0282 ORIG_RAX: ff12
> RAX: dc00 RBX: 0282 RCX: 0006
> RDX: 10d592d5 RSI: 110035eb5d5b RDI: 0282
> RBP: 8801b1a7f3d8 R08: 11003634fe47 R09: 
> R10:  R11:  R12: 86cfcbe0
> R13: 0282 R14: ed003634fe8b R15: 8801b33f25e8
>  fill_pool lib/debugobjects.c:119 [inline]
>  __debug_object_init+0xa7a/0x1040 lib/debugobjects.c:339
>  debug_object_init+0x17/0x20 lib/debugobjects.c:391
>  __init_work+0x2b/0x60 kernel/workqueue.c:506
>  idletimer_tg_create net/netfilter/xt_IDLETIMER.c:152 [inline]
>  idletimer_tg_checkentry+0x691/0xb00 net/netfilter/xt_IDLETIMER.c:213
>  xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:850
>  check_target net/ipv6/netfilter/ip6_tables.c:533 [inline]
>  find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:575
>  translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:744
>  do_replace net/ipv6/netfilter/ip6_tables.c:1160 [inline]
>  do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1686
>  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
>  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
>  

Re: [PATCH net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins

2018-02-21 Thread kbuild test robot
Hi Andrew,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Andrew-Lunn/net-dsa-mv88e6xxx-scratch-registers-and-external-MDIO-pins/20180221-150325
config: i386-randconfig-i0-201807 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/dsa/mv88e6xxx/serdes.o: In function 
`mv88e6xxx_g2_scratch_gpio_set_ext_smi':
>> drivers/net/dsa/mv88e6xxx/global2.h:478: multiple definition of 
>> `mv88e6xxx_g2_scratch_gpio_set_ext_smi'
   drivers/net/dsa/mv88e6xxx/chip.o:drivers/net/dsa/mv88e6xxx/global2.h:478: 
first defined here

vim +478 drivers/net/dsa/mv88e6xxx/global2.h

   475  
   476  int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip *chip,
   477bool external)
 > 478  {
   479  return -EOPNOTSUPP;
   480  }
   481  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-21 Thread Roman Kapl

On 02/21/2018 09:42 AM, Phil Sutter wrote:

Hi Roman,

On Wed, Feb 21, 2018 at 09:38:52AM +0100, Roman Kapl wrote:

On 02/21/2018 08:45 AM, Phil Sutter wrote:

On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote:

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

I'm curious, what does it lead to? :)

Thanks, Phil

tc will dump all filters up to the big one, then it will stop silently.
So it will seem as if you have less filters.

The new behavior is the same, but tc will at leas print the EMSGSIZE
error. It does not handle it in any other way.

I got that, yes. Though your commit message stops mid-sentence and I
wondered what the dynamic buffer allocation "in turn leads to".

Thanks, Phil


Aaah... It leads to "smaller SKBs in kernel". I've lost that line in v2.

Should I send a v3 or something, David?

Sorry, Roman Kapl



Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jiri Pirko
Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>> stuck with this ugly thing forever...
>> 
>> Could you at least make some common code that is shared in between
>> netvsc and virtio_net so this is handled in exacly the same way in both?
>
>IMHO netvsc is a vendor specific driver which made a mistake on what
>behaviour it provides (or tried to align itself with Windows SR-IOV).
>Let's not make a far, far more commonly deployed and important driver
>(virtio) bug-compatible with netvsc.

Yeah. netvsc solution is a dangerous precedent here and in my opinition
it was a huge mistake to merge it. I personally would vote to unmerge it
and make the solution based on team/bond.


>
>To Jiri's initial comments, I feel the same way, in fact I've talked to
>the NetworkManager guys to get auto-bonding based on MACs handled in
>user space.  I think it may very well get done in next versions of NM,
>but isn't done yet.  Stephen also raised the point that not everybody is
>using NM.

Can be done in NM, networkd or other network management tools.
Even easier to do this in teamd and let them all benefit.

Actually, I took a stab to implement this in teamd. Took me like an hour
and half.

You can just run teamd with config option "kidnap" like this:
# teamd/teamd -c '{"kidnap": true }'

Whenever teamd sees another netdev to appear with the same mac as his,
or whenever teamd sees another netdev to change mac to his,
it enslaves it.

Here's the patch (quick and dirty):

Subject: [patch teamd] teamd: introduce kidnap feature

Signed-off-by: Jiri Pirko 
---
 include/team.h |  7 +++
 libteam/ifinfo.c   | 20 
 teamd/teamd.c  | 17 +
 teamd/teamd.h  |  5 +
 teamd/teamd_events.c   | 17 +
 teamd/teamd_ifinfo_watch.c |  9 +
 teamd/teamd_per_port.c |  7 ++-
 7 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/team.h b/include/team.h
index 9ae517d..b0c19c8 100644
--- a/include/team.h
+++ b/include/team.h
@@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
team_handle *th,
 #define team_for_each_ifinfo(ifinfo, th)   \
for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
 ifinfo = team_get_next_ifinfo(th, ifinfo))
+
+struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
+ struct team_ifinfo *ifinfo);
+#define team_for_each_unlinked_ifinfo(ifinfo, th)  \
+   for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
+ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
+
 /* ifinfo getters */
 bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
 uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
index 5c32a9c..8f9548e 100644
--- a/libteam/ifinfo.c
+++ b/libteam/ifinfo.c
@@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
team_handle *th,
return NULL;
 }
 
+/**
+ * @param th   libteam library context
+ * @param ifinfo   ifinfo structure
+ *
+ * @details Get next unlinked ifinfo in list.
+ *
+ * @return Ifinfo next to ifinfo passed.
+ **/
+TEAM_EXPORT
+struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
+ struct team_ifinfo *ifinfo)
+{
+   do {
+   ifinfo = list_get_next_node_entry(>ifinfo_list, ifinfo, 
list);
+   if (ifinfo && !ifinfo->linked)
+   return ifinfo;
+   } while (ifinfo);
+   return NULL;
+}
+
 /**
  * @param ifinfo   ifinfo structure
  *
diff --git a/teamd/teamd.c b/teamd/teamd.c
index aac2511..069c7f0 100644
--- a/teamd/teamd.c
+++ b/teamd/teamd.c
@@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct 
teamd_context *ctx,
return 0;
 }
 
+static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
+struct team_ifinfo *ifinfo,
+void *priv)
+{
+   int err;
+   bool kidnap;
+
+   err = teamd_config_bool_get(ctx, , "$.kidnap");
+   if (err || !kidnap ||
+   ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
+   memcmp(team_get_ifinfo_hwaddr(ifinfo),
+  ctx->hwaddr, ctx->hwaddr_len))
+   return 0;
+   return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
+}
+
 static const struct teamd_event_watch_ops teamd_port_watch_ops = {
.port_added = teamd_event_watch_port_added,
+   .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed,
 };
 
 static int teamd_port_watch_init(struct teamd_context *ctx)

Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-21 Thread Phil Sutter
Hi Roman,

On Wed, Feb 21, 2018 at 09:38:52AM +0100, Roman Kapl wrote:
> On 02/21/2018 08:45 AM, Phil Sutter wrote:
> > On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote:
> >> So far, if the filter was too large to fit in the allocated skb, the
> >> kernel did not return any error and stopped dumping. Modify the dumper
> >> so that it returns -EMSGSIZE when a filter fails to dump and it is the
> >> first filter in the skb. If we are not first, we will get a next chance
> >> with more room.
> >>
> >> I understand this is pretty near to being an API change, but the
> >> original design (silent truncation) can be considered a bug.
> >>
> >> Note: The error case can happen pretty easily if you create a filter
> >> with 32 actions and have 4kb pages. Also recent versions of iproute try
> >> to be clever with their buffer allocation size, which in turn leads to
> > I'm curious, what does it lead to? :)
> >
> > Thanks, Phil
> 
> tc will dump all filters up to the big one, then it will stop silently. 
> So it will seem as if you have less filters.
> 
> The new behavior is the same, but tc will at leas print the EMSGSIZE 
> error. It does not handle it in any other way.

I got that, yes. Though your commit message stops mid-sentence and I
wondered what the dynamic buffer allocation "in turn leads to".

Thanks, Phil


Re: [PATCH 1/3] net: Kill net_mutex

2018-02-21 Thread Kirill Tkhai
Hi, Stephen,

On 21.02.2018 02:18, Stephen Hemminger wrote:
> On Mon, 19 Feb 2018 12:58:38 +0300
> Kirill Tkhai  wrote:
> 
>> +struct list_headexit_list;  /* To linked to call pernet exit
>> + * methods on dead net (net_sem
>> + * read locked), or to 
>> unregister
>> + * pernet ops (net_sem wr 
>> locked).
>> + */
> 
> Sorry, that comment is completely unparseable.
> Either you know what it does, and therefore comment is unnecessary
> Or change comment to a valid explanation of the semantics of the list.
> 
> Maybe comments about locking model are best left to where
> it is used in the code.

Let's improve it :) It's used to call pernet exit methods, and net ns logic
guarantees, we never call exit methods for the same net in parallel. How
about writing this directly without mention of net_sem? Something like this:

/* To link net to call pernet exit methods */

Or maybe you have better variant?

Thanks,
Kirill


Re: [PATCH net-next 0/5] net/smc: fixes 2018-02-21

2018-02-21 Thread Stefan Raspl
On 21.02.2018 12:32, Ursula Braun wrote:
> Dave,
> 
> here are some smc-patches for net-next. Besides cleanups, the link id
> field of the LLC confirm link reply is changed to comply with RFC7609.
> 
> Thanks, Ursula

Looks good to me!

Ciao,
Stefan



[PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Eric Dumazet
From: Eric Dumazet 

BBR uses tcp_tso_autosize() in an attempt to probe what would be the
burst sizes and to adjust cwnd in bbr_target_cwnd() with following
gold formula :

/* Allow enough full-sized skbs in flight to utilize end systems. */
cwnd += 3 * bbr->tso_segs_goal;

But GSO can be lacking or be constrained to very small
units (ip link set dev ... gso_max_segs 2)

What we really want is to have enough packets in flight so that both
GSO and GRO are efficient.

So in the case GSO is off or downgraded, we still want to have the same
number of packets in flight as if GSO/TSO was fully operational, so
that GRO can hopefully be working efficiently.

To fix this issue, we make tcp_tso_autosize() unaware of
sk->sk_gso_max_segs

Only tcp_tso_segs() has to enforce the gso_max_segs limit.

Tested:

ethtool -K eth0 tso off gso off
tc qd replace dev eth0 root pfifo_fast

Before patch:
for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
691  (ss -temoi shows cwnd is stuck around 6 )
667
651
631
517

After patch :
# for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
   1733 (ss -temoi shows cwnd is around 386 )
   1778
   1746
   1781
   1718

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Eric Dumazet 
Reported-by: Oleksandr Natalenko 
---
 net/ipv4/tcp_output.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
b2bca373f8bee35267df49b5947a6793fed71a12..6818042cd8a9a1778f54637861647091afd9a769
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1730,7 +1730,7 @@ u32 tcp_tso_autosize(const struct sock *sk, unsigned int 
mss_now,
 */
segs = max_t(u32, bytes / mss_now, min_tso_segs);
 
-   return min_t(u32, segs, sk->sk_gso_max_segs);
+   return segs;
 }
 EXPORT_SYMBOL(tcp_tso_autosize);
 
@@ -1742,9 +1742,10 @@ static u32 tcp_tso_segs(struct sock *sk, unsigned int 
mss_now)
const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0;
 
-   return tso_segs ? :
-   tcp_tso_autosize(sk, mss_now,
-sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs);
+   if (!tso_segs)
+   tso_segs = tcp_tso_autosize(sk, mss_now,
+   sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs);
+   return min_t(u32, tso_segs, sk->sk_gso_max_segs);
 }
 
 /* Returns the portion of skb which can be sent right away */



Re: [PATCH net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins

2018-02-21 Thread Andrew Lunn
On Wed, Feb 21, 2018 at 05:40:29PM +0800, kbuild test robot wrote:
> Hi Andrew,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Andrew-Lunn/net-dsa-mv88e6xxx-scratch-registers-and-external-MDIO-pins/20180221-150325
> config: i386-randconfig-i0-201807 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>drivers/net/dsa/mv88e6xxx/serdes.o: In function 
> `mv88e6xxx_g2_scratch_gpio_set_ext_smi':
> >> drivers/net/dsa/mv88e6xxx/global2.h:478: multiple definition of 
> >> `mv88e6xxx_g2_scratch_gpio_set_ext_smi'
>drivers/net/dsa/mv88e6xxx/chip.o:drivers/net/dsa/mv88e6xxx/global2.h:478: 
> first defined here
> 
> vim +478 drivers/net/dsa/mv88e6xxx/global2.h
> 
>475
>476int mv88e6xxx_g2_scratch_gpio_set_ext_smi(struct mv88e6xxx_chip 
> *chip,
>477  bool external)
>  > 478{
>479return -EOPNOTSUPP;
>480}

The stub function needs to be inline. I will fix up in v2.

Andrew


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Jason A. Donenfeld
On Wed, Feb 21, 2018 at 6:47 AM, Eric Dumazet  wrote:
>> This probably should be queued up for stable.
>
> When was the bug added ? This would help a lot stable teams ...

This needs to be backported to 4.16-rc0+, 4.15+, 4.14+, 4.13.14+, and 4.9.63+.


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Eric Dumazet
On Wed, 2018-02-21 at 15:54 +0100, Jason A. Donenfeld wrote:
> On Wed, Feb 21, 2018 at 6:47 AM, Eric Dumazet  wrote:
> > > This probably should be queued up for stable.
> > 
> > When was the bug added ? This would help a lot stable teams ...
> 
> This needs to be backported to 4.16-rc0+, 4.15+, 4.14+, 4.13.14+, and 4.9.63+.

I guess I should have asked you to provide an appropriate Fixes: tag,
clearly identifying the commit that added the bug.

Fixes: 12-digit-sha1 ("patch title")

Thanks.



Qualcomm rmnet driver and qmi_wwan

2018-02-21 Thread Daniele Palmas
Hello,

in rmnet kernel documentation I read:

"This driver can be used to register onto any physical network device in
IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator."

Does this mean that it can be used in association with the qmi_wwan driver?

If yes, can someone give me an hint on the steps to follow?

If not, does anyone know if it is possible to modify qmi_wwan in order
to take advantage of the features provided by the rmnet driver?

In this case hint on the changes for modifying qmi_wwan are welcome.

Thanks in advance,
Daniele


Re: [PATCH v7 02/37] asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU

2018-02-21 Thread Arnd Bergmann
On Wed, Feb 21, 2018 at 12:21 PM, Greentime Hu  wrote:
> 2018-02-16 18:47 GMT+08:00 kbuild test robot :

>
> From: Greentime Hu 
> Date: Wed, 21 Feb 2018 14:21:23 +0800
> Subject: [PATCH] xtensa: add ioremap_nocache declaration before include
>  asm-generic/io.h.
>
> A future commit for the nds32 architecture bootstrap("asm-generic/io.h:
> move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef
> CONFIG_MMU") will move the ioremap_nocache out of the CONFIG_MMU ifdef.
> This means that in order to suppress re-definition errors we need to
> setup #define's before importing asm-generic/io.h.
>
> Signed-off-by: Greentime Hu 

Don't you need to override both ioremap and ioremap_nocache?

> ---
>  arch/xtensa/include/asm/io.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/xtensa/include/asm/io.h b/arch/xtensa/include/asm/io.h
> index c38e5a732d86..acc5bb2cf1c7 100644
> --- a/arch/xtensa/include/asm/io.h
> +++ b/arch/xtensa/include/asm/io.h
> @@ -52,6 +52,7 @@ static inline void __iomem *ioremap_cache(unsigned
> long offset,
> return xtensa_ioremap_cache(offset, size);
>  }
>  #define ioremap_cache ioremap_cache
> +#define ioremap_nocache ioremap_nocache
>
>  #define ioremap_wc ioremap_nocache


[PATCH iproute2-next] rdma: Add batch command support

2018-02-21 Thread Leon Romanovsky
From: Leon Romanovsky 

Implement an option (-b) to execute RDMAtool commands
from supplied file. This follows the same model as
in use for ip and devlink tools, by expecting
every new command to be on new line.

These commands are expected to be without any -*
(e.g. -d, -j, e.t.c) global flags, which should be
called externally.

Signed-off-by: Leon Romanovsky 
---
 man/man8/rdma.8 | 16 +
 rdma/rdma.c | 70 ++---
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
index 798b33d3..fba77693 100644
--- a/man/man8/rdma.8
+++ b/man/man8/rdma.8
@@ -11,6 +11,12 @@ rdma \- RDMA tool
 .BR help " }"
 .sp

+.ti -8
+.B rdma
+.RB "[ " -force " ] "
+.BI "-batch " filename
+.sp
+
 .ti -8
 .IR OBJECT " := { "
 .BR dev " | " link " }"
@@ -31,6 +37,16 @@ Print the version of the
 .B rdma
 tool and exit.

+.TP
+.BR "\-b", " \-batch " 
+Read commands from provided file or standard input and invoke them.
+First failure will cause termination of rdma.
+
+.TP
+.BR "\-force"
+Don't terminate rdma on errors in batch mode.
+If there were any errors during execution of the commands, the application 
return code will be non zero.
+
 .TP
 .BR "\-d" , " --details"
 Otuput detailed information.
diff --git a/rdma/rdma.c b/rdma/rdma.c
index 19608f41..ab8f98d2 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -15,8 +15,9 @@
 static void help(char *name)
 {
pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
+  "   %s [ -f[orce] ] -b[atch] filename\n"
   "where  OBJECT := { dev | link | resource | help }\n"
-  "   OPTIONS := { -V[ersion] | -d[etails] | -j[son] | 
-p[retty]}\n", name);
+  "   OPTIONS := { -V[ersion] | -d[etails] | -j[son] | 
-p[retty]}\n", name, name);
 }

 static int cmd_help(struct rd *rd)
@@ -25,7 +26,7 @@ static int cmd_help(struct rd *rd)
return 0;
 }

-static int rd_cmd(struct rd *rd)
+static int rd_cmd(struct rd *rd, int argc, char **argv)
 {
const struct rd_cmd cmds[] = {
{ NULL, cmd_help },
@@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd)
{ 0 }
};

+   rd->argc = argc;
+   rd->argv = argv;
+
return rd_exec_cmd(rd, cmds, "object");
 }

-static int rd_init(struct rd *rd, int argc, char **argv, char *filename)
+static int rd_batch(struct rd *rd, const char *name, bool force)
+{
+   char *line = NULL;
+   size_t len = 0;
+   int ret = 0;
+
+   if (name && strcmp(name, "-") != 0) {
+   if (!freopen(name, "r", stdin)) {
+   pr_err("Cannot open file \"%s\" for reading: %s\n",
+  name, strerror(errno));
+   return errno;
+   }
+   }
+
+   cmdlineno = 0;
+   while (getcmdline(, , stdin) != -1) {
+   char *largv[512];
+   int largc;
+
+   largc = makeargs(line, largv, 100);
+   if (!largc)
+   continue;   /* blank line */
+
+   ret = rd_cmd(rd, largc, largv);
+   if (ret) {
+   pr_err("Command failed %s:%d\n", name, cmdlineno);
+   if (!force)
+   break;
+   }
+   }
+
+   free(line);
+
+   return ret;
+}
+
+static int rd_init(struct rd *rd, char *filename)
 {
uint32_t seq;
int ret;

rd->filename = filename;
-   rd->argc = argc;
-   rd->argv = argv;
INIT_LIST_HEAD(>dev_map_list);
INIT_LIST_HEAD(>filter_list);

@@ -87,11 +125,15 @@ int main(int argc, char **argv)
{ "json",   no_argument,NULL, 'j' },
{ "pretty", no_argument,NULL, 'p' },
{ "details",no_argument,NULL, 'd' },
+   { "force",  no_argument,NULL, 'f' },
+   { "batch",  required_argument,  NULL, 'b' },
{ NULL, 0, NULL, 0 }
};
+   const char *batch_file = NULL;
bool pretty_output = false;
bool show_details = false;
bool json_output = false;
+   bool force = false;
char *filename;
struct rd rd;
int opt;
@@ -99,7 +141,7 @@ int main(int argc, char **argv)

filename = basename(argv[0]);

-   while ((opt = getopt_long(argc, argv, "Vhdpj",
+   while ((opt = getopt_long(argc, argv, ":Vhdpjfb:",
  long_options, NULL)) >= 0) {
switch (opt) {
case 'V':
@@ -115,9 +157,18 @@ int main(int argc, char **argv)
case 'j':
json_output = true;
break;
+   case 'f':
+   force = true;
+ 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-21 Thread Tommi Rantala

On 20.02.2018 18:26, Neil Horman wrote:

On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
 wrote:

On 19.02.2018 20:59, Dmitry Vyukov wrote:

Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.



These two commits are in v4.16-rc1:

commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
Author: Tommi Rantala 
Date:   Mon Feb 5 21:48:14 2018 +0200

 sctp: fix dst refcnt leak in sctp_v4_get_dst
...
 Fixes: 410f03831 ("sctp: add routing output fallback")
 Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
addresses")


commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
Author: Alexey Kodanev 
Date:   Mon Feb 5 15:10:35 2018 +0300

 sctp: fix dst refcnt leak in sctp_v6_get_dst()
...
 Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
addresses for ipv6")


I guess we missed something if it's still reproducible.

I can check it later this week, unless someone else beat me to it.


Hi Tommi,

Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
another one then. But I am still seeing these:

[   58.799130] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   60.847138] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   62.895093] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   64.943103] unregister_netdevice: waiting for lo to become free.
Usage count = 4

on upstream tree pulled ~12 hours ago.


Can you write a systemtap script to probe dev_hold, and dev_put, printing out a
backtrace if the device name matches "lo".  That should tell us definitively if
the problem is in the same location or not


Hi Dmitry, I tested with the reproducer and the kernel .config file that 
you sent in the first email in this thread:


With 4.16-rc2 unable to reproduce.

With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
for lo to become free. Usage count = 3"


With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
sctp_v6_get_dst()" cherry-picked on top, unable to reproduce.



Is syzkaller doing something else now to trigger the bug...?
Can you still trigger the bug with the same reproducer?


Tommi


Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Paolo Abeni
Hi,

On Wed, 2018-02-21 at 06:43 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> BBR uses tcp_tso_autosize() in an attempt to probe what would be the
> burst sizes and to adjust cwnd in bbr_target_cwnd() with following
> gold formula :
> 
> /* Allow enough full-sized skbs in flight to utilize end systems. */
> cwnd += 3 * bbr->tso_segs_goal;
> 
> But GSO can be lacking or be constrained to very small
> units (ip link set dev ... gso_max_segs 2)
> 
> What we really want is to have enough packets in flight so that both
> GSO and GRO are efficient.
> 
> So in the case GSO is off or downgraded, we still want to have the same
> number of packets in flight as if GSO/TSO was fully operational, so
> that GRO can hopefully be working efficiently.
> 
> To fix this issue, we make tcp_tso_autosize() unaware of
> sk->sk_gso_max_segs
> 
> Only tcp_tso_segs() has to enforce the gso_max_segs limit.
> 
> Tested:
> 
> ethtool -K eth0 tso off gso off
> tc qd replace dev eth0 root pfifo_fast
> 
> Before patch:
> for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> 691  (ss -temoi shows cwnd is stuck around 6 )
> 667
> 651
> 631
> 517
> 
> After patch :
> # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
>1733 (ss -temoi shows cwnd is around 386 )
>1778
>1746
>1781
>1718
> 
> Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
> Signed-off-by: Eric Dumazet 
> Reported-by: Oleksandr Natalenko 
> ---
>  net/ipv4/tcp_output.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> b2bca373f8bee35267df49b5947a6793fed71a12..6818042cd8a9a1778f54637861647091afd9a769
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1730,7 +1730,7 @@ u32 tcp_tso_autosize(const struct sock *sk, unsigned 
> int mss_now,
>*/
>   segs = max_t(u32, bytes / mss_now, min_tso_segs);
>  
> - return min_t(u32, segs, sk->sk_gso_max_segs);
> + return segs;
>  }
>  EXPORT_SYMBOL(tcp_tso_autosize);
>  

Very minor nit, why don't:

return max_t(u32, bytes / mss_now, min_tso_segs);

and drop the 'segs' local variable?

Thanks,

Paolo


Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Eric Dumazet
On Wed, Feb 21, 2018 at 7:01 AM, Paolo Abeni  wrote:
>
> Very minor nit, why don't:
>
> return max_t(u32, bytes / mss_now, min_tso_segs);
>
> and drop the 'segs' local variable?

Simply to ease backports.

We had some constant changes in this function in the past.


Re: [PATCH V7 2/4] sctp: Add ip option support

2018-02-21 Thread Marcelo Ricardo Leitner
On Tue, Feb 20, 2018 at 07:15:27PM +, Richard Haines wrote:
> Add ip option support to allow LSM security modules to utilise CIPSO/IPv4
> and CALIPSO/IPv6 services.
> 
> Signed-off-by: Richard Haines 

LGTM too, thanks!

Acked-by: Marcelo Ricardo Leitner 

> ---
> All SCTP lksctp-tools/src/func_tests run correctly in enforcing mode.
> All "./sctp-tests run" obtained from: https://github.com/sctp/sctp-tests
> pass.
> 
> V7 Changes:
> 1) Log when copy ip options fail for IPv4 and IPv6
> 2) Correct sctp_setsockopt_maxseg() function. Note that the lksctp-tools
> func_tests do not test with struct sctp_assoc_value. Just used simple test
> and okay.
> 3) Move calculation of overheads to sctp_packet_config().
> NOTE: Initially in sctp_packet_reset() I set packet->size and
> packet->overhead to zero (as it is a reset). This was okay for all the
> lksctp-tools function tests, however when running "sctp-tests" ndatshched
> tests it causes these to fail with an st_s.log entry of:
>   sid: 3, expected: 3
>   sid: 3, expected: 3
>   unexpected sid packet !!!
>   sid: 1, expected: 3
> 
> I then found sctp_packet_transmit() relies on setting
> "packet->size = packet->overhead;" to reset size to the current overhead
> after sending packets, hence the comment in sctp_packet_reset()
> 
>  include/net/sctp/sctp.h|  4 +++-
>  include/net/sctp/structs.h |  2 ++
>  net/sctp/chunk.c   | 10 +++---
>  net/sctp/ipv6.c| 45 ++---
>  net/sctp/output.c  | 34 +-
>  net/sctp/protocol.c| 38 ++
>  net/sctp/socket.c  | 11 ---
>  7 files changed, 117 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index f7ae6b0..25c5c87 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -441,9 +441,11 @@ static inline int sctp_list_single_entry(struct 
> list_head *head)
>  static inline int sctp_frag_point(const struct sctp_association *asoc, int 
> pmtu)
>  {
>   struct sctp_sock *sp = sctp_sk(asoc->base.sk);
> + struct sctp_af *af = sp->pf->af;
>   int frag = pmtu;
>  
> - frag -= sp->pf->af->net_header_len;
> + frag -= af->ip_options_len(asoc->base.sk);
> + frag -= af->net_header_len;
>   frag -= sizeof(struct sctphdr) + sctp_datachk_len(>stream);
>  
>   if (asoc->user_frag)
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 03e92dd..ead5fce 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -491,6 +491,7 @@ struct sctp_af {
>   void(*ecn_capable)(struct sock *sk);
>   __u16   net_header_len;
>   int sockaddr_len;
> + int (*ip_options_len)(struct sock *sk);
>   sa_family_t sa_family;
>   struct list_head list;
>  };
> @@ -515,6 +516,7 @@ struct sctp_pf {
>   int (*addr_to_user)(struct sctp_sock *sk, union sctp_addr *addr);
>   void (*to_sk_saddr)(union sctp_addr *, struct sock *sk);
>   void (*to_sk_daddr)(union sctp_addr *, struct sock *sk);
> + void (*copy_ip_options)(struct sock *sk, struct sock *newsk);
>   struct sctp_af *af;
>  };
>  
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 991a530..d726d21 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -171,6 +171,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
>   struct list_head *pos, *temp;
>   struct sctp_chunk *chunk;
>   struct sctp_datamsg *msg;
> + struct sctp_sock *sp;
> + struct sctp_af *af;
>   int err;
>  
>   msg = sctp_datamsg_new(GFP_KERNEL);
> @@ -189,9 +191,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
>   /* This is the biggest possible DATA chunk that can fit into
>* the packet
>*/
> - max_data = asoc->pathmtu -
> -sctp_sk(asoc->base.sk)->pf->af->net_header_len -
> -sizeof(struct sctphdr) - sctp_datachk_len(>stream);
> + sp = sctp_sk(asoc->base.sk);
> + af = sp->pf->af;
> + max_data = asoc->pathmtu - af->net_header_len -
> +sizeof(struct sctphdr) - sctp_datachk_len(>stream) -
> +af->ip_options_len(asoc->base.sk);
>   max_data = SCTP_TRUNC4(max_data);
>  
>   /* If the the peer requested that we authenticate DATA chunks
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index e35d4f7..30a05a8 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -427,6 +427,41 @@ static void sctp_v6_copy_addrlist(struct list_head 
> *addrlist,
>   rcu_read_unlock();
>  }
>  
> +/* Copy over any ip options */
> +static void sctp_v6_copy_ip_options(struct sock *sk, struct sock *newsk)
> +{
> + struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
> + struct 

Re: [PATCH net-next v2 1/1] net: Allow a rule to track originating protocol

2018-02-21 Thread David Miller
From: Donald Sharp 
Date: Tue, 20 Feb 2018 08:55:58 -0500

> Allow a rule that is being added/deleted/modified or
> dumped to contain the originating protocol's id.
> 
> The protocol is handled just like a routes originating
> protocol is.  This is especially useful because there
> is starting to be a plethora of different user space
> programs adding rules.
> 
> Allow the vrf device to specify that the kernel is the originator
> of the rule created for this device.
> 
> Signed-off-by: Donald Sharp 

Looks good, applied, thanks Donald.


Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread David Miller
From: Anders Roxell 
Date: Wed, 21 Feb 2018 22:30:01 +0100

> bpf builds a test program for loading BPF ELF files. Add the executable
> to the .gitignore list.
> 
> Signed-off-by: Anders Roxell 

Acked-by: David S. Miller 


[PATCH net-next 4/7] net: Rename NETEVENT_MULTIPATH_HASH_UPDATE

2018-02-21 Thread David Ahern
Rename NETEVENT_MULTIPATH_HASH_UPDATE to
NETEVENT_IPV4_MPATH_HASH_UPDATE to denote it relates to a change
in the IPv4 hash policy.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 +-
 include/net/netevent.h| 2 +-
 net/ipv4/sysctl_net_ipv4.c| 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 05146970c19c..da8ca721f2fc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2427,7 +2427,7 @@ static int mlxsw_sp_router_netevent_event(struct 
notifier_block *nb,
mlxsw_core_schedule_work(_work->work);
mlxsw_sp_port_dev_put(mlxsw_sp_port);
break;
-   case NETEVENT_MULTIPATH_HASH_UPDATE:
+   case NETEVENT_IPV4_MPATH_HASH_UPDATE:
net = ptr;
 
if (!net_eq(net, _net))
diff --git a/include/net/netevent.h b/include/net/netevent.h
index 40e7bab68490..baee605a94ab 100644
--- a/include/net/netevent.h
+++ b/include/net/netevent.h
@@ -26,7 +26,7 @@ enum netevent_notif_type {
NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */
NETEVENT_REDIRECT, /* arg is struct netevent_redirect ptr */
NETEVENT_DELAY_PROBE_TIME_UPDATE, /* arg is struct neigh_parms ptr */
-   NETEVENT_MULTIPATH_HASH_UPDATE, /* arg is struct net ptr */
+   NETEVENT_IPV4_MPATH_HASH_UPDATE, /* arg is struct net ptr */
 };
 
 int register_netevent_notifier(struct notifier_block *nb);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 89683d868b37..011de9a20ec6 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -400,7 +400,7 @@ static int proc_fib_multipath_hash_policy(struct ctl_table 
*table, int write,
 
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (write && ret == 0)
-   call_netevent_notifiers(NETEVENT_MULTIPATH_HASH_UPDATE, net);
+   call_netevent_notifiers(NETEVENT_IPV4_MPATH_HASH_UPDATE, net);
 
return ret;
 }
-- 
2.11.0



Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-21 Thread Jakub Kicinski
On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote:
> We also have a large range of TCAM based hardware offload outthere
> that will _not_ work with your BPF HW offload infrastructure. What
> this bpf infrastructure pushes into the kernel is just a blob
> expressing things in a very low-level instruction-set: trying to find
> a mapping of that to typical HW intermediate representations in the
> TCAM based HW offload world will be simply crazy.

I'm not sure where the TCAM talk is coming from.  Think much smaller -
cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs.  The
information the verifier is gathering will be crucial for optimizing
those.  Please don't discount the value of being able to use
heterogeneous processing units by the networking stack.


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:
> I haven't checked emails for days and did not realize the new revision
> had already came out. And thank you for the effort, this revision
> really looks to be a step forward towards our use case and is close to
> what we wanted to do. A few questions in line.
>
> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>  wrote:
>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
 Ppatch 2 is in response to the community request for a 3 netdev
 solution.  However, it creates some issues we'll get into in a moment.
 It extends virtio_net to use alternate datapath when available and
 registered. When BACKUP feature is enabled, virtio_net driver creates
 an additional 'bypass' netdev that acts as a master device and controls
 2 slave devices.  The original virtio_net netdev is registered as
 'backup' netdev and a passthru/vf device with the same MAC gets
 registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
 associated with the same 'pci' device.  The user accesses the network
 interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
 as default for transmits when it is available with link up and running.
>>>
>>> Thank you do doing this.
>>>
 We noticed a couple of issues with this approach during testing.
 - As both 'bypass' and 'backup' netdevs are associated with the same
   virtio pci device, udev tries to rename both of them with the same name
   and the 2nd rename will fail. This would be OK as long as the first 
 netdev
   to be renamed is the 'bypass' netdev, but the order in which udev gets
   to rename the 2 netdevs is not reliable.
>>>
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>>
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
>
> Could it made it also possible to take over the config from VF instead
> of virtio on an existing VM image? And get udev rename the bypass
> netdev to what the original VF was. I don't say tightly binding the
> bypass master to only virtio or VF, but I think we should provide both
> options to support different upgrade paths. Possibly we could tweak
> the device tree layout to reuse the same PCI slot for the master
> bypass netdev, such that udev would not get confused when renaming the
> device. The VF needs to use a different function slot afterwards.
> Perhaps we might need to a special multiseat like QEMU device for that
> purpose?
>
> Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

>>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>>
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
>>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>>
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
>>
 - When 

Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-21 Thread Florian Fainelli
On 02/21/2018 03:46 PM, Jakub Kicinski wrote:
> On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote:
>> We also have a large range of TCAM based hardware offload outthere
>> that will _not_ work with your BPF HW offload infrastructure. What
>> this bpf infrastructure pushes into the kernel is just a blob
>> expressing things in a very low-level instruction-set: trying to find
>> a mapping of that to typical HW intermediate representations in the
>> TCAM based HW offload world will be simply crazy.
> 
> I'm not sure where the TCAM talk is coming from.  Think much smaller -
> cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs.  The
> information the verifier is gathering will be crucial for optimizing
> those.  Please don't discount the value of being able to use
> heterogeneous processing units by the networking stack.
> 

The only use case that we have a good answer for is when there is no HW
offload capability available, because there, we know that eBPF is our
best possible solution for a software fast path, in large part because
of all the efforts that went into making it both safe and fast.

When there is offloading HW available, there does not appear to be a
perfect answer to this problem of, given a standard Linux utility that
can express any sort of match + action, be it ethtool::rxnfc,
tc/cls_{u32,flower}, nftables, how do I transform that into what makes
most sense to my HW? You could:

- have hardware that understands BPF bytecode directly, great, then you
don't have to do anything, just pass it up the driver baby, oh wait,
it's not that simple, the NFP driver is not small

- transform BPF back into something that your hardware understand, does
that belong in the kernel? Maybe, maybe not

- use a completely different intermediate representation like P4,
brainfuck, I don't know

Maybe first things first, we have at least 3 different programming
interfaces, if not more: ethtool::rxnfc, tc/cls_{u32,flower}, nftables
that are all capable of programming TCAMs and hardware capable of match
+ action, how about we start with having some sort of common library
code that:

- validates input parameters against HW capabilities
- does the adequate transformation from any of these interfaces into a
generic set of input parameters
- define what the appropriate behavior is when programming through all
of these 3 interfaces that ultimately access the same shared piece of
HW, and therefore need to manage resources allocation?


-- 
Florian


Proposal

2018-02-21 Thread melisa mehmet
Hello

Greetings to you and everyone around you please did you get my previous email 
regarding my proposal ?
please let me know if we can work together on this.

Best Reagrds


Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Shuah Khan
On 02/21/2018 03:48 PM, David Miller wrote:
> From: Anders Roxell 
> Date: Wed, 21 Feb 2018 22:30:01 +0100
> 
>> bpf builds a test program for loading BPF ELF files. Add the executable
>> to the .gitignore list.
>>
>> Signed-off-by: Anders Roxell 
> 
> Acked-by: David S. Miller 
> 
> 

Thanks. I will pull this in for 4.16-rc

-- Shuah


Re: [PATCH] bpf: clean up unused-variable warning

2018-02-21 Thread Daniel Borkmann
On 02/20/2018 11:07 PM, Arnd Bergmann wrote:
> The only user of this variable is inside of an #ifdef, causing
> a warning without CONFIG_INET:
> 
> net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
> net/core/filter.c:3382:6: error: unused variable 'val' 
> [-Werror=unused-variable]
>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> 
> This replaces the #ifdef with a nicer IS_ENABLED() check that
> makes the code more readable and avoids the warning.
> 
> Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
> Signed-off-by: Arnd Bergmann 

Now applied to bpf, thanks Arnd!


Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan
On (02/21/18 18:45), Willem de Bruijn wrote:
> 
> I do mean returning 0 instead of -EAGAIN if control data is ready.
> Something like
> 
> @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr
> *msg, size_t size,
> 
> if (!rds_next_incoming(rs, )) {
> if (nonblock) {
> -   ret = -EAGAIN;
> +   ncookies = rds_recvmsg_zcookie(rs, msg);
> +   ret = ncookies ? 0 : -EAGAIN;
> break;
> }

Yes, but you now have an implicit branch based on ncookies, so I'm
not sure it saved all that much? like I said let me revisit this

> By the way, the put_cmsg is unconditional even if the caller did
> not supply msg_control. So it is basically no longer safe to ever
> call read, recv or recvfrom on a socket if zerocopy notifications
> are outstanding.

Wait, I thought put_cmsg already checks for these things. 

> It is possible to check msg_controllen before even deciding whether
> to try to dequeue notifications (and take the lock). I see that this is
> not common. But RDS of all cases seems to do this, in
> rds_notify_queue_get:

yes the comment above that code suggests that this bit of code
was done to avoid calling put_cmsg while holding the rs_lock.

One bit of administrivia though- if I now drop sk_error_queue for
PF_RDS, I'll have to fix selftests in the same patch too, so the
patch will get a bit bulky (and thus a bit more difficult to review).



Bug with 'ip' command where build a stuck list of interfaces

2018-02-21 Thread Rm Beer
Hello.
With 'ip tunnel add tun0 mode ipip dev enp1s0' make two interfaces
called 'tunl0' and 'tun0'.
Late with 'ip tunnel del tun0' remove only 'tun0' and forget the
'tunl0' in the list.
with this you can not do anything else in the list, it can not be
created and it can not be deleted. It is hanging.

# ip link list
:
:
:
4: tunl0@NONE:  mtu 1480 qdisc noop state DOWN mode DEFAULT
group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu
On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
 wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>  wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
 On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
> Ppatch 2 is in response to the community request for a 3 netdev
> solution.  However, it creates some issues we'll get into in a moment.
> It extends virtio_net to use alternate datapath when available and
> registered. When BACKUP feature is enabled, virtio_net driver creates
> an additional 'bypass' netdev that acts as a master device and controls
> 2 slave devices.  The original virtio_net netdev is registered as
> 'backup' netdev and a passthru/vf device with the same MAC gets
> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
> associated with the same 'pci' device.  The user accesses the network
> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
> as default for transmits when it is available with link up and running.

 Thank you do doing this.

> We noticed a couple of issues with this approach during testing.
> - As both 'bypass' and 'backup' netdevs are associated with the same
>   virtio pci device, udev tries to rename both of them with the same name
>   and the 2nd rename will fail. This would be OK as long as the first 
> netdev
>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>   to rename the 2 netdevs is not reliable.

 Out of curiosity - why do you link the master netdev to the virtio
 struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.

>
> The other bits get into more complexity then we are ready to handle
> for now. I think I might have talked about something similar that I
> was referring to as a "virtio-bond" where you would have a PCI/PCIe
> tree topology that makes this easier to sort out, and the "virtio-bond
> would be used to handle coordination/configuration of a much more
> complex interface.

That was one way to solve this problem but I'd like to see 

[PATCH v2 net-next 2/2] lan743x: Update MAINTAINERS to include lan743x driver

2018-02-21 Thread Bryan Whitehead
Update MAINTAINERS to include lan743x driver

Signed-off-by: Bryan Whitehead 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a7f76e..c340125 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9149,6 +9149,13 @@ F:   drivers/net/dsa/microchip/*
 F: include/linux/platform_data/microchip-ksz.h
 F: Documentation/devicetree/bindings/net/dsa/ksz.txt
 
+MICROCHIP LAN743X ETHERNET DRIVER
+M: Bryan Whitehead 
+M: Microchip Linux Driver Support 
+L: netdev@vger.kernel.org
+S: Maintained
+F: drivers/net/ethernet/microchip/lan743x_*
+
 MICROCHIP USB251XB DRIVER
 M: Richard Leitner 
 L: linux-...@vger.kernel.org
-- 
2.7.4



Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn
On Wed, Feb 21, 2018 at 3:19 PM, Sowmini Varadhan
 wrote:
> This commit is an optimization that builds on top of commit 01883eda72bd
> ("rds: support for zcopy completion notification") for PF_RDS sockets.
>
> Cookies associated with zerocopy completion are passed up on the POLLIN
> channel, piggybacked with data whereever possible. Such cookies are passed
> up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when
> the returned value of recvmsg() is >= 0. A max of SO_EE_ORIGIN_MAX_ZCOOKIES
> may be passed with each message.

I'd put this optimization behind a socket option.

Either that, or always store cookies on an RDS specific queue, return as recv
cmsg and then remove the alternative error queue path completely.

Having both is confusing, also in terms of system behavior (e.g., see signaling
and sk_err handling in sock_dequeue_err_skb).

> +static int rds_recvmsg_zcookie(struct rds_sock *rs, struct msghdr *msg)
> +{
> +   struct sk_buff *skb, *tmp;
> +   struct sock_exterr_skb *serr;
> +   struct sock *sk = rds_rs_to_sk(rs);
> +   struct sk_buff_head *q = >sk_error_queue;
> +   struct rds_zcopy_cookies done;
> +   u32 *ptr;
> +   int i;
> +   unsigned long flags;
> +
> +   spin_lock_irqsave(>lock, flags);

This seems expensive on every recvmsg. Even if zerocopy is not enabled.

> +   if (skb_queue_empty(q)) {
> +   spin_unlock_irqrestore(>lock, flags);
> +   return 0;
> +   }
> +   skb_queue_walk_safe(q, skb, tmp) {

This too. If anything, just peek at the queue head and skip otherwise.
Anything else will cause an error, which blocks regular read and write
on the socket.

>  int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> int msg_flags)
>  {
> @@ -586,6 +623,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t size,
> int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
> DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name);
> struct rds_incoming *inc = NULL;
> +   int ncookies;
>
> /* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */
> timeo = sock_rcvtimeo(sk, nonblock);
> @@ -609,6 +647,14 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t size,
> break;
> }
>
> +   if (list_empty(>rs_recv_queue) && nonblock) {
> +   ncookies = rds_recvmsg_zcookie(rs, msg);
> +   if (ncookies) {
> +   ret = 0;

This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?


Re: syzcaller patch postings...

2018-02-21 Thread Florian Westphal
David Miller  wrote:
> I have to mention this now before it gets out of control.
> 
> I would like to ask that syzkaller stop posting the patch it is
> testing when it posts to netdev.

Same for netfilter-devel.

I could not get a reproducer to trigger and asked syzbot to test
the patch (great feature, thanks!) -- i did not cc any mailing list.

So I was very surprised syzbot announced test result by adding
back all CCs from original report.  I think its better to
announce the result to patch author only.



Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu
I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal 

nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-21 Thread Alexei Starovoitov
On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
> 
> Obvious candidates are: meta, numgen, limit, objref, quota, reject.
> 
> We should probably also consider removing
> CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
> build both too (at least rbtree since that offers interval).
> 
> For the indirect call issue we can use direct calls from eval loop for
> some of the more frequently used ones, similar to what we do already
> for nft_cmp_fast_expr. 

nft_cmp_fast_expr and other expressions mentioned above made me thinking...

do we have the same issue with nft interpreter as we had with bpf one?
bpf interpreter was used as part of spectre2 attack to leak
information via cache side channel and let VM read hypervisor memory.
Due to that issue we removed bpf interpreter from the kernel code.
That's what CONFIG_BPF_JIT_ALWAYS_ON for...
but we still have nft interpreter in the kernel that can also
execute arbitrary nft expressions.

Jann's exploit used the following bpf instructions:
struct bpf_insn evil_bytecode_instrs[] = {
// rax = target_byte_addr
{ .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, { 
.imm = target_byte_addr>>32 },
// rdi = timing_leak_array
{ .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = host_timing_leak_addr 
}, { .imm = host_timing_leak_addr>>32 },
// rax = *(u8*)rax
{ .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 },
// rax = rax << ...
{ .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx },
// rax = rax & 0x400
{ .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 },
// rax = rdi + rax
{ .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 },
// *(u8*) (rax + 0x800)
{ .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 },

and a gadget to jump into __bpf_prog_run with insn pointing
to memory controlled by the guest while accessible
(at different virt address) by the hypervisor.

It seems possible to construct similar sequence of instructions
out of nft expressions and use gadget that jumps into nft_do_chain().
The attacker would need to discover more kernel addresses:
nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval,
nft_lookup_eval, and nft_bitmap_lookup
to populate nft chains, rules and expressions in guest memory
comparing to bpf interpreter attack.

Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv)
pkt needs to point to fake struct sk_buff in guest memory with
skb->head == target_byte_addr
The first nft expression can be nft_payload_fast_eval().
If it's properly constructed with
(nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, dreg 
== 1)
it will do arbitrary load of
*(u8 *)dest = *(u8 *)ptr;
from target_byte_addr into register 1 of nft state machine
(dest is u32 array of registers in the stack of nft_do_chain)
Second nft expression can be nft_bitwise_eval() to mask particular
bit in register 1.
Then nft_cmp_eval() to check whether bit is one or zero and
conditional NFT_BREAK out of first nft expression into second nft rule.
The last conditional nft_immediate_eval() in the first rule will set
register 1 to 0x400 * 8 while the first nft_bitwise_eval() in
the second rule with do r1 &= 0x400 * 8.
So at this point r1 will have either 0x400 * 8 or 0 depending
on value of speculatively loaded bit.
The last expression can be nft_lookup_eval() with 
nft_lookup->set->ops->lookup == nft_bitmap_lookup
which will do nft_bitmap->bitmap[idx] where idx = r1 / 8
The memory used for this last nft_lookup/bitmap expression is
both an instruction and timing_leak_array itself.
If I'm not mistaken, this sequence of nft expression will
speculatively execute very similar logic as in evil_bytecode_instrs[]

The amount of actual speculative native cpu load/stores/branches is
probably more than executed by bpf interpreter for these evil bytecodes,
but likely well within cpu speculation window of 100+ insns.

Obviously such exploit is harder to do than bpf based one.
Do we need to do anything about it ?
May be it's easier to find gadgets in .text of vmlinux
instead of messing with interpreters?

Jann,
can you comment on removing interpreters in general?
Do we need to worry about having bpf and/or nft interpreter
in the kernel?



Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan
On (02/21/18 16:54), Willem de Bruijn wrote:
> 
> I'd put this optimization behind a socket option.

Yes, that thought occurred to me as well- I think RDS applications
are unlikely to use the error_queue path because, as I mentioned
before, these are heavily request-response based, so you're 
going to be getting something back for each sendmsg anyway.

So if I had a sockopt, it would be something that would be
"piggyback completion" most (all?) of the time anyway, but I suppose
making it explicit is useful to have.

> Either that, or always store cookies on an RDS specific queue, return as recv
> cmsg and then remove the alternative error queue path completely.

I think the error queue path is good to have, if only to be aligned
with other socket families.

> > +   spin_lock_irqsave(>lock, flags);
> 
> This seems expensive on every recvmsg. Even if zerocopy is not enabled.

noted, I'll predicate it on both zcopy and the sockopt suggestion.

> > +   if (skb_queue_empty(q)) {
> > +   spin_unlock_irqrestore(>lock, flags);
> > +   return 0;
> > +   }
> > +   skb_queue_walk_safe(q, skb, tmp) {
> 
> This too. If anything, just peek at the queue head and skip otherwise.
> Anything else will cause an error, which blocks regular read and write
> on the socket.

but that would be technically incorrect, because you could have a 
mix/match of the zcopy notification with other sk_error_queue messages
(given that rds_rm_zerocopy_callback looks at the tail, and
creates a new skb if the tail is not a SO_EE_ORIGIN_ZCOOKIE).

Of course, its true that *today* the only thing in the rds socket
error queue is the cookie list, so this queue walk is a bit of overkill..

but maybe I am missing something you are concerned about? The queue walk is
intended to pull out the first SO_EE_ORIGIN_ZCOOKIE (if it exists)
and return quietly otherwise (leaving err_queue unchanged)- where
do you see the blocking error?


> > +   if (list_empty(>rs_recv_queue) && nonblock) {
> > +   ncookies = rds_recvmsg_zcookie(rs, msg);
> > +   if (ncookies) {
> > +   ret = 0;
> 
> This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?

The only time MSG_DONTWAIT was previously returning 0 for PF_RDS
was when there was a congestion notification or rdma completion
i.e., the checks for rs_notify_queue and rs_cong_notify earlier
in the function, and in those 2 cases it was not returning data
(i.e. ret was always 0). In all other cases  rds_recvmsg would
always return either
- ret == -1, with errno set to EAGAIN or ETIMEDOUT, depending on the value
  of MSG_DONTWAIT, or,
- ret > 0 with data bytes.

That behavior (as well existing behavior for congestion notification
and rdma completion) has not changed. The only thing that changed is that
if there is no data to pass up, the ret will be 0, errno will be 0,
and the ancillary data may have completion cookies.

--Sowmini
  


Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn
>> Okay. If callers must already handle 0 as a valid return code, then
>> it is fine to add another case that does this.
>>
>> The extra branch in the hot path is still rather unfortunately. Could
>> this be integrated in the existing if (nonblock) branch below?
>
> that's where I first started. it got even hairier because the
> callers expect a retval of 0 (-EAGAIN threw rds-stress into an
> infinite loop of continulally trying to recv) and the end result
> was just confusing code with the same number of branches..

I do mean returning 0 instead of -EAGAIN if control data is ready.
Something like

@@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr
*msg, size_t size,

if (!rds_next_incoming(rs, )) {
if (nonblock) {
-   ret = -EAGAIN;
+   ncookies = rds_recvmsg_zcookie(rs, msg);
+   ret = ncookies ? 0 : -EAGAIN;
break;
}

By the way, the put_cmsg is unconditional even if the caller did
not supply msg_control. So it is basically no longer safe to ever
call read, recv or recvfrom on a socket if zerocopy notifications
are outstanding.

It is possible to check msg_controllen before even deciding whether
to try to dequeue notifications (and take the lock). I see that this is
not common. But RDS of all cases seems to do this, in
rds_notify_queue_get:

max_messages = msghdr->msg_controllen /
CMSG_SPACE(sizeof(cmsg));


[PATCH net-next] ibmvnic: Fix TX descriptor tracking

2018-02-21 Thread Thomas Falcon
With the recent change, transmissions that only needed
one descriptor were being missed. The result is that such
packets were tracked as outstanding transmissions but never
removed when its completion notification was received.

Fixes: ffc385b95adb ("ibmvnic: Keep track of supplementary TX descriptors")
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 340e1ab..b3a34d9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1478,7 +1478,6 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
if ((*hdrs >> 7) & 1) {
build_hdr_descs_arr(tx_buff, _entries, *hdrs);
tx_crq.v1.n_crq_elem = num_entries;
-   tx_buff->num_entries = num_entries;
tx_buff->indir_arr[0] = tx_crq;
tx_buff->indir_dma = dma_map_single(dev, tx_buff->indir_arr,
sizeof(tx_buff->indir_arr),
@@ -1533,6 +1532,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
netif_stop_subqueue(netdev, queue_num);
}
 
+   tx_buff->num_entries = num_entries;
tx_packets++;
tx_bytes += skb->len;
txq->trans_start = jiffies;
-- 
1.8.3.1



[PATCH net] ibmvnic: Fix early release of login buffer

2018-02-21 Thread Thomas Falcon
The login buffer is released before the driver can perform
sanity checks between resources the driver requested and what
firmware will provide. Don't release the login buffer until
the sanity check is performed.

Fixes: 34f0f4e3f488 ("ibmvnic: Fix login buffer memory leaks")
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1703b88..340e1ab 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3790,7 +3790,6 @@ static int handle_login_rsp(union ibmvnic_crq 
*login_rsp_crq,
 
dma_unmap_single(dev, adapter->login_buf_token, adapter->login_buf_sz,
 DMA_BIDIRECTIONAL);
-   release_login_buffer(adapter);
dma_unmap_single(dev, adapter->login_rsp_buf_token,
 adapter->login_rsp_buf_sz, DMA_BIDIRECTIONAL);
 
@@ -3821,6 +3820,7 @@ static int handle_login_rsp(union ibmvnic_crq 
*login_rsp_crq,
ibmvnic_remove(adapter->vdev);
return -EIO;
}
+   release_login_buffer(adapter);
complete(>init_done);
 
return 0;
-- 
1.8.3.1



[PATCH v2 iproute2-next 2/3] ip: Display ip rule protocol used

2018-02-21 Thread Donald Sharp
Modify 'ip rule' command to notice when the kernel passes
to us the originating protocol.

Add code to allow the `ip rule flush protocol XXX`
command to be accepted and properly handled.

Modify the documentation to reflect these code changes.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c| 29 ++---
 man/man8/ip-rule.8 | 18 +-
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/ip/iprule.c b/ip/iprule.c
index 00a6c26a..39008768 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -47,6 +47,7 @@ static void usage(void)
"[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ 
l3mdev ]\n"
"[ uidrange NUMBER-NUMBER ]\n"
"ACTION := [ table TABLE_ID ]\n"
+   "  [ protocol RPROTO ]\n"
"  [ nat ADDRESS ]\n"
"  [ realms [SRCREALM/]DSTREALM ]\n"
"  [ goto NUMBER ]\n"
@@ -71,6 +72,8 @@ static struct
struct fib_rule_uid_range range;
inet_prefix src;
inet_prefix dst;
+   int protocol;
+   int protocolmask;
 } filter;
 
 static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
@@ -338,6 +341,10 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
rtnl_rtntype_n2a(frh->action,
 b1, sizeof(b1)));
 
+   if (frh->proto != RTPROT_UNSPEC)
+   fprintf(fp, " proto %s ",
+   rtnl_rtprot_n2a(frh->proto, b1, sizeof(b1)));
+
fprintf(fp, "\n");
fflush(fp);
return 0;
@@ -391,6 +398,9 @@ static int flush_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
 
parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
+   if ((filter.protocol^frh->proto))
+   return 0;
+
if (tb[FRA_PRIORITY]) {
n->nlmsg_type = RTM_DELRULE;
n->nlmsg_flags = NLM_F_REQUEST;
@@ -415,12 +425,6 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
if (af == AF_UNSPEC)
af = AF_INET;
 
-   if (action != IPRULE_LIST && argc > 0) {
-   fprintf(stderr, "\"ip rule %s\" does not take any arguments.\n",
-   action == IPRULE_SAVE ? "save" : "flush");
-   return -1;
-   }
-
switch (action) {
case IPRULE_SAVE:
if (save_rule_prep())
@@ -508,7 +512,18 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
NEXT_ARG();
if (get_prefix(, *argv, af))
invarg("from value is invalid\n", *argv);
-   } else {
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 prot;
+   NEXT_ARG();
+   filter.protocolmask = -1;
+   if (rtnl_rtprot_a2n(, *argv)) {
+   if (strcmp(*argv, "all") != 0)
+   invarg("invalid \"protocol\"\n", *argv);
+   prot = 0;
+   filter.protocolmask = 0;
+   }
+   filter.protocol = prot;
+   } else{
if (matches(*argv, "dst") == 0 ||
matches(*argv, "to") == 0) {
NEXT_ARG();
diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8
index a5c47981..98b2573d 100644
--- a/man/man8/ip-rule.8
+++ b/man/man8/ip-rule.8
@@ -50,6 +50,8 @@ ip-rule \- routing policy database management
 .IR ACTION " := [ "
 .B  table
 .IR TABLE_ID " ] [ "
+.B  protocol
+.IR RPROTO " ] [ "
 .B  nat
 .IR ADDRESS " ] [ "
 .B realms
@@ -240,6 +242,10 @@ The options preference and order are synonyms with 
priority.
 the routing table identifier to lookup if the rule selector matches.
 It is also possible to use lookup instead of table.
 
+.TP
+.BI protocol " RPROTO"
+the protocol who installed the rule in question.
+
 .TP
 .BI suppress_prefixlength " NUMBER"
 reject routing decisions that have a prefix length of NUMBER or less.
@@ -275,7 +281,11 @@ updates, it flushes the routing cache with
 .RE
 .TP
 .B ip rule flush - also dumps all the deleted rules.
-This command has no arguments.
+.RS
+.TP
+.BI protocol " RPROTO"
+Select the originating protocol.
+.RE
 .TP
 .B ip rule show - list rules
 This command has no arguments.
@@ -283,6 +293,12 @@ The options list or lst are synonyms with show.
 
 .TP
 .B ip rule save
+.RS
+.TP
+.BI protocl " RPROTO"
+Select the originating protocol.
+.RE
+.TP
 save rules table information to stdout
 .RS
 This command behaves like
-- 
2.14.3



[PATCH v2 iproute2-next 0/3] Allow 'ip rule' command to use protocol

2018-02-21 Thread Donald Sharp
Fix iprule.c to use the actual `struct fib_rule_hdr` and to
allow the end user to see and use the protocol keyword
for rule manipulations.

v2: Rearrange and code changes as per David Ahern

Donald Sharp (3):
  ip: Use the `struct fib_rule_hdr` for rules
  ip: Display ip rule protocol used
  ip: Allow rules to accept a specified protocol

 include/uapi/linux/fib_rules.h |   2 +-
 ip/iprule.c| 164 -
 man/man8/ip-rule.8 |  18 -
 3 files changed, 114 insertions(+), 70 deletions(-)

-- 
2.14.3



[PATCH v2 iproute2-next 1/3] ip: Use the `struct fib_rule_hdr` for rules

2018-02-21 Thread Donald Sharp
The iprule.c code was using `struct rtmsg` as the data
type to pass into the kernel for the netlink message.
While 'struct rtmsg' and `struct fib_rule_hdr` are
the same size and mostly the same, we should use
the correct data structure.  This commit translates
the data structures to have iprule.c use the correct
one.

Additionally copy over the modified fib_rules.h file

Signed-off-by: Donald Sharp 
---
 include/uapi/linux/fib_rules.h |   2 +-
 ip/iprule.c| 129 ++---
 2 files changed, 69 insertions(+), 62 deletions(-)

diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b642bf9..92553917 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -23,8 +23,8 @@ struct fib_rule_hdr {
__u8tos;
 
__u8table;
+   __u8proto;
__u8res1;   /* reserved */
-   __u8res2;   /* reserved */
__u8action;
 
__u32   flags;
diff --git a/ip/iprule.c b/ip/iprule.c
index a3abf2f6..00a6c26a 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -73,25 +73,33 @@ static struct
inet_prefix dst;
 } filter;
 
+static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
+{
+   __u32 table = frh->table;
+   if (tb[RTA_TABLE])
+   table = rta_getattr_u32(tb[RTA_TABLE]);
+   return table;
+}
+
 static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 {
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
__u32 table;
 
-   if (preferred_family != AF_UNSPEC && r->rtm_family != preferred_family)
+   if (preferred_family != AF_UNSPEC && frh->family != preferred_family)
return false;
 
if (filter.prefmask &&
filter.pref ^ (tb[FRA_PRIORITY] ? rta_getattr_u32(tb[FRA_PRIORITY]) 
: 0))
return false;
-   if (filter.not && !(r->rtm_flags & FIB_RULE_INVERT))
+   if (filter.not && !(frh->flags & FIB_RULE_INVERT))
return false;
 
if (filter.src.family) {
inet_prefix *f_src = 
 
-   if (f_src->family != r->rtm_family ||
-   f_src->bitlen > r->rtm_src_len)
+   if (f_src->family != frh->family ||
+   f_src->bitlen > frh->src_len)
return false;
 
if (inet_addr_match_rta(f_src, tb[FRA_SRC]))
@@ -101,15 +109,15 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct 
rtattr **tb, int host_len)
if (filter.dst.family) {
inet_prefix *f_dst = 
 
-   if (f_dst->family != r->rtm_family ||
-   f_dst->bitlen > r->rtm_dst_len)
+   if (f_dst->family != frh->family ||
+   f_dst->bitlen > frh->dst_len)
return false;
 
if (inet_addr_match_rta(f_dst, tb[FRA_DST]))
return false;
}
 
-   if (filter.tosmask && filter.tos ^ r->rtm_tos)
+   if (filter.tosmask && filter.tos ^ frh->tos)
return false;
 
if (filter.fwmark) {
@@ -159,7 +167,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return false;
}
 
-   table = rtm_get_table(r, tb);
+   table = frh_get_table(frh, tb);
if (filter.tb > 0 && filter.tb ^ table)
return false;
 
@@ -169,7 +177,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
 int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
FILE *fp = (FILE *)arg;
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
int host_len = -1;
__u32 table;
@@ -180,13 +188,13 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE)
return 0;
 
-   len -= NLMSG_LENGTH(sizeof(*r));
+   len -= NLMSG_LENGTH(sizeof(*frh));
if (len < 0)
return -1;
 
-   parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len);
+   parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
-   host_len = af_bit_len(r->rtm_family);
+   host_len = af_bit_len(frh->family);
 
if (!filter_nlmsg(n, tb, host_len))
return 0;
@@ -200,41 +208,41 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
else
fprintf(fp, "0:\t");
 
-   if (r->rtm_flags & FIB_RULE_INVERT)
+   if (frh->flags & FIB_RULE_INVERT)
fprintf(fp, "not ");
 
if (tb[FRA_SRC]) {
-   if (r->rtm_src_len != host_len) {
+   if (frh->src_len != host_len) {
fprintf(fp, 

[PATCH v2 iproute2-next 3/3] ip: Allow rules to accept a specified protocol

2018-02-21 Thread Donald Sharp
Allow the specification of a protocol when the user
adds/modifies/deletes a rule.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index 39008768..192fe215 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -683,6 +683,12 @@ static int iprule_modify(int cmd, int argc, char **argv)
if (get_rt_realms_or_raw(, *argv))
invarg("invalid realms\n", *argv);
addattr32(, sizeof(req), FRA_FLOW, realm);
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 proto;
+   NEXT_ARG();
+   if (rtnl_rtprot_a2n(, *argv))
+   invarg("\"protocol\" value is invalid\n", 
*argv);
+   req.frh.proto = proto;
} else if (matches(*argv, "table") == 0 ||
   strcmp(*argv, "lookup") == 0) {
NEXT_ARG();
-- 
2.14.3



Re: ss issue on arm not showing UDP listening ports

2018-02-21 Thread Jesse Cooper
Thank you for the suggestions. This is on a raspberry pi 3 not sure if
that fact matters. I will notify Raspbian of the issue.

On 02/21/2018 03:03 PM, Stefano Brivio wrote:
> On Wed, 21 Feb 2018 12:37:31 -0500
> jesse_coo...@codeholics.com wrote:
> 
>> ss utility, iproute2-ss161212
> 
> Works for me on iproute2-ss161212 and 4.9.0 kernel on armv7l. Unless
> somebody on the list has other ideas, I guess you should either try
> more recent versions, debug it (strace should show a pair of
> recvmsg/sendmsg for each UDP socket) or file a ticket for your
> distribution.
> 


Re: [PATCH net-next v4 0/2] Remove IPVlan module dependencies on IPv6 and L3 Master dev

2018-02-21 Thread David Miller
From: Matteo Croce 
Date: Wed, 21 Feb 2018 01:31:12 +0100

> The IPVlan module currently depends on IPv6 and L3 Master dev.
> Refactor the code to allow building IPVlan module regardless of the value
> of CONFIG_IPV6 as done in other drivers like VxLAN or GENEVE.
> Also change the CONFIG_NET_L3_MASTER_DEV dependency into a select,
> since compiling L3 Master device alone has little sense.
> 
> $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
> CONFIG_IPV6=y
> CONFIG_IPVLAN=m
> $ ll drivers/net/ipvlan/ipvlan.ko
> 48K drivers/net/ipvlan/ipvlan.ko
> 
> $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
> # CONFIG_IPV6 is not set
> CONFIG_IPVLAN=m
> $ ll drivers/net/ipvlan/ipvlan.ko
> 44K drivers/net/ipvlan/ipvlan.ko

Series applied, thanks Matteo.


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Samudrala, Sridhar

On 2/21/2018 6:02 PM, Jakub Kicinski wrote:

On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:

I don't see why the team cannot be there always.

It is more the logistical nightmare. Part of the goal here was to work
with the cloud base images that are out there such as
https://alt.fedoraproject.org/cloud/. With just the kernel changes the
overhead for this stays fairly small and would be pulled in as just a
standard part of the kernel update process. The virtio bypass only
pops up if the backup bit is present. With the team solution it
requires that the base image use the team driver on virtio_net when it
sees one. I doubt the OSVs would want to do that just because SR-IOV
isn't that popular of a case.

IIUC we need to monitor for a "backup hint", spawn the master, rename it
to maintain backwards compatibility with no-VF setups and enslave the VF
if it appears.

All those sound possible from user space, the advantage of the kernel
solution right now is that it has more complete code.

Am I misunderstanding?


I think there is some misunderstanding about the exact requirement and 
the usecase
we are trying to solve.  If the Guest is allowed to do this 
configuration, we already have

a solution with either bond/team based user space configuration.

This is to enable cloud service providers to provide a accelerated 
datapath by simply
letting to tenants to get their own images with the only requirement to 
enable their

kernels with newer virtio_net driver with BACKUP support and the VF driver.

To recap from an earlier thread, here is a response from Stephen that 
talks about the
requirement for the netvsc solution and we would like to provide similar 
solution for

KVM based cloud deployments.

> The requirement with Azure accelerated network was that a stock 
distribution image from the

> store must be able to run unmodified and get accelerated networking.
>  Not sure if other environments need to work the same, but it would 
be nice.

>  That meant no additional setup scripts (aka no bonding) and also it must
> work transparently with hot-plug. Also there are diverse set of 
environments:
> openstack, cloudinit, network manager and systemd. The solution had 
to not depend

> on any one of them, but also not break any of them.

Thanks
Sridhar


Re: [PATCH] selftests/bpf: tcpbpf_kern: use in6_* macros from glibc

2018-02-21 Thread Daniel Borkmann
On 02/21/2018 05:51 PM, Anders Roxell wrote:
> Both glibc and the kernel have in6_* macros definitions. Build fails
> because it picks up wrong in6_* macro from the kernel header and not the
> header from glibc.
> 
> Fixes build error below:
> clang -I. -I./include/uapi -I../../../include/uapi
>  -Wno-compare-distinct-pointer-types \
>  -O2 -target bpf -emit-llvm -c test_tcpbpf_kern.c -o - |  \
> llc -march=bpf -mcpu=generic -filetype=obj
>  -o .../tools/testing/selftests/bpf/test_tcpbpf_kern.o
> In file included from test_tcpbpf_kern.c:12:
> .../netinet/in.h:101:5: error: expected identifier
> IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
> ^
> .../linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
> ^
> In file included from test_tcpbpf_kern.c:12:
> /usr/include/netinet/in.h:103:5: error: expected identifier
> IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
> ^
> .../linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
> ^
> In file included from test_tcpbpf_kern.c:12:
> .../netinet/in.h:105:5: error: expected identifier
> IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
> ^
> 
> Since both glibc and the kernel have in6_* macros definitions, use the
> one from glibc.  Kernel headers will check for previous libc definitions
> by including include/linux/libc-compat.h.
> 
> Reported-by: Daniel Díaz 
> Signed-off-by: Anders Roxell 

Applied to bpf tree, thanks Anders!


Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Daniel Borkmann
Hi Shuah,

On 02/22/2018 12:03 AM, Shuah Khan wrote:
> On 02/21/2018 03:48 PM, David Miller wrote:
>> From: Anders Roxell 
>> Date: Wed, 21 Feb 2018 22:30:01 +0100
>>
>>> bpf builds a test program for loading BPF ELF files. Add the executable
>>> to the .gitignore list.
>>>
>>> Signed-off-by: Anders Roxell 
>>
>> Acked-by: David S. Miller 
> 
> Thanks. I will pull this in for 4.16-rc

I would have taken this into bpf tree, but fair enough. This one
in particular doesn't cause any conflicts right now, so feel free
to pick it up then. But usual paths are via bpf / bpf-next as we
otherwise would run into real ugly merge conflicts.

Thanks,
Daniel


Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Daniel Borkmann
On 02/22/2018 01:37 AM, Shuah Khan wrote:
> On 02/21/2018 05:33 PM, Daniel Borkmann wrote:
>> Hi Shuah,
>>
>> On 02/22/2018 12:03 AM, Shuah Khan wrote:
>>> On 02/21/2018 03:48 PM, David Miller wrote:
 From: Anders Roxell 
 Date: Wed, 21 Feb 2018 22:30:01 +0100

> bpf builds a test program for loading BPF ELF files. Add the executable
> to the .gitignore list.
>
> Signed-off-by: Anders Roxell 

 Acked-by: David S. Miller 
>>>
>>> Thanks. I will pull this in for 4.16-rc
>>
>> I would have taken this into bpf tree, but fair enough. This one
>> in particular doesn't cause any conflicts right now, so feel free
>> to pick it up then. But usual paths are via bpf / bpf-next as we
>> otherwise would run into real ugly merge conflicts.
> 
> Daniel,
> 
> Go for it. I know bpf tree stuff causes conflicts. That is why I usually
> stay away from bpfs selftests and let you handle them.
> 
> You are taking the other one away, pick this up as well.

Okay, great, thanks for your understanding, Shuah. Just applied
to bpf tree.

> Acked-by: Shuah Khan 
> 
> thanks,
> -- Shuah
> 



[PATCH v2] selftests/bpf/test_maps: exit child process without error in ENOMEM case

2018-02-21 Thread Li Zhijian
From: Li Zhijian 

test_maps contains a series of stress tests, and previously it will break the
rest tests when it failed to alloc memory.
---
Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
Failed to create hashmap key=16 value=262144 'Cannot allocate memory'
Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
test_maps: test_maps.c:955: run_parallel: Assertion `status == 0' failed.
Aborted
not ok 1..3 selftests:  test_maps [FAIL]
---
after this patch, the rest tests will be continue when it occurs an ENOMEM 
failure

CC: Alexei Starovoitov 
CC: Philip Li 
Suggested-by: Daniel Borkmann 
Signed-off-by: Li Zhijian 
---
 tools/testing/selftests/bpf/test_maps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 436c4c7..9e03a4c 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -126,6 +126,8 @@ static void test_hashmap_sizes(int task, void *data)
fd = bpf_create_map(BPF_MAP_TYPE_HASH, i, j,
2, map_flags);
if (fd < 0) {
+   if (errno == ENOMEM)
+   return;
printf("Failed to create hashmap key=%d 
value=%d '%s'\n",
   i, j, strerror(errno));
exit(1);
-- 
2.7.4



Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Eric Dumazet
On Wed, 2018-02-21 at 19:43 -0800, Alexei Starovoitov wrote:
> On Wed, Feb 21, 2018 at 07:04:02PM -0800, Eric Dumazet wrote:
> > On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote:
> > 
> > ...
> > 
> > > +/* Instead of plain jmp %rax, we emit a retpoline to control
> > > + * speculative execution for the indirect branch.
> > > + */
> > > +static void emit_retpoline_rax_trampoline(u8 **pprog)
> > > +{
> > > + u8 *prog = *pprog;
> > > + int cnt = 0;
> > > +
> > > + EMIT1_off32(0xE8, 7);/* callq  */
> > > + /* capture_spec: */
> > > + EMIT2(0xF3, 0x90);   /* pause */
> > > + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
> > > + EMIT2(0xEB, 0xF9);   /* jmp  */
> > > + /* set_up_target: */
> > > + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
> > > + EMIT1(0xC3); /* retq */
> > > +
> > > + BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
> > > + *pprog = prog;
> > 
> > You might define the actual code sequence (and length) in 
> > arch/x86/include/asm/nospec-branch.h
> > 
> > If we need to adjust code sequences for RETPOLINE, then we wont
> > forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded.
> 
> like adding a comment to asm/nospec-branch.h that says
> "dont forget to adjust bpf_jit_comp.c" ?
> but clang/gcc generate slightly different sequences for
> retpoline anyway, so even if '.macro RETPOLINE_JMP' in
> nospec-branch.h changes it doesn't mean that x64 jit has to change.
> So what kinda comment there would make sense?

I was thinking of something very explicit :

/* byte sequence for following assembly code used by eBPF
   call ...
   ...
   retq
*/
#define RETPOLINE_RAX_DIRECT_FOR_EBPF \
   EMIT1_off32(0xE8, 7);/* callq  */   \
   /* capture_spec: */\
   EMIT2(0xF3, 0x90);   /* pause */   \
   EMIT3(0x0F, 0xAE, 0xE8); /* lfence */  \
   EMIT2(0xEB, 0xF9);   /* jmp  */  \
   /* set_up_target: */   \
   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */   \
   EMIT1(0xC3); /* retq */\

Might be simply byte encoded, (array of 17 bytes)

Well, something like that anyway...



Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-21 Thread Jason Wang



On 2018年02月21日 00:52, John Fastabend wrote:

On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote:

On Fri, 16 Feb 2018 09:19:02 -0800
John Fastabend  wrote:


On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote:

On Fri, 16 Feb 2018 13:31:37 +0800
Jason Wang  wrote:
   

On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:

The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.

Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().

The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.

The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers).  This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).

The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes.  The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).

Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not.  But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.

This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.

But IMHO we should NOT support XDP in receive_mergeable() at all,
because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

I agree to disable it for -net now.

Okay... I'll send an official patch later.
   

For net-next, we probably can do:

- drop xdp_linearize_page() and do XDP through generic XDP helper
   after skb was built

I disagree strongly here - it makes no sense.

Why do you want to explicit fallback to Generic-XDP?
(... then all the performance gain is gone!)
And besides, a couple of function calls later, the generic XDP code
will/can get invoked anyhow...
   

Hi, Can we get EWMA to ensure for majority of cases we have the extra
head room? Seems we could just over-estimate the size by N-bytes. In
some cases we may under-estimate and then would need to fall back to
generic-xdp or otherwise growing the buffer which of course would be
painful and slow, but presumably would happen rarely.

Hmmm... (first of all it is missing tail-room not head-room).
Second having all this extra size estimating code, and fallback options
leaves a very bad taste in my mouth... this sounds like a sure way to
kill performance.



I think it would be much better to keep this feature vs kill it and
make its configuration even more painful to get XDP working on virtio.

Based on you request, I'm going to fixing as much as possible of the
XDP code path in driver virtio_net... I now have 4 fix patches...


Thanks a lot!


There is no way around disabling XDP_REDIRECT in receive_mergeable(),
as XDP does not have a way to detect/know the "data_hard_end" of the
data "frame".


Disabling EWMA also seems reasonable to me.

To me, it seems more reasonable to have a separate RX function call
when an XDP program gets attached, and in that process change to the
memory model so it is compatible with XDP.


I would be OK with that but would be curious to see what Jason and
Michael think. When I original wrote the XDP for virtio support the
XDP infra was still primitive and we didn't have metadata, cpu maps,
etc.


Yes, that why cpumap fails.


  yet. I suspect there might need to be some additional coordination
between guest and host though to switch the packet modes. If I recall
this was where some of the original trouble came from.


Maybe, but I think just have a separate refill function should be 
sufficient? Then we could reuse the exist code to deal with e.g 
synchronization.




  

Take a step back:
  What is the reason/use-case for implementing XDP inside virtio_net?

 From a DDoS/performance perspective XDP in virtio_net happens on the
"wrong-side" as it is activated _inside_ the guest OS, which is too
late for a DDoS filter, as the guest kick/switch 

Re: [PATCH net v2 2/2] tuntap: correctly add the missing xdp flush

2018-02-21 Thread Sergei Shtylyov

Hello!

On 2/22/2018 9:24 AM, Jason Wang wrote:


Commit 762c330d670e ("tuntap: add missing xdp flush") tries to fix the
devmap stall caused by missed xdp flush by counting the pending xdp
redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or
MSG_MORE is clear. This may lead BUG() since xdp_do_flush() was


   Lead to BUG().


called under process context with preemption enabled. Simply disable


   s/under/in the/?


preemption may silent the warning but be not enough since process may


   Silence.


move between different CPUS during a batch which cause xdp_do_flush()
misses some CPU where the process run previously. Consider the several
fallouts, that commit was reverted. To fix the issue correctly, we can
simply calling xdp_do_flush() immediately after xdp_do_redirect(),


   Call.


a side effect is that this removes any possibility of batching which
could be addressed in the future.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 

[...]

MBR, Sergei


Re: [PATCH net v2 1/2] Revert "tuntap: add missing xdp flush"

2018-02-21 Thread Sergei Shtylyov

On 2/22/2018 9:24 AM, Jason Wang wrote:


This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The
reason is we try to batch packets for devmap which causes calling
xdp_do_flush() under the process context. Simply disable premmption


   s/under/in/.
   Disabling preemption.


may not work since process may move among processors which lead
xdp_do_flush() to miss some flushes on some processors.

So simply revert the patch, a follow-up path will add the xdp flush
correctly.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 

[...]

MBR, Sergei


[PATCH iproute2-next v1] rdma: Add batch command support

2018-02-21 Thread Leon Romanovsky
From: Leon Romanovsky 

Implement an option (-b) to execute RDMAtool commands
from supplied file. This follows the same model as
in use for ip and devlink tools, by expecting
every new command to be on new line.

These commands are expected to be without any -*
(e.g. -d, -j, e.t.c) global flags, which should be
called externally.

Signed-off-by: Leon Romanovsky 
---

Changelog v0->v1:
  * Used ARRAY_SIZE instead of hardcoded value as an input to makeargs()

David,

This patch is based on iproute2.git because iproute2-next doesn't
have latest restrack code. The patch itself is completely independent
from that code and is supposed to go to -next, but it has conflicts
(manual page and help line).

Can you please merge iproute2 master into iproute2-next prior to
applying this patch?

Thanks
---
 man/man8/rdma.8 | 16 +
 rdma/rdma.c | 70 ++---
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
index 798b33d3..fba77693 100644
--- a/man/man8/rdma.8
+++ b/man/man8/rdma.8
@@ -11,6 +11,12 @@ rdma \- RDMA tool
 .BR help " }"
 .sp

+.ti -8
+.B rdma
+.RB "[ " -force " ] "
+.BI "-batch " filename
+.sp
+
 .ti -8
 .IR OBJECT " := { "
 .BR dev " | " link " }"
@@ -31,6 +37,16 @@ Print the version of the
 .B rdma
 tool and exit.

+.TP
+.BR "\-b", " \-batch " 
+Read commands from provided file or standard input and invoke them.
+First failure will cause termination of rdma.
+
+.TP
+.BR "\-force"
+Don't terminate rdma on errors in batch mode.
+If there were any errors during execution of the commands, the application 
return code will be non zero.
+
 .TP
 .BR "\-d" , " --details"
 Otuput detailed information.
diff --git a/rdma/rdma.c b/rdma/rdma.c
index 19608f41..ab2c9608 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -15,8 +15,9 @@
 static void help(char *name)
 {
pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
+  "   %s [ -f[orce] ] -b[atch] filename\n"
   "where  OBJECT := { dev | link | resource | help }\n"
-  "   OPTIONS := { -V[ersion] | -d[etails] | -j[son] | 
-p[retty]}\n", name);
+  "   OPTIONS := { -V[ersion] | -d[etails] | -j[son] | 
-p[retty]}\n", name, name);
 }

 static int cmd_help(struct rd *rd)
@@ -25,7 +26,7 @@ static int cmd_help(struct rd *rd)
return 0;
 }

-static int rd_cmd(struct rd *rd)
+static int rd_cmd(struct rd *rd, int argc, char **argv)
 {
const struct rd_cmd cmds[] = {
{ NULL, cmd_help },
@@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd)
{ 0 }
};

+   rd->argc = argc;
+   rd->argv = argv;
+
return rd_exec_cmd(rd, cmds, "object");
 }

-static int rd_init(struct rd *rd, int argc, char **argv, char *filename)
+static int rd_batch(struct rd *rd, const char *name, bool force)
+{
+   char *line = NULL;
+   size_t len = 0;
+   int ret = 0;
+
+   if (name && strcmp(name, "-") != 0) {
+   if (!freopen(name, "r", stdin)) {
+   pr_err("Cannot open file \"%s\" for reading: %s\n",
+  name, strerror(errno));
+   return errno;
+   }
+   }
+
+   cmdlineno = 0;
+   while (getcmdline(, , stdin) != -1) {
+   char *largv[512];
+   int largc;
+
+   largc = makeargs(line, largv, ARRAY_SIZE(largv));
+   if (!largc)
+   continue;   /* blank line */
+
+   ret = rd_cmd(rd, largc, largv);
+   if (ret) {
+   pr_err("Command failed %s:%d\n", name, cmdlineno);
+   if (!force)
+   break;
+   }
+   }
+
+   free(line);
+
+   return ret;
+}
+
+static int rd_init(struct rd *rd, char *filename)
 {
uint32_t seq;
int ret;

rd->filename = filename;
-   rd->argc = argc;
-   rd->argv = argv;
INIT_LIST_HEAD(>dev_map_list);
INIT_LIST_HEAD(>filter_list);

@@ -87,11 +125,15 @@ int main(int argc, char **argv)
{ "json",   no_argument,NULL, 'j' },
{ "pretty", no_argument,NULL, 'p' },
{ "details",no_argument,NULL, 'd' },
+   { "force",  no_argument,NULL, 'f' },
+   { "batch",  required_argument,  NULL, 'b' },
{ NULL, 0, NULL, 0 }
};
+   const char *batch_file = NULL;
bool pretty_output = false;
bool show_details = false;
bool json_output = false;
+   bool force = false;
char *filename;
struct rd rd;
int opt;
@@ -99,7 +141,7 @@ int main(int argc, char **argv)

filename = basename(argv[0]);

-   while ((opt = 

[PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Daniel Borkmann
Implement a retpoline [0] for the BPF tail call JIT'ing that converts
the indirect jump via jmp %rax that is used to make the long jump into
another JITed BPF image. Since this is subject to speculative execution,
we need to control the transient instruction sequence here as well
when CONFIG_RETPOLINE is set, and direct it into a pause + lfence loop.
The latter aligns also with what gcc / clang emits (e.g. [1]).

JIT dump after patch:

  # bpftool p d x i 1
   0: (18) r2 = map[id:1]
   2: (b7) r3 = 0
   3: (85) call bpf_tail_call#12
   4: (b7) r0 = 2
   5: (95) exit

With CONFIG_RETPOLINE:

  # bpftool p d j i 1
  [...]
  33:   cmp%edx,0x24(%rsi)
  36:   jbe0x0072  |*
  38:   mov0x24(%rbp),%eax
  3e:   cmp$0x20,%eax
  41:   ja 0x0072  |
  43:   add$0x1,%eax
  46:   mov%eax,0x24(%rbp)
  4c:   mov0x90(%rsi,%rdx,8),%rax
  54:   test   %rax,%rax
  57:   je 0x0072  |
  59:   mov0x28(%rax),%rax
  5d:   add$0x25,%rax
  61:   callq  0x006d  |+
  66:   pause  |
  68:   lfence |
  6b:   jmp0x0066  |
  6d:   mov%rax,(%rsp) |
  71:   retq   |
  72:   mov$0x2,%eax
  [...]

  * relative fall-through jumps in error case
  + retpoline for indirect jump

Without CONFIG_RETPOLINE:

  # bpftool p d j i 1
  [...]
  33:   cmp%edx,0x24(%rsi)
  36:   jbe0x0063  |*
  38:   mov0x24(%rbp),%eax
  3e:   cmp$0x20,%eax
  41:   ja 0x0063  |
  43:   add$0x1,%eax
  46:   mov%eax,0x24(%rbp)
  4c:   mov0x90(%rsi,%rdx,8),%rax
  54:   test   %rax,%rax
  57:   je 0x0063  |
  59:   mov0x28(%rax),%rax
  5d:   add$0x25,%rax
  61:   jmpq   *%rax   |-
  63:   mov$0x2,%eax
  [...]

  * relative fall-through jumps in error case
  - plain indirect jump as before

  [0] https://support.google.com/faqs/answer/7625886
  [1] 
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 52 -
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4923d92..7e8d562 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -261,6 +261,35 @@ static void emit_prologue(u8 **pprog, u32 stack_depth)
*pprog = prog;
 }
 
+#ifdef CONFIG_RETPOLINE
+# define RETPOLINE_SIZE17
+# define OFFSET_ADJRETPOLINE_SIZE
+
+/* Instead of plain jmp %rax, we emit a retpoline to control
+ * speculative execution for the indirect branch.
+ */
+static void emit_retpoline_rax_trampoline(u8 **pprog)
+{
+   u8 *prog = *pprog;
+   int cnt = 0;
+
+   EMIT1_off32(0xE8, 7);/* callq  */
+   /* capture_spec: */
+   EMIT2(0xF3, 0x90);   /* pause */
+   EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
+   EMIT2(0xEB, 0xF9);   /* jmp  */
+   /* set_up_target: */
+   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
+   EMIT1(0xC3); /* retq */
+
+   BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
+   *pprog = prog;
+}
+#else
+/* Plain jmp %rax version used. */
+# define OFFSET_ADJ2
+#endif
+
 /* generate the following code:
  * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
  *   if (index >= array->map.max_entries)
@@ -290,7 +319,7 @@ static void emit_bpf_tail_call(u8 **pprog)
EMIT2(0x89, 0xD2);/* mov edx, edx */
EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], 
edx */
  offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 43 /* number of bytes to jump */
+#define OFFSET1 (41 + OFFSET_ADJ) /* number of bytes to jump */
EMIT2(X86_JBE, OFFSET1);  /* jbe out */
label1 = cnt;
 
@@ -299,7 +328,7 @@ static void emit_bpf_tail_call(u8 **pprog)
 */
EMIT2_off32(0x8B, 0x85, 36);  /* mov eax, dword ptr [rbp + 
36] */
EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT 
*/
-#define OFFSET2 32
+#define OFFSET2 (30 + OFFSET_ADJ)
EMIT2(X86_JA, OFFSET2);   /* ja out */
label2 = cnt;
EMIT3(0x83, 0xC0, 0x01);  /* add eax, 1 */
@@ -313,7 +342,7 @@ static void emit_bpf_tail_call(u8 **pprog)
 *   goto out;
 */
EMIT3(0x48, 0x85, 0xC0);  /* test rax,rax */
-#define OFFSET3 10
+#define OFFSET3 (8 + OFFSET_ADJ)
EMIT2(X86_JE, OFFSET3);   /* je out */
label3 = cnt;
 
@@ -326,16 +355,19 @@ static void emit_bpf_tail_call(u8 **pprog)
 * rdi == ctx (1st arg)
 * rax == prog->bpf_func + prologue_size
 */
-   EMIT2(0xFF, 0xE0);/* jmp rax */
-
-   /* out: */
-   

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn
On Wed, Feb 21, 2018 at 7:26 PM, Sowmini Varadhan
 wrote:
> On (02/21/18 18:45), Willem de Bruijn wrote:
>>
>> I do mean returning 0 instead of -EAGAIN if control data is ready.
>> Something like
>>
>> @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr
>> *msg, size_t size,
>>
>> if (!rds_next_incoming(rs, )) {
>> if (nonblock) {
>> -   ret = -EAGAIN;
>> +   ncookies = rds_recvmsg_zcookie(rs, msg);
>> +   ret = ncookies ? 0 : -EAGAIN;
>> break;
>> }
>
> Yes, but you now have an implicit branch based on ncookies, so I'm
> not sure it saved all that much?

At least it removes the extra list empty check in the hot path and
relegates this to the relatively rare branch when the queue is empty
and the syscall is non-blocking.

> like I said let me revisit this

Okay. I won't harp on this further.

>> By the way, the put_cmsg is unconditional even if the caller did
>> not supply msg_control. So it is basically no longer safe to ever
>> call read, recv or recvfrom on a socket if zerocopy notifications
>> are outstanding.
>
> Wait, I thought put_cmsg already checks for these things.

It does, and sets MSG_CTRUNC to signal that it was unable to
write all control data. But by then the notifications have already
been dequeued.

>> It is possible to check msg_controllen before even deciding whether
>> to try to dequeue notifications (and take the lock). I see that this is
>> not common. But RDS of all cases seems to do this, in
>> rds_notify_queue_get:
>
> yes the comment above that code suggests that this bit of code
> was done to avoid calling put_cmsg while holding the rs_lock.
>
> One bit of administrivia though- if I now drop sk_error_queue for
> PF_RDS, I'll have to fix selftests in the same patch too, so the
> patch will get a bit bulky (and thus a bit more difficult to review).

Understood. It might be cleanest to split into three patches. One
revert of the error queue code, one new feature and one update
to the test.


[PATCH v2 net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins

2018-02-21 Thread Andrew Lunn
MV88E6352 and later switches support GPIO control through the "Scratch
& Misc" global2 register. Two of the pins controlled this way on the
mv88e6390 family are the external MDIO pins. They can either by used
as part of the MII interface for port 0, GPIOs, or MDIO. Add a
function to configure them for MDIO, if possible, and call it when
registering the external MDIO bus.

Suggested-by: Russell King 
Signed-off-by: Andrew Lunn 
---
v2: Make stub function static inline, as reported by 0-day.

 drivers/net/dsa/mv88e6xxx/chip.c|  9 +
 drivers/net/dsa/mv88e6xxx/global2.h | 14 
 drivers/net/dsa/mv88e6xxx/global2_scratch.c | 51 +
 3 files changed, 74 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 39c7ad7e490f..e1b5c5c66fce 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2165,6 +2165,15 @@ static int mv88e6xxx_mdio_register(struct mv88e6xxx_chip 
*chip,
struct mii_bus *bus;
int err;
 
+   if (external) {
+   mutex_lock(>reg_lock);
+   err = mv88e6xxx_g2_scratch_gpio_set_smi(chip, true);
+   mutex_unlock(>reg_lock);
+
+   if (err)
+   return err;
+   }
+
bus = devm_mdiobus_alloc_size(chip->dev, sizeof(*mdio_bus));
if (!bus)
return -ENOMEM;
diff --git a/drivers/net/dsa/mv88e6xxx/global2.h 
b/drivers/net/dsa/mv88e6xxx/global2.h
index 25f92b3d7157..d85c91036c0f 100644
--- a/drivers/net/dsa/mv88e6xxx/global2.h
+++ b/drivers/net/dsa/mv88e6xxx/global2.h
@@ -266,6 +266,11 @@
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL50x6D
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL60x6E
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL70x6F
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA0  0x70
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA1  0x71
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU   BIT(2)
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA2  0x72
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK 0x3
 
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_GPIO0
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_TRIG1
@@ -325,6 +330,9 @@ extern const struct mv88e6xxx_avb_ops mv88e6390_avb_ops;
 
 extern const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops;
 
+int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip,
+ bool external);
+
 #else /* !CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */
 
 static inline int mv88e6xxx_g2_require(struct mv88e6xxx_chip *chip)
@@ -465,6 +473,12 @@ static const struct mv88e6xxx_avb_ops mv88e6390_avb_ops = 
{};
 
 static const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {};
 
+static inline int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip 
*chip,
+   bool external)
+{
+   return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */
 
 #endif /* _MV88E6XXX_GLOBAL2_H */
diff --git a/drivers/net/dsa/mv88e6xxx/global2_scratch.c 
b/drivers/net/dsa/mv88e6xxx/global2_scratch.c
index 0ff12bff9f0e..3f92b8892dc7 100644
--- a/drivers/net/dsa/mv88e6xxx/global2_scratch.c
+++ b/drivers/net/dsa/mv88e6xxx/global2_scratch.c
@@ -238,3 +238,54 @@ const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {
.get_pctl = mv88e6352_g2_scratch_gpio_get_pctl,
.set_pctl = mv88e6352_g2_scratch_gpio_set_pctl,
 };
+
+/**
+ * mv88e6xxx_g2_gpio_set_smi - set gpio muxing for external smi
+ * @chip: chip private data
+ * @external: set mux for external smi, or free for gpio usage
+ *
+ * Some mv88e6xxx models have GPIO pins that may be configured as
+ * an external SMI interface, or they may be made free for other
+ * GPIO uses.
+ */
+int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip,
+ bool external)
+{
+   int misc_cfg = MV88E6352_G2_SCRATCH_MISC_CFG;
+   int config_data1 = MV88E6352_G2_SCRATCH_CONFIG_DATA1;
+   int config_data2 = MV88E6352_G2_SCRATCH_CONFIG_DATA2;
+   bool no_cpu;
+   u8 p0_mode;
+   int err;
+   u8 val;
+
+   err = mv88e6xxx_g2_scratch_read(chip, config_data2, );
+   if (err)
+   return err;
+
+   p0_mode = val & MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK;
+
+   if (p0_mode == 0x01 || p0_mode == 0x02)
+   return -EBUSY;
+
+   err = mv88e6xxx_g2_scratch_read(chip, config_data1, );
+   if (err)
+   return err;
+
+   no_cpu = !!(val & MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU);
+
+   err = mv88e6xxx_g2_scratch_read(chip, misc_cfg, );
+   if (err)
+   return err;
+
+   /* NO_CPU being 0 inverts the meaning of the bit */
+   if (!no_cpu)
+   external = !external;
+
+   if (external)
+   val |= MV88E6352_G2_SCRATCH_MISC_CFG_NORMALSMI;
+   else
+   

[PATCH iproute2] ip: Properly display AF_BRIDGE address information for neighbor events

2018-02-21 Thread Donald Sharp
The vxlan driver when a neighbor add/delete event occurs sends
NDA_DST filled with a union:

union vxlan_addr {
struct sockaddr_in sin;
struct sockaddr_in6 sin6;
struct sockaddr sa;
};

This eventually calls rt_addr_n2a_r which had no handler for the
AF_BRIDGE family and "???" was being printed.

Add code to properly display this data when requested.

Signed-off-by: Donald Sharp 
---
 lib/utils.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/lib/utils.c b/lib/utils.c
index 24aeddd8..e01e18a7 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -1004,6 +1004,24 @@ const char *rt_addr_n2a_r(int af, int len,
}
case AF_PACKET:
return ll_addr_n2a(addr, len, ARPHRD_VOID, buf, buflen);
+   case AF_BRIDGE:
+   {
+   unsigned short family = ((struct sockaddr *)addr)->sa_family;
+   struct sockaddr_in6 *sin6;
+   struct sockaddr_in *sin;
+
+   switch(family) {
+   case AF_INET:
+   sin = (struct sockaddr_in *)addr;
+   return inet_ntop(AF_INET, >sin_addr, buf, buflen);
+   case AF_INET6:
+   sin6 = (struct sockaddr_in6 *)addr;
+   return inet_ntop(AF_INET6, >sin6_addr,
+buf, buflen);
+   }
+
+   /* fallthrough */
+   }
default:
return "???";
}
-- 
2.14.3



Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Samudrala, Sridhar

On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote:

On 2/21/2018 5:59 PM, Siwei Liu wrote:

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
 wrote:

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  
wrote:

On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a 
moment.

It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver 
creates
an additional 'bypass' netdev that acts as a master device and 
controls

2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' 
netdevs are
associated with the same 'pci' device.  The user accesses the 
network
interface via 'bypass' netdev. The 'bypass' netdev chooses 
'active' netdev
as default for transmits when it is available with link up and 
running.

Thank you do doing this.


We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the 
same
   virtio pci device, udev tries to rename both of them with the 
same name
   and the 2nd rename will fail. This would be OK as long as the 
first netdev
   to be renamed is the 'bypass' netdev, but the order in which 
udev gets

   to rename the 2 netdevs is not reliable.

Out of curiosity - why do you link the master netdev to the virtio
struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.


Yes. This sounds interesting. Looks like you want an existing VM image 
with

VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to 
switch

between these 2 options.


After thinking some more, this may be more involved than adding a new
feature bit.  This requires a netdev created by virtio to take over the 
name of

a VF netdev associated with a PCI device that may not be plugged in when
the virtio driver is coming up. This definitely requires some new messages

Re: [PATCH bpf v2] bpf: fix memory leak in lpm_trie map_free callback function

2018-02-21 Thread Yonghong Song



On 2/21/18 7:40 PM, Eric Dumazet wrote:

On Tue, 2018-02-13 at 19:17 -0800, Alexei Starovoitov wrote:

On Tue, Feb 13, 2018 at 07:00:21PM -0800, Yonghong Song wrote:

There is a memory leak happening in lpm_trie map_free callback
function trie_free. The trie structure itself does not get freed.

Also, trie_free function did not do synchronize_rcu before freeing
various data structures. This is incorrect as some rcu_read_lock
region(s) for lookup, update, delete or get_next_key may not complete yet.
The fix is to add synchronize_rcu in the beginning of trie_free.
The useless spin_lock is removed from this function as well.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reported-by: Mathieu Malaterre 
Reported-by: Alexei Starovoitov 
Tested-by: Mathieu Malaterre 
Signed-off-by: Yonghong Song 
---
  kernel/bpf/lpm_trie.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

v1->v2:
   Make comments more precise and make label name more appropriate,
   as suggested by Daniel


Applied to bpf tree, Thanks Yonghong.



This does not look good.

LOCKDEP surely should complain to

node = rcu_dereference_protected(*slot, lockdep_is_held(>lock));

Since we no longer hold trie->lock


Eric,

Thanks for spotting this issue. Will fix this issue soon.

Yonghong


Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Alexei Starovoitov
On Wed, Feb 21, 2018 at 07:53:22PM -0800, Eric Dumazet wrote:
> > So what kinda comment there would make sense?
> 
> I was thinking of something very explicit :
> 
> /* byte sequence for following assembly code used by eBPF
>call ...
>...
>retq
> */
> #define RETPOLINE_RAX_DIRECT_FOR_EBPF \
>    EMIT1_off32(0xE8, 7);/* callq  */   \
>    /* capture_spec: */\
>    EMIT2(0xF3, 0x90);   /* pause */   \
>    EMIT3(0x0F, 0xAE, 0xE8); /* lfence */  \
>    EMIT2(0xEB, 0xF9);   /* jmp  */  \
>    /* set_up_target: */   \
>    EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */   \
>    EMIT1(0xC3); /* retq */\

got it. yeah. makes sense to me.




F.LLI PISTOLESI Snc

2018-02-21 Thread . F.LLI PISTOLESI Snc
Hello , 
 


I am looking for a reliable supplier /manufacturer of products for sell in 
Europe.

I came across your listing and wanted to get some information regarding minimum 
Order Quantities, FOB pricing and also the possibility of packaging including 
payments terms.

So could you please get back to be with the above informations as soon as 
possible .

My email is :tm6428...@gmail.com

Many thanks and i looking forward to hearing from you and hopefully placing an 
order with you company.

Best Regards
Lorenzo Delleani.

F.LLI PISTOLESI Snc Company P.O. box 205
2740 AE Waddinxveen
The Netherlands


[PATCH net v2 1/2] Revert "tuntap: add missing xdp flush"

2018-02-21 Thread Jason Wang
This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The
reason is we try to batch packets for devmap which causes calling
xdp_do_flush() under the process context. Simply disable premmption
may not work since process may move among processors which lead
xdp_do_flush() to miss some flushes on some processors.

So simply revert the patch, a follow-up path will add the xdp flush
correctly.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b52258c..2823a4a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -181,7 +181,6 @@ struct tun_file {
struct tun_struct *detached;
struct ptr_ring tx_ring;
struct xdp_rxq_info xdp_rxq;
-   int xdp_pending_pkts;
 };
 
 struct tun_flow_entry {
@@ -1662,7 +1661,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
case XDP_REDIRECT:
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
-   ++tfile->xdp_pending_pkts;
err = xdp_do_redirect(tun->dev, , xdp_prog);
if (err)
goto err_redirect;
@@ -1984,11 +1982,6 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
result = tun_get_user(tun, tfile, NULL, from,
  file->f_flags & O_NONBLOCK, false);
 
-   if (tfile->xdp_pending_pkts) {
-   tfile->xdp_pending_pkts = 0;
-   xdp_do_flush_map();
-   }
-
tun_put(tun);
return result;
 }
@@ -2325,13 +2318,6 @@ static int tun_sendmsg(struct socket *sock, struct 
msghdr *m, size_t total_len)
ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter,
   m->msg_flags & MSG_DONTWAIT,
   m->msg_flags & MSG_MORE);
-
-   if (tfile->xdp_pending_pkts >= NAPI_POLL_WEIGHT ||
-   !(m->msg_flags & MSG_MORE)) {
-   tfile->xdp_pending_pkts = 0;
-   xdp_do_flush_map();
-   }
-
tun_put(tun);
return ret;
 }
@@ -3163,7 +3149,6 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
sock_set_flag(>sk, SOCK_ZEROCOPY);
 
memset(>tx_ring, 0, sizeof(tfile->tx_ring));
-   tfile->xdp_pending_pkts = 0;
 
return 0;
 }
-- 
2.7.4



[PATCH] net/smc9194: Remove bogus CONFIG_MAC reference

2018-02-21 Thread Finn Thain
AFAIK the only version of smc9194.c with Mac support is the one in the
linux-mac68k CVS repo, which never made it to the mainline.

Despite that, from v2.3.45, arch/m68k/config.in listed CONFIG_SMC9194
under CONFIG_MAC. This mistake got carried over into Kconfig in v2.5.55.
(See pre-git era "[PATCH] add m68k dependencies to net driver config".)

Signed-off-by: Finn Thain 
---
 drivers/net/ethernet/smsc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/smsc/Kconfig 
b/drivers/net/ethernet/smsc/Kconfig
index 63aca9f847e1..4c2f612e4414 100644
--- a/drivers/net/ethernet/smsc/Kconfig
+++ b/drivers/net/ethernet/smsc/Kconfig
@@ -20,7 +20,7 @@ if NET_VENDOR_SMSC
 
 config SMC9194
tristate "SMC 9194 support"
-   depends on (ISA || MAC && BROKEN)
+   depends on ISA
select CRC32
---help---
  This is support for the SMC9xxx based Ethernet cards. Choose this
-- 
2.16.1


Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn
On Wed, Feb 21, 2018 at 5:14 PM, Sowmini Varadhan
 wrote:
> On (02/21/18 16:54), Willem de Bruijn wrote:
>>
>> I'd put this optimization behind a socket option.
>
> Yes, that thought occurred to me as well- I think RDS applications
> are unlikely to use the error_queue path because, as I mentioned
> before, these are heavily request-response based, so you're
> going to be getting something back for each sendmsg anyway.
>
> So if I had a sockopt, it would be something that would be
> "piggyback completion" most (all?) of the time anyway, but I suppose
> making it explicit is useful to have.
>
>> Either that, or always store cookies on an RDS specific queue, return as recv
>> cmsg and then remove the alternative error queue path completely.
>
> I think the error queue path is good to have, if only to be aligned
> with other socket families.
>
>> > +   spin_lock_irqsave(>lock, flags);
>>
>> This seems expensive on every recvmsg. Even if zerocopy is not enabled.
>
> noted, I'll predicate it on both zcopy and the sockopt suggestion.
>
>> > +   if (skb_queue_empty(q)) {
>> > +   spin_unlock_irqrestore(>lock, flags);
>> > +   return 0;
>> > +   }
>> > +   skb_queue_walk_safe(q, skb, tmp) {
>>
>> This too. If anything, just peek at the queue head and skip otherwise.
>> Anything else will cause an error, which blocks regular read and write
>> on the socket.
>
> but that would be technically incorrect, because you could have a
> mix/match of the zcopy notification with other sk_error_queue messages
> (given that rds_rm_zerocopy_callback looks at the tail, and
> creates a new skb if the tail is not a SO_EE_ORIGIN_ZCOOKIE).
>
> Of course, its true that *today* the only thing in the rds socket
> error queue is the cookie list, so this queue walk is a bit of overkill..
>
> but maybe I am missing something you are concerned about? The queue walk is
> intended to pull out the first SO_EE_ORIGIN_ZCOOKIE (if it exists)
> and return quietly otherwise

Yes, avoiding out of order completions is sensible.

In the common case no more than one notification will be outstanding,
but with a fixed number of notifications per packet, in edge cases this
list may be long.

Reading using a reverse walk avoids that problem.

> (leaving err_queue unchanged)- where
> do you see the blocking error?

Socket functions block if sk_err is non-zero. See for instance
tcp_sendmsg_locked. It is set by most code that also calls
sock_queue_err_skb and also on dequeue from err skb.

This is the main reason that I would consider dropping error
queue completely if you expect all users of RDS to use the
cmsg on regular read to get these notifications.

>> > +   if (list_empty(>rs_recv_queue) && nonblock) {
>> > +   ncookies = rds_recvmsg_zcookie(rs, msg);
>> > +   if (ncookies) {
>> > +   ret = 0;
>>
>> This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?
>
> The only time MSG_DONTWAIT was previously returning 0 for PF_RDS
> was when there was a congestion notification or rdma completion
> i.e., the checks for rs_notify_queue and rs_cong_notify earlier
> in the function, and in those 2 cases it was not returning data
> (i.e. ret was always 0). In all other cases  rds_recvmsg would
> always return either
> - ret == -1, with errno set to EAGAIN or ETIMEDOUT, depending on the value
>   of MSG_DONTWAIT, or,
> - ret > 0 with data bytes.
>
> That behavior (as well existing behavior for congestion notification
> and rdma completion) has not changed. The only thing that changed is that
> if there is no data to pass up, the ret will be 0, errno will be 0,
> and the ancillary data may have completion cookies.

Okay. If callers must already handle 0 as a valid return code, then
it is fine to add another case that does this.

The extra branch in the hot path is still rather unfortunately. Could
this be integrated in the existing if (nonblock) branch below?


Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan
On (02/21/18 17:50), Willem de Bruijn wrote:
> 
> In the common case no more than one notification will be outstanding,
> but with a fixed number of notifications per packet, in edge cases this
> list may be long.
   :
> Socket functions block if sk_err is non-zero. See for instance
> tcp_sendmsg_locked. It is set by most code that also calls
> sock_queue_err_skb and also on dequeue from err skb.
> 
> This is the main reason that I would consider dropping error
> queue completely if you expect all users of RDS to use the
> cmsg on regular read to get these notifications.

I see. That's a good point, and maybe it makes sense to just have
a struct sk_buff_head rs_zcookie_quese on the rds_sock, and
have rds_rm_zerocopy_callback chain cookies ot this  rs_zcookie_queue.

[discussion regarding rds_recvmsg return values elided]

> Okay. If callers must already handle 0 as a valid return code, then
> it is fine to add another case that does this.
> 
> The extra branch in the hot path is still rather unfortunately. Could
> this be integrated in the existing if (nonblock) branch below?

that's where I first started. it got even hairier because the
callers expect a retval of 0 (-EAGAIN threw rds-stress into an
infinite loop of continulally trying to recv) and the end result
was just confusing code with the same number of branches.. 
let me revisit this when I spin out V2 without the sk_error_queue..

--Sowmini


Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Shuah Khan
On 02/21/2018 05:33 PM, Daniel Borkmann wrote:
> Hi Shuah,
> 
> On 02/22/2018 12:03 AM, Shuah Khan wrote:
>> On 02/21/2018 03:48 PM, David Miller wrote:
>>> From: Anders Roxell 
>>> Date: Wed, 21 Feb 2018 22:30:01 +0100
>>>
 bpf builds a test program for loading BPF ELF files. Add the executable
 to the .gitignore list.

 Signed-off-by: Anders Roxell 
>>>
>>> Acked-by: David S. Miller 
>>
>> Thanks. I will pull this in for 4.16-rc
> 
> I would have taken this into bpf tree, but fair enough. This one
> in particular doesn't cause any conflicts right now, so feel free
> to pick it up then. But usual paths are via bpf / bpf-next as we
> otherwise would run into real ugly merge conflicts.
> 

Daniel,

Go for it. I know bpf tree stuff causes conflicts. That is why I usually
stay away from bpfs selftests and let you handle them.

You are taking the other one away, pick this up as well.

Acked-by: Shuah Khan 

thanks,
-- Shuah



Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-21 Thread Jakub Kicinski
On Wed, 21 Feb 2018 16:30:07 -0800, Florian Fainelli wrote:
> On 02/21/2018 03:46 PM, Jakub Kicinski wrote:
> > On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote:  
> >> We also have a large range of TCAM based hardware offload outthere
> >> that will _not_ work with your BPF HW offload infrastructure. What
> >> this bpf infrastructure pushes into the kernel is just a blob
> >> expressing things in a very low-level instruction-set: trying to find
> >> a mapping of that to typical HW intermediate representations in the
> >> TCAM based HW offload world will be simply crazy.  
> > 
> > I'm not sure where the TCAM talk is coming from.  Think much smaller -
> > cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs.  The
> > information the verifier is gathering will be crucial for optimizing
> > those.  Please don't discount the value of being able to use
> > heterogeneous processing units by the networking stack.
> 
> The only use case that we have a good answer for is when there is no HW
> offload capability available, because there, we know that eBPF is our
> best possible solution for a software fast path, in large part because
> of all the efforts that went into making it both safe and fast.

I was trying to point out that JITing eBPF for the host on 32 bit
systems is already a pain, Jiong Wang is leading an effort to improve
this both from LLVM and verifier angles, IOW running through the
verifier may become useful even for host JITs :)

> When there is offloading HW available, there does not appear to be a
> perfect answer to this problem of, given a standard Linux utility that
> can express any sort of match + action, be it ethtool::rxnfc,
> tc/cls_{u32,flower}, nftables, how do I transform that into what makes
> most sense to my HW? You could:
> 
> - have hardware that understands BPF bytecode directly, great, then you
> don't have to do anything, just pass it up the driver baby, oh wait,
> it's not that simple, the NFP driver is not small

True, it's not the largest but fair point, IMHO we should be trying to
push for sharing as much code between drivers as possible, and on all
fronts, but that's a topic for another time...

> - transform BPF back into something that your hardware understand, does
> that belong in the kernel? Maybe, maybe not

Personally, I think there is non-zero probability of AMP CPUs/systems
becoming more common.  NFP is very powerful and fast, but less advanced
solution may just use an off-the-shelf MIPS/ARM/Andes core.  Taking it
slightly further from home to the cellular/WiFi wake up problem which
was mentioned by Android folks at one of netdevs - if we have
MIPS/ARM/Andes *host* JIT in the kernel, and the NIC processor is built
on one of those all the driver needs to provide is some glue and we can
offload filtering to the MCU on the NIC/modem!

> - use a completely different intermediate representation like P4,
> brainfuck, I don't know
>
> Maybe first things first, we have at least 3 different programming
> interfaces, if not more: ethtool::rxnfc, tc/cls_{u32,flower}, nftables
> that are all capable of programming TCAMs and hardware capable of match
> + action, how about we start with having some sort of common library
> code that:
> 
> - validates input parameters against HW capabilities

This one may be quite hard.

> - does the adequate transformation from any of these interfaces into a
> generic set of input parameters
> - define what the appropriate behavior is when programming through all
> of these 3 interfaces that ultimately access the same shared piece of
> HW, and therefore need to manage resources allocation?

That would be great! :)  Flower stands out today as the most feature
rich and a go-to for TCAM offloads.

> 


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jakub Kicinski
On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
> > I don't see why the team cannot be there always.  
> 
> It is more the logistical nightmare. Part of the goal here was to work
> with the cloud base images that are out there such as
> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
> overhead for this stays fairly small and would be pulled in as just a
> standard part of the kernel update process. The virtio bypass only
> pops up if the backup bit is present. With the team solution it
> requires that the base image use the team driver on virtio_net when it
> sees one. I doubt the OSVs would want to do that just because SR-IOV
> isn't that popular of a case.

IIUC we need to monitor for a "backup hint", spawn the master, rename it
to maintain backwards compatibility with no-VF setups and enslave the VF
if it appears.

All those sound possible from user space, the advantage of the kernel
solution right now is that it has more complete code.

Am I misunderstanding?


Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Eric Dumazet
On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote:

...

> +/* Instead of plain jmp %rax, we emit a retpoline to control
> + * speculative execution for the indirect branch.
> + */
> +static void emit_retpoline_rax_trampoline(u8 **pprog)
> +{
> + u8 *prog = *pprog;
> + int cnt = 0;
> +
> + EMIT1_off32(0xE8, 7);/* callq  */
> + /* capture_spec: */
> + EMIT2(0xF3, 0x90);   /* pause */
> + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
> + EMIT2(0xEB, 0xF9);   /* jmp  */
> + /* set_up_target: */
> + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
> + EMIT1(0xC3); /* retq */
> +
> + BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
> + *pprog = prog;

You might define the actual code sequence (and length) in 
arch/x86/include/asm/nospec-branch.h

If we need to adjust code sequences for RETPOLINE, then we wont
forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded.

Thanks Daniel.



[PATCH iproute2] README: re-add updated information link

2018-02-21 Thread Jakub Kicinski
From: Quentin Monnet 

The "Information" link was removed from README file in commit
d7843207e6fd ("README: update location of git repositories, remove
broken info link"), because it redirected to a page that no longer
existed on the Linux Foundation wiki.

This page has just been restored, so we can add the link back again.
Since the previous link was a redirection, use the updated link instead.

Thanks to Luca Boccassi for investigating this issue, restoring and
updating the page.

Signed-off-by: Quentin Monnet 
---
 README | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/README b/README
index 1b7f44272f5a..f66fd5faf4cf 100644
--- a/README
+++ b/README
@@ -1,5 +1,8 @@
 This is a set of utilities for Linux networking.
 
+Information:
+https://wiki.linuxfoundation.org/networking/iproute2
+
 Download:
 http://www.kernel.org/pub/linux/utils/net/iproute2/
 
-- 
2.15.1



[PATCH net-next 3/3] nfp: advertise firmware for mixed 10G/25G mode

2018-02-21 Thread Jakub Kicinski
From: Dirk van der Merwe 

The AMDA0099-0001 platform can support the 1x10G + 1x25G mixed mode
operation. Recently, firmware has been added for this configuration
mode.

Signed-off-by: Dirk van der Merwe 
Acked-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index ab301d56430b..c4b1f344b4da 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -645,6 +645,7 @@ 
MODULE_FIRMWARE("netronome/nic_AMDA0097-0001_4x10_1x40.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0097-0001_8x10.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x10.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x25.nffw");
+MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_1x10_1x25.nffw");
 
 MODULE_AUTHOR("Netronome Systems ");
 MODULE_LICENSE("GPL");
-- 
2.15.1



[PATCH net-next] ibmvnic: Split counters for scrq/pools/napi

2018-02-21 Thread Nathan Fontenot
The approach of one counter to rule them all when tracking the number
of active sub-crqs, pools, and napi has problems handling some failover
scenarios. This is due to the split in initializing the sub crqs,
pools and napi in different places and the placement of updating
the active counts.

This patch simplifies this by having a counter for tx and rx
sub-crqs, pools, and napi.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   38 
 drivers/net/ethernet/ibm/ibmvnic.h |7 +--
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1703b881252f..8ca88f7cc661 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -461,7 +461,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
if (!adapter->rx_pool)
return;
 
-   for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_rx_pools; i++) {
rx_pool = >rx_pool[i];
 
netdev_dbg(adapter->netdev, "Releasing rx_pool[%d]\n", i);
@@ -484,6 +484,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->rx_pool);
adapter->rx_pool = NULL;
+   adapter->num_active_rx_pools = 0;
 }
 
 static int init_rx_pools(struct net_device *netdev)
@@ -508,6 +509,8 @@ static int init_rx_pools(struct net_device *netdev)
return -1;
}
 
+   adapter->num_active_rx_pools = rxadd_subcrqs;
+
for (i = 0; i < rxadd_subcrqs; i++) {
rx_pool = >rx_pool[i];
 
@@ -608,7 +611,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
if (!adapter->tx_pool)
return;
 
-   for (i = 0; i < adapter->num_active_tx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_tx_pools; i++) {
netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i);
tx_pool = >tx_pool[i];
kfree(tx_pool->tx_buff);
@@ -619,6 +622,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->tx_pool);
adapter->tx_pool = NULL;
+   adapter->num_active_tx_pools = 0;
 }
 
 static int init_tx_pools(struct net_device *netdev)
@@ -635,6 +639,8 @@ static int init_tx_pools(struct net_device *netdev)
if (!adapter->tx_pool)
return -1;
 
+   adapter->num_active_tx_pools = tx_subcrqs;
+
for (i = 0; i < tx_subcrqs; i++) {
tx_pool = >tx_pool[i];
 
@@ -745,6 +751,7 @@ static int init_napi(struct ibmvnic_adapter *adapter)
   ibmvnic_poll, NAPI_POLL_WEIGHT);
}
 
+   adapter->num_active_rx_napi = adapter->req_rx_queues;
return 0;
 }
 
@@ -755,7 +762,7 @@ static void release_napi(struct ibmvnic_adapter *adapter)
if (!adapter->napi)
return;
 
-   for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_rx_napi; i++) {
if (>napi[i]) {
netdev_dbg(adapter->netdev,
   "Releasing napi[%d]\n", i);
@@ -765,6 +772,7 @@ static void release_napi(struct ibmvnic_adapter *adapter)
 
kfree(adapter->napi);
adapter->napi = NULL;
+   adapter->num_active_rx_napi = 0;
 }
 
 static int ibmvnic_login(struct net_device *netdev)
@@ -998,10 +1006,6 @@ static int init_resources(struct ibmvnic_adapter *adapter)
return rc;
 
rc = init_tx_pools(netdev);
-
-   adapter->num_active_tx_scrqs = adapter->req_tx_queues;
-   adapter->num_active_rx_scrqs = adapter->req_rx_queues;
-
return rc;
 }
 
@@ -1706,9 +1710,6 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 
release_napi(adapter);
init_napi(adapter);
-
-   adapter->num_active_tx_scrqs = adapter->req_tx_queues;
-   adapter->num_active_rx_scrqs = adapter->req_rx_queues;
} else {
rc = reset_tx_pools(adapter);
if (rc)
@@ -2398,19 +2399,10 @@ static struct ibmvnic_sub_crq_queue 
*init_sub_crq_queue(struct ibmvnic_adapter
 
 static void release_sub_crqs(struct ibmvnic_adapter *adapter, bool do_h_free)
 {
-   u64 num_tx_scrqs, num_rx_scrqs;
int i;
 
-   if (adapter->state == VNIC_PROBED) {
-   num_tx_scrqs = adapter->req_tx_queues;
-   num_rx_scrqs = adapter->req_rx_queues;
-   } else {
-   num_tx_scrqs = adapter->num_active_tx_scrqs;
-   num_rx_scrqs = adapter->num_active_rx_scrqs;
-   }
-
if (adapter->tx_scrq) {
-   for (i = 0; i < num_tx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_tx_scrqs; i++) {

[PATCH net-next 1/3] nfp: add Makefiles to all directories

2018-02-21 Thread Jakub Kicinski
To be able to build separate objects we need to provide
Kbuild with a Makefile in each directory.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/flower/Makefile  | 2 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nic/Makefile | 2 ++
 5 files changed, 10 insertions(+)
 create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/flower/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nic/Makefile

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/Makefile 
b/drivers/net/ethernet/netronome/nfp/bpf/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/bpf/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/flower/Makefile 
b/drivers/net/ethernet/netronome/nfp/flower/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/flower/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile 
b/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/nic/Makefile 
b/drivers/net/ethernet/netronome/nfp/nic/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nic/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
-- 
2.15.1



[PATCH net-next 0/3] nfp: build and FW initramfs updates

2018-02-21 Thread Jakub Kicinski
Hi!

This set brings empty makefiles to allow building single object files
(useful for build-testing), Kbuild does not cater to this use case
too well.  There are two ethernet drivers right now which suffer
from this (nfp, aquantia), both are fixed.

Dirk adds an uncommon FW image name to the list of firmware files
module may request.

Dirk van der Merwe (1):
  nfp: advertise firmware for mixed 10G/25G mode

Jakub Kicinski (2):
  nfp: add Makefiles to all directories
  aquantia: add Makefiles to all directories

 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile  | 2 ++
 drivers/net/ethernet/netronome/nfp/bpf/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/flower/Makefile  | 2 ++
 drivers/net/ethernet/netronome/nfp/nfp_main.c   | 1 +
 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nic/Makefile | 2 ++
 7 files changed, 13 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/flower/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nic/Makefile

-- 
2.15.1



[PATCH net-next 2/3] aquantia: add Makefiles to all directories

2018-02-21 Thread Jakub Kicinski
To be able to build separate objects we need to provide
Kbuild with a Makefile in each directory.

Signed-off-by: Jakub Kicinski 
---
CC: Igor Russkikh 

 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile | 2 ++
 1 file changed, 2 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
-- 
2.15.1



[PATCH iproute2 net-next] ss: print skmeminfo for packet sockets

2018-02-21 Thread Roopa Prabhu
From: Roopa Prabhu 

before:
$ss --packet -p -m
p_raw0  0*:eth0
  users:(("lldpd",pid=2240,fd=11))

after:
$ss --packet -p -m
p_raw0  0*:eth0
  users:(("lldpd",pid=2240,fd=11))
  skmem:(r0,rb266240,t0,tb266240,f0,w0,o320,bl0,d0)

Signed-off-by: Roopa Prabhu 
---
 misc/ss.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index 29a2507..49f9c49 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -3920,6 +3920,9 @@ static int packet_show_sock(const struct sockaddr_nl 
*addr,
fil++;
}
}
+
+   if (show_mem)
+   print_skmeminfo(tb, PACKET_DIAG_MEMINFO);
return 0;
 }
 
-- 
2.1.4



[PATCH net v2 2/2] tuntap: correctly add the missing xdp flush

2018-02-21 Thread Jason Wang
Commit 762c330d670e ("tuntap: add missing xdp flush") tries to fix the
devmap stall caused by missed xdp flush by counting the pending xdp
redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or
MSG_MORE is clear. This may lead BUG() since xdp_do_flush() was
called under process context with preemption enabled. Simply disable
preemption may silent the warning but be not enough since process may
move between different CPUS during a batch which cause xdp_do_flush()
misses some CPU where the process run previously. Consider the several
fallouts, that commit was reverted. To fix the issue correctly, we can
simply calling xdp_do_flush() immediately after xdp_do_redirect(),
a side effect is that this removes any possibility of batching which
could be addressed in the future.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2823a4a..a363ea2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1662,6 +1662,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
err = xdp_do_redirect(tun->dev, , xdp_prog);
+   xdp_do_flush_map();
if (err)
goto err_redirect;
rcu_read_unlock();
-- 
2.7.4



Re: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver

2018-02-21 Thread Andrew Lunn
> +static void lan743x_intr_unregister_isr(struct lan743x_adapter *adapter,
> + int vector_index)
> +{
> + struct lan743x_vector *vector = >intr.vector_list
> + [vector_index];
> +
> + devm_free_irq(>pci.pdev->dev, vector->irq, vector);

Hu Bryan

The point of devm_ is that you don't need to free resources you have
allocated using devm_. The core will release them when the device is
removed.

In this case, you might want to ensure the hardware will not generate
any more interrupts, but you can leave the core to call free_irq().

Please look at all your devm_*free() like calls, and remove them if
they are not needed. 

> +static void lan743x_mdiobus_cleanup(struct lan743x_adapter *adapter)
> +{
> + if (adapter->init_flags & LAN743X_INIT_FLAG_MDIOBUS_REGISTERED) {
> + mdiobus_unregister(adapter->mdiobus);
> + adapter->init_flags &= ~LAN743X_INIT_FLAG_MDIOBUS_REGISTERED;
> + }
> + if (adapter->init_flags & LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED) {
> + devm_mdiobus_free(>pci.pdev->dev, adapter->mdiobus);
> + adapter->mdiobus = NULL;
> + adapter->init_flags &= ~LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED;
> + }
> +}

So you can delete devm_mdiobus_free(). That probably means you can also remove 
LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED.

> +
> +static void lan743x_full_cleanup(struct lan743x_adapter *adapter)
> +{
> + if (adapter->init_flags & LAN743X_INIT_FLAG_NETDEV_REGISTERED) {
> + unregister_netdev(adapter->netdev);
> + adapter->init_flags &= ~LAN743X_INIT_FLAG_NETDEV_REGISTERED;
> + }
> + lan743x_mdiobus_cleanup(adapter);
> + lan743x_hardware_cleanup(adapter);
> + if (adapter->init_flags & LAN743X_COMPONENT_FLAG_PCI) {
> + lan743x_pci_cleanup(adapter);
> + adapter->init_flags &= ~LAN743X_COMPONENT_FLAG_PCI;
> + }
> +
> + /* would have freed netdev here.
> +  * but netdev was allocated with devm_alloc_etherdev.
> +  * and devm_free_netdev is not accessible.
> +  * so it is expected to be freed by the devm subsystem.
> +  */

And this comment can go.

Andrew


Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-21 Thread Jason Wang



On 2018年02月16日 23:41, Jesper Dangaard Brouer wrote:

On Fri, 16 Feb 2018 13:31:37 +0800
Jason Wang  wrote:


On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:

The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.

Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().

The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.

The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers).  This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).

The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes.  The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).

Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not.  But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.

This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.

But IMHO we should NOT support XDP in receive_mergeable() at all,
because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

I agree to disable it for -net now.

Okay... I'll send an official patch later.


For net-next, we probably can do:

- drop xdp_linearize_page() and do XDP through generic XDP helper
   after skb was built

I disagree strongly here - it makes no sense.

Why do you want to explicit fallback to Generic-XDP?
(... then all the performance gain is gone!)


Note this only happens when:

1) Rx buffer size is under estimated, we could disable estimation and 
then this won't happen
2) headroom is not sufficient, we try hard to not stop device during XDP 
set, so this can happen but only for the first several packets


So this looks pretty fine and remove a lot of complex codes.


And besides, a couple of function calls later, the generic XDP code
will/can get invoked anyhow...


How if we choose to use native mode of XDP?




Take a step back:
  What is the reason/use-case for implementing XDP inside virtio_net?

 From a DDoS/performance perspective XDP in virtio_net happens on the
"wrong-side" as it is activated _inside_ the guest OS, which is too
late for a DDoS filter, as the guest kick/switch overhead have already
occurred.


I don't see any difference of virtio-net now. Consider a real hardward 
NIC, XDP (except for the offload case) also start to drop packet after 
it reach hardware.




I do use XDP_DROP inside the guest (driver virtio_net), but just to
perform what I can zoom-in benchmarking, for perf-record isolating the
early RX code path in the guest.  (Using iptables "raw" table drop is
almost as useful for that purpose).



We could not assume the type of virtio-net backend, it could be dpdk or 
other high speed implementation. And I'm pretty sure there are more use 
cases, here are two:


- Use XDP to accelerate nest VM
- XDP offload to host



The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is
significantly more interesting.  As it allow us to skip large parts of
the network stack and redirect from a physical device (ixgbe) into a
guest device.  Ran a benchmark:
  - 0.5 Mpps with normal code path into device with driver tun
  - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device

Plus, there are indications that 3.7Mpps is not the real limit, as
guest CPU doing XDP_DROP is 75% idle... thus this is a likely a
scheduling + queue size issue.


Yes, XDP_DROP can do more (but I forget the exact number). Btw testpmd 
(in guest) can give me about 3Mpps when doing forwarding (io mode). The 
main bottleneck in this case is vhost, XDP_REDIRECT can provides about 
8Mpps to tun, but vhost can only receive about 4Mpps, and vhost tx can 
only have 3Mpps.


Thanks





- disable EWMA when XDP is set and reserve enough tailroom.


Besides the described bug:

Update(1): There is also a OOM leak in the 

Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Alexei Starovoitov
On Wed, Feb 21, 2018 at 07:04:02PM -0800, Eric Dumazet wrote:
> On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote:
> 
> ...
> 
> > +/* Instead of plain jmp %rax, we emit a retpoline to control
> > + * speculative execution for the indirect branch.
> > + */
> > +static void emit_retpoline_rax_trampoline(u8 **pprog)
> > +{
> > +   u8 *prog = *pprog;
> > +   int cnt = 0;
> > +
> > +   EMIT1_off32(0xE8, 7);/* callq  */
> > +   /* capture_spec: */
> > +   EMIT2(0xF3, 0x90);   /* pause */
> > +   EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
> > +   EMIT2(0xEB, 0xF9);   /* jmp  */
> > +   /* set_up_target: */
> > +   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
> > +   EMIT1(0xC3); /* retq */
> > +
> > +   BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
> > +   *pprog = prog;
> 
> You might define the actual code sequence (and length) in 
> arch/x86/include/asm/nospec-branch.h
> 
> If we need to adjust code sequences for RETPOLINE, then we wont
> forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded.

like adding a comment to asm/nospec-branch.h that says
"dont forget to adjust bpf_jit_comp.c" ?
but clang/gcc generate slightly different sequences for
retpoline anyway, so even if '.macro RETPOLINE_JMP' in
nospec-branch.h changes it doesn't mean that x64 jit has to change.
So what kinda comment there would make sense?



Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Bo YU

Hi,
On Wed, Feb 21, 2018 at 04:41:05PM +0100, Jason A. Donenfeld wrote:

Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")

I think you Would better to resend it.

Bo,


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Samudrala, Sridhar

On 2/21/2018 5:59 PM, Siwei Liu wrote:

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
 wrote:

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:

On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:

On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Thank you do doing this.


We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the same
   virtio pci device, udev tries to rename both of them with the same name
   and the 2nd rename will fail. This would be OK as long as the first netdev
   to be renamed is the 'bypass' netdev, but the order in which udev gets
   to rename the 2 netdevs is not reliable.

Out of curiosity - why do you link the master netdev to the virtio
struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.


Yes. This sounds interesting. Looks like you want an existing VM image with
VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to switch
between these 2 options.





The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

That was 

Re: [PATCH bpf v2] bpf: fix memory leak in lpm_trie map_free callback function

2018-02-21 Thread Eric Dumazet
On Tue, 2018-02-13 at 19:17 -0800, Alexei Starovoitov wrote:
> On Tue, Feb 13, 2018 at 07:00:21PM -0800, Yonghong Song wrote:
> > There is a memory leak happening in lpm_trie map_free callback
> > function trie_free. The trie structure itself does not get freed.
> > 
> > Also, trie_free function did not do synchronize_rcu before freeing
> > various data structures. This is incorrect as some rcu_read_lock
> > region(s) for lookup, update, delete or get_next_key may not complete yet.
> > The fix is to add synchronize_rcu in the beginning of trie_free.
> > The useless spin_lock is removed from this function as well.
> > 
> > Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map 
> > implementation")
> > Reported-by: Mathieu Malaterre 
> > Reported-by: Alexei Starovoitov 
> > Tested-by: Mathieu Malaterre 
> > Signed-off-by: Yonghong Song 
> > ---
> >  kernel/bpf/lpm_trie.c | 11 +++
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > v1->v2:
> >   Make comments more precise and make label name more appropriate,
> >   as suggested by Daniel
> 
> Applied to bpf tree, Thanks Yonghong.


This does not look good.

LOCKDEP surely should complain to

node = rcu_dereference_protected(*slot, lockdep_is_held(>lock));

Since we no longer hold trie->lock




[PATCH bpf] bpf: fix rcu lockdep warning for lpm_trie map_free callback

2018-02-21 Thread Yonghong Song
Commit 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback 
function")
fixed a memory leak and removed unnecessary locks in map_free callback function.
Unfortrunately, it introduced a lockdep warning. When lockdep checking is 
turned on,
running tools/testing/selftests/bpf/test_lpm_map will have:

  [   98.294321] =
  [   98.294807] WARNING: suspicious RCU usage
  [   98.295359] 4.16.0-rc2+ #193 Not tainted
  [   98.295907] -
  [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious 
rcu_dereference_check() usage!
  [   98.297657]
  [   98.297657] other info that might help us debug this:
  [   98.297657]
  [   98.298663]
  [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
  [   98.299536] 2 locks held by kworker/2:1/54:
  [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: 
[<196bc1f0>] process_one_work+0x157/0x5c0
  [   98.301381]  #1:  ((work_completion)(>work)){+.+.}, at: 
[<196bc1f0>] process_one_work+0x157/0x5c0

Since actual trie tree removal happens only after no other
accesses to the tree are possible, this patch simply converted all
rcu protected pointer access to normal access, which removed the
above warning.

Fixes: 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback 
function")
Reported-by: Eric Dumazet 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/lpm_trie.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index a75e02c..0c15813 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -552,7 +552,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 static void trie_free(struct bpf_map *map)
 {
struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
-   struct lpm_trie_node __rcu **slot;
+   struct lpm_trie_node **slot;
struct lpm_trie_node *node;
 
/* Wait for outstanding programs to complete
@@ -569,23 +569,22 @@ static void trie_free(struct bpf_map *map)
slot = >root;
 
for (;;) {
-   node = rcu_dereference_protected(*slot,
-   lockdep_is_held(>lock));
+   node = *slot;
if (!node)
goto out;
 
-   if (rcu_access_pointer(node->child[0])) {
+   if (node->child[0]) {
slot = >child[0];
continue;
}
 
-   if (rcu_access_pointer(node->child[1])) {
+   if (node->child[1]) {
slot = >child[1];
continue;
}
 
kfree(node);
-   RCU_INIT_POINTER(*slot, NULL);
+   *slot = NULL;
break;
}
}
-- 
2.9.5



Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Neal Cardwell
On Wed, Feb 21, 2018 at 9:43 AM, Eric Dumazet  wrote:
>
> From: Eric Dumazet 
>
> BBR uses tcp_tso_autosize() in an attempt to probe what would be the
> burst sizes and to adjust cwnd in bbr_target_cwnd() with following
> gold formula :
>
> /* Allow enough full-sized skbs in flight to utilize end systems. */
> cwnd += 3 * bbr->tso_segs_goal;
>
> But GSO can be lacking or be constrained to very small
> units (ip link set dev ... gso_max_segs 2)
>
> What we really want is to have enough packets in flight so that both
> GSO and GRO are efficient.
>
> So in the case GSO is off or downgraded, we still want to have the same
> number of packets in flight as if GSO/TSO was fully operational, so
> that GRO can hopefully be working efficiently.
>
> To fix this issue, we make tcp_tso_autosize() unaware of
> sk->sk_gso_max_segs
>
> Only tcp_tso_segs() has to enforce the gso_max_segs limit.
>
> Tested:
>
> ethtool -K eth0 tso off gso off
> tc qd replace dev eth0 root pfifo_fast
>
> Before patch:
> for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> 691  (ss -temoi shows cwnd is stuck around 6 )
> 667
> 651
> 631
> 517
>
> After patch :
> # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
>1733 (ss -temoi shows cwnd is around 386 )
>1778
>1746
>1781
>1718
>
> Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
> Signed-off-by: Eric Dumazet 
> Reported-by: Oleksandr Natalenko 
> ---
>  net/ipv4/tcp_output.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)

Acked-by: Neal Cardwell 

Thanks, Eric!

neal


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Jason A. Donenfeld
Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")


Re: [PATCH iproute2-next] rdma: Add batch command support

2018-02-21 Thread Leon Romanovsky
On Wed, Feb 21, 2018 at 09:10:45AM -0700, David Ahern wrote:
> On 2/21/18 5:38 AM, Leon Romanovsky wrote:
> > @@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd)
> > { 0 }
> > };
> >
> > +   rd->argc = argc;
> > +   rd->argv = argv;
> > +
> > return rd_exec_cmd(rd, cmds, "object");
> >  }
> >
> > -static int rd_init(struct rd *rd, int argc, char **argv, char *filename)
> > +static int rd_batch(struct rd *rd, const char *name, bool force)
> > +{
> > +   char *line = NULL;
> > +   size_t len = 0;
> > +   int ret = 0;
> > +
> > +   if (name && strcmp(name, "-") != 0) {
> > +   if (!freopen(name, "r", stdin)) {
> > +   pr_err("Cannot open file \"%s\" for reading: %s\n",
> > +  name, strerror(errno));
> > +   return errno;
> > +   }
> > +   }
> > +
> > +   cmdlineno = 0;
> > +   while (getcmdline(, , stdin) != -1) {
> > +   char *largv[512];
> > +   int largc;
> > +
> > +   largc = makeargs(line, largv, 100);
>
> you have largv[512] declared but passing a max of 100. I realize other
> batch commands have it hardcoded, but ARRAY_SIZE is better.

Thanks for catching it, it was my copy/paste error.

I'll wait for a day for other comments and will resubmit if it is needed.

The fix is:

diff --git a/rdma/rdma.c b/rdma/rdma.c
index ab8f98d2..ab2c9608 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -62,7 +62,7 @@ static int rd_batch(struct rd *rd, const char *name, bool 
force)
char *largv[512];
int largc;

-   largc = makeargs(line, largv, 100);
+   largc = makeargs(line, largv, ARRAY_SIZE(largv));
if (!largc)
continue;   /* blank line */



signature.asc
Description: PGP signature


Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Soheil Hassas Yeganeh
On Wed, Feb 21, 2018 at 10:14 AM Neal Cardwell  wrote:

> On Wed, Feb 21, 2018 at 9:43 AM, Eric Dumazet 
wrote:
> >
> > From: Eric Dumazet 
> >
> > BBR uses tcp_tso_autosize() in an attempt to probe what would be the
> > burst sizes and to adjust cwnd in bbr_target_cwnd() with following
> > gold formula :
> >
> > /* Allow enough full-sized skbs in flight to utilize end systems. */
> > cwnd += 3 * bbr->tso_segs_goal;
> >
> > But GSO can be lacking or be constrained to very small
> > units (ip link set dev ... gso_max_segs 2)
> >
> > What we really want is to have enough packets in flight so that both
> > GSO and GRO are efficient.
> >
> > So in the case GSO is off or downgraded, we still want to have the same
> > number of packets in flight as if GSO/TSO was fully operational, so
> > that GRO can hopefully be working efficiently.
> >
> > To fix this issue, we make tcp_tso_autosize() unaware of
> > sk->sk_gso_max_segs
> >
> > Only tcp_tso_segs() has to enforce the gso_max_segs limit.
> >
> > Tested:
> >
> > ethtool -K eth0 tso off gso off
> > tc qd replace dev eth0 root pfifo_fast
> >
> > Before patch:
> > for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> > 691  (ss -temoi shows cwnd is stuck around 6 )
> > 667
> > 651
> > 631
> > 517
> >
> > After patch :
> > # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> >1733 (ss -temoi shows cwnd is around 386 )
> >1778
> >1746
> >1781
> >1718
> >
> > Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
> > Signed-off-by: Eric Dumazet 
> > Reported-by: Oleksandr Natalenko 
> > ---
> >  net/ipv4/tcp_output.c |9 +
> >  1 file changed, 5 insertions(+), 4 deletions(-)

> Acked-by: Neal Cardwell 

Acked-by: Soheil Hassas Yeganeh 

Thank you Eric for the nice patch!


Re: [4.15-stable 1/1] uapi/if_ether.h: move __UAPI_DEF_ETHHDR libc define

2018-02-21 Thread David Miller
From: Mark Asselstine 
Date: Wed, 21 Feb 2018 10:13:18 -0500

> Please promote commit da360299b6734135a5f66d7db458dcc7801c826a [uapi/
> if_ether.h: move __UAPI_DEF_ETHHDR libc define] to linux-4.15.y stable. 
> Without this fix users of the uapi headers will run into the issue as 
> described in the commit log. Specifically we have seen this with building 
> ebtables.

Please check my stable patchwork queue, it should be in there.

Thanks.


ss issue on arm not showing UDP listening ports

2018-02-21 Thread jesse_cooper


Hi,

I currently had the follow issues with ss where it was not displaying  
the UDP listening ports.This was on:


Linux 4.9.59-v7+ #1047 SMP Sun Oct 29 12:19:23 GMT 2017 armv7l GNU/Linux

ss -ul
State Recv-Q Send-Q Local Address:Port Peer Address:Port

netstat -ul
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State
udp0  0 0.0.0.0:mdns0.0.0.0:*
udp0  0 0.0.0.0:bootpc  0.0.0.0:*
udp0  0 0.0.0.0:55887   0.0.0.0:*
udp0  0 192.168.1.80:ntp0.0.0.0:*
udp0  0 localhost:ntp   0.0.0.0:*
udp0  0 0.0.0.0:ntp 0.0.0.0:*
udp0  0 0.0.0.0:46739   0.0.0.0:*
udp0  0 0.0.0.0:snmp0.0.0.0:*
udp6   0  0 [::]:mdns   [::]:*
udp6   0  0 [::]:40880  [::]:*
udp6   0  0 fe80::bcdd:e603:7d8:ntp [::]:*
udp6   0  0 localhost:ntp   [::]:*
udp6   0  0 [::]:ntp[::]:*

I hope I have found the proper way to report this issue.



Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-21 Thread David Miller
From: Roman Kapl 
Date: Wed, 21 Feb 2018 09:49:31 +0100

> Should I send a v3 or something, David?

No, it's too late to fix it, sorry.


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my opinition
> it was a huge mistake to merge it. I personally would vote to unmerge it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko 

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
The problem is nobody wants to do that. What they want is to maintain
one guest image and if they decide to upgrade to SR-IOV they still
want their live migration and they don't want to have to reconfigure
the guest.

That said it does seem to make the existing Red Hat solution easier to
manage since you wouldn't be guessing at ifname so I have provided
some feedback below.

> ---
>  include/team.h |  7 +++
>  libteam/ifinfo.c   | 20 
>  teamd/teamd.c  | 17 +
>  teamd/teamd.h  |  5 +
>  teamd/teamd_events.c   | 17 +
>  teamd/teamd_ifinfo_watch.c |  9 +
>  teamd/teamd_per_port.c |  7 ++-
>  7 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/include/team.h b/include/team.h
> index 9ae517d..b0c19c8 100644
> --- a/include/team.h
> +++ b/include/team.h
> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
> team_handle *th,
>  #define team_for_each_ifinfo(ifinfo, th)   \
> for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
>  ifinfo = team_get_next_ifinfo(th, ifinfo))
> +
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> + struct team_ifinfo *ifinfo);
> +#define team_for_each_unlinked_ifinfo(ifinfo, th)  \
> +   for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
> +ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
> +
>  /* ifinfo getters */
>  bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
>  uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
> diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
> index 5c32a9c..8f9548e 100644
> --- a/libteam/ifinfo.c
> +++ b/libteam/ifinfo.c
> @@ -494,6 +494,26 @@ struct team_ifinfo 

Re: [PATCH iproute2-next] rdma: Add batch command support

2018-02-21 Thread David Ahern
On 2/21/18 5:38 AM, Leon Romanovsky wrote:
> @@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd)
>   { 0 }
>   };
> 
> + rd->argc = argc;
> + rd->argv = argv;
> +
>   return rd_exec_cmd(rd, cmds, "object");
>  }
> 
> -static int rd_init(struct rd *rd, int argc, char **argv, char *filename)
> +static int rd_batch(struct rd *rd, const char *name, bool force)
> +{
> + char *line = NULL;
> + size_t len = 0;
> + int ret = 0;
> +
> + if (name && strcmp(name, "-") != 0) {
> + if (!freopen(name, "r", stdin)) {
> + pr_err("Cannot open file \"%s\" for reading: %s\n",
> +name, strerror(errno));
> + return errno;
> + }
> + }
> +
> + cmdlineno = 0;
> + while (getcmdline(, , stdin) != -1) {
> + char *largv[512];
> + int largc;
> +
> + largc = makeargs(line, largv, 100);

you have largv[512] declared but passing a max of 100. I realize other
batch commands have it hardcoded, but ARRAY_SIZE is better.


Re: [PATCH RFC net-next 5/7] net/ipv6: Add support for path selection using hash of 5-tuple

2018-02-21 Thread Ido Schimmel
On Mon, Feb 12, 2018 at 04:06:00PM -0800, David Ahern wrote:
> Some operators prefer IPv6 path selection to use a standard 5-tuple
> hash rather than just an L3 hash with the flow the label. To that end
> add support to IPv6 for multipath hash policy similar to bf4e0a3db97eb
> ("net: ipv4: add support for ECMP hash policy choice"). The default
> is still L3 which covers source and destination addresses along with
> flow label and IPv6 protocol.
> 
> Signed-off-by: David Ahern 

[...]

> @@ -1819,20 +1820,53 @@ static void ip6_multipath_l3_keys(const struct 
> sk_buff *skb,
>  }
>  
>  /* if skb is set it will be used and fl6 can be NULL */
> -u32 rt6_multipath_hash(const struct flowi6 *fl6, const struct sk_buff *skb)
> +u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6,
> +const struct sk_buff *skb)
>  {
>   struct flow_keys hash_keys;
>   u32 mhash;
>  
> - memset(_keys, 0, sizeof(hash_keys));
> - hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> - if (skb) {
> - ip6_multipath_l3_keys(skb, _keys);
> - } else {
> - hash_keys.addrs.v6addrs.src = fl6->saddr;
> - hash_keys.addrs.v6addrs.dst = fl6->daddr;
> - hash_keys.tags.flow_label = (__force u32)fl6->flowlabel;
> - hash_keys.basic.ip_proto = fl6->flowi6_proto;
> + switch (net->ipv6.sysctl.multipath_hash_policy) {
> + case 0:
> + memset(_keys, 0, sizeof(hash_keys));
> + hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> + if (skb) {
> + ip6_multipath_l3_keys(skb, _keys);
> + } else {
> + hash_keys.addrs.v6addrs.src = fl6->saddr;
> + hash_keys.addrs.v6addrs.dst = fl6->daddr;
> + hash_keys.tags.flow_label = (__force u32)fl6->flowlabel;
> + hash_keys.basic.ip_proto = fl6->flowi6_proto;
> + }
> + break;
> + case 1:
> + if (skb) {
> + unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
> + struct flow_keys keys;
> +
> + /* short-circuit if we already have L4 hash present */
> + if (skb->l4_hash)
> + return skb_get_hash_raw(skb) >> 1;
> +
> + memset(_keys, 0, sizeof(hash_keys));
> +
> + skb_flow_dissect_flow_keys(skb, , flag);
> +
> + hash_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src;
> + hash_keys.addrs.v6addrs.dst = keys.addrs.v6addrs.dst;

Shouldn't you add:

hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;

?

Otherwise flow_hash_from_keys() will not consistentify the ports and the
addresses and it will also not use the addresses for the hash
computation.

It's missing from fib_multipath_hash() as well, so I might be missing
something.

> + hash_keys.ports.src = keys.ports.src;
> + hash_keys.ports.dst = keys.ports.dst;
> + hash_keys.tags.flow_label = keys.tags.flow_label;

Why are you using the flow label?

> + hash_keys.basic.ip_proto = keys.basic.ip_proto;
> + } else {
> + memset(_keys, 0, sizeof(hash_keys));
> + hash_keys.control.addr_type = 
> FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> + hash_keys.addrs.v6addrs.src = fl6->saddr;
> + hash_keys.addrs.v6addrs.dst = fl6->daddr;
> + hash_keys.ports.src = fl6->fl6_sport;
> + hash_keys.ports.dst = fl6->fl6_dport;
> + hash_keys.basic.ip_proto = fl6->flowi6_proto;
> + }
>   }
>   mhash = flow_hash_from_keys(_keys);
>  
> @@ -1858,7 +1892,7 @@ void ip6_route_input(struct sk_buff *skb)
>   if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
>   fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
>   if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
> - fl6.mp_hash = rt6_multipath_hash(, skb);
> + fl6.mp_hash = rt6_multipath_hash(net, , skb);
>   skb_dst_drop(skb);
>   skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, , flags));
>  }


Re: [PATCH v2 iproute2-next 0/5] bridge: json and color support

2018-02-21 Thread David Ahern
On 2/20/18 12:24 PM, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> This set of patches adds color and full JSON support to bridge command.
> 
> The output format for bridge link command changes so that 
>   $ bridge link show
> and
>   $ ip link show
> use same basic format.
> 
> The "-c" flag to bridge changes from shortened form of "-compressvlan"
> to shortened form of "-color".  Once again this is so that ip
> and bridge command take similar options.
> 
> Lastly the JSON output format changes slightly but this
> could not impact any real user, because in several cases
> the current format was invalid JSON!
> 
> v2
>   rebase to updated iproute2-next
>   use common pretty flag
>   use common print_name_and_link
> 
> Stephen Hemminger (5):
>   bridge: implement json pretty print flag
>   bridge: colorize output and use JSON print library
>   bridge: add json support for link command
>   bridge: update man page for new color and json changes
>   ip: always print interface name in color
> 
>  bridge/br_common.h |   4 +-
>  bridge/bridge.c|  16 ++-
>  bridge/fdb.c   | 279 -
>  bridge/link.c  | 245 +++-
>  bridge/mdb.c   | 354 
> ++---
>  bridge/vlan.c  | 279 -
>  include/utils.h|   2 +-
>  ip/ipaddress.c |   4 +-
>  lib/utils.c|   4 +-
>  man/man8/bridge.8  |  15 ++-
>  10 files changed, 521 insertions(+), 681 deletions(-)
> 

series applied to iproute2-next.



  1   2   >