Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-17 Thread Yanjun Zhu



On 2018/4/17 23:37, Tariq Toukan wrote:



On 16/04/2018 4:02 AM, Zhu Yanjun wrote:

While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
   [] dump_stack+0x63/0x81
   [] panic+0xcc/0x21b
   [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
   [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
   [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
   [] __mlx4_cmd+0xb0/0x160 [mlx4_core]
   [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
   [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly 
returns

instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Signed-off-by: Zhu Yanjun 
---
  drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c

index 6a9086d..f1c8c42 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, 
u64 in_param, u64 out_param,

   * Device is going through error recovery
   * and cannot accept commands.
   */
+    mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+    ret = -EINVAL;
  goto out;
  }
  @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, 
u64 in_param, u64 *out_param,

  }
    out_reset:
+    if (err == -EINVAL)
+    goto out;
+


See below.


  if (err)
  err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
  out:
@@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, 
u64 in_param, u64 *out_param,

  *out_param = context->out_param;
    out_reset:
+    if (err == -EINVAL)
+    goto out;
+
  if (err)


Instead, just do here: if (err && err != -EINVAL)


  err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
  out:



I am not sure this does not mistakenly cover other cases that already 
exist and have (err == -EINVAL).


For example, this line is hard to predict:
err = mlx4_status_to_errno
and later on, we might get into
if (mlx4_closing_cmd_fatal_error(op, stat))
which leads to out_reset.

Thanks a lot.
Sure. I agree with you that "err = mlx4_status_to_errno" and "if 
(mlx4_closing_cmd_fatal_error(op, stat))" will also make "err=-EINVAL".

This will mistakenly go to out instead of resetting HCA device.

I will make a new patch to avoid the above error.

Zhu Yanjun


We must have a deeper look at this.
But a better option is, change the error indication to uniquely 
indicate "already in error recovery".






Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)

2018-04-17 Thread Finn Thain
On Wed, 18 Apr 2018, Michael Schmitz wrote:

> All,
> 
> just noticed belatedly that the Makefile hunk of patch 9 does no
> longer apply cleanly in 4.17-rc1, sorry. My series was based on 4.16.
> I'll resend that one, OK?
> 

I might end up simpler to resend the whole series --

> Cheers,
> 
>   Michael
> 
> 
> > 1/9 net: phy: new Asix Electronics PHY driver
> > 2/9 net: ax88796: Fix MAC address reading
> > 3/9 net: ax88796: Attach MII bus only when open
> > 4/9 net: ax88796: Do not free IRQ in ax_remove() (already freed in 
> > ax_close()).
> > 5/9 net: ax88796: Add block_input/output hooks to ax_plat_data

I found that git am rejects this one, though 'patch' applies it with fuzz.

> > 6/9 net: ax88796: add interrupt status callback to platform data
> > 7/9 net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as 
> > shareable
> > 8/9 net: ax88796: release platform device drvdata on probe error and module 
> > remove
> > 9/9 net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

git am rejected this one and also complained about trailing whitespace.

I'd rebase on v4.17-rc1 and also run checkpatch over the results.

-- 

> >
> >  drivers/net/ethernet/8390/Kconfig|   17 ++-
> >  drivers/net/ethernet/8390/Makefile   |1 +
> >  drivers/net/ethernet/8390/ax88796.c  |  228 
> >  drivers/net/ethernet/8390/xsurf100.c |  381 
> > ++
> >  drivers/net/phy/Kconfig  |6 +
> >  drivers/net/phy/Makefile |1 +
> >  drivers/net/phy/asix.c   |   65 ++
> >  drivers/net/phy/phy_device.c |3 +-
> >  include/linux/phy.h  |1 +
> >  include/net/ax88796.h|   14 ++
> >  10 files changed, 621 insertions(+), 96 deletions(-)
> >
> > Cheers,
> >
> >   Michael


Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)

2018-04-17 Thread Michael Schmitz
All,

just noticed belatedly that the Makefile hunk of patch 9 does no
longer apply cleanly in 4.17-rc1, sorry. My series was based on 4.16.
I'll resend that one, OK?

Cheers,

  Michael


On Wed, Apr 18, 2018 at 4:26 PM, Michael Schmitz  wrote:
> This patch series adds support for the Individual Computers X-Surf 100
> network card for m68k Amiga, a network adapter based on the AX88796 chip set.
>
> The driver was originally written for kernel version 3.19 by Michael Karcher
> (see CC:), and adapted to 4.16 for submission to netdev by me. Questions
> regarding motivation for some of the changes are probably best directed at
> Michael Karcher.
>
> The driver has been tested by Adrian  who will
> send his Tested-by tag separately.
>
> A few changes to the ax88796 driver were required:
> - to read the MAC address, some setup of the ax99796 chip must be done,
> - attach to the MII bus only on device open to allow module unloading,
> - allow to supersede ax_block_input/ax_block_output by card-specific
>   optimized code,
> - use an optional interrupt status callback to allow easier sharing of the
>   card interrupt,
> - set IRQF_SHARED if platform IRQ resource is marked shareable,
>
> The Asix Electronix PHY used on the X-Surf 100 is buggy, and causes the
> software reset to hang if the previous command sent to the PHY was also
> a soft reset. This bug requires addition of a PHY driver for Asix PHYs
> to provide a fixed .soft_reset function, included in this series.
>
> Some additional cleanup:
> - do not attempt to free IRQ in ax_remove (complements 82533ad9a1c),
> - clear platform drvdata on probe fail and module remove.
>
> Changes since v1:
>
> Raised in review by Andrew Lunn:
> - move MII code around to avoid need for forward declaration
> - combine patches 2 and 7 to add cleanup in error path
>
> Changes since v2:
>
> - corrected authorship attribution to Michael Karcher
>
> Suggested by Geert Uytterhoeven:
> - use ei_local->reset_8390() instead of duplicating ax_reset_8390()
> - use %pR to format struct resource pointers
> - assign pdev and xs100 pointers in declaration
> - don't split error messages
> - change Kconfig logic to only require XSURF100 set on Amiga
>
> Suggested by Andrew Lunn:
> - add COMPILE_TEST to ax88796 Kconfig options
> - use new Asix PHY driver for X-Surf 100
>
> Suggested by Andrew Lunn/Finn Thain:
> - declare struct sk_buff in ax88796.h
> - correct whitespace error in ax88796.h
>
> This series' patches, in order:
>
> 1/9 net: phy: new Asix Electronics PHY driver
> 2/9 net: ax88796: Fix MAC address reading
> 3/9 net: ax88796: Attach MII bus only when open
> 4/9 net: ax88796: Do not free IRQ in ax_remove() (already freed in 
> ax_close()).
> 5/9 net: ax88796: Add block_input/output hooks to ax_plat_data
> 6/9 net: ax88796: add interrupt status callback to platform data
> 7/9 net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as 
> shareable
> 8/9 net: ax88796: release platform device drvdata on probe error and module 
> remove
> 9/9 net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
>
>  drivers/net/ethernet/8390/Kconfig|   17 ++-
>  drivers/net/ethernet/8390/Makefile   |1 +
>  drivers/net/ethernet/8390/ax88796.c  |  228 
>  drivers/net/ethernet/8390/xsurf100.c |  381 
> ++
>  drivers/net/phy/Kconfig  |6 +
>  drivers/net/phy/Makefile |1 +
>  drivers/net/phy/asix.c   |   65 ++
>  drivers/net/phy/phy_device.c |3 +-
>  include/linux/phy.h  |1 +
>  include/net/ax88796.h|   14 ++
>  10 files changed, 621 insertions(+), 96 deletions(-)
>
> Cheers,
>
>   Michael


Re: [Regression] net/phy/micrel.c v4.9.94

2018-04-17 Thread Chris Ruehl

On Wednesday, April 18, 2018 09:34 AM, Chris Ruehl wrote:

Hello,

I like to get your heads up at a regression introduced in 4.9.94
commitment lead to a kernel ops and make the network unusable on my MX6DL 
customized board.


Race condition resume is called on startup and the phy not yet initialized.

[    7.313366] Unable to handle kernel NULL pointer dereference at virtual 
address 0008

[    7.321602] pgd = ecfc

[    7.324950] [0008] *pgd=8e901831

[    7.328652] Internal error: Oops: 17 [#1] PREEMPT SMP ARM

[    7.334061] Modules linked in:

[    7.337146] CPU: 0 PID: 269 Comm: ip Not tainted 4.9.94 #11

[    7.342725] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)

[    7.349259] task: ece59900 task.stack: ec9ea000

[    7.353809] PC is at kszphy_config_reset+0x14/0x148

[    7.358703] LR is at kszphy_resume+0x1c/0x6c

[    7.362983] pc : []    lr : []    psr: 60030013

[    7.362983] sp : ec9eb918  ip : ec9eb938  fp : ec9eb934

[    7.374467] r10: 0007  r9 :   r8 : ee693c00

[    7.379700] r7 :   r6 :   r5 :   r4 : ee6fc000

[    7.386234] r3 : 0001  r2 :   r1 : 0110  r0 : ee6fc000

[    7.392768] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none

[    7.399911] Control: 10c5387d  Table: 3cfc004a  DAC: 0051

[    7.405663] Process ip (pid: 269, stack limit = 0xec9ea210)

[    7.411244] Stack: (0xec9eb918 to 0xec9ec000)

[    7.415611] b900: ee6fc000 
[    7.423800] b920: ee031000  ec9eb94c ec9eb938 c056a4fc c056a244 
ee6fc000 
[    7.431988] b940: ec9eb97c ec9eb950 c05681e4 c056a4ec 0007 ee6fc000 
ee6fc000 c056ce7c
[    7.440174] b960: c056ce7c ee031000 ee55c818  ec9eb99c ec9eb980 
c05683cc c0568134
[    7.448364] b980: 0007 ec9eba10 ee6fc000 0007 ec9eb9c4 ec9eb9a0 
c0568450 c05683bc
[    7.456550] b9a0: 0007 0005 ee031000 ec9eb9d3 0200 c1508da4 
ec9eba6c ec9eb9c8
[    7.464736] b9c0: c056ce24 c0568410 0005 ee03162c 3201 30383831 
652e3030 72656874
[    7.472921] b9e0: 2d74656e 0031 03e8 00c8 c01732ec c0172adc 
03e8 00c8
[    7.481109] ba00: 024000c0 ee55c000 c150e454 024000c0 38383132 2e303030 
65687465 74656e72
[    7.489296] ba20: 303a312d ee35 ec9eba6c ec9eba38 c0224b50 c0175eb8 
ec9eba6c c056eb44
[    7.497482] ba40: c056bbe0 f0c16000 ee031000 ee55c000 0200 f0c16000 
ee031000 ee55c000
[    7.505667] ba60: ec9ebaa4 ec9eba70 c056eba4 c056cd1c 0001 ee03162c 
ec9ebaa4 ee031000
[    7.513855] ba80:  c09566ec ee031030  ec9ccd10 ecb39900 
ec9ebacc ec9ebaa8
[    7.522043] baa0: c06ad6e0 c056e92c ec9ebacc ee031000 ee031000 0001 
1003 1002
[    7.530229] bac0: ec9ebaf4 ec9ebad0 c06ad99c c06ad63c 1002 ee031000 
ee031148 1002
[    7.538414] bae0:   ec9ebb1c ec9ebaf8 c06ada6c c06ad90c 
1002 
[    7.546601] bb00: ee031000 ec9ebc28  c09566ec ec9ebb94 ec9ebb20 
c06c1034 c06ada58
[    7.554787] bb20: c0c50df8 2e184000 ec9ebb44 ec9ebb38 c0173528 c0173320 
ec9ebbd4 c0e82b6c
[    7.562972] bb40:  ece59dc8 ebb4e9d0 c9eae3f3 ece59900 0003 
ece59900 005e
[    7.571157] bb60: c14e30ec c0d1e51c ece59900  ee031000 ec9ccd00 
 
[    7.579346] bb80: ec9ebb98  ec9ebd04 ec9ebb98 c06c30cc c06c0d68 
ec9ebbc4 
[    7.587531] bba0: c01758bc ecb39900 c09eb3a0 ec9ccd20  ec9ccd10 
0001 ece59900
[    7.595715] bbc0: c01e0e64   0001 ec9ebbfc  
 
[    7.603900] bbe0:    ff00 ec9ebc0c ec9ebc00 
c0173528 c0173320
[    7.612084] bc00: ec9ebc9c ec9ebc10 c01e0e64 c0173520  000e 
ece59900 0096
[    7.620269] bc20: c14e30ec c0d1e51c     
 
[    7.628452] bc40:       
 
[    7.636636] bc60:       
 
[    7.644819] bc80:       
 
[    7.653003] bca0:       
 
[    7.661186] bcc0:       
c06d3870 
[    7.669372] bce0: ec9ccd00 ecb39900 c15226e4   ecb39900 
ec9ebd44 ec9ebd08
[    7.677556] bd00: c06c343c c06c2bdc c0869c2c c0173520 0001  
c06c06e4 
[    7.685741] bd20:  ec9ccd00 c06c32b8 ecb39900 ecb39900  
ec9ebd64 ec9ebd48
[    7.693926] bd40: c06d86cc c06c32c4  ecb39900 0020 ec970400 
ec9ebd7c ec9ebd68
[    7.702110] bd60: c06c06f4 c06d8630 c06c06c4 ee15f400 ec9ebdac ec9ebd80 
c06d802c c06c06d0
[    7.710294] bd80: ec9ebf50 7fff ec970400 ec9ebf48 ec970400  
0020 
[    7.718477] bda0: ec9ebe0c ec9ebdb0 c06d84e8 c06d7ec8 000c ec9ebe48 
000c 
[    7.726661] bdc0: beee97bc 

Re: [PATCH iproute2 net-next] vxlan: fix ttl inherit behavior

2018-04-17 Thread Hangbin Liu
Hi Stephen,

The patch's subject contains fix. But the kernel feature is applied on net-next.
So I'm not sure if iproute2 net-next is suitable. If you are OK with the patch,
please feel free to apply it on the branch which you think is suitable.

Thanks
Hangbin

On 18 April 2018 at 13:05, Hangbin Liu  wrote:
> Like kernel net-next commit 72f6d71e491e6 ("vxlan: add ttl inherit support"),
> vxlan ttl inherit should means inherit the inner protocol's ttl value.
>
> But currently when we add vxlan with "ttl inherit", we only set ttl 0,
> which is actually use whatever default value instead of inherit the inner
> protocol's ttl value.
>
> To make a difference with ttl inherit and ttl == 0, we add an attribute
> IFLA_VXLAN_TTL_INHERIT when "ttl inherit" specified. And use "ttl auto"
> to means "use whatever default value", the same behavior with ttl == 0.
>
> Reported-by: Jianlin Shi 
> Suggested-by: Jiri Benc 
> Signed-off-by: Hangbin Liu 


[PATCH iproute2 net-next] vxlan: fix ttl inherit behavior

2018-04-17 Thread Hangbin Liu
Like kernel net-next commit 72f6d71e491e6 ("vxlan: add ttl inherit support"),
vxlan ttl inherit should means inherit the inner protocol's ttl value.

But currently when we add vxlan with "ttl inherit", we only set ttl 0,
which is actually use whatever default value instead of inherit the inner
protocol's ttl value.

To make a difference with ttl inherit and ttl == 0, we add an attribute
IFLA_VXLAN_TTL_INHERIT when "ttl inherit" specified. And use "ttl auto"
to means "use whatever default value", the same behavior with ttl == 0.

Reported-by: Jianlin Shi 
Suggested-by: Jiri Benc 
Signed-off-by: Hangbin Liu 
---
 include/uapi/linux/if_link.h | 1 +
 ip/iplink_vxlan.c| 8 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index dab5246..387f873 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -511,6 +511,7 @@ enum {
IFLA_VXLAN_COLLECT_METADATA,
IFLA_VXLAN_LABEL,
IFLA_VXLAN_GPE,
+   IFLA_VXLAN_TTL_INHERIT,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 661eaa7..5804db3 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -165,14 +165,18 @@ static int vxlan_parse_opt(struct link_util *lu, int 
argc, char **argv,
 
NEXT_ARG();
check_duparg(, IFLA_VXLAN_TTL, "ttl", *argv);
-   if (strcmp(*argv, "inherit") != 0) {
+   if (strcmp(*argv, "inherit") == 0) {
+   addattr_l(n, 1024, IFLA_VXLAN_TTL_INHERIT, 
NULL, 0);
+   } else if (strcmp(*argv, "auto") == 0) {
+   addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
+   } else {
if (get_unsigned(, *argv, 0))
invarg("invalid TTL", *argv);
if (uval > 255)
invarg("TTL must be <= 255", *argv);
ttl = uval;
+   addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
}
-   addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
} else if (!matches(*argv, "tos") ||
   !matches(*argv, "dsfield")) {
__u32 uval;
-- 
2.5.5



Re: [PATCH ipsec-next] selftests: add xfrm state-policy-monitor to rtnetlink.sh

2018-04-17 Thread Steffen Klassert
On Thu, Apr 12, 2018 at 03:59:59PM -0700, Shannon Nelson wrote:
> Add a simple set of tests for the IPsec xfrm commands.
> 
> Signed-off-by: Shannon Nelson 

Applied to ipsec-next, thanks Shannon!


Re: [PATCH net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust

2018-04-17 Thread Yafang Shao
On Wed, Apr 18, 2018 at 7:44 AM, Alexei Starovoitov
 wrote:
> On Mon, Apr 16, 2018 at 08:43:31AM -0700, Eric Dumazet wrote:
>>
>>
>> On 04/16/2018 08:33 AM, Yafang Shao wrote:
>> > tcp_rcv_space_adjust is called every time data is copied to user space,
>> > introducing a tcp tracepoint for which could show us when the packet is
>> > copied to user.
>> > This could help us figure out whether there's latency in user process.
>> >
>> > When a tcp packet arrives, tcp_rcv_established() will be called and with
>> > the existed tracepoint tcp_probe we could get the time when this packet
>> > arrives.
>> > Then this packet will be copied to user, and tcp_rcv_space_adjust will
>> > be called and with this new introduced tracepoint we could get the time
>> > when this packet is copied to user.
>> >
>> > arrives time : user process time=> latency caused by user
>> > tcp_probe  tcp_rcv_space_adjust
>> >
>> > Hence in the prink message, sk is printed as a key to connect these two
>> > tracepoints.
>> >
>>
>> socket pointer is not a key.
>>
>> TCP sockets can be reused pretty fast after free.
>>
>> I suggest you go for cookie instead, this is an unique 64bit identifier.
>> ( sock_gen_cookie() for details )
>
> I think would be even better if the stack would do this sock_gen_cookie()
> on its own in some way that user cannnot infere the order.
> In many cases we wanted to use socket cookie, but since it's not inited
> by default it's kinda useless.
> Turning this tracepoint on just to get cookie would be an ugly workaround.
>

Could we init it in sk_alloc() ?
Then in other code paths, for example sock_getsockopt or tracepoints,
we only read the value through a new inline function named
sock_read_cookie().


Thanks
Yafang


Re: [PATCH 10/10] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

2018-04-17 Thread Michael Schmitz
Hi Geert,

On Wed, Apr 18, 2018 at 1:53 AM, Geert Uytterhoeven
 wrote:
>> --- /dev/null
>> +++ b/drivers/net/ethernet/8390/xsurf100.c
>> @@ -0,0 +1,411 @@
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#define ZORRO_PROD_INDIVIDUAL_COMPUTERS_X_SURF100 \
>> +   ZORRO_ID(INDIVIDUAL_COMPUTERS, 0x64, 0)
>
> Another long define to get rid of? ;-)

I decided to leave it that way - it doesn't stick out quite as badly
as the one in the ESP driver. Give me a yell if you insist.

Cheers,

  Michael


[PATCH v3 1/9] net: phy: new Asix Electronics PHY driver

2018-04-17 Thread Michael Schmitz
The Asix Electronics PHY found on the X-Surf 100 Amiga Zorro network
card by Individual Computers is buggy, and needs the reset bit toggled
as workaround to make a PHY soft reset succed.

Add workaround driver just for this special case. Export phy_poll_reset()
from core phy_device driver to avoid code duplication.

Signed-off-by: Michael Schmitz 
---
 drivers/net/phy/Kconfig  |6 
 drivers/net/phy/Makefile |1 +
 drivers/net/phy/asix.c   |   65 ++
 drivers/net/phy/phy_device.c |3 +-
 include/linux/phy.h  |1 +
 5 files changed, 75 insertions(+), 1 deletions(-)
 create mode 100644 drivers/net/phy/asix.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index bdfbabb..f5b484c 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -218,6 +218,12 @@ config AQUANTIA_PHY
---help---
  Currently supports the Aquantia AQ1202, AQ2104, AQR105, AQR405
 
+config ASIX_PHY
+   tristate "Asix PHYs"
+   ---help---
+ Currently supports the Asix Electronics PHY found in the X-Surf 100
+ AX88796 package.
+
 config AT803X_PHY
tristate "AT803X PHYs"
---help---
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 01acbcb..701ca0b 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -45,6 +45,7 @@ obj-y += $(sfp-obj-y) $(sfp-obj-m)
 
 obj-$(CONFIG_AMD_PHY)  += amd.o
 obj-$(CONFIG_AQUANTIA_PHY) += aquantia.o
+obj-$(CONFIG_ASIX_PHY) += asix.o
 obj-$(CONFIG_AT803X_PHY)   += at803x.o
 obj-$(CONFIG_BCM63XX_PHY)  += bcm63xx.o
 obj-$(CONFIG_BCM7XXX_PHY)  += bcm7xxx.o
diff --git a/drivers/net/phy/asix.c b/drivers/net/phy/asix.c
new file mode 100644
index 000..15e8a0e
--- /dev/null
+++ b/drivers/net/phy/asix.c
@@ -0,0 +1,65 @@
+/*
+ * Driver for Asix PHYs
+ *
+ * Author: Michael Schmitz 
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PHY_ID_ASIX0x003b1841
+
+MODULE_DESCRIPTION("Asix PHY driver");
+MODULE_AUTHOR("Michael Schmitz ");
+MODULE_LICENSE("GPL");
+
+/**
+ * asix_soft_reset - software reset the PHY via BMCR_RESET bit
+ * @phydev: target phy_device struct
+ *
+ * Description: Perform a software PHY reset using the standard
+ * BMCR_RESET bit and poll for the reset bit to be cleared.
+ * Toggle BMCR_RESET bit off to accomodate broken PHY implementations
+ * such as used on the Individual Computers' X-Surf 100 Zorro card.
+ *
+ * Returns: 0 on success, < 0 on failure
+ */
+static int asix_soft_reset(struct phy_device *phydev)
+{
+   int ret;
+
+   /* Asix PHY won't reset unless reset bit toggles */
+   ret = phy_write(phydev, MII_BMCR, 0);
+   if (ret < 0)
+   return ret;
+
+   phy_write(phydev, MII_BMCR, BMCR_RESET);
+
+   return phy_poll_reset(phydev);
+}
+
+static struct phy_driver asix_driver[] = { {
+   .phy_id = PHY_ID_ASIX,
+   .name   = "Asix Electronics",
+   .phy_id_mask= 0xfff0,
+   .features   = PHY_BASIC_FEATURES,
+   .soft_reset = asix_soft_reset,
+} };
+
+module_phy_driver(asix_driver);
+
+static struct mdio_device_id __maybe_unused asix_tbl[] = {
+   { PHY_ID_ASIX, 0xfff0 },
+   { }
+};
+
+MODULE_DEVICE_TABLE(mdio, asix_tbl);
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 777912b..fb8c13b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -833,7 +833,7 @@ void phy_disconnect(struct phy_device *phydev)
  *   standard phy_init_hw() which will zero all the other bits in the BMCR
  *   and reapply all driver-specific and board-specific fixups.
  */
-static int phy_poll_reset(struct phy_device *phydev)
+int phy_poll_reset(struct phy_device *phydev)
 {
/* Poll until the reset bit clears (50ms per retry == 0.6 sec) */
unsigned int retries = 12;
@@ -854,6 +854,7 @@ static int phy_poll_reset(struct phy_device *phydev)
msleep(1);
return 0;
 }
+EXPORT_SYMBOL(phy_poll_reset);
 
 int phy_init_hw(struct phy_device *phydev)
 {
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 7c4c237..fa0c4fd 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -980,6 +980,7 @@ void phy_attached_print(struct phy_device *phydev, const 
char *fmt, ...)
 int genphy_resume(struct phy_device *phydev);
 int genphy_loopback(struct phy_device *phydev, bool enable);
 int genphy_soft_reset(struct phy_device *phydev);
+int phy_poll_reset(struct phy_device *phydev);
 static inline int genphy_no_soft_reset(struct 

[PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)

2018-04-17 Thread Michael Schmitz
This patch series adds support for the Individual Computers X-Surf 100
network card for m68k Amiga, a network adapter based on the AX88796 chip set.

The driver was originally written for kernel version 3.19 by Michael Karcher
(see CC:), and adapted to 4.16 for submission to netdev by me. Questions
regarding motivation for some of the changes are probably best directed at
Michael Karcher.

The driver has been tested by Adrian  who will
send his Tested-by tag separately.

A few changes to the ax88796 driver were required:
- to read the MAC address, some setup of the ax99796 chip must be done,
- attach to the MII bus only on device open to allow module unloading,
- allow to supersede ax_block_input/ax_block_output by card-specific
  optimized code,
- use an optional interrupt status callback to allow easier sharing of the
  card interrupt,
- set IRQF_SHARED if platform IRQ resource is marked shareable,

The Asix Electronix PHY used on the X-Surf 100 is buggy, and causes the
software reset to hang if the previous command sent to the PHY was also
a soft reset. This bug requires addition of a PHY driver for Asix PHYs
to provide a fixed .soft_reset function, included in this series.

Some additional cleanup:
- do not attempt to free IRQ in ax_remove (complements 82533ad9a1c),
- clear platform drvdata on probe fail and module remove.

Changes since v1:

Raised in review by Andrew Lunn:
- move MII code around to avoid need for forward declaration
- combine patches 2 and 7 to add cleanup in error path

Changes since v2:

- corrected authorship attribution to Michael Karcher

Suggested by Geert Uytterhoeven:
- use ei_local->reset_8390() instead of duplicating ax_reset_8390()
- use %pR to format struct resource pointers
- assign pdev and xs100 pointers in declaration
- don't split error messages
- change Kconfig logic to only require XSURF100 set on Amiga

Suggested by Andrew Lunn:
- add COMPILE_TEST to ax88796 Kconfig options
- use new Asix PHY driver for X-Surf 100

Suggested by Andrew Lunn/Finn Thain:
- declare struct sk_buff in ax88796.h
- correct whitespace error in ax88796.h

This series' patches, in order:

1/9 net: phy: new Asix Electronics PHY driver
2/9 net: ax88796: Fix MAC address reading
3/9 net: ax88796: Attach MII bus only when open
4/9 net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
5/9 net: ax88796: Add block_input/output hooks to ax_plat_data
6/9 net: ax88796: add interrupt status callback to platform data
7/9 net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as shareable
8/9 net: ax88796: release platform device drvdata on probe error and module 
remove
9/9 net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

 drivers/net/ethernet/8390/Kconfig|   17 ++-
 drivers/net/ethernet/8390/Makefile   |1 +
 drivers/net/ethernet/8390/ax88796.c  |  228 
 drivers/net/ethernet/8390/xsurf100.c |  381 ++
 drivers/net/phy/Kconfig  |6 +
 drivers/net/phy/Makefile |1 +
 drivers/net/phy/asix.c   |   65 ++
 drivers/net/phy/phy_device.c |3 +-
 include/linux/phy.h  |1 +
 include/net/ax88796.h|   14 ++
 10 files changed, 621 insertions(+), 96 deletions(-)

Cheers,

  Michael


[PATCH v3 9/9] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

Add platform device driver to populate the ax88796 platform data from
information provided by the XSurf100 zorro device driver. The ax88796
module will be loaded through this module's probe function.

Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 

---

Changes in v3:
Suggested by Geert Uytterhoeven:
- use ei_local->reset_8390() instead of duplicating ax_reset_8390()
- use %pR to format struct resource pointers
- assign pdev and xs100 pointers in declaration
- don't split error messages
- change Kconfig logic to only require XSURF100 set on Amiga

Suggested by Andrew Lunn:
- add COMPILE_TEST to ax88796 Kconfig options
- use new Asix PHY driver for X-Surf 100
---
 drivers/net/ethernet/8390/Kconfig|   17 ++-
 drivers/net/ethernet/8390/Makefile   |1 +
 drivers/net/ethernet/8390/xsurf100.c |  381 ++
 3 files changed, 397 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/8390/xsurf100.c

diff --git a/drivers/net/ethernet/8390/Kconfig 
b/drivers/net/ethernet/8390/Kconfig
index fdc6734..607dc00 100644
--- a/drivers/net/ethernet/8390/Kconfig
+++ b/drivers/net/ethernet/8390/Kconfig
@@ -29,8 +29,8 @@ config PCMCIA_AXNET
  called axnet_cs.  If unsure, say N.
 
 config AX88796
-   tristate "ASIX AX88796 NE2000 clone support"
-   depends on (ARM || MIPS || SUPERH)
+   tristate "ASIX AX88796 NE2000 clone support" if !ZORRO
+   depends on (ARM || MIPS || SUPERH || ZORRO || COMPILE_TEST)
select CRC32
select PHYLIB
select MDIO_BITBANG
@@ -45,6 +45,19 @@ config AX88796_93CX6
---help---
  Select this if your platform comes with an external 93CX6 eeprom.
 
+config XSURF100
+   tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
+   depends on ZORRO
+   select AX88796
+   select ASIX_PHY
+   ---help---
+ This driver is for the Individual Computers X-Surf 100 Ethernet
+ card (based on the Asix AX88796 chip). If you have such a card,
+ say Y. Otherwise, say N.
+
+ To compile this driver as a module, choose M here: the module
+ will be called xsurf100.
+
 config HYDRA
tristate "Hydra support"
depends on ZORRO
diff --git a/drivers/net/ethernet/8390/Makefile 
b/drivers/net/ethernet/8390/Makefile
index f975c2f..3715f8d 100644
--- a/drivers/net/ethernet/8390/Makefile
+++ b/drivers/net/ethernet/8390/Makefile
@@ -16,4 +16,5 @@ obj-$(CONFIG_PCMCIA_PCNET) += pcnet_cs.o 8390.o
 obj-$(CONFIG_STNIC) += stnic.o 8390.o
 obj-$(CONFIG_ULTRA) += smc-ultra.o 8390.o
 obj-$(CONFIG_WD80x3) += wd.o 8390.o
+obj-$(CONFIG_XSURF100) += xsurf100.o
 obj-$(CONFIG_ZORRO8390) += zorro8390.o 8390.o
diff --git a/drivers/net/ethernet/8390/xsurf100.c 
b/drivers/net/ethernet/8390/xsurf100.c
new file mode 100644
index 000..7ab5ca0
--- /dev/null
+++ b/drivers/net/ethernet/8390/xsurf100.c
@@ -0,0 +1,381 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define ZORRO_PROD_INDIVIDUAL_COMPUTERS_X_SURF100 \
+   ZORRO_ID(INDIVIDUAL_COMPUTERS, 0x64, 0)
+
+#define XS100_IRQSTATUS_BASE 0x40
+#define XS100_8390_BASE 0x800
+
+/* Longword-access area. Translated to 2 16-bit access cycles by the
+ * X-Surf 100 FPGA
+ */
+#define XS100_8390_DATA32_BASE 0x8000
+#define XS100_8390_DATA32_SIZE 0x2000
+/* Sub-Areas for fast data register access; addresses relative to area begin */
+#define XS100_8390_DATA_READ32_BASE 0x0880
+#define XS100_8390_DATA_WRITE32_BASE 0x0C80
+#define XS100_8390_DATA_AREA_SIZE 0x80
+
+#define __NS8390_init ax_NS8390_init
+
+/* force unsigned long back to 'void __iomem *' */
+#define ax_convert_addr(_a) ((void __force __iomem *)(_a))
+
+#define ei_inb(_a) z_readb(ax_convert_addr(_a))
+#define ei_outb(_v, _a) z_writeb(_v, ax_convert_addr(_a))
+
+#define ei_inw(_a) z_readw(ax_convert_addr(_a))
+#define ei_outw(_v, _a) z_writew(_v, ax_convert_addr(_a))
+
+#define ei_inb_p(_a) ei_inb(_a)
+#define ei_outb_p(_v, _a) ei_outb(_v, _a)
+
+/* define EI_SHIFT() to take into account our register offsets */
+#define EI_SHIFT(x) (ei_local->reg_offset[(x)])
+
+/* Ensure we have our RCR base value */
+#define AX88796_PLATFORM
+
+static unsigned char version[] =
+   "ax88796.c: Copyright 2005,2007 Simtec Electronics\n";
+
+#include "lib8390.c"
+
+/* from ne.c */
+#define NE_CMD EI_SHIFT(0x00)
+#define NE_RESET   EI_SHIFT(0x1f)
+#define NE_DATAPORTEI_SHIFT(0x10)
+
+struct xsurf100_ax_plat_data {
+   struct ax_plat_data ax;
+   void __iomem *base_regs;
+   void __iomem *data_area;
+};
+
+static int is_xsurf100_network_irq(struct platform_device *pdev)
+{
+   struct xsurf100_ax_plat_data *xs100 = dev_get_platdata(>dev);
+
+   return (readw(xs100->base_regs + XS100_IRQSTATUS_BASE) & 0x) != 0;
+}
+
+/* These functions guarantee that the iomem is accessed 

[PATCH v3 4/9] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

This complements the fix in 82533ad9a1c ("net: ethernet: ax88796:
don't call free_irq without request_irq first") that removed the
free_irq call in the error path of probe, to also not call free_irq
when remove is called to revert the effects of probe.

Fixes: 82533ad9a1c (net: ethernet: ax88796: don't call free_irq without 
request_irq first)
Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/net/ethernet/8390/ax88796.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index 83e59ae..ecf104c 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -793,7 +793,6 @@ static int ax_remove(struct platform_device *pdev)
struct resource *mem;
 
unregister_netdev(dev);
-   free_irq(dev->irq, dev);
 
iounmap(ei_local->mem);
mem = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-- 
1.7.0.4



[PATCH v3 5/9] net: ax88796: Add block_input/output hooks to ax_plat_data

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

Add platform specific hooks for block transfer reads/writes of packet
buffer data, superseding the default provided ax_block_input/output.
Currently used for m68k Amiga XSurf100.

Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 

---

Changes in v3:

Suggested by Andrew Lunn/Finn Thain:
- declare struct sk_buff in ax88796.h
- correct whitespace error
---
 drivers/net/ethernet/8390/ax88796.c |   10 --
 include/net/ax88796.h   |9 +
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index ecf104c..29cde38 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -760,8 +760,14 @@ static int ax_init_dev(struct net_device *dev)
 #endif
 
ei_local->reset_8390 = _reset_8390;
-   ei_local->block_input = _block_input;
-   ei_local->block_output = _block_output;
+   if (ax->plat->block_input)
+   ei_local->block_input = ax->plat->block_input;
+   else
+   ei_local->block_input = _block_input;
+   if (ax->plat->block_output)
+   ei_local->block_output = ax->plat->block_output;
+   else
+   ei_local->block_output = _block_output;
ei_local->get_8390_hdr = _get_8390_hdr;
ei_local->priv = 0;
ei_local->msg_enable = ax_msg_enable;
diff --git a/include/net/ax88796.h b/include/net/ax88796.h
index b9a3bec..363b0ca 100644
--- a/include/net/ax88796.h
+++ b/include/net/ax88796.h
@@ -12,6 +12,9 @@
 #ifndef __NET_AX88796_PLAT_H
 #define __NET_AX88796_PLAT_H
 
+struct sk_buff;
+struct net_device;
+
 #define AXFLG_HAS_EEPROM   (1<<0)
 #define AXFLG_MAC_FROMDEV  (1<<1)  /* device already has MAC */
 #define AXFLG_HAS_93CX6(1<<2)  /* use eeprom_93cx6 
driver */
@@ -26,6 +29,12 @@ struct ax_plat_data {
u32 *reg_offsets;   /* register offsets */
u8  *mac_addr;  /* MAC addr (only used when
   AXFLG_MAC_FROMPLATFORM is used */
+
+   /* uses default ax88796 buffer if set to NULL */
+   void (*block_output)(struct net_device *dev, int count,
+   const unsigned char *buf, int star_page);
+   void (*block_input)(struct net_device *dev, int count,
+   struct sk_buff *skb, int ring_offset);
 };
 
 #endif /* __NET_AX88796_PLAT_H */
-- 
1.7.0.4



[PATCH v3 6/9] net: ax88796: add interrupt status callback to platform data

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

To be able to tell the ax88796 driver whether it is sensible to enter
the 8390 interrupt handler, an "is this interrupt caused by the 88796"
callback has been added to the ax_plat_data structure (with NULL being
compatible to the previous behaviour).

Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 
---
 drivers/net/ethernet/8390/ax88796.c |   23 +--
 include/net/ax88796.h   |5 +
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index 29cde38..c799441 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -165,6 +165,21 @@ static void ax_reset_8390(struct net_device *dev)
ei_outb(ENISR_RESET, addr + EN0_ISR);   /* Ack intr. */
 }
 
+/* Wrapper for __ei_interrupt for platforms that have a platform-specific
+ * way to find out whether the interrupt request might be caused by
+ * the ax88796 chip.
+ */
+static irqreturn_t ax_ei_interrupt_filtered(int irq, void *dev_id)
+{
+   struct net_device *dev = dev_id;
+   struct ax_device *ax = to_ax_dev(dev);
+   struct platform_device *pdev = to_platform_device(dev->dev.parent);
+
+   if (!ax->plat->check_irq(pdev))
+   return IRQ_NONE;
+
+   return ax_ei_interrupt(irq, dev_id);
+}
 
 static void ax_get_8390_hdr(struct net_device *dev, struct e8390_pkt_hdr *hdr,
int ring_page)
@@ -484,8 +499,12 @@ static int ax_open(struct net_device *dev)
if (ret)
goto failed_mii;
 
-   ret = request_irq(dev->irq, ax_ei_interrupt, ax->irqflags,
- dev->name, dev);
+   if (ax->plat->check_irq)
+   ret = request_irq(dev->irq, ax_ei_interrupt_filtered,
+ ax->irqflags, dev->name, dev);
+   else
+   ret = request_irq(dev->irq, ax_ei_interrupt, ax->irqflags,
+ dev->name, dev);
if (ret)
goto failed_request_irq;
 
diff --git a/include/net/ax88796.h b/include/net/ax88796.h
index 363b0ca..84b3785 100644
--- a/include/net/ax88796.h
+++ b/include/net/ax88796.h
@@ -14,6 +14,7 @@
 
 struct sk_buff;
 struct net_device;
+struct platform_device;
 
 #define AXFLG_HAS_EEPROM   (1<<0)
 #define AXFLG_MAC_FROMDEV  (1<<1)  /* device already has MAC */
@@ -35,6 +36,10 @@ struct ax_plat_data {
const unsigned char *buf, int star_page);
void (*block_input)(struct net_device *dev, int count,
struct sk_buff *skb, int ring_offset);
+   /* returns nonzero if a pending interrupt request might by caused by
+* the ax88786. Handles all interrupts if set to NULL
+*/
+   int (*check_irq)(struct platform_device *pdev);
 };
 
 #endif /* __NET_AX88796_PLAT_H */
-- 
1.7.0.4



[PATCH v3 3/9] net: ax88796: Attach MII bus only when open

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

Call ax_mii_init in ax_open(), and unregister/remove mdiobus resources
in ax_close().

This is needed to be able to unload the module, as the module is busy
while the MII bus is attached.

Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/8390/ax88796.c |  183 ++-
 1 files changed, 95 insertions(+), 88 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index 2a256aa..83e59ae 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -389,6 +389,90 @@ static void ax_phy_switch(struct net_device *dev, int on)
ei_outb(reg_gpoc, ei_local->mem + EI_SHIFT(0x17));
 }
 
+static void ax_bb_mdc(struct mdiobb_ctrl *ctrl, int level)
+{
+   struct ax_device *ax = container_of(ctrl, struct ax_device, bb_ctrl);
+
+   if (level)
+   ax->reg_memr |= AX_MEMR_MDC;
+   else
+   ax->reg_memr &= ~AX_MEMR_MDC;
+
+   ei_outb(ax->reg_memr, ax->addr_memr);
+}
+
+static void ax_bb_dir(struct mdiobb_ctrl *ctrl, int output)
+{
+   struct ax_device *ax = container_of(ctrl, struct ax_device, bb_ctrl);
+
+   if (output)
+   ax->reg_memr &= ~AX_MEMR_MDIR;
+   else
+   ax->reg_memr |= AX_MEMR_MDIR;
+
+   ei_outb(ax->reg_memr, ax->addr_memr);
+}
+
+static void ax_bb_set_data(struct mdiobb_ctrl *ctrl, int value)
+{
+   struct ax_device *ax = container_of(ctrl, struct ax_device, bb_ctrl);
+
+   if (value)
+   ax->reg_memr |= AX_MEMR_MDO;
+   else
+   ax->reg_memr &= ~AX_MEMR_MDO;
+
+   ei_outb(ax->reg_memr, ax->addr_memr);
+}
+
+static int ax_bb_get_data(struct mdiobb_ctrl *ctrl)
+{
+   struct ax_device *ax = container_of(ctrl, struct ax_device, bb_ctrl);
+   int reg_memr = ei_inb(ax->addr_memr);
+
+   return reg_memr & AX_MEMR_MDI ? 1 : 0;
+}
+
+static const struct mdiobb_ops bb_ops = {
+   .owner = THIS_MODULE,
+   .set_mdc = ax_bb_mdc,
+   .set_mdio_dir = ax_bb_dir,
+   .set_mdio_data = ax_bb_set_data,
+   .get_mdio_data = ax_bb_get_data,
+};
+
+static int ax_mii_init(struct net_device *dev)
+{
+   struct platform_device *pdev = to_platform_device(dev->dev.parent);
+   struct ei_device *ei_local = netdev_priv(dev);
+   struct ax_device *ax = to_ax_dev(dev);
+   int err;
+
+   ax->bb_ctrl.ops = _ops;
+   ax->addr_memr = ei_local->mem + AX_MEMR;
+   ax->mii_bus = alloc_mdio_bitbang(>bb_ctrl);
+   if (!ax->mii_bus) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   ax->mii_bus->name = "ax88796_mii_bus";
+   ax->mii_bus->parent = dev->dev.parent;
+   snprintf(ax->mii_bus->id, MII_BUS_ID_SIZE, "%s-%x",
+   pdev->name, pdev->id);
+
+   err = mdiobus_register(ax->mii_bus);
+   if (err)
+   goto out_free_mdio_bitbang;
+
+   return 0;
+
+ out_free_mdio_bitbang:
+   free_mdio_bitbang(ax->mii_bus);
+ out:
+   return err;
+}
+
 static int ax_open(struct net_device *dev)
 {
struct ax_device *ax = to_ax_dev(dev);
@@ -396,6 +480,10 @@ static int ax_open(struct net_device *dev)
 
netdev_dbg(dev, "open\n");
 
+   ret = ax_mii_init(dev);
+   if (ret)
+   goto failed_mii;
+
ret = request_irq(dev->irq, ax_ei_interrupt, ax->irqflags,
  dev->name, dev);
if (ret)
@@ -423,6 +511,10 @@ static int ax_open(struct net_device *dev)
ax_phy_switch(dev, 0);
free_irq(dev->irq, dev);
  failed_request_irq:
+   /* unregister mdiobus */
+   mdiobus_unregister(ax->mii_bus);
+   free_mdio_bitbang(ax->mii_bus);
+ failed_mii:
return ret;
 }
 
@@ -442,6 +534,9 @@ static int ax_close(struct net_device *dev)
phy_disconnect(dev->phydev);
 
free_irq(dev->irq, dev);
+
+   mdiobus_unregister(ax->mii_bus);
+   free_mdio_bitbang(ax->mii_bus);
return 0;
 }
 
@@ -541,92 +636,8 @@ static void ax_eeprom_register_write(struct eeprom_93cx6 
*eeprom)
 #endif
 };
 
-static void ax_bb_mdc(struct mdiobb_ctrl *ctrl, int level)
-{
-   struct ax_device *ax = container_of(ctrl, struct ax_device, bb_ctrl);
-
-   if (level)
-   ax->reg_memr |= AX_MEMR_MDC;
-   else
-   ax->reg_memr &= ~AX_MEMR_MDC;
-
-   ei_outb(ax->reg_memr, ax->addr_memr);
-}
-
-static void ax_bb_dir(struct mdiobb_ctrl *ctrl, int output)
-{
-   struct ax_device *ax = container_of(ctrl, struct ax_device, bb_ctrl);
-
-   if (output)
-   ax->reg_memr &= ~AX_MEMR_MDIR;
-   else
-   ax->reg_memr |= AX_MEMR_MDIR;
-
-   ei_outb(ax->reg_memr, ax->addr_memr);
-}
-
-static void ax_bb_set_data(struct mdiobb_ctrl *ctrl, int value)
-{
-   

[PATCH v3 8/9] net: ax88796: release platform device drvdata on probe error and module remove

2018-04-17 Thread Michael Schmitz
The net device struct pointer is stored as platform device drvdata on
module probe - clear the drvdata entry on probe fail there, as well as
when unloading the module.

Signed-off-by: Michael Schmitz 
---
 drivers/net/ethernet/8390/ax88796.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index a72dfbc..eb72282 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -829,6 +829,7 @@ static int ax_remove(struct platform_device *pdev)
release_mem_region(mem->start, resource_size(mem));
}
 
+   platform_set_drvdata(pdev, NULL);
free_netdev(dev);
 
return 0;
@@ -962,6 +963,7 @@ static int ax_probe(struct platform_device *pdev)
release_mem_region(mem->start, mem_size);
 
  exit_mem:
+   platform_set_drvdata(pdev, NULL);
free_netdev(dev);
 
return ret;
-- 
1.7.0.4



[PATCH v3 7/9] net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as shareable

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

On the Amiga X-Surf100, the network card interrupt is shared with many
other interrupt sources, so requires the IRQF_SHARED flag to register.

Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 
---
 drivers/net/ethernet/8390/ax88796.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index c799441..a72dfbc 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -875,6 +875,9 @@ static int ax_probe(struct platform_device *pdev)
dev->irq = irq->start;
ax->irqflags = irq->flags & IRQF_TRIGGER_MASK;
 
+   if (irq->flags &  IORESOURCE_IRQ_SHAREABLE)
+   ax->irqflags |= IRQF_SHARED;
+
mem = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!mem) {
dev_err(>dev, "no MEM specified\n");
-- 
1.7.0.4



[PATCH v3 2/9] net: ax88796: Fix MAC address reading

2018-04-17 Thread Michael Schmitz
From: Michael Karcher 

To read the MAC address from the (virtual) SAprom, the remote DMA
unit needs to be set up like for every other process access to card-local
memory.

Signed-off-by: Michael Karcher 
Signed-off-by: Michael Schmitz 
---
 drivers/net/ethernet/8390/ax88796.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/8390/ax88796.c 
b/drivers/net/ethernet/8390/ax88796.c
index 2455547..2a256aa 100644
--- a/drivers/net/ethernet/8390/ax88796.c
+++ b/drivers/net/ethernet/8390/ax88796.c
@@ -671,10 +671,16 @@ static int ax_init_dev(struct net_device *dev)
if (ax->plat->flags & AXFLG_HAS_EEPROM) {
unsigned char SA_prom[32];
 
+   ei_outb(6, ioaddr + EN0_RCNTLO);
+   ei_outb(0, ioaddr + EN0_RCNTHI);
+   ei_outb(0, ioaddr + EN0_RSARLO);
+   ei_outb(0, ioaddr + EN0_RSARHI);
+   ei_outb(E8390_RREAD + E8390_START, ioaddr + NE_CMD);
for (i = 0; i < sizeof(SA_prom); i += 2) {
SA_prom[i] = ei_inb(ioaddr + NE_DATAPORT);
SA_prom[i + 1] = ei_inb(ioaddr + NE_DATAPORT);
}
+   ei_outb(ENISR_RDC, ioaddr + EN0_ISR);   /* Ack intr. */
 
if (ax->plat->wordlength == 2)
for (i = 0; i < 16; i++)
-- 
1.7.0.4



Re: [PATCH] VSOCK: make af_vsock.ko removable again

2018-04-17 Thread Stefan Hajnoczi
On Tue, Apr 17, 2018 at 09:45:12AM -0400, David Miller wrote:
> From: Stefan Hajnoczi 
> Date: Tue, 17 Apr 2018 14:25:58 +0800
> 
> > Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
> > vsock_init_tables()") introduced a module_init() function without a
> > corresponding module_exit() function.
> > 
> > Modules with an init function can only be removed if they also have an
> > exit function.  Therefore the vsock module was considered "permanent"
> > and could not be removed.
> > 
> > This patch adds an empty module_exit() function so that "rmmod vsock"
> > works.  No explicit cleanup is required because:
> > 
> > 1. Transports call vsock_core_exit() upon exit and cannot be removed
> >while sockets are still alive.
> > 2. vsock_diag.ko does not perform any action that requires cleanup by
> >vsock.ko.
> > 
> > Reported-by: Xiumei Mu 
> > Cc: Cong Wang 
> > Cc: Jorgen Hansen 
> > Signed-off-by: Stefan Hajnoczi 
> 
> Applied, but please provide a proper Fixes: tag next time.  I added it
> for you this time.

Will do.  Thanks!

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 04/10] net: ax88796: Add block_input/output hooks to ax_plat_data

2018-04-17 Thread Michael Schmitz
Hi Finn,

On Wed, Apr 18, 2018 at 1:23 PM, Finn Thain <fth...@telegraphics.com.au> wrote:
> On Wed, 18 Apr 2018, Michael Schmitz wrote:
>
>> I think this is a false positive - we're encouraged to provide the
>> full parameter list for functions, so the sreuct sk_buff* can't be
>> avoided.
>>
>
> I don't think it's a false positive. I think ax88796.h would need to
> #include .
>
> You may be able to get away with a forward declaration, as in,
> struct skbuff;
> but I'm not sure about that. I would have to build mach-anubis.c to check.

I've added a forward declaration for now - worked for struct
net_device as well (would have been missing from the mach-anubis.c
build as well because of the missing netdevice header).

> But why do you need to pass an skbuff pointer here? xs100_block_input()
> only accesses skb->data.

I'm forced to use the same interface as ax_block_input()
(xs100_block_input is a plug-in replacement for that). But both could
be changed. Let's leave that for later please.

> BTW, this patch has an unrelated whitespace change.

Fixed, thanks.

Cheers,

  Michael

>
> --
>
>> Cheers,
>>
>>   Michael
>>
>>
>> On Wed, Apr 18, 2018 at 6:46 AM, kbuild test robot <l...@intel.com> wrote:
>> > Hi Michael,
>> >
>> > I love your patch! Perhaps something to improve:
>> >
>> > [auto build test WARNING on v4.16]
>> > [cannot apply to net-next/master net/master v4.17-rc1 next-20180417]
>> > [if your patch is applied to the wrong git tree, please drop us a note to 
>> > help improve the system]
>> >
>> > url:
>> > https://github.com/0day-ci/linux/commits/Michael-Schmitz/New-network-driver-for-Amiga-X-Surf-100-m68k/20180417-141150
>> > config: arm-samsung (attached as .config)
>> > compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
>> > reproduce:
>> > wget 
>> > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
>> > -O ~/bin/make.cross
>> > chmod +x ~/bin/make.cross
>> > # save the attached .config to linux build tree
>> > make.cross ARCH=arm
>> >
>> > All warnings (new ones prefixed by >>):
>> >
>> >In file included from arch/arm/mach-s3c24xx/mach-anubis.c:42:0:
>> >>> include/net/ax88796.h:35:11: warning: 'struct sk_buff' declared inside 
>> >>> parameter list will not be visible outside of this definition or 
>> >>> declaration
>> >struct sk_buff *skb, int ring_offset);
>> >   ^~~
>> >
>> > vim +35 include/net/ax88796.h
>> >
>> > 20
>> > 21  struct ax_plat_data {
>> > 22  unsigned int flags;
>> > 23  unsigned charwordlength;/* 1 or 2 */
>> > 24  unsigned chardcr_val;   /* default value for DCR */
>> > 25  unsigned charrcr_val;   /* default value for RCR */
>> > 26  unsigned chargpoc_val;  /* default value for GPOC 
>> > */
>> > 27  u32 *reg_offsets;   /* register offsets */
>> > 28  u8  *mac_addr;  /* MAC addr (only used when
>> > 29 AXFLG_MAC_FROMPLATFORM 
>> > is used */
>> > 30
>> > 31  /* uses default ax88796 buffer if set to NULL */
>> > 32  void (*block_output)(struct net_device *dev, int count,
>> > 33  const unsigned char *buf, int star_page);
>> > 34  void (*block_input)(struct net_device *dev, int count,
>> >   > 35  struct sk_buff *skb, int ring_offset);
>> > 36  };
>> > 37
>> >
>> > ---
>> > 0-DAY kernel test infrastructureOpen Source Technology 
>> > Center
>> > https://lists.01.org/pipermail/kbuild-all   Intel 
>> > Corporation
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-m68k" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-m68k" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2 00/21] net/ipv6: Separate data structures for FIB and data path

2018-04-17 Thread David Miller
From: David Ahern 
Date: Tue, 17 Apr 2018 17:33:06 -0700

> IPv6 uses the same data struct for both control plane (FIB entries) and
> data path (dst entries). This struct has elements needed for both paths
> adding memory overhead and complexity (taking a dst hold in most places
> but an additional reference on rt6i_ref in a few). Furthermore, because
> of the dst_alloc tie, all FIB entries are allocated with GFP_ATOMIC.
> 
> This patch set separates FIB entries from dst entries, better aligning
> IPv6 code with IPv4, simplifying the reference counting and allowing
> FIB entries added by userspace (not autoconf) to use GFP_KERNEL. It is
> first step to a number of performance and scalability changes.
> 
> The end result of this patch set:
>   - FIB entries (fib6_info):
> /* size: 208, cachelines: 4, members: 25 */
> /* sum members: 207, holes: 1, sum holes: 1 */
> 
>   - dst entries (rt6_info)
>/* size: 240, cachelines: 4, members: 11 */
> 
> Versus the the single rt6_info struct today for both paths:
>   /* size: 320, cachelines: 5, members: 28 */
> 
> This amounts to a 35% reduction in memory use for FIB entries and a
> 25% reduction for dst entries.

Looks great, series applied, thanks David!


Re: [PATCH 04/10] net: ax88796: Add block_input/output hooks to ax_plat_data

2018-04-17 Thread Michael Schmitz
Hi Andrew,

ax88796 includes it via linux/netdevice.h. mac-anubis.c doesn't.

Michael Karcher's patches have added forward derclarations for struct
netdevice and struct platform_data already - I'll add struct sk_buff
as suggested by Finn.

Cheers,

  Michael


On Wed, Apr 18, 2018 at 1:19 PM, Andrew Lunn  wrote:
> On Wed, Apr 18, 2018 at 12:53:21PM +1200, Michael Schmitz wrote:
>> I think this is a false positive - we're encouraged to provide the
>> full parameter list for functions, so the sreuct sk_buff* can't be
>> avoided.
>
> Hi Michael
>
> How is  being included?
>
> You probably want to build using the .config file and see.
>
> Andrew


Re: [PATCH v2] net: change the comment of dev_mc_init

2018-04-17 Thread David Miller
From: sunlianwen 
Date: Wed, 18 Apr 2018 08:29:52 +0800

> The comment of dev_mc_init() is wrong. which use dev_mc_flush
> instead of dev_mc_init.
> 
> Signed-off-by: Lianwen Sun 

Patch is still corrupted by your email client.

> - * dev_mc_flush - Init multicast address list
> + * dev_mc_init - Init multicast address list

The character after "*" is a TAB yet it is a sequence of SPACES
in your patch.

Your email client is doing this.

Please do not resend this patch to the mailing list until you can
successfully email the patch to yourself and apply the patch cleanly.


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-17 Thread Willem de Bruijn
On Tue, Apr 17, 2018 at 10:25 PM, Samudrala, Sridhar
 wrote:
>
> On 4/17/2018 2:07 PM, Willem de Bruijn wrote:
>>
>> On Tue, Apr 17, 2018 at 4:48 PM, Sowmini Varadhan
>>  wrote:
>>>
>>> On (04/17/18 16:23), Willem de Bruijn wrote:

 Assuming IPv4 with an MTU of 1500 and the maximum segment
 size of 1472, the receiver will see three datagrams with MSS of
 1472B, 528B and 512B.
>>>
>>> so the recvmsg will also pass up 1472, 526, 512, right?
>>
>> That's right.
>>
>>> If yes, how will the recvmsg differentiate between the case
>>> (2000 byte message followed by 512 byte message) and
>>> (1472 byte message, 526 byte message, then 512 byte message),
>>> in other words, how are UDP message boundary semantics preserved?
>>
>> They aren't. This is purely an optimization to amortize the cost of
>> repeated tx stack traversal. Unlike UFO, which would preserve the
>> boundaries of the original larger than MTU datagram.
>
>
> Doesn't this break UDP applications that expect message boundary
> preservation semantics? Is it possible to negotiate this feature?

A process has to explicitly request the feature with socket option
or cmsg UDP_SEGMENT. By setting that to gso size is signals
its intent to send multiple datagrams in one call.

Or were you responding to the hypothetical GRO example below?
Yes, that clearly would have to be limited to negotiated flows, not
unlike how foo-over-udp tunneling is detected. It is also not a serious
suggestion at this point.

>> A prime use case is bulk transfer of data. Think video streaming
>> with QUIC. It must send MTU sized or smaller packets, but has
>> no application-layer requirement to reconstruct large packets on
>> the peer.
>>
>> That said, for negotiated flows an inverse GRO feature could
>> conceivably be implemented to reduce rx stack traversal, too.
>> Though due to interleaving of packets on the wire, it aggregation
>> would be best effort, similar to TCP TSO and GRO using the
>> PSH bit as packetization signal.


Re: [PATCH RFC net-next 06/11] udp: add gso support to virtual devices

2018-04-17 Thread Willem de Bruijn
On Tue, Apr 17, 2018 at 8:43 PM, Dimitris Michailidis
 wrote:
> On Tue, Apr 17, 2018 at 1:00 PM, Willem de Bruijn
>  wrote:
>> From: Willem de Bruijn 
>>
>> Virtual devices such as tunnels and bonding can handle large packets.
>> Only segment packets when reaching a physical or loopback device.
>>
>> Signed-off-by: Willem de Bruijn 
>> ---
>>  include/linux/netdev_features.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/linux/netdev_features.h 
>> b/include/linux/netdev_features.h
>> index 35b79f47a13d..1e4883bb02a7 100644
>> --- a/include/linux/netdev_features.h
>> +++ b/include/linux/netdev_features.h
>> @@ -80,6 +80,7 @@ enum {
>>
>> NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload 
>> */
>> NETIF_F_HW_TLS_RECORD_BIT,  /* Offload TLS record */
>> +   NETIF_F_GSO_UDP_L4_BIT, /* UDP payload GSO (not UFO) */
>
> Please add an entry for the new flag to
> net/core/ethtool.c:netdev_features_strings
> and a description to Documentation/networking/netdev-features.txt.

Will do. I initially wrote this as a transparent kernel-internal feature,
but indeed it should be observable and configurable.


Fw: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing null pointer vulnerability.

2018-04-17 Thread Stephen Hemminger
This may already be fixed.

Begin forwarded message:

Date: Wed, 18 Apr 2018 01:52:59 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing 
null pointer vulnerability.


https://bugzilla.kernel.org/show_bug.cgi?id=199429

Bug ID: 199429
   Summary: smc_shutdown(net/smc/af_smc.c) has a UAF causing null
pointer vulnerability.
   Product: Networking
   Version: 2.5
Kernel Version: 4.16.0-rc7
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: 1773876...@qq.com
Regression: No

Created attachment 275431
  --> https://bugzilla.kernel.org/attachment.cgi?id=275431=edit  
POC

Syzkaller hit 'general protection fault in kernel_sock_shutdown' bug.

NET: Registered protocol family 43
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN PTI
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in: smc ib_core binfmt_misc joydev hid_generic snd_pcm snd_timer
snd usbmouse usbhid soundcore psmouse e1000 hid pcspkr parport_pc input_leds
i2c_piix4 parport serio_raw floppy qemu_fw_cfg evbug mac_hid
CPU: 1 PID: 1751 Comm: syzkaller252340 Not tainted 4.16.0-rc7+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1
04/01/2014
RIP: 0010:kernel_sock_shutdown+0x29/0x70 net/socket.c:3255
RSP: 0018:88000666fcf8 EFLAGS: 00010206
RAX: dc00 RBX:  RCX: 829206e4
RDX: 0005 RSI:  RDI: 0028
RBP: 88003b43a0d2 R08: 0003 R09: 0002b3c0
R10: 0ae7 R11: 00eb R12: 
R13:  R14:  R15: 
FS:  0225b880() GS:88003fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f5b8580 CR3: 3bcde004 CR4: 001606e0
Call Trace:
 smc_shutdown+0x431/0x4a0 [smc]
 SYSC_shutdown net/socket.c:1901 [inline]
 SyS_shutdown+0x140/0x250 net/socket.c:1892
 do_syscall_64+0x2ee/0x580 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x4431a9
RSP: 002b:7ffcccb77758 EFLAGS: 0217 ORIG_RAX: 0030
RAX: ffda RBX: 004003d0 RCX: 004431a9
RDX: 004431a9 RSI:  RDI: 0003
RBP: 00401800 R08: 004003d0 R09: 004003d0
R10: 004003d0 R11: 0217 R12: 00401890
R13:  R14: 006b1018 R15: 
Code: 00 00 0f 1f 44 00 00 41 54 55 41 89 f4 53 48 89 fb e8 4c bd ad fe 48 8d
7b 28 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 74 05 e8
7c 62 e0 fe 48 8b 6b 28 48 b8 00 00 00 00 
RIP: kernel_sock_shutdown+0x29/0x70 net/socket.c:3255 RSP: 88000666fcf8
---[ end trace ac1ba3c5e5bfa977 ]---

0xa02d1a82  1258rc =
smc_close_active(smc);
Dump of assembler code from 0xa02d1a82 to 0xa02d1a8c:
=> 0xa02d1a82 :  call   0xa02f3c50  

   0xa02d1a87 :  movr13d,eax
   0xa02d1a8a :  call   0x813fc430
End of assembler dump.
rax0x88005a6217c0   -131939878955072
rbx0x88005be55b40   -131939853575360
rcx0xa02d1a7f   -1607656833
rdx0x0  0
rsi0xfe01   4294966785
rdi0x88005be55b40   -131939853575360
rbp0x88005be55b52   0x88005be55b52
rsp0x88005e887d18   0x88005e887d18
r8 0x88005f9d0258   -131939791207848
r9 0x880060e2bc00   -131939769861120
r100x88005f9e7340   -131939791113408
r110xb9ed   47597
r120x0  0
r130x0  0
r140x0  0
r150x0  0
rip0xa02d1a82   0xa02d1a82 
eflags 0x293[ CF AF SF IF ]
cs 0x10 16
ss 0x18 24
ds 0x0  0
es 0x0  0
fs 0x0  0
gs 0x0  0
ni:3: Error in sourced command file:
Could not fetch register "fs_base"; remote failure reply 'E14'
(gdb) b *0xa02d1a87
Breakpoint 36 at 0xa02d1a87: file ../net/smc/af_smc.c, line 1258.
(gdb) c
Continuing.
[Switching to Thread 4]

Thread 4 hit Hardware watchpoint 34: ((struct smc_sock*)
0x88005be55b40)->clcsock

Old value = (struct socket *) 0x880058fa5100
New value = (struct socket 

[PATCH bpf-next] tools: bpftool: make it easier to feed hex bytes to bpftool

2018-04-17 Thread Jakub Kicinski
From: Quentin Monnet 

bpftool uses hexadecimal values when it dumps map contents:

# bpftool map dump id 1337
key: ff 13 37 ff  value: a1 b2 c3 d4 ff ff ff ff
Found 1 element

In order to lookup or update values with bpftool, the natural reflex is
then to copy and paste the values to the command line, and to try to run
something like:

# bpftool map update id 1337 key ff 13 37 ff \
value 00 00 00 00 00 00 1a 2b
Error: error parsing byte: ff

bpftool complains, because it uses strtoul() with a 0 base to parse the
bytes, and that without a "0x" prefix, the bytes are considered as
decimal values (or even octal if they start with "0").

To feed hexadecimal values instead, one needs to add "0x" prefixes
everywhere necessary:

# bpftool map update id 1337 key 0xff 0x13 0x37 0xff \
value 0 0 0 0 0 0 0x1a 0x2b

To make it easier to use hexadecimal values, add an optional "hex"
keyword to put after "key" or "value" to tell bpftool to consider the
digits as hexadecimal. We can now do:

# bpftool map update id 1337 key hex ff 13 37 ff \
value hex 0 0 0 0 0 0 1a 2b

Without the "hex" keyword, the bytes are still parsed according to
normal integer notation (decimal if no prefix, or hexadecimal or octal
if "0x" or "0" prefix is used, respectively).

The patch also add related documentation and bash completion for the
"hex" keyword.

Suggested-by: Daniel Borkmann 
Suggested-by: David Beckett 
Signed-off-by: Quentin Monnet 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/bpftool-map.rst | 29 +
 tools/bpf/bpftool/bash-completion/bpftool   |  8 ---
 tools/bpf/bpftool/map.c | 17 ++-
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst 
b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 457e868bd32f..5f512b14bff9 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -23,10 +23,10 @@ MAP COMMANDS
 
 |  **bpftool** **map { show | list }**   [*MAP*]
 |  **bpftool** **map dump***MAP*
-|  **bpftool** **map update**  *MAP*  **key** *BYTES*   **value** *VALUE* 
[*UPDATE_FLAGS*]
-|  **bpftool** **map lookup**  *MAP*  **key** *BYTES*
-|  **bpftool** **map getnext** *MAP* [**key** *BYTES*]
-|  **bpftool** **map delete**  *MAP*  **key** *BYTES*
+|  **bpftool** **map update**  *MAP*  **key** [**hex**] *BYTES*   
**value** [**hex**] *VALUE* [*UPDATE_FLAGS*]
+|  **bpftool** **map lookup**  *MAP*  **key** [**hex**] *BYTES*
+|  **bpftool** **map getnext** *MAP* [**key** [**hex**] *BYTES*]
+|  **bpftool** **map delete**  *MAP*  **key** [**hex**] *BYTES*
 |  **bpftool** **map pin** *MAP*  *FILE*
 |  **bpftool** **map help**
 |
@@ -48,20 +48,26 @@ DESCRIPTION
**bpftool map dump***MAP*
  Dump all entries in a given *MAP*.
 
-   **bpftool map update**  *MAP*  **key** *BYTES*   **value** *VALUE* 
[*UPDATE_FLAGS*]
+   **bpftool map update**  *MAP*  **key** [**hex**] *BYTES*   **value** 
[**hex**] *VALUE* [*UPDATE_FLAGS*]
  Update map entry for a given *KEY*.
 
  *UPDATE_FLAGS* can be one of: **any** update existing entry
  or add if doesn't exit; **exist** update only if entry already
  exists; **noexist** update only if entry doesn't exist.
 
-   **bpftool map lookup**  *MAP*  **key** *BYTES*
+ If the **hex** keyword is provided in front of the bytes
+ sequence, the bytes are parsed as hexadeximal values, even if
+ no "0x" prefix is added. If the keyword is not provided, then
+ the bytes are parsed as decimal values, unless a "0x" prefix
+ (for hexadecimal) or a "0" prefix (for octal) is provided.
+
+   **bpftool map lookup**  *MAP*  **key** [**hex**] *BYTES*
  Lookup **key** in the map.
 
-   **bpftool map getnext** *MAP* [**key** *BYTES*]
+   **bpftool map getnext** *MAP* [**key** [**hex**] *BYTES*]
  Get next key.  If *key* is not specified, get first key.
 
-   **bpftool map delete**  *MAP*  **key** *BYTES*
+   **bpftool map delete**  *MAP*  **key** [**hex**] *BYTES*
  Remove entry from the map.
 
**bpftool map pin** *MAP*  *FILE*
@@ -98,7 +104,12 @@ EXAMPLES
   10: hash  name some_map  flags 0x0
key 4B  value 8B  max_entries 2048  memlock 167936B
 
-**# bpftool map update id 10 key 13 00 07 00 value 02 00 00 00 01 02 03 04**
+The following three commands are equivalent:
+
+|
+| **# bpftool map update id 10 key hex   20   c4   b7   00 value hex   0f   ff 
  ff   ab   01   02   03   4c**
+| **# bpftool map 

Re: [PATCH net-next 3/5] ipv4: support sport, dport and ip protocol in RTM_GETROUTE

2018-04-17 Thread Roopa Prabhu
On Tue, Apr 17, 2018 at 1:10 AM, Ido Schimmel  wrote:
> On Mon, Apr 16, 2018 at 01:41:36PM -0700, Roopa Prabhu wrote:
>> @@ -2757,6 +2796,12 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
>> struct nlmsghdr *nlh,
>>   fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
>>   fl4.flowi4_mark = mark;
>>   fl4.flowi4_uid = uid;
>> + if (sport)
>> + fl4.fl4_sport = sport;
>> + if (dport)
>> + fl4.fl4_dport = dport;
>> + if (ip_proto)
>> + fl4.flowi4_proto = ip_proto;
>
> Hi Roopa,
>
> This info isn't set in the synthesized skb, but only in the flow info
> and therefore not used for input routes. I see you added a test case,
> but it's only for output routes. I believe an input route test case will
> fail.

yep. I made a note for myself to work thru the input case and missed before
i sent the series.

>
> Also, note that the skb as synthesized now is invalid - iph->ihl is 0
> for example - so the flow dissector will spit it out. It effectively
> means that route get is broken when L4 hashing is used. It also affects
> output routes because since commit 3765d35ed8b9 ("net: ipv4: Convert
> inet_rtm_getroute to rcu versions of route lookup") the skb is used to
> calculate the multipath hash.

yep, remember that. will look. thanks Ido.


Re: [PATCH] PCI: Add PCIe to pcie_print_link_status() messages

2018-04-17 Thread Jakub Kicinski
On Fri, 13 Apr 2018 11:16:38 -0700, Jakub Kicinski wrote:
> Currently the pcie_print_link_status() will print PCIe bandwidth
> and link width information but does not mention it is pertaining
> to the PCIe.  Since this and related functions are used exclusively
> by networking drivers today users may get confused into thinking
> that it's the NIC bandwidth that is being talked about.  Insert a
> "PCIe" into the messages.
> 
> Signed-off-by: Jakub Kicinski 

Hi Bjorn!

Could this small change still make it into 4.17 or are you planning to
apply it in 4.18 cycle?  IMHO the message clarification may be worth
considering for 4.17..

>  drivers/pci/pci.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index aa86e904f93c..73a0a4993f6a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5273,11 +5273,11 @@ void pcie_print_link_status(struct pci_dev *dev)
>   bw_avail = pcie_bandwidth_available(dev, _dev, , );
>  
>   if (bw_avail >= bw_cap)
> - pci_info(dev, "%u.%03u Gb/s available bandwidth (%s x%d 
> link)\n",
> + pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d 
> link)\n",
>bw_cap / 1000, bw_cap % 1000,
>PCIE_SPEED2STR(speed_cap), width_cap);
>   else
> - pci_info(dev, "%u.%03u Gb/s available bandwidth, limited by %s 
> x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
> + pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited 
> by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
>bw_avail / 1000, bw_avail % 1000,
>PCIE_SPEED2STR(speed), width,
>limiting_dev ? pci_name(limiting_dev) : "",



Re: [PATCH RFC net-next 00/11] udp gso

2018-04-17 Thread Samudrala, Sridhar


On 4/17/2018 2:07 PM, Willem de Bruijn wrote:

On Tue, Apr 17, 2018 at 4:48 PM, Sowmini Varadhan
 wrote:

On (04/17/18 16:23), Willem de Bruijn wrote:

Assuming IPv4 with an MTU of 1500 and the maximum segment
size of 1472, the receiver will see three datagrams with MSS of
1472B, 528B and 512B.

so the recvmsg will also pass up 1472, 526, 512, right?

That's right.


If yes, how will the recvmsg differentiate between the case
(2000 byte message followed by 512 byte message) and
(1472 byte message, 526 byte message, then 512 byte message),
in other words, how are UDP message boundary semantics preserved?

They aren't. This is purely an optimization to amortize the cost of
repeated tx stack traversal. Unlike UFO, which would preserve the
boundaries of the original larger than MTU datagram.


Doesn't this break UDP applications that expect message boundary
preservation semantics? Is it possible to negotiate this feature?



A prime use case is bulk transfer of data. Think video streaming
with QUIC. It must send MTU sized or smaller packets, but has
no application-layer requirement to reconstruct large packets on
the peer.

That said, for negotiated flows an inverse GRO feature could
conceivably be implemented to reduce rx stack traversal, too.
Though due to interleaving of packets on the wire, it aggregation
would be best effort, similar to TCP TSO and GRO using the
PSH bit as packetization signal.




Re: [PATCH bpf-next 10/10] [bpf]: make virtio compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Jason Wang



On 2018年04月17日 14:51, Nikita V. Shirokov wrote:

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack

Signed-off-by: Nikita V. Shirokov 
---
  drivers/net/virtio_net.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..115d85f7360a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -604,6 +604,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
case XDP_PASS:
/* Recalculate length in case bpf program changed it */
delta = orig_data - xdp.data;
+   len = xdp.data_end - xdp.data;
break;
case XDP_TX:
sent = __virtnet_xdp_xmit(vi, );
@@ -637,7 +638,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
goto err;
}
skb_reserve(skb, headroom - delta);
-   skb_put(skb, len + delta);
+   skb_put(skb, len);
if (!delta) {
buf += header_offset;
memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len);
@@ -752,6 +753,10 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
offset = xdp.data -
page_address(xdp_page) - vi->hdr_len;
  
+			/* recalculate len if xdp.data or xdp.data_end were

+* adjusted
+*/
+   len = xdp.data_end - xdp.data;
/* We can only create skb based on xdp_page. */
if (unlikely(xdp_page != page)) {
rcu_read_unlock();


Reviewed-by: Jason Wang 



Re: [PATCH bpf-next 09/10] [bpf]: make tun compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Jason Wang



On 2018年04月17日 14:51, Nikita V. Shirokov wrote:

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for tun driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
(in case if after xdp's prog run data_end pointer was adjusted)

Signed-off-by: Nikita V. Shirokov 
---
  drivers/net/tun.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 28583aa0c17d..0b488a958076 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1688,6 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
return NULL;
case XDP_PASS:
delta = orig_data - xdp.data;
+   len = xdp.data_end - xdp.data;
break;
default:
bpf_warn_invalid_xdp_action(act);
@@ -1708,7 +1709,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
}
  
  	skb_reserve(skb, pad - delta);

-   skb_put(skb, len + delta);
+   skb_put(skb, len);
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
  


Reviewed-by: Jason Wang 



Re: [PATCHv2 net-next] vxlan: add ttl inherit support

2018-04-17 Thread Hangbin Liu
On Tue, Apr 17, 2018 at 03:16:27PM -0400, David Miller wrote:
> From: Hangbin Liu 
> Date: Tue, 17 Apr 2018 20:52:54 +0800
> 
> > Like tos inherit, ttl inherit should also means inherit the inner protocol's
> > ttl values, which actually not implemented in vxlan yet.
> > 
> > But we could not treat ttl == 0 as "use the inner TTL", because that would 
> > be
> > used also when the "ttl" option is not specified and that would be a 
> > behavior
> > change, and breaking real use cases.
> > 
> > So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
> > specified.
> > 
> > ---
> > v2: As suggested by Stefano, clean up function ip_tunnel_get_ttl().
> > 
> > Suggested-by: Jiri Benc 
> > Signed-off-by: Hangbin Liu 
> 
> I already applied V1 of your patch.
> 
> Furthermore, this commit message would cause your signoffs and other tags
> to be removed due to the "---" deliminator.
> 
> I generally encourage people to leave the change history text _in_ the
> commit message anyways.  It is useful information for the future.

Thanks for the reminding. I will keep this in mind.

Cheers
Hangbin


[PATCH 2/3] mac80211: Add support for ethtool gstats2 API.

2018-04-17 Thread greearb
From: Ben Greear 

This enables users to request fewer stats to be refreshed
in cases where firmware does not need to be probed.

Signed-off-by: Ben Greear 
---
 include/net/mac80211.h|  6 ++
 net/mac80211/driver-ops.h |  9 +++--
 net/mac80211/ethtool.c| 18 +-
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index d2279b2..4854f33 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -3361,6 +3361,8 @@ enum ieee80211_reconfig_type {
  *
  * @get_et_stats:  Ethtool API to get a set of u64 stats.
  *
+ * @get_et_stats2:  Ethtool API to get a set of u64 stats, with flags.
+ *
  * @get_et_strings:  Ethtool API to get a set of strings to describe stats
  * and perhaps other supported types of ethtool data-sets.
  *
@@ -3692,6 +3694,10 @@ struct ieee80211_ops {
void(*get_et_stats)(struct ieee80211_hw *hw,
struct ieee80211_vif *vif,
struct ethtool_stats *stats, u64 *data);
+   void(*get_et_stats2)(struct ieee80211_hw *hw,
+struct ieee80211_vif *vif,
+struct ethtool_stats *stats, u64 *data,
+u32 flags);
void(*get_et_strings)(struct ieee80211_hw *hw,
  struct ieee80211_vif *vif,
  u32 sset, u8 *data);
diff --git a/net/mac80211/driver-ops.h b/net/mac80211/driver-ops.h
index 4d82fe7..519d2db 100644
--- a/net/mac80211/driver-ops.h
+++ b/net/mac80211/driver-ops.h
@@ -58,10 +58,15 @@ static inline void drv_get_et_strings(struct 
ieee80211_sub_if_data *sdata,
 
 static inline void drv_get_et_stats(struct ieee80211_sub_if_data *sdata,
struct ethtool_stats *stats,
-   u64 *data)
+   u64 *data, u32 flags)
 {
struct ieee80211_local *local = sdata->local;
-   if (local->ops->get_et_stats) {
+   if (local->ops->get_et_stats2) {
+   trace_drv_get_et_stats(local);
+   local->ops->get_et_stats2(>hw, >vif, stats, data,
+ flags);
+   trace_drv_return_void(local);
+   } else if (local->ops->get_et_stats) {
trace_drv_get_et_stats(local);
local->ops->get_et_stats(>hw, >vif, stats, data);
trace_drv_return_void(local);
diff --git a/net/mac80211/ethtool.c b/net/mac80211/ethtool.c
index 9cc986d..b67520e 100644
--- a/net/mac80211/ethtool.c
+++ b/net/mac80211/ethtool.c
@@ -61,9 +61,9 @@ static int ieee80211_get_sset_count(struct net_device *dev, 
int sset)
return rv;
 }
 
-static void ieee80211_get_stats(struct net_device *dev,
-   struct ethtool_stats *stats,
-   u64 *data)
+static void ieee80211_get_stats2(struct net_device *dev,
+struct ethtool_stats *stats,
+u64 *data, u32 flags)
 {
struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev);
struct ieee80211_chanctx_conf *chanctx_conf;
@@ -199,7 +199,14 @@ static void ieee80211_get_stats(struct net_device *dev,
if (WARN_ON(i != STA_STATS_LEN))
return;
 
-   drv_get_et_stats(sdata, stats, &(data[STA_STATS_LEN]));
+   drv_get_et_stats(sdata, stats, [STA_STATS_LEN], flags);
+}
+
+static void ieee80211_get_stats(struct net_device *dev,
+   struct ethtool_stats *stats,
+   u64 *data)
+{
+   ieee80211_get_stats2(dev, stats, data, 0);
 }
 
 static void ieee80211_get_strings(struct net_device *dev, u32 sset, u8 *data)
@@ -211,7 +218,7 @@ static void ieee80211_get_strings(struct net_device *dev, 
u32 sset, u8 *data)
sz_sta_stats = sizeof(ieee80211_gstrings_sta_stats);
memcpy(data, ieee80211_gstrings_sta_stats, sz_sta_stats);
}
-   drv_get_et_strings(sdata, sset, &(data[sz_sta_stats]));
+   drv_get_et_strings(sdata, sset, [sz_sta_stats]);
 }
 
 static int ieee80211_get_regs_len(struct net_device *dev)
@@ -238,5 +245,6 @@ const struct ethtool_ops ieee80211_ethtool_ops = {
.set_ringparam = ieee80211_set_ringparam,
.get_strings = ieee80211_get_strings,
.get_ethtool_stats = ieee80211_get_stats,
+   .get_ethtool_stats2 = ieee80211_get_stats2,
.get_sset_count = ieee80211_get_sset_count,
 };
-- 
2.4.11



[PATCH] ethtool: Support ETHTOOL_GSTATS2 API.

2018-04-17 Thread greearb
From: Ben Greear 

This allows users to specify flags to the get-stats
API, potentially saving expensive stats queries when
they are not desired.

Signed-off-by: Ben Greear 
---
 ethtool-copy.h |  9 +
 ethtool.c  | 25 -
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 8cc61e9..11ce456 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -1390,11 +1390,20 @@ enum ethtool_fec_config_bits {
 #define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 #define ETHTOOL_GFECPARAM  0x0050 /* Get FEC settings */
 #define ETHTOOL_SFECPARAM  0x0051 /* Set FEC settings */
+#define ETHTOOL_GSTATS20x0052 /* get NIC-specific 
statistics
+   * with ability to specify flags.
+   * See ETHTOOL_GS2* below.
+   */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
 #define SPARC_ETH_SSET ETHTOOL_SSET
 
+/* GSTATS2 flags */
+#define ETHTOOL_GS2_SKIP_NONE (0)/* default is to update all stats */
+#define ETHTOOL_GS2_SKIP_FW   (1<<0) /* Skip reading stats that probe firmware,
+ * and thus are slow/expensive.
+ */
 /* Link mode bit indices */
 enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_10baseT_Half_BIT  = 0,
diff --git a/ethtool.c b/ethtool.c
index 3289e0f..6a11077 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3440,14 +3440,14 @@ static int do_phys_id(struct cmd_context *ctx)
 }
 
 static int do_gstats(struct cmd_context *ctx, int cmd, int stringset,
-   const char *name)
+const char *name, u32 flags)
 {
struct ethtool_gstrings *strings;
struct ethtool_stats *stats;
unsigned int n_stats, sz_stats, i;
int err;
 
-   if (ctx->argc != 0)
+   if ((ctx->argc != 0) && (flags == ETHTOOL_GS2_SKIP_NONE))
exit_bad_args();
 
strings = get_stringset(ctx, stringset,
@@ -3475,7 +3475,10 @@ static int do_gstats(struct cmd_context *ctx, int cmd, 
int stringset,
}
 
stats->cmd = cmd;
-   stats->n_stats = n_stats;
+   if (cmd == ETHTOOL_GSTATS2)
+   stats->n_stats = flags;
+   else
+   stats->n_stats = n_stats;
err = send_ioctl(ctx, stats);
if (err < 0) {
perror("Cannot get stats information");
@@ -3500,12 +3503,22 @@ static int do_gstats(struct cmd_context *ctx, int cmd, 
int stringset,
 
 static int do_gnicstats(struct cmd_context *ctx)
 {
-   return do_gstats(ctx, ETHTOOL_GSTATS, ETH_SS_STATS, "NIC");
+   return do_gstats(ctx, ETHTOOL_GSTATS, ETH_SS_STATS, "NIC", 
ETHTOOL_GS2_SKIP_NONE);
+}
+
+static int do_gnicstats2(struct cmd_context *ctx)
+{
+   u32 flags = ETHTOOL_GS2_SKIP_NONE;
+   if (ctx->argc >= 1)
+   if (strcmp(ctx->argp[0], "nofw") == 0)
+   flags |= ETHTOOL_GS2_SKIP_FW;
+   return do_gstats(ctx, ETHTOOL_GSTATS2, ETH_SS_STATS, "NIC", flags);
 }
 
 static int do_gphystats(struct cmd_context *ctx)
 {
-   return do_gstats(ctx, ETHTOOL_GPHYSTATS, ETH_SS_PHY_STATS, "PHY");
+   return do_gstats(ctx, ETHTOOL_GPHYSTATS, ETH_SS_PHY_STATS, "PHY",
+ETHTOOL_GS2_SKIP_NONE);
 }
 
 static int do_srxntuple(struct cmd_context *ctx,
@@ -5118,6 +5131,8 @@ static const struct option {
{ "-t|--test", 1, do_test, "Execute adapter self test",
  "   [ online | offline | external_lb ]\n" },
{ "-S|--statistics", 1, do_gnicstats, "Show adapter statistics" },
+   { "-2|--S2", 1, do_gnicstats2, "Show adapter statistics with flags",
+ "   [ nofw ]\n" },
{ "--phy-statistics", 1, do_gphystats,
  "Show phy statistics" },
{ "-n|-u|--show-nfc|--show-ntuple", 1, do_grxclass,
-- 
2.4.11



[PATCH 3/3] ath10k: Support ethtool gstats2 API.

2018-04-17 Thread greearb
From: Ben Greear 

Skip a firmware stats update when calling
code indicates the stats refresh is not needed.

Signed-off-by: Ben Greear 
---
 drivers/net/wireless/ath/ath10k/debug.c | 18 +++---
 drivers/net/wireless/ath/ath10k/debug.h |  4 
 drivers/net/wireless/ath/ath10k/mac.c   |  1 +
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/debug.c 
b/drivers/net/wireless/ath/ath10k/debug.c
index bac832c..d559a3f 100644
--- a/drivers/net/wireless/ath/ath10k/debug.c
+++ b/drivers/net/wireless/ath/ath10k/debug.c
@@ -1159,9 +1159,10 @@ int ath10k_debug_get_et_sset_count(struct ieee80211_hw 
*hw,
return 0;
 }
 
-void ath10k_debug_get_et_stats(struct ieee80211_hw *hw,
-  struct ieee80211_vif *vif,
-  struct ethtool_stats *stats, u64 *data)
+void ath10k_debug_get_et_stats2(struct ieee80211_hw *hw,
+   struct ieee80211_vif *vif,
+   struct ethtool_stats *stats, u64 *data,
+   u32 flags)
 {
struct ath10k *ar = hw->priv;
static const struct ath10k_fw_stats_pdev zero_stats = {};
@@ -1170,6 +1171,9 @@ void ath10k_debug_get_et_stats(struct ieee80211_hw *hw,
 
mutex_lock(>conf_mutex);
 
+   if (flags & ETHTOOL_GS2_SKIP_FW)
+   goto skip_query_fw_stats;
+
if (ar->state == ATH10K_STATE_ON) {
ret = ath10k_debug_fw_stats_request(ar);
if (ret) {
@@ -1180,6 +1184,7 @@ void ath10k_debug_get_et_stats(struct ieee80211_hw *hw,
}
}
 
+skip_query_fw_stats:
pdev_stats = list_first_entry_or_null(>debug.fw_stats.pdevs,
  struct ath10k_fw_stats_pdev,
  list);
@@ -1244,6 +1249,13 @@ void ath10k_debug_get_et_stats(struct ieee80211_hw *hw,
WARN_ON(i != ATH10K_SSTATS_LEN);
 }
 
+void ath10k_debug_get_et_stats(struct ieee80211_hw *hw,
+  struct ieee80211_vif *vif,
+  struct ethtool_stats *stats, u64 *data)
+{
+   ath10k_debug_get_et_stats2(hw, vif, stats, data, 0);
+}
+
 static const struct file_operations fops_fw_dbglog = {
.read = ath10k_read_fw_dbglog,
.write = ath10k_write_fw_dbglog,
diff --git a/drivers/net/wireless/ath/ath10k/debug.h 
b/drivers/net/wireless/ath/ath10k/debug.h
index 0afca5c..595d964 100644
--- a/drivers/net/wireless/ath/ath10k/debug.h
+++ b/drivers/net/wireless/ath/ath10k/debug.h
@@ -117,6 +117,10 @@ int ath10k_debug_get_et_sset_count(struct ieee80211_hw *hw,
 void ath10k_debug_get_et_stats(struct ieee80211_hw *hw,
   struct ieee80211_vif *vif,
   struct ethtool_stats *stats, u64 *data);
+void ath10k_debug_get_et_stats2(struct ieee80211_hw *hw,
+   struct ieee80211_vif *vif,
+   struct ethtool_stats *stats, u64 *data,
+   u32 level);
 
 static inline u64 ath10k_debug_get_fw_dbglog_mask(struct ath10k *ar)
 {
diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
b/drivers/net/wireless/ath/ath10k/mac.c
index bf05a36..27b793c 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -7734,6 +7734,7 @@ static const struct ieee80211_ops ath10k_ops = {
.ampdu_action   = ath10k_ampdu_action,
.get_et_sset_count  = ath10k_debug_get_et_sset_count,
.get_et_stats   = ath10k_debug_get_et_stats,
+   .get_et_stats2  = ath10k_debug_get_et_stats2,
.get_et_strings = ath10k_debug_get_et_strings,
.add_chanctx= ath10k_mac_op_add_chanctx,
.remove_chanctx = ath10k_mac_op_remove_chanctx,
-- 
2.4.11



[PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.

2018-04-17 Thread greearb
From: Ben Greear 

This is similar to ETHTOOL_GSTATS, but it allows you to specify
flags.  These flags can be used by the driver to decrease the
amount of stats refreshed.  In particular, this helps with ath10k
since getting the firmware stats can be slow.

Signed-off-by: Ben Greear 
---
 include/linux/ethtool.h  | 12 
 include/uapi/linux/ethtool.h | 10 ++
 net/core/ethtool.c   | 40 +++-
 3 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe4181..a4aa11f 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -243,6 +243,15 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 
*legacy_u32,
  * @get_ethtool_stats: Return extended statistics about the device.
  * This is only useful if the device maintains statistics not
  * included in  rtnl_link_stats64.
+ * @get_ethtool_stats2: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in  rtnl_link_stats64.
+ *  Takes a flags argument:  0 means all (same as get_ethtool_stats),
+ *  0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
+ *  Other flags are reserved for now.
+ *  Same number of stats will be returned, but some of them might
+ *  not be as accurate/refreshed.  This is to allow not querying
+ *  firmware or other expensive-to-read stats, for instance.
  * @begin: Function to be called before any other operation.  Returns a
  * negative error code or zero.
  * @complete: Function to be called after any other operation except
@@ -355,6 +364,9 @@ struct ethtool_ops {
int (*set_phys_id)(struct net_device *, enum ethtool_phys_id_state);
void(*get_ethtool_stats)(struct net_device *,
 struct ethtool_stats *, u64 *);
+   void(*get_ethtool_stats2)(struct net_device *dev,
+ struct ethtool_stats *gstats, u64 *data,
+ u32 flags);
int (*begin)(struct net_device *);
void(*complete)(struct net_device *);
u32 (*get_priv_flags)(struct net_device *);
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 4ca65b5..1c74f3e 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1396,11 +1396,21 @@ enum ethtool_fec_config_bits {
 #define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 #define ETHTOOL_GFECPARAM  0x0050 /* Get FEC settings */
 #define ETHTOOL_SFECPARAM  0x0051 /* Set FEC settings */
+#define ETHTOOL_GSTATS20x0052 /* get NIC-specific 
statistics
+   * with ability to specify flags.
+   * See ETHTOOL_GS2* below.
+   */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
 #define SPARC_ETH_SSET ETHTOOL_SSET
 
+/* GSTATS2 flags */
+#define ETHTOOL_GS2_SKIP_NONE (0)/* default is to update all stats */
+#define ETHTOOL_GS2_SKIP_FW   (1<<0) /* Skip reading stats that probe firmware,
+ * and thus are slow/expensive.
+ */
+
 /* Link mode bit indices */
 enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_10baseT_Half_BIT  = 0,
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6..6ec3413 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1952,16 +1952,14 @@ static int ethtool_phys_id(struct net_device *dev, void 
__user *useraddr)
return rc;
 }
 
-static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
+static int _ethtool_get_stats(struct net_device *dev, void __user *useraddr,
+ u32 flags)
 {
struct ethtool_stats stats;
const struct ethtool_ops *ops = dev->ethtool_ops;
u64 *data;
int ret, n_stats;
 
-   if (!ops->get_ethtool_stats || !ops->get_sset_count)
-   return -EOPNOTSUPP;
-
n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
if (n_stats < 0)
return n_stats;
@@ -1976,7 +1974,10 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
if (n_stats && !data)
return -ENOMEM;
 
-   ops->get_ethtool_stats(dev, , data);
+   if (flags != ETHTOOL_GS2_SKIP_NONE)
+   ops->get_ethtool_stats2(dev, , data, flags);
+   else
+   ops->get_ethtool_stats(dev, , data);
 
ret = -EFAULT;
if (copy_to_user(useraddr, , sizeof(stats)))
@@ -1991,6 +1992,31 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
return ret;
 }
 
+static int ethtool_get_stats(struct 

Re: [PATCH v3 net,stable] tun: fix vlan packet truncation

2018-04-17 Thread Jason Wang



On 2018年04月18日 04:46, Bjørn Mork wrote:

Bogus trimming in tun_net_xmit() causes truncated vlan packets.

skb->len is correct whether or not skb_vlan_tag_present() is true. There
is no more reason to adjust the skb length on xmit in this driver than
any other driver. tun_put_user() adds 4 bytes to the total for tagged
packets because it transmits the tag inline to userspace.  This is
similar to a nic transmitting the tag inline on the wire.

Reproducing the bug by sending any tagged packet through back-to-back
connected tap interfaces:

  socat TUN,tun-type=tap,iff-up,tun-name=in TUN,tun-type=tap,iff-up,tun-name=out 
&
  ip link add link in name in.20 type vlan id 20
  ip addr add 10.9.9.9/24 dev in.20
  ip link set in.20 up
  tshark -nxxi in -f arp -c1 2>/dev/null &
  tshark -nxxi out -f arp -c1 2>/dev/null &
  ping -c 1 10.9.9.5 >/dev/null 2>&1

The output from the 'in' and 'out' interfaces are different when the
bug is present:

  Capturing on 'in'
    ff ff ff ff ff ff 76 cf 76 37 d5 0a 81 00 00 14   ..v.v7..
  0010  08 06 00 01 08 00 06 04 00 01 76 cf 76 37 d5 0a   ..v.v7..
  0020  0a 09 09 09 00 00 00 00 00 00 0a 09 09 05 ..

  Capturing on 'out'
    ff ff ff ff ff ff 76 cf 76 37 d5 0a 81 00 00 14   ..v.v7..
  0010  08 06 00 01 08 00 06 04 00 01 76 cf 76 37 d5 0a   ..v.v7..
  0020  0a 09 09 09 00 00 00 00 00 00 ..

Fixes: aff3d70a07ff ("tun: allow to attach ebpf socket filter")
Cc: Jason Wang 
Signed-off-by: Bjørn Mork 
---
v2:
  - Must still call pskb_trim() after running the filter, as pointed out by
Jason and David. But no need to check if len < 0 anymore, since
run_ebpf_filter() returns insigned ints.

v3:
  - actually change the len <= 0 test as mentioned above


  drivers/net/tun.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 28583aa0c17d..ef33950a45d9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1102,12 +1102,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
struct net_device *dev)
goto drop;
  
  	len = run_ebpf_filter(tun, skb, len);

-
-   /* Trim extra bytes since we may insert vlan proto & TCI
-* in tun_put_user().
-*/
-   len -= skb_vlan_tag_present(skb) ? sizeof(struct veth) : 0;
-   if (len <= 0 || pskb_trim(skb, len))
+   if (len == 0 || pskb_trim(skb, len))
goto drop;
  
  	if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))


Acked-by: Jason Wang 

Thanks


Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support

2018-04-17 Thread Marcelo Ricardo Leitner
On Tue, Apr 17, 2018 at 04:35:18PM -0400, Vlad Yasevich wrote:
> On 04/02/2018 10:47 AM, Marcelo Ricardo Leitner wrote:
> > On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
> >> Now that we have SCTP offload capabilities in the kernel, we can add
> >> them to virtio as well.  First step is SCTP checksum.
> >
> > Thanks.
> >
> >> As for GSO, the way sctp GSO is currently implemented buys us nothing
> >> in added support to virtio.  To add true GSO, would require a lot of
> >> re-work inside of SCTP and would require extensions to the virtio
> >> net header to carry extra sctp data.
> >
> > Can you please elaborate more on this? Is this because SCTP GSO relies
> > on the gso skb format for knowing how to segment it instead of having
> > a list of sizes?
> >
>
> it's mainly because all the true segmentation, placing data into chunks,
> has already happened.  All that GSO does is allow for higher bundling
> rate between VMs. If that is all SCTP GSO ever going to do, that fine,
> but the goal is to do real GSO eventually and potentially reduce the
> amount of memory copying we are doing.
> If we do that, any current attempt at GSO in virtio would have to be
> depricated and we'd need GSO2 or something like that.
>
> This is why, after doing the GSO support, I decided not to include it.

Gotcha. I don't think it will ever go further than what we have now.
Placing data into chunks later is not really feasible/wanted,
especially now with stream schedulers and idata chunks. Doesn't seem
worth the hassle... we would have to support things like, "segment
half of this message plus a third of this other one from that other
stream." (in case it's round robin).

  Marcelo


[Regression] net/phy/micrel.c v4.9.94

2018-04-17 Thread Chris Ruehl

Hello,

I like to get your heads up at a regression introduced in 4.9.94
commitment lead to a kernel ops and make the network unusable on my MX6DL 
customized board.


Race condition resume is called on startup and the phy not yet initialized.

[7.313366] Unable to handle kernel NULL pointer dereference at virtual 
address 0008 

[7.321602] pgd = ecfc 



[7.324950] [0008] *pgd=8e901831 



[7.328652] Internal error: Oops: 17 [#1] PREEMPT SMP ARM 



[7.334061] Modules linked in: 



[7.337146] CPU: 0 PID: 269 Comm: ip Not tainted 4.9.94 #11 



[7.342725] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) 



[7.349259] task: ece59900 task.stack: ec9ea000 



[7.353809] PC is at kszphy_config_reset+0x14/0x148 



[7.358703] LR is at kszphy_resume+0x1c/0x6c 



[7.362983] pc : []lr : []psr: 60030013 



[7.362983] sp : ec9eb918  ip : ec9eb938  fp : ec9eb934 



[7.374467] r10: 0007  r9 :   r8 : ee693c00 



[7.379700] r7 :   r6 :   r5 :   r4 : ee6fc000 



[7.386234] r3 : 0001  r2 :   r1 : 0110  r0 : ee6fc000 



[7.392768] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none 



[7.399911] Control: 10c5387d  Table: 3cfc004a  DAC: 0051 



[7.405663] Process ip (pid: 269, stack limit = 0xec9ea210) 



[7.411244] Stack: (0xec9eb918 to 0xec9ec000) 



[7.415611] b900: 
ee6fc000  

[7.423800] b920: ee031000  ec9eb94c ec9eb938 c056a4fc c056a244 
ee6fc000  

[7.431988] b940: ec9eb97c ec9eb950 c05681e4 c056a4ec 0007 ee6fc000 
ee6fc000 c056ce7c 

[7.440174] b960: c056ce7c ee031000 ee55c818  ec9eb99c ec9eb980 
c05683cc c0568134 

[7.448364] b980: 0007 ec9eba10 ee6fc000 0007 ec9eb9c4 ec9eb9a0 
c0568450 c05683bc 

[7.456550] b9a0: 0007 0005 ee031000 ec9eb9d3 0200 c1508da4 
ec9eba6c ec9eb9c8 

[7.464736] b9c0: c056ce24 c0568410 0005 ee03162c 3201 30383831 
652e3030 72656874 

[7.472921] b9e0: 2d74656e 0031 03e8 00c8 c01732ec c0172adc 
03e8 00c8 

[7.481109] ba00: 024000c0 ee55c000 c150e454 024000c0 38383132 2e303030 
65687465 74656e72 

[7.489296] ba20: 303a312d ee35 ec9eba6c ec9eba38 c0224b50 c0175eb8 
ec9eba6c c056eb44 

[7.497482] ba40: c056bbe0 f0c16000 ee031000 ee55c000 0200 f0c16000 
ee031000 ee55c000 

[7.505667] ba60: ec9ebaa4 ec9eba70 c056eba4 c056cd1c 0001 ee03162c 
ec9ebaa4 ee031000 

[7.513855] ba80:  c09566ec ee031030  ec9ccd10 ecb39900 
ec9ebacc ec9ebaa8 

[7.522043] baa0: c06ad6e0 c056e92c ec9ebacc ee031000 ee031000 0001 
1003 1002 

[7.530229] bac0: ec9ebaf4 ec9ebad0 c06ad99c c06ad63c 1002 ee031000 
ee031148 1002 

[7.538414] bae0:   ec9ebb1c ec9ebaf8 c06ada6c c06ad90c 
1002  

[7.546601] bb00: ee031000 ec9ebc28  c09566ec ec9ebb94 ec9ebb20 
c06c1034 c06ada58 

[7.554787] bb20: c0c50df8 2e184000 ec9ebb44 ec9ebb38 c0173528 c0173320 
ec9ebbd4 c0e82b6c 

[7.562972] bb40:  ece59dc8 ebb4e9d0 c9eae3f3 ece59900 0003 
ece59900 005e 

[7.571157] bb60: c14e30ec c0d1e51c ece59900  ee031000 ec9ccd00 
  

[7.579346] bb80: ec9ebb98  ec9ebd04 ec9ebb98 c06c30cc c06c0d68 
ec9ebbc4  

[7.587531] bba0: c01758bc ecb39900 c09eb3a0 ec9ccd20  ec9ccd10 
0001 ece59900 

[7.595715] bbc0: c01e0e64   0001 ec9ebbfc  
  

[7.603900] bbe0:    ff00 ec9ebc0c ec9ebc00 
c0173528 c0173320 

[7.612084] bc00: ec9ebc9c ec9ebc10 c01e0e64 c0173520  000e 
ece59900 0096 

[7.620269] bc20: c14e30ec c0d1e51c     
  

[7.628452] bc40:       
  

[7.636636] bc60:       
  

[7.644819] bc80:       
  

[7.653003] bca0:       
  

[7.661186] bcc0:       
c06d3870  

[7.669372] bce0: ec9ccd00 ecb39900 c15226e4   ecb39900 
ec9ebd44 ec9ebd08 

[7.677556] bd00: c06c343c c06c2bdc c0869c2c c0173520 0001  
c06c06e4  

[7.685741] bd20:  ec9ccd00 c06c32b8 ecb39900 ecb39900  
ec9ebd64 ec9ebd48 

[7.693926] bd40: c06d86cc c06c32c4  ecb39900 0020 ec970400 
ec9ebd7c ec9ebd68 

[7.702110] bd60: c06c06f4 c06d8630 c06c06c4 ee15f400 ec9ebdac ec9ebd80 
c06d802c c06c06d0 

[7.710294] bd80: ec9ebf50 7fff ec970400 ec9ebf48 ec970400  
0020  

[7.718477] bda0: ec9ebe0c ec9ebdb0 c06d84e8 c06d7ec8 

Re:Re: [PATCH net] net: Fix one possible memleak in ip_setup_cork

2018-04-17 Thread Gao Feng
At 2018-04-17 05:18:25, "Eric Dumazet"  wrote:
>
>
>On 04/16/2018 09:58 AM, David Miller wrote:
>> From: gfree.w...@vip.163.com
>> Date: Mon, 16 Apr 2018 10:16:45 +0800
>> 
>>> From: Gao Feng 
>>>
>>> It would allocate memory in this function when the cork->opt is NULL. But
>>> the memory isn't freed if failed in the latter rt check, and return error
>>> directly. It causes the memleak if its caller is ip_make_skb which also
>>> doesn't free the cork->opt when meet a error.
>>>
>>> Now move the rt check ahead to avoid the memleak.
>>>
>>> Signed-off-by: Gao Feng 
>> 
>> Looks good, applied and queued up for -stable.
>> 
>> I guess in the other code paths, ip_flush_pending_frames() or similar
>> would clean up the in-sock cork information.
>> 
>
>I am not sure ip_make_skb() can be called with a NULL rt.
>
>Patch makes no harm, but does not seem to fix a bug.
>

Thanks Eric.

I just look up current all callers of ip_make_skb and ip_append_data, they check
if the rt is valid ahead. So current codes won't pass one NULL rt to 
ip_setup_cork indeed.

Then this patch is just as an enhancement, not a fix. 
As the programming rule, the function should free the mem which is allocated by 
itself when
it failed.

Best Regards
Feng


Re: [PATCH 04/10] net: ax88796: Add block_input/output hooks to ax_plat_data

2018-04-17 Thread Finn Thain
On Wed, 18 Apr 2018, Michael Schmitz wrote:

> I think this is a false positive - we're encouraged to provide the
> full parameter list for functions, so the sreuct sk_buff* can't be
> avoided.
> 

I don't think it's a false positive. I think ax88796.h would need to 
#include .

You may be able to get away with a forward declaration, as in,
struct skbuff;
but I'm not sure about that. I would have to build mach-anubis.c to check.

But why do you need to pass an skbuff pointer here? xs100_block_input() 
only accesses skb->data.

BTW, this patch has an unrelated whitespace change.

-- 

> Cheers,
> 
>   Michael
> 
> 
> On Wed, Apr 18, 2018 at 6:46 AM, kbuild test robot <l...@intel.com> wrote:
> > Hi Michael,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on v4.16]
> > [cannot apply to net-next/master net/master v4.17-rc1 next-20180417]
> > [if your patch is applied to the wrong git tree, please drop us a note to 
> > help improve the system]
> >
> > url:
> > https://github.com/0day-ci/linux/commits/Michael-Schmitz/New-network-driver-for-Amiga-X-Surf-100-m68k/20180417-141150
> > config: arm-samsung (attached as .config)
> > compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
> > reproduce:
> > wget 
> > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> > ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > make.cross ARCH=arm
> >
> > All warnings (new ones prefixed by >>):
> >
> >In file included from arch/arm/mach-s3c24xx/mach-anubis.c:42:0:
> >>> include/net/ax88796.h:35:11: warning: 'struct sk_buff' declared inside 
> >>> parameter list will not be visible outside of this definition or 
> >>> declaration
> >struct sk_buff *skb, int ring_offset);
> >   ^~~
> >
> > vim +35 include/net/ax88796.h
> >
> > 20
> > 21  struct ax_plat_data {
> > 22  unsigned int flags;
> > 23  unsigned charwordlength;/* 1 or 2 */
> > 24  unsigned chardcr_val;   /* default value for DCR */
> > 25  unsigned charrcr_val;   /* default value for RCR */
> > 26  unsigned chargpoc_val;  /* default value for GPOC */
> > 27  u32 *reg_offsets;   /* register offsets */
> > 28  u8  *mac_addr;  /* MAC addr (only used when
> > 29 AXFLG_MAC_FROMPLATFORM 
> > is used */
> > 30
> > 31  /* uses default ax88796 buffer if set to NULL */
> > 32  void (*block_output)(struct net_device *dev, int count,
> > 33  const unsigned char *buf, int star_page);
> > 34  void (*block_input)(struct net_device *dev, int count,
> >   > 35  struct sk_buff *skb, int ring_offset);
> > 36  };
> > 37
> >
> > ---
> > 0-DAY kernel test infrastructureOpen Source Technology 
> > Center
> > https://lists.01.org/pipermail/kbuild-all   Intel 
> > Corporation
> --
> To unsubscribe from this list: send the line "unsubscribe linux-m68k" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[PATCH v3] net: change the comment of dev_mc_init

2018-04-17 Thread sunlianwen
The comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.

Signed-off-by: Lianwen Sun 

Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Tiwei Bie
On Tue, Apr 17, 2018 at 06:54:51PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2018 at 10:56:26PM +0800, Tiwei Bie wrote:
> > On Tue, Apr 17, 2018 at 05:04:59PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > > > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > [...]
> > > > > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > > > unsigned int head,
> > > > > > > > > > + void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > +   vq->desc_state[head].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +   i = head;
> > > > > > > > > > +
> > > > > > > > > > +   for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > > +   desc->flags = 0x0;
> > > > > > > > > Looks like this is unnecessary.
> > > > > > > > It's safer to zero it. If we don't zero it, after we
> > > > > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > > > > this function, the desc is still available to the
> > > > > > > > device.
> > > > > > > 
> > > > > > > Well detach_unused_buf_packed() should be called after device is 
> > > > > > > stopped,
> > > > > > > otherwise even if you try to clear, there will still be a window 
> > > > > > > that device
> > > > > > > may use it.
> > > > > > 
> > > > > > This is not about whether the device has been stopped or
> > > > > > not. We don't have other places to re-initialize the ring
> > > > > > descriptors and wrap_counter. So they need to be set to
> > > > > > the correct values when doing detach_unused_buf.
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> > > > > 
> > > > > find vqs is the time to do it.
> > > > 
> > > > The .find_vqs() will call .setup_vq() which will eventually
> > > > call vring_create_virtqueue(). It's a different case. Here
> > > > we're talking about re-initializing the descs and updating
> > > > the wrap counter when detaching the unused descs (In this
> > > > case, split ring just needs to decrease vring.avail->idx).
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > 
> > > There's no requirement that  virtqueue_detach_unused_buf re-initializes
> > > the descs. It happens on cleanup path just before drivers delete the
> > > vqs.
> > 
> > Cool, I wasn't aware of it. I saw split ring decrease
> > vring.avail->idx after detaching an unused desc, so I
> > thought detaching unused desc also needs to make sure
> > that the ring state will be updated correspondingly.
> 
> 
> Hmm. You are right. Seems to be out console driver being out of spec.
> Will have to look at how to fix that :(
> 
> It was done here:
> 
> Commit b3258ff1d6086bd2b9eeb556844a868ad7d49bc8
> Author: Amit Shah 
> Date:   Wed Mar 16 19:12:10 2011 +0530
> 
> virtio: Decrement avail idx on buffer detach
> 
> When detaching a buffer from a vq, the avail.idx value should be
> decremented as well.
> 
> This was noticed by hot-unplugging a virtio console port and then
> plugging in a new one on the same number (re-using the vqs which were
> just 'disowned').  qemu reported
> 
>'Guest moved used index from 0 to 256'
> 
> when any IO was attempted on the new port.
> 
> CC: sta...@kernel.org
> Reported-by: juzhang 
> Signed-off-by: Amit Shah 
> Signed-off-by: Rusty Russell 
> 
> The spec is quite explicit though:
>   A driver MUST NOT decrement the available idx on a live virtqueue (ie. 
> there is no way to “unexpose”
>   buffers).
> 

Hmm.. Got it. Thanks!

Best regards,
Tiwei Bie


> 
> 
> 
> 
> > If there is no such requirement, do you think it's OK
> > to remove below two lines:
> > 
> > vq->avail_idx_shadow--;
> > vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> > 
> > from virtqueue_detach_unused_buf(), and we could have
> > one generic function to handle both rings:
> > 
> > void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
> > {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > unsigned int num, i;
> > void *buf;
> > 
> > START_USE(vq);
> > 
> > num = vq->packed ? vq->vring_packed.num : vq->vring.num;
> > 
> > for (i = 0; i < num; i++) {
> > if 

Re: [PATCH 04/10] net: ax88796: Add block_input/output hooks to ax_plat_data

2018-04-17 Thread Andrew Lunn
On Wed, Apr 18, 2018 at 12:53:21PM +1200, Michael Schmitz wrote:
> I think this is a false positive - we're encouraged to provide the
> full parameter list for functions, so the sreuct sk_buff* can't be
> avoided.

Hi Michael

How is  being included?

You probably want to build using the .config file and see.

Andrew


[PATCH net-next] ipv6: frags: fix a lockdep false positive

2018-04-17 Thread Eric Dumazet
lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.

It complains because of following chains :

1) sch_direct_xmit()(lock txq->_xmit_lock)
dev_hard_start_xmit()
 xmit_one()
  dev_queue_xmit_nit()
   packet_rcv_fanout()
ip_check_defrag()
 ip_defrag()
  spin_lock() (lock frag queue spinlock)

2) ip6_input_finish()
ipv6_frag_rcv()   (lock frag queue spinlock)
 ip6_frag_queue()
  icmpv6_param_prob() (lock txq->_xmit_lock at some point)

We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.

Signed-off-by: Eric Dumazet 
---
Note do David: I chose net-next because of recent changes in net-next,
and because it is a false positive, but can respin for net tree
if you prefer. Thanks !

 net/ipv6/reassembly.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 
2cdf3dcf8c2c1f7629154f71a7bd199b2bf05fd1..b939b94e7e91ddae1552f0b6f6a54c42ab180615
 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -163,7 +163,8 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
 }
 
 static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
-  struct frag_hdr *fhdr, int nhoff)
+ struct frag_hdr *fhdr, int nhoff,
+ u32 *prob_offset)
 {
struct sk_buff *prev, *next;
struct net_device *dev;
@@ -179,11 +180,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
if ((unsigned int)end > IPV6_MAXPLEN) {
-   __IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- ((u8 *)>frag_off -
-  skb_network_header(skb)));
+   *prob_offset = (u8 *)>frag_off - skb_network_header(skb);
return -1;
}
 
@@ -214,10 +211,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
/* RFC2460 says always send parameter problem in
 * this case. -DaveM
 */
-   __IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
-   IPSTATS_MIB_INHDRERRORS);
-   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
- offsetof(struct ipv6hdr, 
payload_len));
+   *prob_offset = offsetof(struct ipv6hdr, payload_len);
return -1;
}
if (end > fq->q.len) {
@@ -519,15 +513,22 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
+   u32 prob_offset = 0;
int ret;
 
spin_lock(>q.lock);
 
fq->iif = iif;
-   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff);
+   ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff,
+_offset);
 
spin_unlock(>q.lock);
inet_frag_put(>q);
+   if (prob_offset) {
+   __IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
+   IPSTATS_MIB_INHDRERRORS);
+   icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, prob_offset);
+   }
return ret;
}
 
-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH bpf-next v3 2/8] bpf: add documentation for eBPF helpers (01-11)

2018-04-17 Thread Alexei Starovoitov
On Tue, Apr 17, 2018 at 03:34:32PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Alexei:
> 
> - bpf_map_lookup_elem()
> - bpf_map_update_elem()
> - bpf_map_delete_elem()
> - bpf_probe_read()
> - bpf_ktime_get_ns()
> - bpf_trace_printk()
> - bpf_skb_store_bytes()
> - bpf_l3_csum_replace()
> - bpf_l4_csum_replace()
> - bpf_tail_call()
> - bpf_clone_redirect()
> 
> v3:
> - bpf_map_lookup_elem(): Fix description of restrictions for flags
>   related to the existence of the entry.
> - bpf_trace_printk(): State that trace_pipe can be configured. Fix
>   return value in case an unknown format specifier is met. Add a note on
>   kernel log notice when the helper is used. Edit example.
> - bpf_tail_call(): Improve comment on stack inheritance.
> - bpf_clone_redirect(): Improve description of BPF_F_INGRESS flag.
> 
> Cc: Alexei Starovoitov 
> Signed-off-by: Quentin Monnet 

Acked-by: Alexei Starovoitov 



Re: [PATCH v2 bpf-next 0/3] Add missing types to bpftool, libbpf

2018-04-17 Thread Alexei Starovoitov
On Tue, Apr 17, 2018 at 10:28:43AM -0700, Andrey Ignatov wrote:
> v1->v2:
> - add new types to bpftool-cgroup man page;
> - add new types to bash completion for bpftool;
> - don't add types that should not be in bpftool cgroup.
> 
> Add support for various BPF prog types and attach types that have been
> added to kernel recently but not to bpftool or libbpf yet.

lgtm. for the set:
Acked-by: Alexei Starovoitov 



Re: [PATCH 04/10] net: ax88796: Add block_input/output hooks to ax_plat_data

2018-04-17 Thread Michael Schmitz
I think this is a false positive - we're encouraged to provide the
full parameter list for functions, so the sreuct sk_buff* can't be
avoided.

Cheers,

  Michael


On Wed, Apr 18, 2018 at 6:46 AM, kbuild test robot <l...@intel.com> wrote:
> Hi Michael,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on v4.16]
> [cannot apply to net-next/master net/master v4.17-rc1 next-20180417]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Michael-Schmitz/New-network-driver-for-Amiga-X-Surf-100-m68k/20180417-141150
> config: arm-samsung (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm
>
> All warnings (new ones prefixed by >>):
>
>In file included from arch/arm/mach-s3c24xx/mach-anubis.c:42:0:
>>> include/net/ax88796.h:35:11: warning: 'struct sk_buff' declared inside 
>>> parameter list will not be visible outside of this definition or declaration
>struct sk_buff *skb, int ring_offset);
>   ^~~
>
> vim +35 include/net/ax88796.h
>
> 20
> 21  struct ax_plat_data {
> 22  unsigned int flags;
> 23  unsigned charwordlength;/* 1 or 2 */
> 24  unsigned chardcr_val;   /* default value for DCR */
> 25  unsigned charrcr_val;   /* default value for RCR */
> 26  unsigned chargpoc_val;  /* default value for GPOC */
> 27  u32 *reg_offsets;   /* register offsets */
> 28  u8  *mac_addr;  /* MAC addr (only used when
> 29 AXFLG_MAC_FROMPLATFORM is 
> used */
> 30
> 31  /* uses default ax88796 buffer if set to NULL */
> 32  void (*block_output)(struct net_device *dev, int count,
> 33  const unsigned char *buf, int star_page);
> 34  void (*block_input)(struct net_device *dev, int count,
>   > 35  struct sk_buff *skb, int ring_offset);
> 36  };
> 37
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [PATCH RFC net-next 06/11] udp: add gso support to virtual devices

2018-04-17 Thread Dimitris Michailidis
On Tue, Apr 17, 2018 at 1:00 PM, Willem de Bruijn
 wrote:
> From: Willem de Bruijn 
>
> Virtual devices such as tunnels and bonding can handle large packets.
> Only segment packets when reaching a physical or loopback device.
>
> Signed-off-by: Willem de Bruijn 
> ---
>  include/linux/netdev_features.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 35b79f47a13d..1e4883bb02a7 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -80,6 +80,7 @@ enum {
>
> NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */
> NETIF_F_HW_TLS_RECORD_BIT,  /* Offload TLS record */
> +   NETIF_F_GSO_UDP_L4_BIT, /* UDP payload GSO (not UFO) */

Please add an entry for the new flag to
net/core/ethtool.c:netdev_features_strings
and a description to Documentation/networking/netdev-features.txt.

>
> /*
>  * Add your fresh new feature above and remember to update
> @@ -147,6 +148,7 @@ enum {
>  #define NETIF_F_HW_ESP_TX_CSUM __NETIF_F(HW_ESP_TX_CSUM)
>  #defineNETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
>  #define NETIF_F_HW_TLS_RECORD  __NETIF_F(HW_TLS_RECORD)
> +#define NETIF_F_GSO_UDP_L4 __NETIF_F(GSO_UDP_L4)
>
>  #define for_each_netdev_feature(mask_addr, bit)\
> for_each_set_bit(bit, (unsigned long *)mask_addr, 
> NETDEV_FEATURE_COUNT)
> @@ -216,6 +218,7 @@ enum {
>  NETIF_F_GSO_GRE_CSUM | \
>  NETIF_F_GSO_IPXIP4 |   \
>  NETIF_F_GSO_IPXIP6 |   \
> +NETIF_F_GSO_UDP_L4 |   \
>  NETIF_F_GSO_UDP_TUNNEL |   \
>  NETIF_F_GSO_UDP_TUNNEL_CSUM)
>
> --
> 2.17.0.484.g0c8726318c-goog
>


Re: [PATCH bpf-next 08/10] [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Jakub Kicinski
On Tue, 17 Apr 2018 16:08:29 -0700, Alexei Starovoitov wrote:
> On Mon, Apr 16, 2018 at 11:51:29PM -0700, Nikita V. Shirokov wrote:
> > w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> > well (only "decrease" of pointer's location is going to be supported).
> > changing of this pointer will change packet's size.
> > for nfp driver we will just calculate packet's length unconditionally
> > 
> > Signed-off-by: Nikita V. Shirokov 
> > ---
> >  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
> > b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> > index 1eb6549f2a54..d9111c077699 100644
> > --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> > +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> > @@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring 
> > *rx_ring, int budget)
> >  
> > act = bpf_prog_run_xdp(xdp_prog, );
> >  
> > -   pkt_len -= xdp.data - orig_data;
> > +   pkt_len = xdp.data_end - xdp.data;  
> 
> Looks correct, but Jakub please review.

Indeed:

Acked-by: Jakub Kicinski 

Thanks!


Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-04-17 Thread David Ahern
On 4/17/18 5:29 PM, Ben Greear wrote:
> 
> FYI, problem still happens in 4.16.  I'm going to re-enable my hack below
> for this kernel as well...I had hopes it might be fixed...

Interesting. I was hoping the same.

> 
> BUG: unable to handle kernel NULL pointer dereference at 8
> IP: fib6_walk_continue+0x5b/0x140 [ipv6]
> PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0
> Oops:  [#1] PREEMPT SMP PTI
> Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink
> nf_defrag_ipv4 libcrc32c vrf]
> CPU: 3 PID: 15117 Comm: ip Tainted: G   O 4.16.0+ #5
> Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b
> 05/02/2017
> RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
> RSP: 0018:c90008c3bc10 EFLAGS: 00010287
> RAX: 88085ac45050 RBX: 8807e03008a0 RCX: 
> RDX:  RSI: c90008c3bc48 RDI: 8232b240
> RBP: 880819167600 R08: 0008 R09: 8807dff10071
> R10: c90008c3bbd0 R11:  R12: 8807e03008a0
> R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000
> FS:  7f2f04342700() GS:88087fcc()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0008 CR3: 0007e0556002 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  inet6_dump_fib+0x14b/0x2c0 [ipv6]
>  netlink_dump+0x216/0x2a0
>  netlink_recvmsg+0x254/0x400
>  ? copy_msghdr_from_user+0xb5/0x110
>  ___sys_recvmsg+0xe9/0x230
>  ? find_held_lock+0x3b/0xb0
>  ? __handle_mm_fault+0x617/0x1180
>  ? __audit_syscall_entry+0xb3/0x110
>  ? __sys_recvmsg+0x39/0x70
>  __sys_recvmsg+0x39/0x70
>  do_syscall_64+0x63/0x120
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x7f2f03a72030
> RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f
> RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030
> RDX:  RSI: 7fffab3de570 RDI: 0004
> RBP:  R08: 7e6c R09: 7fffab3e63a8
> R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608
> R13: 0066b460 R14: 7e6c R15: 
> Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83
> ea 01 89 53 2c c7 4
> RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10
> CR2: 0008
> ---[ end trace bd03458864eb266c ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: disabled
> Rebooting in 10 seconds..
> ACPI MEMORY or I/O RESET_REG.
> 


Since you can reproduce, would you mind trying
https://github.com/dsahern/linux.git ipv6/fib6-change-v2

Hopefully these will be committed upstream soon. It changes the game a
bit with the FIB walker. Would be interesting to know if this problem
goes away.


[PATCH net-next v2 04/21] net/ipv6: Pass net to fib6_update_sernum

2018-04-17 Thread David Ahern
Pass net namespace to fib6_update_sernum. It can not be marked const
as fib6_new_sernum will change ipv6.fib6_sernum.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  2 +-
 net/ipv6/ip6_fib.c|  3 +--
 net/ipv6/route.c  | 10 +-
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 5e86fd9dc857..f0aaf1c8f1a8 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -408,7 +408,7 @@ void __net_exit fib6_notifier_exit(struct net *net);
 unsigned int fib6_tables_seq_read(struct net *net);
 int fib6_tables_dump(struct net *net, struct notifier_block *nb);
 
-void fib6_update_sernum(struct rt6_info *rt);
+void fib6_update_sernum(struct net *net, struct rt6_info *rt);
 void fib6_update_sernum_upto_root(struct net *net, struct rt6_info *rt);
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index deab2db6692e..74d2a3748e2f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -105,9 +105,8 @@ enum {
FIB6_NO_SERNUM_CHANGE = 0,
 };
 
-void fib6_update_sernum(struct rt6_info *rt)
+void fib6_update_sernum(struct net *net, struct rt6_info *rt)
 {
-   struct net *net = dev_net(rt->dst.dev);
struct fib6_node *fn;
 
fn = rcu_dereference_protected(rt->rt6i_node,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1d738bfe893b..0a99cda9fd7b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1352,7 +1352,7 @@ static int rt6_insert_exception(struct rt6_info *nrt,
/* Update fn->fn_sernum to invalidate all cached dst */
if (!err) {
spin_lock_bh(>rt6i_table->tb6_lock);
-   fib6_update_sernum(ort);
+   fib6_update_sernum(net, ort);
spin_unlock_bh(>rt6i_table->tb6_lock);
fib6_force_start_gc(net);
}
@@ -3786,11 +3786,11 @@ void rt6_multipath_rebalance(struct rt6_info *rt)
 static int fib6_ifup(struct rt6_info *rt, void *p_arg)
 {
const struct arg_netdev_event *arg = p_arg;
-   const struct net *net = dev_net(arg->dev);
+   struct net *net = dev_net(arg->dev);
 
if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev) {
rt->rt6i_nh_flags &= ~arg->nh_flags;
-   fib6_update_sernum_upto_root(dev_net(rt->dst.dev), rt);
+   fib6_update_sernum_upto_root(net, rt);
rt6_multipath_rebalance(rt);
}
 
@@ -3869,7 +3869,7 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
 {
const struct arg_netdev_event *arg = p_arg;
const struct net_device *dev = arg->dev;
-   const struct net *net = dev_net(dev);
+   struct net *net = dev_net(dev);
 
if (rt == net->ipv6.ip6_null_entry)
return 0;
@@ -3892,7 +3892,7 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
}
rt6_multipath_nh_flags_set(rt, dev, RTNH_F_DEAD |
   RTNH_F_LINKDOWN);
-   fib6_update_sernum(rt);
+   fib6_update_sernum(net, rt);
rt6_multipath_rebalance(rt);
}
return -2;
-- 
2.11.0



[PATCH net-next v2 09/21] net/ipv6: Defer initialization of dst to data path

2018-04-17 Thread David Ahern
Defer setting dst input, output and error until fib entry is copied.

The reject path from ip6_route_info_create is moved to a new function
ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type
to dst error.

The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code
from addrconf_dst_alloc and the non-reject path of ip6_route_info_create.
The dst output function is always ip6_output and the input function is
either ip6_input (local routes), ip6_mc_input (multicast routes) or
ip6_forward (anything else).

A couple of places using dst.error are updated to look at rt6i_flags.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 115 +++
 1 file changed, 74 insertions(+), 41 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 222af19d3403..3b301aafd2ed 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -920,6 +920,75 @@ static struct net_device *ip6_rt_get_dev_rcu(struct 
rt6_info *rt)
return dev;
 }
 
+static const int fib6_prop[RTN_MAX + 1] = {
+   [RTN_UNSPEC]= 0,
+   [RTN_UNICAST]   = 0,
+   [RTN_LOCAL] = 0,
+   [RTN_BROADCAST] = 0,
+   [RTN_ANYCAST]   = 0,
+   [RTN_MULTICAST] = 0,
+   [RTN_BLACKHOLE] = -EINVAL,
+   [RTN_UNREACHABLE] = -EHOSTUNREACH,
+   [RTN_PROHIBIT]  = -EACCES,
+   [RTN_THROW] = -EAGAIN,
+   [RTN_NAT]   = -EINVAL,
+   [RTN_XRESOLVE]  = -EINVAL,
+};
+
+static int ip6_rt_type_to_error(u8 fib6_type)
+{
+   return fib6_prop[fib6_type];
+}
+
+static void ip6_rt_init_dst_reject(struct rt6_info *rt, struct rt6_info *ort)
+{
+   rt->dst.error = ip6_rt_type_to_error(ort->fib6_type);
+
+   switch (ort->fib6_type) {
+   case RTN_BLACKHOLE:
+   rt->dst.output = dst_discard_out;
+   rt->dst.input = dst_discard;
+   break;
+   case RTN_PROHIBIT:
+   rt->dst.output = ip6_pkt_prohibit_out;
+   rt->dst.input = ip6_pkt_prohibit;
+   break;
+   case RTN_THROW:
+   case RTN_UNREACHABLE:
+   default:
+   rt->dst.output = ip6_pkt_discard_out;
+   rt->dst.input = ip6_pkt_discard;
+   break;
+   }
+}
+
+static void ip6_rt_init_dst(struct rt6_info *rt, struct rt6_info *ort)
+{
+   if (ort->rt6i_flags & RTF_REJECT) {
+   ip6_rt_init_dst_reject(rt, ort);
+   return;
+   }
+
+   rt->dst.error = 0;
+   rt->dst.output = ip6_output;
+
+   if (ort->fib6_type == RTN_LOCAL) {
+   rt->dst.flags |= DST_HOST;
+   rt->dst.input = ip6_input;
+   } else if (ipv6_addr_type(>rt6i_dst.addr) & IPV6_ADDR_MULTICAST) {
+   rt->dst.input = ip6_mc_input;
+   } else {
+   rt->dst.input = ip6_forward;
+   }
+
+   if (ort->fib6_nh.nh_lwtstate) {
+   rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate);
+   lwtunnel_set_redirect(>dst);
+   }
+
+   rt->dst.lastuse = jiffies;
+}
+
 static void rt6_set_from(struct rt6_info *rt, struct rt6_info *from)
 {
BUG_ON(from->from);
@@ -932,14 +1001,12 @@ static void rt6_set_from(struct rt6_info *rt, struct 
rt6_info *from)
 
 static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort)
 {
-   rt->dst.input = ort->dst.input;
-   rt->dst.output = ort->dst.output;
+   ip6_rt_init_dst(rt, ort);
+
rt->rt6i_dst = ort->rt6i_dst;
-   rt->dst.error = ort->dst.error;
rt->rt6i_idev = ort->rt6i_idev;
if (rt->rt6i_idev)
in6_dev_hold(rt->rt6i_idev);
-   rt->dst.lastuse = jiffies;
rt->rt6i_gateway = ort->fib6_nh.nh_gw;
rt->rt6i_flags = ort->rt6i_flags;
rt6_set_from(rt, ort);
@@ -2329,7 +2396,7 @@ static struct rt6_info *__ip6_route_redirect(struct net 
*net,
continue;
if (rt6_check_expired(rt))
continue;
-   if (rt->dst.error)
+   if (rt->rt6i_flags & RTF_REJECT)
break;
if (!(rt->rt6i_flags & RTF_GATEWAY))
continue;
@@ -2357,7 +2424,7 @@ static struct rt6_info *__ip6_route_redirect(struct net 
*net,
 
if (!rt)
rt = net->ipv6.ip6_null_entry;
-   else if (rt->dst.error) {
+   else if (rt->rt6i_flags & RTF_REJECT) {
rt = net->ipv6.ip6_null_entry;
goto out;
}
@@ -2900,15 +2967,6 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
 
addr_type = ipv6_addr_type(>fc_dst);
 
-   if (addr_type & IPV6_ADDR_MULTICAST)
-   rt->dst.input = ip6_mc_input;
-   else if (cfg->fc_flags & RTF_LOCAL)
-   rt->dst.input = ip6_input;
-   else
-   rt->dst.input = ip6_forward;
-
-   rt->dst.output = ip6_output;
-
if (cfg->fc_encap) {

[PATCH net-next v2 03/21] vrf: Move fib6_table into net_vrf

2018-04-17 Thread David Ahern
A later patch removes rt6i_table from rt6_info. Save the ipv6
table for a VRF in net_vrf. fib tables can not be deleted so
no reference counting or locking is required.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 25 ++---
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 0a2b180d138a..90b5f3900c22 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -48,6 +48,9 @@ static unsigned int vrf_net_id;
 struct net_vrf {
struct rtable __rcu *rth;
struct rt6_info __rcu   *rt6;
+#if IS_ENABLED(CONFIG_IPV6)
+   struct fib6_table   *fib6_table;
+#endif
u32 tb_id;
 };
 
@@ -496,7 +499,6 @@ static int vrf_rt6_create(struct net_device *dev)
int flags = DST_HOST | DST_NOPOLICY | DST_NOXFRM;
struct net_vrf *vrf = netdev_priv(dev);
struct net *net = dev_net(dev);
-   struct fib6_table *rt6i_table;
struct rt6_info *rt6;
int rc = -ENOMEM;
 
@@ -504,8 +506,8 @@ static int vrf_rt6_create(struct net_device *dev)
if (!ipv6_mod_enabled())
return 0;
 
-   rt6i_table = fib6_new_table(net, vrf->tb_id);
-   if (!rt6i_table)
+   vrf->fib6_table = fib6_new_table(net, vrf->tb_id);
+   if (!vrf->fib6_table)
goto out;
 
/* create a dst for routing packets out a VRF device */
@@ -513,7 +515,6 @@ static int vrf_rt6_create(struct net_device *dev)
if (!rt6)
goto out;
 
-   rt6->rt6i_table = rt6i_table;
rt6->dst.output = vrf_output6;
 
rcu_assign_pointer(vrf->rt6, rt6);
@@ -946,22 +947,8 @@ static struct rt6_info *vrf_ip6_route_lookup(struct net 
*net,
 int flags)
 {
struct net_vrf *vrf = netdev_priv(dev);
-   struct fib6_table *table = NULL;
-   struct rt6_info *rt6;
-
-   rcu_read_lock();
-
-   /* fib6_table does not have a refcnt and can not be freed */
-   rt6 = rcu_dereference(vrf->rt6);
-   if (likely(rt6))
-   table = rt6->rt6i_table;
-
-   rcu_read_unlock();
-
-   if (!table)
-   return NULL;
 
-   return ip6_pol_route(net, table, ifindex, fl6, skb, flags);
+   return ip6_pol_route(net, vrf->fib6_table, ifindex, fl6, skb, flags);
 }
 
 static void vrf_ip6_input_dst(struct sk_buff *skb, struct net_device *vrf_dev,
-- 
2.11.0



[PATCH net-next v2 06/21] net/ipv6: Move support functions up in route.c

2018-04-17 Thread David Ahern
Code move only.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 119 +++
 1 file changed, 59 insertions(+), 60 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 045811a3da76..0daf4c9c9f2b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -78,7 +78,6 @@ enum rt6_nud_state {
RT6_NUD_SUCCEED = 1
 };
 
-static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort);
 static struct dst_entry*ip6_dst_check(struct dst_entry *dst, u32 
cookie);
 static unsigned int ip6_default_advmss(const struct dst_entry *dst);
 static unsigned int ip6_mtu(const struct dst_entry *dst);
@@ -879,6 +878,65 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
 }
 #endif
 
+/*
+ * Misc support functions
+ */
+
+/* called with rcu_lock held */
+static struct net_device *ip6_rt_get_dev_rcu(struct rt6_info *rt)
+{
+   struct net_device *dev = rt->dst.dev;
+
+   if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) {
+   /* for copies of local routes, dst->dev needs to be the
+* device if it is a master device, the master device if
+* device is enslaved, and the loopback as the default
+*/
+   if (netif_is_l3_slave(dev) &&
+   !rt6_need_strict(>rt6i_dst.addr))
+   dev = l3mdev_master_dev_rcu(dev);
+   else if (!netif_is_l3_master(dev))
+   dev = dev_net(dev)->loopback_dev;
+   /* last case is netif_is_l3_master(dev) is true in which
+* case we want dev returned to be dev
+*/
+   }
+
+   return dev;
+}
+
+static void rt6_set_from(struct rt6_info *rt, struct rt6_info *from)
+{
+   BUG_ON(from->from);
+
+   rt->rt6i_flags &= ~RTF_EXPIRES;
+   dst_hold(>dst);
+   rt->from = from;
+   dst_init_metrics(>dst, dst_metrics_ptr(>dst), true);
+}
+
+static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort)
+{
+   rt->dst.input = ort->dst.input;
+   rt->dst.output = ort->dst.output;
+   rt->rt6i_dst = ort->rt6i_dst;
+   rt->dst.error = ort->dst.error;
+   rt->rt6i_idev = ort->rt6i_idev;
+   if (rt->rt6i_idev)
+   in6_dev_hold(rt->rt6i_idev);
+   rt->dst.lastuse = jiffies;
+   rt->rt6i_gateway = ort->rt6i_gateway;
+   rt->rt6i_flags = ort->rt6i_flags;
+   rt6_set_from(rt, ort);
+   rt->rt6i_metric = ort->rt6i_metric;
+#ifdef CONFIG_IPV6_SUBTREES
+   rt->rt6i_src = ort->rt6i_src;
+#endif
+   rt->rt6i_prefsrc = ort->rt6i_prefsrc;
+   rt->rt6i_table = ort->rt6i_table;
+   rt->dst.lwtstate = lwtstate_get(ort->dst.lwtstate);
+}
+
 static struct fib6_node* fib6_backtrack(struct fib6_node *fn,
struct in6_addr *saddr)
 {
@@ -1024,29 +1082,6 @@ int ip6_ins_rt(struct net *net, struct rt6_info *rt)
return __ip6_ins_rt(rt, , , NULL);
 }
 
-/* called with rcu_lock held */
-static struct net_device *ip6_rt_get_dev_rcu(struct rt6_info *rt)
-{
-   struct net_device *dev = rt->dst.dev;
-
-   if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) {
-   /* for copies of local routes, dst->dev needs to be the
-* device if it is a master device, the master device if
-* device is enslaved, and the loopback as the default
-*/
-   if (netif_is_l3_slave(dev) &&
-   !rt6_need_strict(>rt6i_dst.addr))
-   dev = l3mdev_master_dev_rcu(dev);
-   else if (!netif_is_l3_master(dev))
-   dev = dev_net(dev)->loopback_dev;
-   /* last case is netif_is_l3_master(dev) is true in which
-* case we want dev returned to be dev
-*/
-   }
-
-   return dev;
-}
-
 static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
   const struct in6_addr *daddr,
   const struct in6_addr *saddr)
@@ -3270,42 +3305,6 @@ static void rt6_do_redirect(struct dst_entry *dst, 
struct sock *sk, struct sk_bu
neigh_release(neigh);
 }
 
-/*
- * Misc support functions
- */
-
-static void rt6_set_from(struct rt6_info *rt, struct rt6_info *from)
-{
-   BUG_ON(from->from);
-
-   rt->rt6i_flags &= ~RTF_EXPIRES;
-   dst_hold(>dst);
-   rt->from = from;
-   dst_init_metrics(>dst, dst_metrics_ptr(>dst), true);
-}
-
-static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort)
-{
-   rt->dst.input = ort->dst.input;
-   rt->dst.output = ort->dst.output;
-   rt->rt6i_dst = ort->rt6i_dst;
-   rt->dst.error = ort->dst.error;
-   rt->rt6i_idev = ort->rt6i_idev;
-   if (rt->rt6i_idev)
-   in6_dev_hold(rt->rt6i_idev);
-   rt->dst.lastuse = jiffies;
-   rt->rt6i_gateway = 

[PATCH net-next v2 12/21] net/ipv6: Add fib6_null_entry

2018-04-17 Thread David Ahern
ip6_null_entry will stay a dst based return for lookups that fail to
match an entry.

Add a new fib6_null_entry which constitutes the root node and leafs
for fibs. Replace existing references to ip6_null_entry with the
new fib6_null_entry when dealing with FIBs.

Signed-off-by: David Ahern 
---
 include/net/netns/ipv6.h |  3 ++-
 net/ipv6/ip6_fib.c   | 26 ++--
 net/ipv6/route.c | 62 +---
 3 files changed, 58 insertions(+), 33 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index c29f09cfc9d7..74e4e1e449d5 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -60,7 +60,8 @@ struct netns_ipv6 {
 #endif
struct xt_table *ip6table_nat;
 #endif
-   struct rt6_info *ip6_null_entry;
+   struct rt6_info *fib6_null_entry;
+   struct rt6_info *ip6_null_entry;
struct rt6_statistics   *rt6_stats;
struct timer_list   ip6_fib_timer;
struct hlist_head   *fib_table_hash;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f25f4d9831e8..280b69497ad0 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -231,7 +231,7 @@ static struct fib6_table *fib6_alloc_table(struct net *net, 
u32 id)
if (table) {
table->tb6_id = id;
rcu_assign_pointer(table->tb6_root.leaf,
-  net->ipv6.ip6_null_entry);
+  net->ipv6.fib6_null_entry);
table->tb6_root.fn_flags = RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO;
inet_peer_base_init(>tb6_peers);
}
@@ -369,7 +369,7 @@ struct fib6_dump_arg {
 
 static void fib6_rt_dump(struct rt6_info *rt, struct fib6_dump_arg *arg)
 {
-   if (rt == arg->net->ipv6.ip6_null_entry)
+   if (rt == arg->net->ipv6.fib6_null_entry)
return;
call_fib6_entry_notifier(arg->nb, arg->net, FIB_EVENT_ENTRY_ADD, rt);
 }
@@ -658,7 +658,7 @@ static struct fib6_node *fib6_add_1(struct net *net,
/* remove null_entry in the root node */
} else if (fn->fn_flags & RTN_TL_ROOT &&
   rcu_access_pointer(fn->leaf) ==
-  net->ipv6.ip6_null_entry) {
+  net->ipv6.fib6_null_entry) {
RCU_INIT_POINTER(fn->leaf, NULL);
}
 
@@ -1171,9 +1171,9 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
if (!sfn)
goto failure;
 
-   
atomic_inc(>nl_net->ipv6.ip6_null_entry->rt6i_ref);
+   
atomic_inc(>nl_net->ipv6.fib6_null_entry->rt6i_ref);
rcu_assign_pointer(sfn->leaf,
-  info->nl_net->ipv6.ip6_null_entry);
+  info->nl_net->ipv6.fib6_null_entry);
sfn->fn_flags = RTN_ROOT;
 
/* Now add the first leaf node to new subtree */
@@ -1212,7 +1212,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
if (fn->fn_flags & RTN_TL_ROOT) {
/* put back null_entry for root node */
rcu_assign_pointer(fn->leaf,
-   info->nl_net->ipv6.ip6_null_entry);
+   info->nl_net->ipv6.fib6_null_entry);
} else {
atomic_inc(>rt6i_ref);
rcu_assign_pointer(fn->leaf, rt);
@@ -1251,7 +1251,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
if (!pn_leaf) {
WARN_ON(!pn_leaf);
pn_leaf =
-   info->nl_net->ipv6.ip6_null_entry;
+   info->nl_net->ipv6.fib6_null_entry;
}
 #endif
atomic_inc(_leaf->rt6i_ref);
@@ -1494,7 +1494,7 @@ static struct rt6_info *fib6_find_prefix(struct net *net,
struct fib6_node *child_left, *child_right;
 
if (fn->fn_flags & RTN_ROOT)
-   return net->ipv6.ip6_null_entry;
+   return net->ipv6.fib6_null_entry;
 
while (fn) {
child_left = rcu_dereference_protected(fn->left,
@@ -1531,7 +1531,7 @@ static struct fib6_node *fib6_repair_tree(struct net *net,
 
/* Set fn->leaf to null_entry for root node. */
if (fn->fn_flags & RTN_TL_ROOT) {
-   rcu_assign_pointer(fn->leaf, net->ipv6.ip6_null_entry);
+   rcu_assign_pointer(fn->leaf, net->ipv6.fib6_null_entry);
return fn;
}
 
@@ -1576,7 

[PATCH net-next v2 02/21] net: Handle null dst in rtnl_put_cacheinfo

2018-04-17 Thread David Ahern
Need to keep expires time for IPv6 routes in a dump of FIB entries.
Update rtnl_put_cacheinfo to allow dst to be NULL in which case
rta_cacheinfo will only contain non-dst data.

Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 45936922d7e2..80802546c279 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -785,13 +785,15 @@ int rtnl_put_cacheinfo(struct sk_buff *skb, struct 
dst_entry *dst, u32 id,
   long expires, u32 error)
 {
struct rta_cacheinfo ci = {
-   .rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse),
-   .rta_used = dst->__use,
-   .rta_clntref = atomic_read(&(dst->__refcnt)),
.rta_error = error,
.rta_id =  id,
};
 
+   if (dst) {
+   ci.rta_lastuse = jiffies_delta_to_clock_t(jiffies - 
dst->lastuse);
+   ci.rta_used = dst->__use;
+   ci.rta_clntref = atomic_read(>__refcnt);
+   }
if (expires) {
unsigned long clock;
 
-- 
2.11.0



[PATCH net-next v2 18/21] net/ipv6: introduce fib6_info struct and helpers

2018-04-17 Thread David Ahern
Add fib6_info struct and alloc, destroy, hold and release helpers.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h | 55 ++
 net/ipv6/ip6_fib.c| 60 +++
 2 files changed, 115 insertions(+)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 159f651dee55..630392ae12d8 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -38,6 +38,7 @@
 #endif
 
 struct rt6_info;
+struct fib6_info;
 
 struct fib6_config {
u32 fc_table;
@@ -132,6 +133,46 @@ struct fib6_nh {
int nh_weight;
 };
 
+struct fib6_info {
+   struct fib6_table   *rt6i_table;
+   struct fib6_info __rcu  *rt6_next;
+   struct fib6_node __rcu  *rt6i_node;
+
+   /* Multipath routes:
+* siblings is a list of fib6_info that have the the same metric/weight,
+* destination, but not the same gateway. nsiblings is just a cache
+* to speed up lookup.
+*/
+   struct list_headrt6i_siblings;
+   unsigned intrt6i_nsiblings;
+
+   atomic_trt6i_ref;
+   struct inet6_dev*rt6i_idev;
+   unsigned long   expires;
+   struct dst_metrics  *fib6_metrics;
+#define fib6_pmtu  fib6_metrics->metrics[RTAX_MTU-1]
+
+   struct rt6key   rt6i_dst;
+   u32 rt6i_flags;
+   struct rt6key   rt6i_src;
+   struct rt6key   rt6i_prefsrc;
+
+   struct rt6_info * __percpu  *rt6i_pcpu;
+   struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
+
+   u32 rt6i_metric;
+   u8  rt6i_protocol;
+   u8  fib6_type;
+   u8  exception_bucket_flushed:1,
+   should_flush:1,
+   dst_nocount:1,
+   dst_nopolicy:1,
+   dst_host:1,
+   unused:3;
+
+   struct fib6_nh  fib6_nh;
+};
+
 struct rt6_info {
struct dst_entrydst;
struct rt6_info __rcu   *rt6_next;
@@ -291,6 +332,20 @@ static inline void ip6_rt_put(struct rt6_info *rt)
 
 void rt6_free_pcpu(struct rt6_info *non_pcpu_rt);
 
+struct rt6_info *fib6_info_alloc(gfp_t gfp_flags);
+void fib6_info_destroy(struct rt6_info *f6i);
+
+static inline void fib6_info_hold(struct rt6_info *f6i)
+{
+   atomic_inc(>rt6i_ref);
+}
+
+static inline void fib6_info_release(struct rt6_info *f6i)
+{
+   if (f6i && atomic_dec_and_test(>rt6i_ref))
+   fib6_info_destroy(f6i);
+}
+
 static inline void rt6_hold(struct rt6_info *rt)
 {
atomic_inc(>rt6i_ref);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 53df4a98a7f7..d07578d84db0 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -145,6 +145,66 @@ static __be32 addr_bit_set(const void *token, int fn_bit)
   addr[fn_bit >> 5];
 }
 
+struct rt6_info *fib6_info_alloc(gfp_t gfp_flags)
+{
+   struct rt6_info *f6i;
+
+   f6i = kzalloc(sizeof(*f6i), gfp_flags);
+   if (!f6i)
+   return NULL;
+
+   f6i->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
+   if (!f6i->rt6i_pcpu) {
+   kfree(f6i);
+   return NULL;
+   }
+
+   INIT_LIST_HEAD(>rt6i_siblings);
+   f6i->fib6_metrics = (struct dst_metrics *)_default_metrics;
+
+   atomic_inc(>rt6i_ref);
+
+   return f6i;
+}
+
+void fib6_info_destroy(struct rt6_info *f6i)
+{
+   struct rt6_exception_bucket *bucket;
+
+   WARN_ON(f6i->rt6i_node);
+
+   bucket = rcu_dereference_protected(f6i->rt6i_exception_bucket, 1);
+   if (bucket) {
+   f6i->rt6i_exception_bucket = NULL;
+   kfree(bucket);
+   }
+
+   if (f6i->rt6i_pcpu) {
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   struct rt6_info **ppcpu_rt;
+   struct rt6_info *pcpu_rt;
+
+   ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu);
+   pcpu_rt = *ppcpu_rt;
+   if (pcpu_rt) {
+   dst_dev_put(_rt->dst);
+   dst_release(_rt->dst);
+   *ppcpu_rt = NULL;
+   }
+   }
+   }
+
+   if (f6i->rt6i_idev)
+   in6_dev_put(f6i->rt6i_idev);
+   if (f6i->fib6_nh.nh_dev)
+   dev_put(f6i->fib6_nh.nh_dev);
+
+   kfree(f6i);
+}
+EXPORT_SYMBOL_GPL(fib6_info_destroy);
+
 static struct fib6_node *node_alloc(struct net *net)
 {

[PATCH net-next v2 21/21] net/ipv6: Remove unused code and variables for rt6_info

2018-04-17 Thread David Ahern
Drop unneeded elements from rt6_info struct and rearrange layout to
something more relevant for the data path.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h   | 60 +++--
 net/ipv6/ip6_fib.c  | 22 --
 net/ipv6/route.c| 27 ++
 net/ipv6/xfrm6_policy.c |  2 --
 4 files changed, 5 insertions(+), 106 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index d41b7bd69fb3..a36116b92100 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -175,58 +175,20 @@ struct fib6_info {
 
 struct rt6_info {
struct dst_entrydst;
-   struct rt6_info __rcu   *rt6_next;
struct fib6_info*from;
 
-   /*
-* Tail elements of dst_entry (__refcnt etc.)
-* and these elements (rarely used in hot path) are in
-* the same cache line.
-*/
-   struct fib6_table   *rt6i_table;
-   struct fib6_node __rcu  *rt6i_node;
-
+   struct rt6key   rt6i_dst;
+   struct rt6key   rt6i_src;
struct in6_addr rt6i_gateway;
-
-   /* Multipath routes:
-* siblings is a list of rt6_info that have the the same metric/weight,
-* destination, but not the same gateway. nsiblings is just a cache
-* to speed up lookup.
-*/
-   struct list_headrt6i_siblings;
-   unsigned intrt6i_nsiblings;
-
-   atomic_trt6i_ref;
-
-   /* These are in a separate cache line. */
-   struct rt6key   rt6i_dst cacheline_aligned_in_smp;
+   struct inet6_dev*rt6i_idev;
u32 rt6i_flags;
-   struct rt6key   rt6i_src;
struct rt6key   rt6i_prefsrc;
 
struct list_headrt6i_uncached;
struct uncached_list*rt6i_uncached_list;
 
-   struct inet6_dev*rt6i_idev;
-   struct rt6_info * __percpu  *rt6i_pcpu;
-   struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
-
-   u32 rt6i_metric;
/* more non-fragment space at head required */
unsigned short  rt6i_nfheader_len;
-   u8  rt6i_protocol;
-   u8  fib6_type;
-   u8  exception_bucket_flushed:1,
-   should_flush:1,
-   dst_nocount:1,
-   dst_nopolicy:1,
-   dst_host:1,
-   unused:3;
-
-   unsigned long   expires;
-   struct dst_metrics  *fib6_metrics;
-#define fib6_pmtu  fib6_metrics->metrics[RTAX_MTU-1]
-   struct fib6_nh  fib6_nh;
 };
 
 #define for_each_fib6_node_rt_rcu(fn)  \
@@ -328,8 +290,6 @@ static inline void ip6_rt_put(struct rt6_info *rt)
dst_release(>dst);
 }
 
-void rt6_free_pcpu(struct rt6_info *non_pcpu_rt);
-
 struct fib6_info *fib6_info_alloc(gfp_t gfp_flags);
 void fib6_info_destroy(struct fib6_info *f6i);
 
@@ -344,20 +304,6 @@ static inline void fib6_info_release(struct fib6_info *f6i)
fib6_info_destroy(f6i);
 }
 
-static inline void rt6_hold(struct rt6_info *rt)
-{
-   atomic_inc(>rt6i_ref);
-}
-
-static inline void rt6_release(struct rt6_info *rt)
-{
-   if (atomic_dec_and_test(>rt6i_ref)) {
-   rt6_free_pcpu(rt);
-   dst_dev_put(>dst);
-   dst_release(>dst);
-   }
-}
-
 enum fib6_walk_state {
 #ifdef CONFIG_IPV6_SUBTREES
FWS_S,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 77cf43b2d858..2ab49b7cac22 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -240,28 +240,6 @@ static void node_free(struct net *net, struct fib6_node 
*fn)
net->ipv6.rt6_stats->fib_nodes--;
 }
 
-void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
-{
-   int cpu;
-
-   if (!non_pcpu_rt->rt6i_pcpu)
-   return;
-
-   for_each_possible_cpu(cpu) {
-   struct rt6_info **ppcpu_rt;
-   struct rt6_info *pcpu_rt;
-
-   ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
-   pcpu_rt = *ppcpu_rt;
-   if (pcpu_rt) {
-   dst_dev_put(_rt->dst);
-   dst_release(_rt->dst);
-   *ppcpu_rt = NULL;
-   }
-   }
-}
-EXPORT_SYMBOL_GPL(rt6_free_pcpu);
-
 static void fib6_free_table(struct fib6_table *table)
 {
inetpeer_invalidate_tree(>tb6_peers);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 2ccf939e1a20..f9c363327d62 100644
--- 

[PATCH net-next v2 13/21] net/ipv6: Add rt6_info create function for ip6_pol_route_lookup

2018-04-17 Thread David Ahern
ip6_pol_route_lookup is the lookup function for ip6_route_lookup and
rt6_lookup. At the moment it returns either a reference to a FIB entry
or a cached exception. To move FIB entries to a separate struct, this
lookup function needs to convert FIB entries to an rt6_info that is
returned to the caller.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7c141394d4f1..e293692174ba 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1055,6 +1055,19 @@ static bool ip6_hold_safe(struct net *net, struct 
rt6_info **prt,
return false;
 }
 
+/* called with rcu_lock held */
+static struct rt6_info *ip6_create_rt_rcu(struct rt6_info *rt)
+{
+   struct net_device *dev = rt->fib6_nh.nh_dev;
+   struct rt6_info *nrt;
+
+   nrt = __ip6_dst_alloc(dev_net(dev), dev, 0);
+   if (nrt)
+   ip6_rt_copy_init(nrt, rt);
+
+   return nrt;
+}
+
 static struct rt6_info *ip6_pol_route_lookup(struct net *net,
 struct fib6_table *table,
 struct flowi6 *fl6,
@@ -1087,18 +1100,26 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
}
/* Search through exception table */
rt_cache = rt6_find_cached_rt(rt, >daddr, >saddr);
-   if (rt_cache)
+   if (rt_cache) {
rt = rt_cache;
+   if (ip6_hold_safe(net, , true))
+   dst_use_noref(>dst, jiffies);
+   } else if (dst_hold_safe(>dst)) {
+   struct rt6_info *nrt;
 
-   if (ip6_hold_safe(net, , true))
-   dst_use_noref(>dst, jiffies);
+   nrt = ip6_create_rt_rcu(rt);
+   dst_release(>dst);
+   rt = nrt;
+   } else {
+   rt = net->ipv6.ip6_null_entry;
+   dst_hold(>dst);
+   }
 
rcu_read_unlock();
 
trace_fib6_table_lookup(net, rt, table, fl6);
 
return rt;
-
 }
 
 struct dst_entry *ip6_route_lookup(struct net *net, struct flowi6 *fl6,
-- 
2.11.0



[PATCH net-next v2 08/21] net/ipv6: Move nexthop data to fib6_nh

2018-04-17 Thread David Ahern
Introduce fib6_nh structure and move nexthop related data from
rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
lwtstate from a FIB lookup perspective are converted to use fib6_nh;
datapath references to dst version are left as is.

Signed-off-by: David Ahern 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  32 ++--
 include/net/ip6_fib.h  |  16 +-
 include/net/ip6_route.h|   6 +-
 net/ipv6/addrconf.c|   2 +-
 net/ipv6/ip6_fib.c |   6 +-
 net/ipv6/route.c   | 162 +++--
 6 files changed, 125 insertions(+), 99 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 1904c0323d39..d995a0b52d7c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2770,9 +2770,9 @@ mlxsw_sp_nexthop6_group_cmp(const struct 
mlxsw_sp_nexthop_group *nh_grp,
struct in6_addr *gw;
int ifindex, weight;
 
-   ifindex = mlxsw_sp_rt6->rt->dst.dev->ifindex;
-   weight = mlxsw_sp_rt6->rt->rt6i_nh_weight;
-   gw = _sp_rt6->rt->rt6i_gateway;
+   ifindex = mlxsw_sp_rt6->rt->fib6_nh.nh_dev->ifindex;
+   weight = mlxsw_sp_rt6->rt->fib6_nh.nh_weight;
+   gw = _sp_rt6->rt->fib6_nh.nh_gw;
if (!mlxsw_sp_nexthop6_group_has_nexthop(nh_grp, gw, ifindex,
 weight))
return false;
@@ -2838,7 +2838,7 @@ mlxsw_sp_nexthop6_group_hash(struct mlxsw_sp_fib6_entry 
*fib6_entry, u32 seed)
struct net_device *dev;
 
list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
-   dev = mlxsw_sp_rt6->rt->dst.dev;
+   dev = mlxsw_sp_rt6->rt->fib6_nh.nh_dev;
val ^= dev->ifindex;
}
 
@@ -3836,9 +3836,9 @@ mlxsw_sp_rt6_nexthop(struct mlxsw_sp_nexthop_group 
*nh_grp,
struct mlxsw_sp_nexthop *nh = _grp->nexthops[i];
struct rt6_info *rt = mlxsw_sp_rt6->rt;
 
-   if (nh->rif && nh->rif->dev == rt->dst.dev &&
+   if (nh->rif && nh->rif->dev == rt->fib6_nh.nh_dev &&
ipv6_addr_equal((const struct in6_addr *) >gw_addr,
-   >rt6i_gateway))
+   >fib6_nh.nh_gw))
return nh;
continue;
}
@@ -3895,7 +3895,7 @@ mlxsw_sp_fib6_entry_offload_set(struct mlxsw_sp_fib_entry 
*fib_entry)
 
if (fib_entry->type == MLXSW_SP_FIB_ENTRY_TYPE_LOCAL) {
list_first_entry(_entry->rt6_list, struct mlxsw_sp_rt6,
-list)->rt->rt6i_nh_flags |= RTNH_F_OFFLOAD;
+list)->rt->fib6_nh.nh_flags |= RTNH_F_OFFLOAD;
return;
}
 
@@ -3905,9 +3905,9 @@ mlxsw_sp_fib6_entry_offload_set(struct mlxsw_sp_fib_entry 
*fib_entry)
 
nh = mlxsw_sp_rt6_nexthop(nh_grp, mlxsw_sp_rt6);
if (nh && nh->offloaded)
-   mlxsw_sp_rt6->rt->rt6i_nh_flags |= RTNH_F_OFFLOAD;
+   mlxsw_sp_rt6->rt->fib6_nh.nh_flags |= RTNH_F_OFFLOAD;
else
-   mlxsw_sp_rt6->rt->rt6i_nh_flags &= ~RTNH_F_OFFLOAD;
+   mlxsw_sp_rt6->rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
}
 }
 
@@ -3922,7 +3922,7 @@ mlxsw_sp_fib6_entry_offload_unset(struct 
mlxsw_sp_fib_entry *fib_entry)
list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
struct rt6_info *rt = mlxsw_sp_rt6->rt;
 
-   rt->rt6i_nh_flags &= ~RTNH_F_OFFLOAD;
+   rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
}
 }
 
@@ -4818,8 +4818,8 @@ static bool mlxsw_sp_nexthop6_ipip_type(const struct 
mlxsw_sp *mlxsw_sp,
const struct rt6_info *rt,
enum mlxsw_sp_ipip_type *ret)
 {
-   return rt->dst.dev &&
-  mlxsw_sp_netdev_ipip_type(mlxsw_sp, rt->dst.dev, ret);
+   return rt->fib6_nh.nh_dev &&
+  mlxsw_sp_netdev_ipip_type(mlxsw_sp, rt->fib6_nh.nh_dev, ret);
 }
 
 static int mlxsw_sp_nexthop6_type_init(struct mlxsw_sp *mlxsw_sp,
@@ -4829,7 +4829,7 @@ static int mlxsw_sp_nexthop6_type_init(struct mlxsw_sp 
*mlxsw_sp,
 {
const struct mlxsw_sp_ipip_ops *ipip_ops;
struct mlxsw_sp_ipip_entry *ipip_entry;
-   struct net_device *dev = rt->dst.dev;
+   struct net_device *dev = rt->fib6_nh.nh_dev;
struct mlxsw_sp_rif *rif;
int err;
 
@@ -4872,11 +4872,11 @@ static int mlxsw_sp_nexthop6_init(struct mlxsw_sp 
*mlxsw_sp,
  struct mlxsw_sp_nexthop 

[PATCH net-next v2 19/21] net/ipv6: separate handling of FIB entries from dst based routes

2018-04-17 Thread David Ahern
Last step before flipping the data type for FIB entries:
- use fib6_info_alloc to create FIB entries in ip6_route_info_create
  and addrconf_dst_alloc
- use fib6_info_release in place of dst_release, ip6_rt_put and
  rt6_release
- remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt
- when purging routes, drop per-cpu routes
- replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release
- use rt->from since it points to the FIB entry
- drop references to exception bucket, fib6_metrics and per-cpu from
  dst entries (those are relevant for fib entries only)

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h   |   4 +-
 include/net/ip6_route.h |   3 +-
 net/ipv6/addrconf.c |  18 +++--
 net/ipv6/anycast.c  |   7 +-
 net/ipv6/ip6_fib.c  |  55 ++--
 net/ipv6/ip6_output.c   |   3 +-
 net/ipv6/ndisc.c|   6 +-
 net/ipv6/route.c| 171 +---
 8 files changed, 115 insertions(+), 152 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 630392ae12d8..6c3d92bb3459 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -314,9 +314,7 @@ static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 
if (rt->rt6i_flags & RTF_PCPU ||
(unlikely(!list_empty(>rt6i_uncached)) && rt->from))
-   rt = rt->from;
-
-   rt6_get_cookie_safe(rt, );
+   rt6_get_cookie_safe(rt->from, );
 
return cookie;
 }
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 686cdc7f356a..57d0d45667f1 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -114,8 +114,7 @@ static inline int ip6_route_get_saddr(struct net *net, 
struct rt6_info *rt,
  unsigned int prefs,
  struct in6_addr *saddr)
 {
-   struct inet6_dev *idev =
-   rt ? ip6_dst_idev((struct dst_entry *)rt) : NULL;
+   struct inet6_dev *idev = rt ? rt->rt6i_idev : NULL;
int err = 0;
 
if (rt && rt->rt6i_prefsrc.plen)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 915cd0734b27..e533a447f68c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -916,7 +916,6 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
pr_warn("Freeing alive inet6 address %p\n", ifp);
return;
}
-   ip6_rt_put(ifp->rt);
 
kfree_rcu(ifp, rcu);
 }
@@ -1102,8 +1101,8 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
inet6addr_notifier_call_chain(NETDEV_UP, ifa);
 out:
if (unlikely(err < 0)) {
-   if (rt)
-   ip6_rt_put(rt);
+   fib6_info_release(rt);
+
if (ifa) {
if (ifa->idev)
in6_dev_put(ifa->idev);
@@ -1191,7 +1190,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned 
long expires, bool del_r
else {
if (!(rt->rt6i_flags & RTF_EXPIRES))
fib6_set_expires(rt, expires);
-   ip6_rt_put(rt);
+   fib6_info_release(rt);
}
}
 }
@@ -2375,8 +2374,7 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
continue;
if ((rt->rt6i_flags & noflags) != 0)
continue;
-   if (!dst_hold_safe(>dst))
-   rt = NULL;
+   fib6_info_hold(rt);
break;
}
 out:
@@ -2687,7 +2685,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, 
int len, bool sllao)
addrconf_prefix_route(>prefix, pinfo->prefix_len,
  dev, expires, flags, GFP_ATOMIC);
}
-   ip6_rt_put(rt);
+   fib6_info_release(rt);
}
 
/* Try to figure out our local address for this prefix */
@@ -3361,7 +3359,7 @@ static int fixup_permanent_addr(struct net *net,
ifp->rt = rt;
spin_unlock(>lock);
 
-   ip6_rt_put(prev);
+   fib6_info_release(prev);
}
 
if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
@@ -5636,8 +5634,8 @@ static void __ipv6_ifa_notify(int event, struct 
inet6_ifaddr *ifp)
ip6_del_rt(net, rt);
}
if (ifp->rt) {
-   if (dst_hold_safe(>rt->dst))
-   ip6_del_rt(net, ifp->rt);
+   ip6_del_rt(net, ifp->rt);
+   ifp->rt = NULL;
}
rt_genid_bump_ipv6(net);
break;
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index e456386fe4d5..3db8fe10322b 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -213,7 

[PATCH net-next v2 20/21] net/ipv6: Flip FIB entries to fib6_info

2018-04-17 Thread David Ahern
Convert all code paths referencing a FIB entry from
rt6_info to fib6_info.

Signed-off-by: David Ahern 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  64 ++---
 include/net/if_inet6.h |   4 +-
 include/net/ip6_fib.h  |  42 ++--
 include/net/ip6_route.h|  28 +--
 include/net/netns/ipv6.h   |   2 +-
 net/ipv6/addrconf.c|  20 +-
 net/ipv6/anycast.c |   4 +-
 net/ipv6/ip6_fib.c | 116 -
 net/ipv6/ndisc.c   |   2 +-
 net/ipv6/route.c   | 259 +++--
 10 files changed, 271 insertions(+), 270 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index d995a0b52d7c..b85b15851841 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -442,7 +442,7 @@ struct mlxsw_sp_fib6_entry {
 
 struct mlxsw_sp_rt6 {
struct list_head list;
-   struct rt6_info *rt;
+   struct fib6_info *rt;
 };
 
 struct mlxsw_sp_lpm_tree {
@@ -3834,7 +3834,7 @@ mlxsw_sp_rt6_nexthop(struct mlxsw_sp_nexthop_group 
*nh_grp,
 
for (i = 0; i < nh_grp->count; i++) {
struct mlxsw_sp_nexthop *nh = _grp->nexthops[i];
-   struct rt6_info *rt = mlxsw_sp_rt6->rt;
+   struct fib6_info *rt = mlxsw_sp_rt6->rt;
 
if (nh->rif && nh->rif->dev == rt->fib6_nh.nh_dev &&
ipv6_addr_equal((const struct in6_addr *) >gw_addr,
@@ -3920,7 +3920,7 @@ mlxsw_sp_fib6_entry_offload_unset(struct 
mlxsw_sp_fib_entry *fib_entry)
fib6_entry = container_of(fib_entry, struct mlxsw_sp_fib6_entry,
  common);
list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
-   struct rt6_info *rt = mlxsw_sp_rt6->rt;
+   struct fib6_info *rt = mlxsw_sp_rt6->rt;
 
rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
}
@@ -4699,7 +4699,7 @@ static void mlxsw_sp_router_fib4_del(struct mlxsw_sp 
*mlxsw_sp,
mlxsw_sp_fib_node_put(mlxsw_sp, fib_node);
 }
 
-static bool mlxsw_sp_fib6_rt_should_ignore(const struct rt6_info *rt)
+static bool mlxsw_sp_fib6_rt_should_ignore(const struct fib6_info *rt)
 {
/* Packets with link-local destination IP arriving to the router
 * are trapped to the CPU, so no need to program specific routes
@@ -4721,7 +4721,7 @@ static bool mlxsw_sp_fib6_rt_should_ignore(const struct 
rt6_info *rt)
return false;
 }
 
-static struct mlxsw_sp_rt6 *mlxsw_sp_rt6_create(struct rt6_info *rt)
+static struct mlxsw_sp_rt6 *mlxsw_sp_rt6_create(struct fib6_info *rt)
 {
struct mlxsw_sp_rt6 *mlxsw_sp_rt6;
 
@@ -4734,18 +4734,18 @@ static struct mlxsw_sp_rt6 *mlxsw_sp_rt6_create(struct 
rt6_info *rt)
 * memory.
 */
mlxsw_sp_rt6->rt = rt;
-   rt6_hold(rt);
+   fib6_info_hold(rt);
 
return mlxsw_sp_rt6;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
-static void mlxsw_sp_rt6_release(struct rt6_info *rt)
+static void mlxsw_sp_rt6_release(struct fib6_info *rt)
 {
-   rt6_release(rt);
+   fib6_info_release(rt);
 }
 #else
-static void mlxsw_sp_rt6_release(struct rt6_info *rt)
+static void mlxsw_sp_rt6_release(struct fib6_info *rt)
 {
 }
 #endif
@@ -4756,13 +4756,13 @@ static void mlxsw_sp_rt6_destroy(struct mlxsw_sp_rt6 
*mlxsw_sp_rt6)
kfree(mlxsw_sp_rt6);
 }
 
-static bool mlxsw_sp_fib6_rt_can_mp(const struct rt6_info *rt)
+static bool mlxsw_sp_fib6_rt_can_mp(const struct fib6_info *rt)
 {
/* RTF_CACHE routes are ignored */
return (rt->rt6i_flags & (RTF_GATEWAY | RTF_ADDRCONF)) == RTF_GATEWAY;
 }
 
-static struct rt6_info *
+static struct fib6_info *
 mlxsw_sp_fib6_entry_rt(const struct mlxsw_sp_fib6_entry *fib6_entry)
 {
return list_first_entry(_entry->rt6_list, struct mlxsw_sp_rt6,
@@ -4771,7 +4771,7 @@ mlxsw_sp_fib6_entry_rt(const struct mlxsw_sp_fib6_entry 
*fib6_entry)
 
 static struct mlxsw_sp_fib6_entry *
 mlxsw_sp_fib6_node_mp_entry_find(const struct mlxsw_sp_fib_node *fib_node,
-const struct rt6_info *nrt, bool replace)
+const struct fib6_info *nrt, bool replace)
 {
struct mlxsw_sp_fib6_entry *fib6_entry;
 
@@ -4779,7 +4779,7 @@ mlxsw_sp_fib6_node_mp_entry_find(const struct 
mlxsw_sp_fib_node *fib_node,
return NULL;
 
list_for_each_entry(fib6_entry, _node->entry_list, common.list) {
-   struct rt6_info *rt = mlxsw_sp_fib6_entry_rt(fib6_entry);
+   struct fib6_info *rt = mlxsw_sp_fib6_entry_rt(fib6_entry);
 
/* RT6_TABLE_LOCAL and RT6_TABLE_MAIN share the same
 * virtual router.
@@ 

[PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions

2018-04-17 Thread David Ahern
Most FIB entries can be added using memory allocated with GFP_KERNEL.
Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
can be reached from the packet path (e.g., ndisc and autoconfig) or
atomic notifiers use GFP_ATOMIC; paths from user context (adding
addresses and routes) use GFP_KERNEL.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |  6 --
 net/ipv6/addrconf.c | 39 +++
 net/ipv6/anycast.c  |  2 +-
 net/ipv6/route.c| 18 ++
 4 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index cb6fb7e16a28..ff70266e30d7 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -100,7 +100,8 @@ void ip6_route_cleanup(void);
 
 int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg);
 
-int ip6_route_add(struct fib6_config *cfg, struct netlink_ext_ack *extack);
+int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
+ struct netlink_ext_ack *extack);
 int ip6_ins_rt(struct net *net, struct rt6_info *rt);
 int ip6_del_rt(struct net *net, struct rt6_info *rt);
 
@@ -138,7 +139,8 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, 
struct flowi6 *fl6);
 void fib6_force_start_gc(struct net *net);
 
 struct rt6_info *addrconf_dst_alloc(struct net *net, struct inet6_dev *idev,
-   const struct in6_addr *addr, bool anycast);
+   const struct in6_addr *addr, bool anycast,
+   gfp_t gfp_flags);
 
 struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev,
   int flags);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c8df5c6499db..915cd0734b27 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1037,7 +1037,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
goto out;
}
 
-   rt = addrconf_dst_alloc(net, idev, addr, false);
+   rt = addrconf_dst_alloc(net, idev, addr, false, gfp_flags);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
@@ -2320,7 +2320,7 @@ static void  ipv6_try_regen_rndid(struct inet6_dev *idev, 
struct in6_addr *tmpad
 
 static void
 addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev,
- unsigned long expires, u32 flags)
+ unsigned long expires, u32 flags, gfp_t gfp_flags)
 {
struct fib6_config cfg = {
.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX,
@@ -2345,7 +2345,7 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, 
struct net_device *dev,
cfg.fc_flags |= RTF_NONEXTHOP;
 #endif
 
-   ip6_route_add(, NULL);
+   ip6_route_add(, gfp_flags, NULL);
 }
 
 
@@ -2401,7 +2401,7 @@ static void addrconf_add_mroute(struct net_device *dev)
 
ipv6_addr_set(_dst, htonl(0xFF00), 0, 0, 0);
 
-   ip6_route_add(, NULL);
+   ip6_route_add(, GFP_ATOMIC, NULL);
 }
 
 static struct inet6_dev *addrconf_add_dev(struct net_device *dev)
@@ -2685,7 +2685,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, 
int len, bool sllao)
expires = jiffies_to_clock_t(rt_expires);
}
addrconf_prefix_route(>prefix, pinfo->prefix_len,
- dev, expires, flags);
+ dev, expires, flags, GFP_ATOMIC);
}
ip6_rt_put(rt);
}
@@ -2900,7 +2900,7 @@ static int inet6_addr_add(struct net *net, int ifindex,
if (!IS_ERR(ifp)) {
if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) {
addrconf_prefix_route(>addr, ifp->prefix_len, dev,
- expires, flags);
+ expires, flags, GFP_KERNEL);
}
 
/* Send a netlink notification if DAD is enabled and
@@ -3053,7 +3053,8 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
 
if (addr.s6_addr32[3]) {
add_addr(idev, , plen, scope);
-   addrconf_prefix_route(, plen, idev->dev, 0, pflags);
+   addrconf_prefix_route(, plen, idev->dev, 0, pflags,
+ GFP_ATOMIC);
return;
}
 
@@ -3078,7 +3079,7 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
 
add_addr(idev, , plen, flag);
addrconf_prefix_route(, plen, idev->dev, 0,
- pflags);
+ pflags, GFP_ATOMIC);
}
}
}
@@ -3118,7 +3119,8 @@ void addrconf_add_linklocal(struct 

[PATCH net-next v2 17/21] net/ipv6: Cleanup exception and cache route handling

2018-04-17 Thread David Ahern
IPv6 FIB will only contain FIB entries with exception routes added to
the FIB entry. Once this transformation is complete, FIB lookups will
return a fib6_info with the lookup functions still returning a dst
based rt6_info. The current code uses rt6_info for both paths and
overloads the rt6_info variable usually called 'rt'.

This patch introduces a new 'f6i' variable name for the result of the FIB
lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a
fib6_info in a later patch which is why it is introduced as f6i now;
avoids the additional churn in the later patch.

In addition, remove RTF_CACHE and dst checks from fib6 add and delete
since they can not happen now and will never happen after the data
type flip.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |   1 -
 net/ipv6/ip6_fib.c  |  16 +-
 net/ipv6/route.c| 142 +++-
 3 files changed, 81 insertions(+), 78 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ff70266e30d7..686cdc7f356a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -106,7 +106,6 @@ int ip6_ins_rt(struct net *net, struct rt6_info *rt);
 int ip6_del_rt(struct net *net, struct rt6_info *rt);
 
 void rt6_flush_exceptions(struct rt6_info *rt);
-int rt6_remove_exception_rt(struct rt6_info *rt);
 void rt6_age_exceptions(struct rt6_info *rt, struct fib6_gc_args *gc_args,
unsigned long now);
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 280b69497ad0..53df4a98a7f7 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1074,7 +1074,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
rt6_info *rt,
 static void fib6_start_gc(struct net *net, struct rt6_info *rt)
 {
if (!timer_pending(>ipv6.ip6_fib_timer) &&
-   (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
+   (rt->rt6i_flags & RTF_EXPIRES))
mod_timer(>ipv6.ip6_fib_timer,
  jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
 }
@@ -1125,8 +1125,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 
if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
return -EINVAL;
-   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
-   return -EINVAL;
 
if (info->nlh) {
if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
@@ -1650,8 +1648,6 @@ static void fib6_del_route(struct fib6_table *table, 
struct fib6_node *fn,
 
RT6_TRACE("fib6_del_route\n");
 
-   WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE);
-
/* Unlink it */
*rtp = rt->rt6_next;
rt->rt6i_node = NULL;
@@ -1720,21 +1716,11 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
struct rt6_info __rcu **rtp;
struct rt6_info __rcu **rtp_next;
 
-#if RT6_DEBUG >= 2
-   if (rt->dst.obsolete > 0) {
-   WARN_ON(fn);
-   return -ENOENT;
-   }
-#endif
if (!fn || rt == net->ipv6.fib6_null_entry)
return -ENOENT;
 
WARN_ON(!(fn->fn_flags & RTN_RTINFO));
 
-   /* remove cached dst from exception table */
-   if (rt->rt6i_flags & RTF_CACHE)
-   return rt6_remove_exception_rt(rt);
-
/*
 *  Walk the leaf entries looking for ourself
 */
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 56a854b426a6..ad9eaecf539c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1013,8 +1013,8 @@ static void rt6_set_from(struct rt6_info *rt, struct 
rt6_info *from)
BUG_ON(from->from);
 
rt->rt6i_flags &= ~RTF_EXPIRES;
-   dst_hold(>dst);
-   rt->from = from;
+   if (dst_hold_safe(>dst))
+   rt->from = from;
dst_init_metrics(>dst, from->fib6_metrics->metrics, true);
if (from->fib6_metrics != _default_metrics) {
rt->dst._metrics |= DST_METRICS_REFCOUNTED;
@@ -1097,8 +1097,9 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
 const struct sk_buff *skb,
 int flags)
 {
-   struct rt6_info *rt, *rt_cache;
+   struct rt6_info *f6i;
struct fib6_node *fn;
+   struct rt6_info *rt;
 
if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF)
flags &= ~RT6_LOOKUP_F_IFACE;
@@ -1106,36 +1107,36 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
rcu_read_lock();
fn = fib6_lookup(>tb6_root, >daddr, >saddr);
 restart:
-   rt = rcu_dereference(fn->leaf);
-   if (!rt) {
-   rt = net->ipv6.fib6_null_entry;
+   f6i = rcu_dereference(fn->leaf);
+   if (!f6i) {
+   f6i = net->ipv6.fib6_null_entry;
} else {
-   rt = rt6_device_match(net, rt, >saddr,
+   f6i = rt6_device_match(net, f6i, >saddr,
  fl6->flowi6_oif, flags);

[PATCH net-next v2 14/21] net/ipv6: Move dst flags to booleans in fib entries

2018-04-17 Thread David Ahern
Continuing to wean FIB paths off of dst_entry, use a bool to hold
requests for certain dst settings. Add a helper to convert the
flags to DST flags when a FIB entry is converted to a dst_entry.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  5 -
 net/ipv6/addrconf.c   |  4 ++--
 net/ipv6/route.c  | 29 -
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index c73b985734f5..159f651dee55 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -177,7 +177,10 @@ struct rt6_info {
u8  fib6_type;
u8  exception_bucket_flushed:1,
should_flush:1,
-   unused:6;
+   dst_nocount:1,
+   dst_nopolicy:1,
+   dst_host:1,
+   unused:3;
 
unsigned long   expires;
struct dst_metrics  *fib6_metrics;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a156f1a0b1a7..c8df5c6499db 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1046,7 +1046,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
 
if (net->ipv6.devconf_all->disable_policy ||
idev->cnf.disable_policy)
-   rt->dst.flags |= DST_NOPOLICY;
+   rt->dst_nopolicy = true;
 
neigh_parms_data_state_setall(idev->nd_parms);
 
@@ -5981,7 +5981,7 @@ void addrconf_disable_policy_idev(struct inet6_dev *idev, 
int val)
int cpu;
 
rcu_read_lock();
-   addrconf_set_nopolicy(ifa->rt, val);
+   ifa->rt->dst_nopolicy = val ? true : false;
if (rt->rt6i_pcpu) {
for_each_possible_cpu(cpu) {
struct rt6_info **rtp;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e293692174ba..1a3e0db31b34 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -937,6 +937,20 @@ static int ip6_rt_type_to_error(u8 fib6_type)
return fib6_prop[fib6_type];
 }
 
+static unsigned short fib6_info_dst_flags(struct rt6_info *rt)
+{
+   unsigned short flags = 0;
+
+   if (rt->dst_nocount)
+   flags |= DST_NOCOUNT;
+   if (rt->dst_nopolicy)
+   flags |= DST_NOPOLICY;
+   if (rt->dst_host)
+   flags |= DST_HOST;
+
+   return flags;
+}
+
 static void ip6_rt_init_dst_reject(struct rt6_info *rt, struct rt6_info *ort)
 {
rt->dst.error = ip6_rt_type_to_error(ort->fib6_type);
@@ -961,6 +975,8 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
struct rt6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct rt6_info *ort)
 {
+   rt->dst.flags |= fib6_info_dst_flags(ort);
+
if (ort->rt6i_flags & RTF_REJECT) {
ip6_rt_init_dst_reject(rt, ort);
return;
@@ -970,7 +986,6 @@ static void ip6_rt_init_dst(struct rt6_info *rt, struct 
rt6_info *ort)
rt->dst.output = ip6_output;
 
if (ort->fib6_type == RTN_LOCAL) {
-   rt->dst.flags |= DST_HOST;
rt->dst.input = ip6_input;
} else if (ipv6_addr_type(>rt6i_dst.addr) & IPV6_ADDR_MULTICAST) {
rt->dst.input = ip6_mc_input;
@@ -1058,10 +1073,11 @@ static bool ip6_hold_safe(struct net *net, struct 
rt6_info **prt,
 /* called with rcu_lock held */
 static struct rt6_info *ip6_create_rt_rcu(struct rt6_info *rt)
 {
+   unsigned short flags = fib6_info_dst_flags(rt);
struct net_device *dev = rt->fib6_nh.nh_dev;
struct rt6_info *nrt;
 
-   nrt = __ip6_dst_alloc(dev_net(dev), dev, 0);
+   nrt = __ip6_dst_alloc(dev_net(dev), dev, flags);
if (nrt)
ip6_rt_copy_init(nrt, rt);
 
@@ -1229,12 +1245,13 @@ static struct rt6_info *ip6_rt_cache_alloc(struct 
rt6_info *ort,
 
 static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt)
 {
+   unsigned short flags = fib6_info_dst_flags(rt);
struct net_device *dev;
struct rt6_info *pcpu_rt;
 
rcu_read_lock();
dev = ip6_rt_get_dev_rcu(rt);
-   pcpu_rt = __ip6_dst_alloc(dev_net(dev), dev, rt->dst.flags);
+   pcpu_rt = __ip6_dst_alloc(dev_net(dev), dev, flags);
rcu_read_unlock();
if (!pcpu_rt)
return NULL;
@@ -2965,7 +2982,7 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
ipv6_addr_prefix(>rt6i_dst.addr, >fc_dst, cfg->fc_dst_len);
rt->rt6i_dst.plen = cfg->fc_dst_len;
if (rt->rt6i_dst.plen == 128)
-   rt->dst.flags |= DST_HOST;
+   rt->dst_host = true;
 
 #ifdef CONFIG_IPV6_SUBTREES

[PATCH net-next v2 15/21] net/ipv6: Create a neigh_lookup for FIB entries

2018-04-17 Thread David Ahern
The router discovery code has a FIB entry and wants to validate the
gateway has a neighbor entry. Refactor the existing dst_neigh_lookup
for IPv6 and create a new function that takes the gateway and device
and returns a neighbor entry. Use the new function in
ndisc_router_discovery to validate the gateway.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |  3 +++
 net/ipv6/ndisc.c|  8 ++--
 net/ipv6/route.c| 33 -
 3 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 655e13017a45..cb6fb7e16a28 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -279,4 +279,7 @@ static inline bool rt6_duplicate_nexthop(struct rt6_info 
*a, struct rt6_info *b)
   !lwtunnel_cmp_encap(a->fib6_nh.nh_lwtstate, 
b->fib6_nh.nh_lwtstate);
 }
 
+struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
+  struct net_device *dev, struct sk_buff *skb,
+  const void *daddr);
 #endif
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index e4d9eea92139..556717154fa3 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1276,7 +1276,9 @@ static void ndisc_router_discovery(struct sk_buff *skb)
rt = rt6_get_dflt_router(net, _hdr(skb)->saddr, skb->dev);
 
if (rt) {
-   neigh = dst_neigh_lookup(>dst, _hdr(skb)->saddr);
+   neigh = ip6_neigh_lookup(>fib6_nh.nh_gw,
+rt->fib6_nh.nh_dev, NULL,
+ _hdr(skb)->saddr);
if (!neigh) {
ND_PRINTK(0, err,
  "RA: %s got default router without 
neighbour\n",
@@ -1304,7 +1306,9 @@ static void ndisc_router_discovery(struct sk_buff *skb)
return;
}
 
-   neigh = dst_neigh_lookup(>dst, _hdr(skb)->saddr);
+   neigh = ip6_neigh_lookup(>fib6_nh.nh_gw,
+rt->fib6_nh.nh_dev, NULL,
+ _hdr(skb)->saddr);
if (!neigh) {
ND_PRINTK(0, err,
  "RA: %s got default router without 
neighbour\n",
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1a3e0db31b34..d635d71f7d51 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -182,12 +182,10 @@ static void rt6_uncached_list_flush_dev(struct net *net, 
struct net_device *dev)
}
 }
 
-static inline const void *choose_neigh_daddr(struct rt6_info *rt,
+static inline const void *choose_neigh_daddr(const struct in6_addr *p,
 struct sk_buff *skb,
 const void *daddr)
 {
-   struct in6_addr *p = >rt6i_gateway;
-
if (!ipv6_addr_any(p))
return (const void *) p;
else if (skb)
@@ -195,18 +193,27 @@ static inline const void *choose_neigh_daddr(struct 
rt6_info *rt,
return daddr;
 }
 
-static struct neighbour *ip6_neigh_lookup(const struct dst_entry *dst,
- struct sk_buff *skb,
- const void *daddr)
+struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
+  struct net_device *dev,
+  struct sk_buff *skb,
+  const void *daddr)
 {
-   struct rt6_info *rt = (struct rt6_info *) dst;
struct neighbour *n;
 
-   daddr = choose_neigh_daddr(rt, skb, daddr);
-   n = __ipv6_neigh_lookup(dst->dev, daddr);
+   daddr = choose_neigh_daddr(gw, skb, daddr);
+   n = __ipv6_neigh_lookup(dev, daddr);
if (n)
return n;
-   return neigh_create(_tbl, daddr, dst->dev);
+   return neigh_create(_tbl, daddr, dev);
+}
+
+static struct neighbour *ip6_dst_neigh_lookup(const struct dst_entry *dst,
+ struct sk_buff *skb,
+ const void *daddr)
+{
+   const struct rt6_info *rt = container_of(dst, struct rt6_info, dst);
+
+   return ip6_neigh_lookup(>rt6i_gateway, dst->dev, skb, daddr);
 }
 
 static void ip6_confirm_neigh(const struct dst_entry *dst, const void *daddr)
@@ -214,7 +221,7 @@ static void ip6_confirm_neigh(const struct dst_entry *dst, 
const void *daddr)
struct net_device *dev = dst->dev;
struct rt6_info *rt = (struct rt6_info *)dst;
 
-   daddr = choose_neigh_daddr(rt, NULL, daddr);
+   daddr = choose_neigh_daddr(>rt6i_gateway, NULL, daddr);
if (!daddr)
return;
if (dev->flags & (IFF_NOARP | IFF_LOOPBACK))
@@ -239,7 +246,7 @@ static struct dst_ops ip6_dst_ops_template = {
.update_pmtu=   ip6_rt_update_pmtu,
.redirect 

[PATCH net-next v2 10/21] net/ipv6: move metrics from dst to rt6_info

2018-04-17 Thread David Ahern
Similar to IPv4, add fib metrics to the fib struct, which at the moment
is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
into dst by reference using refcount.

To make the transition:
- add dst_metrics to rt6_info. Default to dst_default_metrics if no
  metrics are passed during route add. No need for a separate pmtu
  entry; it can reference the MTU slot in fib6_metrics

- ip6_convert_metrics allocates memory in the FIB entry and uses
  ip_metrics_convert to copy from netlink attribute to metrics entry

- the convert metrics call is done in ip6_route_info_create simplifying
  the route add path
  + fib6_commit_metrics and fib6_copy_metrics and the temporary
mx6_config are no longer needed

- add fib6_metric_set helper to change the value of a metric in the
  fib entry since dst_metric_set can no longer be used

- cow_metrics for IPv6 can drop to dst_cow_metrics_generic

- rt6_dst_from_metrics_check is no longer needed

- rt6_fill_node needs the FIB entry and dst as separate arguments to
  keep compatibility with existing output. Current dst address is
  renamed to dest.
  (to be consistent with IPv4 rt6_fill_node really should be split
  into 2 functions similar to fib_dump_info and rt_fill_info)

- rt6_fill_node no longer needs the temporary metrics variable

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  17 ++--
 net/core/dst.c|   1 +
 net/ipv6/ip6_fib.c|  66 +
 net/ipv6/ndisc.c  |  10 +-
 net/ipv6/route.c  | 257 +++---
 5 files changed, 133 insertions(+), 218 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index f0a88370ba95..1f8dc9d12abb 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -94,11 +94,6 @@ struct fib6_gc_args {
 #define FIB6_SUBTREE(fn)   (rcu_dereference_protected((fn)->subtree, 1))
 #endif
 
-struct mx6_config {
-   const u32 *mx;
-   DECLARE_BITMAP(mx_valid, RTAX_MAX);
-};
-
 /*
  * routing information
  *
@@ -176,7 +171,6 @@ struct rt6_info {
struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
 
u32 rt6i_metric;
-   u32 rt6i_pmtu;
/* more non-fragment space at head required */
unsigned short  rt6i_nfheader_len;
u8  rt6i_protocol;
@@ -185,6 +179,8 @@ struct rt6_info {
should_flush:1,
unused:6;
 
+   struct dst_metrics  *fib6_metrics;
+#define fib6_pmtu  fib6_metrics->metrics[RTAX_MTU-1]
struct fib6_nh  fib6_nh;
 };
 
@@ -390,8 +386,7 @@ void fib6_clean_all(struct net *net, int (*func)(struct 
rt6_info *, void *arg),
void *arg);
 
 int fib6_add(struct fib6_node *root, struct rt6_info *rt,
-struct nl_info *info, struct mx6_config *mxc,
-struct netlink_ext_ack *extack);
+struct nl_info *info, struct netlink_ext_ack *extack);
 int fib6_del(struct rt6_info *rt, struct nl_info *info);
 
 void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info,
@@ -420,6 +415,12 @@ int fib6_tables_dump(struct net *net, struct 
notifier_block *nb);
 void fib6_update_sernum(struct net *net, struct rt6_info *rt);
 void fib6_update_sernum_upto_root(struct net *net, struct rt6_info *rt);
 
+void fib6_metric_set(struct rt6_info *f6i, int metric, u32 val);
+static inline bool fib6_metric_locked(struct rt6_info *f6i, int metric)
+{
+   return !!(f6i->fib6_metrics->metrics[RTAX_LOCK - 1] & (1 << metric));
+}
+
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 int fib6_rules_init(void);
 void fib6_rules_cleanup(void);
diff --git a/net/core/dst.c b/net/core/dst.c
index 007aa0b08291..2d9b37f8944a 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -58,6 +58,7 @@ const struct dst_metrics dst_default_metrics = {
 */
.refcnt = REFCOUNT_INIT(1),
 };
+EXPORT_SYMBOL(dst_default_metrics);
 
 void dst_init(struct dst_entry *dst, struct dst_ops *ops,
  struct net_device *dev, int initial_ref, int initial_obsolete,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 64b73e65f114..0d94c56c3e41 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -578,6 +578,24 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
return res;
 }
 
+void fib6_metric_set(struct rt6_info *f6i, int metric, u32 val)
+{
+   if (!f6i)
+   return;
+
+   if (f6i->fib6_metrics == _default_metrics) {
+   struct dst_metrics *p = kzalloc(sizeof(*p), GFP_ATOMIC);
+
+   if (!p)
+   return;
+
+   refcount_set(>refcnt, 1);
+   f6i->fib6_metrics = p;
+   }
+
+   f6i->fib6_metrics->metrics[metric - 1] = val;
+}
+
 /*
  * Routing Table
  *
@@ -801,38 +819,6 @@ static 

[PATCH net-next v2 07/21] net/ipv6: Save route type in rt6_info

2018-04-17 Thread David Ahern
The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags
and dst.error. Since dst is going to be removed, it can no longer be
relied on for FIB dumps so save the route type as fib6_type.

fc_type is set in current users based on the algorithm in rt6_fill_node:
  - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL
  - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST
  - else fc_type = RTN_UNICAST

Similarly, fib6_type is set in the rt6_info templates based on the
RTF_REJECT section of rt6_fill_node converting dst.error to RTN type.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  1 +
 net/ipv6/addrconf.c   |  2 ++
 net/ipv6/route.c  | 46 --
 3 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index f0aaf1c8f1a8..0165820bbafb 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -174,6 +174,7 @@ struct rt6_info {
int rt6i_nh_weight;
unsigned short  rt6i_nfheader_len;
u8  rt6i_protocol;
+   u8  fib6_type;
u8  exception_bucket_flushed:1,
should_flush:1,
unused:6;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 7ff7466c52e5..f71fcf2635d5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2331,6 +2331,7 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, 
struct net_device *dev,
.fc_flags = RTF_UP | flags,
.fc_nlinfo.nl_net = dev_net(dev),
.fc_protocol = RTPROT_KERNEL,
+   .fc_type = RTN_UNICAST,
};
 
cfg.fc_dst = *pfx;
@@ -2394,6 +2395,7 @@ static void addrconf_add_mroute(struct net_device *dev)
.fc_ifindex = dev->ifindex,
.fc_dst_len = 8,
.fc_flags = RTF_UP,
+   .fc_type = RTN_UNICAST,
.fc_nlinfo.nl_net = dev_net(dev),
};
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0daf4c9c9f2b..1cc01f5bb773 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -307,6 +307,7 @@ static const struct rt6_info ip6_null_entry_template = {
.rt6i_protocol  = RTPROT_KERNEL,
.rt6i_metric= ~(u32) 0,
.rt6i_ref   = ATOMIC_INIT(1),
+   .fib6_type  = RTN_UNREACHABLE,
 };
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
@@ -324,6 +325,7 @@ static const struct rt6_info ip6_prohibit_entry_template = {
.rt6i_protocol  = RTPROT_KERNEL,
.rt6i_metric= ~(u32) 0,
.rt6i_ref   = ATOMIC_INIT(1),
+   .fib6_type  = RTN_PROHIBIT,
 };
 
 static const struct rt6_info ip6_blk_hole_entry_template = {
@@ -339,6 +341,7 @@ static const struct rt6_info ip6_blk_hole_entry_template = {
.rt6i_protocol  = RTPROT_KERNEL,
.rt6i_metric= ~(u32) 0,
.rt6i_ref   = ATOMIC_INIT(1),
+   .fib6_type  = RTN_BLACKHOLE,
 };
 
 #endif
@@ -2802,6 +2805,11 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
goto out;
}
 
+   if (cfg->fc_type > RTN_MAX) {
+   NL_SET_ERR_MSG(extack, "Invalid route type");
+   goto out;
+   }
+
if (cfg->fc_dst_len > 128) {
NL_SET_ERR_MSG(extack, "Invalid prefix length");
goto out;
@@ -2914,6 +2922,8 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
rt->rt6i_metric = cfg->fc_metric;
rt->rt6i_nh_weight = 1;
 
+   rt->fib6_type = cfg->fc_type;
+
/* We cannot add true routes via loopback here,
   they would result in kernel looping; promote them to reject routes
 */
@@ -3354,6 +3364,7 @@ static struct rt6_info *rt6_add_route_info(struct net 
*net,
.fc_flags   = RTF_GATEWAY | RTF_ADDRCONF | RTF_ROUTEINFO |
  RTF_UP | RTF_PREF(pref),
.fc_protocol = RTPROT_RA,
+   .fc_type = RTN_UNICAST,
.fc_nlinfo.portid = 0,
.fc_nlinfo.nlh = NULL,
.fc_nlinfo.nl_net = net,
@@ -3410,6 +3421,7 @@ struct rt6_info *rt6_add_dflt_router(struct net *net,
.fc_flags   = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT |
  RTF_UP | RTF_EXPIRES | RTF_PREF(pref),
.fc_protocol = RTPROT_RA,
+   .fc_type = RTN_UNICAST,
.fc_nlinfo.portid = 0,
.fc_nlinfo.nlh = NULL,
.fc_nlinfo.nl_net = net,
@@ -3485,6 +3497,7 @@ static void rtmsg_to_fib6_config(struct net *net,
cfg->fc_dst_len = rtmsg->rtmsg_dst_len;
cfg->fc_src_len = rtmsg->rtmsg_src_len;
cfg->fc_flags = rtmsg->rtmsg_flags;
+   cfg->fc_type = rtmsg->rtmsg_type;
 

[PATCH net-next v2 00/21] net/ipv6: Separate data structures for FIB and data path

2018-04-17 Thread David Ahern
IPv6 uses the same data struct for both control plane (FIB entries) and
data path (dst entries). This struct has elements needed for both paths
adding memory overhead and complexity (taking a dst hold in most places
but an additional reference on rt6i_ref in a few). Furthermore, because
of the dst_alloc tie, all FIB entries are allocated with GFP_ATOMIC.

This patch set separates FIB entries from dst entries, better aligning
IPv6 code with IPv4, simplifying the reference counting and allowing
FIB entries added by userspace (not autoconf) to use GFP_KERNEL. It is
first step to a number of performance and scalability changes.

The end result of this patch set:
  - FIB entries (fib6_info):
/* size: 208, cachelines: 4, members: 25 */
/* sum members: 207, holes: 1, sum holes: 1 */

  - dst entries (rt6_info)
   /* size: 240, cachelines: 4, members: 11 */

Versus the the single rt6_info struct today for both paths:
  /* size: 320, cachelines: 5, members: 28 */

This amounts to a 35% reduction in memory use for FIB entries and a
25% reduction for dst entries.

With respect to locking FIB entries use RCU and a single atomic
counter with fib6_info_hold and fib6_info_release helpers to manage
the reference counting. dst entries use only the traditional dst
refcounts with dst_hold and dst_release.

FIB entries for host routes are referenced by inet6_ifaddr and
ifacaddr6. In both cases, additional holds are taken -- similar to
what is done for devices.

This set is the first of many changes to improve the scalability of the
IPv6 code. Follow on changes include:
- consolidating duplicate fib6_info references like IPv4 does with
  duplicate fib_info

- moving fib6_info into a slab cache to avoid allocation roundups to
  power of 2 (the 208 size becomes a 256 actual allocation)

- Allow FIB lookups without generating a dst (e.g., most rt6_lookup
  users just want to verify the egress device). Means moving dst
  allocation to the other side of fib6_rule_lookup which again aligns
  with IPv4 behavior

- using separate standalone nexthop objects which have performance
  benefits beyond fib_info consolidation

At this point I am not seeing any refcount leaks or underflows, no
oops or bug_ons, or warnings from kasan, so I think it is ready for
others to beat up on it finding errors in code paths I have missed.

v2 changes
- rebased to top of tree
- improved commit message on patch 7

v1 changes
- rebased to top of tree
- fix memory leak of metrics as noted by Ido
- MTU fixes based on pmtu tests (thanks Stefano Brivio for writing)

RFC v2 changes
- improved commit messages
- move common metrics code from dst.c to net/ipv4/metrics.c (comment
  from DaveM)
- address comments from Wei Wang and Martin KaFai Lau (let me know if
  I missed something)
- fixes detected by kernel test robots
  + added fib6_metric_set to change metric on a FIB entry which could
be pointing to read-only dst_default_metrics
  + 0day testing found a problem with an intermediate patch; added
dst_hold_safe on rt->from. Code is removed 3 patches later
- allow cacheinfo to handle NULL dst; means only expires is pushed to
  userspace

David Ahern (21):
  net: Move fib_convert_metrics to metrics file
  net: Handle null dst in rtnl_put_cacheinfo
  vrf: Move fib6_table into net_vrf
  net/ipv6: Pass net to fib6_update_sernum
  net/ipv6: Pass net namespace to route functions
  net/ipv6: Move support functions up in route.c
  net/ipv6: Save route type in rt6_info
  net/ipv6: Move nexthop data to fib6_nh
  net/ipv6: Defer initialization of dst to data path
  net/ipv6: move metrics from dst to rt6_info
  net/ipv6: move expires into rt6_info
  net/ipv6: Add fib6_null_entry
  net/ipv6: Add rt6_info create function for ip6_pol_route_lookup
  net/ipv6: Move dst flags to booleans in fib entries
  net/ipv6: Create a neigh_lookup for FIB entries
  net/ipv6: Add gfp_flags to route add functions
  net/ipv6: Cleanup exception and cache route handling
  net/ipv6: introduce fib6_info struct and helpers
  net/ipv6: separate handling of FIB entries from dst based routes
  net/ipv6: Flip FIB entries to fib6_info
  net/ipv6: Remove unused code and variables for rt6_info

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |   96 +-
 drivers/net/vrf.c  |   25 +-
 include/net/if_inet6.h |4 +-
 include/net/ip.h   |3 +
 include/net/ip6_fib.h  |  151 ++-
 include/net/ip6_route.h|   45 +-
 include/net/netns/ipv6.h   |3 +-
 net/core/dst.c |1 +
 net/core/rtnetlink.c   |8 +-
 net/ipv4/Makefile  |3 +-
 net/ipv4/fib_semantics.c   |   43 +-
 net/ipv4/metrics.c |   53 +
 net/ipv6/addrconf.c|  

[PATCH net-next v2 01/21] net: Move fib_convert_metrics to metrics file

2018-04-17 Thread David Ahern
Move logic of fib_convert_metrics into ip_metrics_convert. This allows
the code that converts netlink attributes into metrics struct to be
re-used in a later patch by IPv6.

This is mostly a code move with the following changes to variable names:
  - fi->fib_net becomes net
  - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
  - metrics array is passed as an input from fi->fib_metrics->metrics

Signed-off-by: David Ahern 
---
 include/net/ip.h |  3 +++
 net/ipv4/Makefile|  3 ++-
 net/ipv4/fib_semantics.c | 43 ++-
 net/ipv4/metrics.c   | 53 
 4 files changed, 60 insertions(+), 42 deletions(-)
 create mode 100644 net/ipv4/metrics.c

diff --git a/include/net/ip.h b/include/net/ip.h
index ecffd843e7b8..dc4a2d6e58a5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -396,6 +396,9 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
 }
 
+int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len,
+  u32 *metrics);
+
 u32 ip_idents_reserve(u32 hash, int segs);
 void __ip_select_ident(struct net *net, struct iphdr *iph, int segs);
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index a07b7dd06def..b379520f9133 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -13,7 +13,8 @@ obj-y := route.o inetpeer.o protocol.o \
 tcp_offload.o datagram.o raw.o udp.o udplite.o \
 udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
-inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+metrics.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c27122f01b87..6608db23f54b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1019,47 +1019,8 @@ static bool fib_valid_prefsrc(struct fib_config *cfg, 
__be32 fib_prefsrc)
 static int
 fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
 {
-   bool ecn_ca = false;
-   struct nlattr *nla;
-   int remaining;
-
-   if (!cfg->fc_mx)
-   return 0;
-
-   nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
-   int type = nla_type(nla);
-   u32 val;
-
-   if (!type)
-   continue;
-   if (type > RTAX_MAX)
-   return -EINVAL;
-
-   if (type == RTAX_CC_ALGO) {
-   char tmp[TCP_CA_NAME_MAX];
-
-   nla_strlcpy(tmp, nla, sizeof(tmp));
-   val = tcp_ca_get_key_by_name(fi->fib_net, tmp, _ca);
-   if (val == TCP_CA_UNSPEC)
-   return -EINVAL;
-   } else {
-   val = nla_get_u32(nla);
-   }
-   if (type == RTAX_ADVMSS && val > 65535 - 40)
-   val = 65535 - 40;
-   if (type == RTAX_MTU && val > 65535 - 15)
-   val = 65535 - 15;
-   if (type == RTAX_HOPLIMIT && val > 255)
-   val = 255;
-   if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK))
-   return -EINVAL;
-   fi->fib_metrics->metrics[type - 1] = val;
-   }
-
-   if (ecn_ca)
-   fi->fib_metrics->metrics[RTAX_FEATURES - 1] |= 
DST_FEATURE_ECN_CA;
-
-   return 0;
+   return ip_metrics_convert(fi->fib_net, cfg->fc_mx, cfg->fc_mx_len,
+ fi->fib_metrics->metrics);
 }
 
 struct fib_info *fib_create_info(struct fib_config *cfg,
diff --git a/net/ipv4/metrics.c b/net/ipv4/metrics.c
new file mode 100644
index ..5121c6475e6b
--- /dev/null
+++ b/net/ipv4/metrics.c
@@ -0,0 +1,53 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len,
+  u32 *metrics)
+{
+   bool ecn_ca = false;
+   struct nlattr *nla;
+   int remaining;
+
+   if (!fc_mx)
+   return 0;
+
+   nla_for_each_attr(nla, fc_mx, fc_mx_len, remaining) {
+   int type = nla_type(nla);
+   u32 val;
+
+   if (!type)
+   continue;
+   if (type > RTAX_MAX)
+   return -EINVAL;
+
+   if (type == RTAX_CC_ALGO) {
+   char tmp[TCP_CA_NAME_MAX];
+
+   nla_strlcpy(tmp, nla, sizeof(tmp));
+   val = tcp_ca_get_key_by_name(net, tmp, _ca);
+   if (val == TCP_CA_UNSPEC)
+ 

[PATCH net-next v2 05/21] net/ipv6: Pass net namespace to route functions

2018-04-17 Thread David Ahern
Pass network namespace reference into route add, delete and get
functions.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h | 12 ++-
 net/ipv6/addrconf.c | 33 --
 net/ipv6/anycast.c  | 10 +
 net/ipv6/ndisc.c| 12 ++-
 net/ipv6/route.c| 54 +
 5 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 08b132381984..1130a1144dfd 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -101,8 +101,8 @@ void ip6_route_cleanup(void);
 int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg);
 
 int ip6_route_add(struct fib6_config *cfg, struct netlink_ext_ack *extack);
-int ip6_ins_rt(struct rt6_info *);
-int ip6_del_rt(struct rt6_info *);
+int ip6_ins_rt(struct net *net, struct rt6_info *rt);
+int ip6_del_rt(struct net *net, struct rt6_info *rt);
 
 void rt6_flush_exceptions(struct rt6_info *rt);
 int rt6_remove_exception_rt(struct rt6_info *rt);
@@ -137,7 +137,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, 
struct flowi6 *fl6);
 
 void fib6_force_start_gc(struct net *net);
 
-struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
+struct rt6_info *addrconf_dst_alloc(struct net *net, struct inet6_dev *idev,
const struct in6_addr *addr, bool anycast);
 
 struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev,
@@ -147,9 +147,11 @@ struct rt6_info *ip6_dst_alloc(struct net *net, struct 
net_device *dev,
  * support functions for ND
  *
  */
-struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr,
+struct rt6_info *rt6_get_dflt_router(struct net *net,
+const struct in6_addr *addr,
 struct net_device *dev);
-struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr,
+struct rt6_info *rt6_add_dflt_router(struct net *net,
+const struct in6_addr *gwaddr,
 struct net_device *dev, unsigned int pref);
 
 void rt6_purge_dflt_routers(struct net *net);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index b2c0175125db..7ff7466c52e5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1037,7 +1037,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
goto out;
}
 
-   rt = addrconf_dst_alloc(idev, addr, false);
+   rt = addrconf_dst_alloc(net, idev, addr, false);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
@@ -1187,7 +1187,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned 
long expires, bool del_r
   0, RTF_GATEWAY | RTF_DEFAULT);
if (rt) {
if (del_rt)
-   ip6_del_rt(rt);
+   ip6_del_rt(dev_net(ifp->idev->dev), rt);
else {
if (!(rt->rt6i_flags & RTF_EXPIRES))
rt6_set_expires(rt, expires);
@@ -2666,7 +2666,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, 
int len, bool sllao)
if (rt) {
/* Autoconf prefix route */
if (valid_lft == 0) {
-   ip6_del_rt(rt);
+   ip6_del_rt(net, rt);
rt = NULL;
} else if (addrconf_finite_timeout(rt_expires)) {
/* not infinity */
@@ -,7 +,8 @@ static void addrconf_gre_config(struct net_device *dev)
 }
 #endif
 
-static int fixup_permanent_addr(struct inet6_dev *idev,
+static int fixup_permanent_addr(struct net *net,
+   struct inet6_dev *idev,
struct inet6_ifaddr *ifp)
 {
/* !rt6i_node means the host route was removed from the
@@ -3343,7 +3344,7 @@ static int fixup_permanent_addr(struct inet6_dev *idev,
if (!ifp->rt || !ifp->rt->rt6i_node) {
struct rt6_info *rt, *prev;
 
-   rt = addrconf_dst_alloc(idev, >addr, false);
+   rt = addrconf_dst_alloc(net, idev, >addr, false);
if (IS_ERR(rt))
return PTR_ERR(rt);
 
@@ -3367,7 +3368,7 @@ static int fixup_permanent_addr(struct inet6_dev *idev,
return 0;
 }
 
-static void addrconf_permanent_addr(struct net_device *dev)
+static void addrconf_permanent_addr(struct net *net, struct net_device *dev)
 {
struct inet6_ifaddr *ifp, *tmp;
struct inet6_dev *idev;
@@ -3380,7 +3381,7 @@ static void addrconf_permanent_addr(struct net_device 
*dev)
 
list_for_each_entry_safe(ifp, tmp, >addr_list, if_list) {
if ((ifp->flags & IFA_F_PERMANENT) &&
-   

[PATCH net-next v2 11/21] net/ipv6: move expires into rt6_info

2018-04-17 Thread David Ahern
Add expires to rt6_info for FIB entries, and add fib6 helpers to
manage it. Data path use of dst.expires remains.

The transition is fairly straightforward: when working with fib entries,
rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with
fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and
rt6_check_expired becomes fib6_check_expired, where the fib6 versions
are added by this patch.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h | 27 +++
 net/ipv6/addrconf.c   |  6 +++---
 net/ipv6/ip6_fib.c|  8 
 net/ipv6/ndisc.c  |  2 +-
 net/ipv6/route.c  | 20 +++-
 5 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 1f8dc9d12abb..c73b985734f5 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -179,6 +179,7 @@ struct rt6_info {
should_flush:1,
unused:6;
 
+   unsigned long   expires;
struct dst_metrics  *fib6_metrics;
 #define fib6_pmtu  fib6_metrics->metrics[RTAX_MTU-1]
struct fib6_nh  fib6_nh;
@@ -197,6 +198,26 @@ static inline struct inet6_dev *ip6_dst_idev(struct 
dst_entry *dst)
return ((struct rt6_info *)dst)->rt6i_idev;
 }
 
+static inline void fib6_clean_expires(struct rt6_info *f6i)
+{
+   f6i->rt6i_flags &= ~RTF_EXPIRES;
+   f6i->expires = 0;
+}
+
+static inline void fib6_set_expires(struct rt6_info *f6i,
+   unsigned long expires)
+{
+   f6i->expires = expires;
+   f6i->rt6i_flags |= RTF_EXPIRES;
+}
+
+static inline bool fib6_check_expired(const struct rt6_info *f6i)
+{
+   if (f6i->rt6i_flags & RTF_EXPIRES)
+   return time_after(jiffies, f6i->expires);
+   return false;
+}
+
 static inline void rt6_clean_expires(struct rt6_info *rt)
 {
rt->rt6i_flags &= ~RTF_EXPIRES;
@@ -211,11 +232,9 @@ static inline void rt6_set_expires(struct rt6_info *rt, 
unsigned long expires)
 
 static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
 {
-   struct rt6_info *rt;
+   if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
+   rt0->dst.expires = rt0->from->expires;
 
-   for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); rt = rt->from);
-   if (rt && rt != rt0)
-   rt0->dst.expires = rt->dst.expires;
dst_set_expires(>dst, timeout);
rt0->rt6i_flags |= RTF_EXPIRES;
 }
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 483c8772e856..a156f1a0b1a7 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1190,7 +1190,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned 
long expires, bool del_r
ip6_del_rt(dev_net(ifp->idev->dev), rt);
else {
if (!(rt->rt6i_flags & RTF_EXPIRES))
-   rt6_set_expires(rt, expires);
+   fib6_set_expires(rt, expires);
ip6_rt_put(rt);
}
}
@@ -2672,9 +2672,9 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, 
int len, bool sllao)
rt = NULL;
} else if (addrconf_finite_timeout(rt_expires)) {
/* not infinity */
-   rt6_set_expires(rt, jiffies + rt_expires);
+   fib6_set_expires(rt, jiffies + rt_expires);
} else {
-   rt6_clean_expires(rt);
+   fib6_clean_expires(rt);
}
} else if (valid_lft) {
clock_t expires = 0;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0d94c56c3e41..f25f4d9831e8 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -906,9 +906,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
rt6_info *rt,
if (!(iter->rt6i_flags & RTF_EXPIRES))
return -EEXIST;
if (!(rt->rt6i_flags & RTF_EXPIRES))
-   rt6_clean_expires(iter);
+   fib6_clean_expires(iter);
else
-   rt6_set_expires(iter, rt->dst.expires);
+   fib6_set_expires(iter, rt->expires);
fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu);
return -EEXIST;
}
@@ -2003,8 +2003,8 @@ static int fib6_age(struct rt6_info *rt, void *arg)
 *  Routes are expired even if they are in use.
 */
 
-   if (rt->rt6i_flags & RTF_EXPIRES && rt->dst.expires) {
-  

Re: [PATCH net-next 2/2] openvswitch: Support conntrack zone limit

2018-04-17 Thread Yi-Hung Wei
> s/to commit/from committing/
> s/entry/entries/

Thanks, will fix that in both patches in v2.


> I think this is a great idea but I suggest porting to the iproute2 package
> so everyone can use it.  Then git rid of the OVS specific prefixes.
> Presuming of course that the conntrack connection
> limit backend works there as well I guess.  If it doesn't, then I'd suggest
> extending
> it.  This is a nice feature for all users in my opinion and then OVS
> can take advantage of it as well.

Thanks for the comment.  And yes, I think currently, iptables’s
connlimit extension does support limiting the # of connections.  Users
need to configure the zone properly, and the iptable’s connlimit
extension is using netfilter's nf_conncount backend already.

The main goal for this patch is to utilize netfilter backend
(nf_conncount) to count and limit the number of connections. OVS needs
the proposed OVS_CT_LIMIT netlink API and the corresponding booking
data structure because the current nf_conncount backend only counts
the # of connections, but it does not keep track of the connection
limit in nf_conncount.

Thanks,

-Yi-Hung


[PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-17 Thread Yi-Hung Wei
Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 498 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   1 +
 6 files changed, 574 insertions(+), 6 deletions(-)

-- 
2.7.4



[PATCH net-next v2 1/2] openvswitch: Add conntrack limit netlink definition

2018-04-17 Thread Yi-Hung Wei
Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 62 
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_OPTION,   /* Nested OVS_CT_LIMIT_ATTR_* */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+   OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_SET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_RLY, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   __OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+   OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,   /* u32 default conntrack limit
+* for all zones. */
+   OVS_CT_ZONE_LIMIT_ATTR_ZONE,/* u16 conntrack zone id. */
+   OVS_CT_ZONE_LIMIT_ATTR_LIMIT,   /* u32 max number of conntrack
+* entries allowed in the
+* corresponding zone. */
+   OVS_CT_ZONE_LIMIT_ATTR_COUNT,   /* u32 number of conntrack
+* entries in the corresponding
+* zone. */
+   __OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4



[PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit

2018-04-17 Thread Yi-Hung Wei
Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 498 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 512 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..d09b572f72b4 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -17,7 +17,9 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +78,38 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+   ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+   [OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1070,94 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info, u16 zone)
+{
+   return >limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */

[PATCH v2] net: change the comment of dev_mc_init

2018-04-17 Thread sunlianwen

The comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.

Signed-off-by: Lianwen Sun 
---
 net/core/dev_addr_lists.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index e3e6a3e2ca22..d884d8f5f0e5 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -839,7 +839,7 @@ void dev_mc_flush(struct net_device *dev)
 EXPORT_SYMBOL(dev_mc_flush);

 /**
- * dev_mc_flush - Init multicast address list
+ * dev_mc_init - Init multicast address list
  * @dev: device
  *
  * Init multicast address list.
--
2.17.0




Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-17 Thread Siwei Liu
I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu  wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller  wrote:
>> From: Siwei Liu 
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature they don't need, no one
> is likely going to embrace the change. As such there's just no real
> value of adopting this in-kernel bypass facility for any cloud service
> provider. It does not 

Re: One question about __tcp_select_window()

2018-04-17 Thread Wang Jian
Thanks for your reply, Eric.

Actually, this is a query about the code while I am reading code.
>From my instinct and the comment, I think we should choose the bigger
one but maybe I miss something(like your said, autotuning)
Anyway, I will read more codes and do more tests.

Thanks.

On Tue, Apr 17, 2018 at 10:43 PM, Eric Dumazet  wrote:
>
>
> On 04/17/2018 06:53 AM, Wang Jian wrote:
>> I test the fix with 4.17.0-rc1+ and it seems work.
>>
>> 1. iperf -c IP -i 20 -t 60 -w 1K
>>  with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec
>> I also try other windows and have similar results.
>>
>> 2. Use tcp probe trace snd_wind.
>> with-fix vs without-fix: 1245568 vs 1042816
>>
>> 3. I don't see extra retransmit/drops.
>>
>
> Unfortunately I have no idea what exact problem you had to solve.
>
> Setting small windows is not exactly the path we are taking.
>
> And I do not know how many side effects your change will have for 'standard' 
> flows
> using autotuning or sane windows.
>


Re: [RFC PATCH v3 bpf-next 2/5] bpf/verifier: rewrite subprog boundary detection

2018-04-17 Thread Alexei Starovoitov
On Fri, Apr 06, 2018 at 06:13:59PM +0100, Edward Cree wrote:
> By storing a subprogno in each insn's aux data, we avoid the need to keep
>  the list of subprog starts sorted or bsearch() it in find_subprog().
> Also, get rid of the weird one-based indexing of subprog numbers.
> 
> Signed-off-by: Edward Cree 
> ---
>  include/linux/bpf_verifier.h |   3 +-
>  kernel/bpf/verifier.c| 284 
> ++-
>  2 files changed, 177 insertions(+), 110 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 8f70dc181e23..17990dd56e65 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -146,6 +146,7 @@ struct bpf_insn_aux_data {
>   s32 call_imm;   /* saved imm field of call insn 
> */
>   };
>   int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
> + u16 subprogno; /* subprog in which this insn resides */
>   bool seen; /* this insn was processed by the verifier */
>  };

as I was saying before this is no go.
subprogno is meaningless in the hierarchy of: prog -> func -> bb -> insn
Soon bpf will have libraries and this field would need to become
a pointer back to bb or func structure creating unnecessary circular dependency.



Re: [PATCH net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 08:43:31AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/16/2018 08:33 AM, Yafang Shao wrote:
> > tcp_rcv_space_adjust is called every time data is copied to user space,
> > introducing a tcp tracepoint for which could show us when the packet is
> > copied to user.
> > This could help us figure out whether there's latency in user process.
> > 
> > When a tcp packet arrives, tcp_rcv_established() will be called and with
> > the existed tracepoint tcp_probe we could get the time when this packet
> > arrives.
> > Then this packet will be copied to user, and tcp_rcv_space_adjust will
> > be called and with this new introduced tracepoint we could get the time
> > when this packet is copied to user.
> > 
> > arrives time : user process time=> latency caused by user
> > tcp_probe  tcp_rcv_space_adjust
> > 
> > Hence in the prink message, sk is printed as a key to connect these two
> > tracepoints.
> > 
> 
> socket pointer is not a key.
> 
> TCP sockets can be reused pretty fast after free.
> 
> I suggest you go for cookie instead, this is an unique 64bit identifier.
> ( sock_gen_cookie() for details )

I think would be even better if the stack would do this sock_gen_cookie()
on its own in some way that user cannnot infere the order.
In many cases we wanted to use socket cookie, but since it's not inited
by default it's kinda useless.
Turning this tracepoint on just to get cookie would be an ugly workaround.



Re: fix for bnx2x panic during ethtool reporting

2018-04-17 Thread Florian Fainelli
+netdev, Ariel,

On 04/17/2018 10:21 AM, Sebastian Kuzminsky wrote:
> "ethtool -i" on a bnx2x interface causes kernel panic when the
> firmware version is longer than expected.  The attached patch fixes
> the problem by simplifying the string handling in bnx2x_fill_fw_str().
> It applies cleanly to 4.14 and 4.17-rc1.

If you want to have a chance of getting your patch included, your should
make sure you copy the driver maintainers and the network mailinglist,
doing that.
-- 
Florian


Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation

2018-04-17 Thread Jakub Kicinski
On Tue, 17 Apr 2018 16:23:48 +0300, Or Gerlitz wrote:
> On Thu, Mar 22, 2018 at 1:55 PM, Jiri Pirko  wrote:
> > From: Jiri Pirko 
> >
> > This patchset resolves 2 issues we have right now:
> > 1) There are many netdevices / ports in the system, for port, pf, vf
> >represenatation but the user has no way to see which is which
> > 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
> >which may lead to inconsistent names between drivers.
> >
> > This patchset introduces port flavours which should address the first
> > problem. I'm testing this with Netronome nfp hardware. When the user
> > has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:  
> 
> J/J (Jiri/Jakub) --
> 
> re "2 physical ports, 1 pf, and 4 vfs" --- does NFP exposes one PF for
> both physical ports?

Yes there are multiple PCIe PFs on the card, but the basic CX card just
uses one for all uplinks (like mlx4).

> FWIW note that in mlx5 and AFAIK any other device except for mlx4 (...)
> folks have FPP (Function Per Port) scheme.
> 
> [..]
> 
> > The desired output should look like this:
> > # devlink port
> > pci/:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
> > pci/:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
> > pci/:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
> > pci/:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
> > pci/:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
> > pci/:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
> > pci/:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3
> > As you can see, the netdev names are generated according to the flavour
> > and port number. In case the port is split, the split subnumber is also 
> > included.  
> 
> What is the purpose/role in getting dev link ports here? is it such
> that @ the end
> of the day the driver would do a devlink_port_get_phys_port_name() call in 
> their
> get phys port name ndo? or we buy more advantages out of doing so?

IMHO having way to get all netdevs and the netdev type from devlink is
quite user friendly.  As of today we also use the devlink ports for port
splitting on 40/100G parts.  Hopefully more functionality migrates over
from ethtool over time.


[PATCH net] net: qualcomm: rmnet: Fix warning seen with fill_info

2018-04-17 Thread Subash Abhinov Kasiviswanathan
When the last rmnet device attached to a real device is removed, the
real device is unregistered from rmnet. As a result, the real device
lookup fails resulting in a warning when the fill_info handler is
called as part of the rmnet device unregistration.

Fix this by returning the rmnet flags as 0 when no real device is
present.

WARNING: CPU: 0 PID: 1779 at net/core/rtnetlink.c:3254
rtmsg_ifinfo_build_skb+0xca/0x10d
Modules linked in:
CPU: 0 PID: 1779 Comm: ip Not tainted 4.16.0-11872-g7ce2367 #1
Stack:
 7fe655f0 60371ea3  
 60282bc6 6006b116 7fe65600 60371ee8
 7fe65660 6003a68c  9
Call Trace:
 [<6006b116>] ? printk+0x0/0x94
 [<6001f375>] show_stack+0xfe/0x158
 [<60371ea3>] ? dump_stack_print_info+0xe8/0xf1
 [<60282bc6>] ? rtmsg_ifinfo_build_skb+0xca/0x10d
 [<6006b116>] ? printk+0x0/0x94
 [<60371ee8>] dump_stack+0x2a/0x2c
 [<6003a68c>] __warn+0x10e/0x13e
 [<6003a82c>] warn_slowpath_null+0x48/0x4f
 [<60282bc6>] rtmsg_ifinfo_build_skb+0xca/0x10d
 [<60282c4d>] rtmsg_ifinfo_event.part.37+0x1e/0x43
 [<60282c2f>] ? rtmsg_ifinfo_event.part.37+0x0/0x43
 [<60282d03>] rtmsg_ifinfo+0x24/0x28
 [<60264e86>] dev_close_many+0xba/0x119
 [<60282cdf>] ? rtmsg_ifinfo+0x0/0x28
 [<6027c225>] ? rtnl_is_locked+0x0/0x1c
 [<6026ca67>] rollback_registered_many+0x1ae/0x4ae
 [<600314be>] ? unblock_signals+0x0/0xae
 [<6026cdc0>] ? unregister_netdevice_queue+0x19/0xec
 [<6026ceec>] unregister_netdevice_many+0x21/0xa1
 [<6027c765>] rtnl_delete_link+0x3e/0x4e
 [<60280ecb>] rtnl_dellink+0x262/0x29c
 [<6027c241>] ? rtnl_get_link+0x0/0x3e
 [<6027f867>] rtnetlink_rcv_msg+0x235/0x274

Fixes: be81a85f5f87 ("net: qualcomm: rmnet: Implement fill_info")
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index d339885..5f4e447 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -350,15 +350,16 @@ static int rmnet_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
 
real_dev = priv->real_dev;
 
-   if (!rmnet_is_real_dev_registered(real_dev))
-   return -ENODEV;
-
if (nla_put_u16(skb, IFLA_RMNET_MUX_ID, priv->mux_id))
goto nla_put_failure;
 
-   port = rmnet_get_port_rtnl(real_dev);
+   if (rmnet_is_real_dev_registered(real_dev)) {
+   port = rmnet_get_port_rtnl(real_dev);
+   f.flags = port->data_format;
+   } else {
+   f.flags = 0;
+   }
 
-   f.flags = port->data_format;
f.mask  = ~0;
 
if (nla_put(skb, IFLA_RMNET_FLAGS, sizeof(f), ))
-- 
1.9.1



Re: [PATCH] samples/bpf: correct comment in sock_example.c

2018-04-17 Thread Alexei Starovoitov
On Tue, Apr 17, 2018 at 10:25:20AM +0800, Wang Sheng-Hui wrote:
> The program run against loopback interace "lo", not "eth0".
> Correct the comment.
> 
> Signed-off-by: Wang Sheng-Hui 

Acked-by: Alexei Starovoitov 

for future patches please use the following format for the subject:
[PATCH bpf-next] samples/bpf: ...



Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

2018-04-17 Thread Ben Greear

On 01/24/2018 03:59 PM, Ben Greear wrote:

On 06/20/2017 08:03 PM, David Ahern wrote:

On 6/20/17 5:41 PM, Ben Greear wrote:

On 06/20/2017 11:05 AM, Michal Kubecek wrote:

On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:

On 06/14/2017 03:25 PM, David Ahern wrote:

On 6/14/17 4:23 PM, Ben Greear wrote:

On 06/13/2017 07:27 PM, David Ahern wrote:


Let's try a targeted debug patch. See attached


I had to change it to pr_err so it would go to our serial console
since the system locked hard on crash,
and that appears to be enough to change the timing where we can no
longer
reproduce the problem.



ok, let's figure out which one is doing that. There are 3 debug
statements. I suspect fib6_del_route is the one setting the state to
FWS_U. Can you remove the debug prints in fib6_repair_tree and
fib6_walk_continue and try again?


We cannot reproduce with just that one printf in the kernel either.  It
must change the timing too much to trigger the bug.


You might try trace_printk() which should have less impact (don't forget
to enable /proc/sys/kernel/ftrace_dump_on_oops).


We cannot reproduce with trace_printk() either.


I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.


FYI, problem still happens in 4.16.  I'm going to re-enable my hack below
for this kernel as well...I had hopes it might be fixed...

BUG: unable to handle kernel NULL pointer dereference at 8
IP: fib6_walk_continue+0x5b/0x140 [ipv6]
PGD 8007dfc0c067 P4D 8007dfc0c067 PUD 7e66ff067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf]
CPU: 3 PID: 15117 Comm: ip Tainted: G   O 4.16.0+ #5
Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
RSP: 0018:c90008c3bc10 EFLAGS: 00010287
RAX: 88085ac45050 RBX: 8807e03008a0 RCX: 
RDX:  RSI: c90008c3bc48 RDI: 8232b240
RBP: 880819167600 R08: 0008 R09: 8807dff10071
R10: c90008c3bbd0 R11:  R12: 8807e03008a0
R13: 0002 R14: 8807e05744c8 R15: 8807e08ef000
FS:  7f2f04342700() GS:88087fcc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0008 CR3: 0007e0556002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 inet6_dump_fib+0x14b/0x2c0 [ipv6]
 netlink_dump+0x216/0x2a0
 netlink_recvmsg+0x254/0x400
 ? copy_msghdr_from_user+0xb5/0x110
 ___sys_recvmsg+0xe9/0x230
 ? find_held_lock+0x3b/0xb0
 ? __handle_mm_fault+0x617/0x1180
 ? __audit_syscall_entry+0xb3/0x110
 ? __sys_recvmsg+0x39/0x70
 __sys_recvmsg+0x39/0x70
 do_syscall_64+0x63/0x120
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f2f03a72030
RSP: 002b:7fffab3de508 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX: 7fffab3e641c RCX: 7f2f03a72030
RDX:  RSI: 7fffab3de570 RDI: 0004
RBP:  R08: 7e6c R09: 7fffab3e63a8
R10: 7fffab3de5b0 R11: 0246 R12: 7fffab3e6608
R13: 0066b460 R14: 7e6c R15: 
Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 
89 53 2c c7 4
RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: c90008c3bc10
CR2: 0008
---[ end trace bd03458864eb266c ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.



So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
pn = fn->parent;
w->node = pn;
 #ifdef CONFIG_IPV6_SUBTREES
+   if (WARN_ON_ONCE(!pn)) {
+   pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+  w, fn, pn);
+   /* Attempt to work around crash that has been 
here forever. --Ben */
+   return 0;
+   }
if (FIB6_SUBTREE(pn) == fn) {
WARN_ON(!(fn->fn_flags & RTN_ROOT));
w->state = FWS_L;



The printout looks like this (when adding 4000 

Re: SRIOV switchdev mode BoF minutes

2018-04-17 Thread Jakub Kicinski
On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote:
> There is also a school of thought that the VF reps could be
> pre-allocated on the SmartNIC so that any application processing that
> traffic would sit idle when no traffic arrives on the rep, but could
> process frames that do arrive when the VFs were created on the host.
> This implementation will depend on how resources are allocated on a
> given bit of hardware, but can really work well.

+1 if there is no FW resource allocation issues IMHO it's okay to
just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/
controller.  The reprs should just show link down as if PCIe cable
was unpluged until host actually enables them.  

A similar issue exists on multi-host for PFs, right?  If one of the
hosts is down do we still show their PF repr?  IMHO yes.

That makes the thing looks more like a switch with cables being plugged
in and out.


Re: [PATCH v2 bpf-next 1/3] bpftool: Support new prog types and attach types

2018-04-17 Thread Jakub Kicinski
On Tue, 17 Apr 2018 10:28:44 -0700, Andrey Ignatov wrote:
> Add recently added prog types to `bpftool prog` and attach types to
> `bpftool cgroup`.
> 
> Update bpftool documentation and bash completion appropriately.
> 
> Signed-off-by: Andrey Ignatov 

Acked-by: Jakub Kicinski 

Thank you!!


Re: [PATCH bpf-next 08/10] [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:29PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for nfp driver we will just calculate packet's length unconditionally
> 
> Signed-off-by: Nikita V. Shirokov 
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
> b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> index 1eb6549f2a54..d9111c077699 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> @@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
> int budget)
>  
>   act = bpf_prog_run_xdp(xdp_prog, );
>  
> - pkt_len -= xdp.data - orig_data;
> + pkt_len = xdp.data_end - xdp.data;

Looks correct, but Jakub please review.



Re: [PATCH bpf-next 06/10] [bpf]: make bnxt compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:27PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for bnxt driver we will just calculate packet's length unconditionally
> 
> Signed-off-by: Nikita V. Shirokov 

Acked-by: Alexei Starovoitov 



Re: [PATCH bpf-next 07/10] [bpf]: make cavium thunder compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:28PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for cavium's thunder driver we will just calculate packet's length
> unconditionally
> 
> Signed-off-by: Nikita V. Shirokov 

Acked-by: Alexei Starovoitov 



Re: [PATCH bpf-next 05/10] [bpf]: make mlx4 compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:26PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for mlx4 driver we will just calculate packet's length unconditionally
> (the same way as it's already being done in mlx5)
> 
> Signed-off-by: Nikita V. Shirokov 

Acked-by: Alexei Starovoitov 



Re: [PATCH bpf-next 04/10] [bpf]: make generic xdp compatible w/ bpf_xdp_adjust_tail

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:25PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for generic XDP we need to reflect this packet's length change by
> adjusting skb's tail pointer
> 
> Signed-off-by: Nikita V. Shirokov 

Acked-by: Alexei Starovoitov 

pls also change the order of the test/sample patches.
they should come last, since they will work only after this one
and all other driver support.



Re: [PATCH bpf-next 02/10] [bpf]: adding tests for bpf_xdp_adjust_tail

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:23PM -0700, Nikita V. Shirokov wrote:
> adding selftests for bpf_xdp_adjust_tail helper. in this syntetic test
> we are testing that 1) if data_end < data helper will return EINVAL
> 2) for normal use case packet's length would be reduced.
> 
> aside from adding new tests i'm changing behaviour of bpf_prog_test_run
> so it would recalculate packet's length if only data_end pointer was
> changed
> 
> Signed-off-by: Nikita V. Shirokov 
> ---
>  net/bpf/test_run.c |  3 ++-
>  tools/include/uapi/linux/bpf.h | 11 -
>  tools/testing/selftests/bpf/Makefile   |  2 +-
>  tools/testing/selftests/bpf/bpf_helpers.h  |  3 +++
>  tools/testing/selftests/bpf/test_adjust_tail.c | 29 +++
>  tools/testing/selftests/bpf/test_progs.c   | 32 
> ++
>  6 files changed, 77 insertions(+), 3 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c
> 
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index 2ced48662c1f..68c3578343b4 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -170,7 +170,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const 
> union bpf_attr *kattr,
>   xdp.rxq = >xdp_rxq;
>  
>   retval = bpf_test_run(prog, , repeat, );
> - if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN)
> + if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN ||
> + xdp.data_end != xdp.data + size)

please split fixing prog_test_run for adjust_tail into separate patch
and selftests into another one.

>   size = xdp.data_end - xdp.data;
>   ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
>   kfree(data);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 9d07465023a2..9a2d1a04eb24 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -755,6 +755,13 @@ union bpf_attr {
>   * @addr: pointer to struct sockaddr to bind socket to
>   * @addr_len: length of sockaddr structure
>   * Return: 0 on success or negative error code
> + *
> + * int bpf_xdp_adjust_tail(xdp_md, delta)
> + * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
> + * size is supported.
> + * @xdp_md: pointer to xdp_md
> + * @delta: A negative integer to be added to xdp_md.data_end
> + * Return: 0 on success or negative on error
>   */
>  #define __BPF_FUNC_MAPPER(FN)\
>   FN(unspec), \
> @@ -821,7 +828,8 @@ union bpf_attr {
>   FN(msg_apply_bytes),\
>   FN(msg_cork_bytes), \
>   FN(msg_pull_data),  \
> - FN(bind),
> + FN(bind),   \
> + FN(xdp_adjust_tail),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -864,6 +872,7 @@ enum bpf_func_id {
>  /* BPF_FUNC_skb_set_tunnel_key flags. */
>  #define BPF_F_ZERO_CSUM_TX   (1ULL << 1)
>  #define BPF_F_DONT_FRAGMENT  (1ULL << 2)
> +#define BPF_F_SEQ_NUMBER (1ULL << 3)

William Tu missed adding it to tools/include/uapi/bpf.h when it was added
to main uapi/bpf.h
but don't add it as part of this patch.
I saw a separate patch for this passing by in tip tree from Arnaldo.
I'm not sure how quickly it will get into Linus tree,
let's not create extra merge conflicts.

>  
>  /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
>   * BPF_FUNC_perf_event_read_value flags.
> diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> index 0a315ddabbf4..3e819dc70bee 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -31,7 +31,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
> test_tcp_estats.o test
>   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
>   test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
>   sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
> - sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o
> + sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
> test_adjust_tail.o
>  
>  # Order correspond to 'make run_tests' order
>  TEST_PROGS := test_kmod.sh \
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
> b/tools/testing/selftests/bpf/bpf_helpers.h
> index d8223d99f96d..50c607014b22 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -96,6 +96,9 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
> end, int flags) =
>   (void *) BPF_FUNC_msg_pull_data;
>  static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
>   (void *) BPF_FUNC_bind;
> +static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
> + (void *) 

Re: [PATCH v2 8/8] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

2018-04-17 Thread Michael Schmitz
Hi Andrew,

On Wed, Apr 18, 2018 at 1:26 AM, Andrew Lunn  wrote:
> On Tue, Apr 17, 2018 at 02:08:15PM +1200, Michael Schmitz wrote:
>> Add platform device driver to populate the ax88796 platform data from
>> information provided by the XSurf100 zorro device driver.
>> This driver will have to be loaded before loading the ax88796 module,
>> or compiled as built-in.
>>
>> Signed-off-by: Michael Karcher 
>> Signed-off-by: Michael Schmitz 
>> ---
>>  drivers/net/ethernet/8390/Kconfig|   14 +-
>>  drivers/net/ethernet/8390/Makefile   |1 +
>>  drivers/net/ethernet/8390/xsurf100.c |  411 
>> ++
>>  3 files changed, 425 insertions(+), 1 deletions(-)
>>  create mode 100644 drivers/net/ethernet/8390/xsurf100.c
>>
>> diff --git a/drivers/net/ethernet/8390/Kconfig 
>> b/drivers/net/ethernet/8390/Kconfig
>> index fdc6734..0cadd45 100644
>> --- a/drivers/net/ethernet/8390/Kconfig
>> +++ b/drivers/net/ethernet/8390/Kconfig
>> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>>
>>  config AX88796
>>   tristate "ASIX AX88796 NE2000 clone support"
>> - depends on (ARM || MIPS || SUPERH)
>> + depends on (ARM || MIPS || SUPERH || AMIGA)
>
> Hi Michael
>
> Will it compile on other platforms? If so, it is a good idea to add
> COMPILE_TEST as well.

I suppose it will - nothing in there that wouldn't be portable. Well,
let's find out, shall we?

Cheers,

  Michael

>
>  Andrew


Re: [PATCH bpf-next 01/10] [bpf]: adding bpf_xdp_adjust_tail helper

2018-04-17 Thread Alexei Starovoitov
On Mon, Apr 16, 2018 at 11:51:22PM -0700, Nikita V. Shirokov wrote:
> Adding new bpf helper which would allow us to manipulate
> xdp's data_end pointer, and allow us to reduce packet's size
> indended use case: to generate ICMP messages from XDP context,
> where such message would contain truncated original packet.
> 
> Signed-off-by: Nikita V. Shirokov 
> ---
>  include/uapi/linux/bpf.h | 10 +-
>  net/core/filter.c| 29 -
>  2 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c5ec89732a8d..9a2d1a04eb24 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -755,6 +755,13 @@ union bpf_attr {
>   * @addr: pointer to struct sockaddr to bind socket to
>   * @addr_len: length of sockaddr structure
>   * Return: 0 on success or negative error code
> + *
> + * int bpf_xdp_adjust_tail(xdp_md, delta)
> + * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
> + * size is supported.
> + * @xdp_md: pointer to xdp_md
> + * @delta: A negative integer to be added to xdp_md.data_end
> + * Return: 0 on success or negative on error
>   */
>  #define __BPF_FUNC_MAPPER(FN)\
>   FN(unspec), \
> @@ -821,7 +828,8 @@ union bpf_attr {
>   FN(msg_apply_bytes),\
>   FN(msg_cork_bytes), \
>   FN(msg_pull_data),  \
> - FN(bind),
> + FN(bind),   \
> + FN(xdp_adjust_tail),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d31aff93270d..6c8ac7b548d6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2717,6 +2717,30 @@ static const struct bpf_func_proto 
> bpf_xdp_adjust_head_proto = {
>   .arg2_type  = ARG_ANYTHING,
>  };
>  
> +BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
> +{
> + /* only shrinking is allowed for now. */
> + if (unlikely(offset > 0))
> + return -EINVAL;

why allow offset == 0 ?
It's a nop. xdp_adjust_head allows it, but it's not a reason
to repeat the same here.
Like we may decide to do something with offset==0 in the future.
Let's keep it reserved.

In the subject please replace
[bpf]: adding bpf_xdp_adjust_tail helper
with
bpf: adding bpf_xdp_adjust_tail helper

"[bpf] foo bar" subject used to be llvm patch convention,
but lately we switched it to kernel style as well with "bpf: foo bar"



Re: [PATCH 10/10] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)

2018-04-17 Thread Michael Schmitz
Hi Geert,

thanks for your suggestions!

On Wed, Apr 18, 2018 at 1:53 AM, Geert Uytterhoeven
 wrote:
> Hi Michael,
>
> Thanks for your patch!
>
> On Tue, Apr 17, 2018 at 12:04 AM, Michael Schmitz  
> wrote:
>> Add platform device driver to populate the ax88796 platform data from
>> information provided by the XSurf100 zorro device driver.
>> This driver will have to be loaded before loading the ax88796 module,
>> or compiled as built-in.
>
> Is that really true? The platform device should be probed when both the
> device and driver have been registered, but order shouldn't matter.

Loading the xsurf100 module will pull in the ax88796 module, so order
does not matter. I'll drop that.

>
>> Signed-off-by: Michael Karcher 
>
> Missing "From: Michael Karcher ..."?

Fixed the authorship now - probably got mangled when squashing in my
local edits.

>
>> Signed-off-by: Michael Schmitz 
>
>> --- a/drivers/net/ethernet/8390/Kconfig
>> +++ b/drivers/net/ethernet/8390/Kconfig
>> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>>
>>  config AX88796
>> tristate "ASIX AX88796 NE2000 clone support"
>> -   depends on (ARM || MIPS || SUPERH)
>> +   depends on (ARM || MIPS || SUPERH || AMIGA)
>
> s/AMIGA/ZORRO/, for consistency with the below.

Will do.

>
>> select CRC32
>> select PHYLIB
>> select MDIO_BITBANG
>> @@ -45,6 +45,18 @@ config AX88796_93CX6
>> ---help---
>>   Select this if your platform comes with an external 93CX6 eeprom.
>>
>> +config XSURF100
>> +   tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
>> +   depends on ZORRO
>> +   depends on AX88796
>
> It's a bit unfortunate the user has to enable _two_ config options to enable
> this driver.
>
> I see two solutions for that:
>
> 1) Hide the XSURF100 symbol, so it gets enabled automatically if AX88796 is
>enabled on a Zorro bus system:
>
> config XSURF100
> tristate
> depends on ZORRO
> default AX88796
>
> 2) Hide the AX88796 symbol, and let it be selected by XSURF100:
>
> config AX88796
> tristate "ASIX AX88796 NE2000 clone support" if !ZORRO
> depends on ARM || MIPS || SUPERH || ZORRO
> ...
>
> config XSURF100
> tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
> depends on ZORRO
> select AX88796

I'll use the latter -

>> --- /dev/null
>> +++ b/drivers/net/ethernet/8390/xsurf100.c
>> @@ -0,0 +1,411 @@
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#define ZORRO_PROD_INDIVIDUAL_COMPUTERS_X_SURF100 \
>> +   ZORRO_ID(INDIVIDUAL_COMPUTERS, 0x64, 0)
>
> Another long define to get rid of? ;-)
>
>> +/* Hard reset the card. This used to pause for the same period that a
>> + * 8390 reset command required, but that shouldn't be necessary.
>> + */
>> +static void ax_reset_8390(struct net_device *dev)
>> +{
>> +   struct ei_device *ei_local = netdev_priv(dev);
>> +   unsigned long reset_start_time = jiffies;
>> +   void __iomem *addr = (void __iomem *)dev->base_addr;
>> +
>> +   netif_dbg(ei_local, hw, dev, "resetting the 8390 t=%ld...\n", 
>> jiffies);
>> +
>> +   ei_outb(ei_inb(addr + NE_RESET), addr + NE_RESET);
>> +
>> +   ei_local->txing = 0;
>> +   ei_local->dmaing = 0;
>> +
>> +   /* This check _should_not_ be necessary, omit eventually. */
>> +   while ((ei_inb(addr + EN0_ISR) & ENISR_RESET) == 0) {
>> +   if (time_after(jiffies, reset_start_time + 2 * HZ / 100)) {
>> +   netdev_warn(dev, "%s: did not complete.\n", 
>> __func__);
>> +   break;
>> +   }
>
> cpu_relax()?
>
> How long does this usually take? If > 1 ms, you can use e.g. msleep(1)
> instead of cpu_relax().

No idea how long this will take - the reset function is lifted
straight out of ax88796.c with no modifications whatsoever.

Come to think of it - it's exported as ei_local->reset_8390 there, so
there is no good reason for even duplicating the code that I can see.
I'lll drop it.

>
>> +   }
>> +
>> +   ei_outb(ENISR_RESET, addr + EN0_ISR);   /* Ack intr. */
>> +}
>
>> +   if (ei_local->dmaing) {
>> +   netdev_err(dev,
>> +  "DMAing conflict in %s "
>> +  "[DMAstat:%d][irqlock:%d].\n",
>
> Please don't split error messages, as that makes it more difficult to
> grep for them.

Again, found like that in ax88796.c. Will fix here (and eventually in
ax88796.c).

>> +  __func__,
>> +  ei_local->dmaing, ei_local->irqlock);
>> +   return;
>
>> +static int xsurf100_probe(struct zorro_dev *zdev,
>> + const struct zorro_device_id *ent)
>> +{
>
>> +   /* error handling for ioremap regs */
>> +   if 

  1   2   3   4   5   >