date:20180402

Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel

2018-04-02 Thread Alex Vesker




On 4/2/2018 12:12 PM, Jiri Pirko wrote:

Fri, Mar 30, 2018 at 05:11:29PM CEST, and...@lunn.ch wrote:

Please see:
http://patchwork.ozlabs.org/project/netdev/list/?series=36524

I bevieve that the solution in the patchset could be used for
your usecase too.

Hi Jiri

https://lkml.org/lkml/2018/3/20/436

How well does this API work for a 2Gbyte snapshot?

Ccing Alex who did the tests.


I didn't check the performance for such a large snapshot.
From my measurement it takes 0.09s for 1 MB of data this means
about ~3m.
This can be tuned and improved since this is a socket application.


Andrew

linux-next: build failure after merge of the tip tree

2018-04-02 Thread Stephen Rothwell

Hi all,

After merging the tip tree, today's linux-next build (x86_64 allmodconfig)
failed like this:

net/rxrpc/call_object.c: In function 'rxrpc_rcu_destroy_call':
net/rxrpc/call_object.c:661:3: error: implicit declaration of function 
'wake_up_atomic_t'; did you mean 'wake_up_bit'? 
[-Werror=implicit-function-declaration]
   wake_up_atomic_t(>nr_calls);
   ^~~~
   wake_up_bit
net/rxrpc/call_object.c: In function 'rxrpc_destroy_all_calls':
net/rxrpc/call_object.c:728:2: error: implicit declaration of function 
'wait_on_atomic_t'; did you mean 'wait_on_bit'? 
[-Werror=implicit-function-declaration]
  wait_on_atomic_t(>nr_calls, atomic_t_wait, TASK_UNINTERRUPTIBLE);
  ^~~~
  wait_on_bit
net/rxrpc/call_object.c:728:37: error: 'atomic_t_wait' undeclared (first use in 
this function); did you mean 'atomic_long_t'?
  wait_on_atomic_t(>nr_calls, atomic_t_wait, TASK_UNINTERRUPTIBLE);
 ^
 atomic_long_t
net/rxrpc/call_object.c:728:37: note: each undeclared identifier is reported 
only once for each function it appears in
net/rxrpc/call_accept.c: In function 'rxrpc_discard_prealloc':
net/rxrpc/call_accept.c:223:4: error: implicit declaration of function 
'wake_up_atomic_t'; did you mean 'wake_up_bit'? 
[-Werror=implicit-function-declaration]
wake_up_atomic_t(>nr_conns);
^~~~
wake_up_bit

Caused by commit

  9b8cce52c4b5 ("sched/wait: Remove the wait_on_atomic_t() API")

interacting with commits

  d3be4d244330 ("xrpc: Fix potential call vs socket/net destruction race")
  31f5f9a1691e ("rxrpc: Fix apparent leak of rxrpc_local objects")

from the net-next tree.

Haven't we figured out how to remove/change APIs yet? :-(

That tip tree commit is now in Linus' tree (merged since I started this
morning) so the net-next tree will need the below patch (or something
similar when it is merged.

I have applied the following merge fix patch (this may need more work):

From: Stephen Rothwell 
Date: Tue, 3 Apr 2018 15:34:48 +1000
Subject: [PATCH] sched/wait: merge fix up for wait_on_atomic() API removal

Signed-off-by: Stephen Rothwell 
---
 net/rxrpc/call_accept.c | 2 +-
 net/rxrpc/call_object.c | 4 ++--
 net/rxrpc/conn_object.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c
index f67017dcb25e..a9a9be5519b9 100644
--- a/net/rxrpc/call_accept.c
+++ b/net/rxrpc/call_accept.c
@@ -220,7 +220,7 @@ void rxrpc_discard_prealloc(struct rxrpc_sock *rx)
write_unlock(>conn_lock);
kfree(conn);
if (atomic_dec_and_test(>nr_conns))
-   wake_up_atomic_t(>nr_conns);
+   wake_up_var(>nr_conns);
tail = (tail + 1) & (size - 1);
}
 
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index f721c2b7e234..f6734d8cb01a 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -658,7 +658,7 @@ static void rxrpc_rcu_destroy_call(struct rcu_head *rcu)
kfree(call->rxtx_annotations);
kmem_cache_free(rxrpc_call_jar, call);
if (atomic_dec_and_test(>nr_calls))
-   wake_up_atomic_t(>nr_calls);
+   wake_up_var(>nr_calls);
 }
 
 /*
@@ -725,5 +725,5 @@ void rxrpc_destroy_all_calls(struct rxrpc_net *rxnet)
write_unlock(>call_lock);
 
atomic_dec(>nr_calls);
-   wait_on_atomic_t(>nr_calls, atomic_t_wait, TASK_UNINTERRUPTIBLE);
+   wait_var_event(>nr_calls, !atomic_read(>nr_calls));
 }
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index 0950ee3d26f5..4c77a78a252a 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -367,7 +367,7 @@ static void rxrpc_destroy_connection(struct rcu_head *rcu)
rxrpc_put_peer(conn->params.peer);
 
if (atomic_dec_and_test(>params.local->rxnet->nr_conns))
-   wake_up_atomic_t(>params.local->rxnet->nr_conns);
+   wake_up_var(>params.local->rxnet->nr_conns);
rxrpc_put_local(conn->params.local);
 
kfree(conn);
@@ -482,6 +482,6 @@ void rxrpc_destroy_all_connections(struct rxrpc_net *rxnet)
/* We need to wait for the connections to be destroyed by RCU as they
 * pin things that we still need to get rid of.
 */
-   wait_on_atomic_t(>nr_conns, atomic_t_wait, TASK_UNINTERRUPTIBLE);
+   wait_var_event(>nr_conns, !atomic_read(>nr_conns));
_leave("");
 }
-- 
2.16.1

-- 
Cheers,
Stephen Rothwell


pgpV_S8PAHTPQ.pgp
Description: OpenPGP digital signature

Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Al Viro

On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:

> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
> of bugs (most of them preexisting, but I should've checked instead
> of assuming that everything's fine).  Will push out later tonight.

OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
and double-dput in late failure exit in rpc_fill_super())
and 5 regressions from register_shrinker() failure recovery.

Re: [GIT PULL] remove in-kernel calls to syscalls

2018-04-02 Thread Linus Torvalds

On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowski
 wrote:
>
> This patchset removes all in-kernel calls to syscall functions in the
> kernel with the exception of arch/.

Ok, this finished off my arch updates for today, I'll probably move on
to driver pulls tomorrow.

Anyway, it's in my tree, will push out once my test build finishes.

Linus

[PATCH iproute2-next v1] rdma: Print net device name and index for RDMA device

2018-04-02 Thread Leon Romanovsky

From: Leon Romanovsky 

The RDMA devices are operated in RoCE and iWARP modes have net device
underneath. Present their names in regular output and their net index
in detailed mode.

[root@nps ~]# rdma link show mlx5_3/1
4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7
[root@nps ~]# rdma link show mlx5_3/1 -d
4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7 netdev_index 7
caps: 

Signed-off-by: Leon Romanovsky 
Reviewed-by: Steve Wise 
---
 Changes v0->v1:
  * Resend after commit 29122c1aae35 ("rdma: update rdma_netlink.h")
which updated relevant netlink attributes.
  * Added Steve's ROB
---
 rdma/include/uapi/rdma/rdma_netlink.h |  4 
 rdma/link.c   | 21 +
 rdma/utils.c  |  2 ++
 3 files changed, 27 insertions(+)

diff --git a/rdma/include/uapi/rdma/rdma_netlink.h 
b/rdma/include/uapi/rdma/rdma_netlink.h
index 9446a721..45474f13 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/rdma/include/uapi/rdma/rdma_netlink.h
@@ -388,6 +388,10 @@ enum rdma_nldev_attr {
RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY, /* u32 */
RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY, /* u32 */

+   /* Netdev information for relevant protocols, like RoCE and iWARP */
+   RDMA_NLDEV_ATTR_NDEV_INDEX, /* u32 */
+   RDMA_NLDEV_ATTR_NDEV_NAME,  /* string */
+
RDMA_NLDEV_ATTR_MAX
 };
 #endif /* _RDMA_NETLINK_H */
diff --git a/rdma/link.c b/rdma/link.c
index 66bcd50e..7e914c87 100644
--- a/rdma/link.c
+++ b/rdma/link.c
@@ -205,6 +205,26 @@ static void link_print_phys_state(struct rd *rd, struct 
nlattr **tb)
pr_out("physical_state %s ", phys_state_to_str(phys_state));
 }

+static void link_print_netdev(struct rd *rd, struct nlattr **tb)
+{
+   const char *netdev_name;
+   uint32_t idx;
+
+   if (!tb[RDMA_NLDEV_ATTR_NDEV_NAME] || !tb[RDMA_NLDEV_ATTR_NDEV_INDEX])
+   return;
+
+   netdev_name = mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_NDEV_NAME]);
+   idx = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_NDEV_INDEX]);
+   if (rd->json_output) {
+   jsonw_string_field(rd->jw, "netdev", netdev_name);
+   jsonw_uint_field(rd->jw, "netdev_index", idx);
+   } else {
+   pr_out("netdev %s ", netdev_name);
+   if (rd->show_details)
+   pr_out("netdev_index %u ", idx);
+   }
+}
+
 static int link_parse_cb(const struct nlmsghdr *nlh, void *data)
 {
struct nlattr *tb[RDMA_NLDEV_ATTR_MAX] = {};
@@ -241,6 +261,7 @@ static int link_parse_cb(const struct nlmsghdr *nlh, void 
*data)
link_print_lmc(rd, tb);
link_print_state(rd, tb);
link_print_phys_state(rd, tb);
+   link_print_netdev(rd, tb);
if (rd->show_details)
link_print_caps(rd, tb);

diff --git a/rdma/utils.c b/rdma/utils.c
index a2e08e91..f38f0459 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -391,6 +391,8 @@ static const enum mnl_attr_data_type 
nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
[RDMA_NLDEV_ATTR_RES_LKEY] = MNL_TYPE_U32,
[RDMA_NLDEV_ATTR_RES_IOVA] = MNL_TYPE_U64,
[RDMA_NLDEV_ATTR_RES_MRLEN] = MNL_TYPE_U64,
+   [RDMA_NLDEV_ATTR_NDEV_INDEX]= MNL_TYPE_U32,
+   [RDMA_NLDEV_ATTR_NDEV_NAME] = MNL_TYPE_NUL_STRING,
 };

 int rd_attr_cb(const struct nlattr *attr, void *data)
--
2.14.3

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-04-02 Thread Richard Cochran

On Mon, Mar 26, 2018 at 01:01:58AM +0200, Andrew Lunn wrote:
> The phylib core code will take the phydev lock before calling into the
> driver. By violating the layering, we are missing on this lock.

That lock protects the fields within the struct phy_device, like the
state field.  None of the time stamping methods need to read or write
any part of that data structure.

Actually it is not true that the core always takes the lock before
calling the driver methods.  See .ack_interrupt for example.

> Maybe the one driver which currently implements these calls does not
> need locking.

It has locking.  If a specific device needs locking to protect the
integrity of its registers or other internal data structures, then it
can and should implement those locks.

Thanks,
Richard

Re: WARNING in refcount_dec

2018-04-02 Thread DaeRyong Jeong

No. Only the first crash (WARNING in refcount_dec) is reproduced by
the attached reproducer.

The second crash (kernel bug at af_packet.c:3107) is reproduced by
another reproducer.
We reported it here.
http://lkml.iu.edu/hypermail/linux/kernel/1803.3/05324.html

On Sun, Apr 1, 2018 at 4:38 PM, Willem de Bruijn
 wrote:
> On Thu, Mar 29, 2018 at 1:16 AM, Cong Wang  wrote:
>> (Cc'ing netdev and Willem)
>>
>> On Wed, Mar 28, 2018 at 12:03 PM, Byoungyoung Lee
>>  wrote:
>>> Another crash patterns observed: race between (setsockopt$packet_int)
>>> and (bind$packet).
>>>
>>> --
>>> [  357.731597] kernel BUG at
>>> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:3107!
>>> [  357.733382] invalid opcode:  [#1] SMP KASAN
>>> [  357.734017] Modules linked in:
>>> [  357.734662] CPU: 1 PID: 3871 Comm: repro.exe Not tainted 4.16.0-rc3 #1
>>> [  357.735791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>>> [  357.737434] RIP: 0010:packet_do_bind+0x88d/0x950
>>> [  357.738121] RSP: 0018:8800b2787b08 EFLAGS: 00010293
>>> [  357.738906] RAX: 8800b2fdc780 RBX: 880234358cc0 RCX: 
>>> 838b244c
>>> [  357.739905] RDX:  RSI: 838b257d RDI: 
>>> 0001
>>> [  357.741315] RBP: 8800b2787c10 R08: 8800b2fdc780 R09: 
>>> 
>>> [  357.743055] R10: 0001 R11:  R12: 
>>> 88023352ecc0
>>> [  357.744744] R13:  R14: 0001 R15: 
>>> 1d00
>>> [  357.746377] FS:  7f4b43733700() GS:8800b8b0()
>>> knlGS:
>>> [  357.749599] CS:  0010 DS:  ES:  CR0: 80050033
>>> [  357.752096] CR2: 20058000 CR3: 0002334b8000 CR4: 
>>> 06e0
>>> [  357.755045] Call Trace:
>>> [  357.755822]  ? compat_packet_setsockopt+0x100/0x100
>>> [  357.757324]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
>>> [  357.758810]  packet_bind+0xa2/0xe0
>>> [  357.759640]  SYSC_bind+0x279/0x2f0
>>> [  357.760364]  ? move_addr_to_kernel.part.19+0xc0/0xc0
>>> [  357.761491]  ? __handle_mm_fault+0x25d0/0x25d0
>>> [  357.762449]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>> [  357.763663]  ? __do_page_fault+0x417/0xba0
>>> [  357.764569]  ? vmalloc_fault+0x910/0x910
>>> [  357.765405]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>> [  357.766525]  ? mark_held_locks+0x25/0xb0
>>> [  357.767336]  ? SyS_socketpair+0x4a0/0x4a0
>>> [  357.768182]  SyS_bind+0x24/0x30
>>> [  357.768851]  do_syscall_64+0x209/0x5d0
>>> [  357.769650]  ? syscall_return_slowpath+0x3e0/0x3e0
>>> [  357.770665]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>> [  357.771779]  ? syscall_return_slowpath+0x260/0x3e0
>>> [  357.772748]  ? mark_held_locks+0x25/0xb0
>>> [  357.773581]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
>>> [  357.774720]  ? retint_user+0x18/0x18
>>> [  357.775493]  ? trace_hardirqs_off_caller+0xb5/0x120
>>> [  357.776567]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>>> [  357.777512]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>> [  357.778508] RIP: 0033:0x4503a9
>>> [  357.779156] RSP: 002b:7f4b43732ce8 EFLAGS: 0246 ORIG_RAX:
>>> 0031
>>> [  357.780737] RAX: ffda RBX:  RCX: 
>>> 004503a9
>>> [  357.782169] RDX: 0014 RSI: 20058000 RDI: 
>>> 0003
>>> [  357.783710] RBP: 7f4b43732d10 R08:  R09: 
>>> 
>>> [  357.785202] R10:  R11: 0246 R12: 
>>> 
>>> [  357.786664] R13:  R14: 7f4b437339c0 R15: 
>>> 7f4b43733700
>>> [  357.788210] Code: c0 fd 48 c7 c2 00 c8 d9 84 be ab 02 00 00 48 c7
>>> c7 60 c8 d9 84 c6 05 e7 a2 48 02 01 e8 3f 17 af fd e9 60 fb ff ff e8
>>> 43 b3 c0 fd <0f> 0b e8 3c b3 c0 fd 48 8b bd 20 ff ff ff e8 60 1e e7 fd
>>> 4c 89
>>> [  357.792260] RIP: packet_do_bind+0x88d/0x950 RSP: 8800b2787b08
>>> [  357.793698] ---[ end trace 0c5a2539f0247369 ]---
>>> [  357.794696] Kernel panic - not syncing: Fatal exception
>>> [  357.795918] Kernel Offset: disabled
>>> [  357.796614] Rebooting in 86400 seconds..
>>>
>>> On Wed, Mar 28, 2018 at 1:19 AM, Byoungyoung Lee  
>>> wrote:
 We report the crash: WARNING in refcount_dec

 This crash has been found in v4.16-rc3 using RaceFuzzer (a modified
 version of Syzkaller), which we describe more at the end of this
 report. Our analysis shows that the race occurs when invoking two
 syscalls concurrently, (setsockopt$packet_int) and
 (setsockopt$packet_rx_ring).

 C repro code : 
 https://kiwi.cs.purdue.edu/static/race-fuzzer/repro-refcount_dec.c
 kernel config: 
 https://kiwi.cs.purdue.edu/static/race-fuzzer/kernel-config-v4.16-rc3
>>
>>
>> I tried your

Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-02 Thread Herbert Xu

On Tue, Apr 03, 2018 at 01:41:26PM +1000, NeilBrown wrote:
>
> Do we really need a rhashtable_walk_peek() interface?
> I imagine that a seqfile ->start function can do:
> 
>   if (*ppos == 0 && last_pos != 0) {
>   rhashtable_walk_exit();
> rhashtable_walk_enter(, );
> last_pos = 0;
>   }
>   rhashtable_walk_start();
>   if (*ppos == last_pos && iter.p)
>   return iter.p;
>   last_pos = *ppos;

We don't want users poking into the internals of iter.  If you're
suggesting we could simplify rhashtable_walk_peek to just this after
your patch then yes possibly.  You also need to ensure that not only
seqfs users continue to work but also netlink dump users which are
slightly different.

> It might be OK to have a function call instead of expecting people to
> use iter.p directly.

Yes that would definitely be the preferred option.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-04-02 Thread Richard Cochran

On Sun, Mar 25, 2018 at 04:01:49PM -0700, Florian Fainelli wrote:
> The best that I can think about and it still is a hack in some way, is
> to you have your time stamping driver create a proxy mii_bus whose
> purpose is just to hook to mdio/phy_device events (such as link changes)
> in order to do what is necessary, or at least, this would indicate its
> transparent nature towards the MDIO/MDC lines...

That won't work at all, AFAICT.  There is only one mii_bus per netdev,
that is one that is attached to the phydev.

> Tangential: the existing PHY time stamping logic should probably be
> generalized to a mdio_device (which the phy_device is a specialized
> superset of) instead of to the phy_device. This would still allow
> existing use cases but it would also allow us to support possible "pure
> MDIO" devices would that become some thing in the future.

So this is exactly what I did.  The time stamping methods were pushed
down into the mdio_device.  The active device (mdiots pointer) points
either to a non-PHY mdio_device or to the mdio_device embedded in the
phydev.

Thanks,
Richard

Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-02 Thread NeilBrown

On Fri, Mar 30 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 06:52:34PM +0200, Andreas Gruenbacher wrote:
>>
>> Should rhashtable_walk_peek be kept around even if there are no more
>> users? I have my doubts.
>
> Absolutely.  All netlink dumps using rhashtable_walk_next are buggy
> and need to switch over to rhashtable_walk_peek.  As otherwise
> the object that triggers the out-of-space condition will be skipped
> upon resumption.

Do we really need a rhashtable_walk_peek() interface?
I imagine that a seqfile ->start function can do:

  if (*ppos == 0 && last_pos != 0) {
rhashtable_walk_exit();
rhashtable_walk_enter(, );
last_pos = 0;
  }
  rhashtable_walk_start();
  if (*ppos == last_pos && iter.p)
return iter.p;
  last_pos = *ppos;
  return rhashtable_walk_next()


and the ->next function just does

  last_pos = *ppos;
  *ppos += 1;
  do p = rhashtable_walk_next(); while (IS_ERR(p));
  return p;

It might be OK to have a function call instead of expecting people to
use iter.p directly.

static inline void *rhashtable_walk_prev(struct rhashtable_iter *iter)
{
return iter->p;
}

Thoughts?

Thanks,
NeilBrown


signature.asc
Description: PGP signature

linux-next: manual merge of the net-next tree with the pci tree

2018-04-02 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/ethernet/mellanox/mlx5/core/en_main.c

between commit:

  2907938d2375 ("net/mlx5e: Use pcie_bandwidth_available() to compute 
bandwidth")

from the pci tree and commit:

  0608d4dbaf4e ("net/mlx5e: Unify slow PCI heuristic")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 884337f88589,0aab3afc6885..
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@@ -3880,16 -4026,50 +4033,20 @@@ void mlx5e_build_default_indir_rqt(u32 
indirection_rqt[i] = i % num_channels;
  }
  
- static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw)
 -static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
 -{
 -  enum pcie_link_width width;
 -  enum pci_bus_speed speed;
 -  int err = 0;
 -
 -  err = pcie_get_minimum_link(mdev->pdev, , );
 -  if (err)
 -  return err;
 -
 -  if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
 -  return -EINVAL;
 -
 -  switch (speed) {
 -  case PCIE_SPEED_2_5GT:
 -  *pci_bw = 2500 * width;
 -  break;
 -  case PCIE_SPEED_5_0GT:
 -  *pci_bw = 5000 * width;
 -  break;
 -  case PCIE_SPEED_8_0GT:
 -  *pci_bw = 8000 * width;
 -  break;
 -  default:
 -  return -EINVAL;
 -  }
 -
 -  return 0;
 -}
 -
+ static bool slow_pci_heuristic(struct mlx5_core_dev *mdev)
  {
-   return (link_speed && pci_bw &&
-   (pci_bw < 4) && (pci_bw < link_speed));
- }
+   u32 link_speed = 0;
+   u32 pci_bw = 0;
  
- static bool hw_lro_heuristic(u32 link_speed, u32 pci_bw)
- {
-   return !(link_speed && pci_bw &&
-(pci_bw <= 16000) && (pci_bw < link_speed));
+   mlx5e_get_max_linkspeed(mdev, _speed);
 -  mlx5e_get_pci_bw(mdev, _bw);
++  pci_bw = pcie_bandwidth_available(mdev->pdev, NULL, NULL, NULL);
+   mlx5_core_dbg_once(mdev, "Max link speed = %d, PCI BW = %d\n",
+  link_speed, pci_bw);
+ 
+ #define MLX5E_SLOW_PCI_RATIO (2)
+ 
+   return link_speed && pci_bw &&
+   link_speed > MLX5E_SLOW_PCI_RATIO * pci_bw;
  }
  
  void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 
cq_period_mode)


pgpHbGrm02dE5.pgp
Description: OpenPGP digital signature

Re: [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

Alex,

On 4/2/2018 3:06 PM, Sinan Kaya wrote:
> Code includes wmb() followed by writel() in multiple places. writel()
> already has a barrier on some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> I did a regex search for wmb() followed by writel() in each drivers
> directory. I scrubbed the ones I care about in this series.
> 
> I considered "ease of change", "popular usage" and "performance critical
> path" as the determining criteria for my filtering.
> 
> We used relaxed API heavily on ARM for a long time but
> it did not exist on other architectures. For this reason, relaxed
> architectures have been paying double penalty in order to use the common
> drivers.
> 
> Now that relaxed API is present on all architectures, we can go and scrub
> all drivers to see what needs to change and what can remain.
> 
> We start with mostly used ones and hope to increase the coverage over time.
> It will take a while to cover all drivers.
> 
> Feel free to apply patches individually.
> 
> Change since 7:
> 
> * API clarification with Linus and several architecture folks regarding
> writel()
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html
> 
> * removed wmb() in front of writel() at several places.
> * remove old IA64 comments regarding ordering.
> 

What do you think about this version? Did I miss any SMP barriers?


> Sinan Kaya (7):
>   i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
>   ixgbe: eliminate duplicate barriers on weakly-ordered archs
>   igbvf: eliminate duplicate barriers on weakly-ordered archs
>   igb: eliminate duplicate barriers on weakly-ordered archs
>   fm10k: Eliminate duplicate barriers on weakly-ordered archs
>   ixgbevf: keep writel() closer to wmb()
>   ixgbevf: eliminate duplicate barriers on weakly-ordered archs
> 
>  drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 --
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +--
>  drivers/net/ethernet/intel/igb/igb_main.c | 14 ++--
>  drivers/net/ethernet/intel/igbvf/netdev.c | 16 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  5 -
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 
> ---
>  8 files changed, 30 insertions(+), 100 deletions(-)
> 

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH 0/2] rhashtable_walk fixes

2018-04-02 Thread David Miller

From: NeilBrown 
Date: Tue, 03 Apr 2018 12:23:40 +1000

>  I'm sorry if I've caused some confusion, but I didn't think that I was
>  submitting patches to you and know nothing about your two trees.
>  I was submitting patches to Thomas and Herbert, the registered
>  maintainers of rhashtable.  I assumed they would review, respond, and
>  take responsibility for getting them upstream, if that's what they
>  decided, based on whatever arrangements they have in place.
> 
>  If it is appropriate I can resend all of my patches that receive an
>  Ack as a coherent series, and send this to you nominating a particular
>  tree, but I'm unlikely to do that unless asked and told which tree to
>  nominate.

Herbert and Thomas generally review rhashtable patches, but rhashtable
itself is generally maintained in the networking tree(s).  So once
they review and ACK it, I would apply it.

Re: [PATCH 0/2] rhashtable_walk fixes

2018-04-02 Thread NeilBrown

On Fri, Mar 30 2018, David Miller wrote:

> From: NeilBrown 
> Date: Thu, 29 Mar 2018 12:19:09 +1100
>
>> These two patches apply on top of my previous "rhashtable: reset iter
>> when rhashtable_walk_start sees new table" patch.
>> 
>> The first fixes a bug that I found in rhltable_insert().
>> 
>> The second is an alternate to my "rhashtable: allow a walk of the hash
>> table without missing object."
>> This version doesn't require an API change and should be reliable for
>> rhltables too (my first version didn't handle these correctly).
>
> Neil, please don't mix and match patches.
>
> Also when you need to change a patch in a series, please post the entire
> new series not just the patch that changes.
>
> Patch #1 in this series is unnecessary.  As Herbert explained this has
> been fixed already.
>
> So please repost freshly the patches that are relevant and you want me
> to consider for inclusion.  Also be explicit and clear about which of
> my two networking trees you are targetting these changes.

Hi Dave,
 I'm sorry if I've caused some confusion, but I didn't think that I was
 submitting patches to you and know nothing about your two trees.
 I was submitting patches to Thomas and Herbert, the registered
 maintainers of rhashtable.  I assumed they would review, respond, and
 take responsibility for getting them upstream, if that's what they
 decided, based on whatever arrangements they have in place.

 If it is appropriate I can resend all of my patches that receive an
 Ack as a coherent series, and send this to you nominating a particular
 tree, but I'm unlikely to do that unless asked and told which tree to
 nominate.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH net] net: dsa: mt7530: Use NULL instead of plain integer

2018-04-02 Thread Sean Wang

On Mon, 2018-04-02 at 16:24 -0700, Florian Fainelli wrote:
> We would be passing 0 instead of NULL as the rsp argument to
> mt7530_fdb_cmd(), fix that.
> 

Acked-by: Sean Wang 

BTW, does the part of the commit message should be updated with "passing
NULL instead of 0"?


> Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 
> switch")
> Signed-off-by: Florian Fainelli 
> ---
>  drivers/net/dsa/mt7530.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index 4e53c5ce23ff..a7f6c0a62fc7 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c
> @@ -917,7 +917,7 @@ mt7530_port_fdb_add(struct dsa_switch *ds, int port,
>  
>   mutex_lock(>reg_mutex);
>   mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_ENT);
> - ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0);
> + ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL);
>   mutex_unlock(>reg_mutex);
>  
>   return ret;
> @@ -933,7 +933,7 @@ mt7530_port_fdb_del(struct dsa_switch *ds, int port,
>  
>   mutex_lock(>reg_mutex);
>   mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_EMP);
> - ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0);
> + ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL);
>   mutex_unlock(>reg_mutex);
>  
>   return ret;
> @@ -1293,7 +1293,7 @@ mt7530_setup(struct dsa_switch *ds)
>   }
>  
>   /* Flush the FDB table */
> - ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, 0);
> + ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, NULL);
>   if (ret < 0)
>   return ret;
>

[PATCH net-next] pptp: remove a buggy dst release in pptp_connect()

2018-04-02 Thread Eric Dumazet

Once dst has been cached in socket via sk_setup_caps(),
it is illegal to call ip_rt_put() (or dst_release()),
since sk_setup_caps() did not change dst refcount.

We can still dereference it since we hold socket lock.

Caugth by syzbot :

BUG: KASAN: use-after-free in atomic_dec_return 
include/asm-generic/atomic-instrumented.h:198 [inline]
BUG: KASAN: use-after-free in dst_release+0x27/0xa0 net/core/dst.c:185
Write of size 4 at addr 8801c54dc040 by task syz-executor4/20088

CPU: 1 PID: 20088 Comm: syz-executor4 Not tainted 4.16.0+ #376
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1a7/0x27d lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x23c/0x360 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
 atomic_dec_return include/asm-generic/atomic-instrumented.h:198 [inline]
 dst_release+0x27/0xa0 net/core/dst.c:185
 sk_dst_set include/net/sock.h:1812 [inline]
 sk_dst_reset include/net/sock.h:1824 [inline]
 sock_setbindtodevice net/core/sock.c:610 [inline]
 sock_setsockopt+0x431/0x1b20 net/core/sock.c:707
 SYSC_setsockopt net/socket.c:1845 [inline]
 SyS_setsockopt+0x2ff/0x360 net/socket.c:1828
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4552d9
RSP: 002b:7f4878126c68 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 7f48781276d4 RCX: 004552d9
RDX: 0019 RSI: 0001 RDI: 0013
RBP: 0072bea0 R08: 0010 R09: 
R10: 200010c0 R11: 0246 R12: 
R13: 0526 R14: 006fac30 R15: 

Allocated by task 20088:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3542
 dst_alloc+0x11f/0x1a0 net/core/dst.c:104
 rt_dst_alloc+0xe9/0x540 net/ipv4/route.c:1520
 __mkroute_output net/ipv4/route.c:2265 [inline]
 ip_route_output_key_hash_rcu+0xa49/0x2c60 net/ipv4/route.c:2493
 ip_route_output_key_hash+0x20b/0x370 net/ipv4/route.c:2322
 __ip_route_output_key include/net/route.h:126 [inline]
 ip_route_output_flow+0x26/0xa0 net/ipv4/route.c:2577
 ip_route_output_ports include/net/route.h:163 [inline]
 pptp_connect+0xa84/0x1170 drivers/net/ppp/pptp.c:453
 SYSC_connect+0x213/0x4a0 net/socket.c:1639
 SyS_connect+0x24/0x30 net/socket.c:1620
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Freed by task 20082:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527
 __cache_free mm/slab.c:3486 [inline]
 kmem_cache_free+0x83/0x2a0 mm/slab.c:3744
 dst_destroy+0x266/0x380 net/core/dst.c:140
 dst_destroy_rcu+0x16/0x20 net/core/dst.c:153
 __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
 rcu_do_batch kernel/rcu/tree.c:2675 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2930 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2897 [inline]
 rcu_process_callbacks+0xd6c/0x17b0 kernel/rcu/tree.c:2914
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285

The buggy address belongs to the object at 8801c54dc000
 which belongs to the cache ip_dst_cache of size 168
The buggy address is located 64 bytes inside of
 168-byte region [8801c54dc000, 8801c54dc0a8)
The buggy address belongs to the page:
page:ea0007153700 count:1 mapcount:0 mapping:8801c54dc000 index:0x0
flags: 0x2fffc000100(slab)
raw: 02fffc000100 8801c54dc000  00010010
raw: ea0006b34b20 ea0006b6c1e0 8801d674a1c0 
page dumped because: kasan: bad access detected

Signed-off-by: Eric Dumazet 
---
 drivers/net/ppp/pptp.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c
index 
8249d46a784493faedbfad5ec78212afab8c872e..c4267ecefd85271ac561e94d57c586179e66777c
 100644
--- a/drivers/net/ppp/pptp.c
+++ b/drivers/net/ppp/pptp.c
@@ -464,7 +464,6 @@ static int pptp_connect(struct socket *sock, struct 
sockaddr *uservaddr,
po->chan.mtu = dst_mtu(>dst);
if (!po->chan.mtu)
po->chan.mtu = PPP_MRU;
-   ip_rt_put(rt);
po->chan.mtu -= PPTP_HEADER_OVERHEAD;
 
po->chan.hdrlen = 2 + sizeof(struct pptp_gre_header);
-- 
2.17.0.rc1.321.gba9d0f2565-goog

Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications

2018-04-02 Thread Alexei Starovoitov

On Thu, Mar 29, 2018 at 11:44:17PM +0100, Edward Cree wrote:
> By storing subprog boundaries as a subprogno mark on each insn, rather than
>  a start (and implicit end) for each subprog, we collect a number of gains:
> * More efficient determination of which subprog contains a given insn, and
>   thus of find_subprog (which subprog begins at a given insn).
> * Number of verifier "full recursive walk" passes is reduced, since most of
>   the work is done in the main insn walk (do_check()).  Leftover work in
>   other passes is mostly linear scans (O(insn_cnt)) or, in the case of
>   check_max_stack_depth(), a topological sort (O(subprog_cnt)).
> 
> Some other changes were also included to support this:
> * Per-subprog info is stored in env->subprog_info, an array of structs,
>   rather than several arrays with a common index.
> * Call graph is now stored in the new bpf_subprog_info struct; used here
>   for check_max_stack_depth() but may have other uses too.
> 
> Along with this, patch #3 puts parent pointers (used by liveness analysis)
>  in the registers instead of the func_state or verifier_state, so that we
>  don't need skip_callee() machinery.  This also does the right thing for
>  stack slots, so they don't need their own special handling for liveness
>  marking either.

I like patch 3 and going to play with it.
How did you test it ? Do you have processed_insn numbers for
cilium or selftests programs before and after?
There should be no difference, right?
Essentially it's trading extra run-time cost of skip_callee logic
for higher memory overhead due to parent pointers in every register
state. I guess that's ok, since it does cleanup things nicely.

As far as patch 1 it was very difficult to review, since several logically
different things clamped together. So breaking it apart:
- converting two arrays of subprog_starts and subprog_stack_depth into
  single array of struct bpf_subprog_info is a good thing
- tsort is interesting, but not sure it's correct. more below
- but main change of combining subprog logic with do_check is no good.
The verifier have to move towards compiler style code analysis instead of
the other way around. It has to analyze the program in simple and
hopefully easy to understand passes instead of combinging everything
into one loop. We _have_ to get rid of do_check() approach and
corresponding insn_processed counter. That was and still is
the biggest and most pressing issue we need to solve in bpf verification.
The verifier has to switch to compiler style code analysis to scale.
The algorithms should be such that scale with thousands of instructions
and thousands of functions. The only way I see reaching that goal
is to do hierarchical program analysis in passes:
- identify subrpograms
- identify basic blocks, build control flow graph
- build dom graph, ssa and so on
- and do per-basic block liveness and data flow analysis that propagates
  the combined result further into other BBs along cfg edges.
There will be no 'do_check() across whole program' walk.
Combining subprog pass with do_check is going into opposite direction
of this long term work. Divide and conquer. Combining more things into
do_check is the opposite of this programming principle.
My short term plan is to split basic instruction correctness checks
out of do_check() loop into separate pass and run it early on.
It's necessary for bpf libraries to work, since the verifier cannot do
full data flow analysis at this point, but should build cfg and move
as much as possible out of instruction by instruction walk to scale
with number of bpf programs/libraries and sizes of them.

As far as tsort approach for determining max stack depth...
It's an interesting idea, but implementation is suffering from the same
'combine everything into one loop' coding issue.
I think updating total_stack_depth math should be separate from sorting,
since the same function can be part of different stacks with different max.
I don't see how updating global subprog_info[i].total_stack_depth can
be correct. It has to be different for every stack and the longest
stack is not necessarily the deepest. May be I'm missing something,
but combined sort and stack_depth math didn't make it easy to review.
I find existing check_max_stack_depth() algo much easier to understand.

Re: [PATCH net-next 4/5] tun: Add support for SCTP checksum offload

2018-04-02 Thread kbuild test robot

Hi Vladislav,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Vladislav-Yasevich/virtio-net-Add-SCTP-checksum-offload-support/20180402-221407
config: m68k-hp300_defconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

Note: the 
linux-review/Vladislav-Yasevich/virtio-net-Add-SCTP-checksum-offload-support/20180402-221407
 HEAD 5e0497a085e70055a1981959802173f4ff05c86b builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   drivers/net/tun.c: In function 'set_offload':
>> drivers/net/tun.c:2722:12: error: 'TUN_F_SCTP_CSUM' undeclared (first use in 
>> this function); did you mean 'TUN_F_CSUM'?
 if (arg & TUN_F_SCTP_CSUM) {
   ^~~
   TUN_F_CSUM
   drivers/net/tun.c:2722:12: note: each undeclared identifier is reported only 
once for each function it appears in

vim +2722 drivers/net/tun.c

  2696  
  2697  /* This is like a cut-down ethtool ops, except done via tun fd so no
  2698   * privs required. */
  2699  static int set_offload(struct tun_struct *tun, unsigned long arg)
  2700  {
  2701  netdev_features_t features = 0;
  2702  
  2703  if (arg & TUN_F_CSUM) {
  2704  features |= NETIF_F_HW_CSUM;
  2705  arg &= ~TUN_F_CSUM;
  2706  
  2707  if (arg & (TUN_F_TSO4|TUN_F_TSO6)) {
  2708  if (arg & TUN_F_TSO_ECN) {
  2709  features |= NETIF_F_TSO_ECN;
  2710  arg &= ~TUN_F_TSO_ECN;
  2711  }
  2712  if (arg & TUN_F_TSO4)
  2713  features |= NETIF_F_TSO;
  2714  if (arg & TUN_F_TSO6)
  2715  features |= NETIF_F_TSO6;
  2716  arg &= ~(TUN_F_TSO4|TUN_F_TSO6);
  2717  }
  2718  
  2719  arg &= ~TUN_F_UFO;
  2720  }
  2721  
> 2722  if (arg & TUN_F_SCTP_CSUM) {
  2723  features |= NETIF_F_SCTP_CRC;
  2724  arg &= ~TUN_F_SCTP_CSUM;
  2725  }
  2726  
  2727  /* This gives the user a way to test for new features in future 
by
  2728   * trying to set them. */
  2729  if (arg)
  2730  return -EINVAL;
  2731  
  2732  tun->set_features = features;
  2733  tun->dev->wanted_features &= ~TUN_USER_FEATURES;
  2734  tun->dev->wanted_features |= features;
  2735  netdev_update_features(tun->dev);
  2736  
  2737  return 0;
  2738  }
  2739  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Jacob Keller

On Mon, Apr 2, 2018 at 7:05 AM, Bjorn Helgaas  wrote:
> +/* PCIe speed to Mb/s reduced by encoding overhead */
> +#define PCIE_SPEED2MBS_ENC(speed) \
> +   ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \
> +(speed) == PCIE_SPEED_8_0GT  ?  (8000*(128/130)) : \
> +(speed) == PCIE_SPEED_5_0GT  ?  (5000*(8/10)) : \
> +(speed) == PCIE_SPEED_2_5GT  ?  (2500*(8/10)) : \
> +0)
> +

Should this be "(speed * x ) / y" instead? wouldn't they calculate
128/130 and truncate that to zero before multiplying by the speed? Or
are compilers smart enough to do this the other way to avoid the
losses?

Thanks,
Jake

[PATCH net] net: dsa: mt7530: Use NULL instead of plain integer

2018-04-02 Thread Florian Fainelli

We would be passing 0 instead of NULL as the rsp argument to
mt7530_fdb_cmd(), fix that.

Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 
switch")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/mt7530.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 4e53c5ce23ff..a7f6c0a62fc7 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -917,7 +917,7 @@ mt7530_port_fdb_add(struct dsa_switch *ds, int port,
 
mutex_lock(>reg_mutex);
mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_ENT);
-   ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0);
+   ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL);
mutex_unlock(>reg_mutex);
 
return ret;
@@ -933,7 +933,7 @@ mt7530_port_fdb_del(struct dsa_switch *ds, int port,
 
mutex_lock(>reg_mutex);
mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_EMP);
-   ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0);
+   ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL);
mutex_unlock(>reg_mutex);
 
return ret;
@@ -1293,7 +1293,7 @@ mt7530_setup(struct dsa_switch *ds)
}
 
/* Flush the FDB table */
-   ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, 0);
+   ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, NULL);
if (ret < 0)
return ret;
 
-- 
2.14.1

[PATCH net] net: dsa: b53: Fix sparse warnings in b53_mmap.c

2018-04-02 Thread Florian Fainelli

sparse complains about the following warnings:

drivers/net/dsa/b53/b53_mmap.c:33:31: warning: incorrect type in
initializer (different address spaces)
drivers/net/dsa/b53/b53_mmap.c:33:31:expected unsigned char
[noderef] [usertype] *regs
drivers/net/dsa/b53/b53_mmap.c:33:31:got void *priv

and indeed, while what we are doing is functional, we are dereferencing
a void * pointer into a void __iomem * which is not great. Just use the
defined b53_mmap_priv structure which holds our register base and use
that.

Fixes: 967dd82ffc52 ("net: dsa: b53: Add support for Broadcom RoboSwitch")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_mmap.c | 33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_mmap.c b/drivers/net/dsa/b53/b53_mmap.c
index ef63d24fef81..c628d0980c0b 100644
--- a/drivers/net/dsa/b53/b53_mmap.c
+++ b/drivers/net/dsa/b53/b53_mmap.c
@@ -30,7 +30,8 @@ struct b53_mmap_priv {
 
 static int b53_mmap_read8(struct b53_device *dev, u8 page, u8 reg, u8 *val)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
*val = readb(regs + (page << 8) + reg);
 
@@ -39,7 +40,8 @@ static int b53_mmap_read8(struct b53_device *dev, u8 page, u8 
reg, u8 *val)
 
 static int b53_mmap_read16(struct b53_device *dev, u8 page, u8 reg, u16 *val)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
if (WARN_ON(reg % 2))
return -EINVAL;
@@ -54,7 +56,8 @@ static int b53_mmap_read16(struct b53_device *dev, u8 page, 
u8 reg, u16 *val)
 
 static int b53_mmap_read32(struct b53_device *dev, u8 page, u8 reg, u32 *val)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
if (WARN_ON(reg % 4))
return -EINVAL;
@@ -69,7 +72,8 @@ static int b53_mmap_read32(struct b53_device *dev, u8 page, 
u8 reg, u32 *val)
 
 static int b53_mmap_read48(struct b53_device *dev, u8 page, u8 reg, u64 *val)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
if (WARN_ON(reg % 2))
return -EINVAL;
@@ -107,7 +111,8 @@ static int b53_mmap_read48(struct b53_device *dev, u8 page, 
u8 reg, u64 *val)
 
 static int b53_mmap_read64(struct b53_device *dev, u8 page, u8 reg, u64 *val)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
u32 hi, lo;
 
if (WARN_ON(reg % 4))
@@ -128,7 +133,8 @@ static int b53_mmap_read64(struct b53_device *dev, u8 page, 
u8 reg, u64 *val)
 
 static int b53_mmap_write8(struct b53_device *dev, u8 page, u8 reg, u8 value)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
writeb(value, regs + (page << 8) + reg);
 
@@ -138,7 +144,8 @@ static int b53_mmap_write8(struct b53_device *dev, u8 page, 
u8 reg, u8 value)
 static int b53_mmap_write16(struct b53_device *dev, u8 page, u8 reg,
u16 value)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
if (WARN_ON(reg % 2))
return -EINVAL;
@@ -154,7 +161,8 @@ static int b53_mmap_write16(struct b53_device *dev, u8 
page, u8 reg,
 static int b53_mmap_write32(struct b53_device *dev, u8 page, u8 reg,
u32 value)
 {
-   u8 __iomem *regs = dev->priv;
+   struct b53_mmap_priv *priv = dev->priv;
+   void __iomem *regs = priv->regs;
 
if (WARN_ON(reg % 4))
return -EINVAL;
@@ -223,12 +231,19 @@ static const struct b53_io_ops b53_mmap_ops = {
 static int b53_mmap_probe(struct platform_device *pdev)
 {
struct b53_platform_data *pdata = pdev->dev.platform_data;
+   struct b53_mmap_priv *priv;
struct b53_device *dev;
 
if (!pdata)
return -EINVAL;
 
-   dev = b53_switch_alloc(>dev, _mmap_ops, pdata->regs);
+   priv = devm_kzalloc(>dev, sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   priv->regs = pdata->regs;
+
+   dev = b53_switch_alloc(>dev, _mmap_ops, priv);
if (!dev)
return -ENOMEM;
 
-- 
2.14.1

[PATCH net 0/2] net: Broadcom drivers sparse fixes

2018-04-02 Thread Florian Fainelli

Hi all,

This patch series fixes the same warning reported by sparse in bcmsysport and
bcmgenet in the code that deals with inserting the TX checksum pointers:

drivers/net/ethernet/broadcom/bcmsysport.c:1155:26: warning: cast from 
restricted __be16
drivers/net/ethernet/broadcom/bcmsysport.c:1155:26: warning: incorrect type in 
argument 1 (different base types)
drivers/net/ethernet/broadcom/bcmsysport.c:1155:26:expected unsigned short 
[unsigned] [usertype] val
drivers/net/ethernet/broadcom/bcmsysport.c:1155:26:got restricted __be16 
[usertype] protocol

This patch fixes both issues by using the same construct and not swapping
skb->protocol but instead the values we are checking against.

Florian Fainelli (2):
  net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()
  net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()

 drivers/net/ethernet/broadcom/bcmsysport.c | 11 ++-
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 11 ++-
 2 files changed, 12 insertions(+), 10 deletions(-)

-- 
2.14.1

[PATCH net 2/2] net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()

2018-04-02 Thread Florian Fainelli

skb->protocol is a __be16 which we would be calling htons() against,
while this is not wrong per-se as it correctly results in swapping the
value on LE hosts, this still upsets sparse. Adopt a similar pattern to
what other drivers do and just assign ip_ver to skb->protocol, and then
use htons() against the different constants such that the compiler can
resolve the values at build time.

Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC 
driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 3fc549b88c43..879db1c700bd 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1133,7 +1133,7 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct 
sk_buff *skb,
u32 csum_info;
u8 ip_proto;
u16 csum_start;
-   u16 ip_ver;
+   __be16 ip_ver;
 
/* Re-allocate SKB if needed */
if (unlikely(skb_headroom(skb) < sizeof(*tsb))) {
@@ -1152,12 +1152,12 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct 
sk_buff *skb,
memset(tsb, 0, sizeof(*tsb));
 
if (skb->ip_summed == CHECKSUM_PARTIAL) {
-   ip_ver = htons(skb->protocol);
+   ip_ver = skb->protocol;
switch (ip_ver) {
-   case ETH_P_IP:
+   case htons(ETH_P_IP):
ip_proto = ip_hdr(skb)->protocol;
break;
-   case ETH_P_IPV6:
+   case htons(ETH_P_IPV6):
ip_proto = ipv6_hdr(skb)->nexthdr;
break;
default:
@@ -1171,7 +1171,8 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct 
sk_buff *skb,
 
if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP) {
csum_info |= L4_LENGTH_VALID;
-   if (ip_proto == IPPROTO_UDP && ip_ver == ETH_P_IP)
+   if (ip_proto == IPPROTO_UDP &&
+   ip_ver == htons(ETH_P_IP))
csum_info |= L4_UDP;
} else {
csum_info = 0;
-- 
2.14.1

[PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-02 Thread Florian Fainelli

skb->protocol is a __be16 which we would be calling htons() against,
while this is not wrong per-se as it correctly results in swapping the
value on LE hosts, this still upsets sparse. Adopt a similar pattern to
what other drivers do and just assign ip_ver to skb->protocol, and then
use htons() against the different constants such that the compiler can
resolve the values at build time.

Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index b1e35a9accf1..aeb329690f78 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -1460,7 +1460,7 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct 
net_device *dev,
struct sk_buff *new_skb;
u16 offset;
u8 ip_proto;
-   u16 ip_ver;
+   __be16 ip_ver;
u32 tx_csum_info;
 
if (unlikely(skb_headroom(skb) < sizeof(*status))) {
@@ -1480,12 +1480,12 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct 
net_device *dev,
status = (struct status_64 *)skb->data;
 
if (skb->ip_summed  == CHECKSUM_PARTIAL) {
-   ip_ver = htons(skb->protocol);
+   ip_ver = skb->protocol;
switch (ip_ver) {
-   case ETH_P_IP:
+   case htons(ETH_P_IP):
ip_proto = ip_hdr(skb)->protocol;
break;
-   case ETH_P_IPV6:
+   case htons(ETH_P_IPV6):
ip_proto = ipv6_hdr(skb)->nexthdr;
break;
default:
@@ -1501,7 +1501,8 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct 
net_device *dev,
 */
if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP) {
tx_csum_info |= STATUS_TX_CSUM_LV;
-   if (ip_proto == IPPROTO_UDP && ip_ver == ETH_P_IP)
+   if (ip_proto == IPPROTO_UDP &&
+   ip_ver == htons(ETH_P_IP))
tx_csum_info |= STATUS_TX_CSUM_PROTO_UDP;
} else {
tx_csum_info = 0;
-- 
2.14.1

[PATCH net-next] rxrpc: Fix undefined packet handling

2018-04-02 Thread David Howells

By analogy with other Rx implementations, RxRPC packet types 9, 10 and 11
should just be discarded rather than being aborted like other undefined
packet types.

Reported-by: Jeffrey Altman 
Signed-off-by: David Howells 
---

 net/rxrpc/input.c|6 ++
 net/rxrpc/protocol.h |6 ++
 2 files changed, 12 insertions(+)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 21800e6f5019..0410d2277ca2 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -1200,6 +1200,12 @@ void rxrpc_data_ready(struct sock *udp_sk)
!rxrpc_validate_jumbo(skb))
goto bad_message;
break;
+
+   /* Packet types 9-11 should just be ignored. */
+   case RXRPC_PACKET_TYPE_PARAMS:
+   case RXRPC_PACKET_TYPE_10:
+   case RXRPC_PACKET_TYPE_11:
+   goto discard;
}
 
rcu_read_lock();
diff --git a/net/rxrpc/protocol.h b/net/rxrpc/protocol.h
index 4bddcf3face3..93da73bf7098 100644
--- a/net/rxrpc/protocol.h
+++ b/net/rxrpc/protocol.h
@@ -46,6 +46,9 @@ struct rxrpc_wire_header {
 #define RXRPC_PACKET_TYPE_CHALLENGE6   /* connection security 
challenge (SRVR->CLNT) */
 #define RXRPC_PACKET_TYPE_RESPONSE 7   /* connection secutity response 
(CLNT->SRVR) */
 #define RXRPC_PACKET_TYPE_DEBUG8   /* debug info request */
+#define RXRPC_PACKET_TYPE_PARAMS   9   /* Parameter negotiation 
(unspec'd, ignore) */
+#define RXRPC_PACKET_TYPE_10   10  /* Ignored */
+#define RXRPC_PACKET_TYPE_11   11  /* Ignored */
 #define RXRPC_PACKET_TYPE_VERSION  13  /* version string request */
 #define RXRPC_N_PACKET_TYPES   14  /* number of packet types (incl 
type 0) */
 
@@ -78,6 +81,9 @@ struct rxrpc_wire_header {
(1 << RXRPC_PACKET_TYPE_CHALLENGE) |\
(1 << RXRPC_PACKET_TYPE_RESPONSE) | \
/*(1 << RXRPC_PACKET_TYPE_DEBUG) | */   \
+   (1 << RXRPC_PACKET_TYPE_PARAMS) |   \
+   (1 << RXRPC_PACKET_TYPE_10) |   \
+   (1 << RXRPC_PACKET_TYPE_11) |   \
(1 << RXRPC_PACKET_TYPE_VERSION))
 
 /*/

[PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-04-02 Thread Jakub Kicinski

Some user space depends on enabling sriov_totalvfs number of VFs
to not fail, e.g.:

$ cat .../sriov_totalvfs > .../sriov_numvfs

For devices which VF support depends on loaded FW we have the
pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
a special "unset" value, meaning drivers can't limit sriov_totalvfs
to 0.  Remove the special values completely and simply initialize
driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
Add a helper for drivers to reset the VF limit back to total.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c |  6 +++---
 drivers/pci/iov.c | 27 +--
 include/linux/pci.h   |  2 ++
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index c4b1f344b4da..a76d177e40dd 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -123,7 +123,7 @@ static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf)
return pci_sriov_set_totalvfs(pf->pdev, pf->limit_vfs);
 
pf->limit_vfs = ~0;
-   pci_sriov_set_totalvfs(pf->pdev, 0); /* 0 is unset */
+   pci_sriov_reset_totalvfs(pf->pdev);
/* Allow any setting for backwards compatibility if symbol not found */
if (err == -ENOENT)
return 0;
@@ -537,7 +537,7 @@ static int nfp_pci_probe(struct pci_dev *pdev,
 err_net_remove:
nfp_net_pci_remove(pf);
 err_sriov_unlimit:
-   pci_sriov_set_totalvfs(pf->pdev, 0);
+   pci_sriov_reset_totalvfs(pf->pdev);
 err_fw_unload:
kfree(pf->rtbl);
nfp_mip_close(pf->mip);
@@ -570,7 +570,7 @@ static void nfp_pci_remove(struct pci_dev *pdev)
nfp_hwmon_unregister(pf);
 
nfp_pcie_sriov_disable(pdev);
-   pci_sriov_set_totalvfs(pf->pdev, 0);
+   pci_sriov_reset_totalvfs(pf->pdev);
 
nfp_net_pci_remove(pf);
 
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 677924ae0350..c63ea870d8be 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -443,6 +443,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
iov->nres = nres;
iov->ctrl = ctrl;
iov->total_VFs = total;
+   iov->driver_max_VFs = total;
pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >vf_device);
iov->pgsz = pgsz;
iov->self = dev;
@@ -788,12 +789,29 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 
numvfs)
 }
 EXPORT_SYMBOL_GPL(pci_sriov_set_totalvfs);
 
+/**
+ * pci_sriov_reset_totalvfs -- return the TotalVFs value to the default
+ * @dev: the PCI PF device
+ *
+ * Should be called from PF driver's remove routine with
+ * device's mutex held.
+ */
+void pci_sriov_reset_totalvfs(struct pci_dev *dev)
+{
+   /* Shouldn't change if VFs already enabled */
+   if (!dev->is_physfn || dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
+   return;
+
+   dev->sriov->driver_max_VFs = dev->sriov->total_VFs;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_reset_totalvfs);
+
 /**
  * pci_sriov_get_totalvfs -- get total VFs supported on this device
  * @dev: the PCI PF device
  *
- * For a PCIe device with SRIOV support, return the PCIe
- * SRIOV capability value of TotalVFs or the value of driver_max_VFs
+ * For a PCIe device with SRIOV support, return the value of driver_max_VFs
+ * which can be equal to the PCIe SRIOV capability value of TotalVFs or lower
  * if the driver reduced it.  Otherwise 0.
  */
 int pci_sriov_get_totalvfs(struct pci_dev *dev)
@@ -801,9 +819,6 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
if (!dev->is_physfn)
return 0;
 
-   if (dev->sriov->driver_max_VFs)
-   return dev->sriov->driver_max_VFs;
-
-   return dev->sriov->total_VFs;
+   return dev->sriov->driver_max_VFs;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..95fde8850393 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1952,6 +1952,7 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id);
 int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
+void pci_sriov_reset_totalvfs(struct pci_dev *dev);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
@@ -1978,6 +1979,7 @@ static inline int pci_vfs_assigned(struct pci_dev *dev)
 { return 0; }
 static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
+static inline void pci_sriov_reset_totalvfs(struct pci_dev *dev) { }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
 static inline resource_size_t

Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Al Viro

On Mon, Apr 02, 2018 at 10:52:12PM +0100, Al Viro wrote:
> On Mon, Apr 02, 2018 at 03:30:56PM -0500, Eric W. Biederman wrote:
> > Tetsuo Handa  writes:
> 
> > > I don't think this is a dup of existing bug.
> > > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.
> > 
> > Even if expanding mount_ns to more filesystems was magically fixed,
> > proc would still have this issue with the pid namespace rather than
> > the net namespace.
> > 
> > This is a mess.  I will take a look and see if I can see a a fix.
> 
> It's trivially fixable, and there's no need to modify mount_ns() at
> all.
> 
> All we need is for rpc_kill_sb() to recognize whether we are already
> through the point in rpc_fill_super() where the refcount is bumped.
> That's it.
> 
> The most trivial way to do that is to move
>   net = get_net(sb->s_fs_info);
> past
> if (!root)
> return -ENOMEM;
> in the latter and have
> out:
>   if (!sb->s_root)
>   net = NULL;
> kill_litter_super(sb);
>   if (net)
>   put_net(net);
> in the end of the former.  And similar changes in other affected
> instances.

FWIW, I'm going through the ->kill_sb() instances, fixing that sort
of bugs (most of them preexisting, but I should've checked instead
of assuming that everything's fine).  Will push out later tonight.

Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Al Viro

On Mon, Apr 02, 2018 at 03:30:56PM -0500, Eric W. Biederman wrote:
> Tetsuo Handa  writes:

> > I don't think this is a dup of existing bug.
> > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.
> 
> Even if expanding mount_ns to more filesystems was magically fixed,
> proc would still have this issue with the pid namespace rather than
> the net namespace.
> 
> This is a mess.  I will take a look and see if I can see a a fix.

It's trivially fixable, and there's no need to modify mount_ns() at
all.

All we need is for rpc_kill_sb() to recognize whether we are already
through the point in rpc_fill_super() where the refcount is bumped.
That's it.

The most trivial way to do that is to move
net = get_net(sb->s_fs_info);
past
if (!root)
return -ENOMEM;
in the latter and have
out:
if (!sb->s_root)
net = NULL;
kill_litter_super(sb);
if (net)
put_net(net);
in the end of the former.  And similar changes in other affected
instances.

Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-04-02 Thread Jakub Kicinski

On Fri, 30 Mar 2018 09:54:37 -0700, Alexander Duyck wrote:
> On Fri, Mar 30, 2018 at 4:49 AM, Christoph Hellwig  wrote:
> > On Thu, Mar 29, 2018 at 11:22:31AM -0700, Jakub Kicinski wrote:  
> >> Some user space depends on driver allowing sriov_totalvfs to be
> >> enabled.  
> >
> > I can't make sene of this sentence.  Can you explain what user space
> > code depends on what semantics?  The sriov_totalvfs file should show
> > up for any device supporting SR-IOV as far as I can tell.
> >  
> >>
> >> For devices which VF support depends on loaded FW we
> >> have the pci_sriov_{g,s}et_totalvfs() API.  However, this API
> >> uses 0 as a special "unset" value, meaning drivers can't limit
> >> sriov_totalvfs to 0.  Change the special value to be U16_MAX.
> >> Use simple min() to determine actual totalvfs.  
> >
> > Please use a PCI_MAX_VFS or similar define instead of plain U16_MAX or ~0.  
> 
> Actually is there any reason why driver_max_VFs couldn't just be
> initialized to the same value as total_VFs?
> 
> Also looking over the patch I don't see how writing ~0 would be
> accepted unless you also make changes to pci_sriov_set_totalvfs since
> it should fail the "numvfs > dev->sriov->total_VFs" check. You might
> just want to look at adding a new function that would reset the
> driver_max_VFs value instead of trying to write it to an arbitrary
> value from the driver.

Ack, the reset function plus using total_VFs as unset seems a lot
cleaner, thanks!

Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-04-02 Thread Jakub Kicinski

On Fri, 30 Mar 2018 04:49:05 -0700, Christoph Hellwig wrote:
> On Thu, Mar 29, 2018 at 11:22:31AM -0700, Jakub Kicinski wrote:
> > Some user space depends on driver allowing sriov_totalvfs to be
> > enabled.  
> 
> I can't make sene of this sentence.  Can you explain what user space
> code depends on what semantics?  The sriov_totalvfs file should show
> up for any device supporting SR-IOV as far as I can tell.

Ugh, I took me a while to understand what I meant myself...  So what I
meant is that this should generally work:

# cat .../sriov_totalvfs > .../sriov_numvfs

I.e. one should be able to "enable" the number of VFs advertised in
sriov_totalvfs.  I will rephrase!

> > For devices which VF support depends on loaded FW we
> > have the pci_sriov_{g,s}et_totalvfs() API.  However, this API
> > uses 0 as a special "unset" value, meaning drivers can't limit
> > sriov_totalvfs to 0.  Change the special value to be U16_MAX.
> > Use simple min() to determine actual totalvfs.  
> 
> Please use a PCI_MAX_VFS or similar define instead of plain U16_MAX or ~0.

I think I may just go with what I think Alex is suggesting and "unset"
the value to total_VFs.

Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Tal Gilboa

On 4/2/2018 11:25 PM, Keller, Jacob E wrote:

-Original Message-
From: Bjorn Helgaas [mailto:helg...@kernel.org]
Sent: Monday, April 02, 2018 12:58 PM
To: Keller, Jacob E 
Cc: Tal Gilboa ; Tariq Toukan ; Ariel
Elior ; Ganesh Goudar ;
Kirsher, Jeffrey T ; everest-linux...@cavium.com;
intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
ker...@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
speed
and whether it's limited

On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote:

-Original Message-
From: Bjorn Helgaas [mailto:helg...@kernel.org]
Sent: Friday, March 30, 2018 2:05 PM
To: Tal Gilboa 
Cc: Tariq Toukan ; Keller, Jacob E
; Ariel Elior ; Ganesh
Goudar ; Kirsher, Jeffrey T
; everest-linux...@cavium.com; intel-wired-
l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
linux-...@vger.kernel.org
Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed

and

whether it's limited

From: Tal Gilboa 

Add pcie_print_link_status().  This logs the current settings of the link
(speed, width, and total available bandwidth).

If the device is capable of more bandwidth but is limited by a slower
upstream link, we include information about the link that limits the
device's performance.

The user may be able to move the device to a different slot for better
performance.

This provides a unified method for all PCI devices to report status and
issues, instead of each device reporting in a different way, using
different code.

Signed-off-by: Tal Gilboa 
[bhelgaas: changelog, reword log messages, print device capabilities when
not limited]
Signed-off-by: Bjorn Helgaas 
---
  drivers/pci/pci.c   |   29 +
  include/linux/pci.h |1 +
  2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e00d56b12747..cec7aed09f6b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
enum pci_bus_speed *speed,
return *width * PCIE_SPEED2MBS_ENC(*speed);
  }

+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device.  If this is less than the
+ * device is capable of, report the device's maximum possible bandwidth and
+ * the upstream link that limits its performance to less than that.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+   enum pcie_link_width width, width_cap;
+   enum pci_bus_speed speed, speed_cap;
+   struct pci_dev *limiting_dev = NULL;
+   u32 bw_avail, bw_cap;
+
+   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
+   bw_avail = pcie_bandwidth_available(dev, _dev, ,
);
+
+   if (bw_avail >= bw_cap)
+   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
+bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
+   else
+   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
link at %s (capable of %d Mb/s with %s x%d link)\n",
+bw_avail, PCIE_SPEED2STR(speed), width,
+limiting_dev ? pci_name(limiting_dev) : "",
+bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
+}

Personally, I would make thic last one a pci_warn() to indicate it at a
higher log level, but I'm  ok with the wording, and if consensus is that
this should be at info, I'm ok with that.

Tal's original patch did have a pci_warn() here, and we went back and
forth a bit.  They get bug reports when a device doesn't perform as
expected, which argues for pci_warn().  But they also got feedback
saying warnings are a bit too much, which argues for pci_info() [1]

I don't have a really strong opinion either way.  I have a slight
preference for info because the user may not be able to do anything
about it (there may not be a faster slot available), and I think
distros are usually configured so a warning interrupts the smooth
graphical boot.

It looks like mlx4, fm10k, and ixgbe currently use warnings, while
bnx2x, bnxt_en, and cxgb4 use info.  It's a tie so far :)

[1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-
5ffaf261c...@mellanox.com

With that information, I'm fine with the proposal to display this as only an 
info. The message is still printed and can be used for debugging purposes, and 
I think that's really enough.

Here's a proposal for printing the bandwidth as "x.xxx Gb/s":

Nice, I like that also.

Regards,
Jake

Same here for

RE: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-04-02 Thread Keller, Jacob E

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Bjorn Helgaas
> Sent: Monday, April 02, 2018 1:32 PM
> To: Keller, Jacob E 
> Cc: Tal Gilboa ; Tariq Toukan ; Ariel
> Elior ; Ganesh Goudar ;
> Kirsher, Jeffrey T ; everest-linux...@cavium.com;
> intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> pcie_print_link_status()
> 
> On Mon, Apr 02, 2018 at 03:56:06PM +, Keller, Jacob E wrote:
> > > -Original Message-
> > > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > > Sent: Friday, March 30, 2018 2:06 PM
> > > To: Tal Gilboa 
> > > Cc: Tariq Toukan ; Keller, Jacob E
> > > ; Ariel Elior ; Ganesh
> > > Goudar ; Kirsher, Jeffrey T
> > > ; everest-linux...@cavium.com; intel-wired-
> > > l...@lists.osuosl.org; netdev@vger.kernel.org; 
> > > linux-ker...@vger.kernel.org;
> > > linux-...@vger.kernel.org
> > > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> > > pcie_print_link_status()
> > >
> > > From: Bjorn Helgaas 
> > >
> > > Use pcie_print_link_status() to report PCIe link speed and possible
> > > limitations instead of implementing this in the driver itself.
> > >
> > > Note that pcie_get_minimum_link() can return misleading information
> because
> > > it finds the slowest link and the narrowest link without considering the
> > > total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
> > > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> > > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth 
> > > of
> > > about 2000 MB/s for a 16 GT/s x1 link.
> >
> > This comment is about what's being fixed, so it would have been easier to
> > parse if it were written to more clearly indicate that we're removing
> > (and not adding) this behavior.
> 
> Good point.  Is this any better?
> 
>   fm10k: Report PCIe link properties with pcie_print_link_status()
> 
>   Previously the driver used pcie_get_minimum_link() to warn when the NIC
>   is in a slot that can't supply as much bandwidth as the NIC could use.
> 
>   pcie_get_minimum_link() can be misleading because it finds the slowest link
>   and the narrowest link (which may be different links) without considering
>   the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
>   2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
>   bandwidth, not the true available bandwidth of about 1969 MB/s for a
>   16 GT/s x1 link.
> 
>   Use pcie_print_link_status() to report PCIe link speed and possible
>   limitations instead of implementing this in the driver itself.  This finds
>   the slowest link in the path to the device by computing the total bandwidth
>   of each link and compares that with the capabilities of the device.
> 
>   Note that the driver previously used dev_warn() to suggest using a
>   different slot, but pcie_print_link_status() uses dev_info() because if the
>   platform has no faster slot available, the user can't do anything about the
>   warning and may not want to be bothered with it.

Perfect! Thanks!

-Jake

Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-04-02 Thread Bjorn Helgaas

On Mon, Apr 02, 2018 at 03:56:06PM +, Keller, Jacob E wrote:
> > -Original Message-
> > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > Sent: Friday, March 30, 2018 2:06 PM
> > To: Tal Gilboa 
> > Cc: Tariq Toukan ; Keller, Jacob E
> > ; Ariel Elior ; Ganesh
> > Goudar ; Kirsher, Jeffrey T
> > ; everest-linux...@cavium.com; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-...@vger.kernel.org
> > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> > pcie_print_link_status()
> > 
> > From: Bjorn Helgaas 
> > 
> > Use pcie_print_link_status() to report PCIe link speed and possible
> > limitations instead of implementing this in the driver itself.
> > 
> > Note that pcie_get_minimum_link() can return misleading information because
> > it finds the slowest link and the narrowest link without considering the
> > total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
> > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
> > about 2000 MB/s for a 16 GT/s x1 link.
> 
> This comment is about what's being fixed, so it would have been easier to
> parse if it were written to more clearly indicate that we're removing
> (and not adding) this behavior.

Good point.  Is this any better?

  fm10k: Report PCIe link properties with pcie_print_link_status()

  Previously the driver used pcie_get_minimum_link() to warn when the NIC
  is in a slot that can't supply as much bandwidth as the NIC could use.

  pcie_get_minimum_link() can be misleading because it finds the slowest link
  and the narrowest link (which may be different links) without considering
  the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
  2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
  bandwidth, not the true available bandwidth of about 1969 MB/s for a
  16 GT/s x1 link.

  Use pcie_print_link_status() to report PCIe link speed and possible
  limitations instead of implementing this in the driver itself.  This finds
  the slowest link in the path to the device by computing the total bandwidth
  of each link and compares that with the capabilities of the device.

  Note that the driver previously used dev_warn() to suggest using a
  different slot, but pcie_print_link_status() uses dev_info() because if the
  platform has no faster slot available, the user can't do anything about the
  warning and may not want to be bothered with it.

Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Eric W. Biederman

Tetsuo Handa  writes:

> syzbot wrote:
>> > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa
>> >  wrote:
>> >> Switching from mm to fsdevel, for this report says that put_net(net) in
>> >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to
>> >> register_shrinker() failure.
>> 
>> >> Relevant commits will be
>> >> commit 9ee332d99e4d5a97 ("sget(): handle failures of  
>> >> register_shrinker()") and
>> >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to  
>> >> mount_ns.").
>> 
>> >> When sget_userns() in mount_ns() failed, mount_ns() returns an error  
>> >> code to
>> >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info)  
>> >> was
>> >> not called by rpc_fill_super() (via fill_super callback passed to  
>> >> mount_ns())
>> >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb()  
>> >> from
>> >> deactivate_locked_super()).
>> 
>> >> --
>> >> static struct dentry *
>> >> rpc_mount(struct file_system_type *fs_type,
>> >>  int flags, const char *dev_name, void *data)
>> >> {
>> >>  struct net *net = current->nsproxy->net_ns;
>> >>  return mount_ns(fs_type, flags, data, net, net->user_ns,  
>> >> rpc_fill_super);
>> >> }
>> >> --
>> 
>> > Messed kernel output, this is definitely not in should_fail.
>> 
>> > #syz dup: WARNING: refcount bug in sk_alloc
>> 
>> Can't find the corresponding bug.
>> 
> I don't think this is a dup of existing bug.
> We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.

Even if expanding mount_ns to more filesystems was magically fixed,
proc would still have this issue with the pid namespace rather than
the net namespace.

This is a mess.  I will take a look and see if I can see a a fix.

Eric

[PATCH net-next] nfp: add a separate counter for packets with CHECKSUM_COMPLETE

2018-04-02 Thread Jakub Kicinski

We are currently counting packets with CHECKSUM_COMPLETE as
"hw_rx_csum_ok".  This is confusing.  Add a new counter.
To make sure it fits in the same cacheline move the less used
error counter to a different location.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h |  4 +++-
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c  |  2 +-
 drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 16 +---
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 787df47ec430..bd7d8ae31e17 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -391,6 +391,7 @@ struct nfp_net_rx_ring {
  * @rx_drops:  Number of packets dropped on RX due to lack of resources
  * @hw_csum_rx_ok:  Counter of packets where the HW checksum was OK
  * @hw_csum_rx_inner_ok: Counter of packets where the inner HW checksum was OK
+ * @hw_csum_rx_complete: Counter of packets with CHECKSUM_COMPLETE reported
  * @hw_csum_rx_error:   Counter of packets with bad checksums
  * @tx_sync:   Seqlock for atomic updates of TX stats
  * @tx_pkts:   Number of Transmitted packets
@@ -434,7 +435,7 @@ struct nfp_net_r_vector {
u64 rx_drops;
u64 hw_csum_rx_ok;
u64 hw_csum_rx_inner_ok;
-   u64 hw_csum_rx_error;
+   u64 hw_csum_rx_complete;
 
struct nfp_net_tx_ring *xdp_ring;
 
@@ -446,6 +447,7 @@ struct nfp_net_r_vector {
u64 tx_gather;
u64 tx_lso;
 
+   u64 hw_csum_rx_error;
u64 rx_replace_buf_alloc_fail;
u64 tx_errors;
u64 tx_busy;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 43a9c207a049..1eb6549f2a54 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1406,7 +1406,7 @@ static void nfp_net_rx_csum(struct nfp_net_dp *dp,
skb->ip_summed = meta->csum_type;
skb->csum = meta->csum;
u64_stats_update_begin(_vec->rx_sync);
-   r_vec->hw_csum_rx_ok++;
+   r_vec->hw_csum_rx_complete++;
u64_stats_update_end(_vec->rx_sync);
return;
}
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
index e1dae0616f52..c9016419bfa0 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
@@ -179,7 +179,7 @@ static const struct nfp_et_stat nfp_mac_et_stats[] = {
 
 #define NN_ET_GLOBAL_STATS_LEN ARRAY_SIZE(nfp_net_et_stats)
 #define NN_ET_SWITCH_STATS_LEN 9
-#define NN_RVEC_GATHER_STATS   8
+#define NN_RVEC_GATHER_STATS   9
 #define NN_RVEC_PER_Q_STATS3
 
 static void nfp_net_get_nspinfo(struct nfp_app *app, char *version)
@@ -468,6 +468,7 @@ static u8 *nfp_vnic_get_sw_stats_strings(struct net_device 
*netdev, u8 *data)
 
data = nfp_pr_et(data, "hw_rx_csum_ok");
data = nfp_pr_et(data, "hw_rx_csum_inner_ok");
+   data = nfp_pr_et(data, "hw_rx_csum_complete");
data = nfp_pr_et(data, "hw_rx_csum_err");
data = nfp_pr_et(data, "rx_replace_buf_alloc_fail");
data = nfp_pr_et(data, "hw_tx_csum");
@@ -493,18 +494,19 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device 
*netdev, u64 *data)
data[0] = nn->r_vecs[i].rx_pkts;
tmp[0] = nn->r_vecs[i].hw_csum_rx_ok;
tmp[1] = nn->r_vecs[i].hw_csum_rx_inner_ok;
-   tmp[2] = nn->r_vecs[i].hw_csum_rx_error;
-   tmp[3] = nn->r_vecs[i].rx_replace_buf_alloc_fail;
+   tmp[2] = nn->r_vecs[i].hw_csum_rx_complete;
+   tmp[3] = nn->r_vecs[i].hw_csum_rx_error;
+   tmp[4] = nn->r_vecs[i].rx_replace_buf_alloc_fail;
} while (u64_stats_fetch_retry(>r_vecs[i].rx_sync, start));
 
do {
start = u64_stats_fetch_begin(>r_vecs[i].tx_sync);
data[1] = nn->r_vecs[i].tx_pkts;
data[2] = nn->r_vecs[i].tx_busy;
-   tmp[4] = nn->r_vecs[i].hw_csum_tx;
-   tmp[5] = nn->r_vecs[i].hw_csum_tx_inner;
-   tmp[6] = nn->r_vecs[i].tx_gather;
-   tmp[7] = nn->r_vecs[i].tx_lso;
+   tmp[5] = nn->r_vecs[i].hw_csum_tx;
+   tmp[6] = nn->r_vecs[i].hw_csum_tx_inner;
+   tmp[7] = nn->r_vecs[i].tx_gather;
+   tmp[8] = nn->r_vecs[i].tx_lso;
} while (u64_stats_fetch_retry(>r_vecs[i].tx_sync, start));

Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code

2018-04-02 Thread Marc Zyngier

On Mon, 2 Apr 2018 07:43:49 +
Alexander Kurz  wrote:

> Remove the duplicated code for asix88179_178a bind and reset methods.
> 
> Signed-off-by: Alexander Kurz 
> ---
>  drivers/net/usb/ax88179_178a.c | 137 
> ++---
>  1 file changed, 31 insertions(+), 106 deletions(-)

The good news is that this patch doesn't seem to break anything this
time. A few remarks below:

> 
> diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
> index a6ef75907ae9..fea4c7b877cc 100644
> --- a/drivers/net/usb/ax88179_178a.c
> +++ b/drivers/net/usb/ax88179_178a.c
> @@ -1223,7 +1223,7 @@ static int ax88179_led_setting(struct usbnet *dev)
>   return 0;
>  }
>  
> -static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> +static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
>  {
>   u8 buf[5];
>   u16 *tmp16;
> @@ -1231,12 +1231,11 @@ static int ax88179_bind(struct usbnet *dev, struct 
> usb_interface *intf)
>   struct ax88179_data *ax179_data = (struct ax88179_data *)dev->data;
>   struct ethtool_eee eee_data;
>  
> - usbnet_get_endpoints(dev, intf);
> -

So you move this to "bind"...

>   tmp16 = (u16 *)buf;
>   tmp = (u8 *)buf;
>  
> - memset(ax179_data, 0, sizeof(*ax179_data));
> + if (!do_reset)
> + memset(ax179_data, 0, sizeof(*ax179_data));

... but not that. Why? They both are exclusive to the bind operation.

>  
>   /* Power up ethernet PHY */
>   *tmp16 = 0;
> @@ -1249,9 +1248,13 @@ static int ax88179_bind(struct usbnet *dev, struct 
> usb_interface *intf)
>   ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_CLK_SELECT, 1, 1, tmp);
>   msleep(100);
>  
> + if (do_reset)
> + ax88179_auto_detach(dev, 0);
> +
>   ax88179_read_cmd(dev, AX_ACCESS_MAC, AX_NODE_ID, ETH_ALEN,
>ETH_ALEN, dev->net->dev_addr);
> - memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN);
> + if (!do_reset)
> + memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN);
>  
>   /* RX bulk configuration */
>   memcpy(tmp, _BULKIN_SIZE[0], 5);
> @@ -1266,19 +1269,21 @@ static int ax88179_bind(struct usbnet *dev, struct 
> usb_interface *intf)
>   ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_PAUSE_WATERLVL_HIGH,
> 1, 1, tmp);
>  
> - dev->net->netdev_ops = _netdev_ops;
> - dev->net->ethtool_ops = _ethtool_ops;
> - dev->net->needed_headroom = 8;
> - dev->net->max_mtu = 4088;
> -
> - /* Initialize MII structure */
> - dev->mii.dev = dev->net;
> - dev->mii.mdio_read = ax88179_mdio_read;
> - dev->mii.mdio_write = ax88179_mdio_write;
> - dev->mii.phy_id_mask = 0xff;
> - dev->mii.reg_num_mask = 0xff;
> - dev->mii.phy_id = 0x03;
> - dev->mii.supports_gmii = 1;
> + if (!do_reset) {
> + dev->net->netdev_ops = _netdev_ops;
> + dev->net->ethtool_ops = _ethtool_ops;
> + dev->net->needed_headroom = 8;
> + dev->net->max_mtu = 4088;
> +
> + /* Initialize MII structure */
> + dev->mii.dev = dev->net;
> + dev->mii.mdio_read = ax88179_mdio_read;
> + dev->mii.mdio_write = ax88179_mdio_write;
> + dev->mii.phy_id_mask = 0xff;
> + dev->mii.reg_num_mask = 0xff;
> + dev->mii.phy_id = 0x03;
> + dev->mii.supports_gmii = 1;
> + }
>  
>   dev->net->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> NETIF_F_RXCSUM;
> @@ -1330,6 +1335,13 @@ static int ax88179_bind(struct usbnet *dev, struct 
> usb_interface *intf)
>   return 0;
>  }
>  
> +static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> +{
> + usbnet_get_endpoints(dev, intf);
> +
> + return ax88179_bind_or_reset(dev, false);
> +}

I find the whole construct extremely messy.

You shouldn't need to "bind or reset" the adapter. You first reset it
(that's the HW part), and you then bind it (that's the SW part). I
understand that the code is quite messy already, and that you're trying
to improve it. I'm just not convinced that having a single function
containing everything and the kitchen sink is the most sensible
approach.

Instead, you probably want to extract stuff that needs to be done at
reset time from what can be done at bind time, and keep the two quite
separate. The fact that bind is mostly a superset of reset is a bit of
an odd thing, and it'd be good to get to the bottom of that (I fully
admit that my understanding of USB networking is close to zero).

I came up with the below hack on top of your patch, and things seem to
still behave.

Thanks,

M.

diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
index fea4c7b877cc..aed98d400d7a 100644
--- a/drivers/net/usb/ax88179_178a.c
+++ b/drivers/net/usb/ax88179_178a.c
@@ -1223,7 +1223,7 @@ static int

RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Keller, Jacob E



> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Monday, April 02, 2018 12:58 PM
> To: Keller, Jacob E 
> Cc: Tal Gilboa ; Tariq Toukan ; Ariel
> Elior ; Ganesh Goudar ;
> Kirsher, Jeffrey T ; everest-linux...@cavium.com;
> intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> speed
> and whether it's limited
> 
> On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote:
> > > -Original Message-
> > > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > > Sent: Friday, March 30, 2018 2:05 PM
> > > To: Tal Gilboa 
> > > Cc: Tariq Toukan ; Keller, Jacob E
> > > ; Ariel Elior ; Ganesh
> > > Goudar ; Kirsher, Jeffrey T
> > > ; everest-linux...@cavium.com; intel-wired-
> > > l...@lists.osuosl.org; netdev@vger.kernel.org; 
> > > linux-ker...@vger.kernel.org;
> > > linux-...@vger.kernel.org
> > > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> > > speed
> and
> > > whether it's limited
> > >
> > > From: Tal Gilboa 
> > >
> > > Add pcie_print_link_status().  This logs the current settings of the link
> > > (speed, width, and total available bandwidth).
> > >
> > > If the device is capable of more bandwidth but is limited by a slower
> > > upstream link, we include information about the link that limits the
> > > device's performance.
> > >
> > > The user may be able to move the device to a different slot for better
> > > performance.
> > >
> > > This provides a unified method for all PCI devices to report status and
> > > issues, instead of each device reporting in a different way, using
> > > different code.
> > >
> > > Signed-off-by: Tal Gilboa 
> > > [bhelgaas: changelog, reword log messages, print device capabilities when
> > > not limited]
> > > Signed-off-by: Bjorn Helgaas 
> > > ---
> > >  drivers/pci/pci.c   |   29 +
> > >  include/linux/pci.h |1 +
> > >  2 files changed, 30 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index e00d56b12747..cec7aed09f6b 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> > > enum pci_bus_speed *speed,
> > >   return *width * PCIE_SPEED2MBS_ENC(*speed);
> > >  }
> > >
> > > +/**
> > > + * pcie_print_link_status - Report the PCI device's link speed and width
> > > + * @dev: PCI device to query
> > > + *
> > > + * Report the available bandwidth at the device.  If this is less than 
> > > the
> > > + * device is capable of, report the device's maximum possible bandwidth 
> > > and
> > > + * the upstream link that limits its performance to less than that.
> > > + */
> > > +void pcie_print_link_status(struct pci_dev *dev)
> > > +{
> > > + enum pcie_link_width width, width_cap;
> > > + enum pci_bus_speed speed, speed_cap;
> > > + struct pci_dev *limiting_dev = NULL;
> > > + u32 bw_avail, bw_cap;
> > > +
> > > + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
> > > + bw_avail = pcie_bandwidth_available(dev, _dev, ,
> > > );
> > > +
> > > + if (bw_avail >= bw_cap)
> > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > + else
> > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> > > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > +  bw_avail, PCIE_SPEED2STR(speed), width,
> > > +  limiting_dev ? pci_name(limiting_dev) : "",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > +}
> >
> > Personally, I would make thic last one a pci_warn() to indicate it at a
> > higher log level, but I'm  ok with the wording, and if consensus is that
> > this should be at info, I'm ok with that.
> 
> Tal's original patch did have a pci_warn() here, and we went back and
> forth a bit.  They get bug reports when a device doesn't perform as
> expected, which argues for pci_warn().  But they also got feedback
> saying warnings are a bit too much, which argues for pci_info() [1]
> 
> I don't have a really strong opinion either way.  I have a slight
> preference for info because the user may not be able to do anything
> about it (there may not be a faster slot available), and I think
> distros are usually configured so a warning interrupts the smooth
> graphical boot.
> 
> It looks like mlx4, fm10k, and ixgbe currently use warnings, while
> bnx2x, bnxt_en, and cxgb4 use info.  It's a tie so far

Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Bjorn Helgaas

On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote:
> > -Original Message-
> > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > Sent: Friday, March 30, 2018 2:05 PM
> > To: Tal Gilboa 
> > Cc: Tariq Toukan ; Keller, Jacob E
> > ; Ariel Elior ; Ganesh
> > Goudar ; Kirsher, Jeffrey T
> > ; everest-linux...@cavium.com; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-...@vger.kernel.org
> > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> > speed and
> > whether it's limited
> > 
> > From: Tal Gilboa 
> > 
> > Add pcie_print_link_status().  This logs the current settings of the link
> > (speed, width, and total available bandwidth).
> > 
> > If the device is capable of more bandwidth but is limited by a slower
> > upstream link, we include information about the link that limits the
> > device's performance.
> > 
> > The user may be able to move the device to a different slot for better
> > performance.
> > 
> > This provides a unified method for all PCI devices to report status and
> > issues, instead of each device reporting in a different way, using
> > different code.
> > 
> > Signed-off-by: Tal Gilboa 
> > [bhelgaas: changelog, reword log messages, print device capabilities when
> > not limited]
> > Signed-off-by: Bjorn Helgaas 
> > ---
> >  drivers/pci/pci.c   |   29 +
> >  include/linux/pci.h |1 +
> >  2 files changed, 30 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index e00d56b12747..cec7aed09f6b 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> > enum pci_bus_speed *speed,
> > return *width * PCIE_SPEED2MBS_ENC(*speed);
> >  }
> > 
> > +/**
> > + * pcie_print_link_status - Report the PCI device's link speed and width
> > + * @dev: PCI device to query
> > + *
> > + * Report the available bandwidth at the device.  If this is less than the
> > + * device is capable of, report the device's maximum possible bandwidth and
> > + * the upstream link that limits its performance to less than that.
> > + */
> > +void pcie_print_link_status(struct pci_dev *dev)
> > +{
> > +   enum pcie_link_width width, width_cap;
> > +   enum pci_bus_speed speed, speed_cap;
> > +   struct pci_dev *limiting_dev = NULL;
> > +   u32 bw_avail, bw_cap;
> > +
> > +   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
> > +   bw_avail = pcie_bandwidth_available(dev, _dev, ,
> > );
> > +
> > +   if (bw_avail >= bw_cap)
> > +   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +   else
> > +   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > +bw_avail, PCIE_SPEED2STR(speed), width,
> > +limiting_dev ? pci_name(limiting_dev) : "",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +}
> 
> Personally, I would make thic last one a pci_warn() to indicate it at a
> higher log level, but I'm  ok with the wording, and if consensus is that
> this should be at info, I'm ok with that.

Tal's original patch did have a pci_warn() here, and we went back and
forth a bit.  They get bug reports when a device doesn't perform as
expected, which argues for pci_warn().  But they also got feedback
saying warnings are a bit too much, which argues for pci_info() [1]

I don't have a really strong opinion either way.  I have a slight
preference for info because the user may not be able to do anything
about it (there may not be a faster slot available), and I think
distros are usually configured so a warning interrupts the smooth
graphical boot.

It looks like mlx4, fm10k, and ixgbe currently use warnings, while
bnx2x, bnxt_en, and cxgb4 use info.  It's a tie so far :)

[1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-5ffaf261c...@mellanox.com

Here's a proposal for printing the bandwidth as "x.xxx Gb/s":

commit ad370f38c1b5e9b8bb941eaed84ebb676c4bdaa4
Author: Tal Gilboa 
Date:   Fri Mar 30 08:56:47 2018 -0500

PCI: Add pcie_print_link_status() to log link speed and whether it's limited

Add pcie_print_link_status().  This logs the current settings of the link
(speed, width, and total available bandwidth).

If the device is capable of more bandwidth but is limited by a slower
upstream link, we include information about the link that limits the
device's performance.

The user may be able to move the device to a different slot for better
performance.

This provides a unified method for all

Re: [bpf-next PATCH 1/2] bpf: sockmap, free memory on sock close with cork data

2018-04-02 Thread John Fastabend

On 04/02/2018 12:50 PM, John Fastabend wrote:
> If a socket with pending cork data is closed we do not return the
> memory to the socket until the garbage collector free's the psock
> structure. The garbage collector though can run after the sock has
> completed its close operation. If this ordering happens the sock code
> will through a WARN_ON because there is still outstanding memory
> accounted to the sock.
> 
> To resolve this ensure we return memory to the sock when a socket
> is closed.
> 
> Signed-off-by: John Fastabend 
> Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper")
> ---

Hi Alexei, Daniel,

These two fixes apply against current bpf-next or bpf after
bpf-next is merged. I could resend later I suppose but I think
it makes sense to get these in sooner rather than later. 

Thanks,
John

[bpf-next PATCH 2/2] bpf: sockmap, duplicates release calls may NULL sk_prot

2018-04-02 Thread John Fastabend

It is possible to have multiple ULP tcp_release call paths in flight
if a sock is closed and simultaneously being removed from the sockmap
control path. The result would be setting the sk_prot to the saved
values on the first iteration and then on the second iteration setting
the value to NULL.

This patch resolves this by ensuring we only reset the sk_prot pointer
if we have a valid saved state to set.

Fixes: 4f738adba30a7 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket 
TX/RX data")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8ddf326..8dd9210 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -182,8 +182,10 @@ static void bpf_tcp_release(struct sock *sk)
psock->cork = NULL;
}
 
-   sk->sk_prot = psock->sk_proto;
-   psock->sk_proto = NULL;
+   if (psock->sk_proto) {
+   sk->sk_prot = psock->sk_proto;
+   psock->sk_proto = NULL;
+   }
 out:
rcu_read_unlock();
 }

[bpf-next PATCH 1/2] bpf: sockmap, free memory on sock close with cork data

2018-04-02 Thread John Fastabend

If a socket with pending cork data is closed we do not return the
memory to the socket until the garbage collector free's the psock
structure. The garbage collector though can run after the sock has
completed its close operation. If this ordering happens the sock code
will through a WARN_ON because there is still outstanding memory
accounted to the sock.

To resolve this ensure we return memory to the sock when a socket
is closed.

Signed-off-by: John Fastabend 
Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper")
---
 kernel/bpf/sockmap.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d2bda5a..8ddf326 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -211,6 +211,12 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
close_fun = psock->save_close;
 
write_lock_bh(>sk_callback_lock);
+   if (psock->cork) {
+   free_start_sg(psock->sock, psock->cork);
+   kfree(psock->cork);
+   psock->cork = NULL;
+   }
+
list_for_each_entry_safe(md, mtmp, >ingress, list) {
list_del(>list);
free_start_sg(psock->sock, md);

Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Bjorn Helgaas

On Mon, Apr 02, 2018 at 04:00:16PM +, Keller, Jacob E wrote:
> > -Original Message-
> > From: Tal Gilboa [mailto:ta...@mellanox.com]
> > Sent: Monday, April 02, 2018 7:34 AM
> > To: Bjorn Helgaas 
> > Cc: Tariq Toukan ; Keller, Jacob E
> > ; Ariel Elior ; Ganesh
> > Goudar ; Kirsher, Jeffrey T
> > ; everest-linux...@cavium.com; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-...@vger.kernel.org
> > Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute
> > max supported link bandwidth
> > 
> > On 4/2/2018 5:05 PM, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:
> > >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:
> > >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
> >  On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > > From: Tal Gilboa 
> > >
> > > Add pcie_bandwidth_capable() to compute the max link bandwidth
> > supported by
> > > a device, based on the max link speed and width, adjusted by the
> > encoding
> > > overhead.
> > >
> > > The maximum bandwidth of the link is computed as:
> > >
> > >  max_link_speed * max_link_width * (1 - encoding_overhead)
> > >
> > > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using
> > 8b/10b
> > > encoding, and about 1.5% for 8 GT/s or higher speed links using 
> > > 128b/130b
> > > encoding.
> > >
> > > Signed-off-by: Tal Gilboa 
> > > [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> > > signatures, don't export outside drivers/pci]
> > > Signed-off-by: Bjorn Helgaas 
> > > Reviewed-by: Tariq Toukan 
> > > ---
> > > drivers/pci/pci.c |   21 +
> > > drivers/pci/pci.h |9 +
> > > 2 files changed, 30 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index 43075be79388..9ce89e254197 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5208,6 +5208,27 @@ enum pcie_link_width
> > pcie_get_width_cap(struct pci_dev *dev)
> > >   return PCIE_LNK_WIDTH_UNKNOWN;
> > > }
> > > +/**
> > > + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth
> > capability
> > > + * @dev: PCI device
> > > + * @speed: storage for link speed
> > > + * @width: storage for link width
> > > + *
> > > + * Calculate a PCI device's link bandwidth by querying for its link 
> > > speed
> > > + * and width, multiplying them, and applying encoding overhead.
> > > + */
> > > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> > *speed,
> > > +enum pcie_link_width *width)
> > > +{
> > > + *speed = pcie_get_speed_cap(dev);
> > > + *width = pcie_get_width_cap(dev);
> > > +
> > > + if (*speed == PCI_SPEED_UNKNOWN || *width ==
> > PCIE_LNK_WIDTH_UNKNOWN)
> > > + return 0;
> > > +
> > > + return *width * PCIE_SPEED2MBS_ENC(*speed);
> > > +}
> > > +
> > > /**
> > >  * pci_select_bars - Make BAR mask from the type of resource
> > >  * @dev: the PCI device for which BAR mask is made
> > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > index 66738f1050c0..2a50172b9803 100644
> > > --- a/drivers/pci/pci.h
> > > +++ b/drivers/pci/pci.h
> > > @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev
> > *dev);
> > >(speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> > >"Unknown speed")
> > > +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for
> > gen3 */
> > > +#define PCIE_SPEED2MBS_ENC(speed) \
> > 
> >  Missing gen4.
> > >>>
> > >>> I made it "gen3+".  I think that's accurate, isn't it?  The spec
> > >>> doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
> > >>> says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
> > >>> use 128b/130b encoding.
> > >>>
> > >>
> > >> I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it 
> > >> wasn't
> > >> added. Need to return 15754.
> > >
> > > Oh, duh, of course!  Sorry for being dense.  What about the following?
> > > I included the calculation as opposed to just the magic numbers to try
> > > to make it clear how they're derived.  This has the disadvantage of
> > > truncating the result instead of rounding, but I doubt that's
> > > significant in this context.  If it is, we could use the magic numbers
> > > and put the computation in a comment.
> > 
> > We can always use DIV_ROUND_UP((speed * enc_nominator),
> >

Re: [PATCH v2] KEYS: DNS: limit the length of option strings

2018-04-02 Thread Eric Biggers

On Fri, Mar 23, 2018 at 01:21:22PM -0700, Eric Biggers wrote:
> On Mon, Mar 12, 2018 at 10:57:07AM -0700, Eric Biggers wrote:
> > On Wed, Mar 07, 2018 at 03:54:37PM +, David Howells wrote:
> > > Eric Biggers  wrote:
> > > 
> > > > Fix it by limiting option strings (combined name + value) to a much more
> > > > reasonable 128 bytes.  The exact limit is arbitrary, but currently the
> > > > only recognized option is formatted as "dnserror=%lu" which fits well
> > > > within this limit.
> > > 
> > > There will be more options coming ("ipv4", "ipv6") but they shouldn't 
> > > overrun
> > > this limit and we can always extend the limit if need be.
> > > 
> > > David
> > 
> > David (Howells) do you want to take this patch through the keyrings tree or
> > should I ask David Miller to take it through net-next?
> > 
> > Eric
> 
> Ping.

Ping again.

[PATCH v8 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba9..c17924b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1696,12 +1696,6 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(i, rx_ring->tail);
}
 }
@@ -2467,10 +2461,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
if (xdp_xmit) {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
writel(ring->next_to_use, ring->tail);
 
xdp_do_flush_map();
@@ -8080,12 +8070,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
 
-   /*
-* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier to make certain all of the
+   /* We need this memory barrier to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -8102,7 +8087,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10034,10 +10019,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
if (unlikely(!ring))
return;
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.
-*/
-   wmb();
writel(ring->next_to_use, ring->tail);
 
return;
-- 
2.7.4

Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Vincent Bernat

 ❦  2 avril 2018 22:05 +0300, Julian Anastasov  :

>   Sorry to say it but may be you missed the discussion
> on lvs-devel about the new MH scheduler implemented by Inju Song:
>
> https://www.spinics.net/lists/lvs-devel/msg04928.html
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html
>
>   In the last 6 months we fixed all issues and I acked
> v4 just yesterday:
>
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html

For some reason, "ipvs maglev" in Google doesn't yield those
results. But having code already reviewed is great news for me: I can
use it with confidence. :)
-- 
Make sure your code "does nothing" gracefully.
- The Elements of Programming Style (Kernighan & Plauger)

[PATCH v8 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +---
 2 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f174c72..1b9fa7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -186,7 +186,13 @@ static int i40e_program_fdir_filter(struct 
i40e_fdir_filter *fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
+
return 0;
 
 dma_fail:
@@ -1523,12 +1529,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(val, rx_ring->tail);
 }
 
@@ -2274,11 +2274,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring 
*rx_ring,
 
 static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
-   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
 }
 
 /**
@@ -3444,7 +3440,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 12bd937..eb5556e 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -804,12 +804,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(val, rx_ring->tail);
 }
 
@@ -2379,7 +2373,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4

[PATCH v8 5/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index df86070..41e3aaa 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -172,13 +172,6 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, 
u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
-
/* notify hardware of new descriptors */
writel(i, rx_ring->tail);
}
@@ -1036,11 +1029,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
/* record SW timestamp if HW timestamp is not available */
skb_tx_timestamp(first->skb);
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier to make certain all of the
+   /* We need this memory barrier to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -1055,7 +1044,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4

[PATCH v8 4/7] igb: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c1c0bc3..c3f7130 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5652,11 +5652,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier to make certain all of the
+   /* We need this memory barrier to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -5674,7 +5670,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8073,12 +8069,6 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(i, rx_ring->tail);
}
 }
-- 
2.7.4

[PATCH v8 6/7] ixgbevf: keep writel() closer to wmb()

2018-04-02 Thread Sinan Kaya

Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya 
Reviewed-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 447ce1d..c75ea1f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index e3d04f2..757dac6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4

[PATCH v8 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 23 +++
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 757dac6..29b71a7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -719,12 +719,6 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(i, rx_ring->tail);
}
 }
@@ -1228,10 +1222,6 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
struct ixgbevf_ring *xdp_ring =
adapter->xdp_ring[rx_ring->queue_index];
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
@@ -3985,11 +3975,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier (wmb) to make certain all of the
+   /* We also need this memory barrier (wmb) to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -4004,7 +3990,12 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 
return;
 dma_error:
-- 
2.7.4

Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code

2018-04-02 Thread Marc Zyngier

[dropping Freddy as I'm getting bounces from asix.com.tw]

On Mon, 2 Apr 2018 15:21:08 + Alexander Kurz  wrote:

Alexander,

> Hi Marc, David,
> with the v2 patch ("net: usb: asix88179_178a: de-duplicate code")
> I made an embarrasly stupid mistake of removing the wrong function.
> The v2 patch accidentially changed ax88179_link_reset() instead of 
> ax88179_reset(). Hunk 6 of v2 ("net: usb: asix88179_178a: de-duplicate 
> code") is just utterly wrong.

OK, that explains what I saw here.

> ax88179_bind() and ax88179_reset() were the correct targets to be 
> de-duplicated, as done in the v3 patch.
> 
> Sorry for this, Alexander

We all make mistakes, so let's try to learn from them.

You can improve your process by testing what you're about to send (a
very basic requirement), and documenting the changes you make on each
version you send (a cover letter is a good place to put it).

Answering reviewer questions helps building trust between the
contributor and the maintainer on the receiving end of the patch, and
you probably want to answer them before sending a new version so that a
proper discussion can take place, specially if the reviewer doesn't
quite see what you're aiming for.

I'll comment on the patch separately.

Hope this helps,

M.
-- 
Without deviation from the norm, progress is not possible.

[PATCH v8 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index e2b7502..d9f186a 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -246,12 +246,6 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
else
i--;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-   */
-   wmb();
writel(i, adapter->hw.hw_addr + rx_ring->tail);
}
 }
@@ -2289,16 +2283,16 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
}
 
tx_desc->read.cmd_type_len |= cpu_to_le32(adapter->txd_cmd);
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
+
+   /* We use this memory barrier to make certain all of the
+* status bits have been updated before next_to_watch is
+* written.
 */
wmb();
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4

[PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory. I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Change since 7:

* API clarification with Linus and several architecture folks regarding
writel()

https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html

* removed wmb() in front of writel() at several places.
* remove old IA64 comments regarding ordering.

Sinan Kaya (7):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 --
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +--
 drivers/net/ethernet/intel/igb/igb_main.c | 14 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c | 16 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 ---
 8 files changed, 30 insertions(+), 100 deletions(-)

-- 
2.7.4

[GIT PULL] remove in-kernel calls to syscalls

2018-04-02 Thread Dominik Brodowski

Linus,

please pull the following changes since commit 
0c8efd610b58cb23cefdfa12015799079aef94ae:

  Linux 4.16-rc5 (2018-03-11 17:25:09 -0700)

which are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux.git syscalls-next

up to commit c9a211951c7c79cfb5de888d7d9550872868b086:

  bpf: whitelist all syscalls for error injection (2018-04-02 20:16:21 +0200)

to remove all in-kernel calls to syscalls except from arch/ .

Since the last time I sent the patches out for review,[*] I have solely
added a few more ACKs. Jon Corbet raised the question whether the
documentation really should go to Documentation/process/adding-syscalls.rst
and not to Documentation/process/coding-style.rst (even though, as he said,
that isn't quite right either). As most of the existing instances where
syscalls were called in the kernel were (1) common codepaths for old
and new syscalls, (2) common codepaths for native and compat syscalls, and
(3) syscall multiplexers like sys_ipc(), I have kept it at the former
location for the time being, but will be happy to submit a follow-up patch
to move the documentation bits to a different file.

[*] lkml.kernel.org/r/20180329112426.23043-1-li...@dominikbrodowski.net

All these patches have been in -next, but got rebased a few minutes ago to
include another ACK in patch 2/109 (no code changes). There were/are a few
trivial conflicts against the net, sparc and vfs trees, but not (yet) against
what is in your tree up to commit 86bbbebac1933e6e95e8234c4f7d220c5ddd38bc.

Thanks,
Dominik


System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or compat_sys_xyzzy()
should only be called from userspace via the syscall table, but not from
elsewhere in the kernel.

At least on 64-bit x86, it will likely be a hard requirement from v4.17
onwards to not call system call functions in the kernel: It is better to
use use a different calling convention for system calls there, where
struct pt_regs is decoded on-the-fly in a syscall wrapper which then hands
processing over to the actual syscall function. This means that only those
parameters which are actually needed for a specific syscall are passed on
during syscall entry, instead of filling in six CPU registers with random
user space content all the time (which may cause serious trouble down the
call chain). Those x86-specific patches will be pushed through the x86
tree in the near future.

Moreover, rules on how data may be accessed may differ between kernel data
and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific code.

This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h.

First goes a patch which defines the goal and explains the rationale:

  syscalls: define and explain goal to not call syscalls in the kernel

A few codepaths can trivially be converted to existing in-kernel interfaces:

  kernel: use kernel_wait4() instead of sys_wait4()
  kernel: open-code sys_rt_sigpending() in sys_sigpending()
  kexec: call do_kexec_load() in compat syscall directly
  mm: use do_futex() instead of sys_futex() in mm_release()
  x86: use _do_fork() in compat_sys_x86_clone()
  x86: remove compat_sys_x86_waitpid()

Then follow many patches which only affect specfic subsystems each, and
replace sys_*() with internal helpers named __sys_*() or do_sys_*(). Let's
start with net/:

  net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall
  net: socket: add __sys_sendto() helper; remove in-kernel call to syscall
  net: socket: add __sys_accept4() helper; remove in-kernel call to syscall
  net: socket: add __sys_socket() helper; remove in-kernel call to syscall
  net: socket: add __sys_bind() helper; remove in-kernel call to syscall
  net: socket: add __sys_connect() helper; remove in-kernel call to syscall
  net: socket: add __sys_listen() helper; remove in-kernel call to syscall
  net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall
  net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall
  net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall
  net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall
  net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall
  net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall
  net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall
  net: socket: move check for forbid_cmsg_compat to __sys_...msg()
  net: socket: replace calls to sys_send() with __sys_sendto()
  net:

Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Julian Anastasov


Hello,

On Mon, 2 Apr 2018, Vincent Bernat wrote:

> Based on Google's Maglev algorithm [1][2], this scheduler builds a
> lookup table in a way disruption is minimized when a change
> occurs. This helps in case of active/active setup without
> synchronization. Like for classic source hashing, this lookup table is
> used to assign connections to a real server.
> 
> Both source address and port are used to compute the hash (unlike sh
> where this is optional).
> 
> Weights are correctly handled. Unlike sh, servers with a weight of 0
> are considered as absent. Also, unlike sh, when a server becomes
> unavailable due to a threshold, no fallback is possible: doing so
> would seriously impair the the usefulness of using a consistent hash.
> 
> There is a small hack to detect when all real servers have a weight of
> 0. It relies on the fact it is not possible for the weight of a real
> server to change during the execution of the assignment. I believe
> this is the case as modifications through netlink are subject to a
> mutex, but the use of atomic_read() is unsettling.
> 
> The value of 65537 for the hash table size is currently not modifiable
> at compile-time. This is the value suggested in the Maglev
> paper. Another possible value is 257 (for small tests) and 655373 (for
> very large setups).
> 
> [1]: https://research.google.com/pubs/pub44824.html
> [2]: 
> https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/

Sorry to say it but may be you missed the discussion
on lvs-devel about the new MH scheduler implemented by Inju Song:

https://www.spinics.net/lists/lvs-devel/msg04928.html
http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html

In the last 6 months we fixed all issues and I acked
v4 just yesterday:

http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html

This scheduler supports:

- tables with different size (prime): IP_VS_MH_TAB_INDEX
- gcd of weights: ip_vs_mh_gcd_weight
- shifted weights: ip_vs_mh_shift_weight
- weight can be changed any time

> Signed-off-by: Vincent Bernat 
> ---
>  include/net/ip_vs.h|  27 
>  net/netfilter/ipvs/Kconfig |  13 ++
>  net/netfilter/ipvs/Makefile|   1 +
>  net/netfilter/ipvs/ip_vs_csh.c | 339 
> +
>  net/netfilter/ipvs/ip_vs_sh.c  |  32 +---
>  5 files changed, 381 insertions(+), 31 deletions(-)
>  create mode 100644 net/netfilter/ipvs/ip_vs_csh.c

Regards

--
Julian Anastasov

Re: [PATCH 15/15] ARM: pxa: change SSP DMA channels allocation

2018-04-02 Thread kbuild test robot

Hi Robert,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16]
[cannot apply to arm-soc/for-next next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> arch/arm/plat-pxa/ssp.c:19:10: fatal error: mach/audio.h: No such file or 
>> directory
#include 
 ^~
   compilation terminated.

vim +19 arch/arm/plat-pxa/ssp.c

  > 19  #include 
20  #include 
21  #include 
22  #include 
23  #include 
24  #include 
25  #include 
26  #include 
27  #include 
28  #include 
29  #include 
30  #include 
31  #include 
32  #include 
33  #include 
34  #include 
35  #include 
36  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [net-next 0/2] Add promiscous mode support in k2g network driver

2018-04-02 Thread Murali Karicheri

On 04/02/2018 12:28 PM, David Miller wrote:
> From: Murali Karicheri 
> Date: Mon, 2 Apr 2018 12:17:17 -0400
> 
>> This patch adds support for promiscuous mode in network driver for K2G
>> SoC. This depends on v3 of my series at
>> https://www.spinics.net/lists/kernel/msg2765942.html
> 
> The net-next tree is closed, please resubmit this series after the merge
> window when the net-next tree is openned back up.
> 
> Thank you.
> 
Sure! As I have indicated in my cover letter, I will fold this
to the other series and re-submit it when the merge window opens.

Thanks
-- 
Murali Karicheri
Linux Kernel, Keystone

Re: [net-next 2/2] net: netcp: ethss: k2g: add promiscuous mode support

2018-04-02 Thread Murali Karicheri

Andrew,

Thanks for reviewing this!

On 04/02/2018 12:47 PM, Andrew Lunn wrote:
> On Mon, Apr 02, 2018 at 12:17:19PM -0400, Murali Karicheri wrote:
>> +static int gbe_set_rx_mode(void *intf_priv, bool promisc)
>> +{
>> +struct gbe_intf *gbe_intf = intf_priv;
>> +struct gbe_priv *gbe_dev = gbe_intf->gbe_dev;
>> +struct cpsw_ale *ale = gbe_dev->ale;
>> +unsigned long timeout;
>> +int i, ret = -ETIMEDOUT;
>> +
>> +/* Disable(1)/Enable(0) Learn for all ports (host is port 0 and
>> + * slaves are port 1 and up
>> + */
>> +for (i = 0; i <= gbe_dev->num_slaves; i++) {
>> +cpsw_ale_control_set(ale, i,
>> + ALE_PORT_NOLEARN, !!promisc);
>> +cpsw_ale_control_set(ale, i,
>> + ALE_PORT_NO_SA_UPDATE, !!promisc);
>> +}
> 
> Hi Murali
> 
> Does this mean that in promisc mode, switching of frames between ports
> in hardware is disabled? You are relying on the software bridge to
> perform such bridging between ports?

The K2G switch has only one slave port. The other port is the host port.
So there is no switching applicable here. At the egress drivers provide 
frame with PS_FLAG to indicate which port the frame is forwarded to and
at the Ingress direction, it forward the received frame to the Host port
which is the only other port in a K2G switch (2u). To Implement
promiscuous mode, this requires ALE to be enabled and take advantage of
ALE feature to flood all received unicast frames to host port. In the 
non-promiscuous mode, it disables that feature. So only frames with
destination MAC address match is forwarded.

For other K2 devices that has more than one port available, what you say
is applicable. However we have not implemented the switch mode of these
devices with multiple ports and don't have plan to do the same anytime
in the future as this device is already matured and adding this feature
at this point doesn't make much sense now. The driver on these devices
currently bypass ALE and implement plain Ethernet interfaces (n port)
for Ethernet connectivity.

> 
> You might want to look at skb->offload_fwd_mark. By setting this, you
> can tell the software bridge the hardware has already bridged the
> frame. You might then be able to have promisc enabled, and the
> hardware still doing the forwarding.

Yes, if we decide to support switch mode for K2 devices, I will certainly
look at this and add support as you have suggested.

> 
>Andrew
> 

-- 
Murali Karicheri
Linux Kernel, Keystone

Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-02 Thread Alexei Starovoitov

On Mon, Apr 02, 2018 at 12:09:44PM -0600, David Ahern wrote:
> On 4/2/18 12:03 PM, John Fastabend wrote:
> > 
> > Can the above be a normal BPF helper that returns an
> > ifindex? Then something roughly like this patter would
> > work for all drivers with redirect support,
> > 
> > 
> >  route_ifindex = ip_route_lookup(__daddr, )
> >  if (!route_ifindex)
> >return do_foo()
> >  return xdp_redirect(route_ifindex);
> >  
> > So my suggestion is,
> > 
> >   1. enable veth xdp (including redirect support)
> >   2. add a helper to lookup route from routing table
> > 
> > Alternatively you can skip step (2) and encode the routing
> > table in BPF directly. Maybe we need a more efficient data
> > structure but that should also work.
> > 
> 
> That's what I have here:
> 
> https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8

was wondering what's up with the delay and when are you going to
submit them officially...
The use case came up several times.

Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-02 Thread David Ahern

On 4/2/18 12:03 PM, John Fastabend wrote:
> 
> Can the above be a normal BPF helper that returns an
> ifindex? Then something roughly like this patter would
> work for all drivers with redirect support,
> 
> 
>  route_ifindex = ip_route_lookup(__daddr, )
>  if (!route_ifindex)
>return do_foo()
>  return xdp_redirect(route_ifindex);
>  
> So my suggestion is,
> 
>   1. enable veth xdp (including redirect support)
>   2. add a helper to lookup route from routing table
> 
> Alternatively you can skip step (2) and encode the routing
> table in BPF directly. Maybe we need a more efficient data
> structure but that should also work.
> 

That's what I have here:

https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8

And Jesper has done some measurements showing a 400% improvement in
throughput.

I have not had time to come back to the xdp forwarding set. It needs to
handle vlan devices at a minimum before I send an RFC.

Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-02 Thread John Fastabend

On 04/01/2018 05:47 PM, Md. Islam wrote:
> This patch implements IPv4 forwarding on xdp_buff. I added a new
> config option XDP_ROUTER. Kernel would forward packets through fast
> path when this option is enabled. But it would require driver support.
> Currently it only works with veth. Here I have modified veth such that
> it outputs xdp_buff. I created a testbed in Mininet. The Mininet
> script (topology.py) is attached. Here the topology is:
> 
> h1 -r1-h2 (r1 acts as a router)
> 
> This patch improves the throughput from 53.8Gb/s to 60Gb/s on my
> machine. Median RTT also improved from around .055 ms to around .035
> ms.
> 
> Then I disabled hyperthreading and cpu frequency scaling in order to
> utilize CPU cache (DPDK also utilizes CPU cache to improve
> forwarding). This further improves per-packet forwarding latency from
> around 400ns to 200 ns. More specifically, header parsing and fib
> lookup only takes around 82 ns. This shows that this could be used to
> implement linerate packet forwarding in kernel.
> 
> The patch has been generated on 4.15.0+. Please let me know your
> feedback and suggestions. Please feel free to let me know if this
> approach make sense.

Make sense although lets try to avoid hard coded routing into
XDP xmit routines. See details below.


> +#ifdef CONFIG_XDP_ROUTER
> +int veth_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
> +{

This is nice but instead of building a new config_xdp_router
just enable standard XDP for veth + a new helper call to
do routing. Then it will be immediately usable from any XDP
enabled device.

> +struct veth_priv *priv = netdev_priv(dev);
> +struct net_device *rcv;
> +struct ethhdr *ethh;
> +struct sk_buff *skb;
> +int length = xdp->data_end - xdp->data;
> +
> +rcu_read_lock();
> +rcv = rcu_dereference(priv->peer);
> +if (unlikely(!rcv)) {
> +kfree(xdp);
> +goto drop;
> +}
> +
> +/* Update MAC address and checksum */
> +ethh = eth_hdr_xdp(xdp);
> +ether_addr_copy(ethh->h_source, dev->dev_addr);
> +ether_addr_copy(ethh->h_dest, rcv->dev_addr);
> +
> +/* if IP forwarding is enabled on the receiver,
> + * call xdp_router_forward()
> + */
> +if (is_forwarding_enabled(rcv)) {
> +prefetch_xdp(xdp);
> +if (likely(xdp_router_forward(rcv, xdp) == NET_RX_SUCCESS)) {
> +struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
> +
> +u64_stats_update_begin(>syncp);
> +stats->bytes += length;
> +stats->packets++;
> +u64_stats_update_end(>syncp);
> +goto success;
> +}
> +}
> +
> +/* Local deliver */
> +skb = (struct sk_buff *)xdp->data_meta;
> +if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
> +struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
> +
> +u64_stats_update_begin(>syncp);
> +stats->bytes += length;
> +stats->packets++;
> +u64_stats_update_end(>syncp);
> +} else {
> +drop:
> +atomic64_inc(>dropped);
> +}
> +success:
> +rcu_read_unlock();
> +return NETDEV_TX_OK;
> +}
> +#endif
> +
>  static const struct net_device_ops veth_netdev_ops = {
>  .ndo_init= veth_dev_init,
>  .ndo_open= veth_open,
> @@ -290,6 +370,9 @@ static const struct net_device_ops veth_netdev_ops = {
>  .ndo_get_iflink= veth_get_iflink,
>  .ndo_features_check= passthru_features_check,
>  .ndo_set_rx_headroom= veth_set_rx_headroom,
> +#ifdef CONFIG_XDP_ROUTER
> +.ndo_xdp_xmit= veth_xdp_xmit,
> +#endif
>  };
> 

[...]

> +#ifdef CONFIG_XDP_ROUTER
> +int ip_route_lookup(__be32 daddr, __be32 saddr,
> +   u8 tos, struct net_device *dev,
> +   struct fib_result *res);
> +#endif
> +

Can the above be a normal BPF helper that returns an
ifindex? Then something roughly like this patter would
work for all drivers with redirect support,


 route_ifindex = ip_route_lookup(__daddr, )
 if (!route_ifindex)
   return do_foo()
 return xdp_redirect(route_ifindex);
 
So my suggestion is,

  1. enable veth xdp (including redirect support)
  2. add a helper to lookup route from routing table

Alternatively you can skip step (2) and encode the routing
table in BPF directly. Maybe we need a more efficient data
structure but that should also work.

Thanks,
John

[Patch net] af_unix: remove redundant lockdep class

2018-04-02 Thread Cong Wang

After commit 581319c58600 ("net/socket: use per af lockdep classes for sk 
queues")
sock queue locks now have per-af lockdep classes, including unix socket.
It is no longer necessary to workaround it.

I noticed this while looking at a syzbot deadlock report, this patch
itself doesn't fix it (this is why I don't add Reported-by).

Fixes: 581319c58600 ("net/socket: use per af lockdep classes for sk queues")
Cc: Paolo Abeni 
Signed-off-by: Cong Wang 
---
 net/unix/af_unix.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 2d465bdeccbc..45971e173924 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -745,14 +745,6 @@ static struct proto unix_proto = {
.obj_size   = sizeof(struct unix_sock),
 };
 
-/*
- * AF_UNIX sockets do not interact with hardware, hence they
- * dont trigger interrupts - so it's safe for them to have
- * bh-unsafe locking for their sk_receive_queue.lock. Split off
- * this special lock-class by reinitializing the spinlock key:
- */
-static struct lock_class_key af_unix_sk_receive_queue_lock_key;
-
 static struct sock *unix_create1(struct net *net, struct socket *sock, int 
kern)
 {
struct sock *sk = NULL;
@@ -767,8 +759,6 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
goto out;
 
sock_init_data(sock, sk);
-   lockdep_set_class(>sk_receive_queue.lock,
-   _unix_sk_receive_queue_lock_key);
 
sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
sk->sk_write_space  = unix_write_space;
-- 
2.13.0

Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Vincent Bernat

 ❦  2 avril 2018 10:33 -0700, Eric Dumazet  :

>> +static inline u32
>> +ip_vs_csh_permutation(struct ip_vs_dest *d, int j)
>> +{
>> +u32 offset, skip;
>> +__be32 addr_fold = d->addr.ip;
>> +
>> +#ifdef CONFIG_IP_VS_IPV6
>> +if (d->af == AF_INET6)
>> +addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^
>> +d->addr.ip6[2]^d->addr.ip6[3];
>> +#endif
>> +addr_fold = ntohl(addr_fold) + ntohs(d->port);
>> +offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE;
>> +skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1;
>> +return (offset + j * skip) % IP_VS_CSH_TAB_SIZE;
>> +}
>> +
>
> This does not look very strong to me, particularly the IPv6 folding
>
> I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(),
> even if it is hard coded ;)

I can switch to ipv6_addr_hash(). However, switching to
__ipv6_addr_jhash seems useless as I would need to hardcode the initial
value: people use source hashing to get the same result from one host to
another. Am I missing something?
-- 
Each module should do one thing well.
- The Elements of Programming Style (Kernighan & Plauger)

Re: [GIT] Networking

2018-04-02 Thread David Miller


Sorry, this is a dup of the bug fix pull request from last week.

I'll send you the right one.

Re: [bpf-next PATCH 4/4] bpf: sockmap, add hash map support

2018-04-02 Thread John Fastabend

On 04/02/2018 08:54 AM, Alexei Starovoitov wrote:
> On Sun, Apr 01, 2018 at 08:01:10AM -0700, John Fastabend wrote:
>> Sockmap is currently backed by an array and enforces keys to be
>> four bytes. This works well for many use cases and was originally
>> modeled after devmap which also uses four bytes keys. However,
>> this has become limiting in larger use cases where a hash would
>> be more appropriate. For example users may want to use the 5-tuple
>> of the socket as the lookup key.
>>
>> To support this add hash support.
>>
>> Signed-off-by: John Fastabend 
> 
> api looks good, but I think it came a bit too late for this release.
> _nulls part you don't need for this hash. Few other nits:
> 

Yeah no problem will push when {bpf|net}-next opens again. We
also have some fields we need to access from sockmap programs
as well such as ip addrs and ports. I'll respin the fixes (first
two patches) for bpf-net.

>> +static void htab_elem_free_rcu(struct rcu_head *head)
>> +{
>> +struct htab_elem *l = container_of(head, struct htab_elem, rcu);
>> +
>> +/* must increment bpf_prog_active to avoid kprobe+bpf triggering while
>> + * we're calling kfree, otherwise deadlock is possible if kprobes
>> + * are placed somewhere inside of slub
>> + */
>> +preempt_disable();
>> +__this_cpu_inc(bpf_prog_active);
>> +kfree(l);
>> +__this_cpu_dec(bpf_prog_active);
>> +preempt_enable();
> 
> I don't think it's necessary.
> 

Yep agreed.

>> +static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
>> +{
>> +struct bpf_htab *htab;
>> +int i, err;
>> +u64 cost;
>> +
>> +if (!capable(CAP_NET_ADMIN))
>> +return ERR_PTR(-EPERM);
>> +
>> +/* check sanity of attributes */
>> +if (attr->max_entries == 0 ||
>> +attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
>> +return ERR_PTR(-EINVAL);
>> +
>> +if (attr->value_size > KMALLOC_MAX_SIZE)
>> +return ERR_PTR(-E2BIG);
> 
> doesn't seem to match
> + u32 fd = *(u32 *)value;
> that is done later.
> 

Yep.

>> +static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head,
>> + u32 hash, void *key, u32 key_size)
>> +{
>> +struct hlist_nulls_node *n;
>> +struct htab_elem *l;
>> +
>> +hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
>> +if (l->hash == hash && !memcmp(>key, key, key_size))
>> +return l;
> 
> if nulls is needed, there gotta be a comment explaining it.
> 

Sure its not needed lets drop it.

> please add tests for all methods.
> 

I'll add these tests to test_maps.

>> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
>> index f95fa67..2fa4cbb 100644
>> --- a/tools/bpf/bpftool/map.c
>> +++ b/tools/bpf/bpftool/map.c
>> @@ -67,6 +67,7 @@
>>  [BPF_MAP_TYPE_DEVMAP]   = "devmap",
>>  [BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
>>  [BPF_MAP_TYPE_CPUMAP]   = "cpumap",
>> +[BPF_MAP_TYPE_SOCKHASH] = "sockhash",
>>  };
>>  
>>  static unsigned int get_possible_cpus(void)
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 9d07465..1a19450 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -115,6 +115,7 @@ enum bpf_map_type {
>>  BPF_MAP_TYPE_DEVMAP,
>>  BPF_MAP_TYPE_SOCKMAP,
>>  BPF_MAP_TYPE_CPUMAP,
>> +BPF_MAP_TYPE_SOCKHASH,
> 
> tools/* updates should be in separate commit.
> 

Great, thanks for the review.

Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Eric Dumazet



On 04/02/2018 10:20 AM, Vincent Bernat wrote:

> +static inline u32
> +ip_vs_csh_permutation(struct ip_vs_dest *d, int j)
> +{
> + u32 offset, skip;
> + __be32 addr_fold = d->addr.ip;
> +
> +#ifdef CONFIG_IP_VS_IPV6
> + if (d->af == AF_INET6)
> + addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^
> + d->addr.ip6[2]^d->addr.ip6[3];
> +#endif
> + addr_fold = ntohl(addr_fold) + ntohs(d->port);
> + offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE;
> + skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1;
> + return (offset + j * skip) % IP_VS_CSH_TAB_SIZE;
> +}
> +

This does not look very strong to me, particularly the IPv6 folding

I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(), even if it 
is hard coded ;)

[PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Vincent Bernat

Based on Google's Maglev algorithm [1][2], this scheduler builds a
lookup table in a way disruption is minimized when a change
occurs. This helps in case of active/active setup without
synchronization. Like for classic source hashing, this lookup table is
used to assign connections to a real server.

Both source address and port are used to compute the hash (unlike sh
where this is optional).

Weights are correctly handled. Unlike sh, servers with a weight of 0
are considered as absent. Also, unlike sh, when a server becomes
unavailable due to a threshold, no fallback is possible: doing so
would seriously impair the the usefulness of using a consistent hash.

There is a small hack to detect when all real servers have a weight of
0. It relies on the fact it is not possible for the weight of a real
server to change during the execution of the assignment. I believe
this is the case as modifications through netlink are subject to a
mutex, but the use of atomic_read() is unsettling.

The value of 65537 for the hash table size is currently not modifiable
at compile-time. This is the value suggested in the Maglev
paper. Another possible value is 257 (for small tests) and 655373 (for
very large setups).

[1]: https://research.google.com/pubs/pub44824.html
[2]: 
https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/

Signed-off-by: Vincent Bernat 
---
 include/net/ip_vs.h|  27 
 net/netfilter/ipvs/Kconfig |  13 ++
 net/netfilter/ipvs/Makefile|   1 +
 net/netfilter/ipvs/ip_vs_csh.c | 339 +
 net/netfilter/ipvs/ip_vs_sh.c  |  32 +---
 5 files changed, 381 insertions(+), 31 deletions(-)
 create mode 100644 net/netfilter/ipvs/ip_vs_csh.c

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index eb0bec043c96..2184b43b7320 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -195,6 +195,33 @@ static inline int ip_vs_addr_equal(int af, const union 
nf_inet_addr *a,
return a->ip == b->ip;
 }
 
+static inline __be16 ip_vs_get_port(const struct sk_buff *skb, struct 
ip_vs_iphdr *iph)
+{
+   __be16 _ports[2], *ports;
+
+   /* At this point we know that we have a valid packet of some kind.
+* Because ICMP packets are only guaranteed to have the first 8
+* bytes, let's just grab the ports.  Fortunately they're in the
+* same position for all three of the protocols we care about.
+*/
+   switch (iph->protocol) {
+   case IPPROTO_TCP:
+   case IPPROTO_UDP:
+   case IPPROTO_SCTP:
+   ports = skb_header_pointer(skb, iph->len, sizeof(_ports),
+  &_ports);
+   if (unlikely(!ports))
+   return 0;
+
+   if (likely(!ip_vs_iph_inverse(iph)))
+   return ports[0];
+   else
+   return ports[1];
+   default:
+   return 0;
+   }
+}
+
 #ifdef CONFIG_IP_VS_DEBUG
 #include 
 
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index b32fb0dbe237..bfd091e020af 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -225,6 +225,19 @@ config IP_VS_SH
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_CSH
+   tristate "consistent source hashing scheduling"
+   ---help---
+ The consistent source hashing scheduling algorithm assigns
+ network connections to the servers through looking up a
+ statically assigned hash table by their source IP addresses
+ and ports. The hash table is built to minimize disruption
+ when a server becomes unavailable. It relies on the Maglev
+ algorithm.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 config IP_VS_SED
tristate "shortest expected delay scheduling"
---help---
diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile
index c552993fa4b9..7d0badf86dfe 100644
--- a/net/netfilter/ipvs/Makefile
+++ b/net/netfilter/ipvs/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_IP_VS_LBLC) += ip_vs_lblc.o
 obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o
 obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
+obj-$(CONFIG_IP_VS_CSH) += ip_vs_csh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
 
diff --git a/net/netfilter/ipvs/ip_vs_csh.c b/net/netfilter/ipvs/ip_vs_csh.c
new file mode 100644
index ..c70db61ba934
--- /dev/null
+++ b/net/netfilter/ipvs/ip_vs_csh.c
@@ -0,0 +1,339 @@
+/*
+ * IPVS:Consistent Hashing scheduling module using Google's Maglev
+ *
+ * Authors: Vincent Bernat 
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under

Re: [PATCH net 0/2] net: bgmac: Couple of sparse warnings

2018-04-02 Thread David Miller

From: Florian Fainelli 
Date: Mon, 2 Apr 2018 10:17:36 -0700

> On 04/01/2018 07:21 PM, David Miller wrote:
>> From: Florian Fainelli 
>> Date: Sun,  1 Apr 2018 10:26:28 -0700
>> 
>>> This patch series fixes a couple of warnings reported by sparse, should not
>>> cause any functional problems since bgmac is typically used on LE platforms
>>> anyway.
>> 
>> Series applied, thanks Florian.
> 
> I did not see this in net/master or included in your recent pull request
> to Linus, not a big deal, but I just want to make sure this is not lost
> somewhere ;)

All changes go into net-next as I prepare for a merge window pull request.

Re: [PATCH net 0/2] net: bgmac: Couple of sparse warnings

2018-04-02 Thread Florian Fainelli

On 04/01/2018 07:21 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Sun,  1 Apr 2018 10:26:28 -0700
> 
>> This patch series fixes a couple of warnings reported by sparse, should not
>> cause any functional problems since bgmac is typically used on LE platforms
>> anyway.
> 
> Series applied, thanks Florian.

I did not see this in net/master or included in your recent pull request
to Linus, not a big deal, but I just want to make sure this is not lost
somewhere ;)
-- 
Florian

[ANNOUNCE] iproute 4.16

2018-04-02 Thread Stephen Hemminger

Release of iproute2 for Linux 4.16

Lastest version iproute2 utility to support new features in Linux 4.16.
This release covers a wide range of small changes. Lots of changes to:
bpf, vrf, devlink, flower, and rdma support. Also more changes to
ss and JSON support enhancements.

The tarball can be dowloaded from:
  https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.16.0.tar.gz

The upstream repositories for master and net-next branch are now
split. Master branch is at:
  git://git.kernel.org/pub/scm/network/iproute2/iproute2.gti

and patches for next release are in (master branch):
  git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git

There are symlinks so that old paths still work.


Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

---
Adam Vyskovsky (1):
  tc: fix an off-by-one error while printing tc actions

Alexander Alemayhu (4):
  man: add examples to ip.8
  man: fix man page warnings
  tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support
  examples/bpf: update list of examples

Alexander Aring (5):
  tc: m_ife: allow ife type to zero
  tc: m_ife: print IEEE ethertype format
  tc: m_ife: report about kernels default type
  man: tc-ife: add default type note
  tc: m_ife: fix match tcindex parsing

Alexander Heinlein (1):
  ip/xfrm: Fix deleteall when having many policies installed

Alexander Zubkov (5):
  iproute: list/flush/save filter also by metric
  iproute: "list/flush/save default" selected all of the routes
  treat "default" and "all"/"any" addresses differenty
  treat "default" and "all"/"any" addresses differenty
  arrange prefix parsing code after redundant patches

Alexey Kodanev (1):
  fix typo in ip-xfrm man page, rmd610 -> rmd160

Amir Vadai (14):
  libnetlink: Introduce rta_getattr_be*()
  tc/cls_flower: Classify packet in ip tunnels
  tc/act_tunnel: Introduce ip tunnel action
  tc/pedit: Fix a typo in pedit usage message
  tc/pedit: Extend pedit to specify offset relative to mac/transport headers
  tc/pedit: Introduce 'add' operation
  tc/pedit: p_ip: introduce editing ttl header
  tc/pedit: Support fields bigger than 32 bits
  tc/pedit: p_eth: ETH header editor
  tc/pedit: p_tcp: introduce pedit tcp support
  pedit: Fix a typo in warning
  pedit: Do not allow using retain for too big fields
  pedit: Check for extended capability in protocol parser
  pedit: Introduce ipv6 support

Amritha Nambiar (4):
  tc/mqprio: Offload mode and shaper options in mqprio
  flower: Represent HW traffic classes as classid values
  man: tc-mqprio: add documentation for new offload options
  man: tc-flower: add explanation for hw_tc option

Andreas Henriksson (1):
  ss: fix help/man TCP-STATE description for listening

Antonio Quartulli (2):
  ss: fix crash when skipping disabled header field
  ss: fix NULL pointer access when parsing unix sockets with oldformat

Arkadi Sharshevsky (15):
  devlink: Change netlink attribute validation
  devlink: Add support for pipeline debug (dpipe)
  bridge: Distinguish between externally learned vs offloaded FDBs
  devlink: Make match/action parsing more flexible
  devlink: Add support for special format protocol headers
  devlink: Add support for protocol IPv4/IPv6/Ethernet special formats
  devlink: Ignore unknown attributes
  devlink: Change empty line indication with indentations
  devlink: mnlg: Add support for extended ack
  devlink: Add support for devlink resource abstraction
  devlink: Add support for hot reload
  devlink: Move dpipe context from heap to stack
  devlink: Add support for resource/dpipe relation
  devlink: Update man pages and add resource man
  devlink: Fix error reporting

Asbjørn Sloth Tønnesen (2):
  testsuite: refactor kernel config search
  testsuite: search for kernel config in /boot

Baruch Siach (3):
  tc: add missing limits.h header
  ip: include libc headers first
  lib: fix multiple strlcpy definition

Benjamin LaHaise (2):
  f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all"
  tc: flower: support for matching MPLS labels

Boris Pismenny (1):
  ip xfrm: Add xfrm state crypto offload

Casey Callendrello (1):
  netns: make /var/run/netns bind-mount recursive

Chris Mi (3):
  tc: fix command "tc actions del" hang issue
  lib/libnetlink: Add a new function rtnl_talk_iov
  tc: Add batchsize feature for filter and actions

Christian Brauner (1):
  netns: allow negative nsid

Christian Ehrhardt (2):
  tests: read limited amount from /dev/urandom
  tests: make sure rand_dev suffix has 6 chars

Christoph Paasch (1):
  ip: add fastopen_no_cookie option to ip route

Craig Gallek (2):
  gre6: fix copy/paste bugs in GREv6 attribute manipulation
  iplink: Expose IFLA_*_FWMARK attributes

Re: SO_TCP_NODELAY implementation in TCP stack

2018-04-02 Thread Neal Cardwell

On Sun, Apr 1, 2018 at 4:06 AM Naruto Nguyen 
wrote:

> Hello everyone,

> As I know we have a socket option SO_TCP_NODELAY to disable Nagle
> Algorithm, and I found it is implemented in TCP/IP stack at
> https://elixir.bootlin.com/linux/v4.4.90/source/net/ipv4/tcp.c#L2401 .
> However, I do not know where the source code the Nagle Algorithm is
> implemented in kernel. If you know, could you please help me?

The Linux TCP code for Nagle's algorithm is largely in tcp_output.c
in the following functions:

   tcp_nagle_check(), tcp_minshall_check(), tcp_minshall_update()

The variant of the Nagle algorithm that Linux implements is described here:

   https://tools.ietf.org/html/draft-minshall-nagle-00

Basically, the algorithm avoids having more than one outstanding
unacknowledged packet that is less than one MSS in size.

The Nagle algorithm is disabled using the TCP_NODELAY socket option.

neal

Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-02 Thread Stephen Hemminger

On Sun, 1 Apr 2018 20:47:28 -0400
"Md. Islam"  wrote:

> This patch implements IPv4 forwarding on xdp_buff. I added a new
> config option XDP_ROUTER. Kernel would forward packets through fast
> path when this option is enabled. But it would require driver support.
> Currently it only works with veth. Here I have modified veth such that
> it outputs xdp_buff. I created a testbed in Mininet. The Mininet
> script (topology.py) is attached. Here the topology is:

Having XDP routing would be great.

The solution you have chosen that by changing each driver does
not scale well since it requires lots of changes and is not ready
for upstream.

Re: [net-next 2/2] net: netcp: ethss: k2g: add promiscuous mode support

2018-04-02 Thread Andrew Lunn

On Mon, Apr 02, 2018 at 12:17:19PM -0400, Murali Karicheri wrote:
> +static int gbe_set_rx_mode(void *intf_priv, bool promisc)
> +{
> + struct gbe_intf *gbe_intf = intf_priv;
> + struct gbe_priv *gbe_dev = gbe_intf->gbe_dev;
> + struct cpsw_ale *ale = gbe_dev->ale;
> + unsigned long timeout;
> + int i, ret = -ETIMEDOUT;
> +
> + /* Disable(1)/Enable(0) Learn for all ports (host is port 0 and
> +  * slaves are port 1 and up
> +  */
> + for (i = 0; i <= gbe_dev->num_slaves; i++) {
> + cpsw_ale_control_set(ale, i,
> +  ALE_PORT_NOLEARN, !!promisc);
> + cpsw_ale_control_set(ale, i,
> +  ALE_PORT_NO_SA_UPDATE, !!promisc);
> + }

Hi Murali

Does this mean that in promisc mode, switching of frames between ports
in hardware is disabled? You are relying on the software bridge to
perform such bridging between ports?

You might want to look at skb->offload_fwd_mark. By setting this, you
can tell the software bridge the hardware has already bridged the
frame. You might then be able to have promisc enabled, and the
hardware still doing the forwarding.

 Andrew

Re: [PATCH 12/15] dmaengine: pxa: make the filter function internal

2018-04-02 Thread kbuild test robot

Hi Robert,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.16]
[cannot apply to arm-soc/for-next next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   drivers/mtd/nand/marvell_nand.c:2621:17: sparse: undefined identifier 
'pxad_filter_fn'
>> drivers/mtd/nand/marvell_nand.c:2621:17: sparse: call with no type!
   In file included from drivers/mtd/nand/marvell_nand.c:21:0:
   drivers/mtd/nand/marvell_nand.c: In function 'marvell_nfc_init_dma':
   drivers/mtd/nand/marvell_nand.c:2621:42: error: 'pxad_filter_fn' undeclared 
(first use in this function); did you mean 'dma_filter_fn'?
  dma_request_slave_channel_compat(mask, pxad_filter_fn,
 ^
   include/linux/dmaengine.h:1408:46: note: in definition of macro 
'dma_request_slave_channel_compat'
 __dma_request_slave_channel_compat(&(mask), x, y, dev, name)
 ^
   drivers/mtd/nand/marvell_nand.c:2621:42: note: each undeclared identifier is 
reported only once for each function it appears in
  dma_request_slave_channel_compat(mask, pxad_filter_fn,
 ^
   include/linux/dmaengine.h:1408:46: note: in definition of macro 
'dma_request_slave_channel_compat'
 __dma_request_slave_channel_compat(&(mask), x, y, dev, name)
 ^

vim +2621 drivers/mtd/nand/marvell_nand.c

02f26ecf Miquel Raynal 2018-01-09  2588  
02f26ecf Miquel Raynal 2018-01-09  2589  static int marvell_nfc_init_dma(struct 
marvell_nfc *nfc)
02f26ecf Miquel Raynal 2018-01-09  2590  {
02f26ecf Miquel Raynal 2018-01-09  2591 struct platform_device *pdev = 
container_of(nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2592 
struct platform_device,
02f26ecf Miquel Raynal 2018-01-09  2593 
dev);
02f26ecf Miquel Raynal 2018-01-09  2594 struct dma_slave_config config 
= {};
02f26ecf Miquel Raynal 2018-01-09  2595 struct resource *r;
02f26ecf Miquel Raynal 2018-01-09  2596 dma_cap_mask_t mask;
02f26ecf Miquel Raynal 2018-01-09  2597 struct pxad_param param;
02f26ecf Miquel Raynal 2018-01-09  2598 int ret;
02f26ecf Miquel Raynal 2018-01-09  2599  
02f26ecf Miquel Raynal 2018-01-09  2600 if 
(!IS_ENABLED(CONFIG_PXA_DMA)) {
02f26ecf Miquel Raynal 2018-01-09  2601 dev_warn(nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2602  "DMA not 
enabled in configuration\n");
02f26ecf Miquel Raynal 2018-01-09  2603 return -ENOTSUPP;
02f26ecf Miquel Raynal 2018-01-09  2604 }
02f26ecf Miquel Raynal 2018-01-09  2605  
02f26ecf Miquel Raynal 2018-01-09  2606 ret = 
dma_set_mask_and_coherent(nfc->dev, DMA_BIT_MASK(32));
02f26ecf Miquel Raynal 2018-01-09  2607 if (ret)
02f26ecf Miquel Raynal 2018-01-09  2608 return ret;
02f26ecf Miquel Raynal 2018-01-09  2609  
02f26ecf Miquel Raynal 2018-01-09  2610 r = platform_get_resource(pdev, 
IORESOURCE_DMA, 0);
02f26ecf Miquel Raynal 2018-01-09  2611 if (!r) {
02f26ecf Miquel Raynal 2018-01-09  2612 dev_err(nfc->dev, "No 
resource defined for data DMA\n");
02f26ecf Miquel Raynal 2018-01-09  2613 return -ENXIO;
02f26ecf Miquel Raynal 2018-01-09  2614 }
02f26ecf Miquel Raynal 2018-01-09  2615  
02f26ecf Miquel Raynal 2018-01-09  2616 param.drcmr = r->start;
02f26ecf Miquel Raynal 2018-01-09  2617 param.prio = PXAD_PRIO_LOWEST;
02f26ecf Miquel Raynal 2018-01-09  2618 dma_cap_zero(mask);
02f26ecf Miquel Raynal 2018-01-09  2619 dma_cap_set(DMA_SLAVE, mask);
02f26ecf Miquel Raynal 2018-01-09  2620 nfc->dma_chan =
02f26ecf Miquel Raynal 2018-01-09 @2621 
dma_request_slave_channel_compat(mask, pxad_filter_fn,
02f26ecf Miquel Raynal 2018-01-09  2622 
 , nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2623 
 "data");
02f26ecf Miquel Raynal 2018-01-09  2624 if (!nfc->dma_chan) {
02f26ecf Miquel Raynal 2018-01-09  2625 dev_err(nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2626 "Unable to 
request data DMA channel\n");
02f26ecf Miquel Raynal 2018-01-09  2627 return -ENODEV;
02f26ecf Miquel Raynal 2018-01-09

[GIT] Networking

2018-04-02 Thread David Miller


1) Fix RCU locking in xfrm_local_error(), from Taehee Yoo.

2) Fix return value assignments and thus error checking in
   iwl_mvm_start_ap_ibss(), from Johannes Berg.

3) Don't count header length twice in vti4, from Stefano Brivio.

4) Fix deadlock in rt6_age_examine_exception, from Eric Dumazet.

5) Fix out-of-bounds access in nf_sk_lookup_slow{v4,v6}() from
   Subash Abhinov.

6) Check nladdr size in netlink_connect(), from Alexander
   Potapenko.

7) VF representor SQ numbers are 32 not 16 bits, in mlx5 driver,
   from Or Gerlitz.

8) Out of bounds read in skb_network_protocol(), from Eric Dumazet.

9) r8169 driver sets driver data pointer after register_netdev() which
   is too late.  Fix from Heiner Kallweit.

10) Fix memory leak in mlx4 driver, from Moshe Shemesh.

11) The multi-VLAN decap fix added a regression when dealing with
device that lack a MAC header, such as tun.  Fix from Toshiaki
Makita.

12) Fix integer overflow in dynamic interrupt coalescing code.  From Tal
Gilboa.

13) Use after free in vrf code, from David Ahern.

14) IPV6 route leak between VRFs fix, also from David Ahern.

Please pull, thanks a lot!

The following changes since commit f36b7534b83357cf52e747905de6d65b4f7c2512:

  Merge branch 'akpm' (patches from Andrew) (2018-03-22 18:48:43 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to e81b5e01c14add8395dfba7130f8829206bb507d:

  net: mvneta: fix enable of all initialized RXQs (2018-03-30 14:27:47 -0400)


Alexander Potapenko (1):
  netlink: make sure nladdr has correct size in netlink_connect()

Andrei Otcheretianski (2):
  iwlwifi: mvm: Increase session protection time after CS
  iwlwifi: mvm: Move unused phy's to a default channel

Avraham Stern (3):
  iwlwifi: mvm: clear tx queue id when unreserving aggregation queue
  iwlwifi: mvm: make sure internal station has a valid id
  iwlwifi: mvm: fix array out of bounds reference

Beni Lev (1):
  iwlwifi: mvm: Correctly set IGTK for AP

Colin Ian King (1):
  atm: iphase: fix spelling mistake: "Receiverd" -> "Received"

Cong Wang (1):
  llc: properly handle dev_queue_xmit() return value

Craig Dillabaugh (1):
  net sched actions: fix dumping which requires several messages to user 
space

Dave Watson (1):
  strparser: Fix sign of err codes

David Ahern (2):
  vrf: Fix use after free and double free in vrf_finish_output
  net/ipv6: Fix route leaking between VRFs

David Lebrun (1):
  ipv6: sr: fix seg6 encap performances with TSO enabled

David S. Miller (11):
  Merge branch 'mlxsw-GRE-mtu-changes'
  Merge git://git.kernel.org/.../pablo/nf
  Merge tag 'wireless-drivers-for-davem-2018-03-24' of 
git://git.kernel.org/.../kvalo/wireless-drivers
  Merge branch 'bond-hwaddr-sync-fixes'
  Merge tag 'batadv-net-for-davem-20180326' of 
git://git.open-mesh.org/linux-merge
  Merge tag 'mlx5-fixes-2018-03-23' of git://git.kernel.org/.../saeed/linux
  Merge branch 'mlx4-misc-fixes-for-4.16'
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  ip_tunnel: Resolve ipsec merge conflict properly.
  Merge git://git.kernel.org/.../bpf/bpf
  Merge branch 'vlan-fix'

Emmanuel Grumbach (1):
  iwlwifi: mvm: set the correct tid when we flush the MCAST sta

Eran Ben Elisha (1):
  net/mlx4_en: Fix mixed PFC and Global pause user control requests

Eric Dumazet (2):
  ipv6: fix possible deadlock in rt6_age_examine_exception()
  net: fix possible out-of-bound read in skb_network_protocol()

Florian Westphal (3):
  netfilter: nf_tables: meter: pick a set backend that supports updates
  netfilter: nf_tables: permit second nat hook if colliding hook is going 
away
  netfilter: nf_tables: add missing netlink attrs to policies

Giuseppe Lippolis (1):
  net-usb: add qmi_wwan if on lte modem wistron neweb d18q1

Hans Wippel (1):
  net/ipv4: disable SMC TCP option with SYN Cookies

Heiner Kallweit (1):
  r8169: fix setting driver_data after register_netdev

Jakub Kicinski (2):
  tools: bpftool: don't use hex numbers in JSON output
  nfp: bpf: fix check of program max insn count

Jason Wang (3):
  vhost_net: add missing lock nesting notation
  vhost: correctly remove wait queue during poll failure
  vhost: validate log when IOTLB is enabled

Jianbo Liu (2):
  net/mlx5e: Don't override vport admin link state in switchdev mode
  net/mlx5e: Fix memory usage issues in offloading TC flows

Johannes Berg (1):
  iwlwifi: mvm: fix error checking for multi/broadcast sta

John Fastabend (1):
  net: sched, fix OOO packets with pfifo_fast

Kalle Valo (2):
  Merge tag 'iwlwifi-for-kalle-2018-03-16' of 
git://git.kernel.org/.../iwlwifi/iwlwifi-fixes
  Merge tag 'iwlwifi-for-kalle-2018-03-19' of

Re: [net-next 0/2] Add promiscous mode support in k2g network driver

2018-04-02 Thread David Miller

From: Murali Karicheri 
Date: Mon, 2 Apr 2018 12:17:17 -0400

> This patch adds support for promiscuous mode in network driver for K2G
> SoC. This depends on v3 of my series at
> https://www.spinics.net/lists/kernel/msg2765942.html

The net-next tree is closed, please resubmit this series after the merge
window when the net-next tree is openned back up.

Thank you.

Re: [PATCH 12/15] dmaengine: pxa: make the filter function internal

2018-04-02 Thread kbuild test robot

Hi Robert,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16]
[cannot apply to arm-soc/for-next next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from drivers/mtd/nand/marvell_nand.c:21:0:
   drivers/mtd/nand/marvell_nand.c: In function 'marvell_nfc_init_dma':
>> drivers/mtd/nand/marvell_nand.c:2621:42: error: 'pxad_filter_fn' undeclared 
>> (first use in this function); did you mean 'dma_filter_fn'?
  dma_request_slave_channel_compat(mask, pxad_filter_fn,
 ^
   include/linux/dmaengine.h:1408:46: note: in definition of macro 
'dma_request_slave_channel_compat'
 __dma_request_slave_channel_compat(&(mask), x, y, dev, name)
 ^
   drivers/mtd/nand/marvell_nand.c:2621:42: note: each undeclared identifier is 
reported only once for each function it appears in
  dma_request_slave_channel_compat(mask, pxad_filter_fn,
 ^
   include/linux/dmaengine.h:1408:46: note: in definition of macro 
'dma_request_slave_channel_compat'
 __dma_request_slave_channel_compat(&(mask), x, y, dev, name)
 ^

vim +2621 drivers/mtd/nand/marvell_nand.c

02f26ecf Miquel Raynal 2018-01-09  2588  
02f26ecf Miquel Raynal 2018-01-09  2589  static int marvell_nfc_init_dma(struct 
marvell_nfc *nfc)
02f26ecf Miquel Raynal 2018-01-09  2590  {
02f26ecf Miquel Raynal 2018-01-09  2591 struct platform_device *pdev = 
container_of(nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2592 
struct platform_device,
02f26ecf Miquel Raynal 2018-01-09  2593 
dev);
02f26ecf Miquel Raynal 2018-01-09  2594 struct dma_slave_config config 
= {};
02f26ecf Miquel Raynal 2018-01-09  2595 struct resource *r;
02f26ecf Miquel Raynal 2018-01-09  2596 dma_cap_mask_t mask;
02f26ecf Miquel Raynal 2018-01-09  2597 struct pxad_param param;
02f26ecf Miquel Raynal 2018-01-09  2598 int ret;
02f26ecf Miquel Raynal 2018-01-09  2599  
02f26ecf Miquel Raynal 2018-01-09  2600 if 
(!IS_ENABLED(CONFIG_PXA_DMA)) {
02f26ecf Miquel Raynal 2018-01-09  2601 dev_warn(nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2602  "DMA not 
enabled in configuration\n");
02f26ecf Miquel Raynal 2018-01-09  2603 return -ENOTSUPP;
02f26ecf Miquel Raynal 2018-01-09  2604 }
02f26ecf Miquel Raynal 2018-01-09  2605  
02f26ecf Miquel Raynal 2018-01-09  2606 ret = 
dma_set_mask_and_coherent(nfc->dev, DMA_BIT_MASK(32));
02f26ecf Miquel Raynal 2018-01-09  2607 if (ret)
02f26ecf Miquel Raynal 2018-01-09  2608 return ret;
02f26ecf Miquel Raynal 2018-01-09  2609  
02f26ecf Miquel Raynal 2018-01-09  2610 r = platform_get_resource(pdev, 
IORESOURCE_DMA, 0);
02f26ecf Miquel Raynal 2018-01-09  2611 if (!r) {
02f26ecf Miquel Raynal 2018-01-09  2612 dev_err(nfc->dev, "No 
resource defined for data DMA\n");
02f26ecf Miquel Raynal 2018-01-09  2613 return -ENXIO;
02f26ecf Miquel Raynal 2018-01-09  2614 }
02f26ecf Miquel Raynal 2018-01-09  2615  
02f26ecf Miquel Raynal 2018-01-09  2616 param.drcmr = r->start;
02f26ecf Miquel Raynal 2018-01-09  2617 param.prio = PXAD_PRIO_LOWEST;
02f26ecf Miquel Raynal 2018-01-09  2618 dma_cap_zero(mask);
02f26ecf Miquel Raynal 2018-01-09  2619 dma_cap_set(DMA_SLAVE, mask);
02f26ecf Miquel Raynal 2018-01-09  2620 nfc->dma_chan =
02f26ecf Miquel Raynal 2018-01-09 @2621 
dma_request_slave_channel_compat(mask, pxad_filter_fn,
02f26ecf Miquel Raynal 2018-01-09  2622 
 , nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2623 
 "data");
02f26ecf Miquel Raynal 2018-01-09  2624 if (!nfc->dma_chan) {
02f26ecf Miquel Raynal 2018-01-09  2625 dev_err(nfc->dev,
02f26ecf Miquel Raynal 2018-01-09  2626 "Unable to 
request data DMA channel\n");
02f26ecf Miquel Raynal 2018-01-09  2627 return -ENODEV;
02f26ecf Miquel Raynal 2018-01-09  2628 }
02f26ecf Miquel Raynal 2018-01-09  2629  
02f26ecf Miquel Raynal 2018-01-09  2630

RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Keller, Jacob E

> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, March 30, 2018 2:05 PM
> To: Tal Gilboa 
> Cc: Tariq Toukan ; Keller, Jacob E
> ; Ariel Elior ; Ganesh
> Goudar ; Kirsher, Jeffrey T
> ; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed 
> and
> whether it's limited
> 
> From: Tal Gilboa 
> 
> Add pcie_print_link_status().  This logs the current settings of the link
> (speed, width, and total available bandwidth).
> 
> If the device is capable of more bandwidth but is limited by a slower
> upstream link, we include information about the link that limits the
> device's performance.
> 
> The user may be able to move the device to a different slot for better
> performance.
> 
> This provides a unified method for all PCI devices to report status and
> issues, instead of each device reporting in a different way, using
> different code.
> 
> Signed-off-by: Tal Gilboa 
> [bhelgaas: changelog, reword log messages, print device capabilities when
> not limited]
> Signed-off-by: Bjorn Helgaas 
> ---
>  drivers/pci/pci.c   |   29 +
>  include/linux/pci.h |1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e00d56b12747..cec7aed09f6b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> enum pci_bus_speed *speed,
>   return *width * PCIE_SPEED2MBS_ENC(*speed);
>  }
> 
> +/**
> + * pcie_print_link_status - Report the PCI device's link speed and width
> + * @dev: PCI device to query
> + *
> + * Report the available bandwidth at the device.  If this is less than the
> + * device is capable of, report the device's maximum possible bandwidth and
> + * the upstream link that limits its performance to less than that.
> + */
> +void pcie_print_link_status(struct pci_dev *dev)
> +{
> + enum pcie_link_width width, width_cap;
> + enum pci_bus_speed speed, speed_cap;
> + struct pci_dev *limiting_dev = NULL;
> + u32 bw_avail, bw_cap;
> +
> + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
> + bw_avail = pcie_bandwidth_available(dev, _dev, ,
> );
> +
> + if (bw_avail >= bw_cap)
> + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> + else
> + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> +  bw_avail, PCIE_SPEED2STR(speed), width,
> +  limiting_dev ? pci_name(limiting_dev) : "",
> +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> +}

Personally, I would make thic last one a pci_warn() to indicate it at a higher 
log level, but I'm  ok with the wording, and if consensus is that this should 
be at info, I'm ok with that.

Thanks,
Jake

> +EXPORT_SYMBOL(pcie_print_link_status);
> +
>  /**
>   * pci_select_bars - Make BAR mask from the type of resource
>   * @dev: the PCI device for which BAR mask is made
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index f2bf2b7a66c7..38f7957121ef 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1086,6 +1086,7 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum
> pci_bus_speed *speed,
>  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
> **limiting_dev,
>enum pci_bus_speed *speed,
>enum pcie_link_width *width);
> +void pcie_print_link_status(struct pci_dev *dev);
>  void pcie_flr(struct pci_dev *dev);
>  int __pci_reset_function_locked(struct pci_dev *dev);
>  int pci_reset_function(struct pci_dev *dev);

[net-next 2/2] net: netcp: ethss: k2g: add promiscuous mode support

2018-04-02 Thread Murali Karicheri

From: WingMan Kwok 

This patch adds support for promiscuous mode in k2g's network
driver. When upper layer instructs to transition from
non-promiscuous mode to promiscuous mode or vice versa
K2G network driver needs to configure ALE accordingly
so that in case of non-promiscuous mode, ALE will not flood
all unicast packets to host port, while in promiscuous
mode, it will pass all received unicast packets to
host port.

Signed-off-by: WingMan Kwok 
Signed-off-by: Murali Karicheri 
---
 drivers/net/ethernet/ti/netcp_ethss.c | 56 +++
 1 file changed, 56 insertions(+)

diff --git a/drivers/net/ethernet/ti/netcp_ethss.c 
b/drivers/net/ethernet/ti/netcp_ethss.c
index f7af999..1ac2cd6 100644
--- a/drivers/net/ethernet/ti/netcp_ethss.c
+++ b/drivers/net/ethernet/ti/netcp_ethss.c
@@ -2771,6 +2771,61 @@ static inline int gbe_hwtstamp_set(struct gbe_intf 
*gbe_intf, struct ifreq *req)
 }
 #endif /* CONFIG_TI_CPTS */
 
+static int gbe_set_rx_mode(void *intf_priv, bool promisc)
+{
+   struct gbe_intf *gbe_intf = intf_priv;
+   struct gbe_priv *gbe_dev = gbe_intf->gbe_dev;
+   struct cpsw_ale *ale = gbe_dev->ale;
+   unsigned long timeout;
+   int i, ret = -ETIMEDOUT;
+
+   /* Disable(1)/Enable(0) Learn for all ports (host is port 0 and
+* slaves are port 1 and up
+*/
+   for (i = 0; i <= gbe_dev->num_slaves; i++) {
+   cpsw_ale_control_set(ale, i,
+ALE_PORT_NOLEARN, !!promisc);
+   cpsw_ale_control_set(ale, i,
+ALE_PORT_NO_SA_UPDATE, !!promisc);
+   }
+
+   if (!promisc) {
+   /* Don't Flood All Unicast Packets to Host port */
+   cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 0);
+   dev_vdbg(gbe_dev->dev, "promiscuous mode disabled\n");
+   return 0;
+   }
+
+   timeout = jiffies + HZ;
+
+   /* Clear All Untouched entries */
+   cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1);
+   do {
+   cpu_relax();
+   if (cpsw_ale_control_get(ale, 0, ALE_AGEOUT)) {
+   ret = 0;
+   break;
+   }
+
+   } while (time_after(timeout, jiffies));
+
+   /* Make sure it is not a false timeout */
+   if (ret && !cpsw_ale_control_get(ale, 0, ALE_AGEOUT))
+   return ret;
+
+   cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1);
+
+   /* Clear all mcast from ALE */
+   cpsw_ale_flush_multicast(ale,
+GBE_PORT_MASK(gbe_dev->ale_ports),
+-1);
+
+   /* Flood All Unicast Packets to Host port */
+   cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
+   dev_vdbg(gbe_dev->dev, "promiscuous mode enabled\n");
+   return ret;
+}
+
 static int gbe_ioctl(void *intf_priv, struct ifreq *req, int cmd)
 {
struct gbe_intf *gbe_intf = intf_priv;
@@ -3523,6 +3578,7 @@ static int gbe_probe(struct netcp_device *netcp_device, 
struct device *dev,
gbe_dev->max_num_slaves = 8;
} else if (of_device_is_compatible(node, "ti,netcp-gbe-2")) {
gbe_dev->max_num_slaves = 1;
+   gbe_module.set_rx_mode = gbe_set_rx_mode;
} else if (of_device_is_compatible(node, "ti,netcp-xgbe")) {
gbe_dev->max_num_slaves = 2;
} else {
-- 
1.9.1

[net-next 0/2] Add promiscous mode support in k2g network driver

2018-04-02 Thread Murali Karicheri

This patch adds support for promiscuous mode in network driver for K2G
SoC. This depends on v3 of my series at
https://www.spinics.net/lists/kernel/msg2765942.html

I plan to fold this to the above series and submit again when the net-next
merge windows opens. At this time, please review and let me know if it
looks good or need any re-work. I would like to get this ready so that it
can be merged along with the above series.

The boot and promiscuous mode test logs are at
https://pastebin.ubuntu.com/p/XQCvFS3QZb/

WingMan Kwok (2):
  net: netcp: add api to support set rx mode in netcp modules
  net: netcp: ethss: k2g: add promiscuous mode support

 drivers/net/ethernet/ti/netcp.h   |  1 +
 drivers/net/ethernet/ti/netcp_core.c  | 19 
 drivers/net/ethernet/ti/netcp_ethss.c | 56 +++
 3 files changed, 76 insertions(+)

-- 
1.9.1

[net-next 1/2] net: netcp: add api to support set rx mode in netcp modules

2018-04-02 Thread Murali Karicheri

From: WingMan Kwok 

This patch adds an API to support setting rx mode in
netcp modules.  If a netcp module needs to be notified
when upper layer transitions from one rx mode to
another and react accordingly, such a module will implement
the new API set_rx_mode added in this patch.  Currently
rx modes supported are PROMISCUOUS and NON_PROMISCUOUS
modes.

Signed-off-by: WingMan Kwok 
Signed-off-by: Murali Karicheri 
---
 drivers/net/ethernet/ti/netcp.h  |  1 +
 drivers/net/ethernet/ti/netcp_core.c | 19 +++
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/ti/netcp.h b/drivers/net/ethernet/ti/netcp.h
index 416f732..c4ffdf4 100644
--- a/drivers/net/ethernet/ti/netcp.h
+++ b/drivers/net/ethernet/ti/netcp.h
@@ -214,6 +214,7 @@ struct netcp_module {
int (*add_vid)(void *intf_priv, int vid);
int (*del_vid)(void *intf_priv, int vid);
int (*ioctl)(void *intf_priv, struct ifreq *req, int cmd);
+   int (*set_rx_mode)(void *intf_priv, bool promisc);
 
/* used internally */
struct list_headmodule_list;
diff --git a/drivers/net/ethernet/ti/netcp_core.c 
b/drivers/net/ethernet/ti/netcp_core.c
index 736f6f7..e40aa3e 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1509,6 +1509,24 @@ static void netcp_addr_sweep_add(struct netcp_intf 
*netcp)
}
 }
 
+static int netcp_set_promiscuous(struct netcp_intf *netcp, bool promisc)
+{
+   struct netcp_intf_modpriv *priv;
+   struct netcp_module *module;
+   int error;
+
+   for_each_module(netcp, priv) {
+   module = priv->netcp_module;
+   if (!module->set_rx_mode)
+   continue;
+
+   error = module->set_rx_mode(priv->module_priv, promisc);
+   if (error)
+   return error;
+   }
+   return 0;
+}
+
 static void netcp_set_rx_mode(struct net_device *ndev)
 {
struct netcp_intf *netcp = netdev_priv(ndev);
@@ -1538,6 +1556,7 @@ static void netcp_set_rx_mode(struct net_device *ndev)
/* finally sweep and callout into modules */
netcp_addr_sweep_del(netcp);
netcp_addr_sweep_add(netcp);
+   netcp_set_promiscuous(netcp, promisc);
spin_unlock(>lock);
 }
 
-- 
1.9.1

Re: [PATCH 0/4] RFC: Realtek 83xx SMI driver core

2018-04-02 Thread Carl-Daniel Hailfinger

Hi Linus,

did you make any progress with this?
I noticed that the Vodafone Easybox 904xdsl/904lte models both make use
of the RTL8367 switch. About one million of these routers have been
deployed in Germany.
There is an OpenWrt fork at
https://github.com/Quallenauge/Easybox-904-XDSL/commits/master-lede
which depends on the out-of-tree patches which seem to be the basis for
your Realtek 83xx driver patches.

Having your Realtek 83xx patches in the upstream Linux kernel would help
tremendously in getting support for those router models merged in OpenWrt.

Regards,
Carl-Daniel

RE: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Keller, Jacob E

> -Original Message-
> From: Tal Gilboa [mailto:ta...@mellanox.com]
> Sent: Monday, April 02, 2018 7:34 AM
> To: Bjorn Helgaas 
> Cc: Tariq Toukan ; Keller, Jacob E
> ; Ariel Elior ; Ganesh
> Goudar ; Kirsher, Jeffrey T
> ; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute
> max supported link bandwidth
> 
> On 4/2/2018 5:05 PM, Bjorn Helgaas wrote:
> > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:
> >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:
> >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
>  On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > From: Tal Gilboa 
> >
> > Add pcie_bandwidth_capable() to compute the max link bandwidth
> supported by
> > a device, based on the max link speed and width, adjusted by the
> encoding
> > overhead.
> >
> > The maximum bandwidth of the link is computed as:
> >
> >  max_link_speed * max_link_width * (1 - encoding_overhead)
> >
> > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using
> 8b/10b
> > encoding, and about 1.5% for 8 GT/s or higher speed links using 
> > 128b/130b
> > encoding.
> >
> > Signed-off-by: Tal Gilboa 
> > [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> > signatures, don't export outside drivers/pci]
> > Signed-off-by: Bjorn Helgaas 
> > Reviewed-by: Tariq Toukan 
> > ---
> > drivers/pci/pci.c |   21 +
> > drivers/pci/pci.h |9 +
> > 2 files changed, 30 insertions(+)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 43075be79388..9ce89e254197 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5208,6 +5208,27 @@ enum pcie_link_width
> pcie_get_width_cap(struct pci_dev *dev)
> > return PCIE_LNK_WIDTH_UNKNOWN;
> > }
> > +/**
> > + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth
> capability
> > + * @dev: PCI device
> > + * @speed: storage for link speed
> > + * @width: storage for link width
> > + *
> > + * Calculate a PCI device's link bandwidth by querying for its link 
> > speed
> > + * and width, multiplying them, and applying encoding overhead.
> > + */
> > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> *speed,
> > +  enum pcie_link_width *width)
> > +{
> > +   *speed = pcie_get_speed_cap(dev);
> > +   *width = pcie_get_width_cap(dev);
> > +
> > +   if (*speed == PCI_SPEED_UNKNOWN || *width ==
> PCIE_LNK_WIDTH_UNKNOWN)
> > +   return 0;
> > +
> > +   return *width * PCIE_SPEED2MBS_ENC(*speed);
> > +}
> > +
> > /**
> >  * pci_select_bars - Make BAR mask from the type of resource
> >  * @dev: the PCI device for which BAR mask is made
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 66738f1050c0..2a50172b9803 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev
> *dev);
> >  (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> >  "Unknown speed")
> > +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for
> gen3 */
> > +#define PCIE_SPEED2MBS_ENC(speed) \
> 
>  Missing gen4.
> >>>
> >>> I made it "gen3+".  I think that's accurate, isn't it?  The spec
> >>> doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
> >>> says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
> >>> use 128b/130b encoding.
> >>>
> >>
> >> I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it 
> >> wasn't
> >> added. Need to return 15754.
> >
> > Oh, duh, of course!  Sorry for being dense.  What about the following?
> > I included the calculation as opposed to just the magic numbers to try
> > to make it clear how they're derived.  This has the disadvantage of
> > truncating the result instead of rounding, but I doubt that's
> > significant in this context.  If it is, we could use the magic numbers
> > and put the computation in a comment.
> 
> We can always use DIV_ROUND_UP((speed * enc_nominator),
> enc_denominator). I think this is confusing and since this introduces a
> bandwidth limit I would prefer to give a wider limit than a wrong one,
> even it is by less than 1Mb/s. My vote is for leaving it as you wrote below.
> 
> >
> > Another question: we currently deal in

RE: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-04-02 Thread Keller, Jacob E

> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, March 30, 2018 2:06 PM
> To: Tal Gilboa 
> Cc: Tariq Toukan ; Keller, Jacob E
> ; Ariel Elior ; Ganesh
> Goudar ; Kirsher, Jeffrey T
> ; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> pcie_print_link_status()
> 
> From: Bjorn Helgaas 
> 
> Use pcie_print_link_status() to report PCIe link speed and possible
> limitations instead of implementing this in the driver itself.
> 
> Note that pcie_get_minimum_link() can return misleading information because
> it finds the slowest link and the narrowest link without considering the
> total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
> 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
> about 2000 MB/s for a 16 GT/s x1 link.

This comment is about what's being fixed, so it would have been easier to parse 
if it were written to more clearly indicate that we're removing (and not 
adding) this behavior.

Aside from the commit message (which I don't feel strongly enough needs a 
re-send of the patch) this looks good to me.

Acked-by: Jacob Keller 

Thanks Bjorn and Tal for fixing this!

> 
> Signed-off-by: Bjorn Helgaas 
> ---
>  drivers/net/ethernet/intel/fm10k/fm10k_pci.c |   87 
> --
>  1 file changed, 1 insertion(+), 86 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> index a434fecfdfeb..aa05fb534942 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> @@ -2120,91 +2120,6 @@ static int fm10k_sw_init(struct fm10k_intfc *interface,
>   return 0;
>  }
> 
> -static void fm10k_slot_warn(struct fm10k_intfc *interface)
> -{
> - enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
> - enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
> - struct fm10k_hw *hw = >hw;
> - int max_gts = 0, expected_gts = 0;
> -
> - if (pcie_get_minimum_link(interface->pdev, , ) ||
> - speed == PCI_SPEED_UNKNOWN || width ==
> PCIE_LNK_WIDTH_UNKNOWN) {
> - dev_warn(>pdev->dev,
> -  "Unable to determine PCI Express bandwidth.\n");
> - return;
> - }
> -
> - switch (speed) {
> - case PCIE_SPEED_2_5GT:
> - /* 8b/10b encoding reduces max throughput by 20% */
> - max_gts = 2 * width;
> - break;
> - case PCIE_SPEED_5_0GT:
> - /* 8b/10b encoding reduces max throughput by 20% */
> - max_gts = 4 * width;
> - break;
> - case PCIE_SPEED_8_0GT:
> - /* 128b/130b encoding has less than 2% impact on throughput */
> - max_gts = 8 * width;
> - break;
> - default:
> - dev_warn(>pdev->dev,
> -  "Unable to determine PCI Express bandwidth.\n");
> - return;
> - }
> -
> - dev_info(>pdev->dev,
> -  "PCI Express bandwidth of %dGT/s available\n",
> -  max_gts);
> - dev_info(>pdev->dev,
> -  "(Speed:%s, Width: x%d, Encoding Loss:%s, Payload:%s)\n",
> -  (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
> -   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
> -   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
> -   "Unknown"),
> -  hw->bus.width,
> -  (speed == PCIE_SPEED_2_5GT ? "20%" :
> -   speed == PCIE_SPEED_5_0GT ? "20%" :
> -   speed == PCIE_SPEED_8_0GT ? "<2%" :
> -   "Unknown"),
> -  (hw->bus.payload == fm10k_bus_payload_128 ? "128B" :
> -   hw->bus.payload == fm10k_bus_payload_256 ? "256B" :
> -   hw->bus.payload == fm10k_bus_payload_512 ? "512B" :
> -   "Unknown"));
> -
> - switch (hw->bus_caps.speed) {
> - case fm10k_bus_speed_2500:
> - /* 8b/10b encoding reduces max throughput by 20% */
> - expected_gts = 2 * hw->bus_caps.width;
> - break;
> - case fm10k_bus_speed_5000:
> - /* 8b/10b encoding reduces max throughput by 20% */
> - expected_gts = 4 * hw->bus_caps.width;
> - break;
> - case fm10k_bus_speed_8000:
> - /* 128b/130b encoding has less than 2% impact on throughput */
> - expected_gts = 8 * hw->bus_caps.width;
> - break;
> - default:
> - dev_warn(>pdev->dev,
> -  "Unable to

Re: [bpf-next PATCH 4/4] bpf: sockmap, add hash map support

2018-04-02 Thread Alexei Starovoitov

On Sun, Apr 01, 2018 at 08:01:10AM -0700, John Fastabend wrote:
> Sockmap is currently backed by an array and enforces keys to be
> four bytes. This works well for many use cases and was originally
> modeled after devmap which also uses four bytes keys. However,
> this has become limiting in larger use cases where a hash would
> be more appropriate. For example users may want to use the 5-tuple
> of the socket as the lookup key.
> 
> To support this add hash support.
> 
> Signed-off-by: John Fastabend 

api looks good, but I think it came a bit too late for this release.
_nulls part you don't need for this hash. Few other nits:

> +static void htab_elem_free_rcu(struct rcu_head *head)
> +{
> + struct htab_elem *l = container_of(head, struct htab_elem, rcu);
> +
> + /* must increment bpf_prog_active to avoid kprobe+bpf triggering while
> +  * we're calling kfree, otherwise deadlock is possible if kprobes
> +  * are placed somewhere inside of slub
> +  */
> + preempt_disable();
> + __this_cpu_inc(bpf_prog_active);
> + kfree(l);
> + __this_cpu_dec(bpf_prog_active);
> + preempt_enable();

I don't think it's necessary.

> +static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
> +{
> + struct bpf_htab *htab;
> + int i, err;
> + u64 cost;
> +
> + if (!capable(CAP_NET_ADMIN))
> + return ERR_PTR(-EPERM);
> +
> + /* check sanity of attributes */
> + if (attr->max_entries == 0 ||
> + attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
> + return ERR_PTR(-EINVAL);
> +
> + if (attr->value_size > KMALLOC_MAX_SIZE)
> + return ERR_PTR(-E2BIG);

doesn't seem to match
+   u32 fd = *(u32 *)value;
that is done later.

> +static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head,
> +  u32 hash, void *key, u32 key_size)
> +{
> + struct hlist_nulls_node *n;
> + struct htab_elem *l;
> +
> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> + if (l->hash == hash && !memcmp(>key, key, key_size))
> + return l;

if nulls is needed, there gotta be a comment explaining it.

please add tests for all methods.

> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index f95fa67..2fa4cbb 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -67,6 +67,7 @@
>   [BPF_MAP_TYPE_DEVMAP]   = "devmap",
>   [BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
>   [BPF_MAP_TYPE_CPUMAP]   = "cpumap",
> + [BPF_MAP_TYPE_SOCKHASH] = "sockhash",
>  };
>  
>  static unsigned int get_possible_cpus(void)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 9d07465..1a19450 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -115,6 +115,7 @@ enum bpf_map_type {
>   BPF_MAP_TYPE_DEVMAP,
>   BPF_MAP_TYPE_SOCKMAP,
>   BPF_MAP_TYPE_CPUMAP,
> + BPF_MAP_TYPE_SOCKHASH,

tools/* updates should be in separate commit.

Re: [PATCH net-next] net: ipv6/gre: Add GRO support

2018-04-02 Thread Tariq Toukan




On 02/04/2018 6:19 PM, Eric Dumazet wrote:



On 04/02/2018 08:00 AM, Eran Ben Elisha wrote:

Seems good, but why isn't this handled directly in GRO native layer ?

ip6_tunnel and ip6_gre do not share initialization flow functions (unlike ipv4).
Changing the ipv6 init infrastructure should not be part of this
patch. we prefer to keep this one minimal, simple and safe.




Looking at gre_gro_receive() and gre_gro_complete() I could not see why they
could not be copied/pasted to IPv6.


These functions to handle GRO over GRE are already assigned in
gre_offload_init() (in net/ipv4/gre_offload.c under CONFIG_IPV6).
However without initializing the gro_cells, the receive path will not
go via napi_gro_receive path, but directly to netif_rx.
So AFAIU, only gcells->cells was missing for gro_cells_receive to
really go via GRO flow.



Maybe give more details on the changelog, it is really not obvious.

Hopefully the above filled this request.




Not really :/



So you're referring to native interface. We thought you meant native IP 
module.




gro_cells_receive() is not really useful with native GRO, since packet is 
already
a GRO packet by the time it reaches ip_tunnel_rcv() or __ip6_tnl_rcv()



Right. If GRO on native interface is ON, our patch doesn't help much.
The case we improve here is:
Native has GRO OFF, GRE has GRO ON.

Before this patch there were no GRO packets at all in this case, only 
MTU packets went up the stack.



Sure, it might be usefull if native GRO (happening on eth0 if you prefer) did 
not
handle a particular encapsulation.



Or it is turned OFF.


gro_cell was a work around before we extended GRO to be able to decap some 
tunnel headers.

It seems we have to extend this to also support GRE6.

Re: [BUG/Q] can_pernet_exit() leaves devices on dead net

2018-04-02 Thread Kirill Tkhai

Hi, Oliver,

On 02.04.2018 18:28, Oliver Hartkopp wrote:
> Hi Kirill, Marc,
> 
> I checked the code once more and added some debug output to the other parts 
> in CAN notifier code.
> 
> In fact the code pointed to by both of you seems to be obsolete as I only 
> wanted to be 'really sure' that no leftovers of the CAN filters at module 
> unloading.
> 
> 
>> Yes, this one looks good:
>> https://marc.info/?l=linux-can=150169589119335=2
>>
>> Regards,
>> Kirill
>>
> 
> I was obviously too cautious ;-)
> 
> All tests I made resulted in the function iterating through all the CAN 
> netdevices doing exactly nothing.
> 
> I'm fine with removing that stuff - but I'm not sure whether it's worth to 
> push that patch to stable 4.12+ or even before 4.12 (without namespace 
> support - and removing rcu_barrier() too).
> 
> Any opinions?

I think the same -- it's not need for stable as there is just iteration over 
empty list, i.e., noop.

Kirill

Re: [BUG/Q] can_pernet_exit() leaves devices on dead net

2018-04-02 Thread Oliver Hartkopp


Hi Kirill, Marc,

I checked the code once more and added some debug output to the other 
parts in CAN notifier code.


In fact the code pointed to by both of you seems to be obsolete as I 
only wanted to be 'really sure' that no leftovers of the CAN filters at 
module unloading.




Yes, this one looks good:
https://marc.info/?l=linux-can=150169589119335=2

Regards,
Kirill



I was obviously too cautious ;-)

All tests I made resulted in the function iterating through all the CAN 
netdevices doing exactly nothing.


I'm fine with removing that stuff - but I'm not sure whether it's worth 
to push that patch to stable 4.12+ or even before 4.12 (without 
namespace support - and removing rcu_barrier() too).


Any opinions?

Best regards,
Oliver

Re: [PATCH net-next] bridge: Allow max MTU when multiple VLANs present

2018-04-02 Thread Chas Williams

On Mon, Apr 2, 2018 at 11:08 AM, Roopa Prabhu  wrote:
> On Fri, Mar 30, 2018 at 12:54 PM, Chas Williams <3ch...@gmail.com> wrote:
>> On Thu, Mar 29, 2018 at 9:02 PM, Toshiaki Makita
>>  wrote:
>>> On 2018/03/30 1:49, Roopa Prabhu wrote:
 On Thu, Mar 22, 2018 at 9:53 PM, Roopa Prabhu  
 wrote:
> On Thu, Mar 22, 2018 at 8:34 AM, Chas Williams <3ch...@gmail.com> wrote:
>> If the bridge is allowing multiple VLANs, some VLANs may have
>> different MTUs.  Instead of choosing the minimum MTU for the
>> bridge interface, choose the maximum MTU of the bridge members.
>> With this the user only needs to set a larger MTU on the member
>> ports that are participating in the large MTU VLANS.
>>
>> Signed-off-by: Chas Williams <3ch...@gmail.com>
>> ---
>
> Acked-by: Roopa Prabhu 
>
> This or an equivalent fix is necessary: as stated above, today the
> bridge mtu capped at min port mtu limits all
> vlan devices on top of the vlan filtering bridge to min port mtu.


 On further thought, since this patch changes default behavior, it may
 upset people. ie with this patch, a vlan device
 on the bridge by default will now use the  bridge max mtu and that
 could cause unexpected drops in the bridge driver
 if the xmit port had a lower mtu. This may surprise users.
>>
>> It only changes the default behavior when you are using VLAN aware bridges.
>> The behavior remains the same otherwise.  I don't know if VLAN aware bridges
>> are that popular yet so there probably isn't any particular
>> expectation from those
>> bridges.
>
> they are popular...in-fact they are the default bridge mode on our
> network switches.
> And they have been around for some time now to ignore its users.
> Plus it is not right to change default mtu behavior for one mode of the bridge
> and not the others (bridge mtu handling from user-space is complex enough 
> today
> due to dynamic mtu changes on port enslave/deslave).

I don't see the issue with one mode of bridge behaving differently
from another mode.
The VLAN behavior between the two bridge modes is completely different so having
a different MTU behavior doesn't seem that surprising.

You are potentially mixing different sized VLAN on a same bridge.  The only sane
choice is to pick the largest MTU for the bridge.  This lets you have
whatever MTU
is appropriate on the child VLAN interfaces of the bridge.  If you
attempt to forward
from a port with a larger MTU to a smaller MTU, you get the expected behavior.

Forcing the end user to configure all the ports to the maximum MTU of
all the VLANs
on the bridge is wrong IMHO.  You then risk attempting to forward
oversize packets
on a network that can't support that.

>
>>
>> I don't think those drops are unexpected.  If a user has misconfigured
>> the bridge
>> we can't be expected to fix that for them.  It is the user's
>> responsbility to ensure
>> that the ports on the VLAN have a size consistent with the traffic
>> they expect to
>> pass.
>>
>
> By default they are not expected today. The problem is changing the bridge
> to max mtu changes 'all' the vlan devices on top of the vlan aware bridge to
> max mtu by default which makes drops at the bridge driver more common if the
> user had mixed mtu on its ports.

That's not been my experience.  The MTU on the vlan devices is only
limited by the
bridges's MTU.  Setting the bridge MTU doesn't change the children
VLAN devices MTUs.

[RFC] vhost: introduce mdev based hardware vhost backend

2018-04-02 Thread Tiwei Bie

This patch introduces a mdev (mediated device) based hardware
vhost backend. This backend is an abstraction of the various
hardware vhost accelerators (potentially any device that uses
virtio ring can be used as a vhost accelerator). Some generic
mdev parent ops are provided for accelerator drivers to support
generating mdev instances.

What's this
===

The idea is that we can setup a virtio ring compatible device
with the messages available at the vhost-backend. Originally,
these messages are used to implement a software vhost backend,
but now we will use these messages to setup a virtio ring
compatible hardware device. Then the hardware device will be
able to work with the guest virtio driver in the VM just like
what the software backend does. That is to say, we can implement
a hardware based vhost backend in QEMU, and any virtio ring
compatible devices potentially can be used with this backend.
(We also call it vDPA -- vhost Data Path Acceleration).

One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel. So we defined a standard device based on mdev, which
is able to accept vhost messages. When the mdev emulation code
(i.e. the generic mdev parent ops provided by this patch) gets
vhost messages, it will parse and deliver them to accelerator
drivers. Drivers can use these messages to setup accelerators.

That is to say, the generic mdev parent ops (e.g. read()/write()/
ioctl()/...) will be provided for accelerator drivers to register
accelerators as mdev parent devices. And each accelerator device
will support generating standard mdev instance(s).

With this standard device interface, we will be able to just
develop one userspace driver to implement the hardware based
vhost backend in QEMU.

Difference between vDPA and PCI passthru


The key difference between vDPA and PCI passthru is that, in
vDPA only the data path of the device (e.g. DMA ring, notify
region and queue interrupt) is pass-throughed to the VM, the
device control path (e.g. PCI configuration space and MMIO
regions) is still defined and emulated by QEMU.

The benefits of keeping virtio device emulation in QEMU compared
with virtio device PCI passthru include (but not limit to):

- consistent device interface for guest OS in the VM;
- max flexibility on the hardware design, especially the
  accelerator for each vhost backend doesn't have to be a
  full PCI device;
- leveraging the existing virtio live-migration framework;

The interface of this mdev based device
===

1. BAR0

The MMIO region described by BAR0 is the main control
interface. Messages will be written to or read from
this region.

The message type is determined by the `request` field
in message header. The message size is encoded in the
message header too. The message format looks like this:

struct vhost_vfio_op {
__u64 request;
__u32 flags;
/* Flag values: */
#define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
__u32 size;
union {
__u64 u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
struct vhost_memory memory;
} payload;
};

The existing vhost-kernel ioctl cmds are reused as
the message requests in above structure.

Each message will be written to or read from this
region at offset 0:

int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
int ret;

ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count)
return -1;

return 0;
}

int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
uint64_t request = op->request;
int ret;

ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count || request != op->request)
return -1;

return 0;
}

It's quite straightforward to set things to the device.
Just need to write the message to device directly:

int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features)
{
struct vhost_vfio_op op;

op.request = VHOST_SET_FEATURES;
op.flags = 0;
op.size = sizeof(features);
op.payload.u64 = features;

return vhost_vfio_write(dev, );
}

To get things from the device, two steps are needed.
Take VHOST_GET_FEATURE as an example:

int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features)
{
struct vhost_vfio_op op;

Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code

2018-04-02 Thread Alexander Kurz

Hi Marc, David,
with the v2 patch ("net: usb: asix88179_178a: de-duplicate code")
I made an embarrasly stupid mistake of removing the wrong function.
The v2 patch accidentially changed ax88179_link_reset() instead of 
ax88179_reset(). Hunk 6 of v2 ("net: usb: asix88179_178a: de-duplicate 
code") is just utterly wrong.

ax88179_bind() and ax88179_reset() were the correct targets to be 
de-duplicated, as done in the v3 patch.

Sorry for this, Alexander

On Mon, 2 Apr 2018, David Miller wrote:

> From: Marc Zyngier 
> Date: Mon, 02 Apr 2018 10:45:40 +0100
> 
> > What has changed between this patch and the previous one? Having a bit
> > of a change-log would certainly help. Also, I would have appreciated a
> > reply to the questions I had on v2 before you posted a third version.
> 
> Agreed, and I'm not applying these patches until this is sorted out
> and explained properly.
>

Re: [PATCH net-next] net: ipv6/gre: Add GRO support

2018-04-02 Thread Eric Dumazet



On 04/02/2018 08:00 AM, Eran Ben Elisha wrote:
 Seems good, but why isn't this handled directly in GRO native layer ?
>>> ip6_tunnel and ip6_gre do not share initialization flow functions (unlike 
>>> ipv4).
>>> Changing the ipv6 init infrastructure should not be part of this
>>> patch. we prefer to keep this one minimal, simple and safe.
>>
>>
>>
>> Looking at gre_gro_receive() and gre_gro_complete() I could not see why they
>> could not be copied/pasted to IPv6.
> 
> These functions to handle GRO over GRE are already assigned in
> gre_offload_init() (in net/ipv4/gre_offload.c under CONFIG_IPV6).
> However without initializing the gro_cells, the receive path will not
> go via napi_gro_receive path, but directly to netif_rx.
> So AFAIU, only gcells->cells was missing for gro_cells_receive to
> really go via GRO flow.
> 
>>
>> Maybe give more details on the changelog, it is really not obvious.
> Hopefully the above filled this request.
>>

Not really :/

gro_cells_receive() is not really useful with native GRO, since packet is 
already
a GRO packet by the time it reaches ip_tunnel_rcv() or __ip6_tnl_rcv()

Sure, it might be usefull if native GRO (happening on eth0 if you prefer) did 
not
handle a particular encapsulation.

gro_cell was a work around before we extended GRO to be able to decap some 
tunnel headers.

It seems we have to extend this to also support GRE6.

Re: [PATCH] connector: add parent pid and tgid to coredump and exit events

2018-04-02 Thread Stefan Strogin

Hi David,

I don't see how it breaks UAPI. The point is that structures
coredump_proc_event and exit_proc_event are members of *union*
event_data, thus position of the existing data in the structure is
unchanged. Furthermore, this change won't increase size of struct
proc_event, because comm_proc_event (also a member of event_data) is
of bigger size than the changed structures.

If I'm wrong, could you please explain what exactly will the change
break in UAPI?

On 30/03/18 19:59, David Miller wrote:
> From: Stefan Strogin 
> Date: Thu, 29 Mar 2018 17:12:47 +0300
> 
>> diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h
>> index 68ff25414700..db210625cee8 100644
>> --- a/include/uapi/linux/cn_proc.h
>> +++ b/include/uapi/linux/cn_proc.h
>> @@ -116,12 +116,16 @@ struct proc_event {
>>  struct coredump_proc_event {
>>  __kernel_pid_t process_pid;
>>  __kernel_pid_t process_tgid;
>> +__kernel_pid_t parent_pid;
>> +__kernel_pid_t parent_tgid;
>>  } coredump;
>>  
>>  struct exit_proc_event {
>>  __kernel_pid_t process_pid;
>>  __kernel_pid_t process_tgid;
>>  __u32 exit_code, exit_signal;
>> +__kernel_pid_t parent_pid;
>> +__kernel_pid_t parent_tgid;
>>  } exit;
>>  
>>  } event_data;
> 
> I don't think you can add these members without breaking UAPI.
>

Re: [PATCH net v5 2/3] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()

2018-04-02 Thread David Miller

From: Alexey Kodanev 
Date: Mon,  2 Apr 2018 17:00:34 +0300

> +++ b/net/ipv6/ip6_output.c
> @@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct 
> sock *sk, struct flowi6 *fl6,
>   *   @sk: socket which provides the dst cache and route info
>   *   @fl6: flow to lookup
>   *   @final_dst: final destination address for ipsec lookup
> + *   @connected: whether @sk is connected or not
 ...
>  struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
> -  const struct in6_addr *final_dst)
> +  const struct in6_addr *final_dst,
> +  int connected)

Please use type 'bool' and true/false for this new parameter.

Thank you.

Re: [PATCH v3 0/2] net: mvneta: improve suspend/resume

2018-04-02 Thread David Miller

From: Jisheng Zhang 
Date: Mon, 2 Apr 2018 11:22:29 +0800

> This series tries to optimize the mvneta's suspend/resume
> implementation by only taking necessary actions.
> 
> Since v2:
>  - keep rtnl lock when calling mvneta_start_dev() and mvneta_stop_dev()
>Thank Russell for pointing this out
> 
> Since v1:
>  - unify ret check
>  - try best to keep the suspend/resume behavior
>  - split txq deinit into sw/hw parts as well
>  - adjust mvneta_stop_dev() location

Series applied, thank you.

Re: [PATCH net-next] bridge: Allow max MTU when multiple VLANs present

2018-04-02 Thread Roopa Prabhu

On Fri, Mar 30, 2018 at 12:54 PM, Chas Williams <3ch...@gmail.com> wrote:
> On Thu, Mar 29, 2018 at 9:02 PM, Toshiaki Makita
>  wrote:
>> On 2018/03/30 1:49, Roopa Prabhu wrote:
>>> On Thu, Mar 22, 2018 at 9:53 PM, Roopa Prabhu  
>>> wrote:
 On Thu, Mar 22, 2018 at 8:34 AM, Chas Williams <3ch...@gmail.com> wrote:
> If the bridge is allowing multiple VLANs, some VLANs may have
> different MTUs.  Instead of choosing the minimum MTU for the
> bridge interface, choose the maximum MTU of the bridge members.
> With this the user only needs to set a larger MTU on the member
> ports that are participating in the large MTU VLANS.
>
> Signed-off-by: Chas Williams <3ch...@gmail.com>
> ---

 Acked-by: Roopa Prabhu 

 This or an equivalent fix is necessary: as stated above, today the
 bridge mtu capped at min port mtu limits all
 vlan devices on top of the vlan filtering bridge to min port mtu.
>>>
>>>
>>> On further thought, since this patch changes default behavior, it may
>>> upset people. ie with this patch, a vlan device
>>> on the bridge by default will now use the  bridge max mtu and that
>>> could cause unexpected drops in the bridge driver
>>> if the xmit port had a lower mtu. This may surprise users.
>
> It only changes the default behavior when you are using VLAN aware bridges.
> The behavior remains the same otherwise.  I don't know if VLAN aware bridges
> are that popular yet so there probably isn't any particular
> expectation from those
> bridges.

they are popular...in-fact they are the default bridge mode on our
network switches.
And they have been around for some time now to ignore its users.
Plus it is not right to change default mtu behavior for one mode of the bridge
and not the others (bridge mtu handling from user-space is complex enough today
due to dynamic mtu changes on port enslave/deslave).


>
> I don't think those drops are unexpected.  If a user has misconfigured
> the bridge
> we can't be expected to fix that for them.  It is the user's
> responsbility to ensure
> that the ports on the VLAN have a size consistent with the traffic
> they expect to
> pass.
>

By default they are not expected today. The problem is changing the bridge
to max mtu changes 'all' the vlan devices on top of the vlan aware bridge to
max mtu by default which makes drops at the bridge driver more common if the
user had mixed mtu on its ports.

Re: [net-next PATCH v3 00/11] Add support for netcp driver on K2G SoC

2018-04-02 Thread Murali Karicheri

On 04/02/2018 10:40 AM, David Miller wrote:
> 
> The net-next tree is closed, please resubmit this after the merge window and
> the net-next tree is open back up again.
> 
Ok. Will do. Thanks

-- 
Murali Karicheri
Linux Kernel, Keystone

Re: [PATCH net-next] net: ipv6/gre: Add GRO support

2018-04-02 Thread Eran Ben Elisha

>>> Seems good, but why isn't this handled directly in GRO native layer ?
>> ip6_tunnel and ip6_gre do not share initialization flow functions (unlike 
>> ipv4).
>> Changing the ipv6 init infrastructure should not be part of this
>> patch. we prefer to keep this one minimal, simple and safe.
>
>
>
> Looking at gre_gro_receive() and gre_gro_complete() I could not see why they
> could not be copied/pasted to IPv6.

These functions to handle GRO over GRE are already assigned in
gre_offload_init() (in net/ipv4/gre_offload.c under CONFIG_IPV6).
However without initializing the gro_cells, the receive path will not
go via napi_gro_receive path, but directly to netif_rx.
So AFAIU, only gcells->cells was missing for gro_cells_receive to
really go via GRO flow.

>
> Maybe give more details on the changelog, it is really not obvious.
Hopefully the above filled this request.
>

Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support

2018-04-02 Thread Marcelo Ricardo Leitner

On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
> Now that we have SCTP offload capabilities in the kernel, we can add
> them to virtio as well.  First step is SCTP checksum.

Thanks.

> As for GSO, the way sctp GSO is currently implemented buys us nothing
> in added support to virtio.  To add true GSO, would require a lot of
> re-work inside of SCTP and would require extensions to the virtio
> net header to carry extra sctp data.

Can you please elaborate more on this? Is this because SCTP GSO relies
on the gso skb format for knowing how to segment it instead of having
a list of sizes?

  Marcelo

Re: [PATCH] net: implement IP_RECVHDRS option to get full headers through recvmsg cmsg.

2018-04-02 Thread David Miller

From: Maciej Żenczykowski 
Date: Sat, 31 Mar 2018 22:43:14 -0700

> From: Luigi Rizzo 
> 
> We have all sorts of different ways to fetch pre-UDP payload metadata:
>   IP_RECVTOS
>   IP_RECVTTL
>   IP_RECVOPTS
>   IP_RETOPTS
> 
> But nothing generic which simply allows you to receive the entire packet 
> header.
> 
> This is in similar vein to TCP_SAVE_SYN but for UDP and other datagram 
> sockets.
> 
> This is envisioned as a way to get GUE extension metadata for encapsulated
> packets, but implemented in a way to be much more future proof.
> 
> (Implemented by Luigi, who asked me to send it upstream)
> 
> Cc: Eric Dumazet 
> Signed-off-by: Luigi Rizzo 
> Signed-off-by: Maciej Żenczykowski 

This is an ipv4 level socket option, so why are you copying in the MAC
header(s)?

That part I don't like at all.

First of all, you have no idea what the link level protocol is for that
MAC header, therefore how could you even begin to interpret it's contents
correctly?

Second of all, MAC level details belong not in AF_INET socket interfaces.

Thank you.

1 2 >

1 - 100 of 176 matches

Mail list logo