Re: [PATCH v2 iproute2-next 4/6] rdma: Add CQ resource tracking information

2018-03-13 Thread Leon Romanovsky
On Tue, Feb 27, 2018 at 08:07:11AM -0800, Steve Wise wrote:
> Sample output:
>
> # rdma resource show cq
> link cxgb4_0/- cqe 46 users 2 pid 30503 comm rping
> link cxgb4_0/- cqe 46 users 2 pid 30498 comm rping
> link mlx4_0/- cqe 63 users 2 pid 30494 comm rping
> link mlx4_0/- cqe 63 users 2 pid 30489 comm rping
> link mlx4_0/- cqe 1023 users 2 poll_ctx WORKQUEUE pid 0 comm [ib_core]
>
> # rdma resource show cq pid 30489
> link mlx4_0/- cqe 63 users 2 pid 30489 comm rping
>
> Signed-off-by: Steve Wise 
> ---
>  rdma/res.c   | 136 
> +++
>  rdma/utils.c |   5 +++
>  2 files changed, 141 insertions(+)
>

Thanks,
Reviewed-by: Leon Romanovsky 


signature.asc
Description: PGP signature


[bug, bisected] pfifo_fast causes packet reordering

2018-03-13 Thread Jakob Unterwurzacher
During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on 
Linux v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of 
packets are delivered out-of-order.


We have tracked the problem down to the driver interface level, and it 
seems that the driver's net_device_ops.ndo_start_xmit() function gets 
the packets handed over in the wrong order.


This behavior was not observed on Linux v4.15 and I have bisected the 
problem down to this patch:



commit c5ad119fb6c09b0297446be05bd66602fa564758
Author: John Fastabend 
Date:   Thu Dec 7 09:58:19 2017 -0800

   net: sched: pfifo_fast use skb_array

   This converts the pfifo_fast qdisc to use the skb_array data structure
   and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
   the lockless bit that can be a child of a qdisc requiring locking. So
   we add logic to clear the lock bit on initialization in these cases when
   the qdisc graft operation occurs.

   This also removes the logic used to pick the next band to dequeue from
   and instead just checks a per priority array for packets from top priority
   to lowest. This might need to be a bit more clever but seems to work
   for now.

   Signed-off-by: John Fastabend 
   Signed-off-by: David S. Miller 


The patch does not revert cleanly, but moving to one commit earlier 
makes the problem go away.


Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem 
go away as well.


Is this an unintended side-effect of the patch or is there something the 
driver has to do to request in-order delivery?


Thanks,
Jakob


Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-13 Thread Dave Taht
On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
 wrote:
> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
> delivered out-of-order.
>
> We have tracked the problem down to the driver interface level, and it seems
> that the driver's net_device_ops.ndo_start_xmit() function gets the packets
> handed over in the wrong order.
>
> This behavior was not observed on Linux v4.15 and I have bisected the
> problem down to this patch:
>
>> commit c5ad119fb6c09b0297446be05bd66602fa564758
>> Author: John Fastabend 
>> Date:   Thu Dec 7 09:58:19 2017 -0800
>>
>>net: sched: pfifo_fast use skb_array
>>
>>This converts the pfifo_fast qdisc to use the skb_array data structure
>>and set the lockless qdisc bit. pfifo_fast is the first qdisc to
>> support
>>the lockless bit that can be a child of a qdisc requiring locking. So
>>we add logic to clear the lock bit on initialization in these cases
>> when
>>the qdisc graft operation occurs.
>>
>>This also removes the logic used to pick the next band to dequeue from
>>and instead just checks a per priority array for packets from top
>> priority
>>to lowest. This might need to be a bit more clever but seems to work
>>for now.
>>
>>Signed-off-by: John Fastabend 
>>Signed-off-by: David S. Miller 
>
>
> The patch does not revert cleanly, but moving to one commit earlier makes
> the problem go away.
>
> Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem go
> away as well.

I am of course, a fan of obsoleting pfifo_fast. There's no good reason
for it anymore.

>
> Is this an unintended side-effect of the patch or is there something the
> driver has to do to request in-order delivery?
>
> Thanks,
> Jakob



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619


[PATCH v2 net] kcm: lock lower socket in kcm_attach

2018-03-13 Thread Tom Herbert
Need to lock lower socket in order to provide mutual exclusion
with kcm_unattach.

v2: Add Reported-by for syzbot

Fixes: ab7ac4eb9832e32a09f4e804 ("kcm: Kernel Connection Multiplexor module")
Reported-by: 
syzbot+ea75c0ffcd353d32515f064aaebefc5279e61...@syzkaller.appspotmail.com
Signed-off-by: Tom Herbert 
---
 net/kcm/kcmsock.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index f297d53a11aa..34355fd19f27 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1381,24 +1381,32 @@ static int kcm_attach(struct socket *sock, struct 
socket *csock,
.parse_msg = kcm_parse_func_strparser,
.read_sock_done = kcm_read_sock_done,
};
-   int err;
+   int err = 0;
 
csk = csock->sk;
if (!csk)
return -EINVAL;
 
+   lock_sock(csk);
+
/* Only allow TCP sockets to be attached for now */
if ((csk->sk_family != AF_INET && csk->sk_family != AF_INET6) ||
-   csk->sk_protocol != IPPROTO_TCP)
-   return -EOPNOTSUPP;
+   csk->sk_protocol != IPPROTO_TCP) {
+   err = -EOPNOTSUPP;
+   goto out;
+   }
 
/* Don't allow listeners or closed sockets */
-   if (csk->sk_state == TCP_LISTEN || csk->sk_state == TCP_CLOSE)
-   return -EOPNOTSUPP;
+   if (csk->sk_state == TCP_LISTEN || csk->sk_state == TCP_CLOSE) {
+   err = -EOPNOTSUPP;
+   goto out;
+   }
 
psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL);
-   if (!psock)
-   return -ENOMEM;
+   if (!psock) {
+   err = -ENOMEM;
+   goto out;
+   }
 
psock->mux = mux;
psock->sk = csk;
@@ -1407,7 +1415,7 @@ static int kcm_attach(struct socket *sock, struct socket 
*csock,
err = strp_init(>strp, csk, );
if (err) {
kmem_cache_free(kcm_psockp, psock);
-   return err;
+   goto out;
}
 
write_lock_bh(>sk_callback_lock);
@@ -1419,7 +1427,8 @@ static int kcm_attach(struct socket *sock, struct socket 
*csock,
write_unlock_bh(>sk_callback_lock);
strp_done(>strp);
kmem_cache_free(kcm_psockp, psock);
-   return -EALREADY;
+   err = -EALREADY;
+   goto out;
}
 
psock->save_data_ready = csk->sk_data_ready;
@@ -1455,7 +1464,10 @@ static int kcm_attach(struct socket *sock, struct socket 
*csock,
/* Schedule RX work in case there are already bytes queued */
strp_check_rcv(>strp);
 
-   return 0;
+out:
+   release_sock(csk);
+
+   return err;
 }
 
 static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info)
@@ -1507,6 +1519,7 @@ static void kcm_unattach(struct kcm_psock *psock)
 
if (WARN_ON(psock->rx_kcm)) {
write_unlock_bh(>sk_callback_lock);
+   release_sock(csk);
return;
}
 
-- 
2.11.0



Re: [PATCH 12/15] ice: Add stats and ethtool support

2018-03-13 Thread Venkataramanan, Anirudh
On Fri, 2018-03-09 at 15:14 -0800, Jakub Kicinski wrote:
> On Fri,  9 Mar 2018 09:21:33 -0800, Anirudh Venkataramanan wrote:
> > +static const struct ice_stats ice_net_stats[] = {
> > +   ICE_NETDEV_STAT(rx_packets),
> > +   ICE_NETDEV_STAT(tx_packets),
> > +   ICE_NETDEV_STAT(rx_bytes),
> > +   ICE_NETDEV_STAT(tx_bytes),
> > +   ICE_NETDEV_STAT(rx_errors),
> > +   ICE_NETDEV_STAT(tx_errors),
> > +   ICE_NETDEV_STAT(rx_dropped),
> > +   ICE_NETDEV_STAT(tx_dropped),
> > +   ICE_NETDEV_STAT(multicast),
> > +   ICE_NETDEV_STAT(rx_length_errors),
> > +   ICE_NETDEV_STAT(rx_crc_errors),
> > +};
> 
> Please don't duplicate standard netdev stats in ethtool -S.

Jacub,

Thanks for the feedback. I am not sure I understand what's being asked
here. Do you mean to say that standard netdev stats should not be
printed when we do ethtool -S or something else?

Thanks!
Ani

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 iproute2-next 5/6] rdma: Add MR resource tracking information

2018-03-13 Thread Leon Romanovsky
On Tue, Feb 27, 2018 at 08:07:17AM -0800, Steve Wise wrote:
> Sample output:
>
> Without CAP_NET_ADMIN:
>
> $ rdma resource show mr mrlen 65536
> link mlx4_0/- mrlen 65536 pid 0 comm [nvme_rdma]
> link cxgb4_0/- mrlen 65536 pid 0 comm [nvme_rdma]
>
> With CAP_NET_ADMIN:
>
> # rdma resource show mr mrlen 65536
> link mlx4_0/- rkey 0x12702 lkey 0x12702 iova 0x85724a000 mrlen 65536 pid 0 
> comm [nvme_rdma]
> link cxgb4_0/- rkey 0x68fe4e9 lkey 0x68fe4e9 iova 0x835b91000 mrlen 65536 pid 
> 0 comm [nvme_rdma]
>
> Signed-off-by: Steve Wise 
> ---
>  include/json_writer.h |   2 +
>  lib/json_writer.c |  11 +
>  rdma/res.c| 125 
> ++
>  rdma/utils.c  |   6 +++
>  4 files changed, 144 insertions(+)
>

Thanks,
Reviewed-by: Leon Romanovsky 


signature.asc
Description: PGP signature


[PATCH] hv_netvsc: Make sure out channel is fully opened on send

2018-03-13 Thread Mohammed Gamal
Dring high network traffic changes to network interface parameters
such as number of channels or MTU can cause a kernel panic with a NULL
pointer dereference. This is due to netvsc_device_remove() being
called and deallocating the channel ring buffers, which can then be
accessed by netvsc_send_pkt() before they're allocated on calling
netvsc_device_add()

The patch fixes this problem by checking the channel state and returning
ENODEV if not yet opened. We also move the call to hv_ringbuf_avail_percent()
which may access the uninitialized ring buffer.

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0265d70..44a8358 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -757,7 +757,7 @@ static inline int netvsc_send_pkt(
struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
u64 req_id;
int ret;
-   u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
+   u32 ring_avail;
 
nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
if (skb)
@@ -773,7 +773,7 @@ static inline int netvsc_send_pkt(
 
req_id = (ulong)skb;
 
-   if (out_channel->rescind)
+   if (out_channel->rescind || out_channel->state != CHANNEL_OPENED_STATE)
return -ENODEV;
 
if (packet->page_buf_cnt) {
@@ -791,6 +791,7 @@ static inline int netvsc_send_pkt(
   
VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
}
 
+   ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
if (ret == 0) {
atomic_inc_return(>queue_sends);
 
-- 
1.8.3.1



Re: [PATCH v2 iproute2-next 6/6] rdma: Add PD resource tracking information

2018-03-13 Thread Leon Romanovsky
On Tue, Feb 27, 2018 at 08:07:23AM -0800, Steve Wise wrote:
> Sample output:
>
> Without CAP_NET_ADMIN capability:
>
> link mlx4_0/- users 0 pid 0 comm [ib_srpt]
> link mlx4_0/- users 0 pid 0 comm [ib_srp]
> link mlx4_0/- users 1 pid 0 comm [ib_core]
> link cxgb4_0/- users 0 pid 0 comm [ib_srp]
>
> With CAP_NET_ADMIN capability:
> link mlx4_0/- local_dma_lkey 0x8000 users 0 pid 0 comm [ib_srpt]
> link mlx4_0/- local_dma_lkey 0x8000 users 0 pid 0 comm [ib_srp]
> link mlx4_0/- local_dma_lkey 0x8000 users 1 pid 0 comm [ib_core]
> link cxgb4_0/- local_dma_lkey 0x0 users 0 pid 0 comm [ib_srp]
>
> Signed-off-by: Steve Wise 
> ---
>  rdma/res.c | 92 
> ++
>  1 file changed, 92 insertions(+)
>

Thanks,
Reviewed-by: Leon Romanovsky 


signature.asc
Description: PGP signature


[PATCH net 2/2] vmxnet3: use correct flag to indicate LRO feature

2018-03-13 Thread Ronak Doshi
'Commit 45dac1d6ea04 ("vmxnet3: Changes for vmxnet3 adapter version 2
(fwd)")' introduced a flag "lro" in structure vmxnet3_adapter which is
used to indicate whether LRO is enabled or not. However, the patch
did not set the flag and hence it was never exercised.

So, when LRO is enabled, it resulted in poor TCP performance due to
delayed acks. This issue is seen with packets which are larger than
the mss getting a delayed ack rather than an immediate ack, thus
resulting in high latency.

This patch removes the lro flag and directly uses device features
against NETIF_F_LRO to check if lro is enabled.

Reported-by: Rachel Lunnon 
Signed-off-by: Ronak Doshi 
Acked-by: Shrikrishna Khare 
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 3 ++-
 drivers/net/vmxnet3/vmxnet3_int.h | 5 ++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c 
b/drivers/net/vmxnet3/vmxnet3_drv.c
index 052eef2f729f..86c4d6e4dfaa 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1473,7 +1473,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
vmxnet3_rx_csum(adapter, skb,
(union Vmxnet3_GenericDesc *)rcd);
skb->protocol = eth_type_trans(skb, adapter->netdev);
-   if (!rcd->tcp || !adapter->lro)
+   if (!rcd->tcp ||
+   !(adapter->netdev->features & NETIF_F_LRO))
goto not_lro;
 
if (segCnt != 0 && mss != 0) {
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h 
b/drivers/net/vmxnet3/vmxnet3_int.h
index b94fdfd0b6f1..99387a4a20a8 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.4.12.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.4.13.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM  0x01040c00
+#define VMXNET3_DRIVER_VERSION_NUM  0x01040d00
 
 #if defined(CONFIG_PCI_MSI)
/* RSS only makes sense if MSI-X is supported. */
@@ -343,7 +343,6 @@ struct vmxnet3_adapter {
u8  version;
 
boolrxcsum;
-   boollro;
 
 #ifdef VMXNET3_RSS
struct UPT1_RSSConf *rss_conf;
-- 
2.11.0



Re: [PATCH] net: dsa: drop some VLAs in switch.c

2018-03-13 Thread Vivien Didelot
Hi Salvatore,

Salvatore Mesoraca  writes:

> dsa_switch's num_ports is currently fixed to DSA_MAX_PORTS. So we avoid
> 2 VLAs[1] by using DSA_MAX_PORTS instead of ds->num_ports.
>
> [1] https://lkml.org/lkml/2018/3/7/621
>
> Signed-off-by: Salvatore Mesoraca 

NAK.

We are in the process to remove hardcoded limits such as DSA_MAX_PORTS
and DSA_MAX_SWITCHES, so we have to stick with ds->num_ports.


Thanks,

Vivien


Re: [PATCH] net: dsa: drop some VLAs in switch.c

2018-03-13 Thread Florian Fainelli
On 03/13/2018 12:58 PM, Vivien Didelot wrote:
> Hi Salvatore,
> 
> Salvatore Mesoraca  writes:
> 
>> dsa_switch's num_ports is currently fixed to DSA_MAX_PORTS. So we avoid
>> 2 VLAs[1] by using DSA_MAX_PORTS instead of ds->num_ports.
>>
>> [1] https://lkml.org/lkml/2018/3/7/621
>>
>> Signed-off-by: Salvatore Mesoraca 
> 
> NAK.
> 
> We are in the process to remove hardcoded limits such as DSA_MAX_PORTS
> and DSA_MAX_SWITCHES, so we have to stick with ds->num_ports.

Then this means that we need to allocate a bitmap from the heap, which
sounds a bit superfluous and could theoretically fail... not sure which
way is better, but bumping the size to DSA_MAX_PORTS definitively does
help people working on enabling -Wvla.
-- 
Florian


Re: [RESEND PATCH] rsi: Remove stack VLA usage

2018-03-13 Thread Tobin C. Harding
On Sun, Mar 11, 2018 at 09:06:10PM -0500, Larry Finger wrote:
> On 03/11/2018 08:43 PM, Tobin C. Harding wrote:
> >The kernel would like to have all stack VLA usage removed[1].  rsi uses
> >a VLA based on 'blksize'.  Elsewhere in the SDIO code maximum block size
> >is defined using a magic number.  We can use a pre-processor defined
> >constant and declare the array to maximum size.  We add a check before
> >accessing the array in case of programmer error.
> >
> >[1]: https://lkml.org/lkml/2018/3/7/621
> >
> >Signed-off-by: Tobin C. Harding 
> >---
> >
> >RESEND: add wireless mailing list to CC's (requested by Kalle)
> >
> >  drivers/net/wireless/rsi/rsi_91x_hal.c  | 13 +++--
> >  drivers/net/wireless/rsi/rsi_91x_sdio.c |  9 +++--
> >  2 files changed, 14 insertions(+), 8 deletions(-)
> >
> >diff --git a/drivers/net/wireless/rsi/rsi_91x_hal.c 
> >b/drivers/net/wireless/rsi/rsi_91x_hal.c
> >index 1176de646942..839ebdd602df 100644
> >--- a/drivers/net/wireless/rsi/rsi_91x_hal.c
> >+++ b/drivers/net/wireless/rsi/rsi_91x_hal.c
> >@@ -641,7 +641,7 @@ static int ping_pong_write(struct rsi_hw *adapter, u8 
> >cmd, u8 *addr, u32 size)
> > u32 cmd_addr;
> > u16 cmd_resp, cmd_req;
> > u8 *str;
> >-int status;
> >+int status, ret;
> > if (cmd == PING_WRITE) {
> > cmd_addr = PING_BUFFER_ADDRESS;
> >@@ -655,12 +655,13 @@ static int ping_pong_write(struct rsi_hw *adapter, u8 
> >cmd, u8 *addr, u32 size)
> > str = "PONG_VALID";
> > }
> >-status = hif_ops->load_data_master_write(adapter, cmd_addr, size,
> >+ret = hif_ops->load_data_master_write(adapter, cmd_addr, size,
> > block_size, addr);
> >-if (status) {
> >-rsi_dbg(ERR_ZONE, "%s: Unable to write blk at addr %0x\n",
> >-__func__, *addr);
> >-return status;
> >+if (ret) {
> >+if (ret != -EINVAL)
> >+rsi_dbg(ERR_ZONE, "%s: Unable to write blk at addr 
> >%0x\n",
> >+__func__, *addr);
> >+return ret;
> > }
> > status = bl_cmd(adapter, cmd_req, cmd_resp, str);
> >diff --git a/drivers/net/wireless/rsi/rsi_91x_sdio.c 
> >b/drivers/net/wireless/rsi/rsi_91x_sdio.c
> >index b0cf41195051..b766578b591a 100644
> >--- a/drivers/net/wireless/rsi/rsi_91x_sdio.c
> >+++ b/drivers/net/wireless/rsi/rsi_91x_sdio.c
> >@@ -20,6 +20,8 @@
> >  #include "rsi_common.h"
> >  #include "rsi_hal.h"
> >+#define RSI_MAX_BLOCK_SIZE 256
> >+
> >  /**
> >   * rsi_sdio_set_cmd52_arg() - This function prepares cmd 52 read/write arg.
> >   * @rw: Read/write
> >@@ -362,7 +364,7 @@ static int rsi_setblocklength(struct rsi_hw *adapter, 
> >u32 length)
> > rsi_dbg(INIT_ZONE, "%s: Setting the block length\n", __func__);
> > status = sdio_set_block_size(dev->pfunction, length);
> >-dev->pfunction->max_blksize = 256;
> >+dev->pfunction->max_blksize = RSI_MAX_BLOCK_SIZE;
> > adapter->block_size = dev->pfunction->max_blksize;
> > rsi_dbg(INFO_ZONE,
> >@@ -567,9 +569,12 @@ static int rsi_sdio_load_data_master_write(struct 
> >rsi_hw *adapter,
> >  {
> > u32 num_blocks, offset, i;
> > u16 msb_address, lsb_address;
> >-u8 temp_buf[block_size];
> >+u8 temp_buf[RSI_MAX_BLOCK_SIZE];
> > int status;
> >+if (block_size > RSI_MAX_BLOCK_SIZE)
> >+return -EINVAL;
> >+
> > num_blocks = instructions_sz / block_size;
> > msb_address = base_address >> 16;
> 
> I am not giving this patch a negative review, but my solution to the same
> problem has been to change the on-stack array into a u8 pointer, use
> kmalloc() to assign the space, and then free that space at the end. That way
> large stack allocations are avoided, with a minimum of changes.

Your idea is better Larry, have you got a patch done already or do you
want me to knock one up?

thanks,
Tobin.


Re: [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"

2018-03-13 Thread Luca Boccassi
On Tue, 2018-03-13 at 21:12 +0100, Alexander Zubkov wrote:
> Hi,
> 
> I just realized that you need patch for v4.15.0, which is easier to
> do.
> I'll send it as separate message now. I will make patch for the
> master 
> branch, but later.

Thanks but don't worry about 4.15 - Stephen's revert will be enough for
now. I'm going to push 4.16 as soon as it's out anyway, so you can just
do the changes for master if you wish.

> On 13.03.2018 13:02, Luca Boccassi wrote:
> > On Tue, 2018-03-13 at 12:05 +0100, Alexander Zubkov wrote:
> > > Hello again,
> > > 
> > > The fun thing is that before the commit "ip route ls all" showed
> > > all
> > > routes, but "ip -[4|6] route ls all" showed only default. So it
> > > was
> > > broken too, but in other way.
> > > I see parsing of prefix was changed since my patch. So I need
> > > several
> > > days to propose fix. I think if "ip route ls [all|any]" shows all
> > > routes and "ip route ls default" shows only default, everybody
> > > will
> > > be happy with that?
> > 
> > Hi,
> > 
> > My only concern is that behaviour of existing commands that have
> > been
> > in releases is not changed, otherwise I get bugs raised :-)
> > 
> > Thank you for your work!
> > 
> > > 13.03.2018, 09:46, "Alexander Zubkov" :
> > > > Hello.
> > > > 
> > > > May be the better way would be to change how "all"/"any"
> > > > argument
> > > > behaves? My original concern was about "default" only. I agree
> > > > too,
> > > > that "all" or "any" should work for all routes. But not for the
> > > > default.
> > > > 
> > > > 12.03.2018, 22:37, "Luca Boccassi" :
> > > > >   On Mon, 2018-03-12 at 14:03 -0700, Stephen Hemminger wrote:
> > > > > >    This reverts commit
> > > > > > 9135c4d6037ff9f1818507bac0049fc44db8c3d2.
> > > > > > 
> > > > > >    Debian maintainer found that basic command:
> > > > > >    # ip route flush all
> > > > > >    No longer worked as expected which breaks user scripts
> > > > > > and
> > > > > >    expectations. It no longer flushed all IPv4 routes.
> > > > > > 
> > > > > >    Reported-by: Luca Boccassi 
> > > > > >    Signed-off-by: Stephen Hemminger  > > > > > .org>
> > > > > >    ---
> > > > > > ip/iproute.c | 65 ++---
> > > > > > ---
> > > > > > 
> > > > > >    
> > > > > > lib/utils.c  | 13 
> > > > > > 2 files changed, 32 insertions(+), 46 deletions(-)
> > > > > 
> > > > >   Tested-by: Luca Boccassi 
> > > > > 
> > > > >   Thanks, solves the problem. I'll backport it to Debian.
> > > > > 
> > > > >   Alexander, reproducing the issue is quite simple - before
> > > > > that
> > > > > commit,
> > > > >   ip route ls all showed all routes, but with the change it
> > > > > started
> > > > >   showing only the default table. Same for ip route flush.
> > > > > 
> > > > >   --
> > > > >   Kind regards,
> > > > >   Luca Boccassi
> 
> 

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


Re: [RESEND] rsi: Remove stack VLA usage

2018-03-13 Thread tcharding
On Mon, Mar 12, 2018 at 09:46:06AM +, Kalle Valo wrote:
> tcharding  wrote:
> 
> > The kernel would like to have all stack VLA usage removed[1].  rsi uses
> > a VLA based on 'blksize'.  Elsewhere in the SDIO code maximum block size
> > is defined using a magic number.  We can use a pre-processor defined
> > constant and declare the array to maximum size.  We add a check before
> > accessing the array in case of programmer error.
> > 
> > [1]: https://lkml.org/lkml/2018/3/7/621
> > 
> > Signed-off-by: Tobin C. Harding 
> 
> Tobin, your name in patchwork.kernel.org is just "tcharding" then it should be
> "Tobin C. Harding". Patchwork is braindead in a way as it takes the name from
> it's database instead of the From header of the patch in question.
> 
> I can fix that manually but it would be helpful if you could register to
> patchwork and fix your name during registration. You have only one chance to
> fix your name (another braindead feature!) so be careful :)

Hi Kalle,

I logged into my patchwork account but I don't see any way to set the
name.  Within 'profile' there is only 'change password' and 'link
email'.  I thought I could unregister then re-register but I can't see
how to do that either.  Is there a maintainer of patchwork.kernel.org
who I can email to manually remove me from the system?

thanks,
Tobin.


Re: [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"

2018-03-13 Thread Alexander Zubkov

Hi,

I just realized that you need patch for v4.15.0, which is easier to do.
I'll send it as separate message now. I will make patch for the master 
branch, but later.


On 13.03.2018 13:02, Luca Boccassi wrote:

On Tue, 2018-03-13 at 12:05 +0100, Alexander Zubkov wrote:

Hello again,

The fun thing is that before the commit "ip route ls all" showed all
routes, but "ip -[4|6] route ls all" showed only default. So it was
broken too, but in other way.
I see parsing of prefix was changed since my patch. So I need several
days to propose fix. I think if "ip route ls [all|any]" shows all
routes and "ip route ls default" shows only default, everybody will
be happy with that?


Hi,

My only concern is that behaviour of existing commands that have been
in releases is not changed, otherwise I get bugs raised :-)

Thank you for your work!


13.03.2018, 09:46, "Alexander Zubkov" :

Hello.

May be the better way would be to change how "all"/"any" argument
behaves? My original concern was about "default" only. I agree too,
that "all" or "any" should work for all routes. But not for the
default.

12.03.2018, 22:37, "Luca Boccassi" :

  On Mon, 2018-03-12 at 14:03 -0700, Stephen Hemminger wrote:

   This reverts commit 9135c4d6037ff9f1818507bac0049fc44db8c3d2.

   Debian maintainer found that basic command:
   # ip route flush all
   No longer worked as expected which breaks user scripts and
   expectations. It no longer flushed all IPv4 routes.

   Reported-by: Luca Boccassi 
   Signed-off-by: Stephen Hemminger 
   ---
    ip/iproute.c | 65 ++--

   
    lib/utils.c  | 13 
    2 files changed, 32 insertions(+), 46 deletions(-)


  Tested-by: Luca Boccassi 

  Thanks, solves the problem. I'll backport it to Debian.

  Alexander, reproducing the issue is quite simple - before that
commit,
  ip route ls all showed all routes, but with the change it
started
  showing only the default table. Same for ip route flush.

  --
  Kind regards,
  Luca Boccassi






[PATCH iproute2] treat "default" and "all"/"any" parameters differenty

2018-03-13 Thread Alexander Zubkov

Debian maintainer found that basic command:
# ip route flush all
No longer worked as expected which breaks user scripts and
expectations. It no longer flushed all IPv4 routes.

Recently behaviour of "default" prefix parameter was corrected. But at
the same time behaviour of "all"/"any" was altered too, because they
were the same branch of the code. As those parameters mean different,
they need to be treated differently in code too. This patch reflects
the difference.

Reported-by: Luca Boccassi 
Signed-off-by: Alexander Zubkov 
---
 lib/utils.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/utils.c b/lib/utils.c
index 9fa5220..b289d9c 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -658,7 +658,8 @@ int get_prefix_1(inet_prefix *dst, char *arg, int 
family)

dst->family = family;
dst->bytelen = 0;
dst->bitlen = 0;
-   dst->flags |= PREFIXLEN_SPECIFIED;
+   if (strcmp(arg, "default") == 0)
+   dst->flags |= PREFIXLEN_SPECIFIED;
return 0;
}

--
1.9.1



[PATCH net-next 1/2] selftests/txtimestamp: Add more configurable parameters

2018-03-13 Thread Vinicius Costa Gomes
Add a way to configure if poll() should wait forever for an event, the
number of packets that should be sent for each and if there should be
any delay between packets.

Signed-off-by: Vinicius Costa Gomes 
---
 .../selftests/networking/timestamping/txtimestamp.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/networking/timestamping/txtimestamp.c 
b/tools/testing/selftests/networking/timestamping/txtimestamp.c
index 5df07047ca86..5190b1dd78b1 100644
--- a/tools/testing/selftests/networking/timestamping/txtimestamp.c
+++ b/tools/testing/selftests/networking/timestamping/txtimestamp.c
@@ -68,9 +68,11 @@ static int cfg_num_pkts = 4;
 static int do_ipv4 = 1;
 static int do_ipv6 = 1;
 static int cfg_payload_len = 10;
+static int cfg_poll_timeout = 100;
 static bool cfg_show_payload;
 static bool cfg_do_pktinfo;
 static bool cfg_loop_nodata;
+static bool cfg_no_delay;
 static uint16_t dest_port = 9000;
 
 static struct sockaddr_in daddr;
@@ -171,7 +173,7 @@ static void __poll(int fd)
 
memset(, 0, sizeof(pollfd));
pollfd.fd = fd;
-   ret = poll(, 1, 100);
+   ret = poll(, 1, cfg_poll_timeout);
if (ret != 1)
error(1, errno, "poll");
 }
@@ -371,7 +373,8 @@ static void do_test(int family, unsigned int opt)
error(1, errno, "send");
 
/* wait for all errors to be queued, else ACKs arrive OOO */
-   usleep(50 * 1000);
+   if (!cfg_no_delay)
+   usleep(50 * 1000);
 
__poll(fd);
 
@@ -397,6 +400,9 @@ static void __attribute__((noreturn)) usage(const char 
*filepath)
"  -n:   set no-payload option\n"
"  -r:   use raw\n"
"  -R:   use raw (IP_HDRINCL)\n"
+   "  -D:   no delay between packets\n"
+   "  -F:   poll() waits forever for an event\n"
+   "  -c N: number of packets for each test\n"
"  -p N: connect to port N\n"
"  -u:   use udp\n"
"  -x:   show payload (up to 70 bytes)\n",
@@ -409,7 +415,7 @@ static void parse_opt(int argc, char **argv)
int proto_count = 0;
char c;
 
-   while ((c = getopt(argc, argv, "46hIl:np:rRux")) != -1) {
+   while ((c = getopt(argc, argv, "46hIl:np:rRuxc:DF")) != -1) {
switch (c) {
case '4':
do_ipv6 = 0;
@@ -447,6 +453,15 @@ static void parse_opt(int argc, char **argv)
case 'x':
cfg_show_payload = true;
break;
+   case 'c':
+   cfg_num_pkts = strtoul(optarg, NULL, 10);
+   break;
+   case 'D':
+   cfg_no_delay = true;
+   break;
+   case 'F':
+   cfg_poll_timeout = -1;
+   break;
case 'h':
default:
usage(argv[0]);
-- 
2.16.2



[PATCH net-next 0/2] skbuff: Fix applications not being woken for errors

2018-03-13 Thread Vinicius Costa Gomes
Hi,

Changes from the RFC:
 - tweaked commit messages;

Original cover letter:

This is actually a "bug report"-RFC instead of the more usual "new
feature"-RFC.

We are developing an application that uses TX hardware timestamping to
make some measurements, and during development Randy Witt initially
reported that the application poll() never unblocked when TX hardware
timestamping was enabled.

After some investigation, it turned out the problem wasn't only
exclusive to hardware timestamping, and could be reproduced with
software timestamping.

Applying patch (1), and running txtimestamp like this, for example:

$ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F

('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
packets for each test, '-D' to remove the delay between packets, '-l
1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
wait forever)

will cause the application to become stuck in the poll() call in most
of the times. (Note: I couldn't reproduce the issue running against an
address that is routed through loopback.)

Another interesting fact is that if the POLLIN event is added to the
poll() .events, poll() no longer becomes stuck, and more interestingly
the returned event in .revents is only POLLERR.

After a few debugging sessions, we got to 'sock_queue_err_skb()' and
how it notifies applications of the error just enqueued. Changing it
to use 'sk->sk_error_report()', fixes the issue for hardware and
software timestamping. That is patch (2).

The "solution" proposed in patch (2) looks like too big a hammer, if
it's not, then it seems that this problem existed since a long time
ago (pre git) and was uncommon for folks to reach the necessary
conditions to trigger it (my hypothesis is that only triggers when the
error is reported from a different task context than the application).

Am I missing something here?


Cheers,
--

Vinicius Costa Gomes (2):
  selftests/txtimestamp: Add more configurable parameters
  skbuff: Fix not waking applications when errors are enqueued

 net/core/skbuff.c   |  2 +-
 .../selftests/networking/timestamping/txtimestamp.c | 21 ++---
 2 files changed, 19 insertions(+), 4 deletions(-)

--
2.16.2


[PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued

2018-03-13 Thread Vinicius Costa Gomes
When errors are enqueued to the error queue via sock_queue_err_skb()
function, it is possible that the waiting application is not notified.

Calling 'sk->sk_data_ready()' would not notify applications that
selected only POLLERR events in poll() (for example).

Reported-by: Randy E. Witt 
Signed-off-by: Vinicius Costa Gomes 
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..6def3534f509 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4181,7 +4181,7 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff 
*skb)
 
skb_queue_tail(>sk_error_queue, skb);
if (!sock_flag(sk, SOCK_DEAD))
-   sk->sk_data_ready(sk);
+   sk->sk_error_report(sk);
return 0;
 }
 EXPORT_SYMBOL(sock_queue_err_skb);
-- 
2.16.2



Re: [PATCH 12/15] ice: Add stats and ethtool support

2018-03-13 Thread Venkataramanan, Anirudh
On Sat, 2018-03-10 at 08:42 -0800, Stephen Hemminger wrote:
> On Fri,  9 Mar 2018 09:21:33 -0800
> Anirudh Venkataramanan  wrote:
> 
> > +   /* VSI stats */
> > +   struct rtnl_link_stats64 net_stats;
> > +   struct rtnl_link_stats64 net_stats_prev;
> > +   struct ice_eth_stats eth_stats;
> > +   struct ice_eth_stats eth_stats_prev;
> 
> You also don't need current and previous as separate copies since
> previous is only
> used while computing the current values.

Thanks for the feedback, Stephen.

eth_stats_prev is used in ice_update_eth_stats when we update
eth_stats.

While looking into this though, I found that net_stats_prev field in
struct ice_vsi (and consequently *prev_ns and *prev_es pointers in
ice_update_vsi_stats) may not be needed. Is this what you meant?

Thanks!
Ani


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 iproute2-next 0/6] cm_id, cq, mr, and pd resource tracking

2018-03-13 Thread David Ahern
On 3/13/18 1:32 AM, Leon Romanovsky wrote:
> On Mon, Mar 12, 2018 at 10:53:03AM -0700, David Ahern wrote:
>> On 3/12/18 8:16 AM, Steve Wise wrote:
>>> Hey all,
>>>
>>> The kernel side of this series has been merged for rdma-next [1].  Let me
>>> know if this iproute2 series can be merged, of if it needs more changes.
>>>
>>
>> The problem is that iproute2 headers are synced to kernel headers from
>> DaveM's tree (net-next mainly). I take it this series will not appear in
>> Dave's tree until after a merge through Linus' tree. Correct?
> 
> David,
> 
> Technically, you are right, and we would like to ask you for an extra tweak
> to the flow for the RDMAtool, because current scheme causes delays at least
> cycle.
> 
> Every RDMAtool's patchset which requires changes to headers is always
> includes header patch, can you please accept those series and once you
> are bringing new net-next headers from Linus, simply overwrite all our
> headers?

I did not follow the discussion back when this decision was made, so how
did rdma tool end up in iproute2? I do not need the overhead of
sometimes I sync the rdma header file and sometimes I don't.

One option that comes to mind is to move the rdma header file under the
rdma directory. It breaks the uapi model, but it seems that iproute2 is
just a delivery vehicle for this command.


linux-next: build warning after merge of the net-next tree

2018-03-13 Thread Stephen Rothwell
Hi all,

After merging the net-next tree, today's linux-next build (sparc
defconfig) produced this warning:

net/core/pktgen.c: In function 'pktgen_if_write':
net/core/pktgen.c:1710:1: warning: the frame size of 1048 bytes is larger than 
1024 bytes [-Wframe-larger-than=]
 }
 ^

Introduced by commit

  35951393bbff ("pktgen: Remove VLA usage")

-- 
Cheers,
Stephen Rothwell


pgpcfIX3FndZw.pgp
Description: OpenPGP digital signature


Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw

2018-03-13 Thread Felix Fietkau
On 2018-02-27 11:08, Rafał Miłecki wrote:
> I've problem when using OpenWrt/LEDE on a home router with Broadcom's
> FullMAC WiFi chipset.
> 
> 
> First of all OpenWrt/LEDE uses bridge interface for LAN network with:
> 1) IFLA_BRPORT_MCAST_TO_UCAST
> 2) Clients isolation in hostapd
> 3) Hairpin mode enabled
> 
> For more details please see Linus's patch description:
> https://patchwork.kernel.org/patch/9530669/
> and maybe hairpin mode patch:
> https://lwn.net/Articles/347344/
> 
> Short version: in that setup packets received from a bridged wireless
> interface can be handled back to it for transmission.
> 
> 
> Now, Broadcom's firmware for their FullMAC chipsets in AP mode
> supports an obsoleted 802.11f AKA IAPP standard. It's a roaming
> standard that was replaced by 802.11r.
> 
> Whenever a new station associates, firmware generates a packet like:
> ff ff ff ff  ff ff ec 10  7b 5f ?? ??  00 06 00 01  af 81 01 00
> (just masked 2 bytes of my MAC)
> 
> For mode details you can see discussion in my brcmfmac patch thread:
> https://patchwork.kernel.org/patch/10191451/
> 
> 
> The problem is that bridge (in setup as above) handles such a packet
> back to the device.
> 
> That makes Broadcom's FullMAC firmware believe that a given station
> just connected to another AP in a network (which doesn't even exist).
> As a result firmware immediately disassociates that station. It's
> simply impossible to connect to the router. Every association is
> followed by immediate disassociation.
> 
> 
> Can you see any solution for this problem? Is that an option to stop
> multicast-to-unicast from touching 802.11f packets? Some other ideas?
> Obviously I can't modify Broadcom's firmware and drop that obsoleted
> standard.
Let's look at it from a different angle: Since these packets are
forwarded as normal packets by the bridge, and the Broadcom firmware
reacts to them in this nasty way, that's basically local DoS security
issue. In my opinion that matters a lot more than having support for an
obsolete feature that almost nobody will ever want to use.

I think the right approach to deal with this issue is to drop these
garbage packets in both the receive and transmit path of brcmfmac.

- Felix


Re: aio poll, io_pgetevents and a new in-kernel poll API V5

2018-03-13 Thread Christoph Hellwig
ping?

On Mon, Mar 05, 2018 at 01:27:07PM -0800, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.
> 
> To make the interface fully useful a new io_pgetevents system call is
> added, which atomically saves and restores the signal mask over the
> io_pgetevents system call.  It it the logical equivalent to pselect and
> ppoll for io_pgetevents.
> 
> The corresponding libaio changes for io_pgetevents support and
> documentation, as well as a test case will be posted in a separate
> series.
> 
> The changes were sponsored by Scylladb, and improve performance
> of the seastar framework up to 10%, while also removing the need
> for a privileged SCHED_FIFO epoll listener thread.
> 
> git://git.infradead.org/users/hch/vfs.git aio-poll.5
> 
> Gitweb:
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.5
> 
> Libaio changes:
> 
> https://pagure.io/libaio.git io-poll
> 
> Seastar changes (not updated for the new io_pgetevens ABI yet):
> 
> https://github.com/avikivity/seastar/commits/aio
> 
> Changes since V4:
>  - rebased ontop of Linux 4.16-rc4
> 
> Changes since V3:
>  - remove the pre-sleep ->poll_mask call in vfs_poll,
>allow ->get_poll_head to return POLL* values.
> 
> Changes since V2:
>  - removed a double initialization
>  - new vfs_get_poll_head helper
>  - document that ->get_poll_head can return NULL
>  - call ->poll_mask before sleeping
>  - various ACKs
>  - add conversion of random to ->poll_mask
>  - add conversion of af_alg to ->poll_mask
>  - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
>  - reshuffled the series so that prep patches and everything not
>requiring the new in-kernel poll API is in the beginning
> 
> Changes since V1:
>  - handle the NULL ->poll case in vfs_poll
>  - dropped the file argument to the ->poll_mask socket operation
>  - replace the ->pre_poll socket operation with ->get_poll_head as
>in the file operations
---end quoted text---


Re: WARNING in kmalloc_slab (4)

2018-03-13 Thread Dmitry Vyukov
On Tue, Mar 13, 2018 at 10:51 AM, Steffen Klassert
 wrote:
> On Tue, Mar 13, 2018 at 12:33:02AM -0700, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> f44b1886a5f876c87b5889df463ad7b97834ba37 (Fri Mar 9 18:10:06 2018 +)
>> Merge branch 's390-qeth-next'
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+6a7e7ed886bde4346...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> WARNING: CPU: 1 PID: 27333 at mm/slab_common.c:1012 kmalloc_slab+0x5d/0x70
>> mm/slab_common.c:1012
>> Kernel panic - not syncing: panic_on_warn set ...
>>
>> syz-executor0: vmalloc: allocation failure: 17045651456 bytes,
>> mode:0x14080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null)
>> CPU: 1 PID: 27333 Comm: syz-executor2 Not tainted 4.16.0-rc4+ #260
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>>  panic+0x1e4/0x41c kernel/panic.c:183
>> syz-executor0 cpuset=
>>  __warn+0x1dc/0x200 kernel/panic.c:547
>> /
>>  mems_allowed=0
>>  report_bug+0x211/0x2d0 lib/bug.c:184
>>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>>  invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
>> RIP: 0010:kmalloc_slab+0x5d/0x70 mm/slab_common.c:1012
>> RSP: 0018:8801ccfc72f0 EFLAGS: 00010246
>> RAX:  RBX: 1018 RCX: 84ec4fc8
>> RDX: 0ba7 RSI:  RDI: 1018
>> RBP: 8801ccfc72f0 R08:  R09: 1100399f8e21
>> R10: 8801ccfc7040 R11: 0001 R12: 0018
>> R13: 8801ccfc7598 R14: 014080c0 R15: 8801aebaad80
>>  __do_kmalloc mm/slab.c:3700 [inline]
>>  __kmalloc+0x25/0x760 mm/slab.c:3714
>>  kmalloc include/linux/slab.h:517 [inline]
>>  kzalloc include/linux/slab.h:701 [inline]
>>  xfrm_alloc_replay_state_esn net/xfrm/xfrm_user.c:442 [inline]
>
> This is likely fixed with:
>
> commit d97ca5d714a5334aecadadf696875da40f1fbf3e
> xfrm_user: uncoditionally validate esn replay attribute struct
>
> The patch is included in the ipsec pull request for the net
> tree I've sent this morning.

Let's tell syzbot:

#syz fix: xfrm_user: uncoditionally validate esn replay attribute struct


Re: [PATCH v2 net] net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

2018-03-13 Thread Steffen Klassert
On Wed, Mar 07, 2018 at 02:42:53PM -0800, Greg Hackmann wrote:
> f7c83bcbfaf5 ("net: xfrm: use __this_cpu_read per-cpu helper") added a
> __this_cpu_read() call inside ipcomp_alloc_tfms().
> 
> At the time, __this_cpu_read() required the caller to either not care
> about races or to handle preemption/interrupt issues.  3.15 tightened
> the rules around some per-cpu operations, and now __this_cpu_read()
> should never be used in a preemptible context.  On 3.15 and later, we
> need to use this_cpu_read() instead.
> 

...

> Signed-off-by: Greg Hackmann 

Patch applied, thanks!


Re: BUG_ON triggered in skb_segment

2018-03-13 Thread Yunsheng Lin
Hi, Song

On 2018/3/13 13:45, Yonghong Song wrote:
> Hi,
> 
> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> net-next function skb_segment, line 3667.
> 
> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
> 3473 netdev_features_t features)
> 3474 {
> 3475 struct sk_buff *segs = NULL;
> 3476 struct sk_buff *tail = NULL;
> ...
> 3665 while (pos < offset + len) {
> 3666 if (i >= nfrags) {
> 3667 BUG_ON(skb_headlen(list_skb));
> 3668
> 3669 i = 0;
> 3670 nfrags = skb_shinfo(list_skb)->nr_frags;
> 3671 frag = skb_shinfo(list_skb)->frags;
> 3672 frag_skb = list_skb;
> ...
> 
> call stack:
> ...
> #0 [883ffef034f8] machine_kexec at 81044c41
>  #1 [883ffef03558] __crash_kexec at 8110c525
>  #2 [883ffef03620] crash_kexec at 8110d5cc
>  #3 [883ffef03640] oops_end at 8101d7e7
>  #4 [883ffef03668] die at 8101deb2
>  #5 [883ffef03698] do_trap at 8101a700
>  #6 [883ffef036e8] do_error_trap at 8101abfe
>  #7 [883ffef037a0] do_invalid_op at 8101acd0
>  #8 [883ffef037b0] invalid_op at 81a00bab
> [exception RIP: skb_segment+3044]
> RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
> RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
> RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
> RBP: 883ffef03928   R8: 2ce2   R9: 27da
> R10: 01ea  R11: 2d82  R12: 883f90a1ee80
> R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
> ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [883ffef03930] tcp_gso_segment at 818713e7
> #10 [883ffef03990] tcp4_gso_segment at 818717d8
> #11 [883ffef039b0] inet_gso_segment at 81882c9b
> #12 [883ffef03a10] skb_mac_gso_segment at 817f39b8
> #13 [883ffef03a38] __skb_gso_segment at 817f3ac9
> #14 [883ffef03a68] validate_xmit_skb at 817f3eed
> #15 [883ffef03aa8] validate_xmit_skb_list at 817f40a2
> #16 [883ffef03ad8] sch_direct_xmit at 81824efb
> #17 [883ffef03b20] __qdisc_run at 818251aa
> #18 [883ffef03b90] __dev_queue_xmit at 817f45ed
> #19 [883ffef03c08] dev_queue_xmit at 817f4b90
> #20 [883ffef03c18] __bpf_redirect at 81812b66
> #21 [883ffef03c40] skb_do_redirect at 81813209
> #22 [883ffef03c60] __netif_receive_skb_core at 817f310d
> #23 [883ffef03cc8] __netif_receive_skb at 817f32e8
> #24 [883ffef03ce8] netif_receive_skb_internal at 817f5538
> #25 [883ffef03d10] napi_gro_complete at 817f56c0
> #26 [883ffef03d28] dev_gro_receive at 817f5ea6
> #27 [883ffef03d78] napi_gro_receive at 817f6168
> #28 [883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at 817381c2
> #29 [883ffef03e30] mlx5e_poll_rx_cq at 817386c2
> #30 [883ffef03e80] mlx5e_napi_poll at 8173926e
> #31 [883ffef03ed0] net_rx_action at 817f5a6e
> #32 [883ffef03f48] __softirqentry_text_start at 81c000c3
> #33 [883ffef03fa8] irq_exit at 8108f515
> #34 [883ffef03fb8] do_IRQ at 81a01b11
> ---  ---
> bt: cannot transition from IRQ stack to current process stack:
> IRQ stack pointer: 883ffef034f8
> process stack pointer: 81a01ae9
>current stack base: c9000c5c4000
> ...
> Setup:
> =
> 
> The test will involve three machines:
>   M_ipv6 <-> M_nat <-> M_ipv4
> 
> The M_nat will do ipv4<->ipv6 address translation and then forward packet
> to proper destination. The control plane will configure M_nat properly
> will understand virtual ipv4 address for machine M_ipv6, and
> virtual ipv6 address for machine M_ipv4.
> 
> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
> The program uses bpf_skb_change_proto to do protocol conversion.
> bpf_skb_change_proto will adjust skb header_len and len properly
> based on protocol change.
> After the conversion, the program will make proper change on
> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
> through bpf_redirect.
> 
> Experiment:
> ===
> 
> MTU: 1500B for all three machines.
> 
> The tso/lro/gro are enabled on the M_nat box.
> 
> ping works on both ways of M_ipv6 <-> M_ipv4.
> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 (both 
> ways).
> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed with the 
> above BUG_ON, really fast.
> Did not really test from M_ipv4 to M_ipv6 with large file.
> 
> The error path likely to be (also from the above call 

Re: [PATCH net-next v2 3/4] ibmvnic: Pad small packets to minimum MTU size

2018-03-13 Thread kbuild test robot
Hi Thomas,

I love your patch! Yet something to improve:

[auto build test ERROR on v4.16-rc4]
[also build test ERROR on next-20180309]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Thomas-Falcon/ibmvnic-Fix-VLAN-and-other-device-errata/20180313-125518
config: powerpc-allmodconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All error/warnings (new ones prefixed by >>):

   drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_xmit':
>> drivers/net/ethernet/ibm/ibmvnic.c:1386:36: error: passing argument 2 of 
>> 'ibmvnic_xmit_workarounds' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
 if (ibmvnic_xmit_workarounds(skb, adapter)) {
   ^~~
   drivers/net/ethernet/ibm/ibmvnic.c:1336:12: note: expected 'struct 
net_device *' but argument is of type 'struct ibmvnic_adapter *'
static int ibmvnic_xmit_workarounds(struct sk_buff *skb,
   ^~~~
   drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_xmit_workarounds':
>> drivers/net/ethernet/ibm/ibmvnic.c:1347:1: warning: control reaches end of 
>> non-void function [-Wreturn-type]
}
^
   cc1: some warnings being treated as errors

vim +/ibmvnic_xmit_workarounds +1386 drivers/net/ethernet/ibm/ibmvnic.c

  1335  
  1336  static int ibmvnic_xmit_workarounds(struct sk_buff *skb,
  1337  struct net_device *netdev)
  1338  {
  1339  /* For some backing devices, mishandling of small packets
  1340   * can result in a loss of connection or TX stall. Device
  1341   * architects recommend that no packet should be smaller
  1342   * than the minimum MTU value provided to the driver, so
  1343   * pad any packets to that length
  1344   */
  1345  if (skb->len < netdev->min_mtu)
  1346  return skb_put_padto(skb, netdev->min_mtu);
> 1347  }
  1348  
  1349  static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
  1350  {
  1351  struct ibmvnic_adapter *adapter = netdev_priv(netdev);
  1352  int queue_num = skb_get_queue_mapping(skb);
  1353  u8 *hdrs = (u8 *)>tx_rx_desc_req;
  1354  struct device *dev = >vdev->dev;
  1355  struct ibmvnic_tx_buff *tx_buff = NULL;
  1356  struct ibmvnic_sub_crq_queue *tx_scrq;
  1357  struct ibmvnic_tx_pool *tx_pool;
  1358  unsigned int tx_send_failed = 0;
  1359  unsigned int tx_map_failed = 0;
  1360  unsigned int tx_dropped = 0;
  1361  unsigned int tx_packets = 0;
  1362  unsigned int tx_bytes = 0;
  1363  dma_addr_t data_dma_addr;
  1364  struct netdev_queue *txq;
  1365  unsigned long lpar_rc;
  1366  union sub_crq tx_crq;
  1367  unsigned int offset;
  1368  int num_entries = 1;
  1369  unsigned char *dst;
  1370  u64 *handle_array;
  1371  int index = 0;
  1372  u8 proto = 0;
  1373  int ret = 0;
  1374  
  1375  if (adapter->resetting) {
  1376  if (!netif_subqueue_stopped(netdev, skb))
  1377  netif_stop_subqueue(netdev, queue_num);
  1378  dev_kfree_skb_any(skb);
  1379  
  1380  tx_send_failed++;
  1381  tx_dropped++;
  1382  ret = NETDEV_TX_OK;
  1383  goto out;
  1384  }
  1385  
> 1386  if (ibmvnic_xmit_workarounds(skb, adapter)) {
  1387  tx_dropped++;
  1388  tx_send_failed++;
  1389  ret = NETDEV_TX_OK;
  1390  goto out;
  1391  }
  1392  
  1393  tx_pool = >tx_pool[queue_num];
  1394  tx_scrq = adapter->tx_scrq[queue_num];
  1395  txq = netdev_get_tx_queue(netdev, skb_get_queue_mapping(skb));
  1396  handle_array = (u64 *)((u8 *)(adapter->login_rsp_buf) +
  1397  
be32_to_cpu(adapter->login_rsp_buf->off_txsubm_subcrqs));
  1398  
  1399  index = tx_pool->free_map[tx_pool->consumer_index];
  1400  
  1401  if (skb_is_gso(skb)) {
  1402  offset = tx_pool->tso_index * IBMVNIC_TSO_BUF_SZ;
  1403  dst = tx_pool->tso_ltb.buff + offset;
  1404  memset(dst, 0, IBMVNIC_TSO_BUF_SZ);
  1405  data_dma_addr = tx_pool->tso_ltb.addr + offset;
  1406  tx_pool->tso_index++;
  1407 

Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources

2018-03-13 Thread Christoph Hellwig
On Mon, Mar 12, 2018 at 01:17:00PM -0700, Alexander Duyck wrote:
> No, I am aware of those. The problem is they aren't accessed as
> function pointers. As such converting them to static inline functions
> is easy. As I am sure you are aware an "inline" function doesn't
> normally generate a function pointer.

I think Keith's original idea of defining them to NULL is right.  That
takes care of all the current trivial assign to struct cases.

If someone wants to call these functions they'll still need the ifdef
around the call as those won't otherwise compile, but they probably
want the ifdef around the whole caller anyway.


Re: [PATCH net-next v2] sctp: fix error return code in sctp_sendmsg_new_asoc()

2018-03-13 Thread Xin Long
On Tue, Mar 13, 2018 at 11:03 AM, Wei Yongjun  wrote:
> Return error code -EINVAL in the address len check error handling
> case since 'err' can be overwrite to 0 by 'err = sctp_verify_addr()'
> in the for loop.
>
> Fixes: 2c0dbaa0c43d ("sctp: add support for SCTP_DSTADDRV4/6 Information for 
> sendmsg")
> Signed-off-by: Wei Yongjun 
> Acked-by: Neil Horman 
> ---
> v1 -> v2: remove the 'err' initialization
> ---
>  net/sctp/socket.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 7d3476a..af5cf29 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1677,7 +1677,7 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, __u16 
> sflags,
> struct sctp_association *asoc;
> enum sctp_scope scope;
> struct cmsghdr *cmsg;
> -   int err = -EINVAL;
> +   int err;
>
> *tp = NULL;
>
> @@ -1761,16 +1761,20 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, 
> __u16 sflags,
> memset(daddr, 0, sizeof(*daddr));
> dlen = cmsg->cmsg_len - sizeof(struct cmsghdr);
> if (cmsg->cmsg_type == SCTP_DSTADDRV4) {
> -   if (dlen < sizeof(struct in_addr))
> +   if (dlen < sizeof(struct in_addr)) {
> +   err = -EINVAL;
> goto free;
> +   }
>
> dlen = sizeof(struct in_addr);
> daddr->v4.sin_family = AF_INET;
> daddr->v4.sin_port = htons(asoc->peer.port);
> memcpy(>v4.sin_addr, CMSG_DATA(cmsg), dlen);
> } else {
> -   if (dlen < sizeof(struct in6_addr))
> +   if (dlen < sizeof(struct in6_addr)) {
> +   err = -EINVAL;
> goto free;
> +   }
>
> dlen = sizeof(struct in6_addr);
> daddr->v6.sin6_family = AF_INET6;
>
Reviewed-by: Xin Long 


[PATCH 6/9] xfrm: Fix infinite loop in xfrm_get_dst_nexthop with transport mode.

2018-03-13 Thread Steffen Klassert
On transport mode we forget to fetch the child dst_entry
before we continue the while loop, this leads to an infinite
loop. Fix this by fetching the child dst_entry before we
continue the while loop.

Fixes: 0f6c480f23f4 ("xfrm: Move dst->path into struct xfrm_dst")
Reported-by: syzbot+7d03c810e50aaedef...@syzkaller.appspotmail.com
Tested-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 150d46633ce6..625b3fca5704 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2732,14 +2732,14 @@ static const void *xfrm_get_dst_nexthop(const struct 
dst_entry *dst,
while (dst->xfrm) {
const struct xfrm_state *xfrm = dst->xfrm;
 
+   dst = xfrm_dst_child(dst);
+
if (xfrm->props.mode == XFRM_MODE_TRANSPORT)
continue;
if (xfrm->type->flags & XFRM_TYPE_REMOTE_COADDR)
daddr = xfrm->coaddr;
else if (!(xfrm->type->flags & XFRM_TYPE_LOCAL_COADDR))
daddr = >id.daddr;
-
-   dst = xfrm_dst_child(dst);
}
return daddr;
 }
-- 
2.14.1



[PATCH 4/9] xfrm: reuse uncached_list to track xdsts

2018-03-13 Thread Steffen Klassert
From: Xin Long 

In early time, when freeing a xdst, it would be inserted into
dst_garbage.list first. Then if it's refcnt was still held
somewhere, later it would be put into dst_busy_list in
dst_gc_task().

When one dev was being unregistered, the dev of these dsts in
dst_busy_list would be set with loopback_dev and put this dev.
So that this dev's removal wouldn't get blocked, and avoid the
kmsg warning:

  kernel:unregister_netdevice: waiting for veth0 to become \
  free. Usage count = 2

However after Commit 52df157f17e5 ("xfrm: take refcnt of dst
when creating struct xfrm_dst bundle"), the xdst will not be
freed with dst gc, and this warning happens.

To fix it, we need to find these xdsts that are still held by
others when removing the dev, and free xdst's dev and set it
with loopback_dev.

But unfortunately after flow_cache for xfrm was deleted, no
list tracks them anymore. So we need to save these xdsts
somewhere to release the xdst's dev later.

To make this easier, this patch is to reuse uncached_list to
track xdsts, so that the dev refcnt can be released in the
event NETDEV_UNREGISTER process of fib_netdev_notifier.

Thanks to Florian, we could move forward this fix quickly.

Fixes: 52df157f17e5 ("xfrm: take refcnt of dst when creating struct xfrm_dst 
bundle")
Reported-by: Jianlin Shi 
Reported-by: Hangbin Liu 
Tested-by: Eyal Birger 
Signed-off-by: Xin Long 
Signed-off-by: Steffen Klassert 
---
 include/net/ip6_route.h |  3 +++
 include/net/route.h |  3 +++
 net/ipv4/route.c| 21 +
 net/ipv4/xfrm4_policy.c |  4 +++-
 net/ipv6/route.c|  4 ++--
 net/ipv6/xfrm6_policy.c |  5 +
 6 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 27d23a65f3cd..ac0866bb9e93 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -179,6 +179,9 @@ void rt6_disable_ip(struct net_device *dev, unsigned long 
event);
 void rt6_sync_down_dev(struct net_device *dev, unsigned long event);
 void rt6_multipath_rebalance(struct rt6_info *rt);
 
+void rt6_uncached_list_add(struct rt6_info *rt);
+void rt6_uncached_list_del(struct rt6_info *rt);
+
 static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb)
 {
const struct dst_entry *dst = skb_dst(skb);
diff --git a/include/net/route.h b/include/net/route.h
index 1eb9ce470e25..40b870d58f38 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -227,6 +227,9 @@ struct in_ifaddr;
 void fib_add_ifaddr(struct in_ifaddr *);
 void fib_del_ifaddr(struct in_ifaddr *, struct in_ifaddr *);
 
+void rt_add_uncached_list(struct rtable *rt);
+void rt_del_uncached_list(struct rtable *rt);
+
 static inline void ip_rt_put(struct rtable *rt)
 {
/* dst_release() accepts a NULL parameter.
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 49cc1c1df1ba..1d1e4abe04b0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1383,7 +1383,7 @@ struct uncached_list {
 
 static DEFINE_PER_CPU_ALIGNED(struct uncached_list, rt_uncached_list);
 
-static void rt_add_uncached_list(struct rtable *rt)
+void rt_add_uncached_list(struct rtable *rt)
 {
struct uncached_list *ul = raw_cpu_ptr(_uncached_list);
 
@@ -1394,14 +1394,8 @@ static void rt_add_uncached_list(struct rtable *rt)
spin_unlock_bh(>lock);
 }
 
-static void ipv4_dst_destroy(struct dst_entry *dst)
+void rt_del_uncached_list(struct rtable *rt)
 {
-   struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
-   struct rtable *rt = (struct rtable *) dst;
-
-   if (p != _default_metrics && refcount_dec_and_test(>refcnt))
-   kfree(p);
-
if (!list_empty(>rt_uncached)) {
struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1411,6 +1405,17 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
}
 }
 
+static void ipv4_dst_destroy(struct dst_entry *dst)
+{
+   struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
+   struct rtable *rt = (struct rtable *)dst;
+
+   if (p != _default_metrics && refcount_dec_and_test(>refcnt))
+   kfree(p);
+
+   rt_del_uncached_list(rt);
+}
+
 void rt_flush_dev(struct net_device *dev)
 {
struct net *net = dev_net(dev);
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 05017e2c849c..8d33f7b311f4 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -102,6 +102,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct 
net_device *dev,
xdst->u.rt.rt_pmtu = rt->rt_pmtu;
xdst->u.rt.rt_table_id = rt->rt_table_id;
INIT_LIST_HEAD(>u.rt.rt_uncached);
+   rt_add_uncached_list(>u.rt);
 
return 0;
 }
@@ -241,7 +242,8 @@ static void xfrm4_dst_destroy(struct dst_entry *dst)
struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
 
  

pull request (net): ipsec 2018-03-13

2018-03-13 Thread Steffen Klassert
1) Refuse to insert 32 bit userspace socket policies on 64
   bit systems like we do it for standard policies. We don't
   have a compat layer, so inserting socket policies from
   32 bit userspace will lead to a broken configuration.

2) Make the policy hold queue work without the flowcache.
   Dummy bundles are not chached anymore, so we need to
   generate a new one on each lookup as long as the SAs
   are not yet in place.

3) Fix the validation of the esn replay attribute. The
   The sanity check in verify_replay() is bypassed if
   the XFRM_STATE_ESN flag is not set. Fix this by doing
   the sanity check uncoditionally.
   From Florian Westphal.

4) After most of the dst_entry garbage collection code
   is removed, we may leak xfrm_dst entries as they are
   neither cached nor tracked somewhere. Fix this by
   reusing the 'uncached_list' to track xfrm_dst entries
   too. From Xin Long.

5) Fix a rcu_read_lock/rcu_read_unlock imbalance in
   xfrm_get_tos() From Xin Long.

6) Fix an infinite loop in xfrm_get_dst_nexthop. On
   transport mode we fetch the child dst_entry after
   we continue, so this pointer is never updated.
   Fix this by fetching it before we continue.

7) Fix ESN sequence number gap after IPsec GSO packets.
We accidentally increment the sequence number counter
on the xfrm_state by one packet too much in the ESN
case. Fix this by setting the sequence number to the
correct value.

8) Reset the ethernet protocol after decapsulation only if a
   mac header was set. Otherwise it breaks configurations
   with TUN devices. From Yossi Kuperman.

9) Fix __this_cpu_read() usage in preemptible code. Use
   this_cpu_read() instead in ipcomp_alloc_tfms().
   From Greg Hackmann.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 743efac1c670c6618742c923f6275d819604:

  net: pxa168_eth: add netconsole support (2018-02-01 14:58:37 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 0dcd7876029b58770f769cbb7b484e88e4a305e5:

  net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms() 
(2018-03-13 07:46:37 +0100)


Florian Westphal (1):
  xfrm_user: uncoditionally validate esn replay attribute struct

Greg Hackmann (1):
  net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

Steffen Klassert (4):
  xfrm: Refuse to insert 32 bit userspace socket policies on 64 bit systems
  xfrm: Fix policy hold queue after flowcache removal.
  xfrm: Fix infinite loop in xfrm_get_dst_nexthop with transport mode.
  xfrm: Fix ESN sequence number handling for IPsec GSO packets.

Xin Long (2):
  xfrm: reuse uncached_list to track xdsts
  xfrm: do not call rcu_read_unlock when afinfo is NULL in xfrm_get_tos

Yossi Kuperman (1):
  xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto

 include/net/ip6_route.h  |  3 +++
 include/net/route.h  |  3 +++
 net/ipv4/route.c | 21 +
 net/ipv4/xfrm4_mode_tunnel.c |  3 ++-
 net/ipv4/xfrm4_policy.c  |  4 +++-
 net/ipv6/route.c |  4 ++--
 net/ipv6/xfrm6_mode_tunnel.c |  3 ++-
 net/ipv6/xfrm6_policy.c  |  5 +
 net/xfrm/xfrm_ipcomp.c   |  2 +-
 net/xfrm/xfrm_policy.c   | 13 -
 net/xfrm/xfrm_replay.c   |  2 +-
 net/xfrm/xfrm_state.c|  5 +
 net/xfrm/xfrm_user.c | 21 -
 13 files changed, 56 insertions(+), 33 deletions(-)


Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support

2018-03-13 Thread Christoph Hellwig
On Tue, Mar 13, 2018 at 08:12:52AM +, David Woodhouse wrote:
> I'd also *really* like to see a way to enable this for PFs which don't
> have (and don't need) a driver. We seem to have lost that along the
> way.

We've been forth and back on that.  I agree that not having any driver
just seems dangerous.  If your PF really does nothing we should just
have a trivial pf_stub driver that does nothing but wiring up
pci_sriov_configure_simple.  We can then add PCI IDs to it either
statically, or using the dynamic ids mechanism.


[PATCH 5/9] xfrm: do not call rcu_read_unlock when afinfo is NULL in xfrm_get_tos

2018-03-13 Thread Steffen Klassert
From: Xin Long 

When xfrm_policy_get_afinfo returns NULL, it will not hold rcu
read lock. In this case, rcu_read_unlock should not be called
in xfrm_get_tos, just like other places where it's calling
xfrm_policy_get_afinfo.

Fixes: f5e2bb4f5b22 ("xfrm: policy: xfrm_get_tos cannot fail")
Signed-off-by: Xin Long 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 8b3811ff002d..150d46633ce6 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1458,10 +1458,13 @@ xfrm_tmpl_resolve(struct xfrm_policy **pols, int npols, 
const struct flowi *fl,
 static int xfrm_get_tos(const struct flowi *fl, int family)
 {
const struct xfrm_policy_afinfo *afinfo;
-   int tos = 0;
+   int tos;
 
afinfo = xfrm_policy_get_afinfo(family);
-   tos = afinfo ? afinfo->get_tos(fl) : 0;
+   if (!afinfo)
+   return 0;
+
+   tos = afinfo->get_tos(fl);
 
rcu_read_unlock();
 
-- 
2.14.1



[PATCH 8/9] xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto

2018-03-13 Thread Steffen Klassert
From: Yossi Kuperman 

Artem Savkov reported that commit 5efec5c655dd leads to a packet loss under
IPSec configuration. It appears that his setup consists of a TUN device,
which does not have a MAC header.

Make sure MAC header exists.

Note: TUN device sets a MAC header pointer, although it does not have one.

Fixes: 5efec5c655dd ("xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP 
version")
Reported-by: Artem Savkov 
Tested-by: Artem Savkov 
Signed-off-by: Yossi Kuperman 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/xfrm4_mode_tunnel.c | 3 ++-
 net/ipv6/xfrm6_mode_tunnel.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index 63faeee989a9..2a9764bd1719 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -92,7 +92,8 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, 
struct sk_buff *skb)
 
skb_reset_network_header(skb);
skb_mac_header_rebuild(skb);
-   eth_hdr(skb)->h_proto = skb->protocol;
+   if (skb->mac_len)
+   eth_hdr(skb)->h_proto = skb->protocol;
 
err = 0;
 
diff --git a/net/ipv6/xfrm6_mode_tunnel.c b/net/ipv6/xfrm6_mode_tunnel.c
index bb935a3b7fea..de1b0b8c53b0 100644
--- a/net/ipv6/xfrm6_mode_tunnel.c
+++ b/net/ipv6/xfrm6_mode_tunnel.c
@@ -92,7 +92,8 @@ static int xfrm6_mode_tunnel_input(struct xfrm_state *x, 
struct sk_buff *skb)
 
skb_reset_network_header(skb);
skb_mac_header_rebuild(skb);
-   eth_hdr(skb)->h_proto = skb->protocol;
+   if (skb->mac_len)
+   eth_hdr(skb)->h_proto = skb->protocol;
 
err = 0;
 
-- 
2.14.1



[PATCH 7/9] xfrm: Fix ESN sequence number handling for IPsec GSO packets.

2018-03-13 Thread Steffen Klassert
When IPsec offloading was introduced, we accidentally incremented
the sequence number counter on the xfrm_state by one packet
too much in the ESN case. This leads to a sequence number gap of
one packet after each GSO packet. Fix this by setting the sequence
number to the correct value.

Fixes: d7dbefc45cf5 ("xfrm: Add xfrm_replay_overflow functions for offloading")
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_replay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index 1d38c6acf8af..9e3a5e85f828 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -660,7 +660,7 @@ static int xfrm_replay_overflow_offload_esn(struct 
xfrm_state *x, struct sk_buff
} else {
XFRM_SKB_CB(skb)->seq.output.low = oseq + 1;
XFRM_SKB_CB(skb)->seq.output.hi = oseq_hi;
-   xo->seq.low = oseq = oseq + 1;
+   xo->seq.low = oseq + 1;
xo->seq.hi = oseq_hi;
oseq += skb_shinfo(skb)->gso_segs;
}
-- 
2.14.1



[PATCH 9/9] net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

2018-03-13 Thread Steffen Klassert
From: Greg Hackmann 

f7c83bcbfaf5 ("net: xfrm: use __this_cpu_read per-cpu helper") added a
__this_cpu_read() call inside ipcomp_alloc_tfms().

At the time, __this_cpu_read() required the caller to either not care
about races or to handle preemption/interrupt issues.  3.15 tightened
the rules around some per-cpu operations, and now __this_cpu_read()
should never be used in a preemptible context.  On 3.15 and later, we
need to use this_cpu_read() instead.

syzkaller reported this leading to the following kernel BUG while
fuzzing sendmsg:

BUG: using __this_cpu_read() in preemptible [] code: repro/3101
caller is ipcomp_init_state+0x185/0x990
CPU: 3 PID: 3101 Comm: repro Not tainted 4.16.0-rc4-00123-g86f84779d8e9 #154
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0xb9/0x115
 check_preemption_disabled+0x1cb/0x1f0
 ipcomp_init_state+0x185/0x990
 ? __xfrm_init_state+0x876/0xc20
 ? lock_downgrade+0x5e0/0x5e0
 ipcomp4_init_state+0xaa/0x7c0
 __xfrm_init_state+0x3eb/0xc20
 xfrm_init_state+0x19/0x60
 pfkey_add+0x20df/0x36f0
 ? pfkey_broadcast+0x3dd/0x600
 ? pfkey_sock_destruct+0x340/0x340
 ? pfkey_seq_stop+0x80/0x80
 ? __skb_clone+0x236/0x750
 ? kmem_cache_alloc+0x1f6/0x260
 ? pfkey_sock_destruct+0x340/0x340
 ? pfkey_process+0x62a/0x6f0
 pfkey_process+0x62a/0x6f0
 ? pfkey_send_new_mapping+0x11c0/0x11c0
 ? mutex_lock_io_nested+0x1390/0x1390
 pfkey_sendmsg+0x383/0x750
 ? dump_sp+0x430/0x430
 sock_sendmsg+0xc0/0x100
 ___sys_sendmsg+0x6c8/0x8b0
 ? copy_msghdr_from_user+0x3b0/0x3b0
 ? pagevec_lru_move_fn+0x144/0x1f0
 ? find_held_lock+0x32/0x1c0
 ? do_huge_pmd_anonymous_page+0xc43/0x11e0
 ? lock_downgrade+0x5e0/0x5e0
 ? get_kernel_page+0xb0/0xb0
 ? _raw_spin_unlock+0x29/0x40
 ? do_huge_pmd_anonymous_page+0x400/0x11e0
 ? __handle_mm_fault+0x553/0x2460
 ? __fget_light+0x163/0x1f0
 ? __sys_sendmsg+0xc7/0x170
 __sys_sendmsg+0xc7/0x170
 ? SyS_shutdown+0x1a0/0x1a0
 ? __do_page_fault+0x5a0/0xca0
 ? lock_downgrade+0x5e0/0x5e0
 SyS_sendmsg+0x27/0x40
 ? __sys_sendmsg+0x170/0x170
 do_syscall_64+0x19f/0x640
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f0ee73dfb79
RSP: 002b:7ffe14fc15a8 EFLAGS: 0207 ORIG_RAX: 002e
RAX: ffda RBX:  RCX: 7f0ee73dfb79
RDX:  RSI: 208befc8 RDI: 0004
RBP: 7ffe14fc15b0 R08: 7ffe14fc15c0 R09: 7ffe14fc15c0
R10:  R11: 0207 R12: 00400440
R13: 7ffe14fc16b0 R14:  R15: 

Signed-off-by: Greg Hackmann 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_ipcomp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index ccfdc7115a83..a00ec715aa46 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -283,7 +283,7 @@ static struct crypto_comp * __percpu 
*ipcomp_alloc_tfms(const char *alg_name)
struct crypto_comp *tfm;
 
/* This can be any valid CPU ID so we don't need locking. */
-   tfm = __this_cpu_read(*pos->tfms);
+   tfm = this_cpu_read(*pos->tfms);
 
if (!strcmp(crypto_comp_name(tfm), alg_name)) {
pos->users++;
-- 
2.14.1



[PATCH 3/9] xfrm_user: uncoditionally validate esn replay attribute struct

2018-03-13 Thread Steffen Klassert
From: Florian Westphal 

The sanity test added in ecd7918745234 can be bypassed, validation
only occurs if XFRM_STATE_ESN flag is set, but rest of code doesn't care
and just checks if the attribute itself is present.

So always validate.  Alternative is to reject if we have the attribute
without the flag but that would change abi.

Reported-by: syzbot+0ab777c27d2bb7588...@syzkaller.appspotmail.com
Cc: Mathias Krause 
Fixes: ecd7918745234 ("xfrm_user: ensure user supplied esn replay window is 
valid")
Fixes: d8647b79c3b7e ("xfrm: Add user interface for esn and big anti-replay 
windows")
Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_user.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 7f52b8eb177d..080035f056d9 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -121,22 +121,17 @@ static inline int verify_replay(struct xfrm_usersa_info 
*p,
struct nlattr *rt = attrs[XFRMA_REPLAY_ESN_VAL];
struct xfrm_replay_state_esn *rs;
 
-   if (p->flags & XFRM_STATE_ESN) {
-   if (!rt)
-   return -EINVAL;
+   if (!rt)
+   return (p->flags & XFRM_STATE_ESN) ? -EINVAL : 0;
 
-   rs = nla_data(rt);
+   rs = nla_data(rt);
 
-   if (rs->bmp_len > XFRMA_REPLAY_ESN_MAX / sizeof(rs->bmp[0]) / 8)
-   return -EINVAL;
-
-   if (nla_len(rt) < (int)xfrm_replay_state_esn_len(rs) &&
-   nla_len(rt) != sizeof(*rs))
-   return -EINVAL;
-   }
+   if (rs->bmp_len > XFRMA_REPLAY_ESN_MAX / sizeof(rs->bmp[0]) / 8)
+   return -EINVAL;
 
-   if (!rt)
-   return 0;
+   if (nla_len(rt) < (int)xfrm_replay_state_esn_len(rs) &&
+   nla_len(rt) != sizeof(*rs))
+   return -EINVAL;
 
/* As only ESP and AH support ESN feature. */
if ((p->id.proto != IPPROTO_ESP) && (p->id.proto != IPPROTO_AH))
-- 
2.14.1



[PATCH 1/9] xfrm: Refuse to insert 32 bit userspace socket policies on 64 bit systems

2018-03-13 Thread Steffen Klassert
We don't have a compat layer for xfrm, so userspace and kernel
structures have different sizes in this case. This results in
a broken configuration, so refuse to configure socket policies
when trying to insert from 32 bit userspace as we do it already
with policies inserted via netlink.

Reported-and-tested-by: syzbot+e1a1577ca8bcb47b7...@syzkaller.appspotmail.com
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 54e21f19d722..f9d2f2233f09 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -2056,6 +2056,11 @@ int xfrm_user_policy(struct sock *sk, int optname, u8 
__user *optval, int optlen
struct xfrm_mgr *km;
struct xfrm_policy *pol = NULL;
 
+#ifdef CONFIG_COMPAT
+   if (in_compat_syscall())
+   return -EOPNOTSUPP;
+#endif
+
if (!optval && !optlen) {
xfrm_sk_policy_insert(sk, XFRM_POLICY_IN, NULL);
xfrm_sk_policy_insert(sk, XFRM_POLICY_OUT, NULL);
-- 
2.14.1



[PATCH 2/9] xfrm: Fix policy hold queue after flowcache removal.

2018-03-13 Thread Steffen Klassert
Now that the flowcache is removed we need to generate
a new dummy bundle every time we check if the needed
SAs are in place because the dummy bundle is not cached
anymore. Fix it by passing the XFRM_LOOKUP_QUEUE flag
to xfrm_lookup(). This makes sure that we get a dummy
bundle in case the SAs are not yet in place.

Fixes: 3ca28286ea80 ("xfrm_policy: bypass flow_cache_lookup")
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7a23078132cf..8b3811ff002d 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1891,7 +1891,7 @@ static void xfrm_policy_queue_process(struct timer_list 
*t)
spin_unlock(>hold_queue.lock);
 
dst_hold(xfrm_dst_path(dst));
-   dst = xfrm_lookup(net, xfrm_dst_path(dst), , sk, 0);
+   dst = xfrm_lookup(net, xfrm_dst_path(dst), , sk, XFRM_LOOKUP_QUEUE);
if (IS_ERR(dst))
goto purge_queue;
 
-- 
2.14.1



Re: WARNING in kmalloc_slab (4)

2018-03-13 Thread Steffen Klassert
On Tue, Mar 13, 2018 at 12:33:02AM -0700, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on net-next commit
> f44b1886a5f876c87b5889df463ad7b97834ba37 (Fri Mar 9 18:10:06 2018 +)
> Merge branch 's390-qeth-next'
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+6a7e7ed886bde4346...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> WARNING: CPU: 1 PID: 27333 at mm/slab_common.c:1012 kmalloc_slab+0x5d/0x70
> mm/slab_common.c:1012
> Kernel panic - not syncing: panic_on_warn set ...
> 
> syz-executor0: vmalloc: allocation failure: 17045651456 bytes,
> mode:0x14080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null)
> CPU: 1 PID: 27333 Comm: syz-executor2 Not tainted 4.16.0-rc4+ #260
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
> syz-executor0 cpuset=
>  __warn+0x1dc/0x200 kernel/panic.c:547
> /
>  mems_allowed=0
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
> RIP: 0010:kmalloc_slab+0x5d/0x70 mm/slab_common.c:1012
> RSP: 0018:8801ccfc72f0 EFLAGS: 00010246
> RAX:  RBX: 1018 RCX: 84ec4fc8
> RDX: 0ba7 RSI:  RDI: 1018
> RBP: 8801ccfc72f0 R08:  R09: 1100399f8e21
> R10: 8801ccfc7040 R11: 0001 R12: 0018
> R13: 8801ccfc7598 R14: 014080c0 R15: 8801aebaad80
>  __do_kmalloc mm/slab.c:3700 [inline]
>  __kmalloc+0x25/0x760 mm/slab.c:3714
>  kmalloc include/linux/slab.h:517 [inline]
>  kzalloc include/linux/slab.h:701 [inline]
>  xfrm_alloc_replay_state_esn net/xfrm/xfrm_user.c:442 [inline]

This is likely fixed with:

commit d97ca5d714a5334aecadadf696875da40f1fbf3e
xfrm_user: uncoditionally validate esn replay attribute struct

The patch is included in the ipsec pull request for the net
tree I've sent this morning.


Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support

2018-03-13 Thread David Woodhouse
On Mon, 2018-03-12 at 10:23 -0700, Alexander Duyck wrote:
> 
> -   .sriov_configure = ena_sriov_configure,
> +#ifdef CONFIG_PCI_IOV
> +   .sriov_configure = pci_sriov_configure_simple,
> +#endif
>  };

I'd like to see that ifdef go away, as discussed. I agree that just
#define pci_sriov_configure_simple NULL
should suffice. As Christoph points out, it's not going to compile if
people try to just invoke it directly.

I'd also *really* like to see a way to enable this for PFs which don't
have (and don't need) a driver. We seem to have lost that along the
way.


smime.p7s
Description: S/MIME cryptographic signature


Re: BUG_ON triggered in skb_segment

2018-03-13 Thread Yonghong Song



On 3/12/18 11:04 PM, Eric Dumazet wrote:



On 03/12/2018 10:45 PM, Yonghong Song wrote:

Hi,

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
net-next function skb_segment, line 3667.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = 
skb_shinfo(list_skb)->nr_frags;

3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
#0 [883ffef034f8] machine_kexec at 81044c41
  #1 [883ffef03558] __crash_kexec at 8110c525
  #2 [883ffef03620] crash_kexec at 8110d5cc
  #3 [883ffef03640] oops_end at 8101d7e7
  #4 [883ffef03668] die at 8101deb2
  #5 [883ffef03698] do_trap at 8101a700
  #6 [883ffef036e8] do_error_trap at 8101abfe
  #7 [883ffef037a0] do_invalid_op at 8101acd0
  #8 [883ffef037b0] invalid_op at 81a00bab
 [exception RIP: skb_segment+3044]
 RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
 RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
 RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
 RBP: 883ffef03928   R8: 2ce2   R9: 27da
 R10: 01ea  R11: 2d82  R12: 883f90a1ee80
 R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
 ORIG_RAX:   CS: 0010  SS: 0018
  #9 [883ffef03930] tcp_gso_segment at 818713e7
#10 [883ffef03990] tcp4_gso_segment at 818717d8
#11 [883ffef039b0] inet_gso_segment at 81882c9b
#12 [883ffef03a10] skb_mac_gso_segment at 817f39b8
#13 [883ffef03a38] __skb_gso_segment at 817f3ac9
#14 [883ffef03a68] validate_xmit_skb at 817f3eed
#15 [883ffef03aa8] validate_xmit_skb_list at 817f40a2
#16 [883ffef03ad8] sch_direct_xmit at 81824efb
#17 [883ffef03b20] __qdisc_run at 818251aa
#18 [883ffef03b90] __dev_queue_xmit at 817f45ed
#19 [883ffef03c08] dev_queue_xmit at 817f4b90
#20 [883ffef03c18] __bpf_redirect at 81812b66
#21 [883ffef03c40] skb_do_redirect at 81813209
#22 [883ffef03c60] __netif_receive_skb_core at 817f310d
#23 [883ffef03cc8] __netif_receive_skb at 817f32e8
#24 [883ffef03ce8] netif_receive_skb_internal at 817f5538
#25 [883ffef03d10] napi_gro_complete at 817f56c0
#26 [883ffef03d28] dev_gro_receive at 817f5ea6
#27 [883ffef03d78] napi_gro_receive at 817f6168
#28 [883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at 817381c2
#29 [883ffef03e30] mlx5e_poll_rx_cq at 817386c2
#30 [883ffef03e80] mlx5e_napi_poll at 8173926e
#31 [883ffef03ed0] net_rx_action at 817f5a6e
#32 [883ffef03f48] __softirqentry_text_start at 81c000c3
#33 [883ffef03fa8] irq_exit at 8108f515
#34 [883ffef03fb8] do_IRQ at 81a01b11
---  ---
bt: cannot transition from IRQ stack to current process stack:
 IRQ stack pointer: 883ffef034f8
 process stack pointer: 81a01ae9
    current stack base: c9000c5c4000
...
Setup:
=

The test will involve three machines:
   M_ipv6 <-> M_nat <-> M_ipv4

The M_nat will do ipv4<->ipv6 address translation and then forward packet
to proper destination. The control plane will configure M_nat properly
will understand virtual ipv4 address for machine M_ipv6, and
virtual ipv6 address for machine M_ipv4.

M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
The program uses bpf_skb_change_proto to do protocol conversion.
bpf_skb_change_proto will adjust skb header_len and len properly
based on protocol change.
After the conversion, the program will make proper change on
ethhdr and ip4/6 header, recalculate checksum, and send the packet out
through bpf_redirect.

Experiment:
===

MTU: 1500B for all three machines.

The tso/lro/gro are enabled on the M_nat box.

ping works on both ways of M_ipv6 <-> M_ipv4.
It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
(both ways).
Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed 
with the above BUG_ON, really fast.

Did not really test from M_ipv4 to M_ipv6 with large file.

The error path likely to be (also from the above call stack):
   nic -> lro/gro -> bpf_program -> gso (BUG_ON)

In one of experiments, I explicitly printed the skb->len and 
skb->data_len. The values are 

Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw

2018-03-13 Thread Rafał Miłecki

On 03/13/2018 12:01 AM, Stephen Hemminger wrote:

On Mon, 12 Mar 2018 23:42:48 +0100
Rafał Miłecki  wrote:


2) Blame bridge + mcast-to-ucast + hairpin for 802.11f incompatibility

If we agree that 802.11f support in FullMAC firmware is acceptable, then
we have to make sure Linux's bridge doesn't break it by passing 802.11f
(broadcast) frames back to the source interface. That would require a
check like in below diff + proper code for handling such packets. I'm
afraid I'm not familiar with bridge code enough to complete that.

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index edae702..9e5d6ea 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -126,6 +126,27 @@ static void br_do_proxy_arp(struct sk_buff *skb, struct 
net_bridge *br,
}
  }

+static bool br_skb_is_iapp_add_packet(struct sk_buff *skb)
+{
+   const u8 iapp_add_packet[6] __aligned(2) = {
+   0x00, 0x01, 0xaf, 0x81, 0x01, 0x00,
+   };
+#if !defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+   const u16 *a = (const u16 *)skb->data;
+   const u16 *b = (const u16 *)iapp_add_packet;
+#endif
+
+   if (skb->len != 6)
+   return false;
+
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+   return !(((*(const u32 *)skb->data) ^ (*(const u32 *)iapp_add_packet)) |
+((*(const u16 *)(skb->data + 4)) ^ (*(const u16 
*)(iapp_add_packet + 4;
+#else
+   return !((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2]));
+#endif
+}
+
  /* note: already called with rcu_read_lock */
  int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
  {
@@ -155,6 +176,8 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
if (is_multicast_ether_addr(dest)) {
/* by definition the broadcast is also a multicast address */
if (is_broadcast_ether_addr(dest)) {
+   if (br_skb_is_iapp_add_packet(skb))
+   pr_warn("This packet should not be passed back to 
the source interface!\n");
pkt_type = BR_PKT_BROADCAST;
local_rcv = true;
} else {



Don't like bridge doing special case code for magic received values directly in 
input path.
Really needs to be generic which is why I suggested ebtables.


We need in-bridge solution only if we decide to support FullMAC
firmwares with 802.11f implementation.

In that case is this possible to use ebtables as a workaround at all?
Can I really use ebtables to set switch to don't pass 802.11f ADD frames
back to the original interface?


[PATCH net 0/2] Fix vlan untag and insertion for bridge and vlan with reorder_hdr off

2018-03-13 Thread Toshiaki Makita
As Brandon Carpenter reported[1], sending non-vlan-offloaded packets from
bridge devices ends up with corrupted packets. He narrowed down this problem
and found that the root cause is in skb_reorder_vlan_header().

While I was working on fixing this problem, I found that the function does
not work properly for double tagged packets with reorder_hdr off as well.

Patch 1 fixes these 2 problems in skb_reorder_vlan_header().

And it turned out that fixing skb_reorder_vlan_header() is not sufficient
to receive double tagged packets with reorder_hdr off while I was testing the
fix. Vlan tags got out of order when vlan devices with reorder_hdr disabled
were stacked. Patch 2 fixes this problem.

[1] https://www.spinics.net/lists/linux-ethernet-bridging/msg07039.html

Toshiaki Makita (2):
  net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
  vlan: Fix out of order vlan headers with reorder header off

 include/linux/if_vlan.h   | 66 +++
 include/uapi/linux/if_ether.h |  1 +
 net/8021q/vlan_core.c |  4 +--
 net/core/skbuff.c |  7 +++--
 4 files changed, 63 insertions(+), 15 deletions(-)

-- 
1.8.3.1




[PATCH net 2/2] vlan: Fix out of order vlan headers with reorder header off

2018-03-13 Thread Toshiaki Makita
With reorder header off, received packets are untagged in skb_vlan_untag()
called from within __netif_receive_skb_core(), and later the tag will be
inserted back in vlan_do_receive().

This caused out of order vlan headers when we create a vlan device on top
of another vlan device, because vlan_do_receive() inserts a tag as the
outermost vlan tag. E.g. the outer tag is first removed in skb_vlan_untag()
and inserted back in vlan_do_receive(), then the inner tag is next removed
and inserted back as the outermost tag.

This patch fixes the behaviour by inserting the inner tag at the right
position.

Signed-off-by: Toshiaki Makita 
---
 include/linux/if_vlan.h | 66 -
 net/8021q/vlan_core.c   |  4 +--
 2 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 5e6a2d4..c4a1cff 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -300,30 +300,34 @@ static inline bool 
vlan_hw_offload_capable(netdev_features_t features,
 }
 
 /**
- * __vlan_insert_tag - regular VLAN tag inserting
+ * __vlan_insert_inner_tag - inner VLAN tag inserting
  * @skb: skbuff to tag
  * @vlan_proto: VLAN encapsulation protocol
  * @vlan_tci: VLAN TCI to insert
+ * @mac_len: MAC header length including outer vlan headers
  *
- * Inserts the VLAN tag into @skb as part of the payload
+ * Inserts the VLAN tag into @skb as part of the payload at offset mac_len
  * Returns error if skb_cow_head failes.
  *
  * Does not change skb->protocol so this function can be used during receive.
  */
-static inline int __vlan_insert_tag(struct sk_buff *skb,
-   __be16 vlan_proto, u16 vlan_tci)
+static inline int __vlan_insert_inner_tag(struct sk_buff *skb,
+ __be16 vlan_proto, u16 vlan_tci,
+ unsigned int mac_len)
 {
struct vlan_ethhdr *veth;
 
if (skb_cow_head(skb, VLAN_HLEN) < 0)
return -ENOMEM;
 
-   veth = skb_push(skb, VLAN_HLEN);
+   skb_push(skb, VLAN_HLEN);
 
-   /* Move the mac addresses to the beginning of the new header. */
-   memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN);
+   /* Move the mac header sans proto to the beginning of the new header. */
+   memmove(skb->data, skb->data + VLAN_HLEN, mac_len - ETH_TLEN);
skb->mac_header -= VLAN_HLEN;
 
+   veth = (struct vlan_ethhdr *)(skb->data + mac_len - ETH_HLEN);
+
/* first, the ethernet type */
veth->h_vlan_proto = vlan_proto;
 
@@ -334,12 +338,30 @@ static inline int __vlan_insert_tag(struct sk_buff *skb,
 }
 
 /**
- * vlan_insert_tag - regular VLAN tag inserting
+ * __vlan_insert_tag - regular VLAN tag inserting
  * @skb: skbuff to tag
  * @vlan_proto: VLAN encapsulation protocol
  * @vlan_tci: VLAN TCI to insert
  *
  * Inserts the VLAN tag into @skb as part of the payload
+ * Returns error if skb_cow_head failes.
+ *
+ * Does not change skb->protocol so this function can be used during receive.
+ */
+static inline int __vlan_insert_tag(struct sk_buff *skb,
+   __be16 vlan_proto, u16 vlan_tci)
+{
+   return __vlan_insert_inner_tag(skb, vlan_proto, vlan_tci, ETH_HLEN);
+}
+
+/**
+ * vlan_insert_inner_tag - inner VLAN tag inserting
+ * @skb: skbuff to tag
+ * @vlan_proto: VLAN encapsulation protocol
+ * @vlan_tci: VLAN TCI to insert
+ * @mac_len: MAC header length including outer vlan headers
+ *
+ * Inserts the VLAN tag into @skb as part of the payload at offset mac_len
  * Returns a VLAN tagged skb. If a new skb is created, @skb is freed.
  *
  * Following the skb_unshare() example, in case of error, the calling function
@@ -347,12 +369,14 @@ static inline int __vlan_insert_tag(struct sk_buff *skb,
  *
  * Does not change skb->protocol so this function can be used during receive.
  */
-static inline struct sk_buff *vlan_insert_tag(struct sk_buff *skb,
- __be16 vlan_proto, u16 vlan_tci)
+static inline struct sk_buff *vlan_insert_inner_tag(struct sk_buff *skb,
+   __be16 vlan_proto,
+   u16 vlan_tci,
+   unsigned int mac_len)
 {
int err;
 
-   err = __vlan_insert_tag(skb, vlan_proto, vlan_tci);
+   err = __vlan_insert_inner_tag(skb, vlan_proto, vlan_tci, mac_len);
if (err) {
dev_kfree_skb_any(skb);
return NULL;
@@ -361,6 +385,26 @@ static inline struct sk_buff *vlan_insert_tag(struct 
sk_buff *skb,
 }
 
 /**
+ * vlan_insert_tag - regular VLAN tag inserting
+ * @skb: skbuff to tag
+ * @vlan_proto: VLAN encapsulation protocol
+ * @vlan_tci: VLAN TCI to insert
+ *
+ * Inserts the VLAN tag into @skb as part of the payload
+ * Returns a VLAN tagged skb. If a new skb 

Re: [PATCH v2] sctp: Fix double free in sctp_sendmsg_to_asoc

2018-03-13 Thread Xin Long
On Tue, Mar 13, 2018 at 2:15 AM, Neil Horman  wrote:
> syzbot/kasan detected a double free in sctp_sendmsg_to_asoc:
> BUG: KASAN: use-after-free in sctp_association_free+0x7b7/0x930
> net/sctp/associola.c:332
> Read of size 8 at addr 8801d8006ae0 by task syzkaller914861/4202
>
> CPU: 1 PID: 4202 Comm: syzkaller914861 Not tainted 4.16.0-rc4+ #258
> Hardware name: Google Google Compute Engine/Google Compute Engine
> 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report+0x23c/0x360 mm/kasan/report.c:412
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>  sctp_association_free+0x7b7/0x930 net/sctp/associola.c:332
>  sctp_sendmsg+0xc67/0x1a80 net/sctp/socket.c:2075
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:639
>  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>  SyS_sendto+0x40/0x50 net/socket.c:1716
>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> This was introduced by commit:
> f84af33 sctp: factor out sctp_sendmsg_to_asoc from sctp_sendmsg
>
> As the newly refactored function moved the wait_for_sndbuf call to a
> point after the association was connected, allowing for peeloff events
> to occur, which in turn caused wait_for_sndbuf to return -EPIPE which
> was not caught by the logic that determines if an association should be
> freed or not.
>
> Fix it the easy way by returning the ordering of
> sctp_primitive_ASSOCIATE and sctp_wait_for_sndbuf to the old order, to
> ensure that EPIPE will not happen.
>
> Tested by myself using the syzbot reproducers with positive results
>
> Signed-off-by: Neil Horman 
> CC: da...@davemloft.net
> CC: Xin Long 
> Reported-by: syzbot+a4e4112c3aff00c8c...@syzkaller.appspotmail.com
>
> ---
> Change notes
> v2)
>  * Moved additional calls to restore origional ordering
>  * add sctp prefix
> ---
>  net/sctp/socket.c | 26 +-
>  1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 7d3476a4860d..4bbfcf9532c2 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1876,6 +1876,19 @@ static int sctp_sendmsg_to_asoc(struct 
> sctp_association *asoc,
> goto err;
> }
>
> +   if (asoc->pmtu_pending)
> +   sctp_assoc_pending_pmtu(asoc);
> +
> +   if (sctp_wspace(asoc) < msg_len)
> +   sctp_prsctp_prune(asoc, sinfo, msg_len - sctp_wspace(asoc));
> +
> +   if (!sctp_wspace(asoc)) {
> +   timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +   err = sctp_wait_for_sndbuf(asoc, , msg_len);
> +   if (err)
> +   goto err;
> +   }
> +
> if (sctp_state(asoc, CLOSED)) {
> err = sctp_primitive_ASSOCIATE(net, asoc, NULL);
> if (err)
> @@ -1893,19 +1906,6 @@ static int sctp_sendmsg_to_asoc(struct 
> sctp_association *asoc,
> pr_debug("%s: we associated primitively\n", __func__);
> }
>
> -   if (asoc->pmtu_pending)
> -   sctp_assoc_pending_pmtu(asoc);
> -
> -   if (sctp_wspace(asoc) < msg_len)
> -   sctp_prsctp_prune(asoc, sinfo, msg_len - sctp_wspace(asoc));
> -
> -   if (!sctp_wspace(asoc)) {
> -   timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> -   err = sctp_wait_for_sndbuf(asoc, , msg_len);
> -   if (err)
> -   goto err;
> -   }
> -
> datamsg = sctp_datamsg_from_user(asoc, sinfo, >msg_iter);
> if (IS_ERR(datamsg)) {
> err = PTR_ERR(datamsg);
> --
> 2.14.3
>
Reviewed-by: Xin Long 


Re: [2/2] net/usb/ax88179_178a: Delete three unnecessary variables in ax88179_chk_eee()

2018-03-13 Thread SF Markus Elfring
>> Use three values directly for a condition check without assigning them
>> to intermediate variables.
> 
> Hi,
> 
> what is the benefit of this?

I proposed a small source code reduction.

Other software design directions might become more interesting for this use 
case.

Regards,
Markus


Re: BUG_ON triggered in skb_segment

2018-03-13 Thread Eric Dumazet



On 03/12/2018 10:45 PM, Yonghong Song wrote:

Hi,

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
net-next function skb_segment, line 3667.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = 
skb_shinfo(list_skb)->nr_frags;

3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
#0 [883ffef034f8] machine_kexec at 81044c41
  #1 [883ffef03558] __crash_kexec at 8110c525
  #2 [883ffef03620] crash_kexec at 8110d5cc
  #3 [883ffef03640] oops_end at 8101d7e7
  #4 [883ffef03668] die at 8101deb2
  #5 [883ffef03698] do_trap at 8101a700
  #6 [883ffef036e8] do_error_trap at 8101abfe
  #7 [883ffef037a0] do_invalid_op at 8101acd0
  #8 [883ffef037b0] invalid_op at 81a00bab
     [exception RIP: skb_segment+3044]
     RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
     RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
     RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
     RBP: 883ffef03928   R8: 2ce2   R9: 27da
     R10: 01ea  R11: 2d82  R12: 883f90a1ee80
     R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
     ORIG_RAX:   CS: 0010  SS: 0018
  #9 [883ffef03930] tcp_gso_segment at 818713e7
#10 [883ffef03990] tcp4_gso_segment at 818717d8
#11 [883ffef039b0] inet_gso_segment at 81882c9b
#12 [883ffef03a10] skb_mac_gso_segment at 817f39b8
#13 [883ffef03a38] __skb_gso_segment at 817f3ac9
#14 [883ffef03a68] validate_xmit_skb at 817f3eed
#15 [883ffef03aa8] validate_xmit_skb_list at 817f40a2
#16 [883ffef03ad8] sch_direct_xmit at 81824efb
#17 [883ffef03b20] __qdisc_run at 818251aa
#18 [883ffef03b90] __dev_queue_xmit at 817f45ed
#19 [883ffef03c08] dev_queue_xmit at 817f4b90
#20 [883ffef03c18] __bpf_redirect at 81812b66
#21 [883ffef03c40] skb_do_redirect at 81813209
#22 [883ffef03c60] __netif_receive_skb_core at 817f310d
#23 [883ffef03cc8] __netif_receive_skb at 817f32e8
#24 [883ffef03ce8] netif_receive_skb_internal at 817f5538
#25 [883ffef03d10] napi_gro_complete at 817f56c0
#26 [883ffef03d28] dev_gro_receive at 817f5ea6
#27 [883ffef03d78] napi_gro_receive at 817f6168
#28 [883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at 817381c2
#29 [883ffef03e30] mlx5e_poll_rx_cq at 817386c2
#30 [883ffef03e80] mlx5e_napi_poll at 8173926e
#31 [883ffef03ed0] net_rx_action at 817f5a6e
#32 [883ffef03f48] __softirqentry_text_start at 81c000c3
#33 [883ffef03fa8] irq_exit at 8108f515
#34 [883ffef03fb8] do_IRQ at 81a01b11
---  ---
bt: cannot transition from IRQ stack to current process stack:
     IRQ stack pointer: 883ffef034f8
     process stack pointer: 81a01ae9
    current stack base: c9000c5c4000
...
Setup:
=

The test will involve three machines:
   M_ipv6 <-> M_nat <-> M_ipv4

The M_nat will do ipv4<->ipv6 address translation and then forward packet
to proper destination. The control plane will configure M_nat properly
will understand virtual ipv4 address for machine M_ipv6, and
virtual ipv6 address for machine M_ipv4.

M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
The program uses bpf_skb_change_proto to do protocol conversion.
bpf_skb_change_proto will adjust skb header_len and len properly
based on protocol change.
After the conversion, the program will make proper change on
ethhdr and ip4/6 header, recalculate checksum, and send the packet out
through bpf_redirect.

Experiment:
===

MTU: 1500B for all three machines.

The tso/lro/gro are enabled on the M_nat box.

ping works on both ways of M_ipv6 <-> M_ipv4.
It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
(both ways).
Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed with 
the above BUG_ON, really fast.

Did not really test from M_ipv4 to M_ipv6 with large file.

The error path likely to be (also from the above call stack):
   nic -> lro/gro -> bpf_program -> gso (BUG_ON)

In one of experiments, I explicitly printed the skb->len and 
skb->data_len. The values are below:

   skb_segment: len 2856, data_len 

Re: BUG_ON triggered in skb_segment

2018-03-13 Thread Eric Dumazet



On 03/12/2018 11:08 PM, Yonghong Song wrote:



On 3/12/18 11:04 PM, Eric Dumazet wrote:



On 03/12/2018 10:45 PM, Yonghong Song wrote:

Hi,

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
net-next function skb_segment, line 3667.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = 
skb_shinfo(list_skb)->nr_frags;

3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...

call stack:
...
#0 [883ffef034f8] machine_kexec at 81044c41
  #1 [883ffef03558] __crash_kexec at 8110c525
  #2 [883ffef03620] crash_kexec at 8110d5cc
  #3 [883ffef03640] oops_end at 8101d7e7
  #4 [883ffef03668] die at 8101deb2
  #5 [883ffef03698] do_trap at 8101a700
  #6 [883ffef036e8] do_error_trap at 8101abfe
  #7 [883ffef037a0] do_invalid_op at 8101acd0
  #8 [883ffef037b0] invalid_op at 81a00bab
 [exception RIP: skb_segment+3044]
 RIP: 817e4dd4  RSP: 883ffef03860  RFLAGS: 00010216
 RAX: 2bf6  RBX: 883feb7aaa00  RCX: 0011
 RDX: 883fb87910c0  RSI: 0011  RDI: 883feb7ab500
 RBP: 883ffef03928   R8: 2ce2   R9: 27da
 R10: 01ea  R11: 2d82  R12: 883f90a1ee80
 R13: 883fb8791120  R14: 883feb7abc00  R15: 2ce2
 ORIG_RAX:   CS: 0010  SS: 0018
  #9 [883ffef03930] tcp_gso_segment at 818713e7
#10 [883ffef03990] tcp4_gso_segment at 818717d8
#11 [883ffef039b0] inet_gso_segment at 81882c9b
#12 [883ffef03a10] skb_mac_gso_segment at 817f39b8
#13 [883ffef03a38] __skb_gso_segment at 817f3ac9
#14 [883ffef03a68] validate_xmit_skb at 817f3eed
#15 [883ffef03aa8] validate_xmit_skb_list at 817f40a2
#16 [883ffef03ad8] sch_direct_xmit at 81824efb
#17 [883ffef03b20] __qdisc_run at 818251aa
#18 [883ffef03b90] __dev_queue_xmit at 817f45ed
#19 [883ffef03c08] dev_queue_xmit at 817f4b90
#20 [883ffef03c18] __bpf_redirect at 81812b66
#21 [883ffef03c40] skb_do_redirect at 81813209
#22 [883ffef03c60] __netif_receive_skb_core at 817f310d
#23 [883ffef03cc8] __netif_receive_skb at 817f32e8
#24 [883ffef03ce8] netif_receive_skb_internal at 817f5538
#25 [883ffef03d10] napi_gro_complete at 817f56c0
#26 [883ffef03d28] dev_gro_receive at 817f5ea6
#27 [883ffef03d78] napi_gro_receive at 817f6168
#28 [883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at 817381c2
#29 [883ffef03e30] mlx5e_poll_rx_cq at 817386c2
#30 [883ffef03e80] mlx5e_napi_poll at 8173926e
#31 [883ffef03ed0] net_rx_action at 817f5a6e
#32 [883ffef03f48] __softirqentry_text_start at 81c000c3
#33 [883ffef03fa8] irq_exit at 8108f515
#34 [883ffef03fb8] do_IRQ at 81a01b11
---  ---
bt: cannot transition from IRQ stack to current process stack:
 IRQ stack pointer: 883ffef034f8
 process stack pointer: 81a01ae9
    current stack base: c9000c5c4000
...
Setup:
=

The test will involve three machines:
   M_ipv6 <-> M_nat <-> M_ipv4

The M_nat will do ipv4<->ipv6 address translation and then forward 
packet

to proper destination. The control plane will configure M_nat properly
will understand virtual ipv4 address for machine M_ipv6, and
virtual ipv6 address for machine M_ipv4.

M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
The program uses bpf_skb_change_proto to do protocol conversion.
bpf_skb_change_proto will adjust skb header_len and len properly
based on protocol change.
After the conversion, the program will make proper change on
ethhdr and ip4/6 header, recalculate checksum, and send the packet out
through bpf_redirect.

Experiment:
===

MTU: 1500B for all three machines.

The tso/lro/gro are enabled on the M_nat box.

ping works on both ways of M_ipv6 <-> M_ipv4.
It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
(both ways).
Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed 
with the above BUG_ON, really fast.

Did not really test from M_ipv4 to M_ipv6 with large file.

The error path likely to be (also from the above call stack):
   nic -> lro/gro -> bpf_program -> gso (BUG_ON)

In one of experiments, I explicitly 

Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw

2018-03-13 Thread Felix Fietkau
[resent with fixed typo in linux-wireless address]

On 2018-02-27 11:08, Rafał Miłecki wrote:
> I've problem when using OpenWrt/LEDE on a home router with Broadcom's
> FullMAC WiFi chipset.
> 
> 
> First of all OpenWrt/LEDE uses bridge interface for LAN network with:
> 1) IFLA_BRPORT_MCAST_TO_UCAST
> 2) Clients isolation in hostapd
> 3) Hairpin mode enabled
> 
> For more details please see Linus's patch description:
> https://patchwork.kernel.org/patch/9530669/
> and maybe hairpin mode patch:
> https://lwn.net/Articles/347344/
> 
> Short version: in that setup packets received from a bridged wireless
> interface can be handled back to it for transmission.
> 
> 
> Now, Broadcom's firmware for their FullMAC chipsets in AP mode
> supports an obsoleted 802.11f AKA IAPP standard. It's a roaming
> standard that was replaced by 802.11r.
> 
> Whenever a new station associates, firmware generates a packet like:
> ff ff ff ff  ff ff ec 10  7b 5f ?? ??  00 06 00 01  af 81 01 00
> (just masked 2 bytes of my MAC)
> 
> For mode details you can see discussion in my brcmfmac patch thread:
> https://patchwork.kernel.org/patch/10191451/
> 
> 
> The problem is that bridge (in setup as above) handles such a packet
> back to the device.
> 
> That makes Broadcom's FullMAC firmware believe that a given station
> just connected to another AP in a network (which doesn't even exist).
> As a result firmware immediately disassociates that station. It's
> simply impossible to connect to the router. Every association is
> followed by immediate disassociation.
> 
> 
> Can you see any solution for this problem? Is that an option to stop
> multicast-to-unicast from touching 802.11f packets? Some other ideas?
> Obviously I can't modify Broadcom's firmware and drop that obsoleted
> standard.
Let's look at it from a different angle: Since these packets are
forwarded as normal packets by the bridge, and the Broadcom firmware
reacts to them in this nasty way, that's basically local DoS security
issue. In my opinion that matters a lot more than having support for an
obsolete feature that almost nobody will ever want to use.

I think the right approach to deal with this issue is to drop these
garbage packets in both the receive and transmit path of brcmfmac.

- Felix


Re: BUG_ON triggered in skb_segment

2018-03-13 Thread Steffen Klassert
On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/12/2018 11:08 PM, Yonghong Song wrote:
> > 
> > 
> > On 3/12/18 11:04 PM, Eric Dumazet wrote:
> > > 
> > > 
> > > On 03/12/2018 10:45 PM, Yonghong Song wrote:
> > > > ...
> > > > Setup:
> > > > =
> > > > 
> > > > The test will involve three machines:
> > > >    M_ipv6 <-> M_nat <-> M_ipv4
> > > > 
> > > > The M_nat will do ipv4<->ipv6 address translation and then
> > > > forward packet
> > > > to proper destination. The control plane will configure M_nat properly
> > > > will understand virtual ipv4 address for machine M_ipv6, and
> > > > virtual ipv6 address for machine M_ipv4.
> > > > 
> > > > M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
> > > > The program uses bpf_skb_change_proto to do protocol conversion.
> > > > bpf_skb_change_proto will adjust skb header_len and len properly
> > > > based on protocol change.
> > > > After the conversion, the program will make proper change on
> > > > ethhdr and ip4/6 header, recalculate checksum, and send the packet out
> > > > through bpf_redirect.
> > > > 
> > > > Experiment:
> > > > ===
> > > > 
> > > > MTU: 1500B for all three machines.
> > > > 
> > > > The tso/lro/gro are enabled on the M_nat box.
> > > > 
> > > > ping works on both ways of M_ipv6 <-> M_ipv4.
> > > > It works for transfering a small file (4KB) between M_ipv6 and
> > > > M_ipv4 (both ways).
> > > > Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
> > > > failed with the above BUG_ON, really fast.
> > > > Did not really test from M_ipv4 to M_ipv6 with large file.
> > > > 
> > > > The error path likely to be (also from the above call stack):
> > > >    nic -> lro/gro -> bpf_program -> gso (BUG_ON)

Just out of curiosity, are these packets created with LRO or GRO?
Usually LRO is disabled if forwarding is enabled on a machine,
because segmented LGO packets are likely corrupt.

These packets take an alternative redirect path, so not sure what
happens here.

> > > > 
> > > > In one of experiments, I explicitly printed the skb->len and
> > > > skb->data_len. The values are below:
> > > >    skb_segment: len 2856, data_len 2686
> > > > They should be equal to avoid BUG.
> > > > 
> > > > In another experiment, I got:
> > > >    skb_segment: len 1428, data_len 1258
> > > > 
> > > > In both cases, the difference is 170 bytes. Not sure whether
> > > > this is just a coincidence or not.
> > > > 
> > > > Workaround:
> > > > ===
> > > > 
> > > > A workaround to avoid BUG_ON is to disable lro/gro. This way,
> > > > kernel will not receive big packets and hence gso is not really called.
> > > > 
> > > > I am not familiar with gso code. Does anybody hit this BUG_ON before?
> > > > Any suggestion on how to debug this?
> > > > 
> > > 
> > > skb_segment() works if incoming GRO packet is not modified in its
> > > geometry.
> > > 
> > > In your case it seems you had to adjust gso_size (calling
> > > skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
> > > skb_segment() badly, because geometry changes, unless you had
> > > specific MTU/MSS restrictions.
> > > 
> > > You will have to make skb_segment() more generic if you really want this.
> > 
> > In net/core/filter.c function bpf_skb_change_proto, which is called
> > in the bpf program, does some GSO adjustment. Could you help check
> > whether it satisfies my above use case or not? Thanks!
> 
> As I said this  helper ends up modifying gso_size by +/- 20 (sizeof(ipv6
> header) - sizeof(ipv4 header))
> 
> So it wont work if skb_segment() is called after this change.

Even HW TSO use gso_size to segment the packets. Would'nt this
result in broken packets too, if gso_size is modified on a
forwarding path?

> 
> Not clear why the GRO packet is not sent as is (as a TSO packet) since
> mlx4/mlx5 NICs certainly support TSO.

If the packets are generated with GRO, there could be data chained
at the frag_list pointer. Most NICs can't offload such skbs, so if
skb_segment() can't split at the frag_list pointer, it will just
segment the packets based on gso_size.



[PATCH V2 net 0/1] net/smc: listen socket closing

2018-03-13 Thread Ursula Braun
Hi Dave,

last week you asked for a better solution using generic infrastructure
to fix the closing of a listening SMC socket.
This made me realize that flush_work() is an appropriate way to make
sure the smc_tcp_listen_worker has finished processing (incl. the
release of the internal clcsock.)

Thanks, Ursula

Ursula Braun (1):
  net/smc: simplify wait when closing listen socket

 net/smc/af_smc.c|  4 
 net/smc/smc_close.c | 25 +++--
 2 files changed, 3 insertions(+), 26 deletions(-)

-- 
2.13.5



[PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-13 Thread Ilpo Järvinen
Currently, the TCP code is overly eager to increase window on
almost every ACK. It makes those ACKs that the receiver should
sent as dupACKs look like they update window that is not
considered a real dupACK by the non-SACK sender-side code.
Therefore the sender needs to resort to RTO to recover
losses as fast retransmit/fast recovery cannot be triggered
by such masked duplicate ACKs.

This change makes sure that when an ofo segment is received,
no change to window is applied if we are going to send a dupACK.
Even with this change, the window updates keep being maintained
but that occurs "behind the scenes". That is, this change does
not interfere with memory management of the flow which could
have long-term impact for the progress of the flow but only
prevents those updates being seen on the wire on short-term.
It's ok to change the window for non-dupACKs such as the first
ACK after ofo arrivals start if that ACK was using delayed ACKs
and also whenever the ack field advances. As ack field typically
advances once per RTT as the first hole is retransmitted, the
window updates are not delayed entirely during long recoveries.

Even before this fix, tcp_select_window did not allow ACK
shrinking the window for duplicate ACKs (that was previously
even called "treason" but the old charmy message is gone now).
The advertized window can only shrink when also ack field
changes which will not be blocked by this change as it is not
duplicate ACK.

Signed-off-by: Ilpo Järvinen 
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 -
 net/ipv4/tcp_output.c | 43 +--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+   unused:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 270aa48..4ff192b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4626,6 +4626,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
bool fragstolen;
int eaten;
 
@@ -4669,7 +4670,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
 * gap in queue is filled.
 */
if (RB_EMPTY_ROOT(>out_of_order_queue))
-   inet_csk(sk)->icsk_ack.pingpong = 0;
+   icsk->icsk_ack.pingpong = 0;
}
 
if (tp->rx_opt.num_sacks)
@@ -4719,6 +4720,8 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
goto queue_and_out;
}
 
+   if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+   tp->dupack_wnd_lock = 1;
tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
u32 old_win = tp->rcv_wnd;
-   u32 cur_win = tcp_receive_window(tp);
-   u32 new_win = __tcp_select_window(sk);
-
-   /* Never shrink the offered window */
-   if (new_win < cur_win) {
-   /* Danger Will Robinson!
-* Don't update rcv_wup/rcv_wnd here or else
-* we will not be able to advertise a zero
-* window in time.  --DaveM
-*
-* Relax Will Robinson.
-*/
-   if (new_win == 0)
-   NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPWANTZEROWINDOWADV);
-   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   u32 cur_win;
+   u32 new_win;
+
+   if (tp->dupack_wnd_lock) {
+   new_win = old_win;
+   tp->dupack_wnd_lock = 0;
+   } else {
+   cur_win = tcp_receive_window(tp);
+   new_win = __tcp_select_window(sk);
+   /* Never shrink the offered window */
+   if (new_win < cur_win) {
+   /* Danger Will Robinson!
+ 

Re: BUG: corrupted list in sctp_association_free

2018-03-13 Thread Xin Long
On Tue, Mar 13, 2018 at 3:34 PM, syzbot
 wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> fd372a7a9e5e9d8011a0222d10edd3523abcd3b1 (Thu Mar 8 19:43:48 2018 +)
> Merge tag 'mlx5-updates-2018-02-28-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
>
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+e56a5d45f832ef33a...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> selinux_nlmsg_perm: 1 callbacks suppressed
> SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
> sclass=netlink_route_socket pig=12502 comm=syz-executor3
> SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
> sclass=netlink_route_socket pig=12528 comm=syz-executor3
> list_del corruption, fcc5fb27->next is LIST_POISON1
> (cb16e51d)
> [ cut here ]
> kernel BUG at lib/list_debug.c:47!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 12537 Comm: syz-executor2 Not tainted 4.16.0-rc4+ #258
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__list_del_entry_valid+0xd3/0x150 lib/list_debug.c:45
> RSP: 0018:8801b6387778 EFLAGS: 00010286
> RAX: 004e RBX: dead0200 RCX: 
> RDX: 004e RSI: c90002ed6000 RDI: ed0036c70ee3
> RBP: 8801b6387790 R08: 110036c70e3b R09: 
> R10:  R11:  R12: dead0100
> R13: 8801d3164000 R14: 8801d8502220 R15: 8801b6387c58
> FS:  7ff42042f700() GS:8801db20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7ff42040ddb8 CR3: 0001bd840003 CR4: 001606f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  __list_del_entry include/linux/list.h:117 [inline]
>  list_del include/linux/list.h:125 [inline]
>  sctp_association_free+0x133/0x930 net/sctp/associola.c:341
>  sctp_sendmsg+0xc67/0x1a80 net/sctp/socket.c:2075
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:639
>  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>  SyS_sendto+0x40/0x50 net/socket.c:1716
>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x453e69
> RSP: 002b:7ff42042ec68 EFLAGS: 0246 ORIG_RAX: 002c
> RAX: ffda RBX: 7ff42042f6d4 RCX: 00453e69
> RDX: 0001 RSI: 2340 RDI: 0015
> RBP: 0072c0c8 R08: 204d9000 R09: 001c
> R10:  R11: 0246 R12: 
> R13: 04cd R14: 006f73d8 R15: 0003
> Code: 8f 00 00 00 49 8b 54 24 08 48 39 f2 75 3b 48 83 c4 08 b8 01 00 00 00
> 5b 41 5c 5d c3 4c 89 e2 48 c7 c7 c0 7c 40 86 e8 75 f6 fb fe <0f> 0b 48 c7 c7
> 20 7d 40 86 e8 67 f6 fb fe 0f 0b 48 c7 c7 80 7d
> RIP: __list_del_entry_valid+0xd3/0x150 lib/list_debug.c:45 RSP:
> 8801b6387778
> ---[ end trace a6b157f61f9bd43a ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
I'd think the patch Neil just posted would fix it.


[PATCH] xfrm: fix rcu_read_unlock usage in xfrm_local_error

2018-03-13 Thread Taehee Yoo
In the xfrm_local_error, rcu_read_unlock should be called when afinfo
is not NULL. because xfrm_state_get_afinfo calls rcu_read_unlock
if afinfo is NULL.

Signed-off-by: Taehee Yoo 
---
 net/xfrm/xfrm_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 2346867..89b178a7 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -285,8 +285,9 @@ void xfrm_local_error(struct sk_buff *skb, int mtu)
return;
 
afinfo = xfrm_state_get_afinfo(proto);
-   if (afinfo)
+   if (afinfo) {
afinfo->local_error(skb, mtu);
-   rcu_read_unlock();
+   rcu_read_unlock();
+   }
 }
 EXPORT_SYMBOL_GPL(xfrm_local_error);
-- 
2.9.3



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-13 Thread Greg Kroah-Hartman
On Mon, Mar 12, 2018 at 10:22:00AM -0700, Alexei Starovoitov wrote:
> On 3/10/18 7:34 AM, Luis R. Rodriguez wrote:
> > Also,
> > 
> > Alexei you never answered my questions out aliases with the umh modules.
> > Long term this important to consider.
> 
> aliases always felt like a crutch to me.
> I can see an argument when they're used as 'alias pci:* foo'
> but the way it's used in networking with ip_set_* and nf-* is
> something I prefer not to ever do again.
> Definitely no aliases for bpfilter umh.

I agree, let's not do that if at all possible for these types of
binaries.

greg k-h


Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw

2018-03-13 Thread Arend van Spriel

On 3/13/2018 8:20 AM, Felix Fietkau wrote:

[resent with fixed typo in linux-wireless address]

On 2018-02-27 11:08, Rafał Miłecki wrote:

I've problem when using OpenWrt/LEDE on a home router with Broadcom's
FullMAC WiFi chipset.


First of all OpenWrt/LEDE uses bridge interface for LAN network with:
1) IFLA_BRPORT_MCAST_TO_UCAST
2) Clients isolation in hostapd
3) Hairpin mode enabled

For more details please see Linus's patch description:
https://patchwork.kernel.org/patch/9530669/
and maybe hairpin mode patch:
https://lwn.net/Articles/347344/

Short version: in that setup packets received from a bridged wireless
interface can be handled back to it for transmission.


Now, Broadcom's firmware for their FullMAC chipsets in AP mode
supports an obsoleted 802.11f AKA IAPP standard. It's a roaming
standard that was replaced by 802.11r.

Whenever a new station associates, firmware generates a packet like:
ff ff ff ff  ff ff ec 10  7b 5f ?? ??  00 06 00 01  af 81 01 00
(just masked 2 bytes of my MAC)

For mode details you can see discussion in my brcmfmac patch thread:
https://patchwork.kernel.org/patch/10191451/


The problem is that bridge (in setup as above) handles such a packet
back to the device.

That makes Broadcom's FullMAC firmware believe that a given station
just connected to another AP in a network (which doesn't even exist).
As a result firmware immediately disassociates that station. It's
simply impossible to connect to the router. Every association is
followed by immediate disassociation.


Can you see any solution for this problem? Is that an option to stop
multicast-to-unicast from touching 802.11f packets? Some other ideas?
Obviously I can't modify Broadcom's firmware and drop that obsoleted
standard.

Let's look at it from a different angle: Since these packets are
forwarded as normal packets by the bridge, and the Broadcom firmware
reacts to them in this nasty way, that's basically local DoS security
issue. In my opinion that matters a lot more than having support for an
obsolete feature that almost nobody will ever want to use.

I think the right approach to deal with this issue is to drop these
garbage packets in both the receive and transmit path of brcmfmac.


My approach was to get rid of it in firmware as this never made it into 
the 802.11 spec. So I asked internally whether it was still used. Turns 
out that we still rely on it for some customers. I am fine with dropping 
these "garbage" packets, but given that there is still use for it I 
would like to see that under a Kconfig flag. Dropping it may be the default.


Regards,
Arend



Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address

2018-03-13 Thread kbuild test robot
Hi Song,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20180309]
[also build test WARNING on v4.16-rc5]
[cannot apply to linus/master v4.16-rc4 v4.16-rc3 v4.16-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Song-Liu/bpf-stackmap-with-build_id-offset/20180313-085825


coccinelle warnings: (new ones prefixed by >>)

>> kernel/bpf/stackmap.c:177:2-3: Unneeded semicolon

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH] bpf: fix semicolon.cocci warnings

2018-03-13 Thread kbuild test robot
From: Fengguang Wu 

kernel/bpf/stackmap.c:177:2-3: Unneeded semicolon


 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Fixes: fd5c09555695 ("bpf: extend stackmap to save binary_build_id+offset 
instead of address")
CC: Song Liu 
Signed-off-by: Fengguang Wu 
---

 stackmap.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -174,7 +174,7 @@ static inline int stack_map_parse_build_
if (new_offs <= note_offs)  /* overflow */
break;
note_offs = new_offs;
-   };
+   }
return -EINVAL;
 }
 


[PATCH v3 net 0/5] tcp: fixes to non-SACK TCP

2018-03-13 Thread Ilpo Järvinen
Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection. These are not fixes to just some
remote corner cases of recovery but many/almost all recoveries
without SACK will be impacted by one (or even more than one) of
them. The sender-side changes (1-4) are not mainly, if any, to
improve performance (throughput) but address correctness
(congestion control responses should not get incorrectly
reverted) and burstiness (that may cause additional problems
later as some of the packets in such bursts may get dropped
needing again to resort to loss recovery that is likely
similarly bursty).

v1 -> v2:
- Tried to improve changelogs
- Reworked FRTO undo fix location
- Removed extra parenthesis from EXPR (and while at it, reverse
  also the sides of &&)
- Pass prior_snd_una rather than flag around to avoid moving
  tcp_packet_delayed call
- Pass tp instead of sk. Sk was there only due to a subsequent
  change (that I think is only net-next material) limiting the
  use of the transient state to only RTO recoveries as it won't
  be needed after NewReno recovery that won't do unnecessary
  rexmits like the non-SACK RTO recovery does

v2 -> v3:
- Remove unnecessarily added braces

tcp: feed correct number of pkts acked to cc
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized


[PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-13 Thread Ilpo Järvinen
When a cumulative ACK lands to high_seq at the end of loss
recovery and SACK is not enabled, the sender needs to avoid
false fast retransmits (RFC6582). The avoidance mechanisms is
implemented by remaining in the loss recovery CA state until
one additional cumulative ACK arrives. During the operation of
this avoidance mechanism, there is internal transient in the
use of state variables which will always trigger a bogus undo.

When we enter to this transient state in tcp_try_undo_recovery,
tcp_any_retrans_done is often (always?) false resulting in
clearing retrans_stamp. On the next cumulative ACK,
tcp_try_undo_recovery again executes because CA state still
remains in the same recovery state and tcp_may_undo will always
return true because tcp_packet_delayed has this condition:
return !tp->retrans_stamp || ...

Check if the false fast retransmit transient avoidance is in
progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
has advanced already on this ACK but CA state still remains
unchanged (CA state is updated slightly later than undo is
checked), prior_snd_una needs to be passed to tcp_packet_delayed
(instead of tp->snd_una). Passing prior_snd_una around to
the tcp_packet_delayed makes this change look more involved than
it really is.

The additional checks done in this change only affect non-SACK
case, the SACK case remains the same.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 42 ++
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 72ecfbb..270aa48 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct 
tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
+ const u32 prior_snd_una)
 {
-   return !tp->retrans_stamp ||
-  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+   if (!tp->retrans_stamp) {
+   /* Sender will be in a transient state with cleared
+* retrans_stamp during false fast retransmit prevention
+* mechanism
+*/
+   return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
+   }
+   return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
bool unmark_loss)
tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp,
+   const u32 prior_snd_una)
 {
-   return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+   return tp->undo_marker &&
+  (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_may_undo(tp)) {
+   if (tcp_may_undo(tp, prior_snd_una)) {
int mib_idx;
 
/* Happy end! We did not retransmit anything
@@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
+ bool frto_undo)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (frto_undo || tcp_may_undo(tp)) {
+   if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
tcp_undo_cwnd_reduction(sk, true);
 
DBGUNDO(sk, "partial loss");
@@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
 static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
-int *rexmit)
+int *rexmit, const u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
bool recovered = !before(tp->snd_una, tp->high_seq);
 
if ((flag & FLAG_SND_UNA_ADVANCED) &&
-   tcp_try_undo_loss(sk, false))
+   tcp_try_undo_loss(sk, prior_snd_una, false))
return;
 
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
 * lost, i.e., never-retransmitted data are (s)acked.
 */
 

[PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-13 Thread Ilpo Järvinen
If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a26c09..c60745c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
pkts_acked = rexmit_acked + newdata_acked;
 
tcp_remove_reno_sacks(sk, pkts_acked);
+
+   /* If any of the cumulatively ACKed segments was
+* retransmitted, non-SACK case cannot confirm that
+* progress was due to original transmission due to
+* lack of TCPCB_SACKED_ACKED bits even if some of
+* the packets may have been never retransmitted.
+*/
+   if (flag & FLAG_RETRANS_DATA_ACKED)
+   flag &= ~FLAG_ORIG_SACK_ACKED;
} else {
int delta;
 
-- 
2.7.4



[PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()

2018-03-13 Thread Ilpo Järvinen
No functional changes. This change simplifies the next change
slightly.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c60745c..72ecfbb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp,
+   const u32 snd_una)
+{
+   return tcp_is_reno(tp) && (snd_una == tp->high_seq);
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
} else if (tp->rack.reo_wnd_persist) {
tp->rack.reo_wnd_persist--;
}
-   if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-   /* Hold old state until something *above* high_seq
-* is ACKed. For Reno it is MUST to prevent false
-* fast retransmits (RFC2582). SACK TCP is safe. */
+   if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) {
+   /* Hold old state until something *above* high_seq is ACKed
+* if false fast retransmit is possible.
+*/
if (!tcp_any_retrans_done(sk))
tp->retrans_stamp = 0;
return true;
-- 
2.7.4



[PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-13 Thread Ilpo Järvinen
A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..4a26c09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
long seq_rtt_us = -1L;
long ca_rtt_us = -1L;
u32 pkts_acked = 0;
+   u32 rexmit_acked = 0;
+   u32 newdata_acked = 0;
u32 last_in_flight = 0;
bool rtt_update;
int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (unlikely(sacked & TCPCB_RETRANS)) {
-   if (sacked & TCPCB_SACKED_RETRANS)
+   if (sacked & TCPCB_SACKED_RETRANS) {
tp->retrans_out -= acked_pcount;
+   rexmit_acked += acked_pcount;
+   }
flag |= FLAG_RETRANS_DATA_ACKED;
} else if (!(sacked & TCPCB_SACKED_ACKED)) {
last_ackt = skb->skb_mstamp;
@@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
reord = start_seq;
if (!after(scb->end_seq, tp->high_seq))
flag |= FLAG_ORIG_SACK_ACKED;
+   else
+   newdata_acked += acked_pcount;
}
 
if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (tcp_is_reno(tp)) {
+   /* Due to discontinuity on RTO in the artificial
+* sacked_out calculations, TCP must restrict
+* pkts_acked without SACK to rexmits and new data
+* segments
+*/
+   if (icsk->icsk_ca_state == TCP_CA_Loss)
+   pkts_acked = rexmit_acked + newdata_acked;
+
tcp_remove_reno_sacks(sk, pkts_acked);
} else {
int delta;
-- 
2.7.4



Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-13 Thread Ilpo Järvinen
On Fri, 9 Mar 2018, David Miller wrote:

> From: Ilpo Järvinen 
> Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)
> 
> > Unfortunately I don't have now permission to publish the time-seq
> > graph about it but I've tried to improve the changelog messages so
> > that you can better understand under which conditions the problem
> > occurs.
> 
> It is indeed extremely unfortunate that you wish to justify a change
> for which you cannot provide the supporting data at all.

Here is the time-seqno graph about the issue:

https://www.cs.helsinki.fi/u/ijjarvin/linux/nonsackbugs/recovery_undo_bug.pdf

First the correct CC action (wnd reduction) occurs; then bogus undo 
causes bursting back to the window with which the congestion losses 
occurred earlier; because of the burst, some packets get lost due to 
congestion again.

The sender is actually somewhat lucky here: If only one packet would get 
lost instead of three, the same process would repeat for the next recovery 
(as cumulative ACK to high_seq condition would reoccur).


-- 
 i.

Re: linux-next: manual merge of the net-next tree with the net tree

2018-03-13 Thread Petr Machata
Stephen Rothwell  writes:

> Today's linux-next merge of the net-next tree got conflicts in:
>
>   drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>   drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>
> between commit:
>
>   663f1b26f9c1 ("mlxsw: spectrum: Prevent duplicate mirrors")
>
> from the net tree and commit:
>
>   a629ef210d89 ("mlxsw: spectrum: Move SPAN code to separate module")
>
> from the net-next tree.
>
> I fixed it up

Looks good.

Thanks,
Petr


Re: [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"

2018-03-13 Thread Alexander Zubkov
Hello again,

The fun thing is that before the commit "ip route ls all" showed all routes, 
but "ip -[4|6] route ls all" showed only default. So it was broken too, but in 
other way.
I see parsing of prefix was changed since my patch. So I need several days to 
propose fix. I think if "ip route ls [all|any]" shows all routes and "ip route 
ls default" shows only default, everybody will be happy with that?

13.03.2018, 09:46, "Alexander Zubkov" :
> Hello.
>
> May be the better way would be to change how "all"/"any" argument behaves? My 
> original concern was about "default" only. I agree too, that "all" or "any" 
> should work for all routes. But not for the default.
>
> 12.03.2018, 22:37, "Luca Boccassi" :
>>  On Mon, 2018-03-12 at 14:03 -0700, Stephen Hemminger wrote:
>>>   This reverts commit 9135c4d6037ff9f1818507bac0049fc44db8c3d2.
>>>
>>>   Debian maintainer found that basic command:
>>>   # ip route flush all
>>>   No longer worked as expected which breaks user scripts and
>>>   expectations. It no longer flushed all IPv4 routes.
>>>
>>>   Reported-by: Luca Boccassi 
>>>   Signed-off-by: Stephen Hemminger 
>>>   ---
>>>    ip/iproute.c | 65 ++--
>>>   
>>>    lib/utils.c  | 13 
>>>    2 files changed, 32 insertions(+), 46 deletions(-)
>>
>>  Tested-by: Luca Boccassi 
>>
>>  Thanks, solves the problem. I'll backport it to Debian.
>>
>>  Alexander, reproducing the issue is quite simple - before that commit,
>>  ip route ls all showed all routes, but with the change it started
>>  showing only the default table. Same for ip route flush.
>>
>>  --
>>  Kind regards,
>>  Luca Boccassi


Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support

2018-03-13 Thread David Woodhouse


On Tue, 2018-03-13 at 09:16 +0100, Christoph Hellwig wrote:
> On Tue, Mar 13, 2018 at 08:12:52AM +, David Woodhouse wrote:
> > 
> > I'd also *really* like to see a way to enable this for PFs which don't
> > have (and don't need) a driver. We seem to have lost that along the
> > way.
> We've been forth and back on that.  I agree that not having any driver
> just seems dangerous.  If your PF really does nothing we should just
> have a trivial pf_stub driver that does nothing but wiring up
> pci_sriov_configure_simple.  We can then add PCI IDs to it either
> statically, or using the dynamic ids mechanism.

Or just add it to the existing pci-stub. What's the point in having a
new driver? 

smime.p7s
Description: S/MIME cryptographic signature


[PATCH net] qed: Use after free in qed_rdma_free()

2018-03-13 Thread Dan Carpenter
We're dereferencing "p_hwfn->p_rdma_info" but that is freed on the line
before in qed_rdma_resc_free(p_hwfn).

Fixes: 9de506a547c0 ("qed: Free RoCE ILT Memory on rmmod qedr")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c 
b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
index f3ee6538b553..a411f9c702a1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
@@ -379,8 +379,8 @@ static void qed_rdma_free(struct qed_hwfn *p_hwfn)
DP_VERBOSE(p_hwfn, QED_MSG_RDMA, "Freeing RDMA\n");
 
qed_rdma_free_reserved_lkey(p_hwfn);
-   qed_rdma_resc_free(p_hwfn);
qed_cxt_free_proto_ilt(p_hwfn, p_hwfn->p_rdma_info->proto);
+   qed_rdma_resc_free(p_hwfn);
 }
 
 static void qed_rdma_get_guid(struct qed_hwfn *p_hwfn, u8 *guid)


[PATCH 1/2] udp: Move the udp sysctl to namespace.

2018-03-13 Thread Tonghao Zhang
This patch moves the udp_rmem_min, udp_wmem_min
to namespace and init the udp_l3mdev_accept explicitly.

Signed-off-by: Tonghao Zhang 
---
 include/net/netns/ipv4.h   |  3 ++
 net/ipv4/sysctl_net_ipv4.c | 32 -
 net/ipv4/udp.c | 86 +++---
 net/ipv6/udp.c | 52 ++--
 4 files changed, 96 insertions(+), 77 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 3a970e4..382bfd7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -168,6 +168,9 @@ struct netns_ipv4 {
atomic_t tfo_active_disable_times;
unsigned long tfo_active_disable_stamp;
 
+   int sysctl_udp_wmem_min;
+   int sysctl_udp_rmem_min;
+
 #ifdef CONFIG_NET_L3_MASTER_DEV
int sysctl_udp_l3mdev_accept;
 #endif
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 011de9a..5b72d97 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -520,22 +520,6 @@ static int proc_fib_multipath_hash_policy(struct ctl_table 
*table, int write,
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
},
-   {
-   .procname   = "udp_rmem_min",
-   .data   = _udp_rmem_min,
-   .maxlen = sizeof(sysctl_udp_rmem_min),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = 
-   },
-   {
-   .procname   = "udp_wmem_min",
-   .data   = _udp_wmem_min,
-   .maxlen = sizeof(sysctl_udp_wmem_min),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = 
-   },
{ }
 };
 
@@ -1167,6 +1151,22 @@ static int proc_fib_multipath_hash_policy(struct 
ctl_table *table, int write,
.proc_handler   = proc_dointvec_minmax,
.extra1 = ,
},
+   {
+   .procname   = "udp_rmem_min",
+   .data   = _net.ipv4.sysctl_udp_rmem_min,
+   .maxlen = sizeof(init_net.ipv4.sysctl_udp_rmem_min),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = 
+   },
+   {
+   .procname   = "udp_wmem_min",
+   .data   = _net.ipv4.sysctl_udp_wmem_min,
+   .maxlen = sizeof(init_net.ipv4.sysctl_udp_wmem_min),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = 
+   },
{ }
 };
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404..7ae77f2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -122,12 +122,6 @@
 long sysctl_udp_mem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_mem);
 
-int sysctl_udp_rmem_min __read_mostly;
-EXPORT_SYMBOL(sysctl_udp_rmem_min);
-
-int sysctl_udp_wmem_min __read_mostly;
-EXPORT_SYMBOL(sysctl_udp_wmem_min);
-
 atomic_long_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);
 
@@ -2533,35 +2527,35 @@ int udp_abort(struct sock *sk, int err)
 EXPORT_SYMBOL_GPL(udp_abort);
 
 struct proto udp_prot = {
-   .name  = "UDP",
-   .owner = THIS_MODULE,
-   .close = udp_lib_close,
-   .connect   = ip4_datagram_connect,
-   .disconnect= udp_disconnect,
-   .ioctl = udp_ioctl,
-   .init  = udp_init_sock,
-   .destroy   = udp_destroy_sock,
-   .setsockopt= udp_setsockopt,
-   .getsockopt= udp_getsockopt,
-   .sendmsg   = udp_sendmsg,
-   .recvmsg   = udp_recvmsg,
-   .sendpage  = udp_sendpage,
-   .release_cb= ip4_datagram_release_cb,
-   .hash  = udp_lib_hash,
-   .unhash= udp_lib_unhash,
-   .rehash= udp_v4_rehash,
-   .get_port  = udp_v4_get_port,
-   .memory_allocated  = _memory_allocated,
-   .sysctl_mem= sysctl_udp_mem,
-   .sysctl_wmem   = _udp_wmem_min,
-   .sysctl_rmem   = _udp_rmem_min,
-   .obj_size  = sizeof(struct udp_sock),
-   .h.udp_table   = _table,
+   .name   = "UDP",
+   .owner  = THIS_MODULE,
+   .close  = udp_lib_close,
+   .connect= ip4_datagram_connect,
+   .disconnect = udp_disconnect,
+   .ioctl  = udp_ioctl,
+   .init   = udp_init_sock,
+   .destroy= udp_destroy_sock,
+   .setsockopt = udp_setsockopt,
+   .getsockopt = udp_getsockopt,
+   .sendmsg= udp_sendmsg,
+   

[PATCH net-next 3/4] net: Convert tipc_net_ops

2018-03-13 Thread Kirill Tkhai
TIPC looks concentrated in itself, and other pernet_operations
seem not touching its entities.

tipc_net_ops look pernet-divided, and they should be safe to
be executed in parallel for several net the same time.

Signed-off-by: Kirill Tkhai 
---
 net/tipc/core.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 0b982d048fb9..04fd91bb11d7 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -105,6 +105,7 @@ static struct pernet_operations tipc_net_ops = {
.exit = tipc_exit_net,
.id   = _net_id,
.size = sizeof(struct tipc_net),
+   .async = true,
 };
 
 static int __init tipc_init(void)



[PATCH net-next 4/4] net: Convert rds_tcp_net_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations create and destroy sysctl table
and listen socket. Also, exit method flushes global
workqueue and work. Everything looks per-net safe,
so we can mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/rds/tcp.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 08230a145042..eb04e7fa2467 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -515,6 +515,7 @@ static struct pernet_operations rds_tcp_net_ops = {
.exit = rds_tcp_exit_net,
.id = _tcp_netid,
.size = sizeof(struct rds_tcp_net),
+   .async = true,
 };
 
 static void rds_tcp_kill_sock(struct net *net)



[PATCH net-next 1/4] net: Convert sctp_defaults_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations have a deal with sysctl, /proc
entries and statistics. Also, there are freeing of
net::sctp::addr_waitq queue and net::sctp::local_addr_list
in exit method. All of them look pernet-divided, and it
seems these items are only interesting for sctp_defaults_ops,
which are safe to be executed in parallel.

Signed-off-by: Kirill Tkhai 
---
 net/sctp/protocol.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 91813e686c67..32be52304f98 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1330,6 +1330,7 @@ static void __net_exit sctp_defaults_exit(struct net *net)
 static struct pernet_operations sctp_defaults_ops = {
.init = sctp_defaults_init,
.exit = sctp_defaults_exit,
+   .async = true,
 };
 
 static int __net_init sctp_ctrlsock_init(struct net *net)



Re: [PATCH net-next 2/4] net: Convert sctp_ctrlsock_ops

2018-03-13 Thread Neil Horman
On Tue, Mar 13, 2018 at 01:37:02PM +0300, Kirill Tkhai wrote:
> These pernet_operations create and destroy net::sctp::ctl_sock.
> Since pernet_operations do not send sctp packets each other,
> they look safe to be marked as async.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  net/sctp/protocol.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 32be52304f98..606361ee9e4a 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1354,6 +1354,7 @@ static void __net_init sctp_ctrlsock_exit(struct net 
> *net)
>  static struct pernet_operations sctp_ctrlsock_ops = {
>   .init = sctp_ctrlsock_init,
>   .exit = sctp_ctrlsock_exit,
> + .async = true,
>  };
>  
>  /* Initialize the universe into something sensible.  */
> 
> 
Acked-by: Neil Horman 



Re: linux-next: build warning after merge of the net-next tree

2018-03-13 Thread Gustavo A. R. Silva

Hi Stephen,

On 03/13/2018 01:11 AM, Stephen Rothwell wrote:

Hi all,

After merging the net-next tree, today's linux-next build (sparc
defconfig) produced this warning:

net/core/pktgen.c: In function 'pktgen_if_write':
net/core/pktgen.c:1710:1: warning: the frame size of 1048 bytes is larger than 
1024 bytes [-Wframe-larger-than=]
  }
  ^

Introduced by commit

   35951393bbff ("pktgen: Remove VLA usage")



Thanks for the report.

David:

If this code is not going to be executed very often [1], then I think it 
is safe to use dynamic memory allocation instead, as this is not going 
to impact the performance.


What do you think?

[1] https://lkml.org/lkml/2018/3/9/630

Thanks
--
Gustavo




[PATCH net-next] cxgb4: Add HMA support

2018-03-13 Thread Arjun Vynipadath
HMA(Host Memory Access) maps a part of host memory for T6-SO memfree cards.

This commit does the following:
- Query FW to check if we have HMA support. If yes, the params will
  return HMA size configured in FW. We will dma map memory based
  on this size.
- Also contains changes to get HMA memory information via debugfs.

Signed-off-by: Arjun Vynipadath 
Signed-off-by: Santosh Rastapur 
Signed-off-by: Michael Werner 
Signed-off-by: Ganesh GR 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  13 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  10 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 228 -
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h  |  56 +
 5 files changed, 303 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index d3fa53d..b2df0ff 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -831,6 +831,16 @@ struct vf_info {
u16 vlan;
 };
 
+enum {
+   HMA_DMA_MAPPED_FLAG = 1
+};
+
+struct hma_data {
+   unsigned char flags;
+   struct sg_table *sgt;
+   dma_addr_t *phy_addr;   /* physical address of the page */
+};
+
 struct mbox_list {
struct list_head list;
 };
@@ -946,6 +956,9 @@ struct adapter {
 
/* Ethtool Dump */
struct ethtool_dump eth_dump;
+
+   /* HMA */
+   struct hma_data hma;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 2822bbf..de2ba86 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2617,7 +2617,7 @@ int mem_open(struct inode *inode, struct file *file)
 
file->private_data = inode->i_private;
 
-   mem = (uintptr_t)file->private_data & 0x3;
+   mem = (uintptr_t)file->private_data & 0x7;
adap = file->private_data - mem;
 
(void)t4_fwcache(adap, FW_PARAM_DEV_FWCACHE_FLUSH);
@@ -2630,7 +2630,7 @@ static ssize_t mem_read(struct file *file, char __user 
*buf, size_t count,
 {
loff_t pos = *ppos;
loff_t avail = file_inode(file)->i_size;
-   unsigned int mem = (uintptr_t)file->private_data & 3;
+   unsigned int mem = (uintptr_t)file->private_data & 0x7;
struct adapter *adap = file->private_data - mem;
__be32 *data;
int ret;
@@ -3042,6 +3042,12 @@ int t4_setup_debugfs(struct adapter *adap)
add_debugfs_mem(adap, "mc", MEM_MC,
EXT_MEM_SIZE_G(size));
}
+
+   if (i & HMA_MUX_F) {
+   size = t4_read_reg(adap, MA_EXT_MEMORY1_BAR_A);
+   add_debugfs_mem(adap, "hma", MEM_HMA,
+   EXT_MEM1_SIZE_G(size));
+   }
}
 
de = debugfs_create_file_size("flash", S_IRUSR, adap->debugfs_root, 
adap,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 1b44652..d1e2786 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1733,10 +1733,11 @@ EXPORT_SYMBOL(cxgb4_sync_txq_pidx);
 
 int cxgb4_read_tpte(struct net_device *dev, u32 stag, __be32 *tpte)
 {
-   struct adapter *adap;
-   u32 offset, memtype, memaddr;
u32 edc0_size, edc1_size, mc0_size, mc1_size, size;
u32 edc0_end, edc1_end, mc0_end, mc1_end;
+   u32 offset, memtype, memaddr;
+   struct adapter *adap;
+   u32 hma_size = 0;
int ret;
 
adap = netdev2adap(dev);
@@ -1756,6 +1757,10 @@ int cxgb4_read_tpte(struct net_device *dev, u32 stag, 
__be32 *tpte)
size = t4_read_reg(adap, MA_EXT_MEMORY0_BAR_A);
mc0_size = EXT_MEM0_SIZE_G(size) << 20;
 
+   if (t4_read_reg(adap, MA_TARGET_MEM_ENABLE_A) & HMA_MUX_F) {
+   size = t4_read_reg(adap, MA_EXT_MEMORY1_BAR_A);
+   hma_size = EXT_MEM1_SIZE_G(size) << 20;
+   }
edc0_end = edc0_size;
edc1_end = edc0_end + edc1_size;
mc0_end = edc1_end + mc0_size;
@@ -1767,7 +1772,10 @@ int cxgb4_read_tpte(struct net_device *dev, u32 stag, 
__be32 *tpte)
memtype = MEM_EDC1;
memaddr = offset - edc0_end;
} else {
-   if (offset < mc0_end) {
+   if (hma_size && (offset < (edc1_end + hma_size))) {
+   memtype = MEM_HMA;
+   memaddr = offset - edc1_end;
+   } else if (offset < mc0_end) {
memtype = MEM_MC0;
memaddr = offset - edc1_end;

Re: [2/2] net/usb/ax88179_178a: Delete three unnecessary variables in ax88179_chk_eee()

2018-03-13 Thread Oliver Neukum
Am Dienstag, den 13.03.2018, 08:24 +0100 schrieb SF Markus Elfring:
> > 
> > > 
> > > Use three values directly for a condition check without assigning them
> > > to intermediate variables.
> > 
> > Hi,
> > 
> > what is the benefit of this?
> 
> I proposed a small source code reduction.
> 
> Other software design directions might become more interesting for this use 
> case.

Yes and doing so you killed three meaningful names that tell
us what these checks actually test for. That is not an improvement.

Regards
Oliver



[PATCH net-next] net: Add comment about pernet_operations methods and synchronization

2018-03-13 Thread Kirill Tkhai
Make locking scheme be visible for users, and provide
a comment what for we are need exit_batch() methods,
and when it should be used.

Signed-off-by: Kirill Tkhai 
---
 include/net/net_namespace.h |   14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index d4417495773a..71abc8d79178 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -312,6 +312,20 @@ struct net *get_net_ns_by_id(struct net *net, int id);
 
 struct pernet_operations {
struct list_head list;
+   /*
+* Below methods are called without any exclusive locks.
+* More than one net may be constructed and destructed
+* in parallel on several cpus. Every pernet_operations
+* have to keep in mind all other pernet_operations and
+* to introduce a locking, if they share common resources.
+*
+* Exit methods using blocking RCU primitives, such as
+* synchronize_rcu(), should be implemented via exit_batch.
+* Then, destruction of a group of net requires single
+* synchronize_rcu() related to these pernet_operations,
+* instead of separate synchronize_rcu() for every net.
+* Please, avoid synchronize_rcu() at all, where it's possible.
+*/
int (*init)(struct net *net);
void (*exit)(struct net *net);
void (*exit_batch)(struct list_head *net_exit_list);



Re: BUG: corrupted list in sctp_association_free

2018-03-13 Thread Dmitry Vyukov
On Tue, Mar 13, 2018 at 1:44 PM, Xin Long  wrote:
> On Tue, Mar 13, 2018 at 3:34 PM, syzbot
>  wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> fd372a7a9e5e9d8011a0222d10edd3523abcd3b1 (Thu Mar 8 19:43:48 2018 +)
>> Merge tag 'mlx5-updates-2018-02-28-2' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+e56a5d45f832ef33a...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> selinux_nlmsg_perm: 1 callbacks suppressed
>> SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
>> sclass=netlink_route_socket pig=12502 comm=syz-executor3
>> SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
>> sclass=netlink_route_socket pig=12528 comm=syz-executor3
>> list_del corruption, fcc5fb27->next is LIST_POISON1
>> (cb16e51d)
>> [ cut here ]
>> kernel BUG at lib/list_debug.c:47!
>> invalid opcode:  [#1] SMP KASAN
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 0 PID: 12537 Comm: syz-executor2 Not tainted 4.16.0-rc4+ #258
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:__list_del_entry_valid+0xd3/0x150 lib/list_debug.c:45
>> RSP: 0018:8801b6387778 EFLAGS: 00010286
>> RAX: 004e RBX: dead0200 RCX: 
>> RDX: 004e RSI: c90002ed6000 RDI: ed0036c70ee3
>> RBP: 8801b6387790 R08: 110036c70e3b R09: 
>> R10:  R11:  R12: dead0100
>> R13: 8801d3164000 R14: 8801d8502220 R15: 8801b6387c58
>> FS:  7ff42042f700() GS:8801db20() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 7ff42040ddb8 CR3: 0001bd840003 CR4: 001606f0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Call Trace:
>>  __list_del_entry include/linux/list.h:117 [inline]
>>  list_del include/linux/list.h:125 [inline]
>>  sctp_association_free+0x133/0x930 net/sctp/associola.c:341
>>  sctp_sendmsg+0xc67/0x1a80 net/sctp/socket.c:2075
>>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:639
>>  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>>  SyS_sendto+0x40/0x50 net/socket.c:1716
>>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> RIP: 0033:0x453e69
>> RSP: 002b:7ff42042ec68 EFLAGS: 0246 ORIG_RAX: 002c
>> RAX: ffda RBX: 7ff42042f6d4 RCX: 00453e69
>> RDX: 0001 RSI: 2340 RDI: 0015
>> RBP: 0072c0c8 R08: 204d9000 R09: 001c
>> R10:  R11: 0246 R12: 
>> R13: 04cd R14: 006f73d8 R15: 0003
>> Code: 8f 00 00 00 49 8b 54 24 08 48 39 f2 75 3b 48 83 c4 08 b8 01 00 00 00
>> 5b 41 5c 5d c3 4c 89 e2 48 c7 c7 c0 7c 40 86 e8 75 f6 fb fe <0f> 0b 48 c7 c7
>> 20 7d 40 86 e8 67 f6 fb fe 0f 0b 48 c7 c7 80 7d
>> RIP: __list_del_entry_valid+0xd3/0x150 lib/list_debug.c:45 RSP:
>> 8801b6387778
>> ---[ end trace a6b157f61f9bd43a ]---
>> Kernel panic - not syncing: Fatal exception
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Kernel Offset: disabled
>> Rebooting in 86400 seconds..
>>
>>
>> ---
>> This bug is generated by a dumb bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for details.
>> Direct all questions to syzkal...@googlegroups.com.
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is
>> merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> To mark this as a duplicate of another syzbot report, please reply with:
>> #syz dup: exact-subject-of-another-report
>> If it's a one-off invalid bug report, please reply with:
>> #syz invalid
>> Note: if the crash happens again, it will cause creation of a new bug
>> report.
>> Note: all commands must start from beginning of the line in the email body.
> I'd think the patch Neil just posted would fix it.


Hi Xin,

Could you point me to that commit? We need to tell syzbot about it.

Thanks


Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support

2018-03-13 Thread David Woodhouse


On Tue, 2018-03-13 at 09:54 +0100, Christoph Hellwig wrote:
> On Tue, Mar 13, 2018 at 08:45:19AM +, David Woodhouse wrote:
> Because binding to pci-stub means that you'd now enable the simple
> SR-IOV for any device bound to PCI stub.  Which often might be the wrong
> thing.

No, *using* it would be the wrong thing (bad root; no biscuit).

Except when the PF doesn't have SR-IOV capability anyway, in which case
who cares.

Or when the PF does have SR-IOV capability and root has ensure that
she's doing the right thing.

I understand the arguments about disallowing root from doing bad
things. Not that I agree with them. But simply changing to a
*different* driver seems pointless.



smime.p7s
Description: S/MIME cryptographic signature


[PATCH 2/2] doc: Change the udp/sctp rmem/wmem default value.

2018-03-13 Thread Tonghao Zhang
The SK_MEM_QUANTUM was changed from PAGE_SIZE to 4096.

Signed-off-by: Tonghao Zhang 
---
 Documentation/networking/ip-sysctl.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 783675a..1d11207 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -755,13 +755,13 @@ udp_rmem_min - INTEGER
Minimal size of receive buffer used by UDP sockets in moderation.
Each UDP socket is able to use the size for receiving data, even if
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
-   Default: 1 page
+   Default: 4K
 
 udp_wmem_min - INTEGER
Minimal size of send buffer used by UDP sockets in moderation.
Each UDP socket is able to use the size for sending data, even if
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
-   Default: 1 page
+   Default: 4K
 
 CIPSOv4 Variables:
 
@@ -2101,7 +2101,7 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
It is guaranteed to each SCTP socket (but not association) even
under moderate memory pressure.
 
-   Default: 1 page
+   Default: 4K
 
 sctp_wmem  - vector of 3 INTEGERs: min, default, max
Currently this tunable has no effect.
-- 
1.8.3.1



Re: [PATCH 1/2] udp: Move the udp sysctl to namespace.

2018-03-13 Thread Paolo Abeni
Hi,

On Tue, 2018-03-13 at 02:57 -0700, Tonghao Zhang wrote:
> This patch moves the udp_rmem_min, udp_wmem_min
> to namespace and init the udp_l3mdev_accept explicitly.

Can you please be a little more descriptive on why this is
needed/helpful?

> Signed-off-by: Tonghao Zhang 
> ---
>  include/net/netns/ipv4.h   |  3 ++
>  net/ipv4/sysctl_net_ipv4.c | 32 -
>  net/ipv4/udp.c | 86 
> +++---
>  net/ipv6/udp.c | 52 ++--
>  4 files changed, 96 insertions(+), 77 deletions(-)
> 
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 3a970e4..382bfd7 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -168,6 +168,9 @@ struct netns_ipv4 {
>   atomic_t tfo_active_disable_times;
>   unsigned long tfo_active_disable_stamp;
>  
> + int sysctl_udp_wmem_min;
> + int sysctl_udp_rmem_min;
> +
>  #ifdef CONFIG_NET_L3_MASTER_DEV
>   int sysctl_udp_l3mdev_accept;
>  #endif
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 011de9a..5b72d97 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -520,22 +520,6 @@ static int proc_fib_multipath_hash_policy(struct 
> ctl_table *table, int write,
>   .mode   = 0644,
>   .proc_handler   = proc_doulongvec_minmax,
>   },
> - {
> - .procname   = "udp_rmem_min",
> - .data   = _udp_rmem_min,
> - .maxlen = sizeof(sysctl_udp_rmem_min),
> - .mode   = 0644,
> - .proc_handler   = proc_dointvec_minmax,
> - .extra1 = 
> - },
> - {
> - .procname   = "udp_wmem_min",
> - .data   = _udp_wmem_min,
> - .maxlen = sizeof(sysctl_udp_wmem_min),
> - .mode   = 0644,
> - .proc_handler   = proc_dointvec_minmax,
> - .extra1 = 
> - },
>   { }
>  };
>  
> @@ -1167,6 +1151,22 @@ static int proc_fib_multipath_hash_policy(struct 
> ctl_table *table, int write,
>   .proc_handler   = proc_dointvec_minmax,
>   .extra1 = ,
>   },
> + {
> + .procname   = "udp_rmem_min",
> + .data   = _net.ipv4.sysctl_udp_rmem_min,
> + .maxlen = sizeof(init_net.ipv4.sysctl_udp_rmem_min),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = 
> + },
> + {
> + .procname   = "udp_wmem_min",
> + .data   = _net.ipv4.sysctl_udp_wmem_min,
> + .maxlen = sizeof(init_net.ipv4.sysctl_udp_wmem_min),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = 
> + },
>   { }
>  };
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 3013404..7ae77f2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -122,12 +122,6 @@
>  long sysctl_udp_mem[3] __read_mostly;
>  EXPORT_SYMBOL(sysctl_udp_mem);
>  
> -int sysctl_udp_rmem_min __read_mostly;
> -EXPORT_SYMBOL(sysctl_udp_rmem_min);
> -
> -int sysctl_udp_wmem_min __read_mostly;
> -EXPORT_SYMBOL(sysctl_udp_wmem_min);
> -
>  atomic_long_t udp_memory_allocated;
>  EXPORT_SYMBOL(udp_memory_allocated);
>  
> @@ -2533,35 +2527,35 @@ int udp_abort(struct sock *sk, int err)
>  EXPORT_SYMBOL_GPL(udp_abort);
>  
>  struct proto udp_prot = {
> - .name  = "UDP",
> - .owner = THIS_MODULE,
> - .close = udp_lib_close,
> - .connect   = ip4_datagram_connect,
> - .disconnect= udp_disconnect,
> - .ioctl = udp_ioctl,
> - .init  = udp_init_sock,
> - .destroy   = udp_destroy_sock,
> - .setsockopt= udp_setsockopt,
> - .getsockopt= udp_getsockopt,
> - .sendmsg   = udp_sendmsg,
> - .recvmsg   = udp_recvmsg,
> - .sendpage  = udp_sendpage,
> - .release_cb= ip4_datagram_release_cb,
> - .hash  = udp_lib_hash,
> - .unhash= udp_lib_unhash,
> - .rehash= udp_v4_rehash,
> - .get_port  = udp_v4_get_port,
> - .memory_allocated  = _memory_allocated,
> - .sysctl_mem= sysctl_udp_mem,
> - .sysctl_wmem   = _udp_wmem_min,
> - .sysctl_rmem   = _udp_rmem_min,
> - .obj_size  = sizeof(struct udp_sock),
> - .h.udp_table   = _table,
> + .name   = "UDP",
> + .owner  = THIS_MODULE,
> + .close  = udp_lib_close,
> + .connect= ip4_datagram_connect,
> + .disconnect = udp_disconnect,
> + .ioctl  = udp_ioctl,
> + 

[PATCH net-next] net: fix sysctl_fb_tunnels_only_for_init_net link error

2018-03-13 Thread Arnd Bergmann
The new variable is only available when CONFIG_SYSCTL is enabled,
otherwise we get a link error:

net/ipv4/ip_tunnel.o: In function `ip_tunnel_init_net':
ip_tunnel.c:(.text+0x278b): undefined reference to 
`sysctl_fb_tunnels_only_for_init_net'
net/ipv6/sit.o: In function `sit_init_net':
sit.c:(.init.text+0x4c): undefined reference to 
`sysctl_fb_tunnels_only_for_init_net'
net/ipv6/ip6_tunnel.o: In function `ip6_tnl_init_net':
ip6_tunnel.c:(.init.text+0x39): undefined reference to 
`sysctl_fb_tunnels_only_for_init_net'

This adds an extra condition, keeping the traditional behavior when
CONFIG_SYSCTL is disabled.

Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default 
namespaces")
Signed-off-by: Arnd Bergmann 
---
 include/linux/netdevice.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5fbb9f1da7fd..913b1cc882cf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -589,7 +589,9 @@ extern int sysctl_fb_tunnels_only_for_init_net;
 
 static inline bool net_has_fallback_tunnels(const struct net *net)
 {
-   return net == _net || !sysctl_fb_tunnels_only_for_init_net;
+   return net == _net ||
+  !IS_ENABLED(CONFIG_SYSCTL) ||
+  !sysctl_fb_tunnels_only_for_init_net;
 }
 
 static inline int netdev_queue_numa_node_read(const struct netdev_queue *q)
-- 
2.9.0



Re: [PATCH v2 iproute2-next 0/6] cm_id, cq, mr, and pd resource tracking

2018-03-13 Thread Leon Romanovsky
On Mon, Mar 12, 2018 at 10:53:03AM -0700, David Ahern wrote:
> On 3/12/18 8:16 AM, Steve Wise wrote:
> > Hey all,
> >
> > The kernel side of this series has been merged for rdma-next [1].  Let me
> > know if this iproute2 series can be merged, of if it needs more changes.
> >
>
> The problem is that iproute2 headers are synced to kernel headers from
> DaveM's tree (net-next mainly). I take it this series will not appear in
> Dave's tree until after a merge through Linus' tree. Correct?

David,

Technically, you are right, and we would like to ask you for an extra tweak
to the flow for the RDMAtool, because current scheme causes delays at least
cycle.

Every RDMAtool's patchset which requires changes to headers is always
includes header patch, can you please accept those series and once you
are bringing new net-next headers from Linus, simply overwrite all our
headers?

Thanks

>


signature.asc
Description: PGP signature


RE: [PATCH net] qed: Use after free in qed_rdma_free()

2018-03-13 Thread Kalderon, Michal
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Tuesday, March 13, 2018 11:10 AM
> To: Elior, Ariel ; Kalderon, Michal
> 
> Cc: Dept-Eng Everest Linux L2 ;
> netdev@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [PATCH net] qed: Use after free in qed_rdma_free()
> 
> We're dereferencing "p_hwfn->p_rdma_info" but that is freed on the line
> before in qed_rdma_resc_free(p_hwfn).
> 
> Fixes: 9de506a547c0 ("qed: Free RoCE ILT Memory on rmmod qedr")
> Signed-off-by: Dan Carpenter 
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> index f3ee6538b553..a411f9c702a1 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> @@ -379,8 +379,8 @@ static void qed_rdma_free(struct qed_hwfn
> *p_hwfn)
>   DP_VERBOSE(p_hwfn, QED_MSG_RDMA, "Freeing RDMA\n");
> 
>   qed_rdma_free_reserved_lkey(p_hwfn);
> - qed_rdma_resc_free(p_hwfn);
>   qed_cxt_free_proto_ilt(p_hwfn, p_hwfn->p_rdma_info->proto);
> + qed_rdma_resc_free(p_hwfn);
>  }
> 
>  static void qed_rdma_get_guid(struct qed_hwfn *p_hwfn, u8 *guid)
Thanks,

Acked-by: Michal Kalderon 


Re: question about bpf.test_progs.fail

2018-03-13 Thread Daniel Borkmann
Hi Shaoting,

On 03/12/2018 02:52 AM, lst wrote:
> hi, I have a question need your help.
> 
> I get failure "libbpf: incorrect bpf_call opcode" when running below two 
> cases on v4.16-rc3:
> -
> test_l4lb_all();
>     const char *file2 = "./test_l4lb_noinline.o";
> 
> test_xdp_noinline();
> -
> 
> and from the file test_libbpf.sh, it seems libbpf can't load noinline 
> functions.
> -
> # TODO: fix libbpf to load noinline functions
> # [warning] libbpf: incorrect bpf_call opcode
> #libbpf_open_file test_l4lb_noinline.o
> -
> 
> They all point to bpf_object__open(filename) at last.
> Here(test_progs) test "test_l4lb_noinline.o" but test_libbpf.sh don't.
> 
> So, I guess there must be some setting (like certain kernel kconfig or 
> compiling) to make the test work.
> 
> Can you tell me how can I make this test(test_progs) pass?

Are you using/linking against an old libbpf version that doesn't support
BPF to BPF calls? Latest works fine for me:

# clang --version
clang version 7.0.0 (http://llvm.org/git/clang.git 
491b0d6736475fbb9509877edcc18051272b30bd) (http://llvm.org/git/llvm.git 
4b957aaea877317bd77344eaa8d8a6645fbaf822)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin

# llc --version
LLVM (http://llvm.org/):
  LLVM version 7.0.0svn
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: skylake

  Registered Targets:
bpf   - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)

# uname -a
Linux linux.home 4.16.0-rc4+ #48 SMP Mon Mar 12 15:22:34 CET 2018 x86_64 x86_64 
x86_64 GNU/Linux

# cd tools/testing/selftests/bpf
# make clean > /dev/null
# make > /dev/null
Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from 
latest version at 'include/uapi/linux/bpf.h'
Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from 
latest version at 'include/uapi/linux/if_link.h'
# ./test_progs
test_pkt_access:PASS:ipv4 148 nsec
test_pkt_access:PASS:ipv6 122 nsec
test_xdp:PASS:ipv4 6898 nsec
test_xdp:PASS:ipv6 15287 nsec
test_l4lb:PASS:ipv4 1126 nsec
test_l4lb:PASS:ipv6 1824 nsec
test_l4lb:PASS:ipv4 1754 nsec
test_l4lb:PASS:ipv6 3254 nsec
test_xdp_noinline:PASS:ipv4 2620 nsec
test_xdp_noinline:PASS:ipv6 3994 nsec
test_tcp_estats:PASS: 0 nsec
test_bpf_obj_id:PASS:get-fd-by-notexist-prog-id 0 nsec
test_bpf_obj_id:PASS:get-fd-by-notexist-map-id 0 nsec
test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd-bad-nr-map-ids 0 nsec
test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd-bad-nr-map-ids 0 nsec
test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:check total prog id found by get_next_id 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:check total map id found by get_next_id 0 nsec
test_pkt_md_access:PASS: 263 nsec
test_obj_name:PASS:check-bpf-prog-name 0 nsec
test_obj_name:PASS:check-bpf-map-name 0 nsec
test_obj_name:PASS:check-bpf-prog-name 0 nsec
test_obj_name:PASS:check-bpf-map-name 0 nsec
test_obj_name:PASS:check-bpf-prog-name 0 nsec
test_obj_name:PASS:check-bpf-map-name 0 nsec
test_obj_name:PASS:check-bpf-prog-name 0 nsec
test_obj_name:PASS:check-bpf-map-name 0 nsec
test_tp_attach_query:PASS:open 0 nsec
test_tp_attach_query:PASS:read 0 nsec
test_tp_attach_query:PASS:prog_load 0 nsec
test_tp_attach_query:PASS:bpf_obj_get_info_by_fd 0 nsec
test_tp_attach_query:PASS:perf_event_open 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_enable 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_query_bpf 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_set_bpf 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_query_bpf 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_query_bpf 0 nsec
test_tp_attach_query:PASS:prog_load 0 nsec
test_tp_attach_query:PASS:bpf_obj_get_info_by_fd 0 nsec
test_tp_attach_query:PASS:perf_event_open 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_enable 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_set_bpf 0 nsec
test_tp_attach_query:PASS:perf_event_ioc_query_bpf 0 nsec

Re: [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"

2018-03-13 Thread Alexander Zubkov
Hello.

May be the better way would be to change how "all"/"any" argument behaves? My 
original concern was about "default" only. I agree too, that "all" or "any" 
should work for all routes. But not for the default.

12.03.2018, 22:37, "Luca Boccassi" :
> On Mon, 2018-03-12 at 14:03 -0700, Stephen Hemminger wrote:
>>  This reverts commit 9135c4d6037ff9f1818507bac0049fc44db8c3d2.
>>
>>  Debian maintainer found that basic command:
>>  # ip route flush all
>>  No longer worked as expected which breaks user scripts and
>>  expectations. It no longer flushed all IPv4 routes.
>>
>>  Reported-by: Luca Boccassi 
>>  Signed-off-by: Stephen Hemminger 
>>  ---
>>   ip/iproute.c | 65 ++--
>>  
>>   lib/utils.c  | 13 
>>   2 files changed, 32 insertions(+), 46 deletions(-)
>
> Tested-by: Luca Boccassi 
>
> Thanks, solves the problem. I'll backport it to Debian.
>
> Alexander, reproducing the issue is quite simple - before that commit,
> ip route ls all showed all routes, but with the change it started
> showing only the default table. Same for ip route flush.
>
> --
> Kind regards,
> Luca Boccassi


Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support

2018-03-13 Thread Christoph Hellwig
On Tue, Mar 13, 2018 at 08:45:19AM +, David Woodhouse wrote:
> 
> 
> On Tue, 2018-03-13 at 09:16 +0100, Christoph Hellwig wrote:
> > On Tue, Mar 13, 2018 at 08:12:52AM +, David Woodhouse wrote:
> > > 
> > > I'd also *really* like to see a way to enable this for PFs which don't
> > > have (and don't need) a driver. We seem to have lost that along the
> > > way.
> > We've been forth and back on that.  I agree that not having any driver
> > just seems dangerous.  If your PF really does nothing we should just
> > have a trivial pf_stub driver that does nothing but wiring up
> > pci_sriov_configure_simple.  We can then add PCI IDs to it either
> > statically, or using the dynamic ids mechanism.
> 
> Or just add it to the existing pci-stub. What's the point in having a
> new driver? 

Because binding to pci-stub means that you'd now enable the simple
SR-IOV for any device bound to PCI stub.  Which often might be the wrong
thing.


[PATCH V2 net 1/1] net/smc: simplify wait when closing listen socket

2018-03-13 Thread Ursula Braun
Closing of a listen socket wakes up kernel_accept() of
smc_tcp_listen_worker(), and then has to wait till smc_tcp_listen_worker()
gives up the internal clcsock. The wait logic introduced with
commit 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
might wait longer than necessary. This patch implements the idea to
implement the wait just with flush_work(), and gets rid of the extra
smc_close_wait_listen_clcsock() function.

Fixes: 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
Reported-by: Hans Wippel 
Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c|  4 
 net/smc/smc_close.c | 25 +++--
 2 files changed, 3 insertions(+), 26 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8cc97834d4f6..1e0d780855c3 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -978,10 +978,6 @@ static void smc_tcp_listen_work(struct work_struct *work)
lsmc->clcsock = NULL;
}
release_sock(lsk);
-   /* no more listening, wake up smc_close_wait_listen_clcsock and
-* accept
-*/
-   lsk->sk_state_change(lsk);
sock_put(>sk); /* sock_hold in smc_listen */
 }
 
diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
index e339c0186dcf..fa41d9881741 100644
--- a/net/smc/smc_close.c
+++ b/net/smc/smc_close.c
@@ -30,27 +30,6 @@ static void smc_close_cleanup_listen(struct sock *parent)
smc_close_non_accepted(sk);
 }
 
-static void smc_close_wait_listen_clcsock(struct smc_sock *smc)
-{
-   DEFINE_WAIT_FUNC(wait, woken_wake_function);
-   struct sock *sk = >sk;
-   signed long timeout;
-
-   timeout = SMC_CLOSE_WAIT_LISTEN_CLCSOCK_TIME;
-   add_wait_queue(sk_sleep(sk), );
-   do {
-   release_sock(sk);
-   if (smc->clcsock)
-   timeout = wait_woken(, TASK_UNINTERRUPTIBLE,
-timeout);
-   sched_annotate_sleep();
-   lock_sock(sk);
-   if (!smc->clcsock)
-   break;
-   } while (timeout);
-   remove_wait_queue(sk_sleep(sk), );
-}
-
 /* wait for sndbuf data being transmitted */
 static void smc_close_stream_wait(struct smc_sock *smc, long timeout)
 {
@@ -204,9 +183,11 @@ int smc_close_active(struct smc_sock *smc)
rc = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR);
/* wake up kernel_accept of smc_tcp_listen_worker */
smc->clcsock->sk->sk_data_ready(smc->clcsock->sk);
-   smc_close_wait_listen_clcsock(smc);
}
smc_close_cleanup_listen(sk);
+   release_sock(sk);
+   flush_work(>tcp_listen_work);
+   lock_sock(sk);
break;
case SMC_ACTIVE:
smc_close_stream_wait(smc, timeout);
-- 
2.13.5



[PATCH net-next 2/4] net: Convert sctp_ctrlsock_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations create and destroy net::sctp::ctl_sock.
Since pernet_operations do not send sctp packets each other,
they look safe to be marked as async.

Signed-off-by: Kirill Tkhai 
---
 net/sctp/protocol.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 32be52304f98..606361ee9e4a 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1354,6 +1354,7 @@ static void __net_init sctp_ctrlsock_exit(struct net *net)
 static struct pernet_operations sctp_ctrlsock_ops = {
.init = sctp_ctrlsock_init,
.exit = sctp_ctrlsock_exit,
+   .async = true,
 };
 
 /* Initialize the universe into something sensible.  */



[PATCH net-next 0/4] Converting pernet_operations (part #6)

2018-03-13 Thread Kirill Tkhai
Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces in the same time. There are sctp, tipc and rds
in this series.

Thanks,
Kirill
---

Kirill Tkhai (4):
  net: Convert sctp_defaults_ops
  net: Convert sctp_ctrlsock_ops
  net: Convert tipc_net_ops
  net: Convert rds_tcp_net_ops


 net/rds/tcp.c   |1 +
 net/sctp/protocol.c |2 ++
 net/tipc/core.c |1 +
 3 files changed, 4 insertions(+)

--
Signed-off-by: Kirill Tkhai 


[PATCH net-next nfs 5/6] net: Convert nfs4blocklayout_net_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations create and destroy per-net pipe
and dentry, and they seem safe to be marked as async.

Signed-off-by: Kirill Tkhai 
---
 fs/nfs/blocklayout/rpc_pipefs.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/blocklayout/rpc_pipefs.c b/fs/nfs/blocklayout/rpc_pipefs.c
index 9fb067a6f7e0..ef9fa111b009 100644
--- a/fs/nfs/blocklayout/rpc_pipefs.c
+++ b/fs/nfs/blocklayout/rpc_pipefs.c
@@ -261,6 +261,7 @@ static void nfs4blocklayout_net_exit(struct net *net)
 static struct pernet_operations nfs4blocklayout_net_ops = {
.init = nfs4blocklayout_net_init,
.exit = nfs4blocklayout_net_exit,
+   .async = true,
 };
 
 int __init bl_init_pipefs(void)



Re: [PATCH net-next v2] sctp: fix error return code in sctp_sendmsg_new_asoc()

2018-03-13 Thread Neil Horman
On Tue, Mar 13, 2018 at 03:03:30AM +, Wei Yongjun wrote:
> Return error code -EINVAL in the address len check error handling
> case since 'err' can be overwrite to 0 by 'err = sctp_verify_addr()'
> in the for loop.
> 
> Fixes: 2c0dbaa0c43d ("sctp: add support for SCTP_DSTADDRV4/6 Information for 
> sendmsg")
> Signed-off-by: Wei Yongjun 
> Acked-by: Neil Horman 
> ---
> v1 -> v2: remove the 'err' initialization
> ---
>  net/sctp/socket.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 7d3476a..af5cf29 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1677,7 +1677,7 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, __u16 
> sflags,
>   struct sctp_association *asoc;
>   enum sctp_scope scope;
>   struct cmsghdr *cmsg;
> - int err = -EINVAL;
> + int err;
>  
>   *tp = NULL;
>  
> @@ -1761,16 +1761,20 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, 
> __u16 sflags,
>   memset(daddr, 0, sizeof(*daddr));
>   dlen = cmsg->cmsg_len - sizeof(struct cmsghdr);
>   if (cmsg->cmsg_type == SCTP_DSTADDRV4) {
> - if (dlen < sizeof(struct in_addr))
> + if (dlen < sizeof(struct in_addr)) {
> + err = -EINVAL;
>   goto free;
> + }
>  
>   dlen = sizeof(struct in_addr);
>   daddr->v4.sin_family = AF_INET;
>   daddr->v4.sin_port = htons(asoc->peer.port);
>   memcpy(>v4.sin_addr, CMSG_DATA(cmsg), dlen);
>   } else {
> - if (dlen < sizeof(struct in6_addr))
> + if (dlen < sizeof(struct in6_addr)) {
> + err = -EINVAL;
>   goto free;
> + }
>  
>   dlen = sizeof(struct in6_addr);
>   daddr->v6.sin6_family = AF_INET6;
> 
> 
Acked-by: Neil Horman 


[PATCH net-next nfs 0/6] Converting pernet_operations (part #7)

2018-03-13 Thread Kirill Tkhai
Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces in the same time. There are nfs pernet_operations
in this series. All of them look similar each other, they mostly
create and destroy caches with small exceptions.

Also, there is rxrpc_net_ops, which is used in AFS.

Thanks,
Kirill
---

Kirill Tkhai (6):
  net: Convert rpcsec_gss_net_ops
  net: Convert sunrpc_net_ops
  net: Convert nfsd_net_ops
  net: Convert nfs4_dns_resolver_ops
  net: Convert nfs4blocklayout_net_ops
  net: Convert rxrpc_net_ops


 fs/nfs/blocklayout/rpc_pipefs.c |1 +
 fs/nfs/dns_resolve.c|1 +
 fs/nfsd/nfsctl.c|1 +
 net/rxrpc/net_ns.c  |1 +
 net/sunrpc/auth_gss/auth_gss.c  |1 +
 net/sunrpc/sunrpc_syms.c|1 +
 6 files changed, 6 insertions(+)

--
Signed-off-by: Kirill Tkhai 


[PATCH net-next nfs 2/6] net: Convert sunrpc_net_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai 
---
 net/sunrpc/sunrpc_syms.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 56f9eff74150..68287e921847 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -79,6 +79,7 @@ static struct pernet_operations sunrpc_net_ops = {
.exit = sunrpc_exit_net,
.id = _net_id,
.size = sizeof(struct sunrpc_net),
+   .async = true,
 };
 
 static int __init



[PATCH net-next nfs 1/6] net: Convert rpcsec_gss_net_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations initialize and destroy sunrpc_net_id
refered per-net items. Only used global list is cache_list,
and accesses already serialized.

sunrpc_destroy_cache_detail() check for list_empty() without
cache_list_lock, but when it's called from unregister_pernet_subsys(),
there can't be callers in parallel, so we won't miss list_empty()
in this case.

Signed-off-by: Kirill Tkhai 
---
 net/sunrpc/auth_gss/auth_gss.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 9463af4b32e8..44f939cb6bc8 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -2063,6 +2063,7 @@ static __net_exit void rpcsec_gss_exit_net(struct net 
*net)
 static struct pernet_operations rpcsec_gss_net_ops = {
.init = rpcsec_gss_init_net,
.exit = rpcsec_gss_exit_net,
+   .async = true,
 };
 
 /*



[PATCH net-next nfs 3/6] net: Convert nfsd_net_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai 
---
 fs/nfsd/nfsctl.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index d107b4426f7e..1e3824e6cce0 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1263,6 +1263,7 @@ static struct pernet_operations nfsd_net_ops = {
.exit = nfsd_exit_net,
.id   = _net_id,
.size = sizeof(struct nfsd_net),
+   .async = true,
 };
 
 static int __init init_nfsd(void)



[PATCH net-next nfs 4/6] net: Convert nfs4_dns_resolver_ops

2018-03-13 Thread Kirill Tkhai
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another cache. Also they create
and destroy directory. So, they also look safe to be async.

Signed-off-by: Kirill Tkhai 
---
 fs/nfs/dns_resolve.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c
index 060c658eab66..e90bd69ab653 100644
--- a/fs/nfs/dns_resolve.c
+++ b/fs/nfs/dns_resolve.c
@@ -410,6 +410,7 @@ static void nfs4_dns_net_exit(struct net *net)
 static struct pernet_operations nfs4_dns_resolver_ops = {
.init = nfs4_dns_net_init,
.exit = nfs4_dns_net_exit,
+   .async = true,
 };
 
 static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event,



  1   2   3   >