date:20180711

Re: [PATCH net 2/2] sfp: fix module initialisation with netdev already up

2018-07-11 Thread David Miller

From: Russell King 
Date: Tue, 10 Jul 2018 12:05:36 +0100

> It was been observed that with a particular order of initialisation,
> the netdev can be up, but the SFP module still has its TX_DISABLE
> signal asserted.  This occurs when the network device brought up before
> the SFP kernel module has been inserted by userspace.
> 
> This occurs because sfp-bus layer does not hear about the change in
> network device state, and so assumes that it is still down.  Set
> netdev->sfp when the upstream is registered to work around this problem.
> 
> Signed-off-by: Russell King 

Applied.

Re: [PATCH net 1/2] sfp: ensure we clean up properly on bus registration failure

2018-07-11 Thread David Miller

From: Russell King 
Date: Tue, 10 Jul 2018 12:05:31 +0100

> We fail to correctly clean up after a bus registration failure, which
> can lead to an incorrect assumption about the registration state of
> the upstream or sfp cage.
> 
> Signed-off-by: Russell King 

Applied.

Re: [PATCH net-next 0/3] mlxsw: ERSPAN: Take LACP state into consideration

2018-07-11 Thread David Miller

From: Ido Schimmel 
Date: Tue, 10 Jul 2018 10:02:56 +0300

> Petr says:
> 
> When offloading mirror-to-gretap, mlxsw needs to preroute the path that
> the encapsulated packet will take. That path may include a LAG device
> above a front panel port. So far, mlxsw resolved the path to the first
> up front panel slave of the LAG interface, but that only reflects
> administrative state of the port. It neglects to consider whether the
> port actually has a carrier, and what the LACP state is. This patch set
> aims to address these problems.
> 
> Patch #1 publishes team_port_get_rcu().
> 
> Then in patch #2, a new function is introduced,
> mlxsw_sp_port_dev_check(). That returns, for a given netdevice that is a
> slave of a LAG device, whether that device is "txable", i.e. whether the
> LAG master would send traffic through it. Since there's no good place to
> put LAG-wide helpers, introduce a new header include/net/lag.h.
> 
> Finally in patch #3, fix the slave selection logic to take into
> consideration whether a given slave has a carrier and whether it is
> txable.

Series applied, thank you.

Re: [PATCH net-next] macvlan: Change status when lower device goes down

2018-07-11 Thread David Miller

From: Travis Brown 
Date: Tue, 10 Jul 2018 00:35:01 +

> Today macvlan ignores the notification when a lower device goes
> administratively down, preventing the lack of connectivity from
> bubbling up.
> 
> Processing NETDEV_DOWN results in a macvlan state of LOWERLAYERDOWN
> with NO-CARRIER which should be easy to interpret in userspace.
> 
> 2: lower:  mtu 1500 qdisc mq state DOWN mode DEFAULT 
> group default qlen 1000
> 3: macvlan@lower:  mtu 1500 qdisc 
> noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> 
> Signed-off-by: Suresh Krishnan 
> Signed-off-by: Travis Brown 

Seems reasonable, applied, thanks.

Re: [net-next 0/7][pull request] L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09

2018-07-11 Thread David Miller

From: Jeff Kirsher 
Date: Mon,  9 Jul 2018 15:20:35 -0700

> This patch series is meant to allow support for the L2 forward offload, aka
> MACVLAN offload without the need for using ndo_select_queue.
> 
> The existing solution currently requires that we use ndo_select_queue in
> the transmit path if we want to associate specific Tx queues with a given
> MACVLAN interface. In order to get away from this we need to repurpose the
> tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
> a means of accessing the queues on the lower device. As a result we cannot
> offload a device that is configured as multiqueue, however it doesn't
> really make sense to configure a macvlan interfaced as being multiqueue
> anyway since it doesn't really have a qdisc of its own in the first place.
> 
> The big changes in this set are:
>   Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
>   Disable XPS for single queue devices
>   Replace accel_priv with sb_dev in ndo_select_queue
>   Add sb_dev parameter to fallback function for ndo_select_queue
>   Consolidated ndo_select_queue functions that appeared to be duplicates
> 
> The following are changes since commit 
> c47078d6a33fd78d882200cdaacbcfcd63318234:
>   tcp: remove redundant SOCK_DONE checks
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Pulled, thanks Jeff.

Re: [PATCH net-next] tcp: expose both send and receive intervals for rate sample

2018-07-11 Thread David Miller

From: Deepti Raghavan 
Date: Mon,  9 Jul 2018 17:53:39 +

> Congestion control algorithms, which access the rate sample
> through the tcp_cong_control function, only have access to the maximum
> of the send and receive interval, for cases where the acknowledgment
> rate may be inaccurate due to ACK compression or decimation. Algorithms
> may want to use send rates and receive rates as separate signals.
> 
> Signed-off-by: Deepti Raghavan 

Applied.

[PATCH] scripts/tags.sh: Add BPF_CALL

2018-07-11 Thread Constantine Shulyupin

Signed-off-by: Constantine Shulyupin 
---
 scripts/tags.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/tags.sh b/scripts/tags.sh
index 66f08bb1cce9..db0d56ebe9b9 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -152,6 +152,7 @@ regex_asm=(
 )
 regex_c=(
'/^SYSCALL_DEFINE[0-9](\([[:alnum:]_]*\).*/sys_\1/'
+   '/^BPF_CALL_[0-9](\([[:alnum:]_]*\).*/\1/'
'/^COMPAT_SYSCALL_DEFINE[0-9](\([[:alnum:]_]*\).*/compat_sys_\1/'
'/^TRACE_EVENT(\([[:alnum:]_]*\).*/trace_\1/'
'/^TRACE_EVENT(\([[:alnum:]_]*\).*/trace_\1_rcuidle/'
-- 
2.17.1

Re: [PATCH net-next] net: sched: fix unprotected access to rcu cookie pointer

2018-07-11 Thread David Miller

From: Vlad Buslov 
Date: Mon,  9 Jul 2018 20:26:47 +0300

> Fix action attribute size calculation function to take rcu read lock and
> access act_cookie pointer with rcu dereference.
> 
> Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update")
> Reported-by: Marcelo Ricardo Leitner 
> Signed-off-by: Vlad Buslov 

Applied.

Re: [PATCH net-next 0/2] cxgb4: move stats fetched from firmware to debugfs

2018-07-11 Thread David Miller

From: Rahul Lakkireddy 
Date: Mon,  9 Jul 2018 21:42:45 +0530

> Some stats are fetched via slow firmware mailbox, which can cause
> packet drops under heavy load. So, this series removes these stats
> from ethtool -S and expose them via debugfs.
> 
> Patch 1 removes stats fetched via firmware from ethtool -S.
> Patch 2 exposes stats removed in Patch 1 via debugfs.

Series applied, thanks.

Re: [PATCH net-next] net: sched: act_ife: fix memory leak in ife init

2018-07-11 Thread David Miller

From: Vlad Buslov 
Date: Mon,  9 Jul 2018 14:33:26 +0300

> Free params if tcf_idr_check_alloc() returned error.
> 
> Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
> Reported-by: Dan Carpenter 
> Signed-off-by: Vlad Buslov 

Applied.

Re: [PATCH net-next] cxgb4: specify IQTYPE in fw_iq_cmd

2018-07-11 Thread David Miller

From: Ganesh Goudar 
Date: Mon,  9 Jul 2018 16:52:03 +0530

> From: Arjun Vynipadath 
> 
> congestion argument passed to t4_sge_alloc_rxq() is used
> to differentiate between nic/ofld queues.
> 
> Signed-off-by: Arjun Vynipadath 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net v2 0/5] net/ipv6: addr_gen_mode fixes

2018-07-11 Thread David Miller

From: Sabrina Dubroca 
Date: Mon,  9 Jul 2018 12:25:13 +0200

> This series fixes bugs in handling of the addr_gen_mode option, mainly
> related to the sysctl. A minor netlink issue was also present in the
> initial commit introducing the option on a per-netdevice basis.
> 
> v2: add patch 4, requested by David Ahern during review of v1
> add patch 5, missing documentation for the sysctl
> patches 1, 2, 3 are unchanged

I know there is still some discussion going on about sysctl semantics,
but I'll aply this for now and any further refinements can be
submitted on top.

Thanks.

Re: [PATCH resend] rhashtable: detect when object movement might have invalidated a lookup

2018-07-11 Thread David Miller

From: David Miller 
Date: Wed, 11 Jul 2018 22:46:58 -0700 (PDT)

> From: NeilBrown 
> Date: Fri, 06 Jul 2018 17:08:35 +1000
> 
>> 
>> Some users of rhashtable might need to change the key
>> of an object and move it to a different location in the table.
>> Other users might want to allocate objects using
>> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
>> being used for a different (type-compatible) purpose and similarly
>> end up in a different hash-chain.
>> 
>> To support these, we store a unique NULLS_MARKER at the end of
>> each chain, and when a search fails to find a match, we check
>> if the NULLS marker found was the expected one.  If not,
>> the search is repeated.
>> 
>> The unique NULLS_MARKER is derived from the address of the
>> head of the chain.
>> 
>> If an object is removed and re-added to the same hash chain, we won't
>> notice by looking that the NULLS marker.  In this case we must be sure
>> that it was not re-added *after* its original location, or a lookup may
>> incorrectly fail.  The easiest solution is to ensure it is inserted at
>> the start of the chain.  insert_slow() already does that,
>> insert_fast() does not.  So this patch changes insert_fast to always
>> insert at the head of the chain.
>> 
>> Note that such a user must do their own double-checking of
>> the object found by rhashtable_lookup_fast() after ensuring
>> mutual exclusion which anything that might change the key, such as
>> successfully taking a new reference.
>> 
>> Signed-off-by: NeilBrown 
> 
> Applied to net-next.

Actually, reverted, it doesn't even compile.

lib/rhashtable.c: In function ‘rht_bucket_nested’:
lib/rhashtable.c:1187:39: error: macro "INIT_RHT_NULLS_HEAD" passed 3 
arguments, but takes just 1
INIT_RHT_NULLS_HEAD(rhnull, NULL, 0);
   ^
lib/rhashtable.c:1187:4: error: ‘INIT_RHT_NULLS_HEAD’ undeclared (first use in 
this function); did you mean ‘INIT_LIST_HEAD’?
INIT_RHT_NULLS_HEAD(rhnull, NULL, 0);
^~~
INIT_LIST_HEAD
lib/rhashtable.c:1187:4: note: each undeclared identifier is reported only once 
for each function it appears in

Re: [PATCH net-next] net/sched: flower: Fix null pointer dereference when run tc vlan command

2018-07-11 Thread David Miller

From: Jianbo Liu 
Date: Mon,  9 Jul 2018 02:26:20 +

> Zahari issued tc vlan command without setting vlan_ethtype, which will
> crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE]
> is not null before use it.
> Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case.
> 
> Fixes: d64efd0926ba ('net/sched: flower: Add supprt for matching on QinQ vlan 
> headers')
> Signed-off-by: Jianbo Liu 
> Reported-by: Zahari Doychev 

Applied.

[PATCH net-next] net/tls: Removed redundant variable from 'struct tls_sw_context_rx'

2018-07-11 Thread Vakul Garg

The variable 'decrypted' in 'struct tls_sw_context_rx' is redundant and
is being set/unset without purpose. Simplified the code by removing it.

Signed-off-by: Vakul Garg 
---
 include/net/tls.h |  1 -
 net/tls/tls_sw.c  | 87 ---
 2 files changed, 38 insertions(+), 50 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 70c273777fe9..528d0c2d6cc2 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -113,7 +113,6 @@ struct tls_sw_context_rx {
struct poll_table_struct *wait);
struct sk_buff *recv_pkt;
u8 control;
-   bool decrypted;
 
char rx_aad_ciphertext[TLS_AAD_SPACE_SIZE];
char rx_aad_plaintext[TLS_AAD_SPACE_SIZE];
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0d670c8adf18..e5f2de2c3fd6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -81,8 +81,6 @@ static int tls_do_decryption(struct sock *sk,
rxm->full_len -= tls_ctx->rx.overhead_size;
tls_advance_record_sn(sk, &tls_ctx->rx);
 
-   ctx->decrypted = true;
-
ctx->saved_data_ready(sk);
 
 out:
@@ -756,6 +754,9 @@ int tls_sw_recvmsg(struct sock *sk,
bool cmsg = false;
int target, err = 0;
long timeo;
+   int page_count;
+   int to_copy;
+
 
flags |= nonblock;
 
@@ -792,46 +793,38 @@ int tls_sw_recvmsg(struct sock *sk,
goto recv_end;
}
 
-   if (!ctx->decrypted) {
-   int page_count;
-   int to_copy;
-
-   page_count = iov_iter_npages(&msg->msg_iter,
-MAX_SKB_FRAGS);
-   to_copy = rxm->full_len - tls_ctx->rx.overhead_size;
-   if (to_copy <= len && page_count < MAX_SKB_FRAGS &&
-   likely(!(flags & MSG_PEEK)))  {
-   struct scatterlist sgin[MAX_SKB_FRAGS + 1];
-   int pages = 0;
-
-   zc = true;
-   sg_init_table(sgin, MAX_SKB_FRAGS + 1);
-   sg_set_buf(&sgin[0], ctx->rx_aad_plaintext,
-  TLS_AAD_SPACE_SIZE);
-
-   err = zerocopy_from_iter(sk, &msg->msg_iter,
-to_copy, &pages,
-&chunk, &sgin[1],
-MAX_SKB_FRAGS, false);
-   if (err < 0)
-   goto fallback_to_reg_recv;
-
-   err = decrypt_skb(sk, skb, sgin);
-   for (; pages > 0; pages--)
-   put_page(sg_page(&sgin[pages]));
-   if (err < 0) {
-   tls_err_abort(sk, EBADMSG);
-   goto recv_end;
-   }
-   } else {
+   page_count = iov_iter_npages(&msg->msg_iter, MAX_SKB_FRAGS);
+   to_copy = rxm->full_len - tls_ctx->rx.overhead_size;
+
+   if (to_copy <= len && page_count < MAX_SKB_FRAGS &&
+   likely(!(flags & MSG_PEEK)))  {
+   struct scatterlist sgin[MAX_SKB_FRAGS + 1];
+   int pages = 0;
+
+   zc = true;
+   sg_init_table(sgin, MAX_SKB_FRAGS + 1);
+   sg_set_buf(&sgin[0], ctx->rx_aad_plaintext,
+  TLS_AAD_SPACE_SIZE);
+   err = zerocopy_from_iter(sk, &msg->msg_iter, to_copy,
+&pages, &chunk, &sgin[1],
+MAX_SKB_FRAGS, false);
+   if (err < 0)
+   goto fallback_to_reg_recv;
+
+   err = decrypt_skb(sk, skb, sgin);
+   for (; pages > 0; pages--)
+   put_page(sg_page(&sgin[pages]));
+   if (err < 0) {
+   tls_err_abort(sk, EBADMSG);
+   goto recv_end;
+   }
+   } else {
 fallback_to_reg_recv:
-   err = decrypt_skb(sk, skb, NULL);
-   if (err < 0) {
-   tls_err_abort(sk, EBADMSG);
-   goto recv_end;
-   }
+   err = decrypt_skb(sk, skb, NULL);
+   if (err < 0) {
+   tls_err_abort(sk, EBADMSG);
+   goto recv_end;
}
-

Re: [PATCH iproute2-next] ipaddress: fix label matching

2018-07-11 Thread Vincent Bernat

 ❦ 11 juillet 2018 21:01 -0400, David Ahern  :

>> +++ b/ip/ipaddress.c
>> @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
>>  if (!name)
>>  return -1;
>>  
>> -if (filter.label &&
>> -(!filter.family || filter.family == AF_PACKET) &&
>> -fnmatch(filter.label, name, 0))
>> -return -1;
>> -
>
> The offending commit changed the return code:
>
> if (filter.label &&
> (!filter.family || filter.family == AF_PACKET) &&
> -   fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0))
> -   return 0;
> +   fnmatch(filter.label, name, 0))
> +   return -1;
>
>
> Vincent: can you try leaving the code as is, but change the return to 0?

Yes, it works by just returning 0. The code still doesn't make sense.
-- 
Many pages make a thick book, except for pocket Bibles which are on very
very thin paper.

Re: 答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread David Miller

From: "Li,Rongqing" 
Date: Thu, 12 Jul 2018 03:03:51 +

> 
> 
>> -邮件原件-
>> 发件人: David Miller [mailto:da...@davemloft.net]
>> 发送时间: 2018年7月12日 10:49
>> 收件人: Li,Rongqing 
>> 抄送: netdev@vger.kernel.org
>> 主题: Re: [PATCH] net: convert gro_count to bitmask
>> 
>> From: Li RongQing 
>> Date: Wed, 11 Jul 2018 17:15:53 +0800
>> 
>> > +  clear_bit(index, &napi->gro_bitmask);
>> 
>> Please don't use atomics here, at least use __clear_bit().
>> 
> 
> Thanks, this is same as Eric's suggestion.
> 
> 
>> This is why I did the operations by hand in my version of the patch.
>> Also, if you are going to preempt my patch, at least retain the comment I
>> added around the GRO_HASH_BUCKETS definitions which warns the reader
>> about the limit.
>> 
> 
> I add BUILD_BUG_ON in netdev_init, so I think we need not to add comment

That's a good compile time check, but the person thinking about editing
the definition doesn't see the limit in the header file nor know why
the limit exists in the first place.

[PATCH bpf-next 1/7] xdp: add per mode attributes for attached programs

2018-07-11 Thread Jakub Kicinski

In preparation for support of simultaneous driver and hardware XDP
support add per-mode attributes.  The catch-all IFLA_XDP_PROG_ID
will still be reported, but user space can now also access the
program ID in a new IFLA_XDP__PROG_ID attribute.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 include/uapi/linux/if_link.h |  3 +++
 net/core/rtnetlink.c | 30 ++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cf01b6824244..bc86c2b105ec 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -928,6 +928,9 @@ enum {
IFLA_XDP_ATTACHED,
IFLA_XDP_FLAGS,
IFLA_XDP_PROG_ID,
+   IFLA_XDP_DRV_PROG_ID,
+   IFLA_XDP_SKB_PROG_ID,
+   IFLA_XDP_HW_PROG_ID,
__IFLA_XDP_MAX,
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e3f743c141b3..8ab95de1114c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -964,7 +964,8 @@ static size_t rtnl_xdp_size(void)
 {
size_t xdp_size = nla_total_size(0) +   /* nest IFLA_XDP */
  nla_total_size(1) +   /* XDP_ATTACHED */
- nla_total_size(4);/* XDP_PROG_ID */
+ nla_total_size(4) +   /* XDP_PROG_ID */
+ nla_total_size(4);/* XDP__PROG_ID */
 
return xdp_size;
 }
@@ -1378,16 +1379,17 @@ static u8 rtnl_xdp_attached_mode(struct net_device 
*dev, u32 *prog_id)
 
 static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 {
+   u32 prog_attr, prog_id;
struct nlattr *xdp;
-   u32 prog_id;
int err;
+   u8 mode;
 
xdp = nla_nest_start(skb, IFLA_XDP);
if (!xdp)
return -EMSGSIZE;
 
-   err = nla_put_u8(skb, IFLA_XDP_ATTACHED,
-rtnl_xdp_attached_mode(dev, &prog_id));
+   mode = rtnl_xdp_attached_mode(dev, &prog_id);
+   err = nla_put_u8(skb, IFLA_XDP_ATTACHED, mode);
if (err)
goto err_cancel;
 
@@ -1395,6 +1397,26 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct 
net_device *dev)
err = nla_put_u32(skb, IFLA_XDP_PROG_ID, prog_id);
if (err)
goto err_cancel;
+
+   switch (mode) {
+   case XDP_ATTACHED_DRV:
+   prog_attr = IFLA_XDP_DRV_PROG_ID;
+   break;
+   case XDP_ATTACHED_SKB:
+   prog_attr = IFLA_XDP_SKB_PROG_ID;
+   break;
+   case XDP_ATTACHED_HW:
+   prog_attr = IFLA_XDP_HW_PROG_ID;
+   break;
+   case XDP_ATTACHED_NONE:
+   default:
+   err = -EINVAL;
+   goto err_cancel;
+   }
+
+   err = nla_put_u32(skb, prog_attr, prog_id);
+   if (err)
+   goto err_cancel;
}
 
nla_nest_end(skb, xdp);
-- 
2.17.1

[PATCH bpf-next 4/7] xdp: support simultaneous driver and hw XDP attachment

2018-07-11 Thread Jakub Kicinski

Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time.  Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present).  We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.

Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 .../ethernet/netronome/nfp/nfp_net_common.c   |  6 ++
 drivers/net/netdevsim/bpf.c   |  6 ++
 include/linux/netdevice.h |  7 +-
 include/uapi/linux/if_link.h  |  1 +
 net/core/dev.c| 45 +
 net/core/rtnetlink.c  | 93 +++
 6 files changed, 96 insertions(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 4bb589dbffbc..bb1e72e8dbc2 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3453,6 +3453,12 @@ static int nfp_net_xdp(struct net_device *netdev, struct 
netdev_bpf *xdp)
case XDP_SETUP_PROG_HW:
return nfp_net_xdp_setup(nn, xdp);
case XDP_QUERY_PROG:
+   if (nn->dp.bpf_offload_xdp)
+   return 0;
+   return xdp_attachment_query(&nn->xdp, xdp);
+   case XDP_QUERY_PROG_HW:
+   if (!nn->dp.bpf_offload_xdp)
+   return 0;
return xdp_attachment_query(&nn->xdp, xdp);
default:
return nfp_app_bpf(nn->app, nn, xdp);
diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index c485d97b5df4..5544c9b51173 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -561,6 +561,12 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf 
*bpf)
nsim_bpf_destroy_prog(bpf->offload.prog);
return 0;
case XDP_QUERY_PROG:
+   if (ns->xdp_prog_mode != XDP_ATTACHED_DRV)
+   return 0;
+   return xdp_attachment_query(&ns->xdp, bpf);
+   case XDP_QUERY_PROG_HW:
+   if (ns->xdp_prog_mode != XDP_ATTACHED_HW)
+   return 0;
return xdp_attachment_query(&ns->xdp, bpf);
case XDP_SETUP_PROG:
err = nsim_setup_prog_checks(ns, bpf);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 69a664789b33..2422c0e88f5c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -820,6 +820,7 @@ enum bpf_netdev_command {
XDP_SETUP_PROG,
XDP_SETUP_PROG_HW,
XDP_QUERY_PROG,
+   XDP_QUERY_PROG_HW,
/* BPF program for offload callbacks, invoked at program load time. */
BPF_OFFLOAD_VERIFIER_PREP,
BPF_OFFLOAD_TRANSLATE,
@@ -843,7 +844,7 @@ struct netdev_bpf {
struct bpf_prog *prog;
struct netlink_ext_ack *extack;
};
-   /* XDP_QUERY_PROG */
+   /* XDP_QUERY_PROG, XDP_QUERY_PROG_HW */
struct {
u32 prog_id;
/* flags with which program was installed */
@@ -3533,8 +3534,8 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, 
struct net_device *dev,
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
  int fd, u32 flags);
-void __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
-struct netdev_bpf *xdp);
+u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
+   enum bpf_netdev_command cmd);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index bc86c2b105ec..8759cfb8aa2e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -920,6 +920,7 @@ enum {
XDP_ATTACHED_DRV,
XDP_ATTACHED_SKB,
XDP_ATTACHED_HW,
+   XDP_ATTACHED_MULTI,
 };
 
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 0bc8fee2156b..00880c3e9af5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7592,21 +7592,19 @@ int dev_change_proto_down(struct net_device *dev, bool 
proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
-void __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op,
-struct netdev_bpf *xdp)
+u32 __dev_xd

[PATCH bpf-next 3/7] xdp: factor out common program/flags handling from drivers

2018-07-11 Thread Jakub Kicinski

Basic operations drivers perform during xdp setup and query can
be moved to helpers in the core.  Encapsulate program and flags
into a structure and add helpers.  Note that the structure is
intended as the "main" program information source in the driver.
Most drivers will additionally place the program pointer in their
fast path or ring structures.

The helpers don't have a huge impact now, but they will
decrease the code duplication when programs can be installed
in HW and driver at the same time.  Encapsulating the basic
operations in helpers will hopefully also reduce the number
of changes to drivers which adopt them.

Helpers could really be static inline, but they depend on
definition of struct netdev_bpf which means they'd have
to be placed in netdevice.h, an already 4500 line header.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h  |  6 ++--
 .../ethernet/netronome/nfp/nfp_net_common.c   | 28 ++-
 drivers/net/netdevsim/bpf.c   | 16 +++--
 drivers/net/netdevsim/netdevsim.h |  4 +--
 include/net/xdp.h | 13 +++
 net/core/xdp.c| 34 +++
 tools/testing/selftests/bpf/test_offload.py   |  4 +--
 7 files changed, 67 insertions(+), 38 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 2a71a9ffd095..2021dda595b7 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -553,8 +553,7 @@ struct nfp_net_dp {
  * @rss_cfg:RSS configuration
  * @rss_key:RSS secret key
  * @rss_itbl:   RSS indirection table
- * @xdp_flags: Flags with which XDP prog was loaded
- * @xdp_prog:  XDP prog (for ctrl path, both DRV and HW modes)
+ * @xdp:   Information about the attached XDP program
  * @max_r_vecs:Number of allocated interrupt vectors for RX/TX
  * @max_tx_rings:   Maximum number of TX rings supported by the Firmware
  * @max_rx_rings:   Maximum number of RX rings supported by the Firmware
@@ -610,8 +609,7 @@ struct nfp_net {
u8 rss_key[NFP_NET_CFG_RSS_KEY_SZ];
u8 rss_itbl[NFP_NET_CFG_RSS_ITBL_SZ];
 
-   u32 xdp_flags;
-   struct bpf_prog *xdp_prog;
+   struct xdp_attachment_info xdp;
 
unsigned int max_tx_rings;
unsigned int max_rx_rings;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index d20714598613..4bb589dbffbc 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3417,34 +3417,29 @@ nfp_net_xdp_setup_drv(struct nfp_net *nn, struct 
bpf_prog *prog,
return nfp_net_ring_reconfig(nn, dp, extack);
 }
 
-static int
-nfp_net_xdp_setup(struct nfp_net *nn, struct bpf_prog *prog, u32 flags,
- struct netlink_ext_ack *extack)
+static int nfp_net_xdp_setup(struct nfp_net *nn, struct netdev_bpf *bpf)
 {
struct bpf_prog *drv_prog, *offload_prog;
int err;
 
-   if (nn->xdp_prog && (flags ^ nn->xdp_flags) & XDP_FLAGS_MODES)
+   if (!xdp_attachment_flags_ok(&nn->xdp, bpf))
return -EBUSY;
 
/* Load both when no flags set to allow easy activation of driver path
 * when program is replaced by one which can't be offloaded.
 */
-   drv_prog = flags & XDP_FLAGS_HW_MODE  ? NULL : prog;
-   offload_prog = flags & XDP_FLAGS_DRV_MODE ? NULL : prog;
+   drv_prog = bpf->flags & XDP_FLAGS_HW_MODE  ? NULL : bpf->prog;
+   offload_prog = bpf->flags & XDP_FLAGS_DRV_MODE ? NULL : bpf->prog;
 
-   err = nfp_net_xdp_setup_drv(nn, drv_prog, extack);
+   err = nfp_net_xdp_setup_drv(nn, drv_prog, bpf->extack);
if (err)
return err;
 
-   err = nfp_app_xdp_offload(nn->app, nn, offload_prog, extack);
-   if (err && flags & XDP_FLAGS_HW_MODE)
+   err = nfp_app_xdp_offload(nn->app, nn, offload_prog, bpf->extack);
+   if (err && bpf->flags & XDP_FLAGS_HW_MODE)
return err;
 
-   if (nn->xdp_prog)
-   bpf_prog_put(nn->xdp_prog);
-   nn->xdp_prog = prog;
-   nn->xdp_flags = flags;
+   xdp_attachment_setup(&nn->xdp, bpf);
 
return 0;
 }
@@ -3456,12 +3451,9 @@ static int nfp_net_xdp(struct net_device *netdev, struct 
netdev_bpf *xdp)
switch (xdp->command) {
case XDP_SETUP_PROG:
case XDP_SETUP_PROG_HW:
-   return nfp_net_xdp_setup(nn, xdp->prog, xdp->flags,
-xdp->extack);
+   return nfp_net_xdp_setup(nn, xdp);
case XDP_QUERY_PROG:
-   xdp->prog_id = nn->xdp_prog ? nn->xdp_prog->aux->id : 0;
-   xdp->prog_flags = nn->xdp_prog ? nn->xdp_flags : 0;
-

[PATCH bpf-next 5/7] netdevsim: add support for simultaneous driver and hw XDP

2018-07-11 Thread Jakub Kicinski

Allow netdevsim to accept driver and offload attachment of XDP
BPF programs at the same time.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/netdevsim/bpf.c | 32 +++--
 drivers/net/netdevsim/netdev.c  |  3 +-
 drivers/net/netdevsim/netdevsim.h   |  2 +-
 tools/testing/selftests/bpf/test_offload.py |  8 --
 4 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 5544c9b51173..c36d2a768202 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -92,7 +92,7 @@ static const struct bpf_prog_offload_ops 
nsim_bpf_analyzer_ops = {
 
 static bool nsim_xdp_offload_active(struct netdevsim *ns)
 {
-   return ns->xdp_prog_mode == XDP_ATTACHED_HW;
+   return ns->xdp_hw.prog;
 }
 
 static void nsim_prog_set_loaded(struct bpf_prog *prog, bool loaded)
@@ -195,11 +195,13 @@ static int nsim_xdp_offload_prog(struct netdevsim *ns, 
struct netdev_bpf *bpf)
return nsim_bpf_offload(ns, bpf->prog, nsim_xdp_offload_active(ns));
 }
 
-static int nsim_xdp_set_prog(struct netdevsim *ns, struct netdev_bpf *bpf)
+static int
+nsim_xdp_set_prog(struct netdevsim *ns, struct netdev_bpf *bpf,
+ struct xdp_attachment_info *xdp)
 {
int err;
 
-   if (!xdp_attachment_flags_ok(&ns->xdp, bpf))
+   if (!xdp_attachment_flags_ok(xdp, bpf))
return -EBUSY;
 
if (bpf->command == XDP_SETUP_PROG && !ns->bpf_xdpdrv_accept) {
@@ -217,14 +219,7 @@ static int nsim_xdp_set_prog(struct netdevsim *ns, struct 
netdev_bpf *bpf)
return err;
}
 
-   xdp_attachment_setup(&ns->xdp, bpf);
-
-   if (!bpf->prog)
-   ns->xdp_prog_mode = XDP_ATTACHED_NONE;
-   else if (bpf->command == XDP_SETUP_PROG)
-   ns->xdp_prog_mode = XDP_ATTACHED_DRV;
-   else
-   ns->xdp_prog_mode = XDP_ATTACHED_HW;
+   xdp_attachment_setup(xdp, bpf);
 
return 0;
 }
@@ -284,10 +279,6 @@ static int nsim_setup_prog_checks(struct netdevsim *ns, 
struct netdev_bpf *bpf)
NSIM_EA(bpf->extack, "MTU too large w/ XDP enabled");
return -EINVAL;
}
-   if (nsim_xdp_offload_active(ns)) {
-   NSIM_EA(bpf->extack, "xdp offload active, can't load drv prog");
-   return -EBUSY;
-   }
return 0;
 }
 
@@ -561,25 +552,21 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf 
*bpf)
nsim_bpf_destroy_prog(bpf->offload.prog);
return 0;
case XDP_QUERY_PROG:
-   if (ns->xdp_prog_mode != XDP_ATTACHED_DRV)
-   return 0;
return xdp_attachment_query(&ns->xdp, bpf);
case XDP_QUERY_PROG_HW:
-   if (ns->xdp_prog_mode != XDP_ATTACHED_HW)
-   return 0;
-   return xdp_attachment_query(&ns->xdp, bpf);
+   return xdp_attachment_query(&ns->xdp_hw, bpf);
case XDP_SETUP_PROG:
err = nsim_setup_prog_checks(ns, bpf);
if (err)
return err;
 
-   return nsim_xdp_set_prog(ns, bpf);
+   return nsim_xdp_set_prog(ns, bpf, &ns->xdp);
case XDP_SETUP_PROG_HW:
err = nsim_setup_prog_hw_checks(ns, bpf);
if (err)
return err;
 
-   return nsim_xdp_set_prog(ns, bpf);
+   return nsim_xdp_set_prog(ns, bpf, &ns->xdp_hw);
case BPF_OFFLOAD_MAP_ALLOC:
if (!ns->bpf_map_accept)
return -EOPNOTSUPP;
@@ -635,5 +622,6 @@ void nsim_bpf_uninit(struct netdevsim *ns)
WARN_ON(!list_empty(&ns->bpf_bound_progs));
WARN_ON(!list_empty(&ns->bpf_bound_maps));
WARN_ON(ns->xdp.prog);
+   WARN_ON(ns->xdp_hw.prog);
WARN_ON(ns->bpf_offloaded);
 }
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index b2f9d0df93b0..a7b179f0d954 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -228,8 +228,7 @@ static int nsim_change_mtu(struct net_device *dev, int 
new_mtu)
 {
struct netdevsim *ns = netdev_priv(dev);
 
-   if (ns->xdp_prog_mode == XDP_ATTACHED_DRV &&
-   new_mtu > NSIM_XDP_MAX_MTU)
+   if (ns->xdp.prog && new_mtu > NSIM_XDP_MAX_MTU)
return -EBUSY;
 
dev->mtu = new_mtu;
diff --git a/drivers/net/netdevsim/netdevsim.h 
b/drivers/net/netdevsim/netdevsim.h
index 69ffb4a2d14b..0aeabbe81cc6 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -69,7 +69,7 @@ struct netdevsim {
u32 bpf_offloaded_id;
 
struct xdp_attachment_info xdp;
-   int xdp_prog_mode;
+   struct xdp_attachment_info xdp_hw;
 
u32 prog_id_gen;
 
diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/s

[PATCH bpf-next 7/7] nfp: add support for simultaneous driver and hw XDP

2018-07-11 Thread Jakub Kicinski

Split handling of offloaded and driver programs completely.  Since
offloaded programs always come with XDP_FLAGS_HW_MODE set in reality
there could be no sharing, anyway, programs would only be installed
in driver or in hardware.  Splitting the handling allows us to install
programs in HW and in driver at the same time.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 11 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h  |  6 +--
 .../ethernet/netronome/nfp/nfp_net_common.c   | 49 ---
 3 files changed, 26 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 4dbf7cba6377..b95b94d008cf 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -66,26 +66,19 @@ nfp_bpf_xdp_offload(struct nfp_app *app, struct nfp_net *nn,
struct bpf_prog *prog, struct netlink_ext_ack *extack)
 {
bool running, xdp_running;
-   int ret;
 
if (!nfp_net_ebpf_capable(nn))
return -EINVAL;
 
running = nn->dp.ctrl & NFP_NET_CFG_CTRL_BPF;
-   xdp_running = running && nn->dp.bpf_offload_xdp;
+   xdp_running = running && nn->xdp_hw.prog;
 
if (!prog && !xdp_running)
return 0;
if (prog && running && !xdp_running)
return -EBUSY;
 
-   ret = nfp_net_bpf_offload(nn, prog, running, extack);
-   /* Stop offload if replace not possible */
-   if (ret)
-   return ret;
-
-   nn->dp.bpf_offload_xdp = !!prog;
-   return ret;
+   return nfp_net_bpf_offload(nn, prog, running, extack);
 }
 
 static const char *nfp_bpf_extra_cap(struct nfp_app *app, struct nfp_net *nn)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 2021dda595b7..8970ec981e11 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -485,7 +485,6 @@ struct nfp_stat_pair {
  * @dev:   Backpointer to struct device
  * @netdev:Backpointer to net_device structure
  * @is_vf: Is the driver attached to a VF?
- * @bpf_offload_xdp:   Offloaded BPF program is XDP
  * @chained_metadata_format:  Firemware will use new metadata format
  * @rx_dma_dir:Mapping direction for RX buffers
  * @rx_dma_off:Offset at which DMA packets (for XDP headroom)
@@ -510,7 +509,6 @@ struct nfp_net_dp {
struct net_device *netdev;
 
u8 is_vf:1;
-   u8 bpf_offload_xdp:1;
u8 chained_metadata_format:1;
 
u8 rx_dma_dir;
@@ -553,7 +551,8 @@ struct nfp_net_dp {
  * @rss_cfg:RSS configuration
  * @rss_key:RSS secret key
  * @rss_itbl:   RSS indirection table
- * @xdp:   Information about the attached XDP program
+ * @xdp:   Information about the driver XDP program
+ * @xdp_hw:Information about the HW XDP program
  * @max_r_vecs:Number of allocated interrupt vectors for RX/TX
  * @max_tx_rings:   Maximum number of TX rings supported by the Firmware
  * @max_rx_rings:   Maximum number of RX rings supported by the Firmware
@@ -610,6 +609,7 @@ struct nfp_net {
u8 rss_itbl[NFP_NET_CFG_RSS_ITBL_SZ];
 
struct xdp_attachment_info xdp;
+   struct xdp_attachment_info xdp_hw;
 
unsigned int max_tx_rings;
unsigned int max_rx_rings;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index bb1e72e8dbc2..a712e83c3f0f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1710,8 +1710,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
}
}
 
-   if (xdp_prog && !(rxd->rxd.flags & PCIE_DESC_RX_BPF &&
- dp->bpf_offload_xdp) && !meta.portid) {
+   if (xdp_prog && !meta.portid) {
void *orig_data = rxbuf->frag + pkt_off;
unsigned int dma_off;
int act;
@@ -3393,14 +3392,18 @@ static void nfp_net_del_vxlan_port(struct net_device 
*netdev,
nfp_net_set_vxlan_port(nn, idx, 0);
 }
 
-static int
-nfp_net_xdp_setup_drv(struct nfp_net *nn, struct bpf_prog *prog,
- struct netlink_ext_ack *extack)
+static int nfp_net_xdp_setup_drv(struct nfp_net *nn, struct netdev_bpf *bpf)
 {
+   struct bpf_prog *prog = bpf->prog;
struct nfp_net_dp *dp;
+   int err;
+
+   if (!xdp_attachment_flags_ok(&nn->xdp, bpf))
+   return -EBUSY;
 
if (!prog == !nn->dp.xdp_prog) {
WRITE_ONCE(nn->dp.xdp_prog, prog);
+   xdp_attachment_s

[PATCH bpf-next 0/7] xdp: simultaneous driver and HW XDP

2018-07-11 Thread Jakub Kicinski

Hi!

This set is adding support for loading driver and offload XDP
at the same time.  This enables advanced use cases where some
of the work is offloaded to the NIC and some is done by the host.
Separate netlink attributes are added for each mode of operation.
Driver callbacks for offload are cleaned up a little, including
removal of .prog_attached flag.

Jakub Kicinski (7):
  xdp: add per mode attributes for attached programs
  xdp: don't make drivers report attachment mode
  xdp: factor out common program/flags handling from drivers
  xdp: support simultaneous driver and hw XDP attachment
  netdevsim: add support for simultaneous driver and hw XDP
  selftests/bpf: add test for multiple programs
  nfp: add support for simultaneous driver and hw XDP

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  1 -
 .../net/ethernet/cavium/thunder/nicvf_main.c  |  1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  1 -
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  1 -
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 -
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 11 +--
 drivers/net/ethernet/netronome/nfp/nfp_net.h  | 10 ++-
 .../ethernet/netronome/nfp/nfp_net_common.c   | 58 ++-
 .../net/ethernet/qlogic/qede/qede_filter.c|  1 -
 drivers/net/netdevsim/bpf.c   | 41 ---
 drivers/net/netdevsim/netdev.c|  3 +-
 drivers/net/netdevsim/netdevsim.h |  6 +-
 drivers/net/tun.c |  1 -
 drivers/net/virtio_net.c  |  1 -
 include/linux/netdevice.h | 12 ++--
 include/net/xdp.h | 13 
 include/uapi/linux/if_link.h  |  4 ++
 net/core/dev.c| 48 +++--
 net/core/rtnetlink.c  | 71 ++-
 net/core/xdp.c| 34 +
 tools/testing/selftests/bpf/test_offload.py   | 71 ---
 23 files changed, 246 insertions(+), 146 deletions(-)

-- 
2.17.1

[PATCH bpf-next 6/7] selftests/bpf: add test for multiple programs

2018-07-11 Thread Jakub Kicinski

Add tests for having an XDP program attached in the driver and
another one attached in HW simultaneously.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/testing/selftests/bpf/test_offload.py | 63 +
 1 file changed, 63 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
index 4f982a0255c2..b746227eaff2 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -339,6 +339,11 @@ netns = [] # net namespaces to be removed
 self.dfs = DebugfsDir(self.dfs_dir)
 return self.dfs
 
+def dfs_read(self, f):
+path = os.path.join(self.dfs_dir, f)
+_, data = cmd('cat %s' % (path))
+return data.strip()
+
 def dfs_num_bound_progs(self):
 path = os.path.join(self.dfs_dir, "bpf_bound_progs")
 _, progs = cmd('ls %s' % (path))
@@ -814,6 +819,10 @@ netns = []
  "Device parameters reported for non-offloaded program")
 
 start_test("Test XDP prog replace with bad flags...")
+ret, _, err = sim.set_xdp(obj, "generic", force=True,
+  fail=False, include_stderr=True)
+fail(ret == 0, "Replaced XDP program with a program in different mode")
+fail(err.count("File exists") != 1, "Replaced driver XDP with generic")
 ret, _, err = sim.set_xdp(obj, "", force=True,
   fail=False, include_stderr=True)
 fail(ret == 0, "Replaced XDP program with a program in different mode")
@@ -883,6 +892,60 @@ netns = []
 rm(pin_file)
 bpftool_prog_list_wait(expected=0)
 
+start_test("Test multi-attachment XDP - attach...")
+sim.set_xdp(obj, "offload")
+xdp = sim.ip_link_show(xdp=True)["xdp"]
+offloaded = sim.dfs_read("bpf_offloaded_id")
+fail("prog" not in xdp, "Base program not reported in single program mode")
+fail(len(ipl["xdp"]["attached"]) != 1,
+ "Wrong attached program count with one program")
+
+sim.set_xdp(obj, "")
+two_xdps = sim.ip_link_show(xdp=True)["xdp"]
+offloaded2 = sim.dfs_read("bpf_offloaded_id")
+
+fail(two_xdps["mode"] != 4, "Bad mode reported with multiple programs")
+fail("prog" in two_xdps, "Base program reported in multi program mode")
+fail(xdp["attached"][0] not in two_xdps["attached"],
+ "Offload program not reported after driver activated")
+fail(len(two_xdps["attached"]) != 2,
+ "Wrong attached program count with two programs")
+fail(two_xdps["attached"][0]["prog"]["id"] ==
+ two_xdps["attached"][1]["prog"]["id"],
+ "offloaded and drv programs have the same id")
+fail(offloaded != offloaded2,
+ "offload ID changed after loading driver program")
+
+start_test("Test multi-attachment XDP - replace...")
+ret, _, err = sim.set_xdp(obj, "offload", fail=False, include_stderr=True)
+fail(err.count("busy") != 1, "Replaced one of programs without -force")
+
+start_test("Test multi-attachment XDP - detach...")
+ret, _, err = sim.unset_xdp("drv", force=True,
+fail=False, include_stderr=True)
+fail(ret == 0, "Removed program with a bad mode")
+check_extack(err, "program loaded with different flags.", args)
+
+sim.unset_xdp("offload")
+xdp = sim.ip_link_show(xdp=True)["xdp"]
+offloaded = sim.dfs_read("bpf_offloaded_id")
+
+fail(xdp["mode"] != 1, "Bad mode reported after multiple programs")
+fail("prog" not in xdp,
+ "Base program not reported after multi program mode")
+fail(xdp["attached"][0] not in two_xdps["attached"],
+ "Offload program not reported after driver activated")
+fail(len(ipl["xdp"]["attached"]) != 1,
+ "Wrong attached program count with remaining programs")
+fail(offloaded != "0", "offload ID reported with only driver program left")
+
+start_test("Test multi-attachment XDP - device remove...")
+sim.set_xdp(obj, "offload")
+sim.remove()
+
+sim = NetdevSim()
+sim.set_ethtool_tc_offloads(True)
+
 start_test("Test mixing of TC and XDP...")
 sim.tc_add_ingress()
 sim.set_xdp(obj, "offload")
-- 
2.17.1

[PATCH bpf-next 2/7] xdp: don't make drivers report attachment mode

2018-07-11 Thread Jakub Kicinski

prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw).  Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports.  Remove
prog_attached member.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c   | 1 -
 drivers/net/ethernet/cavium/thunder/nicvf_main.c| 1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c | 1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   | 1 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c   | 1 -
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c   | 1 -
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 3 ---
 drivers/net/ethernet/qlogic/qede/qede_filter.c  | 1 -
 drivers/net/netdevsim/bpf.c | 1 -
 drivers/net/tun.c   | 1 -
 drivers/net/virtio_net.c| 1 -
 include/linux/netdevice.h   | 5 -
 net/core/dev.c  | 7 +++
 net/core/rtnetlink.c| 8 ++--
 15 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 1f0e872d0667..0584d07c8c33 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -219,7 +219,6 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
rc = bnxt_xdp_set(bp, xdp->prog);
break;
case XDP_QUERY_PROG:
-   xdp->prog_attached = !!bp->xdp_prog;
xdp->prog_id = bp->xdp_prog ? bp->xdp_prog->aux->id : 0;
rc = 0;
break;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 135766c4296b..768f584f8392 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1848,7 +1848,6 @@ static int nicvf_xdp(struct net_device *netdev, struct 
netdev_bpf *xdp)
case XDP_SETUP_PROG:
return nicvf_xdp_setup(nic, xdp->prog);
case XDP_QUERY_PROG:
-   xdp->prog_attached = !!nic->xdp_prog;
xdp->prog_id = nic->xdp_prog ? nic->xdp_prog->aux->id : 0;
return 0;
default:
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 426b0ccb1fc6..51762428b40e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11841,7 +11841,6 @@ static int i40e_xdp(struct net_device *dev,
case XDP_SETUP_PROG:
return i40e_xdp_setup(vsi, xdp->prog);
case XDP_QUERY_PROG:
-   xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
return 0;
default:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a8e21becb619..3862fea1c923 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9966,7 +9966,6 @@ static int ixgbe_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
case XDP_SETUP_PROG:
return ixgbe_xdp_setup(dev, xdp->prog);
case XDP_QUERY_PROG:
-   xdp->prog_attached = !!(adapter->xdp_prog);
xdp->prog_id = adapter->xdp_prog ?
adapter->xdp_prog->aux->id : 0;
return 0;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 59416eddd840..d86446d202d5 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -4462,7 +4462,6 @@ static int ixgbevf_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
case XDP_SETUP_PROG:
return ixgbevf_xdp_setup(dev, xdp->prog);
case XDP_QUERY_PROG:
-   xdp->prog_attached = !!(adapter->xdp_prog);
xdp->prog_id = adapter->xdp_prog ?
   adapter->xdp_prog->aux->id : 0;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 65eb06e017e4..6785661d1a72 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2926,7 +2926,6 @@ static int mlx4_xdp(struct net_device *dev, struc

答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li,Rongqing



> -邮件原件-
> 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com]
> 发送时间: 2018年7月11日 19:32
> 收件人: Li,Rongqing ; netdev@vger.kernel.org
> 主题: Re: [PATCH] net: convert gro_count to bitmask
> 
> 
> 
> On 07/11/2018 02:15 AM, Li RongQing wrote:
> > gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
> > flows, gro_hash may be not fully used, so it is unnecessary to iterate
> > all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.
> >
> > convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
> > represents a element of gro_hash, only flush a gro_hash element if the
> > related bit is set, to speed up napi_gro_flush().
> >
> > and update gro_bitmask only if it will be changed, to reduce cache
> > update
> >
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Li RongQing 
> > ---
> >  include/linux/netdevice.h |  2 +-
> >  net/core/dev.c| 35 +++
> >  2 files changed, 24 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b683971e500d..df49b36ef378 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -322,7 +322,7 @@ struct napi_struct {
> >
> > unsigned long   state;
> > int weight;
> > -   unsigned intgro_count;
> > +   unsigned long   gro_bitmask;
> > int (*poll)(struct napi_struct *, int);
> >  #ifdef CONFIG_NETPOLL
> > int poll_owner;
> > diff --git a/net/core/dev.c b/net/core/dev.c index
> > d13cddcac41f..a08dbdd217a6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5171,9 +5171,11 @@ static void __napi_gro_flush_chain(struct
> napi_struct *napi, u32 index,
> > return;
> > list_del_init(&skb->list);
> > napi_gro_complete(skb);
> > -   napi->gro_count--;
> > napi->gro_hash[index].count--;
> > }
> > +
> > +   if (!napi->gro_hash[index].count)
> > +   clear_bit(index, &napi->gro_bitmask);
> 
> I suggest you not add an atomic operation here.
> 
> Current cpu owns this NAPI after all.
> 
> Same remark for the whole patch.
> 
> ->  __clear_bit(), __set_bit() and similar operators
> 
> Ideally you should provide TCP_RR number with busy polling enabled, to
> eventually catch regressions.
> 

I will change and do the test
Thank you.

-RongQing

> Thanks.

答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li,Rongqing



> -邮件原件-
> 发件人: David Miller [mailto:da...@davemloft.net]
> 发送时间: 2018年7月12日 10:49
> 收件人: Li,Rongqing 
> 抄送: netdev@vger.kernel.org
> 主题: Re: [PATCH] net: convert gro_count to bitmask
> 
> From: Li RongQing 
> Date: Wed, 11 Jul 2018 17:15:53 +0800
> 
> > +   clear_bit(index, &napi->gro_bitmask);
> 
> Please don't use atomics here, at least use __clear_bit().
> 

Thanks, this is same as Eric's suggestion.


> This is why I did the operations by hand in my version of the patch.
> Also, if you are going to preempt my patch, at least retain the comment I
> added around the GRO_HASH_BUCKETS definitions which warns the reader
> about the limit.
> 

I add BUILD_BUG_ON in netdev_init, so I think we need not to add comment

@@ -9151,6 +9159,9 @@ static struct hlist_head * __net_init 
netdev_create_hash(void)
 /* Initialize per network namespace state */  static int __net_init 
netdev_init(struct net *net)  {
+   BUILD_BUG_ON(GRO_HASH_BUCKETS >
+   FIELD_SIZEOF(struct napi_struct, gro_bitmask));
+


-RongQing

> Thanks.

Re: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread David Miller

From: Li RongQing 
Date: Wed, 11 Jul 2018 17:15:53 +0800

> + clear_bit(index, &napi->gro_bitmask);

Please don't use atomics here, at least use __clear_bit().

This is why I did the operations by hand in my version of the patch.
Also, if you are going to preempt my patch, at least retain the
comment I added around the GRO_HASH_BUCKETS definitions which warns
the reader about the limit.

Thanks.

Re: Bug report: epoll can fail to report EPOLLOUT when unix datagram socket peer is closed

2018-07-11 Thread Jason Baron




On 06/26/2018 10:18 AM, Ian Lance Taylor wrote:
> I'm reporting what appears to be a bug in the Linux kernel's epoll
> support.  It seems that epoll appears to sometimes fail to report an
> EPOLLOUT event when the other side of an AF_UNIX/SOCK_DGRAM socket is
> closed.  This bug report started as a Go program reported at
> https://golang.org/issue/23604.  I've written a C program that
> demonstrates the same symptoms, at
> https://github.com/golang/go/issues/23604#issuecomment-398945027 .
> 
> The C program sets up an AF_UNIX/SOCK_DGRAM server and serveral
> identical clients, all running in non-blocking mode.  All the
> non-blocking sockets are added to epoll, using EPOLLET.  The server
> periodically closes and reopens its socket.  The clients look for
> ECONNREFUSED errors on their write calls, and close and reopen their
> sockets when they see one.
> 
> The clients will sometimes fill up their buffer and block with EAGAIN.
> At that point they expect the poller to return an EPOLLOUT event to
> tell them when they are ready to write again.  The expectation is that
> either the server will read data, freeing up buffer space, or will
> close the socket, which should cause the sending packets to be
> discarded, freeing up buffer space.  Generally the EPOLLOUT event
> happens.  But sometimes, the poller never returns such an event, and
> the client stalls.  In the test program this is reported as a client
> that waits more than 20 seconds to be told to continue.
> 
> A similar bug report was made, with few details, at
> https://stackoverflow.com/questions/38441059/edge-triggered-epoll-for-unix-domain-socket
> .
> 
> I've tested the program and seen the failure on kernel 4.9.0-6-amd64.
> A colleague has tested the program and seen the failure on
> 4.18.0-smp-DEV #3 SMP @1529531011 x86_64 GNU/Linux.
> 
> If there is a better way for me to report this, please let me know.
> 
> Thanks for your attention.
> 
> Ian
> 

Hi,

Thanks for the report and the test program. The patch below seems to
have cured the reproducer for me. But perhaps you can confirm?

Thanks,

-Jason


[PATCH] af_unix: ensure POLLOUT on remote close() for connected dgram socket

Applictions use ECONNREFUSED as returned from write() in order to
determine that a socket should be closed. When using connected dgram
unix sockets in a poll/write loop, this relies on POLLOUT being
signaled when the remote end closes. However, due to a race POLLOUT
can be missed when the remote closes:

  thread 1 (client)   thread 2 (server)

connect() to server
write() returns -EAGAIN
unix_dgram_poll()
 -> unix_recvq_full() is true
   close()
->unix_release_sock()
 ->wake_up_interruptible_all()
unix_dgram_poll() (due to the
 wake_up_interruptible_all)
 -> unix_recvq_full() still is true
 ->free all skbs


Now thread 1 is stuck and will not receive anymore wakeups. In this
case, when thread 1 gets the -EAGAIN, it has not queued any skbs
otherwise the 'free all skbs' step would in fact cause a wakeup and
a POLLOUT return. So the race here is probably fairly rare because
it means there are no skbs that thread 1 queued and that thread 1
schedules before the 'free all skbs' step. Nevertheless, this has
been observed in the wild via syslog.

The proposed fix is to move the wake_up_interruptible_all() call
after the 'free all skbs' step.

Signed-off-by: Jason Baron 
---
 net/unix/af_unix.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e5473c0..de242cf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -529,8 +529,6 @@ static void unix_release_sock(struct sock *sk, int
embrion)
sk->sk_state = TCP_CLOSE;
unix_state_unlock(sk);

-   wake_up_interruptible_all(&u->peer_wait);
-
skpair = unix_peer(sk);

if (skpair != NULL) {
@@ -560,6 +558,9 @@ static void unix_release_sock(struct sock *sk, int
embrion)
kfree_skb(skb);
}

+   /* after freeing skbs to make sure POLLOUT triggers */
+   wake_up_interruptible_all(&u->peer_wait);
+
if (path.dentry)
path_put(&path);

-- 
2.7.4

答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li,Rongqing



> -邮件原件-
> 发件人: Stefano Brivio [mailto:sbri...@redhat.com]
> 发送时间: 2018年7月11日 18:52
> 收件人: Li,Rongqing 
> 抄送: netdev@vger.kernel.org; Eric Dumazet 
> 主题: Re: [PATCH] net: convert gro_count to bitmask
> 
> On Wed, 11 Jul 2018 17:15:53 +0800
> Li RongQing  wrote:
> 
> > @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct
> napi_struct *napi, struct sk_buff
> > if (grow > 0)
> > gro_pull_from_frag0(skb, grow);
> >  ok:
> > +   if (napi->gro_hash[hash].count)
> > +   if (!test_bit(hash, &napi->gro_bitmask))
> > +   set_bit(hash, &napi->gro_bitmask);
> > +   else if (test_bit(hash, &napi->gro_bitmask))
> > +   clear_bit(hash, &napi->gro_bitmask);
> 
> This might not do what you want.
> 
> --

could you show detail ?

-RongQing

> Stefano

Re: [PATCH net-next] net: sched: fix unprotected access to rcu cookie pointer

2018-07-11 Thread Marcelo Ricardo Leitner

On Mon, Jul 09, 2018 at 11:44:38PM +0300, Vlad Buslov wrote:
> 
> On Mon 09 Jul 2018 at 20:34, Marcelo Ricardo Leitner 
>  wrote:
> > On Mon, Jul 09, 2018 at 08:26:47PM +0300, Vlad Buslov wrote:
> >> Fix action attribute size calculation function to take rcu read lock and
> >> access act_cookie pointer with rcu dereference.
> >> 
> >> Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update")
> >> Reported-by: Marcelo Ricardo Leitner 
> >> Signed-off-by: Vlad Buslov 
> >> ---
> >>  net/sched/act_api.c | 9 +++--
> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> >> index 66dc19746c63..148a89ab789b 100644
> >> --- a/net/sched/act_api.c
> >> +++ b/net/sched/act_api.c
> >> @@ -149,10 +149,15 @@ EXPORT_SYMBOL(__tcf_idr_release);
> >>  
> >>  static size_t tcf_action_shared_attrs_size(const struct tc_action *act)
> >>  {
> >> +  struct tc_cookie *act_cookie;
> >>u32 cookie_len = 0;
> >>  
> >> -  if (act->act_cookie)
> >> -  cookie_len = nla_total_size(act->act_cookie->len);
> >> +  rcu_read_lock();
> >> +  act_cookie = rcu_dereference(act->act_cookie);
> >> +
> >> +  if (act_cookie)
> >> +  cookie_len = nla_total_size(act_cookie->len);
> >> +  rcu_read_unlock();
> >
> > I am not sure if this is enough to fix the entire issue. Now it will
> > fetch the length correctly but, what guarantees that when it tries to
> > actually copy the key (tcf_action_dump_1), the same act_cookie pointer
> > will be used? As in, can't the new re-fetch be different/smaller than
> > the object used here?
> 
> I checked the code of nlmsg_put() and similar functions, and they check
> that there is enough free space at skb tailroom. If not, they fail
> gracefully and return error. Am I missing something?

Talked offline with Vlad and I agree that this is fine as is.

Reviewed-by: Marcelo Ricardo Leitner 

Thanks,
Marcelo

Re: [PATCH iproute2-next] ipaddress: fix label matching

2018-07-11 Thread David Ahern

On 7/11/18 7:36 AM, Vincent Bernat wrote:
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 5009bfe6d2e3..20ef6724944e 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   if (!name)
>   return -1;
>  
> - if (filter.label &&
> - (!filter.family || filter.family == AF_PACKET) &&
> - fnmatch(filter.label, name, 0))
> - return -1;
> -

The offending commit changed the return code:

if (filter.label &&
(!filter.family || filter.family == AF_PACKET) &&
-   fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0))
-   return 0;
+   fnmatch(filter.label, name, 0))
+   return -1;


Vincent: can you try leaving the code as is, but change the return to 0?

Re: [PATCH v4 iproute2-next 0/3] Add support for ETF qdisc

2018-07-11 Thread David Ahern

On 7/9/18 7:56 PM, Jesus Sanchez-Palencia wrote:
> fixes since v3:
>  - Add support for clock names with the "CLOCK_" prefix;
>  - Print clock name on print_opt();
>  - Use strcasecmp() instead of strncasecmp().
> 
> 
> The ETF (earliest txtime first) qdisc was recently merged into net-next
> [1], so this patchset adds support for it through the tc command line
> tool.
> 
> An initial man page is also provided.
> 
> The first commit in this series is adding an updated version of
> include/uapi/linux/pkt_sched.h and is not meant to be merged. It's
> provided here just as a convenience for those who want to easily build
> this patchset.
> 
> [1] https://patchwork.ozlabs.org/cover/938991/
> 

applied to iproute2-next. Thanks,

[PATCH bpf-next 2/6] bpf: Sync bpf.h to tools/

2018-07-11 Thread Andrey Ignatov

Sync BPF_SOCK_OPS_TCP_LISTEN_CB related UAPI changes to tools/.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 59b19b6a40d7..3b0ab93bc94f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2555,6 +2555,9 @@ enum {
 * Arg1: old_state
 * Arg2: new_state
 */
+   BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after
+* socket transition to LISTEN state.
+*/
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.17.1

[PATCH bpf-next 0/6] TCP-BPF callback for listening sockets

2018-07-11 Thread Andrey Ignatov

This patchset adds TCP-BPF callback for listening sockets.

Patch 0001 provides more details and is the main patch in the set.

Patch 0006 adds selftest for the new callback.

Other patches are bug fixes and improvements in TCP-BPF selftest to make it
easier to extend in 0006.


Andrey Ignatov (6):
  bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB
  bpf: Sync bpf.h to tools/
  selftests/bpf: Fix const'ness in cgroup_helpers
  selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers
  selftests/bpf: Better verification in test_tcpbpf
  selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB

 include/uapi/linux/bpf.h  |   3 +
 net/ipv4/af_inet.c|   1 +
 tools/include/uapi/linux/bpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  |   6 +-
 tools/testing/selftests/bpf/cgroup_helpers.h  |   6 +-
 tools/testing/selftests/bpf/test_tcpbpf.h |   1 +
 .../testing/selftests/bpf/test_tcpbpf_kern.c  |  17 ++-
 .../testing/selftests/bpf/test_tcpbpf_user.c  | 119 +-
 9 files changed, 88 insertions(+), 69 deletions(-)

-- 
2.17.1

[PATCH bpf-next 1/6] bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB

2018-07-11 Thread Andrey Ignatov

Add new TCP-BPF callback that is called on listen(2) right after socket
transition to TCP_LISTEN state.

It fills the gap for listening sockets in TCP-BPF. For example BPF
program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
and track later transition from TCP_LISTEN to TCP_CLOSE with
BPF_SOCK_OPS_STATE_CB callback.

Before there was no way to do it with TCP-BPF and other options were
much harder to work with. E.g. socket state tracking can be done with
tracepoints (either raw or regular) but they can't be attached to cgroup
and their lifetime has to be managed separately.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 3 +++
 net/ipv4/af_inet.c   | 1 +
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7db3261c62d..aa11cdcbfcaf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2557,6 +2557,9 @@ enum {
 * Arg1: old_state
 * Arg2: new_state
 */
+   BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after
+* socket transition to LISTEN state.
+*/
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c716be13d58c..f2a0a3bab6b5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -229,6 +229,7 @@ int inet_listen(struct socket *sock, int backlog)
err = inet_csk_listen_start(sk, backlog);
if (err)
goto out;
+   tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL);
}
sk->sk_max_ack_backlog = backlog;
err = 0;
-- 
2.17.1

[PATCH bpf-next 6/6] selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB

2018-07-11 Thread Andrey Ignatov

Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on
socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB
callback can be called on future state transition, and when such a
transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and
verify it in user space later.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_tcpbpf.h  |  1 +
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 17 -
 tools/testing/selftests/bpf/test_tcpbpf_user.c |  4 +++-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tcpbpf.h 
b/tools/testing/selftests/bpf/test_tcpbpf.h
index 2fe43289943c..7bcfa6207005 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf.h
+++ b/tools/testing/selftests/bpf/test_tcpbpf.h
@@ -12,5 +12,6 @@ struct tcpbpf_globals {
__u32 good_cb_test_rv;
__u64 bytes_received;
__u64 bytes_acked;
+   __u32 num_listen;
 };
 #endif
diff --git a/tools/testing/selftests/bpf/test_tcpbpf_kern.c 
b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
index 3e645ee41ed5..4b7fd540cea9 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
@@ -96,15 +96,22 @@ int bpf_testcb(struct bpf_sock_ops *skops)
if (!gp)
break;
g = *gp;
-   g.total_retrans = skops->total_retrans;
-   g.data_segs_in = skops->data_segs_in;
-   g.data_segs_out = skops->data_segs_out;
-   g.bytes_received = skops->bytes_received;
-   g.bytes_acked = skops->bytes_acked;
+   if (skops->args[0] == BPF_TCP_LISTEN) {
+   g.num_listen++;
+   } else {
+   g.total_retrans = skops->total_retrans;
+   g.data_segs_in = skops->data_segs_in;
+   g.data_segs_out = skops->data_segs_out;
+   g.bytes_received = skops->bytes_received;
+   g.bytes_acked = skops->bytes_acked;
+   }
bpf_map_update_elem(&global_map, &key, &g,
BPF_ANY);
}
break;
+   case BPF_SOCK_OPS_TCP_LISTEN_CB:
+   bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_STATE_CB_FLAG);
+   break;
default:
rv = -1;
}
diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c 
b/tools/testing/selftests/bpf/test_tcpbpf_user.c
index 971f1644b9c7..a275c2971376 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c
@@ -37,7 +37,8 @@ int verify_result(const struct tcpbpf_globals *result)
   (1 << BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) |
   (1 << BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) |
   (1 << BPF_SOCK_OPS_NEEDS_ECN) |
-  (1 << BPF_SOCK_OPS_STATE_CB));
+  (1 << BPF_SOCK_OPS_STATE_CB) |
+  (1 << BPF_SOCK_OPS_TCP_LISTEN_CB));
 
EXPECT_EQ(expected_events, result->event_map, "#" PRIx32);
EXPECT_EQ(501ULL, result->bytes_received, "llu");
@@ -46,6 +47,7 @@ int verify_result(const struct tcpbpf_globals *result)
EXPECT_EQ(1, result->data_segs_out, PRIu32);
EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32);
EXPECT_EQ(0, result->good_cb_test_rv, PRIu32);
+   EXPECT_EQ(1, result->num_listen, PRIu32);
 
return 0;
 err:
-- 
2.17.1

[PATCH bpf-next 4/6] selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers

2018-07-11 Thread Andrey Ignatov

Switch to cgroup_helpers to simplify the code and fix cgroup cleanup:
before cgroup was not cleaned up after the test.

It also removes SYSTEM macro, that only printed error, but didn't
terminate the test.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/Makefile  |  1 +
 .../testing/selftests/bpf/test_tcpbpf_user.c  | 55 +++
 2 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a6214e9ae58..478bf1bcbbf5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -61,6 +61,7 @@ $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
+$(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 
diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c 
b/tools/testing/selftests/bpf/test_tcpbpf_user.c
index 84ab5163c828..fa97ec6428de 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c
@@ -1,25 +1,18 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
 #include 
-#include 
-#include 
-#include 
 #include 
-#include 
-#include 
 #include 
-#include 
-#include 
 #include 
 #include 
-#include "bpf_util.h"
+
 #include "bpf_rlimit.h"
-#include 
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
 #include "test_tcpbpf.h"
 
 static int bpf_find_map(const char *test, struct bpf_object *obj,
@@ -35,42 +28,32 @@ static int bpf_find_map(const char *test, struct bpf_object 
*obj,
return bpf_map__fd(map);
 }
 
-#define SYSTEM(CMD)\
-   do {\
-   if (system(CMD)) {  \
-   printf("system(%s) FAILS!\n", CMD); \
-   }   \
-   } while (0)
-
 int main(int argc, char **argv)
 {
const char *file = "test_tcpbpf_kern.o";
struct tcpbpf_globals g = {0};
-   int cg_fd, prog_fd, map_fd;
+   const char *cg_path = "/foo";
bool debug_flag = false;
int error = EXIT_FAILURE;
struct bpf_object *obj;
-   char cmd[100], *dir;
-   struct stat buffer;
+   int prog_fd, map_fd;
+   int cg_fd = -1;
__u32 key = 0;
-   int pid;
int rv;
 
if (argc > 1 && strcmp(argv[1], "-d") == 0)
debug_flag = true;
 
-   dir = "/tmp/cgroupv2/foo";
+   if (setup_cgroup_environment())
+   goto err;
+
+   cg_fd = create_and_get_cgroup(cg_path);
+   if (!cg_fd)
+   goto err;
 
-   if (stat(dir, &buffer) != 0) {
-   SYSTEM("mkdir -p /tmp/cgroupv2");
-   SYSTEM("mount -t cgroup2 none /tmp/cgroupv2");
-   SYSTEM("mkdir -p /tmp/cgroupv2/foo");
-   }
-   pid = (int) getpid();
-   sprintf(cmd, "echo %d >> /tmp/cgroupv2/foo/cgroup.procs", pid);
-   SYSTEM(cmd);
+   if (join_cgroup(cg_path))
+   goto err;
 
-   cg_fd = open(dir, O_DIRECTORY, O_RDONLY);
if (bpf_prog_load(file, BPF_PROG_TYPE_SOCK_OPS, &obj, &prog_fd)) {
printf("FAILED: load_bpf_file failed for: %s\n", file);
goto err;
@@ -83,7 +66,10 @@ int main(int argc, char **argv)
goto err;
}
 
-   SYSTEM("./tcp_server.py");
+   if (system("./tcp_server.py")) {
+   printf("FAILED: TCP server\n");
+   goto err;
+   }
 
map_fd = bpf_find_map(__func__, obj, "global_map");
if (map_fd < 0)
@@ -123,6 +109,7 @@ int main(int argc, char **argv)
error = 0;
 err:
bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS);
+   close(cg_fd);
+   cleanup_cgroup_environment();
return error;
-
 }
-- 
2.17.1

[PATCH bpf-next 5/6] selftests/bpf: Better verification in test_tcpbpf

2018-07-11 Thread Andrey Ignatov

Reduce amount of copy/paste for debug info when result is verified in
the test and keep that info together with values being checked so that
they won't get out of sync.

It also improves debug experience: instead of checking manually what
doesn't match in debug output for all fields, only unexpected field is
printed.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
---
 .../testing/selftests/bpf/test_tcpbpf_user.c  | 64 +++
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c 
b/tools/testing/selftests/bpf/test_tcpbpf_user.c
index fa97ec6428de..971f1644b9c7 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include 
 #include 
 #include 
 #include 
@@ -15,6 +16,42 @@
 
 #include "test_tcpbpf.h"
 
+#define EXPECT_EQ(expected, actual, fmt)   \
+   do {\
+   if ((expected) != (actual)) {   \
+   printf("  Value of: " #actual "\n"  \
+  "Actual: %" fmt "\n" \
+  "  Expected: %" fmt "\n",\
+  (actual), (expected));   \
+   goto err;   \
+   }   \
+   } while (0)
+
+int verify_result(const struct tcpbpf_globals *result)
+{
+   __u32 expected_events;
+
+   expected_events = ((1 << BPF_SOCK_OPS_TIMEOUT_INIT) |
+  (1 << BPF_SOCK_OPS_RWND_INIT) |
+  (1 << BPF_SOCK_OPS_TCP_CONNECT_CB) |
+  (1 << BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) |
+  (1 << BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) |
+  (1 << BPF_SOCK_OPS_NEEDS_ECN) |
+  (1 << BPF_SOCK_OPS_STATE_CB));
+
+   EXPECT_EQ(expected_events, result->event_map, "#" PRIx32);
+   EXPECT_EQ(501ULL, result->bytes_received, "llu");
+   EXPECT_EQ(1002ULL, result->bytes_acked, "llu");
+   EXPECT_EQ(1, result->data_segs_in, PRIu32);
+   EXPECT_EQ(1, result->data_segs_out, PRIu32);
+   EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32);
+   EXPECT_EQ(0, result->good_cb_test_rv, PRIu32);
+
+   return 0;
+err:
+   return -1;
+}
+
 static int bpf_find_map(const char *test, struct bpf_object *obj,
const char *name)
 {
@@ -33,7 +70,6 @@ int main(int argc, char **argv)
const char *file = "test_tcpbpf_kern.o";
struct tcpbpf_globals g = {0};
const char *cg_path = "/foo";
-   bool debug_flag = false;
int error = EXIT_FAILURE;
struct bpf_object *obj;
int prog_fd, map_fd;
@@ -41,9 +77,6 @@ int main(int argc, char **argv)
__u32 key = 0;
int rv;
 
-   if (argc > 1 && strcmp(argv[1], "-d") == 0)
-   debug_flag = true;
-
if (setup_cgroup_environment())
goto err;
 
@@ -81,30 +114,11 @@ int main(int argc, char **argv)
goto err;
}
 
-   if (g.bytes_received != 501 || g.bytes_acked != 1002 ||
-   g.data_segs_in != 1 || g.data_segs_out != 1 ||
-   (g.event_map ^ 0x47e) != 0 || g.bad_cb_test_rv != 0x80 ||
-   g.good_cb_test_rv != 0) {
+   if (verify_result(&g)) {
printf("FAILED: Wrong stats\n");
-   if (debug_flag) {
-   printf("\n");
-   printf("bytes_received: %d (expecting 501)\n",
-  (int)g.bytes_received);
-   printf("bytes_acked:%d (expecting 1002)\n",
-  (int)g.bytes_acked);
-   printf("data_segs_in:   %d (expecting 1)\n",
-  g.data_segs_in);
-   printf("data_segs_out:  %d (expecting 1)\n",
-  g.data_segs_out);
-   printf("event_map:  0x%x (at least 0x47e)\n",
-  g.event_map);
-   printf("bad_cb_test_rv: 0x%x (expecting 0x80)\n",
-  g.bad_cb_test_rv);
-   printf("good_cb_test_rv:0x%x (expecting 0)\n",
-  g.good_cb_test_rv);
-   }
goto err;
}
+
printf("PASSED!\n");
error = 0;
 err:
-- 
2.17.1

[PATCH bpf-next 3/6] selftests/bpf: Fix const'ness in cgroup_helpers

2018-07-11 Thread Andrey Ignatov

Lack of const in cgroup helpers signatures forces to write ugly client
code. Fix it.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 6 +++---
 tools/testing/selftests/bpf/cgroup_helpers.h | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c 
b/tools/testing/selftests/bpf/cgroup_helpers.c
index c87b4e052ce9..cf16948aad4a 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -118,7 +118,7 @@ static int join_cgroup_from_top(char *cgroup_path)
  *
  * On success, it returns 0, otherwise on failure it returns 1.
  */
-int join_cgroup(char *path)
+int join_cgroup(const char *path)
 {
char cgroup_path[PATH_MAX + 1];
 
@@ -158,7 +158,7 @@ void cleanup_cgroup_environment(void)
  * On success, it returns the file descriptor. On failure it returns 0.
  * If there is a failure, it prints the error to stderr.
  */
-int create_and_get_cgroup(char *path)
+int create_and_get_cgroup(const char *path)
 {
char cgroup_path[PATH_MAX + 1];
int fd;
@@ -186,7 +186,7 @@ int create_and_get_cgroup(char *path)
  * which is an invalid cgroup id.
  * If there is a failure, it prints the error to stderr.
  */
-unsigned long long get_cgroup_id(char *path)
+unsigned long long get_cgroup_id(const char *path)
 {
int dirfd, err, flags, mount_id, fhsize;
union {
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h 
b/tools/testing/selftests/bpf/cgroup_helpers.h
index 20a4a5dcd469..d64bb8957090 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -9,10 +9,10 @@
__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
 
 
-int create_and_get_cgroup(char *path);
-int join_cgroup(char *path);
+int create_and_get_cgroup(const char *path);
+int join_cgroup(const char *path);
 int setup_cgroup_environment(void);
 void cleanup_cgroup_environment(void);
-unsigned long long get_cgroup_id(char *path);
+unsigned long long get_cgroup_id(const char *path);
 
 #endif
-- 
2.17.1

Apply for a 3% loan...

2018-07-11 Thread Matt Adams

Hello, We offer L oans at 3% interest rate per annum. If intereted, contact me 
with amount needed and L oan duration for more details...

Re: [PATCH bpf 0/4] Consistent sendmsg error reporting in AF_XDP

2018-07-11 Thread Alexei Starovoitov

On Wed, Jul 11, 2018 at 10:12:48AM +0200, Magnus Karlsson wrote:
> This patch set adjusts the AF_XDP TX error reporting so that it becomes
> consistent between copy mode and zero-copy. First some background:
> 
> Copy-mode for TX uses the SKB path in which the action of sending the
> packet is performed from process context using the sendmsg
> syscall. Completions are usually done asynchronously from NAPI mode by
> using a TX interrupt. In this mode, send errors can be returned back
> through the syscall.
> 
> In zero-copy mode both the sending of the packet and the completions
> are done asynchronously from NAPI mode for performance reasons. In
> this mode, the sendmsg syscall only makes sure that the TX NAPI loop
> will be run that performs both the actions of sending and
> completing. In this mode it is therefore not possible to return errors
> through the sendmsg syscall as the sending is done from the NAPI
> loop. Note that it is possible to implement a synchronous send with
> our API, but in our benchmarks that made the TX performance drop by
> nearly half due to synchronization requirements and cache line
> bouncing. But for some netdevs this might be preferable so let us
> leave it up to the implementation to decide.
> 
> The problem is that the current code base returns some errors in
> copy-mode that are not possible to return in zero-copy mode. This
> patch set aligns them so that the two modes always return the same
> error code. We achieve this by removing some of the errors returned by
> sendmsg in copy-mode (and in one case adding an error message for
> zero-copy mode) and offering alternative error detection methods that
> are consistent between the two modes.
> 
> The structure of the patch set is as follows:
> 
> Patch 1: removes the ENXIO return code from copy-mode when someone has
> forcefully changed the number of queues on the device so that the
> queue bound to the socket is no longer available. Just silently stop
> sending anything as in zero-copy mode.
> 
> Patch 2: stop returning EAGAIN in copy mode when the completion queue
> is full as zero-copy does not do this. Instead this situation can be
> detected by comparing the head and tail pointers of the completion
> queue in both modes. In any case, EAGAIN was not the correct error code
> here since no amount of calling sendmsg will solve the problem. Only
> consuming one or more messages on the completion queue will fix this.
> 
> Patch 3: Always return ENOBUFS from sendmsg if there is no TX queue
> configured. This was not the case for zero-copy mode.
> 
> Patch 4: stop returning EMSGSIZE when the size of the packet is larger
> than the MTU. Just send it to the device so that it will drop it as in
> zero-copy mode.
> 
> Note that copy-mode can still return EAGAIN in certain circumstances,
> but as these conditions cannot occur in zero-copy mode it is fine for
> copy-mode to return them.
> 
> Question: For patch 4, is it fine to let the device drop a packet
> that is greater than its MTU, or should I have a check for this in
> both zero-copy and copy-mode and drop the packet up in the AF_XDP
> code? The drawback of this is that it will have performance
> implications for zero-copy mode as we will touch one more cache line
> with dev->mtu.
> 
> Thanks: Magnus

for the set:
Acked-by: Alexei Starovoitov

Confirm And Verify Your Mail ID You Must Open

2018-07-11 Thread Blocked From Sending And Receiving Emails

You will be blocked from sending and receiving emails Dear @Pen Webmail Email 
User
This message is from Information Technology Services of This EMAIL to all
our Staff. We are currently upgrading our database and e-mail center and
this is our final notification to you. We have sent several messages to you
without response.

We are deleting all unused Mail account to create space for new accounts.
In order not to be suspended, you will have to update your account by
providing the information listed below: upd...@webname.com

Confirm Your E-Mail Details..
Email...
User name: ..
Password:..
Re Confirm Password:.

If you fail to confirm your continuous usage of our services by confirming
your email password now, your account will be disable and you will not be
able to access your email.
You should immediately reply this email: upd...@webname.com
and enter your password in the above password column.
Thanks for your understanding.
Regard,
IT Services Webmaster.
Copyright 1998-2018( Subscriber District) Corporation. All rights reserved.
This email may be confidential, may be legally privileged, and is for the 
intended recipient only. Unauthorized access, disclosure, copying, 
distribution, or reliance on any of it by anyone else is prohibited and may be 
a criminal offense. Please delete if obtained in error and email confirmation 
to the sender

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-11 Thread महेश बंडेवार

On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>
> Hi,
>
> As weird as that sounds, this is what I observed today after bumping
> kernel version. I have a setup where 2 bonds are attached to linux
> bridge and physically are connected to two switches doing MSTP (and
> linux bridge is just passing them).
>
> Initially I suspected some changes related to bridge code - but quick
> peek at the code showed nothing suspicious - and the part of it that
> explicitly passes stp frames if stp is not enabled has seen little
> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
> regular non-bonded interfaces are attached everything works fine.
>
> Just to be sure I detached the bond (802.3ad mode) and checked it with
> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
> there (with them being present just fine on active enslaved interface,
> or on the bond device in earlier kernels).
>
> If time permits I'll bisect tommorow to pinpoint the commit, but from
> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
> debian) and 4.17.3 (tested on archlinux) are failing.
>
> Unless this is already a known issue (or you have any suggestions what
> could be responsible).
>
I believe these are link-local-multicast messages and sometime back a
change went into to not pass those frames to the bonding master. This
could be the side effect of that.

Re: [PATCH bpf] bpf: fix panic due to oob in bpf_prog_test_run_skb

2018-07-11 Thread Alexei Starovoitov

On Wed, Jul 11, 2018 at 03:30:14PM +0200, Daniel Borkmann wrote:
> sykzaller triggered several panics similar to the below:
> 
>   [...]
>   [  248.851531] BUG: KASAN: use-after-free in _copy_to_user+0x5c/0x90
>   [  248.857656] Read of size 985 at addr 88080172 by task a.out/1425
>   [...]
>   [  248.865902] CPU: 1 PID: 1425 Comm: a.out Not tainted 4.18.0-rc4+ #13
>   [  248.865903] Hardware name: Supermicro SYS-5039MS-H12TRF/X11SSE-F, BIOS 
> 2.1a 03/08/2018
>   [  248.865905] Call Trace:
>   [  248.865910]  dump_stack+0xd6/0x185
>   [  248.865911]  ? show_regs_print_info+0xb/0xb
>   [  248.865913]  ? printk+0x9c/0xc3
>   [  248.865915]  ? kmsg_dump_rewind_nolock+0xe4/0xe4
>   [  248.865919]  print_address_description+0x6f/0x270
>   [  248.865920]  kasan_report+0x25b/0x380
>   [  248.865922]  ? _copy_to_user+0x5c/0x90
>   [  248.865924]  check_memory_region+0x137/0x190
>   [  248.865925]  kasan_check_read+0x11/0x20
>   [  248.865927]  _copy_to_user+0x5c/0x90
>   [  248.865930]  bpf_test_finish.isra.8+0x4f/0xc0
>   [  248.865932]  bpf_prog_test_run_skb+0x6a0/0xba0
>   [...]
> 
> After scrubbing the BPF prog a bit from the noise, turns out it called
> bpf_skb_change_head() for the lwt_xmit prog with headroom of 2. Nothing
> wrong in that, however, this was run with repeat >> 0 in 
> bpf_prog_test_run_skb()
> and the same skb thus keeps changing until the pskb_expand_head() called
> from skb_cow() keeps bailing out in atomic alloc context with -ENOMEM.
> So upon return we'll basically have 0 headroom left yet blindly do the
> __skb_push() of 14 bytes and keep copying data from there in bpf_test_finish()
> out of bounds. Fix to check if we have enough headroom and if 
> pskb_expand_head()
> fails, bail out with error.
> 
> Another bug independent of this fix (but related in triggering above) is
> that BPF_PROG_TEST_RUN should be reworked to reset the skb/xdp buffer to
> it's original state from input as otherwise repeating the same test in a
> loop won't work for benchmarking when underlying input buffer is getting
> changed by the prog each time and reused for the next run leading to
> unexpected results.
> 
> Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command")
> Reported-by: syzbot+709412e651e55ed96...@syzkaller.appspotmail.com
> Reported-by: syzbot+54f39d6ab58f39720...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 

Applied, Thanks

[BUG net-next] BUG triggered with GRO SKB list_head changes

2018-07-11 Thread Tyler Hicks

Starting with the following net-next commit, I see a BUG when starting a
LXD container inside of a KVM guest using virtio-net:

  d4546c2509b1 net: Convert GRO SKB handling to list_head.

Here's what the kernel spits out:

 kernel BUG at /var/scm/kernel/linux/include/linux/skbuff.h:2080!
 invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
 CPU: 0 PID: 1362 Comm: libvirtd Not tainted 4.18.0-rc2+ #69
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
 RIP: 0010:skb_pull+0x36/0x40
 Code: c6 77 24 29 f0 3b 87 84 00 00 00 89 87 80 00 00 00 72 17 89 f6 48 89 f0 
48 03 87 d8 00 00 00 48 89 87 d8 00 00 00 c3 31 c0 c3 <0f> 0b 0f 1f 84 00 00 00 
00 00 0f 1f 44 00 00 39 b7 80 00 00 00 76 
 RSP: :96737f6039f0 EFLAGS: 00010297
 RAX: 9c66e2f2 RBX:  RCX: 0501
 RDX: 0001 RSI: 000e RDI: 96737f7e3938
 RBP: 967379f40020 R08:  R09: 
 R10: 96737f603988 R11: c0461335 R12: 967379f409e0
 R13: 96737f7e3938 R14:  R15: 967379e96ac0
 FS:  7fc96087e640() GS:96737f60() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 7fc913608aa0 CR3: 5dacc001 CR4: 001606f0
 Call Trace:
  
  br_dev_xmit+0xe1/0x3d0 [bridge]
  dev_hard_start_xmit+0xbc/0x3b0
  __dev_queue_xmit+0xb98/0xc30
  ip_finish_output2+0x3e5/0x670
  ? ip_output+0x7f/0x250
  ip_output+0x7f/0x250
  ? ip_fragment.constprop.5+0x80/0x80
  ip_forward+0x3e2/0x650
  ? ipv4_frags_init_net+0x130/0x130
  ip_rcv+0x2be/0x500
  ? ip_local_deliver_finish+0x3b0/0x3b0
  __netif_receive_skb_core+0x6a8/0xb30
  ? lock_acquire+0xab/0x200
  ? netif_receive_skb_internal+0x2a/0x380
  netif_receive_skb_internal+0x73/0x380
  ? napi_gro_complete+0xcf/0x1b0
  dev_gro_receive+0x374/0x730
  napi_gro_receive+0x4f/0x1d0
  receive_buf+0x4b6/0x1930 [virtio_net]
  ? detach_buf+0x69/0x120 [virtio_ring]
  virtnet_poll+0x122/0x2e0 [virtio_net]
  net_rx_action+0x207/0x450
  __do_softirq+0x149/0x4ea
  irq_exit+0xbf/0xd0
  do_IRQ+0x6c/0x130
  common_interrupt+0xf/0xf
  
 RIP: 0010:__radix_tree_lookup+0x28/0xe0
 Code: 00 00 53 49 89 ca 41 bb 40 00 00 00 4c 8b 47 50 4c 89 c0 83 e0 03 48 83 
f8 01 0f 85 a8 00 00 00 4c 89 c0 48 83 e0 fe 0f b6 08 <4c> 89 d8 48 d3 e0 48 83 
e8 01 48 39 c6 76 11 e9 9f 00 00 00 4c 89 
 RSP: :ae150048fcc0 EFLAGS: 0282 ORIG_RAX: ffd9
 RAX: 96735d2ef908 RBX: 001f RCX: 0006
 RDX:  RSI: 02e2 RDI: 96735d10b788
 RBP: 02e2 R08: 96735d2ef909 R09: 
 R10:  R11: 0040 R12: 001f
 R13: ec01c15f3a80 R14: 001f R15: ae150048fd18
  __do_page_cache_readahead+0x11f/0x2e0
  filemap_fault+0x408/0x660
  ext4_filemap_fault+0x2f/0x40
  __do_fault+0x1f/0xd0
  __handle_mm_fault+0x915/0xfa0
  handle_mm_fault+0x1c2/0x390
  __do_page_fault+0x2f6/0x580
  ? async_page_fault+0x5/0x20
  async_page_fault+0x1b/0x20
 RIP: 0033:0x7fc913608aa0
 Code: Bad RIP value.
 RSP: 002b:7ffcfa9c7f08 EFLAGS: 00010206
 RAX:  RBX: 0003 RCX: 0080
 RDX: 0006 RSI: 7fc913a74bf8 RDI: 7fc913df9720
 RBP: 0001 R08: 55df45795700 R09: 
 R10: 55df4574c010 R11: 0001 R12: 7ffcfa9c8c38
 R13: 7ffcfa9c8c48 R14: 7fc913dc3d70 R15: 55df4578ab30
 Modules linked in: veth ebtable_filter ebtables ipt_MASQUERADE xt_CHECKSUM 
xt_comment xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack libcrc32c iptable_mangle iptable_filter bpfilter bridge stp 
llc fuse kvm_intel kvm irqbypass 9pnet_virtio 9pnet virtio_balloon ib_iser 
rdma_cm configfs iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ip_tables x_tables virtio_net net_failover virtio_blk 
failover crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd 
cryptd glue_helper virtio_pci psmouse virtio_ring virtio

I'm not very familiar with the GRO or IP fragmentation code but I was
able to identify that this change "fixes" the issue:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7ccc601b55d9..a5cea572a7f1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -666,6 +666,7 @@ struct sk_buff {
/* These two members must be first. */
struct sk_buff  *next;
struct sk_buff  *prev;
+   struct list_headlist;
 
union {
struct net_device   *dev;
@@ -678,7 +679,6 @@ struct sk_buff {
};
};
struct rb_node  rbnode; /* used in netem & tcp stack */
-   struct list_headlist;
};
struct sock *s

[BUG] bonded interfaces drop bpdu (stp) frames

2018-07-11 Thread Michal Soltys

Hi,

As weird as that sounds, this is what I observed today after bumping
kernel version. I have a setup where 2 bonds are attached to linux
bridge and physically are connected to two switches doing MSTP (and
linux bridge is just passing them).

Initially I suspected some changes related to bridge code - but quick
peek at the code showed nothing suspicious - and the part of it that
explicitly passes stp frames if stp is not enabled has seen little
changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
regular non-bonded interfaces are attached everything works fine.

Just to be sure I detached the bond (802.3ad mode) and checked it with
simple tcpdump (ether proto \\stp) - and indeed no hello packets were
there (with them being present just fine on active enslaved interface,
or on the bond device in earlier kernels).

If time permits I'll bisect tommorow to pinpoint the commit, but from
quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
debian) and 4.17.3 (tested on archlinux) are failing.

Unless this is already a known issue (or you have any suggestions what
could be responsible).

Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up

2018-07-11 Thread Heiner Kallweit

On 11.07.2018 23:33, Florian Fainelli wrote:
> 
> 
> On 07/11/2018 02:08 PM, Heiner Kallweit wrote:
>> On 11.07.2018 22:55, Andrew Lunn wrote:
 +/**
 + * phy_speed_down - set speed to lowest speed supported by both link 
 partners
 + * @phydev: the phy_device struct
 + * @sync: perform action synchronously
 + *
 + * Description: Typically used to save energy when waiting for a WoL 
 packet
 + */
 +int phy_speed_down(struct phy_device *phydev, bool sync)
>>>
>>> This sync parameter needs some more thought. I'm not sure it is safe.
>>>
>>> How does a PHY trigger a WoL wake up? I guess some use the interrupt
>>> pin. How does a PHY indicate auto-neg has completed? It triggers an
>>> interrupt. So it seems like there is a danger here we suspend, and
>>> then wake up 2 seconds later when auto-neg has completed.
>>>
>>> I'm not sure we can safely suspend until auto-neg has completed.
>>>
 +/**
 + * phy_speed_up - (re)set advertised speeds to all supported speeds
 + * @phydev: the phy_device struct
 + * @sync: perform action synchronously
 + *
 + * Description: Used to revert the effect of phy_speed_down
 + */
 +int phy_speed_up(struct phy_device *phydev, bool sync)
>>>
>>> And here, i'm thinking the opposite. A MAC driver needs to be ready
>>> for the PHY state to change at any time. So why do we need to wait?
>>> Just let the normal mechanisms inform the MAC when the link is up.
>>>
>> I see your points, thanks for the feedback. In my case WoL triggers
>> a PCI PME and the code works as expected, but I agree this may be
>> different in other setups (external PHY).
>>
>> The sync parameter was inspired by following comment from Florian:
>> "One thing that bothers me a bit is that this should ideally be
>> offered as both blocking and non-blocking options"
>> So let's see which comments he may have before preparing a v2.
> 
> What I had in mind is that you would be able to register a callback that
> would tell you when auto-negotiation completes, and not register one if
> you did not want to have that information.
> 
> As Andrew points out though, with PHY using interrupts, this might be a
> bit challenging to do because you will get an interrupt about "something
> has changed" and you would have to run the callback from the PHY state
> machine to determine this was indeed a result of triggering
> auto-negotiation. Maybe polling for auto-negotiation like you do here is
> good enough.
> 
OK, then I would poll for autoneg finished in phy_speed_down and
remove the polling option from phy_speed_up. I will do some tests
with this before submitting a v2.

> One nit, you might have to check for those functions that the PHY did
> have auto-negotiation enabled and was not forced.
> 
This I'm doing already, or do you mean something different?

Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up

2018-07-11 Thread Florian Fainelli




On 07/11/2018 02:08 PM, Heiner Kallweit wrote:
> On 11.07.2018 22:55, Andrew Lunn wrote:
>>> +/**
>>> + * phy_speed_down - set speed to lowest speed supported by both link 
>>> partners
>>> + * @phydev: the phy_device struct
>>> + * @sync: perform action synchronously
>>> + *
>>> + * Description: Typically used to save energy when waiting for a WoL packet
>>> + */
>>> +int phy_speed_down(struct phy_device *phydev, bool sync)
>>
>> This sync parameter needs some more thought. I'm not sure it is safe.
>>
>> How does a PHY trigger a WoL wake up? I guess some use the interrupt
>> pin. How does a PHY indicate auto-neg has completed? It triggers an
>> interrupt. So it seems like there is a danger here we suspend, and
>> then wake up 2 seconds later when auto-neg has completed.
>>
>> I'm not sure we can safely suspend until auto-neg has completed.
>>
>>> +/**
>>> + * phy_speed_up - (re)set advertised speeds to all supported speeds
>>> + * @phydev: the phy_device struct
>>> + * @sync: perform action synchronously
>>> + *
>>> + * Description: Used to revert the effect of phy_speed_down
>>> + */
>>> +int phy_speed_up(struct phy_device *phydev, bool sync)
>>
>> And here, i'm thinking the opposite. A MAC driver needs to be ready
>> for the PHY state to change at any time. So why do we need to wait?
>> Just let the normal mechanisms inform the MAC when the link is up.
>>
> I see your points, thanks for the feedback. In my case WoL triggers
> a PCI PME and the code works as expected, but I agree this may be
> different in other setups (external PHY).
> 
> The sync parameter was inspired by following comment from Florian:
> "One thing that bothers me a bit is that this should ideally be
> offered as both blocking and non-blocking options"
> So let's see which comments he may have before preparing a v2.

What I had in mind is that you would be able to register a callback that
would tell you when auto-negotiation completes, and not register one if
you did not want to have that information.

As Andrew points out though, with PHY using interrupts, this might be a
bit challenging to do because you will get an interrupt about "something
has changed" and you would have to run the callback from the PHY state
machine to determine this was indeed a result of triggering
auto-negotiation. Maybe polling for auto-negotiation like you do here is
good enough.

One nit, you might have to check for those functions that the PHY did
have auto-negotiation enabled and was not forced.
-- 
Florian

Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up

2018-07-11 Thread Heiner Kallweit

On 11.07.2018 22:55, Andrew Lunn wrote:
>> +/**
>> + * phy_speed_down - set speed to lowest speed supported by both link 
>> partners
>> + * @phydev: the phy_device struct
>> + * @sync: perform action synchronously
>> + *
>> + * Description: Typically used to save energy when waiting for a WoL packet
>> + */
>> +int phy_speed_down(struct phy_device *phydev, bool sync)
> 
> This sync parameter needs some more thought. I'm not sure it is safe.
> 
> How does a PHY trigger a WoL wake up? I guess some use the interrupt
> pin. How does a PHY indicate auto-neg has completed? It triggers an
> interrupt. So it seems like there is a danger here we suspend, and
> then wake up 2 seconds later when auto-neg has completed.
> 
> I'm not sure we can safely suspend until auto-neg has completed.
> 
>> +/**
>> + * phy_speed_up - (re)set advertised speeds to all supported speeds
>> + * @phydev: the phy_device struct
>> + * @sync: perform action synchronously
>> + *
>> + * Description: Used to revert the effect of phy_speed_down
>> + */
>> +int phy_speed_up(struct phy_device *phydev, bool sync)
> 
> And here, i'm thinking the opposite. A MAC driver needs to be ready
> for the PHY state to change at any time. So why do we need to wait?
> Just let the normal mechanisms inform the MAC when the link is up.
> 
I see your points, thanks for the feedback. In my case WoL triggers
a PCI PME and the code works as expected, but I agree this may be
different in other setups (external PHY).

The sync parameter was inspired by following comment from Florian:
"One thing that bothers me a bit is that this should ideally be
offered as both blocking and non-blocking options"
So let's see which comments he may have before preparing a v2.

>  Andrew
> 
Heiner

Re: [PATCH net-next 1/2] net: phy: add helper phy_config_aneg

2018-07-11 Thread Florian Fainelli




On 07/11/2018 01:30 PM, Heiner Kallweit wrote:
> This functionality will also be needed in subsequent patches of this
> series, therefore factor it out to a helper.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up

2018-07-11 Thread Andrew Lunn

> +/**
> + * phy_speed_down - set speed to lowest speed supported by both link partners
> + * @phydev: the phy_device struct
> + * @sync: perform action synchronously
> + *
> + * Description: Typically used to save energy when waiting for a WoL packet
> + */
> +int phy_speed_down(struct phy_device *phydev, bool sync)

This sync parameter needs some more thought. I'm not sure it is safe.

How does a PHY trigger a WoL wake up? I guess some use the interrupt
pin. How does a PHY indicate auto-neg has completed? It triggers an
interrupt. So it seems like there is a danger here we suspend, and
then wake up 2 seconds later when auto-neg has completed.

I'm not sure we can safely suspend until auto-neg has completed.

> +/**
> + * phy_speed_up - (re)set advertised speeds to all supported speeds
> + * @phydev: the phy_device struct
> + * @sync: perform action synchronously
> + *
> + * Description: Used to revert the effect of phy_speed_down
> + */
> +int phy_speed_up(struct phy_device *phydev, bool sync)

And here, i'm thinking the opposite. A MAC driver needs to be ready
for the PHY state to change at any time. So why do we need to wait?
Just let the normal mechanisms inform the MAC when the link is up.

 Andrew

Re: [PATCH net-next 1/2] net: phy: add helper phy_config_aneg

2018-07-11 Thread Andrew Lunn

On Wed, Jul 11, 2018 at 10:30:27PM +0200, Heiner Kallweit wrote:
> This functionality will also be needed in subsequent patches of this
> series, therefore factor it out to a helper.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH net-next] tc-testing: add geneve options in tunnel_key unit tests

2018-07-11 Thread Lucas Bates

On Tue, Jul 10, 2018 at 9:22 PM, Jakub Kicinski
 wrote:
> From: Pieter Jansen van Vuuren 
>
> Extend tc tunnel_key action unit tests with geneve options. Tests
> include testing single and multiple geneve options, as well as
> testing geneve options that are expected to fail.
>
> Signed-off-by: Pieter Jansen van Vuuren 
Acked-by: Lucas Bates

Re: [PATCH bpf-next] bpf: better availability probing for seg6 helpers

2018-07-11 Thread Daniel Borkmann

On 07/10/2018 09:20 PM, Daniel Borkmann wrote:
> On 07/10/2018 06:54 PM, Mathieu Xhonneux wrote:
>> bpf_lwt_seg6_* helpers require CONFIG_IPV6_SEG6_BPF, and currently
>> return -EOPNOTSUPP to indicate unavailability. This patch forces the
>> BPF verifier to reject programs using these helpers when
>> !CONFIG_IPV6_SEG6_BPF, allowing users to more easily probe if they are
>> available or not.
>>
>> Signed-off-by: Mathieu Xhonneux 
> 
> Note, just fyi, this would need to go to bpf tree (and not bpf-next) as
> otherwise there's a change in behavior.

Applied, thanks Mathieu!

[PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up

2018-07-11 Thread Heiner Kallweit

Some network drivers include functionality to speed down the PHY when
suspending and just waiting for a WoL packet because this saves energy.
This functionality is quite generic, therefore let's factor it out to
phylib.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c | 78 +++
 include/linux/phy.h   |  2 ++
 2 files changed, 80 insertions(+)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c4aa360d..0547c603 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -551,6 +551,84 @@ int phy_start_aneg(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_start_aneg);
 
+static int phy_poll_aneg_done(struct phy_device *phydev)
+{
+   unsigned int retries = 100;
+   int ret;
+
+   do {
+   msleep(100);
+   ret = phy_aneg_done(phydev);
+   } while (!ret && --retries);
+
+   if (!ret)
+   return -ETIMEDOUT;
+
+   return ret < 0 ? ret : 0;
+}
+
+/**
+ * phy_speed_down - set speed to lowest speed supported by both link partners
+ * @phydev: the phy_device struct
+ * @sync: perform action synchronously
+ *
+ * Description: Typically used to save energy when waiting for a WoL packet
+ */
+int phy_speed_down(struct phy_device *phydev, bool sync)
+{
+   u32 adv = phydev->lp_advertising & phydev->supported;
+   u32 adv_old = phydev->advertising;
+   int ret;
+
+   if (phydev->autoneg != AUTONEG_ENABLE)
+   return 0;
+
+   if (adv & PHY_10BT_FEATURES)
+   phydev->advertising &= ~(PHY_100BT_FEATURES |
+PHY_1000BT_FEATURES);
+   else if (adv & PHY_100BT_FEATURES)
+   phydev->advertising &= ~PHY_1000BT_FEATURES;
+
+   if (phydev->advertising == adv_old)
+   return 0;
+
+   ret = phy_config_aneg(phydev);
+   if (ret)
+   return ret;
+
+   return sync ? phy_poll_aneg_done(phydev) : 0;
+}
+EXPORT_SYMBOL_GPL(phy_speed_down);
+
+/**
+ * phy_speed_up - (re)set advertised speeds to all supported speeds
+ * @phydev: the phy_device struct
+ * @sync: perform action synchronously
+ *
+ * Description: Used to revert the effect of phy_speed_down
+ */
+int phy_speed_up(struct phy_device *phydev, bool sync)
+{
+   u32 mask = PHY_10BT_FEATURES | PHY_100BT_FEATURES | PHY_1000BT_FEATURES;
+   u32 adv_old = phydev->advertising;
+   int ret;
+
+   if (phydev->autoneg != AUTONEG_ENABLE)
+   return 0;
+
+   phydev->advertising = (adv_old & ~mask) | (phydev->supported & mask);
+
+   if (phydev->advertising == adv_old)
+   return 0;
+
+   ret = phy_config_aneg(phydev);
+   if (ret)
+   return ret;
+
+   return sync ? phy_poll_aneg_done(phydev) : 0;
+}
+EXPORT_SYMBOL_GPL(phy_speed_up);
+
 /**
  * phy_start_machine - start PHY state machine tracking
  * @phydev: the phy_device struct
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 6cd09098..275f528e 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -942,6 +942,8 @@ void phy_start(struct phy_device *phydev);
 void phy_stop(struct phy_device *phydev);
 int phy_start_aneg(struct phy_device *phydev);
 int phy_aneg_done(struct phy_device *phydev);
+int phy_speed_down(struct phy_device *phydev, bool sync);
+int phy_speed_up(struct phy_device *phydev, bool sync);
 
 int phy_stop_interrupts(struct phy_device *phydev);
 int phy_restart_aneg(struct phy_device *phydev);
-- 
2.18.0

Re: [PATCH iproute2-next] ipaddress: fix label matching

2018-07-11 Thread Vincent Bernat

 ❦ 11 juillet 2018 13:03 -0700, Stephen Hemminger  :

>> Since 9516823051ce, "ip addr show label lo:1" doesn't work
>> anymore (doesn't show any address, despite a matching label).
>> Reverting to return 0 instead of -1 fix the issue.
>> 
>> However, the condition says: "if we filter by label [...] and the
>> label does NOT match the interface name". This makes little sense to
>> compare the label with the interface name. There is also a logic
>> around filter family being provided or not. The match against the
>> label is done by ifa_label_match_rta() in print_addrinfo() and
>> ipaddr_filter().
>> 
>> Just removing the condition makes "ip addr show" works as expected
>> with or without specifying a label, both when the label is matching
>> and not matching. It also works if we specify a label and the label is
>> the interface name. The flush operation also works as expected.
>> 
>> Fixes: 9516823051ce ("ipaddress: Improve print_linkinfo()")
>> Signed-off-by: Vincent Bernat 
>> ---
>>  ip/ipaddress.c | 5 -
>>  1 file changed, 5 deletions(-)
>> 
>> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
>> index 5009bfe6d2e3..20ef6724944e 100644
>> --- a/ip/ipaddress.c
>> +++ b/ip/ipaddress.c
>> @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
>>  if (!name)
>>  return -1;
>>  
>> -if (filter.label &&
>> -(!filter.family || filter.family == AF_PACKET) &&
>> -fnmatch(filter.label, name, 0))
>> -return -1;
>> -
>>  if (tb[IFLA_GROUP]) {
>>  int group = rta_getattr_u32(tb[IFLA_GROUP]);
>> 
>
> If this is a regression, it should go to iproute2 not iproute2-next.
>
> Surprised by the solution since it is removing code that was there
> before the commit you referenced in Fixes.

Yes, but as I explain in the commit message, the condition does not make
sense for me: why would we match the label against the interface name?
This code exists since a long time.
-- 
The lunatic, the lover, and the poet,
Are of imagination all compact...
-- Wm. Shakespeare, "A Midsummer Night's Dream"

[PATCH net-next 1/2] net: phy: add helper phy_config_aneg

2018-07-11 Thread Heiner Kallweit

This functionality will also be needed in subsequent patches of this
series, therefore factor it out to a helper.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 537297d2..c4aa360d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -467,6 +467,14 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq 
*ifr, int cmd)
 }
 EXPORT_SYMBOL(phy_mii_ioctl);
 
+static int phy_config_aneg(struct phy_device *phydev)
+{
+   if (phydev->drv->config_aneg)
+   return phydev->drv->config_aneg(phydev);
+   else
+   return genphy_config_aneg(phydev);
+}
+
 /**
  * phy_start_aneg_priv - start auto-negotiation for this PHY device
  * @phydev: the phy_device struct
@@ -493,10 +501,7 @@ static int phy_start_aneg_priv(struct phy_device *phydev, 
bool sync)
/* Invalidate LP advertising flags */
phydev->lp_advertising = 0;
 
-   if (phydev->drv->config_aneg)
-   err = phydev->drv->config_aneg(phydev);
-   else
-   err = genphy_config_aneg(phydev);
+   err = phy_config_aneg(phydev);
if (err < 0)
goto out_unlock;
 
-- 
2.18.0

[PATCH net-next 0/2] net: phy: add functionality to speed down PHY when waiting for WoL packet

2018-07-11 Thread Heiner Kallweit

Some network drivers include functionality to speed down the PHY when
suspending and just waiting for a WoL packet because this saves energy.

This patch is based on our recent discussion about factoring out this
functionality to phylib. First user will be the r8169 driver.

Heiner Kallweit (2):
  net: phy: add helper phy_config_aneg
  net: phy: add phy_speed_down and phy_speed_up

 drivers/net/phy/phy.c | 91 +--
 include/linux/phy.h   |  2 +
 2 files changed, 89 insertions(+), 4 deletions(-)

-- 
2.18.0

Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case of forwarding

2018-07-11 Thread Jesper Dangaard Brouer

On Wed, 11 Jul 2018 16:41:35 +0100
Edward Cree  wrote:

> On 11/07/18 16:01, Jesper Dangaard Brouer wrote:
> > In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
> > dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
> > dst_input(skb) in a loop.
> >
> > The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
> > from the list before invoking dst_input().  Further more we need to
> > clear skb->next as other parts of the network stack use another kind
> > of SKB lists for xmit_more (see dev_hard_start_xmit).
> >
> > A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
> > dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
> > sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
> >
> > This patch only fixes the crash, but there is a huge potential for
> > a performance boost if we can pass an SKB-list through to ip_forward.
> >
> > Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
> > Signed-off-by: Jesper Dangaard Brouer   
> Acked-by: Edward Cree 
> 
> But it feels weird and asymmetric to only NULL skb->next (not ->prev), and
>  to have to do this by hand rather than e.g. being able to use
>  list_del_init(&skb->list).  Hopefully this can be revisited once
>  sch_direct_xmit() has been changed to use the list_head rather than SKB
>  special lists.

I cannot use list_del_init(&skb->list) it would also break.
This is a fix, and this code should be revisited.

The reason I used the list_del() + skb->next=NULL, combo, is to keep as
much as possible of the list-poisoning, e.g. 'prev' will be LIST_POISON2.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v3 00/13] tools: bpf: extend bpftool prog load

2018-07-11 Thread Daniel Borkmann

On 07/10/2018 11:42 PM, Jakub Kicinski wrote:
> Hi!
> 
> This series starts with two minor clean ups to test_offload.py
> selftest script.
> 
> The next 11 patches extend the abilities of bpftool prog load
> beyond the simple cgroup use cases.  Three new parameters are
> added:
> 
>  - type - allows specifying program type, independent of how
>code sections are named;
>  - map  - allows reusing existing maps, instead of creating a new
>map on every program load;
>  - dev  - offload/binding to a device.
> 
> A number of changes to libbpf is required to accomplish the task.
> The section - program type logic mapping is exposed.  We should
> probably aim to use the libbpf program section naming everywhere.
> For reuse of maps we need to allow users to set FD for bpf map
> object in libbpf.
> 
> Examples
> 
> Load program my_xdp.o and pin it as /sys/fs/bpf/my_xdp, for xdp
> program type:
> 
> $ bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp \
>   type xdp
> 
> As above but for offload:
> 
> $ bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp \
>   type xdp \
>   dev netdevsim0
> 
> Load program my_maps.o, but for the first map reuse map id 17,
> and for the map called "other_map" reuse pinned map /sys/fs/bpf/map0:
> 
> $ bpftool prog load my_maps.o /sys/fs/bpf/prog \
>   map idx 0 id 17 \
>   map name other_map pinned /sys/fs/bpf/map0
> 
> ---
> v3:
>  - fix return codes in patch 5;
>  - rename libbpf_prog_type_by_string() -> libbpf_prog_type_by_name();
>  - fold file path into xattr in patch 8;
>  - add patch 10;
>  - use dup3() in patch 12;
>  - depend on fd value in patch 12;
>  - close old fd in patch 12.
> v2:
>  - add compat for reallocarray().

Applied to bpf-next, thanks Jakub!

Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case of forwarding

2018-07-11 Thread Jesper Dangaard Brouer

On Wed, 11 Jul 2018 19:05:20 +
Saeed Mahameed  wrote:

> On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote:
> > Only driver sfc actually uses this, but I don't have this NIC, so I
> > tested this on mlx5, with my own changes to make it use
> > netif_receive_skb_list(),
> > but I'm not ready to upstream the mlx5 driver change yet.  
> 
> 
> Thanks Jesper for sharing this, should we look forward to those patches
> or do you want us to implement them ?

Well, I would prefer you to implement those.  I just did a quick
implementation (its trivially easy) so I have something to benchmark
with.  The performance boost is quite impressive!

One reason I didn't "just" send a patch, is that Edward so-fare only
implemented netif_receive_skb_list() and not napi_gro_receive_list().
And your driver uses napi_gro_receive().  This sort-of disables GRO for
your driver, which is not a choice I can make.  Interestingly I get
around the same netperf TCP_STREAM performance.  I assume we can get
even better perf if we "listify" napi_gro_receive.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH] of: mdio: Support fixed links in of_phy_get_and_connect()

2018-07-11 Thread Andrew Lunn

On Wed, Jul 11, 2018 at 07:45:11PM +0200, Linus Walleij wrote:
> By a simple extension of of_phy_get_and_connect() drivers
> that have a fixed link on e.g. RGMII can support also
> fixed links, so in addition to:
> 
> ethernet-port {
>   phy-mode = "rgmii";
>   phy-handle = <&foo>;
> };
> 
> This setup with a fixed-link node and no phy-handle will
> now also work just fine:
> 
> ethernet-port {
>   phy-mode = "rgmii";
>   fixed-link {
>   speed = <1000>;
>   full-duplex;
>   pause;
>   };
> };
> 
> This is very helpful for connecting random ethernet ports
> to e.g. DSA switches that typically reside on fixed links.
> 
> The phy-mode is still there as the fixes link in this case
> is still an RGMII link.
> 
> Tested on the Cortina Gemini driver with the Vitesse DSA
> router chip on a fixed 1Gbit link.
> 
> Suggested-by: Andrew Lunn 
> Signed-off-by: Linus Walleij 

Reviewed-by: Andrew Lunn 

What probably make sense as a followup is add a
of_phy_disconnect_and_put(). When the module is unloaded, you leak a
fixed link, because of_phy_deregister_fixed_link() is not being
called. You also hold a reference to np which does not appear to be
released.

Andrew

Re: [PATCH bpf-next v4 3/3] bpf: btf: print map dump and lookup with btf info

2018-07-11 Thread Jakub Kicinski

On Tue, 10 Jul 2018 20:21:11 -0700, Okash Khawaja wrote:
> + if (err || btf_info.btf_size > last_size) {
> + err = errno;

errno may not be set in case btf_info.btf_size > last_size

errno is positive, while other error return codes are negative.

> + goto exit_free;
> + }

Re: [PATCH iproute2-next] ipaddress: fix label matching

2018-07-11 Thread Stephen Hemminger

On Wed, 11 Jul 2018 13:36:03 +0200
Vincent Bernat  wrote:

> Since 9516823051ce, "ip addr show label lo:1" doesn't work
> anymore (doesn't show any address, despite a matching label).
> Reverting to return 0 instead of -1 fix the issue.
> 
> However, the condition says: "if we filter by label [...] and the
> label does NOT match the interface name". This makes little sense to
> compare the label with the interface name. There is also a logic
> around filter family being provided or not. The match against the
> label is done by ifa_label_match_rta() in print_addrinfo() and
> ipaddr_filter().
> 
> Just removing the condition makes "ip addr show" works as expected
> with or without specifying a label, both when the label is matching
> and not matching. It also works if we specify a label and the label is
> the interface name. The flush operation also works as expected.
> 
> Fixes: 9516823051ce ("ipaddress: Improve print_linkinfo()")
> Signed-off-by: Vincent Bernat 
> ---
>  ip/ipaddress.c | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 5009bfe6d2e3..20ef6724944e 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   if (!name)
>   return -1;
>  
> - if (filter.label &&
> - (!filter.family || filter.family == AF_PACKET) &&
> - fnmatch(filter.label, name, 0))
> - return -1;
> -
>   if (tb[IFLA_GROUP]) {
>   int group = rta_getattr_u32(tb[IFLA_GROUP]);
> 

If this is a regression, it should go to iproute2 not iproute2-next.

Surprised by the solution since it is removing code that was there
before the commit you referenced in Fixes.

Re: [PATCH net-next 2/5 v3] net: gemini: Improve connection prints

2018-07-11 Thread Andrew Lunn

On Wed, Jul 11, 2018 at 09:32:42PM +0200, Linus Walleij wrote:
> Switch over to using a module parameter and debug prints
> that can be controlled by this or ethtool like everyone
> else. Depromote all other prints to debug messages.
> 
> The phy_print_status() was already in place, albeit never
> really used because the debuglevel hiding it had to be
> set up using ethtool.
> 
> Signed-off-by: Linus Walleij 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH net-next 5/5 v3] net: gemini: Indicate that we can handle jumboframes

2018-07-11 Thread Andrew Lunn

On Wed, Jul 11, 2018 at 09:32:45PM +0200, Linus Walleij wrote:
> The hardware supposedly handles frames up to 10236 bytes and
> implements .ndo_change_mtu() so accept 10236 minus the ethernet
> header for a VLAN tagged frame on the netdevices. Use
> ETH_MIN_MTU as minimum MTU.
> 
> Signed-off-by: Linus Walleij 

Reviewed-by: Andrew Lunn 

Andrew

[PATCH v3 net-next 01/19] net: Add decrypted field to skb

2018-07-11 Thread Boris Pismenny

The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 include/linux/skbuff.h | 7 ++-
 net/core/skbuff.c  | 6 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7601838..3ceb8dc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -630,6 +630,7 @@ enum {
  * @hash: the packet hash
  * @queue_mapping: Queue mapping for multiqueue devices
  * @xmit_more: More SKBs are pending for this queue
+ * @decrypted: Decrypted SKB
  * @ndisc_nodetype: router type (from link layer)
  * @ooo_okay: allow the mapping of a socket to a queue to be changed
  * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -736,7 +737,11 @@ struct sk_buff {
peeked:1,
head_frag:1,
xmit_more:1,
-   __unused:1; /* one bit hole */
+#ifdef CONFIG_TLS_DEVICE
+   decrypted:1;
+#else
+   __unused:1;
+#endif
 
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c4e24ac..cfd6c6f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -805,6 +805,9 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 * It is not yet because we do not want to have a 16 bit hole
 */
new->queue_mapping = old->queue_mapping;
+#ifdef CONFIG_TLS_DEVICE
+   new->decrypted = old->decrypted;
+#endif
 
memcpy(&new->headers_start, &old->headers_start,
   offsetof(struct sk_buff, headers_end) -
@@ -865,6 +868,9 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
C(head_frag);
C(data);
C(truesize);
+#ifdef CONFIG_TLS_DEVICE
+   C(decrypted);
+#endif
refcount_set(&n->users, 1);
 
atomic_inc(&(skb_shinfo(skb)->dataref));
-- 
1.8.3.1

[PATCH v3 net-next 18/19] net/mlx5e: IPsec, fix byte count in CQE

2018-07-11 Thread Boris Pismenny

This patch fixes the byte count indication in CQE for processed IPsec
packets that contain a metadata header.

Signed-off-by: Boris Pismenny 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index fda7929..128a82b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -364,6 +364,7 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device 
*netdev,
}
 
remove_metadata_hdr(skb);
+   *cqe_bcnt -= MLX5E_METADATA_ETHER_LEN;
 
return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
index 2bfbbef..ca47c05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
@@ -41,7 +41,7 @@
 #include "en.h"
 
 struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev,
- struct sk_buff *skb);
+ struct sk_buff *skb, u32 *cqe_bcnt);
 void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 
 void mlx5e_ipsec_inverse_table_init(void);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 847e195..4a85b26 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1470,7 +1470,7 @@ void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, 
struct mlx5_cqe64 *cqe)
mlx5e_free_rx_wqe(rq, wi);
goto wq_ll_pop;
}
-   skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb);
+   skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb, &cqe_bcnt);
if (unlikely(!skb)) {
mlx5e_free_rx_wqe(rq, wi);
goto wq_ll_pop;
-- 
1.8.3.1

[PATCH v3 net-next 03/19] net: Add TLS rx resync NDO

2018-07-11 Thread Boris Pismenny

Add new netdev tls op for resynchronizing HW tls context

Signed-off-by: Boris Pismenny 
---
 include/linux/netdevice.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b683971..0434df3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -903,6 +903,8 @@ struct tlsdev_ops {
void (*tls_dev_del)(struct net_device *netdev,
struct tls_context *ctx,
enum tls_offload_ctx_dir direction);
+   void (*tls_dev_resync_rx)(struct net_device *netdev,
+ struct sock *sk, u32 seq, u64 rcd_sn);
 };
 #endif
 
-- 
1.8.3.1

[PATCH v3 net-next 19/19] net/mlx5e: Kconfig, mutually exclude compilation of TLS and IPsec accel

2018-07-11 Thread Boris Pismenny

We currently have no devices that support both TLS and IPsec using the
accel framework, and the current code does not support both IPsec and
TLS. This patch prevents such combinations.

Signed-off-by: Boris Pismenny 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 2545296..d3e8c70 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -93,6 +93,7 @@ config MLX5_EN_TLS
depends on TLS_DEVICE
depends on TLS=y || MLX5_CORE=m
depends on MLX5_ACCEL
+   depends on !MLX5_EN_IPSEC
default n
---help---
  Build support for TLS cryptography-offload accelaration in the NIC.
-- 
1.8.3.1

KASAN: slab-out-of-bounds Read in rds_cong_queue_updates (2)

2018-07-11 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:0026129c8629 rhashtable: add restart routine in rhashtable..
git tree:   net
console output: https://syzkaller.appspot.com/x/log.txt?x=10b7ced040
kernel config:  https://syzkaller.appspot.com/x/.config?x=b88de6eac8694da6
dashboard link: https://syzkaller.appspot.com/bug?extid=0570fef57a5e020bdc87
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0570fef57a5e020bd...@syzkaller.appspotmail.com

==
BUG: KASAN: slab-out-of-bounds in atomic_read  
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_read include/linux/refcount.h:42  
[inline]
BUG: KASAN: slab-out-of-bounds in check_net include/net/net_namespace.h:237  
[inline]
BUG: KASAN: slab-out-of-bounds in rds_destroy_pending net/rds/rds.h:902  
[inline]
BUG: KASAN: slab-out-of-bounds in rds_cong_queue_updates+0x25d/0x5b0  
net/rds/cong.c:226

Read of size 4 at addr 88019f8ec204 by task syz-executor1/27023

CPU: 0 PID: 27023 Comm: syz-executor1 Not tainted 4.18.0-rc3+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_read include/linux/refcount.h:42 [inline]
 check_net include/net/net_namespace.h:237 [inline]
 rds_destroy_pending net/rds/rds.h:902 [inline]
 rds_cong_queue_updates+0x25d/0x5b0 net/rds/cong.c:226
 rds_recv_rcvbuf_delta.part.3+0x332/0x3e0 net/rds/recv.c:123
 rds_recv_rcvbuf_delta net/rds/recv.c:382 [inline]
 rds_recv_incoming+0x85a/0x1320 net/rds/recv.c:382
netlink: 'syz-executor2': attribute type 18 has an invalid length.
 rds_loop_xmit+0x16a/0x340 net/rds/loop.c:95
 rds_send_xmit+0x1343/0x29c0 net/rds/send.c:355
netlink: 180 bytes leftover after parsing attributes in process  
`syz-executor5'.

 rds_sendmsg+0x229e/0x2a40 net/rds/send.c:1243
netlink: 180 bytes leftover after parsing attributes in process  
`syz-executor5'.

 sock_sendmsg_nosec net/socket.c:641 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:651
 __sys_sendto+0x3d7/0x670 net/socket.c:1797
 __do_sys_sendto net/socket.c:1809 [inline]
 __se_sys_sendto net/socket.c:1805 [inline]
 __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455e29
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7fd164b21c68 EFLAGS: 0246 ORIG_RAX: 002c
RAX: ffda RBX: 7fd164b226d4 RCX: 00455e29
RDX: 0481 RSI: 2000 RDI: 0013
RBP: 0072bea0 R08: 2069affb R09: 0010
R10:  R11: 0246 R12: 
R13: 004c14f2 R14: 004d1a08 R15: 

Allocated by task 26052:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
 getname_flags+0xd0/0x5a0 fs/namei.c:140
 getname+0x19/0x20 fs/namei.c:211
 do_sys_open+0x3a2/0x760 fs/open.c:1095
 __do_sys_open fs/open.c:1119 [inline]
 __se_sys_open fs/open.c:1114 [inline]
 __x64_sys_open+0x7e/0xc0 fs/open.c:1114
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 26052:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
 putname+0xf2/0x130 fs/namei.c:261
 do_sys_open+0x569/0x760 fs/open.c:1110
 __do_sys_open fs/open.c:1119 [inline]
 __se_sys_open fs/open.c:1114 [inline]
 __x64_sys_open+0x7e/0xc0 fs/open.c:1114
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 88019f8ec280
 which belongs to the cache names_cache of size 4096
The buggy address is located 124 bytes to the left of
 4096-byte region [88019f8ec280, 88019f8ed280)
The buggy add

[PATCH v3 net-next 14/19] net/mlx5e: TLS, add Innova TLS rx data path

2018-07-11 Thread Boris Pismenny

Implement the TLS rx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

When hardware loses synchronization a special resync request
metadata message is used to request resync.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 112 -
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|   6 ++
 3 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index c96196f..d460fda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -33,6 +33,12 @@
 
 #include "en_accel/tls.h"
 #include "en_accel/tls_rxtx.h"
+#include 
+#include 
+
+#define SYNDROM_DECRYPTED  0x30
+#define SYNDROM_RESYNC_REQUEST 0x31
+#define SYNDROM_AUTH_FAILED 0x32
 
 #define SYNDROME_OFFLOAD_REQUIRED 32
 #define SYNDROME_SYNC 33
@@ -44,10 +50,26 @@ struct sync_info {
skb_frag_t frags[MAX_SKB_FRAGS];
 };
 
-struct mlx5e_tls_metadata {
+struct recv_metadata_content {
+   u8 syndrome;
+   u8 reserved;
+   __be32 sync_seq;
+} __packed;
+
+struct send_metadata_content {
/* One byte of syndrome followed by 3 bytes of swid */
__be32 syndrome_swid;
__be16 first_seq;
+} __packed;
+
+struct mlx5e_tls_metadata {
+   union {
+   /* from fpga to host */
+   struct recv_metadata_content recv;
+   /* from host to fpga */
+   struct send_metadata_content send;
+   unsigned char raw[6];
+   } __packed content;
/* packet type ID field */
__be16 ethertype;
 } __packed;
@@ -68,7 +90,8 @@ static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 
swid)
2 * ETH_ALEN);
 
eth->h_proto = cpu_to_be16(MLX5E_METADATA_ETHER_TYPE);
-   pet->syndrome_swid = htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
+   pet->content.send.syndrome_swid =
+   htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
 
return 0;
 }
@@ -149,7 +172,7 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 
pet = (struct mlx5e_tls_metadata *)(nskb->data + sizeof(struct ethhdr));
memcpy(pet, &syndrome, sizeof(syndrome));
-   pet->first_seq = htons(tcp_seq);
+   pet->content.send.first_seq = htons(tcp_seq);
 
/* MLX5 devices don't care about the checksum partial start, offset
 * and pseudo header
@@ -276,3 +299,86 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device 
*netdev,
 out:
return skb;
 }
+
+static int tls_update_resync_sn(struct net_device *netdev,
+   struct sk_buff *skb,
+   struct mlx5e_tls_metadata *mdata)
+{
+   struct sock *sk = NULL;
+   struct iphdr *iph;
+   struct tcphdr *th;
+   __be32 seq;
+
+   if (mdata->ethertype != htons(ETH_P_IP))
+   return -EINVAL;
+
+   iph = (struct iphdr *)(mdata + 1);
+
+   th = ((void *)iph) + iph->ihl * 4;
+
+   if (iph->version == 4) {
+   sk = inet_lookup_established(dev_net(netdev), &tcp_hashinfo,
+iph->saddr, th->source, iph->daddr,
+th->dest, netdev->ifindex);
+#if IS_ENABLED(CONFIG_IPV6)
+   } else {
+   struct ipv6hdr *ipv6h = (struct ipv6hdr *)iph;
+
+   sk = __inet6_lookup_established(dev_net(netdev), &tcp_hashinfo,
+   &ipv6h->saddr, th->source,
+   &ipv6h->daddr, th->dest,
+   netdev->ifindex, 0);
+#endif
+   }
+   if (!sk || sk->sk_state == TCP_TIME_WAIT)
+   goto out;
+
+   skb->sk = sk;
+   skb->destructor = sock_edemux;
+
+   memcpy(&seq, &mdata->content.recv.sync_seq, sizeof(seq));
+   tls_offload_rx_resync_request(sk, seq);
+out:
+   return 0;
+}
+
+void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+u32 *cqe_bcnt)
+{
+   struct mlx5e_tls_metadata *mdata;
+   struct ethhdr *old_eth;
+   struct ethhdr *new_eth;
+   __be16 *ethtype;
+
+   /* Detect inline metadata */
+   if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
+   return;
+   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
+   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   return;
+
+   /* Use the metadata */
+   mdata = (struct mlx5e_tls_metadata *)(skb->data + ETH_HLEN);
+   switch (mdata->content.recv.syndrome) {
+   case SYND

[PATCH v3 net-next 16/19] net/mlx5e: TLS, build TLS netdev from capabilities

2018-07-11 Thread Boris Pismenny

This patch enables TLS Rx based on available HW capabilities.

Signed-off-by: Boris Pismenny 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 541e6f4..eddd7702 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -183,13 +183,27 @@ static void mlx5e_tls_resync_rx(struct net_device 
*netdev, struct sock *sk,
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 {
+   u32 caps = mlx5_accel_tls_device_caps(priv->mdev);
struct net_device *netdev = priv->netdev;
 
if (!mlx5_accel_is_tls_device(priv->mdev))
return;
 
-   netdev->features |= NETIF_F_HW_TLS_TX;
-   netdev->hw_features |= NETIF_F_HW_TLS_TX;
+   if (caps & MLX5_ACCEL_TLS_TX) {
+   netdev->features  |= NETIF_F_HW_TLS_TX;
+   netdev->hw_features   |= NETIF_F_HW_TLS_TX;
+   }
+
+   if (caps & MLX5_ACCEL_TLS_RX) {
+   netdev->features  |= NETIF_F_HW_TLS_RX;
+   netdev->hw_features   |= NETIF_F_HW_TLS_RX;
+   }
+
+   if (!(caps & MLX5_ACCEL_TLS_LRO)) {
+   netdev->features  &= ~NETIF_F_LRO;
+   netdev->hw_features   &= ~NETIF_F_LRO;
+   }
+
netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
 
-- 
1.8.3.1

[PATCH v3 net-next 05/19] tls: Refactor tls_offload variable names

2018-07-11 Thread Boris Pismenny

For symmetry, we rename tls_offload_context to
tls_offload_context_tx before we add tls_offload_context_rx.

Signed-off-by: Boris Pismenny 
---
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  6 +++---
 include/net/tls.h  | 16 +++---
 net/tls/tls_device.c   | 25 +++---
 net/tls/tls_device_fallback.c  |  8 +++
 4 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index b616217..b82f4de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -50,7 +50,7 @@ struct mlx5e_tls {
 };
 
 struct mlx5e_tls_offload_context {
-   struct tls_offload_context base;
+   struct tls_offload_context_tx base;
u32 expected_seq;
__be32 swid;
 };
@@ -59,8 +59,8 @@ struct mlx5e_tls_offload_context {
 mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 {
BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) >
-TLS_OFFLOAD_CONTEXT_SIZE);
-   return container_of(tls_offload_ctx(tls_ctx),
+TLS_OFFLOAD_CONTEXT_SIZE_TX);
+   return container_of(tls_offload_ctx_tx(tls_ctx),
struct mlx5e_tls_offload_context,
base);
 }
diff --git a/include/net/tls.h b/include/net/tls.h
index 70c2737..5dcd808 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -128,7 +128,7 @@ struct tls_record_info {
skb_frag_t frags[MAX_SKB_FRAGS];
 };
 
-struct tls_offload_context {
+struct tls_offload_context_tx {
struct crypto_aead *aead_send;
spinlock_t lock;/* protects records list */
struct list_head records_list;
@@ -147,8 +147,8 @@ struct tls_offload_context {
 #define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
 };
 
-#define TLS_OFFLOAD_CONTEXT_SIZE   
\
-   (ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +   \
+#define TLS_OFFLOAD_CONTEXT_SIZE_TX
\
+   (ALIGN(sizeof(struct tls_offload_context_tx), sizeof(void *)) +\
 TLS_DRIVER_STATE_SIZE)
 
 enum {
@@ -239,7 +239,7 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
 void tls_device_init(void);
 void tls_device_cleanup(void);
 
-struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
   u32 seq, u64 *p_record_sn);
 
 static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
@@ -380,10 +380,10 @@ static inline struct tls_sw_context_tx *tls_sw_ctx_tx(
return (struct tls_sw_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
-static inline struct tls_offload_context *tls_offload_ctx(
-   const struct tls_context *tls_ctx)
+static inline struct tls_offload_context_tx *
+tls_offload_ctx_tx(const struct tls_context *tls_ctx)
 {
-   return (struct tls_offload_context *)tls_ctx->priv_ctx_tx;
+   return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
@@ -396,7 +396,7 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
  struct sk_buff *skb);
 
 int tls_sw_fallback_init(struct sock *sk,
-struct tls_offload_context *offload_ctx,
+struct tls_offload_context_tx *offload_ctx,
 struct tls_crypto_info *crypto_info);
 
 #endif /* _TLS_OFFLOAD_H */
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index a7a8f8e..332a5d1 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -52,9 +52,8 @@
 
 static void tls_device_free_ctx(struct tls_context *ctx)
 {
-   struct tls_offload_context *offload_ctx = tls_offload_ctx(ctx);
+   kfree(tls_offload_ctx_tx(ctx));
 
-   kfree(offload_ctx);
kfree(ctx);
 }
 
@@ -125,7 +124,7 @@ static void destroy_record(struct tls_record_info *record)
kfree(record);
 }
 
-static void delete_all_records(struct tls_offload_context *offload_ctx)
+static void delete_all_records(struct tls_offload_context_tx *offload_ctx)
 {
struct tls_record_info *info, *temp;
 
@@ -141,14 +140,14 @@ static void tls_icsk_clean_acked(struct sock *sk, u32 
acked_seq)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_record_info *info, *temp;
-   struct tls_offload_context *ctx;
+   struct tls_offload_context_tx *ctx;
u64 deleted_records = 0;
unsigned long flags;
 
if (!tls_ctx)
return;
 
-   ctx = tls_offload_ctx(tls_ctx);
+   ctx = tls_offload_ctx_tx(tls_ctx);
 
spin_lock_irqsave(&ctx->lock, flags

[PATCH v3 net-next 17/19] net/mlx5: Accel, add common metadata functions

2018-07-11 Thread Boris Pismenny

This patch adds common functions to handle mellanox metadata headers.
These functions are used by IPsec and TLS to process FPGA metadata.

Signed-off-by: Boris Pismenny 
---
 .../net/ethernet/mellanox/mlx5/core/accel/accel.h  | 37 ++
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c   | 19 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 18 +++
 3 files changed, 45 insertions(+), 29 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h
new file mode 100644
index 000..c132604
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h
@@ -0,0 +1,37 @@
+#ifndef __MLX5E_ACCEL_H__
+#define __MLX5E_ACCEL_H__
+
+#ifdef CONFIG_MLX5_ACCEL
+
+#include 
+#include 
+#include "en.h"
+
+static inline bool is_metadata_hdr_valid(struct sk_buff *skb)
+{
+   __be16 *ethtype;
+
+   if (unlikely(skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN))
+   return false;
+   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
+   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   return false;
+   return true;
+}
+
+static inline void remove_metadata_hdr(struct sk_buff *skb)
+{
+   struct ethhdr *old_eth;
+   struct ethhdr *new_eth;
+
+   /* Remove the metadata from the buffer */
+   old_eth = (struct ethhdr *)skb->data;
+   new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN);
+   memmove(new_eth, old_eth, 2 * ETH_ALEN);
+   /* Ethertype is already in its new place */
+   skb_pull_inline(skb, MLX5E_METADATA_ETHER_LEN);
+}
+
+#endif /* CONFIG_MLX5_ACCEL */
+
+#endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index c245d8e..fda7929 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -37,6 +37,7 @@
 
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/ipsec.h"
+#include "accel/accel.h"
 #include "en.h"
 
 enum {
@@ -346,19 +347,12 @@ struct sk_buff *mlx5e_ipsec_handle_tx_skb(struct 
net_device *netdev,
 }
 
 struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev,
- struct sk_buff *skb)
+ struct sk_buff *skb, u32 *cqe_bcnt)
 {
struct mlx5e_ipsec_metadata *mdata;
-   struct ethhdr *old_eth;
-   struct ethhdr *new_eth;
struct xfrm_state *xs;
-   __be16 *ethtype;
 
-   /* Detect inline metadata */
-   if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
-   return skb;
-   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
-   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   if (!is_metadata_hdr_valid(skb))
return skb;
 
/* Use the metadata */
@@ -369,12 +363,7 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct 
net_device *netdev,
return NULL;
}
 
-   /* Remove the metadata from the buffer */
-   old_eth = (struct ethhdr *)skb->data;
-   new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN);
-   memmove(new_eth, old_eth, 2 * ETH_ALEN);
-   /* Ethertype is already in its new place */
-   skb_pull_inline(skb, MLX5E_METADATA_ETHER_LEN);
+   remove_metadata_hdr(skb);
 
return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index ecfc764..92d3745 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -33,6 +33,8 @@
 
 #include "en_accel/tls.h"
 #include "en_accel/tls_rxtx.h"
+#include "accel/accel.h"
+
 #include 
 #include 
 
@@ -350,16 +352,9 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
 u32 *cqe_bcnt)
 {
struct mlx5e_tls_metadata *mdata;
-   struct ethhdr *old_eth;
-   struct ethhdr *new_eth;
-   __be16 *ethtype;
struct mlx5e_priv *priv;
 
-   /* Detect inline metadata */
-   if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
-   return;
-   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
-   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   if (!is_metadata_hdr_valid(skb))
return;
 
/* Use the metadata */
@@ -383,11 +378,6 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
return;
}
 
-   /* Remove the metadata from the buffer */
-   old_eth = (struct ethhdr *)skb->data;
-   new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN);
-   memmove(new_eth,

[PATCH v3 net-next 11/19] net/mlx5e: TLS, refactor variable names

2018-07-11 Thread Boris Pismenny

For symmetry, we rename mlx5e_tls_offload_context to
mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c  | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h  | 8 
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c | 6 +++---
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index d167845..7fb9c75 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -123,7 +123,7 @@ static int mlx5e_tls_add(struct net_device *netdev, struct 
sock *sk,
goto free_flow;
 
if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
-   struct mlx5e_tls_offload_context *tx_ctx =
+   struct mlx5e_tls_offload_context_tx *tx_ctx =
mlx5e_get_tls_tx_context(tls_ctx);
u32 swid;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index b82f4de..e26222a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -49,19 +49,19 @@ struct mlx5e_tls {
struct mlx5e_tls_sw_stats sw_stats;
 };
 
-struct mlx5e_tls_offload_context {
+struct mlx5e_tls_offload_context_tx {
struct tls_offload_context_tx base;
u32 expected_seq;
__be32 swid;
 };
 
-static inline struct mlx5e_tls_offload_context *
+static inline struct mlx5e_tls_offload_context_tx *
 mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 {
-   BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) >
+   BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context_tx) >
 TLS_OFFLOAD_CONTEXT_SIZE_TX);
return container_of(tls_offload_ctx_tx(tls_ctx),
-   struct mlx5e_tls_offload_context,
+   struct mlx5e_tls_offload_context_tx,
base);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 15aef71..c96196f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -73,7 +73,7 @@ static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 
swid)
return 0;
 }
 
-static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context *context,
+static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context_tx 
*context,
   u32 tcp_seq, struct sync_info *info)
 {
int remaining, i = 0, ret = -EINVAL;
@@ -161,7 +161,7 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 }
 
 static struct sk_buff *
-mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
+mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
 struct mlx5e_txqsq *sq, struct sk_buff *skb,
 struct mlx5e_tx_wqe **wqe,
 u16 *pi,
@@ -239,7 +239,7 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device 
*netdev,
u16 *pi)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
-   struct mlx5e_tls_offload_context *context;
+   struct mlx5e_tls_offload_context_tx *context;
struct tls_context *tls_ctx;
u32 expected_seq;
int datalen;
-- 
1.8.3.1

[PATCH v3 net-next 06/19] tls: Split decrypt_skb to two functions

2018-07-11 Thread Boris Pismenny

Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.

Later, in the tls_device Rx flow, we will use decrypt_skb directly.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h |  2 ++
 net/tls/tls_sw.c  | 44 ++--
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 5dcd808..49b8922 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -390,6 +390,8 @@ int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
  unsigned char *record_type);
 void tls_register_device(struct tls_device *device);
 void tls_unregister_device(struct tls_device *device);
+int decrypt_skb(struct sock *sk, struct sk_buff *skb,
+   struct scatterlist *sgout);
 
 struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
  struct net_device *dev,
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 3bd7c14..99d0347 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -53,7 +53,6 @@ static int tls_do_decryption(struct sock *sk,
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-   struct strp_msg *rxm = strp_msg(skb);
struct aead_request *aead_req;
 
int ret;
@@ -74,18 +73,6 @@ static int tls_do_decryption(struct sock *sk,
 
ret = crypto_wait_req(crypto_aead_decrypt(aead_req), &ctx->async_wait);
 
-   if (ret < 0)
-   goto out;
-
-   rxm->offset += tls_ctx->rx.prepend_size;
-   rxm->full_len -= tls_ctx->rx.overhead_size;
-   tls_advance_record_sn(sk, &tls_ctx->rx);
-
-   ctx->decrypted = true;
-
-   ctx->saved_data_ready(sk);
-
-out:
kfree(aead_req);
return ret;
 }
@@ -670,8 +657,29 @@ static struct sk_buff *tls_wait_data(struct sock *sk, int 
flags,
return skb;
 }
 
-static int decrypt_skb(struct sock *sk, struct sk_buff *skb,
-  struct scatterlist *sgout)
+static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
+ struct scatterlist *sgout)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+   struct strp_msg *rxm = strp_msg(skb);
+   int err = 0;
+
+   err = decrypt_skb(sk, skb, sgout);
+   if (err < 0)
+   return err;
+
+   rxm->offset += tls_ctx->rx.prepend_size;
+   rxm->full_len -= tls_ctx->rx.overhead_size;
+   tls_advance_record_sn(sk, &tls_ctx->rx);
+   ctx->decrypted = true;
+   ctx->saved_data_ready(sk);
+
+   return err;
+}
+
+int decrypt_skb(struct sock *sk, struct sk_buff *skb,
+   struct scatterlist *sgout)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -821,7 +829,7 @@ int tls_sw_recvmsg(struct sock *sk,
if (err < 0)
goto fallback_to_reg_recv;
 
-   err = decrypt_skb(sk, skb, sgin);
+   err = decrypt_skb_update(sk, skb, sgin);
for (; pages > 0; pages--)
put_page(sg_page(&sgin[pages]));
if (err < 0) {
@@ -830,7 +838,7 @@ int tls_sw_recvmsg(struct sock *sk,
}
} else {
 fallback_to_reg_recv:
-   err = decrypt_skb(sk, skb, NULL);
+   err = decrypt_skb_update(sk, skb, NULL);
if (err < 0) {
tls_err_abort(sk, EBADMSG);
goto recv_end;
@@ -901,7 +909,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t 
*ppos,
}
 
if (!ctx->decrypted) {
-   err = decrypt_skb(sk, skb, NULL);
+   err = decrypt_skb_update(sk, skb, NULL);
 
if (err < 0) {
tls_err_abort(sk, EBADMSG);
-- 
1.8.3.1

[PATCH v3 net-next 08/19] tls: Fill software context without allocation

2018-07-11 Thread Boris Pismenny

This patch allows tls_set_sw_offload to fill the context in case it was
already allocated previously.

We will use it in TLS_DEVICE to fill the RX software context.

Signed-off-by: Boris Pismenny 
---
 net/tls/tls_sw.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 86e22bc..5073676 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1090,28 +1090,38 @@ int tls_set_sw_offload(struct sock *sk, struct 
tls_context *ctx, int tx)
}
 
if (tx) {
-   sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL);
-   if (!sw_ctx_tx) {
-   rc = -ENOMEM;
-   goto out;
+   if (!ctx->priv_ctx_tx) {
+   sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL);
+   if (!sw_ctx_tx) {
+   rc = -ENOMEM;
+   goto out;
+   }
+   ctx->priv_ctx_tx = sw_ctx_tx;
+   } else {
+   sw_ctx_tx =
+   (struct tls_sw_context_tx *)ctx->priv_ctx_tx;
}
-   crypto_init_wait(&sw_ctx_tx->async_wait);
-   ctx->priv_ctx_tx = sw_ctx_tx;
} else {
-   sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL);
-   if (!sw_ctx_rx) {
-   rc = -ENOMEM;
-   goto out;
+   if (!ctx->priv_ctx_rx) {
+   sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL);
+   if (!sw_ctx_rx) {
+   rc = -ENOMEM;
+   goto out;
+   }
+   ctx->priv_ctx_rx = sw_ctx_rx;
+   } else {
+   sw_ctx_rx =
+   (struct tls_sw_context_rx *)ctx->priv_ctx_rx;
}
-   crypto_init_wait(&sw_ctx_rx->async_wait);
-   ctx->priv_ctx_rx = sw_ctx_rx;
}
 
if (tx) {
+   crypto_init_wait(&sw_ctx_tx->async_wait);
crypto_info = &ctx->crypto_send;
cctx = &ctx->tx;
aead = &sw_ctx_tx->aead_send;
} else {
+   crypto_init_wait(&sw_ctx_rx->async_wait);
crypto_info = &ctx->crypto_recv;
cctx = &ctx->rx;
aead = &sw_ctx_rx->aead_recv;
-- 
1.8.3.1

[PATCH v3 net-next 07/19] tls: Split tls_sw_release_resources_rx

2018-07-11 Thread Boris Pismenny

This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.

In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h |  1 +
 net/tls/tls_sw.c  | 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 49b8922..7a485de 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -223,6 +223,7 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
+void tls_sw_release_resources_rx(struct sock *sk);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
   int nonblock, int flags, int *addr_len);
 unsigned int tls_sw_poll(struct file *file, struct socket *sock,
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 99d0347..86e22bc 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1039,7 +1039,7 @@ void tls_sw_free_resources_tx(struct sock *sk)
kfree(ctx);
 }
 
-void tls_sw_free_resources_rx(struct sock *sk)
+void tls_sw_release_resources_rx(struct sock *sk)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -1058,6 +1058,14 @@ void tls_sw_free_resources_rx(struct sock *sk)
strp_done(&ctx->strp);
lock_sock(sk);
}
+}
+
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+
+   tls_sw_release_resources_rx(sk);
 
kfree(ctx);
 }
-- 
1.8.3.1

[PATCH v3 net-next 10/19] tls: Fix zerocopy_from_iter iov handling

2018-07-11 Thread Boris Pismenny

zerocopy_from_iter iterates over the message, but it doesn't revert the
updates made by the iov iteration. This patch fixes it. Now, the iov can
be used after calling zerocopy_from_iter.

Fixes: 3c4d75591 ("tls: kernel TLS support")
Signed-off-by: Boris Pismenny 
---
 net/tls/tls_sw.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 2a6ba0f..37ac220 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -318,6 +318,7 @@ static int zerocopy_from_iter(struct sock *sk, struct 
iov_iter *from,
 out:
*size_used = size;
*pages_used = num_elem;
+   iov_iter_revert(from, size);
 
return rc;
 }
-- 
1.8.3.1

[PATCH v3 net-next 02/19] net: Add TLS RX offload feature

2018-07-11 Thread Boris Pismenny

From: Ilya Lesokhin 

This patch adds a netdev feature to configure TLS RX inline crypto offload.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 623bb8c..2b2a6dc 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -79,6 +79,7 @@ enum {
NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */
NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
NETIF_F_HW_TLS_TX_BIT,  /* Hardware TLS TX offload */
+   NETIF_F_HW_TLS_RX_BIT,  /* Hardware TLS RX offload */
 
NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */
NETIF_F_HW_TLS_RECORD_BIT,  /* Offload TLS record */
@@ -151,6 +152,7 @@ enum {
 #define NETIF_F_HW_TLS_RECORD  __NETIF_F(HW_TLS_RECORD)
 #define NETIF_F_GSO_UDP_L4 __NETIF_F(GSO_UDP_L4)
 #define NETIF_F_HW_TLS_TX  __NETIF_F(HW_TLS_TX)
+#define NETIF_F_HW_TLS_RX  __NETIF_F(HW_TLS_RX)
 
 #define for_each_netdev_feature(mask_addr, bit)\
for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index e677a20..c9993c6 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -111,6 +111,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct 
ethtool_ts_info *info)
[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =   "rx-udp_tunnel-port-offload",
[NETIF_F_HW_TLS_RECORD_BIT] =   "tls-hw-record",
[NETIF_F_HW_TLS_TX_BIT] ="tls-hw-tx-offload",
+   [NETIF_F_HW_TLS_RX_BIT] ="tls-hw-rx-offload",
 };
 
 static const char
-- 
1.8.3.1

[PATCH v3 net-next 00/19] TLS offload rx, netdev & mlx5

2018-07-11 Thread Boris Pismenny

Hi,

The following series provides TLS RX inline crypto offload.

v2->v3:
- Fix typo
- Adjust cover letter
- Fix bug in zero copy flows
- Use network byte order for the record number in resync
- Adjust the sequence provided in resync

v1->v2:
- Fix bisectability problems due to variable name changes
- Fix potential uninitialized return value

This series completes the generic infrastructure to offload TLS crypto to
a network devices. It enables the kernel TLS socket to skip decryption and
authentication operations for SKBs marked as decrypted on the receive
side of the data path. Leaving those computationally expensive operations
to the NIC.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC decrypts a packet's payload if the packet contains the expected TCP
sequence number. The TLS record authentication tag remains unmodified
regardless of decryption. If the packet is decrypted successfully and it
contains an authentication tag, then the authentication check has passed.
Otherwise, if the authentication fails, then the packet is provided
unmodified and the KTLS layer is responsible for handling it.
Out-Of-Order TCP packets are provided unmodified. As a result,
in the slow path some of the SKBs are decrypted while others remain as
ciphertext.

The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs. 
At the worst case a received TLS record consists of both plaintext
and ciphertext packets. These partially decrypted records must be
reencrypted, only to be decrypted.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW whenever it lost track of the TLS record framing in
the TCP stream.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC identifies packets that should be offloaded according to
the 5-tuple and the TCP sequence number. If these match and the
packet is decrypted and authenticated successfully, then a syndrome
is provided to software. Otherwise, the packet is unmodified.
Decrypted and non-decrypted packets aren't coalesced by the network stack,
and the KTLS layer decrypts and authenticates partially decrypted records.
The NIC provides an indication whenever a resync is required. The resync
operation is triggered by the KTLS layer while parsing TLS record headers.

Finally, we measure the performance obtained by running single stream
iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected
back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound)
and KTLS-Offload running both in Tx and Rx. The results show that the
performance of offload is comparable to TCP.

  | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%)
TCP   | 28.8 | 5  | 12
KTLS-Offload-Tx-Rx| 28.6 | 7  | 14

Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf



Boris Pismenny (18):
  net: Add decrypted field to skb
  net: Add TLS rx resync NDO
  tcp: Don't coalesce decrypted and encrypted SKBs
  tls: Refactor tls_offload variable names
  tls: Split decrypt_skb to two functions
  tls: Split tls_sw_release_resources_rx
  tls: Fill software context without allocation
  tls: Add rx inline crypto offload
  tls: Fix zerocopy_from_iter iov handling
  net/mlx5e: TLS, refactor variable names
  net/mlx5: Accel, add TLS rx offload routines
  net/mlx5e: TLS, add innova rx support
  net/mlx5e: TLS, add Innova TLS rx data path
  net/mlx5e: TLS, add software statistics
  net/mlx5e: TLS, build TLS netdev from capabilities
  net/mlx5: Accel, add common metadata functions
  net/mlx5e: IPsec, fix byte count in CQE
  net/mlx5e: Kconfig, mutually exclude compilation of TLS and IPsec
accel

Ilya Lesokhin (1):
  net: Add TLS RX offload feature

 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|   1 +
 .../net/ethernet/mellanox/mlx5/core/accel/accel.h  |  37 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  23 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  26 +-
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c   |  20 +-
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.h   |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |  69 +++--
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  33 ++-
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 117 +++-
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 113 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  18 +-
 include/linux/mlx5/mlx5_ifc

[PATCH v3 net-next 15/19] net/mlx5e: TLS, add software statistics

2018-07-11 Thread Boris Pismenny

This patch adds software statistics for TLS to count important
events.

Signed-off-by: Boris Pismenny 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c  |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h  |  4 
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c | 11 ++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 68368c9..541e6f4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -169,7 +169,10 @@ static void mlx5e_tls_resync_rx(struct net_device *netdev, 
struct sock *sk,
 
rx_ctx = mlx5e_get_tls_rx_context(tls_ctx);
 
+   netdev_info(netdev, "resyncing seq %d rcd %lld\n", seq,
+   be64_to_cpu(rcd_sn));
mlx5_accel_tls_resync_rx(priv->mdev, rx_ctx->handle, seq, rcd_sn);
+   atomic64_inc(&priv->tls->sw_stats.rx_tls_resync_reply);
 }
 
 static const struct tlsdev_ops mlx5e_tls_ops = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index 2d40ede..3f5d721 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -43,6 +43,10 @@ struct mlx5e_tls_sw_stats {
atomic64_t tx_tls_drop_resync_alloc;
atomic64_t tx_tls_drop_no_sync_data;
atomic64_t tx_tls_drop_bypass_required;
+   atomic64_t rx_tls_drop_resync_request;
+   atomic64_t rx_tls_resync_request;
+   atomic64_t rx_tls_resync_reply;
+   atomic64_t rx_tls_auth_fail;
 };
 
 struct mlx5e_tls {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index d460fda..ecfc764 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -330,8 +330,12 @@ static int tls_update_resync_sn(struct net_device *netdev,
netdev->ifindex, 0);
 #endif
}
-   if (!sk || sk->sk_state == TCP_TIME_WAIT)
+   if (!sk || sk->sk_state == TCP_TIME_WAIT) {
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+
+   atomic64_inc(&priv->tls->sw_stats.rx_tls_drop_resync_request);
goto out;
+   }
 
skb->sk = sk;
skb->destructor = sock_edemux;
@@ -349,6 +353,7 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
struct ethhdr *old_eth;
struct ethhdr *new_eth;
__be16 *ethtype;
+   struct mlx5e_priv *priv;
 
/* Detect inline metadata */
if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
@@ -365,9 +370,13 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
break;
case SYNDROM_RESYNC_REQUEST:
tls_update_resync_sn(netdev, skb, mdata);
+   priv = netdev_priv(netdev);
+   atomic64_inc(&priv->tls->sw_stats.rx_tls_resync_request);
break;
case SYNDROM_AUTH_FAILED:
/* Authentication failure will be observed and verified by kTLS 
*/
+   priv = netdev_priv(netdev);
+   atomic64_inc(&priv->tls->sw_stats.rx_tls_auth_fail);
break;
default:
/* Bypass the metadata header to others */
-- 
1.8.3.1

[PATCH v3 net-next 13/19] net/mlx5e: TLS, add innova rx support

2018-07-11 Thread Boris Pismenny

Add the mlx5 implementation of the TLS Rx routines to add/del TLS
contexts, also add the tls_dev_resync_rx routine
to work with the TLS inline Rx crypto offload infrastructure.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 46 +++---
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 15 +++
 2 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 7fb9c75..68368c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -110,9 +110,7 @@ static int mlx5e_tls_add(struct net_device *netdev, struct 
sock *sk,
u32 caps = mlx5_accel_tls_device_caps(mdev);
int ret = -ENOMEM;
void *flow;
-
-   if (direction != TLS_OFFLOAD_CTX_DIR_TX)
-   return -EINVAL;
+   u32 swid;
 
flow = kzalloc(MLX5_ST_SZ_BYTES(tls_flow), GFP_KERNEL);
if (!flow)
@@ -122,18 +120,23 @@ static int mlx5e_tls_add(struct net_device *netdev, 
struct sock *sk,
if (ret)
goto free_flow;
 
+   ret = mlx5_accel_tls_add_flow(mdev, flow, crypto_info,
+ start_offload_tcp_sn, &swid,
+ direction == TLS_OFFLOAD_CTX_DIR_TX);
+   if (ret < 0)
+   goto free_flow;
+
if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
struct mlx5e_tls_offload_context_tx *tx_ctx =
mlx5e_get_tls_tx_context(tls_ctx);
-   u32 swid;
-
-   ret = mlx5_accel_tls_add_tx_flow(mdev, flow, crypto_info,
-start_offload_tcp_sn, &swid);
-   if (ret < 0)
-   goto free_flow;
 
tx_ctx->swid = htonl(swid);
tx_ctx->expected_seq = start_offload_tcp_sn;
+   } else {
+   struct mlx5e_tls_offload_context_rx *rx_ctx =
+   mlx5e_get_tls_rx_context(tls_ctx);
+
+   rx_ctx->handle = htonl(swid);
}
 
return 0;
@@ -147,19 +150,32 @@ static void mlx5e_tls_del(struct net_device *netdev,
  enum tls_offload_ctx_dir direction)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
+   unsigned int handle;
 
-   if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
-   u32 swid = ntohl(mlx5e_get_tls_tx_context(tls_ctx)->swid);
+   handle = ntohl((direction == TLS_OFFLOAD_CTX_DIR_TX) ?
+  mlx5e_get_tls_tx_context(tls_ctx)->swid :
+  mlx5e_get_tls_rx_context(tls_ctx)->handle);
 
-   mlx5_accel_tls_del_tx_flow(priv->mdev, swid);
-   } else {
-   netdev_err(netdev, "unsupported direction %d\n", direction);
-   }
+   mlx5_accel_tls_del_flow(priv->mdev, handle,
+   direction == TLS_OFFLOAD_CTX_DIR_TX);
+}
+
+static void mlx5e_tls_resync_rx(struct net_device *netdev, struct sock *sk,
+   u32 seq, u64 rcd_sn)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+   struct mlx5e_tls_offload_context_rx *rx_ctx;
+
+   rx_ctx = mlx5e_get_tls_rx_context(tls_ctx);
+
+   mlx5_accel_tls_resync_rx(priv->mdev, rx_ctx->handle, seq, rcd_sn);
 }
 
 static const struct tlsdev_ops mlx5e_tls_ops = {
.tls_dev_add = mlx5e_tls_add,
.tls_dev_del = mlx5e_tls_del,
+   .tls_dev_resync_rx = mlx5e_tls_resync_rx,
 };
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index e26222a..2d40ede 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -65,6 +65,21 @@ struct mlx5e_tls_offload_context_tx {
base);
 }
 
+struct mlx5e_tls_offload_context_rx {
+   struct tls_offload_context_rx base;
+   __be32 handle;
+};
+
+static inline struct mlx5e_tls_offload_context_rx *
+mlx5e_get_tls_rx_context(struct tls_context *tls_ctx)
+{
+   BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context_rx) >
+TLS_OFFLOAD_CONTEXT_SIZE_RX);
+   return container_of(tls_offload_ctx_rx(tls_ctx),
+   struct mlx5e_tls_offload_context_rx,
+   base);
+}
+
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
 int mlx5e_tls_init(struct mlx5e_priv *priv);
 void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
-- 
1.8.3.1

[PATCH v3 net-next 09/19] tls: Add rx inline crypto offload

2018-07-11 Thread Boris Pismenny

This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.

The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h |  63 +-
 net/tls/tls_device.c  | 278 ++
 net/tls/tls_device_fallback.c |   1 +
 net/tls/tls_main.c|  32 +++--
 net/tls/tls_sw.c  |  24 +++-
 5 files changed, 355 insertions(+), 43 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 7a485de..d8b3b65 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -83,6 +83,16 @@ struct tls_device {
void (*unhash)(struct tls_device *device, struct sock *sk);
 };
 
+enum {
+   TLS_BASE,
+   TLS_SW,
+#ifdef CONFIG_TLS_DEVICE
+   TLS_HW,
+#endif
+   TLS_HW_RECORD,
+   TLS_NUM_CONFIG,
+};
+
 struct tls_sw_context_tx {
struct crypto_aead *aead_send;
struct crypto_wait async_wait;
@@ -197,6 +207,7 @@ struct tls_context {
int (*push_pending_record)(struct sock *sk, int flags);
 
void (*sk_write_space)(struct sock *sk);
+   void (*sk_destruct)(struct sock *sk);
void (*sk_proto_close)(struct sock *sk, long timeout);
 
int  (*setsockopt)(struct sock *sk, int level,
@@ -209,13 +220,27 @@ struct tls_context {
void (*unhash)(struct sock *sk);
 };
 
+struct tls_offload_context_rx {
+   /* sw must be the first member of tls_offload_context_rx */
+   struct tls_sw_context_rx sw;
+   atomic64_t resync_req;
+   u8 driver_state[];
+   /* The TLS layer reserves room for driver specific state
+* Currently the belief is that there is not enough
+* driver specific state to justify another layer of indirection
+*/
+};
+
+#define TLS_OFFLOAD_CONTEXT_SIZE_RX\
+   (ALIGN(sizeof(struct tls_offload_context_rx), sizeof(void *)) + \
+TLS_DRIVER_STATE_SIZE)
+
 int wait_on_pending_writer(struct sock *sk, long *timeo);
 int tls_sk_query(struct sock *sk, int optname, char __user *optval,
int __user *optlen);
 int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
  unsigned int optlen);
 
-
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
@@ -290,11 +315,19 @@ static inline bool tls_is_pending_open_record(struct 
tls_context *tls_ctx)
return tls_ctx->pending_open_record_frags;
 }
 
+struct sk_buff *
+tls_validate_xmit_skb(struct sock *sk, struct net_device *dev,
+ struct sk_buff *skb);
+
 static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
 {
-   return sk_fullsock(sk) &&
-  /* matches smp_store_release in tls_set_device_offload */
-  smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+   return sk_fullsock(sk) &
+  (smp_load_acquire(&sk->sk_validate_xmit_skb) ==
+  &tls_validate_xmit_skb);
+#else
+   return false;
+#endif
 }
 
 static inline void tls_err_abort(struct sock *sk, int err)
@@ -387,10 +420,27 @@ static inline struct tls_sw_context_tx *tls_sw_ctx_tx(
return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
+static inline struct tls_offload_context_rx *
+tls_offload_ctx_rx(const struct tls_context *tls_ctx)
+{
+   return (struct tls_offload_context_rx *)tls_ctx->priv_ctx_rx;
+}
+
+/* The TLS context is valid until sk_destruct is called */
+static inline void tls_offload_rx_resync_request(struct sock *sk, __be32 seq)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_offload_context_rx *rx_ctx = tls_offload_ctx_rx(tls_ctx);
+
+   atomic64_set(&rx_ctx->resync_req, uint64_t)seq) << 32) | 1));
+}
+
+
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
  unsigned char *record_type);
 void tls_r

[PATCH v3 net-next 12/19] net/mlx5: Accel, add TLS rx offload routines

2018-07-11 Thread Boris Pismenny

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Complete the implementation for Innova TLS (FPGA-based) hardware by
adding support for rx inline crypto offload.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  23 +++--
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  26 +++--
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 113 -
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  18 ++--
 include/linux/mlx5/mlx5_ifc_fpga.h |   1 +
 5 files changed, 135 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
index 77ac19f..da7bd26 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -37,17 +37,26 @@
 #include "mlx5_core.h"
 #include "fpga/tls.h"
 
-int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
-  struct tls_crypto_info *crypto_info,
-  u32 start_offload_tcp_sn, u32 *p_swid)
+int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+   struct tls_crypto_info *crypto_info,
+   u32 start_offload_tcp_sn, u32 *p_swid,
+   bool direction_sx)
 {
-   return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
-start_offload_tcp_sn, p_swid);
+   return mlx5_fpga_tls_add_flow(mdev, flow, crypto_info,
+ start_offload_tcp_sn, p_swid,
+ direction_sx);
 }
 
-void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid,
+bool direction_sx)
 {
-   mlx5_fpga_tls_del_tx_flow(mdev, swid, GFP_KERNEL);
+   mlx5_fpga_tls_del_flow(mdev, swid, GFP_KERNEL, direction_sx);
+}
+
+int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, u32 seq,
+u64 rcd_sn)
+{
+   return mlx5_fpga_tls_resync_rx(mdev, handle, seq, rcd_sn);
 }
 
 bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
index 6f9c9f4..2228c10 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
@@ -60,10 +60,14 @@ struct mlx5_ifc_tls_flow_bits {
u8 reserved_at_2[0x1e];
 };
 
-int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
-  struct tls_crypto_info *crypto_info,
-  u32 start_offload_tcp_sn, u32 *p_swid);
-void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid);
+int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+   struct tls_crypto_info *crypto_info,
+   u32 start_offload_tcp_sn, u32 *p_swid,
+   bool direction_sx);
+void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid,
+bool direction_sx);
+int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, u32 seq,
+u64 rcd_sn);
 bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev);
 u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev);
 int mlx5_accel_tls_init(struct mlx5_core_dev *mdev);
@@ -71,11 +75,15 @@ int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, 
void *flow,
 
 #else
 
-static inline int
-mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
-  struct tls_crypto_info *crypto_info,
-  u32 start_offload_tcp_sn, u32 *p_swid) { return 0; }
-static inline void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 
swid) { }
+static int
+mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+   struct tls_crypto_info *crypto_info,
+   u32 start_offload_tcp_sn, u32 *p_swid,
+   bool direction_sx) { return -ENOTSUPP; }
+static inline void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 
swid,
+  bool direction_sx) { }
+static inline int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 
handle,
+  u32 seq, u64 rcd_sn) { return 0; }
 static inline bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) { 
return false; }
 static inline u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev) { 
return 0; }
 static inline int mlx5_accel_tls_init(struct mlx5_core_dev *mdev) { return 0; }
d

KASAN: use-after-free Read in p9_fd_poll

2018-07-11 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:30c2c32d7f70 Merge tag 'drm-fixes-2018-07-10' of git://ano..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1662c5b240
kernel config:  https://syzkaller.appspot.com/x/.config?x=25856fac4e580aa7
dashboard link: https://syzkaller.appspot.com/bug?extid=0442e6e2f7e1e33b1037
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0442e6e2f7e1e33b1...@syzkaller.appspotmail.com

9pnet: p9_errstr2errno: server reported unknown error etz0e&��?�d$5ܱI3�
QAT: Invalid ioctl
==
BUG: KASAN: use-after-free in p9_fd_poll+0x280/0x2b0 net/9p/trans_fd.c:238
Read of size 8 at addr 8801c647ec80 by task kworker/1:3/5005

CPU: 1 PID: 5005 Comm: kworker/1:3 Not tainted 4.18.0-rc4+ #140
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Workqueue: events p9_poll_workfn
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 p9_fd_poll+0x280/0x2b0 net/9p/trans_fd.c:238
 p9_poll_mux net/9p/trans_fd.c:617 [inline]
 p9_poll_workfn+0x463/0x6d0 net/9p/trans_fd.c:1107
 process_one_work+0xc73/0x1ba0 kernel/workqueue.c:2153
 worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
 kthread+0x345/0x410 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Allocated by task 29121:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
 kmalloc include/linux/slab.h:513 [inline]
 kzalloc include/linux/slab.h:707 [inline]
 p9_fd_open net/9p/trans_fd.c:796 [inline]
 p9_fd_create+0x1a7/0x3f0 net/9p/trans_fd.c:1036
 p9_client_create+0x915/0x16c9 net/9p/client.c:1062
 v9fs_session_init+0x21a/0x1a80 fs/9p/v9fs.c:400
 v9fs_mount+0x7c/0x900 fs/9p/vfs_super.c:135
 mount_fs+0xae/0x328 fs/super.c:1277
 vfs_kern_mount.part.34+0xdc/0x4e0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x581/0x30e0 fs/namespace.c:2848
 ksys_mount+0x12d/0x140 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 29121:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x260 mm/slab.c:3813
 p9_fd_close+0x416/0x5b0 net/9p/trans_fd.c:893
 p9_client_create+0xac2/0x16c9 net/9p/client.c:1076
 v9fs_session_init+0x21a/0x1a80 fs/9p/v9fs.c:400
 v9fs_mount+0x7c/0x900 fs/9p/vfs_super.c:135
 mount_fs+0xae/0x328 fs/super.c:1277
 vfs_kern_mount.part.34+0xdc/0x4e0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x581/0x30e0 fs/namespace.c:2848
 ksys_mount+0x12d/0x140 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 8801c647ec80
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 0 bytes inside of
 512-byte region [8801c647ec80, 8801c647ee80)
The buggy address belongs to the page:
page:ea0007191f80 count:1 mapcount:0 mapping:8801da800940 index:0x0
flags: 0x2fffc000100(slab)
raw: 02fffc000100 ea0006a8cc48 ea00074be548 8801da800940
raw:  8801c647e000 00010006 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 8801c647eb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 8801c647ec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

8801c647ec80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

   ^
 8801c647ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 8801c647ed80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegr

[PATCH v3 net-next 04/19] tcp: Don't coalesce decrypted and encrypted SKBs

2018-07-11 Thread Boris Pismenny

Prevent coalescing of decrypted and encrypted SKBs in GRO
and TCP layer.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 net/ipv4/tcp_input.c   | 12 
 net/ipv4/tcp_offload.c |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 814ea43..f89d86a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4343,6 +4343,11 @@ static bool tcp_try_coalesce(struct sock *sk,
if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
return false;
 
+#ifdef CONFIG_TLS_DEVICE
+   if (from->decrypted != to->decrypted)
+   return false;
+#endif
+
if (!skb_try_coalesce(to, from, fragstolen, &delta))
return false;
 
@@ -4872,6 +4877,9 @@ void tcp_rbtree_insert(struct rb_root *root, struct 
sk_buff *skb)
break;
 
memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
+#ifdef CONFIG_TLS_DEVICE
+   nskb->decrypted = skb->decrypted;
+#endif
TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
if (list)
__skb_queue_before(list, skb, nskb);
@@ -4899,6 +4907,10 @@ void tcp_rbtree_insert(struct rb_root *root, struct 
sk_buff *skb)
skb == tail ||
(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | 
TCPHDR_FIN)))
goto end;
+#ifdef CONFIG_TLS_DEVICE
+   if (skb->decrypted != nskb->decrypted)
+   goto end;
+#endif
}
}
}
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index f5aee64..870b0a3 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -262,6 +262,9 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, 
struct sk_buff *skb)
 
flush |= (len - 1) >= mss;
flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq);
+#ifdef CONFIG_TLS_DEVICE
+   flush |= p->decrypted ^ skb->decrypted;
+#endif
 
if (flush || skb_gro_receive(p, skb)) {
mss = 1;
-- 
1.8.3.1

[PATCH net-next 4/5 v3] net: gemini: Move main init to port

2018-07-11 Thread Linus Walleij

The initialization sequence for the ethernet, setting up
interrupt routing and such things, need to be done after
both the ports are clocked and reset. Before this the
config will not "take". Move the initialization to the
port probe function and keep track of init status in
the state.

Signed-off-by: Linus Walleij 
---
ChangeLog v2->v3:
- No changes, just resending with the rest.
ChangeLog v1->v2:
- No changes, just resending with the rest.
---
 drivers/net/ethernet/cortina/gemini.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 2457a1239d69..0f1d26441177 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -151,6 +151,7 @@ struct gemini_ethernet {
void __iomem *base;
struct gemini_ethernet_port *port0;
struct gemini_ethernet_port *port1;
+   bool initialized;
 
spinlock_t  irq_lock; /* Locks IRQ-related registers */
unsigned intfreeq_order;
@@ -2303,6 +2304,14 @@ static void gemini_port_remove(struct 
gemini_ethernet_port *port)
 
 static void gemini_ethernet_init(struct gemini_ethernet *geth)
 {
+   /* Only do this once both ports are online */
+   if (geth->initialized)
+   return;
+   if (geth->port0 && geth->port1)
+   geth->initialized = true;
+   else
+   return;
+
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_0_REG);
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_1_REG);
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_2_REG);
@@ -2450,6 +2459,10 @@ static int gemini_ethernet_port_probe(struct 
platform_device *pdev)
geth->port0 = port;
else
geth->port1 = port;
+
+   /* This will just be done once both ports are up and reset */
+   gemini_ethernet_init(geth);
+
platform_set_drvdata(pdev, port);
 
/* Set up and register the netdev */
@@ -2567,7 +2580,6 @@ static int gemini_ethernet_probe(struct platform_device 
*pdev)
 
spin_lock_init(&geth->irq_lock);
spin_lock_init(&geth->freeq_lock);
-   gemini_ethernet_init(geth);
 
/* The children will use this */
platform_set_drvdata(pdev, geth);
@@ -2580,8 +2592,8 @@ static int gemini_ethernet_remove(struct platform_device 
*pdev)
 {
struct gemini_ethernet *geth = platform_get_drvdata(pdev);
 
-   gemini_ethernet_init(geth);
geth_cleanup_freeq(geth);
+   geth->initialized = false;
 
return 0;
 }
-- 
2.17.1

[PATCH net-next 2/5 v3] net: gemini: Improve connection prints

2018-07-11 Thread Linus Walleij

Switch over to using a module parameter and debug prints
that can be controlled by this or ethtool like everyone
else. Depromote all other prints to debug messages.

The phy_print_status() was already in place, albeit never
really used because the debuglevel hiding it had to be
set up using ethtool.

Signed-off-by: Linus Walleij 
---
ChangeLog v2->v3:
- Use phy_attached_info() live all other drivers.
- Put it in an if (netif_msg_link()) clause like the
  other message from phy_print_status().
- Explain more in the commit message.
ChangeLog v1->v2:
- Use a module parameter and the message levels like all
  other drivers and stop trying to be special.
---
 drivers/net/ethernet/cortina/gemini.c | 46 +++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 8fc31723f700..f0ab6426daca 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -46,6 +46,11 @@
 #define DRV_NAME   "gmac-gemini"
 #define DRV_VERSION"1.0"
 
+#define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK)
+static int debug = -1;
+module_param(debug, int, 0);
+MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
+
 #define HSIZE_80x00
 #define HSIZE_16   0x01
 #define HSIZE_32   0x02
@@ -300,23 +305,26 @@ static void gmac_speed_set(struct net_device *netdev)
status.bits.speed = GMAC_SPEED_1000;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_1000;
-   netdev_info(netdev, "connect to RGMII @ 1Gbit\n");
+   netdev_dbg(netdev, "connect %s to RGMII @ 1Gbit\n",
+  phydev_name(phydev));
break;
case 100:
status.bits.speed = GMAC_SPEED_100;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII @ 100 Mbit\n");
+   netdev_dbg(netdev, "connect %s to RGMII @ 100 Mbit\n",
+  phydev_name(phydev));
break;
case 10:
status.bits.speed = GMAC_SPEED_10;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII @ 10 Mbit\n");
+   netdev_dbg(netdev, "connect %s to RGMII @ 10 Mbit\n",
+  phydev_name(phydev));
break;
default:
-   netdev_warn(netdev, "Not supported PHY speed (%d)\n",
-   phydev->speed);
+   netdev_warn(netdev, "Unsupported PHY speed (%d) on %s\n",
+   phydev->speed, phydev_name(phydev));
}
 
if (phydev->duplex == DUPLEX_FULL) {
@@ -363,12 +371,6 @@ static int gmac_setup_phy(struct net_device *netdev)
return -ENODEV;
netdev->phydev = phy;
 
-   netdev_info(netdev, "connected to PHY \"%s\"\n",
-   phydev_name(phy));
-   phy_attached_print(phy, "phy_id=0x%.8lx, phy_mode=%s\n",
-  (unsigned long)phy->phy_id,
-  phy_modes(phy->interface));
-
phy->supported &= PHY_GBIT_FEATURES;
phy->supported |= SUPPORTED_Asym_Pause | SUPPORTED_Pause;
phy->advertising = phy->supported;
@@ -376,19 +378,19 @@ static int gmac_setup_phy(struct net_device *netdev)
/* set PHY interface type */
switch (phy->interface) {
case PHY_INTERFACE_MODE_MII:
-   netdev_info(netdev, "set GMAC0 to GMII mode, GMAC1 disabled\n");
+   netdev_dbg(netdev,
+  "MII: set GMAC0 to GMII mode, GMAC1 disabled\n");
status.bits.mii_rmii = GMAC_PHY_MII;
-   netdev_info(netdev, "connect to MII\n");
break;
case PHY_INTERFACE_MODE_GMII:
-   netdev_info(netdev, "set GMAC0 to GMII mode, GMAC1 disabled\n");
+   netdev_dbg(netdev,
+  "GMII: set GMAC0 to GMII mode, GMAC1 disabled\n");
status.bits.mii_rmii = GMAC_PHY_GMII;
-   netdev_info(netdev, "connect to GMII\n");
break;
case PHY_INTERFACE_MODE_RGMII:
-   dev_info(dev, "set GMAC0 and GMAC1 to MII/RGMII mode\n");
+   netdev_dbg(netdev,
+  "RGMII: set GMAC0 and GMAC1 to MII/RGMII mode\n");
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII\n");
break;
default:
netdev_err(netdev, "Unsupported MII interface\n");
@@ -398,6 +400,9 @@ static int gmac_setup_ph

[PATCH net-next 3/5 v3] net: gemini: Allow multiple ports to instantiate

2018-07-11 Thread Linus Walleij

The code was not tested with two ports actually in use at
the same time. (I blame this on lack of actual hardware using
that feature.) Now after locating a system using both ports,
add necessary fix to make both ports come up.

Signed-off-by: Linus Walleij 
---
ChangeLog v2->v3:
- No changes, just resending with the rest.
ChangeLog v1->v2:
- No changes, just resending with the rest.
---
 drivers/net/ethernet/cortina/gemini.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index f0ab6426daca..2457a1239d69 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -1789,7 +1789,10 @@ static int gmac_open(struct net_device *netdev)
phy_start(netdev->phydev);
 
err = geth_resize_freeq(port);
-   if (err) {
+   /* It's fine if it's just busy, the other port has set up
+* the freeq in that case.
+*/
+   if (err && (err != -EBUSY)) {
netdev_err(netdev, "could not resize freeq\n");
goto err_stop_phy;
}
-- 
2.17.1

[PATCH net-next 5/5 v3] net: gemini: Indicate that we can handle jumboframes

2018-07-11 Thread Linus Walleij

The hardware supposedly handles frames up to 10236 bytes and
implements .ndo_change_mtu() so accept 10236 minus the ethernet
header for a VLAN tagged frame on the netdevices. Use
ETH_MIN_MTU as minimum MTU.

Signed-off-by: Linus Walleij 
---
ChangeLog v2->v3:
- No changes, just resending with the rest.
ChangeLog v1->v2:
- Change the min MTU from 256 (vendor code) to ETH_MIN_MTU
  which makes more sense.
---
 drivers/net/ethernet/cortina/gemini.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 0f1d26441177..22f495b490d4 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -2476,6 +2476,11 @@ static int gemini_ethernet_port_probe(struct 
platform_device *pdev)
 
netdev->hw_features = GMAC_OFFLOAD_FEATURES;
netdev->features |= GMAC_OFFLOAD_FEATURES | NETIF_F_GRO;
+   /* We can handle jumbo frames up to 10236 bytes so, let's accept
+* payloads of 10236 bytes minus VLAN and ethernet header
+*/
+   netdev->min_mtu = ETH_MIN_MTU;
+   netdev->max_mtu = 10236 - VLAN_ETH_HLEN;
 
port->freeq_refill = 0;
netif_napi_add(netdev, &port->napi, gmac_napi_poll,
-- 
2.17.1

[PATCH net-next 1/5 v3] net: gemini: Look up L3 maxlen from table

2018-07-11 Thread Linus Walleij

The code to calculate the hardware register enumerator
for the maximum L3 length isn't entirely simple to read.
Use the existing defines and rewrite the function into a
table look-up.

Acked-by: Michał Mirosław 
Signed-off-by: Linus Walleij 
---
ChangeLog v2->v3:
- Collected Michał's ACK.
ChangeLog v1->v2:
- No changes, just resending with the rest.
---
 drivers/net/ethernet/cortina/gemini.c | 61 ---
 1 file changed, 46 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 6d7404f66f84..8fc31723f700 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -401,26 +401,57 @@ static int gmac_setup_phy(struct net_device *netdev)
return 0;
 }
 
-static int gmac_pick_rx_max_len(int max_l3_len)
-{
-   /* index = CONFIG_MAXLEN_XXX values */
-   static const int max_len[8] = {
-   1536, 1518, 1522, 1542,
-   9212, 10236, 1518, 1518
-   };
-   int i, n = 5;
+/* The maximum frame length is not logically enumerated in the
+ * hardware, so we do a table lookup to find the applicable max
+ * frame length.
+ */
+struct gmac_max_framelen {
+   unsigned int max_l3_len;
+   u8 val;
+};
 
-   max_l3_len += ETH_HLEN + VLAN_HLEN;
+static const struct gmac_max_framelen gmac_maxlens[] = {
+   {
+   .max_l3_len = 1518,
+   .val = CONFIG0_MAXLEN_1518,
+   },
+   {
+   .max_l3_len = 1522,
+   .val = CONFIG0_MAXLEN_1522,
+   },
+   {
+   .max_l3_len = 1536,
+   .val = CONFIG0_MAXLEN_1536,
+   },
+   {
+   .max_l3_len = 1542,
+   .val = CONFIG0_MAXLEN_1542,
+   },
+   {
+   .max_l3_len = 9212,
+   .val = CONFIG0_MAXLEN_9k,
+   },
+   {
+   .max_l3_len = 10236,
+   .val = CONFIG0_MAXLEN_10k,
+   },
+};
+
+static int gmac_pick_rx_max_len(unsigned int max_l3_len)
+{
+   const struct gmac_max_framelen *maxlen;
+   int maxtot;
+   int i;
 
-   if (max_l3_len > max_len[n])
-   return -1;
+   maxtot = max_l3_len + ETH_HLEN + VLAN_HLEN;
 
-   for (i = 0; i < 5; i++) {
-   if (max_len[i] >= max_l3_len && max_len[i] < max_len[n])
-   n = i;
+   for (i = 0; i < ARRAY_SIZE(gmac_maxlens); i++) {
+   maxlen = &gmac_maxlens[i];
+   if (maxtot <= maxlen->max_l3_len)
+   return maxlen->val;
}
 
-   return n;
+   return -1;
 }
 
 static int gmac_init(struct net_device *netdev)
-- 
2.17.1

Re: [PATCH v3 net-next] net/sched: add skbprio scheduler

2018-07-11 Thread Marcelo Ricardo Leitner

On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote:
> On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner
>  wrote:
> >
> > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote:
> > >Changing TC_PRIO_MAX from 15 to 63 risks breaking backward 
> > > compatibility
> > > with applications.
> >
> > If done, it needs to be done carefully, indeed. I don't know if it's
> > doable, neither I know how hard is your requirement for 64 different
> > priorities.
> 
> struct tc_prio_qopt {
> int bands;  /* Number of bands */
> __u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band 
> */
> };
> 
> How would you do it carefully?

quick shot, multiplex v1 and v2 formats based on bands and sizeof():

#define TCQ_PRIO_BANDS_V1   16
#define TCQ_PRIO_BANDS_V2   64
#define TC_PRIO_MAX_V2  64

struct tc_prio_qopt_v2 {
int bands;  /* Number of bands */
__u8priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO 
band */
};

static int prio_tune(struct Qdisc *sch, struct nlattr *opt,
 struct netlink_ext_ack *extack)
{
struct prio_sched_data *q = qdisc_priv(sch);
struct Qdisc *queues[TCQ_PRIO_BANDS_V2];
int oldbands = q->bands, i;
struct tc_prio_qopt_v2 *qopt;

if (nla_len(opt) < sizeof(int))
return -EINVAL;
qopt = nla_data(opt);

if (qopt->bands <= TCQ_PRIO_BANDS_V1 &&
nla_len(opt) < sizeof(struct tc_prio_qopt))
return -EINVAL;

if (qopt->bands <= TCQ_PRIO_BANDS_V2 &&
nla_len(opt) < sizeof(*qopt))
return -EINVAL;

/* By here, if it has up to 3 bands, we can assume it is using the _v1
 * layout, while if it has up to TCQ_PRIO_BANDS_V2 it is using the _v2
 * format.
 */

if (qopt->bands > TCQ_PRIO_BANDS_V2 || qopt->bands < 2)
return -EINVAL;
...

With something like this I think it can keep compatibility with old
software while also allowing the new usage.

> Also, it is not only used by prio but also pfifo_fast.

Yes. More is needed, indeed. prio2band would also need to be expanded,
etc. Yet, I still don't see any blocker.

Re: [PATCH net-next 5/5 v2] net: gemini: Indicate that we can handle jumboframes

2018-07-11 Thread Linus Walleij

On Wed, Jul 4, 2018 at 10:35 PM Andrew Lunn  wrote:
>
> On Wed, Jul 04, 2018 at 08:33:24PM +0200, Linus Walleij wrote:
> > The hardware supposedly handles frames up to 10236 bytes and
> > implements .ndo_change_mtu() so accept 10236 minus the ethernet
> > header for a VLAN tagged frame on the netdevices. Use
> > ETH_MIN_MTU as minimum MTU.
> >
> > Signed-off-by: Linus Walleij 
>
> Hi Linus
>
> Did you try with an MTU of 68? Maybe the vendor picked 256 because of
> a hardware limit?

Yeah works fine:

ping -s 68 169.254.1.2
PING 169.254.1.2 (169.254.1.2) 68(96) bytes of data.
76 bytes from 169.254.1.2: icmp_seq=1 ttl=64 time=0.359 ms
76 bytes from 169.254.1.2: icmp_seq=2 ttl=64 time=0.346 ms
76 bytes from 169.254.1.2: icmp_seq=3 ttl=64 time=0.351 ms

This also works fine:

ping -s 9000 169.254.1.2
PING 169.254.1.2 (169.254.1.2) 9000(9028) bytes of data.
9008 bytes from 169.254.1.2: icmp_seq=1 ttl=64 time=1.45 ms
9008 bytes from 169.254.1.2: icmp_seq=2 ttl=64 time=1.68 ms
9008 bytes from 169.254.1.2: icmp_seq=3 ttl=64 time=1.55 ms

I'll send new patches with all suggested changes soon :)

Thanks a lot for your help!

Yours,
Linus Walleij

Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case of forwarding

2018-07-11 Thread Saeed Mahameed

On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote:
> Only driver sfc actually uses this, but I don't have this NIC, so I
> tested this on mlx5, with my own changes to make it use
> netif_receive_skb_list(),
> but I'm not ready to upstream the mlx5 driver change yet.

Thanks Jesper for sharing this, should we look forward to those patches
or do you want us to implement them ?

Thanks,
Saeed.

Re: [PATCH net-next v2 04/11] devlink: Add support for region get command

2018-07-11 Thread Jakub Kicinski

On Wed, 11 Jul 2018 13:43:01 +0300, Alex Vesker wrote:
> + DEVLINK_ATTR_REGION_SIZE,   /* u32 */

> + err = nla_put_u64_64bit(msg, DEVLINK_ATTR_REGION_SIZE,
> + region->size,
> + DEVLINK_ATTR_PAD);

Size in the comment looks incorrect.

Re: [PATCH v3 net-next] net/sched: add skbprio scheduler

2018-07-11 Thread Marcelo Ricardo Leitner

On Tue, Jul 10, 2018 at 07:32:43PM -0700, Cong Wang wrote:
> On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner
>  wrote:
> >
> > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote:
> > >
> > >2. sch_prio.c does not have a global limit on the number of packets on
> > > all its queues, only a limit per queue.
> >
> > It can be useful to sch_prio.c as well, why not?
> > prio_enqueue()
> > {
> > ...
> > +   if (count > sch->global_limit)
> > +   prio_tail_drop(sch);   /* to be implemented */
> > ret = qdisc_enqueue(skb, qdisc, to_free);
> >
> 
> Isn't the whole point of sch_prio offloading the queueing to
> each class? If you need a limit, there is one for each child
> qdisc if you use for example pfifo or bfifo (depending on you
> want to limit bytes or packets).

Yes, but Michel wants to drop from other lower priorities if needed,
and that's not possible if you handle the limit already in a child
qdisc as they don't know about their siblings. The idea in the example
above is to discard it from whatever lower priority is needed, then
queue it. (ok, the example missed to check the priority level)

As for the different units, sch_prio holds a count of how many packets
are queued on its children, and that's what would be used for the limit.

> 
> Also, what's your plan for backward compatibility here?

say:
  if (sch->global_limit && count > sch->global_limit)
as in, only do the limit check/enforcing if needed.

[PATCH] of: mdio: Support fixed links in of_phy_get_and_connect()

2018-07-11 Thread Linus Walleij

By a simple extension of of_phy_get_and_connect() drivers
that have a fixed link on e.g. RGMII can support also
fixed links, so in addition to:

ethernet-port {
phy-mode = "rgmii";
phy-handle = <&foo>;
};

This setup with a fixed-link node and no phy-handle will
now also work just fine:

ethernet-port {
phy-mode = "rgmii";
fixed-link {
speed = <1000>;
full-duplex;
pause;
};
};

This is very helpful for connecting random ethernet ports
to e.g. DSA switches that typically reside on fixed links.

The phy-mode is still there as the fixes link in this case
is still an RGMII link.

Tested on the Cortina Gemini driver with the Vitesse DSA
router chip on a fixed 1Gbit link.

Suggested-by: Andrew Lunn 
Signed-off-by: Linus Walleij 
---
 drivers/of/of_mdio.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index d963baf8e53a..e92391d6d1bd 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -367,14 +367,23 @@ struct phy_device *of_phy_get_and_connect(struct 
net_device *dev,
phy_interface_t iface;
struct device_node *phy_np;
struct phy_device *phy;
+   int ret;
 
iface = of_get_phy_mode(np);
if (iface < 0)
return NULL;
-
-   phy_np = of_parse_phandle(np, "phy-handle", 0);
-   if (!phy_np)
-   return NULL;
+   if (of_phy_is_fixed_link(np)) {
+   ret = of_phy_register_fixed_link(np);
+   if (ret < 0) {
+   netdev_err(dev, "broken fixed-link specification\n");
+   return NULL;
+   }
+   phy_np = of_node_get(np);
+   } else {
+   phy_np = of_parse_phandle(np, "phy-handle", 0);
+   if (!phy_np)
+   return NULL;
+   }
 
phy = of_phy_connect(dev, phy_np, hndlr, 0, iface);
 
-- 
2.17.1

1 2 >

1 - 100 of 181 matches

Mail list logo