Re: [PATCH net 2/2] sfp: fix module initialisation with netdev already up
From: Russell King Date: Tue, 10 Jul 2018 12:05:36 +0100 > It was been observed that with a particular order of initialisation, > the netdev can be up, but the SFP module still has its TX_DISABLE > signal asserted. This occurs when the network device brought up before > the SFP kernel module has been inserted by userspace. > > This occurs because sfp-bus layer does not hear about the change in > network device state, and so assumes that it is still down. Set > netdev->sfp when the upstream is registered to work around this problem. > > Signed-off-by: Russell King Applied.
Re: [PATCH net 1/2] sfp: ensure we clean up properly on bus registration failure
From: Russell King Date: Tue, 10 Jul 2018 12:05:31 +0100 > We fail to correctly clean up after a bus registration failure, which > can lead to an incorrect assumption about the registration state of > the upstream or sfp cage. > > Signed-off-by: Russell King Applied.
Re: [PATCH net-next 0/3] mlxsw: ERSPAN: Take LACP state into consideration
From: Ido Schimmel Date: Tue, 10 Jul 2018 10:02:56 +0300 > Petr says: > > When offloading mirror-to-gretap, mlxsw needs to preroute the path that > the encapsulated packet will take. That path may include a LAG device > above a front panel port. So far, mlxsw resolved the path to the first > up front panel slave of the LAG interface, but that only reflects > administrative state of the port. It neglects to consider whether the > port actually has a carrier, and what the LACP state is. This patch set > aims to address these problems. > > Patch #1 publishes team_port_get_rcu(). > > Then in patch #2, a new function is introduced, > mlxsw_sp_port_dev_check(). That returns, for a given netdevice that is a > slave of a LAG device, whether that device is "txable", i.e. whether the > LAG master would send traffic through it. Since there's no good place to > put LAG-wide helpers, introduce a new header include/net/lag.h. > > Finally in patch #3, fix the slave selection logic to take into > consideration whether a given slave has a carrier and whether it is > txable. Series applied, thank you.
Re: [PATCH net-next] macvlan: Change status when lower device goes down
From: Travis Brown Date: Tue, 10 Jul 2018 00:35:01 + > Today macvlan ignores the notification when a lower device goes > administratively down, preventing the lack of connectivity from > bubbling up. > > Processing NETDEV_DOWN results in a macvlan state of LOWERLAYERDOWN > with NO-CARRIER which should be easy to interpret in userspace. > > 2: lower: mtu 1500 qdisc mq state DOWN mode DEFAULT > group default qlen 1000 > 3: macvlan@lower: mtu 1500 qdisc > noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000 > > Signed-off-by: Suresh Krishnan > Signed-off-by: Travis Brown Seems reasonable, applied, thanks.
Re: [net-next 0/7][pull request] L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09
From: Jeff Kirsher Date: Mon, 9 Jul 2018 15:20:35 -0700 > This patch series is meant to allow support for the L2 forward offload, aka > MACVLAN offload without the need for using ndo_select_queue. > > The existing solution currently requires that we use ndo_select_queue in > the transmit path if we want to associate specific Tx queues with a given > MACVLAN interface. In order to get away from this we need to repurpose the > tc_to_txq array and XPS pointer for the MACVLAN interface and use those as > a means of accessing the queues on the lower device. As a result we cannot > offload a device that is configured as multiqueue, however it doesn't > really make sense to configure a macvlan interfaced as being multiqueue > anyway since it doesn't really have a qdisc of its own in the first place. > > The big changes in this set are: > Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN > Disable XPS for single queue devices > Replace accel_priv with sb_dev in ndo_select_queue > Add sb_dev parameter to fallback function for ndo_select_queue > Consolidated ndo_select_queue functions that appeared to be duplicates > > The following are changes since commit > c47078d6a33fd78d882200cdaacbcfcd63318234: > tcp: remove redundant SOCK_DONE checks > and are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE Pulled, thanks Jeff.
Re: [PATCH net-next] tcp: expose both send and receive intervals for rate sample
From: Deepti Raghavan Date: Mon, 9 Jul 2018 17:53:39 + > Congestion control algorithms, which access the rate sample > through the tcp_cong_control function, only have access to the maximum > of the send and receive interval, for cases where the acknowledgment > rate may be inaccurate due to ACK compression or decimation. Algorithms > may want to use send rates and receive rates as separate signals. > > Signed-off-by: Deepti Raghavan Applied.
[PATCH] scripts/tags.sh: Add BPF_CALL
Signed-off-by: Constantine Shulyupin --- scripts/tags.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/scripts/tags.sh b/scripts/tags.sh index 66f08bb1cce9..db0d56ebe9b9 100755 --- a/scripts/tags.sh +++ b/scripts/tags.sh @@ -152,6 +152,7 @@ regex_asm=( ) regex_c=( '/^SYSCALL_DEFINE[0-9](\([[:alnum:]_]*\).*/sys_\1/' + '/^BPF_CALL_[0-9](\([[:alnum:]_]*\).*/\1/' '/^COMPAT_SYSCALL_DEFINE[0-9](\([[:alnum:]_]*\).*/compat_sys_\1/' '/^TRACE_EVENT(\([[:alnum:]_]*\).*/trace_\1/' '/^TRACE_EVENT(\([[:alnum:]_]*\).*/trace_\1_rcuidle/' -- 2.17.1
Re: [PATCH net-next] net: sched: fix unprotected access to rcu cookie pointer
From: Vlad Buslov Date: Mon, 9 Jul 2018 20:26:47 +0300 > Fix action attribute size calculation function to take rcu read lock and > access act_cookie pointer with rcu dereference. > > Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") > Reported-by: Marcelo Ricardo Leitner > Signed-off-by: Vlad Buslov Applied.
Re: [PATCH net-next 0/2] cxgb4: move stats fetched from firmware to debugfs
From: Rahul Lakkireddy Date: Mon, 9 Jul 2018 21:42:45 +0530 > Some stats are fetched via slow firmware mailbox, which can cause > packet drops under heavy load. So, this series removes these stats > from ethtool -S and expose them via debugfs. > > Patch 1 removes stats fetched via firmware from ethtool -S. > Patch 2 exposes stats removed in Patch 1 via debugfs. Series applied, thanks.
Re: [PATCH net-next] net: sched: act_ife: fix memory leak in ife init
From: Vlad Buslov Date: Mon, 9 Jul 2018 14:33:26 +0300 > Free params if tcf_idr_check_alloc() returned error. > > Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") > Reported-by: Dan Carpenter > Signed-off-by: Vlad Buslov Applied.
Re: [PATCH net-next] cxgb4: specify IQTYPE in fw_iq_cmd
From: Ganesh Goudar Date: Mon, 9 Jul 2018 16:52:03 +0530 > From: Arjun Vynipadath > > congestion argument passed to t4_sge_alloc_rxq() is used > to differentiate between nic/ofld queues. > > Signed-off-by: Arjun Vynipadath > Signed-off-by: Ganesh Goudar Applied.
Re: [PATCH net v2 0/5] net/ipv6: addr_gen_mode fixes
From: Sabrina Dubroca Date: Mon, 9 Jul 2018 12:25:13 +0200 > This series fixes bugs in handling of the addr_gen_mode option, mainly > related to the sysctl. A minor netlink issue was also present in the > initial commit introducing the option on a per-netdevice basis. > > v2: add patch 4, requested by David Ahern during review of v1 > add patch 5, missing documentation for the sysctl > patches 1, 2, 3 are unchanged I know there is still some discussion going on about sysctl semantics, but I'll aply this for now and any further refinements can be submitted on top. Thanks.
Re: [PATCH resend] rhashtable: detect when object movement might have invalidated a lookup
From: David Miller Date: Wed, 11 Jul 2018 22:46:58 -0700 (PDT) > From: NeilBrown > Date: Fri, 06 Jul 2018 17:08:35 +1000 > >> >> Some users of rhashtable might need to change the key >> of an object and move it to a different location in the table. >> Other users might want to allocate objects using >> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation >> being used for a different (type-compatible) purpose and similarly >> end up in a different hash-chain. >> >> To support these, we store a unique NULLS_MARKER at the end of >> each chain, and when a search fails to find a match, we check >> if the NULLS marker found was the expected one. If not, >> the search is repeated. >> >> The unique NULLS_MARKER is derived from the address of the >> head of the chain. >> >> If an object is removed and re-added to the same hash chain, we won't >> notice by looking that the NULLS marker. In this case we must be sure >> that it was not re-added *after* its original location, or a lookup may >> incorrectly fail. The easiest solution is to ensure it is inserted at >> the start of the chain. insert_slow() already does that, >> insert_fast() does not. So this patch changes insert_fast to always >> insert at the head of the chain. >> >> Note that such a user must do their own double-checking of >> the object found by rhashtable_lookup_fast() after ensuring >> mutual exclusion which anything that might change the key, such as >> successfully taking a new reference. >> >> Signed-off-by: NeilBrown > > Applied to net-next. Actually, reverted, it doesn't even compile. lib/rhashtable.c: In function ‘rht_bucket_nested’: lib/rhashtable.c:1187:39: error: macro "INIT_RHT_NULLS_HEAD" passed 3 arguments, but takes just 1 INIT_RHT_NULLS_HEAD(rhnull, NULL, 0); ^ lib/rhashtable.c:1187:4: error: ‘INIT_RHT_NULLS_HEAD’ undeclared (first use in this function); did you mean ‘INIT_LIST_HEAD’? INIT_RHT_NULLS_HEAD(rhnull, NULL, 0); ^~~ INIT_LIST_HEAD lib/rhashtable.c:1187:4: note: each undeclared identifier is reported only once for each function it appears in
Re: [PATCH net-next] net/sched: flower: Fix null pointer dereference when run tc vlan command
From: Jianbo Liu Date: Mon, 9 Jul 2018 02:26:20 + > Zahari issued tc vlan command without setting vlan_ethtype, which will > crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] > is not null before use it. > Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. > > Fixes: d64efd0926ba ('net/sched: flower: Add supprt for matching on QinQ vlan > headers') > Signed-off-by: Jianbo Liu > Reported-by: Zahari Doychev Applied.
[PATCH net-next] net/tls: Removed redundant variable from 'struct tls_sw_context_rx'
The variable 'decrypted' in 'struct tls_sw_context_rx' is redundant and is being set/unset without purpose. Simplified the code by removing it. Signed-off-by: Vakul Garg --- include/net/tls.h | 1 - net/tls/tls_sw.c | 87 --- 2 files changed, 38 insertions(+), 50 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 70c273777fe9..528d0c2d6cc2 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -113,7 +113,6 @@ struct tls_sw_context_rx { struct poll_table_struct *wait); struct sk_buff *recv_pkt; u8 control; - bool decrypted; char rx_aad_ciphertext[TLS_AAD_SPACE_SIZE]; char rx_aad_plaintext[TLS_AAD_SPACE_SIZE]; diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 0d670c8adf18..e5f2de2c3fd6 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -81,8 +81,6 @@ static int tls_do_decryption(struct sock *sk, rxm->full_len -= tls_ctx->rx.overhead_size; tls_advance_record_sn(sk, &tls_ctx->rx); - ctx->decrypted = true; - ctx->saved_data_ready(sk); out: @@ -756,6 +754,9 @@ int tls_sw_recvmsg(struct sock *sk, bool cmsg = false; int target, err = 0; long timeo; + int page_count; + int to_copy; + flags |= nonblock; @@ -792,46 +793,38 @@ int tls_sw_recvmsg(struct sock *sk, goto recv_end; } - if (!ctx->decrypted) { - int page_count; - int to_copy; - - page_count = iov_iter_npages(&msg->msg_iter, -MAX_SKB_FRAGS); - to_copy = rxm->full_len - tls_ctx->rx.overhead_size; - if (to_copy <= len && page_count < MAX_SKB_FRAGS && - likely(!(flags & MSG_PEEK))) { - struct scatterlist sgin[MAX_SKB_FRAGS + 1]; - int pages = 0; - - zc = true; - sg_init_table(sgin, MAX_SKB_FRAGS + 1); - sg_set_buf(&sgin[0], ctx->rx_aad_plaintext, - TLS_AAD_SPACE_SIZE); - - err = zerocopy_from_iter(sk, &msg->msg_iter, -to_copy, &pages, -&chunk, &sgin[1], -MAX_SKB_FRAGS, false); - if (err < 0) - goto fallback_to_reg_recv; - - err = decrypt_skb(sk, skb, sgin); - for (; pages > 0; pages--) - put_page(sg_page(&sgin[pages])); - if (err < 0) { - tls_err_abort(sk, EBADMSG); - goto recv_end; - } - } else { + page_count = iov_iter_npages(&msg->msg_iter, MAX_SKB_FRAGS); + to_copy = rxm->full_len - tls_ctx->rx.overhead_size; + + if (to_copy <= len && page_count < MAX_SKB_FRAGS && + likely(!(flags & MSG_PEEK))) { + struct scatterlist sgin[MAX_SKB_FRAGS + 1]; + int pages = 0; + + zc = true; + sg_init_table(sgin, MAX_SKB_FRAGS + 1); + sg_set_buf(&sgin[0], ctx->rx_aad_plaintext, + TLS_AAD_SPACE_SIZE); + err = zerocopy_from_iter(sk, &msg->msg_iter, to_copy, +&pages, &chunk, &sgin[1], +MAX_SKB_FRAGS, false); + if (err < 0) + goto fallback_to_reg_recv; + + err = decrypt_skb(sk, skb, sgin); + for (; pages > 0; pages--) + put_page(sg_page(&sgin[pages])); + if (err < 0) { + tls_err_abort(sk, EBADMSG); + goto recv_end; + } + } else { fallback_to_reg_recv: - err = decrypt_skb(sk, skb, NULL); - if (err < 0) { - tls_err_abort(sk, EBADMSG); - goto recv_end; - } + err = decrypt_skb(sk, skb, NULL); + if (err < 0) { + tls_err_abort(sk, EBADMSG); + goto recv_end; } -
Re: [PATCH iproute2-next] ipaddress: fix label matching
❦ 11 juillet 2018 21:01 -0400, David Ahern : >> +++ b/ip/ipaddress.c >> @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who, >> if (!name) >> return -1; >> >> -if (filter.label && >> -(!filter.family || filter.family == AF_PACKET) && >> -fnmatch(filter.label, name, 0)) >> -return -1; >> - > > The offending commit changed the return code: > > if (filter.label && > (!filter.family || filter.family == AF_PACKET) && > - fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0)) > - return 0; > + fnmatch(filter.label, name, 0)) > + return -1; > > > Vincent: can you try leaving the code as is, but change the return to 0? Yes, it works by just returning 0. The code still doesn't make sense. -- Many pages make a thick book, except for pocket Bibles which are on very very thin paper.
Re: 答复: [PATCH] net: convert gro_count to bitmask
From: "Li,Rongqing" Date: Thu, 12 Jul 2018 03:03:51 + > > >> -邮件原件- >> 发件人: David Miller [mailto:da...@davemloft.net] >> 发送时间: 2018年7月12日 10:49 >> 收件人: Li,Rongqing >> 抄送: netdev@vger.kernel.org >> 主题: Re: [PATCH] net: convert gro_count to bitmask >> >> From: Li RongQing >> Date: Wed, 11 Jul 2018 17:15:53 +0800 >> >> > + clear_bit(index, &napi->gro_bitmask); >> >> Please don't use atomics here, at least use __clear_bit(). >> > > Thanks, this is same as Eric's suggestion. > > >> This is why I did the operations by hand in my version of the patch. >> Also, if you are going to preempt my patch, at least retain the comment I >> added around the GRO_HASH_BUCKETS definitions which warns the reader >> about the limit. >> > > I add BUILD_BUG_ON in netdev_init, so I think we need not to add comment That's a good compile time check, but the person thinking about editing the definition doesn't see the limit in the header file nor know why the limit exists in the first place.
[PATCH bpf-next 1/7] xdp: add per mode attributes for attached programs
In preparation for support of simultaneous driver and hardware XDP support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID will still be reported, but user space can now also access the program ID in a new IFLA_XDP__PROG_ID attribute. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- include/uapi/linux/if_link.h | 3 +++ net/core/rtnetlink.c | 30 ++ 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index cf01b6824244..bc86c2b105ec 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -928,6 +928,9 @@ enum { IFLA_XDP_ATTACHED, IFLA_XDP_FLAGS, IFLA_XDP_PROG_ID, + IFLA_XDP_DRV_PROG_ID, + IFLA_XDP_SKB_PROG_ID, + IFLA_XDP_HW_PROG_ID, __IFLA_XDP_MAX, }; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index e3f743c141b3..8ab95de1114c 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -964,7 +964,8 @@ static size_t rtnl_xdp_size(void) { size_t xdp_size = nla_total_size(0) + /* nest IFLA_XDP */ nla_total_size(1) + /* XDP_ATTACHED */ - nla_total_size(4);/* XDP_PROG_ID */ + nla_total_size(4) + /* XDP_PROG_ID */ + nla_total_size(4);/* XDP__PROG_ID */ return xdp_size; } @@ -1378,16 +1379,17 @@ static u8 rtnl_xdp_attached_mode(struct net_device *dev, u32 *prog_id) static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) { + u32 prog_attr, prog_id; struct nlattr *xdp; - u32 prog_id; int err; + u8 mode; xdp = nla_nest_start(skb, IFLA_XDP); if (!xdp) return -EMSGSIZE; - err = nla_put_u8(skb, IFLA_XDP_ATTACHED, -rtnl_xdp_attached_mode(dev, &prog_id)); + mode = rtnl_xdp_attached_mode(dev, &prog_id); + err = nla_put_u8(skb, IFLA_XDP_ATTACHED, mode); if (err) goto err_cancel; @@ -1395,6 +1397,26 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) err = nla_put_u32(skb, IFLA_XDP_PROG_ID, prog_id); if (err) goto err_cancel; + + switch (mode) { + case XDP_ATTACHED_DRV: + prog_attr = IFLA_XDP_DRV_PROG_ID; + break; + case XDP_ATTACHED_SKB: + prog_attr = IFLA_XDP_SKB_PROG_ID; + break; + case XDP_ATTACHED_HW: + prog_attr = IFLA_XDP_HW_PROG_ID; + break; + case XDP_ATTACHED_NONE: + default: + err = -EINVAL; + goto err_cancel; + } + + err = nla_put_u32(skb, prog_attr, prog_id); + if (err) + goto err_cancel; } nla_nest_end(skb, xdp); -- 2.17.1
[PATCH bpf-next 4/7] xdp: support simultaneous driver and hw XDP attachment
Split the query of HW-attached program from the software one. Introduce new .ndo_bpf command to query HW-attached program. This will allow drivers to install different programs in HW and SW at the same time. Netlink can now also carry multiple programs on dump (in which case mode will be set to XDP_ATTACHED_MULTI and user has to check per-attachment point attributes, IFLA_XDP_PROG_ID will not be present). We reuse IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size() doesn't need to be updated. Note that the installation side is still not there, since all drivers currently reject installing more than one program at the time. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- .../ethernet/netronome/nfp/nfp_net_common.c | 6 ++ drivers/net/netdevsim/bpf.c | 6 ++ include/linux/netdevice.h | 7 +- include/uapi/linux/if_link.h | 1 + net/core/dev.c| 45 + net/core/rtnetlink.c | 93 +++ 6 files changed, 96 insertions(+), 62 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 4bb589dbffbc..bb1e72e8dbc2 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3453,6 +3453,12 @@ static int nfp_net_xdp(struct net_device *netdev, struct netdev_bpf *xdp) case XDP_SETUP_PROG_HW: return nfp_net_xdp_setup(nn, xdp); case XDP_QUERY_PROG: + if (nn->dp.bpf_offload_xdp) + return 0; + return xdp_attachment_query(&nn->xdp, xdp); + case XDP_QUERY_PROG_HW: + if (!nn->dp.bpf_offload_xdp) + return 0; return xdp_attachment_query(&nn->xdp, xdp); default: return nfp_app_bpf(nn->app, nn, xdp); diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index c485d97b5df4..5544c9b51173 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -561,6 +561,12 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf *bpf) nsim_bpf_destroy_prog(bpf->offload.prog); return 0; case XDP_QUERY_PROG: + if (ns->xdp_prog_mode != XDP_ATTACHED_DRV) + return 0; + return xdp_attachment_query(&ns->xdp, bpf); + case XDP_QUERY_PROG_HW: + if (ns->xdp_prog_mode != XDP_ATTACHED_HW) + return 0; return xdp_attachment_query(&ns->xdp, bpf); case XDP_SETUP_PROG: err = nsim_setup_prog_checks(ns, bpf); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 69a664789b33..2422c0e88f5c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -820,6 +820,7 @@ enum bpf_netdev_command { XDP_SETUP_PROG, XDP_SETUP_PROG_HW, XDP_QUERY_PROG, + XDP_QUERY_PROG_HW, /* BPF program for offload callbacks, invoked at program load time. */ BPF_OFFLOAD_VERIFIER_PREP, BPF_OFFLOAD_TRANSLATE, @@ -843,7 +844,7 @@ struct netdev_bpf { struct bpf_prog *prog; struct netlink_ext_ack *extack; }; - /* XDP_QUERY_PROG */ + /* XDP_QUERY_PROG, XDP_QUERY_PROG_HW */ struct { u32 prog_id; /* flags with which program was installed */ @@ -3533,8 +3534,8 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, int fd, u32 flags); -void __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op, -struct netdev_bpf *xdp); +u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op, + enum bpf_netdev_command cmd); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index bc86c2b105ec..8759cfb8aa2e 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -920,6 +920,7 @@ enum { XDP_ATTACHED_DRV, XDP_ATTACHED_SKB, XDP_ATTACHED_HW, + XDP_ATTACHED_MULTI, }; enum { diff --git a/net/core/dev.c b/net/core/dev.c index 0bc8fee2156b..00880c3e9af5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7592,21 +7592,19 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down) } EXPORT_SYMBOL(dev_change_proto_down); -void __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op, -struct netdev_bpf *xdp) +u32 __dev_xd
[PATCH bpf-next 3/7] xdp: factor out common program/flags handling from drivers
Basic operations drivers perform during xdp setup and query can be moved to helpers in the core. Encapsulate program and flags into a structure and add helpers. Note that the structure is intended as the "main" program information source in the driver. Most drivers will additionally place the program pointer in their fast path or ring structures. The helpers don't have a huge impact now, but they will decrease the code duplication when programs can be installed in HW and driver at the same time. Encapsulating the basic operations in helpers will hopefully also reduce the number of changes to drivers which adopt them. Helpers could really be static inline, but they depend on definition of struct netdev_bpf which means they'd have to be placed in netdevice.h, an already 4500 line header. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- drivers/net/ethernet/netronome/nfp/nfp_net.h | 6 ++-- .../ethernet/netronome/nfp/nfp_net_common.c | 28 ++- drivers/net/netdevsim/bpf.c | 16 +++-- drivers/net/netdevsim/netdevsim.h | 4 +-- include/net/xdp.h | 13 +++ net/core/xdp.c| 34 +++ tools/testing/selftests/bpf/test_offload.py | 4 +-- 7 files changed, 67 insertions(+), 38 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 2a71a9ffd095..2021dda595b7 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -553,8 +553,7 @@ struct nfp_net_dp { * @rss_cfg:RSS configuration * @rss_key:RSS secret key * @rss_itbl: RSS indirection table - * @xdp_flags: Flags with which XDP prog was loaded - * @xdp_prog: XDP prog (for ctrl path, both DRV and HW modes) + * @xdp: Information about the attached XDP program * @max_r_vecs:Number of allocated interrupt vectors for RX/TX * @max_tx_rings: Maximum number of TX rings supported by the Firmware * @max_rx_rings: Maximum number of RX rings supported by the Firmware @@ -610,8 +609,7 @@ struct nfp_net { u8 rss_key[NFP_NET_CFG_RSS_KEY_SZ]; u8 rss_itbl[NFP_NET_CFG_RSS_ITBL_SZ]; - u32 xdp_flags; - struct bpf_prog *xdp_prog; + struct xdp_attachment_info xdp; unsigned int max_tx_rings; unsigned int max_rx_rings; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index d20714598613..4bb589dbffbc 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3417,34 +3417,29 @@ nfp_net_xdp_setup_drv(struct nfp_net *nn, struct bpf_prog *prog, return nfp_net_ring_reconfig(nn, dp, extack); } -static int -nfp_net_xdp_setup(struct nfp_net *nn, struct bpf_prog *prog, u32 flags, - struct netlink_ext_ack *extack) +static int nfp_net_xdp_setup(struct nfp_net *nn, struct netdev_bpf *bpf) { struct bpf_prog *drv_prog, *offload_prog; int err; - if (nn->xdp_prog && (flags ^ nn->xdp_flags) & XDP_FLAGS_MODES) + if (!xdp_attachment_flags_ok(&nn->xdp, bpf)) return -EBUSY; /* Load both when no flags set to allow easy activation of driver path * when program is replaced by one which can't be offloaded. */ - drv_prog = flags & XDP_FLAGS_HW_MODE ? NULL : prog; - offload_prog = flags & XDP_FLAGS_DRV_MODE ? NULL : prog; + drv_prog = bpf->flags & XDP_FLAGS_HW_MODE ? NULL : bpf->prog; + offload_prog = bpf->flags & XDP_FLAGS_DRV_MODE ? NULL : bpf->prog; - err = nfp_net_xdp_setup_drv(nn, drv_prog, extack); + err = nfp_net_xdp_setup_drv(nn, drv_prog, bpf->extack); if (err) return err; - err = nfp_app_xdp_offload(nn->app, nn, offload_prog, extack); - if (err && flags & XDP_FLAGS_HW_MODE) + err = nfp_app_xdp_offload(nn->app, nn, offload_prog, bpf->extack); + if (err && bpf->flags & XDP_FLAGS_HW_MODE) return err; - if (nn->xdp_prog) - bpf_prog_put(nn->xdp_prog); - nn->xdp_prog = prog; - nn->xdp_flags = flags; + xdp_attachment_setup(&nn->xdp, bpf); return 0; } @@ -3456,12 +3451,9 @@ static int nfp_net_xdp(struct net_device *netdev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: case XDP_SETUP_PROG_HW: - return nfp_net_xdp_setup(nn, xdp->prog, xdp->flags, -xdp->extack); + return nfp_net_xdp_setup(nn, xdp); case XDP_QUERY_PROG: - xdp->prog_id = nn->xdp_prog ? nn->xdp_prog->aux->id : 0; - xdp->prog_flags = nn->xdp_prog ? nn->xdp_flags : 0; -
[PATCH bpf-next 5/7] netdevsim: add support for simultaneous driver and hw XDP
Allow netdevsim to accept driver and offload attachment of XDP BPF programs at the same time. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- drivers/net/netdevsim/bpf.c | 32 +++-- drivers/net/netdevsim/netdev.c | 3 +- drivers/net/netdevsim/netdevsim.h | 2 +- tools/testing/selftests/bpf/test_offload.py | 8 -- 4 files changed, 12 insertions(+), 33 deletions(-) diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 5544c9b51173..c36d2a768202 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -92,7 +92,7 @@ static const struct bpf_prog_offload_ops nsim_bpf_analyzer_ops = { static bool nsim_xdp_offload_active(struct netdevsim *ns) { - return ns->xdp_prog_mode == XDP_ATTACHED_HW; + return ns->xdp_hw.prog; } static void nsim_prog_set_loaded(struct bpf_prog *prog, bool loaded) @@ -195,11 +195,13 @@ static int nsim_xdp_offload_prog(struct netdevsim *ns, struct netdev_bpf *bpf) return nsim_bpf_offload(ns, bpf->prog, nsim_xdp_offload_active(ns)); } -static int nsim_xdp_set_prog(struct netdevsim *ns, struct netdev_bpf *bpf) +static int +nsim_xdp_set_prog(struct netdevsim *ns, struct netdev_bpf *bpf, + struct xdp_attachment_info *xdp) { int err; - if (!xdp_attachment_flags_ok(&ns->xdp, bpf)) + if (!xdp_attachment_flags_ok(xdp, bpf)) return -EBUSY; if (bpf->command == XDP_SETUP_PROG && !ns->bpf_xdpdrv_accept) { @@ -217,14 +219,7 @@ static int nsim_xdp_set_prog(struct netdevsim *ns, struct netdev_bpf *bpf) return err; } - xdp_attachment_setup(&ns->xdp, bpf); - - if (!bpf->prog) - ns->xdp_prog_mode = XDP_ATTACHED_NONE; - else if (bpf->command == XDP_SETUP_PROG) - ns->xdp_prog_mode = XDP_ATTACHED_DRV; - else - ns->xdp_prog_mode = XDP_ATTACHED_HW; + xdp_attachment_setup(xdp, bpf); return 0; } @@ -284,10 +279,6 @@ static int nsim_setup_prog_checks(struct netdevsim *ns, struct netdev_bpf *bpf) NSIM_EA(bpf->extack, "MTU too large w/ XDP enabled"); return -EINVAL; } - if (nsim_xdp_offload_active(ns)) { - NSIM_EA(bpf->extack, "xdp offload active, can't load drv prog"); - return -EBUSY; - } return 0; } @@ -561,25 +552,21 @@ int nsim_bpf(struct net_device *dev, struct netdev_bpf *bpf) nsim_bpf_destroy_prog(bpf->offload.prog); return 0; case XDP_QUERY_PROG: - if (ns->xdp_prog_mode != XDP_ATTACHED_DRV) - return 0; return xdp_attachment_query(&ns->xdp, bpf); case XDP_QUERY_PROG_HW: - if (ns->xdp_prog_mode != XDP_ATTACHED_HW) - return 0; - return xdp_attachment_query(&ns->xdp, bpf); + return xdp_attachment_query(&ns->xdp_hw, bpf); case XDP_SETUP_PROG: err = nsim_setup_prog_checks(ns, bpf); if (err) return err; - return nsim_xdp_set_prog(ns, bpf); + return nsim_xdp_set_prog(ns, bpf, &ns->xdp); case XDP_SETUP_PROG_HW: err = nsim_setup_prog_hw_checks(ns, bpf); if (err) return err; - return nsim_xdp_set_prog(ns, bpf); + return nsim_xdp_set_prog(ns, bpf, &ns->xdp_hw); case BPF_OFFLOAD_MAP_ALLOC: if (!ns->bpf_map_accept) return -EOPNOTSUPP; @@ -635,5 +622,6 @@ void nsim_bpf_uninit(struct netdevsim *ns) WARN_ON(!list_empty(&ns->bpf_bound_progs)); WARN_ON(!list_empty(&ns->bpf_bound_maps)); WARN_ON(ns->xdp.prog); + WARN_ON(ns->xdp_hw.prog); WARN_ON(ns->bpf_offloaded); } diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index b2f9d0df93b0..a7b179f0d954 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -228,8 +228,7 @@ static int nsim_change_mtu(struct net_device *dev, int new_mtu) { struct netdevsim *ns = netdev_priv(dev); - if (ns->xdp_prog_mode == XDP_ATTACHED_DRV && - new_mtu > NSIM_XDP_MAX_MTU) + if (ns->xdp.prog && new_mtu > NSIM_XDP_MAX_MTU) return -EBUSY; dev->mtu = new_mtu; diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h index 69ffb4a2d14b..0aeabbe81cc6 100644 --- a/drivers/net/netdevsim/netdevsim.h +++ b/drivers/net/netdevsim/netdevsim.h @@ -69,7 +69,7 @@ struct netdevsim { u32 bpf_offloaded_id; struct xdp_attachment_info xdp; - int xdp_prog_mode; + struct xdp_attachment_info xdp_hw; u32 prog_id_gen; diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/s
[PATCH bpf-next 7/7] nfp: add support for simultaneous driver and hw XDP
Split handling of offloaded and driver programs completely. Since offloaded programs always come with XDP_FLAGS_HW_MODE set in reality there could be no sharing, anyway, programs would only be installed in driver or in hardware. Splitting the handling allows us to install programs in HW and in driver at the same time. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 11 + drivers/net/ethernet/netronome/nfp/nfp_net.h | 6 +-- .../ethernet/netronome/nfp/nfp_net_common.c | 49 --- 3 files changed, 26 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 4dbf7cba6377..b95b94d008cf 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -66,26 +66,19 @@ nfp_bpf_xdp_offload(struct nfp_app *app, struct nfp_net *nn, struct bpf_prog *prog, struct netlink_ext_ack *extack) { bool running, xdp_running; - int ret; if (!nfp_net_ebpf_capable(nn)) return -EINVAL; running = nn->dp.ctrl & NFP_NET_CFG_CTRL_BPF; - xdp_running = running && nn->dp.bpf_offload_xdp; + xdp_running = running && nn->xdp_hw.prog; if (!prog && !xdp_running) return 0; if (prog && running && !xdp_running) return -EBUSY; - ret = nfp_net_bpf_offload(nn, prog, running, extack); - /* Stop offload if replace not possible */ - if (ret) - return ret; - - nn->dp.bpf_offload_xdp = !!prog; - return ret; + return nfp_net_bpf_offload(nn, prog, running, extack); } static const char *nfp_bpf_extra_cap(struct nfp_app *app, struct nfp_net *nn) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 2021dda595b7..8970ec981e11 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -485,7 +485,6 @@ struct nfp_stat_pair { * @dev: Backpointer to struct device * @netdev:Backpointer to net_device structure * @is_vf: Is the driver attached to a VF? - * @bpf_offload_xdp: Offloaded BPF program is XDP * @chained_metadata_format: Firemware will use new metadata format * @rx_dma_dir:Mapping direction for RX buffers * @rx_dma_off:Offset at which DMA packets (for XDP headroom) @@ -510,7 +509,6 @@ struct nfp_net_dp { struct net_device *netdev; u8 is_vf:1; - u8 bpf_offload_xdp:1; u8 chained_metadata_format:1; u8 rx_dma_dir; @@ -553,7 +551,8 @@ struct nfp_net_dp { * @rss_cfg:RSS configuration * @rss_key:RSS secret key * @rss_itbl: RSS indirection table - * @xdp: Information about the attached XDP program + * @xdp: Information about the driver XDP program + * @xdp_hw:Information about the HW XDP program * @max_r_vecs:Number of allocated interrupt vectors for RX/TX * @max_tx_rings: Maximum number of TX rings supported by the Firmware * @max_rx_rings: Maximum number of RX rings supported by the Firmware @@ -610,6 +609,7 @@ struct nfp_net { u8 rss_itbl[NFP_NET_CFG_RSS_ITBL_SZ]; struct xdp_attachment_info xdp; + struct xdp_attachment_info xdp_hw; unsigned int max_tx_rings; unsigned int max_rx_rings; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index bb1e72e8dbc2..a712e83c3f0f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1710,8 +1710,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget) } } - if (xdp_prog && !(rxd->rxd.flags & PCIE_DESC_RX_BPF && - dp->bpf_offload_xdp) && !meta.portid) { + if (xdp_prog && !meta.portid) { void *orig_data = rxbuf->frag + pkt_off; unsigned int dma_off; int act; @@ -3393,14 +3392,18 @@ static void nfp_net_del_vxlan_port(struct net_device *netdev, nfp_net_set_vxlan_port(nn, idx, 0); } -static int -nfp_net_xdp_setup_drv(struct nfp_net *nn, struct bpf_prog *prog, - struct netlink_ext_ack *extack) +static int nfp_net_xdp_setup_drv(struct nfp_net *nn, struct netdev_bpf *bpf) { + struct bpf_prog *prog = bpf->prog; struct nfp_net_dp *dp; + int err; + + if (!xdp_attachment_flags_ok(&nn->xdp, bpf)) + return -EBUSY; if (!prog == !nn->dp.xdp_prog) { WRITE_ONCE(nn->dp.xdp_prog, prog); + xdp_attachment_s
[PATCH bpf-next 0/7] xdp: simultaneous driver and HW XDP
Hi! This set is adding support for loading driver and offload XDP at the same time. This enables advanced use cases where some of the work is offloaded to the NIC and some is done by the host. Separate netlink attributes are added for each mode of operation. Driver callbacks for offload are cleaned up a little, including removal of .prog_attached flag. Jakub Kicinski (7): xdp: add per mode attributes for attached programs xdp: don't make drivers report attachment mode xdp: factor out common program/flags handling from drivers xdp: support simultaneous driver and hw XDP attachment netdevsim: add support for simultaneous driver and hw XDP selftests/bpf: add test for multiple programs nfp: add support for simultaneous driver and hw XDP drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 1 - .../net/ethernet/cavium/thunder/nicvf_main.c | 1 - drivers/net/ethernet/intel/i40e/i40e_main.c | 1 - drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 - .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 - .../net/ethernet/mellanox/mlx4/en_netdev.c| 1 - .../net/ethernet/mellanox/mlx5/core/en_main.c | 1 - drivers/net/ethernet/netronome/nfp/bpf/main.c | 11 +-- drivers/net/ethernet/netronome/nfp/nfp_net.h | 10 ++- .../ethernet/netronome/nfp/nfp_net_common.c | 58 ++- .../net/ethernet/qlogic/qede/qede_filter.c| 1 - drivers/net/netdevsim/bpf.c | 41 --- drivers/net/netdevsim/netdev.c| 3 +- drivers/net/netdevsim/netdevsim.h | 6 +- drivers/net/tun.c | 1 - drivers/net/virtio_net.c | 1 - include/linux/netdevice.h | 12 ++-- include/net/xdp.h | 13 include/uapi/linux/if_link.h | 4 ++ net/core/dev.c| 48 +++-- net/core/rtnetlink.c | 71 ++- net/core/xdp.c| 34 + tools/testing/selftests/bpf/test_offload.py | 71 --- 23 files changed, 246 insertions(+), 146 deletions(-) -- 2.17.1
[PATCH bpf-next 6/7] selftests/bpf: add test for multiple programs
Add tests for having an XDP program attached in the driver and another one attached in HW simultaneously. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- tools/testing/selftests/bpf/test_offload.py | 63 + 1 file changed, 63 insertions(+) diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py index 4f982a0255c2..b746227eaff2 100755 --- a/tools/testing/selftests/bpf/test_offload.py +++ b/tools/testing/selftests/bpf/test_offload.py @@ -339,6 +339,11 @@ netns = [] # net namespaces to be removed self.dfs = DebugfsDir(self.dfs_dir) return self.dfs +def dfs_read(self, f): +path = os.path.join(self.dfs_dir, f) +_, data = cmd('cat %s' % (path)) +return data.strip() + def dfs_num_bound_progs(self): path = os.path.join(self.dfs_dir, "bpf_bound_progs") _, progs = cmd('ls %s' % (path)) @@ -814,6 +819,10 @@ netns = [] "Device parameters reported for non-offloaded program") start_test("Test XDP prog replace with bad flags...") +ret, _, err = sim.set_xdp(obj, "generic", force=True, + fail=False, include_stderr=True) +fail(ret == 0, "Replaced XDP program with a program in different mode") +fail(err.count("File exists") != 1, "Replaced driver XDP with generic") ret, _, err = sim.set_xdp(obj, "", force=True, fail=False, include_stderr=True) fail(ret == 0, "Replaced XDP program with a program in different mode") @@ -883,6 +892,60 @@ netns = [] rm(pin_file) bpftool_prog_list_wait(expected=0) +start_test("Test multi-attachment XDP - attach...") +sim.set_xdp(obj, "offload") +xdp = sim.ip_link_show(xdp=True)["xdp"] +offloaded = sim.dfs_read("bpf_offloaded_id") +fail("prog" not in xdp, "Base program not reported in single program mode") +fail(len(ipl["xdp"]["attached"]) != 1, + "Wrong attached program count with one program") + +sim.set_xdp(obj, "") +two_xdps = sim.ip_link_show(xdp=True)["xdp"] +offloaded2 = sim.dfs_read("bpf_offloaded_id") + +fail(two_xdps["mode"] != 4, "Bad mode reported with multiple programs") +fail("prog" in two_xdps, "Base program reported in multi program mode") +fail(xdp["attached"][0] not in two_xdps["attached"], + "Offload program not reported after driver activated") +fail(len(two_xdps["attached"]) != 2, + "Wrong attached program count with two programs") +fail(two_xdps["attached"][0]["prog"]["id"] == + two_xdps["attached"][1]["prog"]["id"], + "offloaded and drv programs have the same id") +fail(offloaded != offloaded2, + "offload ID changed after loading driver program") + +start_test("Test multi-attachment XDP - replace...") +ret, _, err = sim.set_xdp(obj, "offload", fail=False, include_stderr=True) +fail(err.count("busy") != 1, "Replaced one of programs without -force") + +start_test("Test multi-attachment XDP - detach...") +ret, _, err = sim.unset_xdp("drv", force=True, +fail=False, include_stderr=True) +fail(ret == 0, "Removed program with a bad mode") +check_extack(err, "program loaded with different flags.", args) + +sim.unset_xdp("offload") +xdp = sim.ip_link_show(xdp=True)["xdp"] +offloaded = sim.dfs_read("bpf_offloaded_id") + +fail(xdp["mode"] != 1, "Bad mode reported after multiple programs") +fail("prog" not in xdp, + "Base program not reported after multi program mode") +fail(xdp["attached"][0] not in two_xdps["attached"], + "Offload program not reported after driver activated") +fail(len(ipl["xdp"]["attached"]) != 1, + "Wrong attached program count with remaining programs") +fail(offloaded != "0", "offload ID reported with only driver program left") + +start_test("Test multi-attachment XDP - device remove...") +sim.set_xdp(obj, "offload") +sim.remove() + +sim = NetdevSim() +sim.set_ethtool_tc_offloads(True) + start_test("Test mixing of TC and XDP...") sim.tc_add_ingress() sim.set_xdp(obj, "offload") -- 2.17.1
[PATCH bpf-next 2/7] xdp: don't make drivers report attachment mode
prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 1 - drivers/net/ethernet/cavium/thunder/nicvf_main.c| 1 - drivers/net/ethernet/intel/i40e/i40e_main.c | 1 - drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 - drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 - drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 1 - drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 - drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 3 --- drivers/net/ethernet/qlogic/qede/qede_filter.c | 1 - drivers/net/netdevsim/bpf.c | 1 - drivers/net/tun.c | 1 - drivers/net/virtio_net.c| 1 - include/linux/netdevice.h | 5 - net/core/dev.c | 7 +++ net/core/rtnetlink.c| 8 ++-- 15 files changed, 9 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 1f0e872d0667..0584d07c8c33 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -219,7 +219,6 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp) rc = bnxt_xdp_set(bp, xdp->prog); break; case XDP_QUERY_PROG: - xdp->prog_attached = !!bp->xdp_prog; xdp->prog_id = bp->xdp_prog ? bp->xdp_prog->aux->id : 0; rc = 0; break; diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 135766c4296b..768f584f8392 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -1848,7 +1848,6 @@ static int nicvf_xdp(struct net_device *netdev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: return nicvf_xdp_setup(nic, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = !!nic->xdp_prog; xdp->prog_id = nic->xdp_prog ? nic->xdp_prog->aux->id : 0; return 0; default: diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 426b0ccb1fc6..51762428b40e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -11841,7 +11841,6 @@ static int i40e_xdp(struct net_device *dev, case XDP_SETUP_PROG: return i40e_xdp_setup(vsi, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = i40e_enabled_xdp_vsi(vsi); xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0; return 0; default: diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index a8e21becb619..3862fea1c923 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9966,7 +9966,6 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: return ixgbe_xdp_setup(dev, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = !!(adapter->xdp_prog); xdp->prog_id = adapter->xdp_prog ? adapter->xdp_prog->aux->id : 0; return 0; diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 59416eddd840..d86446d202d5 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -4462,7 +4462,6 @@ static int ixgbevf_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: return ixgbevf_xdp_setup(dev, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = !!(adapter->xdp_prog); xdp->prog_id = adapter->xdp_prog ? adapter->xdp_prog->aux->id : 0; return 0; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 65eb06e017e4..6785661d1a72 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -2926,7 +2926,6 @@ static int mlx4_xdp(struct net_device *dev, struc
答复: [PATCH] net: convert gro_count to bitmask
> -邮件原件- > 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com] > 发送时间: 2018年7月11日 19:32 > 收件人: Li,Rongqing ; netdev@vger.kernel.org > 主题: Re: [PATCH] net: convert gro_count to bitmask > > > > On 07/11/2018 02:15 AM, Li RongQing wrote: > > gro_hash size is 192 bytes, and uses 3 cache lines, if there is few > > flows, gro_hash may be not fully used, so it is unnecessary to iterate > > all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline. > > > > convert gro_count to a bitmask, and rename it as gro_bitmask, each bit > > represents a element of gro_hash, only flush a gro_hash element if the > > related bit is set, to speed up napi_gro_flush(). > > > > and update gro_bitmask only if it will be changed, to reduce cache > > update > > > > Suggested-by: Eric Dumazet > > Signed-off-by: Li RongQing > > --- > > include/linux/netdevice.h | 2 +- > > net/core/dev.c| 35 +++ > > 2 files changed, 24 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index b683971e500d..df49b36ef378 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -322,7 +322,7 @@ struct napi_struct { > > > > unsigned long state; > > int weight; > > - unsigned intgro_count; > > + unsigned long gro_bitmask; > > int (*poll)(struct napi_struct *, int); > > #ifdef CONFIG_NETPOLL > > int poll_owner; > > diff --git a/net/core/dev.c b/net/core/dev.c index > > d13cddcac41f..a08dbdd217a6 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -5171,9 +5171,11 @@ static void __napi_gro_flush_chain(struct > napi_struct *napi, u32 index, > > return; > > list_del_init(&skb->list); > > napi_gro_complete(skb); > > - napi->gro_count--; > > napi->gro_hash[index].count--; > > } > > + > > + if (!napi->gro_hash[index].count) > > + clear_bit(index, &napi->gro_bitmask); > > I suggest you not add an atomic operation here. > > Current cpu owns this NAPI after all. > > Same remark for the whole patch. > > -> __clear_bit(), __set_bit() and similar operators > > Ideally you should provide TCP_RR number with busy polling enabled, to > eventually catch regressions. > I will change and do the test Thank you. -RongQing > Thanks.
答复: [PATCH] net: convert gro_count to bitmask
> -邮件原件- > 发件人: David Miller [mailto:da...@davemloft.net] > 发送时间: 2018年7月12日 10:49 > 收件人: Li,Rongqing > 抄送: netdev@vger.kernel.org > 主题: Re: [PATCH] net: convert gro_count to bitmask > > From: Li RongQing > Date: Wed, 11 Jul 2018 17:15:53 +0800 > > > + clear_bit(index, &napi->gro_bitmask); > > Please don't use atomics here, at least use __clear_bit(). > Thanks, this is same as Eric's suggestion. > This is why I did the operations by hand in my version of the patch. > Also, if you are going to preempt my patch, at least retain the comment I > added around the GRO_HASH_BUCKETS definitions which warns the reader > about the limit. > I add BUILD_BUG_ON in netdev_init, so I think we need not to add comment @@ -9151,6 +9159,9 @@ static struct hlist_head * __net_init netdev_create_hash(void) /* Initialize per network namespace state */ static int __net_init netdev_init(struct net *net) { + BUILD_BUG_ON(GRO_HASH_BUCKETS > + FIELD_SIZEOF(struct napi_struct, gro_bitmask)); + -RongQing > Thanks.
Re: [PATCH] net: convert gro_count to bitmask
From: Li RongQing Date: Wed, 11 Jul 2018 17:15:53 +0800 > + clear_bit(index, &napi->gro_bitmask); Please don't use atomics here, at least use __clear_bit(). This is why I did the operations by hand in my version of the patch. Also, if you are going to preempt my patch, at least retain the comment I added around the GRO_HASH_BUCKETS definitions which warns the reader about the limit. Thanks.
Re: Bug report: epoll can fail to report EPOLLOUT when unix datagram socket peer is closed
On 06/26/2018 10:18 AM, Ian Lance Taylor wrote: > I'm reporting what appears to be a bug in the Linux kernel's epoll > support. It seems that epoll appears to sometimes fail to report an > EPOLLOUT event when the other side of an AF_UNIX/SOCK_DGRAM socket is > closed. This bug report started as a Go program reported at > https://golang.org/issue/23604. I've written a C program that > demonstrates the same symptoms, at > https://github.com/golang/go/issues/23604#issuecomment-398945027 . > > The C program sets up an AF_UNIX/SOCK_DGRAM server and serveral > identical clients, all running in non-blocking mode. All the > non-blocking sockets are added to epoll, using EPOLLET. The server > periodically closes and reopens its socket. The clients look for > ECONNREFUSED errors on their write calls, and close and reopen their > sockets when they see one. > > The clients will sometimes fill up their buffer and block with EAGAIN. > At that point they expect the poller to return an EPOLLOUT event to > tell them when they are ready to write again. The expectation is that > either the server will read data, freeing up buffer space, or will > close the socket, which should cause the sending packets to be > discarded, freeing up buffer space. Generally the EPOLLOUT event > happens. But sometimes, the poller never returns such an event, and > the client stalls. In the test program this is reported as a client > that waits more than 20 seconds to be told to continue. > > A similar bug report was made, with few details, at > https://stackoverflow.com/questions/38441059/edge-triggered-epoll-for-unix-domain-socket > . > > I've tested the program and seen the failure on kernel 4.9.0-6-amd64. > A colleague has tested the program and seen the failure on > 4.18.0-smp-DEV #3 SMP @1529531011 x86_64 GNU/Linux. > > If there is a better way for me to report this, please let me know. > > Thanks for your attention. > > Ian > Hi, Thanks for the report and the test program. The patch below seems to have cured the reproducer for me. But perhaps you can confirm? Thanks, -Jason [PATCH] af_unix: ensure POLLOUT on remote close() for connected dgram socket Applictions use ECONNREFUSED as returned from write() in order to determine that a socket should be closed. When using connected dgram unix sockets in a poll/write loop, this relies on POLLOUT being signaled when the remote end closes. However, due to a race POLLOUT can be missed when the remote closes: thread 1 (client) thread 2 (server) connect() to server write() returns -EAGAIN unix_dgram_poll() -> unix_recvq_full() is true close() ->unix_release_sock() ->wake_up_interruptible_all() unix_dgram_poll() (due to the wake_up_interruptible_all) -> unix_recvq_full() still is true ->free all skbs Now thread 1 is stuck and will not receive anymore wakeups. In this case, when thread 1 gets the -EAGAIN, it has not queued any skbs otherwise the 'free all skbs' step would in fact cause a wakeup and a POLLOUT return. So the race here is probably fairly rare because it means there are no skbs that thread 1 queued and that thread 1 schedules before the 'free all skbs' step. Nevertheless, this has been observed in the wild via syslog. The proposed fix is to move the wake_up_interruptible_all() call after the 'free all skbs' step. Signed-off-by: Jason Baron --- net/unix/af_unix.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index e5473c0..de242cf 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -529,8 +529,6 @@ static void unix_release_sock(struct sock *sk, int embrion) sk->sk_state = TCP_CLOSE; unix_state_unlock(sk); - wake_up_interruptible_all(&u->peer_wait); - skpair = unix_peer(sk); if (skpair != NULL) { @@ -560,6 +558,9 @@ static void unix_release_sock(struct sock *sk, int embrion) kfree_skb(skb); } + /* after freeing skbs to make sure POLLOUT triggers */ + wake_up_interruptible_all(&u->peer_wait); + if (path.dentry) path_put(&path); -- 2.7.4
答复: [PATCH] net: convert gro_count to bitmask
> -邮件原件- > 发件人: Stefano Brivio [mailto:sbri...@redhat.com] > 发送时间: 2018年7月11日 18:52 > 收件人: Li,Rongqing > 抄送: netdev@vger.kernel.org; Eric Dumazet > 主题: Re: [PATCH] net: convert gro_count to bitmask > > On Wed, 11 Jul 2018 17:15:53 +0800 > Li RongQing wrote: > > > @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct > napi_struct *napi, struct sk_buff > > if (grow > 0) > > gro_pull_from_frag0(skb, grow); > > ok: > > + if (napi->gro_hash[hash].count) > > + if (!test_bit(hash, &napi->gro_bitmask)) > > + set_bit(hash, &napi->gro_bitmask); > > + else if (test_bit(hash, &napi->gro_bitmask)) > > + clear_bit(hash, &napi->gro_bitmask); > > This might not do what you want. > > -- could you show detail ? -RongQing > Stefano
Re: [PATCH net-next] net: sched: fix unprotected access to rcu cookie pointer
On Mon, Jul 09, 2018 at 11:44:38PM +0300, Vlad Buslov wrote: > > On Mon 09 Jul 2018 at 20:34, Marcelo Ricardo Leitner > wrote: > > On Mon, Jul 09, 2018 at 08:26:47PM +0300, Vlad Buslov wrote: > >> Fix action attribute size calculation function to take rcu read lock and > >> access act_cookie pointer with rcu dereference. > >> > >> Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") > >> Reported-by: Marcelo Ricardo Leitner > >> Signed-off-by: Vlad Buslov > >> --- > >> net/sched/act_api.c | 9 +++-- > >> 1 file changed, 7 insertions(+), 2 deletions(-) > >> > >> diff --git a/net/sched/act_api.c b/net/sched/act_api.c > >> index 66dc19746c63..148a89ab789b 100644 > >> --- a/net/sched/act_api.c > >> +++ b/net/sched/act_api.c > >> @@ -149,10 +149,15 @@ EXPORT_SYMBOL(__tcf_idr_release); > >> > >> static size_t tcf_action_shared_attrs_size(const struct tc_action *act) > >> { > >> + struct tc_cookie *act_cookie; > >>u32 cookie_len = 0; > >> > >> - if (act->act_cookie) > >> - cookie_len = nla_total_size(act->act_cookie->len); > >> + rcu_read_lock(); > >> + act_cookie = rcu_dereference(act->act_cookie); > >> + > >> + if (act_cookie) > >> + cookie_len = nla_total_size(act_cookie->len); > >> + rcu_read_unlock(); > > > > I am not sure if this is enough to fix the entire issue. Now it will > > fetch the length correctly but, what guarantees that when it tries to > > actually copy the key (tcf_action_dump_1), the same act_cookie pointer > > will be used? As in, can't the new re-fetch be different/smaller than > > the object used here? > > I checked the code of nlmsg_put() and similar functions, and they check > that there is enough free space at skb tailroom. If not, they fail > gracefully and return error. Am I missing something? Talked offline with Vlad and I agree that this is fine as is. Reviewed-by: Marcelo Ricardo Leitner Thanks, Marcelo
Re: [PATCH iproute2-next] ipaddress: fix label matching
On 7/11/18 7:36 AM, Vincent Bernat wrote: > diff --git a/ip/ipaddress.c b/ip/ipaddress.c > index 5009bfe6d2e3..20ef6724944e 100644 > --- a/ip/ipaddress.c > +++ b/ip/ipaddress.c > @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who, > if (!name) > return -1; > > - if (filter.label && > - (!filter.family || filter.family == AF_PACKET) && > - fnmatch(filter.label, name, 0)) > - return -1; > - The offending commit changed the return code: if (filter.label && (!filter.family || filter.family == AF_PACKET) && - fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0)) - return 0; + fnmatch(filter.label, name, 0)) + return -1; Vincent: can you try leaving the code as is, but change the return to 0?
Re: [PATCH v4 iproute2-next 0/3] Add support for ETF qdisc
On 7/9/18 7:56 PM, Jesus Sanchez-Palencia wrote: > fixes since v3: > - Add support for clock names with the "CLOCK_" prefix; > - Print clock name on print_opt(); > - Use strcasecmp() instead of strncasecmp(). > > > The ETF (earliest txtime first) qdisc was recently merged into net-next > [1], so this patchset adds support for it through the tc command line > tool. > > An initial man page is also provided. > > The first commit in this series is adding an updated version of > include/uapi/linux/pkt_sched.h and is not meant to be merged. It's > provided here just as a convenience for those who want to easily build > this patchset. > > [1] https://patchwork.ozlabs.org/cover/938991/ > applied to iproute2-next. Thanks,
[PATCH bpf-next 2/6] bpf: Sync bpf.h to tools/
Sync BPF_SOCK_OPS_TCP_LISTEN_CB related UAPI changes to tools/. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 59b19b6a40d7..3b0ab93bc94f 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2555,6 +2555,9 @@ enum { * Arg1: old_state * Arg2: new_state */ + BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after +* socket transition to LISTEN state. +*/ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect -- 2.17.1
[PATCH bpf-next 0/6] TCP-BPF callback for listening sockets
This patchset adds TCP-BPF callback for listening sockets. Patch 0001 provides more details and is the main patch in the set. Patch 0006 adds selftest for the new callback. Other patches are bug fixes and improvements in TCP-BPF selftest to make it easier to extend in 0006. Andrey Ignatov (6): bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB bpf: Sync bpf.h to tools/ selftests/bpf: Fix const'ness in cgroup_helpers selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers selftests/bpf: Better verification in test_tcpbpf selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB include/uapi/linux/bpf.h | 3 + net/ipv4/af_inet.c| 1 + tools/include/uapi/linux/bpf.h| 3 + tools/testing/selftests/bpf/Makefile | 1 + tools/testing/selftests/bpf/cgroup_helpers.c | 6 +- tools/testing/selftests/bpf/cgroup_helpers.h | 6 +- tools/testing/selftests/bpf/test_tcpbpf.h | 1 + .../testing/selftests/bpf/test_tcpbpf_kern.c | 17 ++- .../testing/selftests/bpf/test_tcpbpf_user.c | 119 +- 9 files changed, 88 insertions(+), 69 deletions(-) -- 2.17.1
[PATCH bpf-next 1/6] bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB
Add new TCP-BPF callback that is called on listen(2) right after socket transition to TCP_LISTEN state. It fills the gap for listening sockets in TCP-BPF. For example BPF program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening and track later transition from TCP_LISTEN to TCP_CLOSE with BPF_SOCK_OPS_STATE_CB callback. Before there was no way to do it with TCP-BPF and other options were much harder to work with. E.g. socket state tracking can be done with tracepoints (either raw or regular) but they can't be attached to cgroup and their lifetime has to be managed separately. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 3 +++ net/ipv4/af_inet.c | 1 + 2 files changed, 4 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index b7db3261c62d..aa11cdcbfcaf 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2557,6 +2557,9 @@ enum { * Arg1: old_state * Arg2: new_state */ + BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after +* socket transition to LISTEN state. +*/ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index c716be13d58c..f2a0a3bab6b5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -229,6 +229,7 @@ int inet_listen(struct socket *sock, int backlog) err = inet_csk_listen_start(sk, backlog); if (err) goto out; + tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL); } sk->sk_max_ack_backlog = backlog; err = 0; -- 2.17.1
[PATCH bpf-next 6/6] selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB
Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB callback can be called on future state transition, and when such a transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and verify it in user space later. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov --- tools/testing/selftests/bpf/test_tcpbpf.h | 1 + tools/testing/selftests/bpf/test_tcpbpf_kern.c | 17 - tools/testing/selftests/bpf/test_tcpbpf_user.c | 4 +++- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/test_tcpbpf.h b/tools/testing/selftests/bpf/test_tcpbpf.h index 2fe43289943c..7bcfa6207005 100644 --- a/tools/testing/selftests/bpf/test_tcpbpf.h +++ b/tools/testing/selftests/bpf/test_tcpbpf.h @@ -12,5 +12,6 @@ struct tcpbpf_globals { __u32 good_cb_test_rv; __u64 bytes_received; __u64 bytes_acked; + __u32 num_listen; }; #endif diff --git a/tools/testing/selftests/bpf/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/test_tcpbpf_kern.c index 3e645ee41ed5..4b7fd540cea9 100644 --- a/tools/testing/selftests/bpf/test_tcpbpf_kern.c +++ b/tools/testing/selftests/bpf/test_tcpbpf_kern.c @@ -96,15 +96,22 @@ int bpf_testcb(struct bpf_sock_ops *skops) if (!gp) break; g = *gp; - g.total_retrans = skops->total_retrans; - g.data_segs_in = skops->data_segs_in; - g.data_segs_out = skops->data_segs_out; - g.bytes_received = skops->bytes_received; - g.bytes_acked = skops->bytes_acked; + if (skops->args[0] == BPF_TCP_LISTEN) { + g.num_listen++; + } else { + g.total_retrans = skops->total_retrans; + g.data_segs_in = skops->data_segs_in; + g.data_segs_out = skops->data_segs_out; + g.bytes_received = skops->bytes_received; + g.bytes_acked = skops->bytes_acked; + } bpf_map_update_elem(&global_map, &key, &g, BPF_ANY); } break; + case BPF_SOCK_OPS_TCP_LISTEN_CB: + bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_STATE_CB_FLAG); + break; default: rv = -1; } diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c b/tools/testing/selftests/bpf/test_tcpbpf_user.c index 971f1644b9c7..a275c2971376 100644 --- a/tools/testing/selftests/bpf/test_tcpbpf_user.c +++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c @@ -37,7 +37,8 @@ int verify_result(const struct tcpbpf_globals *result) (1 << BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) | (1 << BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) | (1 << BPF_SOCK_OPS_NEEDS_ECN) | - (1 << BPF_SOCK_OPS_STATE_CB)); + (1 << BPF_SOCK_OPS_STATE_CB) | + (1 << BPF_SOCK_OPS_TCP_LISTEN_CB)); EXPECT_EQ(expected_events, result->event_map, "#" PRIx32); EXPECT_EQ(501ULL, result->bytes_received, "llu"); @@ -46,6 +47,7 @@ int verify_result(const struct tcpbpf_globals *result) EXPECT_EQ(1, result->data_segs_out, PRIu32); EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32); EXPECT_EQ(0, result->good_cb_test_rv, PRIu32); + EXPECT_EQ(1, result->num_listen, PRIu32); return 0; err: -- 2.17.1
[PATCH bpf-next 4/6] selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers
Switch to cgroup_helpers to simplify the code and fix cgroup cleanup: before cgroup was not cleaned up after the test. It also removes SYSTEM macro, that only printed error, but didn't terminate the test. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov --- tools/testing/selftests/bpf/Makefile | 1 + .../testing/selftests/bpf/test_tcpbpf_user.c | 55 +++ 2 files changed, 22 insertions(+), 34 deletions(-) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 7a6214e9ae58..478bf1bcbbf5 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -61,6 +61,7 @@ $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c $(OUTPUT)/test_sock: cgroup_helpers.c $(OUTPUT)/test_sock_addr: cgroup_helpers.c $(OUTPUT)/test_sockmap: cgroup_helpers.c +$(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c $(OUTPUT)/test_progs: trace_helpers.c $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c b/tools/testing/selftests/bpf/test_tcpbpf_user.c index 84ab5163c828..fa97ec6428de 100644 --- a/tools/testing/selftests/bpf/test_tcpbpf_user.c +++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c @@ -1,25 +1,18 @@ // SPDX-License-Identifier: GPL-2.0 #include #include -#include #include #include -#include #include -#include -#include -#include #include -#include -#include #include -#include -#include #include #include -#include "bpf_util.h" + #include "bpf_rlimit.h" -#include +#include "bpf_util.h" +#include "cgroup_helpers.h" + #include "test_tcpbpf.h" static int bpf_find_map(const char *test, struct bpf_object *obj, @@ -35,42 +28,32 @@ static int bpf_find_map(const char *test, struct bpf_object *obj, return bpf_map__fd(map); } -#define SYSTEM(CMD)\ - do {\ - if (system(CMD)) { \ - printf("system(%s) FAILS!\n", CMD); \ - } \ - } while (0) - int main(int argc, char **argv) { const char *file = "test_tcpbpf_kern.o"; struct tcpbpf_globals g = {0}; - int cg_fd, prog_fd, map_fd; + const char *cg_path = "/foo"; bool debug_flag = false; int error = EXIT_FAILURE; struct bpf_object *obj; - char cmd[100], *dir; - struct stat buffer; + int prog_fd, map_fd; + int cg_fd = -1; __u32 key = 0; - int pid; int rv; if (argc > 1 && strcmp(argv[1], "-d") == 0) debug_flag = true; - dir = "/tmp/cgroupv2/foo"; + if (setup_cgroup_environment()) + goto err; + + cg_fd = create_and_get_cgroup(cg_path); + if (!cg_fd) + goto err; - if (stat(dir, &buffer) != 0) { - SYSTEM("mkdir -p /tmp/cgroupv2"); - SYSTEM("mount -t cgroup2 none /tmp/cgroupv2"); - SYSTEM("mkdir -p /tmp/cgroupv2/foo"); - } - pid = (int) getpid(); - sprintf(cmd, "echo %d >> /tmp/cgroupv2/foo/cgroup.procs", pid); - SYSTEM(cmd); + if (join_cgroup(cg_path)) + goto err; - cg_fd = open(dir, O_DIRECTORY, O_RDONLY); if (bpf_prog_load(file, BPF_PROG_TYPE_SOCK_OPS, &obj, &prog_fd)) { printf("FAILED: load_bpf_file failed for: %s\n", file); goto err; @@ -83,7 +66,10 @@ int main(int argc, char **argv) goto err; } - SYSTEM("./tcp_server.py"); + if (system("./tcp_server.py")) { + printf("FAILED: TCP server\n"); + goto err; + } map_fd = bpf_find_map(__func__, obj, "global_map"); if (map_fd < 0) @@ -123,6 +109,7 @@ int main(int argc, char **argv) error = 0; err: bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS); + close(cg_fd); + cleanup_cgroup_environment(); return error; - } -- 2.17.1
[PATCH bpf-next 5/6] selftests/bpf: Better verification in test_tcpbpf
Reduce amount of copy/paste for debug info when result is verified in the test and keep that info together with values being checked so that they won't get out of sync. It also improves debug experience: instead of checking manually what doesn't match in debug output for all fields, only unexpected field is printed. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov --- .../testing/selftests/bpf/test_tcpbpf_user.c | 64 +++ 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c b/tools/testing/selftests/bpf/test_tcpbpf_user.c index fa97ec6428de..971f1644b9c7 100644 --- a/tools/testing/selftests/bpf/test_tcpbpf_user.c +++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include @@ -15,6 +16,42 @@ #include "test_tcpbpf.h" +#define EXPECT_EQ(expected, actual, fmt) \ + do {\ + if ((expected) != (actual)) { \ + printf(" Value of: " #actual "\n" \ + "Actual: %" fmt "\n" \ + " Expected: %" fmt "\n",\ + (actual), (expected)); \ + goto err; \ + } \ + } while (0) + +int verify_result(const struct tcpbpf_globals *result) +{ + __u32 expected_events; + + expected_events = ((1 << BPF_SOCK_OPS_TIMEOUT_INIT) | + (1 << BPF_SOCK_OPS_RWND_INIT) | + (1 << BPF_SOCK_OPS_TCP_CONNECT_CB) | + (1 << BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) | + (1 << BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) | + (1 << BPF_SOCK_OPS_NEEDS_ECN) | + (1 << BPF_SOCK_OPS_STATE_CB)); + + EXPECT_EQ(expected_events, result->event_map, "#" PRIx32); + EXPECT_EQ(501ULL, result->bytes_received, "llu"); + EXPECT_EQ(1002ULL, result->bytes_acked, "llu"); + EXPECT_EQ(1, result->data_segs_in, PRIu32); + EXPECT_EQ(1, result->data_segs_out, PRIu32); + EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32); + EXPECT_EQ(0, result->good_cb_test_rv, PRIu32); + + return 0; +err: + return -1; +} + static int bpf_find_map(const char *test, struct bpf_object *obj, const char *name) { @@ -33,7 +70,6 @@ int main(int argc, char **argv) const char *file = "test_tcpbpf_kern.o"; struct tcpbpf_globals g = {0}; const char *cg_path = "/foo"; - bool debug_flag = false; int error = EXIT_FAILURE; struct bpf_object *obj; int prog_fd, map_fd; @@ -41,9 +77,6 @@ int main(int argc, char **argv) __u32 key = 0; int rv; - if (argc > 1 && strcmp(argv[1], "-d") == 0) - debug_flag = true; - if (setup_cgroup_environment()) goto err; @@ -81,30 +114,11 @@ int main(int argc, char **argv) goto err; } - if (g.bytes_received != 501 || g.bytes_acked != 1002 || - g.data_segs_in != 1 || g.data_segs_out != 1 || - (g.event_map ^ 0x47e) != 0 || g.bad_cb_test_rv != 0x80 || - g.good_cb_test_rv != 0) { + if (verify_result(&g)) { printf("FAILED: Wrong stats\n"); - if (debug_flag) { - printf("\n"); - printf("bytes_received: %d (expecting 501)\n", - (int)g.bytes_received); - printf("bytes_acked:%d (expecting 1002)\n", - (int)g.bytes_acked); - printf("data_segs_in: %d (expecting 1)\n", - g.data_segs_in); - printf("data_segs_out: %d (expecting 1)\n", - g.data_segs_out); - printf("event_map: 0x%x (at least 0x47e)\n", - g.event_map); - printf("bad_cb_test_rv: 0x%x (expecting 0x80)\n", - g.bad_cb_test_rv); - printf("good_cb_test_rv:0x%x (expecting 0)\n", - g.good_cb_test_rv); - } goto err; } + printf("PASSED!\n"); error = 0; err: -- 2.17.1
[PATCH bpf-next 3/6] selftests/bpf: Fix const'ness in cgroup_helpers
Lack of const in cgroup helpers signatures forces to write ugly client code. Fix it. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov --- tools/testing/selftests/bpf/cgroup_helpers.c | 6 +++--- tools/testing/selftests/bpf/cgroup_helpers.h | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c index c87b4e052ce9..cf16948aad4a 100644 --- a/tools/testing/selftests/bpf/cgroup_helpers.c +++ b/tools/testing/selftests/bpf/cgroup_helpers.c @@ -118,7 +118,7 @@ static int join_cgroup_from_top(char *cgroup_path) * * On success, it returns 0, otherwise on failure it returns 1. */ -int join_cgroup(char *path) +int join_cgroup(const char *path) { char cgroup_path[PATH_MAX + 1]; @@ -158,7 +158,7 @@ void cleanup_cgroup_environment(void) * On success, it returns the file descriptor. On failure it returns 0. * If there is a failure, it prints the error to stderr. */ -int create_and_get_cgroup(char *path) +int create_and_get_cgroup(const char *path) { char cgroup_path[PATH_MAX + 1]; int fd; @@ -186,7 +186,7 @@ int create_and_get_cgroup(char *path) * which is an invalid cgroup id. * If there is a failure, it prints the error to stderr. */ -unsigned long long get_cgroup_id(char *path) +unsigned long long get_cgroup_id(const char *path) { int dirfd, err, flags, mount_id, fhsize; union { diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h index 20a4a5dcd469..d64bb8957090 100644 --- a/tools/testing/selftests/bpf/cgroup_helpers.h +++ b/tools/testing/selftests/bpf/cgroup_helpers.h @@ -9,10 +9,10 @@ __FILE__, __LINE__, clean_errno(), ##__VA_ARGS__) -int create_and_get_cgroup(char *path); -int join_cgroup(char *path); +int create_and_get_cgroup(const char *path); +int join_cgroup(const char *path); int setup_cgroup_environment(void); void cleanup_cgroup_environment(void); -unsigned long long get_cgroup_id(char *path); +unsigned long long get_cgroup_id(const char *path); #endif -- 2.17.1
Apply for a 3% loan...
Hello, We offer L oans at 3% interest rate per annum. If intereted, contact me with amount needed and L oan duration for more details...
Re: [PATCH bpf 0/4] Consistent sendmsg error reporting in AF_XDP
On Wed, Jul 11, 2018 at 10:12:48AM +0200, Magnus Karlsson wrote: > This patch set adjusts the AF_XDP TX error reporting so that it becomes > consistent between copy mode and zero-copy. First some background: > > Copy-mode for TX uses the SKB path in which the action of sending the > packet is performed from process context using the sendmsg > syscall. Completions are usually done asynchronously from NAPI mode by > using a TX interrupt. In this mode, send errors can be returned back > through the syscall. > > In zero-copy mode both the sending of the packet and the completions > are done asynchronously from NAPI mode for performance reasons. In > this mode, the sendmsg syscall only makes sure that the TX NAPI loop > will be run that performs both the actions of sending and > completing. In this mode it is therefore not possible to return errors > through the sendmsg syscall as the sending is done from the NAPI > loop. Note that it is possible to implement a synchronous send with > our API, but in our benchmarks that made the TX performance drop by > nearly half due to synchronization requirements and cache line > bouncing. But for some netdevs this might be preferable so let us > leave it up to the implementation to decide. > > The problem is that the current code base returns some errors in > copy-mode that are not possible to return in zero-copy mode. This > patch set aligns them so that the two modes always return the same > error code. We achieve this by removing some of the errors returned by > sendmsg in copy-mode (and in one case adding an error message for > zero-copy mode) and offering alternative error detection methods that > are consistent between the two modes. > > The structure of the patch set is as follows: > > Patch 1: removes the ENXIO return code from copy-mode when someone has > forcefully changed the number of queues on the device so that the > queue bound to the socket is no longer available. Just silently stop > sending anything as in zero-copy mode. > > Patch 2: stop returning EAGAIN in copy mode when the completion queue > is full as zero-copy does not do this. Instead this situation can be > detected by comparing the head and tail pointers of the completion > queue in both modes. In any case, EAGAIN was not the correct error code > here since no amount of calling sendmsg will solve the problem. Only > consuming one or more messages on the completion queue will fix this. > > Patch 3: Always return ENOBUFS from sendmsg if there is no TX queue > configured. This was not the case for zero-copy mode. > > Patch 4: stop returning EMSGSIZE when the size of the packet is larger > than the MTU. Just send it to the device so that it will drop it as in > zero-copy mode. > > Note that copy-mode can still return EAGAIN in certain circumstances, > but as these conditions cannot occur in zero-copy mode it is fine for > copy-mode to return them. > > Question: For patch 4, is it fine to let the device drop a packet > that is greater than its MTU, or should I have a check for this in > both zero-copy and copy-mode and drop the packet up in the AF_XDP > code? The drawback of this is that it will have performance > implications for zero-copy mode as we will touch one more cache line > with dev->mtu. > > Thanks: Magnus for the set: Acked-by: Alexei Starovoitov
Confirm And Verify Your Mail ID You Must Open
You will be blocked from sending and receiving emails Dear @Pen Webmail Email User This message is from Information Technology Services of This EMAIL to all our Staff. We are currently upgrading our database and e-mail center and this is our final notification to you. We have sent several messages to you without response. We are deleting all unused Mail account to create space for new accounts. In order not to be suspended, you will have to update your account by providing the information listed below: upd...@webname.com Confirm Your E-Mail Details.. Email... User name: .. Password:.. Re Confirm Password:. If you fail to confirm your continuous usage of our services by confirming your email password now, your account will be disable and you will not be able to access your email. You should immediately reply this email: upd...@webname.com and enter your password in the above password column. Thanks for your understanding. Regard, IT Services Webmaster. Copyright 1998-2018( Subscriber District) Corporation. All rights reserved. This email may be confidential, may be legally privileged, and is for the intended recipient only. Unauthorized access, disclosure, copying, distribution, or reliance on any of it by anyone else is prohibited and may be a criminal offense. Please delete if obtained in error and email confirmation to the sender
Re: [BUG] bonded interfaces drop bpdu (stp) frames
On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: > > Hi, > > As weird as that sounds, this is what I observed today after bumping > kernel version. I have a setup where 2 bonds are attached to linux > bridge and physically are connected to two switches doing MSTP (and > linux bridge is just passing them). > > Initially I suspected some changes related to bridge code - but quick > peek at the code showed nothing suspicious - and the part of it that > explicitly passes stp frames if stp is not enabled has seen little > changes (e.g. per-port group_fwd_mask added recently). Furthermore - if > regular non-bonded interfaces are attached everything works fine. > > Just to be sure I detached the bond (802.3ad mode) and checked it with > simple tcpdump (ether proto \\stp) - and indeed no hello packets were > there (with them being present just fine on active enslaved interface, > or on the bond device in earlier kernels). > > If time permits I'll bisect tommorow to pinpoint the commit, but from > quick todays test - 4.9.x is working fine, while 4.16.16 (tested on > debian) and 4.17.3 (tested on archlinux) are failing. > > Unless this is already a known issue (or you have any suggestions what > could be responsible). > I believe these are link-local-multicast messages and sometime back a change went into to not pass those frames to the bonding master. This could be the side effect of that.
Re: [PATCH bpf] bpf: fix panic due to oob in bpf_prog_test_run_skb
On Wed, Jul 11, 2018 at 03:30:14PM +0200, Daniel Borkmann wrote: > sykzaller triggered several panics similar to the below: > > [...] > [ 248.851531] BUG: KASAN: use-after-free in _copy_to_user+0x5c/0x90 > [ 248.857656] Read of size 985 at addr 88080172 by task a.out/1425 > [...] > [ 248.865902] CPU: 1 PID: 1425 Comm: a.out Not tainted 4.18.0-rc4+ #13 > [ 248.865903] Hardware name: Supermicro SYS-5039MS-H12TRF/X11SSE-F, BIOS > 2.1a 03/08/2018 > [ 248.865905] Call Trace: > [ 248.865910] dump_stack+0xd6/0x185 > [ 248.865911] ? show_regs_print_info+0xb/0xb > [ 248.865913] ? printk+0x9c/0xc3 > [ 248.865915] ? kmsg_dump_rewind_nolock+0xe4/0xe4 > [ 248.865919] print_address_description+0x6f/0x270 > [ 248.865920] kasan_report+0x25b/0x380 > [ 248.865922] ? _copy_to_user+0x5c/0x90 > [ 248.865924] check_memory_region+0x137/0x190 > [ 248.865925] kasan_check_read+0x11/0x20 > [ 248.865927] _copy_to_user+0x5c/0x90 > [ 248.865930] bpf_test_finish.isra.8+0x4f/0xc0 > [ 248.865932] bpf_prog_test_run_skb+0x6a0/0xba0 > [...] > > After scrubbing the BPF prog a bit from the noise, turns out it called > bpf_skb_change_head() for the lwt_xmit prog with headroom of 2. Nothing > wrong in that, however, this was run with repeat >> 0 in > bpf_prog_test_run_skb() > and the same skb thus keeps changing until the pskb_expand_head() called > from skb_cow() keeps bailing out in atomic alloc context with -ENOMEM. > So upon return we'll basically have 0 headroom left yet blindly do the > __skb_push() of 14 bytes and keep copying data from there in bpf_test_finish() > out of bounds. Fix to check if we have enough headroom and if > pskb_expand_head() > fails, bail out with error. > > Another bug independent of this fix (but related in triggering above) is > that BPF_PROG_TEST_RUN should be reworked to reset the skb/xdp buffer to > it's original state from input as otherwise repeating the same test in a > loop won't work for benchmarking when underlying input buffer is getting > changed by the prog each time and reused for the next run leading to > unexpected results. > > Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command") > Reported-by: syzbot+709412e651e55ed96...@syzkaller.appspotmail.com > Reported-by: syzbot+54f39d6ab58f39720...@syzkaller.appspotmail.com > Signed-off-by: Daniel Borkmann Applied, Thanks
[BUG net-next] BUG triggered with GRO SKB list_head changes
Starting with the following net-next commit, I see a BUG when starting a LXD container inside of a KVM guest using virtio-net: d4546c2509b1 net: Convert GRO SKB handling to list_head. Here's what the kernel spits out: kernel BUG at /var/scm/kernel/linux/include/linux/skbuff.h:2080! invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 1362 Comm: libvirtd Not tainted 4.18.0-rc2+ #69 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:skb_pull+0x36/0x40 Code: c6 77 24 29 f0 3b 87 84 00 00 00 89 87 80 00 00 00 72 17 89 f6 48 89 f0 48 03 87 d8 00 00 00 48 89 87 d8 00 00 00 c3 31 c0 c3 <0f> 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 39 b7 80 00 00 00 76 RSP: :96737f6039f0 EFLAGS: 00010297 RAX: 9c66e2f2 RBX: RCX: 0501 RDX: 0001 RSI: 000e RDI: 96737f7e3938 RBP: 967379f40020 R08: R09: R10: 96737f603988 R11: c0461335 R12: 967379f409e0 R13: 96737f7e3938 R14: R15: 967379e96ac0 FS: 7fc96087e640() GS:96737f60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fc913608aa0 CR3: 5dacc001 CR4: 001606f0 Call Trace: br_dev_xmit+0xe1/0x3d0 [bridge] dev_hard_start_xmit+0xbc/0x3b0 __dev_queue_xmit+0xb98/0xc30 ip_finish_output2+0x3e5/0x670 ? ip_output+0x7f/0x250 ip_output+0x7f/0x250 ? ip_fragment.constprop.5+0x80/0x80 ip_forward+0x3e2/0x650 ? ipv4_frags_init_net+0x130/0x130 ip_rcv+0x2be/0x500 ? ip_local_deliver_finish+0x3b0/0x3b0 __netif_receive_skb_core+0x6a8/0xb30 ? lock_acquire+0xab/0x200 ? netif_receive_skb_internal+0x2a/0x380 netif_receive_skb_internal+0x73/0x380 ? napi_gro_complete+0xcf/0x1b0 dev_gro_receive+0x374/0x730 napi_gro_receive+0x4f/0x1d0 receive_buf+0x4b6/0x1930 [virtio_net] ? detach_buf+0x69/0x120 [virtio_ring] virtnet_poll+0x122/0x2e0 [virtio_net] net_rx_action+0x207/0x450 __do_softirq+0x149/0x4ea irq_exit+0xbf/0xd0 do_IRQ+0x6c/0x130 common_interrupt+0xf/0xf RIP: 0010:__radix_tree_lookup+0x28/0xe0 Code: 00 00 53 49 89 ca 41 bb 40 00 00 00 4c 8b 47 50 4c 89 c0 83 e0 03 48 83 f8 01 0f 85 a8 00 00 00 4c 89 c0 48 83 e0 fe 0f b6 08 <4c> 89 d8 48 d3 e0 48 83 e8 01 48 39 c6 76 11 e9 9f 00 00 00 4c 89 RSP: :ae150048fcc0 EFLAGS: 0282 ORIG_RAX: ffd9 RAX: 96735d2ef908 RBX: 001f RCX: 0006 RDX: RSI: 02e2 RDI: 96735d10b788 RBP: 02e2 R08: 96735d2ef909 R09: R10: R11: 0040 R12: 001f R13: ec01c15f3a80 R14: 001f R15: ae150048fd18 __do_page_cache_readahead+0x11f/0x2e0 filemap_fault+0x408/0x660 ext4_filemap_fault+0x2f/0x40 __do_fault+0x1f/0xd0 __handle_mm_fault+0x915/0xfa0 handle_mm_fault+0x1c2/0x390 __do_page_fault+0x2f6/0x580 ? async_page_fault+0x5/0x20 async_page_fault+0x1b/0x20 RIP: 0033:0x7fc913608aa0 Code: Bad RIP value. RSP: 002b:7ffcfa9c7f08 EFLAGS: 00010206 RAX: RBX: 0003 RCX: 0080 RDX: 0006 RSI: 7fc913a74bf8 RDI: 7fc913df9720 RBP: 0001 R08: 55df45795700 R09: R10: 55df4574c010 R11: 0001 R12: 7ffcfa9c8c38 R13: 7ffcfa9c8c48 R14: 7fc913dc3d70 R15: 55df4578ab30 Modules linked in: veth ebtable_filter ebtables ipt_MASQUERADE xt_CHECKSUM xt_comment xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_filter bpfilter bridge stp llc fuse kvm_intel kvm irqbypass 9pnet_virtio 9pnet virtio_balloon ib_iser rdma_cm configfs iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables virtio_net net_failover virtio_blk failover crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper virtio_pci psmouse virtio_ring virtio I'm not very familiar with the GRO or IP fragmentation code but I was able to identify that this change "fixes" the issue: diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 7ccc601b55d9..a5cea572a7f1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -666,6 +666,7 @@ struct sk_buff { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; + struct list_headlist; union { struct net_device *dev; @@ -678,7 +679,6 @@ struct sk_buff { }; }; struct rb_node rbnode; /* used in netem & tcp stack */ - struct list_headlist; }; struct sock *s
[BUG] bonded interfaces drop bpdu (stp) frames
Hi, As weird as that sounds, this is what I observed today after bumping kernel version. I have a setup where 2 bonds are attached to linux bridge and physically are connected to two switches doing MSTP (and linux bridge is just passing them). Initially I suspected some changes related to bridge code - but quick peek at the code showed nothing suspicious - and the part of it that explicitly passes stp frames if stp is not enabled has seen little changes (e.g. per-port group_fwd_mask added recently). Furthermore - if regular non-bonded interfaces are attached everything works fine. Just to be sure I detached the bond (802.3ad mode) and checked it with simple tcpdump (ether proto \\stp) - and indeed no hello packets were there (with them being present just fine on active enslaved interface, or on the bond device in earlier kernels). If time permits I'll bisect tommorow to pinpoint the commit, but from quick todays test - 4.9.x is working fine, while 4.16.16 (tested on debian) and 4.17.3 (tested on archlinux) are failing. Unless this is already a known issue (or you have any suggestions what could be responsible).
Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up
On 11.07.2018 23:33, Florian Fainelli wrote: > > > On 07/11/2018 02:08 PM, Heiner Kallweit wrote: >> On 11.07.2018 22:55, Andrew Lunn wrote: +/** + * phy_speed_down - set speed to lowest speed supported by both link partners + * @phydev: the phy_device struct + * @sync: perform action synchronously + * + * Description: Typically used to save energy when waiting for a WoL packet + */ +int phy_speed_down(struct phy_device *phydev, bool sync) >>> >>> This sync parameter needs some more thought. I'm not sure it is safe. >>> >>> How does a PHY trigger a WoL wake up? I guess some use the interrupt >>> pin. How does a PHY indicate auto-neg has completed? It triggers an >>> interrupt. So it seems like there is a danger here we suspend, and >>> then wake up 2 seconds later when auto-neg has completed. >>> >>> I'm not sure we can safely suspend until auto-neg has completed. >>> +/** + * phy_speed_up - (re)set advertised speeds to all supported speeds + * @phydev: the phy_device struct + * @sync: perform action synchronously + * + * Description: Used to revert the effect of phy_speed_down + */ +int phy_speed_up(struct phy_device *phydev, bool sync) >>> >>> And here, i'm thinking the opposite. A MAC driver needs to be ready >>> for the PHY state to change at any time. So why do we need to wait? >>> Just let the normal mechanisms inform the MAC when the link is up. >>> >> I see your points, thanks for the feedback. In my case WoL triggers >> a PCI PME and the code works as expected, but I agree this may be >> different in other setups (external PHY). >> >> The sync parameter was inspired by following comment from Florian: >> "One thing that bothers me a bit is that this should ideally be >> offered as both blocking and non-blocking options" >> So let's see which comments he may have before preparing a v2. > > What I had in mind is that you would be able to register a callback that > would tell you when auto-negotiation completes, and not register one if > you did not want to have that information. > > As Andrew points out though, with PHY using interrupts, this might be a > bit challenging to do because you will get an interrupt about "something > has changed" and you would have to run the callback from the PHY state > machine to determine this was indeed a result of triggering > auto-negotiation. Maybe polling for auto-negotiation like you do here is > good enough. > OK, then I would poll for autoneg finished in phy_speed_down and remove the polling option from phy_speed_up. I will do some tests with this before submitting a v2. > One nit, you might have to check for those functions that the PHY did > have auto-negotiation enabled and was not forced. > This I'm doing already, or do you mean something different?
Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up
On 07/11/2018 02:08 PM, Heiner Kallweit wrote: > On 11.07.2018 22:55, Andrew Lunn wrote: >>> +/** >>> + * phy_speed_down - set speed to lowest speed supported by both link >>> partners >>> + * @phydev: the phy_device struct >>> + * @sync: perform action synchronously >>> + * >>> + * Description: Typically used to save energy when waiting for a WoL packet >>> + */ >>> +int phy_speed_down(struct phy_device *phydev, bool sync) >> >> This sync parameter needs some more thought. I'm not sure it is safe. >> >> How does a PHY trigger a WoL wake up? I guess some use the interrupt >> pin. How does a PHY indicate auto-neg has completed? It triggers an >> interrupt. So it seems like there is a danger here we suspend, and >> then wake up 2 seconds later when auto-neg has completed. >> >> I'm not sure we can safely suspend until auto-neg has completed. >> >>> +/** >>> + * phy_speed_up - (re)set advertised speeds to all supported speeds >>> + * @phydev: the phy_device struct >>> + * @sync: perform action synchronously >>> + * >>> + * Description: Used to revert the effect of phy_speed_down >>> + */ >>> +int phy_speed_up(struct phy_device *phydev, bool sync) >> >> And here, i'm thinking the opposite. A MAC driver needs to be ready >> for the PHY state to change at any time. So why do we need to wait? >> Just let the normal mechanisms inform the MAC when the link is up. >> > I see your points, thanks for the feedback. In my case WoL triggers > a PCI PME and the code works as expected, but I agree this may be > different in other setups (external PHY). > > The sync parameter was inspired by following comment from Florian: > "One thing that bothers me a bit is that this should ideally be > offered as both blocking and non-blocking options" > So let's see which comments he may have before preparing a v2. What I had in mind is that you would be able to register a callback that would tell you when auto-negotiation completes, and not register one if you did not want to have that information. As Andrew points out though, with PHY using interrupts, this might be a bit challenging to do because you will get an interrupt about "something has changed" and you would have to run the callback from the PHY state machine to determine this was indeed a result of triggering auto-negotiation. Maybe polling for auto-negotiation like you do here is good enough. One nit, you might have to check for those functions that the PHY did have auto-negotiation enabled and was not forced. -- Florian
Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up
On 11.07.2018 22:55, Andrew Lunn wrote: >> +/** >> + * phy_speed_down - set speed to lowest speed supported by both link >> partners >> + * @phydev: the phy_device struct >> + * @sync: perform action synchronously >> + * >> + * Description: Typically used to save energy when waiting for a WoL packet >> + */ >> +int phy_speed_down(struct phy_device *phydev, bool sync) > > This sync parameter needs some more thought. I'm not sure it is safe. > > How does a PHY trigger a WoL wake up? I guess some use the interrupt > pin. How does a PHY indicate auto-neg has completed? It triggers an > interrupt. So it seems like there is a danger here we suspend, and > then wake up 2 seconds later when auto-neg has completed. > > I'm not sure we can safely suspend until auto-neg has completed. > >> +/** >> + * phy_speed_up - (re)set advertised speeds to all supported speeds >> + * @phydev: the phy_device struct >> + * @sync: perform action synchronously >> + * >> + * Description: Used to revert the effect of phy_speed_down >> + */ >> +int phy_speed_up(struct phy_device *phydev, bool sync) > > And here, i'm thinking the opposite. A MAC driver needs to be ready > for the PHY state to change at any time. So why do we need to wait? > Just let the normal mechanisms inform the MAC when the link is up. > I see your points, thanks for the feedback. In my case WoL triggers a PCI PME and the code works as expected, but I agree this may be different in other setups (external PHY). The sync parameter was inspired by following comment from Florian: "One thing that bothers me a bit is that this should ideally be offered as both blocking and non-blocking options" So let's see which comments he may have before preparing a v2. > Andrew > Heiner
Re: [PATCH net-next 1/2] net: phy: add helper phy_config_aneg
On 07/11/2018 01:30 PM, Heiner Kallweit wrote: > This functionality will also be needed in subsequent patches of this > series, therefore factor it out to a helper. > > Signed-off-by: Heiner Kallweit Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up
> +/** > + * phy_speed_down - set speed to lowest speed supported by both link partners > + * @phydev: the phy_device struct > + * @sync: perform action synchronously > + * > + * Description: Typically used to save energy when waiting for a WoL packet > + */ > +int phy_speed_down(struct phy_device *phydev, bool sync) This sync parameter needs some more thought. I'm not sure it is safe. How does a PHY trigger a WoL wake up? I guess some use the interrupt pin. How does a PHY indicate auto-neg has completed? It triggers an interrupt. So it seems like there is a danger here we suspend, and then wake up 2 seconds later when auto-neg has completed. I'm not sure we can safely suspend until auto-neg has completed. > +/** > + * phy_speed_up - (re)set advertised speeds to all supported speeds > + * @phydev: the phy_device struct > + * @sync: perform action synchronously > + * > + * Description: Used to revert the effect of phy_speed_down > + */ > +int phy_speed_up(struct phy_device *phydev, bool sync) And here, i'm thinking the opposite. A MAC driver needs to be ready for the PHY state to change at any time. So why do we need to wait? Just let the normal mechanisms inform the MAC when the link is up. Andrew
Re: [PATCH net-next 1/2] net: phy: add helper phy_config_aneg
On Wed, Jul 11, 2018 at 10:30:27PM +0200, Heiner Kallweit wrote: > This functionality will also be needed in subsequent patches of this > series, therefore factor it out to a helper. > > Signed-off-by: Heiner Kallweit Reviewed-by: Andrew Lunn Andrew
Re: [PATCH net-next] tc-testing: add geneve options in tunnel_key unit tests
On Tue, Jul 10, 2018 at 9:22 PM, Jakub Kicinski wrote: > From: Pieter Jansen van Vuuren > > Extend tc tunnel_key action unit tests with geneve options. Tests > include testing single and multiple geneve options, as well as > testing geneve options that are expected to fail. > > Signed-off-by: Pieter Jansen van Vuuren Acked-by: Lucas Bates
Re: [PATCH bpf-next] bpf: better availability probing for seg6 helpers
On 07/10/2018 09:20 PM, Daniel Borkmann wrote: > On 07/10/2018 06:54 PM, Mathieu Xhonneux wrote: >> bpf_lwt_seg6_* helpers require CONFIG_IPV6_SEG6_BPF, and currently >> return -EOPNOTSUPP to indicate unavailability. This patch forces the >> BPF verifier to reject programs using these helpers when >> !CONFIG_IPV6_SEG6_BPF, allowing users to more easily probe if they are >> available or not. >> >> Signed-off-by: Mathieu Xhonneux > > Note, just fyi, this would need to go to bpf tree (and not bpf-next) as > otherwise there's a change in behavior. Applied, thanks Mathieu!
[PATCH net-next 2/2] net: phy: add phy_speed_down and phy_speed_up
Some network drivers include functionality to speed down the PHY when suspending and just waiting for a WoL packet because this saves energy. This functionality is quite generic, therefore let's factor it out to phylib. Signed-off-by: Heiner Kallweit --- drivers/net/phy/phy.c | 78 +++ include/linux/phy.h | 2 ++ 2 files changed, 80 insertions(+) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index c4aa360d..0547c603 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -551,6 +551,84 @@ int phy_start_aneg(struct phy_device *phydev) } EXPORT_SYMBOL(phy_start_aneg); +static int phy_poll_aneg_done(struct phy_device *phydev) +{ + unsigned int retries = 100; + int ret; + + do { + msleep(100); + ret = phy_aneg_done(phydev); + } while (!ret && --retries); + + if (!ret) + return -ETIMEDOUT; + + return ret < 0 ? ret : 0; +} + +/** + * phy_speed_down - set speed to lowest speed supported by both link partners + * @phydev: the phy_device struct + * @sync: perform action synchronously + * + * Description: Typically used to save energy when waiting for a WoL packet + */ +int phy_speed_down(struct phy_device *phydev, bool sync) +{ + u32 adv = phydev->lp_advertising & phydev->supported; + u32 adv_old = phydev->advertising; + int ret; + + if (phydev->autoneg != AUTONEG_ENABLE) + return 0; + + if (adv & PHY_10BT_FEATURES) + phydev->advertising &= ~(PHY_100BT_FEATURES | +PHY_1000BT_FEATURES); + else if (adv & PHY_100BT_FEATURES) + phydev->advertising &= ~PHY_1000BT_FEATURES; + + if (phydev->advertising == adv_old) + return 0; + + ret = phy_config_aneg(phydev); + if (ret) + return ret; + + return sync ? phy_poll_aneg_done(phydev) : 0; +} +EXPORT_SYMBOL_GPL(phy_speed_down); + +/** + * phy_speed_up - (re)set advertised speeds to all supported speeds + * @phydev: the phy_device struct + * @sync: perform action synchronously + * + * Description: Used to revert the effect of phy_speed_down + */ +int phy_speed_up(struct phy_device *phydev, bool sync) +{ + u32 mask = PHY_10BT_FEATURES | PHY_100BT_FEATURES | PHY_1000BT_FEATURES; + u32 adv_old = phydev->advertising; + int ret; + + if (phydev->autoneg != AUTONEG_ENABLE) + return 0; + + phydev->advertising = (adv_old & ~mask) | (phydev->supported & mask); + + if (phydev->advertising == adv_old) + return 0; + + ret = phy_config_aneg(phydev); + if (ret) + return ret; + + return sync ? phy_poll_aneg_done(phydev) : 0; +} +EXPORT_SYMBOL_GPL(phy_speed_up); + /** * phy_start_machine - start PHY state machine tracking * @phydev: the phy_device struct diff --git a/include/linux/phy.h b/include/linux/phy.h index 6cd09098..275f528e 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -942,6 +942,8 @@ void phy_start(struct phy_device *phydev); void phy_stop(struct phy_device *phydev); int phy_start_aneg(struct phy_device *phydev); int phy_aneg_done(struct phy_device *phydev); +int phy_speed_down(struct phy_device *phydev, bool sync); +int phy_speed_up(struct phy_device *phydev, bool sync); int phy_stop_interrupts(struct phy_device *phydev); int phy_restart_aneg(struct phy_device *phydev); -- 2.18.0
Re: [PATCH iproute2-next] ipaddress: fix label matching
❦ 11 juillet 2018 13:03 -0700, Stephen Hemminger : >> Since 9516823051ce, "ip addr show label lo:1" doesn't work >> anymore (doesn't show any address, despite a matching label). >> Reverting to return 0 instead of -1 fix the issue. >> >> However, the condition says: "if we filter by label [...] and the >> label does NOT match the interface name". This makes little sense to >> compare the label with the interface name. There is also a logic >> around filter family being provided or not. The match against the >> label is done by ifa_label_match_rta() in print_addrinfo() and >> ipaddr_filter(). >> >> Just removing the condition makes "ip addr show" works as expected >> with or without specifying a label, both when the label is matching >> and not matching. It also works if we specify a label and the label is >> the interface name. The flush operation also works as expected. >> >> Fixes: 9516823051ce ("ipaddress: Improve print_linkinfo()") >> Signed-off-by: Vincent Bernat >> --- >> ip/ipaddress.c | 5 - >> 1 file changed, 5 deletions(-) >> >> diff --git a/ip/ipaddress.c b/ip/ipaddress.c >> index 5009bfe6d2e3..20ef6724944e 100644 >> --- a/ip/ipaddress.c >> +++ b/ip/ipaddress.c >> @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who, >> if (!name) >> return -1; >> >> -if (filter.label && >> -(!filter.family || filter.family == AF_PACKET) && >> -fnmatch(filter.label, name, 0)) >> -return -1; >> - >> if (tb[IFLA_GROUP]) { >> int group = rta_getattr_u32(tb[IFLA_GROUP]); >> > > If this is a regression, it should go to iproute2 not iproute2-next. > > Surprised by the solution since it is removing code that was there > before the commit you referenced in Fixes. Yes, but as I explain in the commit message, the condition does not make sense for me: why would we match the label against the interface name? This code exists since a long time. -- The lunatic, the lover, and the poet, Are of imagination all compact... -- Wm. Shakespeare, "A Midsummer Night's Dream"
[PATCH net-next 1/2] net: phy: add helper phy_config_aneg
This functionality will also be needed in subsequent patches of this series, therefore factor it out to a helper. Signed-off-by: Heiner Kallweit --- drivers/net/phy/phy.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 537297d2..c4aa360d 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -467,6 +467,14 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq *ifr, int cmd) } EXPORT_SYMBOL(phy_mii_ioctl); +static int phy_config_aneg(struct phy_device *phydev) +{ + if (phydev->drv->config_aneg) + return phydev->drv->config_aneg(phydev); + else + return genphy_config_aneg(phydev); +} + /** * phy_start_aneg_priv - start auto-negotiation for this PHY device * @phydev: the phy_device struct @@ -493,10 +501,7 @@ static int phy_start_aneg_priv(struct phy_device *phydev, bool sync) /* Invalidate LP advertising flags */ phydev->lp_advertising = 0; - if (phydev->drv->config_aneg) - err = phydev->drv->config_aneg(phydev); - else - err = genphy_config_aneg(phydev); + err = phy_config_aneg(phydev); if (err < 0) goto out_unlock; -- 2.18.0
[PATCH net-next 0/2] net: phy: add functionality to speed down PHY when waiting for WoL packet
Some network drivers include functionality to speed down the PHY when suspending and just waiting for a WoL packet because this saves energy. This patch is based on our recent discussion about factoring out this functionality to phylib. First user will be the r8169 driver. Heiner Kallweit (2): net: phy: add helper phy_config_aneg net: phy: add phy_speed_down and phy_speed_up drivers/net/phy/phy.c | 91 +-- include/linux/phy.h | 2 + 2 files changed, 89 insertions(+), 4 deletions(-) -- 2.18.0
Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case of forwarding
On Wed, 11 Jul 2018 16:41:35 +0100 Edward Cree wrote: > On 11/07/18 16:01, Jesper Dangaard Brouer wrote: > > In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling > > dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls > > dst_input(skb) in a loop. > > > > The problem is that ip_sublist_rcv_finish() forgot to remove the SKB > > from the list before invoking dst_input(). Further more we need to > > clear skb->next as other parts of the network stack use another kind > > of SKB lists for xmit_more (see dev_hard_start_xmit). > > > > A crash occurs if e.g. dst_input() invoke ip_forward(), which calls > > dst_output()/ip_output() that eventually calls __dev_queue_xmit() + > > sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). > > > > This patch only fixes the crash, but there is a huge potential for > > a performance boost if we can pass an SKB-list through to ip_forward. > > > > Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") > > Signed-off-by: Jesper Dangaard Brouer > Acked-by: Edward Cree > > But it feels weird and asymmetric to only NULL skb->next (not ->prev), and > to have to do this by hand rather than e.g. being able to use > list_del_init(&skb->list). Hopefully this can be revisited once > sch_direct_xmit() has been changed to use the list_head rather than SKB > special lists. I cannot use list_del_init(&skb->list) it would also break. This is a fix, and this code should be revisited. The reason I used the list_del() + skb->next=NULL, combo, is to keep as much as possible of the list-poisoning, e.g. 'prev' will be LIST_POISON2. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v3 00/13] tools: bpf: extend bpftool prog load
On 07/10/2018 11:42 PM, Jakub Kicinski wrote: > Hi! > > This series starts with two minor clean ups to test_offload.py > selftest script. > > The next 11 patches extend the abilities of bpftool prog load > beyond the simple cgroup use cases. Three new parameters are > added: > > - type - allows specifying program type, independent of how >code sections are named; > - map - allows reusing existing maps, instead of creating a new >map on every program load; > - dev - offload/binding to a device. > > A number of changes to libbpf is required to accomplish the task. > The section - program type logic mapping is exposed. We should > probably aim to use the libbpf program section naming everywhere. > For reuse of maps we need to allow users to set FD for bpf map > object in libbpf. > > Examples > > Load program my_xdp.o and pin it as /sys/fs/bpf/my_xdp, for xdp > program type: > > $ bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp \ > type xdp > > As above but for offload: > > $ bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp \ > type xdp \ > dev netdevsim0 > > Load program my_maps.o, but for the first map reuse map id 17, > and for the map called "other_map" reuse pinned map /sys/fs/bpf/map0: > > $ bpftool prog load my_maps.o /sys/fs/bpf/prog \ > map idx 0 id 17 \ > map name other_map pinned /sys/fs/bpf/map0 > > --- > v3: > - fix return codes in patch 5; > - rename libbpf_prog_type_by_string() -> libbpf_prog_type_by_name(); > - fold file path into xattr in patch 8; > - add patch 10; > - use dup3() in patch 12; > - depend on fd value in patch 12; > - close old fd in patch 12. > v2: > - add compat for reallocarray(). Applied to bpf-next, thanks Jakub!
Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case of forwarding
On Wed, 11 Jul 2018 19:05:20 + Saeed Mahameed wrote: > On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote: > > Only driver sfc actually uses this, but I don't have this NIC, so I > > tested this on mlx5, with my own changes to make it use > > netif_receive_skb_list(), > > but I'm not ready to upstream the mlx5 driver change yet. > > > Thanks Jesper for sharing this, should we look forward to those patches > or do you want us to implement them ? Well, I would prefer you to implement those. I just did a quick implementation (its trivially easy) so I have something to benchmark with. The performance boost is quite impressive! One reason I didn't "just" send a patch, is that Edward so-fare only implemented netif_receive_skb_list() and not napi_gro_receive_list(). And your driver uses napi_gro_receive(). This sort-of disables GRO for your driver, which is not a choice I can make. Interestingly I get around the same netperf TCP_STREAM performance. I assume we can get even better perf if we "listify" napi_gro_receive. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] of: mdio: Support fixed links in of_phy_get_and_connect()
On Wed, Jul 11, 2018 at 07:45:11PM +0200, Linus Walleij wrote: > By a simple extension of of_phy_get_and_connect() drivers > that have a fixed link on e.g. RGMII can support also > fixed links, so in addition to: > > ethernet-port { > phy-mode = "rgmii"; > phy-handle = <&foo>; > }; > > This setup with a fixed-link node and no phy-handle will > now also work just fine: > > ethernet-port { > phy-mode = "rgmii"; > fixed-link { > speed = <1000>; > full-duplex; > pause; > }; > }; > > This is very helpful for connecting random ethernet ports > to e.g. DSA switches that typically reside on fixed links. > > The phy-mode is still there as the fixes link in this case > is still an RGMII link. > > Tested on the Cortina Gemini driver with the Vitesse DSA > router chip on a fixed 1Gbit link. > > Suggested-by: Andrew Lunn > Signed-off-by: Linus Walleij Reviewed-by: Andrew Lunn What probably make sense as a followup is add a of_phy_disconnect_and_put(). When the module is unloaded, you leak a fixed link, because of_phy_deregister_fixed_link() is not being called. You also hold a reference to np which does not appear to be released. Andrew
Re: [PATCH bpf-next v4 3/3] bpf: btf: print map dump and lookup with btf info
On Tue, 10 Jul 2018 20:21:11 -0700, Okash Khawaja wrote: > + if (err || btf_info.btf_size > last_size) { > + err = errno; errno may not be set in case btf_info.btf_size > last_size errno is positive, while other error return codes are negative. > + goto exit_free; > + }
Re: [PATCH iproute2-next] ipaddress: fix label matching
On Wed, 11 Jul 2018 13:36:03 +0200 Vincent Bernat wrote: > Since 9516823051ce, "ip addr show label lo:1" doesn't work > anymore (doesn't show any address, despite a matching label). > Reverting to return 0 instead of -1 fix the issue. > > However, the condition says: "if we filter by label [...] and the > label does NOT match the interface name". This makes little sense to > compare the label with the interface name. There is also a logic > around filter family being provided or not. The match against the > label is done by ifa_label_match_rta() in print_addrinfo() and > ipaddr_filter(). > > Just removing the condition makes "ip addr show" works as expected > with or without specifying a label, both when the label is matching > and not matching. It also works if we specify a label and the label is > the interface name. The flush operation also works as expected. > > Fixes: 9516823051ce ("ipaddress: Improve print_linkinfo()") > Signed-off-by: Vincent Bernat > --- > ip/ipaddress.c | 5 - > 1 file changed, 5 deletions(-) > > diff --git a/ip/ipaddress.c b/ip/ipaddress.c > index 5009bfe6d2e3..20ef6724944e 100644 > --- a/ip/ipaddress.c > +++ b/ip/ipaddress.c > @@ -837,11 +837,6 @@ int print_linkinfo(const struct sockaddr_nl *who, > if (!name) > return -1; > > - if (filter.label && > - (!filter.family || filter.family == AF_PACKET) && > - fnmatch(filter.label, name, 0)) > - return -1; > - > if (tb[IFLA_GROUP]) { > int group = rta_getattr_u32(tb[IFLA_GROUP]); > If this is a regression, it should go to iproute2 not iproute2-next. Surprised by the solution since it is removing code that was there before the commit you referenced in Fixes.
Re: [PATCH net-next 2/5 v3] net: gemini: Improve connection prints
On Wed, Jul 11, 2018 at 09:32:42PM +0200, Linus Walleij wrote: > Switch over to using a module parameter and debug prints > that can be controlled by this or ethtool like everyone > else. Depromote all other prints to debug messages. > > The phy_print_status() was already in place, albeit never > really used because the debuglevel hiding it had to be > set up using ethtool. > > Signed-off-by: Linus Walleij Reviewed-by: Andrew Lunn Andrew
Re: [PATCH net-next 5/5 v3] net: gemini: Indicate that we can handle jumboframes
On Wed, Jul 11, 2018 at 09:32:45PM +0200, Linus Walleij wrote: > The hardware supposedly handles frames up to 10236 bytes and > implements .ndo_change_mtu() so accept 10236 minus the ethernet > header for a VLAN tagged frame on the netdevices. Use > ETH_MIN_MTU as minimum MTU. > > Signed-off-by: Linus Walleij Reviewed-by: Andrew Lunn Andrew
[PATCH v3 net-next 01/19] net: Add decrypted field to skb
The decrypted bit is propogated to cloned/copied skbs. This will be used later by the inline crypto receive side offload of tls. Signed-off-by: Boris Pismenny Signed-off-by: Ilya Lesokhin --- include/linux/skbuff.h | 7 ++- net/core/skbuff.c | 6 ++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 7601838..3ceb8dc 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -630,6 +630,7 @@ enum { * @hash: the packet hash * @queue_mapping: Queue mapping for multiqueue devices * @xmit_more: More SKBs are pending for this queue + * @decrypted: Decrypted SKB * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_hash: indicate hash is a canonical 4-tuple hash over transport @@ -736,7 +737,11 @@ struct sk_buff { peeked:1, head_frag:1, xmit_more:1, - __unused:1; /* one bit hole */ +#ifdef CONFIG_TLS_DEVICE + decrypted:1; +#else + __unused:1; +#endif /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c4e24ac..cfd6c6f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -805,6 +805,9 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old) * It is not yet because we do not want to have a 16 bit hole */ new->queue_mapping = old->queue_mapping; +#ifdef CONFIG_TLS_DEVICE + new->decrypted = old->decrypted; +#endif memcpy(&new->headers_start, &old->headers_start, offsetof(struct sk_buff, headers_end) - @@ -865,6 +868,9 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb) C(head_frag); C(data); C(truesize); +#ifdef CONFIG_TLS_DEVICE + C(decrypted); +#endif refcount_set(&n->users, 1); atomic_inc(&(skb_shinfo(skb)->dataref)); -- 1.8.3.1
[PATCH v3 net-next 18/19] net/mlx5e: IPsec, fix byte count in CQE
This patch fixes the byte count indication in CQE for processed IPsec packets that contain a metadata header. Signed-off-by: Boris Pismenny --- drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c index fda7929..128a82b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c @@ -364,6 +364,7 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev, } remove_metadata_hdr(skb); + *cqe_bcnt -= MLX5E_METADATA_ETHER_LEN; return skb; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h index 2bfbbef..ca47c05 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h @@ -41,7 +41,7 @@ #include "en.h" struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev, - struct sk_buff *skb); + struct sk_buff *skb, u32 *cqe_bcnt); void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); void mlx5e_ipsec_inverse_table_init(void); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 847e195..4a85b26 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -1470,7 +1470,7 @@ void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe) mlx5e_free_rx_wqe(rq, wi); goto wq_ll_pop; } - skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb); + skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb, &cqe_bcnt); if (unlikely(!skb)) { mlx5e_free_rx_wqe(rq, wi); goto wq_ll_pop; -- 1.8.3.1
[PATCH v3 net-next 03/19] net: Add TLS rx resync NDO
Add new netdev tls op for resynchronizing HW tls context Signed-off-by: Boris Pismenny --- include/linux/netdevice.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b683971..0434df3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -903,6 +903,8 @@ struct tlsdev_ops { void (*tls_dev_del)(struct net_device *netdev, struct tls_context *ctx, enum tls_offload_ctx_dir direction); + void (*tls_dev_resync_rx)(struct net_device *netdev, + struct sock *sk, u32 seq, u64 rcd_sn); }; #endif -- 1.8.3.1
[PATCH v3 net-next 19/19] net/mlx5e: Kconfig, mutually exclude compilation of TLS and IPsec accel
We currently have no devices that support both TLS and IPsec using the accel framework, and the current code does not support both IPsec and TLS. This patch prevents such combinations. Signed-off-by: Boris Pismenny --- drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index 2545296..d3e8c70 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -93,6 +93,7 @@ config MLX5_EN_TLS depends on TLS_DEVICE depends on TLS=y || MLX5_CORE=m depends on MLX5_ACCEL + depends on !MLX5_EN_IPSEC default n ---help--- Build support for TLS cryptography-offload accelaration in the NIC. -- 1.8.3.1
KASAN: slab-out-of-bounds Read in rds_cong_queue_updates (2)
Hello, syzbot found the following crash on: HEAD commit:0026129c8629 rhashtable: add restart routine in rhashtable.. git tree: net console output: https://syzkaller.appspot.com/x/log.txt?x=10b7ced040 kernel config: https://syzkaller.appspot.com/x/.config?x=b88de6eac8694da6 dashboard link: https://syzkaller.appspot.com/bug?extid=0570fef57a5e020bdc87 compiler: gcc (GCC) 8.0.1 20180413 (experimental) Unfortunately, I don't have any reproducer for this crash yet. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+0570fef57a5e020bd...@syzkaller.appspotmail.com == BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline] BUG: KASAN: slab-out-of-bounds in refcount_read include/linux/refcount.h:42 [inline] BUG: KASAN: slab-out-of-bounds in check_net include/net/net_namespace.h:237 [inline] BUG: KASAN: slab-out-of-bounds in rds_destroy_pending net/rds/rds.h:902 [inline] BUG: KASAN: slab-out-of-bounds in rds_cong_queue_updates+0x25d/0x5b0 net/rds/cong.c:226 Read of size 4 at addr 88019f8ec204 by task syz-executor1/27023 CPU: 0 PID: 27023 Comm: syz-executor1 Not tainted 4.18.0-rc3+ #5 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline] refcount_read include/linux/refcount.h:42 [inline] check_net include/net/net_namespace.h:237 [inline] rds_destroy_pending net/rds/rds.h:902 [inline] rds_cong_queue_updates+0x25d/0x5b0 net/rds/cong.c:226 rds_recv_rcvbuf_delta.part.3+0x332/0x3e0 net/rds/recv.c:123 rds_recv_rcvbuf_delta net/rds/recv.c:382 [inline] rds_recv_incoming+0x85a/0x1320 net/rds/recv.c:382 netlink: 'syz-executor2': attribute type 18 has an invalid length. rds_loop_xmit+0x16a/0x340 net/rds/loop.c:95 rds_send_xmit+0x1343/0x29c0 net/rds/send.c:355 netlink: 180 bytes leftover after parsing attributes in process `syz-executor5'. rds_sendmsg+0x229e/0x2a40 net/rds/send.c:1243 netlink: 180 bytes leftover after parsing attributes in process `syz-executor5'. sock_sendmsg_nosec net/socket.c:641 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:651 __sys_sendto+0x3d7/0x670 net/socket.c:1797 __do_sys_sendto net/socket.c:1809 [inline] __se_sys_sendto net/socket.c:1805 [inline] __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455e29 Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7fd164b21c68 EFLAGS: 0246 ORIG_RAX: 002c RAX: ffda RBX: 7fd164b226d4 RCX: 00455e29 RDX: 0481 RSI: 2000 RDI: 0013 RBP: 0072bea0 R08: 2069affb R09: 0010 R10: R11: 0246 R12: R13: 004c14f2 R14: 004d1a08 R15: Allocated by task 26052: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554 getname_flags+0xd0/0x5a0 fs/namei.c:140 getname+0x19/0x20 fs/namei.c:211 do_sys_open+0x3a2/0x760 fs/open.c:1095 __do_sys_open fs/open.c:1119 [inline] __se_sys_open fs/open.c:1114 [inline] __x64_sys_open+0x7e/0xc0 fs/open.c:1114 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 26052: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kmem_cache_free+0x86/0x2d0 mm/slab.c:3756 putname+0xf2/0x130 fs/namei.c:261 do_sys_open+0x569/0x760 fs/open.c:1110 __do_sys_open fs/open.c:1119 [inline] __se_sys_open fs/open.c:1114 [inline] __x64_sys_open+0x7e/0xc0 fs/open.c:1114 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at 88019f8ec280 which belongs to the cache names_cache of size 4096 The buggy address is located 124 bytes to the left of 4096-byte region [88019f8ec280, 88019f8ed280) The buggy add
[PATCH v3 net-next 14/19] net/mlx5e: TLS, add Innova TLS rx data path
Implement the TLS rx offload data path according to the requirements of the TLS generic NIC offload infrastructure. Special metadata ethertype is used to pass information to the hardware. When hardware loses synchronization a special resync request metadata message is used to request resync. Signed-off-by: Boris Pismenny Signed-off-by: Ilya Lesokhin --- .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 112 - .../mellanox/mlx5/core/en_accel/tls_rxtx.h | 3 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 6 ++ 3 files changed, 118 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c index c96196f..d460fda 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c @@ -33,6 +33,12 @@ #include "en_accel/tls.h" #include "en_accel/tls_rxtx.h" +#include +#include + +#define SYNDROM_DECRYPTED 0x30 +#define SYNDROM_RESYNC_REQUEST 0x31 +#define SYNDROM_AUTH_FAILED 0x32 #define SYNDROME_OFFLOAD_REQUIRED 32 #define SYNDROME_SYNC 33 @@ -44,10 +50,26 @@ struct sync_info { skb_frag_t frags[MAX_SKB_FRAGS]; }; -struct mlx5e_tls_metadata { +struct recv_metadata_content { + u8 syndrome; + u8 reserved; + __be32 sync_seq; +} __packed; + +struct send_metadata_content { /* One byte of syndrome followed by 3 bytes of swid */ __be32 syndrome_swid; __be16 first_seq; +} __packed; + +struct mlx5e_tls_metadata { + union { + /* from fpga to host */ + struct recv_metadata_content recv; + /* from host to fpga */ + struct send_metadata_content send; + unsigned char raw[6]; + } __packed content; /* packet type ID field */ __be16 ethertype; } __packed; @@ -68,7 +90,8 @@ static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 swid) 2 * ETH_ALEN); eth->h_proto = cpu_to_be16(MLX5E_METADATA_ETHER_TYPE); - pet->syndrome_swid = htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid; + pet->content.send.syndrome_swid = + htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid; return 0; } @@ -149,7 +172,7 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb, pet = (struct mlx5e_tls_metadata *)(nskb->data + sizeof(struct ethhdr)); memcpy(pet, &syndrome, sizeof(syndrome)); - pet->first_seq = htons(tcp_seq); + pet->content.send.first_seq = htons(tcp_seq); /* MLX5 devices don't care about the checksum partial start, offset * and pseudo header @@ -276,3 +299,86 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev, out: return skb; } + +static int tls_update_resync_sn(struct net_device *netdev, + struct sk_buff *skb, + struct mlx5e_tls_metadata *mdata) +{ + struct sock *sk = NULL; + struct iphdr *iph; + struct tcphdr *th; + __be32 seq; + + if (mdata->ethertype != htons(ETH_P_IP)) + return -EINVAL; + + iph = (struct iphdr *)(mdata + 1); + + th = ((void *)iph) + iph->ihl * 4; + + if (iph->version == 4) { + sk = inet_lookup_established(dev_net(netdev), &tcp_hashinfo, +iph->saddr, th->source, iph->daddr, +th->dest, netdev->ifindex); +#if IS_ENABLED(CONFIG_IPV6) + } else { + struct ipv6hdr *ipv6h = (struct ipv6hdr *)iph; + + sk = __inet6_lookup_established(dev_net(netdev), &tcp_hashinfo, + &ipv6h->saddr, th->source, + &ipv6h->daddr, th->dest, + netdev->ifindex, 0); +#endif + } + if (!sk || sk->sk_state == TCP_TIME_WAIT) + goto out; + + skb->sk = sk; + skb->destructor = sock_edemux; + + memcpy(&seq, &mdata->content.recv.sync_seq, sizeof(seq)); + tls_offload_rx_resync_request(sk, seq); +out: + return 0; +} + +void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, +u32 *cqe_bcnt) +{ + struct mlx5e_tls_metadata *mdata; + struct ethhdr *old_eth; + struct ethhdr *new_eth; + __be16 *ethtype; + + /* Detect inline metadata */ + if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN) + return; + ethtype = (__be16 *)(skb->data + ETH_ALEN * 2); + if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE)) + return; + + /* Use the metadata */ + mdata = (struct mlx5e_tls_metadata *)(skb->data + ETH_HLEN); + switch (mdata->content.recv.syndrome) { + case SYND
[PATCH v3 net-next 16/19] net/mlx5e: TLS, build TLS netdev from capabilities
This patch enables TLS Rx based on available HW capabilities. Signed-off-by: Boris Pismenny --- drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c index 541e6f4..eddd7702 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c @@ -183,13 +183,27 @@ static void mlx5e_tls_resync_rx(struct net_device *netdev, struct sock *sk, void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { + u32 caps = mlx5_accel_tls_device_caps(priv->mdev); struct net_device *netdev = priv->netdev; if (!mlx5_accel_is_tls_device(priv->mdev)) return; - netdev->features |= NETIF_F_HW_TLS_TX; - netdev->hw_features |= NETIF_F_HW_TLS_TX; + if (caps & MLX5_ACCEL_TLS_TX) { + netdev->features |= NETIF_F_HW_TLS_TX; + netdev->hw_features |= NETIF_F_HW_TLS_TX; + } + + if (caps & MLX5_ACCEL_TLS_RX) { + netdev->features |= NETIF_F_HW_TLS_RX; + netdev->hw_features |= NETIF_F_HW_TLS_RX; + } + + if (!(caps & MLX5_ACCEL_TLS_LRO)) { + netdev->features &= ~NETIF_F_LRO; + netdev->hw_features &= ~NETIF_F_LRO; + } + netdev->tlsdev_ops = &mlx5e_tls_ops; } -- 1.8.3.1
[PATCH v3 net-next 05/19] tls: Refactor tls_offload variable names
For symmetry, we rename tls_offload_context to tls_offload_context_tx before we add tls_offload_context_rx. Signed-off-by: Boris Pismenny --- .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 6 +++--- include/net/tls.h | 16 +++--- net/tls/tls_device.c | 25 +++--- net/tls/tls_device_fallback.c | 8 +++ 4 files changed, 27 insertions(+), 28 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h index b616217..b82f4de 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h @@ -50,7 +50,7 @@ struct mlx5e_tls { }; struct mlx5e_tls_offload_context { - struct tls_offload_context base; + struct tls_offload_context_tx base; u32 expected_seq; __be32 swid; }; @@ -59,8 +59,8 @@ struct mlx5e_tls_offload_context { mlx5e_get_tls_tx_context(struct tls_context *tls_ctx) { BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) > -TLS_OFFLOAD_CONTEXT_SIZE); - return container_of(tls_offload_ctx(tls_ctx), +TLS_OFFLOAD_CONTEXT_SIZE_TX); + return container_of(tls_offload_ctx_tx(tls_ctx), struct mlx5e_tls_offload_context, base); } diff --git a/include/net/tls.h b/include/net/tls.h index 70c2737..5dcd808 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -128,7 +128,7 @@ struct tls_record_info { skb_frag_t frags[MAX_SKB_FRAGS]; }; -struct tls_offload_context { +struct tls_offload_context_tx { struct crypto_aead *aead_send; spinlock_t lock;/* protects records list */ struct list_head records_list; @@ -147,8 +147,8 @@ struct tls_offload_context { #define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *))) }; -#define TLS_OFFLOAD_CONTEXT_SIZE \ - (ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) + \ +#define TLS_OFFLOAD_CONTEXT_SIZE_TX \ + (ALIGN(sizeof(struct tls_offload_context_tx), sizeof(void *)) +\ TLS_DRIVER_STATE_SIZE) enum { @@ -239,7 +239,7 @@ int tls_device_sendpage(struct sock *sk, struct page *page, void tls_device_init(void); void tls_device_cleanup(void); -struct tls_record_info *tls_get_record(struct tls_offload_context *context, +struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context, u32 seq, u64 *p_record_sn); static inline bool tls_record_is_start_marker(struct tls_record_info *rec) @@ -380,10 +380,10 @@ static inline struct tls_sw_context_tx *tls_sw_ctx_tx( return (struct tls_sw_context_tx *)tls_ctx->priv_ctx_tx; } -static inline struct tls_offload_context *tls_offload_ctx( - const struct tls_context *tls_ctx) +static inline struct tls_offload_context_tx * +tls_offload_ctx_tx(const struct tls_context *tls_ctx) { - return (struct tls_offload_context *)tls_ctx->priv_ctx_tx; + return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx; } int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg, @@ -396,7 +396,7 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk, struct sk_buff *skb); int tls_sw_fallback_init(struct sock *sk, -struct tls_offload_context *offload_ctx, +struct tls_offload_context_tx *offload_ctx, struct tls_crypto_info *crypto_info); #endif /* _TLS_OFFLOAD_H */ diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index a7a8f8e..332a5d1 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -52,9 +52,8 @@ static void tls_device_free_ctx(struct tls_context *ctx) { - struct tls_offload_context *offload_ctx = tls_offload_ctx(ctx); + kfree(tls_offload_ctx_tx(ctx)); - kfree(offload_ctx); kfree(ctx); } @@ -125,7 +124,7 @@ static void destroy_record(struct tls_record_info *record) kfree(record); } -static void delete_all_records(struct tls_offload_context *offload_ctx) +static void delete_all_records(struct tls_offload_context_tx *offload_ctx) { struct tls_record_info *info, *temp; @@ -141,14 +140,14 @@ static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq) { struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_record_info *info, *temp; - struct tls_offload_context *ctx; + struct tls_offload_context_tx *ctx; u64 deleted_records = 0; unsigned long flags; if (!tls_ctx) return; - ctx = tls_offload_ctx(tls_ctx); + ctx = tls_offload_ctx_tx(tls_ctx); spin_lock_irqsave(&ctx->lock, flags
[PATCH v3 net-next 17/19] net/mlx5: Accel, add common metadata functions
This patch adds common functions to handle mellanox metadata headers. These functions are used by IPsec and TLS to process FPGA metadata. Signed-off-by: Boris Pismenny --- .../net/ethernet/mellanox/mlx5/core/accel/accel.h | 37 ++ .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 19 +++ .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 18 +++ 3 files changed, 45 insertions(+), 29 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h b/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h new file mode 100644 index 000..c132604 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h @@ -0,0 +1,37 @@ +#ifndef __MLX5E_ACCEL_H__ +#define __MLX5E_ACCEL_H__ + +#ifdef CONFIG_MLX5_ACCEL + +#include +#include +#include "en.h" + +static inline bool is_metadata_hdr_valid(struct sk_buff *skb) +{ + __be16 *ethtype; + + if (unlikely(skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)) + return false; + ethtype = (__be16 *)(skb->data + ETH_ALEN * 2); + if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE)) + return false; + return true; +} + +static inline void remove_metadata_hdr(struct sk_buff *skb) +{ + struct ethhdr *old_eth; + struct ethhdr *new_eth; + + /* Remove the metadata from the buffer */ + old_eth = (struct ethhdr *)skb->data; + new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN); + memmove(new_eth, old_eth, 2 * ETH_ALEN); + /* Ethertype is already in its new place */ + skb_pull_inline(skb, MLX5E_METADATA_ETHER_LEN); +} + +#endif /* CONFIG_MLX5_ACCEL */ + +#endif /* __MLX5E_EN_ACCEL_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c index c245d8e..fda7929 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c @@ -37,6 +37,7 @@ #include "en_accel/ipsec_rxtx.h" #include "en_accel/ipsec.h" +#include "accel/accel.h" #include "en.h" enum { @@ -346,19 +347,12 @@ struct sk_buff *mlx5e_ipsec_handle_tx_skb(struct net_device *netdev, } struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev, - struct sk_buff *skb) + struct sk_buff *skb, u32 *cqe_bcnt) { struct mlx5e_ipsec_metadata *mdata; - struct ethhdr *old_eth; - struct ethhdr *new_eth; struct xfrm_state *xs; - __be16 *ethtype; - /* Detect inline metadata */ - if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN) - return skb; - ethtype = (__be16 *)(skb->data + ETH_ALEN * 2); - if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE)) + if (!is_metadata_hdr_valid(skb)) return skb; /* Use the metadata */ @@ -369,12 +363,7 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev, return NULL; } - /* Remove the metadata from the buffer */ - old_eth = (struct ethhdr *)skb->data; - new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN); - memmove(new_eth, old_eth, 2 * ETH_ALEN); - /* Ethertype is already in its new place */ - skb_pull_inline(skb, MLX5E_METADATA_ETHER_LEN); + remove_metadata_hdr(skb); return skb; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c index ecfc764..92d3745 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c @@ -33,6 +33,8 @@ #include "en_accel/tls.h" #include "en_accel/tls_rxtx.h" +#include "accel/accel.h" + #include #include @@ -350,16 +352,9 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, u32 *cqe_bcnt) { struct mlx5e_tls_metadata *mdata; - struct ethhdr *old_eth; - struct ethhdr *new_eth; - __be16 *ethtype; struct mlx5e_priv *priv; - /* Detect inline metadata */ - if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN) - return; - ethtype = (__be16 *)(skb->data + ETH_ALEN * 2); - if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE)) + if (!is_metadata_hdr_valid(skb)) return; /* Use the metadata */ @@ -383,11 +378,6 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, return; } - /* Remove the metadata from the buffer */ - old_eth = (struct ethhdr *)skb->data; - new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN); - memmove(new_eth,
[PATCH v3 net-next 11/19] net/mlx5e: TLS, refactor variable names
For symmetry, we rename mlx5e_tls_offload_context to mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx. Signed-off-by: Boris Pismenny Reviewed-by: Aviad Yehezkel Reviewed-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 8 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c | 6 +++--- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c index d167845..7fb9c75 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c @@ -123,7 +123,7 @@ static int mlx5e_tls_add(struct net_device *netdev, struct sock *sk, goto free_flow; if (direction == TLS_OFFLOAD_CTX_DIR_TX) { - struct mlx5e_tls_offload_context *tx_ctx = + struct mlx5e_tls_offload_context_tx *tx_ctx = mlx5e_get_tls_tx_context(tls_ctx); u32 swid; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h index b82f4de..e26222a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h @@ -49,19 +49,19 @@ struct mlx5e_tls { struct mlx5e_tls_sw_stats sw_stats; }; -struct mlx5e_tls_offload_context { +struct mlx5e_tls_offload_context_tx { struct tls_offload_context_tx base; u32 expected_seq; __be32 swid; }; -static inline struct mlx5e_tls_offload_context * +static inline struct mlx5e_tls_offload_context_tx * mlx5e_get_tls_tx_context(struct tls_context *tls_ctx) { - BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) > + BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context_tx) > TLS_OFFLOAD_CONTEXT_SIZE_TX); return container_of(tls_offload_ctx_tx(tls_ctx), - struct mlx5e_tls_offload_context, + struct mlx5e_tls_offload_context_tx, base); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c index 15aef71..c96196f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c @@ -73,7 +73,7 @@ static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 swid) return 0; } -static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context *context, +static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context_tx *context, u32 tcp_seq, struct sync_info *info) { int remaining, i = 0, ret = -EINVAL; @@ -161,7 +161,7 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb, } static struct sk_buff * -mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context, +mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context, struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5e_tx_wqe **wqe, u16 *pi, @@ -239,7 +239,7 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev, u16 *pi) { struct mlx5e_priv *priv = netdev_priv(netdev); - struct mlx5e_tls_offload_context *context; + struct mlx5e_tls_offload_context_tx *context; struct tls_context *tls_ctx; u32 expected_seq; int datalen; -- 1.8.3.1
[PATCH v3 net-next 06/19] tls: Split decrypt_skb to two functions
Previously, decrypt_skb also updated the TLS context. Now, decrypt_skb only decrypts the payload using the current context, while decrypt_skb_update also updates the state. Later, in the tls_device Rx flow, we will use decrypt_skb directly. Signed-off-by: Boris Pismenny --- include/net/tls.h | 2 ++ net/tls/tls_sw.c | 44 ++-- 2 files changed, 28 insertions(+), 18 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 5dcd808..49b8922 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -390,6 +390,8 @@ int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg, unsigned char *record_type); void tls_register_device(struct tls_device *device); void tls_unregister_device(struct tls_device *device); +int decrypt_skb(struct sock *sk, struct sk_buff *skb, + struct scatterlist *sgout); struct sk_buff *tls_validate_xmit_skb(struct sock *sk, struct net_device *dev, diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 3bd7c14..99d0347 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -53,7 +53,6 @@ static int tls_do_decryption(struct sock *sk, { struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); - struct strp_msg *rxm = strp_msg(skb); struct aead_request *aead_req; int ret; @@ -74,18 +73,6 @@ static int tls_do_decryption(struct sock *sk, ret = crypto_wait_req(crypto_aead_decrypt(aead_req), &ctx->async_wait); - if (ret < 0) - goto out; - - rxm->offset += tls_ctx->rx.prepend_size; - rxm->full_len -= tls_ctx->rx.overhead_size; - tls_advance_record_sn(sk, &tls_ctx->rx); - - ctx->decrypted = true; - - ctx->saved_data_ready(sk); - -out: kfree(aead_req); return ret; } @@ -670,8 +657,29 @@ static struct sk_buff *tls_wait_data(struct sock *sk, int flags, return skb; } -static int decrypt_skb(struct sock *sk, struct sk_buff *skb, - struct scatterlist *sgout) +static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb, + struct scatterlist *sgout) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct strp_msg *rxm = strp_msg(skb); + int err = 0; + + err = decrypt_skb(sk, skb, sgout); + if (err < 0) + return err; + + rxm->offset += tls_ctx->rx.prepend_size; + rxm->full_len -= tls_ctx->rx.overhead_size; + tls_advance_record_sn(sk, &tls_ctx->rx); + ctx->decrypted = true; + ctx->saved_data_ready(sk); + + return err; +} + +int decrypt_skb(struct sock *sk, struct sk_buff *skb, + struct scatterlist *sgout) { struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); @@ -821,7 +829,7 @@ int tls_sw_recvmsg(struct sock *sk, if (err < 0) goto fallback_to_reg_recv; - err = decrypt_skb(sk, skb, sgin); + err = decrypt_skb_update(sk, skb, sgin); for (; pages > 0; pages--) put_page(sg_page(&sgin[pages])); if (err < 0) { @@ -830,7 +838,7 @@ int tls_sw_recvmsg(struct sock *sk, } } else { fallback_to_reg_recv: - err = decrypt_skb(sk, skb, NULL); + err = decrypt_skb_update(sk, skb, NULL); if (err < 0) { tls_err_abort(sk, EBADMSG); goto recv_end; @@ -901,7 +909,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, } if (!ctx->decrypted) { - err = decrypt_skb(sk, skb, NULL); + err = decrypt_skb_update(sk, skb, NULL); if (err < 0) { tls_err_abort(sk, EBADMSG); -- 1.8.3.1
[PATCH v3 net-next 08/19] tls: Fill software context without allocation
This patch allows tls_set_sw_offload to fill the context in case it was already allocated previously. We will use it in TLS_DEVICE to fill the RX software context. Signed-off-by: Boris Pismenny --- net/tls/tls_sw.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 86e22bc..5073676 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1090,28 +1090,38 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx) } if (tx) { - sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL); - if (!sw_ctx_tx) { - rc = -ENOMEM; - goto out; + if (!ctx->priv_ctx_tx) { + sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL); + if (!sw_ctx_tx) { + rc = -ENOMEM; + goto out; + } + ctx->priv_ctx_tx = sw_ctx_tx; + } else { + sw_ctx_tx = + (struct tls_sw_context_tx *)ctx->priv_ctx_tx; } - crypto_init_wait(&sw_ctx_tx->async_wait); - ctx->priv_ctx_tx = sw_ctx_tx; } else { - sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL); - if (!sw_ctx_rx) { - rc = -ENOMEM; - goto out; + if (!ctx->priv_ctx_rx) { + sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL); + if (!sw_ctx_rx) { + rc = -ENOMEM; + goto out; + } + ctx->priv_ctx_rx = sw_ctx_rx; + } else { + sw_ctx_rx = + (struct tls_sw_context_rx *)ctx->priv_ctx_rx; } - crypto_init_wait(&sw_ctx_rx->async_wait); - ctx->priv_ctx_rx = sw_ctx_rx; } if (tx) { + crypto_init_wait(&sw_ctx_tx->async_wait); crypto_info = &ctx->crypto_send; cctx = &ctx->tx; aead = &sw_ctx_tx->aead_send; } else { + crypto_init_wait(&sw_ctx_rx->async_wait); crypto_info = &ctx->crypto_recv; cctx = &ctx->rx; aead = &sw_ctx_rx->aead_recv; -- 1.8.3.1
[PATCH v3 net-next 07/19] tls: Split tls_sw_release_resources_rx
This patch splits tls_sw_release_resources_rx into two functions one which releases all inner software tls structures and another that also frees the containing structure. In TLS_DEVICE we will need to release the software structures without freeeing the containing structure, which contains other information. Signed-off-by: Boris Pismenny --- include/net/tls.h | 1 + net/tls/tls_sw.c | 10 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/net/tls.h b/include/net/tls.h index 49b8922..7a485de 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -223,6 +223,7 @@ int tls_sw_sendpage(struct sock *sk, struct page *page, void tls_sw_close(struct sock *sk, long timeout); void tls_sw_free_resources_tx(struct sock *sk); void tls_sw_free_resources_rx(struct sock *sk); +void tls_sw_release_resources_rx(struct sock *sk); int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len); unsigned int tls_sw_poll(struct file *file, struct socket *sock, diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 99d0347..86e22bc 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1039,7 +1039,7 @@ void tls_sw_free_resources_tx(struct sock *sk) kfree(ctx); } -void tls_sw_free_resources_rx(struct sock *sk) +void tls_sw_release_resources_rx(struct sock *sk) { struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); @@ -1058,6 +1058,14 @@ void tls_sw_free_resources_rx(struct sock *sk) strp_done(&ctx->strp); lock_sock(sk); } +} + +void tls_sw_free_resources_rx(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + + tls_sw_release_resources_rx(sk); kfree(ctx); } -- 1.8.3.1
[PATCH v3 net-next 10/19] tls: Fix zerocopy_from_iter iov handling
zerocopy_from_iter iterates over the message, but it doesn't revert the updates made by the iov iteration. This patch fixes it. Now, the iov can be used after calling zerocopy_from_iter. Fixes: 3c4d75591 ("tls: kernel TLS support") Signed-off-by: Boris Pismenny --- net/tls/tls_sw.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 2a6ba0f..37ac220 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -318,6 +318,7 @@ static int zerocopy_from_iter(struct sock *sk, struct iov_iter *from, out: *size_used = size; *pages_used = num_elem; + iov_iter_revert(from, size); return rc; } -- 1.8.3.1
[PATCH v3 net-next 02/19] net: Add TLS RX offload feature
From: Ilya Lesokhin This patch adds a netdev feature to configure TLS RX inline crypto offload. Signed-off-by: Ilya Lesokhin Signed-off-by: Boris Pismenny --- include/linux/netdev_features.h | 2 ++ net/core/ethtool.c | 1 + 2 files changed, 3 insertions(+) diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index 623bb8c..2b2a6dc 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -79,6 +79,7 @@ enum { NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */ NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */ NETIF_F_HW_TLS_TX_BIT, /* Hardware TLS TX offload */ + NETIF_F_HW_TLS_RX_BIT, /* Hardware TLS RX offload */ NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */ NETIF_F_HW_TLS_RECORD_BIT, /* Offload TLS record */ @@ -151,6 +152,7 @@ enum { #define NETIF_F_HW_TLS_RECORD __NETIF_F(HW_TLS_RECORD) #define NETIF_F_GSO_UDP_L4 __NETIF_F(GSO_UDP_L4) #define NETIF_F_HW_TLS_TX __NETIF_F(HW_TLS_TX) +#define NETIF_F_HW_TLS_RX __NETIF_F(HW_TLS_RX) #define for_each_netdev_feature(mask_addr, bit)\ for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT) diff --git a/net/core/ethtool.c b/net/core/ethtool.c index e677a20..c9993c6 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -111,6 +111,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info) [NETIF_F_RX_UDP_TUNNEL_PORT_BIT] = "rx-udp_tunnel-port-offload", [NETIF_F_HW_TLS_RECORD_BIT] = "tls-hw-record", [NETIF_F_HW_TLS_TX_BIT] ="tls-hw-tx-offload", + [NETIF_F_HW_TLS_RX_BIT] ="tls-hw-rx-offload", }; static const char -- 1.8.3.1
[PATCH v3 net-next 00/19] TLS offload rx, netdev & mlx5
Hi, The following series provides TLS RX inline crypto offload. v2->v3: - Fix typo - Adjust cover letter - Fix bug in zero copy flows - Use network byte order for the record number in resync - Adjust the sequence provided in resync v1->v2: - Fix bisectability problems due to variable name changes - Fix potential uninitialized return value This series completes the generic infrastructure to offload TLS crypto to a network devices. It enables the kernel TLS socket to skip decryption and authentication operations for SKBs marked as decrypted on the receive side of the data path. Leaving those computationally expensive operations to the NIC. This infrastructure doesn't require a TCP offload engine. Instead, the NIC decrypts a packet's payload if the packet contains the expected TCP sequence number. The TLS record authentication tag remains unmodified regardless of decryption. If the packet is decrypted successfully and it contains an authentication tag, then the authentication check has passed. Otherwise, if the authentication fails, then the packet is provided unmodified and the KTLS layer is responsible for handling it. Out-Of-Order TCP packets are provided unmodified. As a result, in the slow path some of the SKBs are decrypted while others remain as ciphertext. The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs. At the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW whenever it lost track of the TLS record framing in the TCP stream. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC identifies packets that should be offloaded according to the 5-tuple and the TCP sequence number. If these match and the packet is decrypted and authenticated successfully, then a syndrome is provided to software. Otherwise, the packet is unmodified. Decrypted and non-decrypted packets aren't coalesced by the network stack, and the KTLS layer decrypts and authenticates partially decrypted records. The NIC provides an indication whenever a resync is required. The resync operation is triggered by the KTLS layer while parsing TLS record headers. Finally, we measure the performance obtained by running single stream iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound) and KTLS-Offload running both in Tx and Rx. The results show that the performance of offload is comparable to TCP. | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%) TCP | 28.8 | 5 | 12 KTLS-Offload-Tx-Rx| 28.6 | 7 | 14 Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf Boris Pismenny (18): net: Add decrypted field to skb net: Add TLS rx resync NDO tcp: Don't coalesce decrypted and encrypted SKBs tls: Refactor tls_offload variable names tls: Split decrypt_skb to two functions tls: Split tls_sw_release_resources_rx tls: Fill software context without allocation tls: Add rx inline crypto offload tls: Fix zerocopy_from_iter iov handling net/mlx5e: TLS, refactor variable names net/mlx5: Accel, add TLS rx offload routines net/mlx5e: TLS, add innova rx support net/mlx5e: TLS, add Innova TLS rx data path net/mlx5e: TLS, add software statistics net/mlx5e: TLS, build TLS netdev from capabilities net/mlx5: Accel, add common metadata functions net/mlx5e: IPsec, fix byte count in CQE net/mlx5e: Kconfig, mutually exclude compilation of TLS and IPsec accel Ilya Lesokhin (1): net: Add TLS RX offload feature drivers/net/ethernet/mellanox/mlx5/core/Kconfig| 1 + .../net/ethernet/mellanox/mlx5/core/accel/accel.h | 37 +++ .../net/ethernet/mellanox/mlx5/core/accel/tls.c| 23 +- .../net/ethernet/mellanox/mlx5/core/accel/tls.h| 26 +- .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 20 +- .../mellanox/mlx5/core/en_accel/ipsec_rxtx.h | 2 +- .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 69 +++-- .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 33 ++- .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 117 +++- .../mellanox/mlx5/core/en_accel/tls_rxtx.h | 3 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 8 +- drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 113 ++-- drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h | 18 +- include/linux/mlx5/mlx5_ifc
[PATCH v3 net-next 15/19] net/mlx5e: TLS, add software statistics
This patch adds software statistics for TLS to count important events. Signed-off-by: Boris Pismenny --- drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 3 +++ drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 4 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c | 11 ++- 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c index 68368c9..541e6f4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c @@ -169,7 +169,10 @@ static void mlx5e_tls_resync_rx(struct net_device *netdev, struct sock *sk, rx_ctx = mlx5e_get_tls_rx_context(tls_ctx); + netdev_info(netdev, "resyncing seq %d rcd %lld\n", seq, + be64_to_cpu(rcd_sn)); mlx5_accel_tls_resync_rx(priv->mdev, rx_ctx->handle, seq, rcd_sn); + atomic64_inc(&priv->tls->sw_stats.rx_tls_resync_reply); } static const struct tlsdev_ops mlx5e_tls_ops = { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h index 2d40ede..3f5d721 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h @@ -43,6 +43,10 @@ struct mlx5e_tls_sw_stats { atomic64_t tx_tls_drop_resync_alloc; atomic64_t tx_tls_drop_no_sync_data; atomic64_t tx_tls_drop_bypass_required; + atomic64_t rx_tls_drop_resync_request; + atomic64_t rx_tls_resync_request; + atomic64_t rx_tls_resync_reply; + atomic64_t rx_tls_auth_fail; }; struct mlx5e_tls { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c index d460fda..ecfc764 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c @@ -330,8 +330,12 @@ static int tls_update_resync_sn(struct net_device *netdev, netdev->ifindex, 0); #endif } - if (!sk || sk->sk_state == TCP_TIME_WAIT) + if (!sk || sk->sk_state == TCP_TIME_WAIT) { + struct mlx5e_priv *priv = netdev_priv(netdev); + + atomic64_inc(&priv->tls->sw_stats.rx_tls_drop_resync_request); goto out; + } skb->sk = sk; skb->destructor = sock_edemux; @@ -349,6 +353,7 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, struct ethhdr *old_eth; struct ethhdr *new_eth; __be16 *ethtype; + struct mlx5e_priv *priv; /* Detect inline metadata */ if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN) @@ -365,9 +370,13 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, break; case SYNDROM_RESYNC_REQUEST: tls_update_resync_sn(netdev, skb, mdata); + priv = netdev_priv(netdev); + atomic64_inc(&priv->tls->sw_stats.rx_tls_resync_request); break; case SYNDROM_AUTH_FAILED: /* Authentication failure will be observed and verified by kTLS */ + priv = netdev_priv(netdev); + atomic64_inc(&priv->tls->sw_stats.rx_tls_auth_fail); break; default: /* Bypass the metadata header to others */ -- 1.8.3.1
[PATCH v3 net-next 13/19] net/mlx5e: TLS, add innova rx support
Add the mlx5 implementation of the TLS Rx routines to add/del TLS contexts, also add the tls_dev_resync_rx routine to work with the TLS inline Rx crypto offload infrastructure. Signed-off-by: Boris Pismenny Signed-off-by: Ilya Lesokhin --- .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 46 +++--- .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 15 +++ 2 files changed, 46 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c index 7fb9c75..68368c9 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c @@ -110,9 +110,7 @@ static int mlx5e_tls_add(struct net_device *netdev, struct sock *sk, u32 caps = mlx5_accel_tls_device_caps(mdev); int ret = -ENOMEM; void *flow; - - if (direction != TLS_OFFLOAD_CTX_DIR_TX) - return -EINVAL; + u32 swid; flow = kzalloc(MLX5_ST_SZ_BYTES(tls_flow), GFP_KERNEL); if (!flow) @@ -122,18 +120,23 @@ static int mlx5e_tls_add(struct net_device *netdev, struct sock *sk, if (ret) goto free_flow; + ret = mlx5_accel_tls_add_flow(mdev, flow, crypto_info, + start_offload_tcp_sn, &swid, + direction == TLS_OFFLOAD_CTX_DIR_TX); + if (ret < 0) + goto free_flow; + if (direction == TLS_OFFLOAD_CTX_DIR_TX) { struct mlx5e_tls_offload_context_tx *tx_ctx = mlx5e_get_tls_tx_context(tls_ctx); - u32 swid; - - ret = mlx5_accel_tls_add_tx_flow(mdev, flow, crypto_info, -start_offload_tcp_sn, &swid); - if (ret < 0) - goto free_flow; tx_ctx->swid = htonl(swid); tx_ctx->expected_seq = start_offload_tcp_sn; + } else { + struct mlx5e_tls_offload_context_rx *rx_ctx = + mlx5e_get_tls_rx_context(tls_ctx); + + rx_ctx->handle = htonl(swid); } return 0; @@ -147,19 +150,32 @@ static void mlx5e_tls_del(struct net_device *netdev, enum tls_offload_ctx_dir direction) { struct mlx5e_priv *priv = netdev_priv(netdev); + unsigned int handle; - if (direction == TLS_OFFLOAD_CTX_DIR_TX) { - u32 swid = ntohl(mlx5e_get_tls_tx_context(tls_ctx)->swid); + handle = ntohl((direction == TLS_OFFLOAD_CTX_DIR_TX) ? + mlx5e_get_tls_tx_context(tls_ctx)->swid : + mlx5e_get_tls_rx_context(tls_ctx)->handle); - mlx5_accel_tls_del_tx_flow(priv->mdev, swid); - } else { - netdev_err(netdev, "unsupported direction %d\n", direction); - } + mlx5_accel_tls_del_flow(priv->mdev, handle, + direction == TLS_OFFLOAD_CTX_DIR_TX); +} + +static void mlx5e_tls_resync_rx(struct net_device *netdev, struct sock *sk, + u32 seq, u64 rcd_sn) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct mlx5e_priv *priv = netdev_priv(netdev); + struct mlx5e_tls_offload_context_rx *rx_ctx; + + rx_ctx = mlx5e_get_tls_rx_context(tls_ctx); + + mlx5_accel_tls_resync_rx(priv->mdev, rx_ctx->handle, seq, rcd_sn); } static const struct tlsdev_ops mlx5e_tls_ops = { .tls_dev_add = mlx5e_tls_add, .tls_dev_del = mlx5e_tls_del, + .tls_dev_resync_rx = mlx5e_tls_resync_rx, }; void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h index e26222a..2d40ede 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h @@ -65,6 +65,21 @@ struct mlx5e_tls_offload_context_tx { base); } +struct mlx5e_tls_offload_context_rx { + struct tls_offload_context_rx base; + __be32 handle; +}; + +static inline struct mlx5e_tls_offload_context_rx * +mlx5e_get_tls_rx_context(struct tls_context *tls_ctx) +{ + BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context_rx) > +TLS_OFFLOAD_CONTEXT_SIZE_RX); + return container_of(tls_offload_ctx_rx(tls_ctx), + struct mlx5e_tls_offload_context_rx, + base); +} + void mlx5e_tls_build_netdev(struct mlx5e_priv *priv); int mlx5e_tls_init(struct mlx5e_priv *priv); void mlx5e_tls_cleanup(struct mlx5e_priv *priv); -- 1.8.3.1
[PATCH v3 net-next 09/19] tls: Add rx inline crypto offload
This patch completes the generic infrastructure to offload TLS crypto to a network device. It enables the kernel to skip decryption and authentication of some skbs marked as decrypted by the NIC. In the fast path, all packets received are decrypted by the NIC and the performance is comparable to plain TCP. This infrastructure doesn't require a TCP offload engine. Instead, the NIC only decrypts packets that contain the expected TCP sequence number. Out-Of-Order TCP packets are provided unmodified. As a result, at the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS Rx and this offload are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW after HW lost track of TLS record framing in the TCP stream. Signed-off-by: Boris Pismenny --- include/net/tls.h | 63 +- net/tls/tls_device.c | 278 ++ net/tls/tls_device_fallback.c | 1 + net/tls/tls_main.c| 32 +++-- net/tls/tls_sw.c | 24 +++- 5 files changed, 355 insertions(+), 43 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 7a485de..d8b3b65 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -83,6 +83,16 @@ struct tls_device { void (*unhash)(struct tls_device *device, struct sock *sk); }; +enum { + TLS_BASE, + TLS_SW, +#ifdef CONFIG_TLS_DEVICE + TLS_HW, +#endif + TLS_HW_RECORD, + TLS_NUM_CONFIG, +}; + struct tls_sw_context_tx { struct crypto_aead *aead_send; struct crypto_wait async_wait; @@ -197,6 +207,7 @@ struct tls_context { int (*push_pending_record)(struct sock *sk, int flags); void (*sk_write_space)(struct sock *sk); + void (*sk_destruct)(struct sock *sk); void (*sk_proto_close)(struct sock *sk, long timeout); int (*setsockopt)(struct sock *sk, int level, @@ -209,13 +220,27 @@ struct tls_context { void (*unhash)(struct sock *sk); }; +struct tls_offload_context_rx { + /* sw must be the first member of tls_offload_context_rx */ + struct tls_sw_context_rx sw; + atomic64_t resync_req; + u8 driver_state[]; + /* The TLS layer reserves room for driver specific state +* Currently the belief is that there is not enough +* driver specific state to justify another layer of indirection +*/ +}; + +#define TLS_OFFLOAD_CONTEXT_SIZE_RX\ + (ALIGN(sizeof(struct tls_offload_context_rx), sizeof(void *)) + \ +TLS_DRIVER_STATE_SIZE) + int wait_on_pending_writer(struct sock *sk, long *timeo); int tls_sk_query(struct sock *sk, int optname, char __user *optval, int __user *optlen); int tls_sk_attach(struct sock *sk, int optname, char __user *optval, unsigned int optlen); - int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx); int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); int tls_sw_sendpage(struct sock *sk, struct page *page, @@ -290,11 +315,19 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx) return tls_ctx->pending_open_record_frags; } +struct sk_buff * +tls_validate_xmit_skb(struct sock *sk, struct net_device *dev, + struct sk_buff *skb); + static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk) { - return sk_fullsock(sk) && - /* matches smp_store_release in tls_set_device_offload */ - smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct; +#ifdef CONFIG_SOCK_VALIDATE_XMIT + return sk_fullsock(sk) & + (smp_load_acquire(&sk->sk_validate_xmit_skb) == + &tls_validate_xmit_skb); +#else + return false; +#endif } static inline void tls_err_abort(struct sock *sk, int err) @@ -387,10 +420,27 @@ static inline struct tls_sw_context_tx *tls_sw_ctx_tx( return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx; } +static inline struct tls_offload_context_rx * +tls_offload_ctx_rx(const struct tls_context *tls_ctx) +{ + return (struct tls_offload_context_rx *)tls_ctx->priv_ctx_rx; +} + +/* The TLS context is valid until sk_destruct is called */ +static inline void tls_offload_rx_resync_request(struct sock *sk, __be32 seq) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_rx *rx_ctx = tls_offload_ctx_rx(tls_ctx); + + atomic64_set(&rx_ctx->resync_req, uint64_t)seq) << 32) | 1)); +} + + int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg, unsigned char *record_type); void tls_r
[PATCH v3 net-next 12/19] net/mlx5: Accel, add TLS rx offload routines
In Innova TLS, TLS contexts are added or deleted via a command message over the SBU connection. The HW then sends a response message over the same connection. Complete the implementation for Innova TLS (FPGA-based) hardware by adding support for rx inline crypto offload. Signed-off-by: Boris Pismenny Signed-off-by: Ilya Lesokhin --- .../net/ethernet/mellanox/mlx5/core/accel/tls.c| 23 +++-- .../net/ethernet/mellanox/mlx5/core/accel/tls.h| 26 +++-- drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 113 - drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h | 18 ++-- include/linux/mlx5/mlx5_ifc_fpga.h | 1 + 5 files changed, 135 insertions(+), 46 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c index 77ac19f..da7bd26 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c @@ -37,17 +37,26 @@ #include "mlx5_core.h" #include "fpga/tls.h" -int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow, - struct tls_crypto_info *crypto_info, - u32 start_offload_tcp_sn, u32 *p_swid) +int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow, + struct tls_crypto_info *crypto_info, + u32 start_offload_tcp_sn, u32 *p_swid, + bool direction_sx) { - return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info, -start_offload_tcp_sn, p_swid); + return mlx5_fpga_tls_add_flow(mdev, flow, crypto_info, + start_offload_tcp_sn, p_swid, + direction_sx); } -void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid) +void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid, +bool direction_sx) { - mlx5_fpga_tls_del_tx_flow(mdev, swid, GFP_KERNEL); + mlx5_fpga_tls_del_flow(mdev, swid, GFP_KERNEL, direction_sx); +} + +int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, u32 seq, +u64 rcd_sn) +{ + return mlx5_fpga_tls_resync_rx(mdev, handle, seq, rcd_sn); } bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h index 6f9c9f4..2228c10 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h @@ -60,10 +60,14 @@ struct mlx5_ifc_tls_flow_bits { u8 reserved_at_2[0x1e]; }; -int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow, - struct tls_crypto_info *crypto_info, - u32 start_offload_tcp_sn, u32 *p_swid); -void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid); +int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow, + struct tls_crypto_info *crypto_info, + u32 start_offload_tcp_sn, u32 *p_swid, + bool direction_sx); +void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid, +bool direction_sx); +int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, u32 seq, +u64 rcd_sn); bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev); u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev); int mlx5_accel_tls_init(struct mlx5_core_dev *mdev); @@ -71,11 +75,15 @@ int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow, #else -static inline int -mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow, - struct tls_crypto_info *crypto_info, - u32 start_offload_tcp_sn, u32 *p_swid) { return 0; } -static inline void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid) { } +static int +mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow, + struct tls_crypto_info *crypto_info, + u32 start_offload_tcp_sn, u32 *p_swid, + bool direction_sx) { return -ENOTSUPP; } +static inline void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid, + bool direction_sx) { } +static inline int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, + u32 seq, u64 rcd_sn) { return 0; } static inline bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) { return false; } static inline u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev) { return 0; } static inline int mlx5_accel_tls_init(struct mlx5_core_dev *mdev) { return 0; } d
KASAN: use-after-free Read in p9_fd_poll
Hello, syzbot found the following crash on: HEAD commit:30c2c32d7f70 Merge tag 'drm-fixes-2018-07-10' of git://ano.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=1662c5b240 kernel config: https://syzkaller.appspot.com/x/.config?x=25856fac4e580aa7 dashboard link: https://syzkaller.appspot.com/bug?extid=0442e6e2f7e1e33b1037 compiler: gcc (GCC) 8.0.1 20180413 (experimental) Unfortunately, I don't have any reproducer for this crash yet. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+0442e6e2f7e1e33b1...@syzkaller.appspotmail.com 9pnet: p9_errstr2errno: server reported unknown error etz0e&��?�d$5ܱI3� QAT: Invalid ioctl == BUG: KASAN: use-after-free in p9_fd_poll+0x280/0x2b0 net/9p/trans_fd.c:238 Read of size 8 at addr 8801c647ec80 by task kworker/1:3/5005 CPU: 1 PID: 5005 Comm: kworker/1:3 Not tainted 4.18.0-rc4+ #140 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events p9_poll_workfn Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 p9_fd_poll+0x280/0x2b0 net/9p/trans_fd.c:238 p9_poll_mux net/9p/trans_fd.c:617 [inline] p9_poll_workfn+0x463/0x6d0 net/9p/trans_fd.c:1107 process_one_work+0xc73/0x1ba0 kernel/workqueue.c:2153 worker_thread+0x189/0x13c0 kernel/workqueue.c:2296 kthread+0x345/0x410 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Allocated by task 29121: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620 kmalloc include/linux/slab.h:513 [inline] kzalloc include/linux/slab.h:707 [inline] p9_fd_open net/9p/trans_fd.c:796 [inline] p9_fd_create+0x1a7/0x3f0 net/9p/trans_fd.c:1036 p9_client_create+0x915/0x16c9 net/9p/client.c:1062 v9fs_session_init+0x21a/0x1a80 fs/9p/v9fs.c:400 v9fs_mount+0x7c/0x900 fs/9p/vfs_super.c:135 mount_fs+0xae/0x328 fs/super.c:1277 vfs_kern_mount.part.34+0xdc/0x4e0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:1027 [inline] do_new_mount fs/namespace.c:2518 [inline] do_mount+0x581/0x30e0 fs/namespace.c:2848 ksys_mount+0x12d/0x140 fs/namespace.c:3064 __do_sys_mount fs/namespace.c:3078 [inline] __se_sys_mount fs/namespace.c:3075 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 29121: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xd9/0x260 mm/slab.c:3813 p9_fd_close+0x416/0x5b0 net/9p/trans_fd.c:893 p9_client_create+0xac2/0x16c9 net/9p/client.c:1076 v9fs_session_init+0x21a/0x1a80 fs/9p/v9fs.c:400 v9fs_mount+0x7c/0x900 fs/9p/vfs_super.c:135 mount_fs+0xae/0x328 fs/super.c:1277 vfs_kern_mount.part.34+0xdc/0x4e0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:1027 [inline] do_new_mount fs/namespace.c:2518 [inline] do_mount+0x581/0x30e0 fs/namespace.c:2848 ksys_mount+0x12d/0x140 fs/namespace.c:3064 __do_sys_mount fs/namespace.c:3078 [inline] __se_sys_mount fs/namespace.c:3075 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at 8801c647ec80 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 0 bytes inside of 512-byte region [8801c647ec80, 8801c647ee80) The buggy address belongs to the page: page:ea0007191f80 count:1 mapcount:0 mapping:8801da800940 index:0x0 flags: 0x2fffc000100(slab) raw: 02fffc000100 ea0006a8cc48 ea00074be548 8801da800940 raw: 8801c647e000 00010006 page dumped because: kasan: bad access detected Memory state around the buggy address: 8801c647eb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 8801c647ec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 8801c647ec80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 8801c647ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 8801c647ed80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb == --- This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegr
[PATCH v3 net-next 04/19] tcp: Don't coalesce decrypted and encrypted SKBs
Prevent coalescing of decrypted and encrypted SKBs in GRO and TCP layer. Signed-off-by: Boris Pismenny Signed-off-by: Ilya Lesokhin --- net/ipv4/tcp_input.c | 12 net/ipv4/tcp_offload.c | 3 +++ 2 files changed, 15 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 814ea43..f89d86a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4343,6 +4343,11 @@ static bool tcp_try_coalesce(struct sock *sk, if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq) return false; +#ifdef CONFIG_TLS_DEVICE + if (from->decrypted != to->decrypted) + return false; +#endif + if (!skb_try_coalesce(to, from, fragstolen, &delta)) return false; @@ -4872,6 +4877,9 @@ void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb) break; memcpy(nskb->cb, skb->cb, sizeof(skb->cb)); +#ifdef CONFIG_TLS_DEVICE + nskb->decrypted = skb->decrypted; +#endif TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start; if (list) __skb_queue_before(list, skb, nskb); @@ -4899,6 +4907,10 @@ void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb) skb == tail || (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) goto end; +#ifdef CONFIG_TLS_DEVICE + if (skb->decrypted != nskb->decrypted) + goto end; +#endif } } } diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index f5aee64..870b0a3 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -262,6 +262,9 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) flush |= (len - 1) >= mss; flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq); +#ifdef CONFIG_TLS_DEVICE + flush |= p->decrypted ^ skb->decrypted; +#endif if (flush || skb_gro_receive(p, skb)) { mss = 1; -- 1.8.3.1
[PATCH net-next 4/5 v3] net: gemini: Move main init to port
The initialization sequence for the ethernet, setting up interrupt routing and such things, need to be done after both the ports are clocked and reset. Before this the config will not "take". Move the initialization to the port probe function and keep track of init status in the state. Signed-off-by: Linus Walleij --- ChangeLog v2->v3: - No changes, just resending with the rest. ChangeLog v1->v2: - No changes, just resending with the rest. --- drivers/net/ethernet/cortina/gemini.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 2457a1239d69..0f1d26441177 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -151,6 +151,7 @@ struct gemini_ethernet { void __iomem *base; struct gemini_ethernet_port *port0; struct gemini_ethernet_port *port1; + bool initialized; spinlock_t irq_lock; /* Locks IRQ-related registers */ unsigned intfreeq_order; @@ -2303,6 +2304,14 @@ static void gemini_port_remove(struct gemini_ethernet_port *port) static void gemini_ethernet_init(struct gemini_ethernet *geth) { + /* Only do this once both ports are online */ + if (geth->initialized) + return; + if (geth->port0 && geth->port1) + geth->initialized = true; + else + return; + writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_0_REG); writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_1_REG); writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_2_REG); @@ -2450,6 +2459,10 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev) geth->port0 = port; else geth->port1 = port; + + /* This will just be done once both ports are up and reset */ + gemini_ethernet_init(geth); + platform_set_drvdata(pdev, port); /* Set up and register the netdev */ @@ -2567,7 +2580,6 @@ static int gemini_ethernet_probe(struct platform_device *pdev) spin_lock_init(&geth->irq_lock); spin_lock_init(&geth->freeq_lock); - gemini_ethernet_init(geth); /* The children will use this */ platform_set_drvdata(pdev, geth); @@ -2580,8 +2592,8 @@ static int gemini_ethernet_remove(struct platform_device *pdev) { struct gemini_ethernet *geth = platform_get_drvdata(pdev); - gemini_ethernet_init(geth); geth_cleanup_freeq(geth); + geth->initialized = false; return 0; } -- 2.17.1
[PATCH net-next 2/5 v3] net: gemini: Improve connection prints
Switch over to using a module parameter and debug prints that can be controlled by this or ethtool like everyone else. Depromote all other prints to debug messages. The phy_print_status() was already in place, albeit never really used because the debuglevel hiding it had to be set up using ethtool. Signed-off-by: Linus Walleij --- ChangeLog v2->v3: - Use phy_attached_info() live all other drivers. - Put it in an if (netif_msg_link()) clause like the other message from phy_print_status(). - Explain more in the commit message. ChangeLog v1->v2: - Use a module parameter and the message levels like all other drivers and stop trying to be special. --- drivers/net/ethernet/cortina/gemini.c | 46 +++ 1 file changed, 26 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 8fc31723f700..f0ab6426daca 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -46,6 +46,11 @@ #define DRV_NAME "gmac-gemini" #define DRV_VERSION"1.0" +#define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK) +static int debug = -1; +module_param(debug, int, 0); +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); + #define HSIZE_80x00 #define HSIZE_16 0x01 #define HSIZE_32 0x02 @@ -300,23 +305,26 @@ static void gmac_speed_set(struct net_device *netdev) status.bits.speed = GMAC_SPEED_1000; if (phydev->interface == PHY_INTERFACE_MODE_RGMII) status.bits.mii_rmii = GMAC_PHY_RGMII_1000; - netdev_info(netdev, "connect to RGMII @ 1Gbit\n"); + netdev_dbg(netdev, "connect %s to RGMII @ 1Gbit\n", + phydev_name(phydev)); break; case 100: status.bits.speed = GMAC_SPEED_100; if (phydev->interface == PHY_INTERFACE_MODE_RGMII) status.bits.mii_rmii = GMAC_PHY_RGMII_100_10; - netdev_info(netdev, "connect to RGMII @ 100 Mbit\n"); + netdev_dbg(netdev, "connect %s to RGMII @ 100 Mbit\n", + phydev_name(phydev)); break; case 10: status.bits.speed = GMAC_SPEED_10; if (phydev->interface == PHY_INTERFACE_MODE_RGMII) status.bits.mii_rmii = GMAC_PHY_RGMII_100_10; - netdev_info(netdev, "connect to RGMII @ 10 Mbit\n"); + netdev_dbg(netdev, "connect %s to RGMII @ 10 Mbit\n", + phydev_name(phydev)); break; default: - netdev_warn(netdev, "Not supported PHY speed (%d)\n", - phydev->speed); + netdev_warn(netdev, "Unsupported PHY speed (%d) on %s\n", + phydev->speed, phydev_name(phydev)); } if (phydev->duplex == DUPLEX_FULL) { @@ -363,12 +371,6 @@ static int gmac_setup_phy(struct net_device *netdev) return -ENODEV; netdev->phydev = phy; - netdev_info(netdev, "connected to PHY \"%s\"\n", - phydev_name(phy)); - phy_attached_print(phy, "phy_id=0x%.8lx, phy_mode=%s\n", - (unsigned long)phy->phy_id, - phy_modes(phy->interface)); - phy->supported &= PHY_GBIT_FEATURES; phy->supported |= SUPPORTED_Asym_Pause | SUPPORTED_Pause; phy->advertising = phy->supported; @@ -376,19 +378,19 @@ static int gmac_setup_phy(struct net_device *netdev) /* set PHY interface type */ switch (phy->interface) { case PHY_INTERFACE_MODE_MII: - netdev_info(netdev, "set GMAC0 to GMII mode, GMAC1 disabled\n"); + netdev_dbg(netdev, + "MII: set GMAC0 to GMII mode, GMAC1 disabled\n"); status.bits.mii_rmii = GMAC_PHY_MII; - netdev_info(netdev, "connect to MII\n"); break; case PHY_INTERFACE_MODE_GMII: - netdev_info(netdev, "set GMAC0 to GMII mode, GMAC1 disabled\n"); + netdev_dbg(netdev, + "GMII: set GMAC0 to GMII mode, GMAC1 disabled\n"); status.bits.mii_rmii = GMAC_PHY_GMII; - netdev_info(netdev, "connect to GMII\n"); break; case PHY_INTERFACE_MODE_RGMII: - dev_info(dev, "set GMAC0 and GMAC1 to MII/RGMII mode\n"); + netdev_dbg(netdev, + "RGMII: set GMAC0 and GMAC1 to MII/RGMII mode\n"); status.bits.mii_rmii = GMAC_PHY_RGMII_100_10; - netdev_info(netdev, "connect to RGMII\n"); break; default: netdev_err(netdev, "Unsupported MII interface\n"); @@ -398,6 +400,9 @@ static int gmac_setup_ph
[PATCH net-next 3/5 v3] net: gemini: Allow multiple ports to instantiate
The code was not tested with two ports actually in use at the same time. (I blame this on lack of actual hardware using that feature.) Now after locating a system using both ports, add necessary fix to make both ports come up. Signed-off-by: Linus Walleij --- ChangeLog v2->v3: - No changes, just resending with the rest. ChangeLog v1->v2: - No changes, just resending with the rest. --- drivers/net/ethernet/cortina/gemini.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index f0ab6426daca..2457a1239d69 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -1789,7 +1789,10 @@ static int gmac_open(struct net_device *netdev) phy_start(netdev->phydev); err = geth_resize_freeq(port); - if (err) { + /* It's fine if it's just busy, the other port has set up +* the freeq in that case. +*/ + if (err && (err != -EBUSY)) { netdev_err(netdev, "could not resize freeq\n"); goto err_stop_phy; } -- 2.17.1
[PATCH net-next 5/5 v3] net: gemini: Indicate that we can handle jumboframes
The hardware supposedly handles frames up to 10236 bytes and implements .ndo_change_mtu() so accept 10236 minus the ethernet header for a VLAN tagged frame on the netdevices. Use ETH_MIN_MTU as minimum MTU. Signed-off-by: Linus Walleij --- ChangeLog v2->v3: - No changes, just resending with the rest. ChangeLog v1->v2: - Change the min MTU from 256 (vendor code) to ETH_MIN_MTU which makes more sense. --- drivers/net/ethernet/cortina/gemini.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 0f1d26441177..22f495b490d4 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -2476,6 +2476,11 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev) netdev->hw_features = GMAC_OFFLOAD_FEATURES; netdev->features |= GMAC_OFFLOAD_FEATURES | NETIF_F_GRO; + /* We can handle jumbo frames up to 10236 bytes so, let's accept +* payloads of 10236 bytes minus VLAN and ethernet header +*/ + netdev->min_mtu = ETH_MIN_MTU; + netdev->max_mtu = 10236 - VLAN_ETH_HLEN; port->freeq_refill = 0; netif_napi_add(netdev, &port->napi, gmac_napi_poll, -- 2.17.1
[PATCH net-next 1/5 v3] net: gemini: Look up L3 maxlen from table
The code to calculate the hardware register enumerator for the maximum L3 length isn't entirely simple to read. Use the existing defines and rewrite the function into a table look-up. Acked-by: Michał Mirosław Signed-off-by: Linus Walleij --- ChangeLog v2->v3: - Collected Michał's ACK. ChangeLog v1->v2: - No changes, just resending with the rest. --- drivers/net/ethernet/cortina/gemini.c | 61 --- 1 file changed, 46 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 6d7404f66f84..8fc31723f700 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -401,26 +401,57 @@ static int gmac_setup_phy(struct net_device *netdev) return 0; } -static int gmac_pick_rx_max_len(int max_l3_len) -{ - /* index = CONFIG_MAXLEN_XXX values */ - static const int max_len[8] = { - 1536, 1518, 1522, 1542, - 9212, 10236, 1518, 1518 - }; - int i, n = 5; +/* The maximum frame length is not logically enumerated in the + * hardware, so we do a table lookup to find the applicable max + * frame length. + */ +struct gmac_max_framelen { + unsigned int max_l3_len; + u8 val; +}; - max_l3_len += ETH_HLEN + VLAN_HLEN; +static const struct gmac_max_framelen gmac_maxlens[] = { + { + .max_l3_len = 1518, + .val = CONFIG0_MAXLEN_1518, + }, + { + .max_l3_len = 1522, + .val = CONFIG0_MAXLEN_1522, + }, + { + .max_l3_len = 1536, + .val = CONFIG0_MAXLEN_1536, + }, + { + .max_l3_len = 1542, + .val = CONFIG0_MAXLEN_1542, + }, + { + .max_l3_len = 9212, + .val = CONFIG0_MAXLEN_9k, + }, + { + .max_l3_len = 10236, + .val = CONFIG0_MAXLEN_10k, + }, +}; + +static int gmac_pick_rx_max_len(unsigned int max_l3_len) +{ + const struct gmac_max_framelen *maxlen; + int maxtot; + int i; - if (max_l3_len > max_len[n]) - return -1; + maxtot = max_l3_len + ETH_HLEN + VLAN_HLEN; - for (i = 0; i < 5; i++) { - if (max_len[i] >= max_l3_len && max_len[i] < max_len[n]) - n = i; + for (i = 0; i < ARRAY_SIZE(gmac_maxlens); i++) { + maxlen = &gmac_maxlens[i]; + if (maxtot <= maxlen->max_l3_len) + return maxlen->val; } - return n; + return -1; } static int gmac_init(struct net_device *netdev) -- 2.17.1
Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
On Tue, Jul 10, 2018 at 07:25:53PM -0700, Cong Wang wrote: > On Mon, Jul 9, 2018 at 2:40 PM Marcelo Ricardo Leitner > wrote: > > > > On Mon, Jul 09, 2018 at 05:03:31PM -0400, Michel Machado wrote: > > >Changing TC_PRIO_MAX from 15 to 63 risks breaking backward > > > compatibility > > > with applications. > > > > If done, it needs to be done carefully, indeed. I don't know if it's > > doable, neither I know how hard is your requirement for 64 different > > priorities. > > struct tc_prio_qopt { > int bands; /* Number of bands */ > __u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band > */ > }; > > How would you do it carefully? quick shot, multiplex v1 and v2 formats based on bands and sizeof(): #define TCQ_PRIO_BANDS_V1 16 #define TCQ_PRIO_BANDS_V2 64 #define TC_PRIO_MAX_V2 64 struct tc_prio_qopt_v2 { int bands; /* Number of bands */ __u8priomap[TC_PRIO_MAX_V2+1]; /* Map: logical priority -> PRIO band */ }; static int prio_tune(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { struct prio_sched_data *q = qdisc_priv(sch); struct Qdisc *queues[TCQ_PRIO_BANDS_V2]; int oldbands = q->bands, i; struct tc_prio_qopt_v2 *qopt; if (nla_len(opt) < sizeof(int)) return -EINVAL; qopt = nla_data(opt); if (qopt->bands <= TCQ_PRIO_BANDS_V1 && nla_len(opt) < sizeof(struct tc_prio_qopt)) return -EINVAL; if (qopt->bands <= TCQ_PRIO_BANDS_V2 && nla_len(opt) < sizeof(*qopt)) return -EINVAL; /* By here, if it has up to 3 bands, we can assume it is using the _v1 * layout, while if it has up to TCQ_PRIO_BANDS_V2 it is using the _v2 * format. */ if (qopt->bands > TCQ_PRIO_BANDS_V2 || qopt->bands < 2) return -EINVAL; ... With something like this I think it can keep compatibility with old software while also allowing the new usage. > Also, it is not only used by prio but also pfifo_fast. Yes. More is needed, indeed. prio2band would also need to be expanded, etc. Yet, I still don't see any blocker.
Re: [PATCH net-next 5/5 v2] net: gemini: Indicate that we can handle jumboframes
On Wed, Jul 4, 2018 at 10:35 PM Andrew Lunn wrote: > > On Wed, Jul 04, 2018 at 08:33:24PM +0200, Linus Walleij wrote: > > The hardware supposedly handles frames up to 10236 bytes and > > implements .ndo_change_mtu() so accept 10236 minus the ethernet > > header for a VLAN tagged frame on the netdevices. Use > > ETH_MIN_MTU as minimum MTU. > > > > Signed-off-by: Linus Walleij > > Hi Linus > > Did you try with an MTU of 68? Maybe the vendor picked 256 because of > a hardware limit? Yeah works fine: ping -s 68 169.254.1.2 PING 169.254.1.2 (169.254.1.2) 68(96) bytes of data. 76 bytes from 169.254.1.2: icmp_seq=1 ttl=64 time=0.359 ms 76 bytes from 169.254.1.2: icmp_seq=2 ttl=64 time=0.346 ms 76 bytes from 169.254.1.2: icmp_seq=3 ttl=64 time=0.351 ms This also works fine: ping -s 9000 169.254.1.2 PING 169.254.1.2 (169.254.1.2) 9000(9028) bytes of data. 9008 bytes from 169.254.1.2: icmp_seq=1 ttl=64 time=1.45 ms 9008 bytes from 169.254.1.2: icmp_seq=2 ttl=64 time=1.68 ms 9008 bytes from 169.254.1.2: icmp_seq=3 ttl=64 time=1.55 ms I'll send new patches with all suggested changes soon :) Thanks a lot for your help! Yours, Linus Walleij
Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case of forwarding
On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote: > Only driver sfc actually uses this, but I don't have this NIC, so I > tested this on mlx5, with my own changes to make it use > netif_receive_skb_list(), > but I'm not ready to upstream the mlx5 driver change yet. Thanks Jesper for sharing this, should we look forward to those patches or do you want us to implement them ? Thanks, Saeed.
Re: [PATCH net-next v2 04/11] devlink: Add support for region get command
On Wed, 11 Jul 2018 13:43:01 +0300, Alex Vesker wrote: > + DEVLINK_ATTR_REGION_SIZE, /* u32 */ > + err = nla_put_u64_64bit(msg, DEVLINK_ATTR_REGION_SIZE, > + region->size, > + DEVLINK_ATTR_PAD); Size in the comment looks incorrect.
Re: [PATCH v3 net-next] net/sched: add skbprio scheduler
On Tue, Jul 10, 2018 at 07:32:43PM -0700, Cong Wang wrote: > On Mon, Jul 9, 2018 at 12:53 PM Marcelo Ricardo Leitner > wrote: > > > > On Mon, Jul 09, 2018 at 02:18:33PM -0400, Michel Machado wrote: > > > > > >2. sch_prio.c does not have a global limit on the number of packets on > > > all its queues, only a limit per queue. > > > > It can be useful to sch_prio.c as well, why not? > > prio_enqueue() > > { > > ... > > + if (count > sch->global_limit) > > + prio_tail_drop(sch); /* to be implemented */ > > ret = qdisc_enqueue(skb, qdisc, to_free); > > > > Isn't the whole point of sch_prio offloading the queueing to > each class? If you need a limit, there is one for each child > qdisc if you use for example pfifo or bfifo (depending on you > want to limit bytes or packets). Yes, but Michel wants to drop from other lower priorities if needed, and that's not possible if you handle the limit already in a child qdisc as they don't know about their siblings. The idea in the example above is to discard it from whatever lower priority is needed, then queue it. (ok, the example missed to check the priority level) As for the different units, sch_prio holds a count of how many packets are queued on its children, and that's what would be used for the limit. > > Also, what's your plan for backward compatibility here? say: if (sch->global_limit && count > sch->global_limit) as in, only do the limit check/enforcing if needed.
[PATCH] of: mdio: Support fixed links in of_phy_get_and_connect()
By a simple extension of of_phy_get_and_connect() drivers that have a fixed link on e.g. RGMII can support also fixed links, so in addition to: ethernet-port { phy-mode = "rgmii"; phy-handle = <&foo>; }; This setup with a fixed-link node and no phy-handle will now also work just fine: ethernet-port { phy-mode = "rgmii"; fixed-link { speed = <1000>; full-duplex; pause; }; }; This is very helpful for connecting random ethernet ports to e.g. DSA switches that typically reside on fixed links. The phy-mode is still there as the fixes link in this case is still an RGMII link. Tested on the Cortina Gemini driver with the Vitesse DSA router chip on a fixed 1Gbit link. Suggested-by: Andrew Lunn Signed-off-by: Linus Walleij --- drivers/of/of_mdio.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c index d963baf8e53a..e92391d6d1bd 100644 --- a/drivers/of/of_mdio.c +++ b/drivers/of/of_mdio.c @@ -367,14 +367,23 @@ struct phy_device *of_phy_get_and_connect(struct net_device *dev, phy_interface_t iface; struct device_node *phy_np; struct phy_device *phy; + int ret; iface = of_get_phy_mode(np); if (iface < 0) return NULL; - - phy_np = of_parse_phandle(np, "phy-handle", 0); - if (!phy_np) - return NULL; + if (of_phy_is_fixed_link(np)) { + ret = of_phy_register_fixed_link(np); + if (ret < 0) { + netdev_err(dev, "broken fixed-link specification\n"); + return NULL; + } + phy_np = of_node_get(np); + } else { + phy_np = of_parse_phandle(np, "phy-handle", 0); + if (!phy_np) + return NULL; + } phy = of_phy_connect(dev, phy_np, hndlr, 0, iface); -- 2.17.1