Re: [PATCH] bonding: Don't update slave->link until ready to commit
On Wed, May 24, 2017 at 7:45 PM, Nithin Nayak Sujirwrote: > In the loadbalance arp monitoring scheme, when a slave link change is > detected, the slave->link is immediately updated and slave_state_changed > is set. Later down the function, the rtnl_lock is acquired and the > changes are committed, updating the bond link state. > > However, the acquisition of the rtnl_lock can fail. The next time the > monitor runs, since slave->link is already updated, it determines that > link is unchanged. This results in the bond link state permanently out > of sync with the slave link. > > This patch modifies bond_loadbalance_arp_mon() to handle link changes > identical to bond_ab_arp_{inspect/commit}(). The new link state is > maintained in slave->new_link until we're ready to commit at which point > it's copied into slave->link. > > NOTE: miimon_{inspect/commit}() has a more complex state machine > requiring the use of the bond_{propose,commit}_link_state() functions > which maintains the intermediate state in slave->link_new_state. The arp > monitors don't require that. > > Testing: This bug is very easy to reproduce with the following steps. > 1. In a loop, toggle a slave link of a bond slave interface. > 2. In a separate loop, do ifconfig up/down of an unrelated interface to > create contention for rtnl_lock. > Within a few iterations, the bond link goes out of sync with the slave > link. > > Signed-off-by: Nithin Nayak Sujir Acked-by: Mahesh Bandewar > Cc: Mahesh Bandewar > Cc: Jay Vosburgh > --- > drivers/net/bonding/bond_main.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index 7331331..2359478b 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -2612,11 +2612,13 @@ static void bond_loadbalance_arp_mon(struct bonding > *bond) > bond_for_each_slave_rcu(bond, slave, iter) { > unsigned long trans_start = dev_trans_start(slave->dev); > > + slave->new_link = BOND_LINK_NOCHANGE; > + > if (slave->link != BOND_LINK_UP) { > if (bond_time_in_interval(bond, trans_start, 1) && > bond_time_in_interval(bond, slave->last_rx, 1)) { > > - slave->link = BOND_LINK_UP; > + slave->new_link = BOND_LINK_UP; > slave_state_changed = 1; > > /* primary_slave has no meaning in round-robin > @@ -2643,7 +2645,7 @@ static void bond_loadbalance_arp_mon(struct bonding > *bond) > if (!bond_time_in_interval(bond, trans_start, 2) || > !bond_time_in_interval(bond, slave->last_rx, 2)) { > > - slave->link = BOND_LINK_DOWN; > + slave->new_link = BOND_LINK_DOWN; > slave_state_changed = 1; > > if (slave->link_failure_count < UINT_MAX) > @@ -2674,6 +2676,11 @@ static void bond_loadbalance_arp_mon(struct bonding > *bond) > if (!rtnl_trylock()) > goto re_arm; > > + bond_for_each_slave(bond, slave, iter) { > + if (slave->new_link != BOND_LINK_NOCHANGE) > + slave->link = slave->new_link; > + } > + > if (slave_state_changed) { > bond_slave_state_change(bond); > if (BOND_MODE(bond) == BOND_MODE_XOR) > -- > 2.8.2 >
RE: [for-next 4/6] net/mlx5: FPGA, Add basic support for Innova
> -Original Message- > From: Alexei Starovoitov [mailto:alexei.starovoi...@gmail.com] > > On Tue, May 23, 2017 at 02:44:02PM +0300, Saeed Mahameed wrote: > > From: Ilan Tayari> > > > Mellanox Innova is a NIC with ConnectX and an FPGA on the same > > board. The FPGA is a bump-on-the-wire and thus affects operation of > > the mlx5_core driver on the ConnectX ASIC. > > > > Add basic support for Innova in mlx5_core. > > > > This allows using the Innova card as a regular NIC, by detecting > > the FPGA capability bit, and verifying its load state before > > initializing ConnectX interfaces. > > That doesn't sound like cx4/cx5 hw that mlx5 driver suppose to support. Hi Alexei, Thanks for your feedback. Let me address your concerns. I didn't state it, but the current iterations of Innova cards include the ConnectX-4LX chip, which is exactly what mlx5 driver supports. The PCI interface is *only* through the ConnectX chip with some new Firmware commands. This patch doesn't register any new PCI device, so it does not make sense to create a separate driver. > > > Also detect FPGA fatal runtime failures and enter error state if > > they ever happen. > > > > All new FPGA-related logic is placed in its own subdirectory 'fpga', > > which may be built by selecting CONFIG_MLX5_FPGA. > > This prepares for further support of various Innova features in later > > patchsets. > > Additional details about hardware architecture will be provided as > > more features get submitted. > > > > Signed-off-by: Ilan Tayari > > Reviewed-by: Boris Pismenny > > Signed-off-by: Saeed Mahameed > > --- > > MAINTAINERS| 10 + > > drivers/net/ethernet/mellanox/mlx5/core/Kconfig| 10 + > > drivers/net/ethernet/mellanox/mlx5/core/Makefile | 3 + > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 11 ++ > > drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c | 64 +++ > > drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h | 59 ++ > > .../net/ethernet/mellanox/mlx5/core/fpga/core.c| 202 > + > > .../net/ethernet/mellanox/mlx5/core/fpga/core.h| 99 ++ > > drivers/net/ethernet/mellanox/mlx5/core/main.c | 19 +- > > include/linux/mlx5/device.h| 6 + > > include/linux/mlx5/driver.h| 5 + > > include/linux/mlx5/mlx5_ifc.h | 11 +- > > include/linux/mlx5/mlx5_ifc_fpga.h | 144 > +++ > > 13 files changed, 640 insertions(+), 3 deletions(-) > > Can you put it into different driver? Dumping everything into by far > the biggest nic driver already is already huge headache in terms on > maintainability, debugging and back ports. > Look at how intel splits their drivers. > ixgb, ixgbe, ixgbevf are different drivers thought they have a lot in > common. On one side it's a bit of copy paste, but on the other side > it makes drivers much easier to develop and maintain independently. > ConnectX-6 code and any future hw support doesn't belong to > mlx5 driver at all. If you build your driver without explicitly enabling CONFIG_MLX5_FPGA=y (default is N) then you get none of this, and your driver is practically the same as before. The FPGA and CX operation is very tightly integrated. If you don't want any of this, simply don't opt-in to CONFIG_MLX5_FPGA. If you do want this, then splitting some of the logic to a separate kernel object will not gain anything useful (logic would stay the same), and just pollute the exported symbol table and open up the door for issues of inter-module compatibility and so on. Furthermore, the next patchset will introduce IPSec offload capabilities Which are declared in netdev->hw_features. Those cannot be modified after the netdevice is created, so all the extra logic has to be integrated into the mlx5 ethernet driver. This is another reason why a separate driver is a bad idea. We will include details about the board and how things work in the cover letter of the IPSec offload patchset. Ilan.
Re: [PATCH net-next] ibmvnic: Enable TSO support
Hi Thomas, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Thomas-Falcon/ibmvnic-Enable-TSO-support/20170525-111039 config: powerpc-allmodconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All warnings (new ones prefixed by >>): In file included from include/linux/module.h:18:0, from drivers/net/ethernet/ibm/ibmvnic.c:46: drivers/net/ethernet/ibm/ibmvnic.c: In function '__check_large_send_offload': include/linux/moduleparam.h:148:27: error: return from incompatible pointer type [-Werror=incompatible-pointer-types] param_check_##type(name, &(value)); \ ^ include/linux/moduleparam.h:346:68: note: in definition of macro '__param_check' static inline type __always_unused *__check_##name(void) { return(p); } ^ include/linux/moduleparam.h:148:2: note: in expansion of macro 'param_check_bool' param_check_##type(name, &(value)); \ ^~~~ include/linux/moduleparam.h:128:2: note: in expansion of macro 'module_param_named' module_param_named(name, name, type, perm) ^~ >> drivers/net/ethernet/ibm/ibmvnic.c:85:1: note: in expansion of macro >> 'module_param' module_param(large_send_offload, bool, 0644); ^~~~ cc1: some warnings being treated as errors vim +/module_param +85 drivers/net/ethernet/ibm/ibmvnic.c 69 #include 70 #include 71 #include 72 #include 73 #include 74 #include 75 #include 76 #include 77 #include 78 79 #include "ibmvnic.h" 80 81 static const char ibmvnic_driver_name[] = "ibmvnic"; 82 static const char ibmvnic_driver_string[] = "IBM System i/p Virtual NIC Driver"; 83 84 static int large_send_offload; > 85 module_param(large_send_offload, bool, 0644); 86 MODULE_PARM_DESC(large_send_offload, 87 "Determines whether large send offload is enabled"); 88 89 MODULE_AUTHOR("Santiago Leon"); 90 MODULE_DESCRIPTION("IBM System i/p Virtual NIC Driver"); 91 MODULE_LICENSE("GPL"); 92 MODULE_VERSION(IBMVNIC_DRIVER_VERSION); 93 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH net-next 5/6] net: mpls: Add extack messages for route add and delete failures
Add error messages for failures in adding and deleting mpls routes. This covers most of the annoying EINVAL errors. Signed-off-by: David Ahern--- net/mpls/af_mpls.c | 136 +--- net/mpls/internal.h | 2 +- 2 files changed, 98 insertions(+), 40 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index f3830951fb1c..d0120f8c8a2c 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -720,7 +720,8 @@ static int mpls_nh_build_from_cfg(struct mpls_route_config *cfg, static int mpls_nh_build(struct net *net, struct mpls_route *rt, struct mpls_nh *nh, int oif, struct nlattr *via, -struct nlattr *newdst, u8 max_labels) +struct nlattr *newdst, u8 max_labels, +struct netlink_ext_ack *extack) { int err = -ENOMEM; @@ -729,14 +730,14 @@ static int mpls_nh_build(struct net *net, struct mpls_route *rt, if (newdst) { err = nla_get_labels(newdst, max_labels, >nh_labels, -nh->nh_label, NULL); +nh->nh_label, extack); if (err) goto errout; } if (via) { err = nla_get_via(via, >nh_via_alen, >nh_via_table, - __mpls_nh_via(rt, nh)); + __mpls_nh_via(rt, nh), extack); if (err) goto errout; } else { @@ -803,7 +804,8 @@ static u8 mpls_count_nexthops(struct rtnexthop *rtnh, int len, } static int mpls_nh_build_multi(struct mpls_route_config *cfg, - struct mpls_route *rt, u8 max_labels) + struct mpls_route *rt, u8 max_labels, + struct netlink_ext_ack *extack) { struct rtnexthop *rtnh = cfg->rc_mp; struct nlattr *nla_via, *nla_newdst; @@ -837,7 +839,7 @@ static int mpls_nh_build_multi(struct mpls_route_config *cfg, err = mpls_nh_build(cfg->rc_nlinfo.nl_net, rt, nh, rtnh->rtnh_ifindex, nla_via, nla_newdst, - max_labels); + max_labels, extack); if (err) goto errout; @@ -856,7 +858,8 @@ static int mpls_nh_build_multi(struct mpls_route_config *cfg, return err; } -static int mpls_route_add(struct mpls_route_config *cfg) +static int mpls_route_add(struct mpls_route_config *cfg, + struct netlink_ext_ack *extack) { struct mpls_route __rcu **platform_label; struct net *net = cfg->rc_nlinfo.nl_net; @@ -876,17 +879,25 @@ static int mpls_route_add(struct mpls_route_config *cfg) } /* Reserved labels may not be set */ - if (index < MPLS_LABEL_FIRST_UNRESERVED) + if (index < MPLS_LABEL_FIRST_UNRESERVED) { + NL_SET_ERR_MSG(extack, + "Invalid label - must be MPLS_LABEL_FIRST_UNRESERVED or higher"); goto errout; + } /* The full 20 bit range may not be supported. */ - if (index >= net->mpls.platform_labels) + if (index >= net->mpls.platform_labels) { + NL_SET_ERR_MSG(extack, + "Label >= configured max in platform_labels"); goto errout; + } /* Append makes no sense with mpls */ err = -EOPNOTSUPP; - if (cfg->rc_nlflags & NLM_F_APPEND) + if (cfg->rc_nlflags & NLM_F_APPEND) { + NL_SET_ERR_MSG(extack, "MPLS does not support route append"); goto errout; + } err = -EEXIST; platform_label = rtnl_dereference(net->mpls.platform_label); @@ -913,8 +924,10 @@ static int mpls_route_add(struct mpls_route_config *cfg) nhs = 1; } - if (nhs == 0) + if (nhs == 0) { + NL_SET_ERR_MSG(extack, "Route does not contain a nexthop"); goto errout; + } err = -ENOMEM; rt = mpls_rt_alloc(nhs, max_via_alen, max_labels); @@ -928,7 +941,7 @@ static int mpls_route_add(struct mpls_route_config *cfg) rt->rt_ttl_propagate = cfg->rc_ttl_propagate; if (cfg->rc_mp) - err = mpls_nh_build_multi(cfg, rt, max_labels); + err = mpls_nh_build_multi(cfg, rt, max_labels, extack); else err = mpls_nh_build_from_cfg(cfg, rt); if (err) @@ -944,7 +957,8 @@ static int mpls_route_add(struct mpls_route_config *cfg) return err; } -static int mpls_route_del(struct mpls_route_config *cfg) +static int mpls_route_del(struct mpls_route_config *cfg, + struct netlink_ext_ack *extack) { struct net *net = cfg->rc_nlinfo.nl_net; unsigned
[PATCH net-next 3/6] net: plumb extack arg down to lwtunnel build state
Pass extack arg down to lwtunnel_build_state and the build_state callbacks. Add messages for failures in lwtunnel_build_state, and add the extarg to nla_parse where possible in the build_state callbacks. Signed-off-by: David Ahern--- include/linux/netlink.h | 10 ++ include/net/lwtunnel.h| 9 ++--- net/core/lwt_bpf.c| 5 +++-- net/core/lwtunnel.c | 20 +--- net/ipv4/fib_lookup.h | 3 ++- net/ipv4/fib_semantics.c | 20 +++- net/ipv4/fib_trie.c | 2 +- net/ipv4/ip_tunnel_core.c | 11 +++ net/ipv6/ila/ila_lwt.c| 5 +++-- net/ipv6/route.c | 2 +- net/ipv6/seg6_iptunnel.c | 5 +++-- net/mpls/mpls_iptunnel.c | 5 +++-- 12 files changed, 67 insertions(+), 30 deletions(-) diff --git a/include/linux/netlink.h b/include/linux/netlink.h index a68aad484c69..8664fd26eb5d 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -102,6 +102,16 @@ struct netlink_ext_ack { (extack)->bad_attr = (attr);\ } while (0) +#define NL_SET_ERR_MSG_ATTR(extack, attr, msg) do {\ + static const char __msg[] = (msg); \ + struct netlink_ext_ack *__extack = (extack);\ + \ + if (__extack) { \ + __extack->_msg = __msg; \ + __extack->bad_attr = (attr);\ + } \ +} while (0) + extern void netlink_kernel_release(struct sock *sk); extern int __netlink_change_ngroups(struct sock *sk, unsigned int groups); extern int netlink_change_ngroups(struct sock *sk, unsigned int groups); diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index ca6f002774ef..7c26863b8cf4 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -35,7 +35,8 @@ struct lwtunnel_state { struct lwtunnel_encap_ops { int (*build_state)(struct nlattr *encap, unsigned int family, const void *cfg, - struct lwtunnel_state **ts); + struct lwtunnel_state **ts, + struct netlink_ext_ack *extack); void (*destroy_state)(struct lwtunnel_state *lws); int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb); int (*input)(struct sk_buff *skb); @@ -114,7 +115,8 @@ int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, int lwtunnel_build_state(u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, -struct lwtunnel_state **lws); +struct lwtunnel_state **lws, +struct netlink_ext_ack *extack); int lwtunnel_fill_encap(struct sk_buff *skb, struct lwtunnel_state *lwtstate); int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate); @@ -192,7 +194,8 @@ static inline int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, static inline int lwtunnel_build_state(u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, - struct lwtunnel_state **lws) + struct lwtunnel_state **lws, + struct netlink_ext_ack *extack) { return -EOPNOTSUPP; } diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index b3bc0a31af9f..1307731ddfe4 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -240,7 +240,8 @@ static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = { static int bpf_build_state(struct nlattr *nla, unsigned int family, const void *cfg, - struct lwtunnel_state **ts) + struct lwtunnel_state **ts, + struct netlink_ext_ack *extack) { struct nlattr *tb[LWT_BPF_MAX + 1]; struct lwtunnel_state *newts; @@ -250,7 +251,7 @@ static int bpf_build_state(struct nlattr *nla, if (family != AF_INET && family != AF_INET6) return -EAFNOSUPPORT; - ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy, NULL); + ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy, extack); if (ret < 0) return ret; diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index ab840386a74d..a0feca1df7b5 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -103,25 +103,39 @@ EXPORT_SYMBOL(lwtunnel_encap_del_ops); int lwtunnel_build_state(u16 encap_type, struct nlattr *encap, unsigned int family, -const void *cfg, struct lwtunnel_state **lws) +const void *cfg, struct lwtunnel_state **lws, +
[PATCH net-next 4/6] net: Fill in extack for mpls lwt encap
Fill in extack for errors in build_state for mpls lwt encap including passing extack to nla_get_labels and adding error messages for failures in it. Signed-off-by: David Ahern--- net/mpls/af_mpls.c | 49 ++-- net/mpls/internal.h | 2 +- net/mpls/mpls_iptunnel.c | 12 +++- 3 files changed, 47 insertions(+), 16 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 257ec66009da..f3830951fb1c 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -728,8 +728,8 @@ static int mpls_nh_build(struct net *net, struct mpls_route *rt, goto errout; if (newdst) { - err = nla_get_labels(newdst, max_labels, ->nh_labels, nh->nh_label); + err = nla_get_labels(newdst, max_labels, >nh_labels, +nh->nh_label, NULL); if (err) goto errout; } @@ -782,7 +782,8 @@ static u8 mpls_count_nexthops(struct rtnexthop *rtnh, int len, nla = nla_find(attrs, attrlen, RTA_NEWDST); if (nla && - nla_get_labels(nla, MAX_NEW_LABELS, _labels, NULL) != 0) + nla_get_labels(nla, MAX_NEW_LABELS, _labels, + NULL, NULL) != 0) return 0; *max_labels = max_t(u8, *max_labels, n_labels); @@ -1541,8 +1542,8 @@ int nla_put_labels(struct sk_buff *skb, int attrtype, } EXPORT_SYMBOL_GPL(nla_put_labels); -int nla_get_labels(const struct nlattr *nla, - u8 max_labels, u8 *labels, u32 label[]) +int nla_get_labels(const struct nlattr *nla, u8 max_labels, u8 *labels, + u32 label[], struct netlink_ext_ack *extack) { unsigned len = nla_len(nla); struct mpls_shim_hdr *nla_label; @@ -1553,13 +1554,18 @@ int nla_get_labels(const struct nlattr *nla, /* len needs to be an even multiple of 4 (the label size). Number * of labels is a u8 so check for overflow. */ - if (len & 3 || len / 4 > 255) + if (len & 3 || len / 4 > 255) { + NL_SET_ERR_MSG_ATTR(extack, nla, + "Invalid length for labels attribute"); return -EINVAL; + } /* Limit the number of new labels allowed */ nla_labels = len/4; - if (nla_labels > max_labels) + if (nla_labels > max_labels) { + NL_SET_ERR_MSG(extack, "Too many labels"); return -EINVAL; + } /* when label == NULL, caller wants number of labels */ if (!label) @@ -1574,8 +1580,29 @@ int nla_get_labels(const struct nlattr *nla, /* Ensure the bottom of stack flag is properly set * and ttl and tc are both clear. */ - if ((dec.bos != bos) || dec.ttl || dec.tc) + if (dec.ttl) { + NL_SET_ERR_MSG_ATTR(extack, nla, + "TTL in label must be 0"); + return -EINVAL; + } + + if (dec.tc) { + NL_SET_ERR_MSG_ATTR(extack, nla, + "Traffic class in label must be 0"); return -EINVAL; + } + + if (dec.bos != bos) { + NL_SET_BAD_ATTR(extack, nla); + if (bos) { + NL_SET_ERR_MSG(extack, + "BOS bit must be set in first label"); + } else { + NL_SET_ERR_MSG(extack, + "BOS bit can only be set in first label"); + } + return -EINVAL; + } switch (dec.label) { case MPLS_LABEL_IMPLNULL: @@ -1583,6 +1610,8 @@ int nla_get_labels(const struct nlattr *nla, * assign and distribute, but which never * actually appears in the encapsulation. */ + NL_SET_ERR_MSG_ATTR(extack, nla, + "Implicit NULL Label (3) can not be used in encapsulation"); return -EINVAL; } @@ -1696,14 +1725,14 @@ static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, case RTA_NEWDST: if (nla_get_labels(nla, MAX_NEW_LABELS, >rc_output_labels, - cfg->rc_output_label)) + cfg->rc_output_label, NULL)) goto errout; break; case RTA_DST:
[PATCH net-next 0/6] net: another round of extack handling for routing
This set focuses on passing extack through lwtunnel and MPLS with additional catches for IPv4 route add and minor cleanups in MPLS encountered passing the extack arg around. David Ahern (6): net: ipv4: Add extack message for invalid prefix or length net: lwtunnel: Add extack to encap attr validation net: plumb extack arg down to lwtunnel build state net: Fill in extack for mpls lwt encap net: mpls: Add extack messages for route add and delete failures net: mpls: minor cleanups include/linux/netlink.h | 10 +++ include/net/ip_fib.h | 3 +- include/net/lwtunnel.h| 22 -- net/core/lwt_bpf.c| 5 +- net/core/lwtunnel.c | 38 -- net/ipv4/fib_frontend.c | 12 +-- net/ipv4/fib_lookup.h | 3 +- net/ipv4/fib_semantics.c | 20 ++--- net/ipv4/fib_trie.c | 22 -- net/ipv4/ip_tunnel_core.c | 11 ++- net/ipv6/ila/ila_lwt.c| 5 +- net/ipv6/route.c | 6 +- net/ipv6/seg6_iptunnel.c | 5 +- net/mpls/af_mpls.c| 184 ++ net/mpls/internal.h | 4 +- net/mpls/mpls_iptunnel.c | 17 +++-- 16 files changed, 259 insertions(+), 108 deletions(-) -- 2.11.0 (Apple Git-81)
[PATCH net-next 2/6] net: lwtunnel: Add extack to encap attr validation
Pass extack down to lwtunnel_valid_encap_type and lwtunnel_valid_encap_type_attr. Add messages for unknown or unsupported encap types. Signed-off-by: David Ahern--- include/net/lwtunnel.h | 13 + net/core/lwtunnel.c | 18 +- net/ipv4/fib_frontend.c | 6 -- net/ipv6/route.c| 4 ++-- 4 files changed, 28 insertions(+), 13 deletions(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index ebfe237aad7e..ca6f002774ef 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -107,8 +107,10 @@ int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, unsigned int num); int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, unsigned int num); -int lwtunnel_valid_encap_type(u16 encap_type); -int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len); +int lwtunnel_valid_encap_type(u16 encap_type, + struct netlink_ext_ack *extack); +int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, + struct netlink_ext_ack *extack); int lwtunnel_build_state(u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, @@ -172,11 +174,14 @@ static inline int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, return -EOPNOTSUPP; } -static inline int lwtunnel_valid_encap_type(u16 encap_type) +static inline int lwtunnel_valid_encap_type(u16 encap_type, + struct netlink_ext_ack *extack) { + NL_SET_ERR_MSG(extack, "CONFIG_LWTUNNEL is not enabled in this kernel"); return -EOPNOTSUPP; } -static inline int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len) +static inline int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, +struct netlink_ext_ack *extack) { /* return 0 since we are not walking attr looking for * RTA_ENCAP_TYPE attribute on nexthops. diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index cfae3d5fe11f..ab840386a74d 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -126,14 +126,16 @@ int lwtunnel_build_state(u16 encap_type, } EXPORT_SYMBOL(lwtunnel_build_state); -int lwtunnel_valid_encap_type(u16 encap_type) +int lwtunnel_valid_encap_type(u16 encap_type, struct netlink_ext_ack *extack) { const struct lwtunnel_encap_ops *ops; int ret = -EINVAL; if (encap_type == LWTUNNEL_ENCAP_NONE || - encap_type > LWTUNNEL_ENCAP_MAX) + encap_type > LWTUNNEL_ENCAP_MAX) { + NL_SET_ERR_MSG(extack, "Unknown lwt encapsulation type"); return ret; + } rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); @@ -153,11 +155,16 @@ int lwtunnel_valid_encap_type(u16 encap_type) } } #endif - return ops ? 0 : -EOPNOTSUPP; + ret = ops ? 0 : -EOPNOTSUPP; + if (ret < 0) + NL_SET_ERR_MSG(extack, "lwt encapsulation type not supported"); + + return ret; } EXPORT_SYMBOL(lwtunnel_valid_encap_type); -int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int remaining) +int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int remaining, + struct netlink_ext_ack *extack) { struct rtnexthop *rtnh = (struct rtnexthop *)attr; struct nlattr *nla_entype; @@ -174,7 +181,8 @@ int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int remaining) if (nla_entype) { encap_type = nla_get_u16(nla_entype); - if (lwtunnel_valid_encap_type(encap_type) != 0) + if (lwtunnel_valid_encap_type(encap_type, + extack) != 0) return -EOPNOTSUPP; } } diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 2d55882d566e..f03fe8521f6e 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -684,7 +684,8 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb, break; case RTA_MULTIPATH: err = lwtunnel_valid_encap_type_attr(nla_data(attr), -nla_len(attr)); +nla_len(attr), +extack); if (err < 0) goto errout; cfg->fc_mp = nla_data(attr); @@ -701,7 +702,8 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb, break;
[PATCH net-next 1/6] net: ipv4: Add extack message for invalid prefix or length
Add extack error message for invalid prefix length and invalid prefix. Example of the latter is a route spec containing 172.16.100.1/24, where the /24 mask means the lower 8-bits should be 0. Amazing how easy that one is to overlook when an EINVAL is returned. Signed-off-by: David Ahern--- include/net/ip_fib.h| 3 ++- net/ipv4/fib_frontend.c | 6 +++--- net/ipv4/fib_trie.c | 20 +++- 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 42e8b8f55f7c..d490a0f8dc4c 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -265,7 +265,8 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, struct fib_result *res, int fib_flags); int fib_table_insert(struct net *, struct fib_table *, struct fib_config *, struct netlink_ext_ack *extack); -int fib_table_delete(struct net *, struct fib_table *, struct fib_config *); +int fib_table_delete(struct net *, struct fib_table *, struct fib_config *, +struct netlink_ext_ack *extack); int fib_table_dump(struct fib_table *table, struct sk_buff *skb, struct netlink_callback *cb); int fib_table_flush(struct net *net, struct fib_table *table); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 14d2f7bd7c76..2d55882d566e 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -588,7 +588,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg) if (cmd == SIOCDELRT) { tb = fib_get_table(net, cfg.fc_table); if (tb) - err = fib_table_delete(net, tb, ); + err = fib_table_delete(net, tb, , NULL); else err = -ESRCH; } else { @@ -732,7 +732,7 @@ static int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh, goto errout; } - err = fib_table_delete(net, tb, ); + err = fib_table_delete(net, tb, , extack); errout: return err; } @@ -851,7 +851,7 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad if (cmd == RTM_NEWROUTE) fib_table_insert(net, tb, , NULL); else - fib_table_delete(net, tb, ); + fib_table_delete(net, tb, , NULL); } void fib_add_ifaddr(struct in_ifaddr *ifa) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 6d0f6c79d9aa..015ff759b0fd 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1115,15 +1115,20 @@ int fib_table_insert(struct net *net, struct fib_table *tb, u32 key; int err; - if (plen > KEYLENGTH) + if (plen > KEYLENGTH) { + NL_SET_ERR_MSG(extack, "Invalid prefix length"); return -EINVAL; + } key = ntohl(cfg->fc_dst); pr_debug("Insert table=%u %08x/%d\n", tb->tb_id, key, plen); - if ((plen < KEYLENGTH) && (key << plen)) + if ((plen < KEYLENGTH) && (key << plen)) { + NL_SET_ERR_MSG(extack, + "Invalid prefix for given prefix length"); return -EINVAL; + } fi = fib_create_info(cfg, extack); if (IS_ERR(fi)) { @@ -1507,7 +1512,7 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp, /* Caller must hold RTNL. */ int fib_table_delete(struct net *net, struct fib_table *tb, -struct fib_config *cfg) +struct fib_config *cfg, struct netlink_ext_ack *extack) { struct trie *t = (struct trie *) tb->tb_data; struct fib_alias *fa, *fa_to_delete; @@ -1517,13 +1522,18 @@ int fib_table_delete(struct net *net, struct fib_table *tb, u8 tos = cfg->fc_tos; u32 key; - if (plen > KEYLENGTH) + if (plen > KEYLENGTH) { + NL_SET_ERR_MSG(extack, "Invalid prefix length"); return -EINVAL; + } key = ntohl(cfg->fc_dst); - if ((plen < KEYLENGTH) && (key << plen)) + if ((plen < KEYLENGTH) && (key << plen)) { + NL_SET_ERR_MSG(extack, + "Invalid prefix for given prefix length"); return -EINVAL; + } l = fib_find_node(t, , key); if (!l) -- 2.11.0 (Apple Git-81)
[PATCH net-next 6/6] net: mpls: minor cleanups
Noticed these doing the extack support: - nla_get_via is only used in af_mpls.c so remove from internal.h - err is initialized to EINVAL in mpls_nh_build_from_cfg but then set again before it is checked. Remove the EINVAL setting Signed-off-by: David Ahern--- net/mpls/af_mpls.c | 5 +++-- net/mpls/internal.h | 2 -- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index d0120f8c8a2c..ee468feb9e64 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -43,6 +43,9 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, struct nlmsghdr *nlh, struct net *net, u32 portid, unsigned int nlm_flags); +static int nla_get_via(const struct nlattr *nla, u8 *via_alen, u8 *via_table, + u8 via[], struct netlink_ext_ack *extack); + static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) { struct mpls_route *rt = NULL; @@ -695,8 +698,6 @@ static int mpls_nh_build_from_cfg(struct mpls_route_config *cfg, if (!nh) return -ENOMEM; - err = -EINVAL; - nh->nh_labels = cfg->rc_output_labels; for (i = 0; i < nh->nh_labels; i++) nh->nh_label[i] = cfg->rc_output_label[i]; diff --git a/net/mpls/internal.h b/net/mpls/internal.h index a015a6a1143b..cf65aec2e551 100644 --- a/net/mpls/internal.h +++ b/net/mpls/internal.h @@ -204,8 +204,6 @@ int nla_put_labels(struct sk_buff *skb, int attrtype, u8 labels, const u32 label[]); int nla_get_labels(const struct nlattr *nla, u8 max_labels, u8 *labels, u32 label[], struct netlink_ext_ack *extack); -int nla_get_via(const struct nlattr *nla, u8 *via_alen, u8 *via_table, - u8 via[], struct netlink_ext_ack *extack); bool mpls_output_possible(const struct net_device *dev); unsigned int mpls_dev_mtu(const struct net_device *dev); bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu); -- 2.11.0 (Apple Git-81)
Re: [PATCH net-next 1/8] net: ipv4: refactor __ip_route_output_key_hash
On Wed, May 24, 2017 at 6:10 PM, David Ahernwrote: > On 5/24/17 1:33 PM, Rosen, Rami wrote: >> Hi, Rupa /David Ahern, >> >> First, thanks for this patch set! >> >> Second, it seems to me that something might be incorrect here. >> >> You have these additions in this patch (1/8): >> ... >> +struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 >> *flp, >> + const struct sk_buff *skb, >> + struct fib_result *res); >> ... >> +struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, >> + const struct sk_buff *skb) >> +{ >> + struct fib_result res; >> + struct rtable *rth; >> + >> + res.tclassid= 0; >> + res.fi = NULL; >> + res.table = NULL; >> + >> + rcu_read_lock(); >> + rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); >> rcu_read_unlock(); >> + >> return rth; >> } >> -EXPORT_SYMBOL_GPL(__ip_route_output_key_hash); >> +EXPORT_SYMBOL_GPL(ip_route_output_key_hash); >> >> >> So the third parameter to ip_route_output_key_hash_rcu() should be skb*, and >> the fourth parameter should be fib_result *. However, you do not pass the >> skb parameter >> when calling ip_route_output_key_hash_rcu() in >> ip_route_output_key_hash() (in fact you don't use it at all), and you pass >> mp_hash as the fourth parameter. > > Yep, it's a problem with the forward port of the first round of patches. yep, > > Roopa: in include/net/route.h, __ip_route_output_key_hash should be > removed as well. ack, thanks Rami and david. will fix it in v2.
[PATCH] bonding: Don't update slave->link until ready to commit
In the loadbalance arp monitoring scheme, when a slave link change is detected, the slave->link is immediately updated and slave_state_changed is set. Later down the function, the rtnl_lock is acquired and the changes are committed, updating the bond link state. However, the acquisition of the rtnl_lock can fail. The next time the monitor runs, since slave->link is already updated, it determines that link is unchanged. This results in the bond link state permanently out of sync with the slave link. This patch modifies bond_loadbalance_arp_mon() to handle link changes identical to bond_ab_arp_{inspect/commit}(). The new link state is maintained in slave->new_link until we're ready to commit at which point it's copied into slave->link. NOTE: miimon_{inspect/commit}() has a more complex state machine requiring the use of the bond_{propose,commit}_link_state() functions which maintains the intermediate state in slave->link_new_state. The arp monitors don't require that. Testing: This bug is very easy to reproduce with the following steps. 1. In a loop, toggle a slave link of a bond slave interface. 2. In a separate loop, do ifconfig up/down of an unrelated interface to create contention for rtnl_lock. Within a few iterations, the bond link goes out of sync with the slave link. Signed-off-by: Nithin Nayak SujirCc: Mahesh Bandewar Cc: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 7331331..2359478b 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2612,11 +2612,13 @@ static void bond_loadbalance_arp_mon(struct bonding *bond) bond_for_each_slave_rcu(bond, slave, iter) { unsigned long trans_start = dev_trans_start(slave->dev); + slave->new_link = BOND_LINK_NOCHANGE; + if (slave->link != BOND_LINK_UP) { if (bond_time_in_interval(bond, trans_start, 1) && bond_time_in_interval(bond, slave->last_rx, 1)) { - slave->link = BOND_LINK_UP; + slave->new_link = BOND_LINK_UP; slave_state_changed = 1; /* primary_slave has no meaning in round-robin @@ -2643,7 +2645,7 @@ static void bond_loadbalance_arp_mon(struct bonding *bond) if (!bond_time_in_interval(bond, trans_start, 2) || !bond_time_in_interval(bond, slave->last_rx, 2)) { - slave->link = BOND_LINK_DOWN; + slave->new_link = BOND_LINK_DOWN; slave_state_changed = 1; if (slave->link_failure_count < UINT_MAX) @@ -2674,6 +2676,11 @@ static void bond_loadbalance_arp_mon(struct bonding *bond) if (!rtnl_trylock()) goto re_arm; + bond_for_each_slave(bond, slave, iter) { + if (slave->new_link != BOND_LINK_NOCHANGE) + slave->link = slave->new_link; + } + if (slave_state_changed) { bond_slave_state_change(bond); if (BOND_MODE(bond) == BOND_MODE_XOR) -- 2.8.2
Re: [PATCH net-next 1/8] net: ipv4: refactor __ip_route_output_key_hash
Hi David, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Roopa-Prabhu/net-extend-RTM_GETROUTE-to-return-fib-result/20170525-053253 config: openrisc-or1ksim_defconfig (attached as .config) compiler: or1k-linux-gcc (GCC) 5.4.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=openrisc Note: the linux-review/Roopa-Prabhu/net-extend-RTM_GETROUTE-to-return-fib-result/20170525-053253 HEAD 083c4ee9e124d0acf29d159ced8a22cb41665a7a builds fine. It only hurts bisectibility. All errors (new ones prefixed by >>): In file included from include/net/route.h:31:0, from include/net/lwtunnel.h:8, from include/net/ip_tunnels.h:17, from include/net/dst_metadata.h:5, from net/ipv4/route.c:94: net/ipv4/route.c: In function 'ip_route_output_key_hash_rcu': >> include/net/ip_fib.h:167:32: error: request for member 'fi' in something not >> a structure or union #define FIB_RES_NH(res) ((res).fi->fib_nh[0]) ^ include/net/ip_fib.h:196:28: note: in expansion of macro 'FIB_RES_NH' #define FIB_RES_DEV(res) (FIB_RES_NH(res).nh_dev) ^ net/ipv4/route.c:2399:35: note: in expansion of macro 'FIB_RES_DEV' dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : ^ net/ipv4/route.c: In function 'ip_route_output_key_hash': net/ipv4/route.c:2430:53: error: 'mp_hash' undeclared (first use in this function) rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^ net/ipv4/route.c:2430:53: note: each undeclared identifier is reported only once for each function it appears in net/ipv4/route.c:2430:47: error: passing argument 3 of 'ip_route_output_key_hash_rcu' from incompatible pointer type [-Werror=incompatible-pointer-types] rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^ net/ipv4/route.c:2249:16: note: expected 'const struct sk_buff *' but argument is of type 'struct fib_result *' struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, ^ cc1: some warnings being treated as errors -- In file included from include/net/route.h:31:0, from include/net/lwtunnel.h:8, from include/net/ip_tunnels.h:17, from include/net/dst_metadata.h:5, from net//ipv4/route.c:94: net//ipv4/route.c: In function 'ip_route_output_key_hash_rcu': >> include/net/ip_fib.h:167:32: error: request for member 'fi' in something not >> a structure or union #define FIB_RES_NH(res) ((res).fi->fib_nh[0]) ^ include/net/ip_fib.h:196:28: note: in expansion of macro 'FIB_RES_NH' #define FIB_RES_DEV(res) (FIB_RES_NH(res).nh_dev) ^ net//ipv4/route.c:2399:35: note: in expansion of macro 'FIB_RES_DEV' dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : ^ net//ipv4/route.c: In function 'ip_route_output_key_hash': net//ipv4/route.c:2430:53: error: 'mp_hash' undeclared (first use in this function) rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^ net//ipv4/route.c:2430:53: note: each undeclared identifier is reported only once for each function it appears in net//ipv4/route.c:2430:47: error: passing argument 3 of 'ip_route_output_key_hash_rcu' from incompatible pointer type [-Werror=incompatible-pointer-types] rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^ net//ipv4/route.c:2249:16: note: expected 'const struct sk_buff *' but argument is of type 'struct fib_result *' struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, ^ cc1: some warnings being treated as errors vim +/fi +167 include/net/ip_fib.h 5f300893f Thomas Graf2006-11-09 151u32 fl_mark; 246955fe4 Robert Olsson 2005-06-20 152unsigned char fl_tos; 246955fe4 Robert Olsson 2005-06-20 153unsigned char fl_scope; 246955fe4 Robert Olsson 2005-06-20 154unsigned char tb_id_in; 246955fe4 Robert Olsson 2005-06-20 155 246955fe4 Robert Olsson 2005-06-20 156unsigned char tb_id; /* Results */ 246955fe4 Robert Olsson 2005-06-20 157unsigned char prefixlen; 246955fe4 Robert Olsson 2005-06-20 158unsigned char nh_sel; 246955fe4 Robert Olsson 2005-06-20 159unsigned char type;
Re: [PATCH net-next 8/8] net: ipv6: RTM_GETROUTE: return matched fib result when requested
Since you have to do a v2 ... On 5/24/17 12:19 PM, Roopa Prabhu wrote: > @@ -3622,6 +3623,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, > struct nlmsghdr *nlh, > memset(, 0, sizeof(fl6)); > rtm = nlmsg_data(nlh); > fl6.flowlabel = ip6_make_flowinfo(rtm->rtm_tos, 0); > + fibmatch = (rtm->rtm_flags & RTM_F_FIB_MATCH) ? true : false; this is typically done as !!(rtm->rtm_flags & RTM_F_FIB_MATCH) > > if (tb[RTA_SRC]) { > if (nla_len(tb[RTA_SRC]) < sizeof(struct in6_addr)) > @@ -3667,12 +3669,27 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, > struct nlmsghdr *nlh, > if (!ipv6_addr_any()) > flags |= RT6_LOOKUP_F_HAS_SADDR; > > - rt = (struct rt6_info *)ip6_route_input_lookup(net, dev, , > -flags); > + if (!fibmatch) > + rt = (struct rt6_info *)ip6_route_input_lookup(net, dev, > +, > +flags); > } else { > fl6.flowi6_oif = oif; > > - rt = (struct rt6_info *)ip6_route_output(net, NULL, ); > + if (!fibmatch) > + rt = (struct rt6_info *)ip6_route_output_flags(net, > +NULL, > +, 0); > + } > + > + if (fibmatch) { > + rt = (struct rt6_info *)ip6_route_lookup(net, , 0); > + if (rt->dst.error) { > + err = rt->dst.error; > + ip6_rt_put(rt); > + goto errout; > + } > + I'd prefer to see the typecasts go away and use container_of to go from dst_entry to rt6_info. I realize some of this is movement of existing code, but better to clean up as we go.
[PATCH net-next 0/2] be2net: patch-set
Hi Dave, Please consider applying these two patches to net-next Suresh Reddy (2): be2net: Fix UE detection logic for BE3 be2net: Update the driver version to 11.4.0.0 drivers/net/ethernet/emulex/benet/be.h | 2 +- drivers/net/ethernet/emulex/benet/be_hw.h | 3 +++ drivers/net/ethernet/emulex/benet/be_main.c | 27 +++ 3 files changed, 23 insertions(+), 9 deletions(-) -- 1.8.3.1
[PATCH net-next 1/2] be2net: Fix UE detection logic for BE3
On certain platforms BE3 chips may indicate spurious UEs (unrecoverable error). Because of the UE detection logic was disabled in the driver for BE3 chips. Because of this, even in cases of a real UE, a failover will not occur. This patch re-enables UE detection on BE3 and if a UE is detected, reads the POST register. If the POST register, reports either a FAT_LOG_STATE or a ARMFW_UE, then it means that a valid UE occurred in the chip. Signed-off-by: Suresh Reddy--- drivers/net/ethernet/emulex/benet/be_hw.h | 3 +++ drivers/net/ethernet/emulex/benet/be_main.c | 27 +++ 2 files changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/emulex/benet/be_hw.h b/drivers/net/ethernet/emulex/benet/be_hw.h index 36e4232..c967f45 100644 --- a/drivers/net/ethernet/emulex/benet/be_hw.h +++ b/drivers/net/ethernet/emulex/benet/be_hw.h @@ -49,6 +49,9 @@ #define POST_STAGE_BE_RESET0x3 /* Host wants to reset chip */ #define POST_STAGE_ARMFW_RDY 0xc000 /* FW is done with POST */ #define POST_STAGE_RECOVERABLE_ERR 0xE000 /* Recoverable err detected */ +/* FW has detected a UE and is dumping FAT log data */ +#define POST_STAGE_FAT_LOG_START 0x0D00 +#define POST_STAGE_ARMFW_UE0xF000 /*FW has asserted an UE*/ /* Lancer SLIPORT registers */ #define SLIPORT_STATUS_OFFSET 0x404 diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index f3a09ab..8000551 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -3241,8 +3241,9 @@ void be_detect_error(struct be_adapter *adapter) { u32 ue_lo = 0, ue_hi = 0, ue_lo_mask = 0, ue_hi_mask = 0; u32 sliport_status = 0, sliport_err1 = 0, sliport_err2 = 0; - u32 i; struct device *dev = >pdev->dev; + u16 val; + u32 i; if (be_check_error(adapter, BE_ERROR_HW)) return; @@ -3280,15 +3281,25 @@ void be_detect_error(struct be_adapter *adapter) ue_lo = (ue_lo & ~ue_lo_mask); ue_hi = (ue_hi & ~ue_hi_mask); - /* On certain platforms BE hardware can indicate spurious UEs. -* Allow HW to stop working completely in case of a real UE. -* Hence not setting the hw_error for UE detection. -*/ - if (ue_lo || ue_hi) { + /* On certain platforms BE3 hardware can indicate +* spurious UEs. In case of a UE in the chip, +* the POST register correctly reports either a +* FAT_LOG_START state (FW is currently dumping +* FAT log data) or a ARMFW_UE state. Check for the +* above states to ascertain if the UE is valid or not. +*/ + if (BE3_chip(adapter)) { + val = be_POST_stage_get(adapter); + if ((val & POST_STAGE_FAT_LOG_START) +!= POST_STAGE_FAT_LOG_START && + (val & POST_STAGE_ARMFW_UE) +!= POST_STAGE_ARMFW_UE) + return; + } + dev_err(dev, "Error detected in the adapter"); - if (skyhawk_chip(adapter)) - be_set_error(adapter, BE_ERROR_UE); + be_set_error(adapter, BE_ERROR_UE); for (i = 0; ue_lo; ue_lo >>= 1, i++) { if (ue_lo & 1) -- 1.8.3.1
[PATCH net-next 2/2] be2net: Update the driver version to 11.4.0.0
Signed-off-by: Suresh Reddy--- drivers/net/ethernet/emulex/benet/be.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/emulex/benet/be.h b/drivers/net/ethernet/emulex/benet/be.h index 5056624..674cf9d 100644 --- a/drivers/net/ethernet/emulex/benet/be.h +++ b/drivers/net/ethernet/emulex/benet/be.h @@ -37,7 +37,7 @@ #include "be_hw.h" #include "be_roce.h" -#define DRV_VER"11.1.0.0" +#define DRV_VER"11.4.0.0" #define DRV_NAME "be2net" #define BE_NAME"Emulex BladeEngine2" #define BE3_NAME "Emulex BladeEngine3" -- 1.8.3.1
[PATCH net-next] ibmvnic: Enable TSO support
Enable TSO in the ibmvnic driver. Scatter-gather is also enabled, though there currently is no support for scatter-gather in vNIC-compatible hardware, resulting in forced linearization of all fragmented SKB's. Though not ideal, given the throughput improvement gained by enabling TSO, it has been decided that this is an acceptable tradeoff. The feature is also enabled by a module parameter. This parameter is necessary because TSO can not easily be enabled or disabled in firmware without reinitializing the driver. CC: Nathan FontenotCC: John Allen Signed-off-by: Thomas Falcon --- drivers/net/ethernet/ibm/ibmvnic.c | 39 +++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index abe2b6e..64e8d50 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -81,6 +81,11 @@ static const char ibmvnic_driver_name[] = "ibmvnic"; static const char ibmvnic_driver_string[] = "IBM System i/p Virtual NIC Driver"; +static int large_send_offload; +module_param(large_send_offload, bool, 0644); +MODULE_PARM_DESC(large_send_offload, +"Determines whether large send offload is enabled"); + MODULE_AUTHOR("Santiago Leon "); MODULE_DESCRIPTION("IBM System i/p Virtual NIC Driver"); MODULE_LICENSE("GPL"); @@ -1025,6 +1030,17 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) goto out; } + /* All scatter-gather SKB's will be linearized for the time being, but +* scatter-gather is only enabled if the user wishes to use TSO. +*/ + if (skb_shinfo(skb)->nr_frags && __skb_linearize(skb)) { + dev_kfree_skb_any(skb); + tx_send_failed++; + tx_dropped++; + ret = NETDEV_TX_OK; + goto out; + } + tx_pool = >tx_pool[queue_num]; tx_scrq = adapter->tx_scrq[queue_num]; txq = netdev_get_tx_queue(netdev, skb_get_queue_mapping(skb)); @@ -1082,6 +1098,11 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD; hdrs += 2; } + if (skb_is_gso(skb)) { + tx_crq.v1.flags1 |= IBMVNIC_TX_LSO; + tx_crq.v1.mss = cpu_to_be16(skb_shinfo(skb)->gso_size); + hdrs += 2; + } /* determine if l2/3/4 headers are sent to firmware */ if ((*hdrs >> 7) & 1 && (skb->protocol == htons(ETH_P_IP) || @@ -2629,10 +2650,10 @@ static void handle_query_ip_offload_rsp(struct ibmvnic_adapter *adapter) adapter->ip_offload_ctrl.udp_ipv4_chksum = buf->udp_ipv4_chksum; adapter->ip_offload_ctrl.tcp_ipv6_chksum = buf->tcp_ipv6_chksum; adapter->ip_offload_ctrl.udp_ipv6_chksum = buf->udp_ipv6_chksum; + adapter->ip_offload_ctrl.large_tx_ipv4 = buf->large_tx_ipv4; + adapter->ip_offload_ctrl.large_tx_ipv6 = buf->large_tx_ipv6; - /* large_tx/rx disabled for now, additional features needed */ - adapter->ip_offload_ctrl.large_tx_ipv4 = 0; - adapter->ip_offload_ctrl.large_tx_ipv6 = 0; + /* large_rx disabled for now, additional features needed */ adapter->ip_offload_ctrl.large_rx_ipv4 = 0; adapter->ip_offload_ctrl.large_rx_ipv6 = 0; @@ -2648,6 +2669,18 @@ static void handle_query_ip_offload_rsp(struct ibmvnic_adapter *adapter) (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM))) adapter->netdev->features |= NETIF_F_RXCSUM; + if (large_send_offload) { + /* Scatter-gather is not currently supported by firmware. +* It will only be enabled to support TSO operations. +*/ + adapter->netdev->features = NETIF_F_SG; + + if (buf->large_tx_ipv4) + adapter->netdev->features |= NETIF_F_TSO; + if (buf->large_tx_ipv6) + adapter->netdev->features |= NETIF_F_TSO6; + } + memset(, 0, sizeof(crq)); crq.control_ip_offload.first = IBMVNIC_CRQ_CMD; crq.control_ip_offload.cmd = CONTROL_IP_OFFLOAD; -- 1.8.3.1
Re: [PATCH net-next 7/8] net: ipv4: RTM_GETROUTE: return matched fib result when requested
On 5/24/17 12:19 PM, Roopa Prabhu wrote: > @@ -2746,8 +2748,15 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, > struct nlmsghdr *nlh, > if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE) > table_id = rt->rt_table_id; > > - err = rt_fill_info(net, dst, src, table_id, , skb, > -NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, rt); > + if (rtm->rtm_flags & RTM_F_FIB_MATCH) > + err = fib_dump_info(skb, NETLINK_CB(in_skb).portid, > + nlh->nlmsg_seq, RTM_NEWROUTE, table_id, > + rt->rt_type, res.prefix, res.prefixlen, > + fl4.flowi4_tos, res.fi, 0); I like this part -- much simpler than duplicating the attributes which is where I was headed.
Re: [PATCH net-next 1/8] net: ipv4: refactor __ip_route_output_key_hash
Hi David, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Roopa-Prabhu/net-extend-RTM_GETROUTE-to-return-fib-result/20170525-053253 config: cris-etrax-100lx_v2_defconfig (attached as .config) compiler: cris-linux-gcc (GCC) 6.2.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=cris Note: the linux-review/Roopa-Prabhu/net-extend-RTM_GETROUTE-to-return-fib-result/20170525-053253 HEAD 083c4ee9e124d0acf29d159ced8a22cb41665a7a builds fine. It only hurts bisectibility. All error/warnings (new ones prefixed by >>): In file included from include/net/route.h:31:0, from include/net/lwtunnel.h:8, from include/net/ip_tunnels.h:17, from include/net/dst_metadata.h:5, from net/ipv4/route.c:94: net/ipv4/route.c: In function 'ip_route_output_key_hash_rcu': >> include/net/ip_fib.h:167:32: error: 'res' is a pointer; did you mean to use >> '->'? #define FIB_RES_NH(res) ((res).fi->fib_nh[0]) ^ -> >> include/net/ip_fib.h:196:28: note: in expansion of macro 'FIB_RES_NH' #define FIB_RES_DEV(res) (FIB_RES_NH(res).nh_dev) ^~ >> net/ipv4/route.c:2399:35: note: in expansion of macro 'FIB_RES_DEV' dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : ^~~ net/ipv4/route.c: In function 'ip_route_output_key_hash': >> net/ipv4/route.c:2430:53: error: 'mp_hash' undeclared (first use in this >> function) rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^~~ net/ipv4/route.c:2430:53: note: each undeclared identifier is reported only once for each function it appears in >> net/ipv4/route.c:2430:47: error: passing argument 3 of >> 'ip_route_output_key_hash_rcu' from incompatible pointer type >> [-Werror=incompatible-pointer-types] rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^ net/ipv4/route.c:2249:16: note: expected 'const struct sk_buff *' but argument is of type 'struct fib_result *' struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, ^~~~ cc1: some warnings being treated as errors -- In file included from include/net/route.h:31:0, from include/net/lwtunnel.h:8, from include/net/ip_tunnels.h:17, from include/net/dst_metadata.h:5, from net//ipv4/route.c:94: net//ipv4/route.c: In function 'ip_route_output_key_hash_rcu': >> include/net/ip_fib.h:167:32: error: 'res' is a pointer; did you mean to use >> '->'? #define FIB_RES_NH(res) ((res).fi->fib_nh[0]) ^ -> >> include/net/ip_fib.h:196:28: note: in expansion of macro 'FIB_RES_NH' #define FIB_RES_DEV(res) (FIB_RES_NH(res).nh_dev) ^~ net//ipv4/route.c:2399:35: note: in expansion of macro 'FIB_RES_DEV' dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : ^~~ net//ipv4/route.c: In function 'ip_route_output_key_hash': net//ipv4/route.c:2430:53: error: 'mp_hash' undeclared (first use in this function) rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^~~ net//ipv4/route.c:2430:53: note: each undeclared identifier is reported only once for each function it appears in net//ipv4/route.c:2430:47: error: passing argument 3 of 'ip_route_output_key_hash_rcu' from incompatible pointer type [-Werror=incompatible-pointer-types] rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); ^ net//ipv4/route.c:2249:16: note: expected 'const struct sk_buff *' but argument is of type 'struct fib_result *' struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, ^~~~ cc1: some warnings being treated as errors vim +/mp_hash +2430 net/ipv4/route.c 2393 fl4->saddr = res->fi->fib_prefsrc; 2394 else 2395 fl4->saddr = fl4->daddr; 2396 } 2397 2398 /* L3 master device is the loopback for that domain */ > 2399 dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : 2400 net->loopback_dev; 2401 fl4->flowi4_oif = dev_out->ifindex;
Re: Deleting a dynamic mac entry..
On 2017/05/25 3:05, Manohar Kumar wrote: > Thanks, Toshiaki. > > What is the right way to set the default_pvid using the bridge command > ? I tried this, which fails.. > > root@net-3:~# ip link set dev vxlan0 name untagged type vlan id 0 > RTNETLINK answers: Operation not supported > root@net-3:~# > > All the interfaces in the bridge are untagged. # ip link set br0 down # echo 0 > /sys/class/net/br0/bridge/default_pvid Toshiaki Makita
Re: [PATCH net-next 1/8] net: ipv4: refactor __ip_route_output_key_hash
On 5/24/17 1:33 PM, Rosen, Rami wrote: > Hi, Rupa /David Ahern, > > First, thanks for this patch set! > > Second, it seems to me that something might be incorrect here. > > You have these additions in this patch (1/8): > ... > +struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 > *flp, > + const struct sk_buff *skb, > + struct fib_result *res); > ... > +struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, > + const struct sk_buff *skb) > +{ > + struct fib_result res; > + struct rtable *rth; > + > + res.tclassid= 0; > + res.fi = NULL; > + res.table = NULL; > + > + rcu_read_lock(); > + rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); > rcu_read_unlock(); > + > return rth; > } > -EXPORT_SYMBOL_GPL(__ip_route_output_key_hash); > +EXPORT_SYMBOL_GPL(ip_route_output_key_hash); > > > So the third parameter to ip_route_output_key_hash_rcu() should be skb*, and > the fourth parameter should be fib_result *. However, you do not pass the skb > parameter > when calling ip_route_output_key_hash_rcu() in > ip_route_output_key_hash() (in fact you don't use it at all), and you pass > mp_hash as the fourth parameter. Yep, it's a problem with the forward port of the first round of patches. Roopa: in include/net/route.h, __ip_route_output_key_hash should be removed as well.
Re: [PATCH net-next 3/3] net: sysfs: Document PHY device sysfs attributes
On 05/24/2017 04:07 PM, Florian Fainelli wrote: > Hi Andrew, > > On 05/24/2017 01:10 PM, Andrew Lunn wrote: >>> +What: /sys/class/mdio_bus///phy_interface >>> +Date: February 2014 >>> +KernelVersion: 3.15 >>> +Contact: netdev@vger.kernel.org >>> +Description: >>> + String value indicating the PHY interface, possible >>> + values are in include/linux/phy.h function phy_modes. >> >> Hi Florian >> >> Does this suggest that these strings should be moved to >> include/uapi/linux/phy.h? > > Humm, I suppose we could do that, although I am not sure ally sure what > we would be putting in there (at least for now). I actually prefer to put the possible values in the sysfs documentation files instead of in a UAPI header. Will submit a v2. -- Florian
Re: [PATCH 0/2] Replace driver's usage of hard-coded device IDs to #defines
From: Myron StoweDate: Wed, 24 May 2017 16:47:34 -0600 > Noa Osherovich introduced a series of new Mellanox device ID definitions to > help differentiate specific controllers that needed INTx masking quirks [1]. > > Bjorn Helgaas followed on, using the device ID definitions Noa provided to > replace hard-coded values within the mxl4 ID table [2]. > > This patch continues along similar lines, adding a few additional Mellanox > device ID definitions and converting the net/mlx5e driver's mlx5 ID table to > use the defines so tools like 'grep' and 'cscope' can be used to help > identify relationships with other aspects (such as INTx masking). If you're adding pci_ids.h defines, it's only valid to do so if you actually use the defines in more than one location. This patch series is not doing that.
Re: [PATCH] test_bpf: Add a couple of tests for BPF_JSGE.
On 05/25/2017 01:35 AM, David Daney wrote: Some JITs can optimize comparisons with zero. Add a couple of BPF_JSGE tests against immediate zero. Signed-off-by: David DaneyThanks for the tests! Acked-by: Daniel Borkmann
[PATCH] test_bpf: Add a couple of tests for BPF_JSGE.
Some JITs can optimize comparisons with zero. Add a couple of BPF_JSGE tests against immediate zero. Signed-off-by: David Daney--- lib/test_bpf.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 889bc31..be88cba 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -4504,6 +4504,44 @@ static struct bpf_test tests[] = { { }, { { 0, 1 } }, }, + { + "JMP_JSGE_K: Signed jump: value walk 1", + .u.insns_int = { + BPF_ALU32_IMM(BPF_MOV, R0, 0), + BPF_LD_IMM64(R1, -3), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 6), + BPF_ALU64_IMM(BPF_ADD, R1, 1), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 4), + BPF_ALU64_IMM(BPF_ADD, R1, 1), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 2), + BPF_ALU64_IMM(BPF_ADD, R1, 1), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 1), + BPF_EXIT_INSN(),/* bad exit */ + BPF_ALU32_IMM(BPF_MOV, R0, 1), /* good exit */ + BPF_EXIT_INSN(), + }, + INTERNAL, + { }, + { { 0, 1 } }, + }, + { + "JMP_JSGE_K: Signed jump: value walk 2", + .u.insns_int = { + BPF_ALU32_IMM(BPF_MOV, R0, 0), + BPF_LD_IMM64(R1, -3), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 4), + BPF_ALU64_IMM(BPF_ADD, R1, 2), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 2), + BPF_ALU64_IMM(BPF_ADD, R1, 2), + BPF_JMP_IMM(BPF_JSGE, R1, 0, 1), + BPF_EXIT_INSN(),/* bad exit */ + BPF_ALU32_IMM(BPF_MOV, R0, 1), /* good exit */ + BPF_EXIT_INSN(), + }, + INTERNAL, + { }, + { { 0, 1 } }, + }, /* BPF_JMP | BPF_JGT | BPF_K */ { "JMP_JGT_K: if (3 > 2) return 1", -- 2.9.4
linux-next: manual merge of the net-next tree with the net tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/phy/marvell.c between commit: f2899788353c ("net: phy: marvell: Limit errata to 88m1101") from the net tree and commit: 0c3439bc7773 ("net: phy: Marvell: checkpatch - Comments") from the net-next tree. I fixed it up (I just used the former version) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell
Re: [PATCH net-next 3/3] net: sysfs: Document PHY device sysfs attributes
Hi Andrew, On 05/24/2017 01:10 PM, Andrew Lunn wrote: >> +What: /sys/class/mdio_bus///phy_interface >> +Date: February 2014 >> +KernelVersion: 3.15 >> +Contact:netdev@vger.kernel.org >> +Description: >> +String value indicating the PHY interface, possible >> +values are in include/linux/phy.h function phy_modes. > > Hi Florian > > Does this suggest that these strings should be moved to > include/uapi/linux/phy.h? Humm, I suppose we could do that, although I am not sure ally sure what we would be putting in there (at least for now). -- Florian
[PATCH net v2 2/5] bpf: properly reset caller saved regs after helper call and ld_abs/ind
Currently, after performing helper calls, we clear all caller saved registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto specification. The way we reset these regs can affect pruning decisions in later paths, since we only reset register's imm to 0 and type to NOT_INIT. However, we leave out clearing of other variables such as id, min_value, max_value, etc, which can later on lead to pruning mismatches due to stale data. Signed-off-by: Daniel BorkmannAcked-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 37 - 1 file changed, 16 insertions(+), 21 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e37e06b..339c8a1 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -463,19 +463,22 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5 }; +static void mark_reg_not_init(struct bpf_reg_state *regs, u32 regno) +{ + BUG_ON(regno >= MAX_BPF_REG); + + memset([regno], 0, sizeof(regs[regno])); + regs[regno].type = NOT_INIT; + regs[regno].min_value = BPF_REGISTER_MIN_RANGE; + regs[regno].max_value = BPF_REGISTER_MAX_RANGE; +} + static void init_reg_state(struct bpf_reg_state *regs) { int i; - for (i = 0; i < MAX_BPF_REG; i++) { - regs[i].type = NOT_INIT; - regs[i].imm = 0; - regs[i].min_value = BPF_REGISTER_MIN_RANGE; - regs[i].max_value = BPF_REGISTER_MAX_RANGE; - regs[i].min_align = 0; - regs[i].aux_off = 0; - regs[i].aux_off_align = 0; - } + for (i = 0; i < MAX_BPF_REG; i++) + mark_reg_not_init(regs, i); /* frame pointer */ regs[BPF_REG_FP].type = FRAME_PTR; @@ -1346,7 +1349,6 @@ static int check_call(struct bpf_verifier_env *env, int func_id, int insn_idx) struct bpf_verifier_state *state = >cur_state; const struct bpf_func_proto *fn = NULL; struct bpf_reg_state *regs = state->regs; - struct bpf_reg_state *reg; struct bpf_call_arg_meta meta; bool changes_data; int i, err; @@ -1413,11 +1415,8 @@ static int check_call(struct bpf_verifier_env *env, int func_id, int insn_idx) } /* reset caller saved regs */ - for (i = 0; i < CALLER_SAVED_REGS; i++) { - reg = regs + caller_saved[i]; - reg->type = NOT_INIT; - reg->imm = 0; - } + for (i = 0; i < CALLER_SAVED_REGS; i++) + mark_reg_not_init(regs, caller_saved[i]); /* update return register */ if (fn->ret_type == RET_INTEGER) { @@ -2445,7 +2444,6 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) { struct bpf_reg_state *regs = env->cur_state.regs; u8 mode = BPF_MODE(insn->code); - struct bpf_reg_state *reg; int i, err; if (!may_access_skb(env->prog->type)) { @@ -2478,11 +2476,8 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) } /* reset caller saved regs to unreadable */ - for (i = 0; i < CALLER_SAVED_REGS; i++) { - reg = regs + caller_saved[i]; - reg->type = NOT_INIT; - reg->imm = 0; - } + for (i = 0; i < CALLER_SAVED_REGS; i++) + mark_reg_not_init(regs, caller_saved[i]); /* mark destination R0 register as readable, since it contains * the value fetched from the packet -- 1.9.3
[PATCH net v2 4/5] bpf: fix wrong exposure of map_flags into fdinfo for lpm
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via attr->map_flags, since it does not support preallocation yet. We check the flag, but we never copy the flag into trie->map.map_flags, which is later on exposed into fdinfo and used by loaders such as iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test whether a pinned map has the same spec as the one from the BPF obj file and if not, bails out, which is currently the case for lpm since it exposes always 0 as flags. Also copy over flags in array_map_alloc() and stack_map_alloc(). They always have to be 0 right now, but we should make sure to not miss to copy them over at a later point in time when we add actual flags for them to use. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reported-by: Jarno RajahalmeSigned-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/bpf/arraymap.c | 1 + kernel/bpf/lpm_trie.c | 1 + kernel/bpf/stackmap.c | 1 + 3 files changed, 3 insertions(+) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 5e00b23..172dc8e 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -86,6 +86,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) array->map.key_size = attr->key_size; array->map.value_size = attr->value_size; array->map.max_entries = attr->max_entries; + array->map.map_flags = attr->map_flags; array->elem_size = elem_size; if (!percpu) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 39cfafd..b09185f 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -432,6 +432,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) trie->map.key_size = attr->key_size; trie->map.value_size = attr->value_size; trie->map.max_entries = attr->max_entries; + trie->map.map_flags = attr->map_flags; trie->data_size = attr->key_size - offsetof(struct bpf_lpm_trie_key, data); trie->max_prefixlen = trie->data_size * 8; diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 4dfd6f2..31147d7 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -88,6 +88,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr) smap->map.key_size = attr->key_size; smap->map.value_size = value_size; smap->map.max_entries = attr->max_entries; + smap->map.map_flags = attr->map_flags; smap->n_buckets = n_buckets; smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; -- 1.9.3
[PATCH net v2 3/5] bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data
The bpf_clone_redirect() still needs to be listed in bpf_helper_changes_pkt_data() since we call into bpf_try_make_head_writable() from there, thus we need to invalidate prior pkt regs as well. Fixes: 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs") Signed-off-by: Daniel BorkmannAcked-by: Alexei Starovoitov --- net/core/filter.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/filter.c b/net/core/filter.c index a253a61..a6bb95f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2281,6 +2281,7 @@ bool bpf_helper_changes_pkt_data(void *func) func == bpf_skb_change_head || func == bpf_skb_change_tail || func == bpf_skb_pull_data || + func == bpf_clone_redirect || func == bpf_l3_csum_replace || func == bpf_l4_csum_replace || func == bpf_xdp_adjust_head) -- 1.9.3
[PATCH net v2 5/5] bpf: add various verifier test cases
This patch adds various verifier test cases: 1) A test case for the pruning issue when tracking alignment is used. 2) Various PTR_TO_MAP_VALUE_OR_NULL tests to make sure pointer arithmetic turns such register into UNKNOWN_VALUE type. 3) Test cases for the special treatment of LD_ABS/LD_IND to make sure verifier doesn't break calling convention here. Latter is needed, since f.e. arm64 JIT uses r1 - r5 for storing temporary data, so they really must be marked as NOT_INIT. Signed-off-by: Daniel BorkmannAcked-by: Alexei Starovoitov --- include/linux/filter.h | 10 ++ tools/include/linux/filter.h| 10 ++ tools/testing/selftests/bpf/test_verifier.c | 239 +++- 3 files changed, 255 insertions(+), 4 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 56197f8..62d948f 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -272,6 +272,16 @@ .off = OFF, \ .imm = IMM }) +/* Unconditional jumps, goto pc + off16 */ + +#define BPF_JMP_A(OFF) \ + ((struct bpf_insn) {\ + .code = BPF_JMP | BPF_JA, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = 0 }) + /* Function call */ #define BPF_EMIT_CALL(FUNC)\ diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h index 390d7c9..4ce25d4 100644 --- a/tools/include/linux/filter.h +++ b/tools/include/linux/filter.h @@ -208,6 +208,16 @@ .off = OFF, \ .imm = IMM }) +/* Unconditional jumps, goto pc + off16 */ + +#define BPF_JMP_A(OFF) \ + ((struct bpf_insn) {\ + .code = BPF_JMP | BPF_JA, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = 0 }) + /* Function call */ #define BPF_EMIT_CALL(FUNC)\ diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 3773562..cabb19b 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -49,6 +49,7 @@ #define MAX_NR_MAPS4 #define F_NEEDS_EFFICIENT_UNALIGNED_ACCESS (1 << 0) +#define F_LOAD_WITH_STRICT_ALIGNMENT (1 << 1) struct bpf_test { const char *descr; @@ -2615,6 +2616,30 @@ struct test_val { .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, { + "direct packet access: test17 (pruning, alignment)", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, + offsetof(struct __sk_buff, data)), + BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1, + offsetof(struct __sk_buff, data_end)), + BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1, + offsetof(struct __sk_buff, mark)), + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 14), + BPF_JMP_IMM(BPF_JGT, BPF_REG_7, 1, 4), + BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1), + BPF_STX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, -4), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 1), + BPF_JMP_A(-6), + }, + .errstr = "misaligned packet access off 2+15+-4 size 4", + .result = REJECT, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .flags = F_LOAD_WITH_STRICT_ALIGNMENT, + }, + { "helper access to packet: test1, valid packet_ptr range", .insns = { BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, @@ -3341,6 +3366,70 @@ struct test_val { .prog_type = BPF_PROG_TYPE_SCHED_CLS }, { + "alu ops on ptr_to_map_value_or_null, 1", + .insns = { + BPF_MOV64_IMM(BPF_REG_1, 10), + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_1, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), +
[PATCH net v2 1/5] bpf: fix incorrect pruning decision when alignment must be tracked
Currently, when we enforce alignment tracking on direct packet access, the verifier lets the following program pass despite doing a packet write with unaligned access: 0: (61) r2 = *(u32 *)(r1 +76) 1: (61) r3 = *(u32 *)(r1 +80) 2: (61) r7 = *(u32 *)(r1 +8) 3: (bf) r0 = r2 4: (07) r0 += 14 5: (25) if r7 > 0x1 goto pc+4 R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 6: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 7: (63) *(u32 *)(r0 -4) = r0 8: (b7) r0 = 0 9: (95) exit from 6 to 8: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 8: (b7) r0 = 0 9: (95) exit from 5 to 10: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=2 R10=fp 10: (07) r0 += 1 11: (05) goto pc-6 6: safe <- here, wrongly found safe processed 15 insns However, if we enforce a pruning mismatch by adding state into r8 which is then being mismatched in states_equal(), we find that for the otherwise same program, the verifier detects a misaligned packet access when actually walking that path: 0: (61) r2 = *(u32 *)(r1 +76) 1: (61) r3 = *(u32 *)(r1 +80) 2: (61) r7 = *(u32 *)(r1 +8) 3: (b7) r8 = 1 4: (bf) r0 = r2 5: (07) r0 += 14 6: (25) if r7 > 0x1 goto pc+4 R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 7: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 8: (63) *(u32 *)(r0 -4) = r0 9: (b7) r0 = 0 10: (95) exit from 7 to 9: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 9: (b7) r0 = 0 10: (95) exit from 6 to 11: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=2 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 11: (07) r0 += 1 12: (b7) r8 = 0 13: (05) goto pc-7<- mismatch due to r8 7: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15) R3=pkt_end R7=inv,min_value=2 R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp 8: (63) *(u32 *)(r0 -4) = r0 misaligned packet access off 2+15+-4 size 4 The reason why we fail to see it in states_equal() is that the third test in compare_ptrs_to_packet() ... if (old->off <= cur->off && old->off >= old->range && cur->off >= cur->range) return true; ... will let the above pass. The situation we run into is that old->off <= cur->off (14 <= 15), meaning that prior walked paths went with smaller offset, which was later used in the packet access after successful packet range check and found to be safe already. For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14 as in above program to it, results in R0=pkt(id=0,off=14,r=0) before the packet range test. Now, testing this against R3=pkt_end with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14) for the case when we're within bounds. A write into the packet at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and aligned (2 is for NET_IP_ALIGN). After processing this with all fall-through paths, we later on check paths from branches. When the above skb->mark test is true, then we jump near the end of the program, perform r0 += 1, and jump back to the 'if r0 > r3 goto out' test we've visited earlier already. This time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune that part because this time we'll have a larger safe packet range, and we already found that with off=14 all further insn were already safe, so it's safe as well with a larger off. However, the problem is that the subsequent write into the packet with 2 + 15 -4 is then unaligned, and not caught by the alignment tracking. Note that min_align, aux_off, and aux_off_align were all 0 in this example. Since we cannot tell at this time what kind of packet access was performed in the prior walk and what minimal requirements it has (we might do so in the future, but that requires more complexity), fix it to disable this pruning case for strict alignment for now, and let the verifier do check such paths instead. With that applied, the test cases pass and reject the program due to misalignment. Fixes: d1174416747d ("bpf: Track alignment of register values in the verifier.") Reference: http://patchwork.ozlabs.org/patch/761909/ Signed-off-by: Daniel BorkmannAcked-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git
[PATCH net v2 0/5] Various BPF fixes
Follow-up to fix incorrect pruning when alignment tracking is in use and to properly clear regs after call to not leave stale data behind, also a fix that adds bpf_clone_redirect to the bpf_helper_changes_pkt_data helper and exposes correct map_flags for lpm map into fdinfo. For details, please see individual patches. Thanks! v1 -> v2: - Reworked first patch so that env->strict_alignment is the final indicator on whether we have to deal with strict alignment rather than having CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS checks on various locations, so only checking env->strict_alignment is sufficient after that. Thanks for spotting, Dave! - Added patch 3 and 4. - Rest as is. Daniel Borkmann (5): bpf: fix incorrect pruning decision when alignment must be tracked bpf: properly reset caller saved regs after helper call and ld_abs/ind bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data bpf: fix wrong exposure of map_flags into fdinfo for lpm bpf: add various verifier test cases include/linux/filter.h | 10 ++ kernel/bpf/arraymap.c | 1 + kernel/bpf/lpm_trie.c | 1 + kernel/bpf/stackmap.c | 1 + kernel/bpf/verifier.c | 56 +++ net/core/filter.c | 1 + tools/include/linux/filter.h| 10 ++ tools/testing/selftests/bpf/test_verifier.c | 239 +++- 8 files changed, 285 insertions(+), 34 deletions(-) -- 1.9.3
[PATCH 2/2] net/mlx5e: Use device ID defines
Use Mellanox device ID definitions in the mlx5 ID table so tools such as 'grep' and 'cscope' can be used to help find related aspects. No functional change intended. Signed-off-by: Myron Stowe--- drivers/net/ethernet/mellanox/mlx5/core/main.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 0c123d5..98642eb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1508,17 +1508,17 @@ static void shutdown(struct pci_dev *pdev) } static const struct pci_device_id mlx5_core_pci_table[] = { - { PCI_VDEVICE(MELLANOX, 0x1011) }, /* Connect-IB */ + { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTIB) }, { PCI_VDEVICE(MELLANOX, 0x1012), MLX5_PCI_DEV_IS_VF}, /* Connect-IB VF */ - { PCI_VDEVICE(MELLANOX, 0x1013) }, /* ConnectX-4 */ + { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4) }, { PCI_VDEVICE(MELLANOX, 0x1014), MLX5_PCI_DEV_IS_VF}, /* ConnectX-4 VF */ - { PCI_VDEVICE(MELLANOX, 0x1015) }, /* ConnectX-4LX */ + { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4_LX) }, { PCI_VDEVICE(MELLANOX, 0x1016), MLX5_PCI_DEV_IS_VF}, /* ConnectX-4LX VF */ - { PCI_VDEVICE(MELLANOX, 0x1017) }, /* ConnectX-5, PCIe 3.0 */ + { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX5) }, { PCI_VDEVICE(MELLANOX, 0x1018), MLX5_PCI_DEV_IS_VF}, /* ConnectX-5 VF */ - { PCI_VDEVICE(MELLANOX, 0x1019) }, /* ConnectX-5 Ex */ + { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX5_EX) }, { PCI_VDEVICE(MELLANOX, 0x101a), MLX5_PCI_DEV_IS_VF}, /* ConnectX-5 Ex VF */ - { PCI_VDEVICE(MELLANOX, 0x101b) }, /* ConnectX-6 */ + { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX6) }, { PCI_VDEVICE(MELLANOX, 0x101c), MLX5_PCI_DEV_IS_VF}, /* ConnectX-6 VF */ { 0, } };
[PATCH 1/2] PCI: Add Mellanox device IDs
Add Mellanox device IDs for controllers covered by the mlx5 driver. Signed-off-by: Myron Stowe--- include/linux/pci_ids.h |3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index 5f6b71d..5626d5a 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -2266,6 +2266,9 @@ #define PCI_DEVICE_ID_MELLANOX_CONNECTIB 0x1011 #define PCI_DEVICE_ID_MELLANOX_CONNECTX4 0x1013 #define PCI_DEVICE_ID_MELLANOX_CONNECTX4_LX0x1015 +#define PCI_DEVICE_ID_MELLANOX_CONNECTX5 0x1017 +#define PCI_DEVICE_ID_MELLANOX_CONNECTX5_EX0x1019 +#define PCI_DEVICE_ID_MELLANOX_CONNECTX6 0x101b #define PCI_DEVICE_ID_MELLANOX_TAVOR 0x5a44 #define PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE0x5a46 #define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c
[PATCH 0/2] Replace driver's usage of hard-coded device IDs to #defines
Noa Osherovich introduced a series of new Mellanox device ID definitions to help differentiate specific controllers that needed INTx masking quirks [1]. Bjorn Helgaas followed on, using the device ID definitions Noa provided to replace hard-coded values within the mxl4 ID table [2]. This patch continues along similar lines, adding a few additional Mellanox device ID definitions and converting the net/mlx5e driver's mlx5 ID table to use the defines so tools like 'grep' and 'cscope' can be used to help identify relationships with other aspects (such as INTx masking). [1] 7254383341b PCI: Add Mellanox device IDs d76d2fe05fd PCI: Convert Mellanox broken INTx quirks to be for listed devices only [2] c19e4b9037f net/mlx4_core: Use device ID defines Myron Stowe (2): PCI: Add Mellanox device IDs net/mlx5e: Use device ID defines drivers/net/ethernet/mellanox/mlx5/core/main.c | 12 ++-- include/linux/pci_ids.h|3 +++ 2 files changed, 9 insertions(+), 6 deletions(-) --
[PATCH net] sky2: Do not deadlock on sky2_hw_down
From: Joshua EmeleThe sky2_hw_down uses sky2_tx_complete to free pending frames stuck in the HW queue. Because sky2_hw_down can be called from a process context, the call to u64_stats_update_begin can result in deadlock. Because the statistics do not require update as part of the sky2_hw_down sequence, prevent the update to avoid the deadlock. [18198.003900] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [18198.009931] ifconfig/11604 [HC0[0]:SC0[0]:HE1:SE1] takes: [18198.015348] (>seq#2){+.?...}, at: [<805ed440>] sky2_hw_down+0x370/0x428 [18198.022827] {IN-SOFTIRQ-W} state was registered at: [18198.027723] [<801729a0>] lock_acquire+0x78/0x98 [18198.032484] [<805ed048>] sky2_tx_complete+0x1a8/0x230 [18198.037760] [<805f2718>] sky2_poll+0x7cc/0xfa8 [18198.042425] [<807692f8>] net_rx_action+0x200/0x330 [18198.047447] [<80129f64>] __do_softirq+0x130/0x31c [18198.052369] [<8012a4a0>] irq_exit+0xc8/0x13c [18198.056836] [<8017e398>] __handle_domain_irq+0x84/0xf8 [18198.062180] [<80101564>] gic_handle_irq+0x58/0xb8 [18198.067086] [<8010d838>] __irq_usr+0x58/0x80 [18198.071557] [<76e178ae>] 0x76e178ae [18198.075247] irq event stamp: 2109 [18198.078568] hardirqs last enabled at (2109): [<8012a2cc>] __local_bh_enable_ip+0x90/0x110 [18198.086860] hardirqs last disabled at (2107): [<8012a27c>] __local_bh_enable_ip+0x40/0x110 [18198.095150] softirqs last enabled at (2108): [<805ed368>] sky2_hw_down+0x298/0x428 [18198.102832] softirqs last disabled at (2106): [<805ed284>] sky2_hw_down+0x1b4/0x428 [18198.110515] other info that might help us debug this: [18198.117048] Possible unsafe locking scenario: [18198.122974]CPU0 [18198.125425] [18198.127876] lock(>seq#2); [18198.131332] [18198.133955] lock(>seq#2); [18198.137585] *** DEADLOCK *** [18198.143517] 1 lock held by ifconfig/11604: [18198.147618] #0: (rtnl_mutex){+.+.+.}, at: [<8077e874>] rtnl_lock+0x18/0x20 [18198.154772] stack backtrace: [18198.159143] CPU: 1 PID: 11604 Comm: ifconfig Not tainted 4.8.17 #1 [18198.165331] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) [18198.171864] Backtrace: [18198.174353] [<8010c634>] (dump_backtrace) from [<8010c828>] (show_stack+0x18/0x1c) [18198.181931] r6:60070093 r5: r4:80e23248 r3: [18198.187682] [<8010c810>] (show_stack) from [<8041edc4>] (dump_stack+0xb4/0xe8) [18198.194924] [<8041ed10>] (dump_stack) from [<801700cc>] (print_usage_bug+0x1dc/0x2d0) [18198.202762] r10:edec55c0 r9:80bd1178 r8:81649598 r7:0006 r6:edec5a90 r5:80fb09f4 [18198.210686] r4:edec55c0 r3:0002 [18198.214311] [<8016fef0>] (print_usage_bug) from [<80170348>] (mark_lock+0x188/0x6c4) [18198.222059] r9: r8:0004 r7:edec55c0 r6:0006 r5:edec5a90 r4:8016f44c [18198.229903] [<801701c0>] (mark_lock) from [<801714ac>] (__lock_acquire+0xb90/0x1c70) [18198.237653] r10:edec55c0 r9: r8:edec5aa4 r7:0004 r6:0001 r5: [18198.245573] r4:01cd r3:0001 [18198.249192] [<8017091c>] (__lock_acquire) from [<801729a0>] (lock_acquire+0x78/0x98) [18198.256940] r10:0001 r9:06d0 r8: r7:0001 r6:805ed440 r5:60070013 [18198.264860] r4: [18198.267428] [<80172928>] (lock_acquire) from [<805ed048>] (sky2_tx_complete+0x1a8/0x230) [18198.275523] r7:ee2d9dfc r6: r5:ee2d9dc0 r4: [18198.281266] [<805ecea0>] (sky2_tx_complete) from [<805ed440>] (sky2_hw_down+0x370/0x428) [18198.289360] r10:ee2cac00 r9:06d0 r8:ee2d9dc0 r7:f0d40aa8 r6:0001 r5:0d48 [18198.297281] r4: [18198.299850] [<805ed0d0>] (sky2_hw_down) from [<805efed8>] (sky2_close+0xac/0x11c) [18198.307337] r10:ee838600 r9: r8: r7:1003 r6:ee2d9dc0 r5: [18198.315258] r4:ee2cac00 [18198.317830] [<805efe2c>] (sky2_close) from [<8076a310>] (__dev_close_many+0xc4/0x118) [18198.325664] r7:1003 r6:1042 r5:eb047df8 r4:ee2d9800 [18198.331404] [<8076a24c>] (__dev_close_many) from [<8076ab24>] (__dev_close+0x30/0x48) [18198.339240] r5:0001 r4:ee2d9800 [18198.342866] [<8076aaf4>] (__dev_close) from [<8076dbac>] (__dev_change_flags+0x90/0x154) [18198.350970] [<8076db1c>] (__dev_change_flags) from [<8076dc90>] (dev_change_flags+0x20/0x50) [18198.359412] r8: r7:8914 r6:1003 r5:ee2d9948 r4:ee2d9800 r3:0014 [18198.367264] [<8076dc70>] (dev_change_flags) from [<807f06d0>] (devinet_ioctl+0x724/0x818) [18198.375448] r8: r7:8914 r6:ee2cae0c r5:7ed799ac r4:eb047e80 r3:0014 [18198.383291] [<807effac>] (devinet_ioctl) from [<807f2e98>] (inet_ioctl+0x19c/0x1c8) [18198.390951] r10:edfc3b80 r9:eb046000 r8:0004 r7:ee724b40 r6:7ed799ac r5:ee724b60 [18198.398872] r4:8914 [18198.401446] [<807f2cfc>] (inet_ioctl) from [<8074a234>] (sock_ioctl+0x158/0x300) [18198.408860] [<8074a0dc>] (sock_ioctl) from [<80239770>] (do_vfs_ioctl+0x98/0xa3c) [18198.416348] r7:8023a150
[PATCH] arp: fixed -Wuninitialized compiler warning
Commit 7d472a59c0e5ec117220a05de6b370447fb6cb66 ("arp: always override existing neigh entries with gratuitous ARP") introduced a compiler warning: net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized in this function [-Wmaybe-uninitialized] While the code logic seems to be correct and doesn't allow the variable to be used uninitialized, and the warning is not consistently reproducible, it's still worth fixing it for other people not to waste time looking at the warning in case it pops up in the build environment. Yes, compiler is probably at fault, but we will need to accommodate. Fixes: 7d472a59c0e5ec117220a05de6b370447fb6cb66 Signed-off-by: Ihar Hrachyshka--- net/ipv4/arp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index ae96e6f..e9f3386 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -863,8 +863,8 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb) n = __neigh_lookup(_tbl, , dev, 0); + addr_type = -1; if (n || IN_DEV_ARP_ACCEPT(in_dev)) { - addr_type = -1; is_garp = arp_is_garp(net, dev, _type, arp->ar_op, sip, tip, sha, tha); } -- 2.9.4
Re: [Intel-wired-lan] [PATCH 3/4] [next-queue]net: i40e: Enable mqprio full offload mode in the i40e driver for configuring TCs and queue mapping
On Fri, May 19, 2017 at 5:58 PM, Amritha Nambiarwrote: > The i40e driver is modified to enable the new mqprio hardware > offload mode and factor the TCs and queue configuration by > creating channel VSIs. In this mode, the priority to traffic > class mapping and the user specified queue ranges are used > to configure the traffic classes when the 'hw' option is set > to 2. > > Example: > # tc qdisc add dev eth0 root mqprio num_tc 4\ > map 0 0 0 0 1 2 2 3 queues 2@0 2@2 1@4 1@5 hw 2 > > # tc qdisc show dev eth0 > qdisc mqprio 8038: root tc 4 map 0 0 0 0 1 2 2 3 0 0 0 0 0 0 0 0 > queues:(0:1) (2:3) (4:4) (5:5) > > The HW channels created are removed and all the queue configuration > is set to default when the qdisc is detached from the root of the > device. > > #tc qdisc del dev eth0 root > > This patch also disables setting up channels via ethtool (ethtool -L) > when the TCs are confgured using mqprio scheduler. > > Signed-off-by: Amritha Nambiar > --- > drivers/net/ethernet/intel/i40e/i40e.h |4 > drivers/net/ethernet/intel/i40e/i40e_ethtool.c |6 > drivers/net/ethernet/intel/i40e/i40e_main.c| 311 > ++-- > 3 files changed, 292 insertions(+), 29 deletions(-) > > diff --git a/drivers/net/ethernet/intel/i40e/i40e.h > b/drivers/net/ethernet/intel/i40e/i40e.h > index 0915b02..a62f65a 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e.h > +++ b/drivers/net/ethernet/intel/i40e/i40e.h > @@ -54,6 +54,8 @@ > #include > #include > #include > +#include > + > #include "i40e_type.h" > #include "i40e_prototype.h" > #include "i40e_client.h" > @@ -685,6 +687,7 @@ struct i40e_vsi { > enum i40e_vsi_type type; /* VSI type, e.g., LAN, FCoE, etc */ > s16 vf_id; /* Virtual function ID for SRIOV VSIs */ > > + struct tc_mqprio_qopt_offload mqprio_qopt; /* queue parameters */ > struct i40e_tc_configuration tc_config; > struct i40e_aqc_vsi_properties_data info; > > @@ -710,6 +713,7 @@ struct i40e_vsi { > u16 cnt_q_avail; /* num of queues available for channel usage */ > u16 orig_rss_size; > u16 current_rss_size; > + bool reconfig_rss; > > /* keeps track of next_base_queue to be used for channel setup */ > atomic_t next_base_queue; > diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c > b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c > index 3d58762..ab52979 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c > @@ -3841,6 +3841,12 @@ static int i40e_set_channels(struct net_device *dev, > if (vsi->type != I40E_VSI_MAIN) > return -EINVAL; > > + /* We do not support setting channels via ethtool when TCs are > +* configured through mqprio > +*/ > + if (pf->flags & I40E_FLAG_TC_MQPRIO) > + return -EINVAL; > + > /* verify they are not requesting separate vectors */ > if (!count || ch->rx_count || ch->tx_count) > return -EINVAL; > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c > b/drivers/net/ethernet/intel/i40e/i40e_main.c > index e1bea45..7f61d4f 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c > @@ -68,6 +68,7 @@ static int i40e_reset(struct i40e_pf *pf); > static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool > lock_acquired); > static void i40e_fdir_sb_setup(struct i40e_pf *pf); > static int i40e_veb_get_bw_info(struct i40e_veb *veb); > +static int i40e_vsi_config_rss(struct i40e_vsi *vsi); > > /* i40e_pci_tbl - PCI Device ID Table > * > @@ -1560,6 +1561,105 @@ static int i40e_set_mac(struct net_device *netdev, > void *p) > } > > /** > + * i40e_vsi_setup_queue_map_mqprio - Prepares VSI tc_config to have queue > + * configurations based on MQPRIO options. > + * @vsi: the VSI being configured, > + * @ctxt: VSI context structure > + * @enabled_tc: number of traffic classes to enable > + **/ > +static int i40e_vsi_setup_queue_map_mqprio(struct i40e_vsi *vsi, > + struct i40e_vsi_context *ctxt, > + u8 enabled_tc) > +{ > + u8 netdev_tc = 0, offset = 0; > + u16 qcount = 0, max_qcount, qmap, sections = 0; > + int i, override_q, pow, num_qps, ret; > + > + if (vsi->type != I40E_VSI_MAIN) > + return -EINVAL; > + > + sections = I40E_AQ_VSI_PROP_QUEUE_MAP_VALID; > + sections |= I40E_AQ_VSI_PROP_SCHED_VALID; > + > + vsi->tc_config.numtc = vsi->mqprio_qopt.qopt.num_tc; > + vsi->tc_config.enabled_tc = enabled_tc ? enabled_tc : 1; > + > + num_qps = vsi->mqprio_qopt.qopt.count[0]; > + > + /* find the next higher power-of-2 of num queue pairs */ > + pow = ilog2(num_qps); > + if (!is_power_of_2(num_qps)) > +
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 04:36 PM, Florian Fainelli wrote: OK, and there is no way you can run into the following race condition: CPU HW MDIO read intent polling starts disable HW autopoll polling continues Disabling of the HW autopoll waits for the poll to actually stop before continuing. You can see the code here: http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/qualcomm/emac/emac-phy.c#L102 MDIO read is done MDIO read done polling stops MDIO read value returned if you disable autopolling in HW this is guaranteed to immediately stop by the time the register value is seen in HW and your I/O read/write returns? It doesn't immediately stop, but the emac_phy_mdio_autopoll_disable() function waits for the MDIO bus to not be busy. But low-level details of this feature are not documented, so who knows what it does exactly? The original code that used this feature only supported one PHY and never expected there to be any asynchronous MDIO transactios. -- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [Intel-wired-lan] [PATCH 1/4] [next-queue]net: mqprio: Introduce new hardware offload mode in mqprio for offloading full TC configurations
On Fri, May 19, 2017 at 5:58 PM, Amritha Nambiarwrote: > This patch introduces a new hardware offload mode in mqprio > which makes full use of the mqprio options, the TCs, the > queue configurations and the bandwidth rates for the TCs. > This is achieved by setting the value 2 for the "hw" option. > This new offload mode supports new attributes for traffic > class such as minimum and maximum values for bandwidth rate limits. > > Introduces a new datastructure 'tc_mqprio_qopt_offload' for offloading > mqprio queue options and use this to be shared between the kernel and > device driver. This contains a copy of the exisiting datastructure > for mqprio queue options. This new datastructure can be extended when > adding new attributes for traffic class such as bandwidth rate limits. The > existing datastructure for mqprio queue options will be shared between the > kernel and userspace. > > This patch enables configuring additional attributes associated > with a traffic class such as minimum and maximum bandwidth > rates and can be offloaded to the hardware in the new offload mode. > The min and max limits for bandwidth rates are provided > by the user along with the the TCs and the queue configurations > when creating the mqprio qdisc. > > Example: > # tc qdisc add dev eth0 root mqprio num_tc 2 map 0 0 0 0 1 1 1 1\ > queues 4@0 4@4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2 > > To dump the bandwidth rates: > > # tc qdisc show dev eth0 > qdisc mqprio 804a: root tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 > queues:(0:3) (4:7) > min rates:0bit 0bit > max rates:55Mbit 60Mbit > > Signed-off-by: Amritha Nambiar > --- > include/linux/netdevice.h |2 > include/net/pkt_cls.h |7 ++ > include/uapi/linux/pkt_sched.h | 13 +++ > net/sched/sch_mqprio.c | 169 > +--- > 4 files changed, 180 insertions(+), 11 deletions(-) > [...] > diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c > index 0a4cf27..6457ec9 100644 > --- a/net/sched/sch_mqprio.c > +++ b/net/sched/sch_mqprio.c > @@ -18,10 +18,13 @@ > #include > #include > #include > +#include > > struct mqprio_sched { > struct Qdisc**qdiscs; > int hw_offload; > + u32 flags; > + u64 min_rate[TC_QOPT_MAX_QUEUE], max_rate[TC_QOPT_MAX_QUEUE]; > }; > > static void mqprio_destroy(struct Qdisc *sch) > @@ -39,10 +42,21 @@ static void mqprio_destroy(struct Qdisc *sch) > } > > if (priv->hw_offload && dev->netdev_ops->ndo_setup_tc) { > - struct tc_mqprio_qopt offload = { 0 }; > - struct tc_to_netdev tc = { .type = TC_SETUP_MQPRIO, > - { .mqprio = } }; > + struct tc_mqprio_qopt_offload offload = { 0 }; So this is currently throwing a warning when I pull these patches and build this with gcc 6.2. I think the correct setup is "offload = {{ 0 }}" in order to indicate that we are initializing the inner structure and all other data to 0. > + struct tc_to_netdev tc = { 0 }; > > + switch (priv->hw_offload) { > + case TC_MQPRIO_HW_OFFLOAD_TCS: > + tc.type = TC_SETUP_MQPRIO; > + tc.mqprio = > + break; > + case TC_MQPRIO_HW_OFFLOAD: > + tc.type = TC_SETUP_MQPRIO_EXT; > + tc.mqprio_qopt = > + break; > + default: > + return; > + } > dev->netdev_ops->ndo_setup_tc(dev, sch->handle, 0, ); > } else { > netdev_set_num_tc(dev, 0);
Re: [PATCH v2 0/4] arp: always override existing neigh entries with gratuitous ARP
On Wed, May 24, 2017 at 11:32 PM, Ihar Hrachyshkawrote: > On 05/23/2017 01:56 PM, Arnd Bergmann wrote: >> >> This seems to have caused a build warning: >> >> net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized >> in this function [-Wmaybe-uninitialized] >> > Not sure. How do you reproduce it? I just did 'make net' in the latest tree > that includes the patch, and it doesn't trigger the failure. Also the code > logic prevents it to be uninitialized (it's always set to -1 or the actual > type value). I saw it reported by the kernelci build bot: https://kernelci.org/build/id/5925b81859b514fb106b9590/logs/ It happened on arm64 with 'make allmodconfig', but none of the other architectures or the other configurations on that architecture. This kind of warning can be a real pain to shut up when the compiler gets it wrong. I haven't tested the latest kernel with your patch on my own build box yet, so I also haven't been able to reproduce it. Most likely, gcc gets confused when it needs to track the state of too many variables here (I certainly get confused when I read it too). My first attempt to resolve it would be: diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index ae96e6f3e0cb..39f26980a565 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -663,6 +663,8 @@ static bool arp_is_garp(struct net *net, struct net_device *dev, *addr_type = inet_addr_type_dev_table(net, dev, sip); if (*addr_type != RTN_UNICAST) is_garp = false; + } else { + *addr_type = -1; } return is_garp; } @@ -864,7 +866,6 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb) n = __neigh_lookup(_tbl, , dev, 0); if (n || IN_DEV_ARP_ACCEPT(in_dev)) { - addr_type = -1; is_garp = arp_is_garp(net, dev, _type, arp->ar_op, sip, tip, sha, tha); } but this clearly needs a lot of build testing to see if it's sufficient. Moving the initialization of addr_type=1 to the start of the function would certainly avoid the warning, but I suspect this would be incorrect here. Arnd
Re: [Intel-wired-lan] [PATCH 2/4] [next-queue]net: i40e: Add infrastructure for queue channel support with the TCs and queue configurations offloaded via mqprio scheduler
On Fri, May 19, 2017 at 5:58 PM, Amritha Nambiarwrote: > This patch sets up the infrastructure for offloading TCs and > queue configurations to the hardware by creating HW channels(VSI). > A new channel is created for each of the traffic class > configuration offloaded via mqprio framework except for the first TC > (TC0). TC0 for the main VSI is also reconfigured as per user provided > queue parameters. Queue counts that are not power-of-2 are handled by > reconfiguring RSS by reprogramming LUTs using the queue count value. > This patch also handles configuring the TX rings for the channels, > setting up the RX queue map for channel. > > Also, the channels so created are removed and all the queue > configuration is set to default when the qdisc is detached from the > root of the device. > > Signed-off-by: Amritha Nambiar > Signed-off-by: Kiran Patil > --- > drivers/net/ethernet/intel/i40e/i40e.h | 36 + > drivers/net/ethernet/intel/i40e/i40e_main.c | 740 > +++ > drivers/net/ethernet/intel/i40e/i40e_txrx.h |2 > 3 files changed, 771 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/ethernet/intel/i40e/i40e.h > b/drivers/net/ethernet/intel/i40e/i40e.h > index 395ca94..0915b02 100644 [...] > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c > b/drivers/net/ethernet/intel/i40e/i40e_main.c > index 8d1d3b85..e1bea45 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c [...] > +/** > + * i40e_create_queue_channel - function to create channel > + * @vsi: VSI to be configured > + * @ch: ptr to channel (it contains channel specific params) > + * > + * This function creates channel (VSI) using num_queues specified by user, > + * reconfigs RSS if needed. > + **/ > +int i40e_create_queue_channel(struct i40e_vsi *vsi, > + struct i40e_channel *ch) > +{ > + struct i40e_pf *pf = vsi->back; > + bool reconfig_rss; > + int err; > + > + if (!ch) > + return -EINVAL; > + > + if (!ch->num_queue_pairs) { > + dev_err(>pdev->dev, "Invalid num_queues requested: %d\n", > + ch->num_queue_pairs); > + return -EINVAL; > + } > + > + /* validate user requested num_queues for channel */ > + err = i40e_validate_num_queues(pf, ch->num_queue_pairs, vsi, > + _rss); > + if (err) { > + dev_info(>pdev->dev, "Failed to validate num_queues > (%d)\n", > +ch->num_queue_pairs); > + return -EINVAL; > + } > + > + /* By default we are in VEPA mode, if this is the first VF/VMDq > +* VSI to be added switch to VEB mode. > +*/ > + if ((!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) || > + (!i40e_is_any_channel(vsi))) { > + if (!is_power_of_2(vsi->tc_config.tc_info[0].qcount)) { > + dev_info(>pdev->dev, > +"Failed to create channel. Override queues > (%u) not power of 2\n", > +vsi->tc_config.tc_info[0].qcount); > + return -EINVAL; > + } > + > + if (vsi->type == I40E_VSI_SRIOV) { > + if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) { > + dev_info(>pdev->dev, > +"Expected to be VEB mode by this > time\n"); > + return -EINVAL; > + } > + } > + if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) { > + pf->flags |= I40E_FLAG_VEB_MODE_ENABLED; > + > + if (vsi->type == I40E_VSI_MAIN) { > + if (pf->flags & I40E_FLAG_TC_MQPRIO) > + i40e_do_reset(pf, > + BIT_ULL(__I40E_PF_RESET_REQUESTED), > + true); > + else > + i40e_do_reset_safe(pf, > + BIT_ULL(__I40E_PF_RESET_REQUESTED)); So these BIT_ULL lines are triggering a check in checkpatch, and I have to say I don't really like this as it really is messed up in terms of formatting. If nothing else you might want to look at defining a macro that replaces the line. That way you could still represent the same data without having to resort to misaligning things to make it under 80 characters.
Re: [PATCH v2 2/4] arp: decompose is_garp logic into a separate function
On 05/18/2017 01:49 PM, Julian Anastasov wrote: All 4 patches look ok to me with only a small problem which comes from patch already included in kernel. I see that GARP replies can not work for 1394, is_garp will be cleared. May be 'tha' check should be moved in if expression, for example: if (is_garp && ar_op == htons(ARPOP_REPLY) && tha) is_garp = !memcmp(tha, sha, dev->addr_len); I can easily miss something substantial, so please correct me, but... If it's of REPLY type, the RFC 2002 requires that target hardware address field equals to source address field for a packet to be considered gratuitous. Since IEEE 1394 ARP standard defines its payload without target field, it seems to me that there is no such thing as a gratuitous ARP reply for IEEE 1394. That's why I think resetting is_garp to 0 for those packets is justified. Ihar
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 02:32 PM, Timur Tabi wrote: > On 05/24/2017 04:28 PM, Florian Fainelli wrote: > >> Yes phydev->lock which is used to serialize the state machine state >> changes. >> >> Most PHYs have many more registers than the 15 standard exposed >> directly, and so you need indirect reads/writes to access these >> registers, which typically involve switching a particular page, doing >> the indirect register access, and then flipping the page back. If you >> interrupt that scheme one way or another, your reads and writes are all >> messed up. > > Ah, and the at803x is a device like that. > > At worst, the autopoll feature could read a register from the wrong > page, and think that the link state has changed when it hasn't. But > that's still bad, and all my problems do revolve around link states. Absolutely, just like it's not clear whether autopoll does read BMSR.LINKSTAT or another at803x specific register? Also how is BMSR.LINKSTAT defined on at803x when there is SGMII + Copper involved? > >>> I forgot one detail. Every time you do an MDIO read/write, it >>> temporarily disables the feature. Although, I think that's not relevant >>> to your point. >> >> Is that done by the HW itself, or is this under SW control exclusively. > > Software. OK, and there is no way you can run into the following race condition: CPU HW MDIO read intent polling starts disable HW autopoll polling continues MDIO read is done MDIO read done polling stops MDIO read value returned if you disable autopolling in HW this is guaranteed to immediately stop by the time the register value is seen in HW and your I/O read/write returns? -- Florian
Re: [PATCH v2 0/4] arp: always override existing neigh entries with gratuitous ARP
On 05/23/2017 01:56 PM, Arnd Bergmann wrote: This seems to have caused a build warning: net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized in this function [-Wmaybe-uninitialized] Not sure. How do you reproduce it? I just did 'make net' in the latest tree that includes the patch, and it doesn't trigger the failure. Also the code logic prevents it to be uninitialized (it's always set to -1 or the actual type value). Please advise, Ihar
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 04:28 PM, Florian Fainelli wrote: Yes phydev->lock which is used to serialize the state machine state changes. Most PHYs have many more registers than the 15 standard exposed directly, and so you need indirect reads/writes to access these registers, which typically involve switching a particular page, doing the indirect register access, and then flipping the page back. If you interrupt that scheme one way or another, your reads and writes are all messed up. Ah, and the at803x is a device like that. At worst, the autopoll feature could read a register from the wrong page, and think that the link state has changed when it hasn't. But that's still bad, and all my problems do revolve around link states. I forgot one detail. Every time you do an MDIO read/write, it temporarily disables the feature. Although, I think that's not relevant to your point. Is that done by the HW itself, or is this under SW control exclusively. Software.
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 02:20 PM, Timur Tabi wrote: > On 05/24/2017 04:15 PM, Andrew Lunn wrote: >>> My NIC has a feature called autopolling where it takes over the MDIO >>> bus and regularly polls the link state. When it detects that the >>> link state has changed, it generates a MAC interrupt. This is when >>> I call phy_mac_interrupt() normally. > >> Unfortunately, you need to keep this feature turned off. It will not >> respect the phydev mutex. It has no idea what page has been currently >> selected. It probably has no way to flip the page and see if the SGMII >> link is up. etc. > > phydev mutex? And what do you mean by page? Yes phydev->lock which is used to serialize the state machine state changes. Most PHYs have many more registers than the 15 standard exposed directly, and so you need indirect reads/writes to access these registers, which typically involve switching a particular page, doing the indirect register access, and then flipping the page back. If you interrupt that scheme one way or another, your reads and writes are all messed up. > > I forgot one detail. Every time you do an MDIO read/write, it > temporarily disables the feature. Although, I think that's not relevant > to your point. Is that done by the HW itself, or is this under SW control exclusively. > > Disabling this feature and switching from PHY_IGNORE_INTERRUPT to > PHY_POLL might fix everything. I will try it. > Humm yes, that seems like a worthwhile exercise at least. -- Florian
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 04:15 PM, Andrew Lunn wrote: My NIC has a feature called autopolling where it takes over the MDIO bus and regularly polls the link state. When it detects that the link state has changed, it generates a MAC interrupt. This is when I call phy_mac_interrupt() normally. Unfortunately, you need to keep this feature turned off. It will not respect the phydev mutex. It has no idea what page has been currently selected. It probably has no way to flip the page and see if the SGMII link is up. etc. phydev mutex? And what do you mean by page? I forgot one detail. Every time you do an MDIO read/write, it temporarily disables the feature. Although, I think that's not relevant to your point. Disabling this feature and switching from PHY_IGNORE_INTERRUPT to PHY_POLL might fix everything. I will try it.
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 01:57 PM, Timur Tabi wrote: > On 05/24/2017 02:34 PM, Andrew Lunn wrote: >>> Ok, I'm going to debug this some more. It turns out that the MAC >>> side of >>> the SGMII link can send an interrupt when it thinks that >>> auto-negotiation is >>> done. I might be able to use this. >> >> You can use this for your board. But it still leaves the phy driver >> broken for everybody else. > > Wait, I thought you said the at803x driver was not broken, since it > returns 0 when the SGMII side of the link hasn't finished auto-negotiating? > >>> What function should my MAC driver call when it wants the phy core to >>> call >>> at803x_aneg_done again to see if autonegotiation is done? >> >> You want to trigger the PHY state machine. There is only one exported >> API call to do this, phy_mac_interrupt(). But you are supposed to pass >> the new link state. And your MAC driver has no idea of that, it does >> not know if the copper side of the PHY is up. > > My NIC has a feature called autopolling where it takes over the MDIO bus > and regularly polls the link state. When it detects that the link state > has changed, it generates a MAC interrupt. This is when I call > phy_mac_interrupt() normally. > > I think I can use the SGMII interrupt to also call phy_mac_interrupt(). > The problem is the from the copper side, the link is already up, so if I > call phy_mac_interrupt() again and say the link is up, the phy core is > going to ignore that. The question is what the HW is auto-polling on? Most HW implementations that I have seen either auto-poll and assume that BMSR.LINKSTATUS will reflect what they want to, or they even let you specify which register and bit(s) to monitor for link status. So which one is it here? As Andrew already responded, this clashes with any reads/writes that the PHY library could do, unless your HW is smart enough to serialize MDIO reads/writes? > >> So it might be better if you export phy_trigger_machine(). > > I'll test that, but that does seem a bit hackish. > >>> Also, is there a way for the MAC driver to know that at803x_aneg_done() >>> previously returned 0, and that it needs to tell the phy core to >>> check again? >> >> Not that i know of. The MAC layer is not supposed to be messing around >> in the PHY layer. However, just triggering the PHY state machine >> should be enough. > > Can you tell my how PHY_HAS_INTERRUPT is supposed to work? How does the > PHY send an interrupt? PHY_HAS_INTERRUPT indicates that you have a dedicated interrupt line for your PHY device and that you have the ability to implement a custom interrupt handling callback in the PHY driver (ack_interrupt) that lets you deal with all possible sorts of the interrupts that the PHY could generate. > > I'm starting to think that my NIC's autopolling feature is not > compatible with phylib, and that I should use polling mode. That's pretty much what Andrew told you about 5 emails ago... -- Florian
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On Wed, May 24, 2017 at 03:57:06PM -0500, Timur Tabi wrote: > On 05/24/2017 02:34 PM, Andrew Lunn wrote: > >>Ok, I'm going to debug this some more. It turns out that the MAC side of > >>the SGMII link can send an interrupt when it thinks that auto-negotiation is > >>done. I might be able to use this. > > > >You can use this for your board. But it still leaves the phy driver > >broken for everybody else. > > Wait, I thought you said the at803x driver was not broken, since it > returns 0 when the SGMII side of the link hasn't finished > auto-negotiating? It is correct so far. But to work, it needs to interrupt again once the SGMII side has come up. Only then have we link. > My NIC has a feature called autopolling where it takes over the MDIO > bus and regularly polls the link state. When it detects that the > link state has changed, it generates a MAC interrupt. This is when > I call phy_mac_interrupt() normally. Unfortunately, you need to keep this feature turned off. It will not respect the phydev mutex. It has no idea what page has been currently selected. It probably has no way to flip the page and see if the SGMII link is up. etc. > Can you tell my how PHY_HAS_INTERRUPT is supposed to work? How does > the PHY send an interrupt? Generally, the PHY interrupt pin is connected to a GPIO. You then use the GPIO as an interrupt source. So it has an interrupt number. Put that in phydev->irq, eg using the interrupts property in device tree. The core will register an interrupt handler, and enable the interrupt. When it receives an interrupt, it calls the phy driver to service the interrupt. Andrew
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 02:34 PM, Andrew Lunn wrote: Ok, I'm going to debug this some more. It turns out that the MAC side of the SGMII link can send an interrupt when it thinks that auto-negotiation is done. I might be able to use this. You can use this for your board. But it still leaves the phy driver broken for everybody else. Wait, I thought you said the at803x driver was not broken, since it returns 0 when the SGMII side of the link hasn't finished auto-negotiating? What function should my MAC driver call when it wants the phy core to call at803x_aneg_done again to see if autonegotiation is done? You want to trigger the PHY state machine. There is only one exported API call to do this, phy_mac_interrupt(). But you are supposed to pass the new link state. And your MAC driver has no idea of that, it does not know if the copper side of the PHY is up. My NIC has a feature called autopolling where it takes over the MDIO bus and regularly polls the link state. When it detects that the link state has changed, it generates a MAC interrupt. This is when I call phy_mac_interrupt() normally. I think I can use the SGMII interrupt to also call phy_mac_interrupt(). The problem is the from the copper side, the link is already up, so if I call phy_mac_interrupt() again and say the link is up, the phy core is going to ignore that. So it might be better if you export phy_trigger_machine(). I'll test that, but that does seem a bit hackish. Also, is there a way for the MAC driver to know that at803x_aneg_done() previously returned 0, and that it needs to tell the phy core to check again? Not that i know of. The MAC layer is not supposed to be messing around in the PHY layer. However, just triggering the PHY state machine should be enough. Can you tell my how PHY_HAS_INTERRUPT is supposed to work? How does the PHY send an interrupt? I'm starting to think that my NIC's autopolling feature is not compatible with phylib, and that I should use polling mode.
Re: [PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always
On 05/24/2017 11:39 PM, Jason A. Donenfeld wrote: I've only looked on the last 2 patches. You can add my: Reviewed-by: Sergei Shtylyovif you want. :-) Will do. For the series, or just for 5/5? 5/5 only. :-) MBR, Sergei
Re: [PATCH v6 net-next 01/17] net: qualcomm: remove unnecessary includes
From: Stefan WahrenDate: Wed, 24 May 2017 22:05:26 +0200 (CEST) > AFAIK these ones above aren't necessary (no init, no kernel module, > no kernel parameter, no kernel version) for this C file. So i will > double check it. You need the endianness translators like cpu_to_be32() or whatever, so you need to figure out where you are getting that once these explicit headers are removed. And see, it's probably hidden inside of the private header's includes. So we can't tell.
Re: [PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always
On Wed, May 24, 2017 at 6:41 PM, Sergei Shtylyov >I've only looked on the last 2 patches. You can add my: > > Reviewed-by: Sergei Shtylyov> > if you want. :-) Will do. For the series, or just for 5/5?
Re: [patch net-next v2 0/8] mlxsw: Support firmware flash
From: Jiri PirkoDate: Tue, 23 May 2017 21:56:22 +0200 > From: Jiri Pirko > > Add support for device firmware flash on mlxsw spectrum. The firmware files > are expected to be in the Mellanox Firmware Archive version 2 (MFA2) > format. > > The firmware flash is triggered on driver initialization time if the device > firmware version does not meet the minimum firmware version supported by > the driver. > > Currently, to activate the newly flashed firmware, the user needs to > reboot his system. > > The first patch introduces the mlxfw module, which implements common logic > needed for the firmware flash process on Mellanox products, such as the > MFA2 format parsing and the firmware flash state machine logic. As the > module implements common logic which will be needed by various different > Mellanox drivers, it defines a set of callbacks needed to interact with the > specific device. > > Patches 1-5 implement the needed mlxfw callbacks in the mlxsw spectrum > driver. > > Patches 6 and 7 add boot-time firmware upgrade on the mlxsw spectrum > driver. > > Patch 8 adds a fix needed for new firmware versions. Series applied, although I hope you sort out the user interface soon otherwise most of this is completely dead unused code. Thanks.
Re: [PATCH net-next] tcp: fix TCP_SYNCNT flakes
From: Eric DumazetDate: Tue, 23 May 2017 12:38:35 -0700 > From: Eric Dumazet > > After the mentioned commit, some of our packetdrill tests became flaky. > > TCP_SYNCNT socket option can limit the number of SYN retransmits. > > retransmits_timed_out() has to compare times computations based on > local_clock() while timers are based on jiffies. With NTP adjustments > and roundings we can observe 999 ms delay for 1000 ms timers. > We end up sending one extra SYN packet. > > Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate > TCP's connection close threshold as a time value") makes no > real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at > all. > > Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set > parameter from retransmits_timed_out() > > Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock") > Signed-off-by: Eric Dumazet > Signed-off-by: Yuchung Cheng Applied, thanks Eric.
Re: [PATCH net-next] net: dsa: support cross-chip ageing time
From: Vivien DidelotDate: Tue, 23 May 2017 15:20:59 -0400 > Now that the switchdev bridge ageing time attribute is propagated to all > switch chips of the fabric, each switch can check if the requested value > is valid and program itself, so that the whole fabric shares a common > ageing time setting. > > This is especially needed for switch chips in between others, containing > no bridge port members but evidently used in the data path. > > To achieve that, remove the condition which skips the other switches. We > also don't need to identify the target switch anymore, thus remove the > sw_index member of the dsa_notifier_ageing_time_info notifier structure. > > On ZII Dev Rev B (with two 88E6352 and one 88E6185) and ZII Dev Rev C > (with two 88E6390X), we have the following hardware configuration: ... > Before this patch: ... > After this patch: ... > Signed-off-by: Vivien Didelot Applied, thanks.
Re: [PATCH V3 net 0/3] Fix checksum issues with Q-in-Q vlans
From: Vladislav YasevichDate: Tue, 23 May 2017 13:38:40 -0400 > TCP checksum appear broken on a lot of devices that > advertise NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM. This problem > becomes very visible/reproducable since the series > commit afb0bc972b526 ("Merge branch 'stacked_vlan_tso'"). > > In particular, the issue appeared consistently on bnx2 and be2net > drivers (not all drivers were tested). > > This short series corrects this by disabling checksum offload > support on packets sent through Q-in-Q vlans if the underlying HW only > enables IP specific checksum features. We currently 'assume' that > any drivers setting NETIF_F_HW_CSUM can correclty pass checksum offsets > to HW. It is up to individual drivers to enable it properly through > ndo_features_check if they have some support for Q-in-Q vlans. > > Additionally, be2net driver was fixed to make the proper call. > > While looking at the drivers, it was also found that virtio-net ended > up disabling accelerations, which is unnecessary. > > V3: Fixed checkpatch errors. > > V2: Instead of disabling checksuming for all devices, only devices using > NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM are now affected by this change. > For drivers using NETIF_F_HW_CSUM, we will continue to use checksum > offloading. If any drivers are found to be broken, they would need > be fixed individually. Series applied and queued up for -stable, thanks.
Re: [patch net-next v2 0/5] add tcp flags match support to flower and offload it
From: Jiri PirkoDate: Tue, 23 May 2017 18:40:43 +0200 > From: Jiri Pirko > > This patch adds support to dissect tcp flags, match on them using > flower classifier and offload such rules to mlxsw Spectrum devices. > > --- > v1->v2: > - removed no longer relevant comment from patch 1 as suggested by Or > - sent correct patches this time Series applied, thanks.
Re: [Patch net-next] net_sched: only create filter chains for new filters/actions
Wed, May 24, 2017 at 05:53:42PM CEST, xiyou.wangc...@gmail.com wrote: >On Tue, May 23, 2017 at 11:37 PM, Jiri Pirkowrote: >> Tue, May 23, 2017 at 06:42:37PM CEST, xiyou.wangc...@gmail.com wrote: >>>tcf_chain_get() always creates a new filter chain if not found >>>in existing ones. This is totally unnecessary when we get or >>>delete filters, new chain should be only created for new filters >>>(or new actions). >>> >>>Fixes: 5bc1701881e3 ("net: sched: introduce multichain support for filters") >>>Cc: Jamal Hadi Salim >>>Cc: Jiri Pirko >>>Signed-off-by: Cong Wang >>>--- >>> include/net/pkt_cls.h | 3 ++- >>> net/sched/act_api.c | 2 +- >>> net/sched/cls_api.c | 13 + >>> 3 files changed, 12 insertions(+), 6 deletions(-) >>> >>>diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h >>>index 2c213a6..f776229 100644 >>>--- a/include/net/pkt_cls.h >>>+++ b/include/net/pkt_cls.h >>>@@ -18,7 +18,8 @@ int register_tcf_proto_ops(struct tcf_proto_ops *ops); >>> int unregister_tcf_proto_ops(struct tcf_proto_ops *ops); >>> >>> #ifdef CONFIG_NET_CLS >>>-struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index); >>>+struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index, >>>+ bool create); >>> void tcf_chain_put(struct tcf_chain *chain); >>> int tcf_block_get(struct tcf_block **p_block, >>> struct tcf_proto __rcu **p_filter_chain); >>>diff --git a/net/sched/act_api.c b/net/sched/act_api.c >>>index 0ecf2a8..aed6cf2 100644 >>>--- a/net/sched/act_api.c >>>+++ b/net/sched/act_api.c >>>@@ -34,7 +34,7 @@ static int tcf_action_goto_chain_init(struct tc_action *a, >>>struct tcf_proto *tp) >>> >>> if (!tp) >>> return -EINVAL; >>>- a->goto_chain = tcf_chain_get(tp->chain->block, chain_index); >>>+ a->goto_chain = tcf_chain_get(tp->chain->block, chain_index, true); >>> if (!a->goto_chain) >>> return -ENOMEM; >>> return 0; >>>diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c >>>index 01a8b8b..23d2236 100644 >>>--- a/net/sched/cls_api.c >>>+++ b/net/sched/cls_api.c >>>@@ -220,7 +220,8 @@ static void tcf_chain_destroy(struct tcf_chain *chain) >>> kfree(chain); >>> } >>> >>>-struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index) >>>+struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index, >>>+ bool create) >>> { >>> struct tcf_chain *chain; >>> >>>@@ -230,7 +231,10 @@ struct tcf_chain *tcf_chain_get(struct tcf_block >>>*block, u32 chain_index) >>> return chain; >>> } >>> } >>>- return tcf_chain_create(block, chain_index); >>>+ if (create) >>>+ return tcf_chain_create(block, chain_index); >>>+ else >>>+ return NULL; >>> } >>> EXPORT_SYMBOL(tcf_chain_get); >>> >>>@@ -509,9 +513,10 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct >>>nlmsghdr *n, >>> err = -EINVAL; >>> goto errout; >>> } >>>- chain = tcf_chain_get(block, chain_index); >>>+ chain = tcf_chain_get(block, chain_index, >>>+n->nlmsg_type == RTM_NEWTFILTER); >> >> First of all, I really hate all these true/false arg dances. Totaly >> confusing all the time. > >Sounds like you are able to understand the code at all. >Sigh, I bet you never even read the changelog. ;) > > >> >> >> >>> if (!chain) { >>>- err = -ENOMEM; >>>+ err = n->nlmsg_type == RTM_NEWTFILTER ? -ENOMEM : -EINVAL; >> >> Confusing. Please do not obfuscate the code for a corner cases. Thanks. > >Either you don't understand the changelog or you don't understand >ternary conditional operator. I can't help you if the latter. > >Sorry. Heh. Really? All I say is, your patch is not needed at all. All it adds makes to code harder to understand and no benefit.
Re: [PATCH net 1/3] bpf: fix incorrect pruning decision when alignment must be tracked
On 05/24/2017 10:07 PM, David Miller wrote: From: Daniel BorkmannDate: Tue, 23 May 2017 18:30:41 +0200 + if (!env->strict_alignment && old->off <= cur->off && You can't just test env->strict_alignment by itself, that's just an override and doesn't determine the actual "strict" value we use which is a combination of env->strict_alignment and "!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS". So you'll have to update this test. Argh, good point, true. Will respin with a v2.
Re: [PATCH net-next 3/3] net: sysfs: Document PHY device sysfs attributes
> +What:/sys/class/mdio_bus///phy_interface > +Date:February 2014 > +KernelVersion: 3.15 > +Contact: netdev@vger.kernel.org > +Description: > + String value indicating the PHY interface, possible > + values are in include/linux/phy.h function phy_modes. Hi Florian Does this suggest that these strings should be moved to include/uapi/linux/phy.h? Andrew
Re: [PATCH net 1/3] bpf: fix incorrect pruning decision when alignment must be tracked
From: Daniel BorkmannDate: Tue, 23 May 2017 18:30:41 +0200 > + if (!env->strict_alignment && old->off <= cur->off && You can't just test env->strict_alignment by itself, that's just an override and doesn't determine the actual "strict" value we use which is a combination of env->strict_alignment and "!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS". So you'll have to update this test.
Re: [PATCH v6 net-next 01/17] net: qualcomm: remove unnecessary includes
> David Millerhat am 24. Mai 2017 um 21:41 geschrieben: > > > From: Stefan Wahren > Date: Tue, 23 May 2017 15:12:37 +0200 > > > Most of the includes in qca_7k.c are unnecessary so we better remove them. > > > > Signed-off-by: Stefan Wahren > > --- > > drivers/net/ethernet/qualcomm/qca_7k.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/drivers/net/ethernet/qualcomm/qca_7k.c > > b/drivers/net/ethernet/qualcomm/qca_7k.c > > index f0066fb..557d53c 100644 > > --- a/drivers/net/ethernet/qualcomm/qca_7k.c > > +++ b/drivers/net/ethernet/qualcomm/qca_7k.c > > @@ -23,11 +23,7 @@ > > * kernel-based SPI device. > > */ > > > > -#include > > -#include > > -#include > > #include > > -#include > > > > #include "qca_7k.h" > > > > -- > > 2.1.4 > > > > Changes like this drive me crazy. > > The only reason you can remove those headers is because you are obtaining > things indirectly via qca_7k.h > > And if that is indeed the case, you are also getting qca_spi.h which > in turn includes linux/spi/spi.h > > So you could have removed that as well. > > But seriously, it is so much harder to understand a driver and what > interfaces it needs via header files when you hide _all_ of it behind > these local private header files which just include _everything_ > and then _every_ foo.c file in your driver gets _all_ of those kernel > headers whether they need it or not. > > So if just one foo.c file needs 20 extra kernel headers than the rest > of the files in the driver, every foo.c file eats that cost of > including them. > > I really don't like when drivers move in this direction for that > reason. And at best, as described at the beginning of my response, > this change is incomplete. > The intension of this change wasn't to hide the includes into qca_7k.h AFAIK these ones above aren't necessary (no init, no kernel module, no kernel parameter, no kernel version) for this C file. So i will double check it.
Re: [PATCH net-next] geneve: fix fill_info when using collect_metadata
On Wed, May 24, 2017 at 12:20:36PM -0700, Pravin Shelar wrote: > On Tue, May 23, 2017 at 3:37 PM, Eric Garverwrote: > > Since 9b4437a5b870 ("geneve: Unify LWT and netdev handling.") fill_info > > does not return UDP_ZERO_CSUM6_RX when using COLLECT_METADATA. This is > > because it uses ip_tunnel_info_af() with the device level info, which is > > not valid for COLLECT_METADATA. > > > > Fix by checking for the presence of the actual sockets. > > > > Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.") > > Signed-off-by: Eric Garver > Thanks for the patch. > > Acked-by: Pravin B Shelar > > > I noticed that the MTU and encal_len calculation in geneve_configure() > also needs to be fixed for collect metadata case. Can you send patch > to fix it? Ah, yes. I'll take a look at those as well. Eric.
Re: [PATCH v3 net] net: phy: marvell: Limit errata to 88m1101
From: Andrew LunnDate: Tue, 23 May 2017 17:49:13 +0200 > The 88m1101 has an errata when configuring autoneg. However, it was > being applied to many other Marvell PHYs as well. Limit its scope to > just the 88m1101. > > Fixes: 76884679c644 ("phylib: Add support for Marvell 88eS and 88e1145") > Reported-by: Daniel Walker > Signed-off-by: Andrew Lunn > Acked-by: Harini Katakam Applied and queued up for -stable, thanks.
Re: [PATCH] net/phy: fix mdio-octeon dependency and build
From: Randy DunlapDate: Tue, 23 May 2017 08:19:49 -0700 > From: Randy Dunlap > > Fix build errors by making this driver depend on OF_MDIO, like > several other similar drivers do. > > drivers/built-in.o: In function `octeon_mdiobus_remove': > mdio-octeon.c:(.text+0x196ee0): undefined reference to `mdiobus_unregister' > mdio-octeon.c:(.text+0x196ee8): undefined reference to `mdiobus_free' > drivers/built-in.o: In function `octeon_mdiobus_probe': > mdio-octeon.c:(.text+0x196f1d): undefined reference to > `devm_mdiobus_alloc_size' > mdio-octeon.c:(.text+0x196ffe): undefined reference to `of_mdiobus_register' > mdio-octeon.c:(.text+0x197010): undefined reference to `mdiobus_free' > > Signed-off-by: Randy Dunlap Applied, thanks Randy.
Re: [pull request][net 0/8] Mellanox, mlx5 fixes 2017-05-23
From: Saeed MahameedDate: Tue, 23 May 2017 17:16:02 +0300 > This series contains some fixes for the mlx5 driver and one small patch that > adds csum actions accessors in include/net/tc_act/tc_csum.h needed by some of > mlx5 fixes patches. > > Details are below. > > Please pull and let me know if there's any problem. > > Note: This series doesn't introduce any merge conflict with the ongoing mlx5 > for-next submission. > > For -stable kernels >= 4.7: > ("net/mlx5: Avoid using pending command interface slots") > ("net/mlx5: Tolerate irq_set_affinity_hint() failures") Pulled and queued up for -stable, thanks.
Re: [PATCH v6 net-next 01/17] net: qualcomm: remove unnecessary includes
From: Stefan WahrenDate: Tue, 23 May 2017 15:12:37 +0200 > Most of the includes in qca_7k.c are unnecessary so we better remove them. > > Signed-off-by: Stefan Wahren > --- > drivers/net/ethernet/qualcomm/qca_7k.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/drivers/net/ethernet/qualcomm/qca_7k.c > b/drivers/net/ethernet/qualcomm/qca_7k.c > index f0066fb..557d53c 100644 > --- a/drivers/net/ethernet/qualcomm/qca_7k.c > +++ b/drivers/net/ethernet/qualcomm/qca_7k.c > @@ -23,11 +23,7 @@ > * kernel-based SPI device. > */ > > -#include > -#include > -#include > #include > -#include > > #include "qca_7k.h" > > -- > 2.1.4 > Changes like this drive me crazy. The only reason you can remove those headers is because you are obtaining things indirectly via qca_7k.h And if that is indeed the case, you are also getting qca_spi.h which in turn includes linux/spi/spi.h So you could have removed that as well. But seriously, it is so much harder to understand a driver and what interfaces it needs via header files when you hide _all_ of it behind these local private header files which just include _everything_ and then _every_ foo.c file in your driver gets _all_ of those kernel headers whether they need it or not. So if just one foo.c file needs 20 extra kernel headers than the rest of the files in the driver, every foo.c file eats that cost of including them. I really don't like when drivers move in this direction for that reason. And at best, as described at the beginning of my response, this change is incomplete.
Re: [PATCH net] net/mlx4: Fix the check in attaching steering rules
From: Tariq ToukanDate: Tue, 23 May 2017 15:50:07 +0300 > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c > b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c > index ae5fdc2df654..00a7cd3dcc2e 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c > @@ -1562,8 +1562,7 @@ static int mlx4_en_flow_replace(struct net_device *dev, > qpn = priv->drop_qp.qpn; > else if (cmd->fs.ring_cookie & EN_ETHTOOL_QP_ATTACH) { > qpn = cmd->fs.ring_cookie & (EN_ETHTOOL_QP_ATTACH - 1); > - if (qpn < priv->rss_map.base_qpn || > - qpn >= priv->rss_map.base_qpn + priv->rx_ring_num) { > + if (!mlx4_qp_lookup(priv->mdev->dev, qpn)) { > en_warn(priv, "rxnfc: QP (0x%x) doesn't exist\n", qpn); > return -EINVAL; > } > diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c > b/drivers/net/ethernet/mellanox/mlx4/qp.c > index 2d6abd4662b1..1eff2fe32a8b 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/qp.c > +++ b/drivers/net/ethernet/mellanox/mlx4/qp.c > @@ -384,6 +384,20 @@ static void mlx4_qp_free_icm(struct mlx4_dev *dev, int > qpn) > __mlx4_qp_free_icm(dev, qpn); > } > > +struct mlx4_qp *mlx4_qp_lookup(struct mlx4_dev *dev, u32 qpn) ... > +EXPORT_SYMBOL_GPL(mlx4_qp_lookup); > + This phony separation between MLX4_CORE and MLX4_EN is the only reason you need this unreasonable symbol export. I doubt you'll ever use this function anywhere outside of en_ethtool.c so this export is wasted space in the kernel image. Probably compiler could inline it decently as well. So find another way to do this without the symbol export. I don't really want to hear any stories about "clean separation" or whatever. What's happening here is exactly why this separate modules scheme results in ugly unreasonable code, and unnecessary gymnastics and wasted object space just to make routines available in one place from another.
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On Wed, May 24, 2017 at 01:58:09PM -0500, Timur Tabi wrote: > On 05/24/2017 09:09 AM, Andrew Lunn wrote: > > It could be, the copper side is up, but the SGMII side is down, at the > > point at803x_aneg_done() is called. So it is correctly returning > > 0. Sometime later the SGMII side goes up, but there is not a second > > interrupt. Hence the phy core does not know that the full, 2 stage MAC > > to PHY to peer PHY link is now up. > > Ok, I'm going to debug this some more. It turns out that the MAC side of > the SGMII link can send an interrupt when it thinks that auto-negotiation is > done. I might be able to use this. You can use this for your board. But it still leaves the phy driver broken for everybody else. > What function should my MAC driver call when it wants the phy core to call > at803x_aneg_done again to see if autonegotiation is done? You want to trigger the PHY state machine. There is only one exported API call to do this, phy_mac_interrupt(). But you are supposed to pass the new link state. And your MAC driver has no idea of that, it does not know if the copper side of the PHY is up. So it might be better if you export phy_trigger_machine(). > Also, is there a way for the MAC driver to know that at803x_aneg_done() > previously returned 0, and that it needs to tell the phy core to check again? Not that i know of. The MAC layer is not supposed to be messing around in the PHY layer. However, just triggering the PHY state machine should be enough. Andrew
[PATCH net-next 1/3] net: phy: Create sysfs reciprocal links for attached_dev/phydev
There is currently no way for a program scanning /sys to know whether a network device is attached to a particular PHY device, just like the PHY device is not pointed back to its attached network device. Create a symbolic link in the network device's namespace named "phydev" which points to the PHY device and create a symbolic link in the PHY device's namespace named "attached_dev" that points back to the network device. These links are set up during phy_attach_direct() and removed during phy_detach() for symetry. Signed-off-by: Florian Fainelli--- drivers/net/phy/phy_device.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 1219eeab69d1..8151477c7027 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -960,6 +960,15 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, phydev->attached_dev = dev; dev->phydev = phydev; + err = sysfs_create_link(>mdio.dev.kobj, >dev.kobj, + "attached_dev"); + if (err) + goto error; + + err = sysfs_create_link(>dev.kobj, >mdio.dev.kobj, + "phydev"); + if (err) + goto error; phydev->dev_flags = flags; @@ -1050,6 +1059,8 @@ void phy_detach(struct phy_device *phydev) struct mii_bus *bus; int i; + sysfs_remove_link(>dev.kobj, "phydev"); + sysfs_remove_link(>mdio.dev.kobj, "attached_dev"); phydev->attached_dev->phydev = NULL; phydev->attached_dev = NULL; phy_suspend(phydev); -- 2.9.3
[PATCH net-next 3/3] net: sysfs: Document PHY device sysfs attributes
Document the different sysfs attributes that exist for PHY devices: attached_dev, phy_has_fixups, phy_id and phy_interface. Signed-off-by: Florian Fainelli--- Documentation/ABI/testing/sysfs-class-net-phydev | 32 1 file changed, 32 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-net-phydev diff --git a/Documentation/ABI/testing/sysfs-class-net-phydev b/Documentation/ABI/testing/sysfs-class-net-phydev new file mode 100644 index ..a05831667c3d --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-net-phydev @@ -0,0 +1,32 @@ +What: /sys/class/mdio_bus///attached_dev +Date: May 2017 +KernelVersion: 4.13 +Contact: netdev@vger.kernel.org +Description: + Symbolic link to the network device this PHY device is + attached to. + +What: /sys/class/mdio_bus///phy_has_fixups +Date: February 2014 +KernelVersion: 3.15 +Contact: netdev@vger.kernel.org +Description: + Boolean value indicating whether the PHY device has + any fixups registered against it (phy_register_fixup) + +What: /sys/class/mdio_bus///phy_id +Date: November 2012 +KernelVersion: 3.8 +Contact: netdev@vger.kernel.org +Description: + 32-bit hexadecimal value corresponding to the PHY device's OUI, + model and revision number. + +What: /sys/class/mdio_bus///phy_interface +Date: February 2014 +KernelVersion: 3.15 +Contact: netdev@vger.kernel.org +Description: + String value indicating the PHY interface, possible + values are in include/linux/phy.h function phy_modes. + -- 2.9.3
[PATCH net-next 2/3] net: sysfs: Document "phydev" symbolic link
Now that we link the network device to its PHY device, document this sysfs symbolic link. Signed-off-by: Florian Fainelli--- Documentation/ABI/testing/sysfs-class-net | 8 1 file changed, 8 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index 668604fc8e06..6856da99b6f7 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -251,3 +251,11 @@ Contact: netdev@vger.kernel.org Description: Indicates the unique physical switch identifier of a switch this port belongs to, as a string. + +What: /sys/class/net//phydev +Date: May 2017 +KernelVersion: 4.13 +Contact: netdev@vger.kernel.org +Description: + Symbolic link to the PHY device this network device is attached + to. -- 2.9.3
[PATCH net-next 0/3] net: phy: Create sysfs reciprocal links for attached_dev/phydev
Hi David, Andrew, This patch series addresses a device topology shortcoming where a program scanning /sys would not be able to establish a mapping between the network device and the PHY device. In the process it turned out that no PHY device documentation existed for sysfs attributes. Thanks! Florian Fainelli (3): net: phy: Create sysfs reciprocal links for attached_dev/phydev net: sysfs: Document "phydev" symbolic link net: sysfs: Document PHY device sysfs attributes Documentation/ABI/testing/sysfs-class-net| 8 ++ Documentation/ABI/testing/sysfs-class-net-phydev | 32 drivers/net/phy/phy_device.c | 11 3 files changed, 51 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-net-phydev -- 2.9.3
RE: [PATCH net-next 1/8] net: ipv4: refactor __ip_route_output_key_hash
Hi, Rupa /David Ahern, First, thanks for this patch set! Second, it seems to me that something might be incorrect here. You have these additions in this patch (1/8): ... +struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *flp, + const struct sk_buff *skb, + struct fib_result *res); ... +struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, + const struct sk_buff *skb) +{ + struct fib_result res; + struct rtable *rth; + + res.tclassid= 0; + res.fi = NULL; + res.table = NULL; + + rcu_read_lock(); + rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); rcu_read_unlock(); + return rth; } -EXPORT_SYMBOL_GPL(__ip_route_output_key_hash); +EXPORT_SYMBOL_GPL(ip_route_output_key_hash); So the third parameter to ip_route_output_key_hash_rcu() should be skb*, and the fourth parameter should be fib_result *. However, you do not pass the skb parameter when calling ip_route_output_key_hash_rcu() in ip_route_output_key_hash() (in fact you don't use it at all), and you pass mp_hash as the fourth parameter. Regards, Rami Rosen Intel Corporation
Re: [PATCH] net: rtnetlink: bail out from rtnl_fdb_dump() on parse error
From: Alexander PotapenkoDate: Tue, 23 May 2017 13:20:28 +0200 > rtnl_fdb_dump() failed to check the result of nlmsg_parse(), which led > to contents of |ifm| being uninitialized because nlh->nlmsglen was too > small to accommodate |ifm|. The uninitialized data may affect some > branches and result in unwanted effects, although kernel data doesn't > seem to leak to the userspace directly. > > The bug has been detected with KMSAN and syzkaller. > > Signed-off-by: Alexander Potapenko Applied, thanks.
Re: pull-request: mac80211 2017-05-23
From: Johannes BergDate: Tue, 23 May 2017 14:42:55 +0200 > I have just two fixes here, one of the scheduled scan issue that > Sander Eikelenboom found, and the other properly makes mesh more > strictly check its extension headers. > > Please pull and let me know if there's any problem. Pulled, thanks Johannes.
Re: [PATCH] net: rtnetlink: bail out from rtnl_fdb_dump() on parse error
On Tue, 2017-05-23 at 13:20 +0200, Alexander Potapenko wrote: > rtnl_fdb_dump() failed to check the result of nlmsg_parse(), which led > to contents of |ifm| being uninitialized because nlh->nlmsglen was too > small to accommodate |ifm|. The uninitialized data may affect some > branches and result in unwanted effects, although kernel data doesn't > seem to leak to the userspace directly. > > The bug has been detected with KMSAN and syzkaller. > > Signed-off-by: Alexander Potapenko> --- > For the record, here is the KMSAN report: > > == > BUG: KMSAN: use of unitialized memory in rtnl_fdb_dump+0x5dc/0x1000 > CPU: 0 PID: 1039 Comm: probe Not tainted 4.11.0-rc5+ #2727 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:16 > dump_stack+0x143/0x1b0 lib/dump_stack.c:52 > kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:1007 > __kmsan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:491 > rtnl_fdb_dump+0x5dc/0x1000 net/core/rtnetlink.c:3230 > netlink_dump+0x84f/0x1190 net/netlink/af_netlink.c:2168 > __netlink_dump_start+0xc97/0xe50 net/netlink/af_netlink.c:2258 > netlink_dump_start ./include/linux/netlink.h:165 > rtnetlink_rcv_msg+0xae9/0xb40 net/core/rtnetlink.c:4094 > netlink_rcv_skb+0x339/0x5a0 net/netlink/af_netlink.c:2339 > rtnetlink_rcv+0x83/0xa0 net/core/rtnetlink.c:4110 > netlink_unicast_kernel net/netlink/af_netlink.c:1272 > netlink_unicast+0x13b7/0x1480 net/netlink/af_netlink.c:1298 > netlink_sendmsg+0x10b8/0x10f0 net/netlink/af_netlink.c:1844 > sock_sendmsg_nosec net/socket.c:633 > sock_sendmsg net/socket.c:643 > ___sys_sendmsg+0xd4b/0x10f0 net/socket.c:1997 > __sys_sendmsg net/socket.c:2031 > SYSC_sendmsg+0x2c6/0x3f0 net/socket.c:2042 > SyS_sendmsg+0x87/0xb0 net/socket.c:2038 > do_syscall_64+0x102/0x150 arch/x86/entry/common.c:285 > entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246 > RIP: 0033:0x401300 > RSP: 002b:7ffc3b0e6d58 EFLAGS: 0246 ORIG_RAX: 002e > RAX: ffda RBX: 004002b0 RCX: 00401300 > RDX: RSI: 7ffc3b0e6d80 RDI: 0003 > RBP: 7ffc3b0e6e00 R08: 000b R09: 0004 > R10: 000d R11: 0246 R12: > R13: 004065a0 R14: 00406630 R15: > origin: 8fe00056 > save_stack_trace+0x59/0x60 arch/x86/kernel/stacktrace.c:59 > kmsan_save_stack_with_flags mm/kmsan/kmsan.c:352 > kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:247 > kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:260 > slab_alloc_node mm/slub.c:2743 > __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4349 > __kmalloc_reserve net/core/skbuff.c:138 > __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231 > alloc_skb ./include/linux/skbuff.h:933 > netlink_alloc_large_skb net/netlink/af_netlink.c:1144 > netlink_sendmsg+0x934/0x10f0 net/netlink/af_netlink.c:1819 > sock_sendmsg_nosec net/socket.c:633 > sock_sendmsg net/socket.c:643 > ___sys_sendmsg+0xd4b/0x10f0 net/socket.c:1997 > __sys_sendmsg net/socket.c:2031 > SYSC_sendmsg+0x2c6/0x3f0 net/socket.c:2042 > SyS_sendmsg+0x87/0xb0 net/socket.c:2038 > do_syscall_64+0x102/0x150 arch/x86/entry/common.c:285 > return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246 > == > > and the reproducer: > > == > #include > #include > #include > #include > > int main() > { > int sock = socket(PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK, 0); > struct msghdr msg; > memset(, 0, sizeof(msg)); > char nlmsg_buf[32]; > memset(nlmsg_buf, 0, sizeof(nlmsg_buf)); > struct nlmsghdr *nlmsg = nlmsg_buf; > nlmsg->nlmsg_len = 0x11; > nlmsg->nlmsg_type = 0x1e; // RTM_NEWROUTE = RTM_BASE + 0x0e > // type = 0x0e = 1110b > // kind = 2 > nlmsg->nlmsg_flags = 0x101; // NLM_F_ROOT | NLM_F_REQUEST > nlmsg->nlmsg_seq = 0; > nlmsg->nlmsg_pid = 0; > nlmsg_buf[16] = (char)7; > struct iovec iov; > iov.iov_base = nlmsg_buf; > iov.iov_len = 17; > msg.msg_iov = > msg.msg_iovlen = 1; > sendmsg(sock, , 0); > return 0; > } > == > --- > net/core/rtnetlink.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c > index 49a279a7cc15..9e2c0a7cb325 100644 > --- a/net/core/rtnetlink.c > +++ b/net/core/rtnetlink.c > @@ -3231,8 +3231,11 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct > netlink_callback *cb) > int err = 0; > int fidx = 0; > > - if (nlmsg_parse(cb->nlh, sizeof(struct ifinfomsg), tb, > - IFLA_MAX, ifla_policy, NULL) == 0) { > + err =
Re: [PATCH v2] net: fec: add post PHY reset delay DT property
From: Quentin SchulzDate: Tue, 23 May 2017 11:48:08 +0200 > Some PHY require to wait for a bit after the reset GPIO has been > toggled. This adds support for the DT property `phy-reset-post-delay` > which gives the delay in milliseconds to wait after reset. > > If the DT property is not given, no delay is observed. Post reset delay > greater than 1000ms are invalid. > > Signed-off-by: Quentin Schulz Applied, thanks.
Re: [PATCH net-next 2/4] nfp: register ports as devlink ports
On Wed, 24 May 2017 14:35:14 +0200, Jiri Pirko wrote: > >+void nfp_devlink_port_unregister(struct nfp_port *port) > >+{ > >+/* Due to unpleasant lock ordering we may see the port go away before > >+ * we have fully probed. > > Could you elaborate on this a bit more please? It's partially due to peculiarities of the management FW more than kernel stuff. Unfortunately some ethtool media config requires reboot to be applied, so we print a friendly message to the logs and unregister the associated netdevs. Which means once netdevs get registered ports may go away. Enter devlink, I need the ability to grab the adapater lock in split/unsplit callbacks to find the ports, which implies having to drop that lock before I register devlink. And only after I register devlink can I register the ports. I could do init without registering anything, drop the adapter lock, register devlink, and then grab the adapter lock back and register devlink ports and netdevs. But there is another issue... Since I look for ports on a list maintained in the adapter struct, driver code doesn't care if devlink_port has been registered or not. The moment devlink is registered, split/unsplit requests will be accepted - potentially trying to unregister devlink_port before the register could happen. Further down the line, also, the eswitch mode setting is coming. Which means the moment I register devlink itself ports will get shuffled (due to the plan of registering VFs as ports :)). I feel like registering devlink should be the last action of the driver, really. My plan was to keep that simple if() for now, and once we get to extending devlink with SR-IOV stuff also add the ability to pre-register ports. Allow registering ports on not-yet-registered devlink (probably put them on a private list within struct devlink). This would make devlink_register() a single point when everything devlink becomes visible, atomically, instead of devlink itself coming first and then ports following. Does that make sense? Am I misreading the code (again :S)?
Re: [PATCH net 0/2] sctp: a bunch of fixes for processing dupcookie
From: Xin LongDate: Tue, 23 May 2017 13:28:53 +0800 > After introducing transport hashtable and per stream info into sctp, > some regressions were caused when processing dupcookie, this patchset > is to fix them. Series applied, thanks.
Re: [PATCH net-next] geneve: fix fill_info when using collect_metadata
On Tue, May 23, 2017 at 3:37 PM, Eric Garverwrote: > Since 9b4437a5b870 ("geneve: Unify LWT and netdev handling.") fill_info > does not return UDP_ZERO_CSUM6_RX when using COLLECT_METADATA. This is > because it uses ip_tunnel_info_af() with the device level info, which is > not valid for COLLECT_METADATA. > > Fix by checking for the presence of the actual sockets. > > Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.") > Signed-off-by: Eric Garver Thanks for the patch. Acked-by: Pravin B Shelar I noticed that the MTU and encal_len calculation in geneve_configure() also needs to be fixed for collect metadata case. Can you send patch to fix it?
Re: [PATCH v2 net-next 00/11] qed/qede: Mostly-cleanup series
From: Yuval MintzDate: Tue, 23 May 2017 09:41:17 +0300 > This series contains some cleanup of the qed and qede code: > - #1 contains mostly static/endian changes in order to allow qede to >pass sparse compilation cleanly. > - #2, #5 and #6 are either semantic or remove dead-code from driver. > - #9, #10 and #11 relate to printing and slightly change some APIs >between qed and the protocol drivers for that end [sharing the >interface names and information regarding device]. > > The rest of the patches are minor changes/fixes to various flows > in qed. Series applied, thanks.
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
On 05/24/2017 09:09 AM, Andrew Lunn wrote: > It could be, the copper side is up, but the SGMII side is down, at the > point at803x_aneg_done() is called. So it is correctly returning > 0. Sometime later the SGMII side goes up, but there is not a second > interrupt. Hence the phy core does not know that the full, 2 stage MAC > to PHY to peer PHY link is now up. Ok, I'm going to debug this some more. It turns out that the MAC side of the SGMII link can send an interrupt when it thinks that auto-negotiation is done. I might be able to use this. What function should my MAC driver call when it wants the phy core to call at803x_aneg_done again to see if autonegotiation is done? Also, is there a way for the MAC driver to know that at803x_aneg_done() previously returned 0, and that it needs to tell the phy core to check again? -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH net-next 1/8] net: ipv4: refactor __ip_route_output_key_hash
From: David AhernA later patch wants access to the fib result on an output route lookup with the rcu lock held. Refactor __ip_route_output_key_hash, pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. To keep the name length under control remove the leading underscores from the name. _rcu is added to the name of the new helper indicating it is called with the rcu read lock held. Signed-off-by: David Ahern Signed-off-by: Roopa Prabhu --- include/net/route.h | 7 ++- net/ipv4/icmp.c | 2 +- net/ipv4/route.c| 57 +++-- 3 files changed, 40 insertions(+), 26 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 2cc0e14..5a92347 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -115,11 +115,16 @@ struct rt_cache_stat { void rt_flush_dev(struct net_device *dev); struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *flp, const struct sk_buff *skb); +struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *flp, + const struct sk_buff *skb); +struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *flp, + const struct sk_buff *skb, + struct fib_result *res); static inline struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *flp) { - return __ip_route_output_key_hash(net, flp, NULL); + return ip_route_output_key_hash(net, flp, NULL); } struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp, diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 43318b5..5610971 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -489,7 +489,7 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev); security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); - rt = __ip_route_output_key_hash(net, fl4, skb_in); + rt = ip_route_output_key_hash(net, fl4, skb_in); if (IS_ERR(rt)) return rt; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 655d9ee..f787208 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2246,29 +2246,22 @@ static struct rtable *__mkroute_output(const struct fib_result *res, * Major route resolver routine. */ -struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, - const struct sk_buff *skb) +struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, + const struct sk_buff *skb, + struct fib_result *res) { struct net_device *dev_out = NULL; + int orig_oif = fl4->flowi4_oif; __u8 tos = RT_FL_TOS(fl4); unsigned int flags = 0; - struct fib_result res; - struct rtable *rth; - int orig_oif; int err = -ENETUNREACH; - - res.tclassid= 0; - res.fi = NULL; - res.table = NULL; - - orig_oif = fl4->flowi4_oif; + struct rtable *rth; fl4->flowi4_iif = LOOPBACK_IFINDEX; fl4->flowi4_tos = tos & IPTOS_RT_MASK; fl4->flowi4_scope = ((tos & RTO_ONLINK) ? RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); - rcu_read_lock(); if (fl4->saddr) { rth = ERR_PTR(-EINVAL); if (ipv4_is_multicast(fl4->saddr) || @@ -2354,15 +2347,15 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, fl4->daddr = fl4->saddr = htonl(INADDR_LOOPBACK); dev_out = net->loopback_dev; fl4->flowi4_oif = LOOPBACK_IFINDEX; - res.type = RTN_LOCAL; + res->type = RTN_LOCAL; flags |= RTCF_LOCAL; goto make_route; } - err = fib_lookup(net, fl4, , 0); + err = fib_lookup(net, fl4, res, 0); if (err) { - res.fi = NULL; - res.table = NULL; + res->fi = NULL; + res->table = NULL; if (fl4->flowi4_oif && (ipv4_is_multicast(fl4->daddr) || !netif_index_is_l3_master(net, fl4->flowi4_oif))) { @@ -2387,17 +2380,17 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, if (fl4->saddr == 0) fl4->saddr = inet_select_addr(dev_out, 0, RT_SCOPE_LINK); - res.type = RTN_UNICAST; + res->type =
[PATCH net-next 2/8] net: ipv4: refactor ip_route_input_noref
From: David AhernA later patch wants access to the fib result on an input route lookup with the rcu lock held. Refactor ip_route_input_noref pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. Signed-off-by: David Ahern Signed-off-by: Roopa Prabhu --- include/net/route.h | 3 +++ net/ipv4/route.c| 66 ++--- 2 files changed, 40 insertions(+), 29 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 5a92347..8d209ad 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -180,6 +180,9 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4 int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src, u8 tos, struct net_device *devin); +int ip_route_input_rcu(struct sk_buff *skb, __be32 dst, __be32 src, + u8 tos, struct net_device *devin, + struct fib_result *res); static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src, u8 tos, struct net_device *devin) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index f787208..fe7b765 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1852,9 +1852,9 @@ static int ip_mkroute_input(struct sk_buff *skb, */ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, - u8 tos, struct net_device *dev) + u8 tos, struct net_device *dev, + struct fib_result *res) { - struct fib_result res; struct in_device *in_dev = __in_dev_get_rcu(dev); struct ip_tunnel_info *tun_info; struct flowi4 fl4; @@ -1884,8 +1884,8 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr)) goto martian_source; - res.fi = NULL; - res.table = NULL; + res->fi = NULL; + res->table = NULL; if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0)) goto brd_input; @@ -1921,17 +1921,17 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, fl4.daddr = daddr; fl4.saddr = saddr; fl4.flowi4_uid = sock_net_uid(net, NULL); - err = fib_lookup(net, , , 0); + err = fib_lookup(net, , res, 0); if (err != 0) { if (!IN_DEV_FORWARD(in_dev)) err = -EHOSTUNREACH; goto no_route; } - if (res.type == RTN_BROADCAST) + if (res->type == RTN_BROADCAST) goto brd_input; - if (res.type == RTN_LOCAL) { + if (res->type == RTN_LOCAL) { err = fib_validate_source(skb, saddr, daddr, tos, 0, dev, in_dev, ); if (err < 0) @@ -1943,10 +1943,10 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, err = -EHOSTUNREACH; goto no_route; } - if (res.type != RTN_UNICAST) + if (res->type != RTN_UNICAST) goto martian_destination; - err = ip_mkroute_input(skb, , in_dev, daddr, saddr, tos); + err = ip_mkroute_input(skb, res, in_dev, daddr, saddr, tos); out: return err; brd_input: @@ -1960,14 +1960,14 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, goto martian_source; } flags |= RTCF_BROADCAST; - res.type = RTN_BROADCAST; + res->type = RTN_BROADCAST; RT_CACHE_STAT_INC(in_brd); local_input: do_cache = false; - if (res.fi) { + if (res->fi) { if (!itag) { - rth = rcu_dereference(FIB_RES_NH(res).nh_rth_input); + rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input); if (rt_cache_valid(rth)) { skb_dst_set_noref(skb, >dst); err = 0; @@ -1978,7 +1978,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, } rth = rt_dst_alloc(l3mdev_master_dev_rcu(dev) ? : net->loopback_dev, - flags | RTCF_LOCAL, res.type, + flags | RTCF_LOCAL, res->type, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache); if (!rth) goto e_nobufs; @@ -1988,18 +1988,18 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, rth->dst.tclassid = itag; #endif rth->rt_is_input = 1; - if (res.table) - rth->rt_table_id = res.table->tb_id; + if (res->table) +
[PATCH net-next 8/8] net: ipv6: RTM_GETROUTE: return matched fib result when requested
From: Roopa PrabhuThis patch adds support to return matched fib result when RTM_F_FIB_MATCH flag is specified in RTM_GETROUTE request. This is useful for user-space applications/controllers wanting to query a matching route. Signed-off-by: Roopa Prabhu --- net/ipv6/route.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 80bda31..c4d6da9 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3612,6 +3612,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, struct rtmsg *rtm; struct flowi6 fl6; int err, iif = 0, oif = 0; + bool fibmatch; err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv6_policy, extack); @@ -3622,6 +3623,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, memset(, 0, sizeof(fl6)); rtm = nlmsg_data(nlh); fl6.flowlabel = ip6_make_flowinfo(rtm->rtm_tos, 0); + fibmatch = (rtm->rtm_flags & RTM_F_FIB_MATCH) ? true : false; if (tb[RTA_SRC]) { if (nla_len(tb[RTA_SRC]) < sizeof(struct in6_addr)) @@ -3667,12 +3669,27 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, if (!ipv6_addr_any()) flags |= RT6_LOOKUP_F_HAS_SADDR; - rt = (struct rt6_info *)ip6_route_input_lookup(net, dev, , - flags); + if (!fibmatch) + rt = (struct rt6_info *)ip6_route_input_lookup(net, dev, + , + flags); } else { fl6.flowi6_oif = oif; - rt = (struct rt6_info *)ip6_route_output(net, NULL, ); + if (!fibmatch) + rt = (struct rt6_info *)ip6_route_output_flags(net, + NULL, + , 0); + } + + if (fibmatch) { + rt = (struct rt6_info *)ip6_route_lookup(net, , 0); + if (rt->dst.error) { + err = rt->dst.error; + ip6_rt_put(rt); + goto errout; + } + } if (rt == net->ipv6.ip6_null_entry) { @@ -3689,10 +3706,15 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, } skb_dst_set(skb, >dst); - - err = rt6_fill_node(net, skb, rt, , , iif, - RTM_NEWROUTE, NETLINK_CB(in_skb).portid, - nlh->nlmsg_seq, 0); + if (fibmatch) { + err = rt6_fill_node(net, skb, rt, NULL, NULL, iif, + RTM_NEWROUTE, NETLINK_CB(in_skb).portid, + nlh->nlmsg_seq, 0); + } else { + err = rt6_fill_node(net, skb, rt, , , iif, + RTM_NEWROUTE, NETLINK_CB(in_skb).portid, + nlh->nlmsg_seq, 0); + } if (err < 0) { kfree_skb(skb); goto errout; -- 1.9.1
[PATCH net-next 0/8] net: extend RTM_GETROUTE to return fib result
From: Roopa PrabhuThis series adds a new RTM_F_FIB_MATCH flag to return matched fib result with RTM_GETROUTE. This is useful for applications and protocols in userspace wanting to query the selected route. examples (with patched iproute2): ipv4: $ip route show default via 192.168.0.2 dev eth0 10.0.14.0/24 nexthop via 172.16.0.3 dev dummy0 weight 1 nexthop via 172.16.1.3 dev dummy1 weight 1 $ip route get 10.0.14.2 10.0.14.2 via 172.16.1.3 dev dummy1 src 172.16.1.1 cache $ip route get fibmatch 10.0.14.2 10.0.14.0/24 nexthop via 172.16.0.3 dev dummy0 weight 1 nexthop via 172.16.1.3 dev dummy1 weight 1 ipv6: $ip -6 route show 2001:db9:100::/120 metric 1024 nexthop via 2001:db8:2::2 dev dummy0 weight 1 nexthop via 2001:db8:12::2 dev dummy1 weight 1 $ip -6 route get 2001:db9:100::1 2001:db9:100::1 from :: via 2001:db8:12::2 dev dummy1 src 2001:db8:12::1 metric 1024 pref medium $ip -6 route get fibmatch 2001:db9:100::1 2001:db9:100::/120 metric 1024 nexthop via 2001:db8:12::2 dev dummy1 weight 1 nexthop via 2001:db8:2::2 dev dummy0 weight 1 David Ahern (5): net: ipv4: refactor __ip_route_output_key_hash net: ipv4: refactor ip_route_input_noref net: ipv4: Remove event arg to rt_fill_info net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup net: ipv4: Save trie prefix to fib lookup result Roopa Prabhu (3): net: ipv4: add new RTM_F_FIB_MATCH flag for use with RTM_GETROUTE net: ipv4: RTM_GETROUTE: return matched fib result when requested net: ipv6: RTM_GETROUTE: return matched fib result when requested include/net/ip_fib.h | 1 + include/net/route.h| 10 ++- include/uapi/linux/rtnetlink.h | 1 + net/ipv4/fib_trie.c| 1 + net/ipv4/icmp.c| 2 +- net/ipv4/route.c | 160 - net/ipv6/route.c | 11 ++- 7 files changed, 116 insertions(+), 70 deletions(-) -- 1.9.1
[PATCH net-next 5/8] net: ipv4: Save trie prefix to fib lookup result
From: David AhernPrefix is needed for returning matching route spec on get route request. Signed-off-by: David Ahern Signed-off-by: Roopa Prabhu --- include/net/ip_fib.h | 1 + net/ipv4/fib_trie.c | 1 + 2 files changed, 2 insertions(+) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 42e8b8f..25f5c51 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -136,6 +136,7 @@ struct fib_info { struct fib_table; struct fib_result { + __be32 prefix; unsigned char prefixlen; unsigned char nh_sel; unsigned char type; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 6d0f6c79..6e9df7d 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1452,6 +1452,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, if (!(fib_flags & FIB_LOOKUP_NOREF)) atomic_inc(>fib_clntref); + res->prefix = htonl(n->key); res->prefixlen = KEYLENGTH - fa->fa_slen; res->nh_sel = nhsel; res->type = fa->fa_type; -- 1.9.1
[PATCH net-next 3/8] net: ipv4: Remove event arg to rt_fill_info
From: David Ahernrt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that remove the arg and use RTM_NEWROUTE directly in rt_fill_info. Signed-off-by: David Ahern Signed-off-by: Roopa Prabhu --- net/ipv4/route.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index fe7b765..9699e9b 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2536,7 +2536,7 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4, static int rt_fill_info(struct net *net, __be32 dst, __be32 src, u32 table_id, struct flowi4 *fl4, struct sk_buff *skb, u32 portid, - u32 seq, int event) + u32 seq) { struct rtable *rt = skb_rtable(skb); struct rtmsg *r; @@ -2545,7 +2545,7 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, u32 table_id, u32 error; u32 metrics[RTAX_MAX]; - nlh = nlmsg_put(skb, portid, seq, event, sizeof(*r), 0); + nlh = nlmsg_put(skb, portid, seq, RTM_NEWROUTE, sizeof(*r), 0); if (!nlh) return -EMSGSIZE; @@ -2745,8 +2745,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, table_id = rt->rt_table_id; err = rt_fill_info(net, dst, src, table_id, , skb, - NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, - RTM_NEWROUTE); + NETLINK_CB(in_skb).portid, nlh->nlmsg_seq); if (err < 0) goto errout_free; -- 1.9.1
[PATCH net-next 7/8] net: ipv4: RTM_GETROUTE: return matched fib result when requested
From: Roopa PrabhuThis patch adds support to return matched fib result when RTM_F_FIB_MATCH flag is specified in RTM_GETROUTE request. This is useful for user-space applications/controllers wanting to query a matching route. Signed-off-by: Roopa Prabhu --- net/ipv4/route.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 6e0bd40..419bdba 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -114,6 +114,8 @@ #include #include +#include "fib_lookup.h" + #define RT_FL_TOS(oldflp4) \ ((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK)) @@ -2746,8 +2748,15 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE) table_id = rt->rt_table_id; - err = rt_fill_info(net, dst, src, table_id, , skb, - NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, rt); + if (rtm->rtm_flags & RTM_F_FIB_MATCH) + err = fib_dump_info(skb, NETLINK_CB(in_skb).portid, + nlh->nlmsg_seq, RTM_NEWROUTE, table_id, + rt->rt_type, res.prefix, res.prefixlen, + fl4.flowi4_tos, res.fi, 0); + else + err = rt_fill_info(net, dst, src, table_id, , skb, + NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, + rt); if (err < 0) goto errout_free; -- 1.9.1
[PATCH net-next 6/8] net: ipv4: add new RTM_F_FIB_MATCH flag for use with RTM_GETROUTE
From: Roopa PrabhuThis flag when specified will return matched fib result in response to a RTM_GETROUTE query. Signed-off-by: Roopa Prabhu --- include/uapi/linux/rtnetlink.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 6487b21..564790e 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -278,6 +278,7 @@ enum rt_scope_t { #define RTM_F_EQUALIZE 0x400 /* Multipath equalizer: NI */ #define RTM_F_PREFIX 0x800 /* Prefix addresses */ #define RTM_F_LOOKUP_TABLE 0x1000 /* set rtm_table to FIB lookup result */ +#define RTM_F_FIB_MATCH0x2000 /* return full fib lookup match */ /* Reserved table identifiers */ -- 1.9.1
[PATCH net-next 4/8] net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup
From: David AhernConvert inet_rtm_getroute to use ip_route_input_rcu and ip_route_output_key_hash_rcu passing the fib_result arg to both. The rcu lock is held through the creation of the response, so the rtable/dst does not need to be attached to the skb and is passed to rt_fill_info directly. In converting from ip_route_output_key to ip_route_output_key_hash_rcu the xfrm_lookup_route in ip_route_output_flow is dropped since flowi4_proto is not set for a route get request. Signed-off-by: David Ahern Signed-off-by: Roopa Prabhu --- net/ipv4/route.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 9699e9b..6e0bd40 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2404,7 +2404,7 @@ struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, } /* L3 master device is the loopback for that domain */ - dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : + dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(*res)) ? : net->loopback_dev; fl4->flowi4_oif = dev_out->ifindex; flags |= RTCF_LOCAL; @@ -2435,7 +2435,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, res.table = NULL; rcu_read_lock(); - rth = ip_route_output_key_hash_rcu(net, fl4, , mp_hash); + rth = ip_route_output_key_hash_rcu(net, fl4, skb, ); rcu_read_unlock(); return rth; @@ -2534,11 +2534,11 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4, } EXPORT_SYMBOL_GPL(ip_route_output_flow); +/* called with rcu_read_lock held */ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, u32 table_id, struct flowi4 *fl4, struct sk_buff *skb, u32 portid, - u32 seq) + u32 seq, struct rtable *rt) { - struct rtable *rt = skb_rtable(skb); struct rtmsg *r; struct nlmsghdr *nlh; unsigned long expires = 0; @@ -2653,6 +2653,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, struct net *net = sock_net(in_skb->sk); struct rtmsg *rtm; struct nlattr *tb[RTA_MAX+1]; + struct fib_result res = {}; struct rtable *rt = NULL; struct flowi4 fl4; __be32 dst = 0; @@ -2709,10 +2710,12 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, fl4.flowi4_mark = mark; fl4.flowi4_uid = uid; + rcu_read_lock(); + if (iif) { struct net_device *dev; - dev = __dev_get_by_index(net, iif); + dev = dev_get_by_index_rcu(net, iif); if (!dev) { err = -ENODEV; goto errout_free; @@ -2721,14 +2724,14 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, skb->protocol = htons(ETH_P_IP); skb->dev= dev; skb->mark = mark; - err = ip_route_input(skb, dst, src, rtm->rtm_tos, dev); + err = ip_route_input_rcu(skb, dst, src, rtm->rtm_tos, +dev, ); rt = skb_rtable(skb); if (err == 0 && rt->dst.error) err = -rt->dst.error; } else { - rt = ip_route_output_key(net, ); - + rt = ip_route_output_key_hash_rcu(net, , skb, ); err = 0; if (IS_ERR(rt)) err = PTR_ERR(rt); @@ -2737,7 +2740,6 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, if (err) goto errout_free; - skb_dst_set(skb, >dst); if (rtm->rtm_flags & RTM_F_NOTIFY) rt->rt_flags |= RTCF_NOTIFY; @@ -2745,15 +2747,18 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, table_id = rt->rt_table_id; err = rt_fill_info(net, dst, src, table_id, , skb, - NETLINK_CB(in_skb).portid, nlh->nlmsg_seq); + NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, rt); if (err < 0) goto errout_free; + rcu_read_unlock(); + err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid); errout: return err; errout_free: + rcu_read_unlock(); kfree_skb(skb); goto errout; } -- 1.9.1
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
On Tue, May 23, 2017 at 9:48 AM, Solar Designerwrote: >> >>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer >> >>> wrote: >> >>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: >> >>> >> *) When modules_autoload_mode is set to (2), automatic module loading >> >>> >> is >> >>> >> disabled for all. Once set, this value can not be changed. >> >>> > >> >>> > What purpose does this securelevel-like property ("Once set, this value >> >>> > can not be changed.") serve here? I think this mode 2 is needed, but >> >>> > without this extra property, which is bypassable by e.g. explicitly >> >>> > loaded kernel modules anyway (and that's OK). > > On Mon, May 22, 2017 at 04:07:56PM -0700, Kees Cook wrote: >> I'm on the fence. For modules_disabled and Yama, it was tied to >> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could >> not later be undone by an attacker gaining that privilege, keeping >> them out of either kernel memory or existing user process memory. >> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where >> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT >> issue direct modprobe requests, but it's _possible_ via crazy symlink >> games to trick capable processes into writing to sysctls. We've seen >> this multiple times before, and it's a way for attackers to turn a >> single privileged write into a privileged exec. > > OK, tricking a process via crazy symlink games is finally a potentially > valid reason. The question then becomes: are there perhaps so many > other important sysctl's, disk files, etc. (which the vulnerable capable > process could similarly be tricked into writing) so that specifically > resetting modules_autoload_mode isn't particularly lucrative? I think > that the answer to that is usually yes. Another related question: do we > really want to inconsistently single out a handful of sysctl's for this > kind of extra protection? I think not. > > I agree there are some other settings where being unable to reset them > makes sense, but I think this isn't one of those. > Alright, I already replied to Andy, since it was requested I will drop it. I definitely prefer that we have something merged and usable ;-) >> I might turn the question around, though: why would we want to have it >> changeable at this setting? > > Convenience for the sysadmin - being able to correct one's error (e.g., > wrong order of shell commands), respond to new findings (thought module > autoloading was unneeded after some point, then found out some software > relies on it), change one's mind, reuse a system differently than > originally intended without a forced reboot. > >> I'm fine leaving that piece off, either way. > > I'm also fine with either decision. I just thought I'd point out what > looked weird to me. > > I think this is an important patch that should get in, but primarily > for modules_autoload_mode=1, which many distros could make the default > (and maybe the kernel eventually should?) I do not think that desktop, or interactive systems will set it. Ubuntu has snaps for their app store, and they can use the per-task flag and other vendors too Flatpak, etc. > For modules_autoload_mode=2, we already seem to have the equivalent of > modprobe=/bin/true (or does it differ subtly, maybe in return values?), > which I already use at startup on a GPU box like this (preloading > modules so that the OpenCL backends wouldn't need the autoloading): > > nvidia-smi > nvidia-modprobe -u -c=0 > #modprobe nvidia_uvm > #modprobe fglrx > > sysctl -w kernel.modprobe=/bin/true > sysctl -w kernel.hotplug=/bin/true > > but it's good to also have this supported more explicitly and more > consistently through modules_autoload_mode=2 while we're at it. So I > support having this mode as well. I just question the need to have it > non-resettable. > Ok, yes with mode=2 it is clear interface, it logs what was denied, it is consistent with the per-task flag that we want for sandboxes and desktop, it will work with CONFIG_STATIC_USERMODEHELPER, more importantly it will avoid doing the usermod upcall, even if one day the kernel support upcall into namespaces, I'm pretty sure that no one will like the idea of namespaced modprobe paths, modules are global anyway, this allows to say we are safe by default from any future change that may make an upcall into the wrong context, we just avoid that. -- tixxdz
Re: Deleting a dynamic mac entry..
Thanks, Toshiaki. What is the right way to set the default_pvid using the bridge command ? I tried this, which fails.. root@net-3:~# ip link set dev vxlan0 name untagged type vlan id 0 RTNETLINK answers: Operation not supported root@net-3:~# All the interfaces in the bridge are untagged. thanks, Manohar. On Mon, May 22, 2017 at 8:17 PM, Toshiaki Makitawrote: > On 2017/05/21 11:28, Manohar Kumar wrote: >> Hello, >> >> In 3.19 the following bridge fdb command to delete a dynamically >> learned entry fails.. >> >> root@net-3:~# bridge fdb show | grep 02:42:0a:ff:00:06 >> 02:42:0a:ff:00:06 dev vxlan0 master br0 >> root@net-3:~# bridge fdb del 02:42:0a:ff:00:06 dev vxlan0 master >> RTNETLINK answers: No such file or directory >> >> It works in 4.4. >> >> Can someone please point to the patch that made this change ? > > 25d3b493a52d ("bridge: Fix inability to add non-vlan fdb entry") might > be what you are looking for, but you might want to do git-bisect to > track down any regression or fix. > >> In kernels without this patch is there an alternative to delete >> (actually I want to do it programmatically) dynamic mac entries ? > > If 25d3b493a52d is causing your problem, set default_pvid to 0 in order > to disable default_pvid, and delete any vlans which is already > configured in bridge's vlan_filtering. Then, delete the fdb entry. > > Toshiaki Makita >
Re: [PATCH] nfc: Ensure presence of required attributes in the activate_target netlink handler
On Wed, May 24, 2017 at 3:42 AM, Mateusz Jurczykwrote: > Check that the NFC_ATTR_TARGET_INDEX and NFC_ATTR_PROTOCOLS attributes (in > addition to NFC_ATTR_DEVICE_INDEX) are provided by the netlink client > prior to accessing them. This prevents potential unhandled NULL pointer > dereference exceptions which can be triggered by malicious user-mode > programs, if they omit one or both of these attributes. > > Signed-off-by: Mateusz Jurczyk Acked-by: Kees Cook -Kees > --- > net/nfc/netlink.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/nfc/netlink.c b/net/nfc/netlink.c > index 6b0850e63e09..b251fb936a27 100644 > --- a/net/nfc/netlink.c > +++ b/net/nfc/netlink.c > @@ -907,7 +907,9 @@ static int nfc_genl_activate_target(struct sk_buff *skb, > struct genl_info *info) > u32 device_idx, target_idx, protocol; > int rc; > > - if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) > + if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || > + !info->attrs[NFC_ATTR_TARGET_INDEX] || > + !info->attrs[NFC_ATTR_PROTOCOLS]) > return -EINVAL; > > device_idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); > -- > 2.13.0.219.gdb65acc882-goog > -- Kees Cook Pixel Security
RE: [PATCH v2] drivers: phy: Add Cortina CS4340 driver
> -Original Message- > From: Florian Fainelli [mailto:f.faine...@gmail.com] > Sent: Wednesday, May 24, 2017 8:03 PM > To: Bogdan Purcareata; Andrew Lunn > > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH v2] drivers: phy: Add Cortina CS4340 driver > > On 05/24/2017 07:34 AM, Bogdan Purcareata wrote: > >> You should do that as well. > > > > Okay, I will include this check in a probe function in v3. > > When you do so, can you change the subject to be net: phy: Add Cortina > CS4340 driver for the driver, and use dt-bindings: net: for the Device > Tree binding patch? Will do, thanks! Bogdan