Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
> I'm not sure if I am really a fan of trying to solve this in this way. > It seems like this is going to be optimizing the paths for one case at > the detriment of others. Historically mapping and unmapping has always > been expensive, especially in the case of IOMMU enabled environments. > I would much rather see us focus on having swiotlb_dma_ops replaced > with dma_direct_ops in the cases where the device can access all of > physical memory. I am definitively not a fan, but IFF indirect calls are such an overhead it makes sense to avoid it for the common and simple case. And the direct mapping is a common case present on just about every architecture, and it is a very simple fast path that just adds an offset to the physical address. So if we want to speed something up, this is it. > > - if (ops->unmap_page) > > + if (!dev->is_dma_direct && ops->unmap_page) > > If I understand correctly this is only needed for the swiotlb case and > not the dma_direct case. It would make much more sense to just > overwrite the dev->dma_ops pointer with dma_direct_ops to address all > of the sync and unmap cases. Yes. > > + if (dev->dma_ops == _direct_ops || > > + (dev->dma_ops == _dma_ops && > > +mask == DMA_BIT_MASK(64))) > > + dev->is_dma_direct = true; > > + else > > + dev->is_dma_direct = false; > > So I am not sure this will work on x86. If I am not mistaken I believe > dev->dma_ops is normally not set and instead the default dma > operations are pulled via get_arch_dma_ops which returns the global > dma_ops pointer. True, for x86 we'd need to check get_arch_dma_ops(). > What you may want to consider as an alternative would be to look at > modifying drivers that are using the swiotlb so that you could just > overwrite the dev->dma_ops with the dma_direct_ops in the cases where > the hardware can support accessing all of physical hardware and where > we aren't forcing the use of the bounce buffers in the swiotlb. > > Then for the above code you only have to worry about the map calls, > and you could just do a check against the dma_direct_ops pointer > instead of having to add a new flag. That would be the long term plan IFF we go down this route. For now I just wanted a quick hack for performance testing.
[PATCH bpf-next 07/10] [bpf]: make cavium thunder compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for cavium's thunder driver we will just calculate packet's length unconditionally Signed-off-by: Nikita V. Shirokov--- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 707db3304396..7135db45927e 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -538,9 +538,9 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog, action = bpf_prog_run_xdp(prog, ); rcu_read_unlock(); + len = xdp.data_end - xdp.data; /* Check if XDP program has changed headers */ if (orig_data != xdp.data) { - len = xdp.data_end - xdp.data; offset = orig_data - xdp.data; dma_addr -= offset; } -- 2.15.1
[PATCH bpf-next 01/10] [bpf]: adding bpf_xdp_adjust_tail helper
Adding new bpf helper which would allow us to manipulate xdp's data_end pointer, and allow us to reduce packet's size indended use case: to generate ICMP messages from XDP context, where such message would contain truncated original packet. Signed-off-by: Nikita V. Shirokov--- include/uapi/linux/bpf.h | 10 +- net/core/filter.c| 29 - 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c5ec89732a8d..9a2d1a04eb24 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -755,6 +755,13 @@ union bpf_attr { * @addr: pointer to struct sockaddr to bind socket to * @addr_len: length of sockaddr structure * Return: 0 on success or negative error code + * + * int bpf_xdp_adjust_tail(xdp_md, delta) + * Adjust the xdp_md.data_end by delta. Only shrinking of packet's + * size is supported. + * @xdp_md: pointer to xdp_md + * @delta: A negative integer to be added to xdp_md.data_end + * Return: 0 on success or negative on error */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -821,7 +828,8 @@ union bpf_attr { FN(msg_apply_bytes),\ FN(msg_cork_bytes), \ FN(msg_pull_data), \ - FN(bind), + FN(bind), \ + FN(xdp_adjust_tail), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index d31aff93270d..6c8ac7b548d6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2717,6 +2717,30 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset) +{ + /* only shrinking is allowed for now. */ + if (unlikely(offset > 0)) + return -EINVAL; + + void *data_end = xdp->data_end + offset; + + if (unlikely(data_end < xdp->data + ETH_HLEN)) + return -EINVAL; + + xdp->data_end = data_end; + + return 0; +} + +static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = { + .func = bpf_xdp_adjust_tail, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; + BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset) { void *meta = xdp->data_meta + offset; @@ -3053,7 +3077,8 @@ bool bpf_helper_changes_pkt_data(void *func) func == bpf_l4_csum_replace || func == bpf_xdp_adjust_head || func == bpf_xdp_adjust_meta || - func == bpf_msg_pull_data) + func == bpf_msg_pull_data || + func == bpf_xdp_adjust_tail) return true; return false; @@ -3867,6 +3892,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return _xdp_redirect_proto; case BPF_FUNC_redirect_map: return _xdp_redirect_map_proto; + case BPF_FUNC_xdp_adjust_tail: + return _xdp_adjust_tail_proto; default: return bpf_base_func_proto(func_id); } -- 2.15.1
Re: [PATCH][next] iwlwifi: mvm: remove division by size of sizeof(struct ieee80211_wmm_rule)
On Wed, 2018-04-11 at 14:05 +0100, Colin King wrote: > From: Colin Ian King> > The subtraction of two struct ieee80211_wmm_rule pointers leaves a > result > that is automatically scaled down by the size of the size of pointed- > to > type, hence the division by sizeof(struct ieee80211_wmm_rule) is > bogus and should be removed. > > Detected by CoverityScan, CID#146 ("Extra sizeof expression") > > Fixes: 77e30e10ee28 ("iwlwifi: mvm: query regdb for wmm rule if > needed") > Signed-off-by: Colin Ian King > --- Thanks, Colin! I've pushed this to our internal tree for review and if everything goes fine it will land upstream following our normal upstreaming process. -- Cheers, Luca.
[PATCH net-next 3/3] net: phy: Enable C45 PHYs with vendor specific address space
A search of the dev-addr property is done in of_mdiobus_register. If the property is found in the PHY node, of_mdiobus_register_vend_spec_phy() is called. This is a wrapper function for of_mdiobus_register_phy() which finds the device in package based on dev-addr, and fills devices_addrs, which is a new field added to phy_c45_device_ids. This new field will store the dev-addr property on the same index where the device in package has been found. The of_mdiobus_register_phy() now contains an extra parameter, which is struct phy_c45_device_ids *c45_ids. If c45_ids is not NULL, get_vend_spec_addr_phy_device() is called and c45_ids are propagated all the way to get_phy_c45_ids(). Having dev-addr stored in devices_addrs, in get_phy_c45_ids(), when probing the identifiers, dev-addr can be extracted from devices_addrs and probed if devices_addrs[current_identifier] is not 0. Signed-off-by: Vicentiu Galanopulo--- drivers/net/phy/phy_device.c | 49 +-- drivers/of/of_mdio.c | 113 +-- include/linux/phy.h | 14 ++ 3 files changed, 169 insertions(+), 7 deletions(-) diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index ac23322..5c79fd8 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -457,7 +457,7 @@ static int get_phy_c45_devs_in_pkg(struct mii_bus *bus, int addr, int dev_addr, static int get_phy_c45_ids(struct mii_bus *bus, int addr, u32 *phy_id, struct phy_c45_device_ids *c45_ids) { int phy_reg; - int i, reg_addr; + int i, reg_addr, dev_addr; const int num_ids = ARRAY_SIZE(c45_ids->device_ids); u32 *devs = _ids->devices_in_package; @@ -493,13 +493,23 @@ static int get_phy_c45_ids(struct mii_bus *bus, int addr, u32 *phy_id, if (!(c45_ids->devices_in_package & (1 << i))) continue; - reg_addr = MII_ADDR_C45 | i << 16 | MII_PHYSID1; + /* if c45_ids->devices_addrs for the current id is not 0, +* then dev-addr was defined in the PHY device tree node, +* and the PHY has been seen as a valid device, and added +* in the package. In this case we can use the +* dev-addr(c45_ids->devices_addrs[i]) to do the MDIO +* reading of the PHY ID. +*/ + dev_addr = !!c45_ids->devices_addrs[i] ? + c45_ids->devices_addrs[i] : i; + + reg_addr = MII_ADDR_C45 | dev_addr << 16 | MII_PHYSID1; phy_reg = mdiobus_read(bus, addr, reg_addr); if (phy_reg < 0) return -EIO; c45_ids->device_ids[i] = (phy_reg & 0x) << 16; - reg_addr = MII_ADDR_C45 | i << 16 | MII_PHYSID2; + reg_addr = MII_ADDR_C45 | dev_addr << 16 | MII_PHYSID2; phy_reg = mdiobus_read(bus, addr, reg_addr); if (phy_reg < 0) return -EIO; @@ -551,6 +561,39 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id, } /** + * get_vend_spec_addr_phy_device - reads the specified PHY device + *and returns its @phy_device struct + * @bus: the target MII bus + * @addr: PHY address on the MII bus + * @is_c45: If true the PHY uses the 802.3 clause 45 protocol + * @c45_ids: Query the c45_ids to see if a PHY with a vendor specific + * register address space was defined in the PHY device tree + * node by adding the "dev-addr" property to the node. + * Store the c45 ID information about the rest of the PHYs + * found PHYs on the MDIO bus during probing. + * + * Description: Reads the ID registers of the PHY at @addr on the + * @bus, then allocates and returns the phy_device to represent it. + */ +struct phy_device *get_vend_spec_addr_phy_device(struct mii_bus *bus, +int addr, bool is_c45, +struct phy_c45_device_ids *c45_ids) +{ + u32 phy_id = 0; + int r; + + r = get_phy_id(bus, addr, _id, is_c45, c45_ids); + if (r) + return ERR_PTR(r); + + /* If the phy_id is mostly Fs, there is no device there */ + if ((phy_id & 0x1fff) == 0x1fff) + return ERR_PTR(-ENODEV); + + return phy_device_create(bus, addr, phy_id, is_c45, c45_ids); +} + +/** * get_phy_device - reads the specified PHY device and returns its @phy_device * struct * @bus: the target MII bus diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c index 8c0c927..52e8bfb 100644 --- a/drivers/of/of_mdio.c +++ b/drivers/of/of_mdio.c @@ -45,7 +45,8 @@ static int of_get_phy_id(struct device_node *device, u32 *phy_id) } static int
[PATCH net-next 2/3] net: phy: Change the array size to 32 for device_ids
In the context of enabling the discovery of the PHYs which have the C45 MDIO address space in a non-standard address: num_ids in get_phy_c45_ids, has the value 8 (ARRAY_SIZE(c45_ids->device_ids)), but the u32 *devs can store 32 devices in the bitfield. If a device is stored in *devs, in bits 32 to 9 (bit counting in lookup loop starts from 1), it will not be found. Signed-off-by: Vicentiu Galanopulo--- include/linux/phy.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/phy.h b/include/linux/phy.h index f0b5870..26aa320 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -360,7 +360,7 @@ enum phy_state { */ struct phy_c45_device_ids { u32 devices_in_package; - u32 device_ids[8]; + u32 device_ids[32]; }; /* phy_device: An instance of a PHY -- 2.7.4
[PATCH net-next 0/3] net: phy: Enable C45 vendor specific MDIO register addr space
Enabling the discovery on the MDIO bus of PHYs which have a vendor specific address space for accessing the C45 MDIO registers. Vicentiu Galanopulo (3): net: phy: Add binding for vendor specific C45 MDIO address space net: phy: Change the array size to 32 for device_ids net: phy: Enable C45 PHYs with vendor specific address space Documentation/devicetree/bindings/net/phy.txt | 6 ++ drivers/net/phy/phy_device.c | 49 ++- drivers/of/of_mdio.c | 113 +- include/linux/phy.h | 16 +++- 4 files changed, 176 insertions(+), 8 deletions(-) -- 2.7.4
Re: [PATCH net-next 3/5] ipv4: support sport, dport and ip protocol in RTM_GETROUTE
On Mon, Apr 16, 2018 at 01:41:36PM -0700, Roopa Prabhu wrote: > @@ -2757,6 +2796,12 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, > struct nlmsghdr *nlh, > fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0; > fl4.flowi4_mark = mark; > fl4.flowi4_uid = uid; > + if (sport) > + fl4.fl4_sport = sport; > + if (dport) > + fl4.fl4_dport = dport; > + if (ip_proto) > + fl4.flowi4_proto = ip_proto; Hi Roopa, This info isn't set in the synthesized skb, but only in the flow info and therefore not used for input routes. I see you added a test case, but it's only for output routes. I believe an input route test case will fail. Also, note that the skb as synthesized now is invalid - iph->ihl is 0 for example - so the flow dissector will spit it out. It effectively means that route get is broken when L4 hashing is used. It also affects output routes because since commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") the skb is used to calculate the multipath hash.
Re: tcp hang when socket fills up ?
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote: > > I turned pr_debug on in tcp_in_window() for another try and it's a bit > > mangled because the information on multiple lines and the function is > > called in parallel but it looks like I do have some seq > maxend +1 > > > > Although it's weird, the maxend was set WAY earlier apparently? > > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 > > receiver end=505004284 maxend=505033596 maxwin=29200 > > then window decreased drastically e.g. previous ack just before refusal: > > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 > > end=1913292311 > > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 > > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7 > > scale=0 is suspect. > > Really if conntrack does not see SYN SYNACK packets, it should not > make any window check, since windows can be scaled up to 14 :/ Hm... it doesn't seem to be the case here: 14.364038 tcp_in_window: START 14.364065 tcp_in_window: 14.364090 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284 14.364129 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=0 maxend=0 maxwin=0 scale=0 14.364158 tcp_in_window: 14.364185 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284 14.364210 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=0 maxend=0 maxwin=0 scale=0 14.364237 tcp_in_window: I=1 II=1 III=1 IV=1 14.364262 tcp_in_window: res=1 sender end=505004284 maxend=505004284 maxwin=29200 receiver end=0 maxend=29200 maxwin=0 looks like SYN packet 14.661682 tcp_in_window: START 14.661706 tcp_in_window: 14.661731 seq=1913287797 ack=0+(0) sack=0+(0) win=29200 end=1913287798 14.661828 tcp_in_window: sender end=0 maxend=29200 maxwin=0 scale=0 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7 14.661867 tcp_in_window: START 14.661893 tcp_in_window: 14.661917 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 end=1025597635 14.661942 tcp_in_window: START 14.661966 tcp_in_window: 14.661993 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7 14.662020 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 end=505004284 14.662045 tcp_in_window: 14.662072 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 end=1025597635 14.662097 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7 14.662125 tcp_in_window: 14.662150 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7 14.662175 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 end=505004284 14.662202 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7 14.662226 tcp_in_window: I=1 II=1 III=1 IV=1 14.662251 tcp_in_window: I=1 II=1 III=1 IV=1 14.662277 tcp_in_window: res=1 sender end=505004284 maxend=505004284 maxwin=29200 receiver end=1913287798 maxend=1913316998 maxwin=29200 14.662302 tcp_in_window: res=1 sender end=1025597635 maxend=1025635103 maxwin=354378 receiver end=1542862349 maxend=1543171341 maxwin=37504 SYNACK response and (dataless) ACK in the original direction, mixed with an unrelated packet. 14.687411 tcp_in_window: START 14.687522 tcp_in_window: 14.687570 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287798 14.687619 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7 14.687659 tcp_in_window: 14.687699 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287798 14.687739 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7 14.687774 tcp_in_window: I=1 II=1 III=1 IV=1 14.687806 tcp_in_window: res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 receiver end=505004284 maxend=505033596 maxwin=29200 ACK in the reply direction (no data). We still have scale=7 in both directions. 14.688706 tcp_in_window: START 14.688733 tcp_in_window: 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688824 tcp_in_window: 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688911 tcp_in_window: I=1 II=1 III=1 IV=1 14.688938 tcp_in_window: res=1 sender end=1913287819
[PATCH] net: change the comment of dev_mc_init
the comment of dev_mc_init() is wrong. which use dev_mc_flush instead of dev_mc_init. Signed-off-by:Lianwen Sun--- net/core/dev_addr_lists.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c index e3e6a3e2ca22..d884d8f5f0e5 100644 --- a/net/core/dev_addr_lists.c +++ b/net/core/dev_addr_lists.c @@ -839,7 +839,7 @@ void dev_mc_flush(struct net_device *dev) EXPORT_SYMBOL(dev_mc_flush); /** - * dev_mc_flush - Init multicast address list + * dev_mc_init - Init multicast address list * @dev: device * * Init multicast address list. -- 2.17.0
[PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel
Register callback to collect hardware/firmware dumps in second kernel before hardware/firmware is initialized. The dumps for each device will be available as elf notes in /proc/vmcore in second kernel. Signed-off-by: Rahul LakkireddySigned-off-by: Ganesh Goudar --- v4: - No changes. v3: - Replaced all crashdd* with vmcoredd*. - Replaced crashdd_add_dump() with vmcore_add_device_dump(). - Updated comments and commit message. v2: - No Changes. Changes since rfc v2: - Update comments and commit message for sysfs change. rfc v2: - Updated dump registration to the new API in patch 1. drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 4 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h | 3 +++ drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 10 ++ 4 files changed, 42 insertions(+) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h index 688f95440af2..01e7aad4ce5b 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h @@ -50,6 +50,7 @@ #include #include #include +#include #include #include "t4_chip_type.h" #include "cxgb4_uld.h" @@ -964,6 +965,9 @@ struct adapter { struct hma_data hma; struct srq_data *srq; + + /* Dump buffer for collecting logs in kdump kernel */ + struct vmcoredd_data vmcoredd; }; /* Support for "sched-class" command to allow a TX Scheduling Class to be diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c index 143686c60234..76433d4fe483 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c @@ -488,3 +488,28 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter) adapter->eth_dump.version = adapter->params.fw_vers; adapter->eth_dump.len = 0; } + +static int cxgb4_cudbg_vmcoredd_collect(struct vmcoredd_data *data, void *buf) +{ + struct adapter *adap = container_of(data, struct adapter, vmcoredd); + u32 len = data->size; + + return cxgb4_cudbg_collect(adap, buf, , CXGB4_ETH_DUMP_ALL); +} + +int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap) +{ + struct vmcoredd_data *data = >vmcoredd; + u32 len; + + len = sizeof(struct cudbg_hdr) + + sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY; + len += CUDBG_DUMP_BUFF_SIZE; + + data->size = len; + snprintf(data->name, sizeof(data->name), "%s_%s", cxgb4_driver_name, +adap->name); + data->vmcoredd_callback = cxgb4_cudbg_vmcoredd_collect; + + return vmcore_add_device_dump(data); +} diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h index ce1ac9a1c878..ef59ba1ed968 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h @@ -41,8 +41,11 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS { CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */ }; +#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW) + u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag); int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size, u32 flag); void cxgb4_init_ethtool_dump(struct adapter *adapter); +int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap); #endif /* __CXGB4_CUDBG_H__ */ diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 24d2865b8806..32cad0acf76c 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -5544,6 +5544,16 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (err) goto out_free_adapter; + if (is_kdump_kernel()) { + /* Collect hardware state and append to /proc/vmcore */ + err = cxgb4_cudbg_vmcore_add_dump(adapter); + if (err) { + dev_warn(adapter->pdev_dev, +"Fail collecting vmcore device dump, err: %d. Continuing\n", +err); + err = 0; + } + } if (!is_t4(adapter->params.chip)) { s_qpp = (QUEUESPERPAGEPF0_S + -- 2.14.1
[PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes
Update read and mmap logic to append device dumps as additional notes before the other elf notes. We add device dumps before other elf notes because the other elf notes may not fill the elf notes buffer completely and we will end up with zero-filled data between the elf notes and the device dumps. Tools will then try to decode this zero-filled data as valid notes and we don't want that. Hence, adding device dumps before the other elf notes ensure that zero-filled data can be avoided. This also ensures that the device dumps and the other elf notes can be properly mmaped at page aligned address. Incorporate device dump size into the total vmcore size. Also update offsets for other program headers after the device dumps are added. Suggested-by: Eric Biederman. Signed-off-by: Rahul Lakkireddy Signed-off-by: Ganesh Goudar --- v4: - No changes. v3: - Patch added in this version. - Exported dumps as elf notes. Suggested by Eric Biederman . fs/proc/vmcore.c | 247 ++- 1 file changed, 243 insertions(+), 4 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 7395462d2f86..ed1ebd85e14e 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -39,6 +39,8 @@ static size_t elfcorebuf_sz_orig; static char *elfnotes_buf; static size_t elfnotes_sz; +/* Size of all notes minus the device dump notes */ +static size_t elfnotes_orig_sz; /* Total size of vmcore file. */ static u64 vmcore_size; @@ -51,6 +53,9 @@ static LIST_HEAD(vmcoredd_list); static DEFINE_MUTEX(vmcoredd_mutex); #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */ +/* Device Dump Size */ +static size_t vmcoredd_orig_sz; + /* * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error * The called function has to take care of module refcounting. @@ -185,6 +190,77 @@ static int copy_to(void *target, void *src, size_t size, int userbuf) return 0; } +#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP +static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int userbuf) +{ + struct vmcoredd_node *dump; + u64 offset = 0; + int ret = 0; + size_t tsz; + char *buf; + + mutex_lock(_mutex); + list_for_each_entry(dump, _list, list) { + if (start < offset + dump->size) { + tsz = min(offset + (u64)dump->size - start, (u64)size); + buf = dump->buf + start - offset; + if (copy_to(dst, buf, tsz, userbuf)) { + ret = -EFAULT; + goto out_unlock; + } + + size -= tsz; + start += tsz; + dst += tsz; + + /* Leave now if buffer filled already */ + if (!size) + goto out_unlock; + } + offset += dump->size; + } + +out_unlock: + mutex_unlock(_mutex); + return ret; +} + +static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst, + u64 start, size_t size) +{ + struct vmcoredd_node *dump; + u64 offset = 0; + int ret = 0; + size_t tsz; + char *buf; + + mutex_lock(_mutex); + list_for_each_entry(dump, _list, list) { + if (start < offset + dump->size) { + tsz = min(offset + (u64)dump->size - start, (u64)size); + buf = dump->buf + start - offset; + if (remap_vmalloc_range_partial(vma, dst, buf, tsz)) { + ret = -EFAULT; + goto out_unlock; + } + + size -= tsz; + start += tsz; + dst += tsz; + + /* Leave now if buffer filled already */ + if (!size) + goto out_unlock; + } + offset += dump->size; + } + +out_unlock: + mutex_unlock(_mutex); + return ret; +} +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */ + /* Read from the ELF header and then the crash dump. On error, negative value is * returned otherwise number of bytes read are returned. */ @@ -222,10 +298,41 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos, if (*fpos < elfcorebuf_sz + elfnotes_sz) { void *kaddr; + /* We add device dumps before other elf notes because the +* other elf notes may not fill the elf notes buffer +* completely and we will end up with zero-filled data +* between the elf notes and the device dumps. Tools will +* then try to decode this zero-filled data as valid notes +* and we
[PATCH] bpf: btf: fix semicolon.cocci warnings
From: Fengguang Wukernel/bpf/btf.c:353:2-3: Unneeded semicolon kernel/bpf/btf.c:280:2-3: Unneeded semicolon kernel/bpf/btf.c:663:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Fixes: b22ac5b97dd9 ("bpf: btf: Validate type reference") CC: Martin KaFai Lau Signed-off-by: Fengguang Wu --- btf.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -277,7 +277,7 @@ static bool btf_type_is_modifier(const s case BTF_KIND_CONST: case BTF_KIND_RESTRICT: return true; - }; + } return false; } @@ -350,7 +350,7 @@ static bool btf_type_has_size(const stru case BTF_KIND_UNION: case BTF_KIND_ENUM: return true; - }; + } return false; } @@ -660,7 +660,7 @@ static bool env_type_is_resolve_sink(con !btf_type_is_struct(next_type); default: BUG_ON(1); - }; + } } static bool env_type_is_resolved(const struct btf_verifier_env *env,
Re: [PATCH bpf-next v3 02/10] bpf: btf: Validate type reference
Hi Martin, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/BTF-BPF-Type-Format/20180417-142247 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master coccinelle warnings: (new ones prefixed by >>) >> kernel/bpf/btf.c:353:2-3: Unneeded semicolon kernel/bpf/btf.c:280:2-3: Unneeded semicolon kernel/bpf/btf.c:663:2-3: Unneeded semicolon Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH 03/10] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
Hello! On 4/17/2018 1:04 AM, Michael Schmitz wrote: From: John Paul Adrian GlaubitzThis complements the fix in 82533ad9a1c that removed the free_irq You also need to specify the commit's summary line enclosed in (""). call in the error path of probe, to also not call free_irq when remove is called to revert the effects of probe. Signed-off-by: Michael Karcher [...] MBR, Sergei
[PATCH net-next] liquidio: Enhanced ethtool stats
From: Intiyaz Basha1. Added red_drops stats. Inbound packets dropped by RED, buffer exhaustion 2. Included fcs_err, jabber_err, l2_err and frame_err errors under rx_errors 3. Included fifo_err, dmac_drop, red_drops, fw_err_pko, fw_err_link and fw_err_drop under rx_dropped 4. Included max_collision_fail, max_deferral_fail, total_collisions, fw_err_pko, fw_err_link, fw_err_drop and fw_err_pki under tx_dropped 5. Counting dma mapping errors 6. Added some firmware stats description and removed for some Signed-off-by: Intiyaz Basha Acked-by: Derek Chickles Acked-by: Satanand Burla Signed-off-by: Felix Manlunas --- drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 54 ++-- drivers/net/ethernet/cavium/liquidio/lio_main.c| 2 + .../net/ethernet/cavium/liquidio/liquidio_common.h | 75 ++ drivers/net/ethernet/cavium/liquidio/octeon_iq.h | 4 +- 4 files changed, 86 insertions(+), 49 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c index a63ddf0..b40e8f5 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c @@ -96,11 +96,9 @@ enum { "tx_packets", "rx_bytes", "tx_bytes", - "rx_errors",/*jabber_err+l2_err+frame_err */ - "tx_errors",/*fw_err_pko+fw_err_link+fw_err_drop */ - "rx_dropped", /*st->fromwire.total_rcvd - st->fromwire.fw_total_rcvd + -*st->fromwire.dmac_drop + st->fromwire.fw_err_drop -*/ + "rx_errors", + "tx_errors", + "rx_dropped", "tx_dropped", "tx_total_sent", @@ -119,7 +117,7 @@ enum { "mac_tx_total_bytes", "mac_tx_mcast_pkts", "mac_tx_bcast_pkts", - "mac_tx_ctl_packets", /*oct->link_stats.fromhost.ctl_sent */ + "mac_tx_ctl_packets", "mac_tx_total_collisions", "mac_tx_one_collision", "mac_tx_multi_collison", @@ -170,17 +168,17 @@ enum { "tx_packets", "rx_bytes", "tx_bytes", - "rx_errors", /* jabber_err + l2_err+frame_err */ - "tx_errors", /* fw_err_pko + fw_err_link+fw_err_drop */ - "rx_dropped", /* total_rcvd - fw_total_rcvd + dmac_drop + fw_err_drop */ + "rx_errors", + "tx_errors", + "rx_dropped", "tx_dropped", "link_state_changes", }; /* statistics of host tx queue */ static const char oct_iq_stats_strings[][ETH_GSTRING_LEN] = { - "packets", /*oct->instr_queue[iq_no]->stats.tx_done*/ - "bytes",/*oct->instr_queue[iq_no]->stats.tx_tot_bytes*/ + "packets", + "bytes", "dropped", "iq_busy", "sgentry_sent", @@ -197,13 +195,9 @@ enum { /* statistics of host rx queue */ static const char oct_droq_stats_strings[][ETH_GSTRING_LEN] = { - "packets", /*oct->droq[oq_no]->stats.rx_pkts_received */ - "bytes",/*oct->droq[oq_no]->stats.rx_bytes_received */ - "dropped", /*oct->droq[oq_no]->stats.rx_dropped+ -*oct->droq[oq_no]->stats.dropped_nodispatch+ -*oct->droq[oq_no]->stats.dropped_toomany+ -*oct->droq[oq_no]->stats.dropped_nomem -*/ + "packets", + "bytes", + "dropped", "dropped_nomem", "dropped_toomany", "fw_dropped", @@ -1068,16 +1062,33 @@ static void lio_vf_set_msglevel(struct net_device *netdev, u32 msglvl) data[i++] = CVM_CAST64(netstats->rx_bytes); /*sum of oct->instr_queue[iq_no]->stats.tx_tot_bytes */ data[i++] = CVM_CAST64(netstats->tx_bytes); - data[i++] = CVM_CAST64(netstats->rx_errors); + data[i++] = CVM_CAST64(netstats->rx_errors + + oct_dev->link_stats.fromwire.fcs_err + + oct_dev->link_stats.fromwire.jabber_err + + oct_dev->link_stats.fromwire.l2_err + + oct_dev->link_stats.fromwire.frame_err); data[i++] = CVM_CAST64(netstats->tx_errors); /*sum of oct->droq[oq_no]->stats->rx_dropped + *oct->droq[oq_no]->stats->dropped_nodispatch + *oct->droq[oq_no]->stats->dropped_toomany + *oct->droq[oq_no]->stats->dropped_nomem */ - data[i++] = CVM_CAST64(netstats->rx_dropped); + data[i++] = CVM_CAST64(netstats->rx_dropped + + oct_dev->link_stats.fromwire.fifo_err + + oct_dev->link_stats.fromwire.dmac_drop + + oct_dev->link_stats.fromwire.red_drops + +
[PATCH bpf-next 09/10] [bpf]: make tun compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for tun driver we need to adjust XDP_PASS handling by recalculating length of the packet if it was passed to the TCP/IP stack (in case if after xdp's prog run data_end pointer was adjusted) Signed-off-by: Nikita V. Shirokov--- drivers/net/tun.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 28583aa0c17d..0b488a958076 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1688,6 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, return NULL; case XDP_PASS: delta = orig_data - xdp.data; + len = xdp.data_end - xdp.data; break; default: bpf_warn_invalid_xdp_action(act); @@ -1708,7 +1709,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, } skb_reserve(skb, pad - delta); - skb_put(skb, len + delta); + skb_put(skb, len); get_page(alloc_frag->page); alloc_frag->offset += buflen; -- 2.15.1
[PATCH bpf-next 10/10] [bpf]: make virtio compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for virtio driver we need to adjust XDP_PASS handling by recalculating length of the packet if it was passed to the TCP/IP stack Signed-off-by: Nikita V. Shirokov--- drivers/net/virtio_net.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 7b187ec7411e..115d85f7360a 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -604,6 +604,7 @@ static struct sk_buff *receive_small(struct net_device *dev, case XDP_PASS: /* Recalculate length in case bpf program changed it */ delta = orig_data - xdp.data; + len = xdp.data_end - xdp.data; break; case XDP_TX: sent = __virtnet_xdp_xmit(vi, ); @@ -637,7 +638,7 @@ static struct sk_buff *receive_small(struct net_device *dev, goto err; } skb_reserve(skb, headroom - delta); - skb_put(skb, len + delta); + skb_put(skb, len); if (!delta) { buf += header_offset; memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len); @@ -752,6 +753,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, offset = xdp.data - page_address(xdp_page) - vi->hdr_len; + /* recalculate len if xdp.data or xdp.data_end were +* adjusted +*/ + len = xdp.data_end - xdp.data; /* We can only create skb based on xdp_page. */ if (unlikely(xdp_page != page)) { rcu_read_unlock(); -- 2.15.1
net: 4.9-stable regression in drivers/net/phy/micrel.c on 4.9.94
Hi We run into a NULL pointer dereference crash when booting 4.9.94 on our Artpec-6 board with stmmac ethernet and Micrel KSZ9031 phy. I traced this to the patch d7ba3c00047d ("net: phy: micrel: Restore led_mode and clk_sel on resume") that was added in 4.9.94. This patch makes kszphy_resume() depend on the kszphy_priv object having been created and this happens only for those Micrel PHYs that have a .probe callback assigned. This is not the case for KSZ9031. This is already fixed in later kernels by bfe72442578b ("net: phy: micrel: fix crash when statistic requested for KSZ9031 phy") thas assigns a probe function for all Micrel PHYs that depend on the kszphy_priv existing. Please consider applying this to the 4.9 stable tree. Crash dump splat: Unable to handle kernel NULL pointer dereference at virtual address 0008 pgd = bd8bc000 [0008] *pgd=3d98e831, *pte=, *ppte= Internal error: Oops: 17 [#1] PREEMPT SMP ARM Modules linked in: e1000e nvmem_artpec6_efuse nvmem_core artpec6_trng(O) artpec6_lcpu(O) CPU: 0 PID: 216 Comm: netd Tainted: G O4.9.94-axis5-devel #1 Hardware name: Axis ARTPEC-6 Platform task: bf344620 task.stack: bd10c000 PC is at kszphy_config_reset+0x14/0x148 LR is at kszphy_resume+0x1c/0x5c pc : [<804ad358>]lr : [<804ad608>]psr: 600c0113 sp : bd10dd00 ip : 8dc7 fp : bf393200 r10: r9 : 0002 r8 : r7 : bf3ad000 r6 : r5 : bf086000 r4 : bf3ad400 r3 : 0001 r2 : r1 : 00040003 r0 : bf3ad400 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 3d8bc04a DAC: 0051 Process netd (pid: 216, stack limit = 0xbd10c210) Stack: (0xbd10dd00 to 0xbd10e000) dd00: bf3ad400 bf086000 bf3ad000 804ad608 bf3ad400 bf086000 dd20: 804ab404 bf3ad400 bf086000 804b345c 7ee94b94 dd40: 804ab5a4 bf3ad400 bf086000 804b345c 80509278 bf086000 bf086000 dd60: 0002 7ee94b94 804ae4a4 0002 0001 8014bc78 801647dc dd80: beb25cc0 beb634e0 be97c87c 00c3 800f0093 80682d1c dda0: 800f0093 beb634e0 80682d1c beb25cc0 802a6388 beb88444 ddc0: beb88450 00c3 0001 0001 0001 801647dc 0001 beb88440 dde0: 0001 801414fc 0001 805de9f4 bd10dea0 0001 beae9b8c bf086000 de00: 0001 80743d58 bf086030 804b25b0 8064fd7c 801414fc fff2 bd10de64 de20: 000d 801414fc bf0864c0 bd10de64 000d 801414fc bf086000 bf086000 de40: 0001 80743d58 bf086030 7ee94b94 bf393200 80567b00 de60: bf086188 bf086000 80567da4 bf086000 0001 1003 1002 80567dcc de80: bf086000 1002 bf086148 7ee94b94 80567e9c bd10dec8 dea0: bf39320c 7ee94b94 805de4b0 bd10df00 8914 bf086000 dec0: 0014 bf39320c 30687465 1003 dee0: 8914 be643360 7ee94b94 be643340 7ee94b94 df00: 0008 80548420 7ee94b94 be643360 beae7ee0 0008 df20: 7ee94b94 80271884 beb0a700 be4f3360 df40: 0002 0023 beb0a708 76f216c4 8025ec18 8027db08 df60: beae7ee0 8027db08 beae7ee1 beae7ee0 8914 7ee94b94 0008 df80: 802721c4 01f1bcb0 76fadcf0 0001 0036 80108984 bd10c000 dfa0: 801087c0 01f1bcb0 76fadcf0 0008 8914 7ee94b94 01f1be48 dfc0: 01f1bcb0 76fadcf0 0001 0036 7ee94b94 0008 0004cd2c dfe0: 00063d60 7ee94b74 00027344 76b10b2c 600f0010 0008 7ee727f4 [<804ad358>] (kszphy_config_reset) from [<804ad608>] (kszphy_resume+0x1c/0x5c) [<804ad608>] (kszphy_resume) from [<804ab404>] (phy_attach_direct+0xbc/0x1c4) [<804ab404>] (phy_attach_direct) from [<804ab5a4>] (phy_connect_direct+0x1c/0x54) [<804ab5a4>] (phy_connect_direct) from [<80509278>] (of_phy_connect+0x40/0x68) [<80509278>] (of_phy_connect) from [<804ae4a4>] (stmmac_init_phy+0x50/0x1ec) [<804ae4a4>] (stmmac_init_phy) from [<804b25b0>] (stmmac_open+0x70/0xc90) [<804b25b0>] (stmmac_open) from [<80567b00>] (__dev_open+0xc4/0x140) [<80567b00>] (__dev_open) from [<80567dcc>] (__dev_change_flags+0x9c/0x14c) [<80567dcc>] (__dev_change_flags) from [<80567e9c>] (dev_change_flags+0x20/0x50) [<80567e9c>] (dev_change_flags) from [<805de4b0>] (devinet_ioctl+0x6d4/0x798) [<805de4b0>] (devinet_ioctl) from [<80548420>] (sock_ioctl+0x158/0x2e4) [<80548420>] (sock_ioctl) from [<80271884>] (do_vfs_ioctl+0xa8/0x974) [<80271884>] (do_vfs_ioctl) from [<802721c4>] (SyS_ioctl+0x74/0x84) [<802721c4>] (SyS_ioctl) from [<801087c0>] (ret_fast_syscall+0x0/0x48) Code: e52de004 e8bd4000 e1a04000 e59061d0 (e5d63008)
[PATCH/RFC net-next 0/5] ravb: updates
Hi Sergei, this series is composed of otherwise unrelated RAVB patches from the R-Car BSP v3.6.2 which at a first pass seem worth considering for upstream. I would value your feedback on these patches so they can either proceed into net-next or remain local to the BSP. Thanks! Kazuya Mizuguchi (4): ravb: correct ptp does failure after suspend and resume ravb: do not write 1 to reserved bits ravb: remove undocumented processing ravb: remove tx buffer addr 4byte alilgnment restriction for R-Car Gen3 Masaru Nagai (1): ravb: fix inconsistent lock state at enabling tx timestamp drivers/net/ethernet/renesas/ravb.h | 23 +++- drivers/net/ethernet/renesas/ravb_main.c | 192 --- drivers/net/ethernet/renesas/ravb_ptp.c | 2 +- 3 files changed, 117 insertions(+), 100 deletions(-) -- 2.11.0
[PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp
From: Masaru Nagai[ 58.490829] = [ 58.495205] [ INFO: inconsistent lock state ] [ 58.499583] 4.9.0-yocto-standard-7-g2ef7caf #57 Not tainted [ 58.505529] - [ 58.509904] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. [ 58.515939] swapper/0/0 [HC1[1]:SC1[1]:HE0:SE0] takes: [ 58.521099] (&(>lock)->rlock#2){?.-...}, at: [] skb_queue_tail+0x2c/0x68 {HARDIRQ-ON-W} state was registered at: [ 58.533654] [ 58.535155] [] mark_lock+0x1c4/0x718 [ 58.540318] [ 58.541814] [] __lock_acquire+0x660/0x1890 [ 58.547501] [ 58.548997] [] lock_acquire+0xd0/0x290 [ 58.554334] [ 58.555834] [] _raw_spin_lock_bh+0x50/0x90 [ 58.561520] [ 58.563018] [] first_packet_length+0x40/0x2b0 [ 58.568965] [ 58.570461] [] udp_ioctl+0x58/0x120 [ 58.575535] [ 58.577032] [] inet_ioctl+0x58/0x128 [ 58.582194] [ 58.583691] [] sock_do_ioctl+0x40/0x88 [ 58.589028] [ 58.590523] [] sock_ioctl+0x284/0x350 [ 58.595773] [ 58.597271] [] do_vfs_ioctl+0xb0/0x7c0 [ 58.602607] [ 58.604103] [] SyS_ioctl+0x94/0xa8 [ 58.609090] [ 58.610588] [] __sys_trace_return+0x0/0x4 [ 58.616187] irq event stamp: 335205 [ 58.619690] hardirqs last enabled at (335204): [] __do_softirq+0xdc/0x5c4 [ 58.628168] hardirqs last disabled at (335205): [] el1_irq+0x70/0x12c [ 58.636211] softirqs last enabled at (335202): [] _local_bh_enable+0x28/0x50 [ 58.644950] softirqs last disabled at (335203): [] irq_exit+0xd4/0x100 [ 58.653076] [ 58.653076] other info that might help us debug this: [ 58.659632] Possible unsafe locking scenario: [ 58.659632] [ 58.665577]CPU0 [ 58.668031] [ 58.670484] lock(&(>lock)->rlock#2); [ 58.674799] [ 58.677427] lock(&(>lock)->rlock#2); [ 58.681916] [ 58.681916] *** DEADLOCK *** [ 58.681916] [ 58.687863] 1 lock held by swapper/0/0: [ 58.691713] #0: (&(>lock)->rlock){-.-...}, at: [] ravb_multi_interrupt+0x28/0x98 [ 58.701456] [ 58.701456] stack backtrace: [ 58.705833] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-yocto-standard-7-g2ef7caf #57 [ 58.714396] Hardware name: Renesas Salvator-X board based on r8a7796 (DT) [ 58.721214] Call trace: [ 58.723672] [] dump_backtrace+0x0/0x1d8 [ 58.729095] [] show_stack+0x24/0x30 [ 58.734170] [] dump_stack+0xb0/0xe8 [ 58.740285] [] print_usage_bug.part.24+0x264/0x27c [ 58.747697] [] mark_lock+0x150/0x718 [ 58.753892] [] __lock_acquire+0xc10/0x1890 [ 58.760602] [] lock_acquire+0xd0/0x290 [ 58.766956] [] _raw_spin_lock_irqsave+0x58/0x98 [ 58.774089] [] skb_queue_tail+0x2c/0x68 [ 58.780518] [] sock_queue_err_skb+0xc8/0x138 [ 58.787364] [] __skb_complete_tx_timestamp+0x8c/0xb8 [ 58.794888] [] __skb_tstamp_tx+0xd8/0x130 [ 58.801437] [] skb_tstamp_tx+0x30/0x40 [ 58.807723] [] ravb_timestamp_interrupt+0x164/0x1a8 [ 58.815144] [] ravb_multi_interrupt+0x88/0x98 [ 58.822043] [] __handle_irq_event_percpu+0x94/0x418 [ 58.829464] [] handle_irq_event_percpu+0x28/0x60 [ 58.836622] [] handle_irq_event+0x50/0x80 [ 58.843166] [] handle_fasteoi_irq+0xdc/0x1e0 [ 58.849968] [] generic_handle_irq+0x34/0x50 [ 58.856681] [] __handle_domain_irq+0x8c/0x100 [ 58.863568] [] gic_handle_irq+0x60/0xb8 [ 58.869930] Exception stack(0x80063b0f9de0 to 0x80063b0f9f10) [ 58.877348] 9de0: 80063b0f9e10 0001 80063b0f9f40 08081810 [ 58.886159] 9e00: 6145 08082f70 09194b00 00190f2c [ 58.894961] 9e20: 800632171000 000a 0003a4d0 [ 58.903767] 9e40: 0016 0023 091952f8 [ 58.912568] 9e60: 0040 34d5d91d [ 58.921363] 9e80: [ 58.930133] 9ea0: 0918 080d76e4 0052 [ 58.938897] 9ec0: 08d7 0008 0001 [ 58.947660] 9ee0: 80063a428000 09185000 0918 80063b0f9f40 [ 58.956430] 9f00: 0808180c 80063b0f9f40 [ 58.962253] [] el1_irq+0xb4/0x12c [ 58.968096] [] irq_exit+0xd4/0x100 [ 58.974025] [] __handle_domain_irq+0x90/0x100 [ 58.980916] [] gic_handle_irq+0x60/0xb8 [ 58.987281] Exception stack(0x09183d20 to 0x09183e50) [ 58.994708] 3d20: 09194b00 00190f2b 800632171000 8c6318c6318c6320 [ 59.003554] 3d40: 0003a4d0 0016 002a [ 59.012416] 3d60: 091952f8 1000 [ 59.021279] 3d80: 34d5d91d [ 59.030111] 3da0: 000d9e3b53c4 [ 59.038913] 3dc0:
[PATCH/RFC net-next 5/5] ravb: remove tx buffer addr 4byte alilgnment restriction for R-Car Gen3
From: Kazuya MizuguchiThis patch sets from two descriptor to one descriptor because R-Car Gen3 does not have the 4 bytes alignment restriction of the transmission buffer. Signed-off-by: Kazuya Mizuguchi Signed-off-by: Simon Horman --- drivers/net/ethernet/renesas/ravb.h | 6 +- drivers/net/ethernet/renesas/ravb_main.c | 131 +++ 2 files changed, 85 insertions(+), 52 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index fcd04dbc7dde..3d0985305c26 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -964,7 +964,10 @@ enum RAVB_QUEUE { #define RX_QUEUE_OFFSET4 #define NUM_RX_QUEUE 2 #define NUM_TX_QUEUE 2 -#define NUM_TX_DESC2 /* TX descriptors per packet */ + +/* TX descriptors per packet */ +#define NUM_TX_DESC_GEN2 2 +#define NUM_TX_DESC_GEN3 1 struct ravb_tstamp_skb { struct list_head list; @@ -1043,6 +1046,7 @@ struct ravb_private { unsigned no_avb_link:1; unsigned avb_link_active_low:1; unsigned wol_enabled:1; + int num_tx_desc;/* TX descriptors per packet */ }; static inline u32 ravb_read(struct net_device *ndev, enum ravb_reg reg) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 88056dd912ed..f137b62d5b52 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -189,12 +189,13 @@ static int ravb_tx_free(struct net_device *ndev, int q, bool free_txed_only) int free_num = 0; int entry; u32 size; + int num_tx_desc = priv->num_tx_desc; for (; priv->cur_tx[q] - priv->dirty_tx[q] > 0; priv->dirty_tx[q]++) { bool txed; entry = priv->dirty_tx[q] % (priv->num_tx_ring[q] * -NUM_TX_DESC); +num_tx_desc); desc = >tx_ring[q][entry]; txed = desc->die_dt == DT_FEMPTY; if (free_txed_only && !txed) @@ -203,12 +204,12 @@ static int ravb_tx_free(struct net_device *ndev, int q, bool free_txed_only) dma_rmb(); size = le16_to_cpu(desc->ds_tagl) & TX_DS; /* Free the original skb. */ - if (priv->tx_skb[q][entry / NUM_TX_DESC]) { + if (priv->tx_skb[q][entry / num_tx_desc]) { dma_unmap_single(ndev->dev.parent, le32_to_cpu(desc->dptr), size, DMA_TO_DEVICE); /* Last packet descriptor? */ - if (entry % NUM_TX_DESC == NUM_TX_DESC - 1) { - entry /= NUM_TX_DESC; + if (entry % num_tx_desc == num_tx_desc - 1) { + entry /= num_tx_desc; dev_kfree_skb_any(priv->tx_skb[q][entry]); priv->tx_skb[q][entry] = NULL; if (txed) @@ -229,6 +230,7 @@ static void ravb_ring_free(struct net_device *ndev, int q) struct ravb_private *priv = netdev_priv(ndev); int ring_size; int i; + int num_tx_desc = priv->num_tx_desc; if (priv->rx_ring[q]) { for (i = 0; i < priv->num_rx_ring[q]; i++) { @@ -252,7 +254,7 @@ static void ravb_ring_free(struct net_device *ndev, int q) ravb_tx_free(ndev, q, false); ring_size = sizeof(struct ravb_tx_desc) * - (priv->num_tx_ring[q] * NUM_TX_DESC + 1); + (priv->num_tx_ring[q] * num_tx_desc + 1); dma_free_coherent(ndev->dev.parent, ring_size, priv->tx_ring[q], priv->tx_desc_dma[q]); priv->tx_ring[q] = NULL; @@ -284,9 +286,10 @@ static void ravb_ring_format(struct net_device *ndev, int q) struct ravb_ex_rx_desc *rx_desc; struct ravb_tx_desc *tx_desc; struct ravb_desc *desc; + int num_tx_desc = priv->num_tx_desc; int rx_ring_size = sizeof(*rx_desc) * priv->num_rx_ring[q]; int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q] * - NUM_TX_DESC; + num_tx_desc; dma_addr_t dma_addr; int i; @@ -321,8 +324,10 @@ static void ravb_ring_format(struct net_device *ndev, int q) for (i = 0, tx_desc = priv->tx_ring[q]; i < priv->num_tx_ring[q]; i++, tx_desc++) { tx_desc->die_dt = DT_EEMPTY; - tx_desc++; - tx_desc->die_dt = DT_EEMPTY; + if (num_tx_desc >= 2) { + tx_desc++; + tx_desc->die_dt = DT_EEMPTY; + }
[PATCH/RFC net-next 4/5] ravb: remove undocumented processing
From: Kazuya MizuguchiSigned-off-by: Kazuya Mizuguchi Signed-off-by: Simon Horman --- drivers/net/ethernet/renesas/ravb.h | 5 - drivers/net/ethernet/renesas/ravb_main.c | 15 --- 2 files changed, 20 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index 57eea4a77826..fcd04dbc7dde 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -197,15 +197,11 @@ enum ravb_reg { MAHR= 0x05c0, MALR= 0x05c8, TROCR = 0x0700, /* Undocumented? */ - CDCR= 0x0708, /* Undocumented? */ - LCCR= 0x0710, /* Undocumented? */ CEFCR = 0x0740, FRECR = 0x0748, TSFRCR = 0x0750, TLFRCR = 0x0758, RFCR= 0x0760, - CERCR = 0x0768, /* Undocumented? */ - CEECR = 0x0770, /* Undocumented? */ MAFCR = 0x0778, }; @@ -223,7 +219,6 @@ enum CCC_BIT { CCC_CSEL_HPB= 0x0001, CCC_CSEL_ETH_TX = 0x0002, CCC_CSEL_GMII_REF = 0x0003, - CCC_BOC = 0x0010, /* Undocumented? */ CCC_LBME= 0x0100, }; diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 736ca2f76a35..88056dd912ed 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -451,12 +451,6 @@ static int ravb_dmac_init(struct net_device *ndev) ravb_ring_format(ndev, RAVB_BE); ravb_ring_format(ndev, RAVB_NC); -#if defined(__LITTLE_ENDIAN) - ravb_modify(ndev, CCC, CCC_BOC, 0); -#else - ravb_modify(ndev, CCC, CCC_BOC, CCC_BOC); -#endif - /* Set AVB RX */ ravb_write(ndev, RCR_EFFS | RCR_ENCF | RCR_ETS0 | RCR_ESF | 0x1800, RCR); @@ -1660,15 +1654,6 @@ static struct net_device_stats *ravb_get_stats(struct net_device *ndev) nstats->tx_dropped += ravb_read(ndev, TROCR); ravb_write(ndev, 0, TROCR); /* (write clear) */ - nstats->collisions += ravb_read(ndev, CDCR); - ravb_write(ndev, 0, CDCR); /* (write clear) */ - nstats->tx_carrier_errors += ravb_read(ndev, LCCR); - ravb_write(ndev, 0, LCCR); /* (write clear) */ - - nstats->tx_carrier_errors += ravb_read(ndev, CERCR); - ravb_write(ndev, 0, CERCR); /* (write clear) */ - nstats->tx_carrier_errors += ravb_read(ndev, CEECR); - ravb_write(ndev, 0, CEECR); /* (write clear) */ nstats->rx_packets = stats0->rx_packets + stats1->rx_packets; nstats->tx_packets = stats0->tx_packets + stats1->tx_packets; -- 2.11.0
[PATCH/RFC net-next 2/5] ravb: correct ptp does failure after suspend and resume
From: Kazuya MizuguchiThis patch fixes the problem that ptp4l command does not work after suspend and resume. Add the initial setting in ravb_suspend() and ravb_resume(), because ptp does not work. Fixes: a0d2f20650e8 ("Renesas Ethernet AVB PTP clock driver") Signed-off-by: Kazuya Mizuguchi Signed-off-by: Simon Horman --- drivers/net/ethernet/renesas/ravb_main.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index b311b1ac1286..dbde3d11458b 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -2295,6 +2295,9 @@ static int __maybe_unused ravb_suspend(struct device *dev) else ret = ravb_close(ndev); + if (priv->chip_id != RCAR_GEN2) + ravb_ptp_stop(ndev); + return ret; } @@ -2302,6 +2305,7 @@ static int __maybe_unused ravb_resume(struct device *dev) { struct net_device *ndev = dev_get_drvdata(dev); struct ravb_private *priv = netdev_priv(ndev); + struct platform_device *pdev = priv->pdev; int ret = 0; /* If WoL is enabled set reset mode to rearm the WoL logic */ @@ -2330,6 +2334,9 @@ static int __maybe_unused ravb_resume(struct device *dev) /* Restore descriptor base address table */ ravb_write(ndev, priv->desc_bat_dma, DBAT); + if (priv->chip_id != RCAR_GEN2) + ravb_ptp_init(ndev, pdev); + if (netif_running(ndev)) { if (priv->wol_enabled) { ret = ravb_wol_restore(ndev); -- 2.11.0
[PATCH/RFC net-next 3/5] ravb: do not write 1 to reserved bits
From: Kazuya MizuguchiThis patch corrects writing 1 to reserved bits. The write value should be 0. Signed-off-by: Kazuya Mizuguchi Signed-off-by: Simon Horman --- drivers/net/ethernet/renesas/ravb.h | 12 drivers/net/ethernet/renesas/ravb_main.c | 9 + drivers/net/ethernet/renesas/ravb_ptp.c | 2 +- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index b81f4faf7b10..57eea4a77826 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -433,6 +433,8 @@ enum EIS_BIT { EIS_QFS = 0x0001, }; +#define EIS_RESERVED_BITS (u32)(GENMASK(31, 17) | GENMASK(15, 11)) + /* RIC0 */ enum RIC0_BIT { RIC0_FRE0 = 0x0001, @@ -477,6 +479,8 @@ enum RIS0_BIT { RIS0_FRF17 = 0x0002, }; +#define RIS0_RESERVED_BITS (u32)GENMASK(31, 18) + /* RIC1 */ enum RIC1_BIT { RIC1_RFWE = 0x8000, @@ -533,6 +537,8 @@ enum RIS2_BIT { RIS2_RFFF = 0x8000, }; +#define RIS2_RESERVED_BITS (u32)GENMASK_ULL(30, 18) + /* TIC */ enum TIC_BIT { TIC_FTE0= 0x0001, /* Undocumented? */ @@ -549,6 +555,10 @@ enum TIS_BIT { TIS_TFWF= 0x0200, }; +#define TIS_RESERVED_BITS (u32)(GENMASK_ULL(31, 20) | \ + GENMASK_ULL(15, 12) | \ + GENMASK_ULL(7, 4)) + /* ISS */ enum ISS_BIT { ISS_FRS = 0x0001, /* Undocumented? */ @@ -622,6 +632,8 @@ enum GIS_BIT { GIS_PTMF= 0x0004, }; +#define GIS_RESERVED_BITS (u32)GENMASK(15, 10) + /* GIE (R-Car Gen3 only) */ enum GIE_BIT { GIE_PTCS= 0x0001, diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index dbde3d11458b..736ca2f76a35 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -742,10 +742,11 @@ static void ravb_error_interrupt(struct net_device *ndev) u32 eis, ris2; eis = ravb_read(ndev, EIS); - ravb_write(ndev, ~EIS_QFS, EIS); + ravb_write(ndev, ~(EIS_QFS | EIS_RESERVED_BITS), EIS); if (eis & EIS_QFS) { ris2 = ravb_read(ndev, RIS2); - ravb_write(ndev, ~(RIS2_QFF0 | RIS2_RFFF), RIS2); + ravb_write(ndev, ~(RIS2_QFF0 | RIS2_RFFF | RIS2_RESERVED_BITS), + RIS2); /* Receive Descriptor Empty int */ if (ris2 & RIS2_QFF0) @@ -913,7 +914,7 @@ static int ravb_poll(struct napi_struct *napi, int budget) /* Processing RX Descriptor Ring */ if (ris0 & mask) { /* Clear RX interrupt */ - ravb_write(ndev, ~mask, RIS0); + ravb_write(ndev, ~(mask | RIS0_RESERVED_BITS), RIS0); if (ravb_rx(ndev, , q)) goto out; } @@ -925,7 +926,7 @@ static int ravb_poll(struct napi_struct *napi, int budget) spin_lock_irqsave(>lock, flags); /* Clear TX interrupt */ - ravb_write(ndev, ~mask, TIS); + ravb_write(ndev, ~(mask | TIS_RESERVED_BITS), TIS); ravb_tx_free(ndev, q, true); netif_wake_subqueue(ndev, q); mmiowb(); diff --git a/drivers/net/ethernet/renesas/ravb_ptp.c b/drivers/net/ethernet/renesas/ravb_ptp.c index eede70ec37f8..ba3017ca5577 100644 --- a/drivers/net/ethernet/renesas/ravb_ptp.c +++ b/drivers/net/ethernet/renesas/ravb_ptp.c @@ -319,7 +319,7 @@ void ravb_ptp_interrupt(struct net_device *ndev) } } - ravb_write(ndev, ~gis, GIS); + ravb_write(ndev, ~(gis | GIS_RESERVED_BITS), GIS); } void ravb_ptp_init(struct net_device *ndev, struct platform_device *pdev) -- 2.11.0
[PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
On production servers running variety of workloads over time, kernel panic can happen sporadically after days or even months. It is important to collect as much debug logs as possible to root cause and fix the problem, that may not be easy to reproduce. Snapshot of underlying hardware/firmware state (like register dump, firmware logs, adapter memory, etc.), at the time of kernel panic will be very helpful while debugging the culprit device driver. This series of patches add new generic framework that enable device drivers to collect device specific snapshot of the hardware/firmware state of the underlying device in the crash recovery kernel. In crash recovery kernel, the collected logs are added as elf notes to /proc/vmcore, which is copied by user space scripts for post-analysis. The sequence of actions done by device drivers to append their device specific hardware/firmware logs to /proc/vmcore are as follows: 1. During probe (before hardware is initialized), device drivers register to the vmcore module (via vmcore_add_device_dump()), with callback function, along with buffer size and log name needed for firmware/hardware log collection. 2. vmcore module allocates the buffer with requested size. It adds an elf note and invokes the device driver's registered callback function. 3. Device driver collects all hardware/firmware logs into the buffer and returns control back to vmcore module. The device specific hardware/firmware logs can be seen as elf notes: # readelf -n /proc/vmcore Displaying notes found at file offset 0x1000 with length 0x04003288: Owner Data size Description VMCOREDD_cxgb4_:02:00.4 0x02000fd8Unknown note type: (0x0700) VMCOREDD_cxgb4_:04:00.4 0x02000fd8Unknown note type: (0x0700) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) CORE 0x0150 NT_PRSTATUS (prstatus structure) VMCOREINFO 0x074f Unknown note type: (0x) Patch 1 adds API to vmcore module to allow drivers to register callback to collect the device specific hardware/firmware logs. The logs will be added to /proc/vmcore as elf notes. Patch 2 updates read and mmap logic to append device specific hardware/ firmware logs as elf notes. Patch 3 shows a cxgb4 driver example using the API to collect hardware/firmware logs in crash recovery kernel, before hardware is initialized. Thanks, Rahul RFC v1: https://lkml.org/lkml/2018/3/2/542 RFC v2: https://lkml.org/lkml/2018/3/16/326 --- v4: - Made __vmcore_add_device_dump() static. - Moved compile check to define vmcore_add_device_dump() to crash_dump.h to fix compilation when vmcore.c is not compiled in. - Convert ---help--- to help in Kconfig as indicated by checkpatch. - Rebased to tip. v3: - Dropped sysfs crashdd module. - Exported dumps as elf notes. Suggested by Eric Biederman. Added as patch 2 in this version. - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device dump support. - Moved logic related to adding dumps from crashdd to vmcore module. - Rename all crashdd* to vmcoredd*. - Updated comments. v2: - Added ABI Documentation for crashdd. - Directly use octal permission instead of macro. Changes since rfc v2: - Moved exporting crashdd from procfs to sysfs. Suggested by Stephen Hemminger - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory. - Replaced all proc API with sysfs API and updated comments. - Calling driver callback before creating the binary file under crashdd sysfs. - Changed binary dump file permission from S_IRUSR to S_IRUGO. - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP. rfc v2: - Collecting logs in 2nd kernel instead of during kernel panic. Suggested by Eric Biederman . - Added new crashdd module that exports /proc/crashdd/ containing driver's registered hardware/firmware logs in patch 1. - Replaced the API to allow drivers to register their hardware/firmware log collect routine in crash recovery kernel in patch 1. - Updated patch 2 to use the new API in patch 1. Rahul Lakkireddy (3): vmcore: add API to collect hardware dump in second kernel vmcore: append device dumps to vmcore as elf notes cxgb4: collect hardware dump in second kernel drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 4 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++ drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h | 3 +
[PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel
The sequence of actions done by device drivers to append their device specific hardware/firmware logs to /proc/vmcore are as follows: 1. During probe (before hardware is initialized), device drivers register to the vmcore module (via vmcore_add_device_dump()), with callback function, along with buffer size and log name needed for firmware/hardware log collection. 2. vmcore module allocates the buffer with requested size. It adds an Elf note and invokes the device driver's registered callback function. 3. Device driver collects all hardware/firmware logs into the buffer and returns control back to vmcore module. Ensure that the device dump buffer size is always aligned to page size so that it can be mmaped. Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more generic and reserve NT_VMCOREDD note type to indicate vmcore device dump. Suggested-by: Eric Biederman. Signed-off-by: Rahul Lakkireddy Signed-off-by: Ganesh Goudar --- v4: - Made __vmcore_add_device_dump() static. - Moved compile check to define vmcore_add_device_dump() to crash_dump.h to fix compilation when vmcore.c is not compiled in. - Convert ---help--- to help in Kconfig as indicated by checkpatch. - Rebased to tip. v3: - Dropped sysfs crashdd module. - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device dump support. - Moved logic related to adding dumps from crashdd to vmcore module. - Rename all crashdd* to vmcoredd*. v2: - Added ABI Documentation for crashdd. - Directly use octal permission instead of macro. Changes since rfc v2: - Moved exporting crashdd from procfs to sysfs. Suggested by Stephen Hemminger - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory. - Replaced all proc API with sysfs API and updated comments. - Calling driver callback before creating the binary file under crashdd sysfs. - Changed binary dump file permission from S_IRUSR to S_IRUGO. - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP. rfc v2: - Collecting logs in 2nd kernel instead of during kernel panic. Suggested by Eric Biederman . - Patch added in this series. fs/proc/Kconfig| 10 +++ fs/proc/vmcore.c | 152 ++--- include/linux/crash_core.h | 4 ++ include/linux/crash_dump.h | 17 + include/linux/kcore.h | 6 ++ include/uapi/linux/elf.h | 1 + 6 files changed, 181 insertions(+), 9 deletions(-) diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig index 1ade1206bb89..504c77a444bd 100644 --- a/fs/proc/Kconfig +++ b/fs/proc/Kconfig @@ -43,6 +43,16 @@ config PROC_VMCORE help Exports the dump image of crashed kernel in ELF format. +config PROC_VMCORE_DEVICE_DUMP + bool "Device Hardware/Firmware Log Collection" + depends on PROC_VMCORE + default y + help + Device drivers can collect the device specific snapshot of + their hardware or firmware before they are initialized in + crash recovery kernel. If you say Y here, the device dumps + will be added as ELF notes to /proc/vmcore + config PROC_SYSCTL bool "Sysctl support (/proc/sys)" if EXPERT depends on PROC_FS diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index a45f0af22a60..7395462d2f86 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -44,6 +45,12 @@ static u64 vmcore_size; static struct proc_dir_entry *proc_vmcore; +#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP +/* Device Dump list and mutex to synchronize access to list */ +static LIST_HEAD(vmcoredd_list); +static DEFINE_MUTEX(vmcoredd_mutex); +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */ + /* * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error * The called function has to take care of module refcounting. @@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = { }; /** - * alloc_elfnotes_buf - allocate buffer for ELF note segment in - * vmalloc memory - * - * @notes_sz: size of buffer + * vmcore_alloc_buf - allocate buffer in vmalloc memory + * @sizez: size of buffer * * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap * the buffer to user-space by means of remap_vmalloc_range(). @@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = { * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is * disabled and there's no need to allow users to mmap the buffer. */ -static inline char *alloc_elfnotes_buf(size_t notes_sz) +static inline char *vmcore_alloc_buf(size_t size) { #ifdef CONFIG_MMU - return vmalloc_user(notes_sz); + return vmalloc_user(size); #else - return vzalloc(notes_sz); + return vzalloc(size); #endif } @@ -665,7 +670,7 @@
Re: [PATCH] VSOCK: make af_vsock.ko removable again
> On Apr 17, 2018, at 8:25 AM, Stefan Hajnocziwrote: > > Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call > vsock_init_tables()") introduced a module_init() function without a > corresponding module_exit() function. > > Modules with an init function can only be removed if they also have an > exit function. Therefore the vsock module was considered "permanent" > and could not be removed. > > This patch adds an empty module_exit() function so that "rmmod vsock" > works. No explicit cleanup is required because: > > 1. Transports call vsock_core_exit() upon exit and cannot be removed > while sockets are still alive. > 2. vsock_diag.ko does not perform any action that requires cleanup by > vsock.ko. > > Reported-by: Xiumei Mu > Cc: Cong Wang > Cc: Jorgen Hansen > Signed-off-by: Stefan Hajnoczi > --- > net/vmw_vsock/af_vsock.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c > index aac9b8f6552e..c1076c19b858 100644 > --- a/net/vmw_vsock/af_vsock.c > +++ b/net/vmw_vsock/af_vsock.c > @@ -2018,7 +2018,13 @@ const struct vsock_transport > *vsock_core_get_transport(void) > } > EXPORT_SYMBOL_GPL(vsock_core_get_transport); > > +static void __exit vsock_exit(void) > +{ > + /* Do nothing. This function makes this module removable. */ > +} > + > module_init(vsock_init_tables); > +module_exit(vsock_exit); > > MODULE_AUTHOR("VMware, Inc."); > MODULE_DESCRIPTION("VMware Virtual Socket Family"); > -- > 2.14.3 > Looks good to me. Reviewed-by: Jorgen Hansen
[PATCH net-next] vxlan: add ttl inherit support
Like tos inherit, ttl inherit should also means inherit the inner protocol's ttl values, which actually not implemented in vxlan yet. But we could not treat ttl == 0 as "use the inner TTL", because that would be used also when the "ttl" option is not specified and that would be a behavior change, and breaking real use cases. So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is specified with ip cmd. Reported-by: Jianlin ShiSuggested-by: Jiri Benc Signed-off-by: Hangbin Liu --- drivers/net/vxlan.c | 17 ++--- include/net/ip_tunnels.h | 11 +++ include/net/vxlan.h | 1 + include/uapi/linux/if_link.h | 1 + 4 files changed, 27 insertions(+), 3 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index aa5f034..209a840 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2085,9 +2085,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, local_ip = vxlan->cfg.saddr; dst_cache = >dst_cache; md->gbp = skb->mark; - ttl = vxlan->cfg.ttl; - if (!ttl && vxlan_addr_multicast(dst)) - ttl = 1; + if (flags & VXLAN_F_TTL_INHERIT) { + ttl = ip_tunnel_get_ttl(old_iph, skb); + } else { + ttl = vxlan->cfg.ttl; + if (!ttl && vxlan_addr_multicast(dst)) + ttl = 1; + } tos = vxlan->cfg.tos; if (tos == 1) @@ -2709,6 +2713,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = { [IFLA_VXLAN_GBP]= { .type = NLA_FLAG, }, [IFLA_VXLAN_GPE]= { .type = NLA_FLAG, }, [IFLA_VXLAN_REMCSUM_NOPARTIAL] = { .type = NLA_FLAG }, + [IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG }, }; static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[], @@ -3254,6 +3259,12 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[], if (data[IFLA_VXLAN_TTL]) conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]); + if (data[IFLA_VXLAN_TTL_INHERIT]) { + if (changelink) + return -EOPNOTSUPP; + conf->flags |= VXLAN_F_TTL_INHERIT; + } + if (data[IFLA_VXLAN_LABEL]) conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) & IPV6_FLOWLABEL_MASK; diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index cbe5add..6c3c421 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -377,6 +377,17 @@ static inline u8 ip_tunnel_get_dsfield(const struct iphdr *iph, return 0; } +static inline u8 ip_tunnel_get_ttl(const struct iphdr *iph, + const struct sk_buff *skb) +{ + if (skb->protocol == htons(ETH_P_IP)) + return iph->ttl; + else if (skb->protocol == htons(ETH_P_IPV6)) + return ((const struct ipv6hdr *)iph)->hop_limit; + else + return 0; +} + /* Propogate ECN bits out */ static inline u8 ip_tunnel_ecn_encap(u8 tos, const struct iphdr *iph, const struct sk_buff *skb) diff --git a/include/net/vxlan.h b/include/net/vxlan.h index ad73d8b..b99a02ae 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -262,6 +262,7 @@ struct vxlan_dev { #define VXLAN_F_COLLECT_METADATA 0x2000 #define VXLAN_F_GPE0x4000 #define VXLAN_F_IPV6_LINKLOCAL 0x8000 +#define VXLAN_F_TTL_INHERIT0x1 /* Flags that are used in the receive path. These flags must match in * order for a socket to be shareable diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 11d0c0e..e771a63 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -516,6 +516,7 @@ enum { IFLA_VXLAN_COLLECT_METADATA, IFLA_VXLAN_LABEL, IFLA_VXLAN_GPE, + IFLA_VXLAN_TTL_INHERIT, __IFLA_VXLAN_MAX }; #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1) -- 2.5.5
[PATCH bpf-next 03/10] [bpf]: add bpf_xdp_adjust_tail sample prog
adding bpf's sample program which is using bpf_xdp_adjust_tail helper by generating ICMPv4 "packet to big" message if ingress packet's size is bigger then 600 bytes Signed-off-by: Nikita V. Shirokov--- samples/bpf/Makefile | 4 + samples/bpf/xdp_adjust_tail_kern.c| 151 ++ samples/bpf/xdp_adjust_tail_user.c| 141 tools/testing/selftests/bpf/bpf_helpers.h | 2 + 4 files changed, 298 insertions(+) create mode 100644 samples/bpf/xdp_adjust_tail_kern.c create mode 100644 samples/bpf/xdp_adjust_tail_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 4d6a6edd4bf6..aa8c392e2e52 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp hostprogs-y += cpustat +hostprogs-y += xdp_adjust_tail # Libbpf dependencies LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o @@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o +xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o always += xdp2skb_meta_kern.o always += syscall_tp_kern.o always += cpustat_kern.o +always += xdp_adjust_tail_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf HOSTLOADLIBES_xdp_rxq_info += -lelf HOSTLOADLIBES_syscall_tp += -lelf HOSTLOADLIBES_cpustat += -lelf +HOSTLOADLIBES_xdp_adjust_tail += -lelf # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c new file mode 100644 index ..17570559fd08 --- /dev/null +++ b/samples/bpf/xdp_adjust_tail_kern.c @@ -0,0 +1,151 @@ +/* Copyright (c) 2018 Facebook + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program shows how to use bpf_xdp_adjust_tail() by + * generating ICMPv4 "packet to big" (unreachable/ df bit set frag needed + * to be more preice in case of v4)" where receiving packets bigger then + * 600 bytes. + */ +#define KBUILD_MODNAME "foo" +#include +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" + +#define DEFAULT_TTL 64 +#define MAX_PCKT_SIZE 600 +#define ICMP_TOOBIG_SIZE 98 +#define ICMP_TOOBIG_PAYLOAD_SIZE 92 + +struct bpf_map_def SEC("maps") icmpcnt = { + .type = BPF_MAP_TYPE_ARRAY, + .key_size = sizeof(__u32), + .value_size = sizeof(__u64), + .max_entries = 1, +}; + +static __always_inline void count_icmp(void) +{ + u64 key = 0; + u64 *icmp_count; + + icmp_count = bpf_map_lookup_elem(, ); + if (icmp_count) + *icmp_count += 1; +} + +static __always_inline void swap_mac(void *data, struct ethhdr *orig_eth) +{ + struct ethhdr *eth; + + eth = data; + memcpy(eth->h_source, orig_eth->h_dest, ETH_ALEN); + memcpy(eth->h_dest, orig_eth->h_source, ETH_ALEN); + eth->h_proto = orig_eth->h_proto; +} + +static __always_inline __u16 csum_fold_helper(__u32 csum) +{ + return ~((csum & 0x) + (csum >> 16)); +} + +static __always_inline void ipv4_csum(void *data_start, int data_size, + __u32 *csum) +{ + *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum); + *csum = csum_fold_helper(*csum); +} + +static __always_inline int send_icmp4_too_big(struct xdp_md *xdp) +{ + int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr); + + if (bpf_xdp_adjust_head(xdp, 0 - headroom)) + return XDP_DROP; + void *data = (void *)(long)xdp->data; + void *data_end = (void *)(long)xdp->data_end; + + if (data + (ICMP_TOOBIG_SIZE + headroom) > data_end) + return XDP_DROP; + + struct iphdr *iph, *orig_iph; + struct icmphdr *icmp_hdr; + struct ethhdr *orig_eth; + __u32 csum = 0; + __u64 off = 0; + + orig_eth = data + headroom; + swap_mac(data, orig_eth); + off += sizeof(struct ethhdr); + iph = data + off; + off += sizeof(struct iphdr); + icmp_hdr = data + off; + off += sizeof(struct icmphdr); + orig_iph = data + off; + icmp_hdr->type = ICMP_DEST_UNREACH; + icmp_hdr->code = ICMP_FRAG_NEEDED; + icmp_hdr->un.frag.mtu =
[PATCH bpf-next 00/10] introduction of bpf_xdp_adjust_tail
In this patch series i'm adding new bpf helper which allow to manupulate xdp's data_end pointer. right now only "shrinking" (reduce packet's size by moving pointer) is supported (and i see no use case for "growing"). Main use case for such helper is to be able to generate controll (ICMP) messages from XDP context. such messages usually contains first N bytes from original packets as a payload, and this is exactly what this helper would allow us to do (see patch 3 for sample program, where we generate ICMP "packet too big" message). This helper could be usefull for load balancing applications where after additional encapsulation, resulting packet could be bigger then interface MTU. Aside from new helper this patch series contains minor changes in device drivers (for ones which requires), so they would recal packet's length not only when head pointer was adjusted, but if tail's one as well. Nikita V. Shirokov (10): [bpf]: adding bpf_xdp_adjust_tail helper [bpf]: adding tests for bpf_xdp_adjust_tail [bpf]: add bpf_xdp_adjust_tail sample prog [bpf]: make generic xdp compatible w/ bpf_xdp_adjust_tail [bpf]: make mlx4 compatible w/ bpf_xdp_adjust_tail [bpf]: make bnxt compatible w/ bpf_xdp_adjust_tail [bpf]: make cavium thunder compatible w/ bpf_xdp_adjust_tail [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail [bpf]: make tun compatible w/ bpf_xdp_adjust_tail [bpf]: make virtio compatible w/ bpf_xdp_adjust_tail drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- .../net/ethernet/netronome/nfp/nfp_net_common.c| 2 +- drivers/net/tun.c | 3 +- drivers/net/virtio_net.c | 7 +- include/uapi/linux/bpf.h | 10 +- net/bpf/test_run.c | 3 +- net/core/dev.c | 10 +- net/core/filter.c | 29 +++- samples/bpf/Makefile | 4 + samples/bpf/xdp_adjust_tail_kern.c | 151 + samples/bpf/xdp_adjust_tail_user.c | 141 +++ tools/include/uapi/linux/bpf.h | 11 +- tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/bpf_helpers.h | 5 + tools/testing/selftests/bpf/test_adjust_tail.c | 29 tools/testing/selftests/bpf/test_progs.c | 32 + 18 files changed, 433 insertions(+), 12 deletions(-) create mode 100644 samples/bpf/xdp_adjust_tail_kern.c create mode 100644 samples/bpf/xdp_adjust_tail_user.c create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c -- 2.15.1
[PATCH bpf-next 02/10] [bpf]: adding tests for bpf_xdp_adjust_tail
adding selftests for bpf_xdp_adjust_tail helper. in this syntetic test we are testing that 1) if data_end < data helper will return EINVAL 2) for normal use case packet's length would be reduced. aside from adding new tests i'm changing behaviour of bpf_prog_test_run so it would recalculate packet's length if only data_end pointer was changed Signed-off-by: Nikita V. Shirokov--- net/bpf/test_run.c | 3 ++- tools/include/uapi/linux/bpf.h | 11 - tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/bpf_helpers.h | 3 +++ tools/testing/selftests/bpf/test_adjust_tail.c | 29 +++ tools/testing/selftests/bpf/test_progs.c | 32 ++ 6 files changed, 77 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 2ced48662c1f..68c3578343b4 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -170,7 +170,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, xdp.rxq = >xdp_rxq; retval = bpf_test_run(prog, , repeat, ); - if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN) + if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN || + xdp.data_end != xdp.data + size) size = xdp.data_end - xdp.data; ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration); kfree(data); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9d07465023a2..9a2d1a04eb24 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -755,6 +755,13 @@ union bpf_attr { * @addr: pointer to struct sockaddr to bind socket to * @addr_len: length of sockaddr structure * Return: 0 on success or negative error code + * + * int bpf_xdp_adjust_tail(xdp_md, delta) + * Adjust the xdp_md.data_end by delta. Only shrinking of packet's + * size is supported. + * @xdp_md: pointer to xdp_md + * @delta: A negative integer to be added to xdp_md.data_end + * Return: 0 on success or negative on error */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -821,7 +828,8 @@ union bpf_attr { FN(msg_apply_bytes),\ FN(msg_cork_bytes), \ FN(msg_pull_data), \ - FN(bind), + FN(bind), \ + FN(xdp_adjust_tail), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -864,6 +872,7 @@ enum bpf_func_id { /* BPF_FUNC_skb_set_tunnel_key flags. */ #define BPF_F_ZERO_CSUM_TX (1ULL << 1) #define BPF_F_DONT_FRAGMENT(1ULL << 2) +#define BPF_F_SEQ_NUMBER (1ULL << 3) /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and * BPF_FUNC_perf_event_read_value flags. diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 0a315ddabbf4..3e819dc70bee 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -31,7 +31,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \ test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \ sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \ - sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o + sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o # Order correspond to 'make run_tests' order TEST_PROGS := test_kmod.sh \ diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h index d8223d99f96d..50c607014b22 100644 --- a/tools/testing/selftests/bpf/bpf_helpers.h +++ b/tools/testing/selftests/bpf/bpf_helpers.h @@ -96,6 +96,9 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) = (void *) BPF_FUNC_msg_pull_data; static int (*bpf_bind)(void *ctx, void *addr, int addr_len) = (void *) BPF_FUNC_bind; +static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) = + (void *) BPF_FUNC_xdp_adjust_tail; + /* llvm builtin functions that eBPF C program may use to * emit BPF_LD_ABS and BPF_LD_IND instructions diff --git a/tools/testing/selftests/bpf/test_adjust_tail.c b/tools/testing/selftests/bpf/test_adjust_tail.c new file mode 100644 index ..86239e792d6d --- /dev/null +++ b/tools/testing/selftests/bpf/test_adjust_tail.c @@ -0,0 +1,29 @@ +/* Copyright (c) 2016,2017 Facebook + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#include
[PATCH bpf-next 05/10] [bpf]: make mlx4 compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for mlx4 driver we will just calculate packet's length unconditionally (the same way as it's already being done in mlx5) Signed-off-by: Nikita V. Shirokov--- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 5c613c6663da..efc55feddc5c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -775,8 +775,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud act = bpf_prog_run_xdp(xdp_prog, ); + length = xdp.data_end - xdp.data; if (xdp.data != orig_data) { - length = xdp.data_end - xdp.data; frags[0].page_offset = xdp.data - xdp.data_hard_start; va = xdp.data; -- 2.15.1
[PATCH bpf-next 04/10] [bpf]: make generic xdp compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for generic XDP we need to reflect this packet's length change by adjusting skb's tail pointer Signed-off-by: Nikita V. Shirokov--- net/core/dev.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 969462ebb296..11c789231a03 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3996,9 +3996,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, struct bpf_prog *xdp_prog) { struct netdev_rx_queue *rxqueue; + void *orig_data, *orig_data_end; u32 metalen, act = XDP_DROP; struct xdp_buff xdp; - void *orig_data; int hlen, off; u32 mac_len; @@ -4037,6 +4037,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, xdp.data_meta = xdp.data; xdp.data_end = xdp.data + hlen; xdp.data_hard_start = skb->data - skb_headroom(skb); + orig_data_end = xdp.data_end; orig_data = xdp.data; rxqueue = netif_get_rxqueue(skb); @@ -4051,6 +4052,13 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, __skb_push(skb, -off); skb->mac_header += off; + /* check if bpf_xdp_adjust_tail was used. it can only "shrink" +* pckt. +*/ + off = orig_data_end - xdp.data_end; + if (off != 0) + skb_set_tail_pointer(skb, xdp.data_end - xdp.data); + switch (act) { case XDP_REDIRECT: case XDP_TX: -- 2.15.1
Re: tcp hang when socket fills up ?
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote: > > I turned pr_debug on in tcp_in_window() for another try and it's a bit > > mangled because the information on multiple lines and the function is > > called in parallel but it looks like I do have some seq > maxend +1 > > > > Although it's weird, the maxend was set WAY earlier apparently? > > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 > > receiver end=505004284 maxend=505033596 maxwin=29200 > > then window decreased drastically e.g. previous ack just before refusal: > > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 > > end=1913292311 > > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 > > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7 > > scale=0 is suspect. > > Really if conntrack does not see SYN SYNACK packets, it should not > make any window check, since windows can be scaled up to 14 :/ Or maybe set the scaling to - TCP_MAX_WSCALE (14) by default - 0 when SYN or SYNACK without window scale option is seen - value of window scale option when SYN or SYNACK with it is seen Michal Kubecek
[PATCH bpf-next 06/10] [bpf]: make bnxt compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for bnxt driver we will just calculate packet's length unconditionally Signed-off-by: Nikita V. Shirokov--- drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 1389ab5e05df..1f0e872d0667 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -113,10 +113,10 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons, if (tx_avail != bp->tx_ring_size) *event &= ~BNXT_RX_EVENT; + *len = xdp.data_end - xdp.data; if (orig_data != xdp.data) { offset = xdp.data - xdp.data_hard_start; *data_ptr = xdp.data_hard_start + offset; - *len = xdp.data_end - xdp.data; } switch (act) { case XDP_PASS: -- 2.15.1
[PATCH bpf-next 08/10] [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for nfp driver we will just calculate packet's length unconditionally Signed-off-by: Nikita V. Shirokov--- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 1eb6549f2a54..d9111c077699 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget) act = bpf_prog_run_xdp(xdp_prog, ); - pkt_len -= xdp.data - orig_data; + pkt_len = xdp.data_end - xdp.data; pkt_off += xdp.data - orig_data; switch (act) { -- 2.15.1
Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
On Mon, 16 Apr 2018 23:15:50 -0700 Christoph Hellwigwrote: > On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote: > > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use > > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included that > > in below fix patch. > > x86_swiotlb_dma_ops should not exist any mor, and x86 now uses > dma_direct_ops. Looks like you are applying it to an old kernel :) > > > Performance improved to 8.9 Mpps from approx 6.5Mpps. > > > > (This was without my bulking for net_device->ndo_xdp_xmit, so that > > number should improve more). > > What is the number for the otherwise comparable setup without repolines? Approx 12 Mpps. You forgot to handle the dma_direct_mapping_error() case, which still used the retpoline in the above 8.9 Mpps measurement, I fixed it up and performance increased to 9.6 Mpps. Notice, in this test there are still two retpoline/indirect-calls left. The net_device->ndo_xdp_xmit and the invocation of the XDP BPF prog. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH net-next 1/3] net: phy: Add binding for vendor specific C45 MDIO address space
The extra property enables the discovery on the MDIO bus of the PHYs which have a vendor specific address space for accessing the C45 MDIO registers. Signed-off-by: Vicentiu Galanopulo--- Documentation/devicetree/bindings/net/phy.txt | 6 ++ 1 file changed, 6 insertions(+) diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt index d2169a5..82692e2 100644 --- a/Documentation/devicetree/bindings/net/phy.txt +++ b/Documentation/devicetree/bindings/net/phy.txt @@ -61,6 +61,11 @@ Optional Properties: - reset-deassert-us: Delay after the reset was deasserted in microseconds. If this property is missing the delay will be skipped. +- dev-addr: If set, it indicates the device address of the PHY to be used + when accessing the C45 PHY registers over MDIO. It is used for vendor specific + register space addresses that do no conform to standard address for the MDIO + registers (e.g. MMD30) + Example: ethernet-phy@0 { @@ -72,4 +77,5 @@ ethernet-phy@0 { reset-gpios = < 4 GPIO_ACTIVE_LOW>; reset-assert-us = <1000>; reset-deassert-us = <2000>; + dev-addr = <0x1e>; }; -- 2.7.4
[PATCH net] vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi
Syzkaller spotted an old bug which leads to reading skb beyond tail by 4 bytes on vlan tagged packets. This is caused because skb_vlan_tagged_multi() did not check skb_headlen. BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283 [inline] BUG: KMSAN: uninit-value in skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline] BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672 [inline] BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline] BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009 CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 eth_type_vlan include/linux/if_vlan.h:283 [inline] skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline] vlan_features_check include/linux/if_vlan.h:672 [inline] dflt_features_check net/core/dev.c:2949 [inline] netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009 validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084 __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549 dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590 packet_snd net/packet/af_packet.c:2944 [inline] packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] sock_write_iter+0x3b9/0x470 net/socket.c:909 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x43ffa9 RSP: 002b:7fff2cff3948 EFLAGS: 0217 ORIG_RAX: 0014 RAX: ffda RBX: 004002c8 RCX: 0043ffa9 RDX: 0001 RSI: 2080 RDI: 0003 RBP: 006cb018 R08: R09: R10: R11: 0217 R12: 004018d0 R13: 00401960 R14: R15: Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2737 [inline] __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 __kmalloc_reserve net/core/skbuff.c:138 [inline] __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 alloc_skb include/linux/skbuff.h:984 [inline] alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234 sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085 packet_alloc_skb net/packet/af_packet.c:2803 [inline] packet_snd net/packet/af_packet.c:2894 [inline] packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] sock_write_iter+0x3b9/0x470 net/socket.c:909 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 58e998c6d239 ("offloading: Force software GSO for multiple vlan tags.") Reported-and-tested-by: syzbot+0bbe42c764feafa82...@syzkaller.appspotmail.com Signed-off-by: Toshiaki Makita--- include/linux/if_vlan.h | 7 +-- net/core/dev.c | 2 +- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index d11f41d..78a5a90 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -663,7 +663,7 @@ static inline bool skb_vlan_tagged(const struct sk_buff *skb) * Returns true if the skb is tagged with multiple vlan headers, regardless * of whether it is hardware accelerated or not. */ -static inline bool skb_vlan_tagged_multi(const struct sk_buff *skb) +static inline bool skb_vlan_tagged_multi(struct sk_buff *skb) { __be16 protocol = skb->protocol; @@ -673,6 +673,9 @@ static inline bool skb_vlan_tagged_multi(const struct sk_buff *skb) if (likely(!eth_type_vlan(protocol))) return false; + if (unlikely(!pskb_may_pull(skb, VLAN_ETH_HLEN))) + return false; + veh = (struct vlan_ethhdr *)skb->data;
Re: [PATCH net 1/2] tipc: add policy for TIPC_NLA_NET_ADDR
On 04/16/2018 11:29 PM, Eric Dumazet wrote: > Before syzbot/KMSAN bites, add the missing policy for TIPC_NLA_NET_ADDR > > Fixes: 27c21416727a ("tipc: add net set to new netlink api") > Signed-off-by: Eric Dumazet> Cc: Jon Maloy > Cc: Ying Xue Acked-by: Ying Xue > --- > net/tipc/netlink.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c > index > b76f13f6fea10a53d00ed14a38cdf5cdf7afa44c..d4e0bbeee72793a060befaf8a9d0239731c0d48c > 100644 > --- a/net/tipc/netlink.c > +++ b/net/tipc/netlink.c > @@ -79,7 +79,8 @@ const struct nla_policy > tipc_nl_sock_policy[TIPC_NLA_SOCK_MAX + 1] = { > > const struct nla_policy tipc_nl_net_policy[TIPC_NLA_NET_MAX + 1] = { > [TIPC_NLA_NET_UNSPEC] = { .type = NLA_UNSPEC }, > - [TIPC_NLA_NET_ID] = { .type = NLA_U32 } > + [TIPC_NLA_NET_ID] = { .type = NLA_U32 }, > + [TIPC_NLA_NET_ADDR] = { .type = NLA_U32 }, > }; > > const struct nla_policy tipc_nl_link_policy[TIPC_NLA_LINK_MAX + 1] = { >
[PATCHv2 net-next] vxlan: add ttl inherit support
Like tos inherit, ttl inherit should also means inherit the inner protocol's ttl values, which actually not implemented in vxlan yet. But we could not treat ttl == 0 as "use the inner TTL", because that would be used also when the "ttl" option is not specified and that would be a behavior change, and breaking real use cases. So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is specified. --- v2: As suggested by Stefano, clean up function ip_tunnel_get_ttl(). Suggested-by: Jiri BencSigned-off-by: Hangbin Liu --- drivers/net/vxlan.c | 17 ++--- include/net/ip_tunnels.h | 12 include/net/vxlan.h | 1 + include/uapi/linux/if_link.h | 1 + 4 files changed, 28 insertions(+), 3 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index aa5f034..209a840 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2085,9 +2085,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, local_ip = vxlan->cfg.saddr; dst_cache = >dst_cache; md->gbp = skb->mark; - ttl = vxlan->cfg.ttl; - if (!ttl && vxlan_addr_multicast(dst)) - ttl = 1; + if (flags & VXLAN_F_TTL_INHERIT) { + ttl = ip_tunnel_get_ttl(old_iph, skb); + } else { + ttl = vxlan->cfg.ttl; + if (!ttl && vxlan_addr_multicast(dst)) + ttl = 1; + } tos = vxlan->cfg.tos; if (tos == 1) @@ -2709,6 +2713,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = { [IFLA_VXLAN_GBP]= { .type = NLA_FLAG, }, [IFLA_VXLAN_GPE]= { .type = NLA_FLAG, }, [IFLA_VXLAN_REMCSUM_NOPARTIAL] = { .type = NLA_FLAG }, + [IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG }, }; static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[], @@ -3254,6 +3259,12 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[], if (data[IFLA_VXLAN_TTL]) conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]); + if (data[IFLA_VXLAN_TTL_INHERIT]) { + if (changelink) + return -EOPNOTSUPP; + conf->flags |= VXLAN_F_TTL_INHERIT; + } + if (data[IFLA_VXLAN_LABEL]) conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) & IPV6_FLOWLABEL_MASK; diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index cbe5add..5a8ab9f 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -377,6 +377,18 @@ static inline u8 ip_tunnel_get_dsfield(const struct iphdr *iph, return 0; } +static inline u8 ip_tunnel_get_ttl(const struct iphdr *iph, + const struct sk_buff *skb) +{ + if (skb->protocol == htons(ETH_P_IP)) + return iph->ttl; + + if (skb->protocol == htons(ETH_P_IPV6)) + return ((const struct ipv6hdr *)iph)->hop_limit; + + return 0; +} + /* Propogate ECN bits out */ static inline u8 ip_tunnel_ecn_encap(u8 tos, const struct iphdr *iph, const struct sk_buff *skb) diff --git a/include/net/vxlan.h b/include/net/vxlan.h index ad73d8b..b99a02ae 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -262,6 +262,7 @@ struct vxlan_dev { #define VXLAN_F_COLLECT_METADATA 0x2000 #define VXLAN_F_GPE0x4000 #define VXLAN_F_IPV6_LINKLOCAL 0x8000 +#define VXLAN_F_TTL_INHERIT0x1 /* Flags that are used in the receive path. These flags must match in * order for a socket to be shareable diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 11d0c0e..e771a63 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -516,6 +516,7 @@ enum { IFLA_VXLAN_COLLECT_METADATA, IFLA_VXLAN_LABEL, IFLA_VXLAN_GPE, + IFLA_VXLAN_TTL_INHERIT, __IFLA_VXLAN_MAX }; #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1) -- 2.5.5
Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
On 04/17/2018 04:08 AM, Michael Schmitz wrote: From: John Paul Adrian GlaubitzThis should be: From: Michael Karcher -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH v2 6/8] net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as shareable
On 04/17/2018 04:08 AM, Michael Schmitz wrote: From: John Paul Adrian GlaubitzThis should be: From: Michael Karcher -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device
于 2018年4月17日 GMT+08:00 下午7:59:38, Chen-Yu Tsai写到: >On Tue, Apr 17, 2018 at 7:52 PM, Maxime Ripard > wrote: >> On Mon, Apr 16, 2018 at 10:51:55PM +0800, Chen-Yu Tsai wrote: >>> On Mon, Apr 16, 2018 at 10:31 PM, Maxime Ripard >>> wrote: >>> > On Thu, Apr 12, 2018 at 11:23:30PM +0800, Chen-Yu Tsai wrote: >>> >> On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng >wrote: >>> >> > 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard > 写到: >>> >> >>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote: >>> >> >>> From: Chen-Yu Tsai >>> >> >>> >>> >> >>> On the Allwinner R40 SoC, the "GMAC clock" register is in the >CCU >>> >> >>> address space; on the A64 SoC this register is in the SRAM >controller >>> >> >>> address space, and with a different offset. >>> >> >>> >>> >> >>> To access the register from another device and hide the >internal >>> >> >>> difference between the device, let it register a regmap named >>> >> >>> "emac-clock". We can then get the device from the phandle, >and >>> >> >>> retrieve the regmap with dev_get_regmap(); in this situation >the >>> >> >>> regmap_field will be set up to access the only register in >the >>> >> >>regmap. >>> >> >>> >>> >> >>> Signed-off-by: Chen-Yu Tsai >>> >> >>> [Icenowy: change to use regmaps with single register, change >commit >>> >> >>> message] >>> >> >>> Signed-off-by: Icenowy Zheng >>> >> >>> --- >>> >> >>> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48 >>> >> >>++- >>> >> >>> 1 file changed, 46 insertions(+), 2 deletions(-) >>> >> >>> >>> >> >>> diff --git >a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >>> >> >>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >>> >> >>> index 1037f6c78bca..b61210c0d415 100644 >>> >> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >>> >> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >>> >> >>> @@ -85,6 +85,13 @@ const struct reg_field >old_syscon_reg_field = { >>> >> >>> .msb = 31, >>> >> >>> }; >>> >> >>> >>> >> >>> +/* Specially exported regmap which contains only EMAC >register */ >>> >> >>> +const struct reg_field single_reg_field = { >>> >> >>> +.reg = 0, >>> >> >>> +.lsb = 0, >>> >> >>> +.msb = 31, >>> >> >>> +}; >>> >> >>> + >>> >> >> >>> >> >>I'm not sure this would be wise. If we ever need some other >register >>> >> >>exported through the regmap, will have to change all the >calling sites >>> >> >>everywhere in the kernel, which will be a pain and will break >>> >> >>bisectability. >>> >> > >>> >> > In this situation the register can be exported as another >>> >> > regmap. Currently the code will access a regmap with name >>> >> > "emac-clock" for this register. >>> >> > >>> >> >> >>> >> >>Chen-Yu's (or was it yours?) initial solution with a custom >writeable >>> >> >>hook only allowing a single register seemed like a better one. >>> >> > >>> >> > But I remember you mentioned that you want it to hide the >>> >> > difference inside the device. >>> >> >>> >> The idea is that a device can export multiple regmaps. This one, >>> >> the one named "gmac" (in my soon to come v2) or "emac-clock" >here, >>> >> is but one of many possible regmaps, and it only exports the >register >>> >> needed by the GMAC/EMAC. >>> > >>> > I'm not sure this would be wise either. There's a single register >map, >>> > and as far as I know we don't have a binding to express this in >the >>> > DT. This means that the customer and provider would have to use >the >>> > same name, but without anything actually enforcing it aside from >>> > "someone in the community knows it". >>> > >>> > This is not a really good design, and I was actually preferring >your >>> > first option. We shouldn't rely on any undocumented rule. This >will be >>> > easy to break and hard to maintain. >>> >>> So, one regmap per device covering the whole register range, and the >>> consumer knows which register to poke by looking at its own >compatible. >>> >>> That sound right? >> >> Yep. And ideally, sending a single serie for both the A64 and the R40 >> cases, in order to provide the big picture. > >OK. I'll incorporate Icenowy's stuff into my series. In this situation maybe I should send newer revision of A64 drivers to you? > >ChenYu > >___ >linux-arm-kernel mailing list >linux-arm-ker...@lists.infradead.org >http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
[PATCH] net: qrtr: add MODULE_ALIAS_NETPROTO macro
To ensure that qrtr can be loaded automatically, when needed, if it is compiled as module. Signed-off-by: Nicolas Dechesne--- net/qrtr/qrtr.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index b33e5aeb4c06..2aa07b547b16 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -1135,3 +1135,4 @@ module_exit(qrtr_proto_fini); MODULE_DESCRIPTION("Qualcomm IPC-router driver"); MODULE_LICENSE("GPL v2"); +MODULE_ALIAS_NETPROTO(PF_QIPCRTR); -- 2.14.2
[PATCH net] sfc: check RSS is active for filter insert
For some firmware variants - specifically 'capture packed stream' - RSS filters are not valid. We must check if RSS is actually active rather than merely enabled. Fixes: 42356d9a137b ("sfc: support RSS spreading of ethtool ntuple filters") Signed-off-by: Bert Kenward--- drivers/net/ethernet/sfc/ef10.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c index 36f24c7e553a..83ce229f4eb7 100644 --- a/drivers/net/ethernet/sfc/ef10.c +++ b/drivers/net/ethernet/sfc/ef10.c @@ -5264,7 +5264,7 @@ static int efx_ef10_filter_insert_addr_list(struct efx_nic *efx, ids = vlan->uc; } - filter_flags = efx_rss_enabled(efx) ? EFX_FILTER_FLAG_RX_RSS : 0; + filter_flags = efx_rss_active(>rss_context) ? EFX_FILTER_FLAG_RX_RSS : 0; /* Insert/renew filters */ for (i = 0; i < addr_count; i++) { @@ -5333,7 +5333,7 @@ static int efx_ef10_filter_insert_def(struct efx_nic *efx, int rc; u16 *id; - filter_flags = efx_rss_enabled(efx) ? EFX_FILTER_FLAG_RX_RSS : 0; + filter_flags = efx_rss_active(>rss_context) ? EFX_FILTER_FLAG_RX_RSS : 0; efx_filter_init_rx(, EFX_FILTER_PRI_AUTO, filter_flags, 0); -- 2.13.6
Re: [PATCH/RFC net-next 2/5] ravb: correct ptp does failure after suspend and resume
> @@ -2302,6 +2305,7 @@ static int __maybe_unused ravb_resume(struct device > *dev) > { > struct net_device *ndev = dev_get_drvdata(dev); > struct ravb_private *priv = netdev_priv(ndev); > + struct platform_device *pdev = priv->pdev; Minor nit: I'd save this line... > + if (priv->chip_id != RCAR_GEN2) > + ravb_ptp_init(ndev, pdev); ... and use ravb_ptp_init(ndev, priv->pdev); here. But well, maybe just bike-shedding... signature.asc Description: PGP signature
Re: [PATCH bpf-next 1/3] bpftool: Add missing prog types and attach types
2018-04-16 16:57 UTC-0700 ~ Andrey Ignatov> Jakub Kicinski [Mon, 2018-04-16 16:53 -0700]: >> On Mon, 16 Apr 2018 14:41:57 -0700, Andrey Ignatov wrote: >>> diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c >>> index cae32a6..8689916 100644 >>> --- a/tools/bpf/bpftool/cgroup.c >>> +++ b/tools/bpf/bpftool/cgroup.c >>> @@ -16,15 +16,28 @@ >>> #define HELP_SPEC_ATTACH_FLAGS >>> \ >>> "ATTACH_FLAGS := { multi | override }" >>> >>> -#define HELP_SPEC_ATTACH_TYPES >>> \ >>> - "ATTACH_TYPE := { ingress | egress | sock_create | sock_ops | device }" >>> +#define HELP_SPEC_ATTACH_TYPES >>>\ >>> + " ATTACH_TYPE := { ingress | egress | sock_create |\n" \ >>> + "sock_ops | stream_parser |\n" \ >>> + "stream_verdict | device | msg_verdict |\n"\ >>> + "bind4 | bind6 | connect4 | connect6 |\n" \ >>> + "post_bind4 | post_bind6 }" >>> >> >> Would you mind updating the man page in Documentation/ as well? > > Sure. Will update and send v2. > Hi Andrey, In addition to the Documentation, there would also be the bash completion to update. The patch below should make it, please feel free to incorporate it to your changes if it seems alright to you. Otherwise I'll submit it as a follow-up. Quentin --- diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index 71cc5dec3685..dad9109c2800 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -374,7 +374,8 @@ _bpftool() ;; attach|detach) local ATTACH_TYPES='ingress egress sock_create sock_ops \ -device' +stream_parser stream_verdict device msg_verdict bind4 \ +bind6 connect4 connect6 post_bind4 post_bind6' local ATTACH_FLAGS='multi override' local PROG_TYPE='id pinned tag' case $prev in @@ -382,7 +383,9 @@ _bpftool() _filedir return 0 ;; -ingress|egress|sock_create|sock_ops|device) +ingress|egress|sock_create|sock_ops|stream_parser|\ +stream_verdict|device|msg_verdict|bind4|bind6|\ +connect4|connect6|post_bind4|post_bind6) COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \ "$cur" ) ) return 0
Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device
On Tue, Apr 17, 2018 at 7:52 PM, Maxime Ripardwrote: > On Mon, Apr 16, 2018 at 10:51:55PM +0800, Chen-Yu Tsai wrote: >> On Mon, Apr 16, 2018 at 10:31 PM, Maxime Ripard >> wrote: >> > On Thu, Apr 12, 2018 at 11:23:30PM +0800, Chen-Yu Tsai wrote: >> >> On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng wrote: >> >> > 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard >> >> > 写到: >> >> >>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote: >> >> >>> From: Chen-Yu Tsai >> >> >>> >> >> >>> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU >> >> >>> address space; on the A64 SoC this register is in the SRAM controller >> >> >>> address space, and with a different offset. >> >> >>> >> >> >>> To access the register from another device and hide the internal >> >> >>> difference between the device, let it register a regmap named >> >> >>> "emac-clock". We can then get the device from the phandle, and >> >> >>> retrieve the regmap with dev_get_regmap(); in this situation the >> >> >>> regmap_field will be set up to access the only register in the >> >> >>regmap. >> >> >>> >> >> >>> Signed-off-by: Chen-Yu Tsai >> >> >>> [Icenowy: change to use regmaps with single register, change commit >> >> >>> message] >> >> >>> Signed-off-by: Icenowy Zheng >> >> >>> --- >> >> >>> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48 >> >> >>++- >> >> >>> 1 file changed, 46 insertions(+), 2 deletions(-) >> >> >>> >> >> >>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> >> >>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> >> >>> index 1037f6c78bca..b61210c0d415 100644 >> >> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> >> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> >> >>> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = { >> >> >>> .msb = 31, >> >> >>> }; >> >> >>> >> >> >>> +/* Specially exported regmap which contains only EMAC register */ >> >> >>> +const struct reg_field single_reg_field = { >> >> >>> +.reg = 0, >> >> >>> +.lsb = 0, >> >> >>> +.msb = 31, >> >> >>> +}; >> >> >>> + >> >> >> >> >> >>I'm not sure this would be wise. If we ever need some other register >> >> >>exported through the regmap, will have to change all the calling sites >> >> >>everywhere in the kernel, which will be a pain and will break >> >> >>bisectability. >> >> > >> >> > In this situation the register can be exported as another >> >> > regmap. Currently the code will access a regmap with name >> >> > "emac-clock" for this register. >> >> > >> >> >> >> >> >>Chen-Yu's (or was it yours?) initial solution with a custom writeable >> >> >>hook only allowing a single register seemed like a better one. >> >> > >> >> > But I remember you mentioned that you want it to hide the >> >> > difference inside the device. >> >> >> >> The idea is that a device can export multiple regmaps. This one, >> >> the one named "gmac" (in my soon to come v2) or "emac-clock" here, >> >> is but one of many possible regmaps, and it only exports the register >> >> needed by the GMAC/EMAC. >> > >> > I'm not sure this would be wise either. There's a single register map, >> > and as far as I know we don't have a binding to express this in the >> > DT. This means that the customer and provider would have to use the >> > same name, but without anything actually enforcing it aside from >> > "someone in the community knows it". >> > >> > This is not a really good design, and I was actually preferring your >> > first option. We shouldn't rely on any undocumented rule. This will be >> > easy to break and hard to maintain. >> >> So, one regmap per device covering the whole register range, and the >> consumer knows which register to poke by looking at its own compatible. >> >> That sound right? > > Yep. And ideally, sending a single serie for both the A64 and the R40 > cases, in order to provide the big picture. OK. I'll incorporate Icenowy's stuff into my series. ChenYu
Re: tcp hang when socket fills up ?
Michal Kubecek wrote on Tue, Apr 17, 2018: > Data (21 bytes) packet in reply direction. And somewhere between the > first and second debugging print, we ended up with sender scale=0 and > that value is then preserved from now on. > > The only place between the two debug prints where we could change only > one of the td_sender values are the two calls to tcp_options() but > neither should be called now unless I missed something. I'll try to > think about it some more. Could it have something to do with the way I setup the connection? I don't think the "both remotes call connect() with carefully selected source/dest port" is a very common case.. If you look at the tcpdump outputs I attached the sequence usually is something like server > client SYN client > server SYN server > client SYNACK client > server ACK ultimately it IS a connection, but with an extra SYN packet in front of it (that first SYN opens up the conntrack of the nat so that the client's syn can come in, the client's conntrack will be that of a normal connection since its first SYN goes in directly after the server's (it didn't see the server's SYN)) Looking at my logs again, I'm seeing the same as you: This looks like the actual SYN/SYN/SYNACK/ACK: - 14.364090 seq=505004283 likely SYN coming out of server - 14.661731 seq=1913287797 on next line it says receiver end=505004284 so likely the matching SYN from client Which this time gets a proper SYNACK from server: 14.662020 seq=505004283 ack=1913287798 And following final dataless ACK: 14.687570 seq=1913287798 ack=505004284 Then as you point out some data ACK, where the scale poofs: 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688824 tcp_in_window: 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 As you say, only tcp_options() will clear only on side of the scales. We don't have sender->td_maxwin == 0 (printed) so I see no other way than we are in the last else if: - we have after(end, sender->td_end) (end=1913287819 > sender end=1913287798) - I assume the tcp state machine must be confused because of the SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, but since this is a data packet it doesn't have the tcp option for scale thus scale resets. At least peeling the logs myself helped me follow the process, I'll sprinkle some carefully crafted logs tomorrow to check if this is true and will let you figure what is best of trying to preserve scale if it was set before, setting a default to 14 or something else. Thanks! -- Dominique Martinet | Asmadeus
Re: [PATCH net-next 2/3] net: phy: Change the array size to 32 for device_ids
On Tue, Apr 17, 2018 at 04:02:32AM -0500, Vicentiu Galanopulo wrote: > In the context of enabling the discovery of the PHYs > which have the C45 MDIO address space in a non-standard > address: num_ids in get_phy_c45_ids, has the > value 8 (ARRAY_SIZE(c45_ids->device_ids)), but the > u32 *devs can store 32 devices in the bitfield. > > If a device is stored in *devs, in bits 32 to 9 > (bit counting in lookup loop starts from 1), it will > not be found. Reviewed-by: Andrew LunnAndrew
Re: [PATCH net-next 1/3] net: phy: Add binding for vendor specific C45 MDIO address space
On Tue, Apr 17, 2018 at 04:02:31AM -0500, Vicentiu Galanopulo wrote: > The extra property enables the discovery on the MDIO bus > of the PHYs which have a vendor specific address space > for accessing the C45 MDIO registers. > > Signed-off-by: Vicentiu GalanopuloHi Vicentiu I think binding is O.K, but the implementation needs work. So Reviewed-by: Andrew Lunn Andrew
Re: [PATCH 08/10] net: ax88796: Make reset more robust on AX88796B
On Tue, Apr 17, 2018 at 07:18:10AM +0200, Michael Karcher wrote: > [Andrew, sorry for the dup. I did hit reply-to-auhor instead of > reply-to-all first.] > > Andrew Lunn schrieb: > >> > This should really be fixed in the PHY driver, not the MAC. > >> > >> OK - do you want this separate, or as part of this series? Might have > >> a few side effects on more commonly used hardware, perhaps? > > > > Hi Michael > > > > What PHY driver is used? > The ax88796b comes with its own integrated (buggy) PHY needing this > workaround. This PHY has its own ID which is not known by Linux, so it is > using the genphy driver as fallback. > > > In the driver you can implement a .soft_reset > > function which first does the dummy write, and then uses > > genphy_soft_reset() to do the actual reset. > We could do that - but I dont't see the point in creating a PHY driver > that is only ever used by this MAC driver, just to add a single line to > the genphy driver. If the same PHY might be used with a different MAC, > you definitely would have a point there, though. Hi Michael We try to keep the core code clean, and put all workarounds for buggy hardware in drivers specific to them. It just helps keep the core code maintainable. I would prefer a driver specific to this PHY with the workaround. But lets see what Florian says. Andrew
[PATCH net-next] cxgb4vf: display pause settings
Add support to display pause settings Signed-off-by: Ganesh Goudar--- drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c index 9a81b523..71f13bd 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c @@ -1419,6 +1419,22 @@ static int cxgb4vf_get_link_ksettings(struct net_device *dev, base->duplex = DUPLEX_UNKNOWN; } + if (pi->link_cfg.fc & PAUSE_RX) { + if (pi->link_cfg.fc & PAUSE_TX) { + ethtool_link_ksettings_add_link_mode(link_ksettings, +advertising, +Pause); + } else { + ethtool_link_ksettings_add_link_mode(link_ksettings, +advertising, +Asym_Pause); + } + } else if (pi->link_cfg.fc & PAUSE_TX) { + ethtool_link_ksettings_add_link_mode(link_ksettings, +advertising, +Asym_Pause); + } + base->autoneg = pi->link_cfg.autoneg; if (pi->link_cfg.pcaps & FW_PORT_CAP32_ANEG) ethtool_link_ksettings_add_link_mode(link_ksettings, -- 2.1.0
[PATCH RESEND net-next] ipv6: provide Kconfig switch to disable accept_ra by default
Many distributions and users prefer to handle router advertisements in userspace; one example is OpenWrt, which includes a combined RA and DHCPv6 client. For such configurations, accept_ra should not be enabled by default. As setting net.ipv6.conf.default.accept_ra via sysctl.conf or similar facilities may be too late to catch all interfaces and common sysctl.conf tools do not allow setting an option for all existing interfaces, this patch provides a Kconfig option to control the default value of default.accept_ra. Using default.accept_ra is preferable to all.accept_ra for our usecase, as disabling all.accept_ra would preclude users from explicitly enabling accept_ra on individual interfaces. Signed-off-by: Matthias Schiffer--- net/ipv6/Kconfig| 12 net/ipv6/addrconf.c | 2 +- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index 6794ddf0547c..0f453110f288 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -20,6 +20,18 @@ menuconfig IPV6 if IPV6 +config IPV6_ACCEPT_RA_DEFAULT + bool "IPv6: Accept router advertisements by default" + default y + help + The kernel can internally handle IPv6 router advertisements for + stateless address autoconfiguration (SLAAC) and route configuration, + which can be configured in detail and per-interface using a number of + sysctl options. This option controls the default value of + net.ipv6.conf.default.accept_ra. + + If unsure, say Y. + config IPV6_ROUTER_PREF bool "IPv6: Router Preference (RFC 4191) support" ---help--- diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 62b97722722c..5fd6d2120a35 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -245,7 +245,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .forwarding = 0, .hop_limit = IPV6_DEFAULT_HOPLIMIT, .mtu6 = IPV6_MIN_MTU, - .accept_ra = 1, + .accept_ra = IS_ENABLED(CONFIG_IPV6_ACCEPT_RA_DEFAULT), .accept_redirects = 1, .autoconf = 1, .force_mld_version = 0, -- 2.17.0
Re: [PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp
On Tue, Apr 17, 2018 at 10:50:26AM +0200, Simon Horman wrote: > From: Masaru Nagai> > [ 58.490829] = > [ 58.495205] [ INFO: inconsistent lock state ] > [ 58.499583] 4.9.0-yocto-standard-7-g2ef7caf #57 Not tainted > [ 58.505529] - > [ 58.509904] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. > [ 58.515939] swapper/0/0 [HC1[1]:SC1[1]:HE0:SE0] takes: > [ 58.521099] (&(>lock)->rlock#2){?.-...}, at: [] > skb_queue_tail+0x2c/0x68 > {HARDIRQ-ON-W} state was registered at: Maybe add a short text to the log to describe the approach of this fix? signature.asc Description: PGP signature
Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device
On Mon, Apr 16, 2018 at 10:51:55PM +0800, Chen-Yu Tsai wrote: > On Mon, Apr 16, 2018 at 10:31 PM, Maxime Ripard >wrote: > > On Thu, Apr 12, 2018 at 11:23:30PM +0800, Chen-Yu Tsai wrote: > >> On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng wrote: > >> > 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard > >> > 写到: > >> >>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote: > >> >>> From: Chen-Yu Tsai > >> >>> > >> >>> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU > >> >>> address space; on the A64 SoC this register is in the SRAM controller > >> >>> address space, and with a different offset. > >> >>> > >> >>> To access the register from another device and hide the internal > >> >>> difference between the device, let it register a regmap named > >> >>> "emac-clock". We can then get the device from the phandle, and > >> >>> retrieve the regmap with dev_get_regmap(); in this situation the > >> >>> regmap_field will be set up to access the only register in the > >> >>regmap. > >> >>> > >> >>> Signed-off-by: Chen-Yu Tsai > >> >>> [Icenowy: change to use regmaps with single register, change commit > >> >>> message] > >> >>> Signed-off-by: Icenowy Zheng > >> >>> --- > >> >>> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48 > >> >>++- > >> >>> 1 file changed, 46 insertions(+), 2 deletions(-) > >> >>> > >> >>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c > >> >>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c > >> >>> index 1037f6c78bca..b61210c0d415 100644 > >> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c > >> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c > >> >>> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = { > >> >>> .msb = 31, > >> >>> }; > >> >>> > >> >>> +/* Specially exported regmap which contains only EMAC register */ > >> >>> +const struct reg_field single_reg_field = { > >> >>> +.reg = 0, > >> >>> +.lsb = 0, > >> >>> +.msb = 31, > >> >>> +}; > >> >>> + > >> >> > >> >>I'm not sure this would be wise. If we ever need some other register > >> >>exported through the regmap, will have to change all the calling sites > >> >>everywhere in the kernel, which will be a pain and will break > >> >>bisectability. > >> > > >> > In this situation the register can be exported as another > >> > regmap. Currently the code will access a regmap with name > >> > "emac-clock" for this register. > >> > > >> >> > >> >>Chen-Yu's (or was it yours?) initial solution with a custom writeable > >> >>hook only allowing a single register seemed like a better one. > >> > > >> > But I remember you mentioned that you want it to hide the > >> > difference inside the device. > >> > >> The idea is that a device can export multiple regmaps. This one, > >> the one named "gmac" (in my soon to come v2) or "emac-clock" here, > >> is but one of many possible regmaps, and it only exports the register > >> needed by the GMAC/EMAC. > > > > I'm not sure this would be wise either. There's a single register map, > > and as far as I know we don't have a binding to express this in the > > DT. This means that the customer and provider would have to use the > > same name, but without anything actually enforcing it aside from > > "someone in the community knows it". > > > > This is not a really good design, and I was actually preferring your > > first option. We shouldn't rely on any undocumented rule. This will be > > easy to break and hard to maintain. > > So, one regmap per device covering the whole register range, and the > consumer knows which register to poke by looking at its own compatible. > > That sound right? Yep. And ideally, sending a single serie for both the A64 and the R40 cases, in order to provide the big picture. Maxime -- Maxime Ripard, Bootlin (formerly Free Electrons) Embedded Linux and Kernel engineering https://bootlin.com signature.asc Description: PGP signature
Re: [PATCH net-next 3/3] net: phy: Enable C45 PHYs with vendor specific address space
On Tue, Apr 17, 2018 at 04:02:33AM -0500, Vicentiu Galanopulo wrote: > A search of the dev-addr property is done in of_mdiobus_register. > If the property is found in the PHY node, of_mdiobus_register_vend_spec_phy() > is called. This is a wrapper function for of_mdiobus_register_phy() > which finds the device in package based on dev-addr, and fills > devices_addrs, which is a new field added to phy_c45_device_ids. > This new field will store the dev-addr property on the same index > where the device in package has been found. > > The of_mdiobus_register_phy() now contains an extra parameter, > which is struct phy_c45_device_ids *c45_ids. > If c45_ids is not NULL, get_vend_spec_addr_phy_device() is called > and c45_ids are propagated all the way to get_phy_c45_ids(). > > Having dev-addr stored in devices_addrs, in get_phy_c45_ids(), > when probing the identifiers, dev-addr can be extracted from > devices_addrs and probed if devices_addrs[current_identifier] is not 0. This still needs work. But i don't want David to see the two Reviewed-by and think the series is O.K. So lets make it clear NACK More comments to follow. Andrew
Re: [PATCH/RFC net-next 4/5] ravb: remove undocumented processing
On Tue, Apr 17, 2018 at 10:50:29AM +0200, Simon Horman wrote: > From: Kazuya Mizuguchi> > Signed-off-by: Kazuya Mizuguchi > Signed-off-by: Simon Horman > --- > drivers/net/ethernet/renesas/ravb.h | 5 - > drivers/net/ethernet/renesas/ravb_main.c | 15 --- > 2 files changed, 20 deletions(-) > > diff --git a/drivers/net/ethernet/renesas/ravb.h > b/drivers/net/ethernet/renesas/ravb.h > index 57eea4a77826..fcd04dbc7dde 100644 > --- a/drivers/net/ethernet/renesas/ravb.h > +++ b/drivers/net/ethernet/renesas/ravb.h > @@ -197,15 +197,11 @@ enum ravb_reg { > MAHR= 0x05c0, > MALR= 0x05c8, > TROCR = 0x0700, /* Undocumented? */ Why not this, too? Maybe some background info from HW team for the commit message would be nice to have... signature.asc Description: PGP signature
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote: > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote: > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote: > > > On 2018年04月13日 15:15, Tiwei Bie wrote: > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote: > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > [...] > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned > > > > > > int head, > > > > > > + void **ctx) > > > > > > +{ > > > > > > + struct vring_packed_desc *desc; > > > > > > + unsigned int i, j; > > > > > > + > > > > > > + /* Clear data ptr. */ > > > > > > + vq->desc_state[head].data = NULL; > > > > > > + > > > > > > + i = head; > > > > > > + > > > > > > + for (j = 0; j < vq->desc_state[head].num; j++) { > > > > > > + desc = >vring_packed.desc[i]; > > > > > > + vring_unmap_one_packed(vq, desc); > > > > > > + desc->flags = 0x0; > > > > > Looks like this is unnecessary. > > > > It's safer to zero it. If we don't zero it, after we > > > > call virtqueue_detach_unused_buf_packed() which calls > > > > this function, the desc is still available to the > > > > device. > > > > > > Well detach_unused_buf_packed() should be called after device is stopped, > > > otherwise even if you try to clear, there will still be a window that > > > device > > > may use it. > > > > This is not about whether the device has been stopped or > > not. We don't have other places to re-initialize the ring > > descriptors and wrap_counter. So they need to be set to > > the correct values when doing detach_unused_buf. > > > > Best regards, > > Tiwei Bie > > find vqs is the time to do it. The .find_vqs() will call .setup_vq() which will eventually call vring_create_virtqueue(). It's a different case. Here we're talking about re-initializing the descs and updating the wrap counter when detaching the unused descs (In this case, split ring just needs to decrease vring.avail->idx). Best regards, Tiwei Bie
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote: > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote: > > On 2018年04月13日 15:15, Tiwei Bie wrote: > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote: > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > [...] > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned > > > > > int head, > > > > > + void **ctx) > > > > > +{ > > > > > + struct vring_packed_desc *desc; > > > > > + unsigned int i, j; > > > > > + > > > > > + /* Clear data ptr. */ > > > > > + vq->desc_state[head].data = NULL; > > > > > + > > > > > + i = head; > > > > > + > > > > > + for (j = 0; j < vq->desc_state[head].num; j++) { > > > > > + desc = >vring_packed.desc[i]; > > > > > + vring_unmap_one_packed(vq, desc); > > > > > + desc->flags = 0x0; > > > > Looks like this is unnecessary. > > > It's safer to zero it. If we don't zero it, after we > > > call virtqueue_detach_unused_buf_packed() which calls > > > this function, the desc is still available to the > > > device. > > > > Well detach_unused_buf_packed() should be called after device is stopped, > > otherwise even if you try to clear, there will still be a window that device > > may use it. > > This is not about whether the device has been stopped or > not. We don't have other places to re-initialize the ring > descriptors and wrap_counter. So they need to be set to > the correct values when doing detach_unused_buf. > > Best regards, > Tiwei Bie find vqs is the time to do it. -- MST
Re: tcp hang when socket fills up ?
On Tue, Apr 17, 2018 at 02:34:37PM +0200, Dominique Martinet wrote: > Michal Kubecek wrote on Tue, Apr 17, 2018: > > Data (21 bytes) packet in reply direction. And somewhere between the > > first and second debugging print, we ended up with sender scale=0 and > > that value is then preserved from now on. > > > > The only place between the two debug prints where we could change only > > one of the td_sender values are the two calls to tcp_options() but > > neither should be called now unless I missed something. I'll try to > > think about it some more. > > Could it have something to do with the way I setup the connection? > I don't think the "both remotes call connect() with carefully selected > source/dest port" is a very common case.. > > If you look at the tcpdump outputs I attached the sequence usually is > something like > server > client SYN > client > server SYN > server > client SYNACK > client > server ACK This must be what nf_conntrack_proto_tcp.c calls "simultaneous open". > ultimately it IS a connection, but with an extra SYN packet in front of > it (that first SYN opens up the conntrack of the nat so that the > client's syn can come in, the client's conntrack will be that of a > normal connection since its first SYN goes in directly after the > server's (it didn't see the server's SYN)) > > > Looking at my logs again, I'm seeing the same as you: > > This looks like the actual SYN/SYN/SYNACK/ACK: > - 14.364090 seq=505004283 likely SYN coming out of server > - 14.661731 seq=1913287797 on next line it says receiver > end=505004284 so likely the matching SYN from client > Which this time gets a proper SYNACK from server: > 14.662020 seq=505004283 ack=1913287798 > And following final dataless ACK: > 14.687570 seq=1913287798 ack=505004284 > > Then as you point out some data ACK, where the scale poofs: > 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > end=1913287819 > 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 > scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > 14.688824 tcp_in_window: > 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > end=1913287819 > 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 > scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > > As you say, only tcp_options() will clear only on side of the scales. > We don't have sender->td_maxwin == 0 (printed) so I see no other way > than we are in the last else if: > - we have after(end, sender->td_end) (end=1913287819 > sender > end=1913287798) > - I assume the tcp state machine must be confused because of the > SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, > but since this is a data packet it doesn't have the tcp option for scale > thus scale resets. I agree that sender->td_maxwin is not zero so that the handshake above probably left the conntrack in TCP_CONNTRACK_SYN_RECV state for some reason. I'll try to go through the code with the pattern you mentioned in mind. Michal Kubecek
[net-next V10 PATCH 16/16] xdp: transition into using xdp_frame for ndo_xdp_xmit
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync. This builds towards changing the API further to become a bulk API, because xdp_buff is not a queue-able object while xdp_frame is. V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP") V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 30 ++--- drivers/net/ethernet/intel/i40e/i40e_txrx.h |2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +++- drivers/net/tun.c | 19 ++-- drivers/net/virtio_net.c | 24 include/linux/netdevice.h |4 ++- net/core/filter.c | 17 +- 7 files changed, 74 insertions(+), 46 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index c8bf4d35fdea..87fb27ab9c24 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2203,9 +2203,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring, #define I40E_XDP_CONSUMED 1 #define I40E_XDP_TX 2 -static int i40e_xmit_xdp_ring(struct xdp_buff *xdp, +static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf, struct i40e_ring *xdp_ring); +static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, +struct i40e_ring *xdp_ring) +{ + struct xdp_frame *xdpf = convert_to_xdp_frame(xdp); + + if (unlikely(!xdpf)) + return I40E_XDP_CONSUMED; + + return i40e_xmit_xdp_ring(xdpf, xdp_ring); +} + /** * i40e_run_xdp - run an XDP program * @rx_ring: Rx ring being processed @@ -2233,7 +2244,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring, break; case XDP_TX: xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index]; - result = i40e_xmit_xdp_ring(xdp, xdp_ring); + result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring); break; case XDP_REDIRECT: err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog); @@ -3480,21 +3491,14 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb, * @xdp: data to transmit * @xdp_ring: XDP Tx ring **/ -static int i40e_xmit_xdp_ring(struct xdp_buff *xdp, +static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf, struct i40e_ring *xdp_ring) { u16 i = xdp_ring->next_to_use; struct i40e_tx_buffer *tx_bi; struct i40e_tx_desc *tx_desc; - struct xdp_frame *xdpf; + u32 size = xdpf->len; dma_addr_t dma; - u32 size; - - xdpf = convert_to_xdp_frame(xdp); - if (unlikely(!xdpf)) - return I40E_XDP_CONSUMED; - - size = xdpf->len; if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) { xdp_ring->tx_stats.tx_busy++; @@ -3684,7 +3688,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev) * * Returns Zero if sent, else an error code **/ -int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) +int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf) { struct i40e_netdev_priv *np = netdev_priv(dev); unsigned int queue_index = smp_processor_id(); @@ -3697,7 +3701,7 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs) return -ENXIO; - err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]); + err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]); if (err != I40E_XDP_TX) return -ENOSPC; diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h index 857b1d743c8d..4bf318b8be85 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h @@ -511,7 +511,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw); void i40e_detect_recover_hung(struct i40e_vsi *vsi); int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size); bool __i40e_chk_linearize(struct sk_buff *skb); -int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp); +int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf); void i40e_xdp_flush(struct net_device *dev); /** diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 4f2864165723..51e7d82a5860 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2262,7 +2262,7 @@ static struct sk_buff
[net-next V10 PATCH 15/16] xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/i40e/i40e_txrx.c |5 ++--- drivers/net/ethernet/intel/ixgbe/ixgbe.h|4 +--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 17 +++-- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |1 + drivers/net/tun.c |4 ++-- drivers/net/virtio_net.c|2 +- include/net/xdp.h |2 +- kernel/bpf/cpumap.c |6 +++--- net/core/xdp.c |4 +++- 9 files changed, 25 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 96c54cbfb1f9..c8bf4d35fdea 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -638,8 +638,7 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring, if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB) kfree(tx_buffer->raw_buf); else if (ring_is_xdp(ring)) - xdp_return_frame(tx_buffer->xdpf->data, -_buffer->xdpf->mem); + xdp_return_frame(tx_buffer->xdpf); else dev_kfree_skb_any(tx_buffer->skb); if (dma_unmap_len(tx_buffer, len)) @@ -842,7 +841,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi, /* free the skb/XDP data */ if (ring_is_xdp(tx_ring)) - xdp_return_frame(tx_buf->xdpf->data, _buf->xdpf->mem); + xdp_return_frame(tx_buf->xdpf); else napi_consume_skb(tx_buf->skb, napi_budget); diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index abb5248e917e..7dd5038cfcc4 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -241,8 +241,7 @@ struct ixgbe_tx_buffer { unsigned long time_stamp; union { struct sk_buff *skb; - /* XDP uses address ptr on irq_clean */ - void *data; + struct xdp_frame *xdpf; }; unsigned int bytecount; unsigned short gso_segs; @@ -250,7 +249,6 @@ struct ixgbe_tx_buffer { DEFINE_DMA_UNMAP_ADDR(dma); DEFINE_DMA_UNMAP_LEN(len); u32 tx_flags; - struct xdp_mem_info xdp_mem; }; struct ixgbe_rx_buffer { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index f10904ec2172..4f2864165723 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, /* free the skb */ if (ring_is_xdp(tx_ring)) - xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); + xdp_return_frame(tx_buffer->xdpf); else napi_consume_skb(tx_buffer->skb, napi_budget); @@ -2386,6 +2386,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, xdp.data_hard_start = xdp.data - ixgbe_rx_offset(rx_ring);
[net-next V10 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page cache, which uses elevated refcnt for "catching"/avoiding SKB put_page returns the page through the page allocator. And at the same time, have pages getting returned to the page_pool from ndp_xdp_xmit DMA completion. The performance improvement for XDP_REDIRECT in this patch is really good. Especially considering that (currently) the xdp_return_frame API and page_pool_put_page() does per frame operations of both rhashtable ID-lookup and locked return into (page_pool) ptr_ring. (It is the plan to remove these per frame operation in a followup patchset). The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe, with xdp_redirect_map (using devmap) . And the target/maximum capability of ixgbe is 13Mpps (on this HW setup). Before this patch for mlx5, XDP redirected frames were returned via the page allocator. The single flow performance was 6Mpps, and if I started two flows the collective performance drop to 4Mpps, because we hit the page allocator lock (further negative scaling occurs). Two test scenarios need to be covered, for xdp_return_frame API, which is DMA-TX completion running on same-CPU or cross-CPU free/return. Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very close to our 13Mpps max target. The reason max target isn't reached in cross-CPU test, is likely due to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to ixgbe testing). It is also planned to remove this unnecessary DMA unmap in a later patchset V2: Adjustments requested by Tariq - Changed page_pool_create return codes not return NULL, only ERR_PTR, as this simplifies err handling in drivers. - Save a branch in mlx5e_page_release - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V5: Updated patch desc V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params") V9: - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication") - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU") - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V10: Req from Tariq - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ Signed-off-by: Jesper Dangaard BrouerReviewed-by: Tariq Toukan Acked-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en.h |3 ++ drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 41 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 16 ++-- 3 files changed, 48 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 1a05d1072c5e..3317a4da87cb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -53,6 +53,8 @@ #include "mlx5_core.h" #include "en_stats.h" +struct page_pool; + #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v) #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN) @@ -534,6 +536,7 @@ struct mlx5e_rq { unsigned int hw_mtu; struct mlx5e_xdpsq xdpsq; DECLARE_BITMAP(flags, 8); + struct page_pool *page_pool; /* control */ struct mlx5_wq_ctrlwq_ctrl; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2dca0933dfd3..f10037419978 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -35,6 +35,7 @@ #include #include #include +#include #include "eswitch.h" #include "en.h" #include "en_tc.h" @@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, struct mlx5e_rq_param *rqp, struct mlx5e_rq *rq) { + struct page_pool_params pp_params = { 0 }; struct mlx5_core_dev *mdev = c->mdev; void *rqc = rqp->rqc; void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); - u32 byte_count; + u32 byte_count, pool_size; int npages; int wq_sz; int err; @@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params); + pool_size = 1 << params->log_rq_mtu_frames; switch (rq->wq_type) { case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: + + pool_size = MLX5_MPWRQ_PAGES_PER_WQE << mlx5e_mpwqe_get_log_rq_size(params); rq->post_wqes = mlx5e_post_rx_mpwqes; rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe; @@ -512,13 +517,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->mkey_be = c->mkey_be; } - /* This must only be activate for order-0
[net-next V10 PATCH 12/16] page_pool: refurbish version of page_pool code
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning pages on DMA-TX completion time, which have good cross CPU performance, given DMA-TX completion time can happen on a remote CPU. Refurbish my page_pool code, that was presented[1] at MM-summit 2016. Adapted page_pool code to not depend the page allocator and integration into struct page. The DMA mapping feature is kept, even-though it will not be activated/used in this patchset. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf V2: Adjustments requested by Tariq - Changed page_pool_create return codes, don't return NULL, only ERR_PTR, as this simplifies err handling in drivers. V4: many small improvements and cleanups - Add DOC comment section, that can be used by kernel-doc - Improve fallback mode, to work better with refcnt based recycling e.g. remove a WARN as pointed out by Tariq e.g. quicker fallback if ptr_ring is empty. V5: Fixed SPDX license as pointed out by Alexei V6: Adjustments requested by Eric Dumazet - Adjust cacheline_aligned_in_smp usage/placement - Move rcu_head in struct page_pool - Free pages quicker on destroy, minimize resources delayed an RCU period - Remove code for forward/backward compat ABI interface V8: Issues found by kbuild test robot - Address sparse should be static warnings - Only compile+link when a driver use/select page_pool, mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/mellanox/mlx5/core/Kconfig |1 include/net/page_pool.h | 129 + net/Kconfig |3 net/core/Makefile |1 net/core/page_pool.c| 317 +++ 5 files changed, 451 insertions(+) create mode 100644 include/net/page_pool.h create mode 100644 net/core/page_pool.c diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index c032319f1cb9..12257034131e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -30,6 +30,7 @@ config MLX5_CORE_EN bool "Mellanox Technologies ConnectX-4 Ethernet support" depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE depends on IPV6=y || IPV6=n || MLX5_CORE=m + select PAGE_POOL default n ---help--- Ethernet support in Mellanox Technologies ConnectX-4 NIC. diff --git a/include/net/page_pool.h b/include/net/page_pool.h new file mode 100644 index ..1fe77db59518 --- /dev/null +++ b/include/net/page_pool.h @@ -0,0 +1,129 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * page_pool.h + * Author: Jesper Dangaard Brouer + * Copyright (C) 2016 Red Hat, Inc. + */ + +/** + * DOC: page_pool allocator + * + * This page_pool allocator is optimized for the XDP mode that + * uses one-frame-per-page, but have fallbacks that act like the + * regular page allocator APIs. + * + * Basic use involve replacing alloc_pages() calls with the + * page_pool_alloc_pages() call. Drivers should likely use + * page_pool_dev_alloc_pages() replacing dev_alloc_pages(). + * + * If page_pool handles DMA mapping (use page->private), then API user + * is responsible for invoking page_pool_put_page() once. In-case of + * elevated refcnt, the DMA state is released, assuming other users of + * the page will eventually call put_page(). + * + * If no DMA mapping is done, then it can act as shim-layer that + * fall-through to alloc_page. As no state is kept on the page, the + * regular put_page() call is sufficient. + */ +#ifndef _NET_PAGE_POOL_H +#define _NET_PAGE_POOL_H + +#include /* Needed by ptr_ring */ +#include +#include + +#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */ +#define PP_FLAG_ALLPP_FLAG_DMA_MAP + +/* + * Fast allocation side cache array/stack + * + * The cache size and refill watermark is related to the network + * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX + * ring is usually refilled and the max consumed elements will be 64, + * thus a natural max size of objects needed in the cache. + * + * Keeping room for more objects, is due to XDP_DROP use-case. As + * XDP_DROP allows the opportunity to recycle objects directly into + * this array, as it shares the same softirq/NAPI protection. If + * cache is already full (or partly full) then the XDP_DROP recycles + * would have to take a slower code path. + */ +#define PP_ALLOC_CACHE_SIZE128 +#define PP_ALLOC_CACHE_REFILL 64 +struct pp_alloc_cache { + u32 count; + void *cache[PP_ALLOC_CACHE_SIZE]; +}; + +struct page_pool_params { + unsigned intflags; + unsigned intorder; + unsigned intpool_size; + int nid;
[net-next V10 PATCH 13/16] xdp: allow page_pool as an allocator type in xdp_return_frame
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage. The registered allocator page_pool pointer is not available directly from xdp_rxq_info, but it could be (if needed). For now, the driver should keep separate track of the page_pool pointer, which it should use for RX-ring page allocation. As suggested by Saeed, to maintain a symmetric API it is the drivers responsibility to allocate/create and free/destroy the page_pool. Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers responsibility to free the page_pool, but with a RCU free call. This is done easily via the page_pool helper page_pool_destroy() (which avoids touching any driver code during the RCU callback, which could happen after the driver have been unloaded). V8: address issues found by kbuild test robot - Address sparse should be static warnings - Allow xdp.o to be compiled without page_pool.o V9: Remove inline from .c file, compiler knows best Signed-off-by: Jesper Dangaard Brouer--- include/net/page_pool.h | 14 +++ include/net/xdp.h |3 ++ net/core/xdp.c | 60 ++- 3 files changed, 65 insertions(+), 12 deletions(-) diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 1fe77db59518..c79087153148 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -117,7 +117,12 @@ void __page_pool_put_page(struct page_pool *pool, static inline void page_pool_put_page(struct page_pool *pool, struct page *page) { + /* When page_pool isn't compiled-in, net/core/xdp.c doesn't +* allow registering MEM_TYPE_PAGE_POOL, but shield linker. +*/ +#ifdef CONFIG_PAGE_POOL __page_pool_put_page(pool, page, false); +#endif } /* Very limited use-cases allow recycle direct */ static inline void page_pool_recycle_direct(struct page_pool *pool, @@ -126,4 +131,13 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, __page_pool_put_page(pool, page, true); } +static inline bool is_page_pool_compiled_in(void) +{ +#ifdef CONFIG_PAGE_POOL + return true; +#else + return false; +#endif +} + #endif /* _NET_PAGE_POOL_H */ diff --git a/include/net/xdp.h b/include/net/xdp.h index 5f67c62540aa..d0ee437753dc 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -36,6 +36,7 @@ enum xdp_mem_type { MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */ MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ + MEM_TYPE_PAGE_POOL, MEM_TYPE_MAX, }; @@ -44,6 +45,8 @@ struct xdp_mem_info { u32 id; }; +struct page_pool; + struct xdp_rxq_info { struct net_device *dev; u32 queue_index; diff --git a/net/core/xdp.c b/net/core/xdp.c index 8b2cb79b5de0..33e382afbd95 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -8,6 +8,7 @@ #include #include #include +#include #include @@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht; struct xdp_mem_allocator { struct xdp_mem_info mem; - void *allocator; + union { + void *allocator; + struct page_pool *page_pool; + }; struct rhash_head node; struct rcu_head rcu; }; @@ -74,7 +78,9 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu) /* Allow this ID to be reused */ ida_simple_remove(_id_pool, xa->mem.id); - /* TODO: Depending on allocator type/pointer free resources */ + /* Notice, driver is expected to free the *allocator, +* e.g. page_pool, and MUST also use RCU free. +*/ /* Poison memory */ xa->mem.id = 0x; @@ -225,6 +231,17 @@ static int __mem_id_cyclic_get(gfp_t gfp) return id; } +static bool __is_supported_mem_type(enum xdp_mem_type type) +{ + if (type == MEM_TYPE_PAGE_POOL) + return is_page_pool_compiled_in(); + + if (type >= MEM_TYPE_MAX) + return false; + + return true; +} + int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, enum xdp_mem_type type, void *allocator) { @@ -238,13 +255,16 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, return -EFAULT; } - if (type >= MEM_TYPE_MAX) - return -EINVAL; + if (!__is_supported_mem_type(type)) + return -EOPNOTSUPP; xdp_rxq->mem.type = type; - if (!allocator) + if (!allocator) { + if (type == MEM_TYPE_PAGE_POOL) + return -EINVAL; /* Setup time check page_pool req */ return 0; + } /* Delay init of rhashtable to save memory if feature isn't used */ if (!mem_id_init) { @@ -290,15 +310,31 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model); void xdp_return_frame(void *data, struct xdp_mem_info *mem) { - if (mem->type == MEM_TYPE_PAGE_SHARED) { + struct
[PATCH net-next] ipv6: send netlink notifications for manually configured addresses
Send a netlink notification when userspace adds a manually configured address if DAD is enabled and optimistic flag isn't set. Moreover send RTM_DELADDR notifications for tentative addresses. Some userspace applications (e.g. NetworkManager) are interested in addr netlink events albeit the address is still in tentative state, however events are not sent if DAD process is not completed. If the address is added and immediately removed userspace listeners are not notified. This behaviour can be easily reproduced by using veth interfaces: $ ip -b - < link add dev vm1 type veth peer name vm2 > link set dev vm1 up > link set dev vm2 up > addr add 2001:db8:a:b:1:2:3:4/64 dev vm1 > addr del 2001:db8:a:b:1:2:3:4/64 dev vm1 EOF This patch reverts the behaviour introduced by the commit f784ad3d79e5 ("ipv6: do not send RTM_DELADDR for tentative addresses") Suggested-by: Thomas HallerSigned-off-by: Lorenzo Bianconi --- net/ipv6/addrconf.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 62b97722722c..b2c0175125db 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2901,6 +2901,11 @@ static int inet6_addr_add(struct net *net, int ifindex, expires, flags); } + /* Send a netlink notification if DAD is enabled and +* optimistic flag is not set +*/ + if (!(ifp->flags & (IFA_F_OPTIMISTIC | IFA_F_NODAD))) + ipv6_ifa_notify(0, ifp); /* * Note that section 3.1 of RFC 4429 indicates * that the Optimistic flag should not be set for @@ -5028,14 +5033,6 @@ static void inet6_ifa_notify(int event, struct inet6_ifaddr *ifa) struct net *net = dev_net(ifa->idev->dev); int err = -ENOBUFS; - /* Don't send DELADDR notification for TENTATIVE address, -* since NEWADDR notification is sent only after removing -* TENTATIVE flag, if DAD has not failed. -*/ - if (ifa->flags & IFA_F_TENTATIVE && !(ifa->flags & IFA_F_DADFAILED) && - event == RTM_DELADDR) - return; - skb = nlmsg_new(inet6_ifaddr_msgsize(), GFP_ATOMIC); if (!skb) goto errout; -- 2.14.3
Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx
On 16/04/2018 20:22, Rob Herring wrote: > On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote: >> The Microchip LAN78XX family of devices are Ethernet controllers with >> a USB interface. Despite being discoverable devices it can be useful to >> be able to configure them from Device Tree, particularly in low-cost >> applications without an EEPROM or programmed OTP. >> >> Document the supported properties in a bindings file, adding it to >> MAINTAINERS at the same time. >> >> Signed-off-by: Phil Elwell>> --- >> .../devicetree/bindings/net/microchip,lan78xx.txt | 44 >> ++ >> MAINTAINERS| 1 + >> 2 files changed, 45 insertions(+) >> create mode 100644 >> Documentation/devicetree/bindings/net/microchip,lan78xx.txt >> >> diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt >> b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt >> new file mode 100644 >> index 000..e7d7850 >> --- /dev/null >> +++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt >> @@ -0,0 +1,44 @@ >> +Microchip LAN78xx Gigabit Ethernet controller >> + >> +The LAN78XX devices are usually configured by programming their OTP or with >> +an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have >> neither. >> + >> +Please refer to ethernet.txt for a description of common Ethernet bindings. >> + >> +Optional properties: >> +- microchip,eee-enabled: if present, enable Energy Efficient Ethernet >> support; > > I see we have some flags for broken EEE, but nothing already defined to > enable EEE. Seems like this should either be a user option (therefore > not in DT) or we should use the broken EEE properties if this is h/w > dependent. In the downstream Raspberry Pi kernel we use DT as a way of passing user settings to drivers - it's more powerful than the command line. I understand that this is not the done thing here so I'm withdrawing this element of the patch series. Apologies for the noise. >> +- microchip,led-modes: a two-element vector, with each element configuring >> + the operating mode of an LED. The values supported by the device are; >> + 0: Link/Activity >> + 1: Link1000/Activity >> + 2: Link100/Activity >> + 3: Link10/Activity >> + 4: Link100/1000/Activity >> + 5: Link10/1000/Activity >> + 6: Link10/100/Activity >> + 7: RESERVED >> + 8: Duplex/Collision >> + 9: Collision >> + 10: Activity >> + 11: RESERVED >> + 12: Auto-negotiation Fault >> + 13: RESERVED >> + 14: Off >> + 15: On >> +- microchip,tx-lpi-timer: the delay (in microseconds) between the TX fifo >> + becoming empty and invoking Low Power Idles (default 600). > > Needs a unit suffix as defined in property-units.txt. > >> + >> +Example: >> + >> +/* Standard configuration for a Raspberry Pi 3 B+ */ >> +ethernet: usbether@1 { >> +compatible = "usb424,7800"; >> +reg = <1>; >> +microchip,eee-enabled; >> +microchip,tx-lpi-timer = <600>; >> +/* >> + * led0 = 1:link1000/activity >> + * led1 = 6:link10/100/activity >> + */ >> +microchip,led-modes = <1 6>; >> +}; >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 2328eed..b637aad 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -14482,6 +14482,7 @@ M: Microchip Linux Driver Support >> >> L: netdev@vger.kernel.org >> S: Maintained >> F: drivers/net/usb/lan78xx.* >> +F: Documentation/devicetree/bindings/net/microchip,lan78xx.txt >> >> USB MASS STORAGE DRIVER >> M: Alan Stern >> -- >> 2.7.4 >>
Re: [PATCH net 2/2] tipc: fix possible crash in __tipc_nl_net_set()
On 04/16/2018 11:29 PM, Eric Dumazet wrote: > syzbot reported a crash in __tipc_nl_net_set() caused by NULL dereference. > > We need to check that both TIPC_NLA_NET_NODEID and TIPC_NLA_NET_NODEID_W1 > are present. > > We also need to make sure userland provided u64 attributes. > > Fixes: d50ccc2d3909 ("tipc: add 128-bit node identifier") > Signed-off-by: Eric Dumazet> Cc: Jon Maloy > Cc: Ying Xue > Reported-by: syzbot Acked-by: Ying Xue > --- > net/tipc/net.c | 2 ++ > net/tipc/netlink.c | 2 ++ > 2 files changed, 4 insertions(+) > > diff --git a/net/tipc/net.c b/net/tipc/net.c > index > 856f9e97ea293210bea1d2003d2092482732ace9..4fbaa0464405370601cb2fd1dd3b03733836d342 > 100644 > --- a/net/tipc/net.c > +++ b/net/tipc/net.c > @@ -252,6 +252,8 @@ int __tipc_nl_net_set(struct sk_buff *skb, struct > genl_info *info) > u64 *w0 = (u64 *)_id[0]; > u64 *w1 = (u64 *)_id[8]; > > + if (!attrs[TIPC_NLA_NET_NODEID_W1]) > + return -EINVAL; > *w0 = nla_get_u64(attrs[TIPC_NLA_NET_NODEID]); > *w1 = nla_get_u64(attrs[TIPC_NLA_NET_NODEID_W1]); > tipc_net_init(net, node_id, 0); > diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c > index > d4e0bbeee72793a060befaf8a9d0239731c0d48c..6ff2254088f647d4f7410c3335ccdae2e68ec522 > 100644 > --- a/net/tipc/netlink.c > +++ b/net/tipc/netlink.c > @@ -81,6 +81,8 @@ const struct nla_policy tipc_nl_net_policy[TIPC_NLA_NET_MAX > + 1] = { > [TIPC_NLA_NET_UNSPEC] = { .type = NLA_UNSPEC }, > [TIPC_NLA_NET_ID] = { .type = NLA_U32 }, > [TIPC_NLA_NET_ADDR] = { .type = NLA_U32 }, > + [TIPC_NLA_NET_NODEID] = { .type = NLA_U64 }, > + [TIPC_NLA_NET_NODEID_W1]= { .type = NLA_U64 }, > }; > > const struct nla_policy tipc_nl_link_policy[TIPC_NLA_LINK_MAX + 1] = { >
[net-next V10 PATCH 04/16] xdp: move struct xdp_buff from filter.h to xdp.h
This is done to prepare for the next patch, and it is also nice to move this XDP related struct out of filter.h. Signed-off-by: Jesper Dangaard Brouer--- include/linux/filter.h | 24 +--- include/net/xdp.h | 22 ++ 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index fc4e8f91b03d..4da8b2308174 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -30,6 +30,7 @@ struct sock; struct seccomp_data; struct bpf_prog_aux; struct xdp_rxq_info; +struct xdp_buff; /* ArgX, context and stack frame pointer register positions. Note, * Arg1, Arg2, Arg3, etc are used as argument mappings of function @@ -500,14 +501,6 @@ struct bpf_skb_data_end { void *data_end; }; -struct xdp_buff { - void *data; - void *data_end; - void *data_meta; - void *data_hard_start; - struct xdp_rxq_info *rxq; -}; - struct sk_msg_buff { void *data; void *data_end; @@ -772,21 +765,6 @@ int xdp_do_redirect(struct net_device *dev, struct bpf_prog *prog); void xdp_do_flush_map(void); -/* Drivers not supporting XDP metadata can use this helper, which - * rejects any room expansion for metadata as a result. - */ -static __always_inline void -xdp_set_data_meta_invalid(struct xdp_buff *xdp) -{ - xdp->data_meta = xdp->data + 1; -} - -static __always_inline bool -xdp_data_meta_unsupported(const struct xdp_buff *xdp) -{ - return unlikely(xdp->data_meta > xdp->data); -} - void bpf_warn_invalid_xdp_action(u32 act); struct sock *do_sk_redirect_map(struct sk_buff *skb); diff --git a/include/net/xdp.h b/include/net/xdp.h index e4207699c410..15f8ade008b5 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -50,6 +50,13 @@ struct xdp_rxq_info { struct xdp_mem_info mem; } cacheline_aligned; /* perf critical, avoid false-sharing */ +struct xdp_buff { + void *data; + void *data_end; + void *data_meta; + void *data_hard_start; + struct xdp_rxq_info *rxq; +}; static inline void xdp_return_frame(void *data, struct xdp_mem_info *mem) @@ -72,4 +79,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq); int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, enum xdp_mem_type type, void *allocator); +/* Drivers not supporting XDP metadata can use this helper, which + * rejects any room expansion for metadata as a result. + */ +static __always_inline void +xdp_set_data_meta_invalid(struct xdp_buff *xdp) +{ + xdp->data_meta = xdp->data + 1; +} + +static __always_inline bool +xdp_data_meta_unsupported(const struct xdp_buff *xdp) +{ + return unlikely(xdp->data_meta > xdp->data); +} + #endif /* __LINUX_NET_XDP_H__ */
[net-next V10 PATCH 06/16] tun: convert to use generic xdp_frame and xdp_return_frame API
From: Jesper Dangaard BrouerThe tuntap driver invented it's own driver specific way of queuing XDP packets, by storing the xdp_buff information in the top of the XDP frame data. Convert it over to use the more generic xdp_frame structure. The main problem with the in-driver method is that the xdp_rxq_info pointer cannot be trused/used when dequeueing the frame. V3: Remove check based on feedback from Jason Signed-off-by: Jesper Dangaard Brouer --- drivers/net/tun.c | 43 --- drivers/vhost/net.c|7 --- include/linux/if_tun.h |4 ++-- 3 files changed, 26 insertions(+), 28 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 28583aa0c17d..2c85e5cac2a9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -248,11 +248,11 @@ struct veth { __be16 h_vlan_TCI; }; -bool tun_is_xdp_buff(void *ptr) +bool tun_is_xdp_frame(void *ptr) { return (unsigned long)ptr & TUN_XDP_FLAG; } -EXPORT_SYMBOL(tun_is_xdp_buff); +EXPORT_SYMBOL(tun_is_xdp_frame); void *tun_xdp_to_ptr(void *ptr) { @@ -660,10 +660,10 @@ void tun_ptr_free(void *ptr) { if (!ptr) return; - if (tun_is_xdp_buff(ptr)) { - struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + if (tun_is_xdp_frame(ptr)) { + struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr); - put_page(virt_to_head_page(xdp->data)); + xdp_return_frame(xdpf->data, >mem); } else { __skb_array_destroy_skb(ptr); } @@ -1298,17 +1298,14 @@ static const struct net_device_ops tun_netdev_ops = { static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) { struct tun_struct *tun = netdev_priv(dev); - struct xdp_buff *buff = xdp->data_hard_start; - int headroom = xdp->data - xdp->data_hard_start; + struct xdp_frame *frame; struct tun_file *tfile; u32 numqueues; int ret = 0; - /* Assure headroom is available and buff is properly aligned */ - if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp))) - return -ENOSPC; - - *buff = *xdp; + frame = convert_to_xdp_frame(xdp); + if (unlikely(!frame)) + return -EOVERFLOW; rcu_read_lock(); @@ -1323,7 +1320,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) /* Encode the XDP flag into lowest bit for consumer to differ * XDP buffer from sk_buff. */ - if (ptr_ring_produce(>tx_ring, tun_xdp_to_ptr(buff))) { + if (ptr_ring_produce(>tx_ring, tun_xdp_to_ptr(frame))) { this_cpu_inc(tun->pcpu_stats->tx_dropped); ret = -ENOSPC; } @@ -2001,11 +1998,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) static ssize_t tun_put_user_xdp(struct tun_struct *tun, struct tun_file *tfile, - struct xdp_buff *xdp, + struct xdp_frame *xdp_frame, struct iov_iter *iter) { int vnet_hdr_sz = 0; - size_t size = xdp->data_end - xdp->data; + size_t size = xdp_frame->len; struct tun_pcpu_stats *stats; size_t ret; @@ -2021,7 +2018,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso)); } - ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz; + ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz; stats = get_cpu_ptr(tun->pcpu_stats); u64_stats_update_begin(>syncp); @@ -2189,11 +2186,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, return err; } - if (tun_is_xdp_buff(ptr)) { - struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + if (tun_is_xdp_frame(ptr)) { + struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr); - ret = tun_put_user_xdp(tun, tfile, xdp, to); - put_page(virt_to_head_page(xdp->data)); + ret = tun_put_user_xdp(tun, tfile, xdpf, to); + xdp_return_frame(xdpf->data, >mem); } else { struct sk_buff *skb = ptr; @@ -2432,10 +2429,10 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, static int tun_ptr_peek_len(void *ptr) { if (likely(ptr)) { - if (tun_is_xdp_buff(ptr)) { - struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + if (tun_is_xdp_frame(ptr)) { + struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr); - return xdp->data_end - xdp->data; + return xdpf->len; } return __skb_array_len_with_tag(ptr); } else { diff
[net-next V10 PATCH 07/16] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
The virtio_net driver assumes XDP frames are always released based on page refcnt (via put_page). Thus, is only queues the XDP data pointer address and uses virt_to_head_page() to retrieve struct page. Use the XDP return API to get away from such assumptions. Instead queue an xdp_frame, which allow us to use the xdp_return_frame API, when releasing the frame. V8: Avoid endianness issues (found by kbuild test robot) V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan Carpenter) Signed-off-by: Jesper Dangaard Brouer--- drivers/net/virtio_net.c | 54 +- 1 file changed, 29 insertions(+), 25 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 7b187ec7411e..f50e1ad81ad4 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -415,38 +415,48 @@ static void virtnet_xdp_flush(struct net_device *dev) virtqueue_kick(sq->vq); } -static bool __virtnet_xdp_xmit(struct virtnet_info *vi, - struct xdp_buff *xdp) +static int __virtnet_xdp_xmit(struct virtnet_info *vi, + struct xdp_buff *xdp) { struct virtio_net_hdr_mrg_rxbuf *hdr; - unsigned int len; + struct xdp_frame *xdpf, *xdpf_sent; struct send_queue *sq; + unsigned int len; unsigned int qp; - void *xdp_sent; int err; qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id(); sq = >sq[qp]; /* Free up any pending old buffers before queueing new ones. */ - while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) { - struct page *sent_page = virt_to_head_page(xdp_sent); + while ((xdpf_sent = virtqueue_get_buf(sq->vq, )) != NULL) + xdp_return_frame(xdpf_sent->data, _sent->mem); - put_page(sent_page); - } + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + /* virtqueue want to use data area in-front of packet */ + if (unlikely(xdpf->metasize > 0)) + return -EOPNOTSUPP; - xdp->data -= vi->hdr_len; + if (unlikely(xdpf->headroom < vi->hdr_len)) + return -EOVERFLOW; + + /* Make room for virtqueue hdr (also change xdpf->headroom?) */ + xdpf->data -= vi->hdr_len; /* Zero header and leave csum up to XDP layers */ - hdr = xdp->data; + hdr = xdpf->data; memset(hdr, 0, vi->hdr_len); + xdpf->len += vi->hdr_len; - sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data); + sg_init_one(sq->sg, xdpf->data, xdpf->len); - err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC); + err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC); if (unlikely(err)) - return false; /* Caller handle free/refcnt */ + return -ENOSPC; /* Caller handle free/refcnt */ - return true; + return 0; } static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) @@ -454,7 +464,6 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) struct virtnet_info *vi = netdev_priv(dev); struct receive_queue *rq = vi->rq; struct bpf_prog *xdp_prog; - bool sent; /* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this * indicate XDP resources have been successfully allocated. @@ -463,10 +472,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) if (!xdp_prog) return -ENXIO; - sent = __virtnet_xdp_xmit(vi, xdp); - if (!sent) - return -ENOSPC; - return 0; + return __virtnet_xdp_xmit(vi, xdp); } static unsigned int virtnet_get_headroom(struct virtnet_info *vi) @@ -555,7 +561,6 @@ static struct sk_buff *receive_small(struct net_device *dev, struct page *page = virt_to_head_page(buf); unsigned int delta = 0; struct page *xdp_page; - bool sent; int err; len -= vi->hdr_len; @@ -606,8 +611,8 @@ static struct sk_buff *receive_small(struct net_device *dev, delta = orig_data - xdp.data; break; case XDP_TX: - sent = __virtnet_xdp_xmit(vi, ); - if (unlikely(!sent)) { + err = __virtnet_xdp_xmit(vi, ); + if (unlikely(err)) { trace_xdp_exception(vi->dev, xdp_prog, act); goto err_xdp; } @@ -690,7 +695,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, struct bpf_prog *xdp_prog; unsigned int truesize; unsigned int headroom = mergeable_ctx_to_headroom(ctx); - bool sent; int err; head_skb =
[net-next V10 PATCH 08/16] bpf: cpumap convert to use generic xdp_frame
The generic xdp_frame format, was inspired by the cpumap own internal xdp_pkt format. It is now time to convert it over to the generic xdp_frame format. The cpumap needs one extra field dev_rx. Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h |1 + kernel/bpf/cpumap.c | 100 ++- 2 files changed, 29 insertions(+), 72 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index 756c42811e78..ea3773f94f65 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -67,6 +67,7 @@ struct xdp_frame { * while mem info is valid on remote CPU. */ struct xdp_mem_info mem; + struct net_device *dev_rx; /* used by cpumap */ }; /* Convert xdp_buff to xdp_frame */ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 3e4bbcbe3e86..bcdc4dea5ce7 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work) kthread_stop(rcpu->kthread); } -/* For now, xdp_pkt is a cpumap internal data structure, with info - * carried between enqueue to dequeue. It is mapped into the top - * headroom of the packet, to avoid allocating separate mem. - */ -struct xdp_pkt { - void *data; - u16 len; - u16 headroom; - u16 metasize; - /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, -* while mem info is valid on remote CPU. -*/ - struct xdp_mem_info mem; - struct net_device *dev_rx; -}; - -/* Convert xdp_buff to xdp_pkt */ -static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp) -{ - struct xdp_pkt *xdp_pkt; - int metasize; - int headroom; - - /* Assure headroom is available for storing info */ - headroom = xdp->data - xdp->data_hard_start; - metasize = xdp->data - xdp->data_meta; - metasize = metasize > 0 ? metasize : 0; - if (unlikely((headroom - metasize) < sizeof(*xdp_pkt))) - return NULL; - - /* Store info in top of packet */ - xdp_pkt = xdp->data_hard_start; - - xdp_pkt->data = xdp->data; - xdp_pkt->len = xdp->data_end - xdp->data; - xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); - xdp_pkt->metasize = metasize; - - /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ - xdp_pkt->mem = xdp->rxq->mem; - - return xdp_pkt; -} - static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, -struct xdp_pkt *xdp_pkt) +struct xdp_frame *xdpf) { unsigned int frame_size; void *pkt_data_start; @@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * would be preferred to set frame_size to 2048 or 4096 * depending on the driver. * frame_size = 2048; -* frame_len = frame_size - sizeof(*xdp_pkt); +* frame_len = frame_size - sizeof(*xdp_frame); * * Instead, with info avail, skb_shared_info in placed after * packet len. This, unfortunately fakes the truesize. @@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * is not at a fixed memory location, with mixed length * packets, which is bad for cache-line hotness. */ - frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom + + frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - pkt_data_start = xdp_pkt->data - xdp_pkt->headroom; + pkt_data_start = xdpf->data - xdpf->headroom; skb = build_skb(pkt_data_start, frame_size); if (!skb) return NULL; - skb_reserve(skb, xdp_pkt->headroom); - __skb_put(skb, xdp_pkt->len); - if (xdp_pkt->metasize) - skb_metadata_set(skb, xdp_pkt->metasize); + skb_reserve(skb, xdpf->headroom); + __skb_put(skb, xdpf->len); + if (xdpf->metasize) + skb_metadata_set(skb, xdpf->metasize); /* Essential SKB info: protocol and skb->dev */ - skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx); + skb->protocol = eth_type_trans(skb, xdpf->dev_rx); /* Optional SKB info, currently missing: * - HW checksum info (skb->ip_summed) @@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring) * invoked cpu_map_kthread_stop(). Catch any broken behaviour * gracefully and warn once. */ - struct xdp_pkt *xdp_pkt; + struct xdp_frame *xdpf; - while ((xdp_pkt = ptr_ring_consume(ring))) - if (WARN_ON_ONCE(xdp_pkt)) - xdp_return_frame(xdp_pkt, _pkt->mem); + while ((xdpf = ptr_ring_consume(ring))) + if
[net-next V10 PATCH 11/16] xdp: rhashtable with allocator ID to pointer mapping
Use the IDA infrastructure for getting a cyclic increasing ID number, that is used for keeping track of each registered allocator per RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which uses a radix tree, use a dynamic rhashtable, for creating ID to pointer lookup table, because this is faster. The problem that is being solved here is that, the xdp_rxq_info pointer (stored in xdp_buff) cannot be used directly, as the guaranteed lifetime is too short. The info is needed on a (potentially) remote CPU during DMA-TX completion time . In an xdp_frame the xdp_mem_info is stored, when it got converted from an xdp_buff, which is sufficient for the simple page refcnt based recycle schemes. For more advanced allocators there is a need to store a pointer to the registered allocator. Thus, there is a need to guard the lifetime or validity of the allocator pointer, which is done through this rhashtable ID map to pointer. The removal and validity of of the allocator and helper struct xdp_mem_allocator is guarded by RCU. The allocator will be created by the driver, and registered with xdp_rxq_info_reg_mem_model(). It is up-to debate who is responsible for freeing the allocator pointer or invoking the allocator destructor function. In any case, this must happen via RCU freeing. Use the IDA infrastructure for getting a cyclic increasing ID number, that is used for keeping track of each registered allocator per RX-queue xdp_rxq_info. V4: Per req of Jason Wang - Use xdp_rxq_info_reg_mem_model() in all drivers implementing XDP_REDIRECT, even-though it's not strictly necessary when allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero). V6: Per req of Alex Duyck - Introduce rhashtable_lookup() call in later patch V8: Address sparse should be static warnings (from kbuild test robot) Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |9 + drivers/net/tun.c |6 + drivers/net/virtio_net.c |7 + include/net/xdp.h | 14 -- net/core/xdp.c| 223 - 5 files changed, 241 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 0bfe6cf2bf8b..f10904ec2172 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -6370,7 +6370,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter, struct device *dev = rx_ring->dev; int orig_node = dev_to_node(dev); int ring_node = -1; - int size; + int size, err; size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count; @@ -6407,6 +6407,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter, rx_ring->queue_index) < 0) goto err; + err = xdp_rxq_info_reg_mem_model(_ring->xdp_rxq, +MEM_TYPE_PAGE_SHARED, NULL); + if (err) { + xdp_rxq_info_unreg(_ring->xdp_rxq); + goto err; + } + rx_ring->xdp_prog = adapter->xdp_prog; return 0; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2c85e5cac2a9..283bde85c455 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -854,6 +854,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file, tun->dev, tfile->queue_index); if (err < 0) goto out; + err = xdp_rxq_info_reg_mem_model(>xdp_rxq, +MEM_TYPE_PAGE_SHARED, NULL); + if (err < 0) { + xdp_rxq_info_unreg(>xdp_rxq); + goto out; + } err = 0; } diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index f50e1ad81ad4..42d338fe9a8d 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1305,6 +1305,13 @@ static int virtnet_open(struct net_device *dev) if (err < 0) return err; + err = xdp_rxq_info_reg_mem_model(>rq[i].xdp_rxq, +MEM_TYPE_PAGE_SHARED, NULL); + if (err < 0) { + xdp_rxq_info_unreg(>rq[i].xdp_rxq); + return err; + } + virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi); virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi); } diff --git a/include/net/xdp.h b/include/net/xdp.h index ea3773f94f65..5f67c62540aa 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -41,6 +41,7 @@ enum xdp_mem_type { struct xdp_mem_info { u32 type; /* enum xdp_mem_type, but known size type */ + u32 id; }; struct
[net-next V10 PATCH 10/16] mlx5: register a memory model when XDP is enabled
Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame. This enable a different memory model, thus activating another code path in the xdp_return_frame API. V2: Fixed issues pointed out by Tariq. Signed-off-by: Jesper Dangaard BrouerReviewed-by: Tariq Toukan Acked-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index b29c1d93f058..2dca0933dfd3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -512,6 +512,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->mkey_be = c->mkey_be; } + /* This must only be activate for order-0 pages */ + if (rq->xdp_prog) { + err = xdp_rxq_info_reg_mem_model(>xdp_rxq, +MEM_TYPE_PAGE_ORDER0, NULL); + if (err) + goto err_rq_wq_destroy; + } + for (i = 0; i < wq_sz; i++) { struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(>wq, i);
[net-next V10 PATCH 09/16] i40e: convert to use generic xdp_frame and xdp_return_frame API
Also convert driver i40e, which very recently got XDP_REDIRECT support in commit d9314c474d4f ("i40e: add support for XDP_REDIRECT"). V7: This patch got added in V7 of this patchset. Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 20 +++- drivers/net/ethernet/intel/i40e/i40e_txrx.h |1 + 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index f174c72480ab..96c54cbfb1f9 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -638,7 +638,8 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring, if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB) kfree(tx_buffer->raw_buf); else if (ring_is_xdp(ring)) - page_frag_free(tx_buffer->raw_buf); + xdp_return_frame(tx_buffer->xdpf->data, +_buffer->xdpf->mem); else dev_kfree_skb_any(tx_buffer->skb); if (dma_unmap_len(tx_buffer, len)) @@ -841,7 +842,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi, /* free the skb/XDP data */ if (ring_is_xdp(tx_ring)) - page_frag_free(tx_buf->raw_buf); + xdp_return_frame(tx_buf->xdpf->data, _buf->xdpf->mem); else napi_consume_skb(tx_buf->skb, napi_budget); @@ -2225,6 +2226,8 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring, if (!xdp_prog) goto xdp_out; + prefetchw(xdp->data_hard_start); /* xdp_frame write */ + act = bpf_prog_run_xdp(xdp_prog, xdp); switch (act) { case XDP_PASS: @@ -3481,25 +3484,32 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb, static int i40e_xmit_xdp_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring) { - u32 size = xdp->data_end - xdp->data; u16 i = xdp_ring->next_to_use; struct i40e_tx_buffer *tx_bi; struct i40e_tx_desc *tx_desc; + struct xdp_frame *xdpf; dma_addr_t dma; + u32 size; + + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return I40E_XDP_CONSUMED; + + size = xdpf->len; if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) { xdp_ring->tx_stats.tx_busy++; return I40E_XDP_CONSUMED; } - dma = dma_map_single(xdp_ring->dev, xdp->data, size, DMA_TO_DEVICE); + dma = dma_map_single(xdp_ring->dev, xdpf->data, size, DMA_TO_DEVICE); if (dma_mapping_error(xdp_ring->dev, dma)) return I40E_XDP_CONSUMED; tx_bi = _ring->tx_bi[i]; tx_bi->bytecount = size; tx_bi->gso_segs = 1; - tx_bi->raw_buf = xdp->data; + tx_bi->xdpf = xdpf; /* record length, and DMA address */ dma_unmap_len_set(tx_bi, len, size); diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h index 3043483ec426..857b1d743c8d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h @@ -306,6 +306,7 @@ static inline unsigned int i40e_txd_use_count(unsigned int size) struct i40e_tx_buffer { struct i40e_tx_desc *next_to_watch; union { + struct xdp_frame *xdpf; struct sk_buff *skb; void *raw_buf; };
[net-next V10 PATCH 01/16] mlx5: basic XDP_REDIRECT forward support
This implements basic XDP redirect support in mlx5 driver. Notice that the ndo_xdp_xmit() is NOT implemented, because that API need some changes that this patchset is working towards. The main purpose of this patch is have different drivers doing XDP_REDIRECT to show how different memory models behave in a cross driver world. Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect, as the return API does not exist yet to to keep this mapped. Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing, introduce xdpsq.db.redirect_flush boolian. V9: Adjust for commit 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication") Signed-off-by: Jesper Dangaard BrouerReviewed-by: Tariq Toukan Acked-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en.h|1 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 27 --- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 30cad07be2b5..1a05d1072c5e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -392,6 +392,7 @@ struct mlx5e_xdpsq { struct { struct mlx5e_dma_info *di; bool doorbell; + bool redirect_flush; } db; /* read only */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 176645762e49..0e24be05907f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, return 0; } +static void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, + struct mlx5e_dma_info *dma_info) +{ + dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq), + rq->buff.map_dir); +} + void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, bool recycle) { if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info)) return; - dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq), - rq->buff.map_dir); + mlx5e_page_dma_unmap(rq, dma_info); put_page(dma_info->page); } @@ -800,9 +806,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, void *va, u16 *rx_headroom, u32 *len) { - const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog); + struct bpf_prog *prog = READ_ONCE(rq->xdp_prog); struct xdp_buff xdp; u32 act; + int err; if (!prog) return false; @@ -823,6 +830,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq, if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, ))) trace_xdp_exception(rq->netdev, prog, act); return true; + case XDP_REDIRECT: + /* When XDP enabled then page-refcnt==1 here */ + err = xdp_do_redirect(rq->netdev, , prog); + if (!err) { + __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); + rq->xdpsq.db.redirect_flush = true; + mlx5e_page_dma_unmap(rq, di); + } + return true; default: bpf_warn_invalid_xdp_action(act); case XDP_ABORTED: @@ -1140,6 +1156,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget) xdpsq->db.doorbell = false; } + if (xdpsq->db.redirect_flush) { + xdp_do_flush_map(); + xdpsq->db.redirect_flush = false; + } + mlx5_cqwq_update_db_record(>wq); /* ensure cq space is freed before enabling more cqes */
[net-next V10 PATCH 02/16] xdp: introduce xdp_return_frame API and use in cpumap
Introduce an xdp_return_frame API, and convert over cpumap as the first user, given it have queued XDP frame structure to leverage. V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck. V6: Remove comment that id will be added later (Req by Alex Duyck) V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot) Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h | 27 +++ kernel/bpf/cpumap.c | 60 +++ net/core/xdp.c | 18 +++ 3 files changed, 81 insertions(+), 24 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index b2362ddfa694..e4207699c410 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -33,16 +33,43 @@ * also mandatory during RX-ring setup. */ +enum xdp_mem_type { + MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */ + MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ + MEM_TYPE_MAX, +}; + +struct xdp_mem_info { + u32 type; /* enum xdp_mem_type, but known size type */ +}; + struct xdp_rxq_info { struct net_device *dev; u32 queue_index; u32 reg_state; + struct xdp_mem_info mem; } cacheline_aligned; /* perf critical, avoid false-sharing */ + +static inline +void xdp_return_frame(void *data, struct xdp_mem_info *mem) +{ + if (mem->type == MEM_TYPE_PAGE_SHARED) + page_frag_free(data); + + if (mem->type == MEM_TYPE_PAGE_ORDER0) { + struct page *page = virt_to_page(data); /* Assumes order0 page*/ + + put_page(page); + } +} + int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, struct net_device *dev, u32 queue_index); void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq); void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq); bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq); +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, + enum xdp_mem_type type, void *allocator); #endif /* __LINUX_NET_XDP_H__ */ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index a4bb0b34375a..3e4bbcbe3e86 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) return ERR_PTR(err); } -static void __cpu_map_queue_destructor(void *ptr) -{ - /* The tear-down procedure should have made sure that queue is -* empty. See __cpu_map_entry_replace() and work-queue -* invoked cpu_map_kthread_stop(). Catch any broken behaviour -* gracefully and warn once. -*/ - if (WARN_ON_ONCE(ptr)) - page_frag_free(ptr); -} - -static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) -{ - if (atomic_dec_and_test(>refcnt)) { - /* The queue should be empty at this point */ - ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor); - kfree(rcpu->queue); - kfree(rcpu); - } -} - static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) { atomic_inc(>refcnt); @@ -188,6 +168,10 @@ struct xdp_pkt { u16 len; u16 headroom; u16 metasize; + /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, +* while mem info is valid on remote CPU. +*/ + struct xdp_mem_info mem; struct net_device *dev_rx; }; @@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp) xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); xdp_pkt->metasize = metasize; + /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ + xdp_pkt->mem = xdp->rxq->mem; + return xdp_pkt; } @@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, return skb; } +static void __cpu_map_ring_cleanup(struct ptr_ring *ring) +{ + /* The tear-down procedure should have made sure that queue is +* empty. See __cpu_map_entry_replace() and work-queue +* invoked cpu_map_kthread_stop(). Catch any broken behaviour +* gracefully and warn once. +*/ + struct xdp_pkt *xdp_pkt; + + while ((xdp_pkt = ptr_ring_consume(ring))) + if (WARN_ON_ONCE(xdp_pkt)) + xdp_return_frame(xdp_pkt, _pkt->mem); +} + +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) +{ + if (atomic_dec_and_test(>refcnt)) { + /* The queue should be empty at this point */ + __cpu_map_ring_cleanup(rcpu->queue); + ptr_ring_cleanup(rcpu->queue, NULL); + kfree(rcpu->queue); + kfree(rcpu); + } +} + static int cpu_map_kthread_run(void *data) { struct bpf_cpu_map_entry *rcpu = data; @@
[net-next V10 PATCH 03/16] ixgbe: use xdp_return_frame API
Extend struct ixgbe_tx_buffer to store the xdp_mem_info. Notice that this could be optimized further by putting this into a union in the struct ixgbe_tx_buffer, but this patchset works towards removing this again. Thus, this is not done. Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/ixgbe/ixgbe.h |1 + drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |6 -- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 4f08c712e58e..abb5248e917e 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -250,6 +250,7 @@ struct ixgbe_tx_buffer { DEFINE_DMA_UNMAP_ADDR(dma); DEFINE_DMA_UNMAP_LEN(len); u32 tx_flags; + struct xdp_mem_info xdp_mem; }; struct ixgbe_rx_buffer { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index afadba99f7b8..0bfe6cf2bf8b 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, /* free the skb */ if (ring_is_xdp(tx_ring)) - page_frag_free(tx_buffer->data); + xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); else napi_consume_skb(tx_buffer->skb, napi_budget); @@ -5797,7 +5797,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring) /* Free all the Tx ring sk_buffs */ if (ring_is_xdp(tx_ring)) - page_frag_free(tx_buffer->data); + xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); else dev_kfree_skb_any(tx_buffer->skb); @@ -8366,6 +8366,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter, dma_unmap_len_set(tx_buffer, len, len); dma_unmap_addr_set(tx_buffer, dma, dma); tx_buffer->data = xdp->data; + tx_buffer->xdp_mem = xdp->rxq->mem; + tx_desc->read.buffer_addr = cpu_to_le64(dma); /* put descriptor type bits */
[net-next V10 PATCH 05/16] xdp: introduce a new xdp_frame type
This is needed to convert drivers tuntap and virtio_net. This is a generalization of what is done inside cpumap, which will be converted later. Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h | 40 1 file changed, 40 insertions(+) diff --git a/include/net/xdp.h b/include/net/xdp.h index 15f8ade008b5..756c42811e78 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -58,6 +58,46 @@ struct xdp_buff { struct xdp_rxq_info *rxq; }; +struct xdp_frame { + void *data; + u16 len; + u16 headroom; + u16 metasize; + /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, +* while mem info is valid on remote CPU. +*/ + struct xdp_mem_info mem; +}; + +/* Convert xdp_buff to xdp_frame */ +static inline +struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) +{ + struct xdp_frame *xdp_frame; + int metasize; + int headroom; + + /* Assure headroom is available for storing info */ + headroom = xdp->data - xdp->data_hard_start; + metasize = xdp->data - xdp->data_meta; + metasize = metasize > 0 ? metasize : 0; + if (unlikely((headroom - metasize) < sizeof(*xdp_frame))) + return NULL; + + /* Store info in top of packet */ + xdp_frame = xdp->data_hard_start; + + xdp_frame->data = xdp->data; + xdp_frame->len = xdp->data_end - xdp->data; + xdp_frame->headroom = headroom - sizeof(*xdp_frame); + xdp_frame->metasize = metasize; + + /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ + xdp_frame->mem = xdp->rxq->mem; + + return xdp_frame; +} + static inline void xdp_return_frame(void *data, struct xdp_mem_info *mem) {
[net-next V10 PATCH 00/16] XDP redirect memory return API
Resubmit V10 against net-next, as it contains NIC driver changes. This patchset works towards supporting different XDP RX-ring memory allocators. As this will be needed by the AF_XDP zero-copy mode. The patchset uses mlx5 as the sample driver, which gets implemented XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to change thought the patchset). A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt). And both ndo_xdp_xmit and the new xdp_return_frame end-up using this. Support for a driver supplied allocator is implemented, and a refurbished version of page_pool is the first return allocator type introduced. This will be a integration point for AF_XDP zero-copy. The mlx5 driver evolve into using the page_pool, and see a performance increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps. The patchset stop at 16 patches (one over limit), but more API changes are planned. Specifically extending ndo_xdp_xmit and xdp_return_frame APIs to support bulking. As this will address some known limits. V2: Updated according to Tariq's feedback V3: Updated based on feedback from Jason Wang and Alex Duyck V4: Updated based on feedback from Tariq and Jason V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test V6: Updated based on feedback from Eric Dumazet and Alex Duyck V7: Adapt to i40e that got XDP_REDIRECT support in-between V8: Updated based on feedback kbuild test robot, and adjust for mlx5 changes page_pool only compiled into kernel when drivers Kconfig 'select' feature V9: Remove some inline statements, let compiler decide what to inline Fix return value in virtio_net driver Adjust for mlx5 changes in-between submissions V10: Minor adjust for mlx5 requested by Tariq Resubmit against net-next --- Jesper Dangaard Brouer (16): mlx5: basic XDP_REDIRECT forward support xdp: introduce xdp_return_frame API and use in cpumap ixgbe: use xdp_return_frame API xdp: move struct xdp_buff from filter.h to xdp.h xdp: introduce a new xdp_frame type tun: convert to use generic xdp_frame and xdp_return_frame API virtio_net: convert to use generic xdp_frame and xdp_return_frame API bpf: cpumap convert to use generic xdp_frame i40e: convert to use generic xdp_frame and xdp_return_frame API mlx5: register a memory model when XDP is enabled xdp: rhashtable with allocator ID to pointer mapping page_pool: refurbish version of page_pool code xdp: allow page_pool as an allocator type in xdp_return_frame mlx5: use page_pool for xdp_return_frame call xdp: transition into using xdp_frame for return API xdp: transition into using xdp_frame for ndo_xdp_xmit drivers/net/ethernet/intel/i40e/i40e_txrx.c | 33 ++ drivers/net/ethernet/intel/i40e/i40e_txrx.h |3 drivers/net/ethernet/intel/ixgbe/ixgbe.h |3 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 38 ++- drivers/net/ethernet/mellanox/mlx5/core/Kconfig |1 drivers/net/ethernet/mellanox/mlx5/core/en.h |4 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 37 ++ drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 42 ++- drivers/net/tun.c | 60 ++-- drivers/net/virtio_net.c | 67 +++- drivers/vhost/net.c |7 include/linux/filter.h| 24 -- include/linux/if_tun.h|4 include/linux/netdevice.h |4 include/net/page_pool.h | 143 + include/net/xdp.h | 83 + kernel/bpf/cpumap.c | 132 +++-- net/Kconfig |3 net/core/Makefile |1 net/core/filter.c | 17 + net/core/page_pool.c | 317 + net/core/xdp.c| 269 ++ 22 files changed, 1094 insertions(+), 198 deletions(-) create mode 100644 include/net/page_pool.h create mode 100644 net/core/page_pool.c --
Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote: > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included that > in below fix patch. x86_swiotlb_dma_ops should not exist any mor, and x86 now uses dma_direct_ops. Looks like you are applying it to an old kernel :) > Performance improved to 8.9 Mpps from approx 6.5Mpps. > > (This was without my bulking for net_device->ndo_xdp_xmit, so that > number should improve more). What is the number for the otherwise comparable setup without repolines?
[PATCH] VSOCK: make af_vsock.ko removable again
Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call vsock_init_tables()") introduced a module_init() function without a corresponding module_exit() function. Modules with an init function can only be removed if they also have an exit function. Therefore the vsock module was considered "permanent" and could not be removed. This patch adds an empty module_exit() function so that "rmmod vsock" works. No explicit cleanup is required because: 1. Transports call vsock_core_exit() upon exit and cannot be removed while sockets are still alive. 2. vsock_diag.ko does not perform any action that requires cleanup by vsock.ko. Reported-by: Xiumei MuCc: Cong Wang Cc: Jorgen Hansen Signed-off-by: Stefan Hajnoczi --- net/vmw_vsock/af_vsock.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index aac9b8f6552e..c1076c19b858 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -2018,7 +2018,13 @@ const struct vsock_transport *vsock_core_get_transport(void) } EXPORT_SYMBOL_GPL(vsock_core_get_transport); +static void __exit vsock_exit(void) +{ + /* Do nothing. This function makes this module removable. */ +} + module_init(vsock_init_tables); +module_exit(vsock_exit); MODULE_AUTHOR("VMware, Inc."); MODULE_DESCRIPTION("VMware Virtual Socket Family"); -- 2.14.3
Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device
On 16/04/2018 7:51 PM, David Miller wrote: From: Zhu YanjunDate: Sun, 15 Apr 2018 21:02:07 -0400 While a faulty cable is used or HCA firmware error, HCA device will be offline. When the driver is accessing this offline device, the following call trace will pop out. ... In the above call trace, the function mlx4_cmd_poll calls the function mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out. This is not reasonable. Since HCA device is offline when it is being accessed, it should not be reset again. In this patch, since HCA is offline, the function mlx4_cmd_post returns an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns instead of resetting HCA. CC: Srinivas Eeda CC: Junxiao Bi Suggested-by: Håkon Bugge Signed-off-by: Zhu Yanjun Tariq, I'm assuming you'll take this in and send it to me later. Thanks. Yes, I will review and send if all is OK. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
On Tue, Apr 17, 2018 at 09:07:01AM +0200, Jesper Dangaard Brouer wrote: > > > number should improve more). > > > > What is the number for the otherwise comparable setup without repolines? > > Approx 12 Mpps. > > You forgot to handle the dma_direct_mapping_error() case, which still > used the retpoline in the above 8.9 Mpps measurement, I fixed it up and > performance increased to 9.6 Mpps. > > Notice, in this test there are still two retpoline/indirect-calls > left. The net_device->ndo_xdp_xmit and the invocation of the XDP BPF > prog. But that seems like a pretty clear indicator that we want the fast path direct mapping. I'll try to find some time over the next weeks to do a cleaner version of it.
Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
Hi Michael, Adrian, Thanks for your patch! On Tue, Apr 17, 2018 at 4:08 AM, Michael Schmitzwrote: > From: John Paul Adrian Glaubitz > > This complements the fix in 82533ad9a1c that removed the free_irq Please quote the commit's subject, too, like ... fix in commit 82533ad9a1 ("net: ethernet: ax88796: don't call free_irq without request_irq first") BTW, I have a git alias for that: $ git help fixes `git fixes' is aliased to `show --format='Fixes: %h ("%s")' -s' $ git fixes 82533ad9a1c Fixes: 82533ad9a1c ("net: ethernet: ax88796: don't call free_irq without request_irq first") > call in the error path of probe, to also not call free_irq when > remove is called to revert the effects of probe. > > Signed-off-by: Michael Karcher The patch is authored by Adrian, but his SoB is missing? Michael (Schmitz): as you took the patch, you should add your SoB, too. For the actual patch contents: Reviewed-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
RE: [PATCH 0/3] Receive Side Coalescing for macb driver
From: David Miller [mailto:da...@davemloft.net] Sent: 16 kwietnia 2018 17:09 > From: Rafal Ozieblo> Date: Sat, 14 Apr 2018 21:53:07 +0100 > >> This patch series adds support for receive side coalescing for Cadence >> GEM driver. Receive segmentation coalescing is a mechanism to reduce >> CPU overhead. This is done by coalescing received TCP message segments >> together into a single large message. This means that when the message >> is complete the CPU only has to process the single header and act upon >> the one data payload. > > You're going to have to think more deeply about enabling this feature. > > If you can't adjust the receive buffer offset, then the various packet header > fields will be unaligned. > > On certain architectures this will result in unaligned traps all over the > networking stack as the packet is being processed. > > So enabling this by default will hurt performance on such systems a lot. > > The whole "skb_reserve(skb, NET_IP_ALIGN)" is not just for fun, it is > absolutely essential. I totally agree with you. But the issue is with IP cores which has this feature implemented in. Even when user does not want to use that feature but he bought IP with configuration supported RSC, then he has to switch off IP alignment. There is no IP alignment with RSC in the GEM: "When the gem rsc define has been set the receive buffer offset cannot be changed in the network configuration register." If IP supports RSC and skb has 2B reserved for alignment we end up with none packets receive correctly (2B missing in the each skb). We can either leave few customers without support in Linux driver or let them use the driver with decrease performance.
Re: [PATCH v2 2/8] net: ax88796: Attach MII bus only when open
On Tue, Apr 17, 2018 at 02:08:09PM +1200, Michael Schmitz wrote: > From: Michael Karcher> > Call ax_mii_init in ax_open(), and unregister/remove mdiobus resources > in ax_close(). > > This is needed to be able to unload the module, as the module is busy > while the MII bus is attached. > > Signed-off-by: Michael Karcher > Signed-off-by: Michael Schmitz Reviewed-by: Andrew Lunn Andrew
Re: [PATCH v2 8/8] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
On Tue, Apr 17, 2018 at 02:08:15PM +1200, Michael Schmitz wrote: > Add platform device driver to populate the ax88796 platform data from > information provided by the XSurf100 zorro device driver. > This driver will have to be loaded before loading the ax88796 module, > or compiled as built-in. > > Signed-off-by: Michael Karcher> Signed-off-by: Michael Schmitz > --- > drivers/net/ethernet/8390/Kconfig| 14 +- > drivers/net/ethernet/8390/Makefile |1 + > drivers/net/ethernet/8390/xsurf100.c | 411 > ++ > 3 files changed, 425 insertions(+), 1 deletions(-) > create mode 100644 drivers/net/ethernet/8390/xsurf100.c > > diff --git a/drivers/net/ethernet/8390/Kconfig > b/drivers/net/ethernet/8390/Kconfig > index fdc6734..0cadd45 100644 > --- a/drivers/net/ethernet/8390/Kconfig > +++ b/drivers/net/ethernet/8390/Kconfig > @@ -30,7 +30,7 @@ config PCMCIA_AXNET > > config AX88796 > tristate "ASIX AX88796 NE2000 clone support" > - depends on (ARM || MIPS || SUPERH) > + depends on (ARM || MIPS || SUPERH || AMIGA) Hi Michael Will it compile on other platforms? If so, it is a good idea to add COMPILE_TEST as well. Andrew
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote: > On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote: > > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote: > > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote: > > > > On 2018年04月13日 15:15, Tiwei Bie wrote: > > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote: > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > [...] > > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, > > > > > > > unsigned int head, > > > > > > > + void **ctx) > > > > > > > +{ > > > > > > > + struct vring_packed_desc *desc; > > > > > > > + unsigned int i, j; > > > > > > > + > > > > > > > + /* Clear data ptr. */ > > > > > > > + vq->desc_state[head].data = NULL; > > > > > > > + > > > > > > > + i = head; > > > > > > > + > > > > > > > + for (j = 0; j < vq->desc_state[head].num; j++) { > > > > > > > + desc = >vring_packed.desc[i]; > > > > > > > + vring_unmap_one_packed(vq, desc); > > > > > > > + desc->flags = 0x0; > > > > > > Looks like this is unnecessary. > > > > > It's safer to zero it. If we don't zero it, after we > > > > > call virtqueue_detach_unused_buf_packed() which calls > > > > > this function, the desc is still available to the > > > > > device. > > > > > > > > Well detach_unused_buf_packed() should be called after device is > > > > stopped, > > > > otherwise even if you try to clear, there will still be a window that > > > > device > > > > may use it. > > > > > > This is not about whether the device has been stopped or > > > not. We don't have other places to re-initialize the ring > > > descriptors and wrap_counter. So they need to be set to > > > the correct values when doing detach_unused_buf. > > > > > > Best regards, > > > Tiwei Bie > > > > find vqs is the time to do it. > > The .find_vqs() will call .setup_vq() which will eventually > call vring_create_virtqueue(). It's a different case. Here > we're talking about re-initializing the descs and updating > the wrap counter when detaching the unused descs (In this > case, split ring just needs to decrease vring.avail->idx). > > Best regards, > Tiwei Bie There's no requirement that virtqueue_detach_unused_buf re-initializes the descs. It happens on cleanup path just before drivers delete the vqs. -- MST
[bpf-next PATCH] samples/bpf: fix xdp_monitor user output for tracepoint exception
The variable rec_i contains an XDP action code not an error. Thus, using err2str() was wrong, it should have been action2str(). Signed-off-by: Jesper Dangaard Brouer--- samples/bpf/xdp_monitor_user.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/samples/bpf/xdp_monitor_user.c b/samples/bpf/xdp_monitor_user.c index eec14520d513..894bc64c2cac 100644 --- a/samples/bpf/xdp_monitor_user.c +++ b/samples/bpf/xdp_monitor_user.c @@ -330,7 +330,7 @@ static void stats_print(struct stats_record *stats_rec, pps = calc_pps_u64(r, p, t); if (pps > 0) printf(fmt1, "Exception", i, - 0.0, pps, err2str(rec_i)); + 0.0, pps, action2str(rec_i)); } pps = calc_pps_u64(>total, >total, t); if (pps > 0)
Re: [PATCH bpf-next 01/10] [bpf]: adding bpf_xdp_adjust_tail helper
Hi Nikita, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Nikita-V-Shirokov/introduction-of-bpf_xdp_adjust_tail/20180417-211905 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: i386-randconfig-s1-201815 (attached as .config) compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): net/core/filter.c: In function 'bpf_xdp_adjust_tail': >> net/core/filter.c:2726:2: warning: ISO C90 forbids mixed declarations and >> code [-Wdeclaration-after-statement] void *data_end = xdp->data_end + offset; ^~~~ vim +2726 net/core/filter.c 2719 2720 BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset) 2721 { 2722 /* only shrinking is allowed for now. */ 2723 if (unlikely(offset > 0)) 2724 return -EINVAL; 2725 > 2726 void *data_end = xdp->data_end + offset; 2727 2728 if (unlikely(data_end < xdp->data + ETH_HLEN)) 2729 return -EINVAL; 2730 2731 xdp->data_end = data_end; 2732 2733 return 0; 2734 } 2735 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp
From: Simon HormanDate: Tue, 17 Apr 2018 10:50:26 +0200 > From: Masaru Nagai > > [ 58.490829] = > [ 58.495205] [ INFO: inconsistent lock state ] > [ 58.499583] 4.9.0-yocto-standard-7-g2ef7caf #57 Not tainted ... > Fixes: f51bdc236b6c ("ravb: Add dma queue interrupt support") > Signed-off-by: Masaru Nagai > Signed-off-by: Kazuya Mizuguchi > Signed-off-by: Simon Horman This really needs more than the lockdep dump in the commit message, explaining what the problem is and how it was corrected. Are the wrong interrupt types enabled? Are they handled improperly? It definitely isn't clear from just reading the patch.
Re: One question about __tcp_select_window()
On 04/17/2018 06:53 AM, Wang Jian wrote: > I test the fix with 4.17.0-rc1+ and it seems work. > > 1. iperf -c IP -i 20 -t 60 -w 1K > with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec > I also try other windows and have similar results. > > 2. Use tcp probe trace snd_wind. > with-fix vs without-fix: 1245568 vs 1042816 > > 3. I don't see extra retransmit/drops. > Unfortunately I have no idea what exact problem you had to solve. Setting small windows is not exactly the path we are taking. And I do not know how many side effects your change will have for 'standard' flows using autotuning or sane windows.
Re: One question about __tcp_select_window()
I test the fix with 4.17.0-rc1+ and it seems work. 1. iperf -c IP -i 20 -t 60 -w 1K with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec I also try other windows and have similar results. 2. Use tcp probe trace snd_wind. with-fix vs without-fix: 1245568 vs 1042816 3. I don't see extra retransmit/drops. On Sun, Apr 15, 2018 at 8:50 PM, Wang Jianwrote: > Hi all, > > While I read __tcp_select_window() code, I find that it maybe return a > smaller window. > Below is one scenario I thought, may be not right: > In function __tcp_select_window(), assume: > full_space is 6mss, free_space is 2mss, tp->rcv_wnd is 3MSS. > And assume disable window scaling, then > window = tp->rcv_wnd > free_space && window > free_space > then it will round down free_space and return it. > > Is this expected behavior? The comment is also saying > "Get the largest window that is a nice multiple of mss." > > Should we do something like below ? Or I miss something? > I don't know how to verify it now. > > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2680,9 +2680,9 @@ u32 __tcp_select_window(struct sock *sk) > * We also don't do any window rounding when the free space > * is too small. > */ > - if (window <= free_space - mss || window > free_space) > + if (window <= free_space - mss) > window = rounddown(free_space, mss); > - else if (mss == full_space && > + else if (window <= free_space && mss == full_space && > free_space > window + (full_space >> 1)) > window = free_space; > } > > Thanks.
Re: [net-next V10 PATCH 00/16] XDP redirect memory return API
From: Jesper Dangaard BrouerDate: Tue, 17 Apr 2018 14:58:52 +0200 > Resubmit V10 against net-next, as it contains NIC driver changes. Series applied, thanks Jesper.
Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
From: Geert UytterhoevenDate: Tue, 17 Apr 2018 10:20:25 +0200 > BTW, I have a git alias for that: > > $ git help fixes > `git fixes' is aliased to `show --format='Fixes: %h ("%s")' -s' > $ git fixes 82533ad9a1c > Fixes: 82533ad9a1c ("net: ethernet: ax88796: don't call free_irq > without request_irq first") Thanks for sharing :)
Re: [net-next V10 PATCH 00/16] XDP redirect memory return API
From: Alexei StarovoitovDate: Tue, 17 Apr 2018 06:53:33 -0700 > looks like you forgot to include extra patch to fixup xdp_adjust_head() > helper. Otherwise reused xdp_frame in the top of that packet is leaking > kernel pointers into bpf program. > Could you please respin with that change included? Just in time, I was about to push this back out. :) Jesper, please respin with Alexei's requested changes.