Re: [PATCH 6/6] wl1251: Set generated MAC address back to NVS data
Pali Rohárwrites: > In case there is no valid MAC address kernel generates random one. This > patch propagate this generated MAC address back to NVS data which will be > uploaded to wl1251 chip. So HW would have same MAC address as linux kernel > uses. > > Signed-off-by: Pali Rohár Why? What issue does this fix? -- Kalle Valo
Re: [PATCH 5/6] wl1251: Parse and use MAC address from supplied NVS data
Pali Rohárwrites: > This patch implements parsing MAC address from NVS data which are sent to > wl1251 chip. Calibration NVS data could contain valid MAC address and it > will be used instead randomly generated. > > This patch also move code for requesting NVS data from userspace to driver > initialization code to make sure that NVS data will be there at time when > permanent MAC address is needed. > > Calibration NVS data for wl1251 are model specific. Every one device with > wl1251 chip should have been calibrated in factory and needs to provide own > calibration data. > > Default example wl1251-nvs.bin data found in linux-firmware repository and > contains MAC address 00:00:20:07:03:09. So this MAC address is marked as > invalid as it is not real device specific address, just example one. > > Format of calibration NVS data can be found at: > http://notaz.gp2x.de/misc/pnd/wl1251/nvs_map.txt > > Signed-off-by: Pali Rohár [...] > +static int wl1251_read_nvs_mac(struct wl1251 *wl) > +{ > + u8 mac[ETH_ALEN]; > + int i; > + > + if (wl->nvs_len < 0x24) > + return -ENODATA; > + > + /* length is 2 and data address is 0x546c (mask is 0xfffe) */ > + if (wl->nvs[0x19] != 2 || wl->nvs[0x1a] != 0x6d || wl->nvs[0x1b] != > 0x54) > + return -EINVAL; > + > + /* MAC is stored in reverse order */ > + for (i = 0; i < ETH_ALEN; i++) > + mac[i] = wl->nvs[0x1c + ETH_ALEN - i - 1]; No magic numbers, please. Replace all nvs offsets with proper defines to make the code more readable. -- Kalle Valo
Re: netvsc NAPI patch process
On Thu, Jan 26, 2017 at 01:06:46PM -0500, David Miller wrote: > From: Stephen Hemminger> Date: Thu, 26 Jan 2017 10:04:05 -0800 > > > I have a working set of patches to enable NAPI in the netvsc driver. > > The problem is that it requires a set of patches to vmbus layer as well. > > Since vmbus patches have been going through char-misc-next tree rather > > than net-next, it is difficult to stage these. > > > > How about if I send the vmbus patches through normal driver-devel upstream > > and during the 4.10 merge window send the last 3 patches for NAPI for > > linux-net > > tree to get into 4.10? > > Another option is that the char-misc-next folks create a branch with just > the commits you need for NAPI, I pull that into net-next, and then you > can submit the NAPI changes to me. I can easily do that, or I have no problem with the vmbus changes going through the net-next tree, if they are sane (i.e. let me review them please...) Which ever is easier for the networking developers, their tree is much crazier than the tiny char-misc tree is :) thanks, greg k-h
[PATCH v2] net: phy: micrel: add support for KSZ8795
This is adds support for the PHYs in the KSZ8795 5port managed switch. It will allow to detect the link between the switch and the soc and uses the same read_status functions as the KSZ8873MLL switch. Signed-off-by: Sean Nyekjaer--- Changes in v2: - Removed "switch" name drivers/net/phy/micrel.c | 14 ++ include/linux/micrel_phy.h | 2 ++ 2 files changed, 16 insertions(+) diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c index ea92d524d5a8..fab56c9350cf 100644 --- a/drivers/net/phy/micrel.c +++ b/drivers/net/phy/micrel.c @@ -1014,6 +1014,20 @@ static struct phy_driver ksphy_driver[] = { .get_stats = kszphy_get_stats, .suspend= genphy_suspend, .resume = genphy_resume, +}, { + .phy_id = PHY_ID_KSZ8795, + .phy_id_mask= MICREL_PHY_ID_MASK, + .name = "Micrel KSZ8795", + .features = (SUPPORTED_Pause | SUPPORTED_Asym_Pause), + .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT, + .config_init= kszphy_config_init, + .config_aneg= ksz8873mll_config_aneg, + .read_status= ksz8873mll_read_status, + .get_sset_count = kszphy_get_sset_count, + .get_strings= kszphy_get_strings, + .get_stats = kszphy_get_stats, + .suspend= genphy_suspend, + .resume = genphy_resume, } }; module_phy_driver(ksphy_driver); diff --git a/include/linux/micrel_phy.h b/include/linux/micrel_phy.h index 257173e0095e..f541da68d1e7 100644 --- a/include/linux/micrel_phy.h +++ b/include/linux/micrel_phy.h @@ -35,6 +35,8 @@ #define PHY_ID_KSZ886X 0x00221430 #define PHY_ID_KSZ8863 0x00221435 +#define PHY_ID_KSZ8795 0x00221550 + /* struct phy_device dev_flags definitions */ #define MICREL_PHY_50MHZ_CLK 0x0001 #define MICREL_PHY_FXEN0x0002 -- 2.11.0
Re: [PATCH v2 0/4] USB support for Broadcom NSP SoC
Hi, On Thursday 26 January 2017 10:57 PM, Florian Fainelli wrote: > On 01/26/2017 07:34 AM, Kishon Vijay Abraham I wrote: >> >> >> On Tuesday 17 January 2017 09:44 PM, Yendapally Reddy Dhananjaya Reddy wrote: >>> This patch set contains the usb support for Broadcom NSP SoC. The >>> usb3 phy is internal to the SoC and is accessed through mdio interface. >>> The mdio interface can be used to access either internal usb3 phy or >>> external ethernet phy using a multiplexer. >>> >>> The first patch provides the documentation details for usb3 phy. The >>> second patch provides the changes to the mdio bus driver. The third >>> patch provides the phy driver and fourth patch provides the enable >>> method for usb. >> >> merged the 1st and 4th patch to linux-phy. > > You mean 1st and 3rd here, right? 4th is a Device Tree change that I > should take, and Patch 2 should be merged by David. yeah, sorry! > > What branch in your tree should we be looking at? git://git.kernel.org/pub/scm/linux/kernel/git/kishon/linux-phy.git next Thanks Kishon
Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data
Pali Rohárwrites: > NVS calibration data for wl1251 are model specific. Every one device with > wl1251 chip has different and calibrated in factory. > > Not all wl1251 chips have own EEPROM where are calibration data stored. And > in that case there is no "standard" place. Every device has stored them on > different place (some in rootfs file, some in dedicated nand partition, > some in another proprietary structure). > > Kernel wl1251 driver cannot support every one different storage decided by > device manufacture so it will use request_firmware_prefer_user() call for > loading NVS calibration data and userspace helper will be responsible to > prepare correct data. > > In case userspace helper fails request_firmware_prefer_user() still try to > load data file directly from VFS as fallback mechanism. > > On Nokia N900 device which has wl1251 chip, NVS calibration data are stored > in CAL nand partition. CAL is proprietary Nokia key/value format for nand > devices. > > With this patch it is finally possible to load correct model specific NVS > calibration data for Nokia N900. > > Signed-off-by: Pali Rohár > --- > drivers/net/wireless/ti/wl1251/Kconfig |1 + > drivers/net/wireless/ti/wl1251/main.c |2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/ti/wl1251/Kconfig > b/drivers/net/wireless/ti/wl1251/Kconfig > index 7142ccf..affe154 100644 > --- a/drivers/net/wireless/ti/wl1251/Kconfig > +++ b/drivers/net/wireless/ti/wl1251/Kconfig > @@ -2,6 +2,7 @@ config WL1251 > tristate "TI wl1251 driver support" > depends on MAC80211 > select FW_LOADER > + select FW_LOADER_USER_HELPER > select CRC7 > ---help--- > This will enable TI wl1251 driver support. The drivers make > diff --git a/drivers/net/wireless/ti/wl1251/main.c > b/drivers/net/wireless/ti/wl1251/main.c > index 208f062..24f8866 100644 > --- a/drivers/net/wireless/ti/wl1251/main.c > +++ b/drivers/net/wireless/ti/wl1251/main.c > @@ -110,7 +110,7 @@ static int wl1251_fetch_nvs(struct wl1251 *wl) > struct device *dev = wiphy_dev(wl->hw->wiphy); > int ret; > > - ret = request_firmware(, WL1251_NVS_NAME, dev); > + ret = request_firmware_prefer_user(, WL1251_NVS_NAME, dev); I don't see the need for this. Just remove the default nvs file from filesystem and the fallback user helper will be always used, right? Like we discussed earlier, the default nvs file should not be used by normal users. -- Kalle Valo
Re: [PATCH net-next] net: Fix ndo_setup_tc comment
Thu, Jan 26, 2017 at 11:44:17PM CET, f.faine...@gmail.com wrote: >Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume >general tc operand") changed the ndo_setup_tc() signature, but did not >update the comments in netdevice.h, so do that now. > >Signed-off-by: Florian FainelliReviewed-by: Jiri Pirko
[PATCH RFC ipsec-next 1/2] net: Drop secpath on free after gro merge.
With a followup patch, a gro merged skb can have a secpath. So drop it before freeing or reusing the skb. Signed-off-by: Steffen Klassert--- net/core/dev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index 56818f7..ef3a969 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4623,6 +4623,7 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) case GRO_MERGED_FREE: if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { skb_dst_drop(skb); + secpath_reset(skb); kmem_cache_free(skbuff_head_cache, skb); } else { __kfree_skb(skb); @@ -4663,6 +4664,7 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb) skb->encapsulation = 0; skb_shinfo(skb)->gso_type = 0; skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); + secpath_reset(skb); napi->skb = skb; } -- 1.9.1
[PATCH RFC ipsec-next 2/2] xfrm: Add a dummy network device for napi.
This patch adds a dummy network device so that we can use gro_cells for IPsec GRO. With this, we handle IPsec GRO with no impact on the generic networking code. Signed-off-by: Steffen Klassert--- net/xfrm/xfrm_input.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 6e3f025..3213fe8 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -21,6 +21,9 @@ static DEFINE_SPINLOCK(xfrm_input_afinfo_lock); static struct xfrm_input_afinfo __rcu *xfrm_input_afinfo[NPROTO]; +static struct gro_cells gro_cells; +static struct net_device xfrm_napi_dev; + int xfrm_input_register_afinfo(struct xfrm_input_afinfo *afinfo) { int err = 0; @@ -371,7 +374,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) if (decaps) { skb_dst_drop(skb); - netif_rx(skb); + gro_cells_receive(_cells, skb); return 0; } else { return x->inner_mode->afinfo->transport_finish(skb, async); @@ -394,6 +397,13 @@ int xfrm_input_resume(struct sk_buff *skb, int nexthdr) void __init xfrm_input_init(void) { + int err; + + init_dummy_netdev(_napi_dev); + err = gro_cells_init(_cells, _napi_dev); + if (err) + gro_cells.cells = NULL; + secpath_cachep = kmem_cache_create("secpath_cache", sizeof(struct sec_path), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, -- 1.9.1
[PATCH RFC ipsec-next v2] IPsec GRO
This adds a dummy network device so that we can use gro_cells for IPsec GRO. We now may have a secpath at a GRO merged skb, so we need to drop it. This is the only change to the generic networking code. The packet still travels two times through the stack, but might be aggregated in the second round. We can avoid the second round with implementing GRO callbacks for the IPsec protocols. This will be a separate patchset as this needs some more generic networking changes because of the asynchronous nature of IPsec.
Re: [PATCH RFC ipsec-next 2/2] xfrm: Add a device independent napi instance
On Thu, Jan 26, 2017 at 07:10:22AM -0800, Eric Dumazet wrote: > On Thu, 2017-01-26 at 06:26 -0800, Eric Dumazet wrote: > > > > > Alternative would be to use a > > > > static struct net_device xfrm_napi_anchor_device; > > > > and use gro_cell > > Also take a look at init_dummy_netdev() I already thought about using some dummy net_device for this, but was not sure what I need to initialize. So it seemed to be safer to use a private napi instance. init_dummy_netdev() is exactly what I need for that, thanks a lot!
Re: fs, net: deadlock between bind/splice on af_unix
On Thu, Jan 26, 2017 at 09:11:07PM -0800, Cong Wang wrote: > On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzikwrote: > > On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote: > >> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov wrote: > >> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro wrote: > >> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote: > >> >> > >> >>> > Why do we do autobind there, anyway, and why is it conditional on > >> >>> > SOCK_PASSCRED? Note that e.g. for SOCK_STREAM we can bloody well get > >> >>> > to sending stuff without autobind ever done - just use socketpair() > >> >>> > to create that sucker and we won't be going through the connect() > >> >>> > at all. > >> >>> > >> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls > >> >>> unix_autobind(), > >> >>> not SOCK_STREAM. > >> >> > >> >> Yes, I've noticed. What I'm asking is what in there needs autobind > >> >> triggered > >> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case? > >> >> > >> >>> I guess some lock, perhaps the u->bindlock could be dropped before > >> >>> acquiring the next one (sb_writer), but I need to double check. > >> >> > >> >> Bad idea, IMO - do you *want* autobind being able to come through while > >> >> bind(2) is busy with mknod? > >> > > >> > > >> > Ping. This is still happening on HEAD. > >> > > >> > >> Thanks for your reminder. Mind to give the attached patch (compile only) > >> a try? I take another approach to fix this deadlock, which moves the > >> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected > >> impact with this way. > >> > > > > I don't think this is the right approach. > > > > Currently the file creation is potponed until unix_bind can no longer > > fail otherwise. With it reordered, it may be someone races you with a > > different path and now you are left with a file to clean up. Except it > > is quite unclear for me if you can unlink it. > > What races do you mean here? If you mean someone could get a > refcount of that file, it could happen no matter we have bindlock or not > since it is visible once created. The filesystem layer should take care of > the file refcount so all we need to do here is calling path_put() as in my > patch. Or if you mean two threads calling unix_bind() could race without > binlock, only one of them should succeed the other one just fails out. Two threads can race and one fails with EINVAL. With your patch there is a new file created and it is unclear what to do with it - leaving it as it is sounds like the last resort and unlinking it sounds extremely fishy as it opens you to games played by the user.
Re: [PATCH net-next 3/4] mlx5: Make building tc hardware offload configurable
Hi Tom, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Tom-Herbert/mlx5-Create-build-configuration-options/20170127-084348 config: x86_64-randconfig-s3-01271208 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: the linux-review/Tom-Herbert/mlx5-Create-build-configuration-options/20170127-084348 HEAD fe8939265468a7204ffc5b1c6c878b39bae7e7d0 builds fine. It only hurts bisectibility. All errors (new ones prefixed by >>): drivers/built-in.o: In function `mlx5e_configure_flower': >> (.text+0x22a6a1): undefined reference to `mlx5e_vxlan_lookup_port' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
RE: [patch net-next v2 2/4] net/sched: Introduce sample tc action
>-Original Message- >From: Cong Wang [mailto:xiyou.wangc...@gmail.com] >Sent: Thursday, January 26, 2017 1:30 AM >To: Jiri Pirko>Cc: Linux Kernel Network Developers ; David Miller > ; Yotam Gigi ; Ido Schimmel > ; Elad Raz ; Nogah Frankel > ; Or Gerlitz ; Jamal Hadi Salim > ; geert+rene...@glider.be; Stephen Hemminger > ; Guenter Roeck ; Roopa >Prabhu ; John Fastabend > ; Simon Horman ; >Roman Mashak >Subject: Re: [patch net-next v2 2/4] net/sched: Introduce sample tc action > >On Mon, Jan 23, 2017 at 2:07 AM, Jiri Pirko wrote: >> + >> +static int tcf_sample_init(struct net *net, struct nlattr *nla, >> + struct nlattr *est, struct tc_action **a, int ovr, >> + int bind) >> +{ >> + struct tc_action_net *tn = net_generic(net, sample_net_id); >> + struct nlattr *tb[TCA_SAMPLE_MAX + 1]; >> + struct psample_group *psample_group; >> + struct tc_sample *parm; >> + struct tcf_sample *s; >> + bool exists = false; >> + int ret; >> + >> + if (!nla) >> + return -EINVAL; >> + ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy); >> + if (ret < 0) >> + return ret; >> + if (!tb[TCA_SAMPLE_PARMS] || !tb[TCA_SAMPLE_RATE] || >> + !tb[TCA_SAMPLE_PSAMPLE_GROUP]) >> + return -EINVAL; >> + >> + parm = nla_data(tb[TCA_SAMPLE_PARMS]); >> + >> + exists = tcf_hash_check(tn, parm->index, a, bind); >> + if (exists && bind) >> + return 0; >> + >> + if (!exists) { >> + ret = tcf_hash_create(tn, parm->index, est, a, >> + _sample_ops, bind, false); >> + if (ret) >> + return ret; >> + ret = ACT_P_CREATED; >> + } else { >> + tcf_hash_release(*a, bind); >> + if (!ovr) >> + return -EEXIST; >> + } >> + s = to_sample(*a); >> + >> + ASSERT_RTNL(); > >Copy-n-paste from mirred aciton? This is not needed for you, mirred >needs it because of target netdevice. Ho, you are right. I will remove it. > > >> + s->tcf_action = parm->action; >> + s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]); >> + s->psample_group_num = >nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]); >> + psample_group = psample_group_get(net, s->psample_group_num); >> + if (!psample_group) >> + return -ENOMEM; > >I don't think you can just return here, needs tcf_hash_cleanup() for create >case, right? Will fix. > > >> + RCU_INIT_POINTER(s->psample_group, psample_group); >> + >> + if (tb[TCA_SAMPLE_TRUNC_SIZE]) { >> + s->truncate = true; >> + s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]); >> + } > > >Do you need tcf_lock here if RCU only protects ->psample_group?? > You are right. I need to protect in case of update. I will send a fixup patch in the following days. Thanks! > >> + >> + if (ret == ACT_P_CREATED) >> + tcf_hash_insert(tn, *a); >> + return ret; >> +} >> + > > >Thanks.
Re: [PATCH net-next v2] macb: Common code to enable ptp support for MACB/GEM
Hi Rafal, On Thu, Jan 26, 2017 at 8:45 PM, Rafal Ozieblowrote: >> -Original Message- >> From: Andrei Pistirica [mailto:andrei.pistir...@microchip.com] >> Sent: 19 stycznia 2017 16:56 >> Subject: [PATCH net-next v2] macb: Common code to enable ptp support for >> MACB/GEM >> >> >> +static inline bool gem_has_ptp(struct macb *bp) >> +{ >> + return !!(bp->caps & MACB_CAPS_GEM_HAS_PTP); >> +} > Why don't you use hardware capabilities here? Would it be better to read it > from hardware instead adding it to many configuration? If you are referring to TSU bit in DCFG5, then we will be relying on Cadence IP's information irrespective of the SoC capability and whether the PTP support was adequate. I think the capability approach gives better control and it is not really much to add. Regards, Harini
Re: [PATCH net-next 1/4] mlx5: Make building eswitch configurable
On Fri, Jan 27, 2017 at 1:32 AM, Tom Herbertwrote: > Add a configuration option (CONFIG_MLX5_CORE_ESWITCH) for controlling > whether the eswitch code is built. Change Kconfig and Makefile > accordingly. Tom, FWIW, please note that the basic e-switch functionality is needed also when SRIOV isn't of use, this is for a multi host configuration. Or. My WW (and same for the rest of the IL team..) has ended so I will be able to further look on this series and comment on Sunday.
Re: fs, net: deadlock between bind/splice on af_unix
On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzikwrote: > On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote: >> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov wrote: >> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro wrote: >> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote: >> >> >> >>> > Why do we do autobind there, anyway, and why is it conditional on >> >>> > SOCK_PASSCRED? Note that e.g. for SOCK_STREAM we can bloody well get >> >>> > to sending stuff without autobind ever done - just use socketpair() >> >>> > to create that sucker and we won't be going through the connect() >> >>> > at all. >> >>> >> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls unix_autobind(), >> >>> not SOCK_STREAM. >> >> >> >> Yes, I've noticed. What I'm asking is what in there needs autobind >> >> triggered >> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case? >> >> >> >>> I guess some lock, perhaps the u->bindlock could be dropped before >> >>> acquiring the next one (sb_writer), but I need to double check. >> >> >> >> Bad idea, IMO - do you *want* autobind being able to come through while >> >> bind(2) is busy with mknod? >> > >> > >> > Ping. This is still happening on HEAD. >> > >> >> Thanks for your reminder. Mind to give the attached patch (compile only) >> a try? I take another approach to fix this deadlock, which moves the >> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected >> impact with this way. >> > > I don't think this is the right approach. > > Currently the file creation is potponed until unix_bind can no longer > fail otherwise. With it reordered, it may be someone races you with a > different path and now you are left with a file to clean up. Except it > is quite unclear for me if you can unlink it. What races do you mean here? If you mean someone could get a refcount of that file, it could happen no matter we have bindlock or not since it is visible once created. The filesystem layer should take care of the file refcount so all we need to do here is calling path_put() as in my patch. Or if you mean two threads calling unix_bind() could race without binlock, only one of them should succeed the other one just fails out.
[PATCHv5 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs
The Marvell 98DX3236, 98DX3336, 98DX4521 and variants are switch ASICs with integrated CPUs. They are similar to the Armada XP SoCs but have different I/O interfaces. Signed-off-by: Chris PackhamAcked-by: Rob Herring --- Notes: Changes in v2: - Update devicetree binding documentation to reflect that 98DX3336 and 984251 are supersets of 98DX3236. - disable crypto block - disable sdio for 98DX3236, enable for 98DX4251 Changes in v3: - fix typo 4521 -> 4251 - document prestera bindings - rework corediv-clock binding - add label to packet processor node - add new compatible string for DFX server Changes in v4: - Collect ack from Rob Changes in v5: - Fixup license text. Add labels to nodes. .../devicetree/bindings/arm/marvell/98dx3236.txt | 23 ++ .../devicetree/bindings/net/marvell,prestera.txt | 50 arch/arm/boot/dts/armada-xp-98dx3236.dtsi | 254 + arch/arm/boot/dts/armada-xp-98dx3336.dtsi | 76 ++ arch/arm/boot/dts/armada-xp-98dx4251.dtsi | 90 5 files changed, 493 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/marvell/98dx3236.txt create mode 100644 Documentation/devicetree/bindings/net/marvell,prestera.txt create mode 100644 arch/arm/boot/dts/armada-xp-98dx3236.dtsi create mode 100644 arch/arm/boot/dts/armada-xp-98dx3336.dtsi create mode 100644 arch/arm/boot/dts/armada-xp-98dx4251.dtsi diff --git a/Documentation/devicetree/bindings/arm/marvell/98dx3236.txt b/Documentation/devicetree/bindings/arm/marvell/98dx3236.txt new file mode 100644 index ..64e8c73fc5ab --- /dev/null +++ b/Documentation/devicetree/bindings/arm/marvell/98dx3236.txt @@ -0,0 +1,23 @@ +Marvell 98DX3236, 98DX3336 and 98DX4251 Platforms Device Tree Bindings +-- + +Boards with a SoC of the Marvell 98DX3236, 98DX3336 and 98DX4251 families +shall have the following property: + +Required root node property: + +compatible: must contain "marvell,armadaxp-98dx3236" + +In addition, boards using the Marvell 98DX3336 SoC shall have the +following property: + +Required root node property: + +compatible: must contain "marvell,armadaxp-98dx3336" + +In addition, boards using the Marvell 98DX4251 SoC shall have the +following property: + +Required root node property: + +compatible: must contain "marvell,armadaxp-98dx4251" diff --git a/Documentation/devicetree/bindings/net/marvell,prestera.txt b/Documentation/devicetree/bindings/net/marvell,prestera.txt new file mode 100644 index ..5fbab29718e8 --- /dev/null +++ b/Documentation/devicetree/bindings/net/marvell,prestera.txt @@ -0,0 +1,50 @@ +Marvell Prestera Switch Chip bindings +- + +Required properties: +- compatible: one of the following + "marvell,prestera-98dx3236", + "marvell,prestera-98dx3336", + "marvell,prestera-98dx4251", +- reg: address and length of the register set for the device. +- interrupts: interrupt for the device + +Optional properties: +- dfx: phandle reference to the "DFX Server" node + +Example: + +switch { + compatible = "simple-bus"; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0 MBUS_ID(0x03, 0x00) 0 0x10>; + + packet-processor@0 { + compatible = "marvell,prestera-98dx3236"; + reg = <0 0x400>; + interrupts = <33>, <34>, <35>; + dfx = <>; + }; +}; + +DFX Server bindings +--- + +Required properties: +- compatible: must be "marvell,dfx-server" +- reg: address and length of the register set for the device. + +Example: + +dfx-registers { + compatible = "simple-bus"; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0 MBUS_ID(0x08, 0x00) 0 0x10>; + + dfx: dfx@0 { + compatible = "marvell,dfx-server"; + reg = <0 0x10>; + }; +}; diff --git a/arch/arm/boot/dts/armada-xp-98dx3236.dtsi b/arch/arm/boot/dts/armada-xp-98dx3236.dtsi new file mode 100644 index ..9461128fae24 --- /dev/null +++ b/arch/arm/boot/dts/armada-xp-98dx3236.dtsi @@ -0,0 +1,254 @@ +/* + * Device Tree Include file for Marvell 98dx3236 family SoC + * + * Copyright (C) 2016 Allied Telesis Labs + * + * This file is dual-licensed: you can use it either under the terms + * of the GPL or the X11 license, at your option. Note that this dual + * licensing only applies to this file, and not this project as a + * whole. + * + * a) This file is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This file is distributed in the hope that it will be
Re: loopback device reference count leakage
On Thu, Jan 26, 2017 at 05:01:38PM -0800, Cong Wang wrote: > I'd suggest to you add some debugging printk's to the dst refcount functions, > or maybe just inside dst_gc_task(). I think the last dst referring to > the loopback > dev is still being referred at that point, which prevents GC from destroying > it. Thanks for the suggestion! I will test it out. > Meanwhile, if it would be also helpful if you can share how you managed to > reproduce this reliably, I saw this bug in our data center before but never > know how to reproduce it. I used one of our applications to reproduce the issue, to be honest, I haven't completely isolated which part of the code is triggering the bug. However, the suspicion is that, since the application basically acts as a web crawler, the bug is manifested after initiating a large amount connections to a wide range of IP addresses in a short period of time. Hope it somewhat helps. Thanks, Kaiwen
cls_matchall and port mirroring questions
Hi, As I am adding support for cls_matchall in the b53/bcm_sf2 drivers, I was looking into several, yet unrelated things: - mlxsw does not seem to specify whether the port used for capture remains usable, or blocks non-mirror traffic ingressing/egressing it, do we want a control knob for that? If not, what is a sensible default, block all non capture traffic? - do we have an updated man page for tc-matchall.8 that features how to use the statistical sampler too? b53 switches have a divider that allows us to select how many frames we want to receive (10 bit value). - b53 supports capture against a particular MAC SA or DA (or both), do we want to be able to control that somehow? What about Marvell switches, what can they do? - a fair amount of code dealing with the cls_matchall mirroring entry is not switch driver specific, in fact, the only things that are switch driver specific are: - list pointer where to store this entry (typically in the private network device context) - operation to check whether the device belongs to us (identical netdev_ops) - retrieval of the destination port number (to_port) which is also typically available in network device private context Do we want to move a fair amount of code into switchdev, treat cls_matchall entries as a specific switchdev object, and have drivers take over at the same level that mlxsw_sp_port_add_cls_matchall_mirror() currently starts? Thanks! -- Florian
Re: [PATCH net-next] net: adjust skb->truesize in pskb_expand_head()
On Fri, 2017-01-27 at 10:22 +0800, kbuild test robot wrote: > Hi Eric, > > [auto build test WARNING on net-next/master] > > url: > https://github.com/0day-ci/linux/commits/Eric-Dumazet/net-adjust-skb-truesize-in-pskb_expand_head/20170127-082517 > config: i386-randconfig-x0-01270914 (attached as .config) > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 Hmm... I thought sock_edemux was safe, but apparently it uses a parameter for no good reason. I will add this to v2 diff --git a/include/net/sock.h b/include/net/sock.h index 7144750d14e56b9d5392e43dc46cb40a87e3d397..94e65fd703548dd40e16c30207fd55c879ed0b60 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1534,7 +1534,7 @@ void sock_efree(struct sk_buff *skb); #ifdef CONFIG_INET void sock_edemux(struct sk_buff *skb); #else -#define sock_edemux(skb) sock_efree(skb) +#define sock_edemux sock_efree #endif int sock_setsockopt(struct socket *sock, int level, int op,
Re: [PATCH net-next] net: adjust skb->truesize in pskb_expand_head()
Hi Eric, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Eric-Dumazet/net-adjust-skb-truesize-in-pskb_expand_head/20170127-082517 config: i386-randconfig-x0-01270914 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): In file included from include/uapi/linux/stddef.h:1:0, from include/linux/stddef.h:4, from include/uapi/linux/posix_types.h:4, from include/uapi/linux/types.h:13, from include/linux/types.h:5, from include/linux/list.h:4, from include/linux/module.h:9, from net/core/skbuff.c:41: net/core/skbuff.c: In function 'pskb_expand_head': net/core/skbuff.c:1265:37: error: 'sock_edemux' undeclared (first use in this function) if (!skb->sk || skb->destructor == sock_edemux) ^ include/linux/compiler.h:149:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> net/core/skbuff.c:1265:2: note: in expansion of macro 'if' if (!skb->sk || skb->destructor == sock_edemux) ^~ net/core/skbuff.c:1265:37: note: each undeclared identifier is reported only once for each function it appears in if (!skb->sk || skb->destructor == sock_edemux) ^ include/linux/compiler.h:149:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> net/core/skbuff.c:1265:2: note: in expansion of macro 'if' if (!skb->sk || skb->destructor == sock_edemux) ^~ vim +/if +1265 net/core/skbuff.c 1249 skb->end = size; 1250 off = nhead; 1251 #else 1252 skb->end = skb->head + size; 1253 #endif 1254 skb->tail += off; 1255 skb_headers_offset_update(skb, nhead); 1256 skb->cloned = 0; 1257 skb->hdr_len = 0; 1258 skb->nohdr= 0; 1259 atomic_set(_shinfo(skb)->dataref, 1); 1260 1261 /* It is not generally safe to change skb->truesize. 1262 * For the moment, we really care of rx path, or 1263 * when skb is orphaned (not attached to a socket) 1264 */ > 1265 if (!skb->sk || skb->destructor == sock_edemux) 1266 skb->truesize += size - osize; 1267 1268 return 0; 1269 1270 nofrags: 1271 kfree(data); 1272 nodata: 1273 return -ENOMEM; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [net-next] openvswitch: Simplify do_execute_actions().
Nice clean-up. Acked-by: Jarno Rajahalme> On Jan 25, 2017, at 9:24 PM, Andy Zhou wrote: > > do_execute_actions() implements a worthwhile optimization: in case > an output action is the last action in an action list, skb_clone() > can be avoided by outputing the current skb. However, the > implementation is more complicated than necessary. This patch > simplify this logic. > > Signed-off-by: Andy Zhou > --- > net/openvswitch/actions.c | 40 +++- > 1 file changed, 19 insertions(+), 21 deletions(-) > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 514f7bc..3866608 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -830,6 +830,9 @@ static void do_output(struct datapath *dp, struct sk_buff > *skb, int out_port, > { > struct vport *vport = ovs_vport_rcu(dp, out_port); > > + if (unlikely(!skb)) > + return; > + > if (likely(vport)) { > u16 mru = OVS_CB(skb)->mru; > u32 cutlen = OVS_CB(skb)->cutlen; > @@ -1141,12 +1144,6 @@ static int do_execute_actions(struct datapath *dp, > struct sk_buff *skb, > struct sw_flow_key *key, > const struct nlattr *attr, int len) > { > - /* Every output action needs a separate clone of 'skb', but the common > - * case is just a single output action, so that doing a clone and > - * then freeing the original skbuff is wasteful. So the following code > - * is slightly obscure just to avoid that. > - */ > - int prev_port = -1; > const struct nlattr *a; > int rem; > > @@ -1154,20 +1151,25 @@ static int do_execute_actions(struct datapath *dp, > struct sk_buff *skb, >a = nla_next(a, )) { > int err = 0; > > - if (unlikely(prev_port != -1)) { > - struct sk_buff *out_skb = skb_clone(skb, GFP_ATOMIC); > + switch (nla_type(a)) { > + case OVS_ACTION_ATTR_OUTPUT: { > + int port = nla_get_u32(a); > > - if (out_skb) > - do_output(dp, out_skb, prev_port, key); > + /* Every output action needs a separate clone > + * of 'skb', In case the output action is the > + * last action, cloning can be avoided. > + */ > + if (nla_is_last(a, rem)) { > + do_output(dp, skb, port, key); > + /* 'skb' has been used for output. > + */ > + return 0; > + } > > + do_output(dp, skb_clone(skb, GFP_ATOMIC), port, key); > OVS_CB(skb)->cutlen = 0; > - prev_port = -1; > - } > - > - switch (nla_type(a)) { > - case OVS_ACTION_ATTR_OUTPUT: > - prev_port = nla_get_u32(a); > break; > + } > > case OVS_ACTION_ATTR_TRUNC: { > struct ovs_action_trunc *trunc = nla_data(a); > @@ -1257,11 +1259,7 @@ static int do_execute_actions(struct datapath *dp, > struct sk_buff *skb, > } > } > > - if (prev_port != -1) > - do_output(dp, skb, prev_port, key); > - else > - consume_skb(skb); > - > + consume_skb(skb); > return 0; > } > > -- > 1.8.3.1 >
Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver
On (01/26/17 19:08), Willem de Bruijn wrote: > > Thanks for the context. ax25_addr_parse doesn't adjust length, it only > verifies that the contents of the variable length header matches > protocol spec. I don't think that it or the .validate callback have to > be modified to return length. Yes, I noticed that too, but my reading of ax25_addr_parse was that it checks to see that a sane L2 header has been passed in, and if that (sane-header) is the case, it returns pointer to the start of data. Thus the returned (non-null) pointer minus start should tell you the "real" header length- is my understanding correct? > To ensure that skb_headlen(skb) is at least a valid header length even > when CAP_SYS_RAWIO bypasses validation perhaps revise > dev_validate_header to take an additional skb->len parameter and > call skb_put directly from inside that branch. but when I scanned the af_packet code (which appears to be the only thing that uses dev_validate_header today) it already sets up the skb->data and ->len pointers up correctly (based on len, hard_header_len etc) *before* calling dev_validate_header, so the additional skb_put is not needed? still havent googled up prior discussions that led to dev_validate_header- will probably do that tomorrow AM. --Sowmini
Re: loopback device reference count leakage
On Thu, Jan 26, 2017 at 2:51 PM, Kaiwen Xuwrote: > Hi Cong, > > I tested out your patch, it does seem to be preventing the issue from > happening. Here are the dev_put/dev_hold() calls with your patch > applied. Good. Now we narrow down the bug to those dst's referring loopback_dev. > However, what confuses me is that when the issue didn't occur, there > were always multiple dst_ifdown() calls at the end continuously holding > and releasing the loopback device reference count (sometimes it will be > looping for about a minute), until the last dst_destroy() happens. E.g. > > Jan 11 16:14:59 kernel: [ 2033.429563] lo: dev_hold 2 dst_ifdown > Jan 11 16:14:59 kernel: [ 2033.429565] lo: dev_put 2 dst_ifdown > Jan 11 16:15:00 kernel: [ 2034.453484] lo: dev_hold 2 dst_ifdown > Jan 11 16:15:00 kernel: [ 2034.453487] lo: dev_put 2 dst_ifdown > > (this continues...) > > Jan 11 16:15:01 kernel: [ 2035.129452] lo: dev_put 1 dst_destroy > > And when the issue did occur, the last dst_destroy() call never occurs. Yeah, I noticed that too. So we have two cases here: 1) If these dst's (referring to loopback_dev) really need to stay in GC for a long time, then we should really just releasing loopback references as what my patch does. 2) If they don't not, that is, if they are supposed to be GC'ed soon in this case, then we should investigate why they are still there. 2) seems more likely than 1), because at the point when loopback device is being unregistered, the whole network namespace is being gone, all other devices are already gone, no one should a take reference to this netns, therefore no one should take a reference to any dst referring to any device in it too, even though the dst GC is global. I'd suggest to you add some debugging printk's to the dst refcount functions, or maybe just inside dst_gc_task(). I think the last dst referring to the loopback dev is still being referred at that point, which prevents GC from destroying it. Meanwhile, if it would be also helpful if you can share how you managed to reproduce this reliably, I saw this bug in our data center before but never know how to reproduce it. Thanks!
Re: [1/3] powerpc: bpf: remove redundant check for non-null image
On Fri, 2017-01-13 at 17:10:00 UTC, "Naveen N. Rao" wrote: > From: Daniel Borkmann> > We have a check earlier to ensure we don't proceed if image is NULL. As > such, the redundant check can be removed. > > Signed-off-by: Daniel Borkmann > [Added similar changes for classic BPF JIT] > Signed-off-by: Naveen N. Rao > Acked-by: Alexei Starovoitov Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/052de33ca4f840bf35587eacdf78b3 cheers
Re: [2/3] powerpc: bpf: flush the entire JIT buffer
On Fri, 2017-01-13 at 17:10:01 UTC, "Naveen N. Rao" wrote: > With bpf_jit_binary_alloc(), we allocate at a page granularity and fill > the rest of the space with illegal instructions to mitigate BPF spraying > attacks, while having the actual JIT'ed BPF program at a random location > within the allocated space. Under this scenario, it would be better to > flush the entire allocated buffer rather than just the part containing > the actual program. We already flush the buffer from start to the end of > the BPF program. Extend this to include the illegal instructions after > the BPF program. > > Signed-off-by: Naveen N. Rao> Acked-by: Alexei Starovoitov > Acked-by: Daniel Borkmann Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/10528b9c45cfb9e8f45217ef2f5ef8 cheers
[PATCH net-next 4/4] mlx5: Make building vxlan hardware offload configurable
Add a configuration option (CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD) for controlling whether the support for UDP encapsulation offlaod is supported. Note that only VXLAN offload is supported currently, however the config option is named to be generic for UDP offloads. Signed-off-by: Tom Herbert--- drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 8 +++ drivers/net/ethernet/mellanox/mlx5/core/Makefile | 4 +++- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 27 +-- 3 files changed, 31 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index b38c920..d8ed54a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -57,3 +57,11 @@ config MLX5_CORE_EN_TC Say Y here if you want to use TC hardware offload support. If unsure, set to Y + +config MLX5_CORE_EN_UDP_ENCAP_OFFLOAD + bool "UDP encapsulation offload" + default y + ---help--- + Say Y here if you want to use UDP encapsulation hardware offload. + Currently, VXLAN offload is uspported. + diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile index c308531..c08c9c8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile @@ -7,7 +7,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \ en_main.o en_common.o en_fs.o en_ethtool.o en_tx.o \ - en_rx.o en_rx_am.o en_txrx.o en_clock.o vxlan.o \ + en_rx.o en_rx_am.o en_txrx.o en_clock.o \ en_arfs.o en_fs_ethtool.o en_selftest.o mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) += en_dcbnl.o @@ -17,3 +17,5 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o en_rep. mlx5_core-$(CONFIG_MLX5_CORE_EN_SRIOV) += sriov.o mlx5_core-$(CONFIG_MLX5_CORE_EN_TC) += en_tc.o + +mlx5_core-$(CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD) += vxlan.o diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2d2c982..31a8d88 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -34,14 +34,16 @@ #include #include #include -#include #include #include "en.h" #include "en_tc.h" #ifdef CONFIG_MLX5_CORE_EN_ESWITCH #include "eswitch.h" #endif +#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD +#include #include "vxlan.h" +#endif struct mlx5e_rq_param { u32 rqc[MLX5_ST_SZ_DW(rqc)]; @@ -3111,6 +3113,7 @@ static int mlx5e_get_vf_stats(struct net_device *dev, } #endif /* CONFIG_MLX5_CORE_EN_ESWITCH */ +#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD void mlx5e_add_vxlan_port(struct net_device *netdev, struct udp_tunnel_info *ti) { @@ -3171,20 +3174,22 @@ static netdev_features_t mlx5e_vxlan_features_check(struct mlx5e_priv *priv, /* Disable CSUM and GSO if the udp dport is not offloaded by HW */ return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK); } +#endif static netdev_features_t mlx5e_features_check(struct sk_buff *skb, struct net_device *netdev, netdev_features_t features) { - struct mlx5e_priv *priv = netdev_priv(netdev); - features = vlan_features_check(skb, features); +#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD features = vxlan_features_check(skb, features); /* Validate if the tunneled packet is being offloaded by HW */ if (skb->encapsulation && (features & NETIF_F_CSUM_MASK || features & NETIF_F_GSO_MASK)) - return mlx5e_vxlan_features_check(priv, skb, features); + return mlx5e_vxlan_features_check(netdev_priv(netdev), + skb, features); +#endif return features; } @@ -3365,8 +3370,10 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_set_features= mlx5e_set_features, .ndo_change_mtu = mlx5e_change_mtu, .ndo_do_ioctl= mlx5e_ioctl, +#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD .ndo_udp_tunnel_add = mlx5e_add_vxlan_port, .ndo_udp_tunnel_del = mlx5e_del_vxlan_port, +#endif .ndo_set_tx_maxrate = mlx5e_set_tx_maxrate, .ndo_features_check = mlx5e_features_check, #ifdef CONFIG_RFS_ACCEL @@ -3643,6 +3650,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev) netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX; netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER; +#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD if
[PATCH net-next 1/4] mlx5: Make building eswitch configurable
Add a configuration option (CONFIG_MLX5_CORE_ESWITCH) for controlling whether the eswitch code is built. Change Kconfig and Makefile accordingly. Signed-off-by: Tom Herbert--- drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 11 +++ drivers/net/ethernet/mellanox/mlx5/core/Makefile | 6 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 92 +-- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 39 +++--- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/main.c| 16 ++-- drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 6 +- 7 files changed, 125 insertions(+), 49 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index ddb4ca4..27aae58 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -30,3 +30,14 @@ config MLX5_CORE_EN_DCB This flag is depended on the kernel's DCB support. If unsure, set to Y + +config MLX5_CORE_EN_ESWITCH + bool "Ethernet switch" + default y + depends on MLX5_CORE_EN + ---help--- + Say Y here if you want to use Ethernet switch (E-switch). E-Switch + is the software entity that represents and manages ConnectX4 + inter-HCA ethernet l2 switching. + + If unsure, set to Y diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile index 9f43beb..17025d8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile @@ -5,9 +5,11 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \ mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \ fs_counters.o rl.o lag.o dev.o -mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o eswitch_offloads.o \ +mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \ en_main.o en_common.o en_fs.o en_ethtool.o en_tx.o \ en_rx.o en_rx_am.o en_txrx.o en_clock.o vxlan.o \ - en_tc.o en_arfs.o en_rep.o en_fs_ethtool.o en_selftest.o + en_tc.o en_arfs.o en_fs_ethtool.o en_selftest.o mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) += en_dcbnl.o + +mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o en_rep.o diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index e829143..1db4d98 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -38,7 +38,9 @@ #include #include "en.h" #include "en_tc.h" +#ifdef CONFIG_MLX5_CORE_EN_ESWITCH #include "eswitch.h" +#endif #include "vxlan.h" struct mlx5e_rq_param { @@ -585,10 +587,12 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, switch (priv->params.rq_wq_type) { case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: +#ifdef CONFIG_MLX5_CORE_EN_ESWITCH if (mlx5e_is_vf_vport_rep(priv)) { err = -EINVAL; goto err_rq_wq_destroy; } +#endif rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq; rq->alloc_wqe = mlx5e_alloc_rx_mpwqe; @@ -617,10 +621,14 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, goto err_rq_wq_destroy; } - if (mlx5e_is_vf_vport_rep(priv)) +#ifdef CONFIG_MLX5_CORE_EN_ESWITCH + if (mlx5e_is_vf_vport_rep(priv)) { rq->handle_rx_cqe = mlx5e_handle_rx_cqe_rep; - else + } else +#endif + { rq->handle_rx_cqe = mlx5e_handle_rx_cqe; + } rq->alloc_wqe = mlx5e_alloc_rx_wqe; rq->dealloc_wqe = mlx5e_dealloc_rx_wqe; @@ -2198,7 +2206,6 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) int mlx5e_open_locked(struct net_device *netdev) { struct mlx5e_priv *priv = netdev_priv(netdev); - struct mlx5_core_dev *mdev = priv->mdev; int num_txqs; int err; @@ -2233,11 +2240,13 @@ int mlx5e_open_locked(struct net_device *netdev) if (priv->profile->update_stats) queue_delayed_work(priv->wq, >update_stats_work, 0); - if (MLX5_CAP_GEN(mdev, vport_group_manager)) { +#ifdef CONFIG_MLX5_CORE_EN_ESWITCH + if (MLX5_CAP_GEN(priv->mdev, vport_group_manager)) { err = mlx5e_add_sqs_fwd_rules(priv); if (err) goto err_close_channels; } +#endif return 0; err_close_channels: @@ -2262,7 +2271,6 @@ int mlx5e_open(struct net_device *netdev) int mlx5e_close_locked(struct net_device *netdev) { struct mlx5e_priv *priv = netdev_priv(netdev); - struct mlx5_core_dev *mdev = priv->mdev;
[PATCH net-next] net: adjust skb->truesize in pskb_expand_head()
From: Eric DumazetSlava Shwartsman reported a warning in skb_try_coalesce(), when we detect skb->truesize is completely wrong. In his case, issue came from IPv6 reassembly coping with malicious datagrams, that forced various pskb_may_pull() to reallocate a bigger skb->head than the one allocated by NIC driver before entering GRO layer. Current code does not change skb->truesize, leaving this burden to callers if they care enough. Blindly changing skb->truesize in pskb_expand_head() is not easy, as some producers might track skb->truesize, for example in xmit path for back pressure feedback (sk->sk_wmem_alloc) We can detect the cases where it should be safe to change skb->truesize : 1) skb is not attached to a socket. 2) If it is attached to a socket, destructor is sock_edemux() My audit gave only two callers doing their own skb->truesize manipulation. Signed-off-by: Eric Dumazet Reported-by: Slava Shwartsman --- net/core/skbuff.c| 14 +++--- net/netlink/af_netlink.c |8 +++- net/wireless/util.c |2 -- 3 files changed, 14 insertions(+), 10 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f8dbe4a7ab46a9196c6683ce5c9c14d3d99d..6cd59da7ec583260748b9c45b99a824bcc61 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1192,10 +1192,10 @@ EXPORT_SYMBOL(__pskb_copy_fclone); int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask) { - int i; - u8 *data; - int size = nhead + skb_end_offset(skb) + ntail; + int i, osize = skb_end_offset(skb); + int size = osize + nhead + ntail; long off; + u8 *data; BUG_ON(nhead < 0); @@ -1257,6 +1257,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, skb->hdr_len = 0; skb->nohdr= 0; atomic_set(_shinfo(skb)->dataref, 1); + + /* It is not generally safe to change skb->truesize. +* For the moment, we really care of rx path, or +* when skb is orphaned (not attached to a socket) +*/ + if (!skb->sk || skb->destructor == sock_edemux) + skb->truesize += size - osize; + return 0; nofrags: diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index edcc1e19ad532641f51f6809b8c90d1e3770..7b73c7c161a9680b8691a712c31073b77896 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1210,11 +1210,9 @@ static struct sk_buff *netlink_trim(struct sk_buff *skb, gfp_t allocation) skb = nskb; } - if (!pskb_expand_head(skb, 0, -delta, - (allocation & ~__GFP_DIRECT_RECLAIM) | - __GFP_NOWARN | __GFP_NORETRY)) - skb->truesize -= delta; - + pskb_expand_head(skb, 0, -delta, +(allocation & ~__GFP_DIRECT_RECLAIM) | +__GFP_NOWARN | __GFP_NORETRY); return skb; } diff --git a/net/wireless/util.c b/net/wireless/util.c index 1b9296882dcd6a0b585dfd604a30807e7f26..68e5f2ecee1aa22f17ab9a55eb566124e585 100644 --- a/net/wireless/util.c +++ b/net/wireless/util.c @@ -618,8 +618,6 @@ int ieee80211_data_from_8023(struct sk_buff *skb, const u8 *addr, if (pskb_expand_head(skb, head_need, 0, GFP_ATOMIC)) return -ENOMEM; - - skb->truesize += head_need; } if (encaps_data) {
Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver
On Thu, Jan 26, 2017 at 4:37 PM, Sowmini Varadhanwrote: > On (01/26/17 15:21), Willem de Bruijn wrote: >> > If the application has provided fewer than hard_header_len bytes, >> > dev_validate_header() will zero out the skb->data as needed. This is >> > acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases, >> >> This was added not for datagram sockets, but to be able to bypass >> validation. See the message in commit 2793a23aacbd ("net: validate >> variable length ll header") and discussion leading up to that patch. > > some context, I got inot this patch as a result of the comments in > https://www.mail-archive.com/netdev@vger.kernel.org/msg149031.html > >> As David pointed out, this does not handle variable length headers >> correctly. In link layers that support these, hard_header_len defines >> the maximum header length. A hard failure on len < hard_header_len >> would be incorrect. > > right, since DaveM's comments, I took a look at the drivers > that have a ->validate - afaict (from cscope) ax25 is the only > in-kernel driver that actually passes a ->validate pointer.. > I tried patching ax25 here: > http://marc.info/?l=linux-hams=148537926422828=2 > Still waiting to hear back from that list (which doesnt seem to have > much traffic so maybe I should time out on it). Does that > patch make better sense (I'll look up the comments leading up > to 2793a23aacbd later tonight) Thanks for the context. ax25_addr_parse doesn't adjust length, it only verifies that the contents of the variable length header matches protocol spec. I don't think that it or the .validate callback have to be modified to return length. To ensure that skb_headlen(skb) is at least a valid header length even when CAP_SYS_RAWIO bypasses validation perhaps revise dev_validate_header to take an additional skb->len parameter and call skb_put directly from inside that branch. >> The ->validate callback was added to allow specifying additional >> constraints on a per protocol basis. This is where a min constraint >> can be added, e.g., for ethernet. >> >> > - if (!dev_validate_header(dev, skb->data, len)) { >> > + newlen = dev_validate_header(dev, skb->data, len); >> > + /* As comments above this function indicate, a full L2 header >> > +* must be passed to this function, so if newlen > len, bail. >> > +*/ >> > + if (newlen < 0 || newlen > len) { >> >> If callers only care whether the function returned failure or >> increased len, which also indicates failure, it is cleaner to leave it >> a boolean and fail in cases where len < the minimum for that link >> layer type. No caller actually uses newlen. >> >> > + /* Caller has allocated for copylen in non-paged part of >> > +* skb so we should never find newlen > hdrlen >> > +*/ >> > + WARN_ON(newlen > hdrlen); >> >> WARN_ON_ONCE is safer. > > Ok that's easy enough to do. >
[PATCH net-next 2/4] mlx5: Make building SR-IOV configurable
Add a configuration option (CONFIG_MLX5_CORE_EN_SRIOV) for controlling whether the eswitch code is built. Change Kconfig and Makefile accordingly. Signed-off-by: Tom Herbert--- drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 8 drivers/net/ethernet/mellanox/mlx5/core/Makefile | 4 +++- drivers/net/ethernet/mellanox/mlx5/core/lag.c| 2 ++ drivers/net/ethernet/mellanox/mlx5/core/main.c | 16 4 files changed, 29 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index 27aae58..7ade61a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -41,3 +41,11 @@ config MLX5_CORE_EN_ESWITCH inter-HCA ethernet l2 switching. If unsure, set to Y + +config MLX5_CORE_EN_SRIOV + bool "SR-IOV" + default y + ---help--- + Say Y here if you want to use SR-IOV. + + If unsure, set to Y diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile index 17025d8..6d38250 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile @@ -2,7 +2,7 @@ obj-$(CONFIG_MLX5_CORE) += mlx5_core.o mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \ health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o \ - mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \ + mad.o transobj.o vport.o fs_cmd.o fs_core.o \ fs_counters.o rl.o lag.o dev.o mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \ @@ -13,3 +13,5 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \ mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) += en_dcbnl.o mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o en_rep.o + +mlx5_core-$(CONFIG_MLX5_CORE_EN_SRIOV) += sriov.o diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag.c index 5595724..6dc3792 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lag.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag.c @@ -223,11 +223,13 @@ static void mlx5_do_bond(struct mlx5_lag *ldev) mutex_unlock(_mutex); if (tracker.is_bonded && !mlx5_lag_is_bonded(ldev)) { +#ifdef CONFIG_MLX5_CORE_EN_SRIOV if (mlx5_sriov_is_enabled(dev0) || mlx5_sriov_is_enabled(dev1)) { mlx5_core_warn(dev0, "LAG is not supported with SRIOV"); return; } +#endif for (i = 0; i < MLX5_MAX_PORTS; i++) mlx5_remove_dev_by_protocol(ldev->pf[i].dev, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 224f499..cd6a9c7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -949,15 +949,20 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, struct mlx5_priv *priv) } #endif +#ifdef CONFIG_MLX5_CORE_EN_SRIOV err = mlx5_sriov_init(dev); if (err) { dev_err(>dev, "Failed to init sriov %d\n", err); goto err_eswitch_cleanup; } +#endif return 0; +#ifdef CONFIG_MLX5_CORE_EN_SRIOV err_eswitch_cleanup: +#endif + #ifdef CONFIG_MLX5_CORE_EN_ESWITCH mlx5_eswitch_cleanup(dev->priv.eswitch); @@ -980,7 +985,9 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, struct mlx5_priv *priv) static void mlx5_cleanup_once(struct mlx5_core_dev *dev) { +#ifdef CONFIG_MLX5_CORE_EN_SRIOV mlx5_sriov_cleanup(dev); +#endif #ifdef CONFIG_MLX5_CORE_EN_ESWITCH mlx5_eswitch_cleanup(dev->priv.eswitch); #endif @@ -1135,11 +1142,13 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv, mlx5_eswitch_attach(dev->priv.eswitch); #endif +#ifdef CONFIG_MLX5_CORE_EN_SRIOV err = mlx5_sriov_attach(dev); if (err) { dev_err(>dev, "sriov init failed %d\n", err); goto err_sriov; } +#endif if (mlx5_device_registered(dev)) { mlx5_attach_device(dev); @@ -1159,9 +1168,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv, return 0; err_reg_dev: +#ifdef CONFIG_MLX5_CORE_EN_SRIOV mlx5_sriov_detach(dev); err_sriov: +#endif + #ifdef CONFIG_MLX5_CORE_EN_ESWITCH mlx5_eswitch_detach(dev->priv.eswitch); #endif @@ -1232,7 +1244,9 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv, if (mlx5_device_registered(dev)) mlx5_detach_device(dev); +#ifdef CONFIG_MLX5_CORE_EN_SRIOV mlx5_sriov_detach(dev); +#endif #ifdef CONFIG_MLX5_CORE_EN_ESWITCH mlx5_eswitch_detach(dev->priv.eswitch); #endif @@
[PATCH net-next 3/4] mlx5: Make building tc hardware offload configurable
Add a configuration option (CONFIG_MLX5_CORE_EN_TC) for controlling whether the eswitch code is built. Change Kconfig and Makefile accordingly. Signed-off-by: Tom Herbert--- drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 8 drivers/net/ethernet/mellanox/mlx5/core/Makefile | 4 +++- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 ++ 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index 7ade61a..b38c920 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -49,3 +49,11 @@ config MLX5_CORE_EN_SRIOV Say Y here if you want to use SR-IOV. If unsure, set to Y + +config MLX5_CORE_EN_TC + bool "TC offload" + default y + ---help--- + Say Y here if you want to use TC hardware offload support. + + If unsure, set to Y diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile index 6d38250..c308531 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile @@ -8,10 +8,12 @@ mlx5_core-y :=main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \ en_main.o en_common.o en_fs.o en_ethtool.o en_tx.o \ en_rx.o en_rx_am.o en_txrx.o en_clock.o vxlan.o \ - en_tc.o en_arfs.o en_fs_ethtool.o en_selftest.o + en_arfs.o en_fs_ethtool.o en_selftest.o mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) += en_dcbnl.o mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o en_rep.o mlx5_core-$(CONFIG_MLX5_CORE_EN_SRIOV) += sriov.o + +mlx5_core-$(CONFIG_MLX5_CORE_EN_TC) += en_tc.o diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 1db4d98..2d2c982 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2690,6 +2690,7 @@ int mlx5e_modify_rqs_vsd(struct mlx5e_priv *priv, bool vsd) return 0; } +#ifdef CONFIG_MLX5_CORE_EN_TC static int mlx5e_setup_tc(struct net_device *netdev, u8 tc) { struct mlx5e_priv *priv = netdev_priv(netdev); @@ -2743,6 +2744,7 @@ static int mlx5e_ndo_setup_tc(struct net_device *dev, u32 handle, return mlx5e_setup_tc(dev, tc->tc); } +#endif static void mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats) @@ -3323,7 +3325,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, .ndo_start_xmit = mlx5e_xmit, +#ifdef CONFIG_MLX5_CORE_EN_TC .ndo_setup_tc= mlx5e_ndo_setup_tc, +#endif .ndo_select_queue= mlx5e_select_queue, .ndo_get_stats64 = mlx5e_get_stats, .ndo_set_rx_mode = mlx5e_set_rx_mode, @@ -3349,7 +3353,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, .ndo_start_xmit = mlx5e_xmit, +#ifdef CONFIG_MLX5_CORE_EN_TC .ndo_setup_tc= mlx5e_ndo_setup_tc, +#endif .ndo_select_queue= mlx5e_select_queue, .ndo_get_stats64 = mlx5e_get_stats, .ndo_set_rx_mode = mlx5e_set_rx_mode, @@ -3762,9 +3768,11 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv) goto err_destroy_direct_tirs; } +#ifdef CONFIG_MLX5_CORE_EN_TC err = mlx5e_tc_init(priv); if (err) goto err_destroy_flow_steering; +#endif return 0; @@ -3786,7 +3794,9 @@ static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv) { int i; +#ifdef CONFIG_MLX5_CORE_EN_TC mlx5e_tc_cleanup(priv); +#endif mlx5e_destroy_flow_steering(priv); mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_indirect_tirs(priv); -- 2.9.3
Re: [PATCH v5 net] ravb: unmap descriptors when freeing rings
From: Simon HormanDate: Thu, 26 Jan 2017 14:29:27 +0100 > From: Kazuya Mizuguchi > > "swiotlb buffer is full" errors occur after repeated initialisation of a > device - f.e. suspend/resume or ip link set up/down. This is because memory > mapped using dma_map_single() in ravb_ring_format() and ravb_start_xmit() > is not released. Resolve this problem by unmapping descriptors when > freeing rings. > > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > Signed-off-by: Kazuya Mizuguchi > [simon: reworked] > Signed-off-by: Simon Horman Applied, thanks Simon.
[PATCH net-next 0/4] mlx5: Create build configuration options
This patchset creates configuration options for sriov, vxlan, eswitch, and tc features in the mlx5 driver. The purpose of this is to allow not building these features. These features are optional advanced features that are not required for a core Ethernet driver. A user can disable these features which resuces the amount of code in the driver. Disabling these features (and DCB) reduces the size of mlx5_core.o by about 16%. This is also can reduce the complexity of backport and rebases since user would no longer need to worry about dependencies with the rest of the kernel that features which might not be of any interest to a user may bring in. Tested: Build and ran the driver with all features enabled (the default) and with none enabled (including DCB). Did not see any issues. I did not explicity test operation of ayy of features in the list. Tom Herbert (4): mlx5: Make building eswitch configurable mlx5: Make building SR-IOV configurable mlx5: Make building tc hardware offload configurable mlx5: Make building vxlan hardware offload configurable drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 35 ++ drivers/net/ethernet/mellanox/mlx5/core/Makefile | 16 ++- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 129 -- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 39 +-- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/lag.c | 2 + drivers/net/ethernet/mellanox/mlx5/core/main.c| 32 -- drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 6 +- 8 files changed, 205 insertions(+), 58 deletions(-) -- 2.9.3
Re: [PATCH net-next v2] net: ipv6: ignore null_entry on route dumps
From: David AhernDate: Thu, 26 Jan 2017 13:54:08 -0800 > Dave: per last email you suggested putting this in fib6_dump_node, but > fib6_dump_node does not have knowledge of the network namespace > because of the way the fib_walker works. I could put > struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) w->args; > into fib6_dump_node to get the net pointer from it but did not > want to duplicate the typecast. Ok, if you can't see the network namespace in fib6_dump_node() you can't do the test there. I'll apply this, thanks.
Re: [PATCH net-next] net: ipv6: remove skb_reserve in getroute
From: David AhernDate: Thu, 26 Jan 2017 14:08:36 -0800 > Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The > allocated skb is not passed through the routing engine (like it is for > IPv4) and has not since the beginning of git time. > > Signed-off-by: David Ahern Good catch, applied, thanks David.
Re: fs, net: deadlock between bind/splice on af_unix
On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote: > On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukovwrote: > > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro wrote: > >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote: > >> > >>> > Why do we do autobind there, anyway, and why is it conditional on > >>> > SOCK_PASSCRED? Note that e.g. for SOCK_STREAM we can bloody well get > >>> > to sending stuff without autobind ever done - just use socketpair() > >>> > to create that sucker and we won't be going through the connect() > >>> > at all. > >>> > >>> In the case Dmitry reported, unix_dgram_sendmsg() calls unix_autobind(), > >>> not SOCK_STREAM. > >> > >> Yes, I've noticed. What I'm asking is what in there needs autobind > >> triggered > >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case? > >> > >>> I guess some lock, perhaps the u->bindlock could be dropped before > >>> acquiring the next one (sb_writer), but I need to double check. > >> > >> Bad idea, IMO - do you *want* autobind being able to come through while > >> bind(2) is busy with mknod? > > > > > > Ping. This is still happening on HEAD. > > > > Thanks for your reminder. Mind to give the attached patch (compile only) > a try? I take another approach to fix this deadlock, which moves the > unix_mknod() out of unix->bindlock. Not sure if there is any unexpected > impact with this way. > I don't think this is the right approach. Currently the file creation is potponed until unix_bind can no longer fail otherwise. With it reordered, it may be someone races you with a different path and now you are left with a file to clean up. Except it is quite unclear for me if you can unlink it. I don't have a good idea how to fix it. A somewhat typical approach would introduce an intermediate state ("under construction") and drop the lock between calling into unix_mknod. In this particular case, perhaps you could repurpose gc_flags as a general flags carrier and add a 'binding in process' flag to test.
Re: [PATCH net-next] net: Fix ndo_setup_tc comment
On 17-01-26 02:44 PM, Florian Fainelli wrote: > Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume > general tc operand") changed the ndo_setup_tc() signature, but did not > update the comments in netdevice.h, so do that now. > > Signed-off-by: Florian Fainelli> --- Thanks. Acked-by: John Fastabend
Re: loopback device reference count leakage
Hi Cong, I tested out your patch, it does seem to be preventing the issue from happening. Here are the dev_put/dev_hold() calls with your patch applied. Jan 26 00:29:08 kernel: [ 4385.940243] lo: dev_hold 1 rx_queue_add_kobject Jan 26 00:29:08 kernel: [ 4385.940255] lo: dev_hold 2 netdev_queue_add_kobject Jan 26 00:29:08 kernel: [ 4385.940257] lo: dev_hold 3 register_netdevice Jan 26 00:29:08 kernel: [ 4385.940260] lo: dev_hold 4 neigh_parms_alloc Jan 26 00:29:08 kernel: [ 4385.940262] lo: dev_hold 5 inetdev_init Jan 26 00:29:08 kernel: [ 4386.017699] lo: dev_hold 6 qdisc_alloc Jan 26 00:29:08 kernel: [ 4386.017741] lo: dev_hold 7 dev_get_by_index Jan 26 00:29:08 kernel: [ 4386.017749] lo: dev_hold 8 dev_get_by_index Jan 26 00:29:08 kernel: [ 4386.017756] lo: dev_hold 9 fib_check_nh Jan 26 00:29:08 kernel: [ 4386.017760] lo: dev_hold 10 fib_check_nh Jan 26 00:29:08 kernel: [ 4386.017767] lo: dev_hold 11 dev_get_by_index Jan 26 00:29:08 kernel: [ 4386.017772] lo: dev_hold 12 dev_get_by_index Jan 26 00:29:08 kernel: [ 4386.017775] lo: dev_hold 13 fib_check_nh Jan 26 00:29:08 kernel: [ 4386.017778] lo: dev_hold 14 fib_check_nh Jan 26 00:29:08 kernel: [ 4386.033548] lo: dev_put 14 free_fib_info_rcu Jan 26 00:29:08 kernel: [ 4386.033553] lo: dev_put 13 free_fib_info_rcu Jan 26 00:29:08 kernel: [ 4386.033556] lo: dev_put 12 free_fib_info_rcu Jan 26 00:29:08 kernel: [ 4386.033558] lo: dev_put 11 free_fib_info_rcu Jan 26 00:29:08 kernel: [ 4386.033560] lo: dev_put 10 free_fib_info_rcu Jan 26 00:29:08 kernel: [ 4386.033563] lo: dev_put 9 free_fib_info_rcu Jan 26 00:29:09 kernel: [ 4386.438175] lo: dev_hold 9 dst_init Jan 26 00:29:09 kernel: [ 4386.442558] lo: dev_hold 10 dst_init Jan 26 00:29:09 kernel: [ 4386.442564] lo: dev_hold 11 dst_init Jan 26 00:29:09 kernel: [ 4386.477575] lo: dev_put 11 dst_destroy Jan 26 00:29:09 kernel: [ 4386.641150] lo: dev_hold 11 dst_init Jan 26 00:37:59 kernel: [ 4916.949380] lo: dev_hold 12 dst_init Jan 26 00:37:59 kernel: [ 4916.949401] lo: dev_hold 13 __neigh_create Jan 26 00:56:52 kernel: [ 6049.882993] lo: dev_hold 14 dst_init Jan 26 00:57:54 kernel: [ 6111.782520] lo: dev_hold 15 dst_init Jan 26 01:28:02 kernel: [ 7920.396248] lo: dev_hold 16 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396251] lo: dev_hold 17 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396254] lo: dev_hold 18 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396257] lo: dev_hold 19 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396260] lo: dev_hold 20 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396263] lo: dev_hold 21 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396266] lo: dev_hold 22 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396268] lo: dev_hold 23 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396271] lo: dev_hold 24 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396274] lo: dev_hold 25 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396277] lo: dev_hold 26 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396280] lo: dev_hold 27 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396283] lo: dev_hold 28 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396286] lo: dev_hold 29 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396288] lo: dev_hold 30 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396291] lo: dev_hold 31 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396294] lo: dev_hold 32 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396297] lo: dev_hold 33 dst_ifdown Jan 26 01:28:02 kernel: [ 7920.396300] lo: dev_hold 34 dst_ifdown Jan 26 01:28:03 kernel: [ 7920.584272] lo: dev_put 34 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584277] lo: dev_put 33 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584279] lo: dev_put 32 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584281] lo: dev_put 31 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584283] lo: dev_put 30 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584285] lo: dev_put 29 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584287] lo: dev_put 28 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584289] lo: dev_put 27 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584290] lo: dev_put 26 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584301] lo: dev_put 25 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584303] lo: dev_put 24 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584305] lo: dev_put 23 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584307] lo: dev_put 22 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584309] lo: dev_put 21 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584311] lo: dev_put 20 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584324] lo: dev_put 19 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584326] lo: dev_put 18 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584328] lo: dev_put 17 dst_destroy Jan 26 01:28:03 kernel: [ 7920.584330] lo: dev_put 16 dst_destroy Jan 26 01:30:32 kernel: [ 8069.750192] lo: dev_put 15 neigh_destroy Jan 26 01:30:32 kernel: [ 8069.751961] lo: dev_put 14 qdisc_destroy Jan 26 01:30:32 kernel: [ 8069.752014] lo: dev_put 13 neigh_parms_release Jan 26 01:30:32 kernel: [
Re: [PATCHv3 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs
On 27/01/17 09:24, Chris Packham wrote: > On 27/01/17 04:10, Gregory CLEMENT wrote: >>> + internal-regs { > > [snip] > >>> + >>> + dfx-registers { >> node label >> > > [snip] > >>> + switch { >> node label >> > > These are peers to the internal-regs, i.e. parts of the SoC with > mappable windows in the address space. Do they really need a label? > Their subnodes absolutely need (and have) labels. > Actually the pci-controller is in the same category and that has a label so I'll add one.
[PATCH net-next] net: Fix ndo_setup_tc comment
Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume general tc operand") changed the ndo_setup_tc() signature, but did not update the comments in netdevice.h, so do that now. Signed-off-by: Florian Fainelli--- include/linux/netdevice.h | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3868c32d98af..d63cacb67ea6 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -964,11 +964,12 @@ struct netdev_xdp { * with PF and querying it may introduce a theoretical security risk. * int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool setting); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); - * int (*ndo_setup_tc)(struct net_device *dev, u8 tc) - * Called to setup 'tc' number of traffic classes in the net device. This - * is always called from the stack with the rtnl lock held and netif tx - * queues stopped. This allows the netdevice to perform queue management - * safely. + * int (*ndo_setup_tc)(struct net_device *dev, u32 handle, + *__be16 protocol, struct tc_to_netdev *tc); + * Called to setup any 'tc' scheduler, classifier or action on @dev. + * This is always called from the stack with the rtnl lock held and netif + * tx queues stopped. This allows the netdevice to perform queue + * management safely. * * Fiber Channel over Ethernet (FCoE) offload functions. * int (*ndo_fcoe_enable)(struct net_device *dev); -- 2.9.3
Re: [PATCH net-next] net: ipv6: remove skb_reserve in getroute
On Thu, 2017-01-26 at 14:08 -0800, David Ahern wrote: > Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The > allocated skb is not passed through the routing engine (like it is for > IPv4) and has not since the beginning of git time. > > Signed-off-by: David Ahern> --- > net/ipv6/route.c | 6 -- > 1 file changed, 6 deletions(-) Nice ;) Acked-by: Eric Dumazet
[PATCH V3 net-next 01/14] net/ena: remove ntuple filter support from device feature list
Remove NETIF_F_NTUPLE from netdev->features. The ENA device driver does not support ntuple filtering. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index cc8b13e..7493ea3 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2722,7 +2722,6 @@ static void ena_set_dev_offloads(struct ena_com_dev_get_features_ctx *feat, netdev->features = dev_features | NETIF_F_SG | - NETIF_F_NTUPLE | NETIF_F_RXHASH | NETIF_F_HIGHDMA; -- 2.7.4
[PATCH V3 net-next 02/14] net/ena: fix error handling when probe fails
When driver fails in probe, it will release all resources, including adapter. In case of probe failure, ena_remove should not try to free the adapter resources. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 7493ea3..cb60567 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -3046,6 +3046,7 @@ static int ena_probe(struct pci_dev *pdev, const struct pci_device_id *ent) err_free_region: ena_release_bars(ena_dev, pdev); err_free_ena_dev: + pci_set_drvdata(pdev, NULL); vfree(ena_dev); err_disable_device: pci_disable_device(pdev); -- 2.7.4
[PATCH V3 net-next 03/14] net/ena: fix queues number calculation
The ENA driver tries to open a queue per vCPU. To determine how many vCPUs the instance have it uses num_possible_cpus() while it should have use num_online_cpus() instead. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index cb60567..f409cfd 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2660,7 +2660,7 @@ static int ena_calc_io_queue_num(struct pci_dev *pdev, io_sq_num = get_feat_ctx->max_queues.max_sq_num; } - io_queue_num = min_t(int, num_possible_cpus(), ENA_MAX_NUM_IO_QUEUES); + io_queue_num = min_t(int, num_online_cpus(), ENA_MAX_NUM_IO_QUEUES); io_queue_num = min_t(int, io_queue_num, io_sq_num); io_queue_num = min_t(int, io_queue_num, get_feat_ctx->max_queues.max_cq_num); -- 2.7.4
[PATCH V3 net-next 00/14] Bug Fixes in ENA driver.
Changes between V3 and V2: * Fix typos and correct alignment in commit messages. * use napi_complete_done() return value to determine when the napi handler needs to unmask the interrupts rather than implementing non standard solution. * Remove new features from this patchset and leave bug fixes only. * Give example in the commit message for kernel crashes. * Use BIT(x) instead of use the value explicitly. Netanel Belgazal (14): net/ena: remove ntuple filter support from device feature list net/ena: fix error handling when probe fails net/ena: fix queues number calculation net/ena: fix ethtool RSS flow configuration net/ena: fix RSS default hash configuration net/ena: fix NULL dereference when removing the driver after device reset failed net/ena: refactor ena_get_stats64 to be atomic context safe net/ena: fix potential access to freed memory during device reset net/ena: use napi_complete_done() return value net/ena: use READ_ONCE to access completion descriptors net/ena: reduce the severity of ena printouts net/ena: change driver's default timeouts net/ena: change condition for host attribute configuration net/ena: update driver version to 1.1.2 drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 20 ++- drivers/net/ethernet/amazon/ena/ena_com.c| 41 ++--- drivers/net/ethernet/amazon/ena/ena_com.h| 1 + drivers/net/ethernet/amazon/ena/ena_eth_com.c| 8 +- drivers/net/ethernet/amazon/ena/ena_netdev.c | 186 --- drivers/net/ethernet/amazon/ena/ena_netdev.h | 9 +- 6 files changed, 182 insertions(+), 83 deletions(-) -- 2.7.4
[PATCH V3 net-next 04/14] net/ena: fix ethtool RSS flow configuration
ena_flow_data_to_flow_hash and ena_flow_hash_to_flow_type treat the ena_flow_hash_to_flow_type enum as power of two values. Change the values of ena_admin_flow_hash_fields to be power of two values. This bug effect the ethtool set/get rxnfc. ethtool will report wrong values hash fields for get and will configure wrong hash fields in set. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h index a46e749..e1594d6 100644 --- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h +++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h @@ -631,22 +631,22 @@ enum ena_admin_flow_hash_proto { /* RSS flow hash fields */ enum ena_admin_flow_hash_fields { /* Ethernet Dest Addr */ - ENA_ADMIN_RSS_L2_DA = 0, + ENA_ADMIN_RSS_L2_DA = BIT(0), /* Ethernet Src Addr */ - ENA_ADMIN_RSS_L2_SA = 1, + ENA_ADMIN_RSS_L2_SA = BIT(1), /* ipv4/6 Dest Addr */ - ENA_ADMIN_RSS_L3_DA = 2, + ENA_ADMIN_RSS_L3_DA = BIT(2), /* ipv4/6 Src Addr */ - ENA_ADMIN_RSS_L3_SA = 5, + ENA_ADMIN_RSS_L3_SA = BIT(3), /* tcp/udp Dest Port */ - ENA_ADMIN_RSS_L4_DP = 6, + ENA_ADMIN_RSS_L4_DP = BIT(4), /* tcp/udp Src Port */ - ENA_ADMIN_RSS_L4_SP = 7, + ENA_ADMIN_RSS_L4_SP = BIT(5), }; struct ena_admin_proto_input { -- 2.7.4
[PATCH V3 net-next 09/14] net/ena: use napi_complete_done() return value
Do not unamsk interrupts if we are in busy poll mode. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.c | 44 ++-- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 606fb5c..d1e1d9d 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -1122,26 +1122,40 @@ static int ena_io_poll(struct napi_struct *napi, int budget) tx_work_done = ena_clean_tx_irq(tx_ring, tx_budget); rx_work_done = ena_clean_rx_irq(rx_ring, napi, budget); - if ((budget > rx_work_done) && (tx_budget > tx_work_done)) { - napi_complete_done(napi, rx_work_done); + /* If the device is about to reset or down, avoid unmask +* the interrupt and return 0 so NAPI won't reschedule +*/ + if (unlikely(!test_bit(ENA_FLAG_DEV_UP, _ring->adapter->flags) || +test_bit(ENA_FLAG_TRIGGER_RESET, _ring->adapter->flags))) { + napi_complete_done(napi, 0); + ret = 0; + } else if ((budget > rx_work_done) && (tx_budget > tx_work_done)) { napi_comp_call = 1; - /* Tx and Rx share the same interrupt vector */ - if (ena_com_get_adaptive_moderation_enabled(rx_ring->ena_dev)) - ena_adjust_intr_moderation(rx_ring, tx_ring); - /* Update intr register: rx intr delay, tx intr delay and -* interrupt unmask + /* Update numa and unmask the interrupt only when schedule +* from the interrupt context (vs from sk_busy_loop) */ - ena_com_update_intr_reg(_reg, - rx_ring->smoothed_interval, - tx_ring->smoothed_interval, - true); + if (napi_complete_done(napi, rx_work_done)) { + /* Tx and Rx share the same interrupt vector */ + if (ena_com_get_adaptive_moderation_enabled(rx_ring->ena_dev)) + ena_adjust_intr_moderation(rx_ring, tx_ring); + + /* Update intr register: rx intr delay, +* tx intr delay and interrupt unmask +*/ + ena_com_update_intr_reg(_reg, + rx_ring->smoothed_interval, + tx_ring->smoothed_interval, + true); + + /* It is a shared MSI-X. +* Tx and Rx CQ have pointer to it. +* So we use one of them to reach the intr reg +*/ + ena_com_unmask_intr(rx_ring->ena_com_io_cq, _reg); + } - /* It is a shared MSI-X. Tx and Rx CQ have pointer to it. -* So we use one of them to reach the intr reg -*/ - ena_com_unmask_intr(rx_ring->ena_com_io_cq, _reg); ena_update_ring_numa_node(tx_ring, rx_ring); -- 2.7.4
[PATCH V3 net-next 06/14] net/ena: fix NULL dereference when removing the driver after device reset failed
If for some reason the device stops responding, and the device reset failes to recover the device, the mmio register read data structure will not be reinitialized. On driver removal, the driver will also try to reset the device, but this time the mmio data structure will be NULL. To solve this issue, perform the device reset in the remove function only if the device is runnig. Crash log 54.240382] BUG: unable to handle kernel NULL pointer dereference at (null) [ 54.244186] IP: [] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv] [ 54.244186] PGD 0 [ 54.244186] Oops: 0002 [#1] SMP [ 54.244186] Modules linked in: ena_drv(OE-) snd_hda_codec_generic kvm_intel kvm crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_intel aes_x86_64 snd_hda_controller lrw gf128mul cirrus glue_helper ablk_helper ttm snd_hda_codec drm_kms_helper cryptd snd_hwdep drm snd_pcm pvpanic snd_timer syscopyarea sysfillrect snd parport_pc sysimgblt serio_raw soundcore i2c_piix4 mac_hid lp parport psmouse floppy [ 54.244186] CPU: 5 PID: 1841 Comm: rmmod Tainted: G OE 3.16.0-031600-generic #201408031935 [ 54.244186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 54.244186] task: 880135852880 ti: 8800bb64 task.ti: 8800bb64 [ 54.244186] RIP: 0010:[] [] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv] [ 54.244186] RSP: 0018:8800bb643d50 EFLAGS: 00010083 [ 54.244186] RAX: deb0 RBX: 00030d40 RCX: 0003 [ 54.244186] RDX: 0202 RSI: 0058 RDI: c9775104 [ 54.244186] RBP: 8800bb643d88 R08: R09: cf00 [ 54.244186] R10: 000fffe0 R11: 0001 R12: [ 54.244186] R13: c9765000 R14: c9775104 R15: 7fca1fa98090 [ 54.244186] FS: 7fca1f1bd740() GS:88013fd4() knlGS: [ 54.244186] CS: 0010 DS: ES: CR0: 80050033 [ 54.244186] CR2: CR3: b9cf6000 CR4: 001406e0 [ 54.244186] Stack: [ 54.244186] 0202 00580286 c9765000 c9765000 [ 54.244186] 880135f6b000 8800b936 7fca1fa98090 8800bb643db8 [ 54.244186] c0680b3d 8800b93608c0 c9765000 880135f6b000 [ 54.244186] Call Trace: [ 54.244186] [] ena_com_dev_reset+0x1d/0x1b0 [ena_drv] [ 54.244186] [] ena_remove+0xa7/0x130 [ena_drv] [ 54.244186] [] pci_device_remove+0x46/0xc0 [ 54.244186] [] __device_release_driver+0x7f/0xf0 [ 54.244186] [] driver_detach+0xc8/0xd0 [ 54.244186] [] bus_remove_driver+0x59/0xd0 [ 54.244186] [] driver_unregister+0x2e/0x60 [ 54.244186] [] ? show_refcnt+0x40/0x40 [ 54.244186] [] pci_unregister_driver+0x23/0xa0 [ 54.244186] [] ena_cleanup+0x10/0xed1 [ena_drv] [ 54.244186] [] SyS_delete_module+0x157/0x1e0 [ 54.244186] [] ? do_notify_resume+0xc7/0xd0 [ 54.244186] [] system_call_fastpath+0x1a/0x1f [ 54.244186] Code: c3 4d 8d b5 04 01 01 00 4c 89 f7 e8 e1 5a 11 c1 48 89 45 c8 41 0f b7 85 00 01 01 00 8d 48 01 66 2d 52 21 66 41 89 8d 00 01 01 00 <66> 41 89 04 24 0f b7 45 d4 89 45 d0 89 c1 41 0f b7 85 00 01 01 [ 54.244186] RIP [] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv] [ 54.244186] RSP [ 54.244186] CR2: [ 54.244186] ---[ end trace 18dd9889b6497810 ]--- Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index f409cfd..639f0aa 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2509,6 +2509,8 @@ static void ena_fw_reset_device(struct work_struct *work) err: rtnl_unlock(); + clear_bit(ENA_FLAG_DEVICE_RUNNING, >flags); + dev_err(>dev, "Reset attempt failed. Can not reset the device\n"); } @@ -3118,7 +3120,9 @@ static void ena_remove(struct pci_dev *pdev) cancel_work_sync(>resume_io_task); - ena_com_dev_reset(ena_dev); + /* Reset the device only if the device is running. */ + if (test_bit(ENA_FLAG_DEVICE_RUNNING, >flags)) + ena_com_dev_reset(ena_dev); ena_free_mgmnt_irq(adapter); -- 2.7.4
[PATCH V3 net-next 05/14] net/ena: fix RSS default hash configuration
ENA default hash configures IPv4_frag hash twice instead of configure non-IP packets. The bug caused IPv4 fragmented packets to be calculated based on L2 source and destination address instead of L3 source and destination. IPv4 packets can reach to the wrong Rx queue. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_com.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index 3066d9c..46aad3a 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -2184,7 +2184,7 @@ int ena_com_set_default_hash_ctrl(struct ena_com_dev *ena_dev) hash_ctrl->selected_fields[ENA_ADMIN_RSS_IP4_FRAG].fields = ENA_ADMIN_RSS_L3_SA | ENA_ADMIN_RSS_L3_DA; - hash_ctrl->selected_fields[ENA_ADMIN_RSS_IP4_FRAG].fields = + hash_ctrl->selected_fields[ENA_ADMIN_RSS_NOT_IP].fields = ENA_ADMIN_RSS_L2_DA | ENA_ADMIN_RSS_L2_SA; for (i = 0; i < ENA_ADMIN_RSS_PROTO_NUM; i++) { -- 2.7.4
[PATCH V3 net-next 08/14] net/ena: fix potential access to freed memory during device reset
If the ena driver detects that the device is not behave as expected, it tries to reset the device. The reset flow calls ena_down, which will frees all the resources the driver allocates and then it will reset the device. This flow can cause memory corruption if the device is still writes to the driver's memory space. To overcome this potential race, move the reset before the device resources are freed. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.c | 56 +--- 1 file changed, 43 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index ea3c801..606fb5c 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -80,14 +80,18 @@ static void ena_tx_timeout(struct net_device *dev) { struct ena_adapter *adapter = netdev_priv(dev); + /* Change the state of the device to trigger reset +* Check that we are not in the middle or a trigger already +*/ + + if (test_and_set_bit(ENA_FLAG_TRIGGER_RESET, >flags)) + return; + u64_stats_update_begin(>syncp); adapter->dev_stats.tx_timeout++; u64_stats_update_end(>syncp); netif_err(adapter, tx_err, dev, "Transmit time out\n"); - - /* Change the state of the device to trigger reset */ - set_bit(ENA_FLAG_TRIGGER_RESET, >flags); } static void update_rx_ring_mtu(struct ena_adapter *adapter, int mtu) @@ -1109,7 +1113,8 @@ static int ena_io_poll(struct napi_struct *napi, int budget) tx_budget = tx_ring->ring_size / ENA_TX_POLL_BUDGET_DIVIDER; - if (!test_bit(ENA_FLAG_DEV_UP, _ring->adapter->flags)) { + if (!test_bit(ENA_FLAG_DEV_UP, _ring->adapter->flags) || + test_bit(ENA_FLAG_TRIGGER_RESET, _ring->adapter->flags)) { napi_complete_done(napi, 0); return 0; } @@ -1698,12 +1703,22 @@ static void ena_down(struct ena_adapter *adapter) adapter->dev_stats.interface_down++; u64_stats_update_end(>syncp); - /* After this point the napi handler won't enable the tx queue */ - ena_napi_disable_all(adapter); netif_carrier_off(adapter->netdev); netif_tx_disable(adapter->netdev); + /* After this point the napi handler won't enable the tx queue */ + ena_napi_disable_all(adapter); + /* After destroy the queue there won't be any new interrupts */ + + if (test_bit(ENA_FLAG_TRIGGER_RESET, >flags)) { + int rc; + + rc = ena_com_dev_reset(adapter->ena_dev); + if (rc) + dev_err(>pdev->dev, "Device reset failed\n"); + } + ena_destroy_all_io_queues(adapter); ena_disable_io_intr_sync(adapter); @@ -2065,6 +2080,14 @@ static void ena_netpoll(struct net_device *netdev) struct ena_adapter *adapter = netdev_priv(netdev); int i; + /* Dont schedule NAPI if the driver is in the middle of reset +* or netdev is down. +*/ + + if (!test_bit(ENA_FLAG_DEV_UP, >flags) || + test_bit(ENA_FLAG_TRIGGER_RESET, >flags)) + return; + for (i = 0; i < adapter->num_queues; i++) napi_schedule(>ena_napi[i].napi); } @@ -2451,6 +2474,14 @@ static void ena_fw_reset_device(struct work_struct *work) bool dev_up, wd_state; int rc; + if (unlikely(!test_bit(ENA_FLAG_TRIGGER_RESET, >flags))) { + dev_err(>dev, + "device reset schedule while reset bit is off\n"); + return; + } + + netif_carrier_off(netdev); + del_timer_sync(>timer_service); rtnl_lock(); @@ -2464,12 +2495,6 @@ static void ena_fw_reset_device(struct work_struct *work) */ ena_close(netdev); - rc = ena_com_dev_reset(ena_dev); - if (rc) { - dev_err(>dev, "Device reset failed\n"); - goto err; - } - ena_free_mgmnt_irq(adapter); ena_disable_msix(adapter); @@ -2482,6 +2507,8 @@ static void ena_fw_reset_device(struct work_struct *work) ena_com_mmio_reg_read_request_destroy(ena_dev); + clear_bit(ENA_FLAG_TRIGGER_RESET, >flags); + /* Finish with the destroy part. Start the init part */ rc = ena_device_init(ena_dev, adapter->pdev, _feat_ctx, _state); @@ -2547,6 +2574,9 @@ static void check_for_missing_tx_completions(struct ena_adapter *adapter) if (!test_bit(ENA_FLAG_DEV_UP, >flags)) return; + if (test_bit(ENA_FLAG_TRIGGER_RESET, >flags)) + return; + budget = ENA_MONITORED_TX_QUEUES; for (i = adapter->last_monitored_tx_qid; i < adapter->num_queues; i++) { @@ -2646,7 +2676,7 @@ static void ena_timer_service(unsigned long data) if (host_info)
[PATCH V3 net-next 07/14] net/ena: refactor ena_get_stats64 to be atomic context safe
ndo_get_stat64() can be called from atomic context, but the current implementation sends an admin command to retrieve the statistics from the device. This admin command can sleep. This patch re-factors the implementation of ena_get_stats64() to use the {rx,tx}bytes/count from the driver's inner counters, and to obtain the rx drop counter from the asynchronous keep alive (heart bit) event. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 8 drivers/net/ethernet/amazon/ena/ena_netdev.c | 57 +--- drivers/net/ethernet/amazon/ena/ena_netdev.h | 1 + 3 files changed, 51 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h index e1594d6..5b6509d 100644 --- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h +++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h @@ -873,6 +873,14 @@ struct ena_admin_aenq_link_change_desc { u32 flags; }; +struct ena_admin_aenq_keep_alive_desc { + struct ena_admin_aenq_common_desc aenq_common_desc; + + u32 rx_drops_low; + + u32 rx_drops_high; +}; + struct ena_admin_ena_mmio_req_read_less_resp { u16 req_id; diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 639f0aa..ea3c801 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2169,28 +2169,46 @@ static struct rtnl_link_stats64 *ena_get_stats64(struct net_device *netdev, struct rtnl_link_stats64 *stats) { struct ena_adapter *adapter = netdev_priv(netdev); - struct ena_admin_basic_stats ena_stats; - int rc; + struct ena_ring *rx_ring, *tx_ring; + unsigned int start; + u64 rx_drops; + int i; if (!test_bit(ENA_FLAG_DEV_UP, >flags)) return NULL; - rc = ena_com_get_dev_basic_stats(adapter->ena_dev, _stats); - if (rc) - return NULL; + for (i = 0; i < adapter->num_queues; i++) { + u64 bytes, packets; + + tx_ring = >tx_ring[i]; + + do { + start = u64_stats_fetch_begin_irq(_ring->syncp); + packets = tx_ring->tx_stats.cnt; + bytes = tx_ring->tx_stats.bytes; + } while (u64_stats_fetch_retry_irq(_ring->syncp, start)); + + stats->tx_packets += packets; + stats->tx_bytes += bytes; - stats->tx_bytes = ((u64)ena_stats.tx_bytes_high << 32) | - ena_stats.tx_bytes_low; - stats->rx_bytes = ((u64)ena_stats.rx_bytes_high << 32) | - ena_stats.rx_bytes_low; + rx_ring = >rx_ring[i]; + + do { + start = u64_stats_fetch_begin_irq(_ring->syncp); + packets = rx_ring->rx_stats.cnt; + bytes = rx_ring->rx_stats.bytes; + } while (u64_stats_fetch_retry_irq(_ring->syncp, start)); - stats->rx_packets = ((u64)ena_stats.rx_pkts_high << 32) | - ena_stats.rx_pkts_low; - stats->tx_packets = ((u64)ena_stats.tx_pkts_high << 32) | - ena_stats.tx_pkts_low; + stats->rx_packets += packets; + stats->rx_bytes += bytes; + } + + do { + start = u64_stats_fetch_begin_irq(>syncp); + rx_drops = adapter->dev_stats.rx_drops; + } while (u64_stats_fetch_retry_irq(>syncp, start)); - stats->rx_dropped = ((u64)ena_stats.rx_drops_high << 32) | - ena_stats.rx_drops_low; + stats->rx_dropped = rx_drops; stats->multicast = 0; stats->collisions = 0; @@ -3213,8 +3231,17 @@ static void ena_keep_alive_wd(void *adapter_data, struct ena_admin_aenq_entry *aenq_e) { struct ena_adapter *adapter = (struct ena_adapter *)adapter_data; + struct ena_admin_aenq_keep_alive_desc *desc; + u64 rx_drops; + desc = (struct ena_admin_aenq_keep_alive_desc *)aenq_e; adapter->last_keep_alive_jiffies = jiffies; + + rx_drops = ((u64)desc->rx_drops_high << 32) | desc->rx_drops_low; + + u64_stats_update_begin(>syncp); + adapter->dev_stats.rx_drops = rx_drops; + u64_stats_update_end(>syncp); } static void ena_notification(void *adapter_data, diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index 69d7e9e..f0ddc11 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -241,6 +241,7 @@ struct ena_stats_dev { u64 interface_up; u64 interface_down; u64 admin_q_pause; + u64 rx_drops; }; enum ena_flags_t { -- 2.7.4
[PATCH V3 net-next 10/14] net/ena: use READ_ONCE to access completion descriptors
Completion descriptors are accessed from the driver and from the device. To avoid reading the old value, use READ_ONCE macro. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_com.h | 1 + drivers/net/ethernet/amazon/ena/ena_eth_com.c | 8 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.h b/drivers/net/ethernet/amazon/ena/ena_com.h index 509d7b8..c9b33ee 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.h +++ b/drivers/net/ethernet/amazon/ena/ena_com.h @@ -33,6 +33,7 @@ #ifndef ENA_COM #define ENA_COM +#include #include #include #include diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.c b/drivers/net/ethernet/amazon/ena/ena_eth_com.c index 539c536..f999305 100644 --- a/drivers/net/ethernet/amazon/ena/ena_eth_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.c @@ -45,7 +45,7 @@ static inline struct ena_eth_io_rx_cdesc_base *ena_com_get_next_rx_cdesc( cdesc = (struct ena_eth_io_rx_cdesc_base *)(io_cq->cdesc_addr.virt_addr + (head_masked * io_cq->cdesc_entry_size_in_bytes)); - desc_phase = (cdesc->status & ENA_ETH_IO_RX_CDESC_BASE_PHASE_MASK) >> + desc_phase = (READ_ONCE(cdesc->status) & ENA_ETH_IO_RX_CDESC_BASE_PHASE_MASK) >> ENA_ETH_IO_RX_CDESC_BASE_PHASE_SHIFT; if (desc_phase != expected_phase) @@ -141,7 +141,7 @@ static inline u16 ena_com_cdesc_rx_pkt_get(struct ena_com_io_cq *io_cq, ena_com_cq_inc_head(io_cq); count++; - last = (cdesc->status & ENA_ETH_IO_RX_CDESC_BASE_LAST_MASK) >> + last = (READ_ONCE(cdesc->status) & ENA_ETH_IO_RX_CDESC_BASE_LAST_MASK) >> ENA_ETH_IO_RX_CDESC_BASE_LAST_SHIFT; } while (!last); @@ -489,13 +489,13 @@ int ena_com_tx_comp_req_id_get(struct ena_com_io_cq *io_cq, u16 *req_id) * expected, it mean that the device still didn't update * this completion. */ - cdesc_phase = cdesc->flags & ENA_ETH_IO_TX_CDESC_PHASE_MASK; + cdesc_phase = READ_ONCE(cdesc->flags) & ENA_ETH_IO_TX_CDESC_PHASE_MASK; if (cdesc_phase != expected_phase) return -EAGAIN; ena_com_cq_inc_head(io_cq); - *req_id = cdesc->req_id; + *req_id = READ_ONCE(cdesc->req_id); return 0; } -- 2.7.4
[PATCH V3 net-next 12/14] net/ena: change driver's default timeouts
The timeouts were too agressive and sometimes cause false alarms. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_com.c| 4 ++-- drivers/net/ethernet/amazon/ena/ena_netdev.h | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index 5518b1f..8029e7c 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -36,9 +36,9 @@ /*/ /* Timeout in micro-sec */ -#define ADMIN_CMD_TIMEOUT_US (100) +#define ADMIN_CMD_TIMEOUT_US (300) -#define ENA_ASYNC_QUEUE_DEPTH 4 +#define ENA_ASYNC_QUEUE_DEPTH 16 #define ENA_ADMIN_QUEUE_DEPTH 32 #define MIN_ENA_VER (((ENA_COMMON_SPEC_VERSION_MAJOR) << \ diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index f0ddc11..efe0ea1 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -100,7 +100,7 @@ /* Number of queues to check for missing queues per timer service */ #define ENA_MONITORED_TX_QUEUES4 /* Max timeout packets before device reset */ -#define MAX_NUM_OF_TIMEOUTED_PACKETS 32 +#define MAX_NUM_OF_TIMEOUTED_PACKETS 128 #define ENA_TX_RING_IDX_NEXT(idx, ring_size) (((idx) + 1) & ((ring_size) - 1)) @@ -116,9 +116,9 @@ #define ENA_IO_IRQ_IDX(q) (ENA_IO_IRQ_FIRST_IDX + (q)) /* ENA device should send keep alive msg every 1 sec. - * We wait for 3 sec just to be on the safe side. + * We wait for 6 sec just to be on the safe side. */ -#define ENA_DEVICE_KALIVE_TIMEOUT (3 * HZ) +#define ENA_DEVICE_KALIVE_TIMEOUT (6 * HZ) #define ENA_MMIO_DISABLE_REG_READ BIT(0) -- 2.7.4
[PATCH V3 net-next 14/14] net/ena: update driver version to 1.1.2
Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_netdev.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index efe0ea1..ed62d8e 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -44,7 +44,7 @@ #include "ena_eth_com.h" #define DRV_MODULE_VER_MAJOR 1 -#define DRV_MODULE_VER_MINOR 0 +#define DRV_MODULE_VER_MINOR 1 #define DRV_MODULE_VER_SUBMINOR 2 #define DRV_MODULE_NAME"ena" -- 2.7.4
[PATCH V3 net-next 11/14] net/ena: reduce the severity of ena printouts
Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_com.c| 27 +-- drivers/net/ethernet/amazon/ena/ena_netdev.c | 14 +++--- 2 files changed, 28 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index 46aad3a..5518b1f 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -784,7 +784,7 @@ static int ena_com_get_feature_ex(struct ena_com_dev *ena_dev, int ret; if (!ena_com_check_supported_feature_id(ena_dev, feature_id)) { - pr_info("Feature %d isn't supported\n", feature_id); + pr_debug("Feature %d isn't supported\n", feature_id); return -EPERM; } @@ -1126,7 +1126,13 @@ int ena_com_execute_admin_command(struct ena_com_admin_queue *admin_queue, comp_ctx = ena_com_submit_admin_cmd(admin_queue, cmd, cmd_size, comp, comp_size); if (unlikely(IS_ERR(comp_ctx))) { - pr_err("Failed to submit command [%ld]\n", PTR_ERR(comp_ctx)); + if (comp_ctx == ERR_PTR(-ENODEV)) + pr_debug("Failed to submit command [%ld]\n", +PTR_ERR(comp_ctx)); + else + pr_err("Failed to submit command [%ld]\n", + PTR_ERR(comp_ctx)); + return PTR_ERR(comp_ctx); } @@ -1895,7 +1901,7 @@ int ena_com_set_dev_mtu(struct ena_com_dev *ena_dev, int mtu) int ret; if (!ena_com_check_supported_feature_id(ena_dev, ENA_ADMIN_MTU)) { - pr_info("Feature %d isn't supported\n", ENA_ADMIN_MTU); + pr_debug("Feature %d isn't supported\n", ENA_ADMIN_MTU); return -EPERM; } @@ -1948,8 +1954,8 @@ int ena_com_set_hash_function(struct ena_com_dev *ena_dev) if (!ena_com_check_supported_feature_id(ena_dev, ENA_ADMIN_RSS_HASH_FUNCTION)) { - pr_info("Feature %d isn't supported\n", - ENA_ADMIN_RSS_HASH_FUNCTION); + pr_debug("Feature %d isn't supported\n", +ENA_ADMIN_RSS_HASH_FUNCTION); return -EPERM; } @@ -2112,7 +2118,8 @@ int ena_com_set_hash_ctrl(struct ena_com_dev *ena_dev) if (!ena_com_check_supported_feature_id(ena_dev, ENA_ADMIN_RSS_HASH_INPUT)) { - pr_info("Feature %d isn't supported\n", ENA_ADMIN_RSS_HASH_INPUT); + pr_debug("Feature %d isn't supported\n", +ENA_ADMIN_RSS_HASH_INPUT); return -EPERM; } @@ -2270,8 +2277,8 @@ int ena_com_indirect_table_set(struct ena_com_dev *ena_dev) if (!ena_com_check_supported_feature_id( ena_dev, ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG)) { - pr_info("Feature %d isn't supported\n", - ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG); + pr_debug("Feature %d isn't supported\n", +ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG); return -EPERM; } @@ -2542,8 +2549,8 @@ int ena_com_init_interrupt_moderation(struct ena_com_dev *ena_dev) if (rc) { if (rc == -EPERM) { - pr_info("Feature %d isn't supported\n", - ENA_ADMIN_INTERRUPT_MODERATION); + pr_debug("Feature %d isn't supported\n", +ENA_ADMIN_INTERRUPT_MODERATION); rc = 0; } else { pr_err("Failed to get interrupt moderation admin cmd. rc: %d\n", diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index d1e1d9d..96048bd 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -563,6 +563,7 @@ static void ena_free_all_rx_bufs(struct ena_adapter *adapter) */ static void ena_free_tx_bufs(struct ena_ring *tx_ring) { + bool print_once = true; u32 i; for (i = 0; i < tx_ring->ring_size; i++) { @@ -574,9 +575,16 @@ static void ena_free_tx_bufs(struct ena_ring *tx_ring) if (!tx_info->skb) continue; - netdev_notice(tx_ring->netdev, - "free uncompleted tx skb qid %d idx 0x%x\n", - tx_ring->qid, i); + if (print_once) { + netdev_notice(tx_ring->netdev, + "free uncompleted tx skb qid %d idx 0x%x\n", + tx_ring->qid, i); + print_once = false; +
[PATCH V3 net-next 13/14] net/ena: change condition for host attribute configuration
Move the host info config to be the first admin command that is executed. This change require the driver to remove the 'feature check' from host info configuration flow. The check is removed since the supported features bitmask field is retrieved only after calling ENA_ADMIN_DEVICE_ATTRIBUTES admin command. If set host info is not supported an error will be returned by the device. Signed-off-by: Netanel Belgazal--- drivers/net/ethernet/amazon/ena/ena_com.c| 8 +++- drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 +++-- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index 8029e7c..08d11ce 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -2451,11 +2451,9 @@ int ena_com_set_host_attributes(struct ena_com_dev *ena_dev) int ret; - if (!ena_com_check_supported_feature_id(ena_dev, - ENA_ADMIN_HOST_ATTR_CONFIG)) { - pr_warn("Set host attribute isn't supported\n"); - return -EPERM; - } + /* Host attribute config is called before ena_com_get_dev_attr_feat +* so ena_com can't check if the feature is supported. +*/ memset(, 0x0, sizeof(cmd)); admin_queue = _dev->admin_queue; diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 96048bd..7b9c80f 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2416,6 +2416,8 @@ static int ena_device_init(struct ena_com_dev *ena_dev, struct pci_dev *pdev, */ ena_com_set_admin_polling_mode(ena_dev, true); + ena_config_host_info(ena_dev); + /* Get Device Attributes*/ rc = ena_com_get_dev_attr_feat(ena_dev, get_feat_ctx); if (rc) { @@ -2440,11 +2442,10 @@ static int ena_device_init(struct ena_com_dev *ena_dev, struct pci_dev *pdev, *wd_state = !!(aenq_groups & BIT(ENA_ADMIN_KEEP_ALIVE)); - ena_config_host_info(ena_dev); - return 0; err_admin_init: + ena_com_delete_host_info(ena_dev); ena_com_admin_destroy(ena_dev); err_mmio_read_less: ena_com_mmio_reg_read_request_destroy(ena_dev); -- 2.7.4
[PATCH net-next] net: ipv6: remove skb_reserve in getroute
Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The allocated skb is not passed through the routing engine (like it is for IPv4) and has not since the beginning of git time. Signed-off-by: David Ahern--- net/ipv6/route.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 47499ed429da..5046d2b24004 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3420,12 +3420,6 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh) goto errout; } - /* Reserve room for dummy headers, this skb can pass - through good chunk of routing engine. -*/ - skb_reset_mac_header(skb); - skb_reserve(skb, MAX_HEADER + sizeof(struct ipv6hdr)); - skb_dst_set(skb, >dst); err = rt6_fill_node(net, skb, rt, , , iif, -- 2.1.4
Re: TCP stops sending packets over loopback on 4.10-rc3?
On Wed, 2017-01-25 at 06:39 -0800, Eric Dumazet wrote: > On Wed, 2017-01-25 at 09:26 -0500, Josef Bacik wrote: > > > > > Nope ftrace isn't broken, I'm just dumb, the space is being > > reclaimed > > by sk_wmem_free_skb(). So I guess I need to figure out why I stop > > getting ACK's from the other side of the loopback. Thanks, > ss -temoi dst 127.0.0.1 > > Might give you some hints, like packets being dropped. > > ACK can be delayed if the reader is slow to consume bytes. > Yup looks like I'm getting packet loss for some reason, but the application is sitting there in recvmsg, so it's not hung and definitely available for receiving new packets. ESTAB 0 4124232 ::1:34044 ::1:nbd t imer:(on,1min38sec,9) ino:20067 sk:8 <-> skmem:(r0,rb6291456,t0,tb4194304,f1720,w4204872,o0,bl0) ts sack cubic wscale:7,7 rto:102912 backoff:9 rtt:0.084/0.038 ato:40 mss:65464 cwnd:1 ssthresh:18 bytes_acked:71964077253 bytes_received:68804409996 segs_out:3882829 segs_in:4092731 send 6234.7Mbps lastsnd:4336 lastrcv:111289 lastack:111299 unacked:28 retrans:1/4277 lost:28 reordering:60 rcv_rtt:1.875 rcv_space:1315136 ESTAB 0 0 ::1:nbd ::1:34044 ti mer:(keepalive,109min,0) ino:19396 sk:2 <-> skmem:(r0,rb6291456,t0,tb2626560,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:201 rtt:0.279/0.16 ato:40 mss:65464 cwnd:16 ssthresh:9 bytes_acked:68804409996 bytes_received:71964077252 segs_out:4092730 segs_in:3882792 send 30033.7Mbps lastsnd:111286 lastrcv:111307 lastack:111286 retrans:0/3113 reordering:26 rcv_rtt:1 rcv_space:4782816 I traced tcp_enter_loss() and once things stop moving that starts firing. That's all I have so far, been busy with other things but I'm devoting my full attention to this now. Thanks, Josef
[PATCH net] tcp: don't annotate mark on control socket from tcp_v6_send_response()
Unlike ipv4, this control socket is shared by all cpus so we cannot use it as scratchpad area to annotate the mark that we pass to ip6_xmit(). Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket family caches the flowi6 structure in the sctp_transport structure, so we cannot use to carry the mark unless we later on reset it back, which I discarded since it looks ugly to me. Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled") Suggested-by: Eric DumazetSigned-off-by: Pablo Neira Ayuso --- Tested with nc6, works for me. Note: DCCP and SCTP don't seem to support fwmark_reflect yet, so leave these bits. Thanks! include/net/ipv6.h | 2 +- net/dccp/ipv6.c | 4 ++-- net/ipv6/inet6_connection_sock.c | 2 +- net/ipv6/ip6_output.c| 4 ++-- net/ipv6/tcp_ipv6.c | 5 ++--- net/sctp/ipv6.c | 3 ++- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 487e57391664..7afe991e900e 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -871,7 +871,7 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb); * upper-layer output functions */ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, -struct ipv6_txoptions *opt, int tclass); +__u32 mark, struct ipv6_txoptions *opt, int tclass); int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr); diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index adfc790f7193..c4e879c02186 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -227,7 +227,7 @@ static int dccp_v6_send_response(const struct sock *sk, struct request_sock *req opt = ireq->ipv6_opt; if (!opt) opt = rcu_dereference(np->opt); - err = ip6_xmit(sk, skb, , opt, np->tclass); + err = ip6_xmit(sk, skb, , sk->sk_mark, opt, np->tclass); rcu_read_unlock(); err = net_xmit_eval(err); } @@ -281,7 +281,7 @@ static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) dst = ip6_dst_lookup_flow(ctl_sk, , NULL); if (!IS_ERR(dst)) { skb_dst_set(skb, dst); - ip6_xmit(ctl_sk, skb, , NULL, 0); + ip6_xmit(ctl_sk, skb, , 0, NULL, 0); DCCP_INC_STATS(DCCP_MIB_OUTSEGS); DCCP_INC_STATS(DCCP_MIB_OUTRSTS); return; diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c index 7396e75e161b..75c308239243 100644 --- a/net/ipv6/inet6_connection_sock.c +++ b/net/ipv6/inet6_connection_sock.c @@ -176,7 +176,7 @@ int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl_unused /* Restore final destination back after routing done */ fl6.daddr = sk->sk_v6_daddr; - res = ip6_xmit(sk, skb, , rcu_dereference(np->opt), + res = ip6_xmit(sk, skb, , sk->sk_mark, rcu_dereference(np->opt), np->tclass); rcu_read_unlock(); return res; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 38122d04fadc..2c0df09e9036 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -172,7 +172,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) * which are using proper atomic operations or spinlocks. */ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, -struct ipv6_txoptions *opt, int tclass) +__u32 mark, struct ipv6_txoptions *opt, int tclass) { struct net *net = sock_net(sk); const struct ipv6_pinfo *np = inet6_sk(sk); @@ -240,7 +240,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, skb->protocol = htons(ETH_P_IPV6); skb->priority = sk->sk_priority; - skb->mark = sk->sk_mark; + skb->mark = mark; mtu = dst_mtu(dst); if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) { diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 2b20622a5824..cb8929681dc7 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -469,7 +469,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, opt = ireq->ipv6_opt; if (!opt) opt = rcu_dereference(np->opt); - err = ip6_xmit(sk, skb, fl6, opt, np->tclass); + err = ip6_xmit(sk, skb, fl6, sk->sk_mark, opt, np->tclass); rcu_read_unlock(); err = net_xmit_eval(err); } @@ -840,8 +840,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32 dst = ip6_dst_lookup_flow(ctl_sk, , NULL); if (!IS_ERR(dst)) { skb_dst_set(buff, dst); - ctl_sk->sk_mark =
[PATCH net-next v2] net: ipv6: ignore null_entry on route dumps
lkp-robot reported a BUG: [ 10.151226] BUG: unable to handle kernel NULL pointer dereference at 0198 [ 10.152525] IP: rt6_fill_node+0x164/0x4b8 [ 10.153307] *pdpt = 12ee5001 *pde = [ 10.153309] [ 10.154492] Oops: [#1] [ 10.154987] CPU: 0 PID: 909 Comm: netifd Not tainted 4.10.0-rc4-00722-g41e8c70ee162-dirty #10 [ 10.156482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 10.158254] task: d0deb000 task.stack: d0e0c000 [ 10.159059] EIP: rt6_fill_node+0x164/0x4b8 [ 10.159780] EFLAGS: 00010296 CPU: 0 [ 10.160404] EAX: EBX: d10c2358 ECX: c1f7c6cc EDX: c1f6ff44 [ 10.161469] ESI: EDI: c2059900 EBP: d0e0dc4c ESP: d0e0dbe4 [ 10.162534] DS: 007b ES: 007b FS: GS: 0033 SS: 0068 [ 10.163482] CR0: 80050033 CR2: 0198 CR3: 10d94660 CR4: 06b0 [ 10.164535] Call Trace: [ 10.164993] ? paravirt_sched_clock+0x9/0xd [ 10.165727] ? sched_clock+0x9/0xc [ 10.166329] ? sched_clock_cpu+0x19/0xe9 [ 10.166991] ? lock_release+0x13e/0x36c [ 10.167652] rt6_dump_route+0x4c/0x56 [ 10.168276] fib6_dump_node+0x1d/0x3d [ 10.168913] fib6_walk_continue+0xab/0x167 [ 10.169611] fib6_walk+0x2a/0x40 [ 10.170182] inet6_dump_fib+0xfb/0x1e0 [ 10.170855] netlink_dump+0xcd/0x21f This happens when the loopback device is set down and a ipv6 fib route dump is requested. ip6_null_entry is the root of all ipv6 fib tables making it integrated into the table and hence passed to the ipv6 route dump code. The null_entry route uses the loopback device for dst.dev but may not have rt6i_idev set because of the order in which initializations are done -- ip6_route_net_init is run before addrconf_init has initialized the loopback device. Fixing the initialization order is a much bigger problem with no obvious solution thus far. The BUG is triggered when the loopback is set down and the netif_running check added by a1a22c1206 fails. The fill_node descends to checking rt->rt6i_idev for ignore_routes_with_linkdown and since rt6i_idev is NULL it faults. The null_entry route should not be processed in a dump request. Catch and ignore. This check is done in rt6_dump_route as it is the highest place in the callchain with knowledge of both the route and the network namespace. Fixes: a1a22c1206("net: ipv6: Keep nexthop of multipath route on admin down") Signed-off-by: David Ahern--- v2 - updated commit message; no code change Dave: per last email you suggested putting this in fib6_dump_node, but fib6_dump_node does not have knowledge of the network namespace because of the way the fib_walker works. I could put struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) w->args; into fib6_dump_node to get the net pointer from it but did not want to duplicate the typecast. net/ipv6/route.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4b1f0f98a0e9..47499ed429da 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3320,6 +3320,10 @@ static int rt6_fill_node(struct net *net, int rt6_dump_route(struct rt6_info *rt, void *p_arg) { struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg; + struct net *net = arg->net; + + if (rt == net->ipv6.ip6_null_entry) + return 0; if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) { struct rtmsg *rtm = nlmsg_data(arg->cb->nlh); @@ -3332,7 +3336,7 @@ int rt6_dump_route(struct rt6_info *rt, void *p_arg) } } - return rt6_fill_node(arg->net, + return rt6_fill_node(net, arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid, arg->cb->nlh->nlmsg_seq, NLM_F_MULTI); -- 2.1.4
Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver
On (01/26/17 15:21), Willem de Bruijn wrote: > > If the application has provided fewer than hard_header_len bytes, > > dev_validate_header() will zero out the skb->data as needed. This is > > acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases, > > This was added not for datagram sockets, but to be able to bypass > validation. See the message in commit 2793a23aacbd ("net: validate > variable length ll header") and discussion leading up to that patch. some context, I got inot this patch as a result of the comments in https://www.mail-archive.com/netdev@vger.kernel.org/msg149031.html > As David pointed out, this does not handle variable length headers > correctly. In link layers that support these, hard_header_len defines > the maximum header length. A hard failure on len < hard_header_len > would be incorrect. right, since DaveM's comments, I took a look at the drivers that have a ->validate - afaict (from cscope) ax25 is the only in-kernel driver that actually passes a ->validate pointer.. I tried patching ax25 here: http://marc.info/?l=linux-hams=148537926422828=2 Still waiting to hear back from that list (which doesnt seem to have much traffic so maybe I should time out on it). Does that patch make better sense (I'll look up the comments leading up to 2793a23aacbd later tonight) > The ->validate callback was added to allow specifying additional > constraints on a per protocol basis. This is where a min constraint > can be added, e.g., for ethernet. > > > - if (!dev_validate_header(dev, skb->data, len)) { > > + newlen = dev_validate_header(dev, skb->data, len); > > + /* As comments above this function indicate, a full L2 header > > +* must be passed to this function, so if newlen > len, bail. > > +*/ > > + if (newlen < 0 || newlen > len) { > > If callers only care whether the function returned failure or > increased len, which also indicates failure, it is cleaner to leave it > a boolean and fail in cases where len < the minimum for that link > layer type. No caller actually uses newlen. > > > + /* Caller has allocated for copylen in non-paged part of > > +* skb so we should never find newlen > hdrlen > > +*/ > > + WARN_ON(newlen > hdrlen); > > WARN_ON_ONCE is safer. Ok that's easy enough to do.
Re: netvsc NAPI patch process
From: KY SrinivasanDate: Thu, 26 Jan 2017 20:46:40 + > In the past, we have done this in two stages - get the supporting > vmbus patches into Greg's tree first and in the next merge cycle get > the netvsc patches in. Why not continue to do what we have done in > the past to address cross-tree dependencies. It takes too damn long.
[PATCHv3 perf/core 1/6] tools lib bpf: Add BPF program pinning APIs.
Add new APIs to pin a BPF program (or specific instances) to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. bpf_program__pin_instance(prog, path, n) will pin the nth instance of 'prog' to the specified path. bpf_program__pin(prog, path) will create the directory 'path' (if it does not exist) and pin each instance within that directory. For instance, path/0, path/1, path/2. Signed-off-by: Joe Stringer--- v3: Add per-instance pinning. Use path for bpf_program__pin() as directory. v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- tools/lib/bpf/libbpf.c | 112 + tools/lib/bpf/libbpf.h | 3 ++ 2 files changed, 115 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index e6cd62b1264b..d1d7638b7c21 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -4,6 +4,7 @@ * Copyright (C) 2013-2015 Alexei Starovoitov * Copyright (C) 2015 Wang Nan * Copyright (C) 2015 Huawei Inc. + * Copyright (C) 2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -22,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -31,7 +33,10 @@ #include #include #include +#include #include +#include +#include #include #include @@ -1237,6 +1242,113 @@ int bpf_object__load(struct bpf_object *obj) return err; } +static int check_path(const char *path) +{ + struct statfs st_fs; + char *dname, *dir; + int err = 0; + + if (path == NULL) + return -EINVAL; + + dname = strdup(path); + dir = dirname(dname); + if (statfs(dir, _fs)) { + pr_warning("failed to statfs %s: %s\n", dir, strerror(errno)); + err = -errno; + } + free(dname); + + if (!err && st_fs.f_type != BPF_FS_MAGIC) { + pr_warning("specified path %s is not on BPF FS\n", path); + err = -EINVAL; + } + + return err; +} + +int bpf_program__pin_instance(struct bpf_program *prog, const char *path, + int instance) +{ + int err; + + err = check_path(path); + if (err) + return err; + + if (prog == NULL) { + pr_warning("invalid program pointer\n"); + return -EINVAL; + } + + if (instance < 0 || instance >= prog->instances.nr) { + pr_warning("invalid prog instance %d of prog %s (max %d)\n", + instance, prog->section_name, prog->instances.nr); + return -EINVAL; + } + + if (bpf_obj_pin(prog->instances.fds[instance], path)) { + pr_warning("failed to pin program: %s\n", strerror(errno)); + return -errno; + } + pr_debug("pinned program '%s'\n", path); + + return 0; +} + +static int make_dir(const char *path) +{ + int err = 0; + + if (mkdir(path, 0700) && errno != EEXIST) + err = -errno; + + if (err) + pr_warning("failed to mkdir %s: %s\n", path, strerror(-err)); + return err; +} + +int bpf_program__pin(struct bpf_program *prog, const char *path) +{ + int i, err; + + err = check_path(path); + if (err) + return err; + + if (prog == NULL) { + pr_warning("invalid program pointer\n"); + return -EINVAL; + } + + if (prog->instances.nr <= 0) { + pr_warning("no instances of prog %s to pin\n", + prog->section_name); + return -EINVAL; + } + + err = make_dir(path); + if (err) + return err; + + for (i = 0; i < prog->instances.nr; i++) { + char buf[PATH_MAX]; + int len; + + len = snprintf(buf, PATH_MAX, "%s/%d", path, i); + if (len < 0) + return -EINVAL; + else if (len > PATH_MAX) + return -ENAMETOOLONG; + + err = bpf_program__pin_instance(prog, buf, i); + if (err) + return err; + } + + return 0; +} + void bpf_object__close(struct bpf_object *obj) { size_t i; diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 4014d1ba5e3d..9f8aa63b95f4 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -106,6 +106,9 @@ void *bpf_program__priv(struct bpf_program *prog); const char *bpf_program__title(struct bpf_program *prog, bool needs_copy); int bpf_program__fd(struct bpf_program *prog); +int bpf_program__pin_instance(struct bpf_program *prog, const char *path, + int instance); +int
[PATCH] net: intel: e1000e: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes--- drivers/net/ethernet/intel/e1000e/ethtool.c | 91 ++ 1 files changed, 49 insertions(+), 42 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c index 7aff68a..3768a5c 100644 --- a/drivers/net/ethernet/intel/e1000e/ethtool.c +++ b/drivers/net/ethernet/intel/e1000e/ethtool.c @@ -117,15 +117,15 @@ struct e1000_stats { #define E1000_TEST_LEN ARRAY_SIZE(e1000_gstrings_test) -static int e1000_get_settings(struct net_device *netdev, - struct ethtool_cmd *ecmd) +static int e1000_get_link_ksettings(struct net_device *netdev, + struct ethtool_link_ksettings *cmd) { struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = >hw; - u32 speed; + u32 speed, supported, advertising; if (hw->phy.media_type == e1000_media_type_copper) { - ecmd->supported = (SUPPORTED_10baseT_Half | + supported = (SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full | SUPPORTED_100baseT_Half | SUPPORTED_100baseT_Full | @@ -133,39 +133,36 @@ static int e1000_get_settings(struct net_device *netdev, SUPPORTED_Autoneg | SUPPORTED_TP); if (hw->phy.type == e1000_phy_ife) - ecmd->supported &= ~SUPPORTED_1000baseT_Full; - ecmd->advertising = ADVERTISED_TP; + supported &= ~SUPPORTED_1000baseT_Full; + advertising = ADVERTISED_TP; if (hw->mac.autoneg == 1) { - ecmd->advertising |= ADVERTISED_Autoneg; + advertising |= ADVERTISED_Autoneg; /* the e1000 autoneg seems to match ethtool nicely */ - ecmd->advertising |= hw->phy.autoneg_advertised; + advertising |= hw->phy.autoneg_advertised; } - ecmd->port = PORT_TP; - ecmd->phy_address = hw->phy.addr; - ecmd->transceiver = XCVR_INTERNAL; - + cmd->base.port = PORT_TP; + cmd->base.phy_address = hw->phy.addr; } else { - ecmd->supported = (SUPPORTED_1000baseT_Full | + supported = (SUPPORTED_1000baseT_Full | SUPPORTED_FIBRE | SUPPORTED_Autoneg); - ecmd->advertising = (ADVERTISED_1000baseT_Full | + advertising = (ADVERTISED_1000baseT_Full | ADVERTISED_FIBRE | ADVERTISED_Autoneg); - ecmd->port = PORT_FIBRE; - ecmd->transceiver = XCVR_EXTERNAL; + cmd->base.port = PORT_FIBRE; } speed = SPEED_UNKNOWN; - ecmd->duplex = DUPLEX_UNKNOWN; + cmd->base.duplex = DUPLEX_UNKNOWN; if (netif_running(netdev)) { if (netif_carrier_ok(netdev)) { speed = adapter->link_speed; - ecmd->duplex = adapter->link_duplex - 1; + cmd->base.duplex = adapter->link_duplex - 1; } } else if (!pm_runtime_suspended(netdev->dev.parent)) { u32 status = er32(STATUS); @@ -179,30 +176,36 @@ static int e1000_get_settings(struct net_device *netdev, speed = SPEED_10; if (status & E1000_STATUS_FD) - ecmd->duplex = DUPLEX_FULL; + cmd->base.duplex = DUPLEX_FULL; else - ecmd->duplex = DUPLEX_HALF; + cmd->base.duplex = DUPLEX_HALF; } } - ethtool_cmd_speed_set(ecmd, speed); - ecmd->autoneg = ((hw->phy.media_type == e1000_media_type_fiber) || + cmd->base.speed = speed; + cmd->base.autoneg = ((hw->phy.media_type == e1000_media_type_fiber) || hw->mac.autoneg) ? AUTONEG_ENABLE : AUTONEG_DISABLE; /* MDI-X => 2; MDI =>1; Invalid =>0 */ if ((hw->phy.media_type == e1000_media_type_copper) && netif_carrier_ok(netdev)) - ecmd->eth_tp_mdix = hw->phy.is_mdix ? ETH_TP_MDI_X : ETH_TP_MDI; + cmd->base.eth_tp_mdix = hw->phy.is_mdix ? + ETH_TP_MDI_X : ETH_TP_MDI; else - ecmd->eth_tp_mdix = ETH_TP_MDI_INVALID; + cmd->base.eth_tp_mdix =
[PATCHv3 perf/core 0/6] Libbpf object pinning
This series adds pinning functionality for maps, programs, and objects. Library users may call bpf_map__pin(map, path) or bpf_program__pin(prog, path) to pin maps and programs separately, or use bpf_object__pin(obj, path) to pin all maps and programs from the BPF object to the path. The map and program variations require a path where it will be pinned in the filesystem, and the object variation will create named directories for each program with instances within, and mount the maps by name under the path. For example, with the directory '/sys/fs/bpf/foo' and a BPF object which contains two instances of a program named 'bar', and a map named 'baz': /sys/fs/bpf/foo/bar/0 /sys/fs/bpf/foo/bar/1 /sys/fs/bpf/foo/baz --- v3: Split out bpf_program__pin_instance(). Change the paths from PATH/{maps,progs}/foo to the above. Drop the patches that were applied. Add a perf test to check that pinning works. v2: Wang Nan provided improvements to patch 1. Dropped patch 2 from v1. Added acks for acked patches. Split the bpf_obj__pin() to also provide map / program pinning APIs. Allow users to provide full filesystem path (don't autodetect/mount BPFFS). v1: Initial post. Joe Stringer (6): tools lib bpf: Add BPF program pinning APIs. tools lib bpf: Add bpf_map__pin() tools lib bpf: Add bpf_object__pin() tools perf util: Make rm_rf(path) argument const tools lib api fs: Add bpf_fs filesystem detector perf test: Add libbpf pinning test tools/lib/api/fs/fs.c | 16 + tools/lib/api/fs/fs.h | 1 + tools/lib/bpf/libbpf.c | 188 + tools/lib/bpf/libbpf.h | 5 ++ tools/perf/tests/bpf.c | 42 ++- tools/perf/util/util.c | 2 +- tools/perf/util/util.h | 2 +- 7 files changed, 253 insertions(+), 3 deletions(-) -- 2.11.0
[PATCHv3 perf/core 5/6] tools lib api fs: Add bpf_fs filesystem detector
Allow mounting of the BPF filesystem at /sys/fs/bpf. Signed-off-by: Joe Stringer--- v3: Initial post. --- tools/lib/api/fs/fs.c | 16 tools/lib/api/fs/fs.h | 1 + 2 files changed, 17 insertions(+) diff --git a/tools/lib/api/fs/fs.c b/tools/lib/api/fs/fs.c index f99f49e4a31e..4b6bfc43cccf 100644 --- a/tools/lib/api/fs/fs.c +++ b/tools/lib/api/fs/fs.c @@ -38,6 +38,10 @@ #define HUGETLBFS_MAGIC0x958458f6 #endif +#ifndef BPF_FS_MAGIC +#define BPF_FS_MAGIC 0xcafe4a11 +#endif + static const char * const sysfs__fs_known_mountpoints[] = { "/sys", 0, @@ -75,6 +79,11 @@ static const char * const hugetlbfs__known_mountpoints[] = { 0, }; +static const char * const bpf_fs__known_mountpoints[] = { + "/sys/fs/bpf", + 0, +}; + struct fs { const char *name; const char * const *mounts; @@ -89,6 +98,7 @@ enum { FS__DEBUGFS = 2, FS__TRACEFS = 3, FS__HUGETLBFS = 4, + FS__BPF_FS = 5, }; #ifndef TRACEFS_MAGIC @@ -121,6 +131,11 @@ static struct fs fs__entries[] = { .mounts = hugetlbfs__known_mountpoints, .magic = HUGETLBFS_MAGIC, }, + [FS__BPF_FS] = { + .name = "bpf", + .mounts = bpf_fs__known_mountpoints, + .magic = BPF_FS_MAGIC, + }, }; static bool fs__read_mounts(struct fs *fs) @@ -280,6 +295,7 @@ FS(procfs, FS__PROCFS); FS(debugfs, FS__DEBUGFS); FS(tracefs, FS__TRACEFS); FS(hugetlbfs, FS__HUGETLBFS); +FS(bpf_fs, FS__BPF_FS); int filename__read_int(const char *filename, int *value) { diff --git a/tools/lib/api/fs/fs.h b/tools/lib/api/fs/fs.h index a63269f5d20c..6b332dc74498 100644 --- a/tools/lib/api/fs/fs.h +++ b/tools/lib/api/fs/fs.h @@ -22,6 +22,7 @@ FS(procfs) FS(debugfs) FS(tracefs) FS(hugetlbfs) +FS(bpf_fs) #undef FS -- 2.11.0
[PATCHv3 perf/core 2/6] tools lib bpf: Add bpf_map__pin()
Add a new API to pin a BPF map to the filesystem. The user can specify the path full path within a BPF filesystem to pin the map. Signed-off-by: Joe Stringer--- v3: No change. v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- tools/lib/bpf/libbpf.c | 22 ++ tools/lib/bpf/libbpf.h | 1 + 2 files changed, 23 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index d1d7638b7c21..ce987c02363e 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1349,6 +1349,28 @@ int bpf_program__pin(struct bpf_program *prog, const char *path) return 0; } +int bpf_map__pin(struct bpf_map *map, const char *path) +{ + int err; + + err = check_path(path); + if (err) + return err; + + if (map == NULL) { + pr_warning("invalid map pointer\n"); + return -EINVAL; + } + + if (bpf_obj_pin(map->fd, path)) { + pr_warning("failed to pin map: %s\n", strerror(errno)); + return -errno; + } + + pr_debug("pinned map '%s'\n", path); + return 0; +} + void bpf_object__close(struct bpf_object *obj) { size_t i; diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 9f8aa63b95f4..2addf9d5b13c 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -236,6 +236,7 @@ typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void *); int bpf_map__set_priv(struct bpf_map *map, void *priv, bpf_map_clear_priv_t clear_priv); void *bpf_map__priv(struct bpf_map *map); +int bpf_map__pin(struct bpf_map *map, const char *path); long libbpf_get_error(const void *ptr); -- 2.11.0
[PATCHv3 perf/core 6/6] perf test: Add libbpf pinning test
Add a test for the newly added BPF object pinning functionality. For example: # tools/perf/perf test 37 37: BPF filter : 37.1: Basic BPF filtering : Ok 37.2: BPF pinning : Ok 37.3: BPF prologue generation : Ok 37.4: BPF relocation checker : Ok # tools/perf/perf test 37 -v 2>&1 | grep pinned libbpf: pinned map '/sys/fs/bpf/perf_test/flip_table' libbpf: pinned program '/sys/fs/bpf/perf_test/func=SyS_epoll_wait/0' Signed-off-by: Joe StringerCc: Wang Nan Cc: Arnaldo Carvalho de Melo --- v3: Initial post. --- tools/perf/tests/bpf.c | 42 +- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c index 92343f43e44a..1a04fe77487d 100644 --- a/tools/perf/tests/bpf.c +++ b/tools/perf/tests/bpf.c @@ -5,11 +5,13 @@ #include #include #include +#include #include #include "tests.h" #include "llvm.h" #include "debug.h" #define NR_ITERS 111 +#define PERF_TEST_BPF_PATH "/sys/fs/bpf/perf_test" #ifdef HAVE_LIBBPF_SUPPORT @@ -54,6 +56,7 @@ static struct { const char *msg_load_fail; int (*target_func)(void); int expect_result; + boolpin; } bpf_testcase_table[] = { { LLVM_TESTCASE_BASE, @@ -63,6 +66,17 @@ static struct { "load bpf object failed", _wait_loop, (NR_ITERS + 1) / 2, + false, + }, + { + LLVM_TESTCASE_BASE, + "BPF pinning", + "[bpf_pinning]", + "fix kbuild first", + "check your vmlinux setting?", + _wait_loop, + (NR_ITERS + 1) / 2, + true, }, #ifdef HAVE_BPF_PROLOGUE { @@ -73,6 +87,7 @@ static struct { "check your vmlinux setting?", _loop, (NR_ITERS + 1) / 4, + false, }, #endif { @@ -83,6 +98,7 @@ static struct { "libbpf error when dealing with relocation", NULL, 0, + false, }, }; @@ -226,10 +242,34 @@ static int __test__bpf(int idx) goto out; } - if (obj) + if (obj) { ret = do_test(obj, bpf_testcase_table[idx].target_func, bpf_testcase_table[idx].expect_result); + if (ret != TEST_OK) + goto out; + if (bpf_testcase_table[idx].pin) { + int err; + + if (!bpf_fs__mount()) { + pr_debug("BPF filesystem not mounted\n"); + ret = TEST_FAIL; + goto out; + } + err = mkdir(PERF_TEST_BPF_PATH, 0777); + if (err && errno != EEXIST) { + pr_debug("Failed to make perf_test dir: %s\n", +strerror(errno)); + ret = TEST_FAIL; + goto out; + } + if (bpf_object__pin(obj, PERF_TEST_BPF_PATH)) + ret = TEST_FAIL; + if (rm_rf(PERF_TEST_BPF_PATH)) + ret = TEST_FAIL; + } + } + out: bpf__clear(); return ret; -- 2.11.0
[PATCHv3 perf/core 3/6] tools lib bpf: Add bpf_object__pin()
Add a new API to pin a BPF object to the filesystem. The user can specify the path within a BPF filesystem to pin the object. Programs will be pinned under a subdirectory named the same as the program, with each instance appearing as a numbered file under that directory, and maps will be pinned under the path using the name of the map as the file basename. For example, with the directory '/sys/fs/bpf/foo' and a BPF object which contains two instances of a program named 'bar', and a map named 'baz': /sys/fs/bpf/foo/bar/0 /sys/fs/bpf/foo/bar/1 /sys/fs/bpf/foo/baz Signed-off-by: Joe Stringer--- v3: Mount to PATH/MAPNAME, PATH/PROGNAME/N (N = instance number) v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- tools/lib/bpf/libbpf.c | 54 ++ tools/lib/bpf/libbpf.h | 1 + 2 files changed, 55 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index ce987c02363e..703cfa986b34 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -1371,6 +1372,59 @@ int bpf_map__pin(struct bpf_map *map, const char *path) return 0; } +int bpf_object__pin(struct bpf_object *obj, const char *path) +{ + struct bpf_program *prog; + struct bpf_map *map; + int err; + + if (!obj) + return -ENOENT; + + if (!obj->loaded) { + pr_warning("object not yet loaded; load it first\n"); + return -ENOENT; + } + + err = make_dir(path); + if (err) + return err; + + bpf_map__for_each(map, obj) { + char buf[PATH_MAX]; + int len; + + len = snprintf(buf, PATH_MAX, "%s/%s", path, + bpf_map__name(map)); + if (len < 0) + return -EINVAL; + else if (len > PATH_MAX) + return -ENAMETOOLONG; + + err = bpf_map__pin(map, buf); + if (err) + return err; + } + + bpf_object__for_each_program(prog, obj) { + char buf[PATH_MAX]; + int len; + + len = snprintf(buf, PATH_MAX, "%s/%s", path, + prog->section_name); + if (len < 0) + return -EINVAL; + else if (len > PATH_MAX) + return -ENAMETOOLONG; + + err = bpf_program__pin(prog, buf); + if (err) + return err; + } + + return 0; +} + void bpf_object__close(struct bpf_object *obj) { size_t i; diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 2addf9d5b13c..b30394f9947a 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -65,6 +65,7 @@ struct bpf_object *bpf_object__open(const char *path); struct bpf_object *bpf_object__open_buffer(void *obj_buf, size_t obj_buf_sz, const char *name); +int bpf_object__pin(struct bpf_object *object, const char *path); void bpf_object__close(struct bpf_object *object); /* Load/unload object into/from kernel */ -- 2.11.0
[PATCHv3 perf/core 4/6] tools perf util: Make rm_rf(path) argument const
rm_rf() doesn't modify its path argument, and a future caller will pass a string constant into it to delete. Signed-off-by: Joe Stringer--- v3: Initial post. --- tools/perf/util/util.c | 2 +- tools/perf/util/util.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c index bf29aed16bd6..d8b45cea54d0 100644 --- a/tools/perf/util/util.c +++ b/tools/perf/util/util.c @@ -85,7 +85,7 @@ int mkdir_p(char *path, mode_t mode) return (stat(path, ) && mkdir(path, mode)) ? -1 : 0; } -int rm_rf(char *path) +int rm_rf(const char *path) { DIR *dir; int ret = 0; diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h index 6e8be174ec0b..c74708da8571 100644 --- a/tools/perf/util/util.h +++ b/tools/perf/util/util.h @@ -209,7 +209,7 @@ static inline int sane_case(int x, int high) } int mkdir_p(char *path, mode_t mode); -int rm_rf(char *path); +int rm_rf(const char *path); struct strlist *lsdir(const char *name, bool (*filter)(const char *, struct dirent *)); bool lsdir_no_dot_filter(const char *name, struct dirent *d); int copyfile(const char *from, const char *to); -- 2.11.0
Re: [PATCH v2 net] net: free ip_vs_dest structs when refcnt=0
Hello, On Mon, 23 Jan 2017, David Windsor wrote: > Currently, the ip_vs_dest cache frees ip_vs_dest objects when their > reference count becomes < 0. Aside from not being semantically sound, > this is problematic for the new type refcount_t, which will be introduced > shortly in a separate patch. refcount_t is the new kernel type for > holding reference counts, and provides overflow protection and a > constrained interface relative to atomic_t (the type currently being > used for kernel reference counts). > > Per Julian Anastasov: "The problem is that dest_trash currently holds > deleted dests (unlinked from RCU lists) with refcnt=0." Changing > dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest > structs when their refcnt=0, in ip_vs_dest_put_and_free(). > > Signed-off-by: David WindsorThanks! I tested the first version and this one just adds the needed changes in comments, so Signed-off-by: Julian Anastasov Simon and Pablo, this is more appropriate for ipvs-next/nf-next. Please apply! > --- > include/net/ip_vs.h| 2 +- > net/netfilter/ipvs/ip_vs_ctl.c | 8 +++- > 2 files changed, 4 insertions(+), 6 deletions(-) > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > index cd6018a..a3e78ad 100644 > --- a/include/net/ip_vs.h > +++ b/include/net/ip_vs.h > @@ -1421,7 +1421,7 @@ static inline void ip_vs_dest_put(struct ip_vs_dest > *dest) > > static inline void ip_vs_dest_put_and_free(struct ip_vs_dest *dest) > { > - if (atomic_dec_return(>refcnt) < 0) > + if (atomic_dec_and_test(>refcnt)) > kfree(dest); > } > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 55e0169..5fc4836 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -711,7 +711,6 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int > dest_af, > dest->vport == svc->port))) { > /* HIT */ > list_del(>t_list); > - ip_vs_dest_hold(dest); > goto out; > } > } > @@ -741,7 +740,7 @@ static void ip_vs_dest_free(struct ip_vs_dest *dest) > * When the ip_vs_control_clearup is activated by ipvs module exit, > * the service tables must have been flushed and all the connections > * are expired, and the refcnt of each destination in the trash must > - * be 0, so we simply release them here. > + * be 1, so we simply release them here. > */ > static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs) > { > @@ -1080,11 +1079,10 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, > struct ip_vs_dest *dest, > if (list_empty(>dest_trash) && !cleanup) > mod_timer(>dest_trash_timer, > jiffies + (IP_VS_DEST_TRASH_PERIOD >> 1)); > - /* dest lives in trash without reference */ > + /* dest lives in trash with reference */ > list_add(>t_list, >dest_trash); > dest->idle_start = 0; > spin_unlock_bh(>dest_trash_lock); > - ip_vs_dest_put(dest); > } > > > @@ -1160,7 +1158,7 @@ static void ip_vs_dest_trash_expire(unsigned long data) > > spin_lock(>dest_trash_lock); > list_for_each_entry_safe(dest, next, >dest_trash, t_list) { > - if (atomic_read(>refcnt) > 0) > + if (atomic_read(>refcnt) > 1) > continue; > if (dest->idle_start) { > if (time_before(now, dest->idle_start + > -- > 2.7.4 Regards -- Julian Anastasov
RE: netvsc NAPI patch process
> -Original Message- > From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Thursday, January 26, 2017 10:04 AM > To: da...@davemloft.net; KY Srinivasan; Greg KH > > Cc: netdev@vger.kernel.org > Subject: netvsc NAPI patch process > > I have a working set of patches to enable NAPI in the netvsc driver. > The problem is that it requires a set of patches to vmbus layer as well. > Since vmbus patches have been going through char-misc-next tree rather > than net-next, it is difficult to stage these. In the past, we have done this in two stages - get the supporting vmbus patches into Greg's tree first and in the next merge cycle get the netvsc patches in. Why not continue to do what we have done in the past to address cross-tree dependencies. Regards, K. Y > > How about if I send the vmbus patches through normal driver-devel > upstream > and during the 4.10 merge window send the last 3 patches for NAPI for linux- > net > tree to get into 4.10?
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 02:40 PM, Michal Hocko wrote: On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: On 01/26/2017 12:58 PM, Michal Hocko wrote: On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with Hmm. 'On my own' means, we could potentially BUG somewhere down the vmalloc implementation, etc, presumably? So it might in-fact be harmful to pass that, right? No it would mean that it might eventually hit the behavior which you are trying to avoid - in other words it may invoke OOM killer even though __GFP_NORETRY means giving up before any system wide disruptive actions a re taken. Ok, thanks for clarifying, more on that further below. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ Ok, I know this is out of scope for this series, but since i) this is _not_ the _only_ spot right now which has such a construct and ii) I am already kind of nagging a bit ;), my question would be, what would it take to start supporting it? propagate gfp mask all the way down from vmalloc to all places which might allocate down the path and especially page table allocation function are PITA because they are really deep. This is a lot of work... But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? The case I was concerned about was from vmalloc() path, not kmalloc(). That was where the stack trace indicating OOM pointed to. As an example, there could be really large allocation requests for maps where the map has pre-allocated memory for its elements. Thus, if we get to the point where we need to kill others due to shortage of mem for satisfying this, I'd much much rather prefer to just not let vmalloc() work really hard and fail early on instead. In my (crafted) test case, I was connected via ssh and it each time reliably killed my connection, which is really suboptimal. F.e., I could also imagine a buggy or miscalculated map definition for a prog that is provisioned to multiple places, which then accidentally triggers this. Or if large on purpose, but we crossed the line, it could be handled more gracefully, f.e. I could imagine an option to falling back to a non-pre-allocated map flavor from the application loading the program. Trade-off for sure, but still allowing it to operate up to a certain extend. Granted, if vmalloc() succeeded without trying hard and we then OOM elsewhere, too bad, but we don't have much control over that one anyway, only about our own request. Reason I asked above was whether having __GFP_NORETRY in would be fatal somewhere down the path, but seems not as you say. So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Thanks, Daniel
Re: [PATCH net-next v2 0/4] net: dsa: Preparatory patches
From: Florian FainelliDate: Thu, 26 Jan 2017 10:45:50 -0800 > This patch series extracts the 4 patches of the larger: net: dsa: Support for > pdata in dsa2 while we wait for feedback from Greg KH on the device > references. > > Changes in v2: > > - rebased properly after the multi-MDIO bus support added to mv88e6xxx Series applied, thanks Florian.
Re: [PATCH net-next] liquidio: Avoid accessing skb after submitting to input queue
From: Felix ManlunasDate: Thu, 26 Jan 2017 11:52:35 -0800 > From: Satanand Burla > > Accessing skb after submitting to input queue can cause > access to stale pointers if the skb ends up being transmitted > and freed by that time. > > Signed-off-by: Satanand Burla > Signed-off-by: Derek Chickles > Signed-off-by: Raghu Vatsavayi > Signed-off-by: Felix Manlunas Applied, thanks.
Re: [PATCH net-next 1/3] trace: add variant without spacing in trace_print_hex_seq
On 01/26/2017 08:53 PM, Arnaldo Carvalho de Melo wrote: Em Wed, Jan 25, 2017 at 02:28:16AM +0100, Daniel Borkmann escreveu: For upcoming tracepoint support for BPF, we want to dump the program's tag. Format should be similar to __print_hex(), but without spacing. Add a __print_hex_str() variant for exactly that purpose that reuses trace_print_hex_seq(). Steven should be back to his side of the wall soon, will wait for his Ack, ok? Ok, seems this set got applied already to net-next in the meantime, so if there are any objections on this, I will follow up with a patch of course. Thanks, Daniel - Arnaldo
Re: [PATCHv3 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs
On 27/01/17 04:10, Gregory CLEMENT wrote: >> +internal-regs { [snip] >> + >> +dfx-registers { > node label > [snip] >> +switch { > node label > These are peers to the internal-regs, i.e. parts of the SoC with mappable windows in the address space. Do they really need a label? Their subnodes absolutely need (and have) labels.
Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver
> If the application has provided fewer than hard_header_len bytes, > dev_validate_header() will zero out the skb->data as needed. This is > acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases, This was added not for datagram sockets, but to be able to bypass validation. See the message in commit 2793a23aacbd ("net: validate variable length ll header") and discussion leading up to that patch. > the application must provide a full L2 header, and the PF_PACKET Tx > paths must fail with an error when fewer than hard_header_len bytes > are detected. As David pointed out, this does not handle variable length headers correctly. In link layers that support these, hard_header_len defines the maximum header length. A hard failure on len < hard_header_len would be incorrect. The ->validate callback was added to allow specifying additional constraints on a per protocol basis. This is where a min constraint can be added, e.g., for ethernet. > All invocations to dev_validate_header() already adjusts the > skb's data, len, tail etc pointers based on hard_header_len before > invoking dev_validate_header(), so additional skb pointers should > not be needed after dev_validate_header(). > > Signed-off-by: Sowmini Varadhan> --- > - if (!dev_validate_header(dev, skb->data, len)) { > + newlen = dev_validate_header(dev, skb->data, len); > + /* As comments above this function indicate, a full L2 header > +* must be passed to this function, so if newlen > len, bail. > +*/ > + if (newlen < 0 || newlen > len) { If callers only care whether the function returned failure or increased len, which also indicates failure, it is cleaner to leave it a boolean and fail in cases where len < the minimum for that link layer type. No caller actually uses newlen. > + /* Caller has allocated for copylen in non-paged part of > +* skb so we should never find newlen > hdrlen > +*/ > + WARN_ON(newlen > hdrlen); WARN_ON_ONCE is safer.
Re: [PATCHv3 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs
On 27/01/17 04:10, Gregory CLEMENT wrote: > Hi Chris, > > On ven., janv. 06 2017, Chris Packham> wrote: > >> The Marvell 98DX3236, 98DX3336, 98DX4521 and variants are switch ASICs >> with integrated CPUs. They are similar to the Armada XP SoCs but have >> different I/O interfaces. > > Before sending a new version I have a few remarks: > > [snip] I'll update the dtsi files to use the node labels and correct the commends as requested > > Why the following node is not part of the dtsi? > > Gregory > >> +resume@20980 { >> +compatible = "marvell,98dx3336-resume-ctrl"; >> +reg = <0x20980 0x10>; >> +}; >> +}; >> +}; The 98DX9236 has a single ARMv7 core. As such this resume control isn't present on it. The 98DX3336 and 98DX4521 have dual ARMv7 cores and this is used to boot the second core (SMP support is a little different compared to Armada-XP). In other words {98DX3336, 98DX4521} = 98DX9236 + an additional core. At the switch packet processor level there are more differences but as far as the kernel is concerned the only real difference is the number of cores.
Re: [PATCH net 1/5] ibmvnic: harden interrupt handler
On Wed, 25 Jan 2017 15:02:19 -0600 Thomas Falconwrote: > static irqreturn_t ibmvnic_interrupt(int irq, void *instance) > { > struct ibmvnic_adapter *adapter = instance; > + unsigned long flags; > + > + spin_lock_irqsave(>crq.lock, flags); > + vio_disable_interrupts(adapter->vdev); > + tasklet_schedule(>tasklet); > + spin_unlock_irqrestore(>crq.lock, flags); > + return IRQ_HANDLED; > +} > + Why not use NAPI? rather than a tasklet
Re: [PATCH net] bpf: expose netns inode to bpf programs
On 1/26/17 11:07 AM, Andy Lutomirski wrote: On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitovwrote: On 1/26/17 10:12 AM, Andy Lutomirski wrote: On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov wrote: On 1/26/17 8:37 AM, Andy Lutomirski wrote: Think of bpf programs as safe kernel modules. They don't have confined boundaries and program authors, if not careful, can shoot themselves in the foot. We're not trying to prevent that because it's impossible to check that the program is sane. Just like it's impossible to check that kernel module is sane. But in case of bpf we check that bpf program is _safe_ from the kernel point of view. If it's doing some garbage, it's program's business. Does it make more sense now? With all due respect, I think this is not an acceptable way to think about BPF at all. If you think of BPF this way, I think there needs to be a real discussion at KS or similar as to whether this is okay. The reason is simple: the kernel promises a stable ABI to userspace but not to kernel modules. By thinking of BPF as more like a module, you're taking a big shortcut that will either result in ABI breakage down the road or in committing to a problematic stable ABI. you misunderstood the analogy. bpf abi is certainly stable. that's why we were careful of not exposing anything to it that is not already stable. In that case I don't understand what you're trying to say. Eric thinks your patch exposes a bad interface. A bad interface for userspace is a very different thing from a bad interface available to kernel modules. Are you saying that BPF is kernel-module-like in that the ABI exposed to BPF programs doesn't need to meet the same quality standards as userspace ABIs? of course not. ns.inum is already exposed to user space as a value. This patch exposes it to bpf program in a convenient and stable way, Here's what I'm imaging Eric is thinking: ns.inum is currently exposed to userspace via procfs. In principle, the value could be local to a namespace, though, which would enable CRIU to be able to preserve namespace inode numbers across a checkpoint+restore operation. If this happened, the contained and restored procfs would see a different inode number than the outermost procfs. sure. there are many different ways for the program to see inode that either was already reused or disappeared. What I'm saying that it is expected. We cannot prevent that from bpf side. Just like ifindex value read by the program can be bogus as in the example I just provided. If you start exposing the raw ns.inum field to BPF programs and those programs are not themselves scoped to a namespace, then this could create a problem for CRIU. criu doesn't support ebpf because maps are not snapshot-able and programs are detached from the control plane. I cannot see how one can criu of xdp or cls program. The ssh connection to the box might die in the middle while criu is messing with unknown. Hence the analogy to the kernel modules. Imagine a set of mini-kernel modules and a set of apps that depend on them. What kind of criu can we even talk about? But you told Eric that his nack doesn't matter, and maybe it would be nice to ask him to clarify instead. Fair enough. Eric, thoughts?
Re: [PATCH net-next 1/3] trace: add variant without spacing in trace_print_hex_seq
Em Wed, Jan 25, 2017 at 02:28:16AM +0100, Daniel Borkmann escreveu: > For upcoming tracepoint support for BPF, we want to dump the program's > tag. Format should be similar to __print_hex(), but without spacing. > Add a __print_hex_str() variant for exactly that purpose that reuses > trace_print_hex_seq(). Steven should be back to his side of the wall soon, will wait for his Ack, ok? - Arnaldo > Signed-off-by: Daniel Borkmann> Cc: Steven Rostedt > Cc: Arnaldo Carvalho de Melo > --- > include/linux/trace_events.h | 3 ++- > include/trace/trace_events.h | 8 +++- > kernel/trace/trace_output.c | 7 --- > 3 files changed, 13 insertions(+), 5 deletions(-) > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > index be00761..cfa475a 100644 > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -33,7 +33,8 @@ const char *trace_print_bitmask_seq(struct trace_seq *p, > void *bitmask_ptr, > unsigned int bitmask_size); > > const char *trace_print_hex_seq(struct trace_seq *p, > - const unsigned char *buf, int len); > + const unsigned char *buf, int len, > + bool spacing); > > const char *trace_print_array_seq(struct trace_seq *p, > const void *buf, int count, > diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h > index 467e12f..9f68462 100644 > --- a/include/trace/trace_events.h > +++ b/include/trace/trace_events.h > @@ -297,7 +297,12 @@ > #endif > > #undef __print_hex > -#define __print_hex(buf, buf_len) trace_print_hex_seq(p, buf, buf_len) > +#define __print_hex(buf, buf_len)\ > + trace_print_hex_seq(p, buf, buf_len, true) > + > +#undef __print_hex_str > +#define __print_hex_str(buf, buf_len) > \ > + trace_print_hex_seq(p, buf, buf_len, false) > > #undef __print_array > #define __print_array(array, count, el_size) \ > @@ -711,6 +716,7 @@ > #undef __print_flags > #undef __print_symbolic > #undef __print_hex > +#undef __print_hex_str > #undef __get_dynamic_array > #undef __get_dynamic_array_len > #undef __get_str > diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c > index 5d33a73..30a144b1 100644 > --- a/kernel/trace/trace_output.c > +++ b/kernel/trace/trace_output.c > @@ -163,14 +163,15 @@ enum print_line_t trace_print_printk_msg_only(struct > trace_iterator *iter) > EXPORT_SYMBOL_GPL(trace_print_bitmask_seq); > > const char * > -trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int > buf_len) > +trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int > buf_len, > + bool spacing) > { > int i; > const char *ret = trace_seq_buffer_ptr(p); > > for (i = 0; i < buf_len; i++) > - trace_seq_printf(p, "%s%2.2x", i == 0 ? "" : " ", buf[i]); > - > + trace_seq_printf(p, "%s%2.2x", !spacing || i == 0 ? "" : " ", > + buf[i]); > trace_seq_putc(p, 0); > > return ret; > -- > 1.9.3
[PATCH net-next] liquidio: Avoid accessing skb after submitting to input queue
From: Satanand BurlaAccessing skb after submitting to input queue can cause access to stale pointers if the skb ends up being transmitted and freed by that time. Signed-off-by: Satanand Burla Signed-off-by: Derek Chickles Signed-off-by: Raghu Vatsavayi Signed-off-by: Felix Manlunas --- drivers/net/ethernet/cavium/liquidio/lio_main.c| 6 +++--- drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c index 5ee3f00..9261ddc 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c @@ -3316,11 +3316,11 @@ static int liquidio_xmit(struct sk_buff *skb, struct net_device *netdev) netif_trans_update(netdev); - if (skb_shinfo(skb)->gso_size) - stats->tx_done += skb_shinfo(skb)->gso_segs; + if (tx_info->s.gso_segs) + stats->tx_done += tx_info->s.gso_segs; else stats->tx_done++; - stats->tx_tot_bytes += skb->len; + stats->tx_tot_bytes += ndata.datasize; return NETDEV_TX_OK; diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c index e96cf6c..a6587d7 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c @@ -2433,11 +2433,11 @@ static int liquidio_xmit(struct sk_buff *skb, struct net_device *netdev) netif_trans_update(netdev); - if (skb_shinfo(skb)->gso_size) - stats->tx_done += skb_shinfo(skb)->gso_segs; + if (tx_info->s.gso_segs) + stats->tx_done += tx_info->s.gso_segs; else stats->tx_done++; - stats->tx_tot_bytes += skb->len; + stats->tx_tot_bytes += ndata.datasize; return NETDEV_TX_OK;
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 26 January 2017 at 11:32, Arnaldo Carvalho de Melowrote: > Em Wed, Jan 25, 2017 at 10:18:22AM +0800, Wangnan (F) escreveu: >> On 2017/1/25 9:16, Joe Stringer wrote: >> > On 24 January 2017 at 17:06, Wangnan (F) wrote: >> > > On 2017/1/25 9:04, Wangnan (F) wrote: >> > > Is it possible to use directory tree instead? > >> > > %s/object/mapname >> > > %s/object/prog/instance >> > I don't think objects have names, so let's assume an object with two >> > program instances named foo, and one map named bar. > >> > A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with >> > the following files and directories: >> > /sys/fs/bpf/myobj/foo/1 >> > /sys/fs/bpf/myobj/foo/2 >> > /sys/fs/bpf/myobj/bar > >> > Alternatively, if you want to control exactly where you want the >> > progs/maps to be pinned, you can call eg >> > bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that >> > instance will be mounted to /sys/fs/bpf/wherever, or alternatively >> > bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with >> > /sys/fs/bpf/foo/{0,1}. > >> > This looks pretty reasonable to me. > >> It looks good to me. > > Ok, please continue from perf/core, Ingo merged the first patch of this > patchset today, Ok thanks, I'll continue from there.
Re: [PATCH net-next 2/3] sfc: refactor debug-or-warnings printks
From: Edward CreeDate: Thu, 26 Jan 2017 17:53:48 + > diff --git a/drivers/net/ethernet/sfc/net_driver.h > b/drivers/net/ethernet/sfc/net_driver.h > index 5927c20..c640955 100644 > --- a/drivers/net/ethernet/sfc/net_driver.h > +++ b/drivers/net/ethernet/sfc/net_driver.h > @@ -51,6 +51,15 @@ > #define EFX_WARN_ON_PARANOID(x) do {} while (0) > #endif > > +/* if @cond then downgrade to debug, else print at @level */ > +#define netif_cond_dbg(priv, type, netdev, cond, level, fmt, args...) \ > + do { \ > + if (cond) \ > + netif_dbg(priv, type, netdev, fmt, ##args); \ > + else \ > + netif_ ## level(priv, type, netdev, fmt, ##args); \ > + } while (0) > + > /** > * > * Efx data structures > Please do not define locally in a driver an interface that looks like a generic one and might be useful to code outside of your driver. Thanks.
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
Em Wed, Jan 25, 2017 at 10:18:22AM +0800, Wangnan (F) escreveu: > On 2017/1/25 9:16, Joe Stringer wrote: > > On 24 January 2017 at 17:06, Wangnan (F)wrote: > > > On 2017/1/25 9:04, Wangnan (F) wrote: > > > Is it possible to use directory tree instead? > > > %s/object/mapname > > > %s/object/prog/instance > > I don't think objects have names, so let's assume an object with two > > program instances named foo, and one map named bar. > > A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with > > the following files and directories: > > /sys/fs/bpf/myobj/foo/1 > > /sys/fs/bpf/myobj/foo/2 > > /sys/fs/bpf/myobj/bar > > Alternatively, if you want to control exactly where you want the > > progs/maps to be pinned, you can call eg > > bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that > > instance will be mounted to /sys/fs/bpf/wherever, or alternatively > > bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with > > /sys/fs/bpf/foo/{0,1}. > > This looks pretty reasonable to me. > It looks good to me. Ok, please continue from perf/core, Ingo merged the first patch of this patchset today, - Arnaldo
Re: [PATCH 0/7] pull request for net-next: batman-adv 2017-01-26
From: Simon WunderlichDate: Thu, 26 Jan 2017 17:43:57 +0100 > this is the updated version of yesterdays feature/cleanup pull request for > batman-adv which should go into net-next. We've followed your suggestion > regarding the NET_XMIT_CN handling and modified Gaos patch accordingly. > > The remaining patches are untouched. > > Please pull or let me know of any problem! Pulled, thanks.
Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled
On Thu, 2017-01-26 at 20:19 +0100, Pablo Neira Ayuso wrote: > Right. This is not percpu as in IPv4. > > I can send a follow up patch to get this in sync with the way we do it > in IPv4, ie. add percpu socket. > > Fine with this approach? Thanks! Not really. percpu sockets are going to slow down network namespace creation / deletion and increase memory foot print. IPv6 is cleaner because it does not really have to use different sockets. Ultimately would would like to have the same for IPv4. I would rather carry the mark either in an additional parameter, or in the flow that is already passed as a parameter.
Re: [PATCH net v2 3/3] gtp: fix cross netns recv on gtp socket
From: Andreas SchultzDate: Thu, 26 Jan 2017 16:11:34 +0100 > The use of the passed through netlink src_net to check for a > cross netns operation was wrong. Using the GTP socket and the > GTP netdevice is always correct (even if the netdev has been > moved to new netns after link creation). > > Remove the now obsolete net field from gtp_dev. > > Signed-off-by: Andreas Schultz Please at least compile test your submissions: drivers/net/gtp.c: In function ‘gtp_newlink’: drivers/net/gtp.c:677:8: error: too many arguments to function ‘gtp_encap_enable’ err = gtp_encap_enable(dev, gtp, fd0, fd1, src_net); ^ drivers/net/gtp.c:659:12: note: declared here static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp, ^ drivers/net/gtp.c: At top level: drivers/net/gtp.c:822:12: error: conflicting types for ‘gtp_encap_enable’ static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp, ^ drivers/net/gtp.c:659:12: note: previous declaration of ‘gtp_encap_enable’ was here static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp, ^ drivers/net/gtp.c:659:12: warning: ‘gtp_encap_enable’ used but never defined drivers/net/gtp.c:822:12: warning: ‘gtp_encap_enable’ defined but not used [-Wunused-function] static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp, ^ scripts/Makefile.build:299: recipe for target 'drivers/net/gtp.o' failed
Re: [BUG/RFC] vhost: net: big endian viring access despite virtio 1
On Thu, Jan 26, 2017 at 06:39:14PM +0100, Halil Pasic wrote: > > Hi! > > Recently I have been investigating some strange migration problems on > s390x. > > It turned out under certain circumstances vhost_net corrupts avail.idx by > using wrong endianness. > > I managed to track the problem down (I'm pretty sure). It boils down to > the following. > > When stopping vhost userspace (QEMU) calls vhost_net_set_backend with > the fd argument set to -1, this leads to is_le being reset in > vhost_vq_init_access. On a BE system resetting to legacy means resetting > to BE. Usually this is not a problem, but in the case when oldubufs is not > zero (which is not likely if no network stress applied) it is a problem. > That code path needs to write avail.idx, and ends up using wrong > endianness when doing that (but only on a BE system). > > That is the story in prose, now let's see the corresponding code annotated > with some comments. > > from drivers/vhost/net.c: > static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) > { > /* [..] some not too interesting stuff */ > sock = get_socket(fd); > /* fd == -1 --> sock == NULL */ > if (IS_ERR(sock)) { > r = PTR_ERR(sock); > goto err_vq; > } > > /* start polling new socket */ > oldsock = vq->private_data; > if (sock != oldsock) { > ubufs = vhost_net_ubuf_alloc(vq, > sock && vhost_sock_zcopy(sock)); > if (IS_ERR(ubufs)) { > r = PTR_ERR(ubufs); > goto err_ubufs; > } > > vhost_net_disable_vq(n, vq); > ==> vq->private_data = sock; > /* now vq->private_data is NULL */ > ==> r = vhost_vq_init_access(vq); > if (r) > goto err_used; > /* vq endianness has been reset to BE on s390 */ > r = vhost_net_enable_vq(n, vq); > if (r) > goto err_used; > > ==> oldubufs = nvq->ubufs; > /* here oldubufs might become != 0 */ > nvq->ubufs = ubufs; > > n->tx_packets = 0; > n->tx_zcopy_err = 0; > n->tx_flush = false; > } > mutex_unlock(>mutex); > > if (oldubufs) { > vhost_net_ubuf_put_wait_and_free(oldubufs); > mutex_lock(>mutex); > ==> vhost_zerocopy_signal_used(n, vq); > /* tries to update virtqueue structures; endianness is BE on s390 > * now (but should be LE for virtio-1) */ > mutex_unlock(>mutex); > } > /*[..] rest of the function */ > } > > from drivers/vhost/vhost.c: > > int vhost_vq_init_access(struct vhost_virtqueue *vq) > { > __virtio16 last_used_idx; > int r; > bool is_le = vq->is_le; > > if (!vq->private_data) { > ==> vhost_reset_is_le(vq); > /* resets to native endianness and returns */ > return 0; > } > > ==> vhost_init_is_le(vq); > /* here we init is_le */ > > r = vhost_update_used_flags(vq); > if (r) > goto err; > vq->signalled_used_valid = false; > if (!vq->iotlb && > !access_ok(VERIFY_READ, >used->idx, sizeof vq->used->idx)) { > r = -EFAULT; > goto err; > } > r = vhost_get_user(vq, last_used_idx, >used->idx); > if (r) { > vq_err(vq, "Can't access used idx at %p\n", >>used->idx); > goto err; > } > vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx); > return 0; > > err: > vq->is_le = is_le; > return r; > } > > AFAIU this can be fixed very simply by omitting the reset. Below the > patch. I'm not sure though, the endianness handling ain't simple in > vhost. Am I going in the right direction? > > We have a test (on s390x only) running for a couple of hours now and so > far so good but it's a bit early to announce it is tested for s390x. > If the patch is reasonable, I'm intend to do a version with proper > attribution and stuff. > > By the way the reset was first introduced by > https://lkml.org/lkml/2015/4/10/226 (dug it up in the hope that reasons > for the reset were discussed -- but no enlightenment for me). > > Finally I would like to credit Dave Gilbert for hinting that the > unreasonable avail.idx may be due to an endianness problem and Michael A. > Tebolt for reporting the bug and testing. > > -8<-- > >From b26e2bbdc03832a0204ee2b42967a1b49a277dc8 Mon Sep 17 00:00:00 2001 > From: Halil Pasic> Date: Thu, 26 Jan 2017 00:06:15 +0100 > Subject: [PATCH] vhost: remove useless/dangerous reset of is_le > > The reset of is_le does no good, but it
Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled
On Thu, Jan 26, 2017 at 10:02:40AM -0800, Eric Dumazet wrote: > On Thu, 2017-01-26 at 17:37 +0100, Pablo Neira Ayuso wrote: > > From: Pau Espin Pedrol> > > > Otherwise, RST packets generated by the TCP stack for non-existing > > sockets always have mark 0. > > The mark from the original packet is assigned to the netns_ipv4/6 > > socket used to send the response so that it can get copied into the > > response skb when the socket sends it. > > > > Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies") > > Cc: Lorenzo Colitti > > Signed-off-by: Pau Espin Pedrol > > Signed-off-by: Pablo Neira Ayuso > > --- > > net/ipv4/ip_output.c | 1 + > > net/ipv6/tcp_ipv6.c | 1 + > > 2 files changed, 2 insertions(+) > > > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > > index fac275c48108..b67719f45953 100644 > > --- a/net/ipv4/ip_output.c > > +++ b/net/ipv4/ip_output.c > > @@ -1629,6 +1629,7 @@ void ip_send_unicast_reply(struct sock *sk, struct > > sk_buff *skb, > > sk->sk_protocol = ip_hdr(skb)->protocol; > > sk->sk_bound_dev_if = arg->bound_dev_if; > > sk->sk_sndbuf = sysctl_wmem_default; > > + sk->sk_mark = fl4.flowi4_mark; > > err = ip_append_data(sk, , ip_reply_glue_bits, arg->iov->iov_base, > > len, 0, , , MSG_DONTWAIT); > > if (unlikely(err)) { > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > > index 73bc8fc68acd..2b20622a5824 100644 > > --- a/net/ipv6/tcp_ipv6.c > > +++ b/net/ipv6/tcp_ipv6.c > > @@ -840,6 +840,7 @@ static void tcp_v6_send_response(const struct sock *sk, > > struct sk_buff *skb, u32 > > dst = ip6_dst_lookup_flow(ctl_sk, , NULL); > > if (!IS_ERR(dst)) { > > skb_dst_set(buff, dst); > > + ctl_sk->sk_mark = fl6.flowi6_mark; > > ip6_xmit(ctl_sk, buff, , NULL, tclass); > > TCP_INC_STATS(net, TCP_MIB_OUTSEGS); > > if (rst) > > > This patch is wrong. > > ctl_sk is a shared socket, and tcp_v6_send_response() can be called from > many different cpus at the same time. Right. This is not percpu as in IPv4. I can send a follow up patch to get this in sync with the way we do it in IPv4, ie. add percpu socket. Fine with this approach? Thanks!
Re: [PATCH] [net-next] ISDN: eicon: reduce stack size of sig_ind function
From: Arnd BergmannDate: Wed, 25 Jan 2017 23:15:53 +0100 > I noticed that this function uses a lot of kernel stack when the > "latent entropy" plugin is enabled: > > drivers/isdn/hardware/eicon/message.c: In function 'sig_ind': > drivers/isdn/hardware/eicon/message.c:6113:1: error: the frame size of 1168 > bytes is larger than 1152 bytes [-Werror=frame-larger-than=] > > We currently don't warn about this, as we raise the warning limit > to 2048 bytes in mainline, but I'd like to lower that limit again > in the future, and this function can easily be changed to be more > efficient and avoid that warning, by making some of its local > variables 'const'. > > Signed-off-by: Arnd Bergmann Applied, thanks Arnd.
Re: [PATCH net] bpf: expose netns inode to bpf programs
On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitovwrote: > On 1/26/17 10:12 AM, Andy Lutomirski wrote: >> >> On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov wrote: >>> >>> On 1/26/17 8:37 AM, Andy Lutomirski wrote: > > > Think of bpf programs as safe kernel modules. They don't have > confined boundaries and program authors, if not careful, can shoot > themselves in the foot. We're not trying to prevent that because > it's impossible to check that the program is sane. Just like > it's impossible to check that kernel module is sane. > But in case of bpf we check that bpf program is _safe_ from the kernel > point of view. If it's doing some garbage, it's program's business. > Does it make more sense now? > With all due respect, I think this is not an acceptable way to think about BPF at all. If you think of BPF this way, I think there needs to be a real discussion at KS or similar as to whether this is okay. The reason is simple: the kernel promises a stable ABI to userspace but not to kernel modules. By thinking of BPF as more like a module, you're taking a big shortcut that will either result in ABI breakage down the road or in committing to a problematic stable ABI. >>> >>> >>> >>> you misunderstood the analogy. >>> bpf abi is certainly stable. that's why we were careful of not >>> exposing anything to it that is not already stable. >>> >> >> In that case I don't understand what you're trying to say. Eric >> thinks your patch exposes a bad interface. A bad interface for >> userspace is a very different thing from a bad interface available to >> kernel modules. Are you saying that BPF is kernel-module-like in that >> the ABI exposed to BPF programs doesn't need to meet the same quality >> standards as userspace ABIs? > > > of course not. > ns.inum is already exposed to user space as a value. > This patch exposes it to bpf program in a convenient and stable way, Here's what I'm imaging Eric is thinking: ns.inum is currently exposed to userspace via procfs. In principle, the value could be local to a namespace, though, which would enable CRIU to be able to preserve namespace inode numbers across a checkpoint+restore operation. If this happened, the contained and restored procfs would see a different inode number than the outermost procfs. If you start exposing the raw ns.inum field to BPF programs and those programs are not themselves scoped to a namespace, then this could create a problem for CRIU. But you told Eric that his nack doesn't matter, and maybe it would be nice to ask him to clarify instead.
Re: [PATCH net 1/5] ibmvnic: harden interrupt handler
On 01/26/2017 12:28 PM, Stephen Hemminger wrote: > On Wed, 25 Jan 2017 15:02:19 -0600 > Thomas Falconwrote: > >> static irqreturn_t ibmvnic_interrupt(int irq, void *instance) >> { >> struct ibmvnic_adapter *adapter = instance; >> +unsigned long flags; >> + >> +spin_lock_irqsave(>crq.lock, flags); >> +vio_disable_interrupts(adapter->vdev); >> +tasklet_schedule(>tasklet); >> +spin_unlock_irqrestore(>crq.lock, flags); >> +return IRQ_HANDLED; >> +} >> + > Why not use NAPI? rather than a tasklet > This interrupt function doesn't process packets, but message passing between firmware and driver for determining device capabilities and available resources, such as the number TX and RX queues.
[PATCH] macvtap: Use kmalloc_array() in macvtap_queue_resize()
From: Markus ElfringDate: Thu, 26 Jan 2017 19:47:38 +0100 A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- drivers/net/macvtap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 5c26653eceb5..1796d8f1ef47 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -1243,7 +1243,7 @@ static int macvtap_queue_resize(struct macvlan_dev *vlan) int n = vlan->numqueues; int ret, i = 0; - arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL); + arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL); if (!arrays) return -ENOMEM; -- 2.11.0
Re: NAPI on USB network drivers
On 01/25/2017 09:57 PM, Alexander Duyck wrote: On Wed, Jan 25, 2017 at 5:33 AM, Oliver Hartkoppwrote: You could probably get around the o-o-o problem by enabling RPS for the interface. I have found that it works for me to do that in order to resolve o-o-o frames generated by VMs on virtual interfaces that can't use NAPI. Sounds promising! In fact I tried to fix this o-o-o stuff here: http://marc.info/?l=linux-can=143637774606287=2 Do you have an example how to do it right? Best regards, Oliver
Re: [PATCH net 1/5] ibmvnic: harden interrupt handler
On 01/26/2017 11:56 AM, David Miller wrote: > From: Thomas Falcon> Date: Thu, 26 Jan 2017 10:44:22 -0600 > >> On 01/25/2017 10:04 PM, David Miller wrote: >>> From: Thomas Falcon >>> Date: Wed, 25 Jan 2017 15:02:19 -0600 >>> Move most interrupt handler processing into a tasklet, and introduce a delay after re-enabling interrupts to fix timing issues encountered in hardware testing. Signed-off-by: Thomas Falcon >>> I don't think you have any idea what the real problem is. This looks >>> like a hack, at best. Next patch you'll increase the delay to "20", >>> right? And if that doesn't work you'll try "40". >>> >>> Or if you do know the reason, you need to explain it in detail in this >>> commit message so that we can properly evaluate this patch. >> You're right, I should have given more explanation in the commit message >> about the bug encountered and our reasons for this sort of fix. The issue >> is that there are some scenarios during the device init where multiple >> messages are sent by firmware in one interrupt request. >> >> We have observed behavior where messages are delayed, resulting in the >> interrupt handler completing before the delayed messages can be processed, >> fouling up the device bring-up in the device probing and elsewhere. The >> goal of the delay is to buy some time for the hypervisor to forward all the >> CRQ messages from the firmware. > Then isn't there an event you can sleep and wait for, or a piece of state for > you to poll and test for in a timeout based loop? > > This delay is a bad solution for the problem of waiting for A to happen > before you do B. > Understood. We will come up with a better fix. Thanks for your attention.
[PATCH net-next v2 0/4] net: dsa: Preparatory patches
Hi David, This patch series extracts the 4 patches of the larger: net: dsa: Support for pdata in dsa2 while we wait for feedback from Greg KH on the device references. Changes in v2: - rebased properly after the multi-MDIO bus support added to mv88e6xxx Thanks! Florian Fainelli (4): net: dsa: Pass device pointer to dsa_register_switch net: dsa: Make most functions take a dsa_port argument net: dsa: Suffix function manipulating device_node with _dn net: dsa: Move ports assignment closer to error checking drivers/net/dsa/b53/b53_common.c | 2 +- drivers/net/dsa/mv88e6xxx/chip.c | 11 ++--- drivers/net/dsa/qca8k.c | 2 +- include/net/dsa.h| 2 +- net/dsa/dsa.c| 15 --- net/dsa/dsa2.c | 87 ++-- net/dsa/dsa_priv.h | 4 +- 7 files changed, 67 insertions(+), 56 deletions(-) -- 2.9.3