date:20170126

Re: [PATCH 6/6] wl1251: Set generated MAC address back to NVS data

2017-01-26 Thread Kalle Valo

Pali Rohár  writes:

> In case there is no valid MAC address kernel generates random one. This
> patch propagate this generated MAC address back to NVS data which will be
> uploaded to wl1251 chip. So HW would have same MAC address as linux kernel
> uses.
>
> Signed-off-by: Pali Rohár 

Why? What issue does this fix?

-- 
Kalle Valo

Re: [PATCH 5/6] wl1251: Parse and use MAC address from supplied NVS data

2017-01-26 Thread Kalle Valo

Pali Rohár  writes:

> This patch implements parsing MAC address from NVS data which are sent to
> wl1251 chip. Calibration NVS data could contain valid MAC address and it
> will be used instead randomly generated.
>
> This patch also move code for requesting NVS data from userspace to driver
> initialization code to make sure that NVS data will be there at time when
> permanent MAC address is needed.
>
> Calibration NVS data for wl1251 are model specific. Every one device with
> wl1251 chip should have been calibrated in factory and needs to provide own
> calibration data.
>
> Default example wl1251-nvs.bin data found in linux-firmware repository and
> contains MAC address 00:00:20:07:03:09. So this MAC address is marked as
> invalid as it is not real device specific address, just example one.
>
> Format of calibration NVS data can be found at:
> http://notaz.gp2x.de/misc/pnd/wl1251/nvs_map.txt
>
> Signed-off-by: Pali Rohár 

[...]

> +static int wl1251_read_nvs_mac(struct wl1251 *wl)
> +{
> + u8 mac[ETH_ALEN];
> + int i;
> +
> + if (wl->nvs_len < 0x24)
> + return -ENODATA;
> +
> + /* length is 2 and data address is 0x546c (mask is 0xfffe) */
> + if (wl->nvs[0x19] != 2 || wl->nvs[0x1a] != 0x6d || wl->nvs[0x1b] != 
> 0x54)
> + return -EINVAL;
> +
> + /* MAC is stored in reverse order */
> + for (i = 0; i < ETH_ALEN; i++)
> + mac[i] = wl->nvs[0x1c + ETH_ALEN - i - 1];

No magic numbers, please. Replace all nvs offsets with proper defines to
make the code more readable.

-- 
Kalle Valo

Re: netvsc NAPI patch process

2017-01-26 Thread Greg KH

On Thu, Jan 26, 2017 at 01:06:46PM -0500, David Miller wrote:
> From: Stephen Hemminger 
> Date: Thu, 26 Jan 2017 10:04:05 -0800
> 
> > I have a working set of patches to enable NAPI in the netvsc driver.
> > The problem is that it requires a set of patches to vmbus layer as well.
> > Since vmbus patches have been going through char-misc-next tree rather
> > than net-next, it is difficult to stage these.
> > 
> > How about if I send the vmbus patches through normal driver-devel upstream
> > and during the 4.10 merge window send the last 3 patches for NAPI for 
> > linux-net
> > tree to get into 4.10?
> 
> Another option is that the char-misc-next folks create a branch with just
> the commits you need for NAPI, I pull that into net-next, and then you
> can submit the NAPI changes to me.

I can easily do that, or I have no problem with the vmbus changes going
through the net-next tree, if they are sane (i.e. let me review them
please...)  Which ever is easier for the networking developers, their
tree is much crazier than the tiny char-misc tree is :)

thanks,

greg k-h

[PATCH v2] net: phy: micrel: add support for KSZ8795

2017-01-26 Thread Sean Nyekjaer

This is adds support for the PHYs in the KSZ8795 5port managed switch.

It will allow to detect the link between the switch and the soc
and uses the same read_status functions as the KSZ8873MLL switch.

Signed-off-by: Sean Nyekjaer 
---
Changes in v2:
 - Removed "switch" name

 drivers/net/phy/micrel.c   | 14 ++
 include/linux/micrel_phy.h |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index ea92d524d5a8..fab56c9350cf 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -1014,6 +1014,20 @@ static struct phy_driver ksphy_driver[] = {
.get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
.resume = genphy_resume,
+}, {
+   .phy_id = PHY_ID_KSZ8795,
+   .phy_id_mask= MICREL_PHY_ID_MASK,
+   .name   = "Micrel KSZ8795",
+   .features   = (SUPPORTED_Pause | SUPPORTED_Asym_Pause),
+   .flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
+   .config_init= kszphy_config_init,
+   .config_aneg= ksz8873mll_config_aneg,
+   .read_status= ksz8873mll_read_status,
+   .get_sset_count = kszphy_get_sset_count,
+   .get_strings= kszphy_get_strings,
+   .get_stats  = kszphy_get_stats,
+   .suspend= genphy_suspend,
+   .resume = genphy_resume,
 } };
 
 module_phy_driver(ksphy_driver);
diff --git a/include/linux/micrel_phy.h b/include/linux/micrel_phy.h
index 257173e0095e..f541da68d1e7 100644
--- a/include/linux/micrel_phy.h
+++ b/include/linux/micrel_phy.h
@@ -35,6 +35,8 @@
 #define PHY_ID_KSZ886X 0x00221430
 #define PHY_ID_KSZ8863 0x00221435
 
+#define PHY_ID_KSZ8795 0x00221550
+
 /* struct phy_device dev_flags definitions */
 #define MICREL_PHY_50MHZ_CLK   0x0001
 #define MICREL_PHY_FXEN0x0002
-- 
2.11.0

Re: [PATCH v2 0/4] USB support for Broadcom NSP SoC

2017-01-26 Thread Kishon Vijay Abraham I

Hi,

On Thursday 26 January 2017 10:57 PM, Florian Fainelli wrote:
> On 01/26/2017 07:34 AM, Kishon Vijay Abraham I wrote:
>>
>>
>> On Tuesday 17 January 2017 09:44 PM, Yendapally Reddy Dhananjaya Reddy wrote:
>>> This patch set contains the usb support for Broadcom NSP SoC. The
>>> usb3 phy is internal to the SoC and is accessed through mdio interface.
>>> The mdio interface can be used to access either internal usb3 phy or
>>> external ethernet phy using a multiplexer.
>>>
>>> The first patch provides the documentation details for usb3 phy. The
>>> second patch provides the changes to the mdio bus driver. The third
>>> patch provides the phy driver and fourth patch provides the enable
>>> method for usb.
>>
>> merged the 1st and 4th patch to linux-phy.
> 
> You mean 1st and 3rd here, right? 4th is a Device Tree change that I
> should take, and Patch 2 should be merged by David.

yeah, sorry!
> 
> What branch in your tree should we be looking at?

git://git.kernel.org/pub/scm/linux/kernel/git/kishon/linux-phy.git next

Thanks
Kishon

Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data

2017-01-26 Thread Kalle Valo

Pali Rohár  writes:

> NVS calibration data for wl1251 are model specific. Every one device with
> wl1251 chip has different and calibrated in factory.
>
> Not all wl1251 chips have own EEPROM where are calibration data stored. And
> in that case there is no "standard" place. Every device has stored them on
> different place (some in rootfs file, some in dedicated nand partition,
> some in another proprietary structure).
>
> Kernel wl1251 driver cannot support every one different storage decided by
> device manufacture so it will use request_firmware_prefer_user() call for
> loading NVS calibration data and userspace helper will be responsible to
> prepare correct data.
>
> In case userspace helper fails request_firmware_prefer_user() still try to
> load data file directly from VFS as fallback mechanism.
>
> On Nokia N900 device which has wl1251 chip, NVS calibration data are stored
> in CAL nand partition. CAL is proprietary Nokia key/value format for nand
> devices.
>
> With this patch it is finally possible to load correct model specific NVS
> calibration data for Nokia N900.
>
> Signed-off-by: Pali Rohár 
> ---
>  drivers/net/wireless/ti/wl1251/Kconfig |1 +
>  drivers/net/wireless/ti/wl1251/main.c  |2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/ti/wl1251/Kconfig 
> b/drivers/net/wireless/ti/wl1251/Kconfig
> index 7142ccf..affe154 100644
> --- a/drivers/net/wireless/ti/wl1251/Kconfig
> +++ b/drivers/net/wireless/ti/wl1251/Kconfig
> @@ -2,6 +2,7 @@ config WL1251
>   tristate "TI wl1251 driver support"
>   depends on MAC80211
>   select FW_LOADER
> + select FW_LOADER_USER_HELPER
>   select CRC7
>   ---help---
> This will enable TI wl1251 driver support. The drivers make
> diff --git a/drivers/net/wireless/ti/wl1251/main.c 
> b/drivers/net/wireless/ti/wl1251/main.c
> index 208f062..24f8866 100644
> --- a/drivers/net/wireless/ti/wl1251/main.c
> +++ b/drivers/net/wireless/ti/wl1251/main.c
> @@ -110,7 +110,7 @@ static int wl1251_fetch_nvs(struct wl1251 *wl)
>   struct device *dev = wiphy_dev(wl->hw->wiphy);
>   int ret;
>  
> - ret = request_firmware(, WL1251_NVS_NAME, dev);
> + ret = request_firmware_prefer_user(, WL1251_NVS_NAME, dev);

I don't see the need for this. Just remove the default nvs file from
filesystem and the fallback user helper will be always used, right?

Like we discussed earlier, the default nvs file should not be used by
normal users.

-- 
Kalle Valo

Re: [PATCH net-next] net: Fix ndo_setup_tc comment

2017-01-26 Thread Jiri Pirko

Thu, Jan 26, 2017 at 11:44:17PM CET, f.faine...@gmail.com wrote:
>Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume
>general tc operand") changed the ndo_setup_tc() signature, but did not
>update the comments in netdevice.h, so do that now.
>
>Signed-off-by: Florian Fainelli 

Reviewed-by: Jiri Pirko

[PATCH RFC ipsec-next 1/2] net: Drop secpath on free after gro merge.

2017-01-26 Thread Steffen Klassert

With a followup patch, a gro merged skb can have a secpath.
So drop it before freeing or reusing the skb.

Signed-off-by: Steffen Klassert 
---
 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 56818f7..ef3a969 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4623,6 +4623,7 @@ static gro_result_t napi_skb_finish(gro_result_t ret, 
struct sk_buff *skb)
case GRO_MERGED_FREE:
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
skb_dst_drop(skb);
+   secpath_reset(skb);
kmem_cache_free(skbuff_head_cache, skb);
} else {
__kfree_skb(skb);
@@ -4663,6 +4664,7 @@ static void napi_reuse_skb(struct napi_struct *napi, 
struct sk_buff *skb)
skb->encapsulation = 0;
skb_shinfo(skb)->gso_type = 0;
skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
+   secpath_reset(skb);
 
napi->skb = skb;
 }
-- 
1.9.1

[PATCH RFC ipsec-next 2/2] xfrm: Add a dummy network device for napi.

2017-01-26 Thread Steffen Klassert

This patch adds a dummy network device so that we can
use gro_cells for IPsec GRO. With this, we handle IPsec
GRO with no impact on the generic networking code.

Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_input.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 6e3f025..3213fe8 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -21,6 +21,9 @@
 static DEFINE_SPINLOCK(xfrm_input_afinfo_lock);
 static struct xfrm_input_afinfo __rcu *xfrm_input_afinfo[NPROTO];
 
+static struct gro_cells gro_cells;
+static struct net_device xfrm_napi_dev;
+
 int xfrm_input_register_afinfo(struct xfrm_input_afinfo *afinfo)
 {
int err = 0;
@@ -371,7 +374,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
 
if (decaps) {
skb_dst_drop(skb);
-   netif_rx(skb);
+   gro_cells_receive(_cells, skb);
return 0;
} else {
return x->inner_mode->afinfo->transport_finish(skb, async);
@@ -394,6 +397,13 @@ int xfrm_input_resume(struct sk_buff *skb, int nexthdr)
 
 void __init xfrm_input_init(void)
 {
+   int err;
+
+   init_dummy_netdev(_napi_dev);
+   err = gro_cells_init(_cells, _napi_dev);
+   if (err)
+   gro_cells.cells = NULL;
+
secpath_cachep = kmem_cache_create("secpath_cache",
   sizeof(struct sec_path),
   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
-- 
1.9.1

[PATCH RFC ipsec-next v2] IPsec GRO

2017-01-26 Thread Steffen Klassert


This adds a dummy network device so that we can use gro_cells
for IPsec GRO. We now may have a secpath at a GRO merged skb,
so we need to drop it. This is the only change to the generic
networking code.

The packet still travels two times through the stack,
but might be aggregated in the second round. We can
avoid the second round with implementing GRO callbacks
for the IPsec protocols. This will be a separate patchset
as this needs some more generic networking changes because
of the asynchronous nature of IPsec.

Re: [PATCH RFC ipsec-next 2/2] xfrm: Add a device independent napi instance

2017-01-26 Thread Steffen Klassert

On Thu, Jan 26, 2017 at 07:10:22AM -0800, Eric Dumazet wrote:
> On Thu, 2017-01-26 at 06:26 -0800, Eric Dumazet wrote:
> 
> > 
> > Alternative would be to use a 
> > 
> > static struct net_device xfrm_napi_anchor_device;
> > 
> > and use gro_cell
> 
> Also take a look at init_dummy_netdev()

I already thought about using some dummy net_device for this,
but was not sure what I need to initialize. So it seemed
to be safer to use a private napi instance.

init_dummy_netdev() is exactly what I need for that, thanks a lot!

Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Mateusz Guzik

On Thu, Jan 26, 2017 at 09:11:07PM -0800, Cong Wang wrote:
> On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzik  wrote:
> > On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
> >> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov  wrote:
> >> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro  wrote:
> >> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
> >> >>
> >> >>> > Why do we do autobind there, anyway, and why is it conditional on
> >> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
> >> >>> > to sending stuff without autobind ever done - just use socketpair()
> >> >>> > to create that sucker and we won't be going through the connect()
> >> >>> > at all.
> >> >>>
> >> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls 
> >> >>> unix_autobind(),
> >> >>> not SOCK_STREAM.
> >> >>
> >> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
> >> >> triggered
> >> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
> >> >>
> >> >>> I guess some lock, perhaps the u->bindlock could be dropped before
> >> >>> acquiring the next one (sb_writer), but I need to double check.
> >> >>
> >> >> Bad idea, IMO - do you *want* autobind being able to come through while
> >> >> bind(2) is busy with mknod?
> >> >
> >> >
> >> > Ping. This is still happening on HEAD.
> >> >
> >>
> >> Thanks for your reminder. Mind to give the attached patch (compile only)
> >> a try? I take another approach to fix this deadlock, which moves the
> >> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
> >> impact with this way.
> >>
> >
> > I don't think this is the right approach.
> >
> > Currently the file creation is potponed until unix_bind can no longer
> > fail otherwise. With it reordered, it may be someone races you with a
> > different path and now you are left with a file to clean up. Except it
> > is quite unclear for me if you can unlink it.
> 
> What races do you mean here? If you mean someone could get a
> refcount of that file, it could happen no matter we have bindlock or not
> since it is visible once created. The filesystem layer should take care of
> the file refcount so all we need to do here is calling path_put() as in my
> patch. Or if you mean two threads calling unix_bind() could race without
> binlock, only one of them should succeed the other one just fails out.

Two threads can race and one fails with EINVAL.

With your patch there is a new file created and it is unclear what to
do with it - leaving it as it is sounds like the last resort and
unlinking it sounds extremely fishy as it opens you to games played by
the user.

Re: [PATCH net-next 3/4] mlx5: Make building tc hardware offload configurable

2017-01-26 Thread kbuild test robot

Hi Tom,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Tom-Herbert/mlx5-Create-build-configuration-options/20170127-084348
config: x86_64-randconfig-s3-01271208 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: the 
linux-review/Tom-Herbert/mlx5-Create-build-configuration-options/20170127-084348
 HEAD fe8939265468a7204ffc5b1c6c878b39bae7e7d0 builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `mlx5e_configure_flower':
>> (.text+0x22a6a1): undefined reference to `mlx5e_vxlan_lookup_port'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

RE: [patch net-next v2 2/4] net/sched: Introduce sample tc action

2017-01-26 Thread Yotam Gigi

>-Original Message-
>From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
>Sent: Thursday, January 26, 2017 1:30 AM
>To: Jiri Pirko 
>Cc: Linux Kernel Network Developers ; David Miller
>; Yotam Gigi ; Ido Schimmel
>; Elad Raz ; Nogah Frankel
>; Or Gerlitz ; Jamal Hadi Salim
>; geert+rene...@glider.be; Stephen Hemminger
>; Guenter Roeck ; Roopa
>Prabhu ; John Fastabend
>; Simon Horman ;
>Roman Mashak 
>Subject: Re: [patch net-next v2 2/4] net/sched: Introduce sample tc action
>
>On Mon, Jan 23, 2017 at 2:07 AM, Jiri Pirko  wrote:
>> +
>> +static int tcf_sample_init(struct net *net, struct nlattr *nla,
>> +  struct nlattr *est, struct tc_action **a, int ovr,
>> +  int bind)
>> +{
>> +   struct tc_action_net *tn = net_generic(net, sample_net_id);
>> +   struct nlattr *tb[TCA_SAMPLE_MAX + 1];
>> +   struct psample_group *psample_group;
>> +   struct tc_sample *parm;
>> +   struct tcf_sample *s;
>> +   bool exists = false;
>> +   int ret;
>> +
>> +   if (!nla)
>> +   return -EINVAL;
>> +   ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy);
>> +   if (ret < 0)
>> +   return ret;
>> +   if (!tb[TCA_SAMPLE_PARMS] || !tb[TCA_SAMPLE_RATE] ||
>> +   !tb[TCA_SAMPLE_PSAMPLE_GROUP])
>> +   return -EINVAL;
>> +
>> +   parm = nla_data(tb[TCA_SAMPLE_PARMS]);
>> +
>> +   exists = tcf_hash_check(tn, parm->index, a, bind);
>> +   if (exists && bind)
>> +   return 0;
>> +
>> +   if (!exists) {
>> +   ret = tcf_hash_create(tn, parm->index, est, a,
>> + _sample_ops, bind, false);
>> +   if (ret)
>> +   return ret;
>> +   ret = ACT_P_CREATED;
>> +   } else {
>> +   tcf_hash_release(*a, bind);
>> +   if (!ovr)
>> +   return -EEXIST;
>> +   }
>> +   s = to_sample(*a);
>> +
>> +   ASSERT_RTNL();
>
>Copy-n-paste from mirred aciton? This is not needed for you, mirred
>needs it because of target netdevice.

Ho, you are right. I will remove it.

>
>
>> +   s->tcf_action = parm->action;
>> +   s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
>> +   s->psample_group_num =
>nla_get_u32(tb[TCA_SAMPLE_PSAMPLE_GROUP]);
>> +   psample_group = psample_group_get(net, s->psample_group_num);
>> +   if (!psample_group)
>> +   return -ENOMEM;
>
>I don't think you can just return here, needs tcf_hash_cleanup() for create
>case, right?

Will fix.

>
>
>> +   RCU_INIT_POINTER(s->psample_group, psample_group);
>> +
>> +   if (tb[TCA_SAMPLE_TRUNC_SIZE]) {
>> +   s->truncate = true;
>> +   s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]);
>> +   }
>
>
>Do you need tcf_lock here if RCU only protects ->psample_group??
>

You are right. I need to protect in case of update.

I will send a fixup patch in the following days. Thanks!

>
>> +
>> +   if (ret == ACT_P_CREATED)
>> +   tcf_hash_insert(tn, *a);
>> +   return ret;
>> +}
>> +
>
>
>Thanks.

Re: [PATCH net-next v2] macb: Common code to enable ptp support for MACB/GEM

2017-01-26 Thread Harini Katakam

Hi Rafal,

On Thu, Jan 26, 2017 at 8:45 PM, Rafal Ozieblo  wrote:
>> -Original Message-
>> From: Andrei Pistirica [mailto:andrei.pistir...@microchip.com]
>> Sent: 19 stycznia 2017 16:56
>> Subject: [PATCH net-next v2] macb: Common code to enable ptp support for 
>> MACB/GEM
>>
>>
>> +static inline bool gem_has_ptp(struct macb *bp)
>> +{
>> + return !!(bp->caps & MACB_CAPS_GEM_HAS_PTP);
>> +}
> Why don't you use hardware capabilities here? Would it be better to read it 
> from hardware instead adding it to many configuration?

If you are referring to TSU bit in DCFG5, then we will be relying on
Cadence IP's information irrespective of the SoC capability
and whether the PTP support was adequate.
I think the capability approach gives better control and
it is not really much to add.

Regards,
Harini

Re: [PATCH net-next 1/4] mlx5: Make building eswitch configurable

2017-01-26 Thread Or Gerlitz

On Fri, Jan 27, 2017 at 1:32 AM, Tom Herbert  wrote:
> Add a configuration option (CONFIG_MLX5_CORE_ESWITCH) for controlling
> whether the eswitch code is built. Change Kconfig and Makefile
> accordingly.

Tom, FWIW, please note that the basic e-switch functionality is needed
also when SRIOV isn't of use, this is for a multi host configuration.

Or.

My WW (and same for the rest of the IL team..) has ended so I will be
able to further look on this series and comment on Sunday.

Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Cong Wang

On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzik  wrote:
> On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
>> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov  wrote:
>> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro  wrote:
>> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
>> >>
>> >>> > Why do we do autobind there, anyway, and why is it conditional on
>> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
>> >>> > to sending stuff without autobind ever done - just use socketpair()
>> >>> > to create that sucker and we won't be going through the connect()
>> >>> > at all.
>> >>>
>> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls unix_autobind(),
>> >>> not SOCK_STREAM.
>> >>
>> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
>> >> triggered
>> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
>> >>
>> >>> I guess some lock, perhaps the u->bindlock could be dropped before
>> >>> acquiring the next one (sb_writer), but I need to double check.
>> >>
>> >> Bad idea, IMO - do you *want* autobind being able to come through while
>> >> bind(2) is busy with mknod?
>> >
>> >
>> > Ping. This is still happening on HEAD.
>> >
>>
>> Thanks for your reminder. Mind to give the attached patch (compile only)
>> a try? I take another approach to fix this deadlock, which moves the
>> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
>> impact with this way.
>>
>
> I don't think this is the right approach.
>
> Currently the file creation is potponed until unix_bind can no longer
> fail otherwise. With it reordered, it may be someone races you with a
> different path and now you are left with a file to clean up. Except it
> is quite unclear for me if you can unlink it.

What races do you mean here? If you mean someone could get a
refcount of that file, it could happen no matter we have bindlock or not
since it is visible once created. The filesystem layer should take care of
the file refcount so all we need to do here is calling path_put() as in my
patch. Or if you mean two threads calling unix_bind() could race without
binlock, only one of them should succeed the other one just fails out.

[PATCHv5 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs

2017-01-26 Thread Chris Packham

The Marvell 98DX3236, 98DX3336, 98DX4521 and variants are switch ASICs
with integrated CPUs. They are similar to the Armada XP SoCs but have
different I/O interfaces.

Signed-off-by: Chris Packham 
Acked-by: Rob Herring 
---

Notes:
Changes in v2:
- Update devicetree binding documentation to reflect that 98DX3336 and
  984251 are supersets of 98DX3236.
- disable crypto block
- disable sdio for 98DX3236, enable for 98DX4251
Changes in v3:
- fix typo 4521 -> 4251
- document prestera bindings
- rework corediv-clock binding
- add label to packet processor node
- add new compatible string for DFX server
Changes in v4:
- Collect ack from Rob
Changes in v5:
- Fixup license text. Add labels to nodes.

 .../devicetree/bindings/arm/marvell/98dx3236.txt   |  23 ++
 .../devicetree/bindings/net/marvell,prestera.txt   |  50 
 arch/arm/boot/dts/armada-xp-98dx3236.dtsi  | 254 +
 arch/arm/boot/dts/armada-xp-98dx3336.dtsi  |  76 ++
 arch/arm/boot/dts/armada-xp-98dx4251.dtsi  |  90 
 5 files changed, 493 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/marvell/98dx3236.txt
 create mode 100644 Documentation/devicetree/bindings/net/marvell,prestera.txt
 create mode 100644 arch/arm/boot/dts/armada-xp-98dx3236.dtsi
 create mode 100644 arch/arm/boot/dts/armada-xp-98dx3336.dtsi
 create mode 100644 arch/arm/boot/dts/armada-xp-98dx4251.dtsi

diff --git a/Documentation/devicetree/bindings/arm/marvell/98dx3236.txt 
b/Documentation/devicetree/bindings/arm/marvell/98dx3236.txt
new file mode 100644
index ..64e8c73fc5ab
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/marvell/98dx3236.txt
@@ -0,0 +1,23 @@
+Marvell 98DX3236, 98DX3336 and 98DX4251 Platforms Device Tree Bindings
+--
+
+Boards with a SoC of the Marvell 98DX3236, 98DX3336 and 98DX4251 families
+shall have the following property:
+
+Required root node property:
+
+compatible: must contain "marvell,armadaxp-98dx3236"
+
+In addition, boards using the Marvell 98DX3336 SoC shall have the
+following property:
+
+Required root node property:
+
+compatible: must contain "marvell,armadaxp-98dx3336"
+
+In addition, boards using the Marvell 98DX4251 SoC shall have the
+following property:
+
+Required root node property:
+
+compatible: must contain "marvell,armadaxp-98dx4251"
diff --git a/Documentation/devicetree/bindings/net/marvell,prestera.txt 
b/Documentation/devicetree/bindings/net/marvell,prestera.txt
new file mode 100644
index ..5fbab29718e8
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/marvell,prestera.txt
@@ -0,0 +1,50 @@
+Marvell Prestera Switch Chip bindings
+-
+
+Required properties:
+- compatible: one of the following
+   "marvell,prestera-98dx3236",
+   "marvell,prestera-98dx3336",
+   "marvell,prestera-98dx4251",
+- reg: address and length of the register set for the device.
+- interrupts: interrupt for the device
+
+Optional properties:
+- dfx: phandle reference to the "DFX Server" node
+
+Example:
+
+switch {
+   compatible = "simple-bus";
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0 MBUS_ID(0x03, 0x00) 0 0x10>;
+
+   packet-processor@0 {
+   compatible = "marvell,prestera-98dx3236";
+   reg = <0 0x400>;
+   interrupts = <33>, <34>, <35>;
+   dfx = <>;
+   };
+};
+
+DFX Server bindings
+---
+
+Required properties:
+- compatible: must be "marvell,dfx-server"
+- reg: address and length of the register set for the device.
+
+Example:
+
+dfx-registers {
+   compatible = "simple-bus";
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0 MBUS_ID(0x08, 0x00) 0 0x10>;
+
+   dfx: dfx@0 {
+   compatible = "marvell,dfx-server";
+   reg = <0 0x10>;
+   };
+};
diff --git a/arch/arm/boot/dts/armada-xp-98dx3236.dtsi 
b/arch/arm/boot/dts/armada-xp-98dx3236.dtsi
new file mode 100644
index ..9461128fae24
--- /dev/null
+++ b/arch/arm/boot/dts/armada-xp-98dx3236.dtsi
@@ -0,0 +1,254 @@
+/*
+ * Device Tree Include file for Marvell 98dx3236 family SoC
+ *
+ * Copyright (C) 2016 Allied Telesis Labs
+ *
+ * This file is dual-licensed: you can use it either under the terms
+ * of the GPL or the X11 license, at your option. Note that this dual
+ * licensing only applies to this file, and not this project as a
+ * whole.
+ *
+ *  a) This file is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This file is distributed in the hope that it will be

Re: loopback device reference count leakage

2017-01-26 Thread Kaiwen Xu

On Thu, Jan 26, 2017 at 05:01:38PM -0800, Cong Wang wrote:
> I'd suggest to you add some debugging printk's to the dst refcount functions,
> or maybe just inside dst_gc_task(). I think the last dst referring to
> the loopback
> dev is still being referred at that point, which prevents GC from destroying 
> it.

Thanks for the suggestion! I will test it out.

> Meanwhile, if it would be also helpful if you can share how you managed to
> reproduce this reliably, I saw this bug in our data center before but never
> know how to reproduce it.

I used one of our applications to reproduce the issue, to be honest, I
haven't completely isolated which part of the code is triggering the
bug. However, the suspicion is that, since the application basically
acts as a web crawler, the bug is manifested after initiating a large
amount connections to a wide range of IP addresses in a short period of
time. Hope it somewhat helps.

Thanks,
Kaiwen

cls_matchall and port mirroring questions

2017-01-26 Thread Florian Fainelli

Hi,

As I am adding support for cls_matchall in the b53/bcm_sf2 drivers, I
was looking into several, yet unrelated things:

- mlxsw does not seem to specify whether the port used for capture
remains usable, or blocks non-mirror traffic ingressing/egressing it, do
we want a control knob for that? If not, what is a sensible default,
block all non capture traffic?

- do we have an updated man page for tc-matchall.8 that features how to
use the statistical sampler too? b53 switches have a divider that allows
us to select how many frames we want to receive (10 bit value).

- b53 supports capture against a particular MAC SA or DA (or both), do
we want to be able to control that somehow? What about Marvell switches,
what can they do?

-  a fair amount of code dealing with the cls_matchall mirroring entry
is not switch driver specific, in fact, the only things that are switch
driver specific are:
- list pointer where to store this entry (typically in the private
network device context)
- operation to check whether the device belongs to us (identical
netdev_ops)
- retrieval of the destination port number (to_port) which is also
typically available in network device private context

Do we want to move a fair amount of code into switchdev, treat
cls_matchall entries as a specific switchdev object, and have drivers
take over at the same level that mlxsw_sp_port_add_cls_matchall_mirror()
currently starts?

Thanks!
-- 
Florian

Re: [PATCH net-next] net: adjust skb->truesize in pskb_expand_head()

2017-01-26 Thread Eric Dumazet

On Fri, 2017-01-27 at 10:22 +0800, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test WARNING on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Eric-Dumazet/net-adjust-skb-truesize-in-pskb_expand_head/20170127-082517
> config: i386-randconfig-x0-01270914 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 

Hmm... I thought sock_edemux was safe, but apparently it uses a
parameter for no good reason.

I will add this to v2

diff --git a/include/net/sock.h b/include/net/sock.h
index 
7144750d14e56b9d5392e43dc46cb40a87e3d397..94e65fd703548dd40e16c30207fd55c879ed0b60
 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1534,7 +1534,7 @@ void sock_efree(struct sk_buff *skb);
 #ifdef CONFIG_INET
 void sock_edemux(struct sk_buff *skb);
 #else
-#define sock_edemux(skb) sock_efree(skb)
+#define sock_edemux sock_efree
 #endif
 
 int sock_setsockopt(struct socket *sock, int level, int op,

Re: [PATCH net-next] net: adjust skb->truesize in pskb_expand_head()

2017-01-26 Thread kbuild test robot

Hi Eric,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/net-adjust-skb-truesize-in-pskb_expand_head/20170127-082517
config: i386-randconfig-x0-01270914 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/linux/list.h:4,
from include/linux/module.h:9,
from net/core/skbuff.c:41:
   net/core/skbuff.c: In function 'pskb_expand_head':
   net/core/skbuff.c:1265:37: error: 'sock_edemux' undeclared (first use in 
this function)
 if (!skb->sk || skb->destructor == sock_edemux)
^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
 if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
 ^~~~
>> net/core/skbuff.c:1265:2: note: in expansion of macro 'if'
 if (!skb->sk || skb->destructor == sock_edemux)
 ^~
   net/core/skbuff.c:1265:37: note: each undeclared identifier is reported only 
once for each function it appears in
 if (!skb->sk || skb->destructor == sock_edemux)
^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
 if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
 ^~~~
>> net/core/skbuff.c:1265:2: note: in expansion of macro 'if'
 if (!skb->sk || skb->destructor == sock_edemux)
 ^~

vim +/if +1265 net/core/skbuff.c

  1249  skb->end  = size;
  1250  off   = nhead;
  1251  #else
  1252  skb->end  = skb->head + size;
  1253  #endif
  1254  skb->tail += off;
  1255  skb_headers_offset_update(skb, nhead);
  1256  skb->cloned   = 0;
  1257  skb->hdr_len  = 0;
  1258  skb->nohdr= 0;
  1259  atomic_set(_shinfo(skb)->dataref, 1);
  1260  
  1261  /* It is not generally safe to change skb->truesize.
  1262   * For the moment, we really care of rx path, or
  1263   * when skb is orphaned (not attached to a socket)
  1264   */
> 1265  if (!skb->sk || skb->destructor == sock_edemux)
  1266  skb->truesize += size - osize;
  1267  
  1268  return 0;
  1269  
  1270  nofrags:
  1271  kfree(data);
  1272  nodata:
  1273  return -ENOMEM;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [net-next] openvswitch: Simplify do_execute_actions().

2017-01-26 Thread Jarno Rajahalme

Nice clean-up.

Acked-by: Jarno Rajahalme 

> On Jan 25, 2017, at 9:24 PM, Andy Zhou  wrote:
> 
> do_execute_actions() implements a worthwhile optimization: in case
> an output action is the last action in an action list, skb_clone()
> can be avoided by outputing the current skb. However, the
> implementation is more complicated than necessary.  This patch
> simplify this logic.
> 
> Signed-off-by: Andy Zhou 
> ---
> net/openvswitch/actions.c | 40 +++-
> 1 file changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 514f7bc..3866608 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -830,6 +830,9 @@ static void do_output(struct datapath *dp, struct sk_buff 
> *skb, int out_port,
> {
>   struct vport *vport = ovs_vport_rcu(dp, out_port);
> 
> + if (unlikely(!skb))
> + return;
> +
>   if (likely(vport)) {
>   u16 mru = OVS_CB(skb)->mru;
>   u32 cutlen = OVS_CB(skb)->cutlen;
> @@ -1141,12 +1144,6 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> struct sw_flow_key *key,
> const struct nlattr *attr, int len)
> {
> - /* Every output action needs a separate clone of 'skb', but the common
> -  * case is just a single output action, so that doing a clone and
> -  * then freeing the original skbuff is wasteful.  So the following code
> -  * is slightly obscure just to avoid that.
> -  */
> - int prev_port = -1;
>   const struct nlattr *a;
>   int rem;
> 
> @@ -1154,20 +1151,25 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
>a = nla_next(a, )) {
>   int err = 0;
> 
> - if (unlikely(prev_port != -1)) {
> - struct sk_buff *out_skb = skb_clone(skb, GFP_ATOMIC);
> + switch (nla_type(a)) {
> + case OVS_ACTION_ATTR_OUTPUT: {
> + int port = nla_get_u32(a);
> 
> - if (out_skb)
> - do_output(dp, out_skb, prev_port, key);
> + /* Every output action needs a separate clone
> +  * of 'skb', In case the output action is the
> +  * last action, cloning can be avoided.
> +  */
> + if (nla_is_last(a, rem)) {
> + do_output(dp, skb, port, key);
> + /* 'skb' has been used for output.
> +  */
> + return 0;
> + }
> 
> + do_output(dp, skb_clone(skb, GFP_ATOMIC), port, key);
>   OVS_CB(skb)->cutlen = 0;
> - prev_port = -1;
> - }
> -
> - switch (nla_type(a)) {
> - case OVS_ACTION_ATTR_OUTPUT:
> - prev_port = nla_get_u32(a);
>   break;
> + }
> 
>   case OVS_ACTION_ATTR_TRUNC: {
>   struct ovs_action_trunc *trunc = nla_data(a);
> @@ -1257,11 +1259,7 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
>   }
>   }
> 
> - if (prev_port != -1)
> - do_output(dp, skb, prev_port, key);
> - else
> - consume_skb(skb);
> -
> + consume_skb(skb);
>   return 0;
> }
> 
> -- 
> 1.8.3.1
>

Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver

2017-01-26 Thread Sowmini Varadhan

On (01/26/17 19:08), Willem de Bruijn wrote:
> 
> Thanks for the context. ax25_addr_parse doesn't adjust length, it only
> verifies that the contents of the variable length header matches
> protocol spec. I don't think that it or the .validate callback have to
> be modified to return length.

Yes, I noticed that too, but my reading of ax25_addr_parse
was that it checks to see that a sane L2 header has been 
passed in, and if that (sane-header) is the case, it
returns pointer to the start of data. Thus the returned
(non-null) pointer minus start should tell you the "real"
header length- is my understanding correct?

> To ensure that skb_headlen(skb) is at least a valid header length even
> when CAP_SYS_RAWIO bypasses validation perhaps revise
> dev_validate_header to take an additional skb->len parameter and
> call skb_put directly from inside that branch.

but when I scanned the af_packet code (which appears to
be the only thing that uses dev_validate_header today)
it already sets up the skb->data and ->len pointers up
correctly (based on len, hard_header_len etc) *before*
calling dev_validate_header, so the additional skb_put
is not needed?

still havent googled up prior discussions that led
to dev_validate_header- will probably do that tomorrow AM.

--Sowmini

Re: loopback device reference count leakage

2017-01-26 Thread Cong Wang

On Thu, Jan 26, 2017 at 2:51 PM, Kaiwen Xu  wrote:
> Hi Cong,
>
> I tested out your patch, it does seem to be preventing the issue from
> happening. Here are the dev_put/dev_hold() calls with your patch
> applied.

Good. Now we narrow down the bug to those dst's referring loopback_dev.

> However, what confuses me is that when the issue didn't occur, there
> were always multiple dst_ifdown() calls at the end continuously holding
> and releasing the loopback device reference count (sometimes it will be
> looping for about a minute), until the last dst_destroy() happens. E.g.
>
> Jan 11 16:14:59  kernel: [ 2033.429563] lo: dev_hold 2 dst_ifdown
> Jan 11 16:14:59  kernel: [ 2033.429565] lo: dev_put 2 dst_ifdown
> Jan 11 16:15:00  kernel: [ 2034.453484] lo: dev_hold 2 dst_ifdown
> Jan 11 16:15:00  kernel: [ 2034.453487] lo: dev_put 2 dst_ifdown
>
> (this continues...)
>
> Jan 11 16:15:01  kernel: [ 2035.129452] lo: dev_put 1 dst_destroy
>
> And when the issue did occur, the last dst_destroy() call never occurs.

Yeah, I noticed that too. So we have two cases here:

1) If these dst's (referring to loopback_dev) really need to stay in
GC for a long
time, then we should really just releasing loopback references as what my patch
does.

2) If they don't not, that is, if they are supposed to be GC'ed soon
in this case,
then we should investigate why they are still there.

2) seems more likely than 1), because at the point when loopback device is
being unregistered, the whole network namespace is being gone, all other
devices are already gone, no one should a take reference to this netns,
therefore no one should take a reference to any dst referring to any device
in it too, even though the dst GC is global.

I'd suggest to you add some debugging printk's to the dst refcount functions,
or maybe just inside dst_gc_task(). I think the last dst referring to
the loopback
dev is still being referred at that point, which prevents GC from destroying it.

Meanwhile, if it would be also helpful if you can share how you managed to
reproduce this reliably, I saw this bug in our data center before but never
know how to reproduce it.

Thanks!

Re: [1/3] powerpc: bpf: remove redundant check for non-null image

2017-01-26 Thread Michael Ellerman

On Fri, 2017-01-13 at 17:10:00 UTC, "Naveen N. Rao" wrote:
> From: Daniel Borkmann 
> 
> We have a check earlier to ensure we don't proceed if image is NULL. As
> such, the redundant check can be removed.
> 
> Signed-off-by: Daniel Borkmann 
> [Added similar changes for classic BPF JIT]
> Signed-off-by: Naveen N. Rao 
> Acked-by: Alexei Starovoitov 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/052de33ca4f840bf35587eacdf78b3

cheers

Re: [2/3] powerpc: bpf: flush the entire JIT buffer

2017-01-26 Thread Michael Ellerman

On Fri, 2017-01-13 at 17:10:01 UTC, "Naveen N. Rao" wrote:
> With bpf_jit_binary_alloc(), we allocate at a page granularity and fill
> the rest of the space with illegal instructions to mitigate BPF spraying
> attacks, while having the actual JIT'ed BPF program at a random location
> within the allocated space. Under this scenario, it would be better to
> flush the entire allocated buffer rather than just the part containing
> the actual program. We already flush the buffer from start to the end of
> the BPF program. Extend this to include the illegal instructions after
> the BPF program.
> 
> Signed-off-by: Naveen N. Rao 
> Acked-by: Alexei Starovoitov 
> Acked-by: Daniel Borkmann 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/10528b9c45cfb9e8f45217ef2f5ef8

cheers

[PATCH net-next 4/4] mlx5: Make building vxlan hardware offload configurable

2017-01-26 Thread Tom Herbert

Add a configuration option (CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD)
for controlling whether the support for UDP encapsulation offlaod is
supported. Note that only VXLAN offload is supported currently,
however the config option is named to be generic for UDP offloads.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   |  8 +++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |  4 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 27 +--
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index b38c920..d8ed54a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -57,3 +57,11 @@ config MLX5_CORE_EN_TC
  Say Y here if you want to use TC hardware offload support.
 
  If unsure, set to Y
+
+config MLX5_CORE_EN_UDP_ENCAP_OFFLOAD
+   bool "UDP encapsulation offload"
+   default y
+   ---help---
+ Say Y here if you want to use UDP encapsulation hardware offload.
+ Currently, VXLAN offload is uspported.
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c308531..c08c9c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -7,7 +7,7 @@ mlx5_core-y :=  main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \
en_main.o en_common.o en_fs.o en_ethtool.o en_tx.o \
-   en_rx.o en_rx_am.o en_txrx.o en_clock.o vxlan.o \
+   en_rx.o en_rx_am.o en_txrx.o en_clock.o \
en_arfs.o en_fs_ethtool.o en_selftest.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
@@ -17,3 +17,5 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o 
eswitch_offloads.o en_rep.
 mlx5_core-$(CONFIG_MLX5_CORE_EN_SRIOV) += sriov.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_TC) += en_tc.o
+
+mlx5_core-$(CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD) += vxlan.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2d2c982..31a8d88 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -34,14 +34,16 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "en.h"
 #include "en_tc.h"
 #ifdef CONFIG_MLX5_CORE_EN_ESWITCH
 #include "eswitch.h"
 #endif
+#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD
+#include 
 #include "vxlan.h"
+#endif
 
 struct mlx5e_rq_param {
u32 rqc[MLX5_ST_SZ_DW(rqc)];
@@ -3111,6 +3113,7 @@ static int mlx5e_get_vf_stats(struct net_device *dev,
 }
 #endif /* CONFIG_MLX5_CORE_EN_ESWITCH */
 
+#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD
 void mlx5e_add_vxlan_port(struct net_device *netdev,
  struct udp_tunnel_info *ti)
 {
@@ -3171,20 +3174,22 @@ static netdev_features_t 
mlx5e_vxlan_features_check(struct mlx5e_priv *priv,
/* Disable CSUM and GSO if the udp dport is not offloaded by HW */
return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
 }
+#endif
 
 static netdev_features_t mlx5e_features_check(struct sk_buff *skb,
  struct net_device *netdev,
  netdev_features_t features)
 {
-   struct mlx5e_priv *priv = netdev_priv(netdev);
-
features = vlan_features_check(skb, features);
+#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD
features = vxlan_features_check(skb, features);
 
/* Validate if the tunneled packet is being offloaded by HW */
if (skb->encapsulation &&
(features & NETIF_F_CSUM_MASK || features & NETIF_F_GSO_MASK))
-   return mlx5e_vxlan_features_check(priv, skb, features);
+   return mlx5e_vxlan_features_check(netdev_priv(netdev),
+ skb, features);
+#endif
 
return features;
 }
@@ -3365,8 +3370,10 @@ static const struct net_device_ops 
mlx5e_netdev_ops_sriov = {
.ndo_set_features= mlx5e_set_features,
.ndo_change_mtu  = mlx5e_change_mtu,
.ndo_do_ioctl= mlx5e_ioctl,
+#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD
.ndo_udp_tunnel_add  = mlx5e_add_vxlan_port,
.ndo_udp_tunnel_del  = mlx5e_del_vxlan_port,
+#endif
.ndo_set_tx_maxrate  = mlx5e_set_tx_maxrate,
.ndo_features_check  = mlx5e_features_check,
 #ifdef CONFIG_RFS_ACCEL
@@ -3643,6 +3650,7 @@ static void mlx5e_build_nic_netdev(struct net_device 
*netdev)
netdev->hw_features  |= NETIF_F_HW_VLAN_CTAG_RX;
netdev->hw_features  |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
+#ifdef CONFIG_MLX5_CORE_EN_UDP_ENCAP_OFFLOAD
if

[PATCH net-next 1/4] mlx5: Make building eswitch configurable

2017-01-26 Thread Tom Herbert

Add a configuration option (CONFIG_MLX5_CORE_ESWITCH) for controlling
whether the eswitch code is built. Change Kconfig and Makefile
accordingly.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   | 11 +++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |  6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 92 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   | 39 +++---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  4 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c| 16 ++--
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c   |  6 +-
 7 files changed, 125 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index ddb4ca4..27aae58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -30,3 +30,14 @@ config MLX5_CORE_EN_DCB
  This flag is depended on the kernel's DCB support.
 
  If unsure, set to Y
+
+config MLX5_CORE_EN_ESWITCH
+   bool "Ethernet switch"
+   default y
+   depends on MLX5_CORE_EN
+   ---help---
+ Say Y here if you want to use Ethernet switch (E-switch). E-Switch
+ is the software entity that represents and manages ConnectX4
+ inter-HCA ethernet l2 switching.
+
+ If unsure, set to Y
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9f43beb..17025d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -5,9 +5,11 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
fs_counters.o rl.o lag.o dev.o
 
-mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o eswitch_offloads.o \
+mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \
en_main.o en_common.o en_fs.o en_ethtool.o en_tx.o \
en_rx.o en_rx_am.o en_txrx.o en_clock.o vxlan.o \
-   en_tc.o en_arfs.o en_rep.o en_fs_ethtool.o en_selftest.o
+   en_tc.o en_arfs.o en_fs_ethtool.o en_selftest.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
+
+mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o 
en_rep.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e829143..1db4d98 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -38,7 +38,9 @@
 #include 
 #include "en.h"
 #include "en_tc.h"
+#ifdef CONFIG_MLX5_CORE_EN_ESWITCH
 #include "eswitch.h"
+#endif
 #include "vxlan.h"
 
 struct mlx5e_rq_param {
@@ -585,10 +587,12 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+#ifdef CONFIG_MLX5_CORE_EN_ESWITCH
if (mlx5e_is_vf_vport_rep(priv)) {
err = -EINVAL;
goto err_rq_wq_destroy;
}
+#endif
 
rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
@@ -617,10 +621,14 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
goto err_rq_wq_destroy;
}
 
-   if (mlx5e_is_vf_vport_rep(priv))
+#ifdef CONFIG_MLX5_CORE_EN_ESWITCH
+   if (mlx5e_is_vf_vport_rep(priv)) {
rq->handle_rx_cqe = mlx5e_handle_rx_cqe_rep;
-   else
+   } else
+#endif
+   {
rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
+   }
 
rq->alloc_wqe = mlx5e_alloc_rx_wqe;
rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
@@ -2198,7 +2206,6 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
 int mlx5e_open_locked(struct net_device *netdev)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
-   struct mlx5_core_dev *mdev = priv->mdev;
int num_txqs;
int err;
 
@@ -2233,11 +2240,13 @@ int mlx5e_open_locked(struct net_device *netdev)
if (priv->profile->update_stats)
queue_delayed_work(priv->wq, >update_stats_work, 0);
 
-   if (MLX5_CAP_GEN(mdev, vport_group_manager)) {
+#ifdef CONFIG_MLX5_CORE_EN_ESWITCH
+   if (MLX5_CAP_GEN(priv->mdev, vport_group_manager)) {
err = mlx5e_add_sqs_fwd_rules(priv);
if (err)
goto err_close_channels;
}
+#endif
return 0;
 
 err_close_channels:
@@ -2262,7 +2271,6 @@ int mlx5e_open(struct net_device *netdev)
 int mlx5e_close_locked(struct net_device *netdev)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
-   struct mlx5_core_dev *mdev = priv->mdev;

[PATCH net-next] net: adjust skb->truesize in pskb_expand_head()

2017-01-26 Thread Eric Dumazet

From: Eric Dumazet 

Slava Shwartsman reported a warning in skb_try_coalesce(), when we
detect skb->truesize is completely wrong.

In his case, issue came from IPv6 reassembly coping with malicious
datagrams, that forced various pskb_may_pull() to reallocate a bigger
skb->head than the one allocated by NIC driver before entering GRO
layer.

Current code does not change skb->truesize, leaving this burden to
callers if they care enough.

Blindly changing skb->truesize in pskb_expand_head() is not
easy, as some producers might track skb->truesize, for example
in xmit path for back pressure feedback (sk->sk_wmem_alloc)

We can detect the cases where it should be safe to change
skb->truesize :

1) skb is not attached to a socket.
2) If it is attached to a socket, destructor is sock_edemux()

My audit gave only two callers doing their own skb->truesize
manipulation.

Signed-off-by: Eric Dumazet 
Reported-by: Slava Shwartsman 
---
 net/core/skbuff.c|   14 +++---
 net/netlink/af_netlink.c |8 +++-
 net/wireless/util.c  |2 --
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
f8dbe4a7ab46a9196c6683ce5c9c14d3d99d..6cd59da7ec583260748b9c45b99a824bcc61 
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1192,10 +1192,10 @@ EXPORT_SYMBOL(__pskb_copy_fclone);
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 gfp_t gfp_mask)
 {
-   int i;
-   u8 *data;
-   int size = nhead + skb_end_offset(skb) + ntail;
+   int i, osize = skb_end_offset(skb);
+   int size = osize + nhead + ntail;
long off;
+   u8 *data;
 
BUG_ON(nhead < 0);
 
@@ -1257,6 +1257,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
skb->hdr_len  = 0;
skb->nohdr= 0;
atomic_set(_shinfo(skb)->dataref, 1);
+
+   /* It is not generally safe to change skb->truesize.
+* For the moment, we really care of rx path, or
+* when skb is orphaned (not attached to a socket)
+*/
+   if (!skb->sk || skb->destructor == sock_edemux)
+   skb->truesize += size - osize;
+
return 0;
 
 nofrags:
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 
edcc1e19ad532641f51f6809b8c90d1e3770..7b73c7c161a9680b8691a712c31073b77896 
100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1210,11 +1210,9 @@ static struct sk_buff *netlink_trim(struct sk_buff *skb, 
gfp_t allocation)
skb = nskb;
}
 
-   if (!pskb_expand_head(skb, 0, -delta,
- (allocation & ~__GFP_DIRECT_RECLAIM) |
- __GFP_NOWARN | __GFP_NORETRY))
-   skb->truesize -= delta;
-
+   pskb_expand_head(skb, 0, -delta,
+(allocation & ~__GFP_DIRECT_RECLAIM) |
+__GFP_NOWARN | __GFP_NORETRY);
return skb;
 }
 
diff --git a/net/wireless/util.c b/net/wireless/util.c
index 
1b9296882dcd6a0b585dfd604a30807e7f26..68e5f2ecee1aa22f17ab9a55eb566124e585 
100644
--- a/net/wireless/util.c
+++ b/net/wireless/util.c
@@ -618,8 +618,6 @@ int ieee80211_data_from_8023(struct sk_buff *skb, const u8 
*addr,
 
if (pskb_expand_head(skb, head_need, 0, GFP_ATOMIC))
return -ENOMEM;
-
-   skb->truesize += head_need;
}
 
if (encaps_data) {

Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver

2017-01-26 Thread Willem de Bruijn

On Thu, Jan 26, 2017 at 4:37 PM, Sowmini Varadhan
 wrote:
> On (01/26/17 15:21), Willem de Bruijn wrote:
>> > If the application has provided fewer than hard_header_len bytes,
>> > dev_validate_header() will zero out the skb->data as needed. This is
>> > acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases,
>>
>> This was added not for datagram sockets, but to be able to bypass
>> validation. See the message in commit 2793a23aacbd ("net: validate
>> variable length ll header") and discussion leading up to that patch.
>
> some context, I got inot this patch as a result of  the comments in
>  https://www.mail-archive.com/netdev@vger.kernel.org/msg149031.html
>
>> As David pointed out, this does not handle variable length headers
>> correctly. In link layers that support these, hard_header_len defines
>> the maximum header length. A hard failure on len < hard_header_len
>> would be incorrect.
>
> right, since DaveM's comments, I took a look at the drivers
> that have a ->validate - afaict (from cscope) ax25 is the only
> in-kernel driver that actually passes a ->validate pointer..
> I tried patching ax25 here:
>   http://marc.info/?l=linux-hams=148537926422828=2
> Still waiting to hear back from that list (which doesnt seem to have
> much traffic so maybe I should time out on it). Does that
> patch make better sense (I'll look up the comments leading up
> to 2793a23aacbd later tonight)

Thanks for the context. ax25_addr_parse doesn't adjust length, it only
verifies that the contents of the variable length header matches
protocol spec. I don't think that it or the .validate callback have to
be modified to return length.

To ensure that skb_headlen(skb) is at least a valid header length even
when CAP_SYS_RAWIO bypasses validation perhaps revise
dev_validate_header to take an additional skb->len parameter and
call skb_put directly from inside that branch.

>> The ->validate callback was added to allow specifying additional
>> constraints on a per protocol basis. This is where a min constraint
>> can be added, e.g., for ethernet.
>>
>> > -   if (!dev_validate_header(dev, skb->data, len)) {
>> > +   newlen = dev_validate_header(dev, skb->data, len);
>> > +   /* As comments above this function indicate, a full L2 header
>> > +* must be passed to this function, so if newlen > len, bail.
>> > +*/
>> > +   if (newlen < 0 || newlen > len) {
>>
>> If callers only care whether the function returned failure or
>> increased len, which also indicates failure, it is cleaner to leave it
>> a boolean and fail in cases where len < the minimum for that link
>> layer type. No caller actually uses newlen.
>>
>> > +   /* Caller has allocated for copylen in non-paged part of
>> > +* skb so we should never find newlen > hdrlen
>> > +*/
>> > +   WARN_ON(newlen > hdrlen);
>>
>> WARN_ON_ONCE is safer.
>
> Ok that's easy enough to do.
>

[PATCH net-next 2/4] mlx5: Make building SR-IOV configurable

2017-01-26 Thread Tom Herbert

Add a configuration option (CONFIG_MLX5_CORE_EN_SRIOV) for controlling
whether the eswitch code is built. Change Kconfig and Makefile
accordingly.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig  |  8 
 drivers/net/ethernet/mellanox/mlx5/core/Makefile |  4 +++-
 drivers/net/ethernet/mellanox/mlx5/core/lag.c|  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c   | 16 
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 27aae58..7ade61a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -41,3 +41,11 @@ config MLX5_CORE_EN_ESWITCH
  inter-HCA ethernet l2 switching.
 
  If unsure, set to Y
+
+config MLX5_CORE_EN_SRIOV
+   bool "SR-IOV"
+   default y
+   ---help---
+ Say Y here if you want to use SR-IOV.
+
+ If unsure, set to Y
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 17025d8..6d38250 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -2,7 +2,7 @@ obj-$(CONFIG_MLX5_CORE) += mlx5_core.o
 
 mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o \
-   mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
+   mad.o transobj.o vport.o fs_cmd.o fs_core.o \
fs_counters.o rl.o lag.o dev.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \
@@ -13,3 +13,5 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o 
en_rep.o
+
+mlx5_core-$(CONFIG_MLX5_CORE_EN_SRIOV) += sriov.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag.c 
b/drivers/net/ethernet/mellanox/mlx5/core/lag.c
index 5595724..6dc3792 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag.c
@@ -223,11 +223,13 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
mutex_unlock(_mutex);
 
if (tracker.is_bonded && !mlx5_lag_is_bonded(ldev)) {
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
if (mlx5_sriov_is_enabled(dev0) ||
mlx5_sriov_is_enabled(dev1)) {
mlx5_core_warn(dev0, "LAG is not supported with SRIOV");
return;
}
+#endif
 
for (i = 0; i < MLX5_MAX_PORTS; i++)
mlx5_remove_dev_by_protocol(ldev->pf[i].dev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 224f499..cd6a9c7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -949,15 +949,20 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv)
}
 #endif
 
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
err = mlx5_sriov_init(dev);
if (err) {
dev_err(>dev, "Failed to init sriov %d\n", err);
goto err_eswitch_cleanup;
}
+#endif
 
return 0;
 
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
 err_eswitch_cleanup:
+#endif
+
 #ifdef CONFIG_MLX5_CORE_EN_ESWITCH
mlx5_eswitch_cleanup(dev->priv.eswitch);
 
@@ -980,7 +985,9 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, struct 
mlx5_priv *priv)
 
 static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 {
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
mlx5_sriov_cleanup(dev);
+#endif
 #ifdef CONFIG_MLX5_CORE_EN_ESWITCH
mlx5_eswitch_cleanup(dev->priv.eswitch);
 #endif
@@ -1135,11 +1142,13 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
mlx5_eswitch_attach(dev->priv.eswitch);
 #endif
 
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
err = mlx5_sriov_attach(dev);
if (err) {
dev_err(>dev, "sriov init failed %d\n", err);
goto err_sriov;
}
+#endif
 
if (mlx5_device_registered(dev)) {
mlx5_attach_device(dev);
@@ -1159,9 +1168,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
return 0;
 
 err_reg_dev:
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
mlx5_sriov_detach(dev);
 
 err_sriov:
+#endif
+
 #ifdef CONFIG_MLX5_CORE_EN_ESWITCH
mlx5_eswitch_detach(dev->priv.eswitch);
 #endif
@@ -1232,7 +1244,9 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
if (mlx5_device_registered(dev))
mlx5_detach_device(dev);
 
+#ifdef CONFIG_MLX5_CORE_EN_SRIOV
mlx5_sriov_detach(dev);
+#endif
 #ifdef CONFIG_MLX5_CORE_EN_ESWITCH
mlx5_eswitch_detach(dev->priv.eswitch);
 #endif
@@

[PATCH net-next 3/4] mlx5: Make building tc hardware offload configurable

2017-01-26 Thread Tom Herbert

Add a configuration option (CONFIG_MLX5_CORE_EN_TC) for controlling
whether the eswitch code is built. Change Kconfig and Makefile
accordingly.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   |  8 
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |  4 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 ++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 7ade61a..b38c920 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -49,3 +49,11 @@ config MLX5_CORE_EN_SRIOV
  Say Y here if you want to use SR-IOV.
 
  If unsure, set to Y
+
+config MLX5_CORE_EN_TC
+   bool "TC offload"
+   default y
+   ---help---
+ Say Y here if you want to use TC hardware offload support.
+
+ If unsure, set to Y
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 6d38250..c308531 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -8,10 +8,12 @@ mlx5_core-y :=main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o \
en_main.o en_common.o en_fs.o en_ethtool.o en_tx.o \
en_rx.o en_rx_am.o en_txrx.o en_clock.o vxlan.o \
-   en_tc.o en_arfs.o en_fs_ethtool.o en_selftest.o
+   en_arfs.o en_fs_ethtool.o en_selftest.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_ESWITCH) += eswitch.o eswitch_offloads.o 
en_rep.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_SRIOV) += sriov.o
+
+mlx5_core-$(CONFIG_MLX5_CORE_EN_TC) += en_tc.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1db4d98..2d2c982 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2690,6 +2690,7 @@ int mlx5e_modify_rqs_vsd(struct mlx5e_priv *priv, bool 
vsd)
return 0;
 }
 
+#ifdef CONFIG_MLX5_CORE_EN_TC
 static int mlx5e_setup_tc(struct net_device *netdev, u8 tc)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
@@ -2743,6 +2744,7 @@ static int mlx5e_ndo_setup_tc(struct net_device *dev, u32 
handle,
 
return mlx5e_setup_tc(dev, tc->tc);
 }
+#endif
 
 static void
 mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
@@ -3323,7 +3325,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
.ndo_start_xmit  = mlx5e_xmit,
+#ifdef CONFIG_MLX5_CORE_EN_TC
.ndo_setup_tc= mlx5e_ndo_setup_tc,
+#endif
.ndo_select_queue= mlx5e_select_queue,
.ndo_get_stats64 = mlx5e_get_stats,
.ndo_set_rx_mode = mlx5e_set_rx_mode,
@@ -3349,7 +3353,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
.ndo_start_xmit  = mlx5e_xmit,
+#ifdef CONFIG_MLX5_CORE_EN_TC
.ndo_setup_tc= mlx5e_ndo_setup_tc,
+#endif
.ndo_select_queue= mlx5e_select_queue,
.ndo_get_stats64 = mlx5e_get_stats,
.ndo_set_rx_mode = mlx5e_set_rx_mode,
@@ -3762,9 +3768,11 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
goto err_destroy_direct_tirs;
}
 
+#ifdef CONFIG_MLX5_CORE_EN_TC
err = mlx5e_tc_init(priv);
if (err)
goto err_destroy_flow_steering;
+#endif
 
return 0;
 
@@ -3786,7 +3794,9 @@ static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv)
 {
int i;
 
+#ifdef CONFIG_MLX5_CORE_EN_TC
mlx5e_tc_cleanup(priv);
+#endif
mlx5e_destroy_flow_steering(priv);
mlx5e_destroy_direct_tirs(priv);
mlx5e_destroy_indirect_tirs(priv);
-- 
2.9.3

Re: [PATCH v5 net] ravb: unmap descriptors when freeing rings

2017-01-26 Thread David Miller

From: Simon Horman 
Date: Thu, 26 Jan 2017 14:29:27 +0100

> From: Kazuya Mizuguchi 
> 
> "swiotlb buffer is full" errors occur after repeated initialisation of a
> device - f.e. suspend/resume or ip link set up/down. This is because memory
> mapped using dma_map_single() in ravb_ring_format() and ravb_start_xmit()
> is not released.  Resolve this problem by unmapping descriptors when
> freeing rings.
> 
> Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
> Signed-off-by: Kazuya Mizuguchi 
> [simon: reworked]
> Signed-off-by: Simon Horman 

Applied, thanks Simon.

[PATCH net-next 0/4] mlx5: Create build configuration options

2017-01-26 Thread Tom Herbert

This patchset creates configuration options for sriov, vxlan, eswitch,
and tc features in the mlx5 driver. The purpose of this is to allow not
building these features. These features are optional advanced features
that are not required for a core Ethernet driver. A user can disable
these features which resuces the amount of code in the driver. Disabling
these features (and DCB) reduces the size of mlx5_core.o by about 16%.
This is also can reduce the complexity of backport and rebases since
user would no longer need to worry about dependencies with the rest of
the kernel that features which might not be of any interest to a user
may bring in.

Tested: Build and ran the driver with all features enabled (the default)
and with none enabled (including DCB). Did not see any issues. I did
not explicity test operation of ayy of features in the list.


Tom Herbert (4):
  mlx5: Make building eswitch configurable
  mlx5: Make building SR-IOV configurable
  mlx5: Make building tc hardware offload configurable
  mlx5: Make building vxlan hardware offload configurable

 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   |  35 ++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |  16 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 129 --
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |  39 +--
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/lag.c |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/main.c|  32 --
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c   |   6 +-
 8 files changed, 205 insertions(+), 58 deletions(-)

-- 
2.9.3

Re: [PATCH net-next v2] net: ipv6: ignore null_entry on route dumps

2017-01-26 Thread David Miller

From: David Ahern 
Date: Thu, 26 Jan 2017 13:54:08 -0800

> Dave: per last email you suggested putting this in fib6_dump_node, but
>   fib6_dump_node does not have knowledge of the network namespace
>   because of the way the fib_walker works. I could put
>   struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) w->args;
>   into fib6_dump_node to get the net pointer from it but did not
>   want to duplicate the typecast.

Ok, if you can't see the network namespace in fib6_dump_node() you
can't do the test there.

I'll apply this, thanks.

Re: [PATCH net-next] net: ipv6: remove skb_reserve in getroute

2017-01-26 Thread David Miller

From: David Ahern 
Date: Thu, 26 Jan 2017 14:08:36 -0800

> Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The
> allocated skb is not passed through the routing engine (like it is for
> IPv4) and has not since the beginning of git time.
> 
> Signed-off-by: David Ahern 

Good catch, applied, thanks David.

Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Mateusz Guzik

On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov  wrote:
> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro  wrote:
> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
> >>
> >>> > Why do we do autobind there, anyway, and why is it conditional on
> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
> >>> > to sending stuff without autobind ever done - just use socketpair()
> >>> > to create that sucker and we won't be going through the connect()
> >>> > at all.
> >>>
> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls unix_autobind(),
> >>> not SOCK_STREAM.
> >>
> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
> >> triggered
> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
> >>
> >>> I guess some lock, perhaps the u->bindlock could be dropped before
> >>> acquiring the next one (sb_writer), but I need to double check.
> >>
> >> Bad idea, IMO - do you *want* autobind being able to come through while
> >> bind(2) is busy with mknod?
> >
> >
> > Ping. This is still happening on HEAD.
> >
> 
> Thanks for your reminder. Mind to give the attached patch (compile only)
> a try? I take another approach to fix this deadlock, which moves the
> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
> impact with this way.
> 

I don't think this is the right approach.

Currently the file creation is potponed until unix_bind can no longer
fail otherwise. With it reordered, it may be someone races you with a
different path and now you are left with a file to clean up. Except it
is quite unclear for me if you can unlink it.

I don't have a good idea how to fix it. A somewhat typical approach
would introduce an intermediate state ("under construction") and drop
the lock between calling into unix_mknod.

In this particular case, perhaps you could repurpose gc_flags as a
general flags carrier and add a 'binding in process' flag to test.

Re: [PATCH net-next] net: Fix ndo_setup_tc comment

2017-01-26 Thread John Fastabend

On 17-01-26 02:44 PM, Florian Fainelli wrote:
> Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume
> general tc operand") changed the ndo_setup_tc() signature, but did not
> update the comments in netdevice.h, so do that now.
> 
> Signed-off-by: Florian Fainelli 
> ---

Thanks.

Acked-by: John Fastabend

Re: loopback device reference count leakage

2017-01-26 Thread Kaiwen Xu

Hi Cong,

I tested out your patch, it does seem to be preventing the issue from
happening. Here are the dev_put/dev_hold() calls with your patch
applied.

Jan 26 00:29:08  kernel: [ 4385.940243] lo: dev_hold 1 
rx_queue_add_kobject
Jan 26 00:29:08  kernel: [ 4385.940255] lo: dev_hold 2 
netdev_queue_add_kobject
Jan 26 00:29:08  kernel: [ 4385.940257] lo: dev_hold 3 
register_netdevice
Jan 26 00:29:08  kernel: [ 4385.940260] lo: dev_hold 4 
neigh_parms_alloc
Jan 26 00:29:08  kernel: [ 4385.940262] lo: dev_hold 5 inetdev_init
Jan 26 00:29:08  kernel: [ 4386.017699] lo: dev_hold 6 qdisc_alloc
Jan 26 00:29:08  kernel: [ 4386.017741] lo: dev_hold 7 
dev_get_by_index
Jan 26 00:29:08  kernel: [ 4386.017749] lo: dev_hold 8 
dev_get_by_index
Jan 26 00:29:08  kernel: [ 4386.017756] lo: dev_hold 9 fib_check_nh
Jan 26 00:29:08  kernel: [ 4386.017760] lo: dev_hold 10 fib_check_nh
Jan 26 00:29:08  kernel: [ 4386.017767] lo: dev_hold 11 
dev_get_by_index
Jan 26 00:29:08  kernel: [ 4386.017772] lo: dev_hold 12 
dev_get_by_index
Jan 26 00:29:08  kernel: [ 4386.017775] lo: dev_hold 13 fib_check_nh
Jan 26 00:29:08  kernel: [ 4386.017778] lo: dev_hold 14 fib_check_nh
Jan 26 00:29:08  kernel: [ 4386.033548] lo: dev_put 14 
free_fib_info_rcu
Jan 26 00:29:08  kernel: [ 4386.033553] lo: dev_put 13 
free_fib_info_rcu
Jan 26 00:29:08  kernel: [ 4386.033556] lo: dev_put 12 
free_fib_info_rcu
Jan 26 00:29:08  kernel: [ 4386.033558] lo: dev_put 11 
free_fib_info_rcu
Jan 26 00:29:08  kernel: [ 4386.033560] lo: dev_put 10 
free_fib_info_rcu
Jan 26 00:29:08  kernel: [ 4386.033563] lo: dev_put 9 
free_fib_info_rcu
Jan 26 00:29:09  kernel: [ 4386.438175] lo: dev_hold 9 dst_init
Jan 26 00:29:09  kernel: [ 4386.442558] lo: dev_hold 10 dst_init
Jan 26 00:29:09  kernel: [ 4386.442564] lo: dev_hold 11 dst_init
Jan 26 00:29:09  kernel: [ 4386.477575] lo: dev_put 11 dst_destroy
Jan 26 00:29:09  kernel: [ 4386.641150] lo: dev_hold 11 dst_init
Jan 26 00:37:59  kernel: [ 4916.949380] lo: dev_hold 12 dst_init
Jan 26 00:37:59  kernel: [ 4916.949401] lo: dev_hold 13 __neigh_create
Jan 26 00:56:52  kernel: [ 6049.882993] lo: dev_hold 14 dst_init
Jan 26 00:57:54  kernel: [ 6111.782520] lo: dev_hold 15 dst_init
Jan 26 01:28:02  kernel: [ 7920.396248] lo: dev_hold 16 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396251] lo: dev_hold 17 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396254] lo: dev_hold 18 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396257] lo: dev_hold 19 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396260] lo: dev_hold 20 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396263] lo: dev_hold 21 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396266] lo: dev_hold 22 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396268] lo: dev_hold 23 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396271] lo: dev_hold 24 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396274] lo: dev_hold 25 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396277] lo: dev_hold 26 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396280] lo: dev_hold 27 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396283] lo: dev_hold 28 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396286] lo: dev_hold 29 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396288] lo: dev_hold 30 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396291] lo: dev_hold 31 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396294] lo: dev_hold 32 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396297] lo: dev_hold 33 dst_ifdown
Jan 26 01:28:02  kernel: [ 7920.396300] lo: dev_hold 34 dst_ifdown
Jan 26 01:28:03  kernel: [ 7920.584272] lo: dev_put 34 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584277] lo: dev_put 33 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584279] lo: dev_put 32 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584281] lo: dev_put 31 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584283] lo: dev_put 30 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584285] lo: dev_put 29 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584287] lo: dev_put 28 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584289] lo: dev_put 27 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584290] lo: dev_put 26 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584301] lo: dev_put 25 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584303] lo: dev_put 24 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584305] lo: dev_put 23 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584307] lo: dev_put 22 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584309] lo: dev_put 21 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584311] lo: dev_put 20 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584324] lo: dev_put 19 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584326] lo: dev_put 18 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584328] lo: dev_put 17 dst_destroy
Jan 26 01:28:03  kernel: [ 7920.584330] lo: dev_put 16 dst_destroy
Jan 26 01:30:32  kernel: [ 8069.750192] lo: dev_put 15 neigh_destroy
Jan 26 01:30:32  kernel: [ 8069.751961] lo: dev_put 14 qdisc_destroy
Jan 26 01:30:32  kernel: [ 8069.752014] lo: dev_put 13 
neigh_parms_release
Jan 26 01:30:32  kernel: [

Re: [PATCHv3 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs

2017-01-26 Thread Chris Packham

On 27/01/17 09:24, Chris Packham wrote:
> On 27/01/17 04:10, Gregory CLEMENT wrote:
>>> +   internal-regs {
>
> [snip]
>
>>> +
>>> +   dfx-registers {
>> node label
>>
>
> [snip]
>
>>> +   switch {
>> node label
>>
>
> These are peers to the internal-regs, i.e. parts of the SoC with
> mappable windows in the address space. Do they really need a label?
> Their subnodes absolutely need (and have) labels.
>

Actually the pci-controller is in the same category and that has a label 
so I'll add one.

[PATCH net-next] net: Fix ndo_setup_tc comment

2017-01-26 Thread Florian Fainelli

Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume
general tc operand") changed the ndo_setup_tc() signature, but did not
update the comments in netdevice.h, so do that now.

Signed-off-by: Florian Fainelli 
---
 include/linux/netdevice.h | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3868c32d98af..d63cacb67ea6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -964,11 +964,12 @@ struct netdev_xdp {
  *  with PF and querying it may introduce a theoretical security risk.
  * int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool 
setting);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
- * int (*ndo_setup_tc)(struct net_device *dev, u8 tc)
- * Called to setup 'tc' number of traffic classes in the net device. This
- * is always called from the stack with the rtnl lock held and netif tx
- * queues stopped. This allows the netdevice to perform queue management
- * safely.
+ * int (*ndo_setup_tc)(struct net_device *dev, u32 handle,
+ *__be16 protocol, struct tc_to_netdev *tc);
+ * Called to setup any 'tc' scheduler, classifier or action on @dev.
+ * This is always called from the stack with the rtnl lock held and netif
+ * tx queues stopped. This allows the netdevice to perform queue
+ * management safely.
  *
  * Fiber Channel over Ethernet (FCoE) offload functions.
  * int (*ndo_fcoe_enable)(struct net_device *dev);
-- 
2.9.3

Re: [PATCH net-next] net: ipv6: remove skb_reserve in getroute

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 14:08 -0800, David Ahern wrote:
> Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The
> allocated skb is not passed through the routing engine (like it is for
> IPv4) and has not since the beginning of git time.
> 
> Signed-off-by: David Ahern 
> ---
>  net/ipv6/route.c | 6 --
>  1 file changed, 6 deletions(-)

Nice ;)

Acked-by: Eric Dumazet

[PATCH V3 net-next 01/14] net/ena: remove ntuple filter support from device feature list

2017-01-26 Thread Netanel Belgazal

Remove NETIF_F_NTUPLE from netdev->features.
The ENA device driver does not support ntuple filtering.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index cc8b13e..7493ea3 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2722,7 +2722,6 @@ static void ena_set_dev_offloads(struct 
ena_com_dev_get_features_ctx *feat,
netdev->features =
dev_features |
NETIF_F_SG |
-   NETIF_F_NTUPLE |
NETIF_F_RXHASH |
NETIF_F_HIGHDMA;
 
-- 
2.7.4

[PATCH V3 net-next 02/14] net/ena: fix error handling when probe fails

2017-01-26 Thread Netanel Belgazal

When driver fails in probe, it will release all resources,
including adapter.
In case of probe failure, ena_remove should not try to
free the adapter resources.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 7493ea3..cb60567 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -3046,6 +3046,7 @@ static int ena_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 err_free_region:
ena_release_bars(ena_dev, pdev);
 err_free_ena_dev:
+   pci_set_drvdata(pdev, NULL);
vfree(ena_dev);
 err_disable_device:
pci_disable_device(pdev);
-- 
2.7.4

[PATCH V3 net-next 03/14] net/ena: fix queues number calculation

2017-01-26 Thread Netanel Belgazal

The ENA driver tries to open a queue per vCPU.
To determine how many vCPUs the instance have it uses num_possible_cpus()
while it should have use num_online_cpus() instead.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index cb60567..f409cfd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2660,7 +2660,7 @@ static int ena_calc_io_queue_num(struct pci_dev *pdev,
io_sq_num = get_feat_ctx->max_queues.max_sq_num;
}
 
-   io_queue_num = min_t(int, num_possible_cpus(), ENA_MAX_NUM_IO_QUEUES);
+   io_queue_num = min_t(int, num_online_cpus(), ENA_MAX_NUM_IO_QUEUES);
io_queue_num = min_t(int, io_queue_num, io_sq_num);
io_queue_num = min_t(int, io_queue_num,
 get_feat_ctx->max_queues.max_cq_num);
-- 
2.7.4

[PATCH V3 net-next 00/14] Bug Fixes in ENA driver.

2017-01-26 Thread Netanel Belgazal

Changes between V3 and V2:
* Fix typos and correct alignment in commit messages.
* use napi_complete_done() return value to determine when the napi
handler needs to unmask the interrupts rather than implementing
non standard solution.
* Remove new features from this patchset and leave bug fixes only.
* Give example in the commit message for kernel crashes.
* Use BIT(x) instead of use the value explicitly.


Netanel Belgazal (14):
  net/ena: remove ntuple filter support from device feature list
  net/ena: fix error handling when probe fails
  net/ena: fix queues number calculation
  net/ena: fix ethtool RSS flow configuration
  net/ena: fix RSS default hash configuration
  net/ena: fix NULL dereference when removing the driver after device
reset failed
  net/ena: refactor ena_get_stats64 to be atomic context safe
  net/ena: fix potential access to freed memory during device reset
  net/ena: use napi_complete_done() return value
  net/ena: use READ_ONCE to access completion descriptors
  net/ena: reduce the severity of ena printouts
  net/ena: change driver's default timeouts
  net/ena: change condition for host attribute configuration
  net/ena: update driver version to 1.1.2

 drivers/net/ethernet/amazon/ena/ena_admin_defs.h |  20 ++-
 drivers/net/ethernet/amazon/ena/ena_com.c|  41 ++---
 drivers/net/ethernet/amazon/ena/ena_com.h|   1 +
 drivers/net/ethernet/amazon/ena/ena_eth_com.c|   8 +-
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 186 ---
 drivers/net/ethernet/amazon/ena/ena_netdev.h |   9 +-
 6 files changed, 182 insertions(+), 83 deletions(-)

-- 
2.7.4

[PATCH V3 net-next 04/14] net/ena: fix ethtool RSS flow configuration

2017-01-26 Thread Netanel Belgazal

ena_flow_data_to_flow_hash and ena_flow_hash_to_flow_type
treat the ena_flow_hash_to_flow_type enum as power of two values.

Change the values of ena_admin_flow_hash_fields to be power of two values.

This bug effect the ethtool set/get rxnfc.
ethtool will report wrong values hash fields for get and will
configure wrong hash fields in set.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
index a46e749..e1594d6 100644
--- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
@@ -631,22 +631,22 @@ enum ena_admin_flow_hash_proto {
 /* RSS flow hash fields */
 enum ena_admin_flow_hash_fields {
/* Ethernet Dest Addr */
-   ENA_ADMIN_RSS_L2_DA = 0,
+   ENA_ADMIN_RSS_L2_DA = BIT(0),
 
/* Ethernet Src Addr */
-   ENA_ADMIN_RSS_L2_SA = 1,
+   ENA_ADMIN_RSS_L2_SA = BIT(1),
 
/* ipv4/6 Dest Addr */
-   ENA_ADMIN_RSS_L3_DA = 2,
+   ENA_ADMIN_RSS_L3_DA = BIT(2),
 
/* ipv4/6 Src Addr */
-   ENA_ADMIN_RSS_L3_SA = 5,
+   ENA_ADMIN_RSS_L3_SA = BIT(3),
 
/* tcp/udp Dest Port */
-   ENA_ADMIN_RSS_L4_DP = 6,
+   ENA_ADMIN_RSS_L4_DP = BIT(4),
 
/* tcp/udp Src Port */
-   ENA_ADMIN_RSS_L4_SP = 7,
+   ENA_ADMIN_RSS_L4_SP = BIT(5),
 };
 
 struct ena_admin_proto_input {
-- 
2.7.4

[PATCH V3 net-next 09/14] net/ena: use napi_complete_done() return value

2017-01-26 Thread Netanel Belgazal

Do not unamsk interrupts if we are in busy poll mode.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 44 ++--
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 606fb5c..d1e1d9d 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -1122,26 +1122,40 @@ static int ena_io_poll(struct napi_struct *napi, int 
budget)
tx_work_done = ena_clean_tx_irq(tx_ring, tx_budget);
rx_work_done = ena_clean_rx_irq(rx_ring, napi, budget);
 
-   if ((budget > rx_work_done) && (tx_budget > tx_work_done)) {
-   napi_complete_done(napi, rx_work_done);
+   /* If the device is about to reset or down, avoid unmask
+* the interrupt and return 0 so NAPI won't reschedule
+*/
+   if (unlikely(!test_bit(ENA_FLAG_DEV_UP, _ring->adapter->flags) ||
+test_bit(ENA_FLAG_TRIGGER_RESET, 
_ring->adapter->flags))) {
+   napi_complete_done(napi, 0);
+   ret = 0;
 
+   } else if ((budget > rx_work_done) && (tx_budget > tx_work_done)) {
napi_comp_call = 1;
-   /* Tx and Rx share the same interrupt vector */
-   if (ena_com_get_adaptive_moderation_enabled(rx_ring->ena_dev))
-   ena_adjust_intr_moderation(rx_ring, tx_ring);
 
-   /* Update intr register: rx intr delay, tx intr delay and
-* interrupt unmask
+   /* Update numa and unmask the interrupt only when schedule
+* from the interrupt context (vs from sk_busy_loop)
 */
-   ena_com_update_intr_reg(_reg,
-   rx_ring->smoothed_interval,
-   tx_ring->smoothed_interval,
-   true);
+   if (napi_complete_done(napi, rx_work_done)) {
+   /* Tx and Rx share the same interrupt vector */
+   if 
(ena_com_get_adaptive_moderation_enabled(rx_ring->ena_dev))
+   ena_adjust_intr_moderation(rx_ring, tx_ring);
+
+   /* Update intr register: rx intr delay,
+* tx intr delay and interrupt unmask
+*/
+   ena_com_update_intr_reg(_reg,
+   rx_ring->smoothed_interval,
+   tx_ring->smoothed_interval,
+   true);
+
+   /* It is a shared MSI-X.
+* Tx and Rx CQ have pointer to it.
+* So we use one of them to reach the intr reg
+*/
+   ena_com_unmask_intr(rx_ring->ena_com_io_cq, _reg);
+   }
 
-   /* It is a shared MSI-X. Tx and Rx CQ have pointer to it.
-* So we use one of them to reach the intr reg
-*/
-   ena_com_unmask_intr(rx_ring->ena_com_io_cq, _reg);
 
ena_update_ring_numa_node(tx_ring, rx_ring);
 
-- 
2.7.4

[PATCH V3 net-next 06/14] net/ena: fix NULL dereference when removing the driver after device reset failed

2017-01-26 Thread Netanel Belgazal

If for some reason the device stops responding, and the device reset
failes to recover the device, the mmio register read data structure
will not be reinitialized.

On driver removal, the driver will also try to reset the device, but
this time the mmio data structure will be NULL.

To solve this issue, perform the device reset in the remove function
only if the device is runnig.

Crash log
   54.240382] BUG: unable to handle kernel NULL pointer dereference at  
 (null)
[   54.244186] IP: [] ena_com_reg_bar_read32+0x8a/0x180 
[ena_drv]
[   54.244186] PGD 0
[   54.244186] Oops: 0002 [#1] SMP
[   54.244186] Modules linked in: ena_drv(OE-) snd_hda_codec_generic kvm_intel 
kvm crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel aesni_intel 
snd_hda_intel aes_x86_64 snd_hda_controller lrw gf128mul cirrus glue_helper 
ablk_helper ttm snd_hda_codec drm_kms_helper cryptd snd_hwdep drm snd_pcm 
pvpanic snd_timer syscopyarea sysfillrect snd parport_pc sysimgblt serio_raw 
soundcore i2c_piix4 mac_hid lp parport psmouse floppy
[   54.244186] CPU: 5 PID: 1841 Comm: rmmod Tainted: G   OE 
3.16.0-031600-generic #201408031935
[   54.244186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Bochs 01/01/2011
[   54.244186] task: 880135852880 ti: 8800bb64 task.ti: 
8800bb64
[   54.244186] RIP: 0010:[]  [] 
ena_com_reg_bar_read32+0x8a/0x180 [ena_drv]
[   54.244186] RSP: 0018:8800bb643d50  EFLAGS: 00010083
[   54.244186] RAX: deb0 RBX: 00030d40 RCX: 0003
[   54.244186] RDX: 0202 RSI: 0058 RDI: c9775104
[   54.244186] RBP: 8800bb643d88 R08:  R09: cf00
[   54.244186] R10: 000fffe0 R11: 0001 R12: 
[   54.244186] R13: c9765000 R14: c9775104 R15: 7fca1fa98090
[   54.244186] FS:  7fca1f1bd740() GS:88013fd4() 
knlGS:
[   54.244186] CS:  0010 DS:  ES:  CR0: 80050033
[   54.244186] CR2:  CR3: b9cf6000 CR4: 001406e0
[   54.244186] Stack:
[   54.244186]  0202 00580286 c9765000 
c9765000
[   54.244186]  880135f6b000 8800b936 7fca1fa98090 
8800bb643db8
[   54.244186]  c0680b3d 8800b93608c0 c9765000 
880135f6b000
[   54.244186] Call Trace:
[   54.244186]  [] ena_com_dev_reset+0x1d/0x1b0 [ena_drv]
[   54.244186]  [] ena_remove+0xa7/0x130 [ena_drv]
[   54.244186]  [] pci_device_remove+0x46/0xc0
[   54.244186]  [] __device_release_driver+0x7f/0xf0
[   54.244186]  [] driver_detach+0xc8/0xd0
[   54.244186]  [] bus_remove_driver+0x59/0xd0
[   54.244186]  [] driver_unregister+0x2e/0x60
[   54.244186]  [] ? show_refcnt+0x40/0x40
[   54.244186]  [] pci_unregister_driver+0x23/0xa0
[   54.244186]  [] ena_cleanup+0x10/0xed1 [ena_drv]
[   54.244186]  [] SyS_delete_module+0x157/0x1e0
[   54.244186]  [] ? do_notify_resume+0xc7/0xd0
[   54.244186]  [] system_call_fastpath+0x1a/0x1f
[   54.244186] Code: c3 4d 8d b5 04 01 01 00 4c 89 f7 e8 e1 5a 11 c1 48 89 45 
c8 41 0f b7 85 00 01 01 00 8d 48 01 66 2d 52 21 66 41 89 8d 00 01 01 00 <66> 41 
89 04 24 0f b7 45 d4 89 45 d0 89 c1 41 0f b7 85 00 01 01
[   54.244186] RIP  [] ena_com_reg_bar_read32+0x8a/0x180 
[ena_drv]
[   54.244186]  RSP 
[   54.244186] CR2: 
[   54.244186] ---[ end trace 18dd9889b6497810 ]---

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index f409cfd..639f0aa 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2509,6 +2509,8 @@ static void ena_fw_reset_device(struct work_struct *work)
 err:
rtnl_unlock();
 
+   clear_bit(ENA_FLAG_DEVICE_RUNNING, >flags);
+
dev_err(>dev,
"Reset attempt failed. Can not reset the device\n");
 }
@@ -3118,7 +3120,9 @@ static void ena_remove(struct pci_dev *pdev)
 
cancel_work_sync(>resume_io_task);
 
-   ena_com_dev_reset(ena_dev);
+   /* Reset the device only if the device is running. */
+   if (test_bit(ENA_FLAG_DEVICE_RUNNING, >flags))
+   ena_com_dev_reset(ena_dev);
 
ena_free_mgmnt_irq(adapter);
 
-- 
2.7.4

[PATCH V3 net-next 05/14] net/ena: fix RSS default hash configuration

2017-01-26 Thread Netanel Belgazal

ENA default hash configures IPv4_frag hash twice instead of
configure non-IP packets.

The bug caused IPv4 fragmented packets to be calculated based on
L2 source and destination address instead of L3 source and destination.
IPv4 packets can reach to the wrong Rx queue.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_com.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 3066d9c..46aad3a 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -2184,7 +2184,7 @@ int ena_com_set_default_hash_ctrl(struct ena_com_dev 
*ena_dev)
hash_ctrl->selected_fields[ENA_ADMIN_RSS_IP4_FRAG].fields =
ENA_ADMIN_RSS_L3_SA | ENA_ADMIN_RSS_L3_DA;
 
-   hash_ctrl->selected_fields[ENA_ADMIN_RSS_IP4_FRAG].fields =
+   hash_ctrl->selected_fields[ENA_ADMIN_RSS_NOT_IP].fields =
ENA_ADMIN_RSS_L2_DA | ENA_ADMIN_RSS_L2_SA;
 
for (i = 0; i < ENA_ADMIN_RSS_PROTO_NUM; i++) {
-- 
2.7.4

[PATCH V3 net-next 08/14] net/ena: fix potential access to freed memory during device reset

2017-01-26 Thread Netanel Belgazal

If the ena driver detects that the device is not behave as expected,
it tries to reset the device.
The reset flow calls ena_down, which will frees all the resources
the driver allocates and then it will reset the device.

This flow can cause memory corruption if the device is still writes
to the driver's memory space.
To overcome this potential race, move the reset before the device
resources are freed.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 56 +---
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index ea3c801..606fb5c 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -80,14 +80,18 @@ static void ena_tx_timeout(struct net_device *dev)
 {
struct ena_adapter *adapter = netdev_priv(dev);
 
+   /* Change the state of the device to trigger reset
+* Check that we are not in the middle or a trigger already
+*/
+
+   if (test_and_set_bit(ENA_FLAG_TRIGGER_RESET, >flags))
+   return;
+
u64_stats_update_begin(>syncp);
adapter->dev_stats.tx_timeout++;
u64_stats_update_end(>syncp);
 
netif_err(adapter, tx_err, dev, "Transmit time out\n");
-
-   /* Change the state of the device to trigger reset */
-   set_bit(ENA_FLAG_TRIGGER_RESET, >flags);
 }
 
 static void update_rx_ring_mtu(struct ena_adapter *adapter, int mtu)
@@ -1109,7 +1113,8 @@ static int ena_io_poll(struct napi_struct *napi, int 
budget)
 
tx_budget = tx_ring->ring_size / ENA_TX_POLL_BUDGET_DIVIDER;
 
-   if (!test_bit(ENA_FLAG_DEV_UP, _ring->adapter->flags)) {
+   if (!test_bit(ENA_FLAG_DEV_UP, _ring->adapter->flags) ||
+   test_bit(ENA_FLAG_TRIGGER_RESET, _ring->adapter->flags)) {
napi_complete_done(napi, 0);
return 0;
}
@@ -1698,12 +1703,22 @@ static void ena_down(struct ena_adapter *adapter)
adapter->dev_stats.interface_down++;
u64_stats_update_end(>syncp);
 
-   /* After this point the napi handler won't enable the tx queue */
-   ena_napi_disable_all(adapter);
netif_carrier_off(adapter->netdev);
netif_tx_disable(adapter->netdev);
 
+   /* After this point the napi handler won't enable the tx queue */
+   ena_napi_disable_all(adapter);
+
/* After destroy the queue there won't be any new interrupts */
+
+   if (test_bit(ENA_FLAG_TRIGGER_RESET, >flags)) {
+   int rc;
+
+   rc = ena_com_dev_reset(adapter->ena_dev);
+   if (rc)
+   dev_err(>pdev->dev, "Device reset failed\n");
+   }
+
ena_destroy_all_io_queues(adapter);
 
ena_disable_io_intr_sync(adapter);
@@ -2065,6 +2080,14 @@ static void ena_netpoll(struct net_device *netdev)
struct ena_adapter *adapter = netdev_priv(netdev);
int i;
 
+   /* Dont schedule NAPI if the driver is in the middle of reset
+* or netdev is down.
+*/
+
+   if (!test_bit(ENA_FLAG_DEV_UP, >flags) ||
+   test_bit(ENA_FLAG_TRIGGER_RESET, >flags))
+   return;
+
for (i = 0; i < adapter->num_queues; i++)
napi_schedule(>ena_napi[i].napi);
 }
@@ -2451,6 +2474,14 @@ static void ena_fw_reset_device(struct work_struct *work)
bool dev_up, wd_state;
int rc;
 
+   if (unlikely(!test_bit(ENA_FLAG_TRIGGER_RESET, >flags))) {
+   dev_err(>dev,
+   "device reset schedule while reset bit is off\n");
+   return;
+   }
+
+   netif_carrier_off(netdev);
+
del_timer_sync(>timer_service);
 
rtnl_lock();
@@ -2464,12 +2495,6 @@ static void ena_fw_reset_device(struct work_struct *work)
 */
ena_close(netdev);
 
-   rc = ena_com_dev_reset(ena_dev);
-   if (rc) {
-   dev_err(>dev, "Device reset failed\n");
-   goto err;
-   }
-
ena_free_mgmnt_irq(adapter);
 
ena_disable_msix(adapter);
@@ -2482,6 +2507,8 @@ static void ena_fw_reset_device(struct work_struct *work)
 
ena_com_mmio_reg_read_request_destroy(ena_dev);
 
+   clear_bit(ENA_FLAG_TRIGGER_RESET, >flags);
+
/* Finish with the destroy part. Start the init part */
 
rc = ena_device_init(ena_dev, adapter->pdev, _feat_ctx, _state);
@@ -2547,6 +2574,9 @@ static void check_for_missing_tx_completions(struct 
ena_adapter *adapter)
if (!test_bit(ENA_FLAG_DEV_UP, >flags))
return;
 
+   if (test_bit(ENA_FLAG_TRIGGER_RESET, >flags))
+   return;
+
budget = ENA_MONITORED_TX_QUEUES;
 
for (i = adapter->last_monitored_tx_qid; i < adapter->num_queues; i++) {
@@ -2646,7 +2676,7 @@ static void ena_timer_service(unsigned long data)
if (host_info)

[PATCH V3 net-next 07/14] net/ena: refactor ena_get_stats64 to be atomic context safe

2017-01-26 Thread Netanel Belgazal

ndo_get_stat64() can be called from atomic context, but the current
implementation sends an admin command to retrieve the statistics from
the device. This admin command can sleep.

This patch re-factors the implementation of ena_get_stats64() to use
the {rx,tx}bytes/count from the driver's inner counters, and to obtain
the rx drop counter from the asynchronous keep alive (heart bit)
event.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_admin_defs.h |  8 
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 57 +---
 drivers/net/ethernet/amazon/ena/ena_netdev.h |  1 +
 3 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
index e1594d6..5b6509d 100644
--- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
@@ -873,6 +873,14 @@ struct ena_admin_aenq_link_change_desc {
u32 flags;
 };
 
+struct ena_admin_aenq_keep_alive_desc {
+   struct ena_admin_aenq_common_desc aenq_common_desc;
+
+   u32 rx_drops_low;
+
+   u32 rx_drops_high;
+};
+
 struct ena_admin_ena_mmio_req_read_less_resp {
u16 req_id;
 
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 639f0aa..ea3c801 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2169,28 +2169,46 @@ static struct rtnl_link_stats64 *ena_get_stats64(struct 
net_device *netdev,
 struct rtnl_link_stats64 
*stats)
 {
struct ena_adapter *adapter = netdev_priv(netdev);
-   struct ena_admin_basic_stats ena_stats;
-   int rc;
+   struct ena_ring *rx_ring, *tx_ring;
+   unsigned int start;
+   u64 rx_drops;
+   int i;
 
if (!test_bit(ENA_FLAG_DEV_UP, >flags))
return NULL;
 
-   rc = ena_com_get_dev_basic_stats(adapter->ena_dev, _stats);
-   if (rc)
-   return NULL;
+   for (i = 0; i < adapter->num_queues; i++) {
+   u64 bytes, packets;
+
+   tx_ring = >tx_ring[i];
+
+   do {
+   start = u64_stats_fetch_begin_irq(_ring->syncp);
+   packets = tx_ring->tx_stats.cnt;
+   bytes = tx_ring->tx_stats.bytes;
+   } while (u64_stats_fetch_retry_irq(_ring->syncp, start));
+
+   stats->tx_packets += packets;
+   stats->tx_bytes += bytes;
 
-   stats->tx_bytes = ((u64)ena_stats.tx_bytes_high << 32) |
-   ena_stats.tx_bytes_low;
-   stats->rx_bytes = ((u64)ena_stats.rx_bytes_high << 32) |
-   ena_stats.rx_bytes_low;
+   rx_ring = >rx_ring[i];
+
+   do {
+   start = u64_stats_fetch_begin_irq(_ring->syncp);
+   packets = rx_ring->rx_stats.cnt;
+   bytes = rx_ring->rx_stats.bytes;
+   } while (u64_stats_fetch_retry_irq(_ring->syncp, start));
 
-   stats->rx_packets = ((u64)ena_stats.rx_pkts_high << 32) |
-   ena_stats.rx_pkts_low;
-   stats->tx_packets = ((u64)ena_stats.tx_pkts_high << 32) |
-   ena_stats.tx_pkts_low;
+   stats->rx_packets += packets;
+   stats->rx_bytes += bytes;
+   }
+
+   do {
+   start = u64_stats_fetch_begin_irq(>syncp);
+   rx_drops = adapter->dev_stats.rx_drops;
+   } while (u64_stats_fetch_retry_irq(>syncp, start));
 
-   stats->rx_dropped = ((u64)ena_stats.rx_drops_high << 32) |
-   ena_stats.rx_drops_low;
+   stats->rx_dropped = rx_drops;
 
stats->multicast = 0;
stats->collisions = 0;
@@ -3213,8 +3231,17 @@ static void ena_keep_alive_wd(void *adapter_data,
  struct ena_admin_aenq_entry *aenq_e)
 {
struct ena_adapter *adapter = (struct ena_adapter *)adapter_data;
+   struct ena_admin_aenq_keep_alive_desc *desc;
+   u64 rx_drops;
 
+   desc = (struct ena_admin_aenq_keep_alive_desc *)aenq_e;
adapter->last_keep_alive_jiffies = jiffies;
+
+   rx_drops = ((u64)desc->rx_drops_high << 32) | desc->rx_drops_low;
+
+   u64_stats_update_begin(>syncp);
+   adapter->dev_stats.rx_drops = rx_drops;
+   u64_stats_update_end(>syncp);
 }
 
 static void ena_notification(void *adapter_data,
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 69d7e9e..f0ddc11 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -241,6 +241,7 @@ struct ena_stats_dev {
u64 interface_up;
u64 interface_down;
u64 admin_q_pause;
+   u64 rx_drops;
 };
 
 enum ena_flags_t {
-- 
2.7.4

[PATCH V3 net-next 10/14] net/ena: use READ_ONCE to access completion descriptors

2017-01-26 Thread Netanel Belgazal

Completion descriptors are accessed from the driver and from the device.
To avoid reading the old value, use READ_ONCE macro.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_com.h | 1 +
 drivers/net/ethernet/amazon/ena/ena_eth_com.c | 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.h 
b/drivers/net/ethernet/amazon/ena/ena_com.h
index 509d7b8..c9b33ee 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_com.h
@@ -33,6 +33,7 @@
 #ifndef ENA_COM
 #define ENA_COM
 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.c 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.c
index 539c536..f999305 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.c
@@ -45,7 +45,7 @@ static inline struct ena_eth_io_rx_cdesc_base 
*ena_com_get_next_rx_cdesc(
cdesc = (struct ena_eth_io_rx_cdesc_base *)(io_cq->cdesc_addr.virt_addr
+ (head_masked * io_cq->cdesc_entry_size_in_bytes));
 
-   desc_phase = (cdesc->status & ENA_ETH_IO_RX_CDESC_BASE_PHASE_MASK) >>
+   desc_phase = (READ_ONCE(cdesc->status) & 
ENA_ETH_IO_RX_CDESC_BASE_PHASE_MASK) >>
ENA_ETH_IO_RX_CDESC_BASE_PHASE_SHIFT;
 
if (desc_phase != expected_phase)
@@ -141,7 +141,7 @@ static inline u16 ena_com_cdesc_rx_pkt_get(struct 
ena_com_io_cq *io_cq,
 
ena_com_cq_inc_head(io_cq);
count++;
-   last = (cdesc->status & ENA_ETH_IO_RX_CDESC_BASE_LAST_MASK) >>
+   last = (READ_ONCE(cdesc->status) & 
ENA_ETH_IO_RX_CDESC_BASE_LAST_MASK) >>
ENA_ETH_IO_RX_CDESC_BASE_LAST_SHIFT;
} while (!last);
 
@@ -489,13 +489,13 @@ int ena_com_tx_comp_req_id_get(struct ena_com_io_cq 
*io_cq, u16 *req_id)
 * expected, it mean that the device still didn't update
 * this completion.
 */
-   cdesc_phase = cdesc->flags & ENA_ETH_IO_TX_CDESC_PHASE_MASK;
+   cdesc_phase = READ_ONCE(cdesc->flags) & ENA_ETH_IO_TX_CDESC_PHASE_MASK;
if (cdesc_phase != expected_phase)
return -EAGAIN;
 
ena_com_cq_inc_head(io_cq);
 
-   *req_id = cdesc->req_id;
+   *req_id = READ_ONCE(cdesc->req_id);
 
return 0;
 }
-- 
2.7.4

[PATCH V3 net-next 12/14] net/ena: change driver's default timeouts

2017-01-26 Thread Netanel Belgazal

The timeouts were too agressive and sometimes cause false alarms.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 4 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 5518b1f..8029e7c 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -36,9 +36,9 @@
 /*/
 
 /* Timeout in micro-sec */
-#define ADMIN_CMD_TIMEOUT_US (100)
+#define ADMIN_CMD_TIMEOUT_US (300)
 
-#define ENA_ASYNC_QUEUE_DEPTH 4
+#define ENA_ASYNC_QUEUE_DEPTH 16
 #define ENA_ADMIN_QUEUE_DEPTH 32
 
 #define MIN_ENA_VER (((ENA_COMMON_SPEC_VERSION_MAJOR) << \
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index f0ddc11..efe0ea1 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -100,7 +100,7 @@
 /* Number of queues to check for missing queues per timer service */
 #define ENA_MONITORED_TX_QUEUES4
 /* Max timeout packets before device reset */
-#define MAX_NUM_OF_TIMEOUTED_PACKETS 32
+#define MAX_NUM_OF_TIMEOUTED_PACKETS 128
 
 #define ENA_TX_RING_IDX_NEXT(idx, ring_size) (((idx) + 1) & ((ring_size) - 1))
 
@@ -116,9 +116,9 @@
 #define ENA_IO_IRQ_IDX(q)  (ENA_IO_IRQ_FIRST_IDX + (q))
 
 /* ENA device should send keep alive msg every 1 sec.
- * We wait for 3 sec just to be on the safe side.
+ * We wait for 6 sec just to be on the safe side.
  */
-#define ENA_DEVICE_KALIVE_TIMEOUT  (3 * HZ)
+#define ENA_DEVICE_KALIVE_TIMEOUT  (6 * HZ)
 
 #define ENA_MMIO_DISABLE_REG_READ  BIT(0)
 
-- 
2.7.4

[PATCH V3 net-next 14/14] net/ena: update driver version to 1.1.2

2017-01-26 Thread Netanel Belgazal

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index efe0ea1..ed62d8e 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -44,7 +44,7 @@
 #include "ena_eth_com.h"
 
 #define DRV_MODULE_VER_MAJOR   1
-#define DRV_MODULE_VER_MINOR   0
+#define DRV_MODULE_VER_MINOR   1
 #define DRV_MODULE_VER_SUBMINOR 2
 
 #define DRV_MODULE_NAME"ena"
-- 
2.7.4

[PATCH V3 net-next 11/14] net/ena: reduce the severity of ena printouts

2017-01-26 Thread Netanel Belgazal

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 27 +--
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 14 +++---
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 46aad3a..5518b1f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -784,7 +784,7 @@ static int ena_com_get_feature_ex(struct ena_com_dev 
*ena_dev,
int ret;
 
if (!ena_com_check_supported_feature_id(ena_dev, feature_id)) {
-   pr_info("Feature %d isn't supported\n", feature_id);
+   pr_debug("Feature %d isn't supported\n", feature_id);
return -EPERM;
}
 
@@ -1126,7 +1126,13 @@ int ena_com_execute_admin_command(struct 
ena_com_admin_queue *admin_queue,
comp_ctx = ena_com_submit_admin_cmd(admin_queue, cmd, cmd_size,
comp, comp_size);
if (unlikely(IS_ERR(comp_ctx))) {
-   pr_err("Failed to submit command [%ld]\n", PTR_ERR(comp_ctx));
+   if (comp_ctx == ERR_PTR(-ENODEV))
+   pr_debug("Failed to submit command [%ld]\n",
+PTR_ERR(comp_ctx));
+   else
+   pr_err("Failed to submit command [%ld]\n",
+  PTR_ERR(comp_ctx));
+
return PTR_ERR(comp_ctx);
}
 
@@ -1895,7 +1901,7 @@ int ena_com_set_dev_mtu(struct ena_com_dev *ena_dev, int 
mtu)
int ret;
 
if (!ena_com_check_supported_feature_id(ena_dev, ENA_ADMIN_MTU)) {
-   pr_info("Feature %d isn't supported\n", ENA_ADMIN_MTU);
+   pr_debug("Feature %d isn't supported\n", ENA_ADMIN_MTU);
return -EPERM;
}
 
@@ -1948,8 +1954,8 @@ int ena_com_set_hash_function(struct ena_com_dev *ena_dev)
 
if (!ena_com_check_supported_feature_id(ena_dev,
ENA_ADMIN_RSS_HASH_FUNCTION)) {
-   pr_info("Feature %d isn't supported\n",
-   ENA_ADMIN_RSS_HASH_FUNCTION);
+   pr_debug("Feature %d isn't supported\n",
+ENA_ADMIN_RSS_HASH_FUNCTION);
return -EPERM;
}
 
@@ -2112,7 +2118,8 @@ int ena_com_set_hash_ctrl(struct ena_com_dev *ena_dev)
 
if (!ena_com_check_supported_feature_id(ena_dev,
ENA_ADMIN_RSS_HASH_INPUT)) {
-   pr_info("Feature %d isn't supported\n", 
ENA_ADMIN_RSS_HASH_INPUT);
+   pr_debug("Feature %d isn't supported\n",
+ENA_ADMIN_RSS_HASH_INPUT);
return -EPERM;
}
 
@@ -2270,8 +2277,8 @@ int ena_com_indirect_table_set(struct ena_com_dev 
*ena_dev)
 
if (!ena_com_check_supported_feature_id(
ena_dev, ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG)) {
-   pr_info("Feature %d isn't supported\n",
-   ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG);
+   pr_debug("Feature %d isn't supported\n",
+ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG);
return -EPERM;
}
 
@@ -2542,8 +2549,8 @@ int ena_com_init_interrupt_moderation(struct ena_com_dev 
*ena_dev)
 
if (rc) {
if (rc == -EPERM) {
-   pr_info("Feature %d isn't supported\n",
-   ENA_ADMIN_INTERRUPT_MODERATION);
+   pr_debug("Feature %d isn't supported\n",
+ENA_ADMIN_INTERRUPT_MODERATION);
rc = 0;
} else {
pr_err("Failed to get interrupt moderation admin cmd. 
rc: %d\n",
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index d1e1d9d..96048bd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -563,6 +563,7 @@ static void ena_free_all_rx_bufs(struct ena_adapter 
*adapter)
  */
 static void ena_free_tx_bufs(struct ena_ring *tx_ring)
 {
+   bool print_once = true;
u32 i;
 
for (i = 0; i < tx_ring->ring_size; i++) {
@@ -574,9 +575,16 @@ static void ena_free_tx_bufs(struct ena_ring *tx_ring)
if (!tx_info->skb)
continue;
 
-   netdev_notice(tx_ring->netdev,
- "free uncompleted tx skb qid %d idx 0x%x\n",
- tx_ring->qid, i);
+   if (print_once) {
+   netdev_notice(tx_ring->netdev,
+ "free uncompleted tx skb qid %d idx 
0x%x\n",
+ tx_ring->qid, i);
+   print_once = false;
+

[PATCH V3 net-next 13/14] net/ena: change condition for host attribute configuration

2017-01-26 Thread Netanel Belgazal

Move the host info config to be the first admin command that is executed.
This change require the driver to remove the 'feature check'
from host info configuration flow.
The check is removed since the supported features bitmask field
is retrieved only after calling ENA_ADMIN_DEVICE_ATTRIBUTES admin command.

If set host info is not supported an error will be returned by the device.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 8 +++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 +++--
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 8029e7c..08d11ce 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -2451,11 +2451,9 @@ int ena_com_set_host_attributes(struct ena_com_dev 
*ena_dev)
 
int ret;
 
-   if (!ena_com_check_supported_feature_id(ena_dev,
-   ENA_ADMIN_HOST_ATTR_CONFIG)) {
-   pr_warn("Set host attribute isn't supported\n");
-   return -EPERM;
-   }
+   /* Host attribute config is called before ena_com_get_dev_attr_feat
+* so ena_com can't check if the feature is supported.
+*/
 
memset(, 0x0, sizeof(cmd));
admin_queue = _dev->admin_queue;
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 96048bd..7b9c80f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2416,6 +2416,8 @@ static int ena_device_init(struct ena_com_dev *ena_dev, 
struct pci_dev *pdev,
 */
ena_com_set_admin_polling_mode(ena_dev, true);
 
+   ena_config_host_info(ena_dev);
+
/* Get Device Attributes*/
rc = ena_com_get_dev_attr_feat(ena_dev, get_feat_ctx);
if (rc) {
@@ -2440,11 +2442,10 @@ static int ena_device_init(struct ena_com_dev *ena_dev, 
struct pci_dev *pdev,
 
*wd_state = !!(aenq_groups & BIT(ENA_ADMIN_KEEP_ALIVE));
 
-   ena_config_host_info(ena_dev);
-
return 0;
 
 err_admin_init:
+   ena_com_delete_host_info(ena_dev);
ena_com_admin_destroy(ena_dev);
 err_mmio_read_less:
ena_com_mmio_reg_read_request_destroy(ena_dev);
-- 
2.7.4

[PATCH net-next] net: ipv6: remove skb_reserve in getroute

2017-01-26 Thread David Ahern

Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The
allocated skb is not passed through the routing engine (like it is for
IPv4) and has not since the beginning of git time.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 47499ed429da..5046d2b24004 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3420,12 +3420,6 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh)
goto errout;
}
 
-   /* Reserve room for dummy headers, this skb can pass
-  through good chunk of routing engine.
-*/
-   skb_reset_mac_header(skb);
-   skb_reserve(skb, MAX_HEADER + sizeof(struct ipv6hdr));
-
skb_dst_set(skb, >dst);
 
err = rt6_fill_node(net, skb, rt, , , iif,
-- 
2.1.4

Re: TCP stops sending packets over loopback on 4.10-rc3?

2017-01-26 Thread Josef Bacik

On Wed, 2017-01-25 at 06:39 -0800, Eric Dumazet wrote:
> On Wed, 2017-01-25 at 09:26 -0500, Josef Bacik wrote:
> 
> > 
> > Nope ftrace isn't broken, I'm just dumb, the space is being
> > reclaimed 
> > by sk_wmem_free_skb().  So I guess I need to figure out why I stop 
> > getting ACK's from the other side of the loopback.  Thanks,
> ss -temoi dst 127.0.0.1
> 
> Might give you some hints, like packets being dropped.
> 
> ACK can be delayed if the reader is slow to consume bytes.
> 

Yup looks like I'm getting packet loss for some reason, but the
application is sitting there in recvmsg, so it's not hung and
definitely available for receiving new packets.

ESTAB  0  4124232  
  ::1:34044
   
::1:nbd   t
imer:(on,1min38sec,9) ino:20067 sk:8 <->
 skmem:(r0,rb6291456,t0,tb4194304,f1720,w4204872,o0,bl0) ts
sack cubic wscale:7,7 rto:102912 backoff:9 rtt:0.084/0.038 ato:40
mss:65464 cwnd:1 ssthresh:18 bytes_acked:71964077253
bytes_received:68804409996 segs_out:3882829 segs_in:4092731 send
6234.7Mbps lastsnd:4336 lastrcv:111289 lastack:111299 unacked:28
retrans:1/4277 lost:28 reordering:60 rcv_rtt:1.875 rcv_space:1315136

ESTAB  0  0
 ::1:nbd   
   
   ::1:34044 ti
mer:(keepalive,109min,0) ino:19396 sk:2 <->
 skmem:(r0,rb6291456,t0,tb2626560,f0,w0,o0,bl0) ts sack cubic
wscale:7,7 rto:201 rtt:0.279/0.16 ato:40 mss:65464 cwnd:16 ssthresh:9
bytes_acked:68804409996 bytes_received:71964077252 segs_out:4092730
segs_in:3882792 send 30033.7Mbps lastsnd:111286 lastrcv:111307
lastack:111286 retrans:0/3113 reordering:26 rcv_rtt:1 rcv_space:4782816

I traced tcp_enter_loss() and once things stop moving that starts
firing.  That's all I have so far, been busy with other things but I'm
devoting my full attention to this now.  Thanks,

Josef

[PATCH net] tcp: don't annotate mark on control socket from tcp_v6_send_response()

2017-01-26 Thread Pablo Neira Ayuso

Unlike ipv4, this control socket is shared by all cpus so we cannot use
it as scratchpad area to annotate the mark that we pass to ip6_xmit().

Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
family caches the flowi6 structure in the sctp_transport structure, so
we cannot use to carry the mark unless we later on reset it back, which
I discarded since it looks ugly to me.

Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled")
Suggested-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
Tested with nc6, works for me.

Note: DCCP and SCTP don't seem to support fwmark_reflect yet, so leave
  these bits.

Thanks!

 include/net/ipv6.h   | 2 +-
 net/dccp/ipv6.c  | 4 ++--
 net/ipv6/inet6_connection_sock.c | 2 +-
 net/ipv6/ip6_output.c| 4 ++--
 net/ipv6/tcp_ipv6.c  | 5 ++---
 net/sctp/ipv6.c  | 3 ++-
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 487e57391664..7afe991e900e 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -871,7 +871,7 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb);
  * upper-layer output functions
  */
 int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
-struct ipv6_txoptions *opt, int tclass);
+__u32 mark, struct ipv6_txoptions *opt, int tclass);
 
 int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr);
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index adfc790f7193..c4e879c02186 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -227,7 +227,7 @@ static int dccp_v6_send_response(const struct sock *sk, 
struct request_sock *req
opt = ireq->ipv6_opt;
if (!opt)
opt = rcu_dereference(np->opt);
-   err = ip6_xmit(sk, skb, , opt, np->tclass);
+   err = ip6_xmit(sk, skb, , sk->sk_mark, opt, np->tclass);
rcu_read_unlock();
err = net_xmit_eval(err);
}
@@ -281,7 +281,7 @@ static void dccp_v6_ctl_send_reset(const struct sock *sk, 
struct sk_buff *rxskb)
dst = ip6_dst_lookup_flow(ctl_sk, , NULL);
if (!IS_ERR(dst)) {
skb_dst_set(skb, dst);
-   ip6_xmit(ctl_sk, skb, , NULL, 0);
+   ip6_xmit(ctl_sk, skb, , 0, NULL, 0);
DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
return;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 7396e75e161b..75c308239243 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -176,7 +176,7 @@ int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, 
struct flowi *fl_unused
/* Restore final destination back after routing done */
fl6.daddr = sk->sk_v6_daddr;
 
-   res = ip6_xmit(sk, skb, , rcu_dereference(np->opt),
+   res = ip6_xmit(sk, skb, , sk->sk_mark, rcu_dereference(np->opt),
   np->tclass);
rcu_read_unlock();
return res;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 38122d04fadc..2c0df09e9036 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -172,7 +172,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
  * which are using proper atomic operations or spinlocks.
  */
 int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
-struct ipv6_txoptions *opt, int tclass)
+__u32 mark, struct ipv6_txoptions *opt, int tclass)
 {
struct net *net = sock_net(sk);
const struct ipv6_pinfo *np = inet6_sk(sk);
@@ -240,7 +240,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, 
struct flowi6 *fl6,
 
skb->protocol = htons(ETH_P_IPV6);
skb->priority = sk->sk_priority;
-   skb->mark = sk->sk_mark;
+   skb->mark = mark;
 
mtu = dst_mtu(dst);
if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2b20622a5824..cb8929681dc7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -469,7 +469,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct 
dst_entry *dst,
opt = ireq->ipv6_opt;
if (!opt)
opt = rcu_dereference(np->opt);
-   err = ip6_xmit(sk, skb, fl6, opt, np->tclass);
+   err = ip6_xmit(sk, skb, fl6, sk->sk_mark, opt, np->tclass);
rcu_read_unlock();
err = net_xmit_eval(err);
}
@@ -840,8 +840,7 @@ static void tcp_v6_send_response(const struct sock *sk, 
struct sk_buff *skb, u32
dst = ip6_dst_lookup_flow(ctl_sk, , NULL);
if (!IS_ERR(dst)) {
skb_dst_set(buff, dst);
-   ctl_sk->sk_mark =

[PATCH net-next v2] net: ipv6: ignore null_entry on route dumps

2017-01-26 Thread David Ahern

lkp-robot reported a BUG:
[   10.151226] BUG: unable to handle kernel NULL pointer dereference at 0198
[   10.152525] IP: rt6_fill_node+0x164/0x4b8
[   10.153307] *pdpt = 12ee5001 *pde = 
[   10.153309]
[   10.154492] Oops:  [#1]
[   10.154987] CPU: 0 PID: 909 Comm: netifd Not tainted 
4.10.0-rc4-00722-g41e8c70ee162-dirty #10
[   10.156482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[   10.158254] task: d0deb000 task.stack: d0e0c000
[   10.159059] EIP: rt6_fill_node+0x164/0x4b8
[   10.159780] EFLAGS: 00010296 CPU: 0
[   10.160404] EAX:  EBX: d10c2358 ECX: c1f7c6cc EDX: c1f6ff44
[   10.161469] ESI:  EDI: c2059900 EBP: d0e0dc4c ESP: d0e0dbe4
[   10.162534]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
[   10.163482] CR0: 80050033 CR2: 0198 CR3: 10d94660 CR4: 06b0
[   10.164535] Call Trace:
[   10.164993]  ? paravirt_sched_clock+0x9/0xd
[   10.165727]  ? sched_clock+0x9/0xc
[   10.166329]  ? sched_clock_cpu+0x19/0xe9
[   10.166991]  ? lock_release+0x13e/0x36c
[   10.167652]  rt6_dump_route+0x4c/0x56
[   10.168276]  fib6_dump_node+0x1d/0x3d
[   10.168913]  fib6_walk_continue+0xab/0x167
[   10.169611]  fib6_walk+0x2a/0x40
[   10.170182]  inet6_dump_fib+0xfb/0x1e0
[   10.170855]  netlink_dump+0xcd/0x21f

This happens when the loopback device is set down and a ipv6 fib route
dump is requested.

ip6_null_entry is the root of all ipv6 fib tables making it integrated
into the table and hence passed to the ipv6 route dump code. The
null_entry route uses the loopback device for dst.dev but may not have
rt6i_idev set because of the order in which initializations are done --
ip6_route_net_init is run before addrconf_init has initialized the
loopback device. Fixing the initialization order is a much bigger problem
with no obvious solution thus far.

The BUG is triggered when the loopback is set down and the netif_running
check added by a1a22c1206 fails. The fill_node descends to checking
rt->rt6i_idev for ignore_routes_with_linkdown and since rt6i_idev is
NULL it faults.

The null_entry route should not be processed in a dump request. Catch
and ignore. This check is done in rt6_dump_route as it is the highest
place in the callchain with knowledge of both the route and the network
namespace.

Fixes: a1a22c1206("net: ipv6: Keep nexthop of multipath route on admin down")
Signed-off-by: David Ahern 
---
v2
- updated commit message; no code change

Dave: per last email you suggested putting this in fib6_dump_node, but
  fib6_dump_node does not have knowledge of the network namespace
  because of the way the fib_walker works. I could put
  struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) w->args;
  into fib6_dump_node to get the net pointer from it but did not
  want to duplicate the typecast.

 net/ipv6/route.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4b1f0f98a0e9..47499ed429da 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3320,6 +3320,10 @@ static int rt6_fill_node(struct net *net,
 int rt6_dump_route(struct rt6_info *rt, void *p_arg)
 {
struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
+   struct net *net = arg->net;
+
+   if (rt == net->ipv6.ip6_null_entry)
+   return 0;
 
if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) {
struct rtmsg *rtm = nlmsg_data(arg->cb->nlh);
@@ -3332,7 +3336,7 @@ int rt6_dump_route(struct rt6_info *rt, void *p_arg)
}
}
 
-   return rt6_fill_node(arg->net,
+   return rt6_fill_node(net,
 arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE,
 NETLINK_CB(arg->cb->skb).portid, arg->cb->nlh->nlmsg_seq,
 NLM_F_MULTI);
-- 
2.1.4

Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver

2017-01-26 Thread Sowmini Varadhan

On (01/26/17 15:21), Willem de Bruijn wrote:
> > If the application has provided fewer than hard_header_len bytes,
> > dev_validate_header() will zero out the skb->data as needed. This is
> > acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases,
> 
> This was added not for datagram sockets, but to be able to bypass
> validation. See the message in commit 2793a23aacbd ("net: validate
> variable length ll header") and discussion leading up to that patch.

some context, I got inot this patch as a result of  the comments in
 https://www.mail-archive.com/netdev@vger.kernel.org/msg149031.html

> As David pointed out, this does not handle variable length headers
> correctly. In link layers that support these, hard_header_len defines
> the maximum header length. A hard failure on len < hard_header_len
> would be incorrect.

right, since DaveM's comments, I took a look at the drivers
that have a ->validate - afaict (from cscope) ax25 is the only 
in-kernel driver that actually passes a ->validate pointer.. 
I tried patching ax25 here:
  http://marc.info/?l=linux-hams=148537926422828=2
Still waiting to hear back from that list (which doesnt seem to have
much traffic so maybe I should time out on it). Does that
patch make better sense (I'll look up the comments leading up
to 2793a23aacbd later tonight)

> The ->validate callback was added to allow specifying additional
> constraints on a per protocol basis. This is where a min constraint
> can be added, e.g., for ethernet.
> 
> > -   if (!dev_validate_header(dev, skb->data, len)) {
> > +   newlen = dev_validate_header(dev, skb->data, len);
> > +   /* As comments above this function indicate, a full L2 header
> > +* must be passed to this function, so if newlen > len, bail.
> > +*/
> > +   if (newlen < 0 || newlen > len) {
> 
> If callers only care whether the function returned failure or
> increased len, which also indicates failure, it is cleaner to leave it
> a boolean and fail in cases where len < the minimum for that link
> layer type. No caller actually uses newlen.
> 
> > +   /* Caller has allocated for copylen in non-paged part of
> > +* skb so we should never find newlen > hdrlen
> > +*/
> > +   WARN_ON(newlen > hdrlen);
> 
> WARN_ON_ONCE is safer.

Ok that's easy enough to do.

Re: netvsc NAPI patch process

2017-01-26 Thread David Miller

From: KY Srinivasan 
Date: Thu, 26 Jan 2017 20:46:40 +

> In the past, we have done this in two stages - get the supporting
> vmbus patches into Greg's tree first and in the next merge cycle get
> the netvsc patches in. Why not continue to do what we have done in
> the past to address cross-tree dependencies.

It takes too damn long.

[PATCHv3 perf/core 1/6] tools lib bpf: Add BPF program pinning APIs.

2017-01-26 Thread Joe Stringer

Add new APIs to pin a BPF program (or specific instances) to the filesystem.
The user can specify the path full path within a BPF filesystem to pin the
program.

bpf_program__pin_instance(prog, path, n) will pin the nth instance of
'prog' to the specified path.
bpf_program__pin(prog, path) will create the directory 'path' (if it
does not exist) and pin each instance within that directory. For
instance, path/0, path/1, path/2.

Signed-off-by: Joe Stringer 
---
v3: Add per-instance pinning.
Use path for bpf_program__pin() as directory.
v2: Don't automount BPF filesystem
Split program, map, object pinning into separate APIs and separate
patches.
---
 tools/lib/bpf/libbpf.c | 112 +
 tools/lib/bpf/libbpf.h |   3 ++
 2 files changed, 115 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e6cd62b1264b..d1d7638b7c21 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4,6 +4,7 @@
  * Copyright (C) 2013-2015 Alexei Starovoitov 
  * Copyright (C) 2015 Wang Nan 
  * Copyright (C) 2015 Huawei Inc.
+ * Copyright (C) 2017 Nicira, Inc.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -31,7 +33,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -1237,6 +1242,113 @@ int bpf_object__load(struct bpf_object *obj)
return err;
 }
 
+static int check_path(const char *path)
+{
+   struct statfs st_fs;
+   char *dname, *dir;
+   int err = 0;
+
+   if (path == NULL)
+   return -EINVAL;
+
+   dname = strdup(path);
+   dir = dirname(dname);
+   if (statfs(dir, _fs)) {
+   pr_warning("failed to statfs %s: %s\n", dir, strerror(errno));
+   err = -errno;
+   }
+   free(dname);
+
+   if (!err && st_fs.f_type != BPF_FS_MAGIC) {
+   pr_warning("specified path %s is not on BPF FS\n", path);
+   err = -EINVAL;
+   }
+
+   return err;
+}
+
+int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
+ int instance)
+{
+   int err;
+
+   err = check_path(path);
+   if (err)
+   return err;
+
+   if (prog == NULL) {
+   pr_warning("invalid program pointer\n");
+   return -EINVAL;
+   }
+
+   if (instance < 0 || instance >= prog->instances.nr) {
+   pr_warning("invalid prog instance %d of prog %s (max %d)\n",
+  instance, prog->section_name, prog->instances.nr);
+   return -EINVAL;
+   }
+
+   if (bpf_obj_pin(prog->instances.fds[instance], path)) {
+   pr_warning("failed to pin program: %s\n", strerror(errno));
+   return -errno;
+   }
+   pr_debug("pinned program '%s'\n", path);
+
+   return 0;
+}
+
+static int make_dir(const char *path)
+{
+   int err = 0;
+
+   if (mkdir(path, 0700) && errno != EEXIST)
+   err = -errno;
+
+   if (err)
+   pr_warning("failed to mkdir %s: %s\n", path, strerror(-err));
+   return err;
+}
+
+int bpf_program__pin(struct bpf_program *prog, const char *path)
+{
+   int i, err;
+
+   err = check_path(path);
+   if (err)
+   return err;
+
+   if (prog == NULL) {
+   pr_warning("invalid program pointer\n");
+   return -EINVAL;
+   }
+
+   if (prog->instances.nr <= 0) {
+   pr_warning("no instances of prog %s to pin\n",
+  prog->section_name);
+   return -EINVAL;
+   }
+
+   err = make_dir(path);
+   if (err)
+   return err;
+
+   for (i = 0; i < prog->instances.nr; i++) {
+   char buf[PATH_MAX];
+   int len;
+
+   len = snprintf(buf, PATH_MAX, "%s/%d", path, i);
+   if (len < 0)
+   return -EINVAL;
+   else if (len > PATH_MAX)
+   return -ENAMETOOLONG;
+
+   err = bpf_program__pin_instance(prog, buf, i);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
 void bpf_object__close(struct bpf_object *obj)
 {
size_t i;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 4014d1ba5e3d..9f8aa63b95f4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -106,6 +106,9 @@ void *bpf_program__priv(struct bpf_program *prog);
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy);
 
 int bpf_program__fd(struct bpf_program *prog);
+int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
+ int instance);
+int

[PATCH] net: intel: e1000e: use new api ethtool_{get|set}_link_ksettings

2017-01-26 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/intel/e1000e/ethtool.c |   91 ++
 1 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 7aff68a..3768a5c 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -117,15 +117,15 @@ struct e1000_stats {
 
 #define E1000_TEST_LEN ARRAY_SIZE(e1000_gstrings_test)
 
-static int e1000_get_settings(struct net_device *netdev,
- struct ethtool_cmd *ecmd)
+static int e1000_get_link_ksettings(struct net_device *netdev,
+   struct ethtool_link_ksettings *cmd)
 {
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 speed;
+   u32 speed, supported, advertising;
 
if (hw->phy.media_type == e1000_media_type_copper) {
-   ecmd->supported = (SUPPORTED_10baseT_Half |
+   supported = (SUPPORTED_10baseT_Half |
   SUPPORTED_10baseT_Full |
   SUPPORTED_100baseT_Half |
   SUPPORTED_100baseT_Full |
@@ -133,39 +133,36 @@ static int e1000_get_settings(struct net_device *netdev,
   SUPPORTED_Autoneg |
   SUPPORTED_TP);
if (hw->phy.type == e1000_phy_ife)
-   ecmd->supported &= ~SUPPORTED_1000baseT_Full;
-   ecmd->advertising = ADVERTISED_TP;
+   supported &= ~SUPPORTED_1000baseT_Full;
+   advertising = ADVERTISED_TP;
 
if (hw->mac.autoneg == 1) {
-   ecmd->advertising |= ADVERTISED_Autoneg;
+   advertising |= ADVERTISED_Autoneg;
/* the e1000 autoneg seems to match ethtool nicely */
-   ecmd->advertising |= hw->phy.autoneg_advertised;
+   advertising |= hw->phy.autoneg_advertised;
}
 
-   ecmd->port = PORT_TP;
-   ecmd->phy_address = hw->phy.addr;
-   ecmd->transceiver = XCVR_INTERNAL;
-
+   cmd->base.port = PORT_TP;
+   cmd->base.phy_address = hw->phy.addr;
} else {
-   ecmd->supported   = (SUPPORTED_1000baseT_Full |
+   supported   = (SUPPORTED_1000baseT_Full |
 SUPPORTED_FIBRE |
 SUPPORTED_Autoneg);
 
-   ecmd->advertising = (ADVERTISED_1000baseT_Full |
+   advertising = (ADVERTISED_1000baseT_Full |
 ADVERTISED_FIBRE |
 ADVERTISED_Autoneg);
 
-   ecmd->port = PORT_FIBRE;
-   ecmd->transceiver = XCVR_EXTERNAL;
+   cmd->base.port = PORT_FIBRE;
}
 
speed = SPEED_UNKNOWN;
-   ecmd->duplex = DUPLEX_UNKNOWN;
+   cmd->base.duplex = DUPLEX_UNKNOWN;
 
if (netif_running(netdev)) {
if (netif_carrier_ok(netdev)) {
speed = adapter->link_speed;
-   ecmd->duplex = adapter->link_duplex - 1;
+   cmd->base.duplex = adapter->link_duplex - 1;
}
} else if (!pm_runtime_suspended(netdev->dev.parent)) {
u32 status = er32(STATUS);
@@ -179,30 +176,36 @@ static int e1000_get_settings(struct net_device *netdev,
speed = SPEED_10;
 
if (status & E1000_STATUS_FD)
-   ecmd->duplex = DUPLEX_FULL;
+   cmd->base.duplex = DUPLEX_FULL;
else
-   ecmd->duplex = DUPLEX_HALF;
+   cmd->base.duplex = DUPLEX_HALF;
}
}
 
-   ethtool_cmd_speed_set(ecmd, speed);
-   ecmd->autoneg = ((hw->phy.media_type == e1000_media_type_fiber) ||
+   cmd->base.speed = speed;
+   cmd->base.autoneg = ((hw->phy.media_type == e1000_media_type_fiber) ||
 hw->mac.autoneg) ? AUTONEG_ENABLE : AUTONEG_DISABLE;
 
/* MDI-X => 2; MDI =>1; Invalid =>0 */
if ((hw->phy.media_type == e1000_media_type_copper) &&
netif_carrier_ok(netdev))
-   ecmd->eth_tp_mdix = hw->phy.is_mdix ? ETH_TP_MDI_X : ETH_TP_MDI;
+   cmd->base.eth_tp_mdix = hw->phy.is_mdix ?
+   ETH_TP_MDI_X : ETH_TP_MDI;
else
-   ecmd->eth_tp_mdix = ETH_TP_MDI_INVALID;
+   cmd->base.eth_tp_mdix =

[PATCHv3 perf/core 0/6] Libbpf object pinning

2017-01-26 Thread Joe Stringer

This series adds pinning functionality for maps, programs, and objects.
Library users may call bpf_map__pin(map, path) or bpf_program__pin(prog, path)
to pin maps and programs separately, or use bpf_object__pin(obj, path) to
pin all maps and programs from the BPF object to the path. The map and program
variations require a path where it will be pinned in the filesystem,
and the object variation will create named directories for each program with
instances within, and mount the maps by name under the path.

For example, with the directory '/sys/fs/bpf/foo' and a BPF object which
contains two instances of a program named 'bar', and a map named 'baz':
/sys/fs/bpf/foo/bar/0
/sys/fs/bpf/foo/bar/1
/sys/fs/bpf/foo/baz

---
v3: Split out bpf_program__pin_instance().
Change the paths from PATH/{maps,progs}/foo to the above.
Drop the patches that were applied.
Add a perf test to check that pinning works.
v2: Wang Nan provided improvements to patch 1.
Dropped patch 2 from v1.
Added acks for acked patches.
Split the bpf_obj__pin() to also provide map / program pinning APIs.
Allow users to provide full filesystem path (don't autodetect/mount BPFFS).
v1: Initial post.

Joe Stringer (6):
  tools lib bpf: Add BPF program pinning APIs.
  tools lib bpf: Add bpf_map__pin()
  tools lib bpf: Add bpf_object__pin()
  tools perf util: Make rm_rf(path) argument const
  tools lib api fs: Add bpf_fs filesystem detector
  perf test: Add libbpf pinning test

 tools/lib/api/fs/fs.c  |  16 +
 tools/lib/api/fs/fs.h  |   1 +
 tools/lib/bpf/libbpf.c | 188 +
 tools/lib/bpf/libbpf.h |   5 ++
 tools/perf/tests/bpf.c |  42 ++-
 tools/perf/util/util.c |   2 +-
 tools/perf/util/util.h |   2 +-
 7 files changed, 253 insertions(+), 3 deletions(-)

-- 
2.11.0

[PATCHv3 perf/core 5/6] tools lib api fs: Add bpf_fs filesystem detector

2017-01-26 Thread Joe Stringer

Allow mounting of the BPF filesystem at /sys/fs/bpf.

Signed-off-by: Joe Stringer 
---
v3: Initial post.
---
 tools/lib/api/fs/fs.c | 16 
 tools/lib/api/fs/fs.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/tools/lib/api/fs/fs.c b/tools/lib/api/fs/fs.c
index f99f49e4a31e..4b6bfc43cccf 100644
--- a/tools/lib/api/fs/fs.c
+++ b/tools/lib/api/fs/fs.c
@@ -38,6 +38,10 @@
 #define HUGETLBFS_MAGIC0x958458f6
 #endif
 
+#ifndef BPF_FS_MAGIC
+#define BPF_FS_MAGIC   0xcafe4a11
+#endif
+
 static const char * const sysfs__fs_known_mountpoints[] = {
"/sys",
0,
@@ -75,6 +79,11 @@ static const char * const hugetlbfs__known_mountpoints[] = {
0,
 };
 
+static const char * const bpf_fs__known_mountpoints[] = {
+   "/sys/fs/bpf",
+   0,
+};
+
 struct fs {
const char  *name;
const char * const  *mounts;
@@ -89,6 +98,7 @@ enum {
FS__DEBUGFS = 2,
FS__TRACEFS = 3,
FS__HUGETLBFS = 4,
+   FS__BPF_FS = 5,
 };
 
 #ifndef TRACEFS_MAGIC
@@ -121,6 +131,11 @@ static struct fs fs__entries[] = {
.mounts = hugetlbfs__known_mountpoints,
.magic  = HUGETLBFS_MAGIC,
},
+   [FS__BPF_FS] = {
+   .name   = "bpf",
+   .mounts = bpf_fs__known_mountpoints,
+   .magic  = BPF_FS_MAGIC,
+   },
 };
 
 static bool fs__read_mounts(struct fs *fs)
@@ -280,6 +295,7 @@ FS(procfs,  FS__PROCFS);
 FS(debugfs, FS__DEBUGFS);
 FS(tracefs, FS__TRACEFS);
 FS(hugetlbfs, FS__HUGETLBFS);
+FS(bpf_fs, FS__BPF_FS);
 
 int filename__read_int(const char *filename, int *value)
 {
diff --git a/tools/lib/api/fs/fs.h b/tools/lib/api/fs/fs.h
index a63269f5d20c..6b332dc74498 100644
--- a/tools/lib/api/fs/fs.h
+++ b/tools/lib/api/fs/fs.h
@@ -22,6 +22,7 @@ FS(procfs)
 FS(debugfs)
 FS(tracefs)
 FS(hugetlbfs)
+FS(bpf_fs)
 
 #undef FS
 
-- 
2.11.0

[PATCHv3 perf/core 2/6] tools lib bpf: Add bpf_map__pin()

2017-01-26 Thread Joe Stringer

Add a new API to pin a BPF map to the filesystem. The user can
specify the path full path within a BPF filesystem to pin the map.

Signed-off-by: Joe Stringer 
---
v3: No change.
v2: Don't automount BPF filesystem
Split program, map, object pinning into separate APIs and separate
patches.
---
 tools/lib/bpf/libbpf.c | 22 ++
 tools/lib/bpf/libbpf.h |  1 +
 2 files changed, 23 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index d1d7638b7c21..ce987c02363e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1349,6 +1349,28 @@ int bpf_program__pin(struct bpf_program *prog, const 
char *path)
return 0;
 }
 
+int bpf_map__pin(struct bpf_map *map, const char *path)
+{
+   int err;
+
+   err = check_path(path);
+   if (err)
+   return err;
+
+   if (map == NULL) {
+   pr_warning("invalid map pointer\n");
+   return -EINVAL;
+   }
+
+   if (bpf_obj_pin(map->fd, path)) {
+   pr_warning("failed to pin map: %s\n", strerror(errno));
+   return -errno;
+   }
+
+   pr_debug("pinned map '%s'\n", path);
+   return 0;
+}
+
 void bpf_object__close(struct bpf_object *obj)
 {
size_t i;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 9f8aa63b95f4..2addf9d5b13c 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -236,6 +236,7 @@ typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void 
*);
 int bpf_map__set_priv(struct bpf_map *map, void *priv,
  bpf_map_clear_priv_t clear_priv);
 void *bpf_map__priv(struct bpf_map *map);
+int bpf_map__pin(struct bpf_map *map, const char *path);
 
 long libbpf_get_error(const void *ptr);
 
-- 
2.11.0

[PATCHv3 perf/core 6/6] perf test: Add libbpf pinning test

2017-01-26 Thread Joe Stringer

Add a test for the newly added BPF object pinning functionality.

For example:
  # tools/perf/perf test 37
37: BPF filter :
37.1: Basic BPF filtering  : Ok
37.2: BPF pinning  : Ok
37.3: BPF prologue generation  : Ok
37.4: BPF relocation checker   : Ok

  # tools/perf/perf test 37 -v 2>&1 | grep pinned
libbpf: pinned map '/sys/fs/bpf/perf_test/flip_table'
libbpf: pinned program '/sys/fs/bpf/perf_test/func=SyS_epoll_wait/0'

Signed-off-by: Joe Stringer 
Cc: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
---
v3: Initial post.
---
 tools/perf/tests/bpf.c | 42 +-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c
index 92343f43e44a..1a04fe77487d 100644
--- a/tools/perf/tests/bpf.c
+++ b/tools/perf/tests/bpf.c
@@ -5,11 +5,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "tests.h"
 #include "llvm.h"
 #include "debug.h"
 #define NR_ITERS   111
+#define PERF_TEST_BPF_PATH "/sys/fs/bpf/perf_test"
 
 #ifdef HAVE_LIBBPF_SUPPORT
 
@@ -54,6 +56,7 @@ static struct {
const char *msg_load_fail;
int (*target_func)(void);
int expect_result;
+   boolpin;
 } bpf_testcase_table[] = {
{
LLVM_TESTCASE_BASE,
@@ -63,6 +66,17 @@ static struct {
"load bpf object failed",
_wait_loop,
(NR_ITERS + 1) / 2,
+   false,
+   },
+   {
+   LLVM_TESTCASE_BASE,
+   "BPF pinning",
+   "[bpf_pinning]",
+   "fix kbuild first",
+   "check your vmlinux setting?",
+   _wait_loop,
+   (NR_ITERS + 1) / 2,
+   true,
},
 #ifdef HAVE_BPF_PROLOGUE
{
@@ -73,6 +87,7 @@ static struct {
"check your vmlinux setting?",
_loop,
(NR_ITERS + 1) / 4,
+   false,
},
 #endif
{
@@ -83,6 +98,7 @@ static struct {
"libbpf error when dealing with relocation",
NULL,
0,
+   false,
},
 };
 
@@ -226,10 +242,34 @@ static int __test__bpf(int idx)
goto out;
}
 
-   if (obj)
+   if (obj) {
ret = do_test(obj,
  bpf_testcase_table[idx].target_func,
  bpf_testcase_table[idx].expect_result);
+   if (ret != TEST_OK)
+   goto out;
+   if (bpf_testcase_table[idx].pin) {
+   int err;
+
+   if (!bpf_fs__mount()) {
+   pr_debug("BPF filesystem not mounted\n");
+   ret = TEST_FAIL;
+   goto out;
+   }
+   err = mkdir(PERF_TEST_BPF_PATH, 0777);
+   if (err && errno != EEXIST) {
+   pr_debug("Failed to make perf_test dir: %s\n",
+strerror(errno));
+   ret = TEST_FAIL;
+   goto out;
+   }
+   if (bpf_object__pin(obj, PERF_TEST_BPF_PATH))
+   ret = TEST_FAIL;
+   if (rm_rf(PERF_TEST_BPF_PATH))
+   ret = TEST_FAIL;
+   }
+   }
+
 out:
bpf__clear();
return ret;
-- 
2.11.0

[PATCHv3 perf/core 3/6] tools lib bpf: Add bpf_object__pin()

2017-01-26 Thread Joe Stringer

Add a new API to pin a BPF object to the filesystem. The user can
specify the path within a BPF filesystem to pin the object.
Programs will be pinned under a subdirectory named the same as the
program, with each instance appearing as a numbered file under that
directory, and maps will be pinned under the path using the name of
the map as the file basename.

For example, with the directory '/sys/fs/bpf/foo' and a BPF object which
contains two instances of a program named 'bar', and a map named 'baz':
/sys/fs/bpf/foo/bar/0
/sys/fs/bpf/foo/bar/1
/sys/fs/bpf/foo/baz

Signed-off-by: Joe Stringer 
---
v3: Mount to PATH/MAPNAME, PATH/PROGNAME/N (N = instance number)
v2: Don't automount BPF filesystem
Split program, map, object pinning into separate APIs and separate
patches.
---
 tools/lib/bpf/libbpf.c | 54 ++
 tools/lib/bpf/libbpf.h |  1 +
 2 files changed, 55 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ce987c02363e..703cfa986b34 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1371,6 +1372,59 @@ int bpf_map__pin(struct bpf_map *map, const char *path)
return 0;
 }
 
+int bpf_object__pin(struct bpf_object *obj, const char *path)
+{
+   struct bpf_program *prog;
+   struct bpf_map *map;
+   int err;
+
+   if (!obj)
+   return -ENOENT;
+
+   if (!obj->loaded) {
+   pr_warning("object not yet loaded; load it first\n");
+   return -ENOENT;
+   }
+
+   err = make_dir(path);
+   if (err)
+   return err;
+
+   bpf_map__for_each(map, obj) {
+   char buf[PATH_MAX];
+   int len;
+
+   len = snprintf(buf, PATH_MAX, "%s/%s", path,
+  bpf_map__name(map));
+   if (len < 0)
+   return -EINVAL;
+   else if (len > PATH_MAX)
+   return -ENAMETOOLONG;
+
+   err = bpf_map__pin(map, buf);
+   if (err)
+   return err;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   char buf[PATH_MAX];
+   int len;
+
+   len = snprintf(buf, PATH_MAX, "%s/%s", path,
+  prog->section_name);
+   if (len < 0)
+   return -EINVAL;
+   else if (len > PATH_MAX)
+   return -ENAMETOOLONG;
+
+   err = bpf_program__pin(prog, buf);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
 void bpf_object__close(struct bpf_object *obj)
 {
size_t i;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 2addf9d5b13c..b30394f9947a 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -65,6 +65,7 @@ struct bpf_object *bpf_object__open(const char *path);
 struct bpf_object *bpf_object__open_buffer(void *obj_buf,
   size_t obj_buf_sz,
   const char *name);
+int bpf_object__pin(struct bpf_object *object, const char *path);
 void bpf_object__close(struct bpf_object *object);
 
 /* Load/unload object into/from kernel */
-- 
2.11.0

[PATCHv3 perf/core 4/6] tools perf util: Make rm_rf(path) argument const

2017-01-26 Thread Joe Stringer

rm_rf() doesn't modify its path argument, and a future caller will pass
a string constant into it to delete.

Signed-off-by: Joe Stringer 
---
v3: Initial post.
---
 tools/perf/util/util.c | 2 +-
 tools/perf/util/util.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index bf29aed16bd6..d8b45cea54d0 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -85,7 +85,7 @@ int mkdir_p(char *path, mode_t mode)
return (stat(path, ) && mkdir(path, mode)) ? -1 : 0;
 }
 
-int rm_rf(char *path)
+int rm_rf(const char *path)
 {
DIR *dir;
int ret = 0;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 6e8be174ec0b..c74708da8571 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -209,7 +209,7 @@ static inline int sane_case(int x, int high)
 }
 
 int mkdir_p(char *path, mode_t mode);
-int rm_rf(char *path);
+int rm_rf(const char *path);
 struct strlist *lsdir(const char *name, bool (*filter)(const char *, struct 
dirent *));
 bool lsdir_no_dot_filter(const char *name, struct dirent *d);
 int copyfile(const char *from, const char *to);
-- 
2.11.0

Re: [PATCH v2 net] net: free ip_vs_dest structs when refcnt=0

2017-01-26 Thread Julian Anastasov


Hello,

On Mon, 23 Jan 2017, David Windsor wrote:

> Currently, the ip_vs_dest cache frees ip_vs_dest objects when their
> reference count becomes < 0.  Aside from not being semantically sound,
> this is problematic for the new type refcount_t, which will be introduced
> shortly in a separate patch. refcount_t is the new kernel type for
> holding reference counts, and provides overflow protection and a
> constrained interface relative to atomic_t (the type currently being
> used for kernel reference counts).
> 
> Per Julian Anastasov: "The problem is that dest_trash currently holds
> deleted dests (unlinked from RCU lists) with refcnt=0."  Changing
> dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest
> structs when their refcnt=0, in ip_vs_dest_put_and_free().
> 
> Signed-off-by: David Windsor 

Thanks! I tested the first version and this one
just adds the needed changes in comments, so

Signed-off-by: Julian Anastasov 

Simon and Pablo, this is more appropriate for
ipvs-next/nf-next. Please apply!

> ---
>  include/net/ip_vs.h| 2 +-
>  net/netfilter/ipvs/ip_vs_ctl.c | 8 +++-
>  2 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index cd6018a..a3e78ad 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -1421,7 +1421,7 @@ static inline void ip_vs_dest_put(struct ip_vs_dest 
> *dest)
>  
>  static inline void ip_vs_dest_put_and_free(struct ip_vs_dest *dest)
>  {
> - if (atomic_dec_return(>refcnt) < 0)
> + if (atomic_dec_and_test(>refcnt))
>   kfree(dest);
>  }
>  
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 55e0169..5fc4836 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -711,7 +711,6 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int 
> dest_af,
> dest->vport == svc->port))) {
>   /* HIT */
>   list_del(>t_list);
> - ip_vs_dest_hold(dest);
>   goto out;
>   }
>   }
> @@ -741,7 +740,7 @@ static void ip_vs_dest_free(struct ip_vs_dest *dest)
>   *  When the ip_vs_control_clearup is activated by ipvs module exit,
>   *  the service tables must have been flushed and all the connections
>   *  are expired, and the refcnt of each destination in the trash must
> - *  be 0, so we simply release them here.
> + *  be 1, so we simply release them here.
>   */
>  static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs)
>  {
> @@ -1080,11 +1079,10 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, 
> struct ip_vs_dest *dest,
>   if (list_empty(>dest_trash) && !cleanup)
>   mod_timer(>dest_trash_timer,
> jiffies + (IP_VS_DEST_TRASH_PERIOD >> 1));
> - /* dest lives in trash without reference */
> + /* dest lives in trash with reference */
>   list_add(>t_list, >dest_trash);
>   dest->idle_start = 0;
>   spin_unlock_bh(>dest_trash_lock);
> - ip_vs_dest_put(dest);
>  }
>  
>  
> @@ -1160,7 +1158,7 @@ static void ip_vs_dest_trash_expire(unsigned long data)
>  
>   spin_lock(>dest_trash_lock);
>   list_for_each_entry_safe(dest, next, >dest_trash, t_list) {
> - if (atomic_read(>refcnt) > 0)
> + if (atomic_read(>refcnt) > 1)
>   continue;
>   if (dest->idle_start) {
>   if (time_before(now, dest->idle_start +
> -- 
> 2.7.4

Regards

--
Julian Anastasov

RE: netvsc NAPI patch process

2017-01-26 Thread KY Srinivasan



> -Original Message-
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Thursday, January 26, 2017 10:04 AM
> To: da...@davemloft.net; KY Srinivasan ; Greg KH
> 
> Cc: netdev@vger.kernel.org
> Subject: netvsc NAPI patch process
> 
> I have a working set of patches to enable NAPI in the netvsc driver.
> The problem is that it requires a set of patches to vmbus layer as well.
> Since vmbus patches have been going through char-misc-next tree rather
> than net-next, it is difficult to stage these.

In the past, we have done this in two stages - get the supporting vmbus patches 
into
Greg's tree first and in the next merge cycle get the netvsc patches in. Why 
not continue to
do what we have done in the past to address cross-tree dependencies.

Regards,

K. Y 
> 
> How about if I send the vmbus patches through normal driver-devel
> upstream
> and during the 4.10 merge window send the last 3 patches for NAPI for linux-
> net
> tree to get into 4.10?

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 02:40 PM, Michal Hocko wrote:

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:

On 01/26/2017 12:58 PM, Michal Hocko wrote:

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:

On 01/26/2017 11:08 AM, Michal Hocko wrote:

[...]

If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.


Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with


Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?


No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.


Ok, thanks for clarifying, more on that further below.


diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

   void *bpf_map_area_alloc(size_t size)
   {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/


Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?


propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?


The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead. In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.

Thanks,
Daniel

Re: [PATCH net-next v2 0/4] net: dsa: Preparatory patches

2017-01-26 Thread David Miller

From: Florian Fainelli 
Date: Thu, 26 Jan 2017 10:45:50 -0800

> This patch series extracts the 4 patches of the larger: net: dsa: Support for
> pdata in dsa2 while we wait for feedback from Greg KH on the device 
> references.
> 
> Changes in v2:
> 
> - rebased properly after the multi-MDIO bus support added to mv88e6xxx

Series applied, thanks Florian.

Re: [PATCH net-next] liquidio: Avoid accessing skb after submitting to input queue

2017-01-26 Thread David Miller

From: Felix Manlunas 
Date: Thu, 26 Jan 2017 11:52:35 -0800

> From: Satanand Burla 
> 
> Accessing skb after submitting to input queue can cause
> access to stale pointers if the skb ends up being transmitted
> and freed by that time.
> 
> Signed-off-by: Satanand Burla 
> Signed-off-by: Derek Chickles 
> Signed-off-by: Raghu Vatsavayi 
> Signed-off-by: Felix Manlunas 

Applied, thanks.

Re: [PATCH net-next 1/3] trace: add variant without spacing in trace_print_hex_seq

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 08:53 PM, Arnaldo Carvalho de Melo wrote:

Em Wed, Jan 25, 2017 at 02:28:16AM +0100, Daniel Borkmann escreveu:

For upcoming tracepoint support for BPF, we want to dump the program's
tag. Format should be similar to __print_hex(), but without spacing.
Add a __print_hex_str() variant for exactly that purpose that reuses
trace_print_hex_seq().


Steven should be back to his side of the wall soon, will wait for his
Ack, ok?


Ok, seems this set got applied already to net-next in the meantime, so
if there are any objections on this, I will follow up with a patch of
course.

Thanks,
Daniel


- Arnaldo

Re: [PATCHv3 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs

2017-01-26 Thread Chris Packham

On 27/01/17 04:10, Gregory CLEMENT wrote:
>> +internal-regs {

[snip]

>> +
>> +dfx-registers {
> node label
>

[snip]

>> +switch {
> node label
>

These are peers to the internal-regs, i.e. parts of the SoC with 
mappable windows in the address space. Do they really need a label? 
Their subnodes absolutely need (and have) labels.

Re: [PATCH RFC net-next] packet: always ensure that we pass hard_header_len bytes in skb_headlen() to the driver

2017-01-26 Thread Willem de Bruijn

> If the application has provided fewer than hard_header_len bytes,
> dev_validate_header() will zero out the skb->data as needed. This is
> acceptable for SOCK_DGRAM/PF_PACKET sockets but in all other cases,

This was added not for datagram sockets, but to be able to bypass
validation. See the message in commit 2793a23aacbd ("net: validate
variable length ll header") and discussion leading up to that patch.

> the application must provide a full L2 header, and the PF_PACKET Tx
> paths must fail with an error when fewer than hard_header_len bytes
> are detected.

As David pointed out, this does not handle variable length headers
correctly. In link layers that support these, hard_header_len defines
the maximum header length. A hard failure on len < hard_header_len
would be incorrect.

The ->validate callback was added to allow specifying additional
constraints on a per protocol basis. This is where a min constraint
can be added, e.g., for ethernet.

> All invocations to dev_validate_header() already adjusts the
> skb's data, len, tail etc pointers based on hard_header_len before
> invoking dev_validate_header(), so additional skb pointers should
> not be needed after dev_validate_header().
>
> Signed-off-by: Sowmini Varadhan 
> ---

> -   if (!dev_validate_header(dev, skb->data, len)) {
> +   newlen = dev_validate_header(dev, skb->data, len);
> +   /* As comments above this function indicate, a full L2 header
> +* must be passed to this function, so if newlen > len, bail.
> +*/
> +   if (newlen < 0 || newlen > len) {

If callers only care whether the function returned failure or
increased len, which also indicates failure, it is cleaner to leave it
a boolean and fail in cases where len < the minimum for that link
layer type. No caller actually uses newlen.

> +   /* Caller has allocated for copylen in non-paged part of
> +* skb so we should never find newlen > hdrlen
> +*/
> +   WARN_ON(newlen > hdrlen);

WARN_ON_ONCE is safer.

Re: [PATCHv3 4/5] arm: mvebu: Add device tree for 98DX3236 SoCs

2017-01-26 Thread Chris Packham

On 27/01/17 04:10, Gregory CLEMENT wrote:
> Hi Chris,
>
>  On ven., janv. 06 2017, Chris Packham  
> wrote:
>
>> The Marvell 98DX3236, 98DX3336, 98DX4521 and variants are switch ASICs
>> with integrated CPUs. They are similar to the Armada XP SoCs but have
>> different I/O interfaces.
>
> Before sending a new version I have a few remarks:
>
>

[snip]

I'll update the dtsi files to use the node labels and correct the 
commends as requested

>
> Why the following node is not part of the dtsi?
>
> Gregory
>
>> +resume@20980 {
>> +compatible = "marvell,98dx3336-resume-ctrl";
>> +reg = <0x20980 0x10>;
>> +};
>> +};
>> +};

The 98DX9236 has a single ARMv7 core. As such this resume control isn't 
present on it. The 98DX3336 and 98DX4521 have dual ARMv7 cores and this 
is used to boot the second core (SMP support is a little different 
compared to Armada-XP).

In other words {98DX3336, 98DX4521} = 98DX9236 + an additional core. At 
the switch packet processor level there are more differences but as far 
as the kernel is concerned the only real difference is the number of cores.

Re: [PATCH net 1/5] ibmvnic: harden interrupt handler

2017-01-26 Thread Stephen Hemminger

On Wed, 25 Jan 2017 15:02:19 -0600
Thomas Falcon  wrote:

>  static irqreturn_t ibmvnic_interrupt(int irq, void *instance)
>  {
>   struct ibmvnic_adapter *adapter = instance;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>crq.lock, flags);
> + vio_disable_interrupts(adapter->vdev);
> + tasklet_schedule(>tasklet);
> + spin_unlock_irqrestore(>crq.lock, flags);
> + return IRQ_HANDLED;
> +}
> +

Why not use NAPI? rather than a tasklet

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Alexei Starovoitov


On 1/26/17 11:07 AM, Andy Lutomirski wrote:

On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov  wrote:

On 1/26/17 10:12 AM, Andy Lutomirski wrote:


On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov  wrote:


On 1/26/17 8:37 AM, Andy Lutomirski wrote:



Think of bpf programs as safe kernel modules. They don't have
confined boundaries and program authors, if not careful, can shoot
themselves in the foot. We're not trying to prevent that because
it's impossible to check that the program is sane. Just like
it's impossible to check that kernel module is sane.
But in case of bpf we check that bpf program is _safe_ from the kernel
point of view. If it's doing some garbage, it's program's business.
Does it make more sense now?



With all due respect, I think this is not an acceptable way to think
about BPF at all.  If you think of BPF this way, I think there needs
to be a real discussion at KS or similar as to whether this is okay.
The reason is simple: the kernel promises a stable ABI to userspace
but not to kernel modules.  By thinking of BPF as more like a module,
you're taking a big shortcut that will either result in ABI breakage
down the road or in committing to a problematic stable ABI.




you misunderstood the analogy.
bpf abi is certainly stable. that's why we were careful of not
exposing anything to it that is not already stable.



In that case I don't understand what you're trying to say.  Eric
thinks your patch exposes a bad interface.  A bad interface for
userspace is a very different thing from a bad interface available to
kernel modules.  Are you saying that BPF is kernel-module-like in that
the ABI exposed to BPF programs doesn't need to meet the same quality
standards as userspace ABIs?



of course not.
ns.inum is already exposed to user space as a value.
This patch exposes it to bpf program in a convenient and stable way,


Here's what I'm imaging Eric is thinking:

ns.inum is currently exposed to userspace via procfs.  In principle,
the value could be local to a namespace, though, which would enable
CRIU to be able to preserve namespace inode numbers across a
checkpoint+restore operation.  If this happened, the contained and
restored procfs would see a different inode number than the outermost
procfs.


sure. there are many different ways for the program to see inode
that either was already reused or disappeared.
What I'm saying that it is expected. We cannot prevent that from
bpf side. Just like ifindex value read by the program can be bogus
as in the example I just provided.


If you start exposing the raw ns.inum field to BPF programs and those
programs are not themselves scoped to a namespace, then this could
create a problem for CRIU.


criu doesn't support ebpf because maps are not snapshot-able and
programs are detached from the control plane. I cannot see how one can
criu of xdp or cls program. The ssh connection to the box might die in
the middle while criu is messing with unknown. Hence the analogy to
the kernel modules. Imagine a set of mini-kernel modules and a set
of apps that depend on them. What kind of criu can we even talk about?


But you told Eric that his nack doesn't matter, and maybe it would be
nice to ask him to clarify instead.


Fair enough. Eric, thoughts?

Re: [PATCH net-next 1/3] trace: add variant without spacing in trace_print_hex_seq

2017-01-26 Thread Arnaldo Carvalho de Melo

Em Wed, Jan 25, 2017 at 02:28:16AM +0100, Daniel Borkmann escreveu:
> For upcoming tracepoint support for BPF, we want to dump the program's
> tag. Format should be similar to __print_hex(), but without spacing.
> Add a __print_hex_str() variant for exactly that purpose that reuses
> trace_print_hex_seq().

Steven should be back to his side of the wall soon, will wait for his
Ack, ok?

- Arnaldo
 
> Signed-off-by: Daniel Borkmann 
> Cc: Steven Rostedt 
> Cc: Arnaldo Carvalho de Melo 
> ---
>  include/linux/trace_events.h | 3 ++-
>  include/trace/trace_events.h | 8 +++-
>  kernel/trace/trace_output.c  | 7 ---
>  3 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index be00761..cfa475a 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -33,7 +33,8 @@ const char *trace_print_bitmask_seq(struct trace_seq *p, 
> void *bitmask_ptr,
>   unsigned int bitmask_size);
>  
>  const char *trace_print_hex_seq(struct trace_seq *p,
> - const unsigned char *buf, int len);
> + const unsigned char *buf, int len,
> + bool spacing);
>  
>  const char *trace_print_array_seq(struct trace_seq *p,
>  const void *buf, int count,
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 467e12f..9f68462 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -297,7 +297,12 @@
>  #endif
>  
>  #undef __print_hex
> -#define __print_hex(buf, buf_len) trace_print_hex_seq(p, buf, buf_len)
> +#define __print_hex(buf, buf_len)\
> + trace_print_hex_seq(p, buf, buf_len, true)
> +
> +#undef __print_hex_str
> +#define __print_hex_str(buf, buf_len)
> \
> + trace_print_hex_seq(p, buf, buf_len, false)
>  
>  #undef __print_array
>  #define __print_array(array, count, el_size) \
> @@ -711,6 +716,7 @@
>  #undef __print_flags
>  #undef __print_symbolic
>  #undef __print_hex
> +#undef __print_hex_str
>  #undef __get_dynamic_array
>  #undef __get_dynamic_array_len
>  #undef __get_str
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 5d33a73..30a144b1 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -163,14 +163,15 @@ enum print_line_t trace_print_printk_msg_only(struct 
> trace_iterator *iter)
>  EXPORT_SYMBOL_GPL(trace_print_bitmask_seq);
>  
>  const char *
> -trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int 
> buf_len)
> +trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int 
> buf_len,
> + bool spacing)
>  {
>   int i;
>   const char *ret = trace_seq_buffer_ptr(p);
>  
>   for (i = 0; i < buf_len; i++)
> - trace_seq_printf(p, "%s%2.2x", i == 0 ? "" : " ", buf[i]);
> -
> + trace_seq_printf(p, "%s%2.2x", !spacing || i == 0 ? "" : " ",
> +  buf[i]);
>   trace_seq_putc(p, 0);
>  
>   return ret;
> -- 
> 1.9.3

[PATCH net-next] liquidio: Avoid accessing skb after submitting to input queue

2017-01-26 Thread Felix Manlunas

From: Satanand Burla 

Accessing skb after submitting to input queue can cause
access to stale pointers if the skb ends up being transmitted
and freed by that time.

Signed-off-by: Satanand Burla 
Signed-off-by: Derek Chickles 
Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 6 +++---
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 5ee3f00..9261ddc 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3316,11 +3316,11 @@ static int liquidio_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
netif_trans_update(netdev);
 
-   if (skb_shinfo(skb)->gso_size)
-   stats->tx_done += skb_shinfo(skb)->gso_segs;
+   if (tx_info->s.gso_segs)
+   stats->tx_done += tx_info->s.gso_segs;
else
stats->tx_done++;
-   stats->tx_tot_bytes += skb->len;
+   stats->tx_tot_bytes += ndata.datasize;
 
return NETDEV_TX_OK;
 
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index e96cf6c..a6587d7 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -2433,11 +2433,11 @@ static int liquidio_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
netif_trans_update(netdev);
 
-   if (skb_shinfo(skb)->gso_size)
-   stats->tx_done += skb_shinfo(skb)->gso_segs;
+   if (tx_info->s.gso_segs)
+   stats->tx_done += tx_info->s.gso_segs;
else
stats->tx_done++;
-   stats->tx_tot_bytes += skb->len;
+   stats->tx_tot_bytes += ndata.datasize;
 
return NETDEV_TX_OK;

Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()

2017-01-26 Thread Joe Stringer

On 26 January 2017 at 11:32, Arnaldo Carvalho de Melo  wrote:
> Em Wed, Jan 25, 2017 at 10:18:22AM +0800, Wangnan (F) escreveu:
>> On 2017/1/25 9:16, Joe Stringer wrote:
>> > On 24 January 2017 at 17:06, Wangnan (F)  wrote:
>> > > On 2017/1/25 9:04, Wangnan (F) wrote:
>> > > Is it possible to use directory tree instead?
>
>> > > %s/object/mapname
>> > > %s/object/prog/instance
>> > I don't think objects have names, so let's assume an object with two
>> > program instances named foo, and one map named bar.
>
>> > A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with
>> > the following files and directories:
>> > /sys/fs/bpf/myobj/foo/1
>> > /sys/fs/bpf/myobj/foo/2
>> > /sys/fs/bpf/myobj/bar
>
>> > Alternatively, if you want to control exactly where you want the
>> > progs/maps to be pinned, you can call eg
>> > bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that
>> > instance will be mounted to /sys/fs/bpf/wherever, or alternatively
>> > bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with
>> > /sys/fs/bpf/foo/{0,1}.
>
>> > This looks pretty reasonable to me.
>
>> It looks good to me.
>
> Ok, please continue from perf/core, Ingo merged the first patch of this
> patchset today,

Ok thanks, I'll continue from there.

Re: [PATCH net-next 2/3] sfc: refactor debug-or-warnings printks

2017-01-26 Thread David Miller

From: Edward Cree 
Date: Thu, 26 Jan 2017 17:53:48 +

> diff --git a/drivers/net/ethernet/sfc/net_driver.h 
> b/drivers/net/ethernet/sfc/net_driver.h
> index 5927c20..c640955 100644
> --- a/drivers/net/ethernet/sfc/net_driver.h
> +++ b/drivers/net/ethernet/sfc/net_driver.h
> @@ -51,6 +51,15 @@
>  #define EFX_WARN_ON_PARANOID(x) do {} while (0)
>  #endif
>  
> +/* if @cond then downgrade to debug, else print at @level */
> +#define netif_cond_dbg(priv, type, netdev, cond, level, fmt, args...) \
> + do {  \
> + if (cond) \
> + netif_dbg(priv, type, netdev, fmt, ##args);   \
> + else  \
> + netif_ ## level(priv, type, netdev, fmt, ##args); \
> + } while (0)
> +
>  /**
>   *
>   * Efx data structures
> 

Please do not define locally in a driver an interface that looks like a generic
one and might be useful to code outside of your driver.

Thanks.

Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()

2017-01-26 Thread Arnaldo Carvalho de Melo

Em Wed, Jan 25, 2017 at 10:18:22AM +0800, Wangnan (F) escreveu:
> On 2017/1/25 9:16, Joe Stringer wrote:
> > On 24 January 2017 at 17:06, Wangnan (F)  wrote:
> > > On 2017/1/25 9:04, Wangnan (F) wrote:
> > > Is it possible to use directory tree instead?

> > > %s/object/mapname
> > > %s/object/prog/instance
> > I don't think objects have names, so let's assume an object with two
> > program instances named foo, and one map named bar.

> > A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with
> > the following files and directories:
> > /sys/fs/bpf/myobj/foo/1
> > /sys/fs/bpf/myobj/foo/2
> > /sys/fs/bpf/myobj/bar

> > Alternatively, if you want to control exactly where you want the
> > progs/maps to be pinned, you can call eg
> > bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that
> > instance will be mounted to /sys/fs/bpf/wherever, or alternatively
> > bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with
> > /sys/fs/bpf/foo/{0,1}.

> > This looks pretty reasonable to me.

> It looks good to me.

Ok, please continue from perf/core, Ingo merged the first patch of this
patchset today,

- Arnaldo

Re: [PATCH 0/7] pull request for net-next: batman-adv 2017-01-26

2017-01-26 Thread David Miller

From: Simon Wunderlich 
Date: Thu, 26 Jan 2017 17:43:57 +0100

> this is the updated version of yesterdays feature/cleanup pull request for
> batman-adv which should go into net-next. We've followed your suggestion
> regarding the NET_XMIT_CN handling and modified Gaos patch accordingly.
> 
> The remaining patches are untouched.
> 
> Please pull or let me know of any problem!

Pulled, thanks.

Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 20:19 +0100, Pablo Neira Ayuso wrote:

> Right. This is not percpu as in IPv4.
> 
> I can send a follow up patch to get this in sync with the way we do it
> in IPv4, ie. add percpu socket.
> 
> Fine with this approach? Thanks!

Not really.

percpu sockets are going to slow down network namespace creation /
deletion and increase memory foot print.

IPv6 is cleaner because it does not really have to use different
sockets.

Ultimately would would like to have the same for IPv4.

I would rather carry the mark either in an additional parameter,
or in the flow that is already passed as a parameter.

Re: [PATCH net v2 3/3] gtp: fix cross netns recv on gtp socket

2017-01-26 Thread David Miller

From: Andreas Schultz 
Date: Thu, 26 Jan 2017 16:11:34 +0100

> The use of the passed through netlink src_net to check for a
> cross netns operation was wrong. Using the GTP socket and the
> GTP netdevice is always correct (even if the netdev has been
> moved to new netns after link creation).
> 
> Remove the now obsolete net field from gtp_dev.
> 
> Signed-off-by: Andreas Schultz 

Please at least compile test your submissions:

drivers/net/gtp.c: In function ‘gtp_newlink’:
drivers/net/gtp.c:677:8: error: too many arguments to function 
‘gtp_encap_enable’
  err = gtp_encap_enable(dev, gtp, fd0, fd1, src_net);
^
drivers/net/gtp.c:659:12: note: declared here
 static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp,
^
drivers/net/gtp.c: At top level:
drivers/net/gtp.c:822:12: error: conflicting types for ‘gtp_encap_enable’
 static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp,
^
drivers/net/gtp.c:659:12: note: previous declaration of ‘gtp_encap_enable’ was 
here
 static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp,
^
drivers/net/gtp.c:659:12: warning: ‘gtp_encap_enable’ used but never defined
drivers/net/gtp.c:822:12: warning: ‘gtp_encap_enable’ defined but not used 
[-Wunused-function]
 static int gtp_encap_enable(struct net_device *dev, struct gtp_dev *gtp,
^
scripts/Makefile.build:299: recipe for target 'drivers/net/gtp.o' failed

Re: [BUG/RFC] vhost: net: big endian viring access despite virtio 1

2017-01-26 Thread Michael S. Tsirkin

On Thu, Jan 26, 2017 at 06:39:14PM +0100, Halil Pasic wrote:
> 
> Hi!
> 
> Recently I have been investigating some strange migration problems on
> s390x.
> 
> It turned out under certain circumstances vhost_net corrupts avail.idx by
> using wrong endianness.
> 
> I managed to track the problem down (I'm pretty sure). It boils down to
> the following.
> 
> When stopping vhost userspace (QEMU) calls vhost_net_set_backend with
> the fd argument set to -1, this leads to is_le being reset in
> vhost_vq_init_access.  On a BE system resetting to legacy means resetting
> to BE. Usually this is not a problem, but in the case when oldubufs is not
> zero (which is not likely if no network stress applied) it is a problem.
> That code path needs to write avail.idx, and ends up using wrong
> endianness when doing that (but only on a BE system).
> 
> That is the story in prose, now let's see the corresponding code annotated
> with some comments.
> 
> from drivers/vhost/net.c:
> static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> {
> /* [..] some not too interesting stuff  */
> sock = get_socket(fd);
> /* fd == -1 --> sock == NULL */
> if (IS_ERR(sock)) {
> r = PTR_ERR(sock);
> goto err_vq;
> }
> 
> /* start polling new socket */
> oldsock = vq->private_data;
> if (sock != oldsock) {
> ubufs = vhost_net_ubuf_alloc(vq,
>  sock && vhost_sock_zcopy(sock));
> if (IS_ERR(ubufs)) {
> r = PTR_ERR(ubufs);
> goto err_ubufs;
> }
> 
> vhost_net_disable_vq(n, vq);
> ==> vq->private_data = sock; 
> /* now vq->private_data is NULL */
> ==> r = vhost_vq_init_access(vq);
> if (r)
> goto err_used;
> /* vq endianness has been reset to BE on s390 */
> r = vhost_net_enable_vq(n, vq);
> if (r)
> goto err_used;
> 
> ==> oldubufs = nvq->ubufs;
> /* here oldubufs might become != 0 */   
> nvq->ubufs = ubufs;
> 
> n->tx_packets = 0;
> n->tx_zcopy_err = 0;
> n->tx_flush = false;
> }
> mutex_unlock(>mutex);
> 
> if (oldubufs) {
> vhost_net_ubuf_put_wait_and_free(oldubufs);
> mutex_lock(>mutex);
> ==> vhost_zerocopy_signal_used(n, vq);
> /* tries to update virtqueue structures; endianness is BE on s390
>  * now (but should be LE for virtio-1) */
> mutex_unlock(>mutex);
> }
> /*[..] rest of the function */
> }

> 
> from drivers/vhost/vhost.c:
> 
> int vhost_vq_init_access(struct vhost_virtqueue *vq)
> {
> __virtio16 last_used_idx;
> int r;
> bool is_le = vq->is_le;
> 
> if (!vq->private_data) {
> ==> vhost_reset_is_le(vq);
> /* resets to native endianness and returns */
> return 0;
> }
> 
> ==>  vhost_init_is_le(vq);
> /* here we init is_le */
> 
> r = vhost_update_used_flags(vq);
> if (r)
> goto err;
> vq->signalled_used_valid = false;
> if (!vq->iotlb &&
> !access_ok(VERIFY_READ, >used->idx, sizeof vq->used->idx)) {
> r = -EFAULT;
> goto err;
> }
> r = vhost_get_user(vq, last_used_idx, >used->idx);
> if (r) {
> vq_err(vq, "Can't access used idx at %p\n",
>>used->idx);
> goto err;
> }
> vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
> return 0;
> 
> err:
> vq->is_le = is_le;
> return r;
> }
> 
> AFAIU this can be fixed very simply by omitting the reset. Below the
> patch. I'm not sure though, the endianness handling ain't simple in
> vhost. Am I going in the right direction?
> 
> We have a test (on s390x only) running for a couple of hours now and so
> far so good but it's a bit early to announce it is tested  for s390x.
> If the patch is reasonable, I'm intend to do a version with proper
> attribution and stuff.
> 
> By the way the reset was first introduced by 
> https://lkml.org/lkml/2015/4/10/226 (dug it up in the hope that reasons
> for the reset were discussed -- but no enlightenment for me).
> 
> Finally I would like to credit Dave Gilbert for hinting that the
> unreasonable avail.idx may be due to an endianness problem and Michael A.
> Tebolt for reporting the bug and testing.
> 
> -8<--
> >From b26e2bbdc03832a0204ee2b42967a1b49a277dc8 Mon Sep 17 00:00:00 2001
> From: Halil Pasic 
> Date: Thu, 26 Jan 2017 00:06:15 +0100
> Subject: [PATCH] vhost: remove useless/dangerous reset of is_le
> 
> The reset of is_le does no good, but it

Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled

2017-01-26 Thread Pablo Neira Ayuso

On Thu, Jan 26, 2017 at 10:02:40AM -0800, Eric Dumazet wrote:
> On Thu, 2017-01-26 at 17:37 +0100, Pablo Neira Ayuso wrote:
> > From: Pau Espin Pedrol 
> > 
> > Otherwise, RST packets generated by the TCP stack for non-existing
> > sockets always have mark 0.
> > The mark from the original packet is assigned to the netns_ipv4/6
> > socket used to send the response so that it can get copied into the
> > response skb when the socket sends it.
> > 
> > Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies")
> > Cc: Lorenzo Colitti 
> > Signed-off-by: Pau Espin Pedrol 
> > Signed-off-by: Pablo Neira Ayuso 
> > ---
> >  net/ipv4/ip_output.c | 1 +
> >  net/ipv6/tcp_ipv6.c  | 1 +
> >  2 files changed, 2 insertions(+)
> > 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index fac275c48108..b67719f45953 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -1629,6 +1629,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
> > sk_buff *skb,
> > sk->sk_protocol = ip_hdr(skb)->protocol;
> > sk->sk_bound_dev_if = arg->bound_dev_if;
> > sk->sk_sndbuf = sysctl_wmem_default;
> > +   sk->sk_mark = fl4.flowi4_mark;
> > err = ip_append_data(sk, , ip_reply_glue_bits, arg->iov->iov_base,
> >  len, 0, , , MSG_DONTWAIT);
> > if (unlikely(err)) {
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 73bc8fc68acd..2b20622a5824 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -840,6 +840,7 @@ static void tcp_v6_send_response(const struct sock *sk, 
> > struct sk_buff *skb, u32
> > dst = ip6_dst_lookup_flow(ctl_sk, , NULL);
> > if (!IS_ERR(dst)) {
> > skb_dst_set(buff, dst);
> > +   ctl_sk->sk_mark = fl6.flowi6_mark;
> > ip6_xmit(ctl_sk, buff, , NULL, tclass);
> > TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
> > if (rst)
> 
> 
> This patch is wrong.
> 
> ctl_sk is a shared socket, and tcp_v6_send_response() can be called from
> many different cpus at the same time.

Right. This is not percpu as in IPv4.

I can send a follow up patch to get this in sync with the way we do it
in IPv4, ie. add percpu socket.

Fine with this approach? Thanks!

Re: [PATCH] [net-next] ISDN: eicon: reduce stack size of sig_ind function

2017-01-26 Thread David Miller

From: Arnd Bergmann 
Date: Wed, 25 Jan 2017 23:15:53 +0100

> I noticed that this function uses a lot of kernel stack when the
> "latent entropy" plugin is enabled:
> 
> drivers/isdn/hardware/eicon/message.c: In function 'sig_ind':
> drivers/isdn/hardware/eicon/message.c:6113:1: error: the frame size of 1168 
> bytes is larger than 1152 bytes [-Werror=frame-larger-than=]
> 
> We currently don't warn about this, as we raise the warning limit
> to 2048 bytes in mainline, but I'd like to lower that limit again
> in the future, and this function can easily be changed to be more
> efficient and avoid that warning, by making some of its local
> variables 'const'.
> 
> Signed-off-by: Arnd Bergmann 

Applied, thanks Arnd.

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Andy Lutomirski

On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov  wrote:
> On 1/26/17 10:12 AM, Andy Lutomirski wrote:
>>
>> On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov  wrote:
>>>
>>> On 1/26/17 8:37 AM, Andy Lutomirski wrote:
>
>
> Think of bpf programs as safe kernel modules. They don't have
> confined boundaries and program authors, if not careful, can shoot
> themselves in the foot. We're not trying to prevent that because
> it's impossible to check that the program is sane. Just like
> it's impossible to check that kernel module is sane.
> But in case of bpf we check that bpf program is _safe_ from the kernel
> point of view. If it's doing some garbage, it's program's business.
> Does it make more sense now?
>

 With all due respect, I think this is not an acceptable way to think
 about BPF at all.  If you think of BPF this way, I think there needs
 to be a real discussion at KS or similar as to whether this is okay.
 The reason is simple: the kernel promises a stable ABI to userspace
 but not to kernel modules.  By thinking of BPF as more like a module,
 you're taking a big shortcut that will either result in ABI breakage
 down the road or in committing to a problematic stable ABI.
>>>
>>>
>>>
>>> you misunderstood the analogy.
>>> bpf abi is certainly stable. that's why we were careful of not
>>> exposing anything to it that is not already stable.
>>>
>>
>> In that case I don't understand what you're trying to say.  Eric
>> thinks your patch exposes a bad interface.  A bad interface for
>> userspace is a very different thing from a bad interface available to
>> kernel modules.  Are you saying that BPF is kernel-module-like in that
>> the ABI exposed to BPF programs doesn't need to meet the same quality
>> standards as userspace ABIs?
>
>
> of course not.
> ns.inum is already exposed to user space as a value.
> This patch exposes it to bpf program in a convenient and stable way,

Here's what I'm imaging Eric is thinking:

ns.inum is currently exposed to userspace via procfs.  In principle,
the value could be local to a namespace, though, which would enable
CRIU to be able to preserve namespace inode numbers across a
checkpoint+restore operation.  If this happened, the contained and
restored procfs would see a different inode number than the outermost
procfs.

If you start exposing the raw ns.inum field to BPF programs and those
programs are not themselves scoped to a namespace, then this could
create a problem for CRIU.

But you told Eric that his nack doesn't matter, and maybe it would be
nice to ask him to clarify instead.

Re: [PATCH net 1/5] ibmvnic: harden interrupt handler

2017-01-26 Thread Thomas Falcon

On 01/26/2017 12:28 PM, Stephen Hemminger wrote:
> On Wed, 25 Jan 2017 15:02:19 -0600
> Thomas Falcon  wrote:
>
>>  static irqreturn_t ibmvnic_interrupt(int irq, void *instance)
>>  {
>>  struct ibmvnic_adapter *adapter = instance;
>> +unsigned long flags;
>> +
>> +spin_lock_irqsave(>crq.lock, flags);
>> +vio_disable_interrupts(adapter->vdev);
>> +tasklet_schedule(>tasklet);
>> +spin_unlock_irqrestore(>crq.lock, flags);
>> +return IRQ_HANDLED;
>> +}
>> +
> Why not use NAPI? rather than a tasklet
>
This interrupt function doesn't process packets, but message passing between 
firmware and driver for determining device capabilities and available 
resources, such as the number TX and RX queues.

[PATCH] macvtap: Use kmalloc_array() in macvtap_queue_resize()

2017-01-26 Thread SF Markus Elfring

From: Markus Elfring 
Date: Thu, 26 Jan 2017 19:47:38 +0100

A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/macvtap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 5c26653eceb5..1796d8f1ef47 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1243,7 +1243,7 @@ static int macvtap_queue_resize(struct macvlan_dev *vlan)
int n = vlan->numqueues;
int ret, i = 0;
 
-   arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL);
+   arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL);
if (!arrays)
return -ENOMEM;
 
-- 
2.11.0

Re: NAPI on USB network drivers

2017-01-26 Thread Oliver Hartkopp


On 01/25/2017 09:57 PM, Alexander Duyck wrote:

On Wed, Jan 25, 2017 at 5:33 AM, Oliver Hartkopp  wrote:

You could probably get around the o-o-o problem by enabling RPS for
the interface.  I have found that it works for me to do that in order
to resolve o-o-o frames generated by VMs on virtual interfaces that
can't use NAPI.


Sounds promising!

In fact I tried to fix this o-o-o stuff here:

http://marc.info/?l=linux-can=143637774606287=2

Do you have an example how to do it right?

Best regards,
Oliver

Re: [PATCH net 1/5] ibmvnic: harden interrupt handler

2017-01-26 Thread Thomas Falcon

On 01/26/2017 11:56 AM, David Miller wrote:
> From: Thomas Falcon 
> Date: Thu, 26 Jan 2017 10:44:22 -0600
>
>> On 01/25/2017 10:04 PM, David Miller wrote:
>>> From: Thomas Falcon 
>>> Date: Wed, 25 Jan 2017 15:02:19 -0600
>>>
 Move most interrupt handler processing into a tasklet, and
 introduce a delay after re-enabling interrupts to fix timing
 issues encountered in hardware testing.

 Signed-off-by: Thomas Falcon 
>>> I don't think you have any idea what the real problem is.  This looks
>>> like a hack, at best.  Next patch you'll increase the delay to "20",
>>> right?  And if that doesn't work you'll try "40".
>>>
>>> Or if you do know the reason, you need to explain it in detail in this
>>> commit message so that we can properly evaluate this patch.
>> You're right, I should have given more explanation in the commit message 
>> about the bug encountered and our reasons for this sort of fix.  The issue 
>> is that there are some scenarios during the device init where multiple 
>> messages are sent by firmware in one interrupt request. 
>>
>> We have observed behavior where messages are delayed, resulting in the 
>> interrupt handler completing before the delayed messages can be processed, 
>> fouling up the device bring-up in the device probing and elsewhere.  The 
>> goal of the delay is to buy some time for the hypervisor to forward all the 
>> CRQ messages from the firmware.
> Then isn't there an event you can sleep and wait for, or a piece of state for
> you to poll and test for in a timeout based loop?
>
> This delay is a bad solution for the problem of waiting for A to happen
> before you do B.
>
Understood.  We will come up with a better fix.  Thanks for your attention.

[PATCH net-next v2 0/4] net: dsa: Preparatory patches

2017-01-26 Thread Florian Fainelli

Hi David,

This patch series extracts the 4 patches of the larger: net: dsa: Support for
pdata in dsa2 while we wait for feedback from Greg KH on the device references.

Changes in v2:

- rebased properly after the multi-MDIO bus support added to mv88e6xxx

Thanks!

Florian Fainelli (4):
  net: dsa: Pass device pointer to dsa_register_switch
  net: dsa: Make most functions take a dsa_port argument
  net: dsa: Suffix function manipulating device_node with _dn
  net: dsa: Move ports assignment closer to error checking

 drivers/net/dsa/b53/b53_common.c |  2 +-
 drivers/net/dsa/mv88e6xxx/chip.c | 11 ++---
 drivers/net/dsa/qca8k.c  |  2 +-
 include/net/dsa.h|  2 +-
 net/dsa/dsa.c| 15 ---
 net/dsa/dsa2.c   | 87 ++--
 net/dsa/dsa_priv.h   |  4 +-
 7 files changed, 67 insertions(+), 56 deletions(-)

-- 
2.9.3

1 2 3 >

1 - 100 of 241 matches

Mail list logo