Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Ard Biesheuvel
On 17 October 2016 at 08:28, Johannes Berg  wrote:
> On Sat, 2016-10-15 at 18:16 +0100, Ard Biesheuvel wrote:
>> The CCM code goes out of its way to perform the CTR encryption of the
>> MAC using the subordinate CTR driver. To this end, it tweaks the
>> input and output scatterlists so the aead_req 'odata' and/or
>> 'auth_tag' fields [which may live on the stack] are prepended to the
>> CTR payload. This involves calling sg_set_buf() on addresses which
>> are not direct mapped, which is not supported.
>
>> Since the calculation of the MAC keystream involves a single call
>> into the cipher, to which we have a handle already given that the
>> CBC-MAC calculation uses it as well, just calculate the MAC keystream
>> directly, and record it in the aead_req private context so we can
>> apply it to the MAC in cypto_ccm_auth_mac(). This greatly simplifies
>> the scatterlist manipulation, and no longer requires scatterlists to
>> refer to buffers that may live on the stack.
>
> No objection from me, Herbert?
>
> I'm getting a bit nervous though - I'd rather have any fix first so
> people get things working again - so maybe I'll apply your other patch
> and mine first, and then we can replace yours by this later.
>

Could we get a statement first whether it is supported to allocate
aead_req (and other crypto req structures) on the stack? If not, then
we have our work cut out for us. But if it is, I'd rather we didn't
apply the kzalloc/kfree patch, since it is just a workaround for the
broken generic CCM driver, for which a fix is already available.

Also, regarding your __percpu patch: those are located in the vmalloc
area as well, at least on arm64, and likely other architectures too.


Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Johannes Berg
On Mon, 2016-10-17 at 08:37 +0100, Ard Biesheuvel wrote:
> 
> Could we get a statement first whether it is supported to allocate
> aead_req (and other crypto req structures) on the stack? 

Well, we haven't heard from Herbert :)

> If not, then
> we have our work cut out for us. But if it is, I'd rather we didn't
> apply the kzalloc/kfree patch, since it is just a workaround for the
> broken generic CCM driver, for which a fix is already available.

Yeah but I can't apply it. I just fixed up your kzalloc patch to also
handle GCM and GMAC, and to have error checking. Will send it in a
minute.

> Also, regarding your __percpu patch: those are located in the vmalloc
> area as well, at least on arm64, and likely other architectures too.

Crap. Any other bright ideas?

johannes


Re: hello

2016-10-17 Thread Alexander Alemayhu
On Fri, Oct 14, 2016 at 05:18:10PM +0800, maowenan wrote:
> i want to subscribe this mail, thank you very much.
> 

Try sending subscribe netdev to majord...@vger.kernel.org


Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Johannes Berg
On Mon, 2016-10-17 at 08:50 +0100, Ard Biesheuvel wrote:

> I just realised that patch should probably use
> aead_request_alloc/aead_request_free [and drop the memset]. That also
> fixes the latent bug where the alignment of the req ctx is not take
> into account.

Good point, I'll fix that up.

> > 
> > > 
> > > Also, regarding your __percpu patch: those are located in the
> > > vmalloc
> > > area as well, at least on arm64, and likely other architectures
> > > too.
> > 
> > Crap. Any other bright ideas?
> > 
> 
> kmem_cache_create() and kmem_cache_alloc()

for the aad/b0/j0/etc? Hmm. Yeah I guess we should do those dynamically
as well then.

johannes


Re: [PATCH v3] mac80211: move struct aead_req off the stack

2016-10-17 Thread Ard Biesheuvel
On 17 October 2016 at 09:03, Johannes Berg  wrote:
> From: Johannes Berg 
>
> Some crypto implementations (such as the generic CCM wrapper in crypto/)
> use scatterlists to map fields of private data in their struct aead_req.
> This means these data structures cannot live in the vmalloc area, which
> means that they cannot live on the stack (with CONFIG_VMAP_STACK.)
>
> This currently occurs only with the generic software implementation, but
> the private data and usage is implementation specific, so move the whole
> data structures off the stack into heap by allocating every time we need
> to use them.
>
> This pattern already exists in the IPsec ESP driver, but in the future,
> we may want/need to improve upon this, e.g. by using a slab cache.
>
> (Based on Ard's patch, but that was missing error handling and GCM/GMAC)
>
> Signed-off-by: Ard Biesheuvel 
> Signed-off-by: Johannes Berg 
> ---
> v3: remove superfluous aead_request_set_tfm() calls

Reviewed-by: Ard Biesheuvel 

> ---
>  net/mac80211/aes_ccm.c  | 36 +++-
>  net/mac80211/aes_ccm.h  |  6 +++---
>  net/mac80211/aes_gcm.c  | 33 +
>  net/mac80211/aes_gcm.h  |  4 ++--
>  net/mac80211/aes_gmac.c | 11 +--
>  net/mac80211/wpa.c  | 12 
>  6 files changed, 50 insertions(+), 52 deletions(-)
>
> diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
> index 7663c28ba353..8e898a6e8de8 100644
> --- a/net/mac80211/aes_ccm.c
> +++ b/net/mac80211/aes_ccm.c
> @@ -18,29 +18,29 @@
>  #include "key.h"
>  #include "aes_ccm.h"
>
> -void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> -  u8 *data, size_t data_len, u8 *mic,
> -  size_t mic_len)
> +int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> + u8 *data, size_t data_len, u8 *mic,
> + size_t mic_len)
>  {
> struct scatterlist sg[3];
> +   struct aead_request *aead_req;
>
> -   char aead_req_data[sizeof(struct aead_request) +
> -  crypto_aead_reqsize(tfm)]
> -   __aligned(__alignof__(struct aead_request));
> -   struct aead_request *aead_req = (void *) aead_req_data;
> -
> -   memset(aead_req, 0, sizeof(aead_req_data));
> +   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
> +   if (!aead_req)
> +   return -ENOMEM;
>
> sg_init_table(sg, 3);
> sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> -   aead_request_set_tfm(aead_req, tfm);
> aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
> aead_request_set_ad(aead_req, sg[0].length);
>
> crypto_aead_encrypt(aead_req);
> +   aead_request_free(aead_req);
> +
> +   return 0;
>  }
>
>  int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> @@ -48,26 +48,28 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
> *b_0, u8 *aad,
>   size_t mic_len)
>  {
> struct scatterlist sg[3];
> -   char aead_req_data[sizeof(struct aead_request) +
> -  crypto_aead_reqsize(tfm)]
> -   __aligned(__alignof__(struct aead_request));
> -   struct aead_request *aead_req = (void *) aead_req_data;
> +   struct aead_request *aead_req;
> +   int err;
>
> if (data_len == 0)
> return -EINVAL;
>
> -   memset(aead_req, 0, sizeof(aead_req_data));
> +   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
> +   if (!aead_req)
> +   return -ENOMEM;
>
> sg_init_table(sg, 3);
> sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> -   aead_request_set_tfm(aead_req, tfm);
> aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
> aead_request_set_ad(aead_req, sg[0].length);
>
> -   return crypto_aead_decrypt(aead_req);
> +   err = crypto_aead_decrypt(aead_req);
> +   aead_request_free(aead_req);
> +
> +   return err;
>  }
>
>  struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
> diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
> index 6a73d1e4d186..03e21b0995e3 100644
> --- a/net/mac80211/aes_ccm.h
> +++ b/net/mac80211/aes_ccm.h
> @@ -15,9 +15,9 @@
>  struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
> size_t key_len,
> size_t mic_len);
> -void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> -  u8 *data, 

Re: net/ipv6: potential deadlock in do_ipv6_setsockopt

2016-10-17 Thread Baozeng Ding
Applied the patch to my test tree. I will tell you the result a few days later. 
Thank you.

On 2016/10/17 2:50, Cong Wang wrote:
> On Sun, Oct 16, 2016 at 6:34 AM, Baozeng Ding  wrote:
>>  Possible unsafe locking scenario:
>>
>>CPU0CPU1
>>
>>   lock([  165.136033] sk_lock-AF_INET6
>> );
>>lock([  165.136033] rtnl_mutex
>> );
>>lock([  165.136033] sk_lock-AF_INET6
>> );
>>   lock([  165.136033] rtnl_mutex
>> );
>>
>>  *** DEADLOCK ***
> 
> This is caused by the conditional rtnl locking in do_ipv6_setsockopt().
> It looks like we miss the case of IPV6_ADDRFORM.
> 
> Please try the attached patch.
> 


Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Johannes Berg

> > Well, if your other patch to make it OK to be on-stack would be
> > applied instead, this wouldn't make much sense either :)
> > 
> 
> Yes but that one only fixes ccm not gcm

Yes, but we can do the same for GCM, no?

> > In this particular patch, we could reduce the size of the struct,
> > but I
> > don't actually think it'll make a difference to go from 48 to 36
> > bytes,
> > given alignment etc., so I think I'll just leave it as is.
> > 
> 
> I understand you are in a hurry, but this is unlikely to be the only
> such issue. I will propose an 'auxdata' feature for the crypto api
> that can be used here, but also for any other occurrence where client
> data assoiciated with the request can no longer be allocated on the
> stack

No objections. I'll merge this anyway today I think, reverting is easy
later.

johannes


[PATCH] net: hip04: Remove superfluous ether_setup after alloc_etherdev

2016-10-17 Thread Tobias Klauser
There is no need to call ether_setup after alloc_ethdev since it was
already called there.

Signed-off-by: Tobias Klauser 
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 3c99ca420f57..854befde0a08 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -897,7 +897,6 @@ static int hip04_mac_probe(struct platform_device *pdev)
 
INIT_WORK(>tx_timeout_task, hip04_tx_timeout_task);
 
-   ether_setup(ndev);
ndev->netdev_ops = _netdev_ops;
ndev->ethtool_ops = _ethtool_ops;
ndev->watchdog_timeo = TX_TIMEOUT;
-- 
2.9.0




Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel
On 17 October 2016 at 09:33, Johannes Berg  wrote:
> From: Johannes Berg 
>
> As the stack can (on x86-64) now be virtually mapped rather than
> using "normal" kernel memory, Sergey noticed mac80211 isn't using
> the SG APIs correctly by putting on-stack buffers into SG tables.
> This leads to kernel crashes.
>
> Fix this by allocating the extra fields dynamically on the fly as
> needed, using a kmem cache.
>
> I used per-CPU memory in a previous iteration of this patch, but
> Ard Biesheuvel pointed out that was also vmalloc'ed on some
> architectures.
>
> Reported-by: Sergey Senozhatsky 
> Signed-off-by: Johannes Berg 

Apologies for going back and forth on this, but it appears there may
be another way to deal with this.

First of all, we only need this handling for the authenticated data,
and only for CCM and GCM, not CMAC (which does not use scatterlists at
all, it simply calls the AES cipher directly)

So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,
which we could allocate along with the AEAD request, e..g.,

"""
diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 8e898a6e8de8..c0c33e6ad94e 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -24,13 +24,17 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead
*tfm, u8 *b_0, u8 *aad,
 {
struct scatterlist sg[3];
struct aead_request *aead_req;
+   u8 *__aad;

aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
if (!aead_req)
return -ENOMEM;

+   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
+   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
+
sg_init_table(sg, 3);
-   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
+   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);

@@ -49,6 +53,7 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
*tfm, u8 *b_0, u8 *aad,
 {
struct scatterlist sg[3];
struct aead_request *aead_req;
+   u8 *__aad;
int err;

if (data_len == 0)
@@ -58,8 +63,11 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
*tfm, u8 *b_0, u8 *aad,
if (!aead_req)
return -ENOMEM;

+   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
+   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
+
sg_init_table(sg, 3);
-   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
+   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);

@@ -90,6 +98,8 @@ struct crypto_aead
*ieee80211_aes_key_setup_encrypt(const u8 key[],
if (err)
goto free_aead;

+   crypto_aead_set_reqsize(tfm,
+   crypto_aead_reqsize(tfm) + 2 * AES_BLOCK_SIZE));
return tfm;

 free_aead:
"""


Re: Need help with mdiobus_register and phy

2016-10-17 Thread Zefir Kurtisi
On 10/15/2016 08:28 PM, Timur Tabi wrote:
> Andrew Lunn wrote:
>> 1) Take the SerDes power down out of the suspend code for the at803x.
>>
>> 2) Assume MII_PHYID1/2 registers are not guaranteed to be available
>> when the PHY is powered down. So get_phy_id should first read
>> MII_BMCR. If it gets 0x, assume there is no PHY there. If the
>> PDOWN bit is set, power up the PHY. Then reading the ID registers.
> 
> Before we take approach #1, I'd like to hear from the developer of that patch,
> Zefir.  According to him, that patch is necessary to fix a bug. I don't know 
> if
> that bug exists only on his system, though.
> 

Hi,

the problem we observed was that in SGMII mode after a suspend/resume cycle the
PHY link states get out of sync (e.g. gianfar reports link up, at803x down) and
power cycling the SerDes was a reliable way to get them re-synchronized again.

This obviously worked for some time after March 2016 and must have stopped only
recently (maybe additional PHY suspends were added).

Anyway, since the SGMII reset is required, instead of reverting the patch in 
full
I suggest to move the SGMII power down from at803x_suspend() and do a SerDes 
power
cycle in at803x_resume(). Could you please test if the patch below fixes the 
problem?


Thanks,
Zefir
---



Subject: [PATCH] at803x: don't power down SerDes on suspend

Powering down SerDes renders registers inaccessible
and with that re-initializing a suspended PHY fails.

Instead of powering down SerDes at suspend(), leave
it as is and power-cycle it on resume().

Signed-off-by: Zefir Kurtisi 
---
 drivers/net/phy/at803x.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index f279a89..282ffbb 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -225,16 +225,6 @@ static int at803x_suspend(struct phy_device *phydev)

phy_write(phydev, MII_BMCR, value);

-   if (phydev->interface != PHY_INTERFACE_MODE_SGMII)
-   goto done;
-
-   /* also power-down SGMII interface */
-   ccr = phy_read(phydev, AT803X_REG_CHIP_CONFIG);
-   phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr & ~AT803X_BT_BX_REG_SEL);
-   phy_write(phydev, MII_BMCR, phy_read(phydev, MII_BMCR) | BMCR_PDOWN);
-   phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr | AT803X_BT_BX_REG_SEL);
-
-done:
mutex_unlock(>lock);

return 0;
@@ -254,9 +244,11 @@ static int at803x_resume(struct phy_device *phydev)
if (phydev->interface != PHY_INTERFACE_MODE_SGMII)
goto done;

-   /* also power-up SGMII interface */
+   /* reset SGMII interface for re-synchronization */
ccr = phy_read(phydev, AT803X_REG_CHIP_CONFIG);
phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr & ~AT803X_BT_BX_REG_SEL);
+   phy_write(phydev, MII_BMCR, phy_read(phydev, MII_BMCR) | BMCR_PDOWN);
+
value = phy_read(phydev, MII_BMCR) & ~(BMCR_PDOWN | BMCR_ISOLATE);
phy_write(phydev, MII_BMCR, value);
phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr | AT803X_BT_BX_REG_SEL);
-- 
2.7.4



Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel
On 17 October 2016 at 10:23, Johannes Berg  wrote:
>
>> Apologies for going back and forth on this, but it appears there may
>> be another way to deal with this.
>>
>> First of all, we only need this handling for the authenticated data,
>
> Are you sure b_0/j_0 aren't needed? We pass those
> to aead_request_set_crypt(), and I wasn't sure what that really did
> internally, perhaps like the internal data.
>

They are the IV[], which is a fixed length parameter of the algorithm.
In contrast, the AAD[] could be of arbitrary length (from the POV of
the crypto API) so it uses scatterlists.

> Testing with that on the stack does seem to work, in fact.
>
> Surely we need zero for GMAC though, since we also put that into the sg
> list. Thus for GMAC we definitely need 20+16 bytes, and since I round
> up to a cacheline (at least on SMP) it doesn't really matter that we
> could get 36 instead of the 48 I have now.
>
>> and only for CCM and GCM, not CMAC (which does not use scatterlists
>> at all, it simply calls the AES cipher directly)
>
> I didn't modify CMAC, I think, only GMAC, which also uses scatterlists.
>

Ah ok, I misread the patch.

>> So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,
>
> and 36 for GMAC :)

Yes. But as I replied, setting the req size is not supported atm,
although it is reasonable to demand a way to allocate additional data
in the request specifically for this issue. So let's proceed with the
aead_request_alloc/free patch, but I would like to propose something
on the API side to address this particular issue


Re: [PATCH net-next 1/2] net: phy: Add Speed downshift set driver for Microsemi PHYs.

2016-10-17 Thread Raju Lakkaraju
Hi Andrew,

Thank you for code review and comments.

On Fri, Oct 14, 2016 at 02:12:32PM +0200, Andrew Lunn wrote:
> EXTERNAL EMAIL
> 
> 
> On Fri, Oct 14, 2016 at 05:10:32PM +0530, Raju Lakkaraju wrote:
> > From: Raju Lakkaraju 
> >
> > For operation in cabling environments that are incompatible with
> > 1000BAST-T, VSC8531 device provides an automatic link speed
> > downshift operation. When enabled, the device automatically changes
> > its 1000BAST-T auto-negotiation to the next slower speed after
> > a configured number of failed attempts at 1000BAST-T.
> > This feature is useful in setting up in networks using older cable
> > installations that include only pairs A and B, and not pairs C and D.
> 
> Any reason not to just turn this on by default when auto-neg is
> enabled?
> 
Downshift can enable by default when auto-neg enabled. This is good idea.
But we would like to provide option to customer can choose whether this
feature need to enable or disable and also configure failure attempts.

Do you have any other suggestion how to configure failure attempts?

> Andrew

---

Thanks,
Raju.


Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Ard Biesheuvel
On 17 October 2016 at 08:47, Johannes Berg  wrote:
> On Mon, 2016-10-17 at 08:37 +0100, Ard Biesheuvel wrote:
>>
>> Could we get a statement first whether it is supported to allocate
>> aead_req (and other crypto req structures) on the stack?
>
> Well, we haven't heard from Herbert :)
>
>> If not, then
>> we have our work cut out for us. But if it is, I'd rather we didn't
>> apply the kzalloc/kfree patch, since it is just a workaround for the
>> broken generic CCM driver, for which a fix is already available.
>
> Yeah but I can't apply it. I just fixed up your kzalloc patch to also
> handle GCM and GMAC, and to have error checking. Will send it in a
> minute.
>

I just realised that patch should probably use
aead_request_alloc/aead_request_free [and drop the memset]. That also
fixes the latent bug where the alignment of the req ctx is not take
into account.

>> Also, regarding your __percpu patch: those are located in the vmalloc
>> area as well, at least on arm64, and likely other architectures too.
>
> Crap. Any other bright ideas?
>

kmem_cache_create() and kmem_cache_alloc()


[PATCH] mac80211: move struct aead_req off the stack

2016-10-17 Thread Johannes Berg
From: Johannes Berg 

Some crypto implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of private data in their struct aead_req.
This means these data structures cannot live in the vmalloc area, which
means that they cannot live on the stack (with CONFIG_VMAP_STACK.)

This currently occurs only with the generic software implementation, but
the private data and usage is implementation specific, so move the whole
data structures off the stack into heap by allocating every time we need
to use them.

This pattern already exists in the IPsec ESP driver, but in the future,
we may want/need to improve upon this, e.g. by using a slab cache.

(Based on Ard's patch, but that was missing error handling and GCM/GMAC)

Signed-off-by: Ard Biesheuvel 
Signed-off-by: Johannes Berg 
---
 net/mac80211/aes_ccm.c  | 34 +++---
 net/mac80211/aes_ccm.h  |  6 +++---
 net/mac80211/aes_gcm.c  | 33 +++--
 net/mac80211/aes_gcm.h  |  4 ++--
 net/mac80211/aes_gmac.c | 12 +++-
 net/mac80211/wpa.c  | 12 
 6 files changed, 54 insertions(+), 47 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 7663c28ba353..cd2f10744238 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -18,18 +18,15 @@
 #include "key.h"
 #include "aes_ccm.h"
 
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len)
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len)
 {
struct scatterlist sg[3];
+   struct aead_request *aead_req;
 
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
-
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = kzalloc(sizeof(struct aead_request) +
+  crypto_aead_reqsize(tfm), GFP_ATOMIC);
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
@@ -41,6 +38,8 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_ad(aead_req, sg[0].length);
 
crypto_aead_encrypt(aead_req);
+   kfree(aead_req);
+   return 0;
 }
 
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
@@ -48,15 +47,16 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
  size_t mic_len)
 {
struct scatterlist sg[3];
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
+   struct aead_request *aead_req;
+   int err;
 
if (data_len == 0)
return -EINVAL;
 
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = kzalloc(sizeof(struct aead_request) +
+  crypto_aead_reqsize(tfm), GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
@@ -67,7 +67,11 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
-   return crypto_aead_decrypt(aead_req);
+   err = crypto_aead_decrypt(aead_req);
+
+   kfree(aead_req);
+
+   return err;
 }
 
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
index 6a73d1e4d186..03e21b0995e3 100644
--- a/net/mac80211/aes_ccm.h
+++ b/net/mac80211/aes_ccm.h
@@ -15,9 +15,9 @@
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
size_t key_len,
size_t mic_len);
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len);
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len);
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
  u8 *data, size_t data_len, u8 *mic,
  size_t mic_len);
diff --git a/net/mac80211/aes_gcm.c 

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Johannes Berg

> Apologies for going back and forth on this, but it appears there may
> be another way to deal with this.
> 
> First of all, we only need this handling for the authenticated data,

Are you sure b_0/j_0 aren't needed? We pass those
to aead_request_set_crypt(), and I wasn't sure what that really did
internally, perhaps like the internal data.

Testing with that on the stack does seem to work, in fact.

Surely we need zero for GMAC though, since we also put that into the sg
list. Thus for GMAC we definitely need 20+16 bytes, and since I round
up to a cacheline (at least on SMP) it doesn't really matter that we
could get 36 instead of the 48 I have now.

> and only for CCM and GCM, not CMAC (which does not use scatterlists
> at all, it simply calls the AES cipher directly)

I didn't modify CMAC, I think, only GMAC, which also uses scatterlists.

> So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,

and 36 for GMAC :)

johannes


Re: [PATCH 10/10] mm: replace access_process_vm() write parameter with gup_flags

2016-10-17 Thread Jesper Nilsson
On Thu, Oct 13, 2016 at 01:20:20AM +0100, Lorenzo Stoakes wrote:
> This patch removes the write parameter from access_process_vm() and replaces 
> it
> with a gup_flags parameter as use of this function previously _implied_
> FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
> 
> We make this explicit as use of FOLL_FORCE can result in surprising behaviour
> (and hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 
> ---
>  arch/cris/arch-v32/kernel/ptrace.c |  4 ++--

For the CRIS part:

Acked-by: Jesper Nilsson 

/^JN - Jesper Nilsson
-- 
   Jesper Nilsson -- jesper.nils...@axis.com


Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Johannes Berg
On Sat, 2016-10-15 at 18:16 +0100, Ard Biesheuvel wrote:
> The CCM code goes out of its way to perform the CTR encryption of the
> MAC using the subordinate CTR driver. To this end, it tweaks the
> input and output scatterlists so the aead_req 'odata' and/or
> 'auth_tag' fields [which may live on the stack] are prepended to the
> CTR payload. This involves calling sg_set_buf() on addresses which
> are not direct mapped, which is not supported.

> Since the calculation of the MAC keystream involves a single call
> into the cipher, to which we have a handle already given that the
> CBC-MAC calculation uses it as well, just calculate the MAC keystream
> directly, and record it in the aead_req private context so we can
> apply it to the MAC in cypto_ccm_auth_mac(). This greatly simplifies
> the scatterlist manipulation, and no longer requires scatterlists to
> refer to buffers that may live on the stack.

No objection from me, Herbert?

I'm getting a bit nervous though - I'd rather have any fix first so
people get things working again - so maybe I'll apply your other patch
and mine first, and then we can replace yours by this later.

johannes


[PATCH v3] mac80211: move extra crypto data off the stack

2016-10-17 Thread Johannes Berg
From: Johannes Berg 

As the stack can (on x86-64) now be virtually mapped rather than
using "normal" kernel memory, Sergey noticed mac80211 isn't using
the SG APIs correctly by putting on-stack buffers into SG tables.
This leads to kernel crashes.

Fix this by allocating a bit of per-CPU memory for the extra data
that encryption/decryption/verification needs, instead of having
it stored on the stack.

Signed-off-by: Johannes Berg 
---
 net/mac80211/aes_ccm.c |  2 --
 net/mac80211/aes_cmac.c|  5 ++-
 net/mac80211/aes_cmac.h|  2 ++
 net/mac80211/aes_gcm.c |  2 --
 net/mac80211/aes_gmac.c| 10 +++---
 net/mac80211/aes_gmac.h|  5 ++-
 net/mac80211/ieee80211_i.h |  7 
 net/mac80211/main.c|  8 +
 net/mac80211/wpa.c | 86 +-
 9 files changed, 97 insertions(+), 30 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 5c9fe00be6cc..8e898a6e8de8 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -34,7 +34,6 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
@@ -64,7 +63,6 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index bdf0790d89cc..ebb8c2dc9928 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -20,7 +20,6 @@
 
 #define CMAC_TLEN 8 /* CMAC TLen = 64 bits (8 octets) */
 #define CMAC_TLEN_256 16 /* CMAC TLen = 128 bits (16 octets) */
-#define AAD_LEN 20
 
 
 static void gf_mulx(u8 *pad)
@@ -101,7 +100,7 @@ void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 
*aad,
 
memset(zero, 0, CMAC_TLEN);
addr[0] = aad;
-   len[0] = AAD_LEN;
+   len[0] = CMAC_AAD_LEN;
addr[1] = data;
len[1] = data_len - CMAC_TLEN;
addr[2] = zero;
@@ -119,7 +118,7 @@ void ieee80211_aes_cmac_256(struct crypto_cipher *tfm, 
const u8 *aad,
 
memset(zero, 0, CMAC_TLEN_256);
addr[0] = aad;
-   len[0] = AAD_LEN;
+   len[0] = CMAC_AAD_LEN;
addr[1] = data;
len[1] = data_len - CMAC_TLEN_256;
addr[2] = zero;
diff --git a/net/mac80211/aes_cmac.h b/net/mac80211/aes_cmac.h
index 3702041f44fd..6645f8963278 100644
--- a/net/mac80211/aes_cmac.h
+++ b/net/mac80211/aes_cmac.h
@@ -11,6 +11,8 @@
 
 #include 
 
+#define CMAC_AAD_LEN 20
+
 struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
   size_t key_len);
 void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
diff --git a/net/mac80211/aes_gcm.c b/net/mac80211/aes_gcm.c
index aa8eb2718bc1..831054c6c756 100644
--- a/net/mac80211/aes_gcm.c
+++ b/net/mac80211/aes_gcm.c
@@ -30,7 +30,6 @@ int ieee80211_aes_gcm_encrypt(struct crypto_aead *tfm, u8 
*j_0, u8 *aad,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, IEEE80211_GCMP_MIC_LEN);
 
-   aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len, j_0);
aead_request_set_ad(aead_req, sg[0].length);
 
@@ -58,7 +57,6 @@ int ieee80211_aes_gcm_decrypt(struct crypto_aead *tfm, u8 
*j_0, u8 *aad,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, IEEE80211_GCMP_MIC_LEN);
 
-   aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg,
   data_len + IEEE80211_GCMP_MIC_LEN, j_0);
aead_request_set_ad(aead_req, sg[0].length);
diff --git a/net/mac80211/aes_gmac.c b/net/mac80211/aes_gmac.c
index 15cfcebf0884..86892e2e3c8c 100644
--- a/net/mac80211/aes_gmac.c
+++ b/net/mac80211/aes_gmac.c
@@ -19,13 +19,12 @@
 
 #define GMAC_MIC_LEN 16
 #define GMAC_NONCE_LEN 12
-#define AAD_LEN 20
 
 int ieee80211_aes_gmac(struct crypto_aead *tfm, const u8 *aad, u8 *nonce,
-  const u8 *data, size_t data_len, u8 *mic)
+  const u8 *data, size_t data_len, u8 *mic, u8 *zero)
 {
struct scatterlist sg[4];
-   u8 zero[GMAC_MIC_LEN], iv[AES_BLOCK_SIZE];
+   u8 iv[AES_BLOCK_SIZE];
struct aead_request *aead_req;
 
if (data_len < GMAC_MIC_LEN)
@@ -37,7 +36,7 @@ int ieee80211_aes_gmac(struct crypto_aead *tfm, const u8 
*aad, u8 *nonce,
 
memset(zero, 0, GMAC_MIC_LEN);
sg_init_table(sg, 4);
-   sg_set_buf([0], aad, AAD_LEN);
+   sg_set_buf([0], aad, GMAC_AAD_LEN);

[PATCH v3] mac80211: move struct aead_req off the stack

2016-10-17 Thread Johannes Berg
From: Johannes Berg 

Some crypto implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of private data in their struct aead_req.
This means these data structures cannot live in the vmalloc area, which
means that they cannot live on the stack (with CONFIG_VMAP_STACK.)

This currently occurs only with the generic software implementation, but
the private data and usage is implementation specific, so move the whole
data structures off the stack into heap by allocating every time we need
to use them.

This pattern already exists in the IPsec ESP driver, but in the future,
we may want/need to improve upon this, e.g. by using a slab cache.

(Based on Ard's patch, but that was missing error handling and GCM/GMAC)

Signed-off-by: Ard Biesheuvel 
Signed-off-by: Johannes Berg 
---
v3: remove superfluous aead_request_set_tfm() calls
---
 net/mac80211/aes_ccm.c  | 36 +++-
 net/mac80211/aes_ccm.h  |  6 +++---
 net/mac80211/aes_gcm.c  | 33 +
 net/mac80211/aes_gcm.h  |  4 ++--
 net/mac80211/aes_gmac.c | 11 +--
 net/mac80211/wpa.c  | 12 
 6 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 7663c28ba353..8e898a6e8de8 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -18,29 +18,29 @@
 #include "key.h"
 #include "aes_ccm.h"
 
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len)
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len)
 {
struct scatterlist sg[3];
+   struct aead_request *aead_req;
 
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
-
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
crypto_aead_encrypt(aead_req);
+   aead_request_free(aead_req);
+
+   return 0;
 }
 
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
@@ -48,26 +48,28 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
  size_t mic_len)
 {
struct scatterlist sg[3];
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
+   struct aead_request *aead_req;
+   int err;
 
if (data_len == 0)
return -EINVAL;
 
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
-   return crypto_aead_decrypt(aead_req);
+   err = crypto_aead_decrypt(aead_req);
+   aead_request_free(aead_req);
+
+   return err;
 }
 
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
index 6a73d1e4d186..03e21b0995e3 100644
--- a/net/mac80211/aes_ccm.h
+++ b/net/mac80211/aes_ccm.h
@@ -15,9 +15,9 @@
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
size_t key_len,
size_t mic_len);
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len);
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len);
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
  

Re: [PATCH v3] mac80211: move extra crypto data off the stack

2016-10-17 Thread Johannes Berg
[snip]

Sorry, this went out by mistake.

johannes


Re: [PATCH 06/10] mm: replace get_user_pages() write/force parameters with gup_flags

2016-10-17 Thread Jesper Nilsson
On Thu, Oct 13, 2016 at 01:20:16AM +0100, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages() and
> replaces them with a gup_flags parameter to make the use of FOLL_FORCE 
> explicit
> in callers as use of this flag can result in surprising behaviour (and hence
> bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 
> ---
>  arch/cris/arch-v32/drivers/cryptocop.c |  4 +---

For the CRIS part:

Acked-by: Jesper Nilsson 

/^JN - Jesper Nilsson
-- 
   Jesper Nilsson -- jesper.nils...@axis.com


Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel


> On 17 Oct 2016, at 10:35, Johannes Berg  wrote:
> 
>> On Mon, 2016-10-17 at 10:30 +0100, Ard Biesheuvel wrote:
>> 
>> Yes. But as I replied, setting the req size is not supported atm,
>> although it is reasonable to demand a way to allocate additional data
>> in the request specifically for this issue. So let's proceed with the
>> aead_request_alloc/free patch, but I would like to propose something
>> on the API side to address this particular issue
> 
> Well, if your other patch to make it OK to be on-stack would be applied
> instead, this wouldn't make much sense either :)
> 

Yes but that one only fixes ccm not gcm

> In this particular patch, we could reduce the size of the struct, but I
> don't actually think it'll make a difference to go from 48 to 36 bytes,
> given alignment etc., so I think I'll just leave it as is.
> 

I understand you are in a hurry, but this is unlikely to be the only such 
issue. I will propose an 'auxdata' feature for the crypto api that can be used 
here, but also for any other occurrence where client data assoiciated with the 
request can no longer be allocated on the stack

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel


> On 17 Oct 2016, at 10:54, Johannes Berg  wrote:
> 
> 
>>> Well, if your other patch to make it OK to be on-stack would be
>>> applied instead, this wouldn't make much sense either :)
>>> 
>> 
>> Yes but that one only fixes ccm not gcm
> 
> Yes, but we can do the same for GCM, no?
> 

No, not really. ccm happens to use aes with the same key for the mac and the 
encryption. gcm uses an another algo entirely for the mac

>>> In this particular patch, we could reduce the size of the struct,
>>> but I
>>> don't actually think it'll make a difference to go from 48 to 36
>>> bytes,
>>> given alignment etc., so I think I'll just leave it as is.
>>> 
>> 
>> I understand you are in a hurry, but this is unlikely to be the only
>> such issue. I will propose an 'auxdata' feature for the crypto api
>> that can be used here, but also for any other occurrence where client
>> data assoiciated with the request can no longer be allocated on the
>> stack
> 
> No objections. I'll merge this anyway today I think, reverting is easy
> later.
> 

ok fair enough

Re: Hello

2016-10-17 Thread Alexander Alemayhu
On Fri, Oct 14, 2016 at 05:17:36PM +0800, yuehaibing wrote:
> subscribe linux-kernel
> 

Try sending subscribe linux-kernel to majord...@vger.kernel.org


Re: linux-4.9-rc1/drivers/net/wireless/intel/iwlwifi/mvm/debugfs.c:1561: poor error checking ?

2016-10-17 Thread Luca Coelho
Hi David,
On Mon, 2016-10-17 at 07:40 +, David Binderman wrote:
> Hello there,
> 
> linux-4.9-rc1/drivers/net/wireless/intel/iwlwifi/mvm/debugfs.c:1561]: (style) 
> Checking if unsigned variable 'len' is less than zero.
> 
> Source code is
> 
> len = min((size_t)le32_to_cpu(rsp->len) << 2,
>   iwl_rx_packet_payload_len(hcmd.resp_pkt) - sizeof(*rsp));
> len = min(len - delta, count);
> if (len < 0) {
> ret = -EFAULT;
> goto out;
> }
> 
> Suggest improve error checking.

Thanks for reporting! A fix for this is already queued in our internal
tree and will be sent upstream soon.

--
Cheers,
Luca.


Re: [PATCH net-next 2/2] net: phy: Add Fast Link Failure - 2 set driver for Microsemi PHYs.

2016-10-17 Thread Raju Lakkaraju
Hi Andrew,

Thank you for code review and comments.

On Fri, Oct 14, 2016 at 02:02:28PM +0200, Andrew Lunn wrote:
> EXTERNAL EMAIL
> 
> 
> > On Fri, Oct 14, 2016 at 05:10:33PM +0530, Raju Lakkaraju wrote:
> > From: Raju Lakkaraju 
> >
> > VSC8531 Fast Link Failure 2 feature enables the PHY to indicate the
> > onset of a potential link failure in < 100 usec for 100BASE-TX
> > operation. FLF2 is supported through the MDINT (active low) pin.
> 
> Is the MDINT pin specific to this feature, or a general interrupt pin?
> 

MDINT pin is general interrupt. MDINT pin share the interrupt with
FLF2 along with another 13 interrupts.

> Device tree is used to describe the hardware. It should not really
> describe software or configuration. But the borders are a bit
> fluffly. Signal edge rates is near to hardware. This is a lot more
> towards configuration. So i'm not sure a device tree property is the
> correct way to describe this.
> 
> This is also a feature i know other PHYs support. The Marvell PHY has
> a "Metro Ethernet" extension which allows it to report link failures
> for 1000BASE-T in 10, 20 or 40ms, instead of the usual 750ms. So we
> need a generic solution other PHYs can implement.
> 
> As with cable testing, i think it should be an ethtool option.

I agree with you.
I thought this is one time initialization either enable or disable.
if customer need this feature, they can enable in DT.
Do you want me to implement through IOCTL instead of Device tree?
Do you have any other suggestions?

> 
>Andrew

---
Thanks,
Raju.


[PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Johannes Berg
From: Johannes Berg 

As the stack can (on x86-64) now be virtually mapped rather than
using "normal" kernel memory, Sergey noticed mac80211 isn't using
the SG APIs correctly by putting on-stack buffers into SG tables.
This leads to kernel crashes.

Fix this by allocating the extra fields dynamically on the fly as
needed, using a kmem cache.

I used per-CPU memory in a previous iteration of this patch, but
Ard Biesheuvel pointed out that was also vmalloc'ed on some
architectures.

Reported-by: Sergey Senozhatsky 
Signed-off-by: Johannes Berg 
---
 net/mac80211/aes_cmac.c|   5 +-
 net/mac80211/aes_cmac.h|   2 +
 net/mac80211/aes_gmac.c|   9 ++-
 net/mac80211/aes_gmac.h|   5 +-
 net/mac80211/ieee80211_i.h |   7 ++
 net/mac80211/main.c|   8 +++
 net/mac80211/wpa.c | 173 -
 7 files changed, 166 insertions(+), 43 deletions(-)

diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index bdf0790d89cc..ebb8c2dc9928 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -20,7 +20,6 @@
 
 #define CMAC_TLEN 8 /* CMAC TLen = 64 bits (8 octets) */
 #define CMAC_TLEN_256 16 /* CMAC TLen = 128 bits (16 octets) */
-#define AAD_LEN 20
 
 
 static void gf_mulx(u8 *pad)
@@ -101,7 +100,7 @@ void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 
*aad,
 
memset(zero, 0, CMAC_TLEN);
addr[0] = aad;
-   len[0] = AAD_LEN;
+   len[0] = CMAC_AAD_LEN;
addr[1] = data;
len[1] = data_len - CMAC_TLEN;
addr[2] = zero;
@@ -119,7 +118,7 @@ void ieee80211_aes_cmac_256(struct crypto_cipher *tfm, 
const u8 *aad,
 
memset(zero, 0, CMAC_TLEN_256);
addr[0] = aad;
-   len[0] = AAD_LEN;
+   len[0] = CMAC_AAD_LEN;
addr[1] = data;
len[1] = data_len - CMAC_TLEN_256;
addr[2] = zero;
diff --git a/net/mac80211/aes_cmac.h b/net/mac80211/aes_cmac.h
index 3702041f44fd..6645f8963278 100644
--- a/net/mac80211/aes_cmac.h
+++ b/net/mac80211/aes_cmac.h
@@ -11,6 +11,8 @@
 
 #include 
 
+#define CMAC_AAD_LEN 20
+
 struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
   size_t key_len);
 void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
diff --git a/net/mac80211/aes_gmac.c b/net/mac80211/aes_gmac.c
index 6951af9715c0..86892e2e3c8c 100644
--- a/net/mac80211/aes_gmac.c
+++ b/net/mac80211/aes_gmac.c
@@ -19,13 +19,12 @@
 
 #define GMAC_MIC_LEN 16
 #define GMAC_NONCE_LEN 12
-#define AAD_LEN 20
 
 int ieee80211_aes_gmac(struct crypto_aead *tfm, const u8 *aad, u8 *nonce,
-  const u8 *data, size_t data_len, u8 *mic)
+  const u8 *data, size_t data_len, u8 *mic, u8 *zero)
 {
struct scatterlist sg[4];
-   u8 zero[GMAC_MIC_LEN], iv[AES_BLOCK_SIZE];
+   u8 iv[AES_BLOCK_SIZE];
struct aead_request *aead_req;
 
if (data_len < GMAC_MIC_LEN)
@@ -37,7 +36,7 @@ int ieee80211_aes_gmac(struct crypto_aead *tfm, const u8 
*aad, u8 *nonce,
 
memset(zero, 0, GMAC_MIC_LEN);
sg_init_table(sg, 4);
-   sg_set_buf([0], aad, AAD_LEN);
+   sg_set_buf([0], aad, GMAC_AAD_LEN);
sg_set_buf([1], data, data_len - GMAC_MIC_LEN);
sg_set_buf([2], zero, GMAC_MIC_LEN);
sg_set_buf([3], mic, GMAC_MIC_LEN);
@@ -47,7 +46,7 @@ int ieee80211_aes_gmac(struct crypto_aead *tfm, const u8 
*aad, u8 *nonce,
iv[AES_BLOCK_SIZE - 1] = 0x01;
 
aead_request_set_crypt(aead_req, sg, sg, 0, iv);
-   aead_request_set_ad(aead_req, AAD_LEN + data_len);
+   aead_request_set_ad(aead_req, GMAC_AAD_LEN + data_len);
 
crypto_aead_encrypt(aead_req);
aead_request_free(aead_req);
diff --git a/net/mac80211/aes_gmac.h b/net/mac80211/aes_gmac.h
index d328204d73a8..f06833c9095f 100644
--- a/net/mac80211/aes_gmac.h
+++ b/net/mac80211/aes_gmac.h
@@ -11,10 +11,13 @@
 
 #include 
 
+#define GMAC_MIC_LEN 16
+#define GMAC_AAD_LEN 20
+
 struct crypto_aead *ieee80211_aes_gmac_key_setup(const u8 key[],
 size_t key_len);
 int ieee80211_aes_gmac(struct crypto_aead *tfm, const u8 *aad, u8 *nonce,
-  const u8 *data, size_t data_len, u8 *mic);
+  const u8 *data, size_t data_len, u8 *mic, u8 *zero);
 void ieee80211_aes_gmac_key_free(struct crypto_aead *tfm);
 
 #endif /* AES_GMAC_H */
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 34c2add2c455..a63593f6b645 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1128,6 +1128,13 @@ enum mac80211_scan_state {
SCAN_ABORT,
 };
 
+struct ieee80211_crypto_bufs {
+   u8 buf1[32];
+   u8 buf2[16];
+} cacheline_aligned_in_smp;
+
+extern struct kmem_cache *ieee80211_crypto_bufs_cache;
+
 struct ieee80211_local {
/* embed the driver visible 

introduce IPV6_PKTINFO_L2 socket-option?

2016-10-17 Thread Alexander Aring

Hi all,

I am currently writing some userspace UDP Software for 6LoWPAN
subsystem. It will use UDP IPv6 sockets, the 6LoWPAN adaptation will be
done at kernelspace.

My problem is currently the following:

I need per recvmsg(2) the used source link-layer address for the
incoming UDP packet which used a multicast destination address.

There is currently no solution to get such information. Michael
Richardson told about IPV6_RECVPKTINFO and then we thought about to
introduce an IPV6_PKTINFO_L2 to get such information.

If somebody know already a solution how to get L2 information from
userspace IPv6 SOCK_DGRAM socket, I would be happy if somebody told me
the solution. :-)

The idea about IPV6_PKTINFO_L2 is to put L2 information in there which
depends on netdev type (on 6LoWPAN it even depends on sub netdev type of
6LoWPAN netdev type - sub netdev type will represent the used link-layer).

Now, here comes some design question...

What we put into the IPV6_PKTINFO_L2 which is an attribute of the control
messages. The socket-option IPV6_PKTINFO_L2 will turn it on/off only
like all others.

If we only need it for recvmsg(2) then maybe we could simple offer the
mac-header - which maybe works on all skb's by accessing skb_mac_header
and skb->mac_len. (btw: on 6LoWPAN this is currently broken and I need
to fix it). It works only if all IPv6 capable subsystem sets these
attributes and don't overwrite the mac space with other data (that's the
bug we have in 6LoWPAN).

The point is... it's useful also for sendmsg. Example: 802.15.4 6LoWPAN
can use two type of mac address: short or extended. On sendmsg you could
say which one you prefer to use for the L2 source/destination address
information. I heard from Michael, this is necessary on some protocols
to prefer extended than short. Currently we use short if short is
available, because it will produce less payload - which is very fine as
default handling. But sometimes you must have control over that.

Another option to implement this is a flag for a neighbour in ndisc
cache, but this works for destination L2 address then and per-neighbour
and not per sendmsg(2), so all running processes for the interface will
be manipulated by that. I think this is not an good option.

What I am thinking of is to have the control message attribute defined
per netdev type (in case of 6LoWPAN even link-layer 6LoWPAN netdev type)
and each link-layer can add what they want to have there. E.g.

struct in6_pktinfo_l2_802154 {
/* sendmsg and recvmsg */
__u8src[MAX_ADDR_LEN];
__u8dst[MAX_ADDR_LEN];

/* recvmsg */
__u8lqi;
};

The lqi is here some 802.15.4 L2 information about link quality. Some
another idea to add there, makes only sense for recvmsg and need to be
some average if fragmentation was done there. If the link-layer doesn't
support such handling -EOPNOTSUPP for setting the socket-option on will
be returned.

The src and dst address could maybe the same on all link-layers and
that's why we could split this attribute maybe in two attributes for
control message:

struct in6_pktinfo_l2 {
/* sendmsg and recvmsg */
__u8src[MAX_ADDR_LEN];
__u8dst[MAX_ADDR_LEN];
};

struct in6_pktinfo_l2_802154 {
/* recvmsg */
__u8lqi;
};

and introduce some IPV6_PKTINFO_L2 and IPV6_PKTINFO_L2_802154_LQI socket
option to enable/disable these attributes. The addresses which will be
placed into src and dst depends on used link-layer which need to be
evaluated somehow before.

I currently must implement the getting of source l2 address information
and I will try to implement it. For sendmsg side I need to figure out
how to get the information from socket interface through 6LoWPAN
implementation - maybe adding new attribute in sk_buff? Don't know yet.

I also want some mainline solution and I don't know if this is the right
way to implement it to have an acceptable solution for mainline. That's
why I start this discussion and hope I get some new inputs here.

- Alex


Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Johannes Berg
On Mon, 2016-10-17 at 10:30 +0100, Ard Biesheuvel wrote:

> Yes. But as I replied, setting the req size is not supported atm,
> although it is reasonable to demand a way to allocate additional data
> in the request specifically for this issue. So let's proceed with the
> aead_request_alloc/free patch, but I would like to propose something
> on the API side to address this particular issue

Well, if your other patch to make it OK to be on-stack would be applied
instead, this wouldn't make much sense either :)

In this particular patch, we could reduce the size of the struct, but I
don't actually think it'll make a difference to go from 48 to 36 bytes,
given alignment etc., so I think I'll just leave it as is.

johannes


[PATCH iproute2 1/2] tc filters: display handle in events when a filter is deleted

2016-10-17 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

An event being displayed via "tc mon" should display the filter handle.
The filter handle is a required parameter when deleting a filter and
therefore the delete event should mimic/mirror the command sent.
A simple test, run tc monitor on one window, on another try adding
a filter:
..
sudo $TC filter add dev $ETH parent : protocol ip pref 1 \
u32 match ip protocol 1 0xff \
flowid 1:1 \
action ok
..

.. get its handle by dumping ...
sudo $TC -s filter ls dev $ETH parent : protocol ip
... find out the handle (say it was 800::800)
... go delete it..
sudo $TC filter del dev $ETH handle 800::800 parent : \
protocol ip prio 1 u32

now see tc monitor reporting it without handle. After this patch
with a handle.

Signed-off-by: Jamal Hadi Salim 
---
 tc/tc_filter.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index 2413cef..4efc44f 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -253,12 +253,13 @@ int print_filter(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
}
fprintf(fp, "%s ", rta_getattr_str(tb[TCA_KIND]));
q = get_filter_kind(RTA_DATA(tb[TCA_KIND]));
-   if (tb[TCA_OPTIONS]) {
+   if (tb[TCA_OPTIONS] || n->nlmsg_type == RTM_DELTFILTER) {
if (q)
q->print_fopt(q, fp, tb[TCA_OPTIONS], t->tcm_handle);
else
fprintf(fp, "[cannot parse parameters]");
}
+
fprintf(fp, "\n");
 
if (show_stats && (tb[TCA_STATS] || tb[TCA_STATS2])) {
-- 
1.9.1



[PATCH iproute2 2/2] tc filters: fix filters to display handle when deleted even when no option

2016-10-17 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Fix a few stylistic things that hurt my eyes while at it.

Signed-off-by: Jamal Hadi Salim 
---
 tc/f_basic.c|  6 +++---
 tc/f_bpf.c  |  6 +++---
 tc/f_cgroup.c   |  7 +++
 tc/f_flow.c |  4 ++--
 tc/f_flower.c   |  5 +++--
 tc/f_fw.c   | 38 ++
 tc/f_matchall.c |  6 +++---
 tc/f_route.c| 29 ++---
 tc/f_rsvp.c | 28 ++--
 tc/f_tcindex.c  |  4 +++-
 tc/f_u32.c  |  9 -
 11 files changed, 82 insertions(+), 60 deletions(-)

diff --git a/tc/f_basic.c b/tc/f_basic.c
index d663668..317dca1 100644
--- a/tc/f_basic.c
+++ b/tc/f_basic.c
@@ -112,14 +112,14 @@ static int basic_print_opt(struct filter_util *qu, FILE 
*f,
 {
struct rtattr *tb[TCA_BASIC_MAX+1];
 
+   if (handle)
+   fprintf(f, "handle 0x%x ", handle);
+
if (opt == NULL)
return 0;
 
parse_rtattr_nested(tb, TCA_BASIC_MAX, opt);
 
-   if (handle)
-   fprintf(f, "handle 0x%x ", handle);
-
if (tb[TCA_BASIC_CLASSID]) {
SPRINT_BUF(b1);
fprintf(f, "flowid %s ",
diff --git a/tc/f_bpf.c b/tc/f_bpf.c
index 5c97c86..a7dd2f2 100644
--- a/tc/f_bpf.c
+++ b/tc/f_bpf.c
@@ -152,14 +152,14 @@ static int bpf_print_opt(struct filter_util *qu, FILE *f,
 {
struct rtattr *tb[TCA_BPF_MAX + 1];
 
+   if (handle)
+   fprintf(f, "handle 0x%x ", handle);
+
if (opt == NULL)
return 0;
 
parse_rtattr_nested(tb, TCA_BPF_MAX, opt);
 
-   if (handle)
-   fprintf(f, "handle 0x%x ", handle);
-
if (tb[TCA_BPF_CLASSID]) {
SPRINT_BUF(b1);
fprintf(f, "flowid %s ",
diff --git a/tc/f_cgroup.c b/tc/f_cgroup.c
index ecf9909..ae798db 100644
--- a/tc/f_cgroup.c
+++ b/tc/f_cgroup.c
@@ -85,14 +85,13 @@ static int cgroup_print_opt(struct filter_util *qu, FILE *f,
 {
struct rtattr *tb[TCA_CGROUP_MAX+1];
 
+   if (handle)
+   fprintf(f, "handle 0x%x ", handle);
+
if (opt == NULL)
return 0;
 
parse_rtattr_nested(tb, TCA_CGROUP_MAX, opt);
-
-   if (handle)
-   fprintf(f, "handle 0x%x ", handle);
-
if (tb[TCA_CGROUP_EMATCHES])
print_ematch(f, tb[TCA_CGROUP_EMATCHES]);
 
diff --git a/tc/f_flow.c b/tc/f_flow.c
index 09ddcaa..5a9948f 100644
--- a/tc/f_flow.c
+++ b/tc/f_flow.c
@@ -272,13 +272,13 @@ static int flow_print_opt(struct filter_util *fu, FILE 
*f, struct rtattr *opt,
unsigned int i;
__u32 mask = ~0, val = 0;
 
+   fprintf(f, "handle 0x%x ", handle);
+
if (opt == NULL)
return -EINVAL;
 
parse_rtattr_nested(tb, TCA_FLOW_MAX, opt);
 
-   fprintf(f, "handle 0x%x ", handle);
-
if (tb[TCA_FLOW_MODE]) {
__u32 mode = rta_getattr_u32(tb[TCA_FLOW_MODE]);
 
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 791ade7..2f566ec 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -456,13 +456,14 @@ static int flower_print_opt(struct filter_util *qu, FILE 
*f,
__be16 eth_type = 0;
__u8 ip_proto = 0xff;
 
+   if (handle)
+   fprintf(f, "handle 0x%x ", handle);
+
if (!opt)
return 0;
 
parse_rtattr_nested(tb, TCA_FLOWER_MAX, opt);
 
-   if (handle)
-   fprintf(f, "handle 0x%x ", handle);
 
if (tb[TCA_FLOWER_CLASSID]) {
SPRINT_BUF(b1);
diff --git a/tc/f_fw.c b/tc/f_fw.c
index 29c9854..f779c16 100644
--- a/tc/f_fw.c
+++ b/tc/f_fw.c
@@ -115,37 +115,42 @@ static int fw_parse_opt(struct filter_util *qu, char 
*handle, int argc, char **a
return 0;
 }
 
-static int fw_print_opt(struct filter_util *qu, FILE *f, struct rtattr *opt, 
__u32 handle)
+static int fw_print_opt(struct filter_util *qu, FILE *f, struct rtattr *opt,
+   __u32 handle)
 {
struct rtattr *tb[TCA_FW_MAX+1];
+   __u32 mark = 0, mask = 0;
 
-   if (opt == NULL)
-   return 0;
+   if (opt) {
+   parse_rtattr_nested(tb, TCA_FW_MAX, opt);
+   if (tb[TCA_FW_MASK]) {
+  __u32 tmask = rta_getattr_u32(tb[TCA_FW_MASK]);
 
-   parse_rtattr_nested(tb, TCA_FW_MAX, opt);
+  if (tmask != 0x)
+  mask = tmask;
+   }
+   }
 
-   if (handle || tb[TCA_FW_MASK]) {
-   __u32 mark = 0, mask = 0;
+   if (mask)
+   fprintf(f, "handle 0x%x/0x%x ", mark, mask);
+   else
+   fprintf(f, "handle 0x%x ", handle);
 
-   if (handle)
-   mark = handle;
-   if (tb[TCA_FW_MASK] &&
-   (mask = rta_getattr_u32(tb[TCA_FW_MASK])) != 0x)
-   fprintf(f, "handle 0x%x/0x%x ", mark, mask);
-   else
-   

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel


> On 17 Oct 2016, at 10:35, Johannes Berg  wrote:
> 
>> On Mon, 2016-10-17 at 10:30 +0100, Ard Biesheuvel wrote:
>> 
>> Yes. But as I replied, setting the req size is not supported atm,
>> although it is reasonable to demand a way to allocate additional data
>> in the request specifically for this issue. So let's proceed with the
>> aead_request_alloc/free patch, but I would like to propose something
>> on the API side to address this particular issue
> 
> Well, if your other patch to make it OK to be on-stack would be applied
> instead, this wouldn't make much sense either :)
> 
> In this particular patch, we could reduce the size of the struct, but I
> don't actually think it'll make a difference to go from 48 to 36 bytes,
> given alignment etc., so I think I'll just leave it as is.
> 
> johannes


[PATCH] ethtool: add register dump support for fjes driver

2016-10-17 Thread Taku Izumi
From: Taku 

This patch adds the register dump format for FUJITSU Extended
Network device like the following:

   # ethtool -d es0

0x: OWNER_EPID(Owner EPID)   0x0001
0x0004: MAX_EP(Maximum EP)   0x0008
0x0010: DCTL  (Device Control)   0x
0x0020: CR(Command request)  0x8002
0x0024: CS(Command status)   0x8002
0x0028: SHSTSAL   (Share status address Low) 0xE8215304
0x002C: SHSTSAH   (Share status address High)0x0007
0x0034: REQBL (Request Buffer length)0x8028
0x0038: REQBAL(Request Buffer Address Low)   0xEB0A
0x003C: REQBAH(Request Buffer Address High)  0x0007
0x0044: RESPBL(Response Buffer Length)   0x0018
0x0048: RESPBAL   (Response Buffer Address Low)  0xE41E1220
0x004C: RESPBAH   (Response Buffer Address High) 0x0007
0x0080: IS(Interrupt status) 0x
0x0084: IMS   (Interrupt mask set)   0x7FE0
0x0088: IMC   (Interrupt mask clear) 0x001F
0x008C: IG(Interrupt generator)  0x0001
0x0090: ICTL  (Interrupt control)0x

Signed-off-by: Taku Izumi 
---
 Makefile.am |  2 +-
 ethtool.c   |  1 +
 fjes.c  | 89 +
 internal.h  |  2 ++
 4 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 fjes.c

diff --git a/Makefile.am b/Makefile.am
index de2db2e..edbda57 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -14,7 +14,7 @@ ethtool_SOURCES += \
  pcnet32.c realtek.c tg3.c marvell.c vioc.c\
  smsc911x.c at76c50x-usb.c sfc.c stmmac.c  \
  sff-common.c sff-common.h sfpid.c sfpdiag.c   \
- ixgbevf.c tse.c vmxnet3.c qsfp.c qsfp.h
+ ixgbevf.c tse.c vmxnet3.c qsfp.c qsfp.h fjes.c
 endif
 
 TESTS = test-cmdline test-features
diff --git a/ethtool.c b/ethtool.c
index 49ac94e..75299c6 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -1136,6 +1136,7 @@ static const struct {
{ "et131x", et131x_dump_regs },
{ "altera_tse", altera_tse_dump_regs },
{ "vmxnet3", vmxnet3_dump_regs },
+   { "fjes", fjes_dump_regs },
 #endif
 };
 
diff --git a/fjes.c b/fjes.c
new file mode 100644
index 000..52f7c28
--- /dev/null
+++ b/fjes.c
@@ -0,0 +1,89 @@
+/* Copyright (c) 2016 FUJITSU LIMITED */
+#include 
+#include "internal.h"
+
+int fjes_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
+{
+   u32 *regs_buff = (u32 *)regs->data;
+
+   if (regs->version != 1)
+   return -1;
+
+   /* Information registers */
+   fprintf(stdout,
+   "0x: OWNER_EPID(Owner EPID)   
0x%08X\n",
+   regs_buff[0]);
+
+   fprintf(stdout,
+   "0x0004: MAX_EP(Maximum EP)   
0x%08X\n",
+   regs_buff[1]);
+
+   /* Device Control registers */
+   fprintf(stdout,
+   "0x0010: DCTL  (Device Control)   
0x%08X\n",
+   regs_buff[4]);
+
+   /* Command Control registers */
+   fprintf(stdout,
+   "0x0020: CR(Command request)  
0x%08X\n",
+   regs_buff[8]);
+
+   fprintf(stdout,
+   "0x0024: CS(Command status)   
0x%08X\n",
+   regs_buff[9]);
+
+   fprintf(stdout,
+   "0x0028: SHSTSAL   (Share status address Low) 
0x%08X\n",
+   regs_buff[10]);
+
+   fprintf(stdout,
+   "0x002C: SHSTSAH   (Share status address High)
0x%08X\n",
+   regs_buff[11]);
+
+   fprintf(stdout,
+   "0x0034: REQBL (Request Buffer length)
0x%08X\n",
+   regs_buff[13]);
+
+   fprintf(stdout,
+   "0x0038: REQBAL(Request Buffer Address Low)   
0x%08X\n",
+   regs_buff[14]);
+
+   fprintf(stdout,
+   "0x003C: REQBAH(Request Buffer Address High)  
0x%08X\n",
+   regs_buff[15]);
+
+   fprintf(stdout,
+   "0x0044: RESPBL(Response Buffer Length)   
0x%08X\n",
+   regs_buff[17]);
+
+   fprintf(stdout,
+   "0x0048: RESPBAL   (Response Buffer Address Low)  
0x%08X\n",
+   regs_buff[18]);
+
+   fprintf(stdout,
+   "0x004C: RESPBAH   (Response Buffer Address High) 
0x%08X\n",
+   regs_buff[19]);
+
+   /* Interrupt Control registers */
+   fprintf(stdout,
+   "0x0080: IS(Interrupt status) 

[PATCH v2] mac80211: move struct aead_req off the stack

2016-10-17 Thread Johannes Berg
From: Johannes Berg 

Some crypto implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of private data in their struct aead_req.
This means these data structures cannot live in the vmalloc area, which
means that they cannot live on the stack (with CONFIG_VMAP_STACK.)

This currently occurs only with the generic software implementation, but
the private data and usage is implementation specific, so move the whole
data structures off the stack into heap by allocating every time we need
to use them.

This pattern already exists in the IPsec ESP driver, but in the future,
we may want/need to improve upon this, e.g. by using a slab cache.

(Based on Ard's patch, but that was missing error handling and GCM/GMAC)

Signed-off-by: Ard Biesheuvel 
Signed-off-by: Johannes Berg 
---
 net/mac80211/aes_ccm.c  | 34 +++---
 net/mac80211/aes_ccm.h  |  6 +++---
 net/mac80211/aes_gcm.c  | 31 +--
 net/mac80211/aes_gcm.h  |  4 ++--
 net/mac80211/aes_gmac.c | 10 +-
 net/mac80211/wpa.c  | 12 
 6 files changed, 50 insertions(+), 47 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 7663c28ba353..5c9fe00be6cc 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -18,18 +18,16 @@
 #include "key.h"
 #include "aes_ccm.h"
 
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len)
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len)
 {
struct scatterlist sg[3];
+   struct aead_request *aead_req;
 
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
-
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
@@ -41,6 +39,9 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_ad(aead_req, sg[0].length);
 
crypto_aead_encrypt(aead_req);
+   aead_request_free(aead_req);
+
+   return 0;
 }
 
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
@@ -48,15 +49,15 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
  size_t mic_len)
 {
struct scatterlist sg[3];
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
+   struct aead_request *aead_req;
+   int err;
 
if (data_len == 0)
return -EINVAL;
 
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
@@ -67,7 +68,10 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
-   return crypto_aead_decrypt(aead_req);
+   err = crypto_aead_decrypt(aead_req);
+   aead_request_free(aead_req);
+
+   return err;
 }
 
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
index 6a73d1e4d186..03e21b0995e3 100644
--- a/net/mac80211/aes_ccm.h
+++ b/net/mac80211/aes_ccm.h
@@ -15,9 +15,9 @@
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
size_t key_len,
size_t mic_len);
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len);
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len);
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
  u8 *data, size_t data_len, u8 *mic,
  size_t mic_len);
diff --git a/net/mac80211/aes_gcm.c b/net/mac80211/aes_gcm.c
index 3afe361fd27c..aa8eb2718bc1 

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel
On 17 October 2016 at 10:14, Ard Biesheuvel  wrote:
> On 17 October 2016 at 09:33, Johannes Berg  wrote:
>> From: Johannes Berg 
>>
>> As the stack can (on x86-64) now be virtually mapped rather than
>> using "normal" kernel memory, Sergey noticed mac80211 isn't using
>> the SG APIs correctly by putting on-stack buffers into SG tables.
>> This leads to kernel crashes.
>>
>> Fix this by allocating the extra fields dynamically on the fly as
>> needed, using a kmem cache.
>>
>> I used per-CPU memory in a previous iteration of this patch, but
>> Ard Biesheuvel pointed out that was also vmalloc'ed on some
>> architectures.
>>
>> Reported-by: Sergey Senozhatsky 
>> Signed-off-by: Johannes Berg 
>
> Apologies for going back and forth on this, but it appears there may
> be another way to deal with this.
>
> First of all, we only need this handling for the authenticated data,
> and only for CCM and GCM, not CMAC (which does not use scatterlists at
> all, it simply calls the AES cipher directly)
>
> So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,
> which we could allocate along with the AEAD request, e..g.,
>
> """
> diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
> index 8e898a6e8de8..c0c33e6ad94e 100644
> --- a/net/mac80211/aes_ccm.c
> +++ b/net/mac80211/aes_ccm.c
> @@ -24,13 +24,17 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead
> *tfm, u8 *b_0, u8 *aad,
>  {
> struct scatterlist sg[3];
> struct aead_request *aead_req;
> +   u8 *__aad;
>
> aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
> if (!aead_req)
> return -ENOMEM;
>
> +   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
> +   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
> +
> sg_init_table(sg, 3);
> -   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> +   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> @@ -49,6 +53,7 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
> *tfm, u8 *b_0, u8 *aad,
>  {
> struct scatterlist sg[3];
> struct aead_request *aead_req;
> +   u8 *__aad;
> int err;
>
> if (data_len == 0)
> @@ -58,8 +63,11 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
> *tfm, u8 *b_0, u8 *aad,
> if (!aead_req)
> return -ENOMEM;
>
> +   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
> +   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
> +
> sg_init_table(sg, 3);
> -   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> +   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> @@ -90,6 +98,8 @@ struct crypto_aead
> *ieee80211_aes_key_setup_encrypt(const u8 key[],
> if (err)
> goto free_aead;
>
> +   crypto_aead_set_reqsize(tfm,
> +   crypto_aead_reqsize(tfm) + 2 * 
> AES_BLOCK_SIZE));
> return tfm;
>

Darn, it seems crypto_aead_set_reqsize() is internal to the crypto API ... :-(


Re: [PATCH] mac80211_hwsim: suggest nl80211 instead of wext driver in documentation

2016-10-17 Thread Johannes Berg
On Mon, 2016-10-17 at 00:39 +0200, Linus Lüssing wrote:
> For mac80211_hwsim interfaces, suggest to use wpa_supplicant with the
> more modern, netlink based driver instead of wext.

Makes sense, applied.

> Actually, I wasn't even able to make a connection with the
> configuration
> files and information provided in
> Documentation/networking/mac80211_hwsim/{README,hostapd.conf/wpa_supp
> licant.conf}
> 
This less so, we even have a few test cases we run regularly, but I
don't know. Maybe there's something special in those configuration
files that we don't test for otherwise.

johannes


Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-17 Thread Jamal Hadi Salim


Some comments:
IIUC, the main struggle seems to be whether the redirect to dummy0
is useful or not? i.e instead of just letting the packets go up the
stack on eth1?
It seems like sflowd needs to read off eth1 via packet socket?
To be backward compatible - supporting that approach seems sensible.

Note:
There is a clear efficiency benefit of both using IFE encoding and
redirecting to dummy0.
1) Redirecting to dummy0 implies you dont need to exercise a bpf
filter around every packet that comes off eth1.
I understand there are probably not millions of pps for this case;
but in a non-offloaded cases it could be millions pps.
And in case of sampling over many ethx devices, you can redirect
samples from many other ethx devices.
So making dummy0 the sflow device is a win.
2) Encaping an IFE header implies a much more efficient bpf filter
(IFE ethertype is an excellent discriminator for bpf).

Additional benefit is as mentioned before - redirecting to a device
means you can send it remotely over ethernet to a more powerful
machine without having to cross kernel-userspace. Redirecting instead
of mirroring to tuntap is also an interesting option.

More comments below (on the sflow person's comment - dont seem him
on the Cc):

On 16-10-15 12:34 PM, Roopa Prabhu wrote:

On 10/12/16, 5:41 AM, Jiri Pirko wrote:

From: Yotam Gigi 




+
+struct sample_packet_metadata {
+   int sample_size;
+   int orig_size;
+   int ifindex;
+};
+

This metadata does not look extensible.. can it be made to ?



Sure it can...


With sflow in context, you need a pair of ifindex numbers to encode ingress and 
egress ports.


What is the use case for both?

Ideally you would also include a sequence number and a count of the total 
number of packets

> that were candidates for sampling.

Sequence number may make sense (they will help show a gap if something
gets dropped). But i am not sure about the stats consuming such space.
Stats are something that can be queried (tc stats should have a record
of how many bytes/packets )


The OVS implementation is a good example, the metadata includes all the actions 
applied
to the packet in the kernel data path.



Again not sure what the use case would be (and why waste such space
especially when you are sending over the wire with such details).


+   rcu_read_lock();
+   retval = READ_ONCE(s->tcf_action);
+
+   if (++s->packet_counter % s->rate == 0) {


The sampling function isn’t random

if (++s->packet_counter % s->rate == 0) {

This is unsuitable for sFlow, which is specific about the random sampling 
function required.
BPF, OVS, and the
ULOG statistics module include efficient kernel based random sampling functions 
that could be used instead.



If i understood correctly, the above is a fallback sampling algorithm.
In the case of the spectrum it already does the sampling in the ASIC
so there is no need to repeat it in software.
Agreed that in that case the sampling approach is not sufficiently
random.

cheers,
jamal


Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-17 Thread Roopa Prabhu
On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
>
> Some comments:
> IIUC, the main struggle seems to be whether the redirect to dummy0
> is useful or not? i.e instead of just letting the packets go up the
> stack on eth1?

yep, correct...given existing workflow for the non-offloaded case is
to receive sample packets via bpf filter on socket or
use netlink as a sample delivery mechanism (NFLOG eg)


> It seems like sflowd needs to read off eth1 via packet socket?
> To be backward compatible - supporting that approach seems sensible.
>
> Note:
> There is a clear efficiency benefit of both using IFE encoding and
> redirecting to dummy0.
> 1) Redirecting to dummy0 implies you dont need to exercise a bpf
> filter around every packet that comes off eth1.
> I understand there are probably not millions of pps for this case;
> but in a non-offloaded cases it could be millions pps.
> And in case of sampling over many ethx devices, you can redirect
> samples from many other ethx devices.
> So making dummy0 the sflow device is a win.
> 2) Encaping an IFE header implies a much more efficient bpf filter
> (IFE ethertype is an excellent discriminator for bpf).
>
> Additional benefit is as mentioned before - redirecting to a device
> means you can send it remotely over ethernet to a more powerful
> machine without having to cross kernel-userspace. Redirecting instead
> of mirroring to tuntap is also an interesting option.

sure, this seems like a good option to have.
generally you have one instance of the sampling agent on a hyper visor or 
switch.
But, if you have use-cases where monitoring agents run external, sure.
would have preferred if it was optional or an addon and not the default.

Regarding the device, yeah, agree there are pros and cons.
An additional device just to sample packets seems like an overkill.
But, if there is no other other option, and there are benefits to it, no 
objections.
Hopefully we can add another option on the existing api to skip the device in 
the future.


>
>
> On 16-10-15 12:34 PM, Roopa Prabhu wrote:
>> On 10/12/16, 5:41 AM, Jiri Pirko wrote:
>>> From: Yotam Gigi 
>
>
>>> +
>>> +struct sample_packet_metadata {
>>> +int sample_size;
>>> +int orig_size;
>>> +int ifindex;
>>> +};
>>> +
>> This metadata does not look extensible.. can it be made to ?
>>
>
> Sure it can...
>
>> With sflow in context, you need a pair of ifindex numbers to encode ingress 
>> and egress ports.
>
> What is the use case for both?

I have heard that most monitoring tools have moved to ingress only sampling 
because of operational
complexity (use case is sflow). I think hardware also supports ingress and 
egress only sampling.
better to have an option to reflect that in the api.

>> Ideally you would also include a sequence number and a count of the total 
>> number of packets
> > that were candidates for sampling.
>
> Sequence number may make sense (they will help show a gap if something
> gets dropped). But i am not sure about the stats consuming such space.
> Stats are something that can be queried (tc stats should have a record
> of how many bytes/packets )

sure, thats fine.
>
>> The OVS implementation is a good example, the metadata includes all the 
>> actions applied
>> to the packet in the kernel data path.
>>
>
> Again not sure what the use case would be (and why waste such space
> especially when you are sending over the wire with such details).

All this is being used currently.., But, this can be other api's sflow uses
for monitoring.
http://openvswitch.org/support/ovscon2014/17/1400-ovs-sflow.pdf

Does not have to be part of the main/basic sampling api...
it was just an example.

>
>>> +rcu_read_lock();
>>> +retval = READ_ONCE(s->tcf_action);
>>> +
>>> +if (++s->packet_counter % s->rate == 0) {
>>
>> The sampling function isn’t random
>>
>> if (++s->packet_counter % s->rate == 0) {
>>
>> This is unsuitable for sFlow, which is specific about the random sampling 
>> function required.
>> BPF, OVS, and the
>> ULOG statistics module include efficient kernel based random sampling 
>> functions that could be used instead.
>>
>
> If i understood correctly, the above is a fallback sampling algorithm.
> In the case of the spectrum it already does the sampling in the ASIC
> so there is no need to repeat it in software.
> Agreed that in that case the sampling approach is not sufficiently
> random.

yes. and since the same sampling api will be used for offloaded and 
non-offloaded case,
the sampling algo here for the non-offloaded case...can do better .. atleast 
match the existing
api efficiency. We would want people to use the same api for the offload and 
non-offloaded case.

thanks,
Roopa



[PATCH 00/28] Reenable maybe-uninitialized warnings

2016-10-17 Thread Arnd Bergmann
This is a set of patches that I hope to get into v4.9 in some form
in order to turn on the -Wmaybe-uninitialized warnings again.

After talking to Linus in person at Linaro Connect about this, I
spent some time on finding all the remaining warnings, and this
is the resulting patch series. More details are in the description
of the last patch that actually enables the warning.

Let me know if there are other warnings that I missed, and whether
you think these are still appropriate for v4.9 or not.
A couple of patches are non-obvious, and could use some more
detailed review.

Arnd

Arnd Bergmann (28):
  [v2] netfilter: nf_tables: avoid uninitialized variable warning
  [v2] mtd: mtk: avoid warning in mtk_ecc_encode
  [v2] infiniband: shut up a maybe-uninitialized warning
  f2fs: replace a build-time warning with runtime WARN_ON
  ext2: avoid bogus -Wmaybe-uninitialized warning
  NFSv4.1: work around -Wmaybe-uninitialized warning
  ceph: avoid false positive maybe-uninitialized warning
  staging: lustre: restore initialization of return code
  staging: lustre: remove broken dead code in
cfs_cpt_table_create_pattern
  UBI: fix uninitialized access of vid_hdr pointer
  block: rdb: false-postive gcc-4.9 -Wmaybe-uninitialized
  [media] rc: print correct variable for z8f0811
  [media] dib0700: fix uninitialized data on 'repeat' event
  iio: accel: sca3000_core: avoid potentially uninitialized variable
  crypto: aesni: avoid -Wmaybe-uninitialized warning
  pcmcia: fix return value of soc_pcmcia_regulator_set
  spi: fsl-espi: avoid processing uninitalized data on error
  drm: avoid uninitialized timestamp use in wait_vblank
  brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap
  net: bcm63xx: avoid referencing uninitialized variable
  net/hyperv: avoid uninitialized variable
  x86: apm: avoid uninitialized data
  x86: mark target address as output in 'insb' asm
  x86: math-emu: possible uninitialized variable use
  s390: pci: don't print uninitialized data for debugging
  nios2: fix timer initcall return value
  rocker: fix maybe-uninitialized warning
  Kbuild: bring back -Wmaybe-uninitialized warning

 Makefile   |  10 +-
 arch/arc/Makefile  |   4 +-
 arch/nios2/kernel/time.c   |   1 +
 arch/s390/pci/pci_dma.c|   2 +-
 arch/x86/crypto/aesni-intel_glue.c | 121 +
 arch/x86/include/asm/io.h  |   4 +-
 arch/x86/kernel/apm_32.c   |   5 +-
 arch/x86/math-emu/Makefile |   4 +-
 arch/x86/math-emu/reg_compare.c|  16 +--
 drivers/block/rbd.c|   1 +
 drivers/gpu/drm/drm_irq.c  |   4 +-
 drivers/infiniband/core/cma.c  |  56 +-
 drivers/media/i2c/ir-kbd-i2c.c |   2 +-
 drivers/media/usb/dvb-usb/dib0700_core.c   |  10 +-
 drivers/mtd/nand/mtk_ecc.c |  19 ++--
 drivers/mtd/ubi/eba.c  |   2 +-
 drivers/net/ethernet/broadcom/bcm63xx_enet.c   |   3 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c |   4 +-
 drivers/net/hyperv/netvsc_drv.c|   2 +-
 .../broadcom/brcm80211/brcmfmac/cfg80211.c |   2 +-
 drivers/pcmcia/soc_common.c|   2 +-
 drivers/spi/spi-fsl-espi.c |   2 +-
 drivers/staging/iio/accel/sca3000_core.c   |   2 +
 .../staging/lustre/lnet/libcfs/linux/linux-cpu.c   |   7 --
 drivers/staging/lustre/lustre/lov/lov_pack.c   |   2 +
 fs/ceph/super.c|   3 +-
 fs/ext2/inode.c|   7 +-
 fs/f2fs/data.c |   7 ++
 fs/nfs/nfs4session.c   |  10 +-
 net/netfilter/nft_range.c  |  10 +-
 scripts/Makefile.ubsan |   4 +
 31 files changed, 187 insertions(+), 141 deletions(-)

-- 
Cc: x...@kernel.org
Cc: linux-me...@vger.kernel.org
Cc: Mauro Carvalho Chehab 
Cc: Martin Schwidefsky 
Cc: linux-s...@vger.kernel.org
Cc: Ilya Dryomov 
Cc: dri-de...@lists.freedesktop.org
Cc: linux-...@lists.infradead.org
Cc: Herbert Xu 
Cc: linux-cry...@vger.kernel.org
Cc: "David S. Miller" 
Cc: netdev@vger.kernel.org
Cc: Greg Kroah-Hartman 
Cc: ceph-de...@vger.kernel.org
Cc: linux-f2fs-de...@lists.sourceforge.net
Cc: linux-e...@vger.kernel.org
Cc: netfilter-de...@vger.kernel.org
2.9.0



[PATCH 01/28] [v2] netfilter: nf_tables: avoid uninitialized variable warning

2016-10-17 Thread Arnd Bergmann
The newly added nft_range_eval() function handles the two possible
nft range operations, but as the compiler warning points out,
any unexpected value would lead to the 'mismatch' variable being
used without being initialized:

net/netfilter/nft_range.c: In function 'nft_range_eval':
net/netfilter/nft_range.c:45:5: error: 'mismatch' may be used uninitialized in 
this function [-Werror=maybe-uninitialized]

This removes the variable in question and instead moves the
condition into the switch itself, which is potentially more
efficient than adding a bogus 'default' clause as in my
first approach, and is nicer than using the 'uninitialized_var'
macro.

Fixes: 0f3cd9b36977 ("netfilter: nf_tables: add range expression")
Link: http://patchwork.ozlabs.org/patch/677114/
Signed-off-by: Arnd Bergmann 
---
 net/netfilter/nft_range.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

Cc: Pablo Neira Ayuso 

diff --git a/net/netfilter/nft_range.c b/net/netfilter/nft_range.c
index c6d5358..2dd80f4 100644
--- a/net/netfilter/nft_range.c
+++ b/net/netfilter/nft_range.c
@@ -28,22 +28,20 @@ static void nft_range_eval(const struct nft_expr *expr,
 const struct nft_pktinfo *pkt)
 {
const struct nft_range_expr *priv = nft_expr_priv(expr);
-   bool mismatch;
int d1, d2;
 
d1 = memcmp(>data[priv->sreg], >data_from, priv->len);
d2 = memcmp(>data[priv->sreg], >data_to, priv->len);
switch (priv->op) {
case NFT_RANGE_EQ:
-   mismatch = (d1 < 0 || d2 > 0);
+   if (d1 < 0 || d2 > 0)
+   regs->verdict.code = NFT_BREAK;
break;
case NFT_RANGE_NEQ:
-   mismatch = (d1 >= 0 && d2 <= 0);
+   if (d1 >= 0 && d2 <= 0)
+   regs->verdict.code = NFT_BREAK;
break;
}
-
-   if (mismatch)
-   regs->verdict.code = NFT_BREAK;
 }
 
 static const struct nla_policy nft_range_policy[NFTA_RANGE_MAX + 1] = {
-- 
2.9.0



[PATCH 19/28] brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap

2016-10-17 Thread Arnd Bergmann
A bugfix added a sanity check around the assignment and use of the
'is_11d' variable, which looks correct to me, but as the function is
rather complex already, this confuses the compiler to the point where
it can no longer figure out if the variable is always initialized
correctly:

brcm80211/brcmfmac/cfg80211.c: In function ‘brcmf_cfg80211_start_ap’:
brcm80211/brcmfmac/cfg80211.c:4586:10: error: ‘is_11d’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

This adds an initialization for the newly introduced case in which
the variable should not really be used, in order to make the warning
go away.

Fixes: b3589dfe0212 ("brcmfmac: ignore 11d configuration errors")
Cc: Hante Meuleman 
Cc: Arend van Spriel 
Cc: Kalle Valo 
Signed-off-by: Arnd Bergmann 
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index b777e1b..78d9966 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -4516,7 +4516,7 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct 
net_device *ndev,
/* store current 11d setting */
if (brcmf_fil_cmd_int_get(ifp, BRCMF_C_GET_REGULATORY,
  >vif->is_11d)) {
-   supports_11d = false;
+   is_11d = supports_11d = false;
} else {
country_ie = brcmf_parse_tlvs((u8 *)settings->beacon.tail,
  settings->beacon.tail_len,
-- 
2.9.0



[PATCH 28/28] Kbuild: bring back -Wmaybe-uninitialized warning

2016-10-17 Thread Arnd Bergmann
Traditionally, we have always had warnings about uninitialized variables
enabled, as this is part of -Wall, and generally a good idea [1], but it
also always produced false positives, mainly because this is a variation
of the halting problem and provably impossible to get right in all cases
[2].

Various people have identified cases that are particularly bad for false
positives, and in commit e74fc973b6e5 ("Turn off -Wmaybe-uninitialized
when building with -Os"), I turned off the warning for any build that
was done with CC_OPTIMIZE_FOR_SIZE.  This drastically reduced the number
of false positive warnings in the default build but unfortunately had
the side effect of turning the warning off completely in 'allmodconfig'
builds, which in turn led to a lot of warnings (both actual bugs, and
remaining false positives) to go in unnoticed.

With commit 877417e6ffb9 ("Kbuild: change CC_OPTIMIZE_FOR_SIZE
definition") enabled the warning again for allmodconfig builds in v4.7
and in v4.8-rc1, I had finally managed to address all warnings I get in
an ARM allmodconfig build and most other maybe-uninitialized warnings
for ARM randconfig builds.

However, commit 6e8d666e9253 ("Disable "maybe-uninitialized" warning
globally") was merged at the same time and disabled it completely for
all configurations, because of false-positive warnings on x86 that
I had not addressed until then. This caused a lot of actual bugs to
get merged into mainline, and I sent several dozen patches for these
during the v4.9 development cycle. Most of these are actual bugs,
some are for correct code that is safe because it is only called
under external constraints that make it impossible to run into
the case that gcc sees, and in a few cases gcc is just stupid and
finds something that can obviously never happen.

I have now done a few thousand randconfig builds on x86 and collected
all patches that I needed to address every single warning I got
(I can provide the combined patch for the other warnings if anyone
is interested), so I hope we can get the warning back and let people
catch the actual bugs earlier.

Note that the majority of the patches I created are for the third kind
of problem (stupid false-positives), for one of two reasons:
- some of them only get triggered in certain combinations of config
  options, so we don't always run into them, and
- the actual bugs tend to get addressed much quicker as they also
  lead to incorrect runtime behavior.

These 27 patches address the warnings that either occur in one of the more
common configurations (defconfig, allmodconfig, or something built by the
kbuild robot or kernelci.org), or they are about a real bug. It would be
good to get these all into v4.9 if we want to turn on the warning again.
I have tested these extensively with gcc-4.9 and gcc-6 and done a bit
of testing with gcc-5, and all of these should now be fine. gcc-4.8
is much worse about the false-positive warnings and is also fairly old
now, so I'm leaving the warning disabled with that version. gcc-4.7 and
older don't understand the -Wno-maybe-uninitialized option and are not
affected by this patch either way.

I have another (smaller) series of patches for warnings that are both
harmless and not as easy to trigger, and I will send them for inclusion
in v4.10.

Link: https://rusty.ozlabs.org/?p=232 [1]
Link: https://gcc.gnu.org/wiki/Better_Uninitialized_Warnings [2]
Signed-off-by: Arnd Bergmann 
---
 Makefile   | 10 ++
 arch/arc/Makefile  |  4 +++-
 scripts/Makefile.ubsan |  4 
 3 files changed, 13 insertions(+), 5 deletions(-)

Cc: x...@kernel.org
Cc: linux-me...@vger.kernel.org
Cc: Mauro Carvalho Chehab 
Cc: Martin Schwidefsky 
Cc: linux-s...@vger.kernel.org
Cc: Ilya Dryomov 
Cc: dri-de...@lists.freedesktop.org
Cc: linux-...@lists.infradead.org
Cc: Herbert Xu 
Cc: linux-cry...@vger.kernel.org
Cc: "David S. Miller" 
Cc: netdev@vger.kernel.org
Cc: Greg Kroah-Hartman 
Cc: ceph-de...@vger.kernel.org
Cc: linux-f2fs-de...@lists.sourceforge.net
Cc: linux-e...@vger.kernel.org
Cc: netfilter-de...@vger.kernel.org

diff --git a/Makefile b/Makefile
index 512e47a..43cd3d9 100644
--- a/Makefile
+++ b/Makefile
@@ -370,7 +370,7 @@ LDFLAGS_MODULE  =
 CFLAGS_KERNEL  =
 AFLAGS_KERNEL  =
 LDFLAGS_vmlinux =
-CFLAGS_GCOV= -fprofile-arcs -ftest-coverage -fno-tree-loop-im
+CFLAGS_GCOV= -fprofile-arcs -ftest-coverage -fno-tree-loop-im  
-Wno-maybe-uninitialized
 CFLAGS_KCOV:= $(call cc-option,-fsanitize-coverage=trace-pc,)
 
 
@@ -620,7 +620,6 @@ ARCH_CFLAGS :=
 include arch/$(SRCARCH)/Makefile
 
 KBUILD_CFLAGS  += $(call cc-option,-fno-delete-null-pointer-checks,)
-KBUILD_CFLAGS  += $(call cc-disable-warning,maybe-uninitialized,)
 KBUILD_CFLAGS  += $(call cc-disable-warning,frame-address,)
 
 ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
@@ -629,15 +628,18 @@ 

Re: [Intel-wired-lan] [PATCH V3 RFC 2/2] ixgbe: ixgbe_atr() compute l4_proto only if non-paged data has network/transport headers

2016-10-17 Thread Jeff Kirsher
On Mon, 2016-10-17 at 18:47 -0400, Sowmini Varadhan wrote:
> On (10/17/16 15:37), Jeff Kirsher wrote:
> > > Reviewed-by: Alexander Duyck 
> > 
> > Sowmini, can you re-submit this to intel-wired-lan but without the RFC
> in
> > the title?
> 
> V4 resubmitted.. I think I just inadvertently forgot to add Alex as the
> reviewed-by.. could you please fix that (or I can resubmit v5 if needed).

No need to resubmit, I can make sure Alex's reviewed-by gets added.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH net-next 02/11] net: Introduce new api for walking upper and lower devices

2016-10-17 Thread David Ahern
On 10/17/16 6:21 AM, Stephen Hemminger wrote:
> 
> No if/else needed. No cast of void * ptr need. Use const if possible?
> 

so much of the stack does not use const and trying to add it for this API does 
not work -- the upper or lower device is passed to the callbacks and those 
callbacks invoke other apis. e.g., the bond patch calls vlan_get_encap_level, 
bond_verify_device_path and bond_confirm_addr and none of those accept a const 
dev.

v3 coming up with the more succinct versions, but const is not possible.


[PATCH 1/1] net: vlan: Use sizeof instead of literal number

2016-10-17 Thread fgao
From: Gao Feng 

Use sizeof variable instead of literal number to enhance the readability.

Signed-off-by: Gao Feng 
---
 net/8021q/vlan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8de138d..5a3903b 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -515,8 +515,8 @@ static int vlan_ioctl_handler(struct net *net, void __user 
*arg)
return -EFAULT;
 
/* Null terminate this sucker, just in case. */
-   args.device1[23] = 0;
-   args.u.device2[23] = 0;
+   args.device1[sizeof(args.device1) - 1] = 0;
+   args.u.device2[sizeof(args.u.device2) - 1] = 0;
 
rtnl_lock();
 
-- 
1.9.1




[PATCH net-next 1/1] net: vlan: Use sizeof instead of literal number

2016-10-17 Thread fgao
From: Gao Feng 

Use sizeof variable instead of literal number to enhance the readability.

Signed-off-by: Gao Feng 
---
 net/8021q/vlan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8de138d..5a3903b 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -515,8 +515,8 @@ static int vlan_ioctl_handler(struct net *net, void __user 
*arg)
return -EFAULT;
 
/* Null terminate this sucker, just in case. */
-   args.device1[23] = 0;
-   args.u.device2[23] = 0;
+   args.device1[sizeof(args.device1) - 1] = 0;
+   args.u.device2[sizeof(args.u.device2) - 1] = 0;
 
rtnl_lock();
 
-- 
1.9.1




Re: [PATCH 1/1] net: vlan: Use sizeof instead of literal number

2016-10-17 Thread Feng Gao
On Tue, Oct 18, 2016 at 8:44 AM,   wrote:
> From: Gao Feng 
>
> Use sizeof variable instead of literal number to enhance the readability.
>
> Signed-off-by: Gao Feng 
> ---
>  net/8021q/vlan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
> index 8de138d..5a3903b 100644
> --- a/net/8021q/vlan.c
> +++ b/net/8021q/vlan.c
> @@ -515,8 +515,8 @@ static int vlan_ioctl_handler(struct net *net, void 
> __user *arg)
> return -EFAULT;
>
> /* Null terminate this sucker, just in case. */
> -   args.device1[23] = 0;
> -   args.u.device2[23] = 0;
> +   args.device1[sizeof(args.device1) - 1] = 0;
> +   args.u.device2[sizeof(args.u.device2) - 1] = 0;
>
> rtnl_lock();
>
> --
> 1.9.1
>
>

Sorry, I forget add the "net-next" in the title.
Now I have sent another new patch, please ignore this conversation.

Regards
Feng


Re: [PATCH net-next 1/1] net: vlan: Use sizeof instead of literal number

2016-10-17 Thread David Miller

It never makes sense to send the same patch for both net and net-next.

If it's a bug fix, it goes to 'net'.  And it will be eventually
be naturally merged into 'net-next'.

Otherwise, if it's a new feature, cleanup, or optimization it goes to
'net-next'.


Re: [PATCH net-next 1/1] net: vlan: Use sizeof instead of literal number

2016-10-17 Thread Gao Feng
Hi David,

On Tue, Oct 18, 2016 at 9:36 AM, David Miller  wrote:
>
> It never makes sense to send the same patch for both net and net-next.
>
> If it's a bug fix, it goes to 'net'.  And it will be eventually
> be naturally merged into 'net-next'.
>
> Otherwise, if it's a new feature, cleanup, or optimization it goes to
> 'net-next'.

Because I forget add the "net-next" in the title of first patch, so I
send the second patch with right title.
And I have replied the first patch and said the reason.

Regards
Feng




[PATCH net-next 09/11] net: Remove all_adj_list and its references

2016-10-17 Thread David Ahern
Only direct adjacencies are maintained. All upper or lower devices can
be learned via the new walk API which recursively walks the adj_list for
upper devices or lower devices.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  25 --
 net/core/dev.c| 223 --
 2 files changed, 18 insertions(+), 230 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a5902d995907..458c87631e7f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1456,7 +1456,6 @@ enum netdev_priv_flags {
  * @ptype_specific: Device-specific, protocol-specific packet handlers
  *
  * @adj_list:  Directly linked devices, like slaves for bonding
- * @all_adj_list:  All linked devices, *including* neighbours
  * @features:  Currently active device features
  * @hw_features:   User-changeable features
  *
@@ -1675,11 +1674,6 @@ struct net_device {
struct list_head lower;
} adj_list;
 
-   struct {
-   struct list_head upper;
-   struct list_head lower;
-   } all_adj_list;
-
netdev_features_t   features;
netdev_features_t   hw_features;
netdev_features_t   wanted_features;
@@ -3771,13 +3765,6 @@ struct net_device 
*netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
 updev; \
 updev = netdev_upper_get_next_dev_rcu(dev, &(iter)))
 
-/* iterate through upper list, must be called under RCU read lock */
-#define netdev_for_each_all_upper_dev_rcu(dev, updev, iter) \
-   for (iter = &(dev)->all_adj_list.upper, \
-updev = netdev_all_upper_get_next_dev_rcu(dev, &(iter)); \
-updev; \
-updev = netdev_all_upper_get_next_dev_rcu(dev, &(iter)))
-
 int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
  int (*fn)(struct net_device *upper_dev,
void *data),
@@ -3817,18 +3804,6 @@ struct net_device *netdev_all_lower_get_next(struct 
net_device *dev,
 struct net_device *netdev_all_lower_get_next_rcu(struct net_device *dev,
 struct list_head **iter);
 
-#define netdev_for_each_all_lower_dev(dev, ldev, iter) \
-   for (iter = (dev)->all_adj_list.lower.next, \
-ldev = netdev_all_lower_get_next(dev, &(iter)); \
-ldev; \
-ldev = netdev_all_lower_get_next(dev, &(iter)))
-
-#define netdev_for_each_all_lower_dev_rcu(dev, ldev, iter) \
-   for (iter = (dev)->all_adj_list.lower.next, \
-ldev = netdev_all_lower_get_next_rcu(dev, &(iter)); \
-ldev; \
-ldev = netdev_all_lower_get_next_rcu(dev, &(iter)))
-
 int netdev_walk_all_lower_dev(struct net_device *dev,
  int (*fn)(struct net_device *lower_dev,
void *data),
diff --git a/net/core/dev.c b/net/core/dev.c
index fc48337cfab8..a9fe14908b44 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5137,6 +5137,13 @@ static struct netdev_adjacent *__netdev_find_adj(struct 
net_device *adj_dev,
return NULL;
 }
 
+static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
+{
+   struct net_device *dev = data;
+
+   return upper_dev == dev;
+}
+
 /**
  * netdev_has_upper_dev - Check if device is linked to an upper device
  * @dev: device
@@ -5151,7 +5158,8 @@ bool netdev_has_upper_dev(struct net_device *dev,
 {
ASSERT_RTNL();
 
-   return __netdev_find_adj(upper_dev, >all_adj_list.upper);
+   return netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
+upper_dev);
 }
 EXPORT_SYMBOL(netdev_has_upper_dev);
 
@@ -5165,13 +5173,6 @@ EXPORT_SYMBOL(netdev_has_upper_dev);
  * The caller must hold rcu lock.
  */
 
-static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
-{
-   struct net_device *dev = data;
-
-   return upper_dev == dev;
-}
-
 bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
  struct net_device *upper_dev)
 {
@@ -5191,7 +5192,7 @@ static bool netdev_has_any_upper_dev(struct net_device 
*dev)
 {
ASSERT_RTNL();
 
-   return !list_empty(>all_adj_list.upper);
+   return !list_empty(>adj_list.upper);
 }
 
 /**
@@ -5254,32 +5255,6 @@ struct net_device *netdev_upper_get_next_dev_rcu(struct 
net_device *dev,
 }
 EXPORT_SYMBOL(netdev_upper_get_next_dev_rcu);
 
-/**
- * netdev_all_upper_get_next_dev_rcu - Get the next dev from upper list
- * @dev: device
- * @iter: list_head ** of the current position
- *
- * Gets the next device from the dev's upper list, starting from iter
- * position. The caller must hold RCU read lock.
- */
-struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
- 

[PATCH v3 net-next 00/11] net: Fix netdev adjacency tracking

2016-10-17 Thread David Ahern
The netdev adjacency tracking is failing to create proper dependencies
for some topologies. For example this topology

++
|  myvrf |
++
  ||
  |  +-+
  |  | macvlan |
  |  +-+
  ||
  +--+
  |  bridge  |
  +--+
  |
  ++
  | bond1  |
  ++
  |
  ++
  |  eth3  |
  ++

hits 1 of 2 problems depending on the order of enslavement. The base set of
commands for both cases:

ip link add bond1 type bond
ip link set bond1 up
ip link set eth3 down
ip link set eth3 master bond1
ip link set eth3 up

ip link add bridge type bridge
ip link set bridge up
ip link add macvlan link bridge type macvlan
ip link set macvlan up

ip link add myvrf type vrf table 1234
ip link set myvrf up

ip link set bridge master myvrf

Case 1 enslave macvlan to the vrf before enslaving the bond to the bridge:

ip link set macvlan master myvrf
ip link set bond1 master bridge

Attempts to delete the VRF:
ip link delete myvrf

trigger the BUG in __netdev_adjacent_dev_remove:

[  587.405260] tried to remove device eth3 from myvrf
[  587.407269] [ cut here ]
[  587.408918] kernel BUG at /home/dsa/kernel.git/net/core/dev.c:5661!
[  587.43] invalid opcode:  [#1] SMP
[  587.412454] Modules linked in: macvlan bridge stp llc bonding vrf
[  587.414765] CPU: 0 PID: 726 Comm: ip Not tainted 4.8.0+ #109
[  587.416766] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[  587.420241] task: 88013ab6eec0 task.stack: c9628000
[  587.422163] RIP: 0010:[]  [] 
__netdev_adjacent_dev_remove+0x40/0x12c
...
[  587.446053] Call Trace:
[  587.446424]  [] __netdev_adjacent_dev_unlink+0x20/0x3c
[  587.447390]  [] netdev_upper_dev_unlink+0xfa/0x15e
[  587.448297]  [] vrf_del_slave+0x13/0x2a [vrf]
[  587.449153]  [] vrf_dev_uninit+0xea/0x114 [vrf]
[  587.450036]  [] rollback_registered_many+0x22b/0x2da
[  587.450974]  [] unregister_netdevice_many+0x17/0x48
[  587.451903]  [] rtnl_delete_link+0x3c/0x43
[  587.452719]  [] rtnl_dellink+0x180/0x194

When the BUG is converted to a WARN_ON it shows 4 missing adjacencies:
  eth3 - myvrf, mvrf - eth3, bond1 - myvrf and myvrf - bond1

All of those are because the __netdev_upper_dev_link function does not
properly link macvlan lower devices to myvrf when it is enslaved.

The second case just flips the ordering of the enslavements:
ip link set bond1 master bridge
ip link set macvlan master myvrf

Then run:
ip link delete bond1
ip link delete myvrf

The vrf delete command hangs because myvrf has a reference that has not
been released. In this case the removal code does not account for 2 paths 
between eth3 and myvrf - one from bridge to vrf and the other through the
macvlan.

Rather than try to maintain a linked list of all upper and lower devices
per netdevice, only track the direct neighbors. The remaining stack can
be determined by recursively walking the neighbors.

The existing netdev_for_each_all_upper_dev_rcu,
netdev_for_each_all_lower_dev and netdev_for_each_all_lower_dev_rcu macros
are replaced with APIs that walk the upper and lower device lists. The
new APIs take a callback function and a data arg that is passed to the
callback for each device in the list. Drivers using the old macros are
converted in separate patches to make it easier on reviewers. It is an
API conversion only; no functional change is intended.

v3
- address Stephen's comment to simplify logic and remove typecasts

v2
- fixed bond0 references in cover-letter
- fixed definition of netdev_next_lower_dev_rcu to mirror the upper_dev
  version.

David Ahern (11):
  net: Remove refnr arg when inserting link adjacencies
  net: Introduce new api for walking upper and lower devices
  net: bonding: Flip to the new dev walk API
  IB/core: Flip to the new dev walk API
  IB/ipoib: Flip to new dev walk API
  ixgbe: Flip to the new dev walk API
  mlxsw: Flip to the new dev walk API
  rocker: Flip to the new dev walk API
  net: Remove all_adj_list and its references
  net: Add warning if any lower device is still in adjacency list
  net: dev: Improve debug statements for adjacency tracking

 drivers/infiniband/core/core_priv.h|   9 +-
 drivers/infiniband/core/roce_gid_mgmt.c|  42 +--
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  37 ++-
 drivers/net/bonding/bond_alb.c |  82 +++---
 drivers/net/bonding/bond_main.c|  17 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 132 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  37 ++-
 drivers/net/ethernet/rocker/rocker_main.c  |  31 ++-
 include/linux/netdevice.h  |  38 ++-
 net/core/dev.c | 350 -
 10 files changed, 

[PATCH net-next 05/11] IB/ipoib: Flip to new dev walk API

2016-10-17 Thread David Ahern
Convert ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a code conversion; no functional change is intended.

v2
- removed typecast of data

Signed-off-by: David Ahern 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 37 +--
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 5636fc3da6b8..cc059218c962 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -292,6 +292,25 @@ static struct net_device *ipoib_get_master_net_dev(struct 
net_device *dev)
return dev;
 }
 
+struct ipoib_walk_data {
+   const struct sockaddr *addr;
+   struct net_device *result;
+};
+
+static int ipoib_upper_walk(struct net_device *upper, void *_data)
+{
+   struct ipoib_walk_data *data = _data;
+   int ret = 0;
+
+   if (ipoib_is_dev_match_addr_rcu(data->addr, upper)) {
+   dev_hold(upper);
+   data->result = upper;
+   ret = 1;
+   }
+
+   return ret;
+}
+
 /**
  * Find a net_device matching the given address, which is an upper device of
  * the given net_device.
@@ -304,27 +323,21 @@ static struct net_device *ipoib_get_master_net_dev(struct 
net_device *dev)
 static struct net_device *ipoib_get_net_dev_match_addr(
const struct sockaddr *addr, struct net_device *dev)
 {
-   struct net_device *upper,
- *result = NULL;
-   struct list_head *iter;
+   struct ipoib_walk_data data = {
+   .addr = addr,
+   };
 
rcu_read_lock();
if (ipoib_is_dev_match_addr_rcu(addr, dev)) {
dev_hold(dev);
-   result = dev;
+   data.result = dev;
goto out;
}
 
-   netdev_for_each_all_upper_dev_rcu(dev, upper, iter) {
-   if (ipoib_is_dev_match_addr_rcu(addr, upper)) {
-   dev_hold(upper);
-   result = upper;
-   break;
-   }
-   }
+   netdev_walk_all_upper_dev_rcu(dev, ipoib_upper_walk, );
 out:
rcu_read_unlock();
-   return result;
+   return data.result;
 }
 
 /* returns the number of IPoIB netdevs on top a given ipoib device matching a
-- 
2.1.4



Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-17 Thread Lawrence Brakmo
Yuchung and Eric, thank you for your comments.

It looks like I need to think more about this patch. I was trying
to reduce the likelihood of reordering (which seems even more
important based on Eric¹s comment on pacing), but it seems like
the only way to prevent reordering is to only re-hash after an RTO
or when there are no packets in flight (which may not occur).


On 10/11/16, 8:56 PM, "Yuchung Cheng"  wrote:

>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng  wrote:
>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo  wrote:
>>> Yuchung, thank you for your comments. Responses inline.
>>>
>>> On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:
>>>
On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>
> The purpose of this patch is to help balance flows across paths. A
>new
> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100)
>that
> the txhash (IPv6 flowlabel) will be changed after a non-RTO
>retransmit.
> A probability is used in order to control how many flows are moved
> during a congestion event and prevent the congested path from
>becoming
> under utilized (which could occur if too many flows leave the current
> path). Txhash changes may be delayed in order to decrease the
>likelihood
> that it will trigger retransmists due to too much reordering.
>
> Another sysctl "tcp_retrans_txhash_mode" determines the behavior
>after
> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
> txhash changes. The idea is to decrease the likelihood of going back
> to a broken path. That is, we don't want flow balancing to trigger
> changes to broken paths. The drawback is that flow balancing does
> not work as well. If the sysctl is greater than 1, then we always
> do flow balancing, even after RTOs.
>
> Tested with packedrill tests (for correctness) and performance
> experiments with 2 and 3 paths. Performance experiments looked at
> aggregate goodput and fairness. For each run, we looked at the ratio
>of
> the goodputs for the fastest and slowest flows. These were averaged
>for
> all the runs. A fairness of 1 means all flows had the same goodput, a
> fairness of 2 means the fastest flow was twice as fast as the slowest
> flow.
>
> The setup for the performance experiments was 4 or 5 serves in a
>rack,
> 10G links. I tested various probabilities, but 20 seemed to have the
> best tradeoff for my setup (small RTTs).
>
>   --- node1 -
> sender --- switch --- node2 - switch  receiver
>   --- node3 -
>
> Scenario 1: One sender sends to one receiver through 2 routes (node1
>or
> node 2). The output from node1 and node2 is 1G (1gbit/sec). With
>only 2
> flows, without flow balancing (prob=0) the average goodput is 1.6G
>vs.
> 1.9G with flow balancing due to 2 flows ending up in one link and
>either
> not moving and taking some time to move. Fairness was 1 in all cases.
> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or
>1.2
> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
> flow balancing increased fairness.
>
> Scenario 2: One sender to one receiver, through 3 routes (node1,...
> node2). With 6 or 16 flows the goodput was the same for all, but
> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That
>is,
> prob=20,mode=1 improved average and worst case fairness.
I am wondering if we can build better API with routing layer to
implement this type of feature, instead of creeping the tx_rehashing
logic scatter in TCP. For example, we call dst_negative_advice on TCP
write timeouts.
>>>
>>> Not sure. The route is not necessarily bad, may be temporarily
>>>congested
>>> or they may all be congested. If all we want to do is change the txhash
>>> (unlike dst_negative_advice), then calling a tx_rehashing function may
>>> be the appropriate call.
>>>

On the patch itself, it seems aggressive to (attempt to) rehash every
post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
identify post-RTO retransmission directly.
>>>
>>> Thanks, I will add the test.
>>>

is this an implementation of the Flow Bender ?
https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation
.cf
m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx
1u_
g=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk=MA4fWBLMTGgRS0eGvBjxf
7BJ
Ol3-oxAzZDEYUG4cE-s=
>>>
>>> Part of flow bender, although there are also some similarities to
>>>flowlet
>>> switching.
>>>

>
> Scenario 3: One sender to one receiver, 2 

[PATCH net-next] r8152: add new products of Lenovo

2016-10-17 Thread Hayes Wang
Add the following four products of Lenovo and sort the order of the list.

VID PID
0x17ef  0x3062
0x17ef  0x3069
0x17ef  0x720c
0x17ef  0x7214

Signed-off-by: Hayes Wang 
---
 drivers/net/usb/cdc_ether.c | 28 
 drivers/net/usb/r8152.c |  6 +-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index c47ec0a..45e5e43 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -687,6 +687,20 @@ static const struct usb_device_id  products[] = {
.driver_info = 0,
 },
 
+/* ThinkPad USB-C Dock (based on Realtek RTL8153) */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x3062, USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+
+/* ThinkPad Thunderbolt 3 Dock (based on Realtek RTL8153) */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x3069, USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+
 /* Lenovo Thinkpad USB 3.0 Ethernet Adapters (based on Realtek RTL8153) */
 {
USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x7205, USB_CLASS_COMM,
@@ -694,6 +708,20 @@ static const struct usb_device_id  products[] = {
.driver_info = 0,
 },
 
+/* Lenovo USB C to Ethernet Adapter (based on Realtek RTL8153) */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x720c, USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+
+/* Lenovo USB-C Travel Hub (based on Realtek RTL8153) */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x7214, USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+
 /* NVIDIA Tegra USB 3.0 Ethernet Adapters (based on Realtek RTL8153) */
 {
USB_DEVICE_AND_INTERFACE_INFO(NVIDIA_VENDOR_ID, 0x09ff, USB_CLASS_COMM,
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 2886946..8d6e13c 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -4411,8 +4411,12 @@ static struct usb_device_id rtl8152_table[] = {
{REALTEK_USB_DEVICE(VENDOR_ID_REALTEK, 0x8152)},
{REALTEK_USB_DEVICE(VENDOR_ID_REALTEK, 0x8153)},
{REALTEK_USB_DEVICE(VENDOR_ID_SAMSUNG, 0xa101)},
-   {REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7205)},
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x304f)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x3062)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x3069)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7205)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x720c)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7214)},
{REALTEK_USB_DEVICE(VENDOR_ID_NVIDIA,  0x09ff)},
{}
 };
-- 
2.7.4



[PATCH net-next 07/11] mlxsw: Flip to the new dev walk API

2016-10-17 Thread David Ahern
Convert mlxsw users to new dev walk API. This is just a code conversion;
no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 37 --
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 43a5eddc2c11..99805fd3d110 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3092,19 +3092,30 @@ static bool mlxsw_sp_port_dev_check(const struct 
net_device *dev)
return dev->netdev_ops == _sp_port_netdev_ops;
 }
 
+static int mlxsw_lower_dev_walk(struct net_device *lower_dev, void *data)
+{
+   struct mlxsw_sp_port **port = data;
+   int ret = 0;
+
+   if (mlxsw_sp_port_dev_check(lower_dev)) {
+   *port = netdev_priv(lower_dev);
+   ret = 1;
+   }
+
+   return ret;
+}
+
 static struct mlxsw_sp_port *mlxsw_sp_port_dev_lower_find(struct net_device 
*dev)
 {
-   struct net_device *lower_dev;
-   struct list_head *iter;
+   struct mlxsw_sp_port *port;
 
if (mlxsw_sp_port_dev_check(dev))
return netdev_priv(dev);
 
-   netdev_for_each_all_lower_dev(dev, lower_dev, iter) {
-   if (mlxsw_sp_port_dev_check(lower_dev))
-   return netdev_priv(lower_dev);
-   }
-   return NULL;
+   port = NULL;
+   netdev_walk_all_lower_dev(dev, mlxsw_lower_dev_walk, );
+
+   return port;
 }
 
 static struct mlxsw_sp *mlxsw_sp_lower_get(struct net_device *dev)
@@ -3117,17 +3128,15 @@ static struct mlxsw_sp *mlxsw_sp_lower_get(struct 
net_device *dev)
 
 static struct mlxsw_sp_port *mlxsw_sp_port_dev_lower_find_rcu(struct 
net_device *dev)
 {
-   struct net_device *lower_dev;
-   struct list_head *iter;
+   struct mlxsw_sp_port *port;
 
if (mlxsw_sp_port_dev_check(dev))
return netdev_priv(dev);
 
-   netdev_for_each_all_lower_dev_rcu(dev, lower_dev, iter) {
-   if (mlxsw_sp_port_dev_check(lower_dev))
-   return netdev_priv(lower_dev);
-   }
-   return NULL;
+   port = NULL;
+   netdev_walk_all_lower_dev_rcu(dev, mlxsw_lower_dev_walk, );
+
+   return port;
 }
 
 struct mlxsw_sp_port *mlxsw_sp_port_lower_dev_hold(struct net_device *dev)
-- 
2.1.4



[PATCH net-next 08/11] rocker: Flip to the new dev walk API

2016-10-17 Thread David Ahern
Convert rocker to the new dev walk API. This is just a code conversion;
no functional change is intended.

v2
- removed typecast of data

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/rocker/rocker_main.c | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index 5424fb341613..5deb25f26e5f 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2839,20 +2839,37 @@ static bool rocker_port_dev_check_under(const struct 
net_device *dev,
return true;
 }
 
+struct rocker_walk_data {
+   struct rocker *rocker;
+   struct rocker_port *port;
+};
+
+static int rocker_lower_dev_walk(struct net_device *lower_dev, void *_data)
+{
+   struct rocker_walk_data *data = _data;
+   int ret = 0;
+
+   if (rocker_port_dev_check_under(lower_dev, data->rocker)) {
+   data->port = netdev_priv(lower_dev);
+   ret = 1;
+   }
+
+   return ret;
+}
+
 struct rocker_port *rocker_port_dev_lower_find(struct net_device *dev,
   struct rocker *rocker)
 {
-   struct net_device *lower_dev;
-   struct list_head *iter;
+   struct rocker_walk_data data;
 
if (rocker_port_dev_check_under(dev, rocker))
return netdev_priv(dev);
 
-   netdev_for_each_all_lower_dev(dev, lower_dev, iter) {
-   if (rocker_port_dev_check_under(lower_dev, rocker))
-   return netdev_priv(lower_dev);
-   }
-   return NULL;
+   data.rocker = rocker;
+   data.port = NULL;
+   netdev_walk_all_lower_dev(dev, rocker_lower_dev_walk, );
+
+   return data.port;
 }
 
 static int rocker_netdevice_event(struct notifier_block *unused,
-- 
2.1.4



[PATCH net-next 11/11] net: dev: Improve debug statements for adjacency tracking

2016-10-17 Thread David Ahern
Adjacency code only has debugs for the insert case. Add debugs for
the remove path and make both consistently worded to make it easier
to follow the insert and removal with reference counts.

In addition, change the BUG to a WARN_ON. A missing adjacency at
removal time is not cause for a panic.

Signed-off-by: David Ahern 
---
 net/core/dev.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c6bbf310d407..f55fb4536016 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5561,6 +5561,9 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
 
if (adj) {
adj->ref_nr += 1;
+   pr_debug("Insert adjacency: dev %s adj_dev %s adj->ref_nr %d\n",
+dev->name, adj_dev->name, adj->ref_nr);
+
return 0;
}
 
@@ -5574,8 +5577,8 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
adj->private = private;
dev_hold(adj_dev);
 
-   pr_debug("dev_hold for %s, because of link added from %s to %s\n",
-adj_dev->name, dev->name, adj_dev->name);
+   pr_debug("Insert adjacency: dev %s adj_dev %s adj->ref_nr %d; dev_hold 
on %s\n",
+dev->name, adj_dev->name, adj->ref_nr, adj_dev->name);
 
if (netdev_adjacent_is_neigh_list(dev, adj_dev, dev_list)) {
ret = netdev_adjacent_sysfs_add(dev, adj_dev, dev_list);
@@ -5614,17 +5617,22 @@ static void __netdev_adjacent_dev_remove(struct 
net_device *dev,
 {
struct netdev_adjacent *adj;
 
+   pr_debug("Remove adjacency: dev %s adj_dev %s ref_nr %d\n",
+dev->name, adj_dev->name, ref_nr);
+
adj = __netdev_find_adj(adj_dev, dev_list);
 
if (!adj) {
-   pr_err("tried to remove device %s from %s\n",
+   pr_err("Adjacency does not exist for device %s from %s\n",
   dev->name, adj_dev->name);
-   BUG();
+   WARN_ON(1);
+   return;
}
 
if (adj->ref_nr > ref_nr) {
-   pr_debug("%s to %s ref_nr-%d = %d\n", dev->name, adj_dev->name,
-ref_nr, adj->ref_nr-ref_nr);
+   pr_debug("adjacency: %s to %s ref_nr - %d = %d\n",
+dev->name, adj_dev->name, ref_nr,
+adj->ref_nr - ref_nr);
adj->ref_nr -= ref_nr;
return;
}
@@ -5636,7 +5644,7 @@ static void __netdev_adjacent_dev_remove(struct 
net_device *dev,
netdev_adjacent_sysfs_del(dev, adj_dev->name, dev_list);
 
list_del_rcu(>list);
-   pr_debug("dev_put for %s, because link removed from %s to %s\n",
+   pr_debug("adjacency: dev_put for %s, because link removed from %s to 
%s\n",
 adj_dev->name, dev->name, adj_dev->name);
dev_put(adj_dev);
kfree_rcu(adj, rcu);
-- 
2.1.4



[PATCH net-next 04/11] IB/core: Flip to the new dev walk API

2016-10-17 Thread David Ahern
Convert rdma_is_upper_dev_rcu, handle_netdev_upper and
ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a code conversion; no functional change is intended.

v2
- removed typecast of data

Signed-off-by: David Ahern 
---
 drivers/infiniband/core/core_priv.h |  9 +--
 drivers/infiniband/core/roce_gid_mgmt.c | 42 ++---
 2 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index 19d499dcab76..0c0bea091de8 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -127,14 +127,7 @@ void ib_cache_release_one(struct ib_device *device);
 static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 struct net_device *upper)
 {
-   struct net_device *_upper = NULL;
-   struct list_head *iter;
-
-   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
-   if (_upper == upper)
-   break;
-
-   return _upper == upper;
+   return netdev_has_upper_dev_all_rcu(dev, upper);
 }
 
 int addr_init(void);
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 06556c34606d..3a64a0881882 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -437,6 +437,28 @@ static void callback_for_addr_gid_device_scan(struct 
ib_device *device,
  >gid_attr);
 }
 
+struct upper_list {
+   struct list_head list;
+   struct net_device *upper;
+};
+
+static int netdev_upper_walk(struct net_device *upper, void *data)
+{
+   struct upper_list *entry = kmalloc(sizeof(*entry), GFP_ATOMIC);
+   struct list_head *upper_list = data;
+
+   if (!entry) {
+   pr_info("roce_gid_mgmt: couldn't allocate entry to delete 
ndev\n");
+   return 0;
+   }
+
+   list_add_tail(>list, upper_list);
+   dev_hold(upper);
+   entry->upper = upper;
+
+   return 0;
+}
+
 static void handle_netdev_upper(struct ib_device *ib_dev, u8 port,
void *cookie,
void (*handle_netdev)(struct ib_device *ib_dev,
@@ -444,30 +466,12 @@ static void handle_netdev_upper(struct ib_device *ib_dev, 
u8 port,
  struct net_device *ndev))
 {
struct net_device *ndev = (struct net_device *)cookie;
-   struct upper_list {
-   struct list_head list;
-   struct net_device *upper;
-   };
-   struct net_device *upper;
-   struct list_head *iter;
struct upper_list *upper_iter;
struct upper_list *upper_temp;
LIST_HEAD(upper_list);
 
rcu_read_lock();
-   netdev_for_each_all_upper_dev_rcu(ndev, upper, iter) {
-   struct upper_list *entry = kmalloc(sizeof(*entry),
-  GFP_ATOMIC);
-
-   if (!entry) {
-   pr_info("roce_gid_mgmt: couldn't allocate entry to 
delete ndev\n");
-   continue;
-   }
-
-   list_add_tail(>list, _list);
-   dev_hold(upper);
-   entry->upper = upper;
-   }
+   netdev_walk_all_upper_dev_rcu(ndev, netdev_upper_walk, _list);
rcu_read_unlock();
 
handle_netdev(ib_dev, port, ndev);
-- 
2.1.4



[PATCH net-next 02/11] net: Introduce new api for walking upper and lower devices

2016-10-17 Thread David Ahern
This patch introduces netdev_walk_all_upper_dev_rcu,
netdev_walk_all_lower_dev and netdev_walk_all_lower_dev_rcu. These
functions recursively walk the adj_list of devices to determine all upper
and lower devices.

The functions take a callback function that is invoked for each device
in the list. If the callback returns non-0, the walk is terminated and
the functions return that code back to callers.

v3
- simplified netdev_has_upper_dev_all_rcu and __netdev_has_upper_dev and
  removed typecast as suggested by Stephen

v2
- fixed definition of netdev_next_lower_dev_rcu to mirror the upper_dev
  version.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  17 +
 net/core/dev.c| 155 ++
 2 files changed, 172 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf341b65ca5e..a5902d995907 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3778,6 +3778,14 @@ struct net_device 
*netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
 updev; \
 updev = netdev_all_upper_get_next_dev_rcu(dev, &(iter)))
 
+int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *upper_dev,
+   void *data),
+ void *data);
+
+bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
+ struct net_device *upper_dev);
+
 void *netdev_lower_get_next_private(struct net_device *dev,
struct list_head **iter);
 void *netdev_lower_get_next_private_rcu(struct net_device *dev,
@@ -3821,6 +3829,15 @@ struct net_device *netdev_all_lower_get_next_rcu(struct 
net_device *dev,
 ldev; \
 ldev = netdev_all_lower_get_next_rcu(dev, &(iter)))
 
+int netdev_walk_all_lower_dev(struct net_device *dev,
+ int (*fn)(struct net_device *lower_dev,
+   void *data),
+ void *data);
+int netdev_walk_all_lower_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *lower_dev,
+   void *data),
+ void *data);
+
 void *netdev_adjacent_get_private(struct list_head *adj_list);
 void *netdev_lower_get_first_private_rcu(struct net_device *dev);
 struct net_device *netdev_master_upper_dev_get(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index f67fd16615bb..fc48337cfab8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5156,6 +5156,31 @@ bool netdev_has_upper_dev(struct net_device *dev,
 EXPORT_SYMBOL(netdev_has_upper_dev);
 
 /**
+ * netdev_has_upper_dev_all - Check if device is linked to an upper device
+ * @dev: device
+ * @upper_dev: upper device to check
+ *
+ * Find out if a device is linked to specified upper device and return true
+ * in case it is. Note that this checks the entire upper device chain.
+ * The caller must hold rcu lock.
+ */
+
+static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
+{
+   struct net_device *dev = data;
+
+   return upper_dev == dev;
+}
+
+bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
+ struct net_device *upper_dev)
+{
+   return !!netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
+  upper_dev);
+}
+EXPORT_SYMBOL(netdev_has_upper_dev_all_rcu);
+
+/**
  * netdev_has_any_upper_dev - Check if device is linked to some device
  * @dev: device
  *
@@ -5255,6 +5280,51 @@ struct net_device 
*netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
 }
 EXPORT_SYMBOL(netdev_all_upper_get_next_dev_rcu);
 
+static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev,
+   struct list_head **iter)
+{
+   struct netdev_adjacent *upper;
+
+   WARN_ON_ONCE(!rcu_read_lock_held() && !lockdep_rtnl_is_held());
+
+   upper = list_entry_rcu((*iter)->next, struct netdev_adjacent, list);
+
+   if (>list == >adj_list.upper)
+   return NULL;
+
+   *iter = >list;
+
+   return upper->dev;
+}
+
+int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *dev,
+   void *data),
+ void *data)
+{
+   struct net_device *udev;
+   struct list_head *iter;
+   int ret;
+
+   for (iter = >adj_list.upper,
+udev = netdev_next_upper_dev_rcu(dev, );
+udev;
+udev = netdev_next_upper_dev_rcu(dev, )) {
+   /* first is the upper device itself */
+   ret = fn(udev, data);
+   if (ret)
+   

[PATCH net-next 03/11] net: bonding: Flip to the new dev walk API

2016-10-17 Thread David Ahern
Convert alb_send_learning_packets and bond_has_this_ip to use the new
netdev_walk_all_upper_dev_rcu API. In both cases this is just a code
conversion; no functional change is intended.

v2
- removed typecast of data and simplified bond_upper_dev_walk

Signed-off-by: David Ahern 
---
 drivers/net/bonding/bond_alb.c  | 82 ++---
 drivers/net/bonding/bond_main.c | 17 +
 2 files changed, 61 insertions(+), 38 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 551f0f8dead3..c80b023092dd 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -950,13 +950,61 @@ static void alb_send_lp_vid(struct slave *slave, u8 
mac_addr[],
dev_queue_xmit(skb);
 }
 
+struct alb_walk_data {
+   struct bonding *bond;
+   struct slave *slave;
+   u8 *mac_addr;
+   bool strict_match;
+};
+
+static int alb_upper_dev_walk(struct net_device *upper, void *_data)
+{
+   struct alb_walk_data *data = _data;
+   bool strict_match = data->strict_match;
+   struct bonding *bond = data->bond;
+   struct slave *slave = data->slave;
+   u8 *mac_addr = data->mac_addr;
+   struct bond_vlan_tag *tags;
+
+   if (is_vlan_dev(upper) && vlan_get_encap_level(upper) == 0) {
+   if (strict_match &&
+   ether_addr_equal_64bits(mac_addr,
+   upper->dev_addr)) {
+   alb_send_lp_vid(slave, mac_addr,
+   vlan_dev_vlan_proto(upper),
+   vlan_dev_vlan_id(upper));
+   } else if (!strict_match) {
+   alb_send_lp_vid(slave, upper->dev_addr,
+   vlan_dev_vlan_proto(upper),
+   vlan_dev_vlan_id(upper));
+   }
+   }
+
+   /* If this is a macvlan device, then only send updates
+* when strict_match is turned off.
+*/
+   if (netif_is_macvlan(upper) && !strict_match) {
+   tags = bond_verify_device_path(bond->dev, upper, 0);
+   if (IS_ERR_OR_NULL(tags))
+   BUG();
+   alb_send_lp_vid(slave, upper->dev_addr,
+   tags[0].vlan_proto, tags[0].vlan_id);
+   kfree(tags);
+   }
+
+   return 0;
+}
+
 static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[],
  bool strict_match)
 {
struct bonding *bond = bond_get_bond_by_slave(slave);
-   struct net_device *upper;
-   struct list_head *iter;
-   struct bond_vlan_tag *tags;
+   struct alb_walk_data data = {
+   .strict_match = strict_match,
+   .mac_addr = mac_addr,
+   .slave = slave,
+   .bond = bond,
+   };
 
/* send untagged */
alb_send_lp_vid(slave, mac_addr, 0, 0);
@@ -965,33 +1013,7 @@ static void alb_send_learning_packets(struct slave 
*slave, u8 mac_addr[],
 * for that device.
 */
rcu_read_lock();
-   netdev_for_each_all_upper_dev_rcu(bond->dev, upper, iter) {
-   if (is_vlan_dev(upper) && vlan_get_encap_level(upper) == 0) {
-   if (strict_match &&
-   ether_addr_equal_64bits(mac_addr,
-   upper->dev_addr)) {
-   alb_send_lp_vid(slave, mac_addr,
-   vlan_dev_vlan_proto(upper),
-   vlan_dev_vlan_id(upper));
-   } else if (!strict_match) {
-   alb_send_lp_vid(slave, upper->dev_addr,
-   vlan_dev_vlan_proto(upper),
-   vlan_dev_vlan_id(upper));
-   }
-   }
-
-   /* If this is a macvlan device, then only send updates
-* when strict_match is turned off.
-*/
-   if (netif_is_macvlan(upper) && !strict_match) {
-   tags = bond_verify_device_path(bond->dev, upper, 0);
-   if (IS_ERR_OR_NULL(tags))
-   BUG();
-   alb_send_lp_vid(slave, upper->dev_addr,
-   tags[0].vlan_proto, tags[0].vlan_id);
-   kfree(tags);
-   }
-   }
+   netdev_walk_all_upper_dev_rcu(bond->dev, alb_upper_dev_walk, );
rcu_read_unlock();
 }
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5fa36ebc0640..c9944d86d045 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2270,22 +2270,23 @@ static void bond_mii_monitor(struct 

[PATCH net-next 01/11] net: Remove refnr arg when inserting link adjacencies

2016-10-17 Thread David Ahern
Commit 93409033ae65 ("net: Add netdev all_adj_list refcnt propagation to
fix panic") propagated the refnr to insert and remove functions tracking
the netdev adjacency graph. However, for the insert path the refnr can
only be 1. Accordingly, remove the refnr argument to make that clear.
ie., the refnr arg in 93409033ae65 was only needed for the remove path.

Signed-off-by: David Ahern 
---
 net/core/dev.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 352e98129601..f67fd16615bb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5453,7 +5453,6 @@ static inline bool netdev_adjacent_is_neigh_list(struct 
net_device *dev,
 
 static int __netdev_adjacent_dev_insert(struct net_device *dev,
struct net_device *adj_dev,
-   u16 ref_nr,
struct list_head *dev_list,
void *private, bool master)
 {
@@ -5463,7 +5462,7 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
adj = __netdev_find_adj(adj_dev, dev_list);
 
if (adj) {
-   adj->ref_nr += ref_nr;
+   adj->ref_nr += 1;
return 0;
}
 
@@ -5473,7 +5472,7 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
 
adj->dev = adj_dev;
adj->master = master;
-   adj->ref_nr = ref_nr;
+   adj->ref_nr = 1;
adj->private = private;
dev_hold(adj_dev);
 
@@ -5547,22 +5546,21 @@ static void __netdev_adjacent_dev_remove(struct 
net_device *dev,
 
 static int __netdev_adjacent_dev_link_lists(struct net_device *dev,
struct net_device *upper_dev,
-   u16 ref_nr,
struct list_head *up_list,
struct list_head *down_list,
void *private, bool master)
 {
int ret;
 
-   ret = __netdev_adjacent_dev_insert(dev, upper_dev, ref_nr, up_list,
+   ret = __netdev_adjacent_dev_insert(dev, upper_dev, up_list,
   private, master);
if (ret)
return ret;
 
-   ret = __netdev_adjacent_dev_insert(upper_dev, dev, ref_nr, down_list,
+   ret = __netdev_adjacent_dev_insert(upper_dev, dev, down_list,
   private, false);
if (ret) {
-   __netdev_adjacent_dev_remove(dev, upper_dev, ref_nr, up_list);
+   __netdev_adjacent_dev_remove(dev, upper_dev, 1, up_list);
return ret;
}
 
@@ -5570,10 +5568,9 @@ static int __netdev_adjacent_dev_link_lists(struct 
net_device *dev,
 }
 
 static int __netdev_adjacent_dev_link(struct net_device *dev,
- struct net_device *upper_dev,
- u16 ref_nr)
+ struct net_device *upper_dev)
 {
-   return __netdev_adjacent_dev_link_lists(dev, upper_dev, ref_nr,
+   return __netdev_adjacent_dev_link_lists(dev, upper_dev,
>all_adj_list.upper,
_dev->all_adj_list.lower,
NULL, false);
@@ -5602,12 +5599,12 @@ static int __netdev_adjacent_dev_link_neighbour(struct 
net_device *dev,
struct net_device *upper_dev,
void *private, bool master)
 {
-   int ret = __netdev_adjacent_dev_link(dev, upper_dev, 1);
+   int ret = __netdev_adjacent_dev_link(dev, upper_dev);
 
if (ret)
return ret;
 
-   ret = __netdev_adjacent_dev_link_lists(dev, upper_dev, 1,
+   ret = __netdev_adjacent_dev_link_lists(dev, upper_dev,
   >adj_list.upper,
   _dev->adj_list.lower,
   private, master);
@@ -5676,7 +5673,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
list_for_each_entry(j, _dev->all_adj_list.upper, list) {
pr_debug("Interlinking %s with %s, non-neighbour\n",
 i->dev->name, j->dev->name);
-   ret = __netdev_adjacent_dev_link(i->dev, j->dev, 
i->ref_nr);
+   ret = __netdev_adjacent_dev_link(i->dev, j->dev);
if (ret)
goto rollback_mesh;
}
@@ -5686,7 +5683,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
list_for_each_entry(i, _dev->all_adj_list.upper, list) {
pr_debug("linking %s's upper 

[PATCH net-next 06/11] ixgbe: Flip to the new dev walk API

2016-10-17 Thread David Ahern
Convert ixgbe users to new dev walk API. This is just a code conversion;
no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 132 --
 1 file changed, 82 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 784b0b98ab2f..f380fda11eb6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5012,24 +5012,23 @@ static int ixgbe_fwd_ring_up(struct net_device *vdev,
return err;
 }
 
-static void ixgbe_configure_dfwd(struct ixgbe_adapter *adapter)
+static int ixgbe_upper_dev_walk(struct net_device *upper, void *data)
 {
-   struct net_device *upper;
-   struct list_head *iter;
-   int err;
-
-   netdev_for_each_all_upper_dev_rcu(adapter->netdev, upper, iter) {
-   if (netif_is_macvlan(upper)) {
-   struct macvlan_dev *dfwd = netdev_priv(upper);
-   struct ixgbe_fwd_adapter *vadapter = dfwd->fwd_priv;
+   if (netif_is_macvlan(upper)) {
+   struct macvlan_dev *dfwd = netdev_priv(upper);
+   struct ixgbe_fwd_adapter *vadapter = dfwd->fwd_priv;
 
-   if (dfwd->fwd_priv) {
-   err = ixgbe_fwd_ring_up(upper, vadapter);
-   if (err)
-   continue;
-   }
-   }
+   if (dfwd->fwd_priv)
+   ixgbe_fwd_ring_up(upper, vadapter);
}
+
+   return 0;
+}
+
+static void ixgbe_configure_dfwd(struct ixgbe_adapter *adapter)
+{
+   netdev_walk_all_upper_dev_rcu(adapter->netdev,
+ ixgbe_upper_dev_walk, NULL);
 }
 
 static void ixgbe_configure(struct ixgbe_adapter *adapter)
@@ -5448,12 +5447,25 @@ static void ixgbe_fdir_filter_exit(struct ixgbe_adapter 
*adapter)
spin_unlock(>fdir_perfect_lock);
 }
 
+static int ixgbe_disable_macvlan(struct net_device *upper, void *data)
+{
+   if (netif_is_macvlan(upper)) {
+   struct macvlan_dev *vlan = netdev_priv(upper);
+
+   if (vlan->fwd_priv) {
+   netif_tx_stop_all_queues(upper);
+   netif_carrier_off(upper);
+   netif_tx_disable(upper);
+   }
+   }
+
+   return 0;
+}
+
 void ixgbe_down(struct ixgbe_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
struct ixgbe_hw *hw = >hw;
-   struct net_device *upper;
-   struct list_head *iter;
int i;
 
/* signal that we are down to the interrupt handler */
@@ -5477,17 +5489,8 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
netif_tx_disable(netdev);
 
/* disable any upper devices */
-   netdev_for_each_all_upper_dev_rcu(adapter->netdev, upper, iter) {
-   if (netif_is_macvlan(upper)) {
-   struct macvlan_dev *vlan = netdev_priv(upper);
-
-   if (vlan->fwd_priv) {
-   netif_tx_stop_all_queues(upper);
-   netif_carrier_off(upper);
-   netif_tx_disable(upper);
-   }
-   }
-   }
+   netdev_walk_all_upper_dev_rcu(adapter->netdev,
+ ixgbe_disable_macvlan, NULL);
 
ixgbe_irq_disable(adapter);
 
@@ -6728,6 +6731,18 @@ static void ixgbe_update_default_up(struct ixgbe_adapter 
*adapter)
 #endif
 }
 
+static int ixgbe_enable_macvlan(struct net_device *upper, void *data)
+{
+   if (netif_is_macvlan(upper)) {
+   struct macvlan_dev *vlan = netdev_priv(upper);
+
+   if (vlan->fwd_priv)
+   netif_tx_wake_all_queues(upper);
+   }
+
+   return 0;
+}
+
 /**
  * ixgbe_watchdog_link_is_up - update netif_carrier status and
  * print link up message
@@ -6737,8 +6752,6 @@ static void ixgbe_watchdog_link_is_up(struct 
ixgbe_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
struct ixgbe_hw *hw = >hw;
-   struct net_device *upper;
-   struct list_head *iter;
u32 link_speed = adapter->link_speed;
const char *speed_str;
bool flow_rx, flow_tx;
@@ -6809,14 +6822,8 @@ static void ixgbe_watchdog_link_is_up(struct 
ixgbe_adapter *adapter)
 
/* enable any upper devices */
rtnl_lock();
-   netdev_for_each_all_upper_dev_rcu(adapter->netdev, upper, iter) {
-   if (netif_is_macvlan(upper)) {
-   struct macvlan_dev *vlan = netdev_priv(upper);
-
-   if (vlan->fwd_priv)
-   netif_tx_wake_all_queues(upper);
-   }
-   }
+   

[PATCH net-next 10/11] net: Add warning if any lower device is still in adjacency list

2016-10-17 Thread David Ahern
Lower list should be empty just like upper.

Signed-off-by: David Ahern 
---
 net/core/dev.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index a9fe14908b44..c6bbf310d407 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5219,6 +5219,20 @@ struct net_device *netdev_master_upper_dev_get(struct 
net_device *dev)
 }
 EXPORT_SYMBOL(netdev_master_upper_dev_get);
 
+/**
+ * netdev_has_any_lower_dev - Check if device is linked to some device
+ * @dev: device
+ *
+ * Find out if a device is linked to a lower device and return true in case
+ * it is. The caller must hold the RTNL lock.
+ */
+static bool netdev_has_any_lower_dev(struct net_device *dev)
+{
+   ASSERT_RTNL();
+
+   return !list_empty(>adj_list.lower);
+}
+
 void *netdev_adjacent_get_private(struct list_head *adj_list)
 {
struct netdev_adjacent *adj;
@@ -6616,6 +6630,7 @@ static void rollback_registered_many(struct list_head 
*head)
 
/* Notifier chain MUST detach us all upper devices. */
WARN_ON(netdev_has_any_upper_dev(dev));
+   WARN_ON(netdev_has_any_lower_dev(dev));
 
/* Remove entries from kobject tree */
netdev_unregister_kobject(dev);
-- 
2.1.4



Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-17 Thread Tom Herbert
On Mon, Oct 17, 2016 at 8:35 PM, Lawrence Brakmo  wrote:
> Yuchung and Eric, thank you for your comments.
>
> It looks like I need to think more about this patch. I was trying
> to reduce the likelihood of reordering (which seems even more
> important based on Eric¹s comment on pacing), but it seems like
> the only way to prevent reordering is to only re-hash after an RTO
> or when there are no packets in flight (which may not occur).
>
Sounds like that should be the same condition as when we set ooo_okay?

>
> On 10/11/16, 8:56 PM, "Yuchung Cheng"  wrote:
>
>>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng  wrote:
>>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo  wrote:
 Yuchung, thank you for your comments. Responses inline.

 On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:

>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>>
>> The purpose of this patch is to help balance flows across paths. A
>>new
>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100)
>>that
>> the txhash (IPv6 flowlabel) will be changed after a non-RTO
>>retransmit.
>> A probability is used in order to control how many flows are moved
>> during a congestion event and prevent the congested path from
>>becoming
>> under utilized (which could occur if too many flows leave the current
>> path). Txhash changes may be delayed in order to decrease the
>>likelihood
>> that it will trigger retransmists due to too much reordering.
>>
>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior
>>after
>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>> txhash changes. The idea is to decrease the likelihood of going back
>> to a broken path. That is, we don't want flow balancing to trigger
>> changes to broken paths. The drawback is that flow balancing does
>> not work as well. If the sysctl is greater than 1, then we always
>> do flow balancing, even after RTOs.
>>
>> Tested with packedrill tests (for correctness) and performance
>> experiments with 2 and 3 paths. Performance experiments looked at
>> aggregate goodput and fairness. For each run, we looked at the ratio
>>of
>> the goodputs for the fastest and slowest flows. These were averaged
>>for
>> all the runs. A fairness of 1 means all flows had the same goodput, a
>> fairness of 2 means the fastest flow was twice as fast as the slowest
>> flow.
>>
>> The setup for the performance experiments was 4 or 5 serves in a
>>rack,
>> 10G links. I tested various probabilities, but 20 seemed to have the
>> best tradeoff for my setup (small RTTs).
>>
>>   --- node1 -
>> sender --- switch --- node2 - switch  receiver
>>   --- node3 -
>>
>> Scenario 1: One sender sends to one receiver through 2 routes (node1
>>or
>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With
>>only 2
>> flows, without flow balancing (prob=0) the average goodput is 1.6G
>>vs.
>> 1.9G with flow balancing due to 2 flows ending up in one link and
>>either
>> not moving and taking some time to move. Fairness was 1 in all cases.
>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or
>>1.2
>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>> flow balancing increased fairness.
>>
>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>> node2). With 6 or 16 flows the goodput was the same for all, but
>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That
>>is,
>> prob=20,mode=1 improved average and worst case fairness.
>I am wondering if we can build better API with routing layer to
>implement this type of feature, instead of creeping the tx_rehashing
>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>write timeouts.

 Not sure. The route is not necessarily bad, may be temporarily
congested
 or they may all be congested. If all we want to do is change the txhash
 (unlike dst_negative_advice), then calling a tx_rehashing function may
 be the appropriate call.

>
>On the patch itself, it seems aggressive to (attempt to) rehash every
>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>identify post-RTO retransmission directly.

 Thanks, I will add the test.

>
>is this an implementation of the Flow Bender ?
>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation
>.cf
>m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx
>1u_

Re: [patch net-next RFC 4/6] Introduce sample tc action

2016-10-17 Thread Roopa Prabhu
On 10/17/16, 5:17 PM, Roopa Prabhu wrote:
> On 10/17/16, 3:10 AM, Jamal Hadi Salim wrote:
[snip]

inline below more data/context..

 +
 +struct sample_packet_metadata {
 +int sample_size;
 +int orig_size;
 +int ifindex;
 +};
 +
>>> This metadata does not look extensible.. can it be made to ?
>>>
>> Sure it can...
more sflow context here... [1]

An extensible metadata scheme is  highly desirable when passing data from the 
dataplane to 
the sampling agent in userspace. Looking forward, advanced instrumentation is 
being 
added to data planes and keeping the api future proof will help.



>>
>>> With sflow in context, you need a pair of ifindex numbers to encode ingress 
>>> and egress ports.
>> What is the use case for both?
> I have heard that most monitoring tools have moved to ingress only sampling 
> because of operational
> complexity (use case is sflow). I think hardware also supports ingress and 
> egress only sampling.
> better to have an option to reflect that in the api.

The reason for having two ifindex numbers is to record the ingress and egress 
ports (i.e. the path that the packet takes through the datapath/ASIC). You may 
actually have three ifindex numbers associated with a sample:
1. The data source that made the measurement (on a linux system each bridge has 
its own ifindex)
2. The ifindex associated with the ingress switch port
3. The ifindex associated with the egress switch port.

All three apply irrespective of sampling direction.


thanks,
Roopa

[1] Additional extended flow attributes have been defined to further extend 
sFlow packet samples:
http://sflow.org/sflow_tunnels.txt 
http://sflow.org/sflow_openflow.txt 




Useless debug warning "netlink: 16 bytes leftover after parsing attributes"

2016-10-17 Thread Marcel Holtmann
Hi,

so lately I am seeing a bunch of these warnings:

netlink: 16 bytes leftover after parsing attributes..

While they give you the process name, they are still useless to track down the 
message that causes them. I find them even more useless since an updated 
userspace on an older kernel can trigger the nla_policy warning here. And that 
updated userspace program is doing nothing wrong by including extra attributes. 
So what purpose is this warning serving?

Regards

Marcel



Re: [PATCH 00/28] Reenable maybe-uninitialized warnings

2016-10-17 Thread Christoph Hellwig
On Tue, Oct 18, 2016 at 12:03:28AM +0200, Arnd Bergmann wrote:
> This is a set of patches that I hope to get into v4.9 in some form
> in order to turn on the -Wmaybe-uninitialized warnings again.

Hi Arnd,

I jsut complained to Geert that I was introducing way to many
bugs or pointless warnings for some compilers lately, but gcc didn't
warn me about them.  From a little research the lack of
-Wmaybe-uninitialized seems to be the reason for it, so I'm all
for re-enabling it.


Re: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

2016-10-17 Thread Julia Cartwright
+linux-pci

On Mon, Oct 17, 2016 at 08:39:40AM -0700, Alexander Duyck wrote:
> On Mon, Oct 17, 2016 at 8:00 AM, Koehrer Mathias (ETAS/ESW5)
>  wrote:
> > Hi Julia!
> >> > > Have you tested on a vanilla (non-RT) kernel?  I doubt there is
> >> > > anything RT specific about what you are seeing, but it might be nice
> >> > > to get confirmation.  Also, bisection would probably be easier if you 
> >> > > confirm on a
> >> vanilla kernel.
> >> > >
> >> > > I find it unlikely that it's a kernel config option that changed
> >> > > which regressed you, but instead was a code change to a driver.
> >> > > Which driver is now the question, and the surface area is still big
> >> > > (processor mapping attributes for this region, PCI root complex 
> >> > > configuration,
> >> PCI brige configuration, igb driver itself, etc.).
> >> > >
> >> > > Big enough that I'd recommend a bisection.  It looks like a
> >> > > bisection between 3.18 and 4.8 would take you about 18 tries to narrow 
> >> > > down,
> >> assuming all goes well.
> >> > >
> >> >
> >> > I have now repeated my tests using the vanilla kernel.
> >> > There I got the very same issue.
> >> > Using kernel 4.0 is fine, however starting with kernel 4.1, the issue 
> >> > appears.
> >>
> >> Great, thanks for confirming!  That helps narrow things down quite a bit.
> >>
> >> > Here is my exact (reproducible) test description:
> >> > I applied the following patch to the kernel to get the igb trace.
> >> > This patch instruments the igb_rd32() function to measure the call to
> >> > readl() which is used to access registers of the igb NIC.
> >>
> >> I took your test setup and ran it between 4.0 and 4.1 on the hardware on 
> >> my desk,
> >> which is an Atom-based board with dual I210s, however I didn't see much
> >> difference.
> >>
> >> However, it's a fairly simple board, with a much simpler PCI topology than 
> >> your
> >> workstation.  I'll see if I can find some other hardware to test on.
> >>
> >> [..]
> >> > This means, that I think that some other stuff in kernel 4.1 has
> >> > changed, which has impact on the igb accesses.
> >> >
> >> > Any idea what component could cause this kind of issue?
> >>
> >> Can you continue your bisection using 'git bisect'?  You've already 
> >> narrowed it down
> >> between 4.0 and 4.1, so you're well on your way.
> >>
> >
> > OK - done.
> > And finally I was successful!
> > The following git commit is the one that is causing the trouble!
> > (The full commit is in the attachment).
> > + BEGIN +++
> > commit 387d37577fdd05e9472c20885464c2a53b3c945f
> > Author: Matthew Garrett 
> > Date:   Tue Apr 7 11:07:00 2015 -0700
> >
> > PCI: Don't clear ASPM bits when the FADT declares it's unsupported
> >
> > Communications with a hardware vendor confirm that the expected 
> > behaviour
> > on systems that set the FADT ASPM disable bit but which still grant full
> > PCIe control is for the OS to leave any BIOS configuration intact and
> > refuse to touch the ASPM bits.  This mimics the behaviour of Windows.
> >
> > Signed-off-by: Matthew Garrett 
> > Signed-off-by: Bjorn Helgaas 
> > + HEADER +++
> >
> > The only files that are modified by this commit are
> > drivers/acpi/pci_root.c
> > drivers/pci/pcie/aspm.c
> > include/linux/pci-aspm.h
> >
> > This is all generic PCIe stuff - however I do not really understand what
> > the changes of the commit are...
> >
> > In my setup I am using a dual port igb Ethernet adapter.
> > This has an onboard PCIe switch and it might be that the configuration of 
> > this
> > PCIe switch on the Intel board is causing the trouble.
> >
> > Please see also the output of "lspci -v" in the attachment.
> > The relevant PCI address of the NIC is 04:00.0 / 04:00.1
> >
> > Any feedback on this is welcome!
> >
> > Thanks
> >
> > Mathias
>
> Hi Mathias,
>
> If you could set the output of lspci -vvv it might be more useful as
> most of the configuration data isn't present in the lspci dump you had
> attached.  Specifically if you could do this for the working case and
> the non-working case we could verify if this issue is actually due to
> the ASPM configuration on the device.
>
> Also one thing you might try is booting your kernel with the kernel
> parameter "pcie_aspm=off".  It sounds like the extra latency is likely
> due to your platform enabling ASPM on the device and this in turn will
> add latency if the PCIe link is disabled when you attempt to perform a
> read as it takes some time to bring the PCIe link up when in L1 state.

So if we assume it's this commit causing the regression, then it's safe
to assume that this system's BIOS is claiming to not support ASPM in the
FADT, but the BIOS is leaving ASPM configured in some way on the
relevant devices.

Also, unfortunately, taking a look at the code which handles

Re: [PATCH 1/9] ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)

2016-10-17 Thread Tom Herbert
On Mon, Oct 17, 2016 at 7:42 AM, David Lebrun  wrote:
> Implement minimal support for processing of SR-enabled packets
> as described in
> https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.
>
> This patch implements the following operations:
> - Intermediate segment endpoint: incrementation of active segment and 
> rerouting.
> - Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
>   and routing of inner packet.
> - Cleanup flag support for SR-inlined packets: removal of SRH if we are the
>   penultimate segment endpoint.
>
> A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
> packets. Default is deny.
>
> This patch does not provide support for HMAC-signed packets.
>
> Signed-off-by: David Lebrun 
> ---
>  include/linux/ipv6.h  |   3 +
>  include/linux/seg6.h  |   6 ++
>  include/uapi/linux/ipv6.h |   2 +
>  include/uapi/linux/seg6.h |  46 +++
>  net/ipv6/Kconfig  |  13 +
>  net/ipv6/addrconf.c   |  18 ++
>  net/ipv6/exthdrs.c| 140 
> ++
>  7 files changed, 228 insertions(+)
>  create mode 100644 include/linux/seg6.h
>  create mode 100644 include/uapi/linux/seg6.h
>
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index 7e9a789..75395ad 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -64,6 +64,9 @@ struct ipv6_devconf {
> } stable_secret;
> __s32   use_oif_addrs_only;
> __s32   keep_addr_on_down;
> +#ifdef CONFIG_IPV6_SEG6
> +   __s32   seg6_enabled;
> +#endif
>
> struct ctl_table_header *sysctl_header;
>  };
> diff --git a/include/linux/seg6.h b/include/linux/seg6.h
> new file mode 100644
> index 000..7a66d2b
> --- /dev/null
> +++ b/include/linux/seg6.h
> @@ -0,0 +1,6 @@
> +#ifndef _LINUX_SEG6_H
> +#define _LINUX_SEG6_H
> +
> +#include 
> +
> +#endif
> diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
> index 8c27723..7ff1d65 100644
> --- a/include/uapi/linux/ipv6.h
> +++ b/include/uapi/linux/ipv6.h
> @@ -39,6 +39,7 @@ struct in6_ifreq {
>  #define IPV6_SRCRT_STRICT  0x01/* Deprecated; will be removed */
>  #define IPV6_SRCRT_TYPE_0  0   /* Deprecated; will be removed */
>  #define IPV6_SRCRT_TYPE_2  2   /* IPv6 type 2 Routing Header   */
> +#define IPV6_SRCRT_TYPE_4  4   /* Segment Routing with IPv6 */
>
>  /*
>   * routing header
> @@ -178,6 +179,7 @@ enum {
> DEVCONF_DROP_UNSOLICITED_NA,
> DEVCONF_KEEP_ADDR_ON_DOWN,
> DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
> +   DEVCONF_SEG6_ENABLED,
> DEVCONF_MAX
>  };
>
> diff --git a/include/uapi/linux/seg6.h b/include/uapi/linux/seg6.h
> new file mode 100644
> index 000..9f9e157
> --- /dev/null
> +++ b/include/uapi/linux/seg6.h
> @@ -0,0 +1,46 @@
> +/*
> + *  SR-IPv6 implementation
> + *
> + *  Author:
> + *  David Lebrun 
> + *
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License
> + *  as published by the Free Software Foundation; either version
> + *  2 of the License, or (at your option) any later version.
> + */
> +
> +#ifndef _UAPI_LINUX_SEG6_H
> +#define _UAPI_LINUX_SEG6_H
> +
> +/*
> + * SRH
> + */
> +struct ipv6_sr_hdr {
> +   __u8nexthdr;
> +   __u8hdrlen;
> +   __u8type;
> +   __u8segments_left;
> +   __u8first_segment;
> +   __be16  flags;
> +   __u8reserved;
> +
> +   struct in6_addr segments[0];
> +} __attribute__((packed));
> +
> +#define SR6_FLAG_CLEANUP   (1 << 15)
> +#define SR6_FLAG_PROTECTED (1 << 14)
> +#define SR6_FLAG_OAM   (1 << 13)
> +#define SR6_FLAG_ALERT (1 << 12)
> +#define SR6_FLAG_HMAC  (1 << 11)
> +
> +#define SR6_TLV_INGRESS1
> +#define SR6_TLV_EGRESS 2
> +#define SR6_TLV_OPAQUE 3
> +#define SR6_TLV_PADDING4
> +#define SR6_TLV_HMAC   5
> +
> +#define sr_get_flags(srh) (be16_to_cpu((srh)->flags))
> +
> +#endif
> diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
> index 2343e4f..691c318 100644
> --- a/net/ipv6/Kconfig
> +++ b/net/ipv6/Kconfig
> @@ -289,4 +289,17 @@ config IPV6_PIMSM_V2
>   Support for IPv6 PIM multicast routing protocol PIM-SMv2.
>   If unsure, say N.
>
> +config IPV6_SEG6
> +   bool "IPv6: Segment Routing support"
> +   depends on IPV6
> +   select CRYPTO_HMAC
> +   select CRYPTO_SHA1
> +   select CRYPTO_SHA256
> +   ---help---
> + Experimental support for IPv6 Segment Routing dataplane as defined
> + in IETF draft-ietf-6man-segment-routing-header-02. This option
> + enables the processing of SR-enabled packets allowing the kernel
> + to act as a segment endpoint (intermediate or 

RE: [ANNOUNCE] bridge-utils 1.6 release

2016-10-17 Thread Rosen, Rami
Hi, Alexander, 

>This link seems to be broken.

Seems that it should be:
https://www.kernel.org/pub/linux/utils/net/bridge-utils/bridge-utils-1.6.tar.gz
Instead of:
>   
> http://www.kernel.org/pub/linux/utils/net/bridge-utils/bridge-utils.1.6.tar.gz

Regards,
Rami Rosen
Intel Corporation



Re: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

2016-10-17 Thread Julia Cartwright
On Sat, Oct 15, 2016 at 12:06:33AM +0200, Richard Cochran wrote:
> On Fri, Oct 14, 2016 at 08:58:22AM +, Koehrer Mathias (ETAS/ESW5) wrote:
> > @@ -753,7 +756,9 @@ u32 igb_rd32(struct e1000_hw *hw, u32 re
> > if (E1000_REMOVED(hw_addr))
> > return ~value;
> >  
> > +trace_igb(801);
> > value = readl(_addr[reg]);
> > +trace_igb(802);
> 
> Nothing prevents this code from being preempted between the two trace
> points, and so you can't be sure whether the time delta in the trace
> is caused by the PCIe read stalling or not.

While that is certainly the case, and would explain the most egregious
of measured latency spikes, it doesn't invalidate the test if you
consider the valuable data point(s) to be the minimum and/or median
latencies.

   Julia


Re: [Intel-wired-lan] [PATCH V2 RFC 2/2] ixgbe: ixgbe_atr() compute l4_proto only if non-paged data has network/transport headers

2016-10-17 Thread Alexander Duyck
On Mon, Oct 17, 2016 at 10:25 AM, Sowmini Varadhan
 wrote:
> For some Tx paths (e.g., tpacket_snd()), ixgbe_atr may be
> passed down an sk_buff that has the network and transport
> header in the paged data, so it needs to make sure these
> headers are available in the headlen bytes to calculate the
> l4_proto.
>
> This patch expect that network and transport headers are
> already available in the non-paged header dat.  The assumption
> is that the caller has set this up if l4_proto based Tx
> steering is desired.
>
> Signed-off-by: Sowmini Varadhan 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   18 ++
>  1 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index eceb47b..2cc1dae 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -54,6 +54,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "ixgbe.h"
>  #include "ixgbe_common.h"
> @@ -7651,11 +7652,16 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
> /* snag network header to get L4 type and address */
> skb = first->skb;
> hdr.network = skb_network_header(skb);
> +   if (hdr.network <= skb->data || hdr.network >= skb_tail_pointer(skb))
> +   return;

I would say you probably only need the first check here for skb->data
and could probably skip the second part.  You will be testing for
skb_tail_pointer in all the other tests you added so this check is
redundant anyway.

Also you might want to go through and wrap these with unlikely() since
most of these are exception cases.

> if (skb->encapsulation &&
> first->protocol == htons(ETH_P_IP) &&
> hdr.ipv4->protocol == IPPROTO_UDP) {
> struct ixgbe_adapter *adapter = q_vector->adapter;
>
> +   if (skb_tail_pointer(skb) < hdr.network + VXLAN_HEADROOM)
> +   return;
> +
> /* verify the port is recognized as VXLAN */
> if (adapter->vxlan_port &&
> udp_hdr(skb)->dest == adapter->vxlan_port)
> @@ -7666,15 +7672,27 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
> hdr.network = skb_inner_network_header(skb);
> }
>
> +   /* Make sure we have at least [minimum IPv4 header + TCP]
> +* or [IPv6 header] bytes
> +*/
> +   if (skb_tail_pointer(skb) < hdr.network + 40)
> +   return;
> +
> /* Currently only IPv4/IPv6 with TCP is supported */
> switch (hdr.ipv4->version) {
> case IPVERSION:
> /* access ihl as u8 to avoid unaligned access on ia64 */
> hlen = (hdr.network[0] & 0x0F) << 2;
> +   if (skb_tail_pointer(skb) < hdr.network + hlen +
> +   sizeof(struct tcphdr))
> +   return;
> l4_proto = hdr.ipv4->protocol;
> break;
> case 6:
> hlen = hdr.network - skb->data;
> +   if (skb_tail_pointer(skb) < hdr.network + hlen +
> +   sizeof(struct tcphdr))
> +   return;
> l4_proto = ipv6_find_hdr(skb, , IPPROTO_TCP, NULL, NULL);
> hlen -= hdr.network - skb->data;
> break;

I believe one more check is needed after this block to verify the TCP
header fields are present.

So you probably need to add a check for "skb_tail_pointer(skb) <
(hdr.network + hlen + 20)".

Thanks.

- Alex


[PATCH net] soreuseport: do not export reuseport_add_sock()

2016-10-17 Thread Eric Dumazet
From: Eric Dumazet 

reuseport_add_sock() is not used from a module,
no need to export it.

Signed-off-by: Eric Dumazet 
---
 net/core/sock_reuseport.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index e92b759d906c..9a1a352fd1eb 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -129,7 +129,6 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2)
 
return 0;
 }
-EXPORT_SYMBOL(reuseport_add_sock);
 
 static void reuseport_free_rcu(struct rcu_head *head)
 {




[PATCH 27/28] rocker: fix maybe-uninitialized warning

2016-10-17 Thread Arnd Bergmann
In some rare configurations, we get a warning about the 'index' variable
being used without an initialization:

drivers/net/ethernet/rocker/rocker_ofdpa.c: In function 
‘ofdpa_port_fib_ipv4.isra.16.constprop’:
drivers/net/ethernet/rocker/rocker_ofdpa.c:2425:92: warning: ‘index’ may be 
used uninitialized in this function [-Wmaybe-uninitialized]

This is a false positive, the logic is just a bit too complex for gcc
to follow here. Moving the intialization of 'index' a little further
down makes it clear to gcc that the function always returns an error
if it is not initialized.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/rocker/rocker_ofdpa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c 
b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 431a608..4ca4613 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -1493,8 +1493,6 @@ static int ofdpa_port_ipv4_nh(struct ofdpa_port 
*ofdpa_port,
spin_lock_irqsave(>neigh_tbl_lock, lock_flags);
 
found = ofdpa_neigh_tbl_find(ofdpa, ip_addr);
-   if (found)
-   *index = found->index;
 
updating = found && adding;
removing = found && !adding;
@@ -1508,9 +1506,11 @@ static int ofdpa_port_ipv4_nh(struct ofdpa_port 
*ofdpa_port,
resolved = false;
} else if (removing) {
ofdpa_neigh_del(trans, found);
+   *index = found->index;
} else if (updating) {
ofdpa_neigh_update(found, trans, NULL, false);
resolved = !is_zero_ether_addr(found->eth_dst);
+   *index = found->index;
} else {
err = -ENOENT;
}
-- 
2.9.0



Re: [Intel-wired-lan] [PATCH V3 RFC 2/2] ixgbe: ixgbe_atr() compute l4_proto only if non-paged data has network/transport headers

2016-10-17 Thread Jeff Kirsher
On Mon, 2016-10-17 at 15:29 -0700, Alexander Duyck wrote:
> On Mon, Oct 17, 2016 at 2:12 PM, Sowmini Varadhan
>  wrote:
> > 
> > For some Tx paths (e.g., tpacket_snd()), ixgbe_atr may be
> > passed down an sk_buff that has the network and transport
> > header in the paged data, so it needs to make sure these
> > headers are available in the headlen bytes to calculate the
> > l4_proto.
> > 
> > This patch expect that network and transport headers are
> > already available in the non-paged header dat.  The assumption
> > is that the caller has set this up if l4_proto based Tx
> > steering is desired.
> > 
> > Signed-off-by: Sowmini Varadhan 
> 
> This all looks correct to me.  I would recommend having Jeff pull it
> in to be submitted to the net queue.
> 
> Reviewed-by: Alexander Duyck 

Sowmini, can you re-submit this to intel-wired-lan but without the RFC in
the title?

signature.asc
Description: This is a digitally signed message part


Re: [Intel-wired-lan] [PATCH V3 RFC 2/2] ixgbe: ixgbe_atr() compute l4_proto only if non-paged data has network/transport headers

2016-10-17 Thread Alexander Duyck
On Mon, Oct 17, 2016 at 2:12 PM, Sowmini Varadhan
 wrote:
> For some Tx paths (e.g., tpacket_snd()), ixgbe_atr may be
> passed down an sk_buff that has the network and transport
> header in the paged data, so it needs to make sure these
> headers are available in the headlen bytes to calculate the
> l4_proto.
>
> This patch expect that network and transport headers are
> already available in the non-paged header dat.  The assumption
> is that the caller has set this up if l4_proto based Tx
> steering is desired.
>
> Signed-off-by: Sowmini Varadhan 

This all looks correct to me.  I would recommend having Jeff pull it
in to be submitted to the net queue.

Reviewed-by: Alexander Duyck 


[PATCH 21/28] net/hyperv: avoid uninitialized variable

2016-10-17 Thread Arnd Bergmann
The hdr_offset variable is only if we deal with a TCP or UDP packet,
but as the check surrounding its usage tests for skb_is_gso()
instead, the compiler has no idea if the variable is initialized
or not at that point:

drivers/net/hyperv/netvsc_drv.c: In function ‘netvsc_start_xmit’:
drivers/net/hyperv/netvsc_drv.c:494:42: error: ‘hdr_offset’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

This adds an additional check for the transport type, which
tells the compiler that this path cannot happen. Since the
get_net_transport_info() function should always be inlined
here, I don't expect this to result in additional runtime
checks.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/hyperv/netvsc_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index f0919bd..5d6e75a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -447,7 +447,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
 * Setup the sendside checksum offload only if this is not a
 * GSO packet.
 */
-   if (skb_is_gso(skb)) {
+   if ((net_trans_info & (INFO_TCP | INFO_UDP)) && skb_is_gso(skb)) {
struct ndis_tcp_lso_info *lso_info;
 
rndis_msg_size += NDIS_LSO_PPI_SIZE;
-- 
2.9.0



[PATCH 20/28] net: bcm63xx: avoid referencing uninitialized variable

2016-10-17 Thread Arnd Bergmann
gcc found a reference to an uninitialized variable in the error handling
of bcm_enet_open, introduced by a recent cleanup:

drivers/net/ethernet/broadcom/bcm63xx_enet.c: In function 'bcm_enet_open'
drivers/net/ethernet/broadcom/bcm63xx_enet.c:1129:2: warning: 'phydev' may be 
used uninitialized in this function [-Wmaybe-uninitialized]

This makes the use of that variable conditional, so we only reference it
here after it has been used before. Unlike my normal patches, I have not
build-tested this one, as I don't currently have mips test in my
randconfig setup.

Fixes: 625eb8667d6f ("net: ethernet: broadcom: bcm63xx: use phydev from struct 
net_device")
Cc: Philippe Reynes 
Reported-by: kbuild test robot 
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c 
b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index ae364c7..5370909 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -1126,7 +1126,8 @@ static int bcm_enet_open(struct net_device *dev)
free_irq(dev->irq, dev);
 
 out_phy_disconnect:
-   phy_disconnect(phydev);
+   if (priv->has_phy)
+   phy_disconnect(phydev);
 
return ret;
 }
-- 
2.9.0



Re: [Intel-wired-lan] [PATCH V3 RFC 2/2] ixgbe: ixgbe_atr() compute l4_proto only if non-paged data has network/transport headers

2016-10-17 Thread Sowmini Varadhan
On (10/17/16 15:37), Jeff Kirsher wrote:
> > Reviewed-by: Alexander Duyck 
> 
> Sowmini, can you re-submit this to intel-wired-lan but without the RFC in
> the title?

V4 resubmitted.. I think I just inadvertently forgot to add Alex as the
reviewed-by.. could you please fix that (or I can resubmit v5 if needed).

--Sowmini


[PATCH 1/2] dwc_eth_qos: do not clear pause flags from phy_device->supported

2016-10-17 Thread Niklas Cassel
From: Niklas Cassel 

phy_device->supported is originally set by the PHY driver.
The ethernet driver should filter phy_device->supported to only contain
flags supported by the IP.
The IP supports setting rx and tx flow control independently,
therefore SUPPORTED_Pause and SUPPORTED_Asym_Pause should not be cleared.
If the flags are cleared, pause frames cannot be enabled (even if they
are supported by the PHY).

Signed-off-by: Niklas Cassel 
Signed-off-by: Lars Persson 
---
 drivers/net/ethernet/synopsys/dwc_eth_qos.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/synopsys/dwc_eth_qos.c 
b/drivers/net/ethernet/synopsys/dwc_eth_qos.c
index c9891eac9775..93a3a919f723 100644
--- a/drivers/net/ethernet/synopsys/dwc_eth_qos.c
+++ b/drivers/net/ethernet/synopsys/dwc_eth_qos.c
@@ -985,7 +985,8 @@ static int dwceqos_mii_probe(struct net_device *ndev)
   "phydev %p, phydev->phy_id 0xa%x, phydev->addr 
0x%x\n",
   phydev, phydev->phy_id, phydev->addr);
 
-   phydev->supported &= PHY_GBIT_FEATURES;
+   phydev->supported &= PHY_GBIT_FEATURES | SUPPORTED_Pause |
+SUPPORTED_Asym_Pause;
 
lp->link= 0;
lp->speed   = 0;
-- 
2.1.4



[PATCH 2/2] dwc_eth_qos: enable flow control by default

2016-10-17 Thread Niklas Cassel
From: Niklas Cassel 

Allow autoneg to enable flow control by default.
The behavior when autoneg is off has not changed.

Signed-off-by: Niklas Cassel 
Signed-off-by: Lars Persson 
---
 drivers/net/ethernet/synopsys/dwc_eth_qos.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/synopsys/dwc_eth_qos.c 
b/drivers/net/ethernet/synopsys/dwc_eth_qos.c
index 93a3a919f723..8e39295427c0 100644
--- a/drivers/net/ethernet/synopsys/dwc_eth_qos.c
+++ b/drivers/net/ethernet/synopsys/dwc_eth_qos.c
@@ -992,6 +992,7 @@ static int dwceqos_mii_probe(struct net_device *ndev)
lp->speed   = 0;
lp->duplex  = DUPLEX_UNKNOWN;
lp->phy_dev = phydev;
+   lp->flowcontrol.autoneg = AUTONEG_ENABLE;
 
if (netif_msg_probe(lp)) {
netdev_dbg(lp->ndev, "phy_addr 0x%x, phy_id 0x%08x\n",
-- 
2.1.4



Re: [PATCH net-next 2/2] net: phy: Add PHY Auto/Mdi/Mdix set driver for Microsemi PHYs.

2016-10-17 Thread Raju Lakkaraju
Hi Andrew,

Thank you for code review and valuable comments.

On Wed, Sep 28, 2016 at 10:24:51PM +0200, Andrew Lunn wrote:
> EXTERNAL EMAIL
> 
> 
> > + reg_val = phy_read(phydev, MSCC_PHY_BYPASS_CONTROL);
> > + if ((mdix == ETH_TP_MDI) || (mdix == ETH_TP_MDI_X)) {
> > + reg_val |= (DISABLE_PAIR_SWAP_CORR_MASK |
> > + DISABLE_POLARITY_CORR_MASK  |
> > + DISABLE_HP_AUTO_MDIX_MASK);
> > + } else {
> > + reg_val &= ~(DISABLE_PAIR_SWAP_CORR_MASK |
> > +  DISABLE_POLARITY_CORR_MASK  |
> > +  DISABLE_HP_AUTO_MDIX_MASK);
> > + }
> > + rc = phy_write(phydev, MSCC_PHY_BYPASS_CONTROL, reg_val);
> > + if (rc != 0)
> > + goto out_unlock;
> > +
> > + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
> > + if (rc != 0)
> > + goto out_unlock;
> > +
> > + reg_val = phy_read(phydev, MSCC_PHY_EXT_MODE_CNTL);
> > + reg_val &= ~(FORCE_MDI_CROSSOVER_MASK);
> > + if (mdix == ETH_TP_MDI)
> > + reg_val |= FORCE_MDI_CROSSOVER_MDI;
> > + else if (mdix == ETH_TP_MDI_X)
> > + reg_val |= FORCE_MDI_CROSSOVER_MDIX;
> > + rc = phy_write(phydev, MSCC_PHY_EXT_MODE_CNTL, reg_val);
> > + if (rc != 0)
> > + goto out_unlock;
> > +
> > + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
> > +
> > +out_unlock:
> 
> out_unlock seems a bit of an odd name, since you are not unlocking
> anything.
> 

Accepted. I will change.

> I also wonder if you should try to reset to MSCC_PHY_PAGE_STANDARD in
> the error condition?
> 

In case of PHY write error, then PHY page set will also return error.
I would like to return error condition, so return after write failure.

> > +
> > + return rc;
> > +}
> > +
> >  static int vsc85xx_wol_set(struct phy_device *phydev,
> >  struct ethtool_wolinfo *wol)
> >  {
> > @@ -227,6 +281,7 @@ static int vsc85xx_default_config(struct phy_device 
> > *phydev)
> >   int rc;
> >   u16 reg_val;
> >
> > + phydev->mdix = ETH_TP_MDI_AUTO;
> 
> Humm, interesting. The only other driver supporting mdix is the
> Marvell one. It does not do this, it leaves it to its default value of
> ETH_TP_MDI_INVALID. It does however interpret ETH_TP_MDI_INVALID as
> meaning as ETH_TP_MDI_AUTO.
> 
> There needs to be consistency here. You either need to do the same as
> the Marvell driver, or you need to modify the Marvell driver to also
> set phydev->mdix to ETH_TP_MDI_AUTO.
> 
> I don't yet know which of these two is the right thing to do.
> 
> Florian?
> 
> Andrew

For this comment, I responded in another thread.

---
Thanks,
Raju.


Re: [PATCH net-next 2/2] net: phy: Add Fast Link Failure - 2 set driver for Microsemi PHYs.

2016-10-17 Thread Florian Fainelli
On October 17, 2016 1:13:14 AM PDT, Raju Lakkaraju 
 wrote:
>Hi Andrew,
>
>Thank you for code review and comments.
>
>On Fri, Oct 14, 2016 at 02:02:28PM +0200, Andrew Lunn wrote:
>> EXTERNAL EMAIL
>> 
>> 
>> > On Fri, Oct 14, 2016 at 05:10:33PM +0530, Raju Lakkaraju wrote:
>> > From: Raju Lakkaraju 
>> >
>> > VSC8531 Fast Link Failure 2 feature enables the PHY to indicate the
>> > onset of a potential link failure in < 100 usec for 100BASE-TX
>> > operation. FLF2 is supported through the MDINT (active low) pin.
>> 
>> Is the MDINT pin specific to this feature, or a general interrupt
>pin?
>> 
>
>MDINT pin is general interrupt. MDINT pin share the interrupt with
>FLF2 along with another 13 interrupts.
>
>> Device tree is used to describe the hardware. It should not really
>> describe software or configuration. But the borders are a bit
>> fluffly. Signal edge rates is near to hardware. This is a lot more
>> towards configuration. So i'm not sure a device tree property is the
>> correct way to describe this.
>> 
>> This is also a feature i know other PHYs support. The Marvell PHY has
>> a "Metro Ethernet" extension which allows it to report link failures
>> for 1000BASE-T in 10, 20 or 40ms, instead of the usual 750ms. So we
>> need a generic solution other PHYs can implement.
>> 
>> As with cable testing, i think it should be an ethtool option.
>
>I agree with you.
>I thought this is one time initialization either enable or disable.
>if customer need this feature, they can enable in DT.
>Do you want me to implement through IOCTL instead of Device tree?
>Do you have any other suggestions?

As indicated in the other email about speed downshift, we may want to utilize 
ethtool's ability to modify tunable parameters (small integer, boolean, values) 
and extend it to cover features offered by PHYs in a way that an user can 
dynamically turn these features on or off.

In fact, this looks a lot like netdev features (e.g: checksum offload), and 
there seems to be some commonality here between at least Marvell and Microsemi 
(for the faster link down reporting), so maybe we should start adding PHY 
features similar to netdev features?

-- 
Florian


Re: [PATCH net-next 02/11] net: Introduce new api for walking upper and lower devices

2016-10-17 Thread Stephen Hemminger
On Fri, 14 Oct 2016 18:28:42 -0700
David Ahern  wrote:

>  
>  /**
> + * netdev_has_upper_dev_all - Check if device is linked to an upper device
> + * @dev: device
> + * @upper_dev: upper device to check
> + *
> + * Find out if a device is linked to specified upper device and return true
> + * in case it is. Note that this checks the entire upper device chain.
> + * The caller must hold rcu lock.
> + */
> +
> +static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
> +{
> + struct net_device *dev = (struct net_device *)data;
> +
> + if (upper_dev == dev)
> + return 1;
> +
> + return 0;
> +}
> +
> +bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
> +   struct net_device *upper_dev)
> +{
> + if (netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
> +   upper_dev))
> + return true;
> +
> + return false;
> +}
> +EXPORT_SYMBOL(netdev_has_upper_dev_all_rcu);

You should write this more succinctly as:

static bool __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
{
struct net_device *dev = data;

return upper_dev == dev;
}

bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
  struct net_device *upper_dev)
{
return netdev_walk_all_upper_dev_rcu(dev,
__netdev_has_upper_dev, upper_dev);
}

No if/else needed. No cast of void * ptr need. Use const if possible?


[PATCH net-next 1/2] bpf: add various tests around spill/fill of regs

2016-10-17 Thread Daniel Borkmann
Add several spill/fill tests. Besides others, one that performs xadd
on the spilled register, one ldx/stx test where different types are
spilled from two branches and read out from common path. Verfier does
handle all correctly.

Signed-off-by: Daniel Borkmann 
---
 samples/bpf/test_verifier.c | 116 
 1 file changed, 116 insertions(+)

diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 369ffaa..45cf740 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1128,6 +1129,108 @@ struct other_val {
.result = REJECT,
},
{
+   "unpriv: spill/fill of ctx",
+   .insns = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   },
+   {
+   "unpriv: spill/fill of ctx 2",
+   .insns = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_hash_recalc),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "unpriv: spill/fill of ctx 3",
+   .insns = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_10, 0),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_hash_recalc),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "R1 type=fp expected=ctx",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "unpriv: spill/fill of ctx 4",
+   .insns = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_10, 
BPF_REG_0, -8, 0),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_hash_recalc),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "R1 type=inv expected=ctx",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "unpriv: spill/fill of different pointers stx",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_3, 42),
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 3),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -16),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_2, 0),
+   BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+   BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
+   BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3,
+   offsetof(struct __sk_buff, mark)),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "same insn cannot be used with different pointers",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "unpriv: spill/fill of different pointers ldx",
+   .insns = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 3),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   

[PATCH net-next v12 2/9] openvswitch: add mac_proto field to the flow key

2016-10-17 Thread Jiri Benc
Use a hole in the structure. We support only Ethernet so far and will add
a support for L2-less packets shortly. We could use a bool to indicate
whether the Ethernet header is present or not but the approach with the
mac_proto field is more generic and occupies the same number of bytes in the
struct, while allowing later extensibility. It also makes the code in the
next patches more self explaining.

It would be nice to use ARPHRD_ constants but those are u16 which would be
waste. Thus define our own constants.

Another upside of this is that we can overload this new field to also denote
whether the flow key is valid. This has the advantage that on
refragmentation, we don't have to reparse the packet but can rely on the
stored eth.type. This is especially important for the next patches in this
series - instead of adding another branch for L2-less packets before calling
ovs_fragment, we can just remove all those branches completely.

Signed-off-by: Jiri Benc 
---
There are three possible approaches:

(1) The one in this patch.

(2) Use just a one bit flag indicating whether the packet is L3 or Ethernet
(similar to the "is_layer3" bool in v11). The code would stay very
similar to this patchset, the memory consumption would be the same.

(3) Use value of 14 for MAC_PROTO_ETHERNET. It would simplify things nicely,
as ovs_mac_header_len would be identical to ovs_key_mac_proto, saving
one comparison. Of course, this would mean that if other L2 protocols
are added in the future, they can only have L2 header length different
than 14. Sounds hacky, although I kind of like this.

After thinking about pros and cons, I implemented (1). Seems to be most
clear of the three options. But I'm happy to implement (2) or (3) if it's
deemed better.
---
 net/openvswitch/actions.c  | 14 +++---
 net/openvswitch/flow.c |  1 +
 net/openvswitch/flow.h | 22 ++
 net/openvswitch/flow_netlink.c |  5 +
 4 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 49af167105d3..44144f914920 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -137,12 +137,12 @@ static struct deferred_action 
*add_deferred_actions(struct sk_buff *skb,
 
 static void invalidate_flow_key(struct sw_flow_key *key)
 {
-   key->eth.type = htons(0);
+   key->mac_proto |= SW_FLOW_KEY_INVALID;
 }
 
 static bool is_flow_key_valid(const struct sw_flow_key *key)
 {
-   return !!key->eth.type;
+   return !(key->mac_proto & SW_FLOW_KEY_INVALID);
 }
 
 static void update_ethertype(struct sk_buff *skb, struct ethhdr *hdr,
@@ -796,16 +796,8 @@ static void do_output(struct datapath *dp, struct sk_buff 
*skb, int out_port,
ovs_vport_send(vport, skb);
} else if (mru <= vport->dev->mtu) {
struct net *net = read_pnet(>net);
-   __be16 ethertype = key->eth.type;
 
-   if (!is_flow_key_valid(key)) {
-   if (eth_p_mpls(skb->protocol))
-   ethertype = skb->inner_protocol;
-   else
-   ethertype = vlan_get_protocol(skb);
-   }
-
-   ovs_fragment(net, vport, skb, mru, ethertype);
+   ovs_fragment(net, vport, skb, mru, key->eth.type);
} else {
kfree_skb(skb);
}
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 22087062bd10..96c8c4716603 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -751,6 +751,7 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
*tun_info,
key->phy.skb_mark = skb->mark;
ovs_ct_fill_key(skb, key);
key->ovs_flow_hash = 0;
+   key->mac_proto = MAC_PROTO_ETHERNET;
key->recirc_id = 0;
 
return key_extract(skb, key);
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index ae783f5c6695..f61cae7f9030 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -37,6 +37,12 @@
 
 struct sk_buff;
 
+enum sw_flow_mac_proto {
+   MAC_PROTO_NONE = 0,
+   MAC_PROTO_ETHERNET,
+};
+#define SW_FLOW_KEY_INVALID0x80
+
 /* Store options at the end of the array if they are less than the
  * maximum size. This allows us to get the benefits of variable length
  * matching for small options.
@@ -68,6 +74,7 @@ struct sw_flow_key {
u32 skb_mark;   /* SKB mark. */
u16 in_port;/* Input switch port (or DP_MAX_PORTS). 
*/
} __packed phy; /* Safe when right after 'tun_key'. */
+   u8 mac_proto;   /* MAC layer protocol (e.g. Ethernet). 
*/
u8 tun_proto;   /* Protocol of encapsulating tunnel. */
u32 ovs_flow_hash;  /* Datapath 

[PATCH net-next v12 3/9] openvswitch: pass mac_proto to ovs_vport_send

2016-10-17 Thread Jiri Benc
We'll need it to alter packets sent to ARPHRD_NONE interfaces.

Change do_output() to use the actual L2 header size of the packet when
deciding on the minimum cutlen. The assumption here is that what matters is
not the output interface hard_header_len but rather the L2 header of the
particular packet. For example, ARPHRD_NONE tunnels that encapsulate
Ethernet should get at least the Ethernet header.

Signed-off-by: Jiri Benc 
---
 net/openvswitch/actions.c | 29 +
 net/openvswitch/vport.c   |  2 +-
 net/openvswitch/vport.h   |  2 +-
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 44144f914920..dc8bb97e2258 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -66,6 +66,7 @@ struct ovs_frag_data {
u16 vlan_tci;
__be16 vlan_proto;
unsigned int l2_len;
+   u8 mac_proto;
u8 l2_data[MAX_L2_LEN];
 };
 
@@ -673,7 +674,7 @@ static int ovs_vport_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
skb_reset_mac_len(skb);
}
 
-   ovs_vport_send(vport, skb);
+   ovs_vport_send(vport, skb, data->mac_proto);
return 0;
 }
 
@@ -692,7 +693,7 @@ static int ovs_vport_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
  * ovs_vport_output(), which is called once per fragmented packet.
  */
 static void prepare_frag(struct vport *vport, struct sk_buff *skb,
-u16 orig_network_offset)
+u16 orig_network_offset, u8 mac_proto)
 {
unsigned int hlen = skb_network_offset(skb);
struct ovs_frag_data *data;
@@ -705,6 +706,7 @@ static void prepare_frag(struct vport *vport, struct 
sk_buff *skb,
data->network_offset = orig_network_offset;
data->vlan_tci = skb->vlan_tci;
data->vlan_proto = skb->vlan_proto;
+   data->mac_proto = mac_proto;
data->l2_len = hlen;
memcpy(>l2_data, skb->data, hlen);
 
@@ -713,7 +715,8 @@ static void prepare_frag(struct vport *vport, struct 
sk_buff *skb,
 }
 
 static void ovs_fragment(struct net *net, struct vport *vport,
-struct sk_buff *skb, u16 mru, __be16 ethertype)
+struct sk_buff *skb, u16 mru,
+struct sw_flow_key *key)
 {
u16 orig_network_offset = 0;
 
@@ -727,11 +730,12 @@ static void ovs_fragment(struct net *net, struct vport 
*vport,
goto err;
}
 
-   if (ethertype == htons(ETH_P_IP)) {
+   if (key->eth.type == htons(ETH_P_IP)) {
struct dst_entry ovs_dst;
unsigned long orig_dst;
 
-   prepare_frag(vport, skb, orig_network_offset);
+   prepare_frag(vport, skb, orig_network_offset,
+ovs_key_mac_proto(key));
dst_init(_dst, _dst_ops, NULL, 1,
 DST_OBSOLETE_NONE, DST_NOCOUNT);
ovs_dst.dev = vport->dev;
@@ -742,7 +746,7 @@ static void ovs_fragment(struct net *net, struct vport 
*vport,
 
ip_do_fragment(net, skb->sk, skb, ovs_vport_output);
refdst_drop(orig_dst);
-   } else if (ethertype == htons(ETH_P_IPV6)) {
+   } else if (key->eth.type == htons(ETH_P_IPV6)) {
const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
unsigned long orig_dst;
struct rt6_info ovs_rt;
@@ -751,7 +755,8 @@ static void ovs_fragment(struct net *net, struct vport 
*vport,
goto err;
}
 
-   prepare_frag(vport, skb, orig_network_offset);
+   prepare_frag(vport, skb, orig_network_offset,
+ovs_key_mac_proto(key));
memset(_rt, 0, sizeof(ovs_rt));
dst_init(_rt.dst, _dst_ops, NULL, 1,
 DST_OBSOLETE_NONE, DST_NOCOUNT);
@@ -765,7 +770,7 @@ static void ovs_fragment(struct net *net, struct vport 
*vport,
refdst_drop(orig_dst);
} else {
WARN_ONCE(1, "Failed fragment ->%s: eth=%04x, MRU=%d, MTU=%d.",
- ovs_vport_name(vport), ntohs(ethertype), mru,
+ ovs_vport_name(vport), ntohs(key->eth.type), mru,
  vport->dev->mtu);
goto err;
}
@@ -785,19 +790,19 @@ static void do_output(struct datapath *dp, struct sk_buff 
*skb, int out_port,
u32 cutlen = OVS_CB(skb)->cutlen;
 
if (unlikely(cutlen > 0)) {
-   if (skb->len - cutlen > ETH_HLEN)
+   if (skb->len - cutlen > ovs_mac_header_len(key))
pskb_trim(skb, skb->len - cutlen);
else
-   pskb_trim(skb, ETH_HLEN);
+   pskb_trim(skb, ovs_mac_header_len(key));
}
 

[PATCH net-next v12 9/9] openvswitch: use ipgre tunnel rather than gretap tunnel

2016-10-17 Thread Jiri Benc
From: Simon Horman 

This allows GRE tunnels to send and receive both
layer 2 packets (packets with an ethernet header) and
layer 3 packets (packets without an ethernet header).

Signed-off-by: Simon Horman 
Signed-off-by: Jiri Benc 
---
v12: removed the non-gre hunks (now part of previous patches in this
 patchset)
---
 include/net/gre.h   | 4 ++--
 net/ipv4/ip_gre.c   | 9 +
 net/openvswitch/vport-gre.c | 2 +-
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/net/gre.h b/include/net/gre.h
index d25d836c129b..1a0bb1cefa60 100644
--- a/include/net/gre.h
+++ b/include/net/gre.h
@@ -31,8 +31,8 @@ struct gre_protocol {
 int gre_add_protocol(const struct gre_protocol *proto, u8 version);
 int gre_del_protocol(const struct gre_protocol *proto, u8 version);
 
-struct net_device *gretap_fb_dev_create(struct net *net, const char *name,
-  u8 name_assign_type);
+struct net_device *gre_fb_dev_create(struct net *net, const char *name,
+u8 name_assign_type);
 int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 bool *csum_err, __be16 proto, int nhs);
 
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 576f705d8180..18caea5c6d09 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1125,8 +1125,8 @@ static int ipgre_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
.get_link_net   = ip_tunnel_get_link_net,
 };
 
-struct net_device *gretap_fb_dev_create(struct net *net, const char *name,
-   u8 name_assign_type)
+struct net_device *gre_fb_dev_create(struct net *net, const char *name,
+u8 name_assign_type)
 {
struct nlattr *tb[IFLA_MAX + 1];
struct net_device *dev;
@@ -1137,13 +1137,14 @@ struct net_device *gretap_fb_dev_create(struct net 
*net, const char *name,
memset(, 0, sizeof(tb));
 
dev = rtnl_create_link(net, name, name_assign_type,
-  _tap_ops, tb);
+  _link_ops, tb);
if (IS_ERR(dev))
return dev;
 
/* Configure flow based GRE device. */
t = netdev_priv(dev);
t->collect_md = true;
+   dev->type = ARPHRD_NONE;
 
err = ipgre_newlink(net, dev, tb, NULL);
if (err < 0) {
@@ -1168,7 +1169,7 @@ struct net_device *gretap_fb_dev_create(struct net *net, 
const char *name,
unregister_netdevice_many(_kill);
return ERR_PTR(err);
 }
-EXPORT_SYMBOL_GPL(gretap_fb_dev_create);
+EXPORT_SYMBOL_GPL(gre_fb_dev_create);
 
 static int __net_init ipgre_tap_init_net(struct net *net)
 {
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index 0e72d95b0e8f..3fc3014bf924 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -61,7 +61,7 @@ static struct vport *gre_tnl_create(const struct vport_parms 
*parms)
return vport;
 
rtnl_lock();
-   dev = gretap_fb_dev_create(net, parms->name, NET_NAME_USER);
+   dev = gre_fb_dev_create(net, parms->name, NET_NAME_USER);
if (IS_ERR(dev)) {
rtnl_unlock();
ovs_vport_free(vport);
-- 
1.8.3.1



Re: [PATCH iproute2] devlink: Convert conditional in dl_argv_handle_port() to switch()

2016-10-17 Thread Stephen Hemminger
On Sun,  9 Oct 2016 10:14:18 +0800
Hangbin Liu  wrote:

> Discovered by Phil's covscan. The final return statement is never reached.
> This is not inherently clear from looking at the code, so change the
> conditional to a switch() statement which should clarify this.
> 
> CC: Phil Sutter 
> Signed-off-by: Hangbin Liu 

Applied thanks


[PATCH net-next v12 1/9] openvswitch: use hard_header_len instead of hardcoded ETH_HLEN

2016-10-17 Thread Jiri Benc
On tx, use hard_header_len while deciding whether to refragment or drop the
packet. That way, all combinations are calculated correctly:

* L2 packet going to L2 interface (the L2 header len is subtracted),
* L2 packet going to L3 interface (the L2 header is included in the packet
  lenght),
* L3 packet going to L3 interface.

Signed-off-by: Jiri Benc 
---
 net/openvswitch/actions.c |  3 ++-
 net/openvswitch/vport.c   | 10 ++
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1105c4e29c62..49af167105d3 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -791,7 +791,8 @@ static void do_output(struct datapath *dp, struct sk_buff 
*skb, int out_port,
pskb_trim(skb, ETH_HLEN);
}
 
-   if (likely(!mru || (skb->len <= mru + ETH_HLEN))) {
+   if (likely(!mru ||
+  (skb->len <= mru + vport->dev->hard_header_len))) {
ovs_vport_send(vport, skb);
} else if (mru <= vport->dev->mtu) {
struct net *net = read_pnet(>net);
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 7387418ac514..d5550fd6a013 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -481,9 +481,10 @@ void ovs_vport_deferred_free(struct vport *vport)
 }
 EXPORT_SYMBOL_GPL(ovs_vport_deferred_free);
 
-static unsigned int packet_length(const struct sk_buff *skb)
+static unsigned int packet_length(const struct sk_buff *skb,
+ struct net_device *dev)
 {
-   unsigned int length = skb->len - ETH_HLEN;
+   unsigned int length = skb->len - dev->hard_header_len;
 
if (!skb_vlan_tag_present(skb) &&
eth_type_vlan(skb->protocol))
@@ -501,10 +502,11 @@ void ovs_vport_send(struct vport *vport, struct sk_buff 
*skb)
 {
int mtu = vport->dev->mtu;
 
-   if (unlikely(packet_length(skb) > mtu && !skb_is_gso(skb))) {
+   if (unlikely(packet_length(skb, vport->dev) > mtu &&
+!skb_is_gso(skb))) {
net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
 vport->dev->name,
-packet_length(skb), mtu);
+packet_length(skb, vport->dev), mtu);
vport->dev->stats.tx_errors++;
goto drop;
}
-- 
1.8.3.1



[PATCH net-next v12 7/9] openvswitch: add Ethernet push and pop actions

2016-10-17 Thread Jiri Benc
It's not allowed to push Ethernet header in front of another Ethernet
header.

It's not allowed to pop Ethernet header if there's a vlan tag. This
preserves the invariant that L3 packet never has a vlan tag.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab 
Signed-off-by: Simon Horman 
Signed-off-by: Jiri Benc 
---
 include/uapi/linux/openvswitch.h | 15 
 net/openvswitch/actions.c| 49 
 net/openvswitch/flow_netlink.c   | 18 +++
 3 files changed, 82 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 59ed3992c760..375d812fea36 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -705,6 +705,15 @@ enum ovs_nat_attr {
 
 #define OVS_NAT_ATTR_MAX (__OVS_NAT_ATTR_MAX - 1)
 
+/*
+ * struct ovs_action_push_eth - %OVS_ACTION_ATTR_PUSH_ETH action argument.
+ * @addresses: Source and destination MAC addresses.
+ * @eth_type: Ethernet type
+ */
+struct ovs_action_push_eth {
+   struct ovs_key_ethernet addresses;
+};
+
 /**
  * enum ovs_action_attr - Action types.
  *
@@ -738,6 +747,10 @@ enum ovs_nat_attr {
  * is no MPLS label stack, as determined by ethertype, no action is taken.
  * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
  * entries in the flow key.
+ * @OVS_ACTION_ATTR_PUSH_ETH: Push a new outermost Ethernet header onto the
+ * packet.
+ * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
+ * packet.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -765,6 +778,8 @@ enum ovs_action_attr {
   * bits. */
OVS_ACTION_ATTR_CT,   /* Nested OVS_CT_ATTR_* . */
OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
+   OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
+   OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
 
__OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
   * from userspace. */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 064cbcb7b0c5..a63572fb878e 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -317,6 +317,47 @@ static int set_eth_addr(struct sk_buff *skb, struct 
sw_flow_key *flow_key,
return 0;
 }
 
+/* pop_eth does not support VLAN packets as this action is never called
+ * for them.
+ */
+static int pop_eth(struct sk_buff *skb, struct sw_flow_key *key)
+{
+   skb_pull_rcsum(skb, ETH_HLEN);
+   skb_reset_mac_header(skb);
+   skb->mac_len -= ETH_HLEN;
+
+   /* safe right before invalidate_flow_key */
+   key->mac_proto = MAC_PROTO_NONE;
+   invalidate_flow_key(key);
+   return 0;
+}
+
+static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
+   const struct ovs_action_push_eth *ethh)
+{
+   struct ethhdr *hdr;
+
+   /* Add the new Ethernet header */
+   if (skb_cow_head(skb, ETH_HLEN) < 0)
+   return -ENOMEM;
+
+   skb_push(skb, ETH_HLEN);
+   skb_reset_mac_header(skb);
+   skb->mac_len = ETH_HLEN;
+
+   hdr = eth_hdr(skb);
+   ether_addr_copy(hdr->h_source, ethh->addresses.eth_src);
+   ether_addr_copy(hdr->h_dest, ethh->addresses.eth_dst);
+   hdr->h_proto = skb->protocol;
+
+   skb_postpush_rcsum(skb, hdr, ETH_HLEN);
+
+   /* safe right before invalidate_flow_key */
+   key->mac_proto = MAC_PROTO_ETHERNET;
+   invalidate_flow_key(key);
+   return 0;
+}
+
 static void update_ip_l4_checksum(struct sk_buff *skb, struct iphdr *nh,
  __be32 addr, __be32 new_addr)
 {
@@ -1200,6 +1241,14 @@ static int do_execute_actions(struct datapath *dp, 
struct sk_buff *skb,
if (err)
return err == -EINPROGRESS ? 0 : err;
break;
+
+   case OVS_ACTION_ATTR_PUSH_ETH:
+   err = push_eth(skb, key, nla_data(a));
+   break;
+
+   case OVS_ACTION_ATTR_POP_ETH:
+   err = pop_eth(skb, key);
+   break;
}
 
if (unlikely(err)) {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index c3d0cc4321c3..d19044f2b1f4 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2383,6 +2383,8 @@ static int __ovs_nla_copy_actions(struct net *net, const 
struct nlattr *attr,
[OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash),
[OVS_ACTION_ATTR_CT] = (u32)-1,
[OVS_ACTION_ATTR_TRUNC] = sizeof(struct 

[PATCH net-next v12 8/9] openvswitch: allow L3 netdev ports

2016-10-17 Thread Jiri Benc
Allow ARPHRD_NONE interfaces to be added to ovs bridge.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab 
Signed-off-by: Simon Horman 
Signed-off-by: Jiri Benc 
---
 net/openvswitch/vport-netdev.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 4e3972344aa6..4309796bc870 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -57,8 +57,10 @@ static void netdev_port_receive(struct sk_buff *skb)
if (unlikely(!skb))
return;
 
-   skb_push(skb, ETH_HLEN);
-   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
+   if (skb->dev->type == ARPHRD_ETHER) {
+   skb_push(skb, ETH_HLEN);
+   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
+   }
ovs_vport_receive(vport, skb, skb_tunnel_info(skb));
return;
 error:
@@ -97,7 +99,8 @@ struct vport *ovs_netdev_link(struct vport *vport, const char 
*name)
}
 
if (vport->dev->flags & IFF_LOOPBACK ||
-   vport->dev->type != ARPHRD_ETHER ||
+   (vport->dev->type != ARPHRD_ETHER &&
+vport->dev->type != ARPHRD_NONE) ||
ovs_is_internal_dev(vport->dev)) {
err = -EINVAL;
goto error_put;
-- 
1.8.3.1



[PATCH net-next v12 6/9] openvswitch: netlink: support L3 packets

2016-10-17 Thread Jiri Benc
Extend the ovs flow netlink protocol to support L3 packets. Packets without
OVS_KEY_ATTR_ETHERNET attribute specify L3 packets; for those, the
OVS_KEY_ATTR_ETHERTYPE attribute is mandatory.

Push/pop vlan actions are only supported for Ethernet packets.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab 
Signed-off-by: Simon Horman 
Signed-off-by: Jiri Benc 
---
 net/openvswitch/flow_netlink.c | 160 +
 1 file changed, 99 insertions(+), 61 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index ccb9900c5230..c3d0cc4321c3 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -123,7 +123,7 @@ static void update_range(struct sw_flow_match *match,
 static bool match_validate(const struct sw_flow_match *match,
   u64 key_attrs, u64 mask_attrs, bool log)
 {
-   u64 key_expected = 1 << OVS_KEY_ATTR_ETHERNET;
+   u64 key_expected = 0;
u64 mask_allowed = key_attrs;  /* At most allow all key attributes */
 
/* The following mask attributes allowed only if they
@@ -969,10 +969,33 @@ static int parse_vlan_from_nlattrs(struct sw_flow_match 
*match,
return 0;
 }
 
+static int parse_eth_type_from_nlattrs(struct sw_flow_match *match,
+  u64 *attrs, const struct nlattr **a,
+  bool is_mask, bool log)
+{
+   __be16 eth_type;
+
+   eth_type = nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]);
+   if (is_mask) {
+   /* Always exact match EtherType. */
+   eth_type = htons(0x);
+   } else if (!eth_proto_is_802_3(eth_type)) {
+   OVS_NLERR(log, "EtherType %x is less than min %x",
+   ntohs(eth_type), ETH_P_802_3_MIN);
+   return -EINVAL;
+   }
+
+   SW_FLOW_KEY_PUT(match, eth.type, eth_type, is_mask);
+   *attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE);
+   return 0;
+}
+
 static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 u64 *attrs, const struct nlattr **a,
 bool is_mask, bool log)
 {
+   u8 mac_proto = MAC_PROTO_ETHERNET;
+
if (*attrs & (1 << OVS_KEY_ATTR_DP_HASH)) {
u32 hash_val = nla_get_u32(a[OVS_KEY_ATTR_DP_HASH]);
 
@@ -1060,9 +1083,19 @@ static int metadata_from_nlattrs(struct net *net, struct 
sw_flow_match *match,
*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_LABELS);
}
 
+   /* For layer 3 packets the Ethernet type is provided
+* and treated as metadata but no MAC addresses are provided.
+*/
+   if (!(*attrs & (1ULL << OVS_KEY_ATTR_ETHERNET)) &&
+   (*attrs & (1ULL << OVS_KEY_ATTR_ETHERTYPE)))
+   mac_proto = MAC_PROTO_NONE;
+
/* Always exact match mac_proto */
-   SW_FLOW_KEY_PUT(match, mac_proto, is_mask ? 0xff : MAC_PROTO_ETHERNET,
-   is_mask);
+   SW_FLOW_KEY_PUT(match, mac_proto, is_mask ? 0xff : mac_proto, is_mask);
+
+   if (mac_proto == MAC_PROTO_NONE)
+   return parse_eth_type_from_nlattrs(match, attrs, a, is_mask,
+  log);
 
return 0;
 }
@@ -1086,33 +1119,26 @@ static int ovs_key_from_nlattrs(struct net *net, struct 
sw_flow_match *match,
SW_FLOW_KEY_MEMCPY(match, eth.dst,
eth_key->eth_dst, ETH_ALEN, is_mask);
attrs &= ~(1 << OVS_KEY_ATTR_ETHERNET);
-   }
-
-   if (attrs & (1 << OVS_KEY_ATTR_VLAN)) {
-   /* VLAN attribute is always parsed before getting here since it
-* may occur multiple times.
-*/
-   OVS_NLERR(log, "VLAN attribute unexpected.");
-   return -EINVAL;
-   }
-
-   if (attrs & (1 << OVS_KEY_ATTR_ETHERTYPE)) {
-   __be16 eth_type;
 
-   eth_type = nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]);
-   if (is_mask) {
-   /* Always exact match EtherType. */
-   eth_type = htons(0x);
-   } else if (!eth_proto_is_802_3(eth_type)) {
-   OVS_NLERR(log, "EtherType %x is less than min %x",
- ntohs(eth_type), ETH_P_802_3_MIN);
+   if (attrs & (1 << OVS_KEY_ATTR_VLAN)) {
+   /* VLAN attribute is always parsed before getting here 
since it
+* may occur multiple times.
+*/
+   OVS_NLERR(log, "VLAN attribute unexpected.");
return -EINVAL;
}
 
-   SW_FLOW_KEY_PUT(match, eth.type, eth_type, is_mask);
-   attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE);
-   } 

Re: [PATCH net-next 2/2] net: phy: Add PHY Auto/Mdi/Mdix set driver for Microsemi PHYs.

2016-10-17 Thread Raju Lakkaraju
Hi Florian/Andrew,

Thank you for review comments.

On Thu, Oct 06, 2016 at 03:57:32AM -0700, Florian Fainelli wrote:
> EXTERNAL EMAIL
> 
> 
> On 09/28/2016 01:24 PM, Andrew Lunn wrote:
> >>  static int vsc85xx_wol_set(struct phy_device *phydev,
> >> struct ethtool_wolinfo *wol)
> >>  {
> >> @@ -227,6 +281,7 @@ static int vsc85xx_default_config(struct phy_device 
> >> *phydev)
> >>  int rc;
> >>  u16 reg_val;
> >>
> >> +phydev->mdix = ETH_TP_MDI_AUTO;
> >
> > Humm, interesting. The only other driver supporting mdix is the
> > Marvell one. It does not do this, it leaves it to its default value of
> > ETH_TP_MDI_INVALID. It does however interpret ETH_TP_MDI_INVALID as
> > meaning as ETH_TP_MDI_AUTO.
> >
> > There needs to be consistency here. You either need to do the same as
> > the Marvell driver, or you need to modify the Marvell driver to also
> > set phydev->mdix to ETH_TP_MDI_AUTO.
> >
> > I don't yet know which of these two is the right thing to do.
> >
> > Florian?
> 
> It's really hard to tell because the other drivers I looked at do not
> necessarily seem to be consistent either. Here, if the MDI status is
> really auto, then this is the correct value to return, if it is unknown,
> it should be ETH_TP_MDI_INVALID.
> 

In mdix get status function, This value will be update as per 
PHY mdix current status.

Shall i configure "phydev->mdix = ETH_TP_MDI_AUTO" as default ?

Andrew, do you have any comments?

> For the Marvell PHY, it sounds like we should be able to determine what
> was configured and return the correct MDI status value
> --
> Florian

---
Thanks,
Raju.


Re: [PATCH iproute2] bridge: add support for the multicast flood flag

2016-10-17 Thread Stephen Hemminger
On Thu, 13 Oct 2016 17:54:20 +0200
Nikolay Aleksandrov  wrote:

> Recently a new per-port flag was added which controls the flooding of
> unknown multicast, this patch adds support for controlling it via iproute2.
> It also updates the man pages with information about the new flag.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---

Applied, thanks.




Re: [PATCH net-next 2/2] net: phy: Add PHY Auto/Mdi/Mdix set driver for Microsemi PHYs.

2016-10-17 Thread Raju Lakkaraju
Hi Florian/Andrew,

Thank you for review comments.

On Thu, Oct 06, 2016 at 04:09:56AM -0700, Florian Fainelli wrote:
> EXTERNAL EMAIL
> 
> 
> On 10/05/2016 12:18 AM, Andrew Lunn wrote:
>  + phydev->mdix = ETH_TP_MDI_AUTO;
> >>>
> >>> Humm, interesting. The only other driver supporting mdix is the
> >>> Marvell one. It does not do this, it leaves it to its default value of
> >>> ETH_TP_MDI_INVALID. It does however interpret ETH_TP_MDI_INVALID as
> >>> meaning as ETH_TP_MDI_AUTO.
> >>>
> >>> There needs to be consistency here. You either need to do the same as
> >>> the Marvell driver, or you need to modify the Marvell driver to also
> >>> set phydev->mdix to ETH_TP_MDI_AUTO.
> >>>
> >> In Ethtool two variable i.e. eth_tp_mdix_ctrl, eth_tp_mdix use to update
> >> the status. But, driver header is having one variable i.e. mdix.
> >> Driver header should also have another variabl like mdix_ctrl.
> >> Then, Ethtool can get/set the Auto MDIX/MDI.
> >> In case, mdix is not configure with ETH_TP_MDI_AUTO, Ethtool shows error as
> >> "setting MDI not supported"
> 
> Agreed, we currently report eth_tp_mdi and eth_tp_mdi_ctrl using
> phydev->mdix, but this is too limiting.
> 
> >>
> >> Please suggest me if you have any better method to fix this issue.
> >
> > Maybe we should add a new flag for the .flags member of the
> > phy_driver. If PHY_HAS_MDIX is set, the phy core will set phydev->mdix
> > to ETH_TP_MDI_AUTO?
> 
> I agree with Raju here, like most other Ethernet drivers, we should
> allow PHY drivers to have an eth_tp_mdix_ctrl to indicate what is the
> configured MDI setting, and read eth_tp_mdi to indicate what is the
> current status, then ethtool can properly differentiate what is going on.
> 

Andrew, Do you want me to add new flag (mdix_ctrl) or keep it as it is?
These changes are more relevant for mdix get status function.
Do you want to me implement along with mdix get status function? 

> Raju, Andrew, does that work for you?
> --
> Florian

---
Thanks,
Raju.


[PATCH net-next v12 0/9] openvswitch: support for layer 3 encapsulated packets

2016-10-17 Thread Jiri Benc
At the core of this patch set is removing the assumption in Open vSwitch
datapath that all packets have Ethernet header. Support for layer 3 GRE
tunnels is also added by this patchset.

The implementation relies on the presence of pop_eth and push_eth actions
in datapath flows to facilitate adding and removing Ethernet headers as
appropriate. The construction of such flows is left up to user-space.

This series is based on work by Simon Horman, Lorand Jakab, Thomas Morin and
others. I kept Lorand's and Simon's s-o-b in the patches that are derived
from v11 to record their authorship of parts of the code. Please let me know
if you disagree with this.

v12 differs from v11 a lot. The main changes are:

* The patches were restructured and split differently for easier review.
* They were rebased and adjusted to the current net-next. Especially MPLS
  handling is different (and easier) thanks to the recent MPLS GSO rework.
* Several bugs were discovered and fixed. The most notable is fragment
  handling: header adjustment for ARPHRD_NONE devices on tx needs to be done
  after refragmentation, not before it. This required significant changes in
  the patchset. Another one is stricter checking of attributes (match on L2
  vs. L3 packet) at the kernel level.
* Instead of is_layer3 bool, a mac_proto field is used. See patch 2. This is
  a matter of taste and alternate approaches are offered in patch 2
  description.

There is no change to uAPI since v11. The previously posted patchset for
Open vSwitch user space works with this submission unmodified.

Jiri Benc (8):
  openvswitch: use hard_header_len instead of hardcoded ETH_HLEN
  openvswitch: add mac_proto field to the flow key
  openvswitch: pass mac_proto to ovs_vport_send
  openvswitch: support MPLS push and pop for L3 packets
  openvswitch: add processing of L3 packets
  openvswitch: netlink: support L3 packets
  openvswitch: add Ethernet push and pop actions
  openvswitch: allow L3 netdev ports

Simon Horman (1):
  openvswitch: use ipgre tunnel rather than gretap tunnel

 include/net/gre.h|   4 +-
 include/uapi/linux/openvswitch.h |  15 
 net/ipv4/ip_gre.c|   9 +-
 net/openvswitch/actions.c| 111 +---
 net/openvswitch/datapath.c   |  17 +---
 net/openvswitch/flow.c   | 100 --
 net/openvswitch/flow.h   |  22 +
 net/openvswitch/flow_netlink.c   | 179 ++-
 net/openvswitch/vport-gre.c  |   2 +-
 net/openvswitch/vport-netdev.c   |   9 +-
 net/openvswitch/vport.c  |  28 --
 net/openvswitch/vport.h  |   2 +-
 12 files changed, 356 insertions(+), 142 deletions(-)

-- 
1.8.3.1



[PATCH net-next v12 4/9] openvswitch: support MPLS push and pop for L3 packets

2016-10-17 Thread Jiri Benc
Update Ethernet header only if there is one.

Signed-off-by: Jiri Benc 
---
 net/openvswitch/actions.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index dc8bb97e2258..064cbcb7b0c5 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -187,7 +187,8 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
 
skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);
 
-   update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype);
+   if (ovs_key_mac_proto(key) == MAC_PROTO_ETHERNET)
+   update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype);
skb->protocol = mpls->mpls_ethertype;
 
invalidate_flow_key(key);
@@ -197,7 +198,6 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
 static int pop_mpls(struct sk_buff *skb, struct sw_flow_key *key,
const __be16 ethertype)
 {
-   struct ethhdr *hdr;
int err;
 
err = skb_ensure_writable(skb, skb->mac_len + MPLS_HLEN);
@@ -213,11 +213,15 @@ static int pop_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
skb_reset_mac_header(skb);
skb_set_network_header(skb, skb->mac_len);
 
-   /* mpls_hdr() is used to locate the ethertype field correctly in the
-* presence of VLAN tags.
-*/
-   hdr = (struct ethhdr *)((void *)mpls_hdr(skb) - ETH_HLEN);
-   update_ethertype(skb, hdr, ethertype);
+   if (ovs_key_mac_proto(key) == MAC_PROTO_ETHERNET) {
+   struct ethhdr *hdr;
+
+   /* mpls_hdr() is used to locate the ethertype field correctly 
in the
+* presence of VLAN tags.
+*/
+   hdr = (struct ethhdr *)((void *)mpls_hdr(skb) - ETH_HLEN);
+   update_ethertype(skb, hdr, ethertype);
+   }
if (eth_p_mpls(skb->protocol))
skb->protocol = ethertype;
 
-- 
1.8.3.1



[PATCH net-next v12 5/9] openvswitch: add processing of L3 packets

2016-10-17 Thread Jiri Benc
Support receiving, extracting flow key and sending of L3 packets (packets
without an Ethernet header).

Note that even after this patch, non-Ethernet interfaces are still not
allowed to be added to bridges. Similarly, netlink interface for sending and
receiving L3 packets to/from user space is not in place yet.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab 
Signed-off-by: Simon Horman 
Signed-off-by: Jiri Benc 
---
 net/openvswitch/datapath.c |  17 ++--
 net/openvswitch/flow.c | 101 ++---
 net/openvswitch/vport.c|  16 +++
 3 files changed, 96 insertions(+), 38 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 4d67ea856067..c5b719fca8d4 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -562,7 +562,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
struct sw_flow *flow;
struct sw_flow_actions *sf_acts;
struct datapath *dp;
-   struct ethhdr *eth;
struct vport *input_vport;
u16 mru = 0;
int len;
@@ -583,17 +582,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
 
nla_memcpy(__skb_put(packet, len), a[OVS_PACKET_ATTR_PACKET], len);
 
-   skb_reset_mac_header(packet);
-   eth = eth_hdr(packet);
-
-   /* Normally, setting the skb 'protocol' field would be handled by a
-* call to eth_type_trans(), but it assumes there's a sending
-* device, which we may not have. */
-   if (eth_proto_is_802_3(eth->h_proto))
-   packet->protocol = eth->h_proto;
-   else
-   packet->protocol = htons(ETH_P_802_2);
-
/* Set packet's mru */
if (a[OVS_PACKET_ATTR_MRU]) {
mru = nla_get_u16(a[OVS_PACKET_ATTR_MRU]);
@@ -609,8 +597,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
 
err = ovs_flow_key_extract_userspace(net, a[OVS_PACKET_ATTR_KEY],
 packet, >key, log);
-   if (err)
+   if (err) {
+   packet = NULL;
goto err_flow_free;
+   }
 
err = ovs_nla_copy_actions(net, a[OVS_PACKET_ATTR_ACTIONS],
   >key, , log);
@@ -620,6 +610,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
rcu_assign_pointer(flow->sf_acts, acts);
packet->priority = flow->key.phy.priority;
packet->mark = flow->key.phy.skb_mark;
+   packet->protocol = flow->key.eth.type;
 
rcu_read_lock();
dp = get_dp_rcu(net, ovs_header->dp_ifindex);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 96c8c4716603..7aee6e273765 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -334,14 +334,17 @@ static int parse_vlan_tag(struct sk_buff *skb, struct 
vlan_head *key_vh)
return 1;
 }
 
-static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
+static void clear_vlan(struct sw_flow_key *key)
 {
-   int res;
-
key->eth.vlan.tci = 0;
key->eth.vlan.tpid = 0;
key->eth.cvlan.tci = 0;
key->eth.cvlan.tpid = 0;
+}
+
+static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
+{
+   int res;
 
if (skb_vlan_tag_present(skb)) {
key->eth.vlan.tci = htons(skb->vlan_tci);
@@ -483,17 +486,20 @@ static int parse_icmpv6(struct sk_buff *skb, struct 
sw_flow_key *key,
  *
  * Returns 0 if successful, otherwise a negative errno value.
  *
- * Initializes @skb header pointers as follows:
+ * Initializes @skb header fields as follows:
  *
- *- skb->mac_header: the Ethernet header.
+ *- skb->mac_header: the L2 header.
  *
- *- skb->network_header: just past the Ethernet header, or just past the
- *  VLAN header, to the first byte of the Ethernet payload.
+ *- skb->network_header: just past the L2 header, or just past the
+ *  VLAN header, to the first byte of the L2 payload.
  *
  *- skb->transport_header: If key->eth.type is ETH_P_IP or ETH_P_IPV6
  *  on output, then just past the IP header, if one is present and
  *  of a correct length, otherwise the same as skb->network_header.
  *  For other key->eth.type values it is left untouched.
+ *
+ *- skb->protocol: the type of the data starting at skb->network_header.
+ *  Equals to key->eth.type.
  */
 static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 {
@@ -505,28 +511,35 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
 
skb_reset_mac_header(skb);
 
-   /* Link layer.  We are guaranteed to have at least the 14 byte Ethernet
-* header in the linear data area.
-*/
-   eth = eth_hdr(skb);
-   ether_addr_copy(key->eth.src, 

Re: [PATCH 4/7] net: qcom/emac: Fix module autoload for OF registration

2016-10-17 Thread Timur Tabi

Javier Martinez Canillas wrote:

If the driver is built as a module, autoload won't work because the module
alias information is not filled. So user-space can't match the registered
device with the corresponding module.

Export the module alias information using the MODULE_DEVICE_TABLE() macro.

Before this patch:

$ modinfo drivers/net/ethernet/qualcomm/emac/qcom-emac.ko | grep alias
alias:  platform:qcom-emac

After this patch:

$ modinfo drivers/net/ethernet/qualcomm/emac/qcom-emac.ko | grep alias
alias:  platform:qcom-emac
alias:  of:N*T*Cqcom,fsm9900-emacC*
alias:  of:N*T*Cqcom,fsm9900-emac

Signed-off-by: Javier Martinez Canillas


Acked-by: Timur Tabi 

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.


  1   2   3   >