date:20160325

Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR

2016-03-25 Thread David Miller

From: Francois Romieu 
Date: Fri, 25 Mar 2016 23:53:25 +0100

> Been there. Such requests are usually left unanswered. :o(

Due to this patch submitters continued anti-social and anti-community
behavior, I have been completely ignoring their patches.

I will continue to mark all of their patch submissions as "REJECTED"
in patchwork until they start to behave like proper community members
and actually respond properly and act upon to the feedback they are
given.

You should also feel free to ignore their work as well, your time is
valuable, don't waste it on someone who ignores everyone's feedback.

Re: [PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}

2016-03-25 Thread Guenter Roeck


On 03/25/2016 06:58 PM, Patrick Uiterwijk wrote:

On Sat, Mar 26, 2016 at 2:45 AM, Guenter Roeck  wrote:

On 03/25/2016 05:10 PM, Patrick Uiterwijk wrote:


Add versions of the phy_page_read and _write functions to
be used in a context where the SMI mutex is held.

Signed-off-by: Patrick Uiterwijk 
---
   drivers/net/dsa/mv88e6xxx.c | 42
--
   1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index fa086e0..13db5d8 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds,
bool ppu_active)
 return 0;
   }

-int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
int reg)
+static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int
page,
+   int reg)
   {
-   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
 int ret;

-   mutex_lock(>smi_mutex);
 ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
 if (ret < 0)
-   goto error;
+   goto clear;
 ret = _mv88e6xxx_phy_read_indirect(ds, port, reg);
-error:
+clear:



Is there some good reason for changing the name of those labels ?


Vivien suggested to rename this since it makes more clear that this write is
meant to return to page 0 to make sure that phylib doesn't get confused
about the currently active page.



And "clear:" accomplishes that ? I would not have guessed.
Wonder if anyone else does. I would have used a comment.
/* Try to return to page 0 even after an error */
or something like that.

Guenter


Patrick



Guenter



 _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
-   mutex_unlock(>smi_mutex);
+
 return ret;
   }

-int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
-int reg, int val)
+int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
int reg)
   {
 struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
 int ret;

 mutex_lock(>smi_mutex);
+   ret = _mv88e6xxx_phy_page_read(ds, port, page, reg);
+   mutex_unlock(>smi_mutex);
+
+   return ret;
+}
+
+static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int
page,
+int reg, int val)
+{
+   int ret;
+
 ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
 if (ret < 0)
-   goto error;
+   goto clear;

 ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val);
-error:
+clear:
 _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
+
+   return ret;
+}
+
+int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
+int reg, int val)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   int ret;
+
+   mutex_lock(>smi_mutex);
+   ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val);
 mutex_unlock(>smi_mutex);
+
 return ret;
   }

Re: [PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}

2016-03-25 Thread Patrick Uiterwijk

On Sat, Mar 26, 2016 at 2:45 AM, Guenter Roeck  wrote:
> On 03/25/2016 05:10 PM, Patrick Uiterwijk wrote:
>>
>> Add versions of the phy_page_read and _write functions to
>> be used in a context where the SMI mutex is held.
>>
>> Signed-off-by: Patrick Uiterwijk 
>> ---
>>   drivers/net/dsa/mv88e6xxx.c | 42
>> --
>>   1 file changed, 32 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
>> index fa086e0..13db5d8 100644
>> --- a/drivers/net/dsa/mv88e6xxx.c
>> +++ b/drivers/net/dsa/mv88e6xxx.c
>> @@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds,
>> bool ppu_active)
>> return 0;
>>   }
>>
>> -int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
>> int reg)
>> +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int
>> page,
>> +   int reg)
>>   {
>> -   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
>> int ret;
>>
>> -   mutex_lock(>smi_mutex);
>> ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
>> if (ret < 0)
>> -   goto error;
>> +   goto clear;
>> ret = _mv88e6xxx_phy_read_indirect(ds, port, reg);
>> -error:
>> +clear:
>
>
> Is there some good reason for changing the name of those labels ?

Vivien suggested to rename this since it makes more clear that this write is
meant to return to page 0 to make sure that phylib doesn't get confused
about the currently active page.

Patrick

>
> Guenter
>
>
>> _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
>> -   mutex_unlock(>smi_mutex);
>> +
>> return ret;
>>   }
>>
>> -int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
>> -int reg, int val)
>> +int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
>> int reg)
>>   {
>> struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
>> int ret;
>>
>> mutex_lock(>smi_mutex);
>> +   ret = _mv88e6xxx_phy_page_read(ds, port, page, reg);
>> +   mutex_unlock(>smi_mutex);
>> +
>> +   return ret;
>> +}
>> +
>> +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int
>> page,
>> +int reg, int val)
>> +{
>> +   int ret;
>> +
>> ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
>> if (ret < 0)
>> -   goto error;
>> +   goto clear;
>>
>> ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val);
>> -error:
>> +clear:
>> _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
>> +
>> +   return ret;
>> +}
>> +
>> +int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
>> +int reg, int val)
>> +{
>> +   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
>> +   int ret;
>> +
>> +   mutex_lock(>smi_mutex);
>> +   ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val);
>> mutex_unlock(>smi_mutex);
>> +
>> return ret;
>>   }
>>
>>
>

Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-25 Thread Alexei Starovoitov

On Fri, Mar 25, 2016 at 03:29:24PM -0700, Eric Dumazet wrote:
> Tom Herbert would like to avoid touching UDP socket refcnt for encapsulated
> traffic. For this to happen, we need to use normal RCU rules, with a grace
> period before freeing a socket. UDP sockets are not short lived in the
> high usage case, so the added cost of call_rcu() should not be a concern.
> 
> This actually removes a lot of complexity in UDP stack
> 
> Multicast receives no longer need to hold a bucket lock.
> 
> Note that ip early demux still needs to take a reference on the socket.
> 
> Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
> but this might be changed later.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> ---
>  include/linux/udp.h |   8 +-
>  include/net/sock.h  |  12 +--
>  include/net/udp.h   |   2 +-
>  net/ipv4/udp.c  | 290 
> +++-
>  net/ipv4/udp_diag.c |  18 ++--
>  net/ipv6/udp.c  | 194 +++
>  6 files changed, 162 insertions(+), 362 deletions(-)

great stuff!

Re: [PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}

2016-03-25 Thread Guenter Roeck


On 03/25/2016 05:10 PM, Patrick Uiterwijk wrote:

Add versions of the phy_page_read and _write functions to
be used in a context where the SMI mutex is held.

Signed-off-by: Patrick Uiterwijk 
---
  drivers/net/dsa/mv88e6xxx.c | 42 --
  1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index fa086e0..13db5d8 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool 
ppu_active)
return 0;
  }

-int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg)
+static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
+   int reg)
  {
-   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;

-   mutex_lock(>smi_mutex);
ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
if (ret < 0)
-   goto error;
+   goto clear;
ret = _mv88e6xxx_phy_read_indirect(ds, port, reg);
-error:
+clear:


Is there some good reason for changing the name of those labels ?

Guenter


_mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
-   mutex_unlock(>smi_mutex);
+
return ret;
  }

-int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
-int reg, int val)
+int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg)
  {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;

mutex_lock(>smi_mutex);
+   ret = _mv88e6xxx_phy_page_read(ds, port, page, reg);
+   mutex_unlock(>smi_mutex);
+
+   return ret;
+}
+
+static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
+int reg, int val)
+{
+   int ret;
+
ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
if (ret < 0)
-   goto error;
+   goto clear;

ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val);
-error:
+clear:
_mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
+
+   return ret;
+}
+
+int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
+int reg, int val)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   int ret;
+
+   mutex_lock(>smi_mutex);
+   ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val);
mutex_unlock(>smi_mutex);
+
return ret;
  }

[PATCH] Drivers: isdn: hisax: isac.c: Fix assignment and check into one expression.

2016-03-25 Thread Cosmin-Gabriel Samoila

Fix variable assignment inside if statement. It is error-prone and hard to read.

Signed-off-by: Cosmin-Gabriel Samoila 
---
 drivers/isdn/hisax/isac.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/isdn/hisax/isac.c b/drivers/isdn/hisax/isac.c
index 7fdf78f..df7e05c 100644
--- a/drivers/isdn/hisax/isac.c
+++ b/drivers/isdn/hisax/isac.c
@@ -215,9 +215,11 @@ isac_interrupt(struct IsdnCardState *cs, u_char val)
if (count == 0)
count = 32;
isac_empty_fifo(cs, count);
-   if ((count = cs->rcvidx) > 0) {
+   count = cs->rcvidx;
+   if (count > 0) {
cs->rcvidx = 0;
-   if (!(skb = alloc_skb(count, GFP_ATOMIC)))
+   skb = alloc_skb(count, GFP_ATOMIC);
+   if (!skb)
printk(KERN_WARNING "HiSax: D receive 
out of memory\n");
else {
memcpy(skb_put(skb, count), cs->rcvbuf, 
count);
@@ -251,7 +253,8 @@ isac_interrupt(struct IsdnCardState *cs, u_char val)
cs->tx_skb = NULL;
}
}
-   if ((cs->tx_skb = skb_dequeue(>sq))) {
+   cs->tx_skb = skb_dequeue(>sq);
+   if (cs->tx_skb) {
cs->tx_cnt = 0;
isac_fill_fifo(cs);
} else
@@ -313,7 +316,8 @@ afterXPR:
 #if ARCOFI_USE
if (v1 & 0x08) {
if (!cs->dc.isac.mon_rx) {
-   if (!(cs->dc.isac.mon_rx = 
kmalloc(MAX_MON_FRAME, GFP_ATOMIC))) {
+   cs->dc.isac.mon_rx = 
kmalloc(MAX_MON_FRAME, GFP_ATOMIC);
+   if (!cs->dc.isac.mon_rx) {
if (cs->debug & L1_DEB_WARN)
debugl1(cs, "ISAC MON 
RX out of memory!");
cs->dc.isac.mocr &= 0xf0;
@@ -343,7 +347,8 @@ afterXPR:
afterMONR0:
if (v1 & 0x80) {
if (!cs->dc.isac.mon_rx) {
-   if (!(cs->dc.isac.mon_rx = 
kmalloc(MAX_MON_FRAME, GFP_ATOMIC))) {
+   cs->dc.isac.mon_rx = 
kmalloc(MAX_MON_FRAME, GFP_ATOMIC);
+   if (!cs->dc.isac.mon_rx) {
if (cs->debug & L1_DEB_WARN)
debugl1(cs, "ISAC MON 
RX out of memory!");
cs->dc.isac.mocr &= 0x0f;
-- 
1.9.1

Re: [PATCH] ipv6: Fix the pmtu path for connected UDP socket

2016-03-25 Thread Martin KaFai Lau

On Fri, Mar 25, 2016 at 04:55:27PM -0700, Martin KaFai Lau wrote:
>  void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
>  {
> + struct dst_entry *odst;
> +
> + odst = sk_dst_get(sk);
> +
>   ip6_update_pmtu(skb, sock_net(sk), mtu,
>   sk->sk_bound_dev_if, sk->sk_mark);
> +
> + if (odst && !odst->error &&
> + !ip6_dst_check(odst, inet6_sk(sk)->dst_cookie)) {
> + struct dst_entry *ndst;
> + struct flowi6 fl6;
> +
> + build_skb_flow_key(, skb, sock_net(sk),
> +sk->sk_bound_dev_if, sk->sk_mark);
> + ndst = ip6_route_output(sock_net(sk), NULL, );
> + if (!ndst->error)
> + ip6_dst_store(sk, ndst, NULL, NULL);
oops...missed:
else
dst_release(ndst);
> + }
> +
> + dst_release(odst);
>  }
>  EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
>
> --
> 2.5.1

[PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}

2016-03-25 Thread Patrick Uiterwijk

Add versions of the phy_page_read and _write functions to
be used in a context where the SMI mutex is held.

Signed-off-by: Patrick Uiterwijk 
---
 drivers/net/dsa/mv88e6xxx.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index fa086e0..13db5d8 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool 
ppu_active)
return 0;
 }
 
-int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg)
+static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
+   int reg)
 {
-   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;
 
-   mutex_lock(>smi_mutex);
ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
if (ret < 0)
-   goto error;
+   goto clear;
ret = _mv88e6xxx_phy_read_indirect(ds, port, reg);
-error:
+clear:
_mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
-   mutex_unlock(>smi_mutex);
+
return ret;
 }
 
-int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
-int reg, int val)
+int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;
 
mutex_lock(>smi_mutex);
+   ret = _mv88e6xxx_phy_page_read(ds, port, page, reg);
+   mutex_unlock(>smi_mutex);
+
+   return ret;
+}
+
+static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
+int reg, int val)
+{
+   int ret;
+
ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
if (ret < 0)
-   goto error;
+   goto clear;
 
ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val);
-error:
+clear:
_mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
+
+   return ret;
+}
+
+int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
+int reg, int val)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   int ret;
+
+   mutex_lock(>smi_mutex);
+   ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val);
mutex_unlock(>smi_mutex);
+
return ret;
 }
 
-- 
2.7.4

[PATCH 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup

2016-03-25 Thread Patrick Uiterwijk

Some of the vendor-specific bootloaders set up this part
of the initialization for us, so this was never added.
However, since upstream bootloaders don't initialize the
chip specifically, they leave the fiber MII's PDOWN flag
set, which means that the CPU port doesn't connect.

This patch checks whether this flag has been clear prior
by something else, and if not make us clear it.

Signed-off-by: Patrick Uiterwijk 
---
 drivers/net/dsa/mv88e6xxx.c | 99 +++--
 drivers/net/dsa/mv88e6xxx.h |  8 
 2 files changed, 76 insertions(+), 31 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 13db5d8..05b2efb 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2264,6 +2264,57 @@ static void mv88e6xxx_bridge_work(struct work_struct 
*work)
mutex_unlock(>smi_mutex);
 }
 
+static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
+int reg, int val)
+{
+   int ret;
+
+   ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
+   if (ret < 0)
+   goto clear;
+
+   ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val);
+clear:
+   _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
+
+   return ret;
+}
+
+static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
+   int reg)
+{
+   int ret;
+
+   ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
+   if (ret < 0)
+   goto clear;
+
+   ret = _mv88e6xxx_phy_read_indirect(ds, port, reg);
+clear:
+   _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
+
+   return ret;
+}
+
+static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds)
+{
+   int ret;
+
+   ret = _mv88e6xxx_phy_page_read(ds, REG_FIBER_SERDES, PAGE_FIBER_SERDES,
+  MII_BMCR);
+   if (ret < 0)
+   return ret;
+
+   if (ret & BMCR_PDOWN) {
+   ret = ret & ~BMCR_PDOWN;
+   ret = _mv88e6xxx_phy_page_write(ds, REG_FIBER_SERDES,
+   PAGE_FIBER_SERDES, MII_BMCR,
+   ret);
+   }
+
+   return ret;
+}
+
 static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
@@ -2367,6 +2418,23 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
goto abort;
}
 
+   /* If this port is connected to a SerDes, make sure the SerDes is not
+* powered down.
+*/
+   if (mv88e6xxx_6352_family(ds)) {
+   ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_STATUS);
+   if (ret < 0)
+   goto abort;
+   ret &= PORT_STATUS_CMODE_MASK;
+   if ((ret == PORT_STATUS_CMODE_100BASE_X) ||
+   (ret == PORT_STATUS_CMODE_1000BASE_X) ||
+   (ret == PORT_STATUS_CMODE_SGMII)) {
+   ret = mv88e6xxx_power_on_serdes(ds);
+   if (ret < 0)
+   goto abort;
+   }
+   }
+
/* Port Control 2: don't force a good FCS, set the maximum frame size to
 * 10240 bytes, disable 802.1q tags checking, don't discard tagged or
 * untagged frames on this port, do a destination address lookup on all
@@ -2708,21 +2776,6 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool 
ppu_active)
return 0;
 }
 
-static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page,
-   int reg)
-{
-   int ret;
-
-   ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
-   if (ret < 0)
-   goto clear;
-   ret = _mv88e6xxx_phy_read_indirect(ds, port, reg);
-clear:
-   _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
-
-   return ret;
-}
-
 int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
@@ -2735,22 +2788,6 @@ int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int 
port, int page, int reg)
return ret;
 }
 
-static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
-int reg, int val)
-{
-   int ret;
-
-   ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page);
-   if (ret < 0)
-   goto clear;
-
-   ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val);
-clear:
-   _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0);
-
-   return ret;
-}
-
 int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page,
 int reg, int val)
 {
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index 9a038ab..26a424a 100644

Re: [RFC net-next 1/2] net: add SOCK_RCU_FREE socket flag

2016-03-25 Thread Tom Herbert

On Fri, Mar 25, 2016 at 3:29 PM, Eric Dumazet  wrote:
> We want a generic way to insert an RCU grace period before socket
> freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
> much overhead.
>
> SLAB_DESTROY_BY_RCU strict rules force us to take a reference
> on the socket sk_refcnt, and it is a performance problem for UDP
> encapsulation, or TCP synflood behavior, as many CPUs might
> attempt the atomic operations on a shared sk_refcnt
>
> UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
> lookup can use traditional RCU rules, without refcount changes.
> They can set the flag only once hashed and visible by other cpus.
>
> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> ---
>  include/net/sock.h |  2 ++
>  net/core/sock.c| 14 +-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 255d3e03727b..c88785a3e76c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -438,6 +438,7 @@ struct sock {
>   struct sk_buff *skb);
> void(*sk_destruct)(struct sock *sk);
> struct sock_reuseport __rcu *sk_reuseport_cb;
> +   struct rcu_head sk_rcu;
>  };
>
>  #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
> @@ -720,6 +721,7 @@ enum sock_flags {
>  */
> SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
> SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
> +   SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b67b9aedb230..238a94f879ca 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, 
> gfp_t priority,
>  }
>  EXPORT_SYMBOL(sk_alloc);
>
> -void sk_destruct(struct sock *sk)
> +/* Sockets having SOCK_RCU_FREE will call this function after one RCU
> + * grace period. This is the case for UDP sockets and TCP listeners.
> + */
> +static void __sk_destruct(struct rcu_head *head)
>  {
> +   struct sock *sk = container_of(head, struct sock, sk_rcu);
> struct sk_filter *filter;
>
> if (sk->sk_destruct)
> @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk)
> sk_prot_free(sk->sk_prot_creator, sk);
>  }
>
> +void sk_destruct(struct sock *sk)
> +{
> +   if (sock_flag(sk, SOCK_RCU_FREE))
> +   call_rcu(>sk_rcu, __sk_destruct);
> +   else
> +   __sk_destruct(>sk_rcu);
> +}
> +

Very nice!

>  static void __sk_free(struct sock *sk)
>  {
> if (unlikely(sock_diag_has_destroy_listeners(sk) && 
> sk->sk_net_refcnt))
> --
> 2.8.0.rc3.226.g39d4020
>

Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-25 Thread Tom Herbert

On Fri, Mar 25, 2016 at 3:29 PM, Eric Dumazet  wrote:
> Tom Herbert would like to avoid touching UDP socket refcnt for encapsulated
> traffic. For this to happen, we need to use normal RCU rules, with a grace
> period before freeing a socket. UDP sockets are not short lived in the
> high usage case, so the added cost of call_rcu() should not be a concern.
>
> This actually removes a lot of complexity in UDP stack
>
> Multicast receives no longer need to hold a bucket lock.
>
> Note that ip early demux still needs to take a reference on the socket.
>
> Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
> but this might be changed later.
>
> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> ---
>  include/linux/udp.h |   8 +-
>  include/net/sock.h  |  12 +--
>  include/net/udp.h   |   2 +-
>  net/ipv4/udp.c  | 290 
> +++-
>  net/ipv4/udp_diag.c |  18 ++--
>  net/ipv6/udp.c  | 194 +++
>  6 files changed, 162 insertions(+), 362 deletions(-)
>
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 87c094961bd5..32342754643a 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
> return udp_sk(sk)->no_check6_rx;
>  }
>
> -#define udp_portaddr_for_each_entry(__sk, node, list) \
> -   hlist_nulls_for_each_entry(__sk, node, list, 
> __sk_common.skc_portaddr_node)
> +#define udp_portaddr_for_each_entry(__sk, list) \
> +   hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
>
> -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \
> -   hlist_nulls_for_each_entry_rcu(__sk, node, list, 
> __sk_common.skc_portaddr_node)
> +#define udp_portaddr_for_each_entry_rcu(__sk, list) \
> +   hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
>
>  #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index c88785a3e76c..5b9562bc478e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -178,7 +178,7 @@ struct sock_common {
> int skc_bound_dev_if;
> union {
> struct hlist_node   skc_bind_node;
> -   struct hlist_nulls_node skc_portaddr_node;
> +   struct hlist_node   skc_portaddr_node;
> };
> struct proto*skc_prot;
> possible_net_t  skc_net;
> @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk,
> hlist_for_each_entry(__sk, list, sk_bind_node)
>
>  /**
> - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct 
> offset
> + * sk_for_each_entry_offset - iterate over a list at a given struct offset
>   * @tpos:  the type * to use as a loop cursor.
>   * @pos:   the  hlist_node to use as a loop cursor.
>   * @head:  the head for your list.
>   * @offset:offset of hlist_node within the struct.
>   *
>   */
> -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset)  
>  \
> -   for (pos = (head)->first; 
>  \
> -(!is_a_nulls(pos)) &&
>  \
> +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset)
>  \
> +   for (pos = rcu_dereference((head)->first);
>  \
> +pos != NULL &&   
>  \
> ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;});  
>  \
> -pos = pos->next)
> +pos = rcu_dereference(pos->next))
>
>  static inline struct user_namespace *sk_user_ns(struct sock *sk)
>  {
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 92927f729ac8..d870ec1611c4 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -59,7 +59,7 @@ struct udp_skb_cb {
>   * @lock:  spinlock protecting changes to head/count
>   */
>  struct udp_hslot {
> -   struct hlist_nulls_head head;
> +   struct hlist_head   head;
> int count;
> spinlock_t  lock;
>  } __attribute__((aligned(2 * sizeof(long;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 08eed5e16df0..3ebca8445d35 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 
> num,
>unsigned int log)
>  {
> struct sock *sk2;
> -   struct hlist_nulls_node *node;
> kuid_t uid = sock_i_uid(sk);
>
> -   sk_nulls_for_each(sk2, node, >head) {
> +   sk_for_each(sk2, >head) {
> if (net_eq(sock_net(sk2), net) &&
> sk2 != sk &&
> (bitmap || udp_sk(sk2)->udp_port_hash == num) &&
>

Re: [PATCH] ipv6: Fix the pmtu path for connected UDP socket

2016-03-25 Thread Martin KaFai Lau

On Wed, Mar 23, 2016 at 04:57:22PM -0700, Wei Wang wrote:
> What about something like this:
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index ed44663..21b4102 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1394,6 +1394,19 @@ static void ip6_rt_update_pmtu(struct dst_entry
> *dst, struct sock *sk,
> __ip6_rt_update_pmtu(dst, sk, skb ? ipv6_hdr(skb) : NULL, mtu);
>  }
>
> +static void ip6_fill_in_flow(struct flowi6 *fl6,  struct net *net,
> +  struct sk_buff *skb, int oif, u32 mark)
> +{
> +   const struct ipv6hdr *iph = (struct ipv6hdr *) skb->data;
> +
> +   memset(fl6, 0, sizeof(fl6));
> +   fl6->flowi6_oif = oif;
> +   fl6->flowi6_mark = mark ? mark : IP6_REPLY_MARK(net, skb->mark);
> +   fl6->daddr = iph->daddr;
> +   fl6->saddr = iph->saddr;
> +   fl6->flowlabel = ip6_flowinfo(iph);
> +}
> +
>  void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
>  int oif, u32 mark)
>  {
> @@ -1401,13 +1414,7 @@ void ip6_update_pmtu(struct sk_buff *skb,
> struct net *net, __be32 mtu,
> struct dst_entry *dst;
> struct flowi6 fl6;
>
> -   memset(, 0, sizeof(fl6));
> -   fl6.flowi6_oif = oif;
> -   fl6.flowi6_mark = mark ? mark : IP6_REPLY_MARK(net, skb->mark);
> -   fl6.daddr = iph->daddr;
> -   fl6.saddr = iph->saddr;
> -   fl6.flowlabel = ip6_flowinfo(iph);
> -
> +   ip6_fill_in_flow(, net, skb, oif, mark);
> dst = ip6_route_output(net, NULL, );
> if (!dst->error)
> __ip6_rt_update_pmtu(dst, NULL, iph, ntohl(mtu));
> @@ -1417,8 +1424,22 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu);
>
>  void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
>  {
> -   ip6_update_pmtu(skb, sock_net(sk), mtu,
> +   struct ipv6_pinfo *np = inet6_sk(sk);
> +   struct dst_entry *dst_new;
> +   struct flowi6 fl6;
> +   struct net *net = sock_net(sk);
> +
> +   ip6_update_pmtu(skb, net, mtu,
> +   sk->sk_bound_dev_if, sk->sk_mark);
> +
> +   if (sk->sk_state == TCP_ESTABLISHED &&
> +   !sk_dst_check(sk, np->dst_cookie)) {
> +   ip6_fill_in_flow(, net, skb,
> sk->sk_bound_dev_if, sk->sk_mark);
> +   dst_new = ip6_route_output(net, NULL, );
> +   if (!IS_ERR(dst_new))
> +   ip6_dst_store(sk, dst_new, NULL, NULL);
> +   }
>  }
>  EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);

Here is the change I come up with (based on your flow key buildup function
with one fix), take a little different approach in ip6_sk_update_pmtu()
and some explanation on ip6_dst_store() after looking at ip6_rt_check()

I have not included the sk lock yet.  There is still a few things I don't fully
understand in another discussion thread.

[PATCH] ipv6: Update sk->sk_dst_cache in ip6_sk_update_pmtu()

There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone which
   contains the MTU value instructed by the ICMPV6_PKT_TOOBIG message.

One observation is that, the __udp6_lib_err() code path does not do a
sk_dst_check() and route relookup (if needed) after doing the pmtu
update (while TCP does).

This patch does a ip6_dst_check() in ip6_sk_update_pmtu() which is used
by UDP to do the pmtu update and update sk->sk_dst_cache if it is needed.

ip6_dst_store(sk, ndst, NULL, NULL) is used.  NULL(s) are passed to
the daddr and saddr parameters for simplicity reason.  The current
use case is only for the UDP lookup.  The details is in ip6_rt_check().

At the point where ip6_dst_store(sk, ndst, NULL, NULL) is called in
this patch, ndst should be a RTF_CACHE clone (which implies
((struct rt6i_info *)ndst)->rt6i_dst.plen == 128) and ip6_rt_check()
will pass without even considering np->d/saddr_cache.

Even it has to use np->d/saddr_cache, the worst case is ip6_rt_check()
fails and causes one more route lookup.

Test:

Server (Connected UDP Socket):
~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0  proto kernel  metric 256  pref medium
2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium

A crappy python code to create a connected UDP socket:

import socket
import errno

HOST = '2fac::1'
PORT = 8080

s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()

Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2

Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR

2016-03-25 Thread Phil Sutter

On Fri, Mar 25, 2016 at 11:53:25PM +0100, Francois Romieu wrote:
> Phil Sutter  :
> [...]
> > Your patch submissions are getting better, also good to see you're
> > finally using git-send-email. A few things need to be corrected though:
> > 
> 
> #define BMCR_RESV   0x003f  /* Unused...   */
> #define BMCR_SPEED1000  0x0040  /* MSB of Speed (1000) */
> #define BMCR_CTST   0x0080  /* Collision test  */
> #define BMCR_FULLDPLX   0x0100  /* Full duplex */
> #define BMCR_ANRESTART  0x0200  /* Auto negotiation restart*/
> #define BMCR_ISOLATE0x0400  /* Isolate data paths from MII */
> #define BMCR_PDOWN  0x0800  /* Enable low power state  */
> #define BMCR_ANENABLE   0x1000  /* Enable auto negotiation */
> #define BMCR_SPEED100   0x2000  /* Select 100Mbps  */
> #define BMCR_LOOPBACK   0x4000  /* TXD loopback bits   */
> 
> BMCR_SPEED100 apart, *all* these bits are now set.
> 
> It does not make much sense.

I presumed that already, but didn't care to check myself so instead
ignored the actual code change. Thanks for pointing out the futility of
the whole thing. :)

> > Also detailed instructions on how to trigger the problem you are fixing
> > for would be good. In detail: Which specific hardware was used, in which
> > situation did the problem occur, how did it behave in that situation and
> > what was the expected behaviour?
> 
> Been there. Such requests are usually left unanswered. :o(

That's my impression from following the (somewhat amusing) former
threads, too. I was merely impressed by the sheer quality of this patch
in comparison to previous ones. Speaking of which, the presented
tenacity certainly earns some respect.

Cheers, Phil

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Ben Greear


On 03/25/2016 04:03 PM, Vijay Pandurangan wrote:

On Fri, Mar 25, 2016 at 6:23 PM, Ben Greear  wrote:

On 03/25/2016 02:59 PM, Vijay Pandurangan wrote:


consider two scenarios, where process a sends raw ethernet frames
containing UDP packets to b

I) process a --> veth --> process b

II) process a -> eth -> wire -> eth -> process b

I believe (I) is the simplest setup we can create that will replicate this
bug.

If process a sends frames that contain UDP packets to process b, what
is the behaviour we want if the UDP packet *has an incorrect
checksum*?

It seems to me that I and II should have identical behaviour, and I
would think that (II) would not deliver the packets to the
application.

In (I) with Cong's patch would we be delivering corrupt UDP packets to
process b despite an incorrect checksum in (I)?

If so, I would argue that this patch isn't right.



Checksums are normally used to deal with flaky transport mechanisms,
and once a machine receives the frame, we do not keep re-calculating
checksums
as we move it through various drivers and subsystems.

In particular, checksums are NOT a security mechanism and can be easily
faked.

Since packets sent on one veth never actually hit any unreliable transport
before they are received on the peer veth, then there should be no need to
checksum packets whose origin is known to be on the local machine.


That's a good argument.  I'm trying to figure out how to reconcile
your thoughts with the argument that virtual ethernet devices are an
abstraction that should behave identically to perfectly-functional
physical ethernet devices when connected with a wire.

In my view, the invariant must be identical functionality, and if I
were writing a regression test for this system, that's what I would
test. I think optimizations for eliding checksums should be
implemented only if they don't alter this functionality.

There must be a way to structure / write this code so that we can
optimize veths without causing different behaviour ...


A real NIC can either do hardware checksums, or it cannot.  If it
cannot, then the host must do it on the CPU for both transmit and
receive.

Veth is not a real NIC, and it cannot do hardware checksum offloading.

So, we either lie and pretend it does, or we eat massive amounts
of CPU usage to calculate and check checksums when sending across
a veth pair.


Any frame sent from a socket can be considered to be a local packet in my
opinion.


I'm not sure that's totally right. Your bridge is adding a delay to
your packets; it could just as easily be simulating corruption by
corrupting 5% of packets going through it. If this change allows
corrupt packets to be delivered to an application when they could not
be delivered if the packets were routed via physical eths, I think
that is a bug.


I actually do support corrupting the frame, but what I normally do is corrupt 
the contents
of the packet, and then recalculate the IP checksum (and TCP if it applies)
and send it on its way.  The receiving NIC and stack will pass the frame up to
the application since the checksums match, and it would be up the application
to deal with it.  So, I can easily cause an application to receive corrupted
frames over physical eths.

I can also corrupt without updating the checksums in case you want to
test another systems NIC and/or stack.

But, if I am purposely corrupting a frame destined for veth, then the only 
reason
I would want the stack to check the checksums is if I were testing my own
stack's checksum logic, and that seems to be a pretty limited use.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [RFC PATCH] tcp: Add SOF_TIMESTAMPING_TX_EOR and allow MSG_EOR in tcp_sendmsg

2016-03-25 Thread Yuchung Cheng

On Thu, Mar 24, 2016 at 6:39 PM, Willem de Bruijn
 wrote:
>
> > This patch allows the user process to use MSG_EOR during
> > tcp_sendmsg to tell the kernel that it is the last byte
> > of an application response message.
> >
> > The user process can use the new SOF_TIMESTAMPING_TX_EOR to
> > ask the kernel to only track timestamp of the MSG_EOR byte.
>
> Selective timestamp requests is a useful addition. Soheil (cc-ed) also
> happens to be looking at this. That approach uses cmsg to selectively
> tag send calls, avoiding the need to define a new SOF_ flag.
>
> > The current SOF_TIMESTAMPING_TX_ACK is tracking the last
> > byte appended to a skb during the tcp_sendmsg.  It may track
> > multiple bytes if the response spans across multiple skbs.
>
> It only tracks the last byte of the buffer passed in sendmsg. If a
> sendmsg results in multiple skbuffs, only the last skb is tracked. It
> is, however, possible that that skbuff is appended to in a follow-on
> sendmsg call. If multiple calls enable recording on an skbuff, only
> the last record request on an skb is kept.
>
> > it is enough to measure the response latency for application
> > protocol with a single request/response at a time (like HTTP 1.1
> > without pipeline), it does not work well for application protocol
> > with >1 pipeline responses (like HTTP2).
Looks like an interesting and useful patch. Since HTTP2 allows
multiplexing data stream frames from multiple logical streams on a
single socket,
how would you instrument to measure the latency of each stream? e.g.,

sendmsg of data_frame_1_of_stream_a
sendmsg of data_frame_1_of_stream_b
sendmsg of data_frame_2_of_stream_a
sendmsg of data_frame_1_of_stream_c
sendmsg of data_frame_2_of_stream_b


> >
> > Each skb can only track one tskey (which is the seq number of
> > the last byte of the message).   To allow tracking the
> > last byte of multiple response messages, this patch takes an approach
> > by not appending to the last skb during tcp_sendmsg if the last skb's
> > tskey will be overwritten.  A similar case also happens during retransmit.
> >
> > This approach avoids introducing another list to track the tskey.  The
> > downside is that it will have less gso benefit and/or more outgoing
> > packets.  Practically, due to the amount of measurement data generated,
> > sampling is usually used in production. (i.e. not every connection is
> > tracked).
>
> Agreed. This is the simplest approach to avoiding timestamp request
> overwrites. A list-based approach quickly becomes complex as skbuffs
> can be split and merged at various points.
>
> Since this use is rare, I would suggest making the code even simpler
> by just jumping to new_segment on a call with this MSG option (or
> cmsg) set, avoiding tcp_tx_ts_noappend_skb() on each new segment.
+1

>
> > One of our use case is at the webserver.  The webserver tracks
> > the HTTP2 response latency by measuring when the webserver
> > sends the first byte to the socket till the TCP ACK of the last byte
> > is received.  In the cases where we don't have client side
> > measurement, measuring from the server side is the only option.
> > In the cases we have the client side measurement, the server side
> > data can also be used to justify/cross-check-with the client
> > side data (e.g. is there slowness at the layer above the client's
> > TCP stack).
> >
> > The TCP PRR paper [1] also measures a similar metrics:
> > "The TCP latency of a HTTP response when the server sends the first
> >  byte until it receives the acknowledgment (ACK) for the last byte."
> >
> > [1] Proportional Rate Reduction for TCP:
> > http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37486.pdf
> >
> > Signed-off-by: Martin KaFai Lau 
> > Cc: Eric Dumazet 
> > Cc: Neal Cardwell 
> > Cc: Willem de Bruijn 
> > Cc: Yuchung Cheng 
> > ---
> >  include/uapi/linux/net_tstamp.h |  3 ++-
> >  net/ipv4/tcp.c  | 23 ++-
> >  net/ipv4/tcp_output.c   |  9 +++--
> >  3 files changed, 27 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/uapi/linux/net_tstamp.h 
> > b/include/uapi/linux/net_tstamp.h
> > index 6d1abea..5376569 100644
> > --- a/include/uapi/linux/net_tstamp.h
> > +++ b/include/uapi/linux/net_tstamp.h
> > @@ -25,8 +25,9 @@ enum {
> > SOF_TIMESTAMPING_TX_ACK = (1<<9),
> > SOF_TIMESTAMPING_OPT_CMSG = (1<<10),
> > SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
> > +   SOF_TIMESTAMPING_TX_EOR = (1<<12),
> >
> > -   SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY,
> > +   SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_TX_EOR,
> > SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
> >  SOF_TIMESTAMPING_LAST
> >  };
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 08b8b96..7de96eb 100644
> > ---

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Vijay Pandurangan

On Fri, Mar 25, 2016 at 6:23 PM, Ben Greear  wrote:
> On 03/25/2016 02:59 PM, Vijay Pandurangan wrote:
>>
>> consider two scenarios, where process a sends raw ethernet frames
>> containing UDP packets to b
>>
>> I) process a --> veth --> process b
>>
>> II) process a -> eth -> wire -> eth -> process b
>>
>> I believe (I) is the simplest setup we can create that will replicate this
>> bug.
>>
>> If process a sends frames that contain UDP packets to process b, what
>> is the behaviour we want if the UDP packet *has an incorrect
>> checksum*?
>>
>> It seems to me that I and II should have identical behaviour, and I
>> would think that (II) would not deliver the packets to the
>> application.
>>
>> In (I) with Cong's patch would we be delivering corrupt UDP packets to
>> process b despite an incorrect checksum in (I)?
>>
>> If so, I would argue that this patch isn't right.
>
>
> Checksums are normally used to deal with flaky transport mechanisms,
> and once a machine receives the frame, we do not keep re-calculating
> checksums
> as we move it through various drivers and subsystems.
>
> In particular, checksums are NOT a security mechanism and can be easily
> faked.
>
> Since packets sent on one veth never actually hit any unreliable transport
> before they are received on the peer veth, then there should be no need to
> checksum packets whose origin is known to be on the local machine.

That's a good argument.  I'm trying to figure out how to reconcile
your thoughts with the argument that virtual ethernet devices are an
abstraction that should behave identically to perfectly-functional
physical ethernet devices when connected with a wire.

In my view, the invariant must be identical functionality, and if I
were writing a regression test for this system, that's what I would
test. I think optimizations for eliding checksums should be
implemented only if they don't alter this functionality.

There must be a way to structure / write this code so that we can
optimize veths without causing different behaviour ...

>
> Any frame sent from a socket can be considered to be a local packet in my
> opinion.

I'm not sure that's totally right. Your bridge is adding a delay to
your packets; it could just as easily be simulating corruption by
corrupting 5% of packets going through it. If this change allows
corrupt packets to be delivered to an application when they could not
be delivered if the packets were routed via physical eths, I think
that is a bug.

>
> That is what Cong's patch does as far as I can tell.
>
>
> Thanks,
> Ben
>
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>

Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR

2016-03-25 Thread Francois Romieu

Phil Sutter  :
[...]
> Your patch submissions are getting better, also good to see you're
> finally using git-send-email. A few things need to be corrected though:
> 

#define BMCR_RESV   0x003f  /* Unused...   */
#define BMCR_SPEED1000  0x0040  /* MSB of Speed (1000) */
#define BMCR_CTST   0x0080  /* Collision test  */
#define BMCR_FULLDPLX   0x0100  /* Full duplex */
#define BMCR_ANRESTART  0x0200  /* Auto negotiation restart*/
#define BMCR_ISOLATE0x0400  /* Isolate data paths from MII */
#define BMCR_PDOWN  0x0800  /* Enable low power state  */
#define BMCR_ANENABLE   0x1000  /* Enable auto negotiation */
#define BMCR_SPEED100   0x2000  /* Select 100Mbps  */
#define BMCR_LOOPBACK   0x4000  /* TXD loopback bits   */

BMCR_SPEED100 apart, *all* these bits are now set.

It does not make much sense.

> Also detailed instructions on how to trigger the problem you are fixing
> for would be good. In detail: Which specific hardware was used, in which
> situation did the problem occur, how did it behave in that situation and
> what was the expected behaviour?

Been there. Such requests are usually left unanswered. :o(

Btw, this stuff targets 3.16 (...) and net-next is still closed.

-- 
Ueimor

Re: [PATCH RFC 3/9] net: Add fast receive encapsulation

2016-03-25 Thread Joe Perches

On Fri, 2016-03-25 at 16:40 -0400, David Miller wrote:
> From: Tom Herbert 
> Date: Wed, 23 Mar 2016 15:36:52 -0700
> 
> > +{
> > + struct udp_sock *up = udp_sk(sk);
> > + int is_udplite = IS_UDPLITE(sk);
> > +
> > + int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
> > +
> 
> Small nit, please put this encap_rcv function pointer declaration at
> the top of the local variable list.

It might also be nice to remove the equivalent typedef and
use the same form in udp_tunnel.h
---
 include/net/udp_tunnel.h | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index b831140..71885b1 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -62,15 +62,12 @@ static inline int udp_sock_create(struct net *net,
    return -EPFNOSUPPORT;
 }
 
-typedef int (*udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
-typedef void (*udp_tunnel_encap_destroy_t)(struct sock *sk);
-
 struct udp_tunnel_sock_cfg {
    void *sk_user_data; /* user data used by encap_rcv call back */
    /* Used for setting up udp_sock fields, see udp.h for details */
    __u8  encap_type;
-   udp_tunnel_encap_rcv_t encap_rcv;
-   udp_tunnel_encap_destroy_t encap_destroy;
+   int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
+   void (*encap_destroy)(struct sock *sk);
 };
 
 /* Setup the given (UDP) sock to receive UDP encapsulated packets */

[RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-25 Thread Eric Dumazet

Tom Herbert would like to avoid touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.

This actually removes a lot of complexity in UDP stack

Multicast receives no longer need to hold a bucket lock.

Note that ip early demux still needs to take a reference on the socket.

Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
---
 include/linux/udp.h |   8 +-
 include/net/sock.h  |  12 +--
 include/net/udp.h   |   2 +-
 net/ipv4/udp.c  | 290 +++-
 net/ipv4/udp_diag.c |  18 ++--
 net/ipv6/udp.c  | 194 +++
 6 files changed, 162 insertions(+), 362 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 87c094961bd5..32342754643a 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
-#define udp_portaddr_for_each_entry(__sk, node, list) \
-   hlist_nulls_for_each_entry(__sk, node, list, 
__sk_common.skc_portaddr_node)
+#define udp_portaddr_for_each_entry(__sk, list) \
+   hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
-#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \
-   hlist_nulls_for_each_entry_rcu(__sk, node, list, 
__sk_common.skc_portaddr_node)
+#define udp_portaddr_for_each_entry_rcu(__sk, list) \
+   hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
 
 #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag)
 
diff --git a/include/net/sock.h b/include/net/sock.h
index c88785a3e76c..5b9562bc478e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -178,7 +178,7 @@ struct sock_common {
int skc_bound_dev_if;
union {
struct hlist_node   skc_bind_node;
-   struct hlist_nulls_node skc_portaddr_node;
+   struct hlist_node   skc_portaddr_node;
};
struct proto*skc_prot;
possible_net_t  skc_net;
@@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk,
hlist_for_each_entry(__sk, list, sk_bind_node)
 
 /**
- * sk_nulls_for_each_entry_offset - iterate over a list at a given struct 
offset
+ * sk_for_each_entry_offset - iterate over a list at a given struct offset
  * @tpos:  the type * to use as a loop cursor.
  * @pos:   the  hlist_node to use as a loop cursor.
  * @head:  the head for your list.
  * @offset:offset of hlist_node within the struct.
  *
  */
-#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset)
   \
-   for (pos = (head)->first;  \
-(!is_a_nulls(pos)) && \
+#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) \
+   for (pos = rcu_dereference((head)->first); \
+pos != NULL &&\
({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;});   \
-pos = pos->next)
+pos = rcu_dereference(pos->next))
 
 static inline struct user_namespace *sk_user_ns(struct sock *sk)
 {
diff --git a/include/net/udp.h b/include/net/udp.h
index 92927f729ac8..d870ec1611c4 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -59,7 +59,7 @@ struct udp_skb_cb {
  * @lock:  spinlock protecting changes to head/count
  */
 struct udp_hslot {
-   struct hlist_nulls_head head;
+   struct hlist_head   head;
int count;
spinlock_t  lock;
 } __attribute__((aligned(2 * sizeof(long;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e16df0..3ebca8445d35 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
   unsigned int log)
 {
struct sock *sk2;
-   struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
 
-   sk_nulls_for_each(sk2, node, >head) {
+   sk_for_each(sk2, >head) {
if (net_eq(sock_net(sk2), net) &&
sk2 != sk &&
(bitmap || udp_sk(sk2)->udp_port_hash == num) &&
@@ -177,12 +176,11 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 
num,
  bool match_wildcard))
 {
struct sock *sk2;
-   struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
int res = 0;
 
spin_lock(>lock);
-

[RFC net-next 1/2] net: add SOCK_RCU_FREE socket flag

2016-03-25 Thread Eric Dumazet

We want a generic way to insert an RCU grace period before socket
freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
much overhead.

SLAB_DESTROY_BY_RCU strict rules force us to take a reference
on the socket sk_refcnt, and it is a performance problem for UDP
encapsulation, or TCP synflood behavior, as many CPUs might
attempt the atomic operations on a shared sk_refcnt

UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
lookup can use traditional RCU rules, without refcount changes.
They can set the flag only once hashed and visible by other cpus.

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
---
 include/net/sock.h |  2 ++
 net/core/sock.c| 14 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b..c88785a3e76c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -438,6 +438,7 @@ struct sock {
  struct sk_buff *skb);
void(*sk_destruct)(struct sock *sk);
struct sock_reuseport __rcu *sk_reuseport_cb;
+   struct rcu_head sk_rcu;
 };
 
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
@@ -720,6 +721,7 @@ enum sock_flags {
 */
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
+   SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..238a94f879ca 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t 
priority,
 }
 EXPORT_SYMBOL(sk_alloc);
 
-void sk_destruct(struct sock *sk)
+/* Sockets having SOCK_RCU_FREE will call this function after one RCU
+ * grace period. This is the case for UDP sockets and TCP listeners.
+ */
+static void __sk_destruct(struct rcu_head *head)
 {
+   struct sock *sk = container_of(head, struct sock, sk_rcu);
struct sk_filter *filter;
 
if (sk->sk_destruct)
@@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk)
sk_prot_free(sk->sk_prot_creator, sk);
 }
 
+void sk_destruct(struct sock *sk)
+{
+   if (sock_flag(sk, SOCK_RCU_FREE))
+   call_rcu(>sk_rcu, __sk_destruct);
+   else
+   __sk_destruct(>sk_rcu);
+}
+
 static void __sk_free(struct sock *sk)
 {
if (unlikely(sock_diag_has_destroy_listeners(sk) && sk->sk_net_refcnt))
-- 
2.8.0.rc3.226.g39d4020

[RFC net-next 0/2] udp: use standard RCU rules

2016-03-25 Thread Eric Dumazet

Add a generic facility for sockets to be freed afer an RCU grace period.

Then UDP is changed to no longer use SLAB_DESTROY_BY_RCU,
in order to speedup rx processing for traffic encapsulated in UDP.

I prepared a patch to convert TCP listeners to this infrastructure,
but will post it later, since Tom was mostly interested in UDP.

Eric Dumazet (2):
  net: add SOCK_RCU_FREE socket flag
  udp: No longer use SLAB_DESTROY_BY_RCU

 include/linux/udp.h |   8 +-
 include/net/sock.h  |  14 +--
 include/net/udp.h   |   2 +-
 net/core/sock.c |  14 ++-
 net/ipv4/udp.c  | 290 +++-
 net/ipv4/udp_diag.c |  18 ++--
 net/ipv6/udp.c  | 194 +++
 7 files changed, 177 insertions(+), 363 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

Re: [net-next RFC 0/4] SO_BINDTOSUBNET

2016-03-25 Thread Gilberto

On 03/25/2016 12:25 AM, Tom Herbert wrote:
> On Wed, Mar 16, 2016 at 6:19 AM, Gilberto Bertin
>  wrote:
>> This is my second attempt to submit an RFC for this patch.
>>
>> Some arguments for and against it since the first submission:
>> * SO_BINDTOSUBNET is an arbitrary option and can be seens as nother use
>> * case of the SO_REUSEPORT BPF patch
>> * but at the same time using BPF requires more work/code on the server
>>   and since the bind to subnet use case could potentially become a
>>   common one maybe there is some value in having it as an option instead
>>   of having to code (either manually or with clang) an eBPF program that
>>   would do the same
> 
> Gilberto, I'm not sure I understand this argument. Have you
> implemented the BPF bind solution?
> 
> Thanks,
> Tom

Yes, I wrote up a very basic draft for this feature (I didn't know there
was already some work going on with SO_ATTACH_REUSEPORT_[CE]BPF).

Thanks,
Gilberto

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Ben Greear


On 03/25/2016 02:59 PM, Vijay Pandurangan wrote:

consider two scenarios, where process a sends raw ethernet frames
containing UDP packets to b

I) process a --> veth --> process b

II) process a -> eth -> wire -> eth -> process b

I believe (I) is the simplest setup we can create that will replicate this bug.

If process a sends frames that contain UDP packets to process b, what
is the behaviour we want if the UDP packet *has an incorrect
checksum*?

It seems to me that I and II should have identical behaviour, and I
would think that (II) would not deliver the packets to the
application.

In (I) with Cong's patch would we be delivering corrupt UDP packets to
process b despite an incorrect checksum in (I)?

If so, I would argue that this patch isn't right.


Checksums are normally used to deal with flaky transport mechanisms,
and once a machine receives the frame, we do not keep re-calculating checksums
as we move it through various drivers and subsystems.

In particular, checksums are NOT a security mechanism and can be easily faked.

Since packets sent on one veth never actually hit any unreliable transport
before they are received on the peer veth, then there should be no need to
checksum packets whose origin is known to be on the local machine.

Any frame sent from a socket can be considered to be a local packet in my
opinion.

That is what Cong's patch does as far as I can tell.

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Cong Wang

On Fri, Mar 25, 2016 at 2:59 PM, Vijay Pandurangan  wrote:
> consider two scenarios, where process a sends raw ethernet frames
> containing UDP packets to b
>
> I) process a --> veth --> process b
>
> II) process a -> eth -> wire -> eth -> process b
>
> I believe (I) is the simplest setup we can create that will replicate this 
> bug.
>
> If process a sends frames that contain UDP packets to process b, what
> is the behaviour we want if the UDP packet *has an incorrect
> checksum*?
>
> It seems to me that I and II should have identical behaviour, and I
> would think that (II) would not deliver the packets to the
> application.
>
> In (I) with Cong's patch would we be delivering corrupt UDP packets to
> process b despite an incorrect checksum in (I)?
>

Right, I thought packet socket does the checksum by itself, so the
problem is: if user-space does the checksum like packet socket, its
checksum could be wrong therefore we can not trust it on RX path
once it loops back.

Let me think about it again.

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Cong Wang

On Fri, Mar 25, 2016 at 1:56 PM, Ben Greear  wrote:
> On 03/24/2016 10:33 PM, Cong Wang wrote:
>
>> Here we go:
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 1ecfa71..ab66080 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket
>> *sock, struct msghdr *msg,
>>  goto out_unlock;
>>  }
>>
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>  skb->protocol = proto;
>>  skb->dev = dev;
>>  skb->priority = sk->sk_priority;
>> @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock
>> *po, struct sk_buff *skb,
>>
>>  ph.raw = frame;
>>
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>  skb->protocol = proto;
>>  skb->dev = dev;
>>  skb->priority = po->sk.sk_priority;
>> @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct
>> sock *sk, size_t prepad,
>>  skb_put(skb, linear);
>>  skb->data_len = len - linear;
>>  skb->len += len - linear;
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>
>>  return skb;
>>   }
>
>
> I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair.
>
> And, UDP, TCP, and pktgen across a pair of veth pairs
> bridged by my user-space packet filter.
>
> All of these tests work fine with your patch as far as I can tell.
>
> So, you can add:
>
> Tested-by: Ben Greear 

Thanks for testing! I will send it out formally after I audit more drivers,
also let's give Tom some time to response.

>
> That said, it could easily break some drivers and/or other scenarios that I
> have not tested, so at the least it should cook a while upstream before
> going into the
> stable tree

Yeah.

Thanks.

[PATCH net] inet: add proper locking in __inet{6}_lookup()

2016-03-25 Thread Eric Dumazet

From: Eric Dumazet 

Blocking BH in __inet{6}_lookup() is not correct, as the lookups
are done using RCU protection.

It matters only starting from Lorenzo Colitti patches to destroy
a TCP socket, since rcu_read_lock() is already held by other users
of these functions.

This can be backported to all known stable versions, since TCP got RCU
lookups back in 2.6.29 : Even if iproute2 contained no code to trigger
the bug, some user programs could have used the API.

Signed-off-by: Eric Dumazet 
Cc: Lorenzo Colitti 
---
 include/net/inet_hashtables.h |5 ++---
 net/ipv6/inet6_hashtables.c   |5 ++---
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 50f635c2c536..4575d4f9509a 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -284,7 +284,6 @@ static inline struct sock *inet_lookup_listener(struct net 
*net,
  * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
  * not check it for lookups anymore, thanks Alexey. -DaveM
  *
- * Local BH must be disabled here.
  */
 struct sock *__inet_lookup_established(struct net *net,
   struct inet_hashinfo *hashinfo,
@@ -326,10 +325,10 @@ static inline struct sock *inet_lookup(struct net *net,
 {
struct sock *sk;
 
-   local_bh_disable();
+   rcu_read_lock();
sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr,
   dport, dif);
-   local_bh_enable();
+   rcu_read_unlock();
 
return sk;
 }
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 70f2628be6fa..9dcd0481fae2 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -200,11 +200,10 @@ struct sock *inet6_lookup(struct net *net, struct 
inet_hashinfo *hashinfo,
 {
struct sock *sk;
 
-   local_bh_disable();
+   rcu_read_lock();
sk = __inet6_lookup(net, hashinfo, skb, doff, saddr, sport, daddr,
ntohs(dport), dif);
-   local_bh_enable();
-
+   rcu_read_unlock();
return sk;
 }
 EXPORT_SYMBOL_GPL(inet6_lookup);

Re: RESEND: Easily reproducible kernel panic due to netdev all_adj_list refcnt handling

2016-03-25 Thread Andrew Collins


On 03/25/2016 02:43 PM, Matthias Schiffer wrote:

We've tried your patch, and it changes the symptoms a bit, but doesn't fix
the panic. I've attached kernel logs of the crash both before and after
applying the patch.

One note: I did not reproduce this issue myself, it was first reported in
[1], and then forwarded to the batman-adv issue tracker [2] by me.

Regards,
Matthias


[1] https://github.com/freifunk-gluon/gluon/issues/680
[2] https://www.open-mesh.org/issues/247


On the off chance it helps, the version of the patch I integrated locally takes 
a somewhat different approach
than the one I sent to the mailing list (propagates adj_list refcnts).  I've 
attached it in case it's useful.

I haven't submitted this upstream yet as it's still rather ugly.  I'm of the opinion 
that the whole "every device
knows every upperdev and lowerdev in its tree" model is rather broken, and the 
patch is just working around
a design that needs some rework.

Thanks,
Andrew Collins
commit df318544e282c6ab5bdc4595658fc1cf8739d091
Author: Andrew Collins 
Date:   Fri Mar 25 16:04:59 2016 -0600

This fixes a relatively easily reproducible kernel panic related to the
all_adj_list handling for netdevs in recent kernels.

The following sequence of commands will reproduce the issue:

ip link add link eth0 name eth0.100 type vlan id 100
ip link add link eth0 name eth0.200 type vlan id 200
ip link add name testbr type bridge
ip link set eth0.100 master testbr
ip link set eth0.200 master testbr
ip link add link testbr mac0 type macvlan
ip link delete dev testbr

This creates an upper/lower tree of (excuse the poor ASCII art):

/---eth0.100-eth0
mac0-testbr-
\---eth0.200-eth0

When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
reference to eth0 is added, so this results in a panic.

This change adds reference count propagation so things are handled properly.

diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..4b4ef6b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5471,6 +5471,7 @@ static inline bool netdev_adjacent_is_neigh_list(struct net_device *dev,
 
 static int __netdev_adjacent_dev_insert(struct net_device *dev,
 	struct net_device *adj_dev,
+	u16 ref_nr,
 	struct list_head *dev_list,
 	void *private, bool master)
 {
@@ -5480,7 +5481,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev,
 	adj = __netdev_find_adj(adj_dev, dev_list);
 
 	if (adj) {
-		adj->ref_nr++;
+		adj->ref_nr += ref_nr;
 		return 0;
 	}
 
@@ -5490,7 +5491,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev,
 
 	adj->dev = adj_dev;
 	adj->master = master;
-	adj->ref_nr = 1;
+	adj->ref_nr = ref_nr;
 	adj->private = private;
 	dev_hold(adj_dev);
 
@@ -5529,6 +5530,7 @@ free_adj:
 
 static void __netdev_adjacent_dev_remove(struct net_device *dev,
 	 struct net_device *adj_dev,
+	 u16 ref_nr,
 	 struct list_head *dev_list)
 {
 	struct netdev_adjacent *adj;
@@ -5541,10 +5543,10 @@ static void __netdev_adjacent_dev_remove(struct net_device *dev,
 		BUG();
 	}
 
-	if (adj->ref_nr > 1) {
-		pr_debug("%s to %s ref_nr-- = %d\n", dev->name, adj_dev->name,
-			 adj->ref_nr-1);
-		adj->ref_nr--;
+	if (adj->ref_nr > ref_nr) {
+		pr_debug("%s to %s ref_nr-%d = %d\n", dev->name, adj_dev->name,
+			 ref_nr, adj->ref_nr-ref_nr);
+		adj->ref_nr -= ref_nr;
 		return;
 	}
 
@@ -5563,21 +5565,22 @@ static void __netdev_adjacent_dev_remove(struct net_device *dev,
 
 static int __netdev_adjacent_dev_link_lists(struct net_device *dev,
 	struct net_device *upper_dev,
+	u16 ref_nr,
 	struct list_head *up_list,
 	struct list_head *down_list,
 	void *private, bool master)
 {
 	int ret;
 
-	ret = __netdev_adjacent_dev_insert(dev, upper_dev, up_list, private,
-	   master);
+	ret = __netdev_adjacent_dev_insert(dev, upper_dev, ref_nr, up_list,
+	   private, master);
 	if (ret)
 		return ret;
 
-	ret = __netdev_adjacent_dev_insert(upper_dev, dev, down_list, private,
-	   false);
+	ret = __netdev_adjacent_dev_insert(upper_dev, dev, ref_nr, down_list,
+	   private, false);
 	if (ret) {
-		__netdev_adjacent_dev_remove(dev, upper_dev, up_list);
+		__netdev_adjacent_dev_remove(dev, upper_dev, ref_nr, up_list);
 		return ret;
 	}
 
@@ -5585,9 +5588,10 @@ static int __netdev_adjacent_dev_link_lists(struct net_device *dev,
 }
 
 static int __netdev_adjacent_dev_link(struct net_device *dev,
-  struct net_device *upper_dev)
+  struct net_device *upper_dev,
+  u16 ref_nr)
 {
-	return __netdev_adjacent_dev_link_lists(dev, upper_dev,
+	return __netdev_adjacent_dev_link_lists(dev, upper_dev, ref_nr,
 		>all_adj_list.upper,
 		_dev->all_adj_list.lower,
 		NULL, false);
@@ -5595,17

[PATCH-RFC] drivers: net: ethernet: intel: e1000e: fix ethtool autoneg off for fiber

2016-03-25 Thread Daniel Walker

From: Steve Shih 

This patch fixes the issues for disabling auto-negotiation and forcing
speed and duplex settings for the fiber media.

For fiber media, e1000_get_settings should return ETH_TP_MDI_INVALID for
eth_tp_mdix_ctrl instead of ETH_TP_MDI_AUTO so subsequent e1000_set_settings
call would not fail with -EOPNOTSUPP.

e1000_set_spd_dplx should not automatically turn autoneg back on for forced
1000 Mbps full duplex settings.

Cc: danie...@fifo99.com
Cc: xe-ker...@external.cisco.com
Signed-off-by: Steve Shih 
---
 drivers/net/ethernet/intel/e1000e/ethtool.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 6cab1f3..cd03dcd 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -201,6 +201,9 @@ static int e1000_get_settings(struct net_device *netdev,
else
ecmd->eth_tp_mdix_ctrl = hw->phy.mdix;
 
+   if (hw->phy.media_type != e1000_media_type_copper)
+   ecmd->eth_tp_mdix_ctrl = ETH_TP_MDI_INVALID;
+
return 0;
 }
 
@@ -236,8 +239,7 @@ static int e1000_set_spd_dplx(struct e1000_adapter 
*adapter, u32 spd, u8 dplx)
mac->forced_speed_duplex = ADVERTISE_100_FULL;
break;
case SPEED_1000 + DUPLEX_FULL:
-   mac->autoneg = 1;
-   adapter->hw.phy.autoneg_advertised = ADVERTISE_1000_FULL;
+   mac->forced_speed_duplex = ADVERTISE_1000_FULL;
break;
case SPEED_1000 + DUPLEX_HALF:  /* not supported */
default:
-- 
2.5.0

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Vijay Pandurangan

consider two scenarios, where process a sends raw ethernet frames
containing UDP packets to b

I) process a --> veth --> process b

II) process a -> eth -> wire -> eth -> process b

I believe (I) is the simplest setup we can create that will replicate this bug.

If process a sends frames that contain UDP packets to process b, what
is the behaviour we want if the UDP packet *has an incorrect
checksum*?

It seems to me that I and II should have identical behaviour, and I
would think that (II) would not deliver the packets to the
application.

In (I) with Cong's patch would we be delivering corrupt UDP packets to
process b despite an incorrect checksum in (I)?

If so, I would argue that this patch isn't right.


On Fri, Mar 25, 2016 at 4:56 PM, Ben Greear  wrote:
> On 03/24/2016 10:33 PM, Cong Wang wrote:
>
>> Here we go:
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 1ecfa71..ab66080 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket
>> *sock, struct msghdr *msg,
>>  goto out_unlock;
>>  }
>>
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>  skb->protocol = proto;
>>  skb->dev = dev;
>>  skb->priority = sk->sk_priority;
>> @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock
>> *po, struct sk_buff *skb,
>>
>>  ph.raw = frame;
>>
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>  skb->protocol = proto;
>>  skb->dev = dev;
>>  skb->priority = po->sk.sk_priority;
>> @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct
>> sock *sk, size_t prepad,
>>  skb_put(skb, linear);
>>  skb->data_len = len - linear;
>>  skb->len += len - linear;
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>
>>  return skb;
>>   }
>
>
> I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair.
>
> And, UDP, TCP, and pktgen across a pair of veth pairs
> bridged by my user-space packet filter.
>
> All of these tests work fine with your patch as far as I can tell.
>
> So, you can add:
>
> Tested-by: Ben Greear 
>
> That said, it could easily break some drivers and/or other scenarios that I
> have not tested, so at the least it should cook a while upstream before
> going into the
> stable tree
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Vijay Pandurangan

On Fri, Mar 25, 2016 at 10:35 AM, Ben Greear  wrote:
>
>
> On 03/24/2016 10:24 PM, Vijay Pandurangan wrote:
>>
>> On Fri, Mar 25, 2016 at 1:07 AM, Ben Greear 
>> wrote:
>>>
>>> On 03/24/2016 09:45 PM, Vijay Pandurangan wrote:


 Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth
 to drop the packets if they have bad checksums before they hit the
 application level.
>>>
>>>
>>>
>>> VETH is pretty special in that when you transmit a frame on one
>>> device, it's pair receives it, and unless there is RAM corruption
>>> or bugs in the kernel, then it cannot be corrupted.
>>
>>
>> Yeah, you're right that that's an optimization. However, I think that
>> we should first ensure that
>>
>> a->veth->b
>>
>> operates exactly like:
>>
>> a->physical eth 1 -> physical eth 2->b
>>
>> in all cases.  Once we have that working everywhere we could think
>> about optimizations.
>>
>>
>> If we're willing to refactor, we could implement the optimization by
>> allowing veth devices to know whether their immediate peer is. If a
>> veth knows it's talking to another veth, it could under some
>> circumstances elide checksum calculation and verification.  I'm not
>> sure what abstractions that would break, though. What do you guys
>> think?
>
>
> veth ALWAYS transmits to another VETH.  The problem is that when veth is
> given a packet to transmit, it is difficult to know where that packet
> came from.

Yeah you're totally right – I guess what I was trying to express (but
failed at) was that we might need to be able to track the original
source of the packet for optimizations.
>
> And, adding software checksumming to veth for every frame would be a huge
> performance hit.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Ben Greear


On 03/24/2016 10:33 PM, Cong Wang wrote:


Here we go:

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1ecfa71..ab66080 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket
*sock, struct msghdr *msg,
 goto out_unlock;
 }

+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 skb->protocol = proto;
 skb->dev = dev;
 skb->priority = sk->sk_priority;
@@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock
*po, struct sk_buff *skb,

 ph.raw = frame;

+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 skb->protocol = proto;
 skb->dev = dev;
 skb->priority = po->sk.sk_priority;
@@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct
sock *sk, size_t prepad,
 skb_put(skb, linear);
 skb->data_len = len - linear;
 skb->len += len - linear;
+   skb->ip_summed = CHECKSUM_UNNECESSARY;

 return skb;
  }


I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair.

And, UDP, TCP, and pktgen across a pair of veth pairs
bridged by my user-space packet filter.

All of these tests work fine with your patch as far as I can tell.

So, you can add:

Tested-by: Ben Greear 

That said, it could easily break some drivers and/or other scenarios that I
have not tested, so at the least it should cook a while upstream before going 
into the
stable tree

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: RESEND: Easily reproducible kernel panic due to netdev all_adj_list refcnt handling

2016-03-25 Thread Matthias Schiffer

On 02/23/2016 11:29 PM, Andrew Collins wrote:
> I'm running into a relatively easily reproducible kernel panic related to
> the all_adj_list handling for netdevs
> in recent kernels.
> 
> The following sequence of commands will reproduce the issue:
> 
> ip link add link eth0 name eth0.100 type vlan id 100
> ip link add link eth0 name eth0.200 type vlan id 200
> ip link add name testbr type bridge
> ip link set eth0.100 master testbr
> ip link set eth0.200 master testbr
> ip link add link testbr mac0 type macvlan
> ip link delete dev testbr
> 
> This creates an upper/lower tree of (excuse the poor ASCII art):
> 
> /---eth0.100-eth0
> mac0-testbr-
> \---eth0.200-eth0
> 
> When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted
> twice from the mac0 list.
> Unfortunately, during setup in __netdev_upper_dev_link, only one reference
> to eth0 is added,
> so this results in the following panic trace:
> 
> [68235.234564] tried to remove device eth0 from mac0

Hi,
I got a similar report which looks like the same issue. Our setup is a bit
more complicated, it also involves batman-adv:

* 5 VLANs (eth0.2, eth0.3, eth0.100, eth0.101, eth0.102)
* batman-adv device bat0 is master of eth0.100, eth0.101, eth0.102
* Bridge br-wan is master of eth0.2
* Bridge br-client is master of bat0 and eth0.3
* macvlan device local-node on top of br-client

The setup is OpenWrt-based, which has a network config daemon which will in
some cases remove bridge ports/slaves when the corresponding devices lose
carrier (and I think br-client and bat0 get deleted in this case, not
completely sure about this). The crash occurs when eth0 goes down.

We've tried your patch, and it changes the symptoms a bit, but doesn't fix
the panic. I've attached kernel logs of the crash both before and after
applying the patch.

One note: I did not reproduce this issue myself, it was first reported in
[1], and then forwarded to the batman-adv issue tracker [2] by me.

Regards,
Matthias

[1] https://github.com/freifunk-gluon/gluon/issues/680
[2] https://www.open-mesh.org/issues/247
U-Boot SPL 2015.01 (Mar 18 2016 - 19:30:08)
DRAM: 1024 MiB
CPU: 96000Hz, AXI/AHB/APB: 3/2/2

U-Boot 2015.01 (Mar 18 2016 - 19:30:08) Allwinner Technology

CPU:   Allwinner A20 (SUN7I)
I2C:   ready
DRAM:  1016 MiB
MMC:   SUNXI SD/MMC: 0
*** Warning - bad CRC, using default environment

Reserved 8192kB of RAM for Framebuffer.
In:serial
Out:   serial
Err:   serial
SCSI:  SUNXI SCSI INIT
SATA link 0 timeout.
AHCI 0001.0100 32 slots 1 ports 3 Gbps 0x1 impl SATA mode
flags: ncq stag pm led clo only pmp pio slum part ccc apst 
Net:   dwmac.1c5
(Re)start USB...
USB0:   USB EHCI 1.00
scanning bus 0 for devices... 1 USB Device(s) found
USB1:   USB EHCI 1.00
scanning bus 1 for devices... 1 USB Device(s) found
   scanning usb for storage devices... 0 Storage Device(s) found
Hit any key to stop autoboot:  0 
switch to partitions #0, OK
mmc0 is current device
Scanning mmc 0...
Found U-Boot script /boot.scr
reading /boot.scr
377 bytes read in 18 ms (19.5 KiB/s)
## Executing script at 4310
reading uImage
1981560 bytes read in 117 ms (16.2 MiB/s)
reading dtb
23549 bytes read in 30 ms (765.6 KiB/s)
## Booting kernel from Legacy Image at 4200 ...
   Image Name:   ARM OpenWrt Linux-3.18.27
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:1981496 Bytes = 1.9 MiB
   Load Address: 40008000
   Entry Point:  40008000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 4300
   Booting using the fdt blob at 0x4300
   Loading Kernel Image ... OK
   Using Device Tree in place at 4300, end 43008bfc

Starting kernel ...

[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 3.18.27 (julian@eclipse) (gcc version 4.8.3 
(OpenWrt/Linaro GCC 4.8-2014.04 r48900) ) #1 SMP PREEMPT Fri Mar 18 19:27:10 
CET 2016
[0.00] CPU: ARMv7 Processor [410fc074] revision 4 (ARMv7), cr=10c5387d
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[0.00] Machine model: LeMaker Banana Pi
[0.00] Memory policy: Data cache writealloc
[0.00] psci: probing for conduit method from DT.
[0.00] psci: Using PSCI v0.1 Function IDs from DT
[0.00] PERCPU: Embedded 9 pages/cpu @eefe1000 s6272 r8192 d22400 u36864
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 258576
[0.00] Kernel command line: console=ttyS0,115200 earlyprintk rootwait 
root=/dev/mmcblk0p2
[0.00] PID hash table entries: 4096 (order: 2, 16384 bytes)
[0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Memory: 1025188K/1040384K available (3931K kernel code, 174K 
rwdata, 1132K rodata, 208K init, 610K bss, 15196K reserved, 262144K highmem)
[0.00] Virtual kernel memory layout:
[0.00]

Re: [PATCH RFC 3/9] net: Add fast receive encapsulation

2016-03-25 Thread David Miller

From: Tom Herbert 
Date: Wed, 23 Mar 2016 15:36:52 -0700

> +{
> + struct udp_sock *up = udp_sk(sk);
> + int is_udplite = IS_UDPLITE(sk);
> +
> + int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
> +

Small nit, please put this encap_rcv function pointer declaration at
the top of the local variable list.

Re: [PATCH] ip.7: Fix incorrect sockopt name

2016-03-25 Thread Michael Kerrisk (man-pages)

Hello Benjamin,

On 03/22/2016 09:28 AM, Benjamin Poirier wrote:
> "IP_LEAVE_GROUP" does not exist. It was perhaps a confusion with
> MCAST_LEAVE_GROUP. Change the text to IP_DROP_MEMBERSHIP which has the same
> function as MCAST_LEAVE_GROUP and is documented in the ip.7 man page.
> 
> Reference:
> Linux kernel net/ipv4/ip_sockglue.c do_ip_setsockopt()

Thanks! Applied.

Cheers,

Michael


> Cc: Radek Pazdera 
> Signed-off-by: Benjamin Poirier 
> ---
>  man7/ip.7 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man7/ip.7 b/man7/ip.7
> index 3905573..37e2c86 100644
> --- a/man7/ip.7
> +++ b/man7/ip.7
> @@ -376,7 +376,7 @@ a given multicast group that come from a given source.
>  If the application has subscribed to multiple sources within
>  the same group, data from the remaining sources will still be delivered.
>  To stop receiving data from all sources at once, use
> -.BR IP_LEAVE_GROUP .
> +.BR IP_DROP_MEMBERSHIP .
>  .IP
>  Argument is an
>  .I ip_mreq_source
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()

2016-03-25 Thread David Miller

From: Bjorn Helgaas 
Date: Fri, 25 Mar 2016 11:46:39 -0500

> You're right, there is an issue here.  I reproduced a problem with a
> bond device.  bond_netpoll_setup() calls __netpoll_setup() directly
> (not netpoll_setup()).  I'll debug it more; just wanted to let you
> know there *is* a problem with this patch.

I bet that's why the assignment to np->dev and the reference counting
were separated in the first place :-/

Indeed, commit 30fdd8a082a00126a6feec994e43e8dc12f5bccb:

commit 30fdd8a082a00126a6feec994e43e8dc12f5bccb
Author: Jiri Pirko 
Date:   Tue Jul 17 05:22:35 2012 +

netpoll: move np->dev and np->dev_name init into __netpoll_setup()

Signed-off-by: Jiri Pirko 
Signed-off-by: David S. Miller

Re: [PATCH net-next] net: dsa: mv88e6xxx: fix 6185 hardware bridging

2016-03-25 Thread David Miller

From: Vivien Didelot 
Date: Fri, 25 Mar 2016 13:20:30 -0400

> David, please ignore this patch for the moment.

Ok.

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread David Miller

From: Ben Greear 
Date: Fri, 25 Mar 2016 10:14:46 -0700

> Anyway, you know the stack and drivers better than me, so if you
> think Cong's patch is valid, then I'll test it and make sure it
> works in my setups at least.

It probably is, I'm just waiting to see if Tom Herbert will give
some feedback or not as this is an area he understands well.

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Willem de Bruijn

On Fri, Mar 25, 2016 at 1:00 PM, Eric Dumazet  wrote:
> On Fri, 2016-03-25 at 12:31 -0400, Craig Gallek wrote:
>
>> I believe the issue here is that closing the listen sockets will drop
>> any connections that are in the listen queue but have not been
>> accepted yet.  In the case of reuseport, you could in theory drain
>> those queues into the non-closed sockets, but that probably has some
>> interesting consequences...
>
> It is more complicated than this.
>
> Ideally, no TCP connection should be dropped during a server change.
>
> The idea is to let old program running as long as :
> 1) It has established TCP sessions
> 2) Some SYN_RECV pseudo requests are still around
>
> Once 3WHS completes for these SYN_RECV, children are queued into
> listener accept queues.
>
> But the idea is to direct all new SYN packets to the 'new' process and
> its listeners. (New SYN_RECV should be created on behalf on the new
> listeners only)
>
>
> In some environments, the listeners are simply transfered via FD
> passing, from the 'old process' to the new one.

Right. Comparatively, one of the nice features of the BPF variant is
that the sockets in the old process can passively enter listen_off
state solely with changes initiated by the new process (change the bpf
filter for the group).

By the way, if I read correctly, the listen_off feature was already
possible without kernel changes prior to fast reuseport by changing
SO_BINDTODEVICE on the old process's sockets to effectively segment
them into a separate reuseport group. With fast reuseport,
sk_bound_dev_if state equivalence is checked on joining a group, but
the socket is not removed from the array when that syscall is made, so
this does not work.

Re: netpoll rtnl_dereference() usage

2016-03-25 Thread Neil Horman

On Fri, Mar 25, 2016 at 12:30:32PM -0500, Bjorn Helgaas wrote:
> Hi Neil,
> 
> Since we're looking at netpoll, here's another question (or two).
> 0790bbb68f9d ("netpoll: cleanup sparse warnings") adds this:
> 
> @@ -1236,7 +1236,11 @@ void __netpoll_cleanup(struct netpoll *np)
> struct netpoll_info *npinfo;
> unsigned long flags;
>  
> -   npinfo = np->dev->npinfo;
> +   /* rtnl_dereference would be preferable here but
> +* rcu_cleanup_netpoll path can put us in here safely without
> +* holding the rtnl, so plain rcu_dereference it is
> +*/
> +   npinfo = rtnl_dereference(np->dev->npinfo);
> if (!npinfo)
> return;
> 
> The comment seems to contradict the code: the comment says "we would
> like to use rtnl_dereference(), but we have to use rcu_dereference()."
> But the code in fact *does* use rtnl_dereference().
> 
its the comment that went awry.  I remember writing that patch, and I initially
thought we had to use rcu_derefence there, but I would up finding a way to keep
the rntl lock held, so rtnl_deref should be ok.  I must have just forgotten to
fixup the comment.

> Also, "rcu_cleanup_netpoll" doesn't exist; maybe it's a typo for
> rcu_cleanup_netpoll_info()?  I don't see the path that leads from
> rcu_cleanup_netpoll_info() to __netpoll_cleanup(), but I don't claim
> to understand the netpoll async subtleties.
> 
Correct again, its the rcu callback rcu_cleanup_netpoll_info that I'm referring
to there, and the comment was written initially when rcu_cleanup_netpoll info
was called cleanup_netpoll_info and called forward into __netpoll_cleanup (in my
development patch versions).  That comment should really just be re-written.
I'm happy to do so if you like

Best
Neil

> Bjorn

Re: [RFT Patch net 1/2] ipv6: invalidate the socket cached route on pmtu events if possible

2016-03-25 Thread Eric Dumazet

On Fri, 2016-03-25 at 10:17 -0700, Cong Wang wrote:

> 1) sock lock protects the whole update: the whole check, update, recheck,
> set logic, to make sure another CPU will not do the same to the same socket
> at the same time.
> 
> 2) the dst itself is safe, because it is always refcounted, and we use xchg()
> to change the pointer in sk_dst_cache.
> 
> Or am I still missing anything here?

As TCP always lock the socket before doing its heavy stuff,
it can use a variant of sk_dst_cache manipulations that do not use extra
atomic operations.

But UDP gets xchg() to safely exchange sk_dst_cache, because we do not
feel locking the socket is needed for UDP for typical uses (! cork)

If you hold the socket lock in ICMP handler, then it would be
inconsistent with udp sendmsg() where we do not hold the socket lock.

Since I believe udp sendmsg() is fine, I do believe you do not need to
lock the socket, and then care about socket being owned by the user.

netpoll rtnl_dereference() usage

2016-03-25 Thread Bjorn Helgaas

Hi Neil,

Since we're looking at netpoll, here's another question (or two).
0790bbb68f9d ("netpoll: cleanup sparse warnings") adds this:

@@ -1236,7 +1236,11 @@ void __netpoll_cleanup(struct netpoll *np)
struct netpoll_info *npinfo;
unsigned long flags;
 
-   npinfo = np->dev->npinfo;
+   /* rtnl_dereference would be preferable here but
+* rcu_cleanup_netpoll path can put us in here safely without
+* holding the rtnl, so plain rcu_dereference it is
+*/
+   npinfo = rtnl_dereference(np->dev->npinfo);
if (!npinfo)
return;

The comment seems to contradict the code: the comment says "we would
like to use rtnl_dereference(), but we have to use rcu_dereference()."
But the code in fact *does* use rtnl_dereference().

Also, "rcu_cleanup_netpoll" doesn't exist; maybe it's a typo for
rcu_cleanup_netpoll_info()?  I don't see the path that leads from
rcu_cleanup_netpoll_info() to __netpoll_cleanup(), but I don't claim
to understand the netpoll async subtleties.

Bjorn

Re: [PATCH net-next] net: dsa: mv88e6xxx: fix 6185 hardware bridging

2016-03-25 Thread Vivien Didelot

Hi,

Vivien Didelot  writes:

> The 88E6185 switch also has a MapDA bit in its Port Control 2 register.
> When this bit is cleared, all frames are sent out to the CPU port.
>
> Set this bit to rely on ATU hits and direct the frame out of the correct
> ports, and thus enable hardware bridging.
>
> Signed-off-by: Vivien Didelot 

Andrew, I sent this patch too fast. It requires a proper support of the
ATU in order to populate the FDB correctly. I send that soon.

David, please ignore this patch for the moment.

Thanks,
Vivien

Re: [RFT Patch net 1/2] ipv6: invalidate the socket cached route on pmtu events if possible

2016-03-25 Thread Cong Wang

On Thu, Mar 24, 2016 at 6:51 PM, Eric Dumazet  wrote:
> On Thu, 2016-03-24 at 17:15 -0700, Cong Wang wrote:
>
>> My understanding is that bh_lock_sock() prevents concurrent
>> access to sock struct. Since this is in softirq context, multiple
>> CPU's could call this function concurrently, the whole pmtu
>> update needs to be done atomically.
>>
>> UDP, on the other hand, doesn't do this logic, it just looks up
>> for dst and save it in sk_dst_cache.
>
> Two ICMP messages processed on two different cpus can still get two
> different sockets pointing to the same dst.
>

Yeah, and dst is refcounted so a shared dst will not be freed by both.

> I do not see how dst pmtu update could be protected by a lock on each
> socket. It would require a dst lock , or some simple memory writes that
> do not require special synchro.

I am lost here because you removed the dst lock in ipv4_sk_update_pmtu():

commit 7f502361531e9eecb396cf99bdc9e9a59f7ebd7f
Author: Eric Dumazet 
Date:   Mon Jun 30 01:26:23 2014 -0700

ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix

We have two different ways to handle changes to sk->sk_dst

First way (used by TCP) assumes socket lock is owned by caller, and use
no extra lock : __sk_dst_set() & __sk_dst_reset()

Another way (used by UDP) uses sk_dst_lock because socket lock is not
always taken. Note that sk_dst_lock is not softirq safe.

These ways are not inter changeable for a given socket type.

ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used
the socket lock as synchronization, but users might be UDP sockets.

Instead of converting sk_dst_lock to a softirq safe version, use xchg()
as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use
sk_dst_lock from softirq context")

In a follow up patch, we probably can remove sk_dst_lock, as it is
only used in IPv6.

My understand is that:

1) sock lock protects the whole update: the whole check, update, recheck,
set logic, to make sure another CPU will not do the same to the same socket
at the same time.

2) the dst itself is safe, because it is always refcounted, and we use xchg()
to change the pointer in sk_dst_cache.

Or am I still missing anything here?

Thanks.

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Ben Greear

On 03/25/2016 09:44 AM, David Miller wrote:

From: Ben Greear 
Date: Fri, 25 Mar 2016 09:10:58 -0700

I am suspicious that this will break at least some drivers.  I
grepped around for ip_summed, and found this, for instance:

davicom/dm9000.c

 /* The DM9000 is not smart enough to leave fragmented packets

An really old (circa 1997), not-oft-used, driver such as this is not
the place to be looking for correct usage of skb->ip_summed semantics.

I would never use whatever a driver like this does influence whether I
apply a bug fix or not.

Point is, it took me 5 minutes to find that, and I did not look hard at many
other drivers.  e1000e and igb appear to be fine, and maybe the rest of them
are too.  Lord knows what other strange setups might be effected by the
ip_summed change.

Anyway, you know the stack and drivers better than me, so if you think Cong's
patch is valid, then I'll test it and make sure it works in my setups at least.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

[PATCH net-next] net: dsa: mv88e6xxx: fix 6185 hardware bridging

2016-03-25 Thread Vivien Didelot

The 88E6185 switch also has a MapDA bit in its Port Control 2 register.
When this bit is cleared, all frames are sent out to the CPU port.

Set this bit to rely on ATU hits and direct the frame out of the correct
ports, and thus enable hardware bridging.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 4332b15..85ed3e2 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2586,7 +2586,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
reg = 0;
if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) ||
mv88e6xxx_6165_family(ds) || mv88e6xxx_6097_family(ds) ||
-   mv88e6xxx_6095_family(ds) || mv88e6xxx_6320_family(ds))
+   mv88e6xxx_6095_family(ds) || mv88e6xxx_6320_family(ds) ||
+   mv88e6xxx_6185_family(ds))
reg = PORT_CONTROL_2_MAP_DA;
 
if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) ||
-- 
2.7.4

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Eric Dumazet

On Fri, 2016-03-25 at 12:31 -0400, Craig Gallek wrote:

> I believe the issue here is that closing the listen sockets will drop
> any connections that are in the listen queue but have not been
> accepted yet.  In the case of reuseport, you could in theory drain
> those queues into the non-closed sockets, but that probably has some
> interesting consequences...

It is more complicated than this.

Ideally, no TCP connection should be dropped during a server change.

The idea is to let old program running as long as :
1) It has established TCP sessions
2) Some SYN_RECV pseudo requests are still around

Once 3WHS completes for these SYN_RECV, children are queued into
listener accept queues.

But the idea is to direct all new SYN packets to the 'new' process and
its listeners. (New SYN_RECV should be created on behalf on the new
listeners only)

In some environments, the listeners are simply transfered via FD
passing, from the 'old process' to the new one.

Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()

2016-03-25 Thread Bjorn Helgaas

On Fri, Mar 25, 2016 at 07:33:42AM -0400, Neil Horman wrote:
> On Thu, Mar 24, 2016 at 09:56:21PM -0500, Bjorn Helgaas wrote:
> > netpoll_setup() does a dev_hold() on np->dev, the netpoll device.  If it
> > fails, it correctly does a dev_put() but leaves np->dev set.  If we call
> > netpoll_cleanup() after the failure, np->dev is still set so we do another
> > dev_put(), which decrements the refcount an extra time.
> > 
> > It's questionable to call netpoll_cleanup() after netpoll_setup() fails,
> > but it can be difficult to find the problem, and we can easily avoid it in
> > this case.  The extra decrements can lead to hangs like this:
> > 
> >   unregister_netdevice: waiting for bond0 to become free. Usage count = -3
> > 
> > Set and clear np->dev at the points where we dev_hold() and dev_put() the
> > device.
> > 
> > Signed-off-by: Bjorn Helgaas 
> > ---
> >  net/core/netpoll.c |3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> > index 94acfc8..a57bd17 100644
> > --- a/net/core/netpoll.c
> > +++ b/net/core/netpoll.c
> > @@ -603,7 +603,6 @@ int __netpoll_setup(struct netpoll *np, struct 
> > net_device *ndev)
> > const struct net_device_ops *ops;
> > int err;
> >  
> > -   np->dev = ndev;
> > strlcpy(np->dev_name, ndev->name, IFNAMSIZ);
> > INIT_WORK(>cleanup_work, netpoll_async_cleanup);
> >  
> > @@ -670,6 +669,7 @@ int netpoll_setup(struct netpoll *np)
> > goto unlock;
> > }
> > dev_hold(ndev);
> > +   np->dev = ndev;
> >  
> > if (netdev_master_upper_dev_get(ndev)) {
> > np_err(np, "%s is a slave device, aborting\n", np->dev_name);
> > @@ -770,6 +770,7 @@ int netpoll_setup(struct netpoll *np)
> > return 0;
> >  
> >  put:
> > +   np->dev = NULL;
> > dev_put(ndev);
> >  unlock:
> > rtnl_unlock();
> > 
> 
> Is this safe for stacked devices?  It makes good sense for the typical case, 
> but
> if you attempt to setup a netpoll client on a bridge/bond/vlan, etc, the lower
> device will get its own netpoll struct registered and have no associated 
> np->dev
> pointer.  It not be a real problem in practice, But you probably want to check
> to make sure that stacked  devices which recursively call the netpoll api 
> don't
> do anyting with the np->dev pointer.

You're right, there is an issue here.  I reproduced a problem with a
bond device.  bond_netpoll_setup() calls __netpoll_setup() directly
(not netpoll_setup()).  I'll debug it more; just wanted to let you
know there *is* a problem with this patch.

Bjorn

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread David Miller

From: Cong Wang 
Date: Fri, 25 Mar 2016 09:32:23 -0700

> So I believe dm9000 needs to fix.

+1

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread David Miller

From: Ben Greear 
Date: Fri, 25 Mar 2016 09:10:58 -0700

> I am suspicious that this will break at least some drivers.  I
> grepped around for ip_summed, and found this, for instance:
> 
> davicom/dm9000.c
> 
> /* The DM9000 is not smart enough to leave fragmented packets

An really old (circa 1997), not-oft-used, driver such as this is not
the place to be looking for correct usage of skb->ip_summed semantics.

I would never use whatever a driver like this does influence whether I
apply a bug fix or not.

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Cong Wang

On Fri, Mar 25, 2016 at 9:10 AM, Ben Greear  wrote:
> I am suspicious that this will break at least some drivers.  I grepped
> around
> for ip_summed, and found this, for instance:
>
> davicom/dm9000.c
>
> /* The DM9000 is not smart enough to leave fragmented packets alone.
> */
> if (dm->ip_summed != ip_summed) {
> if (ip_summed == CHECKSUM_NONE)
> iow(dm, DM9000_TCCR, 0);
> else
> iow(dm, DM9000_TCCR, TCCR_IP | TCCR_UDP | TCCR_TCP);
> dm->ip_summed = ip_summed;
> }
>
>
> It is taking action based on ip_summed == CHECKSUM_NONE, and your change
> will probably break that.
>
> I would suggest that we try to make any fix specific only to veth,
> at least for now.  A tree-wide audit of drivers is probably required
> to safely make the kind of change you propose above.
>
> So, unless you can explain why your change is safe, then I do not plan
> to test it.

I just blindly trust the comments there:

 * CHECKSUM_UNNECESSARY:
 *
 *   This has the same meaning on as CHECKSUM_NONE for checksum offload on
 *   output.

Let's Cc Tom who wrote this comment.

On the other hand, hyperv got this correctly:

if ((skb->ip_summed == CHECKSUM_NONE) ||
(skb->ip_summed == CHECKSUM_UNNECESSARY))
goto do_send;

So I believe dm9000 needs to fix.

Thanks.

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Craig Gallek

On Fri, Mar 25, 2016 at 12:21 PM, Alexei Starovoitov
 wrote:
> On Fri, Mar 25, 2016 at 11:29:10AM -0400, Craig Gallek wrote:
>> On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau  wrote:
>> > The pattern is :
>> >
>> >   t0 : unprivileged processes 1 and 2 are listening to the same port
>> >(sock1@pid1) (sock2@pid2)
>> ><-- listening -->
>> >
>> >   t1 : new processes are started to replace the old ones
>> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
>> ><-- listening --> <-- listening -->
>> >
>> >   t2 : new processes signal the old ones they must stop
>> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
>> ><--- draining --> <-- listening -->
>> >
>> >   t3 : pids 1 and 2 have finished, they go away
>> >  (sock3@pid3) (sock4@pid4)
>> > <-- gone ->  <-- listening -->
> ...
>> t3: Close the first two sockets and only use the last two.  This is
>> the tricky step.  Before this point, the sockets are numbered 0
>> through 3 from the perspective of the BPF program (in the order
>> listen() was called).  As soon as socket 0 is closed, the last socket
>> in the list replaces it (what was 3 becomes 0).  When socket 1 is
>> closed, socket 2 moves into that position.  The assumptions about the
>> socket indexes in the BPF program need to change as the indexes change
>> as a result of closing them.
>
> yeah, the way reuseport_detach_sock() was done makes it hard to manage
> such transitions from bpf program, but I don't see yet what stops
> pid1 an pid2 at stage t2 to just close their sockets.
> If these 'draining' pids don't want to receive packets, they should
> close their sockets. Complicating bpf side to redistribute spraying
> to sock3 and sock4 only (while sock1 and sock2 are still open) is possible,
> but looks unnecessary complex to me.
> Just close sock1 and sock2 at t2 time and then exit pid1, pid2 later.
> If they are tcp sockets with rpc protocol on top and have a problem of
> partial messages, then kcm can solve that and it will simplify
> the user space side as well.
I believe the issue here is that closing the listen sockets will drop
any connections that are in the listen queue but have not been
accepted yet.  In the case of reuseport, you could in theory drain
those queues into the non-closed sockets, but that probably has some
interesting consequences...

[ethtool PATCH v7 0/2] add support for new ETHTOOL_xLINKSETTINGS ioctls

2016-03-25 Thread David Decotigny

From: David Decotigny 

History:
  v7
added ref to related kernel commit in netlink ioctl patch description
  v6
re-added last patch, to use AF_NETLINK when AF_INET not available
  v5
rebased main patch, removed last patch "use AF_LOCAL when AF_INET
not available"
  v4
review Ben Hutchings:
  using AF_UNIX instead of INET6 in the absence of v4 sockets
  use stdbool.h
  do_seeprom always fails when offset/length out of bounds
  sync to latest ethtool.h + kernel.h from net-next
  __SANE_USERSPACE_TYPES__ always defined
  cosmetic updates for var == const tests
  cosmetic updates for associativity in tests
  v3
TRUE/FALSE obvious-ification
  v2
added do_seeprom patch
added netdev   as recipient
  v1
initial submission


# Patch Set Summary:

David Decotigny (2):
  ethtool.c: add support for ETHTOOL_xLINKSETTINGS ioctls
  ethtool: use netlink socket when AF_INET not available

 configure.ac   |   2 +-
 ethtool.c  | 688 -
 internal.h |  67 ++
 test-cmdline.c |  13 ++
 4 files changed, 611 insertions(+), 159 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[ethtool PATCH v7 2/2] ethtool: use netlink socket when AF_INET not available

2016-03-25 Thread David Decotigny

From: David Decotigny 

To benefit from this, kernel commit 025c68186e07 ("netlink: add support
for NIC driver ioctls") is needed.


Signed-off-by: David Decotigny 
---
 configure.ac | 2 +-
 ethtool.c| 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index 3105415..47d2a0f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -15,7 +15,7 @@ AM_PROG_CC_C_O
 dnl Checks for libraries.
 
 dnl Checks for header files.
-AC_CHECK_HEADERS(sys/ioctl.h)
+AC_CHECK_HEADERS(sys/ioctl.h linux/netlink.h)
 
 dnl Checks for typedefs, structures, and compiler characteristics.
 AC_MSG_CHECKING([whether  defines big-endian types])
diff --git a/ethtool.c b/ethtool.c
index cb3d971..314b1b8 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -42,6 +42,9 @@
 #include 
 
 #include 
+#ifdef HAVE_LINUX_NETLINK_H
+# include 
+#endif
 
 #ifndef MAX_ADDR_LEN
 #define MAX_ADDR_LEN   32
@@ -4645,6 +4648,10 @@ opt_found:
 
/* Open control socket. */
ctx.fd = socket(AF_INET, SOCK_DGRAM, 0);
+#ifdef HAVE_LINUX_NETLINK_H
+   if (ctx.fd < 0)
+   ctx.fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
+#endif
if (ctx.fd < 0) {
perror("Cannot get control socket");
return 70;
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Alexei Starovoitov

On Fri, Mar 25, 2016 at 11:29:10AM -0400, Craig Gallek wrote:
> On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau  wrote:
> > The pattern is :
> >
> >   t0 : unprivileged processes 1 and 2 are listening to the same port
> >(sock1@pid1) (sock2@pid2)
> ><-- listening -->
> >
> >   t1 : new processes are started to replace the old ones
> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
> ><-- listening --> <-- listening -->
> >
> >   t2 : new processes signal the old ones they must stop
> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
> ><--- draining --> <-- listening -->
> >
> >   t3 : pids 1 and 2 have finished, they go away
> >  (sock3@pid3) (sock4@pid4)
> > <-- gone ->  <-- listening -->
...
> t3: Close the first two sockets and only use the last two.  This is
> the tricky step.  Before this point, the sockets are numbered 0
> through 3 from the perspective of the BPF program (in the order
> listen() was called).  As soon as socket 0 is closed, the last socket
> in the list replaces it (what was 3 becomes 0).  When socket 1 is
> closed, socket 2 moves into that position.  The assumptions about the
> socket indexes in the BPF program need to change as the indexes change
> as a result of closing them.

yeah, the way reuseport_detach_sock() was done makes it hard to manage
such transitions from bpf program, but I don't see yet what stops
pid1 an pid2 at stage t2 to just close their sockets.
If these 'draining' pids don't want to receive packets, they should
close their sockets. Complicating bpf side to redistribute spraying
to sock3 and sock4 only (while sock1 and sock2 are still open) is possible,
but looks unnecessary complex to me.
Just close sock1 and sock2 at t2 time and then exit pid1, pid2 later.
If they are tcp sockets with rpc protocol on top and have a problem of
partial messages, then kcm can solve that and it will simplify
the user space side as well.

[ethtool PATCH v7 1/2] ethtool.c: add support for ETHTOOL_xLINKSETTINGS ioctls

2016-03-25 Thread David Decotigny

From: David Decotigny 

More info with kernel commit 8d3f2806f8fb ("Merge branch
'ethtool-ksettings'").

Note: The new features implemented in this patch depend on kernel
commit 793cf87de9d1 ("Set cmd field in ETHTOOL_GLINKSETTINGS response to
wrong nwords").


Signed-off-by: David Decotigny 
---
 ethtool.c  | 681 -
 internal.h |  67 ++
 test-cmdline.c |  13 ++
 3 files changed, 603 insertions(+), 158 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 0cd0d4f..cb3d971 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -47,42 +47,6 @@
 #define MAX_ADDR_LEN   32
 #endif
 
-#define ALL_ADVERTISED_MODES   \
-   (ADVERTISED_10baseT_Half |  \
-ADVERTISED_10baseT_Full |  \
-ADVERTISED_100baseT_Half | \
-ADVERTISED_100baseT_Full | \
-ADVERTISED_1000baseT_Half |\
-ADVERTISED_1000baseT_Full |\
-ADVERTISED_1000baseKX_Full|\
-ADVERTISED_2500baseX_Full |\
-ADVERTISED_1baseT_Full |   \
-ADVERTISED_1baseKX4_Full | \
-ADVERTISED_1baseKR_Full |  \
-ADVERTISED_1baseR_FEC |\
-ADVERTISED_2baseMLD2_Full |\
-ADVERTISED_2baseKR2_Full | \
-ADVERTISED_4baseKR4_Full | \
-ADVERTISED_4baseCR4_Full | \
-ADVERTISED_4baseSR4_Full | \
-ADVERTISED_4baseLR4_Full | \
-ADVERTISED_56000baseKR4_Full | \
-ADVERTISED_56000baseCR4_Full | \
-ADVERTISED_56000baseSR4_Full | \
-ADVERTISED_56000baseLR4_Full)
-
-#define ALL_ADVERTISED_FLAGS   \
-   (ADVERTISED_Autoneg |   \
-ADVERTISED_TP |\
-ADVERTISED_AUI |   \
-ADVERTISED_MII |   \
-ADVERTISED_FIBRE | \
-ADVERTISED_BNC |   \
-ADVERTISED_Pause | \
-ADVERTISED_Asym_Pause |\
-ADVERTISED_Backplane | \
-ALL_ADVERTISED_MODES)
-
 #ifndef HAVE_NETIF_MSG
 enum {
NETIF_MSG_DRV   = 0x0001,
@@ -293,6 +257,43 @@ static void get_mac_addr(char *src, unsigned char *dest)
}
 }
 
+static int parse_hex_u32_bitmap(const char *s,
+   unsigned int nbits, u32 *result)
+{
+   const unsigned int nwords = __KERNEL_DIV_ROUND_UP(nbits, 32);
+   size_t slen = strlen(s);
+   size_t i;
+
+   /* ignore optional '0x' prefix */
+   if ((slen > 2) && (strncasecmp(s, "0x", 2) == 0)) {
+   slen -= 2;
+   s += 2;
+   }
+
+   if (slen > 8 * nwords)  /* up to 2 digits per byte */
+   return -1;
+
+   memset(result, 0, 4 * nwords);
+   for (i = 0 ; i < slen ; ++i) {
+   const unsigned int shift = (slen - 1 - i) * 4;
+   u32 *dest = [shift / 32];
+   u32 nibble;
+
+   if ('a' <= s[i] && s[i] <= 'f')
+   nibble = 0xa + (s[i] - 'a');
+   else if ('A' <= s[i] && s[i] <= 'F')
+   nibble = 0xa + (s[i] - 'A');
+   else if ('0' <= s[i] && s[i] <= '9')
+   nibble = (s[i] - '0');
+   else
+   return -1;
+
+   *dest |= (nibble << (shift % 32));
+   }
+
+   return 0;
+}
+
 static void parse_generic_cmdline(struct cmd_context *ctx,
  int *changed,
  struct cmdline_info *info,
@@ -472,64 +473,157 @@ static int do_version(struct cmd_context *ctx)
return 0;
 }
 
-static void dump_link_caps(const char *prefix, const char *an_prefix, u32 mask,
-  int link_mode_only);
+/* link mode routines */
 
-static void dump_supported(struct ethtool_cmd *ep)
+static __ETHTOOL_DECLARE_LINK_MODE_MASK(all_advertised_modes);
+static __ETHTOOL_DECLARE_LINK_MODE_MASK(all_advertised_flags);
+
+static void init_global_link_mode_masks(void)
 {
-   u32 mask = ep->supported;
+   static const enum ethtool_link_mode_bit_indices
+   all_advertised_modes_bits[] = {
+   ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+   ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+   ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+   ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_1000baseKX_Full_BIT,
+   ETHTOOL_LINK_MODE_2500baseX_Full_BIT,
+   ETHTOOL_LINK_MODE_1baseT_Full_BIT,
+

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Ben Greear




On 03/24/2016 10:33 PM, Cong Wang wrote:

On Thu, Mar 24, 2016 at 10:13 PM, Ben Greear  wrote:



On 03/24/2016 10:06 PM, Cong Wang wrote:


On Thu, Mar 24, 2016 at 9:34 PM, Ben Greear 
wrote:




On 03/24/2016 06:44 PM, Vijay Pandurangan wrote:



Oops, I think my last email didn't go through due to an inadvertent
html attachment from my phone mail client.

Can you send us a copy of a packet you're sending and/or confirm that
the IP and UDP4 checksums are set correctly in the packet?

If those are set right, I think we need to read through the networking
code again to see why this is broken...




Wireshark decodes the packet as having no checksum errors.

I think the contents of the packet is correct, but the 'ip_summed'
field is set incorrectly to 'NONE' when transmitting on a raw packet
socket.



Yeah, these bugs are all due to the different interpretations of
ip_summed on TX path and RX path. I think the following patch
should work, if the comments don't mislead me. Could you give
it a try?

For the long term, we need to unify the meaning of ip_summed
on TX path and RX path, or at least translate it in skb_scrub_packet().



I can test this tomorrow, but I think it will not work.  I'm not sending raw
IP frames, I'm sending full ethernet frames.  Socket is PF_PACKET, SOCK_RAW.

Your patch may still be useful for others though?


Here we go:

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1ecfa71..ab66080 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket
*sock, struct msghdr *msg,
 goto out_unlock;
 }

+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 skb->protocol = proto;
 skb->dev = dev;
 skb->priority = sk->sk_priority;
@@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock
*po, struct sk_buff *skb,

 ph.raw = frame;

+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 skb->protocol = proto;
 skb->dev = dev;
 skb->priority = po->sk.sk_priority;
@@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct
sock *sk, size_t prepad,
 skb_put(skb, linear);
 skb->data_len = len - linear;
 skb->len += len - linear;
+   skb->ip_summed = CHECKSUM_UNNECESSARY;

 return skb;
  }


I am suspicious that this will break at least some drivers.  I grepped around
for ip_summed, and found this, for instance:

davicom/dm9000.c

/* The DM9000 is not smart enough to leave fragmented packets alone. */
if (dm->ip_summed != ip_summed) {
if (ip_summed == CHECKSUM_NONE)
iow(dm, DM9000_TCCR, 0);
else
iow(dm, DM9000_TCCR, TCCR_IP | TCCR_UDP | TCCR_TCP);
dm->ip_summed = ip_summed;
}


It is taking action based on ip_summed == CHECKSUM_NONE, and your change
will probably break that.

I would suggest that we try to make any fix specific only to veth,
at least for now.  A tree-wide audit of drivers is probably required
to safely make the kind of change you propose above.

So, unless you can explain why your change is safe, then I do not plan
to test it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH net-next v3.16]r8169: Add support for half duplex on MDI

2016-03-25 Thread Andrew Lunn

On Fri, Mar 25, 2016 at 04:29:52PM +0200, Corcodel Marian wrote:
>  This patch add support for half duplex on MDI chips.
> 
> Signed-off-by: Corcodel Marian 
> ---
>  drivers/net/ethernet/realtek/r8169.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 77c5efb..7f6fb1f 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -1713,8 +1713,13 @@ static int rtl8169_set_speed_xmii(struct net_device 
> *dev,
>   else
>   goto out;
>  
> - if (duplex == DUPLEX_FULL)
> + if (duplex == DUPLEX_FULL) {
>   bmcr |= BMCR_FULLDPLX;
> + } else if (duplex == DUPLEX_HALF) {
> + bmcr |= ~BMCR_FULLDPLX;
> + } else {
> + goto out;
> +  }

Hi Corcodel

What PHY is actually being used? One that is supported by
drivers/net/phy/realtek.c?

It looks like the bugs you are finding are all handling stuff which is
generic, and well tested code in drivers/net/phy/phy_device.c would do
for you, with less bugs.

Maybe you want to consider throwing away the PHY code in r8169 and
just implement an standard mdio bus, use phylib and the realtek.c phy
driver?

Andrew

Re: [PATCH v2 1/1] net: macb: remove BUG_ON() and reset the queue to handle RX errors

2016-03-25 Thread David Miller

From: Cyrille Pitchen 
Date: Fri, 25 Mar 2016 10:37:34 +0100

> This patch removes two BUG_ON() used to notify about RX queue corruptions
> on macb (not gem) hardware without actually handling the error.
> 
> The new code skips corrupted frames but still processes faultless frames.
> Then it resets the RX queue before restarting the reception from a clean
> state.
> 
> This patch is a rework of an older patch proposed by Neil Armstrong:
> http://patchwork.ozlabs.org/patch/371525/
> 
> Signed-off-by: Cyrille Pitchen 
> Acked-by: Neil Armstrong 
> Acked-by: Nicolas Ferre 

Applied, thank you.

Re: [PATCH net-next 1/1] qlge: Update version to 1.00.00.35

2016-03-25 Thread David Miller

From: Manish Chopra 
Date: Fri, 25 Mar 2016 07:14:09 -0400

> Just updating version as many fixes got
> accumulated over 1.00.00.34
> 
> Signed-off-by: Manish Chopra 

Applied, but it is the 'net' tree that fixes et al. like this
should be targetting as 'net-next' is closed.

Re: [PATCH] net: phy: bcm7xxx: Add entries for Broadcom BCM7346 and BCM7362

2016-03-25 Thread David Miller

From: Jaedon Shin 
Date: Fri, 25 Mar 2016 12:46:54 +0900

> Add PHY entries for the Broadcom BCM7346 and BCM7362 chips, these are
> 40nm generation Ethernet PHY.
> 
> Fixes: 815717d1473e ("net: phy: bcm7xxx: Remove wildcard entries")
> Signed-off-by: Jaedon Shin 

Applied, thank you.

Re: [PATCH net] bpf: add missing map_flags to bpf_map_show_fdinfo

2016-03-25 Thread David Miller

From: Daniel Borkmann 
Date: Fri, 25 Mar 2016 00:30:25 +0100

> Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
> tc can check for them when loading objects from a pinned entry, e.g.
> if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
> pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf:
> pre-allocate hash map elements"), so that tc can still support this
> with v4.6.
> 
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 

Applied, thanks Daniel.

Re: dynamically set number of queues for 82598 devices in -stable

2016-03-25 Thread William Dauchy

Hi Jeff,

Any chance to get an opinion on that?

Thanks,

On Nov27 11:05, William Dauchy wrote:
> Hello Jeff,
> 
> I faced the problem described in commit 7e3f5c8:
> ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
> 
> This patch resolves an issue where users were not able to dynamically
> set number of queues for 82598 via ethtool -L
> 
> 
> I backported it for my v4.1.x build but I was wondering if this could be
> a candidate for a backport in stable tree at least for v4.1.x.

-- 
William


signature.asc
Description: PGP signature

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Craig Gallek

On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau  wrote:
> The pattern is :
>
>   t0 : unprivileged processes 1 and 2 are listening to the same port
>(sock1@pid1) (sock2@pid2)
><-- listening -->
>
>   t1 : new processes are started to replace the old ones
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><-- listening --> <-- listening -->
>
>   t2 : new processes signal the old ones they must stop
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><--- draining --> <-- listening -->
>
>   t3 : pids 1 and 2 have finished, they go away
>  (sock3@pid3) (sock4@pid4)
> <-- gone ->  <-- listening -->

To address the documentation issues, I'd like to reference the following:
- The filter.txt document in the kernel tree:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/filter.txt
- It uses (and extends) the BPF instruction set defined in the
original BSD BPF paper: http://www.tcpdump.org/papers/bpf-usenix93.pdf
- The kernel headers define all of the user-space structures used:
  * 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/filter.h
  * 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h

I've been trying to come up with an example BPF program for use in the
example Willy gave earlier in this thread (using 4 points in time and
describing one process with two listening sockets replacing another
with two listening sockets).  Everything except the last step is
pretty straight forward using what is currently available in the
kernel.  I'm using random distribution for simplicity, but you could
easily do something smarter using more information about the specific
hardware:

t0: Evenly distrubute load to two SO_REUSEPORT sockets in a single process:
  ld rand
  mod #2
  ret a

t1: Fork a new process, create two new listening sockets in the same
group. Even after calling listen(), but before updating the BPF
program, only the first two sockets will see new connections.  The
program is trivially modified to use all 4.
  ld rand
  mod #4
  ret a

t2: Stop sending new connections to the first two sockets (draining)
  ld rand
  mod #2
  add #2
  ret a

t3: Close the first two sockets and only use the last two.  This is
the tricky step.  Before this point, the sockets are numbered 0
through 3 from the perspective of the BPF program (in the order
listen() was called).  As soon as socket 0 is closed, the last socket
in the list replaces it (what was 3 becomes 0).  When socket 1 is
closed, socket 2 moves into that position.  The assumptions about the
socket indexes in the BPF program need to change as the indexes change
as a result of closing them.

Even if you use an EBPF map as a level of indirection here, you still
have the issue that the socket indexes change as a result of some of
them leaving the group.  I'm not sure yet how to properly fix this,
but it will probably mean changing the way the socket indexing
works...  The current scheme is really an implementation detail
optimized for efficiency.  It may be worth modifying or creating a
mode which results in a stable mapping.  This will probably be
necessary for any scheme which expects sockets to regularly enter or
leave the group.

Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()

2016-03-25 Thread David Miller

From: Bjorn Helgaas 
Date: Thu, 24 Mar 2016 21:56:21 -0500

> netpoll_setup() does a dev_hold() on np->dev, the netpoll device.  If it
> fails, it correctly does a dev_put() but leaves np->dev set.  If we call
> netpoll_cleanup() after the failure, np->dev is still set so we do another
> dev_put(), which decrements the refcount an extra time.
> 
> It's questionable to call netpoll_cleanup() after netpoll_setup() fails,
> but it can be difficult to find the problem, and we can easily avoid it in
> this case.  The extra decrements can lead to hangs like this:
> 
>   unregister_netdevice: waiting for bond0 to become free. Usage count = -3
> 
> Set and clear np->dev at the points where we dev_hold() and dev_put() the
> device.
> 
> Signed-off-by: Bjorn Helgaas 

Looks good, applied and queued up for -stable.

But you probably do want to look into the stacked device issue
Neil pointed out.

Thanks!

Re: [PATCH 00/13] Enhance stmmac driver to support GMAC4.x IP

2016-03-25 Thread Alexandre Torgue


Hi,

On 03/25/2016 04:11 PM, David Miller wrote:


It is absolutely not appropriate to submit new feature patches
at this time.

Please resubmit this after the net-next tree opens back out.



No pb, I will wait and resend.

Regards

Alex



Thank you.

Re: [PATCH 00/13] Enhance stmmac driver to support GMAC4.x IP

2016-03-25 Thread David Miller


It is absolutely not appropriate to submit new feature patches
at this time.

Please resubmit this after the net-next tree opens back out.

Thank you.

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Ben Greear




On 03/24/2016 10:24 PM, Vijay Pandurangan wrote:

On Fri, Mar 25, 2016 at 1:07 AM, Ben Greear  wrote:

On 03/24/2016 09:45 PM, Vijay Pandurangan wrote:


Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth
to drop the packets if they have bad checksums before they hit the
application level.



VETH is pretty special in that when you transmit a frame on one
device, it's pair receives it, and unless there is RAM corruption
or bugs in the kernel, then it cannot be corrupted.


Yeah, you're right that that's an optimization. However, I think that
we should first ensure that

a->veth->b

operates exactly like:

a->physical eth 1 -> physical eth 2->b

in all cases.  Once we have that working everywhere we could think
about optimizations.


If we're willing to refactor, we could implement the optimization by
allowing veth devices to know whether their immediate peer is. If a
veth knows it's talking to another veth, it could under some
circumstances elide checksum calculation and verification.  I'm not
sure what abstractions that would break, though. What do you guys
think?


veth ALWAYS transmits to another VETH.  The problem is that when veth is
given a packet to transmit, it is difficult to know where that packet
came from.

And, adding software checksumming to veth for every frame would be a huge
performance hit.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

[PATCH net-next v3.16]r8169: Add support for half duplex on MDI

2016-03-25 Thread Corcodel Marian

 This patch add support for half duplex on MDI chips.

Signed-off-by: Corcodel Marian 
---
 drivers/net/ethernet/realtek/r8169.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 77c5efb..7f6fb1f 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1713,8 +1713,13 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
else
goto out;
 
-   if (duplex == DUPLEX_FULL)
+   if (duplex == DUPLEX_FULL) {
bmcr |= BMCR_FULLDPLX;
+   } else if (duplex == DUPLEX_HALF) {
+   bmcr |= ~BMCR_FULLDPLX;
+   } else {
+   goto out;
+}
}
 
rtl_writephy(tp, MII_BMCR, bmcr);
-- 
2.1.4

Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR

2016-03-25 Thread Corcodel Marian

On Fri, 25 Mar 2016 14:33:02 +0100
Phil Sutter  wrote:

> Hi,
> 
> Your patch submissions are getting better, also good to see you're
> finally using git-send-email. A few things need to be corrected
> though:
> 
> Subject line:
> - The 'vNN' part in brackets is supposed to be the reroll-count (see
>   'git format-patch -h' for details), not kernel version. You're
>   supposed to submit your patches for either net or net-next, not a
>   specific kernel version like 3.16.
> - Missing space after the closing bracket.
> - Extra space after 'r8169:'.
> 
> On Fri, Mar 25, 2016 at 03:01:06PM +0200, Corcodel Marian wrote:
> >  This patch correct value on MII_BMCR register ald value 0
> >  have target on reserved register first 2 bytes from MII_BMCR
> >  speed 10 is flipped value on BMCR_SPEED100
> 
> This is very hard to understand. I *guess* you want to say one should
> not write 0 to MII_BMCR since it overwrites reserved bits, but it's
> really not clear. Don't you know someone who can properly translate
> from Romanian to English?
> 
> Also detailed instructions on how to trigger the problem you are
> fixing for would be good. In detail: Which specific hardware was
> used, in which situation did the problem occur, how did it behave in
> that situation and what was the expected behaviour?
> 
> Cheers, Phil

> This is very hard to understand. I *guess* you want to say one should
> not write 0 to MII_BMCR since it overwrites reserved bits, but it's
> really not clear. Don't you know someone who can properly translate
> from Romanian to English?
>Yes this it.

Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR

2016-03-25 Thread Phil Sutter

Hi,

Your patch submissions are getting better, also good to see you're
finally using git-send-email. A few things need to be corrected though:

Subject line:
- The 'vNN' part in brackets is supposed to be the reroll-count (see
  'git format-patch -h' for details), not kernel version. You're
  supposed to submit your patches for either net or net-next, not a
  specific kernel version like 3.16.
- Missing space after the closing bracket.
- Extra space after 'r8169:'.

On Fri, Mar 25, 2016 at 03:01:06PM +0200, Corcodel Marian wrote:
>  This patch correct value on MII_BMCR register ald value 0
>  have target on reserved register first 2 bytes from MII_BMCR
>  speed 10 is flipped value on BMCR_SPEED100

This is very hard to understand. I *guess* you want to say one should
not write 0 to MII_BMCR since it overwrites reserved bits, but it's
really not clear. Don't you know someone who can properly translate from
Romanian to English?

Also detailed instructions on how to trigger the problem you are fixing
for would be good. In detail: Which specific hardware was used, in which
situation did the problem occur, how did it behave in that situation and
what was the expected behaviour?

Cheers, Phil

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Eric Dumazet

On Fri, 2016-03-25 at 12:21 +0100, Yann Ylavic wrote:

> Not my intention, you guys know what's the better for the kernel and its APIs.
> My concern is (was) admittedly due to my own lack of knowledge of
> (e)BPF, hence how much of kernel internals I'd need to know to make
> the SO_REUSEPORT work in my case.
> 
> But sure, any of the help suggested above would make it go away, I'll
> stay tuned.

Most of the time, the BPF filter can be trivial.

If your server has a multi queue NIC with 16 queues on a 16 cpus host,
it would be natural to use 16 listeners, to properly use the NIC
features (RSS).

BPF program would be

A = QUEUE_NUMBER(skb);
RETURN A;

If you chose to have 4 listeners instead, for whatever reasons.

A = QUEUE_NUMBER(skb);
A = A & 3;
RETURN A;

You also can use 

A = CURRENT_CPU;
RETURN A;

[PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR

2016-03-25 Thread Corcodel Marian

 This patch correct value on MII_BMCR register ald value 0
 have target on reserved register first 2 bytes from MII_BMCR
 speed 10 is flipped value on BMCR_SPEED100

Signed-off-by: Corcodel Marian 
---
 drivers/net/ethernet/realtek/r8169.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index a450656..77c5efb 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1707,7 +1707,7 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
giga_ctrl = 0;
 
if (speed == SPEED_10)
-   bmcr = 0;
+   bmcr = ~BMCR_SPEED100;
else if (speed == SPEED_100)
bmcr = BMCR_SPEED100;
else
-- 
2.1.4

[PATCH net-next 1/1] qlge: Update version to 1.00.00.35

2016-03-25 Thread Manish Chopra

Just updating version as many fixes got
accumulated over 1.00.00.34

Signed-off-by: Manish Chopra 
---
 drivers/net/ethernet/qlogic/qlge/qlge.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h 
b/drivers/net/ethernet/qlogic/qlge/qlge.h
index ef33270..6d31f92 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge.h
+++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
@@ -18,7 +18,7 @@
  */
 #define DRV_NAME   "qlge"
 #define DRV_STRING "QLogic 10 Gigabit PCI-E Ethernet Driver "
-#define DRV_VERSION"1.00.00.34"
+#define DRV_VERSION"1.00.00.35"
 
 #define WQ_ADDR_ALIGN  0x3 /* 4 byte alignment */
 
-- 
2.7.2

Re: [PATCH net-next v3.16]r8169: Implement definitions from linux/mii.h header

2016-03-25 Thread Sergei Shtylyov


Hello.

On 3/25/2016 1:25 PM, Corcodel Marian wrote:


  Add definitions from mii and inhibit 0 to advertise.

Signed-off-by: Corcodel Marian 
---
  drivers/net/ethernet/realtek/r8169.c | 22 ++
  1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index a57b650..a450656 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1679,29 +1679,19 @@ static int rtl8169_set_speed_xmii(struct net_device 
*dev,
int auto_nego;

auto_nego = rtl_readphy(tp, MII_ADVERTISE);
-   auto_nego &= ~(ADVERTISE_10HALF | ADVERTISE_10FULL |
-   ADVERTISE_100HALF | ADVERTISE_100FULL);
-
-   if (adv & ADVERTISED_10baseT_Half)
-   auto_nego |= ADVERTISE_10HALF;
-   if (adv & ADVERTISED_10baseT_Full)
-   auto_nego |= ADVERTISE_10FULL;
-   if (adv & ADVERTISED_100baseT_Half)
-   auto_nego |= ADVERTISE_100HALF;
-   if (adv & ADVERTISED_100baseT_Full)
-   auto_nego |= ADVERTISE_100FULL;
+   auto_nego &= ~(0 | ADVERTISE_10HALF | ADVERTISE_10FULL |


   ORing with 0 doesn't change the result.


+   ADVERTISE_100HALF | ADVERTISE_100FULL |
+   ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM);

+   auto_nego |= ethtool_adv_to_mii_adv_t(adv);
auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;

giga_ctrl = rtl_readphy(tp, MII_CTRL1000);
-   giga_ctrl &= ~(ADVERTISE_1000FULL | ADVERTISE_1000HALF);
+   giga_ctrl &= ~(0 | ADVERTISE_1000FULL | ADVERTISE_1000HALF);


   Likewise.

[...]

MBR, Sergei

tg3 link autoneg on one port results in packets loss on other port

2016-03-25 Thread vincent Richard

Hi all,
I am using a HP DL360p server with broadcom BCM5719 ethernet card
handled by tg3 driver.
When the link is renegotiated on one ethernet port, out packets get
lost on other ethernet ports. The problem is easily reproducible by
looping on switch port up/down.
Instrumenting the driver, I observed that tg3_setup_phy() may last
more than 40 msec while holding the spin lock.
Is it possible to modify the tg3 driver so as to minimize link
renegotiation  impact ?
Thanks in advance for your help.

Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()

2016-03-25 Thread Neil Horman

On Thu, Mar 24, 2016 at 09:56:21PM -0500, Bjorn Helgaas wrote:
> netpoll_setup() does a dev_hold() on np->dev, the netpoll device.  If it
> fails, it correctly does a dev_put() but leaves np->dev set.  If we call
> netpoll_cleanup() after the failure, np->dev is still set so we do another
> dev_put(), which decrements the refcount an extra time.
> 
> It's questionable to call netpoll_cleanup() after netpoll_setup() fails,
> but it can be difficult to find the problem, and we can easily avoid it in
> this case.  The extra decrements can lead to hangs like this:
> 
>   unregister_netdevice: waiting for bond0 to become free. Usage count = -3
> 
> Set and clear np->dev at the points where we dev_hold() and dev_put() the
> device.
> 
> Signed-off-by: Bjorn Helgaas 
> ---
>  net/core/netpoll.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 94acfc8..a57bd17 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -603,7 +603,6 @@ int __netpoll_setup(struct netpoll *np, struct net_device 
> *ndev)
>   const struct net_device_ops *ops;
>   int err;
>  
> - np->dev = ndev;
>   strlcpy(np->dev_name, ndev->name, IFNAMSIZ);
>   INIT_WORK(>cleanup_work, netpoll_async_cleanup);
>  
> @@ -670,6 +669,7 @@ int netpoll_setup(struct netpoll *np)
>   goto unlock;
>   }
>   dev_hold(ndev);
> + np->dev = ndev;
>  
>   if (netdev_master_upper_dev_get(ndev)) {
>   np_err(np, "%s is a slave device, aborting\n", np->dev_name);
> @@ -770,6 +770,7 @@ int netpoll_setup(struct netpoll *np)
>   return 0;
>  
>  put:
> + np->dev = NULL;
>   dev_put(ndev);
>  unlock:
>   rtnl_unlock();
> 

Is this safe for stacked devices?  It makes good sense for the typical case, but
if you attempt to setup a netpoll client on a bridge/bond/vlan, etc, the lower
device will get its own netpoll struct registered and have no associated np->dev
pointer.  It not be a real problem in practice, But you probably want to check
to make sure that stacked  devices which recursively call the netpoll api don't
do anyting with the np->dev pointer.

Regards
Neil

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Yann Ylavic

On Fri, Mar 25, 2016 at 9:53 AM, Willy Tarreau wrote:
>
> On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote:
>> Everything is possible, but do not complain because BPF went in the
>> kernel before your changes.
>
> Don't get me wrong, I'm not complaining, I'm more asking for help to
> try to elaborate the alternate solution. I understood well what my
> proposal did because it was pretty straightforward, and the new one
> I'll have to do is of an immense complexity for me by now, since it
> will require learning a new language and finding doc on how all this
> works together. But as I said I'm totally sold to the benefits it can
> provide for large scale deployments and I'm well aware of the ugly
> socket scan there was in the previous one.

+1 (all the work done on the listeners, including SO_REUSEPORT, and
until lately very much appreciated in any case!)

>
> If you could share a pointer to some good starter documentation for this,
> I would appreciate it. I really have no idea where to start from and the
> only elements I found on the net didn't give a single hint regarding all
> this :-/

+1

On Fri, Mar 25, 2016 at 1:24 AM, David Miller wrote:
>
> If we encapsulate this into libraries and helper wrappers, there is
> no reason web server developers should be looking at these details
> anyways.

++1

>
> Please don't make a mountain out of a mole-hill.

Not my intention, you guys know what's the better for the kernel and its APIs.
My concern is (was) admittedly due to my own lack of knowledge of
(e)BPF, hence how much of kernel internals I'd need to know to make
the SO_REUSEPORT work in my case.

But sure, any of the help suggested above would make it go away, I'll
stay tuned.

[PATCH net-next v3.16]r8169: Implement definitions from linux/mii.h header

2016-03-25 Thread Corcodel Marian

 Add definitions from mii and inhibit 0 to advertise.

Signed-off-by: Corcodel Marian 
---
 drivers/net/ethernet/realtek/r8169.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index a57b650..a450656 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1679,29 +1679,19 @@ static int rtl8169_set_speed_xmii(struct net_device 
*dev,
int auto_nego;
 
auto_nego = rtl_readphy(tp, MII_ADVERTISE);
-   auto_nego &= ~(ADVERTISE_10HALF | ADVERTISE_10FULL |
-   ADVERTISE_100HALF | ADVERTISE_100FULL);
-
-   if (adv & ADVERTISED_10baseT_Half)
-   auto_nego |= ADVERTISE_10HALF;
-   if (adv & ADVERTISED_10baseT_Full)
-   auto_nego |= ADVERTISE_10FULL;
-   if (adv & ADVERTISED_100baseT_Half)
-   auto_nego |= ADVERTISE_100HALF;
-   if (adv & ADVERTISED_100baseT_Full)
-   auto_nego |= ADVERTISE_100FULL;
+   auto_nego &= ~(0 | ADVERTISE_10HALF | ADVERTISE_10FULL |
+   ADVERTISE_100HALF | ADVERTISE_100FULL |
+   ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM);
 
+   auto_nego |= ethtool_adv_to_mii_adv_t(adv);
auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
 
giga_ctrl = rtl_readphy(tp, MII_CTRL1000);
-   giga_ctrl &= ~(ADVERTISE_1000FULL | ADVERTISE_1000HALF);
+   giga_ctrl &= ~(0 | ADVERTISE_1000FULL | ADVERTISE_1000HALF);
 
/* The 8100e/8101e/8102e do Fast Ethernet only. */
if (tp->mii.supports_gmii) {
-   if (adv & ADVERTISED_1000baseT_Half)
-   giga_ctrl |= ADVERTISE_1000HALF;
-   if (adv & ADVERTISED_1000baseT_Full)
-   giga_ctrl |= ADVERTISE_1000FULL;
+   giga_ctrl |= ethtool_adv_to_mii_ctrl1000_t(adv);
} else if (adv & (ADVERTISED_1000baseT_Half |
  ADVERTISED_1000baseT_Full)) {
netif_info(tp, link, dev,
-- 
2.1.4

[PATCH v2 1/1] net: macb: remove BUG_ON() and reset the queue to handle RX errors

2016-03-25 Thread Cyrille Pitchen

This patch removes two BUG_ON() used to notify about RX queue corruptions
on macb (not gem) hardware without actually handling the error.

The new code skips corrupted frames but still processes faultless frames.
Then it resets the RX queue before restarting the reception from a clean
state.

This patch is a rework of an older patch proposed by Neil Armstrong:
http://patchwork.ozlabs.org/patch/371525/

Signed-off-by: Cyrille Pitchen 
Acked-by: Neil Armstrong 
Acked-by: Nicolas Ferre 
---

ChangeLog

v1 -> v2
- add Acked-by Neil Armstrong and Nicolas Ferre
- use bool type instead of int for the 'reset_rx_queue' variable
- move longer declarations before shorter declarations

 drivers/net/ethernet/cadence/macb.c | 59 ++---
 1 file changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index f715352a9b85..6c3dc27cb100 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -917,7 +917,10 @@ static int macb_rx_frame(struct macb *bp, unsigned int 
first_frag,
unsigned int frag_len = bp->rx_buffer_size;
 
if (offset + frag_len > len) {
-   BUG_ON(frag != last_frag);
+   if (unlikely(frag != last_frag)) {
+   dev_kfree_skb_any(skb);
+   return -1;
+   }
frag_len = len - offset;
}
skb_copy_to_linear_data_offset(skb, offset,
@@ -945,8 +948,23 @@ static int macb_rx_frame(struct macb *bp, unsigned int 
first_frag,
return 0;
 }
 
+static inline void macb_init_rx_ring(struct macb *bp)
+{
+   dma_addr_t addr;
+   int i;
+
+   addr = bp->rx_buffers_dma;
+   for (i = 0; i < RX_RING_SIZE; i++) {
+   bp->rx_ring[i].addr = addr;
+   bp->rx_ring[i].ctrl = 0;
+   addr += bp->rx_buffer_size;
+   }
+   bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP);
+}
+
 static int macb_rx(struct macb *bp, int budget)
 {
+   bool reset_rx_queue = false;
int received = 0;
unsigned int tail;
int first_frag = -1;
@@ -972,10 +990,18 @@ static int macb_rx(struct macb *bp, int budget)
 
if (ctrl & MACB_BIT(RX_EOF)) {
int dropped;
-   BUG_ON(first_frag == -1);
+
+   if (unlikely(first_frag == -1)) {
+   reset_rx_queue = true;
+   continue;
+   }
 
dropped = macb_rx_frame(bp, first_frag, tail);
first_frag = -1;
+   if (unlikely(dropped < 0)) {
+   reset_rx_queue = true;
+   continue;
+   }
if (!dropped) {
received++;
budget--;
@@ -983,6 +1009,26 @@ static int macb_rx(struct macb *bp, int budget)
}
}
 
+   if (unlikely(reset_rx_queue)) {
+   unsigned long flags;
+   u32 ctrl;
+
+   netdev_err(bp->dev, "RX queue corruption: reset it\n");
+
+   spin_lock_irqsave(>lock, flags);
+
+   ctrl = macb_readl(bp, NCR);
+   macb_writel(bp, NCR, ctrl & ~MACB_BIT(RE));
+
+   macb_init_rx_ring(bp);
+   macb_writel(bp, RBQP, bp->rx_ring_dma);
+
+   macb_writel(bp, NCR, ctrl | MACB_BIT(RE));
+
+   spin_unlock_irqrestore(>lock, flags);
+   return received;
+   }
+
if (first_frag != -1)
bp->rx_tail = first_frag;
else
@@ -1523,15 +1569,8 @@ static void gem_init_rings(struct macb *bp)
 static void macb_init_rings(struct macb *bp)
 {
int i;
-   dma_addr_t addr;
 
-   addr = bp->rx_buffers_dma;
-   for (i = 0; i < RX_RING_SIZE; i++) {
-   bp->rx_ring[i].addr = addr;
-   bp->rx_ring[i].ctrl = 0;
-   addr += bp->rx_buffer_size;
-   }
-   bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP);
+   macb_init_rx_ring(bp);
 
for (i = 0; i < TX_RING_SIZE; i++) {
bp->queues[0].tx_ring[i].addr = 0;
-- 
1.8.2.2

[PATCH 13/13] stmmac: update MAINTAINERS

2016-03-25 Thread Alexandre TORGUE

Signed-off-by: Alexandre TORGUE 

diff --git a/MAINTAINERS b/MAINTAINERS
index b70294e..394e233 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3293,6 +3293,7 @@ F:Documentation/powerpc/cxlflash.txt
 
 STMMAC ETHERNET DRIVER
 M: Giuseppe Cavallaro 
+M: Alexandre Torgue 
 L: netdev@vger.kernel.org
 W: http://www.stlinux.com
 S: Supported
-- 
1.9.1

[PATCH 02/13] stmmac: rework the routines to show the ring status

2016-03-25 Thread Alexandre TORGUE

To avoid lot of check in stmmac_main for display ring management
and support the GMAC4 chip, the display_ring function is moved
into dedicated descriptor file.

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 797a913..6cea61b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -380,6 +380,8 @@ struct stmmac_desc_ops {
 u64(*get_timestamp) (void *desc, u32 ats);
/* get rx timestamp status */
int (*get_rx_timestamp_status) (void *desc, u32 ats);
+   /* Display ring */
+   void (*display_ring)(void *head, unsigned int size, bool rx);
 };
 
 extern const struct stmmac_desc_ops enh_desc_ops;
diff --git a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c 
b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
index cfb018c..3e1b249 100644
--- a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
@@ -411,6 +411,26 @@ static int enh_desc_get_rx_timestamp_status(void *desc, 
u32 ats)
}
 }
 
+static void enh_desc_display_ring(void *head, unsigned size, bool rx)
+{
+   struct dma_extended_desc *ep = (struct dma_extended_desc *)head;
+   int i;
+
+   pr_info("Extended %s descriptor ring:\n", rx ? "RX" : "TX");
+
+   for (i = 0; i < size; i++) {
+   u64 x;
+
+   x = *(u64 *)ep;
+   pr_info("%d [0x%x]: 0x%x 0x%x 0x%x 0x%x\n",
+   i, (unsigned int)virt_to_phys(ep),
+   (unsigned int)x, (unsigned int)(x >> 32),
+   ep->basic.des2, ep->basic.des3);
+   ep++;
+   }
+   pr_info("\n");
+}
+
 const struct stmmac_desc_ops enh_desc_ops = {
.tx_status = enh_desc_get_tx_status,
.rx_status = enh_desc_get_rx_status,
@@ -430,4 +450,5 @@ const struct stmmac_desc_ops enh_desc_ops = {
.get_tx_timestamp_status = enh_desc_get_tx_timestamp_status,
.get_timestamp = enh_desc_get_timestamp,
.get_rx_timestamp_status = enh_desc_get_rx_timestamp_status,
+   .display_ring = enh_desc_display_ring,
 };
diff --git a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c 
b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
index e13228f..d93323d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
@@ -279,6 +279,26 @@ static int ndesc_get_rx_timestamp_status(void *desc, u32 
ats)
return 1;
 }
 
+static void ndesc_display_ring(void *head, unsigned size, bool rx)
+{
+   struct dma_desc *p = (struct dma_desc *)head;
+   int i;
+
+   pr_info("%s descriptor ring:\n", rx ? "RX" : "TX");
+
+   for (i = 0; i < size; i++) {
+   u64 x;
+
+   x = *(u64 *)p;
+   pr_info("%d [0x%x]: 0x%x 0x%x 0x%x 0x%x",
+   i, (unsigned int)virt_to_phys(p),
+   (unsigned int)x, (unsigned int)(x >> 32),
+   p->des2, p->des3);
+   p++;
+   }
+   pr_info("\n");
+}
+
 const struct stmmac_desc_ops ndesc_ops = {
.tx_status = ndesc_get_tx_status,
.rx_status = ndesc_get_rx_status,
@@ -297,4 +317,5 @@ const struct stmmac_desc_ops ndesc_ops = {
.get_tx_timestamp_status = ndesc_get_tx_timestamp_status,
.get_timestamp = ndesc_get_timestamp,
.get_rx_timestamp_status = ndesc_get_rx_timestamp_status,
+   .display_ring = ndesc_display_ring,
 };
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index d3ebfea..8103527 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -883,53 +883,22 @@ static int stmmac_init_phy(struct net_device *dev)
return 0;
 }
 
-/**
- * stmmac_display_ring - display ring
- * @head: pointer to the head of the ring passed.
- * @size: size of the ring.
- * @extend_desc: to verify if extended descriptors are used.
- * Description: display the control/status and buffer descriptors.
- */
-static void stmmac_display_ring(void *head, int size, int extend_desc)
-{
-   int i;
-   struct dma_extended_desc *ep = (struct dma_extended_desc *)head;
-   struct dma_desc *p = (struct dma_desc *)head;
-
-   for (i = 0; i < size; i++) {
-   u64 x;
-   if (extend_desc) {
-   x = *(u64 *) ep;
-   pr_info("%d [0x%x]: 0x%x 0x%x 0x%x 0x%x\n",
-   i, (unsigned int)virt_to_phys(ep),
-   (unsigned int)x, (unsigned int)(x >> 32),
-   ep->basic.des2, ep->basic.des3);
-   ep++;
-   } else {
-   x = *(u64 *) p;
-

[PATCH 06/13] stmmac: add DMA support for GMAC 4.xx

2016-03-25 Thread Alexandre TORGUE

DMA behavior is linked to descriptor management:

-descriptor mechanism (Tx for example, but it is exactly the same for RX):
-useful registers:
-DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring
-DMA_CH#_TxDesc_List_Address: start address of the ring
-DMA_CH#_TxDesc_Tail_Pointer: address of the last
  descriptor to send + 1.
-DMA_CH#_TxDesc_Current_App_TxDesc: address of the current
descriptor

-The descriptor Tail Pointer register contains the pointer to the
 descriptor address (N). The base address and the current
 descriptor decide the address of the current descriptor that the
 DMA can process. The descriptors up to one location less than the
 one indicated by the descriptor tail pointer (N-1) are owned by
 the DMA. The DMA continues to process the descriptors until the
 following condition occurs:
 "current descriptor pointer == Descriptor Tail pointer"

Then the DMA goes into suspend mode. The application must perform
a write to descriptor tail pointer register and update the tail
pointer to have the following condition and to start a new transfer:
"current descriptor pointer < Descriptor tail pointer"

The DMA automatically wraps around the base address when the end
of ring is reached.

Up to 8 DMA could be use but currently we only use one (channel0)

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
b/drivers/net/ethernet/stmicro/stmmac/Makefile
index fa000fd..9398ace 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -3,7 +3,7 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o 
ring_mode.o  \
  chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
  dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \
  mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o  \
- $(stmmac-y)
+ dwmac4_dma.o dwmac4_lib.o $(stmmac-y)
 
 # Ordering matters. Generic driver must be last.
 obj-$(CONFIG_STMMAC_PLATFORM)  += stmmac-platform.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index ea7eb0d..2a5126e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -41,6 +41,8 @@
 /* Synopsys Core versions */
 #defineDWMAC_CORE_3_40 0x34
 #defineDWMAC_CORE_3_50 0x35
+#defineDWMAC_CORE_4_00 0x40
+#define STMMAC_CHAN0   0   /* Always supported and default for all chips */
 
 #define DMA_TX_SIZE 512
 #define DMA_RX_SIZE 512
@@ -270,6 +272,7 @@ enum dma_irq_status {
 #defineCORE_PCS_ANE_COMPLETE   (1 << 5)
 #defineCORE_PCS_LINK_STATUS(1 << 6)
 #defineCORE_RGMII_IRQ  (1 << 7)
+#define CORE_IRQ_MTL_RX_OVERFLOW   BIT(8)
 
 /* Physical Coding Sublayer */
 struct rgmii_adv {
@@ -301,8 +304,10 @@ struct dma_features {
/* 802.3az - Energy-Efficient Ethernet (EEE) */
unsigned int eee;
unsigned int av;
+   unsigned int tsoen;
/* TX and RX csum */
unsigned int tx_coe;
+   unsigned int rx_coe;
unsigned int rx_coe_type1;
unsigned int rx_coe_type2;
unsigned int rxfifo_over_2048;
@@ -425,6 +430,11 @@ struct stmmac_dma_ops {
   struct dma_features *dma_cap);
/* Program the HW RX Watchdog */
void (*rx_watchdog) (void __iomem *ioaddr, u32 riwt);
+   void (*set_tx_ring_len)(void __iomem *ioaddr, u32 len);
+   void (*set_rx_ring_len)(void __iomem *ioaddr, u32 len);
+   void (*set_rx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
+   void (*set_tx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
+   void (*enable_tso)(void __iomem *ioaddr, bool en, u32 chan);
 };
 
 struct mac_device_info;
@@ -473,6 +483,7 @@ struct stmmac_hwtimestamp {
 };
 
 extern const struct stmmac_hwtimestamp stmmac_ptp;
+extern const struct stmmac_mode_ops dwmac4_ring_mode_ops;
 
 struct mac_link {
int port;
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
new file mode 100644
index 000..116151c
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -0,0 +1,354 @@
+/*
+ * This is the driver for the GMAC on-chip Ethernet controller for ST SoCs.
+ * DWC Ether MAC version 4.xx  has been used for  developing this code.
+ *
+ * This contains the functions to handle the dma.
+ *
+ * Copyright (C) 2015  STMicroelectronics Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ *

[PATCH 08/13] stmmac: enhance mmc counter management

2016-03-25 Thread Alexandre TORGUE

For gmac3, the MMC addr map is: 0x100 - 0x2fc
For gmac4, the MMC addr map is: 0x700 - 0x8fc

So instead of adding 0x600 to the IO address when setup the mmc,
the RMON base address is saved inside the private structure and
then used to manage the counters.

Signed-off-by: Giuseppe Cavallaro 
Signed-off-by: Alexandre TORGUE 

diff --git a/drivers/net/ethernet/stmicro/stmmac/mmc.h 
b/drivers/net/ethernet/stmicro/stmmac/mmc.h
index 192c249..38a1a56 100644
--- a/drivers/net/ethernet/stmicro/stmmac/mmc.h
+++ b/drivers/net/ethernet/stmicro/stmmac/mmc.h
@@ -35,6 +35,10 @@
 * current value.*/
 #define MMC_CNTRL_PRESET   0x10
 #define MMC_CNTRL_FULL_HALF_PRESET 0x20
+
+#define MMC_GMAC4_OFFSET   0x700
+#define MMC_GMAC3_X_OFFSET 0x100
+
 struct stmmac_counters {
unsigned int mmc_tx_octetcount_gb;
unsigned int mmc_tx_framecount_gb;
diff --git a/drivers/net/ethernet/stmicro/stmmac/mmc_core.c 
b/drivers/net/ethernet/stmicro/stmmac/mmc_core.c
index 3f20bb1..ce9aa79 100644
--- a/drivers/net/ethernet/stmicro/stmmac/mmc_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/mmc_core.c
@@ -28,12 +28,12 @@
 
 /* MAC Management Counters register offset */
 
-#define MMC_CNTRL  0x0100  /* MMC Control */
-#define MMC_RX_INTR0x0104  /* MMC RX Interrupt */
-#define MMC_TX_INTR0x0108  /* MMC TX Interrupt */
-#define MMC_RX_INTR_MASK   0x010c  /* MMC Interrupt Mask */
-#define MMC_TX_INTR_MASK   0x0110  /* MMC Interrupt Mask */
-#define MMC_DEFAULT_MASK   0x
+#define MMC_CNTRL  0x00/* MMC Control */
+#define MMC_RX_INTR0x04/* MMC RX Interrupt */
+#define MMC_TX_INTR0x08/* MMC TX Interrupt */
+#define MMC_RX_INTR_MASK   0x0c/* MMC Interrupt Mask */
+#define MMC_TX_INTR_MASK   0x10/* MMC Interrupt Mask */
+#define MMC_DEFAULT_MASK   0x
 
 /* MMC TX counter registers */
 
@@ -41,115 +41,115 @@
  * _GB register stands for good and bad frames
  * _G is for good only.
  */
-#define MMC_TX_OCTETCOUNT_GB   0x0114
-#define MMC_TX_FRAMECOUNT_GB   0x0118
-#define MMC_TX_BROADCASTFRAME_G0x011c
-#define MMC_TX_MULTICASTFRAME_G0x0120
-#define MMC_TX_64_OCTETS_GB0x0124
-#define MMC_TX_65_TO_127_OCTETS_GB 0x0128
-#define MMC_TX_128_TO_255_OCTETS_GB0x012c
-#define MMC_TX_256_TO_511_OCTETS_GB0x0130
-#define MMC_TX_512_TO_1023_OCTETS_GB   0x0134
-#define MMC_TX_1024_TO_MAX_OCTETS_GB   0x0138
-#define MMC_TX_UNICAST_GB  0x013c
-#define MMC_TX_MULTICAST_GB0x0140
-#define MMC_TX_BROADCAST_GB0x0144
-#define MMC_TX_UNDERFLOW_ERROR 0x0148
-#define MMC_TX_SINGLECOL_G 0x014c
-#define MMC_TX_MULTICOL_G  0x0150
-#define MMC_TX_DEFERRED0x0154
-#define MMC_TX_LATECOL 0x0158
-#define MMC_TX_EXESSCOL0x015c
-#define MMC_TX_CARRIER_ERROR   0x0160
-#define MMC_TX_OCTETCOUNT_G0x0164
-#define MMC_TX_FRAMECOUNT_G0x0168
-#define MMC_TX_EXCESSDEF   0x016c
-#define MMC_TX_PAUSE_FRAME 0x0170
-#define MMC_TX_VLAN_FRAME_G0x0174
+#define MMC_TX_OCTETCOUNT_GB   0x14
+#define MMC_TX_FRAMECOUNT_GB   0x18
+#define MMC_TX_BROADCASTFRAME_G0x1c
+#define MMC_TX_MULTICASTFRAME_G0x20
+#define MMC_TX_64_OCTETS_GB0x24
+#define MMC_TX_65_TO_127_OCTETS_GB 0x28
+#define MMC_TX_128_TO_255_OCTETS_GB0x2c
+#define MMC_TX_256_TO_511_OCTETS_GB0x30
+#define MMC_TX_512_TO_1023_OCTETS_GB   0x34
+#define MMC_TX_1024_TO_MAX_OCTETS_GB   0x38
+#define MMC_TX_UNICAST_GB  0x3c
+#define MMC_TX_MULTICAST_GB0x40
+#define MMC_TX_BROADCAST_GB0x44
+#define MMC_TX_UNDERFLOW_ERROR 0x48
+#define MMC_TX_SINGLECOL_G 0x4c
+#define MMC_TX_MULTICOL_G  0x50
+#define MMC_TX_DEFERRED0x54
+#define MMC_TX_LATECOL 0x58
+#define MMC_TX_EXESSCOL0x5c
+#define MMC_TX_CARRIER_ERROR   0x60
+#define MMC_TX_OCTETCOUNT_G0x64
+#define MMC_TX_FRAMECOUNT_G0x68
+#define MMC_TX_EXCESSDEF   0x6c
+#define MMC_TX_PAUSE_FRAME 0x70
+#define MMC_TX_VLAN_FRAME_G0x74
 
 /* MMC RX counter registers */
-#define MMC_RX_FRAMECOUNT_GB   0x0180
-#define MMC_RX_OCTETCOUNT_GB   0x0184
-#define MMC_RX_OCTETCOUNT_G0x0188
-#define MMC_RX_BROADCASTFRAME_G0x018c
-#define MMC_RX_MULTICASTFRAME_G0x0190
-#define MMC_RX_CRC_ERROR

[PATCH 10/13] stmmac: support new GMAC4

2016-03-25 Thread Alexandre TORGUE

This patch adds the whole GMAC4 support inside the
stmmac d.d. now able to use the new HW and some new features
i.e.: TSO.
It is missing the multi-queue and split Header support at this
stage.
This patch also updates the driver version and the stmmac.txt.

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index eabe86b..fc60368 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -169,6 +169,9 @@ struct stmmac_extra_stats {
unsigned long mtl_rx_fifo_ctrl_active;
unsigned long mac_rx_frame_ctrl_fifo;
unsigned long mac_gmii_rx_proto_engine;
+   /* TSO */
+   unsigned long tx_tso_frames;
+   unsigned long tx_tso_nfrags;
 };
 
 /* CSR Frequency Access Defines*/
@@ -545,6 +548,7 @@ void stmmac_dwmac4_set_mac(void __iomem *ioaddr, bool 
enable);
 void dwmac_dma_flush_tx_fifo(void __iomem *ioaddr);
 extern const struct stmmac_mode_ops ring_mode_ops;
 extern const struct stmmac_mode_ops chain_mode_ops;
+extern const struct stmmac_desc_ops dwmac4_desc_ops;
 
 /**
  * stmmac_get_synopsys_id - return the SYINID.
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h 
b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 26fb855..317ce35 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -24,7 +24,7 @@
 #define __STMMAC_H__
 
 #define STMMAC_RESOURCE_NAME   "stmmaceth"
-#define DRV_MODULE_VERSION "Oct_2015"
+#define DRV_MODULE_VERSION "Dec_2015"
 
 #include 
 #include 
@@ -67,6 +67,7 @@ struct stmmac_priv {
spinlock_t tx_lock;
bool tx_path_in_lpi_mode;
struct timer_list txtimer;
+   bool tso;
 
struct dma_desc *dma_rx cacheline_aligned_in_smp;
struct dma_extended_desc *dma_erx;
@@ -129,6 +130,9 @@ struct stmmac_priv {
int irq_wake;
spinlock_t ptp_lock;
void __iomem *mmcaddr;
+   u32 rx_tail_addr;
+   u32 tx_tail_addr;
+   u32 mss;
 
 #ifdef CONFIG_DEBUG_FS
struct dentry *dbgfs_dir;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
index fb2e7fc85..e2b98b0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
@@ -161,6 +161,9 @@ static const struct stmmac_stats stmmac_gstrings_stats[] = {
STMMAC_STAT(mtl_rx_fifo_ctrl_active),
STMMAC_STAT(mac_rx_frame_ctrl_fifo),
STMMAC_STAT(mac_gmii_rx_proto_engine),
+   /* TSO */
+   STMMAC_STAT(tx_tso_frames),
+   STMMAC_STAT(tx_tso_nfrags),
 };
 #define STMMAC_STATS_LEN ARRAY_SIZE(stmmac_gstrings_stats)
 
@@ -499,7 +502,7 @@ static void stmmac_get_ethtool_stats(struct net_device *dev,
int i, j = 0;
 
/* Update the DMA HW counters for dwmac10/100 */
-   if (!priv->plat->has_gmac)
+   if (priv->hw->dma->dma_diagnostic_fr)
priv->hw->dma->dma_diagnostic_fr(>stats,
 (void *) >xstats,
 priv->ioaddr);
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 977487a..cb21884 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -56,6 +56,7 @@
 #include "dwmac1000.h"
 
 #define STMMAC_ALIGN(x)L1_CACHE_ALIGN(x)
+#defineTSO_MAX_BUFF_SIZE   (SZ_16K - 1)
 
 /* Module parameters */
 #define TX_TIMEO   5000
@@ -726,13 +727,15 @@ static void stmmac_adjust_link(struct net_device *dev)
new_state = 1;
switch (phydev->speed) {
case 1000:
-   if (likely(priv->plat->has_gmac))
+   if (likely((priv->plat->has_gmac) ||
+  (priv->plat->has_gmac4)))
ctrl &= ~priv->hw->link.port;
stmmac_hw_fix_mac_speed(priv);
break;
case 100:
case 10:
-   if (priv->plat->has_gmac) {
+   if (likely((priv->plat->has_gmac) ||
+  (priv->plat->has_gmac4))) {
ctrl |= priv->hw->link.port;
if (phydev->speed == SPEED_100) {
ctrl |= priv->hw->link.speed;
@@ -977,7 +980,10 @@ static int stmmac_init_rx_buffers(struct stmmac_priv 
*priv, struct dma_desc *p,
return -EINVAL;
}
 
-   p->des2

[PATCH 09/13] stmmac: add new DT platform entries for GMAC4

2016-03-25 Thread Alexandre TORGUE

This is to support the snps,dwmac-4.00 and snps,dwmac-4.10a
and related features on the platform driver.
See binding doc for further details.

Signed-off-by: Giuseppe Cavallaro 
Signed-off-by: Alexandre TORGUE 

diff --git a/Documentation/devicetree/bindings/net/stmmac.txt 
b/Documentation/devicetree/bindings/net/stmmac.txt
index 6605d19..4d302db 100644
--- a/Documentation/devicetree/bindings/net/stmmac.txt
+++ b/Documentation/devicetree/bindings/net/stmmac.txt
@@ -59,6 +59,8 @@ Optional properties:
- snps,fb: fixed-burst
- snps,mb: mixed-burst
- snps,rb: rebuild INCRx Burst
+   - snps,tso: this enables the TSO feature otherwise it will be managed by
+   MAC HW capability register.
 - mdio: with compatible = "snps,dwmac-mdio", create and register mdio bus.
 
 Examples:
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index dcbd2a1..6ca32f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -243,6 +243,13 @@ stmmac_probe_config_dt(struct platform_device *pdev, const 
char **mac)
plat->pmt = 1;
}
 
+   if (of_device_is_compatible(np, "snps,dwmac-4.00") ||
+   of_device_is_compatible(np, "snps,dwmac-4.10a")) {
+   plat->has_gmac4 = 1;
+   plat->pmt = 1;
+   plat->tso_en = of_property_read_bool(np, "snps,tso");
+   }
+
if (of_device_is_compatible(np, "snps,dwmac-3.610") ||
of_device_is_compatible(np, "snps,dwmac-3.710")) {
plat->enh_desc = 1;
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 4bcf5a6..3aa1870 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -138,5 +138,7 @@ struct plat_stmmacenet_data {
void (*exit)(struct platform_device *pdev, void *priv);
void *bsp_priv;
struct stmmac_axi *axi;
+   int has_gmac4;
+   bool tso_en;
 };
 #endif
-- 
1.9.1

[PATCH 04/13] stmmac: add descriptors function for GMAC 4.xx

2016-03-25 Thread Alexandre TORGUE

One of main changes of GMAC 4.xx IP is descriptors management.
-descriptors are only used in ring mode.
-A descriptor is composed of 4 32bits registers (no more extended
 descriptors)
-descriptor mechanism (Tx for example, but it is exactly the same for RX):
-useful registers:
-DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor
   ring
-DMA_CH#_TxDesc_List_Address: start address of the ring
-DMA_CH#_TxDesc_Tail_Pointer: address of the last
  descriptor to send + 1.
-DMA_CH#_TxDesc_Current_App_TxDesc: address of the current
descriptor

-The descriptor Tail Pointer register contains the pointer to the
 descriptor address (N). The base address and the current
 descriptor decide the address of the current descriptor that the
 DMA can process. The descriptors up to one location less than the
 one indicated by the descriptor tail pointer (N-1) are owned by
 the DMA. The DMA continues to process the descriptors until the
 following condition occurs:
 "current descriptor pointer == Descriptor Tail pointer"

  Then the DMA goes into suspend mode. The application must perform
  a write to descriptor tail pointer register and update the tail
  pointer to have the following condition and to start a new
  transfer:
  "current descriptor pointer < Descriptor tail pointer"

  The DMA automatically wraps around the base address when the end
  of ring is reached.

-New features are available on IP:
-TSO (TCP Segmentation Offload) for TX only
-Split header: to have header and payload in 2 different buffers

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
b/drivers/net/ethernet/stmicro/stmmac/Makefile
index b390161..fa000fd 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -2,7 +2,8 @@ obj-$(CONFIG_STMMAC_ETH) += stmmac.o
 stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o ring_mode.o \
  chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
  dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \
- mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o $(stmmac-y)
+ mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o  \
+ $(stmmac-y)
 
 # Ordering matters. Generic driver must be last.
 obj-$(CONFIG_STMMAC_PLATFORM)  += stmmac-platform.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 66e132f..ea7eb0d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -243,6 +243,7 @@ enum rx_frame_status {
csum_none = 0x2,
llc_snap = 0x4,
dma_own = 0x8,
+   rx_not_ls = 0x10,
 };
 
 /* Tx status */
@@ -348,6 +349,10 @@ struct stmmac_desc_ops {
void (*prepare_tx_desc) (struct dma_desc *p, int is_fs, int len,
 bool csum_flag, int mode, bool tx_own,
 bool ls);
+   void (*prepare_tso_tx_desc)(struct dma_desc *p, int is_fs, int len1,
+   int len2, bool tx_own, bool ls,
+   unsigned int tcphdrlen,
+   unsigned int tcppayloadlen);
/* Set/get the owner of the descriptor */
void (*set_tx_owner) (struct dma_desc *p);
int (*get_tx_owner) (struct dma_desc *p);
@@ -382,6 +387,8 @@ struct stmmac_desc_ops {
int (*get_rx_timestamp_status) (void *desc, u32 ats);
/* Display ring */
void (*display_ring)(void *head, unsigned int size, bool rx);
+   /* set MSS via context descriptor */
+   void (*set_mss)(struct dma_desc *p, unsigned int mss);
 };
 
 extern const struct stmmac_desc_ops enh_desc_ops;
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
new file mode 100644
index 000..33cbec3
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
@@ -0,0 +1,396 @@
+/*
+ * This contains the functions to handle the descriptors for DesignWare 
databook
+ * 4.xx.
+ *
+ * Copyright (C) 2015  STMicroelectronics Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * Author: Alexandre Torgue 
+ */
+
+#include 
+#include "common.h"
+#include "dwmac4_descs.h"
+
+static int dwmac4_wrback_get_tx_status(void *data, struct stmmac_extra_stats 
*x,
+  struct dma_desc *p,
+  void __iomem *ioaddr)
+{
+   struct net_device_stats *stats = (struct

[PATCH 11/13] Documentation: networking: update stmmac

2016-03-25 Thread Alexandre TORGUE

Update stmmac driver documentation according to new GMAC 4.x family.

Signed-off-by: Alexandre TORGUE 

diff --git a/Documentation/networking/stmmac.txt 
b/Documentation/networking/stmmac.txt
index d64a147..671fe3d 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -1,6 +1,6 @@
STMicroelectronics 10/100/1000 Synopsys Ethernet driver
 
-Copyright (C) 2007-2014  STMicroelectronics Ltd
+Copyright (C) 2007-2015  STMicroelectronics Ltd
 Author: Giuseppe Cavallaro 
 
 This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
@@ -138,6 +138,8 @@ struct plat_stmmacenet_data {
int (*init)(struct platform_device *pdev, void *priv);
void (*exit)(struct platform_device *pdev, void *priv);
void *bsp_priv;
+   int has_gmac4;
+   bool tso_en;
 };
 
 Where:
@@ -181,6 +183,8 @@ Where:
 registers.  init/exit callbacks should not use or modify
 platform data.
  o bsp_priv: another private pointer.
+ o has_gmac4: uses GMAC4 core.
+ o tso_en: Enables TSO (TCP Segmentation Offload) feature.
 
 For MDIO bus The we have:
 
@@ -278,6 +282,13 @@ Please see the following document:
  o stmmac_ethtool.c: to implement the ethtool support;
  o stmmac.h: private driver structure;
  o common.h: common definitions and VFTs;
+ o mmc_core.c/mmc.h: Management MAC Counters;
+ o stmmac_hwtstamp.c: HW timestamp support for PTP;
+ o stmmac_ptp.c: PTP 1588 clock;
+ o dwmac-.c: these are for the platform glue-logic file; e.g. dwmac-sti.c
+   for STMicroelectronics SoCs.
+
+- GMAC 3.x
  o descs.h: descriptor structure definitions;
  o dwmac1000_core.c: dwmac GiGa core functions;
  o dwmac1000_dma.c: dma functions for the GMAC chip;
@@ -289,11 +300,32 @@ Please see the following document:
  o enh_desc.c: functions for handling enhanced descriptors;
  o norm_desc.c: functions for handling normal descriptors;
  o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes;
- o mmc_core.c/mmc.h: Management MAC Counters;
- o stmmac_hwtstamp.c: HW timestamp support for PTP;
- o stmmac_ptp.c: PTP 1588 clock;
- o dwmac-.c: these are for the platform glue-logic file; e.g. dwmac-sti.c
-   for STMicroelectronics SoCs.
+
+- GMAC4.x generation
+ o dwmac4_core.c: dwmac GMAC4.x core functions;
+ o dwmac4_desc.c: functions for handling GMAC4.x descriptors;
+ o dwmac4_descs.h: descriptor definitions;
+ o dwmac4_dma.c: dma functions for the GMAC4.x chip;
+ o dwmac4_dma.h: dma definitions for the GMAC4.x chip;
+ o dwmac4.h: core definitions for the GMAC4.x chip;
+ o dwmac4_lib.c: generic GMAC4.x functions;
+
+4.12) TSO support (GMAC4.x)
+
+TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family.
+When a packet is sent through TCP protocol, the TCP stack ensures that
+the SKB provided to the low level driver (stmmac in our case) matches with
+the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for
+MTU set to 1500)). It means that if an application using TCP want to send a
+packet which will have a length (after adding headers) > 1514 the packet
+will be split in several TCP packets: The data payload is split and headers
+(TCP/IP ..) are added. It is done by software.
+
+When TSO is enabled, the TCP stack doesn't care about the maximum frame
+length and provide SKB packet to stmmac as it is. The GMAC IP will have to
+perform the segmentation by it self to match with maximum frame length.
+
+This feature can be enabled in device tree through "snps,tso" entry.
 
 5) Debug Information
 
-- 
1.9.1

[PATCH 12/13] stmmac: update version to Jan_2016

2016-03-25 Thread Alexandre TORGUE

This patch just updates the driver to the version fully
tested on STi platforms. This version is Jan_2016.

Signed-off-by: Alexandre TORGUE 

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h 
b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 317ce35..ff67506 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -24,7 +24,7 @@
 #define __STMMAC_H__
 
 #define STMMAC_RESOURCE_NAME   "stmmaceth"
-#define DRV_MODULE_VERSION "Dec_2015"
+#define DRV_MODULE_VERSION "Jan_2016"
 
 #include 
 #include 
-- 
1.9.1

[PATCH 05/13] stmmac: add GMAC4 DMA/CORE Header File

2016-03-25 Thread Alexandre TORGUE

This is the main header file to define all the
macro used for GMAC4 DMA and CORE parts.

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
new file mode 100644
index 000..c12f15c
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -0,0 +1,224 @@
+/*
+ * DWMAC4 Header file.
+ *
+ * Copyright (C) 2015  STMicroelectronics Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * Author: Alexandre Torgue 
+ */
+
+#ifndef __DWMAC4_H__
+#define __DWMAC4_H__
+
+#include "common.h"
+
+/*  MAC registers */
+#define GMAC_CONFIG0x
+#define GMAC_PACKET_FILTER 0x0008
+#define GMAC_HASH_TAB_0_31 0x0010
+#define GMAC_HASH_TAB_32_630x0014
+#define GMAC_RX_FLOW_CTRL  0x0090
+#define GMAC_QX_TX_FLOW_CTRL(x)(0x70 + x * 4)
+#define GMAC_INT_STATUS0x00b0
+#define GMAC_INT_EN0x00b4
+#define GMAC_AN_CTRL   0x00e0
+#define GMAC_AN_STATUS 0x00e4
+#define GMAC_AN_ADV0x00e8
+#define GMAC_AN_LPA0x00ec
+#define GMAC_PMT   0x00c0
+#define GMAC_VERSION   0x0110
+#define GMAC_DEBUG 0x0114
+#define GMAC_HW_FEATURE0   0x011c
+#define GMAC_HW_FEATURE1   0x0120
+#define GMAC_HW_FEATURE2   0x0124
+#define GMAC_MDIO_ADDR 0x0200
+#define GMAC_MDIO_DATA 0x0204
+#define GMAC_ADDR_HIGH(reg)(0x300 + reg * 8)
+#define GMAC_ADDR_LOW(reg) (0x304 + reg * 8)
+
+/* MAC Packet Filtering */
+#define GMAC_PACKET_FILTER_PR  BIT(0)
+#define GMAC_PACKET_FILTER_HMC BIT(2)
+#define GMAC_PACKET_FILTER_PM  BIT(4)
+
+#define GMAC_MAX_PERFECT_ADDRESSES 128
+
+/* MAC Flow Control RX */
+#define GMAC_RX_FLOW_CTRL_RFE  BIT(0)
+
+/* MAC Flow Control TX */
+#define GMAC_TX_FLOW_CTRL_TFE  BIT(1)
+#define GMAC_TX_FLOW_CTRL_PT_SHIFT 16
+
+/*  MAC Interrupt bitmap*/
+#define GMAC_INT_PMT_ENBIT(4)
+#define GMAC_INT_LPI_ENBIT(5)
+
+enum dwmac4_irq_status {
+   time_stamp_irq = 0x1000,
+   mmc_rx_csum_offload_irq = 0x0800,
+   mmc_tx_irq = 0x0400,
+   mmc_rx_irq = 0x0200,
+   mmc_irq = 0x0100,
+   pmt_irq = 0x0010,
+   pcs_ane_irq = 0x0004,
+   pcs_link_irq = 0x0002,
+};
+
+/* MAC Auto-Neg bitmap*/
+#defineGMAC_AN_CTRL_RANBIT(9)
+#defineGMAC_AN_CTRL_ANEBIT(12)
+#define GMAC_AN_CTRL_ELE   BIT(14)
+#define GMAC_AN_FD BIT(5)
+#define GMAC_AN_HD BIT(6)
+#define GMAC_AN_PSE_MASK   GENMASK(8, 7)
+#define GMAC_AN_PSE_SHIFT  7
+
+/* MAC PMT bitmap */
+enum power_event {
+   pointer_reset = 0x8000,
+   global_unicast = 0x0200,
+   wake_up_rx_frame = 0x0040,
+   magic_frame = 0x0020,
+   wake_up_frame_en = 0x0004,
+   magic_pkt_en = 0x0002,
+   power_down = 0x0001,
+};
+
+/* MAC Debug bitmap */
+#define GMAC_DEBUG_TFCSTS_MASK GENMASK(18, 17)
+#define GMAC_DEBUG_TFCSTS_SHIFT17
+#define GMAC_DEBUG_TFCSTS_IDLE 0
+#define GMAC_DEBUG_TFCSTS_WAIT 1
+#define GMAC_DEBUG_TFCSTS_GEN_PAUSE2
+#define GMAC_DEBUG_TFCSTS_XFER 3
+#define GMAC_DEBUG_TPESTS  BIT(16)
+#define GMAC_DEBUG_RFCFCSTS_MASK   GENMASK(2, 1)
+#define GMAC_DEBUG_RFCFCSTS_SHIFT  1
+#define GMAC_DEBUG_RPESTS  BIT(0)
+
+/* MAC config */
+#define GMAC_CONFIG_IPCBIT(27)
+#define GMAC_CONFIG_2K BIT(22)
+#define GMAC_CONFIG_ACSBIT(20)
+#define GMAC_CONFIG_BE BIT(18)
+#define GMAC_CONFIG_JD BIT(17)
+#define GMAC_CONFIG_JE BIT(16)
+#define GMAC_CONFIG_PS BIT(15)
+#define GMAC_CONFIG_FESBIT(14)
+#define GMAC_CONFIG_DM BIT(13)
+#define GMAC_CONFIG_DCRS   BIT(9)
+#define GMAC_CONFIG_TE BIT(1)
+#define GMAC_CONFIG_RE BIT(0)
+
+/* MAC HW features0 bitmap */
+#define GMAC_HW_FEAT_ADDMACBIT(18)
+#define GMAC_HW_FEAT_RXCOESEL  BIT(16)
+#define GMAC_HW_FEAT_TXCOSEL   BIT(14)
+#define GMAC_HW_FEAT_EEESELBIT(13)
+#define GMAC_HW_FEAT_TSSEL BIT(12)
+#define GMAC_HW_FEAT_MMCSEL

[PATCH 07/13] stmmac: add GMAC4 core support

2016-03-25 Thread Alexandre TORGUE

This is the initial support for GMAC4 that includes
the main callbacks to setup the core module: including
Csum, basic filtering, mac address and interrupt (MMC,
MTL, PMT) No LPI added.

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 9398ace..0fb362d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -3,7 +3,7 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o 
ring_mode.o  \
  chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
  dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \
  mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o  \
- dwmac4_dma.o dwmac4_lib.o $(stmmac-y)
+ dwmac4_dma.o dwmac4_lib.o dwmac4_core.o $(stmmac-y)
 
 # Ordering matters. Generic driver must be last.
 obj-$(CONFIG_STMMAC_PLATFORM)  += stmmac-platform.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 2a5126e..eabe86b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -527,15 +527,21 @@ struct mac_device_info *dwmac1000_setup(void __iomem 
*ioaddr, int mcbins,
int perfect_uc_entries,
int *synopsys_id);
 struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id);
-
+struct mac_device_info *dwmac4_setup(void __iomem *ioaddr, int mcbins,
+int perfect_uc_entries, int *synopsys_id);
 
 void stmmac_set_mac_addr(void __iomem *ioaddr, u8 addr[6],
 unsigned int high, unsigned int low);
 void stmmac_get_mac_addr(void __iomem *ioaddr, unsigned char *addr,
 unsigned int high, unsigned int low);
-
 void stmmac_set_mac(void __iomem *ioaddr, bool enable);
 
+void stmmac_dwmac4_set_mac_addr(void __iomem *ioaddr, u8 addr[6],
+   unsigned int high, unsigned int low);
+void stmmac_dwmac4_get_mac_addr(void __iomem *ioaddr, unsigned char *addr,
+   unsigned int high, unsigned int low);
+void stmmac_dwmac4_set_mac(void __iomem *ioaddr, bool enable);
+
 void dwmac_dma_flush_tx_fifo(void __iomem *ioaddr);
 extern const struct stmmac_mode_ops ring_mode_ops;
 extern const struct stmmac_mode_ops chain_mode_ops;
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
index c12f15c..bc50952 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -221,4 +221,35 @@ enum power_event {
 /* To dump the core regs excluding  the Address Registers */
 #defineGMAC_REG_NUM132
 
+/*  MTL debug */
+#define MTL_DEBUG_TXSTSFSTSBIT(5)
+#define MTL_DEBUG_TXFSTS   BIT(4)
+#define MTL_DEBUG_TWCSTS   BIT(3)
+
+/* MTL debug: Tx FIFO Read Controller Status */
+#define MTL_DEBUG_TRCSTS_MASK  GENMASK(2, 1)
+#define MTL_DEBUG_TRCSTS_SHIFT 1
+#define MTL_DEBUG_TRCSTS_IDLE  0
+#define MTL_DEBUG_TRCSTS_READ  1
+#define MTL_DEBUG_TRCSTS_TXW   2
+#define MTL_DEBUG_TRCSTS_WRITE 3
+#define MTL_DEBUG_TXPAUSED BIT(0)
+
+/* MAC debug: GMII or MII Transmit Protocol Engine Status */
+#define MTL_DEBUG_RXFSTS_MASK  GENMASK(5, 4)
+#define MTL_DEBUG_RXFSTS_SHIFT 4
+#define MTL_DEBUG_RXFSTS_EMPTY 0
+#define MTL_DEBUG_RXFSTS_BT1
+#define MTL_DEBUG_RXFSTS_AT2
+#define MTL_DEBUG_RXFSTS_FULL  3
+#define MTL_DEBUG_RRCSTS_MASK  GENMASK(2, 1)
+#define MTL_DEBUG_RRCSTS_SHIFT 1
+#define MTL_DEBUG_RRCSTS_IDLE  0
+#define MTL_DEBUG_RRCSTS_RDATA 1
+#define MTL_DEBUG_RRCSTS_RSTAT 2
+#define MTL_DEBUG_RRCSTS_FLUSH 3
+#define MTL_DEBUG_RWCSTS   BIT(0)
+
+extern const struct stmmac_dma_ops dwmac4_dma_ops;
+extern const struct stmmac_dma_ops dwmac410_dma_ops;
 #endif /* __DWMAC4_H__ */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
new file mode 100644
index 000..4f7283d
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
@@ -0,0 +1,407 @@
+/*
+ * This is the driver for the GMAC on-chip Ethernet controller for ST SoCs.
+ * DWC Ether MAC version 4.00  has been used for developing this code.
+ *
+ * This only implements the mac core functions for this chip.
+ *
+ * Copyright (C) 2015  STMicroelectronics Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software

[PATCH 03/13] stmmac: rework synopsys id read, moved to dwmac setup

2016-03-25 Thread Alexandre TORGUE

synopsys_uid is only used once after setup, to get synopsys_id
by using shitf/mask operation. It's no longer used then.
So, remove this temporary variable and directly compute
synopsys_id from setup routine.

Acked-by: Giuseppe Cavallaro 
Signed-off-by: Fabrice Gasnier 
Signed-off-by: Alexandre TORGUE 

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 6cea61b..66e132f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -498,7 +498,6 @@ struct mac_device_info {
const struct stmmac_hwtimestamp *ptp;
struct mii_regs mii;/* MII register Addresses */
struct mac_link link;
-   unsigned int synopsys_uid;
void __iomem *pcsr; /* vpointer to device CSRs */
int multicast_filter_bins;
int unicast_filter_entries;
@@ -507,8 +506,10 @@ struct mac_device_info {
 };
 
 struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins,
-   int perfect_uc_entries);
-struct mac_device_info *dwmac100_setup(void __iomem *ioaddr);
+   int perfect_uc_entries,
+   int *synopsys_id);
+struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id);
+
 
 void stmmac_set_mac_addr(void __iomem *ioaddr, u8 addr[6],
 unsigned int high, unsigned int low);
@@ -521,4 +522,24 @@ void dwmac_dma_flush_tx_fifo(void __iomem *ioaddr);
 extern const struct stmmac_mode_ops ring_mode_ops;
 extern const struct stmmac_mode_ops chain_mode_ops;
 
+/**
+ * stmmac_get_synopsys_id - return the SYINID.
+ * @priv: driver private structure
+ * Description: this simple function is to decode and return the SYINID
+ * starting from the HW core register.
+ */
+static inline u32 stmmac_get_synopsys_id(u32 hwid)
+{
+   /* Check Synopsys Id (not available on old chips) */
+   if (likely(hwid)) {
+   u32 uid = ((hwid & 0xff00) >> 8);
+   u32 synid = (hwid & 0x00ff);
+
+   pr_info("stmmac - user ID: 0x%x, Synopsys ID: 0x%x\n",
+   uid, synid);
+
+   return synid;
+   }
+   return 0;
+}
 #endif /* __COMMON_H__ */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
index c294117..fb1eb57 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -491,7 +491,8 @@ static const struct stmmac_ops dwmac1000_ops = {
 };
 
 struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins,
-   int perfect_uc_entries)
+   int perfect_uc_entries,
+   int *synopsys_id)
 {
struct mac_device_info *mac;
u32 hwid = readl(ioaddr + GMAC_VERSION);
@@ -516,7 +517,9 @@ struct mac_device_info *dwmac1000_setup(void __iomem 
*ioaddr, int mcbins,
mac->link.speed = GMAC_CONTROL_FES;
mac->mii.addr = GMAC_MII_ADDR;
mac->mii.data = GMAC_MII_DATA;
-   mac->synopsys_uid = hwid;
+
+   /* Get and dump the chip ID */
+   *synopsys_id = stmmac_get_synopsys_id(hwid);
 
return mac;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
index f8dd773..6418b2e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
@@ -173,7 +173,7 @@ static const struct stmmac_ops dwmac100_ops = {
.get_umac_addr = dwmac100_get_umac_addr,
 };
 
-struct mac_device_info *dwmac100_setup(void __iomem *ioaddr)
+struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id)
 {
struct mac_device_info *mac;
 
@@ -192,7 +192,8 @@ struct mac_device_info *dwmac100_setup(void __iomem *ioaddr)
mac->link.speed = 0;
mac->mii.addr = MAC_MII_ADDR;
mac->mii.data = MAC_MII_DATA;
-   mac->synopsys_uid = 0;
+   /* Synopsys Id is not available on old chips */
+   *synopsys_id = 0;
 
return mac;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 8103527..2ebee81 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1468,29 +1468,6 @@ static void stmmac_mmc_setup(struct stmmac_priv *priv)
 }
 
 /**
- * stmmac_get_synopsys_id - return the SYINID.
- * @priv: driver private structure
- * Description: this simple function is to decode and return the SYINID
- * starting from the HW core register.
- */
-static u32 stmmac_get_synopsys_id(struct stmmac_priv *priv)
-{
-   u32 hwid =

[PATCH 00/13] Enhance stmmac driver to support GMAC4.x IP

2016-03-25 Thread Alexandre TORGUE

This is a subset of patch to enhance current stmmac driver to support
new GMAC4.x chips. New set of callbacks is defined to support this new
family: descriptors, dma, core.

One of main changes of GMAC 4.xx IP is descriptors management.
 -descriptors are only used in ring mode.
 -A descriptor is composed of 4 32bits registers (no more extended
  descriptors)
 -descriptor mechanism (Tx for example, but it is exactly the same for RX):
 -useful registers:
  -DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring
  -DMA_CH#_TxDesc_List_Address: start address of the ring
  -DMA_CH#_TxDesc_Tail_Pointer: address of the last descriptor to send + 1.
  -DMA_CH#_TxDesc_Current_App_TxDesc: address of the current descriptor

 -The descriptor Tail Pointer register contains the pointer to the
  descriptor address (N). The base address and the current
  descriptor decide the address of the current descriptor that the
  DMA can process. The descriptors up to one location less than the
  one indicated by the descriptor tail pointer (N-1) are owned by
  the DMA. The DMA continues to process the descriptors until the
  following condition occurs:
  "current descriptor pointer == Descriptor Tail pointer"

  Then the DMA goes into suspend mode. The application must perform
  a write to descriptor tail pointer register and update the tail
  pointer to have the following condition and to start a new transfer:
  "current descriptor pointer < Descriptor tail pointer"

  The DMA automatically wraps around the base address when the end
  of ring is reached.
  
New features are available on IP:
 -TSO (TCP Segmentation Offload) for TX only
 -Split header: to have header and payload in 2 different buffers (not yet 
implemented)

Below some throughput figures obtained on some boxes:

iperf (mbps) 
--
   tcp udp
tx   rx   tx  rx  
 -
GMAC4.x 935  930  750 800 

Note: There is a change in 4.10a databook on bitfield mapping of 
DMA_CHANx_INTR_ENA register. 
This requires to have � diffrent set of callbacks between IP 4.00a and 4.10a.

Best regards

Alex

Alexandre TORGUE (13):
  stmmac: rework get_hw_feature function
  stmmac: rework the routines to show the ring status
  stmmac: rework synopsys id read, moved to dwmac setup
  stmmac: add descriptors function for GMAC 4.xx
  stmmac: add GMAC4 DMA/CORE Header File
  stmmac: add DMA support for GMAC 4.xx
  stmmac: add GMAC4 core support
  stmmac: enhance mmc counter management
  stmmac: add new DT platform entries for GMAC4
  stmmac: support new GMAC4
  Documentation: networking: update stmmac
  stmmac: update version to Jan_2016
  stmmac: update MAINTAINERS

 Documentation/devicetree/bindings/net/stmmac.txt   |   2 +
 Documentation/networking/stmmac.txt|  44 +-
 MAINTAINERS|   1 +
 drivers/net/ethernet/stmicro/stmmac/Makefile   |   3 +-
 drivers/net/ethernet/stmicro/stmmac/common.h   |  64 +-
 .../net/ethernet/stmicro/stmmac/dwmac1000_core.c   |   7 +-
 .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c|  35 +-
 .../net/ethernet/stmicro/stmmac/dwmac100_core.c|   5 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac4.h   | 255 
 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c  | 407 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c | 396 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.h | 129 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c   | 354 
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h   | 202 +++
 drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c   | 225 +++
 drivers/net/ethernet/stmicro/stmmac/enh_desc.c |  21 +
 drivers/net/ethernet/stmicro/stmmac/mmc.h  |   4 +
 drivers/net/ethernet/stmicro/stmmac/mmc_core.c | 349 +--
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c|  21 +
 drivers/net/ethernet/stmicro/stmmac/stmmac.h   |   7 +-
 .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c   |   7 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 643 +++--
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   7 +
 include/linux/stmmac.h |   2 +
 24 files changed, 2821 insertions(+), 369 deletions(-)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4.h
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.h
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c

-- 
1.9.1

[PATCH 01/13] stmmac: rework get_hw_feature function

2016-03-25 Thread Alexandre TORGUE

On next GMAC IP generation (4.xx), the way to get hw feature
is not the same than on previous 3.xx. As it is hardware
dependent, the way to get hw capabilities should be defined in dma ops of
each MAC IP. It will avoid also a huge computation of hw capabilities in
stmmac_main.

Signed-off-by: Alexandre TORGUE 
Signed-off-by: Giuseppe Cavallaro 

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index f96d257..797a913 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -412,7 +412,8 @@ struct stmmac_dma_ops {
int (*dma_interrupt) (void __iomem *ioaddr,
  struct stmmac_extra_stats *x);
/* If supported then get the optional core features */
-   unsigned int (*get_hw_feature) (void __iomem *ioaddr);
+   void (*get_hw_feature)(void __iomem *ioaddr,
+  struct dma_features *dma_cap);
/* Program the HW RX Watchdog */
void (*rx_watchdog) (void __iomem *ioaddr, u32 riwt);
 };
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
index da32d60..99074695 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
@@ -215,9 +215,40 @@ static void dwmac1000_dump_dma_regs(void __iomem *ioaddr)
}
 }
 
-static unsigned int dwmac1000_get_hw_feature(void __iomem *ioaddr)
+static void dwmac1000_get_hw_feature(void __iomem *ioaddr,
+struct dma_features *dma_cap)
 {
-   return readl(ioaddr + DMA_HW_FEATURE);
+   u32 hw_cap = readl(ioaddr + DMA_HW_FEATURE);
+
+   dma_cap->mbps_10_100 = (hw_cap & DMA_HW_FEAT_MIISEL);
+   dma_cap->mbps_1000 = (hw_cap & DMA_HW_FEAT_GMIISEL) >> 1;
+   dma_cap->half_duplex = (hw_cap & DMA_HW_FEAT_HDSEL) >> 2;
+   dma_cap->hash_filter = (hw_cap & DMA_HW_FEAT_HASHSEL) >> 4;
+   dma_cap->multi_addr = (hw_cap & DMA_HW_FEAT_ADDMAC) >> 5;
+   dma_cap->pcs = (hw_cap & DMA_HW_FEAT_PCSSEL) >> 6;
+   dma_cap->sma_mdio = (hw_cap & DMA_HW_FEAT_SMASEL) >> 8;
+   dma_cap->pmt_remote_wake_up = (hw_cap & DMA_HW_FEAT_RWKSEL) >> 9;
+   dma_cap->pmt_magic_frame = (hw_cap & DMA_HW_FEAT_MGKSEL) >> 10;
+   /* MMC */
+   dma_cap->rmon = (hw_cap & DMA_HW_FEAT_MMCSEL) >> 11;
+   /* IEEE 1588-2002 */
+   dma_cap->time_stamp =
+   (hw_cap & DMA_HW_FEAT_TSVER1SEL) >> 12;
+   /* IEEE 1588-2008 */
+   dma_cap->atime_stamp = (hw_cap & DMA_HW_FEAT_TSVER2SEL) >> 13;
+   /* 802.3az - Energy-Efficient Ethernet (EEE) */
+   dma_cap->eee = (hw_cap & DMA_HW_FEAT_EEESEL) >> 14;
+   dma_cap->av = (hw_cap & DMA_HW_FEAT_AVSEL) >> 15;
+   /* TX and RX csum */
+   dma_cap->tx_coe = (hw_cap & DMA_HW_FEAT_TXCOESEL) >> 16;
+   dma_cap->rx_coe_type1 = (hw_cap & DMA_HW_FEAT_RXTYP1COE) >> 17;
+   dma_cap->rx_coe_type2 = (hw_cap & DMA_HW_FEAT_RXTYP2COE) >> 18;
+   dma_cap->rxfifo_over_2048 = (hw_cap & DMA_HW_FEAT_RXFIFOSIZE) >> 19;
+   /* TX and RX number of channels */
+   dma_cap->number_rx_channel = (hw_cap & DMA_HW_FEAT_RXCHCNT) >> 20;
+   dma_cap->number_tx_channel = (hw_cap & DMA_HW_FEAT_TXCHCNT) >> 22;
+   /* Alternate (enhanced) DESC mode */
+   dma_cap->enh_desc = (hw_cap & DMA_HW_FEAT_ENHDESSEL) >> 24;
 }
 
 static void dwmac1000_rx_watchdog(void __iomem *ioaddr, u32 riwt)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 4c5ce98..d3ebfea 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1558,51 +1558,15 @@ static void stmmac_selec_desc_mode(struct stmmac_priv 
*priv)
  */
 static int stmmac_get_hw_features(struct stmmac_priv *priv)
 {
-   u32 hw_cap = 0;
+   u32 ret = 0;
 
if (priv->hw->dma->get_hw_feature) {
-   hw_cap = priv->hw->dma->get_hw_feature(priv->ioaddr);
-
-   priv->dma_cap.mbps_10_100 = (hw_cap & DMA_HW_FEAT_MIISEL);
-   priv->dma_cap.mbps_1000 = (hw_cap & DMA_HW_FEAT_GMIISEL) >> 1;
-   priv->dma_cap.half_duplex = (hw_cap & DMA_HW_FEAT_HDSEL) >> 2;
-   priv->dma_cap.hash_filter = (hw_cap & DMA_HW_FEAT_HASHSEL) >> 4;
-   priv->dma_cap.multi_addr = (hw_cap & DMA_HW_FEAT_ADDMAC) >> 5;
-   priv->dma_cap.pcs = (hw_cap & DMA_HW_FEAT_PCSSEL) >> 6;
-   priv->dma_cap.sma_mdio = (hw_cap & DMA_HW_FEAT_SMASEL) >> 8;
-   priv->dma_cap.pmt_remote_wake_up =
-   (hw_cap & DMA_HW_FEAT_RWKSEL) >> 9;
-   priv->dma_cap.pmt_magic_frame =
-   (hw_cap & DMA_HW_FEAT_MGKSEL) >> 10;
-   /* MMC */
-   priv->dma_cap.rmon = (hw_cap & DMA_HW_FEAT_MMCSEL) >> 11;
-

Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops

2016-03-25 Thread Julian Anastasov

Hello,

On Thu, 24 Mar 2016, David Ahern wrote:

> On 3/24/16 4:33 PM, Julian Anastasov wrote:
> > But for multipath routes we can also consider the
> > nexthops as "alternatives", so it depends on how one uses
> > the multipath mechanism. The ability to fallback to
> > another nexthop assumes one connection is allowed to
> > move from one ISP to another. What if the second ISP
> > decides to reject the connection? What we have is a
> > broken connection just because the retransmits
> > were diverted to wrong place in the hurry. So, the
> > nexthops can be compatible or incompatible. For your
> > setup they are, for others they are not.
> 
> I am not sure I completely understand your point. Are you saying that within a
> single multipath route some connections to nexthops are allowed and others are
> not?
> 
> So to put that paragraph into an example
> 
> 15.0.0.0/16
>   nexthop via 12.0.0.2  dev swp1 weight 1
>   nexthop via 12.0.0.3  dev swp1 weight 1
> 
> Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 12.0.0.3
> because 12.0.0.3 could be a different ISP and not allow TCP connections from
> this address space?

Yes. Two cases are possible:

1. ISP2 filters saddr, traffic with saddr from ISP1 is dropped.

2. ISP2 allows any saddr. But how the responses from
world with daddr=IP_from_ISP1 will come from ISP2 link?
If the nexthops are for different ISP the connection
can survive only if sticks to its ISP. An ISP will
not work as a backup link for another ISP.

The hash-based algorithm does not move connections
from one nexthop to another. If you rearrange the nexthops
on failure, the binding is lost and connections can break.
A fallback from fragile wifi link can upset users and any
redirects will lead to bad experience with broken conns.
So only CONNMARK can help to stick connection to some path.
If this path has multiple simultaneous alternatives you can
again use multipath route reached from 'ip rule from PUBIP2 table ISP2'
but then we again are restricted from the hash-based alg
which is suitable only for default routes hit by saddr=0.0.0.0
lookups. In the common case when connection is created
there are two lookups:

1. TCP selects nexthop with a saddr=0.0.0.0 lookup.
ISP is selected based on the mp alg.

2. If one is lucky to reach ip_route_me_harder the
hash-based lookup is defeated because now lookup
uses saddr=iph->saddr, so it selects different nexthop.
It works while all packets for the connection reach
ip_route_me_harder.

> > So, if the kernel used a random selection
> > your fallback algorithm should help. But it is fragile
> > for the simple setup with single default multipath route.
> > May be what we miss is the ability to choose between
> > random and hash-based selection. Then your patch may be
> > useful but only for setup 2 (multipath route hit only by
> > first packet). So, your patch may come with a sysctl var
> > that explains your current patch logic: "avoid failed nexthops,
> > never probe them, wait their failed entry to be expired by
> > neigh_periodic_work and just then we can use the nexthop
> > by creating new entry to probe the GW". Who will trigger
> > probes often enough to maintain fresh state?
> 
> First packet out does the probe -- neigh lookup fails, kernel has no knowledge
> so can't reject the nexthop based on neighbor information.
> 
> After that if it has information that says that a nexthop is dead, why would
> it continue to try to probe? Any traffic that selects that nh is dead. That to

If entry becomes FAILED this state is preserved
if we do not direct traffic to this entry. If there was a
single connection that was rejected after 3 failed probes
the next connection (with your patch) will fallback to
another neigh and the first entry will remain in FAILED
state until expiration. If one wants to refresh the state
often, a script/tool that pings all GWs is needed, so that
you can notice the available or failed paths faster.

> me defies the basis of having multiple paths.

We do not know how long is the outage. Long living
connections may prefer to survive with retransmits.
Say you are using SSH via wifi link doing important work.
Do you want your connection to break just because link was
down for a while?

Fallback to other ISP can be unwanted. If we do not
know if multipath route is used per-packet ot per-connection
we can not just apply a fallback to other nexthops.
We can do it only if user can configure the different
usages per multipath route:

1. hash-based or random

2. allow fallback or not. This config is not a MUST if users
can select random mode, it can imply that fallback is allowed.

Regards

--
Julian Anastasov

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Willy Tarreau

Hi Eric,

On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote:
> Everything is possible, but do not complain because BPF went in the
> kernel before your changes.

Don't get me wrong, I'm not complaining, I'm more asking for help to
try to elaborate the alternate solution. I understood well what my
proposal did because it was pretty straightforward, and the new one
I'll have to do is of an immense complexity for me by now, since it
will require learning a new language and finding doc on how all this
works together. But as I said I'm totally sold to the benefits it can
provide for large scale deployments and I'm well aware of the ugly
socket scan there was in the previous one.

> Just rework your patch.
> 
> Supporting multiple SO_REUSEPORT groups on the same port should not be
> too hard really. Making sure BPF still work for them is feasible.
> 
> But the semantic of the socket option would be really different.

I don't care much about the socket options themselves as long as I can
continue to support seamless reloads. I could even get rid of SO_REUSEPORT
on Linux to use something else instead if I have a reliable way to detect
that the alternative will work.

> You need to not control an individual listener, but a group of listener.
> 
> Your dying haproxy would issue a single system call to tell the kernel :
> My SO_REUSEPORT group should not accept new SYN packets, so that the new
> haproxy can setup a working new SO_REUSEPORT group.

Normally it's the other way around :-) The new process first grabs the
socket, there's a tiny window where both are attached, and only then the
old process leaves. That's the only way to ensure there's no loss nor
added latency in the processing.

If you could share a pointer to some good starter documentation for this,
I would appreciate it. I really have no idea where to start from and the
only elements I found on the net didn't give a single hint regarding all
this :-/

Thanks,
Willy

Re: [PATCH] stmmac: Fix phy without MDIO subnode

2016-03-25 Thread Giuseppe CAVALLARO


On 3/24/2016 6:01 PM, John Keeping wrote:

On Thu, 24 Mar 2016 13:56:02 +0100, Giuseppe CAVALLARO wrote:


This should be fixed by some work done some days
ago and not yet committed.

Pls see "stmmac: MDIO fixes" patch-set and let me know
if ok on your side.


Yes, that works for me.


thx John for having tested it on your side.

peppe




Thanks,
John


On 3/24/2016 11:56 AM, John Keeping wrote:

Since commit 88f8b1bb41c6 ("stmmac: Fix 'eth0: No PHY found'
regression") we no longer allocate mdio_bus_data unless there is a MDIO
subnode.  This breaks the ethernet on the Radxa Rock2 (using
rk3288-rock2-square.dts) which does not have an MDIO subnode.

That commit was correct that the phy_bus_name test is unhelpful since we
allocate "plat" in the same function and never set phy_bus_name so let's
just drop the test which restores the previous behaviour.

Fixes: 88f8b1bb41c6 ("stmmac: Fix 'eth0: No PHY found' regression")
Signed-off-by: John Keeping 
---
   drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index dcbd2a1..e0fa060 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -189,7 +189,7 @@ stmmac_probe_config_dt(struct platform_device *pdev, const 
char **mac)
if (of_property_read_u32(np, "snps,phy-addr", >phy_addr) == 0)
dev_warn(>dev, "snps,phy-addr property is deprecated\n");

-   if ((plat->phy_node && !of_phy_is_fixed_link(np)) || !plat->mdio_node)
+   if ((plat->phy_node && !of_phy_is_fixed_link(np)))
plat->mdio_bus_data = NULL;
else
plat->mdio_bus_data =

RE: [v6, 2/5] soc: fsl: add GUTS driver for QorIQ platforms

2016-03-25 Thread Yangbo Lu

> -Original Message-
> From: Scott Wood
> Sent: Saturday, March 19, 2016 6:55 AM
> To: Yangbo Lu; devicet...@vger.kernel.org; linux-arm-
> ker...@lists.infradead.org; linux-ker...@vger.kernel.org; linuxppc-
> d...@lists.ozlabs.org; linux-...@vger.kernel.org; linux-
> i...@vger.kernel.org; io...@lists.linux-foundation.org;
> netdev@vger.kernel.org; linux-...@vger.kernel.org
> Cc: Rob Herring; Russell King; Jochen Friedrich; Joerg Roedel; Claudiu
> Manoil; ulf.hans...@linaro.org; Bhupesh Sharma; Zhao Qiang; Kumar Gala;
> Santosh Shilimkar; Yang-Leo Li; Xiaobo Xie
> Subject: Re: [v6, 2/5] soc: fsl: add GUTS driver for QorIQ platforms
> 
> On 03/09/2016 04:18 AM, Yangbo Lu wrote:
> > +#ifdef CONFIG_FSL_GUTS
> > +u32 fsl_guts_get_svr(void);
> > +int fsl_guts_init(void);
> > +#endif
> 
> Don't ifdef prototypes (when not providing a stub alternative).
> 
> -Scott

[Lu Yangbo-B47093] Ok, will remove the ifdef.
Thank you very much.

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Eric Dumazet

On Fri, 2016-03-25 at 06:28 +0100, Willy Tarreau wrote:
> On Thu, Mar 24, 2016 at 04:54:03PM -0700, Tom Herbert wrote:
> > On Thu, Mar 24, 2016 at 4:40 PM, Yann Ylavic  wrote:
> > > I'll learn how to do this to get the best performances from the
> > > server, but having to do so to work around what looks like a defect
> > > (for simple/default SMP configurations at least, no NUMA or clever
> > > CPU-affinity or queuing policy involved) seems odd in the first place.
> > >
> > I disagree with your assessment that there is a defect. SO_REUSEPORT
> > is designed to spread packets amongst _equivalent_ connections. In the
> > server draining case sockets are no longer equivalent, but that is a
> > special case.
> 
> I partially disagree with you here Tom. Initially SO_REUSEPORT was not
> used to spread packets but to allow soft restart in some applications.
> I've been using it since 2001 in haproxy on *BSD and linux 2.2. It was
> removed during 2.3 and I used to keep a patch to reimplement it in 2.4
> (basically 2 or 3 lines, the infrastructure was still present), but the
> patch was not accepted. The same patch worked for 2.6 and 3.x, allowing
> me to continue to perform soft-restarts on Linux just like I used to do
> on *BSD. When SO_REUSEPORT was reimplemented in 3.9 with load balancing,
> I was happy because it at last allowed me to drop my patch and I got
> the extra benefit of better load balancing of incoming connections.
> 
> But the main use we have for it (at least historically) is for soft
> restarts, where one process replaces another one. Very few people use
> more than one process in our case.
> 
> However given the benefits of the load spreading for extreme loads,
> I'm willing to find how to achieve the same with BPF, but it's pretty
> clear that at this point I have no idea where to start from and that
> for a single process replacing a single one, it looks quite complicated.
> 
> For me quite frankly the special case is the load balancing which is
> a side benefit (and a nice one, don't get me wrong).
> 
> That's why I would have found it nice to "fix" the process replacement
> to avoid dropping incoming connections, though I don't want it to become
> a problem for future improvements on BPF. I don't think the two lines I
> proposed could become an issue but I'll live without them (or continue
> to apply this patch).
> 
> BTW, I have no problem with having to write a little bit of assembly for
> fast interfaces if it remains untouched for years, we do already have a
> bit in haproxy. It's just a longterm investment.

Everything is possible, but do not complain because BPF went in the
kernel before your changes.

Just rework your patch.

Supporting multiple SO_REUSEPORT groups on the same port should not be
too hard really. Making sure BPF still work for them is feasible.

But the semantic of the socket option would be really different.

You need to not control an individual listener, but a group of listener.

Your dying haproxy would issue a single system call to tell the kernel :
My SO_REUSEPORT group should not accept new SYN packets, so that the new
haproxy can setup a working new SO_REUSEPORT group.

RE: [v6, 5/5] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2016-03-25 Thread Yangbo Lu

> -Original Message-
> From: Scott Wood
> Sent: Saturday, March 19, 2016 2:28 AM
> To: Arnd Bergmann; Rob Herring
> Cc: Yangbo Lu; linuxppc-...@lists.ozlabs.org; devicet...@vger.kernel.org;
> ulf.hans...@linaro.org; Zhao Qiang; Russell King; Bhupesh Sharma;
> netdev@vger.kernel.org; Joerg Roedel; Kumar Gala; linux-
> m...@vger.kernel.org; linux-ker...@vger.kernel.org; Yang-Leo Li;
> io...@lists.linux-foundation.org; linux-...@vger.kernel.org; Claudiu
> Manoil; Santosh Shilimkar; Xiaobo Xie; linux-...@vger.kernel.org; linux-
> arm-ker...@lists.infradead.org
> Subject: Re: [v6, 5/5] mmc: sdhci-of-esdhc: fix host version for T4240-
> R1.0-R2.0
> 
> On 03/17/2016 12:06 PM, Arnd Bergmann wrote:
> > On Thursday 17 March 2016 12:01:01 Rob Herring wrote:
> >> On Mon, Mar 14, 2016 at 05:45:43PM +, Scott Wood wrote:
> >
> > This makes the driver non-portable. Better identify the specific
> > workarounds based on the compatible string for this device, or add
> > a boolean DT property for the quirk.
> >
> >Arnd
> 
>  [Lu Yangbo-B47093] Hi Arnd, we did have a discussion about using DTS
> in v1 before.
>  https://patchwork.kernel.org/patch/6834221/
> 
>  We don't have a separate DTS file for each revision of an SOC and if
> we did, we'd constantly have people using the wrong one.
>  In addition, the device tree is stable ABI and errata are often
> discovered after device tree are deployed.
>  See the link for details.
> 
>  So we decide to read SVR from the device-config/guts MMIO block
> other than using DTS.
>  Thanks.
> >>>
> >>> Also note that this driver is already only for fsl-specific
> >>> hardware, and it will still work even if fsl_guts doesn't find
> >>> anything to bind to
> >>> -- it just wouldn't be able to detect errata based on SVR in that
> case.
> >>
> >> IIRC, it is the same IP block as i.MX and Arnd's point is this won't
> >> even compile on !PPC. It is things like this that prevent sharing the
> >> driver.
> 
> The whole point of using the MMIO SVR instead of the PPC SPR is so that
> it will work on ARM...  The guts driver should build on any platform as
> long as OF is enabled, and if it doesn't find a node to bind to it will
> return 0 for SVR, and the eSDHC driver will continue (after printing an
> error that should be removed) without the ability to test for errata
> based on SVR.
> 
> > I think the first four patches take care of building for ARM, but the
> > problem remains if you want to enable COMPILE_TEST as we need for
> > certain automated checking.
> 
> What specific problem is there with COMPILE_TEST?
> 
> >> Dealing with Si revs is a common problem. We should have a common
> >> solution. There is soc_device for this purpose.
> >
> > Exactly. The last time this came up, I think we agreed to implement a
> > helper using glob_match() on the soc_device strings. Unfortunately
> > this hasn't happened then, but I'd still prefer that over yet another
> > vendor-specific way of dealing with the generic issue.
> 
> soc_device would require encoding the SVR as a string and then decoding
> the string, which is more complicated and error prone than having
> platform-specific code test a platform-specific number.  And when would
> it get registered on arm64, which doesn't have platform code?
> 
> -Scott

[Lu Yangbo-B47093] Hi Arnd, could you answer Scott's questions?
If you don't oppose this patch, I'd like to rework a new version for merging.
Thanks.

:)

RE: [v6, 3/5] dt: move guts devicetree doc out of powerpc directory

2016-03-25 Thread Yangbo Lu

> -Original Message-
> From: Scott Wood
> Sent: Saturday, March 19, 2016 2:16 AM
> To: Rob Herring; Yangbo Lu
> Cc: devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; linuxppc-...@lists.ozlabs.org; linux-
> c...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux-
> foundation.org; netdev@vger.kernel.org; linux-...@vger.kernel.org;
> Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil;
> ulf.hans...@linaro.org; Bhupesh Sharma; Zhao Qiang; Kumar Gala; Santosh
> Shilimkar; Yang-Leo Li; Xiaobo Xie
> Subject: Re: [v6, 3/5] dt: move guts devicetree doc out of powerpc
> directory
> 
> On 03/17/2016 12:06 PM, Rob Herring wrote:
> > On Wed, Mar 09, 2016 at 06:08:49PM +0800, Yangbo Lu wrote:
> >> Move guts devicetree doc to
> >> Documentation/devicetree/bindings/soc/fsl/
> >> since it's used by not only PowerPC but also ARM. And add a
> >> specification for 'little-endian' property.
> >>
> >> Signed-off-by: Yangbo Lu 
> >> ---
> >> Changes for v2:
> >>- None
> >> Changes for v3:
> >>- None
> >> Changes for v4:
> >>- Added this patch
> >> Changes for v5:
> >>- Modified the description for little-endian property Changes for
> >> v6:
> >>- None
> >> ---
> >>  Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3
> >> +++
> >>  1 file changed, 3 insertions(+)
> >>  rename Documentation/devicetree/bindings/{powerpc =>
> >> soc}/fsl/guts.txt (91%)
> >>
> >> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt
> >> b/Documentation/devicetree/bindings/soc/fsl/guts.txt
> >> similarity index 91%
> >> rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt
> >> rename to Documentation/devicetree/bindings/soc/fsl/guts.txt
> >> index b71b203..07adca9 100644
> >> --- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt
> >> +++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt
> >> @@ -25,6 +25,9 @@ Recommended properties:
> >>   - fsl,liodn-bits : Indicates the number of defined bits in the LIODN
> >> registers, for those SOCs that have a PAMU device.
> >>
> >> + - little-endian : Indicates that the global utilities block is
> little
> >> +   endian. The default is big endian.
> >
> > The default is "the native endianness of the system". So absence on an
> > ARM system would be LE.
> 
> No.  For this binding, the default is big-endian, because that's what
> existed for this device before an endian property was added.
> 
> "endianness of the system" is not a well-defined concept.
> 
> > This property is valid for any simple-bus device,
> 
> Since when does simple-bus mean anything more than that the nodes
> underneath it can be used without bus-specific knowledge?
> 
> > so it isn't really required to document per device. You can, but your
> > description had better match the documented behaviour.
> 
> Documented where?
> 
> In fact, Documentation/devicetree/bindings/common-properties.txt
> explicitly says of the endian properties, "If a binding supports these
> properties, then the binding should also specify the default behavior if
> none of these properties are present."
> 
> -Scott

[Lu Yangbo-B47093] So, Rob, could you accept this patch after so much 
discussion?
:)

1 2 >

100 matches

Mail list logo