date:20160407

[PATCH v4 2/2] RDS: fix congestion map corruption for PAGE_SIZE > 4k

2016-04-07 Thread Shamir Rabinovitch

When PAGE_SIZE > 4k single page can contain 2 RDS fragments. If
'rds_ib_cong_recv' ignore the RDS fragment offset in to the page it
then read the data fragment as far congestion map update and lead to
corruption of the RDS connection far congestion map.

Signed-off-by: Shamir Rabinovitch 
---
 net/rds/ib_recv.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 977fb86..abc8cc8 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -796,7 +796,7 @@ static void rds_ib_cong_recv(struct rds_connection *conn,
 
addr = kmap_atomic(sg_page(>f_sg));
 
-   src = addr + frag_off;
+   src = addr + frag->f_sg.offset + frag_off;
dst = (void *)map->m_page_addrs[map_page] + map_off;
for (k = 0; k < to_copy; k += 8) {
/* Record ports that became uncongested, ie
-- 
1.7.1

[PATCH v4 1/2] RDS: memory allocated must be align to 8

2016-04-07 Thread Shamir Rabinovitch

Fix issue in 'rds_ib_cong_recv' when accessing unaligned memory
allocated by 'rds_page_remainder_alloc' using uint64_t pointer.

Signed-off-by: Shamir Rabinovitch 
---
 net/rds/page.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/page.c b/net/rds/page.c
index 616f21f..e2b5a58 100644
--- a/net/rds/page.c
+++ b/net/rds/page.c
@@ -135,8 +135,8 @@ int rds_page_remainder_alloc(struct scatterlist *scat, 
unsigned long bytes,
if (rem->r_offset != 0)
rds_stats_inc(s_page_remainder_hit);
 
-   rem->r_offset += bytes;
-   if (rem->r_offset == PAGE_SIZE) {
+   rem->r_offset += ALIGN(bytes, 8);
+   if (rem->r_offset >= PAGE_SIZE) {
__free_page(rem->r_page);
rem->r_page = NULL;
}
-- 
1.7.1

Re: [PATCH] net: mark DECnet as broken

2016-04-07 Thread One Thousand Gnomes

On Thu,  7 Apr 2016 09:22:43 +0200
Vegard Nossum  wrote:

> There are NULL pointer dereference bugs in DECnet which can be triggered
> by unprivileged users and have been reported multiple times to LKML,
> however nobody seems confident enough in the proposed fixes to merge them
> and the consensus seems to be that nobody cares enough about DECnet to
> see it fixed anyway.
> 
> To shield unsuspecting users from the possible DOS, we should mark this
> BROKEN until somebody who actually uses this code can fix it.

How about consigning it to staging at this point ?

Alan

Re: optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 09:11 -0400, Sowmini Varadhan wrote:
> I was going back to Alexei's comment here:
>   http://permalink.gmane.org/gmane.linux.network/387806
> In rds-stress profiling,  we are indeed seeing the pksb_pull 
> (from rds_tcp_data_recv) show up in the perf profile.
> 
> At least for rds-tcp, we cannot re-use the skb even if
> it is not shared, because what we need to do is to carve out
> the interesting bits (start at offset, and trim it to to_copy)
> and queue up those interesting bits on the PF_RDS socket,
> while allowing tcp_data_read to go back and read the next 
> tcp segment (which may be part of the next rds dgram).
> 
> But, when  pskb_expand_head is invoked in the call-stack
>   pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
> it will memcpy the offset bytes to the start of data. At least 
> for the rds_tcp_data_recv, we are not interested in being able 
> to do a *skb_push after the *skb_pull, so we dont really care 
> about the intactness of these bytes in offset.
> Thus what I am finding is that when delta > 0, if we skip the 
> following in pskb_expand_head (for the rds-tcp recv case only!)
> 
> /* Copy only real data... and, alas, header. This should be
>  * optimized for the cases when header is void.
>  */
> memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
> 
> and also (only for this case!) this one in __pskb_pull_tail
> 
>  if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
> BUG();
> 
> I am able to get a 40% improvement in the measured IOPS (req/response 
> transactions per second, using 8K byte requests, 256 byte responses,
> 16 concurrent threads), so this optimization seems worth doing.
> 
> Does my analysis above make sense? If yes, the question is, how to
> achieve this bypass in a neat way.  Clearly there are many callers of
> pskb_expand_head who will expect to find the skb_header_len bytes at
> the start of data, but we also dont want to duplicate code in these
> functions. One thought I had is to pass a flag arouund saying "caller
> doesnt care about retaining offset bytes", and use this flag
> - in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
>   pass delta to pskb_expand_head,
> - in pskb_expand_head, only do the memcpy listed above 
>   if delta <= 0
> Any other ideas for how to achieve this?

Use skb split like TCP in output path ?

Really, pskb_expand_head() is not supposed to copy payload ;)

[PATCH 1/2] drivers: net: cpsw: fix port_mask parameters in ale calls

2016-04-07 Thread Grygorii Strashko

ALE APIs expect to receive port masks as input values for arguments
port_mask, untag, reg_mcast, unreg_mcast. But there are few places in
code where port masks are passed left-shifted by cpsw_priv->host_port,
like below:

 cpsw_ale_add_vlan(priv->ale, priv->data.default_vlan,
  ALE_ALL_PORTS << priv->host_port,
  ALE_ALL_PORTS << priv->host_port, 0, 0);

and cpsw is still working just because priv->host_port == 0
and has never ever been changed.

Hence, fix port_mask parameters in ALE APIs calls and drop
"<< priv->host_port" from all places where it's used to
shift valid port mask.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 42fdfd4..5292e70 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -535,7 +535,7 @@ static const struct cpsw_stats cpsw_gstrings_stats[] = {
ALE_VLAN, slave->port_vlan, 0); \
} else {\
cpsw_ale_add_mcast(priv->ale, addr, \
-   ALE_ALL_PORTS << priv->host_port,   \
+   ALE_ALL_PORTS,  \
0, 0, 0);   \
}   \
} while (0)
@@ -602,8 +602,7 @@ static void cpsw_set_promiscious(struct net_device *ndev, 
bool enable)
cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1);
 
/* Clear all mcast from ALE */
-   cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS <<
-priv->host_port, -1);
+   cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS, -1);
 
/* Flood All Unicast Packets to Host port */
cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
@@ -648,8 +647,7 @@ static void cpsw_ndo_set_rx_mode(struct net_device *ndev)
cpsw_ale_set_allmulti(priv->ale, priv->ndev->flags & IFF_ALLMULTI);
 
/* Clear all mcast from ALE */
-   cpsw_ale_flush_multicast(priv->ale, ALE_ALL_PORTS << priv->host_port,
-vid);
+   cpsw_ale_flush_multicast(priv->ale, ALE_ALL_PORTS, vid);
 
if (!netdev_mc_empty(ndev)) {
struct netdev_hw_addr *ha;
@@ -1172,7 +1170,6 @@ static void cpsw_slave_open(struct cpsw_slave *slave, 
struct cpsw_priv *priv)
 static inline void cpsw_add_default_vlan(struct cpsw_priv *priv)
 {
const int vlan = priv->data.default_vlan;
-   const int port = priv->host_port;
u32 reg;
int i;
int unreg_mcast_mask;
@@ -1190,9 +1187,9 @@ static inline void cpsw_add_default_vlan(struct cpsw_priv 
*priv)
else
unreg_mcast_mask = ALE_PORT_1 | ALE_PORT_2;
 
-   cpsw_ale_add_vlan(priv->ale, vlan, ALE_ALL_PORTS << port,
- ALE_ALL_PORTS << port, ALE_ALL_PORTS << port,
- unreg_mcast_mask << port);
+   cpsw_ale_add_vlan(priv->ale, vlan, ALE_ALL_PORTS,
+ ALE_ALL_PORTS, ALE_ALL_PORTS,
+ unreg_mcast_mask);
 }
 
 static void cpsw_init_host_port(struct cpsw_priv *priv)
@@ -1273,8 +1270,7 @@ static int cpsw_ndo_open(struct net_device *ndev)
cpsw_add_default_vlan(priv);
else
cpsw_ale_add_vlan(priv->ale, priv->data.default_vlan,
- ALE_ALL_PORTS << priv->host_port,
- ALE_ALL_PORTS << priv->host_port, 0, 0);
+ ALE_ALL_PORTS, ALE_ALL_PORTS, 0, 0);
 
if (!cpsw_common_res_usage_state(priv)) {
struct cpsw_priv *priv_sl0 = cpsw_get_slave_priv(priv, 0);
@@ -1666,7 +1662,7 @@ static inline int cpsw_add_vlan_ale_entry(struct 
cpsw_priv *priv,
}
 
ret = cpsw_ale_add_vlan(priv->ale, vid, port_mask, 0, port_mask,
-   unreg_mcast_mask << priv->host_port);
+   unreg_mcast_mask);
if (ret != 0)
return ret;
 
@@ -1738,7 +1734,7 @@ static int cpsw_ndo_vlan_rx_kill_vid(struct net_device 
*ndev,
return ret;
 
ret = cpsw_ale_del_ucast(priv->ale, priv->mac_addr,
-priv->host_port, ALE_VLAN, vid);
+HOST_PORT_NUM, ALE_VLAN, vid);
if (ret != 0)
return ret;
 
-- 
2.8.0

[PATCH 2/2] drivers: net: cpsw: drop host_port field from struct cpsw_priv

2016-04-07 Thread Grygorii Strashko

The host_port field is constantly assigned to 0 and this value has
never changed (since time when cpsw driver was introduced. More over,
if this field will be assigned to non 0 value it will break current
driver functionality.

Hence, there are no reasons to continue maintaining this host_port
field and it can be removed, and the HOST_PORT_NUM and ALE_PORT_HOST
defines can be used instead.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c | 30 --
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 5292e70..54bcc38 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -381,7 +381,6 @@ struct cpsw_priv {
u32 coal_intvl;
u32 bus_freq_mhz;
int rx_packet_max;
-   int host_port;
struct clk  *clk;
u8  mac_addr[ETH_ALEN];
struct cpsw_slave   *slaves;
@@ -531,7 +530,7 @@ static const struct cpsw_stats cpsw_gstrings_stats[] = {
int slave_port = cpsw_get_slave_port(priv,  \
slave->slave_num);  \
cpsw_ale_add_mcast(priv->ale, addr, \
-   1 << slave_port | 1 << priv->host_port, \
+   1 << slave_port | ALE_PORT_HOST,\
ALE_VLAN, slave->port_vlan, 0); \
} else {\
cpsw_ale_add_mcast(priv->ale, addr, \
@@ -542,10 +541,7 @@ static const struct cpsw_stats cpsw_gstrings_stats[] = {
 
 static inline int cpsw_get_slave_port(struct cpsw_priv *priv, u32 slave_num)
 {
-   if (priv->host_port == 0)
-   return slave_num + 1;
-   else
-   return slave_num;
+   return slave_num + 1;
 }
 
 static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
@@ -1090,7 +1086,7 @@ static inline void cpsw_add_dual_emac_def_ale_entries(
struct cpsw_priv *priv, struct cpsw_slave *slave,
u32 slave_port)
 {
-   u32 port_mask = 1 << slave_port | 1 << priv->host_port;
+   u32 port_mask = 1 << slave_port | ALE_PORT_HOST;
 
if (priv->version == CPSW_VERSION_1)
slave_write(slave, slave->port_vlan, CPSW1_PORT_VLAN);
@@ -1101,7 +1097,7 @@ static inline void cpsw_add_dual_emac_def_ale_entries(
cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
   port_mask, ALE_VLAN, slave->port_vlan, 0);
cpsw_ale_add_ucast(priv->ale, priv->mac_addr,
-   priv->host_port, ALE_VLAN | ALE_SECURE, slave->port_vlan);
+   HOST_PORT_NUM, ALE_VLAN | ALE_SECURE, slave->port_vlan);
 }
 
 static void soft_reset_slave(struct cpsw_slave *slave)
@@ -1202,7 +1198,7 @@ static void cpsw_init_host_port(struct cpsw_priv *priv)
cpsw_ale_start(priv->ale);
 
/* switch to vlan unaware mode */
-   cpsw_ale_control_set(priv->ale, priv->host_port, ALE_VLAN_AWARE,
+   cpsw_ale_control_set(priv->ale, HOST_PORT_NUM, ALE_VLAN_AWARE,
 CPSW_ALE_VLAN_AWARE);
control_reg = readl(>regs->control);
control_reg |= CPSW_VLAN_AWARE;
@@ -1216,14 +1212,14 @@ static void cpsw_init_host_port(struct cpsw_priv *priv)
 >host_port_regs->cpdma_tx_pri_map);
__raw_writel(0, >host_port_regs->cpdma_rx_chan_map);
 
-   cpsw_ale_control_set(priv->ale, priv->host_port,
+   cpsw_ale_control_set(priv->ale, HOST_PORT_NUM,
 ALE_PORT_STATE, ALE_PORT_STATE_FORWARD);
 
if (!priv->data.dual_emac) {
-   cpsw_ale_add_ucast(priv->ale, priv->mac_addr, priv->host_port,
+   cpsw_ale_add_ucast(priv->ale, priv->mac_addr, HOST_PORT_NUM,
   0, 0);
cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
-  1 << priv->host_port, 0, 0, ALE_MCAST_FWD_2);
+  ALE_PORT_HOST, 0, 0, ALE_MCAST_FWD_2);
}
 }
 
@@ -1616,9 +1612,9 @@ static int cpsw_ndo_set_mac_address(struct net_device 
*ndev, void *p)
flags = ALE_VLAN;
}
 
-   cpsw_ale_del_ucast(priv->ale, priv->mac_addr, priv->host_port,
+   cpsw_ale_del_ucast(priv->ale, priv->mac_addr, HOST_PORT_NUM,
   flags, vid);
-   cpsw_ale_add_ucast(priv->ale, addr->sa_data, priv->host_port,
+   cpsw_ale_add_ucast(priv->ale, addr->sa_data, HOST_PORT_NUM,
   flags, vid);
 
memcpy(priv->mac_addr, addr->sa_data, ETH_ALEN);
@@ -1667,7 +1663,7 @@ static inline

[PATCH 0/2] drivers: net: cpsw: fix ale calls and drop host_port field from cpsw_priv

2016-04-07 Thread Grygorii Strashko

This clean up series intended to:
 - fix port_mask parameters in ale calls and drop unnecessary shifts
 - drop host_port field from struct cpsw_priv

Nothing critical. Tested on am437x-idk-evm in dual mac and switch modes.

Grygorii Strashko (2):
  drivers: net: cpsw: fix port_mask parameters in ale calls
  drivers: net: cpsw: drop host_port field from struct cpsw_priv

 drivers/net/ethernet/ti/cpsw.c | 52 +-
 1 file changed, 21 insertions(+), 31 deletions(-)

-- 
2.8.0

Re: [PATCH] wlcore: spi: add wl18xx support

2016-04-07 Thread Kalle Valo

"Reizer, Eyal"  writes:

> Ping on this patch
>
>> -Original Message-
>> From: Eyal Reizer [mailto:eyalrei...@gmail.com]
>> Sent: Wednesday, March 30, 2016 4:07 PM
>> To: kv...@codeaurora.org; linux-wirel...@vger.kernel.org;
>> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
>> Cc: Reizer, Eyal
>> Subject: [PATCH] wlcore: spi: add wl18xx support

Please edit your quotes and don't top most. A oneliner and then followed
by almost 400 lines unnecessary text for example makes it harder to use
patchwork:

https://patchwork.kernel.org/patch/8696181/

-- 
Kalle Valo

optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-07 Thread Sowmini Varadhan


I was going back to Alexei's comment here:
  http://permalink.gmane.org/gmane.linux.network/387806
In rds-stress profiling,  we are indeed seeing the pksb_pull 
(from rds_tcp_data_recv) show up in the perf profile.

At least for rds-tcp, we cannot re-use the skb even if
it is not shared, because what we need to do is to carve out
the interesting bits (start at offset, and trim it to to_copy)
and queue up those interesting bits on the PF_RDS socket,
while allowing tcp_data_read to go back and read the next 
tcp segment (which may be part of the next rds dgram).

But, when  pskb_expand_head is invoked in the call-stack
  pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
it will memcpy the offset bytes to the start of data. At least 
for the rds_tcp_data_recv, we are not interested in being able 
to do a *skb_push after the *skb_pull, so we dont really care 
about the intactness of these bytes in offset.
Thus what I am finding is that when delta > 0, if we skip the 
following in pskb_expand_head (for the rds-tcp recv case only!)

/* Copy only real data... and, alas, header. This should be
 * optimized for the cases when header is void.
 */
memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);

and also (only for this case!) this one in __pskb_pull_tail

 if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
BUG();

I am able to get a 40% improvement in the measured IOPS (req/response 
transactions per second, using 8K byte requests, 256 byte responses,
16 concurrent threads), so this optimization seems worth doing.

Does my analysis above make sense? If yes, the question is, how to
achieve this bypass in a neat way.  Clearly there are many callers of
pskb_expand_head who will expect to find the skb_header_len bytes at
the start of data, but we also dont want to duplicate code in these
functions. One thought I had is to pass a flag arouund saying "caller
doesnt care about retaining offset bytes", and use this flag
- in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
  pass delta to pskb_expand_head,
- in pskb_expand_head, only do the memcpy listed above 
  if delta <= 0
Any other ideas for how to achieve this?

--Sowmini

Re: [PATCH] wlcore: spi: add wl18xx support

2016-04-07 Thread Kalle Valo

"Reizer, Eyal"  writes:

>> >  static const struct of_device_id wlcore_spi_of_match_table[] = {
>> > -  { .compatible = "ti,wl1271" },
>> > +  { .compatible = "ti,wl1271", .data = _data},
>> > +  { .compatible = "ti,wl1273", .data = _data},
>> > +  { .compatible = "ti,wl1281", .data = _data},
>> > +  { .compatible = "ti,wl1283", .data = _data},
>> > +  { .compatible = "ti,wl1801", .data = _data},
>> > +  { .compatible = "ti,wl1805", .data = _data},
>> > +  { .compatible = "ti,wl1807", .data = _data},
>> > +  { .compatible = "ti,wl1831", .data = _data},
>> > +  { .compatible = "ti,wl1835", .data = _data},
>> > +  { .compatible = "ti,wl1837", .data = _data},
>> >{ }
>> 
>> Shouldn't you also update bindings/net/wireless/ti,wlcore,spi.txt? Now it 
>> only
>> mentions about ti,wl1271 and not anything about the rest.
>
> You are right! Will be fixed in v2

Thanks. Also remember to CC devicetree list.

-- 
Kalle Valo

Re: Boot failure when using NFS on OMAP based evms

2016-04-07 Thread Willem de Bruijn

On Thu, Apr 7, 2016 at 10:36 AM, Franklin S Cooper Jr.  wrote:
>
>
> On 04/06/2016 09:22 PM, Willem de Bruijn wrote:
>> On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr.  wrote:
>>> Hi All,
>>>
>>> Currently linux-next is failing to boot via NFS on my AM335x GP evm,
>>> AM437x GP evm and Beagle X15. I bisected the problem down to the commit
>>> "udp: remove headers from UDP packets before queueing".
>>>
>>> I had to revert the following three commits to get things working again:
>>>
>>> e6afc8ace6dd5cef5e812f26c72579da8806f5ac
>>> udp: remove headers from UDP packets before queueing
>>>
>>> 627d2d6b550094d88f9e518e15967e7bf906ebbf
>>> udp: enable MSG_PEEK at non-zero offset
>>>
>>> b9bb53f3836f4eb2bdeb3447be11042bd29c2408
>>> sock: convert sk_peek_offset functions to WRITE_ONCE
>>>
>>
>> Thanks for the report, and apologies for breaking your configuration.
>> I had missed that sunrpc can dequeue skbs from a udp receive
>> queue and makes assumptions about the layout of those packets. rxrpc
>> does the same. From what I can tell so far, those are the only two
>> protocols that do this. I have verified that the following fixes rxrpc for me
>>
>> --- a/net/rxrpc/ar-input.c
>> +++ b/net/rxrpc/ar-input.c
>> @@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv
>> *sp, struct sk_buff *skb)
>> struct rxrpc_wire_header whdr;
>>
>> /* dig out the RxRPC connection details */
>> -   if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < 
>> 0)
>> +   if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0)
>> return -EBADMSG;
>> -   if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr)))
>> +   if (!pskb_pull(skb, sizeof(whdr)))
>> BUG();
>>
>> I have not yet been able to reproduce the sunrpc/nfs issue, but I
>> suspect that the following might fix it. I will try to create an NFS
>> setup.
>>
>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>> index 2df87f7..8ab40ba 100644
>> --- a/net/sunrpc/socklib.c
>> +++ b/net/sunrpc/socklib.c
>> @@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr,
>> struct sk_buff *skb)
>> struct xdr_skb_reader   desc;
>>
>> desc.skb = skb;
>> -   desc.offset = sizeof(struct udphdr);
>> +   desc.offset = 0;
>> desc.count = skb->len - desc.offset;
>>
>> if (skb_csum_unnecessary(skb))
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 1413cdc..71d6072 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>> svsk->sk_sk->sk_stamp = skb->tstamp;
>> set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be
>> more data... */
>>
>> -   len  = skb->len - sizeof(struct udphdr);
>> +   len  = skb->len;
>> rqstp->rq_arg.len = len;
>>
>> rqstp->rq_prot = IPPROTO_UDP;
>> @@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>> skb_free_datagram_locked(svsk->sk_sk, skb);
>> } else {
>> /* we can use it in-place */
>> -   rqstp->rq_arg.head[0].iov_base = skb->data +
>> -   sizeof(struct udphdr);
>> +   rqstp->rq_arg.head[0].iov_base = skb->data;
>> rqstp->rq_arg.head[0].iov_len = len;
>> if (skb_checksum_complete(skb))
>> goto out_free;
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 65e7595..c1fc7b2 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
>> u32 _xid;
>> __be32 *xp;
>>
>> -   repsize = skb->len - sizeof(struct udphdr);
>> +   repsize = skb->len;
>> if (repsize < 4) {
>> dprintk("RPC:   impossible RPC reply size %d!\n", 
>> repsize);
>> return;
>> }
>>
>>
>>
>> /* Copy the XID from the skb... */
>> -   xp = skb_header_pointer(skb, sizeof(struct udphdr),
>> -   sizeof(_xid), &_xid);
>> +   xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid);
>> if (xp == NULL)
>> return;
>>
>
>
> Thank you for your quick response. I verified with all of the above
> suggested changes that NFS works again on my 3 evms.

Thanks a lot for testing, Franklin. I will send out the two patches.

Re: Backport Security Fix for CVE-2015-8787 to v4.1

2016-04-07 Thread Pablo Neira Ayuso

On Thu, Apr 07, 2016 at 03:40:30PM +0900, Yuki Machida wrote:
> Hi David,
> 
> I conformed that a patch of CVE-2015-8787 not applied at v4.1.21.
> Could you please apply a patch for 4.1-stable ?
> 
> CVE-2015-8787
> Upstream commit 94f9cd81436c85d8c3a318ba92e236ede73752fc

I'll request again, this time to Sasha Levin to include this in
-stable 4.1.

Thanks.

Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 03:21 -0700, Eric Dumazet wrote:

> Please do not send patches before really understanding the issue you
> have.
> 
> Having a backlog of 12506206 bytes is ridiculous. Dropping packets is
> absolutely fine if this ever happens.
> 
> Something is really wrong on your host, or the sender simply does not
> comply with TCP protocol (not caring of receiver window at all)
> 
> Since you added a trace of truesize, please also trace skb->len
> 

BTW, have you played with /proc/sys/net/ipv4/tcp_adv_win_scale ?

[PATCH net] vxlan: synchronously and race-free destruction of vxlan sockets

2016-04-07 Thread Hannes Frederic Sowa

Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.

udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.

Cc: Jiri Benc 
Signed-off-by: Hannes Frederic Sowa 
---
 drivers/net/vxlan.c | 20 +++-
 include/net/vxlan.h |  2 --
 2 files changed, 3 insertions(+), 19 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 1c0fa364323e28..487e48b7a53090 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -98,7 +98,6 @@ struct vxlan_fdb {
 
 /* salt for hash table */
 static u32 vxlan_salt __read_mostly;
-static struct workqueue_struct *vxlan_wq;
 
 static inline bool vxlan_collect_metadata(struct vxlan_sock *vs)
 {
@@ -1065,7 +1064,9 @@ static void __vxlan_sock_release(struct vxlan_sock *vs)
vxlan_notify_del_rx_port(vs);
spin_unlock(>sock_lock);
 
-   queue_work(vxlan_wq, >del_work);
+   synchronize_rcu();
+   udp_tunnel_sock_release(vs->sock);
+   kfree(vs);
 }
 
 static void vxlan_sock_release(struct vxlan_dev *vxlan)
@@ -2574,13 +2575,6 @@ static const struct ethtool_ops vxlan_ethtool_ops = {
.get_link   = ethtool_op_get_link,
 };
 
-static void vxlan_del_work(struct work_struct *work)
-{
-   struct vxlan_sock *vs = container_of(work, struct vxlan_sock, del_work);
-   udp_tunnel_sock_release(vs->sock);
-   kfree_rcu(vs, rcu);
-}
-
 static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
__be16 port, u32 flags)
 {
@@ -2626,8 +2620,6 @@ static struct vxlan_sock *vxlan_socket_create(struct net 
*net, bool ipv6,
for (h = 0; h < VNI_HASH_SIZE; ++h)
INIT_HLIST_HEAD(>vni_list[h]);
 
-   INIT_WORK(>del_work, vxlan_del_work);
-
sock = vxlan_create_sock(net, ipv6, port, flags);
if (IS_ERR(sock)) {
pr_info("Cannot bind port %d, err=%ld\n", ntohs(port),
@@ -3218,10 +3210,6 @@ static int __init vxlan_init_module(void)
 {
int rc;
 
-   vxlan_wq = alloc_workqueue("vxlan", 0, 0);
-   if (!vxlan_wq)
-   return -ENOMEM;
-
get_random_bytes(_salt, sizeof(vxlan_salt));
 
rc = register_pernet_subsys(_net_ops);
@@ -3242,7 +3230,6 @@ out3:
 out2:
unregister_pernet_subsys(_net_ops);
 out1:
-   destroy_workqueue(vxlan_wq);
return rc;
 }
 late_initcall(vxlan_init_module);
@@ -3251,7 +3238,6 @@ static void __exit vxlan_cleanup_module(void)
 {
rtnl_link_unregister(_link_ops);
unregister_netdevice_notifier(_notifier_block);
-   destroy_workqueue(vxlan_wq);
unregister_pernet_subsys(_net_ops);
/* rcu_barrier() is called by netns */
 }
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 73ed2e951c020d..2113f808e905a4 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -126,9 +126,7 @@ struct vxlan_metadata {
 /* per UDP socket information */
 struct vxlan_sock {
struct hlist_node hlist;
-   struct work_struct del_work;
struct socket*sock;
-   struct rcu_head   rcu;
struct hlist_head vni_list[VNI_HASH_SIZE];
atomic_t  refcnt;
struct udp_offload udp_offloads;
-- 
2.5.5

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-04-07 Thread Vijay Pandurangan

On Fri, Mar 25, 2016 at 7:46 PM, Ben Greear  wrote:
> A real NIC can either do hardware checksums, or it cannot.  If it
> cannot, then the host must do it on the CPU for both transmit and
> receive.
>
> Veth is not a real NIC, and it cannot do hardware checksum offloading.
>
> So, we either lie and pretend it does, or we eat massive amounts
> of CPU usage to calculate and check checksums when sending across
> a veth pair.
>

That's a good point. Does anyone know what the overhead actually is these days?

>>> Any frame sent from a socket can be considered to be a local packet in my
>>> opinion.
>>
>>
>> I'm not sure that's totally right. Your bridge is adding a delay to
>> your packets; it could just as easily be simulating corruption by
>> corrupting 5% of packets going through it. If this change allows
>> corrupt packets to be delivered to an application when they could not
>> be delivered if the packets were routed via physical eths, I think
>> that is a bug.
>
>
> I actually do support corrupting the frame, but what I normally do is
> corrupt the contents
> of the packet, and then recalculate the IP checksum (and TCP if it applies)
> and send it on its way.  The receiving NIC and stack will pass the frame up
> to
> the application since the checksums match, and it would be up the
> application
> to deal with it.  So, I can easily cause an application to receive corrupted
> frames over physical eths.
>
> I can also corrupt without updating the checksums in case you want to
> test another systems NIC and/or stack.
>
> But, if I am purposely corrupting a frame destined for veth, then the only
> reason
> I would want the stack to check the checksums is if I were testing my own
> stack's checksum logic, and that seems to be a pretty limited use.


In the common case you're 100% right.  OTOH, there's something
disconcerting about an abstraction layer lying and behaving
unexpectedly.  Most traffic that originates on a machine can have its
checksums safely ignored.  Whatever the reason is (maybe, as you say
you're testing checksums – on the other hand maybe there's a bug in
your code somewhere), I really feel like we should try to figure out a
way to ensure that this optimization is at the very least opt-in…

>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>

[PATCH net-next 0/2] fix udp pull header breakage

2016-04-07 Thread Willem de Bruijn

From: Willem de Bruijn 

Commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") modified udp receive processing to pull headers before
enqueue and to not expect them on dequeue.

The patch missed protocols on top of udp with in-kernel
implementations that have their own skb_recv_datagram calls and
dequeue logic. Modify these datapaths to also no longer expect
a udp header at skb->data.

Sunrpc and rxrpc are the only two protocols that call this
function and contain references to udphr (some others, like tipc,
are based on encap_rcv, which acts before enqueue, before the
the header pull).

Willem de Bruijn (2):
  sunrpc: do not pull udp headers on receive
  rxrpc: do not pull udp headers on receive

 net/rxrpc/ar-input.c  | 4 ++--
 net/sunrpc/socklib.c  | 2 +-
 net/sunrpc/svcsock.c  | 5 ++---
 net/sunrpc/xprtsock.c | 5 ++---
 4 files changed, 7 insertions(+), 9 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 1/2] sunrpc: do not pull udp headers on receive

2016-04-07 Thread Willem de Bruijn

From: Willem de Bruijn 

Commit e6afc8ace6dd modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.

Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

Reported-by: Franklin S Cooper Jr. 
Signed-off-by: Willem de Bruijn 
---
 net/sunrpc/socklib.c  | 2 +-
 net/sunrpc/svcsock.c  | 5 ++---
 net/sunrpc/xprtsock.c | 5 ++---
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
index 2df87f7..8ab40ba 100644
--- a/net/sunrpc/socklib.c
+++ b/net/sunrpc/socklib.c
@@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct 
sk_buff *skb)
struct xdr_skb_reader   desc;
 
desc.skb = skb;
-   desc.offset = sizeof(struct udphdr);
+   desc.offset = 0;
desc.count = skb->len - desc.offset;
 
if (skb_csum_unnecessary(skb))
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1413cdc..71d6072 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
svsk->sk_sk->sk_stamp = skb->tstamp;
set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be more 
data... */
 
-   len  = skb->len - sizeof(struct udphdr);
+   len  = skb->len;
rqstp->rq_arg.len = len;
 
rqstp->rq_prot = IPPROTO_UDP;
@@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
skb_free_datagram_locked(svsk->sk_sk, skb);
} else {
/* we can use it in-place */
-   rqstp->rq_arg.head[0].iov_base = skb->data +
-   sizeof(struct udphdr);
+   rqstp->rq_arg.head[0].iov_base = skb->data;
rqstp->rq_arg.head[0].iov_len = len;
if (skb_checksum_complete(skb))
goto out_free;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 65e7595..c1fc7b2 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
u32 _xid;
__be32 *xp;
 
-   repsize = skb->len - sizeof(struct udphdr);
+   repsize = skb->len;
if (repsize < 4) {
dprintk("RPC:   impossible RPC reply size %d!\n", repsize);
return;
}
 
/* Copy the XID from the skb... */
-   xp = skb_header_pointer(skb, sizeof(struct udphdr),
-   sizeof(_xid), &_xid);
+   xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid);
if (xp == NULL)
return;
 
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH net-next V3 05/16] net: fec: reduce interrupts

2016-04-07 Thread Troy Kisky

On 4/6/2016 8:57 PM, David Miller wrote:
> From: Troy Kisky 
> Date: Wed, 6 Apr 2016 17:42:47 -0700
> 
>> Sure, that's an easy change. But if a TX int is what caused the
>> interrupt and masks them, and then a RX packet happens before napi
>> runs, do you want the TX serviced 1st, or RX ?
> 
> If you properly split your driver up into seperate interrupt/poll
> instances, one for TX one for RX, you wouldn't need to ask me
> that question now would you?
> 
> :-)
> 

I absolutely claim no ownership :-)

Re: Boot failure when using NFS on OMAP based evms

2016-04-07 Thread Willem de Bruijn

 Currently linux-next is failing to boot via NFS on my AM335x GP evm,
 AM437x GP evm and Beagle X15. I bisected the problem down to the commit
 "udp: remove headers from UDP packets before queueing".
>>>
>>> Thanks for the report, and apologies for breaking your configuration.
>>> I had missed that sunrpc can dequeue skbs from a udp receive
>>> queue and makes assumptions about the layout of those packets. rxrpc
>>> does the same. From what I can tell so far, those are the only two
>>> protocols that do this. I have verified that the following fixes rxrpc for 
>>> me
>>>
>>
>> Thank you for your quick response. I verified with all of the above
>> suggested changes that NFS works again on my 3 evms.
>
> Thanks a lot for testing, Franklin. I will send out the two patches.

Patches sent to netdev. I'll do another scan to verify that there are
no additional protocols that dequeue skbs from udp receive queues
and expect udp headers present.

Re: Boot failure when using NFS on OMAP based evms

2016-04-07 Thread Franklin S Cooper Jr.



On 04/06/2016 09:22 PM, Willem de Bruijn wrote:
> On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr.  wrote:
>> Hi All,
>>
>> Currently linux-next is failing to boot via NFS on my AM335x GP evm,
>> AM437x GP evm and Beagle X15. I bisected the problem down to the commit
>> "udp: remove headers from UDP packets before queueing".
>>
>> I had to revert the following three commits to get things working again:
>>
>> e6afc8ace6dd5cef5e812f26c72579da8806f5ac
>> udp: remove headers from UDP packets before queueing
>>
>> 627d2d6b550094d88f9e518e15967e7bf906ebbf
>> udp: enable MSG_PEEK at non-zero offset
>>
>> b9bb53f3836f4eb2bdeb3447be11042bd29c2408
>> sock: convert sk_peek_offset functions to WRITE_ONCE
>>
> 
> Thanks for the report, and apologies for breaking your configuration.
> I had missed that sunrpc can dequeue skbs from a udp receive
> queue and makes assumptions about the layout of those packets. rxrpc
> does the same. From what I can tell so far, those are the only two
> protocols that do this. I have verified that the following fixes rxrpc for me
> 
> --- a/net/rxrpc/ar-input.c
> +++ b/net/rxrpc/ar-input.c
> @@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv
> *sp, struct sk_buff *skb)
> struct rxrpc_wire_header whdr;
> 
> /* dig out the RxRPC connection details */
> -   if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < 
> 0)
> +   if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0)
> return -EBADMSG;
> -   if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr)))
> +   if (!pskb_pull(skb, sizeof(whdr)))
> BUG();
> 
> I have not yet been able to reproduce the sunrpc/nfs issue, but I
> suspect that the following might fix it. I will try to create an NFS
> setup.
> 
> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
> index 2df87f7..8ab40ba 100644
> --- a/net/sunrpc/socklib.c
> +++ b/net/sunrpc/socklib.c
> @@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr,
> struct sk_buff *skb)
> struct xdr_skb_reader   desc;
> 
> desc.skb = skb;
> -   desc.offset = sizeof(struct udphdr);
> +   desc.offset = 0;
> desc.count = skb->len - desc.offset;
> 
> if (skb_csum_unnecessary(skb))
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 1413cdc..71d6072 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> svsk->sk_sk->sk_stamp = skb->tstamp;
> set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be
> more data... */
> 
> -   len  = skb->len - sizeof(struct udphdr);
> +   len  = skb->len;
> rqstp->rq_arg.len = len;
> 
> rqstp->rq_prot = IPPROTO_UDP;
> @@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> skb_free_datagram_locked(svsk->sk_sk, skb);
> } else {
> /* we can use it in-place */
> -   rqstp->rq_arg.head[0].iov_base = skb->data +
> -   sizeof(struct udphdr);
> +   rqstp->rq_arg.head[0].iov_base = skb->data;
> rqstp->rq_arg.head[0].iov_len = len;
> if (skb_checksum_complete(skb))
> goto out_free;
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 65e7595..c1fc7b2 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
> u32 _xid;
> __be32 *xp;
> 
> -   repsize = skb->len - sizeof(struct udphdr);
> +   repsize = skb->len;
> if (repsize < 4) {
> dprintk("RPC:   impossible RPC reply size %d!\n", 
> repsize);
> return;
> }
> 
> 
> 
> /* Copy the XID from the skb... */
> -   xp = skb_header_pointer(skb, sizeof(struct udphdr),
> -   sizeof(_xid), &_xid);
> +   xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid);
> if (xp == NULL)
> return;
> 


Thank you for your quick response. I verified with all of the above
suggested changes that NFS works again on my 3 evms.

[PATCH net 0/2] tipc: name distributor pernet queue

2016-04-07 Thread Jon Maloy

Commit #1 fixes a potential issue with deferred binding table 
updates being pushed to the wrong namespace.

Commit #2 solves a problem with deferred binding table updates
remaining in the the defer queue after the issuing node has gone
down.

Erik Hugne (2):
  tipc: make dist queue pernet
  tipc: purge deferred updates from dead nodes

 net/tipc/core.c   |  1 +
 net/tipc/core.h   |  3 +++
 net/tipc/name_distr.c | 35 ++-
 3 files changed, 30 insertions(+), 9 deletions(-)

-- 
1.9.1

[PATCH net 1/2] tipc: make dist queue pernet

2016-04-07 Thread Jon Maloy

From: Erik Hugne 

Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.

Signed-off-by: Erik Hugne 
Signed-off-by: Jon Maloy 
---
 net/tipc/core.c   |  1 +
 net/tipc/core.h   |  3 +++
 net/tipc/name_distr.c | 16 +++-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 03a8428..e2bdb07a 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -69,6 +69,7 @@ static int __net_init tipc_init_net(struct net *net)
if (err)
goto out_nametbl;
 
+   INIT_LIST_HEAD(>dist_queue);
err = tipc_topsrv_start(net);
if (err)
goto out_subscr;
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 5504d63..eff58dc 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -103,6 +103,9 @@ struct tipc_net {
spinlock_t nametbl_lock;
struct name_table *nametbl;
 
+   /* Name dist queue */
+   struct list_head dist_queue;
+
/* Topology subscription server */
struct tipc_server *topsrv;
atomic_t subscription_count;
diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index ebe9d0f..4f4f581 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -40,11 +40,6 @@
 
 int sysctl_tipc_named_timeout __read_mostly = 2000;
 
-/**
- * struct tipc_dist_queue - queue holding deferred name table updates
- */
-static struct list_head tipc_dist_queue = LIST_HEAD_INIT(tipc_dist_queue);
-
 struct distr_queue_item {
struct distr_item i;
u32 dtype;
@@ -279,9 +274,11 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
  * tipc_named_add_backlog - add a failed name table update to the backlog
  *
  */
-static void tipc_named_add_backlog(struct distr_item *i, u32 type, u32 node)
+static void tipc_named_add_backlog(struct net *net, struct distr_item *i,
+  u32 type, u32 node)
 {
struct distr_queue_item *e;
+   struct tipc_net *tn = net_generic(net, tipc_net_id);
unsigned long now = get_jiffies_64();
 
e = kzalloc(sizeof(*e), GFP_ATOMIC);
@@ -291,7 +288,7 @@ static void tipc_named_add_backlog(struct distr_item *i, 
u32 type, u32 node)
e->node = node;
e->expires = now + msecs_to_jiffies(sysctl_tipc_named_timeout);
memcpy(e, i, sizeof(*i));
-   list_add_tail(>next, _dist_queue);
+   list_add_tail(>next, >dist_queue);
 }
 
 /**
@@ -301,10 +298,11 @@ static void tipc_named_add_backlog(struct distr_item *i, 
u32 type, u32 node)
 void tipc_named_process_backlog(struct net *net)
 {
struct distr_queue_item *e, *tmp;
+   struct tipc_net *tn = net_generic(net, tipc_net_id);
char addr[16];
unsigned long now = get_jiffies_64();
 
-   list_for_each_entry_safe(e, tmp, _dist_queue, next) {
+   list_for_each_entry_safe(e, tmp, >dist_queue, next) {
if (time_after(e->expires, now)) {
if (!tipc_update_nametbl(net, >i, e->node, e->dtype))
continue;
@@ -344,7 +342,7 @@ void tipc_named_rcv(struct net *net, struct sk_buff_head 
*inputq)
node = msg_orignode(msg);
while (count--) {
if (!tipc_update_nametbl(net, item, node, mtype))
-   tipc_named_add_backlog(item, mtype, node);
+   tipc_named_add_backlog(net, item, mtype, node);
item++;
}
kfree_skb(skb);
-- 
1.9.1

[PATCH net 2/2] tipc: purge deferred updates from dead nodes

2016-04-07 Thread Jon Maloy

From: Erik Hugne 

If a peer node becomes unavailable, in addition to removing the
nametable entries from this node we also need to purge all deferred
updates associated with this node.

Signed-off-by: Erik Hugne 
Signed-off-by: Jon Maloy 
---
 net/tipc/name_distr.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 4f4f581..6b626a6 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -224,12 +224,31 @@ static void tipc_publ_purge(struct net *net, struct 
publication *publ, u32 addr)
kfree_rcu(p, rcu);
 }
 
+/**
+ * tipc_dist_queue_purge - remove deferred updates from a node that went down
+ */
+static void tipc_dist_queue_purge(struct net *net, u32 addr)
+{
+   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct distr_queue_item *e, *tmp;
+
+   spin_lock_bh(>nametbl_lock);
+   list_for_each_entry_safe(e, tmp, >dist_queue, next) {
+   if (e->node != addr)
+   continue;
+   list_del(>next);
+   kfree(e);
+   }
+   spin_unlock_bh(>nametbl_lock);
+}
+
 void tipc_publ_notify(struct net *net, struct list_head *nsub_list, u32 addr)
 {
struct publication *publ, *tmp;
 
list_for_each_entry_safe(publ, tmp, nsub_list, nodesub_list)
tipc_publ_purge(net, publ, addr);
+   tipc_dist_queue_purge(net, addr);
 }
 
 /**
-- 
1.9.1

Re: [PATCH net-next V3 00/16] net: fec: cleanup and fixes

2016-04-07 Thread Troy Kisky

On 4/6/2016 8:58 PM, David Miller wrote:
> From: Troy Kisky 
> Date: Wed, 6 Apr 2016 18:09:17 -0700
> 
>> On 4/6/2016 2:20 PM, David Miller wrote:
>>>
>>> This is a way too large patch series.
>>>
>>> Please split it up into smaller, more logical, pieces.
>>>
>>> Thanks.
>>>
>>
>> If you apply the 1st 3 that have been acked, I'll be at 13.
>>
>> Would I then send the next 5 for  V4, and when that is applied
>> send another V4 with the next 8 that have been already been acked?
> 
> What other reasonable option is there?  I can't think of any.
> 

A V1 for the next 8 would not be too unreasonable.

[PATCH net-next 2/2] rxrpc: do not pull udp headers on receive

2016-04-07 Thread Willem de Bruijn

From: Willem de Bruijn 

Commit e6afc8ace6dd modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.

Rxrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn 
---
 net/rxrpc/ar-input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/ar-input.c b/net/rxrpc/ar-input.c
index 63ed75c..4824a82 100644
--- a/net/rxrpc/ar-input.c
+++ b/net/rxrpc/ar-input.c
@@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv *sp, struct 
sk_buff *skb)
struct rxrpc_wire_header whdr;
 
/* dig out the RxRPC connection details */
-   if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < 0)
+   if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0)
return -EBADMSG;
-   if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr)))
+   if (!pskb_pull(skb, sizeof(whdr)))
BUG();
 
memset(sp, 0, sizeof(*sp));
-- 
2.8.0.rc3.226.g39d4020

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Chuck Lever


> On Apr 7, 2016, at 7:38 AM, Christoph Hellwig  wrote:
> 
> This is also very interesting for storage targets, which face the same
> issue.  SCST has a mode where it caches some fully constructed SGLs,
> which is probably very similar to what NICs want to do.

+1 for NFS server.


--
Chuck Lever

Re: [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> (Topic proposal for MM-summit)
> 
> Network Interface Cards (NIC) drivers, and increasing speeds stress
> the page-allocator (and DMA APIs).  A number of driver specific
> open-coded approaches exists that work-around these bottlenecks in the
> page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> allocating larger pages and handing-out page "fragments".
> 
> I'm proposing a generic page-pool recycle facility, that can cover the
> driver use-cases, increase performance and open up for zero-copy RX.
> 
> 
> The basic performance problem is that pages (containing packets at RX)
> are cycled through the page allocator (freed at TX DMA completion
> time).  While a system in a steady state, could avoid calling the page
> allocator, when having a pool of pages equal to the size of the RX
> ring plus the number of outstanding frames in the TX ring (waiting for
> DMA completion).

We certainly used this at Google for quite a while.

The thing is : in steady state, the number of pages being 'in tx queues'
is lower than number of pages that were allocated for RX queues.

The page allocator is hardly hit, once you have big enough RX ring
buffers. (Nothing fancy, simply the default number of slots)

The 'hard coded´ code is quite small actually

if (page_count(page) != 1) {
free the page and allocate another one, 
since we are not the exclusive owner.
Prefer __GFP_COLD pages btw.
}
page_ref_inc(page);

Problem of a 'pool' is that it matches a router workload, not host one.

With existing code, new pages are automatically allocated on demand, if
say previous pages are still used by skb stored in sockets receive
queues and consumers are slow to react to the presence of this data.

But in most cases (steady state), the refcount on the page is released
by the application reading the data before the driver cycled through the
RX ring buffer and drivers only increments the page count.

I also played with grouping pages into the same 2MB pages, but got mixed
results.

Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Christoph Hellwig

This is also very interesting for storage targets, which face the same
issue.  SCST has a mode where it caches some fully constructed SGLs,
which is probably very similar to what NICs want to do.

Re: [net-next 0/8][pull request] 1GbE Intel Wired LAN Driver Updates 2016-04-06

2016-04-07 Thread David Miller

From: Jeff Kirsher 
Date: Wed,  6 Apr 2016 21:37:25 -0700

> This series contains updates to e1000, e1000e, igb and Kconfig.

All looks sane, pulled, thanks Jeff.

Re: [PATCH] ieee802154/adf7242: fix memory leak of firmware

2016-04-07 Thread Michael Hennerich


On 04/07/2016 01:16 PM, Sudip Mukherjee wrote:

If the firmware upload or the firmware verification fails then we
printed the error message and exited but we missed releasing the
firmware.

Signed-off-by: Sudip Mukherjee 


Acked-by: Michael Hennerich 


---
  drivers/net/ieee802154/adf7242.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/net/ieee802154/adf7242.c b/drivers/net/ieee802154/adf7242.c
index 89154c0..91d4531 100644
--- a/drivers/net/ieee802154/adf7242.c
+++ b/drivers/net/ieee802154/adf7242.c
@@ -1030,6 +1030,7 @@ static int adf7242_hw_init(struct adf7242_local *lp)
if (ret) {
dev_err(>spi->dev,
"upload firmware failed with %d\n", ret);
+   release_firmware(fw);
return ret;
}

@@ -1037,6 +1038,7 @@ static int adf7242_hw_init(struct adf7242_local *lp)
if (ret) {
dev_err(>spi->dev,
"verify firmware failed with %d\n", ret);
+   release_firmware(fw);
return ret;
}





--
Greetings,
Michael

--
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft: Muenchen; Registergericht: Muenchen HRB 40368;
Geschaeftsfuehrer:Dr.Carsten Suckrow, Thomas Wessel, William A. Martin,
Margaret Seif

[PATCH iproute2 3/4] ip-link.8: document "external" flag for vxlan

2016-04-07 Thread Jiri Benc

Signed-off-by: Jiri Benc 
---
 man/man8/ip-link.8.in | 8 
 1 file changed, 8 insertions(+)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 805511423ef2..f677f8c55365 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -422,6 +422,8 @@ the following additional arguments are supported:
 ] [
 .BI maxaddress " NUMBER "
 ] [
+.RI "[no]external "
+] [
 .B gbp
 ]
 
@@ -516,6 +518,12 @@ are entered into the VXLAN device forwarding database.
 - specifies the maximum number of FDB entries.
 
 .sp
+.I [no]external
+- specifies whether an external control plane
+.RB "(e.g. " "ip route encap" )
+or the internal FDB should be used.
+
+.sp
 .B gbp
 - enables the Group Policy extension (VXLAN-GBP).
 
-- 
1.8.3.1

[PATCH iproute2 2/4] vxlan: 'external' implies 'nolearning'

2016-04-07 Thread Jiri Benc

It doesn't make sense to use external control plane and fill internal FDB at
the same time. It's even an illegal combination for VXLAN-GPE.

Just switch off learning when 'external' is specified.

Signed-off-by: Jiri Benc 
---
 ip/iplink_vxlan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index e3bbea0031df..e9d64b451677 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -234,6 +234,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
remcsumrx = 0;
} else if (!matches(*argv, "external")) {
metadata = 1;
+   learning = 0;
} else if (!matches(*argv, "noexternal")) {
metadata = 0;
} else if (!matches(*argv, "gbp")) {
-- 
1.8.3.1

[PATCH iproute2 4/4] vxlan: add support for VXLAN-GPE

2016-04-07 Thread Jiri Benc

Adds support to create a VXLAN-GPE interface.

Signed-off-by: Jiri Benc 
---
 ip/iplink_vxlan.c | 13 +++--
 man/man8/ip-link.8.in |  9 +
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index e9d64b451677..49a40befa5d5 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -31,7 +31,7 @@ static void print_explain(FILE *f)
fprintf(f, " [ ageing SECONDS ] [ maxaddress NUMBER 
]\n");
fprintf(f, " [ [no]udpcsum ] [ [no]udp6zerocsumtx ] [ 
[no]udp6zerocsumrx ]\n");
fprintf(f, " [ [no]remcsumtx ] [ [no]remcsumrx ]\n");
-   fprintf(f, " [ [no]external ] [ gbp ]\n");
+   fprintf(f, " [ [no]external ] [ gbp ] [ gpe ]\n");
fprintf(f, "\n");
fprintf(f, "Where: VNI   := 0-16777215\n");
fprintf(f, "   ADDR  := { IP_ADDRESS | any }\n");
@@ -79,6 +79,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
__u8 remcsumrx = 0;
__u8 metadata = 0;
__u8 gbp = 0;
+   __u8 gpe = 0;
int dst_port_set = 0;
struct ifla_vxlan_port_range range = { 0, 0 };
 
@@ -239,6 +240,8 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
metadata = 0;
} else if (!matches(*argv, "gbp")) {
gbp = 1;
+   } else if (!matches(*argv, "gpe")) {
+   gpe = 1;
} else if (matches(*argv, "help") == 0) {
explain();
return -1;
@@ -267,7 +270,9 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
return -1;
}
 
-   if (!dst_port_set) {
+   if (!dst_port_set && gpe) {
+   dstport = 4790;
+   } else if (!dst_port_set) {
fprintf(stderr, "vxlan: destination port not specified\n"
"Will use Linux kernel default (non-standard value)\n");
fprintf(stderr,
@@ -324,6 +329,8 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
 
if (gbp)
addattr_l(n, 1024, IFLA_VXLAN_GBP, NULL, 0);
+   if (gpe)
+   addattr_l(n, 1024, IFLA_VXLAN_GPE, NULL, 0);
 
 
return 0;
@@ -490,6 +497,8 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
 
if (tb[IFLA_VXLAN_GBP])
fputs("gbp ", f);
+   if (tb[IFLA_VXLAN_GPE])
+   fputs("gpe ", f);
 }
 
 static void vxlan_print_help(struct link_util *lu, int argc, char **argv,
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index f677f8c55365..984fb2eb0d63 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -425,6 +425,8 @@ the following additional arguments are supported:
 .RI "[no]external "
 ] [
 .B gbp
+] [
+.B gpe
 ]
 
 .in +8
@@ -566,6 +568,13 @@ Example:
 
 .in -4
 
+.sp
+.B gpe
+- enables the Generic Protocol extension (VXLAN-GPE). Currently, this is
+only supported together with the
+.B external
+keyword.
+
 .in -8
 
 .TP
-- 
1.8.3.1

[PATCH iproute2 0/4] vxlan: add VXLAN-GPE support

2016-04-07 Thread Jiri Benc

This patchset adds a couple of refinements to the 'external' keyword and
adds support for configuring VXLAN-GPE interfaces.

The VXLAN-GPE support was recently merged to net-next.

The first patch is not intended to be applied directly but I still include
it in order for this patchset to be compilable and testable by those
interested.

Jiri Benc (4):
  if_link.h: rebase to the kernel
  vxlan: 'external' implies 'nolearning'
  ip-link.8: document "external" flag for vxlan
  vxlan: add support for VXLAN-GPE

 include/linux/if_link.h |  1 +
 ip/iplink_vxlan.c   | 14 --
 man/man8/ip-link.8.in   | 17 +
 3 files changed, 30 insertions(+), 2 deletions(-)

-- 
1.8.3.1

[PATCH iproute2 1/4] if_link.h: rebase to the kernel

2016-04-07 Thread Jiri Benc

Not intended to be applied directly. Instead, I assume Stephen will rebase
all headers to net-next as usual.

Signed-off-by: Jiri Benc 
---
 include/linux/if_link.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 9b7f3c8d64c4..7e6fe6798a41 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -484,6 +484,7 @@ enum {
IFLA_VXLAN_REMCSUM_NOPARTIAL,
IFLA_VXLAN_COLLECT_METADATA,
IFLA_VXLAN_LABEL,
+   IFLA_VXLAN_GPE,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
-- 
1.8.3.1

Re: [PATCH net-next] tcp/dccp: fix inet_reuseport_add_sock()

2016-04-07 Thread David Ahern


On 4/6/16 11:07 PM, Eric Dumazet wrote:

From: Eric Dumazet 

David Ahern reported panics in __inet_hash() caused by my recent commit.

The reason is inet_reuseport_add_sock() was still using
sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
SO_REUSEPORT enabled listeners were causing an instant crash.

While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
flag, as it is inherited from the parent at clone time.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet 
Reported-by: David Ahern 
---


Fixed the problem I as hiting.

Tested-by: David Ahern

Re: [PATCH] wlcore: spi: add wl18xx support

2016-04-07 Thread Kalle Valo

Eyal Reizer  writes:

> Add support for using with both wl12xx and wl18xx.
>
> - all wilink family needs special init command for entering wspi mode.
>   extra clock cycles should be sent after the spi init command while the
>   cs pin is high.
> - switch to controling the cs pin from the spi driver for achieveing the
>   above.
> - the selected cs gpio is read from the spi device-tree node using the
>   cs-gpios field and setup as a gpio.
> - See the example below for specifying the cs gpio using the cs-gpios entry
>
>  {
>   status = "okay";
>   pinctrl-names = "default";
>   pinctrl-0 = <_pins>;
>   cs-gpios = < 5 0>;
>   #address-cells = <1>;
>   #size-cells = <0>;
>   wlcore: wlcore@0 {
>   compatible = "ti,wl1835";
>   vwlan-supply = <_en_reg>;
>   spi-max-frequency = <4800>;
>   reg = <0>;  /* chip select 0 on spi0, ie spi0.0 */
>   interrupt-parent = <>;
>   interrupts = <27 IRQ_TYPE_EDGE_RISING>;
>   };
> };
>
> Signed-off-by: Eyal Reizer 

[...]

>  static const struct of_device_id wlcore_spi_of_match_table[] = {
> - { .compatible = "ti,wl1271" },
> + { .compatible = "ti,wl1271", .data = _data},
> + { .compatible = "ti,wl1273", .data = _data},
> + { .compatible = "ti,wl1281", .data = _data},
> + { .compatible = "ti,wl1283", .data = _data},
> + { .compatible = "ti,wl1801", .data = _data},
> + { .compatible = "ti,wl1805", .data = _data},
> + { .compatible = "ti,wl1807", .data = _data},
> + { .compatible = "ti,wl1831", .data = _data},
> + { .compatible = "ti,wl1835", .data = _data},
> + { .compatible = "ti,wl1837", .data = _data},
>   { }

Shouldn't you also update bindings/net/wireless/ti,wlcore,spi.txt? Now
it only mentions about ti,wl1271 and not anything about the rest.

Adding devicetree list for further comments.

-- 
Kalle Valo

[LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Jesper Dangaard Brouer

(Topic proposal for MM-summit)

Network Interface Cards (NIC) drivers, and increasing speeds stress
the page-allocator (and DMA APIs).  A number of driver specific
open-coded approaches exists that work-around these bottlenecks in the
page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
allocating larger pages and handing-out page "fragments".

I'm proposing a generic page-pool recycle facility, that can cover the
driver use-cases, increase performance and open up for zero-copy RX.


The basic performance problem is that pages (containing packets at RX)
are cycled through the page allocator (freed at TX DMA completion
time).  While a system in a steady state, could avoid calling the page
allocator, when having a pool of pages equal to the size of the RX
ring plus the number of outstanding frames in the TX ring (waiting for
DMA completion).

The motivation for quick page recycling came primarily for performance
reasons.  But returning pages to the same pool also benefit other
use-cases.  If a NIC HW RX ring is strictly bound (e.g. to a process
or guest/KVM) then pages can be shared/mmap'ed (RX zero-copy) as
information leaking does not occur.  (Obviously for this use-case,
when adding pages into the pool these need to zero'ed out).


The motivation behind implemeting this (extremely fast page-pool) is
because we need it as a building block in the network stack, but
hopefully other areas could also benefit from this.


[Resources/Links]: It is specifically related to:

What Facebook calls XDP (eXpress Data Path)
 * https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
 * RFC patchset thread: http://thread.gmane.org/gmane.linux.network/406288

And what I call the "packet-page" level:
 * BoF on kernel network performance: http://lwn.net/Articles/676806/
 * http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html


See you soon at LFS/MM-summit :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

RE: [PATCH v7 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-04-07 Thread Dexuan Cui

> From: Joe Perches [mailto:j...@perches.com]
> Sent: Thursday, April 7, 2016 19:30
> To: Dexuan Cui ; gre...@linuxfoundation.org;
> da...@davemloft.net; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> jasow...@redhat.com; KY Srinivasan ; Haiyang Zhang
> 
> Cc: vkuzn...@redhat.com
> Subject: Re: [PATCH v7 net-next 1/1] hv_sock: introduce Hyper-V Sockets
> 
> On Thu, 2016-04-07 at 05:50 -0700, Dexuan Cui wrote:
> > Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
> > mechanism between the host and the guest. It's somewhat like TCP over
> > VMBus, but the transportation layer (VMBus) is much simpler than IP.
> 
> style trivia:
> 
> > diff --git a/net/hv_sock/af_hvsock.c b/net/hv_sock/af_hvsock.c
> []
> > +static struct sock *__hvsock_find_bound_socket(const struct sockaddr_hv
> *addr)
> > +{
> > +   struct hvsock_sock *hvsk;
> > +
> > +   list_for_each_entry(hvsk, _bound_list, bound_list)
> > +   if (uuid_equals(addr->shv_service_id,
> > +   hvsk->local_addr.shv_service_id))
> > +   return hvsock_to_sk(hvsk);
> 
> Because there's an if, it's generally nicer to use
> braces in the list_for_each

Thanks for the suggestion, Joe!
I'll add {}.

> > +static struct sock *__hvsock_find_connected_socket_by_channel(
> > +   const struct vmbus_channel *channel)
> > +{
> > +   struct hvsock_sock *hvsk;
> > +
> > +   list_for_each_entry(hvsk, _connected_list, connected_list)
> > +   if (hvsk->channel == channel)
> > +   return hvsock_to_sk(hvsk);
> > +   return NULL;
> 
> here too
I'll fix this too.

> > +static int hvsock_sendmsg(struct socket *sock, struct msghdr *msg, size_t 
> > len)
> > +{
> []
> > +   if (msg->msg_flags & ~MSG_DONTWAIT) {
> > +   pr_err("hvsock_sendmsg: unsupported flags=0x%x\n",
> > +      msg->msg_flags);
> 
> All the pr_ messages with embedded function
> names could use "%s:", __func__
I'll fix this.

Thanks,
-- Dexuan

RE: [PATCH] wlcore: spi: add wl18xx support

2016-04-07 Thread Reizer, Eyal



> -Original Message-
> From: Kalle Valo [mailto:kv...@codeaurora.org]
> Sent: Thursday, April 07, 2016 3:25 PM
> To: Eyal Reizer
> Cc: linux-wirel...@vger.kernel.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; Reizer, Eyal; devicet...@vger.kernel.org
> Subject: Re: [PATCH] wlcore: spi: add wl18xx support
> 
> Eyal Reizer  writes:
> 
> > Add support for using with both wl12xx and wl18xx.
> >
> > - all wilink family needs special init command for entering wspi mode.
> >   extra clock cycles should be sent after the spi init command while the
> >   cs pin is high.
> > - switch to controling the cs pin from the spi driver for achieveing the
> >   above.
> > - the selected cs gpio is read from the spi device-tree node using the
> >   cs-gpios field and setup as a gpio.
> > - See the example below for specifying the cs gpio using the cs-gpios
> > entry
> >
> >{
> > status = "okay";
> > pinctrl-names = "default";
> > pinctrl-0 = <_pins>;
> > cs-gpios = < 5 0>;
> > #address-cells = <1>;
> > #size-cells = <0>;
> > wlcore: wlcore@0 {
> > compatible = "ti,wl1835";
> > vwlan-supply = <_en_reg>;
> > spi-max-frequency = <4800>;
> > reg = <0>;  /* chip select 0 on spi0, ie spi0.0 */
> > interrupt-parent = <>;
> > interrupts = <27 IRQ_TYPE_EDGE_RISING>;
> > };
> > };
> >
> > Signed-off-by: Eyal Reizer 
> 
> [...]
> 
> >  static const struct of_device_id wlcore_spi_of_match_table[] = {
> > -   { .compatible = "ti,wl1271" },
> > +   { .compatible = "ti,wl1271", .data = _data},
> > +   { .compatible = "ti,wl1273", .data = _data},
> > +   { .compatible = "ti,wl1281", .data = _data},
> > +   { .compatible = "ti,wl1283", .data = _data},
> > +   { .compatible = "ti,wl1801", .data = _data},
> > +   { .compatible = "ti,wl1805", .data = _data},
> > +   { .compatible = "ti,wl1807", .data = _data},
> > +   { .compatible = "ti,wl1831", .data = _data},
> > +   { .compatible = "ti,wl1835", .data = _data},
> > +   { .compatible = "ti,wl1837", .data = _data},
> > { }
> 
> Shouldn't you also update bindings/net/wireless/ti,wlcore,spi.txt? Now it only
> mentions about ti,wl1271 and not anything about the rest.

You are right! 
Will be fixed in v2

> 
> Adding devicetree list for further comments.
> 
> --
> Kalle Valo

Best Regards,
Eyal

Re: [PATCH net-next 5/7] sctp: reuse the some transport traversal functions in proc

2016-04-07 Thread Neil Horman

On Tue, Apr 05, 2016 at 12:06:30PM +0800, Xin Long wrote:
> There are some transport traversal functions for sctp_diag, we can also
> use it for sctp_proc. cause they have the similar situation to traversal
> transport.
> 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/proc.c | 80 
> +
>  1 file changed, 18 insertions(+), 62 deletions(-)
> 
> diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> index 5cfac8d..dd8492f 100644
> --- a/net/sctp/proc.c
> +++ b/net/sctp/proc.c
> @@ -282,80 +282,31 @@ struct sctp_ht_iter {
>   struct rhashtable_iter hti;
>  };
>  
> -static struct sctp_transport *sctp_transport_get_next(struct seq_file *seq)
> -{
> - struct sctp_ht_iter *iter = seq->private;
> - struct sctp_transport *t;
> -
> - t = rhashtable_walk_next(>hti);
> - for (; t; t = rhashtable_walk_next(>hti)) {
> - if (IS_ERR(t)) {
> - if (PTR_ERR(t) == -EAGAIN)
> - continue;
> - break;
> - }
> -
> - if (net_eq(sock_net(t->asoc->base.sk), seq_file_net(seq)) &&
> - t->asoc->peer.primary_path == t)
> - break;
> - }
> -
> - return t;
> -}
> -

this may just be a nit, but you defined the new sctp_transport_get_next in patch
2 of this series, and didn't remove this private version until here.  Is that
going to cause some behavioral issue, if someone builds a kernel between patch 2
and 7?  Seems like perhaps those two patches should be merged.

Neil

[PATCH net-next v3 2/2] tipc: stricter filtering of packets in bearer layer

2016-04-07 Thread Jon Maloy

Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.

To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only.  This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.

Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/bearer.c | 50 +-
 net/tipc/msg.h|  5 +
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 20566e9..6f11c62 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -337,23 +337,16 @@ static int tipc_reset_bearer(struct net *net, struct 
tipc_bearer *b)
  */
 static void bearer_disable(struct net *net, struct tipc_bearer *b)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-   u32 i;
+   struct tipc_net *tn = tipc_net(net);
+   int bearer_id = b->identity;
 
pr_info("Disabling bearer <%s>\n", b->name);
b->media->disable_media(b);
-
-   tipc_node_delete_links(net, b->identity);
+   tipc_node_delete_links(net, bearer_id);
RCU_INIT_POINTER(b->media_ptr, NULL);
if (b->link_req)
tipc_disc_delete(b->link_req);
-
-   for (i = 0; i < MAX_BEARERS; i++) {
-   if (b == rtnl_dereference(tn->bearer_list[i])) {
-   RCU_INIT_POINTER(tn->bearer_list[i], NULL);
-   break;
-   }
-   }
+   RCU_INIT_POINTER(tn->bearer_list[bearer_id], NULL);
kfree_rcu(b, rcu);
 }
 
@@ -396,7 +389,7 @@ void tipc_disable_l2_media(struct tipc_bearer *b)
 
 /**
  * tipc_l2_send_msg - send a TIPC packet out over an L2 interface
- * @buf: the packet to be sent
+ * @skb: the packet to be sent
  * @b: the bearer through which the packet is to be sent
  * @dest: peer destination address
  */
@@ -405,17 +398,21 @@ int tipc_l2_send_msg(struct net *net, struct sk_buff *skb,
 {
struct net_device *dev;
int delta;
+   void *tipc_ptr;
 
dev = (struct net_device *)rcu_dereference_rtnl(b->media_ptr);
if (!dev)
return 0;
 
+   /* Send RESET message even if bearer is detached from device */
+   tipc_ptr = rtnl_dereference(dev->tipc_ptr);
+   if (unlikely(!tipc_ptr && !msg_is_reset(buf_msg(skb
+   goto drop;
+
delta = dev->hard_header_len - skb_headroom(skb);
if ((delta > 0) &&
-   pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-   kfree_skb(skb);
-   return 0;
-   }
+   pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
+   goto drop;
 
skb_reset_network_header(skb);
skb->dev = dev;
@@ -424,6 +421,9 @@ int tipc_l2_send_msg(struct net *net, struct sk_buff *skb,
dev->dev_addr, skb->len);
dev_queue_xmit(skb);
return 0;
+drop:
+   kfree_skb(skb);
+   return 0;
 }
 
 int tipc_bearer_mtu(struct net *net, u32 bearer_id)
@@ -549,9 +549,18 @@ static int tipc_l2_device_event(struct notifier_block *nb, 
unsigned long evt,
 {
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct net *net = dev_net(dev);
+   struct tipc_net *tn = tipc_net(net);
struct tipc_bearer *b;
+   int i;
 
b = rtnl_dereference(dev->tipc_ptr);
+   if (!b) {
+   for (i = 0; i < MAX_BEARERS; b = NULL, i++) {
+   b = rtnl_dereference(tn->bearer_list[i]);
+   if (b && (b->media_ptr == dev))
+   break;
+   }
+   }
if (!b)
return NOTIFY_DONE;
 
@@ -561,13 +570,20 @@ static int tipc_l2_device_event(struct notifier_block 
*nb, unsigned long evt,
case NETDEV_CHANGE:
if (netif_carrier_ok(dev))
break;
+   case NETDEV_UP:
+   rcu_assign_pointer(dev->tipc_ptr, b);
+   break;
case NETDEV_GOING_DOWN:
+   RCU_INIT_POINTER(dev->tipc_ptr, NULL);
+   synchronize_net();
+   tipc_reset_bearer(net, b);
+   break;
case NETDEV_CHANGEMTU:
tipc_reset_bearer(net, b);
break;
case NETDEV_CHANGEADDR:

[PATCH net-next v3 0/2] tipc: some small fixes

2016-04-07 Thread Jon Maloy

When fix a minor buffer leak, and ensure that bearers filter packets
correctly while they are being shut down.

v2: Corrected typos in commit #3, as per feedback from S. Shtylyov
v3: Removed commit #3 from the series. Improved version will be 
re-submitted later.

Jon Maloy (2):
  tipc: eliminate buffer leak in bearer layer
  tipc: stricter filtering of packets in bearer layer

 net/tipc/bearer.c   | 101 ++--
 net/tipc/discover.c |   7 ++--
 net/tipc/discover.h |   2 +-
 net/tipc/msg.h  |   5 +++
 4 files changed, 67 insertions(+), 48 deletions(-)

-- 
1.9.1

Re: [PATCH v2 2/2] sctp: delay calls to sk_data_ready() as much as possible

2016-04-07 Thread Marcelo Ricardo Leitner

On Thu, Apr 07, 2016 at 10:05:32AM +0200, Jakub Sitnicki wrote:
> On Wed, Apr 06, 2016 at 07:53 PM CEST, Marcelo Ricardo Leitner 
>  wrote:
> > Currently, the processing of multiple chunks in a single SCTP packet
> > leads to multiple calls to sk_data_ready, causing multiple wake up
> > signals which are costly and doesn't make it wake up any faster.
> >
> > With this patch it will notice that the wake up is pending and will do it
> > before leaving the state machine interpreter, latest place possible to
> > do it realiably and cleanly.
> >
> > Note that sk_data_ready events are not dependent on asocs, unlike waking
> > up writers.
> >
> > Signed-off-by: Marcelo Ricardo Leitner 
> > ---
> 
> [...]
> 
> > diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> > index 
> > 7fe56d0acabf66cfd8fe29dfdb45f7620b470ac7..e7042f9ce63b0cfca50cae252f51b60b68cb5731
> >  100644
> > --- a/net/sctp/sm_sideeffect.c
> > +++ b/net/sctp/sm_sideeffect.c
> > @@ -1742,6 +1742,11 @@ out:
> > error = sctp_outq_uncork(>outqueue, gfp);
> > } else if (local_cork)
> > error = sctp_outq_uncork(>outqueue, gfp);
> > +
> > +   if (sctp_sk(ep->base.sk)->pending_data_ready) {
> > +   ep->base.sk->sk_data_ready(ep->base.sk);
> > +   sctp_sk(ep->base.sk)->pending_data_ready = 0;
> > +   }
> > return error;
> >  nomem:
> > error = -ENOMEM;
> 
> Would it make sense to introduce a local variable for ep->base.sk (and
> make this function 535+1 lines long ;-)
> 
>   struct sock *sk = ep->base.sk;
> 
> ... like sctp_ulpq_tail_event() does?

I guess so, yes. Same for sctp_sk() cast then. I´ll post a new version
later, thanks.

  Marcelo

[PATCH net-next v3 1/2] tipc: eliminate buffer leak in bearer layer

2016-04-07 Thread Jon Maloy

When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a  bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.

This commit fixes this problem by introducing two changes:

1) Instead of attemting to send the discovery message directly, we let
   tipc_disc_create() return the discovery buffer to the calling
   function, tipc_enable_bearer(), so that the latter can send it
   when the enabling sequence is finished.

2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
   functions at the bearer layer, we now free the indicated buffer or
   buffer chain when a valid bearer cannot be found.

Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/bearer.c   | 51 ++-
 net/tipc/discover.c |  7 ++-
 net/tipc/discover.h |  2 +-
 3 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 27a5406..20566e9 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -205,6 +205,7 @@ static int tipc_enable_bearer(struct net *net, const char 
*name,
struct tipc_bearer *b;
struct tipc_media *m;
struct tipc_bearer_names b_names;
+   struct sk_buff *skb;
char addr_string[16];
u32 bearer_id;
u32 with_this_prio;
@@ -301,7 +302,7 @@ restart:
b->net_plane = bearer_id + 'A';
b->priority = priority;
 
-   res = tipc_disc_create(net, b, >bcast_addr);
+   res = tipc_disc_create(net, b, >bcast_addr, );
if (res) {
bearer_disable(net, b);
pr_warn("Bearer <%s> rejected, discovery object creation 
failed\n",
@@ -310,7 +311,8 @@ restart:
}
 
rcu_assign_pointer(tn->bearer_list[bearer_id], b);
-
+   if (skb)
+   tipc_bearer_xmit_skb(net, bearer_id, skb, >bcast_addr);
pr_info("Enabled bearer <%s>, discovery domain %s, priority %u\n",
name,
tipc_addr_string_fill(addr_string, disc_domain), priority);
@@ -450,6 +452,8 @@ void tipc_bearer_xmit_skb(struct net *net, u32 bearer_id,
b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
if (likely(b))
b->media->send_msg(net, skb, b, dest);
+   else
+   kfree_skb(skb);
rcu_read_unlock();
 }
 
@@ -468,11 +472,11 @@ void tipc_bearer_xmit(struct net *net, u32 bearer_id,
 
rcu_read_lock();
b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
-   if (likely(b)) {
-   skb_queue_walk_safe(xmitq, skb, tmp) {
-   __skb_dequeue(xmitq);
-   b->media->send_msg(net, skb, b, dst);
-   }
+   if (unlikely(!b))
+   __skb_queue_purge(xmitq);
+   skb_queue_walk_safe(xmitq, skb, tmp) {
+   __skb_dequeue(xmitq);
+   b->media->send_msg(net, skb, b, dst);
}
rcu_read_unlock();
 }
@@ -490,14 +494,14 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
 
rcu_read_lock();
b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
-   if (likely(b)) {
-   skb_queue_walk_safe(xmitq, skb, tmp) {
-   hdr = buf_msg(skb);
-   msg_set_non_seq(hdr, 1);
-   msg_set_mc_netid(hdr, net_id);
-   __skb_dequeue(xmitq);
-   b->media->send_msg(net, skb, b, >bcast_addr);
-   }
+   if (unlikely(!b))
+   __skb_queue_purge(xmitq);
+   skb_queue_walk_safe(xmitq, skb, tmp) {
+   hdr = buf_msg(skb);
+   msg_set_non_seq(hdr, 1);
+   msg_set_mc_netid(hdr, net_id);
+   __skb_dequeue(xmitq);
+   b->media->send_msg(net, skb, b, >bcast_addr);
}
rcu_read_unlock();
 }
@@ -513,24 +517,21 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
  * ignores packets sent using interface multicast, and traffic sent to other
  * nodes (which can happen if interface is running in promiscuous mode).
  */
-static int tipc_l2_rcv_msg(struct sk_buff *buf, struct net_device *dev,
+static int tipc_l2_rcv_msg(struct sk_buff *skb, struct net_device *dev,
   struct packet_type *pt, struct net_device *orig_dev)
 {
struct tipc_bearer *b;
 
rcu_read_lock();
b = rcu_dereference_rtnl(dev->tipc_ptr);
-   if (likely(b)) {
-   if (likely(buf->pkt_type <= PACKET_BROADCAST)) {
-   buf->next = NULL;
-

[PATCH v6 net-next] net: ipv4: Consider failed nexthops in multipath routes

2016-04-07 Thread David Ahern

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

 [h2]   [h3]   15.0.0.5
  |  |
 3| 3|
[SP1]  [SP2]--+
 1  2   1 2
 |  | /-+ |
 |   \   /|
 | X  |
 |/ \ |
 |   /   \---\|
 1  2 1   2
 12.0.0.2  [TOR1] 3-3 [TOR2] 12.0.0.3
 4 4
  \   /
\/
 \  /
  ---|   |-/
 1   2
[TOR3]
  3|
   |
  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2  dev swp1 weight 1
nexthop via 12.0.0.3  dev swp1 weight 1
...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1  FAILED
...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern 
---
v6
- changed __neigh_lookup_noref to __ipv4_neigh_lookup_noref per Dave's
  comment

v5
- returned comma that got lost in the ether and removed resetting of
  nhsel at end of loop - again comments from Julian

v4
- remove NULL initializer and logic for fallback per Julian's comment

v3
- Julian comments: changed use of dead in documentation to failed,
  init state to NUD_REACHABLE which simplifies fib_good_nh, use of
  nh_dev for neighbor lookup, fallback to first entry which is what
  current logic does

v2
- use rcu locking to avoid refcnts per Eric's suggestion
- only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
  comment
- drop the 'state == NUD_REACHABLE' from the state check since it is
  part of NUD_VALID (comment from Julian)
- wrapped the use of the neigh in a sysctl

 Documentation/networking/ip-sysctl.txt | 10 ++
 include/net/netns/ipv4.h   |  3 +++
 net/ipv4/fib_semantics.c   | 34 +-
 net/ipv4/sysctl_net_ipv4.c | 11 +++
 4 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index b183e2b606c8..6c7f365b1515 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN
fwmark of the packet they are replying to.
Default: 0
 
+fib_multipath_use_neigh - BOOLEAN
+   Use status of existing neighbor entry when determining nexthop for
+   multipath routes. If disabled, neighbor information is not used and
+   packets could be directed to a failed nexthop. Only valid for kernels
+   built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+   Default: 0 (disabled)
+   Possible values:
+   0 - disabled
+   1 - enabled
+
 route/max_size - INTEGER
Maximum number of routes allowed in the kernel.  Increase
this when using large numbers of interfaces and/or routes.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a69cde3ce460..d061ffeb1e71 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -133,6 +133,9 @@ struct netns_ipv4 {
struct fib_rules_ops*mr_rules_ops;
 #endif
 #endif
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+   int sysctl_fib_multipath_use_neigh;
+#endif
atomic_trt_genid;
 };
 #endif
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e8ff10..ab64d9f2eef9 100644
--- a/net/ipv4/fib_semantics.c
+++

[PATCH v7 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-04-07 Thread Dexuan Cui

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
---
 MAINTAINERS |2 +
 include/linux/hyperv.h  |   16 +
 include/linux/socket.h  |5 +-
 include/net/af_hvsock.h |   51 ++
 include/uapi/linux/hyperv.h |   16 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1481 +++
 10 files changed, 1584 insertions(+), 2 deletions(-)
 create mode 100644 include/net/af_hvsock.h
 create mode 100644 net/hv_sock/Kconfig
 create mode 100644 net/hv_sock/Makefile
 create mode 100644 net/hv_sock/af_hvsock.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 67d99dd..7b6f203 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5267,7 +5267,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index aa0fadc..b92439d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1338,4 +1338,20 @@ extern __u32 vmbus_proto_version;
 
 int vmbus_send_tl_connect_request(const uuid_le *shv_guest_servie_id,
  const uuid_le *shv_host_servie_id);
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+} __packed;
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
+
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
+#define HVSOCK_MIN_PKT_LEN HVSOCK_PKT_LEN(1)
+
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 73bf6c6..88b1ccd 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -201,8 +201,8 @@ struct ucred {
 #define AF_NFC 39  /* NFC sockets  */
 #define AF_VSOCK   40  /* vSockets */
 #define AF_KCM 41  /* Kernel Connection Multiplexor*/
-
-#define AF_MAX 42  /* For now.. */
+#define AF_HYPERV  42  /* Hyper-V Sockets  */
+#define AF_MAX 43  /* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC  AF_UNSPEC
@@ -249,6 +249,7 @@ struct ucred {
 #define PF_NFC AF_NFC
 #define PF_VSOCK   AF_VSOCK
 #define PF_KCM AF_KCM
+#define PF_HYPERV  AF_HYPERV
 #define PF_MAX AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
new file mode 100644
index 000..a5aa28d
--- /dev/null
+++ b/include/net/af_hvsock.h
@@ -0,0 +1,51 @@
+#ifndef __AF_HVSOCK_H__
+#define __AF_HVSOCK_H__
+
+#include 
+#include 
+#include 
+
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV (5 * PAGE_SIZE)
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND (5 * PAGE_SIZE)
+
+#define HVSOCK_RCV_BUF_SZ  VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV
+#define HVSOCK_SND_BUF_SZ  PAGE_SIZE
+
+#define sk_to_hvsock(__sk)((struct hvsock_sock *)(__sk))
+#define hvsock_to_sk(__hvsk)   ((struct sock *)(__hvsk))
+
+struct hvsock_sock {
+   /* sk must be the first member. */
+   struct sock sk;
+
+   struct sockaddr_hv local_addr;
+   struct sockaddr_hv remote_addr;
+
+   /* protected by the global hvsock_mutex */
+   struct list_head bound_list;
+   struct list_head connected_list;
+
+   struct list_head accept_queue;
+   /* used by enqueue and dequeue */
+   struct mutex accept_queue_mutex;
+
+   struct delayed_work dwork;
+
+   u32 peer_shutdown;
+
+   struct vmbus_channel *channel;
+
+   struct {
+   struct

[PATCH] ieee802154/adf7242: fix memory leak of firmware

2016-04-07 Thread Sudip Mukherjee

If the firmware upload or the firmware verification fails then we
printed the error message and exited but we missed releasing the
firmware.

Signed-off-by: Sudip Mukherjee 
---
 drivers/net/ieee802154/adf7242.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ieee802154/adf7242.c b/drivers/net/ieee802154/adf7242.c
index 89154c0..91d4531 100644
--- a/drivers/net/ieee802154/adf7242.c
+++ b/drivers/net/ieee802154/adf7242.c
@@ -1030,6 +1030,7 @@ static int adf7242_hw_init(struct adf7242_local *lp)
if (ret) {
dev_err(>spi->dev,
"upload firmware failed with %d\n", ret);
+   release_firmware(fw);
return ret;
}
 
@@ -1037,6 +1038,7 @@ static int adf7242_hw_init(struct adf7242_local *lp)
if (ret) {
dev_err(>spi->dev,
"verify firmware failed with %d\n", ret);
+   release_firmware(fw);
return ret;
}
 
-- 
1.9.1

Re: [PATCH v7 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-04-07 Thread Joe Perches

On Thu, 2016-04-07 at 05:50 -0700, Dexuan Cui wrote:
> Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
> mechanism between the host and the guest. It's somewhat like TCP over
> VMBus, but the transportation layer (VMBus) is much simpler than IP.

style trivia:

> diff --git a/net/hv_sock/af_hvsock.c b/net/hv_sock/af_hvsock.c
[]
> +static struct sock *__hvsock_find_bound_socket(const struct sockaddr_hv 
> *addr)
> +{
> + struct hvsock_sock *hvsk;
> + 
> + list_for_each_entry(hvsk, _bound_list, bound_list)
> + if (uuid_equals(addr->shv_service_id,
> + hvsk->local_addr.shv_service_id))
> + return hvsock_to_sk(hvsk);

Because there's an if, it's generally nicer to use
braces in the list_for_each
> +static struct sock *__hvsock_find_connected_socket_by_channel(
> + const struct vmbus_channel *channel)
> +{
> + struct hvsock_sock *hvsk;
> +
> + list_for_each_entry(hvsk, _connected_list, connected_list)
> + if (hvsk->channel == channel)
> + return hvsock_to_sk(hvsk);
> + return NULL;

here too

> +static int hvsock_sendmsg(struct socket *sock, struct msghdr *msg, size_t 
> len)
> +{
[]
> + if (msg->msg_flags & ~MSG_DONTWAIT) {
> + pr_err("hvsock_sendmsg: unsupported flags=0x%x\n",
> +    msg->msg_flags);

All the pr_ messages with embedded function
names could use "%s:", __func__

Re: [PATCH V2 5/8] net: mediatek: fix mtk_pending_work

2016-04-07 Thread kbuild test robot

Hi John,

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.6-rc2 next-20160407]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/John-Crispin/net-mediatek-make-the-driver-pass-stress-tests/20160408-033430
config: arm-allyesconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

Note: the 
linux-review/John-Crispin/net-mediatek-make-the-driver-pass-stress-tests/20160408-033430
 HEAD e648090f60723da77108430208b4b957c481048b builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/list.h:8:0,
from include/linux/kobject.h:20,
from include/linux/device.h:17,
from include/linux/node.h:17,
from include/linux/cpu.h:16,
from include/linux/of_device.h:4,
from drivers/net/ethernet/mediatek/mtk_eth_soc.c:15:
   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_pending_work':
>> include/linux/kernel.h:824:27: error: 'struct mtk_eth' has no member named 
>> 'pending_work'
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
  ^
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1433:24: note: in expansion of 
>> macro 'container_of'
 struct mtk_eth *eth = container_of(work, struct mtk_eth, pending_work);
   ^
   include/linux/kernel.h:824:48: warning: initialization from incompatible 
pointer type
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
   ^
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1433:24: note: in expansion of 
>> macro 'container_of'
 struct mtk_eth *eth = container_of(work, struct mtk_eth, pending_work);
   ^
   include/linux/kernel.h:824:48: warning: (near initialization for 'eth')
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
   ^
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1433:24: note: in expansion of 
>> macro 'container_of'
 struct mtk_eth *eth = container_of(work, struct mtk_eth, pending_work);
   ^
   In file included from include/linux/compiler.h:60:0,
from include/linux/ioport.h:12,
from include/linux/device.h:16,
from include/linux/node.h:17,
from include/linux/cpu.h:16,
from include/linux/of_device.h:4,
from drivers/net/ethernet/mediatek/mtk_eth_soc.c:15:
>> include/linux/compiler-gcc.h:158:2: error: 'struct mtk_eth' has no member 
>> named 'pending_work'
 __builtin_offsetof(a, b)
 ^
   include/linux/stddef.h:16:32: note: in expansion of macro 
'__compiler_offsetof'
#define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER)
   ^
   include/linux/kernel.h:825:29: note: in expansion of macro 'offsetof'
 (type *)( (char *)__mptr - offsetof(type,member) );})
^
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1433:24: note: in expansion of 
>> macro 'container_of'
 struct mtk_eth *eth = container_of(work, struct mtk_eth, pending_work);
   ^

vim +824 include/linux/kernel.h

^1da177e Linus Torvalds 2005-04-16  818   * @ptr:   the pointer to the 
member.
^1da177e Linus Torvalds 2005-04-16  819   * @type:  the type of the 
container struct this is embedded in.
^1da177e Linus Torvalds 2005-04-16  820   * @member:the name of the member 
within the struct.
^1da177e Linus Torvalds 2005-04-16  821   *
^1da177e Linus Torvalds 2005-04-16  822   */
^1da177e Linus Torvalds 2005-04-16  823  #define container_of(ptr, type, 
member) ({ \
^1da177e Linus Torvalds 2005-04-16 @824 const typeof( ((type 
*)0)->member ) *__mptr = (ptr);\
^1da177e Linus Torvalds 2005-04-16  825 (type *)( (char *)__mptr - 
offsetof(type,member) );})
^1da177e Linus Torvalds 2005-04-16  826  
b9d4f426 Arnaud Lacombe 2011-07-25  827  /* Rebuild everything on 
CONFIG_FTRACE_MCOUNT_RECORD */

:: The code at line 824 was first introduced by commit
:: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:: TO: Linus Torvalds <torva...@ppc970.osdl.org>
:: CC: Linus Torvalds <torva...@ppc970.osdl.org>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [net-next:master 194/196] include/net/sock.h:1367:9: error: implicit declaration of function 'lockdep_is_held'

2016-04-07 Thread Hannes Frederic Sowa



On Thu, Apr 7, 2016, at 23:09, David Miller wrote:
> From: kbuild test robot 
> Date: Fri, 8 Apr 2016 05:00:42 +0800
> 
> >include/net/sock.h: In function 'lockdep_sock_is_held':
> >>> include/net/sock.h:1367:9: error: implicit declaration of function 
> >>> 'lockdep_is_held' [-Werror=implicit-function-declaration]
> >  return lockdep_is_held(>sk_lock) ||
> ...
> >   1361  } while (0)
> >   1362  
> >   1363  static bool lockdep_sock_is_held(const struct sock *csk)
> >   1364  {
> >   1365  struct sock *sk = (struct sock *)csk;
> >   1366  
> >> 1367   return lockdep_is_held(>sk_lock) ||
> >   1368 lockdep_is_held(>sk_lock.slock);
> >   1369  }
> 
> Hmmm, Hannes to we need to make this a macro just like lockdep_is_held()
> is?

I think my newest patch should fix it. I simply forgot the inline
keyword. inline functions get invisible if not used by the compiler.

Sorry,
Hannes

[PATCH net-next 1/2] udp: do not expect udp headers on ioctl SIOCINQ

2016-04-07 Thread Willem de Bruijn

From: Willem de Bruijn 

On udp sockets, ioctl SIOCINQ returns the payload size of the first
packet. Since commit e6afc8ace6dd pulled the headers, the result is
incorrect when subtracting header length. Remove that operation.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/udp.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3563788d..d2d294b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1282,8 +1282,6 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 * of this packet since that is all
 * that will be read.
 */
-   amount -= sizeof(struct udphdr);
-
return put_user(amount, (int __user *)arg);
}
 
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 0/2] fix two more udp pull header issues

2016-04-07 Thread Willem de Bruijn

From: Willem de Bruijn 

Follow up patches to the fixes to RxRPC and SunRPC. A scan of the
code showed two other interfaces that expect UDP packets to have
a udphdr when queued: read packet length with ioctl SIOCINQ and
receive payload checksum with socket option IP_CHECKSUM.

Willem de Bruijn (2):
  udp: do not expect udp headers on ioctl SIOCINQ
  udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM

 net/ipv4/ip_sockglue.c | 3 ++-
 net/ipv4/udp.c | 4 +---
 2 files changed, 3 insertions(+), 4 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[RFC PATCH 00/11] GSO partial and TSO FIXEDID support

2016-04-07 Thread Alexander Duyck

This patch series represents my respun version of the GSO partial work.
The major changes from the first version is that we are no longer making
the decision to mangle IP IDs transparently at the device.  Instead it is
now pushed up to the tunnel layer itself so that the tunnel is not
responsible for deciding if the IP IDs will be static or fixed for a given
TSO.

I'm a bit jammed up at the moment as I am trying to determine the best spot
to make the same change I currently am for VXLAN and GENEVE with GRE and
IPIP tunnels.  I'm assuming the correct spot would be somewhere near
iptunnel_handle_offload as I did for the other two tunnel types I have
already updated, but the flow based tunnels for GRE seem to be making it a
bit more complicated as I am not sure if a tunnel dev actually exists for
those tunnels.

This patch series is meant to apply to the dev-queue branch Jeff Kirsher's
next-queue tree at:
https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git

I chose his tree as the Intel driver patches would not apply otherwise.

Patch 1 from the series is a copy of the patch recently accepted for the
net tree.  It is needed in this series to avoid merge conflicts later on as
we were making other changes in the GRO code.

---

Alexander Duyck (11):
  GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
  ethtool: Add support for toggling any of the GSO offloads
  GSO: Add GSO type for fixed IPv4 ID
  GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID 
values
  GSO: Support partial segmentation offload
  VXLAN: Add option to mangle IP IDs on inner headers when using TSO
  GENEVE: Add option to mangle IP IDs on inner headers when using TSO
  Documentation: Add documentation for TSO and GSO features
  i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM
  ixgbe/ixgbevf: Add support for GSO partial
  igb/igbvf: Add support for GSO partial


 Documentation/networking/segmentation-offloads.txt |  127 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c|6 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|7 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |7 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|6 +
 drivers/net/ethernet/intel/igb/igb_main.c  |  119 ++---
 drivers/net/ethernet/intel/igbvf/netdev.c  |  180 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  115 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  122 +++---
 drivers/net/geneve.c   |   24 ++-
 drivers/net/vxlan.c|   16 ++
 include/linux/netdev_features.h|8 +
 include/linux/netdevice.h  |   11 +
 include/linux/skbuff.h |   27 ++-
 include/net/udp_tunnel.h   |8 -
 include/net/vxlan.h|1 
 include/uapi/linux/if_link.h   |2 
 net/core/dev.c |   33 
 net/core/ethtool.c |4 
 net/core/skbuff.c  |   29 +++
 net/ipv4/af_inet.c |   70 ++--
 net/ipv4/fou.c |6 +
 net/ipv4/gre_offload.c |   35 +++-
 net/ipv4/ip_gre.c  |   13 +
 net/ipv4/tcp_offload.c |   30 +++
 net/ipv4/udp_offload.c |   27 ++-
 net/ipv6/ip6_offload.c |   21 ++
 27 files changed, 837 insertions(+), 217 deletions(-)
 create mode 100644 Documentation/networking/segmentation-offloads.txt

--

Re: [PATCH net-next] sock: make lockdep_sock_is_held static inline

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 23:53 +0200, Hannes Frederic Sowa wrote:
> I forgot to add inline to lockdep_sock_is_held, so it generated all
> kinds of build warnings if not build with lockdep support.
> 
> Reported-by: kbuild test robot 
> Signed-off-by: Hannes Frederic Sowa 
> ---
>  include/net/sock.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index eb2d7c3e120b25..46b29374df8ed7 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1360,7 +1360,7 @@ do {
> \
>   lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \
>  } while (0)
>  
> -static bool lockdep_sock_is_held(const struct sock *csk)
> +static inline bool lockdep_sock_is_held(const struct sock *csk)
>  {
>   struct sock *sk = (struct sock *)csk;
>  


But... this wont solve the compiler error ?

include/net/sock.h: In function 'lockdep_sock_is_held':
include/net/sock.h:1367:2: error: implicit declaration of function
'lockdep_is_held' [-Werror=implicit-function-declaration]
In file included from security/lsm_audit.c:20:0:
include/net/sock.h: In function 'lockdep_sock_is_held':
include/net/sock.h:1367:2: error: implicit declaration of function
'lockdep_is_held' [-Werror=implicit-function-declaration]

$ egrep "LOCKDEP|PROVE" .config
CONFIG_LOCKDEP_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_PROVE_RCU is not set

[RFC PATCH 05/11] GSO: Support partial segmentation offload

2016-04-07 Thread Alexander Duyck

This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_FIXEDID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_FIXEDID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdev_features.h |5 +
 include/linux/netdevice.h   |2 ++
 include/linux/skbuff.h  |9 +++--
 net/core/dev.c  |   31 ++-
 net/core/ethtool.c  |1 +
 net/core/skbuff.c   |   29 -
 net/ipv4/af_inet.c  |   20 
 net/ipv4/gre_offload.c  |   26 +-
 net/ipv4/tcp_offload.c  |   10 --
 net/ipv4/udp_offload.c  |   27 +--
 net/ipv6/ip6_offload.c  |   10 +-
 11 files changed, 148 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 5d7da1ac6df5..6ef549ec5b13 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -48,6 +48,10 @@ enum {
NETIF_F_GSO_SIT_BIT,/* ... SIT tunnel with TSO */
NETIF_F_GSO_UDP_TUNNEL_BIT, /* ... UDP TUNNEL with TSO */
NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
+   NETIF_F_GSO_PARTIAL_BIT,/* ... Only segment inner-most L4
+* in hardware and all other
+* headers in software.
+*/
NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
/**/NETIF_F_GSO_LAST =  /* last bit, see GSO_MASK */
NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
@@ -122,6 +126,7 @@ enum {
 #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
 #define NETIF_F_TSO_FIXEDID__NETIF_F(TSO_FIXEDID)
+#define NETIF_F_GSO_PARTIAL __NETIF_F(GSO_PARTIAL)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index abf8cc2d9bfb..36a079598034 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1656,6 +1656,7 @@ struct net_device {
netdev_features_t   vlan_features;
netdev_features_t   hw_enc_features;
netdev_features_t   mpls_features;
+   netdev_features_t   gso_partial_features;
 
int ifindex;
int group;
@@ -4023,6 +4024,7 @@ static inline bool net_gso_ok(netdev_features_t features, 
int gso_type)
BUILD_BUG_ON(SKB_GSO_SIT != (NETIF_F_GSO_SIT >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> 
NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> 
NETIF_F_GSO_SHIFT));
+   BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> 
NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> 
NETIF_F_GSO_SHIFT));
 
return (features & feature) == feature;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5fba16658f9d..da0ace389fec 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -483,7 +483,9 @@ enum {
 
SKB_GSO_UDP_TUNNEL_CSUM = 1 << 12,
 
-   SKB_GSO_TUNNEL_REMCSUM = 1 << 13,
+   SKB_GSO_PARTIAL = 1 << 13,
+
+   SKB_GSO_TUNNEL_REMCSUM = 1 << 14,
 };
 
 #if BITS_PER_LONG > 32
@@ -3591,7 +3593,10 @@ static inline struct sec_path *skb_sec_path(struct 
sk_buff *skb)
  * Keeps track of level of encapsulation of network headers.
  */
 struct skb_gso_cb {
-   int mac_offset;
+   union {
+   int mac_offset;
+   int data_offset;
+   };
int encap_level;
__wsum  csum;
__u16   csum_start;
diff --git a/net/core/dev.c b/net/core/dev.c
index 4ed2852b3706..53b216b617c3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2711,6 +2711,19 @@ struct sk_buff

[RFC PATCH 04/11] GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values

2016-04-07 Thread Alexander Duyck

This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this addresses is that it places limitations on the outer
IPv4 ID header in the case of tunneled frames.  Specifically it forces the
IP ID to be incrementing by 1 unless the DF bit is set in the outer IPv4
header.  This way we can avoid creating overlapping series of IP IDs that
could possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |5 -
 net/core/dev.c|1 +
 net/ipv4/af_inet.c|   35 ---
 net/ipv4/tcp_offload.c|   16 +++-
 net/ipv6/ip6_offload.c|8 ++--
 5 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 38ccc01eb97d..abf8cc2d9bfb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2123,7 +2123,10 @@ struct napi_gro_cb {
/* Used in GRE, set in fou/gue_gro_receive */
u8  is_fou:1;
 
-   /* 6 bit hole */
+   /* Used to determine if flush_id can be ignored */
+   u8  is_atomic:1;
+
+   /* 5 bit hole */
 
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum  csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index d51343a821ed..4ed2852b3706 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4440,6 +4440,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->free = 0;
NAPI_GRO_CB(skb)->encap_mark = 0;
NAPI_GRO_CB(skb)->is_fou = 0;
+   NAPI_GRO_CB(skb)->is_atomic = 1;
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
/* Setup for GRO checksum validation */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 19e9a2c45d71..98fe04b99e01 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1328,6 +1328,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff 
**head,
 
for (p = *head; p; p = p->next) {
struct iphdr *iph2;
+   u16 flush_id;
 
if (!NAPI_GRO_CB(p)->same_flow)
continue;
@@ -1351,16 +1352,36 @@ static struct sk_buff **inet_gro_receive(struct sk_buff 
**head,
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
 
-   /* Save the IP ID check to be included later when we get to
-* the transport layer so only the inner most IP ID is checked.
-* This is because some GSO/TSO implementations do not
-* correctly increment the IP ID for the outer hdrs.
-*/
-   NAPI_GRO_CB(p)->flush_id =
-   ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ 
id);
NAPI_GRO_CB(p)->flush |= flush;
+
+   /* We need to store of the IP ID check to be included later
+* when we can verify that this packet does in fact belong
+* to a given flow.
+*/
+   flush_id = (u16)(id - ntohs(iph2->id));
+
+   /* This bit of code makes it much easier for us to identify
+* the cases where we are doing atomic vs non-atomic IP ID
+* checks.  Specifically an atomic check can return IP ID
+* values 0 - 0x, while a non-atomic check can only
+* return 0 or 0x.
+*/
+   if (!NAPI_GRO_CB(p)->is_atomic ||
+   !(iph->frag_off & htons(IP_DF))) {
+   flush_id ^= NAPI_GRO_CB(p)->count;
+   flush_id = flush_id ? 0x : 0;
+   }
+
+   /* If the previous IP ID value was based on an atomic
+* datagram we can overwrite the value and ignore it.
+*/
+   if (NAPI_GRO_CB(skb)->is_atomic)
+   NAPI_GRO_CB(p)->flush_id = flush_id;
+   else
+   NAPI_GRO_CB(p)->flush_id |= flush_id;
}
 
+   NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
NAPI_GRO_CB(skb)->flush |= flush;
skb_set_network_header(skb, off);
/* The above will be needed by the transport layer if there is one
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 08dd25d835af..d1ffd55289bd 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -239,7 +239,7 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, 
struct sk_buff

[RFC PATCH 08/11] Documentation: Add documentation for TSO and GSO features

2016-04-07 Thread Alexander Duyck

This document is a starting point for defining the TSO and GSO features.
The whole thing is starting to get a bit messy so I wanted to make sure we
have notes somwhere to start describing what does and doesn't work.

Signed-off-by: Alexander Duyck 
---
 Documentation/networking/segmentation-offloads.txt |  127 
 1 file changed, 127 insertions(+)
 create mode 100644 Documentation/networking/segmentation-offloads.txt

diff --git a/Documentation/networking/segmentation-offloads.txt 
b/Documentation/networking/segmentation-offloads.txt
new file mode 100644
index ..b06dd9b65ab3
--- /dev/null
+++ b/Documentation/networking/segmentation-offloads.txt
@@ -0,0 +1,127 @@
+Segmentation Offloads in the Linux Networking Stack
+
+Introduction
+
+
+This document describes a set of techniques in the Linux networking stack
+to take advantage of segmentation offload capabilities of various NICs.
+
+The following technologies are described:
+ * TCP Segmentation Offload - TSO
+ * UDP Fragmentation Offload - UFO
+ * IPIP, SIT, GRE, and UDP Tunnel Offloads
+ * Generic Segmentation Offload - GSO
+ * Generic Receive Offload - GRO
+ * Partial Generic Segmentation Offload - GSO_PARTIAL
+
+TCP Segmentation Offload
+
+
+TCP segmentation allows a device to segment a single frame into multiple
+frames with a data payload size specified in skb_shinfo()->gso_size.
+When TCP segmentation requested the bit for either SKB_GSO_TCP or
+SKB_GSO_TCP6 should be set in skb_shinfo()->gso_type and
+skb_shinfo()->gso_size should be set to a non-zero value.
+
+TCP segmentation is dependent on support for the use of partial checksum
+offload.  For this reason TSO is normally disabled if the Tx checksum
+offload for a given device is disabled.
+
+In order to support TCP segmentation offload it is necessary to populate
+the network and transport header offsets of the skbuff so that the device
+drivers will be able determine the offsets of the IP or IPv6 header and the
+TCP header.  In addition as CHECKSUM_PARTIAL is required csum_start should
+also point to the TCP header of the packet.
+
+For IPv4 segmentation we support one of two types in terms of the IP ID.
+The default behavior is to increment the IP ID with every segment.  If the
+GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP
+ID and all segments will use the same IP ID.
+
+UDP Fragmentation Offload
+=
+
+UDP fragmentation offload allows a device to fragment an oversized UDP
+datagram into multiple IPv4 fragments.  Many of the requirements for UDP
+fragmentation offload are the same as TSO.  However the IPv4 ID for
+fragments should not increment as a single IPv4 datagram is fragmented.
+
+IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads
+
+
+In addition to the offloads described above it is possible for a frame to
+contain additional headers such as an outer tunnel.  In order to account
+for such instances an additional set of segmentation offload types were
+introduced including SKB_GSO_IPIP, SKB_GSO_SIT, SKB_GSO_GRE, and
+SKB_GSO_UDP_TUNNEL.  These extra segmentation types are used to identify
+cases where there are more than just 1 set of headers.  For example in the
+case of IPIP and SIT we should have the network and transport headers moved
+from the standard list of headers to "inner" header offsets.
+
+Currently only two levels of headers are supported.  The convention is to
+refer to the tunnel headers as the outer headers, while the encapsulated
+data is normally referred to as the inner headers.  Below is the list of
+calls to access the given headers:
+
+IPIP/SIT Tunnel:
+   Outer   Inner
+MACskb_mac_header
+Networkskb_network_header  skb_inner_network_header
+Transport  skb_transport_header
+
+UDP/GRE Tunnel:
+   Outer   Inner
+MACskb_mac_header  skb_inner_mac_header
+Networkskb_network_header  skb_inner_network_header
+Transport  skb_transport_headerskb_inner_transport_header
+
+In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and
+SKB_GSO_UDP_TUNNEL_CSUM.  These two additional tunnel types reflect the
+fact that the outer header also requests to have a non-zero checksum
+included in the outer header.
+
+Finally there is SKB_GSO_REMCSUM which indicates that a given tunnel header
+has requested a remote checksum offload.  In this case the inner headers
+will be left with a partial checksum and only the outer header checksum
+will be computed.
+
+Generic Segmentation Offload
+
+
+Generic segmentation offload is a pure software offload that is meant to
+deal with cases where device drivers cannot perform the offloads described
+above.  What occurs in GSO is that a given skbuff will have its data

[RFC PATCH 10/11] ixgbe/ixgbevf: Add support for GSO partial

2016-04-07 Thread Alexander Duyck

This patch adds support for partial GSO segmentation in the case of
encapsulated frames.  Specifically with this change the driver can perform
segmentation as long as the type is either SKB_GSO_TCP_FIXEDID or
SKB_GSO_TCPV6.  If neither of these gso types are specified then tunnel
segmentation is not supported and we will default back to GSO.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  115 +++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  122 -
 2 files changed, 180 insertions(+), 57 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c6bd3ae5f986..57e083f6c8a9 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7195,9 +7195,18 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 struct ixgbe_tx_buffer *first,
 u8 *hdr_len)
 {
+   u32 vlan_macip_lens, type_tucmd, mss_l4len_idx;
struct sk_buff *skb = first->skb;
-   u32 vlan_macip_lens, type_tucmd;
-   u32 mss_l4len_idx, l4len;
+   union {
+   struct iphdr *v4;
+   struct ipv6hdr *v6;
+   unsigned char *hdr;
+   } ip;
+   union {
+   struct tcphdr *tcp;
+   unsigned char *hdr;
+   } l4;
+   u32 paylen, l4_offset;
int err;
 
if (skb->ip_summed != CHECKSUM_PARTIAL)
@@ -7210,46 +7219,52 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
if (err < 0)
return err;
 
+   ip.hdr = skb_network_header(skb);
+   l4.hdr = skb_checksum_start(skb);
+
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;
 
-   if (first->protocol == htons(ETH_P_IP)) {
-   struct iphdr *iph = ip_hdr(skb);
-   iph->tot_len = 0;
-   iph->check = 0;
-   tcp_hdr(skb)->check = ~csum_tcpudp_magic(iph->saddr,
-iph->daddr, 0,
-IPPROTO_TCP,
-0);
+   /* initialize outer IP header fields */
+   if (ip.v4->version == 4) {
+   /* IP header will have to cancel out any data that
+* is not a part of the outer IP header
+*/
+   ip.v4->check = csum_fold(csum_add(lco_csum(skb),
+ csum_unfold(l4.tcp->check)));
type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
+
+   ip.v4->tot_len = 0;
first->tx_flags |= IXGBE_TX_FLAGS_TSO |
   IXGBE_TX_FLAGS_CSUM |
   IXGBE_TX_FLAGS_IPV4;
-   } else if (skb_is_gso_v6(skb)) {
-   ipv6_hdr(skb)->payload_len = 0;
-   tcp_hdr(skb)->check =
-   ~csum_ipv6_magic(_hdr(skb)->saddr,
-_hdr(skb)->daddr,
-0, IPPROTO_TCP, 0);
+   } else {
+   ip.v6->payload_len = 0;
first->tx_flags |= IXGBE_TX_FLAGS_TSO |
   IXGBE_TX_FLAGS_CSUM;
}
 
-   /* compute header lengths */
-   l4len = tcp_hdrlen(skb);
-   *hdr_len = skb_transport_offset(skb) + l4len;
+   /* determine offset of inner transport header */
+   l4_offset = l4.hdr - skb->data;
+
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+
+   /* remove payload length from inner checksum */
+   paylen = skb->len - l4_offset;
+   csum_replace_by_diff(>check, htonl(paylen));
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
first->bytecount += (first->gso_segs - 1) * *hdr_len;
 
/* mss_l4len_id: use 0 as index for TSO */
-   mss_l4len_idx = l4len << IXGBE_ADVTXD_L4LEN_SHIFT;
+   mss_l4len_idx = (*hdr_len - l4_offset) << IXGBE_ADVTXD_L4LEN_SHIFT;
mss_l4len_idx |= skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT;
 
/* vlan_macip_lens: HEADLEN, MACLEN, VLAN tag */
-   vlan_macip_lens = skb_network_header_len(skb);
-   vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
+   vlan_macip_lens = l4.hdr - ip.hdr;
+   vlan_macip_lens |= (ip.hdr - skb->data) << IXGBE_ADVTXD_MACLEN_SHIFT;
vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
 
ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0, type_tucmd,
@@ -8906,17 +8921,49 @@ static void ixgbe_fwd_del(struct net_device *pdev, void 
*priv)
kfree(fwd_adapter);
 }
 
-#define IXGBE_MAX_TUNNEL_HDR_LEN 80
+#define IXGBE_MAX_MAC_HDR_LEN  127
+#define

[RFC PATCH 03/11] GSO: Add GSO type for fixed IPv4 ID

2016-04-07 Thread Alexander Duyck

This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdev_features.h |3 +++
 include/linux/netdevice.h   |1 +
 include/linux/skbuff.h  |   20 +++-
 net/core/ethtool.c  |1 +
 net/ipv4/af_inet.c  |   19 +++
 net/ipv4/gre_offload.c  |1 +
 net/ipv4/tcp_offload.c  |4 +++-
 net/ipv6/ip6_offload.c  |3 ++-
 8 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index a734bf43d190..5d7da1ac6df5 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -39,6 +39,7 @@ enum {
NETIF_F_UFO_BIT,/* ... UDPv4 fragmentation */
NETIF_F_GSO_ROBUST_BIT, /* ... ->SKB_GSO_DODGY */
NETIF_F_TSO_ECN_BIT,/* ... TCP ECN support */
+   NETIF_F_TSO_FIXEDID_BIT,/* ... IPV4 header has fixed IP ID */
NETIF_F_TSO6_BIT,   /* ... TCPv6 segmentation */
NETIF_F_FSO_BIT,/* ... FCoE segmentation */
NETIF_F_GSO_GRE_BIT,/* ... GRE with TSO */
@@ -120,6 +121,7 @@ enum {
 #define NETIF_F_GSO_SIT__NETIF_F(GSO_SIT)
 #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
+#define NETIF_F_TSO_FIXEDID__NETIF_F(TSO_FIXEDID)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX)
@@ -147,6 +149,7 @@ enum {
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
+NETIF_F_TSO_FIXEDID | \
 NETIF_F_TSO6 | NETIF_F_UFO)
 
 /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8395308a2445..38ccc01eb97d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4011,6 +4011,7 @@ static inline bool net_gso_ok(netdev_features_t features, 
int gso_type)
BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_DODGY   != (NETIF_F_GSO_ROBUST >> 
NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
+   BUILD_BUG_ON(SKB_GSO_TCP_FIXEDID != (NETIF_F_TSO_FIXEDID >> 
NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_TCPV6   != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_FCOE!= (NETIF_F_FSO >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_GRE != (NETIF_F_GSO_GRE >> NETIF_F_GSO_SHIFT));
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 007381270ff8..5fba16658f9d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -465,23 +465,25 @@ enum {
/* This indicates the tcp segment has CWR set. */
SKB_GSO_TCP_ECN = 1 << 3,
 
-   SKB_GSO_TCPV6 = 1 << 4,
+   SKB_GSO_TCP_FIXEDID = 1 << 4,
 
-   SKB_GSO_FCOE = 1 << 5,
+   SKB_GSO_TCPV6 = 1 << 5,
 
-   SKB_GSO_GRE = 1 << 6,
+   SKB_GSO_FCOE = 1 << 6,
 
-   SKB_GSO_GRE_CSUM = 1 << 7,
+   SKB_GSO_GRE = 1 << 7,
 
-   SKB_GSO_IPIP = 1 << 8,
+   SKB_GSO_GRE_CSUM = 1 << 8,
 
-   SKB_GSO_SIT = 1 << 9,
+   SKB_GSO_IPIP = 1 << 9,
 
-   SKB_GSO_UDP_TUNNEL = 1 << 10,
+   SKB_GSO_SIT = 1 << 10,
 
-   SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
+   SKB_GSO_UDP_TUNNEL = 1 << 11,
 
-   SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
+   SKB_GSO_UDP_TUNNEL_CSUM = 1 << 12,
+
+   SKB_GSO_TUNNEL_REMCSUM = 1 << 13,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6a7f99661c2f..5340c9dbc318 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -79,6 +79,7 @@ static const char 
netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
[NETIF_F_UFO_BIT] =  "tx-udp-fragmentation",
[NETIF_F_GSO_ROBUST_BIT] =   "tx-gso-robust",
[NETIF_F_TSO_ECN_BIT] =  "tx-tcp-ecn-segmentation",
+   [NETIF_F_TSO_FIXEDID_BIT] =  "tx-tcp-fixedid-segmentation",
[NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
[NETIF_F_FSO_BIT] =  "tx-fcoe-segmentation",
[NETIF_F_GSO_GRE_BIT] =  "tx-gre-segmentation",
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a38b9910af60..19e9a2c45d71 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1195,10 +1195,10 @@ EXPORT_SYMBOL(inet_sk_rebuild_header);
 static struct sk_buff

[RFC PATCH 06/11] VXLAN: Add option to mangle IP IDs on inner headers when using TSO

2016-04-07 Thread Alexander Duyck

This patch adds support for a feature I am calling IP ID mangling.  It is
basically just another way of saying the IP IDs that are transmitted by the
tunnel may not match up with what would normally be expected.  Specifically
what will happen is in the case of TSO the IP IDs on the headers will be a
fixed value so a given TSO will repeat the same inner IP ID value gso_segs
number of times.

Signed-off-by: Alexander Duyck 
---
 drivers/net/vxlan.c  |   16 
 include/net/vxlan.h  |1 +
 include/uapi/linux/if_link.h |1 +
 3 files changed, 18 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 51cccddfe403..cc903ab832c2 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1783,6 +1783,10 @@ static int vxlan_build_skb(struct sk_buff *skb, struct 
dst_entry *dst,
type |= SKB_GSO_TUNNEL_REMCSUM;
}
 
+   if ((vxflags & VXLAN_F_TCP_FIXEDID) && skb_is_gso(skb) &&
+   (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4))
+   type |= SKB_GSO_TCP_FIXEDID;
+
min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
+ VXLAN_HLEN + iphdr_len
+ (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
@@ -2635,6 +2639,7 @@ static const struct nla_policy 
vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_GBP]= { .type = NLA_FLAG, },
[IFLA_VXLAN_GPE]= { .type = NLA_FLAG, },
[IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
+   [IFLA_VXLAN_IPID_MANGLE]= { .type = NLA_FLAG },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -3092,6 +3097,9 @@ static int vxlan_newlink(struct net *src_net, struct 
net_device *dev,
if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL])
conf.flags |= VXLAN_F_REMCSUM_NOPARTIAL;
 
+   if (data[IFLA_VXLAN_IPID_MANGLE])
+   conf.flags |= VXLAN_F_TCP_FIXEDID;
+
err = vxlan_dev_configure(src_net, dev, );
switch (err) {
case -ENODEV:
@@ -3154,6 +3162,10 @@ static size_t vxlan_get_size(const struct net_device 
*dev)
nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_UDP_ZERO_CSUM6_RX 
*/
nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_REMCSUM_TX */
nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_REMCSUM_RX */
+   nla_total_size(0) +/* IFLA_VXLAN_GBP */
+   nla_total_size(0) +/* IFLA_VXLAN_GPE */
+   nla_total_size(0) +/* IFLA_VXLAN_REMCSUM_NOPARTIAL 
*/
+   nla_total_size(0) +/* IFLA_VXLAN_IPID_MANGLE */
0;
 }
 
@@ -3244,6 +3256,10 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_flag(skb, IFLA_VXLAN_REMCSUM_NOPARTIAL))
goto nla_put_failure;
 
+   if (vxlan->flags & VXLAN_F_TCP_FIXEDID &&
+   nla_put_flag(skb, IFLA_VXLAN_IPID_MANGLE))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index dcc6f4057115..5c2dc9ecea59 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -265,6 +265,7 @@ struct vxlan_dev {
 #define VXLAN_F_REMCSUM_NOPARTIAL  0x1000
 #define VXLAN_F_COLLECT_METADATA   0x2000
 #define VXLAN_F_GPE0x4000
+#define VXLAN_F_TCP_FIXEDID0x8000
 
 /* Flags that are used in the receive path. These flags must match in
  * order for a socket to be shareable
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9427f17d06d6..a3bc3f2a63d3 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -489,6 +489,7 @@ enum {
IFLA_VXLAN_COLLECT_METADATA,
IFLA_VXLAN_LABEL,
IFLA_VXLAN_GPE,
+   IFLA_VXLAN_IPID_MANGLE,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)

Re: [PATCH net-next] sock: make lockdep_sock_is_held static inline

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 15:30 -0700, Eric Dumazet wrote:

> But... this wont solve the compiler error ?
> 
> include/net/sock.h: In function 'lockdep_sock_is_held':
> include/net/sock.h:1367:2: error: implicit declaration of function
> 'lockdep_is_held' [-Werror=implicit-function-declaration]
> In file included from security/lsm_audit.c:20:0:
> include/net/sock.h: In function 'lockdep_sock_is_held':
> include/net/sock.h:1367:2: error: implicit declaration of function
> 'lockdep_is_held' [-Werror=implicit-function-declaration]
> 
> $ egrep "LOCKDEP|PROVE" .config
> CONFIG_LOCKDEP_SUPPORT=y
> # CONFIG_PROVE_LOCKING is not set
> # CONFIG_PROVE_RCU is not set
> 

Better use something like :

diff --git a/include/net/sock.h b/include/net/sock.h
index eb2d7c3e120b..ab6b6b9469ad 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1360,13 +1360,15 @@ do {
\
lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \
 } while (0)
 
-static bool lockdep_sock_is_held(const struct sock *csk)
+#ifdef CONFIG_PROVE_RCU
+static inline bool lockdep_sock_is_held(const struct sock *csk)
 {
struct sock *sk = (struct sock *)csk;
 
return lockdep_is_held(>sk_lock) ||
   lockdep_is_held(>sk_lock.slock);
 }
+#endif
 
 void lock_sock_nested(struct sock *sk, int subclass);

[net-next:master 207/208] net/tipc/bearer.c:560:48: sparse: incompatible types in comparison expression (different base types)

2016-04-07 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   889750bd2e08a94d52a116056d462b3a8e5616a7
commit: 5b7066c3dd24c7d538e5ee402eb24bb182c16dab [207/208] tipc: stricter 
filtering of packets in bearer layer
reproduce:
# apt-get install sparse
git checkout 5b7066c3dd24c7d538e5ee402eb24bb182c16dab
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': 
unknown attribute
>> net/tipc/bearer.c:560:48: sparse: incompatible types in comparison 
>> expression (different base types)

vim +560 net/tipc/bearer.c

   544   * This function is called by the Ethernet driver in case of link
   545   * change event.
   546   */
   547  static int tipc_l2_device_event(struct notifier_block *nb, unsigned 
long evt,
   548  void *ptr)
   549  {
   550  struct net_device *dev = netdev_notifier_info_to_dev(ptr);
   551  struct net *net = dev_net(dev);
   552  struct tipc_net *tn = tipc_net(net);
   553  struct tipc_bearer *b;
   554  int i;
   555  
   556  b = rtnl_dereference(dev->tipc_ptr);
   557  if (!b) {
   558  for (i = 0; i < MAX_BEARERS; b = NULL, i++) {
   559  b = rtnl_dereference(tn->bearer_list[i]);
 > 560  if (b && (b->media_ptr == dev))
   561  break;
   562  }
   563  }
   564  if (!b)
   565  return NOTIFY_DONE;
   566  
   567  b->mtu = dev->mtu;
   568  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[PATCH net-next 2/2] udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM

2016-04-07 Thread Willem de Bruijn

From: Willem de Bruijn 

On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.

Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/ip_sockglue.c | 3 ++-
 net/ipv4/udp.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 89b5f3b..279471c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -106,7 +106,8 @@ static void ip_cmsg_recv_checksum(struct msghdr *msg, 
struct sk_buff *skb,
return;
 
if (offset != 0)
-   csum = csum_sub(csum, csum_partial(skb->data, offset, 0));
+   csum = csum_sub(csum, csum_partial(skb_transport_header(skb),
+  offset, 0));
 
put_cmsg(msg, SOL_IP, IP_CHECKSUM, sizeof(__wsum), );
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d2d294b..f186313 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1375,7 +1375,7 @@ try_again:
*addr_len = sizeof(*sin);
}
if (inet->cmsg_flags)
-   ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
+   ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr) + off);
 
err = copied;
if (flags & MSG_TRUNC)
-- 
2.8.0.rc3.226.g39d4020

[RFC PATCH 02/11] ethtool: Add support for toggling any of the GSO offloads

2016-04-07 Thread Alexander Duyck

The strings were missing for several of the GSO offloads that are
available.  This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.

Signed-off-by: Alexander Duyck 
---
 net/core/ethtool.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f426c5ad6149..6a7f99661c2f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -82,9 +82,11 @@ static const char 
netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
[NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
[NETIF_F_FSO_BIT] =  "tx-fcoe-segmentation",
[NETIF_F_GSO_GRE_BIT] =  "tx-gre-segmentation",
+   [NETIF_F_GSO_GRE_CSUM_BIT] = "tx-gre-csum-segmentation",
[NETIF_F_GSO_IPIP_BIT] = "tx-ipip-segmentation",
[NETIF_F_GSO_SIT_BIT] =  "tx-sit-segmentation",
[NETIF_F_GSO_UDP_TUNNEL_BIT] =   "tx-udp_tnl-segmentation",
+   [NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
 
[NETIF_F_FCOE_CRC_BIT] = "tx-checksum-fcoe-crc",
[NETIF_F_SCTP_CRC_BIT] ="tx-checksum-sctp",

[RFC PATCH 07/11] GENEVE: Add option to mangle IP IDs on inner headers when using TSO

2016-04-07 Thread Alexander Duyck

This patch adds support for a feature I am calling IP ID mangling.  It is
basically just another way of saying the IP IDs that are transmitted by the
tunnel may not match up with what would normally be expected.  Specifically
what will happen is in the case of TSO the IP IDs on the headers will be a
fixed value so a given TSO will repeat the same inner IP ID value gso_segs
number of times.

Signed-off-by: Alexander Duyck 
---
 drivers/net/geneve.c |   24 ++--
 include/net/udp_tunnel.h |8 
 include/uapi/linux/if_link.h |1 +
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index bc168894bda3..6352223d80c3 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -80,6 +80,7 @@ struct geneve_dev {
 #define GENEVE_F_UDP_ZERO_CSUM_TX  BIT(0)
 #define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1)
 #define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2)
+#define GENEVE_F_TCP_FIXEDID   BIT(3)
 
 struct geneve_sock {
boolcollect_md;
@@ -702,9 +703,14 @@ static int geneve_build_skb(struct rtable *rt, struct 
sk_buff *skb,
int min_headroom;
int err;
bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX);
+   int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
 
skb_scrub_packet(skb, xnet);
 
+   if ((flags & GENEVE_F_TCP_FIXEDID) && skb_is_gso(skb) &&
+   (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4))
+   type |= SKB_GSO_TCP_FIXEDID;
+
min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
+ GENEVE_BASE_HLEN + opt_len + sizeof(struct iphdr);
err = skb_cow_head(skb, min_headroom);
@@ -713,7 +719,7 @@ static int geneve_build_skb(struct rtable *rt, struct 
sk_buff *skb,
goto free_rt;
}
 
-   skb = udp_tunnel_handle_offloads(skb, udp_sum);
+   skb = iptunnel_handle_offloads(skb, type);
if (IS_ERR(skb)) {
err = PTR_ERR(skb);
goto free_rt;
@@ -739,9 +745,14 @@ static int geneve6_build_skb(struct dst_entry *dst, struct 
sk_buff *skb,
int min_headroom;
int err;
bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM6_TX);
+   int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
 
skb_scrub_packet(skb, xnet);
 
+   if ((flags & GENEVE_F_TCP_FIXEDID) && skb_is_gso(skb) &&
+   (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4))
+   type |= SKB_GSO_TCP_FIXEDID;
+
min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
+ GENEVE_BASE_HLEN + opt_len + sizeof(struct ipv6hdr);
err = skb_cow_head(skb, min_headroom);
@@ -750,7 +761,7 @@ static int geneve6_build_skb(struct dst_entry *dst, struct 
sk_buff *skb,
goto free_dst;
}
 
-   skb = udp_tunnel_handle_offloads(skb, udp_sum);
+   skb = iptunnel_handle_offloads(skb, type);
if (IS_ERR(skb)) {
err = PTR_ERR(skb);
goto free_dst;
@@ -1249,6 +1260,7 @@ static const struct nla_policy 
geneve_policy[IFLA_GENEVE_MAX + 1] = {
[IFLA_GENEVE_UDP_CSUM]  = { .type = NLA_U8 },
[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
[IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
+   [IFLA_GENEVE_IPID_MANGLE]   = { .type = NLA_FLAG },
 };
 
 static int geneve_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -1436,6 +1448,9 @@ static int geneve_newlink(struct net *net, struct 
net_device *dev,
nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]))
flags |= GENEVE_F_UDP_ZERO_CSUM6_RX;
 
+   if (data[IFLA_GENEVE_IPID_MANGLE])
+   flags |= GENEVE_F_TCP_FIXEDID;
+
return geneve_configure(net, dev, , vni, ttl, tos, label,
dst_port, metadata, flags);
 }
@@ -1460,6 +1475,7 @@ static size_t geneve_get_size(const struct net_device 
*dev)
nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_CSUM */
nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_TX 
*/
nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX 
*/
+   nla_total_size(0) +  /* IFLA_GENEVE_IPID_MANGLE */
0;
 }
 
@@ -1505,6 +1521,10 @@ static int geneve_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
   !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_RX)))
goto nla_put_failure;
 
+   if ((geneve->flags & GENEVE_F_TCP_FIXEDID) &&
+   nla_put_flag(skb, IFLA_GENEVE_IPID_MANGLE))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index b83114077cee..c44d04259665 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -98,14 +98,6 @@ struct

[RFC PATCH 11/11] igb/igbvf: Add support for GSO partial

2016-04-07 Thread Alexander Duyck

This patch adds support for partial GSO segmentation in the case of
encapsulated frames.  Specifically with this change the driver can perform
segmentation as long as the type is either SKB_GSO_TCP_FIXEDID or
SKB_GSO_TCPV6.  If neither of these gso types are specified then tunnel
segmentation is not supported and we will default back to GSO.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb_main.c |  119 +++
 drivers/net/ethernet/intel/igbvf/netdev.c |  180 ++---
 2 files changed, 205 insertions(+), 94 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 8e96c35307fb..8204ebecd2a5 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2087,6 +2087,52 @@ static int igb_ndo_fdb_add(struct ndmsg *ndm, struct 
nlattr *tb[],
return ndo_dflt_fdb_add(ndm, tb, dev, addr, vid, flags);
 }
 
+#define IGB_MAX_MAC_HDR_LEN127
+#define IGB_MAX_NETWORK_HDR_LEN511
+#define IGB_GSO_PARTIAL_FEATURES (NETIF_F_TSO_FIXEDID | \
+ NETIF_F_GSO_GRE | \
+ NETIF_F_GSO_GRE_CSUM | \
+ NETIF_F_GSO_IPIP | \
+ NETIF_F_GSO_SIT | \
+ NETIF_F_GSO_UDP_TUNNEL | \
+ NETIF_F_GSO_UDP_TUNNEL_CSUM)
+static netdev_features_t
+igb_features_check(struct sk_buff *skb, struct net_device *dev,
+  netdev_features_t features)
+{
+   unsigned int network_hdr_len, mac_hdr_len;
+
+   /* Make certain the headers can be described by a context descriptor */
+   mac_hdr_len = skb_network_header(skb) - skb->data;
+   network_hdr_len = skb_checksum_start(skb) - skb_network_header(skb);
+   if (unlikely((mac_hdr_len > IGB_MAX_MAC_HDR_LEN) ||
+(network_hdr_len >  IGB_MAX_NETWORK_HDR_LEN)))
+   return features & ~(NETIF_F_HW_CSUM |
+   NETIF_F_SCTP_CRC |
+   NETIF_F_HW_VLAN_CTAG_TX |
+   NETIF_F_TSO |
+   NETIF_F_TSO_FIXEDID |
+   NETIF_F_TSO6);
+
+   /* We can only support a fixed IPv4 ID or IPv6 header for TSO
+* with tunnels.  So if we aren't using a tunnel, or we aren't
+* performing TSO with a fixed ID we must strip the partial
+* features.
+*/
+   if (!(skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
+  SKB_GSO_GRE_CSUM |
+  SKB_GSO_IPIP |
+  SKB_GSO_SIT |
+  SKB_GSO_UDP_TUNNEL |
+  SKB_GSO_UDP_TUNNEL_CSUM)) ||
+   !(skb_shinfo(skb)->gso_type & (SKB_GSO_TCP_FIXEDID |
+  SKB_GSO_TCPV6)))
+   return features & ~(NETIF_F_GSO_PARTIAL |
+   IGB_GSO_PARTIAL_FEATURES);
+
+   return features;
+}
+
 static const struct net_device_ops igb_netdev_ops = {
.ndo_open   = igb_open,
.ndo_stop   = igb_close,
@@ -2111,7 +2157,7 @@ static const struct net_device_ops igb_netdev_ops = {
.ndo_fix_features   = igb_fix_features,
.ndo_set_features   = igb_set_features,
.ndo_fdb_add= igb_ndo_fdb_add,
-   .ndo_features_check = passthru_features_check,
+   .ndo_features_check = igb_features_check,
 };
 
 /**
@@ -2384,6 +2430,9 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (hw->mac.type >= e1000_82576)
netdev->features |= NETIF_F_SCTP_CRC;
 
+   netdev->gso_partial_features = IGB_GSO_PARTIAL_FEATURES;
+   netdev->features |= NETIF_F_GSO_PARTIAL | IGB_GSO_PARTIAL_FEATURES;
+
/* copy netdev features into list of user selectable features */
netdev->hw_features |= netdev->features;
netdev->hw_features |= NETIF_F_RXALL;
@@ -2401,14 +2450,16 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 NETIF_F_SCTP_CRC;
 
netdev->mpls_features |= NETIF_F_HW_CSUM;
-   netdev->hw_enc_features |= NETIF_F_HW_CSUM;
+   netdev->hw_enc_features |= NETIF_F_HW_CSUM |
+  NETIF_F_TSO |
+  NETIF_F_TSO6 |
+  NETIF_F_GSO_PARTIAL |
+  IGB_GSO_PARTIAL_FEATURES;
 
netdev->priv_flags |= IFF_SUPP_NOFCS;
 
-   if (pci_using_dac) {
+   if (pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;
-   netdev->vlan_features |= NETIF_F_HIGHDMA;
-

[RFC PATCH 01/11] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU

2016-04-07 Thread Alexander Duyck

This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |5 -
 net/core/dev.c|1 +
 net/ipv4/fou.c|6 ++
 net/ipv4/gre_offload.c|8 
 net/ipv4/ip_gre.c |   13 ++---
 5 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d09c2e4..8395308a2445 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2120,7 +2120,10 @@ struct napi_gro_cb {
/* Used in foo-over-udp, set in udp[46]_gro_receive */
u8  is_ipv6:1;
 
-   /* 7 bit hole */
+   /* Used in GRE, set in fou/gue_gro_receive */
+   u8  is_fou:1;
+
+   /* 6 bit hole */
 
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum  csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index 273f10d1e306..d51343a821ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->flush = 0;
NAPI_GRO_CB(skb)->free = 0;
NAPI_GRO_CB(skb)->encap_mark = 0;
+   NAPI_GRO_CB(skb)->is_fou = 0;
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
/* Setup for GRO checksum validation */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 5a94aea280d3..a39068b4a4d9 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff 
**head,
 */
NAPI_GRO_CB(skb)->encap_mark = 0;
 
+   /* Flag this frame as already having an outer encap header */
+   NAPI_GRO_CB(skb)->is_fou = 1;
+
rcu_read_lock();
offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
ops = rcu_dereference(offloads[proto]);
@@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff 
**head,
 */
NAPI_GRO_CB(skb)->encap_mark = 0;
 
+   /* Flag this frame as already having an outer encap header */
+   NAPI_GRO_CB(skb)->is_fou = 1;
+
rcu_read_lock();
offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
ops = rcu_dereference(offloads[guehdr->proto_ctype]);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index c47539d04b88..6a5bd4317866 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff 
**head,
if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0)
goto out;
 
+   /* We can only support GRE_CSUM if we can track the location of
+* the GRE header.  In the case of FOU/GUE we cannot because the
+* outer UDP header displaces the GRE header leaving us in a state
+* of limbo.
+*/
+   if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou)
+   goto out;
+
type = greh->protocol;
 
rcu_read_lock();
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 31936d387cfd..af5d1f38217f 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev)
dev->hw_features|= GRE_FEATURES;
 
if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
-   /* TCP offload with GRE SEQ is not supported. */
-   dev->features|= NETIF_F_GSO_SOFTWARE;
-   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   /* TCP offload with GRE SEQ is not supported, nor
+* can we support 2 levels of outer headers requiring
+* an update.
+*/
+   if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
+   (tunnel->encap.type == TUNNEL_ENCAP_NONE)) {
+   dev->features|= NETIF_F_GSO_SOFTWARE;
+   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   }
+
/* Can use a lockless transmit, unless we generate
 * output sequences
 */

[RFC PATCH 09/11] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM

2016-04-07 Thread Alexander Duyck

This patch makes it so that i40e and i40evf can use GSO_PARTIAL to support
segmentation for frames with checksums enabled in outer headers.  As a
result we can now send data over these types of tunnels at over 20Gb/s
versus the 12Gb/s that was previously possible on my system.

The advantage with the i40e parts is that this offload is mostly
transparent as the hardware still deals with the inner and/or outer IPv4
headers so the IP ID is still incrementing for both when this offload is
performed.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |6 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |7 ++-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c   |7 ++-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c |6 +-
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 07a70c4ac49f..6c095b07ce82 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9119,17 +9119,21 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
   NETIF_F_TSO_ECN  |
   NETIF_F_TSO6 |
   NETIF_F_GSO_GRE  |
+  NETIF_F_GSO_GRE_CSUM |
   NETIF_F_GSO_IPIP |
   NETIF_F_GSO_SIT  |
   NETIF_F_GSO_UDP_TUNNEL   |
   NETIF_F_GSO_UDP_TUNNEL_CSUM  |
+  NETIF_F_GSO_PARTIAL  |
   NETIF_F_SCTP_CRC |
   NETIF_F_RXHASH   |
   NETIF_F_RXCSUM   |
   0;
 
if (!(pf->flags & I40E_FLAG_OUTER_UDP_CSUM_CAPABLE))
-   netdev->hw_enc_features ^= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+   netdev->gso_partial_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+
+   netdev->gso_partial_features |= NETIF_F_GSO_GRE_CSUM;
 
/* record features VLANs can make use of */
netdev->vlan_features |= netdev->hw_enc_features;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6e44cf118843..ede4183468b9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2300,11 +2300,15 @@ static int i40e_tso(struct sk_buff *skb, u8 *hdr_len, 
u64 *cd_type_cmd_tso_mss)
}
 
if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
+SKB_GSO_GRE_CSUM |
 SKB_GSO_IPIP |
 SKB_GSO_SIT |
 SKB_GSO_UDP_TUNNEL |
 SKB_GSO_UDP_TUNNEL_CSUM)) {
-   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM) {
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
+   (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)) {
+   l4.udp->len = 0;
+
/* determine offset of outer transport header */
l4_offset = l4.hdr - skb->data;
 
@@ -2481,6 +2485,7 @@ static int i40e_tx_enable_csum(struct sk_buff *skb, u32 
*tx_flags,
 
/* indicate if we need to offload outer UDP header */
if ((*tx_flags & I40E_TX_FLAGS_TSO) &&
+   !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM))
tunnel |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index f101895ecf4a..6ce00547c13e 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1565,11 +1565,15 @@ static int i40e_tso(struct sk_buff *skb, u8 *hdr_len, 
u64 *cd_type_cmd_tso_mss)
}
 
if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
+SKB_GSO_GRE_CSUM |
 SKB_GSO_IPIP |
 SKB_GSO_SIT |
 SKB_GSO_UDP_TUNNEL |
 SKB_GSO_UDP_TUNNEL_CSUM)) {
-   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM) {
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
+   (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)) {
+   l4.udp->len = 0;
+
/* determine offset of outer transport

Re: [PATCH] net: mark DECnet as broken

2016-04-07 Thread David Miller

From: One Thousand Gnomes 
Date: Thu, 7 Apr 2016 15:01:20 +0100

> On Thu,  7 Apr 2016 09:22:43 +0200
> Vegard Nossum  wrote:
> 
>> There are NULL pointer dereference bugs in DECnet which can be triggered
>> by unprivileged users and have been reported multiple times to LKML,
>> however nobody seems confident enough in the proposed fixes to merge them
>> and the consensus seems to be that nobody cares enough about DECnet to
>> see it fixed anyway.
>> 
>> To shield unsuspecting users from the possible DOS, we should mark this
>> BROKEN until somebody who actually uses this code can fix it.
> 
> How about consigning it to staging at this point ?

Staging is a one way facility in my opinion.

I saw we just fix the NULL dereference.

[PATCH] net: vrf: Fix dst reference counting

2016-04-07 Thread David Ahern

Vivek reported a kernel exception deleting a VRF with an active
connection through it. The root cause is that the socket has a cached
reference to a dst that is destroyed. Converting the dst_destroy to
dst_release and letting proper reference counting kick in does not
work as the dst has a reference to the device which needs to be released
as well.

I talked to Hannes about this at netdev and he pointed out the ipv4 and
ipv6 dst handling has dst_ifdown for just this scenario. Rather than
continuing with the reinvented dst wheel in VRF just remove it and
leverage the ipv4 and ipv6 versions.

Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver")
Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c   | 177 +---
 include/net/ip6_route.h |   3 +
 include/net/route.h |   3 +
 net/ipv4/route.c|   7 +-
 net/ipv6/route.c|   7 +-
 5 files changed, 30 insertions(+), 167 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 9a9fabb900c1..8a8f1e58b415 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -60,41 +60,6 @@ struct pcpu_dstats {
struct u64_stats_sync   syncp;
 };
 
-static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie)
-{
-   return dst;
-}
-
-static int vrf_ip_local_out(struct net *net, struct sock *sk, struct sk_buff 
*skb)
-{
-   return ip_local_out(net, sk, skb);
-}
-
-static unsigned int vrf_v4_mtu(const struct dst_entry *dst)
-{
-   /* TO-DO: return max ethernet size? */
-   return dst->dev->mtu;
-}
-
-static void vrf_dst_destroy(struct dst_entry *dst)
-{
-   /* our dst lives forever - or until the device is closed */
-}
-
-static unsigned int vrf_default_advmss(const struct dst_entry *dst)
-{
-   return 65535 - 40;
-}
-
-static struct dst_ops vrf_dst_ops = {
-   .family = AF_INET,
-   .local_out  = vrf_ip_local_out,
-   .check  = vrf_ip_check,
-   .mtu= vrf_v4_mtu,
-   .destroy= vrf_dst_destroy,
-   .default_advmss = vrf_default_advmss,
-};
-
 /* neighbor handling is done with actual device; do not want
  * to flip skb->dev for those ndisc packets. This really fails
  * for multiple next protocols (e.g., NEXTHDR_HOP). But it is
@@ -349,46 +314,6 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct 
net_device *dev)
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
-static struct dst_entry *vrf_ip6_check(struct dst_entry *dst, u32 cookie)
-{
-   return dst;
-}
-
-static struct dst_ops vrf_dst_ops6 = {
-   .family = AF_INET6,
-   .local_out  = ip6_local_out,
-   .check  = vrf_ip6_check,
-   .mtu= vrf_v4_mtu,
-   .destroy= vrf_dst_destroy,
-   .default_advmss = vrf_default_advmss,
-};
-
-static int init_dst_ops6_kmem_cachep(void)
-{
-   vrf_dst_ops6.kmem_cachep = kmem_cache_create("vrf_ip6_dst_cache",
-sizeof(struct rt6_info),
-0,
-SLAB_HWCACHE_ALIGN,
-NULL);
-
-   if (!vrf_dst_ops6.kmem_cachep)
-   return -ENOMEM;
-
-   return 0;
-}
-
-static void free_dst_ops6_kmem_cachep(void)
-{
-   kmem_cache_destroy(vrf_dst_ops6.kmem_cachep);
-}
-
-static int vrf_input6(struct sk_buff *skb)
-{
-   skb->dev->stats.rx_errors++;
-   kfree_skb(skb);
-   return 0;
-}
-
 /* modelled after ip6_finish_output2 */
 static int vrf_finish_output6(struct net *net, struct sock *sk,
  struct sk_buff *skb)
@@ -429,67 +354,34 @@ static int vrf_output6(struct net *net, struct sock *sk, 
struct sk_buff *skb)
!(IP6CB(skb)->flags & IP6SKB_REROUTED));
 }
 
-static void vrf_rt6_destroy(struct net_vrf *vrf)
+static void vrf_rt6_release(struct net_vrf *vrf)
 {
-   dst_destroy(>rt6->dst);
-   free_percpu(vrf->rt6->rt6i_pcpu);
+   dst_release(>rt6->dst);
vrf->rt6 = NULL;
 }
 
 static int vrf_rt6_create(struct net_device *dev)
 {
struct net_vrf *vrf = netdev_priv(dev);
-   struct dst_entry *dst;
+   struct net *net = dev_net(dev);
struct rt6_info *rt6;
-   int cpu;
int rc = -ENOMEM;
 
-   rt6 = dst_alloc(_dst_ops6, dev, 0,
-   DST_OBSOLETE_NONE,
-   (DST_HOST | DST_NOPOLICY | DST_NOXFRM));
+   rt6 = ip6_dst_alloc(net, dev,
+   DST_HOST | DST_NOPOLICY | DST_NOXFRM | DST_NOCACHE);
if (!rt6)
goto out;
 
-   dst = >dst;
-
-   rt6->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, GFP_KERNEL);
-   if (!rt6->rt6i_pcpu) {
-   dst_destroy(dst);
-   goto out;
-   }
-   for_each_possible_cpu(cpu) {
-

[PATCH] net: vrf: Fix dev refcnt leak due to IPv6 prefix route

2016-04-07 Thread David Ahern

ifupdown2 found a kernel bug with IPv6 routes and movement from the main
table to the VRF table. Sequence of events:

Create the interface and add addresses:
ip link add dev eth4.105 link eth4 type vlan id 105
ip addr add dev eth4.105 8.105.105.10/24
ip -6 addr add dev eth4.105 2008:105:105::10/64

At this point IPv6 has inserted a prefix route in the main table even
though the interface is 'down'. From there the VRF device is created:
ip link add dev vrf105 type vrf table 105
ip addr add dev vrf105 9.9.105.10/32
ip -6 addr add dev vrf105 2000:9:105::10/128
ip link set vrf105 up

Then the interface is enslaved, while still in the 'down' state:
ip link set dev eth4.105 master vrf105

Since the device is down the VRF driver cycling the device does not
send the NETDEV_UP and NETDEV_DOWN but rather the NETDEV_CHANGE event
which does not flush the routes inserted prior.

When the link is brought up
ip link set dev eth4.105 up

the prefix route is added in the VRF table, but does not remove
the route from the main table.

Fix by handling the NETDEV_CHANGEUPPER event similar what was implemented
for IPv4 in 7f49e7a38b77 ("net: Flush local routes when device changes vrf
association")

Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 27aed1afcf81..2db2116d3e6b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3255,6 +3255,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
   void *ptr)
 {
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+   struct netdev_notifier_changeupper_info *info;
struct inet6_dev *idev = __in6_dev_get(dev);
int run_pending = 0;
int err;
@@ -3413,6 +3414,15 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
if (idev)
addrconf_type_change(dev, event);
break;
+
+   case NETDEV_CHANGEUPPER:
+   info = ptr;
+
+   /* flush all routes if dev is linked to or unlinked from
+* an L3 master device (e.g., VRF)
+*/
+   if (info->upper_dev && netif_is_l3_master(info->upper_dev))
+   addrconf_ifdown(dev, 0);
}
 
return NOTIFY_OK;
-- 
2.1.4

Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-04-07 Thread Ben Greear


On 04/07/2016 08:11 AM, Vijay Pandurangan wrote:

On Fri, Mar 25, 2016 at 7:46 PM, Ben Greear  wrote:

A real NIC can either do hardware checksums, or it cannot.  If it
cannot, then the host must do it on the CPU for both transmit and
receive.

Veth is not a real NIC, and it cannot do hardware checksum offloading.

So, we either lie and pretend it does, or we eat massive amounts
of CPU usage to calculate and check checksums when sending across
a veth pair.



That's a good point. Does anyone know what the overhead actually is these days?


You could try setting up a system with ixgbe or similar, and then manually
disable csum offload using ethtool, and see how that performs in comparison
to hardware offload?


But, if I am purposely corrupting a frame destined for veth, then the only
reason
I would want the stack to check the checksums is if I were testing my own
stack's checksum logic, and that seems to be a pretty limited use.



In the common case you're 100% right.  OTOH, there's something
disconcerting about an abstraction layer lying and behaving
unexpectedly.  Most traffic that originates on a machine can have its
checksums safely ignored.  Whatever the reason is (maybe, as you say
you're testing checksums – on the other hand maybe there's a bug in
your code somewhere), I really feel like we should try to figure out a
way to ensure that this optimization is at the very least opt-in…


I'm fine with allowing a user to force software-csum on veth devices
if someone wants to code that up, but forcing sw-csum for local frames
on veth devices should be disabled by default.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-07 Thread Cong Wang

On Wed, Apr 6, 2016 at 11:49 AM, Martin KaFai Lau  wrote:
> On Wed, Apr 06, 2016 at 10:58:23AM -0700, Cong Wang wrote:
>> On Tue, Apr 5, 2016 at 5:11 PM, Martin KaFai Lau  wrote:
>> > On Mon, Apr 04, 2016 at 01:45:02PM -0700, Cong Wang wrote:
>> >> I see your point, but calling __ip6_datagram_connect() seems overkill
>> >> here, we don't need to update so many things in the pmtu update context,
>> >> at least IPv4 doesn't do that either. I don't think you have to do that.
>> >>
>> >> So why just updating the dst cache (also some addr cache) here is not
>> >> enough?
>> > I am not sure I understand.  I could be missing something.
>> >
>> > This patch uses ip6_datagram_dst_update() to do the route lookup and
>> > sk->sk_dst_cache update.  ip6_datagram_dst_update() is
>> > created in the first two refactoring patches and is also used by
>> > __ip6_datagram_connect().
>> >
>> > Which operations in ip6_datagram_dst_update() could be saved
>> > during the pmtu update?
>>
>> I thought you call the same ip6_datagram_dst_update() for both
>> pmtu update and __ip6_datagram_connect(), but you actually skip
>> some sk operations for pmtu case, which means you don't need
>> to worry about parallel ip6_datagram_connect().
>>
>> IPv6 UDP sendmsg() path stores the dst without sock lock anyway,
>> we don't cope with a concurrent connect() on another cpu.
> A parallel sendmsg and connect could be an issue.  The user is connecting
> to a new dest while another parallel sendmsg is sending to (could be the old
> dest, new dest or somewhere between old and new dest?)
>
> However, it is the userland making and it will be another patch if we want
> to protect this case too.

Yeah, it is a different problem, but no one complains about it yet.

>
> In pmtu update, the kernel is doing the lookup and update without the
> userland conscious.
>
>> But still, I don't see this is a problem here, because even if we store
>> an obsolete address in cache, it would be corrected later.
> The sendmsg() path will correct it (relookup and update sk_dst_cache) but not
> the getsockopt(IPV6_MTU) path which is what this patch is trying to fix: 
> Update
> a _valid_ dst to sk->sk_dst_cache.

You are lost in discussion, I never object to update sk_dst_cache, what
we disagree here is merely if we need to lock the sock in pmtu update
context.

I still think it is okay without the lock, because even if you take the lock,
the pmtu update could still happen after you release it, so there is no
essential difference here. The only reason I can think of for taking
the sock lock is protecting parallel pmtu update, but it looks safe for
this case too.

So which case do you want to protect by taking the sock lock?

Re: [PATCH 1/9] net: mediatek: update the IRQ part of the binding document

2016-04-07 Thread David Miller


Every patch series must begin with a postings labelled "[PATCH 0/9] ..."
which explains what the series is doing, how it is implementing that,
and why it is implemented that way.

Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed

2016-04-07 Thread Paul Moore

On Thursday, April 07, 2016 01:45:32 AM Florian Westphal wrote:
> Paul Moore  wrote:
> > On Wed, Apr 6, 2016 at 6:14 PM, Florian Westphal  wrote:
> > > netfilter hooks are per namespace -- so there is hook unregister when
> > > netns is destroyed.
> > 
> > Looking around, I see the global and per-namespace registration
> > functions (nf_register_hook and nf_register_net_hook, respectively),
> > but I'm looking to see if/how newly created namespace inherit
> > netfilter hooks from the init network namespace ... if you can create
> > a network namespace and dodge the SELinux hooks, that isn't a good
> > thing from a SELinux point of view, although it might be a plus
> > depending on where you view Paolo's original patches ;)
> 
> Heh :-)
> 
> If you use nf_register_net_hook, the hook is only registered in the
> namespace.
> 
> If you use nf_register_hook, the hook is put on a global list and
> registed in all existing namespaces.
> 
> New namespaces will have the hook added as well (see
> netfilter_net_init -> nf_register_hook_list in netfilter/core.c )
>
> Since nf_register_hook is used it should be impossible to get a netns
> that doesn't call these hooks.

Great, thanks.
 
> > > Do you think it makes sense to rework the patch to delay registering
> > > of the netfiler hooks until the system is in a state where they're
> > > needed, without the 'unregister' aspect?
> > 
> > I would need to see the patch to say for certain, but in principle
> > that seems perfectly reasonable and I think would satisfy both the
> > netdev and SELinux camps - good suggestion.  My main goal is to drop
> > the selinux_nf_ip_init() entirely so it can't be used as a ROP gadget.
> > 
> > We might even be able to trim the secmark_active and peerlbl_active
> > checks in the SELinux netfilter hooks (an earlier attempt at
> > optimization; contrary to popular belief, I do care about SELinux
> > performance), although that would mean that enabling the network
> > access controls would be one way ... I guess you can disregard that
> > last bit, I'm thinking aloud again.
> 
> One way is fine I think.

Yes, just disregard my second paragraph above.
 
> > > Ideally this would even be per netns -- in perfect world we would
> > > be able to make it so that a new netns are created with an empty
> > > hook list.
> > 
> > In general SELinux doesn't care about namespaces, for reasons that are
> > sorta beyond the scope of this conversation, so I would like to stick
> > to a all or nothing approach to enabling the SELinux netfilter hooks
> > across namespaces.  Perhaps we can revisit this at a later time, but
> > let's keep it simple right now.
> 
> Okay, I'd prefer to stick to your recommendation anyway wrt. to selinux
> (Casey, I read your comment regarding smack. Noted, we don't want to
> break smack either...)
> 
> I think that in this case the entire question is:
> 
> In your experience, how likely is a config where selinux is enabled BUT the
> hooks are not needed (i.e., where we hit the
> 
> if (!selinux_policycap_netpeer)
> return NF_ACCEPT;
> 
> if (!secmark_active && !peerlbl_active)
>return NF_ACCEPT;
> 
> tests inside the hooks)?  If such setups are uncommon we should just
> drop this idea or at least put it on the back burner until the more
> expensive netfilter hooks (conntrack, cough) are out of the way.

A few years ago I would have said that it is relatively uncommon for admins to 
enable the SELinux network access controls; it was typically just 
government/intelligence agencies who had very strict access control 
requirements and represented a small portion of SELinux users.  However, over 
the past few years I've been fielding more and more questions from admins/devs 
in the virtualization space who are interested in some of these capabilities; 
it isn't clear to me how many of these people are switching it on, but there 
is definitely more interest than I have seen in the past and the interested is 
centered around some rather common use cases.

So, to summarize, I don't know ;)

If you've got bigger sources of overhead, my opinion would be to go tackle 
those first.  Perhaps I can even find the time to work on the 
SELinux/netfilter stuff while you are off slaying the bigger dragons, no 
promises at the moment.

-- 
paul moore
www.paul-moore.com

Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-07 Thread Martin KaFai Lau

On Thu, Apr 07, 2016 at 11:37:10AM -0700, Cong Wang wrote:
> You are lost in discussion
Indeed. :(

>
> I still think it is okay without the lock, because even if you take the lock,
> the pmtu update could still happen after you release it, so there is no
> essential difference here. The only reason I can think of for taking
> the sock lock is protecting parallel pmtu update, but it looks safe for
> this case too.
>
> So which case do you want to protect by taking the sock lock?
When the pmtu-update is doing route lookup and another connect is
happening, what sk->sk_v6_daddr will this route lookup use?
the old one, new one or neither of them?

Is it acceptable that getsockopt() is returning something that it
is not currently connected to? and potentially somewhere that it
is never connected to?

[PATCH 7/9] net: mediatek: fix TX locking

2016-04-07 Thread John Crispin

Inside the TX path there is a lock inside the tx_map function. This is
however too late. The patch moves the lock to the start of the xmit
function right before the free count check of the DMA ring happens.
If we do not do this, the code becomes racy leading to TX stalls and
dropped packets. This happens as there are 2 netdevs running on the
same physical DMA ring.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 60b66ab..8434355 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -536,7 +536,6 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
struct mtk_eth *eth = mac->hw;
struct mtk_tx_dma *itxd, *txd;
struct mtk_tx_buf *tx_buf;
-   unsigned long flags;
dma_addr_t mapped_addr;
unsigned int nr_frags;
int i, n_desc = 1;
@@ -568,11 +567,6 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(dma_mapping_error(>dev, mapped_addr)))
return -ENOMEM;
 
-   /* normally we can rely on the stack not calling this more than once,
-* however we have 2 queues running ont he same ring so we need to lock
-* the ring access
-*/
-   spin_lock_irqsave(>page_lock, flags);
WRITE_ONCE(itxd->txd1, mapped_addr);
tx_buf->flags |= MTK_TX_FLAGS_SINGLE0;
dma_unmap_addr_set(tx_buf, dma_addr0, mapped_addr);
@@ -632,8 +626,6 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
WRITE_ONCE(itxd->txd3, (TX_DMA_SWC | TX_DMA_PLEN0(skb_headlen(skb)) |
(!nr_frags * TX_DMA_LS0)));
 
-   spin_unlock_irqrestore(>page_lock, flags);
-
netdev_sent_queue(dev, skb->len);
skb_tx_timestamp(skb);
 
@@ -661,8 +653,6 @@ err_dma:
itxd = mtk_qdma_phys_to_virt(ring, itxd->txd2);
} while (itxd != txd);
 
-   spin_unlock_irqrestore(>page_lock, flags);
-
return -ENOMEM;
 }
 
@@ -712,14 +702,22 @@ static int mtk_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct mtk_eth *eth = mac->hw;
struct mtk_tx_ring *ring = >tx_ring;
struct net_device_stats *stats = >stats;
+   unsigned long flags;
bool gso = false;
int tx_num;
 
+   /* normally we can rely on the stack not calling this more than once,
+* however we have 2 queues running ont he same ring so we need to lock
+* the ring access
+*/
+   spin_lock_irqsave(>page_lock, flags);
+
tx_num = mtk_cal_txd_req(skb);
if (unlikely(atomic_read(>free_count) <= tx_num)) {
mtk_stop_queue(eth);
netif_err(eth, tx_queued, dev,
  "Tx Ring full when queue awake!\n");
+   spin_unlock_irqrestore(>page_lock, flags);
return NETDEV_TX_BUSY;
}
 
@@ -747,10 +745,12 @@ static int mtk_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
 ring->thresh))
mtk_wake_queue(eth);
}
+   spin_unlock_irqrestore(>page_lock, flags);
 
return NETDEV_TX_OK;
 
 drop:
+   spin_unlock_irqrestore(>page_lock, flags);
stats->tx_dropped++;
dev_kfree_skb(skb);
return NETDEV_TX_OK;
-- 
1.7.10.4

[PATCH 1/9] net: mediatek: update the IRQ part of the binding document

2016-04-07 Thread John Crispin

The current binding document only describes a single interrupt. Update the
document by adding the 2 other interrupts.

The driver currently only uses a single interrupt. The HW is however able
to using IRQ grouping to split TX and RX onto separate GIC irqs.

Signed-off-by: John Crispin 
Cc: devicet...@vger.kernel.org
---
 Documentation/devicetree/bindings/net/mediatek-net.txt |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt 
b/Documentation/devicetree/bindings/net/mediatek-net.txt
index 5ca7929..2f142be 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -9,7 +9,7 @@ have dual GMAC each represented by a child node..
 Required properties:
 - compatible: Should be "mediatek,mt7623-eth"
 - reg: Address and length of the register set for the device
-- interrupts: Should contain the frame engines interrupt
+- interrupts: Should contain the three frame engines interrupts
 - clocks: the clock used by the core
 - clock-names: the names of the clock listed in the clocks property. These are
"ethif", "esw", "gp2", "gp1"
@@ -42,7 +42,9 @@ eth: ethernet@1b10 {
 < CLK_ETHSYS_GP2>,
 < CLK_ETHSYS_GP1>;
clock-names = "ethif", "esw", "gp2", "gp1";
-   interrupts = ;
+   interrupts = ;
power-domains = < MT2701_POWER_DOMAIN_ETH>;
resets = < MT2701_ETHSYS_ETH_RST>;
reset-names = "eth";
-- 
1.7.10.4

[PATCH net-next] net: ipv6: Use passed in table for nexthop lookups

2016-04-07 Thread David Ahern

Similar to 3bfd847203c6 ("net: Use passed in table for nexthop lookups")
for IPv4, if the route spec contains a table id use that to lookup the
next hop first and fall back to a full lookup if it fails (per the fix
4c9bcd117918b ("net: Fix nexthop lookups")).

Example:

root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo  proto none  metric 0  pref medium
2100:1::/120 dev eth1  proto kernel  metric 256  pref medium
local 2100:2::1 dev lo  proto none  metric 0  pref medium
2100:2::/120 dev eth2  proto kernel  metric 256  pref medium
local fe80::e0:f9ff:fe09:3cac dev lo  proto none  metric 0  pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo  proto none  metric 0  pref medium
fe80::/64 dev eth1  proto kernel  metric 256  pref medium
fe80::/64 dev eth2  proto kernel  metric 256  pref medium
ff00::/8 dev red  metric 256  pref medium
ff00::/8 dev eth1  metric 256  pref medium
ff00::/8 dev eth2  metric 256  pref medium
unreachable default dev lo  metric 240  error -113 pref medium

root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
RTNETLINK answers: No route to host

Route add fails even though 2100:1::64 is a reachable next hop:
root@kenny:~# ping6 -I red  2100:1::64
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms

With this patch:
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo  proto none  metric 0  pref medium
2100:1::/120 dev eth1  proto kernel  metric 256  pref medium
local 2100:2::1 dev lo  proto none  metric 0  pref medium
2100:2::/120 dev eth2  proto kernel  metric 256  pref medium
2100:3::/64 via 2100:1::64 dev eth1  metric 1024  pref medium
local fe80::e0:f9ff:fe09:3cac dev lo  proto none  metric 0  pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo  proto none  metric 0  pref medium
fe80::/64 dev eth1  proto kernel  metric 256  pref medium
fe80::/64 dev eth2  proto kernel  metric 256  pref medium
ff00::/8 dev red  metric 256  pref medium
ff00::/8 dev eth1  metric 256  pref medium
ff00::/8 dev eth2  metric 256  pref medium
unreachable default dev lo  metric 240  error -113 pref medium

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1d8871a5ed20..3e699dc199f3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1928,7 +1928,7 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
rt->rt6i_gateway = *gw_addr;
 
if (gwa_type != (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST)) {
-   struct rt6_info *grt;
+   struct rt6_info *grt = NULL;
 
/* IPv6 strictly inhibits using not link-local
   addresses as nexthop address.
@@ -1940,7 +1940,38 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
if (!(gwa_type & IPV6_ADDR_UNICAST))
goto out;
 
-   grt = rt6_lookup(net, gw_addr, NULL, cfg->fc_ifindex, 
1);
+   if (cfg->fc_table) {
+   struct flowi6 fl6 = {
+   .flowi6_oif = cfg->fc_ifindex,
+   .daddr = *gw_addr,
+   .saddr = cfg->fc_prefsrc,
+   };
+   struct fib6_table *table;
+   int flags = 0;
+
+   err = -EHOSTUNREACH;
+   table = fib6_get_table(net, cfg->fc_table);
+   if (!table)
+   goto out;
+
+   if (!ipv6_addr_any(>fc_prefsrc))
+   flags |= RT6_LOOKUP_F_HAS_SADDR;
+
+   grt = ip6_pol_route(net, table, cfg->fc_ifindex,
+   , flags);
+
+   /* if table lookup failed, fall back
+* to full lookup
+*/
+   if (grt == net->ipv6.ip6_null_entry) {
+   ip6_rt_put(grt);
+   grt = NULL;
+   }
+   }
+
+   if (!grt)
+   grt = rt6_lookup(net, gw_addr, NULL,
+cfg->fc_ifindex, 1);
 
err = -EHOSTUNREACH;

Re: [PATCH net-next 5/7] sctp: reuse the some transport traversal functions in proc

2016-04-07 Thread Marcelo Ricardo Leitner

On Thu, Apr 07, 2016 at 09:09:30AM -0400, Neil Horman wrote:
> On Tue, Apr 05, 2016 at 12:06:30PM +0800, Xin Long wrote:
> > There are some transport traversal functions for sctp_diag, we can also
> > use it for sctp_proc. cause they have the similar situation to traversal
> > transport.
> > 
> > Signed-off-by: Xin Long 
> > ---
> >  net/sctp/proc.c | 80 
> > +
> >  1 file changed, 18 insertions(+), 62 deletions(-)
> > 
> > diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> > index 5cfac8d..dd8492f 100644
> > --- a/net/sctp/proc.c
> > +++ b/net/sctp/proc.c
> > @@ -282,80 +282,31 @@ struct sctp_ht_iter {
> > struct rhashtable_iter hti;
> >  };
> >  
> > -static struct sctp_transport *sctp_transport_get_next(struct seq_file *seq)
> > -{
> > -   struct sctp_ht_iter *iter = seq->private;
> > -   struct sctp_transport *t;
> > -
> > -   t = rhashtable_walk_next(>hti);
> > -   for (; t; t = rhashtable_walk_next(>hti)) {
> > -   if (IS_ERR(t)) {
> > -   if (PTR_ERR(t) == -EAGAIN)
> > -   continue;
> > -   break;
> > -   }
> > -
> > -   if (net_eq(sock_net(t->asoc->base.sk), seq_file_net(seq)) &&
> > -   t->asoc->peer.primary_path == t)
> > -   break;
> > -   }
> > -
> > -   return t;
> > -}
> > -
> 
> this may just be a nit, but you defined the new sctp_transport_get_next in 
> patch
> 2 of this series, and didn't remove this private version until here.  Is that
> going to cause some behavioral issue, if someone builds a kernel between 
> patch 2

Yes, it causes issues:

...net/sctp/proc.c:285:31: error: conflicting types for 
‘sctp_transport_get_next’
 static struct sctp_transport *sctp_transport_get_next(struct seq_file *seq)
   ^

> and 7?  Seems like perhaps those two patches should be merged.

Agreed.

  Marcelo

[PATCH 3/9] net: mediatek: mtk_cal_txd_req() returns bad value

2016-04-07 Thread John Crispin

The code used to also support the PDMA engine, which had 2 packet pointers
per descriptor. Because of this we had to divide the result by 2 and round
it up. This is no longer needed as the code only supports QDMA.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index bb10d57..94cceb8 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -681,7 +681,7 @@ static inline int mtk_cal_txd_req(struct sk_buff *skb)
nfrags += skb_shinfo(skb)->nr_frags;
}
 
-   return DIV_ROUND_UP(nfrags, 2);
+   return nfrags;
 }
 
 static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
-- 
1.7.10.4

Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY

2016-04-07 Thread Rob Herring

On Tue, Apr 05, 2016 at 12:22:30AM +0800, Chen-Yu Tsai wrote:
> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
> configured through a memory mapped hardware register.
> 
> This same register also configures the MAC interface mode and TX clock
> source. Also covered by the register, but not supported in these bindings,
> are TX/RX clock delay chains and inverters, and an RMII module.
> 
> Signed-off-by: Chen-Yu Tsai 
> ---
>  .../bindings/net/allwinner,sun8i-h3-ephy.txt   | 44 
> ++
>  1 file changed, 44 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt

Acked-by: Rob Herring

Re: [PATCH net-next 3/7] sctp: export some functions for sctp_diag in inet_diag

2016-04-07 Thread Marcelo Ricardo Leitner

On Tue, Apr 05, 2016 at 12:06:28PM +0800, Xin Long wrote:
> inet_diag_msg_common_fill is used to fill the diag msg common info,
> we need to use it in sctp_diag as well, so export it.
> 
> We also add inet_diag_get_handler() to access inet_diag_table in sctp
> diag.
> 
> Signed-off-by: Xin Long 
> ---
>  net/ipv4/inet_diag.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
> index bd591eb..29121a6 100644
> --- a/net/ipv4/inet_diag.c
> +++ b/net/ipv4/inet_diag.c
> @@ -66,7 +66,13 @@ static void inet_diag_unlock_handler(const struct 
> inet_diag_handler *handler)
>   mutex_unlock(_diag_table_mutex);
>  }
>  
> -static void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock 
> *sk)
> +struct inet_diag_handler *inet_diag_get_handler(int proto)

It needs to return it as const, as inet_diag_table is also declared as
const, so we don't loose the qualifier.

> +{
> + return inet_diag_table[proto];
> +}
> +EXPORT_SYMBOL_GPL(inet_diag_get_handler);
> +
> +void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock *sk)
>  {
>   r->idiag_family = sk->sk_family;
>  
> @@ -89,6 +95,7 @@ static void inet_diag_msg_common_fill(struct inet_diag_msg 
> *r, struct sock *sk)
>   r->id.idiag_dst[0] = sk->sk_daddr;
>   }
>  }
> +EXPORT_SYMBOL_GPL(inet_diag_msg_common_fill);
>  
>  static size_t inet_sk_attr_size(void)
>  {
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[PATCH v5 net-next 02/15] nfp: move link state interrupt request/free calls

2016-04-07 Thread Jakub Kicinski

We need to be able to disable the link state interrupt when
the device is brought down.  We used to just free the IRQ
at the beginning of .ndo_stop().  As we now move towards
more ordered .ndo_open()/.ndo_stop() paths LSC allocation
should be placed in the "allocate resource" section.

Since the IRQ can't be freed early in .ndo_stop(), it is
disabled instead.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 23 +++---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 0dae81454e77..5da1199e7afb 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1729,10 +1729,16 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
  NFP_NET_IRQ_EXN_IDX, nn->exn_handler);
if (err)
return err;
+   err = nfp_net_aux_irq_request(nn, NFP_NET_CFG_LSC, "%s-lsc",
+ nn->lsc_name, sizeof(nn->lsc_name),
+ NFP_NET_IRQ_LSC_IDX, nn->lsc_handler);
+   if (err)
+   goto err_free_exn;
+   disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
 
err = nfp_net_alloc_rings(nn);
if (err)
-   goto err_free_exn;
+   goto err_free_lsc;
 
err = netif_set_real_num_tx_queues(netdev, nn->num_tx_rings);
if (err)
@@ -1812,19 +1818,11 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
 
netif_tx_wake_all_queues(netdev);
 
-   err = nfp_net_aux_irq_request(nn, NFP_NET_CFG_LSC, "%s-lsc",
- nn->lsc_name, sizeof(nn->lsc_name),
- NFP_NET_IRQ_LSC_IDX, nn->lsc_handler);
-   if (err)
-   goto err_stop_tx;
+   enable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
nfp_net_read_link_status(nn);
 
return 0;
 
-err_stop_tx:
-   netif_tx_disable(netdev);
-   for (r = 0; r < nn->num_r_vecs; r++)
-   nfp_net_tx_flush(nn->r_vecs[r].tx_ring);
 err_disable_napi:
while (r--) {
napi_disable(>r_vecs[r].napi);
@@ -1834,6 +1832,8 @@ err_clear_config:
nfp_net_clear_config_and_disable(nn);
 err_free_rings:
nfp_net_free_rings(nn);
+err_free_lsc:
+   nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
 err_free_exn:
nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX);
return err;
@@ -1855,7 +1855,7 @@ static int nfp_net_netdev_close(struct net_device *netdev)
 
/* Step 1: Disable RX and TX rings from the Linux kernel perspective
 */
-   nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
+   disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
netif_carrier_off(netdev);
nn->link_up = false;
 
@@ -1876,6 +1876,7 @@ static int nfp_net_netdev_close(struct net_device *netdev)
}
 
nfp_net_free_rings(nn);
+   nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX);
 
nn_dbg(nn, "%s down", netdev->name);
-- 
1.9.1

[PATCH v5 net-next 00/15] MTU/buffer reconfig changes

2016-04-07 Thread Jakub Kicinski

Hi!

I re-discussed MPLS/MTU internally, dropped it from the patch 1,
re-tested everything, found out I forgot about debugfs pointers,
fixed that as well.

v5:
 - don't reserve space in RX buffers for MPLS label stack
   (patch 1);
 - fix debugfs pointers to ring structures (patch 5).
v4:
 - cut down on unrelated patches;
 - don't "close" the device on error path.

--- v4 cover letter

Previous series included some not entirely related patches,
this one is cut down.  Main issue I'm trying to solve here
is that .ndo_change_mtu() in nfpvf driver is doing full
close/open to reallocate buffers - which if open fails
can result in device being basically closed even though
the interface is started.  As suggested by you I try to move
towards a paradigm where the resources are allocated first
and the MTU change is only done once I'm certain (almost)
nothing can fail.  Almost because I need to communicate 
with FW and that can always time out.

Patch 1 fixes small issue.  Next 10 patches reorganize things
so that I can easily allocate new rings and sets of buffers
while the device is running.  Patches 13 and 15 reshape the
.ndo_change_mtu() and ethtool's ring-resize operation into
desired form.


Jakub Kicinski (15):
  nfp: correct RX buffer length calculation
  nfp: move link state interrupt request/free calls
  nfp: break up nfp_net_{alloc|free}_rings
  nfp: make *x_ring_init do all the init
  nfp: allocate ring SW structs dynamically
  nfp: cleanup tx ring flush and rename to reset
  nfp: reorganize initial filling of RX rings
  nfp: preallocate RX buffers early in .ndo_open
  nfp: move filling ring information to FW config
  nfp: slice .ndo_open() and .ndo_stop() up
  nfp: sync ring state during FW reconfiguration
  nfp: propagate list buffer size in struct rx_ring
  nfp: convert .ndo_change_mtu() to prepare/commit paradigm
  nfp: pass ring count as function parameter
  nfp: allow ring size reconfiguration at runtime

 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  10 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 903 ++---
 .../net/ethernet/netronome/nfp/nfp_net_debugfs.c   |  20 +-
 .../net/ethernet/netronome/nfp/nfp_net_ethtool.c   |  30 +-
 4 files changed, 627 insertions(+), 336 deletions(-)

-- 
1.9.1

[PATCH v5 net-next 04/15] nfp: make *x_ring_init do all the init

2016-04-07 Thread Jakub Kicinski

nfp_net_[rt]x_ring_init functions used to be called from probe
path only and some of their functionality was spilled to the
call site.  In order to reuse them for ring reconfiguration
we need them to do all the init.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 28 ++
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 8692587904c5..7cd20fcd631a 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -347,12 +347,18 @@ static irqreturn_t nfp_net_irq_exn(int irq, void *data)
 /**
  * nfp_net_tx_ring_init() - Fill in the boilerplate for a TX ring
  * @tx_ring:  TX ring structure
+ * @r_vec:IRQ vector servicing this ring
+ * @idx:  Ring index
  */
-static void nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring)
+static void
+nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring,
+struct nfp_net_r_vector *r_vec, unsigned int idx)
 {
-   struct nfp_net_r_vector *r_vec = tx_ring->r_vec;
struct nfp_net *nn = r_vec->nfp_net;
 
+   tx_ring->idx = idx;
+   tx_ring->r_vec = r_vec;
+
tx_ring->qcidx = tx_ring->idx * nn->stride_tx;
tx_ring->qcp_q = nn->tx_bar + NFP_QCP_QUEUE_OFF(tx_ring->qcidx);
 }
@@ -360,12 +366,18 @@ static void nfp_net_tx_ring_init(struct nfp_net_tx_ring 
*tx_ring)
 /**
  * nfp_net_rx_ring_init() - Fill in the boilerplate for a RX ring
  * @rx_ring:  RX ring structure
+ * @r_vec:IRQ vector servicing this ring
+ * @idx:  Ring index
  */
-static void nfp_net_rx_ring_init(struct nfp_net_rx_ring *rx_ring)
+static void
+nfp_net_rx_ring_init(struct nfp_net_rx_ring *rx_ring,
+struct nfp_net_r_vector *r_vec, unsigned int idx)
 {
-   struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
struct nfp_net *nn = r_vec->nfp_net;
 
+   rx_ring->idx = idx;
+   rx_ring->r_vec = r_vec;
+
rx_ring->fl_qcidx = rx_ring->idx * nn->stride_rx;
rx_ring->rx_qcidx = rx_ring->fl_qcidx + (nn->stride_rx - 1);
 
@@ -403,14 +415,10 @@ static void nfp_net_irqs_assign(struct net_device *netdev)
cpumask_set_cpu(r, _vec->affinity_mask);
 
r_vec->tx_ring = >tx_rings[r];
-   nn->tx_rings[r].idx = r;
-   nn->tx_rings[r].r_vec = r_vec;
-   nfp_net_tx_ring_init(r_vec->tx_ring);
+   nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, r);
 
r_vec->rx_ring = >rx_rings[r];
-   nn->rx_rings[r].idx = r;
-   nn->rx_rings[r].r_vec = r_vec;
-   nfp_net_rx_ring_init(r_vec->rx_ring);
+   nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, r);
}
 }
 
-- 
1.9.1

[PATCH v5 net-next 08/15] nfp: preallocate RX buffers early in .ndo_open

2016-04-07 Thread Jakub Kicinski

We want the .ndo_open() to have following structure:
 - allocate resources;
 - configure HW/FW;
 - enable the device from stack perspective.
Therefore filling RX rings needs to be moved to the beginning
of .ndo_open().

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 0c3c37ad28a4..a6a917fe8e31 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1666,28 +1666,19 @@ static void nfp_net_clear_config_and_disable(struct 
nfp_net *nn)
  * @nn:  NFP Net device structure
  * @r_vec:   Ring vector to be started
  */
-static int nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector 
*r_vec)
+static void
+nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec)
 {
unsigned int irq_vec;
-   int err = 0;
 
irq_vec = nn->irq_entries[r_vec->irq_idx].vector;
 
disable_irq(irq_vec);
 
-   err = nfp_net_rx_ring_bufs_alloc(r_vec->nfp_net, r_vec->rx_ring);
-   if (err) {
-   nn_err(nn, "RV%02d: couldn't allocate enough buffers\n",
-  r_vec->irq_idx);
-   goto out;
-   }
nfp_net_rx_ring_fill_freelist(r_vec->rx_ring);
-
napi_enable(_vec->napi);
-out:
-   enable_irq(irq_vec);
 
-   return err;
+   enable_irq(irq_vec);
 }
 
 static int nfp_net_netdev_open(struct net_device *netdev)
@@ -1742,6 +1733,10 @@ static int nfp_net_netdev_open(struct net_device *netdev)
err = nfp_net_rx_ring_alloc(nn->r_vecs[r].rx_ring);
if (err)
goto err_free_tx_ring_p;
+
+   err = nfp_net_rx_ring_bufs_alloc(nn, nn->r_vecs[r].rx_ring);
+   if (err)
+   goto err_flush_rx_ring_p;
}
 
err = netif_set_real_num_tx_queues(netdev, nn->num_tx_rings);
@@ -1814,11 +1809,8 @@ static int nfp_net_netdev_open(struct net_device *netdev)
 * - enable all TX queues
 * - set link state
 */
-   for (r = 0; r < nn->num_r_vecs; r++) {
-   err = nfp_net_start_vec(nn, >r_vecs[r]);
-   if (err)
-   goto err_disable_napi;
-   }
+   for (r = 0; r < nn->num_r_vecs; r++)
+   nfp_net_start_vec(nn, >r_vecs[r]);
 
netif_tx_wake_all_queues(netdev);
 
@@ -1827,18 +1819,14 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
 
return 0;
 
-err_disable_napi:
-   while (r--) {
-   napi_disable(>r_vecs[r].napi);
-   nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring);
-   nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring);
-   }
 err_clear_config:
nfp_net_clear_config_and_disable(nn);
 err_free_rings:
r = nn->num_r_vecs;
 err_free_prev_vecs:
while (r--) {
+   nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring);
+err_flush_rx_ring_p:
nfp_net_rx_ring_free(nn->r_vecs[r].rx_ring);
 err_free_tx_ring_p:
nfp_net_tx_ring_free(nn->r_vecs[r].tx_ring);
-- 
1.9.1

[PATCH 2/9] net: mediatek: watchdog_timeo was not set

2016-04-07 Thread John Crispin

The original commit failed to set watchdog_timeo. This patch sets
watchdog_timeo to HZ.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index e0b68af..bb10d57 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1645,6 +1645,7 @@ static int mtk_add_mac(struct mtk_eth *eth, struct 
device_node *np)
mac->hw_stats->reg_offset = id * MTK_STAT_OFFSET;
 
SET_NETDEV_DEV(eth->netdev[id], eth->dev);
+   eth->netdev[id]->watchdog_timeo = HZ;
eth->netdev[id]->netdev_ops = _netdev_ops;
eth->netdev[id]->base_addr = (unsigned long)eth->base;
eth->netdev[id]->vlan_features = MTK_HW_FEATURES &
-- 
1.7.10.4

[PATCH 6/9] net: mediatek: fix mtk_pending_work

2016-04-07 Thread John Crispin

The driver supports 2 MACs. Both run on the same DMA ring. If we hit a TX
timeout we need to stop both netdevs before restarting them again. If we
don't do this, mtk_stop() wont shutdown DMA and the consecutive call to
mtk_open() wont restart DMA and enable IRQs.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   31 ++-
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 4ebc42e..60b66ab 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1430,19 +1430,30 @@ static int mtk_do_ioctl(struct net_device *dev, struct 
ifreq *ifr, int cmd)
 
 static void mtk_pending_work(struct work_struct *work)
 {
-   struct mtk_mac *mac = container_of(work, struct mtk_mac, pending_work);
-   struct mtk_eth *eth = mac->hw;
-   struct net_device *dev = eth->netdev[mac->id];
-   int err;
+   struct mtk_eth *eth = container_of(work, struct mtk_eth, pending_work);
+   int err, i;
+   unsigned long restart = 0;
 
rtnl_lock();
-   mtk_stop(dev);
 
-   err = mtk_open(dev);
-   if (err) {
-   netif_alert(eth, ifup, dev,
-   "Driver up/down cycle failed, closing device.\n");
-   dev_close(dev);
+   /* stop all devices to make sure that dma is properly shut down */
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!netif_oper_up(eth->netdev[i]))
+   continue;
+   mtk_stop(eth->netdev[i]);
+   __set_bit(i, );
+   }
+
+   /* restart DMA and enable IRQs */
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!test_bit(i, ))
+   continue;
+   err = mtk_open(eth->netdev[i]);
+   if (err) {
+   netif_alert(eth, ifup, eth->netdev[i],
+ "Driver up/down cycle failed, closing device.\n");
+   dev_close(eth->netdev[i]);
+   }
}
rtnl_unlock();
 }
-- 
1.7.10.4

[PATCH 4/9] net: mediatek: remove superfluous reset call

2016-04-07 Thread John Crispin

HW reset is triggered in the mtk_hw_init() function. There is no need to
also reset the core during probe.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 94cceb8..a4982e4 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1679,10 +1679,6 @@ static int mtk_probe(struct platform_device *pdev)
struct mtk_eth *eth;
int err;
 
-   err = device_reset(>dev);
-   if (err)
-   return err;
-
match = of_match_device(of_mtk_match, >dev);
soc = (struct mtk_soc_data *)match->data;
 
-- 
1.7.10.4

[PATCH 8/9] net: mediatek: move the pending_work struct to the device generic struct

2016-04-07 Thread John Crispin

The worker always touches both netdevs. It is ethernet core and not MAC
specific. We only need one worker, which belongs into the ethernets core
struct.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   10 --
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |4 ++--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8434355..f9f8851 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1193,7 +1193,7 @@ static void mtk_tx_timeout(struct net_device *dev)
eth->netdev[mac->id]->stats.tx_errors++;
netif_err(eth, tx_err, dev,
  "transmit timed out\n");
-   schedule_work(>pending_work);
+   schedule_work(>pending_work);
 }
 
 static irqreturn_t mtk_handle_irq(int irq, void *_eth)
@@ -1438,7 +1438,7 @@ static void mtk_pending_work(struct work_struct *work)
 
/* stop all devices to make sure that dma is properly shut down */
for (i = 0; i < MTK_MAC_COUNT; i++) {
-   if (!netif_oper_up(eth->netdev[i]))
+   if (!eth->netdev[i])
continue;
mtk_stop(eth->netdev[i]);
__set_bit(i, );
@@ -1463,15 +1463,13 @@ static int mtk_cleanup(struct mtk_eth *eth)
int i;
 
for (i = 0; i < MTK_MAC_COUNT; i++) {
-   struct mtk_mac *mac = netdev_priv(eth->netdev[i]);
-
if (!eth->netdev[i])
continue;
 
unregister_netdev(eth->netdev[i]);
free_netdev(eth->netdev[i]);
-   cancel_work_sync(>pending_work);
}
+   cancel_work_sync(>pending_work);
 
return 0;
 }
@@ -1659,7 +1657,6 @@ static int mtk_add_mac(struct mtk_eth *eth, struct 
device_node *np)
mac->id = id;
mac->hw = eth;
mac->of_node = np;
-   INIT_WORK(>pending_work, mtk_pending_work);
 
mac->hw_stats = devm_kzalloc(eth->dev,
 sizeof(*mac->hw_stats),
@@ -1761,6 +1758,7 @@ static int mtk_probe(struct platform_device *pdev)
 
eth->dev = >dev;
eth->msg_enable = netif_msg_init(mtk_msg_level, MTK_DEFAULT_MSG_ENABLE);
+   INIT_WORK(>pending_work, mtk_pending_work);
 
err = mtk_hw_init(eth);
if (err)
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 48a5292..eed626d 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -363,6 +363,7 @@ struct mtk_rx_ring {
  * @clk_gp1:   The gmac1 clock
  * @clk_gp2:   The gmac2 clock
  * @mii_bus:   If there is a bus we need to create an instance for it
+ * @pending_work:  The workqueue used to reset the dma ring
  */
 
 struct mtk_eth {
@@ -389,6 +390,7 @@ struct mtk_eth {
struct clk  *clk_gp1;
struct clk  *clk_gp2;
struct mii_bus  *mii_bus;
+   struct work_struct  pending_work;
 };
 
 /* struct mtk_mac -the structure that holds the info about the MACs of the
@@ -398,7 +400,6 @@ struct mtk_eth {
  * @hw:Backpointer to our main datastruture
  * @hw_stats:  Packet statistics counter
  * @phy_dev:   The attached PHY if available
- * @pending_work:  The workqueue used to reset the dma ring
  */
 struct mtk_mac {
int id;
@@ -406,7 +407,6 @@ struct mtk_mac {
struct mtk_eth  *hw;
struct mtk_hw_stats *hw_stats;
struct phy_device   *phy_dev;
-   struct work_struct  pending_work;
 };
 
 /* the struct describing the SoC. these are declared in the soc_xyz.c files */
-- 
1.7.10.4

[PATCH 5/9] net: mediatek: fix stop and wakeup of queue

2016-04-07 Thread John Crispin

The driver supports 2 MACs. Both run on the same DMA ring. If we go
above/below the TX rings threshold value, we always need to wake/stop
the queue of both devices. Not doing to can cause TX stalls and packet
drops on one of the devices.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   37 +++
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index a4982e4..4ebc42e 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -684,6 +684,28 @@ static inline int mtk_cal_txd_req(struct sk_buff *skb)
return nfrags;
 }
 
+static void mtk_wake_queue(struct mtk_eth *eth)
+{
+   int i;
+
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->netdev[i])
+   continue;
+   netif_wake_queue(eth->netdev[i]);
+   }
+}
+
+static void mtk_stop_queue(struct mtk_eth *eth)
+{
+   int i;
+
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->netdev[i])
+   continue;
+   netif_stop_queue(eth->netdev[i]);
+   }
+}
+
 static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct mtk_mac *mac = netdev_priv(dev);
@@ -695,7 +717,7 @@ static int mtk_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
tx_num = mtk_cal_txd_req(skb);
if (unlikely(atomic_read(>free_count) <= tx_num)) {
-   netif_stop_queue(dev);
+   mtk_stop_queue(eth);
netif_err(eth, tx_queued, dev,
  "Tx Ring full when queue awake!\n");
return NETDEV_TX_BUSY;
@@ -720,10 +742,10 @@ static int mtk_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
 
if (unlikely(atomic_read(>free_count) <= ring->thresh)) {
-   netif_stop_queue(dev);
+   mtk_stop_queue(eth);
if (unlikely(atomic_read(>free_count) >
 ring->thresh))
-   netif_wake_queue(dev);
+   mtk_wake_queue(eth);
}
 
return NETDEV_TX_OK;
@@ -897,13 +919,8 @@ static int mtk_poll_tx(struct mtk_eth *eth, int budget, 
bool *tx_again)
if (!total)
return 0;
 
-   for (i = 0; i < MTK_MAC_COUNT; i++) {
-   if (!eth->netdev[i] ||
-   unlikely(!netif_queue_stopped(eth->netdev[i])))
-   continue;
-   if (atomic_read(>free_count) > ring->thresh)
-   netif_wake_queue(eth->netdev[i]);
-   }
+   if (atomic_read(>free_count) > ring->thresh)
+   mtk_wake_queue(eth);
 
return total;
 }
-- 
1.7.10.4

[PATCH 9/9] net: mediatek: do not set the QID field in the TX DMA descriptors

2016-04-07 Thread John Crispin

The QID field gets set to the mac id. This made the DMA linked list queue
the traffic of each MAC on a different internal queue. However during long
term testing we found that this will cause traffic stalls as the multi
queue setup requires a more complete initialisation which is not part of
the upstream driver yet.

This patch removes the code setting the QID field, resulting in all
traffic ending up in queue 0 which works without any special setup.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index f9f8851..8163047 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -603,8 +603,7 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
WRITE_ONCE(txd->txd1, mapped_addr);
WRITE_ONCE(txd->txd3, (TX_DMA_SWC |
   TX_DMA_PLEN0(frag_map_size) |
-  last_frag * TX_DMA_LS0) |
-  mac->id);
+  last_frag * TX_DMA_LS0));
WRITE_ONCE(txd->txd4, 0);
 
tx_buf->skb = (struct sk_buff *)MTK_DMA_DUMMY_DESC;
-- 
1.7.10.4

[PATCH] net-next: mediatek: add support for IRQ grouping

2016-04-07 Thread John Crispin

The ethernet core has 3 IRQs. Using the IRQ grouping registers we are able
to separate TX and RX IRQs, which allows us to service them on separate
cores. This patch splits the IRQ handler into 2 separate functions, one
for TX and another for RX. The TX housekeeping is split out of the NAPI
handler. Instead we use a tasklet to handle housekeeping.

Signed-off-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |  115 +--
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |   12 ++-
 2 files changed, 86 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8163047..6387516 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -756,7 +756,7 @@ drop:
 }
 
 static int mtk_poll_rx(struct napi_struct *napi, int budget,
-  struct mtk_eth *eth, u32 rx_intr)
+  struct mtk_eth *eth)
 {
struct mtk_rx_ring *ring = >rx_ring;
int idx = ring->calc_idx;
@@ -842,12 +842,12 @@ release_desc:
}
 
if (done < budget)
-   mtk_w32(eth, rx_intr, MTK_QMTK_INT_STATUS);
+   mtk_w32(eth, MTK_RX_DONE_INT, MTK_QMTK_INT_STATUS);
 
return done;
 }
 
-static int mtk_poll_tx(struct mtk_eth *eth, int budget, bool *tx_again)
+static int mtk_poll_tx(struct mtk_eth *eth, int budget)
 {
struct mtk_tx_ring *ring = >tx_ring;
struct mtk_tx_dma *desc;
@@ -910,9 +910,7 @@ static int mtk_poll_tx(struct mtk_eth *eth, int budget, 
bool *tx_again)
}
 
/* read hw index again make sure no new tx packet */
-   if (cpu != dma || cpu != mtk_r32(eth, MTK_QTX_DRX_PTR))
-   *tx_again = true;
-   else
+   if (cpu == dma && cpu == mtk_r32(eth, MTK_QTX_DRX_PTR))
mtk_w32(eth, MTK_TX_DONE_INT, MTK_QMTK_INT_STATUS);
 
if (!total)
@@ -924,27 +922,27 @@ static int mtk_poll_tx(struct mtk_eth *eth, int budget, 
bool *tx_again)
return total;
 }
 
+static void mtk_clean_tx_tasklet(unsigned long arg)
+{
+   struct mtk_eth *eth = (struct mtk_eth *)arg;
+
+   if (mtk_poll_tx(eth, MTK_NAPI_WEIGHT) > 0)
+   tasklet_schedule(>tx_clean_tasklet);
+   else
+   mtk_irq_enable(eth, MTK_TX_DONE_INT);
+}
+
 static int mtk_poll(struct napi_struct *napi, int budget)
 {
struct mtk_eth *eth = container_of(napi, struct mtk_eth, rx_napi);
-   u32 status, status2, mask, tx_intr, rx_intr, status_intr;
-   int tx_done, rx_done;
-   bool tx_again = false;
+   u32 status, status2, mask, status_intr;
+   int rx_done = 0;
 
status = mtk_r32(eth, MTK_QMTK_INT_STATUS);
status2 = mtk_r32(eth, MTK_INT_STATUS2);
-   tx_intr = MTK_TX_DONE_INT;
-   rx_intr = MTK_RX_DONE_INT;
status_intr = (MTK_GDM1_AF | MTK_GDM2_AF);
-   tx_done = 0;
-   rx_done = 0;
-   tx_again = 0;
 
-   if (status & tx_intr)
-   tx_done = mtk_poll_tx(eth, budget, _again);
-
-   if (status & rx_intr)
-   rx_done = mtk_poll_rx(napi, budget, eth, rx_intr);
+   rx_done = mtk_poll_rx(napi, budget, eth);
 
if (unlikely(status2 & status_intr)) {
mtk_stats_update(eth);
@@ -953,20 +951,20 @@ static int mtk_poll(struct napi_struct *napi, int budget)
 
if (unlikely(netif_msg_intr(eth))) {
mask = mtk_r32(eth, MTK_QDMA_INT_MASK);
-   netdev_info(eth->netdev[0],
-   "done tx %d, rx %d, intr 0x%08x/0x%x\n",
-   tx_done, rx_done, status, mask);
+   dev_info(eth->dev,
+"done rx %d, intr 0x%08x/0x%x\n",
+rx_done, status, mask);
}
 
-   if (tx_again || rx_done == budget)
+   if (rx_done == budget)
return budget;
 
status = mtk_r32(eth, MTK_QMTK_INT_STATUS);
-   if (status & (tx_intr | rx_intr))
+   if (status & MTK_RX_DONE_INT)
return budget;
 
napi_complete(napi);
-   mtk_irq_enable(eth, tx_intr | rx_intr);
+   mtk_irq_enable(eth, MTK_RX_DONE_INT);
 
return rx_done;
 }
@@ -1195,22 +1193,43 @@ static void mtk_tx_timeout(struct net_device *dev)
schedule_work(>pending_work);
 }
 
-static irqreturn_t mtk_handle_irq(int irq, void *_eth)
+static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth)
 {
struct mtk_eth *eth = _eth;
u32 status;
 
status = mtk_r32(eth, MTK_QMTK_INT_STATUS);
+   status &= ~MTK_TX_DONE_INT;
+
if (unlikely(!status))
return IRQ_NONE;
 
-   if (likely(status & (MTK_RX_DONE_INT | MTK_TX_DONE_INT))) {
+   if (status & MTK_RX_DONE_INT) {
if (likely(napi_schedule_prep(>rx_napi)))
__napi_schedule(>rx_napi);
-   } else {
-   mtk_w32(eth, status,

Re: How to get creatior PID information for the local tcp connection

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 23:01 +0530, Vishnu Pratap Singh wrote:
> Hi,
> 
> 
> Issue -  How to get PID information for the local tcp connection
> 
> 
> 
> i want to get the creator PID for each socket in user space for local
> tcp connection, i see in kernel there is support for returing PID with
> "SO_PEERCRED" ioctl to work across namespaces. it uses struct pid and
> struct cred to store the peer credentials on struct sock.
> cred_to_ucred(sk->sk_peer_pid, sk->sk_peer_cred, ); Above
> function stores the PID information in ucred->pid = pid_vnr(pid); and
> same is returned via "SO_PEERCRED" ioctl .
> 
> But for local tcp connection i get pid as 0, is there any way i can
> get the PID information. Any help or suggestion will be highly
> helpful.
> 
> 

man 7 socket

   SO_PEERCRED
  Return the credentials of the foreign  process  connected  to  
this  socket.
  This  is  possible  only  for  connected  AF_UNIX stream sockets 
and AF_UNIX
  stream and datagram socket pairs created using socketpair(2);  
see  unix(7).
  The  returned  credentials  are those that were in effect at the 
time of the
  call to connect(2) or socketpair(2).  The argument  is  a  ucred  
structure;
  define  the  GNU_SOURCE  feature test macro to obtain the 
definition of that
  structure from .  This socket option is read-only.

Re: How to get creatior PID information for the local tcp connection

2016-04-07 Thread Eric Dumazet

On Thu, 2016-04-07 at 11:26 -0700, Eric Dumazet wrote:
> On Thu, 2016-04-07 at 23:01 +0530, Vishnu Pratap Singh wrote:
> > Hi,
> > 
> > 
> > Issue -  How to get PID information for the local tcp connection
> > 
> > 
> > 
> > i want to get the creator PID for each socket in user space for local
> > tcp connection, i see in kernel there is support for returing PID with
> > "SO_PEERCRED" ioctl to work across namespaces. it uses struct pid and
> > struct cred to store the peer credentials on struct sock.
> > cred_to_ucred(sk->sk_peer_pid, sk->sk_peer_cred, ); Above
> > function stores the PID information in ucred->pid = pid_vnr(pid); and
> > same is returned via "SO_PEERCRED" ioctl .
> > 
> > But for local tcp connection i get pid as 0, is there any way i can
> > get the PID information. Any help or suggestion will be highly
> > helpful.
> > 
> > 
> 
> man 7 socket
> 
>SO_PEERCRED
>   Return the credentials of the foreign  process  connected  to  
> this  socket.
>   This  is  possible  only  for  connected  AF_UNIX stream 
> sockets and AF_UNIX
>   stream and datagram socket pairs created using socketpair(2);  
> see  unix(7).
>   The  returned  credentials  are those that were in effect at 
> the time of the
>   call to connect(2) or socketpair(2).  The argument  is  a  
> ucred  structure;
>   define  the  GNU_SOURCE  feature test macro to obtain the 
> definition of that
>   structure from .  This socket option is read-only.
> 

Sorry, I hit "Send" too fast.

This is not implemented for TCP yet.

You'll have to take a look at iproute2 package, since "ss -tp" is able
to find this information, by looking at all /proc/{pid}/fd/*  files and
the socket inode number the kernel gives through inet_diag

Not scalable if you have millions of sockets...

Re: [PATCH net-next 0/7] sctp: support sctp_diag in kernel

2016-04-07 Thread Marcelo Ricardo Leitner

On Tue, Apr 05, 2016 at 12:06:25PM +0800, Xin Long wrote:
> This patchset will add sctp_diag module to implement diag interface on
> sctp in kernel.
...

Other than the const thing and the point noticed by Neil which need
to be fixed, patchset looks good to me.

  Marcelo

[PATCH v5 net-next 12/15] nfp: propagate list buffer size in struct rx_ring

2016-04-07 Thread Jakub Kicinski

Free list buffer size needs to be propagated to few functions
as a parameter and added to struct nfp_net_rx_ring since soon
some of the functions will be reused to manage rings with
buffers of size different than nn->fl_bufsz.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  3 +++
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 24 ++
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index fc005c982b7d..9ab8e3967dc9 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -298,6 +298,8 @@ struct nfp_net_rx_buf {
  * @rxds:   Virtual address of FL/RX ring in host memory
  * @dma:DMA address of the FL/RX ring
  * @size:   Size, in bytes, of the FL/RX ring (needed to free)
+ * @bufsz: Buffer allocation size for convenience of management routines
+ * (NOTE: this is in second cache line, do not use on fast path!)
  */
 struct nfp_net_rx_ring {
struct nfp_net_r_vector *r_vec;
@@ -319,6 +321,7 @@ struct nfp_net_rx_ring {
 
dma_addr_t dma;
unsigned int size;
+   unsigned int bufsz;
 } cacheline_aligned;
 
 /**
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index ed23b9d348c3..03c60f755de0 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -957,25 +957,27 @@ static inline int nfp_net_rx_space(struct nfp_net_rx_ring 
*rx_ring)
  * nfp_net_rx_alloc_one() - Allocate and map skb for RX
  * @rx_ring:   RX ring structure of the skb
  * @dma_addr:  Pointer to storage for DMA address (output param)
+ * @fl_bufsz:  size of freelist buffers
  *
  * This function will allcate a new skb, map it for DMA.
  *
  * Return: allocated skb or NULL on failure.
  */
 static struct sk_buff *
-nfp_net_rx_alloc_one(struct nfp_net_rx_ring *rx_ring, dma_addr_t *dma_addr)
+nfp_net_rx_alloc_one(struct nfp_net_rx_ring *rx_ring, dma_addr_t *dma_addr,
+unsigned int fl_bufsz)
 {
struct nfp_net *nn = rx_ring->r_vec->nfp_net;
struct sk_buff *skb;
 
-   skb = netdev_alloc_skb(nn->netdev, nn->fl_bufsz);
+   skb = netdev_alloc_skb(nn->netdev, fl_bufsz);
if (!skb) {
nn_warn_ratelimit(nn, "Failed to alloc receive SKB\n");
return NULL;
}
 
*dma_addr = dma_map_single(>pdev->dev, skb->data,
- nn->fl_bufsz, DMA_FROM_DEVICE);
+  fl_bufsz, DMA_FROM_DEVICE);
if (dma_mapping_error(>pdev->dev, *dma_addr)) {
dev_kfree_skb_any(skb);
nn_warn_ratelimit(nn, "Failed to map DMA RX buffer\n");
@@ -1068,7 +1070,7 @@ nfp_net_rx_ring_bufs_free(struct nfp_net *nn, struct 
nfp_net_rx_ring *rx_ring)
continue;
 
dma_unmap_single(>dev, rx_ring->rxbufs[i].dma_addr,
-nn->fl_bufsz, DMA_FROM_DEVICE);
+rx_ring->bufsz, DMA_FROM_DEVICE);
dev_kfree_skb_any(rx_ring->rxbufs[i].skb);
rx_ring->rxbufs[i].dma_addr = 0;
rx_ring->rxbufs[i].skb = NULL;
@@ -1090,7 +1092,8 @@ nfp_net_rx_ring_bufs_alloc(struct nfp_net *nn, struct 
nfp_net_rx_ring *rx_ring)
 
for (i = 0; i < rx_ring->cnt - 1; i++) {
rxbufs[i].skb =
-   nfp_net_rx_alloc_one(rx_ring, [i].dma_addr);
+   nfp_net_rx_alloc_one(rx_ring, [i].dma_addr,
+rx_ring->bufsz);
if (!rxbufs[i].skb) {
nfp_net_rx_ring_bufs_free(nn, rx_ring);
return -ENOMEM;
@@ -1278,7 +1281,8 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
 
skb = rx_ring->rxbufs[idx].skb;
 
-   new_skb = nfp_net_rx_alloc_one(rx_ring, _dma_addr);
+   new_skb = nfp_net_rx_alloc_one(rx_ring, _dma_addr,
+  nn->fl_bufsz);
if (!new_skb) {
nfp_net_rx_give_one(rx_ring, rx_ring->rxbufs[idx].skb,
rx_ring->rxbufs[idx].dma_addr);
@@ -1465,10 +1469,12 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring 
*rx_ring)
 /**
  * nfp_net_rx_ring_alloc() - Allocate resource for a RX ring
  * @rx_ring:  RX ring to allocate
+ * @fl_bufsz: Size of buffers to allocate
  *
  * Return: 0 on success, negative errno otherwise.
  */
-static int nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring)
+static int
+nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz)
 {
struct nfp_net_r_vector *r_vec =

[PATCH v5 net-next 01/15] nfp: correct RX buffer length calculation

2016-04-07 Thread Jakub Kicinski

When calculating the RX buffer length we need to account
for up to 2 VLAN tags.  Rounding up to 1k is an relic of
a distant past and can be removed.  While at it also remove
trivial print statement.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 43c618bafdb6..0dae81454e77 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1911,9 +1911,6 @@ static void nfp_net_set_rx_mode(struct net_device *netdev)
 static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu)
 {
struct nfp_net *nn = netdev_priv(netdev);
-   u32 tmp;
-
-   nn_dbg(nn, "New MTU = %d\n", new_mtu);
 
if (new_mtu < 68 || new_mtu > nn->max_mtu) {
nn_err(nn, "New MTU (%d) is not valid\n", new_mtu);
@@ -1921,10 +1918,7 @@ static int nfp_net_change_mtu(struct net_device *netdev, 
int new_mtu)
}
 
netdev->mtu = new_mtu;
-
-   /* Freelist buffer size rounded up to the nearest 1K */
-   tmp = new_mtu + ETH_HLEN + VLAN_HLEN + NFP_NET_MAX_PREPEND;
-   nn->fl_bufsz = roundup(tmp, 1024);
+   nn->fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 + new_mtu;
 
/* restart if running */
if (netif_running(netdev)) {
-- 
1.9.1

1 2 3 >

1 - 100 of 245 matches

Mail list logo