date:20160420

Re: qdisc spin lock

2016-04-20 Thread Michael Ma

2016-04-20 15:34 GMT-07:00 Eric Dumazet :
> On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>> 2016-04-08 7:19 GMT-07:00 Eric Dumazet :
>> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> really have multiple interfaces...
>> >>
>> >> With this MQ solution we'll still need to assign transmit queues to
>> >> different classes by doing some math on the bandwidth limit if I
>> >> understand correctly, which seems to be less convenient compared with
>> >> a solution purely within HTB.
>> >>
>> >> I assume that with this solution I can still share qdisc among
>> >> multiple transmit queues - please let me know if this is not the case.
>> >
>> > Note that this MQ + HTB thing works well, unless you use a bonding
>> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> > tokens between the slaves)
>>
>> Actually MQ+HTB works well for small packets - like flow of 512 byte
>> packets can be throttled by HTB using one txq without being affected
>> by other flows with small packets. However I found using this solution
>> large packets (10k for example) will only achieve very limited
>> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>> using 30 threads. But sending 10k packets using 10 threads has only 10
>> Mbit/s with the same TC configuration. If I increase burst and cburst
>> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>> hit.
>>
>> The strange thing is that I don't see this problem when using HTB as
>> the root. So txq number seems to be a factor here - however it's
>> really hard to understand why would it only affect larger packets. Is
>> this a known issue? Any suggestion on how to investigate the issue
>> further? Profiling shows that the cpu utilization is pretty low.
>
> You could try
>
> perf record -a -g -e skb:kfree_skb sleep 5
> perf report
>
> So that you see where the packets are dropped.
>
> Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
> dropped at qdisc enqueue time, instead of having backpressure.
>

Thanks for the hint - how should I read the perf report? Also we're
using TCP socket in this testing - TCP window size is set to 70kB.

-  35.88% init  [kernel.kallsyms]  [k] intel_idle
   ◆
 intel_idle
   ▒
-  15.83%  strings  libc-2.5.so[.]
__GI___connect_internal
▒
   - __GI___connect_internal
   ▒
  - 50.00% get_mapping
   ▒
   __nscd_get_map_ref
   ▒
50.00% __nscd_open_socket
   ▒
-  13.19%  strings  libc-2.5.so[.] __GI___libc_recvmsg
   ▒
   - __GI___libc_recvmsg
   ▒
  + 64.52% getifaddrs
   ▒
  + 35.48% __check_pf
   ▒
-  10.55%  strings  libc-2.5.so[.] __sendto_nocancel
   ▒
   - __sendto_nocancel
   ▒
100.00% 0
>
>

Re: [PATCH 2/4] net: thunderx: Add multiqset support for dataplane apps

2016-04-20 Thread kbuild test robot

Hi,

[auto build test ERROR on net/master]
[also build test ERROR on v4.6-rc4 next-20160420]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/sunil-kovvuri-gmail-com/net-thunderx-Add-multiqset-support-for-DPDK/20160419-213640
config: x86_64-randconfig-s4-04211222 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/cavium/thunder/nic_main.c: In function 
'nic_get_vf_pdev':
>> drivers/net/ethernet/cavium/thunder/nic_main.c:600:12: error: 'struct 
>> pci_dev' has no member named 'physfn'
  if (vfdev->physfn != pdev)
   ^

vim +600 drivers/net/ethernet/cavium/thunder/nic_main.c

   594  pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, );
   595  
   596  vfdev = pci_get_device(vid, devid, NULL);
   597  for (; vfdev; vfdev = pci_get_device(vid, devid, vfdev)) {
   598  if (!vfdev->is_virtfn)
   599  continue;
 > 600  if (vfdev->physfn != pdev)
   601  continue;
   602  if (vf >= vf_en)
   603  continue;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH net 4/4] net/mlx4_en: Split SW RX dropped counter per RX ring

2016-04-20 Thread Or Gerlitz

On Thu, Apr 21, 2016 at 4:02 AM, Eric Dumazet  wrote:
> On Wed, 2016-04-20 at 18:00 +0300, Or Gerlitz wrote:

>> Just to be sure, you'd like me to re-spin this and fix the reporter name?

> Absolutely not, I believe patchwork should handle this just fine.
> Patchwork does not understand the "Fixes:" tag yet, but Reported-by: is
> fine.

OK, Eric and Florian, thanks for clarifying this out.

Or.

Re: [PATCH net v2 0/3] drivers: net: cpsw: phy-handle fixes

2016-04-20 Thread David Rivshin (Allworx)

Sorry all for the noise. Gmail seems to be deciding that this outgoing 
mail is spammy, and starts blocking it part-way through. I've tried 
cutting down the CC list, but still no luck. If anyone knows how to get 
around this (while still having a reasonable patch submission), please 
let me know. 

For tonight, I guess I have no choice but to give up. I'll try again
tomorrow in hopes gmail becomes sane again.


On Wed, 20 Apr 2016 23:24:39 -0400
"David Rivshin (Allworx)"  wrote:

> From: David Rivshin 
> 
> The first patch fixes a bug that makes dual_emac mode break if
> either slave uses the phy-handle property in the devicetree.
> 
> The second patch fixes some cosmetic problems with error messages,
> and also makes the binding documentation more explicit.
> 
> The third patch cleans up the fixed-link case to work like
> the now-fixed phy-handle case.
> 
> I have tested on the following hardware configurations:
>  - (EVMSK) dual emac, phy_id property in both slaves
>  - (EVMSK) dual emac, phy-handle property in both slaves
>  - (BeagleBoneBlack) single emac, phy_id property
>  - (custom) single emac, fixed-link subnode
> 
> Nicolas Chauvet reported testing on an HP t410 (dm8148).
> 
> Markus Brunner reported testing v1 on the following [1]:
>  - emac0 with phy_id and emac1 with fixed phy
>  - emac0 with phy-handle and emac1 with fixed phy
>  - emac0 with fixed phy and emac1 with fixed phy
> 
> 
> Changes since v1 [2]:
> - Rebased
> - Added Tested-by from Nicolas Chauvet on all patches
> - Added Acked-by from Rob Herring for the binding change in patch 2 [3]
> 
> [1] http://www.spinics.net/lists/netdev/msg357890.html
> [2] http://www.spinics.net/lists/netdev/msg357772.html
> [3] http://www.spinics.net/lists/netdev/msg358254.html
> 
> David Rivshin (3):
>   drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac
> config
>   drivers: net: cpsw: fix error messages when using phy-handle DT
> property
>   drivers: net: cpsw: use of_phy_connect() in fixed-link case
> 
>  Documentation/devicetree/bindings/net/cpsw.txt |  4 +--
>  drivers/net/ethernet/ti/cpsw.c | 41 
> +-
>  drivers/net/ethernet/ti/cpsw.h |  1 +
>  3 files changed, 23 insertions(+), 23 deletions(-)
>

[PATCH net v2 0/3] drivers: net: cpsw: phy-handle fixes

2016-04-20 Thread David Rivshin (Allworx)

From: David Rivshin 

The first patch fixes a bug that makes dual_emac mode break if
either slave uses the phy-handle property in the devicetree.

The second patch fixes some cosmetic problems with error messages,
and also makes the binding documentation more explicit.

The third patch cleans up the fixed-link case to work like
the now-fixed phy-handle case.

I have tested on the following hardware configurations:
 - (EVMSK) dual emac, phy_id property in both slaves
 - (EVMSK) dual emac, phy-handle property in both slaves
 - (BeagleBoneBlack) single emac, phy_id property
 - (custom) single emac, fixed-link subnode

Nicolas Chauvet reported testing on an HP t410 (dm8148).

Markus Brunner reported testing v1 on the following [1]:
 - emac0 with phy_id and emac1 with fixed phy
 - emac0 with phy-handle and emac1 with fixed phy
 - emac0 with fixed phy and emac1 with fixed phy


Changes since v1 [2]:
- Rebased
- Added Tested-by from Nicolas Chauvet on all patches
- Added Acked-by from Rob Herring for the binding change in patch 2 [3]

[1] http://www.spinics.net/lists/netdev/msg357890.html
[2] http://www.spinics.net/lists/netdev/msg357772.html
[3] http://www.spinics.net/lists/netdev/msg358254.html

David Rivshin (3):
  drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac
config
  drivers: net: cpsw: fix error messages when using phy-handle DT
property
  drivers: net: cpsw: use of_phy_connect() in fixed-link case

 Documentation/devicetree/bindings/net/cpsw.txt |  4 +--
 drivers/net/ethernet/ti/cpsw.c | 41 +-
 drivers/net/ethernet/ti/cpsw.h |  1 +
 3 files changed, 23 insertions(+), 23 deletions(-)

-- 
2.5.5

[PATCH net v2 0/3] drivers: net: cpsw: phy-handle fixes

2016-04-20 Thread David Rivshin (Allworx)

From: David Rivshin 

The first patch fixes a bug that makes dual_emac mode break if
either slave uses the phy-handle property in the devicetree.

The second patch fixes some cosmetic problems with error messages,
and also makes the binding documentation more explicit.

The third patch cleans up the fixed-link case to work like
the now-fixed phy-handle case.

I have tested on the following hardware configurations:
 - (EVMSK) dual emac, phy_id property in both slaves
 - (EVMSK) dual emac, phy-handle property in both slaves
 - (BeagleBoneBlack) single emac, phy_id property
 - (custom) single emac, fixed-link subnode

Nicolas Chauvet reported testing on an HP t410 (dm8148).

Markus Brunner reported testing v1 on the following [1]:
 - emac0 with phy_id and emac1 with fixed phy
 - emac0 with phy-handle and emac1 with fixed phy
 - emac0 with fixed phy and emac1 with fixed phy


Changes since v1 [2]:
- Rebased
- Added Tested-by from Nicolas Chauvet on all patches
- Added Acked-by from Rob Herring for the binding change in patch 2 [3]

[1] http://www.spinics.net/lists/netdev/msg357890.html
[2] http://www.spinics.net/lists/netdev/msg357772.html
[3] http://www.spinics.net/lists/netdev/msg358254.html

David Rivshin (3):
  drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac
config
  drivers: net: cpsw: fix error messages when using phy-handle DT
property
  drivers: net: cpsw: use of_phy_connect() in fixed-link case

 Documentation/devicetree/bindings/net/cpsw.txt |  4 +--
 drivers/net/ethernet/ti/cpsw.c | 41 +-
 drivers/net/ethernet/ti/cpsw.h |  1 +
 3 files changed, 23 insertions(+), 23 deletions(-)

-- 
2.5.5

Re: [PATCH V2] net: stmmac: socfpga: Remove re-registration of reset controller

2016-04-20 Thread Dinh Nguyen



On 04/20/2016 05:27 PM, Marek Vasut wrote:
> On 04/20/2016 11:17 PM, Dinh Nguyen wrote:
>> On 04/19/2016 07:05 PM, Marek Vasut wrote:
>>> Both socfpga_dwmac_parse_data() in dwmac-socfpga.c and stmmac_dvr_probe()
>>> in stmmac_main.c functions call devm_reset_control_get() to register an
>>> reset controller for the stmmac. This results in an attempt to register
>>> two reset controllers for the same non-shared reset line.
>>>
>>> The first attempt to register the reset controller works fine. The second
>>> attempt fails with warning from the reset controller core, see below.
>>> The warning is produced because the reset line is non-shared and thus
>>> it is allowed to have only up-to one reset controller associated with
>>> that reset line, not two or more.
>>>
>>> The solution is not great. Since the hardware needs to toggle the reset
>>> before calling stmmac_dvr_probe() to perform mandatory preconfiguration,
>>> this patch splits socfpga_dwmac_init_probe() from socfpga_dwmac_init().
>>>
>>> The socfpga_dwmac_init_probe() temporarily registers the reset controller,
>>> performs the pre-configuration and unregisters the reset controller again.
>>> This function is only called from the socfpga_dwmac_probe().
>>>
>>> The original socfpga_dwmac_init() is tweaked to use reset controller
>>> pointer from the stmmac_priv (private data of the stmmac core) instead
>>> of the local instance, which was used before.
>>>
>>> Finally, plat_dat->exit and socfpga_dwmac_exit() is no longer necessary,
>>> since the functionality is already performed by the stmmac core.
>>>
>>> [ cut here ]
>>> WARNING: CPU: 0 PID: 1 at drivers/reset/core.c:187 
>>> __of_reset_control_get+0x218/0x270
>>> Modules linked in:
>>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
>>> 4.6.0-rc4-next-20160419-00015-gabb2477-dirty #4
>>> Hardware name: Altera SOCFPGA
>>> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
>>> [] (show_stack) from [] (dump_stack+0x94/0xa8)
>>> [] (dump_stack) from [] (__warn+0xec/0x104)
>>> [] (__warn) from [] (warn_slowpath_null+0x20/0x28)
>>> [] (warn_slowpath_null) from [] 
>>> (__of_reset_control_get+0x218/0x270)
>>> [] (__of_reset_control_get) from [] 
>>> (__devm_reset_control_get+0x54/0x90)
>>> [] (__devm_reset_control_get) from [] 
>>> (stmmac_dvr_probe+0x1b4/0x8e8)
>>> [] (stmmac_dvr_probe) from [] 
>>> (socfpga_dwmac_probe+0x1b8/0x28c)
>>> [] (socfpga_dwmac_probe) from [] 
>>> (platform_drv_probe+0x4c/0xb0)
>>> [] (platform_drv_probe) from [] 
>>> (driver_probe_device+0x224/0x2bc)
>>> [] (driver_probe_device) from [] 
>>> (__driver_attach+0xac/0xb0)
>>> [] (__driver_attach) from [] 
>>> (bus_for_each_dev+0x6c/0xa0)
>>> [] (bus_for_each_dev) from [] 
>>> (bus_add_driver+0x1a4/0x21c)
>>> [] (bus_add_driver) from [] (driver_register+0x78/0xf8)
>>> [] (driver_register) from [] 
>>> (do_one_initcall+0x40/0x170)
>>> [] (do_one_initcall) from [] 
>>> (kernel_init_freeable+0x1dc/0x27c)
>>> [] (kernel_init_freeable) from [] 
>>> (kernel_init+0x8/0x114)
>>> [] (kernel_init) from [] (ret_from_fork+0x14/0x3c)
>>> ---[ end trace 059d2fbe87608fa9 ]---
>>>
>>> Signed-off-by: Marek Vasut 
>>> Cc: Matthew Gerlach 
>>> Cc: Dinh Nguyen 
>>> Cc: David S. Miller 
>>> ---
>>> V2: Add missing stmmac_rst = NULL; into socfpga_dwmac_init_probe()
>>> ---
>>>  .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c| 70 
>>> --
>>>  1 file changed, 39 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c 
>>> b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
>>> index 76d671e..5885a2e 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
>>> @@ -49,7 +49,6 @@ struct socfpga_dwmac {
>>> u32 reg_shift;
>>> struct  device *dev;
>>> struct regmap *sys_mgr_base_addr;
>>> -   struct reset_control *stmmac_rst;
>>> void __iomem *splitter_base;
>>> bool f2h_ptp_ref_clk;
>>>  };
>>> @@ -92,15 +91,6 @@ static int socfpga_dwmac_parse_data(struct socfpga_dwmac 
>>> *dwmac, struct device *
>>> struct device_node *np_splitter;
>>> struct resource res_splitter;
>>>  
>>> -   dwmac->stmmac_rst = devm_reset_control_get(dev,
>>> - STMMAC_RESOURCE_NAME);
>>> -   if (IS_ERR(dwmac->stmmac_rst)) {
>>> -   dev_info(dev, "Could not get reset control!\n");
>>> -   if (PTR_ERR(dwmac->stmmac_rst) == -EPROBE_DEFER)
>>> -   return -EPROBE_DEFER;
>>> -   dwmac->stmmac_rst = NULL;
>>> -   }
>>> -
>>> dwmac->interface = of_get_phy_mode(np);
>>>  
>>> sys_mgr_base_addr = syscon_regmap_lookup_by_phandle(np, 
>>> "altr,sysmgr-syscon");
>>> @@ -194,30 +184,23 @@ static int socfpga_dwmac_setup(struct socfpga_dwmac 
>>> *dwmac)
>>> return 0;
>>>  }
>>>  
>>> -static void

IPv6 patch mysteriously breaks IPv4 VPN

2016-04-20 Thread Valdis Kletnieks

I'll say up front - no, I do *not* have a clue why this commit causes this
problem - it makes exactly zero fsking sense.

Scenario:  $WORK is blessed with a Juniper VPN system.  I've been
seeing for a while now (since Dec-ish) an issue where at startup,
the tun0 device will get wedged.  ifconfig reports this:

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1400
inet 172.27.1.165  netmask 255.255.255.255  destination 172.27.1.165
inet6 fe80::6802:d95c:f3f4:2a6f  prefixlen 64  scopeid 0x20
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  
(UNSPEC)
RX packets 0  bytes 0 (0.0 B)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 1  bytes 48 (48.0 B)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

and no more packets cross - not even a ping.

Yes, the tunnel is ipv4 only, and only ipv4 routes get set by the VPN software.

bisect results confirmed - linux-next 20160327 is bad, but 20160420 with this
one conmmit reverted works.

% git bisect bad  
cc9da6cc4f56e05cc9e591459fe0192727ff58b3 is the first bad commit
commit cc9da6cc4f56e05cc9e591459fe0192727ff58b3
Author: BjÃ¸rn Mork <bj...@mork.no>
Date:   Wed Dec 16 16:44:38 2015 +0100

ipv6: addrconf: use stable address generator for ARPHRD_NONE

Add a new address generator mode, using the stable address generator
with an automatically generated secret. This is intended as a default
address generator mode for device types with no EUI64 implementation.
The new generator is used for ARPHRD_NONE interfaces initially, adding
default IPv6 autoconf support to e.g. tun interfaces.

If the addrgenmode is set to 'random', either by default or manually,
and no stable secret is available, then a random secret is used as
input for the stable-privacy address generator.  The secret can be
read and modified like manually configured secrets, using the proc
interface.  Modifying the secret will change the addrgen mode to
'stable-privacy' to indicate that it operates on a known secret.

Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
a known secret is available when the device is created, then the mode
will default to 'stable-privacy' as before.  The mode can be manually
set to 'random' but it will behave exactly like 'stable-privacy' in
this case. The secret will not change.

Cc: Hannes Frederic Sowa <han...@stressinduktion.org>
Cc: åè¤è±æ <hideaki.yoshif...@miraclelinux.com>
Signed-off-by: BjÃ¸rn Mork <bj...@mork.no>
Acked-by: Hannes Frederic Sowa <han...@stressinduktion.org>
Signed-off-by: David S. Miller <da...@davemloft.net>

(Sorry for the delay in reporting this - bisecting this proved to be
a bear and a half, because this problematic commit landed only about 10 commits 
after
this one: 

git bisect start
# good: [1bd4978a88ac2589f3105f599b1d404a312fb7f6] tun: honor IFF_UP in 
tun_get_user()

which fixed a *different* issue that prevented the tun device from getting
created at all (or it was immediately taken back down by the VPN software).
End result was that unless I gave a "known good" start point in that dozen
commit range, there's be a month's worth of 'git commit skip' to wade through.
I got damned lucky and found a record on one of my servers of an ssh over VPN,
and correlated it to the one day that linux-next had the above fix for the
previous issue, and wasn't broken by this current issue)


pgp7t7dQRSLiQ.pgp
Description: PGP signature

[PATCH net] openvswitch: use flow protocol when recalculating ipv6 checksums

2016-04-20 Thread Simon Horman

When using masked actions the ipv6_proto field of an action
to set IPv6 fields may be zero rather than the prevailing protocol
which will result in skipping checksum recalculation.

This patch resolves the problem by relying on the protocol
in the flow key rather than that in the set field action.

Fixes: 83d2b9ba1abc ("net: openvswitch: Support masked set actions.")
Cc: Jarno Rajahalme 
Signed-off-by: Simon Horman 
---
* Found using tcpdump to examine the checksums of packets.
* I believe a similar fix is required for the user-space implementation
  of the datapath. I plan to look into that unless someone else wishes to.
---
 net/openvswitch/actions.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index e9dd47b2a85b..879185fe183f 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -461,7 +461,7 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key 
*flow_key,
mask_ipv6_addr(saddr, key->ipv6_src, mask->ipv6_src, masked);
 
if (unlikely(memcmp(saddr, masked, sizeof(masked {
-   set_ipv6_addr(skb, key->ipv6_proto, saddr, masked,
+   set_ipv6_addr(skb, flow_key->ip.proto, saddr, masked,
  true);
memcpy(_key->ipv6.addr.src, masked,
   sizeof(flow_key->ipv6.addr.src));
@@ -483,7 +483,7 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key 
*flow_key,
 NULL, )
   != NEXTHDR_ROUTING);
 
-   set_ipv6_addr(skb, key->ipv6_proto, daddr, masked,
+   set_ipv6_addr(skb, flow_key->ip.proto, daddr, masked,
  recalc_csum);
memcpy(_key->ipv6.addr.dst, masked,
   sizeof(flow_key->ipv6.addr.dst));
-- 
2.7.0.rc3.207.g0ac5344

[PATCH net] Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets

2016-04-20 Thread Shrikrishna Khare

For IPv6, if the device indicates that the checksum is correct, set
CHECKSUM_UNNECESSARY.

Reported-by: Subbarao Narahari 
Signed-off-by: Shrikrishna Khare 
Signed-off-by: Jin Heo 
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 12 
 drivers/net/vmxnet3/vmxnet3_int.h |  4 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c 
b/drivers/net/vmxnet3/vmxnet3_drv.c
index fc895d0..4a67e4f 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1152,12 +1152,16 @@ vmxnet3_rx_csum(struct vmxnet3_adapter *adapter,
union Vmxnet3_GenericDesc *gdesc)
 {
if (!gdesc->rcd.cnc && adapter->netdev->features & NETIF_F_RXCSUM) {
-   /* typical case: TCP/UDP over IP and both csums are correct */
-   if ((le32_to_cpu(gdesc->dword[3]) & VMXNET3_RCD_CSUM_OK) ==
-   VMXNET3_RCD_CSUM_OK) {
+   if (gdesc->rcd.v4 &&
+   (le32_to_cpu(gdesc->dword[3]) &
+VMXNET3_RCD_CSUM_OK) == VMXNET3_RCD_CSUM_OK) {
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   BUG_ON(!(gdesc->rcd.tcp || gdesc->rcd.udp));
+   BUG_ON(gdesc->rcd.frg);
+   } else if (gdesc->rcd.v6 && (le32_to_cpu(gdesc->dword[3]) &
+(1 << VMXNET3_RCD_TUC_SHIFT))) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
BUG_ON(!(gdesc->rcd.tcp || gdesc->rcd.udp));
-   BUG_ON(!(gdesc->rcd.v4  || gdesc->rcd.v6));
BUG_ON(gdesc->rcd.frg);
} else {
if (gdesc->rcd.csum) {
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h 
b/drivers/net/vmxnet3/vmxnet3_int.h
index 729c344..c482539 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.4.6.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.4.7.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM  0x01040600
+#define VMXNET3_DRIVER_VERSION_NUM  0x01040700
 
 #if defined(CONFIG_PCI_MSI)
/* RSS only makes sense if MSI-X is supported. */
-- 
1.9.1

Re: [PATCH net 4/4] net/mlx4_en: Split SW RX dropped counter per RX ring

2016-04-20 Thread Eric Dumazet

On Wed, 2016-04-20 at 18:00 +0300, Or Gerlitz wrote:

> Just to be sure, you'd like me to re-spin this and fix the reporter name?

Absolutely not, I believe patchwork should handle this just fine.

Patchwork does not understand the "Fixes:" tag yet, but Reported-by: is
fine.

linux-next: zillions of lockdep whinges in include/net/sock.h:1408

2016-04-20 Thread Valdis Kletnieks

linux-next 20160420 is whining at an incredible rate - in 20 minutes of
uptime, I piled up some 41,000 hits from all over the place (cleaned up
to skip the CPU and PID so the list isn't quite so long):

% grep include/net/sock.h /var/log/messages | cut -f5- -d: |  sed -e 's/PID: 
[0-9]* /PID: (elided) /' -e 's/CPU: [0-3]/CPU: +/' | sort | uniq -c | sort -nr
  13468  CPU: + PID: (elided) at include/net/sock.h:1408 tcp_v6_rcv+0xc20/0xcb0
   9770  CPU: + PID: (elided) at include/net/sock.h:1408 
udp_queue_rcv_skb+0x3ca/0x6d0
   7706  CPU: + PID: (elided) at include/net/sock.h:1408 
sock_owned_by_user+0x91/0xa0
   2818  CPU: + PID: (elided) at include/net/sock.h:1408 
udpv6_queue_rcv_skb+0x3b6/0x6d0
   1981  CPU: + PID: (elided) at include/net/sock.h:1408 
tcp_write_timer+0xf2/0x110
   1954  CPU: + PID: (elided) at include/net/sock.h:1408 
tcp_delack_timer+0x110/0x130
   1912  CPU: + PID: (elided) at include/net/sock.h:1408 
tcp_keepalive_timer+0x136/0x2c0
882  CPU: + PID: (elided) at include/net/sock.h:1408 tcp_close+0x226/0x4f0
804  CPU: + PID: (elided) at include/net/sock.h:1408 
tcp_tasklet_func+0x192/0x1e0
 28  CPU: + PID: (elided) at include/net/sock.h:1408 
tcp_child_process+0x17a/0x350
  2  CPU: + PID: (elided) at include/net/sock.h:1408 tcp_v6_err+0x401/0x660
  2  CPU: + PID: (elided) at include/net/sock.h:1408 tcp_v6_err+0x1fd/0x660

Seems to be from this commit, which is apparently over-stringent or
isn't handling some case correctly:

commit fafc4e1ea1a4c1eb13a30c9426fb799f5efacbc3
Author: Hannes Frederic Sowa <han...@stressinduktion.org>
Date:   Fri Apr 8 15:11:27 2016 +0200

sock: tigthen lockdep checks for sock_owned_by_user

sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.

Cc: linux-c...@vger.kernel.org
Cc: linux-blueto...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <han...@stressinduktion.org>
Signed-off-by: David S. Miller <da...@davemloft.net>



pgpzfUwGbQWVc.pgp
Description: PGP signature

Re: qdisc spin lock

2016-04-20 Thread Eric Dumazet

On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
> 2016-04-08 7:19 GMT-07:00 Eric Dumazet :
> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
> >> I didn't really know that multiple qdiscs can be isolated using MQ so
> >> that each txq can be associated with a particular qdisc. Also we don't
> >> really have multiple interfaces...
> >>
> >> With this MQ solution we'll still need to assign transmit queues to
> >> different classes by doing some math on the bandwidth limit if I
> >> understand correctly, which seems to be less convenient compared with
> >> a solution purely within HTB.
> >>
> >> I assume that with this solution I can still share qdisc among
> >> multiple transmit queues - please let me know if this is not the case.
> >
> > Note that this MQ + HTB thing works well, unless you use a bonding
> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> > tokens between the slaves)
> 
> Actually MQ+HTB works well for small packets - like flow of 512 byte
> packets can be throttled by HTB using one txq without being affected
> by other flows with small packets. However I found using this solution
> large packets (10k for example) will only achieve very limited
> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
> using 30 threads. But sending 10k packets using 10 threads has only 10
> Mbit/s with the same TC configuration. If I increase burst and cburst
> of HTB to some extreme large value (like 50MB) the ceiling rate can be
> hit.
> 
> The strange thing is that I don't see this problem when using HTB as
> the root. So txq number seems to be a factor here - however it's
> really hard to understand why would it only affect larger packets. Is
> this a known issue? Any suggestion on how to investigate the issue
> further? Profiling shows that the cpu utilization is pretty low.

You could try 

perf record -a -g -e skb:kfree_skb sleep 5
perf report

So that you see where the packets are dropped.

Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
dropped at qdisc enqueue time, instead of having backpressure.

Re: [PATCH V2] net: stmmac: socfpga: Remove re-registration of reset controller

2016-04-20 Thread Marek Vasut

On 04/20/2016 11:17 PM, Dinh Nguyen wrote:
> On 04/19/2016 07:05 PM, Marek Vasut wrote:
>> Both socfpga_dwmac_parse_data() in dwmac-socfpga.c and stmmac_dvr_probe()
>> in stmmac_main.c functions call devm_reset_control_get() to register an
>> reset controller for the stmmac. This results in an attempt to register
>> two reset controllers for the same non-shared reset line.
>>
>> The first attempt to register the reset controller works fine. The second
>> attempt fails with warning from the reset controller core, see below.
>> The warning is produced because the reset line is non-shared and thus
>> it is allowed to have only up-to one reset controller associated with
>> that reset line, not two or more.
>>
>> The solution is not great. Since the hardware needs to toggle the reset
>> before calling stmmac_dvr_probe() to perform mandatory preconfiguration,
>> this patch splits socfpga_dwmac_init_probe() from socfpga_dwmac_init().
>>
>> The socfpga_dwmac_init_probe() temporarily registers the reset controller,
>> performs the pre-configuration and unregisters the reset controller again.
>> This function is only called from the socfpga_dwmac_probe().
>>
>> The original socfpga_dwmac_init() is tweaked to use reset controller
>> pointer from the stmmac_priv (private data of the stmmac core) instead
>> of the local instance, which was used before.
>>
>> Finally, plat_dat->exit and socfpga_dwmac_exit() is no longer necessary,
>> since the functionality is already performed by the stmmac core.
>>
>> [ cut here ]
>> WARNING: CPU: 0 PID: 1 at drivers/reset/core.c:187 
>> __of_reset_control_get+0x218/0x270
>> Modules linked in:
>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
>> 4.6.0-rc4-next-20160419-00015-gabb2477-dirty #4
>> Hardware name: Altera SOCFPGA
>> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
>> [] (show_stack) from [] (dump_stack+0x94/0xa8)
>> [] (dump_stack) from [] (__warn+0xec/0x104)
>> [] (__warn) from [] (warn_slowpath_null+0x20/0x28)
>> [] (warn_slowpath_null) from [] 
>> (__of_reset_control_get+0x218/0x270)
>> [] (__of_reset_control_get) from [] 
>> (__devm_reset_control_get+0x54/0x90)
>> [] (__devm_reset_control_get) from [] 
>> (stmmac_dvr_probe+0x1b4/0x8e8)
>> [] (stmmac_dvr_probe) from [] 
>> (socfpga_dwmac_probe+0x1b8/0x28c)
>> [] (socfpga_dwmac_probe) from [] 
>> (platform_drv_probe+0x4c/0xb0)
>> [] (platform_drv_probe) from [] 
>> (driver_probe_device+0x224/0x2bc)
>> [] (driver_probe_device) from [] 
>> (__driver_attach+0xac/0xb0)
>> [] (__driver_attach) from [] (bus_for_each_dev+0x6c/0xa0)
>> [] (bus_for_each_dev) from [] 
>> (bus_add_driver+0x1a4/0x21c)
>> [] (bus_add_driver) from [] (driver_register+0x78/0xf8)
>> [] (driver_register) from [] (do_one_initcall+0x40/0x170)
>> [] (do_one_initcall) from [] 
>> (kernel_init_freeable+0x1dc/0x27c)
>> [] (kernel_init_freeable) from [] (kernel_init+0x8/0x114)
>> [] (kernel_init) from [] (ret_from_fork+0x14/0x3c)
>> ---[ end trace 059d2fbe87608fa9 ]---
>>
>> Signed-off-by: Marek Vasut 
>> Cc: Matthew Gerlach 
>> Cc: Dinh Nguyen 
>> Cc: David S. Miller 
>> ---
>> V2: Add missing stmmac_rst = NULL; into socfpga_dwmac_init_probe()
>> ---
>>  .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c| 70 
>> --
>>  1 file changed, 39 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c 
>> b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
>> index 76d671e..5885a2e 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
>> @@ -49,7 +49,6 @@ struct socfpga_dwmac {
>>  u32 reg_shift;
>>  struct  device *dev;
>>  struct regmap *sys_mgr_base_addr;
>> -struct reset_control *stmmac_rst;
>>  void __iomem *splitter_base;
>>  bool f2h_ptp_ref_clk;
>>  };
>> @@ -92,15 +91,6 @@ static int socfpga_dwmac_parse_data(struct socfpga_dwmac 
>> *dwmac, struct device *
>>  struct device_node *np_splitter;
>>  struct resource res_splitter;
>>  
>> -dwmac->stmmac_rst = devm_reset_control_get(dev,
>> -  STMMAC_RESOURCE_NAME);
>> -if (IS_ERR(dwmac->stmmac_rst)) {
>> -dev_info(dev, "Could not get reset control!\n");
>> -if (PTR_ERR(dwmac->stmmac_rst) == -EPROBE_DEFER)
>> -return -EPROBE_DEFER;
>> -dwmac->stmmac_rst = NULL;
>> -}
>> -
>>  dwmac->interface = of_get_phy_mode(np);
>>  
>>  sys_mgr_base_addr = syscon_regmap_lookup_by_phandle(np, 
>> "altr,sysmgr-syscon");
>> @@ -194,30 +184,23 @@ static int socfpga_dwmac_setup(struct socfpga_dwmac 
>> *dwmac)
>>  return 0;
>>  }
>>  
>> -static void socfpga_dwmac_exit(struct platform_device *pdev, void *priv)
>> -{
>> -struct socfpga_dwmac*dwmac = priv;
>> -
>> -/* On socfpga platform exit, assert

[PATCH net] atl2: Disable unimplemented scatter/gather feature

2016-04-20 Thread Ben Hutchings

atl2 includes NETIF_F_SG in hw_features even though it has no support
for non-linear skbs.  This bug was originally harmless since the
driver does not claim to implement checksum offload and that used to
be a requirement for SG.

Now that SG and checksum offload are independent features, if you
explicitly enable SG *and* use one of the rare protocols that can use
SG without checkusm offload, this potentially leaks sensitive
information (before you notice that it just isn't working).  Therefore
this obscure bug has been designated CVE-2016-2117.

Reported-by: Justin Yackoski 
Signed-off-by: Ben Hutchings 
Fixes: ec5f06156423 ("net: Kill link between CSUM and SG features.")
---
 drivers/net/ethernet/atheros/atlx/atl2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/atheros/atlx/atl2.c 
b/drivers/net/ethernet/atheros/atlx/atl2.c
index 8f76f4558a88..2ff465848b65 100644
--- a/drivers/net/ethernet/atheros/atlx/atl2.c
+++ b/drivers/net/ethernet/atheros/atlx/atl2.c
@@ -1412,7 +1412,7 @@ static int atl2_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
err = -EIO;
 
-   netdev->hw_features = NETIF_F_SG | NETIF_F_HW_VLAN_CTAG_RX;
+   netdev->hw_features = NETIF_F_HW_VLAN_CTAG_RX;
netdev->features |= (NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX);
 
/* Init PHY as early as possible due to power saving issue  */


signature.asc
Description: Digital signature

Re: [PATCH V2] net: stmmac: socfpga: Remove re-registration of reset controller

2016-04-20 Thread Dinh Nguyen

On 04/19/2016 07:05 PM, Marek Vasut wrote:
> Both socfpga_dwmac_parse_data() in dwmac-socfpga.c and stmmac_dvr_probe()
> in stmmac_main.c functions call devm_reset_control_get() to register an
> reset controller for the stmmac. This results in an attempt to register
> two reset controllers for the same non-shared reset line.
> 
> The first attempt to register the reset controller works fine. The second
> attempt fails with warning from the reset controller core, see below.
> The warning is produced because the reset line is non-shared and thus
> it is allowed to have only up-to one reset controller associated with
> that reset line, not two or more.
> 
> The solution is not great. Since the hardware needs to toggle the reset
> before calling stmmac_dvr_probe() to perform mandatory preconfiguration,
> this patch splits socfpga_dwmac_init_probe() from socfpga_dwmac_init().
> 
> The socfpga_dwmac_init_probe() temporarily registers the reset controller,
> performs the pre-configuration and unregisters the reset controller again.
> This function is only called from the socfpga_dwmac_probe().
> 
> The original socfpga_dwmac_init() is tweaked to use reset controller
> pointer from the stmmac_priv (private data of the stmmac core) instead
> of the local instance, which was used before.
> 
> Finally, plat_dat->exit and socfpga_dwmac_exit() is no longer necessary,
> since the functionality is already performed by the stmmac core.
> 
> [ cut here ]
> WARNING: CPU: 0 PID: 1 at drivers/reset/core.c:187 
> __of_reset_control_get+0x218/0x270
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.6.0-rc4-next-20160419-00015-gabb2477-dirty #4
> Hardware name: Altera SOCFPGA
> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
> [] (show_stack) from [] (dump_stack+0x94/0xa8)
> [] (dump_stack) from [] (__warn+0xec/0x104)
> [] (__warn) from [] (warn_slowpath_null+0x20/0x28)
> [] (warn_slowpath_null) from [] 
> (__of_reset_control_get+0x218/0x270)
> [] (__of_reset_control_get) from [] 
> (__devm_reset_control_get+0x54/0x90)
> [] (__devm_reset_control_get) from [] 
> (stmmac_dvr_probe+0x1b4/0x8e8)
> [] (stmmac_dvr_probe) from [] 
> (socfpga_dwmac_probe+0x1b8/0x28c)
> [] (socfpga_dwmac_probe) from [] 
> (platform_drv_probe+0x4c/0xb0)
> [] (platform_drv_probe) from [] 
> (driver_probe_device+0x224/0x2bc)
> [] (driver_probe_device) from [] 
> (__driver_attach+0xac/0xb0)
> [] (__driver_attach) from [] (bus_for_each_dev+0x6c/0xa0)
> [] (bus_for_each_dev) from [] (bus_add_driver+0x1a4/0x21c)
> [] (bus_add_driver) from [] (driver_register+0x78/0xf8)
> [] (driver_register) from [] (do_one_initcall+0x40/0x170)
> [] (do_one_initcall) from [] 
> (kernel_init_freeable+0x1dc/0x27c)
> [] (kernel_init_freeable) from [] (kernel_init+0x8/0x114)
> [] (kernel_init) from [] (ret_from_fork+0x14/0x3c)
> ---[ end trace 059d2fbe87608fa9 ]---
> 
> Signed-off-by: Marek Vasut 
> Cc: Matthew Gerlach 
> Cc: Dinh Nguyen 
> Cc: David S. Miller 
> ---
> V2: Add missing stmmac_rst = NULL; into socfpga_dwmac_init_probe()
> ---
>  .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c| 70 
> --
>  1 file changed, 39 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c 
> b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
> index 76d671e..5885a2e 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
> @@ -49,7 +49,6 @@ struct socfpga_dwmac {
>   u32 reg_shift;
>   struct  device *dev;
>   struct regmap *sys_mgr_base_addr;
> - struct reset_control *stmmac_rst;
>   void __iomem *splitter_base;
>   bool f2h_ptp_ref_clk;
>  };
> @@ -92,15 +91,6 @@ static int socfpga_dwmac_parse_data(struct socfpga_dwmac 
> *dwmac, struct device *
>   struct device_node *np_splitter;
>   struct resource res_splitter;
>  
> - dwmac->stmmac_rst = devm_reset_control_get(dev,
> -   STMMAC_RESOURCE_NAME);
> - if (IS_ERR(dwmac->stmmac_rst)) {
> - dev_info(dev, "Could not get reset control!\n");
> - if (PTR_ERR(dwmac->stmmac_rst) == -EPROBE_DEFER)
> - return -EPROBE_DEFER;
> - dwmac->stmmac_rst = NULL;
> - }
> -
>   dwmac->interface = of_get_phy_mode(np);
>  
>   sys_mgr_base_addr = syscon_regmap_lookup_by_phandle(np, 
> "altr,sysmgr-syscon");
> @@ -194,30 +184,23 @@ static int socfpga_dwmac_setup(struct socfpga_dwmac 
> *dwmac)
>   return 0;
>  }
>  
> -static void socfpga_dwmac_exit(struct platform_device *pdev, void *priv)
> -{
> - struct socfpga_dwmac*dwmac = priv;
> -
> - /* On socfpga platform exit, assert and hold reset to the
> -  * enet controller - the default state after a hard reset.
> -  */
> - if (dwmac->stmmac_rst)
> -

[PATCH net-next] macvlan: fix failure during registration v2

2016-04-20 Thread Francesco Ruggeri

If macvlan_common_newlink fails in register_netdevice after macvlan_init
then it decrements port->count twice, first in macvlan_uninit (from
register_netdevice or rollback_registered) and then again in
macvlan_common_newlink.
A similar problem may exist in the ipvlan driver.
This patch consolidates modifications to port->count into macvlan_init
and macvlan_uninit (thanks to Eric Biederman for suggesting this approach).
In macvtap_device_event it also avoids cleaning up in NETDEV_UNREGISTER
if NETDEV_REGISTER had previously failed.

Signed-off-by: Francesco Ruggeri 
---
 drivers/net/macvlan.c | 10 --
 drivers/net/macvtap.c |  2 ++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 2bcf1f3..cb01023 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -795,6 +795,7 @@ static int macvlan_init(struct net_device *dev)
 {
struct macvlan_dev *vlan = netdev_priv(dev);
const struct net_device *lowerdev = vlan->lowerdev;
+   struct macvlan_port *port = vlan->port;
 
dev->state  = (dev->state & ~MACVLAN_STATE_MASK) |
  (lowerdev->state & MACVLAN_STATE_MASK);
@@ -812,6 +813,8 @@ static int macvlan_init(struct net_device *dev)
if (!vlan->pcpu_stats)
return -ENOMEM;
 
+   port->count += 1;
+
return 0;
 }
 
@@ -1312,10 +1315,9 @@ int macvlan_common_newlink(struct net *src_net, struct 
net_device *dev,
return err;
}
 
-   port->count += 1;
err = register_netdevice(dev);
if (err < 0)
-   goto destroy_port;
+   return err;
 
dev->priv_flags |= IFF_MACVLAN;
err = netdev_upper_dev_link(lowerdev, dev);
@@ -1330,10 +1332,6 @@ int macvlan_common_newlink(struct net *src_net, struct 
net_device *dev,
 
 unregister_netdev:
unregister_netdevice(dev);
-destroy_port:
-   port->count -= 1;
-   if (!port->count)
-   macvlan_port_destroy(lowerdev);
 
return err;
 }
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 95394ed..e770221 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1303,6 +1303,8 @@ static int macvtap_device_event(struct notifier_block 
*unused,
}
break;
case NETDEV_UNREGISTER:
+   if (vlan->minor == 0)
+   break;
devt = MKDEV(MAJOR(macvtap_major), vlan->minor);
device_destroy(macvtap_class, devt);
macvtap_free_minor(vlan);
-- 
1.8.1.4

Re: [Intel-gfx] [PATCH 4/4] drm/i915: Move ioremap_wc tracking onto VMA

2016-04-20 Thread Luis R. Rodriguez

On Wed, Apr 20, 2016 at 01:17:30PM +0200, Daniel Vetter wrote:
> On Wed, Apr 20, 2016 at 11:10:54AM +0200, Luis R. Rodriguez wrote:
> > Reason I ask is since I noticed a while ago a lot of drivers
> > were using info->fix.smem_start and info->fix.smem_len consistently
> > for their ioremap'd areas it might make sense instead to let the
> > internal framebuffer (register_framebuffer()) optionally manage the
> > ioremap_wc() for drivers, given that this is pretty generic stuff.
> 
> All that legacy fbdev stuff is just for legacy support, and I prefer to
> have that as dumb as possible. There's been some discussion even around
> lifting the "kick out firmware fb driver" out of fbdev, since we'd need it
> to have a simple drm driver for e.g. uefi.
> 
> But I definitely don't want a legacy horror show like fbdev to
> automagically take care of device mappings for drivers.

Makes sense, it also still begs the question if more modern APIs
could manage the ioremap for you. Evidence shows people get
sloppy and if things were done internally with helpers it may
be easier to later make adjustments.

  Luis

Re: qdisc spin lock

2016-04-20 Thread Michael Ma

2016-04-08 7:19 GMT-07:00 Eric Dumazet :
> On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Note that this MQ + HTB thing works well, unless you use a bonding
> device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> tokens between the slaves)

Actually MQ+HTB works well for small packets - like flow of 512 byte
packets can be throttled by HTB using one txq without being affected
by other flows with small packets. However I found using this solution
large packets (10k for example) will only achieve very limited
bandwidth. In my test I used MQ to assign one txq to a HTB which sets
rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
using 30 threads. But sending 10k packets using 10 threads has only 10
Mbit/s with the same TC configuration. If I increase burst and cburst
of HTB to some extreme large value (like 50MB) the ceiling rate can be
hit.

The strange thing is that I don't see this problem when using HTB as
the root. So txq number seems to be a factor here - however it's
really hard to understand why would it only affect larger packets. Is
this a known issue? Any suggestion on how to investigate the issue
further? Profiling shows that the cpu utilization is pretty low.

>
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>
> bonding can not really be used as a true MQ device yet.
>
> I might send a patch to disable this 'bonding feature' if no slave sets
> a queue_id.
>
>

Re: [PATCH 02/19] io-mapping: Specify mapping size for io_mapping_map_wc()

2016-04-20 Thread Luis R. Rodriguez

On Wed, Apr 20, 2016 at 08:14:32PM +0100, Chris Wilson wrote:
> On Wed, Apr 20, 2016 at 08:58:44PM +0200, Luis R. Rodriguez wrote:
> > On Wed, Apr 20, 2016 at 07:42:13PM +0100, Chris Wilson wrote:
> > > The ioremap() hidden behind the io_mapping_map_wc() convenience helper
> > > can be used for remapping multiple pages. Extend the helper so that
> > > future callers can use it for larger ranges.
> > > 
> > > Signed-off-by: Chris Wilson 
> > > Cc: Tvrtko Ursulin 
> > > Cc: Daniel Vetter 
> > > Cc: Jani Nikula 
> > > Cc: David Airlie 
> > > Cc: Yishai Hadas 
> > > Cc: Dan Williams 
> > > Cc: Ingo Molnar 
> > > Cc: "Peter Zijlstra (Intel)" 
> > > Cc: David Hildenbrand 
> > > Cc: Luis R. Rodriguez 
> > > Cc: intel-...@lists.freedesktop.org
> > > Cc: dri-de...@lists.freedesktop.org
> > > Cc: netdev@vger.kernel.org
> > > Cc: linux-r...@vger.kernel.org
> > > Cc: linux-ker...@vger.kernel.org
> > 
> > We have 2 callers today, in the future, can you envision
> > this API getting more options? If so, in order to avoid the
> > pain of collateral evolutions I can suggest a descriptor
> > being passed with the required settings / options. This lets
> > you evolve the API without needing to go in and modify
> > old users. If you choose not to that's fine too, just
> > figured I'd chime in with that as I've seen the pain
> > with other APIs, and I'm putting an end to the needless
> > set of collateral evolutions this way.
> 
> Do you have a good example in mind? I've one more patch to try and take
> advantage of the io-mapping (that may or not be such a good idea in
> practice) but I may as well see if I can make io_mapping more useful
> when I do.

Sure, here's my current version of the revamp of the firmware API
to a more flexible API, which lets us compartamentalize the
usermode helper, and through the new API avoids the issues with further
future collateral evolutions. It is still being baked, I'm fine tuning
the SmPL to folks automatically do conversion if they want:

https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux.git/log/?h=20160417-sysdata-api-v1

It also has a test driver (which I'd also recommend if you can pull off).
It would be kind of hard to do something like a lib/io-mapping_test.c
given there is no real device to ioremap -- _but_ perhaps regular
RAM can be used for fake a device MMIO. I am not sure if its even
possible... but if so it would not only be useful for something
like your API but also for testing ioremap() and friends, and
any possible aliasing bombs we may want to vet for. It also hints
how we may in the future be able to automatically write test drivers
for APIs for us through inference, but that needs a lot of more love
to make it tangible.

  Luis

[RESEND] Re: updating carl9170-1.fw in linux-firmware.git

2016-04-20 Thread Christian Lamparter

On Wednesday, April 20, 2016 10:59:44 AM Kalle Valo wrote:
> Christian Lamparter  writes:
> 
> > On Monday, April 18, 2016 07:42:05 PM Kalle Valo wrote:
> >> Christian Lamparter  writes:
> >> 
> >> > On Monday, April 18, 2016 06:45:09 PM Kalle Valo wrote:
> >> >
> >> >> Why even mention anything about a "special firmware" as the firmware is
> >> >> already available from linux-firmware.git? 
> >> >
> >> > Yes and no. 1.9.6 is in linux-firmware.git. I've tried to add 1.9.9 too
> >> > but that failed.
> >> > 
> >> 
> >> Rick's comment makes sense to me, better just to provide the latest
> >> version. No need to unnecessary confuse the users. And if someone really
> >> wants to use an older version that she can retrieve it from the git
> >> history.
> >
> > Part of the fun here is that firmware is GPLv2. The linux-firmware.git has
> > to point to or add the firmware source to their tree. They have added every
> > single source file to it instead of "packaging" it in a tar.bz2/gz/xz
> > like you normally do for release sources.
> >
> > If you want to read more about it:
> > 
> 
> Yeah, that's more work. I get that. But I'm still not understanding
> what's the actual problem which prevents us from updating carl9170
> firmware in linux-firmware.
I'm not sure, but why not ask? I've added the cc'ed Linux Firmware
Maintainers. So for those people reading the fw list:

What would it take to update the carl9170-1.fw firmware file in your
repository to the latest version?

Who has to sent the firmware update. Does it have to be the person who
sent the first request? (Xose)? The maintainer of the firmware (me)?
someone from Qualcomm Atheros? Or someone else (specific)? (the 
firmware is licensed as GPLv2 - in theory anyone should be able to
do that)

How should the firmware source update be handled? Currently the latest
.tar.xz of the firmware has ~130kb. The formated patches from 1.9.6 to
latest are about ~100kb (182 individual patches).

How does linux-firmware handle new binary firmware images and new 
sources? What if carl9170fw-2.bin is added. Do we need another
source directory for this in the current tree then? Because 
carl9170fw-1.bin will still be needed for backwards compatibility
so we basically need to duplicate parts of the source?

Also, how's the situation with ath9k_htc? The 1.4.0 image contains
some GPLv2 code as well? So, why is there no source in the tree, but 
just the link to it? Because, I would like to do basically the same
for carl9170fw and just add a link to the carl9170fw repository and
save everyone this source update "song and dance".

Regards,
Christian

Re: [PATCH net-next v6] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-20 Thread Elad Raz


> On 20 Apr 2016, at 6:43 PM, Roopa Prabhu  wrote:
> 
> From: Roopa Prabhu 
> 
> This patch adds a new RTM_GETSTATS message to query link stats via netlink
> from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
> returns a lot more than just stats and is expensive in some cases when
> frequent polling for stats from userspace is a common operation.
> 
> RTM_GETSTATS is an attempt to provide a light weight netlink message
> to explicity query only link stats from the kernel on an interface.
> The idea is to also keep it extensible so that new kinds of stats can be
> added to it in the future.
> 
> This patch adds the following attribute for NETDEV stats:
> struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
>[IFLA_STATS_LINK_64]  = { .len = sizeof(struct rtnl_link_stats64) },
> };
> 
> Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
> a single interface or all interfaces with NLM_F_DUMP.
> 
> Future possible new types of stat attributes:
> link af stats:
>- IFLA_STATS_LINK_IPV6  (nested. for ipv6 stats)
>- IFLA_STATS_LINK_MPLS  (nested. for mpls/mdev stats)
> extended stats:
>- IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like 
> bridge,
>  vlan, vxlan etc)
>- IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
>  available via ethtool today)

I think that it’s better to have IFLA_STATS_LINK_CPU_ONLY attribute. The 
default stat should be aggregation of HW only packets and packets that got 
trapped to CPU together.

> 
> This patch also declares a filter mask for all stat attributes.
> User has to provide a mask of stats attributes to query. filter mask
> can be specified in the new hdr 'struct if_stats_msg' for stats messages.
> Other important field in the header is the ifindex.
> 
> This api can also include attributes for global stats (eg tcp) in the future.
> When global stats are included in a stats msg, the ifindex in the header
> must be zero. A single stats message cannot contain both global and
> netdev specific stats. To easily distinguish them, netdev specific stat
> attributes name are prefixed with IFLA_STATS_LINK_
> 
> Without any attributes in the filter_mask, no stats will be returned.
> 
> This patch has been tested with mofified iproute2 ifstat.
> 
> Suggested-by: Jamal Hadi Salim 
> Signed-off-by: Roopa Prabhu 

Nice work! Thank you Roopa!

[net-next resubmit PATCH v2 1/3] netdev_features: Fold NETIF_F_ALL_TSO into NETIF_F_GSO_SOFTWARE

2016-04-20 Thread Alexander Duyck

This patch folds NETIF_F_ALL_TSO into the bitmask for NETIF_F_GSO_SOFTWARE.
The idea is to avoid duplication of defines since the only difference
between the two was the GSO_UDP bit.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdev_features.h |8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 15eb0b12fff9..bc8736266749 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -152,11 +152,6 @@ enum {
 #define NETIF_F_GSO_MASK   (__NETIF_F_BIT(NETIF_F_GSO_LAST + 1) - \
__NETIF_F_BIT(NETIF_F_GSO_SHIFT))
 
-/* List of features with software fallbacks. */
-#define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
-NETIF_F_TSO_MANGLEID | \
-NETIF_F_TSO6 | NETIF_F_UFO)
-
 /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
  * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
  * this would be contradictory
@@ -170,6 +165,9 @@ enum {
 #define NETIF_F_ALL_FCOE   (NETIF_F_FCOE_CRC | NETIF_F_FCOE_MTU | \
 NETIF_F_FSO)
 
+/* List of features with software fallbacks. */
+#define NETIF_F_GSO_SOFTWARE   (NETIF_F_ALL_TSO | NETIF_F_UFO)
+
 /*
  * If one device supports one of these features, then enable them
  * for all in netdev_increment_features.

[net-next resubmit PATCH v2 2/3] veth: Update features to include all tunnel GSO types

2016-04-20 Thread Alexander Duyck

This patch adds support for the checksum enabled versions of UDP and GRE
tunnels.  With this change we should be able to send and receive GSO frames
of these types over the veth pair without needing to segment the packets.

Signed-off-by: Alexander Duyck 
---
 drivers/net/veth.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4f30a6ae50d0..f37a6e61d4ad 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -312,10 +312,9 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_set_rx_headroom= veth_set_rx_headroom,
 };
 
-#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |\
-  NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
-  NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |   \
-  NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT | NETIF_F_UFO |   \
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
+  NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+  NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \
   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
   NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )

[net-next resubmit PATCH v2 3/3] net: Add support for IP ID mangling TSO in cases that require encapsulation

2016-04-20 Thread Alexander Duyck

This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports
NETIF_F_TSO.  This way if needed a device can then later enable the TSO
with IP ID mangling and the tunnels on top of that device can then also
make use of the IP ID mangling as well.

Signed-off-by: Alexander Duyck 
---
 net/core/dev.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 52d446b2cb99..6324bc9267f7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7029,8 +7029,19 @@ int register_netdevice(struct net_device *dev)
if (!(dev->flags & IFF_LOOPBACK))
dev->hw_features |= NETIF_F_NOCACHE_COPY;
 
+   /* If IPv4 TCP segmentation offload is supported we should also
+* allow the device to enable segmenting the frame with the option
+* of ignoring a static IP ID value.  This doesn't enable the
+* feature itself but allows the user to enable it later.
+*/
if (dev->hw_features & NETIF_F_TSO)
dev->hw_features |= NETIF_F_TSO_MANGLEID;
+   if (dev->vlan_features & NETIF_F_TSO)
+   dev->vlan_features |= NETIF_F_TSO_MANGLEID;
+   if (dev->mpls_features & NETIF_F_TSO)
+   dev->mpls_features |= NETIF_F_TSO_MANGLEID;
+   if (dev->hw_enc_features & NETIF_F_TSO)
+   dev->hw_enc_features |= NETIF_F_TSO_MANGLEID;
 
/* Make NETIF_F_HIGHDMA inheritable to VLAN devices.
 */

[net-next resubmit PATCH v2 0/3] Feature tweaks/fixes follow-up to GSO partial patches

2016-04-20 Thread Alexander Duyck

This patch series is a set of minor fix-ups and tweaks following the GSO
partial and TSO with IPv4 ID mangling patches.  It mostly is just meant to
make certain that if we have GSO partial support at the device we can make
use of it from the far end of the tunnel.

I submitted this earlier today but it was set as RFC in patchwork.  This is
a submission for net-next and not an RFC so I am resubmitting.

v2: Added cover page which was forgotten with first submission.
Added patch that enables TSOv4 IP ID mangling w/ tunnels and/or VLANs.

---

Alexander Duyck (3):
  netdev_features: Fold NETIF_F_ALL_TSO into NETIF_F_GSO_SOFTWARE
  veth: Update features to include all tunnel GSO types
  net: Add support for IP ID mangling TSO in cases that require 
encapsulation


 drivers/net/veth.c  |7 +++
 include/linux/netdev_features.h |8 +++-
 net/core/dev.c  |   11 +++
 3 files changed, 17 insertions(+), 9 deletions(-)

--

Re: drop all fragments inside tx queue if one gets dropped

2016-04-20 Thread Rick Jones

For the "everything old is new again" files, back in the 1990s, it was 
noticed that on the likes of a netperf UDP_STREAM test on HP-UX, with 
fragmentation taking place, it was possible to consume 100% of the link 
bandwidth and have 0% effective throughput because the transmit queue 
was kept full with IP datagram fragments which could not possibly be 
reassembled (*) because one or more of the fragments of a datagram were 
dropped because the transmit queue was full.


HP-UX implemented "packet trains" where all the fragments of a 
fragmented datagram were presented to the driver, which then either 
queued them all, or none of them.


I don't recall seeing similar poor behaviour in Linux; I would have 
assumed that the intra-stack flow-control "took care" of it.  Perhaps 
there is something specific to wpan which precludes that?


happy benchmarking,

rick jones

[RFC 0/3] net: dsa: cross-chip operations

2016-04-20 Thread Vivien Didelot

This patchset aims to start a thread on cross-chips operations in DSA, no need
to spend time on reviewing the details of the code (especially for mv88e6xxx).

So when several switch chips are interconnected, we need to configure them all
to ensure correct hardware switching. We can think about this case:

  sw0 sw1 sw2
[ 0 1 2 3 4 5 ] [ 0 1 2 3 4 5 ] [ 0 1 2 3 4 5 ]
  |   ' ^ ^ ^ ^ '
  v   ' | | | | '
 CPU  ' `-DSA-' `-DSA-' '
  ' '
  + - - - - - - - br0 - - - - - - - +

Here sw1 needs to be aware of br0, to configure itself with MAC addresses,
VIDs, or whatever to ensure hardware frame bridging between sw0 and sw2.

Two cross-chip unbridged ports (e.g. sw0p3 and sw1p1) of mv88e6xxx-supported
devices can currently talk to each other, because the chips are configured to
allow frames to ingress from any external ports. This is not what we want, and
this patchset fixes that. The only important part for the thread is 1/3 though.

Some Marvell switches have a cross-chip port based VLAN table used to allow or
not external frames to egress its internal ports. So a new switch-level
operation needs to be added in order to inform the other switches that a port
joined or left a bridge group. This is what dsa_slave_broadcast_bridge() does.

But this is not enough. When a port joins a bridge group, its switch driver
needs to learn the existing cross-chip members, so that ingressing frames from
them can be allowed. This is what dsa_tree_broadcast_bridge() does.

But that is ugly. This adds yet another DSA function, and makes the DSA layer
code quite complex. Also, similar notifications need to be implemented to
configure cross-chip VLANs (for VLAN filtering aware systems where br0 is
implemented with a 802.1Q VLAN), FDB additions/deletions so that frames get
switched correctly by the hardware, etc.

Actually the DSA drivers functions are just switchdev ops with a bit of
syntactic sugar, but no real value added. The purpose of the DSA layer is to
scale the switchdev ops "horizontally" to every tree port. To avoid numerous
operations and keep it simple for drivers, I think we need 2 things:

  1) The scope of DSA switch driver ops should be the DSA tree, not the switch.
  This means having each dsa_switch_driver implements functions such as:

  int (*port_bridge_join)(struct dsa_switch *ds, int sw_index, int sw_port,
   struct net_device *bridge);

  instead of the current:

  int (*port_bridge_join)(struct dsa_switch *ds, int port,
   struct net_device *bridge);

  So that drivers can configure their in-chip or cross-chip stuffs, return 0 or
  -EOPNOTSUPP if ds->index != sw_index. Replacing dsa_slave_broadcast_bridge.

  2) To replace dsa_tree_broadcast_bridge, drivers need to access public info
  in the tree, such as bridge membership of every port. That can be acheived
  with a bit of refactoring like the following:

  /* include/net/dsa.h */
  struct dsa_port {
  struct list_headlist;
  struct dsa_switch   *ds;
  int port;
  struct net_device   *bridge_dev;
  }

  struct dsa_switch_tree {
  ...
  struct list_head ports;
  };

  /* net/dsa/dsa_priv.h */
  struct dsa_slave_priv {
  ...
  dsa_port dp;
  };

  Then DSA switch drivers can implement tree-level ops such as:

  int (*port_bridge_join)(struct dsa_switch *ds, struct dsa_port *dp,
   struct net_device *bridge);

I'm working on an RFC for the above. Let me know what you think and if this
seems correct to you.

Cheers,

Vivien Didelot (3):
  net: dsa: add cross-chip notification for bridge
  net: dsa: mv88e6xxx: initialize PVT
  net: dsa: mv88e6xxx: setup PVT

 drivers/net/dsa/mv88e6352.c |   1 +
 drivers/net/dsa/mv88e6xxx.c | 181 ++--
 drivers/net/dsa/mv88e6xxx.h |   7 ++
 include/net/dsa.h   |   6 ++
 net/dsa/slave.c |  60 ++-
 5 files changed, 246 insertions(+), 9 deletions(-)

-- 
2.8.0

[RFC 1/3] net: dsa: add cross-chip notification for bridge

2016-04-20 Thread Vivien Didelot

When multiple switch chips are chained together, one needs to know about
the bridge membership of others. For instance, switches like Marvell
6352 have cross-chip port-based VLAN table to allow or forbid cross-chip
frames to egress.

Add a cross_chip_bridge DSA driver function, used to notify a switch
about bridge membership configured in other chips.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h |  6 ++
 net/dsa/slave.c   | 60 +++
 2 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index c4bc42b..1994fa7 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -340,6 +340,12 @@ struct dsa_switch_driver {
int (*port_fdb_dump)(struct dsa_switch *ds, int port,
 struct switchdev_obj_port_fdb *fdb,
 int (*cb)(struct switchdev_obj *obj));
+
+   /*
+* Cross-chip notifications
+*/
+   void(*cross_chip_bridge)(struct dsa_switch *ds, int sw_index,
+int sw_port, struct net_device *bridge);
 };
 
 void register_switch_driver(struct dsa_switch_driver *type);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 3b6750f..bd8f4e2 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -431,19 +431,68 @@ static int dsa_slave_port_obj_dump(struct net_device *dev,
return err;
 }
 
+static void dsa_slave_broadcast_bridge(struct net_device *dev)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+   int chip;
+
+   for (chip = 0; chip < ds->dst->pd->nr_chips; ++chip) {
+   struct dsa_switch *sw = ds->dst->ds[chip];
+
+   if (sw->index == ds->index)
+   continue;
+
+   if (sw->drv->cross_chip_bridge)
+   sw->drv->cross_chip_bridge(sw, ds->index, p->port,
+  p->bridge_dev);
+   }
+}
+
+static void dsa_tree_broadcast_bridge(struct dsa_switch_tree *dst,
+ struct net_device *bridge)
+{
+   struct net_device *dev;
+   struct dsa_slave_priv *p;
+   struct dsa_switch *ds;
+   int chip, port;
+
+   for (chip = 0; chip < dst->pd->nr_chips; ++chip) {
+   ds = dst->ds[chip];
+
+   for (port = 0; port < DSA_MAX_PORTS; ++port) {
+   if (!ds->ports[port])
+   continue;
+
+   dev = ds->ports[port];
+   p = netdev_priv(dev);
+
+   if (p->bridge_dev == bridge)
+   dsa_slave_broadcast_bridge(dev);
+   }
+   }
+}
+
 static int dsa_slave_bridge_port_join(struct net_device *dev,
  struct net_device *br)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->parent;
-   int ret = -EOPNOTSUPP;
+   int err;
 
p->bridge_dev = br;
 
-   if (ds->drv->port_bridge_join)
-   ret = ds->drv->port_bridge_join(ds, p->port, br);
+   /* In-chip hardware bridging */
+   if (ds->drv->port_bridge_join) {
+   err = ds->drv->port_bridge_join(ds, p->port, br);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+   }
+
+   /* Broadcast bridge membership across chips */
+   dsa_tree_broadcast_bridge(ds->dst, br);
 
-   return ret == -EOPNOTSUPP ? 0 : ret;
+   return 0;
 }
 
 static void dsa_slave_bridge_port_leave(struct net_device *dev)
@@ -462,6 +511,9 @@ static void dsa_slave_bridge_port_leave(struct net_device 
*dev)
 */
if (ds->drv->port_stp_state_set)
ds->drv->port_stp_state_set(ds, p->port, BR_STATE_FORWARDING);
+
+   /* Notify the port leaving to other chips */
+   dsa_slave_broadcast_bridge(dev);
 }
 
 static int dsa_slave_port_attr_get(struct net_device *dev,
-- 
2.8.0

[RFC 2/3] net: dsa: mv88e6xxx: initialize PVT

2016-04-20 Thread Vivien Didelot

Expand the Cross-chip Port Based VLAN Table initilization code, and make
sure the "5 Bit Port" bit is cleared.

This commit doesn't make any functional change to the current code.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 48 -
 drivers/net/dsa/mv88e6xxx.h |  5 +
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 1dd525d..e35bc9f 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2203,6 +2203,47 @@ unlock:
return err;
 }
 
+static int _mv88e6xxx_pvt_wait(struct dsa_switch *ds)
+{
+   return _mv88e6xxx_wait(ds, REG_GLOBAL2, GLOBAL2_PVT_ADDR,
+  GLOBAL2_PVT_ADDR_BUSY);
+}
+
+static int _mv88e6xxx_pvt_cmd(struct dsa_switch *ds, int src_dev, int src_port,
+ u16 op)
+{
+   u16 reg = op;
+   int err;
+
+   /* 9-bit Cross-chip PVT pointer: with GLOBAL2_MISC_5_BIT_PORT cleared,
+* source device is 5-bit, source port is 4-bit.
+*/
+   reg |= (src_dev & 0x1f) << 4;
+   reg |= (src_port & 0xf);
+
+   err = _mv88e6xxx_reg_write(ds, REG_GLOBAL2, GLOBAL2_PVT_ADDR, reg);
+   if (err)
+   return err;
+
+   return _mv88e6xxx_pvt_wait(ds);
+}
+
+static int _mv88e6xxx_pvt_init(struct dsa_switch *ds)
+{
+   int err;
+
+   /* Clear 5 Bit Port for usage with Marvell Link Street devices:
+* use 4 bits for the Src_Port/Src_Trunk and 5 bits for the Src_Dev.
+*/
+   err = _mv88e6xxx_reg_write(ds, REG_GLOBAL2, GLOBAL2_MISC,
+  0 & ~GLOBAL2_MISC_5_BIT_PORT);
+   if (err)
+   return err;
+
+   /* Allow any external frame to egress any internal port */
+   return _mv88e6xxx_pvt_cmd(ds, 0, 0, GLOBAL2_PVT_ADDR_OP_INIT_ONES);
+}
+
 int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
   struct net_device *bridge)
 {
@@ -2747,11 +2788,8 @@ int mv88e6xxx_setup_global(struct dsa_switch *ds)
if (err)
goto unlock;
 
-   /* Initialise cross-chip port VLAN table to reset
-* defaults.
-*/
-   err = _mv88e6xxx_reg_write(ds, REG_GLOBAL2,
-  GLOBAL2_PVT_ADDR, 0x9000);
+   /* Initialize Cross-chip Port VLAN Table (PVT) */
+   err = _mv88e6xxx_pvt_init(ds);
if (err)
goto unlock;
 
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index 0dbe2d1..dd63377 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -298,6 +298,10 @@
 #define GLOBAL2_INGRESS_OP 0x09
 #define GLOBAL2_INGRESS_DATA   0x0a
 #define GLOBAL2_PVT_ADDR   0x0b
+#define GLOBAL2_PVT_ADDR_BUSY  BIT(15)
+#define GLOBAL2_PVT_ADDR_OP_INIT_ONES  ((0x01 << 12) | GLOBAL2_PVT_ADDR_BUSY)
+#define GLOBAL2_PVT_ADDR_OP_WRITE_PVLAN((0x03 << 12) | 
GLOBAL2_PVT_ADDR_BUSY)
+#define GLOBAL2_PVT_ADDR_OP_READ   ((0x04 << 12) | GLOBAL2_PVT_ADDR_BUSY)
 #define GLOBAL2_PVT_DATA   0x0c
 #define GLOBAL2_SWITCH_MAC 0x0d
 #define GLOBAL2_SWITCH_MAC_BUSY BIT(15)
@@ -335,6 +339,7 @@
 #define GLOBAL2_WDOG_CONTROL   0x1b
 #define GLOBAL2_QOS_WEIGHT 0x1c
 #define GLOBAL2_MISC   0x1d
+#define GLOBAL2_MISC_5_BIT_PORTBIT(14)
 
 #define MV88E6XXX_N_FID4096
 
-- 
2.8.0

[RFC 3/3] net: dsa: mv88e6xxx: setup PVT

2016-04-20 Thread Vivien Didelot

Instead of allowing any external frame to egress any internal port,
configure the Cross-chip Port VLAN Table (PVT) to forbid that.

When an external source port joins or leaves a bridge crossing this
switch, mask it in the PVT to allow or forbid frames to egress.

Add support for the cross-chip bridge notification to the 6352 family.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6352.c |   1 +
 drivers/net/dsa/mv88e6xxx.c | 137 +++-
 drivers/net/dsa/mv88e6xxx.h |   2 +
 3 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6352.c b/drivers/net/dsa/mv88e6352.c
index 4afc24d..03ab309 100644
--- a/drivers/net/dsa/mv88e6352.c
+++ b/drivers/net/dsa/mv88e6352.c
@@ -364,6 +364,7 @@ struct dsa_switch_driver mv88e6352_switch_driver = {
.port_fdb_add   = mv88e6xxx_port_fdb_add,
.port_fdb_del   = mv88e6xxx_port_fdb_del,
.port_fdb_dump  = mv88e6xxx_port_fdb_dump,
+   .cross_chip_bridge  = mv88e6xxx_cross_chip_bridge,
 };
 
 MODULE_ALIAS("platform:mv88e6172");
diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index e35bc9f..dccefdb 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -481,6 +481,14 @@ static bool mv88e6xxx_has_stu(struct dsa_switch *ds)
return false;
 }
 
+static bool mv88e6xxx_has_pvt(struct dsa_switch *ds)
+{
+   if (mv88e6xxx_6185_family(ds))
+   return false;
+
+   return true;
+}
+
 /* We expect the switch to perform auto negotiation if there is a real
  * phy. However, in the case of a fixed link phy, we force the port
  * settings from the fixed link settings.
@@ -2228,8 +2236,69 @@ static int _mv88e6xxx_pvt_cmd(struct dsa_switch *ds, int 
src_dev, int src_port,
return _mv88e6xxx_pvt_wait(ds);
 }
 
+static int _mv88e6xxx_pvt_read(struct dsa_switch *ds, int src_dev, int 
src_port,
+  u16 *data)
+{
+   int ret;
+
+   ret = _mv88e6xxx_pvt_wait(ds);
+   if (ret < 0)
+   return ret;
+
+   ret = _mv88e6xxx_pvt_cmd(ds, src_dev, src_port,
+   GLOBAL2_PVT_ADDR_OP_READ);
+   if (ret < 0)
+   return ret;
+
+   ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL2, GLOBAL2_PVT_DATA);
+   if (ret < 0)
+   return ret;
+
+   *data = ret;
+
+   return 0;
+}
+
+static int _mv88e6xxx_pvt_write(struct dsa_switch *ds, int src_dev,
+   int src_port, u16 data)
+{
+   int err;
+
+   err = _mv88e6xxx_pvt_wait(ds);
+   if (err)
+   return err;
+
+   err = _mv88e6xxx_reg_write(ds, REG_GLOBAL2, GLOBAL2_PVT_DATA, data);
+   if (err)
+   return err;
+
+return _mv88e6xxx_pvt_cmd(ds, src_dev, src_port,
+   GLOBAL2_PVT_ADDR_OP_WRITE_PVLAN);
+}
+
+static int _mv88e6xxx_pvt_map(struct dsa_switch *ds, int src_dev, int src_port,
+ struct net_device *bridge)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   u16 pvlan = 0;
+   int port;
+
+   for (port = 0; port < ps->info->num_ports; ++port) {
+   /* Frames from external ports can egress DSA and CPU ports */
+   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
+   pvlan |= BIT(port);
+
+   /* Frames can egress bridge group members */
+   if (bridge && ps->ports[port].bridge_dev == bridge)
+   pvlan |= BIT(port);
+   }
+
+   return _mv88e6xxx_pvt_write(ds, src_dev, src_port, pvlan);
+}
+
 static int _mv88e6xxx_pvt_init(struct dsa_switch *ds)
 {
+   int src_dev, src_port;
int err;
 
/* Clear 5 Bit Port for usage with Marvell Link Street devices:
@@ -2240,8 +2309,21 @@ static int _mv88e6xxx_pvt_init(struct dsa_switch *ds)
if (err)
return err;
 
-   /* Allow any external frame to egress any internal port */
-   return _mv88e6xxx_pvt_cmd(ds, 0, 0, GLOBAL2_PVT_ADDR_OP_INIT_ONES);
+   /* Forbid every port of potential neighbor switches to egress frames on
+* the normal ports of this switch.
+*/
+   for (src_dev = 0; src_dev < 32; ++src_dev) {
+   if (src_dev == ds->index)
+   continue;
+
+   for (src_port = 0; src_port < 16; ++src_port) {
+   err = _mv88e6xxx_pvt_map(ds, src_dev, src_port, NULL);
+   if (err)
+   return err;
+   }
+   }
+
+   return 0;
 }
 
 int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
@@ -2286,6 +2368,35 @@ unlock:
return err;
 }
 
+static int _mv88e6xxx_pvt_unmap_local(struct dsa_switch *ds, int port)
+{
+   u16 pvlan;
+   int src_dev, src_port, err;
+
+   for (src_dev = 0; src_dev < 32;

Re: [PATCH net-next v6] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-20 Thread Roopa Prabhu

On 4/20/16, 1:08 PM, David Miller wrote:
> From: Roopa Prabhu 
> Date: Wed, 20 Apr 2016 08:43:43 -0700
>
>> This patch adds a new RTM_GETSTATS message to query link stats via netlink
>> from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
>> returns a lot more than just stats and is expensive in some cases when
>> frequent polling for stats from userspace is a common operation.
> With nla_align_64bit() now working properly, I've applied this and it works
> on sparc64 too.
>
> Thanks!
Thank you.

Re: [PATCH iproute2 WIP] ifstat: use new RTM_GETSTATS api

2016-04-20 Thread Roopa Prabhu

On 4/20/16, 11:53 AM, Stephen Hemminger wrote:
> On Wed, 20 Apr 2016 09:16:15 -0700
> Roopa Prabhu  wrote:
>
>> +int rtnl_wilddump_stats_req_filter(struct rtnl_handle *rth, int family, int 
>> type,
>> +   __u32 filt_mask)
>> +{
>> +struct {
>> +struct nlmsghdr nlh;
>> +struct if_stats_msg ifsm;
>> +} req;
> Please use C99 initialization instead of memset in new code.

yes, ack.
>
>> +int err;
>> +
>> +memset(, 0, sizeof(req));
>> +req.nlh.nlmsg_len = sizeof(req);
>> +req.nlh.nlmsg_type = type;
>> +req.nlh.nlmsg_flags = NLM_F_DUMP|NLM_F_REQUEST;
>> +req.nlh.nlmsg_pid = 0;
>> +req.nlh.nlmsg_seq = rth->dump = ++rth->seq;
>> +req.ifsm.family = family;
>> +req.ifsm.filter_mask = filt_mask;
>> +
>> +err = send(rth->fd, (void*), sizeof(req), 0);
>> +
>> +return err;
> Why not just:
> return send(rth->fd, , sizoef(req), 0);

yes, i had that initially. and then changed it to add some debugs before 
returning.

this is all WIP. will clean it up.

thanks.

Re: drop all fragments inside tx queue if one gets dropped

2016-04-20 Thread Michael Richardson


{adding some more comments from the -wpan side of things}

Alexander Aring  wrote:
> On linux-wpan we had a discussion about setting the right tx_queue_len
> and came to some issues in 802.15.4 6LoWPAN networks.

...

> And then a lot of fragments laying inside the tx_queue and waits to
> transfer to the transceiver which has only one framebuffer to transmit
> one frame and waits for tx completion to transfer the next one.

> My question is, if qdisc drops some fragment because the queue is full
> or something else. Exists there some way to remove all fragments inside
> the queue? If one fragment will be dropped and all related are still
> inside the queue then we send mostly garbage.

The big concern is that if we make tx_queue_len too big, we are effectively
introducing bloat.
If we make it too small, then we might drop one fragment, when we would
prefer to drop the entire packet.

It seems that maybe we ought to have a queue in the upper interface and fill
the lower interface with at most two packets' worth of fragments.

> I want to add a behaviour which drops all related fragments for 6LoWPAN
> fragmentation at first, if the payload is above 1280 bytes, then we
> have also IPv6 fragmentation on it. In future I also like to remove all
> related 6LoWPAN fragments which are related according to the IPv6
> fragment.

It would still be useful to be able to do this in general: this kind of
operation would also benefit sending large UDP packets over ethernet when we
have to do IP-layer fragmentation.

--
]   Never tell me the odds! | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works| network architect  [
] m...@sandelman.ca  http://www.sandelman.ca/|   ruby on rails[



signature.asc
Description: PGP signature

Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-20 Thread Johannes Berg

On Wed, 2016-04-20 at 15:34 +0200, Jiri Benc wrote:
> On Wed, 20 Apr 2016 15:17:08 +0200, Johannes Berg wrote:
> > 
> > Looks like you have this on a per-message basis. I thought it was
> > better on an attribute basis because that's really where the issue
> > is.
> No problem. I'm not that happy with my patchset myself. Just wanted
> to point it out in case it's useful.

Yeah, I looked at it, but I think it ended up a bit too complicated
really.

It does have slightly more validation in some sense, but I don't really
think that justifies the complexity?

No matter what, we'll always have to deal with the problem of not
having this capability on older kernels. One way to work around it
would be to add a new NLM_F_REQUEST2 flag, since the kernel currently
requires having NLM_F_REQUEST set, NLM_F_REQUEST2 messages would be
rejected by existing kernels. Dunno if it's really worth it though, I
suspect that family/command-specific detection will work in practically
all cases.

johannes

Re: [PATCH net-next v6] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-20 Thread David Miller

From: Roopa Prabhu 
Date: Wed, 20 Apr 2016 08:43:43 -0700

> This patch adds a new RTM_GETSTATS message to query link stats via netlink
> from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
> returns a lot more than just stats and is expensive in some cases when
> frequent polling for stats from userspace is a common operation.

With nla_align_64bit() now working properly, I've applied this and it works
on sparc64 too.

Thanks!

Re: [RFC PATCH v3 net-next 2/3] tcp: Handle eor bit when coalescing skb

2016-04-20 Thread Soheil Hassas Yeganeh

On Wed, Apr 20, 2016 at 2:24 AM, Martin KaFai Lau  wrote:
> This patch:
> 1. Prevent next_skb from coalescing to the prev_skb if
>TCP_SKB_CB(prev_skb)->eor is set
> 2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
>allowed
>
> Signed-off-by: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
> ---
>  net/ipv4/tcp_input.c  | 4 
>  net/ipv4/tcp_output.c | 4 
>  2 files changed, 8 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 75e8336..68c55e5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1303,6 +1303,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct 
> sk_buff *skb,
> }
>
> TCP_SKB_CB(prev)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
> +   TCP_SKB_CB(prev)->eor = TCP_SKB_CB(skb)->eor;
> if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
> TCP_SKB_CB(prev)->end_seq++;
>
> @@ -1368,6 +1369,9 @@ static struct sk_buff *tcp_shift_skb_data(struct sock 
> *sk, struct sk_buff *skb,
> if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
> goto fallback;
>
> +   if (TCP_SKB_CB(prev)->eor)
> +   goto fallback;
> +

nit: You might want to add unlikely around all checks for "tcp_skb_cb->eor"s.

> in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
>   !before(end_seq, TCP_SKB_CB(skb)->end_seq);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a6e4a83..96bdf98 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2494,6 +2494,7 @@ static void tcp_collapse_retrans(struct sock *sk, 
> struct sk_buff *skb)
>  * packet counting does not break.
>  */
> TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & 
> TCPCB_EVER_RETRANS;
> +   TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
>
> /* changed transmit queue under us so clear hints */
> tcp_clear_retrans_hints_partial(tp);
> @@ -2545,6 +2546,9 @@ static void tcp_retrans_try_collapse(struct sock *sk, 
> struct sk_buff *to,
> if (!tcp_can_collapse(sk, skb))
> break;
>
> +   if (TCP_SKB_CB(to)->eor)
> +   break;
> +

nit: Perhaps a better place to check for eor is right after entering
the loop? to skip a few instructions and tcp_can_collapse, in an
unlikely case eor is set.

> space -= skb->len;
>
> if (first) {
> --
> 2.5.1
>

[PATCH] net: nla_align_64bit() needs to test the right pointer.

2016-04-20 Thread David Miller


Netlink messages are appended, one object at a time, to the end of
the SKB.  Therefore we need to test skb_tail_pointer(), not skb->data,
for alignment purposes.

Fixes: 35c5845957c7 ("net: Add helpers for 64-bit aligning netlink attributes.")
Signed-off-by: David S. Miller 
---

This is like a never ending story

 include/net/netlink.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index cf95df1..3c1fd92 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1250,7 +1250,7 @@ static inline int nla_align_64bit(struct sk_buff *skb, 
int padattr)
 * nlattr header for next attribute, will make nla_data()
 * 8-byte aligned.
 */
-   if (IS_ALIGNED((unsigned long)skb->data, 8) &&
+   if (IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8) &&
!nla_reserve(skb, padattr, 0))
return -EMSGSIZE;
 #endif
-- 
2.4.1

Re: [PATCH net-next 2/2] tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp

2016-04-20 Thread Soheil Hassas Yeganeh

On Wed, Apr 20, 2016 at 1:50 AM, Martin KaFai Lau  wrote:
> When collapsing skbs, txstamp_ack also needs to be merged.
>
> Retrans Collapse Test:
> ~~
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 730) = 730
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 730) = 730
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
> 0.200 write(4, ..., 11680) = 11680
>
> 0.200 > P. 1:731(730) ack 1
> 0.200 > P. 731:1461(730) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:13141(4380) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 > P. 1:1461(1460) ack 1
> 0.400 < . 1:1(0) ack 13141 win 257
>
> BPF Output Before:
> ~
> 
>
> BPF Output After:
> ~
> <...>-2027  [007] d.s.79.765921: : ee_data:1459
>
> Sacks Collapse Test:
> ~
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 1460) = 1460
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 13140) = 13140
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
>
> 0.200 > P. 1:1461(1460) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:14601(5840) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 > P. 1:1461(1460) ack 1
> 0.400 < . 1:1(0) ack 14601 win 257
>
> BPF Output Before:
> ~
> 
>
> BPF Output After:
> ~
> <...>-2049  [007] d.s.89.185538: : ee_data:14599
>
> Signed-off-by: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Tested-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f7c3bc0..a6e4a83 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2454,6 +2454,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
>
> shinfo->tx_flags |= tsflags;
> shinfo->tskey = next_shinfo->tskey;
> +   TCP_SKB_CB(skb)->txstamp_ack |=
> +   TCP_SKB_CB(next_skb)->txstamp_ack;
> }
>  }
>
> --
> 2.5.1
>

Re: [PATCH net-next 1/2] tcp: Carry txstamp_ack in tcp_fragment_tstamp

2016-04-20 Thread Soheil Hassas Yeganeh

On Wed, Apr 20, 2016 at 1:50 AM, Martin KaFai Lau  wrote:
> When a tcp skb is sliced into two smaller skbs (e.g. in
> tcp_fragment() and tso_fragment()),  it does not carry
> the txstamp_ack bit to the newly created skb if it is needed.
> The end result is a timestamping event (SCM_TSTAMP_ACK) will
> be missing from the sk->sk_error_queue.
>
> This patch carries this bit to the new skb2
> in tcp_fragment_tstamp().
>
> BPF Output Before:
> ~~
> 
>
> BPF Output After:
> ~~
> <...>-2050  [000] d.s.   100.928763: : ee_data:14599
>
> Packetdrill Script:
> ~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 
> 0.100 > S. 0:0(0) ack 1 
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 14600) = 14600
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
>
> 0.200 > . 1:7301(7300) ack 1
> 0.200 > P. 7301:14601(7300) ack 1
>
> 0.300 < . 1:1(0) ack 14601 win 257
>
> 0.300 close(4) = 0
> 0.300 > F. 14601:14601(0) ack 1
> 0.400 < F. 1:1(0) ack 16062 win 257
> 0.400 > . 14602:14602(0) ack 2
>
> Signed-off-by: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
> Acked-by: Soheil Hassas Yeganeh 
Tested-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 96182a2..f7c3bc0 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1123,6 +1123,8 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, 
> struct sk_buff *skb2)
> shinfo->tx_flags &= ~tsflags;
> shinfo2->tx_flags |= tsflags;
> swap(shinfo->tskey, shinfo2->tskey);
> +   TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
> +   TCP_SKB_CB(skb)->txstamp_ack = 0;
> }
>  }
>
> --
> 2.5.1
>

Re: [PATCH net 2/2] tcp: Merge tx_flags and tskey in tcp_shifted_skb

2016-04-20 Thread Soheil Hassas Yeganeh

On Wed, Apr 20, 2016 at 1:39 AM, Martin KaFai Lau  wrote:
> After receiving sacks, tcp_shifted_skb() will collapse
> skbs if possible.  tx_flags and tskey also have to be
> merged.
>
> This patch reuses the tcp_skb_collapse_tstamp() to handle
> them.
>
> BPF Output Before:
> ~
> 
>
> BPF Output After:
> ~
> <...>-2024  [007] d.s.88.644374: : ee_data:14599
>
> Packetdrill Script:
> ~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 
> 0.100 > S. 0:0(0) ack 1 
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 1460) = 1460
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 13140) = 13140
>
> 0.200 > P. 1:1461(1460) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:14601(5840) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 > P. 1:1461(1460) ack 1
> 0.400 < . 1:1(0) ack 14601 win 257
>
> 0.400 close(4) = 0
> 0.400 > F. 14601:14601(0) ack 1
> 0.500 < F. 1:1(0) ack 14602 win 257
> 0.500 > . 14602:14602(0) ack 2
>
> Signed-off-by: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Tested-by: Soheil Hassas Yeganeh 
> ---
>  include/net/tcp.h | 2 ++
>  net/ipv4/tcp_input.c  | 1 +
>  net/ipv4/tcp_output.c | 4 ++--
>  3 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index b91370f..6db1022 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -552,6 +552,8 @@ void tcp_send_ack(struct sock *sk);
>  void tcp_send_delayed_ack(struct sock *sk);
>  void tcp_send_loss_probe(struct sock *sk);
>  bool tcp_schedule_loss_probe(struct sock *sk);
> +void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> +const struct sk_buff *next_skb);
>
>  /* tcp_input.c */
>  void tcp_resume_early_retransmit(struct sock *sk);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 0edb071..c124c3c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1309,6 +1309,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct 
> sk_buff *skb,
> if (skb == tcp_highest_sack(sk))
> tcp_advance_highest_sack(sk, skb);
>
> +   tcp_skb_collapse_tstamp(prev, skb);
> tcp_unlink_write_queue(skb, sk);
> sk_wmem_free_skb(sk, skb);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 5bc3c30..441ae9d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2441,8 +2441,8 @@ u32 __tcp_select_window(struct sock *sk)
> return window;
>  }
>
> -static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> -   const struct sk_buff *next_skb)
> +void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> +const struct sk_buff *next_skb)
>  {
> const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
> u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;
> --
> 2.5.1
>

Re: [PATCH 02/19] io-mapping: Specify mapping size for io_mapping_map_wc()

2016-04-20 Thread Chris Wilson

On Wed, Apr 20, 2016 at 08:58:44PM +0200, Luis R. Rodriguez wrote:
> On Wed, Apr 20, 2016 at 07:42:13PM +0100, Chris Wilson wrote:
> > The ioremap() hidden behind the io_mapping_map_wc() convenience helper
> > can be used for remapping multiple pages. Extend the helper so that
> > future callers can use it for larger ranges.
> > 
> > Signed-off-by: Chris Wilson 
> > Cc: Tvrtko Ursulin 
> > Cc: Daniel Vetter 
> > Cc: Jani Nikula 
> > Cc: David Airlie 
> > Cc: Yishai Hadas 
> > Cc: Dan Williams 
> > Cc: Ingo Molnar 
> > Cc: "Peter Zijlstra (Intel)" 
> > Cc: David Hildenbrand 
> > Cc: Luis R. Rodriguez 
> > Cc: intel-...@lists.freedesktop.org
> > Cc: dri-de...@lists.freedesktop.org
> > Cc: netdev@vger.kernel.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org
> 
> We have 2 callers today, in the future, can you envision
> this API getting more options? If so, in order to avoid the
> pain of collateral evolutions I can suggest a descriptor
> being passed with the required settings / options. This lets
> you evolve the API without needing to go in and modify
> old users. If you choose not to that's fine too, just
> figured I'd chime in with that as I've seen the pain
> with other APIs, and I'm putting an end to the needless
> set of collateral evolutions this way.

Do you have a good example in mind? I've one more patch to try and take
advantage of the io-mapping (that may or not be such a good idea in
practice) but I may as well see if I can make io_mapping more useful
when I do.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

Re: [PATCH net 1/2] tcp: Merge tx_flags and tskey in tcp_collapse_retrans

2016-04-20 Thread Soheil Hassas Yeganeh

On Wed, Apr 20, 2016 at 1:39 AM, Martin KaFai Lau  wrote:
> If two skbs are merged/collapsed during retransmission, the current
> logic does not merge the tx_flags and tskey.  The end result is
> the SCM_TSTAMP_ACK timestamp could be missing for a packet.
>
> The patch:
> 1. Merge the tx_flags
> 2. Overwrite the prev_skb's tskey with the next_skb's tskey
>
> BPF Output Before:
> ~~
> 
>
> BPF Output After:
> ~~
> packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459
>
> Packetdrill Script:
> ~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 
> 0.100 > S. 0:0(0) ack 1 
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 730) = 730
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 730) = 730
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
> 0.200 write(4, ..., 11680) = 11680
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
>
> 0.200 > P. 1:731(730) ack 1
> 0.200 > P. 731:1461(730) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:13141(4380) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 < . 1:1(0) ack 1 win 257 
> 0.300 > P. 1:1461(1460) ack 1
> 0.400 < . 1:1(0) ack 13141 win 257
>
> 0.400 close(4) = 0
> 0.400 > F. 13141:13141(0) ack 1
> 0.500 < F. 1:1(0) ack 13142 win 257
> 0.500 > . 13142:13142(0) ack 2
>
> Signed-off-by: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Tested-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 7d2dc01..5bc3c30 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2441,6 +2441,20 @@ u32 __tcp_select_window(struct sock *sk)
> return window;
>  }
>
> +static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> +   const struct sk_buff *next_skb)
> +{
> +   const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
> +   u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;
> +
> +   if (unlikely(tsflags)) {
> +   struct skb_shared_info *shinfo = skb_shinfo(skb);
> +
> +   shinfo->tx_flags |= tsflags;
> +   shinfo->tskey = next_shinfo->tskey;
> +   }
> +}
> +
>  /* Collapses two adjacent SKB's during retransmission. */
>  static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>  {
> @@ -2484,6 +2498,8 @@ static void tcp_collapse_retrans(struct sock *sk, 
> struct sk_buff *skb)
>
> tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
>
> +   tcp_skb_collapse_tstamp(skb, next_skb);
> +
> sk_wmem_free_skb(sk, next_skb);
>  }
>
> --
> 2.5.1
>

Re: [PATCH net] tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks

2016-04-20 Thread Soheil Hassas Yeganeh

On Tue, Apr 19, 2016 at 9:54 AM, Soheil Hassas Yeganeh
 wrote:
> On Mon, Apr 18, 2016 at 6:39 PM, Martin KaFai Lau  wrote:
>> Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
>> it could incorrectly think that a skb has already
>> been acked and queue a SCM_TSTAMP_ACK cmsg to the
>> sk->sk_error_queue.
>>
>> In tcp_ack_tstamp(), it checks
>> 'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
>> If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
>> script, between() returns true but the tskey is actually not acked.
>> e.g. try between(3, 2, 1).
>>
>> The fix is to replace between() with one before() and one !before().
>> By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
>> removed.
>>
>> A packetdrill script is used to reproduce the dup ack scenario.
>> Due to the lacking cmsg support in packetdrill (may be I
>> cannot find it),  a BPF prog is used to kprobe to
>> sock_queue_err_skb() and print out the value of
>> serr->ee.ee_data.
>>
>> Both the packetdrill and the bcc BPF script is attached at the end of
>> this commit message.
>>
>> BPF Output Before Fix:
>> ~~
>>   <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
>> packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
>> packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
>> packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
>> packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599
>>
>> BPF Output After Fix:
>> ~~
>>   <...>-2049  [000] d.s.   113.517039: : ee_data:1459
>>   <...>-2049  [000] d.s.   113.517253: : ee_data:14599
>>
>> BCC BPF Script:
>> ~~
>> #!/usr/bin/env python
>>
>> from __future__ import print_function
>> from bcc import BPF
>>
>> bpf_text = """
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> #ifdef memset
>> #undef memset
>> #endif
>>
>> int trace_err_skb(struct pt_regs *ctx)
>> {
>> struct sk_buff *skb = (struct sk_buff *)ctx->si;
>> struct sock *sk = (struct sock *)ctx->di;
>> struct sock_exterr_skb *serr;
>> u32 ee_data = 0;
>>
>> if (!sk || !skb)
>> return 0;
>>
>> serr = SKB_EXT_ERR(skb);
>> bpf_probe_read(_data, sizeof(ee_data), >ee.ee_data);
>> bpf_trace_printk("ee_data:%u\\n", ee_data);
>>
>> return 0;
>> };
>> """
>>
>> b = BPF(text=bpf_text)
>> b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
>> print("Attached to kprobe")
>> b.trace_print()
>>
>> Packetdrill Script:
>> ~~
>> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
>> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
>> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> +0 bind(3, ..., ...) = 0
>> +0 listen(3, 1) = 0
>>
>> 0.100 < S 0:0(0) win 32792 
>> 0.100 > S. 0:0(0) ack 1 
>> 0.200 < . 1:1(0) ack 1 win 257
>> 0.200 accept(3, ..., ...) = 4
>> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>>
>> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
>> 0.200 write(4, ..., 1460) = 1460
>> 0.200 write(4, ..., 13140) = 13140
>>
>> 0.200 > P. 1:1461(1460) ack 1
>> 0.200 > . 1461:8761(7300) ack 1
>> 0.200 > P. 8761:14601(5840) ack 1
>>
>> 0.300 < . 1:1(0) ack 1 win 257 
>> 0.300 < . 1:1(0) ack 1 win 257 
>> 0.300 < . 1:1(0) ack 1 win 257 
>> 0.300 > P. 1:1461(1460) ack 1
>> 0.400 < . 1:1(0) ack 14601 win 257
>>
>> 0.400 close(4) = 0
>> 0.400 > F. 14601:14601(0) ack 1
>> 0.500 < F. 1:1(0) ack 14602 win 257
>> 0.500 > . 14602:14602(0) ack 2
>>
>> Signed-off-by: Martin KaFai Lau 
>> Cc: Eric Dumazet 
>> Cc: Neal Cardwell 
>> Cc: Soheil Hassas Yeganeh 
>
> Acked-by: Soheil Hassas Yeganeh 
Tested-by: Soheil Hassas Yeganeh 
>
>> Cc: Willem de Bruijn 
>> Cc: Yuchung Cheng 
>> ---
>>  net/ipv4/tcp_input.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index e6e65f7..0edb071 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3098,7 +3098,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct 
>> sk_buff *skb,
>>
>> shinfo = skb_shinfo(skb);
>> if ((shinfo->tx_flags & SKBTX_ACK_TSTAMP) &&
>> -   between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1))
>> +   !before(shinfo->tskey, prior_snd_una) &&
>> +   before(shinfo->tskey, tcp_sk(sk)->snd_una))
>> __skb_tstamp_tx(skb, NULL, sk, SCM_TSTAMP_ACK);
>>  }
>
> Nice catch! Thanks.
>
>> --
>> 2.5.1
>>

[PATCH net-next V3 06/11] net/mlx5e: Added ICO SQs

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

Added ICO (Internal Control Operations) SQ per channel to be used
for driver internal operations such as memory registration for
fragmented memory and nop requests upon ifconfig up.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |7 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  135 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |   55 +
 4 files changed, 174 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index f519148..a757fcf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -488,6 +488,11 @@ enum {
MLX5E_SQ_STATE_BF_ENABLE,
 };
 
+struct mlx5e_ico_wqe_info {
+   u8  opcode;
+   u8  num_wqebbs;
+};
+
 struct mlx5e_sq {
/* data path */
 
@@ -529,6 +534,7 @@ struct mlx5e_sq {
struct mlx5_uaruar;
struct mlx5e_channel  *channel;
inttc;
+   struct mlx5e_ico_wqe_info *ico_wqe_info;
 } cacheline_aligned_in_smp;
 
 static inline bool mlx5e_sq_has_room_for(struct mlx5e_sq *sq, u16 n)
@@ -545,6 +551,7 @@ struct mlx5e_channel {
/* data path */
struct mlx5e_rqrq;
struct mlx5e_sqsq[MLX5E_MAX_NUM_TC];
+   struct mlx5e_sqicosq;   /* internal control operations */
struct napi_struct napi;
struct device *pdev;
struct net_device *netdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 871f3af..b25b429 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -48,6 +48,7 @@ struct mlx5e_sq_param {
u32sqc[MLX5_ST_SZ_DW(sqc)];
struct mlx5_wq_param   wq;
u16max_inline;
+   bool   icosq;
 };
 
 struct mlx5e_cq_param {
@@ -59,8 +60,10 @@ struct mlx5e_cq_param {
 struct mlx5e_channel_param {
struct mlx5e_rq_param  rq;
struct mlx5e_sq_param  sq;
+   struct mlx5e_sq_param  icosq;
struct mlx5e_cq_param  rx_cq;
struct mlx5e_cq_param  tx_cq;
+   struct mlx5e_cq_param  icosq_cq;
 };
 
 static void mlx5e_update_carrier(struct mlx5e_priv *priv)
@@ -502,6 +505,8 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
 struct mlx5e_rq_param *param,
 struct mlx5e_rq *rq)
 {
+   struct mlx5e_sq *sq = >icosq;
+   u16 pi = sq->pc & sq->wq.sz_m1;
int err;
 
err = mlx5e_create_rq(c, param, rq);
@@ -517,7 +522,10 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
goto err_disable_rq;
 
set_bit(MLX5E_RQ_STATE_POST_WQES_ENABLE, >state);
-   mlx5e_send_nop(>sq[0], true); /* trigger mlx5e_post_rx_wqes() */
+
+   sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
+   sq->ico_wqe_info[pi].num_wqebbs = 1;
+   mlx5e_send_nop(sq, true); /* trigger mlx5e_post_rx_wqes() */
 
return 0;
 
@@ -583,7 +591,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 
void *sqc = param->sqc;
void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
-   int txq_ix;
int err;
 
err = mlx5_alloc_map_uar(mdev, >uar, true);
@@ -611,8 +618,24 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
if (err)
goto err_sq_wq_destroy;
 
-   txq_ix = c->ix + tc * priv->params.num_channels;
-   sq->txq = netdev_get_tx_queue(priv->netdev, txq_ix);
+   if (param->icosq) {
+   u8 wq_sz = mlx5_wq_cyc_get_size(>wq);
+
+   sq->ico_wqe_info = kzalloc_node(sizeof(*sq->ico_wqe_info) *
+   wq_sz,
+   GFP_KERNEL,
+   cpu_to_node(c->cpu));
+   if (!sq->ico_wqe_info) {
+   err = -ENOMEM;
+   goto err_free_sq_db;
+   }
+   } else {
+   int txq_ix;
+
+   txq_ix = c->ix + tc * priv->params.num_channels;
+   sq->txq = netdev_get_tx_queue(priv->netdev, txq_ix);
+   priv->txq_to_sq_map[txq_ix] = sq;
+   }
 
sq->pdev  = c->pdev;
sq->tstamp= >tstamp;
@@ -621,10 +644,12 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
sq->tc= tc;
sq->edge  = (sq->wq.sz_m1 + 1) - MLX5_SEND_WQE_MAX_WQEBBS;
sq->bf_budget = MLX5E_SQ_BF_BUDGET;
-   priv->txq_to_sq_map[txq_ix] = sq;
 
return 0;

[PATCH net-next V3 07/11] net/mlx5e: Add fragmented memory support for RX multi packet WQE

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

If the allocation of a linear (physically continuous) MPWQE fails,
we allocate a fragmented MPWQE.

This is implemented via device's UMR (User Memory Registration)
which allows to register multiple memory fragments into ConnectX
hardware as a continuous buffer.
UMR registration is an asynchronous operation and is done via
ICO SQs.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   84 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   64 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |  427 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |3 +
 5 files changed, 514 insertions(+), 68 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index a757fcf..c99fdff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -72,6 +72,9 @@
 #define MLX5_MPWRQ_PAGES_PER_WQE   BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
 #define MLX5_MPWRQ_STRIDES_PER_PAGE(MLX5_MPWRQ_NUM_STRIDES >> \
 MLX5_MPWRQ_WQE_PAGE_ORDER)
+#define MLX5_CHANNEL_MAX_NUM_MTTS (ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8) * \
+  BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW))
+#define MLX5_UMR_ALIGN (2048)
 #define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD  (128)
 
 #define MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ (64 * 1024)
@@ -134,6 +137,13 @@ struct mlx5e_rx_wqe {
struct mlx5_wqe_data_seg  data;
 };
 
+struct mlx5e_umr_wqe {
+   struct mlx5_wqe_ctrl_seg   ctrl;
+   struct mlx5_wqe_umr_ctrl_seg   uctrl;
+   struct mlx5_mkey_seg   mkc;
+   struct mlx5_wqe_data_seg   data;
+};
+
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
 #define MLX5E_MIN_BW_ALLOC 1   /* Min percentage of BW allocation */
@@ -179,6 +189,7 @@ static const char vport_strings[][ETH_GSTRING_LEN] = {
"tx_queue_dropped",
"rx_wqe_err",
"rx_mpwqe_filler",
+   "rx_mpwqe_frag",
 };
 
 struct mlx5e_vport_stats {
@@ -221,8 +232,9 @@ struct mlx5e_vport_stats {
u64 tx_queue_dropped;
u64 rx_wqe_err;
u64 rx_mpwqe_filler;
+   u64 rx_mpwqe_frag;
 
-#define NUM_VPORT_COUNTERS 36
+#define NUM_VPORT_COUNTERS 37
 };
 
 static const char pport_strings[][ETH_GSTRING_LEN] = {
@@ -317,6 +329,7 @@ static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
"lro_bytes",
"wqe_err",
"mpwqe_filler",
+   "mpwqe_frag",
 };
 
 struct mlx5e_rq_stats {
@@ -328,7 +341,8 @@ struct mlx5e_rq_stats {
u64 lro_bytes;
u64 wqe_err;
u64 mpwqe_filler;
-#define NUM_RQ_STATS 8
+   u64 mpwqe_frag;
+#define NUM_RQ_STATS 9
 };
 
 static const char sq_stats_strings[][ETH_GSTRING_LEN] = {
@@ -407,6 +421,7 @@ struct mlx5e_tstamp {
 
 enum {
MLX5E_RQ_STATE_POST_WQES_ENABLE,
+   MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS,
 };
 
 struct mlx5e_cq {
@@ -434,18 +449,14 @@ struct mlx5e_dma_info {
dma_addr_t  addr;
 };
 
-struct mlx5e_mpw_info {
-   struct mlx5e_dma_info dma_info;
-   u16 consumed_strides;
-   u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
-};
-
 struct mlx5e_rq {
/* data path */
struct mlx5_wq_ll  wq;
u32wqe_sz;
struct sk_buff   **skb;
struct mlx5e_mpw_info *wqe_info;
+   __be32 mkey_be;
+   __be32 umr_mkey_be;
 
struct device *pdev;
struct net_device *netdev;
@@ -466,6 +477,36 @@ struct mlx5e_rq {
struct mlx5e_priv *priv;
 } cacheline_aligned_in_smp;
 
+struct mlx5e_umr_dma_info {
+   __be64*mtt;
+   __be64*mtt_no_align;
+   dma_addr_t mtt_addr;
+   struct mlx5e_dma_info *dma_info;
+};
+
+struct mlx5e_mpw_info {
+   union {
+   struct mlx5e_dma_info dma_info;
+   struct mlx5e_umr_dma_info umr;
+   };
+   u16 consumed_strides;
+   u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
+
+   void (*dma_pre_sync)(struct device *pdev,
+struct mlx5e_mpw_info *wi,
+u32 wqe_offset, u32 len);
+   void (*add_skb_frag)(struct device *pdev,
+struct sk_buff *skb,
+struct mlx5e_mpw_info *wi,
+u32 page_idx, u32 frag_offset, u32 len);
+   void (*copy_skb_header)(struct device *pdev,
+   struct sk_buff *skb,
+   struct mlx5e_mpw_info *wi,
+   u32 page_idx,

[PATCH net-next V3 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.

Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.

In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.

For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.

MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte

The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.

Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.

* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
  default, 64B, 1024B, 1478B, 65536B.

* Netperf multi TCP stream:
- No degradation, line rate reached.

* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.

* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
  | 2K  |   ~ 1K  |   0
  | 8K  |   ~ 6K  |   0
  | 16K |   ~13K  |   0
  | 32K |   ~28K  |   0
  | 64K |   ~57K  | ~24K

As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.

Signed-off-by: Tariq Toukan 
Signed-off-by: Achiad Shochat 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   77 ++-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   15 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  109 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|  153 ++--
 include/linux/mlx5/device.h|   39 +-
 5 files changed, 349 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 61e249d..f519148 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -57,12 +57,30 @@
 #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE0xa
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE0xd
 
+#define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW0x1
+#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4
+#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6
+
+#define MLX5_MPWRQ_LOG_NUM_STRIDES 11 /* >= 9, HW restriction */
+#define MLX5_MPWRQ_LOG_STRIDE_SIZE 6  /* >= 6, HW restriction */
+#define MLX5_MPWRQ_NUM_STRIDES BIT(MLX5_MPWRQ_LOG_NUM_STRIDES)
+#define MLX5_MPWRQ_STRIDE_SIZE BIT(MLX5_MPWRQ_LOG_STRIDE_SIZE)
+#define MLX5_MPWRQ_LOG_WQE_SZ  (MLX5_MPWRQ_LOG_NUM_STRIDES +\
+MLX5_MPWRQ_LOG_STRIDE_SIZE)
+#define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
+   MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
+#define MLX5_MPWRQ_PAGES_PER_WQE   BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
+#define MLX5_MPWRQ_STRIDES_PER_PAGE(MLX5_MPWRQ_NUM_STRIDES >> \
+MLX5_MPWRQ_WQE_PAGE_ORDER)
+#define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD  (128)
+
 #define MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ (64 * 1024)
 #define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC  0x10
 #define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS  0x20
 #define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC  0x10
 #define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS  0x20
 #define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES0x80
+#define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW0x2
 
 #define MLX5E_LOG_INDIR_RQT_SIZE   0x7
 #define MLX5E_INDIR_RQT_SIZE   BIT(MLX5E_LOG_INDIR_RQT_SIZE)
@@ -74,6 +92,38 @@
 #define MLX5E_NUM_MAIN_GROUPS 9
 #define

[PATCH net-next V3 01/11] net/mlx5: Introduce device queue counters

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

A queue counter can collect several statistics for one or more
hardware queues (QPs, RQs, etc ..) that the counter is attached to.

For Ethernet it will provide an "out of buffer" counter which
collects the number of all packets that are dropped due to lack
of software buffers.

Here we add device commands to alloc/query/dealloc queue counters.

Signed-off-by: Tariq Toukan 
Signed-off-by: Rana Shahout 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c |   68 ++
 include/linux/mlx5/qp.h  |6 ++
 2 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index def2893..b720a27 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -538,3 +538,71 @@ void mlx5_core_destroy_sq_tracked(struct mlx5_core_dev 
*dev,
mlx5_core_destroy_sq(dev, sq->qpn);
 }
 EXPORT_SYMBOL(mlx5_core_destroy_sq_tracked);
+
+int mlx5_core_alloc_q_counter(struct mlx5_core_dev *dev, u16 *counter_id)
+{
+   u32 in[MLX5_ST_SZ_DW(alloc_q_counter_in)];
+   u32 out[MLX5_ST_SZ_DW(alloc_q_counter_out)];
+   int err;
+
+   memset(in, 0, sizeof(in));
+   memset(out, 0, sizeof(out));
+
+   MLX5_SET(alloc_q_counter_in, in, opcode, MLX5_CMD_OP_ALLOC_Q_COUNTER);
+   err = mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, sizeof(out));
+   if (!err)
+   *counter_id = MLX5_GET(alloc_q_counter_out, out,
+  counter_set_id);
+   return err;
+}
+EXPORT_SYMBOL_GPL(mlx5_core_alloc_q_counter);
+
+int mlx5_core_dealloc_q_counter(struct mlx5_core_dev *dev, u16 counter_id)
+{
+   u32 in[MLX5_ST_SZ_DW(dealloc_q_counter_in)];
+   u32 out[MLX5_ST_SZ_DW(dealloc_q_counter_out)];
+
+   memset(in, 0, sizeof(in));
+   memset(out, 0, sizeof(out));
+
+   MLX5_SET(dealloc_q_counter_in, in, opcode,
+MLX5_CMD_OP_DEALLOC_Q_COUNTER);
+   MLX5_SET(dealloc_q_counter_in, in, counter_set_id, counter_id);
+   return mlx5_cmd_exec_check_status(dev, in, sizeof(in), out,
+ sizeof(out));
+}
+EXPORT_SYMBOL_GPL(mlx5_core_dealloc_q_counter);
+
+int mlx5_core_query_q_counter(struct mlx5_core_dev *dev, u16 counter_id,
+ int reset, void *out, int out_size)
+{
+   u32 in[MLX5_ST_SZ_DW(query_q_counter_in)];
+
+   memset(in, 0, sizeof(in));
+
+   MLX5_SET(query_q_counter_in, in, opcode, MLX5_CMD_OP_QUERY_Q_COUNTER);
+   MLX5_SET(query_q_counter_in, in, clear, reset);
+   MLX5_SET(query_q_counter_in, in, counter_set_id, counter_id);
+   return mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, out_size);
+}
+EXPORT_SYMBOL_GPL(mlx5_core_query_q_counter);
+
+int mlx5_core_query_out_of_buffer(struct mlx5_core_dev *dev, u16 counter_id,
+ u32 *out_of_buffer)
+{
+   int outlen = MLX5_ST_SZ_BYTES(query_q_counter_out);
+   void *out;
+   int err;
+
+   out = mlx5_vzalloc(outlen);
+   if (!out)
+   return -ENOMEM;
+
+   err = mlx5_core_query_q_counter(dev, counter_id, 0, out, outlen);
+   if (!err)
+   *out_of_buffer = MLX5_GET(query_q_counter_out, out,
+ out_of_buffer);
+
+   kfree(out);
+   return err;
+}
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index cf031a3..6422102 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -668,6 +668,12 @@ int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, 
u32 *in, int inlen,
struct mlx5_core_qp *sq);
 void mlx5_core_destroy_sq_tracked(struct mlx5_core_dev *dev,
  struct mlx5_core_qp *sq);
+int mlx5_core_alloc_q_counter(struct mlx5_core_dev *dev, u16 *counter_id);
+int mlx5_core_dealloc_q_counter(struct mlx5_core_dev *dev, u16 counter_id);
+int mlx5_core_query_q_counter(struct mlx5_core_dev *dev, u16 counter_id,
+ int reset, void *out, int out_size);
+int mlx5_core_query_out_of_buffer(struct mlx5_core_dev *dev, u16 counter_id,
+ u32 *out_of_buffer);
 
 static inline const char *mlx5_qp_type_str(int type)
 {
-- 
1.7.1

[PATCH net-next V3 02/11] net/mlx5e: Allocate set of queue counters per netdev

2016-04-20 Thread Saeed Mahameed

From: Rana Shahout 

Connect all netdev RQs to this set of queue counters.
Also, add an "rx_out_of_buffer" counter to ethtool,
which indicates RX packet drops due to lack of receive
buffers.

Signed-off-by: Rana Shahout 
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   11 +
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   11 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   42 +++-
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 879e627..c4ddbe8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -236,6 +236,15 @@ struct mlx5e_pport_stats {
__be64 RFC_2819_counters[NUM_RFC_2819_COUNTERS];
 };
 
+static const char qcounter_stats_strings[][ETH_GSTRING_LEN] = {
+   "rx_out_of_buffer",
+};
+
+struct mlx5e_qcounter_stats {
+   u32 rx_out_of_buffer;
+#define NUM_Q_COUNTERS 1
+};
+
 static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
"packets",
"bytes",
@@ -293,6 +302,7 @@ struct mlx5e_sq_stats {
 struct mlx5e_stats {
struct mlx5e_vport_stats   vport;
struct mlx5e_pport_stats   pport;
+   struct mlx5e_qcounter_stats qcnt;
 };
 
 struct mlx5e_params {
@@ -575,6 +585,7 @@ struct mlx5e_priv {
struct net_device *netdev;
struct mlx5e_stats stats;
struct mlx5e_tstamptstamp;
+   u16 q_counter;
 };
 
 #define MLX5E_NET_IP_ALIGN 2
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 68834b7..39c1902 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -165,6 +165,8 @@ static const struct {
},
 };
 
+#define MLX5E_NUM_Q_CNTRS(priv) (NUM_Q_COUNTERS * (!!priv->q_counter))
+
 static int mlx5e_get_sset_count(struct net_device *dev, int sset)
 {
struct mlx5e_priv *priv = netdev_priv(dev);
@@ -172,6 +174,7 @@ static int mlx5e_get_sset_count(struct net_device *dev, int 
sset)
switch (sset) {
case ETH_SS_STATS:
return NUM_VPORT_COUNTERS + NUM_PPORT_COUNTERS +
+  MLX5E_NUM_Q_CNTRS(priv) +
   priv->params.num_channels * NUM_RQ_STATS +
   priv->params.num_channels * priv->params.num_tc *
   NUM_SQ_STATS;
@@ -200,6 +203,11 @@ static void mlx5e_get_strings(struct net_device *dev,
strcpy(data + (idx++) * ETH_GSTRING_LEN,
   vport_strings[i]);
 
+   /* Q counters */
+   for (i = 0; i < MLX5E_NUM_Q_CNTRS(priv); i++)
+   strcpy(data + (idx++) * ETH_GSTRING_LEN,
+  qcounter_stats_strings[i]);
+
/* PPORT counters */
for (i = 0; i < NUM_PPORT_COUNTERS; i++)
strcpy(data + (idx++) * ETH_GSTRING_LEN,
@@ -240,6 +248,9 @@ static void mlx5e_get_ethtool_stats(struct net_device *dev,
for (i = 0; i < NUM_VPORT_COUNTERS; i++)
data[idx++] = ((u64 *)>stats.vport)[i];
 
+   for (i = 0; i < MLX5E_NUM_Q_CNTRS(priv); i++)
+   data[idx++] = ((u32 *)>stats.qcnt)[i];
+
for (i = 0; i < NUM_PPORT_COUNTERS; i++)
data[idx++] = be64_to_cpu(((__be64 *)>stats.pport)[i]);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e0adb60..7fbe1ba 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -129,6 +129,17 @@ free_out:
kvfree(out);
 }
 
+static void mlx5e_update_q_counter(struct mlx5e_priv *priv)
+{
+   struct mlx5e_qcounter_stats *qcnt = >stats.qcnt;
+
+   if (!priv->q_counter)
+   return;
+
+   mlx5_core_query_out_of_buffer(priv->mdev, priv->q_counter,
+ >rx_out_of_buffer);
+}
+
 void mlx5e_update_stats(struct mlx5e_priv *priv)
 {
struct mlx5_core_dev *mdev = priv->mdev;
@@ -250,6 +261,8 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
   s->rx_csum_sw;
 
mlx5e_update_pport_counters(priv);
+   mlx5e_update_q_counter(priv);
+
 free_out:
kvfree(out);
 }
@@ -1055,6 +1068,7 @@ static void mlx5e_build_rq_param(struct mlx5e_priv *priv,
MLX5_SET(wq, wq, log_wq_stride,ilog2(sizeof(struct mlx5e_rx_wqe)));
MLX5_SET(wq, wq, log_wq_sz,priv->params.log_rq_size);
MLX5_SET(wq, wq, pd,   priv->pdn);
+   MLX5_SET(rqc, rqc, counter_set_id, priv->q_counter);

[PATCH net-next V3 03/11] net/mlx5e: Use only close NUMA node for default RSS

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

Distribute default RSS table uniformly over the rings of the
close NUMA node, instead of all available channels.
This way we enforce the preference of close rings over far ones.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |3 ++-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   15 +--
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c4ddbe8..7f19644 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -671,7 +671,8 @@ void mlx5e_build_tir_ctx_hash(void *tirc, struct mlx5e_priv 
*priv);
 
 int mlx5e_open_locked(struct net_device *netdev);
 int mlx5e_close_locked(struct net_device *netdev);
-void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, int len,
+void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
+  u32 *indirection_rqt, int len,
   int num_channels);
 
 static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 39c1902..6f40ba4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -397,7 +397,7 @@ static int mlx5e_set_channels(struct net_device *dev,
mlx5e_close_locked(dev);
 
priv->params.num_channels = count;
-   mlx5e_build_default_indir_rqt(priv->params.indirection_rqt,
+   mlx5e_build_default_indir_rqt(priv->mdev, priv->params.indirection_rqt,
  MLX5E_INDIR_RQT_SIZE, count);
 
if (was_opened)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7fbe1ba..9b58ef6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2297,11 +2297,22 @@ static void mlx5e_ets_init(struct mlx5e_priv *priv)
 }
 #endif
 
-void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, int len,
+void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
+  u32 *indirection_rqt, int len,
   int num_channels)
 {
+   int node = mdev->priv.numa_node;
+   int node_num_of_cores;
int i;
 
+   if (node == -1)
+   node = first_online_node;
+
+   node_num_of_cores = cpumask_weight(cpumask_of_node(node));
+
+   if (node_num_of_cores)
+   num_channels = min_t(int, num_channels, node_num_of_cores);
+
for (i = 0; i < len; i++)
indirection_rqt[i] = i % num_channels;
 }
@@ -2333,7 +2344,7 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev 
*mdev,
netdev_rss_key_fill(priv->params.toeplitz_hash_key,
sizeof(priv->params.toeplitz_hash_key));
 
-   mlx5e_build_default_indir_rqt(priv->params.indirection_rqt,
+   mlx5e_build_default_indir_rqt(mdev, priv->params.indirection_rqt,
  MLX5E_INDIR_RQT_SIZE, num_channels);
 
priv->params.lro_wqe_sz=
-- 
1.7.1

[PATCH net-next V3 00/11] Mellanox 100G mlx5 driver receive path optimizations

2016-04-20 Thread Saeed Mahameed

Hello Dave,

Changes from V2:
- Rebased to 46e7b8d8d53b ("net: dsa: kill circular reference with 
slave priv")
- Updated: ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
* Per Eric Dumazet comment we changed the driver memory 
handling scheme to 
work with order-0 pages rather than order-5 via split_page().
* This means that now a mlx5e rx skb can hold one or (more in 
case of HW LRO)
skb frag each pointing to a 4K order-0 page rather than one 
frag with order-5 page.
- Updated: ("net/mlx5e: Add fragmented memory support for RX multi 
packet WQE")
* Code refactoring and code reuse due the split_page() 
mechanism,
  now the MPWQE and fragmented MPWQE handling almost look the 
same,
  and share most of the code.
- In some cases we see 2%-3% packet rate degradation in comparison to 
the order-5 pages approach,
  due to split_page() cpu consumption, but still we do see 3%-10% 
improvement in comparison to the 
  current linear SKB approach.
- We do believe that now the driver memory scheme is significantly less 
vulnerable 
  to the memory DOS attack Eric pointed at.

Changes from V1:
- Rebased to efde611b0afa ("Merge branch 'nfp-next'")
- Dropped: ("net/mlx5: Refactor mlx5_core_mr to mkey")
Already merged into 4.6 from rdma tree. 
- Dropped: ("net/mlx5_core: Add ConnectX-5 to list of supported 
devices")
Will be pushed to net as we want it in 4.6 release.
- Dropped: ("net/mlx5e: Change RX moderation period to be based on CQE")
Will be pushed in a later series with full software based 
adaptive moderation.
- Added: ("net/mlx5e: Delay skb->data access")
Small trivial optimization.
- Updated: ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Changed Striding RQ defaults to:
>   NUM WQEs = 16
>   Strides Per WQE = 1024
>   Stride Size = 128 
- Updated: ("net/mlx5e: Use napi_alloc_skb for RX SKB allocations")
Consider the IP packet alignment already done in 
napi_alloc_skb.

Changes from V0:
- Fixed a typo in commit message reported by Sergei
- Align SKB fragments truesize to stride size
- Use skb_add_rx_frag and remove the use of SKB_TRUESIZE
- Fix: # MTTs alignment on Power PC
- Fix: Free original (unaligned) pointer of MTT array
- Use dev_alloc_pages and dev_alloc_page
- Extend the stats.buff_alloc_err counter
- Reform the copying of packet header into skb linear data
- Add compiler hints for conditional statements
- Prefetch skd->data prior to copying packet header into it
- Rework: mlx5e_complete_rx_fragmented_mpwqe
- Handle SKB fragments before linear data
- Dropped ("net/mlx5e: Prefetch next RX CQE") for now 
- Added a small patch that Adds ConnectX-5 devices to the list of 
supported devices
- Rebased to 1cdba550 ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next")

This series includes Some RX modifications and optimizations for
the mlx5 Ethernet driver. 

>From Rana, we have one patch that adds the support for Connectx-4
queue counters.

>From Tariq, several patches that are centralized around improving
RX path message rate, CPU and Memory utilization, in each patch
commit message you will find the performance improvements numbers
related to that specific patch.

In the 2nd patch we used a queue counter to report "out of buffer" 
dropped packet count, "Dropped packets due to lack of software resources"

3rd patch modifies the driver's to RSS default value to be spread along the
close NUMA node cores only for better out of the box experience.

In the 4th and 5th patches we utilized the use of RX multi-packet WQE
(Striding RQ) for better memory utilization especially in case of hardware
LRO is enabled and for better message rate for small packets.

In the 6th and 7th patches we added a fallback mechanism to use fragmented
memory when allocating large WQE strides fails, using UMR
(User Memory Registration) and ICO (Internal Control Operations) SQs.

In the 8th to 11th patches we did some small modification which show some small
extra improvements.

Thanks,
Saeed



Rana Shahout (1):
  net/mlx5e: Allocate set of queue counters per netdev

Saeed Mahameed (1):
  net/mlx5e: Delay skb->data access

Tariq Toukan (9):
  net/mlx5: Introduce device queue counters
  net/mlx5e: Use only close NUMA node for default RSS
  net/mlx5e: Use function pointers for RX data path handling
  net/mlx5e: Support RX multi-packet WQE (Striding RQ)
  net/mlx5e: Added ICO SQs
  net/mlx5e: Add fragmented memory support for RX multi packet WQE
  net/mlx5e: Use

[PATCH net-next V3 04/11] net/mlx5e: Use function pointers for RX data path handling

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

In preparation for Striding RQ feature, which will need its own
RX handlers.
This patch does not change any functionality.

Signed-off-by: Tariq Toukan 
Signed-off-by: Achiad Shochat 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   33 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   74 +++--
 3 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7f19644..61e249d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -72,6 +72,17 @@
 #define MLX5E_SQ_BF_BUDGET 16
 
 #define MLX5E_NUM_MAIN_GROUPS 9
+#define MLX5E_NET_IP_ALIGN 2
+
+struct mlx5e_tx_wqe {
+   struct mlx5_wqe_ctrl_seg ctrl;
+   struct mlx5_wqe_eth_seg  eth;
+};
+
+struct mlx5e_rx_wqe {
+   struct mlx5_wqe_srq_next_seg  next;
+   struct mlx5_wqe_data_seg  data;
+};
 
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
@@ -357,6 +368,12 @@ struct mlx5e_cq {
struct mlx5_wq_ctrlwq_ctrl;
 } cacheline_aligned_in_smp;
 
+struct mlx5e_rq;
+typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
+  struct mlx5_cqe64 *cqe);
+typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe 
*wqe,
+ u16 ix);
+
 struct mlx5e_rq {
/* data path */
struct mlx5_wq_ll  wq;
@@ -368,6 +385,8 @@ struct mlx5e_rq {
struct mlx5e_tstamp   *tstamp;
struct mlx5e_rq_stats  stats;
struct mlx5e_cqcq;
+   mlx5e_fp_handle_rx_cqe handle_rx_cqe;
+   mlx5e_fp_alloc_wqe alloc_wqe;
 
unsigned long  state;
intix;
@@ -588,18 +607,6 @@ struct mlx5e_priv {
u16 q_counter;
 };
 
-#define MLX5E_NET_IP_ALIGN 2
-
-struct mlx5e_tx_wqe {
-   struct mlx5_wqe_ctrl_seg ctrl;
-   struct mlx5_wqe_eth_seg  eth;
-};
-
-struct mlx5e_rx_wqe {
-   struct mlx5_wqe_srq_next_seg  next;
-   struct mlx5_wqe_data_seg  data;
-};
-
 enum mlx5e_link_mode {
MLX5E_1000BASE_CX_SGMII  = 0,
MLX5E_1000BASE_KX= 1,
@@ -642,7 +649,9 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum 
mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9b58ef6..23ba12c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -357,6 +357,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
cpu_to_be32(byte_count | MLX5_HW_START_PADDING);
}
 
+   rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
+   rq->alloc_wqe = mlx5e_alloc_rx_wqe;
rq->pdev= c->pdev;
rq->netdev  = c->netdev;
rq->tstamp  = >tstamp;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 58d4e2f..d7ccced 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -42,8 +42,7 @@ static inline bool mlx5e_rx_hw_stamp(struct mlx5e_tstamp 
*tstamp)
return tstamp->hwtstamp_config.rx_filter == HWTSTAMP_FILTER_ALL;
 }
 
-static inline int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq,
-struct mlx5e_rx_wqe *wqe, u16 ix)
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 {
struct sk_buff *skb;
dma_addr_t dma_addr;
@@ -87,7 +86,7 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
while (!mlx5_wq_ll_is_full(wq)) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
 
-   if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, wq->head)))
+   if (unlikely(rq->alloc_wqe(rq, wqe, wq->head)))
break;
 
mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
@@ -229,50 +228,55 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 
*cqe,
skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
 }
 
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct

[PATCH net-next V3 10/11] net/mlx5e: Delay skb->data access

2016-04-20 Thread Saeed Mahameed

Move mlx5e_handle_csum and eth_type_trans to the end of
mlx5e_build_rx_skb to gain some more time before accessing
skb->data, to reduce cache misses.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5bdcc0b..ee5fa16 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -573,10 +573,6 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 
*cqe,
if (unlikely(mlx5e_rx_hw_stamp(tstamp)))
mlx5e_fill_hwstamp(tstamp, get_cqe_ts(cqe), skb_hwtstamps(skb));
 
-   mlx5e_handle_csum(netdev, cqe, rq, skb, !!lro_num_seg);
-
-   skb->protocol = eth_type_trans(skb, netdev);
-
skb_record_rx_queue(skb, rq->ix);
 
if (likely(netdev->features & NETIF_F_RXHASH))
@@ -587,6 +583,9 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 
*cqe,
   be16_to_cpu(cqe->vlan_info));
 
skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
+
+   mlx5e_handle_csum(netdev, cqe, rq, skb, !!lro_num_seg);
+   skb->protocol = eth_type_trans(skb, netdev);
 }
 
 static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
-- 
1.7.1

[PATCH net-next V3 09/11] net/mlx5e: Remove redundant barrier

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

The bit-op operation one line before is an explicit barrier
by itself.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index a3fd0f5..c38781f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -147,7 +147,6 @@ void mlx5e_completion_event(struct mlx5_core_cq *mcq)
struct mlx5e_cq *cq = container_of(mcq, struct mlx5e_cq, mcq);
 
set_bit(MLX5E_CHANNEL_NAPI_SCHED, >channel->flags);
-   barrier();
napi_schedule(cq->napi);
 }
 
-- 
1.7.1

[PATCH net-next V3 11/11] net/mlx5e: Add ethtool counter for RX buffer allocation failures

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

Counts the number of RX buffer allocation failures and shows it
in ethtool statistics.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |8 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   11 +--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 303e6cd..6e24e82 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -189,6 +189,7 @@ static const char vport_strings[][ETH_GSTRING_LEN] = {
"rx_wqe_err",
"rx_mpwqe_filler",
"rx_mpwqe_frag",
+   "rx_buff_alloc_err",
 };
 
 struct mlx5e_vport_stats {
@@ -232,8 +233,9 @@ struct mlx5e_vport_stats {
u64 rx_wqe_err;
u64 rx_mpwqe_filler;
u64 rx_mpwqe_frag;
+   u64 rx_buff_alloc_err;
 
-#define NUM_VPORT_COUNTERS 37
+#define NUM_VPORT_COUNTERS 38
 };
 
 static const char pport_strings[][ETH_GSTRING_LEN] = {
@@ -329,6 +331,7 @@ static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
"wqe_err",
"mpwqe_filler",
"mpwqe_frag",
+   "buff_alloc_err",
 };
 
 struct mlx5e_rq_stats {
@@ -341,7 +344,8 @@ struct mlx5e_rq_stats {
u64 wqe_err;
u64 mpwqe_filler;
u64 mpwqe_frag;
-#define NUM_RQ_STATS 9
+   u64 buff_alloc_err;
+#define NUM_RQ_STATS 10
 };
 
 static const char sq_stats_strings[][ETH_GSTRING_LEN] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9b17bc0..d485d1e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -180,6 +180,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
s->rx_wqe_err   = 0;
s->rx_mpwqe_filler  = 0;
s->rx_mpwqe_frag= 0;
+   s->rx_buff_alloc_err= 0;
for (i = 0; i < priv->params.num_channels; i++) {
rq_stats = >channel[i]->rq.stats;
 
@@ -192,6 +193,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
s->rx_wqe_err   += rq_stats->wqe_err;
s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
+   s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
 
for (j = 0; j < priv->params.num_tc; j++) {
sq_stats = >channel[i]->sq[j].stats;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index ee5fa16..918b7c7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -447,9 +447,14 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 
while (!mlx5_wq_ll_is_full(wq)) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
+   int err;
 
-   if (unlikely(rq->alloc_wqe(rq, wqe, wq->head)))
+   err = rq->alloc_wqe(rq, wqe, wq->head);
+   if (unlikely(err)) {
+   if (err != -EBUSY)
+   rq->stats.buff_alloc_err++;
break;
+   }
 
mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
}
@@ -701,8 +706,10 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct 
mlx5_cqe64 *cqe)
skb = napi_alloc_skb(rq->cq.napi,
 ALIGN(MLX5_MPWRQ_SMALL_PACKET_THRESHOLD,
   sizeof(long)));
-   if (unlikely(!skb))
+   if (unlikely(!skb)) {
+   rq->stats.buff_alloc_err++;
goto mpwrq_cqe_out;
+   }
 
prefetch(skb->data);
cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
-- 
1.7.1

[PATCH net-next V3 08/11] net/mlx5e: Use napi_alloc_skb for RX SKB allocations

2016-04-20 Thread Saeed Mahameed

From: Tariq Toukan 

Instead of netdev_alloc_skb, we use the napi_alloc_skb function
which is designated to allocate skbuff's for RX in a
channel-specific NAPI instance, and implies the IP packet alignment.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   12 +---
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c99fdff..303e6cd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -93,7 +93,6 @@
 #define MLX5E_SQ_BF_BUDGET 16
 
 #define MLX5E_NUM_MAIN_GROUPS 9
-#define MLX5E_NET_IP_ALIGN 2
 
 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 942829e..9b17bc0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -373,8 +373,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->wqe_sz = (priv->params.lro_en) ?
priv->params.lro_wqe_sz :
MLX5E_SW2HW_MTU(priv->netdev->mtu);
-   rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz + MLX5E_NET_IP_ALIGN);
-   byte_count = rq->wqe_sz - MLX5E_NET_IP_ALIGN;
+   rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
+   byte_count = rq->wqe_sz;
byte_count |= MLX5_HW_START_PADDING;
}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d71919c..5bdcc0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -47,7 +47,7 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct 
mlx5e_rx_wqe *wqe, u16 ix)
struct sk_buff *skb;
dma_addr_t dma_addr;
 
-   skb = netdev_alloc_skb(rq->netdev, rq->wqe_sz);
+   skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
if (unlikely(!skb))
return -ENOMEM;
 
@@ -61,10 +61,8 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct 
mlx5e_rx_wqe *wqe, u16 ix)
if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
goto err_free_skb;
 
-   skb_reserve(skb, MLX5E_NET_IP_ALIGN);
-
*((dma_addr_t *)skb->cb) = dma_addr;
-   wqe->data.addr = cpu_to_be64(dma_addr + MLX5E_NET_IP_ALIGN);
+   wqe->data.addr = cpu_to_be64(dma_addr);
wqe->data.lkey = rq->mkey_be;
 
rq->skb[ix] = skb;
@@ -701,9 +699,9 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct 
mlx5_cqe64 *cqe)
goto mpwrq_cqe_out;
}
 
-   skb = netdev_alloc_skb(rq->netdev,
-  ALIGN(MLX5_MPWRQ_SMALL_PACKET_THRESHOLD,
-sizeof(long)));
+   skb = napi_alloc_skb(rq->cq.napi,
+ALIGN(MLX5_MPWRQ_SMALL_PACKET_THRESHOLD,
+  sizeof(long)));
if (unlikely(!skb))
goto mpwrq_cqe_out;
 
-- 
1.7.1

Re: [PATCH 02/19] io-mapping: Specify mapping size for io_mapping_map_wc()

2016-04-20 Thread Luis R. Rodriguez

On Wed, Apr 20, 2016 at 07:42:13PM +0100, Chris Wilson wrote:
> The ioremap() hidden behind the io_mapping_map_wc() convenience helper
> can be used for remapping multiple pages. Extend the helper so that
> future callers can use it for larger ranges.
> 
> Signed-off-by: Chris Wilson 
> Cc: Tvrtko Ursulin 
> Cc: Daniel Vetter 
> Cc: Jani Nikula 
> Cc: David Airlie 
> Cc: Yishai Hadas 
> Cc: Dan Williams 
> Cc: Ingo Molnar 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: David Hildenbrand 
> Cc: Luis R. Rodriguez 
> Cc: intel-...@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: netdev@vger.kernel.org
> Cc: linux-r...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org

We have 2 callers today, in the future, can you envision
this API getting more options? If so, in order to avoid the
pain of collateral evolutions I can suggest a descriptor
being passed with the required settings / options. This lets
you evolve the API without needing to go in and modify
old users. If you choose not to that's fine too, just
figured I'd chime in with that as I've seen the pain
with other APIs, and I'm putting an end to the needless
set of collateral evolutions this way.

Other than that possible API optimization:

Reviewed-by: Luis R. Rodriguez 

  Luis

Re: [PATCH iproute2 WIP] ifstat: use new RTM_GETSTATS api

2016-04-20 Thread Stephen Hemminger

On Wed, 20 Apr 2016 09:16:15 -0700
Roopa Prabhu  wrote:

> +int rtnl_wilddump_stats_req_filter(struct rtnl_handle *rth, int family, int 
> type,
> +__u32 filt_mask)
> +{
> + struct {
> + struct nlmsghdr nlh;
> + struct if_stats_msg ifsm;
> + } req;

Please use C99 initialization instead of memset in new code.

> + int err;
> +
> + memset(, 0, sizeof(req));
> + req.nlh.nlmsg_len = sizeof(req);
> + req.nlh.nlmsg_type = type;
> + req.nlh.nlmsg_flags = NLM_F_DUMP|NLM_F_REQUEST;
> + req.nlh.nlmsg_pid = 0;
> + req.nlh.nlmsg_seq = rth->dump = ++rth->seq;
> + req.ifsm.family = family;
> + req.ifsm.filter_mask = filt_mask;
> +
> + err = send(rth->fd, (void*), sizeof(req), 0);
> +
> + return err;

Why not just:
return send(rth->fd, , sizoef(req), 0);

> +}

Re: [PATCH V2] net: ethernet: mellanox: correct page conversion

2016-04-20 Thread Sinan Kaya

On 4/20/2016 2:40 PM, Eran Ben Elisha wrote:
>>
>> It is been 1.5 years since I reported the problem. We came up with three
>> different solutions this week. I'd like to see a version of the solution
>> to get merged until Mellanox comes up with a better solution with another
>> patch. My proposal is to use this one.
>>
> 
> We will post our suggestion here in the following days.
> 

Thanks, please have me in CC. I'm not subscribed to this group normally.
I can post a tested-by after testing.

-- 
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project

[PATCH 02/19] io-mapping: Specify mapping size for io_mapping_map_wc()

2016-04-20 Thread Chris Wilson

The ioremap() hidden behind the io_mapping_map_wc() convenience helper
can be used for remapping multiple pages. Extend the helper so that
future callers can use it for larger ranges.

Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
Cc: Daniel Vetter 
Cc: Jani Nikula 
Cc: David Airlie 
Cc: Yishai Hadas 
Cc: Dan Williams 
Cc: Ingo Molnar 
Cc: "Peter Zijlstra (Intel)" 
Cc: David Hildenbrand 
Cc: Luis R. Rodriguez 
Cc: intel-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
Cc: netdev@vger.kernel.org
Cc: linux-r...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 drivers/gpu/drm/i915/intel_overlay.c|  3 ++-
 drivers/net/ethernet/mellanox/mlx4/pd.c |  4 +++-
 include/linux/io-mapping.h  | 10 +++---
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_overlay.c 
b/drivers/gpu/drm/i915/intel_overlay.c
index 9746b9841c13..0d5a376878d3 100644
--- a/drivers/gpu/drm/i915/intel_overlay.c
+++ b/drivers/gpu/drm/i915/intel_overlay.c
@@ -198,7 +198,8 @@ intel_overlay_map_regs(struct intel_overlay *overlay)
regs = (struct overlay_registers __iomem 
*)overlay->reg_bo->phys_handle->vaddr;
else
regs = io_mapping_map_wc(ggtt->mappable,
-overlay->flip_addr);
+overlay->flip_addr,
+PAGE_SIZE);
 
return regs;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/pd.c 
b/drivers/net/ethernet/mellanox/mlx4/pd.c
index b3cc3ab63799..6fc156a3918d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/pd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/pd.c
@@ -205,7 +205,9 @@ int mlx4_bf_alloc(struct mlx4_dev *dev, struct mlx4_bf *bf, 
int node)
goto free_uar;
}
 
-   uar->bf_map = io_mapping_map_wc(priv->bf_mapping, uar->index << 
PAGE_SHIFT);
+   uar->bf_map = io_mapping_map_wc(priv->bf_mapping,
+   uar->index << PAGE_SHIFT,
+   PAGE_SIZE);
if (!uar->bf_map) {
err = -ENOMEM;
goto unamp_uar;
diff --git a/include/linux/io-mapping.h b/include/linux/io-mapping.h
index e399029b68c5..645ad06b5d52 100644
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -100,14 +100,16 @@ io_mapping_unmap_atomic(void __iomem *vaddr)
 }
 
 static inline void __iomem *
-io_mapping_map_wc(struct io_mapping *mapping, unsigned long offset)
+io_mapping_map_wc(struct io_mapping *mapping,
+ unsigned long offset,
+ unsigned long size)
 {
resource_size_t phys_addr;
 
BUG_ON(offset >= mapping->size);
phys_addr = mapping->base + offset;
 
-   return ioremap_wc(phys_addr, PAGE_SIZE);
+   return ioremap_wc(phys_addr, size);
 }
 
 static inline void
@@ -155,7 +157,9 @@ io_mapping_unmap_atomic(void __iomem *vaddr)
 
 /* Non-atomic map/unmap */
 static inline void __iomem *
-io_mapping_map_wc(struct io_mapping *mapping, unsigned long offset)
+io_mapping_map_wc(struct io_mapping *mapping,
+ unsigned long offset,
+ unsigned long size)
 {
return ((char __force __iomem *) mapping) + offset;
 }
-- 
2.8.1

Re: [PATCH V2] net: ethernet: mellanox: correct page conversion

2016-04-20 Thread Eran Ben Elisha

>
> It is been 1.5 years since I reported the problem. We came up with three
> different solutions this week. I'd like to see a version of the solution
> to get merged until Mellanox comes up with a better solution with another
> patch. My proposal is to use this one.
>

We will post our suggestion here in the following days.

[PATCH net-next 0/2] net: bcmsysport: utilize newer NAPI APIs

2016-04-20 Thread Florian Fainelli

Hi David, Eric, Petri,

These two patches are very analoguous to what was already submitted for
BCMGENET and switch the SYSTEMPORT driver to utilizing __napi_schedule_irqoff()
and napi_complete_done for the RX NAPI context.

Florian Fainelli (2):
  net: bcmsysport: use  __napi_schedule_irqoff()
  net: bcmsysport: use napi_complete_done()

 drivers/net/ethernet/broadcom/bcmsysport.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.1.0

[PATCH net-next 2/2] net: bcmsysport: use napi_complete_done()

2016-04-20 Thread Florian Fainelli

By using napi_complete_done(), we allow fine tuning of
/sys/class/net/ethX/gro_flush_timeout for higher GRO aggregation
efficiency for a Gbit NIC.

Check commit 24d2e4a50737 ("tg3: use napi_complete_done()") for details.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 9e3ec739d860..30b0c2895a56 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -831,7 +831,7 @@ static int bcm_sysport_poll(struct napi_struct *napi, int 
budget)
rdma_writel(priv, priv->rx_c_index, RDMA_CONS_INDEX);
 
if (work_done < budget) {
-   napi_complete(napi);
+   napi_complete_done(napi, work_done);
/* re-enable RX interrupts */
intrl2_0_mask_clear(priv, INTRL2_0_RDMA_MBDONE);
}
-- 
2.1.0

[PATCH net-next 1/2] net: bcmsysport: use __napi_schedule_irqoff()

2016-04-20 Thread Florian Fainelli

Both bcm_sysport_tx_isr() and bcm_sysport_rx_isr() run in hard irq
context, we do not need to block irq again.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 993c780bdfab..9e3ec739d860 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -873,7 +873,7 @@ static irqreturn_t bcm_sysport_rx_isr(int irq, void *dev_id)
if (likely(napi_schedule_prep(>napi))) {
/* disable RX interrupts */
intrl2_0_mask_set(priv, INTRL2_0_RDMA_MBDONE);
-   __napi_schedule(>napi);
+   __napi_schedule_irqoff(>napi);
}
}
 
@@ -916,7 +916,7 @@ static irqreturn_t bcm_sysport_tx_isr(int irq, void *dev_id)
 
if (likely(napi_schedule_prep(>napi))) {
intrl2_1_mask_set(priv, BIT(ring));
-   __napi_schedule(>napi);
+   __napi_schedule_irqoff(>napi);
}
}
 
-- 
2.1.0

Re: [PATCH net 4/4] net/mlx4_en: Split SW RX dropped counter per RX ring

2016-04-20 Thread Florian Fainelli

On 20/04/16 08:00, Or Gerlitz wrote:
> On 4/20/2016 5:56 PM, Eric Dumazet wrote:
>>> >Fixes: ab35da16 ('net/mlx4_en: Moderate ethtool callback to
>>> [...] ')
>>> >Signed-off-by: Eran Ben Elisha
>>> >Reported-by: Brenden Blanco
>>> >Signed-off-by: Saeed Mahameed
>>> >Signed-off-by: Or Gerlitz
>>> >---
>> Reported-by: Eric Dumazet
>>
>> (http://www.spinics.net/lists/netdev/msg371318.html  )
> 
> Hi Eric,
> 
> Just to be sure, you'd like me to re-spin this and fix the reporter name?

There is no need for that, patchwork amends Reported-by (and a bunch of
other tags) automatically when somebody replies to the message, see the
resulting mbox for this patch:

http://patchwork.ozlabs.org/patch/612664/mbox/
-- 
Florian

Re: [RFC PATCH 2/5] mlx5: Add support for UDP tunnel segmentation with outer checksum offload

2016-04-20 Thread Alexander Duyck

On Wed, Apr 20, 2016 at 10:40 AM, Saeed Mahameed
 wrote:
> On Tue, Apr 19, 2016 at 10:06 PM, Alexander Duyck  wrote:
>> This patch assumes that the mlx5 hardware will ignore existing IPv4/v6
>> header fields for length and checksum as well as the length and checksum
>> fields for outer UDP headers.
>>
>> I have no means of testing this as I do not have any mlx5 hardware but
>> thought I would submit it as an RFC to see if anyone out there wants to
>> test this and see if this does in fact enable this functionality allowing
>> us to to segment UDP tunneled frames that have an outer checksum.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index e0adb604f461..57d8da796d50 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -2390,13 +2390,18 @@ static void mlx5e_build_netdev(struct net_device 
>> *netdev)
>> netdev->hw_features  |= NETIF_F_HW_VLAN_CTAG_FILTER;
>>
>> if (mlx5e_vxlan_allowed(mdev)) {
>> -   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
>> +   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL |
>> +  NETIF_F_GSO_UDP_TUNNEL_CSUM |
>> +  NETIF_F_GSO_PARTIAL;
>> netdev->hw_enc_features |= NETIF_F_IP_CSUM;
>> netdev->hw_enc_features |= NETIF_F_RXCSUM;
>> netdev->hw_enc_features |= NETIF_F_TSO;
>> netdev->hw_enc_features |= NETIF_F_TSO6;
>> netdev->hw_enc_features |= NETIF_F_RXHASH;
>> netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
>> +   netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM |
>> +  NETIF_F_GSO_PARTIAL;
>> +   netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
>> }
>>
>> netdev->features  = netdev->hw_features;
>>
>
> Hi Alex,
>
> Adding Matt, VxLAN feature owner from Mellanox,
> Matt please correct me if am wrong, but We already tested GSO VxLAN
> and we saw the TCP/IP checksum offloads for both inner and outer
> headers handled by the hardware.
>
> And looking at mlx5e_sq_xmit:
>
> if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> if (skb->encapsulation) {
> eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> MLX5_ETH_WQE_L4_INNER_CSUM;
> sq->stats.csum_offload_inner++;
> } else {
> eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> }
>
> We enable inner/outer hardware checksumming unconditionally without
> looking at the features Alex is suggesting in this patch,
> Alex, can you elaborate more on the meaning of those features ? and
> why would it work for us without declaring them ?

Well right now the feature list exposed by the device indicates that
TSO is not used if a VxLAN tunnel has a checksum in an outer header.
Since that is not exposed currently that is completely offloaded in
software via GSO.

What the GSO partial does is allow us to treat GSO for tunnels with
checksum like it is GSO for tunnels without checksum by precomputing
the UDP checksum as though the frame had already been segmented and
restricts us to an even multiple of MSS bytes that are to be segmented
between all the frames.  One side effect though is that all of the IP
and UDP header fields are also precomputed, but from what I can tell
it looks like the values that would be changed by a change in length
are ignored or overwritten by the hardware and driver anyway.

- Alex

Re: Davicom DM9162 PHY supported in the kernel?

2016-04-20 Thread Florian Fainelli

Hi,

On 20/04/16 08:21, Amr Bekhit wrote:
> Hello,
> 
> I'm using an embedded Linux board based on an AT91SAM9X25 that uses the
> Davicom DM9162IEP PHY chip. I'm struggling to get packets out on the
> wire and I'm suspecting that I might have an issue between the AT91 MAC
> and the PHY chip. I've looked through the kernel config options and the
> kernel already has compiled-in support for the Davicom PHYs, however I
> noticed that according to the help text, only the dm9161e and dm9131
> chips are supported, which may indicate why my ethernet isn't working. I
> was wondering whether the DM9162 is backwards compatible with the
> existing driver? I'm currently using the mainline kernel 4.3. (p.s. I
> know the hardware works fine since I have no problem transferring files
> using tftp via u-boot).

Well, u-boot is a very simplistic networking stack, there could be tons
of issues that get under the radar because it cannot report them
properly, but let's assume it works so you have something to compare
against.

The DM9162 should be very similar to the DM9161, so the first thing
might be trying to add the PHY ID (32-bits OUI) to the matching table in
drivers/net/phy/davicom.c, and make it configure the PHY through
dm9161_config_init() since that looks at the PHY interface (MII, RMII
etc.) and does a bit of configuration here.

Right now, chances are that you are running with the Generic PHY driver
which has no clue about Davicom specific programming (if any). There
could also be board-level fixups required (adjusting trace lengths, if
you are using a RGMII interface for instance), etc.
-- 
Florian

Re: [PATCH] MAINTAINERS: net: add entry for TI Ethernet Switch drivers

2016-04-20 Thread Tony Lindgren

* Grygorii Strashko  [160420 09:19]:
> On 04/20/2016 05:23 PM, Tony Lindgren wrote:
> > * Grygorii Strashko  [160420 04:26]:
> >> Add record for TI Ethernet Switch Driver CPSW/CPDMA/MDIO HW
> >> (am33/am43/am57/dr7/davinci) to ensure that related patches
> >> will go through dedicated linux-omap list.
> >>
> >> Also add Mugunthan as maintainer and myself as the reviewer.
> >>
> >> Cc: "David S. Miller" 
> >> Cc: Mugunthan V N 
> >> Cc: Richard Cochran 
> >> Signed-off-by: Grygorii Strashko 
> >> ---
> >>   MAINTAINERS | 8 
> >>   1 file changed, 8 insertions(+)
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 1d5b4be..aca864d 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -11071,6 +11071,14 @@ S:Maintained
> >>   F:   drivers/clk/ti/
> >>   F:   include/linux/clk/ti.h
> >>   
> >> +TI ETHERNET SWITCH DRIVER (CPSW)
> >> +M:Mugunthan V N 
> >> +R:Grygorii Strashko 
> >> +L:linux-o...@vger.kernel.org
> >> +S:Maintained
> >> +F:drivers/net/ethernet/ti/cpsw*
> >> +F:drivers/net/ethernet/ti/davinci*
> >> +
> >>   TI FLASH MEDIA INTERFACE DRIVER
> >>   M:   Alex Dubov 
> >>   S:   Maintained
> >> -- 
> > 
> > Please add netdev list also there as the primary list:
> > 
> > L:  netdev@vger.kernel.org
> > L:  linux-o...@vger.kernel.org
> > 
> > Then we can easily review and ack the patches for Dave to apply.
> > 
> 
> I can, but want clarify if it really necessary, because get_maintainer.pl
> automatically adds netdev@vger.kernel.org:

Well it may not be obvious from reading MAINTAINERS file though :)

Tony

Re: [PATCH net-next] net: dsa: remove tag_protocol from dsa_switch

2016-04-20 Thread Florian Fainelli

On 18/04/16 15:24, Vivien Didelot wrote:
> Having the tag protocol in dsa_switch_driver for setup time and in
> dsa_switch_tree for runtime is enough. Remove dsa_switch's one.
> 
> Signed-off-by: Vivien Didelot 

Acked-by: Florian Fainelli 
-- 
Florian

[PATCH RFC net-next] net: dsa: Provide CPU port statistics to master netdev

2016-04-20 Thread Florian Fainelli

This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).

We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.

We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.

Signed-off-by: Florian Fainelli 
---
 include/net/dsa.h |  5 
 net/dsa/slave.c   | 69 +++
 2 files changed, 74 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index c4bc42bd3538..67f811f00339 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -111,6 +111,11 @@ struct dsa_switch_tree {
enum dsa_tag_protocol   tag_protocol;
 
/*
+* Original copy of the master netdev ethtool_ops
+*/
+   struct ethtool_ops  master_ethtool_ops;
+
+   /*
 * The switch and port to which the CPU is attached.
 */
s8  cpu_switch;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2dae0d064359..41283c6f725a 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -666,6 +666,59 @@ static void dsa_slave_get_strings(struct net_device *dev,
}
 }
 
+static void dsa_cpu_port_get_ethtool_stats(struct net_device *dev,
+  struct ethtool_stats *stats,
+  uint64_t *data)
+{
+   struct dsa_switch_tree *dst = dev->dsa_ptr;
+   struct dsa_switch *ds = dst->ds[0];
+   s8 cpu_port = dst->cpu_port;
+   int count = 0;
+
+   if (dst->master_ethtool_ops.get_sset_count) {
+   count = dst->master_ethtool_ops.get_sset_count(dev,
+  ETH_SS_STATS);
+   dst->master_ethtool_ops.get_ethtool_stats(dev, stats, data);
+   }
+
+   if (ds->drv->get_ethtool_stats)
+   ds->drv->get_ethtool_stats(ds, cpu_port, data + count);
+}
+
+static int dsa_cpu_port_get_sset_count(struct net_device *dev, int sset)
+{
+   struct dsa_switch_tree *dst = dev->dsa_ptr;
+   struct dsa_switch *ds = dst->ds[0];
+   int count = 0;
+
+   if (dst->master_ethtool_ops.get_sset_count)
+   count += dst->master_ethtool_ops.get_sset_count(dev, sset);
+
+   if (sset == ETH_SS_STATS && ds->drv->get_sset_count)
+   count += ds->drv->get_sset_count(ds);
+
+   return count;
+}
+
+static void dsa_cpu_port_get_strings(struct net_device *dev,
+uint32_t stringset, uint8_t *data)
+{
+   struct dsa_switch_tree *dst = dev->dsa_ptr;
+   struct dsa_switch *ds = dst->ds[0];
+   s8 cpu_port = dst->cpu_port;
+   int len = ETH_GSTRING_LEN;
+   int count = 0;
+
+   if (dst->master_ethtool_ops.get_sset_count) {
+   count = dst->master_ethtool_ops.get_sset_count(dev,
+  ETH_SS_STATS);
+   dst->master_ethtool_ops.get_strings(dev, stringset, data);
+   }
+
+   if (stringset == ETH_SS_STATS && ds->drv->get_strings)
+   ds->drv->get_strings(ds, cpu_port, data + count * len);
+}
+
 static void dsa_slave_get_ethtool_stats(struct net_device *dev,
struct ethtool_stats *stats,
uint64_t *data)
@@ -821,6 +874,8 @@ static const struct ethtool_ops dsa_slave_ethtool_ops = {
.get_eee= dsa_slave_get_eee,
 };
 
+static struct ethtool_ops dsa_cpu_port_ethtool_ops;
+
 static const struct net_device_ops dsa_slave_netdev_ops = {
.ndo_open   = dsa_slave_open,
.ndo_stop   = dsa_slave_close,
@@ -1038,6 +1093,7 @@ int dsa_slave_create(struct dsa_switch *ds, struct device 
*parent,
 int port, char *name)
 {
struct net_device *master = ds->dst->master_netdev;
+   struct dsa_switch_tree *dst = ds->dst;
struct net_device *slave_dev;
struct dsa_slave_priv *p;
int ret;
@@ -1049,6 +1105,19 @@ int dsa_slave_create(struct dsa_switch *ds, struct 
device *parent,
 
slave_dev->features = master->vlan_features;
slave_dev->ethtool_ops = _slave_ethtool_ops;
+   if (master->ethtool_ops != _cpu_port_ethtool_ops) {
+   memcpy(>master_ethtool_ops, master->ethtool_ops,
+

Re: [RFC PATCH 2/5] mlx5: Add support for UDP tunnel segmentation with outer checksum offload

2016-04-20 Thread Saeed Mahameed

On Tue, Apr 19, 2016 at 10:06 PM, Alexander Duyck  wrote:
> This patch assumes that the mlx5 hardware will ignore existing IPv4/v6
> header fields for length and checksum as well as the length and checksum
> fields for outer UDP headers.
>
> I have no means of testing this as I do not have any mlx5 hardware but
> thought I would submit it as an RFC to see if anyone out there wants to
> test this and see if this does in fact enable this functionality allowing
> us to to segment UDP tunneled frames that have an outer checksum.
>
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index e0adb604f461..57d8da796d50 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -2390,13 +2390,18 @@ static void mlx5e_build_netdev(struct net_device 
> *netdev)
> netdev->hw_features  |= NETIF_F_HW_VLAN_CTAG_FILTER;
>
> if (mlx5e_vxlan_allowed(mdev)) {
> -   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
> +   netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL |
> +  NETIF_F_GSO_UDP_TUNNEL_CSUM |
> +  NETIF_F_GSO_PARTIAL;
> netdev->hw_enc_features |= NETIF_F_IP_CSUM;
> netdev->hw_enc_features |= NETIF_F_RXCSUM;
> netdev->hw_enc_features |= NETIF_F_TSO;
> netdev->hw_enc_features |= NETIF_F_TSO6;
> netdev->hw_enc_features |= NETIF_F_RXHASH;
> netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
> +   netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM |
> +  NETIF_F_GSO_PARTIAL;
> +   netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
> }
>
> netdev->features  = netdev->hw_features;
>

Hi Alex,

Adding Matt, VxLAN feature owner from Mellanox,
Matt please correct me if am wrong, but We already tested GSO VxLAN
and we saw the TCP/IP checksum offloads for both inner and outer
headers handled by the hardware.

And looking at mlx5e_sq_xmit:

if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
if (skb->encapsulation) {
eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
MLX5_ETH_WQE_L4_INNER_CSUM;
sq->stats.csum_offload_inner++;
} else {
eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
}

We enable inner/outer hardware checksumming unconditionally without
looking at the features Alex is suggesting in this patch,
Alex, can you elaborate more on the meaning of those features ? and
why would it work for us without declaring them ?

Re: skb_at_tc_ingress helper breaks compilation of oot modules

2016-04-20 Thread Alexei Starovoitov

On Wed, Apr 20, 2016 at 12:38:11PM +0200, Daniel Borkmann wrote:
> On 04/20/2016 12:21 PM, Ingo Saitz wrote:
> >In Linux 4.5, when CONFIG_NET_CLS_ACT is defined, compilation of out of
> >tree modules breaks with undeclared functions/constants. The culprit is:
> >
> >commit fdc5432a7b44ab7de17141beec19d946b9344e91
> >Author: Daniel Borkmann 
> >Date:   Thu Jan 7 15:50:22 2016 +0100
> >
> > net, sched: add skb_at_tc_ingress helper
> >
> >which uses G_TC_AT and AT_INGRESS but only includes linux/pkt_cls.h,
> >which does not include these #defines for oot builds. Unfortunately I'm
> >not sure what the correct fix is, maybe the uapi folks could help, but i
> >attached a simple testcase and build log (Makefile is straight from
> >kernelnewbies).
> 
> Hmm, your fail.c test case only contains '#include '?
> 
> Note, upstream kernel never cared about out-of-tree modules, only
> in-tree code. ;) Did you run into an issue with any in-tree code?

I'm glad it broke out of tree module. We should do it more often.
llvm constantly reshuffles internal api to incentivize upstreaming
and working with the community.

Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)

2016-04-20 Thread Saeed Mahameed

On Tue, Apr 19, 2016 at 8:39 PM, Mel Gorman  wrote:
> On Tue, Apr 19, 2016 at 06:25:32PM +0200, Jesper Dangaard Brouer wrote:
>> On Mon, 18 Apr 2016 07:17:13 -0700
>> Eric Dumazet  wrote:
>>
>
> alloc_pages_exact()
>

We want to allocate 32 order-0 physically contiguous pages and to free
each one of them individually.
the documentation states "Memory allocated by this function must be
released by free_pages_exact()"

Also it returns a pointer to the memory and we need pointers to pages.

>> > > allocates many physically contiguous pages with order0 ! so we assume
>> > > it is ok to use split_page.
>> >
>> > Note: I have no idea of split_page() performance :
>>
>> Maybe Mel knows?
>
> Irrelevant in comparison to the cost of allocating an order-5 pages if
> one is not already available.
>

we still allocate order-5 pages but now we split them to 32 order-0 pages.
the split adds extra few cpu cycles but it is lookless and
straightforward, and it does the job in terms of better memory
utilization.
now in scenarios where small packets can hold a ref on pages for too
long they would hold a ref on order-0 pages rather than order-5.

Re: [PATCH] MAINTAINERS: net: add entry for TI Ethernet Switch drivers

2016-04-20 Thread Grygorii Strashko

On 04/20/2016 05:23 PM, Tony Lindgren wrote:
> * Grygorii Strashko  [160420 04:26]:
>> Add record for TI Ethernet Switch Driver CPSW/CPDMA/MDIO HW
>> (am33/am43/am57/dr7/davinci) to ensure that related patches
>> will go through dedicated linux-omap list.
>>
>> Also add Mugunthan as maintainer and myself as the reviewer.
>>
>> Cc: "David S. Miller" 
>> Cc: Mugunthan V N 
>> Cc: Richard Cochran 
>> Signed-off-by: Grygorii Strashko 
>> ---
>>   MAINTAINERS | 8 
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 1d5b4be..aca864d 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -11071,6 +11071,14 @@ S:  Maintained
>>   F: drivers/clk/ti/
>>   F: include/linux/clk/ti.h
>>   
>> +TI ETHERNET SWITCH DRIVER (CPSW)
>> +M:  Mugunthan V N 
>> +R:  Grygorii Strashko 
>> +L:  linux-o...@vger.kernel.org
>> +S:  Maintained
>> +F:  drivers/net/ethernet/ti/cpsw*
>> +F:  drivers/net/ethernet/ti/davinci*
>> +
>>   TI FLASH MEDIA INTERFACE DRIVER
>>   M: Alex Dubov 
>>   S: Maintained
>> -- 
> 
> Please add netdev list also there as the primary list:
> 
> L:netdev@vger.kernel.org
> L:linux-o...@vger.kernel.org
> 
> Then we can easily review and ack the patches for Dave to apply.
> 

I can, but want clarify if it really necessary, because get_maintainer.pl
automatically adds netdev@vger.kernel.org:

./scripts/get_maintainer.pl 
~/.../0001-drivers-net-cpsw-fix-port_mask-parameters-in-ale-cal.patch 
Mugunthan V N  (maintainer:TI ETHERNET SWITCH DRIVER 
(CPSW))
Grygorii Strashko  (reviewer:TI ETHERNET SWITCH 
DRIVER (CPSW))
linux-o...@vger.kernel.org (open list:TI ETHERNET SWITCH DRIVER (CPSW))
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-ker...@vger.kernel.org (open list)
 


-- 
regards,
-grygorii

[PATCH iproute2 WIP] ifstat: use new RTM_GETSTATS api

2016-04-20 Thread Roopa Prabhu

From: Roopa Prabhu 

sample hacked up patch currently used for testing.
needs re-work if ifstat will move to RTM_GETSTATS.

Signed-off-by: Roopa Prabhu 
---
 include/libnetlink.h  |  6 ++
 include/linux/if_link.h   | 22 ++
 include/linux/rtnetlink.h |  5 +
 lib/libnetlink.c  | 31 +++
 misc/ifstat.c | 37 -
 5 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 491263f..ccaab46 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -44,6 +44,12 @@ int rtnl_dump_request(struct rtnl_handle *rth, int type, 
void *req,
 int rtnl_dump_request_n(struct rtnl_handle *rth, struct nlmsghdr *n)
__attribute__((warn_unused_result));
 
+int rtnl_wilddump_stats_request(struct rtnl_handle *rth, int family, int type)
+   __attribute__((warn_unused_result));
+int rtnl_wilddump_stats_req_filter(struct rtnl_handle *rth, int family,
+  int type, __u32 filt_mask)
+  __attribute__((warn_unused_result));
+
 struct rtnl_ctrl_data {
int nsid;
 };
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 6a688e8..eb1064a 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -165,6 +165,8 @@ enum {
 #define IFLA_RTA(r)  ((struct rtattr*)(((char*)(r)) + 
NLMSG_ALIGN(sizeof(struct ifinfomsg
 #define IFLA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifinfomsg))
 
+#define IFLA_RTA_STATS(r)  ((struct rtattr*)(((char*)(r)) + 
NLMSG_ALIGN(sizeof(struct if_stats_msg
+
 enum {
IFLA_INET_UNSPEC,
IFLA_INET_CONF,
@@ -777,4 +779,24 @@ enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* STATS section */
+
+struct if_stats_msg {
+   __u8  family;
+   __u8  pad1;
+   __u16 pad2;
+   __u32 ifindex;
+   __u32 filter_mask;
+};
+
+enum {
+   IFLA_STATS_UNSPEC,
+   IFLA_STATS_LINK_64,
+   __IFLA_STATS_MAX,
+};
+
+#define IFLA_STATS_MAX (__IFLA_STATS_MAX - 1)
+
+#define IFLA_STATS_FILTER_BIT(ATTR)  (1 << (ATTR - 1))
+
 #endif /* _LINUX_IF_LINK_H */
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 6aaa2a3..e8cdff5 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -139,6 +139,11 @@ enum {
RTM_GETNSID = 90,
 #define RTM_GETNSID RTM_GETNSID
 
+   RTM_NEWSTATS = 92,
+#define RTM_NEWSTATS RTM_NEWSTATS
+   RTM_GETSTATS = 94,
+#define RTM_GETSTATS RTM_GETSTATS
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index a90e52c..f7baf51 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -838,3 +838,34 @@ int __parse_rtattr_nested_compat(struct rtattr *tb[], int 
max, struct rtattr *rt
memset(tb, 0, sizeof(struct rtattr *) * (max + 1));
return 0;
 }
+
+int rtnl_wilddump_stats_req_filter(struct rtnl_handle *rth, int family, int 
type,
+  __u32 filt_mask)
+{
+   struct {
+   struct nlmsghdr nlh;
+   struct if_stats_msg ifsm;
+   } req;
+
+   int err;
+
+   memset(, 0, sizeof(req));
+   req.nlh.nlmsg_len = sizeof(req);
+   req.nlh.nlmsg_type = type;
+   req.nlh.nlmsg_flags = NLM_F_DUMP|NLM_F_REQUEST;
+   req.nlh.nlmsg_pid = 0;
+   req.nlh.nlmsg_seq = rth->dump = ++rth->seq;
+   req.ifsm.family = family;
+   req.ifsm.filter_mask = filt_mask;
+
+   err = send(rth->fd, (void*), sizeof(req), 0);
+
+   return err;
+}
+
+int rtnl_wilddump_stats_request(struct rtnl_handle *rth, int family, int type)
+{
+   return rtnl_wilddump_stats_req_filter(rth, family, type,
+ 
IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64));
+}
+
diff --git a/misc/ifstat.c b/misc/ifstat.c
index abbb4e7..e517c9a 100644
--- a/misc/ifstat.c
+++ b/misc/ifstat.c
@@ -35,6 +35,8 @@
 
 #include 
 
+#include "utils.h"
+
 int dump_zeros;
 int reset_history;
 int ignore_history;
@@ -49,6 +51,8 @@ double W;
 char **patterns;
 int npatterns;
 
+struct rtnl_handle rth;
+
 char info_source[128];
 int source_mismatch;
 
@@ -58,9 +62,9 @@ struct ifstat_ent {
struct ifstat_ent   *next;
char*name;
int ifindex;
-   unsigned long long  val[MAXS];
+   __u64   val[MAXS];
double  rate[MAXS];
-   __u32   ival[MAXS];
+   __u64   ival[MAXS];
 };
 
 static const char *stats[MAXS] = {
@@ -109,32 +113,29 @@ static int match(const char *id)
 static int get_nlmsg(const struct sockaddr_nl *who,
 struct nlmsghdr *m, void *arg)
 {
-   struct ifinfomsg *ifi = NLMSG_DATA(m);
-   struct rtattr *tb[IFLA_MAX+1];
+

Re: [PATCH net-next v6] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-20 Thread David Miller

From: Roopa Prabhu 
Date: Wed, 20 Apr 2016 08:43:43 -0700

> This patch has been tested with mofified iproute2 ifstat.

Can you please send me the patch you are using?  I want to do some quick testing
on sparc64 before I push this out.

Thanks.

Re: [PATCH v2 1/1] Revert "Prevent NUll pointer dereference with two PHYs on cpsw"

2016-04-20 Thread David Miller

From: Andrew Goodbody 
Date: Wed, 20 Apr 2016 16:14:51 +0100

> This reverts commit cfe255600154f0072d4a8695590dbd194dfd1aeb
> 
> This can result in a "Unable to handle kernel paging request"
> during boot. This was due to using an uninitialised struct member,
> data->slaves.
> 
> Signed-off-by: Andrew Goodbody 
> Tested-by: Tony Lindgren 
> ---
> 
> v2 No code change, added signoff and collected tested-by

Applied, thanks.

[PATCH net-next v6] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-20 Thread Roopa Prabhu

From: Roopa Prabhu 

This patch adds a new RTM_GETSTATS message to query link stats via netlink
from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
returns a lot more than just stats and is expensive in some cases when
frequent polling for stats from userspace is a common operation.

RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.

This patch adds the following attribute for NETDEV stats:
struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
[IFLA_STATS_LINK_64]  = { .len = sizeof(struct rtnl_link_stats64) },
};

Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
a single interface or all interfaces with NLM_F_DUMP.

Future possible new types of stat attributes:
link af stats:
- IFLA_STATS_LINK_IPV6  (nested. for ipv6 stats)
- IFLA_STATS_LINK_MPLS  (nested. for mpls/mdev stats)
extended stats:
- IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like 
bridge,
  vlan, vxlan etc)
- IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
  available via ethtool today)

This patch also declares a filter mask for all stat attributes.
User has to provide a mask of stats attributes to query. filter mask
can be specified in the new hdr 'struct if_stats_msg' for stats messages.
Other important field in the header is the ifindex.

This api can also include attributes for global stats (eg tcp) in the future.
When global stats are included in a stats msg, the ifindex in the header
must be zero. A single stats message cannot contain both global and
netdev specific stats. To easily distinguish them, netdev specific stat
attributes name are prefixed with IFLA_STATS_LINK_

Without any attributes in the filter_mask, no stats will be returned.

This patch has been tested with mofified iproute2 ifstat.

Suggested-by: Jamal Hadi Salim 
Signed-off-by: Roopa Prabhu 
---
RFC to v1 (apologies for the delay in sending this version out. busy days):
- Addressed feedback from Dave
- removed rtnl_link_stats
- Added hdr struct if_stats_msg to carry ifindex and
  filter mask
- new macro IFLA_STATS_FILTER_BIT(ATTR) for filter mask
- split the ipv6 patch into a separate patch, need some more eyes on it
- prefix attributes with IFLA_STATS instead of IFLA_LINK_STATS for
  shorter attribute names

v2:
- move IFLA_STATS_INET6 declaration to the inet6 patch
- get rid of RTM_DELSTATS
- mark ipv6 patch RFC. It can be used as an example for
  other AF stats like stats

v3:
- add required padding to the if_stats_msg structure(suggested by jamal)
- rename netdev stat attributes with IFLA_STATS_LINK prefix
  so that they are easily distinguishable with global
  stats in the future (after global stats discussion with thomas)
- get rid of unnecessary copy when getting stats with dev_get_stats
  (suggested by dave)

v4:
- dropped calcit and af stats from this patch. Will add it
  back when it becomes necessary and with the first af stats
  patch
- add check for null filter in dump and return -EINVAL:
  this follows rtnl_fdb_dump in returning an error.
  But since netlink_dump does not propagate the error
  to the user, the user will not see an error and
  but will also not see any data. This is consistent with
  other kinds of dumps.

v5:
- fix selinux nlmsgtab to account for new RTM_*STATS messages

v6:
- fix alignment for 64bit stats attribute, using davids new
  cleaver trick of using a pad attribute and new helper apis
- change selinux RTM_NEWSTATS permissions to READ since this
  patch does not support writes yet.

 include/uapi/linux/if_link.h   |  23 ++
 include/uapi/linux/rtnetlink.h |   5 ++
 net/core/rtnetlink.c   | 158 +
 security/selinux/nlmsgtab.c|   4 +-
 4 files changed, 189 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 5ffdcb3..115ccc1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -782,4 +782,27 @@ enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* STATS section */
+
+struct if_stats_msg {
+   __u8  family;
+   __u8  pad1;
+   __u16 pad2;
+   __u32 ifindex;
+   __u32 filter_mask;
+};
+
+/* A stats attribute can be netdev specific or a global stat.
+ * For netdev stats, lets use the prefix IFLA_STATS_LINK_*
+ */
+enum {
+   IFLA_STATS_UNSPEC, /* also used as 64bit pad attribute */
+   IFLA_STATS_LINK_64,
+

Re: [PATCH 1/3] e1000e: e1000e_cyclecounter_read(): incvalue is 32 bits, not 64

2016-04-20 Thread Denys Vlasenko

On 04/19/2016 10:57 PM, Jeff Kirsher wrote:
> On Tue, 2016-04-19 at 14:34 +0200, Denys Vlasenko wrote:
>> "incvalue" variable holds a result of "er32(TIMINCA) &
>> E1000_TIMINCA_INCVALUE_MASK"
>> and used in "do_div(temp, incvalue)" as a divisor.
>>
>> Thus, "u64 incvalue" declaration is probably a mistake.
>> Even though it seems to be a harmless one, let's fix it.
>>
>> Signed-off-by: Denys Vlasenko 
>> CC: Jeff Kirsher 
>> CC: Jesse Brandeburg 
>> CC: Shannon Nelson 
>> CC: Carolyn Wyborny 
>> CC: Don Skidmore 
>> CC: Bruce Allan 
>> CC: John Ronciak 
>> CC: Mitch Williams 
>> CC: David S. Miller 
>> CC: LKML 
>> CC: netdev@vger.kernel.org
>> ---
>>  drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> First of all, trimmed down the recipient list since almost all of the
> reviewers you added have nothing to do with e1000e.

I took the list here, MAINTAINERS:

INTEL ETHERNET DRIVERS
M:  Jeff Kirsher 
R:  Jesse Brandeburg 
R:  Shannon Nelson 
R:  Carolyn Wyborny 
R:  Don Skidmore 
R:  Bruce Allan 
R:  John Ronciak 
R:  Mitch Williams 

No more specific information who's e1000e reviewer.
Sorry.

Re: Regression in next for smsc911x with tigthen lockdep checks

2016-04-20 Thread Tony Lindgren

* Hannes Frederic Sowa  [160420 08:24]:
> Hi,
> 
> On 20.04.2016 17:01, Tony Lindgren wrote:
> > Looks like commit fafc4e1ea1a4 ("sock: tigthen lockdep checks for
> > sock_owned_by_user") in next causes a regression at least for
> > smsc911x with CONFIG_LOCKDEP. It keeps spamming with the following
> > message. Any ideas?
> 
> Not yet, can you quickly send me your config?

It's just the arch/arm/configs/omap2plus_defconfig I'm using.

Tony

Fwd: Davicom DM9162 PHY supported in the kernel?

2016-04-20 Thread Amr Bekhit

(Sorry, repeat message due to the previous one being HTML)

Hello,

I'm using an embedded Linux board based on an AT91SAM9X25 that uses
the Davicom DM9162IEP PHY chip. I'm struggling to get packets out on
the wire and I'm suspecting that I might have an issue between the
AT91 MAC and the PHY chip. I've looked through the kernel config
options and the kernel already has compiled-in support for the Davicom
PHYs, however I noticed that according to the help text, only the
dm9161e and dm9131 chips are supported, which may indicate why my
ethernet isn't working. I was wondering whether the DM9162 is
backwards compatible with the existing driver? I'm currently using the
mainline kernel 4.3. (p.s. I know the hardware works fine since I have
no problem transferring files using tftp via u-boot).

Thanks,

Amr Bekhit

Re: Regression in next for smsc911x with tigthen lockdep checks

2016-04-20 Thread Hannes Frederic Sowa

Hi,

On 20.04.2016 17:01, Tony Lindgren wrote:
> Looks like commit fafc4e1ea1a4 ("sock: tigthen lockdep checks for
> sock_owned_by_user") in next causes a regression at least for
> smsc911x with CONFIG_LOCKDEP. It keeps spamming with the following
> message. Any ideas?

Not yet, can you quickly send me your config?

Thanks,
Hannes

Re: [PATCH net 4/4] net/mlx4_en: Split SW RX dropped counter per RX ring

2016-04-20 Thread Or Gerlitz

On 4/20/2016 5:56 PM, Eric Dumazet wrote:

>Fixes: ab35da16 ('net/mlx4_en: Moderate ethtool callback to [...] ')
>Signed-off-by: Eran Ben Elisha
>Reported-by: Brenden Blanco
>Signed-off-by: Saeed Mahameed
>Signed-off-by: Or Gerlitz
>---

Reported-by: Eric Dumazet

(http://www.spinics.net/lists/netdev/msg371318.html  )

Hi Eric,

Just to be sure, you'd like me to re-spin this and fix the reporter name?

Thanks for following up !

sure

Or.

[PATCH v2 1/1] Revert "Prevent NUll pointer dereference with two PHYs on cpsw"

2016-04-20 Thread Andrew Goodbody

This reverts commit cfe255600154f0072d4a8695590dbd194dfd1aeb

This can result in a "Unable to handle kernel paging request"
during boot. This was due to using an uninitialised struct member,
data->slaves.

Signed-off-by: Andrew Goodbody 
Tested-by: Tony Lindgren 
---

v2 No code change, added signoff and collected tested-by

 drivers/net/ethernet/ti/cpsw.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 2cd67a5..54bcc38 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -349,7 +349,6 @@ struct cpsw_slave {
struct cpsw_slave_data  *data;
struct phy_device   *phy;
struct net_device   *ndev;
-   struct device_node  *phy_node;
u32 port_vlan;
u32 open_stat;
 };
@@ -368,6 +367,7 @@ struct cpsw_priv {
spinlock_t  lock;
struct platform_device  *pdev;
struct net_device   *ndev;
+   struct device_node  *phy_node;
struct napi_struct  napi_rx;
struct napi_struct  napi_tx;
struct device   *dev;
@@ -1142,8 +1142,8 @@ static void cpsw_slave_open(struct cpsw_slave *slave, 
struct cpsw_priv *priv)
cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
   1 << slave_port, 0, 0, ALE_MCAST_FWD_2);
 
-   if (slave->phy_node)
-   slave->phy = of_phy_connect(priv->ndev, slave->phy_node,
+   if (priv->phy_node)
+   slave->phy = of_phy_connect(priv->ndev, priv->phy_node,
 _adjust_link, 0, slave->data->phy_if);
else
slave->phy = phy_connect(priv->ndev, slave->data->phy_id,
@@ -2025,8 +2025,7 @@ static int cpsw_probe_dt(struct cpsw_priv *priv,
if (strcmp(slave_node->name, "slave"))
continue;
 
-   priv->slaves[i].phy_node =
-   of_parse_phandle(slave_node, "phy-handle", 0);
+   priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0);
parp = of_get_property(slave_node, "phy_id", );
if (of_phy_is_fixed_link(slave_node)) {
struct device_node *phy_node;
@@ -2267,22 +2266,12 @@ static int cpsw_probe(struct platform_device *pdev)
/* Select default pin state */
pinctrl_pm_select_default_state(>dev);
 
-   data = >data;
-   priv->slaves = devm_kzalloc(>dev,
-   sizeof(struct cpsw_slave) * data->slaves,
-   GFP_KERNEL);
-   if (!priv->slaves) {
-   ret = -ENOMEM;
-   goto clean_runtime_disable_ret;
-   }
-   for (i = 0; i < data->slaves; i++)
-   priv->slaves[i].slave_num = i;
-
if (cpsw_probe_dt(priv, pdev)) {
dev_err(>dev, "cpsw: platform data missing\n");
ret = -ENODEV;
goto clean_runtime_disable_ret;
}
+   data = >data;
 
if (is_valid_ether_addr(data->slave_data[0].mac_addr)) {
memcpy(priv->mac_addr, data->slave_data[0].mac_addr, ETH_ALEN);
@@ -2294,6 +2283,16 @@ static int cpsw_probe(struct platform_device *pdev)
 
memcpy(ndev->dev_addr, priv->mac_addr, ETH_ALEN);
 
+   priv->slaves = devm_kzalloc(>dev,
+   sizeof(struct cpsw_slave) * data->slaves,
+   GFP_KERNEL);
+   if (!priv->slaves) {
+   ret = -ENOMEM;
+   goto clean_runtime_disable_ret;
+   }
+   for (i = 0; i < data->slaves; i++)
+   priv->slaves[i].slave_num = i;
+
priv->slaves[0].ndev = ndev;
priv->emac_port = 0;
 
-- 
2.5.0

[PATCH v2 0/1] Revert "Prevent NUll pointer dereference with two PHYs"

2016-04-20 Thread Andrew Goodbody

Revert this patch as not only did it use an unitialised member of a struct
but there is also a pre-existing patch that does it better.

V2 add signoff

Andrew Goodbody (1):
  Revert "Prevent NUll pointer dereference with two PHYs on cpsw"

 drivers/net/ethernet/ti/cpsw.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

-- 
2.5.0

Re: [PATCH 1/1] Revert "Prevent NUll pointer dereference with two PHYs on cpsw"

2016-04-20 Thread Tony Lindgren

* Andrew Goodbody  [160420 07:51]:
> This reverts commit cfe255600154f0072d4a8695590dbd194dfd1aeb
> 
> This can result in a "Unable to handle kernel paging request"
> during boot. This was due to using an uninitialised struct member,
> data->slaves.

Missing Signed-off-by?

This gets cpsw boards working in next for me again:

Tested-by: Tony Lindgren

Re: Regression in next for smsc911x with tigthen lockdep checks

2016-04-20 Thread Tony Lindgren

* Tony Lindgren <t...@atomide.com> [160420 08:02]:
> Hi,
> 
> Looks like commit fafc4e1ea1a4 ("sock: tigthen lockdep checks for
> sock_owned_by_user") in next causes a regression at least for
> smsc911x with CONFIG_LOCKDEP. It keeps spamming with the following
> message. Any ideas?

Sorry forgot to add Steve to Cc, added now.

> Regards,
> 
> Tony
> 
> 8< 
> WARNING: CPU: 0 PID: 0 at include/net/sock.h:1408 
> udp_queue_rcv_skb+0x398/0x640
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc4-next-20160420 #1087
> Hardware name: Generic OMAP36xx (Flattened Device Tree)
> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
> [] (show_stack) from [] (dump_stack+0xb0/0xe4)
> [] (dump_stack) from [] (__warn+0xd8/0x104)
> [] (__warn) from [] (warn_slowpath_null+0x20/0x28)
> [] (warn_slowpath_null) from [] 
> (udp_queue_rcv_skb+0x398/0x640)
> [] (udp_queue_rcv_skb) from [] 
> (__udp4_lib_rcv+0x4cc/0xc0c)
> [] (__udp4_lib_rcv) from [] 
> (ip_local_deliver_finish+0xcc/0x4f4)
> [] (ip_local_deliver_finish) from [] 
> (ip_local_deliver+0xcc/0xd8)
> [] (ip_local_deliver) from [] (ip_rcv_finish+0xbc/0x700)
> [] (ip_rcv_finish) from [] (ip_rcv+0x48c/0x6d4)
> [] (ip_rcv) from [] (__netif_receive_skb_core+0x380/0xa10)
> [] (__netif_receive_skb_core) from [] 
> (netif_receive_skb_internal+
> 0x74/0x1ec)
> [] (netif_receive_skb_internal) from [] 
> (smsc911x_poll+0xd8/0x228)
> [] (smsc911x_poll) from [] (net_rx_action+0x124/0x470)
> [] (net_rx_action) from [] (__do_softirq+0xc8/0x54c)
> [] (__do_softirq) from [] (irq_exit+0xbc/0x130)
> [] (irq_exit) from [] (__handle_domain_irq+0x6c/0xdc)
> [] (__handle_domain_irq) from [] (__irq_svc+0x58/0x78)
> [] (__irq_svc) from [] (cpuidle_enter_state+0xc4/0x3d4)
> [] (cpuidle_enter_state) from [] 
> (cpu_startup_entry+0x198/0x3a0)
> [] (cpu_startup_entry) from [] (start_kernel+0x350/0x3c8)
> [] (start_kernel) from [<8000807c>] (0x8000807c)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: fix HAVE_EFFICIENT_UNALIGNED_ACCESS typos

2016-04-20 Thread David Miller

From: Eric Dumazet 
Date: Wed, 20 Apr 2016 07:31:31 -0700

> From: Eric Dumazet 
> 
> HAVE_EFFICIENT_UNALIGNED_ACCESS needs CONFIG_ prefix.
> 
> Also add a comment in nla_align_64bit() explaining we have
> to add a padding if current skb->data is aligned, as it
> certainly can be confusing.
> 
> Fixes: 35c5845957c7 ("net: Add helpers for 64-bit aligning netlink 
> attributes.")
> Signed-off-by: Eric Dumazet 

Applied, thanks.

Regression in next for smsc911x with tigthen lockdep checks

2016-04-20 Thread Tony Lindgren

Hi,

Looks like commit fafc4e1ea1a4 ("sock: tigthen lockdep checks for
sock_owned_by_user") in next causes a regression at least for
smsc911x with CONFIG_LOCKDEP. It keeps spamming with the following
message. Any ideas?

Regards,

Tony

8< 
WARNING: CPU: 0 PID: 0 at include/net/sock.h:1408 udp_queue_rcv_skb+0x398/0x640
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc4-next-20160420 #1087
Hardware name: Generic OMAP36xx (Flattened Device Tree)
[] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[] (show_stack) from [] (dump_stack+0xb0/0xe4)
[] (dump_stack) from [] (__warn+0xd8/0x104)
[] (__warn) from [] (warn_slowpath_null+0x20/0x28)
[] (warn_slowpath_null) from [] 
(udp_queue_rcv_skb+0x398/0x640)
[] (udp_queue_rcv_skb) from [] (__udp4_lib_rcv+0x4cc/0xc0c)
[] (__udp4_lib_rcv) from [] 
(ip_local_deliver_finish+0xcc/0x4f4)
[] (ip_local_deliver_finish) from [] 
(ip_local_deliver+0xcc/0xd8)
[] (ip_local_deliver) from [] (ip_rcv_finish+0xbc/0x700)
[] (ip_rcv_finish) from [] (ip_rcv+0x48c/0x6d4)
[] (ip_rcv) from [] (__netif_receive_skb_core+0x380/0xa10)
[] (__netif_receive_skb_core) from [] 
(netif_receive_skb_internal+
0x74/0x1ec)
[] (netif_receive_skb_internal) from [] 
(smsc911x_poll+0xd8/0x228)
[] (smsc911x_poll) from [] (net_rx_action+0x124/0x470)
[] (net_rx_action) from [] (__do_softirq+0xc8/0x54c)
[] (__do_softirq) from [] (irq_exit+0xbc/0x130)
[] (irq_exit) from [] (__handle_domain_irq+0x6c/0xdc)
[] (__handle_domain_irq) from [] (__irq_svc+0x58/0x78)
[] (__irq_svc) from [] (cpuidle_enter_state+0xc4/0x3d4)
[] (cpuidle_enter_state) from [] 
(cpu_startup_entry+0x198/0x3a0)
[] (cpu_startup_entry) from [] (start_kernel+0x350/0x3c8)
[] (start_kernel) from [<8000807c>] (0x8000807c)

Re: [PATCH v2 0/1] drivers: net: cpsw: Fix NULL pointer dereference with two slave PHYs

2016-04-20 Thread David Miller

From: Andrew Goodbody 
Date: Wed, 20 Apr 2016 08:49:34 +

> Sorry, I had no notification that this had happened. However I
> thought that the plan was to revert v1 and go with David Rivshin's
> patch instead. I'll see if I can create a revert in a little while.

Yes, that's  fine.

Re: [PATCH net 4/4] net/mlx4_en: Split SW RX dropped counter per RX ring

2016-04-20 Thread Eric Dumazet

On Wed, 2016-04-20 at 16:01 +0300, Or Gerlitz wrote:
> From: Eran Ben Elisha 
> 
> Count SW packet drops per RX ring instead of a global counter. This
> will allow monitoring the number of rx drops per ring.
> 
> In addition, SW rx_dropped counter was overwritten by HW rx_dropped
> counter, sum both of them instead to show the accurate value.
> 
> Fixes: ab35da16 ('net/mlx4_en: Moderate ethtool callback to [...] ')
> Signed-off-by: Eran Ben Elisha 
> Reported-by: Brenden Blanco 
> Signed-off-by: Saeed Mahameed 
> Signed-off-by: Or Gerlitz 
> ---

Reported-by: Eric Dumazet 

( http://www.spinics.net/lists/netdev/msg371318.html )

Thanks for following up !

[net-next PATCH v2 0/3] Feature tweaks/fixes follow-up to GSO partial patches

2016-04-20 Thread Alexander Duyck

This patch series is a set of minor fix-ups and tweaks following the GSO
partial and TSO with IPv4 ID mangling patches.  It mostly is just meant to
make certain that if we have GSO partial support at the device we can make
use of it from the far end of the tunnel.

v2: Added cover page which was forgotten with first submission.
Added patch that enables TSOv4 IP ID mangling w/ tunnels and/or VLANs.

---

Alexander Duyck (3):
  netdev_features: Fold NETIF_F_ALL_TSO into NETIF_F_GSO_SOFTWARE
  veth: Update features to include all tunnel GSO types
  net: Add support for IP ID mangling TSO in cases that require 
encapsulation


 drivers/net/veth.c  |7 +++
 include/linux/netdev_features.h |8 +++-
 net/core/dev.c  |   11 +++
 3 files changed, 17 insertions(+), 9 deletions(-)

--

[net-next PATCH v2 3/3] net: Add support for IP ID mangling TSO in cases that require encapsulation

2016-04-20 Thread Alexander Duyck

This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports
NETIF_F_TSO.  This way if needed a device can then later enable the TSO
with IP ID mangling and the tunnels on top of that device can then also
make use of the IP ID mangling as well.

Signed-off-by: Alexander Duyck 
---
 net/core/dev.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 52d446b2cb99..6324bc9267f7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7029,8 +7029,19 @@ int register_netdevice(struct net_device *dev)
if (!(dev->flags & IFF_LOOPBACK))
dev->hw_features |= NETIF_F_NOCACHE_COPY;
 
+   /* If IPv4 TCP segmentation offload is supported we should also
+* allow the device to enable segmenting the frame with the option
+* of ignoring a static IP ID value.  This doesn't enable the
+* feature itself but allows the user to enable it later.
+*/
if (dev->hw_features & NETIF_F_TSO)
dev->hw_features |= NETIF_F_TSO_MANGLEID;
+   if (dev->vlan_features & NETIF_F_TSO)
+   dev->vlan_features |= NETIF_F_TSO_MANGLEID;
+   if (dev->mpls_features & NETIF_F_TSO)
+   dev->mpls_features |= NETIF_F_TSO_MANGLEID;
+   if (dev->hw_enc_features & NETIF_F_TSO)
+   dev->hw_enc_features |= NETIF_F_TSO_MANGLEID;
 
/* Make NETIF_F_HIGHDMA inheritable to VLAN devices.
 */

[net-next PATCH v2 1/3] netdev_features: Fold NETIF_F_ALL_TSO into NETIF_F_GSO_SOFTWARE

2016-04-20 Thread Alexander Duyck

This patch folds NETIF_F_ALL_TSO into the bitmask for NETIF_F_GSO_SOFTWARE.
The idea is to avoid duplication of defines since the only difference
between the two was the GSO_UDP bit.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdev_features.h |8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 15eb0b12fff9..bc8736266749 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -152,11 +152,6 @@ enum {
 #define NETIF_F_GSO_MASK   (__NETIF_F_BIT(NETIF_F_GSO_LAST + 1) - \
__NETIF_F_BIT(NETIF_F_GSO_SHIFT))
 
-/* List of features with software fallbacks. */
-#define NETIF_F_GSO_SOFTWARE   (NETIF_F_TSO | NETIF_F_TSO_ECN | \
-NETIF_F_TSO_MANGLEID | \
-NETIF_F_TSO6 | NETIF_F_UFO)
-
 /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
  * set in features when NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM are set--
  * this would be contradictory
@@ -170,6 +165,9 @@ enum {
 #define NETIF_F_ALL_FCOE   (NETIF_F_FCOE_CRC | NETIF_F_FCOE_MTU | \
 NETIF_F_FSO)
 
+/* List of features with software fallbacks. */
+#define NETIF_F_GSO_SOFTWARE   (NETIF_F_ALL_TSO | NETIF_F_UFO)
+
 /*
  * If one device supports one of these features, then enable them
  * for all in netdev_increment_features.

[net-next PATCH v2 2/3] veth: Update features to include all tunnel GSO types

2016-04-20 Thread Alexander Duyck

This patch adds support for the checksum enabled versions of UDP and GRE
tunnels.  With this change we should be able to send and receive GSO frames
of these types over the veth pair without needing to segment the packets.

Signed-off-by: Alexander Duyck 
---
 drivers/net/veth.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4f30a6ae50d0..f37a6e61d4ad 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -312,10 +312,9 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_set_rx_headroom= veth_set_rx_headroom,
 };
 
-#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |\
-  NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
-  NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |   \
-  NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT | NETIF_F_UFO |   \
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
+  NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+  NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \
   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
   NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )

Re: [PATCH net-next] net/hsr: Fixed version field in ENUM

2016-04-20 Thread David Miller

From: Peter Heise 
Date: Wed, 20 Apr 2016 09:08:29 +0200

> New field (IFLA_HSR_VERSION) was added in the middle of an existing
> ENUM and would break kernel ABI, therefore moved to the end.
> Reported by Stephen Hemminger.
> 
> Signed-off-by: Peter Heise 

Applied, thanks.

Re: [PATCH net-next] net: dsa: remove tag_protocol from dsa_switch

2016-04-20 Thread Andrew Lunn

On Mon, Apr 18, 2016 at 06:24:04PM -0400, Vivien Didelot wrote:
> Having the tag protocol in dsa_switch_driver for setup time and in
> dsa_switch_tree for runtime is enough. Remove dsa_switch's one.
> 
> Signed-off-by: Vivien Didelot 

I had to think about this one for a minute. At the moment it is good,
however, sometime in the future, we might want to revert it. Some
Marvell switches support two tagging schemes. At the moment, we have
no way to express that, so the switch driver only offers one. If we
were to extend the API to express a list of supported tagging schemes,
we would then want dsa_switch to contain the scheme actually chosen.
However, until we actually implement something like this, lets remove
it.

Reviewed-by: Andrew Lunn 

 Andrew

> ---
>  include/net/dsa.h | 5 -
>  net/dsa/dsa.c | 5 ++---
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index c4bc42b..2d280ab 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -136,11 +136,6 @@ struct dsa_switch {
>   void *priv;
>  
>   /*
> -  * Tagging protocol understood by this switch
> -  */
> - enum dsa_tag_protocol   tag_protocol;
> -
> - /*
>* Configuration data for this switch.
>*/
>   struct dsa_chip_data*pd;
> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index efa612f..d61ceed 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -267,7 +267,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
> struct device *parent)
>* switch.
>*/
>   if (dst->cpu_switch == index) {
> - switch (ds->tag_protocol) {
> + switch (drv->tag_protocol) {
>  #ifdef CONFIG_NET_DSA_TAG_DSA
>   case DSA_TAG_PROTO_DSA:
>   dst->rcv = dsa_netdev_ops.rcv;
> @@ -295,7 +295,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
> struct device *parent)
>   goto out;
>   }
>  
> - dst->tag_protocol = ds->tag_protocol;
> + dst->tag_protocol = drv->tag_protocol;
>   }
>  
>   /*
> @@ -411,7 +411,6 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
>   ds->pd = pd;
>   ds->drv = drv;
>   ds->priv = priv;
> - ds->tag_protocol = drv->tag_protocol;
>   ds->master_dev = host_dev;
>  
>   ret = dsa_switch_setup_one(ds, parent);
> -- 
> 2.8.0
>

[PATCH 1/1] Revert "Prevent NUll pointer dereference with two PHYs on cpsw"

2016-04-20 Thread Andrew Goodbody

This reverts commit cfe255600154f0072d4a8695590dbd194dfd1aeb

This can result in a "Unable to handle kernel paging request"
during boot. This was due to using an uninitialised struct member,
data->slaves.
---
 drivers/net/ethernet/ti/cpsw.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 2cd67a5..54bcc38 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -349,7 +349,6 @@ struct cpsw_slave {
struct cpsw_slave_data  *data;
struct phy_device   *phy;
struct net_device   *ndev;
-   struct device_node  *phy_node;
u32 port_vlan;
u32 open_stat;
 };
@@ -368,6 +367,7 @@ struct cpsw_priv {
spinlock_t  lock;
struct platform_device  *pdev;
struct net_device   *ndev;
+   struct device_node  *phy_node;
struct napi_struct  napi_rx;
struct napi_struct  napi_tx;
struct device   *dev;
@@ -1142,8 +1142,8 @@ static void cpsw_slave_open(struct cpsw_slave *slave, 
struct cpsw_priv *priv)
cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
   1 << slave_port, 0, 0, ALE_MCAST_FWD_2);
 
-   if (slave->phy_node)
-   slave->phy = of_phy_connect(priv->ndev, slave->phy_node,
+   if (priv->phy_node)
+   slave->phy = of_phy_connect(priv->ndev, priv->phy_node,
 _adjust_link, 0, slave->data->phy_if);
else
slave->phy = phy_connect(priv->ndev, slave->data->phy_id,
@@ -2025,8 +2025,7 @@ static int cpsw_probe_dt(struct cpsw_priv *priv,
if (strcmp(slave_node->name, "slave"))
continue;
 
-   priv->slaves[i].phy_node =
-   of_parse_phandle(slave_node, "phy-handle", 0);
+   priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0);
parp = of_get_property(slave_node, "phy_id", );
if (of_phy_is_fixed_link(slave_node)) {
struct device_node *phy_node;
@@ -2267,22 +2266,12 @@ static int cpsw_probe(struct platform_device *pdev)
/* Select default pin state */
pinctrl_pm_select_default_state(>dev);
 
-   data = >data;
-   priv->slaves = devm_kzalloc(>dev,
-   sizeof(struct cpsw_slave) * data->slaves,
-   GFP_KERNEL);
-   if (!priv->slaves) {
-   ret = -ENOMEM;
-   goto clean_runtime_disable_ret;
-   }
-   for (i = 0; i < data->slaves; i++)
-   priv->slaves[i].slave_num = i;
-
if (cpsw_probe_dt(priv, pdev)) {
dev_err(>dev, "cpsw: platform data missing\n");
ret = -ENODEV;
goto clean_runtime_disable_ret;
}
+   data = >data;
 
if (is_valid_ether_addr(data->slave_data[0].mac_addr)) {
memcpy(priv->mac_addr, data->slave_data[0].mac_addr, ETH_ALEN);
@@ -2294,6 +2283,16 @@ static int cpsw_probe(struct platform_device *pdev)
 
memcpy(ndev->dev_addr, priv->mac_addr, ETH_ALEN);
 
+   priv->slaves = devm_kzalloc(>dev,
+   sizeof(struct cpsw_slave) * data->slaves,
+   GFP_KERNEL);
+   if (!priv->slaves) {
+   ret = -ENOMEM;
+   goto clean_runtime_disable_ret;
+   }
+   for (i = 0; i < data->slaves; i++)
+   priv->slaves[i].slave_num = i;
+
priv->slaves[0].ndev = ndev;
priv->emac_port = 0;
 
-- 
2.5.0

[PATCH 0/1] Revert "Prevent NUll pointer dereference with two PHYs"

2016-04-20 Thread Andrew Goodbody

Revert this patch as not only did it use an unitialised member of a struct
but there is also a pre-existing patch that does it better.

Andrew Goodbody (1):
  Revert "Prevent NUll pointer dereference with two PHYs on cpsw"

 drivers/net/ethernet/ti/cpsw.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

-- 
2.5.0

[PATCH net-next] net: fix HAVE_EFFICIENT_UNALIGNED_ACCESS typos

2016-04-20 Thread Eric Dumazet

From: Eric Dumazet 

HAVE_EFFICIENT_UNALIGNED_ACCESS needs CONFIG_ prefix.

Also add a comment in nla_align_64bit() explaining we have
to add a padding if current skb->data is aligned, as it
certainly can be confusing.

Fixes: 35c5845957c7 ("net: Add helpers for 64-bit aligning netlink attributes.")
Signed-off-by: Eric Dumazet 
---
 include/net/netlink.h |   19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index e644b3489acf..cf95df1fa14b 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1238,18 +1238,21 @@ static inline int nla_validate_nested(const struct 
nlattr *start, int maxtype,
  * Conditionally emit a padding netlink attribute in order to make
  * the next attribute we emit have a 64-bit aligned nla_data() area.
  * This will only be done in architectures which do not have
- * HAVE_EFFICIENT_UNALIGNED_ACCESS defined.
+ * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS defined.
  *
  * Returns zero on success or a negative error code.
  */
 static inline int nla_align_64bit(struct sk_buff *skb, int padattr)
 {
-#ifndef HAVE_EFFICIENT_UNALIGNED_ACCESS
-   if (IS_ALIGNED((unsigned long)skb->data, 8)) {
-   struct nlattr *attr = nla_reserve(skb, padattr, 0);
-   if (!attr)
-   return -EMSGSIZE;
-   }
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   /* The nlattr header is 4 bytes in size, that's why we test
+* if the skb->data _is_ aligned.  This NOP attribute, plus
+* nlattr header for next attribute, will make nla_data()
+* 8-byte aligned.
+*/
+   if (IS_ALIGNED((unsigned long)skb->data, 8) &&
+   !nla_reserve(skb, padattr, 0))
+   return -EMSGSIZE;
 #endif
return 0;
 }
@@ -1261,7 +1264,7 @@ static inline int nla_align_64bit(struct sk_buff *skb, 
int padattr)
 static inline int nla_total_size_64bit(int payload)
 {
return NLA_ALIGN(nla_attr_size(payload))
-#ifndef HAVE_EFFICIENT_UNALIGNED_ACCESS
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ NLA_ALIGN(nla_attr_size(0))
 #endif
;

1 2 >

1 - 100 of 168 matches

Mail list logo