Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Jesper Dangaard Brouer

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:

> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
> wrote:
> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
> >>  wrote:  
> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
> >>> wrote:  
>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> > wrote:  
> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> [...]  
> 
>  Only if a qdisc is present and pressure is high enough.
> 
>  But in a forwarding setup, we likely receive at a lower rate than the
>  NIC can transmit.  
> >>
> >> Yes, I can confirm this happens in my experiments.
> >>  
>   
> >>>
> >>> Jesper has a similar Idea to make the qdisc think it is under
> >>> pressure, when the device TX ring is idle most of the time, i think
> >>> his idea can come in handy here. I am not fully involved in the
> >>> details, maybe he can elaborate more.
> >>>
> >>> But if it works, it will be transparent to napi, and xmit more will
> >>> happen by design.  
> >>
> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> indicating xmit_more should be possible.  This will be a topic at the
> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> challenge people to come up with a good solution ;-)
> >>  
> >
> > One thing I've noticed but haven't yet actually analyzed much is if
> > I shrink the nic descriptor ring size to only be slightly larger than
> > the qdisc layer bulking size I get more bulking and better perf numbers.
> > At least on microbenchmarks. The reason being the nic pushes back more
> > on the qdisc. So maybe a case for making the ring size in the NIC some
> > factor of the expected number of queues feeding the descriptor ring.
> >  

I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).

 
> BQL is not helping with that?

Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight".  Which Tom already have
a patch for :-)  Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH 1/2] macb: Add 1588 support in Cadence GEM.

2016-09-07 Thread Harini Katakam
Hi,

On Tue, Sep 6, 2016 at 9:18 PM, Richard Cochran
 wrote:
>

>> +#define GEM_TISUBN   0x01bc /* 1588 Timer Increment Sub-ns */
>
> This regsiter does not exist.  Looking at
>
>Zynq-7000 AP SoC Technical Reference Manual
>UG585 (v1.10) February 23, 2015
>
> starting on page 1273 we see:
>
> udp_csum_errors 0x01B0 32 ro0x UDP checksum error
> timer_strobe_s  0x01C8 32 rw0x 1588 timer sync strobe seconds
> timer_strobe_ns 0x01CC 32 mixed 0x 1588 timer sync strobe 
> nanoseconds
> timer_s 0x01D0 32 rw0x 1588 timer seconds
> timer_ns0x01D4 32 mixed 0x 1588 timer nanoseconds
> timer_adjust0x01D8 32 mixed 0x 1588 timer adjust
> timer_incr  0x01DC 32 mixed 0x 1588 timer increment
>
> There is no register at 0x1BC.
>
>> +#define GEM_TSH  0x01c0 /* 1588 Timer Seconds High */
>
> This one doesn't exist either.  What is going on here?

I cant be sure of the version of Cadence GEM used in SAMA5D2
but these registers (sub ns increments alteast) only exist in
the IP version used in Zynq Ultrascale+ MPSoC.


>> + /* get GEM internal time */
>> + sech = gem_readl(bp, TSH);
>> + secl = gem_readl(bp, TSL);
>
> Does reading TSH latch the time?  The TRM is silent about that, and
> most other designs latch on reading the LSB.

No, it does not latch the time.
When doing a read + adjust + write, this will
mean there's room for some error.

Although when writing, the write to MSB and LSB registers
was made atomic. This bug fix came only in the
most recent version of the IP, am afraid.

Regards,
Harini


[PATCH net-next] ipv4: accept u8 in IP_TOS ancillary data

2016-09-07 Thread Eric Dumazet
From: Eric Dumazet 

In commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as
ancillary data") Francesco added IP_TOS values specified as integer.

However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
in a single byte, when IP_RECVTOS is set on the socket.

It can be very useful to reflect all ancillary options as given by the
kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
EINVAL after Francesco patch.

So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
server can simply reuse same ancillary block without having to mangle
it.

Jesper can then augment
https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
to add TOS reflection ;)

Fixes: f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary 
data")
Signed-off-by: Eric Dumazet 
Cc: Francesco Fusco  
Cc: Jesper Dangaard Brouer 
---
 net/ipv4/ip_sockglue.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 
71a52f4d4cffba2db9353f43dc817689bf4fab10..af4919792b6a812041dcb18ff30aa8b27482c7a2
 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -284,9 +284,12 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
ipc->ttl = val;
break;
case IP_TOS:
-   if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
+   if (cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+   val = *(int *)CMSG_DATA(cmsg);
+   else if (cmsg->cmsg_len == CMSG_LEN(sizeof(u8)))
+   val = *(u8 *)CMSG_DATA(cmsg);
+   else
return -EINVAL;
-   val = *(int *)CMSG_DATA(cmsg);
if (val < 0 || val > 255)
return -EINVAL;
ipc->tos = val;




[PATCH] Bluetooth: fix kzalloc-simple.cocci warnings

2016-09-07 Thread kbuild test robot
net/bluetooth/mgmt.c:905:6-13: WARNING: kzalloc should be used for rp, instead 
of kmalloc/memset


 Use kzalloc rather than kmalloc followed by memset with 0

 This considers some simple cases that are common and easy to validate
 Note in particular that there are no ...s in the rule, so all of the
 matched code has to be contiguous

Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci

CC: Michał Narajowski 
Signed-off-by: Fengguang Wu 
---

 mgmt.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/net/bluetooth/mgmt.c
+++ b/net/bluetooth/mgmt.c
@@ -902,12 +902,10 @@ static int read_ext_controller_info(stru
eir_len = eir_append_data(buff, eir_len, EIR_NAME_SHORT,
  hdev->short_name, name_len);
 
-   rp = kmalloc(sizeof(*rp) + eir_len, GFP_KERNEL);
+   rp = kzalloc(sizeof(*rp) + eir_len, GFP_KERNEL);
if (!rp)
return -ENOMEM;
 
-   memset(rp, 0, sizeof(*rp) + eir_len);
-
rp->eir_len = cpu_to_le16(eir_len);
memcpy(rp->eir, buff, eir_len);
 


Re: [PATCH 2/2 v3] net: smsc911x: request and deassert optional RESET GPIO

2016-09-07 Thread David Miller
From: Linus Walleij 
Date: Wed,  7 Sep 2016 15:53:42 +0200

> On some systems (such as the Qualcomm APQ8060 Dragonboard) the
> RESET signal of the SMSC911x is not pulled up by a resistor (or
> the internal pull-up that will pull it up if the pin is not
> even connected) but instead connected to a GPIO line, so that
> the operating system must explicitly deassert RESET before use.
> 
> Support this in the SMSC911x driver so this ethernet connector
> can be used on such targets.
> 
> Notice that we request the line to go logical low (deassert)
> whilst the line on the actual component is active low. This
> is managed in the respective hardware description when
> specifying the GPIO line with e.g. device tree or ACPI. With
> device tree it looks like this in one case:
> 
>   reset-gpios = < 30 GPIO_ACTIVE_LOW>;
> 
> Which means that logically requesting the RESET line to be
> deasserted will result in the line being driven high, taking
> the device out of reset.
> 
> Cc: Jeremy Linton 
> Signed-off-by: Linus Walleij 

Applied.


Re: [PATCH 1/2 v3] net: smsc911x: augment device tree bindings

2016-09-07 Thread David Miller
From: Linus Walleij 
Date: Wed,  7 Sep 2016 15:53:31 +0200

> This adds device tree bindings for:
> 
> - An optional GPIO line for releasing the RESET signal to the
>   SMSC911x devices
> 
> - An optional PME (power management event) interrupt line that
>   can be utilized to wake up the system on network activity.
>   This signal exist on all the SMSC911x devices, it is just not
>   very often routed.
> 
> Both these lines are routed to the SoC on the Qualcomm APQ8060
> Dragonboard and thus needs to be bound in the device tree.
> 
> Cc: devicet...@vger.kernel.org
> Cc: Jeremy Linton 
> Signed-off-by: Linus Walleij 

Applied.


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Tom Herbert
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  wrote:
> On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>>
>> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
>> wrote:
>>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
 On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> wrote:
>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> [...]

 Only if a qdisc is present and pressure is high enough.

 But in a forwarding setup, we likely receive at a lower rate than the
 NIC can transmit.
>>
>> Yes, I can confirm this happens in my experiments.
>>

>>>
>>> Jesper has a similar Idea to make the qdisc think it is under
>>> pressure, when the device TX ring is idle most of the time, i think
>>> his idea can come in handy here. I am not fully involved in the
>>> details, maybe he can elaborate more.
>>>
>>> But if it works, it will be transparent to napi, and xmit more will
>>> happen by design.
>>
>> Yes. I have some ideas around getting more bulking going from the qdisc
>> layer, by having the drivers provide some feedback to the qdisc layer
>> indicating xmit_more should be possible.  This will be a topic at the
>> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> challenge people to come up with a good solution ;-)
>>
>
> One thing I've noticed but haven't yet actually analyzed much is if
> I shrink the nic descriptor ring size to only be slightly larger than
> the qdisc layer bulking size I get more bulking and better perf numbers.
> At least on microbenchmarks. The reason being the nic pushes back more
> on the qdisc. So maybe a case for making the ring size in the NIC some
> factor of the expected number of queues feeding the descriptor ring.
>

BQL is not helping with that?

Tom

> .John


RE: [PATCH net-next 0/3] r8152: configuration setting

2016-09-07 Thread Hayes Wang
David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, September 08, 2016 8:38 AM
[...]
> By forcing a certain mode via a Kconfig value, you are basically making it
> impossible for distributions to do something reasonable here.

The request is always from some manufacturers, not end users.
They always ask the driver to control everything. And I don't
think the end user knows or cares about how the device is set.
That is why I try to satisfy them via Kconfig.

I think you have rejected this way. I would think if there
is any other method. Thanks.

Best Regards,
Hayes



Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread John Fastabend
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
> 
> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
> wrote:
>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
 On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
 wrote:  
> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> [...]
>>>
>>> Only if a qdisc is present and pressure is high enough.
>>>
>>> But in a forwarding setup, we likely receive at a lower rate than the
>>> NIC can transmit.
> 
> Yes, I can confirm this happens in my experiments.
> 
>>>  
>>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device TX ring is idle most of the time, i think
>> his idea can come in handy here. I am not fully involved in the
>> details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
> 
> Yes. I have some ideas around getting more bulking going from the qdisc
> layer, by having the drivers provide some feedback to the qdisc layer
> indicating xmit_more should be possible.  This will be a topic at the
> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> challenge people to come up with a good solution ;-)
> 

One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.

.John


RE: [PATCH net-next 0/3] r8152: configuration setting

2016-09-07 Thread Hayes Wang
Bjørn Mork [mailto:bj...@mork.no]
> Sent: Wednesday, September 07, 2016 9:51 PM
[...]
> So this adds a lot of code to work around the issues you introduced by
> unnecessarily blacklisting the CDC ECM configuration earlier, and still
> makes the r8152 driver handle the device even in ECM mode.

I suggest to use vendor mode only, but some people ask me to
submit such patches. If these patches are rejected, I have
enough reasons to tell them it is unacceptable rather than
I don't do it.

> Just remove the completely unnecessary blacklist, and let the cdc_ether
> driver handle the device if the user selects the ECM configuration.
> That't how the USB system works.  There is no need for any code in r8152
> to do that.

The pure cdc_ether driver couldn't change the speed of the
ethernet, because it doesn't know how to access the PHY of
the device. Therefore, I add relative code in r8152 driver.

Best Regards,
Hayes


[PATCHv2 iproute2] ip route: check ftell, fseek return value

2016-09-07 Thread Hangbin Liu
ftell() may return -1 in error case, which is not handled and therefore pass a
negative offset to fseek(). The return code of fseek() is also not checked.

Reported-by: Phil Sutter 
Signed-off-by: Hangbin Liu 
---
 ip/iproute.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 3da23af..c06a474 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1860,6 +1860,11 @@ static int iproute_restore(void)
exit(-1);
 
pos = ftell(stdin);
+   if (pos == -1) {
+   perror("Failed to restore: ftell");
+   exit(-1);
+   }
+
for (prio = 0; prio < 3; prio++) {
int err;
 
@@ -1867,7 +1872,10 @@ static int iproute_restore(void)
if (err)
exit(err);
 
-   fseek(stdin, pos, SEEK_SET);
+   if (fseek(stdin, pos, SEEK_SET) == -1) {
+   perror("Failed to restore: fseek");
+   exit(-1);
+   }
}
 
exit(0);
-- 
2.5.5



Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it

2016-09-07 Thread Toshiaki Makita
On 2016/09/08 3:22, Michal Soltys wrote:
...
> 4.7.2
> git describe on that commit suggests it's been available since 4.6.x
> 
> What I did in details:
> 
> ip li add name port1b type veth peer name port1e
> ip li add br0 type bridge
> ip li set dev br0 type bridge vlan_default_pvid 0
> ip li set dev br0 type bridge vlan_filtering 1
> bridge vlan add dev br0 vid 10 self
> bridge vlan add dev br0 vid 250 untagged pvid self
> ip li add link br0 name vlan10 type vlan id 10
> ip li set port1b master br0
> 
> At this point br0.vlan10 had outdated mac after br0 took port1b's one as
> its own.

If the mac address of lower device is changed while vlan device is down,
the address will be synchronized when vlan device becomes up.
Please try "ip li set vlan10 up".

-- 
Toshiaki Makita




[PATCH] net_namespace: fixed net_device reference leak

2016-09-07 Thread Jojy Varghese
During namespace cleanup, if ‘dst’ subsystem is holding
a reference to the loopback interface in the namespace, it does not get
released. This is because in the case where the net_device held by 'dst'
is same as the namespace's loopback net_device, current code first
does a ’dev_hold’ on the same device followed by a ‘dev_put’ on the
same device resulting in a no-op.

This change fixes this leak by assigning the initial namespace’s loopback
device to the ‘dst’’ before releasing the reference to the network device
being unregistered.

Additional reference: https://github.com/docker/docker/issues/5618

Signed-off-by: Jojy G Varghese 
---
 net/core/dst.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index a1656e3..a18e8ea 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -433,7 +433,7 @@ static void dst_ifdown(struct dst_entry *dst,
struct net_device *dev,
dst->input = dst_discard;
dst->output = dst_discard_out;
} else {
-   dst->dev = dev_net(dst->dev)->loopback_dev;
+   dst->dev = init_net.loopback_dev;
dev_hold(dst->dev);
dev_put(dev);
}
-- 
1.8.3.1


Re: Minimum MTU Mess

2016-09-07 Thread Andrew Lunn
> This is definitely going to require a few passes... (Working my way
> through every driver with an ndo_change_mtu wired up right now to
> see just how crazy this might get).

It might be something Coccinelle can help you with. Try describing the
transformation you want to do, to their mailing list, and they might
come up with a script for you.

 Andrew


Re: [PATCH net-next v22 3/3] openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes

2016-09-07 Thread pravin shelar
On Wed, Sep 7, 2016 at 9:56 AM, Eric Garver  wrote:
> Add support for 802.1ad including the ability to push and pop double
> tagged vlans. Add support for 802.1ad to netlink parsing and flow
> conversion. Uses double nested encap attributes to represent double
> tagged vlan. Inner TPID encoded along with ctci in nested attributes.
>
> This is based on Thomas F Herbert's original v20 patch. I made some
> small clean ups and bug fixes.
>
> Signed-off-by: Thomas F Herbert 
> Signed-off-by: Eric Garver 
> ---

Thanks for finishing the work.

Acked-by: Pravin B Shelar 


Re: [ovs-dev] [PATCH net-next v22 1/3] openvswitch: 802.1ad uapi changes.

2016-09-07 Thread pravin shelar
On Wed, Sep 7, 2016 at 9:56 AM, Eric Garver  wrote:
> From: Thomas F Herbert 
>
> openvswitch: Add support for 8021.AD
>
> Change the description of the VLAN tpid field.
>
> Signed-off-by: Thomas F Herbert 

Acked-by: Pravin B Shelar 


Re: [ovs-dev] [PATCH net-next v22 2/3] vlan: Check for vlan ethernet types for 8021.q or 802.1ad

2016-09-07 Thread pravin shelar
On Wed, Sep 7, 2016 at 9:56 AM, Eric Garver  wrote:
> This is to simplify using double tagged vlans. This function allows all
> valid vlan ethertypes to be checked in a single function call.
> Also replace some instances that check for both ETH_P_8021Q and
> ETH_P_8021AD.
>
> Patch based on one originally by Thomas F Herbert.
>
> Signed-off-by: Thomas F Herbert 
> Signed-off-by: Eric Garver 

Acked-by: Pravin B Shelar 


Re: [PATCH v2 net-next 0/2] qed*: Debug data collection

2016-09-07 Thread David Miller
From: Tomer Tayar 
Date: Wed, 7 Sep 2016 16:36:23 +0300

> This patch series adds the support of debug data collection in the qed driver,
> and the means to extract it in the qede driver via the get_regs operation.
 ...
> Please consider applying this to 'net-next'.

Series applied, thank you.


Re: [PATCH] kaweth: remove obsolete debugging statements

2016-09-07 Thread David Miller
From: Oliver Neukum 
Date: Wed,  7 Sep 2016 15:27:09 +0200

> SOme statements in the driver only served to inform
> which functions were entered. Ftrace can do that just as good without
> needing memory. Remove the statements.
> 
> Signed-off-by: Oliver Neukum 

Applied to net-next, thanks Oliver.


Re: [PATCH] qed: mark symbols static where possible

2016-09-07 Thread David Miller
From: Yuval Mintz 
Date: Wed, 7 Sep 2016 11:55:34 +

>> We get a few warnings when building kernel with W=1:
>> drivers/net/ethernet/qlogic/qed/qed_l2.c:112:5: warning: no previous
>> prototype for 'qed_sp_vport_start' [-Wmissing-prototypes]
>> 
>> 
>> In fact, these functions are only used in the file in which they are
>> declared and don't need a declaration, but can be made static.
>> so this patch marks these functions with 'static'.
>> 
>> Signed-off-by: Baoyou Xie 
> 
> Most of this changes are correct. I have 2 reservations, though:
> 
> 1. qed_get_vport_stats() is in use in net-next by qed_main.c starting with
> 6c75424612a7 ("qed: Add support for NCSI statistics."), so we shouldn't
> make it static.
> [This raises the question of which repository was this patch built against?]
> 
> 2. Regarding the changes in qed_cxt.c - some of the functions you're turning
> into static are going to be used be other files in our pending RoCE driver
> submissions [you could say those ARE really missing their prototypes].
> I don't have a real objection to this change - I'm just stating that if you'll
> change those now to static, we'll probably 'revert' the change in the near
> future.

Baoyou, please respin this against net-next, thanks.


Re: [PATCH] qed: add missing header dependencies

2016-09-07 Thread David Miller
From: Baoyou Xie 
Date: Wed,  7 Sep 2016 19:07:00 +0800

> We get 4 warnings when building kernel with W=1:
> drivers/net/ethernet/qlogic/qed/qed_selftest.c:6:5: warning: no previous 
> prototype for 'qed_selftest_memory' [-Wmissing-prototypes]
> drivers/net/ethernet/qlogic/qed/qed_selftest.c:19:5: warning: no previous 
> prototype for 'qed_selftest_interrupt' [-Wmissing-prototypes]
> drivers/net/ethernet/qlogic/qed/qed_selftest.c:32:5: warning: no previous 
> prototype for 'qed_selftest_register' [-Wmissing-prototypes]
> drivers/net/ethernet/qlogic/qed/qed_selftest.c:55:5: warning: no previous 
> prototype for 'qed_selftest_clock' [-Wmissing-prototypes]
> 
> In fact, these functions are declared in qed_selftest.h, so this patch
> add missing header dependencies.
> 
> Signed-off-by: Baoyou Xie 

Applied.


Re: [PATCH net-next 0/3] r8152: configuration setting

2016-09-07 Thread David Miller
From: Hayes Wang 
Date: Wed, 7 Sep 2016 16:12:19 +0800

> Some people prefer to use ECM mode rather than vendor mode. Therefore, I add
> CONFIG_RTL8152_CONFIG_VALUE in Kconfig. Then, the users could choose the USB
> configuration value which they want. The default is to support vendor mode
> only.

By forcing a certain mode via a Kconfig value, you are basically making it
impossible for distributions to do something reasonable here.

This needs to be somehow a runtime thing, and I'm pretty sure we've
had many other situations of USB devices which want to be run by
different "personality" drivers in the past.

Perhaps we can do something generically for USB devices for this kind
of situation and make r8152 use it.

Thanks.


Re: [PATCH net-next V6 0/4] net/sched: ip tunnel metadata set/release/classify by using TC

2016-09-07 Thread David Miller
From: Hadar Hen Zion 
Date: Wed,  7 Sep 2016 11:08:02 +0300

> This patchset introduces ip tunnel manipulation support using the TC 
> subsystem.

Please address the feedback given by Eric Dumazet for patch #4,
thank you.


Re: [PATCH net-next v2] netlink: don't forget to release a rhashtable_iter structure

2016-09-07 Thread David Miller
From: Andrei Vagin 
Date: Tue,  6 Sep 2016 21:31:17 -0700

> This bug was detected by kmemleak:
> unreferenced object 0x8804269cc3c0 (size 64):
>   comm "criu", pid 1042, jiffies 4294907360 (age 13.713s)
>   hex dump (first 32 bytes):
> a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00  .2.,
> 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  
>   backtrace:
> [] kmemleak_alloc+0x4a/0xa0
> [] kmem_cache_alloc_trace+0x10f/0x280
> [] __netlink_diag_dump+0x26c/0x290 [netlink_diag]
> 
> v2: don't remove a reference on a rhashtable_iter structure to
> release it from netlink_diag_dump_done
> 
> Cc: Herbert Xu 
> Fixes: ad202074320c ("netlink: Use rhashtable walk interface in diag dump")
> Signed-off-by: Andrei Vagin 

Applied, thanks.


Re: [PATCH net-next] MAINTAINERS: Update CPMAC email address

2016-09-07 Thread David Miller
From: Florian Fainelli 
Date: Tue,  6 Sep 2016 20:22:21 -0700

> Signed-off-by: Florian Fainelli 

Since MAINTAINERS entry accuracy is very important for bug reporting
and patch CC:'ing, I tend to apply these kind of patches always to
'net'.

And I have done so in this case too.

Thanks.


Re: [PATCH net-next 1/2] tcp: measure rwnd-limited time

2016-09-07 Thread David Miller
From: "Francis Y. Yan" 
Date: Tue,  6 Sep 2016 18:32:40 -0700

> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7be9b12..f5b588e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -176,6 +176,7 @@ struct tcp_sock {
>* were acked.
>*/
>   struct u64_stats_sync syncp; /* protects 64bit vars (cf tcp_get_info()) 
> */
> + seqcount_t seqcnt;  /* proctects rwnd-limited-related vars, etc. */
>  
>   u32 snd_una;/* First byte we want an ack for*/
>   u32 snd_sml;/* Last byte of the most recently transmitted 
> small packet */
> @@ -204,6 +205,8 @@ struct tcp_sock {
>  
>   u32 window_clamp;   /* Maximal window to advertise  */
>   u32 rcv_ssthresh;   /* Current window clamp */
> + struct skb_mstamp rwnd_limited_ts; /* Last timestamp limited by rwnd */
> + u64 rwnd_limited;   /* Total time (us) limited by rwnd */
>  
>   /* Information of the most recently (s)acked skb */
>   struct tcp_rack {

I understand completely the usefulness of this change, but wow that is a lot of
new TCP socket space taken up just to export some time values in tcp_info for
debugging and statistics.


Re: [PATCH net-next] net: xfrm: Change u32 sysctl entries to use proc_douintvec

2016-09-07 Thread David Miller
From: Subash Abhinov Kasiviswanathan 
Date: Tue,  6 Sep 2016 18:09:31 -0600

> proc_dointvec limits the values to INT_MAX in u32 sysctl entries.
> proc_douintvec allows to write upto UINT_MAX.
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan 

I am assuming Steffen will pick this up.


Re: [PATCH net-next 0/1] rxrpc: Local abort tracepoint

2016-09-07 Thread David Miller
From: David Howells 
Date: Wed, 07 Sep 2016 16:25:50 +0100

> 
> Here's a patch that adds a tracepoint that allows local aborts to be
> debugged.  This needs to be applied on top of the just-posted call refcount
> overhaul patch.
> 
> The patches can be found here also:
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
> 
> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-rewrite-20160904-2

Also pulled, thanks a lot David.


Re: [PATCH net-next 0/8] rxrpc: Overhaul call refcounting

2016-09-07 Thread David Miller
From: David Howells <dhowe...@redhat.com>
Date: Wed, 07 Sep 2016 16:22:26 +0100

> 
> Here's a set of mostly small patches leading up to one big one.
> 
> The big patch at the end of the series overhauls how rxrpc_call refcounting
> is handled, making it more sane so that calls bound to user IDs are _only_
> released from socket operations or kernel API functions.  Further, the
> patch stops calls from holding refs on their parent socket - which can
> prevent the socket from being cleaned up.
> 
> The second largest patch improves the call tracking tracepoint by providing
> extra information about the situation in which gets and puts occur.  This
> allows distinctions to be drawn between refs held by the socket user ID
> tree, refs held by the work queue (to be implemented by a future patch) and
> other refs.
> 
> The other patches include a couple of cleanups and some simple alterations
> to avoid NULL pointer dereferences in the big patch.
> 
> The patches can be found here also (non-terminally on the branch):
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
> 
> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-rewrite-20160907-1

Pulled.


Re: XPS configuration question (on tg3)

2016-09-07 Thread Eric Dumazet
On Thu, 2016-09-08 at 01:45 +0200, Michal Soltys wrote:
> > Keep in mind that TCP stack can send packets, responding to incoming
> > ACK.
> > 
> > So you might check that incoming ACK are handled by the 'right' cpu.
> > 
> > Without RFS, there is no such guarantee.

> Did some more testing today, indeed RFS helped with TCP flows, but 
> it got me wondering:
> 
> In scenario such as: XPS off, RFS/RPS off, irqs pinned, process
> transfering data pinned - one tx queue was chosen (through hash) and
> consistently persisted throughout the whole transfer. No exceptions, at
> least none in the tests I did.

It depends if at least one packet for each flow sits in a qdisc/NIC
queue. TCP has this Out Of Order transmit logic preventing a TX queue
change, even if the process doing the sendmsg() is migrated.

git grep -n ooo_okay

> 
> When XPS is getting enabled, the only thing that changes is that instead
> of using hash to select one of available queues, the cpu running process
> is specifically told which queue it can use (and eventually selects one
> through hash if more than one is available). Shouldn't the choice
> persist throughout the transfer as well then ?

Sure, if the process doing the sendmsg() sticks to one cpu, and this cpu
is the one handling incoming ACK packets as well.






Re: XPS configuration question (on tg3)

2016-09-07 Thread Michal Soltys
On 2016-09-07 02:19, Eric Dumazet wrote:
> On Tue, 2016-09-06 at 23:00 +0200, Michal Soltys wrote:
>> On 2016-09-06 22:21, Alexander Duyck wrote:
>> > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys  wrote:
>> >> Hi,
>> >>
>> >> I've been testing different configurations and I didn't manage to get XPS 
>> >> to "behave" correctly - so I'm probably misunderstanding or forgetting 
>> >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 
>> >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx 
>> >> and rx), 1 was unused (this got me scratching my head a bit) and the 
>> >> remaining one was for the last rx (though due to another bug recently 
>> >> fixed the 4th rx queue was inconfigurable on receive side). The names 
>> >> were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4.
>> >>
>> >> The XPS was configured as:
>> >>
>> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus
>> >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus
>> >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus
>> >>
>> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue 
>> >> only, 4-7 tx-1 and 8-15 tx-2.
>> >>
>> >> Just in case rx side could get in the way as far as flows go, relevant 
>> >> irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 
>> >> 10 - falling into groups defined by the above masks.
>> >>
>> >> I tested both with mq and multiq scheduler, essentially either this:
>> >>
>> >> qdisc mq 2: root
>> >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >>
>> >> or this (for the record, skbaction queue_mapping was behaving correctly 
>> >> with the one below):
>> >>
>> >> qdisc multiq 3: root refcnt 6 bands 3/5
>> >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >>
>> >> Now, do I understand correctly, that under the above setup - commands 
>> >> such as
>> >>
>> >> taskset 400 nc -p $prt host_ip 12345 > >> or
>> >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10
>> >>
>> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that 
>> >> supports affinity by itself) and sending data to some other host on the 
>> >> net - should *always* use tx-2 queue ?
>> >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 
>> >> > >>
>> >> In my case, what queue it used was basically random (on top of that it 
>> >> sometimes changed the used queue mid-transfer) what could be easily 
>> >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a 
>> >> bit at loss now, as I though xps configuration should be absolute.
>> >>
>> >> Well, I'd be greatful for some pointers / hints.
>> > 
>> > So it sounds like you have everything configured correctly.  The one
>> > question I would have is if we are certain the CPU pinning is working
>> > for the application.  You might try using something like perf to
>> > verify what is running on CPU 10, and what is running on the CPUs that
>> > the queues are associated with.
>> > 
>> 
>> I did verify with 'top' in this case. I'll double check tommorow just to
>> be sure. Other than testing, there was nothing else running on the machine.
>> 
>> > Also after you have configured things you may want to double check and
>> > verify the xps_cpus value is still set.  I know under some
>> > circumstances the value can be reset by a device driver if the number
>> > of queues changes, or if the interface toggles between being
>> > administratively up/down.
>> 
>> Hmm, none of this was happening during tests.
>> 
>> Are there any other circumstances where xps settings could be ignored or
>> changed during the test (that is during the actual transfer, not between
>> separate attempts) ?
>> 
>> One thing I'm a bit afraid is that kernel was not exactly the newest
>> (3.16), maybe I'm missing some crucial fixes, though xps was added much
>> earlier than that. Either way, I'll try to redo tests with current
>> kernel tommorow.
>> 
> 
> Keep in mind that TCP stack can send packets, responding to incoming
> ACK.
> 
> So you might check that incoming ACK are handled by the 'right' cpu.
> 
> Without RFS, there is no such guarantee.
> 
> echo 32768 >/proc/sys/net/core/rps_sock_flow_entries
> echo 8192 >/sys/class/net/eth1/queues/rx-0/rps_flow_cnt
> echo 8192 >/sys/class/net/eth1/queues/rx-1/rps_flow_cnt
> echo 8192 >/sys/class/net/eth1/queues/rx-2/rps_flow_cnt
> echo 8192 >/sys/class/net/eth1/queues/rx-3/rps_flow_cnt
> 

Did some more testing today, indeed RFS helped with TCP flows, but 
it got me 

Re: Minimum MTU Mess

2016-09-07 Thread Jarod Wilson
On Wed, Sep 07, 2016 at 01:35:35PM -0700, Stephen Hemminger wrote:
> On Wed, 7 Sep 2016 15:53:56 -0400
> Jarod Wilson  wrote:
> 
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6466,9 +6466,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
> > if (new_mtu == dev->mtu)
> > return 0;
> >  
> > -   /*  MTU must be positive.*/
> > -   if (new_mtu < 0)
> > +   if (new_mtu < dev->min_mtu) {
> > +   netdev_err(dev, "Invalid MTU %d requested, hw min %d\n",
> > +  new_mtu, dev->min_mtu);
> > return -EINVAL;
> > +   }
> > +
> > +   if (new_mtu > dev->max_mtu) {
> > +   netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
> > +  new_mtu, dev->min_mtu);
> > +   return -EINVAL;
> > +   }
> >  
> 
> Maybe don't log something that can be triggered from a user program.
> Or at least rate limit it.

Yeah, I was a little bit on the fence on whether to log anything, make it
netdev_err, netdev_dbg, or what. Quite a few drivers have a netdev_err for
failed MTU changes, while others also have netdev_info spew for successful
MTU changes. Maybe a rate-limited netdev_err is the way to go here.

-- 
Jarod Wilson
ja...@redhat.com



Re: Minimum MTU Mess

2016-09-07 Thread Jarod Wilson
On Wed, Sep 07, 2016 at 10:31:12PM +0200, Andrew Lunn wrote:
> Hi Jarod
> 
> > -   /*  MTU must be positive.*/
> > -   if (new_mtu < 0)
> > +   if (new_mtu < dev->min_mtu) {
> > +   netdev_err(dev, "Invalid MTU %d requested, hw min %d\n",
> > +  new_mtu, dev->min_mtu);
> > return -EINVAL;
> > +   }
> > +
> > +   if (new_mtu > dev->max_mtu) {
> > +   netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
> > +  new_mtu, dev->min_mtu);
> > +   return -EINVAL;
> > +   }
> 
> I doubt you can make such a big change like this in one go. Can you
> really guarantee all interfaces, of what ever type, will have some
> value for dev->min_mtu and dev->max_mtu? What may fly is something
> more like:
> 
> > +   if (dev->max_mtu && new_mtu > dev->max_mtu) {
> > +   netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
> > +  new_mtu, dev->min_mtu);
> > +   return -EINVAL;
> > +   }
> 
> Maybe in a few cycles you can add a WARN_ON(!dev->max_mtu), and a few
> cycles after that go with (new_mtu > dev->max_mtu).

My local tree actually has if (dev->max_mtu > 0 && new_mtu > dev->max_mtu)
since just after I'd sent my mail for exactly that reason, though looking
at alloc_netdev_mqs(), it does seem we're at least guaranteed the value
will be 0 if not otherwise initialized, so your version looks perfectly
fine, and in the min_mtu case, without any additional handling, things
behave exactly as they did before. This is definitely going to require a
few passes... (Working my way through every driver with an ndo_change_mtu
wired up right now to see just how crazy this might get).

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH 2/2 v3] net: smsc911x: request and deassert optional RESET GPIO

2016-09-07 Thread Jeremy Linton

Hi,

On 09/07/2016 08:53 AM, Linus Walleij wrote:

On some systems (such as the Qualcomm APQ8060 Dragonboard) the
RESET signal of the SMSC911x is not pulled up by a resistor (or
the internal pull-up that will pull it up if the pin is not
even connected) but instead connected to a GPIO line, so that
the operating system must explicitly deassert RESET before use.

Support this in the SMSC911x driver so this ethernet connector
can be used on such targets.

Notice that we request the line to go logical low (deassert)
whilst the line on the actual component is active low. This
is managed in the respective hardware description when
specifying the GPIO line with e.g. device tree or ACPI. With
device tree it looks like this in one case:

  reset-gpios = < 30 GPIO_ACTIVE_LOW>;

Which means that logically requesting the RESET line to be
deasserted will result in the line being driven high, taking
the device out of reset.

Cc: Jeremy Linton 
Signed-off-by: Linus Walleij 


Thanks for clarifying the GPIO_ACTIVE_LOW state on this pin. I've 
reviewed this patch, and tested that it doesn't have any affect on a 
JunoR2 (ACPI or DT) system. I've got some personal reservations about 
making changes for a single board that might better be handled in 
firmware. But, I don't think those should be an excuse to keep that 
hardware from working. Particularly since this is a fairly trivial change.


So FWIW,

Reviewed-by: Jeremy Linton 


Thanks,



Re: [Patch v6] net: ethernet: xilinx: Enable emaclite for MIPS

2016-09-07 Thread David Miller
From: Zubair Lutfullah Kakakhel 
Date: Mon, 5 Sep 2016 13:07:54 +0100

> The MIPS based xilfpga platform uses this driver.
> Enable it for MIPS
> 
> Signed-off-by: Zubair Lutfullah Kakakhel 
> 
> ---
> V1 -> V6 are from a series that has gotten too big.
> So I have split this patch and am sending it separately.

What tree to you expect this patch to be applied to?


[PATCH net-next] bpf: fix range propagation on direct packet access

2016-09-07 Thread Daniel Borkmann
LLVM can generate code that tests for direct packet access via
skb->data/data_end in a way that currently gets rejected by the
verifier, example:

  [...]
   7: (61) r3 = *(u32 *)(r6 +80)
   8: (61) r9 = *(u32 *)(r6 +76)
   9: (bf) r2 = r9
  10: (07) r2 += 54
  11: (3d) if r3 >= r2 goto pc+12
   R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
   R9=pkt(id=0,off=0,r=0) R10=fp
  12: (18) r4 = 0xff7a
  14: (05) goto pc+430
  [...]

  from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
 R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
  24: (7b) *(u64 *)(r10 -40) = r1
  25: (b7) r1 = 0
  26: (63) *(u32 *)(r6 +56) = r1
  27: (b7) r2 = 40
  28: (71) r8 = *(u8 *)(r9 +20)
  invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)

The reason why this gets rejected despite a proper test is that we
currently call find_good_pkt_pointers() only in case where we detect
tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
derived, for example, from a register of type pkt(id=Y,off=0,r=0)
pointing to skb->data. find_good_pkt_pointers() then fills the range
in the current branch to pkt(id=Y,off=0,r=Z) on success.

For above case, we need to extend that to recognize pkt_end >= rX
pattern and mark the other branch that is taken on success with the
appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
only two practical options to test for from what LLVM could have
generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=)
that we would need to take into account as well.

After the fix:

  [...]
   7: (61) r3 = *(u32 *)(r6 +80)
   8: (61) r9 = *(u32 *)(r6 +76)
   9: (bf) r2 = r9
  10: (07) r2 += 54
  11: (3d) if r3 >= r2 goto pc+12
   R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
   R9=pkt(id=0,off=0,r=0) R10=fp
  12: (18) r4 = 0xff7a
  14: (05) goto pc+430
  [...]

  from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
 R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
  24: (7b) *(u64 *)(r10 -40) = r1
  25: (b7) r1 = 0
  26: (63) *(u32 *)(r6 +56) = r1
  27: (b7) r2 = 40
  28: (71) r8 = *(u8 *)(r9 +20)
  29: (bf) r1 = r8
  30: (25) if r8 > 0x3c goto pc+47
   R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
   R9=pkt(id=0,off=0,r=54) R10=fp
  31: (b7) r1 = 1
  [...]

Verifier test cases are also added in this work, one that demonstrates
the mentioned example here and one that tries a bad packet access for
the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
with both test variants (>, >=).

Fixes: 969bf05eb3ce ("bpf: direct packet access")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 Targetting at net-next as this should be just fine.

 kernel/bpf/verifier.c   |  55 +---
 samples/bpf/test_verifier.c | 102 
 2 files changed, 142 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 48c2705..90493a6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1637,21 +1637,42 @@ static int check_alu_op(struct verifier_env *env, 
struct bpf_insn *insn)
return 0;
 }
 
-static void find_good_pkt_pointers(struct verifier_env *env,
-  struct reg_state *dst_reg)
+static void find_good_pkt_pointers(struct verifier_state *state,
+  const struct reg_state *dst_reg)
 {
-   struct verifier_state *state = >cur_state;
struct reg_state *regs = state->regs, *reg;
int i;
-   /* r2 = r3;
-* r2 += 8
-* if (r2 > pkt_end) goto somewhere
-* r2 == dst_reg, pkt_end == src_reg,
-* r2=pkt(id=n,off=8,r=0)
-* r3=pkt(id=n,off=0,r=0)
-* find register r3 and mark its range as r3=pkt(id=n,off=0,r=8)
-* so that range of bytes [r3, r3 + 8) is safe to access
+
+   /* LLVM can generate two kind of checks:
+*
+* Type 1:
+*
+*   r2 = r3;
+*   r2 += 8;
+*   if (r2 > pkt_end) goto 
+*   
+*
+*   Where:
+* r2 == dst_reg, pkt_end == src_reg
+* r2=pkt(id=n,off=8,r=0)
+* r3=pkt(id=n,off=0,r=0)
+*
+* Type 2:
+*
+*   r2 = r3;
+*   r2 += 8;
+*   if (pkt_end >= r2) goto 
+*   
+*
+*   Where:
+* pkt_end == dst_reg, r2 == src_reg
+* r2=pkt(id=n,off=8,r=0)
+* r3=pkt(id=n,off=0,r=0)
+*
+* Find register r3 and mark its range as r3=pkt(id=n,off=0,r=8)
+* so that range of bytes [r3, r3 + 8) is safe to access.
 */
+
for (i = 0; i < MAX_BPF_REG; i++)
if (regs[i].type == PTR_TO_PACKET && regs[i].id == dst_reg->id)
   

Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support

2016-09-07 Thread David Miller
From: Or Gerlitz 
Date: Wed, 7 Sep 2016 23:51:56 +0300

> On Wed, Sep 7, 2016 at 11:27 PM, David Miller  wrote:
>> From: Saeed Mahameed 
>> Date: Wed, 7 Sep 2016 23:16:13 +0300
>>
>>> Hi Dave, Sorry to bother, but i would like to drop this series for now
>>> Huy is working to define and come up with a better mechanism to
>>> enable/disable the new DCBX hybrid mode he is adding.
>>
>> I can't "just drop" any change that's in my tree already.
> 
> Dave, I don't see it @ your kernel.org clone but it's not in
> patchworks either, so I guess you picked that into your other clone.

Aha, I marked it "changes requested" in patchwork and did not apply
it.

My bad.


Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 15:26 -0700, Stephen Hemminger wrote:

> How much does this grow the size of tcp socket structure?

This actually shrinks it by 8 bytes, or more on debug kernels where
sizeof(spinlock_t) > 4

Before :

struct sk_buff_head out_of_order_queue;  // At least 24 bytes on 64bit 


After :

2 pointers ( 16 bytes on 64bit )




Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue

2016-09-07 Thread Stephen Hemminger
On Wed, 07 Sep 2016 14:49:28 -0700
Eric Dumazet  wrote:

> From: Yaogong Wang 
> 
> Over the years, TCP BDP has increased by several orders of magnitude,
> and some people are considering to reach the 2 Gbytes limit.
> 
> Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
> MSS.
> 
> In presence of packet losses (or reorders), TCP stores incoming packets
> into an out of order queue, and number of skbs sitting there waiting for
> the missing packets to be received can be in the 10^5 range.
> 
> Most packets are appended to the tail of this queue, and when
> packets can finally be transferred to receive queue, we scan the queue
> from its head.
> 
> However, in presence of heavy losses, we might have to find an arbitrary
> point in this queue, involving a linear scan for every incoming packet,
> throwing away cpu caches.
> 
> This patch converts it to a RB tree, to get bounded latencies.
> 
> Yaogong wrote a preliminary patch about 2 years ago.
> Eric did the rebase, added ofo_last_skb cache, polishing and tests.
> 
> Tested with network dropping between 1 and 10 % packets, with good
> success (about 30 % increase of throughput in stress tests)
> 
> Next step would be to also use an RB tree for the write queue at sender
> side ;)
> 
> Signed-off-by: Yaogong Wang 
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Cc: Ilpo Järvinen 

How much does this grow the size of tcp socket structure?


Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-dev
 wrote:
> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
>> From: Rana Shahout 
>>
>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>>
>> When XDP is on we make sure to change channels RQs type to
>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>> ensure "page per packet".
>>
>> On XDP set, we fail if HW LRO is set and request from user to turn it
>> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>> annoying, but we prefer not to enforce LRO off from XDP set function.
>>
>> Full channels reset (close/open) is required only when setting XDP
>> on/off.
>>
>> When XDP set is called just to exchange programs, we will update
>> each RQ xdp program on the fly and for synchronization with current
>> data path RX activity of that RQ, we temporally disable that RQ and
>> ensure RX path is not running, quickly update and re-enable that RQ,
>> for that we do:
>> - rq.state = disabled
>> - napi_synnchronize
>> - xchg(rq->xdp_prg)
>> - rq.state = enabled
>> - napi_schedule // Just in case we've missed an IRQ
>>
>> Packet rate performance testing was done with pktgen 64B packets and on
>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Comparison is done between:
>> 1. Baseline, Before this patch with TC drop action
>> 2. This patch with TC drop action
>> 3. This patch with XDP RX fast drop
>>
>> StreamsBaseline(TC drop)TC dropXDP fast Drop
>> --
>> 1   5.51Mpps5.14Mpps 13.5Mpps
>
> This (13.5 M PPS) is less than 50% of the result we presented @ the
> XDP summit which was obtained by Rana. Please see if/how much does
> this grows if you use more sender threads, but all of them to xmit the
> same stream/flows, so we're on one ring. That (XDP with single RX ring
> getting packets from N remote TX rings) would be your canonical
> base-line for any further numbers.
>

I used N TX senders sending 48Mpps to a single RX core.
The single RX core could handle only 13.5Mpps.

The implementation here is different from the one we presented at the
summit, before, it was with striding RQ, now it is regular linked list
RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings
handles only 1K).

In striding RQ we register only 16 HW descriptors for every 32K
packets. I.e for
every 32K packets we access the HW only 16 times.  on the other hand,
regular RQ will access the HW (register descriptors) once per packet,
i.e we write to HW 1K time for 1K packets. i think this explains the
difference.

the catch here is that we can't use striding RQ for XDP, bummer!

As i said, we will have the full and final performance results on V1.
This is just a RFC with barely quick and dirty testing.


> ___
> iovisor-dev mailing list
> iovisor-...@lists.iovisor.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[PATCH net-next] tcp: use an RB tree for ooo receive queue

2016-09-07 Thread Eric Dumazet
From: Yaogong Wang 

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang 
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Ilpo Järvinen 
---
 include/linux/skbuff.h   |2 
 include/linux/tcp.h  |7 
 include/net/tcp.h|2 
 net/core/skbuff.c|   19 ++
 net/ipv4/tcp.c   |4 
 net/ipv4/tcp_input.c |  330 +
 net/ipv4/tcp_ipv4.c  |2 
 net/ipv4/tcp_minisocks.c |1 
 8 files changed, 218 insertions(+), 149 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cfb7219be665..4c5662f05bda 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2402,6 +2402,8 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
+void skb_rbtree_purge(struct rb_root *root);
+
 void *netdev_alloc_frag(unsigned int fragsz);
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b1242354..c723a465125d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -281,10 +281,9 @@ struct tcp_sock {
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
 
-   /* OOO segments go in this list. Note that socket lock must be held,
-* as we do not use sk_buff_head lock.
-*/
-   struct sk_buff_head out_of_order_queue;
+   /* OOO segments go in this rbtree. Socket lock must be held. */
+   struct rb_root  out_of_order_queue;
+   struct sk_buff  *ooo_last_skb; /* cache rb_last(out_of_order_queue) */
 
/* SACKs data, these 2 need to be together (see tcp_options_write) */
struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d6ae36512429..fdfbedd61c67 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -640,7 +640,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (skb_queue_empty(>out_of_order_queue) &&
+   if (RB_EMPTY_ROOT(>out_of_order_queue) &&
tp->rcv_wnd &&
atomic_read(>sk_rmem_alloc) < sk->sk_rcvbuf &&
!tp->urg_data)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b68fa1..1e329d411242 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2445,6 +2445,25 @@ void skb_queue_purge(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_queue_purge);
 
 /**
+ * skb_rbtree_purge - empty a skb rbtree
+ * @root: root of the rbtree to empty
+ *
+ * Delete all buffers on an _buff rbtree. Each buffer is removed from
+ * the list and one reference dropped. This function does not take
+ * any lock. Synchronization should be handled by the caller (e.g., TCP
+ * out-of-order queue is protected by the socket lock).
+ */
+void skb_rbtree_purge(struct rb_root *root)
+{
+   struct sk_buff *skb, *next;
+
+   rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
+   kfree_skb(skb);
+
+   *root = RB_ROOT;
+}
+
+/**
  * skb_queue_head - queue a buffer at the list head
  * @list: list to use
  * @newsk: buffer to queue
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 77311a92275c..a13fcb369f52 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -380,7 +380,7 @@ void tcp_init_sock(struct sock *sk)
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   __skb_queue_head_init(>out_of_order_queue);
+   tp->out_of_order_queue = RB_ROOT;
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);
INIT_LIST_HEAD(>tsq_node);
@@ -2243,7 +2243,7 @@ int 

Re: [PATCH v5 2/6] clk: gxbb: expose MPLL2 clock for use by DT

2016-09-07 Thread Stephen Boyd
On 09/06, Martin Blumenstingl wrote:
> This exposes the MPLL2 clock as this is one of the input clocks of the
> ethernet controller's internal mux.
> 
> Signed-off-by: Martin Blumenstingl 
> ---

Acked-by: Stephen Boyd 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: ptp

2016-09-07 Thread Richard Cochran
On Wed, Sep 07, 2016 at 01:55:19PM -0700, Daniel Walker wrote:
> So the code only allows second granularity negative updates,

No.

> or the seconds
> component is the only part which needs to actually be negative ?

Yes.

See the function, clockadj_step, in linuxptp/clockadj.c for a working
user space example.

Thanks,
Richard



[PATCH net-next] macsec: set network devtype

2016-09-07 Thread Stephen Hemminger
The netdevice type structure for macsec was being defined but never used.
To set the network device type the macro SET_NETDEV_DEVTYPE must be called.
Compile tested only, I don't use macsec.

Signed-off-by: Stephen Hemminger 

--- a/drivers/net/macsec.c  2016-08-25 17:20:20.671031972 -0700
+++ b/drivers/net/macsec.c  2016-09-07 14:03:01.676238097 -0700
@@ -2973,6 +2973,7 @@ static void macsec_setup(struct net_devi
dev->priv_flags |= IFF_NO_QUEUE;
dev->netdev_ops = _netdev_ops;
dev->destructor = macsec_free_netdev;
+   SET_NETDEV_DEVTYPE(dev, _type);
 
eth_zero_addr(dev->broadcast);
 }


Re: ptp

2016-09-07 Thread Daniel Walker

On 09/07/2016 01:48 PM, Richard Cochran wrote:

On Wed, Sep 07, 2016 at 01:40:59PM -0700, Daniel Walker wrote:

There is a test (below) , which prevents negative nanosecond updates. The
code below would force a negative update to always return more than
NSEC_PER_SEC. It should be using abs() instead which would return the value
desired.

No.  This:

/*
 * The value of a timeval is the sum of its fields, but the
 * field tv_usec must always be non-negative.
 */

HTH,
Richard



So the code only allows second granularity negative updates, or the 
seconds component is the only part which needs to actually be negative ?



Daniel



[PATCH net-next] rtnetlink: remove unused ifla_stats_policy

2016-09-07 Thread Stephen Hemminger
This structure is defined but never used. Flagged with W=1

Signed-off-by: Stephen Hemminger 

--- a/net/core/rtnetlink.c  2016-09-07 13:55:17.030107744 -0700
+++ b/net/core/rtnetlink.c  2016-09-07 13:55:45.190236855 -0700
@@ -3669,10 +3669,6 @@ nla_put_failure:
return -EMSGSIZE;
 }
 
-static const struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
-   [IFLA_STATS_LINK_64]= { .len = sizeof(struct rtnl_link_stats64) },
-};
-
 static size_t if_nlmsg_stats_size(const struct net_device *dev,
  u32 filter_mask)
 {


Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-07 Thread Or Gerlitz
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
> From: Rana Shahout 
>
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>
> When XDP is on we make sure to change channels RQs type to
> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> ensure "page per packet".
>
> On XDP set, we fail if HW LRO is set and request from user to turn it
> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> annoying, but we prefer not to enforce LRO off from XDP set function.
>
> Full channels reset (close/open) is required only when setting XDP
> on/off.
>
> When XDP set is called just to exchange programs, we will update
> each RQ xdp program on the fly and for synchronization with current
> data path RX activity of that RQ, we temporally disable that RQ and
> ensure RX path is not running, quickly update and re-enable that RQ,
> for that we do:
> - rq.state = disabled
> - napi_synnchronize
> - xchg(rq->xdp_prg)
> - rq.state = enabled
> - napi_schedule // Just in case we've missed an IRQ
>
> Packet rate performance testing was done with pktgen 64B packets and on
> TX side and, TC drop action on RX side compared to XDP fast drop.
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
> 1. Baseline, Before this patch with TC drop action
> 2. This patch with TC drop action
> 3. This patch with XDP RX fast drop
>
> StreamsBaseline(TC drop)TC dropXDP fast Drop
> --
> 1   5.51Mpps5.14Mpps 13.5Mpps

This (13.5 M PPS) is less than 50% of the result we presented @ the
XDP summit which was obtained by Rana. Please see if/how much does
this grows if you use more sender threads, but all of them to xmit the
same stream/flows, so we're on one ring. That (XDP with single RX ring
getting packets from N remote TX rings) would be your canonical
base-line for any further numbers.


Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support

2016-09-07 Thread Or Gerlitz
On Wed, Sep 7, 2016 at 11:27 PM, David Miller  wrote:
> From: Saeed Mahameed 
> Date: Wed, 7 Sep 2016 23:16:13 +0300
>
>> Hi Dave, Sorry to bother, but i would like to drop this series for now
>> Huy is working to define and come up with a better mechanism to
>> enable/disable the new DCBX hybrid mode he is adding.
>
> I can't "just drop" any change that's in my tree already.

Dave, I don't see it @ your kernel.org clone but it's not in
patchworks either, so I guess you picked that into your other clone.
If indeed you took it (and I guess this is the case), please note that
was done in spite of a pending reviewer comment
(by me) on the actual necessity of patch #3. If its not possible to
revert them on the non kernel.org clone before you push it to
kernel.org, I guess we need to send you a revert

[1] http://marc.info/?l=linux-netdev=147317917622462=2

> It is part of the permanent GIT record as I never rebase my trees for
> any reason whatsoever.

> So the best you can do is send me a revert patch.


Re: ptp

2016-09-07 Thread Richard Cochran
On Wed, Sep 07, 2016 at 01:40:59PM -0700, Daniel Walker wrote:
> There is a test (below) , which prevents negative nanosecond updates. The
> code below would force a negative update to always return more than
> NSEC_PER_SEC. It should be using abs() instead which would return the value
> desired.

No.  This:

/*
 * The value of a timeval is the sum of its fields, but the
 * field tv_usec must always be non-negative.
 */

HTH,
Richard


Re: Minimum MTU Mess

2016-09-07 Thread Stephen Hemminger
On Wed, 7 Sep 2016 15:53:56 -0400
Jarod Wilson  wrote:

> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6466,9 +6466,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
>   if (new_mtu == dev->mtu)
>   return 0;
>  
> - /*  MTU must be positive.*/
> - if (new_mtu < 0)
> + if (new_mtu < dev->min_mtu) {
> + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n",
> +new_mtu, dev->min_mtu);
>   return -EINVAL;
> + }
> +
> + if (new_mtu > dev->max_mtu) {
> + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
> +new_mtu, dev->min_mtu);
> + return -EINVAL;
> + }
>  

Maybe don't log something that can be triggered from a user program.
Or at least rate limit it.


Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support

2016-09-07 Thread David Miller
From: Saeed Mahameed 
Date: Wed, 7 Sep 2016 23:16:13 +0300

> Hi Dave, Sorry to bother, but i would like to drop this series for now
> Huy is working to define and come up with a better mechanism to
> enable/disable the new DCBX hybrid mode he is adding.

I can't "just drop" any change that's in my tree already.

It is part of the permanent GIT record as I never rebase my trees for
any reason whatsoever.

So the best you can do is send me a revert patch.


Re: Minimum MTU Mess

2016-09-07 Thread Andrew Lunn
Hi Jarod

> - /*  MTU must be positive.*/
> - if (new_mtu < 0)
> + if (new_mtu < dev->min_mtu) {
> + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n",
> +new_mtu, dev->min_mtu);
>   return -EINVAL;
> + }
> +
> + if (new_mtu > dev->max_mtu) {
> + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
> +new_mtu, dev->min_mtu);
> + return -EINVAL;
> + }

I doubt you can make such a big change like this in one go. Can you
really guarantee all interfaces, of what ever type, will have some
value for dev->min_mtu and dev->max_mtu? What may fly is something
more like:

> + if (dev->max_mtu && new_mtu > dev->max_mtu) {
> + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
> +new_mtu, dev->min_mtu);
> + return -EINVAL;
> + }

Maybe in a few cycles you can add a WARN_ON(!dev->max_mtu), and a few
cycles after that go with (new_mtu > dev->max_mtu).

  Andrew


Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support

2016-09-07 Thread Saeed Mahameed
On Tue, Sep 6, 2016 at 7:04 PM, Saeed Mahameed  wrote:
> Hi Dave,
>

Hi Dave, Sorry to bother, but i would like to drop this series for now
Huy is working to define and come up with a better mechanism to
enable/disable the new DCBX hybrid mode he is adding.

Thanks for understanding,
Saeed.

> This series from Huy provides mlx5 DCBX updates to support DCBX CEE
> API and DCBX firmware/host modes support.
>
> 1st patch ensures the dcbnl_rtnl_ops is published only when the qos 
> capability bits is on.
>
> 2nd patch adds the support for CEE interfaces into mlx5 dcbnl_rtnl_ops.
>
> 3rd patch refactors ETS query to read ETS configuration directly from 
> firmware rather
> than having a software shadow to it. The existing IEEE interfaces stays the 
> same.
>
> 4th patch adds the support for MLX5_REG_DCBX_PARAM and MLX5_REG_DCBX_APP 
> firmware
> commands to manipulate mlx5 DCBX mode.
>
> 5th patch adds the driver support for the new DCBX firmware.
> This ensures the backward compatibility versus the old and new firmware.
> With the new DCBX firmware, qos settings can be controlled by either firmware
> or software depending on the DCBX mode.
>
> 6th patch adds support for module events log.
>
> Changes since V1:
> 1. Add qos capability check
> 2. In port module events eqe structure, change rsvd_n to reserved_at_n to be 
> consistent with mlx5_ifc.h
> 3. Improve commit messages
> 4. Drop DCBX private flags patch
> 5. Add patch to check for qos capability bit check before exposing dcb 
> interfaces
> 6. Replace memset with static array initialization
>
> Thanks,
> Saeed.
>
> Huy Nguyen (6):
>   net/mlx5e: Add qos capability check
>   net/mlx5e: Support DCBX CEE API
>   net/mlx5e: Read ETS settings directly from firmware
>   net/mlx5: Add DCBX firmware commands support
>   net/mlx5e: ConnectX-4 firmware support for DCBX
>   net/mlx5: Add handling for port module event
>
>  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  36 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 483 
> -
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  27 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  12 +
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|   1 +
>  drivers/net/ethernet/mellanox/mlx5/core/port.c | 148 +++
>  include/linux/mlx5/device.h|  11 +
>  include/linux/mlx5/driver.h|   7 +
>  include/linux/mlx5/mlx5_ifc.h  |   3 +-
>  include/linux/mlx5/port.h  |   6 +
>  10 files changed, 698 insertions(+), 36 deletions(-)
>
> --
> 2.7.4
>


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazet  wrote:
> On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:
>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device
>> TX ring is idle most of the time, i think his idea can come in handy here.
>> I am not fully involved in the details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
>
> I do not think qdisc is relevant here.
>
> Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
> because only this layer can know if more packets are to come.
>
>
> What I am saying is that regardless of skb->xmit_more being set or not,
> (for example if no qdisc is even used)
> a NAPI driver can arm a bit asking the doorbell being sent at the end of
> NAPI.
>
> I am not saying this must be done, only that the idea could be extended
> to non XDP world, if we care enough.
>

Yes, and i am just trying to suggest Ideas that do not require
communication between RX (NAPI) and TX.

The problem here is the synchronization (TX doorbell from RX) which is
not as simple as atomic operation for some drivers.

How about RX bulking ? it also can help here, since for the forwarding
case, the forwarding path will be able to
process bulk of RX SKBs and can bulk xmit the portion of SKBs that
will be forwarded.

As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if
you are joining of course).

Thanks
Saeed.


Re: [PATCH net] tcp: fastopen: avoid negative sk_forward_alloc

2016-09-07 Thread Josh Hunt

On 09/07/2016 10:34 AM, Eric Dumazet wrote:

From: Eric Dumazet 

When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
we append an skb in socket receive queue, but we forget to call
sk_forced_mem_schedule().

Effect is that the socket has a negative sk->sk_forward_alloc as long as
the message is not read by the application.

Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp
fin memory accounting")

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet 
---
  net/ipv4/tcp_fastopen.c |1 +
  1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 62a5751d4fe1..4e777a3243f9 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -150,6 +150,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
*skb)
tp->segs_in = 0;
tcp_segs_in(tp, skb);
__skb_pull(skb, tcp_hdrlen(skb));
+   sk_forced_mem_schedule(sk, skb->truesize);
skb_set_owner_r(skb, sk);

TCP_SKB_CB(skb)->seq++;




The change makes sense to me. Thanks Eric!

Reviewed-by: Josh Hunt 


Re: Minimum MTU Mess

2016-09-07 Thread Jarod Wilson
On Tue, Sep 06, 2016 at 04:55:29PM -0700, David Miller wrote:
> From: Jarod Wilson 
> Date: Fri, 2 Sep 2016 13:07:42 -0400
> 
> > In any case, the number of "mtu < 68" and "#define FOO_MIN_MTU 68", or
> > variations thereof, under drivers/net/ is kind of crazy.
> 
> Agreed, we can have a default and let the different cases provide
> overrides.
> 
> Mostly what to do here is a function of the hardware though.

So I've been tinkering with this some, and it looks like having both
centralized min and max checking could be useful here. I'm hacking away at
drivers now, but the basis of all this would potentially look about like
the patch below, and each device would have to set dev->m{in,ax}_mtu one
way or another. Drivers using alloc_etherdev and/or ether_setup would get
the "default" values, and then they can be overridden. Probably need
something to make sure dev->max_mtu isn't set to 0 though...

Possibly on the right track here, or might there be a better way to
approach this?

diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 117d02e..864d6f2 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -35,6 +35,8 @@
 #define ETH_FRAME_LEN  1514/* Max. octets in frame sans FCS */
 #define ETH_FCS_LEN4   /* Octets in the FCS */
 
+#define ETH_MIN_MTU68  /* Min IPv4 MTU per RFC791  */
+
 /*
  * These are the defined Ethernet Protocol ID's.
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index dd6ce59..d20fd5d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6466,9 +6466,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
if (new_mtu == dev->mtu)
return 0;
 
-   /*  MTU must be positive.*/
-   if (new_mtu < 0)
+   if (new_mtu < dev->min_mtu) {
+   netdev_err(dev, "Invalid MTU %d requested, hw min %d\n",
+  new_mtu, dev->min_mtu);
return -EINVAL;
+   }
+
+   if (new_mtu > dev->max_mtu) {
+   netdev_err(dev, "Invalid MTU %d requested, hw max %d\n",
+  new_mtu, dev->min_mtu);
+   return -EINVAL;
+   }
 
if (!netif_device_present(dev))
return -ENODEV;
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 66dff5e..2be0203 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -322,8 +322,6 @@ EXPORT_SYMBOL(eth_mac_addr);
  */
 int eth_change_mtu(struct net_device *dev, int new_mtu)
 {
-   if (new_mtu < 68 || new_mtu > ETH_DATA_LEN)
-   return -EINVAL;
dev->mtu = new_mtu;
return 0;
 }
@@ -357,6 +355,8 @@ void ether_setup(struct net_device *dev)
dev->type   = ARPHRD_ETHER;
dev->hard_header_len= ETH_HLEN;
dev->mtu= ETH_DATA_LEN;
+   dev->min_mtu= ETH_MIN_MTU;
+   dev->max_mtu= ETH_DATA_LEN;
dev->addr_len   = ETH_ALEN;
dev->tx_queue_len   = 1000; /* Ethernet wants good queues */
dev->flags  = IFF_BROADCAST|IFF_MULTICAST;

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand

2016-09-07 Thread Jesper Dangaard Brouer

On Wed,  7 Sep 2016 15:42:25 +0300 Saeed Mahameed  wrote:

> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
> 
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
> 
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.

Yes, it is true that there is a performance reduction (for normal
stack, not XDP) caused by hitting a bottleneck in the page allocator.

I actually have a PoC implementation of my page_pool, that show we
regain the performance and then some.  Based on an earlier version of
this patch, where I hook it into the mlx5 driver (50Gbit/s version).


You desc might be a bit outdated, as this patch and the patch before
does contain you own driver local page-cache recycle facility.  And you
also show that you regain quite a lot of the lost performance.

You driver local page_cache does have its limitations (see comments on
other patch), as it depend on timely refcnt decrease, by the users of
the page.  If they hold onto pages (like TCP) then your page-cache will
not be efficient.

 
> Packet rate performance testing was done with pktgen 64B packets on
> xmit side and TC drop action on RX side.

I assume this is TC _ingress_ dropping, like [1]

[1] 
https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh

> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Comparison is done between:
>  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>  2.Build SKB with RX page cache (This patch)
> 
> StreamsBaselineBuild SKB+page-cacheImprovement
> ---
> 1  4.33Mpps  5.51Mpps27%
> 2  7.35Mpps  11.5Mpps52%
> 4  14.0Mpps  16.3Mpps16%
> 8  22.2Mpps  29.6Mpps20%
> 16 24.8Mpps  34.0Mpps17%

The improvements gained from using your page-cache is impressively high.

Thanks for working on this,
 --Jesper
 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  10 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  31 +++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 215 
> +++---
>  3 files changed, 133 insertions(+), 123 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index afbdf70..a346112 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -65,6 +65,8 @@
>  #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6
>  
> +#define MLX5_RX_HEADROOM NET_SKB_PAD
> +
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE   6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS  8  /* >= 6, HW 
> restriction */
>  #define MLX5_MPWRQ_LOG_WQE_SZ18
> @@ -302,10 +304,14 @@ struct mlx5e_page_cache {
>  struct mlx5e_rq {
>   /* data path */
>   struct mlx5_wq_ll  wq;
> - u32wqe_sz;
> - struct sk_buff   **skb;
> +
> + struct mlx5e_dma_info *dma_info;
>   struct mlx5e_mpw_info *wqe_info;
>   void  *mtt_no_align;
> + struct {
> + u8 page_order;
> + u32wqe_sz;/* wqe data buffer size */
> + } buff;
>   __be32 mkey_be;
>  
>   struct device *pdev;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index c84702c..c9f1dea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>   void *rqc = param->rqc;
>   void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
>   u32 byte_count;
> + u32 frag_sz;
> + int npages;
>   int wq_sz;
>   int err;
>   int i;
> @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  
>   rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
>   rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> - rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> - byte_count = rq->wqe_sz;
> +
> + rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> + byte_count = rq->buff.wqe_sz;
>   rq->mkey_be = 

Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ

2016-09-07 Thread Jesper Dangaard Brouer

On Wed,  7 Sep 2016 15:42:22 +0300 Saeed Mahameed  wrote:

> From: Tariq Toukan 
> 
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.

Thanks you for doing this! MM-list people thanks you!

For others to understand what this means:  This driver was doing
split_page() on high-order pages (for Striding RQ).  This was really bad
because it will cause fragmenting the page-allocator, and depleting the
high-order pages available quickly.

(I've left rest of patch intact below, if some MM people should be
interested in looking at the changes).

There is even a funny comment in split_page() relevant to this:

/* [...]
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.
 */


> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
> 
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
> 
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
> 

Well, the reduction does not really matter than much, because your
baseline benchmarks are from a freshly booted system, where you have
not fragmented and depleted the high-order pages yet... ;-)


> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi 
> packet WQE")
> Signed-off-by: Tariq Toukan 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  54 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 --
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 292 
> -
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
>  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
>  5 files changed, 184 insertions(+), 304 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index bf722aa..075cdfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -62,12 +62,12 @@
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE0xd
>  
>  #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW0x1
> -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4
> +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6
>  
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE   6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS  8  /* >= 6, HW 
> restriction */
> -#define MLX5_MPWRQ_LOG_WQE_SZ17
> +#define MLX5_MPWRQ_LOG_WQE_SZ18
>  #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? 
> \
>   MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
>  #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
> @@ -293,8 +293,8 @@ struct mlx5e_rq {
>   u32wqe_sz;
>   struct sk_buff   **skb;
>   struct mlx5e_mpw_info *wqe_info;
> + void  *mtt_no_align;
>   __be32 mkey_be;
> - __be32 umr_mkey_be;
>  
>   struct device *pdev;
>   struct net_device *netdev;
> @@ -323,32 +323,15 @@ struct mlx5e_rq {
>  
>  struct mlx5e_umr_dma_info {
>   __be64*mtt;
> - __be64*mtt_no_align;
>   dma_addr_t mtt_addr;
> - struct mlx5e_dma_info *dma_info;
> + struct mlx5e_dma_info  dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
> + struct mlx5e_umr_wqe   wqe;
>  };
>  
>  struct mlx5e_mpw_info {
> - union {
> - struct mlx5e_dma_info dma_info;
> - struct mlx5e_umr_dma_info umr;
> - };
> + struct mlx5e_umr_dma_info umr;
>   u16 consumed_strides;
>   u16 

[PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz

2016-09-07 Thread Chris Brandt
Due to a mistake in the hardware manual, the FWSLC and POST1-4 registers
were not documented and left out of the driver for RZ/A making the CAM
feature non-operational.
Additionally, when the offset values for POST1-4 are left blank, the driver
attempts to set them using an offset of 0x which can cause a memory
corruption or panic.

This patch fixes the panic and properly enables CAM.

Reported-by: Daniel Palmer 
Signed-off-by: Chris Brandt 
---
v2:
* POST registers really do exist, so just add them
---
 drivers/net/ethernet/renesas/sh_eth.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 1f8240a..440ae27 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -201,9 +201,14 @@ static const u16 
sh_eth_offset_fast_rz[SH_ETH_MAX_REGISTER_OFFSET] = {
 
[ARSTR] = 0x,
[TSU_CTRST] = 0x0004,
+   [TSU_FWSLC] = 0x0038,
[TSU_VTAG0] = 0x0058,
[TSU_ADSBSY]= 0x0060,
[TSU_TEN]   = 0x0064,
+   [TSU_POST1] = 0x0070,
+   [TSU_POST2] = 0x0074,
+   [TSU_POST3] = 0x0078,
+   [TSU_POST4] = 0x007c,
[TSU_ADRH0] = 0x0100,
 
[TXNLCR0]   = 0x0080,
@@ -2781,6 +2786,8 @@ static void sh_eth_tsu_init(struct sh_eth_private *mdp)
 {
if (sh_eth_is_rz_fast_ether(mdp)) {
sh_eth_tsu_write(mdp, 0, TSU_TEN); /* Disable all CAM entry */
+   sh_eth_tsu_write(mdp, TSU_FWSLC_POSTENU | TSU_FWSLC_POSTENL,
+TSU_FWSLC);/* Enable POST registers */
return;
}
 
-- 
2.9.2




[PATCH net 3/3] nfp: don't pad frames on receive

2016-09-07 Thread Jakub Kicinski
There is no need to pad frames to ETH_ZLEN on RX.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Dinan Gunawardena 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index e4fe0f080f34..252e4924de0f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1440,10 +1440,6 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
 
nfp_net_set_hash(nn->netdev, skb, rxd);
 
-   /* Pad small frames to minimum */
-   if (skb_put_padto(skb, 60))
-   break;
-
/* Stats update */
u64_stats_update_begin(_vec->rx_sync);
r_vec->rx_pkts++;
-- 
1.9.1



[PATCH net 1/3] nfp: remove linux/version.h includes

2016-09-07 Thread Jakub Kicinski
Remove unnecessary version.h includes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Dinan Gunawardena 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c  | 1 -
 drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 -
 drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c  | 1 -
 3 files changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 88678c172b19..e4fe0f080f34 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -41,7 +41,6 @@
  *  Chris Telfer 
  */
 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
index 7d7933d00b8f..4c9897220969 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
@@ -40,7 +40,6 @@
  *  Brad Petrus 
  */
 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
index 37abef016a0a..6f22b0e12ac7 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
@@ -38,7 +38,6 @@
  * Rolf Neugebauer 
  */
 
-#include 
 #include 
 #include 
 #include 
-- 
1.9.1



[PATCH net 2/3] nfp: drop support for old firmware ABIs

2016-09-07 Thread Jakub Kicinski
Be more strict about FW versions.  Drop support for old
transitional revisions which were never used in production.
Dropping support for FW ABI version 0.0.0.0 is particularly
useful because 0 could just be uninitialized memory.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dinan Gunawardena 
---
 drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
index 6f22b0e12ac7..f7062cb648e1 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
@@ -133,7 +133,7 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev,
}
 
nfp_net_get_fw_version(_ver, ctrl_bar);
-   if (fw_ver.class != NFP_NET_CFG_VERSION_CLASS_GENERIC) {
+   if (fw_ver.resv || fw_ver.class != NFP_NET_CFG_VERSION_CLASS_GENERIC) {
dev_err(>dev, "Unknown Firmware ABI %d.%d.%d.%d\n",
fw_ver.resv, fw_ver.class, fw_ver.major, fw_ver.minor);
err = -EINVAL;
@@ -141,9 +141,7 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev,
}
 
/* Determine stride */
-   if (nfp_net_fw_ver_eq(_ver, 0, 0, 0, 0) ||
-   nfp_net_fw_ver_eq(_ver, 0, 0, 0, 1) ||
-   nfp_net_fw_ver_eq(_ver, 0, 0, 0x12, 0x48)) {
+   if (nfp_net_fw_ver_eq(_ver, 0, 0, 0, 1)) {
stride = 2;
tx_bar_no = NFP_NET_Q0_BAR;
rx_bar_no = NFP_NET_Q1_BAR;
-- 
1.9.1



[PATCH net 0/3] nfp: fixes and trivial cleanup

2016-09-07 Thread Jakub Kicinski
Hi!

First patch drops unnecessary version.h includes.  Second one
drops support for pre-release versions of FW ABI.  Removing
FW ABI 0.0 from supported set is particularly good since 0
could just be uninitialized memory.  Last but not least I drop
unnecessary padding of frames on RX which makes us count bytes
incorrectly for the VF2VF traffic.

Jakub Kicinski (3):
  nfp: remove linux/version.h includes
  nfp: drop support for old firmware ABIs
  nfp: don't pad frames on receive

 drivers/net/ethernet/netronome/nfp/nfp_net_common.c  | 5 -
 drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 -
 drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c  | 7 ++-
 3 files changed, 2 insertions(+), 11 deletions(-)

-- 
1.9.1



Re: [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle

2016-09-07 Thread Jesper Dangaard Brouer

On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed  wrote:

> From: Tariq Toukan 
> 
> Instead of reallocating and mapping pages for RX data-path,
> recycle already used pages in a per ring cache.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - order0 no cache
> * 4,786,899 - order0 with cache
> 1% gain
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - order0 no cache
> * 4,127,852 - order0 with cache
> 3.7% gain
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - order0 no cache
> * 3,931,708 - order0 with cache
> 5.4% gain
> 
> Signed-off-by: Tariq Toukan 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h   | 16 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 57 
> --
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++
>  4 files changed, 99 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 075cdfc..afbdf70 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
>   u8  tired;
>  };
>  
> +/* a single cache unit is capable to serve one napi call (for non-striding 
> rq)
> + * or a MPWQE (for striding rq).
> + */
> +#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
> +  MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
> +#define MLX5E_CACHE_SIZE (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
> +struct mlx5e_page_cache {
> + u32 head;
> + u32 tail;
> + struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
> +};
> +
[...]
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index c1cb510..8e02af3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq 
> *rq, u16 ix)
>   mlx5e_tx_notify_hw(sq, >ctrl, 0);
>  }
>  
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> +   struct mlx5e_dma_info *dma_info)
> +{
> + struct mlx5e_page_cache *cache = >page_cache;
> + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> +
> + if (tail_next == cache->head) {
> + rq->stats.cache_full++;
> + return false;
> + }
> +
> + cache->page_cache[cache->tail] = *dma_info;
> + cache->tail = tail_next;
> + return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> +   struct mlx5e_dma_info *dma_info)
> +{
> + struct mlx5e_page_cache *cache = >page_cache;
> +
> + if (unlikely(cache->head == cache->tail)) {
> + rq->stats.cache_empty++;
> + return false;
> + }
> +
> + if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> + rq->stats.cache_busy++;
> + return false;
> + }

Hmmm... doesn't this cause "blocking" of the page_cache recycle
facility until the page at the head of the queue gets (page) refcnt
decremented?  Real use-case could fairly easily block/cause this... 

> +
> + *dma_info = cache->page_cache[cache->head];
> + cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> + rq->stats.cache_reuse++;
> +
> + dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> +DMA_FROM_DEVICE);
> + return true;
> +}
> +
>  static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> struct mlx5e_dma_info *dma_info)
>  {
> - struct page *page = dev_alloc_page();
> + struct page *page;
> +
> + if (mlx5e_rx_cache_get(rq, dma_info))
> + return 0;
>  
> + page = dev_alloc_page();
>   if (unlikely(!page))
>   return -ENOMEM;

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH v2] vti: use right inner_mode for inbound inter address family policy checks

2016-09-07 Thread thomas . zeitlhofer+lkml
In case of inter address family tunneling (IPv6 over vti4 or IPv4 over
vti6), the inbound policy checks in vti_rcv_cb() and vti6_rcv_cb() are
using the wrong address family. As a result, all inbound inter address
family traffic is dropped.

Use the xfrm_ip2inner_mode() helper, as done in xfrm_input() (i.e., also
increment LINUX_MIB_XFRMINSTATEMODEERROR in case of error), to select the
inner_mode that contains the right address family for the inbound policy
checks.

Signed-off-by: Thomas Zeitlhofer 
---

Notes:
v2: implement review comments from Steffen (thanks for the reply):

- return -EINVAL in case of error

- increment LINUX_MIB_XFRMINSTATEMODEERROR in case of error

  Just to point that out, in case there are arguments against it:
  this is done in the namespace of skb->dev and not in the
  t(unnel)?->net namespace.

and update the commit message accordingly.

 net/ipv4/ip_vti.c  | 15 ++-
 net/ipv6/ip6_vti.c | 15 ++-
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index a917903..c9957ed 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -88,6 +88,7 @@ static int vti_rcv_cb(struct sk_buff *skb, int err)
struct net_device *dev;
struct pcpu_sw_netstats *tstats;
struct xfrm_state *x;
+   struct xfrm_mode *inner_mode;
struct ip_tunnel *tunnel = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4;
u32 orig_mark = skb->mark;
int ret;
@@ -105,7 +106,19 @@ static int vti_rcv_cb(struct sk_buff *skb, int err)
}
 
x = xfrm_input_state(skb);
-   family = x->inner_mode->afinfo->family;
+
+   inner_mode = x->inner_mode;
+
+   if (x->sel.family == AF_UNSPEC) {
+   inner_mode = xfrm_ip2inner_mode(x, 
XFRM_MODE_SKB_CB(skb)->protocol);
+   if (inner_mode == NULL) {
+   XFRM_INC_STATS(dev_net(skb->dev),
+  LINUX_MIB_XFRMINSTATEMODEERROR);
+   return -EINVAL;
+   }
+   }
+
+   family = inner_mode->afinfo->family;
 
skb->mark = be32_to_cpu(tunnel->parms.i_key);
ret = xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family);
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d90a11f..52a2f73 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -340,6 +340,7 @@ static int vti6_rcv_cb(struct sk_buff *skb, int err)
struct net_device *dev;
struct pcpu_sw_netstats *tstats;
struct xfrm_state *x;
+   struct xfrm_mode *inner_mode;
struct ip6_tnl *t = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6;
u32 orig_mark = skb->mark;
int ret;
@@ -357,7 +358,19 @@ static int vti6_rcv_cb(struct sk_buff *skb, int err)
}
 
x = xfrm_input_state(skb);
-   family = x->inner_mode->afinfo->family;
+
+   inner_mode = x->inner_mode;
+
+   if (x->sel.family == AF_UNSPEC) {
+   inner_mode = xfrm_ip2inner_mode(x, 
XFRM_MODE_SKB_CB(skb)->protocol);
+   if (inner_mode == NULL) {
+   XFRM_INC_STATS(dev_net(skb->dev),
+  LINUX_MIB_XFRMINSTATEMODEERROR);
+   return -EINVAL;
+   }
+   }
+
+   family = inner_mode->afinfo->family;
 
skb->mark = be32_to_cpu(t->parms.i_key);
ret = xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family);
-- 
2.1.4



Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Jesper Dangaard Brouer

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
wrote:
> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
> > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> >> wrote:  
> >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
[...]
> >
> > Only if a qdisc is present and pressure is high enough.
> >
> > But in a forwarding setup, we likely receive at a lower rate than the
> > NIC can transmit.

Yes, I can confirm this happens in my experiments.

> >  
> 
> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device TX ring is idle most of the time, i think
> his idea can come in handy here. I am not fully involved in the
> details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible.  This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop


Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it

2016-09-07 Thread Michal Soltys
On 2016-09-07 02:44, Toshiaki Makita wrote:
> On 2016/09/07 6:59, Michal Soltys wrote:
>> Consider following scenario:
>> 
>> - create vlan aware bridge (say br0)
>> - setup br0's vlans, e.g.
>> 
>> bridge vlan add dev br0 vid 10 self
>> 
>> This will add necessary fdb entries directing appropriate traffic to the
>> bridge itself.
>> 
>> - create appropriate vlan interfaces on top of it, for example:
>> 
>> ip li add link br0 name br0.10 type vlan id 10
>> ip add add 10.0.0.1/8 dev br0.10 
>> 
>> This will add vlan devices on top of br0 and *inherit br0's mac address*.
>> 
>> - now after all of the above is done
>> 
>> ip li set eth0 master br0
>> 
>> This will attach interface eth0 to the bridge. With this being the first
>> interface attached, br0 will take it's mac address as its own. Any
>> further changes to br0's ports may cause the same, with the lowest mac
>> address of some port becoming br0's mac.
>> 
>> This will update fdb entries as well, but all vlan interfaces on top of
>> br0 (e.g. br0.10) will be using old mac address from the time when vlan
>> was created.
>> 
>> The side effect of it is that any traffic addressed to such interface
>> will be flooded to all ports (and br0 itself).
>> 
>> The only workaround I found is to either manually update mac addresses
>> with 'ip' or recreate vlans (bridge fdb refused to update relevant entries).
>> 
>> But if br0's mac changes due to some port changes - shouldn't it be
>> somehow propagated automatically to vlans created on top of it ?
> 
> This should have been addressed at least in kernel 4.7...
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=308453aa9156a3b8ee382c0949befb507a32b0c1
> 
> Which kernel version do you use?
> 

4.7.2
git describe on that commit suggests it's been available since 4.6.x

What I did in details:

ip li add name port1b type veth peer name port1e
ip li add br0 type bridge
ip li set dev br0 type bridge vlan_default_pvid 0
ip li set dev br0 type bridge vlan_filtering 1
bridge vlan add dev br0 vid 10 self
bridge vlan add dev br0 vid 250 untagged pvid self
ip li add link br0 name vlan10 type vlan id 10
ip li set port1b master br0

At this point br0.vlan10 had outdated mac after br0 took port1b's one as
its own.




Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:

> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device
> TX ring is idle most of the time, i think his idea can come in handy here.
> I am not fully involved in the details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

I do not think qdisc is relevant here.

Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
because only this layer can know if more packets are to come.

What I am saying is that regardless of skb->xmit_more being set or not,
(for example if no qdisc is even used)
a NAPI driver can arm a bit asking the doorbell being sent at the end of
NAPI.

I am not saying this must be done, only that the idea could be extended
to non XDP world, if we care enough.







Re: [PATCH v4 6/6] arm64: dts: rockchip: enable the gmac for rk3399 evb board

2016-09-07 Thread Heiko Stuebner
Am Freitag, 2. September 2016, 01:50:04 CEST schrieb Caesar Wang:
> From: Roger Chen 
> 
> We add the required and optional properties for evb board.
> See the [0] to get the detail information.
> 
> [0]:
> Documentation/devicetree/bindings/net/rockchip-dwmac.txt
> 
> Signed-off-by: Roger Chen 
> Signed-off-by: Caesar Wang 

applied to my dts64 branch,

after moving clkin_gmac and sorting the gmac properties according to their 
alphabetical position.


Heiko


Re: [PATCH v4 5/6] arm64: dts: rockchip: add the gmac needed node for rk3399

2016-09-07 Thread Heiko Stuebner
Am Freitag, 2. September 2016, 01:50:03 CEST schrieb Caesar Wang:
> From: Roger Chen 
> 
> The RK3399 GMAC Ethernet Controller provides a complete Ethernet interface
> from processor to a Reduced Media Independent Interface (RMII) and Reduced
> Gigabit Media Independent Interface (RGMII) compliant Ethernet PHY.
> 
> This patch adds the related needed device information.
> e.g.: interrupts, grf, clocks, pinctrl and so on.
> 
> The full details are in [0].
> 
> [0]:
> Documentation/devicetree/bindings/net/rockchip-dwmac.txt
> 
> Signed-off-by: Roger Chen 
> Signed-off-by: Caesar Wang 
> ---
> 
> Changes in v4:
> - The Roger had posted patch on https://patchwork.kernel.org/patch/9274561/.
> - re-fixup to original author.
> 
> Changes in v3:
> - generate a patch from https://patchwork.kernel.org/patch/9306339/.
> 
> Changes in v2: None
> 
>  arch/arm64/boot/dts/rockchip/rk3399.dtsi | 80
>  1 file changed, 80 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> b/arch/arm64/boot/dts/rockchip/rk3399.dtsi index 2ab233f..092bb45 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> @@ -200,6 +200,26 @@
>   };
>   };
> 
> + gmac: ethernet@fe30 {
> + compatible = "rockchip,rk3399-gmac";
> + reg = <0x0 0xfe30 0x0 0x1>;
> + interrupts = ;

applied to my dts64 branch, after making this a
+   interrupts = ;

(due to the 4-cell interrupt change)


Heiko


[PATCH v2] tcp: cwnd does not increase in TCP YeAH

2016-09-07 Thread Artem Germanov
Commit 76174004a0f19785a328f40388e87e982bbf69b9
(tcp: do not slow start when cwnd equals ssthresh )
introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual 
ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP 
connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth 
~100KB/s.
   That is caused by stalled cwnd when cwnd equals ssthresh. This patch 
fixes it by proper increasing cwnd in this case.

Signed-off-by: Artem Germanov 
Acked-by: Dmitry Adamushko 
---
diff --git a/net/ipv4/tcp_yeah.c b/net/ipv4/tcp_yeah.c
index 028eb04..9c5fc97 100644
--- a/net/ipv4/tcp_yeah.c
+++ b/net/ipv4/tcp_yeah.c
@@ -76,7 +76,7 @@ static void tcp_yeah_cong_avoid(struct sock *sk, u32 ack, u32 
acked)
if (!tcp_is_cwnd_limited(sk))
return;
 
-   if (tp->snd_cwnd <= tp->snd_ssthresh)
+   if (tcp_in_slow_start(tp))
tcp_slow_start(tp, acked);
 
else if (!yeah->doing_reno_now) {


Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand

2016-09-07 Thread Alexei Starovoitov
On Wed, Sep 07, 2016 at 03:42:25PM +0300, Saeed Mahameed wrote:
> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
> 
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
> 
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.
> 
> Packet rate performance testing was done with pktgen 64B packets on xmit
> side and TC drop action on RX side.
> 
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Comparison is done between:
>  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>  2.Build SKB with RX page cache (This patch)
> 
> StreamsBaselineBuild SKB+page-cacheImprovement
> ---
> 1  4.33Mpps  5.51Mpps27%
> 2  7.35Mpps  11.5Mpps52%
> 4  14.0Mpps  16.3Mpps16%
> 8  22.2Mpps  29.6Mpps20%
> 16 24.8Mpps  34.0Mpps17%

Impressive gains for build_skb. I think it should help ip forwarding too
and likely tcp_rr. tcp_stream shouldn't see any difference.
If you can benchmark that along with pktgen+tc_drop it would
help to better understand the impact of the changes.



Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ

2016-09-07 Thread Alexei Starovoitov
On Wed, Sep 07, 2016 at 03:42:22PM +0300, Saeed Mahameed wrote:
> From: Tariq Toukan 
> 
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.
> 
> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
> 
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
> 
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction

imo it's not a realistic use case, but would be good to mention that
patch 3 brings performance back for this use case anyway.



Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert  wrote:
> On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
>  wrote:
>> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz  wrote:
>>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
>>>
 Packet rate performance testing was done with pktgen 64B packets and on
 TX side and, TC drop action on RX side compared to XDP fast drop.

 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

 Comparison is done between:
 1. Baseline, Before this patch with TC drop action
 2. This patch with TC drop action
 3. This patch with XDP RX fast drop

 StreamsBaseline(TC drop)TC dropXDP fast Drop
 --
 1   5.51Mpps5.14Mpps 13.5Mpps
 2   11.5Mpps10.0Mpps 25.1Mpps
 4   16.3Mpps17.2Mpps 35.4Mpps
 8   29.6Mpps28.2Mpps 45.8Mpps*
 16  34.0Mpps30.1Mpps 45.8Mpps*
>>>
>>> Rana, Guys, congrat!!
>>>
>>> When you say X streams, does each stream mapped by RSS to different RX ring?
>>> or we're on the same RX ring for all rows of the above table?
>>
>> Yes, I will make this more clear in the actual submission,
>> Here we are talking about different RSS core rings.
>>
>>>
>>> In the CX3 work, we had X sender "streams" that all mapped to the same RX 
>>> ring,
>>> I don't think we went beyond one RX ring.
>>
>> Here we did, the first row is what you are describing the other rows
>> are the same test
>> with increasing the number of the RSS receiving cores, The xmit side is 
>> sending
>> as many streams as possible to be as much uniformly spread as possible
>> across the
>> different RSS cores on the receiver.
>>
> Hi Saeed,
>
> Please report CPU utilization also. The expectation is that
> performance should scale linearly with increasing number of CPUs (i.e.
> pps/CPU_utilization should be constant).
>

Hi Tom

That was my expectation too.

We didn't do the full analysis yet, It could be that RSS was not
spreading the workload on all the cores evenly.
Those numbers are from my humble machine with a quick and dirty
testing, the idea of this submission
is to let the folks look at the code while we continue testing and
analyzing those patches.

Anyway we will share more accurate results when we have them, with CPU
utilization statistics as well.

Thanks,
Saeed.

> Tom
>


Re: [PATCH net-next V6 4/4] net/sched: Introduce act_tunnel_key

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 09:27 -0700, Cong Wang wrote:

> Fix checkpatch.pl (or whatever tool you prefer).

I do not use checkpatch.pl.

But I know for sure David cares about this ordering.

https://lkml.org/lkml/2016/4/24/91

https://patchwork.ozlabs.org/patch/629958/


So feel free to fix checkpatch.pl 






Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  wrote:
>> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >>
>> >> Here we introduce a xmit more like mechanism that will queue up more
>> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >>
>> >> Once RX napi budget is consumed and we exit napi RX loop, we will
>> >> flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > Why is this idea depends on XDP ?
>> >
>> > It looks like we could apply it to any driver having one IRQ servicing
>> > one RX and one TX, without XDP being involved.
>> >
>>
>> Yes but it is more complicated than XDP case, where the RX ring posts
>> the TX descriptors and once done
>> the RX ring hits the doorbell once for all the TX descriptors it
>> posted, and it is the only possible place to hit a doorbell
>> for XDP TX ring.
>>
>> For regular TX and RX ring sharing the same IRQ, there is no such
>> simple connection between them, and hitting a doorbell
>> from RX ring napi would race with xmit ndo function of the TX ring.
>>
>> How do you synchronize in such case ?
>> isn't the existing xmit more mechanism sufficient enough ?
>
> Only if a qdisc is present and pressure is high enough.
>
> But in a forwarding setup, we likely receive at a lower rate than the
> NIC can transmit.
>

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.

> A simple cmpxchg could be used to synchronize the thing, if we really
> cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
> smaller than doorbell one)
>


[PATCH net-next v22 0/3] openvswitch: add 802.1ad support

2016-09-07 Thread Eric Garver
This series adds 802.1ad support to openvswitch. It is a continuation of the
work originally started by Thomas F Herbert - hence the large rev number.

The extra VLAN is implemented by using an additional level of the
OVS_KEY_ATTR_ENCAP netlink attribute.
In OVS flow speak, this looks like

   eth_type(0x88a8),vlan(vid=100),encap(eth_type(0x8100), vlan(vid=200),
encap(eth_type(0x0800), ...))

The userspace counterpart has also seen recent activity on the ovs-dev mailing
lists. There are some new 802.1ad OVS tests being added - also on the ovs-dev
list. This patch series has been tested using the most recent version of
userspace (v3) and tests (v2).

v22 changes:
  - merge patch 4 into patch 3
  - fix checkpatch.pl errors
- Still some 80 char warnings for long string literals
  - refresh pointer after pskb_may_pull()
  - refactor vlan nlattr parsing to remove some double checks
  - introduce ovs_nla_put_vlan()
  - move triple VLAN check to after ethertype serialization
  - WARN_ON_ONCE() on triple VLAN and unexpected encap values

v21 changes:
  - Fix (and simplify) netlink attribute parsing
  - re-add handling of truncated VLAN tags
  - fix if/else dangling assignment in {push,pop}_vlan()
  - simplify parse_vlan()


Eric Garver (2):
  vlan: Check for vlan ethernet types for 8021.q or 802.1ad
  openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink
attributes

Thomas F Herbert (1):
  openvswitch: 802.1ad uapi changes.

 include/linux/if_vlan.h  |  33 +++--
 include/uapi/linux/openvswitch.h |  17 ++-
 net/openvswitch/actions.c|  16 +-
 net/openvswitch/flow.c   |  65 +---
 net/openvswitch/flow.h   |   8 +-
 net/openvswitch/flow_netlink.c   | 310 ++-
 net/openvswitch/vport.c  |   7 +-
 7 files changed, 314 insertions(+), 142 deletions(-)

-- 
2.5.5



[PATCH net-next v22 3/3] openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes

2016-09-07 Thread Eric Garver
Add support for 802.1ad including the ability to push and pop double
tagged vlans. Add support for 802.1ad to netlink parsing and flow
conversion. Uses double nested encap attributes to represent double
tagged vlan. Inner TPID encoded along with ctci in nested attributes.

This is based on Thomas F Herbert's original v20 patch. I made some
small clean ups and bug fixes.

Signed-off-by: Thomas F Herbert 
Signed-off-by: Eric Garver 
---
 net/openvswitch/actions.c  |  16 ++-
 net/openvswitch/flow.c |  65 ++---
 net/openvswitch/flow.h |   8 +-
 net/openvswitch/flow_netlink.c | 310 -
 net/openvswitch/vport.c|   7 +-
 5 files changed, 282 insertions(+), 124 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index ca91fc33f8a9..4fe9032b1160 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -246,20 +246,24 @@ static int pop_vlan(struct sk_buff *skb, struct 
sw_flow_key *key)
int err;
 
err = skb_vlan_pop(skb);
-   if (skb_vlan_tag_present(skb))
+   if (skb_vlan_tag_present(skb)) {
invalidate_flow_key(key);
-   else
-   key->eth.tci = 0;
+   } else {
+   key->eth.vlan.tci = 0;
+   key->eth.vlan.tpid = 0;
+   }
return err;
 }
 
 static int push_vlan(struct sk_buff *skb, struct sw_flow_key *key,
 const struct ovs_action_push_vlan *vlan)
 {
-   if (skb_vlan_tag_present(skb))
+   if (skb_vlan_tag_present(skb)) {
invalidate_flow_key(key);
-   else
-   key->eth.tci = vlan->vlan_tci;
+   } else {
+   key->eth.vlan.tci = vlan->vlan_tci;
+   key->eth.vlan.tpid = vlan->vlan_tpid;
+   }
return skb_vlan_push(skb, vlan->vlan_tpid,
 ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
 }
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 0ea128eeeab2..1240ae3b88d2 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -302,24 +302,57 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
  sizeof(struct icmp6hdr));
 }
 
-static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
+/**
+ * Parse vlan tag from vlan header.
+ * Returns ERROR on memory error.
+ * Returns 0 if it encounters a non-vlan or incomplete packet.
+ * Returns 1 after successfully parsing vlan tag.
+ */
+static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh)
 {
-   struct qtag_prefix {
-   __be16 eth_type; /* ETH_P_8021Q */
-   __be16 tci;
-   };
-   struct qtag_prefix *qp;
+   struct vlan_head *vh = (struct vlan_head *)skb->data;
 
-   if (unlikely(skb->len < sizeof(struct qtag_prefix) + sizeof(__be16)))
+   if (likely(!eth_type_vlan(vh->tpid)))
return 0;
 
-   if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) +
-sizeof(__be16
+   if (unlikely(skb->len < sizeof(struct vlan_head) + sizeof(__be16)))
+   return 0;
+
+   if (unlikely(!pskb_may_pull(skb, sizeof(struct vlan_head) +
+sizeof(__be16
return -ENOMEM;
 
-   qp = (struct qtag_prefix *) skb->data;
-   key->eth.tci = qp->tci | htons(VLAN_TAG_PRESENT);
-   __skb_pull(skb, sizeof(struct qtag_prefix));
+   vh = (struct vlan_head *)skb->data;
+   key_vh->tci = vh->tci | htons(VLAN_TAG_PRESENT);
+   key_vh->tpid = vh->tpid;
+
+   __skb_pull(skb, sizeof(struct vlan_head));
+   return 1;
+}
+
+static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
+{
+   int res;
+
+   key->eth.vlan.tci = 0;
+   key->eth.vlan.tpid = 0;
+   key->eth.cvlan.tci = 0;
+   key->eth.cvlan.tpid = 0;
+
+   if (likely(skb_vlan_tag_present(skb))) {
+   key->eth.vlan.tci = htons(skb->vlan_tci);
+   key->eth.vlan.tpid = skb->vlan_proto;
+   } else {
+   /* Parse outer vlan tag in the non-accelerated case. */
+   res = parse_vlan_tag(skb, >eth.vlan);
+   if (res <= 0)
+   return res;
+   }
+
+   /* Parse inner vlan tag. */
+   res = parse_vlan_tag(skb, >eth.cvlan);
+   if (res <= 0)
+   return res;
 
return 0;
 }
@@ -480,12 +513,8 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
 * update skb->csum here.
 */
 
-   key->eth.tci = 0;
-   if (skb_vlan_tag_present(skb))
-   key->eth.tci = htons(skb->vlan_tci);
-   else if (eth->h_proto == htons(ETH_P_8021Q))
-   if (unlikely(parse_vlan(skb, key)))
-   return -ENOMEM;
+   if (unlikely(parse_vlan(skb, key)))
+   return -ENOMEM;
 
key->eth.type = 

[PATCH net-next v22 1/3] openvswitch: 802.1ad uapi changes.

2016-09-07 Thread Eric Garver
From: Thomas F Herbert 

openvswitch: Add support for 8021.AD

Change the description of the VLAN tpid field.

Signed-off-by: Thomas F Herbert 
---
 include/uapi/linux/openvswitch.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 54c3b4f4aceb..59ed3992c760 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -605,13 +605,13 @@ struct ovs_action_push_mpls {
  * @vlan_tci: Tag control identifier (TCI) to push.  The CFI bit must be set
  * (but it will not be set in the 802.1Q header that is pushed).
  *
- * The @vlan_tpid value is typically %ETH_P_8021Q.  The only acceptable TPID
- * values are those that the kernel module also parses as 802.1Q headers, to
- * prevent %OVS_ACTION_ATTR_PUSH_VLAN followed by %OVS_ACTION_ATTR_POP_VLAN
- * from having surprising results.
+ * The @vlan_tpid value is typically %ETH_P_8021Q or %ETH_P_8021AD.
+ * The only acceptable TPID values are those that the kernel module also parses
+ * as 802.1Q or 802.1AD headers, to prevent %OVS_ACTION_ATTR_PUSH_VLAN followed
+ * by %OVS_ACTION_ATTR_POP_VLAN from having surprising results.
  */
 struct ovs_action_push_vlan {
-   __be16 vlan_tpid;   /* 802.1Q TPID. */
+   __be16 vlan_tpid;   /* 802.1Q or 802.1ad TPID. */
__be16 vlan_tci;/* 802.1Q TCI (VLAN ID and priority). */
 };
 
@@ -721,9 +721,10 @@ enum ovs_nat_attr {
  * is copied from the value to the packet header field, rest of the bits are
  * left unchanged.  The non-masked value bits must be passed in as zeroes.
  * Masking is not supported for the %OVS_KEY_ATTR_TUNNEL attribute.
- * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q header onto the
- * packet.
- * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet.
+ * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q or 802.1ad header
+ * onto the packet.
+ * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q or 802.1ad header
+ * from the packet.
  * @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in
  * the nested %OVS_SAMPLE_ATTR_* attributes.
  * @OVS_ACTION_ATTR_PUSH_MPLS: Push a new MPLS label stack entry onto the
-- 
2.5.5



[PATCH net-next v22 2/3] vlan: Check for vlan ethernet types for 8021.q or 802.1ad

2016-09-07 Thread Eric Garver
This is to simplify using double tagged vlans. This function allows all
valid vlan ethertypes to be checked in a single function call.
Also replace some instances that check for both ETH_P_8021Q and
ETH_P_8021AD.

Patch based on one originally by Thomas F Herbert.

Signed-off-by: Thomas F Herbert 
Signed-off-by: Eric Garver 
---
 include/linux/if_vlan.h | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 49d4aef1f789..3319d97d789d 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -272,6 +272,23 @@ static inline int vlan_get_encap_level(struct net_device 
*dev)
 }
 #endif
 
+/**
+ * eth_type_vlan - check for valid vlan ether type.
+ * @ethertype: ether type to check
+ *
+ * Returns true if the ether type is a vlan ether type.
+ */
+static inline bool eth_type_vlan(__be16 ethertype)
+{
+   switch (ethertype) {
+   case htons(ETH_P_8021Q):
+   case htons(ETH_P_8021AD):
+   return true;
+   default:
+   return false;
+   }
+}
+
 static inline bool vlan_hw_offload_capable(netdev_features_t features,
   __be16 proto)
 {
@@ -425,8 +442,7 @@ static inline int __vlan_get_tag(const struct sk_buff *skb, 
u16 *vlan_tci)
 {
struct vlan_ethhdr *veth = (struct vlan_ethhdr *)skb->data;
 
-   if (veth->h_vlan_proto != htons(ETH_P_8021Q) &&
-   veth->h_vlan_proto != htons(ETH_P_8021AD))
+   if (!eth_type_vlan(veth->h_vlan_proto))
return -EINVAL;
 
*vlan_tci = ntohs(veth->h_vlan_TCI);
@@ -488,7 +504,7 @@ static inline __be16 __vlan_get_protocol(struct sk_buff 
*skb, __be16 type,
 * present at mac_len - VLAN_HLEN (if mac_len > 0), or at
 * ETH_HLEN otherwise
 */
-   if (type == htons(ETH_P_8021Q) || type == htons(ETH_P_8021AD)) {
+   if (eth_type_vlan(type)) {
if (vlan_depth) {
if (WARN_ON(vlan_depth < VLAN_HLEN))
return 0;
@@ -506,8 +522,7 @@ static inline __be16 __vlan_get_protocol(struct sk_buff 
*skb, __be16 type,
vh = (struct vlan_hdr *)(skb->data + vlan_depth);
type = vh->h_vlan_encapsulated_proto;
vlan_depth += VLAN_HLEN;
-   } while (type == htons(ETH_P_8021Q) ||
-type == htons(ETH_P_8021AD));
+   } while (eth_type_vlan(type));
}
 
if (depth)
@@ -572,8 +587,7 @@ static inline void vlan_set_encap_proto(struct sk_buff *skb,
 static inline bool skb_vlan_tagged(const struct sk_buff *skb)
 {
if (!skb_vlan_tag_present(skb) &&
-   likely(skb->protocol != htons(ETH_P_8021Q) &&
-  skb->protocol != htons(ETH_P_8021AD)))
+   likely(!eth_type_vlan(skb->protocol)))
return false;
 
return true;
@@ -593,15 +607,14 @@ static inline bool skb_vlan_tagged_multi(const struct 
sk_buff *skb)
if (!skb_vlan_tag_present(skb)) {
struct vlan_ethhdr *veh;
 
-   if (likely(protocol != htons(ETH_P_8021Q) &&
-  protocol != htons(ETH_P_8021AD)))
+   if (likely(!eth_type_vlan(protocol)))
return false;
 
veh = (struct vlan_ethhdr *)skb->data;
protocol = veh->h_vlan_encapsulated_proto;
}
 
-   if (protocol != htons(ETH_P_8021Q) && protocol != htons(ETH_P_8021AD))
+   if (!eth_type_vlan(protocol))
return false;
 
return true;
-- 
2.5.5



Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-07 Thread Tom Herbert
On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
 wrote:
> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz  wrote:
>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
>>
>>> Packet rate performance testing was done with pktgen 64B packets and on
>>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>>
>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>>
>>> Comparison is done between:
>>> 1. Baseline, Before this patch with TC drop action
>>> 2. This patch with TC drop action
>>> 3. This patch with XDP RX fast drop
>>>
>>> StreamsBaseline(TC drop)TC dropXDP fast Drop
>>> --
>>> 1   5.51Mpps5.14Mpps 13.5Mpps
>>> 2   11.5Mpps10.0Mpps 25.1Mpps
>>> 4   16.3Mpps17.2Mpps 35.4Mpps
>>> 8   29.6Mpps28.2Mpps 45.8Mpps*
>>> 16  34.0Mpps30.1Mpps 45.8Mpps*
>>
>> Rana, Guys, congrat!!
>>
>> When you say X streams, does each stream mapped by RSS to different RX ring?
>> or we're on the same RX ring for all rows of the above table?
>
> Yes, I will make this more clear in the actual submission,
> Here we are talking about different RSS core rings.
>
>>
>> In the CX3 work, we had X sender "streams" that all mapped to the same RX 
>> ring,
>> I don't think we went beyond one RX ring.
>
> Here we did, the first row is what you are describing the other rows
> are the same test
> with increasing the number of the RSS receiving cores, The xmit side is 
> sending
> as many streams as possible to be as much uniformly spread as possible
> across the
> different RSS cores on the receiver.
>
Hi Saeed,

Please report CPU utilization also. The expectation is that
performance should scale linearly with increasing number of CPUs (i.e.
pps/CPU_utilization should be constant).

Tom

>>
>> Here, I guess you want to 1st get an initial max for N pktgen TX
>> threads all sending
>> the same stream so you land on single RX ring, and then move to M * N pktgen 
>> TX
>> threads to max that further.
>>
>> I don't see how the current Linux stack would be able to happily drive 34M 
>> PPS
>> (== allocate SKB, etc, you know...) on a single CPU, Jesper?
>>
>> Or.


Re: [PATCH net-next V6 4/4] net/sched: Introduce act_tunnel_key

2016-09-07 Thread Cong Wang
On Wed, Sep 7, 2016 at 7:14 AM, Eric Dumazet  wrote:
> On Wed, 2016-09-07 at 11:08 +0300, Hadar Hen Zion wrote:
>> From: Amir Vadai 
>>
>
>
>> +static int tunnel_key_init(struct net *net, struct nlattr *nla,
>> +struct nlattr *est, struct tc_action **a,
>> +int ovr, int bind)
>> +{
>> + struct tc_action_net *tn = net_generic(net, tunnel_key_net_id);
>> + struct nlattr *tb[TCA_TUNNEL_KEY_MAX + 1];
>> + struct metadata_dst *metadata = NULL;
>> + struct tc_tunnel_key *parm;
>> + struct tcf_tunnel_key *t;
>> + struct tcf_tunnel_key_params *params_old;
>> + struct tcf_tunnel_key_params *params_new;
>> + __be64 key_id;
>> + bool exists = false;
>> + int ret = 0;
>> + int err;
>
> We try for new code to avoid reverse Christmas tree ;)

Fix checkpatch.pl (or whatever tool you prefer).


Re: [PATCH net-next V6 4/4] net/sched: Introduce act_tunnel_key

2016-09-07 Thread Cong Wang
On Wed, Sep 7, 2016 at 1:08 AM, Hadar Hen Zion  wrote:
> +struct tcf_tunnel_key_params {
> +   struct rcu_head rcu;
> +   int tcft_action;
> +   int action;
> +   struct metadata_dst *tcft_enc_metadata;
> +};
> +
> +struct tcf_tunnel_key {
> +   struct tc_action  common;
> +   struct tcf_tunnel_key_params __rcu *params;
> +};
> +
...

This is unnecessary if we make the tc action API aware of RCU.

> +
> +static void tunnel_key_release(struct tc_action *a, int bind)
> +{
> +   struct tcf_tunnel_key *t = to_tunnel_key(a);
> +   struct tcf_tunnel_key_params *params;
> +
> +   rcu_read_lock();
> +   params = rcu_dereference(t->params);
> +
> +   if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET)
> +   dst_release(>tcft_enc_metadata->dst);
> +
> +   rcu_read_unlock();
> +

So you allocate memory for t->params in ->init() but not
release it here?

Also, ->cleanup() should be called with RTNL, no need to
take read lock here.

BTW, again you do NOT need to make it RCU, the whole
tc action API should be, as my patchset does, I will take care
of this as a part of my patchset. Eric is wasting your time on
this, with no benefits, the code will be replaced soon.


Re: [RFC Patch net-next 0/6] net_sched: really switch to RCU for tc actions

2016-09-07 Thread John Fastabend
On 16-09-01 10:57 PM, Cong Wang wrote:
> Currently there are only two tc actions lockless:
> gact and mirred. But they are questionable because
> we don't have anything to prevent a parallel update
> on an existing tc action in hash table while reading
> it on fast path, this could be a problem when a tc
> action becomes complex.

hmm I'm trying to see where the questionable part is in the current
code? What is it exactly.

The calls to

 tcf_lastuse_update(>tcf_tm);

for example are possibly wrong as that is a u64 may be set from
multiple cores. So fixing that seems like a good idea.

Actions themselves don't have a path to be "updated" while live do they?
iirc and I think a quick scan this morning of the code shows actions
have a refcnt and a "bind"/"release" action that increments/decrements
this counter. Both bind and release are protected via rtnl lock in the
control path.

I need to follow all the code paths but is there a way to remove an
action that still has a refcnt > 0? In other words does it need to be
removed from all filters before it can be deleted. If yes then by the
time it is removed (after rcu grace period) it should not be in use.
If no then I think there is a problem.

I'm looking at this code path here,

int __tcf_hash_release(struct tc_action *p, bool bind, bool strict)
{
int ret = 0;

if (p) {
if (bind)
p->tcfa_bindcnt--;
else if (strict && p->tcfa_bindcnt > 0)
return -EPERM;

p->tcfa_refcnt--;
if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) {
if (p->ops->cleanup)
p->ops->cleanup(p, bind);
tcf_hash_destroy(p->hinfo, p);
ret = ACT_P_DELETED;
}
}

return ret;
}

It looks to me that every call site that jumps here where its possible
an action is being used by a filter is "strict". And further filters
only release actions after an rcu grace period when being destroyed and
the filter is no longer using the action.

Although the refcnt should be atomic now that its being called from
outside the rtnl lock in rcu call back? At least it looks racy to me
at a glance this morning.

If the refcnt'ing is atomic then do we care/need the hash rcu bits? I'm
not seeing how it helps because in the fast path we don't even touch the
hash table we have a pointer to a refcnt'd action object.

What did I miss?

> 
> This patchset introduces a few new tc action API's
> based on RCU so that the fast path could now really
> be protected by RCU and we can update existing tc
> actions safely and race-freely.
> 
> Obviously this is still _not_ complete yet, I only
> modified mirred action to show the use case of
> the new API's, all the rest actions could switch to
> the new API's too. The new API's are a bit ugly too,
> any suggestion to improve them is welcome.
> 
> I tested mirred action with a few test cases, so far
> so good, at least no obvious bugs. ;)

Take a quick survey of the actions I didn't see any with global state.
But I didn't look at them all.

> 
> 
> Cong Wang (6):
>   net_sched: use RCU for action hash table
>   net_sched: introduce tcf_hash_replace()
>   net_sched: return NULL in tcf_hash_check()
>   net_sched: introduce tcf_hash_copy()
>   net_sched: use rcu in fast path
>   net_sched: switch to RCU API for act_mirred
> 
>  include/net/act_api.h  |  3 +++
>  net/sched/act_api.c| 59 
> +++---
>  net/sched/act_mirred.c | 41 ---
>  3 files changed, 73 insertions(+), 30 deletions(-)
> 



[PATCH net 4/5] net/mlx5e: Fix global PFC counters replication

2016-09-07 Thread Saeed Mahameed
From: Gal Pressman 

Currently when reading global PFC statistics we left the counter
iterator out of the equation and we ended up reading the same counter
over and over again.

Instead of reading the counter at index 0 on every iteration we now read
the counter at index (i).

Fixes: e989d5a532ce ('net/mlx5e: Expose flow control counters to ethtool')
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 98e1a4a..7a346bb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -331,7 +331,7 @@ static void mlx5e_get_ethtool_stats(struct net_device *dev,
if (mlx5e_query_global_pause_combined(priv)) {
for (i = 0; i < NUM_PPORT_PER_PRIO_PFC_COUNTERS; i++) {
data[idx++] = 
MLX5E_READ_CTR64_BE(>stats.pport.per_prio_counters[0],
- 
pport_per_prio_pfc_stats_desc, 0);
+ 
pport_per_prio_pfc_stats_desc, i);
}
}
 
-- 
2.7.4



[PATCH net 0/5] Mellanox 100G mlx5 fixes 2016-09-07

2016-09-07 Thread Saeed Mahameed
Hi Dave,

The following series contains bug fixes for the mlx5e driver.

from Gal,
- Static code checker cleanup (casting overflow)
- Fix global PFC counter statistics reading
- Fix HW LRO when vlan stripping is off

>From Bodong,
- Deprecate old autoneg capability bit and use new one.

>From Tariq,
- Fix xmit more counter race condition


For -stable:
('net/mlx5e: Fix parsing of vlan packets when updating lro header')

No conflicts are introduced with the mlx5 ongoing net-next submission.

Thanks,
Saeed.

Bodong Wang (1):
  net/mlx5e: Move an_disable_cap bit to a new position

Gal Pressman (3):
  net/mlx5e: Prevent casting overflow
  net/mlx5e: Fix global PFC counters replication
  net/mlx5e: Fix parsing of vlan packets when updating lro header

Tariq Toukan (1):
  net/mlx5e: Fix xmit_more counter race issue

 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  8 +---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 22 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  2 +-
 include/linux/mlx5/mlx5_ifc.h  |  5 +++--
 4 files changed, 24 insertions(+), 13 deletions(-)

-- 
2.7.4



[PATCH net 5/5] net/mlx5e: Fix parsing of vlan packets when updating lro header

2016-09-07 Thread Saeed Mahameed
From: Gal Pressman 

Currently vlan tagged packets were not parsed correctly
and assumed to be regular IPv4/IPv6 packets.
We should check for 802.1Q/802.1ad tags and update the lro header
accordingly.
This fixes the use case where LRO is on and rxvlan is off
(vlan stripping is off).

Fixes: e586b3b0baee ('net/mlx5: Ethernet Datapath files')
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b6f8ebb..e7c969d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -637,24 +637,32 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
 u32 cqe_bcnt)
 {
-   struct ethhdr   *eth= (struct ethhdr *)(skb->data);
-   struct iphdr*ipv4   = (struct iphdr *)(skb->data + ETH_HLEN);
-   struct ipv6hdr  *ipv6   = (struct ipv6hdr *)(skb->data + ETH_HLEN);
+   struct ethhdr   *eth = (struct ethhdr *)(skb->data);
+   struct iphdr*ipv4;
+   struct ipv6hdr  *ipv6;
struct tcphdr   *tcp;
+   int network_depth = 0;
+   __be16 proto;
+   u16 tot_len;
 
u8 l4_hdr_type = get_cqe_l4_hdr_type(cqe);
int tcp_ack = ((CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA  == l4_hdr_type) ||
   (CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA == l4_hdr_type));
 
-   u16 tot_len = cqe_bcnt - ETH_HLEN;
+   skb->mac_len = ETH_HLEN;
+   proto = __vlan_get_protocol(skb, eth->h_proto, _depth);
 
-   if (eth->h_proto == htons(ETH_P_IP)) {
-   tcp = (struct tcphdr *)(skb->data + ETH_HLEN +
+   ipv4 = (struct iphdr *)(skb->data + network_depth);
+   ipv6 = (struct ipv6hdr *)(skb->data + network_depth);
+   tot_len = cqe_bcnt - network_depth;
+
+   if (proto == htons(ETH_P_IP)) {
+   tcp = (struct tcphdr *)(skb->data + network_depth +
sizeof(struct iphdr));
ipv6 = NULL;
skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
} else {
-   tcp = (struct tcphdr *)(skb->data + ETH_HLEN +
+   tcp = (struct tcphdr *)(skb->data + network_depth +
sizeof(struct ipv6hdr));
ipv4 = NULL;
skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
-- 
2.7.4



[PATCH net 2/5] net/mlx5e: Move an_disable_cap bit to a new position

2016-09-07 Thread Saeed Mahameed
From: Bodong Wang 

Previous an_disable_cap position bit31 is deprecated to be use in driver
with newer firmware.  New firmware will advertise the same capability
in bit29.

Old capability didn't allow setting more than one protocol for a
specific speed when autoneg is off, while newer firmware will allow
this and it is indicated in the new capability location.

Signed-off-by: Bodong Wang 
Signed-off-by: Saeed Mahameed 
---
 include/linux/mlx5/mlx5_ifc.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 21bc455..d1f9a58 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -6710,9 +6710,10 @@ struct mlx5_ifc_pude_reg_bits {
 };
 
 struct mlx5_ifc_ptys_reg_bits {
-   u8 an_disable_cap[0x1];
+   u8 reserved_at_0[0x1];
u8 an_disable_admin[0x1];
-   u8 reserved_at_2[0x6];
+   u8 an_disable_cap[0x1];
+   u8 reserved_at_3[0x5];
u8 local_port[0x8];
u8 reserved_at_10[0xd];
u8 proto_mask[0x3];
-- 
2.7.4



[PATCH net 3/5] net/mlx5e: Prevent casting overflow

2016-09-07 Thread Saeed Mahameed
From: Gal Pressman 

On 64 bits architectures unsigned long is longer than u32,
casting to unsigned long will result in overflow.
We need to first allocate an unsigned long variable, then assign the
wanted value.

Fixes: 665bc53969d7 ('net/mlx5e: Use new ethtool get/set link ksettings API')
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index d0cf8fa..98e1a4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -659,9 +659,10 @@ out:
 static void ptys2ethtool_supported_link(unsigned long *supported_modes,
u32 eth_proto_cap)
 {
+   unsigned long proto_cap = eth_proto_cap;
int proto;
 
-   for_each_set_bit(proto, (unsigned long *)_proto_cap, 
MLX5E_LINK_MODES_NUMBER)
+   for_each_set_bit(proto, _cap, MLX5E_LINK_MODES_NUMBER)
bitmap_or(supported_modes, supported_modes,
  ptys2ethtool_table[proto].supported,
  __ETHTOOL_LINK_MODE_MASK_NBITS);
@@ -670,9 +671,10 @@ static void ptys2ethtool_supported_link(unsigned long 
*supported_modes,
 static void ptys2ethtool_adver_link(unsigned long *advertising_modes,
u32 eth_proto_cap)
 {
+   unsigned long proto_cap = eth_proto_cap;
int proto;
 
-   for_each_set_bit(proto, (unsigned long *)_proto_cap, 
MLX5E_LINK_MODES_NUMBER)
+   for_each_set_bit(proto, _cap, MLX5E_LINK_MODES_NUMBER)
bitmap_or(advertising_modes, advertising_modes,
  ptys2ethtool_table[proto].advertised,
  __ETHTOOL_LINK_MODE_MASK_NBITS);
-- 
2.7.4



[PATCH net 1/5] net/mlx5e: Fix xmit_more counter race issue

2016-09-07 Thread Saeed Mahameed
From: Tariq Toukan 

Update the xmit_more counter before notifying the HW,
to prevent a possible use-after-free of the skb.

Fixes: c8cf78fe100b ("net/mlx5e: Add ethtool counter for TX xmit_more")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 988eca9..eb0e725 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -356,6 +356,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, 
struct sk_buff *skb)
sq->stats.stopped++;
}
 
+   sq->stats.xmit_more += skb->xmit_more;
if (!skb->xmit_more || netif_xmit_stopped(sq->txq)) {
int bf_sz = 0;
 
@@ -375,7 +376,6 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, 
struct sk_buff *skb)
 
sq->stats.packets++;
sq->stats.bytes += num_bytes;
-   sq->stats.xmit_more += skb->xmit_more;
return NETDEV_TX_OK;
 
 dma_unmap_wqe_err:
-- 
2.7.4



[PATCH net-next v3] net: inet: diag: expose the socket mark to privileged processes.

2016-09-07 Thread Lorenzo Colitti
This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.

Commit a52e95abf772 ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes.  It is useful for
tools like ss which display information about sockets.

Tested: https://android-review.googlesource.com/270210
Signed-off-by: Lorenzo Colitti 
---
 include/linux/inet_diag.h  |  4 ++--
 include/uapi/linux/inet_diag.h |  1 +
 net/ipv4/inet_diag.c   | 49 --
 net/ipv4/udp_diag.c| 10 +
 net/sctp/sctp_diag.c   | 20 +++--
 5 files changed, 56 insertions(+), 28 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index feb04ea..65da430 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -37,7 +37,7 @@ int inet_sk_diag_fill(struct sock *sk, struct 
inet_connection_sock *icsk,
  struct sk_buff *skb, const struct inet_diag_req_v2 *req,
  struct user_namespace *user_ns,
  u32 pid, u32 seq, u16 nlmsg_flags,
- const struct nlmsghdr *unlh);
+ const struct nlmsghdr *unlh, bool net_admin);
 void inet_diag_dump_icsk(struct inet_hashinfo *h, struct sk_buff *skb,
 struct netlink_callback *cb,
 const struct inet_diag_req_v2 *r,
@@ -56,7 +56,7 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, 
struct sock *sk);
 
 int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
 struct inet_diag_msg *r, int ext,
-struct user_namespace *user_ns);
+struct user_namespace *user_ns, bool net_admin);
 
 extern int  inet_diag_register(const struct inet_diag_handler *handler);
 extern void inet_diag_unregister(const struct inet_diag_handler *handler);
diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 5581206..b5c366f 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -123,6 +123,7 @@ enum {
INET_DIAG_LOCALS,
INET_DIAG_PEERS,
INET_DIAG_PAD,
+   INET_DIAG_MARK,
__INET_DIAG_MAX,
 };
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index abfbe49..e4d16fc 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -99,6 +99,7 @@ static size_t inet_sk_attr_size(void)
+ nla_total_size(1) /* INET_DIAG_SHUTDOWN */
+ nla_total_size(1) /* INET_DIAG_TOS */
+ nla_total_size(1) /* INET_DIAG_TCLASS */
+   + nla_total_size(4) /* INET_DIAG_MARK */
+ nla_total_size(sizeof(struct inet_diag_meminfo))
+ nla_total_size(sizeof(struct inet_diag_msg))
+ nla_total_size(SK_MEMINFO_VARS * sizeof(u32))
@@ -109,7 +110,8 @@ static size_t inet_sk_attr_size(void)
 
 int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
 struct inet_diag_msg *r, int ext,
-struct user_namespace *user_ns)
+struct user_namespace *user_ns,
+bool net_admin)
 {
const struct inet_sock *inet = inet_sk(sk);
 
@@ -136,6 +138,9 @@ int inet_diag_msg_attrs_fill(struct sock *sk, struct 
sk_buff *skb,
}
 #endif
 
+   if (net_admin && nla_put_u32(skb, INET_DIAG_MARK, sk->sk_mark))
+   goto errout;
+
r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk));
r->idiag_inode = sock_i_ino(sk);
 
@@ -149,7 +154,8 @@ int inet_sk_diag_fill(struct sock *sk, struct 
inet_connection_sock *icsk,
  struct sk_buff *skb, const struct inet_diag_req_v2 *req,
  struct user_namespace *user_ns,
  u32 portid, u32 seq, u16 nlmsg_flags,
- const struct nlmsghdr *unlh)
+ const struct nlmsghdr *unlh,
+ bool net_admin)
 {
const struct tcp_congestion_ops *ca_ops;
const struct inet_diag_handler *handler;
@@ -175,7 +181,7 @@ int inet_sk_diag_fill(struct sock *sk, struct 
inet_connection_sock *icsk,
r->idiag_timer = 0;
r->idiag_retrans = 0;
 
-   if (inet_diag_msg_attrs_fill(sk, skb, r, ext, user_ns))
+   if (inet_diag_msg_attrs_fill(sk, skb, r, ext, user_ns, net_admin))
goto errout;
 
if (ext & (1 << (INET_DIAG_MEMINFO - 1))) {
@@ -274,10 +280,11 @@ static int inet_csk_diag_fill(struct sock *sk,
  const struct inet_diag_req_v2 *req,
  struct user_namespace *user_ns,
   

Re: [PATCH net-next] net: inet: diag: expose the socket mark to privileged processes.

2016-09-07 Thread Lorenzo Colitti
On Wed, Sep 7, 2016 at 8:20 PM, kbuild test robot  wrote:
> [auto build test ERROR on net-next/master]
>
>net/sctp/sctp_diag.c: In function 'inet_sctp_diag_fill':
> >> net/sctp/sctp_diag.c:136:6: error: too few arguments to function 
> >> 'inet_diag_msg_attrs_fill'
>  if (inet_diag_msg_attrs_fill(sk, skb, r, ext, user_ns))
>  ^~~~

Fixed in v3, and unit tests updated with basic SCTP tests as well.


[PATCH net-next 2/2] rxrpc: Add tracepoint for working out where aborts happen

2016-09-07 Thread David Howells
Add a tracepoint for working out where local aborts happen.  Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.

rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.

Signed-off-by: David Howells 
---

 fs/afs/rxrpc.c   |   17 ---
 include/net/af_rxrpc.h   |3 +
 include/trace/events/rxrpc.h |   29 +++
 net/rxrpc/ar-internal.h  |   14 -
 net/rxrpc/call_event.c   |7 ++-
 net/rxrpc/call_object.c  |2 -
 net/rxrpc/conn_event.c   |6 ++
 net/rxrpc/input.c|7 ++-
 net/rxrpc/insecure.c |   19 ---
 net/rxrpc/rxkad.c|  108 +++---
 net/rxrpc/sendmsg.c  |   18 ---
 11 files changed, 132 insertions(+), 98 deletions(-)

diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 37608be52abd..53750dece80e 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -377,7 +377,7 @@ int afs_make_call(struct in_addr *addr, struct afs_call 
*call, gfp_t gfp,
return wait_mode->wait(call);
 
 error_do_abort:
-   rxrpc_kernel_abort_call(afs_socket, rxcall, RX_USER_ABORT);
+   rxrpc_kernel_abort_call(afs_socket, rxcall, RX_USER_ABORT, -ret, "KSD");
 error_kill_call:
afs_end_call(call);
_leave(" = %d", ret);
@@ -425,12 +425,12 @@ static void afs_deliver_to_call(struct afs_call *call)
case -ENOTCONN:
abort_code = RX_CALL_DEAD;
rxrpc_kernel_abort_call(afs_socket, call->rxcall,
-   abort_code);
+   abort_code, -ret, "KNC");
goto do_abort;
case -ENOTSUPP:
abort_code = RX_INVALID_OPERATION;
rxrpc_kernel_abort_call(afs_socket, call->rxcall,
-   abort_code);
+   abort_code, -ret, "KIV");
goto do_abort;
case -ENODATA:
case -EBADMSG:
@@ -440,7 +440,7 @@ static void afs_deliver_to_call(struct afs_call *call)
if (call->state != AFS_CALL_AWAIT_REPLY)
abort_code = RXGEN_SS_UNMARSHAL;
rxrpc_kernel_abort_call(afs_socket, call->rxcall,
-   abort_code);
+   abort_code, EBADMSG, "KUM");
goto do_abort;
}
}
@@ -463,6 +463,7 @@ do_abort:
  */
 static int afs_wait_for_call_to_complete(struct afs_call *call)
 {
+   const char *abort_why;
int ret;
 
DECLARE_WAITQUEUE(myself, current);
@@ -481,9 +482,11 @@ static int afs_wait_for_call_to_complete(struct afs_call 
*call)
continue;
}
 
+   abort_why = "KWC";
ret = call->error;
if (call->state == AFS_CALL_COMPLETE)
break;
+   abort_why = "KWI";
ret = -EINTR;
if (signal_pending(current))
break;
@@ -497,7 +500,7 @@ static int afs_wait_for_call_to_complete(struct afs_call 
*call)
if (call->state < AFS_CALL_COMPLETE) {
_debug("call incomplete");
rxrpc_kernel_abort_call(afs_socket, call->rxcall,
-   RX_CALL_DEAD);
+   RX_CALL_DEAD, -ret, abort_why);
}
 
_debug("call complete");
@@ -695,7 +698,7 @@ void afs_send_empty_reply(struct afs_call *call)
case -ENOMEM:
_debug("oom");
rxrpc_kernel_abort_call(afs_socket, call->rxcall,
-   RX_USER_ABORT);
+   RX_USER_ABORT, ENOMEM, "KOO");
default:
afs_end_call(call);
_leave(" [error]");
@@ -734,7 +737,7 @@ void afs_send_simple_reply(struct afs_call *call, const 
void *buf, size_t len)
if (n == -ENOMEM) {
_debug("oom");
rxrpc_kernel_abort_call(afs_socket, call->rxcall,
-   RX_USER_ABORT);
+   RX_USER_ABORT, ENOMEM, "KOO");
}
afs_end_call(call);
_leave(" [error]");
diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h
index b4b6a3664dda..08ed8729126c 100644
--- a/include/net/af_rxrpc.h
+++ b/include/net/af_rxrpc.h
@@ -35,7 +35,8 @@ int rxrpc_kernel_send_data(struct socket *, struct rxrpc_call 
*,
   struct msghdr *, size_t);
 int rxrpc_kernel_recv_data(struct socket *, struct rxrpc_call *,
   

[PATCH net-next 0/2] rxrpc: Local abort tracepoint

2016-09-07 Thread David Howells

Here are two patches.  They need to be applied on top of the just-posted
call refcount overhaul patch:

 (1) Fix the return value of some call completion helpers.

 (2) Add a tracepoint that allows local aborts to be debugged.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160904-2

David
---
David Howells (2):
  rxrpc: Fix returns of call completion helpers
  rxrpc: Add tracepoint for working out where aborts happen


 fs/afs/rxrpc.c   |   17 ---
 include/net/af_rxrpc.h   |3 +
 include/trace/events/rxrpc.h |   29 +++
 net/rxrpc/ar-internal.h  |   27 +++
 net/rxrpc/call_event.c   |7 ++-
 net/rxrpc/call_object.c  |2 -
 net/rxrpc/conn_event.c   |6 ++
 net/rxrpc/input.c|7 ++-
 net/rxrpc/insecure.c |   19 ---
 net/rxrpc/rxkad.c|  108 +++---
 net/rxrpc/sendmsg.c  |   18 ---
 11 files changed, 140 insertions(+), 103 deletions(-)



[PATCH net-next 1/2] rxrpc: Fix returns of call completion helpers

2016-09-07 Thread David Howells
rxrpc_set_call_completion() returns bool, not int, so the ret variable
should match this.

rxrpc_call_completed() and __rxrpc_call_completed() should return the value
of rxrpc_set_call_completion().

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 3addda4bfa6b..0353399792b6 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -608,7 +608,7 @@ static inline bool rxrpc_set_call_completion(struct 
rxrpc_call *call,
 u32 abort_code,
 int error)
 {
-   int ret;
+   bool ret;
 
write_lock_bh(>state_lock);
ret = __rxrpc_set_call_completion(call, compl, abort_code, error);
@@ -619,16 +619,19 @@ static inline bool rxrpc_set_call_completion(struct 
rxrpc_call *call,
 /*
  * Record that a call successfully completed.
  */
-static inline void __rxrpc_call_completed(struct rxrpc_call *call)
+static inline bool __rxrpc_call_completed(struct rxrpc_call *call)
 {
-   __rxrpc_set_call_completion(call, RXRPC_CALL_SUCCEEDED, 0, 0);
+   return __rxrpc_set_call_completion(call, RXRPC_CALL_SUCCEEDED, 0, 0);
 }
 
-static inline void rxrpc_call_completed(struct rxrpc_call *call)
+static inline bool rxrpc_call_completed(struct rxrpc_call *call)
 {
+   bool ret;
+
write_lock_bh(>state_lock);
-   __rxrpc_call_completed(call);
+   ret = __rxrpc_call_completed(call);
write_unlock_bh(>state_lock);
+   return ret;
 }
 
 /*



[PATCH net] tcp: fastopen: avoid negative sk_forward_alloc

2016-09-07 Thread Eric Dumazet
From: Eric Dumazet 

When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
we append an skb in socket receive queue, but we forget to call
sk_forced_mem_schedule().

Effect is that the socket has a negative sk->sk_forward_alloc as long as
the message is not read by the application.

Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp
fin memory accounting")
 
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_fastopen.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 62a5751d4fe1..4e777a3243f9 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -150,6 +150,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
*skb)
tp->segs_in = 0;
tcp_segs_in(tp, skb);
__skb_pull(skb, tcp_hdrlen(skb));
+   sk_forced_mem_schedule(sk, skb->truesize);
skb_set_owner_r(skb, sk);
 
TCP_SKB_CB(skb)->seq++;




Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  wrote:
> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> >>
> >> Here we introduce a xmit more like mechanism that will queue up more
> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> >>
> >> Once RX napi budget is consumed and we exit napi RX loop, we will
> >> flush (doorbell) all XDP looped packets in case there are such.
> >
> > Why is this idea depends on XDP ?
> >
> > It looks like we could apply it to any driver having one IRQ servicing
> > one RX and one TX, without XDP being involved.
> >
> 
> Yes but it is more complicated than XDP case, where the RX ring posts
> the TX descriptors and once done
> the RX ring hits the doorbell once for all the TX descriptors it
> posted, and it is the only possible place to hit a doorbell
> for XDP TX ring.
> 
> For regular TX and RX ring sharing the same IRQ, there is no such
> simple connection between them, and hitting a doorbell
> from RX ring napi would race with xmit ndo function of the TX ring.
> 
> How do you synchronize in such case ?
> isn't the existing xmit more mechanism sufficient enough ?

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.

A simple cmpxchg could be used to synchronize the thing, if we really
cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
smaller than doorbell one)








[PATCH net-next 0/1] rxrpc: Local abort tracepoint

2016-09-07 Thread David Howells

Here's a patch that adds a tracepoint that allows local aborts to be
debugged.  This needs to be applied on top of the just-posted call refcount
overhaul patch.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160904-2

David
---
David Howells (1):
  rxrpc: Add tracepoint for working out where aborts happen


 fs/afs/rxrpc.c   |   17 ---
 include/net/af_rxrpc.h   |3 +
 include/trace/events/rxrpc.h |   29 +++
 net/rxrpc/ar-internal.h  |   27 +++
 net/rxrpc/call_event.c   |7 ++-
 net/rxrpc/call_object.c  |2 -
 net/rxrpc/conn_event.c   |6 ++
 net/rxrpc/input.c|7 ++-
 net/rxrpc/insecure.c |   19 ---
 net/rxrpc/rxkad.c|  108 +++---
 net/rxrpc/sendmsg.c  |   18 ---
 11 files changed, 140 insertions(+), 103 deletions(-)



[PATCH net-next 4/8] rxrpc: Use call->peer rather than call->conn->params.peer

2016-09-07 Thread David Howells
Use call->peer rather than call->conn->params.peer to avoid the possibility
of call->conn being NULL and, whilst we're at it, check it for NULL before we
access it.

Signed-off-by: David Howells 
---

 net/rxrpc/call_object.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 3166b5222435..060ddc32a85e 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -514,9 +514,11 @@ void rxrpc_release_call(struct rxrpc_call *call)
 */
_debug("RELEASE CALL %p (%d CONN %p)", call, call->debug_id, conn);
 
-   spin_lock(>params.peer->lock);
-   hlist_del_init(>error_link);
-   spin_unlock(>params.peer->lock);
+   if (call->peer) {
+   spin_lock(>peer->lock);
+   hlist_del_init(>error_link);
+   spin_unlock(>peer->lock);
+   }
 
write_lock_bh(>call_lock);
if (!list_empty(>accept_link)) {



[PATCH net-next 1/8] rxrpc: Whitespace cleanup

2016-09-07 Thread David Howells
Remove some whitespace.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index bb342f5fe7e4..ad702f9f8d1f 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -720,7 +720,6 @@ static inline void rxrpc_put_connection(struct 
rxrpc_connection *conn)
}
 }
 
-
 static inline bool rxrpc_queue_conn(struct rxrpc_connection *conn)
 {
if (!rxrpc_get_connection_maybe(conn))
@@ -879,7 +878,7 @@ int __init rxrpc_init_security(void);
 void rxrpc_exit_security(void);
 int rxrpc_init_client_conn_security(struct rxrpc_connection *);
 int rxrpc_init_server_conn_security(struct rxrpc_connection *);
- 
+
 /*
  * sendmsg.c
  */



  1   2   >