date:20160430




On 04/30/2016 03:01 PM, Vijay Pandurangan wrote:

On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear  wrote:


Good point, so if you had:

eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
userspace-stub <->eth1

and user-space hub enabled this elide flag, things would work, right?
Then, it seems like what we need is a way to tell the kernel
router/bridge logic to follow elide signals in packets coming from
veth. I'm not sure what the best way to do this is because I'm less
familiar with conventions in that part of the kernel, but assuming
there's a way to do this, would it be acceptable?



You cannot receive on one veth without transmitting on the other, so
I think the elide csum logic can go on the raw-socket, and apply to packets
in the transmit-from-user-space direction.  Just allowing the socket to make
the veth behave like it used to before this patch in question should be good
enough, since that worked for us for years.  So, just an option to modify
the
ip_summed for pkts sent on a socket is probably sufficient.


I don't think this is right. Consider:

- App A  sends out corrupt packets 50% of the time and discards inbound data.
- App B doesn't care about corrupt packets and is happy to receive
them and has some way of dealing with them (special case)
- App C is a regular app, say nc or something.

In your world, where A decides what happens to data it transmits,
then
A<--veth-->B and A<---wire-->B will have the same behaviour

but

A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C
will behave incorrectly if it's connected over veth but correctly if
connected with a wire. That is a bug.

Since A cannot know what the app it's talking to will desire, I argue
that both sides of a message must be opted in to this optimization.


How can you make a generic app C know how to do this?  The path could be,
for instance:

eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> 
vethD <-> appC

There are no sockets on vethB, but it does need to have special behaviour to 
elide
csums.  Even if appC is hacked to know how to twiddle some thing on it's veth 
port,
mucking with vethD will have no effect on vethB.

With regard to your example above, why would A corrupt packets?  My guess:

1)  It has bugs (so, fix the bugs, it could equally create incorrect data with 
proper checksums,
so just enabling checksumming adds no useful protection.)

2)  It means to corrupt frames.  In that case, someone must expect that C 
should receive incorrect
frames, otherwise why bother making App-A corrupt them in the first place?

3)  You are explicitly trying to test the kernel checksum logic, so you want 
the kernel to
detect the bad checksum and throw away the packet.  In this case, just 
don't set the socket
option in appA to elide checksums and the packet will be thrown away.

Any other cases you can think of?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Tom Herbert

On Sat, Apr 30, 2016 at 1:59 PM, Ben Greear  wrote:
>
> On 04/30/2016 12:54 PM, Tom Herbert wrote:
>>
>> We've put considerable effort into cleaning up the checksum interface
>> to make it as unambiguous as possible, please be very careful to
>> follow it. Broken checksum processing is really hard to detect and
>> debug.
>>
>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
>> (indicated by csum_level) have been verified to be correct in a
>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
>> checksum it would refer to has not been verified and is incorrect this
>> is a major bug.
>
>
> Suppose I know that the packet received on a packet-socket has
> already been verified by a NIC that supports hardware checksumming.
>
> Then, I want to transmit it on a veth interface using a second
> packet socket.  I do not want veth to recalculate the checksum on
> transmit, nor to validate it on the peer veth on receive, because I do
> not want to waste the CPU cycles.  I am assuming that my app is not
> accidentally corrupting frames, so the checksum can never be bad.
>
> How should the checksumming be configured for the packets going into
> the packet-socket from user-space?
>
Checksum unnecessary conversion to checksum complete should be done
and then the computed value can be sent to user space. If you want to
take advantage of the value on transmit then we would to change the
interface. But I'm not sure why recalculating the the checksum in the
host is even a problem. With all the advances in checksum offload,
LCO, RCO it seems like that shouldn't be a common case anyway.

> Also, I might want to send raw frames that do have
> broken checksums (lets assume a real NIC, not veth), and I want them
> to hit the wire with those bad checksums.
>
> How do I configure the checksumming in this case?

Just send raw packets with bad checksums (no checksum offload).

Tom

>
>
> Thanks,
> Ben
>
>
>
>>
>> Tom
>>
>> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear 
>> wrote:
>>>
>>>
>>>
>>> On 04/30/2016 11:33 AM, Ben Hutchings wrote:


 On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
>
>
> Hello,
>>>
>>>
>>>
>>
>>
>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>
>
> Actually, no, this is not really a regression.


 [...]

 It really is.  Even though the old behaviour was a bug (raw packets
 should not be changed), if there are real applications that depend on
 that then we have to keep those applications working somehow.
>>>
>>>
>>>
>>> To be honest, I fail to see why the old behaviour is a bug when sending
>>> raw packets from user-space.  If raw packets should not be changed, then
>>> we need some way to specify what the checksum setting is to begin with,
>>> otherwise, user-space has not enough control.
>>>
>>> A socket option for new programs, and sysctl configurable defaults for
>>> raw
>>> sockets
>>> for old binary programs would be sufficient I think.
>>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>> Ben Greear 
>>> Candela Technologies Inc  http://www.candelatech.com
>>
>>
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear  wrote:
>>
>> Good point, so if you had:
>>
>> eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
>> userspace-stub <->eth1
>>
>> and user-space hub enabled this elide flag, things would work, right?
>> Then, it seems like what we need is a way to tell the kernel
>> router/bridge logic to follow elide signals in packets coming from
>> veth. I'm not sure what the best way to do this is because I'm less
>> familiar with conventions in that part of the kernel, but assuming
>> there's a way to do this, would it be acceptable?
>
>
> You cannot receive on one veth without transmitting on the other, so
> I think the elide csum logic can go on the raw-socket, and apply to packets
> in the transmit-from-user-space direction.  Just allowing the socket to make
> the veth behave like it used to before this patch in question should be good
> enough, since that worked for us for years.  So, just an option to modify
> the
> ip_summed for pkts sent on a socket is probably sufficient.

I don't think this is right. Consider:

- App A  sends out corrupt packets 50% of the time and discards inbound data.
- App B doesn't care about corrupt packets and is happy to receive
them and has some way of dealing with them (special case)
- App C is a regular app, say nc or something.

In your world, where A decides what happens to data it transmits,
then
A<--veth-->B and A<---wire-->B will have the same behaviour

but

A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C
will behave incorrectly if it's connected over veth but correctly if
connected with a wire. That is a bug.

Since A cannot know what the app it's talking to will desire, I argue
that both sides of a message must be opted in to this optimization.






>
>>> There may be no sockets on the vethB port.  And reader/writer is not
>>> a good way to look at it since I am implementing a bi-directional bridge
>>> in
>>> user-space and each packet-socket is for both rx and tx.
>>
>>
>> Sure, but we could model a bidrectional connection as two
>> unidirectional sockets for our discussions here, right?
>
>
> Best not to I think, you want to make sure that one socket can
> correctly handle tx and rx.  As long as that works, then using
> uni-directional sockets should work too.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.




On 04/30/2016 02:36 PM, Vijay Pandurangan wrote:

On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear  wrote:



On 04/30/2016 02:13 PM, Vijay Pandurangan wrote:


On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear 
wrote:




On 04/30/2016 12:54 PM, Tom Herbert wrote:



We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.




Suppose I know that the packet received on a packet-socket has
already been verified by a NIC that supports hardware checksumming.

Then, I want to transmit it on a veth interface using a second
packet socket.  I do not want veth to recalculate the checksum on
transmit, nor to validate it on the peer veth on receive, because I do
not want to waste the CPU cycles.  I am assuming that my app is not
accidentally corrupting frames, so the checksum can never be bad.

How should the checksumming be configured for the packets going into
the packet-socket from user-space?




It seems like that only the receiver should decide whether or not to
checksum packets on the veth, not the sender.

How about:

We could add a receiving socket option for "don't checksum packets
received from a veth when the other side has marked them as
elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
socket option for "mark all data sent via this socket to a veth as
elide-checksum-suggested".

So the process would be:

Writer:
1. open read socket
2. open write socket, with option elide-checksum-for-veth-suggested
3. write data

Reader:
1. open read socket with "follow-elide-checksum-suggestions-on-veth"
2. read data

The kernel / module would then need to persist the flag on all packets
that traverse a veth, and drop these data when they leave the veth
module.



I'm not sure this works completely.  In my app, the packet flow might be:

eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB
<-> [kernel router/bridge logic ...] <-> eth1


Good point, so if you had:

eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
userspace-stub <->eth1

and user-space hub enabled this elide flag, things would work, right?
Then, it seems like what we need is a way to tell the kernel
router/bridge logic to follow elide signals in packets coming from
veth. I'm not sure what the best way to do this is because I'm less
familiar with conventions in that part of the kernel, but assuming
there's a way to do this, would it be acceptable?


You cannot receive on one veth without transmitting on the other, so
I think the elide csum logic can go on the raw-socket, and apply to packets
in the transmit-from-user-space direction.  Just allowing the socket to make
the veth behave like it used to before this patch in question should be good
enough, since that worked for us for years.  So, just an option to modify the
ip_summed for pkts sent on a socket is probably sufficient.


There may be no sockets on the vethB port.  And reader/writer is not
a good way to look at it since I am implementing a bi-directional bridge in
user-space and each packet-socket is for both rx and tx.


Sure, but we could model a bidrectional connection as two
unidirectional sockets for our discussions here, right?


Best not to I think, you want to make sure that one socket can
correctly handle tx and rx.  As long as that works, then using
uni-directional sockets should work too.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: New DSA design for cross-chip operations

2016-04-30 Thread Andrew Lunn

On Fri, Apr 29, 2016 at 05:33:44PM -0400, Vivien Didelot wrote:
> Hi,
> 
> Here's a proposal for a new design of DSA. It is meant to discuss the
> actual state and future implementation of the D (for distributed) in
> DSA, which consists of supporting several interconnected switch chips,
> exposed to the user as one big switch unit.
> 
> There's still a lot of work to finish in DSA before running into
> cross-chip operations, but it'll help to start thinking about it now.

Hi Vivien

One thing we should keep in mind. Very few switch chips actually
support D in DSA. In fact, at the moment, i don't know of any other
than Marvell. The Broadcom SF2 does not. The Microchip LAN935X
don't. The Micrel KS8995 does not. Do any of the switches supported by
OpenWRT via swconfig?

We should try to limit adding complexity to the DSA core and the API
to the drivers which the majority of drivers will never need. I would
make use of the layering we already have. For the Marvell devices, add
a cross chip layer between the Marvell drivers and the DSA core. And
even then, we might want to keep this layer optional for Marvell
devices. There are only a few niche boards with multiple switch
chips. There are many more WiFi access points and cable modems, etc,
with just a single switch.

  Andrew

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear  wrote:
>
>
> On 04/30/2016 02:13 PM, Vijay Pandurangan wrote:
>>
>> On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear 
>> wrote:
>>>
>>>
>>>
>>> On 04/30/2016 12:54 PM, Tom Herbert wrote:


 We've put considerable effort into cleaning up the checksum interface
 to make it as unambiguous as possible, please be very careful to
 follow it. Broken checksum processing is really hard to detect and
 debug.

 CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
 (indicated by csum_level) have been verified to be correct in a
 packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
 never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
 checksum it would refer to has not been verified and is incorrect this
 is a major bug.
>>>
>>>
>>>
>>> Suppose I know that the packet received on a packet-socket has
>>> already been verified by a NIC that supports hardware checksumming.
>>>
>>> Then, I want to transmit it on a veth interface using a second
>>> packet socket.  I do not want veth to recalculate the checksum on
>>> transmit, nor to validate it on the peer veth on receive, because I do
>>> not want to waste the CPU cycles.  I am assuming that my app is not
>>> accidentally corrupting frames, so the checksum can never be bad.
>>>
>>> How should the checksumming be configured for the packets going into
>>> the packet-socket from user-space?
>>
>>
>>
>> It seems like that only the receiver should decide whether or not to
>> checksum packets on the veth, not the sender.
>>
>> How about:
>>
>> We could add a receiving socket option for "don't checksum packets
>> received from a veth when the other side has marked them as
>> elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
>> socket option for "mark all data sent via this socket to a veth as
>> elide-checksum-suggested".
>>
>> So the process would be:
>>
>> Writer:
>> 1. open read socket
>> 2. open write socket, with option elide-checksum-for-veth-suggested
>> 3. write data
>>
>> Reader:
>> 1. open read socket with "follow-elide-checksum-suggestions-on-veth"
>> 2. read data
>>
>> The kernel / module would then need to persist the flag on all packets
>> that traverse a veth, and drop these data when they leave the veth
>> module.
>
>
> I'm not sure this works completely.  In my app, the packet flow might be:
>
> eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB
> <-> [kernel router/bridge logic ...] <-> eth1

Good point, so if you had:

eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
userspace-stub <->eth1

and user-space hub enabled this elide flag, things would work, right?
Then, it seems like what we need is a way to tell the kernel
router/bridge logic to follow elide signals in packets coming from
veth. I'm not sure what the best way to do this is because I'm less
familiar with conventions in that part of the kernel, but assuming
there's a way to do this, would it be acceptable?



>
> There may be no sockets on the vethB port.  And reader/writer is not
> a good way to look at it since I am implementing a bi-directional bridge in
> user-space and each packet-socket is for both rx and tx.

Sure, but we could model a bidrectional connection as two
unidirectional sockets for our discussions here, right?



>
>>> Also, I might want to send raw frames that do have
>>> broken checksums (lets assume a real NIC, not veth), and I want them
>>> to hit the wire with those bad checksums.
>>>
>>>
>>> How do I configure the checksumming in this case?
>>
>>
>>
>> Correct me if I'm wrong but I think this is already possible now. You
>> can have packets with incorrect checksum hitting the wire as is. What
>> you cannot do is instruct the receiving end to ignore the checksum
>> from the sending end when using a physical device (and something I
>> think we should mimic on the sending device).
>
>
> Yes, it does work currently (or, last I checked)...I just want to make sure
> it keeps working.

Yeah, good point. It would be great if we could write a test, or at
the very least, a list of invariants about what kinds of things should
work somewhere.

>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.




On 04/30/2016 02:13 PM, Vijay Pandurangan wrote:

On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear  wrote:



On 04/30/2016 12:54 PM, Tom Herbert wrote:


We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.



Suppose I know that the packet received on a packet-socket has
already been verified by a NIC that supports hardware checksumming.

Then, I want to transmit it on a veth interface using a second
packet socket.  I do not want veth to recalculate the checksum on
transmit, nor to validate it on the peer veth on receive, because I do
not want to waste the CPU cycles.  I am assuming that my app is not
accidentally corrupting frames, so the checksum can never be bad.

How should the checksumming be configured for the packets going into
the packet-socket from user-space?



It seems like that only the receiver should decide whether or not to
checksum packets on the veth, not the sender.

How about:

We could add a receiving socket option for "don't checksum packets
received from a veth when the other side has marked them as
elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
socket option for "mark all data sent via this socket to a veth as
elide-checksum-suggested".

So the process would be:

Writer:
1. open read socket
2. open write socket, with option elide-checksum-for-veth-suggested
3. write data

Reader:
1. open read socket with "follow-elide-checksum-suggestions-on-veth"
2. read data

The kernel / module would then need to persist the flag on all packets
that traverse a veth, and drop these data when they leave the veth
module.


I'm not sure this works completely.  In my app, the packet flow might be:

eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> 
[kernel router/bridge logic ...] <-> eth1

There may be no sockets on the vethB port.  And reader/writer is not
a good way to look at it since I am implementing a bi-directional bridge in
user-space and each packet-socket is for both rx and tx.


Also, I might want to send raw frames that do have
broken checksums (lets assume a real NIC, not veth), and I want them
to hit the wire with those bad checksums.


How do I configure the checksumming in this case?



Correct me if I'm wrong but I think this is already possible now. You
can have packets with incorrect checksum hitting the wire as is. What
you cannot do is instruct the receiving end to ignore the checksum
from the sending end when using a physical device (and something I
think we should mimic on the sending device).


Yes, it does work currently (or, last I checked)...I just want to make sure it 
keeps working.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear  wrote:
>
>
> On 04/30/2016 12:54 PM, Tom Herbert wrote:
>>
>> We've put considerable effort into cleaning up the checksum interface
>> to make it as unambiguous as possible, please be very careful to
>> follow it. Broken checksum processing is really hard to detect and
>> debug.
>>
>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
>> (indicated by csum_level) have been verified to be correct in a
>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
>> checksum it would refer to has not been verified and is incorrect this
>> is a major bug.
>
>
> Suppose I know that the packet received on a packet-socket has
> already been verified by a NIC that supports hardware checksumming.
>
> Then, I want to transmit it on a veth interface using a second
> packet socket.  I do not want veth to recalculate the checksum on
> transmit, nor to validate it on the peer veth on receive, because I do
> not want to waste the CPU cycles.  I am assuming that my app is not
> accidentally corrupting frames, so the checksum can never be bad.
>
> How should the checksumming be configured for the packets going into
> the packet-socket from user-space?


It seems like that only the receiver should decide whether or not to
checksum packets on the veth, not the sender.

How about:

We could add a receiving socket option for "don't checksum packets
received from a veth when the other side has marked them as
elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
socket option for "mark all data sent via this socket to a veth as
elide-checksum-suggested".

So the process would be:

Writer:
1. open read socket
2. open write socket, with option elide-checksum-for-veth-suggested
3. write data

Reader:
1. open read socket with "follow-elide-checksum-suggestions-on-veth"
2. read data

The kernel / module would then need to persist the flag on all packets
that traverse a veth, and drop these data when they leave the veth
module.


>
>
> Also, I might want to send raw frames that do have
> broken checksums (lets assume a real NIC, not veth), and I want them
> to hit the wire with those bad checksums.
>
>
> How do I configure the checksumming in this case?


Correct me if I'm wrong but I think this is already possible now. You
can have packets with incorrect checksum hitting the wire as is. What
you cannot do is instruct the receiving end to ignore the checksum
from the sending end when using a physical device (and something I
think we should mimic on the sending device).


>
>
>
> Thanks,
> Ben
>
>
>
>>
>> Tom
>>
>> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear  wrote:
>>>
>>>
>>>
>>> On 04/30/2016 11:33 AM, Ben Hutchings wrote:


 On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
>
>
> Hello,
>>>
>>>
>>>
>>
>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>
>
> Actually, no, this is not really a regression.


 [...]

 It really is.  Even though the old behaviour was a bug (raw packets
 should not be changed), if there are real applications that depend on
 that then we have to keep those applications working somehow.
>>>
>>>
>>>
>>> To be honest, I fail to see why the old behaviour is a bug when sending
>>> raw packets from user-space.  If raw packets should not be changed, then
>>> we need some way to specify what the checksum setting is to begin with,
>>> otherwise, user-space has not enough control.
>>>
>>> A socket option for new programs, and sysctl configurable defaults for raw
>>> sockets
>>> for old binary programs would be sufficient I think.
>>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>> Ben Greear 
>>> Candela Technologies Inc  http://www.candelatech.com
>>
>>
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

On 04/30/2016 12:54 PM, Tom Herbert wrote:

We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.

Suppose I know that the packet received on a packet-socket has
already been verified by a NIC that supports hardware checksumming.

Then, I want to transmit it on a veth interface using a second
packet socket. I do not want veth to recalculate the checksum on
transmit, nor to validate it on the peer veth on receive, because I do
not want to waste the CPU cycles. I am assuming that my app is not
accidentally corrupting frames, so the checksum can never be bad.

How should the checksumming be configured for the packets going into
the packet-socket from user-space?

Also, I might want to send raw frames that do have
broken checksums (lets assume a real NIC, not veth), and I want them
to hit the wire with those bad checksums.

How do I configure the checksumming in this case?

Thanks,
Ben

Tom

On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear wrote:

On 04/30/2016 11:33 AM, Ben Hutchings wrote:

On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:

Hello,

http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a

Actually, no, this is not really a regression.

[...]

It really is. Even though the old behaviour was a bug (raw packets
should not be changed), if there are real applications that depend on
that then we have to keep those applications working somehow.

To be honest, I fail to see why the old behaviour is a bug when sending
raw packets from user-space. If raw packets should not be changed, then
we need some way to specify what the checksum setting is to begin with,
otherwise, user-space has not enough control.

A socket option for new programs, and sysctl configurable defaults for raw
sockets
for old binary programs would be sufficient I think.

Thanks,
Ben

--
Ben Greear
Candela Technologies Inc http://www.candelatech.com

[PATCH] pxa168_eth: fix mdiobus_scan() error check

2016-04-30 Thread Sergei Shtylyov

Since mdiobus_scan() returns either an error code or NULL on error, the
driver should check  for both,  not only for NULL, otherwise a crash is
imminent...

Reported-by: Arnd Bergmann 
Signed-off-by: Sergei Shtylyov 

---
The patch is against DaveM's 'net.git' repo.

 drivers/net/ethernet/marvell/pxa168_eth.c |2 ++
 1 file changed, 2 insertions(+)

Index: net/drivers/net/ethernet/marvell/pxa168_eth.c
===
--- net.orig/drivers/net/ethernet/marvell/pxa168_eth.c
+++ net/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -979,6 +979,8 @@ static int pxa168_init_phy(struct net_de
return 0;
 
pep->phy = mdiobus_scan(pep->smi_bus, pep->phy_addr);
+   if (IS_ERR(pep->phy))
+   return PTR_ERR(pep->phy);
if (!pep->phy)
return -ENODEV;

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

[oops – resending this because I was using gmail in HTML mode before
by accident]

There was a discussion on a separate thread about this. I agree with
Sabrina fully. I believe veth should provide an abstraction layer that
correctly emulates a physical network in all ways.

Consider an environment where we have multiple physical computers.
Each computer runs one or more containers, each of which has a
publicly routable ip address.  When adding a new app to the cluster, a
scheduler might decide to run this container on any physical machine
of its choice, assuming that apps have a way of routing traffic to
their backends (we did something similar Google >10 years ago). This
is something we might imagine happening with docker and ipv6 for
instance.

If you have an app, A, which sends raw ethernet frames (the simplest
case I could imagine) with TCP data that has invalid checksums to app
B, which is receiving it, the behaviour of the system _will be
different_ depending upon whether app B is scheduled to run on the
same machine as app A or not. This seems like a clear bug and a broken
abstraction (especially as the default case), and something we should
endeavour to avoid.

I do see Ben's point about enabling sw checksum verification as
potentially incurring a huge performance penalty (I haven't had a
chance to measure it) that is completely wasteful in the vast majority
of cases.

Unfortunately I just don't see how we can solve this problem in a way
that preserves a correct abstraction layer while also avoiding excess
work. I guess the first piece of data that would be helpful is to
determine just how big of a performance penalty this is. If it's
small, then maybe it doesn't matter.

On Thu, Apr 28, 2016 at 6:29 AM, Sabrina Dubroca  wrote:
> Hello,
>
> 2016-04-27, 17:14:44 -0700, Ben Greear wrote:
>> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote:
>> > Hi Ben,
>> >
>> > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote:
>> > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote:
>> > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote:
>> > > > >
>> > > > > 3.2.80-rc1 review patch.  If anyone has any objections, please let 
>> > > > > me know.
>> > > > I would be careful about this.  It causes regressions when sending
>> > > > PACKET_SOCKET buffers from user-space to veth devices.
>> > > >
>> > > > There was a proposed upstream fix for the regression, but it has not 
>> > > > gone
>> > > > into the tree as far as I know.
>> > > >
>> > > > http://www.spinics.net/lists/netdev/msg370436.html
>> > > [...]
>> > >
>> > > OK, I'll drop this for now.
>> >
>> > The fall out from not having this patch is in my opinion a bigger
>> > fallout than not having this patch. This patch fixes silent data
>> > corruption vs. the problem Ben Greear is talking about, which might not
>> > be that a common usage.
>> >
>> > What do others think?
>> >
>> > Bye,
>> > Hannes
>> >
>>
>> This patch from Cong Wang seems to fix the regression for me, I think it 
>> should be added and
>> tested in the main tree, and then apply them to stable as a pair.
>>
>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>
> Actually, no, this is not really a regression.
>
> If you capture packets on a device with checksum offloading enabled,
> the TCP/UDP checksum isn't filled.  veth also behaves that way.  What
> the "veth: don't modify ip_summed" patch does is enable proper
> checksum validation on veth.  This really was a bug in veth.
>
> Cong's patch would also break cases where we choose to inject packets
> with invalid checksums, and they would now be accepted as correct.
>
> Your use case is invalid, it just happened to work because of a
> bug.  If you want the stack to fill checksums so that you want capture
> and reinject packets, you have to disable checksum offloading (or
> compute the checksum yourself in userspace).
>
> Thanks.
>
> --
> Sabrina

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Tom Herbert

We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.

Tom

On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear  wrote:
>
>
> On 04/30/2016 11:33 AM, Ben Hutchings wrote:
>>
>> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
>>>
>>> Hello,
>
>

 http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>>>
>>> Actually, no, this is not really a regression.
>>
>> [...]
>>
>> It really is.  Even though the old behaviour was a bug (raw packets
>> should not be changed), if there are real applications that depend on
>> that then we have to keep those applications working somehow.
>
>
> To be honest, I fail to see why the old behaviour is a bug when sending
> raw packets from user-space.  If raw packets should not be changed, then
> we need some way to specify what the checksum setting is to begin with,
> otherwise, user-space has not enough control.
>
> A socket option for new programs, and sysctl configurable defaults for raw
> sockets
> for old binary programs would be sufficient I think.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: __napi_alloc_skb failures locking up the box

2016-04-30 Thread Eric Dumazet

On Sat, 2016-04-30 at 22:24 +0300, Aaro Koskinen wrote:
> Hi,
> 
> I have old NAS box (Thecus N2100) with 512 MB RAM, where rsync from NFS ->
> disk reliably results in temporary out-of-memory conditions.
> 
> When this happens the dmesg gets flooded with below logs. If the serial
> console logging is enabled, this will lock up the box completely and
> the backup is not making any progress.
> 
> Shouldn't these allocation failures be ratelimited somehow (or even made
> silent)? It doesn't sound right if I can lock up the system simply by
> copying files...

Agreed.

All napi_alloc_skb() callers handle failure just fine.

If they did not, a NULL deref would produce a proper stack dump.

When memory gets this tight, other traces will be dumped anyway.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 15d0df943466..0652709fe81a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2423,7 +2423,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
 static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 unsigned int length)
 {
-   return __napi_alloc_skb(napi, length, GFP_ATOMIC);
+   return __napi_alloc_skb(napi, length, GFP_ATOMIC | __GFP_NOWARN);
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.




On 04/30/2016 11:33 AM, Ben Hutchings wrote:

On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:

Hello,



http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a

Actually, no, this is not really a regression.

[...]

It really is.  Even though the old behaviour was a bug (raw packets
should not be changed), if there are real applications that depend on
that then we have to keep those applications working somehow.


To be honest, I fail to see why the old behaviour is a bug when sending
raw packets from user-space.  If raw packets should not be changed, then
we need some way to specify what the checksum setting is to begin with,
otherwise, user-space has not enough control.

A socket option for new programs, and sysctl configurable defaults for raw 
sockets
for old binary programs would be sufficient I think.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

__napi_alloc_skb failures locking up the box

2016-04-30 Thread Aaro Koskinen

Hi,

I have old NAS box (Thecus N2100) with 512 MB RAM, where rsync from NFS ->
disk reliably results in temporary out-of-memory conditions.

When this happens the dmesg gets flooded with below logs. If the serial
console logging is enabled, this will lock up the box completely and
the backup is not making any progress.

Shouldn't these allocation failures be ratelimited somehow (or even made
silent)? It doesn't sound right if I can lock up the system simply by
copying files...

...

[ 1706.105842] kworker/0:1H: page allocation failure: order:0, 
mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK)
[ 1706.105904] CPU: 0 PID: 519 Comm: kworker/0:1H Not tainted 
4.6.0-rc5-iop32x-los_a50bb #1
[ 1706.105917] Hardware name: Thecus N2100
[ 1706.105973] Workqueue: rpciod rpc_async_schedule
[ 1706.105993] Backtrace: 
[ 1706.106037] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[ 1706.106068]  r7: r6:0060 r5: r4:
[ 1706.106142] [] (show_stack) from [] 
(dump_stack+0x20/0x28)
[ 1706.106185] [] (dump_stack) from [] 
(warn_alloc_failed+0xf0/0x134)
[ 1706.106216] [] (warn_alloc_failed) from [] 
(__alloc_pages_nodemask+0x284/0x8fc)
[ 1706.106233]  r3:02200020 r2:
[ 1706.106269]  r5: r4:c0524c40
[ 1706.106336] [] (__alloc_pages_nodemask) from [] 
(new_slab+0x3c4/0x430)
[ 1706.106352]  r10: r9:c0321fdc r8: r7:df401d00 r6:0015 
r5:0220
[ 1706.106422]  r4:df401d00
[ 1706.106460] [] (new_slab) from [] 
(___slab_alloc.constprop.8+0x238/0x298)
[ 1706.106477]  r10: r9:c0321fdc r8:02080020 r7:df401d00 r6:dfbef060 
r5:
[ 1706.106547]  r4:
[ 1706.106581] [] (___slab_alloc.constprop.8) from [] 
(kmem_cache_alloc+0xbc/0xf8)
[ 1706.106597]  r10:00167e93 r9:340285ee r8:e11e1920 r7:6013 r6:02080020 
r5:df401d00
[ 1706.106668]  r4:
[ 1706.106707] [] (kmem_cache_alloc) from [] 
(__build_skb+0x2c/0x98)
[ 1706.106723]  r7:cc592240 r6:06e0 r5:de3523f0 r4:06e0
[ 1706.106784] [] (__build_skb) from [] 
(__napi_alloc_skb+0xb0/0xfc)
[ 1706.106801]  r9:340285ee r8:e11e1920 r7:cc592240 r6:c0520ab8 r5:de3523f0 
r4:06e0
[ 1706.106903] [] (__napi_alloc_skb) from [] 
(rtl8169_poll+0x3a0/0x588)
[ 1706.106920]  r7:de364000 r6:c000fd08 r5:05ea r4:de3523f0
[ 1706.106986] [] (rtl8169_poll) from [] 
(net_rx_action+0x1cc/0x2ec)
[ 1706.107002]  r10:00022548 r9:df471ba8 r8:c05257e0 r7:012c r6:0040 
r5:0001
[ 1706.107073]  r4:de3523f0
[ 1706.107131] [] (net_rx_action) from [] 
(__do_softirq+0xf4/0x254)
[ 1706.107147]  r10:0101 r9:c052618c r8:4001 r7:c0526188 r6:df47 
r5:0003
[ 1706.107218]  r4:
[ 1706.107255] [] (__do_softirq) from [] 
(do_softirq.part.2+0x34/0x40)
[ 1706.107271]  r10:0001 r9:c0525a26 r8:c0c0 r7: r6:deae8a80 
r5:
[ 1706.107341]  r4:2013
[ 1706.107380] [] (do_softirq.part.2) from [] 
(__local_bh_enable_ip+0xac/0xcc)
[ 1706.107396]  r5: r4:0200
[ 1706.107459] [] (__local_bh_enable_ip) from [] 
(release_sock+0x12c/0x158)
[ 1706.107477]  r5: r4:
[ 1706.107538] [] (release_sock) from [] 
(tcp_sendmsg+0xf8/0xa90)
[ 1706.107557]  r10:4040 r9:deae8a80 r8:df471d74 r7:88cd7146 r6:059c 
r5:de8ccb00
[ 1706.107629]  r4:007c r3:0001
[ 1706.107678] [] (tcp_sendmsg) from [] 
(inet_sendmsg+0x3c/0x74)
[ 1706.107695]  r10:df471e2c r9: r8:df7eb604 r7: r6: 
r5:df02c780
[ 1706.107766]  r4:deae8a80
[ 1706.107800] [] (inet_sendmsg) from [] 
(sock_sendmsg+0x1c/0x30)
[ 1706.107816]  r4:df471d74
[ 1706.107849] [] (sock_sendmsg) from [] 
(kernel_sendmsg+0x38/0x40)
[ 1706.107873] [] (kernel_sendmsg) from [] 
(xs_send_kvec+0x94/0x9c)
[ 1706.107891]  r5: r4:df02c780
[ 1706.107934] [] (xs_send_kvec) from [] 
(xs_sendpages+0x6c/0x244)
[ 1706.107950]  r9: r8:df02c780 r7:007c r6:df7eb604 r5:0001 
r4:
[ 1706.108026] [] (xs_sendpages) from [] 
(xs_tcp_send_request+0x80/0x134)
[ 1706.108043]  r10: r9: r8:de8e8000 r7:de0d7258 r6:df7eb604 
r5:0001
[ 1706.108114]  r4:df7eb600
[ 1706.108172] [] (xs_tcp_send_request) from [] 
(xprt_transmit+0x58/0x214)
[ 1706.108192]  r10:de92cc60 r9: r8:d3697fdd r7:df7eb674 r6:de0d7258 
r5:df7eb600
[ 1706.108265]  r4:de8e8000
[ 1706.108303] [] (xprt_transmit) from [] 
(call_transmit+0x18c/0x230)
[ 1706.108321]  r7:df7eb600 r6:0001 r5:df7eb600 r4:de0d7258
[ 1706.108388] [] (call_transmit) from [] 
(__rpc_execute+0x54/0x2c4)
[ 1706.108404]  r8:c0508940 r7: r6:c03b1f08 r5:0001 r4:de0d7258
[ 1706.108475] [] (__rpc_execute) from [] 
(rpc_async_schedule+0x14/0x18)
[ 1706.108493]  r7: r6:dfbf0800 r5:de92cc60 r4:de0d727c
[ 1706.108564] [] (rpc_async_schedule) from [] 
(process_one_work+0x130/0x3ec)
[ 1706.108587] [] (process_one_work) from [] 
(worker_thread+0x50/0x5ac)
[ 1706.108604]  r10:de92cc60 r9:c0508940 r8:0008 r7:c050b220 r6:c0508954 
r5:de92cc78
[ 1706.108675]  r4:c0508940
[ 1706.108718] [] (worker_thread) from

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Ben Hutchings

On Thu, 2016-04-28 at 06:45 -0700, Ben Greear wrote:
> 
> On 04/28/2016 03:29 AM, Sabrina Dubroca wrote:
[...]
> > Your use case is invalid, it just happened to work because of a
> > bug.  If you want the stack to fill checksums so that you want capture
> > and reinject packets, you have to disable checksum offloading (or
> > compute the checksum yourself in userspace).
> Disabling checksum offloading or computing in user-space (and then
> recomputing in veth to verify the checksum?) is a huge performance loss.
> 
> Maybe we could add a socket option to enable Cong's patch on a per-socket
> basis?  That way my use-case can still work and you can have this new
> behaviour by default?

It does sound like a useful option to have.  If there are other
applications that depend on veth's checksum-fixing behaviour and are
being distributed in binary form, then a per-device option might be
necessary so users can keep those applications working before they're
updated.

Ben.

-- 
Ben Hutchings
Tomorrow will be cancelled due to lack of interest.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Ben Hutchings

On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
> Hello,
> 
> 2016-04-27, 17:14:44 -0700, Ben Greear wrote:
> > 
> > On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote:
> > > 
> > > Hi Ben,
> > > 
> > > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote:
> > > > 
> > > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote:
> > > > > 
> > > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote:
> > > > > > 
> > > > > > 
> > > > > > 3.2.80-rc1 review patch.  If anyone has any objections, please let 
> > > > > > me know.
> > > > > I would be careful about this.  It causes regressions when sending
> > > > > PACKET_SOCKET buffers from user-space to veth devices.
> > > > > 
> > > > > There was a proposed upstream fix for the regression, but it has not 
> > > > > gone
> > > > > into the tree as far as I know.
> > > > > 
> > > > > http://www.spinics.net/lists/netdev/msg370436.html
> > > > [...]
> > > > 
> > > > OK, I'll drop this for now.
> > > The fall out from not having this patch is in my opinion a bigger
> > > fallout than not having this patch. This patch fixes silent data
> > > corruption vs. the problem Ben Greear is talking about, which might not
> > > be that a common usage.
> > > 
> > > What do others think?
> > > 
> > > Bye,
> > > Hannes
> > > 
> > This patch from Cong Wang seems to fix the regression for me, I think it 
> > should be added and
> > tested in the main tree, and then apply them to stable as a pair.
> > 
> > http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
> Actually, no, this is not really a regression.
[...]

It really is.  Even though the old behaviour was a bug (raw packets
should not be changed), if there are real applications that depend on
that then we have to keep those applications working somehow.

Ben.

-- 
Ben Hutchings
Tomorrow will be cancelled due to lack of interest.

signature.asc
Description: This is a digitally signed message part

[PATCH net-next] net: relax expensive skb_unclone() in iptunnel_handle_offloads()

2016-04-30 Thread Eric Dumazet

From: Eric Dumazet 

Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet 
---
 include/linux/skbuff.h|   10 ++
 net/ipv4/ip_tunnel_core.c |2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 
c84a5a1078c5528bf6fc84573f63f3c6f470ce8f..c413c588a24f854be9e4df78d8a6872b6b1ff9f3
 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1325,6 +1325,16 @@ static inline int skb_header_cloned(const struct sk_buff 
*skb)
return dataref != 1;
 }
 
+static inline int skb_header_unclone(struct sk_buff *skb, gfp_t pri)
+{
+   might_sleep_if(gfpflags_allow_blocking(pri));
+
+   if (skb_header_cloned(skb))
+   return pskb_expand_head(skb, 0, 0, pri);
+
+   return 0;
+}
+
 /**
  * skb_header_release - release reference to header
  * @skb: buffer to operate on
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 
786fa7ca28e0db00865cdf9fab472836866b7d6d..9118b0e640ba3a96850cf28aa8caab4e615254d9
 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -157,7 +157,7 @@ int iptunnel_handle_offloads(struct sk_buff *skb,
}
 
if (skb_is_gso(skb)) {
-   err = skb_unclone(skb, GFP_ATOMIC);
+   err = skb_header_unclone(skb, GFP_ATOMIC);
if (unlikely(err))
return err;
skb_shinfo(skb)->gso_type |= gso_type_mask;

[PATCH 1/4] batman-adv: fix DAT candidate selection (must use vid)

Now that DAT is VLAN aware, it must use the VID when
computing the DHT address of the candidate nodes where
an entry is going to be stored/retrieved.

Fixes: be1db4f6615b ("batman-adv: make the Distributed ARP Table vlan aware")
Signed-off-by: Antonio Quartulli 
[s...@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
---
 net/batman-adv/distributed-arp-table.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/batman-adv/distributed-arp-table.c 
b/net/batman-adv/distributed-arp-table.c
index e96d7c745b4a..3e6b2624f980 100644
--- a/net/batman-adv/distributed-arp-table.c
+++ b/net/batman-adv/distributed-arp-table.c
@@ -568,6 +568,7 @@ static void batadv_choose_next_candidate(struct batadv_priv 
*bat_priv,
  * be sent to
  * @bat_priv: the bat priv with all the soft interface information
  * @ip_dst: ipv4 to look up in the DHT
+ * @vid: VLAN identifier
  *
  * An originator O is selected if and only if its DHT_ID value is one of three
  * closest values (from the LEFT, with wrap around if needed) then the hash
@@ -576,7 +577,8 @@ static void batadv_choose_next_candidate(struct batadv_priv 
*bat_priv,
  * Return: the candidate array of size BATADV_DAT_CANDIDATE_NUM.
  */
 static struct batadv_dat_candidate *
-batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst)
+batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst,
+unsigned short vid)
 {
int select;
batadv_dat_addr_t last_max = BATADV_DAT_ADDR_MAX, ip_key;
@@ -592,7 +594,7 @@ batadv_dat_select_candidates(struct batadv_priv *bat_priv, 
__be32 ip_dst)
return NULL;
 
dat.ip = ip_dst;
-   dat.vid = 0;
+   dat.vid = vid;
ip_key = (batadv_dat_addr_t)batadv_hash_dat(,
BATADV_DAT_ADDR_MAX);
 
@@ -612,6 +614,7 @@ batadv_dat_select_candidates(struct batadv_priv *bat_priv, 
__be32 ip_dst)
  * @bat_priv: the bat priv with all the soft interface information
  * @skb: payload to send
  * @ip: the DHT key
+ * @vid: VLAN identifier
  * @packet_subtype: unicast4addr packet subtype to use
  *
  * This function copies the skb with pskb_copy() and is sent as unicast packet
@@ -622,7 +625,7 @@ batadv_dat_select_candidates(struct batadv_priv *bat_priv, 
__be32 ip_dst)
  */
 static bool batadv_dat_send_data(struct batadv_priv *bat_priv,
 struct sk_buff *skb, __be32 ip,
-int packet_subtype)
+unsigned short vid, int packet_subtype)
 {
int i;
bool ret = false;
@@ -631,7 +634,7 @@ static bool batadv_dat_send_data(struct batadv_priv 
*bat_priv,
struct sk_buff *tmp_skb;
struct batadv_dat_candidate *cand;
 
-   cand = batadv_dat_select_candidates(bat_priv, ip);
+   cand = batadv_dat_select_candidates(bat_priv, ip, vid);
if (!cand)
goto out;
 
@@ -1022,7 +1025,7 @@ bool batadv_dat_snoop_outgoing_arp_request(struct 
batadv_priv *bat_priv,
ret = true;
} else {
/* Send the request to the DHT */
-   ret = batadv_dat_send_data(bat_priv, skb, ip_dst,
+   ret = batadv_dat_send_data(bat_priv, skb, ip_dst, vid,
   BATADV_P_DAT_DHT_GET);
}
 out:
@@ -1150,8 +1153,8 @@ void batadv_dat_snoop_outgoing_arp_reply(struct 
batadv_priv *bat_priv,
/* Send the ARP reply to the candidates for both the IP addresses that
 * the node obtained from the ARP reply
 */
-   batadv_dat_send_data(bat_priv, skb, ip_src, BATADV_P_DAT_DHT_PUT);
-   batadv_dat_send_data(bat_priv, skb, ip_dst, BATADV_P_DAT_DHT_PUT);
+   batadv_dat_send_data(bat_priv, skb, ip_src, vid, BATADV_P_DAT_DHT_PUT);
+   batadv_dat_send_data(bat_priv, skb, ip_dst, vid, BATADV_P_DAT_DHT_PUT);
 }
 
 /**
-- 
2.8.1

[PATCH 4/4] batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node

From: Sven Eckelmann 

The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held
an implicit reference to it. But this reference was never stored in form of
a pointer in the batadv_neigh_node itself. Instead
batadv_neigh_node_release depends on a consistent state of
hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the
batadv_hardif_neigh_node object which it has a reference for. But
batadv_hardif_neigh_get cannot guarantee that because it is working only
with rcu_read_lock on this list. It can therefore happen that a neigh_addr
is in this list twice or that batadv_hardif_neigh_get cannot find the
batadv_hardif_neigh_node for an neigh_addr due to some other list
operations taking place at the same time.

Instead add a batadv_hardif_neigh_node pointer directly in
batadv_neigh_node which will be used for the reference counter decremented
on release of batadv_neigh_node.

Fixes: cef63419f7db ("batman-adv: add list of unique single hop neighbors per 
hard-interface")
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/originator.c | 16 +---
 net/batman-adv/types.h  |  2 ++
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/net/batman-adv/originator.c b/net/batman-adv/originator.c
index d52f67a0c057..c355a824713c 100644
--- a/net/batman-adv/originator.c
+++ b/net/batman-adv/originator.c
@@ -250,7 +250,6 @@ static void batadv_neigh_node_release(struct kref *ref)
 {
struct hlist_node *node_tmp;
struct batadv_neigh_node *neigh_node;
-   struct batadv_hardif_neigh_node *hardif_neigh;
struct batadv_neigh_ifinfo *neigh_ifinfo;
struct batadv_algo_ops *bao;
 
@@ -262,13 +261,7 @@ static void batadv_neigh_node_release(struct kref *ref)
batadv_neigh_ifinfo_put(neigh_ifinfo);
}
 
-   hardif_neigh = batadv_hardif_neigh_get(neigh_node->if_incoming,
-  neigh_node->addr);
-   if (hardif_neigh) {
-   /* batadv_hardif_neigh_get() increases refcount too */
-   batadv_hardif_neigh_put(hardif_neigh);
-   batadv_hardif_neigh_put(hardif_neigh);
-   }
+   batadv_hardif_neigh_put(neigh_node->hardif_neigh);
 
if (bao->bat_neigh_free)
bao->bat_neigh_free(neigh_node);
@@ -665,6 +658,10 @@ batadv_neigh_node_new(struct batadv_orig_node *orig_node,
neigh_node->orig_node = orig_node;
neigh_node->last_seen = jiffies;
 
+   /* increment unique neighbor refcount */
+   kref_get(_neigh->refcount);
+   neigh_node->hardif_neigh = hardif_neigh;
+
/* extra reference for return */
kref_init(_node->refcount);
kref_get(_node->refcount);
@@ -673,9 +670,6 @@ batadv_neigh_node_new(struct batadv_orig_node *orig_node,
hlist_add_head_rcu(_node->list, _node->neigh_list);
spin_unlock_bh(_node->neigh_list_lock);
 
-   /* increment unique neighbor refcount */
-   kref_get(_neigh->refcount);
-
batadv_dbg(BATADV_DBG_BATMAN, orig_node->bat_priv,
   "Creating new neighbor %pM for orig_node %pM on interface 
%s\n",
   neigh_addr, orig_node->orig, hard_iface->net_dev->name);
diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h
index 65afd090ab3e..1e47fbe8bb7b 100644
--- a/net/batman-adv/types.h
+++ b/net/batman-adv/types.h
@@ -433,6 +433,7 @@ struct batadv_hardif_neigh_node {
  * @ifinfo_lock: lock protecting private ifinfo members and list
  * @if_incoming: pointer to incoming hard-interface
  * @last_seen: when last packet via this neighbor was received
+ * @hardif_neigh: hardif_neigh of this neighbor
  * @refcount: number of contexts the object is used
  * @rcu: struct used for freeing in an RCU-safe manner
  */
@@ -444,6 +445,7 @@ struct batadv_neigh_node {
spinlock_t ifinfo_lock; /* protects ifinfo_list and its members */
struct batadv_hard_iface *if_incoming;
unsigned long last_seen;
+   struct batadv_hardif_neigh_node *hardif_neigh;
struct kref refcount;
struct rcu_head rcu;
 };
-- 
2.8.1

[PATCH 2/4] batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event

At the moment there is no explicit reactivation of an hard-interface
upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is
reactivated as soon as the next OGM is scheduled for sending, but this
mechanism does not work with B.A.T.M.A.N. V. The latter does not rely
on the same scheduling mechanism as its predecessor and for this reason
the hard-interface remains deactivated forever after being brought down
once.

This patch fixes the reactivation mechanism by adding a new routing API
which explicitly allows each algorithm to perform any needed operation
upon interface re-activation.

Such API is optional and is implemented by B.A.T.M.A.N. V only and it
just takes care of setting the iface status to ACTIVE

Signed-off-by: Antonio Quartulli 
Signed-off-by: Marek Lindner 
---
 net/batman-adv/bat_v.c  | 12 
 net/batman-adv/hard-interface.c |  3 +++
 net/batman-adv/types.h  |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/net/batman-adv/bat_v.c b/net/batman-adv/bat_v.c
index 3315b9a598af..4026f198a734 100644
--- a/net/batman-adv/bat_v.c
+++ b/net/batman-adv/bat_v.c
@@ -32,10 +32,21 @@
 
 #include "bat_v_elp.h"
 #include "bat_v_ogm.h"
+#include "hard-interface.h"
 #include "hash.h"
 #include "originator.h"
 #include "packet.h"
 
+static void batadv_v_iface_activate(struct batadv_hard_iface *hard_iface)
+{
+   /* B.A.T.M.A.N. V does not use any queuing mechanism, therefore it can
+* set the interface as ACTIVE right away, without any risk of race
+* condition
+*/
+   if (hard_iface->if_status == BATADV_IF_TO_BE_ACTIVATED)
+   hard_iface->if_status = BATADV_IF_ACTIVE;
+}
+
 static int batadv_v_iface_enable(struct batadv_hard_iface *hard_iface)
 {
int ret;
@@ -274,6 +285,7 @@ static bool batadv_v_neigh_is_sob(struct batadv_neigh_node 
*neigh1,
 
 static struct batadv_algo_ops batadv_batman_v __read_mostly = {
.name = "BATMAN_V",
+   .bat_iface_activate = batadv_v_iface_activate,
.bat_iface_enable = batadv_v_iface_enable,
.bat_iface_disable = batadv_v_iface_disable,
.bat_iface_update_mac = batadv_v_iface_update_mac,
diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index c61d5b0b24d2..0a7deaf2670a 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -407,6 +407,9 @@ batadv_hardif_activate_interface(struct batadv_hard_iface 
*hard_iface)
 
batadv_update_min_mtu(hard_iface->soft_iface);
 
+   if (bat_priv->bat_algo_ops->bat_iface_activate)
+   bat_priv->bat_algo_ops->bat_iface_activate(hard_iface);
+
 out:
if (primary_if)
batadv_hardif_put(primary_if);
diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h
index 9abfb3e73c34..443e9b84e07d 100644
--- a/net/batman-adv/types.h
+++ b/net/batman-adv/types.h
@@ -1250,6 +1250,8 @@ struct batadv_forw_packet {
  * struct batadv_algo_ops - mesh algorithm callbacks
  * @list: list node for the batadv_algo_list
  * @name: name of the algorithm
+ * @bat_iface_activate: start routing mechanisms when hard-interface is brought
+ *  up
  * @bat_iface_enable: init routing info when hard-interface is enabled
  * @bat_iface_disable: de-init routing info when hard-interface is disabled
  * @bat_iface_update_mac: (re-)init mac addresses of the protocol information
@@ -1277,6 +1279,7 @@ struct batadv_forw_packet {
 struct batadv_algo_ops {
struct hlist_node list;
char *name;
+   void (*bat_iface_activate)(struct batadv_hard_iface *hard_iface);
int (*bat_iface_enable)(struct batadv_hard_iface *hard_iface);
void (*bat_iface_disable)(struct batadv_hard_iface *hard_iface);
void (*bat_iface_update_mac)(struct batadv_hard_iface *hard_iface);
-- 
2.8.1

pull request [net]: batman-adv 20160430

Hello David,

this is another pull request intended for net.

I know that I sent another batch a few days ago, but after having gone
through my queue I thought that merging these last 4 patches would still
be worth it (there won't be any other pull request for linux-4.6 :)).

The description of the changes follows below.

Please pull or let me know of any issue!
Thanks a lot,
Antonio

The following changes since commit 1dfcd832b1a9ed45edac15b31d079b805fa0ae2a:

  Merge branch 'bpf-fixes' (2016-04-28 17:29:46 -0400)

are available in the git repository at:

  git://git.open-mesh.org/linux-merge.git tags/batman-adv-fix-for-davem

for you to fetch changes up to abe59c65225ccd63a5964e2f2a73dd2995b948e7:

  batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node 
(2016-04-29 19:46:11 +0800)


In this small batch of patches you have:
- a fix for our Distributed ARP Table that makes sure that the input
  provided to the hash function during a query is the same as the one
  provided during an insert (so to prevent false negatives), by Antonio
  Quartulli
- a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
  that a hard interface is properly re-activated when it is brought down
  and then up again, by Antonio Quartulli
- two fixes respectively to the reference counting of the tt_local_entry
  and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
  as it would prevent the netdev objects references by batman-adv from
  being released after shutdown.


Antonio Quartulli (2):
  batman-adv: fix DAT candidate selection (must use vid)
  batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP 
event

Sven Eckelmann (2):
  batman-adv: Fix reference counting of vlan object for tt_local_entry
  batman-adv: Fix reference counting of hardif_neigh_node object for 
neigh_node

 net/batman-adv/bat_v.c | 12 ++
 net/batman-adv/distributed-arp-table.c | 17 --
 net/batman-adv/hard-interface.c|  3 +++
 net/batman-adv/originator.c| 16 -
 net/batman-adv/translation-table.c | 42 --
 net/batman-adv/types.h |  7 ++
 6 files changed, 41 insertions(+), 56 deletions(-)

[PATCH 3/4] batman-adv: Fix reference counting of vlan object for tt_local_entry

From: Sven Eckelmann 

The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an
implicit reference to it. But this reference was never stored in form of a
pointer in the tt_local_entry itself. Instead batadv_tt_local_remove,
batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend
on a consistent state of bat_priv->softif_vlan_list and that
batadv_softif_vlan_get always returns the batadv_softif_vlan object which
it has a reference for. But batadv_softif_vlan_get cannot guarantee that
because it is working only with rcu_read_lock on this list. It can
therefore happen that an vid is in this list twice or that
batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to
some other list operations taking place at the same time.

Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry
which will be used for the reference counter decremented on release of
batadv_tt_local_entry.

Fixes: 35df3b298fc8 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Signed-off-by: Sven Eckelmann 
Acked-by: Antonio Quartulli 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/translation-table.c | 42 --
 net/batman-adv/types.h |  2 ++
 2 files changed, 6 insertions(+), 38 deletions(-)

diff --git a/net/batman-adv/translation-table.c 
b/net/batman-adv/translation-table.c
index 0b43e86328a5..9b4551a86535 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -215,6 +215,8 @@ static void batadv_tt_local_entry_release(struct kref *ref)
tt_local_entry = container_of(ref, struct batadv_tt_local_entry,
  common.refcount);
 
+   batadv_softif_vlan_put(tt_local_entry->vlan);
+
kfree_rcu(tt_local_entry, common.rcu);
 }
 
@@ -673,6 +675,7 @@ bool batadv_tt_local_add(struct net_device *soft_iface, 
const u8 *addr,
kref_get(_local->common.refcount);
tt_local->last_seen = jiffies;
tt_local->common.added_at = tt_local->last_seen;
+   tt_local->vlan = vlan;
 
/* the batman interface mac and multicast addresses should never be
 * purged
@@ -991,7 +994,6 @@ int batadv_tt_local_seq_print_text(struct seq_file *seq, 
void *offset)
struct batadv_tt_common_entry *tt_common_entry;
struct batadv_tt_local_entry *tt_local;
struct batadv_hard_iface *primary_if;
-   struct batadv_softif_vlan *vlan;
struct hlist_head *head;
unsigned short vid;
u32 i;
@@ -1027,14 +1029,6 @@ int batadv_tt_local_seq_print_text(struct seq_file *seq, 
void *offset)
last_seen_msecs = last_seen_msecs % 1000;
 
no_purge = tt_common_entry->flags & np_flag;
-
-   vlan = batadv_softif_vlan_get(bat_priv, vid);
-   if (!vlan) {
-   seq_printf(seq, "Cannot retrieve VLAN %d\n",
-  BATADV_PRINT_VID(vid));
-   continue;
-   }
-
seq_printf(seq,
   " * %pM %4i [%c%c%c%c%c%c] %3u.%03u   
(%#.8x)\n",
   tt_common_entry->addr,
@@ -1052,9 +1046,7 @@ int batadv_tt_local_seq_print_text(struct seq_file *seq, 
void *offset)
 BATADV_TT_CLIENT_ISOLA) ? 'I' : '.'),
   no_purge ? 0 : last_seen_secs,
   no_purge ? 0 : last_seen_msecs,
-  vlan->tt.crc);
-
-   batadv_softif_vlan_put(vlan);
+  tt_local->vlan->tt.crc);
}
rcu_read_unlock();
}
@@ -1099,7 +1091,6 @@ u16 batadv_tt_local_remove(struct batadv_priv *bat_priv, 
const u8 *addr,
 {
struct batadv_tt_local_entry *tt_local_entry;
u16 flags, curr_flags = BATADV_NO_FLAGS;
-   struct batadv_softif_vlan *vlan;
void *tt_entry_exists;
 
tt_local_entry = batadv_tt_local_hash_find(bat_priv, addr, vid);
@@ -1139,14 +1130,6 @@ u16 batadv_tt_local_remove(struct batadv_priv *bat_priv, 
const u8 *addr,
/* extra call to free the local tt entry */
batadv_tt_local_entry_put(tt_local_entry);
 
-   /* decrease the reference held for this vlan */
-   vlan = batadv_softif_vlan_get(bat_priv, vid);
-   if (!vlan)
-   goto out;
-
-   batadv_softif_vlan_put(vlan);
-   batadv_softif_vlan_put(vlan);
-
 out:
if (tt_local_entry)
batadv_tt_local_entry_put(tt_local_entry);
@@ -1219,7 +1202,6 @@ static void batadv_tt_local_table_free(struct batadv_priv 
*bat_priv)
spinlock_t *list_lock; /* protects write access to the hash

Unauthorized attempt

2016-04-30 Thread PayPal

Re: [PATCH iproute2 net-next] ifstat: move to new RTM_GETSTATS api

2016-04-30 Thread Roopa Prabhu

On 4/30/16, 3:21 AM, Jamal Hadi Salim wrote:
> On 16-04-30 02:41 AM, Roopa Prabhu wrote:
>> From: Roopa Prabhu 
>>
>> This patch modifies ifstat to use the new RTM_GETSTATS api
>> to query stats from the kernel. In the process this also
>> moves ifstat to use 64 bit stats.
>
> Breaks old kernels? May need to keep backward compat of
> RTM_NEWLINK and even new main struct for GETSTATS.
yes, i was wondering about that. v2 coming. If GETSTATS fails, I will fallback 
to RTM_NEWLINK.

thanks!

Re: [PATCHv3] netem: Segment GSO packets on enqueue

2016-04-30 Thread Neil Horman

On Fri, Apr 29, 2016 at 11:19:05AM -0700, Stephen Hemminger wrote:
> On Fri, 29 Apr 2016 13:35:48 -0400
> Neil Horman  wrote:
> 
> > This was recently reported to me, and reproduced on the latest net kernel, 
> > when
> > attempting to run netperf from a host that had a netem qdisc attached to the
> > egress interface:
> > 
> > [  788.073771] [ cut here ]
> > [  788.096716] WARNING: at net/core/dev.c:2253 
> > skb_warn_bad_offload+0xcd/0xda()
> > [  788.129521] bnx2: caps=(0x0001801949b3, 0x) len=2962
> > data_len=0 gso_size=1448 gso_type=1 ip_summed=3
> > [  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul 
> > ipmi_ssif
> > ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
> > glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo 
> > ipmi_si
> > i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
> > pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
> > libcrc32c
> > sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect 
> > sysimgblt
> > i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
> > crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel 
> > ptp
> > serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash 
> > dm_log
> > dm_mod
> > [  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: GW
> >    3.10.0-327.el7.x86_64 #1
> > [  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
> > [  788.542260]  880437c036b8 f7afc56532a53db9 880437c03670
> > 816351f1
> > [  788.576332]  880437c036a8 8107b200 880633e74200
> > 880231674000
> > [  788.611943]  0001 0003 
> > 880437c03710
> > [  788.647241] Call Trace:
> > [  788.658817][] dump_stack+0x19/0x1b
> > [  788.686193]  [] warn_slowpath_common+0x70/0xb0
> > [  788.713803]  [] warn_slowpath_fmt+0x5c/0x80
> > [  788.741314]  [] ? ___ratelimit+0x93/0x100
> > [  788.767018]  [] skb_warn_bad_offload+0xcd/0xda
> > [  788.796117]  [] skb_checksum_help+0x17c/0x190
> > [  788.823392]  [] netem_enqueue+0x741/0x7c0 [sch_netem]
> > [  788.854487]  [] dev_queue_xmit+0x2a8/0x570
> > [  788.880870]  [] ip_finish_output+0x53d/0x7d0
> > ...
> > 
> > The problem occurs because netem is not prepared to handle GSO packets (as 
> > it
> > uses skb_checksum_help in its enqueue path, which cannot manipulate these
> > frames).
> > 
> > The solution I think is to simply segment the skb in a simmilar fashion to 
> > the
> > way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor 
> > changes.
> > When we decide to corrupt an skb, if the frame is GSO, we segment it, 
> > corrupt
> > the first segment, and enqueue the remaining ones.
> > 
> > tested successfully by myself on the latest net kernel, to whcih this 
> > applies
> > 
> > Signed-off-by: Neil Horman 
> > CC: Jamal Hadi Salim 
> > CC: "David S. Miller" 
> > CC: ne...@lists.linux-foundation.org
> > CC: eric.duma...@gmail.com
> > 
> 
> This looks like a good idea.
> 
> Please cleanup the formatting issues:
> 
> This was recently reported to me, and reproduced on the latest net kernel, 
> when
> 
> WARNING: 'whcih' may be misspelled - perhaps 'which'?
> #93: 
> tested successfully by myself on the latest net kernel, to whcih this applies
> 
> ERROR: "foo* bar" should be "foo *bar"
> #130: FILE: net/sched/sch_netem.c:402:
> +static struct sk_buff* netem_segment(struct sk_buff *skb, struct Qdisc *sch)
> 
> 
> CHECK: braces {} should be used on all arms of this statement
> #164: FILE: net/sched/sch_netem.c:479:
> + if (skb_is_gso(skb)) {
> [...]
> + } else
> [...]
> 
> CHECK: braces {} should be used on all arms of this statement
> #198: FILE: net/sched/sch_netem.c:562:
> + if (rc != NET_XMIT_SUCCESS) {
> [...]
> + } else
> [...]
> 
> 
> 
> 
> 
> 
Will, do when I get back to the office monday
Thanks!
Neil

Re: [iproute2 1/1] man: tc-ife.8: man page for ife action

2016-04-30 Thread Phil Sutter

On Sat, Apr 30, 2016 at 06:58:04AM -0400, Jamal Hadi Salim wrote:
> From: Lucas Bates 
> 
> Signed-off-by: Lucas Bates 
> Signed-off-by: Jamal Hadi Salim 

Acked-by: Phil Sutter 

Nit-picking on two things below, but merely as a hint - I don't think
they need to be fixed before applying this.

[...]
> +.SH NAME
> +IFE - encapsulate/decapsulate metadata
> +.SH SYNOPSIS
> +.in +8
> +.ti -8
> +.BR tc " ... " "action"
> +.B "ife"

The quotes around 'action' and 'ife' are not necessary. Furthermore,
the two lines can be combined:

> +.I DIRECTION ACTION
> +.RB "[ " dst
> +.IR DMAC " ] "
> +.RB "[ " src
> +.IR SMAC " ] "
> +.RB "[ " type
> +.IR TYPE " ] "
> +.R "[ " 

This adds a trailing whitespace. Checkpatch.pl finds these things. ;)

Thanks, Phil

[iproute2 1/1] man: tc-ife.8: man page for ife action

2016-04-30 Thread Jamal Hadi Salim

From: Lucas Bates 

Signed-off-by: Lucas Bates 
Signed-off-by: Jamal Hadi Salim 
---
 man/man8/tc-ife.8 | 118 ++
 1 file changed, 118 insertions(+)
 create mode 100644 man/man8/tc-ife.8

diff --git a/man/man8/tc-ife.8 b/man/man8/tc-ife.8
new file mode 100644
index 000..7a912b0
--- /dev/null
+++ b/man/man8/tc-ife.8
@@ -0,0 +1,118 @@
+.TH "IFE action in tc" 8 "22 Apr 2016" "iproute2" "Linux"
+
+.SH NAME
+IFE - encapsulate/decapsulate metadata
+.SH SYNOPSIS
+.in +8
+.ti -8
+.BR tc " ... " "action"
+.B "ife"
+.I DIRECTION ACTION
+.RB "[ " dst
+.IR DMAC " ] "
+.RB "[ " src
+.IR SMAC " ] "
+.RB "[ " type
+.IR TYPE " ] "
+.R "[ " 
+.IR CONTROL " ] "
+.RB "[ " index
+.IR INDEX " ] "
+
+.ti -8
+.IR DIRECTION " := { "
+.BR decode " | " encode " }"
+
+.ti -8
+.IR ACTION " := { "
+.BR allow " | " use " }"
+
+.ti -8
+.IR CONTROL " := { "
+.BR reclassify " | " use " | " pipe " | " drop " | " continue " | " ok " }"
+.SH DESCRIPTION
+The
+.B ife
+action allows for a sending side to encapsulate arbitrary metadata, which is
+then decapsulated by the receiving end. The sender runs in encoding mode and
+the receiver in decode mode. Both sender and receiver must specify the same
+ethertype. In the future, a registered ethertype may be available as a default.
+.SH OPTIONS
+.TP
+.B decode
+For the receiving side; decode the metadata if the packet matches.
+.TP
+.B encode
+For the sending side. Encode the specified metadata if the packet matches.
+.TP
+.B allow
+Encode direction only. Allows encoding specified metadata.
+.TP
+.B use
+Encode direction only. Enforce static encoding of specified metadata.
+.TP
+.BI dmac " DMAC"
+.TQ
+.BI smac " SMAC"
+Optional six byte destination or source MAC address to encode.
+.TP
+.BI type " TYPE"
+Optional 16-bit ethertype to encode.
+.TP
+.BI CONTROL
+Action to take following an encode/decode.
+.TP
+.BI index " INDEX"
+Assign a unique ID to this action instead of letting the kernel choose one
+automatically.
+.I INDEX
+is a 32bit unsigned integer greater than zero.
+.SH EXAMPLES
+
+On the receiving side, match packets with ethertype 0xdead and restart
+classification so that it will match ICMP on the next rule, at prio 3:
+.RS
+.EX
+# tc qdisc add dev eth0 handle : ingress
+# tc filter add dev eth0 parent : prio 2 protocol 0xdead \\
+   u32 match u32 0 0 flowid 1:1 \\
+   action ife decode reclassify
+# tc filter add dev eth0 parent : priod 3 protocol ip \\
+   u32 match ip protocol 0xff flowid 1:1 \\
+   action continue
+.EE
+.RE
+
+Match with skb mark of 17:
+
+.RS
+.EX
+# tc filter add dev eth0 parent : prio 4 protocol ip \\
+   handle 0x11 fw flowid 1:1 \\
+   action ok
+.EE
+.RE
+
+Configure the sending side to encode for the filters above. Use a destination
+IP address of 192.168.122.237/24, then tag with skb mark of decimal 17. Encode
+the packaet with ethertype 0xdead, add skb->mark to whitelist of metadatum to
+send, and rewrite the destination MAC address to 02:15:15:15:15:15.
+
+.RS
+.EX
+# tc qdisc add dev eth0 root handle 1: prio
+# tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \\
+   match ip dst 192.168.122.237/24 \\
+   match ip protocol 1 0xff \\
+   flowid 1:2 \\
+   action skbedit mark 17 \\
+   action ife encode \\
+   type 0xDEAD \\
+   allow mark \\
+   dst 02:15:15:15:15:15
+.EE
+.RE
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-u32 (8)
-- 
1.9.1

[PATCH net] vlan: Propagate MAC address changes properly

2016-04-30 Thread Mike Manning

The MAC address of the physical interface is only copied to the VLAN
when it is first created, resulting in an inconsistency after MAC
address changes of only newly created VLANs having an up-to-date MAC.

Continuing to inherit the MAC address unless explicitly changed for
the VLAN allows IPv6 EUI64 addresses for the VLAN to reflect the change
and thus for DAD to behave as expected for the given MAC.

Signed-off-by: Mike Manning 
---
 net/8021q/vlan.c |   17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index d2cd9de..2f57cf2 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -293,15 +293,15 @@ static void vlan_sync_address(struct net_device *dev,
 
/* vlan address was different from the old address and is equal to
 * the new address */
-   if (!ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) &&
+   if ((vlandev->flags & IFF_UP) &&
+   !ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) &&
ether_addr_equal(vlandev->dev_addr, dev->dev_addr))
dev_uc_del(dev, vlandev->dev_addr);
 
-   /* vlan address was equal to the old address and is different from
+   /* vlan address was equal to the old address so now also inherit
 * the new address */
-   if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) &&
-   !ether_addr_equal(vlandev->dev_addr, dev->dev_addr))
-   dev_uc_add(dev, vlandev->dev_addr);
+   if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr))
+   ether_addr_copy(vlandev->dev_addr, dev->dev_addr);
 
ether_addr_copy(vlan->real_dev_addr, dev->dev_addr);
 }
@@ -389,13 +389,8 @@ static int vlan_device_event(struct notifier_block 
*unused, unsigned long event,
 
case NETDEV_CHANGEADDR:
/* Adjust unicast filters on underlying device */
-   vlan_group_for_each_dev(grp, i, vlandev) {
-   flgs = vlandev->flags;
-   if (!(flgs & IFF_UP))
-   continue;
-
+   vlan_group_for_each_dev(grp, i, vlandev)
vlan_sync_address(dev, vlandev);
-   }
break;
 
case NETDEV_CHANGEMTU:
-- 
1.7.10.4

Re: [PATCH net] Propagate MAC address changes to VLANs

2016-04-30 Thread Mike Manning

On 03/03/2016 09:12 PM, David Miller wrote:
> From: Mike Manning 
> Date: Mon, 29 Feb 2016 11:32:51 +
> 
>>  
>> -/* vlan address was equal to the old address and is different from
>> +/* vlan address was equal to the old address so now also inherit
>>   * the new address */
>> -if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) &&
>> -!ether_addr_equal(vlandev->dev_addr, dev->dev_addr))
>> -dev_uc_add(dev, vlandev->dev_addr);
>> +if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr))
>> +ether_addr_copy(vlandev->dev_addr, dev->dev_addr);
>>  
> 
> This dev_uc_add() call removal cannot be correct, if the device is up
> we must programe it into the hardware unicast filters and if also
> potentially put it into promiscuous mode via __dev_set_rx_mode().
> 

The call to dev_uc_add() to add a secondary address is only needed if the VLAN 
MAC is different from that for the physical interface. For the proposed 
changes, the VLAN MAC is tracking that of the physical interface and so is the 
same (as typically it does not make sense for these to be different), so 
dev_uc_add() should not be called. The easiest way to demonstrate equivalence 
with the original code, where the MAC address has to be set manually, is with 
some test debugs. Here, first the MAC of the interface itself is changed (so 
dev_uc_add() is called), then the MAC of the VLAN is changed (so dev_uc_del() 
is called):

1) ORIGINAL CODE:

ip addr show dev dp0s8 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff
sudo ip link set dp0s8.40 addr 10:20:30:40:50:61
sudo ip link set dp0s8 addr 10:20:30:40:50:61
ip addr show dev dp0s8 | grep ether
link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff

[ 3990.332577] --- vlan_dev_set_mac_address: id 40, call dev_uc_add for 
10:20:30:40:50:61 on dp
0s8
[ 3990.332579] device dp0s8 entered promiscuous mode
[ 4002.425234] 8021q: --- vlan_sync_address: id 40, for 10:20:30:40:50:61 on 
dp0s8.40 (from dp0
s8)
[ 4002.425472] --- vlan_sync_address: id 40, call dev_uc_del for 
10:20:30:40:50:61 on dp0s8
[ 4002.425475] --- __hw_addr_del_entry: refcount 0 for 10:20:30:40:50:61
[ 4002.425477] device dp0s8 left promiscuous mode

sudo ip link set dp0s8 addr 52:54:00:1f:06:2a
ip addr show dev dp0s8.40 | grep ether
link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff
sudo ip link set dp0s8.40 addr 52:54:00:1f:06:2a
ip addr show dev dp0s8 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff

[ 4121.606671] --- vlan_sync_address: id, 40, call dev_uc_add for 
10:20:30:40:50:61 on dp0s8
[ 4121.606673] device dp0s8 entered promiscuous mode
[ 4147.487780] --- vlan_dev_set_mac_address: id 40, for 52:54:00:1f:06:2a on 
dp0s8
[ 4147.487782] --- vlan_dev_set_mac_address: id 40, call dev_uc_del for 
10:20:30:40:50:61 dp0s8
[ 4147.487784] --- __hw_addr_del_entry: refcount 0 for 10:20:30:40:50:61
[ 4147.487786] device dp0s8 left promiscuous mode


2) WITH IMPROVEMENT FOR VLAN MAC TO FOLLOW THAT OF PHYSICAL INTF, UNLESS 
EXPLICITLY SET:

ip addr show dev dp0s8 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff
sudo ip link set dp0s8 addr 10:20:30:40:50:61
ip addr show dev dp0s8 | grep ether
link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff

[  196.574789] 8021q: --- vlan_sync_address: id 40, for 10:20:30:40:50:61 on 
dp0s8.40 (from dp0
s8)
[  196.575004] --- vlan_sync_address: id 40, update to 10:20:30:40:50:61 on 
dp0s8.40 (from dp0s
8)

sudo ip link set dp0s8 addr 52:54:00:1f:06:2a
ip addr show dev dp0s8 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff

[  265.683313] 8021q: --- vlan_sync_address: id 40, for 52:54:00:1f:06:2a on 
dp0s8.40 (from dp0
s8)
[  265.683534] --- vlan_sync_address: id 40, update to 52:54:00:1f:06:2a on 
dp0s8.40 (from dp0s
8)

sudo ip link set dp0s8.40 addr 10:20:30:40:50:61
sudo ip link set dp0s8 addr 10:20:30:40:50:99
ip addr show dev dp0s8 | grep ether
link/ether 10:20:30:40:50:99 brd ff:ff:ff:ff:ff:ff
ip addr show dev dp0s8.40 | grep ether
link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff
sudo ip link set dp0s8 addr 10:20:30:40:50:61

[ 5561.791222] --- vlan_dev_set_mac_address: id 40, for 10:20:30:40:50:61 on 
dp0s8
[ 5561.791225] --- vlan_dev_set_mac_address: id 40, call dev_uc_add for 
10:20:30:40:50:61 on dp
0s8
[ 5561.791227] device dp0s8 entered promiscuous mode
[

Re: [PATCH iproute2 net-next] ifstat: move to new RTM_GETSTATS api

2016-04-30 Thread Jamal Hadi Salim


On 16-04-30 02:41 AM, Roopa Prabhu wrote:

From: Roopa Prabhu 

This patch modifies ifstat to use the new RTM_GETSTATS api
to query stats from the kernel. In the process this also
moves ifstat to use 64 bit stats.


Breaks old kernels? May need to keep backward compat of
RTM_NEWLINK and even new main struct for GETSTATS.

cheers,
jamal

Re: [PATCH v2 net-next 0/7] net: make TCP preemptible

2016-04-30 Thread Julian Anastasov

Hello,

On Fri, 29 Apr 2016, Eric Dumazet wrote:

> I had corruptions issues and a dying HDD one month ago.
> 
> I have a brand new HDD, but maybe the SSD I use for my git trees is
> dying as well :(
> 
> But I've seen this strange patterns in the past, it might be the old
> text editor I am using.

You can also check for failing capacitors around
the RAM slots... Memory tests can help too.

Regards

--
Julian Anastasov

[PATCH net-next v3 3/4] bridge: vlan: learn to count

Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).

Signed-off-by: Nikolay Aleksandrov 
---
v3: make stats accounting optional with default to off
v2: no change

 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_netlink.c  | 13 ++-
 net/bridge/br_private.h  | 13 ++-
 net/bridge/br_sysfs_br.c | 17 +
 net/bridge/br_vlan.c | 82 
 5 files changed, 110 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8577c0e4116f..cc50261baf59 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -272,6 +272,7 @@ enum {
IFLA_BR_NF_CALL_ARPTABLES,
IFLA_BR_VLAN_DEFAULT_PVID,
IFLA_BR_PAD,
+   IFLA_BR_VLAN_STATS_ENABLED,
__IFLA_BR_MAX,
 };
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 6bae1125e36d..7fba1f018bc9 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -850,6 +850,7 @@ static const struct nla_policy br_policy[IFLA_BR_MAX + 1] = 
{
[IFLA_BR_NF_CALL_IP6TABLES] = { .type = NLA_U8 },
[IFLA_BR_NF_CALL_ARPTABLES] = { .type = NLA_U8 },
[IFLA_BR_VLAN_DEFAULT_PVID] = { .type = NLA_U16 },
+   [IFLA_BR_VLAN_STATS_ENABLED] = { .type = NLA_U8 },
 };
 
 static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
@@ -921,6 +922,14 @@ static int br_changelink(struct net_device *brdev, struct 
nlattr *tb[],
if (err)
return err;
}
+
+   if (data[IFLA_BR_VLAN_STATS_ENABLED]) {
+   __u8 vlan_stats = nla_get_u8(data[IFLA_BR_VLAN_STATS_ENABLED]);
+
+   err = br_vlan_set_stats(br, vlan_stats);
+   if (err)
+   return err;
+   }
 #endif
 
if (data[IFLA_BR_GROUP_FWD_MASK]) {
@@ -1082,6 +1091,7 @@ static size_t br_get_size(const struct net_device *brdev)
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
   nla_total_size(sizeof(__be16)) + /* IFLA_BR_VLAN_PROTOCOL */
   nla_total_size(sizeof(u16)) +/* IFLA_BR_VLAN_DEFAULT_PVID */
+  nla_total_size(sizeof(u8)) + /* IFLA_BR_VLAN_STATS_ENABLED */
 #endif
   nla_total_size(sizeof(u16)) +/* IFLA_BR_GROUP_FWD_MASK */
   nla_total_size(sizeof(struct ifla_bridge_id)) +   /* 
IFLA_BR_ROOT_ID */
@@ -1167,7 +1177,8 @@ static int br_fill_info(struct sk_buff *skb, const struct 
net_device *brdev)
 
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
if (nla_put_be16(skb, IFLA_BR_VLAN_PROTOCOL, br->vlan_proto) ||
-   nla_put_u16(skb, IFLA_BR_VLAN_DEFAULT_PVID, br->default_pvid))
+   nla_put_u16(skb, IFLA_BR_VLAN_DEFAULT_PVID, br->default_pvid) ||
+   nla_put_u8(skb, IFLA_BR_VLAN_STATS_ENABLED, br->vlan_stats_enabled))
return -EMSGSIZE;
 #endif
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 1b5d145dfcbf..8b644069a1a1 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -77,12 +77,21 @@ struct bridge_mcast_querier {
 };
 #endif
 
+struct br_vlan_stats {
+   u64 rx_bytes;
+   u64 rx_packets;
+   u64 tx_bytes;
+   u64 tx_packets;
+   struct u64_stats_sync syncp;
+};
+
 /**
  * struct net_bridge_vlan - per-vlan entry
  *
  * @vnode: rhashtable member
  * @vid: VLAN id
  * @flags: bridge vlan flags
+ * @stats: per-cpu VLAN statistics
  * @br: if MASTER flag set, this points to a bridge struct
  * @port: if MASTER flag unset, this points to a port struct
  * @refcnt: if MASTER flag set, this is bumped for each port referencing it
@@ -100,6 +109,7 @@ struct net_bridge_vlan {
struct rhash_head   vnode;
u16 vid;
u16 flags;
+   struct br_vlan_stats __percpu   *stats;
union {
struct net_bridge   *br;
struct net_bridge_port  *port;
@@ -342,6 +352,7 @@ struct net_bridge
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
struct net_bridge_vlan_group__rcu *vlgrp;
u8  vlan_enabled;
+   u8  vlan_stats_enabled;
__be16

[PATCH net-next v3 1/4] net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter

The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.

Signed-off-by: Nikolay Aleksandrov 
---
v3: no change
v2: improve the error check in rtnl_fill_statsinfo, rename lidx to
prividx, squash patch 2 into this one and save the current idx user
instead of restricting only one

 net/core/rtnetlink.c | 44 +++-
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5503dfe6a050..de529a20cd18 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3444,13 +3444,21 @@ out:
return err;
 }
 
+static bool stats_attr_valid(unsigned int mask, int attrid, int idxattr)
+{
+   return (mask & IFLA_STATS_FILTER_BIT(attrid)) &&
+  (!idxattr || idxattr == attrid);
+}
+
 static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
   int type, u32 pid, u32 seq, u32 change,
-  unsigned int flags, unsigned int filter_mask)
+  unsigned int flags, unsigned int filter_mask,
+  int *idxattr, int *prividx)
 {
struct if_stats_msg *ifsm;
struct nlmsghdr *nlh;
struct nlattr *attr;
+   int s_prividx = *prividx;
 
ASSERT_RTNL();
 
@@ -3462,7 +3470,7 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
ifsm->ifindex = dev->ifindex;
ifsm->filter_mask = filter_mask;
 
-   if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
+   if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_64, *idxattr)) {
struct rtnl_link_stats64 *sp;
 
attr = nla_reserve_64bit(skb, IFLA_STATS_LINK_64,
@@ -3480,7 +3488,11 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
return 0;
 
 nla_put_failure:
-   nlmsg_cancel(skb, nlh);
+   /* not a multi message or no progress mean a real error */
+   if (!(flags & NLM_F_MULTI) || s_prividx == *prividx)
+   nlmsg_cancel(skb, nlh);
+   else
+   nlmsg_end(skb, nlh);
 
return -EMSGSIZE;
 }
@@ -3494,7 +3506,7 @@ static size_t if_nlmsg_stats_size(const struct net_device 
*dev,
 {
size_t size = 0;
 
-   if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64))
+   if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_64, 0))
size += nla_total_size_64bit(sizeof(struct rtnl_link_stats64));
 
return size;
@@ -3503,8 +3515,9 @@ static size_t if_nlmsg_stats_size(const struct net_device 
*dev,
 static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
struct net *net = sock_net(skb->sk);
-   struct if_stats_msg *ifsm;
struct net_device *dev = NULL;
+   int idxattr = 0, prividx = 0;
+   struct if_stats_msg *ifsm;
struct sk_buff *nskb;
u32 filter_mask;
int err;
@@ -3528,7 +3541,7 @@ static int rtnl_stats_get(struct sk_buff *skb, struct 
nlmsghdr *nlh)
 
err = rtnl_fill_statsinfo(nskb, dev, RTM_NEWSTATS,
  NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
- 0, filter_mask);
+ 0, filter_mask, , );
if (err < 0) {
/* -EMSGSIZE implies BUG in if_nlmsg_stats_size */
WARN_ON(err == -EMSGSIZE);
@@ -3542,18 +3555,19 @@ static int rtnl_stats_get(struct sk_buff *skb, struct 
nlmsghdr *nlh)
 
 static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   int h, s_h, err, s_idx, s_idxattr, s_prividx;
struct net *net = sock_net(skb->sk);
+   unsigned int flags = NLM_F_MULTI;
struct if_stats_msg *ifsm;
-   int h, s_h;
-   int idx = 0, s_idx;
-   struct net_device *dev;
struct hlist_head *head;
-   unsigned int flags = NLM_F_MULTI;
+   struct net_device *dev;
u32 filter_mask = 0;
-   int err;
+   int idx = 0;
 
s_h = cb->args[0];
s_idx = cb->args[1];
+   s_idxattr = cb->args[2];
+   s_prividx = cb->args[3];
 
cb->seq = net->dev_base_seq;
 
@@ -3571,7 +3585,8 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
err = rtnl_fill_statsinfo(skb, dev, RTM_NEWSTATS,
  NETLINK_CB(cb->skb).portid,
  cb->nlh->nlmsg_seq, 0,
- flags, filter_mask);
+ flags, filter_mask,
+

[PATCH net-next v3 4/4] bridge: netlink: export per-vlan stats

Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.

Signed-off-by: Nikolay Aleksandrov 
---
v3: no change
v2: remove unused pvid pointer, fix the case where bridge has 0 vlans
but there're global contexts and move to rtnl link type private
attributes nested into a LINK_XSTATS_TYPE_ attribute. The paddings were
added because soon these fields will be needed for per-port per-vlan
stats (or something else if someone beats me to it) so avoiding at least
a few more netlink attributes.

 include/uapi/linux/if_bridge.h | 18 
 include/uapi/linux/if_link.h   |  1 +
 net/bridge/br_netlink.c| 65 ++
 net/bridge/br_private.h|  7 +
 net/bridge/br_vlan.c   | 27 ++
 5 files changed, 118 insertions(+)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index 0536eefff9bf..397d503fdedb 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -134,6 +134,16 @@ struct bridge_vlan_info {
__u16 vid;
 };
 
+struct bridge_vlan_xstats {
+   __u64 rx_bytes;
+   __u64 rx_packets;
+   __u64 tx_bytes;
+   __u64 tx_packets;
+   __u16 vid;
+   __u16 pad1;
+   __u32 pad2;
+};
+
 /* Bridge multicast database attributes
  * [MDBA_MDB] = {
  * [MDBA_MDB_ENTRY] = {
@@ -233,4 +243,12 @@ enum {
 };
 #define MDBA_SET_ENTRY_MAX (__MDBA_SET_ENTRY_MAX - 1)
 
+/* Embedded inside LINK_XSTATS_TYPE_BRIDGE */
+enum {
+   BRIDGE_XSTATS_UNSPEC,
+   BRIDGE_XSTATS_VLAN,
+   __BRIDGE_XSTATS_MAX
+};
+#define BRIDGE_XSTATS_MAX (__BRIDGE_XSTATS_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_BRIDGE_H */
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cc50261baf59..ed59cbd6d129 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -818,6 +818,7 @@ enum {
  */
 enum {
LINK_XSTATS_TYPE_UNSPEC,
+   LINK_XSTATS_TYPE_BRIDGE,
__LINK_XSTATS_TYPE_MAX
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 7fba1f018bc9..a5343c7232bf 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -1234,6 +1234,69 @@ static int br_fill_info(struct sk_buff *skb, const 
struct net_device *brdev)
return 0;
 }
 
+static size_t br_get_linkxstats_size(const struct net_device *dev)
+{
+   struct net_bridge *br = netdev_priv(dev);
+   struct net_bridge_vlan_group *vg;
+   struct net_bridge_vlan *v;
+   int numvls = 0;
+
+   vg = br_vlan_group(br);
+   if (!vg)
+   return 0;
+
+   /* we need to count all, even placeholder entries */
+   list_for_each_entry(v, >vlan_list, vlist)
+   numvls++;
+
+   /* account for the vlans and the link xstats type nest attribute */
+   return numvls * nla_total_size(sizeof(struct bridge_vlan_xstats)) +
+  nla_total_size(0);
+}
+
+static int br_fill_linkxstats(struct sk_buff *skb, const struct net_device 
*dev,
+ int *prividx)
+{
+   struct net_bridge *br = netdev_priv(dev);
+   struct net_bridge_vlan_group *vg;
+   struct net_bridge_vlan *v;
+   struct nlattr *nest;
+   int vl_idx = 0;
+
+   vg = br_vlan_group(br);
+   if (!vg)
+   goto out;
+   nest = nla_nest_start(skb, LINK_XSTATS_TYPE_BRIDGE);
+   if (!nest)
+   return -EMSGSIZE;
+   list_for_each_entry(v, >vlan_list, vlist) {
+   struct bridge_vlan_xstats vxi;
+   struct br_vlan_stats stats;
+
+   if (vl_idx++ < *prividx)
+   continue;
+   memset(, 0, sizeof(vxi));
+   vxi.vid = v->vid;
+   br_vlan_get_stats(v, );
+   vxi.rx_bytes = stats.rx_bytes;
+   vxi.rx_packets = stats.rx_packets;
+   vxi.tx_bytes = stats.tx_bytes;
+   vxi.tx_packets = stats.tx_packets;
+
+   if (nla_put(skb, BRIDGE_XSTATS_VLAN, sizeof(vxi), ))
+   goto nla_put_failure;
+   }
+   nla_nest_end(skb, nest);
+   *prividx = 0;
+out:
+   return 0;
+
+nla_put_failure:
+   nla_nest_end(skb, nest);
+   *prividx = vl_idx;
+
+   return -EMSGSIZE;
+}
 
 static struct rtnl_af_ops br_af_ops __read_mostly = {
.family = AF_BRIDGE,
@@ -1252,6 +1315,8 @@ struct rtnl_link_ops br_link_ops __read_mostly = {
.dellink= br_dev_delete,
.get_size   = br_get_size,
.fill_info  = br_fill_info,
+

[PATCH net-next v3 0/4] bridge: per-vlan stats

Hi,
This set adds support for bridge per-vlan statistics.
In order to be able to dump statistics for many vlans we need a way to
continue dumping after reaching maximum size, thus patches 01 and 02 extend
the new stats API with a per-device extended link stats attribute and
callback which can save its local state and continue where it left off
afterwards. I considered using the already existing "fill_xstats" callback
but it gets confusing since we need to separate the linkinfo dump from the
new stats api dump and adding a flag/argument to do that just looks messy.
I don't think the rtnl_link_ops size is an issue, so adding these seemed
like the cleaner approach.

Patches 03 and 04 add the stats support and netlink dump support
respectively. The stats accounting is controlled via a bridge option which
is default off, thus the performance impact is kept minimal.
I've tested this set with both old and modified iproute2, kmemleak on and
some traffic stress tests while adding/removing vlans and ports.

v3:
 - drop the RCU pvid patch and remove one pointer fetch as requested
 - make stats accounting optional with default to off, the option is in the
   same cache line as vlan_proto and vlan_enabled, so it is already fetched
   before the fast path check thus the performance impact is minimal, this
   also allows us to avoid one vlan lookup and return early when using pvid
 - rebased and retested

v2:
 - Improve the error checking, rename lidx to prividx and save the current
   idx user instead of restricting it to one in patch 01
 - squash patch 02 into 01 and remove the restriction
 - add callback descriptions, improve the size calculation and change the
   xstats message structure to have an embedding level per rtnl link type
   so we can avoid one call to get the link type (and thus filter on it)
   and also each link type can now have any number of private attributes
   inside
 - fix a problem where the vlan stats are not dumped if the bridge has 0
   vlans on it but has vlans on the ports, add bridge link type private
   attributes and also add paddings for future extensions to avoid at least
   a few netlink attributes and improve struct alignment
 - drop the is_skb_forwardable argument constifying patch as it's not
   needed anymore, but it's a nice cleanup which I'll send separately

Thank you,
 Nik


Nikolay Aleksandrov (4):
  net: rtnetlink: allow rtnl_fill_statsinfo to save private state
counter
  net: rtnetlink: add linkxstats callbacks and attribute
  bridge: vlan: learn to count
  bridge: netlink: export per-vlan stats

 include/net/rtnetlink.h|   7 +++
 include/uapi/linux/if_bridge.h |  18 +++
 include/uapi/linux/if_link.h   |  14 ++
 net/bridge/br_netlink.c|  78 -
 net/bridge/br_private.h|  18 +++
 net/bridge/br_sysfs_br.c   |  17 +++
 net/bridge/br_vlan.c   | 109 +++--
 net/core/rtnetlink.c   |  74 +++-
 8 files changed, 307 insertions(+), 28 deletions(-)

-- 
2.4.11

[PATCH net-next v3 2/4] net: rtnetlink: add linkxstats callbacks and attribute