Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.
On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote: > On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: > > Consider: > > > > - App A sends out corrupt packets 50% of the time and discards inbound > > data. (...) > How can you make a generic app C know how to do this? The path could be, > for instance: > > eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> > vethC <-> vethD <-> appC > > There are no sockets on vethB, but it does need to have special behaviour to > elide > csums. Even if appC is hacked to know how to twiddle some thing on it's veth > port, > mucking with vethD will have no effect on vethB. > > With regard to your example above, why would A corrupt packets? My guess: > > 1) It has bugs (so, fix the bugs, it could equally create incorrect data > with proper checksums, > so just enabling checksumming adds no useful protection.) I agree with Ben here, what he needs is the ability for userspace to be trusted when *forwarding* a packet. Ideally you'd only want to receive the csum status per packet on the packet socket and pass the same value on the vethA interface, with this status being kept when the packet reaches vethB. If A purposely corrupts packet, it's A's problem. It's similar to designing a NIC which intentionally corrupts packets and reports "checksum good". The real issue is that in order to do things right, the userspace bridge (here, "A") would really need to pass this status. In Ben's case as he says, bad checksum packets are dropped before reaching A, so that simplifies the process quite a bit and that might be what causes some confusion, but ideally we'd rather have recvmsg() and sendmsg() with these flags. I faced the exact same issue 3 years ago when playing with netmap, it was slow as hell because it would lose all checksum information when packets were passing through userland, resulting in GRO/GSO etc being disabled, and had to modify it to let userland preserve it. That's especially important when you have to deal with possibly corrupted packets not yet detected in the chain because the NIC did not validate their checksums. Willy
Re: sfc: Please add rss_cpus=hyperthreads support
On Tue, May 5, 2015 at 6:48 AM, Robert Stonehousewrote: > On 02/05/15 01:03, Andy Lutomirski wrote: >> >> Can you please port the rss_cpus=hyperthreads feature from the >> OpenOnload driver to the upstream driver? The fact that sticking: >> >> options sfc rss_cpus=hyperthreads >> >> into modprobe.conf causes sfc to fail to load when OpenOnload isn't >> installed is incredibly annoying, and the option is genuinely useful. > > > Sure; I will make sure that happens soon. > I can see how that would be annoying. Ping? Thanks, Andy
Re: [PATCH net-next 10/12] net/mlx5e: Create aRFS flow tables
On Fri, Apr 29, 2016 at 01:36:40AM +0300, Saeed Mahameed wrote: > From: Maor Gottlieb> > Create the following four flow tables for aRFS usage: > 1. IPv4 TCP - filtering 4-tuple of IPv4 TCP packets. > 2. IPv6 TCP - filtering 4-tuple of IPv6 TCP packets. > 3. IPv4 UDP - filtering 4-tuple of IPv4 UDP packets. > 4. IPv6 UDP - filtering 4-tuple of IPv6 UDP packets. > > Each flow table has two flow groups: one for the 4-tuple > filtering (full match) and the other contains * rule for miss rule. > > Full match rule means a hit for aRFS and packet will be forwarded > to the dedicated RQ/Core, miss rule packets will be forwarded to > default RSS hashing. > > Signed-off-by: Maor Gottlieb > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/Makefile |1 + > drivers/net/ethernet/mellanox/mlx5/core/en.h | 41 > drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c | 251 > + > drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 23 +- > drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 +- > drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |3 +- > 6 files changed, 313 insertions(+), 14 deletions(-) > create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile > b/drivers/net/ethernet/mellanox/mlx5/core/Makefile > index 4fc45ee..679e18f 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile > @@ -9,3 +9,4 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o \ > en_txrx.o en_clock.o vxlan.o en_tc.o > > mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) += en_dcbnl.o > +mlx5_core-$(CONFIG_RFS_ACCEL) += en_arfs.o this line breaks the build when RFS is on and CONFIG_MLX5_CORE_EN is off.
[PATCH] macb: fix mdiobus_scan() error check
Now mdiobus_scan() returns ERR_PTR(-ENODEV) instead of NULL if the PHY device ID was read as all ones. As this was not an error before, this value should be filtered out now in this driver. Fixes: b74766a0a0fe ("phylib: don't return NULL from get_phy_device()") Signed-off-by: Sergei Shtylyov--- The patch is against DaveM's 'net-next.git' repo. drivers/net/ethernet/cadence/macb.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: net-next/drivers/net/ethernet/cadence/macb.c === --- net-next.orig/drivers/net/ethernet/cadence/macb.c +++ net-next/drivers/net/ethernet/cadence/macb.c @@ -458,7 +458,8 @@ static int macb_mii_init(struct macb *bp struct phy_device *phydev; phydev = mdiobus_scan(bp->mii_bus, i); - if (IS_ERR(phydev)) { + if (IS_ERR(phydev) && + PTR_ERR(phydev) != -ENODEV) { err = PTR_ERR(phydev); break; }
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 5:52 PM, Ben Greearwrote: Good point, so if you had: eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> userspace-stub <->eth1 and user-space hub enabled this elide flag, things would work, right? Then, it seems like what we need is a way to tell the kernel router/bridge logic to follow elide signals in packets coming from veth. I'm not sure what the best way to do this is because I'm less familiar with conventions in that part of the kernel, but assuming there's a way to do this, would it be acceptable? You cannot receive on one veth without transmitting on the other, so I think the elide csum logic can go on the raw-socket, and apply to packets in the transmit-from-user-space direction. Just allowing the socket to make the veth behave like it used to before this patch in question should be good enough, since that worked for us for years. So, just an option to modify the ip_summed for pkts sent on a socket is probably sufficient. I don't think this is right. Consider: - App A sends out corrupt packets 50% of the time and discards inbound data. - App B doesn't care about corrupt packets and is happy to receive them and has some way of dealing with them (special case) - App C is a regular app, say nc or something. In your world, where A decides what happens to data it transmits, then A<--veth-->B and A<---wire-->B will have the same behaviour but A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C will behave incorrectly if it's connected over veth but correctly if connected with a wire. That is a bug. Since A cannot know what the app it's talking to will desire, I argue that both sides of a message must be opted in to this optimization. How can you make a generic app C know how to do this? The path could be, for instance: eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> vethC <-> vethD <-> appC There are no sockets on vethB, but it does need to have special behaviour to elide csums. Even if appC is hacked to know how to twiddle some thing on it's veth port, mucking with vethD will have no effect on vethB. With regard to your example above, why would A corrupt packets? My guess: 1) It has bugs (so, fix the bugs, it could equally create incorrect data with proper checksums, so just enabling checksumming adds no useful protection.) 2) It means to corrupt frames. In that case, someone must expect that C should receive incorrect frames, otherwise why bother making App-A corrupt them in the first place? 3) You are explicitly trying to test the kernel checksum logic, so you want the kernel to detect the bad checksum and throw away the packet. In this case, just don't set the socket option in appA to elide checksums and the packet will be thrown away. Any other cases you can think of? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On Sat, Apr 30, 2016 at 1:59 PM, Ben Greearwrote: > > On 04/30/2016 12:54 PM, Tom Herbert wrote: >> >> We've put considerable effort into cleaning up the checksum interface >> to make it as unambiguous as possible, please be very careful to >> follow it. Broken checksum processing is really hard to detect and >> debug. >> >> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >> (indicated by csum_level) have been verified to be correct in a >> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >> checksum it would refer to has not been verified and is incorrect this >> is a major bug. > > > Suppose I know that the packet received on a packet-socket has > already been verified by a NIC that supports hardware checksumming. > > Then, I want to transmit it on a veth interface using a second > packet socket. I do not want veth to recalculate the checksum on > transmit, nor to validate it on the peer veth on receive, because I do > not want to waste the CPU cycles. I am assuming that my app is not > accidentally corrupting frames, so the checksum can never be bad. > > How should the checksumming be configured for the packets going into > the packet-socket from user-space? > Checksum unnecessary conversion to checksum complete should be done and then the computed value can be sent to user space. If you want to take advantage of the value on transmit then we would to change the interface. But I'm not sure why recalculating the the checksum in the host is even a problem. With all the advances in checksum offload, LCO, RCO it seems like that shouldn't be a common case anyway. > Also, I might want to send raw frames that do have > broken checksums (lets assume a real NIC, not veth), and I want them > to hit the wire with those bad checksums. > > How do I configure the checksumming in this case? Just send raw packets with bad checksums (no checksum offload). Tom > > > Thanks, > Ben > > > >> >> Tom >> >> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear >> wrote: >>> >>> >>> >>> On 04/30/2016 11:33 AM, Ben Hutchings wrote: On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: > > > Hello, >>> >>> >>> >> >> >> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > > > Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. >>> >>> >>> >>> To be honest, I fail to see why the old behaviour is a bug when sending >>> raw packets from user-space. If raw packets should not be changed, then >>> we need some way to specify what the checksum setting is to begin with, >>> otherwise, user-space has not enough control. >>> >>> A socket option for new programs, and sysctl configurable defaults for >>> raw >>> sockets >>> for old binary programs would be sufficient I think. >>> >>> >>> Thanks, >>> Ben >>> >>> -- >>> Ben Greear >>> Candela Technologies Inc http://www.candelatech.com >> >> > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On Sat, Apr 30, 2016 at 5:52 PM, Ben Greearwrote: >> >> Good point, so if you had: >> >> eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> >> userspace-stub <->eth1 >> >> and user-space hub enabled this elide flag, things would work, right? >> Then, it seems like what we need is a way to tell the kernel >> router/bridge logic to follow elide signals in packets coming from >> veth. I'm not sure what the best way to do this is because I'm less >> familiar with conventions in that part of the kernel, but assuming >> there's a way to do this, would it be acceptable? > > > You cannot receive on one veth without transmitting on the other, so > I think the elide csum logic can go on the raw-socket, and apply to packets > in the transmit-from-user-space direction. Just allowing the socket to make > the veth behave like it used to before this patch in question should be good > enough, since that worked for us for years. So, just an option to modify > the > ip_summed for pkts sent on a socket is probably sufficient. I don't think this is right. Consider: - App A sends out corrupt packets 50% of the time and discards inbound data. - App B doesn't care about corrupt packets and is happy to receive them and has some way of dealing with them (special case) - App C is a regular app, say nc or something. In your world, where A decides what happens to data it transmits, then A<--veth-->B and A<---wire-->B will have the same behaviour but A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C will behave incorrectly if it's connected over veth but correctly if connected with a wire. That is a bug. Since A cannot know what the app it's talking to will desire, I argue that both sides of a message must be opted in to this optimization. > >>> There may be no sockets on the vethB port. And reader/writer is not >>> a good way to look at it since I am implementing a bi-directional bridge >>> in >>> user-space and each packet-socket is for both rx and tx. >> >> >> Sure, but we could model a bidrectional connection as two >> unidirectional sockets for our discussions here, right? > > > Best not to I think, you want to make sure that one socket can > correctly handle tx and rx. As long as that works, then using > uni-directional sockets should work too. > > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 02:36 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 5:29 PM, Ben Greearwrote: On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear wrote: On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? It seems like that only the receiver should decide whether or not to checksum packets on the veth, not the sender. How about: We could add a receiving socket option for "don't checksum packets received from a veth when the other side has marked them as elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending socket option for "mark all data sent via this socket to a veth as elide-checksum-suggested". So the process would be: Writer: 1. open read socket 2. open write socket, with option elide-checksum-for-veth-suggested 3. write data Reader: 1. open read socket with "follow-elide-checksum-suggestions-on-veth" 2. read data The kernel / module would then need to persist the flag on all packets that traverse a veth, and drop these data when they leave the veth module. I'm not sure this works completely. In my app, the packet flow might be: eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> [kernel router/bridge logic ...] <-> eth1 Good point, so if you had: eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> userspace-stub <->eth1 and user-space hub enabled this elide flag, things would work, right? Then, it seems like what we need is a way to tell the kernel router/bridge logic to follow elide signals in packets coming from veth. I'm not sure what the best way to do this is because I'm less familiar with conventions in that part of the kernel, but assuming there's a way to do this, would it be acceptable? You cannot receive on one veth without transmitting on the other, so I think the elide csum logic can go on the raw-socket, and apply to packets in the transmit-from-user-space direction. Just allowing the socket to make the veth behave like it used to before this patch in question should be good enough, since that worked for us for years. So, just an option to modify the ip_summed for pkts sent on a socket is probably sufficient. There may be no sockets on the vethB port. And reader/writer is not a good way to look at it since I am implementing a bi-directional bridge in user-space and each packet-socket is for both rx and tx. Sure, but we could model a bidrectional connection as two unidirectional sockets for our discussions here, right? Best not to I think, you want to make sure that one socket can correctly handle tx and rx. As long as that works, then using uni-directional sockets should work too. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: New DSA design for cross-chip operations
On Fri, Apr 29, 2016 at 05:33:44PM -0400, Vivien Didelot wrote: > Hi, > > Here's a proposal for a new design of DSA. It is meant to discuss the > actual state and future implementation of the D (for distributed) in > DSA, which consists of supporting several interconnected switch chips, > exposed to the user as one big switch unit. > > There's still a lot of work to finish in DSA before running into > cross-chip operations, but it'll help to start thinking about it now. Hi Vivien One thing we should keep in mind. Very few switch chips actually support D in DSA. In fact, at the moment, i don't know of any other than Marvell. The Broadcom SF2 does not. The Microchip LAN935X don't. The Micrel KS8995 does not. Do any of the switches supported by OpenWRT via swconfig? We should try to limit adding complexity to the DSA core and the API to the drivers which the majority of drivers will never need. I would make use of the layering we already have. For the Marvell devices, add a cross chip layer between the Marvell drivers and the DSA core. And even then, we might want to keep this layer optional for Marvell devices. There are only a few niche boards with multiple switch chips. There are many more WiFi access points and cable modems, etc, with just a single switch. Andrew
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On Sat, Apr 30, 2016 at 5:29 PM, Ben Greearwrote: > > > On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: >> >> On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear >> wrote: >>> >>> >>> >>> On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. >>> >>> >>> >>> Suppose I know that the packet received on a packet-socket has >>> already been verified by a NIC that supports hardware checksumming. >>> >>> Then, I want to transmit it on a veth interface using a second >>> packet socket. I do not want veth to recalculate the checksum on >>> transmit, nor to validate it on the peer veth on receive, because I do >>> not want to waste the CPU cycles. I am assuming that my app is not >>> accidentally corrupting frames, so the checksum can never be bad. >>> >>> How should the checksumming be configured for the packets going into >>> the packet-socket from user-space? >> >> >> >> It seems like that only the receiver should decide whether or not to >> checksum packets on the veth, not the sender. >> >> How about: >> >> We could add a receiving socket option for "don't checksum packets >> received from a veth when the other side has marked them as >> elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending >> socket option for "mark all data sent via this socket to a veth as >> elide-checksum-suggested". >> >> So the process would be: >> >> Writer: >> 1. open read socket >> 2. open write socket, with option elide-checksum-for-veth-suggested >> 3. write data >> >> Reader: >> 1. open read socket with "follow-elide-checksum-suggestions-on-veth" >> 2. read data >> >> The kernel / module would then need to persist the flag on all packets >> that traverse a veth, and drop these data when they leave the veth >> module. > > > I'm not sure this works completely. In my app, the packet flow might be: > > eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB > <-> [kernel router/bridge logic ...] <-> eth1 Good point, so if you had: eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <-> userspace-stub <->eth1 and user-space hub enabled this elide flag, things would work, right? Then, it seems like what we need is a way to tell the kernel router/bridge logic to follow elide signals in packets coming from veth. I'm not sure what the best way to do this is because I'm less familiar with conventions in that part of the kernel, but assuming there's a way to do this, would it be acceptable? > > There may be no sockets on the vethB port. And reader/writer is not > a good way to look at it since I am implementing a bi-directional bridge in > user-space and each packet-socket is for both rx and tx. Sure, but we could model a bidrectional connection as two unidirectional sockets for our discussions here, right? > >>> Also, I might want to send raw frames that do have >>> broken checksums (lets assume a real NIC, not veth), and I want them >>> to hit the wire with those bad checksums. >>> >>> >>> How do I configure the checksumming in this case? >> >> >> >> Correct me if I'm wrong but I think this is already possible now. You >> can have packets with incorrect checksum hitting the wire as is. What >> you cannot do is instruct the receiving end to ignore the checksum >> from the sending end when using a physical device (and something I >> think we should mimic on the sending device). > > > Yes, it does work currently (or, last I checked)...I just want to make sure > it keeps working. Yeah, good point. It would be great if we could write a test, or at the very least, a list of invariants about what kinds of things should work somewhere. > > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 02:13 PM, Vijay Pandurangan wrote: On Sat, Apr 30, 2016 at 4:59 PM, Ben Greearwrote: On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? It seems like that only the receiver should decide whether or not to checksum packets on the veth, not the sender. How about: We could add a receiving socket option for "don't checksum packets received from a veth when the other side has marked them as elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending socket option for "mark all data sent via this socket to a veth as elide-checksum-suggested". So the process would be: Writer: 1. open read socket 2. open write socket, with option elide-checksum-for-veth-suggested 3. write data Reader: 1. open read socket with "follow-elide-checksum-suggestions-on-veth" 2. read data The kernel / module would then need to persist the flag on all packets that traverse a veth, and drop these data when they leave the veth module. I'm not sure this works completely. In my app, the packet flow might be: eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB <-> [kernel router/bridge logic ...] <-> eth1 There may be no sockets on the vethB port. And reader/writer is not a good way to look at it since I am implementing a bi-directional bridge in user-space and each packet-socket is for both rx and tx. Also, I might want to send raw frames that do have broken checksums (lets assume a real NIC, not veth), and I want them to hit the wire with those bad checksums. How do I configure the checksumming in this case? Correct me if I'm wrong but I think this is already possible now. You can have packets with incorrect checksum hitting the wire as is. What you cannot do is instruct the receiving end to ignore the checksum from the sending end when using a physical device (and something I think we should mimic on the sending device). Yes, it does work currently (or, last I checked)...I just want to make sure it keeps working. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On Sat, Apr 30, 2016 at 4:59 PM, Ben Greearwrote: > > > On 04/30/2016 12:54 PM, Tom Herbert wrote: >> >> We've put considerable effort into cleaning up the checksum interface >> to make it as unambiguous as possible, please be very careful to >> follow it. Broken checksum processing is really hard to detect and >> debug. >> >> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums >> (indicated by csum_level) have been verified to be correct in a >> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is >> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the >> checksum it would refer to has not been verified and is incorrect this >> is a major bug. > > > Suppose I know that the packet received on a packet-socket has > already been verified by a NIC that supports hardware checksumming. > > Then, I want to transmit it on a veth interface using a second > packet socket. I do not want veth to recalculate the checksum on > transmit, nor to validate it on the peer veth on receive, because I do > not want to waste the CPU cycles. I am assuming that my app is not > accidentally corrupting frames, so the checksum can never be bad. > > How should the checksumming be configured for the packets going into > the packet-socket from user-space? It seems like that only the receiver should decide whether or not to checksum packets on the veth, not the sender. How about: We could add a receiving socket option for "don't checksum packets received from a veth when the other side has marked them as elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending socket option for "mark all data sent via this socket to a veth as elide-checksum-suggested". So the process would be: Writer: 1. open read socket 2. open write socket, with option elide-checksum-for-veth-suggested 3. write data Reader: 1. open read socket with "follow-elide-checksum-suggestions-on-veth" 2. read data The kernel / module would then need to persist the flag on all packets that traverse a veth, and drop these data when they leave the veth module. > > > Also, I might want to send raw frames that do have > broken checksums (lets assume a real NIC, not veth), and I want them > to hit the wire with those bad checksums. > > > How do I configure the checksumming in this case? Correct me if I'm wrong but I think this is already possible now. You can have packets with incorrect checksum hitting the wire as is. What you cannot do is instruct the receiving end to ignore the checksum from the sending end when using a physical device (and something I think we should mimic on the sending device). > > > > Thanks, > Ben > > > >> >> Tom >> >> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear wrote: >>> >>> >>> >>> On 04/30/2016 11:33 AM, Ben Hutchings wrote: On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: > > > Hello, >>> >>> >>> >> >> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > > > Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. >>> >>> >>> >>> To be honest, I fail to see why the old behaviour is a bug when sending >>> raw packets from user-space. If raw packets should not be changed, then >>> we need some way to specify what the checksum setting is to begin with, >>> otherwise, user-space has not enough control. >>> >>> A socket option for new programs, and sysctl configurable defaults for raw >>> sockets >>> for old binary programs would be sufficient I think. >>> >>> >>> Thanks, >>> Ben >>> >>> -- >>> Ben Greear >>> Candela Technologies Inc http://www.candelatech.com >> >> > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 12:54 PM, Tom Herbert wrote: We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Suppose I know that the packet received on a packet-socket has already been verified by a NIC that supports hardware checksumming. Then, I want to transmit it on a veth interface using a second packet socket. I do not want veth to recalculate the checksum on transmit, nor to validate it on the peer veth on receive, because I do not want to waste the CPU cycles. I am assuming that my app is not accidentally corrupting frames, so the checksum can never be bad. How should the checksumming be configured for the packets going into the packet-socket from user-space? Also, I might want to send raw frames that do have broken checksums (lets assume a real NIC, not veth), and I want them to hit the wire with those bad checksums. How do I configure the checksumming in this case? Thanks, Ben Tom On Sat, Apr 30, 2016 at 12:40 PM, Ben Greearwrote: On 04/30/2016 11:33 AM, Ben Hutchings wrote: On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: Hello, http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. To be honest, I fail to see why the old behaviour is a bug when sending raw packets from user-space. If raw packets should not be changed, then we need some way to specify what the checksum setting is to begin with, otherwise, user-space has not enough control. A socket option for new programs, and sysctl configurable defaults for raw sockets for old binary programs would be sufficient I think. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -- Ben Greear Candela Technologies Inc http://www.candelatech.com
[PATCH] pxa168_eth: fix mdiobus_scan() error check
Since mdiobus_scan() returns either an error code or NULL on error, the driver should check for both, not only for NULL, otherwise a crash is imminent... Reported-by: Arnd BergmannSigned-off-by: Sergei Shtylyov --- The patch is against DaveM's 'net.git' repo. drivers/net/ethernet/marvell/pxa168_eth.c |2 ++ 1 file changed, 2 insertions(+) Index: net/drivers/net/ethernet/marvell/pxa168_eth.c === --- net.orig/drivers/net/ethernet/marvell/pxa168_eth.c +++ net/drivers/net/ethernet/marvell/pxa168_eth.c @@ -979,6 +979,8 @@ static int pxa168_init_phy(struct net_de return 0; pep->phy = mdiobus_scan(pep->smi_bus, pep->phy_addr); + if (IS_ERR(pep->phy)) + return PTR_ERR(pep->phy); if (!pep->phy) return -ENODEV;
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
[oops – resending this because I was using gmail in HTML mode before by accident] There was a discussion on a separate thread about this. I agree with Sabrina fully. I believe veth should provide an abstraction layer that correctly emulates a physical network in all ways. Consider an environment where we have multiple physical computers. Each computer runs one or more containers, each of which has a publicly routable ip address. When adding a new app to the cluster, a scheduler might decide to run this container on any physical machine of its choice, assuming that apps have a way of routing traffic to their backends (we did something similar Google >10 years ago). This is something we might imagine happening with docker and ipv6 for instance. If you have an app, A, which sends raw ethernet frames (the simplest case I could imagine) with TCP data that has invalid checksums to app B, which is receiving it, the behaviour of the system _will be different_ depending upon whether app B is scheduled to run on the same machine as app A or not. This seems like a clear bug and a broken abstraction (especially as the default case), and something we should endeavour to avoid. I do see Ben's point about enabling sw checksum verification as potentially incurring a huge performance penalty (I haven't had a chance to measure it) that is completely wasteful in the vast majority of cases. Unfortunately I just don't see how we can solve this problem in a way that preserves a correct abstraction layer while also avoiding excess work. I guess the first piece of data that would be helpful is to determine just how big of a performance penalty this is. If it's small, then maybe it doesn't matter. On Thu, Apr 28, 2016 at 6:29 AM, Sabrina Dubrocawrote: > Hello, > > 2016-04-27, 17:14:44 -0700, Ben Greear wrote: >> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote: >> > Hi Ben, >> > >> > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote: >> > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote: >> > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote: >> > > > > >> > > > > 3.2.80-rc1 review patch. If anyone has any objections, please let >> > > > > me know. >> > > > I would be careful about this. It causes regressions when sending >> > > > PACKET_SOCKET buffers from user-space to veth devices. >> > > > >> > > > There was a proposed upstream fix for the regression, but it has not >> > > > gone >> > > > into the tree as far as I know. >> > > > >> > > > http://www.spinics.net/lists/netdev/msg370436.html >> > > [...] >> > > >> > > OK, I'll drop this for now. >> > >> > The fall out from not having this patch is in my opinion a bigger >> > fallout than not having this patch. This patch fixes silent data >> > corruption vs. the problem Ben Greear is talking about, which might not >> > be that a common usage. >> > >> > What do others think? >> > >> > Bye, >> > Hannes >> > >> >> This patch from Cong Wang seems to fix the regression for me, I think it >> should be added and >> tested in the main tree, and then apply them to stable as a pair. >> >> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > > Actually, no, this is not really a regression. > > If you capture packets on a device with checksum offloading enabled, > the TCP/UDP checksum isn't filled. veth also behaves that way. What > the "veth: don't modify ip_summed" patch does is enable proper > checksum validation on veth. This really was a bug in veth. > > Cong's patch would also break cases where we choose to inject packets > with invalid checksums, and they would now be accepted as correct. > > Your use case is invalid, it just happened to work because of a > bug. If you want the stack to fill checksums so that you want capture > and reinject packets, you have to disable checksum offloading (or > compute the checksum yourself in userspace). > > Thanks. > > -- > Sabrina
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
We've put considerable effort into cleaning up the checksum interface to make it as unambiguous as possible, please be very careful to follow it. Broken checksum processing is really hard to detect and debug. CHECKSUM_UNNECESSARY means that some number of _specific_ checksums (indicated by csum_level) have been verified to be correct in a packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is never right. If CHECKSUM_UNNECESSARY is set in such a manner but the checksum it would refer to has not been verified and is incorrect this is a major bug. Tom On Sat, Apr 30, 2016 at 12:40 PM, Ben Greearwrote: > > > On 04/30/2016 11:33 AM, Ben Hutchings wrote: >> >> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: >>> >>> Hello, > > http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a >>> >>> Actually, no, this is not really a regression. >> >> [...] >> >> It really is. Even though the old behaviour was a bug (raw packets >> should not be changed), if there are real applications that depend on >> that then we have to keep those applications working somehow. > > > To be honest, I fail to see why the old behaviour is a bug when sending > raw packets from user-space. If raw packets should not be changed, then > we need some way to specify what the checksum setting is to begin with, > otherwise, user-space has not enough control. > > A socket option for new programs, and sysctl configurable defaults for raw > sockets > for old binary programs would be sufficient I think. > > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com
Re: __napi_alloc_skb failures locking up the box
On Sat, 2016-04-30 at 22:24 +0300, Aaro Koskinen wrote: > Hi, > > I have old NAS box (Thecus N2100) with 512 MB RAM, where rsync from NFS -> > disk reliably results in temporary out-of-memory conditions. > > When this happens the dmesg gets flooded with below logs. If the serial > console logging is enabled, this will lock up the box completely and > the backup is not making any progress. > > Shouldn't these allocation failures be ratelimited somehow (or even made > silent)? It doesn't sound right if I can lock up the system simply by > copying files... Agreed. All napi_alloc_skb() callers handle failure just fine. If they did not, a NULL deref would produce a proper stack dump. When memory gets this tight, other traces will be dumped anyway. diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 15d0df943466..0652709fe81a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2423,7 +2423,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int length) { - return __napi_alloc_skb(napi, length, GFP_ATOMIC); + return __napi_alloc_skb(napi, length, GFP_ATOMIC | __GFP_NOWARN); } void napi_consume_skb(struct sk_buff *skb, int budget);
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On 04/30/2016 11:33 AM, Ben Hutchings wrote: On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: Hello, http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. To be honest, I fail to see why the old behaviour is a bug when sending raw packets from user-space. If raw packets should not be changed, then we need some way to specify what the checksum setting is to begin with, otherwise, user-space has not enough control. A socket option for new programs, and sysctl configurable defaults for raw sockets for old binary programs would be sufficient I think. Thanks, Ben -- Ben GreearCandela Technologies Inc http://www.candelatech.com
__napi_alloc_skb failures locking up the box
Hi, I have old NAS box (Thecus N2100) with 512 MB RAM, where rsync from NFS -> disk reliably results in temporary out-of-memory conditions. When this happens the dmesg gets flooded with below logs. If the serial console logging is enabled, this will lock up the box completely and the backup is not making any progress. Shouldn't these allocation failures be ratelimited somehow (or even made silent)? It doesn't sound right if I can lock up the system simply by copying files... ... [ 1706.105842] kworker/0:1H: page allocation failure: order:0, mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK) [ 1706.105904] CPU: 0 PID: 519 Comm: kworker/0:1H Not tainted 4.6.0-rc5-iop32x-los_a50bb #1 [ 1706.105917] Hardware name: Thecus N2100 [ 1706.105973] Workqueue: rpciod rpc_async_schedule [ 1706.105993] Backtrace: [ 1706.106037] [] (dump_backtrace) from [] (show_stack+0x18/0x1c) [ 1706.106068] r7: r6:0060 r5: r4: [ 1706.106142] [] (show_stack) from [] (dump_stack+0x20/0x28) [ 1706.106185] [] (dump_stack) from [] (warn_alloc_failed+0xf0/0x134) [ 1706.106216] [] (warn_alloc_failed) from [] (__alloc_pages_nodemask+0x284/0x8fc) [ 1706.106233] r3:02200020 r2: [ 1706.106269] r5: r4:c0524c40 [ 1706.106336] [] (__alloc_pages_nodemask) from [] (new_slab+0x3c4/0x430) [ 1706.106352] r10: r9:c0321fdc r8: r7:df401d00 r6:0015 r5:0220 [ 1706.106422] r4:df401d00 [ 1706.106460] [] (new_slab) from [] (___slab_alloc.constprop.8+0x238/0x298) [ 1706.106477] r10: r9:c0321fdc r8:02080020 r7:df401d00 r6:dfbef060 r5: [ 1706.106547] r4: [ 1706.106581] [] (___slab_alloc.constprop.8) from [] (kmem_cache_alloc+0xbc/0xf8) [ 1706.106597] r10:00167e93 r9:340285ee r8:e11e1920 r7:6013 r6:02080020 r5:df401d00 [ 1706.106668] r4: [ 1706.106707] [] (kmem_cache_alloc) from [] (__build_skb+0x2c/0x98) [ 1706.106723] r7:cc592240 r6:06e0 r5:de3523f0 r4:06e0 [ 1706.106784] [] (__build_skb) from [] (__napi_alloc_skb+0xb0/0xfc) [ 1706.106801] r9:340285ee r8:e11e1920 r7:cc592240 r6:c0520ab8 r5:de3523f0 r4:06e0 [ 1706.106903] [] (__napi_alloc_skb) from [] (rtl8169_poll+0x3a0/0x588) [ 1706.106920] r7:de364000 r6:c000fd08 r5:05ea r4:de3523f0 [ 1706.106986] [] (rtl8169_poll) from [] (net_rx_action+0x1cc/0x2ec) [ 1706.107002] r10:00022548 r9:df471ba8 r8:c05257e0 r7:012c r6:0040 r5:0001 [ 1706.107073] r4:de3523f0 [ 1706.107131] [] (net_rx_action) from [] (__do_softirq+0xf4/0x254) [ 1706.107147] r10:0101 r9:c052618c r8:4001 r7:c0526188 r6:df47 r5:0003 [ 1706.107218] r4: [ 1706.107255] [] (__do_softirq) from [] (do_softirq.part.2+0x34/0x40) [ 1706.107271] r10:0001 r9:c0525a26 r8:c0c0 r7: r6:deae8a80 r5: [ 1706.107341] r4:2013 [ 1706.107380] [] (do_softirq.part.2) from [] (__local_bh_enable_ip+0xac/0xcc) [ 1706.107396] r5: r4:0200 [ 1706.107459] [] (__local_bh_enable_ip) from [] (release_sock+0x12c/0x158) [ 1706.107477] r5: r4: [ 1706.107538] [] (release_sock) from [] (tcp_sendmsg+0xf8/0xa90) [ 1706.107557] r10:4040 r9:deae8a80 r8:df471d74 r7:88cd7146 r6:059c r5:de8ccb00 [ 1706.107629] r4:007c r3:0001 [ 1706.107678] [] (tcp_sendmsg) from [] (inet_sendmsg+0x3c/0x74) [ 1706.107695] r10:df471e2c r9: r8:df7eb604 r7: r6: r5:df02c780 [ 1706.107766] r4:deae8a80 [ 1706.107800] [] (inet_sendmsg) from [] (sock_sendmsg+0x1c/0x30) [ 1706.107816] r4:df471d74 [ 1706.107849] [] (sock_sendmsg) from [] (kernel_sendmsg+0x38/0x40) [ 1706.107873] [] (kernel_sendmsg) from [] (xs_send_kvec+0x94/0x9c) [ 1706.107891] r5: r4:df02c780 [ 1706.107934] [] (xs_send_kvec) from [] (xs_sendpages+0x6c/0x244) [ 1706.107950] r9: r8:df02c780 r7:007c r6:df7eb604 r5:0001 r4: [ 1706.108026] [] (xs_sendpages) from [] (xs_tcp_send_request+0x80/0x134) [ 1706.108043] r10: r9: r8:de8e8000 r7:de0d7258 r6:df7eb604 r5:0001 [ 1706.108114] r4:df7eb600 [ 1706.108172] [] (xs_tcp_send_request) from [] (xprt_transmit+0x58/0x214) [ 1706.108192] r10:de92cc60 r9: r8:d3697fdd r7:df7eb674 r6:de0d7258 r5:df7eb600 [ 1706.108265] r4:de8e8000 [ 1706.108303] [] (xprt_transmit) from [] (call_transmit+0x18c/0x230) [ 1706.108321] r7:df7eb600 r6:0001 r5:df7eb600 r4:de0d7258 [ 1706.108388] [] (call_transmit) from [] (__rpc_execute+0x54/0x2c4) [ 1706.108404] r8:c0508940 r7: r6:c03b1f08 r5:0001 r4:de0d7258 [ 1706.108475] [] (__rpc_execute) from [] (rpc_async_schedule+0x14/0x18) [ 1706.108493] r7: r6:dfbf0800 r5:de92cc60 r4:de0d727c [ 1706.108564] [] (rpc_async_schedule) from [] (process_one_work+0x130/0x3ec) [ 1706.108587] [] (process_one_work) from [] (worker_thread+0x50/0x5ac) [ 1706.108604] r10:de92cc60 r9:c0508940 r8:0008 r7:c050b220 r6:c0508954 r5:de92cc78 [ 1706.108675] r4:c0508940 [ 1706.108718] [] (worker_thread) from
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On Thu, 2016-04-28 at 06:45 -0700, Ben Greear wrote: > > On 04/28/2016 03:29 AM, Sabrina Dubroca wrote: [...] > > Your use case is invalid, it just happened to work because of a > > bug. If you want the stack to fill checksums so that you want capture > > and reinject packets, you have to disable checksum offloading (or > > compute the checksum yourself in userspace). > Disabling checksum offloading or computing in user-space (and then > recomputing in veth to verify the checksum?) is a huge performance loss. > > Maybe we could add a socket option to enable Cong's patch on a per-socket > basis? That way my use-case can still work and you can have this new > behaviour by default? It does sound like a useful option to have. If there are other applications that depend on veth's checksum-fixing behaviour and are being distributed in binary form, then a per-device option might be necessary so users can keep those applications working before they're updated. Ben. -- Ben Hutchings Tomorrow will be cancelled due to lack of interest. signature.asc Description: This is a digitally signed message part
Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote: > Hello, > > 2016-04-27, 17:14:44 -0700, Ben Greear wrote: > > > > On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote: > > > > > > Hi Ben, > > > > > > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote: > > > > > > > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote: > > > > > > > > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote: > > > > > > > > > > > > > > > > > > 3.2.80-rc1 review patch. If anyone has any objections, please let > > > > > > me know. > > > > > I would be careful about this. It causes regressions when sending > > > > > PACKET_SOCKET buffers from user-space to veth devices. > > > > > > > > > > There was a proposed upstream fix for the regression, but it has not > > > > > gone > > > > > into the tree as far as I know. > > > > > > > > > > http://www.spinics.net/lists/netdev/msg370436.html > > > > [...] > > > > > > > > OK, I'll drop this for now. > > > The fall out from not having this patch is in my opinion a bigger > > > fallout than not having this patch. This patch fixes silent data > > > corruption vs. the problem Ben Greear is talking about, which might not > > > be that a common usage. > > > > > > What do others think? > > > > > > Bye, > > > Hannes > > > > > This patch from Cong Wang seems to fix the regression for me, I think it > > should be added and > > tested in the main tree, and then apply them to stable as a pair. > > > > http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a > Actually, no, this is not really a regression. [...] It really is. Even though the old behaviour was a bug (raw packets should not be changed), if there are real applications that depend on that then we have to keep those applications working somehow. Ben. -- Ben Hutchings Tomorrow will be cancelled due to lack of interest. signature.asc Description: This is a digitally signed message part
[PATCH net-next] net: relax expensive skb_unclone() in iptunnel_handle_offloads()
From: Eric DumazetLocally generated TCP GSO packets having to go through a GRE/SIT/IPIP tunnel have to go through an expensive skb_unclone() Reallocating skb->head is a lot of work. Test should really check if a 'real clone' of the packet was done. TCP does not care if the original gso_type is changed while the packet travels in the stack. This adds skb_header_unclone() which is a variant of skb_clone() using skb_header_cloned() check instead of skb_cloned(). This variant can probably be used from other points. Signed-off-by: Eric Dumazet --- include/linux/skbuff.h| 10 ++ net/ipv4/ip_tunnel_core.c |2 +- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c84a5a1078c5528bf6fc84573f63f3c6f470ce8f..c413c588a24f854be9e4df78d8a6872b6b1ff9f3 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1325,6 +1325,16 @@ static inline int skb_header_cloned(const struct sk_buff *skb) return dataref != 1; } +static inline int skb_header_unclone(struct sk_buff *skb, gfp_t pri) +{ + might_sleep_if(gfpflags_allow_blocking(pri)); + + if (skb_header_cloned(skb)) + return pskb_expand_head(skb, 0, 0, pri); + + return 0; +} + /** * skb_header_release - release reference to header * @skb: buffer to operate on diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index 786fa7ca28e0db00865cdf9fab472836866b7d6d..9118b0e640ba3a96850cf28aa8caab4e615254d9 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -157,7 +157,7 @@ int iptunnel_handle_offloads(struct sk_buff *skb, } if (skb_is_gso(skb)) { - err = skb_unclone(skb, GFP_ATOMIC); + err = skb_header_unclone(skb, GFP_ATOMIC); if (unlikely(err)) return err; skb_shinfo(skb)->gso_type |= gso_type_mask;
[PATCH 1/4] batman-adv: fix DAT candidate selection (must use vid)
Now that DAT is VLAN aware, it must use the VID when computing the DHT address of the candidate nodes where an entry is going to be stored/retrieved. Fixes: be1db4f6615b ("batman-adv: make the Distributed ARP Table vlan aware") Signed-off-by: Antonio Quartulli[s...@narfation.org: fix conflicts with current version] Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner --- net/batman-adv/distributed-arp-table.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/net/batman-adv/distributed-arp-table.c b/net/batman-adv/distributed-arp-table.c index e96d7c745b4a..3e6b2624f980 100644 --- a/net/batman-adv/distributed-arp-table.c +++ b/net/batman-adv/distributed-arp-table.c @@ -568,6 +568,7 @@ static void batadv_choose_next_candidate(struct batadv_priv *bat_priv, * be sent to * @bat_priv: the bat priv with all the soft interface information * @ip_dst: ipv4 to look up in the DHT + * @vid: VLAN identifier * * An originator O is selected if and only if its DHT_ID value is one of three * closest values (from the LEFT, with wrap around if needed) then the hash @@ -576,7 +577,8 @@ static void batadv_choose_next_candidate(struct batadv_priv *bat_priv, * Return: the candidate array of size BATADV_DAT_CANDIDATE_NUM. */ static struct batadv_dat_candidate * -batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst) +batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst, +unsigned short vid) { int select; batadv_dat_addr_t last_max = BATADV_DAT_ADDR_MAX, ip_key; @@ -592,7 +594,7 @@ batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst) return NULL; dat.ip = ip_dst; - dat.vid = 0; + dat.vid = vid; ip_key = (batadv_dat_addr_t)batadv_hash_dat(, BATADV_DAT_ADDR_MAX); @@ -612,6 +614,7 @@ batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst) * @bat_priv: the bat priv with all the soft interface information * @skb: payload to send * @ip: the DHT key + * @vid: VLAN identifier * @packet_subtype: unicast4addr packet subtype to use * * This function copies the skb with pskb_copy() and is sent as unicast packet @@ -622,7 +625,7 @@ batadv_dat_select_candidates(struct batadv_priv *bat_priv, __be32 ip_dst) */ static bool batadv_dat_send_data(struct batadv_priv *bat_priv, struct sk_buff *skb, __be32 ip, -int packet_subtype) +unsigned short vid, int packet_subtype) { int i; bool ret = false; @@ -631,7 +634,7 @@ static bool batadv_dat_send_data(struct batadv_priv *bat_priv, struct sk_buff *tmp_skb; struct batadv_dat_candidate *cand; - cand = batadv_dat_select_candidates(bat_priv, ip); + cand = batadv_dat_select_candidates(bat_priv, ip, vid); if (!cand) goto out; @@ -1022,7 +1025,7 @@ bool batadv_dat_snoop_outgoing_arp_request(struct batadv_priv *bat_priv, ret = true; } else { /* Send the request to the DHT */ - ret = batadv_dat_send_data(bat_priv, skb, ip_dst, + ret = batadv_dat_send_data(bat_priv, skb, ip_dst, vid, BATADV_P_DAT_DHT_GET); } out: @@ -1150,8 +1153,8 @@ void batadv_dat_snoop_outgoing_arp_reply(struct batadv_priv *bat_priv, /* Send the ARP reply to the candidates for both the IP addresses that * the node obtained from the ARP reply */ - batadv_dat_send_data(bat_priv, skb, ip_src, BATADV_P_DAT_DHT_PUT); - batadv_dat_send_data(bat_priv, skb, ip_dst, BATADV_P_DAT_DHT_PUT); + batadv_dat_send_data(bat_priv, skb, ip_src, vid, BATADV_P_DAT_DHT_PUT); + batadv_dat_send_data(bat_priv, skb, ip_dst, vid, BATADV_P_DAT_DHT_PUT); } /** -- 2.8.1
[PATCH 4/4] batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node
From: Sven EckelmannThe batadv_neigh_node was specific to a batadv_hardif_neigh_node and held an implicit reference to it. But this reference was never stored in form of a pointer in the batadv_neigh_node itself. Instead batadv_neigh_node_release depends on a consistent state of hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the batadv_hardif_neigh_node object which it has a reference for. But batadv_hardif_neigh_get cannot guarantee that because it is working only with rcu_read_lock on this list. It can therefore happen that a neigh_addr is in this list twice or that batadv_hardif_neigh_get cannot find the batadv_hardif_neigh_node for an neigh_addr due to some other list operations taking place at the same time. Instead add a batadv_hardif_neigh_node pointer directly in batadv_neigh_node which will be used for the reference counter decremented on release of batadv_neigh_node. Fixes: cef63419f7db ("batman-adv: add list of unique single hop neighbors per hard-interface") Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/originator.c | 16 +--- net/batman-adv/types.h | 2 ++ 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/net/batman-adv/originator.c b/net/batman-adv/originator.c index d52f67a0c057..c355a824713c 100644 --- a/net/batman-adv/originator.c +++ b/net/batman-adv/originator.c @@ -250,7 +250,6 @@ static void batadv_neigh_node_release(struct kref *ref) { struct hlist_node *node_tmp; struct batadv_neigh_node *neigh_node; - struct batadv_hardif_neigh_node *hardif_neigh; struct batadv_neigh_ifinfo *neigh_ifinfo; struct batadv_algo_ops *bao; @@ -262,13 +261,7 @@ static void batadv_neigh_node_release(struct kref *ref) batadv_neigh_ifinfo_put(neigh_ifinfo); } - hardif_neigh = batadv_hardif_neigh_get(neigh_node->if_incoming, - neigh_node->addr); - if (hardif_neigh) { - /* batadv_hardif_neigh_get() increases refcount too */ - batadv_hardif_neigh_put(hardif_neigh); - batadv_hardif_neigh_put(hardif_neigh); - } + batadv_hardif_neigh_put(neigh_node->hardif_neigh); if (bao->bat_neigh_free) bao->bat_neigh_free(neigh_node); @@ -665,6 +658,10 @@ batadv_neigh_node_new(struct batadv_orig_node *orig_node, neigh_node->orig_node = orig_node; neigh_node->last_seen = jiffies; + /* increment unique neighbor refcount */ + kref_get(_neigh->refcount); + neigh_node->hardif_neigh = hardif_neigh; + /* extra reference for return */ kref_init(_node->refcount); kref_get(_node->refcount); @@ -673,9 +670,6 @@ batadv_neigh_node_new(struct batadv_orig_node *orig_node, hlist_add_head_rcu(_node->list, _node->neigh_list); spin_unlock_bh(_node->neigh_list_lock); - /* increment unique neighbor refcount */ - kref_get(_neigh->refcount); - batadv_dbg(BATADV_DBG_BATMAN, orig_node->bat_priv, "Creating new neighbor %pM for orig_node %pM on interface %s\n", neigh_addr, orig_node->orig, hard_iface->net_dev->name); diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h index 65afd090ab3e..1e47fbe8bb7b 100644 --- a/net/batman-adv/types.h +++ b/net/batman-adv/types.h @@ -433,6 +433,7 @@ struct batadv_hardif_neigh_node { * @ifinfo_lock: lock protecting private ifinfo members and list * @if_incoming: pointer to incoming hard-interface * @last_seen: when last packet via this neighbor was received + * @hardif_neigh: hardif_neigh of this neighbor * @refcount: number of contexts the object is used * @rcu: struct used for freeing in an RCU-safe manner */ @@ -444,6 +445,7 @@ struct batadv_neigh_node { spinlock_t ifinfo_lock; /* protects ifinfo_list and its members */ struct batadv_hard_iface *if_incoming; unsigned long last_seen; + struct batadv_hardif_neigh_node *hardif_neigh; struct kref refcount; struct rcu_head rcu; }; -- 2.8.1
[PATCH 2/4] batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event
At the moment there is no explicit reactivation of an hard-interface upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is reactivated as soon as the next OGM is scheduled for sending, but this mechanism does not work with B.A.T.M.A.N. V. The latter does not rely on the same scheduling mechanism as its predecessor and for this reason the hard-interface remains deactivated forever after being brought down once. This patch fixes the reactivation mechanism by adding a new routing API which explicitly allows each algorithm to perform any needed operation upon interface re-activation. Such API is optional and is implemented by B.A.T.M.A.N. V only and it just takes care of setting the iface status to ACTIVE Signed-off-by: Antonio QuartulliSigned-off-by: Marek Lindner --- net/batman-adv/bat_v.c | 12 net/batman-adv/hard-interface.c | 3 +++ net/batman-adv/types.h | 3 +++ 3 files changed, 18 insertions(+) diff --git a/net/batman-adv/bat_v.c b/net/batman-adv/bat_v.c index 3315b9a598af..4026f198a734 100644 --- a/net/batman-adv/bat_v.c +++ b/net/batman-adv/bat_v.c @@ -32,10 +32,21 @@ #include "bat_v_elp.h" #include "bat_v_ogm.h" +#include "hard-interface.h" #include "hash.h" #include "originator.h" #include "packet.h" +static void batadv_v_iface_activate(struct batadv_hard_iface *hard_iface) +{ + /* B.A.T.M.A.N. V does not use any queuing mechanism, therefore it can +* set the interface as ACTIVE right away, without any risk of race +* condition +*/ + if (hard_iface->if_status == BATADV_IF_TO_BE_ACTIVATED) + hard_iface->if_status = BATADV_IF_ACTIVE; +} + static int batadv_v_iface_enable(struct batadv_hard_iface *hard_iface) { int ret; @@ -274,6 +285,7 @@ static bool batadv_v_neigh_is_sob(struct batadv_neigh_node *neigh1, static struct batadv_algo_ops batadv_batman_v __read_mostly = { .name = "BATMAN_V", + .bat_iface_activate = batadv_v_iface_activate, .bat_iface_enable = batadv_v_iface_enable, .bat_iface_disable = batadv_v_iface_disable, .bat_iface_update_mac = batadv_v_iface_update_mac, diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c index c61d5b0b24d2..0a7deaf2670a 100644 --- a/net/batman-adv/hard-interface.c +++ b/net/batman-adv/hard-interface.c @@ -407,6 +407,9 @@ batadv_hardif_activate_interface(struct batadv_hard_iface *hard_iface) batadv_update_min_mtu(hard_iface->soft_iface); + if (bat_priv->bat_algo_ops->bat_iface_activate) + bat_priv->bat_algo_ops->bat_iface_activate(hard_iface); + out: if (primary_if) batadv_hardif_put(primary_if); diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h index 9abfb3e73c34..443e9b84e07d 100644 --- a/net/batman-adv/types.h +++ b/net/batman-adv/types.h @@ -1250,6 +1250,8 @@ struct batadv_forw_packet { * struct batadv_algo_ops - mesh algorithm callbacks * @list: list node for the batadv_algo_list * @name: name of the algorithm + * @bat_iface_activate: start routing mechanisms when hard-interface is brought + * up * @bat_iface_enable: init routing info when hard-interface is enabled * @bat_iface_disable: de-init routing info when hard-interface is disabled * @bat_iface_update_mac: (re-)init mac addresses of the protocol information @@ -1277,6 +1279,7 @@ struct batadv_forw_packet { struct batadv_algo_ops { struct hlist_node list; char *name; + void (*bat_iface_activate)(struct batadv_hard_iface *hard_iface); int (*bat_iface_enable)(struct batadv_hard_iface *hard_iface); void (*bat_iface_disable)(struct batadv_hard_iface *hard_iface); void (*bat_iface_update_mac)(struct batadv_hard_iface *hard_iface); -- 2.8.1
pull request [net]: batman-adv 20160430
Hello David, this is another pull request intended for net. I know that I sent another batch a few days ago, but after having gone through my queue I thought that merging these last 4 patches would still be worth it (there won't be any other pull request for linux-4.6 :)). The description of the changes follows below. Please pull or let me know of any issue! Thanks a lot, Antonio The following changes since commit 1dfcd832b1a9ed45edac15b31d079b805fa0ae2a: Merge branch 'bpf-fixes' (2016-04-28 17:29:46 -0400) are available in the git repository at: git://git.open-mesh.org/linux-merge.git tags/batman-adv-fix-for-davem for you to fetch changes up to abe59c65225ccd63a5964e2f2a73dd2995b948e7: batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node (2016-04-29 19:46:11 +0800) In this small batch of patches you have: - a fix for our Distributed ARP Table that makes sure that the input provided to the hash function during a query is the same as the one provided during an insert (so to prevent false negatives), by Antonio Quartulli - a fix for our new protocol implementation B.A.T.M.A.N. V that ensures that a hard interface is properly re-activated when it is brought down and then up again, by Antonio Quartulli - two fixes respectively to the reference counting of the tt_local_entry and neigh_node objects, by Sven Eckelmann. Such bug is rather severe as it would prevent the netdev objects references by batman-adv from being released after shutdown. Antonio Quartulli (2): batman-adv: fix DAT candidate selection (must use vid) batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event Sven Eckelmann (2): batman-adv: Fix reference counting of vlan object for tt_local_entry batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node net/batman-adv/bat_v.c | 12 ++ net/batman-adv/distributed-arp-table.c | 17 -- net/batman-adv/hard-interface.c| 3 +++ net/batman-adv/originator.c| 16 - net/batman-adv/translation-table.c | 42 -- net/batman-adv/types.h | 7 ++ 6 files changed, 41 insertions(+), 56 deletions(-)
[PATCH 3/4] batman-adv: Fix reference counting of vlan object for tt_local_entry
From: Sven EckelmannThe batadv_tt_local_entry was specific to a batadv_softif_vlan and held an implicit reference to it. But this reference was never stored in form of a pointer in the tt_local_entry itself. Instead batadv_tt_local_remove, batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend on a consistent state of bat_priv->softif_vlan_list and that batadv_softif_vlan_get always returns the batadv_softif_vlan object which it has a reference for. But batadv_softif_vlan_get cannot guarantee that because it is working only with rcu_read_lock on this list. It can therefore happen that an vid is in this list twice or that batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to some other list operations taking place at the same time. Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry which will be used for the reference counter decremented on release of batadv_tt_local_entry. Fixes: 35df3b298fc8 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add") Signed-off-by: Sven Eckelmann Acked-by: Antonio Quartulli Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/translation-table.c | 42 -- net/batman-adv/types.h | 2 ++ 2 files changed, 6 insertions(+), 38 deletions(-) diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0b43e86328a5..9b4551a86535 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -215,6 +215,8 @@ static void batadv_tt_local_entry_release(struct kref *ref) tt_local_entry = container_of(ref, struct batadv_tt_local_entry, common.refcount); + batadv_softif_vlan_put(tt_local_entry->vlan); + kfree_rcu(tt_local_entry, common.rcu); } @@ -673,6 +675,7 @@ bool batadv_tt_local_add(struct net_device *soft_iface, const u8 *addr, kref_get(_local->common.refcount); tt_local->last_seen = jiffies; tt_local->common.added_at = tt_local->last_seen; + tt_local->vlan = vlan; /* the batman interface mac and multicast addresses should never be * purged @@ -991,7 +994,6 @@ int batadv_tt_local_seq_print_text(struct seq_file *seq, void *offset) struct batadv_tt_common_entry *tt_common_entry; struct batadv_tt_local_entry *tt_local; struct batadv_hard_iface *primary_if; - struct batadv_softif_vlan *vlan; struct hlist_head *head; unsigned short vid; u32 i; @@ -1027,14 +1029,6 @@ int batadv_tt_local_seq_print_text(struct seq_file *seq, void *offset) last_seen_msecs = last_seen_msecs % 1000; no_purge = tt_common_entry->flags & np_flag; - - vlan = batadv_softif_vlan_get(bat_priv, vid); - if (!vlan) { - seq_printf(seq, "Cannot retrieve VLAN %d\n", - BATADV_PRINT_VID(vid)); - continue; - } - seq_printf(seq, " * %pM %4i [%c%c%c%c%c%c] %3u.%03u (%#.8x)\n", tt_common_entry->addr, @@ -1052,9 +1046,7 @@ int batadv_tt_local_seq_print_text(struct seq_file *seq, void *offset) BATADV_TT_CLIENT_ISOLA) ? 'I' : '.'), no_purge ? 0 : last_seen_secs, no_purge ? 0 : last_seen_msecs, - vlan->tt.crc); - - batadv_softif_vlan_put(vlan); + tt_local->vlan->tt.crc); } rcu_read_unlock(); } @@ -1099,7 +1091,6 @@ u16 batadv_tt_local_remove(struct batadv_priv *bat_priv, const u8 *addr, { struct batadv_tt_local_entry *tt_local_entry; u16 flags, curr_flags = BATADV_NO_FLAGS; - struct batadv_softif_vlan *vlan; void *tt_entry_exists; tt_local_entry = batadv_tt_local_hash_find(bat_priv, addr, vid); @@ -1139,14 +1130,6 @@ u16 batadv_tt_local_remove(struct batadv_priv *bat_priv, const u8 *addr, /* extra call to free the local tt entry */ batadv_tt_local_entry_put(tt_local_entry); - /* decrease the reference held for this vlan */ - vlan = batadv_softif_vlan_get(bat_priv, vid); - if (!vlan) - goto out; - - batadv_softif_vlan_put(vlan); - batadv_softif_vlan_put(vlan); - out: if (tt_local_entry) batadv_tt_local_entry_put(tt_local_entry); @@ -1219,7 +1202,6 @@ static void batadv_tt_local_table_free(struct batadv_priv *bat_priv) spinlock_t *list_lock; /* protects write access to the hash
Unauthorized attempt
Re: [PATCH iproute2 net-next] ifstat: move to new RTM_GETSTATS api
On 4/30/16, 3:21 AM, Jamal Hadi Salim wrote: > On 16-04-30 02:41 AM, Roopa Prabhu wrote: >> From: Roopa Prabhu>> >> This patch modifies ifstat to use the new RTM_GETSTATS api >> to query stats from the kernel. In the process this also >> moves ifstat to use 64 bit stats. > > Breaks old kernels? May need to keep backward compat of > RTM_NEWLINK and even new main struct for GETSTATS. yes, i was wondering about that. v2 coming. If GETSTATS fails, I will fallback to RTM_NEWLINK. thanks!
Re: [PATCHv3] netem: Segment GSO packets on enqueue
On Fri, Apr 29, 2016 at 11:19:05AM -0700, Stephen Hemminger wrote: > On Fri, 29 Apr 2016 13:35:48 -0400 > Neil Hormanwrote: > > > This was recently reported to me, and reproduced on the latest net kernel, > > when > > attempting to run netperf from a host that had a netem qdisc attached to the > > egress interface: > > > > [ 788.073771] [ cut here ] > > [ 788.096716] WARNING: at net/core/dev.c:2253 > > skb_warn_bad_offload+0xcd/0xda() > > [ 788.129521] bnx2: caps=(0x0001801949b3, 0x) len=2962 > > data_len=0 gso_size=1448 gso_type=1 ip_summed=3 > > [ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul > > ipmi_ssif > > ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul > > glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo > > ipmi_si > > i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter > > pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs > > libcrc32c > > sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect > > sysimgblt > > i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci > > crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel > > ptp > > serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash > > dm_log > > dm_mod > > [ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: GW > > 3.10.0-327.el7.x86_64 #1 > > [ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012 > > [ 788.542260] 880437c036b8 f7afc56532a53db9 880437c03670 > > 816351f1 > > [ 788.576332] 880437c036a8 8107b200 880633e74200 > > 880231674000 > > [ 788.611943] 0001 0003 > > 880437c03710 > > [ 788.647241] Call Trace: > > [ 788.658817][] dump_stack+0x19/0x1b > > [ 788.686193] [] warn_slowpath_common+0x70/0xb0 > > [ 788.713803] [] warn_slowpath_fmt+0x5c/0x80 > > [ 788.741314] [] ? ___ratelimit+0x93/0x100 > > [ 788.767018] [] skb_warn_bad_offload+0xcd/0xda > > [ 788.796117] [] skb_checksum_help+0x17c/0x190 > > [ 788.823392] [] netem_enqueue+0x741/0x7c0 [sch_netem] > > [ 788.854487] [] dev_queue_xmit+0x2a8/0x570 > > [ 788.880870] [] ip_finish_output+0x53d/0x7d0 > > ... > > > > The problem occurs because netem is not prepared to handle GSO packets (as > > it > > uses skb_checksum_help in its enqueue path, which cannot manipulate these > > frames). > > > > The solution I think is to simply segment the skb in a simmilar fashion to > > the > > way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor > > changes. > > When we decide to corrupt an skb, if the frame is GSO, we segment it, > > corrupt > > the first segment, and enqueue the remaining ones. > > > > tested successfully by myself on the latest net kernel, to whcih this > > applies > > > > Signed-off-by: Neil Horman > > CC: Jamal Hadi Salim > > CC: "David S. Miller" > > CC: ne...@lists.linux-foundation.org > > CC: eric.duma...@gmail.com > > > > This looks like a good idea. > > Please cleanup the formatting issues: > > This was recently reported to me, and reproduced on the latest net kernel, > when > > WARNING: 'whcih' may be misspelled - perhaps 'which'? > #93: > tested successfully by myself on the latest net kernel, to whcih this applies > > ERROR: "foo* bar" should be "foo *bar" > #130: FILE: net/sched/sch_netem.c:402: > +static struct sk_buff* netem_segment(struct sk_buff *skb, struct Qdisc *sch) > > > CHECK: braces {} should be used on all arms of this statement > #164: FILE: net/sched/sch_netem.c:479: > + if (skb_is_gso(skb)) { > [...] > + } else > [...] > > CHECK: braces {} should be used on all arms of this statement > #198: FILE: net/sched/sch_netem.c:562: > + if (rc != NET_XMIT_SUCCESS) { > [...] > + } else > [...] > > > > > > Will, do when I get back to the office monday Thanks! Neil
Re: [iproute2 1/1] man: tc-ife.8: man page for ife action
On Sat, Apr 30, 2016 at 06:58:04AM -0400, Jamal Hadi Salim wrote: > From: Lucas Bates> > Signed-off-by: Lucas Bates > Signed-off-by: Jamal Hadi Salim Acked-by: Phil Sutter Nit-picking on two things below, but merely as a hint - I don't think they need to be fixed before applying this. [...] > +.SH NAME > +IFE - encapsulate/decapsulate metadata > +.SH SYNOPSIS > +.in +8 > +.ti -8 > +.BR tc " ... " "action" > +.B "ife" The quotes around 'action' and 'ife' are not necessary. Furthermore, the two lines can be combined: > +.I DIRECTION ACTION > +.RB "[ " dst > +.IR DMAC " ] " > +.RB "[ " src > +.IR SMAC " ] " > +.RB "[ " type > +.IR TYPE " ] " > +.R "[ " This adds a trailing whitespace. Checkpatch.pl finds these things. ;) Thanks, Phil
[iproute2 1/1] man: tc-ife.8: man page for ife action
From: Lucas BatesSigned-off-by: Lucas Bates Signed-off-by: Jamal Hadi Salim --- man/man8/tc-ife.8 | 118 ++ 1 file changed, 118 insertions(+) create mode 100644 man/man8/tc-ife.8 diff --git a/man/man8/tc-ife.8 b/man/man8/tc-ife.8 new file mode 100644 index 000..7a912b0 --- /dev/null +++ b/man/man8/tc-ife.8 @@ -0,0 +1,118 @@ +.TH "IFE action in tc" 8 "22 Apr 2016" "iproute2" "Linux" + +.SH NAME +IFE - encapsulate/decapsulate metadata +.SH SYNOPSIS +.in +8 +.ti -8 +.BR tc " ... " "action" +.B "ife" +.I DIRECTION ACTION +.RB "[ " dst +.IR DMAC " ] " +.RB "[ " src +.IR SMAC " ] " +.RB "[ " type +.IR TYPE " ] " +.R "[ " +.IR CONTROL " ] " +.RB "[ " index +.IR INDEX " ] " + +.ti -8 +.IR DIRECTION " := { " +.BR decode " | " encode " }" + +.ti -8 +.IR ACTION " := { " +.BR allow " | " use " }" + +.ti -8 +.IR CONTROL " := { " +.BR reclassify " | " use " | " pipe " | " drop " | " continue " | " ok " }" +.SH DESCRIPTION +The +.B ife +action allows for a sending side to encapsulate arbitrary metadata, which is +then decapsulated by the receiving end. The sender runs in encoding mode and +the receiver in decode mode. Both sender and receiver must specify the same +ethertype. In the future, a registered ethertype may be available as a default. +.SH OPTIONS +.TP +.B decode +For the receiving side; decode the metadata if the packet matches. +.TP +.B encode +For the sending side. Encode the specified metadata if the packet matches. +.TP +.B allow +Encode direction only. Allows encoding specified metadata. +.TP +.B use +Encode direction only. Enforce static encoding of specified metadata. +.TP +.BI dmac " DMAC" +.TQ +.BI smac " SMAC" +Optional six byte destination or source MAC address to encode. +.TP +.BI type " TYPE" +Optional 16-bit ethertype to encode. +.TP +.BI CONTROL +Action to take following an encode/decode. +.TP +.BI index " INDEX" +Assign a unique ID to this action instead of letting the kernel choose one +automatically. +.I INDEX +is a 32bit unsigned integer greater than zero. +.SH EXAMPLES + +On the receiving side, match packets with ethertype 0xdead and restart +classification so that it will match ICMP on the next rule, at prio 3: +.RS +.EX +# tc qdisc add dev eth0 handle : ingress +# tc filter add dev eth0 parent : prio 2 protocol 0xdead \\ + u32 match u32 0 0 flowid 1:1 \\ + action ife decode reclassify +# tc filter add dev eth0 parent : priod 3 protocol ip \\ + u32 match ip protocol 0xff flowid 1:1 \\ + action continue +.EE +.RE + +Match with skb mark of 17: + +.RS +.EX +# tc filter add dev eth0 parent : prio 4 protocol ip \\ + handle 0x11 fw flowid 1:1 \\ + action ok +.EE +.RE + +Configure the sending side to encode for the filters above. Use a destination +IP address of 192.168.122.237/24, then tag with skb mark of decimal 17. Encode +the packaet with ethertype 0xdead, add skb->mark to whitelist of metadatum to +send, and rewrite the destination MAC address to 02:15:15:15:15:15. + +.RS +.EX +# tc qdisc add dev eth0 root handle 1: prio +# tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \\ + match ip dst 192.168.122.237/24 \\ + match ip protocol 1 0xff \\ + flowid 1:2 \\ + action skbedit mark 17 \\ + action ife encode \\ + type 0xDEAD \\ + allow mark \\ + dst 02:15:15:15:15:15 +.EE +.RE + +.SH SEE ALSO +.BR tc (8), +.BR tc-u32 (8) -- 1.9.1
[PATCH net] vlan: Propagate MAC address changes properly
The MAC address of the physical interface is only copied to the VLAN when it is first created, resulting in an inconsistency after MAC address changes of only newly created VLANs having an up-to-date MAC. Continuing to inherit the MAC address unless explicitly changed for the VLAN allows IPv6 EUI64 addresses for the VLAN to reflect the change and thus for DAD to behave as expected for the given MAC. Signed-off-by: Mike Manning--- net/8021q/vlan.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index d2cd9de..2f57cf2 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -293,15 +293,15 @@ static void vlan_sync_address(struct net_device *dev, /* vlan address was different from the old address and is equal to * the new address */ - if (!ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) && + if ((vlandev->flags & IFF_UP) && + !ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) && ether_addr_equal(vlandev->dev_addr, dev->dev_addr)) dev_uc_del(dev, vlandev->dev_addr); - /* vlan address was equal to the old address and is different from + /* vlan address was equal to the old address so now also inherit * the new address */ - if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) && - !ether_addr_equal(vlandev->dev_addr, dev->dev_addr)) - dev_uc_add(dev, vlandev->dev_addr); + if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr)) + ether_addr_copy(vlandev->dev_addr, dev->dev_addr); ether_addr_copy(vlan->real_dev_addr, dev->dev_addr); } @@ -389,13 +389,8 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event, case NETDEV_CHANGEADDR: /* Adjust unicast filters on underlying device */ - vlan_group_for_each_dev(grp, i, vlandev) { - flgs = vlandev->flags; - if (!(flgs & IFF_UP)) - continue; - + vlan_group_for_each_dev(grp, i, vlandev) vlan_sync_address(dev, vlandev); - } break; case NETDEV_CHANGEMTU: -- 1.7.10.4
Re: [PATCH net] Propagate MAC address changes to VLANs
On 03/03/2016 09:12 PM, David Miller wrote: > From: Mike Manning> Date: Mon, 29 Feb 2016 11:32:51 + > >> >> -/* vlan address was equal to the old address and is different from >> +/* vlan address was equal to the old address so now also inherit >> * the new address */ >> -if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) && >> -!ether_addr_equal(vlandev->dev_addr, dev->dev_addr)) >> -dev_uc_add(dev, vlandev->dev_addr); >> +if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr)) >> +ether_addr_copy(vlandev->dev_addr, dev->dev_addr); >> > > This dev_uc_add() call removal cannot be correct, if the device is up > we must programe it into the hardware unicast filters and if also > potentially put it into promiscuous mode via __dev_set_rx_mode(). > The call to dev_uc_add() to add a secondary address is only needed if the VLAN MAC is different from that for the physical interface. For the proposed changes, the VLAN MAC is tracking that of the physical interface and so is the same (as typically it does not make sense for these to be different), so dev_uc_add() should not be called. The easiest way to demonstrate equivalence with the original code, where the MAC address has to be set manually, is with some test debugs. Here, first the MAC of the interface itself is changed (so dev_uc_add() is called), then the MAC of the VLAN is changed (so dev_uc_del() is called): 1) ORIGINAL CODE: ip addr show dev dp0s8 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff sudo ip link set dp0s8.40 addr 10:20:30:40:50:61 sudo ip link set dp0s8 addr 10:20:30:40:50:61 ip addr show dev dp0s8 | grep ether link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff [ 3990.332577] --- vlan_dev_set_mac_address: id 40, call dev_uc_add for 10:20:30:40:50:61 on dp 0s8 [ 3990.332579] device dp0s8 entered promiscuous mode [ 4002.425234] 8021q: --- vlan_sync_address: id 40, for 10:20:30:40:50:61 on dp0s8.40 (from dp0 s8) [ 4002.425472] --- vlan_sync_address: id 40, call dev_uc_del for 10:20:30:40:50:61 on dp0s8 [ 4002.425475] --- __hw_addr_del_entry: refcount 0 for 10:20:30:40:50:61 [ 4002.425477] device dp0s8 left promiscuous mode sudo ip link set dp0s8 addr 52:54:00:1f:06:2a ip addr show dev dp0s8.40 | grep ether link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff sudo ip link set dp0s8.40 addr 52:54:00:1f:06:2a ip addr show dev dp0s8 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff [ 4121.606671] --- vlan_sync_address: id, 40, call dev_uc_add for 10:20:30:40:50:61 on dp0s8 [ 4121.606673] device dp0s8 entered promiscuous mode [ 4147.487780] --- vlan_dev_set_mac_address: id 40, for 52:54:00:1f:06:2a on dp0s8 [ 4147.487782] --- vlan_dev_set_mac_address: id 40, call dev_uc_del for 10:20:30:40:50:61 dp0s8 [ 4147.487784] --- __hw_addr_del_entry: refcount 0 for 10:20:30:40:50:61 [ 4147.487786] device dp0s8 left promiscuous mode 2) WITH IMPROVEMENT FOR VLAN MAC TO FOLLOW THAT OF PHYSICAL INTF, UNLESS EXPLICITLY SET: ip addr show dev dp0s8 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff sudo ip link set dp0s8 addr 10:20:30:40:50:61 ip addr show dev dp0s8 | grep ether link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff [ 196.574789] 8021q: --- vlan_sync_address: id 40, for 10:20:30:40:50:61 on dp0s8.40 (from dp0 s8) [ 196.575004] --- vlan_sync_address: id 40, update to 10:20:30:40:50:61 on dp0s8.40 (from dp0s 8) sudo ip link set dp0s8 addr 52:54:00:1f:06:2a ip addr show dev dp0s8 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 52:54:00:1f:06:2a brd ff:ff:ff:ff:ff:ff [ 265.683313] 8021q: --- vlan_sync_address: id 40, for 52:54:00:1f:06:2a on dp0s8.40 (from dp0 s8) [ 265.683534] --- vlan_sync_address: id 40, update to 52:54:00:1f:06:2a on dp0s8.40 (from dp0s 8) sudo ip link set dp0s8.40 addr 10:20:30:40:50:61 sudo ip link set dp0s8 addr 10:20:30:40:50:99 ip addr show dev dp0s8 | grep ether link/ether 10:20:30:40:50:99 brd ff:ff:ff:ff:ff:ff ip addr show dev dp0s8.40 | grep ether link/ether 10:20:30:40:50:61 brd ff:ff:ff:ff:ff:ff sudo ip link set dp0s8 addr 10:20:30:40:50:61 [ 5561.791222] --- vlan_dev_set_mac_address: id 40, for 10:20:30:40:50:61 on dp0s8 [ 5561.791225] --- vlan_dev_set_mac_address: id 40, call dev_uc_add for 10:20:30:40:50:61 on dp 0s8 [ 5561.791227] device dp0s8 entered promiscuous mode [
Re: [PATCH iproute2 net-next] ifstat: move to new RTM_GETSTATS api
On 16-04-30 02:41 AM, Roopa Prabhu wrote: From: Roopa PrabhuThis patch modifies ifstat to use the new RTM_GETSTATS api to query stats from the kernel. In the process this also moves ifstat to use 64 bit stats. Breaks old kernels? May need to keep backward compat of RTM_NEWLINK and even new main struct for GETSTATS. cheers, jamal
Re: [PATCH v2 net-next 0/7] net: make TCP preemptible
Hello, On Fri, 29 Apr 2016, Eric Dumazet wrote: > I had corruptions issues and a dying HDD one month ago. > > I have a brand new HDD, but maybe the SSD I use for my git trees is > dying as well :( > > But I've seen this strange patterns in the past, it might be the old > text editor I am using. You can also check for failing capacitors around the RAM slots... Memory tests can help too. Regards -- Julian Anastasov
[PATCH net-next v3 3/4] bridge: vlan: learn to count
Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets allocated a per-cpu stats which is then set in each per-port vlan context for quick access. The br_allowed_ingress() common function is used to account for Rx packets and the br_handle_vlan() common function is used to account for Tx packets. Stats accounting is performed only if the bridge-wide vlan_stats_enabled option is set either via sysfs or netlink. A struct hole between vlan_enabled and vlan_proto is used for the new option so it is in the same cache line. Currently it is binary (on/off) but it is intentionally restricted to exactly 0 and 1 since other values will be used in the future for different purposes (e.g. per-port stats). Signed-off-by: Nikolay Aleksandrov--- v3: make stats accounting optional with default to off v2: no change include/uapi/linux/if_link.h | 1 + net/bridge/br_netlink.c | 13 ++- net/bridge/br_private.h | 13 ++- net/bridge/br_sysfs_br.c | 17 + net/bridge/br_vlan.c | 82 5 files changed, 110 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 8577c0e4116f..cc50261baf59 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -272,6 +272,7 @@ enum { IFLA_BR_NF_CALL_ARPTABLES, IFLA_BR_VLAN_DEFAULT_PVID, IFLA_BR_PAD, + IFLA_BR_VLAN_STATS_ENABLED, __IFLA_BR_MAX, }; diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 6bae1125e36d..7fba1f018bc9 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -850,6 +850,7 @@ static const struct nla_policy br_policy[IFLA_BR_MAX + 1] = { [IFLA_BR_NF_CALL_IP6TABLES] = { .type = NLA_U8 }, [IFLA_BR_NF_CALL_ARPTABLES] = { .type = NLA_U8 }, [IFLA_BR_VLAN_DEFAULT_PVID] = { .type = NLA_U16 }, + [IFLA_BR_VLAN_STATS_ENABLED] = { .type = NLA_U8 }, }; static int br_changelink(struct net_device *brdev, struct nlattr *tb[], @@ -921,6 +922,14 @@ static int br_changelink(struct net_device *brdev, struct nlattr *tb[], if (err) return err; } + + if (data[IFLA_BR_VLAN_STATS_ENABLED]) { + __u8 vlan_stats = nla_get_u8(data[IFLA_BR_VLAN_STATS_ENABLED]); + + err = br_vlan_set_stats(br, vlan_stats); + if (err) + return err; + } #endif if (data[IFLA_BR_GROUP_FWD_MASK]) { @@ -1082,6 +1091,7 @@ static size_t br_get_size(const struct net_device *brdev) #ifdef CONFIG_BRIDGE_VLAN_FILTERING nla_total_size(sizeof(__be16)) + /* IFLA_BR_VLAN_PROTOCOL */ nla_total_size(sizeof(u16)) +/* IFLA_BR_VLAN_DEFAULT_PVID */ + nla_total_size(sizeof(u8)) + /* IFLA_BR_VLAN_STATS_ENABLED */ #endif nla_total_size(sizeof(u16)) +/* IFLA_BR_GROUP_FWD_MASK */ nla_total_size(sizeof(struct ifla_bridge_id)) + /* IFLA_BR_ROOT_ID */ @@ -1167,7 +1177,8 @@ static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev) #ifdef CONFIG_BRIDGE_VLAN_FILTERING if (nla_put_be16(skb, IFLA_BR_VLAN_PROTOCOL, br->vlan_proto) || - nla_put_u16(skb, IFLA_BR_VLAN_DEFAULT_PVID, br->default_pvid)) + nla_put_u16(skb, IFLA_BR_VLAN_DEFAULT_PVID, br->default_pvid) || + nla_put_u8(skb, IFLA_BR_VLAN_STATS_ENABLED, br->vlan_stats_enabled)) return -EMSGSIZE; #endif #ifdef CONFIG_BRIDGE_IGMP_SNOOPING diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 1b5d145dfcbf..8b644069a1a1 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -77,12 +77,21 @@ struct bridge_mcast_querier { }; #endif +struct br_vlan_stats { + u64 rx_bytes; + u64 rx_packets; + u64 tx_bytes; + u64 tx_packets; + struct u64_stats_sync syncp; +}; + /** * struct net_bridge_vlan - per-vlan entry * * @vnode: rhashtable member * @vid: VLAN id * @flags: bridge vlan flags + * @stats: per-cpu VLAN statistics * @br: if MASTER flag set, this points to a bridge struct * @port: if MASTER flag unset, this points to a port struct * @refcnt: if MASTER flag set, this is bumped for each port referencing it @@ -100,6 +109,7 @@ struct net_bridge_vlan { struct rhash_head vnode; u16 vid; u16 flags; + struct br_vlan_stats __percpu *stats; union { struct net_bridge *br; struct net_bridge_port *port; @@ -342,6 +352,7 @@ struct net_bridge #ifdef CONFIG_BRIDGE_VLAN_FILTERING struct net_bridge_vlan_group__rcu *vlgrp; u8 vlan_enabled; + u8 vlan_stats_enabled; __be16
[PATCH net-next v3 1/4] net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter
The new prividx argument allows the current dumping device to save a private state counter which would enable it to continue dumping from where it left off. And the idxattr is used to save the current idx user so multiple prividx using attributes can be requested at the same time as suggested by Roopa Prabhu. Signed-off-by: Nikolay Aleksandrov--- v3: no change v2: improve the error check in rtnl_fill_statsinfo, rename lidx to prividx, squash patch 2 into this one and save the current idx user instead of restricting only one net/core/rtnetlink.c | 44 +++- 1 file changed, 31 insertions(+), 13 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 5503dfe6a050..de529a20cd18 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3444,13 +3444,21 @@ out: return err; } +static bool stats_attr_valid(unsigned int mask, int attrid, int idxattr) +{ + return (mask & IFLA_STATS_FILTER_BIT(attrid)) && + (!idxattr || idxattr == attrid); +} + static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, int type, u32 pid, u32 seq, u32 change, - unsigned int flags, unsigned int filter_mask) + unsigned int flags, unsigned int filter_mask, + int *idxattr, int *prividx) { struct if_stats_msg *ifsm; struct nlmsghdr *nlh; struct nlattr *attr; + int s_prividx = *prividx; ASSERT_RTNL(); @@ -3462,7 +3470,7 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, ifsm->ifindex = dev->ifindex; ifsm->filter_mask = filter_mask; - if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) { + if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_64, *idxattr)) { struct rtnl_link_stats64 *sp; attr = nla_reserve_64bit(skb, IFLA_STATS_LINK_64, @@ -3480,7 +3488,11 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, return 0; nla_put_failure: - nlmsg_cancel(skb, nlh); + /* not a multi message or no progress mean a real error */ + if (!(flags & NLM_F_MULTI) || s_prividx == *prividx) + nlmsg_cancel(skb, nlh); + else + nlmsg_end(skb, nlh); return -EMSGSIZE; } @@ -3494,7 +3506,7 @@ static size_t if_nlmsg_stats_size(const struct net_device *dev, { size_t size = 0; - if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) + if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_64, 0)) size += nla_total_size_64bit(sizeof(struct rtnl_link_stats64)); return size; @@ -3503,8 +3515,9 @@ static size_t if_nlmsg_stats_size(const struct net_device *dev, static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh) { struct net *net = sock_net(skb->sk); - struct if_stats_msg *ifsm; struct net_device *dev = NULL; + int idxattr = 0, prividx = 0; + struct if_stats_msg *ifsm; struct sk_buff *nskb; u32 filter_mask; int err; @@ -3528,7 +3541,7 @@ static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh) err = rtnl_fill_statsinfo(nskb, dev, RTM_NEWSTATS, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0, - 0, filter_mask); + 0, filter_mask, , ); if (err < 0) { /* -EMSGSIZE implies BUG in if_nlmsg_stats_size */ WARN_ON(err == -EMSGSIZE); @@ -3542,18 +3555,19 @@ static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh) static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb) { + int h, s_h, err, s_idx, s_idxattr, s_prividx; struct net *net = sock_net(skb->sk); + unsigned int flags = NLM_F_MULTI; struct if_stats_msg *ifsm; - int h, s_h; - int idx = 0, s_idx; - struct net_device *dev; struct hlist_head *head; - unsigned int flags = NLM_F_MULTI; + struct net_device *dev; u32 filter_mask = 0; - int err; + int idx = 0; s_h = cb->args[0]; s_idx = cb->args[1]; + s_idxattr = cb->args[2]; + s_prividx = cb->args[3]; cb->seq = net->dev_base_seq; @@ -3571,7 +3585,8 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb) err = rtnl_fill_statsinfo(skb, dev, RTM_NEWSTATS, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, 0, - flags, filter_mask); + flags, filter_mask, +
[PATCH net-next v3 4/4] bridge: netlink: export per-vlan stats
Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and get_linkxstats_size) in order to export the per-vlan stats. The paddings were added because soon these fields will be needed for per-port per-vlan stats (or something else if someone beats me to it) so avoiding at least a few more netlink attributes. Signed-off-by: Nikolay Aleksandrov--- v3: no change v2: remove unused pvid pointer, fix the case where bridge has 0 vlans but there're global contexts and move to rtnl link type private attributes nested into a LINK_XSTATS_TYPE_ attribute. The paddings were added because soon these fields will be needed for per-port per-vlan stats (or something else if someone beats me to it) so avoiding at least a few more netlink attributes. include/uapi/linux/if_bridge.h | 18 include/uapi/linux/if_link.h | 1 + net/bridge/br_netlink.c| 65 ++ net/bridge/br_private.h| 7 + net/bridge/br_vlan.c | 27 ++ 5 files changed, 118 insertions(+) diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h index 0536eefff9bf..397d503fdedb 100644 --- a/include/uapi/linux/if_bridge.h +++ b/include/uapi/linux/if_bridge.h @@ -134,6 +134,16 @@ struct bridge_vlan_info { __u16 vid; }; +struct bridge_vlan_xstats { + __u64 rx_bytes; + __u64 rx_packets; + __u64 tx_bytes; + __u64 tx_packets; + __u16 vid; + __u16 pad1; + __u32 pad2; +}; + /* Bridge multicast database attributes * [MDBA_MDB] = { * [MDBA_MDB_ENTRY] = { @@ -233,4 +243,12 @@ enum { }; #define MDBA_SET_ENTRY_MAX (__MDBA_SET_ENTRY_MAX - 1) +/* Embedded inside LINK_XSTATS_TYPE_BRIDGE */ +enum { + BRIDGE_XSTATS_UNSPEC, + BRIDGE_XSTATS_VLAN, + __BRIDGE_XSTATS_MAX +}; +#define BRIDGE_XSTATS_MAX (__BRIDGE_XSTATS_MAX - 1) + #endif /* _UAPI_LINUX_IF_BRIDGE_H */ diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index cc50261baf59..ed59cbd6d129 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -818,6 +818,7 @@ enum { */ enum { LINK_XSTATS_TYPE_UNSPEC, + LINK_XSTATS_TYPE_BRIDGE, __LINK_XSTATS_TYPE_MAX }; #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1) diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 7fba1f018bc9..a5343c7232bf 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -1234,6 +1234,69 @@ static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev) return 0; } +static size_t br_get_linkxstats_size(const struct net_device *dev) +{ + struct net_bridge *br = netdev_priv(dev); + struct net_bridge_vlan_group *vg; + struct net_bridge_vlan *v; + int numvls = 0; + + vg = br_vlan_group(br); + if (!vg) + return 0; + + /* we need to count all, even placeholder entries */ + list_for_each_entry(v, >vlan_list, vlist) + numvls++; + + /* account for the vlans and the link xstats type nest attribute */ + return numvls * nla_total_size(sizeof(struct bridge_vlan_xstats)) + + nla_total_size(0); +} + +static int br_fill_linkxstats(struct sk_buff *skb, const struct net_device *dev, + int *prividx) +{ + struct net_bridge *br = netdev_priv(dev); + struct net_bridge_vlan_group *vg; + struct net_bridge_vlan *v; + struct nlattr *nest; + int vl_idx = 0; + + vg = br_vlan_group(br); + if (!vg) + goto out; + nest = nla_nest_start(skb, LINK_XSTATS_TYPE_BRIDGE); + if (!nest) + return -EMSGSIZE; + list_for_each_entry(v, >vlan_list, vlist) { + struct bridge_vlan_xstats vxi; + struct br_vlan_stats stats; + + if (vl_idx++ < *prividx) + continue; + memset(, 0, sizeof(vxi)); + vxi.vid = v->vid; + br_vlan_get_stats(v, ); + vxi.rx_bytes = stats.rx_bytes; + vxi.rx_packets = stats.rx_packets; + vxi.tx_bytes = stats.tx_bytes; + vxi.tx_packets = stats.tx_packets; + + if (nla_put(skb, BRIDGE_XSTATS_VLAN, sizeof(vxi), )) + goto nla_put_failure; + } + nla_nest_end(skb, nest); + *prividx = 0; +out: + return 0; + +nla_put_failure: + nla_nest_end(skb, nest); + *prividx = vl_idx; + + return -EMSGSIZE; +} static struct rtnl_af_ops br_af_ops __read_mostly = { .family = AF_BRIDGE, @@ -1252,6 +1315,8 @@ struct rtnl_link_ops br_link_ops __read_mostly = { .dellink= br_dev_delete, .get_size = br_get_size, .fill_info = br_fill_info, +
[PATCH net-next v3 0/4] bridge: per-vlan stats
Hi, This set adds support for bridge per-vlan statistics. In order to be able to dump statistics for many vlans we need a way to continue dumping after reaching maximum size, thus patches 01 and 02 extend the new stats API with a per-device extended link stats attribute and callback which can save its local state and continue where it left off afterwards. I considered using the already existing "fill_xstats" callback but it gets confusing since we need to separate the linkinfo dump from the new stats api dump and adding a flag/argument to do that just looks messy. I don't think the rtnl_link_ops size is an issue, so adding these seemed like the cleaner approach. Patches 03 and 04 add the stats support and netlink dump support respectively. The stats accounting is controlled via a bridge option which is default off, thus the performance impact is kept minimal. I've tested this set with both old and modified iproute2, kmemleak on and some traffic stress tests while adding/removing vlans and ports. v3: - drop the RCU pvid patch and remove one pointer fetch as requested - make stats accounting optional with default to off, the option is in the same cache line as vlan_proto and vlan_enabled, so it is already fetched before the fast path check thus the performance impact is minimal, this also allows us to avoid one vlan lookup and return early when using pvid - rebased and retested v2: - Improve the error checking, rename lidx to prividx and save the current idx user instead of restricting it to one in patch 01 - squash patch 02 into 01 and remove the restriction - add callback descriptions, improve the size calculation and change the xstats message structure to have an embedding level per rtnl link type so we can avoid one call to get the link type (and thus filter on it) and also each link type can now have any number of private attributes inside - fix a problem where the vlan stats are not dumped if the bridge has 0 vlans on it but has vlans on the ports, add bridge link type private attributes and also add paddings for future extensions to avoid at least a few netlink attributes and improve struct alignment - drop the is_skb_forwardable argument constifying patch as it's not needed anymore, but it's a nice cleanup which I'll send separately Thank you, Nik Nikolay Aleksandrov (4): net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter net: rtnetlink: add linkxstats callbacks and attribute bridge: vlan: learn to count bridge: netlink: export per-vlan stats include/net/rtnetlink.h| 7 +++ include/uapi/linux/if_bridge.h | 18 +++ include/uapi/linux/if_link.h | 14 ++ net/bridge/br_netlink.c| 78 - net/bridge/br_private.h| 18 +++ net/bridge/br_sysfs_br.c | 17 +++ net/bridge/br_vlan.c | 109 +++-- net/core/rtnetlink.c | 74 +++- 8 files changed, 307 insertions(+), 28 deletions(-) -- 2.4.11
[PATCH net-next v3 2/4] net: rtnetlink: add linkxstats callbacks and attribute
Add callbacks to calculate the size and fill link extended statistics which can be split into multiple messages and are dumped via the new rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute. Also add that attribute to the idx mask check since it is expected to be able to save state and resume dumping (e.g. future bridge per-vlan stats will be dumped via this attribute and callbacks). Each link type should nest its private attributes under the per-link type attribute. This allows to have any number of separated private attributes and to avoid one call to get the dev link type. Signed-off-by: Nikolay Aleksandrov--- v3: no change v2: add callback descriptions and make the size calculation more accurate, change the netlink xstats message structure with one more level for each rtnl link type which allows for private link type attributes and also allows us to avoid 1 call to get the dev link type. include/net/rtnetlink.h | 7 +++ include/uapi/linux/if_link.h | 12 net/core/rtnetlink.c | 30 ++ 3 files changed, 49 insertions(+) diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h index 2f87c1ba13de..006a7b81d758 100644 --- a/include/net/rtnetlink.h +++ b/include/net/rtnetlink.h @@ -47,6 +47,9 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh) * @get_num_rx_queues: Function to determine number of receive queues * to create when creating a new device. * @get_link_net: Function to get the i/o netns of the device + * @get_linkxstats_size: Function to calculate the required room for + * dumping device-specific extended link stats + * @fill_linkxstats: Function to dump device-specific extended link stats */ struct rtnl_link_ops { struct list_headlist; @@ -95,6 +98,10 @@ struct rtnl_link_ops { const struct net_device *dev, const struct net_device *slave_dev); struct net *(*get_link_net)(const struct net_device *dev); + size_t (*get_linkxstats_size)(const struct net_device *dev); + int (*fill_linkxstats)(struct sk_buff *skb, + const struct net_device *dev, + int *prividx); }; int __rtnl_link_register(struct rtnl_link_ops *ops); diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index d82de331bb6b..8577c0e4116f 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -802,6 +802,7 @@ struct if_stats_msg { enum { IFLA_STATS_UNSPEC, /* also used as 64bit pad attribute */ IFLA_STATS_LINK_64, + IFLA_STATS_LINK_XSTATS, __IFLA_STATS_MAX, }; @@ -809,4 +810,15 @@ enum { #define IFLA_STATS_FILTER_BIT(ATTR)(1 << (ATTR - 1)) +/* These are embedded into IFLA_STATS_LINK_XSTATS: + * [IFLA_STATS_LINK_XSTATS] + * -> [LINK_XSTATS_TYPE_xxx] + *-> [rtnl link type specific attributes] + */ +enum { + LINK_XSTATS_TYPE_UNSPEC, + __LINK_XSTATS_TYPE_MAX +}; +#define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1) + #endif /* _UAPI_LINUX_IF_LINK_H */ diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index de529a20cd18..d471f097c739 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3483,6 +3483,26 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, dev_get_stats(dev, sp); } + if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_XSTATS, *idxattr)) { + const struct rtnl_link_ops *ops = dev->rtnl_link_ops; + + if (ops && ops->fill_linkxstats) { + int err; + + *idxattr = IFLA_STATS_LINK_XSTATS; + attr = nla_nest_start(skb, + IFLA_STATS_LINK_XSTATS); + if (!attr) + goto nla_put_failure; + + err = ops->fill_linkxstats(skb, dev, prividx); + nla_nest_end(skb, attr); + if (err) + goto nla_put_failure; + *idxattr = 0; + } + } + nlmsg_end(skb, nlh); return 0; @@ -3509,6 +3529,16 @@ static size_t if_nlmsg_stats_size(const struct net_device *dev, if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_64, 0)) size += nla_total_size_64bit(sizeof(struct rtnl_link_stats64)); + if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_XSTATS, 0)) { + const struct rtnl_link_ops *ops = dev->rtnl_link_ops; + + if (ops && ops->get_linkxstats_size) { + size +=
[PATCH iproute2 net-next] ifstat: move to new RTM_GETSTATS api
From: Roopa PrabhuThis patch modifies ifstat to use the new RTM_GETSTATS api to query stats from the kernel. In the process this also moves ifstat to use 64 bit stats. Signed-off-by: Roopa Prabhu --- include/libnetlink.h | 3 +++ include/linux/if_link.h | 22 ++ include/linux/rtnetlink.h | 5 + lib/libnetlink.c | 25 + misc/ifstat.c | 35 +++ 5 files changed, 74 insertions(+), 16 deletions(-) diff --git a/include/libnetlink.h b/include/libnetlink.h index 491263f..e623a3c 100644 --- a/include/libnetlink.h +++ b/include/libnetlink.h @@ -44,6 +44,9 @@ int rtnl_dump_request(struct rtnl_handle *rth, int type, void *req, int rtnl_dump_request_n(struct rtnl_handle *rth, struct nlmsghdr *n) __attribute__((warn_unused_result)); +int rtnl_stats_dump_request(struct rtnl_handle *rth, __u32 filt_mask) + __attribute__((warn_unused_result)); + struct rtnl_ctrl_data { int nsid; }; diff --git a/include/linux/if_link.h b/include/linux/if_link.h index 6a688e8..68f3270 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -165,6 +165,8 @@ enum { #define IFLA_RTA(r) ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct ifinfomsg #define IFLA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifinfomsg)) +#define IFLA_RTA_STATS(r) ((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct if_stats_msg + enum { IFLA_INET_UNSPEC, IFLA_INET_CONF, @@ -777,4 +779,24 @@ enum { #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1) +/* STATS section */ + +struct if_stats_msg { + __u8 family; + __u8 pad1; + __u16 pad2; + __u32 ifindex; + __u32 filter_mask; +}; + +enum { + IFLA_STATS_UNSPEC, + IFLA_STATS_LINK_64, + __IFLA_STATS_MAX, +}; + +#define IFLA_STATS_MAX (__IFLA_STATS_MAX - 1) + +#define IFLA_STATS_FILTER_BIT(ATTR) (1 << (ATTR - 1)) + #endif /* _LINUX_IF_LINK_H */ diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 6aaa2a3..e8cdff5 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -139,6 +139,11 @@ enum { RTM_GETNSID = 90, #define RTM_GETNSID RTM_GETNSID + RTM_NEWSTATS = 92, +#define RTM_NEWSTATS RTM_NEWSTATS + RTM_GETSTATS = 94, +#define RTM_GETSTATS RTM_GETSTATS + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; diff --git a/lib/libnetlink.c b/lib/libnetlink.c index a90e52c..95f80fc 100644 --- a/lib/libnetlink.c +++ b/lib/libnetlink.c @@ -838,3 +838,28 @@ int __parse_rtattr_nested_compat(struct rtattr *tb[], int max, struct rtattr *rt memset(tb, 0, sizeof(struct rtattr *) * (max + 1)); return 0; } + +int rtnl_stats_dump_request(struct rtnl_handle *rth, __u32 filt_mask) +{ + struct { + struct nlmsghdr nlh; + struct if_stats_msg ifsm; + } req = { + .nlh.nlmsg_type = RTM_GETSTATS, + .nlh.nlmsg_flags = NLM_F_DUMP|NLM_F_REQUEST, + .nlh.nlmsg_pid = 0, + .ifsm.family = AF_UNSPEC, + .ifsm.ifindex = 0, + .ifsm.filter_mask = filt_mask, + }; + + if (!filt_mask) { + perror("invalid stats filter mask"); + return -1; + } + + req.nlh.nlmsg_seq = rth->dump = ++rth->seq, + req.nlh.nlmsg_len = sizeof(req); + + return send(rth->fd, (void *), sizeof(req), 0); +} diff --git a/misc/ifstat.c b/misc/ifstat.c index abbb4e7..bf9b9fa 100644 --- a/misc/ifstat.c +++ b/misc/ifstat.c @@ -35,6 +35,8 @@ #include +#include "utils.h" + int dump_zeros; int reset_history; int ignore_history; @@ -49,6 +51,8 @@ double W; char **patterns; int npatterns; +struct rtnl_handle rth; + char info_source[128]; int source_mismatch; @@ -58,9 +62,9 @@ struct ifstat_ent { struct ifstat_ent *next; char*name; int ifindex; - unsigned long long val[MAXS]; + __u64 val[MAXS]; double rate[MAXS]; - __u32 ival[MAXS]; + __u64 ival[MAXS]; }; static const char *stats[MAXS] = { @@ -109,32 +113,29 @@ static int match(const char *id) static int get_nlmsg(const struct sockaddr_nl *who, struct nlmsghdr *m, void *arg) { - struct ifinfomsg *ifi = NLMSG_DATA(m); - struct rtattr *tb[IFLA_MAX+1]; + struct if_stats_msg *ifsm = NLMSG_DATA(m); + struct rtattr *tb[IFLA_STATS_MAX+1]; int len = m->nlmsg_len; struct ifstat_ent *n; int i; - if (m->nlmsg_type != RTM_NEWLINK) + if (m->nlmsg_type != RTM_NEWSTATS) return 0; - len -= NLMSG_LENGTH(sizeof(*ifi)); + len -=