from:"Willy Tarreau"

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Willy Tarreau

On Sat, Dec 08, 2018 at 10:14:47PM +0200, Ilias Apalodimas wrote:
> On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
> > >  
> > > I want to make sure you guys thought about splice() stuff, and
> > > skb_try_coalesce(), and GRO, and skb cloning, and ...
> > 
> > Thanks for the pointers. To Ilias, we need to double check 
> > skb_try_coalesce()
> > code path, as it does look like we don't handle this correctly.
> > 
> 
> Noted. We did check skb cloning, but not this one indeed

I'll try to run some tests on my clearfog later, but for now I'm still
100% busy until haproxy 1.9 gets released.

Cheers,
Willy

Re: bring back IPX and NCPFS, please!

2018-11-09 Thread Willy Tarreau

On Fri, Nov 09, 2018 at 06:30:14PM +0100, Johannes C. Schulz wrote:
> Hello Willy, hello Stephen
> 
> Thankyou for your reply.
> But I'm not able to maintain or code these modules. I'm just a bloody
> user/webdev.

That's what we've all claimed before taking over something many years
ago you know :-)  The most important is time and willingness to try to
do it.

You could first look at the latest kernel supporting those, check if
they still used to work fine in your environment (not everyone has
access to these ones anymore), and if so, then try to copy that code
over newer kernels. Sometimes it will not build with an obvious error
that you'll be able to fix by yourself, sometimes it will be harder
and you'll have to ask for help and/or figure API changes in "git log".
After working many hours on this you'll be much more at ease with this
code and you'll possibly be able to make it work on your kernel version.
This is already a huge step because even if you don't consider it as
being in a mergeable state (too hackish, dirty etc), you have the
option to run it as your own patch for a while.

After this you'll seek some more help about the process needed to get
these merged back and to maintain them as long as you estimate you can
(possibly mark it deprecated and keep it as long as you can). And who
knows, given nothing changes in this area these days, maybe it will be
trivial to maintain this FS for another decade and you'll have learned
something fun and useful.

> It would be really nice if these modules will find a good
> maintainer!

Just think again about the advantages you have over many other people :
  - access to the environment
  - real use case for the feature

There's nothing wrong with trying and failing multiple times, even giving
up if you find the task too hard. But giving up before trying is quite
sad in your situation.

Cheers,
Willy

Re: bring back IPX and NCPFS, please!

2018-11-09 Thread Willy Tarreau

On Fri, Nov 09, 2018 at 02:23:27PM +0100, Johannes C. Schulz wrote:
> Hello all!
> 
> I like to please you to bring back IPX and NCPFS modules to the kernel.
> Whyever my admins using Novell-shares on our network which I'm not be
> able to use anymore - I'm forced to use cifs instead (and the admins
> will kill the cifs-shares in some time), because my kernel (4.18) does
> not have support for ncpfs anymore.
> Maybe we at my work are not enough people that just for us this
> modules will come back, but maybe out there are other people.
> Thank you.

Well, like any code, it requires time and skills. If nobody with the
required skills is available for this anymore, there's no way you'll
get a feature back. However you could always step up to maintain it
yourself if you have the time and are willing to develop your own
skills at it. It's how maintainers change over time for certain parts
of the system, so you have an opportunity here.

Just my two cents,
Willy

Re: Kernel Panic on high bandwidth transfer over wifi

2018-08-29 Thread Willy Tarreau

On Wed, Aug 29, 2018 at 11:42:44AM +, Nathaniel Munk wrote:
> As you can see from the attached log

You apparently forgot to attach the log.

Willy

Re: how to (cross)connect two (physical) eth ports for ping test?

2018-08-18 Thread Willy Tarreau

On Sat, Aug 18, 2018 at 09:10:25PM +0200, Andrew Lunn wrote:
> On Sat, Aug 18, 2018 at 01:39:50PM -0400, Robert P. J. Day wrote:
> > 
> >   (i'm sure this has been explained many times before, so a link
> > covering this will almost certainly do just fine.)
> > 
> >   i want to loop one physical ethernet port into another, and just
> > ping the daylights from one to the other for stress testing. my fedora
> > laptop doesn't actually have two unused ethernet ports, so i just want
> > to emulate this by slapping a couple startech USB/net adapters into
> > two empty USB ports, setting this up, then doing it all over again
> > monday morning on the actual target system, which does have multiple
> > ethernet ports.
> > 
> >   so if someone can point me to the recipe, that would be great and
> > you can stop reading.
> > 
> >   as far as my tentative solution goes, i assume i need to put at
> > least one of the physical ports in a network namespace via "ip netns",
> > then ping from the netns to the root namespace. or, going one step
> > further, perhaps putting both interfaces into two new namespaces, and
> > setting up forwarding.
> 
> Namespaces is a good solution. Something like this should work:
> 
> ip netns add namespace1
> ip netns add namespace2
> 
> ip link set eth1 netns namespace1
> ip link set eth2 netns namespace2
> 
> ip netns exec namespace1 \
> ip addr add 10.42.42.42/24 dev eth1
> 
> ip netns exec namespace1 \
> ip link set eth1 up
> 
> ip netns exec namespace2 \
> ip addr add 10.42.42.24/24 dev eth2
> 
> ip netns exec namespace2 \
> ip link set eth2 up
> 
> ip netns exec namespace1 \
> ping 10.42.42.24
> 
> You might also want to consider iperf3 for stress testing, depending
> on the sort of stress you need.

FWIW I have a setup somewhere involving ip rule + ip route which achieves
the same without involving namespaces. It's a bit hackish but sometimes
convenient. I can dig if someone is interested.

Regards,
Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-05 Thread Willy Tarreau

On Tue, Jun 05, 2018 at 02:33:03PM +0200, Bjørn Mork wrote:
> > I do have IPv6 at home (a /48, waste of addressing space, I'd be fine
> > with less),
> 
> Any reason you would want less?  Any reason the ISP should give you
> less?

What I mean is that *if* the availability of /48 networks was an issue
for some ISPs, I'd be fine with less because I don't plan to deploy 64k
networks at home, though I already have ~9 around the firewall and don't
expect to go much further.

> > Maybe setting up a public list of ISPs where users don't have at least
> > a /60 by default could help, but I suspect that most of them will
> > consider that as long as their competitors are on the list there's no
> > emergency.
> 
> Exactly.  And the number of users using the list as the primary
> parameter for selecting an ISP would be close to 0.  The critical part
> is not the list, but making large enough groups of users consider IPv6
> an important parameter when selecting ISPs.

In fact the IoT trend could play a role here by letting users know that
they can remotely access their fridge and whatever stupid device they've
deployed. But the reality is the opposite : some gateway services are/will
be offered at a paid price to make these devices remotely accessible, and
the claimed security provided by this gateway will be presented as a real
benefit compared to the risks of anyone directly accessing your private
life over IPv6. So I'm not getting much hopes for the future in this area
either.

Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-03 Thread Willy Tarreau

On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
> 
> 
> On 06/03/2018 01:37 PM, Tom Herbert wrote:
> 
> > This is not an inconsequential mechanism that is being proposed. It's
> > a modification to IP protocol that is intended to work on the
> > Internet, but it looks like the draft hasn't been updated for two
> > years and it is not adopted by any IETF working group. I don't see how
> > this can go anywhere without IETF support. Also, I suggest that you
> > look at the IPv10 proposal since that was very similar in intent. One
> > of the reasons that IPv10 shot down was because protocol transition
> > mechanisms were more interesting ten years ago than today. IPv6 has
> > good traction now. In fact, it's probably the case that it's now
> > easier to bring up IPv6 than to try to make IPv4 options work over the
> > Internet.
> 
> +1
> 
> Many hosts do not use IPv4 anymore.
> 
> We even have the project making IPv4 support in linux optional.

I agree on these points, but I'd like to figure what can be done to put
a bit more pressure on ISPs to *always* provide IPv6. It's still very
hard to have decent connectivity at home and without this it will continue
to be marginalize.

I do have IPv6 at home (a /48, waste of addressing space, I'd be fine
with less), there's none at work (I don't even know if the ISP supports
it, at least it was never ever mentioned so probably they don't know
about this), and some ISPs only provide a /64 which is as ridiculous
as providing a single address as it forces the end user to NAT thus
breaking the end-to-end principle. Ideally with IoT at the door, every
home connection should have at least a /60 and enterprises should have
a /56, and this by default, without having to request anything.

Maybe setting up a public list of ISPs where users don't have at least
a /60 by default could help, but I suspect that most of them will
consider that as long as their competitors are on the list there's no
emergency.

Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-02 Thread Willy Tarreau

On Sat, Jun 02, 2018 at 12:17:12PM -0400, Sam Patton wrote:
> As far as application examples, check out this simple netcat-like
> program I use for testing:
> 
> https://github.com/EnIP/enhancedip/blob/master/userspace/netcat/netcat.c
> 
> Lines 61-67 show how to connect directly via an EnIP address.  The
> netcat-like application uses
> 
> a header file called eip.h.

OK so basically we need to use a new address structure (which makes sense),
however I'm not sure I understand how the system is supposed to know it has
to use EnIP instead of IPv4 if the sin_family remains the same :

...
   memset(_addr, 0, sizeof(struct sockaddr_ein));
   serv_addr.sin_family= AF_INET;
   serv_addr.sin_port = htons(portno);
   serv_addr.sin_addr1=inet_addr(argv[1]);
   serv_addr.sin_addr2=inet_addr(argv[2]);

   if (connect(sockfd,(struct sockaddr *) _addr,sizeof(serv_addr)) < 0) 
error("ERROR connecting");
...

Does it solely rely on the 3rd argument to connect() to know that it
may read sin_addr2 ? If so, that sounds a bit fragile to me because I
*think* (but could be wrong) that right now any size at least as large
as sockaddr_in work for IPv4 (typically I think if you pass a struct
sockaddr_storage with its size it should work). So there would be a
risk of breaking applications if this is the case.

I suspect that using a different family would make things more robust
and more friendly to applications. But that's just a first opinion,
you have worked longer than me on this.

>  You can look at it here:
> 
> https://github.com/EnIP/enhancedip/blob/master/userspace/include/eip.h

Thank you. There's a problem with your #pragma pack there, it will
propagate to all struct declared after this file is included and
will force all subsequent structs to be packed. It would be preferable
to use __attribute__((packed)) on the struct or to use pragma pack()
safely (I've just seen it supports push/pop options).

> EnIP makes use of IPv6  records for DNS lookup.  We simply put
> 2001:0101 (which is an IPv6 experimental prefix) and
> 
> then we put the 64-bit EnIP address into the next 8 bytes of the
> address.  The remaining bytes are set to zero.

Does this require a significant amount of changes to the libc ?

> In the kernel, if you want to see how we convert the IPv6 DNS lookup
> into something connect() can manage,
> 
> check out the add_enhanced_ip() routine found here:
> 
> https://github.com/EnIP/enhancedip/blob/master/kernel/4.9.28/socket.c

Hmmm so both IPv4 and IPv6 addresses are supported and used together.
So that just makes me think that if in the end you need a little bit
of IPv6 to handle this, why not instead define a new way to map IPv6 to
Enhanced IPv4 ? I mean, you'd exclusively use AF_INET6 in applications,
within a dedicated address range, allowing you to naturally use all of
the IPv6 ecosystem in place in userland, but use IPv4 over the wire.
This way applications would possibly require even less modififications,
if any at all because they'd accept/connect/send/receive data for IPv6
address families that they all deal with nowadays, but this would rely
on an omni-present IPv4 infrastructure.

> The reason we had to do changes for openssh and not other applications
> (that use DNS) is openssh has a check to
> 
> see if the socket is using IP options.  If the socket does, sshd drops
> the connection.  I had to work around that to get openssh working

OK. They possibly do this to avoid source-routing.

> with EnIP.  The result: if you want to connect the netcat-like program
> with IP addresses you'll end up doing something like the
> 
> example above.  If you're using DNS (getaddrinfo) to connect(), it
> should just work (except for sshd as noted).
> 
> Here's the draft experimental RFC:
> https://tools.ietf.org/html/draft-chimiak-enhanced-ipv4-03

Yep I've found it on the site.

> I'll also note that I am doing this code part time as a hobby for a long
> time so I appreciate your help and support.  It would be really
> 
> great if the kernel community decided to pick this up, but if it's not a
> reality please let me know soonest so I can move on to a
> 
> different hobby.  :)

I *personally* think there is some value in what you're experimenting
with, and doing it as a hobby can leave you with the time needed to
collect pros and cons from various people. I'm just thinking that in
order to get some adoption, you need to be extremely careful not to
break any of the IPv4 applications developed in the last 38 years,
and by this, AF_INET+sizeof() scares me a little bit.

Regards,
Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-01 Thread Willy Tarreau

Hello Sam,

On Fri, Jun 01, 2018 at 09:48:28PM -0400, Sam Patton wrote:
> Hello!
> 
> If you do not know what Enhanced IP is, read this post on netdev first:
> 
> https://www.spinics.net/lists/netdev/msg327242.html
> 
> 
> The Enhanced IP project presents:
> 
>  Enhanced IP v1.4
> 
> The Enhanced IP (EnIP) code has been updated.  It now builds with OpenWRT 
> barrier breaker (for 148 different devices). We've been testing with the 
> Western Digital N600 and N750 wireless home routers.
(...) First note, please think about breaking your lines if you want your
mails to be read by the widest audience, as for some of us here, reading
lines wider than a terminal is really annoying, and often not considered
worth spending time on them considering there are so many easier ones
left to read.

> Interested in seeing Enhanced IP in the Linux kernel, read on.  Not
> interested in seeing Enhanced IP in the Linux kernel read on.
(...)

So I personally find the concept quite interesting. It reminds me of the
previous IPv5/IPv7/IPv8 initiatives, which in my opinion were a bit hopeless.
Here the fact that you decide to consider the IPv4 address as a network opens
new perspectives. For containerized environments it could be considered that
each server, with one IPv4, can host 2^32 guests and that NAT is not needed
anymore for example. It could also open the possibility that enthousiasts
can more easily host some services at home behind their ADSL line without
having to run on strange ports.

However I think your approach is not the most efficient to encourage adoption.
It's important to understand that there will be little incentive for people
to patch their kernels to run some code if they don't have the applications
on top of it. The kernel is not the end goal for most users, the kernel is
just the lower layer needed to run applications on top. I looked at your site
and the github repo, and all I could find was a pre-patched openssh, no simple
explanation of what to change in an application.

What you need to do first is to *explain* how to modify userland applications
to support En-IP, provide an echo server and show the parts which have to be
changed. Write a simple client and do the same. Provide your changes to
existing programs as patches, not as pre-patched code. This way anyone can
use your patches on top of other versions, and can use these patches to
understand what has to be modified in their applications.

Once applications are easy to patch, the incentive to install patched kernels
everywhere will be higher. For many enthousiasts, knowing that they only have
to modify the ADSL router to automatically make their internal IoT stuff
accessible from outside indeed becomes appealing.

Then you'll need to provide patches for well known applications like curl,
wget, DNS servers (bind...), then browsers.

In my case I could be interested in adding support for En-ip into haproxy,
and only once I don't see any showstopped in doing this, I'd be willing to
patch my kernel to support it.

Last advice, provide links to your drafts in future e-mails, they are not
easy to find on your site, we have to navigate through various pages to
finally find them.

Regards,
Willy

Re: Request for -stable inclusion: time stamping fix for nfp

2018-05-17 Thread Willy Tarreau

Adding Greg here.

Greg, apparently a backport of 46f1c52e66db is needed in 4.9 according
to the thread below. It was merged in 4.13 so 4.14 already has it.

Willy

On Thu, May 17, 2018 at 02:09:03PM -0400, David Miller wrote:
> From: Guillaume Nault 
> Date: Thu, 17 May 2018 19:41:47 +0200
> 
> > On Thu, Nov 16, 2017 at 10:13:28AM +0900, David Miller wrote:
> >> From: Guillaume Nault 
> >> Date: Wed, 15 Nov 2017 17:20:46 +0100
> >> 
> >> > Can you please queue commit 46f1c52e66db
> >> > ("nfp: TX time stamp packets before HW doorbell is rung") for -stable?
> >> > We got hit but this bug in the late summer. We run this fix internally
> >> > since a couple of months, but that'd be better to have it officially
> >> > backported so everyone can benefit of it.
> >> 
> >> Queued up.
> > 
> > I guess this one got lost somewhere as it doesn't appear in linux-4.9.y
> > (other trees aren't relevant).
> > If that's unintentional, than can you please re-queue
> > 46f1c52e66db ("nfp: TX time stamp packets before HW doorbell is rung")
> > to -stable?
> 
> I only submit patches to -stable for the two most recent active branches
> which right now consists of 4.16 and 4.14 as per www.kernel.org

Re: [PATCH linux-stable-4.14] tcp: reset sk_send_head in tcp_write_queue_purge

2018-03-20 Thread Willy Tarreau

Hi David,

regarding the patch below, I'm not certain whether you planned to take
it since it's marked "not applicable" on patchwork, but I suspect it's
only because it doesn't apply to mainline.

However, please note that there are two typos in commit IDs referenced in
the commit message that might be nice to fix before merging :

> > For example, after 27fid7a8ed38 (tcp: purge write queue upon RST),

- here it's "a27fd7a8ed38" (missing 'a' and extra 'i' in the middle)

- and lower in the fixes tag there's the extra 'i' as well :

> > Fixes: a27fid7a8ed38 (tcp: purge write queue upon RST)

There was another report today on the stable list for a similar crash
on 4.14.28 and I suspect it's the one I saw this week-end on my firewall
after upgrading from 4.14.10 to 4.14.27 though I didn't have the symbols
to confirm.

Thanks,
Willy

Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()

2018-01-29 Thread Willy Tarreau

Hi Eric,

On Mon, Jan 29, 2018 at 06:04:30AM -0800, Eric Dumazet wrote:
> > If these 4 bytes matter, why not use
> > cmpq with an immediate value instead, which saves 2 extra bytes ? :
> >
> >   - the mov above is 11 bytes total :
> >
> >0:   48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax
> >7:   12
> >8:   48 85 c0test   %rax,%rax
> >
> >   - the equivalent cmp is only 9 bytes :
> >
> >0:   48 83 bc d6 78 56 34cmpq   $0x0,0x12345678(%rsi,%rdx,8)
> >7:   12 00
> >
> > And as a bonus, it doesn't even clobber rax.
> >
> > Just my two cents,
> 
> 
> Hi Willy
> 
> Please look more closely at following instructions.
> 
> We need the value later, not only testing it being zero :)

Ah OK that makes total sense then ;-)

Thanks,
willy

Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()

2018-01-28 Thread Willy Tarreau

Hi,

[ replaced stable@ and greg@ by netdev@ as my question below is not
  relevant to stable ]

On Mon, Jan 29, 2018 at 02:48:54AM +0100, Daniel Borkmann wrote:
> From: Eric Dumazet 
> 
> [ upstream commit 84ccac6e7854ebbfb56d2fc6d5bef9be49bb304c ]
> 
> Saves 4 bytes replacing following instructions :
> 
> lea rax, [rsi + rdx * 8 + offsetof(...)]
> mov rax, qword ptr [rax]
> cmp rax, 0
> 
> by :
> 
> mov rax, [rsi + rdx * 8 + offsetof(...)]
> test rax, rax

I've just noticed this on stable@. If these 4 bytes matter, why not use
cmpq with an immediate value instead, which saves 2 extra bytes ? :

  - the mov above is 11 bytes total :

   0:   48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax
   7:   12 
   8:   48 85 c0test   %rax,%rax

  - the equivalent cmp is only 9 bytes :

   0:   48 83 bc d6 78 56 34cmpq   $0x0,0x12345678(%rsi,%rdx,8)
   7:   12 00 

And as a bonus, it doesn't even clobber rax.

Just my two cents,
Willy

Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Willy Tarreau

Hi Eric,

On Mon, Jan 22, 2018 at 10:16:06AM -0800, Eric Dumazet wrote:
> On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:
> > My test case is to have 6 processes each create 5000 TCP IPv4 connections 
> > to each other
> > on a system with 16GB RAM and send slow-speed data.  This works fine on a 
> > 4.7 kernel, but
> > will not work at all on a 4.13.  The 4.13 first complains about running out 
> > of tcp memory,
> > but even after forcing those values higher, the max connections we can get 
> > is around 15k.
> > 
> > Both kernels have my out-of-tree patches applied, so it is possible it is 
> > my fault
> > at this point.
> > 
> > Any suggestions as to what this might be caused by, or if it is fixed in 
> > more recent kernels?
> > 
> > I will start bisecting in the meantime...
> > 
> 
> Hi Ben
> 
> Unfortunately I have no idea.
> 
> Are you using loopback flows, or have I misunderstood you ?
> 
> How loopback connections can be slow-speed ?

A few quick points : I have not noticed this on 4.9, which we use with
pretty satisfying performance (typically around 100k conn/s). However
during some recent tests I did around the meltdown fixes on 4.15, I
noticed a high connect() or bind() cost to find a local port when
running on the loopback, that I didn't have the time to compare to
older kernels. However, strace clearly showed that bind() (or connect()
if bind was not used) could take as much as 2-3 ms as source ports were
filling up.

To be clear, it was just a quick observation and anything could be wrong
there, including my tests. I'm just saying this in case it matches anything
Ben has observed. I can try to get more info if that helps, but it could be
a different case.

Cheers,
Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau

On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

OK OK. At least we should have security by default and let people trade
it against performance if they want and understand the risk. People
never know when they're secure enough (security doesn't measure) but
they know fairly well when they're not performant enough to take action
(most often changing the machine).

Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau

On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote:
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
> 
> Performance matters. A *LOT*.

Linus, no need to explain that to me, I'm precisely trying to see how
to disable PTI for a specific process because I face up to 45% loss in
certain circumstances, making it a no-go. But while a few of us have
very specific workloads emphasizing this impact, others have very
different ones and will not notice. For example my laptop did boot
pretty fine and I didn't notice anything until I fire a network
benchmark.

Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau

On Sat, Jan 06, 2018 at 07:38:14PM -0800, Alexei Starovoitov wrote:
> yep. plenty of unknowns and what's happening now is an overreaction.

To be fair there's overreaction on both sides. The vast majority of
users need to get a 100% safe system and will never notice  any
difference. A few of us are more concerned by the risk of performance
loss brought by these fixes. We do all need to run tests on the
patchsets to bring numbers on the table.

> What is the rush to push half baked patches into upstream
> that don't even address the variant 1 ?

You probably need to trust the CPU vendors a bit for having had all
the details about the risks for a few months now and accept that if
they're destroying their product's performance compared to their main
competitor, they probably studied a number of other alternatives first.
It doesn't mean they thought about everything of course, and maybe they
need to study your proposal as a better solution to reduce criticism.

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now

Or it was considered less urgent to fix compared to the rest. It's
also possible that the scariest details were not written in the GPZ
article.

> Now the attack is well described, yet cpu vendors still pushing
> for lfence patches that don't make sense. Why?

Imagine if you were in their position and were pushing a solution which
would later be found to be inefficient and to be vulnerable again. Your
name would appear twice in the press in a few months, this would be
terrible. It makes sense in their position to find the safest fix first
given that those like you or me really concerned about the performance
impact know how to add an option to a boot loader or revert a patch that
causes trouble.

> What I think is important is to understand vulnerability first.
> I don't think it was done.

I suspect it was clearly done by those how had no other option but
working on this painful series over the last few months :-/

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

Please help here by testing the patch series and reporting numbers
before, with the fix, and with your proposal. That's the best way to
catch their attention and to get your proposal considered as a viable
alternative (or as a partial one for certain environments). I did the
same when I believed it could be sufficient to add noise to RDTSC and
found it totally inefficient after testing. But it's better for
everyone that research and testing is done rather than criticizing the
proposed fixes (which is not fair to the people who work hard on them
for everyone else).

Cheers,
Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau

On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> Normally people who propose security fixes don't have to argue about the
> fact they added 30 clocks to avoid your box being 0wned.

In fact it depends, because if a fix makes the system unusable for its
initial purpose, this fix will simply not be deployed at all, which is
the worst that can happen. Especially when it cannot be disabled by
config and people stop updating their systems to stay on the last
"known good" version. Fortunately in Linux we often have the choice so
that users rarely have a valid reason for not upgrading!

Willy

Re: BUG warnings in 4.14.9

2017-12-26 Thread Willy Tarreau

Guys,

Chris reported the bug below and confirmed that reverting commit
9704f81 (ipv6: grab rt->rt6i_ref before allocating pcpu rt) seems to
have fixed the issue for him. This patch is a94b9367 in mainline.

I personally have no opinion on the patch, just found it because it
was the only one touching this area between 4.14.8 and 4.14.9 :-)

Should this be reverted or maybe fixed differently ?

thanks,
Willy

On Tue, Dec 26, 2017 at 01:49:59AM +, Chris Rankin wrote:
> Hi,
> 
> I've just raised https://bugzilla.kernel.org/show_bug.cgi?id=198271
> because the new 4.14.9 kernel is generating lots of BUG warnings, e.g.
> 
> [   35.069924] BUG: using smp_processor_id() in preemptible []
> code: avahi-daemon/761
> [   35.078187] caller is ip6_pol_route+0x46b/0x509 [ipv6]
> [   35.083340] CPU: 5 PID: 761 Comm: avahi-daemon Tainted: G
> I 4.14.9 #1
> [   35.091176] Hardware name: Gigabyte Technology Co., Ltd.
> EX58-UD3R/EX58-UD3R, BIOS FB  05/04/2009
> [   35.100181] Call Trace:
> [   35.102709]  dump_stack+0x46/0x59
> [   35.106095]  check_preemption_disabled+0xca/0xda
> [   35.110786]  ip6_pol_route+0x46b/0x509 [ipv6]
> [   35.115224]  fib6_rule_lookup+0x15/0x4b [ipv6]
> [   35.119745]  ip6_dst_lookup_tail+0x11d/0x18f [ipv6]
> [   35.124702]  ip6_dst_lookup_flow+0x30/0x6f [ipv6]
> [   35.129481]  udpv6_sendmsg+0x5c5/0xa10 [ipv6]
> [   35.133914]  ? ip_reply_glue_bits+0x48/0x48
> [   35.138187]  ? rw_copy_check_uvector+0x6d/0xf9
> [   35.142701]  ? sock_sendmsg+0x28/0x34
> [   35.146393]  sock_sendmsg+0x28/0x34
> [   35.149912]  ___sys_sendmsg+0x1b4/0x246
> [   35.153821]  ? poll_select_copy_remaining+0x104/0x104
> [   35.158995]  ? current_time+0x11/0x52
> [   35.162738]  ? pipe_write+0x353/0x365
> [   35.166483]  ? __sys_sendmsg+0x3c/0x5d
> [   35.170304]  __sys_sendmsg+0x3c/0x5d
> [   35.173909]  do_syscall_64+0x4a/0xe1
> [   35.177482]  entry_SYSCALL64_slow_path+0x25/0x25
> [   35.177484] RIP: 0033:0x7f963d7097b7
> [   35.177485] RSP: 002b:77ffbd18 EFLAGS: 0246 ORIG_RAX:
> 002e
> [   35.177486] RAX: ffda RBX: 000d RCX: 
> 7f963d7097b7
> [   35.177487] RDX:  RSI: 77ffbde0 RDI: 
> 000d
> [   35.177487] RBP:  R08: 0001 R09: 
> 77ffbd42
> [   35.177488] R10: 0002 R11: 0246 R12: 
> 77ffbde0
> [   35.177489] R13: 55e4e1af29cc R14: 000d R15: 
> 0002
> [   35.177565] IN=eth0 OUT= MAC=
> SRC=fe80::::0224:1dff:fecd:0f3a
> DST=ff02:::::::00fb LEN=245 TC=0 HOPLIMIT=255
> FLOWLBL=147873 PROTO=UDP SPT=5353 DPT=5353 LEN=205
> 
> Cheers,
> Chris

Re: [PATCH net 0/3] Few mvneta fixes

2017-12-19 Thread Willy Tarreau

Hi Arnd,

On Tue, Dec 19, 2017 at 09:18:35PM +0100, Arnd Bergmann wrote:
> On Tue, Dec 19, 2017 at 5:59 PM, Gregory CLEMENT
>  wrote:
> > Hello,
> >
> > here it is a small series of fixes found on the mvneta driver. They
> > had been already used in the vendor kernel and are now ported to
> > mainline.
> 
> Does one of the patches look like it addresses the rare Oops we discussed on
> #kernelci this morning?
> 
> https://storage.kernelci.org/stable/linux-4.9.y/v4.9.70/arm/mvebu_v7_defconfig/lab-free-electrons/boot-armada-375-db.html

I could be wrong but for me the 375 uses mvpp2, not mvneta, so this
should have no effect there.

Willy

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-17 Thread Willy Tarreau

Hi Andrew,

On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote:
> > Usually it's better to apply LRU or random here in my opinion, as the
> > new entry is much more likely to be needed than older ones by definition.
> 
> Hi Willy
> 
> I think this depends on why you need to discard. If it is normal
> operation and the limits are simply too low, i would agree.
> 
> If however it is a DoS, throwing away the new entries makes sense,
> leaving the old ones which are more likely to be useful.
> 
> Most of the talk in this thread has been about limits for DoS
> prevention...

Sure but my point is that it can kick in on regular traffic and in
this case it can be catastrophic. That's only what bothers me. If
we have an unlimited default value with this algorithm I'm fine
because nobody will get caught by accident with a bridge suddenly
replicating high traffic on all ports because an unknown limit was
reached. That's the principle of least surprise.

I know that when fighting DoSes there's never any universally good
solutions and one has to make tradeoffs. I'm perfectly fine with this.

Cheers,
Willy

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-16 Thread Willy Tarreau

Hi Stephen,

On Thu, Nov 16, 2017 at 04:27:18PM -0800, Stephen Hemminger wrote:
> On Thu, 16 Nov 2017 21:21:55 +0100
> Vincent Bernat  wrote:
> 
> >  ? 16 novembre 2017 20:23 +0100, Andrew Lunn  :
> > 
> > > struct net_bridge_fdb_entry is 40 bytes.
> > >
> > > My WiFi access point which is also a 5 port bridge, currently has 97MB
> > > free RAM. That is space for about 2.5M FDB entries. So even Roopa's
> > > 128K is not really a problem, in terms of memory.  
> > 
> > I am also interested in Sarah's patch because we can now have bridge
> > with many ports through VXLAN. The FDB can be replicated to an external
> > daemon with BGP and the cost of each additional MAC address is therefore
> > higher than just a few bytes. It seems simpler to implement a limiting
> > policy early (at the port or bridge level).
> > 
> > Also, this is a pretty standard limit to have for a bridge (switchport
> > port-security maximum on Cisco, set interface X mac-limit on
> > Juniper). And it's not something easy to do with ebtables.
> 
> I want an optional limit per port, it makes a lot of sense.
> If for no other reason that huge hash tables are a performance problems.

Except its not a limit in that it doesn't prevent new traffic from going
in, it only prevents new MACs from being learned, so suddenly you start
flooding all ports with new traffic once the limit is reached, which is
not trivial to detect nor diagnose.

> There is a bigger question about which fdb to evict but just dropping the
> new one seems to be easiest and as good as any other solution.

Usually it's better to apply LRU or random here in my opinion, as the
new entry is much more likely to be needed than older ones by definition.
In terms of CPU usage it may even be better to kill an entire series in
the hash table (eg: all nodes in the same table entry for example), as
the operation can be almost as cheap and result in not being needed for
a while again.

Willy

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-16 Thread Willy Tarreau

Hi Sarah,

On Thu, Nov 16, 2017 at 01:20:18AM -0800, Sarah Newman wrote:
> I note that anyone who would run up against a too-low limit on the maximum
> number of fdb entries would also be savvy enough to fix it in a matter of
> minutes.

I disagree on this point. There's a huge difference between experiencing
sudden breakage under normal conditions due to arbitrary limits being set
and being down because of an attack. While the latter is not desirable,
it's much more easily accepted and most often requires operations anyway.
The former is never an option.

And I continue to think that the default behaviour once the limit is reached
must not be to prevent new entries from being learned but to purge older
ones. At least it preserves normal operations.

But given the high CPU impact you reported for a very low load, definitely
something needs to be done.

> They could also default the limit to U32_MAX in their particular
> distribution if it was a configuration option.

Well, I'd say that we don't have a default limit on the socket number either
and that it happens to be the expected behaviour. It's almost impossible to
find a suitable limit for everyone. People dealing with regular loads never
read docs and get caught. People working in hostile environments are always
more careful and will ensure that their limits are properly set.

> At the moment there is not even a single log message if the problem doesn't
> result in memory exhaustion.

This probably needs to be addressed as well!

Regards,
Willy

Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage

2017-11-01 Thread Willy Tarreau

On Tue, Oct 31, 2017 at 09:14:45AM -0700, Kees Cook wrote:
> diff --git a/net/socket.c b/net/socket.c
> index c729625eb5d3..34183f4fbdf8 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, struct 
> user_msghdr __user *msg,
>   struct sockaddr __user *uaddr;
>   int __user *uaddr_len = COMPAT_NAMELEN(msg);
>  
> + memset(, 0, sizeof(addr));
>   msg_sys->msg_name = 

Isn't this going to cause a performance hit in the fast path ? Just
checking, I have not read the whole code with the patch in its context.

Willy

Re: net.ipv4.tcp_max_syn_backlog implementation

2017-08-28 Thread Willy Tarreau

On Mon, Aug 28, 2017 at 11:47:41PM -0400, Harsha Chenji wrote:
> So I have ubuntu 12.04 x32 in a VM with syncookies turned off. I tried
> to do a syn flood (with netwox) on 3 different processes. Each of them
> returns a different value with netstat -na | grep -c RECV :
> 
> nc -l  returns 16 (netcat-traditional)
> apache2 port 80 returns 256
> vsftpd on 21 returns 64.
> net.ipv4.tcp_max_syn_backlog is 512.
> 
> Why do these different processes on different ports have different
> queue lengths for incomplete connections? Where exactly in the kernel
> is this decided?

The listening socket's backlog (second argument to the listen() syscall)
is considered as well. The code path to determine whether or not to start
to send SYN cookies is far from being trivial but makes sense once you
write it down completely. I never perfectly remember it, I regularly have
to recheck when I have a doubt.

Willy

Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-08-06 Thread Willy Tarreau

On Sun, Aug 06, 2017 at 07:39:57AM +, maowenan wrote:
> 
> 
> > -Original Message-
> > From: Willy Tarreau [mailto:w...@1wt.eu]
> > Sent: Saturday, August 05, 2017 2:19 AM
> > To: Neal Cardwell
> > Cc: maowenan; David Miller; netdev@vger.kernel.org; Yuchung Cheng; Nandita
> > Dukkipati; Eric Dumazet
> > Subject: Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data
> > ACKed/SACKed
> > 
> > On Fri, Aug 04, 2017 at 02:01:34PM -0400, Neal Cardwell wrote:
> > > On Fri, Aug 4, 2017 at 1:10 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > > > Hi Neal,
> > > >
> > > > On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote:
> > > >> I have attached patches for this fix rebased on to v3.10.107, the
> > > >> latest stable release for 3.10. That's pretty far back in history,
> > > >> so there were substantial conflict resolutions and adjustments 
> > > >> required.
> > > >> :-) Hope that helps.
> > > >
> > > > At least it will help me :-)
> > > >
> > > > Do you suggest that I queue them for 3.10.108, that I wait for
> > > > Maowenan to test them more broadly first or anything else ?
> > >
> > > Let's wait for Maowenan to test them first.
> [Mao Wenan]It works well with these patches of v3.10, and the 
> retransmission packet is about 250ms after TLP probe.
> Thank you very much for these patches under 3.10.

Thanks, I'll take them then.

Willy

Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-08-04 Thread Willy Tarreau

On Fri, Aug 04, 2017 at 02:01:34PM -0400, Neal Cardwell wrote:
> On Fri, Aug 4, 2017 at 1:10 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > Hi Neal,
> >
> > On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote:
> >> I have attached patches for this fix rebased on to v3.10.107, the
> >> latest stable release for 3.10. That's pretty far back in history, so
> >> there were substantial conflict resolutions and adjustments required.
> >> :-) Hope that helps.
> >
> > At least it will help me :-)
> >
> > Do you suggest that I queue them for 3.10.108, that I wait for Maowenan
> > to test them more broadly first or anything else ?
> 
> Let's wait for Maowenan to test them first.

Fine, thanks.
Willy

Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-08-04 Thread Willy Tarreau

Hi Neal,

On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote:
> I have attached patches for this fix rebased on to v3.10.107, the
> latest stable release for 3.10. That's pretty far back in history, so
> there were substantial conflict resolutions and adjustments required.
> :-) Hope that helps.

At least it will help me :-)

Do you suggest that I queue them for 3.10.108, that I wait for Maowenan
to test them more broadly first or anything else ?

I'm fine with any option.
Thanks!
Willy

Re: STABLE: net: reduce skb_warn_bad_offload() noise

2017-07-29 Thread Willy Tarreau

On Fri, Jul 28, 2017 at 10:22:52PM -0700, Eric Dumazet wrote:
> On Fri, 2017-07-28 at 12:30 -0700, David Miller wrote:
> > From: Mark Salyzyn 
> > Date: Fri, 28 Jul 2017 10:29:57 -0700
> > 
> > > Please backport the upstream patch to the stable trees (including
> > > 3.10.y, 3.18.y, 4.4.y and 4.9.y):
> > > 
> > > b2504a5dbef3305ef41988ad270b0e8ec289331c net: reduce
> > > skb_warn_bad_offload() noise
> > > 
> > > Impacting performance or creating unnecessary alarm, and will result
> > > in kernel panic for panic_on_warn configuration.
> > 
> > Yeah this is fine.
> 
> If you do so, also backport 6e7bc478c9a006c701c14476ec9d389a484b4864
> ("net: skb_needs_check() accepts CHECKSUM_NONE for tx")
 
OK, noted for next 3.10 as well.

thanks!
Willy

Re: TCP fast retransmit issues

2017-07-28 Thread Willy Tarreau

On Fri, Jul 28, 2017 at 08:36:49AM +0200, Klavs Klavsen wrote:
> The network guys know what caused it.
> 
> Appearently on (atleast some) Cisco equipment the feature:
> 
> TCP Sequence Number Randomization
> 
> is enabled by default.

I didn't want to suggest names but since you did it first ;-) Indeed it's
mostly on the same device that I've been bothered a lot by their annoying
randomization. I used to know by memory the exact command to type to disable
it, but I don't anymore (something along "no randomization"). The other
trouble it causes is retransmits of the first SYN when your source ports
wrap too fast (ie when installed after a proxy). The SYNs reaching the
other end find a session in TIME_WAIT, but the SYN sometimes lands in
the previous window and leads to an ACK instead of a SYN-ACK, which the
firewall blocks. This was easily worked around using timestamps on both
sides thanks to PAWS. But disabling the broken feature is better. And no,
"more secure" is not an excuse for "broken".

> It would most definetely be beneficial if Linux handled SACK "not working"
> better than it does - but then I might never have found the culprit who
> destroyed SACK :)

Yep ;-)

Willy

Re: TCP fast retransmit issues

2017-07-26 Thread Willy Tarreau

On Wed, Jul 26, 2017 at 07:32:12AM -0700, Eric Dumazet wrote:
> On Wed, 2017-07-26 at 15:42 +0200, Willy Tarreau wrote:
> > On Wed, Jul 26, 2017 at 06:31:21AM -0700, Eric Dumazet wrote:
> > > On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote:
> > > > the 192.168.32.44 is a Centos 7 box.
> > > 
> > > Could you grab a capture on this box, to see if the bogus packets are
> > > sent by it, or later mangled by a middle box ?
> > 
> > Given the huge difference between the window and the ranges of the
> > values in the SACK field, I'm pretty sure there's a firewall doing
> > some sequence numbers randomization in the middle, not aware of SACK
> > and not converting these ones. I've had to disable such broken
> > features more than once in field after similar observations! Probably
> > that the Mac doesn't advertise SACK support and doesn't experience the
> > problem.
> 
> We need to check RFC if such invalid SACK blocks should be ignored (DUP
> ACK would be processed and trigger fast retransmit anyway), or strongly
> validated (as I suspect we currently do), leading to a total freeze.

RFC2883 #4.3 talks about interaction with PAWS and only suggests that
since the sequence numbers can wrap the sender should be aware that a
reported segment can in fact relate to a value within the prior seq
number space before cycling, but that they don't expect any side effect.
So that more or less means to me "you should consider that some of these
segments might be old, meaningless and should be ignored". But as you
can see the recommendation lacks a bit of strength given that no issue
was expected in such a situation.

Willy

Re: TCP fast retransmit issues

2017-07-26 Thread Willy Tarreau

On Wed, Jul 26, 2017 at 04:25:29PM +0200, Klavs Klavsen wrote:
> Thank you very much guys for your insight.. its highly appreciated.
> 
> Next up for me, is waiting till the network guys come back from summer
> vacation, and convince them to sniff on the devices in between to pinpoint
> the culprit :)

That said, Eric, I'm a bit surprized that it completely stalls. Shouldn't
the sender end up retransmitting unacked segments after seeing a certain
number of ACKs not making progress ? Or maybe this is disabled when SACKs
are in use but it seems to me that once invalid SACKs are ignored we should
ideally fall back to the normal way to deal with losses. Here the server
ACKed 3903858556 for the first time at 15:59:54.292743 and repeated this
one 850 times till 16:01:17.296407 but the client kept sending past this
point probably due to a huge window, so this looks suboptimal to me.

Willy

Re: TCP fast retransmit issues

2017-07-26 Thread Willy Tarreau

On Wed, Jul 26, 2017 at 04:08:19PM +0200, Klavs Klavsen wrote:
> Grabbed on both ends.
> 
> http://blog.klavsen.info/fast-retransmit-problem-junos-linux (updated to new
> dump - from client scp'ing)
> http://blog.klavsen.info/fast-retransmit-problem-junos-linux-receiving-side
> (receiving host)

So bingo, Eric guessed right, the client's sequence numbers are translated
on their way to/from the server, but the SACK fields are not :

Server :
15:59:54.292867 IP (tos 0x8, ttl 64, id 15878, offset 0, flags [DF], proto TCP 
(6), length 64)
192.168.32.44.22 > 62.242.222.50.35002: Flags [.], cksum 0xfe2b (incorrect 
-> 0xce0e), seq 1568063538, ack 3903858556,
win 10965, options [nop,nop,TS val 529899820 ecr 774272020,nop,nop,sack 1 
{3903859904:3903861252}], length 0

Client :
15:59:54.297388 IP (tos 0x8, ttl 56, id 15878, offset 0, flags [DF], proto TCP 
(6), length 64)
192.168.32.44.22 > 62.242.222.50.35002: Flags [.], cksum 0xbb2c (correct), 
seq 1568063538, ack 2684453645,
win 10965, options [nop,nop,TS val 529899820 ecr 774272020,nop,nop,sack 1 
{3903859904:3903861252}], length 0

To there's very likely a broken firewall in the middle that is waiting for
a bug fix, or to have its feature disabled. Sometimes it can also happen
on firewalls performing some SYN proxying except that it would mangle the
server's sequence numbers instead of the client ones.

Willy

Re: TCP fast retransmit issues

2017-07-26 Thread Willy Tarreau

On Wed, Jul 26, 2017 at 06:31:21AM -0700, Eric Dumazet wrote:
> On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote:
> > the 192.168.32.44 is a Centos 7 box.
> 
> Could you grab a capture on this box, to see if the bogus packets are
> sent by it, or later mangled by a middle box ?

Given the huge difference between the window and the ranges of the
values in the SACK field, I'm pretty sure there's a firewall doing
some sequence numbers randomization in the middle, not aware of SACK
and not converting these ones. I've had to disable such broken
features more than once in field after similar observations! Probably
that the Mac doesn't advertise SACK support and doesn't experience the
problem.

Willy

Re: [PATCH RFC 0/2] kproxy: Kernel Proxy

2017-06-29 Thread Willy Tarreau

On Thu, Jun 29, 2017 at 04:43:28PM -0700, Tom Herbert wrote:
> On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote:
> >> > In fact that's not much what I observe in field. In practice, large
> >> > data streams are cheaply relayed using splice(), I could achieve
> >> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
> >> > And when you use SSL, the cost of the copy to/from kernel is small
> >> > compared to all the crypto operations surrounding this.
> >> >
> >> Right, getting rid of the extra crypto operations and so called "SSL
> >> inspection" is the ultimate goal this is going towards.
> >
> > Yep but in order to take decisions at L7 you need to decapsulate SSL.
> >
> Decapsulate or decrypt? There's a big difference... :-) I'm am aiming
> to just have to decapsulate.

Sorry, but what difference do you make ? For me "decapsulate" means
"extract the next level layer", and for SSL it means you need to decrypt.

> >
> >> Performance is relevant because we
> >> potentially want security applied to every message in every
> >> communication in a containerized data center. Putting the userspace
> >> hop in the datapath of every packet is know to be problematic, not
> >> just for the performance hit  also because it increases the attack
> >> surface on users' privacy.
> >
> > While I totally agree on the performance hit when inspecting each packet,
> > I fail to see the relation with users' privacy. In fact under some
> > circumstances it can even be the opposite. For example, using something
> > like kTLS for a TCP/HTTP proxy can result in cleartext being observable
> > in strace while it's not visible when TLS is terminated in userland because
> > all you see are openssl's read()/write() operations. Maybe you have specific
> > attacks in mind ?
> >
> No, just the normal problem of making yet one more tool systematically
> have access to user data.

OK.

> >> > Regarding kernel-side protocol parsing, there's an unfortunate trend
> >> > at moving more and more protocols to userland due to these protocols
> >> > evolving very quickly. At least you'll want to find a way to provide
> >> > these parsers from userspace, which will inevitably come with its set
> >> > of problems or limitations :-/
> >> >
> >> That's why everything is going BPF now ;-)
> >
> > Yes, I knew you were going to suggest this :-)  I'm still prudent on it
> > to be honnest. I don't think it would be that easy to implement an HPACK
> > encoder/decoder using BPF. And even regarding just plain HTTP parsing,
> > certain very small operations in haproxy's parser can quickly result in
> > a 10% performance degradation when improperly optimized (ie: changing a
> > "likely", altering branch prediction, or cache walk patterns when using
> > arrays to evaluate character classes faster). But for general usage I
> > indeed think it should be OK.
> >
> HTTP might qualify as a special case, and I believe there's already
> been some work to put http in kernel by Alexander Krizhanovsky and
> others. In this case maybe http parse could be front end before BPF.

It could indeed be an option. We've seen this with Tux in the past.

> Although, pretty clear we'll need regex in BPF if we want use it with
> http.

I think so as well. And some loop-like operations (foreach or stuff like
this so that they remain bounded).

Willy

Re: [PATCH RFC 0/2] kproxy: Kernel Proxy

2017-06-29 Thread Willy Tarreau

On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote:
> > In fact that's not much what I observe in field. In practice, large
> > data streams are cheaply relayed using splice(), I could achieve
> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
> > And when you use SSL, the cost of the copy to/from kernel is small
> > compared to all the crypto operations surrounding this.
> >
> Right, getting rid of the extra crypto operations and so called "SSL
> inspection" is the ultimate goal this is going towards.

Yep but in order to take decisions at L7 you need to decapsulate SSL.

> HTTP is only one use case. The are other interesting use cases such as
> those in container security where the application protocol might be
> something like simple RPC.

OK that indeed makes sense in such environments.

> Performance is relevant because we
> potentially want security applied to every message in every
> communication in a containerized data center. Putting the userspace
> hop in the datapath of every packet is know to be problematic, not
> just for the performance hit  also because it increases the attack
> surface on users' privacy.

While I totally agree on the performance hit when inspecting each packet,
I fail to see the relation with users' privacy. In fact under some
circumstances it can even be the opposite. For example, using something
like kTLS for a TCP/HTTP proxy can result in cleartext being observable
in strace while it's not visible when TLS is terminated in userland because
all you see are openssl's read()/write() operations. Maybe you have specific
attacks in mind ?

> > Regarding kernel-side protocol parsing, there's an unfortunate trend
> > at moving more and more protocols to userland due to these protocols
> > evolving very quickly. At least you'll want to find a way to provide
> > these parsers from userspace, which will inevitably come with its set
> > of problems or limitations :-/
> >
> That's why everything is going BPF now ;-)

Yes, I knew you were going to suggest this :-)  I'm still prudent on it
to be honnest. I don't think it would be that easy to implement an HPACK
encoder/decoder using BPF. And even regarding just plain HTTP parsing,
certain very small operations in haproxy's parser can quickly result in
a 10% performance degradation when improperly optimized (ie: changing a
"likely", altering branch prediction, or cache walk patterns when using
arrays to evaluate character classes faster). But for general usage I
indeed think it should be OK.

> > All this to say that while I can definitely imagine the benefits of
> > having in-kernel sockets for in-kernel L7 processing or filtering,
> > I'm having strong doubts about the benefits that userland may receive
> > by using this (or maybe you already have any performance numbers
> > supporting this ?).
> >
> Nope, no numbers yet.

OK, no worries. Thanks for your explanations!

Willy

Re: [PATCH RFC 0/2] kproxy: Kernel Proxy

2017-06-29 Thread Willy Tarreau

Hi Tom,

On Thu, Jun 29, 2017 at 11:27:03AM -0700, Tom Herbert wrote:
> Sidecar proxies are becoming quite popular on server as a means to
> perform layer 7 processing on application data as it is sent. Such
> sidecars are used for SSL proxies, application firewalls, and L7
> load balancers. While these proxies provide nice functionality,
> their performance is obviously terrible since all the data needs
> to take an extra hop though userspace.
> 
> Consider transmitting data on a TCP socket that goes through a
> sidecar paroxy. The application does a sendmsg in userpsace, data
> goes into kernel, back to userspace, and back to kernel. That is two
> trips through TCP TX, one TCP RX, potentially three copies, three
> sockets are touched, and three context switches. Using a proxy in the
> receive path would have a similarly long path.
> 
>+--+  +--+
>|  Application |  | Proxy|
>|  |  |  |
>|  sendmsg |  | recvmsg sendmsg  |
>+--+  +--+
>  ||   |
>|^   |
> ---V|---|--
>  ||   |
>  +>->-+   V
> TCP TX  TCP RXTCP TX
>   
> The "boomerang" model this employs is quite expensive. This is
> even much worse in the case that the proxy is an SSL proxy (e.g.
> performing SSL inspection to implement and application firewall).

In fact that's not much what I observe in field. In practice, large
data streams are cheaply relayed using splice(), I could achieve
60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
And when you use SSL, the cost of the copy to/from kernel is small
compared to all the crypto operations surrounding this.

Another point is that most HTTP requests are quite small (typically ~80%
20kB or less), and in this case the L7 processing and certain syscalls
significantly dominate the operations, data copies are comparatively
small. Simply parsing a HTTP header takes time (when you do it correctly).
You can hardly parse and index more than 800MB-1GB/s of HTTP headers
per core, which limits you to roughly 1-1.2 M req+resp per second for
a 400 byte request and a 400 byte response, and that's without any
processing at all. But when doing this, certain syscalls like connect(),
close() or epollctl() start to be quite expensive. Even splice() is
expensive to forward small data chunks because you need two calls, and
recv+send is faster. In fact our TCP stack has been so much optimized
for realistic workloads over the years that it becomes hard to gain
more by cheating on it :-)

In the end in haproxy I'm seeing about 300k req+resp per second in
HTTP keep-alive and more like 100-130k with close, when disabling
TCP quick-ack during accept() and connect() to save one ACK on each
side (just doing this generally brings performance gains between 7
and 10%).

Regarding kernel-side protocol parsing, there's an unfortunate trend
at moving more and more protocols to userland due to these protocols
evolving very quickly. At least you'll want to find a way to provide
these parsers from userspace, which will inevitably come with its set
of problems or limitations :-/

All this to say that while I can definitely imagine the benefits of
having in-kernel sockets for in-kernel L7 processing or filtering,
I'm having strong doubts about the benefits that userland may receive
by using this (or maybe you already have any performance numbers
supporting this ?).

Just my two cents,
Willy

Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-16 Thread Willy Tarreau

Hi Neal,

On Thu, Mar 16, 2017 at 11:40:52AM -0400, Neal Cardwell wrote:
> On Thu, Mar 16, 2017 at 7:31 AM, Lutz Vieweg <l...@5t9.de> wrote:
> >
> > On 03/15/2017 11:55 PM, Willy Tarreau wrote:
> >>
> >> At least I can say I've seen many people enable it without understanding 
> >> its impact, confusing it
> >> with tcp_tw_reuse, and copy-pasting it from random blogs and complaining 
> >> about issues in
> >> production.
> >
> >
> > I currently wonder: What it the correct advise to an operator who needs
> > to run one server instance that is meant to accept thousands of new,
> > short-lived TCP connections per minute?
> 
> Note that for this to be a problem there would have to be thousands of
> new, short-lived TCP connections per minute from a single source IP
> address to a single destination IP address. Normal client software
> should not be doing this. AFAIK this is pretty rare, unless someone is
> running a load test or has an overly-aggressive monitoring system. NAT
> boxes or proxies with that kind of traffic should be running with
> multiple public source IPs.

In fact it's the regular stuff with reverse-proxies. You can scan the
whole source port range every second. But when enabling timestamps, you
benefit from PAWS and you don't have any problem anymore, everything
works pretty well.

> But if/when the problem occurs, then the feasible solutions I'm aware
> of, in approximate descending order of preference, are:
> 
> (1) use longer connections from the client side (browsers and RPC libraries 
> are
> usually pretty good about keeping connections open for a long time, so 
> this
> is usually sufficient)
> 
> (2) have the client do the close(), so the client is the side to carry the
> TIME_WAIT state

That's impossible for proxies, as you can't connect again from the same
source port, causing the performances to be divided by more than 100. What
proxies have to do when they're forced to close first an outgoing connection
is to set SO_LINGER to (0,0) so that an RST is used and the source port can
be reused. But as you guess, if that RST gets lost, then next opening is
not that beautiful : either [SYN, ACK, RST, pause, SYN, SYN-ACK, ACK] or
[SYN, RST, pause SYN, SYN-ACK, ACK] depending on whether the SYN appears
in the previous window or not.

> (3) have the server use SO_LINGER with a timeout of 0, so that
> the connection is closed with a RST and the server carries no
> TIME_WAIT state

The problem is that it also kills the tail data.

Quite frankly, the only issues I'm used to see are with clients closing
first and with reusing source connections. As soon as timestamps are
enabled on both sides and people don't blindly play with tcp_tw_recycle,
I really never face any connection issue.

Willy

Re: [PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Willy Tarreau

Hi David,

On Wed, Mar 15, 2017 at 03:40:44PM -0700, David Miller wrote:
> From: Soheil Hassas Yeganeh 
> Date: Wed, 15 Mar 2017 16:30:45 -0400
> 
> > Note that this cache was already broken for caching timestamps of
> > multiple machines behind a NAT sharing the same address.
> 
> That's the documented, well established, limitation of time-wait
> recycling.
> 
> People who enable it, need to consider this issue.
> 
> This limitation of the feature does not give us a reason to break the
> feature even further as a matter of convenience, or to remove it
> altogether for the same reason.
> 
> Please, instead, fix the bug that was introduced.

At least I can say I've seen many people enable it without understanding
its impact, confusing it with tcp_tw_reuse, and copy-pasting it from
random blogs and complaining about issues in production.

I agree that it's hard to arbiter between stupidity and flexibility
though :-/

Regards,
Willy

Re: net: BUG in unix_notinflight

2017-03-07 Thread Willy Tarreau

On Wed, Mar 08, 2017 at 12:23:56AM +0200, Nikolay Borisov wrote:
> 
> >>
> >>
> >> New report from linux-next/c0b7b2b33bd17f7155956d0338ce92615da686c9
> >>
> >> [ cut here ]
> >> kernel BUG at net/unix/garbage.c:149!
> >> invalid opcode:  [#1] SMP KASAN
> >> Dumping ftrace buffer:
> >>(ftrace buffer empty)
> >> Modules linked in:
> >> CPU: 0 PID: 1806 Comm: syz-executor7 Not tainted 4.10.0-next-20170303+ #6
> >> Hardware name: Google Google Compute Engine/Google Compute Engine,
> >> BIOS Google 01/01/2011
> >> task: 880121c64740 task.stack: 88012c9e8000
> >> RIP: 0010:unix_notinflight+0x417/0x5d0 net/unix/garbage.c:149
> >> RSP: 0018:88012c9ef0f8 EFLAGS: 00010297
> >> RAX: 880121c64740 RBX: 11002593de23 RCX: 8801c490c628
> >> RDX:  RSI: 11002593de27 RDI: 8557e504
> >> RBP: 88012c9ef220 R08: 0001 R09: 
> >> R10: dc00 R11: ed002593de55 R12: 8801c490c0c0
> >> R13: 88012c9ef1f8 R14: 85101620 R15: dc00
> >> FS:  013d3940() GS:8801dbe0() 
> >> knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 01fd8cd8 CR3: 0001cce69000 CR4: 001426f0
> >> Call Trace:
> >>  unix_detach_fds.isra.23+0xfa/0x170 net/unix/af_unix.c:1490
> >>  unix_destruct_scm+0xf4/0x200 net/unix/af_unix.c:1499
> > 
> > The problem here is there is no lock protecting concurrent unix_detach_fds()
> > even though unix_notinflight() is already serialized, if we call
> > unix_notinflight()
> > twice on the same file pointer, we trigger this bug...
> > 
> > I don't know what is the right lock here to serialize it.
> > 
> 
> 
> I reported something similar a while ago
> https://lists.gt.net/linux/kernel/2534612
> 
> And Miklos Szeredi then produced the following patch :
> 
> https://patchwork.kernel.org/patch/9305121/
> 
> However, this was never applied. I wonder if the patch makes sense?

I don't know but there's a hint at the bottom of the thread with
Davem's response to which there was no followup :

  "Why would I apply a patch that's an RFC, doesn't have a proper commit
   message, lacks a proper signoff, and also lacks ACK's and feedback
   from other knowledgable developers?"

So at least this point makes sense, maybe the patch is fine but was
not sufficiently reviewed or acked ? Maybe it was proposed as an RFC
to start a discussion and never went to the final status of a patch
waiting for being applied ?

Willy

Re: [4.9.10] ip_route_me_harder() reading off-slab

2017-02-16 Thread Willy Tarreau

On Fri, Feb 17, 2017 at 01:34:11PM +0800, Daniel J Blueman wrote:
> When booting a VM in libvirt/KVM attached to a local bridge and KASAN
> enabled on 4.9.10, we see a stream of KASAN warnings about off-slab
> access [1].

Did it start to appear with 4.9.10 or is 4.9.10 the first 4.9 kernel
you tried (ie is it a regression between earlier kernels and 4.9 or
a recent faulty stable backport into 4.9) ?

Willy


> Let me know if you'd like more debug.
> 
> Thanks,
>   Daniel
> 
> -- [1]
> 
> [  473.579567] BUG: KASAN: slab-out-of-bounds in
> ip_route_me_harder+0xbd5/0xf20 at addr 8801e1eb28a8
> [  473.579577] Read of size 1 by task vcselab/10339
> [  473.579590] CPU: 1 PID: 10339 Comm: vcselab Tainted: GB  4.9.10-debug+ 
> #2
> [  473.579596] Hardware name: Dell Inc. XPS 13 9360/0T3FTF, BIOS 1.3.2
> 01/18/2017
> [  473.579602]  880236086ed0 8aed83a1 8802324fe6c0
> 8801e1eb26f8
> [  473.579626]  880236086ef8 8a849521 880236086f90
> 8801e1eb26f0
> [  473.579645]  8802324fe6c0 880236086f80 8a8497ba
> 8a848b2d
> [  473.579662] Call Trace:
> [  473.579667]  
> [  473.579685]  [] dump_stack+0x85/0xc4
> [  473.579698]  [] kasan_object_err+0x21/0x70
> [  473.579709]  [] kasan_report_error+0x1fa/0x500
> [  473.579720]  [] ? kasan_kmalloc+0xad/0xe0
> [  473.579737]  [] __asan_report_load1_noabort+0x61/0x70
> [  473.579749]  [] ? ip_route_me_harder+0xbd5/0xf20
> [  473.579759]  [] ip_route_me_harder+0xbd5/0xf20
> [  473.579772]  [] ? nf_ip_saveroute+0x320/0x320
> [  473.579785]  [] ? entry_SYSCALL_64_fastpath+0x23/0xc6
> [  473.579801]  [] iptable_mangle_hook+0x3da/0x5f0
> [iptable_mangle]
> [  473.579814]  [] nf_iterate+0x110/0x2d0
> [  473.579826]  [] nf_hook_slow+0xf6/0x1b0
> [  473.579839]  [] ? nf_iterate+0x2d0/0x2d0
> [  473.579850]  [] ? __ip_local_out+0x28b/0x720
> [  473.579860]  [] __ip_local_out+0x366/0x720
> [  473.579869]  [] ? __ip_local_out+0x28b/0x720
> [  473.579879]  [] ? ip_finish_output+0x9b0/0x9b0
> [  473.579894]  [] ?
> __ip_flush_pending_frames.isra.43+0x2e0/0x2e0
> [  473.579905]  [] ip_local_out+0x1e/0x130
> [  473.579915]  [] ip_build_and_send_pkt+0x54d/0xad0
> [  473.579927]  [] tcp_v4_send_synack+0x184/0x290
> [  473.579937]  [] ? tcp_v4_send_check+0x90/0x90
> [  473.579950]  [] ? inet_ehash_insert+0x407/0x910
> [  473.579965]  [] tcp_conn_request+0x1f5b/0x2a20
> [  473.579977]  [] ? tcp_check_space+0x580/0x580
> [  473.579991]  [] ? default_wake_function+0x35/0x50
> [  473.580007]  [] ? debug_check_no_locks_freed+0x290/0x290
> [  473.580018]  [] ? debug_check_no_locks_freed+0x290/0x290
> [  473.580031]  [] ? ipt_do_table+0xb14/0x1ac0 [ip_tables]
> [  473.580041]  [] ? trace_hardirqs_on+0xd/0x10
> [  473.580055]  [] ? __local_bh_enable_ip+0x70/0xc0
> [  473.580067]  [] tcp_v4_conn_request+0x134/0x1e0
> [  473.580079]  [] tcp_v6_conn_request+0x1b8/0x230
> [  473.580089]  [] tcp_rcv_state_process+0x61d/0x41a0
> [  473.580101]  [] ? sk_filter_trim_cap+0x2a7/0x6a0
> [  473.580114]  [] ? tcp_finish_connect+0x600/0x600
> [  473.580125]  [] ? sk_filter_trim_cap+0x2c6/0x6a0
> [  473.580135]  [] ? sk_filter_trim_cap+0xf8/0x6a0
> [  473.580145]  [] ? tcp_md5_do_lookup+0x4a/0x190
> [  473.580157]  [] ? sk_filter_is_valid_access+0x60/0x60
> [  473.580170]  [] ? tcp_v4_inbound_md5_hash+0x139/0x3bb
> [  473.580180]  [] ? ncsi_start_dev+0x111/0x111
> [  473.580190]  [] tcp_v4_do_rcv+0x2c8/0x8c0
> [  473.580201]  [] tcp_v4_rcv+0x23a8/0x2fc0
> [  473.580214]  [] ? ipv4_confirm+0x117/0x3d0
> [nf_conntrack_ipv4]
> [  473.580228]  [] ip_local_deliver_finish+0x2b9/0x970
> [  473.580241]  [] ? ip_local_deliver_finish+0x12a/0x970
> [  473.580251]  [] ip_local_deliver+0x1b4/0x460
> [  473.580259]  [] ? ip_local_deliver+0x202/0x460
> [  473.580267]  [] ? ip_call_ra_chain+0x510/0x510
> [  473.580280]  [] ? iptable_nat_ipv4_in+0x15/0x20
> [iptable_nat]
> [  473.580290]  [] ? nf_iterate+0x92/0x2d0
> [  473.580302]  [] ? ip_rcv_finish+0x18e0/0x18e0
> [  473.580313]  [] ? nf_hook_slow+0xf6/0x1b0
> [  473.580323]  [] ip_rcv_finish+0x655/0x18e0
> [  473.580331]  [] ? ip_rcv+0x9db/0x1280
> [  473.580341]  [] ip_rcv+0x843/0x1280
> [  473.580352]  [] ? ip_rcv+0x8d3/0x1280
> [  473.580363]  [] ? __enqueue_entity+0x139/0x230
> [  473.580373]  [] ? ip_local_deliver+0x460/0x460
> [  473.580382]  [] ? inet_del_offload+0x40/0x40
> [  473.580393]  [] ? ip_local_deliver+0x460/0x460
> [  473.580407]  [] __netif_receive_skb_core+0x15d9/0x2c90
> [  473.580419]  [] ? enqueue_task_fair+0x261/0x2980
> [  473.580428]  [] ? debug_check_no_locks_freed+0x290/0x290
> [  473.580439]  [] ? netif_wake_subqueue+0x1c0/0x1c0
> [  473.580453]  [] __netif_receive_skb+0x24/0x150
> [  473.580468]  [] process_backlog+0xd7/0x610
> [  473.580477]  [] ? process_backlog+0x204/0x610
> [  473.580487]  [] ? swake_up+0x3b/0x60
> [  473.580498]  [] net_rx_action+0x731/0xe60
> [  473.580510]  [] ? sk_busy_loop+0xae0/0xae0
> [  473.580527]  [] ? clockevents_program_event+0x1cf/0x300
> [

[PATCH 3.10 257/319] net: sctp, forbid negative length

2017-02-05 Thread Willy Tarreau

From: Jiri Slaby <jsl...@suse.cz>

commit a4b8e71b05c27bae6bad3bdecddbc6b68a3ad8cf upstream.

Most of getsockopt handlers in net/sctp/socket.c check len against
sizeof some structure like:
if (len < sizeof(int))
return -EINVAL;

On the first look, the check seems to be correct. But since len is int
and sizeof returns size_t, int gets promoted to unsigned size_t too. So
the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
false.

Fix this in sctp by explicitly checking len < 0 before any getsockopt
handler is called.

Note that sctp_getsockopt_events already handled the negative case.
Since we added the < 0 check elsewhere, this one can be removed.

If not checked, this is the result:
UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
shift exponent 52 is too large for 32-bit type 'int'
CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
  88006d99f2a8 b2f7bdea 41b58ab3
 b4363c14 b2f7bcde 88006d99f2d0 88006d99f270
   0034 b5096422
Call Trace:
 [] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
...
 [] ? kmalloc_order+0x24/0x90
 [] ? kmalloc_order_trace+0x24/0x220
 [] ? __kmalloc+0x330/0x540
 [] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
 [] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
 [] ? sock_common_getsockopt+0xb9/0x150
 [] ? SyS_getsockopt+0x1a5/0x270

Signed-off-by: Jiri Slaby <jsl...@suse.cz>
Cc: Vlad Yasevich <vyasev...@gmail.com>
Cc: Neil Horman <nhor...@tuxdriver.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: linux-s...@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Neil Horman <nhor...@tuxdriver.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/sctp/socket.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index bdc3fb6..86e7352 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4259,7 +4259,7 @@ static int sctp_getsockopt_disable_fragments(struct sock 
*sk, int len,
 static int sctp_getsockopt_events(struct sock *sk, int len, char __user 
*optval,
  int __user *optlen)
 {
-   if (len <= 0)
+   if (len == 0)
return -EINVAL;
if (len > sizeof(struct sctp_event_subscribe))
len = sizeof(struct sctp_event_subscribe);
@@ -5770,6 +5770,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int 
level, int optname,
if (get_user(len, optlen))
return -EFAULT;
 
+   if (len < 0)
+   return -EINVAL;
+
sctp_lock_sock(sk);
 
switch (optname) {
-- 
2.8.0.rc2.1.gbe9624a

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-25 Thread Willy Tarreau

On Wed, Jan 25, 2017 at 12:22:05PM -0500, David Miller wrote:
> From: Wei Wang 
> Date: Wed, 25 Jan 2017 09:15:34 -0800
> 
> > Looks like you sent a separate patch on top of this patch series to
> > address double connect().  Then I think this patch series should be
> > good to go.
> 
> Indeed, Willy please give some kind of ACK.

Yes sorry David, for me it's OK. If Wei runs his whole series of tests
on it again, it should be perfect.

thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-25 Thread Willy Tarreau

Hi Wei,

On Wed, Jan 25, 2017 at 09:15:34AM -0800, Wei Wang wrote:
> Willy,
> 
> Looks like you sent a separate patch on top of this patch series to address
> double connect().

Yes, sorry, I wanted to reply to this thread after the git-send-email
and got caught immediately after :-)

So as suggested by Eric in order to make the review easier, it was done
on top of your series.

> Then I think this patch series should be good to go.
> I will get your patch tested with our TFO test cases.

I think so as well. Thanks for running the tests. On my side I could fix
the haproxy bug which triggered this, and could verify the the whole
series works fine both with and without the haproxy fix. So I think we're
good now.

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-24 Thread Willy Tarreau

On Tue, Jan 24, 2017 at 10:51:25AM -0800, Eric Dumazet wrote:
> We do not return -1 / EINPROGRESS but 0
> 
> Do not call connect() twice, it is clearly not supposed to work.

Yes it is, it normally returns -1 / EISCONN on a regular socket :

  EISCONN
  The socket is already connected.

> Fact that it happened to work is still kept for applications not using
> new features (like TCP_FASTOPEN_CONNECT), we wont break this.

Sure but as we saw, deeply burried silent bugs having no effect in
existing applications can suddenly become problematic once TFO is
enabled, and the semantics difference between the two are minimal
enough to warrant being closed.

> I would prefer you submit _if_ needed a patch on top of Wei patch, which
> was carefully tested with our ~500 packetdrill tests.

I totally understand and rest assured that I have a great respect for
this amount of test, which is also why I find the feature really exciting.
I'll probably propose something involving an extra argument then, this
will be much easier to review in the perspective of the existing tests.

> TCP_FASTOPEN_CONNECT + connect() returning 0 is already a violation of
> past behavior, in the sense that no connection really happened yet.

I agree but semantically it could be considered that it means "connect()
already called successfully, feel free to proceed with send() whenever
you want" and that's why it's appealing ;-)

> An application exploiting this return value and consider the server is
> reachable would be mistaken.

100% agree, I even had a private discussion regarding this, mentionning
that I already added a test in haproxy to only enable it if there are
data scheduled for leaving. In my case it's easy because I already have
the same test to decide whether or not to disable TCP_QUICKACK to save
one packet by sending the payload with the first ACK. So in short it will
be :

 if (data) {
  if (disable_quick_ack)
   setsockopt(fd, SOL_TCP, TCP_QUICKACK, , sizeof());
  if (enable_fastopen)
   setsockopt(fd, SOL_TCP, TCP_FASTOPEN_CONNECT, , 
sizeof());
 }
 connect(fd, ...);

But I certainly understand that in some implementations it's could be
trickier. That just reminds me that I haven't tested it combined with
splicing. I'll have to try this.

> We do not support connect() + TCP_FASTOPEN_CONNECT + read(), because we
> do not want to add yet another conditional test in recvmsg() fast path
> for such feature, while existing sendmsg() can already be used to send a
> SYN with FastOpen option.

Yes, I think the mechanism is complex enough internally not to try to
make it even more complex :-)

> So people are not expected to blindly add TCP_FASTOPEN_CONNECT to every
> TCP socket they allocate/use. This would add too much bloat to the
> kernel.

I really think that the true benefit of TFO is for HTTP and SSL where
the client speaks first and already has something to say when the decision
to connect is made. It should be clear in implementors' minds that it
cannot be a default setting and that it doesn't make sense.

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-24 Thread Willy Tarreau

On Tue, Jan 24, 2017 at 09:42:07AM -0800, Eric Dumazet wrote:
> On Tue, 2017-01-24 at 09:26 -0800, Yuchung Cheng wrote:
> 
> > >
> > > Do you think there's a compelling reason for adding a new option or
> > > are you interested in a small patch to perform the change above ?
> > I like the proposal especially other stack also uses TCP_FASTOPEN
> > https://msdn.microsoft.com/en-us/library/windows/desktop/ms738596(v=vs.85).aspx
> 
> 
> Problem is that might break existing applications that were using
> TCP_FASTOPEN before a connect() (it was a NOP until now)
> 
> I prefer we use a separate new option to be 100% safe, not adding
> regressions.
> 
> Only new applications, tested, will use this new feature at their risk.

That's indeed a good point. I Yuchung's comment above made me wonder
about application's portability but very few OSes will use this and in
the end it might be that portable applications will just add :

   #define TCP_FASTOPEN_CONNECT TCP_FASTOPEN

For other OSes and use TCP_FASTOPEN_CONNECT only for the connect() case.

Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-24 Thread Willy Tarreau

Hi Eric,

On Tue, Jan 24, 2017 at 09:44:49AM -0800, Eric Dumazet wrote:
> I believe there is a bug in this application.
> 
> It does not check connect() return value.

Yes in fact it does but I noticed the same thing, there's something causing
the event not to be registered or something like this.

> When 0 is returned, it makes no sense to wait 200 ms :
> 
> > 06:29:24.048593 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
> > u64=9}}) = 0
> > 06:29:24.048651 epoll_wait(3, [], 200, 0) = 0
> > 06:29:24.048699 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> 
> And it makes no sense to call connect() again :
> 
> > 06:29:24.048751 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
> > sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
> > S (Operation now in progress)

I totally agree.

> man connect
> 
> 
> Generally,  connection-based protocol sockets may successfully connect()
> only once;
> 
> 
> 
> I would prefer we do not add yet another bit in tcp kernel sockets, to
> work around some oddity in your program Willy.

I'm fine with chasing the bug on my side and fixing it, but there's a
semantic trouble anyway with returning -EINPROGRESS :
  - connect() = 0 indicates that the connection is established
  - then a further connect() should return -EISCONN, and does so when
not using TFO

man connect says this regarding EINPROGRESS :

The socket is nonblocking and the connection cannot be completed 
immediately.
It is possible to  select(2)  or  poll(2)  for completion  by  selecting  
the
socket  for  writing.  After  select(2) indicates writability, use 
getsockopt(2)
to read the SO_ERROR option at level SOL_SOCKET to determine whether 
connect()
completed successfully (SO_ERROR is  zero)  or  unsuccess-fully (SO_ERROR is
one of the usual error codes listed here, explaining the reason for the 
failure).

Here we clearly have an incompatibility between this EINPROGRESS saying
that we must poll, and poll returning POLLOUT suggesting that it's now
OK.

I'm totally fine with not using an extra bit in a scarce area, but then
we can either add an extra argument to __inet_stream_connect() to say
"this is sendmsg" or just add an extra flag in the last argument.

But in general I don't feel comfortable with a semantics that doesn't
completely match the current and documented one :-/

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 10:59:22AM -0800, Wei Wang wrote:
> This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
> alternative way to perform Fast Open on the active side (client).

Wei, I think that nothing prevents from reusin the original TCP_FASTOPEN
sockopt instead of adding a new one. The original one does this :

case TCP_FASTOPEN:
if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
TCPF_LISTEN))) {
tcp_fastopen_init_key_once(true);

fastopen_queue_tune(sk, val);
} else {
err = -EINVAL;
}
break;

and your new option does this :

case TCP_FASTOPEN_CONNECT:
if (val > 1 || val < 0) {
err = -EINVAL;
} else if (sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) {
if (sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val;
else
err = -EINVAL;
} else {
err = -EOPNOTSUPP;
}
break;

Now if we compare :
  - the value ranges are the same (0,1)
  - tcp_fastopen_init_key_once() only performs an initialization once
  - fastopen_queue_tune() only sets sk->max_qlen based on the backlog,
this has no effect on an outgoing connection ;
  - tp->fastopen_connect can be applied to a listening socket without
side effect.

Thus I think we can merge them this way :

case TCP_FASTOPEN:
if (val >= 0) {
if ((sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) &&
(sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val;

if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) {
tcp_fastopen_init_key_once(true);
fastopen_queue_tune(sk, val);
}
} else {
err = -EINVAL;
}
break;

And for the userland, the API is even simpler because we can use the
same TCP_FASTOPEN sockopt regardless of the socket direction. Also,
I don't know if TCP_FASTOPEN is supported on simultaneous connect,
but at least if it works it would be easier to understand this way.

Do you think there's a compelling reason for adding a new option or
are you interested in a small patch to perform the change above ?

Regards,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 02:57:31PM -0800, Wei Wang wrote:
> Yes. That seems to be a valid fix to it.
> Let me try it with my existing test cases as well to see if it works for
> all scenarios I have.

Perfect. Note that since the state 2 is transient I initially thought
about abusing the flags passed to __inet_stream_connect() to say "hey
I'm sendmsg() and not connect()" but that would have been a ugly hack
while here we really have the 3 socket states represented eventhough
one changes twice around a call.

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 11:01:21PM +0100, Willy Tarreau wrote:
> On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote:
> > On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> > > Hi Willy,
> > > 
> > > True. If you call connect() multiple times on a socket which already has
> > > cookie without a write(), the second and onward connect() call will return
> > > EINPROGRESS.
> > > It is basically because the following code block in 
> > > __inet_stream_connect()
> > > can't distinguish if it is the first time connect() is called or not:
> > > 
> > > case SS_CONNECTING:
> > > if (inet_sk(sk)->defer_connect)  <- defer_connect will
> > > be 0 only after a write() is called
> > > err = -EINPROGRESS;
> > > else
> > > err = -EALREADY;
> > > /* Fall out of switch with err, set for this state */
> > > break;
> > 
> > Ah OK that totally makes sense, thanks for the explanation!
> > 
> > > I guess we can add some extra logic here to address this issue. So the
> > > second connect() and onwards will return EALREADY.
> 
> Thinking about it a bit more, I really think it would make more
> sense to return -EISCONN here if we want to match the semantics
> of a connect() returning zero on the first call. This way the
> caller knows it can write whenever it wants and can disable
> write polling until needed.
> 
> I'm currently rebuilding a kernel with this change to see if it
> behaves any better :
> 
> -err = -EINPROGRESS;
> +err = -EISCONN;

OK so obviously it didn't work since sendmsg() goes there as well.

But that made me realize that there really are 3 states, not 2 :

  - after connect() and before sendmsg() :
 defer_accept = 1, we want to lie to userland and pretend we're
 connected so that userland can call send(). A connect() must
 return either zero or -EISCONN.

  - during first sendmsg(), still connecting :
 the connection is in progress, EINPROGRESS must be returned to
 the first sendmsg().

  - after the first sendmsg() :
 defer_accept = 0 ; connect() must return -EALREADY. We want to
 return real socket states from now on.

Thus I modified defer_accept to take two bits to represent the extra
state we need to indicate the transition. Now sendmsg() upgrades
defer_accept from 1 to 2 before calling __inet_stream_connect(), which
then knows it must return EINPROGRESS to sendmsg().

This way we correctly cover all these situations. Even if we call
connect() again after the first connect() attempt it still matches
the first result :

accept4(7, {sa_family=AF_INET, sin_port=htons(36860), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK_NONBLOCK) = 1
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
accept4(7, 0x7ffc2282fcb0, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(1, 0x7b53a4, 7006, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily 
unavailable)
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 2
fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
setsockopt(2, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(2, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
connect(2, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_wait(0, [], 200, 0)   = 0
connect(2, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EISCONN (Transport endpoint is 
already connected)
epoll_wait(0, [], 200, 0)   = 0
recvfrom(2, 0x7b53a4, 8030, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily 
unavailable)
epoll_ctl(0, EPOLL_CTL_ADD, 2, {EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}) = 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "GET / HTTP/1.1\r\n", 8030, 0, NULL, NULL) = 16
sendto(2, "GET / HTTP/1.1\r\n", 16, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 16
epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "\r\n", 8030, 0, NULL, NULL) = 2
sendto(2, "\r\n", 2, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 2
epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}], 200, 1000) = 1
recvfrom(2, "HTTP/1.1 302 Found\r\nCache-Contro"..., 8030, 0, NULL, NULL) = 98
sendto(1, "HTTP/1.1 302 Found\r\nCache-Contro"..., 98, 
MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 98
shutdown(1, SHUT_WR)= 0
epoll_ctl(0, EPOLL_CTL_DEL, 2, 0x6ff55c) = 0
epoll_wait(0, [{EPOLLIN|EPOLLHUP|

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote:
> On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> > Hi Willy,
> > 
> > True. If you call connect() multiple times on a socket which already has
> > cookie without a write(), the second and onward connect() call will return
> > EINPROGRESS.
> > It is basically because the following code block in __inet_stream_connect()
> > can't distinguish if it is the first time connect() is called or not:
> > 
> > case SS_CONNECTING:
> > if (inet_sk(sk)->defer_connect)  <- defer_connect will
> > be 0 only after a write() is called
> > err = -EINPROGRESS;
> > else
> > err = -EALREADY;
> > /* Fall out of switch with err, set for this state */
> > break;
> 
> Ah OK that totally makes sense, thanks for the explanation!
> 
> > I guess we can add some extra logic here to address this issue. So the
> > second connect() and onwards will return EALREADY.

Thinking about it a bit more, I really think it would make more
sense to return -EISCONN here if we want to match the semantics
of a connect() returning zero on the first call. This way the
caller knows it can write whenever it wants and can disable
write polling until needed.

I'm currently rebuilding a kernel with this change to see if it
behaves any better :

-err = -EINPROGRESS;
+err = -EISCONN;

I'll keep you updated.

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> Hi Willy,
> 
> True. If you call connect() multiple times on a socket which already has
> cookie without a write(), the second and onward connect() call will return
> EINPROGRESS.
> It is basically because the following code block in __inet_stream_connect()
> can't distinguish if it is the first time connect() is called or not:
> 
> case SS_CONNECTING:
> if (inet_sk(sk)->defer_connect)  <- defer_connect will
> be 0 only after a write() is called
> err = -EINPROGRESS;
> else
> err = -EALREADY;
> /* Fall out of switch with err, set for this state */
> break;

Ah OK that totally makes sense, thanks for the explanation!

> I guess we can add some extra logic here to address this issue. So the
> second connect() and onwards will return EALREADY.

If that's possible at little cost it would be nice, because your patch
makes it so easy to enable TFO on outgoing connections now that I
expect many people will blindly run the setsockopt() before connect().

Do not hesitate to ask me to run some tests. While 4 years ago it was
not easy, here it's very simple for me. By the way I'm seeing an ~10%
performance increase on haproxy by enabling this, it's really cool!

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

Hi Wei,

first, thanks a lot for doing this, it's really awesome!

I'm testing it on 4.9 on haproxy and I met a corner case : when I
perform a connect() to a server and I have nothing to send, upon
POLLOUT notification since I have nothing to send I simply probe the
connection using connect() again to see if it returns EISCONN or
anything else. But here now I'm seeing EINPROGRESS loops.

To illustrate this, here's what I'm doing :

:8000  :8001
  [ client ] ---> [ proxy ] ---> [ server ]

The proxy is configured to enable TFO to the server and the server
supports TFO as well. The proxy and the server are in fact two proxy
instances in haproxy running in the same process for convenience.

When I already have data to send here's what I'm seeing (so it works fine) :

06:29:16.861190 accept4(7, {sa_family=AF_INET, sin_port=htons(33986), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK
_NONBLOCK) = 9
06:29:16.861277 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.861342 accept4(7, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.861417 recvfrom(9, "BLAH\n", 7006, 0, NULL, NULL) = 5
06:29:16.861509 recvfrom(9, 0x2619329, 7001, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.861657 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
06:29:16.861730 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
06:29:16.861779 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.861882 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
06:29:16.861942 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
06:29:16.862015 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
u64=9}}) = 0
06:29:16.862072 epoll_wait(3, [], 200, 0) = 0
06:29:16.862126 sendto(10, "BLAH\n", 5, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 5
06:29:16.862281 epoll_wait(3, [{EPOLLIN, {u32=8, u64=8}}], 200, 0) = 1
06:29:16.862334 recvfrom(10, 0x26173a4, 8030, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.862385 accept4(8, {sa_family=AF_INET, sin_port=htons(46760), 
sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_NON
BLOCK) = 11
06:29:16.862450 setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.862504 accept4(8, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.862564 recvfrom(11, "BLAH\n", 7006, 0, NULL, NULL) = 5


When I don't have data, here's what I'm seeing :

06:29:24.047801 accept4(7, {sa_family=AF_INET, sin_port=htons(33988), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK
_NONBLOCK) = 9
06:29:24.047899 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:24.047966 accept4(7, 0x7ffdedb2c7f0, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:24.048043 recvfrom(9, 0xd31324, 7006, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:24.048281 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
06:29:24.048342 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
06:29:24.048392 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:24.048447 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
06:29:24.048508 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
06:29:24.048593 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
u64=9}}) = 0
06:29:24.048651 epoll_wait(3, [], 200, 0) = 0
06:29:24.048699 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.048751 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)
06:29:24.048808 epoll_ctl(3, EPOLL_CTL_ADD, 10, {EPOLLOUT, {u32=10, u64=10}}) = 0
06:29:24.048860 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1
06:29:24.048912 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.048963 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)
06:29:24.049018 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1
06:29:24.049072 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.049122 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)


I theorically understand why but I think we have something wrong here
and instead we should have -1 EISCONN (to pretend the connection is
established) or return EALREADY (to mention that a previous request was
already made and that we're waiting for the next step).

While I can instrument my connect() *not* to use TFO when connecting
without any pending data, I don't always know this (eg when I use
openssl and cross fingers so that it decides to quickly send something
on the next round).

I think it's easy to fall into this tricky corner case and am wondering
what can be done about it. Does the EINPROGRESS happen only because there
is no cookie yet ? If so,

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-11 Thread Willy Tarreau

On Sun, Dec 11, 2016 at 03:50:31PM +0100, Jason A. Donenfeld wrote:
> 3. Add 3 bytes of padding, set to zero, to the encrypted section just
> before the IP header, marked for future use.
> Pros: satisfies IETF mantras, can use those extra bits in the future
> for interesting protocol extensions for authenticated peers.
> Cons: lowers MTU, marginally more difficult to implement but still
> probably just one or two lines of code.
> 
> Of these, I'm leaning toward (3).

Or 4) add one byte to the cleartext header for future use (mostly flags
maybe) and 2 bytes of padding to the encrypted header. This way you get
the following benefits :
  1) your encrypted text is at least 16-bit aligned, maybe it matters
 in your checksum computations on during decryption
  2) your MTU remains even, this is better for both ends
  3) you're free to add some bits either to the encrypted or the clear
 parts.

Just a suggestion :-)

Willy

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-11 Thread Willy Tarreau

Hi Jason,

On Thu, Dec 08, 2016 at 11:20:04PM +0100, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Thu, Dec 8, 2016 at 1:37 AM, David Miller  wrote:
> > You really have to land the IP header on a proper 4 byte boundary.
> >
> > I would suggest pushing 3 dummy garbage bytes of padding at the front
> > or the end of your header.
> 
> Are you sure 3 bytes to get 4 byte alignment is really the best?

It's always the best. However there's another option which should be
considered : maybe it's difficult but not impossible to move some bits
from the current protocol to remove one byte. That's not always easy,
and sometimes you cannot do it just for one bit. However after you run
through this exercise, if you notice there's really no way to shave
this extra byte, you'll realize there's no room left for future
extensions and you'll more easily accept to add 3 empty bytes for
this, typically protocol version, tags, qos or flagss that you'll be
happy to rely on for future versions of your protocol.

Also while it can feel like you're making your protocol less efficient,
keep in mind that these 3 bytes only matter for small packets, and
Ethernet already pads small frames to 64 bytes, so in practice any
IP packet smaller than 46 bytes will not bring any extra savings.

Best regards,
Willy

Re: [ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Willy Tarreau

Hi Tom,

On Wed, Sep 21, 2016 at 10:16:45AM -0700, Tom Herbert wrote:
> This does seem interesting and indeed the driver datapath looks very
> much like XDP. It would be quite interesting if you could rebase and
> then maybe look at how this can work with XDP that would be helpful.

OK I'll assign some time to rebase it then.

> The return actions are identical,

I'm not surprized that the same needs lead to the same designs when
these designs are constrained by CPU cycle count :-)

> but processing descriptor meta data
> (like checksum, vlan) is not yet implemented in XDP-- maybe this is
> something we can leverage from ndiv?

Yes possibly. It's not a big work but it's absolutely mandatory if you
don't want to waste some smart devices' valuable performance improvements.
We changed our API when porting it to ixgbe to support what this NIC (and
many other ones) supports so that the application code doesn't have to
deal with checksums etc. By the way, VLAN is not yet implemented in the
mvneta driver. But this choice ensures that no application has to deal
nor to create bugs.

Cheers,
Willy

Re: [ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Willy Tarreau

Hi Jesper!

On Wed, Sep 21, 2016 at 06:26:39PM +0200, Jesper Dangaard Brouer wrote:
> I definitely want to study it!

Great, at least I've not put this online for nothing :-)

> You mention XDP.  If you didn't notice, I've created some documentation
> on XDP (it is very "live" documentation at this point and it will
> hopefully "materialized" later in the process).  But it should be a good
> starting point for understanding XDP:
> 
>  https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

Thanks, I'll read it. We'll need to educate ourselves to see how to port
our anti-ddos to XDP in the future I guess, so better ensure the design
is fit from the beginning!

> > I presented it in 2014 at kernel recipes :
> >   
> > http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/
> 
> Cool, and it even have a video!

Yep, with a horrible english accent :-)

> > It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> > very light, and retrieves the packets in the NIC's driver before they
> > are converted to an skb, then submits them to a registered RX handler
> > running in softirq context so we have the best of all worlds by
> > benefitting from CPU scalability, delayed processing, and not paying
> > the cost of switching to userland. Also an rx_done() function allows
> > handlers to batch their processing. 
> 
> Wow - it does sound a lot like XDP!  I would say that is sort of
> validate the current direction of XDP, and that there are real
> use-cases for this stuff.

Absolutely! In fact what drove use to this architecture is that we first
wrote our anti-ddos in userland using netmap. While userland might be OK
for switches and routers, in our case we have haproxy listening on TCP
sockets and waiting for these packets. So the packets were bouncing from
kernel to user, then to kernel again, losing checksums, GRO, GSO, etc...
We modified it to support all of these but the performance was still poor,
capping at about 8 Gbps of forwarded traffic instead of ~40. Thus we thought
that the processing would definitely need to be placed in the kernel to avoid
this bouncing, and to avoid turning rings into newer rings all the time.
That's when I realized that it could possibly also cover my needs for a
sniffer and we redesigned the initial code to support both use cases. Now
we don't even see it in regular traffic, which is pretty nice.

> > The RX handler returns an action
> > among accepting the packet as-is, accepting it modified (eg: vlan or
> > tunnel decapsulation), dropping it, postponing the processing
> > (equivalent to EAGAIN), or building a new packet to send back.
> 
> I'll be very interested in studying in-details how you implemented and
> choose what actions to implement.

OK. The HTTP server is a good use case to study because it lets packets
pass through, being dropped, or being responded to, and the code is very
small, so easy to analyse.

> What was the need for postponing the processing (EAGAIN)?

Our SYN cookie generator. If the NIC's Tx queue is full and we cannot build
a SYN-ACK, we prefer to break out of the Rx loop because there's still room
in the Rx ring (statistically speaking).

> > This last function is the one requiring the most changes in existing
> > drivers, but offers the widest range of possibilities. We use it to
> > send SYN cookies, but I have also implemented a stateless HTTP server
> > supporting keep-alive using it, achieving line-rate traffic processing
> > on a single CPU core when the NIC supports it. It's very convenient to
> > test various stateful TCP components as it's easy to sustain millions
> > of connections per second on it.
> 
> Interesting, and controversial use-case.  One controversial use-case
> for XDP, that I imagine was implementing a DNS accelerator, what
> answers simple and frequent requests.

We thought about such a use case as well, just like of a ping responder
(rate limited to avoid serving as DDoS responders).

> You took it a step further with a HTTP server!

It's a fake HTTP server. You ask it to return 1kB of data and it sends you
1kB. It can even do multiple segments but then you're facing the risk of
losses that you'd preferably avoid. But in our case it's very useful to
test various things including netfilter, LVS and haproxy because it
consumes so little power to reach performance levels that they cannot even
reach that you can set it up on a small machine (eg: a cheap USB-powered
ARM board saturates the GigE link with 340 kcps, 663 krps). However I found
that it *could* be fun to improve it to deliver favicons or small error
pages.

> > It does not support forwarding between NICs. It was my first goal
> > because I wanted to implement a TAP with it, bridging the traffic
> > between two ports, but figured it was adding some complexity to the
> > system back then. 
> 
> With all the XDP features at the moment, we have avoided going through
> the page allocator, by relying on different page

[ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Willy Tarreau

Hi,

Over the last 3 years I've been working a bit on high traffic processing
for various reasons. It started with the wish to capture line-rate GigE
traffic on very small fanless ARM machines and the framework has evolved
to be used at my company as a basis for our anti-DDoS engine capable of
dealing with multiple 10G links saturated with floods.

I know it comes a bit late now that there is XDP, but it's my first
vacation since then and I needed to have a bit of calm time to collect
the patches from the various branches and put them together. Anyway I'm
sending this here in case it can be of interest to anyone, for use or
just to study it.

I presented it in 2014 at kernel recipes :
  
http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/

It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
very light, and retrieves the packets in the NIC's driver before they
are converted to an skb, then submits them to a registered RX handler
running in softirq context so we have the best of all worlds by
benefitting from CPU scalability, delayed processing, and not paying
the cost of switching to userland. Also an rx_done() function allows
handlers to batch their processing. The RX handler returns an action
among accepting the packet as-is, accepting it modified (eg: vlan or
tunnel decapsulation), dropping it, postponing the processing
(equivalent to EAGAIN), or building a new packet to send back.

This last function is the one requiring the most changes in existing
drivers, but offers the widest range of possibilities. We use it to
send SYN cookies, but I have also implemented a stateless HTTP server
supporting keep-alive using it, achieving line-rate traffic processing
on a single CPU core when the NIC supports it. It's very convenient to
test various stateful TCP components as it's easy to sustain millions
of connections per second on it.

It does not support forwarding between NICs. It was my first goal
because I wanted to implement a TAP with it, bridging the traffic
between two ports, but figured it was adding some complexity to the
system back then. However since then we've implemented traffic
capture in our product, exploiting this framework to capture without
losses at 14 Mpps. I may find some time to try to extract it later.
It uses the /sys API so that you can simply plug tcpdump -r on a
file there, though there's also an mmap version which uses less CPU
(that's important at 10G).

In its current form since the initial code's intent was to limit
core changes, it happens not to modify anything in the kernel by
default and to reuse the net_device's ax25_ptr to attach devices
(idea borrowed from netmap), so it can be used on an existing
kernel just by loading the patched network drivers (yes, I know
it's not a valid solution for the long term).

The current code is available here :

  http://git.kernel.org/cgit/linux/kernel/git/wtarreau/ndiv.git/

Please let me know if there could be some interest in rebasing it
on more recent versions (currently 3.10, 3.14 and 4.4 are supported).
I don't have much time to assign to it since it works fine as-is,
but will be glad to do so if that can be useful.

Also the stateless HTTP server provided in it definitely is a nice
use case for testing such a framework.

Regards,
Willy

Re: [PATCH net] net: mvneta: set real interrupt per packet for tx_done

2016-07-05 Thread Willy Tarreau

Hi,

On Wed, Jul 06, 2016 at 04:18:58AM +0200, Marcin Wojtas wrote:
> From: Dmitri Epshtein <d...@marvell.com>
> 
> Commit aebea2ba0f74 ("net: mvneta: fix Tx interrupt delay") intended to
> set coalescing threshold to a value guaranteeing interrupt generation
> per each sent packet, so that buffers can be released with no delay.
> 
> In fact setting threshold to '1' was wrong, because it causes interrupt
> every two packets. According to the documentation a reason behind it is
> following - interrupt occurs once sent buffers counter reaches a value,
> which is higher than one specified in MVNETA_TXQ_SIZE_REG(q). This
> behavior was confirmed during tests. Also when testing the SoC working
> as a NAS device, better performance was observed with int-per-packet,
> as it strongly depends on the fact that all transmitted packets are
> released immediately.
> 
> This commit enables NETA controller work in interrupt per sent packet mode
> by setting coalescing threshold to 0.

We had a discussion about this in January 2015 and I thought I sent a patch
to change it but I can't find it, so I think it was only proposed to some
users for testing. I also remember that on more recent kernels by then
(>=3.13) we observed a slightly better performance with this value set to
zero.

Acked-by: Willy Tarreau <w...@1wt.eu>

Willy

[PATCH 3.10 128/143] VSOCK: do not disconnect socket when peer has shutdown SEND only

2016-06-05 Thread Willy Tarreau

From: Ian Campbell <ian.campb...@docker.com>

commit dedc58e067d8c379a15a8a183c5db318201295bb upstream.

The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.

Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.

I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.

Signed-off-by: Ian Campbell <ian.campb...@docker.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Stefan Hajnoczi <stefa...@redhat.com>
Cc: Claudio Imbrenda <imbre...@linux.vnet.ibm.com>
Cc: Andy King <ack...@vmware.com>
Cc: Dmitry Torokhov <d...@vmware.com>
Cc: Jorgen Hansen <jhan...@vmware.com>
Cc: Adit Ranadive <ad...@vmware.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/vmw_vsock/af_vsock.c | 21 +
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9b88693..66a9bf5 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1804,27 +1804,8 @@ vsock_stream_recvmsg(struct kiocb *kiocb,
else if (sk->sk_shutdown & RCV_SHUTDOWN)
err = 0;
 
-   if (copied > 0) {
-   /* We only do these additional bookkeeping/notification steps
-* if we actually copied something out of the queue pair
-* instead of just peeking ahead.
-*/
-
-   if (!(flags & MSG_PEEK)) {
-   /* If the other side has shutdown for sending and there
-* is nothing more to read, then modify the socket
-* state.
-*/
-   if (vsk->peer_shutdown & SEND_SHUTDOWN) {
-   if (vsock_stream_has_data(vsk) <= 0) {
-   sk->sk_state = SS_UNCONNECTED;
-   sock_set_flag(sk, SOCK_DONE);
-   sk->sk_state_change(sk);
-   }
-   }
-   }
+   if (copied > 0)
err = copied;
-   }
 
 out_wait:
finish_wait(sk_sleep(sk), );
-- 
2.8.0.rc2.1.gbe9624a

Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name

2016-05-14 Thread Willy Tarreau

On Sat, May 14, 2016 at 03:21:31PM -0700, Linus Torvalds wrote:
> On Sat, May 14, 2016 at 2:33 PM, Willy Tarreau <w...@1wt.eu> wrote:
> >
> > Why simply not cast the atomic to (unsigned long long) instead of (u64)
> > so that %llu always matches ?
> 
> Yes, that fixes the problem. It's just more typing, and annoying. The
> fact that MS got it right while posix and gcc screwed it up is a bit
> embarrassing..

Well on the other hand, because of this MS still has problems porting
code from 32 to 64 bit. The real problem is that on both sides they
imagined that you needed only one way to specify your types. In practice
users generally want either the most optimal types for the architecture
because they don't care about the size (char, int, size_t, void *...)
or a specific size. This last one is annoying to use with printf format.

> If we ever start using __uint128_t, we'll have even more problems in
> this area. Oh well.

Definitely.

Willy

Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name

2016-05-14 Thread Willy Tarreau

On Sat, May 14, 2016 at 02:31:04PM -0700, Linus Torvalds wrote:
> On Sat, May 14, 2016 at 11:24 AM, Linus Torvalds
>  wrote:
> >
> >
> > -   net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%p", net);
> > +   net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%llu",
> > +   (u64)atomic64_inc_return(_id));
> 
> Oh well. I suspect this is going to cause a new warning on alpha and
> ia64 and possibly others.
> 
> "u64" is indeed "unsigned long long" on x86 and many other
> architectures, but on alpga and ia64 it's just "unsigned long".
> 
> So that case should have been to "long long". I detest how there isn't
> a "64-bit size" printf specifier.

Why simply not cast the atomic to (unsigned long long) instead of (u64)
so that %llu always matches ?

Willy

Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Willy Tarreau

On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote:
> On 04/30/2016 03:01 PM, Vijay Pandurangan wrote:
> > Consider:
> > 
> > - App A  sends out corrupt packets 50% of the time and discards inbound 
> > data.
(...)
> How can you make a generic app C know how to do this?  The path could be,
> for instance:
> 
> eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> 
> vethC <-> vethD <-> appC
> 
> There are no sockets on vethB, but it does need to have special behaviour to 
> elide
> csums.  Even if appC is hacked to know how to twiddle some thing on it's veth 
> port,
> mucking with vethD will have no effect on vethB.
> 
> With regard to your example above, why would A corrupt packets?  My guess:
> 
> 1)  It has bugs (so, fix the bugs, it could equally create incorrect data 
> with proper checksums,
> so just enabling checksumming adds no useful protection.)

I agree with Ben here, what he needs is the ability for userspace to be
trusted when *forwarding* a packet. Ideally you'd only want to receive
the csum status per packet on the packet socket and pass the same value
on the vethA interface, with this status being kept when the packet
reaches vethB.

If A purposely corrupts packet, it's A's problem. It's similar to designing
a NIC which intentionally corrupts packets and reports "checksum good".

The real issue is that in order to do things right, the userspace bridge
(here, "A") would really need to pass this status. In Ben's case as he
says, bad checksum packets are dropped before reaching A, so that
simplifies the process quite a bit and that might be what causes some
confusion, but ideally we'd rather have recvmsg() and sendmsg() with
these flags.

I faced the exact same issue 3 years ago when playing with netmap, it was
slow as hell because it would lose all checksum information when packets
were passing through userland, resulting in GRO/GSO etc being disabled,
and had to modify it to let userland preserve it. That's especially
important when you have to deal with possibly corrupted packets not yet
detected in the chain because the NIC did not validate their checksums.

Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Willy Tarreau

Hi Eric,

On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote:
> Everything is possible, but do not complain because BPF went in the
> kernel before your changes.

Don't get me wrong, I'm not complaining, I'm more asking for help to
try to elaborate the alternate solution. I understood well what my
proposal did because it was pretty straightforward, and the new one
I'll have to do is of an immense complexity for me by now, since it
will require learning a new language and finding doc on how all this
works together. But as I said I'm totally sold to the benefits it can
provide for large scale deployments and I'm well aware of the ugly
socket scan there was in the previous one.

> Just rework your patch.
> 
> Supporting multiple SO_REUSEPORT groups on the same port should not be
> too hard really. Making sure BPF still work for them is feasible.
> 
> But the semantic of the socket option would be really different.

I don't care much about the socket options themselves as long as I can
continue to support seamless reloads. I could even get rid of SO_REUSEPORT
on Linux to use something else instead if I have a reliable way to detect
that the alternative will work.

> You need to not control an individual listener, but a group of listener.
> 
> Your dying haproxy would issue a single system call to tell the kernel :
> My SO_REUSEPORT group should not accept new SYN packets, so that the new
> haproxy can setup a working new SO_REUSEPORT group.

Normally it's the other way around :-) The new process first grabs the
socket, there's a tiny window where both are attached, and only then the
old process leaves. That's the only way to ensure there's no loss nor
added latency in the processing.

If you could share a pointer to some good starter documentation for this,
I would appreciate it. I really have no idea where to start from and the
only elements I found on the net didn't give a single hint regarding all
this :-/

Thanks,
Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

On Thu, Mar 24, 2016 at 04:54:03PM -0700, Tom Herbert wrote:
> On Thu, Mar 24, 2016 at 4:40 PM, Yann Ylavic  wrote:
> > I'll learn how to do this to get the best performances from the
> > server, but having to do so to work around what looks like a defect
> > (for simple/default SMP configurations at least, no NUMA or clever
> > CPU-affinity or queuing policy involved) seems odd in the first place.
> >
> I disagree with your assessment that there is a defect. SO_REUSEPORT
> is designed to spread packets amongst _equivalent_ connections. In the
> server draining case sockets are no longer equivalent, but that is a
> special case.

I partially disagree with you here Tom. Initially SO_REUSEPORT was not
used to spread packets but to allow soft restart in some applications.
I've been using it since 2001 in haproxy on *BSD and linux 2.2. It was
removed during 2.3 and I used to keep a patch to reimplement it in 2.4
(basically 2 or 3 lines, the infrastructure was still present), but the
patch was not accepted. The same patch worked for 2.6 and 3.x, allowing
me to continue to perform soft-restarts on Linux just like I used to do
on *BSD. When SO_REUSEPORT was reimplemented in 3.9 with load balancing,
I was happy because it at last allowed me to drop my patch and I got
the extra benefit of better load balancing of incoming connections.

But the main use we have for it (at least historically) is for soft
restarts, where one process replaces another one. Very few people use
more than one process in our case.

However given the benefits of the load spreading for extreme loads,
I'm willing to find how to achieve the same with BPF, but it's pretty
clear that at this point I have no idea where to start from and that
for a single process replacing a single one, it looks quite complicated.

For me quite frankly the special case is the load balancing which is
a side benefit (and a nice one, don't get me wrong).

That's why I would have found it nice to "fix" the process replacement
to avoid dropping incoming connections, though I don't want it to become
a problem for future improvements on BPF. I don't think the two lines I
proposed could become an issue but I'll live without them (or continue
to apply this patch).

BTW, I have no problem with having to write a little bit of assembly for
fast interfaces if it remains untouched for years, we do already have a
bit in haproxy. It's just a longterm investment.

Best regards,
Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

On Thu, Mar 24, 2016 at 11:20:49AM -0700, Tolga Ceylan wrote:
> I would appreciate a conceptual description on how this would work
> especially for a common scenario
> as described by Willy. My initial impression was that a coordinator
> (master) process takes this
> responsibility to adjust BPF filters as children come and go.

Indeed that would help, I don't know where to start from for now.

> Two popular software has similar use cases: nginx and haproxy. Another
> concern is with the
> introduction of BPF itself, should we expect a performance drop in
> these applications?

Knowing how picky Eric is about performance in such areas, I'm not
worried a single second about adopting something he recommends :-)
I just need to ensure it covers our users' needs. And maybe the
solution I mentionned in the other e-mail could work.

Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

On Thu, Mar 24, 2016 at 07:00:11PM +0100, Willy Tarreau wrote:
> Since it's not about
> load distribution and that processes are totally independant, I don't see
> well how to (ab)use BPF to achieve this.
> 
> The pattern is :
> 
>   t0 : unprivileged processes 1 and 2 are listening to the same port
>(sock1@pid1) (sock2@pid2)
><-- listening -->
> 
>   t1 : new processes are started to replace the old ones
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><-- listening --> <-- listening -->
> 
>   t2 : new processes signal the old ones they must stop
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><--- draining --> <-- listening -->
> 
>   t3 : pids 1 and 2 have finished, they go away
>  (sock3@pid3) (sock4@pid4)
> <-- gone ->  <-- listening -->
> 

Thinking a bit more about it, would it make sense to consider that in order
to address such a scenario, the only the new (still privileged) process
reconfigures the BPF to deliver traffic only to its own sockets and that
by doing so it will result in the old one not to receive any of it anymore ?
If so that could possibly be reasonably doable then. Ie: the old processes
don't have to do anything to stop receiving traffic.

Thanks,
Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

On Thu, Mar 24, 2016 at 10:01:37AM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 17:50 +0100, Willy Tarreau wrote:
> > On Thu, Mar 24, 2016 at 09:33:11AM -0700, Eric Dumazet wrote:
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, 
> > > > struct net *net,
> > > > return -1;
> > > > score += 4;
> > > > }
> > > > +   if (sk->sk_reuseport)
> > > > +   score++;
> > > 
> > > This wont work with BPF
> > > 
> > > > if (sk->sk_incoming_cpu == raw_smp_processor_id())
> > > > score++;
> > > 
> > > This one does not work either with BPF
> > 
> > But this *is* in 4.5. Does this mean that this part doesn't work anymore or
> > just that it's not usable in conjunction with BPF ? In this case I'm less
> > worried, because it would mean that we have a solution for non-BPF aware
> > applications and that BPF-aware applications can simply use BPF.
> > 
> 
> BPF can implement the CPU choice/pref itself. It has everything needed.

Well I don't need the CPU choice, it was already there, it's not my code,
I only need the ability for an independant process to stop receiving new
connections without altering the other processes nor dropping some of these
connections.

In fact initially I didn't even need anything related to incoming connection
load-balancing, just the ability to start a new process without stopping the
old one, as it used to work in 2.2 and for which I used to keep a patch in
2.4 and 2.6. When SO_REUSEPORT was reintroduced in 3.9, that solved the issue
and some users started to complain that between the old and the new processes,
some connections were lost. Hence the proposal above. Since it's not about
load distribution and that processes are totally independant, I don't see
well how to (ab)use BPF to achieve this.

The pattern is :

  t0 : unprivileged processes 1 and 2 are listening to the same port
   (sock1@pid1) (sock2@pid2)
   <-- listening -->

  t1 : new processes are started to replace the old ones
   (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
   <-- listening --> <-- listening -->

  t2 : new processes signal the old ones they must stop
   (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
   <--- draining --> <-- listening -->

  t3 : pids 1 and 2 have finished, they go away
 (sock3@pid3) (sock4@pid4)
<-- gone ->  <-- listening -->

> >   - it seems to me that for BPF to be usable on process shutting down, we'd
> > need to have some form of central knowledge if the goal is to redefine
> > how to distribute the load. In my case there are multiple independant
> > processes forked on startup, so it's unclear to me how each of them 
> > could
> > reconfigure BPF when shutting down without risking to break the other 
> > ones.
> >   - the doc makes me believe that BPF would require privileges to be unset, 
> > so
> > that would not be compatible with a process shutting down which has 
> > already
> > dropped its privileges after startup, but I could be wrong.
> > 
> > Thanks for your help on this,
> > Willy
> > 
> 
> The point is : BPF is the way to go, because it is expandable.

OK so this means we have to find a way to expand it to allow an individual
non-privileged process to change the distribution algorithm without impacting
other processes.

I need to discover it better to find what can be done, but unfortunately at
this point the sole principle makes me think of a level of complexity that
doesn't seem obvious to solve at all :-/

Regards,
Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

On Thu, Mar 24, 2016 at 09:33:11AM -0700, Eric Dumazet wrote:
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct 
> > net *net,
> > return -1;
> > score += 4;
> > }
> > +   if (sk->sk_reuseport)
> > +   score++;
> 
> This wont work with BPF
> 
> > if (sk->sk_incoming_cpu == raw_smp_processor_id())
> > score++;
> 
> This one does not work either with BPF

But this *is* in 4.5. Does this mean that this part doesn't work anymore or
just that it's not usable in conjunction with BPF ? In this case I'm less
worried, because it would mean that we have a solution for non-BPF aware
applications and that BPF-aware applications can simply use BPF.

> Whole point of BPF was to avoid iterate through all sockets [1],
> and let user space use whatever selection logic it needs.
> 
> [1] This was okay with up to 16 sockets. But with 128 it does not scale.

Indeed.

> If you really look at how BPF works, implementing another 'per listener' flag
> would break the BPF selection.

OK.

> You can certainly implement the SO_REUSEPORT_LISTEN_OFF by loading an
> updated BPF, why should we add another way in the kernel to do the same,
> in a way that would not work in some cases ?

I don't try to reimplement something already available, but I'm confused
by a few points :
  - the code above already exists and you mention it cannot be used with BPF
  - for the vast majority of applications not using BPF, would the above *still*
work (it worked in 4.4-rc at least)
  - it seems to me that for BPF to be usable on process shutting down, we'd
need to have some form of central knowledge if the goal is to redefine
how to distribute the load. In my case there are multiple independant
processes forked on startup, so it's unclear to me how each of them could
reconfigure BPF when shutting down without risking to break the other ones.
  - the doc makes me believe that BPF would require privileges to be unset, so
that would not be compatible with a process shutting down which has already
dropped its privileges after startup, but I could be wrong.

Thanks for your help on this,
Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

Hi Eric,

(just lost my e-mail, trying not to forget some points)

On Thu, Mar 24, 2016 at 07:45:44AM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 15:22 +0100, Willy Tarreau wrote:
> > Hi Eric,
> 
> > But that means that any software making use of SO_REUSEPORT needs to
> > also implement BPF on Linux to achieve the same as what it does on
> > other OSes ? Also I found a case where a dying process would still
> > cause trouble in the accept queue, maybe it's not redistributed, I
> > don't remember, all I remember is that my traffic stopped after a
> > segfault of only one of them :-/ I'll have to dig a bit regarding
> > this.
> 
> Hi Willy
> 
> Problem is : If we add a SO_REUSEPORT_LISTEN_OFF, this wont work with
> BPF. 

I wasn't for adding SO_REUSEPORT_LISTEN_OFF either. Instead the idea was
just to modify the score in compute_score() so that a socket which disables
SO_REUSEPORT scores less than one which still has it. The application
wishing to terminate just has to clear the SO_REUSEPORT flag and wait for
accept() reporting EAGAIN. The patch simply looked like this (copy-pasted
hence space-mangled) :

--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
+   if (sk->sk_reuseport)
+   score++;
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}

> BPF makes a decision without knowing individual listeners states.

But is the decision taken without considering compute_score() ? The point
really was to be the least possibly intrusive and quite logical for the
application : "disable SO_REUSEPORT when you don't want to participate to
incoming load balancing anymore".

> Or we would need to extend BPF to access these kind of states.
> Doable certainly, but we need to be convinced it is necessary.

You know that I don't like complex designs to address simple issues if
possible :-)

> And yes, if a listener is closed while children are still in accept
> queue, we drop all the children, we do not care of redistributing them
> to another listeners. Really too complex to be worth it.

Forget this, I mixed two issues here. Yes I know that redistributing is
almost impossible, I've read that code a year ago or so and realized how
complex this would be, without providing even 100% success rate. I wasn't
speaking about redistribution of incoming queue but about an issue I've
met where when I have 4 processes bound to the same port, if one dies,
its share of incoming traffic is definitely lost. The only fix was to
restart the processes to create new listeners. But I don't remember the
conditions where I met this case, I don't even remember if it was on an
-rc kernel or a final one, so I'd prefer to discuss this only once I have
more elements.

Cheers,
Willy
>From c060a5db92274402a0178d7c777a1e37c15eadb9 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 15 Dec 2015 16:40:00 +0100
Subject: net: make lingering sockets score less in compute_score()

When multiple processes use SO_REUSEPORT for a seamless restart
operation, there's a tiny window during which both the old and the new
process are bound to the same port, and there's no way for the old
process to gracefully stop receiving connections without dropping
the few that are left in the queue between the last poll() and the
shutdown() or close() operation.

Incoming connections are distributed between multiple listening sockets
in inet_lookup_listener() according to multiple criteria. The first
criterion is a score based on a number of attributes for each socket,
then a hash computation ensures that the connections are evenly
distributed between sockets of equal weight.

This patch provides a simple approach by which the old process can
simply decrease its score by disabling SO_REUSEPORT on its listening
sockets. Thus, the sockets from the new process always score higher
and are always preferred.

The old process can then safely drain incoming connections and stop
after meeting the -1 EAGAIN condition, as shown in the example below :

 process A (old one)  |  process B (new one)
  |
  setsockopt(SO_REUSEPORT, 1) |
  listen() >= 0   |
  ... |
  accept()|
  ... |  setsockopt(SO_REUSEPORT, 1)
  ... |  listen()

   From now on, both processes receive incoming connections

  ... |  kill(process A, go_away)
  setsockopt(SO_REUSEPORT, 0) |  accept() >= 0

   Here process A stops receiving new connections

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

Hi Eric,

On Thu, Mar 24, 2016 at 07:13:33AM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 07:12 +0100, Willy Tarreau wrote:
> > Hi,
> > 
> > On Wed, Mar 23, 2016 at 10:10:06PM -0700, Tolga Ceylan wrote:
> > > I apologize for not properly following up on this. I had the
> > > impression that we did not want to merge my original patch and then I
> > > also noticed that it fails to keep the hash consistent. Recently, I
> > > read the follow ups on it as well as Willy's patch/proposals.
> > > 
> > > Is there any update on Willy's SO_REUSEPORT patch? IMHO, it solves the
> > > problem and it is simpler than adding new sock option.
> > 
> > no, Craig's changes were merged, and I haven't checked yet if my patch
> > needs to be rebased or still applies. Feel free to check it and resubmit
> > if you have time.
> 
> No need for a patch AFAIK.
> 
> BPF solution is generic enough.
> 
> All user space needs to do is to update the BPF filter so that the
> listener needing to be dismantled does not receive any new packet.

But that means that any software making use of SO_REUSEPORT needs to
also implement BPF on Linux to achieve the same as what it does on
other OSes ? Also I found a case where a dying process would still
cause trouble in the accept queue, maybe it's not redistributed, I
don't remember, all I remember is that my traffic stopped after a
segfault of only one of them :-/ I'll have to dig a bit regarding
this.

Thanks,
Willy

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau

Hi,

On Wed, Mar 23, 2016 at 10:10:06PM -0700, Tolga Ceylan wrote:
> I apologize for not properly following up on this. I had the
> impression that we did not want to merge my original patch and then I
> also noticed that it fails to keep the hash consistent. Recently, I
> read the follow ups on it as well as Willy's patch/proposals.
> 
> Is there any update on Willy's SO_REUSEPORT patch? IMHO, it solves the
> problem and it is simpler than adding new sock option.

no, Craig's changes were merged, and I haven't checked yet if my patch
needs to be rebased or still applies. Feel free to check it and resubmit
if you have time.

Best regards,
Willy

Re: [PATCH v2 net-next 0/8] API set for HW Buffer management

2016-02-17 Thread Willy Tarreau

Hi Gregory,

On Tue, Feb 16, 2016 at 04:33:35PM +0100, Gregory CLEMENT wrote:
> Hello,
> 
> A few weeks ago I sent a proposal for a API set for HW Buffer
> management, to have a better view of the motivation for this API see
> the cover letter of this proposal:
> http://thread.gmane.org/gmane.linux.kernel/2125152
> 
> Since this version I took into account the review from Florian:
> - The hardware buffer management helpers are no more built by default
>   and now depend on a hidden config symbol which has to be selected
>   by the driver if needed
> - The hwbm_pool_refill() and hwbm_pool_add() now receive a gfp_t as
>   argument allowing the caller to specify the flag it needs.
> - buf_num is now tested to ensure there is no wrapping
> - A spinlock has been added to protect the hwbm_pool_add() function in
>   SMP or irq context.
> 
> I also used pr_warn instead of pr_debug in case of errors.
> 
> I fixed the mvneta implementation by returning the buffer to the pool
> at various place instead of ignoring it.
> 
> About the series itself I tried to make this series easier to merge:
> - Squashed "bus: mvenus-mbus: Fix size test for
>mvebu_mbus_get_dram_win_info" into bus: mvebu-mbus: provide api for
>obtaining IO and DRAM window information.
> - Added my signed-otf-by on all the patches as submitter of the series.
> - Renamed the dts patches with the pattern "ARM: dts: platform:"
> - Removed the patch "ARM: mvebu: enable SRAM support in
>   mvebu_v7_defconfig" of this series and already applied it
> - Rodified the order of the patches.
> 
> In order to ease the test the branch mvneta-BM-framework-v2 is
> available at g...@github.com:MISL-EBU-System-SW/mainline-public.git.

Well, I tested this patch series on top of latest master (from today)
on my fresh new clearfog board. I compared carefully with and without
the patchset. My workload was haproxy receiving connections and forwarding
them to my PC via the same port. I tested both with short connections
(HTTP GET of an empty file) and long ones (1 MB or more). No trouble
was detected at all, which is pretty good. I noticed a very tiny
performance drop which is more noticeable on short connections (high
packet rates), my forwarded connection rate went down from 17500/s to
17300/s. But I have not checked yet what can be tuned when using the
BM, nor did I compare CPU usage. I remember having run some tests in
the past, I guess it was on the XP-GP board, and noticed that the BM
could save a significant amount of CPU and improve cache efficiency,
so if this is the case here, we don't really care about a possible 1%
performance drop.

I'll try to provide more results as time permits.

In the mean time if you want (or plan to submit a next batch), feel
free to add a Tested-by: Willy Tarreau <w...@1wt.eu>.

cheers,
Willy

Re: [PATCH v2] unix: properly account for FDs passed over unix sockets

2016-02-02 Thread Willy Tarreau

On Tue, Feb 02, 2016 at 09:32:56PM +0100, Hannes Frederic Sowa wrote:
> But "struct pid *" in unix_skb_parms should be enough to get us to 
> corresponding "struct cred *" so we can decrement the correct counter 
> during skb destruction.
> 
> So:
> 
> We increment current task's unix_inflight and also check the current 
> task's limit during attaching fds to skbs and decrement the inflight 
> counter via "struct pid *". This looks like it should work.

I like it as well, the principle sounds sane.

> >That way it's always the person who actually does the send (rather
> >than the opener of the socket _or_ the opener of the file that gets
> >passed around) that gets credited, and thanks to the cred pointer we
> >can then de-credit them properly.
> 
> Exactly, I try to implement that. Thanks a lot!

Thanks to you Hannes, I appreciate that you work on it, it would take
much more time to me to dig into this.

Willy

Re: [PATCH v2] unix: properly account for FDs passed over unix sockets

2016-02-02 Thread Willy Tarreau

On Tue, Feb 02, 2016 at 12:53:20PM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 12:49 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Tue, Feb 02, 2016 at 12:44:54PM -0800, Linus Torvalds wrote:
> >>
> >> Umm. I think the "struct cred" may change in between, can't it?
> >
> > You mean for example in case of setuid() or something like this ?
> 
> Yeah. I'd be worried about looking up the creds or user structure
> later, and possibly getting a different one.
> 
> I'd much rather look it up at attach time, and just carry an extra
> pointer around. That seems to be an inherently safer model where
> there's no worry about "what happens if the user does X in the
> meantime".

Normally we can only change from root to non-root, and we don't apply
the limits to root, so if we have the ability to only store one bit
indicating "not tracked" or to simply nullify one pointer to avoid
counting in flight FDs for root, we don't take the risk to recredit
them to the target user after a change.

I just don't know if we can do that though :-/

Willy

Re: [PATCH v2] unix: properly account for FDs passed over unix sockets

2016-02-02 Thread Willy Tarreau

On Tue, Feb 02, 2016 at 12:44:54PM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 12:32 PM, Hannes Frederic Sowa
>  wrote:
> > But "struct pid *" in unix_skb_parms should be enough to get us to
> > corresponding "struct cred *" so we can decrement the correct counter during
> > skb destruction.
> 
> Umm. I think the "struct cred" may change in between, can't it?

You mean for example in case of setuid() or something like this ?

willy

Re: Mis-backport in af_unix patch for Linux 3.10.95

2016-01-24 Thread Willy Tarreau

Hello,

On Sun, Jan 24, 2016 at 12:10:35AM -0500, Sultan Qasim wrote:
> Hello all,
> 
> I'm an outsider to the Linux kernel community, so I apologize if this
> is not the right channel to mention this.

The simple fact that you participate, inspect the code and report bugs
makes you part of this community :-)  It's indeed the right place.
Usually when reporting an issue with a commit, we also CC the whole
signed-off-by / CC chain of that commit (which I'm doing now). For
bugs related to networking, we usually CC the netdev list as well.

> I noticed that the
> backported version of the patch "af_unix: Revert 'lock_interruptible'
> in stream receive code" in Linux 3.10.95 seems to have removed the
> mutex_lock_interruptible from the wrong function.
>
> Here is the backported patch:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3a57e783016bf43ab9326172217f564941b85b17
> 
> Here is the original:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/net/unix/af_unix.c?id=3822b5c2fc62e3de8a0f33806ff279fb7df92432
> 
> Was it not meant to be removed from unix_stream_recvmsg instead of
> unix_dgram_recvmsg?

You're absolutely right, good catch! Similar controls were added to
both functions resulting in the same code appearing there, which
confused the patch process, causing the change to be applied to the
wrong location. This happens from time to time in such circumstances
when backporting to older kernels.

> Also, the variable called "noblock" needs to be
> removed from the function being changed to prevent unused variable
> warnings.

If you mean this variable in function unix_dgram_recvmsg(), it would
indeed report a warning but only due to the patch being mis-applied.
In unix_stream_recvmsg(), it's still used as well.

Does the attached patch seem better to you (not compile-tested) ?

Greg/Ben, both 3.2.76 and 3.14.59 are OK regarding this, it seems
like only 3.10.95 was affected.

Thanks,
Willy

>From 77f6e82adf349cbccf7e2ec7601b25c994e0b483 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Sun, 24 Jan 2016 09:19:57 +0100
Subject: af_unix: fix incorrect revert of 'lock_interruptible' in stream
 receive code

As reported by Sultan Qasim, commit 3822b5c ("af_unix: Revert
'lock_interruptible' in stream receive code") was accidently applied
at the wrong place in the backport that appeared in 3.10.95, it
affected unix_dgram_recvmsg() instead of unix_stream_recvmsg() due
to now similar code sections there. The dgram part needs to remain
but the stream part needs to be removed.

Reported-By: Sultan Qasim <sultanqa...@gmail.com>
Fixes: 3a57e78 (3.10.95)
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/unix/af_unix.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f934e7b..0061d00 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1934,7 +1934,14 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct 
socket *sock,
if (flags_OOB)
goto out;

-   mutex_lock(>readlock);
+   err = mutex_lock_interruptible(>readlock);
+   if (unlikely(err)) {
+   /* recvmsg() in non blocking mode is supposed to return -EAGAIN
+* sk_rcvtimeo is not honored by mutex_lock_interruptible()
+*/
+   err = noblock ? -EAGAIN : -ERESTARTSYS;
+   goto out;
+   }

skip = sk_peek_offset(sk, flags);

@@ -2083,14 +2090,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, 
struct socket *sock,
memset(_scm, 0, sizeof(tmp_scm));
}

-   err = mutex_lock_interruptible(>readlock);
-   if (unlikely(err)) {
-   /* recvmsg() in non blocking mode is supposed to return -EAGAIN
-* sk_rcvtimeo is not honored by mutex_lock_interruptible()
-*/
-   err = noblock ? -EAGAIN : -ERESTARTSYS;
-   goto out;
-   }
+   mutex_lock(>readlock);

do {
int chunk;
-- 
1.7.12.2.21.g234cd45.dirty

Re: [PATCH net] af_unix: fix struct pid memory leak

2016-01-24 Thread Willy Tarreau

Hi Eric,

On Sun, Jan 24, 2016 at 01:53:50PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Dmitry reported a struct pid leak detected by a syzkaller program.
> 
> Bug happens in unix_stream_recvmsg() when we break the loop when a
> signal is pending, without properly releasing scm.
> 
> Fixes: b3ca9b02b007 ("net: fix multithreaded signal handling in unix recv 
> routines")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Eric Dumazet 
> Cc: Rainer Weikusat 
> ---
>  net/unix/af_unix.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index c5bf5ef2bf89..49d5093eb055 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2339,6 +2339,7 @@ again:
>  
>   if (signal_pending(current)) {
>   err = sock_intr_errno(timeo);
> + scm_destroy();
>   goto out;
>   }

Good job on this one! FWIW, I managed to test it on 3.14 and I confirm it
completely fixes the leak there as well. I had to modify it a little bit
however since there's no scm local variable there :

-   scm_destroy();
+   scm_destroy(siocb->scm);

Cheers,
Willy

Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau

Hi Eric, Dmitry,

On Fri, Jan 22, 2016 at 08:50:01AM -0800, Eric Dumazet wrote:
> CC netdev, as it looks some af_unix issue ...
> 
> On Fri, 2016-01-22 at 16:08 +0100, Dmitry Vyukov wrote:
> > Hello,
> > 
> > The following program causes struct pid memory leak:
> > 
> > // autogenerated by syzkaller (http://github.com/google/syzkaller)
(...)
> > unreferenced object 0x8800324af200 (size 112):
> >   comm "syz-executor", pid 18413, jiffies 4295500287 (age 14.321s)
> >   hex dump (first 32 bytes):
> > 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >   backtrace:
> > [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:916
> > [< inline >] kmemleak_alloc_recursive 
> > include/linux/kmemleak.h:47
(...)
> > On commit 30f05309bde49295e02e45c7e615f73aa4e0ccc2 (Jan 20).

I can't reproduce this with the indicated commit. I'm unsure how/what
I'm supposed to see. Is a certain config needed ? I've enabled kmemleak
in my .config but there are too few information here to go further
unfortunately.

Regards,
Willy

Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau

On Sat, Jan 23, 2016 at 07:14:33PM +0100, Dmitry Vyukov wrote:
> I've attached my .config.
> Also run this program in a parallel loop. I think it's leaking not
> every time, probably some race is involved.

Thank you. Just in order to confirm, am I supposed to see the
messages you quoted in dmesg ?

Thanks,
Willy

Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau

On Sat, Jan 23, 2016 at 07:46:45PM +0100, Dmitry Vyukov wrote:
> On Sat, Jan 23, 2016 at 7:40 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Sat, Jan 23, 2016 at 07:14:33PM +0100, Dmitry Vyukov wrote:
> >> I've attached my .config.
> >> Also run this program in a parallel loop. I think it's leaking not
> >> every time, probably some race is involved.
> >
> > Thank you. Just in order to confirm, am I supposed to see the
> > messages you quoted in dmesg ?
> 
> 
> I think the simplest way to confirm that you can reproduce it locally
> is to check /proc/slabinfo. When I run this program in a parallel
> loop, number of objects in pid cache was constantly growing:
> 
> # cat /proc/slabinfo | grep pid
> pid  297532576   284 : tunables00
>   0 : slabdata 19 19  0
> ...
> pid  412532576   284 : tunables00
>   0 : slabdata 19 19  0
> ...
> pid 1107   1176576   284 : tunables00
>   0 : slabdata 42 42  0
> ...
> pid 1545   1652576   284 : tunables00
>   0 : slabdata 59 59  0

OK got it and indeed I can see it grow. In fact, the active column grows and
once it reaches the num objects, this one grows in turn, which makes sense.

All I can say now is that it doesn't need to run over multiple processes
to leak, though that makes it easier. SMP is not needed either.

> If you want to use kmemleak, then you need to run this program in a
> parallel loop for some time, then stop it and then:
> 
> $ echo scan > /sys/kernel/debug/kmemleak
> $ cat /sys/kernel/debug/kmemleak
> 
> If kmemleak has detected any leaks, cat will show them. I noticed that
> kmemleak can delay leaks with significant delay, so usually I do scan
> at least 5 times.

Thank you for these information.

I've tested on an older (3.14) kernel and I can see the effect there as well.
I don't have "pid" in slabinfo, but launching 1000 processes at a time uses
a few tens to hundreds kB of RAM on each round. 3.10 doesn't seem affected,
I'm seeing the memory grow to a fixed point if I increase the number of
parallel processes but then even after a few tens of thousands of processes,
the reported used memory doesn't seem to increase (remember no "pid" entry
here).

kmemleak indeed reports me something on 3.14 which seems to match your
trace as I'm seeing bash as the process (instead of syz-executor in your
case) and alloc_pid() calls kmem_cache_alloc() :

Unreferenced object 0x88003facd000 (size 128):
  comm "bash", pid 1822, jiffies 4294951223 (age 15.280s)
  hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[] kmem_cache_alloc+0x92/0xe0
[] alloc_pid+0x24/0x4a0
[] cpumask_any_but+0x23/0x40
[] copy_process.part.66+0x1068/0x16e0
[] n_tty_write+0x37b/0x4f0
[] tty_write+0x1c1/0x2a0
[] do_fork+0xe0/0x340
[] __set_task_blocked+0x30/0x80
[] __set_current_blocked+0x38/0x60
[] stub_clone+0x69/0x90
[] system_call_fastpath+0x16/0x1b
[] 0x

It doesn't report this on 3.10.

Unfortunately I feel totally incompetent on the subject :-/

Willy

Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau

On Sat, Jan 23, 2016 at 06:50:11PM -0800, Eric Dumazet wrote:
> On Sat, Jan 23, 2016 at 6:38 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Sun, Jan 24, 2016 at 03:11:45AM +0100, Willy Tarreau wrote:
> >> It doesn't report this on 3.10.
> >
> > To be more precise, kmemleak reports the issue on 3.13 and not on 3.12.
> > I'm not sure if it's reliable enough to run a bisect though.
> >
> > Willy
> >
> 
> I have the leak on linux-3.11.
> 
> I believe even linux-3.3 gets the leak, although I had to wait about
> one hour to be confident the leak was there.

OK so I'm stopping my bisect. It's possible it's affected with some option
which changed along the various "make oldconfig" at each step. Thanks for
letting me know.

Willy

Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau

On Sun, Jan 24, 2016 at 03:11:45AM +0100, Willy Tarreau wrote:
> It doesn't report this on 3.10.

To be more precise, kmemleak reports the issue on 3.13 and not on 3.12.
I'm not sure if it's reliable enough to run a bisect though.

Willy

Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-30 Thread Willy Tarreau

On Thu, Dec 31, 2015 at 03:08:53PM +0900, Tetsuo Handa wrote:
> Willy Tarreau wrote:
> > On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote:
> > > The MSG_PEEK code should not be harmful and the patch is good as is. I 
> > > first understood from the published private thread, that it is possible 
> > > for a program to exceed the rlimit of fds. But the DoS is only by 
> > > keeping the fds in flight and not attaching them to any program.
> > 
> > Exactly. The real issue is when these FDs become very expensive such as
> > pipes full of data.
> > 
> 
> As you wrote how to abuse this vulnerability which exists in Linux 2.0
> and later kernel, I quote a short description from private thread.
> 
>   "an unprivileged user consumes all file descriptors so that other
>   unprivileged user cannot work" and "an unprivileged user consumes all
>   kernel memory so that the OOM killer kills almost all processes before
>   the culprit process is killed (CVE-2013-4312)".
> 
> Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Mitigates: CVE-2013-4312 (Linux 2.0+)

Well I didn't reveal any secret as it was publicly reported first
in 2010, it's only that Mark sent us the proof of concept exploit
on the security list recently :-)

https://bugzilla.kernel.org/show_bug.cgi?id=20402

Anyway I'll resend the patch with your reported-by, the CVE and
Hannes' ACK.

Thanks!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-30 Thread Willy Tarreau

On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote:
> The MSG_PEEK code should not be harmful and the patch is good as is. I 
> first understood from the published private thread, that it is possible 
> for a program to exceed the rlimit of fds. But the DoS is only by 
> keeping the fds in flight and not attaching them to any program.

Exactly. The real issue is when these FDs become very expensive such as
pipes full of data.

> __alloc_fd, called on the receiver side, does check for the rlimit 
> maximum anyway, so I don't see a loophole anymore:
> 
> Acked-by: Hannes Frederic Sowa 

Thanks!

> Another idea would be to add the amount of memory used to manage the fds 
> to sock_rmem/wmem but I don't see any advantages or disadvantages.

Compared to the impact of the pending data in pipes themselves in flight,
this would remain fairly minimal.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-29 Thread Willy Tarreau

On Tue, Dec 29, 2015 at 03:48:45PM +0100, Hannes Frederic Sowa wrote:
> On 28.12.2015 15:14, Willy Tarreau wrote:
> >It is possible for a process to allocate and accumulate far more FDs than
> >the process' limit by sending them over a unix socket then closing them
> >to keep the process' fd count low.
> >
> >This change addresses this problem by keeping track of the number of FDs
> >in flight per user and preventing non-privileged processes from having
> >more FDs in flight than their configured FD limit.
> >
> >Reported-by: socketp...@gmail.com
> >Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
> >Signed-off-by: Willy Tarreau <w...@1wt.eu>
> 
> Thanks for the patch!
> 
> I think this does not close the DoS attack completely as we duplicate 
> fds if the reader uses MSG_PEEK on the unix domain socket and thus 
> clones the fd. Have I overlooked something?

I didn't know this behaviour. However, then the fd remains in flight, right ?
So as long as it's not removed from the queue, the sender cannot add more
than its FD limit. I may be missing something obvious though :-/

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] unix: properly account for FDs passed over unix sockets

2015-12-28 Thread Willy Tarreau

It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketp...@gmail.com
Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
It would be nice if (if accepted) it would be backported to -stable as the
issue is currently exploitable.
---
 include/linux/sched.h |  1 +
 net/unix/af_unix.c| 24 
 net/unix/garbage.c| 13 -
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..fbf25f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -830,6 +830,7 @@ struct user_struct {
unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
+   unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
 
 #ifdef CONFIG_KEYS
struct key *uid_keyring;/* UID specific keyring */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 45aebd9..d6d7b43 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1499,6 +1499,21 @@ static void unix_destruct_scm(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+/*
+ * The "user->unix_inflight" variable is protected by the garbage
+ * collection lock, and we just read it locklessly here. If you go
+ * over the limit, there might be a tiny race in actually noticing
+ * it across threads. Tough.
+ */
+static inline bool too_many_unix_fds(struct task_struct *p)
+{
+   struct user_struct *user = current_user();
+
+   if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
+   return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
+   return false;
+}
+
 #define MAX_RECURSION_LEVEL 4
 
 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
@@ -1507,6 +1522,9 @@ static int unix_attach_fds(struct scm_cookie *scm, struct 
sk_buff *skb)
unsigned char max_level = 0;
int unix_sock_count = 0;
 
+   if (too_many_unix_fds(current))
+   return -ETOOMANYREFS;
+
for (i = scm->fp->count - 1; i >= 0; i--) {
struct sock *sk = unix_get_socket(scm->fp->fp[i]);
 
@@ -1528,10 +1546,8 @@ static int unix_attach_fds(struct scm_cookie *scm, 
struct sk_buff *skb)
if (!UNIXCB(skb).fp)
return -ENOMEM;
 
-   if (unix_sock_count) {
-   for (i = scm->fp->count - 1; i >= 0; i--)
-   unix_inflight(scm->fp->fp[i]);
-   }
+   for (i = scm->fp->count - 1; i >= 0; i--)
+   unix_inflight(scm->fp->fp[i]);
return max_level;
 }
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..8fcdc22 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -120,11 +120,11 @@ void unix_inflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
 
+   spin_lock(_gc_lock);
+
if (s) {
struct unix_sock *u = unix_sk(s);
 
-   spin_lock(_gc_lock);
-
if (atomic_long_inc_return(>inflight) == 1) {
BUG_ON(!list_empty(>link));
list_add_tail(>link, _inflight_list);
@@ -132,25 +132,28 @@ void unix_inflight(struct file *fp)
BUG_ON(list_empty(>link));
}
unix_tot_inflight++;
-   spin_unlock(_gc_lock);
}
+   fp->f_cred->user->unix_inflight++;
+   spin_unlock(_gc_lock);
 }
 
 void unix_notinflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
 
+   spin_lock(_gc_lock);
+
if (s) {
struct unix_sock *u = unix_sk(s);
 
-   spin_lock(_gc_lock);
BUG_ON(list_empty(>link));
 
if (atomic_long_dec_and_test(>inflight))
list_del_init(>link);
unix_tot_inflight--;
-   spin_unlock(_gc_lock);
}
+   fp->f_cred->user->unix_inflight--;
+   spin_unlock(_gc_lock);
 }
 
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
-- 
1.7.12.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-21 Thread Willy Tarreau

On Mon, Dec 21, 2015 at 12:38:27PM -0800, Tom Herbert wrote:
> On Fri, Dec 18, 2015 at 11:00 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Fri, Dec 18, 2015 at 06:38:03PM -0800, Eric Dumazet wrote:
> >> On Fri, 2015-12-18 at 19:58 +0100, Willy Tarreau wrote:
> >> > Hi Josh,
> >> >
> >> > On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote:
> >> > > I was also puzzled that binding succeeded. Looking into the code paths
> >> > > involved, in inet_csk_get_port, we quickly goto have_snum. From there, 
> >> > > we end
> >> > > up dropping into tb_found. Since !hlist_empty(>owners), we end up 
> >> > > checking
> >> > > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, 
> >> > > uid)).
> >> > > This test passes, so we goto success and bind.
> >> > >
> >> > > Crucially, we are checking the fastreuseport field on the 
> >> > > inet_bind_bucket, and
> >> > > not the sk_reuseport variable on the other sockets in the bucket. 
> >> > > Since this
> >> > > bit is set based on sk_reuseport at the time the first socket binds 
> >> > > (see
> >> > > tb_not_found), I can see no reason why sockets need to keep 
> >> > > SO_REUSEPORT set
> >> > > beyond initial binding.
> >> > >
> >> > > Given this, I believe Willy's patch elegantly solves the problem at 
> >> > > hand.
> >> >
> >> > Great, thanks for your in-depth explanation.
> >> >
> >> > Eric, do you think that this patch may be acceptable material for next
> >> > merge window (given that it's not a fix per-se) ? If so I'll resubmit
> >> > later.
> >>
> >> I need to check with Craig Gallek, because he was about to upstream a
> >> change to make SO_REUSEPORT more scalable & sexy (like having an [e]BPF
> >> filter to perform the selection in an array of sockets)
> >
> > OK fine. Please note that I also considered using a new value instead of
> > zero there but I preferred to avoid it since the man talked about zero/
> > non-zero so I wanted to limit any API change. If Craig adds new values
> > there then this is something we can reconsider.
> >
> Is there any reason why this turning off a soreuseport socket should
> not apply to UDP also? (seems like we have a need to turn off RX but
> not TX for a UDP socket).

I didn't know it was supported for UDP :-) I guess that's the only reason.

willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-18 Thread Willy Tarreau

Hi Josh,

On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote:
> I was also puzzled that binding succeeded. Looking into the code paths
> involved, in inet_csk_get_port, we quickly goto have_snum. From there, we end
> up dropping into tb_found. Since !hlist_empty(>owners), we end up checking
> that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid)).
> This test passes, so we goto success and bind.
> 
> Crucially, we are checking the fastreuseport field on the inet_bind_bucket, 
> and
> not the sk_reuseport variable on the other sockets in the bucket. Since this
> bit is set based on sk_reuseport at the time the first socket binds (see
> tb_not_found), I can see no reason why sockets need to keep SO_REUSEPORT set
> beyond initial binding.
> 
> Given this, I believe Willy's patch elegantly solves the problem at hand.

Great, thanks for your in-depth explanation.

Eric, do you think that this patch may be acceptable material for next
merge window (given that it's not a fix per-se) ? If so I'll resubmit
later.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-18 Thread Willy Tarreau

On Fri, Dec 18, 2015 at 06:38:03PM -0800, Eric Dumazet wrote:
> On Fri, 2015-12-18 at 19:58 +0100, Willy Tarreau wrote:
> > Hi Josh,
> > 
> > On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote:
> > > I was also puzzled that binding succeeded. Looking into the code paths
> > > involved, in inet_csk_get_port, we quickly goto have_snum. From there, we 
> > > end
> > > up dropping into tb_found. Since !hlist_empty(>owners), we end up 
> > > checking
> > > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, 
> > > uid)).
> > > This test passes, so we goto success and bind.
> > > 
> > > Crucially, we are checking the fastreuseport field on the 
> > > inet_bind_bucket, and
> > > not the sk_reuseport variable on the other sockets in the bucket. Since 
> > > this
> > > bit is set based on sk_reuseport at the time the first socket binds (see
> > > tb_not_found), I can see no reason why sockets need to keep SO_REUSEPORT 
> > > set
> > > beyond initial binding.
> > > 
> > > Given this, I believe Willy's patch elegantly solves the problem at hand.
> > 
> > Great, thanks for your in-depth explanation.
> > 
> > Eric, do you think that this patch may be acceptable material for next
> > merge window (given that it's not a fix per-se) ? If so I'll resubmit
> > later.
> 
> I need to check with Craig Gallek, because he was about to upstream a
> change to make SO_REUSEPORT more scalable & sexy (like having an [e]BPF
> filter to perform the selection in an array of sockets)

OK fine. Please note that I also considered using a new value instead of
zero there but I preferred to avoid it since the man talked about zero/
non-zero so I wanted to limit any API change. If Craig adds new values
there then this is something we can reconsider.

Have a nice week-end,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-16 Thread Willy Tarreau

Hi Eric,

On Wed, Dec 16, 2015 at 08:38:14AM +0100, Willy Tarreau wrote:
> On Tue, Dec 15, 2015 at 01:21:15PM -0800, Eric Dumazet wrote:
> > On Tue, 2015-12-15 at 20:44 +0100, Willy Tarreau wrote:
> > 
> > > Thus do you think it's worth adding a new option as Tolga proposed ?
> > 
> > 
> > I thought we tried hard to avoid adding the option but determined
> > we could not avoid it ;)
> 
> Not yet, your other proposal of disabling SO_REUSEPORT makes sense if
> we combine it with the proposal to change the score in my patch. If
> we say that a socket which has SO_REUSEPORT scores higher, then the
> connections which don't want to accept new connections anymore will
> simply have to drop it an not be elected. I find this even cleaner
> since the sole purpose of the loop is to find the best socket in case
> of SO_REUSEPORT.

So I tried this and am pretty satisfied with the results, as I couldn't
see any single reset on 4.4-rc5 with it. On 4.1 I got a few very rare
resets at the exact moment the new process binds to the socket, because
I suspect some ACKs end up in the wrong queue exactly there. But
apparently the changes you did in 4.4 totally got rid of this, which is
great!

I suspected that I could enter a situation where a new process could
fail to bind if generations n-1 and n-2 were still present, because
n-2 would be running without SO_REUSEPORT and that should make this
test fail in inet_csk_bind_conflict(), but it never failed for me :

if ((!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) &&
(!reuseport || !sk2->sk_reuseport ||
(sk2->sk_state != TCP_TIME_WAIT &&
 !uid_eq(uid, sock_i_uid(sk2) {
...

So I'm clearly missing something and can't spot what. I mean, I'd
prefer to see my patch occasionally fail than not understanding why
it always works! If anyone has an suggestion I'm interested.

Here's the updated patch.

Best regards,
Willy

>From c060a5db92274402a0178d7c777a1e37c15eadb9 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 15 Dec 2015 16:40:00 +0100
Subject: net: make lingering sockets score less in compute_score()

When multiple processes use SO_REUSEPORT for a seamless restart
operation, there's a tiny window during which both the old and the new
process are bound to the same port, and there's no way for the old
process to gracefully stop receiving connections without dropping
the few that are left in the queue between the last poll() and the
shutdown() or close() operation.

Incoming connections are distributed between multiple listening sockets
in inet_lookup_listener() according to multiple criteria. The first
criterion is a score based on a number of attributes for each socket,
then a hash computation ensures that the connections are evenly
distributed between sockets of equal weight.

This patch provides a simple approach by which the old process can
simply decrease its score by disabling SO_REUSEPORT on its listening
sockets. Thus, the sockets from the new process always score higher
and are always preferred.

The old process can then safely drain incoming connections and stop
after meeting the -1 EAGAIN condition, as shown in the example below :

 process A (old one)  |  process B (new one)
  |
  setsockopt(SO_REUSEPORT, 1) |
  listen() >= 0   |
  ... |
  accept()|
  ... |  setsockopt(SO_REUSEPORT, 1)
  ... |  listen()

   From now on, both processes receive incoming connections

  ... |  kill(process A, go_away)
  setsockopt(SO_REUSEPORT, 0) |  accept() >= 0

   Here process A stops receiving new connections

  accept() >= 0   |  accept() >= 0
  ... |
  accept() = -1 EAGAIN    |  accept() >= 0
  close() |
  exit()  |

Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ccc5980..1c950ba 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
+   if (sk->sk_reuseport)
+   score++;
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}
-- 
1.7.12.1

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau

Hi Eric,

On Wed, Nov 11, 2015 at 05:09:01PM -0800, Eric Dumazet wrote:
> On Wed, 2015-11-11 at 10:43 -0800, Eric Dumazet wrote:
> > On Wed, 2015-11-11 at 10:23 -0800, Tom Herbert wrote:
> > 
> > > How about doing this in shutdown called for a listener?
> > 
> > Seems a good idea, I will try it, thanks !
> > 
> 
> Arg, I forgot about this shutdown() discussion we had recently
> with Oracle guys.
> 
> It is currently used in linux to unblock potential threads in accept()
> system call.
> 
> This would prevent syn_recv sockets to be finally accepted.

I had a conversation with an haproxy user who's concerned with the
connection drops during the reload operation and we stumbled upon
this thread. I was considering improving shutdown() as well for this
as haproxy already performs a shutdown(RD) during a "pause" operation
(ie: workaround for kernels missing SO_REUSEPORT).

And I found that the code clearly doesn't make this possible since
shutdown(RD) flushes the queue and stops the listening.

However I found what I consider an elegant solution which works
pretty well : by simply adding a test in compute_score(), we can
ensure that a previous socket ranks lower than the current ones,
and is never considered as long as the new ones are present. Here I
achieved this using setsockopt(SO_LINGER). The old process just has
to do this with a non-zero value on the socket it wants to put into
lingering mode and that's all.

I find this elegant since it keeps the same semantics as for a
connected socket in that it avoids killing the queue, and that it
doesn't change the behaviour for existing applications. It just
turns out that listening sockets are set up without any lingering
by default so we don't need to add any new socket options nor
anything.

Please let me know what you think about it (patch attached), if
it's accepted it's trivial to adapt haproxy to this new behaviour.

Thanks!
Willy

>From 7b79e362479fa7084798e6aa41da2a2045f0d6bb Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 15 Dec 2015 16:40:00 +0100
Subject: net: make lingering sockets score less in compute_score()

When multiple processes use SO_REUSEPORT for a seamless restart
operation, there's a tiny window during which both the old and the new
process are bound to the same port, and there's no way for the old
process to gracefully stop receiving connections without dropping
the few that are left in the queue between the last poll() and the
shutdown() or close() operation.

Incoming connections are distributed between multiple listening sockets
in inet_lookup_listener() according to multiple criteria. The first
criterion is a score based on a number of attributes for each socket,
then a hash computation ensures that the connections are evenly
distributed between sockets of equal weight.

This patch provides an elegant approach by which the old process can
simply decrease its score by setting the lingering time to non-zero
on its listening socket. Thus, the sockets from the new process
(which start without any lingering) always score higher and are always
preferred.

The old process can then safely drain incoming connections and stop
after meeting the -1 EAGAIN condition, as shown in the example below :

 process A (old one)|  process B (new one)
|
  listen() >= 0 |
  ...   |
  accept()  |
  ...   |
  ...   |  listen()

   From now on, both processes receive incoming connections

  ...   |  kill(process A, go_away)
  setsockopt(SO_LINGER) |  accept() >= 0

   Here process A stops receiving new connections

  accept() >= 0 |  accept() >= 0
  ...   |
  accept() = -1 EAGAIN  |  accept() >= 0
  close()   |
  exit()|

Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index c6fb80b..473b16f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -191,6 +191,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
score += 4;
}
}
+   if (!sock_flag(sk, SOCK_LINGER))
+   score++;
return score;
 }

-- 
1.7.12.1

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau

On Tue, Dec 15, 2015 at 10:21:52AM -0800, Eric Dumazet wrote:
> On Tue, 2015-12-15 at 18:43 +0100, Willy Tarreau wrote:
> 
> > Ah ? but what does it bring in this case ? I'm not seeing it used
> > anywhere on a listening socket. The code took care of not breaking
> > them though (ie they still accept if no other socket shows up with
> > a higher score). Otherwise we'll have to switch to Tolga's patch,
> > unless we find another socket option that can safely be combined
> > and which makes sense (I often find it better not to make userland
> > depend on new updates of includes when possible).
> 
> Socket options set on the listener before the accept() are inherited.

I completely forgot about this use case, stupid me!

> Applications wanting SO_LINGER special settings on all their sockets can
> use a single system call right before listen().
> 
> Some servers having to deal with TIME_WAIT proliferation very often use
> SO_LINGER with timeout 0

Yes definitely, it's just that I was focused on the listening socket not
taking into account the fact that it would be inherited to the (rare) few
sockets that are accepted from the queue afterwards. And indeed it's a
perfectly legitimate usage to save a syscall per incoming connection.

Thus do you think it's worth adding a new option as Tolga proposed ?

Another solution I considered (but found a bit dirty) was to make use of
the unimplemented shutdown(WR) for this. While it's easy to do, I don't
like it simply because it looks like a hack and not logical at all from
the users perspective.

Thanks,
Willy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau

On Tue, Dec 15, 2015 at 09:10:24AM -0800, Eric Dumazet wrote:
> On Tue, 2015-12-15 at 17:14 +0100, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Wed, Nov 11, 2015 at 05:09:01PM -0800, Eric Dumazet wrote:
> > > On Wed, 2015-11-11 at 10:43 -0800, Eric Dumazet wrote:
> > > > On Wed, 2015-11-11 at 10:23 -0800, Tom Herbert wrote:
> > > > 
> > > > > How about doing this in shutdown called for a listener?
> > > > 
> > > > Seems a good idea, I will try it, thanks !
> > > > 
> > > 
> > > Arg, I forgot about this shutdown() discussion we had recently
> > > with Oracle guys.
> > > 
> > > It is currently used in linux to unblock potential threads in accept()
> > > system call.
> > > 
> > > This would prevent syn_recv sockets to be finally accepted.
> > 
> > I had a conversation with an haproxy user who's concerned with the
> > connection drops during the reload operation and we stumbled upon
> > this thread. I was considering improving shutdown() as well for this
> > as haproxy already performs a shutdown(RD) during a "pause" operation
> > (ie: workaround for kernels missing SO_REUSEPORT).
> > 
> > And I found that the code clearly doesn't make this possible since
> > shutdown(RD) flushes the queue and stops the listening.
> > 
> > However I found what I consider an elegant solution which works
> > pretty well : by simply adding a test in compute_score(), we can
> > ensure that a previous socket ranks lower than the current ones,
> > and is never considered as long as the new ones are present. Here I
> > achieved this using setsockopt(SO_LINGER). The old process just has
> > to do this with a non-zero value on the socket it wants to put into
> > lingering mode and that's all.
> > 
> > I find this elegant since it keeps the same semantics as for a
> > connected socket in that it avoids killing the queue, and that it
> > doesn't change the behaviour for existing applications. It just
> > turns out that listening sockets are set up without any lingering
> > by default so we don't need to add any new socket options nor
> > anything.
> > 
> > Please let me know what you think about it (patch attached), if
> > it's accepted it's trivial to adapt haproxy to this new behaviour.
> 
> Well, problem is : some applications use LINGER on the listener,
> you can not really hijack this flag.

Ah ? but what does it bring in this case ? I'm not seeing it used
anywhere on a listening socket. The code took care of not breaking
them though (ie they still accept if no other socket shows up with
a higher score). Otherwise we'll have to switch to Tolga's patch,
unless we find another socket option that can safely be combined
and which makes sense (I often find it better not to make userland
depend on new updates of includes when possible).

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau

On Tue, Dec 15, 2015 at 01:21:15PM -0800, Eric Dumazet wrote:
> On Tue, 2015-12-15 at 20:44 +0100, Willy Tarreau wrote:
> 
> > Thus do you think it's worth adding a new option as Tolga proposed ?
> 
> 
> I thought we tried hard to avoid adding the option but determined
> we could not avoid it ;)

Not yet, your other proposal of disabling SO_REUSEPORT makes sense if
we combine it with the proposal to change the score in my patch. If
we say that a socket which has SO_REUSEPORT scores higher, then the
connections which don't want to accept new connections anymore will
simply have to drop it an not be elected. I find this even cleaner
since the sole purpose of the loop is to find the best socket in case
of SO_REUSEPORT.

I'll give this a try and will submit such a proposal if that works
for me.

Cheers!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please backport commit to 3.12+

2015-11-06 Thread Willy Tarreau

Hi,

We recently faced the issue described in the patch below on 3.14.56. This fix
was merged in 4.2-rc7. I checked Davem's queue and stable queue and it's not
there yet. Could we please have it in 3.12 and above ? (feature was introduced
in 3.11). I can confirm that it properly fixes the problem for us.

Mainline commit is :

  commit 3c16241c445303a90529565e7437e1f240acfef2
  Author: Phil Sutter 
  Date:   Tue Jul 28 00:53:26 2015 +0200

netfilter: SYNPROXY: fix sending window update to client

Upon receipt of SYNACK from the server, ipt_SYNPROXY first sends back an 
ACK to
finish the server handshake, then calls nf_ct_seqadj_init() to initiate
sequence number adjustment of forwarded packets to the client and finally 
sends
a window update to the client to unblock it's TX queue.

Since synproxy_send_client_ack() does not set synproxy_send_tcp()'s nfct
parameter, no sequence number adjustment happens and the client receives the
window update with incorrect sequence number. Depending on client TCP
implementation, this leads to a significant delay (until a window probe is
being sent).

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: "ss -p" segfaults (updated to 4.2)

2015-10-12 Thread Willy Tarreau

On Mon, Oct 12, 2015 at 09:50:19AM -0700, Stephen Hemminger wrote:
> Applied, and did some editing on commit msg

Thank you Stephen!

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: "ss -p" segfaults (updated to 4.2)

2015-10-06 Thread Willy Tarreau

Hi guys,

I've updated Jose's patch to make it slightly simpler (eg: calloc instead
of malloc+memset), and ported it to 4.2.0 which requires it as well, and
attached it to this e-mail.

I can confirm that with this patch 4.1.1 doesn't segfault on me anymore.
The commit message should be reworked I guess though everything's in it
and I didn't want to modify his description.

Can it be merged as-is or should I reword the commit message and reference
Jose as the fix reporter ? We should not let this bug live forever.

Thanks,
Willy

>From 618028d6c5bfa4fb9898f2ec1ab5483e6f5392d4 Mon Sep 17 00:00:00 2001
From: "j...@openmailbox.org" <j...@openmailbox.org>
Date: Tue, 21 Jul 2015 10:54:05 +0100
Subject: "ss -p" segfaults

Patch for 4.2.0

Essentially all that is needed to get rid of this issue is the
addition of:

memset(u, 0, sizeof(*u));

after:

if (!(u = malloc(sizeof(*u
break;

Also patched some other situations (strcpy and sprintf uses) that
potentially produce the same results.

Signed-off-by: Jose P Santos <j...@openmailbox.org>

[ wt: made Jose's patch slightly simpler, all credits to him for the diag ]
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 misc/ss.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 2f34962..8b0d606 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -457,7 +457,9 @@ static void user_ent_hash_build(void)
 
user_ent_hash_build_init = 1;
 
-   strcpy(name, root);
+   strncpy(name, root, sizeof(name)-1);
+   name[sizeof(name)-1] = 0;
+
if (strlen(name) == 0 || name[strlen(name)-1] != '/')
strcat(name, "/");
 
@@ -481,7 +483,7 @@ static void user_ent_hash_build(void)
if (getpidcon(pid, _context) != 0)
pid_context = strdup(no_ctx);
 
-   sprintf(name + nameoff, "%d/fd/", pid);
+   snprintf(name + nameoff, sizeof(name) - nameoff, "%d/fd/", pid);
pos = strlen(name);
if ((dir1 = opendir(name)) == NULL) {
free(pid_context);
@@ -502,7 +504,7 @@ static void user_ent_hash_build(void)
if (sscanf(d1->d_name, "%d%c", , ) != 1)
continue;
 
-   sprintf(name+pos, "%d", fd);
+   snprintf(name+pos, sizeof(name) - pos, "%d", fd);
 
link_len = readlink(name, lnk, sizeof(lnk)-1);
if (link_len == -1)
@@ -2736,7 +2738,7 @@ static int unix_show(struct filter *f)
struct sockstat *u, **insp;
int flags;
 
-   if (!(u = malloc(sizeof(*u
+   if (!(u = calloc(1, sizeof(*u
break;
u->name = NULL;
u->peer_name = NULL;
@@ -3086,11 +3088,13 @@ static int netlink_show_one(struct filter *f,
strncpy(procname, "kernel", 6);
} else if (pid > 0) {
FILE *fp;
-   sprintf(procname, "%s/%d/stat",
+   snprintf(procname, sizeof(procname), "%s/%d/stat",
getenv("PROC_ROOT") ? : "/proc", pid);
if ((fp = fopen(procname, "r")) != NULL) {
if (fscanf(fp, "%*d (%[^)])", procname) == 1) {
-   sprintf(procname+strlen(procname), 
"/%d", pid);
+   snprintf(procname+strlen(procname),
+   
sizeof(procname)-strlen(procname),
+   "/%d", pid);
done = 1;
}
fclose(fp);
-- 
1.7.12.1

Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-16 Thread Willy Tarreau

Hi,

On Wed, Sep 16, 2015 at 06:53:57AM +, Damien Thébault wrote:
> On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
> > I have a recent Marvell Armada 388 board here which uses the mvneta
> > driver.  I'm seeing some weird effects with NFS with it acting as a
> > client.
> 
> Hello,
> 
> I'm upgrading a Marvelle Armada 370 board using the mvneta driver from
> 4.0 to 4.2 and noticed issues with NFS booting.
> Basically, most of the time init returns with an error code, or
> programs segfault or throw illegal instructions.
> 
> Since it worked fine on 4.0 I bisected until I found commit
> a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix refilling
> for Rx DMA buffers".
> 
> If I revert this commit, everything seems to get back to normal.
> Could you try it ? The two issues look very similar.

I'm not sure but I'm seeing that the accounting was changed by this
patch without being certain of the implications; if the revert above
works, it would be nice to try to only apply this just to see if
that's indeed an accounting error or not :

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 62e48bc..4205867 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1463,6 +1463,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 {
struct net_device *dev = pp->dev;
int rx_done;
+   int missed = 0;
u32 rcvd_pkts = 0;
u32 rcvd_bytes = 0;
 
@@ -1527,6 +1528,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
if (err) {
netdev_err(dev, "Linux processing - Can't refill\n");
rxq->missed++;
+   missed++;
goto err_drop_frame;
}
 
@@ -1561,7 +1563,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
}
 
/* Update rxq management counters */
-   mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done);
+   mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done - missed);
 
return rx_done;
 }

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][kernel 2.6.32] Bond interface can't send gratuitous ARP

2015-07-26 Thread Willy Tarreau

Hi Qingjie,

On Mon, Jul 27, 2015 at 09:05:29AM +0800, ? wrote:
 Hi,
 
 Bond interface worked as Active-Backup mode.
 If the bond interface was added in bridge, then it's just a port of bridge.
 It doesn' have IP address.
 When bond slave changing, the current code bond_send_gratuitous_arp
 didn't take effect. It couldn't send gratuitous ARP.
 So I made a patch to fix this issue.
 
 It's my first patch for kernel.
 
 Thanks a lot for your *understanding* for any *inconvenience* caused.

Patches may only be sent for mainline kernel. Since you're trying to fix
2.6.32, you first need to check if a more recent version works properly.
If it's the case, then instead you should ask for a backport of the patch
that did the related change (it's easier for tracking bug fixes and
regressions). If the problem is also present in mainline, then you should
propose your patch for mainline. Since it's a fix, it will be backported.

BTW, please keep in mind that 2.6.32 is reaching EOL in a few months, so
it would be a good time to try newer ones (3.2 or 3.4 are rock solid now).

Hoping this helps,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 151 matches

Mail list logo