Re: Lock contention in pat_match_str in threaded mode

2019-10-25 Thread Willy Tarreau
Hi Brian,

On Thu, Oct 24, 2019 at 05:10:40PM +, Brian Diekelman wrote:
> Thank you for turning that around so quickly, Willy.
> 
> We'll pull down the new release when it's available.

So just FYI, 1.8.22 was released with the fix.

Cheers,
Willy



RE: Lock contention in pat_match_str in threaded mode

2019-10-24 Thread Brian Diekelman
Thank you for turning that around so quickly, Willy.

We'll pull down the new release when it's available.


-Original Message-
From: Willy Tarreau  
Sent: Tuesday, October 22, 2019 10:53 PM
To: Brian Diekelman 
Cc: haproxy@formilux.org
Subject: Re: Lock contention in pat_match_str in threaded mode

Notice: This email is from an external sender.



Hi Brian,

On Tue, Oct 22, 2019 at 04:19:58PM +0200, Willy Tarreau wrote:
> At the moment I don't know what it requires to break it down per 
> thread, so I'll add a github issue referencing your report so that we don't 
> forget.
> Depending on the complexity, it may make sense to backport it once 
> done, because it's still a regression.

So this morning I spent a little time on it and it was really trivial to fix. 
I've now splitted the LRU cache per thread (just as it is when using nbproc in 
fact) so that we can get rid of the lock. Result, from 67kH/s I jump to 369kH/s 
on a test with 7 threads on 4 cores :-)

I could backport it as far as 1.8 so I did it, as I really consider this as a 
regression and it can be a showstopper for some users to migrate to threads. 
The gains are less important in 1.8 since the scalability is much less linear 
there. I'm going to work on issuing some releases today or over this week, I'd 
encourage you to migrate to 2.0.8 once it's out.

Thanks again for your detailed report!
Willy



Re: Lock contention in pat_match_str in threaded mode

2019-10-22 Thread Willy Tarreau
Hi Brian,

On Tue, Oct 22, 2019 at 04:19:58PM +0200, Willy Tarreau wrote:
> At the moment I don't know what it requires to break it down per thread,
> so I'll add a github issue referencing your report so that we don't forget.
> Depending on the complexity, it may make sense to backport it once done,
> because it's still a regression.

So this morning I spent a little time on it and it was really trivial to
fix. I've now splitted the LRU cache per thread (just as it is when using
nbproc in fact) so that we can get rid of the lock. Result, from 67kH/s
I jump to 369kH/s on a test with 7 threads on 4 cores :-)

I could backport it as far as 1.8 so I did it, as I really consider this
as a regression and it can be a showstopper for some users to migrate to
threads. The gains are less important in 1.8 since the scalability is much
less linear there. I'm going to work on issuing some releases today or over
this week, I'd encourage you to migrate to 2.0.8 once it's out.

Thanks again for your detailed report!
Willy



Re: Lock contention in pat_match_str in threaded mode

2019-10-22 Thread Willy Tarreau
Hi Brian,

On Mon, Oct 14, 2019 at 11:28:17PM +, Brian Diekelman wrote:
> Just wanted to provide some information on what appears to be lock contention 
> around ACL lookups.
> 
> We recently upgraded from haproxy-1.6 to haproxy-1.8.20 and switched from 
> 'nbprocs 8' to 'nbprocs 1, nbthreads 12'
> 
> We have quite a few ACLs files to sift through for domain matching -- about 
> 19MB total.
> 
> After switching to 1.8.20 we saw elevated CPU under our normal peak load, to 
> the point where in some cases all 12 threads were consuming 100% CPU.
> 
> Looking at the 'perf' numbers:
> 
>   90.71%  haproxy  [.] pat_match_str
>0.49%  libc-2.12.so [.] __strncasecmp_l_sse42
>0.41%  haproxy  [.] pat_match_beg
>0.34%  haproxy  [.] lru64_get
>0.26%  [kernel] [k] _spin_lock_bh
>0.17%  haproxy  [.] pat_match_sub
>0.16%  libc-2.12.so [.] _int_malloc
>0.13%  [kernel] [k] _spin_lock
>0.10%  haproxy  [.] process_runnable_tasks
>0.09%  libc-2.12.so [.] malloc
> 
> 
> We looked up that function, and it seemed like all requests were going
> through one spin lock [0].
> The LRU tree seemed to be conditionally initialized based on
> 'global.tune.pattern_cache' which seemed to be tied to the
> 'tune.pattern.cache-size' config value [1] [2].
> So, we specified 'tune.pattern.cache-size 0' to disable the cache code path
> and avoid the spinlock, and the CPU numbers dropped back down to where they
> had been previously (and lru64_get was no longer in the execution traces).
> 
> I realize our use case is probably an edge case (that's a lot of ACL data --
> we have plans to upgrade these to map files now that we're on haproxy-1.8).
> Just wanted to make the developers aware and help anyone who may run into 
> this.

Thank your for your report, it is very useful. It just makes me realize
that it's stupid to have a shared LRU cache between all threads, we should
split it to have one per thread, otherwise the benefit of the cache is
cancelled by the lock!

At the moment I don't know what it requires to break it down per thread,
so I'll add a github issue referencing your report so that we don't forget.
Depending on the complexity, it may make sense to backport it once done,
because it's still a regression.

Thanks,
Willy