Re: [uWSGI] Fully saturating all CPU cores

Malcolm Evershed Wed, 11 Sep 2013 01:04:35 -0700

Thank for reading my long email. You are a saint for helping for free.
Discussion below:


On Tue, Sep 10, 2013 at 9:49 PM, Roberto De Ioris <[email protected]> wrote:
> > Thanks for the reply. Discussion below:
> >
> > On Tue, Sep 10, 2013 at 2:38 AM, Roberto De Ioris <[email protected]>
> > wrote:
> >
> >> > Hi,
> >> >
> >> > I'm investigating using uwsgi to run Python code in the
> >> > FrameworkBenchmarks
> >> > project <http://www.techempower.com/benchmarks/> which compares web
> >> > frameworks, languages, platforms, web servers and more. I tried
> >> running
> >> > another contributor's uwsgi command line, but I can't get uwsgi to
> >> fully
> >> > saturate all CPU cores when under load.
> >> >
> >> > uwsgi command line:
> >> >
> >> > --master -L --http :8080 --http-keepalive -p 2 -w hello --add-header
> >> >> "Connection: keep-alive"
> >>
> >> in this way you are benchmarking a proxied setup with an http router on
> >> front managing all of the requests and forwarding them to the workers.
> >>
> >> While in term of performance it could be successfull, in terms of core
> >> usage could be suboptimal (even if core-usage is a bit 'strange' for a
> >> benchmark as the operating system scheduler chooses the process to give
> >> cpu to using really complex algorithms).
> >
> > I ran htop and found that the http router process was ~100% (thus, using
> > most of one core). My guess is that the http router is CPU bound and thus
> > can't send enough work to the workers, so the worker processes are not
> > fully utilized. Basically, the http router is the bottleneck. On my
> > system,
> > this produces about 6,000-7,000 requests/sec, whereas gunicorn can do
> > about
> > 10,000 requests/sec, saturating all cores.
>
> Seems reasonable, as you have 1 process with uwsgi httprouter and 2 with
> gunicorn (meinheld parser is very near in performance with the uwsgi one)
>
> Just to be sure: are you using a 1.9.x release ?

Yes, 1.9.15.

> My latest numbers (expecially those with pypy) are only for 1.9 codebase
>
> (1.4 http parsing was not good [4 syscall per request] and meinheld last
> time i tried was up to 8% faster in single process mode)
>
> >> The "right" command line would be:
> >>
> >> --master -L --http-socket :8080 -p 2 -w hello
> >>
> >> (keep-alive is useless as this is an in-process non-persistent parser)
> >
> > I tried this (after increasing system somaxconn, using uwsgi -l, and
> > removing -H 'Connection: keep-alive' from the wrk args) and only 326
> > requests completed and there were 330 read socket errors and 1788 timeout
> > socket errors. I'm not sure what's going on, maybe it is a bug in wrk.
>
> -l with which value ?

I set somaxconn to 5000 and used uwsgi -l 5000.

> from the results it looks like something is wrong there (you should have
> errors in uwsgi logs)

I re-ran the scenario, omitting uwsgi -L to get logging on
stdout/stderr and it looked like this:

[pid: 6832|app: 0|req: 1/1] <ip-address> () {22 vars in 287 bytes}
[Wed Sep 11 00:04:58 2013] GET /json => generated 27 bytes in 0 msecs
(HTTP/1.1 200) 2 headers in 65 bytes (1 switches on core 0)
[######## many more lines like above with the same timestamp ##############]
[pid: 6831|app: 0|req: 164/368] <ip-address> () {22 vars in 287 bytes}
[Wed Sep 11 00:05:14 2013] GET /json => generated 27 bytes in 0 msecs
(HTTP/1.1 200) 2 headers in 65 bytes (1 switches on core 0)
[pid: 6832|app: 0|req: 205/369] <ip-address> () {22 vars in 287 bytes}
[Wed Sep 11 00:05:14 2013] GET /json => generated 27 bytes in 0 msecs
(HTTP/1.1 200) 2 headers in 65 bytes (1 switches on core 0)
[pid: 6832|app: 0|req: 206/370] <ip-address> () {22 vars in 287 bytes}
[Wed Sep 11 00:05:14 2013] GET /json => generated 27 bytes in 0 msecs
(HTTP/1.1 200) 2 headers in 65 bytes (1 switches on core 0)
[pid: 6832|app: 0|req: 207/371] <ip-address> () {22 vars in 287 bytes}
[Wed Sep 11 00:05:14 2013] GET /json => generated 27 bytes in 0 msecs
(HTTP/1.1 200) 2 headers in 65 bytes (1 switches on core 0)

Basically, many requests were initially processed, then everything
hung until 15 seconds elapsed (how long wrk is configured to run) and
then there were those last 4 requests. I don't know what is happening,
but it might just be a problem in wrk.

> > But at any rate, my goal is to use HTTP Keep-Alive to get the most
> > requests/sec, so perhaps --http-socket isn't useful for this benchmark in
> > the first place.
>
> (yes) you can't, because keepalive is only for the frontend (the
> httprouter or nginx) while the --http-socket (the in-process one) is
> non-keepalive.
>
> But if you want to compare a non-proxied (gunicorn+meinheld) with a
> proxied one (nginx+uwsgi / httprouter+uwsgi) you will be "unfair"
> (expecially because no-one will place gunicorn+meinheld directly exposed
> without nginx or something similar, as well no-one will expose uwsgi
> --http-socket to the public). Even if things like tcp offloading and dma
> engines highly reduce the impact of the IPC, you always have a little
> overhead (expecially in syscall).
>
> Yes, we are talking about microseconds, but on this kind of benchmarks
> they make the difference too

Yes, I agree that it is 'unfair', but what is amazing is that
eventually I did try nginx+uwsgi and it was the fastest combination
(more below).

> >> If you want to test the http router (something lot of users use on
> >> production) you may want to use --http-processes 2 (this time keepalives
> >> work)
> >>
> >> With this setup the httprouter too will use 2 processes, but again 'cpu
> >> cores' usage could be irrelevant.
> >
> > I used `--master -L --http :8080 --http-processes 2 --http-keepalive -p 2
> > -w hello --add-header ...' and I was able to saturate all CPU cores. The
> > htop CPU usage was about ~65% for each httprouter process and ~35% for the
> > worker processes. The result was ~8,500 requests/sec, an improvement, but
> > still not close to gunicorn. These results seem to suggest that the
> > original problem was that the httprouter is CPU bound and the bottleneck.
>
> probably you still have problems with the listen queue (so the worker
> itself is the bottleneck as the uwsgi router are tuned for really high
> concurrency). Hello world benchmarks are not realistic (or better, are
> near to a DOS) so the first step is tuning the listen queue as the network
> will saturate fast.

To test your theory, I tried running with the default listen backlog
of 100 (by using no -l argument) and with -l 5000, and the results for
both were approximately the same: ~8,900 requests/sec. If there was a
listen queue problem, shouldn't these have different results?

> You may want to run uwsgitop (with the stats server) to see the status of
> the listen queue in real time.

Ok, while using the default listen queue of 100 (by not using the -l
argument), I ran wrk and uwsgitop showed 'idle' most of the time like
this:

uwsgi-1.9.15 - Wed Sep 11 00:32:08 2013 - req: 63709 - lq: 0 - tx: 7.0M
node: machinename - cwd: /home/ubuntu/FrameworkBenchmarks/uwsgi - uid:
1000 - gid: 1000 - masterpid: 6948
 WID    %       PID     REQ     EXC     SIG     STATUS  AVG     RSS
 VSZ     TX      RunT
 2      51.7    6950    32947   0       0       idle    0ms     0
 0       3.0M    1252
 1      48.3    6949    30762   0       0       idle    0ms     0
 0       3.0M    1157

> > So far I haven't seen any data to suggest that this is an affinitization
> > problem or that affinity could help, so I haven't bothered with
> > --cpu-affinity. So far virtualization doesn't seem to be an issue since my
> > physical machine is otherwise idle and has two cores (with
> > hyperthreading).
>
> on virtualized systems cpu-affinity simply does not work for the way cpu's
> are abstracted by the hypervisor. Even if your kernel will show the right
> distribution, internally you do not know which cpu is effectively used.
>
> But this is not your problem. I have run some test with a concurrency of
> 90 (so no need to tune the listen queue), and --http-socket was 1-2%
> faster, while httprouter + uwsgi was 3-4% slower (as expected as you have
> the ipc overhead, something you will always have in production
> environments)
>
> > After doing this research (with your help), my analysis is that the
> > (single
> > process) uwsgi httprouter becomes CPU bound and becomes the limiting
> > factor.
>
> (Always supposing you are using a 1.9.x version)
>
> the httprouter became CPU bound only on higher level of concurrency
> (unless you are using a pre 1.9 version where there are blocking parts)
>
> workers are more heavy in term of "things to do", the fact they are low in
> cpu usage, suggests a communication problem (again it could be the listen
> queue). The httprouter (as nginx) does not have the need to tune the
> listen queue as they constantly accept() and wait again, reducing the need
> of a queue. Workers (instead) have the heavy part after the accept() and
> connections coming to it while in the "heavy part" are enqueued (and
> saturating a 100 listen queue with 256 concurrent connections and 2
> workers is pretty easy, expecially because the --http-socket expect a 4
> seconds timeout on protocol traffic)

Your theory makes sense, but so far I don't think I've seen any data
suggesting that's what is going on. I'm open to ideas.

> > Thus, to increase the performance, one must distribute the load
> > amongst more than one httprouter (--http-processes 2), or perhaps use a
> > different 'router' such as nginx using the uwsgi protocol. What do you
> > think? Is my thinking/analysis/approach wrong? I'm open to suggestions.
>
> the httprouter passes requests to uWSGI workers via the uwsgi protocol. In
> terms of performance it should map 1:1 with nginx (it is only because it
> is way simpler than nginx, the parser of the latter is better for sure)

I tried nginx+uwsgi and I got ~12,300 requests/sec, the best result
I've gotten so far. The uwsgi command line:

--master -L -l 5000 --socket /tmp/uwsgi.sock --chmod-socket=666 -p 2
-w hello --pidfile /tmp/uwsgi.pid

The nginx.conf is here: https://gist.github.com/MalcolmEvershed/6520477

Isn't it odd that nginx+uwsgi is the best performing combination,
beating gunicorn+meinheld and uwsgi httprouter? I'm really not sure
what to make of this. I must be doing something wrong?

> > Is there a way to use multiple worker processes without a router?
> > Basically, is there a way that does the accept()/epoll()/read() from the
> > network and then in the same process executes the python code? That seems
> > like that might be the fastest because it would eliminate the dispatch
> > from
> > the router process to the worker process. I have a feeling that
> > gunicorn+meinheld might be doing this, but I haven't read the code to
> > verify.
>
> i do not follow you here, it is the standard way uWSGI works. Even with
> the httprouter the backend workers share the socket. It is the reason why
> the --thunder-lock is needed in high-load scenarios.

Maybe I'm misunderstanding. I thought that when an httprouter is used
it works like this:

1. HTTP connection comes in from the browser to the httprouter process
which does accept().
2. httprouter makes a connection to a worker process via the uwsgi
protocol, which presumably is TCP or unix socket.
3. The worker process does accept() to accept the connection from the
httprouter process. The worker process doesn't actually accept() the
socket that is directly connected to the browser.

My question is whether there is way to do this:

1. HTTP connection comes in from the browser directly to the worker
process which does accept() and parses the HTTP request and handles it
right there.

But maybe I'm misunderstanding.


Thanks again for all your time on this investigation. I'm hoping that
the result is that we can get some really good uwsgi performance
numbers for the next run of the FrameworkBenchmarks project. Really,
that has been the point of my questions -- to make sure uwsgi shows
its best side. I'm hoping that my questions and data are making sense?
I'm hoping you'll have some ideas on what I can do next. Should we
just go with nginx+uwsgi since that has shown the best numbers? Or are
there certain things I should investigate next? Start running a
profiler on uwsgi httprouter? I'm open to ideas to consider.

Thanks.
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Re: [uWSGI] Fully saturating all CPU cores

Reply via email to