Re: [uWSGI] Fully saturating all CPU cores

Roberto De Ioris Wed, 11 Sep 2013 07:40:42 -0700

>
>
> uwsgi-1.9.15 - Wed Sep 11 00:32:08 2013 - req: 63709 - lq: 0 - tx: 7.0M
> node: machinename - cwd: /home/ubuntu/FrameworkBenchmarks/uwsgi - uid:
> 1000 - gid: 1000 - masterpid: 6948
>  WID    %       PID     REQ     EXC     SIG     STATUS  AVG     RSS
>  VSZ     TX      RunT
>  2      51.7    6950    32947   0       0       idle    0ms     0
>  0       3.0M    1252
>  1      48.3    6949    30762   0       0       idle    0ms     0
>  0       3.0M    1157
>
>> > So far I haven't seen any data to suggest that this is an
>> affinitization
>> > problem or that affinity could help, so I haven't bothered with
>> > --cpu-affinity. So far virtualization doesn't seem to be an issue
>> since my
>> > physical machine is otherwise idle and has two cores (with
>> > hyperthreading).
>>
>> on virtualized systems cpu-affinity simply does not work for the way
>> cpu's
>> are abstracted by the hypervisor. Even if your kernel will show the
>> right
>> distribution, internally you do not know which cpu is effectively used.
>>
>> But this is not your problem. I have run some test with a concurrency of
>> 90 (so no need to tune the listen queue), and --http-socket was 1-2%
>> faster, while httprouter + uwsgi was 3-4% slower (as expected as you
>> have
>> the ipc overhead, something you will always have in production
>> environments)
>>
>> > After doing this research (with your help), my analysis is that the
>> > (single
>> > process) uwsgi httprouter becomes CPU bound and becomes the limiting
>> > factor.
>>
>> (Always supposing you are using a 1.9.x version)
>>
>> the httprouter became CPU bound only on higher level of concurrency
>> (unless you are using a pre 1.9 version where there are blocking parts)
>>
>> workers are more heavy in term of "things to do", the fact they are low
>> in
>> cpu usage, suggests a communication problem (again it could be the
>> listen
>> queue). The httprouter (as nginx) does not have the need to tune the
>> listen queue as they constantly accept() and wait again, reducing the
>> need
>> of a queue. Workers (instead) have the heavy part after the accept() and
>> connections coming to it while in the "heavy part" are enqueued (and
>> saturating a 100 listen queue with 256 concurrent connections and 2
>> workers is pretty easy, expecially because the --http-socket expect a 4
>> seconds timeout on protocol traffic)
>
> Your theory makes sense, but so far I don't think I've seen any data
> suggesting that's what is going on. I'm open to ideas.
>
>> > Thus, to increase the performance, one must distribute the load
>> > amongst more than one httprouter (--http-processes 2), or perhaps use
>> a
>> > different 'router' such as nginx using the uwsgi protocol. What do you
>> > think? Is my thinking/analysis/approach wrong? I'm open to
>> suggestions.
>>
>> the httprouter passes requests to uWSGI workers via the uwsgi protocol.
>> In
>> terms of performance it should map 1:1 with nginx (it is only because it
>> is way simpler than nginx, the parser of the latter is better for sure)
>
> I tried nginx+uwsgi and I got ~12,300 requests/sec, the best result
> I've gotten so far. The uwsgi command line:
>
> --master -L -l 5000 --socket /tmp/uwsgi.sock --chmod-socket=666 -p 2
> -w hello --pidfile /tmp/uwsgi.pid
>
> The nginx.conf is here: https://gist.github.com/MalcolmEvershed/6520477
>
> Isn't it odd that nginx+uwsgi is the best performing combination,
> beating gunicorn+meinheld and uwsgi httprouter? I'm really not sure
> what to make of this. I must be doing something wrong?
>
>> > Is there a way to use multiple worker processes without a router?
>> > Basically, is there a way that does the accept()/epoll()/read() from
>> the
>> > network and then in the same process executes the python code? That
>> seems
>> > like that might be the fastest because it would eliminate the dispatch
>> > from
>> > the router process to the worker process. I have a feeling that
>> > gunicorn+meinheld might be doing this, but I haven't read the code to
>> > verify.
>>
>> i do not follow you here, it is the standard way uWSGI works. Even with
>> the httprouter the backend workers share the socket. It is the reason
>> why
>> the --thunder-lock is needed in high-load scenarios.
>
> Maybe I'm misunderstanding. I thought that when an httprouter is used
> it works like this:
>



It took me a bit to fully understand what is going on.

Finally i decided to invest a bit of time on 'wrk' to check how it works.

Well, while i am not a big fan of "hello world" benchmarks, the one you
made resulted in really interesting (and funny) numbers.

Regarding --http-socket:

add --add-header 'Connection: close' and you should be able to complete
the test (it seems wrk does not manage well implicit non-keepalive
connections). Results will be pretty near the --http one. So nothing funny
here.

Regarding meinheld:

for this kind of test keep-alive definitely helps, i would never have bet
a cent on it but effectively if you add -H 'Connection: close' to wrk,
uWSGI will start winning again (10% more requests compared to mainheld and
upto 40% for plain gunicorn). [Note: please do not blame gunicorn, hello
world tests tend to favour c implementations, things heavily change with
real applications]

The funny part:

i suppose you are using UNIX sockets for nginx. Again this test is based
on micro optimizations. Let's sum (the numbers are relative to my machine)

uwsgi http router + uwsgi tcp -> 66.000-67.000
uwsgi http-socket (no proxy) -> 67.000-68.000
uwsgi http router + uwsgi unix -> 110.000
nginx + uwsgi tcp -> 83.000-84.000
nginx + uwsgi unix -> 145.000-160.000 (!!!)
nginx + uwsgi --http-socket tcp -> 69.000-71.000
nginx + uwsgi --http-socket unix -> 108.000-110.000
gunicorn+meinheld (no proxy, keepalive) -> 125.000-127.000
gunicorn+meinheld (no proxy, connection close) -> 48.000-55.000
gunicorn (no proxy) -> 22.000->27.000


why nginx + uwsgi wins ?

nginx has a better keepalive parser than uwsgi

nginx and meinheld http parsers are the same

uwsgi protocol (under nginx) performs a lot better than the http one

the uWSGI WSGI plugin is way faster than the gunicorn one (but just
because of the 'hello world' test, real-world test with more impact in the
python side have different results)

unix socket always win as micro-optimization

so even with a proxy in the middle, nginx makes a difference with
keep-alive connections and the usage of the uwsgi protocol combined with
the WSGI plugin results in better numbers.

Side note:

adding --thunder-lock to uWSGI gives a boost of 5.000 to 8.000 requests,
there are other tuning available but you will raise no more than a couple
hundreds requests.

Again:

this values are for an hello world, where the "python part" is less than
10% of the whole uWSGI request time, so do not give them too much
emphasis.

I think the same situation will apply to Ruby and Perl plugins too.

-- 
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Re: [uWSGI] Fully saturating all CPU cores

Reply via email to