Thanks for directly investigating this and for explaining the results. Yes, synthetic, artificial microbenchmarks like this are contrived and not realistic, but they help establish an upperbound for performance of real apps (in other words, if hello world of some configuration can do X requests/sec, doing real work certainly won't be above X requests/sec).
I will discuss these results with the FrameworkBenchmarks project and see if it makes sense to potentially switch from gunicorn to nginx+uWSGI or to reconfigure existing uWSGI usage. Thanks! On Wed, Sep 11, 2013 at 7:37 AM, Roberto De Ioris <[email protected]> wrote: > >> >> >> uwsgi-1.9.15 - Wed Sep 11 00:32:08 2013 - req: 63709 - lq: 0 - tx: 7.0M >> node: machinename - cwd: /home/ubuntu/FrameworkBenchmarks/uwsgi - uid: >> 1000 - gid: 1000 - masterpid: 6948 >> WID % PID REQ EXC SIG STATUS AVG RSS >> VSZ TX RunT >> 2 51.7 6950 32947 0 0 idle 0ms 0 >> 0 3.0M 1252 >> 1 48.3 6949 30762 0 0 idle 0ms 0 >> 0 3.0M 1157 >> >>> > So far I haven't seen any data to suggest that this is an >>> affinitization >>> > problem or that affinity could help, so I haven't bothered with >>> > --cpu-affinity. So far virtualization doesn't seem to be an issue >>> since my >>> > physical machine is otherwise idle and has two cores (with >>> > hyperthreading). >>> >>> on virtualized systems cpu-affinity simply does not work for the way >>> cpu's >>> are abstracted by the hypervisor. Even if your kernel will show the >>> right >>> distribution, internally you do not know which cpu is effectively used. >>> >>> But this is not your problem. I have run some test with a concurrency of >>> 90 (so no need to tune the listen queue), and --http-socket was 1-2% >>> faster, while httprouter + uwsgi was 3-4% slower (as expected as you >>> have >>> the ipc overhead, something you will always have in production >>> environments) >>> >>> > After doing this research (with your help), my analysis is that the >>> > (single >>> > process) uwsgi httprouter becomes CPU bound and becomes the limiting >>> > factor. >>> >>> (Always supposing you are using a 1.9.x version) >>> >>> the httprouter became CPU bound only on higher level of concurrency >>> (unless you are using a pre 1.9 version where there are blocking parts) >>> >>> workers are more heavy in term of "things to do", the fact they are low >>> in >>> cpu usage, suggests a communication problem (again it could be the >>> listen >>> queue). The httprouter (as nginx) does not have the need to tune the >>> listen queue as they constantly accept() and wait again, reducing the >>> need >>> of a queue. Workers (instead) have the heavy part after the accept() and >>> connections coming to it while in the "heavy part" are enqueued (and >>> saturating a 100 listen queue with 256 concurrent connections and 2 >>> workers is pretty easy, expecially because the --http-socket expect a 4 >>> seconds timeout on protocol traffic) >> >> Your theory makes sense, but so far I don't think I've seen any data >> suggesting that's what is going on. I'm open to ideas. >> >>> > Thus, to increase the performance, one must distribute the load >>> > amongst more than one httprouter (--http-processes 2), or perhaps use >>> a >>> > different 'router' such as nginx using the uwsgi protocol. What do you >>> > think? Is my thinking/analysis/approach wrong? I'm open to >>> suggestions. >>> >>> the httprouter passes requests to uWSGI workers via the uwsgi protocol. >>> In >>> terms of performance it should map 1:1 with nginx (it is only because it >>> is way simpler than nginx, the parser of the latter is better for sure) >> >> I tried nginx+uwsgi and I got ~12,300 requests/sec, the best result >> I've gotten so far. The uwsgi command line: >> >> --master -L -l 5000 --socket /tmp/uwsgi.sock --chmod-socket=666 -p 2 >> -w hello --pidfile /tmp/uwsgi.pid >> >> The nginx.conf is here: https://gist.github.com/MalcolmEvershed/6520477 >> >> Isn't it odd that nginx+uwsgi is the best performing combination, >> beating gunicorn+meinheld and uwsgi httprouter? I'm really not sure >> what to make of this. I must be doing something wrong? >> >>> > Is there a way to use multiple worker processes without a router? >>> > Basically, is there a way that does the accept()/epoll()/read() from >>> the >>> > network and then in the same process executes the python code? That >>> seems >>> > like that might be the fastest because it would eliminate the dispatch >>> > from >>> > the router process to the worker process. I have a feeling that >>> > gunicorn+meinheld might be doing this, but I haven't read the code to >>> > verify. >>> >>> i do not follow you here, it is the standard way uWSGI works. Even with >>> the httprouter the backend workers share the socket. It is the reason >>> why >>> the --thunder-lock is needed in high-load scenarios. >> >> Maybe I'm misunderstanding. I thought that when an httprouter is used >> it works like this: >> > > > It took me a bit to fully understand what is going on. > > Finally i decided to invest a bit of time on 'wrk' to check how it works. > > Well, while i am not a big fan of "hello world" benchmarks, the one you > made resulted in really interesting (and funny) numbers. > > Regarding --http-socket: > > add --add-header 'Connection: close' and you should be able to complete > the test (it seems wrk does not manage well implicit non-keepalive > connections). Results will be pretty near the --http one. So nothing funny > here. > > Regarding meinheld: > > for this kind of test keep-alive definitely helps, i would never have bet > a cent on it but effectively if you add -H 'Connection: close' to wrk, > uWSGI will start winning again (10% more requests compared to mainheld and > upto 40% for plain gunicorn). [Note: please do not blame gunicorn, hello > world tests tend to favour c implementations, things heavily change with > real applications] > > The funny part: > > i suppose you are using UNIX sockets for nginx. Again this test is based > on micro optimizations. Let's sum (the numbers are relative to my machine) > > uwsgi http router + uwsgi tcp -> 66.000-67.000 > uwsgi http-socket (no proxy) -> 67.000-68.000 > uwsgi http router + uwsgi unix -> 110.000 > nginx + uwsgi tcp -> 83.000-84.000 > nginx + uwsgi unix -> 145.000-160.000 (!!!) > nginx + uwsgi --http-socket tcp -> 69.000-71.000 > nginx + uwsgi --http-socket unix -> 108.000-110.000 > gunicorn+meinheld (no proxy, keepalive) -> 125.000-127.000 > gunicorn+meinheld (no proxy, connection close) -> 48.000-55.000 > gunicorn (no proxy) -> 22.000->27.000 > > > why nginx + uwsgi wins ? > > nginx has a better keepalive parser than uwsgi > > nginx and meinheld http parsers are the same > > uwsgi protocol (under nginx) performs a lot better than the http one > > the uWSGI WSGI plugin is way faster than the gunicorn one (but just > because of the 'hello world' test, real-world test with more impact in the > python side have different results) > > unix socket always win as micro-optimization > > so even with a proxy in the middle, nginx makes a difference with > keep-alive connections and the usage of the uwsgi protocol combined with > the WSGI plugin results in better numbers. > > Side note: > > adding --thunder-lock to uWSGI gives a boost of 5.000 to 8.000 requests, > there are other tuning available but you will raise no more than a couple > hundreds requests. > > Again: > > this values are for an hello world, where the "python part" is less than > 10% of the whole uWSGI request time, so do not give them too much > emphasis. > > I think the same situation will apply to Ruby and Perl plugins too. > > -- > Roberto De Ioris > http://unbit.it > _______________________________________________ > uWSGI mailing list > [email protected] > http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi _______________________________________________ uWSGI mailing list [email protected] http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
