To follow up on this issue, I think it might be related to db and/or
DAL.
+ I don't think it relates to exhausting RAM or too many open files.
Using lsof to monitor open files during stress test, with 200
concurrent channels (ab), I witnessed up to 11,000 open files (most
are apache2). There are several dozens of failed requests, causing by
"premature end of script wsgi". What is interesting is that I was
able to cause these errors even with only 20 concurrent channels (with
only about 3000-4000 open files). This is under postgres.
+ I could not cause the error with sqlite or on a page (controller)
that has only 1 db call.
+ How was I able to cause this error with only 20 concurrent
channels? First, I observed that ab is quite simple in that it hits
the same page again and again. So I wanted a more realistic test
(with more complex behavior). Without any other tools, I decided to
do 2 things simultaneously: (1) ab with 20 concurrent channels, and
(2) manually (ajax) search for items using the search form on the
website; search will perform several db queries which ab does not.
Well, the result is that there were several failed requests (resulting
from this error) even with only 20 concurrent channels (which is
ridiculous).
+ Another anecdote. I myself experienced this error a number of times
while using normally the app (not a result of stress test). Once the
error occurred, apache failed to serve the page, of course. What I
observed is that when I immediately reload the page, it loads up again
very quickly (as normally the case). What this tells me is that the
cause of this wsgi error is probably NOT because of the exhaustion of
some type of resources (RAM, or opening files, etc.); because that
lacking resources was the cause, then there would be some time for the
resources to be recovered before the page would quickly be served
again.
This issue is annoying. Crashing like this is not pleasant from
users' point of view. It's clearly related to scalability of web2py.
I hope there's an answer to this soon.
Here's a typical output of ab with 20 concurrent connections showing
failed requests.
>>>
Finished 254 requests
Server Software: Apache/2.2.9
Server Port: 80
Document Path: /
Document Length: 10133 bytes
Concurrency Level: 20
Time taken for tests: 10.045 seconds
Complete requests: 254
Failed requests: 15
(Connect: 0, Receive: 0, Length: 15, Exceptions: 0)
Write errors: 0
Non-2xx responses: 15
Keep-Alive requests: 239
Total transferred: 2558415 bytes
HTML transferred: 2448532 bytes
Requests per second: 25.29 [#/sec] (mean)
Time per request: 790.914 [ms] (mean)
Time per request: 39.546 [ms] (mean, across all concurrent
requests)
Transfer rate: 248.74 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 6 16.1 0 56
Processing: 121 748 592.9 631 4312
Waiting: 120 688 589.9 566 4243
Total: 121 755 603.8 631 4365
Percentage of the requests served within a certain time (ms)
50% 631
66% 710
75% 772
80% 804
90% 1670
95% 1893
98% 2788
99% 3929
100% 4365 (longest request)
>>>>