Last week I spent a very enjoyable day (no kidding) debugging a very,
very slow cold-start situation (more than 15 minutes to return from the
first request). When making the first request to the app (Zope 3
based), the app server and storage server would show virtually no CPU
utilization, and there would be about a megabit of network traffic (on a
gigabit link). There was no obvious bottleneck.
After liberal application of strace, tcpdump, wireshark (aka ethereal),
and the Python profiler we discovered that while waiting for an
outstanding request for an object to load, ZEO calls a
threading.Connection instance's wait method with a timeout. When given
a timeout that method enters a wait loop with a time.sleep to sleep for
a while and then see if the condition has been met.
We found that time.sleep on that box had a minimum granularity of 10ms
(when passed a non-zero value), thus causing each object load to take
approximately that long. As you can imagine, that somewhat slowed down
the retrieval of the several thousand objects required to satisfy the
initial request(s) (until the ZEO cache was sufficiently warm).
The fix? Short-term: bump the operating system's timer interrupt on
that box to 1000Hz from 100Hz, increasing time.sleep's granularity to
1ms (this was on Linux, Window's time.sleep appears have a much higher
Long-term: Jim has found that the timeout call in the wait-for-result
code can be avoided, side-stepping the call to time.sleep altogether.
Senior Software Engineer
For more information about ZODB, see the ZODB Wiki:
ZODB-Dev mailing list - ZODB-Dev@zope.org