Last week I spent a very enjoyable day (no kidding) debugging a very, very slow cold-start situation (more than 15 minutes to return from the first request). When making the first request to the app (Zope 3 based), the app server and storage server would show virtually no CPU utilization, and there would be about a megabit of network traffic (on a gigabit link). There was no obvious bottleneck.

After liberal application of strace, tcpdump, wireshark (aka ethereal), and the Python profiler we discovered that while waiting for an outstanding request for an object to load, ZEO calls a threading.Connection instance's wait method with a timeout. When given a timeout that method enters a wait loop with a time.sleep to sleep for a while and then see if the condition has been met.

We found that time.sleep on that box had a minimum granularity of 10ms (when passed a non-zero value), thus causing each object load to take approximately that long. As you can imagine, that somewhat slowed down the retrieval of the several thousand objects required to satisfy the initial request(s) (until the ZEO cache was sufficiently warm).

The fix? Short-term: bump the operating system's timer interrupt on that box to 1000Hz from 100Hz, increasing time.sleep's granularity to 1ms (this was on Linux, Window's time.sleep appears have a much higher resolution).

Long-term: Jim has found that the timeout call in the wait-for-result code can be avoided, side-stepping the call to time.sleep altogether.
--
Benji York
Senior Software Engineer
Zope Corporation
_______________________________________________
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev

Reply via email to