See the forwarded message. I just added the following in the buildout section of my ~/.buildout/default.cfg:

index = http://download.zope.org/ppix

Without it, refreshing a small buildout of mine takes 2m44s. With it, it takes about 15 seconds.

Jim

Begin forwarded message:

From: Jim Fulton <[EMAIL PROTECTED]>
Date: July 19, 2007 7:06:34 AM EDT
To: [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Prototype setuptools-specific PyPI index.

Over the past few months, we've struggled quite a bit with Python Package Index (PyPI) performance and stability. Thanks to the heroic efforts of Martin v. Löwis and others, performance and especially stability have improved quite a bit. Martin has demonstrated that, at least when running well, PyPI seems to answer most requests on the order of 7 miliseconds (around 150 requests per second) internally. That's not bad. Unfortunately for users, actual times can be quite a bit longer. For me at work, request take around 300 milliseconds. For Martin, they seem to take somewhat longer. 300 milliseconds isn't so bad for a request or two, however, easy install can easily make 10s or even hundreds of requests to satisfy a user request for a package. zc.buildout, when verifying that a large system with many tens of packages has the most up to date versions of each package can easily make thousands of requests.

Why do setuptools and buildout make so many requests? If a package exposes more than one release, then setuptools checks the package's main PyPI page and the pages for each release. We need to be able to easily use older releases, so we can't hide old releases. Typical projects of ours have many old releases exposed. If setuptools was more clever in the way it searched PyPI, but it would still have to make a minimum of 2 requests per package for packages with multiple versions exposed.

Another potential issue is that PyPI pages can be large. I've found it convenient to use PyPI package pages as the home page for many of my projects. I like to include package documentation in my project pages. Perhaps this is an abuse of PyPI, but it is very convenient for me and no one has complained. :) The zc.buildout pages are around 200K. That's a fair bit of data for setuptools to download and scan for download URLs.

In the course of this discussion, I've realized that it doesn't make sense for setuptools to use the same interface that humans use. setuptools doesn't need to see all of the data that is useful to humans. Similarly, humans generally don't need to see all of the historical releases for a project. I suggested a simple page format designed just for setuptools. An alternative would be an xmlrpc API. I prefer pages because I think that, over time, the amount of requests from automated tools like easy_install and zc.buildout will increase substantially and ultimately, will overwhelm dynamic servers, even ones like PyPI that are reasonably fast. I also think that a simple static collection of pages will be easier to mirror and I think some number of geographic mirrors is likely to help some people. I promised to prototype the format I suggested.

I've created and experimental prototype setuptools-specific package index at

  http://download.zope.org/ppix

Going to that page gives brief instructions for using it with easy_install and zc.buildout. To see an individual package page, add the package name to the URL, as in:

  http://download.zope.org/ppix/setuptools/

A few things to note about this:

- I don't expose a long package list at http://download.zope.org/ ppix/. The long package list would be expensive to download and supports a use case that I consider to be of negative value, which is installing packages with case-insensitive package names, I think it is important for humans to be able to search for packages using case-insensitive search terms, but I think that, after identifying a package, precise package names should be used. I think it is especially important that precise package names be used in package requirements.

- There is a single page per package. This can greatly reduce the number of requests. Packages that store all of their distributions in PyPI and that don't have off-site home pages or download URLs can be scanned with a single request. Note that I excluded home page and download URLs that pointed back to the packages PyPI page, as that wouldn't provide any new information to setuptools.

- Download URLs for *hidden* packages are included. Humans don't need to see old revisions, but setuptools-based tools do. If we used an index like this for setuptools, we could stop unhiding old releases when we created new releases in PyPI. This would make PyPI more useful to humans and less of a pain for developers.

- Download URLs are the same as they are in PyPI. Using this new index, distributions are still downloaded from PyPI, so the index doesn't affect PyPI download statistics.

To see the impact of this, it's interesting to look at installing zc.buildout using easy_install from PyPI and from the experimental index:
Installing using PyPI looks like this:

  (env)[EMAIL PROTECTED]:~/tmp$ time easy_install zc.buildout
  Searching for zc.buildout
  Reading http://cheeseshop.python.org/pypi/zc.buildout/
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b19
  Reading http://svn.zope.org/zc.buildout
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b22
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b23
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b20
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b21
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b26
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b27
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b24
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b25
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b28
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b17
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b16
  Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b18
  Best match: zc.buildout 1.0.0b28
Downloading http://cheeseshop.python.org/packages/2.5/z/ zc.buildout/zc.buildout-1.0.0b28- py2.5.egg#md5=4e37e53f010ed7984555a029732f479d
  Processing zc.buildout-1.0.0b28-py2.5.egg
creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/ lib/python2.5
  Adding zc.buildout 1.0.0b28 to easy-install.pth file
  Installing buildout script to /home/jim/tmp/env/bin/

Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg
  Processing dependencies for zc.buildout
  Searching for setuptools==0.6c6
  Best match: setuptools 0.6c6
  Processing setuptools-0.6c6-py2.5.egg
  Adding setuptools 0.6c6 to easy-install.pth file
  Installing easy_install script to /home/jim/tmp/env/bin/
  Installing easy_install-2.5 script to /home/jim/tmp/env/bin/

  Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6-py2.5.egg
  Processing dependencies for setuptools==0.6c6
  Finished processing dependencies for setuptools==0.6c6
  Finished installing setuptools==0.6c6
  Finished processing dependencies for zc.buildout
  Finished installing zc.buildout

  real  0m31.360s
  user  0m1.136s
  sys   0m0.060s

Note the large number of pages read. Here I was installing a single package with one dependency, setuptools, that was already installed. Let's look at this again using the experimental index:

(env)[EMAIL PROTECTED]:~/tmp$ time easy_install -i http://download.zope.org/ ppix zc.buildout
  Searching for zc.buildout
  Reading http://download.zope.org/ppix/zc.buildout/
  Best match: zc.buildout 1.0.0b28
Downloading http://cheeseshop.python.org/packages/2.5/z/ zc.buildout/zc.buildout-1.0.0b28- py2.5.egg#md5=4e37e53f010ed7984555a029732f479d
  Processing zc.buildout-1.0.0b28-py2.5.egg
creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/ lib/python2.5
  Adding zc.buildout 1.0.0b28 to easy-install.pth file
  Installing buildout script to /home/jim/tmp/env/bin/

Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg
  Processing dependencies for zc.buildout
  Searching for setuptools==0.6c6
  Best match: setuptools 0.6c6
  Processing setuptools-0.6c6-py2.5.egg
  Adding setuptools 0.6c6 to easy-install.pth file
  Installing easy_install script to /home/jim/tmp/env/bin/
  Installing easy_install-2.5 script to /home/jim/tmp/env/bin/

  Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6-py2.5.egg
  Processing dependencies for setuptools==0.6c6
  Finished processing dependencies for setuptools==0.6c6
  Finished installing setuptools==0.6c6
  Finished processing dependencies for zc.buildout
  Finished installing zc.buildout

  real  0m7.006s
  user  0m0.244s
  sys   0m0.040s

Note:

- We made far fewer requests with the new index

- Most of the time in the second example was spent actually downloading the buildout distribution. Most of the time in the first example was spent reading the index.

- I used workingenv to create clean environments for each of the examples above.

WRT zc.buildout, refreshing a buildout with just ZODB installed in it takes about 45 seconds for me using PyPI and about 5 seconds using the experimental index.

Some of the speed improvements is due to the fact that the experimental index is much closer to me (on the net) than PyPI. ATM, requests to PyPI take *me* around 500 milliseconds, while requests to the experimental index are taking between 100 and 300 milliseconds. (I'm at home and this seems to be somewhat variable.) Most of the speed improvements are from reducing the number of requests.

I'm polling PyPI once a minute to get and apply updates. Thanks to the new XML-RPC method that Martin added, this is very efficient to do.

I encourage people to check this out and even try using it with easy_install and especially buildout. AFAIK, aside from being much faster and showing download files for hidden releases it is completely equivalent to PyPI for setuptools use. My intension is to keep this experimental index going and up to date for the foreseeable future and plan to use it for all my work.

My primary goal is to prototype the new index format. If this seems useful, then I think that www.python.org should expose an index in this format to setuptools, either at a different URL or by satisfying setuptools requests from the index based on client information. I'd love to see this index populated via a baking mechanism that updates package pages when they change, rather than through polling as I'm doing.

There would be some benefit to having geographic mirrors. I suspect that having such mirrors available would improve performance further, at least for some folks. It might also be useful to have some mirrors for redundancy purposes. Note though that what I'm doing is mirroring the only index data. I'm not mirroring distributions. Of course, I'd be happy to make my software available. (It already is via our subversion repository.)

I hope this effort spurs useful discussion and progress.

Jim

--
Jim Fulton                      mailto:[EMAIL PROTECTED]                Python 
Powered!
CTO                             (540) 361-1714                  
http://www.python.org
Zope Corporation        http://www.zope.com             http://www.zope.org




--
Jim Fulton                      mailto:[EMAIL PROTECTED]                Python 
Powered!
CTO                             (540) 361-1714                  
http://www.python.org
Zope Corporation        http://www.zope.com             http://www.zope.org



_______________________________________________
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Reply via email to