Hi,

I wrote two scripts based on the work of INADA-san's work to (1)
download the source code of the PyPI top 5000 projects (2) search for
a regex in these projects (compressed source archives).

You can use these tools if you work on an incompatible Python or C API
change to estimate how many projects are impacted.

The HPy project created a Git repository for a similar need (latest
update in June 2021):
https://github.com/hpyproject/top4000-pypi-packages

There are also online services for code search:

* GitHub: https://github.com/search
* https://grep.app/ (I didn't try it yet)
* Debian: https://codesearch.debian.net/


(1) Dowload

Script:
https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py

Usage: download_pypi_top.py PATH

It uses this JSON file:
https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json

>From this service:
https://hugovk.github.io/top-pypi-packages/

At December 1, on 5000 projects, it only downloads 4760 tarball and
ZIP archives: I guess that 240 projects don't provide a source
archive. It takes around 5,2 GB of disk space.


(2) Code search

First, I used the fast and nice "ripgrep" tool with the command "rg
-zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
archives). But it doesn't show the path inside the archive and it
searchs in files generated by Cython whereas I wanted to ignore these
files.

So I wrote a short Python script which decompress tarball and ZIP
archive in memory and looks for a regex:
https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py

Usage: search_pypi_top.py "REGEX" output_filename

The code to parse command line option is hardcoded and pypi_dir =
"PYPI-2021-12-01-TOP-5000" are hardcoded :-D

It ignores files generated by Cython and .so binary files (Linux
dynamic libraries).

While "rg" is very fast, my script is very slow. But I don't care,
once the regex is written, I only need to search for the regex once, I
can wait 10-15 min ;-) I prefer to wait longer and have a more
accurate result. Also, there is room for enhancement, like running
multiple jobs in different processes or threads.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WQVEHLRIVISPFMWSSX5N4TQPIUN2XS22/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to