[Python-Dev] urllib2 EP + decr. startup time
Hello to all. During more than two years i widely use urllib2 to write commercial applications (almost for extracting data from web sites to excel sheets) and here is some enhanced enhanced for it: 1) Add support for 'HEAD' request (and maybe some other). This needs small changes. a)Add request_type = 'GET' to urllib2.Request class constructor. b)Then put request_type value pass to http header, except Request has data - in this case it's change to 'POST'. The results of such request will be the same as in case of 'GET' request, except zero size of body. 2)HTTH keep-alive opener. Almost complete realizations can be found in urlgrabber (http://linux.duke.edu/projects/urlgrabber)(used by yum, so tested well enough, i think). It's use urllib2 opener protocol and well integrated in urllib2 structure. They need just little change to properly support some headers. 3) Save HTTP exchange history. Now there is no suitable way to obtain all sent and received headers. Received headers are saved only for last response in redirection chain and sent headers are not saved at all. I use run-time patching of httplib to intercept of the sent and received data (may be i missed something?). Proposal is to add property 'history' to object returned from urllib2.urlopen - list of objects which contain send/recv headers for all redirect chain. 4) Add possibilities to obtain underlying socket, used for recv http data. Now it's impossible to work with http connection in async matter (or i miss something again?). If connection hangs then whole program hangs too and i don't known way to fix this. Of cause if you obtain such socket then you respond for compression and etc. Now i use following code: x = urrlib2.urlopen(.) sock = x.fp._sock.fp._sock. There only one problem, as i know, - chunked encoding. In case of chunked encoding need to return socket-like object which do all work to assemble chunks in original stream. I already use such object for two years and it's ok. 5) And now for something completely different ;)). This is just initial proposal and it needs enhancement. May be i need put it to python-ideas list? At last Goggle SOC there was one of problem to solve - the decrease of interpreter's startup time. 'strace' command shows next: most of startup time the interpreter try to find imported modules. And most of them finished with 'not found' error, because of large size of sys.path variable. In future this time will be increase - setuptools adds many dirs to search path using pth files (to manage installed modules and eggs). I propose to add something like .so caching which used in modern *nix sytems to load shared libraries. a) Add to python interpreter --build-modules-index option. When python found this opts it scans all dirs in paths and build dictionary {module_name:module_path}. Dict will be saved in external file (save only top-dir for packages and path for one-file modules). Also it saves in this file mtime for all pth files and dirs from path and path variable. b) When interpreter is started-up it, as usually, scans all path dirs for pth files, and using data saved in cache file check is new modules or search dirs added or old modified. Then it read cache file and compares mtimes and path dirs. If it isn't modified then cache data used for fast module loading. If imported module didn't found in cache - interpreter falls back to standard scheme. Also it is necessary to add some options to control of using cache like --disable-cache, --clear-cache,disable cashing some dirs, etc. --- K.Danilov aka KoDer ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib2 EP + decr. startup time
2007/2/16, Phillip J. Eby <[EMAIL PROTECTED]>: > At 04:38 PM 2/16/2007 +0200, KoDer wrote: . > > > Also, are you aware that putting a zipped version of the standard library > on sys.path already speeds up startup considerably? Python since 2.3 > automatically includes an appropriate entry in sys.path: > zipped version has one weakness - you can't put .so(or dll) files inside. In my system 19 from 25 installed egg add directories ,not archives (because it's contain dll ). But even without egg directories >> ['', 'C:\\Python25\\Scripts', 'C:\\WINDOWS\\system32\\python25.zip', 'C:\\Python25\\DLLs', 'C:\\Python25\\lib', 'C:\\Python25\\lib\\plat-win', . 'C:\\Python25\\lib\\site-packages\\wx-2.8-msw-unicode'] len(sys.path) == 18 (without eggs) near 18 / 2 = 9 'file not found' errors for every first module import. So improvement of setuptools will help, but not solve this problem . -- K.Danilov aka KoDer ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib2 EP + decr. startup time
> Right -- most of your problem will be solved by creating > 'C:\\WINDOWS\\system32\\python25.zip', containing the contents of > C:\\Python25\\lib\\. C:\\Python25\\lib\\. contain *many* packages with .dll files - i can't just zip it. wxPython,pyOpenGL,PIL,tk and so on. On Fedora 6 more than 40% dirs of /usr/lib/site-packages contained .so files. Some of them add dirs to path (wx,PIL,Gtk,...). yum,apt and other will bee very angry if i zip site-packages directory. I don't known any package manager which can properly work with packages installed in archive. Are setuptools can work properly with packages packed in one big zip archive (i really don't known)? And finally - if it's so easy why this don't done already? Python widely used in many linux distros and i don't known any one which can install even standard library in zip archive. Most of users can't done it(because they don't known about python at all). Or this just because lack of time? Yesterday i test some programs with strace and receive follow results: command num of sys_calls num of FILE_NOT_FOUND python -c "pass" 2807 619 ~20% yum 20263 11282>50% pychecker 61812527 ~40% meld(nice GUI merge util)160758024 50% ipython < exit.txt164488957 >50% (exit txt contain "exit()\n") (if filter some of FILE_NOT_FOUND which are not produced by python modules search) BTW. In trace results many call chain like this: open("/usr/lib/python2.4/site-packages/Durus-3.6-py2.4-linux-i686.egg", O_RDONLY|O_LARGEFILE) = 6 .. _llseek(6, 98304, [98304], SEEK_SET)= 0 read(6, "\340\377\224\322\373C\200\177.\245\367\205\0\307x\207\r"..., 4096) = 4096 _llseek(6, 102400, [102400], SEEK_SET) = 0 _llseek(6, 102400, [102400], SEEK_SET) = 0 _llseek(6, 102400, [102400], SEEK_SET) = 0 . and so on. As i understand all _llseek(6, 102400, [102400], SEEK_SET) = 0 calls after first are just heating air. -- K.Danilov aka KoDer ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib2 EP + decr. startup time
2007/2/17, Phillip J. Eby <[EMAIL PROTECTED]>: > > I don't follow you; this has nothing to do with setuptools. It's a feature > of Python since version 2.3, > I mean install/update/delete package to exist zip archive, which may contain many other packages(some time it's hart to understand what i write not on native language , sorry ). > but as far as I know nobody's ever set up the > build machinery to create the necessary zipfile when Python is > installed. Perhaps that would be a nice place to begin your patch: a > script to create a stdlib zipfile in the platform-appropriate location, > that can run after the bytecode compilation of the stdlib modules, or that > users can run on older versions of Python to do the same thing. I already work on this ). -- K.Danilov aka KoDer ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ 1673007 ] urllib2 requests history + HEAD
> From: Facundo Batista <[EMAIL PROTECTED]> > This patch was posted by "koder_ua". > I think that Request must have a "request type" parameters, so people > can send "HEAD" requests easily. > But it seems to me that keeping a request history in the module is bad, > because it can easily grow up to thousands and explode (a.k.a. consume > too much memory). > Fo example, I have a web service, running 7x24, and opening another web > service, with around 10 requests per second. This means, keeping the > history (around 50bytes each request), 1.2 GB of RAM in only a month! > So, I'll close this patch as "Rejected", for this reason, if anyone > raises objections. > Regards, > -- > . Facundo This is probably a misunderstanding. Request's history don't store in the "module".They store in two places: 1) In Request object (for current request, so they would be destroyеd with it); 2) In HTTPConnection object (while request redirects). In HTTPConnection history stores only for current served Request. Even if You use the same HTTPConnection for many Requests, they (HTTPConnection) clear history every time when new Request starts. # from httplib HTTPConnection.putrequest patched str = '%s %s %s' % (method, url, self._http_vsn_str) self._output(str) self.sended_hdrs = [str] <<< previous history die here ___Full history for all processed request didn't not stored in any place. --- KDanilov aka koder(aka koder_ua) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ 1673007 ] urllib2 requests history + HEAD
> From: Facundo Batista <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > This patch was posted by "koder_ua". > I think that Request must have a "request type" parameters, so people > can send "HEAD" requests easily. > But it seems to me that keeping a request history in the module is bad, > because it can easily grow up to thousands and explode (a.k.a. consume > too much memory). > Fo example, I have a web service, running 7x24, and opening another web > service, with around 10 requests per second. This means, keeping the > history (around 50bytes each request), 1.2 GB of RAM in only a month! > So, I'll close this patch as "Rejected", for this reason, if anyone > raises objections. > Regards, > -- > . Facundo This is probably a misunderstanding. Request's history don't store in the "module".They store in two places: 1) In Request object (for current request, so they would be destroyеd with it); 2) In HTTPConnection object (while request redirects). In HTTPConnection history stores only for current served Request. Even if You use the same HTTPConnection for many Requests, they (HTTPConnection) clear history every time when new Request starts. # from httplib HTTPConnection.putrequest patched str = '%s %s %s' % (method, url, self._http_vsn_str) self._output(str) self.sended_hdrs = [str] <<< previous history die here ___Full history for all processed request didn't not stored in any place. P.S. This message may be duplicated - first copy i sent from gmail.com and it didn't reach mail list for some unknown for me reasons. --- KDanilov aka koder(aka koder_ua) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com