[Python-Dev] urllib2 EP + decr. startup time

2007-02-16 Thread KoDer
Hello to all.

During more than two years i widely use urllib2 to write
commercial applications (almost for extracting data from web sites to
excel sheets)
and here is some enhanced enhanced for it:

1) Add support for 'HEAD' request (and maybe some other).
This needs small changes.
   a)Add request_type = 'GET' to urllib2.Request class constructor.
   b)Then put request_type value pass to http header, except Request has
  data - in this case it's change to 'POST'.
The results of such request will be the same as in case of 'GET' request,
except zero size of body.

2)HTTH keep-alive opener. Almost complete realizations can be found
in urlgrabber (http://linux.duke.edu/projects/urlgrabber)(used by yum, so tested
well enough, i think). It's use urllib2 opener protocol and well integrated in
urllib2 structure. They need just little change to properly support
some headers.

3) Save HTTP exchange history. Now there is no suitable way to
obtain all sent and received headers. Received headers are saved only
for last response in redirection chain and sent headers are not saved at all.
I use run-time patching of httplib to intercept of the sent and received
data (may be i missed something?). Proposal is to add property
'history' to object returned from urllib2.urlopen - list
of objects which contain send/recv headers for all redirect chain.

4) Add possibilities to obtain underlying socket, used for recv http data.
Now it's impossible to work with http connection in async matter
(or i miss something again?).
If connection hangs then whole program hangs too and i don't known way
to fix this.
Of cause if you obtain such socket then you respond for compression and etc.
Now i use following code:
x = urrlib2.urlopen(.)
sock =  x.fp._sock.fp._sock.
There only one problem, as i know, - chunked encoding. In case of
chunked encoding need to return socket-like object which
do all work to assemble chunks in original stream. I already use
such object for two years and it's ok.

5) And now for something completely different ;)).

   This is just initial proposal and it needs enhancement.  May be i
need put it to python-ideas list?

   At last Goggle SOC there was one of problem to solve - the decrease
of interpreter's startup time.
   'strace' command shows next: most of startup time the interpreter
try to find imported modules.
   And most of them finished with 'not found' error, because of large
size of sys.path variable.
   In future this time will be increase - setuptools adds many dirs to
search path
   using pth files (to manage installed modules and eggs).

   I propose to add something like .so caching which used in modern
*nix sytems to load
   shared libraries.

   a) Add to python interpreter --build-modules-index option. When python found
   this opts it scans all dirs in paths and build dictionary
{module_name:module_path}.
   Dict will be saved in external file (save only top-dir for packages
and path for one-file modules).
   Also it saves in this file mtime for all pth files and dirs from
path and path variable.

   b) When interpreter is started-up it, as usually, scans all path
dirs for pth files,
   and using data saved in cache file check is new modules or search
dirs added or
   old modified.
   Then it read cache file and compares mtimes and path dirs. If it
isn't modified then
   cache data used for fast module loading. If imported module didn't found in
   cache - interpreter falls back to standard scheme.

   Also it is necessary to add some options to control of using cache
like --disable-cache,
   --clear-cache,disable cashing some dirs, etc.
---
K.Danilov aka KoDer
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib2 EP + decr. startup time

2007-02-16 Thread KoDer
2007/2/16, Phillip J. Eby <[EMAIL PROTECTED]>:
> At 04:38 PM 2/16/2007 +0200, KoDer wrote:
.
>
>
> Also, are you aware that putting a zipped version of the standard library
> on sys.path already speeds up startup considerably?  Python since 2.3
> automatically includes an appropriate entry in sys.path:
>

zipped version has one weakness - you can't put .so(or dll) files inside.
In my system 19 from 25 installed egg add directories ,not archives
(because it's contain dll ).
But even without egg directories >>
['',
'C:\\Python25\\Scripts',
'C:\\WINDOWS\\system32\\python25.zip',
'C:\\Python25\\DLLs',
'C:\\Python25\\lib',
'C:\\Python25\\lib\\plat-win',
.
'C:\\Python25\\lib\\site-packages\\wx-2.8-msw-unicode']
len(sys.path) == 18 (without eggs) near 18 / 2 = 9 'file not found' errors
 for every first module import.
So improvement of setuptools will help, but not solve this problem .
-- 
K.Danilov aka KoDer
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib2 EP + decr. startup time

2007-02-17 Thread KoDer
> Right -- most of your problem will be solved by creating
> 'C:\\WINDOWS\\system32\\python25.zip', containing the contents of
> C:\\Python25\\lib\\.

C:\\Python25\\lib\\. contain *many* packages with .dll files - i can't
just zip it.
wxPython,pyOpenGL,PIL,tk and so on. On Fedora 6 more than 40% dirs of
/usr/lib/site-packages contained .so files. Some of them add dirs to path
(wx,PIL,Gtk,...).

yum,apt and other will bee very angry if i zip site-packages directory.
I don't known any package manager which can properly work with
packages installed in archive.

Are setuptools can  work properly with packages packed in one big zip archive
(i really don't known)?

And finally - if it's so easy why this don't done already? Python widely used
in many linux distros and i don't known any one which can install even
standard library
in zip archive. Most of users can't done it(because they don't known
about python at all).
Or this just because lack of time?

Yesterday i test some programs with strace and receive follow results:
command   num of sys_calls  num of FILE_NOT_FOUND
python -c "pass" 2807  619  ~20%
yum   20263  11282>50%
pychecker   61812527 ~40%
meld(nice GUI merge util)160758024  50%
ipython < exit.txt164488957 >50%
(exit txt contain "exit()\n")
(if filter some of  FILE_NOT_FOUND
which are not produced by python modules search)

BTW. In trace results many call chain like this:

open("/usr/lib/python2.4/site-packages/Durus-3.6-py2.4-linux-i686.egg",
O_RDONLY|O_LARGEFILE) = 6
..
_llseek(6, 98304, [98304], SEEK_SET)= 0
read(6, "\340\377\224\322\373C\200\177.\245\367\205\0\307x\207\r"...,
4096) = 4096
_llseek(6, 102400, [102400], SEEK_SET)  = 0
_llseek(6, 102400, [102400], SEEK_SET)  = 0
_llseek(6, 102400, [102400], SEEK_SET)  = 0
.
and so on. As i understand all
_llseek(6, 102400, [102400], SEEK_SET)  = 0
calls after first are just heating air.

-- 
K.Danilov aka KoDer
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib2 EP + decr. startup time

2007-02-18 Thread KoDer
2007/2/17, Phillip J. Eby <[EMAIL PROTECTED]>:
>
> I don't follow you; this has nothing to do with setuptools.  It's a feature
> of Python since version 2.3,
>

I mean install/update/delete package to exist zip archive, which may
contain many other
packages(some time it's hart to understand what i write not on native
language , sorry ).

> but as far as I know nobody's ever set up the
> build machinery to create the necessary zipfile when Python is
> installed.  Perhaps that would be a nice place to begin your patch: a
> script to create a stdlib zipfile in the platform-appropriate location,
> that can run after the bytecode compilation of the stdlib modules, or that
> users can run on older versions of Python to do the same thing.

I already work on this ).
-- 
K.Danilov aka KoDer
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ 1673007 ] urllib2 requests history + HEAD

2007-03-13 Thread KoDer
> From: Facundo Batista <[EMAIL PROTECTED]>
> This patch was posted by "koder_ua".

> I think that Request must have a "request type" parameters, so people
> can send "HEAD" requests easily.

> But it seems to me that keeping a request history in the module is bad,
> because it can easily grow up to thousands and explode (a.k.a. consume
> too much memory).

> Fo example, I have a web service, running 7x24, and opening another web
> service, with around 10 requests per second. This means, keeping the
> history (around 50bytes each request), 1.2 GB of RAM in only a month!

> So, I'll close this patch as "Rejected", for this reason, if anyone
> raises objections.

> Regards,
> --
> .   Facundo

This is probably a misunderstanding.
Request's history don't store in the "module".They store in two places:

1) In Request object (for current request, so they would be destroyеd with it);
2) In HTTPConnection object  (while request redirects). In HTTPConnection
history stores only for current served Request. Even if You use the
same HTTPConnection
for many Requests, they (HTTPConnection) clear history every time when
new Request starts.

# from httplib HTTPConnection.putrequest patched
str = '%s %s %s' % (method, url, self._http_vsn_str)
self._output(str)
self.sended_hdrs = [str] <<< previous history die here

___Full history for all processed request didn't not stored in any place.
---
KDanilov aka koder(aka koder_ua)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ 1673007 ] urllib2 requests history + HEAD

2007-03-16 Thread K.Danilov aka koder
 > From: Facundo Batista <[EMAIL PROTECTED] 
<mailto:[EMAIL PROTECTED]>>
 > This patch was posted by "koder_ua".

 > I think that Request must have a "request type" parameters, so people
 > can send "HEAD" requests easily.

 > But it seems to me that keeping a request history in the module is bad,
 > because it can easily grow up to thousands and explode (a.k.a. consume
 > too much memory).

 > Fo example, I have a web service, running 7x24, and opening another web
 > service, with around 10 requests per second. This means, keeping the
 > history (around 50bytes each request), 1.2 GB of RAM in only a month!

 > So, I'll close this patch as "Rejected", for this reason, if anyone
 > raises objections.

 > Regards,
 > --
 > .   Facundo

This is probably a misunderstanding.
Request's history don't store in the "module".They store in two places:

1) In Request object (for current request, so they would be destroyеd 
with it);
2) In HTTPConnection object  (while request redirects). In HTTPConnection
history stores only for current served Request. Even if You use the
same HTTPConnection
for many Requests, they (HTTPConnection) clear history every time when
new Request starts.

# from httplib HTTPConnection.putrequest patched
str = '%s %s %s' % (method, url, self._http_vsn_str)
self._output(str)
self.sended_hdrs = [str] <<< previous history die here

___Full history for all processed request didn't not stored in any 
place.

P.S. This message may be duplicated - first copy i sent from
gmail.com and it didn't reach mail list for some unknown for me reasons.
---
KDanilov aka koder(aka koder_ua)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com