Re: [Python-Dev] Frame zombies

2007-06-10 Thread Martin v. Löwis
 Note that _only_ recursions will have more than 1 frame attached.

That's not true; in the presence of threads, the same method
may also be invoked more than one time simultaneously.

 But removing freelist altogether will not work well with any type of
 recursion.

How do you know that? Did you measure the time? On what system?
What were the results?

Performance optimization without measuring is just unacceptable.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Martin v. Löwis
 sys.setdefaultencoding()
 exists for a reason, wouldn't it be better if stdlib could cope with
 that at least with zipfile?

sys.setdefaultencoding just does not work. Many more things break when
you call it. It only exists because people like you insisted that it
exists.

 Also note that I'm trying to ask if zipfile should be improved, how it
 should be improved, and this possible improvement is not even for me
 (because now I know how zipfile behaves and I will work correctly with
 it, but someone else might stumble upon this very unexpectedly).

If you want to come up with a patch: sure. The zipfile module should
handle Unicode strings, encoding them in the encoding that the ZIP
specification defines (both the formal one, and the
informal-defined-by-pkwares-implementation).

The tricky question is what to do when reading in zipfiles with
non-ASCII characters (and yes, I understand that in your case
there were only ASCII characters in the file names).

 The problem was that sourcedir was unicode, and on my machine
 everything went ok multiple times. zipfile.ZipInfo.FileHeader would
 return unicode, but then when it writes it to a file it gets back to
 str (because mappings back and forth were identical). The problem
 happened when on a different machine header suddenly got byte 0x98 in
 position 10 (seems to be compress_size), which cp1251 codec couldn't
 decode. You see, arcname didn't even have unicode characters, but the
 mere fact that it was unicode made header upgrade to unicode in
 return header + self.filename + self.extra.

Ok, now I understand. If filename is a Unicode string, header is
converted using the system encoding; depending on the exact value
of header and depending on the system encoding, this may cause
a decoding error.

This bug has been reported as http://bugs.python.org/1170311

 Because that's not supposed to work sanely when self.filename is
 unicode I'm asking if the right behavior would be to a) disallow
 unicode filenames in zipfile.ZipInfo, b) automatically convert
 filename to str in zipfile.ZipInfo, c) leave everything as it is.

The correct behavior would be b); the difficult details are what
encoding to use.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: Instance variable access and descriptors

2007-06-10 Thread Gustavo Carneiro

 I have to agree with you.  If removing support for
self.__dict__['propertyname'] (where propertyname is also the name of a
descriptor) is the price to pay for significant speedup, so be it.  People
doing that are asking for trouble anyway!

On 10/06/07, Eyal Lotem [EMAIL PROTECTED] wrote:


On 6/10/07, Phillip J. Eby [EMAIL PROTECTED] wrote:
 At 12:23 AM 6/10/2007 +0300, Eyal Lotem wrote:
 A. It will break code that uses instance.__dict__['var'] directly,
 when 'var' exists as a property with a __set__ in the class. I believe
 this is not significant.
 B. It will simplify getattr's semantics. Python should _always_ give
 precedence to instance attributes over class ones, rather than have
 very weird special-cases (such as a property with a __set__).

 Actually, these are features that are both used and desirable; I've
 been using them both since Python 2.2 (i.e., for many years
 now).  I'm -1 on removing these features from any version of Python,
even 3.0.

It is the same feature, actually, two sides of the same coin.
Why do you use self.__dict__['propertyname'] when you can use
self._propertyname?
Why even call the first form, which is both longer and causes
performance problems a feature?


 C. It will greatly speed up instance variable access, especially when
 the class has a large mro.

 ...at the cost of slowing down access to properties and __slots__, by
 adding an *extra* dictionary lookup there.
It will slow down access to properties - but that slowdown is
insignificant:
A. The vast majority of lookups are *NOT* of properties. They are the
rare case and should not be the focus of optimization.
B. Property access involves calling Python functions - which is
heavier than a single dict lookup.
C. The dict lookup to find the property in the __mro__ can involve
many dicts (so in those cases adding a single dict lookup is not
heavy).

 Note, by the way, that if you want to change attribute lookup
 semantics, you can always override __getattribute__ and make it work
 whatever way you like, without forcing everybody else to change *their*
code.
If I write my own __getattribute__ I lose the performance benefit that
I am after.
I do agree that code shouldn't be broken, that's why a transitional
that requires using __fastlookup__ can be used (Unfortunately, from
__future__ cannot be used as it is not local to a module, but to a
class hierarchy - unless one imports a feature from __future__ into a
class).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/gjcarneiro%40gmail.com





--
Gustavo J. A. M. Carneiro
INESC Porto
The universe is always one step beyond logic. -- Frank Herbert
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Alexey Borzenkov
  Also note that I'm trying to ask if zipfile should be improved, how it
  should be improved, and this possible improvement is not even for me
  (because now I know how zipfile behaves and I will work correctly with
  it, but someone else might stumble upon this very unexpectedly).
 If you want to come up with a patch: sure. The zipfile module should
 handle Unicode strings, encoding them in the encoding that the ZIP
 specification defines (both the formal one, and the
 informal-defined-by-pkwares-implementation).

I don't think always encoding them to utf-8 (and using bit 11 of
flag_bits) is a good idea, since there's a chance to create archives
that won't be correctly readable by programs not supporting this bit
(it's no secret that currently some programs just assume that
filenames are encoded using one of system encodings). This is too
complex and hazy to implement. Even if I know what is the situation on
Windows (i.e. using OEM, also called DOS encoding, but I'm not sure
how to determine its codec name from within python apart from calling
GetConsoleCP), I'm totally unaware of the situation on other operating
systems.

 The tricky question is what to do when reading in zipfiles with
 non-ASCII characters (and yes, I understand that in your case
 there were only ASCII characters in the file names).

I don't think it should be changed.

 Ok, now I understand. If filename is a Unicode string, header is
 converted using the system encoding; depending on the exact value
 of header and depending on the system encoding, this may cause
 a decoding error.

 This bug has been reported as http://bugs.python.org/1170311

I see. Well, that's all easier now then, as I can just create a patch
for an already existing bug.

  Because that's not supposed to work sanely when self.filename is
  unicode I'm asking if the right behavior would be to a) disallow
  unicode filenames in zipfile.ZipInfo, b) automatically convert
  filename to str in zipfile.ZipInfo, c) leave everything as it is.
 The correct behavior would be b); the difficult details are what
 encoding to use.

Current zipfile seems to officially support ascii filenames only
anyway, so the patch can be as simple as this:

Index: Lib/zipfile.py
===
--- Lib/zipfile.py  (revision 55850)
+++ Lib/zipfile.py  (working copy)
@@ -252,12 +252,13 @@
 self.extract_version = max(45, self.extract_version)
 self.create_version = max(45, self.extract_version)

+filename = str(self.filename)
 header = struct.pack(structFileHeader, stringFileHeader,
  self.extract_version, self.reserved, self.flag_bits,
  self.compress_type, dostime, dosdate, CRC,
  compress_size, file_size,
- len(self.filename), len(extra))
-return header + self.filename + extra
+ len(filename), len(extra))
+return header + filename + extra

 def _decodeExtra(self):
 # Try to decode the extra field.

This doesn't introduce new features, just enforces filenames to be
ascii (or whatever default encoding is)
encodable.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Alexey Borzenkov
 Current zipfile seems to officially support ascii filenames only
 anyway, so the patch can be as simple as this:

Submitted patch and test case as http://python.org/sf/1734346
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Martin v. Löwis
 I don't think always encoding them to utf-8 (and using bit 11 of
 flag_bits) is a good idea, since there's a chance to create archives
 that won't be correctly readable by programs not supporting this bit
 (it's no secret that currently some programs just assume that
 filenames are encoded using one of system encodings).

I think it is also fairly uniformly agreed that these programs are
incorrect; the official encoding of file names in a zip file is
Windows/DOS code page 437.

 This is too
 complex and hazy to implement. Even if I know what is the situation on
 Windows (i.e. using OEM, also called DOS encoding, but I'm not sure
 how to determine its codec name from within python apart from calling
 GetConsoleCP), I'm totally unaware of the situation on other operating
 systems.

I don't think that the situation on Windows is that the OEM code page
should be used. Instead, CP 437 should be used, independent of the OEM
code page.

 The tricky question is what to do when reading in zipfiles with
 non-ASCII characters (and yes, I understand that in your case
 there were only ASCII characters in the file names).
 
 I don't think it should be changed.

In Python 3, it will certainly change, since the string type
will be unicode-based. It probably should not change for the
rest of 2.x.

 Current zipfile seems to officially support ascii filenames only
 anyway

That's not true. You can use any byte string as the file name
that you want, including non-ASCII strings encoded in CP437.

 +filename = str(self.filename)

That would be incorrect, as it relies on the system encoding,
which shouldn't be relied upon. Plus, it would allow arbitrary
non-string things as filenames. What it should do instead
(IMO) is to encode in CP437. Bonus points if it falls back
to the UTF-8 feature of zip files if encoding as CP437 fails.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: Instance variable access and descriptors

2007-06-10 Thread Phillip J. Eby
At 04:14 AM 6/10/2007 +0300, Eyal Lotem wrote:
On 6/10/07, Phillip J. Eby [EMAIL PROTECTED] wrote:
  At 12:23 AM 6/10/2007 +0300, Eyal Lotem wrote:
  A. It will break code that uses instance.__dict__['var'] directly,
  when 'var' exists as a property with a __set__ in the class. I believe
  this is not significant.
  B. It will simplify getattr's semantics. Python should _always_ give
  precedence to instance attributes over class ones, rather than have
  very weird special-cases (such as a property with a __set__).
 
  Actually, these are features that are both used and desirable; I've
  been using them both since Python 2.2 (i.e., for many years
  now).  I'm -1 on removing these features from any version of 
 Python, even 3.0.

It is the same feature, actually, two sides of the same coin.
Why do you use self.__dict__['propertyname'] when you can use
self._propertyname?

Because I'm *not writing this by hand*.  I'm using descriptors that 
know what attribute name they're responsible for, and do the access directly.


Why even call the first form, which is both longer and causes
performance problems a feature?

If you don't understand that, IMO you don't yet understand enough 
about the descriptor architecture to be proposing changes to it.


  Note, by the way, that if you want to change attribute lookup
  semantics, you can always override __getattribute__ and make it work
  whatever way you like, without forcing everybody else to change 
 *their* code.
If I write my own __getattribute__ I lose the performance benefit that
I am after.

Not if you write it in C.


I do agree that code shouldn't be broken, that's why a transitional
that requires using __fastlookup__ can be used (Unfortunately, from
__future__ cannot be used as it is not local to a module, but to a
class hierarchy - unless one imports a feature from __future__ into a
class).

I have no idea what you're talking about here.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Alexey Borzenkov
On 6/10/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
  I don't think always encoding them to utf-8 (and using bit 11 of
  flag_bits) is a good idea, since there's a chance to create archives
  that won't be correctly readable by programs not supporting this bit
  (it's no secret that currently some programs just assume that
  filenames are encoded using one of system encodings).
 I think it is also fairly uniformly agreed that these programs are
 incorrect; the official encoding of file names in a zip file is
 Windows/DOS code page 437.

Before replying to you I actually did some quick tests. I packed a
file with localized filename and then opened it using explorer and
also viewed it using the hexeditor:

   7-Zip: directory cp866, header cp866: explorer sees correct filename.
   zipfile: directory cp1251, header cp1251: explorer sees incorrect filename.
   pkzip25.exe: directory cp866, header cp1251: explorer sees correct
filenames, zipfile complains that filenames differ.
   zip.exe: directory cp1251, header cp1251: explorer sees incorrect filenames.

Also note, that modifying filename in directory with a hex editor to
cp866 made explorer see correct filenames. Another experiment with
pkzip25 showed that modifying filename in directory makes it extract
files with that filenam, i.e. it ignores header filename. The same
behavior is showed by 7-Zip.

So the general idea is that at least directory filename has some sort
of convention of using oem (dos, console) encoding on Windows, cp866
in my case. Header filenames have different encodings, and seem to be
ignored.

 I don't think that the situation on Windows is that the OEM code page
 should be used. Instead, CP 437 should be used, independent of the OEM
 code page.

And on the contrary, pkzip25 made by PKWARE Inc. themselves behaves otherwise.

  +filename = str(self.filename)
 That would be incorrect, as it relies on the system encoding,
 which shouldn't be relied upon.

Well, as I've seen in numerous examples above, system (or actually
dos) encoding is actually what is used by at least by three major
programs: 7-zip, pkzip25 and explorer, at least on windows.

 Plus, it would allow arbitrary
 non-string things as filenames.

Hmm... why is that bad?

 What it should do instead
 (IMO) is to encode in CP437. Bonus points if it falls back
 to the UTF-8 feature of zip files if encoding as CP437 fails.

And encoding to cp437 would be incorrect, as no currently existing
program would correctly work on non-english Windows OSes. I think that
letting the user deciding on the encoding is the right way to go here,
as you can't know what user actually wants these days, it's all too
hazy to me. And in case unicode is passed it just converts it using
ascii (or default) codec. One can specify ascii codec there
explicitly, if using system encoding is really an issue.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: Instance variable access and descriptors

2007-06-10 Thread Phillip J. Eby
At 11:27 AM 6/10/2007 +0100, Gustavo Carneiro wrote:
   I have to agree with you.  If removing support for 
 self.__dict__['propertyname'] (where propertyname is also the name 
 of a descriptor) is the price to pay for significant speedup, so be 
 it.  People doing that are asking for trouble anyway!

How so?  This order of lookup is explicitly defined by the precedence 
rules of PEP 252:

When a dynamic attribute (one defined in a regular object's
__dict__) has the same name as a static attribute (one defined
by a meta-object in the inheritance graph rooted at the regular
object's __class__), the static attribute has precedence if it
is a descriptor that defines a __set__ method (see below);
otherwise (if there is no __set__ method) the dynamic attribute
has precedence.  In other words, for data attributes (those
with a __set__ method), the static definition overrides the
dynamic definition, but for other attributes, dynamic overrides
static.

I fail to see how relying on explicitly-documented language behavior 
is asking for trouble. 

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Martin v. Löwis
 So the general idea is that at least directory filename has some sort
 of convention of using oem (dos, console) encoding on Windows, cp866
 in my case. Header filenames have different encodings, and seem to be
 ignored.

Ok, then this is what the zipfile module should implement.

 That would be incorrect, as it relies on the system encoding,
 which shouldn't be relied upon.
 
 Well, as I've seen in numerous examples above, system (or actually
 dos) encoding is actually what is used by at least by three major
 programs: 7-zip, pkzip25 and explorer, at least on windows.

Please don't confuse Python's system encoding with the system's
(or user's) standard encoding - they are not related at all. Using
the OEM code page if everybody else does it is fine. Using the
encoding that somebody hand-coded into the Python installation
is not.

 Plus, it would allow arbitrary
 non-string things as filenames.
 
 Hmm... why is that bad?

Errors should never pass silently.

 What it should do instead
 (IMO) is to encode in CP437. Bonus points if it falls back
 to the UTF-8 feature of zip files if encoding as CP437 fails.
 
 And encoding to cp437 would be incorrect, as no currently existing
 program would correctly work on non-english Windows OSes. I think that
 letting the user deciding on the encoding is the right way to go here,
 as you can't know what user actually wants these days, it's all too
 hazy to me. 

Asking the user is not practical. If the user was aware of the
problem, you would not have run into the problem in the first place -
you would have known to encode all file names before passing them
into the zipfile module.

The automatic mode should follow the standard or the conventions;
the user (in quotes, because the end user is rarely bothered
with that detail) can still override that explicitly.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Alexey Borzenkov
On 6/10/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
  So the general idea is that at least directory filename has some sort
  of convention of using oem (dos, console) encoding on Windows, cp866
  in my case. Header filenames have different encodings, and seem to be
  ignored.
 Ok, then this is what the zipfile module should implement.

But this is only on Windows! I have no clue what's the common
situation on other OSes and don't even know how to sanely get OEM
codepage on Windows (the obvious way with ctypes.kernel32.GetOEMCP()
doesn't seem good to me).

So I guess that's bad idea anyway, maybe conforming to language bit is
better (ascii will stay ascii anyway).

What about this?

Index: Lib/zipfile.py
===
--- Lib/zipfile.py  (revision 55850)
+++ Lib/zipfile.py  (working copy)
@@ -252,6 +252,7 @@
 self.extract_version = max(45, self.extract_version)
 self.create_version = max(45, self.extract_version)

+self._encodeFilename()
 header = struct.pack(structFileHeader, stringFileHeader,
  self.extract_version, self.reserved, self.flag_bits,
  self.compress_type, dostime, dosdate, CRC,
@@ -259,6 +260,16 @@
  len(self.filename), len(extra))
 return header + self.filename + extra

+def _encodeFilename(self):
+if isinstance(self.filename, unicode):
+self.filename = self.filename.encode('utf-8')
+self.flag_bits = self.flag_bits | 0x800
+
+def _decodeFilename(self):
+if self.flag_bits  0x800:
+self.filename = self.filename.decode('utf-8')
+self.flag_bits = self.flag_bits  ~0x800
+
 def _decodeExtra(self):
 # Try to decode the extra field.
 extra = self.extra
@@ -683,6 +694,7 @@
  t11, (t5)0x3F, (t0x1F) * 2 )

 x._decodeExtra()
+x._decodeFilename()
 x.header_offset = x.header_offset + concat
 self.filelist.append(x)
 self.NameToInfo[x.filename] = x
@@ -967,6 +979,7 @@
 extract_version = zinfo.extract_version
 create_version = zinfo.create_version

+zinfo._encodeFilename()
 centdir = struct.pack(structCentralDir,
   stringCentralDir, create_version,
   zinfo.create_system, extract_version, zinfo.reserved,
Index: Lib/test/test_zipfile.py
===
--- Lib/test/test_zipfile.py(revision 55850)
+++ Lib/test/test_zipfile.py(working copy)
@@ -515,6 +515,11 @@
 # and report that the first file in the archive was corrupt.
 self.assertRaises(RuntimeError, zipf.testzip)

+def testUnicodeFilenames(self):
+zf = zipfile.ZipFile(TESTFN, w)
+zf.writestr(ufoo.txt, Test for unicode filename)
+zf.close()
+
 def tearDown(self):
 support.unlink(TESTFN)
 support.unlink(TESTFN2)

The problem is that I don't know if anything actually supports bit 11
at the time and can't even tell if I did this correctly or not. :(
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Martin v. Löwis
 But this is only on Windows! I have no clue what's the common
 situation on other OSes and don't even know how to sanely get OEM
 codepage on Windows (the obvious way with ctypes.kernel32.GetOEMCP()
 doesn't seem good to me).
 
 So I guess that's bad idea anyway, maybe conforming to language bit is
 better (ascii will stay ascii anyway).
 
 What about this?

I haven't checked (*) whether you got the right value for flag_bits;
assuming you do, this looks good.

For compatibility, I would propose to use UTF-8 only if the file
name is not ASCII. Even though the OEM code pages vary, they
are (mostly) ASCII supersets. So if the string can be encoded
in ASCII, there is no need to set the UTF-8 flag bit.

OTOH, I now wonder whether it would *hurt* to have the flag bit:
if old zip software does not choke if the flag is set, then
it can just as well be set, as ASCII strings automatically
get encoded as ASCII in UTF-8.

Regards,
Martin

(*) I just now read

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

and 0x800 seems to be the right value indeed. Notice, in
appendix D, that the specification says that the historical
encoding of file names is code page 437.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Instance variable access and descriptors

2007-06-10 Thread Aahz
On Sun, Jun 10, 2007, Eyal Lotem wrote:
 
 Python, probably through the valid assumption that most attribute
 lookups go to the class, tries to look for the attribute in the class
 first, and in the instance, second.
 
 What Python currently does is quite peculiar!
 Here's a short description o PyObject_GenericGetAttr:
 
 A. Python looks for a descriptor in the _entire_ mro hierarchy
 (len(mro) class/type check and dict lookups).
 B. If Python found a descriptor and it has both get and set functions
 - it uses it to get the value and returns, skipping the next stage.
 C. If Python either did not find a descriptor, or found one that has
 no setter, it will try a lookup in the instance dict.
 D. If Python failed to find it in the instance, it will use the
 descriptor's getter, and if it has no getter it will use the
 descriptor itself.

Guido, Ping, and I tried working on this at the sprint for PyCon 2003.
We were unable to find any solution that did not affect critical-path
timing.  As other people have noted, the current semantics cannot be
changed.  I'll also echo other people and suggest that this discusion be
moved to python-ideas if you want to continue pushing for a change in
semantics.

I just did a Google for my notes from PyCon 2003 and it appears that I
never sent them out (probably because they aren't particularly
comprehensible).  Here they are for the record (from 3/25/2003):

'''
CACHE_ATTR is the name used to describe a speedup (for new-style classes
only) in attribute lookup by caching the location of attributes in the
MRO.  Some of the non-obvious bits of code:

* If a new-style class has any classic classes in its bases, we
can't do attribute caching (we need to weakrefs to the derived
classes).

* If searching the MRO for an attribute discovers a data descriptor (has
tp_descr_set), that overrides any attribute that might be in the instance;
however, the existence of tp_descr_get still permits the instance to
override its bases (but tp_descr_get is called if there is no instance
attribute).

* We need to invalidate the cache for the updated attribute in all derived
classes in the following cases:

* an attribute is added or deleted to the class or its base classes

* an attribute has its status changed to or from being a data
descriptor

This file uses Python pseudocode to describe changes necessary to
implement CACHE_ATTR at the C level.  Except for class Meta, these are
all exact descriptions of the work being done.  Except for class Meta the
changes go into object.c (Meta goes into typeobject.c).  The pseudocode
looks somewhat C-like to ease the transformation.
'''

NULL = object()

def getattr(inst, name):
isdata, where = lookup(inst.__class__, name)
if isdata:
descr = where[name]
if hasattr(descr, __get__):
return descr.__get__(inst)
else:
return descr
value = inst.__dict__.get(name, NULL)
if value != NULL:
return value
if where == NULL:
raise AttributError
descr = where[name]
if hasattr(descr, __get__):
value = descr.__get__(inst)
else:
value = descr
return value

def setattr(inst, name, value):
isdata, where = lookup(inst.__class__, name)
if isdata:
descr = where[name]
descr.__set__(inst, value)
return
inst.__dict__[name] = value

def lookup(cls, name):
if cls.__cache__ != NULL:
pair = cls.__cache__.get(name)
else:
pair = NULL
if pair:
return pair
else:
for c in cls.__mro__:
where = c.__dict__
if name in where:
descr = where[name]
isdata = hasattr(descr, __set__)
pair = isdata, where
break
else:
pair = False, NULL
if cls.__cache__ != NULL:
cls.__cache__[name] = pair
return pair


'''
These changes go into typeobject.c; they are not a complete
description of what happens during creation/updates, only the
changes necessary to implement CACHE_ATTRO.
'''

from types import ClassType

class Meta(type):
def _invalidate(cls, name):
if name in cls.__cache__:
del cls.__cache__[name]
for c in cls.__subclasses__():
if name not in c.__dict__:
self._invalidate(c, name)
def _build_cache(cls, bases):
for base in bases:
if type(base.__class__) is ClassType:
cls.__cache__ = NULL
break
else:
cls.__cache__ = {}
def __new__ (cls, bases):
self._build_cache(cls, bases)
def __setbases__(cls, bases):
self._build_cache(cls, bases)
def __setattr__(cls, name, value):
if cls.__cache__ != NULL:
old = cls.__dict__.get(name, NULL)
wasdata = old != NULL and hasattr(old, __set__)
isdata = value != NULL and hasattr(value, __set__)
if wasdata != isdata or (old == NULL) 

Re: [Python-Dev] Frame zombies

2007-06-10 Thread Eyal Lotem
On 6/10/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
  Note that _only_ recursions will have more than 1 frame attached.

 That's not true; in the presence of threads, the same method
 may also be invoked more than one time simultaneously.

Yes, I have missed that, and realized that I missed that myself a bit later.
I guess I can rationalize that with the fact that I myself tend to
avoid threads.

  But removing freelist altogether will not work well with any type of
  recursion.

 How do you know that? Did you measure the time? On what system?
 What were the results?

 Performance optimization without measuring is just unacceptable.

I agree, I may have used the wrong tone above.
Removing the freelist will probably either not have a significant
effect (at worst, its adding very little work of maintaining it), or
improve recursions and functions that tend to be running
simultaniously in multiple threads (as in those cases the realloc will
not actually resize the frame, and mallocs/free will indeed be saved).

But do note my corrected tone (I said probably :-) - and anyone is
welcome to try removing it and see if they get a performance benefit.

The fact threading also causes the same code object to be used in
multiple frames makes everything a little less predictable and may
mean that having a larger-than-1 number of frames associated with each
code object may indeed yield a performance benefit.

I am not sure how to benchmark such modifications. Is there any
benchmark that includes threaded use of the same functions in typical
use cases?

 Regards,
 Martin
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/eyal.lotem%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Frame zombies

2007-06-10 Thread Martin v. Löwis
 I am not sure how to benchmark such modifications. Is there any
 benchmark that includes threaded use of the same functions in typical
 use cases?

I don't think it's necessary to benchmark that specific case -
*any* kind of micro-benchmark would be better than none.
If you want to introduce free lists per code object, you
need to benchmark such code, and compare it to the status
quo. While doing so, I'd ask to also measure the case
that the free list is dropped without a replacement.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Alexey Borzenkov
On 6/11/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
 For compatibility, I would propose to use UTF-8 only if the file
 name is not ASCII. Even though the OEM code pages vary, they
 are (mostly) ASCII supersets. So if the string can be encoded
 in ASCII, there is no need to set the UTF-8 flag bit.

Done:

Index: Lib/zipfile.py
===
--- Lib/zipfile.py  (revision 55850)
+++ Lib/zipfile.py  (working copy)
@@ -252,13 +252,29 @@
 self.extract_version = max(45, self.extract_version)
 self.create_version = max(45, self.extract_version)

+filename, flag_bits = self._encodeFilenameFlags()
 header = struct.pack(structFileHeader, stringFileHeader,
- self.extract_version, self.reserved, self.flag_bits,
+ self.extract_version, self.reserved, flag_bits,
  self.compress_type, dostime, dosdate, CRC,
  compress_size, file_size,
- len(self.filename), len(extra))
-return header + self.filename + extra
+ len(filename), len(extra))
+return header + filename + extra

+def _encodeFilenameFlags(self):
+if isinstance(self.filename, unicode):
+try:
+return self.filename.encode('ascii'), self.flag_bits
+except UnicodeEncodeError:
+return self.filename.encode('utf-8'), self.flag_bits | 0x800
+else:
+return self.filename, self.flag_bits
+
+def _decodeFilenameFlags(self):
+if self.flag_bits  0x800:
+return self.filename.decode('utf-8'), self.flag_bits  ~0x800
+else:
+return self.filename, self.flag_bits
+
 def _decodeExtra(self):
 # Try to decode the extra field.
 extra = self.extra
@@ -684,6 +700,7 @@

 x._decodeExtra()
 x.header_offset = x.header_offset + concat
+x.filename, x.flag_bits = x._decodeFilenameFlags()
 self.filelist.append(x)
 self.NameToInfo[x.filename] = x
 if self.debug  2:
@@ -967,16 +984,17 @@
 extract_version = zinfo.extract_version
 create_version = zinfo.create_version

+filename, flag_bits = zinfo._encodeFilenameFlags()
 centdir = struct.pack(structCentralDir,
   stringCentralDir, create_version,
   zinfo.create_system, extract_version, zinfo.reserved,
-  zinfo.flag_bits, zinfo.compress_type, dostime, dosdate,
+  flag_bits, zinfo.compress_type, dostime, dosdate,
   zinfo.CRC, compress_size, file_size,
-  len(zinfo.filename), len(extra_data), len(zinfo.comment),
+  len(filename), len(extra_data), len(zinfo.comment),
   0, zinfo.internal_attr, zinfo.external_attr,
   header_offset)
 self.fp.write(centdir)
-self.fp.write(zinfo.filename)
+self.fp.write(filename)
 self.fp.write(extra_data)
 self.fp.write(zinfo.comment)

Index: Lib/test/test_zipfile.py
===
--- Lib/test/test_zipfile.py(revision 55850)
+++ Lib/test/test_zipfile.py(working copy)
@@ -515,6 +515,12 @@
 # and report that the first file in the archive was corrupt.
 self.assertRaises(RuntimeError, zipf.testzip)

+def testUnicodeFilenames(self):
+zf = zipfile.ZipFile(TESTFN, w)
+zf.writestr(ufoo.txt, Test for unicode filename)
+assert isinstance(zf.infolist()[0].filename, unicode)
+zf.close()
+
 def tearDown(self):
 support.unlink(TESTFN)
 support.unlink(TESTFN2)

What I also changed is to encode filenames only for writing to the
target file, without damaging ZipInfo. The reason for this is that if
user decides to enumerate infolist after she wrote files to ZipFile,
she would expect ZipInfo.filename to be what she passed to
ZipFile.write/ZipFile.writestr.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Question about dictobject.c:lookdict_string

2007-06-10 Thread Eyal Lotem
My question is specifically regarding the transition back from
lookdict_string (the initial value) to the general lookdict.

Currently, when a string-only dict is trying to look up any
non-string, it reverts back to a general lookdict.

Wouldn't it be better (especially in the more important case of a
string-key-only dict), to revert to the generic lookdict when a
non-string is inserted to the dict, rather than when one is being
searched?

This seems to me as it would shift this (admittedly very slight)
performance cost of a type ptr comparison from the read-access, to
write-access on all dicts (which means insertions of new keys in
non-string-only dicts may pay for another check, or that the lookdict
funcptr will be replaced by two funcptrs so that a different insertion
func on string-only dicts is used  too [was tempted to say vtable
here, but that would add another dereference to lookups]).

It would also have the slight benefit of speeding up non-string
lookups in string-only dicts.

This does not seem like a significant issue, but as I know a lot of
effort went into optimizing dicts, I was wondering if I am missing
something here.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] zipfile and unicode filenames

2007-06-10 Thread Alexey Borzenkov
On 6/11/07, Alexey Borzenkov [EMAIL PROTECTED] wrote:
 The problem is that I don't know if anything actually supports bit 11
 at the time and can't even tell if I did this correctly or not. :(

I downloaded the latest WinZip and can confirm that it parses utf-8
filenames correctly (although it seems to treat presence of bit 11
more like enabling autodetection mode, not strict utf-8, but it must
be because it has to cope with lots of incorrect zip files), i.e. in
the presence of bit 11 it understands filename to be utf-8, without
presence of bit 11 it treats it just like oem. :)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com