Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Ezio Melotti

On 24/08/2011 5.31, Nick Coghlan wrote:

On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano  wrote:

(Nor do we write filingsystem, governmentsystem, politicalsystem or
schoolsystem. This is English, not German.)

Personally, I think 'filesystem' is a portmanteau in the process of
coming into existence (as evidenced by usage like 'FHS' standing for
'Filesystem Hierarchy Standard'). However, the two word form is still
useful at times, particularly for disambiguation of acronyms (as
evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and
'Google File System'). The Wikipedia article on the topic mixes and
matches the two forms, but overall does favour the two word form.

Since I tend to use the one word 'filesystem' form myself (ditto for
'filename'), I'm +1 for FilesystemError, but I'm only -0 for
FileSystemError (so I expect that will be the option chosen, given
other responses).


This pretty much summarizes my thoughts.  I saw the wiki article using 
both and since I consider 'filesystem' a single word I was wondering if 
anyone else preferred FilesystemError.  I'm totally fine with 
FileSystemError too though, if most people prefer it.


Best Regards,
Ezio Melotti



Regards,
Nick.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Scott Dial
On 8/23/2011 6:38 PM, Victor Stinner wrote:
> Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
>> - You could try to run stringbench, which can be found at
>>   http://svn.python.org/projects/sandbox/trunk/stringbench (*)
>>   and there's iobench (the text mode benchmarks) in the Tools/iobench
>>   directory.
> 
> Some raw numbers.
> 
> stringbench:
> "147.07 203.07 72.4 TOTAL" for the PEP 393
> "146.81 140.39 104.6 TOTAL" for default
> => PEP is 45% slower

I ran the same benchmark and couldn't make a distinction in performance
between them:

pep-393.txt
182.17  175.47  103.8   TOTAL
cpython.txt
183.26  177.97  103.0   TOTAL

pep-393-wide-unicode.txt
181.61  198.69  91.4TOTAL
cpython-wide-unicode.txt
181.27  195.58  92.7TOTAL

I ran it a couple times and have seen either default or pep-393 being up
to +/- 10 sec slower on the unicode tests. The results of the 8-bit
string tests seem to have less variance on my test machine.

> run test_unicode 50 times:
> 0m19.487s for PEP
> 0m17.187s for default
> => PEP is 13% slower

$ time ./python -m test `python -c 'print "test_unicode " * 50'`

pep-393-wide-unicode.txt
real0m33.409s
cpython-wide-unicode.txt
real0m33.489s

Nothing in it for me.. except your system is obviously faster, in general.

-- 
Scott Dial
sc...@scottdial.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Torsten Becker
On Tue, Aug 23, 2011 at 18:56, Victor Stinner
 wrote:
>> kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still
>> necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape().
>
> If it can be removed, it would be nice to have kind in [0; 2] instead of kind
> in [1; 2], to be able to have a list (of 3 items) => callback or label.

It is also used in PyUnicode_DecodeUTF8Stateful() and there might be
some cases which I missed converting checks for 0 when I introduced
the macro.  The question was more if this should be written as 0 or as
a named constant.  I preferred the named constant for readability.

An alternative would be to have kind values be the same as the number
of bytes for the string representation so it would be 0 (wstr), 1
(1-byte), 2 (2-byte), or 4 (4-byte).

I think the value for wstr/uninitialized/reserved should not be
removed.  The wstr representation is still used in the error case in
the utf8 decoder because these strings can be resized. Also having one
designated value for "uninitialized" limits comparisons in the
affected functions to the kind value, otherwise they would need to
check the str field for NULL to determine in which buffer to write a
character.

> I suppose that compilers prefer a switch with all cases defined, 0 a first 
> item
> and contiguous values. We may need an enum.

During the Summer of Code, Martin and I did a experiment with GCC and
it did not seem to produce a jump table as an optimization for three
cases but generated comparison instructions anyway.  I am not sure how
much we should optimize for potential compiler optimizations here.


Regards,
Torsten
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 3151 from the BDFOP

2011-08-23 Thread Nick Coghlan
On Wed, Aug 24, 2011 at 9:57 AM, Antoine Pitrou  wrote:
> I don't have any personal preference. Previous discussions seemed to
> indicate people preferred IOError. But changing the implementation to
> OSError would be simple. I agree OSError feels slightly more right, as
> in more generic.

IIRC, the preference for IOError was formed when we were going to
deprecate the 'legacy' names. Now that using the old names won't
trigger any kind of warning, +1 for using OSError as the official name
of the base class with IOError as a legacy alias.

>> And that anything
>> raising an exception (e.g. via PyErr_SetFromErrno) other than the new ones
>> will raise IOError?
>
> I'm not sure I understand the question precisely. The errno mapping
> mechanism is implemented in IOError.__new__, but it gets called only if
> the class is exactly IOError, not a subclass:
>
 IOError(errno.EPERM, "foo")
> PermissionError(1, 'foo')
 class MyIOError(IOError): pass
> ...
 MyIOError(errno.EPERM, "foo")
> MyIOError(1, 'foo')
>
> Using IOError.__new__ is the easiest way to ensure that all code
> raising IO errors takes advantage of the errno mapping. Otherwise you
> may get APIs raising the proper subclasses, and other APIs always
> raising base IOError (it doesn't happen often, but some Python
> library code raises an IOError with an explicit errno).

It's also the natural place to put the errno->exception type mapping
so that existing code will raise the new errors without requiring
modification. We could spell it as a new class method ("from_errno" or
similar), but there isn't any ambiguity in doing it directly in
__new__, so a class method seems pointlessly inconvenient.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Torsten Becker
On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou  wrote:
> Macros are useful to shield the abstraction from the implementation. If
> you access the members directly, and the unicode object is represented
> differently in some future version of Python (say e.g. with tagged
> pointers), your code doesn't compile anymore.

I agree with Antoine, from the experience of porting C code from 3.2
to the PEP 393 unicode API, the additional encapsulation by macros
made it much easier to change the implementation of what is a field,
what is a field's actual name, and what needs to be calculated through
a function.

So, I would like to keep primary access as a macro but I see the point
that it would make the struct clearer to access and I would not mind
changing the struct to use a union.  But then most access currently is
through macros so I am not sure how much benefit the union would bring
as it mostly complicates the struct definition.

Also, common, now simple, checks for "unicode->str == NULL" would look
more ambiguous with a union ("unicode->str.latin1 == NULL").


Regards,
Torsten
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Torsten Becker
On Tue, Aug 23, 2011 at 18:27, Victor Stinner
 wrote:
> I posted a patch to re-add it:
> http://bugs.python.org/issue12819#msg142867

Thank you for the patch!  Note that this patch adds the fast path only
to the helper function which determines the length of the string and
the maximum character.  The decoding part is still without a fast path
for ASCII runs.


Regards,
Torsten
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Planned PEP status changes

2011-08-23 Thread Nick Coghlan
Unless I hear any objections, I plan to adjust the current PEP
statuses as follows some time this weekend:

Move from Accepted to Finished:

389  argparse - New Command Line Parsing Module  Bethard
391  Dictionary-Based Configuration For Logging  Sajip
3108  Standard Library Reorganization Cannon
3135  New Super
Spealman, Delaney, Ryan

Move from Accepted to Withdrawn (with a reference to Reid Kleckner's blog post)
3146  Merging Unladen Swallow into CPython
Winter, Yasskin, Kleckner


The PEP 3118 enhanced buffer protocol has some ongoing semantic and
implementation issues still to be worked out, so I plan to leave that
at Accepted. Ditto for PEP 3121 (extension module finalisation), since
that doesn't play nicely with the current 'set everything to None'
approach to breaking cycles during module finalisation.

The other Accepted PEPs are either packaging standards related or
genuinely not implemented yet.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Torsten Becker
On Tue, Aug 23, 2011 at 08:15, Antoine Pitrou  wrote:
> So why would you need three separate implementation of the unrolled
> loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.

The WRITE_FLEXIBLE_OR_WSTR macro does a check for kind and then
writes.  Using this macro for the fast path would be inefficient, to
have a real fast path, you would need a outer if to check for kind and
then in each condition body the matching access to the string (1, 2,
or 4 bytes) and for each body also write 4 or 8 times (guarded by
#ifdef, depending on platform).

As all these cases bloated up the C code, we went for the simple
solution with the goal of profiling the code again afterwards to see
where the new performance bottlenecks would be.

> Even without taking into account the unrolled loop, I wonder how much
> slower UTF-8 decoding becomes with that approach, by the way. Instead of
> testing the "kind" variable at each loop iteration, using a
> stringlib-like approach may be a better deal IMO.

To me this feels like this would complicate the C source code and
decrease readability.  For each function you would need a wrapper
which does the kind checking logic and then, in a separate file, the
implementation of the function which then gets included three times
for each character width.


Regards,
Torsten
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Nick Coghlan
On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano  wrote:
> Antoine Pitrou wrote:
>>
>> Hello,
>>
>> When reviewing the PEP 3151 implementation (*), Ezio commented that
>> "FileSystemError" looks a bit strange and that "FilesystemError" would
>> be a better spelling. What is your opinion?
>
> It's a file system (two words), not filesystem (not in any dictionary or
> spell checker I've ever used).

I rarely find spell checkers to be useful sources of data on correct
spelling of technical jargon (and the computing usage of the term
'filesystem' definitely qualifies as jargon).

> (Nor do we write filingsystem, governmentsystem, politicalsystem or
> schoolsystem. This is English, not German.)

Personally, I think 'filesystem' is a portmanteau in the process of
coming into existence (as evidenced by usage like 'FHS' standing for
'Filesystem Hierarchy Standard'). However, the two word form is still
useful at times, particularly for disambiguation of acronyms (as
evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and
'Google File System'). The Wikipedia article on the topic mixes and
matches the two forms, but overall does favour the two word form.

Since I tend to use the one word 'filesystem' form myself (ditto for
'filename'), I'm +1 for FilesystemError, but I'm only -0 for
FileSystemError (so I expect that will be the option chosen, given
other responses).

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Torsten Becker
On Mon, Aug 22, 2011 at 18:14, Antoine Pitrou  wrote:
> - You could trim the debug results from the benchmark results, this may
>  make them more readable.

Good point, I removed them from the wiki page.

On Tue, Aug 23, 2011 at 18:38, Victor Stinner
 wrote:
> Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
>> - You could try to run stringbench, which can be found at
>>   http://svn.python.org/projects/sandbox/trunk/stringbench (*)
>>   and there's iobench (the text mode benchmarks) in the Tools/iobench
>>   directory.
>
> Some raw numbers.
> [...]

Thank you Victor for running stringbench, I did not get to it in time.


Regards,
Torsten
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Terry Reedy

On 8/23/2011 9:21 AM, Victor Stinner wrote:

Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :

Well, things have to be done in order:
1. the PEP needs to be approved
2. the performance bottlenecks need to be identified
3. optimizations should be applied.


I would not vote for the PEP if it slows down Python, especially if it's
much slower. But Torsten says that it speeds up Python, which is
surprising. I have to do my own benchmarks :-)


The current UCS2 Unicode string implementation, by design, quickly gives 
WRONG answers for len(), iteration, indexing, and slicing if a string 
contains any non-BMP (surrogate pair) Unicode characters. That may have 
been excusable when there essentially were no such extended chars, and 
the few there were were almost never used. But now there are many more, 
with more being added to each Unicode edition. They include cursive Math 
letters that are used in English documents today. The problem will 
slowly get worse and Python, at least on Windows, will become a language 
to avoid for dependable Unicode document processing. 3.x needs a proper 
Unicode implementation that works for all strings on all builds.


utf16.py, attached to http://bugs.python.org/issue12729
prototypes a different solution than the PEP for the above problems for 
the 'mostly BMP' case. I will discuss it in a different post.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Terry Reedy

On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:

Am 23.08.2011 11:46, schrieb Xavier Morel:

Mostly ascii is pretty common for western-european languages (French, for
instance, is probably 90 to 95% ascii). It's also a risk in english, when
the writer "correctly" spells foreign words (résumé and the like).


I know - I still question whether it is "extremely common" (so much as
to justify a special case). I.e. on what application with what dataset
would you gain what speedup, at the expense of what amount of extra
lines, and potential slow-down for other datasets?

[snip]

In the PEP 393 approach, if the string has a two-byte representation,
each character needs to widened to two bytes, and likewise for four
bytes. So three separate copies of the unrolled loop would be needed,
one for each target size.


I fully support the declared purpose of the PEP, which I understand to 
be to have a full,correct Unicode implementation on all new Python 
releases without paying unnecessary space (and consequent time) 
penalties. I think the erroneous length, iteration, indexing, and 
slicing for strings with non-BMP chars in narrow builds needs to be 
fixed for future versions. I think we should at least consider 
alternatives to the PEP393 solution of double or quadrupling space if 
needed for at least one char.


In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 
'mostly BMP chars, few non-BMP chars' case. Rather than expand every 
character from 2 bytes to 4, attach an array cpdex of character (ie code 
point, not code unit) indexes. Then for indexing and slicing, the 
correction is simple, simpler than I first expected:

  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying 
double-byte array. This adds a time penalty of log2(len(cpdex)), but 
avoids most of the space penalty and the consequent time penalty of 
moving more bytes around and increasing cache misses.


I believe the same idea would work for utf8 and the mostly-ascii case. 
The main difference is that non-ascii chars have various byte sizes 
rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So 
the offset correction would not simply be the bisect-left return but 
would require another lookup

  byte-index = char-index + offsets[bisect-left(cpdex, char-index)]

If possible, I would have the with-index-array versions be separate 
subtypes, as in utf16.py. I believe either index-array implementation 
might benefit from a subtype for single multi-unit chars, as a single 
non-ASCII or non-BMP char does not need an auxiliary [0] array and a 
senseless lookup therein but does need its length fixed at 1 instead of 
the number of base array units.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 3151 from the BDFOP

2011-08-23 Thread Antoine Pitrou

Hi,

> One guiding principle for me is that we should keep the abstraction as thin as
> possible.  In particular, I'm concerned about mapping multiple errnos into a
> single Error.  For example both EPIPE and ESHUTDOWN mapping to BrokePipeError,
> or EACESS or EPERM to PermissionError.  I think we should resist this, so that
> one errno maps to exactly one Error.  Where grouping is desired, Python
> already has mechanisms to deal with that, e.g. superclasses and multiple
> inheritance.  Therefore, I think it would be better to have
> 
> + FileSystemPermissionError
>   + AccessError (EACCES)
>   + PermissionError (EPERM)

I'm not sure that's a good idea:

- EPERM is not only about filesystem permissions, see for example
  
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html

- EACCES and EPERM as a low-level distinction makes sense, but at the
  Python programmer's high-level point of view, the AccessError /
  PermissionError distinction does not seem to convey any useful
  meaning.
  (or perhaps that's just my bad understanding of English)

- the "errno" attribute is still there (and still displayed - see below)
  for people who know their system calls and want to inspect the
  original error code

> Also, some of the artificial hierarchy introduced in the PEP may
> not be necessary (e.g. the faux superclass FileSystemPermissionError above).
> This might lead to the elimination of FileSystemError as some have suggested
> (I too question its utility).

Yes, FileSystemError might be removed. I thought that it would be
useful, in some library routines, to catch all filesystem-related
errors indistinctly, but it's not a complete catchall actually (for
example, AccessError is outside of the FileSystemError subtree).

> Similarly, I think it would be helpful to have the errno name (e.g. ENOENT) in
> the error message string.  That way, it won't get in the way for most code,
> but would be usefully printed out for uncaught exceptions.

Agreed, but I think that's a feature request quite orthogonal from the
PEP. The errno *number* is still printed as it was before:

>>> open("foo")
Traceback (most recent call last):
  File "", line 1, in 
FileNotFoundError: [Errno 2] No such file or directory: 'foo'

(see e.g. http://bugs.python.org/issue12762)

> A second guiding principle should be that careful code that works in Python
> 3.2 must continue to work in Python 3.3 once PEP 3151 is accepted, but also
> for Python 2 code ported straight to Python 3.3.

I don't porting straight to 3.3 would make a difference, especially now
that the idea of deprecating old exception names has been abandoned.

> Do be prepared for
> complaints about compatibility for careless code though - there's a ton of
> that out in the wild, and people will always complain with their "working"
> code breaks due to an upgrade.  Be *very* explicit about this in the release
> notes and NEWS file, and put your asbestos underoos on.

I'll take care about that :)

> Have you considered the impact of this PEP on other Python implementations?
> My hazy memory of Jython tells me that errnos don't really leak into Java and
> thus Jython much, but what about PyPy and IronPython?  E.g. step 1's
> deprecation strategy seems pretty CPython-centric.

Alternative implementations already have to implement errno codes in a
way or another if they want to have a chance of running existing code.
So I don't think the PEP makes much of a difference for them.
But their implementors can give their opinion on this.

> As for step 1 (coalescing the errors).  This makes sense and I'm generally
> agreeable, but I'm wondering whether it's best to re-use IOError for this
> rather than introduce a new exception.  Not that I can think of a good name
> for that.  I'm just not totally convinced that existing code when upgrading to
> Python 3.3 won't introduce silent failures.  If an existing error is to be
> re-used for this, I'm torn on whether IOError or OSError is a better choice.
> Popularity aside, OSError *feels* more right.

I don't have any personal preference. Previous discussions seemed to
indicate people preferred IOError. But changing the implementation to
OSError would be simple. I agree OSError feels slightly more right, as
in more generic.

> What is the impact of the PEP on tools such as 2to3 and 3to2?

I'd say none for 2to3. For 3to2 I'm not sure. Obviously if you write
code taking advantage of new features, it will be difficultly
back-portable to 2.x. But that's not specific to PEP 3151. Python 3.2
has lot such stuff already:
http://docs.python.org/py3k/whatsnew/3.2.html

> Just to be clear, am I right that (on POSIX systems at least) IOError and its
> subclasses will always have an errno attribute still?

Yes!

> And that anything
> raising an exception (e.g. via PyErr_SetFromErrno) other than the new ones
> will raise IOError?

I'm not sure I understand the question precisely. The errno mapping
mechanism is implemented in IOErro

Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Terry Reedy

On 8/23/2011 2:46 PM, Brian Curtin wrote:


I don't care all that much but I'm reminded of the .NET
FileSystemWatcher class, so put me down for +0.5 on FileSystemError.


For other reasons, I am at lease +.5 for FileSystemError also.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Victor Stinner
Le mercredi 24 août 2011 00:46:16, Victor Stinner a écrit :
> Le lundi 22 août 2011 20:58:51, Torsten Becker a écrit :
> > [1]: http://www.python.org/dev/peps/pep-0393
> 
> state:
> lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
> next 2 bits (mask 0x0C) - form of str:
> 00 => reserved
> 01 => 1 byte (Latin-1)
> 10 => 2 byte (UCS-2)
> 11 => 4 byte (UCS-4);
> next bit (mask 0x10): 1 if str memory follows PyUnicodeObject
> 
> kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still
> necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape().

If it can be removed, it would be nice to have kind in [0; 2] instead of kind 
in [1; 2], to be able to have a list (of 3 items) => callback or label. I 
suppose that compilers prefer a switch with all cases defined, 0 a first item 
and contiguous values. We may need an enum.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Victor Stinner
Le lundi 22 août 2011 20:58:51, Torsten Becker a écrit :
> [1]: http://www.python.org/dev/peps/pep-0393

state:
lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
next 2 bits (mask 0x0C) - form of str:
00 => reserved
01 => 1 byte (Latin-1)
10 => 2 byte (UCS-2)
11 => 4 byte (UCS-4);
next bit (mask 0x10): 1 if str memory follows PyUnicodeObject

kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still necessary? 
It looks to be only used in PyUnicode_DecodeUnicodeEscape().

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Victor Stinner
Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
> - You could try to run stringbench, which can be found at
>   http://svn.python.org/projects/sandbox/trunk/stringbench (*)
>   and there's iobench (the text mode benchmarks) in the Tools/iobench
>   directory.

Some raw numbers.

stringbench:
"147.07 203.07 72.4 TOTAL" for the PEP 393
"146.81 140.39 104.6 TOTAL" for default
=> PEP is 45% slower

run test_unicode 50 times:
0m19.487s for PEP
0m17.187s for default
=> PEP is 13% slower

time ./python -m test -j4 ("real" time):
3m16.886s (334 tests) for the PEP
3m21.984s (335 tests) for default
... default has 1 more test!

Only 13% slower on test_unicode is *good*. There are still a lot of code using 
the legacy API in unicode.c, so it cam be much better.

stringbench only shows the overhead of the conversion from compact unicode to 
Py_UNICODE* (wchar_t*). stringlib does still use the legacy API.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Victor Stinner
Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
> Hello,
> 
> On Mon, 22 Aug 2011 14:58:51 -0400
> 
> Torsten Becker  wrote:
> > I have implemented an initial version of PEP 393 -- "Flexible String
> > Representation" as part of my Google Summer of Code project.  My patch
> > is hosted as a repository on bitbucket [1] and I created a related
> > issue on the bug tracker [2].  I posted documentation for the current
> > state of the development in the wiki [3].
> 
> A couple of minor comments:
> 
> - “The UTF-8 decoding fast path for ASCII only characters was removed
>   and replaced with a memcpy if the entire string is ASCII.”
>   The fast path would still be useful for mostly-ASCII strings, which
>   are extremely common (unless UTF-8 has become a no-op?).

I posted a patch to re-add it:
http://bugs.python.org/issue12819#msg142867

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 3151 from the BDFOP

2011-08-23 Thread Barry Warsaw
I am sending this review as the BDFOP for PEP 3151.  I've read the PEP and
reviewed the python-dev discussion via Gmane.  I have not reviewed the hg
branch where Antoine has implemented it.

I'm not quite ready to pronounce, but I do have some questions and comments.
First off, thanks to Antoine for taking this issue on, and for his well
written and well reasoned PEP.  There's definitely a problem here and I think
Python will be better off for having addressed it.  I, for one, will be very
happy when I can eliminate the majority of `import errno`s from my code. ;)

One guiding principle for me is that we should keep the abstraction as thin as
possible.  In particular, I'm concerned about mapping multiple errnos into a
single Error.  For example both EPIPE and ESHUTDOWN mapping to BrokePipeError,
or EACESS or EPERM to PermissionError.  I think we should resist this, so that
one errno maps to exactly one Error.  Where grouping is desired, Python
already has mechanisms to deal with that, e.g. superclasses and multiple
inheritance.  Therefore, I think it would be better to have

+ FileSystemPermissionError
  + AccessError (EACCES)
  + PermissionError (EPERM)

Yes, it makes the hierarchy deeper, and means you have to come up with a few
more names, but I think it will also make it easier for the programmer to use
and debug.  Also, some of the artificial hierarchy introduced in the PEP may
not be necessary (e.g. the faux superclass FileSystemPermissionError above).
This might lead to the elimination of FileSystemError as some have suggested
(I too question its utility).

Similarly, I think it would be helpful to have the errno name (e.g. ENOENT) in
the error message string.  That way, it won't get in the way for most code,
but would be usefully printed out for uncaught exceptions.

A second guiding principle should be that careful code that works in Python
3.2 must continue to work in Python 3.3 once PEP 3151 is accepted, but also
for Python 2 code ported straight to Python 3.3.  Given the PEP's emphasis on
"useful compatibility", I think this will be the case.  Do be prepared for
complaints about compatibility for careless code though - there's a ton of
that out in the wild, and people will always complain with their "working"
code breaks due to an upgrade.  Be *very* explicit about this in the release
notes and NEWS file, and put your asbestos underoos on.  On the plus side,
there's not so much Python 3 code to break :).  Also, do clearly explain any
required migration strategy for existing code, probably in this PEP.

Have you considered the impact of this PEP on other Python implementations?
My hazy memory of Jython tells me that errnos don't really leak into Java and
thus Jython much, but what about PyPy and IronPython?  E.g. step 1's
deprecation strategy seems pretty CPython-centric.

As for step 1 (coalescing the errors).  This makes sense and I'm generally
agreeable, but I'm wondering whether it's best to re-use IOError for this
rather than introduce a new exception.  Not that I can think of a good name
for that.  I'm just not totally convinced that existing code when upgrading to
Python 3.3 won't introduce silent failures.  If an existing error is to be
re-used for this, I'm torn on whether IOError or OSError is a better choice.
Popularity aside, OSError *feels* more right.

What is the impact of the PEP on tools such as 2to3 and 3to2?

Just to be clear, am I right that (on POSIX systems at least) IOError and its
subclasses will always have an errno attribute still?  And that anything
raising an exception (e.g. via PyErr_SetFromErrno) other than the new ones
will raise IOError?

I also think that rather than transforming exception when raised from Python,
i.e. via __new__ hackery, perhaps it should be a ValueError in its own right
to raise IOError with an error represented by one of the subclasses.  Chained
exceptions would mean that the original exception needn't get lost.

I surveyed some of my own code and observed (as others have) that EISDIR and
ENOTDIR are pretty rare.  I found more examples of ECHILD and ESRCH than the
former two.  How'd you like to add those two to make your BDFOP happy? :)

What follows are some crazier ideas.  I'm just throwing them out there, not
necessarily suggesting they should go into the PEP.

The new syntax (e.g. if clause on except) is certainly appealing at first
glance, and might be of more general use for Python, but I agree with the
problems as stated in the PEP.  However, there might be a few things that
*can* be done to make even the uncommon cases easier. E.g.

What if all the errno symbolic names were mapped as attributes on IOError?
The only advantage of that would be to eliminate the need to import errno, or
for the ugly `e.errno == errno.ENOENT` stuff.  That would then be rewritten as
`e.errno == IOError.ENOENT`.  A mild savings to be sure, but still.

How dumb/useless/unworkable would it be to add an __future__ to switch from
the old hierarchy to the new one

Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-23 Thread Antoine Pitrou
Le mardi 23 août 2011 à 22:07 +0200, Charles-François Natali a écrit :
> 2011/8/23 Antoine Pitrou :
> > Well, I would consider the I/O locks the most glaring problem. Right
> > now, your program can freeze if you happen to do a fork() while e.g.
> > the stderr lock is taken by another thread (which is quite common when
> > debugging).
> 
> Indeed.
> To solve this, a similar mechanism could be used: after fork(), in the
> child process:
> - just reset each I/O lock (destroy/re-create the lock) if we can
> guarantee that the file object is in a consistent state (i.e. that all
> the invariants hold). That's the approach I used in my initial patch.

For I/O locks I think that would work.
There could also be a process-wide "fork lock" to serialize locks and
other operations, if we want 100% guaranteed consistency of I/O objects
across forks.

> - call a fileobject method which resets the I/O lock and sets the file
> object to a consistent state (in other word, an atfork handler)

I fear that the complication with atfork handlers is that you have to
manage their lifecycle as well (i.e., when an IO object is destroyed,
you have to unregister the handler).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Laurens Van Houtven
On Tue, Aug 23, 2011 at 8:46 PM, Barry Warsaw  wrote:

> On Aug 23, 2011, at 08:39 PM, Ross Lagerwall wrote:
>
> >> When reviewing the PEP 3151 implementation (*), Ezio commented that
> >> "FileSystemError" looks a bit strange and that "FilesystemError" would
> >> be a better spelling. What is your opinion?
> >
> >I don't think it really matters since both "file system" and
> >"filesystem" appear to be in common usage.
> >
> >I would say +1 to "FileSystemError" -- i.e. take file system as two
> >words.
>
> My online dictionaries prefer "file system" to be two words, so for me,
> FileSystemError is preferred.
>
> -Barry
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/_%40lvh.cc
>
>
+1

-- 
cheers
lvh
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Vinay Sajip
Antoine Pitrou  pitrou.net> writes:

> When reviewing the PEP 3151 implementation (*), Ezio commented that
> "FileSystemError" looks a bit strange and that "FilesystemError" would
> be a better spelling. What is your opinion?

+1 for FileSystemError as I, like others, don't regard "filesystem" as a proper
word.

Regards,

Vinay Sajip

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-23 Thread Charles-François Natali
2011/8/23 Antoine Pitrou :
> Well, I would consider the I/O locks the most glaring problem. Right
> now, your program can freeze if you happen to do a fork() while e.g.
> the stderr lock is taken by another thread (which is quite common when
> debugging).

Indeed.
To solve this, a similar mechanism could be used: after fork(), in the
child process:
- just reset each I/O lock (destroy/re-create the lock) if we can
guarantee that the file object is in a consistent state (i.e. that all
the invariants hold). That's the approach I used in my initial patch.
- call a fileobject method which resets the I/O lock and sets the file
object to a consistent state (in other word, an atfork handler)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Steven D'Aprano

Antoine Pitrou wrote:

Hello,

When reviewing the PEP 3151 implementation (*), Ezio commented that
"FileSystemError" looks a bit strange and that "FilesystemError" would
be a better spelling. What is your opinion?


It's a file system (two words), not filesystem (not in any dictionary or 
spell checker I've ever used).


(Nor do we write filingsystem, governmentsystem, politicalsystem or 
schoolsystem. This is English, not German.)




--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Stefan Krah
Barry Warsaw  wrote:
> My online dictionaries prefer "file system" to be two words, so for me,
> FileSystemError is preferred.

+1


Stefan Krah



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Ethan Furman

Antoine Pitrou wrote:

Hello,

When reviewing the PEP 3151 implementation (*), Ezio commented that
"FileSystemError" looks a bit strange and that "FilesystemError" would
be a better spelling. What is your opinion?


FileSystemError
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-23 Thread Antoine Pitrou
On Tue, 23 Aug 2011 20:43:25 +0200
Charles-François Natali  wrote:
> > Please consider this invitation to stick your head into an interesting
> > problem:
> > http://bugs.python.org/issue6721
> 
> Just for the record, I'm now in favor of the atfork mechanism. It
> won't solve the problem for I/O locks, but it'll at least make room
> for a clean and cross-library way to setup atfork handlers. I just
> skimmed over it, but it seemed Gregory's atfork module could be a good
> starting point.

Well, I would consider the I/O locks the most glaring problem. Right
now, your program can freeze if you happen to do a fork() while e.g.
the stderr lock is taken by another thread (which is quite common when
debugging).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Barry Warsaw
On Aug 23, 2011, at 08:39 PM, Ross Lagerwall wrote:

>> When reviewing the PEP 3151 implementation (*), Ezio commented that
>> "FileSystemError" looks a bit strange and that "FilesystemError" would
>> be a better spelling. What is your opinion?
>
>I don't think it really matters since both "file system" and
>"filesystem" appear to be in common usage.
>
>I would say +1 to "FileSystemError" -- i.e. take file system as two
>words.

My online dictionaries prefer "file system" to be two words, so for me,
FileSystemError is preferred.

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Brian Curtin
On Tue, Aug 23, 2011 at 13:20, Antoine Pitrou  wrote:

>
> Hello,
>
> When reviewing the PEP 3151 implementation (*), Ezio commented that
> "FileSystemError" looks a bit strange and that "FilesystemError" would
> be a better spelling. What is your opinion?
>
> (*) http://bugs.python.org/issue12555
>
> Thank you
>
> Antoine.


I don't care all that much but I'm reminded of the .NET FileSystemWatcher
class, so put me down for +0.5 on FileSystemError.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Nadeem Vawda
On Tue, Aug 23, 2011 at 8:39 PM, Ross Lagerwall  wrote:
>> When reviewing the PEP 3151 implementation (*), Ezio commented that
>> "FileSystemError" looks a bit strange and that "FilesystemError" would
>> be a better spelling. What is your opinion?

I think "FilesystemError" looks nicer, but it's not something I'd lose
sleep over either way.

Cheers,
Nadeem
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-23 Thread Charles-François Natali
2011/8/23, Nir Aides :
> Hi all,

Hello Nir,

> Please consider this invitation to stick your head into an interesting
> problem:
> http://bugs.python.org/issue6721

Just for the record, I'm now in favor of the atfork mechanism. It
won't solve the problem for I/O locks, but it'll at least make room
for a clean and cross-library way to setup atfork handlers. I just
skimmed over it, but it seemed Gregory's atfork module could be a good
starting point.

cf
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Ross Lagerwall
> When reviewing the PEP 3151 implementation (*), Ezio commented that
> "FileSystemError" looks a bit strange and that "FilesystemError" would
> be a better spelling. What is your opinion?

I don't think it really matters since both "file system" and
"filesystem" appear to be in common usage.

I would say +1 to "FileSystemError" -- i.e. take file system as two
words.

Cheers
Ross

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Sandro Tosi
On Tue, Aug 23, 2011 at 20:20, Antoine Pitrou  wrote:
> When reviewing the PEP 3151 implementation (*), Ezio commented that
> "FileSystemError" looks a bit strange and that "FilesystemError" would
> be a better spelling. What is your opinion?

FilesystemError.

Cheers,
-- 
Sandro Tosi (aka morph, morpheus, matrixhasu)
My website: http://matrixhasu.altervista.org/
Me at Debian: http://wiki.debian.org/SandroTosi
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] FileSystemError or FilesystemError?

2011-08-23 Thread Antoine Pitrou

Hello,

When reviewing the PEP 3151 implementation (*), Ezio commented that
"FileSystemError" looks a bit strange and that "FilesystemError" would
be a better spelling. What is your opinion?

(*) http://bugs.python.org/issue12555

Thank you

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-23 Thread Nir Aides
Hi all,

Please consider this invitation to stick your head into an interesting
problem:
http://bugs.python.org/issue6721

Nir
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Martin v. Löwis
> Even with tagged pointers, you could just provide a macro that unpacks
> the pointer to the buffer for a given string kind.

These macros are indeed available.

> I don't think there's
> much more to be done to keep up the abstraction. I don't see a reason to
> prevent users from accessing the memory buffer directly, especially not
> by (accidental, as I understand it) obfuscation through a void*.

It's not about preventing them from accessing the representation. It's
an "internal public" structure just as all other object layouts (i.e.
feel free to use them, but expect them to change with the next release).

However, I still think that people rarely will:
- most code treats strings as opaque, just as any other PyObject*
- code that is aware of strings typically wants them in an encoded
  form, often UTF-8, or whatever the underlying C library expects.
- code that does need to look at individual characters should be fine
  with the accessor macros.

That said, I can readily believe that Cython would have a use for direct
access to the structure. I just wouldn't want people to rewrite their
code in four versions (three for the different 3.3 representations,
plus one for 3.2 and earlier).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Stefan Behnel

Antoine Pitrou, 23.08.2011 16:08:

On Tue, 23 Aug 2011 16:02:54 +0200
Stefan Behnel wrote:

"Martin v. Löwis", 23.08.2011 15:17:

Has this been considered before? Was there a reason to decide against it?


I think we simply didn't consider it. An early version of the PEP used
the lower bits for the pointer to encode the kind, in which case it even
stopped being a pointer. Modules are not expected to access this
pointer except through the macros, so it may not matter that much.


The difference is that you *could* access them directly in a safe way, if
it was a union.

So, for an efficient character loop, replicated for performance reasons or
for character range handling reasons or whatever, you could just check the
string kind and then jump to the loop implementation that handles that
type, without using any further macros.


Macros are useful to shield the abstraction from the implementation. If
you access the members directly, and the unicode object is represented
differently in some future version of Python (say e.g. with tagged
pointers), your code doesn't compile anymore.


Even with tagged pointers, you could just provide a macro that unpacks the 
pointer to the buffer for a given string kind. I don't think there's much 
more to be done to keep up the abstraction. I don't see a reason to prevent 
users from accessing the memory buffer directly, especially not by 
(accidental, as I understand it) obfuscation through a void*.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Nick Coghlan
On Tue, Aug 23, 2011 at 11:21 PM, Victor Stinner
 wrote:
> Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :
>>
>> Well, things have to be done in order:
>> 1. the PEP needs to be approved
>> 2. the performance bottlenecks need to be identified
>> 3. optimizations should be applied.
>
> I would not vote for the PEP if it slows down Python, especially if it's
> much slower. But Torsten says that it speeds up Python, which is surprising.
> I have to do my own benchmarks :-)

As Martin noted, cache misses hurt performance so much on modern
processors that making things use less memory overall can actually be
a speed optimisation as well. Guessing where the remaining bottlenecks
are is unlikely to be effective - profiling of the preliminary
implementation will be needed.

However, the idea that reducing the size of pure ASCII strings (which
include all the identifiers in most code) by a factor of 2 or 4 (or
so) results in a net speed increase definitely sounds plausible to me,
even for non-string processing code.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Antoine Pitrou
On Tue, 23 Aug 2011 16:02:54 +0200
Stefan Behnel  wrote:
> "Martin v. Löwis", 23.08.2011 15:17:
> >> Has this been considered before? Was there a reason to decide against it?
> >
> > I think we simply didn't consider it. An early version of the PEP used
> > the lower bits for the pointer to encode the kind, in which case it even
> > stopped being a pointer. Modules are not expected to access this
> > pointer except through the macros, so it may not matter that much.
> 
> The difference is that you *could* access them directly in a safe way, if 
> it was a union.
> 
> So, for an efficient character loop, replicated for performance reasons or 
> for character range handling reasons or whatever, you could just check the 
> string kind and then jump to the loop implementation that handles that 
> type, without using any further macros.

Macros are useful to shield the abstraction from the implementation. If
you access the members directly, and the unicode object is represented
differently in some future version of Python (say e.g. with tagged
pointers), your code doesn't compile anymore.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Nick Coghlan
On Tue, Aug 23, 2011 at 11:17 PM, "Martin v. Löwis"  wrote:
>> Has this been considered before? Was there a reason to decide against it?
>
> I think we simply didn't consider it. An early version of the PEP used
> the lower bits for the pointer to encode the kind, in which case it even
> stopped being a pointer. Modules are not expected to access this
> pointer except through the macros, so it may not matter that much.
>
> OTOH, it's certainly not too late to change it.

It would make the macro implementations a bit clearer, so +1 for the
union approach from me.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Stefan Behnel

"Martin v. Löwis", 23.08.2011 15:17:

Has this been considered before? Was there a reason to decide against it?


I think we simply didn't consider it. An early version of the PEP used
the lower bits for the pointer to encode the kind, in which case it even
stopped being a pointer. Modules are not expected to access this
pointer except through the macros, so it may not matter that much.


The difference is that you *could* access them directly in a safe way, if 
it was a union.


So, for an efficient character loop, replicated for performance reasons or 
for character range handling reasons or whatever, you could just check the 
string kind and then jump to the loop implementation that handles that 
type, without using any further macros.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Victor Stinner

Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :

Well, things have to be done in order:
1. the PEP needs to be approved
2. the performance bottlenecks need to be identified
3. optimizations should be applied.


I would not vote for the PEP if it slows down Python, especially if it's 
much slower. But Torsten says that it speeds up Python, which is 
surprising. I have to do my own benchmarks :-)


Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Antoine Pitrou

> Well, things have to be done in order:
> 1. the PEP needs to be approved
> 2. the performance bottlenecks need to be identified
> 3. optimizations should be applied.

Sure, but the whole point of the PEP is to improve performance (I am
dumping "memory consumption" in the "performance" bucket). That is, I
suppose it will get approval based on its demonstrated benefits.

> I'm not sure what you mean by "stringlib-like" approach - if you are
> talking about templating, I'd rather avoid this for maintainability
> reasons, unless significant improvements can be demonstrated. Torsten
> had a version that used macros for that, and it was a pain to debug.

The point of templating is precisely to avoid macros, so that the code
is natural to read and write and the compiler gives you the right line
number when it finds an error.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Martin v. Löwis
> Has this been considered before? Was there a reason to decide against it?

I think we simply didn't consider it. An early version of the PEP used
the lower bits for the pointer to encode the kind, in which case it even
stopped being a pointer. Modules are not expected to access this
pointer except through the macros, so it may not matter that much.

OTOH, it's certainly not too late to change it.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Martin v. Löwis
> So why would you need three separate implementation of the unrolled
> loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.

Depending on where the speedup comes from in this optimization, it
may well be that the overhead of figuring out where to store the
result eats the gain from the fast test.

> Even without taking into account the unrolled loop, I wonder how much
> slower UTF-8 decoding becomes with that approach, by the way. 

In some cases, tests show that it gets faster, overall, compared to 3.2.
This is probably because strings take less memory, which means less
copying, more cache locality, etc.

Of course, it still may be possible to apply micro-optimizations to
the new implementation.

> Instead of
> testing the "kind" variable at each loop iteration, using a
> stringlib-like approach may be a better deal IMO.

Well, things have to be done in order:
1. the PEP needs to be approved
2. the performance bottlenecks need to be identified
3. optimizations should be applied.

I'm not sure what you mean by "stringlib-like" approach - if you are
talking about templating, I'd rather avoid this for maintainability
reasons, unless significant improvements can be demonstrated. Torsten
had a version that used macros for that, and it was a pain to debug.
So we put correctness and readability first.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Antoine Pitrou
Le mardi 23 août 2011 à 13:51 +0200, "Martin v. Löwis" a écrit :
> > This optimization was done when trying to improve the speed of text I/O.
> 
> So what speedup did it achieve, for the kind of data you talked about?

Since I don't have the number anymore, I've just saved the contents of
https://linuxfr.org/news/le-noyau-linux-est-disponible-en-version%C2%A030
as a "linuxfr.html" file and then did:

$ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()"
1000 loops, best of 3: 859 usec per loop

After disabling the fast path, I ran the micro-benchmark again:

$ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()"
1000 loops, best of 3: 1.09 msec per loop

so that's a 20% speedup.

> > Do you have three copies of the UTF-8 decoder already, or do you a use a
> > stringlib-like approach?
> 
> It's a single implementation - see for yourself.

So why would you need three separate implementation of the unrolled
loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR.

Even without taking into account the unrolled loop, I wonder how much
slower UTF-8 decoding becomes with that approach, by the way. Instead of
testing the "kind" variable at each loop iteration, using a
stringlib-like approach may be a better deal IMO.

Of course we would first need to have various benchmark numbers once the
current PEP 393 implementation is complete.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Stefan Behnel

Torsten Becker, 22.08.2011 20:58:

I have implemented an initial version of PEP 393 -- "Flexible String
Representation" as part of my Google Summer of Code project.  My patch
is hosted as a repository on bitbucket [1] and I created a related
issue on the bug tracker [2].  I posted documentation for the current
state of the development in the wiki [3].


One thing that occurred to me regarding the object struct:

typedef struct {
PyObject_HEAD
Py_ssize_t length;   /* Number of code points in the string */
void *str;   /* Canonical, smallest-form Unicode buffer */
Py_hash_t hash;  /* Hash value; -1 if not set */
int state;   /* != 0 if interned. In this case the two
  * references from the dictionary to this
  * object are *not* counted in ob_refcnt.
  * See SSTATE_KIND_* for other bits */
Py_ssize_t utf8_length;  /* Number of bytes in utf8, excluding the
  * terminating \0. */
char *utf8;  /* UTF-8 representation (null-terminated) */
Py_ssize_t wstr_length;  /* Number of code points in wstr, possible
  * surrogates count as two code points. */
wchar_t *wstr;   /* wchar_t representation (null-terminated) */
} PyUnicodeObject;


Wouldn't the "normal" approach be to use a union for the str field? I.e.

union str {
   unsigned char* latin1;
   Py_UCS2* ucs2;
   Py_UCS4* ucs4;
}

Given that they're all pointers, all fields have the same size, but I find 
it more readable to write


u.str.latin1

than

((const unsigned char*)u.str)

Plus, the three types would be given by the struct, rather than by a 
per-usage cast.


Has this been considered before? Was there a reason to decide against it?

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Martin v. Löwis
> This optimization was done when trying to improve the speed of text I/O.

So what speedup did it achieve, for the kind of data you talked about?

> Do you have three copies of the UTF-8 decoder already, or do you a use a
> stringlib-like approach?

It's a single implementation - see for yourself.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Antoine Pitrou

> >> Is it really extremely common to have strings that are mostly-ASCII but
> >> not completely ASCII? I would agree that pure ASCII strings are
> >> extremely common.
> > Mostly ascii is pretty common for western-european languages (French, for
> > instance, is probably 90 to 95% ascii). It's also a risk in english, when
> > the writer "correctly" spells foreign words (résumé and the like).
> 
> I know - I still question whether it is "extremely common" (so much as
> to justify a special case).

Well, it's:
- all natural languages based on a variant of the latin alphabet
- but also, XML, JSON, HTML documents...
- and log files...
- in short, any kind of parsable format which is structurally ASCII but
and can contain arbitrary unicode

So I would say *most* unicode data out there is mostly-ASCII, even when
it has Japanese characters in it. The rationale is that most unicode
data processed by computers is structured.

This optimization was done when trying to improve the speed of text I/O.

> In the PEP 393 approach, if the string has a two-byte representation,
> each character needs to widened to two bytes, and likewise for four
> bytes. So three separate copies of the unrolled loop would be needed,
> one for each target size.

Do you have three copies of the UTF-8 decoder already, or do you a use a
stringlib-like approach?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Martin v. Löwis
Am 23.08.2011 11:46, schrieb Xavier Morel:
> On 2011-08-23, at 10:55 , Martin v. Löwis wrote:
>>> - “The UTF-8 decoding fast path for ASCII only characters was removed
>>>  and replaced with a memcpy if the entire string is ASCII.” 
>>>  The fast path would still be useful for mostly-ASCII strings, which
>>>  are extremely common (unless UTF-8 has become a no-op?).
>>
>> Is it really extremely common to have strings that are mostly-ASCII but
>> not completely ASCII? I would agree that pure ASCII strings are
>> extremely common.
> Mostly ascii is pretty common for western-european languages (French, for
> instance, is probably 90 to 95% ascii). It's also a risk in english, when
> the writer "correctly" spells foreign words (résumé and the like).

I know - I still question whether it is "extremely common" (so much as
to justify a special case). I.e. on what application with what dataset
would you gain what speedup, at the expense of what amount of extra
lines, and potential slow-down for other datasets?

For the record, the optimization in question is the one where it masks
a long word with 0x80808080L, to see whether it is completely
ASCII, and then copies four characters in an unrolled fashion. It stops
doing so when it sees a non-ASCII character, and returns to that mode
when it gets to the next aligned memory address that stores only ASCII
characters.

In the PEP 393 approach, if the string has a two-byte representation,
each character needs to widened to two bytes, and likewise for four
bytes. So three separate copies of the unrolled loop would be needed,
one for each target size.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Xavier Morel
On 2011-08-23, at 10:55 , Martin v. Löwis wrote:
>> - “The UTF-8 decoding fast path for ASCII only characters was removed
>>  and replaced with a memcpy if the entire string is ASCII.” 
>>  The fast path would still be useful for mostly-ASCII strings, which
>>  are extremely common (unless UTF-8 has become a no-op?).
> 
> Is it really extremely common to have strings that are mostly-ASCII but
> not completely ASCII? I would agree that pure ASCII strings are
> extremely common.
Mostly ascii is pretty common for western-european languages (French, for
instance, is probably 90 to 95% ascii). It's also a risk in english, when
the writer "correctly" spells foreign words (résumé and the like).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Stefan Behnel

"Martin v. Löwis", 23.08.2011 10:55:

- “The UTF-8 decoding fast path for ASCII only characters was removed
   and replaced with a memcpy if the entire string is ASCII.”
   The fast path would still be useful for mostly-ASCII strings, which
   are extremely common (unless UTF-8 has become a no-op?).


Is it really extremely common to have strings that are mostly-ASCII but
not completely ASCII?


Maybe not as "extremely common" as pure ASCII strings, but at least for 
western European languages, "mostly ASCII" strings are very common indeed.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Martin v. Löwis
> - “The UTF-8 decoding fast path for ASCII only characters was removed
>   and replaced with a memcpy if the entire string is ASCII.” 
>   The fast path would still be useful for mostly-ASCII strings, which
>   are extremely common (unless UTF-8 has become a no-op?).

Is it really extremely common to have strings that are mostly-ASCII but
not completely ASCII? I would agree that pure ASCII strings are
extremely common.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-23 Thread Stefan Behnel

Torsten Becker, 22.08.2011 20:58:

I have implemented an initial version of PEP 393 -- "Flexible String
Representation" as part of my Google Summer of Code project.  My patch
is hosted as a repository on bitbucket [1] and I created a related
issue on the bug tracker [2].  I posted documentation for the current
state of the development in the wiki [3].


Very cool!

I've started fixing up Cython for it.

One thing I noticed: on platforms where wchar_t is signed, the comparison 
to "128U" in the Py_UNICODE_ISSPACE() macro may issue a warning when 
applied to a Py_UNICODE value (which it previously was officially defined 
on). For the sake of portability of existing code, this may be worth a 
work-around.


Personally, I wouldn't really mind getting this warning, given that it's 
better to use Py_UCS4 instead of Py_UNICODE. But it may turn out to be an 
annoyance for users, because their code that does this isn't actually 
broken in the new world.


And one thing that I find unfortunate is that we need a new (unexpected) 
_GET_LENGTH() next to the existing (and obvious) _GET_SIZE(), but I guess 
that's a somewhat small price to pay for backwards compatibility...


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com