Torsten Becker, 24.08.2011 04:41:
Also, common, now simple, checks for unicode-str == NULL would look
more ambiguous with a union (unicode-str.latin1 == NULL).
You could just add yet another field any, i.e.
union {
unsigned char* latin1;
Py_UCS2* ucs2;
Py_UCS4* ucs4;
Nick Coghlan writes:
Since I tend to use the one word 'filesystem' form myself (ditto for
'filename'), I'm +1 for FilesystemError, but I'm only -0 for
FileSystemError (so I expect that will be the option chosen, given
other responses).
I slightly prefer FilesystemError because it parses
On 8/23/2011 5:46 PM, Terry Reedy wrote:
On 8/23/2011 6:20 AM, Martin v. Löwis wrote:
Am 23.08.2011 11:46, schrieb Xavier Morel:
Mostly ascii is pretty common for western-european languages
(French, for
instance, is probably 90 to 95% ascii). It's also a risk in english,
when
the writer
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 10:08, Antoine Pitrousolip...@pitrou.net wrote:
Macros are useful to shield the abstraction from the implementation. If
you access the members directly, and the unicode object is represented
differently in some future
Le 24/08/2011 06:59, Scott Dial a écrit :
On 8/23/2011 6:38 PM, Victor Stinner wrote:
Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
- You could try to run stringbench, which can be found at
http://svn.python.org/projects/sandbox/trunk/stringbench (*)
and there's iobench (the
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:27, Victor Stinner
victor.stin...@haypocalc.com wrote:
I posted a patch to re-add it:
http://bugs.python.org/issue12819#msg142867
Thank you for the patch! Note that this patch adds the fast path only
to the helper
So am I correctly reading between the lines when, after reading this
thread so far, and the complete issue discussion so far, that I see a
PEP 393 revision or replacement that has the following characteristics:
1) Narrow builds are dropped.
PEP 393 already drops narrow builds.
2) There
Terry Reedy writes:
The current UCS2 Unicode string implementation, by design, quickly gives
WRONG answers for len(), iteration, indexing, and slicing if a string
contains any non-BMP (surrogate pair) Unicode characters. That may have
been excusable when there essentially were no such
Le 24/08/2011 04:56, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:56, Victor Stinner
victor.stin...@haypocalc.com wrote:
kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still
necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape().
If it can be removed, it
On 24Aug2011 12:31, Nick Coghlan ncogh...@gmail.com wrote:
| On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano st...@pearwood.info wrote:
| Antoine Pitrou wrote:
| When reviewing the PEP 3151 implementation (*), Ezio commented that
| FileSystemError looks a bit strange and that FilesystemError
On 8/24/2011 1:18 AM, Martin v. Löwis wrote:
So am I correctly reading between the lines when, after reading this
thread so far, and the complete issue discussion so far, that I see a
PEP 393 revision or replacement that has the following characteristics:
1) Narrow builds are dropped.
PEP 393
On 8/24/2011 4:11 AM, Victor Stinner wrote:
Le 24/08/2011 06:59, Scott Dial a écrit :
On 8/23/2011 6:38 PM, Victor Stinner wrote:
Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
- You could try to run stringbench, which can be found at
Am 24.08.2011 10:17, schrieb Victor Stinner:
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:27, Victor Stinner
victor.stin...@haypocalc.com wrote:
I posted a patch to re-add it:
http://bugs.python.org/issue12819#msg142867
Thank you for the patch! Note that this
On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
Terry Reedy writes:
The current UCS2 Unicode string implementation, by design, quickly gives
WRONG answers for len(), iteration, indexing, and slicing if a string
contains any non-BMP (surrogate pair) Unicode characters. That may have
I think the value for wstr/uninitialized/reserved should not be
removed. The wstr representation is still used in the error case in
the utf8 decoder because these strings can be resized.
In Python, you can resize an object if it has only one reference. Why is
it not possible in your
When reviewing the PEP 3151 implementation (*), Ezio commented that
FileSystemError looks a bit strange and that FilesystemError would
be a better spelling. What is your opinion?
(*) http://bugs.python.org/issue12555
+1 for FileSystemError
Eli
The buildbots are complaining about some of tests for the new
socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
provide CMSG_LEN.
http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/831/steps/test/logs/stdio
Before I start trying to figure
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy tjre...@udel.edu wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 'mostly
BMP chars, few non-BMP chars' case. Rather than expand every character from
2 bytes to
The buildbots are complaining about some of tests for the new
socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
provide CMSG_LEN.
Looks like kernel bugs:
http://developer.apple.com/library/mac/#qa/qa1541/_index.html
Yes. Mac OS X 10.5 fixes a number of kernel bugs related
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 'mostly
BMP chars, few non-BMP chars' case. Rather than expand every character
Terry Reedy writes:
Excuse me for believing the fine 3.2 manual that says
Strings contain Unicode characters.
The manual is wrong, then, subject to a pronouncement to the contrary,
of course. I was on your side of the fence when this was discussed,
pre-release. I was wrong then. My bet is
On Thu, 25 Aug 2011 01:34:17 +0900
Stephen J. Turnbull step...@xemacs.org wrote:
Martin has long claimed that the fact that I/O is done in terms of
UTF-16 means that the internal representation is UTF-16
Which I/O?
___
Python-Dev mailing list
On Wed, 24 Aug 2011 15:31:50 +0200
Charles-François Natali neolo...@free.fr wrote:
The buildbots are complaining about some of tests for the new
socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
provide CMSG_LEN.
Looks like kernel bugs:
+1 for FileSystemError. I see myself misspelling it as FileSystemError if we
go with alternate spelling. I'll probably won't be the only one.
Thank you,
Vlad
On Wed, Aug 24, 2011 at 4:09 AM, Eli Bendersky eli...@gmail.com wrote:
When reviewing the PEP 3151 implementation (*), Ezio commented
Antoine Pitrou writes:
On Thu, 25 Aug 2011 01:34:17 +0900
Stephen J. Turnbull step...@xemacs.org wrote:
Martin has long claimed that the fact that I/O is done in terms of
UTF-16 means that the internal representation is UTF-16
Which I/O?
Eg, display of characters in the
Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit :
Antoine Pitrou writes:
On Thu, 25 Aug 2011 01:34:17 +0900
Stephen J. Turnbull step...@xemacs.org wrote:
Martin has long claimed that the fact that I/O is done in terms of
UTF-16 means that the internal
Le 24/08/2011 02:46, Terry Reedy a écrit :
On 8/23/2011 9:21 AM, Victor Stinner wrote:
Le 23/08/2011 15:06, Martin v. Löwis a écrit :
Well, things have to be done in order:
1. the PEP needs to be approved
2. the performance bottlenecks need to be identified
3. optimizations should be applied.
PEP 393 abolishes narrow builds as we now know them and changes
semantics. I was answering a complaint about that change. If you do
not like the PEP, fine.
No, I do like the PEP. However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
Eg, display of characters in the interpreter.
I don't know why you say it's done in terms of UTF-16, then. Unicode
strings are simply encoded to whatever character set is detected as the
terminal's character set.
I think what he means (and what I meant when I said something similar):
I/O
Le 24/08/2011 11:22, Glenn Linderman a écrit :
c) mostly ASCII (utf8) with clever indexing/caching to be efficient
d) UTF-8 with clever indexing/caching to be efficient
I see neither a need nor a means to consider these.
The discussion about mostly ASCII strings seems convincing that there
Guido has agreed to eventually pronounce on PEP 393. Before that can
happen, I'd like to collect feedback on it. There have been a number
of voice supporting the PEP in principle, so I'm now interested in
comments in the following areas:
- principle objection. I'll list them in the PEP.
- issues
On Wed, 24 Aug 2011 20:15:24 +0200
Martin v. Löwis mar...@v.loewis.de wrote:
- issues to be considered (unclarities, bugs, limitations, ...)
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any
In article 20110824184927.2697b...@pitrou.net,
Antoine Pitrou solip...@pitrou.net wrote:
On Wed, 24 Aug 2011 15:31:50 +0200
Charles-François Natali neolo...@free.fr wrote:
The buildbots are complaining about some of tests for the new
socket.sendmsg/recvmsg added by issue #6560 for *nix
On 8/24/2011 1:50 PM, Martin v. Löwis wrote:
I'd like to point out that the improved compatibility is only a side
effect, not the primary objective of the PEP.
Then why does the Rationale start with on systems only supporting
UTF-16, users complain that non-BMP characters are not properly
On Wed, 24 Aug 2011 11:37:20 -0700
Ned Deily n...@acm.org wrote:
In article 20110824184927.2697b...@pitrou.net,
Antoine Pitrou solip...@pitrou.net wrote:
On Wed, 24 Aug 2011 15:31:50 +0200
Charles-François Natali neolo...@free.fr wrote:
The buildbots are complaining about some of tests
On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the
'mostly
BMP chars, few non-BMP chars'
But Snow Leopard, where these failures occur, is OS X 10.6.
*sighs*
It still looks like a kernel/libc bug to me: AFAICT, both the code and
the tests are correct.
And apparently, there are still issues pertaining to FD passing on
10.5 (and maybe later, I couldn't find a public access to their bug
On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman v+pyt...@g.nevcal.com wrote:
On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I propose for
On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
Terry Reedy writes:
Excuse me for believing the fine 3.2 manual that says
Strings contain Unicode characters.
The manual is wrong, then, subject to a pronouncement to the contrary,
Please suggest a re-wording then, as it is a bug for
In article
cah_1em30t-8g9ubdprumksl_yisclpuiffz32z4w0y1pcjj...@mail.gmail.com,
Charles-Francois Natali cf.nat...@gmail.com wrote:
But Snow Leopard, where these failures occur, is OS X 10.6.
*sighs*
It still looks like a kernel/libc bug to me: AFAICT, both the code and
the tests are
In article 20110824205047.6be49...@pitrou.net,
Antoine Pitrou solip...@pitrou.net wrote:
On Wed, 24 Aug 2011 11:37:20 -0700
Ned Deily n...@acm.org wrote:
In article 20110824184927.2697b...@pitrou.net,
Antoine Pitrou solip...@pitrou.net wrote:
On Wed, 24 Aug 2011 15:31:50 +0200
On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big
problem. A lot of work has been done to hide this. For example,
repr(chr(0x10)) now displays '\U0010' instead of two characters.
Terry Reedy wrote:
PEP-393 provides support of the full Unicode charset (U+-U+10)
an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint
if there are any non-BMP chars. On my new machine, I should
Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
Given the required variability of character size in all presently
Unicode defined encodings, I tend to agree with Tom that UTF-8, together
with some technique of translating character index to code unit offset,
may provide the best
For Windows users, I believe it will nearly double the memory footprint
if there are any non-BMP chars. On my new machine, I should not mind
that in exchange for correct behavior.
In addition, strings with non-BMP chars are much more rare than strings
with all Latin-1, for which memory usage
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
On 8/24/2011 12:34 PM, Guido van Rossum wrote:
On Wed, Aug 24, 2011 at 11:52 AM, Glenn Lindermanv+pyt...@g.nevcal.com wrote:
On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to
On 25 August 2011 07:10, Victor Stinner victor.stin...@haypocalc.comwrote:
I used stringbench and ./python -m test test_unicode. I plan to try
iobench.
Which other benchmark tool should be used? Should we write a new one?
I think that the PyPy benchmarks (or at least selected tests such as
On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote:
It would seem helpful if the stdlib could have some support for efficient
handling of Unicode characters in some representation. It would help
address the class of applications that does care.
I claim that we have
Antoine Pitrou writes:
Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit :
Antoine Pitrou writes:
On Thu, 25 Aug 2011 01:34:17 +0900
Stephen J. Turnbull step...@xemacs.org wrote:
Martin has long claimed that the fact that I/O is done in terms of
Terry Reedy writes:
Please suggest a re-wording then, as it is a bug for doc and behavior to
disagree.
Strings contain Unicode code units, which for most purposes can be
treated as Unicode characters. However, even as simple an
operation as s1[0] == s2[0] cannot be relied upon
Guido van Rossum writes:
I see nothing wrong with having the language's fundamental data types
(i.e., the unicode object, and even the re module) to be defined in
terms of codepoints, not characters, and I see nothing wrong with
len() returning the number of codepoints (as long as it is
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
turnb...@sk.tsukuba.ac.jp wrote:
Terry Reedy writes:
Please suggest a re-wording then, as it is a bug for doc and behavior to
disagree.
Strings contain Unicode code units, which for most purposes can be
treated as Unicode
On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull step...@xemacs.org wrote:
Guido van Rossum writes:
I see nothing wrong with having the language's fundamental data types
(i.e., the unicode object, and even the re module) to be defined in
terms of codepoints, not characters, and I
On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum gu...@python.org wrote:
Now I am happy to admit that for many Unicode issues the level at
which we have currently defined things (code units, I think -- the
thingies that encodings are made of) is confusing, and it would be
better to switch to
On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan ncogh...@gmail.com wrote:
On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum gu...@python.org wrote:
Now I am happy to admit that for many Unicode issues the level at
which we have currently defined things (code units, I think -- the
thingies that
On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum gu...@python.org wrote:
With narrow builds, code units can currently come into play
internally, but with PEP 393 everything internal will be working
directly with code points. Normalisation, combining characters and
bidi issues may still affect
Guido van Rossum writes:
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
turnb...@sk.tsukuba.ac.jp wrote:
Strings contain Unicode code units, which for most purposes can be
treated as Unicode characters. However, even as simple an
operation as s1[0] == s2[0] cannot be
Victor Stinner, 25.08.2011 00:29:
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
Martin v. Löwis, 24.08.2011 20:15:
Guido has agreed to eventually pronounce on PEP 393. Before that can
happen, I'd like to collect feedback on it. There have been a number
of voice supporting the PEP in principle
Absolutely.
- conditions you would like to pose on the implementation before
On 8/24/2011 7:29 PM, Guido van Rossum wrote:
(Hey, I feel a QOTW coming. Standards? We don't need no stinkin'
standards.http://en.wikipedia.org/wiki/Stinking_badges :-)
Which deserves an appropriate, follow-on, misquote:
Guido says the Unicode standard stinks.
˚͜˚ - and a Unicode smiley to
Nick Coghlan writes:
GvR writes:
Let's just define a Unicode string to be a sequence of code points and
let libraries deal with the rest. Ok, methods like lower() should
consider characters, but indexing/slicing should refer to code points.
Same for '=='; we can have a library that
62 matches
Mail list logo