Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-31 Thread Steve Holden
Adam Olsen wrote:
 On 10/30/05, François Pinard [EMAIL PROTECTED] wrote:
 
All development is done in house by French people.  All documentation,
external or internal, comments, identifier and function names,
everything is in French.  Some of the developers here have had a long
programming life, while they only barely read English.  It is surely
a constant frustration, for some of us, having to mangle identifiers by
ravelling out their necessary diacritics.  It does not look good, it
does not smell good, and in many cases, mangling identifiers
significantly decreases program legibility.
 
 
 Hear, hear!  Not all the world uses english, and restricting them to
 latin characters simply means it's not readable in any language.  It
 doesn't make it any more readable for those of us who only understand
 english.
 
 +1 on internationalized identifiers.
 
While I agree with the sentiments expressed, I think we should not 
underestimate the practical problems that moving away fr

Therefore, if such steps are really going to be considered, I would 
really like to see them introduced in such a way that no breakage occurs 
for existing users, even the parochial ones who feel they (and their 
programs) don't need to understand anything but ASCII.

If this means starting out with the features conditionally compiled, 
despite the added cost of the #ifdefs that would thereby be engendered I 
think that would be a good idea.

We can fix their programs by making Unicode the default string type, but 
it will take much longer to fix their thinking.

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-31 Thread Greg Ewing
François Pinard wrote:

 All development is done in house by French people.  All documentation, 
 external or internal, comments, identifier and function names, 
 everything is in French.

There's nothing stopping you from creating your own
Frenchified version of Python that lets you use all
the characters you want, for your own in-house use.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | A citizen of NewZealandCorp, a   |
Christchurch, New Zealand  | wholly-owned subsidiary of USA Inc.  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-31 Thread François Pinard
[Greg Ewing]

 All development is done in house by French people.  All documentation, 
 external or internal, comments, identifier and function names, 
 everything is in French.

 There's nothing stopping you from creating your own Frenchified 
 version of Python that lets you use all the characters you want, for 
 your own in-house use.

No doubt that we, you and me and everybody, could all have our own 
little version of Python.  :-)

To tell all the truth, the very topic of your suggestion has already 
been discussed in-house already, and the decision has been to stick to 
Python mainstream.  We could not justify to our administration that we 
start modifying our sources, in such a way that we ought to invest 
maintainance each time a new Python version appears, forever.

On the other hand, we may reasonably guess that many people in this 
world would love being as comfortable as possible using Python, while 
naming identifiers naturally.  It is not so unreasonable that we keep 
some _hope_ that Guido will soon choose to help us all, not only me.

-- 
François Pinard   http://pinard.progiciels-bpi.ca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-30 Thread François Pinard
[Martin von Löwis]

 My canonical example is François Pinard, who keeps requesting it, 
 saying that local people where surprised they couldn't use accented 
 characters in Python.  Perhaps that's because he actually is Quebecian 
 :-)

I presume I should comment a bit on this.

People here are not surprised they couldn't use accented characters, 
they are rather saddened, and some hoped that Python would offer that
possibility, one of these days.  Also given that here, every production 
program or system has been progressively rewritten in Python, slowly at 
first, and more aggressively while the confidence was building up, to 
the point not much of the non-Python things remain by now.  So, all our 
hopes are concentrated into a single language.

All development is done in house by French people.  All documentation, 
external or internal, comments, identifier and function names, 
everything is in French.  Some of the developers here have had a long 
programming life, while they only barely read English.  It is surely 
a constant frustration, for some of us, having to mangle identifiers by 
ravelling out their necessary diacritics.  It does not look good, it 
does not smell good, and in many cases, mangling identifiers 
significantly decreases program legibility.

Now, I keep reading strange arguments from people opposing that we use 
national letters in identifiers, disturbed by the fact they would have 
a hard time reading our code or publishing it.  Even worse, some want to 
protect us (and the world) against ourselves, using made up, irrational 
arguments, producing false logic out of their own emotions and feelings.  
They would like us to think, write, and publish in English.  Is it some 
anachronical colonialism?  Quite possible.  It surely has some success, 
as you may find some French people that will only swear in English! :-)

For one, in my programming life, I surely chose to write a lot of 
English code, and I still think English is a good vehicle to planetary 
communication.  However, I like it to my choice.  I always felt much 
opened and collaborative with similarly minded people, and for them, 
happily rewrote my things from French to English in view of sharing, 
whenever I saw some mutual advantage to it.

I resent when people want to force me into English when I have no real 
reason to do so.  Let me choose to use my own language, as nicely as 
I can, when working in-shop with people sharing this language with me, 
for programs that will likely never be published outside anyway.  
Internationalisation is already granted in our overall view of today's
programming, as a way for letting people be comfortable with computers, 
each in his/her own language.  This comfort should extend widely to 
naming main programming objects (functions, classes, variables, modules) 
as legibly as possible.  Here, I mean legible in an ideal way for the 
team or the local community, and not necessarily legible to the whole 
planet.  It does not always have to be planetary, you know.

For keywords, the need is less stringent, as syntactical constructs are 
part of a language.  When English is opaque to a programmer, he/she can 
easily learn that small set of words making the syntax, understanding 
their effect, even while not necessarily understanding the real English 
meaning of those keywords.  This is not a real obstacle in practice.

It is true that many Python tools are not prepared to handle 
internationalised identifiers, and it is very unlikely that these tools 
will get ready before Python opens itself to internationalised 
identifiers.  Let's open Python first, tools will undoubtedly follow.
There will be some adaptation period, but after some while, everything 
will fall in place, things will become smooth again and just natural to 
everybody, to the point many of us might remember the current times, and 
wonder what was all that fuss about.  :-)

-- 
François Pinard   http://pinard.progiciels-bpi.ca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-30 Thread Adam Olsen
On 10/30/05, François Pinard [EMAIL PROTECTED] wrote:
 All development is done in house by French people.  All documentation,
 external or internal, comments, identifier and function names,
 everything is in French.  Some of the developers here have had a long
 programming life, while they only barely read English.  It is surely
 a constant frustration, for some of us, having to mangle identifiers by
 ravelling out their necessary diacritics.  It does not look good, it
 does not smell good, and in many cases, mangling identifiers
 significantly decreases program legibility.

Hear, hear!  Not all the world uses english, and restricting them to
latin characters simply means it's not readable in any language.  It
doesn't make it any more readable for those of us who only understand
english.

+1 on internationalized identifiers.

--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-26 Thread Bengt Richter
At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote:
Bengt Richter wrote:
 Please bear with me for a few paragraphs ;-)

Please note that source code encoding doesn't really have
anything to do with the way the interpreter executes the
program - it's merely a way to tell the parser how to
convert string literals (currently on the Unicode ones)
into constant Unicode objects within the program text.
It's also a nice way to let other people know what kind of
encoding you used to write your comments ;-)

Nothing more.
I think somehow I didn't make things clear, sorry ;-)
As I tried to show in the example of module_a.cs vs module_b.cs,
the source encoding currently results in two different str-type
strings representing the source _character_ sequence, which is the
_same_ in both cases. To make it more clear, try the following little
program (untested except on NT4 with
Python 2.4b1 (#56, Nov  3 2004, 01:47:27)
[GCC 3.2.3 (mingw special 20030504-1)] on win32 ;-):

 t_srcenc.py 
import os
def test():
open('module_a.py','wb').write(
# -*- coding: latin-1 -*- + os.linesep +
cs = '\xfcber-cool' + os.linesep)
open('module_b.py','wb').write(
# -*- coding: utf-8 -*- + os.linesep +
cs = '\xc3\xbcber-cool' + os.linesep)
# show that we have two modules differing only in encoding:
print ''.join(line.decode('latin-1') for line in open('module_a.py'))
print ''.join(line.decode('utf-8') for line in open('module_b.py'))
# see how results are affected:
import module_a, module_b
print module_a.cs + ' =?= ' + module_b.cs
print module_a.cs.decode('latin-1') + ' =?= ' + module_b.cs.decode('utf-8')

if __name__ == '__main__':
test()
---
The result copied from NT4 console to clipboard and pasted into eudora:
__

[17:39] C:\pywk\python-devpy24 t_srcenc.py
# -*- coding: latin-1 -*-
cs = 'über-cool'

# -*- coding: utf-8 -*-
cs = 'über-cool'

nber-cool =?= ++ber-cool
über-cool =?= über-cool
__
(I'd say NT did the best it could, rendering the the copied cp437
superscript n as the 'n' above, and the '++' coming from the
cp437 box characters corresponding to the '\xc3\xbc'. Not sure
how it will show on your screen, but try the program to see ;-)

Once a module is compiled, there's no distinction between
a module using the latin-1 source code encoding or one using
the utf-8 encoding.
ISTM module_a.cs and module_b.cs can readily be distinguished after
compilation, whereas the sources displayed according to their declared
encodings as above (or as e.g. different editors using different native
encoding might) cannot (other than the encoding cookie itself) ;-)
Perhaps you meant something else?

Thanks,
You're welcome.

Regards,
Bengt Richter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread M.-A. Lemburg
Neil Hodgson wrote:
 M.-A. Lemburg:
 
 
Unicode has the concept of combining code points, e.g. you can
store an é (e with a accent) as e + '. Now if you slice
off the accent, you'll break the character that you encoded
using combining code points.
...
next_indextype(u, index) - integer

Returns the Unicode object index for the start of the next
indextype found after u[index] or -1 in case no next element
of this type exists.
 
 
Should entity breakage be further discouraged by returning a slice
 here rather than an object index?

You mean a slice that slices out the next indextype ?

Something like:
 
 i = first_grapheme(u)
 x = 0
 while x  width and u[i] != \n:
x, _ = draw(u[i], (x, y))
i = next_grapheme(u, i)

This sounds a lot like you'd want iterators for the various
index types. Should be possible to implement on top of the
proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc.

Note that what most people refer to as character is a
grapheme in Unicode speak. Given that interpretation,
breaking Unicode characters is something you won't
ever work around with by using larger code units such
as UCS4 compatible ones.

Furthermore, you should also note that surrogates (two
code units encoding one code point) are part of Unicode
life. While you don't need them when storing Unicode
in UCS4 code units, they can still be part of the
Unicode data and the programmer has to be aware of
these.

I personally, don't think that slicing Unicode is
such a big issue. If you know what you are doing,
things tend not to break - which is true for pretty
much everything you do in programming ;-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread M.-A. Lemburg
Bengt Richter wrote:
 At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote:
 
Bengt Richter wrote:

Please bear with me for a few paragraphs ;-)

Please note that source code encoding doesn't really have
anything to do with the way the interpreter executes the
program - it's merely a way to tell the parser how to
convert string literals (currently on the Unicode ones)
into constant Unicode objects within the program text.
It's also a nice way to let other people know what kind of
encoding you used to write your comments ;-)

Nothing more.
 
 I think somehow I didn't make things clear, sorry ;-)
 As I tried to show in the example of module_a.cs vs module_b.cs,
 the source encoding currently results in two different str-type
 strings representing the source _character_ sequence, which is the
 _same_ in both cases. 

I don't follow you here. The source code encoding
is only applied to Unicode literals (you are using string
literals in your example). String literals are passed
through as-is.

Whether or not you editor will use the source
code encoding marker is really up to your editor
and not within the scope of Python.

If you open the two module files in Emacs, you'll
see identical renderings of the string literals.
With other editors, you may have to explicitly tell
the editor which encoding to assume. Dito for shell
printouts.

 To make it more clear, try the following little
 program (untested except on NT4 with
 Python 2.4b1 (#56, Nov  3 2004, 01:47:27)
 [GCC 3.2.3 (mingw special 20030504-1)] on win32 ;-):
 
  t_srcenc.py 
 import os
 def test():
 open('module_a.py','wb').write(
 # -*- coding: latin-1 -*- + os.linesep +
 cs = '\xfcber-cool' + os.linesep)
 open('module_b.py','wb').write(
 # -*- coding: utf-8 -*- + os.linesep +
 cs = '\xc3\xbcber-cool' + os.linesep)
 # show that we have two modules differing only in encoding:
 print ''.join(line.decode('latin-1') for line in open('module_a.py'))
 print ''.join(line.decode('utf-8') for line in open('module_b.py'))
 # see how results are affected:
 import module_a, module_b
 print module_a.cs + ' =?= ' + module_b.cs
 print module_a.cs.decode('latin-1') + ' =?= ' + 
 module_b.cs.decode('utf-8')
 
 if __name__ == '__main__':
 test()
 ---
 The result copied from NT4 console to clipboard and pasted into eudora:
 __
 
 [17:39] C:\pywk\python-devpy24 t_srcenc.py
 # -*- coding: latin-1 -*-
 cs = 'über-cool'
 
 # -*- coding: utf-8 -*-
 cs = 'über-cool'
 
 nber-cool =?= ++ber-cool
 über-cool =?= über-cool
 __
 (I'd say NT did the best it could, rendering the the copied cp437
 superscript n as the 'n' above, and the '++' coming from the
 cp437 box characters corresponding to the '\xc3\xbc'. Not sure
 how it will show on your screen, but try the program to see ;-)

Once a module is compiled, there's no distinction between
a module using the latin-1 source code encoding or one using
the utf-8 encoding.
 
 ISTM module_a.cs and module_b.cs can readily be distinguished after
 compilation, whereas the sources displayed according to their declared
 encodings as above (or as e.g. different editors using different native
 encoding might) cannot (other than the encoding cookie itself) ;-)
 Perhaps you meant something else?

What your editor displays to you is not within the scope
of Python, e.g. if you open the files in Emacs you'll see
something different than in Notepad.

I guess that's the price you have to pay for being able to write
programs that can include Unicode literals using the complete range
of possible Unicode characters without having to revert to
escapes.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Martin v. Löwis
Bill Janssen wrote:
 I just got mail this morning from a researcher who wants exactly what
 Martin described, and wondered why the default MacPython 2.4.2 didn't
 provide it by default. :-)

If all he wants is to represent Deseret, he can do so in a 16-bit
Unicode type, too: Python supports UTF-16.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Bill Janssen
I think he was more interested in the invariant Martin proposed, that

 len(\U0001)

should always be the same and should always be 1.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Guido van Rossum
On 10/25/05, Bill Janssen [EMAIL PROTECTED] wrote:
 I think he was more interested in the invariant Martin proposed, that

  len(\U0001)

 should always be the same and should always be 1.

Yes but why? What does this invariant do for him?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Martin v. Löwis
Guido van Rossum wrote:
 Yes but why? What does this invariant do for him?

I don't know about this person, but there are a few things that
don't work properly in UTF-16 mode:

- the Unicode character database fails to lookup things.
   u\U0001D670.isupper() gives false, but should give true
   (since it denotes MATHEMATICAL MONOSPACE CAPITAL A).
   It gives true in UCS-4 mode
- As a result, normalization on these doesn't work, either.
   It should normalize to LATIN CAPITAL LETTER A under
   NFKC, but doesn't.
- regular expressions only have limited support. In
   particular, adding non-BMP characters to character classes
   is not possible. [\U0001D670] will match any character
   that is either \uD835 or \uDE70, whereas it only matches
   MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode.

There might be more limitations, but those are the ones that
come to mind easily. While I could imagine fixing the first
two with some effort, the third one is really tricky (unless
you would accept a wide representation of a character
class even if the Unicode representation is only narrow).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Neil Hodgson
M.-A. Lemburg:

 You mean a slice that slices out the next indextype ?

   Yes.

 This sounds a lot like you'd want iterators for the various
 index types. Should be possible to implement on top of the
 proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc.

   Iterators may be helpful, but can also be too restrictive when the
processing is not completely iterative, such as peeking ahead or
looking behind to wrap at a word boundary in the display example.
There should be

  It was more that there may leave less scope for error if there was a
move away from indexes to slices. The PEP provides ways to specify
what you want to examine or modify but it looks to me like returning
indexes will see code repetition or additional variables with an
increase in fragility.

 Note that what most people refer to as character is a
 grapheme in Unicode speak.

   A grapheme-oriented string type may be worthwhile although you'd
probably have to choose a particular normalisation form to ease
processing.

 Given that interpretation,
 breaking Unicode characters is something you won't
 ever work around with by using larger code units such
 as UCS4 compatible ones.

   I still think we can reduce the scope for errors.

 Furthermore, you should also note that surrogates (two
 code units encoding one code point) are part of Unicode
 life. While you don't need them when storing Unicode
 in UCS4 code units, they can still be part of the
 Unicode data and the programmer has to be aware of
 these.

   Many programmers can and will ignore surrogates. One day that may
bite them but we can't close off text processing to those who have no
idea of what surrogates are, or directional marks, or that sorting is
locale dependent, or have no understanding of the difference between
NFC and NFKD normalization forms.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Martin v. Löwis
Neil Hodgson wrote:
I'd like to more tightly define Unicode strings for Python 3000.
 Currently, Unicode strings may be implemented with either 2 byte
 (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to
 contain any Unicode character and should be indexable yielding
 characters rather than half characters. Therefore Python strings
 should appear to be UTF-32. There could still be multiple
 implementations (using UTF-16 or UTF-8) to preserve space but all
 implementations should appear to be the same apart from speed and
 memory use.

That's very tricky. If you have multiple implementations, you make
usage at the C API difficult. If you make it either UTF-8 or UTF-32,
you make PythonWin difficult. If you make it UTF-16, you make indexing
difficult.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Martin v. Löwis
Phillip J. Eby wrote:
 I'm tempted to say it would be even better if there was a command line 
 option that could be used to force all binary opens to result in bytes, and 
 require all text opens to specify an encoding.

For Python 3000? -1. There shouldn't be command line switches that have
that much importance.

For Python 2.x? Well, we are not supposed to discuss this.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Neil Hodgson
Martin v. Löwis:

 That's very tricky. If you have multiple implementations, you make
 usage at the C API difficult. If you make it either UTF-8 or UTF-32,
 you make PythonWin difficult. If you make it UTF-16, you make indexing
 difficult.

   For Windows, the code will get a little uglier, needing to perform
an allocation/encoding and deallocation more often then at present but
I don't think there will be a speed degradation as Windows is
currently performing a conversion from 8 bit to UTF-16 inside many
system calls. To minimize the cost of allocation, Python could copy
Windows in keeping a small number of commonly sized preallocated
buffers handy.

   For indexing UTF-16, a flag could be set to show if the string is
all in the base plane and if not, an index could be constructed when
and if needed. It'd be good to get some feel for what proportion of
string operations performed require indexing. Many, such as
startswith, split, and concatenation don't require indexing. The
proportion of operations that use indexing to scan strings would also
be interesting as adding a (currentIndex, currentOffset) cursor to
string objects would be another approach.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread M.-A. Lemburg
Neil Hodgson wrote:
 Guido van Rossum:
 
 
Folks, please focus on what Python 3000 should do.

I'm thinking about making all character strings Unicode (possibly with
different internal representations a la NSString in Apple's Objective
C) and introduce a separate mutable bytes array data type. But I could
use some validation or feedback on this idea from actual
practitioners.
 
 
I'd like to more tightly define Unicode strings for Python 3000.
 Currently, Unicode strings may be implemented with either 2 byte
 (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to
 contain any Unicode character and should be indexable yielding
 characters rather than half characters. Therefore Python strings
 should appear to be UTF-32. There could still be multiple
 implementations (using UTF-16 or UTF-8) to preserve space but all
 implementations should appear to be the same apart from speed and
 memory use.

There seems to be a general misunderstanding here: even if you
have UCS4 storage, it is still possible to slice a Unicode
string in a way which makes rendering it correctly.

Unicode has the concept of combining code points, e.g. you can
store an é (e with a accent) as e + '. Now if you slice
off the accent, you'll break the character that you encoded
using combining code points.

Note that combining code points are rather common in encodings
of Asian scripts, so this is not an artificial example.

Some time ago I proposed a new module called unicodeindex
to help with indexing. It would solve most of the indexing
issues you run into when dealing with Unicode. I've attached
it to this email for reference.

More on the used terms:

http://www.egenix.com/files/python/EuroPython2002-Python-and-Unicode.pdf
http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 24 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
PEP: 0XXX
Title: Unicode Indexing Helper Module
Version: $Revision: 1.0 $
Author: [EMAIL PROTECTED] (Marc-Andr‚ Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 06-Jun-2001
Post-History: 

Abstract

This PEP proposes a new module unicodeindex which provides 
means to index Unicode objects in various higher level abstractions
of characters.

Problem and Terminology

Unicode objects can be indexed just like string object using what
in Unicode terms is called a code unit as index basis.  

Code units are the storage entities used by the Unicode
implementation to store a single Unicode information unit and do
not necessarily map 1-1 to code points which are the smallest
entities encoded by the Unicode standard. Python exposes code
units to the programmer via the Unicode object indexing and slicing
API, e.g. u[10] or u[12:15] refer to the code units at index 10
and indices 12 to 14.

These code points can sometimes be composed to form graphemes
which are then displayed by the Unicode output device as one
character. A word is then a sequence of characters separated by
space characters or punctuation, a line is a sequence of code
points separated by line breaking code point sequences.

For addressing Unicode, there are basically five different methods
by which you can reference the data:

1. per code unit(codeunit)
2. per code point   (codepoint)
3. per grapheme (grapheme)
4. per word (word)
5. per line (line)

The indexing type name is given in parenthesis and used in the
module interface.

Proposed Solution

I propose to add a new module to the standard Python library which
provides interfaces implementing the above indexing methods.

Module Interface

The module should provide the following interfaces for all four
indexing styles:

next_indextype(u, index) - integer

Returns the Unicode object index for the start of the next
indextype found after u[index] or -1 in case no next element
of this type exists.

prev_indextype(u, index) - integer

Returns the Unicode object index for the start of the previous
indextype found before u[index] or -1 in case no previous
element of this type exists.

indextype_index(u, n) - integer

Returns the Unicode object index for the start of the n-th
indextype element in u. Raises an IndexError in case no n-th
element can be found.

indextype_count(u, index) - integer

Counts the number of complete indextype elements found in
u[:index] and returns the count 

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread M.-A. Lemburg
Bengt Richter wrote:
 Please bear with me for a few paragraphs ;-)

Please note that source code encoding doesn't really have
anything to do with the way the interpreter executes the
program - it's merely a way to tell the parser how to
convert string literals (currently on the Unicode ones)
into constant Unicode objects within the program text.
It's also a nice way to let other people know what kind of
encoding you used to write your comments ;-)

Nothing more.

Once a module is compiled, there's no distinction between
a module using the latin-1 source code encoding or one using
the utf-8 encoding.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 24 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
 I'm thinking about making all character strings Unicode (possibly with
 different internal representations a la NSString in Apple's Objective
 C) and introduce a separate mutable bytes array data type. But I could
 use some validation or feedback on this idea from actual
 practitioners.

+1 from me, too.

 I'm tempted to say it would be even better if there was a command line 
 option that could be used to force all binary opens to result in bytes, and 
 require all text opens to specify an encoding.

I like this idea, too.  Presumably plain open(FILENAME, MODE) would
then result in a binary open (no encoding specified), which I've
wanted for a long time (and which makes sense).  But it is a change.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
 Python should allow strings to
 contain any Unicode character and should be indexable yielding
 characters rather than half characters. Therefore Python strings
 should appear to be UTF-32.

+1.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Martin v. Löwis
Neil Hodgson wrote:
For Windows, the code will get a little uglier, needing to perform
 an allocation/encoding and deallocation more often then at present but
 I don't think there will be a speed degradation as Windows is
 currently performing a conversion from 8 bit to UTF-16 inside many
 system calls.
[...]
 
For indexing UTF-16, a flag could be set to show if the string is
 all in the base plane and if not, an index could be constructed when
 and if needed.

There are many design alternatives: one option would be to support
*three* internal representations in a single type, generating the
others from the one operation existing as needed. The default, initial
representation might be UTF-8, with UCS-4 only being generated when
indexing occurs, and UCS-2 only being generated when the API requires
it. On concatenation, always concatenate just one represenation: either
one that is already present in both operands, else UTF-8.

  It'd be good to get some feel for what proportion of
 string operations performed require indexing. Many, such as
 startswith, split, and concatenation don't require indexing. The
 proportion of operations that use indexing to scan strings would also
 be interesting as adding a (currentIndex, currentOffset) cursor to
 string objects would be another approach.

Indeed. My guess is that indexing is more common than you think,
especially when iterating over the string. Of course, iteration
could also operate on UTF-8, if you introduced string iterator
objects.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Antoine Pitrou

 There are many design alternatives: one option would be to support
 *three* internal representations in a single type, generating the
 others from the one operation existing as needed. The default, initial
 representation might be UTF-8, with UCS-4 only being generated when
 indexing occurs, and UCS-2 only being generated when the API requires
 it. On concatenation, always concatenate just one represenation: either
 one that is already present in both operands, else UTF-8.

Wouldn't it be simpler to use:
- one-byte representation if every character = 0xFF
- two-byte representation if every character = 0x
- four-byte representation otherwise

Then combining several strings means using the larger representation as
a result (*). In practice, most use cases will not involve the four-byte
representation.

(*) a heuristic can be invented so that, when producing a smaller string
(by stripping/slicing/etc.), it will sometimes check whether a
narrower representation is possible.
For example : store the length of the string when the last check
occurred, and do a new check when the length falls below the half that
value.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Guido van Rossum
On 10/24/05, Martin v. Löwis [EMAIL PROTECTED] wrote:
 Indeed. My guess is that indexing is more common than you think,
 especially when iterating over the string. Of course, iteration
 could also operate on UTF-8, if you introduced string iterator
 objects.

Python's slice-and-dice model pretty much ensures that indexing is
common. Almost everything is ultimately represented as indices: regex
search results have the index in the API, find()/index() return
indices, many operations take a start and/or end index. As long as
that's the case, indexing better be fast.

Changing the APIs would be much work, although perhaps not impossible
of Python 3000. For example, Raymond Hettinger's partition() API
doesn't refer to indices at all, and can replace many uses of find()
or index().

Still, the mere existence of __getitem__ and __getslice__ on strings
makes it necessary to implement them efficiently. How realistic would
it be to drop them? What should replace them? Some kind of abstract
pointers-into-strings perhaps, but that seems much more complex.

The trick seems to be to support both simple programs manipulating
short strings (where indexing is probably the easiest API to
understand, and the additional copying is unlikely to cause
performance problems) , as well as  programs manipulating very large
buffers containing text and doing sophisticated string processing on
them. Perhaps we could provide a different kind of API to support the
latter, perhaps based on a mutable character buffer data type without
direct indexing?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Guido van Rossum
On 10/24/05, Martin v. Löwis [EMAIL PROTECTED] wrote:
 Guido van Rossum wrote:
  Changing the APIs would be much work, although perhaps not impossible
  of Python 3000. For example, Raymond Hettinger's partition() API
  doesn't refer to indices at all, and can replace many uses of find()
  or index().

 I think Neil's proposal is not to make them go away, but to implement
 them less efficiently. For example, if the internal representation
 is UTF-8, indexing requires linear time, as opposed to constant time.
 If the internal representation is UTF-16, and you have a flag to
 indicate whether there are any surrogates on the string, indexing
 is constant if the flag is false, else linear.

I understand all that. My point is that it's a bad idea to offer an
indexing operation that isn't O(1).

  Perhaps we could provide a different kind of API to support the
  latter, perhaps based on a mutable character buffer data type without
  direct indexing?

 There are different design goals conflicting here:
 - some think: all my data is ASCII, so I want to only use one
byte per character.
 - others think: all my data goes to the Windows API, so I want
to use 2 byte per character.
 - yet others think: I want all of Unicode, with proper, efficient
indexing, so I want four bytes per char.

I doubt the last one though. Probably they really don't want efficient
indexing, they want to perform higher-level operations that currently
are only possible using efficient indexing or slicing. With the right
API. perhaps they could work just as efficiently with an internal
representation of UTF-8.

 It's not so much a matter of API as a matter of internal
 representation. The API doesn't have to change (except for the
 very low-level C API that directly exposes Py_UNICODE*, perhaps).

I think the API should reflect the representation *to some extend*,
namely it shouldn't claim to have operations that are typically
thought of as O(1) that can only be implemented as O(n). An internal
representation of UTF-8 might make everyone happy except heavy Windows
users; but it requires changes to the API so people won't be writing
Python 2.x-style string slinging code.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Neil Hodgson
M.-A. Lemburg:

 Unicode has the concept of combining code points, e.g. you can
 store an é (e with a accent) as e + '. Now if you slice
 off the accent, you'll break the character that you encoded
 using combining code points.
 ...
 next_indextype(u, index) - integer

 Returns the Unicode object index for the start of the next
 indextype found after u[index] or -1 in case no next element
 of this type exists.

   Should entity breakage be further discouraged by returning a slice
here rather than an object index?

   Something like:

i = first_grapheme(u)
x = 0
while x  width and u[i] != \n:
   x, _ = draw(u[i], (x, y))
   i = next_grapheme(u, i)

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
  - yet others think: I want all of Unicode, with proper, efficient
 indexing, so I want four bytes per char.
 
 I doubt the last one though. Probably they really don't want efficient
 indexing, they want to perform higher-level operations that currently
 are only possible using efficient indexing or slicing. With the right
 API. perhaps they could work just as efficiently with an internal
 representation of UTF-8.

I just got mail this morning from a researcher who wants exactly what
Martin described, and wondered why the default MacPython 2.4.2 didn't
provide it by default. :-)

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Guido van Rossum
On 10/24/05, Bill Janssen [EMAIL PROTECTED] wrote:
   - yet others think: I want all of Unicode, with proper, efficient
  indexing, so I want four bytes per char.
 
  I doubt the last one though. Probably they really don't want efficient
  indexing, they want to perform higher-level operations that currently
  are only possible using efficient indexing or slicing. With the right
  API. perhaps they could work just as efficiently with an internal
  representation of UTF-8.

 I just got mail this morning from a researcher who wants exactly what
 Martin described, and wondered why the default MacPython 2.4.2 didn't
 provide it by default. :-)

Oh, I don't doubt that they want it. But often they don't *need* it,
and the higher-level goal they are trying to accomplish can be dealt
with better in a different way. (Sort of my response to people asking
for static typing in Python as well. :-)

Did they tell you what they were trying to do that MacPython 2.4.2
wouldn't let them, beyond represent a large Unicode string as an
array of 4-byte integers?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Greg Ewing
Guido van Rossum wrote:

 I think the API should reflect the representation *to some extend*,
 namely it shouldn't claim to have operations that are typically
 thought of as O(1) that can only be implemented as O(n).

Maybe a compromise could be reached by using a
btree of chunks or something, so indexing is
O(log n). Not as good as O(1) but a lot better
than O(n).

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | A citizen of NewZealandCorp, a   |
Christchurch, New Zealand  | wholly-owned subsidiary of USA Inc.  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Greg Ewing
Guido van Rossum wrote:

 Python's slice-and-dice model pretty much ensures that indexing is
 common. Almost everything is ultimately represented as indices: regex
 search results have the index in the API, find()/index() return
 indices, many operations take a start and/or end index.

Maybe the idea of string views should be reconsidered in
light of this. It's been criticised on the grounds that
its use could keep large strings alive longer than needed,
but if operations that currently return indices instead
returned string views, this wouldn't be any more of a
concern than it is now, especially if there is an easy
way to explicitly materialise the view as an independent
string when wanted.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | A citizen of NewZealandCorp, a   |
Christchurch, New Zealand  | wholly-owned subsidiary of USA Inc.  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
Guido writes:
 Oh, I don't doubt that they want it. But often they don't *need* it,
 and the higher-level goal they are trying to accomplish can be dealt
 with better in a different way. (Sort of my response to people asking
 for static typing in Python as well. :-)

I suppose that's true.  But what if they're not smart enough to figure
out that better, different, way?  I doubt you intend Python to be sort
of the Rubik's cube of programming...

And no, he didn't say why he wanted the ability to represent a
Unicode string as an array of 4-byte integers.  Though I know he's
doing something with the Deseret Alphabet, translating some early work
on American Indian culture that was transcribed in that character set.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Jason Orendorff
-1 on keeping the source encoding of string literals.  Python should
definitely decode them at compile time.

-1 on decoding implicitly as needed.  This causes decoding to happen
late, in unpredictable places.  Decodes can fail; they should happen
as early and as close to the data source as possible.

-j
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Bob Ippolito

On Oct 23, 2005, at 3:10 PM, Jason Orendorff wrote:

 -1 on decoding implicitly as needed.  This causes decoding to happen
 late, in unpredictable places.  Decodes can fail; they should happen
 as early and as close to the data source as possible.

That's not necessarily true... Some codecs can't fail, like latin1.   
I think the main use case for this is to speed up usage of text in  
these sorts of formats anyway.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Stephan Richter
On Sunday 23 October 2005 18:10, Jason Orendorff wrote:
 -1 on keeping the source encoding of string literals.  Python should
 definitely decode them at compile time.

 -1 on decoding implicitly as needed.  This causes decoding to happen
 late, in unpredictable places.  Decodes can fail; they should happen
 as early and as close to the data source as possible.

+1. We have followed this last practice throughout Zope 3 successfully. In our 
case, the publisher framework (in other words the output-protocol-specific 
layer) is responsible for the decoding and encoding of input and output 
streams, respectively. We have been pretty much free of any encoding/decoding 
troubles since. Having our application only use unicode internally was one of 
the best decisions we have made.

Regards,
Stephan
-- 
Stephan Richter
CBU Physics  Chemistry (B.S.) / Tufts Physics (Ph.D. student)
Web2k - Web Software Design, Development and Training
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Guido van Rossum
Folks, please focus on what Python 3000 should do.

I'm thinking about making all character strings Unicode (possibly with
different internal representations a la NSString in Apple's Objective
C) and introduce a separate mutable bytes array data type. But I could
use some validation or feedback on this idea from actual
practitioners.

I don't want to see proposals to mess with the str/unicode semantics
in Python 2.x. Let' leave the Python 2.x str/unicode semantics alone
until Python 3000 -- we don't need mutliple transitions. (Although we
could add the mutable bytes array type sooner.)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Bob Ippolito
On Oct 23, 2005, at 6:06 PM, Guido van Rossum wrote:

 Folks, please focus on what Python 3000 should do.

 I'm thinking about making all character strings Unicode (possibly with
 different internal representations a la NSString in Apple's Objective
 C) and introduce a separate mutable bytes array data type. But I could
 use some validation or feedback on this idea from actual
 practitioners.

 I don't want to see proposals to mess with the str/unicode semantics
 in Python 2.x. Let' leave the Python 2.x str/unicode semantics alone
 until Python 3000 -- we don't need mutliple transitions. (Although we
 could add the mutable bytes array type sooner.)

+1, this is precisely what I'd like to see.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Phillip J. Eby
At 06:06 PM 10/23/2005 -0700, Guido van Rossum wrote:
Folks, please focus on what Python 3000 should do.

I'm thinking about making all character strings Unicode (possibly with
different internal representations a la NSString in Apple's Objective
C) and introduce a separate mutable bytes array data type. But I could
use some validation or feedback on this idea from actual
practitioners.

+1.  Chandler has been going through quite an upheaval to get its unicode 
handling together.  Having a bytes type would be great, as long as there 
was support for files and sockets to produce bytes instead of strings 
(unless an encoding was specified).

I'm tempted to say it would be even better if there was a command line 
option that could be used to force all binary opens to result in bytes, and 
require all text opens to specify an encoding.  The Chandler i18n project 
lead would jump for joy if we had a way to keep legacy strings out of the 
system, apart from ASCII string constants found in code.

It would then be okay not to drop support for the implicit conversions; if 
you can't get strings on input, then conversion's not really an issue.

Anyway, I think all of the things I'd like to see can be done without 
breakage in 2.5.  For Chandler at least, we'd be willing to go with a 
command-line option that's more strict, in order to be able to ensure that 
plugin developers can't accidentally put 8-bit strings in somewhere, just 
by opening a file.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-22 Thread Bengt Richter
Please bear with me for a few paragraphs ;-)

One aspect of str-type strings is the efficiency afforded when all the encoding 
really
is ascii. If the internal encoding were e.g. fixed utf-16le for strings, maybe 
with today's
computers it would still be efficient enough for most actual string purposes 
(excluding
the current use of str-strings as byte sequences).

I.e., you'd still have to identify what was strings (of characters) and what 
was really
byte sequences with no implied or explicit encoding or character semantics.

Ok, let's make that distinction explicit: Call one kind of string a byte 
sequence and the
other a character sequence (representation being a separate issue).

A unicode object is of course the prime _general_ representation of a character 
sequence
in Python, but all the names in python source code (that become NAME tokens) 
are UIAM
also character sequences, and representable by a byte sequence interpreted 
according to
ascii encoding.

For the sake of discussion, suppose we had another _character_ sequence type 
that was
the moral equivalent of unicode except for internal representation, namely a str
subclass with an encoding attribute specifying the encoding that you _could_ use
to decode the str bytes part to get unicode (which you wouldn't do except when 
necessary).
We could call it class charstr(str): ... and have chrstr().bytes be the str 
part and
chrstr().encoding specify the encoding part.

In all the contexts where we have obvious encoding information, we can then 
generate
a charstr instead of a str. E.g., if the source of module_a has

# -*- coding: latin1 -*-
cs = 'über-cool'
then
type(cs)  # = type 'charstr'
cs.bytes  # = '\xfcber-cool'
cs.encoding # = 'latin-1'

and print cs would act like print cs.bytes.decode(cs.encoding) -- or I guess
sys.stdout.write(cs.bytes.decode(cs.encoding).encode(sys.stdout.encoding)
followed by
sys.stdout.write('\n'.decode('ascii').encode(sys.stdout.encoding)
for the newline of the print.

Now if module_b has

# -*- coding: utf8 -*-
cs = 'über-cool'

and we interactively
import module_a, module_b
and then
print module_a.cs + ' =?= ' + module_b.cs

what could happen ideally vs. what we have currently?
UIAM, currently we would just get the concatenation of
the three str byte sequences concatenated to make
'\xfcber-cool =?= \xc3\xbcber-cool'
and that would be printed as whatever that comes out as
without conversion when seen by the output according to
sys.stdout.encoding.

But if those cs instances had been charstr instances, the coding cookie
encoding information would have been preserved, and the interactive print could
have evaluated the string expression -- given cs.decode() as sugar for
(cs.bytes.decode(cs.encoding or globals().get('__encoding__') or
 __import__('sys').getdefaultencoding()))
-- as

module_a.cs.decode() + ' =?= '.decode() + module_b.cs.decode()

if pairwise terms differ in encoding as they might all here. If the interactive
session source were e.g. latin-1, like module_a, then
module_a.cs + ' =?= '
would not require an encoding change, because the ' =?= ' would be a charstr 
instance
with encoding == 'latin-1', and so the result would still be latin-1 that far.
But with module_b.cs being utf8, the next addition would cause the .decode() 
promotions
to unicode. In a console window, the ' =?= '.encoding might be 'cp437' or such, 
and
the first addition would then cause promotion (since module_a.cs.encoding != 
'cp437').

I have sneaked in run-time access to individual modules' encodings by assuming 
that
the encoding cookie could be compiled in as an explicit global __encoding__ 
variable
for any given module (what to have as __encoding__ for built-in modules could 
vary for
various purposes).

ISTM this could have use in situations where an encoding assumption is 
necessary and
currently 'ascii' is not as good a guess as one could make, though I suspect if 
string
literals became charstr strings instead of str strings, many if not most of 
those situations
would disappear (I'm saying this because ATM I can't think of an 'ascii'-guess 
situation that
wouldn't go away ;-) If there were a charchr() version of chr() that would 
result in
a charstr instead of a str, IWT one would want an easy-sugar default encoding 
assumption,
probably based on the same as one would assume for '%c' % num in a given module 
source
-- which presumably would be '%c'.encoding, where '%c' assumes the encoding of 
the module
source, normally recorded in __encoding__. So charchr(n) would act like 
chr(n).decode().encode(''.encoding) -- or more reasonably charstr(chr(n)), 
which would be
short for
charstr(chr(n), globals().get('__encoding__') or 
__import__('sys').getdefaultencoding())
Or some efficient equivalent ;-)

Using strings in dicts requires hashing to find key comparison candidates and 
comparison to
check for key equivalence. This would seem to point to some kind of 

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-17 Thread Martin v. Löwis
Martin Blais wrote:
Yes. setdefaultencoding() is removed from sys by site.py. To get it
again you must reload sys.
 
 
 Thanks.

Actually, I should take the opportunity to advise people that
setdefaultencoding doesn't really work. With the default default
encoding, strings and Unicode objects hash equal when they are
equal. If you change the default encoding, this property
goes away (perhaps unless you change it to Latin-1). As a result,
dictionaries where you mix string and Unicode keys won't work:
you might not find a value for a string key when looking up
with a Unicode object, and vice versa.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-16 Thread Martin Blais
On 10/15/05, Reinhold Birkenfeld [EMAIL PROTECTED] wrote:
 Martin Blais wrote:
  On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote:
  Martin Blais [EMAIL PROTECTED] writes:
 
   How hard would that be to implement?
 
  import sys
  reload(sys)
  sys.setdefaultencoding('undefined')
 
  Hmmm any particular reason for the call to reload() here?

 Yes. setdefaultencoding() is removed from sys by site.py. To get it
 again you must reload sys.

Thanks.

cheers,
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-15 Thread Martin Blais
On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote:
 Martin Blais [EMAIL PROTECTED] writes:

  How hard would that be to implement?

 import sys
 reload(sys)
 sys.setdefaultencoding('undefined')

Hmmm any particular reason for the call to reload() here?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-15 Thread Reinhold Birkenfeld
Martin Blais wrote:
 On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote:
 Martin Blais [EMAIL PROTECTED] writes:

  How hard would that be to implement?

 import sys
 reload(sys)
 sys.setdefaultencoding('undefined')
 
 Hmmm any particular reason for the call to reload() here?

Yes. setdefaultencoding() is removed from sys by site.py. To get it
again you must reload sys.

Reinhold

-- 
Mail address is perfectly valid!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Martin Blais
Hi.

Like a lot of people (or so I hear in the blogosphere...), I've been
experiencing some friction in my code with unicode conversion
problems.  Even when being super extra careful with the types of str's
or unicode objects that my variables can contain, there is always some
case or oversight where something unexpected happens which results in
a conversion which triggers a decode error.  str.join() of a list of
strs, where one unicode object appears unexpectedly, and voila!
exception galore.  Sometimes the problem shows up late because your
test code doesn't always contain accented characters.  I'm sure many
of you experienced that or some variant at some point.

I came to realize recently that this problem shares strong similarity
with the problem of implicit type conversions in C++, or at least it
feels the same:  Stuff just happens implicitly, and it's hard to track
down where and when it happens by just looking at the code.  Part of
the problem is that the unicode object acts a lot like a str, which is
convenient, but...

What if we could completely disable the implicit conversions between
unicode and str?  In other words, if you would ALWAYS be forced to
call either .encode() or .decode() to convert between one and the
other... wouldn't that help a lot deal with that issue?

How hard would that be to implement?  Would it break a lot of code? 
Would some people want that?  (I know I would, at least for some of my
code.)  It seems to me that this would make the code more explicit and
force the programmer to become more aware of those conversions.  Any
opinions welcome.

cheers,
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Michael Hudson
Martin Blais [EMAIL PROTECTED] writes:

 What if we could completely disable the implicit conversions between
 unicode and str?  In other words, if you would ALWAYS be forced to
 call either .encode() or .decode() to convert between one and the
 other... wouldn't that help a lot deal with that issue?

I don't know.  I've made one or two apps safe against this and it's
mostly just annoying.

 How hard would that be to implement?

import sys
reload(sys)
sys.setdefaultencoding('undefined')

 Would it break a lot of code?  Would some people want that?  (I know
 I would, at least for some of my code.)  It seems to me that this
 would make the code more explicit and force the programmer to become
 more aware of those conversions.  Any opinions welcome.

I'm not sure it's a sensible default.

Cheers,
mwh

-- 
  It is never worth a first class man's time to express a majority
  opinion.  By definition, there are plenty of others to do that.
-- G. H. Hardy
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Antoine Pitrou
Le lundi 03 octobre 2005 à 02:09 -0400, Martin Blais a écrit :
 
 What if we could completely disable the implicit conversions between
 unicode and str?

This would be very annoying when dealing with some modules or libraries
where the type (str / unicode) returned by a function depends on the
context, build, or platform.

A good rule of thumb is to convert to unicode everything that is
semantically textual, and to only use str for what is to be semantically
treated as a string of bytes (network packets, identifiers...). This is
also, AFAIU, the semantic model which is favoured for a hypothetical
future version of Python.

This is what I'm using to do safe conversion to a given type without
worrying about the type of the argument:


DEFAULT_CHARSET = 'utf-8'

def safe_unicode(s, charset=None):

Forced conversion of a string to unicode, does nothing
if the argument is already an unicode object.
This function is useful because the .decode method
on an unicode object, instead of being a no-op, tries to
do a double conversion back and forth (which often fails
because 'ascii' is the default codec).

if isinstance(s, str):
return s.decode(charset or DEFAULT_CHARSET)
else:
return s

def safe_str(s, charset=None):

Forced conversion of an unicode to string, does nothing
if the argument is already a plain str object.
This function is useful because the .encode method
on an str object, instead of being a no-op, tries to
do a double conversion back and forth (which often fails
because 'ascii' is the default codec).

if isinstance(s, unicode):
return s.encode(charset or DEFAULT_CHARSET)
else:
return s


Good luck

Antoine.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Fredrik Lundh
Antoine Pitrou wrote:

 A good rule of thumb is to convert to unicode everything that is
 semantically textual

and isn't pure ASCII.

(anyone who are tempted to argue otherwise should benchmark their
applications, both speed- and memorywise, and be prepared to come
up with very strong arguments for why python programs shouldn't be
allowed to be fast and memory-efficient whenever they can...)

/F 



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Antoine Pitrou

Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit :
 Antoine Pitrou wrote:
 
  A good rule of thumb is to convert to unicode everything that is
  semantically textual
 
 and isn't pure ASCII.

How can you be sure that something that is /semantically textual/ will
always remain pure ASCII ? That's contradictory, unless your software
never goes out of the anglo-saxon world (and even...).

 (anyone who are tempted to argue otherwise should benchmark their
 applications, both speed- and memorywise, and be prepared to come
 up with very strong arguments for why python programs shouldn't be
 allowed to be fast and memory-efficient whenever they can...)

I think most applications don't critically depend on text processing
performance. OTOH, international adaptability is the kind of thing
that /will/ bite you one day if you don't prepare for it at the
beginning.

Also, if necessary, the distinction could be an implementation detail
and the conversion be transparent (like int vs. long): the text would be
coded in an 8-bit charset as long as possible and converted to a wide
encoding only when necessary. The important thing is that these
optimisations, if they are necessary, should be transparently handled by
the Python runtime.

(it seems to me - I may be mistaken - that modern Windows versions treat
every string as 16-bit unicode internally. Why are they doing it if it
is that inefficient?)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Martin Blais
On 10/3/05, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 
  I'm not sure it's a sensible default.

 Me neither, especially since this would make it impossible
 to write polymorphic code - e.g. ', '.join(list) wouldn't
 work anymore if list contains Unicode; dito for u', '.join(list)
 with list containing a string.

Sounds like what you want is exactly what I want to avoid (for those
two types anyway).

cheers,
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Jim Fulton
Martin Blais wrote:
 Hi.
 
 Like a lot of people (or so I hear in the blogosphere...), I've been
 experiencing some friction in my code with unicode conversion
 problems.  Even when being super extra careful with the types of str's
 or unicode objects that my variables can contain, there is always some
 case or oversight where something unexpected happens which results in
 a conversion which triggers a decode error.  str.join() of a list of
 strs, where one unicode object appears unexpectedly, and voila!
 exception galore.  Sometimes the problem shows up late because your
 test code doesn't always contain accented characters.  I'm sure many
 of you experienced that or some variant at some point.
 
 I came to realize recently that this problem shares strong similarity
 with the problem of implicit type conversions in C++, or at least it
 feels the same:  Stuff just happens implicitly, and it's hard to track
 down where and when it happens by just looking at the code.  Part of
 the problem is that the unicode object acts a lot like a str, which is
 convenient, but...

I agree.  I think it was a mistake to implicitly convert mixed string
expressions to unicode.


 What if we could completely disable the implicit conversions between
 unicode and str?  In other words, if you would ALWAYS be forced to
 call either .encode() or .decode() to convert between one and the
 other... wouldn't that help a lot deal with that issue?

Perhaps.

 How hard would that be to implement? 

Not hard. We considered doing it for Zope 3, but ...

  Would it break a lot of code?

Yes.

 Would some people want that? 

No, I wouldn't want lots of code to break. ;)

  (I know I would, at least for some of my
 code.)  It seems to me that this would make the code more explicit and
 force the programmer to become more aware of those conversions.  Any
 opinions welcome.

I think it's too late to change this.  I wish it had been done
differently.  (OTOH, I'm very happy we have Unicode support, so
I'm not really complaining. :)

I'll note that this hasn't been that much of a problem for us in Zope.
We follow the strategy:

Antoine Pitrou wrote:
...
  A good rule of thumb is to convert to unicode everything that is
  semantically textual, and to only use str for what is to be semantically
  treated as a string of bytes (network packets, identifiers...). This is
  also, AFAIU, the semantic model which is favoured for a hypothetical
  future version of Python.

This approach has worked pretty well for us.  Still, when there is a problem,
it's a real pain to debug because the error occurs too late, as you point
out.

Jim

-- 
Jim Fulton   mailto:[EMAIL PROTECTED]   Python Powered!
CTO  (540) 361-1714http://www.python.org
Zope Corporation http://www.zope.com   http://www.zope.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Jim Fulton
M.-A. Lemburg wrote:
 Michael Hudson wrote:
 
Martin Blais [EMAIL PROTECTED] writes:



What if we could completely disable the implicit conversions between
unicode and str?  In other words, if you would ALWAYS be forced to
call either .encode() or .decode() to convert between one and the
other... wouldn't that help a lot deal with that issue?


I don't know.  I've made one or two apps safe against this and it's
mostly just annoying.


How hard would that be to implement?

import sys
reload(sys)
sys.setdefaultencoding('undefined')
 
 
 You shouldn't post tricks like these :-)
 
 The correct way to change the default encoding is by
 providing a sitecustomize.py module which then call the
 sys.setdefaultencoding(undefined).

This is a much more evil trick IMO, as it affects all Python code,
rather than a single program.

I would argue that it's evil to change the default encoding
in the first place, except in this case to disable implicit
encoding or decoding.

Jim

-- 
Jim Fulton   mailto:[EMAIL PROTECTED]   Python Powered!
CTO  (540) 361-1714http://www.python.org
Zope Corporation http://www.zope.com   http://www.zope.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Fredrik Lundh
Jim Fulton wrote:

 I would argue that it's evil to change the default encoding
 in the first place, except in this case to disable implicit
 encoding or decoding.

absolutely.  unfortunately, all attempts to add such information to the
sys module documentation seem to have failed...

(last time I tried, I seem to remember that someone argued that it's
there, so it should be documented in a neutral fashion)

/F 



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Josiah Carlson

Antoine Pitrou [EMAIL PROTECTED] wrote:
 
 Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit :
  Antoine Pitrou wrote:
  
   A good rule of thumb is to convert to unicode everything that is
   semantically textual
  
  and isn't pure ASCII.
 
 How can you be sure that something that is /semantically textual/ will
 always remain pure ASCII ? That's contradictory, unless your software
 never goes out of the anglo-saxon world (and even...).

Non-unicode text input widgets.  Works great.  Can be had with the ANSI
wxPython installation.

 (it seems to me - I may be mistaken - that modern Windows versions treat
 every string as 16-bit unicode internally. Why are they doing it if it
 is that inefficient?)

Because modern Windows supports all sorts of symbols which are necessary
for certain special English uses (greek symbols for math, etc.), and
trying to have all of them without just using the unicode backend that
is used for all of the international builds (isn't it just a language
definition?) anyways, would be a waste of time/effort.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Fredrik Lundh
Josiah Carlson wrote:

   and isn't pure ASCII.
 
  How can you be sure that something that is /semantically textual/ will
  always remain pure ASCII ? That's contradictory, unless your software
  never goes out of the anglo-saxon world (and even...).

 Non-unicode text input widgets.  Works great.  Can be had with the ANSI
 wxPython installation.

You're both missing that Python is dynamically typed.  A single string source
doesn't have to return the same type of strings, as long as the objects it 
returns
are compatible with Python's string model and with each other.

Under the default encoding (and quite a few other encodings), that's true for
plain ascii strings and Unicode strings.  This is a good thing.

/F 



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com