Re: [Python-Dev] Generalised String Coercion

2005-08-09 Thread Nick Coghlan
James Y Knight wrote:
 Hum, actually, it somewhat makes sense for the open builtin to  
 become what is now codecs.open, for convenience's sake, although it  
 does blur the distinction between a byte stream and a character  
 stream somewhat. If that happens, I suppose it does actually make  
 sense to give makefile the same signature.

We could always give the text mode/binary mode distinction in open a real 
meaning - text mode deals with character sequences, binary mode deals with 
byte sequences.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://boredomandlaziness.blogspot.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Martin v. Löwis
Bob Ippolito wrote:
 It's UTF-8 by default, I highly doubt many people bother to change it.

I think your doubts are unfounded. Many Japanese people change it to
EUC-JP (I believe), as UTF-8 support doesn't work well for them (or
atleast didn't use to).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Martin v. Löwis
Guido van Rossum wrote:
 We might be able to get there halfway in Python 2.x: we could
 introduce the bytes type now, and provide separate APIs to read and
 write them. (In fact, the array module and the f.readinto()  method
 make this possible today, but it's too klunky so nobody uses it.
 Perhaps a better API would be a new file-open mode (B?) to indicate
 that a file's read* operations should return bytes instead of strings.
 The bytes type could just be a very thin wrapper around array('b').

That answers an important question: so you want the bytes type to be
mutable (and, consequently, unsuitable as a dictionary key).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Martin v. Löwis
Phillip J. Eby wrote:
Hm. What would be the use case for using %s with binary, non-text data?
 
 
 Well, I could see using it to write things like netstrings, 
 i.e.  sock.send(%d:%s, % (len(data),data)) seems like the One Obvious Way 
 to write a netstring in today's Python at least.  But perhaps there's a 
 subtlety I've missed here.

As written, this would stop working when strings become Unicode. It's
pretty clear what '%d' means (format the number in decimal numbers,
using \N{DIGIT ZERO} .. \N{DIGIT NINE} as the digits). It's not
all that clear what %s means: how do you get a sequence of characters
out of data, when data is a byte string?

Perhaps there could be byte string literals, so that you would write

  sock.send(b%d:%s, % (len(data),data))

but this would raise different questions:
- what does %d mean for a byte string formatting? str(len(data))
  returns a character string, how do you get a byte string?
  In the specific case of %d, encoding as ASCII would work, though.
- if byte strings are mutable, what about byte string literals?
  I.e. if I do

  x = b%d:%s,
  x[1] = b'f'

  and run through the code the second time, will the literal have
  changed? Perhaps these would be displays, not literals (although
  I never understood why Guido calls these displays)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Stephen J. Turnbull
 Martin == Martin v Löwis [EMAIL PROTECTED] writes:

Martin I think your doubts are unfounded. Many Japanese people
Martin change it to EUC-JP (I believe), as UTF-8 support doesn't
Martin work well for them (or atleast didn't use to).

If you mean the UTF-8 support in Terminal, it's no better or worse
than the EUC-JP support.  The problem is that most Japanese Unix
systems continue to default to EUC-JP, and many Windows hosts
(including Samba file systems) default to Shift JIS.  So people using
Terminal tend to set it to match the default remote environment (few
of them use shells on the Mac).

All that is certainly true of my organization, for one example.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
 If you mean the UTF-8 support in Terminal, it's no better or worse
 than the EUC-JP support.  The problem is that most Japanese Unix
 systems continue to default to EUC-JP, and many Windows hosts
 (including Samba file systems) default to Shift JIS.  So people using
 Terminal tend to set it to match the default remote environment (few
 of them use shells on the Mac).

Right: that might be the biggest problem. ls(1) would not display
the file names of the remote servers in any readable way.

Thanks for the confirmation.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Michael Hudson
M.-A. Lemburg [EMAIL PROTECTED] writes:

 Set the external encoding for stdin, stdout, stderr:
 
 (also an example for adding encoding support to an
 existing file object):

 def set_sys_std_encoding(encoding):
 # Load encoding support
 (encode, decode, streamreader, streamwriter) = codecs.lookup(encoding)
 # Wrap using stream writers and readers
 sys.stdin = streamreader(sys.stdin)
 sys.stdout = streamwriter(sys.stdout)
 sys.stderr = streamwriter(sys.stderr)
 # Add .encoding attribute for introspection
 sys.stdin.encoding = encoding
 sys.stdout.encoding = encoding
 sys.stderr.encoding = encoding

 set_sys_std_encoding('rot-13')

 Example session:
 print 'hello'
 uryyb
 raw_input()
 hello
 h'hello'
 1/0
 Genpronpx (zbfg erprag pnyy ynfg):
   Svyr fgqva, yvar 1, va ?
 MrebQvivfvbaReebe: vagrtre qvivfvba be zbqhyb ol mreb

 Note that the interactive session bypasses the sys.stdin
 redirection, which is why you can still enter Python
 commands in ASCII - not sure whether there's a reason
 for this, or whether it's just a missing feature.

Um, I'm not quite sure how this would be implemented.  Interactive
input comes via PyOS_Readline which deals in FILE*s... this area of
the code always confuses me :(

Cheers,
mwh

-- 
 As it seems to me, in Perl you have to be an expert to correctly make
 a nested data structure like, say, a list of hashes of instances.  In
 Python, you have to be an idiot not  to be able to do it, because you
 just write it down. -- Peter Norvig, comp.lang.functional
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Nick Coghlan
Martin v. Löwis wrote:
 Guido van Rossum wrote:
The bytes type could just be a very thin wrapper around array('b').
 
 That answers an important question: so you want the bytes type to be
 mutable (and, consequently, unsuitable as a dictionary key).

I would suggest a bytes/frozenbytes pair, similar to set/frozenset and 
list/tuple.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://boredomandlaziness.blogspot.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread M.-A. Lemburg
Guido van Rossum wrote:
 [Guido]
 
My first response to the PEP, however, is that instead of a new
built-in function, I'd rather relax the requirement that str() return
an 8-bit string -- after all, int() is allowed to return a long, so
why couldn't str() be allowed to return a Unicode string?
 
 
 [MAL]
 
The problem here is that strings and Unicode are used in different
ways, whereas integers and longs are very similar. Strings are used
for both arbitrary data and text data, Unicode can only be used
for text data.
 
 Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a
 separate bytes array type for non-text and for encoded text data,
 just like Java; strings should always be considered text data.

 We might be able to get there halfway in Python 2.x: we could
 introduce the bytes type now, and provide separate APIs to read and
 write them.

 (In fact, the array module and the f.readinto()  method
 make this possible today, but it's too klunky so nobody uses it.
 Perhaps a better API would be a new file-open mode (B?) to indicate
 that a file's read* operations should return bytes instead of strings.
 The bytes type could just be a very thin wrapper around array('b').

I'd prefer to keep such bytes type immutable (arrays are mutable),
otherwise, as Martin already mentioned, they wouldn't be usable
as dictionary keys and the transition from the current string
implementation would be made more difficult than necessary.

Since we won't have any use for the string type in Py3k,
why not simply strip it down to a plain bytes type ?

(I wouldn't want to lose or have to reinvent all the
optimizations that went into its implementation and which
are missing in the array implementation.)

About the file-type idea:

We already have text mode and binary mode - with their implementation
being platform dependent. I don't think that this is particularly
good area to add new functionality.

If you use codecs.open() to open a file, you could easily
write a codec which implements what you have in mind.

The new text() built-in would help make a clear distinction
between convert this object to a string of bytes and
please convert this to a text representation. We need to
start making the separation somewhere and I think this is
a good non-invasive start.
 
 
 I agree with the latter, but I would prefer that any new APIs we use
 use a 'bytes' data type to represent non-text data, rather than having
 two different sets of APIs to differentiate between the use of 8-bit
 strings as text vs. data -- while we *currently* use 8-bit strings for
 both text and data, in Python 3.0 we won't, so then the interim APIs
 would have to change again. I'd rather intrduce a new data type and
 new APIs that work with it.

Well, let's put it this way: it all really depends on
what str() should mean in Py3k.

Given that str() is used for mixed content data strings,
simply aliasing str() to unicode() in Py3k would cause a
lot of breakage, due to changed semantics.

Aliasing str() to bytes() would also cause breakage, due
to the fact that bytes types wouldn't have string method
like e.g. .lower(), .upper(), etc.

Perhaps str() in Py3k should become a helper that
converts bytes() to Unicode, provided the content is
ASCII-only.

In any case, Py3k would only have unicode() for text
and bytes() for data, so there's no real need to continue
using str().

If we add the text() API in Py2k and with the above
meaning, then we could rename unicode() to text()
in Py3k - only a cosmetical change, but one that I would
find useful: text() and bytes() are more intuitive to
understand than unicode() and bytes().

Furthermore, the text() built-in could be used to only
allow 8-bit strings with ASCII content to pass through
and require that all non-ASCII content be returned as
Unicode.

We wouldn't be able to enforce this in str().

I'm +1 on adding text().
 
 
 I'm still -1.
 
 
I would also like to suggest a new formatting marker '%t'
to have the same semantics as text() - instead of changing
the semantics of %s as the Neil suggests in the PEP. Again,
the reason is to make the difference between text and
arbitrary data explicit and visible in the code.
 
 
 Hm. What would be the use case for using %s with binary, non-text data?

I guess we'd only keep it for backwards compatibility and
map it to the str() helper.

The main problem for a smooth Unicode transition remains I/O, in my
opinion; I'd like to see a PEP describing a way to attach an encoding
to text files, and a way to decide on a default encoding for stdin,
stdout, stderr.

Hmm, not sure why you need PEPs for this:
 
 
 I'd forgotten how far we've come. I'm still unsure how the default
 encoding on stdin/stdout works.

Codecs in general work like this: they take an existing file-like
object and wrap it with new versions of .read(), .write(),
.readline(), etc. which filter the data through encoding and/or
decoding functions.

Once a file is wrapped with a codec StreamWriter/Reader,
you 

Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread M.-A. Lemburg
Michael Hudson wrote:
 M.-A. Lemburg [EMAIL PROTECTED] writes:
 
 
Set the external encoding for stdin, stdout, stderr:

(also an example for adding encoding support to an
existing file object):

def set_sys_std_encoding(encoding):
# Load encoding support
(encode, decode, streamreader, streamwriter) = codecs.lookup(encoding)
# Wrap using stream writers and readers
sys.stdin = streamreader(sys.stdin)
sys.stdout = streamwriter(sys.stdout)
sys.stderr = streamwriter(sys.stderr)
# Add .encoding attribute for introspection
sys.stdin.encoding = encoding
sys.stdout.encoding = encoding
sys.stderr.encoding = encoding

set_sys_std_encoding('rot-13')

Example session:

print 'hello'

uryyb

raw_input()

hello
h'hello'

1/0

Genpronpx (zbfg erprag pnyy ynfg):
  Svyr fgqva, yvar 1, va ?
MrebQvivfvbaReebe: vagrtre qvivfvba be zbqhyb ol mreb

Note that the interactive session bypasses the sys.stdin
redirection, which is why you can still enter Python
commands in ASCII - not sure whether there's a reason
for this, or whether it's just a missing feature.
 
 
 Um, I'm not quite sure how this would be implemented.  Interactive
 input comes via PyOS_Readline which deals in FILE*s... this area of
 the code always confuses me :(

Me too.

It appears that this part of the Python code
has undergone so many iterations and patches, that the
structure has suffered a lot, e.g. the main() functions calls
PyRun_AnyFileFlags(stdin, stdin, cf),
but the fp argument stdin is then subsequently
ignored if the tok_nextc() function finds that
a prompt is set.

Anyway, hacking along the same lines, I think
the above can be had by changing tok_stdin_decode()
to use a possibly available sys.stdin.decode()
method for the decoding of the data read by
PyOS_Readline(). This would then return Unicode
which tok_stdin_decode() could then encode to
UTF-8 which is the encoding that the tokenizer
can work on.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 08 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Phillip J. Eby
At 10:07 AM 8/8/2005 +0200, Martin v. Löwis wrote:
Phillip J. Eby wrote:
 Hm. What would be the use case for using %s with binary, non-text data?
 
 
  Well, I could see using it to write things like netstrings,
  i.e.  sock.send(%d:%s, % (len(data),data)) seems like the One Obvious 
 Way
  to write a netstring in today's Python at least.  But perhaps there's a
  subtlety I've missed here.

As written, this would stop working when strings become Unicode. It's
pretty clear what '%d' means (format the number in decimal numbers,
using \N{DIGIT ZERO} .. \N{DIGIT NINE} as the digits). It's not
all that clear what %s means: how do you get a sequence of characters
out of data, when data is a byte string?

Perhaps there could be byte string literals, so that you would write

   sock.send(b%d:%s, % (len(data),data))

Actually, thinking about it some more, it seems to me it's actually more 
like this:

sock.send( (%d:%s, % 
(len(data),data.decode('latin1'))).encode('latin1') )

That is, if all we have is unicode and bytes, and 'data' is bytes, then 
encoding and decoding from latin1 is the right way to do a netstring.  It's 
a bit more painful, but still doable.


but this would raise different questions:
- what does %d mean for a byte string formatting? str(len(data))
   returns a character string, how do you get a byte string?
   In the specific case of %d, encoding as ASCII would work, though.
- if byte strings are mutable, what about byte string literals?
   I.e. if I do

   x = b%d:%s,
   x[1] = b'f'

   and run through the code the second time, will the literal have
   changed? Perhaps these would be displays, not literals (although
   I never understood why Guido calls these displays)

I'm thinking that bytes.decode and unicode.encode are the correct way to 
convert between the two, and there's no such thing as a bytes literal.  We 
can always optimize constant.encode(constant) to a bytes display 
internally if necessary, although it will be a pain for programs that have 
lots of bytestring constants.  OTOH, we've previously discussed having a 
'bytes()' constructor, and perhaps it should use latin1 as its default 
encoding.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Aahz
On Sun, Aug 07, 2005, Neil Schemenauer wrote:
 On Sat, Aug 06, 2005 at 06:56:39PM -0700, Guido van Rossum wrote:

 My first response to the PEP, however, is that instead of a new
 built-in function, I'd rather relax the requirement that str() return
 an 8-bit string
 
 Do you have any thoughts on what the C API would be?  It seems to me
 that PyObject_Str cannot start returning a unicode object without a
 lot of code breakage.  I suppose we could introduce a function
 called something like PyObject_String.

OTOH, should Guido change his -1 on text(), that leads to the obvious
PyObject_Text.
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

The way to build large Python applications is to componentize and
loosely-couple the hell out of everything.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Guido van Rossum
Ouch. Too much discussion to respond to it all. Please remember that
in Jythin and IronPython, str and unicode are already synonyms. That's
how Python 3.0 will do it, except unicode will disappear as being
redundant. I like the bytes/frozenbytes pair idea. Streams could grow
a getpos()/setpos() API pair that can be used for stateful encodings
(although it sounds like seek()/tell() would be okay to use in most
cases as long as you read in units of whole lines). For sockets,
send()/recv() would deal in bytes, and makefile() would get an
encoding parameter. I'm not going to change my mind on text() unless
someone explains what's so attractive about it.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Phillip J. Eby
At 09:14 AM 8/8/2005 -0700, Guido van Rossum wrote:
I'm not going to change my mind on text() unless
someone explains what's so attractive about it.

1. It's obvious to non-programmers what it's for (str and unicode aren't)

2. It's more obvious to programmers that it's a *text* string rather than a 
string of bytes

3. It's easier to type than unicode, but less opaque than str

4. Switching to 'text' and 'bytes' allows for a clean break from any mental 
baggage now associated with 'unicode' and 'str'.

Of course, the flip side to these arguments is that in today's Python, one 
rarely has use for the string type names, except for coercion and some 
occasional type checking.  On the other hand, if we end up with type 
declarations, then these issues become a bit more important.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Martin v. Löwis
Phillip J. Eby wrote:
 Actually, thinking about it some more, it seems to me it's actually more
 like this:
 
sock.send( (%d:%s, %
 (len(data),data.decode('latin1'))).encode('latin1') )

While this would work, it would still feel wrong: the binary data
are *not* latin1 (most likely), so declaring them to be latin1 would
be confusing. Perhaps a synonym '8bit' for latin1 could be introduced.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Neil Schemenauer
On Sat, Aug 06, 2005 at 06:56:39PM -0700, Guido van Rossum wrote:
 My first response to the PEP, however, is that instead of a new
 built-in function, I'd rather relax the requirement that str() return
 an 8-bit string -- after all, int() is allowed to return a long, so
 why couldn't str() be allowed to return a Unicode string?

I've played with this idea a bit and it seems viable.  I modified my
original patch to have string_new call PyObject_Text instead of
PyObject_Str.  That change breaks only two tests, both in
test_email.  The tracebacks are attached.  Both problems seem
relatively shallow.  Do you thing such a change could go into 2.5?

  Neil



Traceback (most recent call last):
  File /home/nas/Python/py_cvs/Lib/email/test/test_email.py, line 2844, in 
test_encoded_adjacent_nonencoded
h = make_header(decode_header(s))
  File /home/nas/Python/py_cvs/Lib/email/Header.py, line 123, in make_header
charset = Charset(charset)
  File /home/nas/Python/py_cvs/Lib/email/Charset.py, line 190, in __init__
input_charset = unicode(input_charset, 'ascii').lower()
TypeError: decoding Unicode is not supported

Traceback (most recent call last):
  File /home/nas/Python/py_cvs/Lib/email/test/test_email.py, line 2750, in 
test_multilingual
eq(decode_header(enc),
  File /home/nas/Python/py_cvs/Lib/email/Header.py, line 85, in decode_header
dec = email.quopriMIME.header_decode(encoded)
  File /home/nas/Python/py_cvs/Lib/email/quopriMIME.py, line 319, in 
header_decode
return re.sub(r'=\w{2}', _unquote_match, s)
  File /home/nas/Python/py_cvs/Lib/sre.py, line 142, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal 
not in range(128)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread François Pinard
[Phillip J. Eby]

 At 09:14 AM 8/8/2005 -0700, Guido van Rossum wrote:

  I'm not going to change my mind on text() unless someone explains
  what's so attractive about it.

 2. It's more obvious to programmers that it's a *text* string rather
 than a string of bytes

I've no opinion on the proposal on itself, except maybe that text,
that precise word or name, is a pretty bad choice.  It is far too likely
that people already use or want to use that precise identifier.

There once was a suggestion for naming text the module now known
as textwrap, under the premise that it could be later extended for
holding many other various text-related functions.  Happily enough, this
idea was not retained. textwrap is much more reasonable as a name.

I found Python 1.5.2's string to be especially prone to clashing.  I
still find socket obtrusive in that respect.  Consider len as an
example of a clever choice, while length would not have been. str is
also a good choice. object is a bit more annoying theoretically, yet
we almost never need it in practice. type is annoying as a name (yet
very nice as a concept), as if it was free to use, it would often serve
to label our own things.  The fact is we often need the built-in.

Python should not choose common English words for its built-ins, without
very careful thought, and be reluctant to any compulsion in this area.

-- 
François Pinard   http://pinard.progiciels-bpi.ca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-08 Thread Stephen J. Turnbull
 Martin == Martin v Löwis [EMAIL PROTECTED] writes:

Martin While this would work, it would still feel wrong: the
Martin binary data are *not* latin1 (most likely), so declaring
Martin them to be latin1 would be confusing. Perhaps a synonym
Martin '8bit' for latin1 could be introduced.

Be careful.  This alias has caused Emacs some amount of pain, as
binary data escapes into contexts (such as Universal Newline
processing) where it gets interpreted as character data.  We've also
had some problems in codec implementation, because latin1 and (eg)
latin9 have some differences in semantics other than changing the
coded character set for the GR register---controls are treated
differently, for example, because they _are_ binary (alias latin1)
octets, but not in the range of the latin9 code.

I won't go so far as to say it won't work, but it will require careful
design.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Reinhold Birkenfeld
Guido van Rossum wrote:

 The main problem for a smooth Unicode transition remains I/O, in my
 opinion; I'd like to see a PEP describing a way to attach an encoding
 to text files, and a way to decide on a default encoding for stdin,
 stdout, stderr.

FWIW, I've already drafted a patch for the former. It lets you write to
file.encoding and honors this when writing Unicode strings to it.

http://www.python.org/sf/1214889

Reinhold

-- 
Mail address is perfectly valid!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread M.-A. Lemburg
Guido van Rossum wrote:
 My first response to the PEP, however, is that instead of a new
 built-in function, I'd rather relax the requirement that str() return
 an 8-bit string -- after all, int() is allowed to return a long, so
 why couldn't str() be allowed to return a Unicode string?

The problem here is that strings and Unicode are used in different
ways, whereas integers and longs are very similar. Strings are used
for both arbitrary data and text data, Unicode can only be used
for text data.

The new text() built-in would help make a clear distinction
between convert this object to a string of bytes and
please convert this to a text representation. We need to
start making the separation somewhere and I think this is
a good non-invasive start.

Furthermore, the text() built-in could be used to only
allow 8-bit strings with ASCII content to pass through
and require that all non-ASCII content be returned as
Unicode.

We wouldn't be able to enforce this in str().

I'm +1 on adding text().

I would also like to suggest a new formatting marker '%t'
to have the same semantics as text() - instead of changing
the semantics of %s as the Neil suggests in the PEP. Again,
the reason is to make the difference between text and
arbitrary data explicit and visible in the code.

 The main problem for a smooth Unicode transition remains I/O, in my
 opinion; I'd like to see a PEP describing a way to attach an encoding
 to text files, and a way to decide on a default encoding for stdin,
 stdout, stderr.

Hmm, not sure why you need PEPs for this:

Open an encoded file:
-
Use codecs.open() instead of open() or file().

Set the external encoding for stdin, stdout, stderr:

(also an example for adding encoding support to an
existing file object):

def set_sys_std_encoding(encoding):
# Load encoding support
(encode, decode, streamreader, streamwriter) = codecs.lookup(encoding)
# Wrap using stream writers and readers
sys.stdin = streamreader(sys.stdin)
sys.stdout = streamwriter(sys.stdout)
sys.stderr = streamwriter(sys.stderr)
# Add .encoding attribute for introspection
sys.stdin.encoding = encoding
sys.stdout.encoding = encoding
sys.stderr.encoding = encoding

set_sys_std_encoding('rot-13')

Example session:
 print 'hello'
uryyb
 raw_input()
hello
h'hello'
 1/0
Genpronpx (zbfg erprag pnyy ynfg):
  Svyr fgqva, yvar 1, va ?
MrebQvivfvbaReebe: vagrtre qvivfvba be zbqhyb ol mreb

Note that the interactive session bypasses the sys.stdin
redirection, which is why you can still enter Python
commands in ASCII - not sure whether there's a reason
for this, or whether it's just a missing feature.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 07 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Guido van Rossum
[me]
  a way to decide on a default encoding for stdin,
  stdout, stderr.

[Martin]
 If stdin, stdout and stderr go to a terminal, there already is a
 default encoding (actually, there always is a default encoding on
 these, as it falls back to the system encoding if its not a terminal,
 or if the terminal's encoding is not supported or cannot be determined).

So there is. Wow! I never kew this. How does it work? Can we use this
for writing to files to?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Guido van Rossum
[Guido]
  My first response to the PEP, however, is that instead of a new
  built-in function, I'd rather relax the requirement that str() return
  an 8-bit string -- after all, int() is allowed to return a long, so
  why couldn't str() be allowed to return a Unicode string?

[MAL]
 The problem here is that strings and Unicode are used in different
 ways, whereas integers and longs are very similar. Strings are used
 for both arbitrary data and text data, Unicode can only be used
 for text data.

Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a
separate bytes array type for non-text and for encoded text data,
just like Java; strings should always be considered text data.

We might be able to get there halfway in Python 2.x: we could
introduce the bytes type now, and provide separate APIs to read and
write them. (In fact, the array module and the f.readinto()  method
make this possible today, but it's too klunky so nobody uses it.
Perhaps a better API would be a new file-open mode (B?) to indicate
that a file's read* operations should return bytes instead of strings.
The bytes type could just be a very thin wrapper around array('b').

 The new text() built-in would help make a clear distinction
 between convert this object to a string of bytes and
 please convert this to a text representation. We need to
 start making the separation somewhere and I think this is
 a good non-invasive start.

I agree with the latter, but I would prefer that any new APIs we use
use a 'bytes' data type to represent non-text data, rather than having
two different sets of APIs to differentiate between the use of 8-bit
strings as text vs. data -- while we *currently* use 8-bit strings for
both text and data, in Python 3.0 we won't, so then the interim APIs
would have to change again. I'd rather intrduce a new data type and
new APIs that work with it.

 Furthermore, the text() built-in could be used to only
 allow 8-bit strings with ASCII content to pass through
 and require that all non-ASCII content be returned as
 Unicode.
 
 We wouldn't be able to enforce this in str().
 
 I'm +1 on adding text().

I'm still -1.

 I would also like to suggest a new formatting marker '%t'
 to have the same semantics as text() - instead of changing
 the semantics of %s as the Neil suggests in the PEP. Again,
 the reason is to make the difference between text and
 arbitrary data explicit and visible in the code.

Hm. What would be the use case for using %s with binary, non-text data?

  The main problem for a smooth Unicode transition remains I/O, in my
  opinion; I'd like to see a PEP describing a way to attach an encoding
  to text files, and a way to decide on a default encoding for stdin,
  stdout, stderr.
 
 Hmm, not sure why you need PEPs for this:

I'd forgotten how far we've come. I'm still unsure how the default
encoding on stdin/stdout works.

But it still needs to be simpler; IMO the built-in open() function
should have an encoding keyword. (But it could return something whose
type is not 'file' -- once again making a distinction between open and
file.) Do these files support universal newlines? IMO they should.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Neil Schemenauer
On Sat, Aug 06, 2005 at 06:56:39PM -0700, Guido van Rossum wrote:
 My first response to the PEP, however, is that instead of a new
 built-in function, I'd rather relax the requirement that str() return
 an 8-bit string

Do you have any thoughts on what the C API would be?  It seems to me
that PyObject_Str cannot start returning a unicode object without a
lot of code breakage.  I suppose we could introduce a function
called something like PyObject_String.

  Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Phillip J. Eby
At 05:24 PM 8/7/2005 -0700, Guido van Rossum wrote:
Hm. What would be the use case for using %s with binary, non-text data?

Well, I could see using it to write things like netstrings, 
i.e.  sock.send(%d:%s, % (len(data),data)) seems like the One Obvious Way 
to write a netstring in today's Python at least.  But perhaps there's a 
subtlety I've missed here.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Martin v. Löwis
Guido van Rossum wrote:
If stdin, stdout and stderr go to a terminal, there already is a
default encoding (actually, there always is a default encoding on
these, as it falls back to the system encoding if its not a terminal,
or if the terminal's encoding is not supported or cannot be determined).
 
 
 So there is. Wow! I never kew this. How does it work? Can we use this
 for writing to files to?

On Unix, it uses nl_langinfo(CHARSET), which in turn looks at the
environment variables.

On Windows, it uses GetConsoleCP()/GetConsoleOutputCP().

On Mac, I'm still searching for a way to determine the encoding of
Terminal.app.

In IDLE, it uses locale.getpreferredencoding().

So no, this cannot easily be used for file output. Most likely, people
would use locale.getpreferredencoding() for file output. For socket
output, there should not be a standard way to encode Unicode.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Bob Ippolito
On Aug 7, 2005, at 7:37 PM, Martin v. Löwis wrote:

 Guido van Rossum wrote:

 If stdin, stdout and stderr go to a terminal, there already is a
 default encoding (actually, there always is a default encoding on
 these, as it falls back to the system encoding if its not a  
 terminal,
 or if the terminal's encoding is not supported or cannot be  
 determined).



 So there is. Wow! I never kew this. How does it work? Can we use this
 for writing to files to?


 On Unix, it uses nl_langinfo(CHARSET), which in turn looks at the
 environment variables.

 On Windows, it uses GetConsoleCP()/GetConsoleOutputCP().

 On Mac, I'm still searching for a way to determine the encoding of
 Terminal.app.

It's UTF-8 by default, I highly doubt many people bother to change it.

-bob


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-07 Thread Martin v. Löwis
Guido van Rossum wrote:
 I'm not sure if it works for all encodings, but if possible I'd like
 to extend the seeking semantics on text files: seek positions are byte
 counts, and the application should consider them as magic cookies.

If the seek position is merely a number, it won't work for all
encodings. For the ISO 2022 ones (iso-2022-jp etc), you need to know
the shift state: you can switch to a different encoding in the stream
using standard escape codes, and then the same bytes are interpreted
differently. For example, iso-2022-jp supports these escape codes:

ESC ( B   ASCII
ESC $ @   JIS X 0208-1978
ESC $ B   JIS X 0208-1983
ESC ( J   JIS X 0201-Roman
ESC $ A   GB2312-1980
ESC $ ( C KSC5601-1987
ESC $ ( D JIS X 0212-1990
ESC . A   ISO8859-1
ESC . F   ISO8859-7

So at a certain position in the stream, the same bytes could mean
different characters, depending on which shift state you are in.
That's why ISO C introduced fgetpos/fsetpos in addition to
ftell/fseek: an fpos_t is a truly opaque structure that can also
incorporate codec state.

If you follow this approach, you can get back most of seek;
you will lose the whence parameter, i.e. you cannot seek forth
and back, and you cannot position at the end of the file
(actually, iso-2022-jp still supports appending to a file, since
it requires that all data shift out back to ASCII at the end
of each line, and at the end of the file. So correct ISO 2022
files can still be concatenated)


 Is there any reason not to do Universal Newline processing on *all*
 text files?

Correct. However, this still might result in a full rewrite of the
universal newlines code: the code currently operates on byte streams,
when it should operate on character streams. In some encodings,
CRLF simply isn't represented by \x0d\x0a
(e.g. UTF-16-LE: \x0d\0\0x0a\0)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-06 Thread Terry Reedy
 PEP: 349
 Title: Generalised String Coercion
...
 Rationale
Python has had a Unicode string type for some time now but use of
it is not yet widespread.  There is a large amount of Python code
that assumes that string data is represented as str instances.
The long term plan for Python is to phase out the str type and use
unicode for all string data.

This PEP strikes me as premature, as putting the toy wagon before the 
horse, since it is premised on a major change to Python, possibly the most 
disruptive and controversial ever, being a done deal.  However there is, as 
far as I could find no PEP on Making Strings be Unicode, let alone a 
discussed, debated, and finalized PEP on the subject.

   Clearly, a smooth migration path must be provided.

Of course.  But the path depends on the detailed final target, which has 
not, as far as I know, has been finalized, and certainly not in the needed 
PEP.  Your proposal might be part of the transition section of such a PEP 
or of a separate migration path PEP.

Terry J. Reedy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Generalised String Coercion

2005-08-06 Thread Guido van Rossum
[Removed python-list CC]

On 8/6/05, Terry Reedy [EMAIL PROTECTED] wrote:
  PEP: 349
  Title: Generalised String Coercion
 ...
  Rationale
 Python has had a Unicode string type for some time now but use of
 it is not yet widespread.  There is a large amount of Python code
 that assumes that string data is represented as str instances.
 The long term plan for Python is to phase out the str type and use
 unicode for all string data.
 
 This PEP strikes me as premature, as putting the toy wagon before the
 horse, since it is premised on a major change to Python, possibly the most
 disruptive and controversial ever, being a done deal.  However there is, as
 far as I could find no PEP on Making Strings be Unicode, let alone a
 discussed, debated, and finalized PEP on the subject.

True. OTOH, Jython and IreonPython already have this, and it is my
definite plan to make all strings Unicode in Python 3000. The rest
(such as a bytes datatype) is details, as they say. :-)

My first response to the PEP, however, is that instead of a new
built-in function, I'd rather relax the requirement that str() return
an 8-bit string -- after all, int() is allowed to return a long, so
why couldn't str() be allowed to return a Unicode string?

The main problem for a smooth Unicode transition remains I/O, in my
opinion; I'd like to see a PEP describing a way to attach an encoding
to text files, and a way to decide on a default encoding for stdin,
stdout, stderr.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com