Re: [Python-Dev] Generalised String Coercion
James Y Knight wrote: Hum, actually, it somewhat makes sense for the open builtin to become what is now codecs.open, for convenience's sake, although it does blur the distinction between a byte stream and a character stream somewhat. If that happens, I suppose it does actually make sense to give makefile the same signature. We could always give the text mode/binary mode distinction in open a real meaning - text mode deals with character sequences, binary mode deals with byte sequences. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- http://boredomandlaziness.blogspot.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Bob Ippolito wrote: It's UTF-8 by default, I highly doubt many people bother to change it. I think your doubts are unfounded. Many Japanese people change it to EUC-JP (I believe), as UTF-8 support doesn't work well for them (or atleast didn't use to). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Guido van Rossum wrote: We might be able to get there halfway in Python 2.x: we could introduce the bytes type now, and provide separate APIs to read and write them. (In fact, the array module and the f.readinto() method make this possible today, but it's too klunky so nobody uses it. Perhaps a better API would be a new file-open mode (B?) to indicate that a file's read* operations should return bytes instead of strings. The bytes type could just be a very thin wrapper around array('b'). That answers an important question: so you want the bytes type to be mutable (and, consequently, unsuitable as a dictionary key). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Phillip J. Eby wrote: Hm. What would be the use case for using %s with binary, non-text data? Well, I could see using it to write things like netstrings, i.e. sock.send(%d:%s, % (len(data),data)) seems like the One Obvious Way to write a netstring in today's Python at least. But perhaps there's a subtlety I've missed here. As written, this would stop working when strings become Unicode. It's pretty clear what '%d' means (format the number in decimal numbers, using \N{DIGIT ZERO} .. \N{DIGIT NINE} as the digits). It's not all that clear what %s means: how do you get a sequence of characters out of data, when data is a byte string? Perhaps there could be byte string literals, so that you would write sock.send(b%d:%s, % (len(data),data)) but this would raise different questions: - what does %d mean for a byte string formatting? str(len(data)) returns a character string, how do you get a byte string? In the specific case of %d, encoding as ASCII would work, though. - if byte strings are mutable, what about byte string literals? I.e. if I do x = b%d:%s, x[1] = b'f' and run through the code the second time, will the literal have changed? Perhaps these would be displays, not literals (although I never understood why Guido calls these displays) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin I think your doubts are unfounded. Many Japanese people Martin change it to EUC-JP (I believe), as UTF-8 support doesn't Martin work well for them (or atleast didn't use to). If you mean the UTF-8 support in Terminal, it's no better or worse than the EUC-JP support. The problem is that most Japanese Unix systems continue to default to EUC-JP, and many Windows hosts (including Samba file systems) default to Shift JIS. So people using Terminal tend to set it to match the default remote environment (few of them use shells on the Mac). All that is certainly true of my organization, for one example. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Stephen J. Turnbull wrote: If you mean the UTF-8 support in Terminal, it's no better or worse than the EUC-JP support. The problem is that most Japanese Unix systems continue to default to EUC-JP, and many Windows hosts (including Samba file systems) default to Shift JIS. So people using Terminal tend to set it to match the default remote environment (few of them use shells on the Mac). Right: that might be the biggest problem. ls(1) would not display the file names of the remote servers in any readable way. Thanks for the confirmation. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
M.-A. Lemburg [EMAIL PROTECTED] writes: Set the external encoding for stdin, stdout, stderr: (also an example for adding encoding support to an existing file object): def set_sys_std_encoding(encoding): # Load encoding support (encode, decode, streamreader, streamwriter) = codecs.lookup(encoding) # Wrap using stream writers and readers sys.stdin = streamreader(sys.stdin) sys.stdout = streamwriter(sys.stdout) sys.stderr = streamwriter(sys.stderr) # Add .encoding attribute for introspection sys.stdin.encoding = encoding sys.stdout.encoding = encoding sys.stderr.encoding = encoding set_sys_std_encoding('rot-13') Example session: print 'hello' uryyb raw_input() hello h'hello' 1/0 Genpronpx (zbfg erprag pnyy ynfg): Svyr fgqva, yvar 1, va ? MrebQvivfvbaReebe: vagrtre qvivfvba be zbqhyb ol mreb Note that the interactive session bypasses the sys.stdin redirection, which is why you can still enter Python commands in ASCII - not sure whether there's a reason for this, or whether it's just a missing feature. Um, I'm not quite sure how this would be implemented. Interactive input comes via PyOS_Readline which deals in FILE*s... this area of the code always confuses me :( Cheers, mwh -- As it seems to me, in Perl you have to be an expert to correctly make a nested data structure like, say, a list of hashes of instances. In Python, you have to be an idiot not to be able to do it, because you just write it down. -- Peter Norvig, comp.lang.functional ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Martin v. Löwis wrote: Guido van Rossum wrote: The bytes type could just be a very thin wrapper around array('b'). That answers an important question: so you want the bytes type to be mutable (and, consequently, unsuitable as a dictionary key). I would suggest a bytes/frozenbytes pair, similar to set/frozenset and list/tuple. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- http://boredomandlaziness.blogspot.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Guido van Rossum wrote: [Guido] My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string -- after all, int() is allowed to return a long, so why couldn't str() be allowed to return a Unicode string? [MAL] The problem here is that strings and Unicode are used in different ways, whereas integers and longs are very similar. Strings are used for both arbitrary data and text data, Unicode can only be used for text data. Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a separate bytes array type for non-text and for encoded text data, just like Java; strings should always be considered text data. We might be able to get there halfway in Python 2.x: we could introduce the bytes type now, and provide separate APIs to read and write them. (In fact, the array module and the f.readinto() method make this possible today, but it's too klunky so nobody uses it. Perhaps a better API would be a new file-open mode (B?) to indicate that a file's read* operations should return bytes instead of strings. The bytes type could just be a very thin wrapper around array('b'). I'd prefer to keep such bytes type immutable (arrays are mutable), otherwise, as Martin already mentioned, they wouldn't be usable as dictionary keys and the transition from the current string implementation would be made more difficult than necessary. Since we won't have any use for the string type in Py3k, why not simply strip it down to a plain bytes type ? (I wouldn't want to lose or have to reinvent all the optimizations that went into its implementation and which are missing in the array implementation.) About the file-type idea: We already have text mode and binary mode - with their implementation being platform dependent. I don't think that this is particularly good area to add new functionality. If you use codecs.open() to open a file, you could easily write a codec which implements what you have in mind. The new text() built-in would help make a clear distinction between convert this object to a string of bytes and please convert this to a text representation. We need to start making the separation somewhere and I think this is a good non-invasive start. I agree with the latter, but I would prefer that any new APIs we use use a 'bytes' data type to represent non-text data, rather than having two different sets of APIs to differentiate between the use of 8-bit strings as text vs. data -- while we *currently* use 8-bit strings for both text and data, in Python 3.0 we won't, so then the interim APIs would have to change again. I'd rather intrduce a new data type and new APIs that work with it. Well, let's put it this way: it all really depends on what str() should mean in Py3k. Given that str() is used for mixed content data strings, simply aliasing str() to unicode() in Py3k would cause a lot of breakage, due to changed semantics. Aliasing str() to bytes() would also cause breakage, due to the fact that bytes types wouldn't have string method like e.g. .lower(), .upper(), etc. Perhaps str() in Py3k should become a helper that converts bytes() to Unicode, provided the content is ASCII-only. In any case, Py3k would only have unicode() for text and bytes() for data, so there's no real need to continue using str(). If we add the text() API in Py2k and with the above meaning, then we could rename unicode() to text() in Py3k - only a cosmetical change, but one that I would find useful: text() and bytes() are more intuitive to understand than unicode() and bytes(). Furthermore, the text() built-in could be used to only allow 8-bit strings with ASCII content to pass through and require that all non-ASCII content be returned as Unicode. We wouldn't be able to enforce this in str(). I'm +1 on adding text(). I'm still -1. I would also like to suggest a new formatting marker '%t' to have the same semantics as text() - instead of changing the semantics of %s as the Neil suggests in the PEP. Again, the reason is to make the difference between text and arbitrary data explicit and visible in the code. Hm. What would be the use case for using %s with binary, non-text data? I guess we'd only keep it for backwards compatibility and map it to the str() helper. The main problem for a smooth Unicode transition remains I/O, in my opinion; I'd like to see a PEP describing a way to attach an encoding to text files, and a way to decide on a default encoding for stdin, stdout, stderr. Hmm, not sure why you need PEPs for this: I'd forgotten how far we've come. I'm still unsure how the default encoding on stdin/stdout works. Codecs in general work like this: they take an existing file-like object and wrap it with new versions of .read(), .write(), .readline(), etc. which filter the data through encoding and/or decoding functions. Once a file is wrapped with a codec StreamWriter/Reader, you
Re: [Python-Dev] Generalised String Coercion
Michael Hudson wrote: M.-A. Lemburg [EMAIL PROTECTED] writes: Set the external encoding for stdin, stdout, stderr: (also an example for adding encoding support to an existing file object): def set_sys_std_encoding(encoding): # Load encoding support (encode, decode, streamreader, streamwriter) = codecs.lookup(encoding) # Wrap using stream writers and readers sys.stdin = streamreader(sys.stdin) sys.stdout = streamwriter(sys.stdout) sys.stderr = streamwriter(sys.stderr) # Add .encoding attribute for introspection sys.stdin.encoding = encoding sys.stdout.encoding = encoding sys.stderr.encoding = encoding set_sys_std_encoding('rot-13') Example session: print 'hello' uryyb raw_input() hello h'hello' 1/0 Genpronpx (zbfg erprag pnyy ynfg): Svyr fgqva, yvar 1, va ? MrebQvivfvbaReebe: vagrtre qvivfvba be zbqhyb ol mreb Note that the interactive session bypasses the sys.stdin redirection, which is why you can still enter Python commands in ASCII - not sure whether there's a reason for this, or whether it's just a missing feature. Um, I'm not quite sure how this would be implemented. Interactive input comes via PyOS_Readline which deals in FILE*s... this area of the code always confuses me :( Me too. It appears that this part of the Python code has undergone so many iterations and patches, that the structure has suffered a lot, e.g. the main() functions calls PyRun_AnyFileFlags(stdin, stdin, cf), but the fp argument stdin is then subsequently ignored if the tok_nextc() function finds that a prompt is set. Anyway, hacking along the same lines, I think the above can be had by changing tok_stdin_decode() to use a possibly available sys.stdin.decode() method for the decoding of the data read by PyOS_Readline(). This would then return Unicode which tok_stdin_decode() could then encode to UTF-8 which is the encoding that the tokenizer can work on. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 08 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
At 10:07 AM 8/8/2005 +0200, Martin v. Löwis wrote: Phillip J. Eby wrote: Hm. What would be the use case for using %s with binary, non-text data? Well, I could see using it to write things like netstrings, i.e. sock.send(%d:%s, % (len(data),data)) seems like the One Obvious Way to write a netstring in today's Python at least. But perhaps there's a subtlety I've missed here. As written, this would stop working when strings become Unicode. It's pretty clear what '%d' means (format the number in decimal numbers, using \N{DIGIT ZERO} .. \N{DIGIT NINE} as the digits). It's not all that clear what %s means: how do you get a sequence of characters out of data, when data is a byte string? Perhaps there could be byte string literals, so that you would write sock.send(b%d:%s, % (len(data),data)) Actually, thinking about it some more, it seems to me it's actually more like this: sock.send( (%d:%s, % (len(data),data.decode('latin1'))).encode('latin1') ) That is, if all we have is unicode and bytes, and 'data' is bytes, then encoding and decoding from latin1 is the right way to do a netstring. It's a bit more painful, but still doable. but this would raise different questions: - what does %d mean for a byte string formatting? str(len(data)) returns a character string, how do you get a byte string? In the specific case of %d, encoding as ASCII would work, though. - if byte strings are mutable, what about byte string literals? I.e. if I do x = b%d:%s, x[1] = b'f' and run through the code the second time, will the literal have changed? Perhaps these would be displays, not literals (although I never understood why Guido calls these displays) I'm thinking that bytes.decode and unicode.encode are the correct way to convert between the two, and there's no such thing as a bytes literal. We can always optimize constant.encode(constant) to a bytes display internally if necessary, although it will be a pain for programs that have lots of bytestring constants. OTOH, we've previously discussed having a 'bytes()' constructor, and perhaps it should use latin1 as its default encoding. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
On Sun, Aug 07, 2005, Neil Schemenauer wrote: On Sat, Aug 06, 2005 at 06:56:39PM -0700, Guido van Rossum wrote: My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string Do you have any thoughts on what the C API would be? It seems to me that PyObject_Str cannot start returning a unicode object without a lot of code breakage. I suppose we could introduce a function called something like PyObject_String. OTOH, should Guido change his -1 on text(), that leads to the obvious PyObject_Text. -- Aahz ([EMAIL PROTECTED]) * http://www.pythoncraft.com/ The way to build large Python applications is to componentize and loosely-couple the hell out of everything. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Ouch. Too much discussion to respond to it all. Please remember that in Jythin and IronPython, str and unicode are already synonyms. That's how Python 3.0 will do it, except unicode will disappear as being redundant. I like the bytes/frozenbytes pair idea. Streams could grow a getpos()/setpos() API pair that can be used for stateful encodings (although it sounds like seek()/tell() would be okay to use in most cases as long as you read in units of whole lines). For sockets, send()/recv() would deal in bytes, and makefile() would get an encoding parameter. I'm not going to change my mind on text() unless someone explains what's so attractive about it. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
At 09:14 AM 8/8/2005 -0700, Guido van Rossum wrote: I'm not going to change my mind on text() unless someone explains what's so attractive about it. 1. It's obvious to non-programmers what it's for (str and unicode aren't) 2. It's more obvious to programmers that it's a *text* string rather than a string of bytes 3. It's easier to type than unicode, but less opaque than str 4. Switching to 'text' and 'bytes' allows for a clean break from any mental baggage now associated with 'unicode' and 'str'. Of course, the flip side to these arguments is that in today's Python, one rarely has use for the string type names, except for coercion and some occasional type checking. On the other hand, if we end up with type declarations, then these issues become a bit more important. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Phillip J. Eby wrote: Actually, thinking about it some more, it seems to me it's actually more like this: sock.send( (%d:%s, % (len(data),data.decode('latin1'))).encode('latin1') ) While this would work, it would still feel wrong: the binary data are *not* latin1 (most likely), so declaring them to be latin1 would be confusing. Perhaps a synonym '8bit' for latin1 could be introduced. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
On Sat, Aug 06, 2005 at 06:56:39PM -0700, Guido van Rossum wrote: My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string -- after all, int() is allowed to return a long, so why couldn't str() be allowed to return a Unicode string? I've played with this idea a bit and it seems viable. I modified my original patch to have string_new call PyObject_Text instead of PyObject_Str. That change breaks only two tests, both in test_email. The tracebacks are attached. Both problems seem relatively shallow. Do you thing such a change could go into 2.5? Neil Traceback (most recent call last): File /home/nas/Python/py_cvs/Lib/email/test/test_email.py, line 2844, in test_encoded_adjacent_nonencoded h = make_header(decode_header(s)) File /home/nas/Python/py_cvs/Lib/email/Header.py, line 123, in make_header charset = Charset(charset) File /home/nas/Python/py_cvs/Lib/email/Charset.py, line 190, in __init__ input_charset = unicode(input_charset, 'ascii').lower() TypeError: decoding Unicode is not supported Traceback (most recent call last): File /home/nas/Python/py_cvs/Lib/email/test/test_email.py, line 2750, in test_multilingual eq(decode_header(enc), File /home/nas/Python/py_cvs/Lib/email/Header.py, line 85, in decode_header dec = email.quopriMIME.header_decode(encoded) File /home/nas/Python/py_cvs/Lib/email/quopriMIME.py, line 319, in header_decode return re.sub(r'=\w{2}', _unquote_match, s) File /home/nas/Python/py_cvs/Lib/sre.py, line 142, in sub return _compile(pattern, 0).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
[Phillip J. Eby] At 09:14 AM 8/8/2005 -0700, Guido van Rossum wrote: I'm not going to change my mind on text() unless someone explains what's so attractive about it. 2. It's more obvious to programmers that it's a *text* string rather than a string of bytes I've no opinion on the proposal on itself, except maybe that text, that precise word or name, is a pretty bad choice. It is far too likely that people already use or want to use that precise identifier. There once was a suggestion for naming text the module now known as textwrap, under the premise that it could be later extended for holding many other various text-related functions. Happily enough, this idea was not retained. textwrap is much more reasonable as a name. I found Python 1.5.2's string to be especially prone to clashing. I still find socket obtrusive in that respect. Consider len as an example of a clever choice, while length would not have been. str is also a good choice. object is a bit more annoying theoretically, yet we almost never need it in practice. type is annoying as a name (yet very nice as a concept), as if it was free to use, it would often serve to label our own things. The fact is we often need the built-in. Python should not choose common English words for its built-ins, without very careful thought, and be reluctant to any compulsion in this area. -- François Pinard http://pinard.progiciels-bpi.ca ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin While this would work, it would still feel wrong: the Martin binary data are *not* latin1 (most likely), so declaring Martin them to be latin1 would be confusing. Perhaps a synonym Martin '8bit' for latin1 could be introduced. Be careful. This alias has caused Emacs some amount of pain, as binary data escapes into contexts (such as Universal Newline processing) where it gets interpreted as character data. We've also had some problems in codec implementation, because latin1 and (eg) latin9 have some differences in semantics other than changing the coded character set for the GR register---controls are treated differently, for example, because they _are_ binary (alias latin1) octets, but not in the range of the latin9 code. I won't go so far as to say it won't work, but it will require careful design. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Guido van Rossum wrote: The main problem for a smooth Unicode transition remains I/O, in my opinion; I'd like to see a PEP describing a way to attach an encoding to text files, and a way to decide on a default encoding for stdin, stdout, stderr. FWIW, I've already drafted a patch for the former. It lets you write to file.encoding and honors this when writing Unicode strings to it. http://www.python.org/sf/1214889 Reinhold -- Mail address is perfectly valid! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Guido van Rossum wrote: My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string -- after all, int() is allowed to return a long, so why couldn't str() be allowed to return a Unicode string? The problem here is that strings and Unicode are used in different ways, whereas integers and longs are very similar. Strings are used for both arbitrary data and text data, Unicode can only be used for text data. The new text() built-in would help make a clear distinction between convert this object to a string of bytes and please convert this to a text representation. We need to start making the separation somewhere and I think this is a good non-invasive start. Furthermore, the text() built-in could be used to only allow 8-bit strings with ASCII content to pass through and require that all non-ASCII content be returned as Unicode. We wouldn't be able to enforce this in str(). I'm +1 on adding text(). I would also like to suggest a new formatting marker '%t' to have the same semantics as text() - instead of changing the semantics of %s as the Neil suggests in the PEP. Again, the reason is to make the difference between text and arbitrary data explicit and visible in the code. The main problem for a smooth Unicode transition remains I/O, in my opinion; I'd like to see a PEP describing a way to attach an encoding to text files, and a way to decide on a default encoding for stdin, stdout, stderr. Hmm, not sure why you need PEPs for this: Open an encoded file: - Use codecs.open() instead of open() or file(). Set the external encoding for stdin, stdout, stderr: (also an example for adding encoding support to an existing file object): def set_sys_std_encoding(encoding): # Load encoding support (encode, decode, streamreader, streamwriter) = codecs.lookup(encoding) # Wrap using stream writers and readers sys.stdin = streamreader(sys.stdin) sys.stdout = streamwriter(sys.stdout) sys.stderr = streamwriter(sys.stderr) # Add .encoding attribute for introspection sys.stdin.encoding = encoding sys.stdout.encoding = encoding sys.stderr.encoding = encoding set_sys_std_encoding('rot-13') Example session: print 'hello' uryyb raw_input() hello h'hello' 1/0 Genpronpx (zbfg erprag pnyy ynfg): Svyr fgqva, yvar 1, va ? MrebQvivfvbaReebe: vagrtre qvivfvba be zbqhyb ol mreb Note that the interactive session bypasses the sys.stdin redirection, which is why you can still enter Python commands in ASCII - not sure whether there's a reason for this, or whether it's just a missing feature. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 07 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
[me] a way to decide on a default encoding for stdin, stdout, stderr. [Martin] If stdin, stdout and stderr go to a terminal, there already is a default encoding (actually, there always is a default encoding on these, as it falls back to the system encoding if its not a terminal, or if the terminal's encoding is not supported or cannot be determined). So there is. Wow! I never kew this. How does it work? Can we use this for writing to files to? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
[Guido] My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string -- after all, int() is allowed to return a long, so why couldn't str() be allowed to return a Unicode string? [MAL] The problem here is that strings and Unicode are used in different ways, whereas integers and longs are very similar. Strings are used for both arbitrary data and text data, Unicode can only be used for text data. Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a separate bytes array type for non-text and for encoded text data, just like Java; strings should always be considered text data. We might be able to get there halfway in Python 2.x: we could introduce the bytes type now, and provide separate APIs to read and write them. (In fact, the array module and the f.readinto() method make this possible today, but it's too klunky so nobody uses it. Perhaps a better API would be a new file-open mode (B?) to indicate that a file's read* operations should return bytes instead of strings. The bytes type could just be a very thin wrapper around array('b'). The new text() built-in would help make a clear distinction between convert this object to a string of bytes and please convert this to a text representation. We need to start making the separation somewhere and I think this is a good non-invasive start. I agree with the latter, but I would prefer that any new APIs we use use a 'bytes' data type to represent non-text data, rather than having two different sets of APIs to differentiate between the use of 8-bit strings as text vs. data -- while we *currently* use 8-bit strings for both text and data, in Python 3.0 we won't, so then the interim APIs would have to change again. I'd rather intrduce a new data type and new APIs that work with it. Furthermore, the text() built-in could be used to only allow 8-bit strings with ASCII content to pass through and require that all non-ASCII content be returned as Unicode. We wouldn't be able to enforce this in str(). I'm +1 on adding text(). I'm still -1. I would also like to suggest a new formatting marker '%t' to have the same semantics as text() - instead of changing the semantics of %s as the Neil suggests in the PEP. Again, the reason is to make the difference between text and arbitrary data explicit and visible in the code. Hm. What would be the use case for using %s with binary, non-text data? The main problem for a smooth Unicode transition remains I/O, in my opinion; I'd like to see a PEP describing a way to attach an encoding to text files, and a way to decide on a default encoding for stdin, stdout, stderr. Hmm, not sure why you need PEPs for this: I'd forgotten how far we've come. I'm still unsure how the default encoding on stdin/stdout works. But it still needs to be simpler; IMO the built-in open() function should have an encoding keyword. (But it could return something whose type is not 'file' -- once again making a distinction between open and file.) Do these files support universal newlines? IMO they should. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
On Sat, Aug 06, 2005 at 06:56:39PM -0700, Guido van Rossum wrote: My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string Do you have any thoughts on what the C API would be? It seems to me that PyObject_Str cannot start returning a unicode object without a lot of code breakage. I suppose we could introduce a function called something like PyObject_String. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
At 05:24 PM 8/7/2005 -0700, Guido van Rossum wrote: Hm. What would be the use case for using %s with binary, non-text data? Well, I could see using it to write things like netstrings, i.e. sock.send(%d:%s, % (len(data),data)) seems like the One Obvious Way to write a netstring in today's Python at least. But perhaps there's a subtlety I've missed here. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Guido van Rossum wrote: If stdin, stdout and stderr go to a terminal, there already is a default encoding (actually, there always is a default encoding on these, as it falls back to the system encoding if its not a terminal, or if the terminal's encoding is not supported or cannot be determined). So there is. Wow! I never kew this. How does it work? Can we use this for writing to files to? On Unix, it uses nl_langinfo(CHARSET), which in turn looks at the environment variables. On Windows, it uses GetConsoleCP()/GetConsoleOutputCP(). On Mac, I'm still searching for a way to determine the encoding of Terminal.app. In IDLE, it uses locale.getpreferredencoding(). So no, this cannot easily be used for file output. Most likely, people would use locale.getpreferredencoding() for file output. For socket output, there should not be a standard way to encode Unicode. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
On Aug 7, 2005, at 7:37 PM, Martin v. Löwis wrote: Guido van Rossum wrote: If stdin, stdout and stderr go to a terminal, there already is a default encoding (actually, there always is a default encoding on these, as it falls back to the system encoding if its not a terminal, or if the terminal's encoding is not supported or cannot be determined). So there is. Wow! I never kew this. How does it work? Can we use this for writing to files to? On Unix, it uses nl_langinfo(CHARSET), which in turn looks at the environment variables. On Windows, it uses GetConsoleCP()/GetConsoleOutputCP(). On Mac, I'm still searching for a way to determine the encoding of Terminal.app. It's UTF-8 by default, I highly doubt many people bother to change it. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
Guido van Rossum wrote: I'm not sure if it works for all encodings, but if possible I'd like to extend the seeking semantics on text files: seek positions are byte counts, and the application should consider them as magic cookies. If the seek position is merely a number, it won't work for all encodings. For the ISO 2022 ones (iso-2022-jp etc), you need to know the shift state: you can switch to a different encoding in the stream using standard escape codes, and then the same bytes are interpreted differently. For example, iso-2022-jp supports these escape codes: ESC ( B ASCII ESC $ @ JIS X 0208-1978 ESC $ B JIS X 0208-1983 ESC ( J JIS X 0201-Roman ESC $ A GB2312-1980 ESC $ ( C KSC5601-1987 ESC $ ( D JIS X 0212-1990 ESC . A ISO8859-1 ESC . F ISO8859-7 So at a certain position in the stream, the same bytes could mean different characters, depending on which shift state you are in. That's why ISO C introduced fgetpos/fsetpos in addition to ftell/fseek: an fpos_t is a truly opaque structure that can also incorporate codec state. If you follow this approach, you can get back most of seek; you will lose the whence parameter, i.e. you cannot seek forth and back, and you cannot position at the end of the file (actually, iso-2022-jp still supports appending to a file, since it requires that all data shift out back to ASCII at the end of each line, and at the end of the file. So correct ISO 2022 files can still be concatenated) Is there any reason not to do Universal Newline processing on *all* text files? Correct. However, this still might result in a full rewrite of the universal newlines code: the code currently operates on byte streams, when it should operate on character streams. In some encodings, CRLF simply isn't represented by \x0d\x0a (e.g. UTF-16-LE: \x0d\0\0x0a\0) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
PEP: 349 Title: Generalised String Coercion ... Rationale Python has had a Unicode string type for some time now but use of it is not yet widespread. There is a large amount of Python code that assumes that string data is represented as str instances. The long term plan for Python is to phase out the str type and use unicode for all string data. This PEP strikes me as premature, as putting the toy wagon before the horse, since it is premised on a major change to Python, possibly the most disruptive and controversial ever, being a done deal. However there is, as far as I could find no PEP on Making Strings be Unicode, let alone a discussed, debated, and finalized PEP on the subject. Clearly, a smooth migration path must be provided. Of course. But the path depends on the detailed final target, which has not, as far as I know, has been finalized, and certainly not in the needed PEP. Your proposal might be part of the transition section of such a PEP or of a separate migration path PEP. Terry J. Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Generalised String Coercion
[Removed python-list CC] On 8/6/05, Terry Reedy [EMAIL PROTECTED] wrote: PEP: 349 Title: Generalised String Coercion ... Rationale Python has had a Unicode string type for some time now but use of it is not yet widespread. There is a large amount of Python code that assumes that string data is represented as str instances. The long term plan for Python is to phase out the str type and use unicode for all string data. This PEP strikes me as premature, as putting the toy wagon before the horse, since it is premised on a major change to Python, possibly the most disruptive and controversial ever, being a done deal. However there is, as far as I could find no PEP on Making Strings be Unicode, let alone a discussed, debated, and finalized PEP on the subject. True. OTOH, Jython and IreonPython already have this, and it is my definite plan to make all strings Unicode in Python 3000. The rest (such as a bytes datatype) is details, as they say. :-) My first response to the PEP, however, is that instead of a new built-in function, I'd rather relax the requirement that str() return an 8-bit string -- after all, int() is allowed to return a long, so why couldn't str() be allowed to return a Unicode string? The main problem for a smooth Unicode transition remains I/O, in my opinion; I'd like to see a PEP describing a way to attach an encoding to text files, and a way to decide on a default encoding for stdin, stdout, stderr. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com