Re: [Python-3000] locale-aware strings ?

2006-09-13 Thread Brian Quinlan
Martin v. Löwis wrote: >> I can assure you >> that most of the documents that I work with are not in CP436 - they are >> a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that >> this is true of many Windows XP (US-English) users. So, for me and users >> like me, Python is going t

Re: [Python-3000] iostack, second revision

2006-09-13 Thread Anders J. Munch
Josiah Carlson wrote: > "Anders J. Munch" <[EMAIL PROTECTED]> wrote: > > I don't expect file methods and systems calls to map one to one, but > > you're right, the first time the length is needed, that's an extra > > system call. > > Every time the length is needed, a system call is required > (y

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread John S. Yates, Jr.
On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote: > UTF-8 with BOM is the Microsoft preferred format. I believe this is a gloss. Microsoft uses UTF-16. Because the basic character unit is larger than one byte it is crucial for interoperability to prefix a string of UTF-16 text with an i

Re: [Python-3000] string C API

2006-09-13 Thread Jim Jewett
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Fredrik Lundh schrieb: > > just noticed that PEP 3100 says that PyString_AsEncodedString and > > PyString_AsDecodedString is to be removed, but it doesn't mention > > any other PyString (or PyUnicode) functions. > > how large changes can w

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread Marcin 'Qrczak' Kowalczyk
"John S. Yates, Jr." <[EMAIL PROTECTED]> writes: > It is a mistake on Microsoft's part to fail to strip the BOM > during conversion to UTF-8. There is no MEANINGFUL definition > of BOM in a UTF-8 string. But instead of stripping the wrapper > and converting only the text payload Microsoft lazily

Re: [Python-3000] string C API

2006-09-13 Thread Martin v. Löwis
Jim Jewett schrieb: >> For example, PyString_From{String[AndSize]|Format} would either: >> - have to grow an encoding argument >> - assume a default encoding (either ASCII or UTF-8) >> - change its signature to operate on Py_UNICODE* (although >> we don't have literals for these) or >> - be remov

Re: [Python-3000] iostack, second revision

2006-09-13 Thread Josiah Carlson
"Anders J. Munch" <[EMAIL PROTECTED]> wrote: > Josiah Carlson wrote: > > "Anders J. Munch" <[EMAIL PROTECTED]> wrote: > > > I don't expect file methods and systems calls to map one to one, but > > > you're right, the first time the length is needed, that's an extra > > > system call. > > > > Ever

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread Josiah Carlson
"John S. Yates, Jr." <[EMAIL PROTECTED]> wrote: > > On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote: > > > UTF-8 with BOM is the Microsoft preferred format. > > I believe this is a gloss. Microsoft uses UTF-16. Because > the basic character unit is larger than one byte it is crucial

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread Paul Prescod
On 9/13/06, John S. Yates, Jr. <[EMAIL PROTECTED]> wrote: On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:> UTF-8 with BOM is the Microsoft preferred format.It is a mistake on Microsoft's part to fail to strip the BOMduring conversion to UTF-8.  There is no MEANINGFUL definition of BOM in

Re: [Python-3000] string C API

2006-09-13 Thread Jim Jewett
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Should encoding be an attribute of the string? > No. A Python string is a sequence of Unicode characters. > Even if it was created by converting from some other encoding, > that original encoding gets lost when doing the conversion > (ju

Re: [Python-3000] string C API

2006-09-13 Thread Martin v. Löwis
Jim Jewett schrieb: > Simply not encoding/decoding until required would save quite a bit of > time and space -- but then the object would need some way of > indicating which encoding it is in. Try implementing that some time. You'll find it will be incredibly complex and unmaintainable. Start with

Re: [Python-3000] string C API

2006-09-13 Thread Jim Jewett
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Jim Jewett schrieb: > > Simply not encoding/decoding until required would save quite a bit of > > time and space -- but then the object would need some way of > > indicating which encoding it is in. > Try implementing that some time. You'l

Re: [Python-3000] sys.stdin and sys.stdout with textfile

2006-09-13 Thread Guido van Rossum
On 9/11/06, Greg Ewing <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > > > All sorts of things are different when reading stdin vs. opening a > > filename. e.g. stdin may be a pipe. > > Which suggests that if anything is going to try > to guess the encoding, it would be better for it > to st

Re: [Python-3000] string C API

2006-09-13 Thread Martin v. Löwis
Jim Jewett schrieb: > Simply delegate such methods to a hidden per-encoding subclass. > > The UTF-8 methods will indeed be complex, unless the solution is > simply "someone called indexing/slicing/len, so I have to recode after > all." > > The Latin-1 encoding will have no such problem. I'm not

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread Jason Orendorff
On 9/13/06, John S. Yates, Jr. <[EMAIL PROTECTED]> wrote: > It is a mistake on Microsoft's part to fail to strip the BOM > during conversion to UTF-8. John, you're mistaken about the reason this BOM is here. In Notepad at least, the BOM is intentionally generated when writing the file. It's not

Re: [Python-3000] educational aspects of Python 3000

2006-09-13 Thread Giovanni Bajo
BJörn Lindqvist <[EMAIL PROTECTED]> wrote: >>> The idea of a standard edu library though is a GREAT one. >>> [...] >> I disagree for two reasons: >> >> 1) Even a single line of boilerplate is too much >> when you're trying to pare things down to the >> bare minimum for a beginner. >> >> 2) It tea

Re: [Python-3000] BOM handling

2006-09-13 Thread Antoine Pitrou
Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit : > And is generally ignored, as per unicode spec; it's a "zero width > non-breaking space" - an invisible character with no effect on wrapping > or otherwise. Well it would be better if Py3K (with all strings unicode) makes thin

Re: [Python-3000] BOM handling

2006-09-13 Thread Georg Brandl
Antoine Pitrou wrote: > Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit : >> And is generally ignored, as per unicode spec; it's a "zero width >> non-breaking space" - an invisible character with no effect on wrapping >> or otherwise. > > Well it would be better if Py3K (with a

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread Walter Dörwald
Jason Orendorff wrote: > On 9/13/06, John S. Yates, Jr. <[EMAIL PROTECTED]> wrote: >> It is a mistake on Microsoft's part to fail to strip the BOM >> during conversion to UTF-8. > > John, you're mistaken about the reason this BOM is here. > > In Notepad at least, the BOM is intentionally generate

Re: [Python-3000] BOM handling

2006-09-13 Thread Josiah Carlson
Antoine Pitrou <[EMAIL PROTECTED]> wrote: > > > Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit : > > And is generally ignored, as per unicode spec; it's a "zero width > > non-breaking space" - an invisible character with no effect on wrapping > > or otherwise. > > Well it w

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-13 Thread David Hopwood
Jason Orendorff wrote: > On 9/13/06, John S. Yates, Jr. <[EMAIL PROTECTED]> wrote: > >>It is a mistake on Microsoft's part to fail to strip the BOM >>during conversion to UTF-8. > > John, you're mistaken about the reason this BOM is here. > > In Notepad at least, the BOM is intentionally generat

Re: [Python-3000] BOM handling

2006-09-13 Thread Antoine Pitrou
Hi, Le mercredi 13 septembre 2006 à 16:14 -0700, Josiah Carlson a écrit : > In any case, I believe that the above behavior is correct for the > context. Why? Because utf-8 has no endianness, its 'generic' decoding > spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and > 'utf