Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 11.01.2014 03:04, schrieb Antoine Pitrou: > On Fri, 10 Jan 2014 20:53:09 -0500 > "Eric V. Smith" wrote: >> >> So, I'm -1 on the PEP. It doesn't address the cases laid out in issue >> 3892. See for example http://bugs.python.org/issue3982#msg180432 . I agree. > Then we might as well not do anything, since any attempt to advance > things is met by stubborn opposition in the name of "not far enough". > > (I don't care much personally, I think the issue is quite overblown > anyway) So you wouldn't mind another overhaul of the PEP including a bit more functionality again? :) I really think that practicality beats purity here. (I'm not advocating free mixing bytes and str, mind you!) Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity" - 2 use cases
"Jim J. Jewett" writes: > > > Steven D'Aprano wrote: > >> I think that heuristics to guess the encoding have their role to play, > >> if the caller understands the risks. > > Ben Finney wrote: > > In my opinion, content-type guessing heuristics certainly don't belong > > in the standard library. > > It would be great if there were never any need to guess. But in the > real world, there is -- and often the user won't know any more than > python does. That's why I think it's great to have heuristic guessing code available as a third-party library. > So when it is time to guess, a source of good guesses is an important > battery to include. Why is it important enough to deserve that privilege, over the thousands of other candidates for the standard library? The barrier for entry to the standard library is higher than mere usefulness. > We should explicitly treat autodetection like time zone data -- > there is no promise that the "right answer" (or at least the "best > guess") won't change, even within a release. But there is exactly one set of authoritative time zones at any particular point in time. That's why it makes sense to have that set of authoritative values available in the standard library. Heuristic guesses about content types do not have the property of exactly one authoritative source, so your analogy is not compelling. -- \ “Unix is an operating system, OS/2 is half an operating system, | `\Windows is a shell, and DOS is a boot partition virus.” —Peter | _o__)H. Coffin | Ben Finney ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote: > As you may know, PDF operates over bytes and an integer or floating-point > number is written down as-is, for example "100" or "1.23". I'm sorry, I don't understand what you mean here. I'm honestly not trying to be difficult, but you sound confident that you understand what you are doing, but your description doesn't make sense to me. To me, it looks like you are conflating bytes and ASCII characters, that is, assuming that characters "are" in some sense identical to their ASCII representation. Let me explain: The integer that in English is written as 100 is represented in memory as bytes 0x0064 (assuming a big-endian C short), so when you say "an integer is written down AS-IS" (emphasis added), to me that says that the PDF file includes the bytes 0x0064. But then you go on to write the three character string "100", which (assuming ASCII) is the bytes 0x313030. Going from the C short to the ASCII representation 0x313030 is nothing like inserting the int "as-is". To put it another way, the Python 2 '%d' format code does not just copy bytes. I think that what you are trying to say is that a PDF file is a binary file which includes some ASCII-formatted text fields. So when writing an integer 100, rather than writing it "as is" which would be byte 0x64 (with however many leading null bytes needed for padding), it is converted to ASCII representation 0x313030 first, and that's what needs to be inserted. If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example: > In the case of PDF, the embedding of an image into PDF looks like: > > 10 0 obj > << /Type /XObject > /Width 100 > /Height 100 > /Alternates 15 0 R > /Length 2167 > >> > stream > ...binary image data... > endstream > endobj Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above. Latin-1 has the nice property that every byte decodes into the character with the same code point, and visa versa. So: for i in range(256): assert bytes([i]).decode('latin-1') == chr(i) assert chr(i).encode('latin-1') == bytes([i]) passes. It seems to me that your problem goes away if you use Unicode text with embedded binary data, rather than binary data with embedded ASCII text. Then when writing the file to disk, of course you encode it to Latin-1, either explicitly: pdf = ... # Unicode string containing the PDF contents with open("outfile.pdf", "wb") as f: f.write(pdf.encode("latin-1") or implicitly: with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf) There may be a few wrinkles I haven't thought of, I don't claim to be an expert on PDF. But I see no reason why PDF files ought to be an exception to the rule: * work internally with Unicode text; * convert to and from bytes only on input and output. Please also take note that in Python 3.3 and better, the internal representation of Unicode strings containing only code points up to 255 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte per character. Another advantage is that using text rather than bytes means that your example: [...] > dropping the bytes-formatting of numbers makes it more complicated > than it was. I would appreciate any explanation on how: > > b'%.1f %.1f %.1f RG' % (r, g, b) becomes simply '%.1f %.1f %.1f RG' % (r, g, b) in Python 3. In Python 3.3 and above, it can be written as: u'%.1f %.1f %.1f RG' % (r, g, b) which conveniently is exactly the same syntax you would use in Python 2. That's *much* nicer than your suggestion: > is more confusing than: > > b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), > (r, g, b))) -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11Jan2014 00:43, Juraj Sukop wrote: > On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner > wrote: > > What not building "10 0 obj ... stream" and "endstream endobj" in > > Unicode and then encode to ASCII? Example: > > > > data = b''.join(( > > ("%d %d obj ... stream" % (10, 0)).encode('ascii'), > > binary_image_data, > > ("endstream endobj").encode('ascii'), > > )) > > The key is "encode to ASCII" which means that the result is bytes. Then, > there is this "11 0 obj" which should also be bytes. But it has no > "binary_image_data" - only lots of numbers waiting to be somehow converted > to bytes. I already mentioned the problems with ".encode('ascii')" but it > does not stop here. Numbers may appear not only inside "streams" but almost > anywhere: in the header there is PDF version, an image has to have "width" > and "height", at the end of PDF there is a structure containing offsets to > all of the objects in file. Basically, to ".encode('ascii')" every possible > number is not exactly simple or pretty. Hi Juraj, Might I suggest a helper function (outside the PEP scope) instead of arguing for support for %f et al? Thus: def bytify(things, encoding='ascii'): for thing: if isinstance(thing, bytes): yield thing else: yield str(thing).encode('ascii') Then one's embedding in PDF might become, more readably: data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) ) Of course, bytify might be augmented with whatever encoding facilities might suit your needs. Cheers, -- Cameron Simpson We tend to overestimate the short-term impact of technological change and underestimate its long-term impact. - Amara's Law ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
To avoid implicit conversion between str and bytes, I propose adding only limited %-format, not .format() or .format_map(). "limited %-format" means: %c accepts integer or bytes having one length. %r is not supported %s accepts only bytes. %a is only format accepts arbitrary object. And other formats is same to str. On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou wrote: > On Fri, 10 Jan 2014 18:14:45 -0500 > "Eric V. Smith" wrote: > > > > >> Because embedding the ASCII equivalent of ints and floats in byte > streams > > >> is a common operation? > > > > > > Again, if you're representing "ASCII", you're representing text and > > > should use a str object. > > > > Yes, but is there existing 2.x code that uses %s for int and float > > (perhaps unwittingly), and do we want to "help" that code out? > > Or do we > > want to make porters first change to using %d or %f instead of %s? > > I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and > %f on bytes objects. > > > I think what you're getting at is that in addition to not calling > > __format__, we don't want to call __str__, either, for the same reason. > > Not only. We don't want to do anything that actually asks for a > *textual* representation of something. %d and %f ask for a textual > representation of a number, so they're right out. > > Regards > > Antoine. > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com > -- INADA Naoki ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 06:39 PM, Antoine Pitrou wrote: I know what a network protocol with ill-defined encodings looks like. For the record, I've been (and I suspect Eric and some others have also been) talking about well-defined encodings. For the DBF files that I work with, there is binary, ASCII, and third that is specified in the file header. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 06:39 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 18:28:41 -0800 Ethan Furman wrote: Is it safe to assume you don't use Python for the use-cases under discussion? You know, I've done quite a bit of network programming. No, I didn't, that's why I asked. I've also done an experimental port of Twisted to Python 3. I know what a network protocol with ill-defined encodings looks like. Can you give a code sample of what you think, for example, the PDF generation code should look like? (If you already have, I apologize -- I missed it in all the ruckus.) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
To avoid implicit conversion between str and bytes, I propose adding only limited %-format, not .format() or .format_map(). "limited %-format" means: %c accepts integer or bytes having one length. %r is not supported %s accepts only bytes. %a is only format accepts arbitrary object. And other formats is same to str. On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou wrote: > On Fri, 10 Jan 2014 18:14:45 -0500 > "Eric V. Smith" wrote: > > > > >> Because embedding the ASCII equivalent of ints and floats in byte > streams > > >> is a common operation? > > > > > > Again, if you're representing "ASCII", you're representing text and > > > should use a str object. > > > > Yes, but is there existing 2.x code that uses %s for int and float > > (perhaps unwittingly), and do we want to "help" that code out? > > Or do we > > want to make porters first change to using %d or %f instead of %s? > > I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and > %f on bytes objects. > > > I think what you're getting at is that in addition to not calling > > __format__, we don't want to call __str__, either, for the same reason. > > Not only. We don't want to do anything that actually asks for a > *textual* representation of something. %d and %f ask for a textual > representation of a number, so they're right out. > > Regards > > Antoine. > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com > -- INADA Naoki ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 18:28:41 -0800 Ethan Furman wrote: > > Is it safe to assume you don't use Python for the use-cases under discussion? You know, I've done quite a bit of network programming. I've also done an experimental port of Twisted to Python 3. I know what a network protocol with ill-defined encodings looks like. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 06:04 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough". Heh, and here I thought it was stubborn opposition in the name of purity. ;) (I don't care much personally, I think the issue is quite overblown anyway) Is it safe to assume you don't use Python for the use-cases under discussion? Specifically, mixed ASCII, binary, and encoded-text byte streams? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" wrote: > > So, I'm -1 on the PEP. It doesn't address the cases laid out in issue > 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough". (I don't care much personally, I think the issue is quite overblown anyway) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 8:12 PM, Antoine Pitrou wrote: > On Fri, 10 Jan 2014 16:23:53 -0800 > Ethan Furman wrote: >> On 01/08/2014 02:42 PM, Antoine Pitrou wrote: >>> >>> With Victor's consent, I overhauled PEP 460 and made the feature set >>> more restricted and consistent with the bytes/str separation. >> >> From the PEP: >> = >>> Python 3 generally mandates that text be stored and manipulated as >>> unicode (i.e. str objects, not bytes). In some cases, though, it >>> makes sense to manipulate bytes objects directly. Typical usage is >>> binary network protocols, where you can want to interpolate and >>> assemble several bytes object (some of them literals, some of them >>> compute) to produce complete protocol messages. For example, >>> protocols such as HTTP or SIP have headers with ASCII names and >>> opaque "textual" values using a varying and/or sometimes ill-defined >>> encoding. Moreover, those headers can be followed by a binary >>> body... which can be chunked and decorated with ASCII headers and >>> trailers! >> >> As it stands now, the PEP talks about ASCII, about how it makes sense >> sometimes to work directly with bytes objects, and >> then refuses to allow % to embed ASCII text in the byte stream. > > Indeed I refuse for %-formatting to allow the mixing of bytes and str > objects, in the same way that it is forbidden to concatenate "a" and > b"b" together, or to write b"".join(["abc"]). I think: 'a' + b'b' is different from: b'Content-Length: %d' % 42 The former we want to prevent, but I see nothing wrong with the latter. So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 16:23:53 -0800 Ethan Furman wrote: > On 01/08/2014 02:42 PM, Antoine Pitrou wrote: > > > > With Victor's consent, I overhauled PEP 460 and made the feature set > > more restricted and consistent with the bytes/str separation. > > From the PEP: > = > > Python 3 generally mandates that text be stored and manipulated as > > unicode (i.e. str objects, not bytes). In some cases, though, it > > makes sense to manipulate bytes objects directly. Typical usage is > > binary network protocols, where you can want to interpolate and > > assemble several bytes object (some of them literals, some of them > > compute) to produce complete protocol messages. For example, > > protocols such as HTTP or SIP have headers with ASCII names and > > opaque "textual" values using a varying and/or sometimes ill-defined > > encoding. Moreover, those headers can be followed by a binary > > body... which can be chunked and decorated with ASCII headers and > > trailers! > > As it stands now, the PEP talks about ASCII, about how it makes sense > sometimes to work directly with bytes objects, and > then refuses to allow % to embed ASCII text in the byte stream. Indeed I refuse for %-formatting to allow the mixing of bytes and str objects, in the same way that it is forbidden to concatenate "a" and b"b" together, or to write b"".join(["abc"]). Python 3 was made *precisely* because the implicit conversion between ASCII unicode and bytes is deemed harmful. It's completely counter-productive and misleading for our users to start mudding the message by introducing exceptions to that rule. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/08/2014 02:42 PM, Antoine Pitrou wrote: With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation. From the PEP: = Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque "textual" values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers! As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream. All other features present in formatting of str objects (either through the percent operator or the str.format() method) are unsupported. Those features imply treating the recipient of the operator or method as text, which goes counter to the text / bytes separation (for example, accepting %d as a format code would imply that the bytes object really is a ASCII-compatible text string). No, it implies that portion of the byte stream is ASCII compatible. And we have several examples: PDF, HTML, DBF, just about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of. -1 on the PEP as it stands now. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On 01/10/2014 03:22 PM, Mark Lawrence wrote: On 10/01/2014 22:06, Chris Barker wrote: I'm not so sure -- it could be used (abused?) for that, but I'm suggesting it be used for mixed ascii-binary data. I don't know that there IS a "right" way to do that -- at least not an efficient or easy to read and write one. The correct way is to read the interface specification which tells you what should be in the data. Of course. The debate is about how to generate the data to the specs in an elegant manner. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou wrote: > Also, when you say you've never encountered UTF-16 text in PDFs, it > sounds like those people who've never encountered any non-ASCII data in > their programs. Let me clarify: one does not think in "writing text in Unicode"-terms in PDF. Instead, one records the sequence of "character codes" which correspond to "glyphs" or the glyph IDs directly. That's because one Unicode character may have more than one glyph and more characters can be shown as one glyph. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On Fri, Jan 10, 2014 at 3:22 PM, Mark Lawrence wrote: > The correct way is to read the interface specification which tells you > what should be in the data. Or do people not use interface specifications > these days, preferring to guess what they've got instead? > No one is suggesting guessing (OK, sometimes for what encoding text is in -- but that's when you already know it's text). But while some specs for mixed ascii and binary may specify which bytes are which, not all do -- there may be a read the file 'till you find this text, then the next n bytes are binary, or maybe the next bytes are binary until you get to this ascii text, etc... This is not guessing, but it does require working with an object which has both ascii text and binary in it -- and why shouldn't Python provide a reasonable way to work with that? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 3:40 PM, Juraj Sukop wrote: > What this all means is that the PDF objects are expressed in ASCII, > "stream" objects like images and fonts may have a binary part and I never > saw those UTF+16 strings. > hmm -- I wonder if they are out there in the wild, though > u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1') >> > > The argument for dropping "%f" et al. has been that if something is a > text, then it should be Unicode. Conversely, if it is not text, then it > should not be Unicode. > > What I'm trying to demostrate / test is that you can use unicode objects for mixed binary + ascii, if you make sure to encode/decode using latin-1 ( any others?). The idea is that ascii can be seen/used as text, and other bytes are preserved, and you can ignore whatever meaning latin-1 gives them. using unicode objects means that you can use the existing string formatting (%s), and if you want to pass in binary blobs, you need to decode them as latin-1, creating a unicode object, which will get interpolated into your unicode object, but then that unicode gets encoded back to latin-1, the original bytes are preserved. I think this it confusing, as we are calling it latin-1, but not really using it that way, but it seems it should work. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, 11 Jan 2014 00:43:39 +0100 Juraj Sukop wrote: > Basically, to ".encode('ascii')" every possible > number is not exactly simple or pretty. Well it strikes me that the PDF format itself is not exactly simple or pretty. It might be convenient that Python 2 allows you, in certain cases, to "ignore" encoding issues because the main text type is actually a bytestring, but under the Python 3 model there's no reason to allow the same shortcuts. Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner wrote: > > What not building "10 0 obj ... stream" and "endstream endobj" in > Unicode and then encode to ASCII? Example: > > data = b''.join(( > ("%d %d obj ... stream" % (10, 0)).encode('ascii'), > binary_image_data, > ("endstream endobj").encode('ascii'), > )) > The key is "encode to ASCII" which means that the result is bytes. Then, there is this "11 0 obj" which should also be bytes. But it has no "binary_image_data" - only lots of numbers waiting to be somehow converted to bytes. I already mentioned the problems with ".encode('ascii')" but it does not stop here. Numbers may appear not only inside "streams" but almost anywhere: in the header there is PDF version, an image has to have "width" and "height", at the end of PDF there is a structure containing offsets to all of the objects in file. Basically, to ".encode('ascii')" every possible number is not exactly simple or pretty. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker wrote: > On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop wrote: > >> As you may know, PDF operates over bytes and an integer or floating-point >> number is written down as-is, for example "100" or "1.23". >> > > Just to be clear here -- is PDF specifically bytes+ascii? > > Or could there be some-other-encoding unicode in there? > >From the specs: "At the most fundamental level, a PDF file is a sequence of 8-bit bytes." But it is also possible to represent a PDF using printable ASCII + whitespace by using escapes and "filters". Then, there are also "text strings" which might be encoded in UTF+16. What this all means is that the PDF objects are expressed in ASCII, "stream" objects like images and fonts may have a binary part and I never saw those UTF+16 strings. u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1') > The argument for dropping "%f" et al. has been that if something is a text, then it should be Unicode. Conversely, if it is not text, then it should not be Unicode. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 18:14:45 -0500 "Eric V. Smith" wrote: > > >> Because embedding the ASCII equivalent of ints and floats in byte streams > >> is a common operation? > > > > Again, if you're representing "ASCII", you're representing text and > > should use a str object. > > Yes, but is there existing 2.x code that uses %s for int and float > (perhaps unwittingly), and do we want to "help" that code out? > Or do we > want to make porters first change to using %d or %f instead of %s? I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects. > I think what you're getting at is that in addition to not calling > __format__, we don't want to call __str__, either, for the same reason. Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On 10/01/2014 22:06, Chris Barker wrote: On Fri, Jan 10, 2014 at 6:05 AM, Paul Moore mailto:p.f.mo...@gmail.com>> wrote: > Using the 'latin-1' to mean unknown encoding can easily result > in Mojibake (unreadable text) entering your application with > dangerous effects on your other text data. Agreed. The latin-1 suggestion is purely for people who object to learning how to handle the encodings in their data more accurately. I'm not so sure -- it could be used (abused?) for that, but I'm suggesting it be used for mixed ascii-binary data. I don't know that there IS a "right" way to do that -- at least not an efficient or easy to read and write one. -Chris The correct way is to read the interface specification which tells you what should be in the data. Or do people not use interface specifications these days, preferring to guess what they've got instead? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 6:02 PM, Antoine Pitrou wrote: > On Fri, 10 Jan 2014 14:58:15 -0800 > Ethan Furman wrote: >> On 01/10/2014 02:42 PM, Antoine Pitrou wrote: >>> On Fri, 10 Jan 2014 17:33:57 -0500 >>> "Eric V. Smith" wrote: On 1/10/2014 5:29 PM, Antoine Pitrou wrote: > On Fri, 10 Jan 2014 12:56:19 -0500 > "Eric V. Smith" wrote: >> >> I agree. I don't see any reason to exclude int and float. See Guido's >> messages http://bugs.python.org/issue3982#msg180423 and >> http://bugs.python.org/issue3982#msg180430 for some justification and >> discussion. > > If you are representing int and float, you're really formatting a text > message, not bytes. Basically if you allow the formatting of int and > float instances, there's no reason not to allow the formatting of > arbitrary objects through __str__. It doesn't make sense to > special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. >>> >>> That's not what I'm saying. str.__mod__ is able to represent all kinds >>> of types through %s and calling __str__. It doesn't make sense for >>> bytes.__mod__ to only support int and float. Why only them? Ah, I see. This is about the types that %s supports, not about support for %d and %f. >> Because embedding the ASCII equivalent of ints and floats in byte streams >> is a common operation? > > Again, if you're representing "ASCII", you're representing text and > should use a str object. Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to "help" that code out? Or do we want to make porters first change to using %d or %f instead of %s? I'll grant you that we might be doing more harm than help by special-casing these types. I'm just asking. I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason. Correct me if I'm off base, please. I'm not trying to put words in anyone's mouth. In any event, I think supporting %d and %f (and %i, %u, %x, %g, etc.) inside format strings would be useful. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On Fri, Jan 10, 2014 at 6:05 AM, Paul Moore wrote: > > Using the 'latin-1' to mean unknown encoding can easily result > > in Mojibake (unreadable text) entering your application with > > dangerous effects on your other text data. > > Agreed. The latin-1 suggestion is purely for people who object to > learning how to handle the encodings in their data more accurately. > I'm not so sure -- it could be used (abused?) for that, but I'm suggesting it be used for mixed ascii-binary data. I don't know that there IS a "right" way to do that -- at least not an efficient or easy to read and write one. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 14:58:15 -0800 Ethan Furman wrote: > On 01/10/2014 02:42 PM, Antoine Pitrou wrote: > > On Fri, 10 Jan 2014 17:33:57 -0500 > > "Eric V. Smith" wrote: > >> On 1/10/2014 5:29 PM, Antoine Pitrou wrote: > >>> On Fri, 10 Jan 2014 12:56:19 -0500 > >>> "Eric V. Smith" wrote: > > I agree. I don't see any reason to exclude int and float. See Guido's > messages http://bugs.python.org/issue3982#msg180423 and > http://bugs.python.org/issue3982#msg180430 for some justification and > discussion. > >>> > >>> If you are representing int and float, you're really formatting a text > >>> message, not bytes. Basically if you allow the formatting of int and > >>> float instances, there's no reason not to allow the formatting of > >>> arbitrary objects through __str__. It doesn't make sense to > >>> special-case those two types and nothing else. > >> > >> It might not for .format(), but I'm not convinced. But for %-formatting, > >> str is already special-cased for these types. > > > > That's not what I'm saying. str.__mod__ is able to represent all kinds > > of types through %s and calling __str__. It doesn't make sense for > > bytes.__mod__ to only support int and float. Why only them? > > Because embedding the ASCII equivalent of ints and floats in byte streams > is a common operation? Again, if you're representing "ASCII", you're representing text and should use a str object. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 02:42 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" wrote: On 1/10/2014 5:29 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" wrote: > On 1/10/2014 5:29 PM, Antoine Pitrou wrote: > > On Fri, 10 Jan 2014 12:56:19 -0500 > > "Eric V. Smith" wrote: > >> > >> I agree. I don't see any reason to exclude int and float. See Guido's > >> messages http://bugs.python.org/issue3982#msg180423 and > >> http://bugs.python.org/issue3982#msg180430 for some justification and > >> discussion. > > > > If you are representing int and float, you're really formatting a text > > message, not bytes. Basically if you allow the formatting of int and > > float instances, there's no reason not to allow the formatting of > > arbitrary objects through __str__. It doesn't make sense to > > special-case those two types and nothing else. > > It might not for .format(), but I'm not convinced. But for %-formatting, > str is already special-cased for these types. That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 17:20:32 -0500 "Eric V. Smith" wrote: > > Isn't the point of the PEP to make it easier to port 2.x code to 3.5? > Is > there really existing code like this in 2.x? No, but so what? The point of the PEP is not to allow arbitrary Python 2 code to run without modification under Python 3. There's a reason we broke compatibility, and there's no way we're gonna undo that. > I think what we're trying to do is to make code that looks like: >b'%d %d obj ... stream' % (10, 0) > work in both 2.x and 3.5. That's not what *I* am trying to do. As far as I'm concerned the aim of the PEP is to ease bytes interpolation, not to provide some kind of magical construct that will solve everyone's porting problems. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 5:29 PM, Antoine Pitrou wrote: > On Fri, 10 Jan 2014 12:56:19 -0500 > "Eric V. Smith" wrote: >> >> I agree. I don't see any reason to exclude int and float. See Guido's >> messages http://bugs.python.org/issue3982#msg180423 and >> http://bugs.python.org/issue3982#msg180430 for some justification and >> discussion. > > If you are representing int and float, you're really formatting a text > message, not bytes. Basically if you allow the formatting of int and > float instances, there's no reason not to allow the formatting of > arbitrary objects through __str__. It doesn't make sense to > special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" wrote: > > I agree. I don't see any reason to exclude int and float. See Guido's > messages http://bugs.python.org/issue3982#msg180423 and > http://bugs.python.org/issue3982#msg180430 for some justification and > discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 5:12 PM, Victor Stinner wrote: > 2014/1/10 Juraj Sukop : >> In the case of PDF, the embedding of an image into PDF looks like: >> >> 10 0 obj >> << /Type /XObject >> /Width 100 >> /Height 100 >> /Alternates 15 0 R >> /Length 2167 >> >> >> stream >> ...binary image data... >> endstream >> endobj > > What not building "10 0 obj ... stream" and "endstream endobj" in > Unicode and then encode to ASCII? Example: > > data = b''.join(( > ("%d %d obj ... stream" % (10, 0)).encode('ascii'), > binary_image_data, > ("endstream endobj").encode('ascii'), > )) Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is there really existing code like this in 2.x? I think what we're trying to do is to make code that looks like: b'%d %d obj ... stream' % (10, 0) work in both 2.x and 3.5. But correct me if I'm wrong. I'll admit to not following 100% of these emails. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
2014/1/10 Juraj Sukop : > In the case of PDF, the embedding of an image into PDF looks like: > > 10 0 obj > << /Type /XObject > /Width 100 > /Height 100 > /Alternates 15 0 R > /Length 2167 > >> > stream > ...binary image data... > endstream > endobj What not building "10 0 obj ... stream" and "endstream endobj" in Unicode and then encode to ASCII? Example: data = b''.join(( ("%d %d obj ... stream" % (10, 0)).encode('ascii'), binary_image_data, ("endstream endobj").encode('ascii'), )) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
10.01.14 18:27, Baptiste Carvello написав(ла): would it make sense to be more general, and allow a "lenient mode", where all files implicitly opened with the default encoding would also use the surrogateescape error handler ? The surrogateescape error handler is compatible only with ASCII-compatible encodings (i.e. no ShiftJIS, no UTF-16). It can't be used by default. But you can set PYTHONIOENCODING=:surrogateescape and got you default locale encoding with surrogateescape. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop wrote: > As you may know, PDF operates over bytes and an integer or floating-point > number is written down as-is, for example "100" or "1.23". > Just to be clear here -- is PDF specifically bytes+ascii? Or could there be some-other-encoding unicode in there? If so, then you really have a mess! if it is bytes+ascii, then it seems you could use a unicode object and encode/decode to latin-1 Perhaps still a bit klunkier than formatting directly into a bytes object, but workable. b'%.1f %.1f %.1f RG' % (r, g, b) > > is more confusing than: > > b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), > (r, g, b))) > Let's see, I think that would be: u'%.1f %.1f %.1f RG' % (r, g, b) then when you want to write it out: .encode('latin-1') dumping the binary data in would be a bit uglier, for teh image example: stream ...binary image data... endstream endobj u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1') I think. not too bad, though if nothing else an alias for latin-1 that made it clear it worked for this would be nice. maybe ascii_plus_binary or something? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python3 "complexity" - 2 use cases
> Steven D'Aprano wrote: >> I think that heuristics to guess the encoding have their role to play, >> if the caller understands the risks. Ben Finney wrote: > In my opinion, content-type guessing heuristics certainly don't belong > in the standard library. It would be great if there were never any need to guess. But in the real world, there is -- and often the user won't know any more than python does. So when it is time to guess, a source of good guesses is an important battery to include. The HTML5 specifications go through some fairly extreme contortions to document what browsers actually do, as opposed to what previous standards have mandated. They don't currently specify how to guess (though I think a draft once tried, since the major browsers all do it, and at the time did it similarly), but the specs do explicitly support such a step, and do provide an implementation note encouraging user-agents to do at least minimal auto-detection. http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding My own opinion is therefore that Python SHOULD provide better support for both of the following use cases: (1) Treat this file like it came from the web -- including autodetection and even overriding explicit charset declarations for certain charsets. We should explicitly treat autodetection like time zone data -- there is no promise that the "right answer" (or at least the "best guess") won't change, even within a release. I offer no opinion on whether chardet in particular is still too volatile, but the docs should warn that the API is driven by possibly changing external data. (2) Treat this file as "ASCII+", where anything non-ASCII will (at most) be written back out unchanged; it doesn't even need to be converted to text. At this time, I don't know whether the right answer is making it easy to default to surrogate-escape for all error-handling, adding more bytes methods, encouraging use of python's latin-1 variant, offering a dedicated (new?) codec, or some new suggestion. I do know that this use case is important, and that python 3 currently looks clumsy compared to python 2. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 10.01.2014 18:56, schrieb Eric V. Smith: > On 1/10/2014 12:17 PM, Juraj Sukop wrote: >> (Sorry if this messes-up the thread order, it is meant as a reply to the >> original RFC.) >> >> Dear list, >> >> newbie here. After much hesitation I decided to put forward a use case >> which bothers me about the current proposal. Disclaimer: I happen to >> write a library which is directly influenced by this. >> >> As you may know, PDF operates over bytes and an integer or >> floating-point number is written down as-is, for example "100" or "1.23". >> >> However, the proposal drops "%d", "%f" and "%x" formats and the >> suggested workaround for writing down a number is to use >> ".encode('ascii')", which I think has two problems: >> >> One is that it needs to construct one additional object per formatting >> as opposed to Python 2; it is not uncommon for a PDF file to contain >> millions of numbers. >> >> The second problem is that, in my eyes, it is very counter-intuitive to >> require the use of str only to get formatting on bytes. Consider the >> case where a large bytes object is created out of many smaller bytes >> objects. If I wanted to format a part I had to use str instead. For example: >> >> content = b''.join([ >> b'header', >> b'some dictionary structure', >> b'part 1 abc', >> ('part 2 %.3f' % number).encode('ascii'), >> b'trailer']) > > I agree. I don't see any reason to exclude int and float. See Guido's > messages http://bugs.python.org/issue3982#msg180423 and > http://bugs.python.org/issue3982#msg180430 for some justification and > discussion. Since converting int and float to strings generates a very > small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase > versions), what problem is introduced by allowing int and float? The > original str.format() work relied on this fact in its stringlib > implementation. I agree. I would have needed bytes-formatting (with numbers) recently writing .rtf files. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
INADA Naoki wrote: latin1 is OK but is it Pythonic? Latin is most certainly a Pythonic subject: http://www.youtube.com/watch?v=IIAdHEwiAy8 -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On Jan 10, 2014, at 7:35 AM, Nick Coghlan wrote: > Putting this here because I found out today it's not in any of the > PEPs and folks have to go digging in mailing list archives to find it. > I'll add it to my Python 3 Q&A at some point. > > The reason Python 3 currently tries to rely on the POSIX locale > encoding is that during the Python 3 development process it was > pointed out that ShiftJIS, ISO-2022 and various CJK codec are in > widespread use in Asia, since Asian users needed solutions to the > problem of representing kana, ideographs and other non-Latin > characters long before the Unicode Consortium existed. Really? Because PEP 383 doesn't support and discourages the use of some of these codecs as a locale. -- Philip Jenvey ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 06/01/2014 13:24, Victor Stinner wrote: Hi, bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP. Apologies if this has already been said, but Terry Reedy attached a proof of concept to issue 3982 which might be worth taking a look at if you haven't yet done so. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 12:17 PM, Juraj Sukop wrote: > (Sorry if this messes-up the thread order, it is meant as a reply to the > original RFC.) > > Dear list, > > newbie here. After much hesitation I decided to put forward a use case > which bothers me about the current proposal. Disclaimer: I happen to > write a library which is directly influenced by this. > > As you may know, PDF operates over bytes and an integer or > floating-point number is written down as-is, for example "100" or "1.23". > > However, the proposal drops "%d", "%f" and "%x" formats and the > suggested workaround for writing down a number is to use > ".encode('ascii')", which I think has two problems: > > One is that it needs to construct one additional object per formatting > as opposed to Python 2; it is not uncommon for a PDF file to contain > millions of numbers. > > The second problem is that, in my eyes, it is very counter-intuitive to > require the use of str only to get formatting on bytes. Consider the > case where a large bytes object is created out of many smaller bytes > objects. If I wanted to format a part I had to use str instead. For example: > > content = b''.join([ > b'header', > b'some dictionary structure', > b'part 1 abc', > ('part 2 %.3f' % number).encode('ascii'), > b'trailer']) I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. Since converting int and float to strings generates a very small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase versions), what problem is introduced by allowing int and float? The original str.format() work relied on this fact in its stringlib implementation. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
10.01.14 14:19, M.-A. Lemburg написав(ла): BTW: Perhaps it would be a good idea to backport the surrogateescape error handler to Python 2.7 to simplify writing code which works in both Python 2 and 3. You also should change the UTF-8 codec so that it will reject surrogates (i.e. u'\ud880'.encode('utf-8') and '\xed\xa2\x80'.decode('utf-8') should raise exceptions). And this will break much code. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On Fri, Jan 10, 2014 at 4:35 PM, Nick Coghlan wrote: > On 10 January 2014 13:32, Lennart Regebro wrote: >> No, because your environment have a default language. And Python has a >> default encoding. You only get problems when some file doesn't use the >> default encoding. > > The reason Python 3 currently tries to rely on the POSIX locale > encoding is that during the Python 3 development process it was > pointed out that ShiftJIS, ISO-2022 and various CJK codec are in > widespread use in Asia, since Asian users needed solutions to the > problem of representing kana, ideographs and other non-Latin > characters long before the Unicode Consortium existed. > > This creates a problem for Python 3, as assuming utf-8 means we have a > high risk of corrupting user's data at least in Asian locales, as well > as anywhere else where non-UTF-8 encodings are common (especially when > encodings that aren't ASCII compatible are involved). >From my experience, the concept of a default locale is deeply flawed. What if I log into a (Linux) machine using an old latin-1 putty from the Windows XP era, have most file names and contents in UTF-8 encoding, except for one directory where people from eastern Europe upload files via FTP in whatever encoding they choose. What should the "default" encoding be now? That's why I make it a principle to always unset all LC_* and LANG variables, except when working locally, which happens rather rarely. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
(Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.) Dear list, newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this. As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23". However, the proposal drops "%d", "%f" and "%x" formats and the suggested workaround for writing down a number is to use ".encode('ascii')", which I think has two problems: One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers. The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example: content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer']) In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj Because of the image it makes sense to store such structure inside bytes. On the other hand, there may well be another "obj" which contains the coordinates of Bezier paths: 11 0 obj ... stream 0.5 0.1 0.2 RG 300 300 m 300 400 400 400 400 300 c b endstream endobj To summarize, there are cases which mix "binary" and "text" and, in my opinion, dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how: b'%.1f %.1f %.1f RG' % (r, g, b) is more confusing than: b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b))) Similar situation exists for HTTP ("Content-Length: 123") and ASCII STL ("vertex 1.0 0.0 0.0"). Thanks and have a nice day, Juraj Sukop PS: In the case the proposal will not include the number formatting, it would be nice to list there a set of guidelines or examples on how to proceed with porting Python 2 formats to Python 3. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Summary of Python tracker Issues
ACTIVITY SUMMARY (2014-01-03 - 2014-01-10) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open4409 (+61) closed 27580 (+42) total 31989 (+103) Open issues with patches: 1993 Issues opened (87) == #15027: Faster UTF-32 encoding http://bugs.python.org/issue15027 reopened by serhiy.storchaka #20115: NUL bytes in commented lines http://bugs.python.org/issue20115 opened by arigo #20116: urlparse.parse_qs should take argument for query separator http://bugs.python.org/issue20116 opened by ruben.orduz #20117: subprocess on Windows: wrong return code with shell=True http://bugs.python.org/issue20117 opened by gvanrossum #20118: test_imaplib test_linetoolong fails on 2.7 in SSL test on some http://bugs.python.org/issue20118 opened by r.david.murray #20119: pdb c(ont(inue)) optional one-time-only breakpoint (like perl http://bugs.python.org/issue20119 opened by nlev...@gmail.com #20120: Percent-signs (%) in .pypirc should not be interpolated http://bugs.python.org/issue20120 opened by tlevine #20121: quopri_codec newline handling http://bugs.python.org/issue20121 opened by fredstober #20122: Move CallTips tests to idle_tests http://bugs.python.org/issue20122 opened by serhiy.storchaka #20123: pydoc.synopsis fails to load binary modules http://bugs.python.org/issue20123 opened by eric.snow #20124: The documentation for the atTime parameter of TimedRotatimeFil http://bugs.python.org/issue20124 opened by r.david.murray #20125: We need a good replacement for direct use of load_module(), po http://bugs.python.org/issue20125 opened by eric.snow #20126: sched doesn't handle events added after scheduler starts http://bugs.python.org/issue20126 opened by lo...@blossomhillranch.com #20127: Race condition in test_threaded_import.task()? http://bugs.python.org/issue20127 opened by eric.snow #20128: Re-enable test_modules_search_builtin() in test_pydoc http://bugs.python.org/issue20128 opened by eric.snow #20131: warnings module offers no documented, programmatic way to rese http://bugs.python.org/issue20131 opened by inducer #20132: Many incremental codecs donât handle fragmented data http://bugs.python.org/issue20132 opened by vadmium #20133: Derby: Convert the audioop module to use Argument Clinic http://bugs.python.org/issue20133 opened by serhiy.storchaka #20135: mutate list http://bugs.python.org/issue20135 opened by m123orning #20136: Logging: StreamHandler does not use OS line separator. http://bugs.python.org/issue20136 opened by alibotean #20137: Logging: RotatingFileHandler computes string length instead of http://bugs.python.org/issue20137 opened by alibotean #20138: wsgiref on Python 3.x incorrectly implements URL handling caus http://bugs.python.org/issue20138 opened by aronacher #20139: Python installer does not install a "pip" command (just "pip3" http://bugs.python.org/issue20139 opened by pmoore #20140: UnicodeDecodeError in ntpath.py when home dir contains non-asc http://bugs.python.org/issue20140 opened by Jarek.Åmiejczak #20145: unittest.assert*Regex functions should verify that expected_re http://bugs.python.org/issue20145 opened by the.mulhern #20146: UserDict module docs link is obsolete http://bugs.python.org/issue20146 opened by drunax #20147: multiprocessing.Queue.get() raises queue.Empty exception if ev http://bugs.python.org/issue20147 opened by torsten #20148: Derby: Convert the _sre module to use Argument Clinic http://bugs.python.org/issue20148 opened by serhiy.storchaka #20150: API change in string formatting with :s option should be docum http://bugs.python.org/issue20150 opened by Thomas.Robitaille #20151: Derby: Convert the binascii module to use Argument Clinic http://bugs.python.org/issue20151 opened by serhiy.storchaka #20152: Derby #15: Convert 50 sites to Argument Clinic across 9 files http://bugs.python.org/issue20152 opened by brett.cannon #20153: New-in-3.4 weakref finalizer doc section is already out of dat http://bugs.python.org/issue20153 opened by r.david.murray #20154: Deadlock in asyncio.StreamReader.readexactly() http://bugs.python.org/issue20154 opened by gvanrossum #20155: Regression test test_httpservers fails, hangs on Windows http://bugs.python.org/issue20155 opened by jeff.allen #20156: bz2.BZ2File.read() does not treat growing input file properly http://bugs.python.org/issue20156 opened by Joshua.Chia #20159: Derby #7: Convert 51 sites to Argument Clinic across 3 files - http://bugs.python.org/issue20159 opened by serhiy.storchaka #20160: broken ctypes calling convention on MSVC / 64-bit Windows (lar http://bugs.python.org/issue20160 opened by mark.dickinson #20162: Test test_hash_distribution fails on RHEL 6.5 / ppc64 http://bugs.python.org/issue20162 opened by zaytsev #20163: ValueError: time data does not match format http://bugs.p
Re: [Python-Dev] Python3 "complexity"
Le 10/01/2014 16:35, Nick Coghlan a écrit : > One idea we're considering for Python 3.5 is to have a report of > "ascii" on a POSIX OS imply the surrogateescape error handler (at > least for the standard streams, and perhaps in other contexts), since > the OS reporting the POSIX/C locale almost certainly indicates a > configuration error rather than intentional behaviour. would it make sense to be more general, and allow a "lenient mode", where all files implicitly opened with the default encoding would also use the surrogateescape error handler ? That way, applications designed to process text mostly written in the default encoding would just call sys.set_lenient_mode() and be done. Of course, libraries would need to be strongly discouraged to ever use this and encouraged to explicitly set the error handler on appropriate files instead. Cheers, Baptiste ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
Now I feel it is bad thing that encouraging using unicode for binary with latin-1 encoding or surrogateescape errorhandler. Handling binary data in str type using latin-1 is just a hack. Surrogateescape is just a workaround to keep undecodable bytes in text. Encouraging binary data in str type with latin-1 or surrogateescape means encourage mixing binary and text data. It is worth than Python 2. So Python should encourage handling binary data in bytes type. On Fri, Jan 10, 2014 at 11:28 PM, Matěj Cepl wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 2014-01-10, 12:19 GMT, you wrote: > > Using the 'latin-1' to mean unknown encoding can easily result > > in Mojibake (unreadable text) entering your application with > > dangerous effects on your other text data. > > > > E.g. "Marc-André" read using 'latin-1' if the string itself > > is encoded as UTF-8 will give you "Marc-André" in your > > application. (Yes, I see that a lot in applications > > and websites I use ;-)) > > I am afraid that for most 'latin-1' is just another attempt to > make Unicode complexity go away and the way how to ignore it. > > Matěj > > -BEGIN PGP SIGNATURE- > Version: GnuPG v2.0.22 (GNU/Linux) > > iD8DBQFS0AOG4J/vJdlkhKwRAgffAKCHn8uMnpZDVSwa2Oat+QI2h32o2wCeJdUN > ZXTbDtiJtJrrhnRPzbgc3dc= > =Pr1X > -END PGP SIGNATURE- > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com > -- INADA Naoki ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
Nick Coghlan wrote: > One idea we're considering for Python 3.5 is to have a report of > "ascii" on a POSIX OS imply the surrogateescape error handler (at > least for the standard streams, and perhaps in other contexts), since > the OS reporting the POSIX/C locale almost certainly indicates a > configuration error rather than intentional behaviour. On FreeBSD users apparently get the C locale by default. I don't think I've configured anything special during the install: freebsd-amd64# adduser Username: testuser Full name: Uid (Leave empty for default): Login group [testuser]: Login group is testuser. Invite testuser into other groups? []: Login class [default]: Shell (sh csh tcsh bash rbash nologin) [sh]: Home directory [/home/testuser]: Home directory permissions (Leave empty for default): Use password-based authentication? [yes]: no Lock out the account after creation? [no]: Username : testuser Password : Full Name : Uid: 1003 Class : Groups : testuser Home : /home/testuser Home Mode : Shell : /bin/sh Locked : no OK? (yes/no): yes adduser: INFO: Successfully added (testuser) to the user database. Add another user? (yes/no): no Goodbye! freebsd-amd64# su - testuser $ locale LANG= LC_CTYPE="C" LC_COLLATE="C" LC_TIME="C" LC_NUMERIC="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= Stefan Krah ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] peps: PEP 460: add .format_map()
On 1/10/2014 10:20 AM, Nick Coghlan wrote: > On 10 January 2014 07:41, Eric V. Smith wrote: >> I'm not sure how format_map helps in porting from 2 to 3, since it >> doesn't exist in any version of 2. >> >> Although that said, it's no doubt a useful feature, just not useful in >> code that supports both 2 and 3 with a single code base or when porting >> to 3. > > It's purely a matter of consistency with str - if we're adding binary > interpolation back to Python 3 (which I have been persuaded is a good > idea), then we should provide the same three typical spellings of the > operation that str provides. > > Cheers, > Nick. I'm perfectly okay with that, and it was on my list of things to suggest. I just think that the PEP should be focused on porting code from 2 to 3 and on code that runs on both 2 and 3. I think the Rationale should state this clearly. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On 10 January 2014 13:32, Lennart Regebro wrote: > On Thu, Jan 9, 2014 at 10:06 AM, Kristján Valur Jónsson > wrote: >> Do I speak Chinese to my grocer because china is a growing force in the >> world? Or start every discussion with my children with a negotiation on >> what language to use? > > No, because your environment have a default language. And Python has a > default encoding. You only get problems when some file doesn't use the > default encoding. Putting this here because I found out today it's not in any of the PEPs and folks have to go digging in mailing list archives to find it. I'll add it to my Python 3 Q&A at some point. The reason Python 3 currently tries to rely on the POSIX locale encoding is that during the Python 3 development process it was pointed out that ShiftJIS, ISO-2022 and various CJK codec are in widespread use in Asia, since Asian users needed solutions to the problem of representing kana, ideographs and other non-Latin characters long before the Unicode Consortium existed. This creates a problem for Python 3, as assuming utf-8 means we have a high risk of corrupting user's data at least in Asian locales, as well as anywhere else where non-UTF-8 encodings are common (especially when encodings that aren't ASCII compatible are involved). While the Python 3 status quo on POSIX systems certainly isn't ideal, it at least means our most likely failure mode is an exception rather than silent data corruption. One of the major culprits for that is the antiquated POSIX/C locale, which reports ASCII as the system encoding. One idea we're considering for Python 3.5 is to have a report of "ascii" on a POSIX OS imply the surrogateescape error handler (at least for the standard streams, and perhaps in other contexts), since the OS reporting the POSIX/C locale almost certainly indicates a configuration error rather than intentional behaviour. Cheers, Nick. > > //Lennart > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] peps: PEP 460: add .format_map()
On 10 January 2014 07:41, Eric V. Smith wrote: > I'm not sure how format_map helps in porting from 2 to 3, since it > doesn't exist in any version of 2. > > Although that said, it's no doubt a useful feature, just not useful in > code that supports both 2 and 3 with a single code base or when porting > to 3. It's purely a matter of consistency with str - if we're adding binary interpolation back to Python 3 (which I have been persuaded is a good idea), then we should provide the same three typical spellings of the operation that str provides. Cheers, Nick. > > Eric. > > On 1/9/2014 4:02 PM, antoine.pitrou wrote: >> http://hg.python.org/peps/rev/8947cdc6b22e >> changeset: 5341:8947cdc6b22e >> user:Antoine Pitrou >> date:Thu Jan 09 22:02:01 2014 +0100 >> summary: >> PEP 460: add .format_map() >> >> files: >> pep-0460.txt | 6 +- >> 1 files changed, 5 insertions(+), 1 deletions(-) >> >> >> diff --git a/pep-0460.txt b/pep-0460.txt >> --- a/pep-0460.txt >> +++ b/pep-0460.txt >> @@ -24,12 +24,16 @@ >>similar in syntax to ``str.format()`` (accepting positional as well as >>keyword arguments). >> >> +* ``bytes.format_map(...)`` and ``bytearray.format_map(...)`` for an >> + API similar to ``str.format_map(...)``, with the same formatting >> + syntax and semantics as ``bytes.format()`` and ``bytearray.format()``. >> + >> >> Rationale >> = >> >> In Python 2, ``str % args`` and ``str.format(args)`` allow the formatting >> -and interpolation of bytes strings. This feature has commonly been used >> +and interpolation of bytestrings. This feature has commonly been used >> for the assembling of protocol messages when protocols are known to use >> a fixed encoding. >> >> >> >> >> ___ >> Python-checkins mailing list >> python-check...@python.org >> https://mail.python.org/mailman/listinfo/python-checkins >> > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2014-01-10, 12:19 GMT, you wrote: > Using the 'latin-1' to mean unknown encoding can easily result > in Mojibake (unreadable text) entering your application with > dangerous effects on your other text data. > > E.g. "Marc-André" read using 'latin-1' if the string itself > is encoded as UTF-8 will give you "Marc-André" in your > application. (Yes, I see that a lot in applications > and websites I use ;-)) I am afraid that for most 'latin-1' is just another attempt to make Unicode complexity go away and the way how to ignore it. Matěj -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0AOG4J/vJdlkhKwRAgffAKCHn8uMnpZDVSwa2Oat+QI2h32o2wCeJdUN ZXTbDtiJtJrrhnRPzbgc3dc= =Pr1X -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On 10 January 2014 12:19, M.-A. Lemburg wrote: > Just a word of caution: > > Using the 'latin-1' to mean unknown encoding can easily result > in Mojibake (unreadable text) entering your application with > dangerous effects on your other text data. Agreed. The latin-1 suggestion is purely for people who object to learning how to handle the encodings in their data more accurately. That's not a criticism, wanting to avoid getting sidetracked into understanding encodings when porting a personal script is a classic "practicality vs purity" situation. Current responses to people with encoding issues tend towards an idealistic "you should understand your data better" position, which while true in the abstract is not always what the requester wants to hear. Paul. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python3 "complexity"
On 09.01.2014 22:45, Antoine Pitrou wrote: > On Thu, 9 Jan 2014 13:36:05 -0800 > Chris Barker wrote: >> >> Some folks have suggested using latin-1 (or other 8-bit encoding) -- is >> that guaranteed to work with any binary data, and round-trip accurately? > > Yes, it is. Just a word of caution: Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data. E.g. "Marc-André" read using 'latin-1' if the string itself is encoded as UTF-8 will give you "Marc-André" in your application. (Yes, I see that a lot in applications and websites I use ;-)) Also note that indexing based on code points will likely break that way as well, ie. if you pass an index to an application based on what you see in your editor or shell, those indexes can be wrong when used on the encoded data. UTF-8 is an example of a popular variable length encoding for Unicode, so you'll hit this problem whenever dealing with non-ASCII UTF-8 data. >> and will surrogateescape work for arbitrary binary data? > > Yes, it will. The surrogateescape trick only works if you are encoding your work using the same encoding that you used for decoding it. Otherwise, you'll get a mix of the input encoding and the output encoding as output. Note that the error handler trick has an advantage over the latin-1 trick: if you try to encode a Unicode string with escape surrogates without using the error handler, it will fail, so you at least know that there are "funny" code points in your output string that need some extra care. BTW: Perhaps it would be a good idea to backport the surrogateescape error handler to Python 2.7 to simplify writing code which works in both Python 2 and 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 11:32:05 +1000 Nick Coghlan wrote: > > > > It's consistent with bytearray.join's behaviour: > > > > >>> x = bytearray() > > >>> x.join([b"abc"]) > > bytearray(b'abc') > > >>> x > > bytearray(b'') > > Yeah, I guess I'm OK with us being consistent on that one. It's still > weird, but also clearly useful :) > > Will the new binary format ever call __format__? I assume not, but it's > probably best to make that absolutely explicit in the PEP. Not indeed. I'll add that to the PEP, thanks. cheers Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com