date:20140110

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Georg Brandl

Am 11.01.2014 03:04, schrieb Antoine Pitrou:
> On Fri, 10 Jan 2014 20:53:09 -0500
> "Eric V. Smith"  wrote:
>> 
>> So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
>> 3892. See for example http://bugs.python.org/issue3982#msg180432 .

I agree.

> Then we might as well not do anything, since any attempt to advance
> things is met by stubborn opposition in the name of "not far enough".
> 
> (I don't care much personally, I think the issue is quite overblown
> anyway)

So you wouldn't mind another overhaul of the PEP including a bit more
functionality again? :)  I really think that practicality beats purity
here.  (I'm not advocating free mixing bytes and str, mind you!)

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python3 "complexity" - 2 use cases

2014-01-10 Thread Ben Finney

"Jim J. Jewett"  writes:

>  
> > Steven D'Aprano wrote:
> >> I think that heuristics to guess the encoding have their role to play,
> >> if the caller understands the risks.
>
> Ben Finney wrote:
> > In my opinion, content-type guessing heuristics certainly don't belong
> > in the standard library.
>
> It would be great if there were never any need to guess.  But in the
> real world, there is -- and often the user won't know any more than
> python does.

That's why I think it's great to have heuristic guessing code available
as a third-party library.

> So when it is time to guess, a source of good guesses is an important
> battery to include.

Why is it important enough to deserve that privilege, over the thousands
of other candidates for the standard library? The barrier for entry to
the standard library is higher than mere usefulness.

> We should explicitly treat autodetection like time zone data --
> there is no promise that the "right answer" (or at least the "best
> guess") won't change, even within a release.

But there is exactly one set of authoritative time zones at any
particular point in time. That's why it makes sense to have that set of
authoritative values available in the standard library.

Heuristic guesses about content types do not have the property of
exactly one authoritative source, so your analogy is not compelling.

-- 
 \ “Unix is an operating system, OS/2 is half an operating system, |
  `\Windows is a shell, and DOS is a boot partition virus.” —Peter |
_o__)H. Coffin |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Steven D'Aprano

On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:

> As you may know, PDF operates over bytes and an integer or floating-point
> number is written down as-is, for example "100" or "1.23".

I'm sorry, I don't understand what you mean here. I'm honestly not 
trying to be difficult, but you sound confident that you understand what 
you are doing, but your description doesn't make sense to me. To me, it 
looks like you are conflating bytes and ASCII characters, that is, 
assuming that characters "are" in some sense identical to their ASCII 
representation. Let me explain:

The integer that in English is written as 100 is represented in memory 
as bytes 0x0064 (assuming a big-endian C short), so when you say "an 
integer is written down AS-IS" (emphasis added), to me that says that 
the PDF file includes the bytes 0x0064. But then you go on to write the 
three character string "100", which (assuming ASCII) is the bytes 
0x313030. Going from the C short to the ASCII representation 0x313030 is 
nothing like inserting the int "as-is". To put it another way, the 
Python 2 '%d' format code does not just copy bytes.

I think that what you are trying to say is that a PDF file is a binary 
file which includes some ASCII-formatted text fields. So when writing an 
integer 100, rather than writing it "as is" which would be byte 0x64 
(with however many leading null bytes needed for padding), it is 
converted to ASCII representation 0x313030 first, and that's what needs 
to be inserted.

If you consider PDF as binary with occasional pieces of ASCII text, then 
working with bytes makes sense. But I wonder whether it might be better 
to consider PDF as mostly text with some binary bytes. Even though the 
bulk of the PDF will be binary, the interesting bits are text. E.g. your 
example:

> In the case of PDF, the embedding of an image into PDF looks like:
> 
> 10 0 obj
>   << /Type /XObject
>  /Width 100
>  /Height 100
>  /Alternates 15 0 R
>  /Length 2167
>   >>
> stream
> ...binary image data...
> endstream
> endobj

Even though the binary image data is probably much, much larger in 
length than the text shown above, it's (probably) trivial to deal with: 
convert your image data into bytes, decode those bytes into Latin-1, 
then concatenate the Latin-1 string into the text above.

Latin-1 has the nice property that every byte decodes into the character 
with the same code point, and visa versa. So:

for i in range(256):
assert bytes([i]).decode('latin-1') == chr(i)
assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode 
text with embedded binary data, rather than binary data with embedded 
ASCII text. Then when writing the file to disk, of course you encode it 
to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents
with open("outfile.pdf", "wb") as f:
f.write(pdf.encode("latin-1")

or implicitly:

with open("outfile.pdf", "w", encoding="latin-1") as f:
f.write(pdf)

There may be a few wrinkles I haven't thought of, I don't claim to be an 
expert on PDF. But I see no reason why PDF files ought to be an 
exception to the rule:

* work internally with Unicode text;

* convert to and from bytes only on input and output.

Please also take note that in Python 3.3 and better, the internal 
representation of Unicode strings containing only code points up to 255 
(i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte 
per character.

Another advantage is that using text rather than bytes means that your 
example:

[...]
> dropping the bytes-formatting of numbers makes it more complicated
> than it was. I would appreciate any explanation on how:
> 
> b'%.1f %.1f %.1f RG' % (r, g, b)

becomes simply

'%.1f %.1f %.1f RG' % (r, g, b)

in Python 3. In Python 3.3 and above, it can be written as:

u'%.1f %.1f %.1f RG' % (r, g, b)

which conveniently is exactly the same syntax you would use in Python 2. 
That's *much* nicer than your suggestion:

> is more confusing than:
> 
> b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), 
>  (r, g, b)))

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Cameron Simpson

On 11Jan2014 00:43, Juraj Sukop  wrote:
> On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner
> wrote:
> > What not building "10 0 obj ... stream" and "endstream endobj" in
> > Unicode and then encode to ASCII? Example:
> >
> > data = b''.join((
> >   ("%d %d obj ... stream" % (10, 0)).encode('ascii'),
> >   binary_image_data,
> >   ("endstream endobj").encode('ascii'),
> > ))
> 
> The key is "encode to ASCII" which means that the result is bytes. Then,
> there is this "11 0 obj" which should also be bytes. But it has no
> "binary_image_data" - only lots of numbers waiting to be somehow converted
> to bytes. I already mentioned the problems with ".encode('ascii')" but it
> does not stop here. Numbers may appear not only inside "streams" but almost
> anywhere: in the header there is PDF version, an image has to have "width"
> and "height", at the end of PDF there is a structure containing offsets to
> all of the objects in file. Basically, to ".encode('ascii')" every possible
> number is not exactly simple or pretty.

Hi Juraj,

Might I suggest a helper function (outside the PEP scope) instead
of arguing for support for %f et al?

Thus:

  def bytify(things, encoding='ascii'):
for thing:
  if isinstance(thing, bytes):
yield thing
  else:
yield str(thing).encode('ascii')

Then one's embedding in PDF might become, more readably:

  data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )

Of course, bytify might be augmented with whatever encoding facilities
might suit your needs.

Cheers,
-- 
Cameron Simpson 

We tend to overestimate the short-term impact of technological change and
underestimate its long-term impact. - Amara's Law
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread INADA Naoki

To avoid implicit conversion between str and bytes, I propose adding only
limited %-format,
not .format() or .format_map().

"limited %-format" means:

%c accepts integer or bytes having one length.
%r is not supported
%s accepts only bytes.
%a is only format accepts arbitrary object.

And other formats is same to str.



On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou  wrote:

> On Fri, 10 Jan 2014 18:14:45 -0500
> "Eric V. Smith"  wrote:
> >
> > >> Because embedding the ASCII equivalent of ints and floats in byte
> streams
> > >> is a common operation?
> > >
> > > Again, if you're representing "ASCII", you're representing text and
> > > should use a str object.
> >
> > Yes, but is there existing 2.x code that uses %s for int and float
> > (perhaps unwittingly), and do we want to "help" that code out?
> > Or do we
> > want to make porters first change to using %d or %f instead of %s?
>
> I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
> %f on bytes objects.
>
> > I think what you're getting at is that in addition to not calling
> > __format__, we don't want to call __str__, either, for the same reason.
>
> Not only. We don't want to do anything that actually asks for a
> *textual* representation of something. %d and %f ask for a textual
> representation of a number, so they're right out.
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>



-- 
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman


On 01/10/2014 06:39 PM, Antoine Pitrou wrote:


I know what a network protocol with ill-defined encodings
 looks like.


For the record, I've been (and I suspect Eric and some others have also been) talking about well-defined encodings.  For 
the DBF files that I work with, there is binary, ASCII, and third that is specified in the file header.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman


On 01/10/2014 06:39 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 18:28:41 -0800
Ethan Furman wrote:


Is it safe to assume you don't use Python for the use-cases under discussion?


You know, I've done quite a bit of network programming.


No, I didn't, that's why I asked.


I've also done an experimental port of Twisted to Python 3.
I know what a network protocol with ill-defined encodings
 looks like.


Can you give a code sample of what you think, for example, the PDF generation code should look like?  (If you already 
have, I apologize -- I missed it in all the ruckus.)


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread INADA Naoki

To avoid implicit conversion between str and bytes, I propose adding only
limited %-format,
not .format() or .format_map().

"limited %-format" means:

%c accepts integer or bytes having one length.
%r is not supported
%s accepts only bytes.
%a is only format accepts arbitrary object.

And other formats is same to str.



On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou  wrote:

> On Fri, 10 Jan 2014 18:14:45 -0500
> "Eric V. Smith"  wrote:
> >
> > >> Because embedding the ASCII equivalent of ints and floats in byte
> streams
> > >> is a common operation?
> > >
> > > Again, if you're representing "ASCII", you're representing text and
> > > should use a str object.
> >
> > Yes, but is there existing 2.x code that uses %s for int and float
> > (perhaps unwittingly), and do we want to "help" that code out?
> > Or do we
> > want to make porters first change to using %d or %f instead of %s?
>
> I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
> %f on bytes objects.
>
> > I think what you're getting at is that in addition to not calling
> > __format__, we don't want to call __str__, either, for the same reason.
>
> Not only. We don't want to do anything that actually asks for a
> *textual* representation of something. %d and %f ask for a textual
> representation of a number, so they're right out.
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>



-- 
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou

On Fri, 10 Jan 2014 18:28:41 -0800
Ethan Furman  wrote:
> 
> Is it safe to assume you don't use Python for the use-cases under discussion?

You know, I've done quite a bit of network programming. I've also done
an experimental port of Twisted to Python 3. I know what a network
protocol with ill-defined encodings looks like.

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman


On 01/10/2014 06:04 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 20:53:09 -0500
"Eric V. Smith"  wrote:


So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
3892. See for example http://bugs.python.org/issue3982#msg180432 .


Then we might as well not do anything, since any attempt to advance
things is met by stubborn opposition in the name of "not far enough".


Heh, and here I thought it was stubborn opposition in the name of purity.  ;)



(I don't care much personally, I think the issue is quite overblown
anyway)


Is it safe to assume you don't use Python for the use-cases under discussion?  Specifically, mixed ASCII, binary, and 
encoded-text byte streams?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou

On Fri, 10 Jan 2014 20:53:09 -0500
"Eric V. Smith"  wrote:
> 
> So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
> 3892. See for example http://bugs.python.org/issue3982#msg180432 .

Then we might as well not do anything, since any attempt to advance
things is met by stubborn opposition in the name of "not far enough".

(I don't care much personally, I think the issue is quite overblown
anyway)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith

On 1/10/2014 8:12 PM, Antoine Pitrou wrote:
> On Fri, 10 Jan 2014 16:23:53 -0800
> Ethan Furman  wrote:
>> On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
>>>
>>> With Victor's consent, I overhauled PEP 460 and made the feature set
>>> more restricted and consistent with the bytes/str separation.
>>
>>  From the PEP:
>> =
>>> Python 3 generally mandates that text be stored and manipulated as
>>>  unicode (i.e. str objects, not bytes). In some cases, though, it
>>>  makes sense to manipulate bytes objects directly. Typical usage is
>>>  binary network protocols, where you can want to interpolate and
>>>  assemble several bytes object (some of them literals, some of them
>>>  compute) to produce complete protocol messages. For example,
>>>  protocols such as HTTP or SIP have headers with ASCII names and
>>>  opaque "textual" values using a varying and/or sometimes ill-defined
>>>  encoding. Moreover, those headers can be followed by a binary
>>>  body... which can be chunked and decorated with ASCII headers and
>>>  trailers!
>>
>> As it stands now, the PEP talks about ASCII, about how it makes sense
>> sometimes to work directly with bytes objects, and 
>> then refuses to allow % to embed ASCII text in the byte stream.
> 
> Indeed I refuse for %-formatting to allow the mixing of bytes and str
> objects, in the same way that it is forbidden to concatenate "a" and
> b"b" together, or to write b"".join(["abc"]).

I think:
'a' + b'b'
is different from:
b'Content-Length: %d' % 42

The former we want to prevent, but I see nothing wrong with the latter.

So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
3892. See for example http://bugs.python.org/issue3982#msg180432 .

Eric.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou

On Fri, 10 Jan 2014 16:23:53 -0800
Ethan Furman  wrote:
> On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
> >
> > With Victor's consent, I overhauled PEP 460 and made the feature set
> > more restricted and consistent with the bytes/str separation.
> 
>  From the PEP:
> =
> > Python 3 generally mandates that text be stored and manipulated as
> >  unicode (i.e. str objects, not bytes). In some cases, though, it
> >  makes sense to manipulate bytes objects directly. Typical usage is
> >  binary network protocols, where you can want to interpolate and
> >  assemble several bytes object (some of them literals, some of them
> >  compute) to produce complete protocol messages. For example,
> >  protocols such as HTTP or SIP have headers with ASCII names and
> >  opaque "textual" values using a varying and/or sometimes ill-defined
> >  encoding. Moreover, those headers can be followed by a binary
> >  body... which can be chunked and decorated with ASCII headers and
> >  trailers!
> 
> As it stands now, the PEP talks about ASCII, about how it makes sense
> sometimes to work directly with bytes objects, and 
> then refuses to allow % to embed ASCII text in the byte stream.

Indeed I refuse for %-formatting to allow the mixing of bytes and str
objects, in the same way that it is forbidden to concatenate "a" and
b"b" together, or to write b"".join(["abc"]).

Python 3 was made *precisely* because the implicit conversion between
ASCII unicode and bytes is deemed harmful. It's completely
counter-productive and misleading for our users to start mudding the
message by introducing exceptions to that rule.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/08/2014 02:42 PM, Antoine Pitrou wrote:

With Victor's consent, I overhauled PEP 460 and made the feature set
more restricted and consistent with the bytes/str separation.

From the PEP:
=

Python 3 generally mandates that text be stored and manipulated as
unicode (i.e. str objects, not bytes). In some cases, though, it
makes sense to manipulate bytes objects directly. Typical usage is
binary network protocols, where you can want to interpolate and
assemble several bytes object (some of them literals, some of them
compute) to produce complete protocol messages. For example,
protocols such as HTTP or SIP have headers with ASCII names and
opaque "textual" values using a varying and/or sometimes ill-defined
encoding. Moreover, those headers can be followed by a binary
body... which can be chunked and decorated with ASCII headers and
trailers!

As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and
then refuses to allow % to embed ASCII text in the byte stream.

All other features present in formatting of str objects (either
through the percent operator or the str.format() method) are
unsupported. Those features imply treating the recipient of the
operator or method as text, which goes counter to the text / bytes
separation (for example, accepting %d as a format code would imply
that the bytes object really is a ASCII-compatible text string).

No, it implies that portion of the byte stream is ASCII compatible. And we have several examples: PDF, HTML, DBF, just
about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of.

-1 on the PEP as it stands now.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

55 matches

Mail list logo