Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread R. David Murray
On Sun, 12 Jan 2014 17:51:41 +1000, Nick Coghlan ncogh...@gmail.com wrote:
 On 12 January 2014 04:38, R. David Murray rdmur...@bitdance.com wrote:
  But!  Our goal should be to help people convert to Python3.  So how can
  we find out what the specific problems are that real-world programs are
  facing, look at the *actual code*, and help that project figure out the
  best way to make that code work in both python2 and python3?
 
  That seems like the best way to find out what needs to be added to
  python3 or pypi:  help port the actual code of the developers who are
  running into problems.
 
  Yes, I'm volunteering to help with this, though of course I can't promise
  exactly how much time I'll have available.
 
 And, as has been the case for a long time, the PSF stands ready to
 help with funding credible grant proposals for Python 3 porting
 efforts. I believe some of the core devs (including David?) do
 freelance and contract work, so that's an option definitely worth
 considered if a project would like to support Python 3, but are having
 difficulty getting their with purely volunteer effort.

Yes, I do contract programming, as part of Murray and Walker, Inc (web
site coming soon but not there yet).  And yes I currently have time
available in my schedule.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.infowrote:

 On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:

  AFAIK (and just for the record), there could be both Latin1 text and
 UTF-16
  in a PDF (and other encodings too), depending on the font used:
 [...]
  In Python2, txt is just a str, but in Python3 handling everything as
 latin1
  string obviously doesn't work for TTF in this case.

 Nobody is suggesting that you use Latin-1 for *everything*. We're
 suggesting that you use it for blobs of binary data that represent
 arbitrary bytes. First you have to get your binary data in the first
 place, using whatever technique is necessary.


Just to check I understood what you are saying. Instead of writing:

content = b'\n'.join([
b'header',
b'part 2 %.3f' % number,
binary_image_data,
utf16_string.encode('utf-16be'),
b'trailer'])

it should now look like:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.encode('utf-16be').decode('latin-1'),
'trailer']).encode('latin-1')

Correct?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Nick Coghlan
On 12 Jan 2014 21:53, Juraj Sukop juraj.su...@gmail.com wrote:




 On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.info
wrote:

 On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:

  AFAIK (and just for the record), there could be both Latin1 text and
UTF-16
  in a PDF (and other encodings too), depending on the font used:
 [...]
  In Python2, txt is just a str, but in Python3 handling everything as
latin1
  string obviously doesn't work for TTF in this case.

 Nobody is suggesting that you use Latin-1 for *everything*. We're
 suggesting that you use it for blobs of binary data that represent
 arbitrary bytes. First you have to get your binary data in the first
 place, using whatever technique is necessary.


 Just to check I understood what you are saying. Instead of writing:

 content = b'\n'.join([
 b'header',
 b'part 2 %.3f' % number,
 binary_image_data,
 utf16_string.encode('utf-16be'),
 b'trailer'])

 it should now look like:

 content = '\n'.join([
 'header',
 'part 2 %.3f' % number,
 binary_image_data.decode('latin-1'),
 utf16_string.encode('utf-16be').decode('latin-1'),
 'trailer']).encode('latin-1')

Why are you proposing to do the *join* in text space? Encode all the parts
separately, concatenate them with b'\n'.join() (or whatever separator is
appropriate). It's only the *text formatting operation* that needs to be
done in text space and then explicitly encoded (and this example doesn't
even need latin-1,ASCII is sufficient):

content = b'\n'.join([
b'header',
 ('part 2 %.3f' % number).encode('ascii'),
 binary_image_data,
 utf16_string.encode('utf-16be'),
b'trailer'])

 Correct?

My updated version above is the reasonable way to do it in Python 3, and
the one I consider clearly superior to reintroducing implicit encoding to
ASCII as part of the core text model.

This is why I *don't* have a problem with PEP 460 as it stands - it's just
syntactic sugar for something you can already do with b''.join(), and thus
not particularly controversial.

It's only proposals that add any form of implicit encoding
that silently switches from the text domain to the binary domain that
conflict with the core Python 3 text model (although third party types
remain largely free to do whatever they want).

Cheers,
Nick.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
On Sun, Jan 12, 2014 at 2:16 PM, Nick Coghlan ncogh...@gmail.com wrote:

 Why are you proposing to do the *join* in text space? Encode all the parts
 separately, concatenate them with b'\n'.join() (or whatever separator is
 appropriate). It's only the *text formatting operation* that needs to be
 done in text space and then explicitly encoded (and this example doesn't
 even need latin-1,ASCII is sufficient):

I apparently misunderstood what was Steven suggesting, thanks for the
clarification.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Steven D'Aprano
On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
 On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.infowrote:
 
  On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
 
   AFAIK (and just for the record), there could be both Latin1 text and
  UTF-16
   in a PDF (and other encodings too), depending on the font used:
  [...]
   In Python2, txt is just a str, but in Python3 handling everything as
  latin1
   string obviously doesn't work for TTF in this case.
 
  Nobody is suggesting that you use Latin-1 for *everything*. We're
  suggesting that you use it for blobs of binary data that represent
  arbitrary bytes. First you have to get your binary data in the first
  place, using whatever technique is necessary.
 
 
 Just to check I understood what you are saying. Instead of writing:
 
 content = b'\n'.join([
 b'header',
 b'part 2 %.3f' % number,
 binary_image_data,
 utf16_string.encode('utf-16be'),
 b'trailer'])


Which doesn't work, since bytes don't support %f in Python 3.

 
 it should now look like:
 
 content = '\n'.join([
 'header',
 'part 2 %.3f' % number,
 binary_image_data.decode('latin-1'),
 utf16_string.encode('utf-16be').decode('latin-1'),
 'trailer']).encode('latin-1')
 
 Correct?

Not quite as you show.

First, utf16_string confuses me. What is it? If it is a Unicode 
string, i.e.:

# Python 3 semantics
type(utf16_string)
= returns str

then the name is horribly misleading, and it is best handled like this:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string,  # Misleading name, actually Unicode string
'trailer'])


Note that since it's text, and content is text, there is no need to 
encode then decode.

UTF-16 is not another name for Unicode. Unicode is a character set. 
UTF-16 is just one of a number of different encodings which map the 
0x10 distinct Unicode characters (actually code points) to bytes. 
UTF-16 is one possible way to implement Unicode strings in memory, but 
not the only way. Python has, or does, use four distinct implementations:

1) UTF-16 in narrow builds
2) UTF-32 in wide builds
3) a hybrid approach starting in Python 3.3, where strings are
   stored as either:

   3a) Latin-1
   3b) UCS-2
   3c) UTF-32

   depending on the content of the string.

So calling an arbitrary string utf16_string is misleading or wrong.


On the other hand, if it is actually a bytes object which is the product 
of UTF-16 encoding, i.e.:

type(utf16_string)
= returns bytes

and those bytes were generated by some text.encode(utf-16), then it 
is already binary data and needs to be smuggled into the text string. 
Latin-1 is good for that:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.decode('latin-1'),
'trailer'])


Both examples assume that you intend to do further processing of content 
before sending it, and will encode just before sending:

content.encode('utf-8')

(Don't use Latin-1, since it cannot handle the full range of text 
characters.)

If that's not the case, then perhaps this is better suited to what you 
are doing:

content = b'\n'.join([
b'header',
('part 2 %.3f' % number).encode('ascii'),
binary_image_data,  # already bytes
utf16_string,  # already bytes
b'trailer'])



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Steven D'Aprano
On Sun, Jan 12, 2014 at 11:16:37PM +1000, Nick Coghlan wrote:

  content = '\n'.join([
  'header',
  'part 2 %.3f' % number,
  binary_image_data.decode('latin-1'),
  utf16_string.encode('utf-16be').decode('latin-1'),
  'trailer']).encode('latin-1')
 
 Why are you proposing to do the *join* in text space? 

In defence of that, doing the join as text may be useful if you have 
additional text processing that you want to do after assembling the 
whole string, but before calling encode.

Even if you intend to encode to bytes at the end, you might prefer to 
work in the text domain right until just before the end:

- no need for b' prefixes;
- indexing a string returns a 1-char string, not an int;
- can use the full range of % formatting, etc.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
Wait a second, this is how I understood it but what Nick said made me think
otherwise...

On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano st...@pearwood.infowrote:

 On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
  On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.info
 wrote:
 
  Just to check I understood what you are saying. Instead of writing:
 
  content = b'\n'.join([
  b'header',
  b'part 2 %.3f' % number,
  binary_image_data,
  utf16_string.encode('utf-16be'),
  b'trailer'])

 Which doesn't work, since bytes don't support %f in Python 3.


I know and this was an example of the ideal (for me, anyway) way of
formatting bytes.


 First, utf16_string confuses me. What is it? If it is a Unicode
 string, i.e.:


It is a Unicode string which happens to contain code points outside U+00FF
(as with the TTF example above), so that it triggers the (at least) 2-bytes
memory representation in CPython 3.3+. I agree, I chose the variable name
poorly, my bad.



 content = '\n'.join([
 'header',
 'part 2 %.3f' % number,
 binary_image_data.decode('latin-1'),
 utf16_string,  # Misleading name, actually Unicode string
 'trailer'])


Which, because of that horribly-named-variable, prevents the use of simple
memcpy and makes the image data occupy way more memory than as when it was
in simple bytes.


 Both examples assume that you intend to do further processing of content
 before sending it, and will encode just before sending:


Not really, I was interested to compare it to bytes formatting, hence it
included the encode() as well.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Daniel Holth writes:

  -1 on adding more surrogateesapes by default. It's a pain to track
  down where the encoding errors came from.

What do you mean by default?  It was quite explicit in the code I
posted, and it's the only reasonable thing to do with text data
without known (but ASCII compatible) encoding or multiple different
encodings in a single data chunk.  If you leave it as bytes, it will
barf as soon as you try to mix it with text even if it is pure ASCII!

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 12:39 PM, Stephen J. Turnbull wrote:

Daniel Holth writes:

   -1 on adding more surrogateesapes by default. It's a pain to track
   down where the encoding errors came from.

What do you mean by default?  It was quite explicit in the code I
posted, and it's the only reasonable thing to do with text data
without known (but ASCII compatible) encoding or multiple different
encodings in a single data chunk.  If you leave it as bytes, it will
barf as soon as you try to mix it with text even if it is pure ASCII!


Which is why some (including myself) are asking to be able to stay in bytes land and do any necessary interpolation 
there.  No resulting unicode, no barfing.  ;)


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Mark Shannon

Why not just use six.byte_format(fmt, *args)?
It works on both Python2 and Python3 and accepts the numerical format 
specifiers, plus '%b' for inserting bytes and '%a' for converting text 
to ascii.


Admittedly it doesn't exist yet,
but it could and it would save a lot of arguing :)

(Apologies to anyone who doesn't appreciate my mischievous sense of humour)

Cheers,
Mark.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 01:59 PM, Mark Shannon wrote:


Why not just use six.byte_format(fmt, *args)?
It works on both Python2 and Python3 and accepts the numerical format 
specifiers, plus '%b' for inserting bytes and '%a'
for converting text to ascii.


Sounds like the second best option!



Admittedly it doesn't exist yet,
but it could and it would save a lot of arguing :)


:)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Chris Angelico
On Mon, Jan 13, 2014 at 4:57 AM, Juraj Sukop juraj.su...@gmail.com wrote:
 On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano st...@pearwood.info
 wrote:
 First, utf16_string confuses me. What is it? If it is a Unicode
 string, i.e.:

 It is a Unicode string which happens to contain code points outside U+00FF
 (as with the TTF example above), so that it triggers the (at least) 2-bytes
 memory representation in CPython 3.3+. I agree, I chose the variable name
 poorly, my bad.

When I'm talking about Unicode strings based on their maximum
codepoint, I usually call them something like ASCII string, Latin-1
string, BMP string, and SMP string. Still not wholly accurate,
but less confusing than naming an encoding... oh wait, two of those
_are_ encodings :| But you could use narrow string for the first
two. Or string(0..127) for ASCII, string(0..255) for Latin-1, and
then for consistency string(0..65535) and string(0..1114111) for
the others, except that I doubt that'd be helpful :) At any rate,
BMP as a term for includes characters outside of Latin-1 but all on
the Basic Multilingual Plane would probably be close enough to get
away with.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Steven D'Aprano writes:

  then the name is horribly misleading, and it is best handled like this:
  
  content = '\n'.join([
  'header',
  'part 2 %.3f' % number,
  binary_image_data.decode('latin-1'),
  utf16_string,  # Misleading name, actually Unicode string
  'trailer'])

This loses bigtime, as any encoding that can handle non-latin1 in
utf16_string will corrupt binary_image_data.  OTOH, latin1 will raise
on non-latin1 characters.  utf16_string must be encoded appropriately
then decoded by latin1 to be reencoded by latin1 on output.

  On the other hand, if it is actually a bytes object which is the product 
  of UTF-16 encoding, i.e.:
  
  type(utf16_string)
  = returns bytes
  
  and those bytes were generated by some text.encode(utf-16), then it 
  is already binary data and needs to be smuggled into the text string. 
  Latin-1 is good for that:
  
  content = '\n'.join([
  'header',
  'part 2 %.3f' % number,
  binary_image_data.decode('latin-1'),
  utf16_string.decode('latin-1'),
  'trailer'])
  
  
  Both examples assume that you intend to do further processing of content 
  before sending it, and will encode just before sending:
  
  content.encode('utf-8')
  
  (Don't use Latin-1, since it cannot handle the full range of text 
  characters.)

This corrupts binary_image_data.  Each byte  127 will be replaced by
two bytes.  In the second case, you can use latin1 to encode, it it
gives you what you want.

This kind of subtlety is precisely why MAL warned about use of latin1
to smuggle bytes.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 02:31 PM, Stephen J. Turnbull wrote:


This corrupts binary_image_data.  Each byte  127 will be replaced by
two bytes.  In the second case, you can use latin1 to encode, it it
gives you what you want.

This kind of subtlety is precisely why MAL warned about use of latin1
to smuggle bytes.


And why I've been fighting Steven D'Aprano on it.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Steven D'Aprano
On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:
 Steven D'Aprano writes:
 
   then the name is horribly misleading, and it is best handled like this:
   
   content = '\n'.join([
   'header',
   'part 2 %.3f' % number,
   binary_image_data.decode('latin-1'),
   utf16_string,  # Misleading name, actually Unicode string
   'trailer'])
 
 This loses bigtime, as any encoding that can handle non-latin1 in
 utf16_string will corrupt binary_image_data.  OTOH, latin1 will raise
 on non-latin1 characters.  utf16_string must be encoded appropriately
 then decoded by latin1 to be reencoded by latin1 on output.

Of course you're right, but I have understood the above as being a 
sketch and not real code. (E.g. does header really mean the literal 
string header, or does it stand in for something which is a header?) 
In real code, one would need to have some way of telling where the 
binary image data ends and the Unicode string begins.

If I have misunderstood the situation, then my apologies for compounding 
the error


[...]
   Both examples assume that you intend to do further processing of content 
   before sending it, and will encode just before sending:
   
   content.encode('utf-8')
   
   (Don't use Latin-1, since it cannot handle the full range of text 
   characters.)
 
 This corrupts binary_image_data.  Each byte  127 will be replaced by
 two bytes.

And reading it back using decode('utf-8') will replace those two bytes 
with a single byte, round-tripping exactly.

Of course if you encode to UTF-8 and then try to read the binary data as 
raw bytes, you'll get corrupted data. But do people expect to do this? 
That's a genuine question -- again, I assumed (apparently wrongly) that 
the idea was to write the content out as *text* containing smuggled 
bytes, and read it back the same way.


 In the second case, you can use latin1 to encode, it it
 gives you what you want.
 
 This kind of subtlety is precisely why MAL warned about use of latin1
 to smuggle bytes.

How would you smuggle a chunk of arbitrary bytes into a text string? 
Short of doing something like uuencoding it into ASCII, or equivalent.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Ethan Furman writes:

   This kind of subtlety is precisely why MAL warned about use of latin1
   to smuggle bytes.
  
  And why I've been fighting Steven D'Aprano on it.

No, I think you haven't been fighting Steven d'A on it.  You're
talking about parsing and generating structured binary files, he's
talking about techniques for parsing and generating streams with no
real structure above the byte or encoded character level.

Of course you can implement the former with the latter using Python 3
str, but it's ugly, maybe even painful if you need to encode binary
blobs back to binary to process them.  (More discussion in my other
post, although I suspect you're not going to be terribly happy with
that, either. ;-)

This generally *is not* the case for the wire protocol guys.  AFAICT
they really do want to process things as streams of ASCII-compatible
text, with the non-ASCII stuff treated as runs of uninterpreted bytes
that are just passed through.

So when you talk about we, I suspect you are not the we everybody
else is arguing with.  In particular, AIUI your use case is not
included in the use cases most of us -- including Steven -- are
thinking about.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 04:02 PM, Stephen J. Turnbull wrote:


So when you talk about we, I suspect you are not the we everybody
else is arguing with.  In particular, AIUI your use case is not
included in the use cases most of us -- including Steven -- are
thinking about.


Ah, so even in the minority I'm in the minority.  :/  The we I am usually referring to are those of us who have to 
deal with the mixed ASCII/binary/encoded text files (a couple have spoken up about PDFs, and I have DBF).


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Steven D'Aprano writes:

  Of course you're right, but I have understood the above as being a 
  sketch and not real code. (E.g. does header really mean the literal 
  string header, or does it stand in for something which is a header?) 
  In real code, one would need to have some way of telling where the 
  binary image data ends and the Unicode string begins.

Sure, but I think in Ethan's case it's probably out of band.  I have
been assuming out of band.

   This corrupts binary_image_data.  Each byte  127 will be replaced by
   two bytes.
  
  And reading it back using decode('utf-8') will replace those two bytes 
  with a single byte, round-tripping exactly.

True, but I'm assuming Ethan himself didn't choose DBF format.

  Of course if you encode to UTF-8 and then try to read the binary data as 
  raw bytes, you'll get corrupted data. But do people expect to do this? 

People?  Real People use Python, they wouldn't do that. :-)  But the
app that forced Ethan to deal with DBF might.

   This kind of subtlety is precisely why MAL warned about use of latin1
   to smuggle bytes.
  
  How would you smuggle a chunk of arbitrary bytes into a text string? 
  Short of doing something like uuencoding it into ASCII, or
  equivalent.

Arbitary bytes as a chunk?  I wouldn't do that, probably (see below),
and it's not possible in Python 3 at present (in str ASCII codes
always represent the corresponding ASCII character, they are never
uninterpreted bytes).

But if I know where the bytes are going to be in the str, I'd use
latin1 or (encoding='ascii', errors='surrogateescape') depending on
how well-controlled the processing is.  If I really own those bytes,
I might use latin1, and just forget all of the string-processing
functions that care about character identity (eg, case manipulation).
If the bytes might somehow end up leaking into the rest of the
program, I'd use surrogateescape and live with the doubled space usage.

But really, if it's not a wire-to-wire protocol kind of thing, I'd go
ahead and create a proper model for the data, and text would be text,
and chunks of arbitrary bytes would be bytes and integers would be
integers

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 11 January 2014 08:58, Ethan Furman et...@stoneleaf.us wrote:
 On 01/10/2014 02:42 PM, Antoine Pitrou wrote:

 On Fri, 10 Jan 2014 17:33:57 -0500
 Eric V. Smith e...@trueblade.com wrote:

 On 1/10/2014 5:29 PM, Antoine Pitrou wrote:

 On Fri, 10 Jan 2014 12:56:19 -0500
 Eric V. Smith e...@trueblade.com wrote:


 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion.


 If you are representing int and float, you're really formatting a text
 message, not bytes. Basically if you allow the formatting of int and
 float instances, there's no reason not to allow the formatting of
 arbitrary objects through __str__. It doesn't make sense to
 special-case those two types and nothing else.


 It might not for .format(), but I'm not convinced. But for %-formatting,
 str is already special-cased for these types.


 That's not what I'm saying. str.__mod__ is able to represent all kinds
 of types through %s and calling __str__. It doesn't make sense for
 bytes.__mod__ to only support int and float. Why only them?


 Because embedding the ASCII equivalent of ints and floats in byte streams is
 a common operation?

It's emphatically *NOT* a binary interpolation operation though - the
binary representation of the integer 1 is the byte value 1, not the
byte value 49. If you want the byte value 49 to appear in the stream,
then you need to interpolate the *ASCII encoding* of the string 1,
not the integer 1.

If you want to manipulate text representations, do it in the text
domain. If you want to manipulate binary representations, do it in the
binary domain. The *whole point* of the text model change in Python 3
is to force programmers to *decide* which domain they're operating in
at any given point in time - while the approach of blurring the
boundaries between the two can be convenient for wire protocol and
file format manipulation, it is a horrendous bug magnet everywhere
else.

PEP 360 is just about adding back some missing functionality in the
binary domain (interpolating binary sequences together), not about
bringing back the problematic text model that allows particular text
representations to be interpreted as if they were also binary data.

That said, I actually think there's a valid use case for a Python 3
type that allows the bytes/text boundary to be blurred in making it
easier to port certain kinds of Python 2 code to Python 3
(specifically, working with wire protocols and file formats that
contain a mixture of encodings, but all encodings are *known* to at
least be ASCII compatible). It is highly unlikely that such a type
will *ever* be part of the standard library, though - idiomatic Python
3 code shouldn't need it, affected Python 2 code *can* be ported
without it (but may look more complicated due to the use of explicit
decoding and encoding operations, rather than relying on implicit
ones), and it should be entirely possible to implement it as an
extension module (modulo one bug in CPython that may impact the
approach, but we won't know for sure until people actually try it
out).

Fortunately, after years of my suggesting the idea to almost everyone
that complained about the move away from the broken POSIX text model
in Python 3, Benno Rice has started experimenting with such a type
based on a preliminary test case I wrote at linux.conf.au last week:
https://github.com/jeamland/asciicompat/blob/master/tests/ncoghlan.py

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 11 January 2014 12:28, Ethan Furman et...@stoneleaf.us wrote:
 On 01/10/2014 06:04 PM, Antoine Pitrou wrote:

 On Fri, 10 Jan 2014 20:53:09 -0500
 Eric V. Smith e...@trueblade.com wrote:


 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .


 Then we might as well not do anything, since any attempt to advance
 things is met by stubborn opposition in the name of not far enough.


 Heh, and here I thought it was stubborn opposition in the name of purity.
 ;)

No, it's the POSIX text model is completely broken and we're not
letting people bring it back by stealth because they want to stuff
their esoteric use case back into the builtin data types instead of
writing their own dedicated type now that the builtin types don't
handle it any more.

Yes, we know we changed the text model and knocked wire protocols off
their favoured perch, and we're (thoroughly) aware of the fact that
wire protocol developers don't like the fact that the default model
now strongly favours the vastly more common case of application
development.

However, until Benno volunteered to start experimenting with
implementing an asciistr type yesterday, there have been *zero*
meaningful attempts at trying to solve the issues with wire protocol
manipulation outside the Python 3 core - instead there has just been a
litany of whining that Python 3 is different from Python 2, and a
complete and total refusal to attempt to understand *why* we changed
the text model.

The answer *should* be obvious: the POSIX based text model in Python 2
makes web frameworks easier to write at the expense of making web
applications *harder* to write, and the same is true for every other
domain where the wire protocol and file format handling is isolated to
widely used frameworks and support libraries, with the application
code itself operating mostly on text and structured data. With the
Python 3 text model, we decided that was a terrible trade-off, so the
core text model now *strongly* favours application code.

This means that is now *boundary* code that may need additional helper
types, because the core types aren't necessarily going to cover all
those use cases any more. In particular, the bytes type is, and always
will be, designed for pure binary manipulation, while the str type is
designed for text manipulation. The weird kinda-text-kinda-binary
8-bit builtin type is gone, and *deliberately* so.

I've been saying for years that people should experiment with creating
a Python 3 extension type that
behaves more like the Python 2 str type. For the standard library,
we've never hit a case where the explicit encoding and decoding was so
complicated that creating such a type seemed simpler, so *we're* not
going to do it. After discussing it with me at LCA, Benno Rice offered
to try out the idea, just to determine whether or not it was actually
possible. If there are any CPython bugs that mean the idea *doesn't*
currently work (such as interoperability issues in the core types),
then I'm certainly happy for us to fix *those*. But we're never ever
going to change the core text model back to the broken POSIX one, or
even take steps in that direction.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Stephen Hansen
For not caring much, your own stubbornness is quite notable throughout this
discussion. Stones and glass houses. :)

That said:

Twisted and Mercurial aren't the only ones who are hurt by this, at all.
I'm aware of at least two other projects who are actively hindered in their
support or migration to Python 3 by the bytes type not having some basic
functionality that strings had in 2.0.

The purity crowd in here has brought up that it was an important and
serious decision to split Text from Bytes in Py3, and I actually agree with
that. However, it is missing some very real and very concrete use-cases --
there are multiple situations where there are byte streams which have a
known text-subset which they really, really do need to operate on.

There's been a number of examples given: PDF, HTTP, network streams that
switch inline from text-ish to binary and back-again.. But, we can focus
that down to a very narrow and not at all uncommon situation in the latter.

Look at the HTTP Content-Length header. HTTP headers are fuzzy. My
understanding is, per the RFCs, their body can be arbitrary octets to the
exclusion of line feeds and DELs-- my understanding may be a bit off here,
and please feel free to correct me -- but the relevant specifications are a
bit fuzzy to begin with.

To my understanding of the spec, the header field name is essentially an
ASCII text field (sans separator), and the body is... anything, or nearly
anything. This is HTTP, which is surely one of the most used protocols in
the world.

The need to be able to assemble and disassemble such streams of that is a
real, valid use-case.

But looking at it, now look to the Content-Length header I mentioned. It
seems those who are declaring a purity priority in bytes/string separation
think it reasonable to do things like:

  headers.append((bContent-Length: (%d %
(len(content))).encode(ascii)))

Or something. In the middle of processing a stream, you need to convert
this number into a string then encode it into bytes to just represent the
number as the extremely common, widely-accessible 7-bit ascii subset of its
numerical value. This isn't some rare, grandiose or fiendish undertaking,
or trying to merge Strings and Bytes back together: this is the simple
practical recognition that representing a number as its ascii-numerical
value is actually not at all uncommon.

This position seems utterly astonishing in its ridiculousness to me. The
recognition that the number 123 may be represented as b123 surprises me
as a controversial thing, considering how often I see it in real life.

There is a LOT of code out there which needs a little bit of a middle
ground between bytes and strings; it doesn't mean you are giving way and
allowing strings and bytes to merge and giving up on the Edict of
Separation. But there are real world use-cases where you simply need to be
able to do many basic String like operations on byte-streams.

The removal of the ability to use interpolation to construct such byte
strings was a major regression in python 3 and is a big hurdle for more
then a few projects to upgrade.

I mean, its not like the bytes type lacks knowledge of the subset of
bytes that happen to be 7-bit ascii-compatible and can't perform text-ish
operations on them--

  Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32
bit (Intel)] on win32
  Type help, copyright, credits or license for more information.
   bstephen hansen.title()
  b'Stephen Hansen'

How is this not a practical recognition that yes, while bytes are byte
streams and not text, a huge subset of bytes are text-y, and as long as we
maintain the barrier between higher characters and implicit conversion
therein, we're fine?

I don't see the difference here. There is a very real, practical need to
interpolate bytes. This very real, practical need includes the very real
recognition that converting 12345 to b'12345' is not something weird,
unusual, and subject to the thorny issues of Encodings. It is not violating
the doctrine of separation of powers between Text and Bytes.

Personally, I won't be converting my day job's codebase to Python 3 anytime
soon (where 'soon' is defined as 'within five years, assuming a best-case
scenario that a number of third-party issues are resolved. But! I'm aware
and involved with other projects, and this has bit two of them
specifically. I'm sure there are others who are not aware of this list or
don't feel comfortable talking on it (as it is, I encouraged one of the
project's coder to speak up, but they thought the question was a lost one
due to  previous responses on the original issue ticket and gave up.).

On Fri, Jan 10, 2014 at 6:04 PM, Antoine Pitrou solip...@pitrou.net wrote:

 On Fri, 10 Jan 2014 20:53:09 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
  So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
  3892. See for example http://bugs.python.org/issue3982#msg180432 .

 Then we might as well not do anything, since any 

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Glenn Linderman

On 1/11/2014 1:44 AM, Stephen Hansen wrote:
There's been a number of examples given: PDF, HTTP, network streams 
that switch inline from text-ish to binary and back-again.. But, we 
can focus that down to a very narrow and not at all uncommon situation 
in the latter.


PDF has been mentioned a few times.  ReportLAB recently decided to 
convert to Python 3, and fairly quickly (from my perspective, it took 
them a _long_ time to decide to port, but once they decided to, then it 
seemed quick) produced an alpha version that passes many of their tests. 
I've not tried it yet, although it interests me, as I have some Python 2 
code written only because ReportLAB didn't support Python 3, and I 
wanted to generate some PDF files. I'll be glad to get rid of the Python 
2 code, once they are released.


But I guess they figured out a solution that wasn't onerous, I'd have to 
go re-read the threads to be sure, but it seems they are running one 
code base for both... not sure of the details of what techniques they 
used, or if they ever used the % operator :)


But I'm wondering, since they did what they did so quickly, if the 
mixed bytes and str use case is mostly, in fact, a mind-set issue... 
yes, likely some code has to change, but maybe the changes really aren't 
all that significant.


I wouldn't want to drag them into this discussion, I'd rather they get 
the port complete, but it would be interesting to know what they did, 
and how they did it, and what problems they had, etc. If anyone here 
knows that code a bit, perhaps the diffs could be examined in their 
repository to figure out what they did, and how much it impacted their 
code. I do know they switched XML parsers along the way, as well as 
dealing with string handling differences.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Kristján Valur Jónsson
I don't know what the fuss is about.  This isn't about breaking the text model.
It's about a convenient way to turn text into bytes using a default, lenient, 
way.  Not the other way round.
Here's my proposal

b'foo%sbar' % (a)

would implicitly apply the following function equivalent to every object in the 
tuple:
def coerce_ascii(o):
if has_bytes_interface(o):  return o
return o.encode('ascii', 'strict')

There's no need for special %d or %f formatting.  If more fanciful formatting 
is required, e.g. exponents or, or precision, then by all means, to it in the 
str domain:

b'foo%sbar' %(%.15f%(42.2, ))

Basically, let's just support simple bytes interpolation that will support 
coercing into bytes by means of strict ascii.
It's a one way convenience, explicitly requested, and for conselting adults.


-Original Message-
From: Python-Dev [mailto:python-dev-bounces+kristjan=ccpgames@python.org] 
On Behalf Of Nick Coghlan
Sent: 11. janúar 2014 08:43
To: Ethan Furman
Cc: python-dev@python.org
Subject: Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) 
to Python 3.5

No, it's the POSIX text model is completely broken and we're not letting 
people bring it back by stealth because they want to stuff their esoteric use 
case back into the builtin data types instead of writing their own dedicated 
type now that the builtin types don't handle it any more.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson c...@zip.com.au wrote:


 Hi Juraj,


Hello Cameron.


   data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )


Thanks for the suggestion! The problem with bytify is that some items
might require different formatting than other items. For example, in
Cross-Reference Table there are three different formats: non-padded
integer (1), 10- and 15digit integer, (03, 65535).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano st...@pearwood.infowrote:


 I'm sorry, I don't understand what you mean here. I'm honestly not
 trying to be difficult, but you sound confident that you understand what
 you are doing, but your description doesn't make sense to me. To me, it
 looks like you are conflating bytes and ASCII characters, that is,
 assuming that characters are in some sense identical to their ASCII
 representation. Let me explain:

 The integer that in English is written as 100 is represented in memory
 as bytes 0x0064 (assuming a big-endian C short), so when you say an
 integer is written down AS-IS (emphasis added), to me that says that
 the PDF file includes the bytes 0x0064. But then you go on to write the
 three character string 100, which (assuming ASCII) is the bytes
 0x313030. Going from the C short to the ASCII representation 0x313030 is
 nothing like inserting the int as-is. To put it another way, the
 Python 2 '%d' format code does not just copy bytes.


Sorry, I should've included an example: when I said as-is I meant 1,
0, 0 so that would be yours 0x313030.


 If you consider PDF as binary with occasional pieces of ASCII text, then
 working with bytes makes sense. But I wonder whether it might be better
 to consider PDF as mostly text with some binary bytes. Even though the
 bulk of the PDF will be binary, the interesting bits are text. E.g. your
 example:

 Even though the binary image data is probably much, much larger in
 length than the text shown above, it's (probably) trivial to deal with:
 convert your image data into bytes, decode those bytes into Latin-1,
 then concatenate the Latin-1 string into the text above.


This is similar to what Chris Barker suggested. I also don't try to be
difficult here but please explain to me one thing. To treat bytes as if
they were Latin-1 is bad idea, that's why %f got dropped in the first
place, right? How is it then alright to put an image inside an Unicode
string?

Also, apart from the in/out conversions, do any other difficulties come to
your mind?

Please also take note that in Python 3.3 and better, the internal
 representation of Unicode strings containing only code points up to 255
 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte
 per character.


I guess you meant [C]Python...

In any case, thanks for the detailed reply.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Georg Brandl
Am 11.01.2014 09:43, schrieb Nick Coghlan:
 On 11 January 2014 12:28, Ethan Furman et...@stoneleaf.us wrote:
 On 01/10/2014 06:04 PM, Antoine Pitrou wrote:

 On Fri, 10 Jan 2014 20:53:09 -0500
 Eric V. Smith e...@trueblade.com wrote:


 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .


 Then we might as well not do anything, since any attempt to advance
 things is met by stubborn opposition in the name of not far enough.


 Heh, and here I thought it was stubborn opposition in the name of purity.
 ;)
 
 No, it's the POSIX text model is completely broken and we're not
 letting people bring it back by stealth because they want to stuff
 their esoteric use case back into the builtin data types instead of
 writing their own dedicated type now that the builtin types don't
 handle it any more.
 
 Yes, we know we changed the text model and knocked wire protocols off
 their favoured perch, and we're (thoroughly) aware of the fact that
 wire protocol developers don't like the fact that the default model
 now strongly favours the vastly more common case of application
 development.
 
 However, until Benno volunteered to start experimenting with
 implementing an asciistr type yesterday, there have been *zero*
 meaningful attempts at trying to solve the issues with wire protocol
 manipulation outside the Python 3 core

Can we please also include pseudo-binary file formats?  It's not just
wire protocols.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Georg Brandl
Am 11.01.2014 10:44, schrieb Stephen Hansen:

 I mean, its not like the bytes type lacks knowledge of the subset of bytes
 that happen to be 7-bit ascii-compatible and can't perform text-ish operations
 on them--
 
   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit
 (Intel)] on win32
   Type help, copyright, credits or license for more information.
bstephen hansen.title()
   b'Stephen Hansen'
 
 How is this not a practical recognition that yes, while bytes are byte streams
 and not text, a huge subset of bytes are text-y, and as long as we maintain 
 the
 barrier between higher characters and implicit conversion therein, we're fine?
 
 I don't see the difference here. There is a very real, practical need to
 interpolate bytes. This very real, practical need includes the very real
 recognition that converting 12345 to b'12345' is not something weird, unusual,
 and subject to the thorny issues of Encodings. It is not violating the 
 doctrine
 of separation of powers between Text and Bytes.

This. Exactly. Thanks for putting it so nicely, Stephen.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Georg Brandl
Am 11.01.2014 14:49, schrieb Georg Brandl:
 Am 11.01.2014 10:44, schrieb Stephen Hansen:
 
 I mean, its not like the bytes type lacks knowledge of the subset of bytes
 that happen to be 7-bit ascii-compatible and can't perform text-ish 
 operations
 on them--
 
   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 
 bit
 (Intel)] on win32
   Type help, copyright, credits or license for more information.
bstephen hansen.title()
   b'Stephen Hansen'
 
 How is this not a practical recognition that yes, while bytes are byte 
 streams
 and not text, a huge subset of bytes are text-y, and as long as we maintain 
 the
 barrier between higher characters and implicit conversion therein, we're 
 fine?
 
 I don't see the difference here. There is a very real, practical need to
 interpolate bytes. This very real, practical need includes the very real
 recognition that converting 12345 to b'12345' is not something weird, 
 unusual,
 and subject to the thorny issues of Encodings. It is not violating the 
 doctrine
 of separation of powers between Text and Bytes.
 
 This. Exactly. Thanks for putting it so nicely, Stephen.

To elaborate: if the bytes type didn't have all this ASCII-aware functionality
already, I think we would have (and be using) a dedicated asciistr type right
now.  But it has the functionality, and it's way too late to remove it.

Georg


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread M.-A. Lemburg
On 11.01.2014 14:54, Georg Brandl wrote:
 Am 11.01.2014 14:49, schrieb Georg Brandl:
 Am 11.01.2014 10:44, schrieb Stephen Hansen:

 I mean, its not like the bytes type lacks knowledge of the subset of bytes
 that happen to be 7-bit ascii-compatible and can't perform text-ish 
 operations
 on them--

   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 
 bit
 (Intel)] on win32
   Type help, copyright, credits or license for more information.
bstephen hansen.title()
   b'Stephen Hansen'

 How is this not a practical recognition that yes, while bytes are byte 
 streams
 and not text, a huge subset of bytes are text-y, and as long as we maintain 
 the
 barrier between higher characters and implicit conversion therein, we're 
 fine?

 I don't see the difference here. There is a very real, practical need to
 interpolate bytes. This very real, practical need includes the very real
 recognition that converting 12345 to b'12345' is not something weird, 
 unusual,
 and subject to the thorny issues of Encodings. It is not violating the 
 doctrine
 of separation of powers between Text and Bytes.

 This. Exactly. Thanks for putting it so nicely, Stephen.
 
 To elaborate: if the bytes type didn't have all this ASCII-aware functionality
 already, I think we would have (and be using) a dedicated asciistr type 
 right
 now.  But it has the functionality, and it's way too late to remove it.

I think we need to step back a little from the purist view
of things and give more emphasis on the practicality beats
purity Zen.

I complete agree with Stephen, that bytes are in fact often
an encoding of text. If that text is ASCII compatible, I don't
see any reason why we should not continue to expose the C lib
standard string APIs available for text manipulations on bytes.

We don't have to be pedantic about the bytes/text separation.
It doesn't help in real life.

If you give programmers the choice they will - most of the time -
do the right thing. If you don't give them the tools, they'll
work around the missing features in a gazillion different
ways of which many will probably miss a few edge cases.

bytes already have most of the 8-bit string methods from Python 2,
so it doesn't hurt adding some more of the missing features
from Python 2 on top to make life easier for people dealing
with multiple/unknown encoding data.

BTW: I don't know why so many people keep asking for use cases.
Isn't it obvious that text data without known (but ASCII compatible)
encoding or multiple different encodings in a single data chunk
is part of life ? Most HTTP packets fall into this category,
many email messages as well. And let's not forget that we don't
live in a perfect world. Broken encodings are everywhere around
you - just have a look at your spam folder for a decent chunk
of example data :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 11 2014)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Antoine Pitrou
On Sat, 11 Jan 2014 08:26:57 +0100
Georg Brandl g.bra...@gmx.net wrote:
 Am 11.01.2014 03:04, schrieb Antoine Pitrou:
  On Fri, 10 Jan 2014 20:53:09 -0500
  Eric V. Smith e...@trueblade.com wrote:
  
  So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
  3892. See for example http://bugs.python.org/issue3982#msg180432 .
 
 I agree.
 
  Then we might as well not do anything, since any attempt to advance
  things is met by stubborn opposition in the name of not far enough.
  
  (I don't care much personally, I think the issue is quite overblown
  anyway)
 
 So you wouldn't mind another overhaul of the PEP including a bit more
 functionality again? :)
  I really think that practicality beats purity
 here.  (I'm not advocating free mixing bytes and str, mind you!)

The PEP already proposes a certain amount of practicality. I personally
*would* mind adding %d and friends to it. But of course someone can
fork the PEP or write another one.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 January 2014 01:15, M.-A. Lemburg m...@egenix.com wrote:
 On 11.01.2014 14:54, Georg Brandl wrote:
 Am 11.01.2014 14:49, schrieb Georg Brandl:
 Am 11.01.2014 10:44, schrieb Stephen Hansen:

 I mean, its not like the bytes type lacks knowledge of the subset of 
 bytes
 that happen to be 7-bit ascii-compatible and can't perform text-ish 
 operations
 on them--

   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 
 bit
 (Intel)] on win32
   Type help, copyright, credits or license for more information.
bstephen hansen.title()
   b'Stephen Hansen'

 How is this not a practical recognition that yes, while bytes are byte 
 streams
 and not text, a huge subset of bytes are text-y, and as long as we 
 maintain the
 barrier between higher characters and implicit conversion therein, we're 
 fine?

 I don't see the difference here. There is a very real, practical need to
 interpolate bytes. This very real, practical need includes the very real
 recognition that converting 12345 to b'12345' is not something weird, 
 unusual,
 and subject to the thorny issues of Encodings. It is not violating the 
 doctrine
 of separation of powers between Text and Bytes.

 This. Exactly. Thanks for putting it so nicely, Stephen.

 To elaborate: if the bytes type didn't have all this ASCII-aware 
 functionality
 already, I think we would have (and be using) a dedicated asciistr type 
 right
 now.  But it has the functionality, and it's way too late to remove it.

 I think we need to step back a little from the purist view
 of things and give more emphasis on the practicality beats
 purity Zen.

 I complete agree with Stephen, that bytes are in fact often
 an encoding of text. If that text is ASCII compatible, I don't
 see any reason why we should not continue to expose the C lib
 standard string APIs available for text manipulations on bytes.

 We don't have to be pedantic about the bytes/text separation.
 It doesn't help in real life.

Yes, it bloody well does. The number of people who have told me that
using Python 3 is what allowed them to finally understand how Unicode
works vastly exceeds the number of wire protocol and file format devs
that have complained about working with binary formats being
significantly less tolerant of the it's really like ASCII text
mindset.

We are NOT going back to the confusing incoherent mess that is the
Python 2 model of bolting Unicode onto the side of POSIX:
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3

While that was an *expedient* (and, in fact, necessary) solution at
the time, the fact it is still thoroughly confusing people 13 years
later shows it is not a *comprehensible* solution.

 If you give programmers the choice they will - most of the time -
 do the right thing. If you don't give them the tools, they'll
 work around the missing features in a gazillion different
 ways of which many will probably miss a few edge cases.

 bytes already have most of the 8-bit string methods from Python 2,
 so it doesn't hurt adding some more of the missing features
 from Python 2 on top to make life easier for people dealing
 with multiple/unknown encoding data.

Because people that aren't happy with the current bytes type
persistently refuse to experiment with writing their own extension
type to figure out what the API should look like. Jamming speculative
API design into the core text model without experimenting in a third
party extension first is a straight up stupid idea.

Anyone that is pushing for this should be checking out Benno's first
draft experimental prototype for asciistr and be working on getting it
passing the test suite I created:
https://github.com/jeamland/asciicompat

The Wah, you broke it and now I have completely forgotten how to
create custom types, so I'm just going to piss and moan until somebody
else fixes it infantilism of the past five years in this regard has
frankly pissed me off.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote:
 On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano st...@pearwood.infowrote:

  If you consider PDF as binary with occasional pieces of ASCII text, then
  working with bytes makes sense. But I wonder whether it might be better
  to consider PDF as mostly text with some binary bytes. Even though the
  bulk of the PDF will be binary, the interesting bits are text. E.g. your
  example:

10 0 obj
   /Type /XObject
 /Width 100
 /Height 100
 /Alternates 15 0 R
 /Length 2167
  
stream
...binary image data...
endstream
endobj


  Even though the binary image data is probably much, much larger in
  length than the text shown above, it's (probably) trivial to deal with:
  convert your image data into bytes, decode those bytes into Latin-1,
  then concatenate the Latin-1 string into the text above.
 
 This is similar to what Chris Barker suggested. I also don't try to be
 difficult here but please explain to me one thing. To treat bytes as if
 they were Latin-1 is bad idea, 

Correct. Bytes are not Latin-1. Here are some bytes which represent a 
word I extracted from a text file on my computer: 

b'\x8a\x75\xa7\x65\x72\x73\x74'

If you imagine that they are Latin-1, you might think that the word 
is a C1 control character (VTS, or Vertical Tabulation Set) followed 
by u§erst, but it is not. It is actually the German word äußerst 
(extremely), and the text file was generated on a 1990s vintage 
Macintosh using the MacRoman extended ASCII code page.


 that's why %f got dropped in the first
 place, right? How is it then alright to put an image inside an Unicode
 string?

The point that I am making is that many people want to add formatting 
operations to bytes so they can put ASCII strings inside bytes. But (as 
far as I can tell) they don't need to do this, because they can treat 
Unicode strings containing code points U+ through U+00FF (i.e. the 
same range as handled by Latin-1) as if they were bytes. This gives you:

- convenient syntax, no need to prefix strings with b;

- mostly avoid needing to decode and encode strings, except at a 
  few points in your code;

- the full set of string methods;

- can easily include arbitrary octal or hex byte values, using \o and
  \x escapes;

- error checking: when you finally encode the text to bytes before 
  writing to a file, or sending over a wire, any code-point greater 
  than U+00FF will give you an exception unless explicitly silenced.

No need to wait for Python 3.5 to come out, you can do this *right now*.

Of course, this is a little bit unclean, it breaks the separation of 
text and bytes by treating bytes *as if* they were Unicode code points, 
which they are not, but I believe that this is a practical technique 
which is not too hard to deal with. For instance, suppose I have a 
mixed format which consists of an ASCII tag, a number written in ASCII, 
a NULL separator, and some binary data:

# Using bytes
values = [29460, 29145, 31098, 27123]
blob = b.join(struct.pack(h, n) for n in values)
data = bTag: + str(len(values)).encode('ascii') + b\0 + blob

= gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3'


That's a bit ugly, but not too ugly. I could write code like that. But 
if bytes had % formatting, I might write this instead:

data = bTag:%d\0%s % (len(values), blob)


This is a small improvement, but I can't use it until Python 3.5 comes 
out. Or I could do this right now:


# Using text
values = [29460, 29145, 31098, 27123]
blob = b.join(struct.pack(h, n) for n in values)
data = Tag:%d\0%s % (len(values), blob.decode('latin-1'))

= gives data = 'Tag:4\x00s\x14qÙyzió'

When I'm ready to transmit this over the wire, or write to disk, then I 
encode, and get:

data.encode('latin-1')
= b'Tag:4\x00s\x14q\xd9yzi\xf3'


which is exactly the same as I got in the first place. In this case, I'm 
not using Latin-1 for the semantics of bytes to characters (e.g. byte 
\xf3 = char ó), but for the useful property that all 256 distinct bytes 
are valid in Latin-1. Any other encoding with the same property will do.

It is a little unfortunate that struct gives bytes rather than a str, 
but you can hide that with a simple helper function:

def b2s(bytes):
return bytes.decode('latin1')

data = Tag:%d\0%s % (len(values), b2s(blob))



 Also, apart from the in/out conversions, do any other difficulties come to
 your mind?

No. If you accidentally introduce a non-Latin1 code point, when you 
decode you'll get an exception. 


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Antoine Pitrou
On Sun, 12 Jan 2014 01:34:26 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
 Yes, it bloody well does. The number of people who have told me that
 using Python 3 is what allowed them to finally understand how Unicode
 works vastly exceeds the number of wire protocol and file format devs
 that have complained about working with binary formats being
 significantly less tolerant of the it's really like ASCII text
 mindset.

+1 to what Nick says. Forcing some constructs to be explicit leads
people to know about the issue and understand it, rather than sweep it
under the carpet as Python 2 encouraged them to do.

Yes, if you're dealing with a file format or network protocol, you'd
better know in which charset its textual information is being expressed.
It's a very sane question to ask yourself!

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 07:38 AM, Steven D'Aprano wrote:


The point that I am making is that many people want to add formatting
operations to bytes so they can put ASCII strings inside bytes. But (as
far as I can tell) they don't need to do this, because they can treat
Unicode strings containing code points U+ through U+00FF (i.e. the
same range as handled by Latin-1) as if they were bytes.


So instead of blurring the line between bytes and text, you're blurring the line between text and bytes (with a few 
extra seat belts thrown in).  Besides being a bit awkward, this also means that any encoded text (even the plain ASCII 
stuff) is now being transformed three times instead of one:


  unicode to bytes
  bytes to unicode using latin1
  unicode to bytes

Even if the cost of moving those bytes around is cheap, it's not free.  When you're creating hundreds of PDFs at a time 
that's going to make a difference.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread M.-A. Lemburg
On 11.01.2014 16:34, Nick Coghlan wrote:
 On 12 January 2014 01:15, M.-A. Lemburg m...@egenix.com wrote:
 On 11.01.2014 14:54, Georg Brandl wrote:
 Am 11.01.2014 14:49, schrieb Georg Brandl:
 Am 11.01.2014 10:44, schrieb Stephen Hansen:

 I mean, its not like the bytes type lacks knowledge of the subset of 
 bytes
 that happen to be 7-bit ascii-compatible and can't perform text-ish 
 operations
 on them--

   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 
 32 bit
 (Intel)] on win32
   Type help, copyright, credits or license for more information.
bstephen hansen.title()
   b'Stephen Hansen'

 How is this not a practical recognition that yes, while bytes are byte 
 streams
 and not text, a huge subset of bytes are text-y, and as long as we 
 maintain the
 barrier between higher characters and implicit conversion therein, we're 
 fine?

 I don't see the difference here. There is a very real, practical need to
 interpolate bytes. This very real, practical need includes the very real
 recognition that converting 12345 to b'12345' is not something weird, 
 unusual,
 and subject to the thorny issues of Encodings. It is not violating the 
 doctrine
 of separation of powers between Text and Bytes.

 This. Exactly. Thanks for putting it so nicely, Stephen.

 To elaborate: if the bytes type didn't have all this ASCII-aware 
 functionality
 already, I think we would have (and be using) a dedicated asciistr type 
 right
 now.  But it has the functionality, and it's way too late to remove it.

 I think we need to step back a little from the purist view
 of things and give more emphasis on the practicality beats
 purity Zen.

 I complete agree with Stephen, that bytes are in fact often
 an encoding of text. If that text is ASCII compatible, I don't
 see any reason why we should not continue to expose the C lib
 standard string APIs available for text manipulations on bytes.

 We don't have to be pedantic about the bytes/text separation.
 It doesn't help in real life.
 
 Yes, it bloody well does. The number of people who have told me that
 using Python 3 is what allowed them to finally understand how Unicode
 works vastly exceeds the number of wire protocol and file format devs
 that have complained about working with binary formats being
 significantly less tolerant of the it's really like ASCII text
 mindset.
 
 We are NOT going back to the confusing incoherent mess that is the
 Python 2 model of bolting Unicode onto the side of POSIX:
 http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3
 
 While that was an *expedient* (and, in fact, necessary) solution at
 the time, the fact it is still thoroughly confusing people 13 years
 later shows it is not a *comprehensible* solution.

FWIW: I quite liked the Python 2 model, but perhaps that's because
I already knww how Unicode works, so could use it to make my
life easier ;-)

Seriously, Unicode has always caused heated discussions and
I don't expect this to change in the next 5-10 years.

The point is: there is no 100% perfect solution either way and
when you acknowledge this, things don't look black and white anymore,
but instead full of colors :-)

Python 3 forces people to actually use Unicode; in Python 2 they
could easily avoid it. It's good to educate people on how it's
used and the issues you can run into, but let's not forget
that people are trying to get work done and we all love readable
code.

PEP 460 just adds two more methods to the bytes object which come
in handy when formatting binary data; I don't think it has potential
to muddy the Python 3 text model, given that the bytes
object already exposes a dozen of other ASCII text methods :-)

 If you give programmers the choice they will - most of the time -
 do the right thing. If you don't give them the tools, they'll
 work around the missing features in a gazillion different
 ways of which many will probably miss a few edge cases.

 bytes already have most of the 8-bit string methods from Python 2,
 so it doesn't hurt adding some more of the missing features
 from Python 2 on top to make life easier for people dealing
 with multiple/unknown encoding data.
 
 Because people that aren't happy with the current bytes type
 persistently refuse to experiment with writing their own extension
 type to figure out what the API should look like. Jamming speculative
 API design into the core text model without experimenting in a third
 party extension first is a straight up stupid idea.
 
 Anyone that is pushing for this should be checking out Benno's first
 draft experimental prototype for asciistr and be working on getting it
 passing the test suite I created:
 https://github.com/jeamland/asciicompat
 
 The Wah, you broke it and now I have completely forgotten how to
 create custom types, so I'm just going to piss and moan until somebody
 else fixes it infantilism of the past five years in 

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Matěj Cepl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2014-01-11, 10:56 GMT, you wrote:
 I don't know what the fuss is about.

I just cannot resist:

When you are calm while everybody else is in the state of 
panic, you haven’t understood the problem.

-- one of many collections of Murphy’s Laws

Matěj

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFS0UBf4J/vJdlkhKwRAtc3AJ9c1ElUhLjvHX+Jw4/NvvmGABNbTQCfe9Zm
rD65ozDhpj/Fu3ydM8Oipco=
=TDQP
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 12:43 AM, Nick Coghlan wrote:


In particular, the bytes type is, and always will be, designed for
pure binary manipulation [...]


I apologize for being blunt, but this is a lie.

Lets take a look at the methods defined by bytes:


dir(b'')
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
'__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', 
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', 
'__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 
'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 
'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 
'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


Are you really going to insist that expandtabs, isalnum, isalpha, isdigit, islower, isspace, istitle, isupper, ljust, 
lower, lstrip, rjust, splitlines, swapcase, title, upper, and zfill are pure binary manipulation methods?


Let's take a look at the repr of bytes:


bytes([48, 49, 50, 51])

b'0123'

Wow, that sure doesn't look like binary data!

Py3 did not go from three text models to two, it went to one good one (unicode strings) and one broken one (bytes).  If 
the aim was indeed for pure binary manipulation, we failed.  We left in bunches of methods which can *only* be 
interpreted as supporting ASCII manipulation.


Due to backwards compatibility we cannot now finish yanking those out, so either we live with a half-dead class 
screaming I want be ASCII!  I want to be ASCII! or add back the missing functionality.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 07:34 AM, Nick Coghlan wrote:

On 12 January 2014 01:15, M.-A. Lemburg wrote:


We don't have to be pedantic about the bytes/text separation.
It doesn't help in real life.


Yes, it bloody well does. The number of people who have told me that
using Python 3 is what allowed them to finally understand how Unicode
works . . .


We are not proposing a change to the unicode string type in any way.



We are NOT going back to the confusing incoherent mess that is the
Python 2 model of bolting Unicode onto the side of POSIX . . .


We are not asking for that.



bytes already have most of the 8-bit string methods from Python 2,
so it doesn't hurt adding some more of the missing features
from Python 2 on top to make life easier for people dealing
with multiple/unknown encoding data.


Because people that aren't happy with the current bytes type
persistently refuse to experiment with writing their own extension
type to figure out what the API should look like. Jamming speculative
API design into the core text model without experimenting in a third
party extension first is a straight up stupid idea.


True, if this were a new API; but it isn't, it's the Py2 str API that was stripped out.  The one big difference being 
that if the results of %s (or %d or any other %) is not in the 0-127 range it errors out.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
 On 01/11/2014 07:38 AM, Steven D'Aprano wrote:
 
 The point that I am making is that many people want to add formatting
 operations to bytes so they can put ASCII strings inside bytes. But (as
 far as I can tell) they don't need to do this, because they can treat
 Unicode strings containing code points U+ through U+00FF (i.e. the
 same range as handled by Latin-1) as if they were bytes.
 
 So instead of blurring the line between bytes and text, you're blurring the 
 line between text and bytes (with a few extra seat belts thrown in).  

I'm not blurring anything. The people who designed the file format that 
mixes textual data and binary data did the blurring. Given that such 
formats exist, it is inevitable that we need to put text into bytes, or 
bytes into text. The situation is already blurred, we just have to 
decide how to handle it. There are three broad strategies:

1) Make bytes more string-like, so that we can process our data as 
bytes, but still do string operations on the bits that are ASCII.

2) Make strings more byte-like, so that we can process our data as 
strings, but do byte operations (like bit mask operations) on the parts 
that are binary data.

3) Don't do either. Keep the text parts of your data as text, and the 
binary parts of your data as bytes. Do your text operations on text, and 
your byte operations on bytes.

At some point, of course, they need to be combined. We have a choice:

* Right now, we can use text as the base, and combine bytes into the 
  text using Latin-1, and it Just Works.

* Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects 
  will be more text-like, and then use bytes as the base, and (with 
  luck) it Should Just Work.


There's another disadvantage with the second: treating bytes as if they 
were ASCII by default reinforces the same old harmful paradigm that text 
is ASCII that we're trying to get away from. That's a bad, painful idea 
that causes a lot of problems and buggy code, and should be resisted.

On the other hand, embedding arbitrary binary data in Unicode text 
doesn't reinforce any common or harmful paradigms. It just requires the 
programmer to forget about characters and concentrate on code points, 
since Latin-1 maps bytes to code points in a very convenient way:

Byte 0x00 maps to code point U+
Byte 0x01 maps to code point U+0001
Byte 0x02 maps to code point U+0002
...
Byte 0xFF maps to code point U+00FF


So to embed the binary data 0xDEADBEEF in your string, you can just use 
'\xDE\xAD\xBE\xEF' regardless of what character those code points happen 
to be.

If we are manipulating data *as if it were text*, then we ought to treat 
it as text, not add methods to bytes that makes bytes text-like. If we 
are manipulating data *as if it were bytes*, doing byte-manipulation 
operations like bit-masking, then we ought to treat it as numeric bytes, 
not add numeric methods to text. Is that really a controversial opinion?


 Besides being a bit awkward, this also means that any encoded text (even 
 the plain ASCII stuff) is now being transformed three times instead of one:
 
   unicode to bytes
   bytes to unicode using latin1
   unicode to bytes

Where do you get this from? I don't follow your logic. Start with a text 
template:

template = \xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah


data = template % (George, 42, blob.decode('latin-1'))

Only the binary blobs need to be decoded. We don't need to encode the 
template to bytes, and the textual data doesn't get encoded until we're 
ready to send it across the wire or write it to disk. And when we do, 
since all the code points are in the range U+ to U+00FF, encoding it 
to Latin-1 ought to be a fast, efficient operation, possibly even just a 
mem copy.

It's true that the individual binary data fields will been to be decoded 
from bytes, but unless you want Python to guess an encoding (which is 
the old broken Python 2 model), you're going to have to do that 
regardless.


 Even if the cost of moving those bytes around is cheap, it's not free.  
 When you're creating hundreds of PDFs at a time that's going to make a 
 difference.

You've profiled it? Unless you've measured it, it doesn't exist. I'm not 
going to debate performance penalties of code you haven't written yet.



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread R. David Murray
tl;dr: At the end I'm volunteering to look at real code that is having
porting problems.

On Sat, 11 Jan 2014 17:33:17 +0100, M.-A. Lemburg m...@egenix.com wrote:
 asciistr is interesting in that it coerces to bytes instead
 of to Unicode (as is the case in Python 2).
 
 At the moment it doesn't cover the more common case bytes + str,
 just str + bytes, but let's assume it would, then you'd write
 
 ...
 headers += asciistr('Length: %i bytes\n' % 123)
 headers += b'\n\n'
 body = b'...'
 socket.send(headers + body)
 ...
 
 With PEP 460, you could write the above as:
 
 ...
 headers += b'Length: %i bytes\n' % 123
 headers += b'\n\n'
 body = b'...'
 socket.send(headers + body)
 ...
 
 IMO, that's more readable.
 
 Both variants essentially do the same thing: they implicitly
 coerce ASCII text strings to bytes, so conceptually, there's
 little difference.

And if we are explicit:

headers = u'Length: %i bytes\n' % 123
headers += u'\n\n'
body = b'...'
socket.send(headers.encode('ascii') + body)

(I included the 'u' prefix only because we are talking about
shared-codebase python2/python3 code.)

That looks pretty readable to me, and it is explicit about what
parts are text and what parts are binary.

But of course we'd never do exactly that in any but the simplest of
protocols and scripts.

Instead we'd write a library that had one or more object that modeled
our wire/file protocol.  The text parts the API would accept input as
text strings.  The binary parts it would accept input as bytes.  Then,
when reading or writing the data stream, we perform the appropriate
conversions on the appropriate parts.  Our library does a more complex
analog of 'socket.send(headers.encode('ascii') + body)', one that
understands the various parts and glues them together, encoding the
text parts to the appropriate encoding (often-but-not-always ascii)
as it does so.

And yes, I have written code that does this in Python3.

What I haven't done is written that code to run in both Python3 and
Python2.  I *think* the only missing thing I would need to back-port
it is the surrogateescape error handler, but I haven't tried it.  And I
could probably conditionalize the code to use latin1 on python2 instead
and get away with it.

And please note that email is probably the messiest of messy binary
wire protocols.  Not only do you have bytes and text mixed in the same
data stream, with internal markers (in the text parts) that specify
how to interpret the binary, including what encodings each part of that
binary data is in for cases where that matters, you *also* have to deal
with the possibility of there being *invalid* binary data mixed in with
the ostensibly text parts, that you nevertheless are expected to both
preserve and parse around.

When I started adding back binary support to the email package, I was
really annoyed by the lack of certain string features in the bytes
type.  But in the end, it turned out to be really simple to instead
think of the text-with-invalid-bytes parts as *text*-with-invalid-bytes
(surrogateescaped bytes).

Now, if I was designing from the ground up I'd store the stuff that
was really binary as bytes in the model object instead of storing it as
surrogateescaed text, but that problem is a consequence of how we got from
there to here (python2-email to python3-email-that-didn't-handle-8bit-data
to python3-email-that-works) rather than a problem with the python3 core
data model.

So it seems like I'm with Nick and Antoine and company here.  The
byte-interpolation proposed by Antoine seems reasonable, but I don't
see the *need* for the other stuff.  I think that programs will
be cleaner if the text parts of the protocol are handled *as text*.

On the other hand, Ethan's point that bytes *does* have text methods
is true.  However, other than the perfectly-sensible-for-bytes split,
strip, and ends/startswith, I don't think I actually use any of them.


But!  Our goal should be to help people convert to Python3.  So how can
we find out what the specific problems are that real-world programs are
facing, look at the *actual code*, and help that project figure out the
best way to make that code work in both python2 and python3?

That seems like the best way to find out what needs to be added to
python3 or pypi:  help port the actual code of the developers who are
running into problems.

Yes, I'm volunteering to help with this, though of course I can't promise
exactly how much time I'll have available.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Stephen J. Turnbull
M.-A. Lemburg writes:

  I complete agree with Stephen, that bytes are in fact often
  an encoding of text. If that text is ASCII compatible, I don't
  see any reason why we should not continue to expose the C lib
  standard string APIs available for text manipulations on bytes.

We already *have* a type in Python 3.3 that provides text
manipulations on arrays of 8-bit objects: str (per PEP 393).

  BTW: I don't know why so many people keep asking for use cases.
  Isn't it obvious that text data without known (but ASCII compatible)
  encoding or multiple different encodings in a single data chunk
  is part of life ?

Isn't it equally obvious that if you create or read all such ASCII-
compatible chunks as (encoding='ascii', errors='surrogateescape') that
you *don't need* string APIs for bytes?

Why do these text chunks need to be bytes in the first place?
That's why we ask for use cases.  AFAICS, reading and writing ASCII-
compatible text data as 'latin1' is just as fast as bytes I/O.  So
it's not I/O efficiency, and (since in this model we don't do any
en/decoding on bytes/str), it's not redundant en/decoding of bytes to
str and back.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote:

 I think we need to step back a little from the purist view
 of things and give more emphasis on the practicality beats
 purity Zen.
 
 I complete agree with Stephen, that bytes are in fact often
 an encoding of text. If that text is ASCII compatible, I don't
 see any reason why we should not continue to expose the C lib
 standard string APIs available for text manipulations on bytes.

Later in your post, you talk about the masses of broken encodings found 
everywhere (not just in your spam folder). How do the C lib standard 
string APIs help programmers to avoid broken encodings?


 We don't have to be pedantic about the bytes/text separation.
 It doesn't help in real life.

On the contrary, it helps a lot. To the extent that people keep that 
clean bytes/text separation, it helps avoid bugs. It prevents problems 
like this Python 2 nonsense:

s = Straße
assert len(s) == 6  # fails
assert s[5] == 'e'  # fails

Most problematic, printing s may (depending on your terminal settings) 
actually look like Straße.

Not only is having a clean bytes/text separation the pedantic thing to 
do, it's also the right thing to do nearly always (not withstanding a 
few exceptions, allegedly).


 If you give programmers the choice they will - most of the time -
 do the right thing. 

Unicode has been available in Python since version 2.2, more than a 
decade ago. And yet here we are, five point releases later (2.7), and 
the majority of text processing code is still using bytes. I'm not just 
pointing the finger at others. My 2.x only code almost always uses byte 
strings for text processing, and not always because it was old code I 
wrote before I knew better. The coders I work with do the same, only you 
can remove the word almost. The code I see posted on comp.lang.python 
and Reddit and the tutor mailing list invariably uses byte strings. The 
beginners on the tutor list at least have an excuse that they are 
beginners.

A quarter of a century after Unicode was first published, nearly 
28 years since IBM first introduced the concept of code pages 
to PC users, and we still have programmers writing ASCII only 
string-handling code that, if it works with extended character sets, 
only works by accident. The majority of programmer still have *no idea* 
of even the most basic parts of Unicode. They've had the the right tools 
for a decade, and ignored them.

Python 3 forces the issue, and my code is better for it.


 bytes already have most of the 8-bit string methods from Python 2,
 so it doesn't hurt adding some more of the missing features
 from Python 2 on top to make life easier for people dealing
 with multiple/unknown encoding data.

I personally think it was a mistake to keep text operations like upper() 
and lower() on bytes. I think it will compound the mistake to add even 
more text operations.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 05:33:17PM +0100, M.-A. Lemburg wrote:

 FWIW: I quite liked the Python 2 model, but perhaps that's because
 I already knww how Unicode works, so could use it to make my
 life easier ;-)

/incredulous

I would really love to see you justify that claim. How do you use the 
Python 2 string type to make processing Unicode text easier?



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread MRAB

On 2014-01-11 05:36, Steven D'Aprano wrote:
[snip]

Latin-1 has the nice property that every byte decodes into the character
with the same code point, and visa versa. So:

for i in range(256):
 assert bytes([i]).decode('latin-1') == chr(i)
 assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode
text with embedded binary data, rather than binary data with embedded
ASCII text. Then when writing the file to disk, of course you encode it
to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents
with open(outfile.pdf, wb) as f:
 f.write(pdf.encode(latin-1)

or implicitly:

with open(outfile.pdf, w, encoding=latin-1) as f:
 f.write(pdf)


[snip]
The second example won't work because you're forgetting about the
handling of line endings in text mode.

Suppose you have some binary data bytes([10]).

You convert it into a Unicode string using Latin-1, giving '\n'.

You write it out to a file opened in text mode.

On Windows, that string '\n' will be written to the file as b'\r\n'.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 10:36 AM, Steven D'Aprano wrote:

On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:


   unicode to bytes
   bytes to unicode using latin1
   unicode to bytes


Where do you get this from? I don't follow your logic. Start with a text
template:

template = \xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah


data = template % (George, 42, blob.decode('latin-1'))

Only the binary blobs need to be decoded. We don't need to encode the
template to bytes, and the textual data doesn't get encoded until we're
ready to send it across the wire or write it to disk.


And what if your name field has data not representable in latin-1?

-- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
u'\u0441\u0440\u0403'

-- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: 
ordinal not in range(256)

So really your example should be:

data = template % 
(George.encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 
42, blob.decode('latin-1'))

Which is a mess.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Stephen J. Turnbull
MRAB writes:

   with open(outfile.pdf, w, encoding=latin-1) as f:
f.write(pdf)
  
  [snip]
  The second example won't work because you're forgetting about the
  handling of line endings in text mode.

Not so fast!  Forgot, yes (me too!), but not work?  Not quite:

with open(outfile.pdf, w, encoding=latin-1, newline=) as f:
f.write(pdf)

should do the trick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote:

MRAB writes:

with open(outfile.pdf, w, encoding=latin-1) as f:
 f.write(pdf)
   
   [snip]
   The second example won't work because you're forgetting about the
   handling of line endings in text mode.

Not so fast!  Forgot, yes (me too!), but not work?  Not quite:

 with open(outfile.pdf, w, encoding=latin-1, newline=) as f:
 f.write(pdf)

should do the trick.


Well, it's good that there is a work-a-round.  Are we going to have a document listing all the work-a-rounds needed to 
program a bytes-oriented style using unicode?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread R. David Murray
On Sat, 11 Jan 2014 11:54:26 -0800, Ethan Furman et...@stoneleaf.us wrote:
 On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote:
  MRAB writes:
 
  with open(outfile.pdf, w, encoding=latin-1) as f:
   f.write(pdf)
 
 [snip]
 The second example won't work because you're forgetting about the
 handling of line endings in text mode.
 
  Not so fast!  Forgot, yes (me too!), but not work?  Not quite:
 
   with open(outfile.pdf, w, encoding=latin-1, newline=) as f:
   f.write(pdf)
 
  should do the trick.
 
 Well, it's good that there is a work-a-round.  Are we going to have a 
 document listing all the work-a-rounds needed to 
 program a bytes-oriented style using unicode?

That's not a work-around (if you are talking specifically about the
newline=).  That's just the way the python3 IO library works.  If you
want to preserve the newlines in your data, but still have the text-io
machinery count them for deciding when to trigger io/buffering behavior,
you use newline=''.

It's not the most intuitive API, so I won't be surprised if a lot of
people don't know about it or get confused by it when they see it.
I first learned about it in the context of csv files, another one of
those legacy file protocols that are mostly-text-but-not-entirely.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Donald Stufft

On Jan 11, 2014, at 10:34 AM, Nick Coghlan ncogh...@gmail.com wrote:

 Yes, it bloody well does. The number of people who have told me that
 using Python 3 is what allowed them to finally understand how Unicode
 works vastly exceeds the number of wire protocol and file format devs
 that have complained about working with binary formats being
 significantly less tolerant of the it's really like ASCII text
 mindset.

FWIW as one of the people who it took Python3 to finally figure out how to
actually use unicode, it was the absence of encode on bytes and decode on
str that actually did it. Giving bytes a format method would not have affected
that either way I don’t believe.

-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Terry Reedy

On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote:


We already *have* a type in Python 3.3 that provides text
manipulations on arrays of 8-bit objects: str (per PEP 393).

   BTW: I don't know why so many people keep asking for use cases.
   Isn't it obvious that text data without known (but ASCII compatible)
   encoding or multiple different encodings in a single data chunk
   is part of life ?

Isn't it equally obvious that if you create or read all such ASCII-
compatible chunks as (encoding='ascii', errors='surrogateescape') that
you *don't need* string APIs for bytes?

Why do these text chunks need to be bytes in the first place?
That's why we ask for use cases.  AFAICS, reading and writing ASCII-
compatible text data as 'latin1' is just as fast as bytes I/O.  So
it's not I/O efficiency, and (since in this model we don't do any
en/decoding on bytes/str), it's not redundant en/decoding of bytes to
str and back.


The problem with some criticisms of using 'unicode in Python 3' is that 
there really is no such thing. Unicode in 3.0 to 3.2 used the old 
internal model inherited from 2.x. Unicode in 3.3+ uses a different 
internal model that is a game changer with respect to certain issues of 
space and time efficiency (and cross-platform correctness and 
portability). So at least some the valid criticisms based on the old 
model are out of date and no longer valid.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Daniel Holth
On Sat, Jan 11, 2014 at 4:28 PM, Terry Reedy tjre...@udel.edu wrote:
 On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote:

 We already *have* a type in Python 3.3 that provides text
 manipulations on arrays of 8-bit objects: str (per PEP 393).

BTW: I don't know why so many people keep asking for use cases.
Isn't it obvious that text data without known (but ASCII compatible)
encoding or multiple different encodings in a single data chunk
is part of life ?

 Isn't it equally obvious that if you create or read all such ASCII-
 compatible chunks as (encoding='ascii', errors='surrogateescape') that
 you *don't need* string APIs for bytes?

 Why do these text chunks need to be bytes in the first place?
 That's why we ask for use cases.  AFAICS, reading and writing ASCII-
 compatible text data as 'latin1' is just as fast as bytes I/O.  So
 it's not I/O efficiency, and (since in this model we don't do any
 en/decoding on bytes/str), it's not redundant en/decoding of bytes to
 str and back.


 The problem with some criticisms of using 'unicode in Python 3' is that
 there really is no such thing. Unicode in 3.0 to 3.2 used the old internal
 model inherited from 2.x. Unicode in 3.3+ uses a different internal model
 that is a game changer with respect to certain issues of space and time
 efficiency (and cross-platform correctness and portability). So at least
 some the valid criticisms based on the old model are out of date and no
 longer valid.

-1 on adding more surrogateesapes by default. It's a pain to track
down where the encoding errors came from.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 12:45 PM, Donald Stufft wrote:


FWIW as one of the people who it took Python3 to finally figure out how to
actually use unicode, it was the absence of encode on bytes and decode on
str that actually did it. Giving bytes a format method would not have affected
that either way I don’t believe.


My biggest hurdle was realizing that ASCII was an encoding.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Mariano Reingart
On Fri, Jan 10, 2014 at 9:13 PM, Juraj Sukop juraj.su...@gmail.com wrote:




 On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou solip...@pitrou.netwrote:

 Also, when you say you've never encountered UTF-16 text in PDFs, it
  sounds like those people who've never encountered any non-ASCII data in
 their programs.


 Let me clarify: one does not think in writing text in Unicode-terms in
 PDF. Instead, one records the sequence of character codes which
 correspond to glyphs or the glyph IDs directly. That's because one
 Unicode character may have more than one glyph and more characters can be
 shown as one glyph.



AFAIK (and just for the record), there could be both Latin1 text and UTF-16
in a PDF (and other encodings too), depending on the font used:

/Encoding /WinAnsiEncoding (mostly latin1 standard fonts)
/Encoding /Identity-H (generally for unicode UTF-16 True Type embedded
fonts)

For example, in PyFPDF (a PHP library ported to python), the following code
writes out text that could be encoded in two different encodings:

s = sprintf(BT %.2f %.2f Td (%s) Tj ET, x*self.k, (self.h-y)*self.k, txt)

https://code.google.com/p/pyfpdf/source/browse/fpdf/fpdf.py#602

In Python2, txt is just a str, but in Python3 handling everything as latin1
string obviously doesn't work for TTF in this case.

Best regards

Mariano Reingart
http://www.sistemasagiles.com.ar
http://reingart.blogspot.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 07:22:30PM +, MRAB wrote:

 with open(outfile.pdf, w, encoding=latin-1) as f:
  f.write(pdf)
 
 [snip]
 The second example won't work because you're forgetting about the
 handling of line endings in text mode.

So I did! Thank you for the correction.



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Matěj Cepl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2014-01-11, 18:09 GMT, you wrote:
 We are NOT going back to the confusing incoherent mess that 
 is the Python 2 model of bolting Unicode onto the side of 
 POSIX . . .

 We are not asking for that.

Yes, you do. Maybe not you personally, but number of people here 
on this list (for F...k sake, this is for DEVELOPERS of the 
langauge, not some bloody users!) for whom the current 
suggestion is just the way how to avoid Unicode and keep all 
those broken script which barfs at me all the time alive is quit 
non-zero I am afraid.

Best,

Matěj

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFS0ev24J/vJdlkhKwRAoHOAJ9crimnp+TtXCxmZLvTUSFVFSESAwCeNrby
Yjwk6Ydzc/REezfHP046C5Y=
=c2vl
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 04:28:34PM -0500, Terry Reedy wrote:

 The problem with some criticisms of using 'unicode in Python 3' is that 
 there really is no such thing. Unicode in 3.0 to 3.2 used the old 
 internal model inherited from 2.x. Unicode in 3.3+ uses a different 
 internal model that is a game changer with respect to certain issues of 
 space and time efficiency (and cross-platform correctness and 
 portability). So at least some the valid criticisms based on the old 
 model are out of date and no longer valid.

While there are definitely performance savings (particularly of memory) 
regarding the FSR in Python 3.3, for the use-case we're talking about, 
Python 3.1 and 3.2 (and for that matter, 2.2 through 2.7) Unicode 
strings should be perfectly adequate. The textual data being used is 
ASCII, and the binary blobs are encoded to Latin-1, so everything is a 
subset of Unicode, namely U+ to U+00FF. That means there are no 
astral characters, and no behavioural differences between wide and 
narrow builds (apart from memory use).


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:

 AFAIK (and just for the record), there could be both Latin1 text and UTF-16
 in a PDF (and other encodings too), depending on the font used:
[...]
 In Python2, txt is just a str, but in Python3 handling everything as latin1
 string obviously doesn't work for TTF in this case.

Nobody is suggesting that you use Latin-1 for *everything*. We're 
suggesting that you use it for blobs of binary data that represent 
arbitrary bytes. First you have to get your binary data in the first 
place, using whatever technique is necessary. Here's one way to get a 
blob of binary data:


# encode four C shorts into a fixed-width struct
struct.pack(, 23, 42, 17, 99)

Here's another way:

# encode a text string into UTF-16
My name is Steven.encode(utf-16be)

Both examples return a bytes object containing arbitrary bytes. How do 
you combine those arbitrary bytes with a string template while still 
keeping all code-points under U+0100? By decoding to Latin-1.



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Cameron Simpson
On 11Jan2014 13:15, Juraj Sukop juraj.su...@gmail.com wrote:
 On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson c...@zip.com.au wrote:
data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )
 
 Thanks for the suggestion! The problem with bytify is that some items
 might require different formatting than other items. For example, in
 Cross-Reference Table there are three different formats: non-padded
 integer (1), 10- and 15digit integer, (03, 65535).

Well, this is partly my point: you probably want to exert more
control that is reasonable for the PEP to offer, and you're better
off with a helper function of your own. In particular, aside from
passing in a default char=bytes encoding, you can provide your own
format hooks.

In particular, str already provides a completish % suite and you
have no issue with encodings in that phase because it is all Unicode.

So the points where you're treating PDF as text are probably best
tackled as text and then encoded with a helper like bytify when you
have to glom bytes and textish stuff together.

Crude example, hacked up from yours:

  data = b''.join( bytify(
(%d %d obj ... stream % (10, 0)),
binary_image_data,
endstream endobj,
  )))

where bytify swallows your encoding decisions.

Since encoding anything-not-bytes into a bytes sequence inherently
involves an encoding decision, I think I'm +1 on the PEP's aim of
never mixing bytes with non-bytes, keeping all the encoding decisions
in the caller's hands.

I quite understand not wanting to belabour the code with
.encode('ascii') but that should be said somewhere, so best to
do so yourself in as compact and ergonomic fashion as possible.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

Serious error.
All shortcuts have disappeared.
Screen. Mind. Both are blank.
- Haiku Error Messages 
http://www.salonmagazine.com/21st/chal/1998/02/10chal2.html
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 Jan 2014 03:29, Ethan Furman et...@stoneleaf.us wrote:

 On 01/11/2014 12:43 AM, Nick Coghlan wrote:


 In particular, the bytes type is, and always will be, designed for
 pure binary manipulation [...]


 I apologize for being blunt, but this is a lie.

 Lets take a look at the methods defined by bytes:

 dir(b'')

 ['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
'__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__',
'__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__',
'__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center',
'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index',
'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition',
'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip',
'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper', 'zfill']

 Are you really going to insist that expandtabs, isalnum, isalpha,
isdigit, islower, isspace, istitle, isupper, ljust, lower, lstrip, rjust,
splitlines, swapcase, title, upper, and zfill are pure binary manipulation
methods?

Do you think I don't know that? However, those are all *in-place*
modifications. Yes, they assume ASCII compatible formats, but they're a far
cry from encouraging combination of data from potentially different sources.

I'm also on record as considering this a design decision I regret,
precisely because it has resulted in experienced Python 2 developers
failing to understand that the Python 3 text model is *different* and they
may need to  create a new type.


 Let's take a look at the repr of bytes:

 bytes([48, 49, 50, 51])

 b'0123'

 Wow, that sure doesn't look like binary data!

 Py3 did not go from three text models to two, it went to one good one
(unicode strings) and one broken one (bytes).  If the aim was indeed for
pure binary manipulation, we failed.  We left in bunches of methods which
can *only* be interpreted as supporting ASCII manipulation.

No, no, no. We made some concessions in the design of the bytes type to
*ease* development and debugging of ASCII compatible protocols *where we
believed we could do so without compromising the underlying text model
changes.

Many experienced Python 2 developers are now suffering one of the worst
cases of paradigm lock I have ever seen as they keep trying to make the
Python 3 text model the same as the Python 2 one instead of actually
learning how Python 3 works and recognising that they may actually need to
create a new type for their use case and then potentially seek core dev
assistance if that type reveals new interoperability bugs in the core types
(or encounters old ones).


 Due to backwards compatibility we cannot now finish yanking those out, so
either we live with a half-dead class screaming I want be ASCII!  I want
to be ASCII! or add back the missing functionality.

No, we don't - we treat the core bytes type as PEP 460 does, by adding a
*new* feature proposed by a couple people writing native Python 3 libraries
like asyncio that makes binary formats easier to deal with without carrying
forward even *more* broken assumptions from the Python 2 text model.
(Remember, I'm in favour of Antoine's updated PEP, because it's a real spec
for a new feature, rather than yet another proposal to bolt on even more
text specific formatting features from someone that has never bothered to
understand the reasons for the differences between the two versions).

People that want a full hybrid type back can then pursue the custom
extension type approach.

Cheers,
Nick.



 --
 ~Ethan~
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 06:29 PM, Steven D'Aprano wrote:

On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:

On 01/11/2014 10:36 AM, Steven D'Aprano wrote:

On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:


   unicode to bytes
   bytes to unicode using latin1
   unicode to bytes


Where do you get this from? I don't follow your logic. Start with a text
template:

template = \xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah


data = template % (George, 42, blob.decode('latin-1'))


Since the use-cases people have been speaking about include only ASCII
(or at most, Latin-1) text and arbitrary binary bytes, my example is
limited to showing only ASCII text. But it will work with any text data,
so long as you have a well-defined format that lets you tell which parts
are interpreted as text and which parts as binary data.


Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: 
ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text.


And no, your example will not work with any text, it would completely moji-bake 
my dbf files.



Only the binary blobs need to be decoded. We don't need to encode the
template to bytes, and the textual data doesn't get encoded until we're
ready to send it across the wire or write it to disk.


No!  When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till 
the end!




And what if your name field has data not representable in latin-1?

-- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
u'\u0441\u0440\u0403'


Where did you get those bytes from? You got them from somewhere.


For the sake of argument, pretend a user entered them in.


Who knows? Who cares? Once you have bytes, you can treat them as a blob of
arbitrary bytes and write them to the record using the Latin-1 trick.


No, I can't.  See above.


 If
you're reading those bytes from some stream that gives you bytes, you
don't have to care where they came from.


You're kidding, right?  If I don't know where they came from (a graphics field?  a note field?) how am I going to know 
how to treat them?




But what if you don't start with bytes? If you start with a bunch of
floats, you'll probably convert them to bytes using the struct module.


Yup, and I do.


If you start with non-ASCII text, you have to convert them to bytes too.
No difference here.


Really?  You just said above that it will work with any text data -- you 
can't have it both ways.



You ask the user for their name, they answer срЃ which is given to you
as a Unicode string, and you want to include it in your data record. The
specifications of your file format aren't clear, so I'm going to assume
that:

1) ASCII text is allowed as-is (that is, the name George will be
in the final data file as b'George');


User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII.  The user 
text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII.




2) any other non-ASCII text will be encoded as some fixed encoding
which we can choose to suit ourselves;


Well, the user chooses it, we have to abide by their choice.  (It's kept in the 
file metadata.)



3) arbitrary binary data is allowed as-is (i.e. byte N has to end up
being written as byte N, for any value of N between 0 and 255).


In a couple field types, yes.  Usually the binary data is numeric or date related and there is conversion going on 
there, too, to give me the bytes I need.



[snip]


-- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
Traceback (most recent call last):
   File stdin, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode characters in position
0-2: ordinal not in range(256)


That is backwards to what I've shown. Look at my earlier example again:


And you are not paying attention:

'\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
\--/  \-/
 a non-ascii compatible unicode string  to latin1 bytes

(срЃ.encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, 
blob.decode('latin-1'))
  \--/  \--/
   getting the actual bytes I needand back into unicode 
until I write them later

You did say to use a *text* template to manipulate my data, and then write it later, no?  Well, this is what it would 
look like.




Bytes get DECODED to latin-1, not encoded.

Bytes - text is *decoding*
Text - bytes is *encoding*


Pretend for a moment I know that, and look at my examples again.

I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible:  It must be ENcoded using the 
appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I can ENcode the 

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 January 2014 02:33, M.-A. Lemburg m...@egenix.com wrote:
 On 11.01.2014 16:34, Nick Coghlan wrote:
 While that was an *expedient* (and, in fact, necessary) solution at
 the time, the fact it is still thoroughly confusing people 13 years
 later shows it is not a *comprehensible* solution.

 FWIW: I quite liked the Python 2 model, but perhaps that's because
 I already knww how Unicode works, so could use it to make my
 life easier ;-)

Right, I tried to capture that in
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3
by pointing out that there are two *very* different kinds of code to
consider when discussing text modelling.

Application code lives in a nice clean world of structured data, text
data and binary data, with clean conversion functions for switching
between them.

Boundary code, by contrast, has to deal with the messy task of
translating between them all.

The Python 2 text model is a convenient model for boundary code,
because it implicitly allows switch between binary and text
interpretations of a data stream, and that's often useful due to the
way protocols and file formats are designed.

However, that kind of implicit switching is thoroughly inappropriate
for *application* code. So Python 3 switches the core text model to
one where implicitly switching between the binary domain and the text
domain is considered a *bad* thing, and we object strongly to any
proposals which suggest blurry the boundaries again, since that is
going back to a boundary code model rather than an application code
one.

I've been saying for years that we may need a third type, but it has
been nigh on impossible to get boundary code developers to say
anything more useful than I preferred the Python 2 model, that was
more convenient for me. Yes, we know it was (we do maintain both of
them, after all, and did the update for the standard library's own
boundary code), but application developers are vastly more common, so
boundary code developers lost out on that one and we need to come up
with solutions that *respect* the Python 3 text model, rather than
trying to change it back to the Python 2 one.

 Seriously, Unicode has always caused heated discussions and
 I don't expect this to change in the next 5-10 years.

 The point is: there is no 100% perfect solution either way and
 when you acknowledge this, things don't look black and white anymore,
 but instead full of colors :-)

It would be nice if more boundary code developers actually did that
rather than coming out with accusatory hyperbole and pining for the
halcyon days of Python 2 where the text model favoured their use case
over that of normal application developers.

 Python 3 forces people to actually use Unicode; in Python 2 they
 could easily avoid it. It's good to educate people on how it's
 used and the issues you can run into, but let's not forget
 that people are trying to get work done and we all love readable
 code.

 PEP 460 just adds two more methods to the bytes object which come
 in handy when formatting binary data; I don't think it has potential
 to muddy the Python 3 text model, given that the bytes
 object already exposes a dozen of other ASCII text methods :-)

I dropped my objections to PEP 460 once Antoine fixed it to respect
the boundaries between binary and text data. It's now a pure binary
interpolation proposal, and one I think is a fine idea - there's no
implicit encoding or decoding involved, it's just a tool for
manipulating binary data.

That leaves the implicit encoding and decoding to the third party
asciistr type, as it should be.

 asciistr is interesting in that it coerces to bytes instead
 of to Unicode (as is the case in Python 2).

Not quite - the idea of asciistr is that it is designed to be a
*hybrid* type, like str was in Python 2. If it interacts with binary
objects, it will give a binary result, if it interacts with text
objects, it will give a text result. This makes it potentially
suitable for use for constants in hybrid binary/text APIs like
urllib.parse, allowing them to be implemented using a shared code path
once again.

The initial experimental implementation only works with 7 bit ASCII,
but the UTF-8 caching in the PEP 393 implementation opens up the
possibility of offering a non-strict mode in the future, as does the
option of allowing arbitrary 8-bit data and disallowing interoperation
with text strings in that case.

 At the moment it doesn't cover the more common case bytes + str,
 just str + bytes, but let's assume it would,

Right, I suspect we have some overbroad PyUnicode_Check() calls in
CPython that will need to be addressed before this substitution works
seamlessly - that's one of the reasons I've been asking people to
experiment with the idea since at least 2010 and let us know what
doesn't work (nobody did though, until Benno agreed to try it out
because it sounded like an interesting puzzle 

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 January 2014 04:38, R. David Murray rdmur...@bitdance.com wrote:
 But!  Our goal should be to help people convert to Python3.  So how can
 we find out what the specific problems are that real-world programs are
 facing, look at the *actual code*, and help that project figure out the
 best way to make that code work in both python2 and python3?

 That seems like the best way to find out what needs to be added to
 python3 or pypi:  help port the actual code of the developers who are
 running into problems.

 Yes, I'm volunteering to help with this, though of course I can't promise
 exactly how much time I'll have available.

And, as has been the case for a long time, the PSF stands ready to
help with funding credible grant proposals for Python 3 porting
efforts. I believe some of the core devs (including David?) do
freelance and contract work, so that's an option definitely worth
considered if a project would like to support Python 3, but are having
difficulty getting their with purely volunteer effort.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 11:32:05 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
  It's consistent with bytearray.join's behaviour:
 
   x = bytearray()
   x.join([babc])
  bytearray(b'abc')
   x
  bytearray(b'')
 
 Yeah, I guess I'm OK with us being consistent on that one. It's still
 weird, but also clearly useful :)
 
 Will the new binary format ever call __format__? I assume not, but it's
 probably best to make that absolutely explicit in the PEP.

Not indeed. I'll add that to the PEP, thanks.

cheers

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
(Sorry if this messes-up the thread order, it is meant as a reply to the
original RFC.)

Dear list,

newbie here. After much hesitation I decided to put forward a use case
which bothers me about the current proposal. Disclaimer: I happen to write
a library which is directly influenced by this.

As you may know, PDF operates over bytes and an integer or floating-point
number is written down as-is, for example 100 or 1.23.

However, the proposal drops %d, %f and %x formats and the suggested
workaround for writing down a number is to use .encode('ascii'), which I
think has two problems:

One is that it needs to construct one additional object per formatting as
opposed to Python 2; it is not uncommon for a PDF file to contain millions
of numbers.

The second problem is that, in my eyes, it is very counter-intuitive to
require the use of str only to get formatting on bytes. Consider the case
where a large bytes object is created out of many smaller bytes objects. If
I wanted to format a part I had to use str instead. For example:

content = b''.join([
b'header',
b'some dictionary structure',
b'part 1 abc',
('part 2 %.3f' % number).encode('ascii'),
b'trailer'])

In the case of PDF, the embedding of an image into PDF looks like:

10 0 obj
   /Type /XObject
 /Width 100
 /Height 100
 /Alternates 15 0 R
 /Length 2167
  
stream
...binary image data...
endstream
endobj

Because of the image it makes sense to store such structure inside bytes.
On the other hand, there may well be another obj which contains the
coordinates of Bezier paths:

11 0 obj
...
stream
0.5 0.1 0.2 RG
300 300 m
300 400 400 400 400 300 c
b
endstream
endobj

To summarize, there are cases which mix binary and text and, in my
opinion, dropping the bytes-formatting of numbers makes it more complicated
than it was. I would appreciate any explanation on how:

b'%.1f %.1f %.1f RG' % (r, g, b)

is more confusing than:

b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r,
g, b)))

Similar situation exists for HTTP (Content-Length: 123) and ASCII STL
(vertex 1.0 0.0 0.0).

Thanks and have a nice day,

Juraj Sukop

PS: In the case the proposal will not include the number formatting, it
would be nice to list there a set of guidelines or examples on how to
proceed with porting Python 2 formats to Python 3.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 12:17 PM, Juraj Sukop wrote:
 (Sorry if this messes-up the thread order, it is meant as a reply to the
 original RFC.)
 
 Dear list,
 
 newbie here. After much hesitation I decided to put forward a use case
 which bothers me about the current proposal. Disclaimer: I happen to
 write a library which is directly influenced by this.
 
 As you may know, PDF operates over bytes and an integer or
 floating-point number is written down as-is, for example 100 or 1.23.
 
 However, the proposal drops %d, %f and %x formats and the
 suggested workaround for writing down a number is to use
 .encode('ascii'), which I think has two problems:
 
 One is that it needs to construct one additional object per formatting
 as opposed to Python 2; it is not uncommon for a PDF file to contain
 millions of numbers.
 
 The second problem is that, in my eyes, it is very counter-intuitive to
 require the use of str only to get formatting on bytes. Consider the
 case where a large bytes object is created out of many smaller bytes
 objects. If I wanted to format a part I had to use str instead. For example:
 
 content = b''.join([
 b'header',
 b'some dictionary structure',
 b'part 1 abc',
 ('part 2 %.3f' % number).encode('ascii'),
 b'trailer'])

I agree. I don't see any reason to exclude int and float. See Guido's
messages http://bugs.python.org/issue3982#msg180423 and
http://bugs.python.org/issue3982#msg180430 for some justification and
discussion. Since converting int and float to strings generates a very
small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase
versions), what problem is introduced by allowing int and float? The
original str.format() work relied on this fact in its stringlib
implementation.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Mark Lawrence

On 06/01/2014 13:24, Victor Stinner wrote:

Hi,

bytes % args and bytes.format(args) are requested by Mercurial and
Twisted projects. The issue #3982 was stuck because nobody proposed a
complete definition of the new features. Here is a try as a PEP.



Apologies if this has already been said, but Terry Reedy attached a 
proof of concept to issue 3982 which might be worth taking a look at if 
you haven't yet done so.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Georg Brandl
Am 10.01.2014 18:56, schrieb Eric V. Smith:
 On 1/10/2014 12:17 PM, Juraj Sukop wrote:
 (Sorry if this messes-up the thread order, it is meant as a reply to the
 original RFC.)
 
 Dear list,
 
 newbie here. After much hesitation I decided to put forward a use case
 which bothers me about the current proposal. Disclaimer: I happen to
 write a library which is directly influenced by this.
 
 As you may know, PDF operates over bytes and an integer or
 floating-point number is written down as-is, for example 100 or 1.23.
 
 However, the proposal drops %d, %f and %x formats and the
 suggested workaround for writing down a number is to use
 .encode('ascii'), which I think has two problems:
 
 One is that it needs to construct one additional object per formatting
 as opposed to Python 2; it is not uncommon for a PDF file to contain
 millions of numbers.
 
 The second problem is that, in my eyes, it is very counter-intuitive to
 require the use of str only to get formatting on bytes. Consider the
 case where a large bytes object is created out of many smaller bytes
 objects. If I wanted to format a part I had to use str instead. For example:
 
 content = b''.join([
 b'header',
 b'some dictionary structure',
 b'part 1 abc',
 ('part 2 %.3f' % number).encode('ascii'),
 b'trailer'])
 
 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion. Since converting int and float to strings generates a very
 small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase
 versions), what problem is introduced by allowing int and float? The
 original str.format() work relied on this fact in its stringlib
 implementation.

I agree.

I would have needed bytes-formatting (with numbers) recently writing .rtf files.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Chris Barker
On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop juraj.su...@gmail.com wrote:

 As you may know, PDF operates over bytes and an integer or floating-point
 number is written down as-is, for example 100 or 1.23.


Just to be clear here -- is PDF specifically bytes+ascii?

Or could there be some-other-encoding unicode in there?

If so, then you really have a mess!

if it is bytes+ascii, then it seems you could use a unicode object and
encode/decode to latin-1

Perhaps still a bit klunkier than formatting directly into a bytes object,
but workable.

b'%.1f %.1f %.1f RG' % (r, g, b)

 is more confusing than:

 b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'),
 (r, g, b)))


Let's see, I think that would be:

u'%.1f %.1f %.1f RG' % (r, g, b)

then when you want to write it out:

.encode('latin-1')

dumping the binary data in would be a bit uglier, for teh image example:

stream
...binary image data...
endstream
endobj

ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1')

I think.

not too bad, though if nothing else an alias for latin-1 that made it clear
it worked for this would be nice.

maybe ascii_plus_binary or something?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Victor Stinner
2014/1/10 Juraj Sukop juraj.su...@gmail.com:
 In the case of PDF, the embedding of an image into PDF looks like:

 10 0 obj
/Type /XObject
  /Width 100
  /Height 100
  /Alternates 15 0 R
  /Length 2167
   
 stream
 ...binary image data...
 endstream
 endobj

What not building 10 0 obj ... stream and endstream endobj in
Unicode and then encode to ASCII? Example:

data = b''.join((
  (%d %d obj ... stream % (10, 0)).encode('ascii'),
  binary_image_data,
  (endstream endobj).encode('ascii'),
))

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 5:12 PM, Victor Stinner wrote:
 2014/1/10 Juraj Sukop juraj.su...@gmail.com:
 In the case of PDF, the embedding of an image into PDF looks like:

 10 0 obj
/Type /XObject
  /Width 100
  /Height 100
  /Alternates 15 0 R
  /Length 2167
   
 stream
 ...binary image data...
 endstream
 endobj
 
 What not building 10 0 obj ... stream and endstream endobj in
 Unicode and then encode to ASCII? Example:
 
 data = b''.join((
   (%d %d obj ... stream % (10, 0)).encode('ascii'),
   binary_image_data,
   (endstream endobj).encode('ascii'),
 ))

Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is
there really existing code like this in 2.x?

I think what we're trying to do is to make code that looks like:
   b'%d %d obj ... stream' % (10, 0)
work in both 2.x and 3.5.

But correct me if I'm wrong. I'll admit to not following 100% of these
emails.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 12:56:19 -0500
Eric V. Smith e...@trueblade.com wrote:
 
 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion.

If you are representing int and float, you're really formatting a text
message, not bytes. Basically if you allow the formatting of int and
float instances, there's no reason not to allow the formatting of
arbitrary objects through __str__. It doesn't make sense to
special-case those two types and nothing else.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
 On Fri, 10 Jan 2014 12:56:19 -0500
 Eric V. Smith e...@trueblade.com wrote:

 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion.
 
 If you are representing int and float, you're really formatting a text
 message, not bytes. Basically if you allow the formatting of int and
 float instances, there's no reason not to allow the formatting of
 arbitrary objects through __str__. It doesn't make sense to
 special-case those two types and nothing else.

It might not for .format(), but I'm not convinced. But for %-formatting,
str is already special-cased for these types.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 17:20:32 -0500
Eric V. Smith e...@trueblade.com wrote:
 
 Isn't the point of the PEP to make it easier to port 2.x code to 3.5?
 Is
 there really existing code like this in 2.x?

No, but so what? The point of the PEP is not to allow arbitrary
Python 2 code to run without modification under Python 3. There's a
reason we broke compatibility, and there's no way we're gonna undo that.

 I think what we're trying to do is to make code that looks like:
b'%d %d obj ... stream' % (10, 0)
 work in both 2.x and 3.5.

That's not what *I* am trying to do. As far as I'm concerned the aim of
the PEP is to ease bytes interpolation, not to provide some kind of
magical construct that will solve everyone's porting problems.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 17:33:57 -0500
Eric V. Smith e...@trueblade.com wrote:
 On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
  On Fri, 10 Jan 2014 12:56:19 -0500
  Eric V. Smith e...@trueblade.com wrote:
 
  I agree. I don't see any reason to exclude int and float. See Guido's
  messages http://bugs.python.org/issue3982#msg180423 and
  http://bugs.python.org/issue3982#msg180430 for some justification and
  discussion.
  
  If you are representing int and float, you're really formatting a text
  message, not bytes. Basically if you allow the formatting of int and
  float instances, there's no reason not to allow the formatting of
  arbitrary objects through __str__. It doesn't make sense to
  special-case those two types and nothing else.
 
 It might not for .format(), but I'm not convinced. But for %-formatting,
 str is already special-cased for these types.

That's not what I'm saying. str.__mod__ is able to represent all kinds
of types through %s and calling __str__. It doesn't make sense for
bytes.__mod__ to only support int and float. Why only them?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 02:42 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 17:33:57 -0500
Eric V. Smith e...@trueblade.com wrote:

On 1/10/2014 5:29 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 12:56:19 -0500
Eric V. Smith e...@trueblade.com wrote:


I agree. I don't see any reason to exclude int and float. See Guido's
messages http://bugs.python.org/issue3982#msg180423 and
http://bugs.python.org/issue3982#msg180430 for some justification and
discussion.


If you are representing int and float, you're really formatting a text
message, not bytes. Basically if you allow the formatting of int and
float instances, there's no reason not to allow the formatting of
arbitrary objects through __str__. It doesn't make sense to
special-case those two types and nothing else.


It might not for .format(), but I'm not convinced. But for %-formatting,
str is already special-cased for these types.


That's not what I'm saying. str.__mod__ is able to represent all kinds
of types through %s and calling __str__. It doesn't make sense for
bytes.__mod__ to only support int and float. Why only them?


Because embedding the ASCII equivalent of ints and floats in byte streams is a 
common operation?

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 14:58:15 -0800
Ethan Furman et...@stoneleaf.us wrote:
 On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
  On Fri, 10 Jan 2014 17:33:57 -0500
  Eric V. Smith e...@trueblade.com wrote:
  On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
  On Fri, 10 Jan 2014 12:56:19 -0500
  Eric V. Smith e...@trueblade.com wrote:
 
  I agree. I don't see any reason to exclude int and float. See Guido's
  messages http://bugs.python.org/issue3982#msg180423 and
  http://bugs.python.org/issue3982#msg180430 for some justification and
  discussion.
 
  If you are representing int and float, you're really formatting a text
  message, not bytes. Basically if you allow the formatting of int and
  float instances, there's no reason not to allow the formatting of
  arbitrary objects through __str__. It doesn't make sense to
  special-case those two types and nothing else.
 
  It might not for .format(), but I'm not convinced. But for %-formatting,
  str is already special-cased for these types.
 
  That's not what I'm saying. str.__mod__ is able to represent all kinds
  of types through %s and calling __str__. It doesn't make sense for
  bytes.__mod__ to only support int and float. Why only them?
 
 Because embedding the ASCII equivalent of ints and floats in byte streams
 is a common operation?

Again, if you're representing ASCII, you're representing text and
should use a str object.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 18:14:45 -0500
Eric V. Smith e...@trueblade.com wrote:
 
  Because embedding the ASCII equivalent of ints and floats in byte streams
  is a common operation?
  
  Again, if you're representing ASCII, you're representing text and
  should use a str object.
 
 Yes, but is there existing 2.x code that uses %s for int and float
 (perhaps unwittingly), and do we want to help that code out?
 Or do we
 want to make porters first change to using %d or %f instead of %s?

I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
%f on bytes objects.

 I think what you're getting at is that in addition to not calling
 __format__, we don't want to call __str__, either, for the same reason.

Not only. We don't want to do anything that actually asks for a
*textual* representation of something. %d and %f ask for a textual
representation of a number, so they're right out.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker chris.bar...@noaa.govwrote:

 On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop juraj.su...@gmail.comwrote:

 As you may know, PDF operates over bytes and an integer or floating-point
 number is written down as-is, for example 100 or 1.23.


 Just to be clear here -- is PDF specifically bytes+ascii?

 Or could there be some-other-encoding unicode in there?


From the specs: At the most fundamental level, a PDF file is a sequence of
8-bit bytes. But it is also possible to represent a PDF using printable
ASCII + whitespace by using escapes and filters. Then, there are also
text strings which might be encoded in UTF+16.

What this all means is that the PDF objects are expressed in ASCII,
stream objects like images and fonts may have a binary part and I never
saw those UTF+16 strings.


ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1')


The argument for dropping %f et al. has been that if something is a text,
then it should be Unicode. Conversely, if it is not text, then it should
not be Unicode.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner
victor.stin...@gmail.comwrote:


 What not building 10 0 obj ... stream and endstream endobj in
 Unicode and then encode to ASCII? Example:

 data = b''.join((
   (%d %d obj ... stream % (10, 0)).encode('ascii'),
   binary_image_data,
   (endstream endobj).encode('ascii'),
 ))


The key is encode to ASCII which means that the result is bytes. Then,
there is this 11 0 obj which should also be bytes. But it has no
binary_image_data - only lots of numbers waiting to be somehow converted
to bytes. I already mentioned the problems with .encode('ascii') but it
does not stop here. Numbers may appear not only inside streams but almost
anywhere: in the header there is PDF version, an image has to have width
and height, at the end of PDF there is a structure containing offsets to
all of the objects in file. Basically, to .encode('ascii') every possible
number is not exactly simple or pretty.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Sat, 11 Jan 2014 00:43:39 +0100
Juraj Sukop juraj.su...@gmail.com wrote:
 Basically, to .encode('ascii') every possible
 number is not exactly simple or pretty.

Well it strikes me that the PDF format itself is not exactly simple or
pretty. It might be convenient that Python 2 allows you, in certain
cases, to ignore encoding issues because the main text type is
actually a bytestring, but under the Python 3 model there's no reason
to allow the same shortcuts.

Also, when you say you've never encountered UTF-16 text in PDFs, it
sounds like those people who've never encountered any non-ASCII data in
their programs.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Chris Barker
On Fri, Jan 10, 2014 at 3:40 PM, Juraj Sukop juraj.su...@gmail.com wrote:

 What this all means is that the PDF objects are expressed in ASCII,
 stream objects like images and fonts may have a binary part and I never
 saw those UTF+16 strings.


hmm -- I wonder if they are out there in the wild, though


  ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1')


 The argument for dropping %f et al. has been that if something is a
 text, then it should be Unicode. Conversely, if it is not text, then it
 should not be Unicode.





What I'm trying to demostrate / test is that you can use unicode objects
for mixed binary + ascii, if you make sure to encode/decode using latin-1 (
any others?). The idea is that ascii can be seen/used as text, and other
bytes are preserved, and you can ignore whatever meaning latin-1 gives them.

using unicode objects means that you can use the existing string formatting
(%s), and if you want to pass in binary blobs, you need to decode them as
latin-1, creating a unicode object, which will get interpolated into your
unicode object, but then that unicode gets encoded back to latin-1, the
original bytes are preserved.

I think this it confusing, as we are calling it latin-1, but not really
using it that way, but it seems it should work.

-Chris





-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou solip...@pitrou.netwrote:

 Also, when you say you've never encountered UTF-16 text in PDFs, it
 sounds like those people who've never encountered any non-ASCII data in
 their programs.


Let me clarify: one does not think in writing text in Unicode-terms in
PDF. Instead, one records the sequence of character codes which
correspond to glyphs or the glyph IDs directly. That's because one
Unicode character may have more than one glyph and more characters can be
shown as one glyph.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/08/2014 02:42 PM, Antoine Pitrou wrote:


With Victor's consent, I overhauled PEP 460 and made the feature set
more restricted and consistent with the bytes/str separation.


From the PEP:
=

Python 3 generally mandates that text be stored and manipulated as
 unicode (i.e. str objects, not bytes). In some cases, though, it
 makes sense to manipulate bytes objects directly. Typical usage is
 binary network protocols, where you can want to interpolate and
 assemble several bytes object (some of them literals, some of them
 compute) to produce complete protocol messages. For example,
 protocols such as HTTP or SIP have headers with ASCII names and
 opaque textual values using a varying and/or sometimes ill-defined
 encoding. Moreover, those headers can be followed by a binary
 body... which can be chunked and decorated with ASCII headers and
 trailers!


As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and 
then refuses to allow % to embed ASCII text in the byte stream.



All other features present in formatting of str objects (either
 through the percent operator or the str.format() method) are
 unsupported. Those features imply treating the recipient of the
 operator or method as text, which goes counter to the text / bytes
 separation (for example, accepting %d as a format code would imply
 that the bytes object really is a ASCII-compatible text string).


No, it implies that portion of the byte stream is ASCII compatible.  And we have several examples: PDF, HTML, DBF, just 
about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of.



-1 on the PEP as it stands now.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 16:23:53 -0800
Ethan Furman et...@stoneleaf.us wrote:
 On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
 
  With Victor's consent, I overhauled PEP 460 and made the feature set
  more restricted and consistent with the bytes/str separation.
 
  From the PEP:
 =
  Python 3 generally mandates that text be stored and manipulated as
   unicode (i.e. str objects, not bytes). In some cases, though, it
   makes sense to manipulate bytes objects directly. Typical usage is
   binary network protocols, where you can want to interpolate and
   assemble several bytes object (some of them literals, some of them
   compute) to produce complete protocol messages. For example,
   protocols such as HTTP or SIP have headers with ASCII names and
   opaque textual values using a varying and/or sometimes ill-defined
   encoding. Moreover, those headers can be followed by a binary
   body... which can be chunked and decorated with ASCII headers and
   trailers!
 
 As it stands now, the PEP talks about ASCII, about how it makes sense
 sometimes to work directly with bytes objects, and 
 then refuses to allow % to embed ASCII text in the byte stream.

Indeed I refuse for %-formatting to allow the mixing of bytes and str
objects, in the same way that it is forbidden to concatenate a and
bb together, or to write b.join([abc]).

Python 3 was made *precisely* because the implicit conversion between
ASCII unicode and bytes is deemed harmful. It's completely
counter-productive and misleading for our users to start mudding the
message by introducing exceptions to that rule.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 8:12 PM, Antoine Pitrou wrote:
 On Fri, 10 Jan 2014 16:23:53 -0800
 Ethan Furman et...@stoneleaf.us wrote:
 On 01/08/2014 02:42 PM, Antoine Pitrou wrote:

 With Victor's consent, I overhauled PEP 460 and made the feature set
 more restricted and consistent with the bytes/str separation.

  From the PEP:
 =
 Python 3 generally mandates that text be stored and manipulated as
  unicode (i.e. str objects, not bytes). In some cases, though, it
  makes sense to manipulate bytes objects directly. Typical usage is
  binary network protocols, where you can want to interpolate and
  assemble several bytes object (some of them literals, some of them
  compute) to produce complete protocol messages. For example,
  protocols such as HTTP or SIP have headers with ASCII names and
  opaque textual values using a varying and/or sometimes ill-defined
  encoding. Moreover, those headers can be followed by a binary
  body... which can be chunked and decorated with ASCII headers and
  trailers!

 As it stands now, the PEP talks about ASCII, about how it makes sense
 sometimes to work directly with bytes objects, and 
 then refuses to allow % to embed ASCII text in the byte stream.
 
 Indeed I refuse for %-formatting to allow the mixing of bytes and str
 objects, in the same way that it is forbidden to concatenate a and
 bb together, or to write b.join([abc]).

I think:
'a' + b'b'
is different from:
b'Content-Length: %d' % 42

The former we want to prevent, but I see nothing wrong with the latter.

So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
3892. See for example http://bugs.python.org/issue3982#msg180432 .

Eric.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 20:53:09 -0500
Eric V. Smith e...@trueblade.com wrote:
 
 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .

Then we might as well not do anything, since any attempt to advance
things is met by stubborn opposition in the name of not far enough.

(I don't care much personally, I think the issue is quite overblown
anyway)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 06:04 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 20:53:09 -0500
Eric V. Smith e...@trueblade.com wrote:


So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
3892. See for example http://bugs.python.org/issue3982#msg180432 .


Then we might as well not do anything, since any attempt to advance
things is met by stubborn opposition in the name of not far enough.


Heh, and here I thought it was stubborn opposition in the name of purity.  ;)



(I don't care much personally, I think the issue is quite overblown
anyway)


Is it safe to assume you don't use Python for the use-cases under discussion?  Specifically, mixed ASCII, binary, and 
encoded-text byte streams?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 18:28:41 -0800
Ethan Furman et...@stoneleaf.us wrote:
 
 Is it safe to assume you don't use Python for the use-cases under discussion?

You know, I've done quite a bit of network programming. I've also done
an experimental port of Twisted to Python 3. I know what a network
protocol with ill-defined encodings looks like.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread INADA Naoki
To avoid implicit conversion between str and bytes, I propose adding only
limited %-format,
not .format() or .format_map().

limited %-format means:

%c accepts integer or bytes having one length.
%r is not supported
%s accepts only bytes.
%a is only format accepts arbitrary object.

And other formats is same to str.



On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou solip...@pitrou.net wrote:

 On Fri, 10 Jan 2014 18:14:45 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
   Because embedding the ASCII equivalent of ints and floats in byte
 streams
   is a common operation?
  
   Again, if you're representing ASCII, you're representing text and
   should use a str object.
 
  Yes, but is there existing 2.x code that uses %s for int and float
  (perhaps unwittingly), and do we want to help that code out?
  Or do we
  want to make porters first change to using %d or %f instead of %s?

 I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
 %f on bytes objects.

  I think what you're getting at is that in addition to not calling
  __format__, we don't want to call __str__, either, for the same reason.

 Not only. We don't want to do anything that actually asks for a
 *textual* representation of something. %d and %f ask for a textual
 representation of a number, so they're right out.

 Regards

 Antoine.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com




-- 
INADA Naoki  songofaca...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 06:39 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 18:28:41 -0800
Ethan Furman wrote:


Is it safe to assume you don't use Python for the use-cases under discussion?


You know, I've done quite a bit of network programming.


No, I didn't, that's why I asked.


I've also done an experimental port of Twisted to Python 3.
I know what a network protocol with ill-defined encodings
 looks like.


Can you give a code sample of what you think, for example, the PDF generation code should look like?  (If you already 
have, I apologize -- I missed it in all the ruckus.)


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 06:39 PM, Antoine Pitrou wrote:


I know what a network protocol with ill-defined encodings
 looks like.


For the record, I've been (and I suspect Eric and some others have also been) talking about well-defined encodings.  For 
the DBF files that I work with, there is binary, ASCII, and third that is specified in the file header.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread INADA Naoki
To avoid implicit conversion between str and bytes, I propose adding only
limited %-format,
not .format() or .format_map().

limited %-format means:

%c accepts integer or bytes having one length.
%r is not supported
%s accepts only bytes.
%a is only format accepts arbitrary object.

And other formats is same to str.



On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou solip...@pitrou.net wrote:

 On Fri, 10 Jan 2014 18:14:45 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
   Because embedding the ASCII equivalent of ints and floats in byte
 streams
   is a common operation?
  
   Again, if you're representing ASCII, you're representing text and
   should use a str object.
 
  Yes, but is there existing 2.x code that uses %s for int and float
  (perhaps unwittingly), and do we want to help that code out?
  Or do we
  want to make porters first change to using %d or %f instead of %s?

 I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
 %f on bytes objects.

  I think what you're getting at is that in addition to not calling
  __format__, we don't want to call __str__, either, for the same reason.

 Not only. We don't want to do anything that actually asks for a
 *textual* representation of something. %d and %f ask for a textual
 representation of a number, so they're right out.

 Regards

 Antoine.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com




-- 
INADA Naoki  songofaca...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Cameron Simpson
On 11Jan2014 00:43, Juraj Sukop juraj.su...@gmail.com wrote:
 On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner
 victor.stin...@gmail.comwrote:
  What not building 10 0 obj ... stream and endstream endobj in
  Unicode and then encode to ASCII? Example:
 
  data = b''.join((
(%d %d obj ... stream % (10, 0)).encode('ascii'),
binary_image_data,
(endstream endobj).encode('ascii'),
  ))
 
 The key is encode to ASCII which means that the result is bytes. Then,
 there is this 11 0 obj which should also be bytes. But it has no
 binary_image_data - only lots of numbers waiting to be somehow converted
 to bytes. I already mentioned the problems with .encode('ascii') but it
 does not stop here. Numbers may appear not only inside streams but almost
 anywhere: in the header there is PDF version, an image has to have width
 and height, at the end of PDF there is a structure containing offsets to
 all of the objects in file. Basically, to .encode('ascii') every possible
 number is not exactly simple or pretty.

Hi Juraj,

Might I suggest a helper function (outside the PEP scope) instead
of arguing for support for %f et al?

Thus:

  def bytify(things, encoding='ascii'):
for thing:
  if isinstance(thing, bytes):
yield thing
  else:
yield str(thing).encode('ascii')

Then one's embedding in PDF might become, more readably:

  data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )

Of course, bytify might be augmented with whatever encoding facilities
might suit your needs.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

We tend to overestimate the short-term impact of technological change and
underestimate its long-term impact. - Amara's Law
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Steven D'Aprano
On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:

 As you may know, PDF operates over bytes and an integer or floating-point
 number is written down as-is, for example 100 or 1.23.

I'm sorry, I don't understand what you mean here. I'm honestly not 
trying to be difficult, but you sound confident that you understand what 
you are doing, but your description doesn't make sense to me. To me, it 
looks like you are conflating bytes and ASCII characters, that is, 
assuming that characters are in some sense identical to their ASCII 
representation. Let me explain:

The integer that in English is written as 100 is represented in memory 
as bytes 0x0064 (assuming a big-endian C short), so when you say an 
integer is written down AS-IS (emphasis added), to me that says that 
the PDF file includes the bytes 0x0064. But then you go on to write the 
three character string 100, which (assuming ASCII) is the bytes 
0x313030. Going from the C short to the ASCII representation 0x313030 is 
nothing like inserting the int as-is. To put it another way, the 
Python 2 '%d' format code does not just copy bytes.

I think that what you are trying to say is that a PDF file is a binary 
file which includes some ASCII-formatted text fields. So when writing an 
integer 100, rather than writing it as is which would be byte 0x64 
(with however many leading null bytes needed for padding), it is 
converted to ASCII representation 0x313030 first, and that's what needs 
to be inserted.

If you consider PDF as binary with occasional pieces of ASCII text, then 
working with bytes makes sense. But I wonder whether it might be better 
to consider PDF as mostly text with some binary bytes. Even though the 
bulk of the PDF will be binary, the interesting bits are text. E.g. your 
example:

 In the case of PDF, the embedding of an image into PDF looks like:
 
 10 0 obj
/Type /XObject
  /Width 100
  /Height 100
  /Alternates 15 0 R
  /Length 2167
   
 stream
 ...binary image data...
 endstream
 endobj


Even though the binary image data is probably much, much larger in 
length than the text shown above, it's (probably) trivial to deal with: 
convert your image data into bytes, decode those bytes into Latin-1, 
then concatenate the Latin-1 string into the text above.

Latin-1 has the nice property that every byte decodes into the character 
with the same code point, and visa versa. So:

for i in range(256):
assert bytes([i]).decode('latin-1') == chr(i)
assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode 
text with embedded binary data, rather than binary data with embedded 
ASCII text. Then when writing the file to disk, of course you encode it 
to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents
with open(outfile.pdf, wb) as f:
f.write(pdf.encode(latin-1)

or implicitly:

with open(outfile.pdf, w, encoding=latin-1) as f:
f.write(pdf)


There may be a few wrinkles I haven't thought of, I don't claim to be an 
expert on PDF. But I see no reason why PDF files ought to be an 
exception to the rule:

* work internally with Unicode text;

* convert to and from bytes only on input and output.

Please also take note that in Python 3.3 and better, the internal 
representation of Unicode strings containing only code points up to 255 
(i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte 
per character.

Another advantage is that using text rather than bytes means that your 
example:

[...]
 dropping the bytes-formatting of numbers makes it more complicated
 than it was. I would appreciate any explanation on how:
 
 b'%.1f %.1f %.1f RG' % (r, g, b)

becomes simply

'%.1f %.1f %.1f RG' % (r, g, b)

in Python 3. In Python 3.3 and above, it can be written as:

u'%.1f %.1f %.1f RG' % (r, g, b)

which conveniently is exactly the same syntax you would use in Python 2. 
That's *much* nicer than your suggestion:


 is more confusing than:
 
 b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), 
  (r, g, b)))




-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Georg Brandl
Am 11.01.2014 03:04, schrieb Antoine Pitrou:
 On Fri, 10 Jan 2014 20:53:09 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .

I agree.

 Then we might as well not do anything, since any attempt to advance
 things is met by stubborn opposition in the name of not far enough.
 
 (I don't care much personally, I think the issue is quite overblown
 anyway)

So you wouldn't mind another overhaul of the PEP including a bit more
functionality again? :)  I really think that practicality beats purity
here.  (I'm not advocating free mixing bytes and str, mind you!)

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Nick Coghlan
On 9 Jan 2014 11:29, INADA Naoki songofaca...@gmail.com wrote:


 And I think everyone was well intentioned - and python3 covers most of
the
 bases, but working with binary data is not only a wire-protocol
programmer's
 problem.

If you're working with binary data, use the binary API offered by bytes,
bytearray and memoryview.

 Needing a library to wrap bytesthing.format('ascii', 'surrogateescape')
 or some such thing makes python3 less approachable for those who haven't
 learned that yet - which was almost all of us at some point when we
started
 programming.

 Totally agree with you.

If you're on a relatively modern OS, everything should be UTF-8 and you
should be fine as a beginner.

When you start encountered malformed data, Python 3 should throw an error,
and provide an opportunity to learn more (by looking up the error message),
where Python 2 would silently corrupt the data stream.

Python 2 enshrined a data model eminently suitable for boundary code that
dealt with ASCII compatible binary protocols (like web frameworks) as the
default text model. Application code then needed to take special steps to
get correct behaviour for the full Unicode range. In essence, the Python 2
text model is the POSIX text model with Unicode support bolted on to the
side to make it at least *possible* to write correct application code.

This is completely backwards. Web applications vastly outnumber web
frameworks, and the same goes for every other domain: applications are
vastly more common than the libraries and frameworks that handle data
transformations at system boundaries on their behalf, so making the latter
easier to write at the expense of the former is a deeply flawed design
choice.

So Python 3 reverses the situation: the core text model is now more
appropriate for the central application code, *after* the boundary code has
cleaned up the murky details of wire protocols and file formats.

This is pretty easy to deal with for *new* Python 3 code, since you just
write things to deal with either bytes or text as appropriate.

However, there is some code written for Python 2 that relies more heavily
on the ability to treat ascii compatible binary data as both binary data
*and* as text. This is the use case that Python 3 treats as a more
specialised use case (perhaps benefitting from a specialised third party
type), whereas Python 2 supports it by default.

This is also the use case that relied most heavily on implicit encoding and
decoding, since that's the mechanism that allows the 8-bit and Unicode
paths to share string literals.

Cheers,
Nick.



 --
 INADA Naoki  songofaca...@gmail.com

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Antoine Pitrou
On Thu, 09 Jan 2014 03:54:13 +
MRAB pyt...@mrabarnett.plus.com wrote:
 I'm thinking that the i format could be used for signed integers and
 the u for unsigned integers. The width would be the number of bytes.
 You would also need to have a way of specifying the endianness.
 
 For example:
 
   b'{:2i}'.format(256)
 b'\x01\x00'
   b'{:2i}'.format(256)
 b'\x00\x01'

The goal is not to add an alternative to the struct module. If you need
binary packing/unpacking, just use struct.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Barry Warsaw
On Jan 08, 2014, at 01:51 PM, Stephen J. Turnbull wrote:

Benjamin Peterson writes:

  I agree. This is a very important, much-requested feature for low-level
  networking code.

I hear it's much-requested, but is there any description of typical
use cases?

The two unported libraries that are preventing me from switching Mailman 3 to
Python 3 are restish and storm.  For storm, there's a viable alternative in
SQLAlchemy though I haven't looked at how difficult it will be to port the
model layer (even though we once did use SA).

restish is tougher.  I've investigated flask, pecan, wsme, and a few others
that already have Python 3 support and none of them provide an API that I
consider as nice a fit as restish for our standalone WSGI-based REST admin
server.  That's not to denigrate those other projects, it's just that I think
restish hit the sweet spot, and porting Mailman 3 to some other framework so
far has proven unworkable (I've tried with each of them).

restish is plumbing so I think it's a good test case for Nick's observations
of a wire-protocol layer library, and it's obvious that it Just Works in
Python 2 but does work at all in Python 3.  There have been at least 3
attempts to port restish to Python 3 and all of them get stuck in various
places where you actually *can't* decide whether some data structure should be
a bytes or str.  Make one choice and you get stuck over here, make the other
chose and you get stuck over there.

I've got two abandoned branches on github with (rather old) porting attempts,
and I know other developers have some branches as well.  Having given up on
trying to switch to a different framework, I'm starting over again with
restish (really, it's wonderful :).

I plan on keeping more detailed notes this time specifically so that I can
help contribute to this discussion.  If anybody wants to pitch in, both for
the specific purpose of porting the library, and for the more general insights
it could provide for this thread, please get in touch.

Cheers,
-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Nick Coghlan
On 9 Jan 2014 06:43, Antoine Pitrou solip...@pitrou.net wrote:


 Hi,

 With Victor's consent, I overhauled PEP 460 and made the feature set
 more restricted and consistent with the bytes/str separation.

+1

I was initially dubious about the idea, but the proposed semantics look
good to me.

We should probably include format_map for consistency with the str API.

However, I
 also added bytearray into the mix, as bytearray objects should
 generally support the same operations as bytes (and they can be useful
 *especially* for network programming).

So we'd define the *format* string as mutable to get a mutable result out
of the formatting operations? This seems a little weird to me.

It also seems weird for a format method on a mutable type to *not* perform
in-place mutation.

On the other hand, I don't see another obvious way to control the output
type.

Cheers,
Nick.


 Regards

 Antoine.



 On Mon, 6 Jan 2014 14:24:50 +0100
 Victor Stinner victor.stin...@gmail.com wrote:
  Hi,
 
  bytes % args and bytes.format(args) are requested by Mercurial and
  Twisted projects. The issue #3982 was stuck because nobody proposed a
  complete definition of the new features. Here is a try as a PEP.
 
  The PEP is a draft with open questions. First, I'm not sure that both
  bytes%args and bytes.format(args) are needed. The implementation of
  .format() is more complex, so why not only adding bytes%args? Then,
  the following points must be decided to define the complete list of
  supported features (formatters):


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Antoine Pitrou
On Fri, 10 Jan 2014 05:26:04 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
 We should probably include format_map for consistency with the str API.

Yes, you're right.

 However, I
  also added bytearray into the mix, as bytearray objects should
  generally support the same operations as bytes (and they can be useful
  *especially* for network programming).
 
 So we'd define the *format* string as mutable to get a mutable result out
 of the formatting operations? This seems a little weird to me.
 
 It also seems weird for a format method on a mutable type to *not* perform
 in-place mutation.

It's consistent with bytearray.join's behaviour:

 x = bytearray()
 x.join([babc])
bytearray(b'abc')
 x
bytearray(b'')


Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >