Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-12 Thread INADA Naoki
What about using venv and pip instead of svn?


On Sun, Jan 12, 2014 at 4:12 PM, Georg Brandl  wrote:

> Am 11.01.2014 21:11, schrieb Terry Reedy:
> > On 1/11/2014 2:04 PM, georg.brandl wrote:
> >> http://hg.python.org/cpython/rev/87bdee4d633a
> >> changeset:   88413:87bdee4d633a
> >> branch:  3.3
> >> parent:  88410:05e84d3ecd1e
> >> user:Georg Brandl 
> >> date:Sat Jan 11 20:04:19 2014 +0100
> >> summary:
> >>Update Sphinx toolchain.
> >>
> >> files:
> >>Doc/Makefile |  8 
> >>1 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >>
> >> diff --git a/Doc/Makefile b/Doc/Makefile
> >> --- a/Doc/Makefile
> >> +++ b/Doc/Makefile
> >> @@ -41,19 +41,19 @@
> >>   checkout:
> >>  @if [ ! -d tools/sphinx ]; then \
> >>echo "Checking out Sphinx..."; \
> >> -  svn checkout $(SVNROOT)/external/Sphinx-1.0.7/sphinx
> tools/sphinx; \
> >> +  svn checkout $(SVNROOT)/external/Sphinx-1.2/sphinx tools/sphinx;
> \
> >>  fi
> >
> > Doc/make.bat needs to be similarly updated.
>
> Indeed, thanks for the reminder.
>
> Georg
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>



-- 
INADA Naoki  
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-12 Thread Georg Brandl
Planned :)

Georg

Am 12.01.2014 09:12, schrieb INADA Naoki:
> What about using venv and pip instead of svn?
> 
> 
> On Sun, Jan 12, 2014 at 4:12 PM, Georg Brandl  > wrote:
> 
> Am 11.01.2014 21:11, schrieb Terry Reedy:
> > On 1/11/2014 2:04 PM, georg.brandl wrote:
> >> http://hg.python.org/cpython/rev/87bdee4d633a
> >> changeset:   88413:87bdee4d633a
> >> branch:  3.3
> >> parent:  88410:05e84d3ecd1e
> >> user:Georg Brandl mailto:[email protected]>>
> >> date:Sat Jan 11 20:04:19 2014 +0100
> >> summary:
> >>Update Sphinx toolchain.
> >>
> >> files:
> >>Doc/Makefile |  8 
> >>1 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >>
> >> diff --git a/Doc/Makefile b/Doc/Makefile
> >> --- a/Doc/Makefile
> >> +++ b/Doc/Makefile
> >> @@ -41,19 +41,19 @@
> >>   checkout:
> >>  @if [ ! -d tools/sphinx ]; then \
> >>echo "Checking out Sphinx..."; \
> >> -  svn checkout $(SVNROOT)/external/Sphinx-1.0.7/sphinx 
> tools/sphinx; \
> >> +  svn checkout $(SVNROOT)/external/Sphinx-1.2/sphinx 
> tools/sphinx; \
> >>  fi
> >
> > Doc/make.bat needs to be similarly updated.
> 
> Indeed, thanks for the reminder.
> 
> Georg
> 
> ___
> Python-Dev mailing list
> [email protected] 
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> 
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
> 
> 
> 
> 
> -- 
> INADA Naoki  mailto:[email protected]>>
> 
> 


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 01:01, Victor Stinner  wrote:
> Supporting formating integers would allow to write b"Content-Length:
> %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" %
str(123) which works on Python 2 and 3, is explicit, and needs no
special-casing of int in the format code.

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Georg Brandl
Am 12.01.2014 09:57, schrieb Paul Moore:
> On 12 January 2014 01:01, Victor Stinner  wrote:
>> Supporting formating integers would allow to write b"Content-Length:
>> %s\r\n" % 123, which would work on Python 2 and Python 3.
> 
> I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" %
> str(123) which works on Python 2 and 3, is explicit, and needs no
> special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :)

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread R. David Murray
On Sun, 12 Jan 2014 17:51:41 +1000, Nick Coghlan  wrote:
> On 12 January 2014 04:38, R. David Murray  wrote:
> > But!  Our goal should be to help people convert to Python3.  So how can
> > we find out what the specific problems are that real-world programs are
> > facing, look at the *actual code*, and help that project figure out the
> > best way to make that code work in both python2 and python3?
> >
> > That seems like the best way to find out what needs to be added to
> > python3 or pypi:  help port the actual code of the developers who are
> > running into problems.
> >
> > Yes, I'm volunteering to help with this, though of course I can't promise
> > exactly how much time I'll have available.
> 
> And, as has been the case for a long time, the PSF stands ready to
> help with funding credible grant proposals for Python 3 porting
> efforts. I believe some of the core devs (including David?) do
> freelance and contract work, so that's an option definitely worth
> considered if a project would like to support Python 3, but are having
> difficulty getting their with purely volunteer effort.

Yes, I do contract programming, as part of Murray and Walker, Inc (web
site coming soon but not there yet).  And yes I currently have time
available in my schedule.

--David
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano wrote:

> On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
>
> > AFAIK (and just for the record), there could be both Latin1 text and
> UTF-16
> > in a PDF (and other encodings too), depending on the font used:
> [...]
> > In Python2, txt is just a str, but in Python3 handling everything as
> latin1
> > string obviously doesn't work for TTF in this case.
>
> Nobody is suggesting that you use Latin-1 for *everything*. We're
> suggesting that you use it for blobs of binary data that represent
> arbitrary bytes. First you have to get your binary data in the first
> place, using whatever technique is necessary.


Just to check I understood what you are saying. Instead of writing:

content = b'\n'.join([
b'header',
b'part 2 %.3f' % number,
binary_image_data,
utf16_string.encode('utf-16be'),
b'trailer'])

it should now look like:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.encode('utf-16be').decode('latin-1'),
'trailer']).encode('latin-1')

Correct?
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 09:23, Georg Brandl  wrote:
>> On 12 January 2014 01:01, Victor Stinner  wrote:
>>> Supporting formating integers would allow to write b"Content-Length:
>>> %s\r\n" % 123, which would work on Python 2 and Python 3.
>>
>> I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" %
>> str(123) which works on Python 2 and 3, is explicit, and needs no
>> special-casing of int in the format code.
>
> Certainly doesn't work on Python 3 right now, and never should :)

Sorry, I meant str(123).encode("ascii"), and I'd probably use a helper
function for it.

I could easily argue at this point that this is the type of bug that
having %-formatting operations on bytes would encourage - %s means
"format a string" (from years of C and Python (text) experience) so I
automatically supply a string argument when using %s in a bytes
formatting context.

The reality is that I was probably just being sloppy, though :-)
Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] test.support.check_warnings

2014-01-12 Thread Antoine Pitrou
On Sat, 11 Jan 2014 23:10:43 -0800
Ethan Furman  wrote:
> On 01/11/2014 05:37 PM, Brett Cannon wrote:
> >
> > You're assuming the context manager is doing something magical to verify 
> > that all calls in the block raise the expected
> > exception. What you want to do is execute it in a loop::
> >
> >for test in (...):
> >  with support.check_warnings(("automatic int conversions have been 
> > deprecated", DeprecationWarning), quiet=False):
> >exec(test)
> 
> Well, this is test.support!  I expect magic!  ;)
> 
> Thanks for setting me straight, got it working.

Or you could, you know, use the new assertWarns():
http://docs.python.org/dev/library/unittest.html#unittest.TestCase.assertWarns

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Nick Coghlan
On 12 Jan 2014 21:53, "Juraj Sukop"  wrote:
>
>
>
>
> On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano 
wrote:
>>
>> On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
>>
>> > AFAIK (and just for the record), there could be both Latin1 text and
UTF-16
>> > in a PDF (and other encodings too), depending on the font used:
>> [...]
>> > In Python2, txt is just a str, but in Python3 handling everything as
latin1
>> > string obviously doesn't work for TTF in this case.
>>
>> Nobody is suggesting that you use Latin-1 for *everything*. We're
>> suggesting that you use it for blobs of binary data that represent
>> arbitrary bytes. First you have to get your binary data in the first
>> place, using whatever technique is necessary.
>
>
> Just to check I understood what you are saying. Instead of writing:
>
> content = b'\n'.join([
> b'header',
> b'part 2 %.3f' % number,
> binary_image_data,
> utf16_string.encode('utf-16be'),
> b'trailer'])
>
> it should now look like:
>
> content = '\n'.join([
> 'header',
> 'part 2 %.3f' % number,
> binary_image_data.decode('latin-1'),
> utf16_string.encode('utf-16be').decode('latin-1'),
> 'trailer']).encode('latin-1')

Why are you proposing to do the *join* in text space? Encode all the parts
separately, concatenate them with b'\n'.join() (or whatever separator is
appropriate). It's only the *text formatting operation* that needs to be
done in text space and then explicitly encoded (and this example doesn't
even need latin-1,ASCII is sufficient):

content = b'\n'.join([
b'header',
 ('part 2 %.3f' % number).encode('ascii'),
 binary_image_data,
 utf16_string.encode('utf-16be'),
b'trailer'])

> Correct?

My updated version above is the reasonable way to do it in Python 3, and
the one I consider clearly superior to reintroducing implicit encoding to
ASCII as part of the core text model.

This is why I *don't* have a problem with PEP 460 as it stands - it's just
syntactic sugar for something you can already do with b''.join(), and thus
not particularly controversial.

It's only proposals that add any form of implicit encoding
that silently switches from the text domain to the binary domain that
conflict with the core Python 3 text model (although third party types
remain largely free to do whatever they want).

Cheers,
Nick.

>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Nick Coghlan
On 12 Jan 2014 22:10, "Paul Moore"  wrote:
>
> On 12 January 2014 09:23, Georg Brandl  wrote:
> >> On 12 January 2014 01:01, Victor Stinner 
wrote:
> >>> Supporting formating integers would allow to write b"Content-Length:
> >>> %s\r\n" % 123, which would work on Python 2 and Python 3.
> >>
> >> I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" %
> >> str(123) which works on Python 2 and 3, is explicit, and needs no
> >> special-casing of int in the format code.
> >
> > Certainly doesn't work on Python 3 right now, and never should :)
>
> Sorry, I meant str(123).encode("ascii"), and I'd probably use a helper
> function for it.
>
> I could easily argue at this point that this is the type of bug that
> having %-formatting operations on bytes would encourage - %s means
> "format a string" (from years of C and Python (text) experience) so I
> automatically supply a string argument when using %s in a bytes
> formatting context.
>
> The reality is that I was probably just being sloppy, though :-)

It's also something asciistr will help with once it is working -
asciistr(123) on the RHS will work in both versions.

Cheers,
Nick.

> Paul
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-12 Thread anatoly techtonik
And cross-platform automation tools in Python instead of make
https://bitbucket.org/birkenfeld/sphinx/issue/456/makepy-command-script
--
anatoly t.


On Sun, Jan 12, 2014 at 11:12 AM, INADA Naoki  wrote:
> What about using venv and pip instead of svn?
>
>
> On Sun, Jan 12, 2014 at 4:12 PM, Georg Brandl  wrote:
>>
>> Am 11.01.2014 21:11, schrieb Terry Reedy:
>> > On 1/11/2014 2:04 PM, georg.brandl wrote:
>> >> http://hg.python.org/cpython/rev/87bdee4d633a
>> >> changeset:   88413:87bdee4d633a
>> >> branch:  3.3
>> >> parent:  88410:05e84d3ecd1e
>> >> user:Georg Brandl 
>> >> date:Sat Jan 11 20:04:19 2014 +0100
>> >> summary:
>> >>Update Sphinx toolchain.
>> >>
>> >> files:
>> >>Doc/Makefile |  8 
>> >>1 files changed, 4 insertions(+), 4 deletions(-)
>> >>
>> >>
>> >> diff --git a/Doc/Makefile b/Doc/Makefile
>> >> --- a/Doc/Makefile
>> >> +++ b/Doc/Makefile
>> >> @@ -41,19 +41,19 @@
>> >>   checkout:
>> >>  @if [ ! -d tools/sphinx ]; then \
>> >>echo "Checking out Sphinx..."; \
>> >> -  svn checkout $(SVNROOT)/external/Sphinx-1.0.7/sphinx
>> >> tools/sphinx; \
>> >> +  svn checkout $(SVNROOT)/external/Sphinx-1.2/sphinx tools/sphinx;
>> >> \
>> >>  fi
>> >
>> > Doc/make.bat needs to be similarly updated.
>>
>> Indeed, thanks for the reminder.
>>
>> Georg
>>
>> ___
>> Python-Dev mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>
>
>
>
> --
> INADA Naoki  
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/techtonik%40gmail.com
>
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Common subset of python 2 and python 3

2014-01-12 Thread Nachshon David Armon
Hi,
I am Nachshon and this is my first post to the python mailing list.

I have been porting some libraries from python 2 to python 3 recently with
the goal of a common codebase that will run on both versions. I was
thinking it would make my life, and a lot of other developers as well, a
lot easier if there were a version of python that supported ONLY the
features found both in python 2 and python 3. It should be a developer only
version of python.

It should use unicode strings and require that people use the from
__future__ syntax so that anything written in it will work in python 2.7
and in python 3.3+.

Regarding name changes of standard library modules it should support the
new stuff and have helper functions and guides that make the old modules
likethe new ones. it should encourage using backports of the new standard
library modules like enum so that developers are not stuck for features.

I propose that this new version of python use the python 3 unicode model.
As the version of python will be fully compatible with both python 2 and
with python 3 but NOT necsesarily with all existing code in either. It is
designed as a porting tool only.
I suggest that this new python version should be called python 2 and 9
tenths. Is it worth it for me to write a pep that suggests this?
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Common subset of python 2 and python 3

2014-01-12 Thread Nick Coghlan
Hi Nachson,

Python 2.7 with the -3 warning flag covers most of this, while using tox to
run automated tests under both 2.x and 3.x should cover the rest (tox is
also useful for checking code runs under Python 2.6, even if you normally
use a newer version).

Is there anything in particular you feel isn't covered by the combination
of those two approaches?

Regards,
Nick.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-12 Thread Georg Brandl
That's also planned, see https://bitbucket.org/birkenfeld/sphinx-new-make-mode/.

Georg

Am 12.01.2014 09:49, schrieb anatoly techtonik:
> And cross-platform automation tools in Python instead of make
> https://bitbucket.org/birkenfeld/sphinx/issue/456/makepy-command-script
> --
> anatoly t.
> 
> 
> On Sun, Jan 12, 2014 at 11:12 AM, INADA Naoki  wrote:
>> What about using venv and pip instead of svn?


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Common subset of python 2 and python 3

2014-01-12 Thread Nick Coghlan
On 12 Jan 2014 23:39, "Nachshon David Armon" 
wrote:
>
> Hi,
> I am Nachshon and this is my first post to the python mailing list.
>
> I have been porting some libraries from python 2 to python 3 recently
with the goal of a common codebase that will run on both versions. I was
thinking it would make my life, and a lot of other developers as well, a
lot easier if there were a version of python that supported ONLY the
features found both in python 2 and python 3. It should be a developer only
version of python.
>
> It should use unicode strings and require that people use the from
__future__ syntax so that anything written in it will work in python 2.7
and in python 3.3+.
>
> Regarding name changes of standard library modules it should support the
new stuff and have helper functions and guides that make the old modules
likethe new ones. it should encourage using backports of the new standard
library modules like enum so that developers are not stuck for features.
>
> I propose that this new version of python use the python 3 unicode model.
As the version of python will be fully compatible with both python 2 and
with python 3 but NOT necsesarily with all existing code in either. It is
designed as a porting tool only.

Ah, I missed this on the first read through - that combination of
requirements doesn't quite make sense (the text models are fundamentally
incompatible in a way that forces developers to resolve ambiguities that
Python 2 would silently tolerate until it hit a bad combination of input
data).

You may want to take a look at the "python-future" project - that comes as
close as anything else I am aware of to allowing you to write Python 2 code
that reads like idiomatic Python 3 code.

Cheers,
Nick.

> I suggest that this new python version should be called python 2 and 9
tenths. Is it worth it for me to write a pep that suggests this?
>
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython (3.3): Issue #19092 - Raise a correct exception when cgi.FieldStorage is given an

2014-01-12 Thread Senthil Kumaran
On Sat, Jan 11, 2014 at 11:53 PM, Nick Coghlan  wrote:

> You may want to tweak the tracker so the comment ends up on the
> appropriate issue (#19092 is something else entirely)
>

Yes. This was supposed to be #19097.  My bad.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Kristján Valur Jónsson

Well, my suggestion would that we _should_ make it work, by having the %s 
format specifyer on bytes objects mean: str(arg).encode('ascii', 'strict')
It would be an explicit encoding operator with a known, fixed, and well 
specified encoder.
This would cover most of the use cases seen in this threadnought.  Others could 
be handled with explicit str formatting and encoding.

Imho, this is not equivalent to re-introducing automatic type conversion 
between binary/unicode, it is adding a specific convenience function for 
explicitly asking for ASCII encoding.

K

From: Python-Dev [[email protected]] on 
behalf of Georg Brandl [[email protected]]
Sent: Sunday, January 12, 2014 09:23
To: [email protected]
Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

Am 12.01.2014 09:57, schrieb Paul Moore:
> On 12 January 2014 01:01, Victor Stinner  wrote:
>> Supporting formating integers would allow to write b"Content-Length:
>> %s\r\n" % 123, which would work on Python 2 and Python 3.
>
> I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" %
> str(123) which works on Python 2 and 3, is explicit, and needs no
> special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :)

Georg
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Lennart Regebro
On Sat, Jan 11, 2014 at 8:40 PM, Kristján Valur Jónsson
 wrote:
> Hi there.
> How about a compromise?
> Personally, I think adding the full complement of integer/float formatting to 
> bytes is a bit over the top.
> How about just supporting two format specifiers?
> %b : interpolate a bytes object.  If it doesn't have the buffer interface, 
> error.
> %s : interpolate a str object, encoded to ASCII using 'strict' conversion.
>
> This should cover the most common use cases.
> In particular, you could do this:
>
> Headers.append('Content-Length: %s'%(len(data),))
>
> And then subsequently:
> Packet = b'%b%b'%(b"join(headers), data)
>
> For more complex formatting, you delegate to the more capable string class, 
> but benefit from automatic ASCII conversion:
>
> Data = b"percentage = %s" % ("%4.2f" % (value,))

Although nice and clean as principle, I think it makes for somewhat
messy code. I'm in favor of having float and integer specifiers as
well.

I'm also for including %s, because it makes moving from Python 2
easier. But it should definitely error out if you try to feed it a
non-ascii string.

//Lennart
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Nick Coghlan
On 13 Jan 2014 01:22, "Kristján Valur Jónsson" 
wrote:
>
>
> Well, my suggestion would that we _should_ make it work, by having the %s
format specifyer on bytes objects mean: str(arg).encode('ascii', 'strict')
> It would be an explicit encoding operator with a known, fixed, and well
specified encoder.
> This would cover most of the use cases seen in this threadnought.  Others
could be handled with explicit str formatting and encoding.
>
> Imho, this is not equivalent to re-introducing automatic type conversion
between binary/unicode, it is adding a specific convenience function for
explicitly asking for ASCII encoding.

It is not explicit, it is implicit - whether or not the resulting string
assumes ASCII compatibility or not depends on whether you pass a binary
value (no assumption) or a string value (assumes ASCII compatibility). This
kind of data driven change in assumptions about correctness is utterly
unacceptable in the core text and binary types in Python 3.

It's also completely unnecessary - asciistr will be a third party extension
type that allows those users pining for the halcyon days of the Python 2
str type to stop harassing the core devs with requests to compromise the
core Python 3 text model with implicit encoding operations. I'll ensure any
interoperability bugs between asciistr and the core types that can't be
worked around get fixed.

A separate type is genuinely explicit (since the ASCII assumption is no
longer hidden from the type system), and allows much simpler
interoperability for code that wants (indexing asciistr will eventually
produce length 1 asciistr instances instead of str instances, it will avoid
the bytes(intval) discrepancy, it will avoid the str(bytesval) problem,
etc).

I've been suggesting for years that Python 3 might need a third type (not
required to be a builtin, since it's so specialised), but folks migrating
from Python 2 have been so focused on making the core binary type a hybrid
type again, the notion of taking advantage of PEP 393 to create a dedicated
extension type specifically for working with ASCII compatible binary
protocols has failed to compute.

I'm hoping a test suite and preliminary implementation will help more
people to finally get the point.

Regards,
Nick.

>
> K
> 
> From: Python-Dev [[email protected]] on
behalf of Georg Brandl [[email protected]]
> Sent: Sunday, January 12, 2014 09:23
> To: [email protected]
> Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake
>
> Am 12.01.2014 09:57, schrieb Paul Moore:
> > On 12 January 2014 01:01, Victor Stinner 
wrote:
> >> Supporting formating integers would allow to write b"Content-Length:
> >> %s\r\n" % 123, which would work on Python 2 and Python 3.
> >
> > I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" %
> > str(123) which works on Python 2 and 3, is explicit, and needs no
> > special-casing of int in the format code.
>
> Certainly doesn't work on Python 3 right now, and never should :)
>
> Georg
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
On Sun, Jan 12, 2014 at 2:16 PM, Nick Coghlan  wrote:

> Why are you proposing to do the *join* in text space? Encode all the parts
> separately, concatenate them with b'\n'.join() (or whatever separator is
> appropriate). It's only the *text formatting operation* that needs to be
> done in text space and then explicitly encoded (and this example doesn't
> even need latin-1,ASCII is sufficient):
>
I apparently misunderstood what was Steven suggesting, thanks for the
clarification.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] test.support.check_warnings

2014-01-12 Thread Ethan Furman

On 01/12/2014 04:24 AM, Antoine Pitrou wrote:

On Sat, 11 Jan 2014 23:10:43 -0800
Ethan Furman  wrote:

On 01/11/2014 05:37 PM, Brett Cannon wrote:


You're assuming the context manager is doing something magical to verify that 
all calls in the block raise the expected
exception. What you want to do is execute it in a loop::

for test in (...):
  with support.check_warnings(("automatic int conversions have been 
deprecated", DeprecationWarning), quiet=False):
exec(test)


Well, this is test.support!  I expect magic!  ;)

Thanks for setting me straight, got it working.


Or you could, you know, use the new assertWarns():
http://docs.python.org/dev/library/unittest.html#unittest.TestCase.assertWarns


That's also cool.  If I have to touch that code again I'll switch to it.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 08:09 AM, Nick Coghlan wrote:

On 13 Jan 2014 01:22, "Kristján Valur Jónsson" wrote:


Imho, this is not equivalent to re-introducing automatic type conversion 
between binary/unicode, it is adding a specific convenience function for 
explicitly asking for ASCII encoding.


It is not explicit, it is implicit - whether or not the resulting string 
assumes ASCII compatibility or not depends on
whether you pass a binary value (no assumption) or a string value (assumes 
ASCII compatibility).


Nick, I don't understand what you are saying here.  Are you saying that the result of b'%s' % var may be either a bytes 
object or a str object?  Because that would be wrong -- it would always be a bytes object.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Kristján Valur Jónsson
Now you're just splitting hairs, Nick.

An explicit operator, %s, _defined_ to be "encode a string object using strict 
ascii",

how is that any less explicit than the .encode('ascii', 'strict') spelt out in 
full?  The language is full of constructs that are shorthands for others, more 
lengthy but equivalent things.



I mean, basically what I am suggesting is that in addition to %b with

def helper(o):

return str(o).encode('ascii', 'strict')

b'foo%bbar'%(helper(myobj), )



you have

b'foo%sbar'%(myobj, )



There is no "data driven change in assumptions." Just an interpolation operator 
with a clearly defined meaning.



I don't think anyone is trying to compromise the text model.  All people are 
asking for is that the _boundary_ is made a little easier to deal with.



K




From: Nick Coghlan [[email protected]]
Sent: Sunday, January 12, 2014 16:09
To: Kristján Valur Jónsson
Cc: [email protected]; Georg Brandl
Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

It is not explicit, it is implicit - whether or not the resulting string 
assumes ASCII compatibility or not depends on whether you pass a binary value 
(no assumption) or a string value (assumes ASCII compatibility). This kind of 
data driven change in assumptions about correctness is utterly unacceptable in 
the core text and binary types in Python 3.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 08:21 AM, Ethan Furman wrote:

On 01/12/2014 08:09 AM, Nick Coghlan wrote:

On 13 Jan 2014 01:22, "Kristján Valur Jónsson" wrote:


Imho, this is not equivalent to re-introducing automatic type conversion 
between binary/unicode, it is adding a
specific convenience function for explicitly asking for ASCII encoding.


It is not explicit, it is implicit - whether or not the resulting string 
assumes ASCII compatibility or not depends on
whether you pass a binary value (no assumption) or a string value (assumes 
ASCII compatibility).


Nick, I don't understand what you are saying here.  Are you saying that the 
result of b'%s' % var may be either a bytes
object or a str object?  Because that would be wrong -- it would always be a 
bytes object.


Okay, I just went and took a closer look at the asciistr type [1].  For what it's worth I don't think this is Antoine's 
understanding of what we [2] are asking for, nor is it what we are asking for (I'm sure Antoine will correct me if I'm 
wrong. ;)


We know full well the difference between unicode and bytes, and we know full well that numbers and much of the text we 
need has an ASCII (bytes!) representation.  When we do a b'Content Length: %d' % len(binary_data) we are expecting to 
get back a bytes object, /not/ a unicode object.


Your asciistr, which sometimes returns bytes and sometimes returns text, is 
absolutely *not* what we want.

--
~Ethan~


[1] https://github.com/jeamland/asciicompat
[2] the dbf and pdf folks, at least
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 16:52, Kristján Valur Jónsson  wrote:
> I mean, basically what I am suggesting is that in addition to %b with
>
> def helper(o):
> return str(o).encode('ascii', 'strict')
>
> b'foo%bbar'%(helper(myobj), )
>
> you have
>
> b'foo%sbar'%(myobj, )

But that's not what the current PEP says. It uses %s for interpolating
bytes values. It looks like you're saying that

b'abc %s' % (b'def')

will *not* produce b'abc def', but rather will produce b'abc b\'def\''
(because str(b'def'') is "b'def'").

If that's what you're saying, then fine, but it's a different PEP and
I for one am -1 specifically because of the behaviour I show above.
Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Mark Shannon

On 12/01/14 16:52, Kristján Valur Jónsson wrote:

Now you're just splitting hairs, Nick.

An explicit operator, %s, _defined_ to be "encode a string object using
strict ascii",


I don't like this because '%s' reads to me as "insert *string* here".
I think '%a' which reads as "encode as ASCII and insert here" would be 
better.




how is that any less explicit than the .encode('ascii', 'strict') spelt
out in full?  The language is full of constructs that are shorthands for
others, more lengthy but equivalent things.

I mean, basically what I am suggesting is that in addition to %b with

def helper(o):

 return str(o).encode('ascii', 'strict')

b'foo*%b*bar'%(helper(myobj), )

you have

b'foo*%s*bar'%(myobj, )

There is no "data driven change in assumptions." Just an interpolation
operator with a clearly defined meaning.

I don't think anyone is trying to compromise the text model.  All people
are asking for is that the _boundary_ is made a little easier to deal with.

K


*From:* Nick Coghlan [[email protected]]
*Sent:* Sunday, January 12, 2014 16:09
*To:* Kristján Valur Jónsson
*Cc:* [email protected]; Georg Brandl
*Subject:* Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

It is not explicit, it is implicit - whether or not the resulting string
assumes ASCII compatibility or not depends on whether you pass a binary
value (no assumption) or a string value (assumes ASCII compatibility).
This kind of data driven change in assumptions about correctness is
utterly unacceptable in the core text and binary types in Python 3.



___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/mark%40hotpy.org



___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Steven D'Aprano
On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
> On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano wrote:
> 
> > On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
> >
> > > AFAIK (and just for the record), there could be both Latin1 text and
> > UTF-16
> > > in a PDF (and other encodings too), depending on the font used:
> > [...]
> > > In Python2, txt is just a str, but in Python3 handling everything as
> > latin1
> > > string obviously doesn't work for TTF in this case.
> >
> > Nobody is suggesting that you use Latin-1 for *everything*. We're
> > suggesting that you use it for blobs of binary data that represent
> > arbitrary bytes. First you have to get your binary data in the first
> > place, using whatever technique is necessary.
> 
> 
> Just to check I understood what you are saying. Instead of writing:
> 
> content = b'\n'.join([
> b'header',
> b'part 2 %.3f' % number,
> binary_image_data,
> utf16_string.encode('utf-16be'),
> b'trailer'])


Which doesn't work, since bytes don't support %f in Python 3.

 
> it should now look like:
> 
> content = '\n'.join([
> 'header',
> 'part 2 %.3f' % number,
> binary_image_data.decode('latin-1'),
> utf16_string.encode('utf-16be').decode('latin-1'),
> 'trailer']).encode('latin-1')
> 
> Correct?

Not quite as you show.

First, "utf16_string" confuses me. What is it? If it is a Unicode 
string, i.e.:

# Python 3 semantics
type(utf16_string)
=> returns str

then the name is horribly misleading, and it is best handled like this:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string,  # Misleading name, actually Unicode string
'trailer'])


Note that since it's text, and content is text, there is no need to 
encode then decode.

"UTF-16" is not another name for "Unicode". Unicode is a character set. 
UTF-16 is just one of a number of different encodings which map the 
0x10 distinct Unicode characters (actually "code points") to bytes. 
UTF-16 is one possible way to implement Unicode strings in memory, but 
not the only way. Python has, or does, use four distinct implementations:

1) UTF-16 in "narrow builds"
2) UTF-32 in "wide builds"
3) a hybrid approach starting in Python 3.3, where strings are
   stored as either:

   3a) Latin-1
   3b) UCS-2
   3c) UTF-32

   depending on the content of the string.

So calling an arbitrary string "utf16_string" is misleading or wrong.


On the other hand, if it is actually a bytes object which is the product 
of UTF-16 encoding, i.e.:

type(utf16_string)
=> returns bytes

and those bytes were generated by "some text".encode("utf-16"), then it 
is already binary data and needs to be smuggled into the text string. 
Latin-1 is good for that:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.decode('latin-1'),
'trailer'])


Both examples assume that you intend to do further processing of content 
before sending it, and will encode just before sending:

content.encode('utf-8')

(Don't use Latin-1, since it cannot handle the full range of text 
characters.)

If that's not the case, then perhaps this is better suited to what you 
are doing:

content = b'\n'.join([
b'header',
('part 2 %.3f' % number).encode('ascii'),
binary_image_data,  # already bytes
utf16_string,  # already bytes
b'trailer'])



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Mark Lawrence

On 12/01/2014 17:06, Mark Shannon wrote:

On 12/01/14 16:52, Kristján Valur Jónsson wrote:

Now you're just splitting hairs, Nick.

An explicit operator, %s, _defined_ to be "encode a string object using
strict ascii",


I don't like this because '%s' reads to me as "insert *string* here".
I think '%a' which reads as "encode as ASCII and insert here" would be
better.



I entirely agree.  This would also parallel the conversion flags given 
here http://docs.python.org/3/library/string.html#format-string-syntax, 
I quote "Three conversion flags are currently supported: '!s' which 
calls str() on the value, '!r' which calls repr() and '!a' which calls 
ascii()".


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 17:03, Ethan Furman  wrote:
> We know full well the difference between unicode and bytes, and we know full
> well that numbers and much of the text we need has an ASCII (bytes!)
> representation.  When we do a b'Content Length: %d' % len(binary_data) we
> are expecting to get back a bytes object, /not/ a unicode object.

What I am struggling to understand here is what room for compromise
there is. Clearly, for whatever reason,

b'Content Length: ' + str(len(binary_data)).encode('ascii'))

is not acceptable for you. OK, fair enough. Also, apparently, writing a helper

def int_to_bytes(n):
return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is unacceptable. But I'm not clear why it's unacceptable. Maybe I
missed the explanation - God knows, the thread is long enough :-)

On the other hand, Nick has explained why b'Content Length: %d' %
len(binary_data) is unacceptable to him (you don't have to agree with
his opinion, just concede that he has explained his position in a way
that you understand).

I'm not trying to argue you're wrong - I don't know your codebase, nor
do I know your application area. But surely somewhere between "we must
have % formatting including %d for bytes" and the above, there's a
middle ground that you *are* willing to accept? Can you give any
indications of what that might be? What, specifically, about the
helper function is the problem? I don't think it is any less space
efficient, it doesn't double-encode, and I don't think it's more
difficult to understand (although it is a little longer, it trades
that off against being a bit more explicit as to what's going on).
Surely you're not arguing that your code must work unchanged (not
"there's a way of writing the code so it works on Python 2 and 3", but
"the code you currently have for Python 2 must work with no changes at
all")?

Can you give an example of code that is *nearly* acceptable to you,
which works in Python 2 and 3 today, and explain what improvements you
would like to see to it in order to use it instead of waiting for a
core change?

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Steven D'Aprano
On Sun, Jan 12, 2014 at 11:16:37PM +1000, Nick Coghlan wrote:

> > content = '\n'.join([
> > 'header',
> > 'part 2 %.3f' % number,
> > binary_image_data.decode('latin-1'),
> > utf16_string.encode('utf-16be').decode('latin-1'),
> > 'trailer']).encode('latin-1')
> 
> Why are you proposing to do the *join* in text space? 

In defence of that, doing the join as text may be useful if you have 
additional text processing that you want to do after assembling the 
whole string, but before calling encode.

Even if you intend to encode to bytes at the end, you might prefer to 
work in the text domain right until just before the end:

- no need for b' prefixes;
- indexing a string returns a 1-char string, not an int;
- can use the full range of % formatting, etc.


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Common subset of python 2 and python 3

2014-01-12 Thread Nachshon David Armon
On Sun, Jan 12, 2014 at 3:58 PM, Nick Coghlan  wrote:
>
> On 12 Jan 2014 23:39, "Nachshon David Armon" 
> wrote:
>>

>> I propose that this new version of python use the python 3 unicode model.
>> As the version of python will be fully compatible with both python 2 and
>> with python 3 but NOT necsesarily with all existing code in either. It is
>> designed as a porting tool only.
>
> Ah, I missed this on the first read through - that combination of
> requirements doesn't quite make sense (the text models are fundamentally
> incompatible in a way that forces developers to resolve ambiguities that
> Python 2 would silently tolerate until it hit a bad combination of input
> data).

 while that is true, it is possible to program unicode correctly in
python 2 while remaining compatible with python 3. (a combination of
"from future import unicode_literal" and properly using the encode and
decode functions.). I would prefer a stripped version of python 3 that
does not support anything that will really conflict with python 2. for
porting purposes only of course. my employer still uses python 2 so
the idea is to force other developers to use something that will force
working on both during the transition without every single one having
to be extra careful to support both versions.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
Wait a second, this is how I understood it but what Nick said made me think
otherwise...

On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano wrote:

> On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
> > On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano  >wrote:
> >
> > Just to check I understood what you are saying. Instead of writing:
> >
> > content = b'\n'.join([
> > b'header',
> > b'part 2 %.3f' % number,
> > binary_image_data,
> > utf16_string.encode('utf-16be'),
> > b'trailer'])
>
> Which doesn't work, since bytes don't support %f in Python 3.
>

I know and this was an example of the ideal (for me, anyway) way of
formatting bytes.


> First, "utf16_string" confuses me. What is it? If it is a Unicode
> string, i.e.:
>

It is a Unicode string which happens to contain code points outside U+00FF
(as with the TTF example above), so that it triggers the (at least) 2-bytes
memory representation in CPython 3.3+. I agree, I chose the variable name
poorly, my bad.


>
> content = '\n'.join([
> 'header',
> 'part 2 %.3f' % number,
> binary_image_data.decode('latin-1'),
> utf16_string,  # Misleading name, actually Unicode string
> 'trailer'])
>

Which, because of that horribly-named-variable, prevents the use of simple
memcpy and makes the image data occupy way more memory than as when it was
in simple bytes.


> Both examples assume that you intend to do further processing of content
> before sending it, and will encode just before sending:
>

Not really, I was interested to compare it to bytes formatting, hence it
included the "encode()" as well.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 09:26 AM, Paul Moore wrote:

On 12 January 2014 17:03, Ethan Furman  wrote:

We know full well the difference between unicode and bytes, and we know full
well that numbers and much of the text we need has an ASCII (bytes!)
representation.  When we do a b'Content Length: %d' % len(binary_data) we
are expecting to get back a bytes object, /not/ a unicode object.


What I am struggling to understand here is what room for compromise
there is. Clearly, for whatever reason,

b'Content Length: ' + str(len(binary_data)).encode('ascii'))

is not acceptable for you. OK, fair enough. Also, apparently, writing a helper

def int_to_bytes(n):
 return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is unacceptable. But I'm not clear why it's unacceptable. Maybe I
missed the explanation - God knows, the thread is long enough :-)


True enough!  ;)  It's unacceptable in the sense that the bytes type is /almost/ there, it's /almost/ what is needed to 
handle the boundary conditions.  We have a __bytes__ method (how is it supposed to be used?) that could be made to fit 
the interpolation bill.


It seems to me the core of Nick's refusal is the (and I agree!) rejection of bytes interpolation returning unicode -- 
but that's not what I'm asking for!  I'm asking for it to return bytes, with the interpolated data (in the case if %d, 
%s, etc) being strictly-ASCII encoded.




On the other hand, Nick has explained why b'Content Length: %d' %
len(binary_data) is unacceptable to him (you don't have to agree with
his opinion, just concede that he has explained his position in a way
that you understand).


Only because he (or Benno) finally wrote some tests and I was able to see what he thought I was wanting.  Which does 
seem to leave a *tiny* bit of wiggle room if bytes interpolation always return bytes, and never a unicode (yeah, I know, 
snowball's chance and all that).




I'm not trying to argue you're wrong - I don't know your codebase, nor
do I know your application area. But surely somewhere between "we must
have % formatting including %d for bytes" and the above, there's a
middle ground that you *are* willing to accept? Can you give any
indications of what that might be? What, specifically, about the
helper function is the problem? I don't think it is any less space
efficient, it doesn't double-encode, and I don't think it's more
difficult to understand (although it is a little longer, it trades
that off against being a bit more explicit as to what's going on).
Surely you're not arguing that your code must work unchanged (not
"there's a way of writing the code so it works on Python 2 and 3", but
"the code you currently have for Python 2 must work with no changes at
all")?


I'm arguing from three PoVs:

1) 2 & 3 compatible code base

2) having the bytes type /be/ the boundary type

3) readable code



Can you give an example of code that is *nearly* acceptable to you,
which works in Python 2 and 3 today, and explain what improvements you
would like to see to it in order to use it instead of waiting for a
core change?


I'm not trying to be difficult (just naturally good at it, I guess ;) , but I don't see a lot room for compromises -- I 
would like % interpolation, I'm told I have to use a helper function.  I will if I have to, but first I have to try and 
make myself understood, and I'm not sure that has happened yet.  Following Nick's example I'm writing up some tests that 
clearly show what I would like to see.  Then at least we can debate what I'm actually asking for, and now what the 
(understandably) unicode-what-a-mess-we-had-in-py2k-don't-want-again that some think I am asking for.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 18:26, Ethan Furman  wrote:
> True enough!  ;)  It's unacceptable in the sense that the bytes type is
> /almost/ there, it's /almost/ what is needed to handle the boundary
> conditions.  We have a __bytes__ method (how is it supposed to be used?)
> that could be made to fit the interpolation bill.

And yet I still don't follow what you *want*. Unless it's that b'%d' %
(12,) must work and give b'12', and nothing else is acceptable. Maybe
more accurately, I don't see what you want to do that can't be done in
another way. All I'm seeing in your rejection of alternative
suggestions is "it's not %-interpolation using %d".

> I'm arguing from three PoVs:
> 1) 2 & 3 compatible code base
> 2) having the bytes type /be/ the boundary type
> 3) readable code

The only one of these that I can see being in any way an argument against

def int_to_bytes(n):
return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is (3), and that's largely subjective. Personally, I see very little
difference between the above and %d-interpolation in terms of
*readability*. Brevity, certainly %d wins. But that's not important on
its own, and I'd argue that my version is more clear in terms of
describing the intent (and would be even better if I wasn't rubbish at
thinking of function names, or if this wasn't in isolation, and more
application-focused functions were used).

> It seems to me the core of Nick's refusal is the (and I agree!) rejection of
> bytes interpolation returning unicode -- but that's not what I'm asking for!
> I'm asking for it to return bytes, with the interpolated data (in the case
> if %d, %s, etc) being strictly-ASCII encoded.

My reading of Nick's refusal is that %d takes a value which is
semantically a number, converts it into a base-10 representation
(which is semantically a *string*, not a sequence of bytes[1]) and
then *encodes* that string into a series of bytes using the ASCII
encoding. That is *two* semantic transformations, and one (the ASCII
encoding) is *implicit*. Specifically, it's implicit because (a) the
normal reading of %d is "produce the base-10 representation of a
number, and a base-10 representation is a *string*, and (b) because
nowhere has ASCII been mentioned (why not UTF16? that would be
entirely plausible for a wchar-based environment like Windows). And a
core principle of the bytes/text separation in Python 3 is that
encoding should never happen implicitly.

By the way, I should point out that I would never have understood
*any* of the ideas involved in this thread before Python 3 forced me
to think about Unicode and the distinction between text and bytes. And
yet, I now find myself, in my (non-Python) work environment, being the
local expert whenever applications screw up text encodings. So I, for
one, am very grateful for Python 3's clear separation of bytes and
text. (And if I sometimes come across as over-dogmatic, I apologise -
put it down to the enthusiasm of the recent convert :-))

Paul

[1] If you cannot see that there's no essential reason why the base-10
representation '123' should correspond to the bytes b'\x31\x32\x33'
then you are probably not old enough to have started programming on
EBCDIC-based computers :-)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread INADA Naoki
I want to add one more PoV: small performance regression, especially on
Python 2.
Because programs that needs byte formatting may be low level and used
heavily from application.

Many programs uses one source approach to support Python 3.
And supporting Python 3 should not means large performance regression on
Python 2.


In Python 2:

In [1]: def int_to_bytes(n):
   ...: return unicode(n).encode('ascii')
   ...:

In [2]: %timeit int_to_bytes(42)
100 loops, best of 3: 691 ns per loop

In [3]: %timeit b'Content-Type: ' + int
int   int_to_bytes  intern

In [3]: %timeit b'Content-Type: ' + int_to_bytes(42)
100 loops, best of 3: 737 ns per loop

In [4]: %timeit b'Content-Type: %d' % 42
1000 loops, best of 3: 20.2 ns per loop

In [5]: %timeit (u'Content-Type: %d' % 42).encode('ascii')
100 loops, best of 3: 381 ns per loop


In Python 3:

In [1]: def int_to_bytes(n):
   ...: return str(n).encode('ascii')
   ...:

In [2]: %timeit int_to_bytes(42)
100 loops, best of 3: 612 ns per loop

In [3]: %timeit b'Content-Type: ' + int_to_bytes(42)
100 loops, best of 3: 668 ns per loop

In [4]: %timeit ('Content-Type: %d' % 42).encode('ascii')
100 loops, best of 3: 326 ns per loop


> I'm arguing from three PoVs:
> > 1) 2 & 3 compatible code base
> > 2) having the bytes type /be/ the boundary type
> > 3) readable code
>
> The only one of these that I can see being in any way an argument against
>
> def int_to_bytes(n):
> return str(n).encode('ascii')
>
> b'Content Length: ' + int_to_bytes(len(binary_data))
>
> is (3), and that's largely subjective. Personally, I see very little
> difference between the above and %d-interpolation in terms of
> *readability*. Brevity, certainly %d wins. But that's not important on
> its own, and I'd argue that my version is more clear in terms of
> describing the intent (and would be even better if I wasn't rubbish at
> thinking of function names, or if this wasn't in isolation, and more
> application-focused functions were used).
>
> > It seems to me the core of Nick's refusal is the (and I agree!)
> rejection of
> > bytes interpolation returning unicode -- but that's not what I'm asking
> for!
> > I'm asking for it to return bytes, with the interpolated data (in the
> case
> > if %d, %s, etc) being strictly-ASCII encoded.
>
> My reading of Nick's refusal is that %d takes a value which is
> semantically a number, converts it into a base-10 representation
> (which is semantically a *string*, not a sequence of bytes[1]) and
> then *encodes* that string into a series of bytes using the ASCII
> encoding. That is *two* semantic transformations, and one (the ASCII
> encoding) is *implicit*. Specifically, it's implicit because (a) the
> normal reading of %d is "produce the base-10 representation of a
> number, and a base-10 representation is a *string*, and (b) because
> nowhere has ASCII been mentioned (why not UTF16? that would be
> entirely plausible for a wchar-based environment like Windows). And a
> core principle of the bytes/text separation in Python 3 is that
> encoding should never happen implicitly.
>
> By the way, I should point out that I would never have understood
> *any* of the ideas involved in this thread before Python 3 forced me
> to think about Unicode and the distinction between text and bytes. And
> yet, I now find myself, in my (non-Python) work environment, being the
> local expert whenever applications screw up text encodings. So I, for
> one, am very grateful for Python 3's clear separation of bytes and
> text. (And if I sometimes come across as over-dogmatic, I apologise -
> put it down to the enthusiasm of the recent convert :-))
>
> Paul
>
> [1] If you cannot see that there's no essential reason why the base-10
> representation '123' should correspond to the bytes b'\x31\x32\x33'
> then you are probably not old enough to have started programming on
> EBCDIC-based computers :-)
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>



-- 
INADA Naoki  
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/11/2014 07:09 PM, Nick Coghlan wrote:


Folks that want implicit serialisation (and I agree it has its uses) should go 
help Benno get asciistr up to speed.


asciistr is not what I'm looking for in the way of a boundary type.

I have created a 'bytestring'[1] repository which has the tests for what I am looking for.  Hopefully that will get rid 
of some confusion, at least.


--
~Ethan~


[1] https://bitbucket.org/stoneleaf/bytestring
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Emile van Sebille

On 01/12/2014 09:26 AM, Paul Moore wrote:

Can you give an example of code that is *nearly* acceptable to you,
which works in Python 2 and 3 today, and explain what improvements you
would like to see to it in order to use it instead of waiting for a
core change?



I'm not a developer, but I'm trying to understand how in v3 I accomplish 
what in v2 is easy:


len(open('chars','wb').write("".join(map (chr,range(256.read())

What's the v3 equivalent?

Emile



___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 11:00 AM, Paul Moore wrote:


And yet I still don't follow what you *want*. Unless it's that b'%d' %
(12,) must work and give b'12', and nothing else is acceptable.


Nothing else is ideal.  I'll go that route if I have to.  I understand that in the real world you go with what works, 
but in the development stage you fight for the ideal.  :)




My reading of Nick's refusal is that %d takes a value which is
semantically a number, converts it into a base-10 representation
(which is semantically a *string*, not a sequence of bytes[1]) and
then *encodes* that string into a series of bytes using the ASCII
encoding. That is *two* semantic transformations, and one (the ASCII
encoding) is *implicit*. Specifically, it's implicit because (a) the
normal reading of %d is "produce the base-10 representation of a
number, and a base-10 representation is a *string*, and (b) because
nowhere has ASCII been mentioned (why not UTF16? that would be
entirely plausible for a wchar-based environment like Windows). And a
core principle of the bytes/text separation in Python 3 is that
encoding should never happen implicitly.


That could be.  And yet the bytes type already has several concessions to ASCII 
encoding.



By the way, I should point out that I would never have understood
*any* of the ideas involved in this thread before Python 3 forced me
to think about Unicode and the distinction between text and bytes. And
yet, I now find myself, in my (non-Python) work environment, being the
local expert whenever applications screw up text encodings. So I, for
one, am very grateful for Python 3's clear separation of bytes and
text. (And if I sometimes come across as over-dogmatic, I apologise -
put it down to the enthusiasm of the recent convert :-))


No worries.  I was forced to learn the difference when I wrote my dbf module for 2.5.  Took longer than I'd like to 
admit to realize that ASCII was an encoding.  :/




[1] If you cannot see that there's no essential reason why the base-10
representation '123' should correspond to the bytes b'\x31\x32\x33'
then you are probably not old enough to have started programming on
EBCDIC-based computers :-)


I can see it.  :)  But bytes already acknowledges an ASCII bias.  ;)  And even EBCDIC machines speak ASCII when talking 
telnet.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 19:30, Emile van Sebille  wrote:
> len(open('chars','wb').write("".join(map (chr,range(256.read())

Python 2:

>>> len(open('chars','wb').write("".join(map (chr,range(256.read())
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'read'

I could be facetous and say "None.read", but more seriously, what are
you trying to say here? How do I write a 256-byte file with one byte
for each value? bytes(range(256)) gives you the bytestring you want. I
simply don't see your point here.

>> And yet I still don't follow what you *want*. Unless it's that b'%d' %
>> (12,) must work and give b'12', and nothing else is acceptable.
>
> Nothing else is ideal.  I'll go that route if I have to.  I understand that 
> in the real world you go with what works, but in the development stage you 
> fight for the ideal.  :)

OK, but can you fight by giving arguments as to why it's better than
the plethora of alternatives that have been suggested? Or
counter-arguments to the objections that have been raised to the
proposal?

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Emile van Sebille

On 01/12/2014 11:30 AM, Emile van Sebille wrote:

On 01/12/2014 09:26 AM, Paul Moore wrote:

Can you give an example of code that is *nearly* acceptable to you,
which works in Python 2 and 3 today, and explain what improvements you
would like to see to it in order to use it instead of waiting for a
core change?



I'm not a developer, but I'm trying to understand how in v3 I accomplish
what in v2 is easy:

len(open('chars','wb').write("".join(map (chr,range(256.read())


my bad :

>>> open('chars','wb').write("".join(map (chr,range(256
>>> len(open('chars','rb').read())
256




What's the v3 equivalent?

Emile






___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Stephen J. Turnbull
Georg Brandl writes:

 > > if it weren't for your stupid maximalist opposition).
 > 
 > Can you please stop throwing personal insults around?  You don't have to
 > resort to that level.

Ethan's posts (as an example of one general trend in this thread) are
pretty frustrating, you have to admit.

MAL posted straight out the Python 2 model of text makes it easier for
him to write some programs, so he's all for reintroducing it.  And
that is the whole truth of the matter.  Although I disagree with him,
I appreciate his honesty.

But people keep posting "we don't want Python 2's confounding of text
and binary, we just want bytes with (nearly) all the functionality of
strings [because they are (partially|really) encoded text]".  Some of
them actually use the literal word "text" in their justification!

That's, well, what would you call it?  Either they know what they're
saying, in which case it's disingenuous at best, or they don't know
what they're saying, in which case it's a proposal based on a clear
misunderstanding of the situation.  The problem is not going to go
away just because they *say* they don't want to reintroduce Python 2
text processing.  That is precisely what this proposal is *intended*
to do, whether in the limited form proposed by Antoine or in the much
more extensive form that folks like Ethan want.

What "maximalists" mean is that they promise not to abuse Python 2
text processing when writing Python 3 programs.  This promise is
highly unlikely to be kept for two reasons.  First, they can't make
that promise on behalf of third parties, who for various reasons
certainly will abuse these features to avoid the encoded-text-to-
Unicode-text and vice-versa conversions.  Second, I doubt they
themselves will keep the promise to my satisfaction because their
definition of "text" is ambiguous.  When it's convenient for them to
use text-processing operations on bytes, they'll say "oh, yes, these
are conventionally considered text-processing features, but that's
just an accident of the particular configuration of bytes -- yup,
bytes -- I'm processing."

You could argue that this "abuse" isn't *abuse*.  That it's covered by
"consenting adults".  By the same token, so is smoking in a crowded
elevator -- if you don't like it, don't use the elevator!  Of course
in applications used only by the author, there's no abuse (at least
not of others! :-/ )

But Nick's important example of web frameworks demonstrates the
problem: unless they convert to text where appropriate, they're just
pushing the problem off on application writers.  Sometimes passing on
data as bytes is appropriate, of course, but the framework authors are
likely to be biased in favor of doing that, and it's not hard to
imagine frameworks ported from Python 2 passing on the problem
wholesale on the grounds that "we returned str in Python 2 which is
bytes in Python 3, and since we were processing bytes the whole time,
we see no reason to change the 'ABI'."  Of course the application
writers thought they were receiving text "in an inconvenient and
ambiguous form".  IMO, with the proposed changes, that is likely to
continue indefinitely, negating some of the gains I expected to
receive from Python 3. :-(

Note: there are a lot of high-level frameworks like Django that even
in Python 2 basically went to Unicode everywhere internally.  I don't
deny that.  I think that Python 3 as currently constituted makes it a
lot easier to make an appropriate decision of where to convert, and
should take some of the burden off the high-level frameworks.
Approving this PEP, especially in a maximalist form, will blur the
lines.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Georg Brandl
Am 12.01.2014 20:30, schrieb Emile van Sebille:
> On 01/12/2014 09:26 AM, Paul Moore wrote:
>> Can you give an example of code that is *nearly* acceptable to you,
>> which works in Python 2 and 3 today, and explain what improvements you
>> would like to see to it in order to use it instead of waiting for a
>> core change?
> 
> 
> I'm not a developer, but I'm trying to understand how in v3 I accomplish 
> what in v2 is easy:
> 
> len(open('chars','wb').write("".join(map (chr,range(256.read())
> 
> What's the v3 equivalent?

That's actually very easy and shows a strength of the bytes type,
since there's no text involved:

open('chars', 'wb').write(bytes(range(256)))

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Daniel Holth writes:

 > -1 on adding more surrogateesapes by default. It's a pain to track
 > down where the encoding errors came from.

What do you mean "by default"?  It was quite explicit in the code I
posted, and it's the only reasonable thing to do with "text data
without known (but ASCII compatible) encoding or multiple different
encodings in a single data chunk".  If you leave it as bytes, it will
barf as soon as you try to mix it with text even if it is pure ASCII!

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Greg Ewing

Paul Moore wrote:

I could easily argue at this point that this is the type of bug that
having %-formatting operations on bytes would encourage - %s means
"format a string" (from years of C and Python (text) experience) so I
automatically supply a string argument when using %s in a bytes
formatting context.


So don't call it %s -- call it something else
such as %b.

--
Greg
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Mark Lawrence

On 12/01/2014 21:06, Greg Ewing wrote:

Paul Moore wrote:

I could easily argue at this point that this is the type of bug that
having %-formatting operations on bytes would encourage - %s means
"format a string" (from years of C and Python (text) experience) so I
automatically supply a string argument when using %s in a bytes
formatting context.


So don't call it %s -- call it something else
such as %b.



Sorry but you can't use %b as that'll confuse people who're used to it 
meaning "Month as locale’s abbreviated name." :)


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Greg Ewing

Nick Coghlan wrote:


On 13 Jan 2014 01:22, "Kristján Valur Jónsson" > wrote:


 > Well, my suggestion would that we _should_ make it work, by having 
the %s format specifyer on bytes objects mean: str(arg).encode('ascii', 
'strict')

>
It is not explicit, it is implicit - whether or not the resulting string 
assumes ASCII compatibility or not depends on whether you pass a binary 
value (no assumption) or a string value (assumes ASCII compatibility).


How do you make that out? As far as I can see, Kristjan's
proposal will *always* call str() on the argument of a
%s format, regardless of its type. The *result*  of that
str() is then *required* (not assumed) to be encodable
as ascii. I don't see any type-dependent changes in
behaviour here.

Interpolating a bytes object as-is, without a conversion
to text, should be done by a different format specifier,
such as %b. All text/bytes conversions are then explicit:
if you write %s, then you're encoding something as ascii,
but if you write %b, you're just inserting something
that's already binary.

--
Greg
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 12:02 PM, Stephen J. Turnbull wrote:

Georg Brandl writes:

Antoine writes:


. . . if it weren't for your stupid maximalist opposition. . .


Can you please stop throwing personal insults around?  You don't have to
resort to that level.


Ethan's posts (as an example of one general trend in this thread) are
pretty frustrating, you have to admit.


Two points:

1) Are you saying it's okay to be insulting when frustrated?  I also find this mega-thread frustrating, but I'm trying 
very hard not to be insulting.


2) If you are going to use my name, please be certain of the facts [1].  More 
below.


MAL posted straight out the Python 2 model of text makes it easier for
him to write some programs, so he's all for reintroducing it.  And
that is the whole truth of the matter.  Although I disagree with him,
I appreciate his honesty.


If you have an example of me lying (even if it's just a possibility), please refer to it directly so I can either try to 
explain the misunderstanding or apologize.




But people keep posting "we don't want Python 2's confounding of text
and binary, we just want bytes with (nearly) all the functionality of
strings [because they are (partially|really) encoded text]".  Some of
them actually use the literal word "text" in their justification!


In only one case did I use the word "text" loosely, and that was when I claimed that Py2 had three text types, and Py3 
had two.  I was wrong, I apologize.  Py3 has one definite text type, str, and, I claim, one half text type in bytes, 
because bytes itself provides ASCII text processing methods.  If you have a better term for the notion of 
b'ethan'.title() --> b'Ethan' than ASCII-text processing, I'll use that instead.  If there are good reasons to not allow 
further concessions to the ASCII-ness of bytes (and you provide a good one below) then that makes living with the 
handicap easier.  But don't lie to me (as Nick tried to) and say that "In particular, the bytes type is, and always will 
be, designed for pure binary manipulation" when it has methods like .center().


If I am wrong, and that was not a lie, please explain it to me.



That's, well, what would you call it?  Either they know what they're
saying, in which case it's disingenuous at best, or they don't know
what they're saying, in which case it's a proposal based on a clear
misunderstanding of the situation.


I think some of the misunderstanding (which you also seem to suffer from) is that we (or at least I) /ever/ want a 
unicode string back from bytes interpolation.  I don't!  If I start with bytes, I want bytes back!  And I have a very 
clear grasp on the difference between str and bytes and what ACSII encoding means, it was a hard and painful lesson for 
me and I'm not likely to forget it.


To summarize, I used the term text when referring to unicode text (str), ASCII or ASCII-encoded text to refer to bytes 
that are to be used in a place that requires ASCII bytes for communication (such as content length or field type).  I do 
/not/ use ASCII to refer to any ol' collection of bytes that happens to look like it might be ASCII-encoded text.



The problem is not going to go
away just because they *say* they don't want to reintroduce Python 2
text processing.  That is precisely what this proposal is *intended*
to do, whether in the limited form proposed by Antoine or in the much
more extensive form that folks like Ethan want.

What "maximalists" mean is that they promise not to abuse Python 2
text processing when writing Python 3 programs.  This promise is
highly unlikely to be kept for two reasons.  First, they can't make
that promise on behalf of third parties, who for various reasons
certainly will abuse these features to avoid the encoded-text-to-
Unicode-text and vice-versa conversions.


I concede that this is a good reason to not allow % interpolation.  Kinda like 
not allowing sum on strings.

And I don't make promises for other people, and abusing this feature would be a 
bug.


Second, I doubt they
themselves will keep the promise to my satisfaction because their
definition of "text" is ambiguous.


*My* definition is not ambiguous at all.  If this particular part of the byte stream is defined to contain ASCII-encoded 
text, then I can use the bytes text methods to work with it.  The only time I would return a bytes object is if it was 
supposed to be bytes (an image, for example); otherwise I return a bool, an int, a float, a date, or, even, a str.



When it's convenient for them to
use text-processing operations on bytes, they'll say "oh, yes, these
are conventionally considered text-processing features, but that's
just an accident of the particular configuration of bytes -- yup,
bytes -- I'm processing."


If that particular configuration of bytes is because it's ASCII-encoded text, then sure.  To use, for example, 
bytes.__upper__ on data that wasn't ASCII-encoded text (even if it happened to look like it was) would be the height of 
stupidity

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 12:39 PM, Stephen J. Turnbull wrote:

Daniel Holth writes:

  > -1 on adding more surrogateesapes by default. It's a pain to track
  > down where the encoding errors came from.

What do you mean "by default"?  It was quite explicit in the code I
posted, and it's the only reasonable thing to do with "text data
without known (but ASCII compatible) encoding or multiple different
encodings in a single data chunk".  If you leave it as bytes, it will
barf as soon as you try to mix it with text even if it is pure ASCII!


Which is why some (including myself) are asking to be able to stay in bytes land and do any necessary interpolation 
there.  No resulting unicode, no barfing.  ;)


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Kristján Valur Jónsson
Right. 
I'm saying, let's support two interpolators only:
%b interpolates a bytes object (or one supporting the charbuffer interface) 
into a bytes object.
%s interpolates a str object by first converting to a bytes object using strict 
ascii conversion.

This makes it very explicit what we are trying to do. I think that using %s to 
interpolate a bytes object like the current PEP does is a bad idea, because %s 
already means 'str' elsewhere in the language, both in 2.7 and 3.x

As for the case you mention:
b"abc %s" % (b"def",) -> b"abc def"
b"abc %s" % (b"def",) -> b"abc b\"def\""  # because str(bytesobject) == 
repr(bytesobject)

This is perfectly fine, imho.  Let's not overload %s to mean "bytes" in format 
strings if those format strnings are in fact not strings byt bytes. That way 
madness lies.

K

From: Paul Moore [[email protected]]
Sent: Sunday, January 12, 2014 17:04
To: Kristján Valur Jónsson
Cc: Nick Coghlan; Georg Brandl; [email protected]
Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

On 12 January 2014 16:52, Kristján Valur Jónsson  wrote:


But that's not what the current PEP says. It uses %s for interpolating
bytes values. It looks like you're saying that

b'abc %s' % (b'def')

will *not* produce b'abc def', but rather will produce b'abc b\'def\''
(because str(b'def'') is "b'def'").
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Kristján Valur Jónsson

+1, even better.


From: Python-Dev [[email protected]] on 
behalf of Mark Shannon [[email protected]]
Sent: Sunday, January 12, 2014 17:06
To: [email protected]
Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

On 12/01/14 16:52, Kristján Valur Jónsson wrote:
> Now you're just splitting hairs, Nick.
>
> An explicit operator, %s, _defined_ to be "encode a string object using
> strict ascii",

I don't like this because '%s' reads to me as "insert *string* here".
I think '%a' which reads as "encode as ASCII and insert here" would be
better.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 01:06 PM, Greg Ewing wrote:

Paul Moore wrote:


I could easily argue at this point that this is the type of bug that
having %-formatting operations on bytes would encourage - %s means
"format a string" (from years of C and Python (text) experience) so I
automatically supply a string argument when using %s in a bytes
formatting context.


So don't call it %s -- call it something else
such as %b.


Which is fine for 3.5+ code, but not at all helpful for a 2/3 code base.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Mark Shannon

Why not just use six.byte_format(fmt, *args)?
It works on both Python2 and Python3 and accepts the numerical format 
specifiers, plus '%b' for inserting bytes and '%a' for converting text 
to ascii.


Admittedly it doesn't exist yet,
but it could and it would save a lot of arguing :)

(Apologies to anyone who doesn't appreciate my mischievous sense of humour)

Cheers,
Mark.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 01:37 PM, Kristján Valur Jónsson wrote:

Right.
I'm saying, let's support two interpolators only:
%b interpolates a bytes object (or one supporting the charbuffer interface) 
into a bytes object.
%s interpolates a str object by first converting to a bytes object using strict 
ascii conversion.

This makes it very explicit what we are trying to do. I think that using %s to 
interpolate a bytes object like the current PEP does is a bad idea, because %s 
already means 'str' elsewhere in the language, both in 2.7 and 3.x

As for the case you mention:
b"abc %s" % (b"def",) -> b"abc def"
b"abc %s" % (b"def",) -> b"abc b\"def\""  # because str(bytesobject) == 
repr(bytesobject)

This is perfectly fine, imho.  Let's not overload %s to mean "bytes" in format 
strings if those format strnings are in fact not strings byt bytes. That way madness lies.


You didn't say, but I'm guessing you mean the second one is fine?  if 2/3 compatible code is the goal, the first should 
be what we get.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Glenn Linderman

On 1/12/2014 11:14 AM, Ethan Furman wrote:

And a core principle of the bytes/text separation in Python 3 is that
encoding should never happen implicitly.


That could be.  And yet the bytes type already has several concessions 
to ASCII encoding.


"%d" % 26 => an explicit request to convert binary integer to a base-10 
Unicode/text representation of the integer


b"%d" % 26 => an explicit request to convert binary integer to a base-10 
ASCII bytes representation of the integer


The leading "b" seems to be a very explicit request for bytes rather 
than characters to me, and seems much more attractive than the proposals 
to embed binary in Unicode by abusing Latin-1 encoding.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Greg Ewing

Paul Moore wrote:

On 12 January 2014 18:26, Ethan Furman  wrote:


I'm arguing from three PoVs:
1) 2 & 3 compatible code base
2) having the bytes type /be/ the boundary type
3) readable code


The only one of these that I can see being in any way an argument against

def int_to_bytes(n):
return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is (3),


I think the readability argument becomes a bit sharper when
you consider more complex examples, e.g. if I have a tuple
of 3 floats that I want to put into a PDF file, then

   b"%f %f %f" % my_floats

is considerably clearer than

   b" ".join((float_to_bytes(f) for f in my_floats))


My reading of Nick's refusal is that %d takes a value which is
semantically a number, converts it into a base-10 representation
(which is semantically a *string*, not a sequence of bytes[1]) and
then *encodes* that string into a series of bytes using the ASCII
encoding. That is *two* semantic transformations, and one (the ASCII
encoding) is *implicit*. Specifically, it's implicit because (a) the
normal reading of %d is "produce the base-10 representation of a
number, and a base-10 representation is a *string*, and (b) because
nowhere has ASCII been mentioned


It's indicated (I won't say "implied", see below) by the
fact that we're interpolating it into a bytes object rather
than a string.

This is no more or less implicit than the fact that when
we write

   b"ABC"

then we're saying that those characters are to be encoded
in ASCII, and not EBCDIC or UTF-16 or...

BTW, there's a problem with bandying around the words
"implicit" and "explicit", because they depend on your frame
of reference. For example, one person might say that the
fact that b"%s" encodes into ASCII is implicit, because
ASCII isn't written down in the code anywhere. But another
person might say it's explicit, because the manual explicitly
says that stuff interpolated into a bytes object is encoded
as ASCII.

So arguments of the form "X is bad because it's not
explicit" are prone to getting people talking past each
other.

--
Greg
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Greg Ewing

Ethan Furman wrote:
Your asciistr, which sometimes returns bytes and sometimes returns text, 
is absolutely *not* what we want.


The kind of third-party thing that *might* fill the bill
would be a *function*:

   bytesformat(b"Content-Length: %d", length)

that implements all the %-specifiers we're asking for.

--
Greg
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Greg Ewing

Mark Lawrence wrote:
I entirely agree.  This would also parallel the conversion flags given 
here http://docs.python.org/3/library/string.html#format-string-syntax, 
I quote "Three conversion flags are currently supported: '!s' which 
calls str() on the value, '!r' which calls repr() and '!a' which calls 
ascii()".


Except that ascii() does something rather different --
it's a variation on repr() rather than str(), and it
doesn't imply any encoding operation.

I think this parallel would be more confusing than
helpful.

--
Greg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 01:59 PM, Mark Shannon wrote:


Why not just use six.byte_format(fmt, *args)?
It works on both Python2 and Python3 and accepts the numerical format 
specifiers, plus '%b' for inserting bytes and '%a'
for converting text to ascii.


Sounds like the second best option!



Admittedly it doesn't exist yet,
but it could and it would save a lot of arguing :)


:)

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Chris Angelico
On Mon, Jan 13, 2014 at 4:57 AM, Juraj Sukop  wrote:
> On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano 
> wrote:
>> First, "utf16_string" confuses me. What is it? If it is a Unicode
>> string, i.e.:
>
> It is a Unicode string which happens to contain code points outside U+00FF
> (as with the TTF example above), so that it triggers the (at least) 2-bytes
> memory representation in CPython 3.3+. I agree, I chose the variable name
> poorly, my bad.

When I'm talking about Unicode strings based on their maximum
codepoint, I usually call them something like "ASCII string", "Latin-1
string", "BMP string", and "SMP string". Still not wholly accurate,
but less confusing than naming an encoding... oh wait, two of those
_are_ encodings :| But you could use "narrow string" for the first
two. Or "string(0..127)" for ASCII, "string(0..255)" for Latin-1, and
then for consistency "string(0..65535)" and "string(0..1114111)" for
the others, except that I doubt that'd be helpful :) At any rate,
"BMP" as a term for "includes characters outside of Latin-1 but all on
the Basic Multilingual Plane" would probably be close enough to get
away with.

ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Paul Moore
On 12 January 2014 22:10, Greg Ewing  wrote:
> I think the readability argument becomes a bit sharper when
> you consider more complex examples, e.g. if I have a tuple
> of 3 floats that I want to put into a PDF file, then
>
>b"%f %f %f" % my_floats
>
> is considerably clearer than
>
>b" ".join((float_to_bytes(f) for f in my_floats))

Hmm, I'm not sure I'd agree. I'd quote "explicit is better than
implicit", but given comments below, that would be a mistake :-) Let's
just leave it that I'd probably wrap the whole thing in a
float_list(floats) function in my application, and not *care* how it
was implemented.

One thing that this does bring up, though, is that all the talk is
about %-formatting. Do the people who are arguing for numeric
formatting have views on what (if any) features will be included in
bytes.format()? It seems to me that recasting many of the discussions
using format() make it much less "obvious" that adding the features to
bytes formatting is a reasonable thing to do. I won't give specific
examples, because I would be putting words into people's mouths. But I
*would* say that any genuine proposal for numeric formatting in bytes
should be cast as a formal PEP and explicitly document both % and
format() behaviours.

> It's indicated (I won't say "implied", see below) by the
> fact that we're interpolating it into a bytes object rather
> than a string.
>
> This is no more or less implicit than the fact that when
> we write
>
>b"ABC"
>
> then we're saying that those characters are to be encoded
> in ASCII, and not EBCDIC or UTF-16 or...

That's a fair point, and one I had not taken into consideration.

> BTW, there's a problem with bandying around the words
> "implicit" and "explicit", because they depend on your frame
> of reference. For example, one person might say that the
> fact that b"%s" encodes into ASCII is implicit, because
> ASCII isn't written down in the code anywhere. But another
> person might say it's explicit, because the manual explicitly
> says that stuff interpolated into a bytes object is encoded
> as ASCII.

In my defense, I would say that I was trying to clarify Nick's
objections, and it's entirely possible I misrepresented this aspect of
them.

Personally, I agree that it's not as black and white as simply saying
"numeric formatting is wrong", but I think that the fact that %d et al
represent a "double transformation" (from number to string
representation to encoded bytes) is the differentiating factor here.
Proposals that do nothing but interpolation are essentially
convenience wrappers for various combinations of concatenation and
join. Adding "double transformation" formatting codes is a step
change, and needs to be explicitly acknowledged and justified. (If you
*do* manage to justify such codes, there's a secondary question of
precisely what codes should be supported, but we can start by getting
agreement that the *class* of codes is allowed). PEP 460 explicitly
excludes anything but pure interpolation.

> So arguments of the form "X is bad because it's not
> explicit" are prone to getting people talking past each
> other.

Fair point. I hope my above paragraph clarifies my position somewhat better.

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Steven D'Aprano writes:

 > then the name is horribly misleading, and it is best handled like this:
 > 
 > content = '\n'.join([
 > 'header',
 > 'part 2 %.3f' % number,
 > binary_image_data.decode('latin-1'),
 > utf16_string,  # Misleading name, actually Unicode string
 > 'trailer'])

This loses bigtime, as any encoding that can handle non-latin1 in
utf16_string will corrupt binary_image_data.  OTOH, latin1 will raise
on non-latin1 characters.  utf16_string must be encoded appropriately
then decoded by latin1 to be reencoded by latin1 on output.

 > On the other hand, if it is actually a bytes object which is the product 
 > of UTF-16 encoding, i.e.:
 > 
 > type(utf16_string)
 > => returns bytes
 > 
 > and those bytes were generated by "some text".encode("utf-16"), then it 
 > is already binary data and needs to be smuggled into the text string. 
 > Latin-1 is good for that:
 > 
 > content = '\n'.join([
 > 'header',
 > 'part 2 %.3f' % number,
 > binary_image_data.decode('latin-1'),
 > utf16_string.decode('latin-1'),
 > 'trailer'])
 > 
 > 
 > Both examples assume that you intend to do further processing of content 
 > before sending it, and will encode just before sending:
 > 
 > content.encode('utf-8')
 > 
 > (Don't use Latin-1, since it cannot handle the full range of text 
 > characters.)

This corrupts binary_image_data.  Each byte > 127 will be replaced by
two bytes.  In the second case, you can use latin1 to encode, it it
gives you what you want.

This kind of subtlety is precisely why MAL warned about use of latin1
to smuggle bytes.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Mark Lawrence

On 12/01/2014 17:03, Ethan Furman wrote:

On 01/12/2014 08:21 AM, Ethan Furman wrote:

On 01/12/2014 08:09 AM, Nick Coghlan wrote:

On 13 Jan 2014 01:22, "Kristján Valur Jónsson" wrote:


Imho, this is not equivalent to re-introducing automatic type
conversion between binary/unicode, it is adding a
specific convenience function for explicitly asking for ASCII encoding.


It is not explicit, it is implicit - whether or not the resulting
string assumes ASCII compatibility or not depends on
whether you pass a binary value (no assumption) or a string value
(assumes ASCII compatibility).


Nick, I don't understand what you are saying here.  Are you saying
that the result of b'%s' % var may be either a bytes
object or a str object?  Because that would be wrong -- it would
always be a bytes object.


Okay, I just went and took a closer look at the asciistr type [1].  For
what it's worth I don't think this is Antoine's understanding of what we
[2] are asking for, nor is it what we are asking for (I'm sure Antoine
will correct me if I'm wrong. ;)

We know full well the difference between unicode and bytes, and we know
full well that numbers and much of the text we need has an ASCII
(bytes!) representation.  When we do a b'Content Length: %d' %
len(binary_data) we are expecting to get back a bytes object, /not/ a
unicode object.

Your asciistr, which sometimes returns bytes and sometimes returns text,
is absolutely *not* what we want.


I've just tried asciistr using your test code (having corrected the 
typo, it's assertIsInstance, not assertIsinstance :) and it looks like a 
very good starting point.  Have you, or anyone else for that matter, 
actually tried asciistr out?




--
~Ethan~


[1] https://github.com/jeamland/asciicompat
[2] the dbf and pdf folks, at least



--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 02:31 PM, Stephen J. Turnbull wrote:


This corrupts binary_image_data.  Each byte > 127 will be replaced by
two bytes.  In the second case, you can use latin1 to encode, it it
gives you what you want.

This kind of subtlety is precisely why MAL warned about use of latin1
to smuggle bytes.


And why I've been fighting Steven D'Aprano on it.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Trying to focus the whole bytes/str formatting discussion

2014-01-12 Thread Brett Cannon
I don't know about the rest of you but I feel like the discussion is
heading off the rails (if it hasn't already jumped the tracks). Let's try
to bring this back around to something actionable which people can focus
their energy on as the amount of developer time spent arguing could have
led to several coded-up solutions.

I see it as a practicality-beats-purity vs.
explicit-is-better-than-implicit. The PBP group want bytes.format() (just
assume I include interpolation support if you want that) to work as close
to a drop-in replacement for current str.format() use in Python 2 to ease
porting. The argument is that code looks cleaner and the amount of changes
in Python 2 code being ported to Python 3 is much smaller.

THE EIBTI group are willing to support PEP 460 but beyond that don't want
to have in Python itself anything for bytes.format() which takes in a
string and spits out bytes. It's bytes in->bytes out and not bytes & str
in->bytes out as the PBP group is after. The EIBTI group are arguing that
letting str into bytes.format() and then automatically be converted to
strict ASCII leads to conflating the text/bytes divide as well as being too
magical, e.g. what if you actually wanted UTF-16 for you number string
instead of ASCII; the EIBTI group **wants** to force people to make a
decision. They are also less concerned with making users update Python 2
code to handle this as it already needs to be updated for other Python 3
things anyway.

>From where I'm sitting, the EIBTI group and their PEP 460 proposal from
Antoine (and no longer Victor) are not controversial. Everyone seems to
agree that PEP 460 **at minimum** is acceptable and should happen for
Python 3.5. The people with the uphill battle and something to prove are
those arguing for str in->bytes out support in bytes.format(). The added
features that the PBP group want are the ones being argued over.

As the onus is on the PBP group to convince the EIBTI group (or Guido), I
think the PBP group should code up a solution that does what they want and
put it on PyPI to see what the community thinks. If the PBP group wants to
convince the EIBTI group that str in->bytes out for bytes.format() is
critical in getting a key group of users to start using Python 3 then I
think that needs to be demonstrated through real-world usage by some people.

If there is serious pickup of the solution from PyPI by projects then we
can discuss integrating it into Python 3.5. That gives at least **one
year** to come up with a solution which gets picked up by the community
(standard requirement for stdlib inclusion). At worst some projects use the
PyPI project and find it useful but it doesn't go into Python 3.5. At best
lots of people find it useful enough that we add it to Python 3.5. But
regardless, a PyPI project helps people **no matter what** the EIBTI group
thinks. That's more forward momentum than this conversation currently has.

This has split down philosophical lines and does not look to be tilting one
way or the other by simply using words. I think it has reached the point
that showing code is going to be the only way to tilt the favour towards
the PBP group at this point. Guido has not spoken up so either he is
ignoring it because he's busy, he doesn't care, or he's mulling things over
still. Assuming he doesn't speak up then it comes down to getting a clear
majority on the side of the PBP group and that is not going to happen the
way this discussion is going.

So, action items are:

* Get PEP 460 pronounced on **as is**
* A PyPI project containing PBP ideas and see if the community seizes on it
or not (benefit to people regardless)
* Do a separate PEP that builds on PEP 460 if people really want to
continue down that road at this time

Don't forget, we are talking about Python 3.5; we have not even hit Python
3.4rc1 yet so this level of arguing seems a bit premature and going nowhere.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Antoine Pitrou

Hi Ethan,

On Sun, 12 Jan 2014 13:28:15 -0800
Ethan Furman  wrote:
> On 01/12/2014 12:02 PM, Stephen J. Turnbull wrote:
> > Georg Brandl writes:
> >> Antoine writes:
> >>>
> >>> . . . if it weren't for your stupid maximalist opposition. . .
> >>
> >> Can you please stop throwing personal insults around?  You don't have to
> >> resort to that level.
> >
> > Ethan's posts (as an example of one general trend in this thread) are
> > pretty frustrating, you have to admit.
> 
> Two points:
> 
> 1) Are you saying it's okay to be insulting when frustrated?
> I also find this mega-thread frustrating, but I'm trying 
> very hard not to be insulting.

You are right, it is not ok. The wording wasn't constructive or
controlled at all. I'd like to apologize for that.

At the same point, I was expressing a fair amount of frustration. I
think the last discussion rounds have largely failed to produce any
new meaningful insight (to the point that I've stopped reading several
subthreads). IMO the best thing *for now* would be to "agree to
disagree", let things bake in everyone's mind for some time, and
revisit the subject in some weeks.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Stephen J. Turnbull
Ethan Furman writes:

 > Nothing else is ideal.  I'll go that route if I have to.  I
 > understand that in the real world you go with what works, but in
 > the development stage you fight for the ideal.  :)

You're going to lose, because Python 3 chose a different ideal that
conflicts with yours.

 > > My reading of Nick's refusal is that %d takes a value which is
 > > semantically a number, converts it into a base-10 representation
 > > (which is semantically a *string*, not a sequence of bytes[1]) and
 > > then *encodes* that string into a series of bytes using the ASCII
 > > encoding.
 > 
 > That could be.  And yet the bytes type already has several
 > concessions to ASCII encoding.

No, Nick's point is that there's no encoding needed there are all,
just a bunch of methods that handle numbers in the range 0-255.  You
can rationalize the particular choice of numbers by referring to the
ASCII coded character set, and that's very useful to users.  But
knowledge of ASCII isn't necessary to specify these methods; they can
be defined in an encoding/decoding-free way.

 > But bytes already acknowledges an ASCII bias.

True, but that bias is implemented without use of encoding or
decoding.   b'%d' % (123,) -> b'123' does require encoding, at the
very least in the sense of type change and serialization.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python advanced debug support (update frame code)

2014-01-12 Thread Fabio Zadrozny
Hi Python-dev.

I'm playing a bit on the concept on live-coding during a debug session and
one of the most annoying things is that although I can reload the code for
a function (using something close to xreload), it seems it's not possible
to change the code for the current frame (i.e.: I need to get out of the
function call and then back in to a call to the method from that frame to
see the changes).

I gave a look on the frameobject and it seems it would be possible to set
frame.f_code to another code object -- and set the line number to the start
of the new object, which would cover the most common situation, which would
be restarting the current frame -- provided the arguments remain the same
(which is close to what the java debugger in Eclipse does when it drops the
current frame -- on Python, provided I'm not in a try..except block I can
do even better setting the the frame.f_lineno, but without being able to
change the frame f_code it loses a lot of its usefulness).

So, I'd like to ask for feedback from people with more knowledge on whether
it'd be actually feasible to change the frame.f_code and possible
implications on doing that.

Thanks,

Fabio
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Glenn Linderman

On 1/12/2014 2:57 PM, Stephen J. Turnbull wrote:

  > But bytes already acknowledges an ASCII bias.

True, but that bias is implemented without use of encoding or
decoding.   b'%d' % (123,) -> b'123' does require encoding, at the
very least in the sense of type change and serialization.
b'%d'  all by itself, even before using the % operator, does require 
encoding, at the very list in the sense of type change and serialization.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Trying to focus the whole bytes/str formatting discussion

2014-01-12 Thread Cameron Simpson
On 12Jan2014 17:46, Brett Cannon  wrote:
> THE EIBTI group are willing to support PEP 460 but beyond that don't want
> to have in Python itself anything for bytes.format() which takes in a
> string and spits out bytes. It's bytes in->bytes out and not bytes & str
> in->bytes out as the PBP group is after. The EIBTI group are arguing that
> letting str into bytes.format() and then automatically be converted to
> strict ASCII leads to conflating the text/bytes divide as well as being too
> magical, e.g. what if you actually wanted UTF-16 for you number string
> instead of ASCII; the EIBTI group **wants** to force people to make a
> decision. They are also less concerned with making users update Python 2
> code to handle this as it already needs to be updated for other Python 3
> things anyway.
[...]

I'm in the EIBTI on the whole, but I would also be happy for the
bytes.format() function to accept strings (and floats or whatever
the str.format supports) _provided_ it required an explicit encoding=
parameter to enable it.

i.e. make it easy to use, _but_ require an overt specification of
the str->bytes encoding.

You don't even need a special mode, but have it raise a ValueError
if the (default) encoding is None when an encoding became needed.

Just my 2c on Brett's EIBTI vs PBP divide. I'll try to stay off
this thread now and bikeshed only in the others...
-- 
Cameron Simpson 

You can blip it twice to clear the bore,
But blip it thrice, and you've sinned once more.
- Tom Warner 
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Steven D'Aprano
On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> 
>  > then the name is horribly misleading, and it is best handled like this:
>  > 
>  > content = '\n'.join([
>  > 'header',
>  > 'part 2 %.3f' % number,
>  > binary_image_data.decode('latin-1'),
>  > utf16_string,  # Misleading name, actually Unicode string
>  > 'trailer'])
> 
> This loses bigtime, as any encoding that can handle non-latin1 in
> utf16_string will corrupt binary_image_data.  OTOH, latin1 will raise
> on non-latin1 characters.  utf16_string must be encoded appropriately
> then decoded by latin1 to be reencoded by latin1 on output.

Of course you're right, but I have understood the above as being a 
sketch and not real code. (E.g. does "header" really mean the literal 
string "header", or does it stand in for something which is a header?) 
In real code, one would need to have some way of telling where the 
binary image data ends and the Unicode string begins.

If I have misunderstood the situation, then my apologies for compounding 
the error


[...]
>  > Both examples assume that you intend to do further processing of content 
>  > before sending it, and will encode just before sending:
>  > 
>  > content.encode('utf-8')
>  > 
>  > (Don't use Latin-1, since it cannot handle the full range of text 
>  > characters.)
> 
> This corrupts binary_image_data.  Each byte > 127 will be replaced by
> two bytes.

And reading it back using decode('utf-8') will replace those two bytes 
with a single byte, round-tripping exactly.

Of course if you encode to UTF-8 and then try to read the binary data as 
raw bytes, you'll get corrupted data. But do people expect to do this? 
That's a genuine question -- again, I assumed (apparently wrongly) that 
the idea was to write the content out as *text* containing smuggled 
bytes, and read it back the same way.


> In the second case, you can use latin1 to encode, it it
> gives you what you want.
> 
> This kind of subtlety is precisely why MAL warned about use of latin1
> to smuggle bytes.

How would you smuggle a chunk of arbitrary bytes into a text string? 
Short of doing something like uuencoding it into ASCII, or equivalent.


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 02:57 PM, Stephen J. Turnbull wrote:

Ethan Furman writes:


Nothing else is ideal.  I'll go that route if I have to.  I
understand that in the real world you go with what works, but in
the development stage you fight for the ideal.  :)


You're going to lose, because Python 3 chose a different ideal that
conflicts with yours.


Entirely possible.  I didn't set out to waste anyone's time, but I wasn't around for the initial discussions so don't 
know the reasons behind the result, only that the result is not an appropriate boundary type despite it being what is 
handed around at the boundaries.




My reading of Nick's refusal is that %d takes a value which is
semantically a number, converts it into a base-10 representation
(which is semantically a *string*, not a sequence of bytes[1]) and
then *encodes* that string into a series of bytes using the ASCII
encoding.


That could be.  And yet the bytes type already has several
concessions to ASCII encoding.


No, Nick's point is that there's no encoding needed there are all,
just a bunch of methods that handle numbers in the range 0-255.  You
can rationalize the particular choice of numbers by referring to the
ASCII coded character set, and that's very useful to users.  But
knowledge of ASCII isn't necessary to specify these methods; they can
be defined in an encoding/decoding-free way.


How can you say that with a straight face? [1]  Do you really think that .title, .isalnum, and .center (to name only a 
few) would work the same if the assumed encoding was EBCIDC?  Do you think they would do the proper transformations, or 
return the proper result, if the bytes they were used on were encoded Japanese?




But bytes already acknowledges an ASCII bias.


True, but that bias is implemented without use of encoding or
decoding.   b'%d' % (123,) -> b'123' does require encoding, at the
very least in the sense of type change and serialization.


You mean like changing a number into text does?  Really, this is no different.

--
~Ethan~

[1] I'm sorry to be offensive, but I have no idea how to respond to that that acknowledges my complete astonishment that 
you would say such a thing.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 02:52 PM, Antoine Pitrou wrote:


You are right, it is not ok. The wording wasn't constructive or
controlled at all. I'd like to apologize for that.


Thank you.  Apology accepted.



At the same point, I was expressing a fair amount of frustration. I
think the last discussion rounds have largely failed to produce any
new meaningful insight (to the point that I've stopped reading several
subthreads). IMO the best thing *for now* would be to "agree to
disagree", let things bake in everyone's mind for some time, and
revisit the subject in some weeks.


For the most part I agree.  I did, though, finally figure out what Nick thought I wanted, so there was at least a little 
progress.


But yes, I think tabling the discussion for now, and working on Brett's ideas, 
is entirely appropriate.

--
~Ethan~

P.S.  Direct reply so you don't miss my response.  :)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 460 reboot

2014-01-12 Thread Guido van Rossum
There's a lot of discussion about PEP 460 and I haven't read it all.
Maybe you all have already reached the same conclusion that I have. In
that case I apologize (but the PEP should be updated). Here's my
contribution:

PEP 460 itself currently rejects support for %d, AFAIK on the basis
that bytes aren't necessarily ASCII. I think that's a misunderstanding
of the intention of the bytes type.

The key reason for introducing a separate bytes type in Python 3 is to
avoid *mixing* bytes and text. This aims to avoid the classic Python 2
Unicode failure, where str+unicode fails or succeeds based on whether
str contains non-ASCII characters or not, which means it is easy to
miss in testing. Properly written code in Python 3 will fail based on
the *type* of the objects, not based on their contents. Content-based
failures are still possible, but they occur in typical "boundary"
operations such as encode/decode.

But this does not mean the bytes type isn't allowed to have a
noticeable bias in favor of encodings that are ASCII supersets, even
if not all bytes objects contain such data (e.g. image data,
compressed data, binary network packets, and so on).

IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and
also for b'{}'.format(42) to return b'42'. There are numerous places
where bytes are already assumed to use an ASCII superset:

- byte literals: b'abc' (it's a syntax error to have a non-ASCII character here)
- the upper() and lower() methods modify the ASCII letter positions
- int(b'42') == 42, float(b'3.14') == 3.14

I looked through the example code I recently write for asyncio (which
uses bytes for all data read or written). There are several places
where I have to make a clumsy detour via text strings because I need
to include an ASCII-encoded decimal integer (e.g. the Content-Length
header) or a hex-encoded one (e.g. for Transfer-Encoding: chunked).
Those detours aren't needed for parsing because int() accepts bytes
just fine.

I also note that the behavior of the re module is perfect: if the
pattern is bytes, it can only match bytes and the extracted data is
bytes, and ditto for text -- so it supports both types but doesn't
allow mixing them. The urllib module does this too -- at considerable
cost in its implementation, but it's the right thing, because there
really are good cases to be made for treating URLs as text as well as
for treating them as bytes (as with filenames, command line arguments,
and environment variables).

I'm sad that the json module in Python 3 doesn't support bytes at all,
but at least it is consistent -- it always produces text in ASCII
encoding (by default). The same applies to the http module, which IIUC
adheres to the standard by treating headers as Latin-1.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Ethan Furman writes:

 > > This kind of subtlety is precisely why MAL warned about use of latin1
 > > to smuggle bytes.
 > 
 > And why I've been fighting Steven D'Aprano on it.

No, I think you haven't been fighting Steven d'A on "it".  You're
talking about parsing and generating structured binary files, he's
talking about techniques for parsing and generating streams with no
real structure above the byte or encoded character level.

Of course you can implement the former with the latter using Python 3
"str", but it's ugly, maybe even painful if you need to encode binary
blobs back to binary to process them.  (More discussion in my other
post, although I suspect you're not going to be terribly happy with
that, either. ;-)

This generally *is not* the case for the wire protocol guys.  AFAICT
they really do want to process things as streams of ASCII-compatible
text, with the non-ASCII stuff treated as runs of uninterpreted bytes
that are just passed through.

So when you talk about "we", I suspect you are not the "we" everybody
else is arguing with.  In particular, AIUI your use case is not
included in the use cases most of us -- including Steven -- are
thinking about.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Stephen J. Turnbull
Glenn Linderman writes:

 > the proposals to embed binary in Unicode by abusing Latin-1
 > encoding.

Those aren't "proposals", they are currently feasible techniques in
Python 3 for *some* use cases.

The question is why infecting Python 3 with the byte/character
confoundance virus is preferable to such techniques, especially if
their (serious!) deficiencies are removed by creating a new type such
as asciistr.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Ethan Furman

On 01/12/2014 04:02 PM, Stephen J. Turnbull wrote:


So when you talk about "we", I suspect you are not the "we" everybody
else is arguing with.  In particular, AIUI your use case is not
included in the use cases most of us -- including Steven -- are
thinking about.


Ah, so even in the minority I'm in the minority.  :/  The "we" I am usually referring to are those of us who have to 
deal with the mixed ASCII/binary/encoded text files (a couple have spoken up about PDFs, and I have DBF).


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Donald Stufft

On Jan 12, 2014, at 6:55 PM, Guido van Rossum  wrote:

> The key reason for introducing a separate bytes type in Python 3 is to
> avoid *mixing* bytes and text. This aims to avoid the classic Python 2
> Unicode failure, where str+unicode fails or succeeds based on whether
> str contains non-ASCII characters or not, which means it is easy to
> miss in testing. 

+1

> 
> But this does not mean the bytes type isn't allowed to have a
> noticeable bias in favor of encodings that are ASCII supersets, even
> if not all bytes objects contain such data (e.g. image data,
> compressed data, binary network packets, and so on).

+1

> 
> IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and
> also for b'{}'.format(42) to return b'42'. There are numerous places
> where bytes are already assumed to use an ASCII superset:
> 
> - byte literals: b'abc' (it's a syntax error to have a non-ASCII character 
> here)
> - the upper() and lower() methods modify the ASCII letter positions
> - int(b'42') == 42, float(b'3.14') == 3.14

Completely Agree.

> 
> I looked through the example code I recently write for asyncio (which
> uses bytes for all data read or written). There are several places
> where I have to make a clumsy detour via text strings because I need
> to include an ASCII-encoded decimal integer (e.g. the Content-Length
> header) or a hex-encoded one (e.g. for Transfer-Encoding: chunked).
> Those detours aren't needed for parsing because int() accepts bytes
> just fine.
> 
> I also note that the behavior of the re module is perfect: if the
> pattern is bytes, it can only match bytes and the extracted data is
> bytes, and ditto for text -- so it supports both types but doesn't
> allow mixing them. The urllib module does this too -- at considerable
> cost in its implementation, but it's the right thing, because there
> really are good cases to be made for treating URLs as text as well as
> for treating them as bytes (as with filenames, command line arguments,
> and environment variables).
> 
> I'm sad that the json module in Python 3 doesn't support bytes at all,
> but at least it is consistent -- it always produces text in ASCII
> encoding (by default). The same applies to the http module, which IIUC
> adheres to the standard by treating headers as Latin-1.
> 
> -- 
> --Guido van Rossum (python.org/~guido)
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/donald%40stufft.io


-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Guido van Rossum
On Sun, Jan 12, 2014 at 4:28 PM, Ethan Furman  wrote:
> On 01/12/2014 03:55 PM, Guido van Rossum wrote:
>>
>> There's a lot of discussion about PEP 460 and I haven't read it all.
>> Maybe you all have already reached the same conclusion that I have.
>
>
> No, no agreement has been reached.  Your contribution is timely.
>
>
>
>> PEP 460 itself currently rejects support for %d, AFAIK on the basis
>> that bytes aren't necessarily ASCII. I think that's a misunderstanding
>> of the intention of the bytes type.
>
>
>> [...] this does not mean the bytes type isn't allowed to have a
>>
>> noticeable bias in favor of encodings that are ASCII supersets, even
>> if not all bytes objects contain such data [...]
>
>
>> IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and
>> also for b'{}'.format(42) to return b'42' [...]
>>
>>
>> - byte literals: b'abc' (it's a syntax error to have a non-ASCII character
>> here)
>> - the upper() and lower() methods modify the ASCII letter positions
>> - int(b'42') == 42, float(b'3.14') == 3.14
>
>
> So if we allow the numeric modifiers [1], the only remaining question is do
> we allow %c and %s, and if so how do they behave?
>
> Guido?

Yes, all the numeric formatting codes such as %x, %o, %e, %f, %g
should all work, as should the padding, justification and and related
modifiers. E.g. b'%4x' %10 should return b'   a'.

%c looks simple enough too: With an int it should insert one byte,
insisting that the value is in range(256). With a bytes argument the
length should be 1. (I note that I can't remember ever using %c --
it's just there because it's in C.)

%s seems the trickiest: I think with a bytes argument it should just
insert those bytes (and the padding modifiers should work too), and
for other types it should probably work like %a, so that it works as
expected for numeric values, and with a string argument it will return
the ascii()-variant of its repr(). Examples:

b'%s' % 42 == b'42'
b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
enclosed in single quotes)

I have to admin I didn't know about ascii(). It's nifty. :-)

> --
> ~Ethan~
>
>
> [1] modifiers is not the right word for %i, %x, etc, is it?  What is the
> correct term?

I'd interpret "modifiers" as the stuff that can go between the % and
the format letter, e.g. %04d or %-.3s. The term I'd use for %i, %x etc
would be numeric formatting codes.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Ethan Furman

On 01/12/2014 03:55 PM, Guido van Rossum wrote:

There's a lot of discussion about PEP 460 and I haven't read it all.
Maybe you all have already reached the same conclusion that I have.


No, no agreement has been reached.  Your contribution is timely.



PEP 460 itself currently rejects support for %d, AFAIK on the basis
that bytes aren't necessarily ASCII. I think that's a misunderstanding
of the intention of the bytes type.



[...] this does not mean the bytes type isn't allowed to have a
noticeable bias in favor of encodings that are ASCII supersets, even
if not all bytes objects contain such data [...]



IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and
also for b'{}'.format(42) to return b'42' [...]

- byte literals: b'abc' (it's a syntax error to have a non-ASCII character here)
- the upper() and lower() methods modify the ASCII letter positions
- int(b'42') == 42, float(b'3.14') == 3.14


So if we allow the numeric modifiers [1], the only remaining question is do we allow %c and %s, and if so how do they 
behave?


Guido?

--
~Ethan~


[1] modifiers is not the right word for %i, %x, etc, is it?  What is the 
correct term?
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Trying to focus the whole bytes/str formatting discussion

2014-01-12 Thread Guido van Rossum
Sorry, I started my own "PEP 460 reboot" thread -- I wrote that
message before yours arrived, even if maybe I posted after you. I'm in
the PBP camp myself for this. I won't pronounce on PEP 460 as-is.
Please follow up in the other thread if you need clarifications.

On Sun, Jan 12, 2014 at 2:46 PM, Brett Cannon  wrote:
> I don't know about the rest of you but I feel like the discussion is heading
> off the rails (if it hasn't already jumped the tracks). Let's try to bring
> this back around to something actionable which people can focus their energy
> on as the amount of developer time spent arguing could have led to several
> coded-up solutions.
>
> I see it as a practicality-beats-purity vs.
> explicit-is-better-than-implicit. The PBP group want bytes.format() (just
> assume I include interpolation support if you want that) to work as close to
> a drop-in replacement for current str.format() use in Python 2 to ease
> porting. The argument is that code looks cleaner and the amount of changes
> in Python 2 code being ported to Python 3 is much smaller.
>
> THE EIBTI group are willing to support PEP 460 but beyond that don't want to
> have in Python itself anything for bytes.format() which takes in a string
> and spits out bytes. It's bytes in->bytes out and not bytes & str in->bytes
> out as the PBP group is after. The EIBTI group are arguing that letting str
> into bytes.format() and then automatically be converted to strict ASCII
> leads to conflating the text/bytes divide as well as being too magical, e.g.
> what if you actually wanted UTF-16 for you number string instead of ASCII;
> the EIBTI group **wants** to force people to make a decision. They are also
> less concerned with making users update Python 2 code to handle this as it
> already needs to be updated for other Python 3 things anyway.
>
> From where I'm sitting, the EIBTI group and their PEP 460 proposal from
> Antoine (and no longer Victor) are not controversial. Everyone seems to
> agree that PEP 460 **at minimum** is acceptable and should happen for Python
> 3.5. The people with the uphill battle and something to prove are those
> arguing for str in->bytes out support in bytes.format(). The added features
> that the PBP group want are the ones being argued over.
>
> As the onus is on the PBP group to convince the EIBTI group (or Guido), I
> think the PBP group should code up a solution that does what they want and
> put it on PyPI to see what the community thinks. If the PBP group wants to
> convince the EIBTI group that str in->bytes out for bytes.format() is
> critical in getting a key group of users to start using Python 3 then I
> think that needs to be demonstrated through real-world usage by some people.
>
> If there is serious pickup of the solution from PyPI by projects then we can
> discuss integrating it into Python 3.5. That gives at least **one year** to
> come up with a solution which gets picked up by the community (standard
> requirement for stdlib inclusion). At worst some projects use the PyPI
> project and find it useful but it doesn't go into Python 3.5. At best lots
> of people find it useful enough that we add it to Python 3.5. But
> regardless, a PyPI project helps people **no matter what** the EIBTI group
> thinks. That's more forward momentum than this conversation currently has.
>
> This has split down philosophical lines and does not look to be tilting one
> way or the other by simply using words. I think it has reached the point
> that showing code is going to be the only way to tilt the favour towards the
> PBP group at this point. Guido has not spoken up so either he is ignoring it
> because he's busy, he doesn't care, or he's mulling things over still.
> Assuming he doesn't speak up then it comes down to getting a clear majority
> on the side of the PBP group and that is not going to happen the way this
> discussion is going.
>
> So, action items are:
>
> * Get PEP 460 pronounced on **as is**
> * A PyPI project containing PBP ideas and see if the community seizes on it
> or not (benefit to people regardless)
> * Do a separate PEP that builds on PEP 460 if people really want to continue
> down that road at this time
>
> Don't forget, we are talking about Python 3.5; we have not even hit Python
> 3.4rc1 yet so this level of arguing seems a bit premature and going nowhere.
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Ethan Furman

On 01/12/2014 04:47 PM, Guido van Rossum wrote:


%s seems the trickiest: I think with a bytes argument it should just
insert those bytes (and the padding modifiers should work too), and
for other types it should probably work like %a, so that it works as
expected for numeric values, and with a string argument it will return
the ascii()-variant of its repr(). Examples:

b'%s' % 42 == b'42'
b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
enclosed in single quotes)


I'm not sure about the quotes.  Would anyone ever actually want those in the 
byte stream?

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)

2014-01-12 Thread Steven D'Aprano
Changing the subject line to better describe what we're talking about. I 
hope it is of interest to others apart from Ethan and I -- mixed bytes 
and text is hard to get right. (And if I've got something wrong, I'd 
like to know about it.)


On Sat, Jan 11, 2014 at 08:38:49PM -0800, Ethan Furman wrote:
> On 01/11/2014 06:29 PM, Steven D'Aprano wrote:
[...]
> Since you're talking to me, it would be nice if you addressed the same 
> use-case I was addressing, which is mixed: ascii-encoded text, 
> ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and 
> misc-encoded text.

I thought I had addressed it. But since your use-case is underspecified, 
please excuse me if I get some of it wrong.


> And no, your example will not work with any text, it would completely 
> moji-bake my dbf files.

I don't think it will. Admittedly, I don't know all the ins and outs of 
your files, but as far as I can tell, nothing you have said so far 
suggests that my plan will fail.

Code code speaks louder than words: http://www.pearwood.info/ethan_demo.py

This code produces a string containing smuggled bytes. There is:

- a header containing raw bytes;

- metadata consisting of the name of some encoding in ASCII;

- A series of tagged fields. Each field has a name, which is always 
  ASCII, and terminated with a colon. It is then followed by a 
  single ASCII character and some data:

  * T for some arbitrary chunk of text, encoded in the metadata 
encoding, with a length byte prefix (that is, like a Pascal
string);
  * F for a boolean flag "true" or "false" in ASCII;
  * N for an integer, a C long;
  * D for an integer, in ASCII, terminated at the first non-digit;
  * B for a chunk of arbitrary bytes, with a two-byte length prefix.

And the whole thing is written out to a file, then read back in, 
without data corruption or mojibake. I wrote this about 1am this 
morning, so it may or may not be a shining example of idiomatic Python 
code, but it works and is readable.

I understand that this won't match your actual use-case precisely, but I 
hope it contains the same sorts of mixed binary data and ASCII text that 
you're talking about. There are fixed width fields, variable length 
fields, binary fields, ASCII fields, non-ASCII text, and multiple 
encodings, all living in perfect harmony :-)

And it runs unchanged under both Python 2.7 and 3.3.

As so often happens, what seems good in principle is less useful in 
practce. Once I actually started writing code, I quickly moved beyond 
the simple model:

template = "some text"
data = template % ("text", 42, b'\x16foo'.decode('latin-1'))

that I thought would be easy to a more structured approach. So I wrote 
reader and writer classes and abstracted away the messy bits, although 
in truth none of it is very messy. The worst is dealing with the 2 
versus 3 differences, and even that requires only a handful of small 
helper functions.

I don't claim that the code I tossed together is the optimal design, or 
bug-free, or even that the exact same approach will work for your 
specific case. But it is enough to demonstrate that the basic idea is 
sound, you can process mixed text and bytes in a clean way, it doesn't 
generate mojibake, and can operate in both 2.7 and 3.3 without even 
using a __future__ directive.



> >>>Only the binary blobs need to be decoded. We don't need to encode the
> >>>template to bytes, and the textual data doesn't get encoded until we're
> >>>ready to send it across the wire or write it to disk.
> 
> No!  When I have text, part of which gets ascii-encoded and part of which 
> gets, say, cp1251 encoded, I cannot wait till the end!

I think we are talking about different textual data. It's a bit 
ambiguous, my apologies. You're talking about taking individual fields 
and deciding how to process them. I'm talking about doing your 
processing in the text domain, which means at the end of the process I 
have a Unicode string object rather than a bytes object. Before that str 
can be written to disk, it needs to be encoded.


> >>And what if your name field has data not representable in latin-1?
> >>
> >>--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
> >>u'\u0441\u0440\u0403'
> >
> >Where did you get those bytes from? You got them from somewhere.
> 
> For the sake of argument, pretend a user entered them in.
> 
> >Who knows? Who cares? Once you have bytes, you can treat them as a blob of
> >arbitrary bytes and write them to the record using the Latin-1 trick.
> 
> No, I can't.  See above.
>
> > If
> >you're reading those bytes from some stream that gives you bytes, you
> >don't have to care where they came from.
> 
> You're kidding, right?  If I don't know where they came from (a graphics 
> field?  a note field?) how am I going to know how to treat them?

As I understand it, you want the ability to store *arbitrary bytes* in 
the file, right? Here are nine arbitrary bytes:

b'\x82\xE1\xC2\0\0\x7B\0\xFF\xA8'

You don't need to know how I ge

Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Daniel Holth
On Sun, Jan 12, 2014 at 8:27 PM, Ethan Furman  wrote:
> On 01/12/2014 04:47 PM, Guido van Rossum wrote:
>>
>>
>> %s seems the trickiest: I think with a bytes argument it should just
>> insert those bytes (and the padding modifiers should work too), and
>> for other types it should probably work like %a, so that it works as
>> expected for numeric values, and with a string argument it will return
>> the ascii()-variant of its repr(). Examples:
>>
>> b'%s' % 42 == b'42'
>> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
>> enclosed in single quotes)
>
>
> I'm not sure about the quotes.  Would anyone ever actually want those in the
> byte stream?
>
> --
> ~Ethan~

Is there a formatting character that means "anything except a unicode
string" to prevent accidentally interpolating a Unicode string into a
bytes string without [a sane] encoding?
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Guido van Rossum
On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman  wrote:
> On 01/12/2014 04:47 PM, Guido van Rossum wrote:
>> %s seems the trickiest: I think with a bytes argument it should just
>> insert those bytes (and the padding modifiers should work too), and
>> for other types it should probably work like %a, so that it works as
>> expected for numeric values, and with a string argument it will return
>> the ascii()-variant of its repr(). Examples:
>>
>> b'%s' % 42 == b'42'
>> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
>> enclosed in single quotes)
>
> I'm not sure about the quotes.  Would anyone ever actually want those in the
> byte stream?

Perhaps not, but it's a hint that you should probably think about an
encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of
it as payback time. :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Guido van Rossum
On Sun, Jan 12, 2014 at 6:07 PM, Daniel Holth  wrote:
> Is there a formatting character that means "anything except a unicode
> string" to prevent accidentally interpolating a Unicode string into a
> bytes string without [a sane] encoding?

No, and we shouldn't introduce one. An operation should either work
for no type, one type, a few specific types, or all types. Something
that works for all but one type will *appear* to work for all types to
a casually experimenting user and may pass extensive unittests,
leaving a bomb that can detonate when you least expect it.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)

2014-01-12 Thread Steven D'Aprano
On Mon, Jan 13, 2014 at 01:03:15PM +1100, Steven D'Aprano wrote:

> code speaks louder than words: http://www.pearwood.info/ethan_demo.py

[...]

Ethan refers to code like:

template % ("срЃ".encode('cp1251').decode('latin-1'), 42, 
blob.decode('latin-1'))

> > You did say to use a *text* template to manipulate my data, and then write 
> > it later, no?  Well, this is what it would look like.
> 
> If the text strings the user gives you are compatible with the 
> encoding they specify, you don't need that. Just use:
> 
> ("срЃ", 42, blob.decode('latin-1'))
> 
> It's the user's responsibility if they choose to specify an encoding 
> which is more restrictive than the contents of some field. If they do 
> that, they have to encode that field somehow, so they can treat it as a 
> binary blob. *You* don't have to do this, and you certainly don't have 
> to take perfectly good text and turn it into bytes then back to text 
> just so you can insert it back into text. That would be silly.

It occurs to me that I do exactly that in my demo code :-)

In my defence, it was 1am when I wrote it, and I am a little unclear 
about Nathan's use-case whether the entire file is supposed to be 
compatible with the cp1251 encoding (the example that he gives), or just 
individual fields in it. If I understood the requirements better, my 
code would probably be able to avoid some of those encodes/decodes, or I 
might even decide that working in the text domain is a mistake and 
instead we should look to smuggle text into bytes rather than the other 
way around.

Regardless of which way you go, I'm not seeing that mixed bytes and text 
should be a reason to hold off migrating from 2 to 3. Which is where 
this discussion started days and days ago.

-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Daniel Holth
On Sun, Jan 12, 2014 at 9:18 PM, Guido van Rossum  wrote:
> On Sun, Jan 12, 2014 at 6:07 PM, Daniel Holth  wrote:
>> Is there a formatting character that means "anything except a unicode
>> string" to prevent accidentally interpolating a Unicode string into a
>> bytes string without [a sane] encoding?
>
> No, and we shouldn't introduce one. An operation should either work
> for no type, one type, a few specific types, or all types. Something
> that works for all but one type will *appear* to work for all types to
> a casually experimenting user and may pass extensive unittests,
> leaving a bomb that can detonate when you least expect it.

That pretty much describes how I feel about str(bytes). I would accept
"only a bytes" or "only a string" as consolation formatting characters
:-)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Ethan Furman

On 01/12/2014 06:07 PM, Daniel Holth wrote:

On Sun, Jan 12, 2014 at 8:27 PM, Ethan Furman  wrote:

On 01/12/2014 04:47 PM, Guido van Rossum wrote:



%s seems the trickiest: I think with a bytes argument it should just
insert those bytes (and the padding modifiers should work too), and
for other types it should probably work like %a, so that it works as
expected for numeric values, and with a string argument it will return
the ascii()-variant of its repr(). Examples:

b'%s' % 42 == b'42'
b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
enclosed in single quotes)


I'm not sure about the quotes.  Would anyone ever actually want those in the
byte stream?


Is there a formatting character that means "anything except a unicode
string" to prevent accidentally interpolating a Unicode string into a
bytes string without [a sane] encoding?


In reference to a byte stream, if you do:

--> b'%s' % 'some text'.encode('cp1241')

it's really just bytes into bytes.

If you do :

--> b'%s' % 'some text'

then the encoding is ASCII with strict error checking.  So if it's not representable as clean ASCII either encode it 
manually, or prepare for it to blow up with an UnicodeEncodeError.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Ethan Furman

On 01/12/2014 06:16 PM, Ethan Furman wrote:


If you do :

--> b'%s' % 'some text'


Ignore what I previously said.  With no encoding the result would be:

b"'some text'"

So an encoding should definitely be specified.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)

2014-01-12 Thread Ethan Furman

On 01/12/2014 06:03 PM, Steven D'Aprano wrote:


The above all sounds reasonable. But the following does not -- I think
it shows some fundamental confusion on your part.


My apologies.  The '\xd1.' was a bytestring, I forgot to type the b.  (I know, I know, I should've copied and pasted 
:( )


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Stephen J. Turnbull
Ethan Furman writes:

 > 1) Are you saying it's okay to be insulting when frustrated?  I
 >also find this mega-thread frustrating, but I'm trying
 >very hard not to be insulting.

OK, no.  Understandable, yes.

 > 2) If you are going to use my name, please be certain of the facts
 >[1].  More below.
 > 
 > > MAL posted straight out the Python 2 model of text makes it easier for
 > > him to write some programs, so he's all for reintroducing it.  And
 > > that is the whole truth of the matter.  Although I disagree with him,
 > > I appreciate his honesty.
 > 
 > If you have an example of me lying (even if it's just a
 > possibility), please refer to it directly so I can either try to
 > explain the misunderstanding or apologize.

Praising one person for honesty doesn't imply anybody else is lying.

As for the Artist Currently Posting as Ethan Furman, he's not in the
"disingenous" group.  I don't think you understand the issues at stake
(among other things, as I've discussed elsewhere, I think your use
case is different from the use cases of most of those who are asking
for bytes formatting).  And there's a crucial terminology difference:

 > In only one case did I use the word "text" loosely,

>From my point of view, you consistently do so.  Bytes are *never*
Python 3 text in my terminology, and I think that is generally
accepted on these channels.  "ASCII-encoded text" as you call it (and
repeatedly do so), and want to manipulate using str-like methods on
bytes, is *exactly* the Python 2 model of text.  But you deny that the
effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python
2's bytes/character confusion, don't you?

Yes, I've used "ASCII-compatible text" in some of my posts, but I
recognize that as "loose usage", too, and would stop if requested.
Note I'm not asking you to stop -- I think we all understand what you
mean, even though for some of us it's loose terminology.  What I do
hope you will recognize is that adding str-like methods to bytes is
precisely the Python 2 model of text processing[1], and that like MAL
you will say, "OK, I don't see a problem with reintroducing Python 2's
byte/character confusion."  (Well, I *really* want you to see the
light, and retract your proposal for b'%d' format.  But that hardly
seems likely. :-)

 > But don't lie to me (as Nick tried to) and say that "In particular,
 > the bytes type is, and always will be, designed for pure binary
 > manipulation" when it has methods like .center().

I hardly think Nick is *lying*, any more than you are.  AFAICT, you're
*both* wrong.  According to PEP 3137[2] by Guido van Rossum, the idea
of the immutable bytes type was suggested (in various aspects which
combined to overcome Guido's initial opposition) by Gregory P. Smith,
Jeffrey Yasskin, and Talin.  Guido then chose to implement it by
grabbing the Python 2 code, and removing .encode, and removing
locale-dependent definitions of character classes.  This was with a
view to supporting ports of code that implements wire protocols or
uses bytes as encoded text:

It also makes it possible to efficiently create hash tables using
bytes for keys; this may be useful when parsing protocols like
HTTP or SMTP which are based on bytes representing text.

Porting code that manipulates binary data (or encoded text) in
Python 2.x will be easier using the new design than using the
original 3.0 design with mutable bytes; simply replace str with
bytes and change '...' literals into b'...' literals.

IIRC, only later was regex support added to bytes (by Nick himself,
again IIRC).  And despite the quote above, I don't think Guido meant
to encourage use of bytes as text in wire protocol development, at
least not at that time.  

Note that Nick has already admitted that permitting even methods that
can be implemented purely as numerical manipulations:

def is_uppercase(b):
# Note all comparisons are between integers:
return ord('A') <= b[0] and b[0] <= ord('Z')

was in retrospect a mistake (in his opinion).  So I don't think it was
a lie, merely a difference in your definitions of "pure binary
manipulation".  (Which isn't surprising, given that ultimately
everything in computers as we know them today eventually reduces to
"pure binary manipulations".[3]  Drawing the line is going to involve
personal taste to some extent.)  I think his interpretation that bytes
were *designed* that way is a bit strained given PEP 3137.  I also
don't know what was discussed at language summits, and don't recall
the python-dev conversations about it at all.

A final remark: Be very careful in interpreting Guido's words in these
"practical vs. pure" matters.  I've discovered his offhand comments on
these matters are often both subtle and deep (that probably doesn't
surprise you), and that the idea behind them is usually extremely
precise though his expression may informal or even casual (and here be
dragons -- taking the expression too literally 

Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Scott Dial
On 2014-01-11 22:09, Nick Coghlan wrote:
> For Python 2 folks trying to grok where the "bright line" is in terms of
> the Python 3 text model: if your proposal includes *any* kind of
> implicit serialisation of non binary data to binary, it is going to be
> rejected as an addition to the core bytes type. If it avoids crossing
> that line (as the buffer-API-only version of PEP 460 does), then we can
> talk.

To take such a hard-line stance, I would expect you to author a PEP to
strip the ASCII conveniences from the bytes and bytearray types.
Otherwise, I find it a bit schizophrenic to argue that methods like
lower, upper, title, and etc. don't implicitly assume encoding:

>>> a = "scott".encode('utf-16')
>>> b = a.title()
>>> c = b.decode('utf-16')
'SCOTT'

So, clearly title() not only depends on the bytes characters encoded in
a superset of ASCII characters, it depends on the bytes being a sequence
of ASCII characters, which looks an awful lot like an operation on an
implicit encoded string.

>>> b"文字化け"
  File "", line 1
SyntaxError: bytes can only contain ASCII literal characters.

There is an implicit serialization right there. My terminal is utf8 (or
even if my source encoding is utf8), so why would that not be:

b'\xe6\x96\x87\xe5\xad\x97\xe5\x8c\x96\xe3\x81\x91'

I sympathize with Ethan that the bytes and bytearray types already seem
to concede that bytes is the type you want to use for 7-bit ASCII
manipulations. If that is not what we want, then we are not doing a good
job communicating that to developers with the API. At the onset, the
bytes literal itself seems to be an attractive nuisance as it gives a
nod to using bytes for ASCII character sequences (a.k.a ASCII strings).

Regards,
-Scott

-- 
Scott Dial
[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Guido van Rossum
On Sun, Jan 12, 2014 at 6:16 PM, Ethan Furman  wrote:
> In reference to a byte stream, if you do:
>
> --> b'%s' % 'some text'.encode('cp1241')
>
> it's really just bytes into bytes.

That's a confusing example -- it would be clearer to just show

b'%s' % b'some text'

> If you do :
>
> --> b'%s' % 'some text'
>
> then the encoding is ASCII with strict error checking.  So if it's not
> representable as clean ASCII either encode it manually, or prepare for it to
> blow up with an UnicodeEncodeError.

You don't say what outcome you want, but if you wanted b'%s' % 'some
text' to return b'some text' while b'%s' % '\u1234' should blow up,
you're back at the Python 2 approach and that is the last thing I
want.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Guido van Rossum
On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman  wrote:
> On 01/12/2014 06:16 PM, Ethan Furman wrote:
>>
>>
>> If you do :
>>
>> --> b'%s' % 'some text'
>
>
> Ignore what I previously said.  With no encoding the result would be:
>
> b"'some text'"
>
> So an encoding should definitely be specified.

Yes, but the encoding is no business of %s or %. As far as the
formatting operation cares, if the argument is bytes they will be
copied literally, and if the argument is a str (or anything else) it
will call ascii() on it.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Guido van Rossum
Those still arguing on this thread might want to look at the thread
"PEP 460 reboot".

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-12 Thread Ethan Furman

On 01/12/2014 07:45 PM, Guido van Rossum wrote:

On Sun, Jan 12, 2014 at 6:16 PM, Ethan Furman  wrote:

In reference to a byte stream, if you do:

--> b'%s' % 'some text'.encode('cp1241')

it's really just bytes into bytes.


That's a confusing example -- it would be clearer to just show

b'%s' % b'some text'


If you do :

--> b'%s' % 'some text'

then the encoding is ASCII with strict error checking.  So if it's not
representable as clean ASCII either encode it manually, or prepare for it to
blow up with an UnicodeEncodeError.


You don't say what outcome you want, but if you wanted b'%s' % 'some
text' to return b'some text' while b'%s' % '\u1234' should blow up,
you're back at the Python 2 approach and that is the last thing I
want.


Fair enough.  I'm cool with getting back b"'some_text'" and not ever blowing up.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Stephen J. Turnbull
Steven D'Aprano writes:

 > Of course you're right, but I have understood the above as being a 
 > sketch and not real code. (E.g. does "header" really mean the literal 
 > string "header", or does it stand in for something which is a header?) 
 > In real code, one would need to have some way of telling where the 
 > binary image data ends and the Unicode string begins.

Sure, but I think in Ethan's case it's probably out of band.  I have
been assuming out of band.

 > > This corrupts binary_image_data.  Each byte > 127 will be replaced by
 > > two bytes.
 > 
 > And reading it back using decode('utf-8') will replace those two bytes 
 > with a single byte, round-tripping exactly.

True, but I'm assuming Ethan himself didn't choose DBF format.

 > Of course if you encode to UTF-8 and then try to read the binary data as 
 > raw bytes, you'll get corrupted data. But do people expect to do this? 

People?  Real People use Python, they wouldn't do that. :-)  But the
app that forced Ethan to deal with DBF might.

 > > This kind of subtlety is precisely why MAL warned about use of latin1
 > > to smuggle bytes.
 > 
 > How would you smuggle a chunk of arbitrary bytes into a text string? 
 > Short of doing something like uuencoding it into ASCII, or
 > equivalent.

Arbitary bytes as a chunk?  I wouldn't do that, probably (see below),
and it's not possible in Python 3 at present (in str ASCII codes
always represent the corresponding ASCII character, they are never
uninterpreted bytes).

But if I know where the bytes are going to be in the str, I'd use
latin1 or (encoding='ascii', errors='surrogateescape') depending on
how well-controlled the processing is.  If I really "own" those bytes,
I might use latin1, and just "forget" all of the string-processing
functions that care about character identity (eg, case manipulation).
If the bytes might somehow end up leaking into the rest of the
program, I'd use surrogateescape and live with the doubled space usage.

But really, if it's not a wire-to-wire protocol kind of thing, I'd go
ahead and create a proper model for the data, and text would be text,
and chunks of arbitrary bytes would be bytes and integers would be
integers

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Stephen J. Turnbull
Ethan Furman writes:
 > On 01/12/2014 02:57 PM, Stephen J. Turnbull wrote:

 > > No, Nick's point is that there's no encoding needed there are all,
 > > just a bunch of methods that handle numbers in the range 0-255.  You
 > > can rationalize the particular choice of numbers by referring to the
 > > ASCII coded character set, and that's very useful to users.  But
 > > knowledge of ASCII isn't necessary to specify these methods; they can
 > > be defined in an encoding/decoding-free way.
 > 
 > How can you say that with a straight face? [1]

Because I showed you code that does it.  Did you see an .encode or a
.decode in there?

 > Do you really think that .title, .isalnum, and .center (to name
 > only a few) would work the same if the assumed encoding was EBCIDC?

Yes, yes, and yes.  The numbers involved would change, and the test
for finding letters would be different (and more complicated IIRC).
The only one to worry about is .title, but neither ASCII nor EBCDIC
has confused or multiple letter titlecase.

 > Do you think they would do the proper transformations, or return
 > the proper result, if the bytes they were used on were encoded
 > Japanese?

That depends on which Japanese encoding.  It would work correctly on
UTF-8 and on EUC-JP (packed), and not on any of the others.  But you
wouldn't consider that "ASCII-encoded text", would you?

 > >> But bytes already acknowledges an ASCII bias.
 > >
 > > True, but that bias is implemented without use of encoding or
 > > decoding.   b'%d' % (123,) -> b'123' does require encoding, at the
 > > very least in the sense of type change and serialization.
 > 
 > You mean like changing a number into text does?  Really, this is no
 > different.

Precisely.  "There should be one- and preferably only one -way to do
it."  The one way uses text, so preferably bytes shouldn't.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 07:02 PM, Stephen J. Turnbull wrote:

[snip most of very eloquent reply]

Thank you, Stephen, for remaining calm despite my somewhat heated response.

A few comments in-line.

I now better understand your viewpoint about text always being unicode strings; 
I just happen to disagree.

Hopefully as some consolation I will be very vocal about using str unless bytes is necessary.  Any application that uses 
text should be using str for it, and only using bytes, if necessary, on the back-end.



Ethan Furman writes:

In only one case did I use the word "text" loosely,


[...] Bytes are *never* Python 3 text in my terminology [...] "ASCII-encoded 
text"
as you call it [...] and want to manipulate using str-like methods on bytes


The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes. 
 While the actual implementation of isupper (your example from below) may be done using integer methods, it only makes 
semantic sense if interpreted as ASCII-encoded text.



is *exactly* the Python 2 model of text.  But you deny that the
effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python
2's bytes/character confusion, don't you?


Given that the default (and only) text type in Py3 is str, which is unicode, I don't think any confusion will be as 
severe, but I acknowledge that there could be some.




I hardly think Nick is *lying*, any more than you are.  AFAICT, you're
*both* wrong.


LOL, well, at least I'm in good company, then!  :)


I think some of the misunderstanding (which you also seem to suffer
from) is that we (or at least I) /ever/ want a unicode string back
from bytes interpolation.  I don't!


Please tell me why you think I suffer from that misunderstanding.


I no longer recall, but whatever misapprehension I was suffering from you have alleviated.  (That sentence would make my 
daughter pround!  English major. ;)




But did you get that I'm worried that programmers in Omaha will use
that same functionality to communicate American English (for which it
is basically sufficient, and which also requires ASCII when bytes are
used for communication)?


Yes, I get that.  Hopefully their friends and neighbors will slap them with 
fishes if they do.



*My* definition is not ambiguous at all.  If this particular part
of the byte stream is defined to contain ASCII-encoded text, then I
can use the bytes text methods to work with it.


But how is Python supposed to know that?


Python doesn't need to.  bytes is a low-level object -- it could contain music, movies, dbf data, pdf data, or my 
mothers cheesecake recipe (properly encoded, of course).  Python can't protect me from treating a music file as if it 
were a movie file, or even just writing proper music info at the wrong place in the music file;  all that is up to me, 
as the programmer, to get right, and to understand what is needed.



But under your definition, you need to make the decision, or
explicitly code the decision, on the basis of context.


Exactly so.  I even have to do that in Py2.



If that particular configuration of bytes is because it's
ASCII-encoded text, then sure.


Once again, you are advocate precisely the Python 2 model of text.


Not exactly, because what I get back is bytes, which cannot directly be mixed with unicode (str) as it was in Py2.  I 
think this is a key difference.




To use, for example, bytes.__upper__ on data that wasn't
ASCII-encoded text (even if it happened to look like it was) would
be the height of stupidity.  Please don't include me in such
accusations.


I have no idea why you think I think anybody would be that stupid.
That never occured to me.  It's precisely "magic numbers" that happen
to look like English words when interpreted as ASCII coded characters
that I don't want manipulated by str-like methods that interpret text
(such as full-featured format or %).


This confuses me somewhat.  It's okay to use b'ethan'.upper(), which only makes semantic sense as ASCII-encoded text, 
but b'age: %d' % 43 isn't?  (Aside, I'm perfectly comfortable with "ASCII-encoded text" because if you took 
u'ethan'.encode('ascii') you would get b'ethan'.  If it was some other encoding, such as cp1251, I would call that 
particular byte stream "cp1251-encoded text".  And if there were methods that worked directly on a cp1251-encoded byte 
stream I would not have any problem using them on cp1251-encoded text.)




What Nick
means by a "boundary type" is a type that works seamlessly with the
types on each side of the boundary as a helper in the conversion.  So
when you use a struct to pack a bool, an int, and a date into a bytes,
the struct is the boundary type.  And if there's a helper type to work
with bytes and/or str simultaneously, that's a boundary type, eg,
asciistr.  But bytes itself is not a boundary type, it's just a type
with no internal structure, not even characters.


Hmmm.  I'll have to think about this.

Okay, I've thought some

Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-12 Thread Ethan Furman

On 01/12/2014 08:27 PM, Stephen J. Turnbull wrote:

Ethan Furman writes:

On 01/12/2014 02:57 PM, Stephen J. Turnbull wrote:




I didn't trim enough to make my point clear.  My apologies.


But
knowledge of ASCII isn't necessary to specify these methods; they can
be defined in an encoding/decoding-free way.


Perhaps you meant "use the methods".  I meant "write the methods".

You cannot write .upper for the bytes type without knowing what encoding has been used / is represented by those bytes. 
 And quite frankly, if you use those methods on bytes without knowing (1) which encoding is represented by the bytes 
and (2) that the function you are calling is meant to work with that encoding... well, you deserve what you get.




How can you say that with a straight face?


Because I showed you code that does it.  Did you see an .encode or a
.decode in there?


No, I didn't.  I saw numbers representing bytes representing text that has been encoded in the ASCII codec.  If you 
didn't know it was ASCII, you couldn't write that function.  Even though you don't have to call encode or decode if 
working directly with encoded bytes, you still have to know what the encoding is to do it correctly.




Do you really think that .title, .isalnum, and .center (to name
only a few) would work the same if the assumed encoding was EBCIDC?


I phrased that poorly.  If the byte stream was EBCIDC-encoded, and we called the current .method_which_assumes_ASCII on 
it, would we get the proper results?




The numbers involved would change, and the test
for finding letters would be different (and more complicated IIRC).


And you have actually just made my point.  If the bytes in question were EBCIDC-encoded, we could write a function for 
it because we know what it looks like as encoded bytes.  Then we could be debating the merits of working directly with 
EBCIDC-encoded text instead of ASCII-encoded text.  ;)




"There should be one- and preferably only one -way to do
it."  The one way uses text, so preferably bytes shouldn't.


You forgot the word "obvious".

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >