Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 11:32:05 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
  It's consistent with bytearray.join's behaviour:
 
   x = bytearray()
   x.join([babc])
  bytearray(b'abc')
   x
  bytearray(b'')
 
 Yeah, I guess I'm OK with us being consistent on that one. It's still
 weird, but also clearly useful :)
 
 Will the new binary format ever call __format__? I assume not, but it's
 probably best to make that absolutely explicit in the PEP.

Not indeed. I'll add that to the PEP, thanks.

cheers

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread M.-A. Lemburg
On 09.01.2014 22:45, Antoine Pitrou wrote:
 On Thu, 9 Jan 2014 13:36:05 -0800
 Chris Barker chris.bar...@noaa.gov wrote:

 Some folks have suggested using latin-1 (or other 8-bit encoding) -- is
 that guaranteed to work with any binary data, and round-trip accurately?
 
 Yes, it is.

Just a word of caution:

Using the 'latin-1' to mean unknown encoding can easily result
in Mojibake (unreadable text) entering your application with
dangerous effects on your other text data.

E.g. Marc-André read using 'latin-1' if the string itself
is encoded as UTF-8 will give you Marc-André in your
application. (Yes, I see that a lot in applications
and websites I use ;-))

Also note that indexing based on code points will likely
break that way as well, ie. if you pass an index to an
application based on what you see in your editor or
shell, those indexes can be wrong when used on the
encoded data. UTF-8 is an example of a popular variable
length encoding for Unicode, so you'll hit this problem
whenever dealing with non-ASCII UTF-8 data.

 and will surrogateescape work for arbitrary binary data?
 
 Yes, it will.

The surrogateescape trick only works if you are encoding
your work using the same encoding that you used for decoding
it. Otherwise, you'll get a mix of the input encoding and the
output encoding as output.

Note that the error handler trick has an advantage over the
latin-1 trick: if you try to encode a Unicode string
with escape surrogates without using the error handler,
it will fail, so you at least know that there are funny
code points in your output string that need some extra
care.

BTW: Perhaps it would be a good idea to backport the
surrogateescape error handler to Python 2.7 to simplify
writing code which works in both Python 2 and 3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 10 2014)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Paul Moore
On 10 January 2014 12:19, M.-A. Lemburg m...@egenix.com wrote:
 Just a word of caution:

 Using the 'latin-1' to mean unknown encoding can easily result
 in Mojibake (unreadable text) entering your application with
 dangerous effects on your other text data.

Agreed. The latin-1 suggestion is purely for people who object to
learning how to handle the encodings in their data more accurately.
That's not a criticism, wanting to avoid getting sidetracked into
understanding encodings when porting a personal script is a classic
practicality vs purity situation. Current responses to people with
encoding issues tend towards an idealistic you should understand your
data better position, which while true in the abstract is not always
what the requester wants to hear.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Matěj Cepl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2014-01-10, 12:19 GMT, you wrote:
 Using the 'latin-1' to mean unknown encoding can easily result
 in Mojibake (unreadable text) entering your application with
 dangerous effects on your other text data.

 E.g. Marc-André read using 'latin-1' if the string itself
 is encoded as UTF-8 will give you Marc-André in your
 application. (Yes, I see that a lot in applications
 and websites I use ;-))

I am afraid that for most 'latin-1' is just another attempt to 
make Unicode complexity go away and the way how to ignore it.

Matěj

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFS0AOG4J/vJdlkhKwRAgffAKCHn8uMnpZDVSwa2Oat+QI2h32o2wCeJdUN
ZXTbDtiJtJrrhnRPzbgc3dc=
=Pr1X
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] peps: PEP 460: add .format_map()

2014-01-10 Thread Nick Coghlan
On 10 January 2014 07:41, Eric V. Smith e...@trueblade.com wrote:
 I'm not sure how format_map helps in porting from 2 to 3, since it
 doesn't exist in any version of 2.

 Although that said, it's no doubt a useful feature, just not useful in
 code that supports both 2 and 3 with a single code base or when porting
 to 3.

It's purely a matter of consistency with str - if we're adding binary
interpolation back to Python 3 (which I have been persuaded is a good
idea), then we should provide the same three typical spellings of the
operation that str provides.

Cheers,
Nick.


 Eric.

 On 1/9/2014 4:02 PM, antoine.pitrou wrote:
 http://hg.python.org/peps/rev/8947cdc6b22e
 changeset:   5341:8947cdc6b22e
 user:Antoine Pitrou solip...@pitrou.net
 date:Thu Jan 09 22:02:01 2014 +0100
 summary:
   PEP 460: add .format_map()

 files:
   pep-0460.txt |  6 +-
   1 files changed, 5 insertions(+), 1 deletions(-)


 diff --git a/pep-0460.txt b/pep-0460.txt
 --- a/pep-0460.txt
 +++ b/pep-0460.txt
 @@ -24,12 +24,16 @@
similar in syntax to ``str.format()`` (accepting positional as well as
keyword arguments).

 +* ``bytes.format_map(...)`` and ``bytearray.format_map(...)`` for an
 +  API similar to ``str.format_map(...)``, with the same formatting
 +  syntax and semantics as ``bytes.format()`` and ``bytearray.format()``.
 +

  Rationale
  =

  In Python 2, ``str % args`` and ``str.format(args)`` allow the formatting
 -and interpolation of bytes strings.  This feature has commonly been used
 +and interpolation of bytestrings.  This feature has commonly been used
  for the assembling of protocol messages when protocols are known to use
  a fixed encoding.




 ___
 Python-checkins mailing list
 python-check...@python.org
 https://mail.python.org/mailman/listinfo/python-checkins


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com



-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Nick Coghlan
On 10 January 2014 13:32, Lennart Regebro rege...@gmail.com wrote:
 On Thu, Jan 9, 2014 at 10:06 AM, Kristján Valur Jónsson
 krist...@ccpgames.com wrote:
 Do I speak Chinese to my grocer because china is a growing force in the 
 world?  Or start every discussion with my children with a negotiation on 
 what language to use?

 No, because your environment have a default language. And Python has a
 default encoding. You only get problems when some file doesn't use the
 default encoding.

Putting this here because I found out today it's not in any of the
PEPs and folks have to go digging in mailing list archives to find it.
I'll add it to my Python 3 QA at some point.

The reason Python 3 currently tries to rely on the POSIX locale
encoding is that during the Python 3 development process it was
pointed out that ShiftJIS, ISO-2022 and various CJK codec are in
widespread use in Asia, since Asian users needed solutions to the
problem of representing kana, ideographs and other non-Latin
characters long before the Unicode Consortium existed.

This creates a problem for Python 3, as assuming utf-8 means we have a
high risk of corrupting user's data at least in Asian locales, as well
as anywhere else where non-UTF-8 encodings are common (especially when
encodings that aren't ASCII compatible are involved).

While the Python 3 status quo on POSIX systems certainly isn't ideal,
it at least means our most likely failure mode is an exception rather
than silent data corruption. One of the major culprits for that is the
antiquated POSIX/C locale, which reports ASCII as the system encoding.
One idea we're considering for Python 3.5 is to have a report of
ascii on a POSIX OS imply the surrogateescape error handler (at
least for the standard streams, and perhaps in other contexts), since
the OS reporting the POSIX/C locale almost certainly indicates a
configuration error rather than intentional behaviour.

Cheers,
Nick.


 //Lennart
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com



-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] peps: PEP 460: add .format_map()

2014-01-10 Thread Eric V. Smith
On 1/10/2014 10:20 AM, Nick Coghlan wrote:
 On 10 January 2014 07:41, Eric V. Smith e...@trueblade.com wrote:
 I'm not sure how format_map helps in porting from 2 to 3, since it
 doesn't exist in any version of 2.

 Although that said, it's no doubt a useful feature, just not useful in
 code that supports both 2 and 3 with a single code base or when porting
 to 3.
 
 It's purely a matter of consistency with str - if we're adding binary
 interpolation back to Python 3 (which I have been persuaded is a good
 idea), then we should provide the same three typical spellings of the
 operation that str provides.
 
 Cheers,
 Nick.

I'm perfectly okay with that, and it was on my list of things to
suggest. I just think that the PEP should be focused on porting code
from 2 to 3 and on code that runs on both 2 and 3. I think the Rationale
should state this clearly.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Stefan Krah
Nick Coghlan ncogh...@gmail.com wrote:
 One idea we're considering for Python 3.5 is to have a report of
 ascii on a POSIX OS imply the surrogateescape error handler (at
 least for the standard streams, and perhaps in other contexts), since
 the OS reporting the POSIX/C locale almost certainly indicates a
 configuration error rather than intentional behaviour.

On FreeBSD users apparently get the C locale by default. I don't think I've
configured anything special during the install:


freebsd-amd64# adduser
Username: testuser
Full name: 
Uid (Leave empty for default): 
Login group [testuser]: 
Login group is testuser. Invite testuser into other groups? []: 
Login class [default]: 
Shell (sh csh tcsh bash rbash nologin) [sh]: 
Home directory [/home/testuser]: 
Home directory permissions (Leave empty for default): 
Use password-based authentication? [yes]: no
Lock out the account after creation? [no]: 
Username   : testuser
Password   : disabled
Full Name  : 
Uid: 1003
Class  : 
Groups : testuser 
Home   : /home/testuser
Home Mode  : 
Shell  : /bin/sh
Locked : no
OK? (yes/no): yes
adduser: INFO: Successfully added (testuser) to the user database.
Add another user? (yes/no): no
Goodbye!
freebsd-amd64# su - testuser
$ locale
LANG=
LC_CTYPE=C
LC_COLLATE=C
LC_TIME=C
LC_NUMERIC=C
LC_MONETARY=C
LC_MESSAGES=C
LC_ALL=


Stefan Krah


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread INADA Naoki
Now I feel it is bad thing that encouraging using unicode for binary with
latin-1 encoding or surrogateescape errorhandler.

Handling binary data in str type using latin-1 is just a hack.
Surrogateescape is just a workaround to keep undecodable bytes in text.

Encouraging binary data in str type with latin-1 or surrogateescape means
encourage mixing binary and text data.
It is worth than Python 2.

So Python should encourage handling binary data in bytes type.


On Fri, Jan 10, 2014 at 11:28 PM, Matěj Cepl ma...@ceplovi.cz wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 2014-01-10, 12:19 GMT, you wrote:
  Using the 'latin-1' to mean unknown encoding can easily result
  in Mojibake (unreadable text) entering your application with
  dangerous effects on your other text data.
 
  E.g. Marc-André read using 'latin-1' if the string itself
  is encoded as UTF-8 will give you Marc-André in your
  application. (Yes, I see that a lot in applications
  and websites I use ;-))

 I am afraid that for most 'latin-1' is just another attempt to
 make Unicode complexity go away and the way how to ignore it.

 Matěj

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.22 (GNU/Linux)

 iD8DBQFS0AOG4J/vJdlkhKwRAgffAKCHn8uMnpZDVSwa2Oat+QI2h32o2wCeJdUN
 ZXTbDtiJtJrrhnRPzbgc3dc=
 =Pr1X
 -END PGP SIGNATURE-
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com




-- 
INADA Naoki  songofaca...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Baptiste Carvello
Le 10/01/2014 16:35, Nick Coghlan a écrit :

 One idea we're considering for Python 3.5 is to have a report of
 ascii on a POSIX OS imply the surrogateescape error handler (at
 least for the standard streams, and perhaps in other contexts), since
 the OS reporting the POSIX/C locale almost certainly indicates a
 configuration error rather than intentional behaviour.

would it make sense to be more general, and allow a lenient mode,
where all files implicitly opened with the default encoding would also
use the surrogateescape error handler ?

That way, applications designed to process text mostly written in the
default encoding would just call sys.set_lenient_mode() and be done.

Of course, libraries would need to be strongly discouraged to ever use
this and encouraged to explicitly set the error handler on appropriate
files instead.

Cheers,

Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Summary of Python tracker Issues

2014-01-10 Thread Python tracker

ACTIVITY SUMMARY (2014-01-03 - 2014-01-10)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open4409 (+61)
  closed 27580 (+42)
  total  31989 (+103)

Open issues with patches: 1993 


Issues opened (87)
==

#15027: Faster UTF-32 encoding
http://bugs.python.org/issue15027  reopened by serhiy.storchaka

#20115: NUL bytes in commented lines
http://bugs.python.org/issue20115  opened by arigo

#20116: urlparse.parse_qs should take argument for query separator
http://bugs.python.org/issue20116  opened by ruben.orduz

#20117: subprocess on Windows: wrong return code with shell=True
http://bugs.python.org/issue20117  opened by gvanrossum

#20118: test_imaplib test_linetoolong fails on 2.7 in SSL test on some
http://bugs.python.org/issue20118  opened by r.david.murray

#20119: pdb c(ont(inue)) optional one-time-only breakpoint (like perl 
http://bugs.python.org/issue20119  opened by nlev...@gmail.com

#20120: Percent-signs (%) in .pypirc should not be interpolated
http://bugs.python.org/issue20120  opened by tlevine

#20121: quopri_codec newline handling
http://bugs.python.org/issue20121  opened by fredstober

#20122: Move CallTips tests to idle_tests
http://bugs.python.org/issue20122  opened by serhiy.storchaka

#20123: pydoc.synopsis fails to load binary modules
http://bugs.python.org/issue20123  opened by eric.snow

#20124: The documentation for the atTime parameter of TimedRotatimeFil
http://bugs.python.org/issue20124  opened by r.david.murray

#20125: We need a good replacement for direct use of load_module(), po
http://bugs.python.org/issue20125  opened by eric.snow

#20126: sched doesn't handle events added after scheduler starts
http://bugs.python.org/issue20126  opened by lo...@blossomhillranch.com

#20127: Race condition in test_threaded_import.task()?
http://bugs.python.org/issue20127  opened by eric.snow

#20128: Re-enable test_modules_search_builtin() in test_pydoc
http://bugs.python.org/issue20128  opened by eric.snow

#20131: warnings module offers no documented, programmatic way to rese
http://bugs.python.org/issue20131  opened by inducer

#20132: Many incremental codecs don’t handle fragmented data
http://bugs.python.org/issue20132  opened by vadmium

#20133: Derby: Convert the audioop module to use Argument Clinic
http://bugs.python.org/issue20133  opened by serhiy.storchaka

#20135: mutate list
http://bugs.python.org/issue20135  opened by m123orning

#20136: Logging: StreamHandler does not use OS line separator.
http://bugs.python.org/issue20136  opened by alibotean

#20137: Logging: RotatingFileHandler computes string length instead of
http://bugs.python.org/issue20137  opened by alibotean

#20138: wsgiref on Python 3.x incorrectly implements URL handling caus
http://bugs.python.org/issue20138  opened by aronacher

#20139: Python installer does not install a pip command (just pip3
http://bugs.python.org/issue20139  opened by pmoore

#20140: UnicodeDecodeError in ntpath.py when home dir contains non-asc
http://bugs.python.org/issue20140  opened by Jarek.Śmiejczak

#20145: unittest.assert*Regex functions should verify that expected_re
http://bugs.python.org/issue20145  opened by the.mulhern

#20146: UserDict module docs link is obsolete
http://bugs.python.org/issue20146  opened by drunax

#20147: multiprocessing.Queue.get() raises queue.Empty exception if ev
http://bugs.python.org/issue20147  opened by torsten

#20148: Derby: Convert the _sre module to use Argument Clinic
http://bugs.python.org/issue20148  opened by serhiy.storchaka

#20150: API change in string formatting with :s option should be docum
http://bugs.python.org/issue20150  opened by Thomas.Robitaille

#20151: Derby: Convert the binascii module to use Argument Clinic
http://bugs.python.org/issue20151  opened by serhiy.storchaka

#20152: Derby #15: Convert 50 sites to Argument Clinic across 9 files
http://bugs.python.org/issue20152  opened by brett.cannon

#20153: New-in-3.4 weakref finalizer doc section is already out of dat
http://bugs.python.org/issue20153  opened by r.david.murray

#20154: Deadlock in asyncio.StreamReader.readexactly()
http://bugs.python.org/issue20154  opened by gvanrossum

#20155: Regression test test_httpservers fails, hangs on Windows
http://bugs.python.org/issue20155  opened by jeff.allen

#20156: bz2.BZ2File.read() does not treat growing input file properly
http://bugs.python.org/issue20156  opened by Joshua.Chia

#20159: Derby #7: Convert 51 sites to Argument Clinic across 3 files -
http://bugs.python.org/issue20159  opened by serhiy.storchaka

#20160: broken ctypes calling convention on MSVC / 64-bit Windows (lar
http://bugs.python.org/issue20160  opened by mark.dickinson

#20162: Test test_hash_distribution fails on RHEL 6.5 / ppc64
http://bugs.python.org/issue20162  opened by zaytsev

#20163: ValueError: time data does not match format

[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
(Sorry if this messes-up the thread order, it is meant as a reply to the
original RFC.)

Dear list,

newbie here. After much hesitation I decided to put forward a use case
which bothers me about the current proposal. Disclaimer: I happen to write
a library which is directly influenced by this.

As you may know, PDF operates over bytes and an integer or floating-point
number is written down as-is, for example 100 or 1.23.

However, the proposal drops %d, %f and %x formats and the suggested
workaround for writing down a number is to use .encode('ascii'), which I
think has two problems:

One is that it needs to construct one additional object per formatting as
opposed to Python 2; it is not uncommon for a PDF file to contain millions
of numbers.

The second problem is that, in my eyes, it is very counter-intuitive to
require the use of str only to get formatting on bytes. Consider the case
where a large bytes object is created out of many smaller bytes objects. If
I wanted to format a part I had to use str instead. For example:

content = b''.join([
b'header',
b'some dictionary structure',
b'part 1 abc',
('part 2 %.3f' % number).encode('ascii'),
b'trailer'])

In the case of PDF, the embedding of an image into PDF looks like:

10 0 obj
   /Type /XObject
 /Width 100
 /Height 100
 /Alternates 15 0 R
 /Length 2167
  
stream
...binary image data...
endstream
endobj

Because of the image it makes sense to store such structure inside bytes.
On the other hand, there may well be another obj which contains the
coordinates of Bezier paths:

11 0 obj
...
stream
0.5 0.1 0.2 RG
300 300 m
300 400 400 400 400 300 c
b
endstream
endobj

To summarize, there are cases which mix binary and text and, in my
opinion, dropping the bytes-formatting of numbers makes it more complicated
than it was. I would appreciate any explanation on how:

b'%.1f %.1f %.1f RG' % (r, g, b)

is more confusing than:

b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r,
g, b)))

Similar situation exists for HTTP (Content-Length: 123) and ASCII STL
(vertex 1.0 0.0 0.0).

Thanks and have a nice day,

Juraj Sukop

PS: In the case the proposal will not include the number formatting, it
would be nice to list there a set of guidelines or examples on how to
proceed with porting Python 2 formats to Python 3.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Stefan Ring
On Fri, Jan 10, 2014 at 4:35 PM, Nick Coghlan ncogh...@gmail.com wrote:
 On 10 January 2014 13:32, Lennart Regebro rege...@gmail.com wrote:
 No, because your environment have a default language. And Python has a
 default encoding. You only get problems when some file doesn't use the
 default encoding.

 The reason Python 3 currently tries to rely on the POSIX locale
 encoding is that during the Python 3 development process it was
 pointed out that ShiftJIS, ISO-2022 and various CJK codec are in
 widespread use in Asia, since Asian users needed solutions to the
 problem of representing kana, ideographs and other non-Latin
 characters long before the Unicode Consortium existed.

 This creates a problem for Python 3, as assuming utf-8 means we have a
 high risk of corrupting user's data at least in Asian locales, as well
 as anywhere else where non-UTF-8 encodings are common (especially when
 encodings that aren't ASCII compatible are involved).

From my experience, the concept of a default locale is deeply flawed.
What if I log into a (Linux) machine using an old latin-1 putty from
the Windows XP era, have most file names and contents in UTF-8
encoding, except for one directory where people from eastern Europe
upload files via FTP in whatever encoding they choose. What should the
default encoding be now?

That's why I make it a principle to always unset all LC_* and LANG
variables, except when working locally, which happens rather rarely.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Serhiy Storchaka

10.01.14 14:19, M.-A. Lemburg написав(ла):

BTW: Perhaps it would be a good idea to backport the
surrogateescape error handler to Python 2.7 to simplify
writing code which works in both Python 2 and 3.


You also should change the UTF-8 codec so that it will reject surrogates 
(i.e. u'\ud880'.encode('utf-8') and '\xed\xa2\x80'.decode('utf-8') 
should raise exceptions). And this will break much code.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 12:17 PM, Juraj Sukop wrote:
 (Sorry if this messes-up the thread order, it is meant as a reply to the
 original RFC.)
 
 Dear list,
 
 newbie here. After much hesitation I decided to put forward a use case
 which bothers me about the current proposal. Disclaimer: I happen to
 write a library which is directly influenced by this.
 
 As you may know, PDF operates over bytes and an integer or
 floating-point number is written down as-is, for example 100 or 1.23.
 
 However, the proposal drops %d, %f and %x formats and the
 suggested workaround for writing down a number is to use
 .encode('ascii'), which I think has two problems:
 
 One is that it needs to construct one additional object per formatting
 as opposed to Python 2; it is not uncommon for a PDF file to contain
 millions of numbers.
 
 The second problem is that, in my eyes, it is very counter-intuitive to
 require the use of str only to get formatting on bytes. Consider the
 case where a large bytes object is created out of many smaller bytes
 objects. If I wanted to format a part I had to use str instead. For example:
 
 content = b''.join([
 b'header',
 b'some dictionary structure',
 b'part 1 abc',
 ('part 2 %.3f' % number).encode('ascii'),
 b'trailer'])

I agree. I don't see any reason to exclude int and float. See Guido's
messages http://bugs.python.org/issue3982#msg180423 and
http://bugs.python.org/issue3982#msg180430 for some justification and
discussion. Since converting int and float to strings generates a very
small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase
versions), what problem is introduced by allowing int and float? The
original str.format() work relied on this fact in its stringlib
implementation.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Mark Lawrence

On 06/01/2014 13:24, Victor Stinner wrote:

Hi,

bytes % args and bytes.format(args) are requested by Mercurial and
Twisted projects. The issue #3982 was stuck because nobody proposed a
complete definition of the new features. Here is a try as a PEP.



Apologies if this has already been said, but Terry Reedy attached a 
proof of concept to issue 3982 which might be worth taking a look at if 
you haven't yet done so.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Philip Jenvey

On Jan 10, 2014, at 7:35 AM, Nick Coghlan wrote:

 Putting this here because I found out today it's not in any of the
 PEPs and folks have to go digging in mailing list archives to find it.
 I'll add it to my Python 3 QA at some point.
 
 The reason Python 3 currently tries to rely on the POSIX locale
 encoding is that during the Python 3 development process it was
 pointed out that ShiftJIS, ISO-2022 and various CJK codec are in
 widespread use in Asia, since Asian users needed solutions to the
 problem of representing kana, ideographs and other non-Latin
 characters long before the Unicode Consortium existed.

Really? Because PEP 383 doesn't support and discourages the use of some of 
these codecs as a locale.

--
Philip Jenvey

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Greg Ewing

INADA Naoki wrote:

latin1 is OK but is it Pythonic?


Latin is most certainly a Pythonic subject:

http://www.youtube.com/watch?v=IIAdHEwiAy8

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Georg Brandl
Am 10.01.2014 18:56, schrieb Eric V. Smith:
 On 1/10/2014 12:17 PM, Juraj Sukop wrote:
 (Sorry if this messes-up the thread order, it is meant as a reply to the
 original RFC.)
 
 Dear list,
 
 newbie here. After much hesitation I decided to put forward a use case
 which bothers me about the current proposal. Disclaimer: I happen to
 write a library which is directly influenced by this.
 
 As you may know, PDF operates over bytes and an integer or
 floating-point number is written down as-is, for example 100 or 1.23.
 
 However, the proposal drops %d, %f and %x formats and the
 suggested workaround for writing down a number is to use
 .encode('ascii'), which I think has two problems:
 
 One is that it needs to construct one additional object per formatting
 as opposed to Python 2; it is not uncommon for a PDF file to contain
 millions of numbers.
 
 The second problem is that, in my eyes, it is very counter-intuitive to
 require the use of str only to get formatting on bytes. Consider the
 case where a large bytes object is created out of many smaller bytes
 objects. If I wanted to format a part I had to use str instead. For example:
 
 content = b''.join([
 b'header',
 b'some dictionary structure',
 b'part 1 abc',
 ('part 2 %.3f' % number).encode('ascii'),
 b'trailer'])
 
 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion. Since converting int and float to strings generates a very
 small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase
 versions), what problem is introduced by allowing int and float? The
 original str.format() work relied on this fact in its stringlib
 implementation.

I agree.

I would have needed bytes-formatting (with numbers) recently writing .rtf files.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python3 complexity - 2 use cases

2014-01-10 Thread Jim J. Jewett

 
 Steven D'Aprano wrote:
 I think that heuristics to guess the encoding have their role to play,
 if the caller understands the risks.

Ben Finney wrote:
 In my opinion, content-type guessing heuristics certainly don't belong
 in the standard library.

It would be great if there were never any need to guess.  But in the
real world, there is -- and often the user won't know any more than
python does.  So when it is time to guess, a source of good guesses
is an important battery to include.

The HTML5 specifications go through some fairly extreme contortions
to document what browsers actually do, as opposed to what previous
standards have mandated.  They don't currently specify how to guess
(though I think a draft once tried, since the major browsers all do
it, and at the time did it similarly), but the specs do explicitly
support such a step, and do provide an implementation note
encouraging user-agents to do at least minimal auto-detection.  

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

My own opinion is therefore that Python SHOULD provide better support
for both of the following use cases:

(1)  Treat this file like it came from the web -- including
 autodetection and even overriding explicit charset
 declarations for certain charsets.

We should explicitly treat autodetection like time zone data --
there is no promise that the right answer (or at least the
best guess) won't change, even within a release.

I offer no opinion on whether chardet in particular is still
too volatile, but the docs should warn that the API is driven
by possibly changing external data.

(2)  Treat this file as ASCII+, where anything non-ASCII
 will (at most) be written back out unchanged; it doesn't
 even need to be converted to text.

At this time, I don't know whether the right answer is making it
easy to default to surrogate-escape for all error-handling, 
adding more bytes methods, encouraging use of python's latin-1
variant, offering a dedicated (new?) codec, or some new suggestion.

I do know that this use case is important, and that python 3
currently looks clumsy compared to python 2.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Chris Barker
On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop juraj.su...@gmail.com wrote:

 As you may know, PDF operates over bytes and an integer or floating-point
 number is written down as-is, for example 100 or 1.23.


Just to be clear here -- is PDF specifically bytes+ascii?

Or could there be some-other-encoding unicode in there?

If so, then you really have a mess!

if it is bytes+ascii, then it seems you could use a unicode object and
encode/decode to latin-1

Perhaps still a bit klunkier than formatting directly into a bytes object,
but workable.

b'%.1f %.1f %.1f RG' % (r, g, b)

 is more confusing than:

 b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'),
 (r, g, b)))


Let's see, I think that would be:

u'%.1f %.1f %.1f RG' % (r, g, b)

then when you want to write it out:

.encode('latin-1')

dumping the binary data in would be a bit uglier, for teh image example:

stream
...binary image data...
endstream
endobj

ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1')

I think.

not too bad, though if nothing else an alias for latin-1 that made it clear
it worked for this would be nice.

maybe ascii_plus_binary or something?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Serhiy Storchaka

10.01.14 18:27, Baptiste Carvello написав(ла):

would it make sense to be more general, and allow a lenient mode,
where all files implicitly opened with the default encoding would also
use the surrogateescape error handler ?


The surrogateescape error handler is compatible only with 
ASCII-compatible encodings (i.e. no ShiftJIS, no UTF-16). It can't be 
used by default. But you can set PYTHONIOENCODING=:surrogateescape and 
got you default locale encoding with surrogateescape.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Victor Stinner
2014/1/10 Juraj Sukop juraj.su...@gmail.com:
 In the case of PDF, the embedding of an image into PDF looks like:

 10 0 obj
/Type /XObject
  /Width 100
  /Height 100
  /Alternates 15 0 R
  /Length 2167
   
 stream
 ...binary image data...
 endstream
 endobj

What not building 10 0 obj ... stream and endstream endobj in
Unicode and then encode to ASCII? Example:

data = b''.join((
  (%d %d obj ... stream % (10, 0)).encode('ascii'),
  binary_image_data,
  (endstream endobj).encode('ascii'),
))

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 5:12 PM, Victor Stinner wrote:
 2014/1/10 Juraj Sukop juraj.su...@gmail.com:
 In the case of PDF, the embedding of an image into PDF looks like:

 10 0 obj
/Type /XObject
  /Width 100
  /Height 100
  /Alternates 15 0 R
  /Length 2167
   
 stream
 ...binary image data...
 endstream
 endobj
 
 What not building 10 0 obj ... stream and endstream endobj in
 Unicode and then encode to ASCII? Example:
 
 data = b''.join((
   (%d %d obj ... stream % (10, 0)).encode('ascii'),
   binary_image_data,
   (endstream endobj).encode('ascii'),
 ))

Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is
there really existing code like this in 2.x?

I think what we're trying to do is to make code that looks like:
   b'%d %d obj ... stream' % (10, 0)
work in both 2.x and 3.5.

But correct me if I'm wrong. I'll admit to not following 100% of these
emails.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 12:56:19 -0500
Eric V. Smith e...@trueblade.com wrote:
 
 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion.

If you are representing int and float, you're really formatting a text
message, not bytes. Basically if you allow the formatting of int and
float instances, there's no reason not to allow the formatting of
arbitrary objects through __str__. It doesn't make sense to
special-case those two types and nothing else.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
 On Fri, 10 Jan 2014 12:56:19 -0500
 Eric V. Smith e...@trueblade.com wrote:

 I agree. I don't see any reason to exclude int and float. See Guido's
 messages http://bugs.python.org/issue3982#msg180423 and
 http://bugs.python.org/issue3982#msg180430 for some justification and
 discussion.
 
 If you are representing int and float, you're really formatting a text
 message, not bytes. Basically if you allow the formatting of int and
 float instances, there's no reason not to allow the formatting of
 arbitrary objects through __str__. It doesn't make sense to
 special-case those two types and nothing else.

It might not for .format(), but I'm not convinced. But for %-formatting,
str is already special-cased for these types.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 17:20:32 -0500
Eric V. Smith e...@trueblade.com wrote:
 
 Isn't the point of the PEP to make it easier to port 2.x code to 3.5?
 Is
 there really existing code like this in 2.x?

No, but so what? The point of the PEP is not to allow arbitrary
Python 2 code to run without modification under Python 3. There's a
reason we broke compatibility, and there's no way we're gonna undo that.

 I think what we're trying to do is to make code that looks like:
b'%d %d obj ... stream' % (10, 0)
 work in both 2.x and 3.5.

That's not what *I* am trying to do. As far as I'm concerned the aim of
the PEP is to ease bytes interpolation, not to provide some kind of
magical construct that will solve everyone's porting problems.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 17:33:57 -0500
Eric V. Smith e...@trueblade.com wrote:
 On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
  On Fri, 10 Jan 2014 12:56:19 -0500
  Eric V. Smith e...@trueblade.com wrote:
 
  I agree. I don't see any reason to exclude int and float. See Guido's
  messages http://bugs.python.org/issue3982#msg180423 and
  http://bugs.python.org/issue3982#msg180430 for some justification and
  discussion.
  
  If you are representing int and float, you're really formatting a text
  message, not bytes. Basically if you allow the formatting of int and
  float instances, there's no reason not to allow the formatting of
  arbitrary objects through __str__. It doesn't make sense to
  special-case those two types and nothing else.
 
 It might not for .format(), but I'm not convinced. But for %-formatting,
 str is already special-cased for these types.

That's not what I'm saying. str.__mod__ is able to represent all kinds
of types through %s and calling __str__. It doesn't make sense for
bytes.__mod__ to only support int and float. Why only them?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 02:42 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 17:33:57 -0500
Eric V. Smith e...@trueblade.com wrote:

On 1/10/2014 5:29 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 12:56:19 -0500
Eric V. Smith e...@trueblade.com wrote:


I agree. I don't see any reason to exclude int and float. See Guido's
messages http://bugs.python.org/issue3982#msg180423 and
http://bugs.python.org/issue3982#msg180430 for some justification and
discussion.


If you are representing int and float, you're really formatting a text
message, not bytes. Basically if you allow the formatting of int and
float instances, there's no reason not to allow the formatting of
arbitrary objects through __str__. It doesn't make sense to
special-case those two types and nothing else.


It might not for .format(), but I'm not convinced. But for %-formatting,
str is already special-cased for these types.


That's not what I'm saying. str.__mod__ is able to represent all kinds
of types through %s and calling __str__. It doesn't make sense for
bytes.__mod__ to only support int and float. Why only them?


Because embedding the ASCII equivalent of ints and floats in byte streams is a 
common operation?

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 14:58:15 -0800
Ethan Furman et...@stoneleaf.us wrote:
 On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
  On Fri, 10 Jan 2014 17:33:57 -0500
  Eric V. Smith e...@trueblade.com wrote:
  On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
  On Fri, 10 Jan 2014 12:56:19 -0500
  Eric V. Smith e...@trueblade.com wrote:
 
  I agree. I don't see any reason to exclude int and float. See Guido's
  messages http://bugs.python.org/issue3982#msg180423 and
  http://bugs.python.org/issue3982#msg180430 for some justification and
  discussion.
 
  If you are representing int and float, you're really formatting a text
  message, not bytes. Basically if you allow the formatting of int and
  float instances, there's no reason not to allow the formatting of
  arbitrary objects through __str__. It doesn't make sense to
  special-case those two types and nothing else.
 
  It might not for .format(), but I'm not convinced. But for %-formatting,
  str is already special-cased for these types.
 
  That's not what I'm saying. str.__mod__ is able to represent all kinds
  of types through %s and calling __str__. It doesn't make sense for
  bytes.__mod__ to only support int and float. Why only them?
 
 Because embedding the ASCII equivalent of ints and floats in byte streams
 is a common operation?

Again, if you're representing ASCII, you're representing text and
should use a str object.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Chris Barker
On Fri, Jan 10, 2014 at 6:05 AM, Paul Moore p.f.mo...@gmail.com wrote:

  Using the 'latin-1' to mean unknown encoding can easily result
  in Mojibake (unreadable text) entering your application with
  dangerous effects on your other text data.

 Agreed. The latin-1 suggestion is purely for people who object to
 learning how to handle the encodings in their data more accurately.


I'm not so sure -- it could be used (abused?) for that, but I'm suggesting
it be used for mixed ascii-binary data. I don't know that there IS a
right way to do that -- at least not an efficient or easy to read and
write one.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Mark Lawrence

On 10/01/2014 22:06, Chris Barker wrote:

On Fri, Jan 10, 2014 at 6:05 AM, Paul Moore p.f.mo...@gmail.com
mailto:p.f.mo...@gmail.com wrote:

  Using the 'latin-1' to mean unknown encoding can easily result
  in Mojibake (unreadable text) entering your application with
  dangerous effects on your other text data.

Agreed. The latin-1 suggestion is purely for people who object to
learning how to handle the encodings in their data more accurately.


I'm not so sure -- it could be used (abused?) for that, but I'm
suggesting it be used for mixed ascii-binary data. I don't know that
there IS a right way to do that -- at least not an efficient or easy
to read and write one.

-Chris



The correct way is to read the interface specification which tells you 
what should be in the data.  Or do people not use interface 
specifications these days, preferring to guess what they've got instead?


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 18:14:45 -0500
Eric V. Smith e...@trueblade.com wrote:
 
  Because embedding the ASCII equivalent of ints and floats in byte streams
  is a common operation?
  
  Again, if you're representing ASCII, you're representing text and
  should use a str object.
 
 Yes, but is there existing 2.x code that uses %s for int and float
 (perhaps unwittingly), and do we want to help that code out?
 Or do we
 want to make porters first change to using %d or %f instead of %s?

I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
%f on bytes objects.

 I think what you're getting at is that in addition to not calling
 __format__, we don't want to call __str__, either, for the same reason.

Not only. We don't want to do anything that actually asks for a
*textual* representation of something. %d and %f ask for a textual
representation of a number, so they're right out.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker chris.bar...@noaa.govwrote:

 On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop juraj.su...@gmail.comwrote:

 As you may know, PDF operates over bytes and an integer or floating-point
 number is written down as-is, for example 100 or 1.23.


 Just to be clear here -- is PDF specifically bytes+ascii?

 Or could there be some-other-encoding unicode in there?


From the specs: At the most fundamental level, a PDF file is a sequence of
8-bit bytes. But it is also possible to represent a PDF using printable
ASCII + whitespace by using escapes and filters. Then, there are also
text strings which might be encoded in UTF+16.

What this all means is that the PDF objects are expressed in ASCII,
stream objects like images and fonts may have a binary part and I never
saw those UTF+16 strings.


ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1')


The argument for dropping %f et al. has been that if something is a text,
then it should be Unicode. Conversely, if it is not text, then it should
not be Unicode.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner
victor.stin...@gmail.comwrote:


 What not building 10 0 obj ... stream and endstream endobj in
 Unicode and then encode to ASCII? Example:

 data = b''.join((
   (%d %d obj ... stream % (10, 0)).encode('ascii'),
   binary_image_data,
   (endstream endobj).encode('ascii'),
 ))


The key is encode to ASCII which means that the result is bytes. Then,
there is this 11 0 obj which should also be bytes. But it has no
binary_image_data - only lots of numbers waiting to be somehow converted
to bytes. I already mentioned the problems with .encode('ascii') but it
does not stop here. Numbers may appear not only inside streams but almost
anywhere: in the header there is PDF version, an image has to have width
and height, at the end of PDF there is a structure containing offsets to
all of the objects in file. Basically, to .encode('ascii') every possible
number is not exactly simple or pretty.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Sat, 11 Jan 2014 00:43:39 +0100
Juraj Sukop juraj.su...@gmail.com wrote:
 Basically, to .encode('ascii') every possible
 number is not exactly simple or pretty.

Well it strikes me that the PDF format itself is not exactly simple or
pretty. It might be convenient that Python 2 allows you, in certain
cases, to ignore encoding issues because the main text type is
actually a bytestring, but under the Python 3 model there's no reason
to allow the same shortcuts.

Also, when you say you've never encountered UTF-16 text in PDFs, it
sounds like those people who've never encountered any non-ASCII data in
their programs.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Chris Barker
On Fri, Jan 10, 2014 at 3:40 PM, Juraj Sukop juraj.su...@gmail.com wrote:

 What this all means is that the PDF objects are expressed in ASCII,
 stream objects like images and fonts may have a binary part and I never
 saw those UTF+16 strings.


hmm -- I wonder if they are out there in the wild, though


  ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1')


 The argument for dropping %f et al. has been that if something is a
 text, then it should be Unicode. Conversely, if it is not text, then it
 should not be Unicode.





What I'm trying to demostrate / test is that you can use unicode objects
for mixed binary + ascii, if you make sure to encode/decode using latin-1 (
any others?). The idea is that ascii can be seen/used as text, and other
bytes are preserved, and you can ignore whatever meaning latin-1 gives them.

using unicode objects means that you can use the existing string formatting
(%s), and if you want to pass in binary blobs, you need to decode them as
latin-1, creating a unicode object, which will get interpolated into your
unicode object, but then that unicode gets encoded back to latin-1, the
original bytes are preserved.

I think this it confusing, as we are calling it latin-1, but not really
using it that way, but it seems it should work.

-Chris





-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Chris Barker
On Fri, Jan 10, 2014 at 3:22 PM, Mark Lawrence breamore...@yahoo.co.ukwrote:

 The correct way is to read the interface specification which tells you
 what should be in the data.  Or do people not use interface specifications
 these days, preferring to guess what they've got instead?


No one is suggesting guessing (OK, sometimes for what encoding text is in
-- but that's when you already know it's text). But while some specs for
mixed ascii and binary may specify which bytes are which, not all do --
there may be a read the file 'till you find this text, then the next n
bytes are binary, or maybe the next bytes are binary until you get to this
ascii text, etc...

This is not guessing, but it does require working with an object which has
both ascii text and binary in it -- and why shouldn't Python provide a
reasonable way to work with that?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou solip...@pitrou.netwrote:

 Also, when you say you've never encountered UTF-16 text in PDFs, it
 sounds like those people who've never encountered any non-ASCII data in
 their programs.


Let me clarify: one does not think in writing text in Unicode-terms in
PDF. Instead, one records the sequence of character codes which
correspond to glyphs or the glyph IDs directly. That's because one
Unicode character may have more than one glyph and more characters can be
shown as one glyph.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity

2014-01-10 Thread Ethan Furman

On 01/10/2014 03:22 PM, Mark Lawrence wrote:

On 10/01/2014 22:06, Chris Barker wrote:


I'm not so sure -- it could be used (abused?) for that, but I'm
suggesting it be used for mixed ascii-binary data. I don't know that
there IS a right way to do that -- at least not an efficient or easy
to read and write one.


The correct way is to read the interface specification which tells you what 
should be in the data.


Of course.  The debate is about how to generate the data to the specs in an 
elegant manner.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/08/2014 02:42 PM, Antoine Pitrou wrote:


With Victor's consent, I overhauled PEP 460 and made the feature set
more restricted and consistent with the bytes/str separation.


From the PEP:
=

Python 3 generally mandates that text be stored and manipulated as
 unicode (i.e. str objects, not bytes). In some cases, though, it
 makes sense to manipulate bytes objects directly. Typical usage is
 binary network protocols, where you can want to interpolate and
 assemble several bytes object (some of them literals, some of them
 compute) to produce complete protocol messages. For example,
 protocols such as HTTP or SIP have headers with ASCII names and
 opaque textual values using a varying and/or sometimes ill-defined
 encoding. Moreover, those headers can be followed by a binary
 body... which can be chunked and decorated with ASCII headers and
 trailers!


As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and 
then refuses to allow % to embed ASCII text in the byte stream.



All other features present in formatting of str objects (either
 through the percent operator or the str.format() method) are
 unsupported. Those features imply treating the recipient of the
 operator or method as text, which goes counter to the text / bytes
 separation (for example, accepting %d as a format code would imply
 that the bytes object really is a ASCII-compatible text string).


No, it implies that portion of the byte stream is ASCII compatible.  And we have several examples: PDF, HTML, DBF, just 
about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of.



-1 on the PEP as it stands now.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 16:23:53 -0800
Ethan Furman et...@stoneleaf.us wrote:
 On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
 
  With Victor's consent, I overhauled PEP 460 and made the feature set
  more restricted and consistent with the bytes/str separation.
 
  From the PEP:
 =
  Python 3 generally mandates that text be stored and manipulated as
   unicode (i.e. str objects, not bytes). In some cases, though, it
   makes sense to manipulate bytes objects directly. Typical usage is
   binary network protocols, where you can want to interpolate and
   assemble several bytes object (some of them literals, some of them
   compute) to produce complete protocol messages. For example,
   protocols such as HTTP or SIP have headers with ASCII names and
   opaque textual values using a varying and/or sometimes ill-defined
   encoding. Moreover, those headers can be followed by a binary
   body... which can be chunked and decorated with ASCII headers and
   trailers!
 
 As it stands now, the PEP talks about ASCII, about how it makes sense
 sometimes to work directly with bytes objects, and 
 then refuses to allow % to embed ASCII text in the byte stream.

Indeed I refuse for %-formatting to allow the mixing of bytes and str
objects, in the same way that it is forbidden to concatenate a and
bb together, or to write b.join([abc]).

Python 3 was made *precisely* because the implicit conversion between
ASCII unicode and bytes is deemed harmful. It's completely
counter-productive and misleading for our users to start mudding the
message by introducing exceptions to that rule.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Eric V. Smith
On 1/10/2014 8:12 PM, Antoine Pitrou wrote:
 On Fri, 10 Jan 2014 16:23:53 -0800
 Ethan Furman et...@stoneleaf.us wrote:
 On 01/08/2014 02:42 PM, Antoine Pitrou wrote:

 With Victor's consent, I overhauled PEP 460 and made the feature set
 more restricted and consistent with the bytes/str separation.

  From the PEP:
 =
 Python 3 generally mandates that text be stored and manipulated as
  unicode (i.e. str objects, not bytes). In some cases, though, it
  makes sense to manipulate bytes objects directly. Typical usage is
  binary network protocols, where you can want to interpolate and
  assemble several bytes object (some of them literals, some of them
  compute) to produce complete protocol messages. For example,
  protocols such as HTTP or SIP have headers with ASCII names and
  opaque textual values using a varying and/or sometimes ill-defined
  encoding. Moreover, those headers can be followed by a binary
  body... which can be chunked and decorated with ASCII headers and
  trailers!

 As it stands now, the PEP talks about ASCII, about how it makes sense
 sometimes to work directly with bytes objects, and 
 then refuses to allow % to embed ASCII text in the byte stream.
 
 Indeed I refuse for %-formatting to allow the mixing of bytes and str
 objects, in the same way that it is forbidden to concatenate a and
 bb together, or to write b.join([abc]).

I think:
'a' + b'b'
is different from:
b'Content-Length: %d' % 42

The former we want to prevent, but I see nothing wrong with the latter.

So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
3892. See for example http://bugs.python.org/issue3982#msg180432 .

Eric.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 20:53:09 -0500
Eric V. Smith e...@trueblade.com wrote:
 
 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .

Then we might as well not do anything, since any attempt to advance
things is met by stubborn opposition in the name of not far enough.

(I don't care much personally, I think the issue is quite overblown
anyway)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 06:04 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 20:53:09 -0500
Eric V. Smith e...@trueblade.com wrote:


So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
3892. See for example http://bugs.python.org/issue3982#msg180432 .


Then we might as well not do anything, since any attempt to advance
things is met by stubborn opposition in the name of not far enough.


Heh, and here I thought it was stubborn opposition in the name of purity.  ;)



(I don't care much personally, I think the issue is quite overblown
anyway)


Is it safe to assume you don't use Python for the use-cases under discussion?  Specifically, mixed ASCII, binary, and 
encoded-text byte streams?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Antoine Pitrou
On Fri, 10 Jan 2014 18:28:41 -0800
Ethan Furman et...@stoneleaf.us wrote:
 
 Is it safe to assume you don't use Python for the use-cases under discussion?

You know, I've done quite a bit of network programming. I've also done
an experimental port of Twisted to Python 3. I know what a network
protocol with ill-defined encodings looks like.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread INADA Naoki
To avoid implicit conversion between str and bytes, I propose adding only
limited %-format,
not .format() or .format_map().

limited %-format means:

%c accepts integer or bytes having one length.
%r is not supported
%s accepts only bytes.
%a is only format accepts arbitrary object.

And other formats is same to str.



On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou solip...@pitrou.net wrote:

 On Fri, 10 Jan 2014 18:14:45 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
   Because embedding the ASCII equivalent of ints and floats in byte
 streams
   is a common operation?
  
   Again, if you're representing ASCII, you're representing text and
   should use a str object.
 
  Yes, but is there existing 2.x code that uses %s for int and float
  (perhaps unwittingly), and do we want to help that code out?
  Or do we
  want to make porters first change to using %d or %f instead of %s?

 I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
 %f on bytes objects.

  I think what you're getting at is that in addition to not calling
  __format__, we don't want to call __str__, either, for the same reason.

 Not only. We don't want to do anything that actually asks for a
 *textual* representation of something. %d and %f ask for a textual
 representation of a number, so they're right out.

 Regards

 Antoine.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com




-- 
INADA Naoki  songofaca...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 06:39 PM, Antoine Pitrou wrote:

On Fri, 10 Jan 2014 18:28:41 -0800
Ethan Furman wrote:


Is it safe to assume you don't use Python for the use-cases under discussion?


You know, I've done quite a bit of network programming.


No, I didn't, that's why I asked.


I've also done an experimental port of Twisted to Python 3.
I know what a network protocol with ill-defined encodings
 looks like.


Can you give a code sample of what you think, for example, the PDF generation code should look like?  (If you already 
have, I apologize -- I missed it in all the ruckus.)


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Ethan Furman

On 01/10/2014 06:39 PM, Antoine Pitrou wrote:


I know what a network protocol with ill-defined encodings
 looks like.


For the record, I've been (and I suspect Eric and some others have also been) talking about well-defined encodings.  For 
the DBF files that I work with, there is binary, ASCII, and third that is specified in the file header.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread INADA Naoki
To avoid implicit conversion between str and bytes, I propose adding only
limited %-format,
not .format() or .format_map().

limited %-format means:

%c accepts integer or bytes having one length.
%r is not supported
%s accepts only bytes.
%a is only format accepts arbitrary object.

And other formats is same to str.



On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou solip...@pitrou.net wrote:

 On Fri, 10 Jan 2014 18:14:45 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
   Because embedding the ASCII equivalent of ints and floats in byte
 streams
   is a common operation?
  
   Again, if you're representing ASCII, you're representing text and
   should use a str object.
 
  Yes, but is there existing 2.x code that uses %s for int and float
  (perhaps unwittingly), and do we want to help that code out?
  Or do we
  want to make porters first change to using %d or %f instead of %s?

 I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and
 %f on bytes objects.

  I think what you're getting at is that in addition to not calling
  __format__, we don't want to call __str__, either, for the same reason.

 Not only. We don't want to do anything that actually asks for a
 *textual* representation of something. %d and %f ask for a textual
 representation of a number, so they're right out.

 Regards

 Antoine.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com




-- 
INADA Naoki  songofaca...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Cameron Simpson
On 11Jan2014 00:43, Juraj Sukop juraj.su...@gmail.com wrote:
 On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner
 victor.stin...@gmail.comwrote:
  What not building 10 0 obj ... stream and endstream endobj in
  Unicode and then encode to ASCII? Example:
 
  data = b''.join((
(%d %d obj ... stream % (10, 0)).encode('ascii'),
binary_image_data,
(endstream endobj).encode('ascii'),
  ))
 
 The key is encode to ASCII which means that the result is bytes. Then,
 there is this 11 0 obj which should also be bytes. But it has no
 binary_image_data - only lots of numbers waiting to be somehow converted
 to bytes. I already mentioned the problems with .encode('ascii') but it
 does not stop here. Numbers may appear not only inside streams but almost
 anywhere: in the header there is PDF version, an image has to have width
 and height, at the end of PDF there is a structure containing offsets to
 all of the objects in file. Basically, to .encode('ascii') every possible
 number is not exactly simple or pretty.

Hi Juraj,

Might I suggest a helper function (outside the PEP scope) instead
of arguing for support for %f et al?

Thus:

  def bytify(things, encoding='ascii'):
for thing:
  if isinstance(thing, bytes):
yield thing
  else:
yield str(thing).encode('ascii')

Then one's embedding in PDF might become, more readably:

  data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )

Of course, bytify might be augmented with whatever encoding facilities
might suit your needs.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

We tend to overestimate the short-term impact of technological change and
underestimate its long-term impact. - Amara's Law
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Steven D'Aprano
On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:

 As you may know, PDF operates over bytes and an integer or floating-point
 number is written down as-is, for example 100 or 1.23.

I'm sorry, I don't understand what you mean here. I'm honestly not 
trying to be difficult, but you sound confident that you understand what 
you are doing, but your description doesn't make sense to me. To me, it 
looks like you are conflating bytes and ASCII characters, that is, 
assuming that characters are in some sense identical to their ASCII 
representation. Let me explain:

The integer that in English is written as 100 is represented in memory 
as bytes 0x0064 (assuming a big-endian C short), so when you say an 
integer is written down AS-IS (emphasis added), to me that says that 
the PDF file includes the bytes 0x0064. But then you go on to write the 
three character string 100, which (assuming ASCII) is the bytes 
0x313030. Going from the C short to the ASCII representation 0x313030 is 
nothing like inserting the int as-is. To put it another way, the 
Python 2 '%d' format code does not just copy bytes.

I think that what you are trying to say is that a PDF file is a binary 
file which includes some ASCII-formatted text fields. So when writing an 
integer 100, rather than writing it as is which would be byte 0x64 
(with however many leading null bytes needed for padding), it is 
converted to ASCII representation 0x313030 first, and that's what needs 
to be inserted.

If you consider PDF as binary with occasional pieces of ASCII text, then 
working with bytes makes sense. But I wonder whether it might be better 
to consider PDF as mostly text with some binary bytes. Even though the 
bulk of the PDF will be binary, the interesting bits are text. E.g. your 
example:

 In the case of PDF, the embedding of an image into PDF looks like:
 
 10 0 obj
/Type /XObject
  /Width 100
  /Height 100
  /Alternates 15 0 R
  /Length 2167
   
 stream
 ...binary image data...
 endstream
 endobj


Even though the binary image data is probably much, much larger in 
length than the text shown above, it's (probably) trivial to deal with: 
convert your image data into bytes, decode those bytes into Latin-1, 
then concatenate the Latin-1 string into the text above.

Latin-1 has the nice property that every byte decodes into the character 
with the same code point, and visa versa. So:

for i in range(256):
assert bytes([i]).decode('latin-1') == chr(i)
assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode 
text with embedded binary data, rather than binary data with embedded 
ASCII text. Then when writing the file to disk, of course you encode it 
to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents
with open(outfile.pdf, wb) as f:
f.write(pdf.encode(latin-1)

or implicitly:

with open(outfile.pdf, w, encoding=latin-1) as f:
f.write(pdf)


There may be a few wrinkles I haven't thought of, I don't claim to be an 
expert on PDF. But I see no reason why PDF files ought to be an 
exception to the rule:

* work internally with Unicode text;

* convert to and from bytes only on input and output.

Please also take note that in Python 3.3 and better, the internal 
representation of Unicode strings containing only code points up to 255 
(i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte 
per character.

Another advantage is that using text rather than bytes means that your 
example:

[...]
 dropping the bytes-formatting of numbers makes it more complicated
 than it was. I would appreciate any explanation on how:
 
 b'%.1f %.1f %.1f RG' % (r, g, b)

becomes simply

'%.1f %.1f %.1f RG' % (r, g, b)

in Python 3. In Python 3.3 and above, it can be written as:

u'%.1f %.1f %.1f RG' % (r, g, b)

which conveniently is exactly the same syntax you would use in Python 2. 
That's *much* nicer than your suggestion:


 is more confusing than:
 
 b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), 
  (r, g, b)))




-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 complexity - 2 use cases

2014-01-10 Thread Ben Finney
Jim J. Jewett jimjjew...@gmail.com writes:

  
  Steven D'Aprano wrote:
  I think that heuristics to guess the encoding have their role to play,
  if the caller understands the risks.

 Ben Finney wrote:
  In my opinion, content-type guessing heuristics certainly don't belong
  in the standard library.

 It would be great if there were never any need to guess.  But in the
 real world, there is -- and often the user won't know any more than
 python does.

That's why I think it's great to have heuristic guessing code available
as a third-party library.

 So when it is time to guess, a source of good guesses is an important
 battery to include.

Why is it important enough to deserve that privilege, over the thousands
of other candidates for the standard library? The barrier for entry to
the standard library is higher than mere usefulness.

 We should explicitly treat autodetection like time zone data --
 there is no promise that the right answer (or at least the best
 guess) won't change, even within a release.

But there is exactly one set of authoritative time zones at any
particular point in time. That's why it makes sense to have that set of
authoritative values available in the standard library.

Heuristic guesses about content types do not have the property of
exactly one authoritative source, so your analogy is not compelling.

-- 
 \ “Unix is an operating system, OS/2 is half an operating system, |
  `\Windows is a shell, and DOS is a boot partition virus.” —Peter |
_o__)H. Coffin |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Georg Brandl
Am 11.01.2014 03:04, schrieb Antoine Pitrou:
 On Fri, 10 Jan 2014 20:53:09 -0500
 Eric V. Smith e...@trueblade.com wrote:
 
 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .

I agree.

 Then we might as well not do anything, since any attempt to advance
 things is met by stubborn opposition in the name of not far enough.
 
 (I don't care much personally, I think the issue is quite overblown
 anyway)

So you wouldn't mind another overhaul of the PEP including a bit more
functionality again? :)  I really think that practicality beats purity
here.  (I'm not advocating free mixing bytes and str, mind you!)

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com