[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-30 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

From a function *user* perspective, the latter API (bytes-bytes, str-str) is 
exactly what I'm doing.

Antoine's point is that there are two ways to achieve that:

Option 1 (what my patch currently does):
- provide bytes and str variants of all constants
- choose which set to use at the start of each function
- be careful never to index, only slice (even for single characters)
- a few other traps that I don't remember off the top of my head

Option 2 (the alternative Antoine suggested and I'm considering):
- decode the ASCII compatible bytes to str objects by treating them as 
nominally latin-1
- use the same str-based constants as are used to handle actual str inputs
- be able to index to your heart's content inside the algorithm
- *ensure* that any bytes-as-pseudo-str objects are encoded back to actual 
bytes before they are returned

From outside the function, a user shouldn't be able to tell which approach 
we're using internally.

The nice thing about option 2 is to make sure you're doing it correctly, you 
only need to check three kinds of location:
- the initial parameter handling in each function
- any return statements, raise statements that allow a value to leave the 
function
- any yield expressions (both input and output)

The effects of option 1 are scattered all over your algorithms, so it's hard to 
be sure you've caught everything.

The downside of option 2 is if you make a mistake and let your 
bytes-as-pseudo-str objects escape from the confines of your function, you're 
going to see some very strange behaviour.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-30 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 Option 2 (the alternative Antoine suggested and I'm considering):
 - decode ... to str ...
 - ... objects are encoded back to actual bytes before 
   they are returned

In this case, you have to be very careful to not mix str and bytes decoded to 
str using a pseudo-encoding. Dummy example: urljoin('unicode', b'bytes') should 
raise an error.

I don't care of the internals if you write tests to ensure that it is not 
possible to mix str and bytes with the public API.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-30 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

Yeah, the general implementation concept I'm thinking of going with for option 
2 will use a few helper functions:

url, coerced_to_str = _coerce_to_str(url)
if coerced_to_str:
param = _force_to_str(param) # as appropriate
...
return _undo_coercion(result, coerced_to_str)

The first helper function standardises the typecheck, the second one complains 
if it is given something that is already a string.

The last one just standardises the check to see if the coercion needs to be 
undone, and actually undoing the coercion.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-30 Thread Senthil Kumaran

Changes by Senthil Kumaran orsent...@gmail.com:


--
nosy: +orsenthil

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 A possible duck-typing approach here would be to replace the
 instance(x, str) tests with hasattr(x, 'encode') checks instead.

Looks more ugly than useful to me. People wanting to emulate str had better 
subclass it anyway...

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

Agreed - I think there's a non-zero chance of triggering the str-path by 
mistake if we try to duck-type it (I just added a similar comment to #9969 
regarding a possible convenience API for tokenisation)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

Added to Reitveld: http://codereview.appspot.com/2318041/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

One of Antoine's review comments made me realise I hadn't explicitly noted the 
why not decode with latin-1? rationale for the bytes handling. (It did come 
up in one or more of the myriad python-dev discussions on the topic, I just 
hadn't noted it here)

The primary reason for supporting ASCII compatible bytes directly is 
specifically to avoid the encoding and decoding overhead associated with the 
translation to unicode text. Making that overhead implicit rather than avoiding 
it altogether would be to quite miss the point of the API change.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 The primary reason for supporting ASCII compatible bytes directly is
 specifically to avoid the encoding and decoding overhead associated
 with the translation to unicode text.

I think it's quite misguided. latin1 encoding and decoding is blindingly
fast (on the order of 1GB/s. here). Unless you have multi-megabyte URLs,
you won't notice any overhead.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

 I think it's quite misguided. latin1 encoding and decoding is blindingly
 fast (on the order of 1GB/s. here). Unless you have multi-megabyte URLs,
 you won't notice any overhead.

Ah, I didn't know that (although it makes sense now I think about it).
I'll start exploring ideas along those lines then. Having to name all
the literals as I do in the patch is really quite ugly.

A general sketch of such a strategy would be to stick the following
near the start of affected functions:

encode_result = not isinstance(url, str) # or whatever the main
parameter is called
if encode_result:
url = url.decode('latin-1')
# decode any other arguments that need it
# Select the bytes versions of any relevant globals
else:
# Select the str versions of any relevant globals

Then, at the end, do an encoding step. However, the encoding step may
get a little messy when it comes to the structured data types. For
that, I'll probably take a leaf out of the email6 book and create a
parallel bytes API, with appropriate encode/decode methods to
transform one into the other.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I don't understand why you would like to implicitly convert bytes to str (which 
is one of the worse design choice of Python2). If you don't want to care about 
encodings, use bytes is fine. Decode bytes using an arbitrary encoding is the 
fastest way to mojibake.

So You have two choices: create new functions with bytes as input and output 
(like os.getcwd() and os.getcwdb()), or the output type will depend on the 
input type (solution choosen by os.path). Example of ther later:

 os.path.expanduser('~')
'/home/haypo'
 os.path.expanduser(b'~')
b'/home/haypo'

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-28 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

Attached patch is a very rough first cut at this. I've gone with the basic 
approach of simply assigning the literals to local variables in each function 
that uses them. My rationale for that is:
1. Every function has to have some kind of boilerplate to switch based on the 
type of the argument
2. Some functions need to switch other values (such as the list of scheme_chars 
or the unparse function), so a separate object with literal attributes won't be 
enough
3. Given 1 and 2, the overhead of a separate namespace for the literal 
references isn't justified

I've also gone with a philosophy that only str objects are treated as strings 
and everything else is implicitly assumed to be bytes. This is in keeping with 
the general interpreter behaviour where we *really* don't support duck-typing 
when it comes to strings.

The test updates aren't comprehensive yet, but they were enough to uncover 
quite a few things I had missed.

quoting is still a bit ugly, so I may still add a byte-bytes/str-str variant 
of those APIs.

--
keywords: +patch
Added file: http://bugs.python.org/file19043/issue9873_initial_attempt.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-28 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

A possible duck-typing approach here would be to replace the instance(x, str) 
tests with hasattr(x, 'encode') checks instead.

Thoughts?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-20 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

The design approach (at least for urllib.parse) is to add separate *b APIs that 
operate on bytes rather than implicitly allowing bytes in the ordinary versions 
of the function.

Allowing programmers to manipulate correctly encoded (and hence ASCII 
compatible) bytes to avoid decode/encode overhead when manipulating URLs is 
fine (and the whole point of adding the new functions). Allowing them to *not 
know* whether they have encoded data or text suitable for display to the user 
isn't necessary (and is easy to add later if we decide we want it, while taking 
it away is far more difficult).

More detail at 
http://mail.python.org/pipermail/python-dev/2010-September/103828.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-17 Thread Nick Coghlan

Nick Coghlan ncogh...@gmail.com added the comment:

From the python-dev thread 
(http://mail.python.org/pipermail/python-dev/2010-September/103780.html):
==
So the domain of any polymorphic text manipulation functions we define would be:
 - Unicode strings
 - byte sequences where the encoding is either:
   - a single byte ASCII superset (e.g. iso-8859-*, cp1252, koi8*, mac*)
   - an ASCII compatible multibyte encoding (e.g. UTF-8, EUC-JP)

Passing in byte sequences that are encoded using an ASCII incompatible
multibyte encoding (e.g. CP932, UTF-7, UTF-16, UTF-32, shift-JIS,
big5, iso-2022-*, EUC-CN/KR/TW) or a single byte encoding that is not
an ASCII superset (e.g. EBCDIC) will have undefined results.
==

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-16 Thread Nick Coghlan

New submission from Nick Coghlan ncogh...@gmail.com:

As per python-dev discussion in June, many Py3k APIs currently gratuitously 
prevent the use of bytes and bytearray objects as arguments due to their use of 
string literals internally.

Examples:
urllib.parse.urlparse
urllib.parse.urlunparse
urllib.parse.urljoin
urllib.parse.urlsplit
(and plenty of other examples in urllib.parse)

While a strict reading of the relevant RFCs suggests that strings are the more 
appropriate type for these APIs, as a practical matter, protocol developers 
want to be able to operate on ASCII supersets as raw bytes.

The proposal is to modify the implementation of these functions such that 
string literals are used only with string arguments, and bytes literals 
otherwise. If a function accepts multiple values and a mixture of strings and 
bytes is passed in then the operation will still fail (as it should).

--
assignee: ncoghlan
components: Library (Lib)
messages: 116543
nosy: ncoghlan
priority: high
severity: normal
stage: needs patch
status: open
title: Allow bytes in some APIs that use string literals internally
type: behavior
versions: Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-16 Thread R. David Murray

Changes by R. David Murray rdmur...@bitdance.com:


--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-16 Thread Eric Smith

Changes by Eric Smith e...@trueblade.com:


--
nosy: +eric.smith

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9873] Allow bytes in some APIs that use string literals internally

2010-09-16 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
nosy: +eric.araujo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9873
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com