Nick Coghlan added the comment:
Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing.
It would work for the default C locale (since that's ASCII), but not in the
general case.
Enhancing backslashreplace to also work on input is an interesting idea, but
worth making
Nick Coghlan added the comment:
Ideally we'd have string modification support for all the translations we offer
as codec error handlers:
* Unicode replacement character ('replace' on input)
* ASCII question mark ('replace' on output)
* Dropping them entirely ('ignore')
* XML character
R. David Murray added the comment:
Right now having has_escaped_bytes isn't too important, since I've done nothing
to profile and improve the performance of the new email code. But eventually
I'll need it, because detecting the existence of escaped bytes is inside some
of the inner loops in
Antoine Pitrou added the comment:
data.encode('utf-8', 'replace').decode('utf-8')
data.encode('utf-8', 'ignore').decode('utf-8')
Why not the reverse:
os.fsencode(data).decode('utf-8', 'replace')
os.fsencode(data).decode('utf-8', 'ignore')
Note that backslashreplace needs to be
Ezio Melotti added the comment:
I think similar functions should be added in the unicodedata module rather than
the string module or as str methods. If I'm not mistaken this was already
proposed in another issue.
In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others
Nick Coghlan added the comment:
The purpose of these changes it to provide tools specifically for working with
surrogate escaped data, not for working with arbitrary lone Unicode surrogates.
escaped_surrogates is not defined by the Unicode spec, it's defined by the
behaviour of the
Serhiy Storchaka added the comment:
I agree with Ezio in all points. escaped_surrogates is inefficient for any
purposes and incomplete. _match_surrogates can be created in more efficient
way. clean() has too general and misleading name. redecode() looks just
ridiculous, this cumbersome
Nick Coghlan added the comment:
Guys, you're Python 3 unicode experts, already thoroughly familiar with how
surrogateescape works. These features are not for you.
The exact implementations don't matter. These need to exist, and they need to
be documented with detailed explanations. People
Ezio Melotti added the comment:
That's why I think a function like redecode is a bad idea.
With Python 2 I've seen lot of people blindingly trying .decode when .encode
failed (and the other way around) whenever they were getting an UnicodeError
(and the fact that decoding Unicode results in an
Nick Coghlan added the comment:
The redecode thing is a distraction from my core concern here, so I've split
that out to issue #22264, a separate RFE for a wsgiref.fix_encoding function.
For this issue, my main concern is the function to *clean* a string of escaped
binary data, so it can be
Serhiy Storchaka added the comment:
What problem is purposed to solve clean_surrogate_escapes()? Could you please
provide user scenario or two?
Possible alternative implementation is:
def clean_surrogate_escapes(s):
return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
It
Nick Coghlan added the comment:
My main use case is for passing data to other applications that *don't* have
their Unicode handling in order - I want to be able to use Python to do the
data scrubbing, but at the moment it requires intimate knowledge of the codec
error handling system to do
STINNER Victor added the comment:
Your clean() function looses information. If a filename contains almost only
undecodable characters, it will looks like .txt. It's not very useful. I
would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example
%HH format: \udc80 would be
Nick Coghlan added the comment:
Based on the latest round of bytes handling discussions on python-dev, I came
up with this updated proposal:
# Constant in the string module (akin to string.ascii_letters et al)
escaped_surrogates = bytes(range(128, 256)).decode('ascii',
Nick Coghlan added the comment:
Note: I dropped the has_escaped_bytes idea, as s != string.clean(s) is
functionally equivalent, albeit a bit more expensive if you don't actually want
the cleaned string)
--
___
Python tracker rep...@bugs.python.org
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Antoine Pitrou added the comment:
Can you sum the use cases for these?
(don't want to read a blog post, sorry :-))
--
nosy: +pitrou
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
STINNER Victor added the comment:
The email package needs has_escaped_bytes. Currently it tries to encode to
ascii to find out if there are any, which we proved by microbenchmark is the
fastest way to do it as things stand.
In which function you need to check this? What do you do if there
R. David Murray added the comment:
The email package uses surrogateescape to store unknown bytes in unicode
strings, just as with the handle-bad-data-from-os API surrogateescape was
introduced for. (For the same reason: the source data may have improperly
encoded bytes that we must
Nick Coghlan added the comment:
The use case is to take data from a surrogate escaped interface and either
filter it out entirely or convert it to a valid Unicode string at the point
of *input*, before letting it make its way into the rest of the
application. For example, this approach permits
New submission from Nick Coghlan:
Prompted by issue 18713 and
http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some
possible utilities we could add to the codecs module to help deal with/debug
issues related to surrogate escaped strings:
def has_escaped_bytes(s):
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:
--
nosy: +Arfrever
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
22 matches
Mail list logo