[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-27 Thread Nick Coghlan
Nick Coghlan added the comment: Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case. Enhancing backslashreplace to also work on input is an interesting idea, but worth making

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-25 Thread Nick Coghlan
Nick Coghlan added the comment: Ideally we'd have string modification support for all the translations we offer as codec error handlers: * Unicode replacement character ('replace' on input) * ASCII question mark ('replace' on output) * Dropping them entirely ('ignore') * XML character

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-25 Thread R. David Murray
R. David Murray added the comment: Right now having has_escaped_bytes isn't too important, since I've done nothing to profile and improve the performance of the new email code. But eventually I'll need it, because detecting the existence of escaped bytes is inside some of the inner loops in

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-25 Thread Antoine Pitrou
Antoine Pitrou added the comment: data.encode('utf-8', 'replace').decode('utf-8') data.encode('utf-8', 'ignore').decode('utf-8') Why not the reverse: os.fsencode(data).decode('utf-8', 'replace') os.fsencode(data).decode('utf-8', 'ignore') Note that backslashreplace needs to be

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Ezio Melotti
Ezio Melotti added the comment: I think similar functions should be added in the unicodedata module rather than the string module or as str methods. If I'm not mistaken this was already proposed in another issue. In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Nick Coghlan
Nick Coghlan added the comment: The purpose of these changes it to provide tools specifically for working with surrogate escaped data, not for working with arbitrary lone Unicode surrogates. escaped_surrogates is not defined by the Unicode spec, it's defined by the behaviour of the

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I agree with Ezio in all points. escaped_surrogates is inefficient for any purposes and incomplete. _match_surrogates can be created in more efficient way. clean() has too general and misleading name. redecode() looks just ridiculous, this cumbersome

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Nick Coghlan
Nick Coghlan added the comment: Guys, you're Python 3 unicode experts, already thoroughly familiar with how surrogateescape works. These features are not for you. The exact implementations don't matter. These need to exist, and they need to be documented with detailed explanations. People

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Ezio Melotti
Ezio Melotti added the comment: That's why I think a function like redecode is a bad idea. With Python 2 I've seen lot of people blindingly trying .decode when .encode failed (and the other way around) whenever they were getting an UnicodeError (and the fact that decoding Unicode results in an

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Nick Coghlan
Nick Coghlan added the comment: The redecode thing is a distraction from my core concern here, so I've split that out to issue #22264, a separate RFE for a wsgiref.fix_encoding function. For this issue, my main concern is the function to *clean* a string of escaped binary data, so it can be

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: What problem is purposed to solve clean_surrogate_escapes()? Could you please provide user scenario or two? Possible alternative implementation is: def clean_surrogate_escapes(s): return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace') It

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread Nick Coghlan
Nick Coghlan added the comment: My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-24 Thread STINNER Victor
STINNER Victor added the comment: Your clean() function looses information. If a filename contains almost only undecodable characters, it will looks like .txt. It's not very useful. I would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example %HH format: \udc80 would be

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-23 Thread Nick Coghlan
Nick Coghlan added the comment: Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal: # Constant in the string module (akin to string.ascii_letters et al) escaped_surrogates = bytes(range(128, 256)).decode('ascii',

[issue18814] Add tools for cleaning surrogate escaped strings

2014-08-23 Thread Nick Coghlan
Nick Coghlan added the comment: Note: I dropped the has_escaped_bytes idea, as s != string.clean(s) is functionally equivalent, albeit a bit more expensive if you don't actually want the cleaned string) -- ___ Python tracker rep...@bugs.python.org

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-23 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-23 Thread Antoine Pitrou
Antoine Pitrou added the comment: Can you sum the use cases for these? (don't want to read a blog post, sorry :-)) -- nosy: +pitrou ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-23 Thread STINNER Victor
STINNER Victor added the comment: The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand. In which function you need to check this? What do you do if there

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-23 Thread R. David Murray
R. David Murray added the comment: The email package uses surrogateescape to store unknown bytes in unicode strings, just as with the handle-bad-data-from-os API surrogateescape was introduced for. (For the same reason: the source data may have improperly encoded bytes that we must

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-23 Thread Nick Coghlan
Nick Coghlan added the comment: The use case is to take data from a surrogate escaped interface and either filter it out entirely or convert it to a valid Unicode string at the point of *input*, before letting it make its way into the rest of the application. For example, this approach permits

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-22 Thread Nick Coghlan
New submission from Nick Coghlan: Prompted by issue 18713 and http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some possible utilities we could add to the codecs module to help deal with/debug issues related to surrogate escaped strings: def has_escaped_bytes(s):

[issue18814] Add tools for cleaning surrogate escaped strings

2013-08-22 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___