[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-05-09 Thread Stephen J. Turnbull
Stephen J. Turnbull added the comment: Please do not add the rehandle functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation. I suggest a validation submodule of the

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-05-09 Thread Nick Coghlan
Nick Coghlan added the comment: surrogateescape and surrogateepass data *already* can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Nick Coghlan
Nick Coghlan added the comment: Oh, and yes, I agree a python-dev discussion would be a good idea. From my perspective, rehandle_surrogateescape is the key function for making it easier to check for malformed input data from operating system interfaces. The other items I don't personally have

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I uploaded the patch just before your comment Nick. Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file38520/codecs_convert_escapes_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Nick Coghlan
Nick Coghlan added the comment: I'd wondered about that with respect to rehandle_surrogatepass. The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so handle_surrogates might be a suitable name. If the intent is for it to be

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings). --

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-16 Thread Nick Coghlan
Nick Coghlan added the comment: (Serhiy, did you miss uploading the new patch?) Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-16 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Proposed preliminary patch adds three functions in the codecs module: convert_surrogates(data, errors) -- handle lone surrogates with specified error handler. codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace') 'a€\\udca4'

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-16 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- keywords: +patch Added file: http://bugs.python.org/file38506/codecs_convert_escapes.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-15 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- dependencies: +Add support of UnicodeTranslateError in standard error handlers ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Updated issue title to reflect current proposal. -- title: Add tools for cleaning surrogate escaped strings - Add codecs.convert_surrogateescape to clean surrogate escaped strings ___ Python tracker

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Don't like the function name :-) How about codecs.filter_non_utf8_data(), since that's closer to what the function is really doing and doesn't require knowledge about what surrogateescape is. -- nosy: +lemburg

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: The error handler is called surrogateescape. That means convert_surrogateescape is always only a single step away from thinking I want to remove the smuggled bytes from a surrogateescape'd string, without needing to assume any knowledge on the part of the user

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: The function definition again, this time with a draft docstring: def convert_surrogateescape(data, errors='replace'): Convert escaped raw bytes by applying a different error handler Uses the replace error handler by default, but any input

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Note I would also be OK with convert_surrogates, as that's the term that appears in the relevant error message: b'\xe9'.decode('ascii', 'surrogateescape').encode() Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8'

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Antoine Pitrou
Antoine Pitrou added the comment: Le 23/09/2014 12:57, Nick Coghlan a écrit : The function definition again, this time with a draft docstring: def convert_surrogateescape(data, errors='replace'): Convert escaped raw bytes by applying a different error handler Uses the

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Draft docstring for that version def convert_surrogates(data, errors='replace'): Convert escaped surrogates by applying a different error handler Uses the replace error handler by default, but any input error handler may be specified.

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Antoine: what would be the use case for using a different encoding for the temporary bytes object? It's discarded anyway, so the encoding used isn't externally visible. -- ___ Python tracker rep...@bugs.python.org

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Antoine Pitrou
Antoine Pitrou added the comment: The encoding used impacts the result: s = 'abc\udcc3\udca9' s.encode('ascii', 'surrogateescape').decode('ascii', 'replace') 'abc��' s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') 'abcé' The original string ('abc\udcc3\udca9') was obtained

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 23.09.2014 13:12, Nick Coghlan wrote: Nick Coghlan added the comment: Draft docstring for that version def convert_surrogates(data, errors='replace'): Convert escaped surrogates by applying a different error handler Uses

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread R. David Murray
R. David Murray added the comment: And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8. I'm also not sure that it is worth adding this. If you know what you

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread R. David Murray
R. David Murray added the comment: Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module.

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Good catch Antoine! Here is a sample of more complicated implementation. -- title: Add a convert_surrogates function to clean surrogate escaped strings - Add codecs.convert_surrogateescape to clean surrogate escaped strings Added file:

[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan
Nick Coghlan added the comment: Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding. --