Stephen J. Turnbull added the comment:
Please do not add the rehandle functions to codecs. They do not change the
(duck-typed) representation of data while maintaining the semantics, they
change the semantics of data while retaining the representation.
I suggest a validation submodule of the
Nick Coghlan added the comment:
surrogateescape and surrogateepass data *already* can't be inverted back to
bytes reliably without knowing the original encoding - if you encode them
as something else when they contain surrogates, you'll either get an
exception (the default) or mojibake (if you
Nick Coghlan added the comment:
Oh, and yes, I agree a python-dev discussion would be a good idea.
From my perspective, rehandle_surrogateescape is the key function for making
it easier to check for malformed input data from operating system interfaces.
The other items I don't personally have
Serhiy Storchaka added the comment:
I uploaded the patch just before your comment Nick.
Here is updated patch. Functions are renamed as Nick suggested, added two more
functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly
for example here, they can be committed in other
Changes by Serhiy Storchaka storch...@gmail.com:
Added file: http://bugs.python.org/file38520/codecs_convert_escapes_2.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
Nick Coghlan added the comment:
I'd wondered about that with respect to rehandle_surrogatepass.
The current implementation looks like it processes *all* surrogates (even valid
surrogate pairs), so handle_surrogates might be a suitable name.
If the intent is for it to be
Serhiy Storchaka added the comment:
Note that provided Python implementations are rather a proof of concept. After
discussion I'll provide more efficient C implementations, that should be 1-2
orders faster (and infinitely fast for common case of ASCII strings).
--
Nick Coghlan added the comment:
(Serhiy, did you miss uploading the new patch?)
Regarding the names, we may need to think about the use cases a bit more
explicitly to clarify that in terms of the Python codecs API rather than
expecting folks to understand the underlying representation. In the
Serhiy Storchaka added the comment:
Proposed preliminary patch adds three functions in the codecs module:
convert_surrogates(data, errors) -- handle lone surrogates with specified error
handler.
codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace')
'a€\\udca4'
Changes by Serhiy Storchaka storch...@gmail.com:
--
keywords: +patch
Added file: http://bugs.python.org/file38506/codecs_convert_escapes.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
Changes by Serhiy Storchaka storch...@gmail.com:
--
dependencies: +Add support of UnicodeTranslateError in standard error handlers
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
Nick Coghlan added the comment:
Updated issue title to reflect current proposal.
--
title: Add tools for cleaning surrogate escaped strings - Add
codecs.convert_surrogateescape to clean surrogate escaped strings
___
Python tracker
Marc-Andre Lemburg added the comment:
Don't like the function name :-)
How about codecs.filter_non_utf8_data(), since that's closer
to what the function is really doing and doesn't require
knowledge about what surrogateescape is.
--
nosy: +lemburg
Nick Coghlan added the comment:
The error handler is called surrogateescape. That means
convert_surrogateescape is always only a single step away from thinking I
want to remove the smuggled bytes from a surrogateescape'd string, without
needing to assume any knowledge on the part of the user
Nick Coghlan added the comment:
The function definition again, this time with a draft docstring:
def convert_surrogateescape(data, errors='replace'):
Convert escaped raw bytes by applying a different error handler
Uses the replace error handler by default, but any input
Nick Coghlan added the comment:
Note I would also be OK with convert_surrogates, as that's the term that
appears in the relevant error message:
b'\xe9'.decode('ascii', 'surrogateescape').encode()
Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8'
Antoine Pitrou added the comment:
Le 23/09/2014 12:57, Nick Coghlan a écrit :
The function definition again, this time with a draft docstring:
def convert_surrogateescape(data, errors='replace'):
Convert escaped raw bytes by applying a different error handler
Uses the
Nick Coghlan added the comment:
Draft docstring for that version
def convert_surrogates(data, errors='replace'):
Convert escaped surrogates by applying a different error handler
Uses the replace error handler by default, but any input
error handler may be specified.
Nick Coghlan added the comment:
Antoine: what would be the use case for using a different encoding for the
temporary bytes object? It's discarded anyway, so the encoding used isn't
externally visible.
--
___
Python tracker rep...@bugs.python.org
Antoine Pitrou added the comment:
The encoding used impacts the result:
s = 'abc\udcc3\udca9'
s.encode('ascii', 'surrogateescape').decode('ascii', 'replace')
'abc��'
s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
'abcé'
The original string ('abc\udcc3\udca9') was obtained
Marc-Andre Lemburg added the comment:
On 23.09.2014 13:12, Nick Coghlan wrote:
Nick Coghlan added the comment:
Draft docstring for that version
def convert_surrogates(data, errors='replace'):
Convert escaped surrogates by applying a different error handler
Uses
R. David Murray added the comment:
And indeed my use case for this has instances of both cases: originally decoded
using ASCII and the non-ascii bytes must end up as replaced characters, and
originally decoded using utf-8.
I'm also not sure that it is worth adding this. If you know what you
R. David Murray added the comment:
Oh, wait, I forgot that the context for this was dealing with unix filenames
and/or stdio. So, a function that just uses the fsencoding to do the replace
might indeed be appropriate, but in that case should probably live in the os
module.
Serhiy Storchaka added the comment:
Good catch Antoine!
Here is a sample of more complicated implementation.
--
title: Add a convert_surrogates function to clean surrogate escaped strings
- Add codecs.convert_surrogateescape to clean surrogate escaped strings
Added file:
Nick Coghlan added the comment:
Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even
better idea - replacing the lone surrogates directly with the output of the
alternative error handler avoids any need to worry about the original encoding.
--
25 matches
Mail list logo