Running the script I attached earlier on Edgy and a new machine, I still
get segfault with iso-2022-jp codec, with both Python 2.4 and 2.5.
Version strings printed are:
Python 2.4.4c0 (#2, Sep 29 2006, 20:19:45)
[GCC 4.1.2 20060920 (prerelease) (Ubuntu 4.1.1-13ubuntu3)] on linux2
and
Python 2.5 (r25:51908, Oct 6 2006, 15:22:41)
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu4)] on linux2
I'm not so sure anymore that "x.encode(e).decode(e) == x roundtrip or
failure" behavior is or should be guaranteed, as a best-effort solution
may reasonably substitute characters for others (e.g. dropping accent
marks when converting to ASCII). This isn't what Python usually does,
though: e.g. u'รก'.encode('ascii', 'ignore') results in '', not 'a'. I
don't know if this is specified in detail anywhere, but the pydocs for
the codec module say this about errors:
"Possible values for errors are 'strict' (raise an exception in case of
an encoding error), 'replace' (replace malformed data with a suitable
replacement marker, such as "?"), 'ignore' (ignore malformed data and
continue without further notice), 'xmlcharrefreplace' (replace with the
appropriate XML character reference (for encoding only)) and
'backslashreplace' (replace with backslashed escape sequences (for
encoding only)) as well as any other error handling name defined via
register_error()."
I took this to mean that what can't survive a round-trip is "malformed
data" as far as the codec is concerned and what is done to it is
specified by the errors argument, and that the standard codecs would
stick to the suggested behavior. Looking at it now, it does leave much
room for interpretation. The "suitable replacement marker" part could
even be construed as a blanket permission to produce anything that
doesn't cause an error when decoding (akin to old gcc starting nethack
when encountering #pragma..), and "malformed data" isn't defined either.
I'm not sure where in between these extremes the intended meaning is, so
I guess someone else has to decide whether it's a bug or not.
The original reason for filing a bug report was that I wanted the set of
Unicode characters that can survive a round-trip to certain encodings
with Python codecs, wrote a simple script to find those and was
surprised to get a segfault. It was easy to work around it, but.. While
you can get segfaults in Python if you muck carelessly with, say,
ctypes, I did not expect to get one purely with stock string
manipulation methods. Regardless of the other parts of this report, I
think the segfault at least is a bug.
Thinking about it, I probably should've reported this to the Python
project directly, but I already had an account here, didn't have
official (non-Ubuntu) version of Python and thought that "this is an
obvious bug", which it in retrospect may not be.
--
CJK codec bugs (including segfault)
https://launchpad.net/bugs/29289
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs