[issue22746] cgitb html: wrong encoding for utf-8
Serhiy Storchaka added the comment: We can convert cgitb.hook to produce ASCII-compatible output with charrefs in 3.x. But there is a problem with str in 2.7. 8-bit string can contain non-ASCII data and the encoding is not known in general case. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Ezio Melotti added the comment: In normal HTML utf-8 works fine, doesn't it? It does, in fact as long as the encoding used by the browser matches the one used in the file, no charrefs needs to be used (except gt; lt; and quot;). Of course, if non-Unicode encodings are used, the range of available characters that can go directly in the HTML will be more limited, but this can be solved by using charrefs -- the browser will display the corresponding character no matter what is the encoding. This also means that if charrefs are used for all non-ASCII characters, then the browser will be able to display the page no matter what encoding is being used (as long as it's ASCII-compatible, and most encoding are). The downside is that it will make the source less readable and possible longer, especially if there are lot of non-ASCII characters, but if most of the characters are expected to be ASCII, using charrefs might be ok. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Amaury Forgeot d'Arc added the comment: What about open(..., encoding='latin-1', errors='xmlcharrefreplace') -- nosy: +amaury.forgeotdarc stage: resolved - needs patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Wolfgang Rohdewald added the comment: What about open(..., encoding='latin-1', errors='xmlcharrefreplace') That works fine. I tested with a chinese character 与 But I do not think the application should work around something that cgitb is supposed to handle. More so since the documentation is dead silent about this. You need to use codecs.open instead of open and add those kw arguments. As long as this is not explained in the documentation, I guess it is a bug for everyone not using latin-1. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Wolfgang Rohdewald added the comment: correction: A bug for everyone using non-ascii characters. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Changes by Serhiy Storchaka storch...@gmail.com: -- nosy: +ezio.melotti, serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Amaury Forgeot d'Arc added the comment: You need to use codecs.open instead of open No, why? in python3 open() supports the errors handler. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Changes by STINNER Victor victor.stin...@gmail.com: -- components: +Unicode nosy: +haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
R. David Murray added the comment: In normal HTML utf-8 works fine, doesn't it?. It's only when reading from a file (where the browser doesn't know the encoding) that it fails. Do you have a use case for xmlcharrefreplace in the HTML context (which is what cgitb is primarily targeted at). Some place where the web page can't be declared as utf-8, perhaps? I suppose it might be a not-unreasonable enhancement request to have a parameter to Hook that says do xmlcharrefreplace, but since the workaround is actually simpler than that, I don't know if that is worthwhile or not. Or do people feel like doing the replacement all the time (it's only in tracebacks, after all) be the right thing to do? -- resolution: remind - versions: +Python 3.4, Python 3.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Wolfgang Rohdewald added the comment: You need to use codecs.open instead of open No, why? in python3 open() supports the errors handler. right, but not in python2 which has the same problem. I need my code to run with both. Do you have a use case for xmlcharrefreplace in the HTML context? No, my only use case is the local file. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
New submission from Wolfgang Rohdewald: The attached script shows the non-ascii characters wrong wherever they occur, including the exception message and the comment in the source code. Looking at the produced .html, I can say that cgitb simply passes the single byte utf-8 codes without encoding them as needed. Same happens with Python3.4 (after applying some quick and dirty changes to cgitb.py, see bug #22745). -- components: Library (Lib) files: cgibug.py messages: 230085 nosy: wrohdewald priority: normal severity: normal status: open title: cgitb html: wrong encoding for utf-8 type: behavior versions: Python 2.7 Added file: http://bugs.python.org/file37044/cgibug.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
R. David Murray added the comment: If you look at the file, you'll find that the data is in utf-8 (at least if your locale is a utf-8 locale). However, html is by default interpreted as latin-1, so that's what the webrowser displays when you pass the file on disk to it. If you add encoding='latin-1' to your open call, your script will work. What you do if you need to display non-latin1 characters, I don't know. (See https://bugzil.la/760050, for example). Note: the above is for python3. I don't remember how you do the equivalent in python2...a naive codecs.open call just got me a UnicodeDecodeError. -- nosy: +r.david.murray resolution: - not a bug stage: - resolved status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Wolfgang Rohdewald added the comment: If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own. I do not quite understand why you think this is not a bug. If cgitb encodes unicode like x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references So this bug is fixable, I am reopening it. For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -*- coding: utf-8 -*- Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding. -- keywords: +patch resolution: not a bug - status: closed - open Added file: http://bugs.python.org/file37047/22746.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22746] cgitb html: wrong encoding for utf-8
Changes by Wolfgang Rohdewald wolfg...@rohdewald.de: -- resolution: - remind ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22746 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com