Manuzhai wrote:
I think we should always know or make clear what is the encoding.
If we really don't know, then we should not assume UTF-8 (which
may or may not work, who knows), but rather use some "catch all"
conversion (trac.util.to_unicode, namely).
One option would be the Universal Encoding Detector, but this may be
too much of a dependency for Trac. Pretty much guaranteed to get it
right, though.
http://chardet.feedparser.org/download/
There was some discussion some time ago between me and mgood
(see #2105) about adding an extension point for mime-type detection.
I think such an interface could as well provide the charset information,
which means that you could write a plugin for this U.E.D.
{{{
class IMimeTypeDetector(Interface):
def detect(filename, detail, content):
"""Try to infer the mimetype from the filename or the file content.
Return a `(mimetype, charset)` pair corresponding to the
autodetected mime-type and charset.
Either of `mimetype` or `charset` could be `None`.
"""
}}}
We could eventually add a "degree of confidence" to the result,
in order to be able to rank results from different detectors,
but simply returning `None` when unsure would probably be enough.
Of course, that would not be used to every text used in the system,
only for file content. The above `to_unicode` is also used for that,
so I think I'll rename it `data_to_unicode` (preserve content),
to contrast it with `text_to_unicode` (which might be "lossy").
An alternative to having 2 versions `*_to_unicode` would be to add
a third optional argument: `to_unicode(text, charset=None, lossy=False)`.
-- Christian
PS: Hm, I just realized I begin to wiki format my e-mails... Damn :)
_______________________________________________
Trac-dev mailing list
[email protected]
http://lists.edgewall.com/mailman/listinfo/trac-dev