Manuzhai wrote:
I think we should always know or make clear what is the encoding.
If we really don't know, then we should not assume UTF-8 (which
may or may not work, who knows), but rather use some "catch all"
conversion (trac.util.to_unicode, namely).

One option would be the Universal Encoding Detector, but this may be
too much of a dependency for Trac. Pretty much guaranteed to get it
right, though.

http://chardet.feedparser.org/download/

There was some discussion some time ago between me and mgood
(see #2105) about adding an extension point for mime-type detection.
I think such an interface could as well provide the charset information,
which means that you could write a plugin for this U.E.D.

{{{
class IMimeTypeDetector(Interface):
   def detect(filename, detail, content):
       """Try to infer the mimetype from the filename or the file content.

       Return a `(mimetype, charset)` pair corresponding to the
       autodetected mime-type and charset.
       Either of `mimetype` or `charset` could be `None`.
       """
}}}

We could eventually add a "degree of confidence" to the result,
in order to be able to rank results from different detectors,
but simply returning `None` when unsure would probably be enough.

Of course, that would not be used to every text used in the system,
only for file content. The above `to_unicode` is also used for that,
so I think I'll rename it `data_to_unicode` (preserve content),
to contrast it with `text_to_unicode` (which might be "lossy").

An alternative to having 2 versions `*_to_unicode` would be to add
a third optional argument: `to_unicode(text, charset=None, lossy=False)`.

-- Christian


PS: Hm, I just realized I begin to wiki format my e-mails... Damn :)
_______________________________________________
Trac-dev mailing list
[email protected]
http://lists.edgewall.com/mailman/listinfo/trac-dev

Reply via email to