Alec Thomas wrote:
On Mon, Apr 17, 2006 at 07:39:28AM +0200, solo turn wrote:
if i search for code on trac-hacks, i get the error:
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 308,
in dispatch_request
dispatcher.dispatch(req)
File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 186,
in dispatch
resp = chosen_handler.process_request(req)
File "/usr/lib/python2.4/site-packages/trac/Search.py", line 181, in
process_request
results += list(source.get_search_results(req, terms, filters))
File "build/bdist.linux-i686/egg/tracreposearch/search.py", line
115, in get_search_results
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
39: ordinal not in range(128)
is this a known error, or a new one?
Yes, this is a known problem. I'm eagerly awaiting Christian's
TracDev/UnicodeGuidelines document before launching into fixing unicode
problems. I'd prefer to do it right than just hack around wildly hoping
for the best :)
Ok, so that you don't hold your breath for too long, I've looked at that
code ...
In
http://trac-hacks.org/browser/reposearchplugin/0.9/tracreposearch/search.py#L112,
you use node.get_content().read(), which is documented as:
Warning: `SubversionNode.get_content` returns an object from which one
can read a stream of bytes.
NO guarantees can be given about what that stream of bytes
represents.
It might be some text, encoded in some way or another.
SVN properties __might__ give some hints about the content,
but they actually only reflect the beliefs of whomever set
those properties...
I should probably move this warning inside api.py, as docu for
Node.get_content,
as it's exactly the same for other backends.
So, in short, when you access the content of a file from the repository,
you access it as raw content (because it can be a binary object).
If you decide it's some text, you should be prepared handle /any/ kind
of encoding: to that end, use the `trac.util.to_unicode`:
to_unicode(node.get_content().read())
(btw, I'm thinking to move all text related utilities in trac.util.text,
what do you think?)
There's an additional twist, as Subversion has some conventions for
conveying charset information, in node properties.
That way, you can *maybe* (as the charset is not necessarily set
in the svn:mime-type property) get a *hint* (as the charset which
might be set is not necessarily the right one...) about the encoding
actually used for that file.
So one way to get a hint about the charset is to get it from the
MIME type information. The second possibility to tell what is
the charset actually used is to try to detect it from the content.
Mimeview.get_charset does both of the above (*).
You can also tell `trac.util.to_unicode` to try to use this information:
raw_content = node.get_content().read()
charset = Mimeview(self.env).get_charset(raw_content,
node.get_content_type())
to_unicode(raw_content, charset)
and `to_unicode` will do the right thing: decode the raw_content
using the charset information if available and valid, but also gracefully
fallback otherwise.
There are a few shortcuts to achieve the above:
Mimeview(self.env).get_unicode(node.get_content().read(),
node.get_content_type())
or even:, using `trac.mimeview.api.content_to_unicode`:
content_to_unicode(self.env, node.get_content(), node.get_content_type())
which copes with "readable" objects.
-- Christian
P.S: hey, that's more and more material for this UnicodeGuidelines page :)
(*) Mimeview.get_charset will also in the future be the way to
take benefit from pluggable charset detectors, as well as
Mimeview.get_mimetype will for pluggable mimetype detectors.
_______________________________________________
Trac-dev mailing list
[email protected]
http://lists.edgewall.com/mailman/listinfo/trac-dev