On Mon, Apr 17, 2006 at 11:18:01AM +0200, Christian Boos wrote:
> Ok, so that you don't hold your breath for too long, I've looked at that 
> code ...
> 
> In 
> http://trac-hacks.org/browser/reposearchplugin/0.9/tracreposearch/search.py#L112,
> you use node.get_content().read(), which is documented as:
> 
>  Warning: `SubversionNode.get_content` returns an object from which one
>           can read a stream of bytes.
>           NO guarantees can be given about what that stream of bytes
>           represents.
>           It might be some text, encoded in some way or another.
>           SVN properties __might__ give some hints about the content,
>           but they actually only reflect the beliefs of whomever set
>           those properties...
> 
> I should probably move this warning inside api.py, as docu for 
> Node.get_content,
> as it's exactly the same for other backends.
> 
> So, in short, when you access the content of a file from the repository,
> you access it as raw content (because it can be a binary object).
> If you decide it's some text, you should be prepared handle /any/ kind
> of encoding: to that end, use the `trac.util.to_unicode`:
> 
> to_unicode(node.get_content().read())
> 
> (btw, I'm thinking to move all text related utilities in trac.util.text, 
> what do you think?)
> 
> There's an additional twist, as Subversion has some conventions for
> conveying charset information, in node properties.
> That way, you can *maybe* (as the charset is not necessarily set
> in the svn:mime-type property) get a *hint* (as the charset which
> might be set is not necessarily the right one...) about the encoding
> actually used for that file.
> 
> So one way to get a hint about the charset is to get it from the
> MIME type information. The second possibility to tell what is
> the charset actually used is to try to detect it from the content.
> Mimeview.get_charset does both of the above (*).
> 
> You can also tell `trac.util.to_unicode` to try to use this information:
> 
>  raw_content = node.get_content().read()
>  charset = Mimeview(self.env).get_charset(raw_content, 
> node.get_content_type())
>  to_unicode(raw_content, charset)
> 
> and `to_unicode` will do the right thing: decode the raw_content
> using the charset information if available and valid, but also gracefully
> fallback otherwise.
> 
> There are a few shortcuts to achieve the above:
> 
>  Mimeview(self.env).get_unicode(node.get_content().read(), 
> node.get_content_type())
> 
> or even:, using `trac.mimeview.api.content_to_unicode`:
> 
>  content_to_unicode(self.env, node.get_content(), node.get_content_type())
> 
> which copes with "readable" objects.
> 
> -- Christian
> 
> P.S: hey, that's more and more material for this UnicodeGuidelines page :)

Hehe :)

This is all very useful stuff Christian, thanks. I'll update repo search
soon.

-- 
Evolution: Taking care of those too stupid to take care of themselves.
_______________________________________________
Trac-dev mailing list
[email protected]
http://lists.edgewall.com/mailman/listinfo/trac-dev

Reply via email to