Re: Problems searching HTML binary UTF-8 encoded

Marcel Reutegger Tue, 26 Aug 2008 00:31:24 -0700

Hi Danilo,

this indicates that the default encoding of your platform is ISO-8859-1. See
[1]. you should rather use [2] instead and specify "UTF-8".


regards
 marcel


[1] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes()
[2]
http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes(java.lang.String)

Danilo Barboza wrote:
> Hail!!
> 
> I am having some problems while tries to search over a HTML content in a
> jcr:contet node with properties:
> 
> jcr:mimeType = "text/html"
> jcr:encoding = "UTF-8"
> jcr:data = "<html><head></head><body> Some content with acute á á á
> </body></html>"
> 
> When I try to search using
> 
> //element(*, nt:resource)[jcr:contains(., "á")]
> 
> I recieving none result... All my Strings are UTF-8 encoded, that is the JVM
> Default.
> 
> When I try to search using
> 
> //element(*, nt:resource)[jcr:contains(., "Ã¡")]
> 
> I receive the expected result, but with this latin-converted string in place
> of my "á" UTF-8 string.
> 
> I've write a simple sample demonstrating the problem (see attachment).
> 
> When you run the sample you must set the defaul JVM encondig to UTF-8
> passing -Dfile.encoding=UTF-8 argument to JVM.
> 
> I also have tested with other binary content (like MSWord DOC) and
> everything is going ok...
> 
> The sample code says more than I can explain.
> 
> Someone knows why this occour only with HTML binary content? Maybe the
> HTMLTextExtractor?
> 
> Thanks,
> 
> Danilo Barboza
>

Re: Problems searching HTML binary UTF-8 encoded

Reply via email to