Hi Danilo, this indicates that the default encoding of your platform is ISO-8859-1. See [1]. you should rather use [2] instead and specify "UTF-8".
regards marcel [1] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes() [2] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes(java.lang.String) Danilo Barboza wrote: > Hail!! > > I am having some problems while tries to search over a HTML content in a > jcr:contet node with properties: > > jcr:mimeType = "text/html" > jcr:encoding = "UTF-8" > jcr:data = "<html><head></head><body> Some content with acute á á á > </body></html>" > > When I try to search using > > //element(*, nt:resource)[jcr:contains(., "á")] > > I recieving none result... All my Strings are UTF-8 encoded, that is the JVM > Default. > > When I try to search using > > //element(*, nt:resource)[jcr:contains(., "á")] > > I receive the expected result, but with this latin-converted string in place > of my "á" UTF-8 string. > > I've write a simple sample demonstrating the problem (see attachment). > > When you run the sample you must set the defaul JVM encondig to UTF-8 > passing -Dfile.encoding=UTF-8 argument to JVM. > > I also have tested with other binary content (like MSWord DOC) and > everything is going ok... > > The sample code says more than I can explain. > > Someone knows why this occour only with HTML binary content? Maybe the > HTMLTextExtractor? > > Thanks, > > Danilo Barboza >
