Unicode characters

HUYLEBROECK Jeremy RD-ILAB-SSF Fri, 27 Apr 2007 15:22:15 -0700

Hi,

We experience some encoding probs with the unicode characters getting
out of solr.
Let me explain our flow:


-fetch a web page
-decode entities and unicode characters(such as $#149; ) using Neko
library
-get a unicode String in Java
-Sent it to SOLR through XML created by SAX, with the right encoding
(UTF-8) specified everywhere( writer, header etc...)
-it apparently arrives clean on the SOLR side (verified in our logs).
-In the query output from SOLR (XML message), the character is not
encoded as an entity (not &#149;) but the character itself is used
(character 149=95 hexadecimal).

And we can see in firefox and our logs a "code" instead of the character
(code 149 or 95), even if the original XML message sent to SOLR was
properly rendered in Firefox, or the shell etc...

We might have missed something somewhere as we easily get our minds lost
in all the encoding/unicode nightmare ;)

We've seen the method "escape" in the XML Object of SOLR. It escapes
only a few codes as Entities. Could it be the source of our prob?
What would be the right approach to encode properly our input without
having to tweak solr code? Or is it a bug?

Thanks

Jeremy.

Unicode characters

Reply via email to