Hi, We experience some encoding probs with the unicode characters getting out of solr. Let me explain our flow:
-fetch a web page -decode entities and unicode characters(such as $#149; ) using Neko library -get a unicode String in Java -Sent it to SOLR through XML created by SAX, with the right encoding (UTF-8) specified everywhere( writer, header etc...) -it apparently arrives clean on the SOLR side (verified in our logs). -In the query output from SOLR (XML message), the character is not encoded as an entity (not •) but the character itself is used (character 149=95 hexadecimal). And we can see in firefox and our logs a "code" instead of the character (code 149 or 95), even if the original XML message sent to SOLR was properly rendered in Firefox, or the shell etc... We might have missed something somewhere as we easily get our minds lost in all the encoding/unicode nightmare ;) We've seen the method "escape" in the XML Object of SOLR. It escapes only a few codes as Entities. Could it be the source of our prob? What would be the right approach to encode properly our input without having to tweak solr code? Or is it a bug? Thanks Jeremy.