: -fetch a web page
: -decode entities and unicode characters(such as $#149; ) using Neko
: library
: -get a unicode String in Java
: -Sent it to SOLR through XML created by SAX, with the right encoding
: (UTF-8) specified everywhere( writer, header etc...)
: -it apparently arrives clean on the SOLR side (verified in our logs).
: -In the query output from SOLR (XML message), the character is not
: encoded as an entity (not •) but the character itself is used
: (character 149=95 hexadecimal).

Just because someone uses an html entity to display a character in a web
page doesn't mean it needs to be "escaped" in XML ... i think that in
theory we could use numeric entities to escape *every* character but that
would make the XML responses a lot bigger ... so in general Solr only
escapes the characters that need to be escaped to have a valid UTF-8 XML
response.

Your may also be having some additional problems since 149 (hex 95) is not
a printable UTF-8 character, it's a control character (MESSAGE_WAITING)
... it sounds like you're dealing with HTML where people were using the
numeric value from the "Windows-1252" charset.

you may want to modify your parsing code to do some mappings between
"control" characters that you know aren't ment to be control characters
before you ever send them to solr.  a quick search for "Neko
windows-1525" indicates that enough people have had problems with this
that it is a built in feature...
    http://people.apache.org/~andyc/neko/doc/html/settings.html
    "http://cyberneko.org/html/features/scanner/fix-mswindows-refs
     Specifies whether to fix character entity references for Microsoft
     Windows characters as described at
     http://www.cs.tut.fi/~jkorpela/www/windows-chars.html.";

(I've run into this a number of times over the years when dealing with
content created by windows users, as you can see from my one and only
thread on "JavaJunkies" ...
  http://www.javajunkies.org/index.pl?node_id=3436
)


-Hoss

Reply via email to