Hi,
I got SolrException when submitting XML for indexing (using solr 3.6.1)
////
Jan 15, 2013 10:22:42 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, cod
e 31))
at [row,col {unknown-source}]: [2,1169]
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:81)
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
((CTRL-CHAR, code 31))
...
at [row,col {unknown-source}]: [2,1169]
at
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
at
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660)
at
com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:309)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
////
I checked details, the data causing trouble is
word1chr(31)word2
here both word1 and word2 are normail English characters and "chr(31)" is just
the returning value of PHP
function chr(31). Our XML is well constructed and encoding/charset are well
defined.
The problem is due to chr(31), if I replace it with another UTF-8 character,
indexing is OK.
I checked source code com.ctc.wstx.sr.BasicStreamReader.java, it seems that it
is by design any CTRL
character is not allowed inside CDATA text, but I am puzzled that how could we
avoid CTRL character in
text in general (sure it is not a common occurance but can still happen)?
Thanks very much for helps, Lisheng