Hi Markus,
the result of my investigation is that Lucene currently can only handle
UTF-8 code within BMP [Basic Multilingual Plane] (plane 0) = 0x.
Any code above BMP might end in unpredictable results which is bad.
If you get invalid UTF-8 from the index and use wt=xml it gives the error
Dear list,
after loading some documents via DIH which also include urls
I get this yellow XML error page as search result from solr admin GUI
after a search.
It says XML processing error not well-formed.
The code it argues about is:
arr name=dcurls
strhttp://eprints.soton.ac.uk/43350//str
Results so far.
I could locate and isolate the document causing trouble.
I've checked the document with xmllint again. It is valid, well-formed utf8.
I've loaded the single document and get the XML error if displaying the search
result.
This is through solr admin search and also JSON interface,
It looks like you hit the same issue as i did a while ago:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html
On Friday 11 February 2011 08:59:27 Bernd Fehling wrote:
Dear list,
after loading some documents via DIH which also include urls
I get this yellow XML error page
Hi Markus,
yes it looks like the same issue. There is also a \u utf8-code in your dump.
Till now I followed it into XMLResponseWriter.
Some steps before the result in a buffer looks good and the utf8-code is
correct.
Really hard to debug this freaky problem.
Have you looked deeper into this