I've finally revisited this issue. I switched to Tomcat (5.5.16) and
all is well, so it certainly appears as if this is a Jetty issue.
*sigh*
I'll look into whether there is a newer version of Jetty and if that
fixes this.
I did make some improvements to text encoding in my indexing process,
which is actually quite involved: RDF files, parsed by Java, some
pointers to URLs that get fetched via HttpClient, and then packaged
into a JDOM XML DOM, serialized to a String, and then sent to Solr
via HttpClient. Even if I had the wrong encoding somewhere along the
way, if a valid String is retrieved from a Lucene Field it should be
serializable again, at least I'm assuming so - so it is at least
reassuring that the bug is in Jetty and not in my complicated process.
Erik
On Apr 12, 2006, at 10:24 AM, Yonik Seeley wrote:
On 4/12/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: > The weird thing is that the last Solr line in the trace is
: > org.apache.solr.util.XML.escapeCharData(XML.java:100)
: >
: > 99 if (start==0) {
: > 100 out.write(str);
Thanks, I had missed that. I just verified that line 100 is the same
in both versions of the file, so the most likely explanation is a
corrupt string (the string might end in the first char of a multi char
character) that triggers the exception in Sun's UTF-8 encoder.
So the question then is, how did this bad string come about?
Chris' guess about a bad charset somewhere is probably right.
-Yonik