I've finally revisited this issue. I switched to Tomcat (5.5.16) and all is well, so it certainly appears as if this is a Jetty issue. *sigh*

I'll look into whether there is a newer version of Jetty and if that fixes this.

I did make some improvements to text encoding in my indexing process, which is actually quite involved: RDF files, parsed by Java, some pointers to URLs that get fetched via HttpClient, and then packaged into a JDOM XML DOM, serialized to a String, and then sent to Solr via HttpClient. Even if I had the wrong encoding somewhere along the way, if a valid String is retrieved from a Lucene Field it should be serializable again, at least I'm assuming so - so it is at least reassuring that the bug is in Jetty and not in my complicated process.

        Erik


On Apr 12, 2006, at 10:24 AM, Yonik Seeley wrote:

On 4/12/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: > The weird thing is that the last Solr line in the trace is
: > org.apache.solr.util.XML.escapeCharData(XML.java:100)
: >
: > 99    if (start==0) {
: > 100      out.write(str);

Thanks, I had missed that.  I just verified that line 100 is the same
in both versions of the file, so the most likely explanation is a
corrupt string (the string might end in the first char of a multi char
character) that triggers the exception in Sun's UTF-8 encoder.

So the question then is, how did this bad string come about?
Chris' guess about a bad charset somewhere is probably right.

-Yonik

Reply via email to