Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against:
if (ch % 0x10000 != 0xffff || ch % 0x10000 != 0xfffe || (ch <= 0xfdd0 && ch >= 0xfdef)) { pass; } Comments? [1]: http://unicode.org/cldr/utility/list- unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: > Hi, > > I came across the indexing error below. It happened in a huge batch update > from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to > trace the error back to a specific document. So i try my luck here: anyone > seen this before with SolrJ 3.1? Anything else on the Nutch part i should > have taken care off? > > Thanks! > > > Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 > QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException > log > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException] Invalid UTF-8 character 0xffff at char > #1142033, byte #1155068) at > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:1 > 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > at > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3 > 657) at > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at > org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at > org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content > StreamHandlerBase.java:67) at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas > e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java > :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} > status=500 QTime=423 Jun 27, 2011 10:24:28 AM > org.apache.solr.common.SolrException log > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException] Invalid UTF-8 character 0xffff at char > #1142033, byte #1155068) at > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:1 > 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > at > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3 > 657) at > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at > org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at > org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content > StreamHandlerBase.java:67) at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas > e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java > :356) at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav > a:252) at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl > er.java:1212) at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216 > ) at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCo > llection.java:230) at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java: > 114) at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.jav > a:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java: > 228) at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:5 > 82) Caused by: java.io.CharConversionException: Invalid UTF-8 character > 0xffff at char #1142033, byte #1155068) at > com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at > com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) > at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > at > com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java: > 57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at > com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java > :4628) at > com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java > :4126) at > com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) > at > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3 > 649) ... 26 > moreg.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j > ava:252) at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl > er.java:1212) at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216 > ) at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCo > llection.java:230) at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java: > 114) at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.jav > a:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java: > 228) at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:5 > 82) Caused by: java.io.CharConversionException: Invalid UTF-8 character > 0xffff at char #1142033, byte #1155068) at > com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at > com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) > at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > at > com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java: > 57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at > com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java > :4628) at > com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java > :4126) at > com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) > at > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3 > 649) ... 26 more -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350