Hey everybody,
I've been running into some issues indexing a very large set of documents.
There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously
this is a big task for Solr. I have a PHP script that iterates over the
directory and uses PHP cURL to query Solr to index the files. For now, commit
is set to false to speed up the indexing, and I'm assuming that Solr should be
auto-committing as necessary. I'm using the default solrconfig.xml file
included in apache-solr-1.4.1\example\solr\conf. Once all the documents have
been finished the PHP script queries Solr to commit.
The main problem is that after a few thousand documents (around 2000 last time
I tried), nearly every document begins causing Java exceptions in Solr:
Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@11d329d
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 25 more
As far as I know there's nothing special about these documents so I'm wondering
if it's not properly autocommitting. What would be appropriate settings in
solrconfig.xml for this particular application? I'd like it to autocommit as
soon as it needs to but no more often than that for the sake of efficiency.
Obviously it takes long enough to index 4000 documents and there's no reason to
make it take longer. Thanks for your help!
~Brandon Waterloo