Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing.
See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" <gfernandez-kinc...@capitaliq.com> wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -----Original Message----- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > I've been trying to bulk index about 11 million PDFs, and while profiling our > Solr instance, I noticed that all of the threads that are processing indexing > requests are constantly blocking each other during this call: > > http-8080-Processor39 [BLOCKED] CPU time: 9:35 > java.util.Collections$SynchronizedMap.get(Object) > org.pdfbox.pdmodel.font.PDFont.getAFM() > org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > org.pdfbox.util.PDFStreamEngine.showString(byte[]) > org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, > COSStream) > org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > org.pdfbox.util.PDFTextStripper.processPages(List) > org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > org.pdfbox.util.PDFTextStripper.getText(PDDocument) > org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, > Metadata) > org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > Has anyone run into this before? Any ideas on how to reduce the contention? > > Thanks, > Gio. -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++