RE: PDFBox/Tika Performance Issues

Giovanni Fernandez-Kincade Tue, 16 Mar 2010 14:32:05 -0700

Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.


-----Original Message-----
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:
> 
> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> java.util.Collections$SynchronizedMap.get(Object)
> org.pdfbox.pdmodel.font.PDFont.getAFM()
> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)
> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> org.pdfbox.util.PDFTextStripper.processPages(List)
> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)
> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
> 
> Has anyone run into this before? Any ideas on how to reduce the contention?
> 
> Thanks,
> Gio.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

RE: PDFBox/Tika Performance Issues

Reply via email to