RE: PDFBox/Tika Performance Issues

Giovanni Fernandez-Kincade Tue, 16 Mar 2010 15:55:34 -0700

I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. 
This is what I've tried so far (which was really just me guessing):




1.     Got the latest version of the trunk code from 
http://svn.apache.org/repos/asf/lucene/tika/trunk

2.     Built this using Maven (mvn install)

3.     I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
(tika-0.3.jar).

4.     Then I bounced my servlet server and tried indexing a document. The 
document was successfully indexed, and there were no errors logged as a result, 
but the PDF data does not appear to have been extracted (the field I used for 
map.content had an empty-string as a value).



What's the right approach to perform this patch?





-----Original Message-----
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Tuesday, March 16, 2010 5:41 PM
To: solr-user@lucene.apache.org
Subject: RE: PDFBox/Tika Performance Issues



Thanks Chris!



I'll try the patch.



-----Original Message-----

From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]

Sent: Tuesday, March 16, 2010 5:37 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.



See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).



Cheers,

Chris



[1] http://issues.apache.org/jira/browse/TIKA-380

[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html





On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" 
<gfernandez-kinc...@capitaliq.com> wrote:



Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.



-----Original Message-----

From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll

Sent: Tuesday, March 16, 2010 5:15 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?



FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.



-Grant



On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:



> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:

>

> http-8080-Processor39 [BLOCKED] CPU time: 9:35

> java.util.Collections$SynchronizedMap.get(Object)

> org.pdfbox.pdmodel.font.PDFont.getAFM()

> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)

> org.pdfbox.util.PDFStreamEngine.showString(byte[])

> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)

> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)

> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)

> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)

> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)

> org.pdfbox.util.PDFTextStripper.processPages(List)

> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)

> org.pdfbox.util.PDFTextStripper.getText(PDDocument)

> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)

> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)

> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)

> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)

> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)

> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)

> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)

> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)

> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)

> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)

> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)

> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)

> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)

> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)

> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)

> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)

> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)

> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)

> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)

> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)

> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])

> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])

> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])

> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()

> java.lang.Thread.run()

>

> Has anyone run into this before? Any ideas on how to reduce the contention?

>

> Thanks,

> Gio.



--------------------------

Grant Ingersoll

http://www.lucidimagination.com/



Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search









++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.

Senior Computer Scientist

NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Office: 171-266B, Mailstop: 171-246

Email: chris.mattm...@jpl.nasa.gov

WWW:   http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Adjunct Assistant Professor, Computer Science Department

University of Southern California, Los Angeles, CA 90089 USA

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

RE: PDFBox/Tika Performance Issues

Reply via email to