RE: PDFBox/Tika Performance Issues

Giovanni Fernandez-Kincade Wed, 17 Mar 2010 08:07:13 -0700

Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.


Any other ideas? 

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
> 
> 
> 
> 1.     Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
> 
> 2.     Built this using Maven (mvn install)
> 

On track so far.

> 3.     I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

> 
> 4.     Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -----Original Message-----
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
> 
> 
> 
> Thanks Chris!
> 
> 
> 
> I'll try the patch.
> 
> 
> 
> -----Original Message-----
> 
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
> 
> Sent: Tuesday, March 16, 2010 5:37 PM
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> 
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
> 
> 
> 
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release which should happen soon (hopefully next few
> weeks).
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> [1] http://issues.apache.org/jira/browse/TIKA-380
> 
> [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
> 
> 
> 
> 
> 
> On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade"
> <gfernandez-kinc...@capitaliq.com> wrote:
> 
> 
> 
> Originally 16 (the number of CPUs on the machine), but even with 5 threads
> it's not looking so hot.
> 
> 
> 
> -----Original Message-----
> 
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
> 
> Sent: Tuesday, March 16, 2010 5:15 PM
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> 
> Hmm, that is an ugly thing in PDFBox.  We should probably take this over to
> the PDFBox project.  How many threads are you indexing with?
> 
> 
> 
> FWIW, for that many documents, I might consider using Tika on the client side
> to save on a lot of network traffic.
> 
> 
> 
> -Grant
> 
> 
> 
> On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:
> 
> 
> 
>> I've been trying to bulk index about 11 million PDFs, and while profiling our
>> Solr instance, I noticed that all of the threads that are processing indexing
>> requests are constantly blocking each other during this call:
> 
>> 
> 
>> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> 
>> java.util.Collections$SynchronizedMap.get(Object)
> 
>> org.pdfbox.pdmodel.font.PDFont.getAFM()
> 
>> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> 
>> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> 
>> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> 
>> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> 
>> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources,
>> COSStream)
> 
>> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> 
>> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> 
>> org.pdfbox.util.PDFTextStripper.processPages(List)
> 
>> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> 
>> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> 
>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler,
>> Metadata)
> 
>> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler,
>> Metadata)
> 
>> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler,
>> Metadata)
> 
>> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler,
>> Metadata)
> 
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryReq
>> uest, SolrQueryResponse, ContentStream)
> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryR
>> equest, SolrQueryResponse)
> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest,
>> SolrQueryResponse)
> 
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(
>> SolrQueryRequest, SolrQueryResponse)
> 
>> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest,
>> SolrQueryResponse)
> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest,
>> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest,
>> ServletResponse, FilterChain)
> 
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletReque
>> st, ServletResponse)
> 
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest,
>> ServletResponse)
> 
>> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> 
>> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> 
>> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> 
>> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> 
>> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> 
>> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> 
>> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> 
>> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processCo
>> nnection(TcpConnection, Object[])
> 
>> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket,
>> TcpConnection, Object[])
> 
>> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> 
>> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> 
>> java.lang.Thread.run()
> 
>> 
> 
>> Has anyone run into this before? Any ideas on how to reduce the contention?
> 
>> 
> 
>> Thanks,
> 
>> Gio.
> 
> 
> 
> --------------------------
> 
> Grant Ingersoll
> 
> http://www.lucidimagination.com/
> 
> 
> 
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> Chris Mattmann, Ph.D.
> 
> Senior Computer Scientist
> 
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> 
> Office: 171-266B, Mailstop: 171-246
> 
> Email: chris.mattm...@jpl.nasa.gov
> 
> WWW:   http://sunset.usc.edu/~mattmann/
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> Adjunct Assistant Professor, Computer Science Department
> 
> University of Southern California, Los Angeles, CA 90089 USA
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

RE: PDFBox/Tika Performance Issues

Reply via email to