Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an error, but the data doesn't get extracted. Using the same PDF with my previous /Lib contents works fine.
Any other ideas? These are the jar files I have in my /Lib apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -----Original Message----- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just me guessing): > > > > 1. Got the latest version of the trunk code from > http://svn.apache.org/repos/asf/lucene/tika/trunk > > 2. Built this using Maven (mvn install) > On track so far. > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. > > 4. Then I bounced my servlet server and tried indexing a document. The > document was successfully indexed, and there were no errors logged as a > result, but the PDF data does not appear to have been extracted (the field I > used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -----Original Message----- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika Performance Issues > > > > Thanks Chris! > > > > I'll try the patch. > > > > -----Original Message----- > > From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Tuesday, March 16, 2010 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 > depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may > include a fix for the problem you're seeing. > > > > See this discussion [2] on how to patch Tika to use the new PDFBox if you > can't wait for the 0.7 release which should happen soon (hopefully next few > weeks). > > > > Cheers, > > Chris > > > > [1] http://issues.apache.org/jira/browse/TIKA-380 > > [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html > > > > > > On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" > <gfernandez-kinc...@capitaliq.com> wrote: > > > > Originally 16 (the number of CPUs on the machine), but even with 5 threads > it's not looking so hot. > > > > -----Original Message----- > > From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll > > Sent: Tuesday, March 16, 2010 5:15 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Hmm, that is an ugly thing in PDFBox. We should probably take this over to > the PDFBox project. How many threads are you indexing with? > > > > FWIW, for that many documents, I might consider using Tika on the client side > to save on a lot of network traffic. > > > > -Grant > > > > On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > > > >> I've been trying to bulk index about 11 million PDFs, and while profiling our >> Solr instance, I noticed that all of the threads that are processing indexing >> requests are constantly blocking each other during this call: > >> > >> http-8080-Processor39 [BLOCKED] CPU time: 9:35 > >> java.util.Collections$SynchronizedMap.get(Object) > >> org.pdfbox.pdmodel.font.PDFont.getAFM() > >> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > >> org.pdfbox.util.PDFStreamEngine.showString(byte[]) > >> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > >> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > >> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, >> COSStream) > >> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > >> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > >> org.pdfbox.util.PDFTextStripper.processPages(List) > >> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > >> org.pdfbox.util.PDFTextStripper.getText(PDDocument) > >> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, >> Metadata) > >> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, >> Metadata) > >> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, >> Metadata) > >> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, >> Metadata) > >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryReq >> uest, SolrQueryResponse, ContentStream) > >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryR >> equest, SolrQueryResponse) > >> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, >> SolrQueryResponse) > >> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest( >> SolrQueryRequest, SolrQueryResponse) > >> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, >> SolrQueryResponse) > >> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, >> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, >> ServletResponse, FilterChain) > >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletReque >> st, ServletResponse) > >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, >> ServletResponse) > >> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > >> org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > >> org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > >> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > >> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > >> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > >> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > >> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processCo >> nnection(TcpConnection, Object[]) > >> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, >> TcpConnection, Object[]) > >> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > >> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > >> java.lang.Thread.run() > >> > >> Has anyone run into this before? Any ideas on how to reduce the contention? > >> > >> Thanks, > >> Gio. > > > > -------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: chris.mattm...@jpl.nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++