RE: PDFBox/Tika Performance Issues
I don't think so. I'm using Tomcat on my servers, but I set up my local machine with the Eclipse-Jetty plugin from that Lucid article and I'm getting the same error. These are the libraries references in my Eclipse project: apache-solr-core-1.5-dev.jar apache-solr-dataimporthandler-1.5-dev.jar apache-solr-solrj-1.5-dev.jar commons-codec-1.3.jar commons-csv-1.0-SNAPSHOT-r609327.jar commons-fileupload-1.2.1.jar commons-httpclient-3.1.jar commons-io-1.4.jar geronimo-stax-api_1.0_spec-1.0.1.jar google-collect-1.0.jar jcl-over-slf4j-1.5.5.jar lucene-analyzers-2.9.2.jar lucene-collation-2.9.2.jar lucene-core-2.9.2.jar lucene-fast-vector-highlighter-2.9.2.jar lucene-highlighter-2.9.2.jar lucene-memory-2.9.2.jar lucene-misc-2.9.2.jar lucene-queries-2.9.2.jar lucene-snowball-2.9.2.jar lucene-spatial-2.9.2.jar lucene-spellchecker-2.9.2.jar slf4j-api-1.5.5.jar slf4j-jdk14-1.5.5.jar wstx-asl-3.2.7.jar apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dir.txt dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-app-0.7-SNAPSHOT.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 23, 2010 11:03 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, The error that you're showing in your logs below indicates that this message signature: org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;) doesn't match what was expected. Are you sure you don't have another Solr jar on the classpath somewhere, or in your web server? Are you using Jetty, or Tomcat? Thanks, Chris On 3/23/10 7:59 AM, "Giovanni Fernandez-Kincade" wrote: Sorry for the late reply - been out of town for a couple of days. >From my solrconfig: ignored_ text -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Saturday, March 20, 2010 8:43 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues What's your configuration look like for the ExtractReqHandler? On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote: > Yeah I've been trying that - I keep getting this error when indexing a PDF > with a trunk-build: > > Apache Tomcat/5.5.27 - Error report > HTTP Status 500 - org.apache.solr.handler. > > ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;) > V java.lang.AbstractMethodError: > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) >at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at org.apa
Re: PDFBox/Tika Performance Issues
Hi Giovanni, The error that you're showing in your logs below indicates that this message signature: org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;) doesn't match what was expected. Are you sure you don't have another Solr jar on the classpath somewhere, or in your web server? Are you using Jetty, or Tomcat? Thanks, Chris On 3/23/10 7:59 AM, "Giovanni Fernandez-Kincade" wrote: Sorry for the late reply - been out of town for a couple of days. >From my solrconfig: ignored_ text -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Saturday, March 20, 2010 8:43 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues What's your configuration look like for the ExtractReqHandler? On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote: > Yeah I've been trying that - I keep getting this error when indexing a PDF > with a trunk-build: > > Apache Tomcat/5.5.27 - Error report > HTTP Status 500 - org.apache.solr.handler. > > ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;) > V java.lang.AbstractMethodError: > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) >at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) >at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) >at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) >at java.lang.Thread.run(Unknown Source) type Status report message > > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V > java.lang.AbstractMethodError: > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V >at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) >at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) >at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) >at > org.apache.catalina.core.StandardWrappe
RE: PDFBox/Tika Performance Issues
Sorry for the late reply - been out of town for a couple of days. >From my solrconfig: ignored_ text -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Saturday, March 20, 2010 8:43 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues What's your configuration look like for the ExtractReqHandler? On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote: > Yeah I've been trying that - I keep getting this error when indexing a PDF > with a trunk-build: > > Apache Tomcat/5.5.27 - Error report > HTTP Status 500 - org.apache.solr.handler. > > ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;) > V java.lang.AbstractMethodError: > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V > > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) > > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) > > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) >at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) >at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) >at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) >at java.lang.Thread.run(Unknown Source) type Status report message > > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V > java.lang.AbstractMethodError: > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V >at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) >at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) >at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) >at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) >at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) >at > org.apache.catalina.conn
Re: PDFBox/Tika Performance Issues
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) >at java.lang.Thread.run(Unknown Source) description The server > encountered an internal error > (org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V > java.lang.AbstractMethodError: > org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V >at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) >at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) >at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) >at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) >at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) >at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) >at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) >at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) >at java.lang.Thread.run(Unknown Source) ) that prevented it from > fulfilling this request.Apache Tomcat/5.5.27 > > > I'm trying to get a development environment going following these steps so I > can debug: > http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse > > > -Original Message- > From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll > Sent: Friday, March 19, 2010 1:46 PM > To: solr-user@lucene.apache.org > Subject: Re: PDFBox/Tika Performance Issues > > Can you try trunk? > > On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote: > >> Solr Specification Version: 1.4.0.2009.10.14.08.05.59 >> Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59 >> Lucene Specification Version: 2.9.1-dev >> Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13 >> Current Time: Fri Mar 19 13:11:31 EDT 2010 >> Server Start Time:Wed Mar 17 17:05:19 EDT 2010 >> >> -Original Message- >> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll >> Sent: Friday, March 19, 2010 1:02 PM >> To: solr-user@lucene.apache.org >> Subject: Re: PDFBox/Tika Performance Issues >> >> >> On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: >>> >>> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the >>> /Lib folder for my Solr Core, and renamed it to the name of the existing >>> Tika Jar (tika-0.3.jar). >> >> What version are you on of Solr? It's been a while since Solr Cell was on >> Tika 0.3, > > -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
RE: PDFBox/Tika Performance Issues
/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Unknown Source) ) that prevented it from fulfilling this request.Apache Tomcat/5.5.27 I'm trying to get a development environment going following these steps so I can debug: http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Friday, March 19, 2010 1:46 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Can you try trunk? On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote: > Solr Specification Version: 1.4.0.2009.10.14.08.05.59 > Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59 > Lucene Specification Version: 2.9.1-dev > Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13 > Current Time: Fri Mar 19 13:11:31 EDT 2010 > Server Start Time:Wed Mar 17 17:05:19 EDT 2010 > > -Original Message- > From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll > Sent: Friday, March 19, 2010 1:02 PM > To: solr-user@lucene.apache.org > Subject: Re: PDFBox/Tika Performance Issues > > > On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: >> >> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib >> folder for my Solr Core, and renamed it to the name of the existing Tika Jar >> (tika-0.3.jar). > > What version are you on of Solr? It's been a while since Solr Cell was on > Tika 0.3,
Re: PDFBox/Tika Performance Issues
Can you try trunk? On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote: > Solr Specification Version: 1.4.0.2009.10.14.08.05.59 > Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59 > Lucene Specification Version: 2.9.1-dev > Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13 > Current Time: Fri Mar 19 13:11:31 EDT 2010 > Server Start Time:Wed Mar 17 17:05:19 EDT 2010 > > -Original Message- > From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll > Sent: Friday, March 19, 2010 1:02 PM > To: solr-user@lucene.apache.org > Subject: Re: PDFBox/Tika Performance Issues > > > On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: >> >> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib >> folder for my Solr Core, and renamed it to the name of the existing Tika Jar >> (tika-0.3.jar). > > What version are you on of Solr? It's been a while since Solr Cell was on > Tika 0.3,
RE: PDFBox/Tika Performance Issues
Solr Specification Version: 1.4.0.2009.10.14.08.05.59 Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59 Lucene Specification Version: 2.9.1-dev Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13 Current Time: Fri Mar 19 13:11:31 EDT 2010 Server Start Time:Wed Mar 17 17:05:19 EDT 2010 -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Friday, March 19, 2010 1:02 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: > > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). What version are you on of Solr? It's been a while since Solr Cell was on Tika 0.3,
Re: PDFBox/Tika Performance Issues
On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: > > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). What version are you on of Solr? It's been a while since Solr Cell was on Tika 0.3,
Re: PDFBox/Tika Performance Issues
Ah, OK. Let me try and stand up a SolrCell instance and perform the same test you are and see if I can duplicate it. Hopefully I can get back to you today on this... Cheers, Chris On 3/19/10 7:43 AM, "Giovanni Fernandez-Kincade" wrote: Yeah I had tested it previously and that works... -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, March 19, 2010 12:04 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Let's try and isolate the problem. Can you try parsing the PDF file with tika-app as a standalone? Take your tika-app jar file then run java -jar tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file That should give you something like: Content-Type: application/pdf created: Thu Sep 06 00:41:55 PDT 2007 creator: TeX producer: pdfeTeX-1.21a resourceName: Dissertation.pdf (e.g., this is what I got when I ran it on my Dissertation PDF file). Let's start there - if that works, then there is something up with the integration into SolrCell, and we can start to figure that out... Cheers, Chris On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" wrote: Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an error, but the data doesn't get extracted. Using the same PDF with my previous /Lib contents works fine. Any other ideas? These are the jar files I have in my /Lib apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just me guessing): > > > > 1. Got the latest version of the trunk code from > http://svn.apache.org/repos/asf/lucene/tika/trunk > > 2. Built this using Maven (mvn install) > On track so far. > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. > > 4. Then I bounced my servlet server and tried indexing a document. The > document was successfully indexed, and there were no errors logged as a > result, but the PDF data does not appear to have been extracted (the field I > used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika Performance Issues > > > > Thanks Chris! > > > > I'll try the patch. > > > > -Original Message----- > > From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Tuesday, March 16, 2010 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 > depends on
RE: PDFBox/Tika Performance Issues
Yeah I had tested it previously and that works... -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, March 19, 2010 12:04 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Let's try and isolate the problem. Can you try parsing the PDF file with tika-app as a standalone? Take your tika-app jar file then run java -jar tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file That should give you something like: Content-Type: application/pdf created: Thu Sep 06 00:41:55 PDT 2007 creator: TeX producer: pdfeTeX-1.21a resourceName: Dissertation.pdf (e.g., this is what I got when I ran it on my Dissertation PDF file). Let's start there - if that works, then there is something up with the integration into SolrCell, and we can start to figure that out... Cheers, Chris On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" wrote: Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an error, but the data doesn't get extracted. Using the same PDF with my previous /Lib contents works fine. Any other ideas? These are the jar files I have in my /Lib apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just me guessing): > > > > 1. Got the latest version of the trunk code from > http://svn.apache.org/repos/asf/lucene/tika/trunk > > 2. Built this using Maven (mvn install) > On track so far. > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. > > 4. Then I bounced my servlet server and tried indexing a document. The > document was successfully indexed, and there were no errors logged as a > result, but the PDF data does not appear to have been extracted (the field I > used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika Performance Issues > > > > Thanks Chris! > > > > I'll try the patch. > > > > -Original Message- > > From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Tuesday, March 16, 2010 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 > depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may > include a fix for the problem you're seeing. > > > > See this discussion [2] on how to patch Tika to use the new PDFBox if you > can't wait for the 0.7 release wh
Re: PDFBox/Tika Performance Issues
Hi Giovanni, Let's try and isolate the problem. Can you try parsing the PDF file with tika-app as a standalone? Take your tika-app jar file then run java -jar tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file That should give you something like: Content-Type: application/pdf created: Thu Sep 06 00:41:55 PDT 2007 creator: TeX producer: pdfeTeX-1.21a resourceName: Dissertation.pdf (e.g., this is what I got when I ran it on my Dissertation PDF file). Let's start there - if that works, then there is something up with the integration into SolrCell, and we can start to figure that out... Cheers, Chris On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" wrote: Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an error, but the data doesn't get extracted. Using the same PDF with my previous /Lib contents works fine. Any other ideas? These are the jar files I have in my /Lib apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just me guessing): > > > > 1. Got the latest version of the trunk code from > http://svn.apache.org/repos/asf/lucene/tika/trunk > > 2. Built this using Maven (mvn install) > On track so far. > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. > > 4. Then I bounced my servlet server and tried indexing a document. The > document was successfully indexed, and there were no errors logged as a > result, but the PDF data does not appear to have been extracted (the field I > used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika Performance Issues > > > > Thanks Chris! > > > > I'll try the patch. > > > > -Original Message- > > From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Tuesday, March 16, 2010 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 > depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may > include a fix for the problem you're seeing. > > > > See this discussion [2] on how to patch Tika to use the new PDFBox if you > can't wait for the 0.7 release which should happen soon (hopefully next few > weeks). > > > > Cheers, > > Chris > > > > [1] http://issues.apache.org/jira/browse/TIKA-380 > > [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html >
RE: PDFBox/Tika Performance Issues
Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an error, but the data doesn't get extracted. Using the same PDF with my previous /Lib contents works fine. Any other ideas? These are the jar files I have in my /Lib apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just me guessing): > > > > 1. Got the latest version of the trunk code from > http://svn.apache.org/repos/asf/lucene/tika/trunk > > 2. Built this using Maven (mvn install) > On track so far. > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. > > 4. Then I bounced my servlet server and tried indexing a document. The > document was successfully indexed, and there were no errors logged as a > result, but the PDF data does not appear to have been extracted (the field I > used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika Performance Issues > > > > Thanks Chris! > > > > I'll try the patch. > > > > -Original Message- > > From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Tuesday, March 16, 2010 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 > depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may > include a fix for the problem you're seeing. > > > > See this discussion [2] on how to patch Tika to use the new PDFBox if you > can't wait for the 0.7 release which should happen soon (hopefully next few > weeks). > > > > Cheers, > > Chris > > > > [1] http://issues.apache.org/jira/browse/TIKA-380 > > [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html > > > > > > On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" > wrote: > > > > Originally 16 (the number of CPUs on the machine), but even with 5 threads > it's not looking so hot. > > > > -Original Message- > > From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll > > Sent: Tuesday, March 16, 2010 5:15 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Hmm, that is an ugly thing in PDFBox. We should probably take this over to > the PDFBox project. How many threa
Re: PDFBox/Tika Performance Issues
Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just me guessing): > > > > 1. Got the latest version of the trunk code from > http://svn.apache.org/repos/asf/lucene/tika/trunk > > 2. Built this using Maven (mvn install) > On track so far. > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. > > 4. Then I bounced my servlet server and tried indexing a document. The > document was successfully indexed, and there were no errors logged as a > result, but the PDF data does not appear to have been extracted (the field I > used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika Performance Issues > > > > Thanks Chris! > > > > I'll try the patch. > > > > -Original Message- > > From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Tuesday, March 16, 2010 5:37 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 > depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may > include a fix for the problem you're seeing. > > > > See this discussion [2] on how to patch Tika to use the new PDFBox if you > can't wait for the 0.7 release which should happen soon (hopefully next few > weeks). > > > > Cheers, > > Chris > > > > [1] http://issues.apache.org/jira/browse/TIKA-380 > > [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html > > > > > > On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" > wrote: > > > > Originally 16 (the number of CPUs on the machine), but even with 5 threads > it's not looking so hot. > > > > -Original Message- > > From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll > > Sent: Tuesday, March 16, 2010 5:15 PM > > To: solr-user@lucene.apache.org > > Subject: Re: PDFBox/Tika Performance Issues > > > > Hmm, that is an ugly thing in PDFBox. We should probably take this over to > the PDFBox project. How many threads are you indexing with? > > > > FWIW, for that many documents, I might consider using Tika on the client side > to save on a lot of network traffic. > > > > -Grant > > > > On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > > > >> I've been trying to bulk index about 11 million PDFs, and while profiling our >> Solr instance, I noticed that all of the threads that are processing indexing >> requests are constantly blocking each other during this call: > >> > >> http-8080-Processor39 [BLOCKED] CPU time: 9:35 > >> java.util.Collections$SynchronizedMap.get(Object) > >> org.pdfbox.pdmodel.font.PDFont.getAFM() > >> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > >> org.pdfbox.util.PDFStreamEngine.showString(byte[]) > >> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > >> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > >> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, >> COSStream) > >&
RE: PDFBox/Tika Performance Issues
I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. This is what I've tried so far (which was really just me guessing): 1. Got the latest version of the trunk code from http://svn.apache.org/repos/asf/lucene/tika/trunk 2. Built this using Maven (mvn install) 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib folder for my Solr Core, and renamed it to the name of the existing Tika Jar (tika-0.3.jar). 4. Then I bounced my servlet server and tried indexing a document. The document was successfully indexed, and there were no errors logged as a result, but the PDF data does not appear to have been extracted (the field I used for map.content had an empty-string as a value). What's the right approach to perform this patch? -Original Message- From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] Sent: Tuesday, March 16, 2010 5:41 PM To: solr-user@lucene.apache.org Subject: RE: PDFBox/Tika Performance Issues Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 5:37 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > I've been trying to bulk index about 11 million PDFs, and while profiling our > Solr instance, I noticed that all of the threads that are processing indexing > requests are constantly blocking each other during this call: > > http-8080-Processor39 [BLOCKED] CPU time: 9:35 > java.util.Collections$SynchronizedMap.get(Object) > org.pdfbox.pdmodel.font.PDFont.getAFM() > org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > org.pdfbox.util.PDFStreamEngine.showString(byte[]) > org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, > COSStream) > org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > org.pdfbox.util.PDFTextStripper.processPages(List) > org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > org.pdfbox.util.PDFTextStripper.getText(PDDocument) > org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, > Metadata) > org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletRes
RE: PDFBox/Tika Performance Issues
Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 5:37 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > I've been trying to bulk index about 11 million PDFs, and while profiling our > Solr instance, I noticed that all of the threads that are processing indexing > requests are constantly blocking each other during this call: > > http-8080-Processor39 [BLOCKED] CPU time: 9:35 > java.util.Collections$SynchronizedMap.get(Object) > org.pdfbox.pdmodel.font.PDFont.getAFM() > org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > org.pdfbox.util.PDFStreamEngine.showString(byte[]) > org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, > COSStream) > org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > org.pdfbox.util.PDFTextStripper.processPages(List) > org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > org.pdfbox.util.PDFTextStripper.getText(PDDocument) > org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, > Metadata) > org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > Has anyone run into this before? Any ideas o
Re: PDFBox/Tika Performance Issues
Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > I've been trying to bulk index about 11 million PDFs, and while profiling our > Solr instance, I noticed that all of the threads that are processing indexing > requests are constantly blocking each other during this call: > > http-8080-Processor39 [BLOCKED] CPU time: 9:35 > java.util.Collections$SynchronizedMap.get(Object) > org.pdfbox.pdmodel.font.PDFont.getAFM() > org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > org.pdfbox.util.PDFStreamEngine.showString(byte[]) > org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, > COSStream) > org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > org.pdfbox.util.PDFTextStripper.processPages(List) > org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > org.pdfbox.util.PDFTextStripper.getText(PDDocument) > org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, > Metadata) > org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > Has anyone run into this before? Any ideas on how to reduce the contention? > > Thanks, > Gio. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
RE: PDFBox/Tika Performance Issues
Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > I've been trying to bulk index about 11 million PDFs, and while profiling our > Solr instance, I noticed that all of the threads that are processing indexing > requests are constantly blocking each other during this call: > > http-8080-Processor39 [BLOCKED] CPU time: 9:35 > java.util.Collections$SynchronizedMap.get(Object) > org.pdfbox.pdmodel.font.PDFont.getAFM() > org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > org.pdfbox.util.PDFStreamEngine.showString(byte[]) > org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, > COSStream) > org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > org.pdfbox.util.PDFTextStripper.processPages(List) > org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > org.pdfbox.util.PDFTextStripper.getText(PDDocument) > org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, > Metadata) > org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > Has anyone run into this before? Any ideas on how to reduce the contention? > > Thanks, > Gio. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: PDFBox/Tika Performance Issues
Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: > I've been trying to bulk index about 11 million PDFs, and while profiling our > Solr instance, I noticed that all of the threads that are processing indexing > requests are constantly blocking each other during this call: > > http-8080-Processor39 [BLOCKED] CPU time: 9:35 > java.util.Collections$SynchronizedMap.get(Object) > org.pdfbox.pdmodel.font.PDFont.getAFM() > org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) > org.pdfbox.util.PDFStreamEngine.showString(byte[]) > org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) > org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, > COSStream) > org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) > org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) > org.pdfbox.util.PDFTextStripper.processPages(List) > org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) > org.pdfbox.util.PDFTextStripper.getText(PDDocument) > org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, > Metadata) > org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, > Metadata) > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, > SolrQueryResponse, ContentStream) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, > SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, > SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) > org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, > ServletResponse, FilterChain) > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, > ServletResponse) > org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) > org.apache.catalina.core.StandardContextValve.invoke(Request, Response) > org.apache.catalina.core.StandardHostValve.invoke(Request, Response) > org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) > org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) > org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) > org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, > Object[]) > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, > TcpConnection, Object[]) > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() > java.lang.Thread.run() > > Has anyone run into this before? Any ideas on how to reduce the contention? > > Thanks, > Gio. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search