RE: PDFBox/Tika Performance Issues

2010-03-23 Thread Giovanni Fernandez-Kincade
I don't think so. 

I'm using Tomcat on my servers, but I set up my local machine with the 
Eclipse-Jetty plugin from that Lucid article and I'm getting the same error. 

These are the libraries references in my Eclipse project:
apache-solr-core-1.5-dev.jar
apache-solr-dataimporthandler-1.5-dev.jar
apache-solr-solrj-1.5-dev.jar
commons-codec-1.3.jar
commons-csv-1.0-SNAPSHOT-r609327.jar
commons-fileupload-1.2.1.jar
commons-httpclient-3.1.jar
commons-io-1.4.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
google-collect-1.0.jar
jcl-over-slf4j-1.5.5.jar
lucene-analyzers-2.9.2.jar
lucene-collation-2.9.2.jar
lucene-core-2.9.2.jar
lucene-fast-vector-highlighter-2.9.2.jar
lucene-highlighter-2.9.2.jar
lucene-memory-2.9.2.jar
lucene-misc-2.9.2.jar
lucene-queries-2.9.2.jar
lucene-snowball-2.9.2.jar
lucene-spatial-2.9.2.jar
lucene-spellchecker-2.9.2.jar
slf4j-api-1.5.5.jar
slf4j-jdk14-1.5.5.jar
wstx-asl-3.2.7.jar
apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dir.txt
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-app-0.7-SNAPSHOT.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, March 23, 2010 11:03 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

The error that you're showing in your logs below indicates that this message 
signature:

org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)

doesn't match what was expected. Are you sure you don't have another Solr jar 
on the classpath somewhere, or in your web server? Are you using Jetty, or 
Tomcat?

Thanks,
Chris



On 3/23/10 7:59 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Sorry for the late reply - been out of town for a couple of days.

>From my solrconfig:



  ignored_
  text

  


-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Saturday, March 20, 2010 8:43 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

What's your configuration look like for the ExtractReqHandler?

On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote:

> Yeah I've been trying that - I keep getting this error when indexing a PDF 
> with a trunk-build:
>
>   Apache Tomcat/5.5.27 - Error report
>   HTTP Status 500 - org.apache.solr.handler.
>   
> ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)
>   V  java.lang.AbstractMethodError: 
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
>at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) 
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) 
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
>at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)   
> at org.apa

Re: PDFBox/Tika Performance Issues

2010-03-23 Thread Mattmann, Chris A (388J)
Hi Giovanni,

The error that you're showing in your logs below indicates that this message 
signature:

org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)

doesn't match what was expected. Are you sure you don't have another Solr jar 
on the classpath somewhere, or in your web server? Are you using Jetty, or 
Tomcat?

Thanks,
Chris



On 3/23/10 7:59 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Sorry for the late reply - been out of town for a couple of days.

>From my solrconfig:



  ignored_
  text

  


-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Saturday, March 20, 2010 8:43 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

What's your configuration look like for the ExtractReqHandler?

On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote:

> Yeah I've been trying that - I keep getting this error when indexing a PDF 
> with a trunk-build:
>
>   Apache Tomcat/5.5.27 - Error report
>   HTTP Status 500 - org.apache.solr.handler.
>   
> ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)
>   V  java.lang.AbstractMethodError: 
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
>at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) 
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) 
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
>at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)   
> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) 
>   at 
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
>at 
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
>at 
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
>at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
>at java.lang.Thread.run(Unknown Source)  type  Status report   message 
>   
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>   java.lang.AbstractMethodError: 
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
>at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>at 
> org.apache.catalina.core.StandardWrappe

RE: PDFBox/Tika Performance Issues

2010-03-23 Thread Giovanni Fernandez-Kincade
Sorry for the late reply - been out of town for a couple of days. 

>From my solrconfig:



  ignored_
  text

  


-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Saturday, March 20, 2010 8:43 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

What's your configuration look like for the ExtractReqHandler?

On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote:

> Yeah I've been trying that - I keep getting this error when indexing a PDF 
> with a trunk-build:
> 
>   Apache Tomcat/5.5.27 - Error report
>   HTTP Status 500 - org.apache.solr.handler.
>   
> ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)
>   V  java.lang.AbstractMethodError: 
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>  
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>
>   at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)   
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
>
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
>
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
>at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) 
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) 
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
>at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)   
> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) 
>   at 
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
>at 
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
>at 
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
>at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
>at java.lang.Thread.run(Unknown Source)  type  Status report   message 
>   
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>   java.lang.AbstractMethodError: 
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
>at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
>at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) 
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) 
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
>at 
> org.apache.catalina.conn

Re: PDFBox/Tika Performance Issues

2010-03-20 Thread Grant Ingersoll
 
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
>at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
>at java.lang.Thread.run(Unknown Source)  description   The server 
> encountered an internal error 
> (org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>   java.lang.AbstractMethodError: 
> org.apache.solr.handler.ContentStreamLoader.load(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
>at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
>at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
>at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) 
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) 
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
>at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)   
> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) 
>   at 
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
>at 
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
>at 
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
>at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
>at java.lang.Thread.run(Unknown Source)  ) that prevented it from 
> fulfilling this request.Apache Tomcat/5.5.27   
> 
> 
> I'm trying to get a development environment going following these steps so I 
> can debug:
> http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse
> 
> 
> -Original Message-
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
> Sent: Friday, March 19, 2010 1:46 PM
> To: solr-user@lucene.apache.org
> Subject: Re: PDFBox/Tika Performance Issues
> 
> Can you try trunk?
> 
> On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote:
> 
>> Solr Specification Version: 1.4.0.2009.10.14.08.05.59
>> Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59
>> Lucene Specification Version: 2.9.1-dev
>> Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13
>> Current Time: Fri Mar 19 13:11:31 EDT 2010
>> Server Start Time:Wed Mar 17 17:05:19 EDT 2010
>> 
>> -Original Message-
>> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
>> Sent: Friday, March 19, 2010 1:02 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: PDFBox/Tika Performance Issues
>> 
>> 
>> On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
>>> 
>>> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the 
>>> /Lib folder for my Solr Core, and renamed it to the name of the existing 
>>> Tika Jar (tika-0.3.jar).
>> 
>> What version are you on of Solr?  It's been a while since Solr Cell was on 
>> Tika 0.3,
> 
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



RE: PDFBox/Tika Performance Issues

2010-03-19 Thread Giovanni Fernandez-Kincade
/solr/request/SolrQueryRequest;Lorg/apache/solr/response/SolrQueryResponse;Lorg/apache/solr/common/util/ContentStream;)V
   at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) 
  at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)   
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)   
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)   
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)   
at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
   at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
   at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
   at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
   at java.lang.Thread.run(Unknown Source)  ) that prevented it from fulfilling 
this request.Apache Tomcat/5.5.27   
  
  
I'm trying to get a development environment going following these steps so I 
can debug:
http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse


-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Friday, March 19, 2010 1:46 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Can you try trunk?

On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote:

> Solr Specification Version: 1.4.0.2009.10.14.08.05.59
> Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59
> Lucene Specification Version: 2.9.1-dev
> Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13
> Current Time: Fri Mar 19 13:11:31 EDT 2010
> Server Start Time:Wed Mar 17 17:05:19 EDT 2010
> 
> -Original Message-
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
> Sent: Friday, March 19, 2010 1:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
>> 
>> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
>> folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
>> (tika-0.3.jar).
> 
> What version are you on of Solr?  It's been a while since Solr Cell was on 
> Tika 0.3,




Re: PDFBox/Tika Performance Issues

2010-03-19 Thread Grant Ingersoll
Can you try trunk?

On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote:

> Solr Specification Version: 1.4.0.2009.10.14.08.05.59
> Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59
> Lucene Specification Version: 2.9.1-dev
> Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13
> Current Time: Fri Mar 19 13:11:31 EDT 2010
> Server Start Time:Wed Mar 17 17:05:19 EDT 2010
> 
> -Original Message-
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
> Sent: Friday, March 19, 2010 1:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
>> 
>> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
>> folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
>> (tika-0.3.jar).
> 
> What version are you on of Solr?  It's been a while since Solr Cell was on 
> Tika 0.3,




RE: PDFBox/Tika Performance Issues

2010-03-19 Thread Giovanni Fernandez-Kincade
Solr Specification Version: 1.4.0.2009.10.14.08.05.59
Solr Implementation Version: nightly exported - yonik - 2009-10-14 08:05:59
Lucene Specification Version: 2.9.1-dev
Lucene Implementation Version: 2.9.1-dev 824988 - 2009-10-13 21:47:13
Current Time: Fri Mar 19 13:11:31 EDT 2010
Server Start Time:Wed Mar 17 17:05:19 EDT 2010

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Friday, March 19, 2010 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues


On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
> 
> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
> (tika-0.3.jar).

What version are you on of Solr?  It's been a while since Solr Cell was on Tika 
0.3,


Re: PDFBox/Tika Performance Issues

2010-03-19 Thread Grant Ingersoll

On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
> 
> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
> (tika-0.3.jar).

What version are you on of Solr?  It's been a while since Solr Cell was on Tika 
0.3,

Re: PDFBox/Tika Performance Issues

2010-03-19 Thread Mattmann, Chris A (388J)
Ah, OK. Let me try and stand up a SolrCell instance and perform the same test 
you are and see if I can duplicate it.

Hopefully I can get back to you today on this...

Cheers,
Chris


On 3/19/10 7:43 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Yeah I had tested it previously and that works...

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Friday, March 19, 2010 12:04 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Let's try and isolate the problem. Can you try parsing the PDF file with 
tika-app as a standalone? Take your tika-app jar file then run java -jar 
tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file

That should give you something like:

Content-Type: application/pdf
created: Thu Sep 06 00:41:55 PDT 2007
creator: TeX
producer: pdfeTeX-1.21a
resourceName: Dissertation.pdf

(e.g., this is what I got when I ran it on my Dissertation PDF file).

Let's start there - if that works, then there is something up with the 
integration into SolrCell, and we can start to figure that out...

Cheers,
Chris



On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.

Any other ideas?

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
>
>
>
> 1. Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
>
> 2. Built this using Maven (mvn install)
>

On track so far.

> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

>
> 4. Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
>
>
>
> Thanks Chris!
>
>
>
> I'll try the patch.
>
>
>
> -Original Message-----
>
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
>
> Sent: Tuesday, March 16, 2010 5:37 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: PDFBox/Tika Performance Issues
>
>
>
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on

RE: PDFBox/Tika Performance Issues

2010-03-19 Thread Giovanni Fernandez-Kincade
Yeah I had tested it previously and that works...

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Friday, March 19, 2010 12:04 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Let's try and isolate the problem. Can you try parsing the PDF file with 
tika-app as a standalone? Take your tika-app jar file then run java -jar 
tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file

That should give you something like:

Content-Type: application/pdf
created: Thu Sep 06 00:41:55 PDT 2007
creator: TeX
producer: pdfeTeX-1.21a
resourceName: Dissertation.pdf

(e.g., this is what I got when I ran it on my Dissertation PDF file).

Let's start there - if that works, then there is something up with the 
integration into SolrCell, and we can start to figure that out...

Cheers,
Chris



On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.

Any other ideas?

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
>
>
>
> 1. Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
>
> 2. Built this using Maven (mvn install)
>

On track so far.

> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

>
> 4. Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
>
>
>
> Thanks Chris!
>
>
>
> I'll try the patch.
>
>
>
> -Original Message-
>
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
>
> Sent: Tuesday, March 16, 2010 5:37 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: PDFBox/Tika Performance Issues
>
>
>
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
>
>
>
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release wh

Re: PDFBox/Tika Performance Issues

2010-03-18 Thread Mattmann, Chris A (388J)
Hi Giovanni,

Let's try and isolate the problem. Can you try parsing the PDF file with 
tika-app as a standalone? Take your tika-app jar file then run java -jar 
tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file

That should give you something like:

Content-Type: application/pdf
created: Thu Sep 06 00:41:55 PDT 2007
creator: TeX
producer: pdfeTeX-1.21a
resourceName: Dissertation.pdf

(e.g., this is what I got when I ran it on my Dissertation PDF file).

Let's start there - if that works, then there is something up with the 
integration into SolrCell, and we can start to figure that out...

Cheers,
Chris



On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.

Any other ideas?

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
>
>
>
> 1. Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
>
> 2. Built this using Maven (mvn install)
>

On track so far.

> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

>
> 4. Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
>
>
>
> Thanks Chris!
>
>
>
> I'll try the patch.
>
>
>
> -Original Message-
>
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
>
> Sent: Tuesday, March 16, 2010 5:37 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: PDFBox/Tika Performance Issues
>
>
>
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
>
>
>
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release which should happen soon (hopefully next few
> weeks).
>
>
>
> Cheers,
>
> Chris
>
>
>
> [1] http://issues.apache.org/jira/browse/TIKA-380
>
> [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
>

RE: PDFBox/Tika Performance Issues

2010-03-17 Thread Giovanni Fernandez-Kincade
Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.

Any other ideas? 

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
> 
> 
> 
> 1. Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
> 
> 2. Built this using Maven (mvn install)
> 

On track so far.

> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

> 
> 4. Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
> 
> 
> 
> Thanks Chris!
> 
> 
> 
> I'll try the patch.
> 
> 
> 
> -Original Message-
> 
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
> 
> Sent: Tuesday, March 16, 2010 5:37 PM
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> 
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
> 
> 
> 
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release which should happen soon (hopefully next few
> weeks).
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> [1] http://issues.apache.org/jira/browse/TIKA-380
> 
> [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
> 
> 
> 
> 
> 
> On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade"
>  wrote:
> 
> 
> 
> Originally 16 (the number of CPUs on the machine), but even with 5 threads
> it's not looking so hot.
> 
> 
> 
> -Original Message-
> 
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
> 
> Sent: Tuesday, March 16, 2010 5:15 PM
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> 
> Hmm, that is an ugly thing in PDFBox.  We should probably take this over to
> the PDFBox project.  How many threa

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)
Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
> 
> 
> 
> 1. Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
> 
> 2. Built this using Maven (mvn install)
> 

On track so far.

> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

> 
> 4. Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
> 
> 
> 
> Thanks Chris!
> 
> 
> 
> I'll try the patch.
> 
> 
> 
> -Original Message-
> 
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
> 
> Sent: Tuesday, March 16, 2010 5:37 PM
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> 
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
> 
> 
> 
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release which should happen soon (hopefully next few
> weeks).
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> [1] http://issues.apache.org/jira/browse/TIKA-380
> 
> [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
> 
> 
> 
> 
> 
> On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade"
>  wrote:
> 
> 
> 
> Originally 16 (the number of CPUs on the machine), but even with 5 threads
> it's not looking so hot.
> 
> 
> 
> -Original Message-
> 
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
> 
> Sent: Tuesday, March 16, 2010 5:15 PM
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: PDFBox/Tika Performance Issues
> 
> 
> 
> Hmm, that is an ugly thing in PDFBox.  We should probably take this over to
> the PDFBox project.  How many threads are you indexing with?
> 
> 
> 
> FWIW, for that many documents, I might consider using Tika on the client side
> to save on a lot of network traffic.
> 
> 
> 
> -Grant
> 
> 
> 
> On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:
> 
> 
> 
>> I've been trying to bulk index about 11 million PDFs, and while profiling our
>> Solr instance, I noticed that all of the threads that are processing indexing
>> requests are constantly blocking each other during this call:
> 
>> 
> 
>> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> 
>> java.util.Collections$SynchronizedMap.get(Object)
> 
>> org.pdfbox.pdmodel.font.PDFont.getAFM()
> 
>> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> 
>> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> 
>> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> 
>> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> 
>> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources,
>> COSStream)
> 
>&

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. 
This is what I've tried so far (which was really just me guessing):



1. Got the latest version of the trunk code from 
http://svn.apache.org/repos/asf/lucene/tika/trunk

2. Built this using Maven (mvn install)

3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
(tika-0.3.jar).

4. Then I bounced my servlet server and tried indexing a document. The 
document was successfully indexed, and there were no errors logged as a result, 
but the PDF data does not appear to have been extracted (the field I used for 
map.content had an empty-string as a value).



What's the right approach to perform this patch?





-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Tuesday, March 16, 2010 5:41 PM
To: solr-user@lucene.apache.org
Subject: RE: PDFBox/Tika Performance Issues



Thanks Chris!



I'll try the patch.



-Original Message-

From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]

Sent: Tuesday, March 16, 2010 5:37 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.



See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).



Cheers,

Chris



[1] http://issues.apache.org/jira/browse/TIKA-380

[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html





On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" 
 wrote:



Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.



-Original Message-

From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll

Sent: Tuesday, March 16, 2010 5:15 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?



FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.



-Grant



On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:



> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:

>

> http-8080-Processor39 [BLOCKED] CPU time: 9:35

> java.util.Collections$SynchronizedMap.get(Object)

> org.pdfbox.pdmodel.font.PDFont.getAFM()

> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)

> org.pdfbox.util.PDFStreamEngine.showString(byte[])

> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)

> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)

> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)

> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)

> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)

> org.pdfbox.util.PDFTextStripper.processPages(List)

> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)

> org.pdfbox.util.PDFTextStripper.getText(PDDocument)

> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)

> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)

> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)

> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)

> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)

> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)

> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)

> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)

> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)

> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)

> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)

> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletRes

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
Thanks Chris! 

I'll try the patch. 

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, March 16, 2010 5:37 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.

See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-380
[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html


On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" 
 wrote:

Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:
>
> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> java.util.Collections$SynchronizedMap.get(Object)
> org.pdfbox.pdmodel.font.PDFont.getAFM()
> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)
> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> org.pdfbox.util.PDFTextStripper.processPages(List)
> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)
> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
> Has anyone run into this before? Any ideas o

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)
Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.

See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-380
[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html


On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade" 
 wrote:

Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:
>
> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> java.util.Collections$SynchronizedMap.get(Object)
> org.pdfbox.pdmodel.font.PDFont.getAFM()
> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)
> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> org.pdfbox.util.PDFTextStripper.processPages(List)
> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)
> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
>
> Has anyone run into this before? Any ideas on how to reduce the contention?
>
> Thanks,
> Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search





RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot. 

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:
> 
> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> java.util.Collections$SynchronizedMap.get(Object)
> org.pdfbox.pdmodel.font.PDFont.getAFM()
> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)
> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> org.pdfbox.util.PDFTextStripper.processPages(List)
> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)
> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
> 
> Has anyone run into this before? Any ideas on how to reduce the contention?
> 
> Thanks,
> Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Grant Ingersoll
Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:
> 
> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> java.util.Collections$SynchronizedMap.get(Object)
> org.pdfbox.pdmodel.font.PDFont.getAFM()
> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)
> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> org.pdfbox.util.PDFTextStripper.processPages(List)
> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)
> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
> 
> Has anyone run into this before? Any ideas on how to reduce the contention?
> 
> Thanks,
> Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search