Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread Tommaso Teofili
I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso

2010/7/27 David Thibault dthiba...@esperion.com

 Alessandro  all,

 I was having the same issue with Tika crashing on certain PDFs.  I also
 noticed the bug where no content was extracted after upgrading Tika.

 When I went to the SOLR issue you link to below, I applied all the patches,
 downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
 got the following error:
 SEVERE: java.lang.NoSuchMethodError:
 org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
 at
 org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
 at
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
 at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
 at java.lang.Thread.run(Thread.java:619)

 This is really weird because I DID apply the SolrResourceLoader patch that
 adds the getClassLoader method.  I even verified by going opening up the
 JARs and looking at the class file in Eclipse...I can see the
 SolrResourceLoader.getClassLoader() method.

 Does anyone know why it can't find the method?  After patching the source I
 did ant clean dist in the base directory of the Solr source tree and
 everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
 the jars from dist/ and all the library dependencies from
 contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
 the logs looked good.

 I'm stumped.  It would be very nice to have a Solr implementation using the
 newest versions of PDFBox  Tika and actually have content being
 extracted...=)

 Best,
 Dave


 -Original Message-
 From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
 Sent: Tuesday, July 27, 2010 6:09 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
 CELL/Tika/PDFBox

 Hi Jon,
 During the last days we front the same problem.
 Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
 content and from others, Solr throws an exception during the Indexing
 Process .
 You must:
 Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
 snapshot and tika-parsers 0.8.
 Update PdfBox and all related libraries.
 After that You have to patch Solr 1.4.1 following this patch :

 https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
 This is the firts way to solve the problem.

 Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
 is
 thrown during the Indexing process, but no content is extracted.
 Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
 sounds good but we don't know how stableit is!
 I hope you have now a clear  vision of this issue,
 Best Regards



 2010/7/26 Sharp, Jonathan jsh...@coh.org

 
  Every so often I need to index new batches of scanned PDFs and
 occasionally
  Adobe's OCR can't recognize the text in a couple of these documents. In
  these situations I would like to type in a small amount of text onto the
  document and have it be extracted by Solr CELL.
 
  Adobe Pro 9 has a number of different ways to add text directly to a PDF
  file:
 
  *Typewriter
  *Sticky Note
  *Callout boxes
  *Text boxes
 
  I tried indexing documents with each of these text additions

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
Yesterday I did get this working with version 4.0 from trunk.  I haven't fully 
tested it yet, but the content doesn't come through blank anymore, so that's 
good.  Would it be more stable to stick with 1.4.1 and your patch to get to 
Tika 0.8, or to stick with the 4.0 trunk version?

Best,
Dave

-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Wednesday, July 28, 2010 3:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr 
CELL/Tika/PDFBox

I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso

2010/7/27 David Thibault dthiba...@esperion.com

 Alessandro  all,

 I was having the same issue with Tika crashing on certain PDFs.  I also
 noticed the bug where no content was extracted after upgrading Tika.

 When I went to the SOLR issue you link to below, I applied all the patches,
 downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
 got the following error:
 SEVERE: java.lang.NoSuchMethodError:
 org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
 at
 org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
 at
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
 at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
 at java.lang.Thread.run(Thread.java:619)

 This is really weird because I DID apply the SolrResourceLoader patch that
 adds the getClassLoader method.  I even verified by going opening up the
 JARs and looking at the class file in Eclipse...I can see the
 SolrResourceLoader.getClassLoader() method.

 Does anyone know why it can't find the method?  After patching the source I
 did ant clean dist in the base directory of the Solr source tree and
 everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
 the jars from dist/ and all the library dependencies from
 contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
 the logs looked good.

 I'm stumped.  It would be very nice to have a Solr implementation using the
 newest versions of PDFBox  Tika and actually have content being
 extracted...=)

 Best,
 Dave


 -Original Message-
 From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
 Sent: Tuesday, July 27, 2010 6:09 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
 CELL/Tika/PDFBox

 Hi Jon,
 During the last days we front the same problem.
 Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
 content and from others, Solr throws an exception during the Indexing
 Process .
 You must:
 Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
 snapshot and tika-parsers 0.8.
 Update PdfBox and all related libraries.
 After that You have to patch Solr 1.4.1 following this patch :

 https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
 This is the firts way to solve the problem.

 Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
 is
 thrown during the Indexing process, but no content is extracted.
 Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
 sounds good but we don't know how stableit is!
 I hope you have now a clear  vision of this issue

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread Alessandro Benedetti
In my opinion, the 1.4.1 version with the Patch is more Stable.
Until 4.0 will be released 

2010/7/28 David Thibault dthiba...@esperion.com

 Yesterday I did get this working with version 4.0 from trunk.  I haven't
 fully tested it yet, but the content doesn't come through blank anymore, so
 that's good.  Would it be more stable to stick with 1.4.1 and your patch to
 get to Tika 0.8, or to stick with the 4.0 trunk version?

 Best,
 Dave

 -Original Message-
 From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com]
 Sent: Wednesday, July 28, 2010 3:31 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
 CELL/Tika/PDFBox

 I attached a patch for Solr 1.4.1 release on
 https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
 me.
 This strange behaviour for me was due to the fact that I copied the patched
 jars and war inside the dist directory but forgot to update the war inside
 the example/webapps directory (that is inside Jetty).
 Hope this helps.
 Tommaso

 2010/7/27 David Thibault dthiba...@esperion.com

  Alessandro  all,
 
  I was having the same issue with Tika crashing on certain PDFs.  I also
  noticed the bug where no content was extracted after upgrading Tika.
 
  When I went to the SOLR issue you link to below, I applied all the
 patches,
  downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
 and
  got the following error:
  SEVERE: java.lang.NoSuchMethodError:
 
 org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
  at
 
 org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
  at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
  at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
  at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
  at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
  at
 
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
  at
 
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
  at
 org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
  at java.lang.Thread.run(Thread.java:619)
 
  This is really weird because I DID apply the SolrResourceLoader patch
 that
  adds the getClassLoader method.  I even verified by going opening up the
  JARs and looking at the class file in Eclipse...I can see the
  SolrResourceLoader.getClassLoader() method.
 
  Does anyone know why it can't find the method?  After patching the source
 I
  did ant clean dist in the base directory of the Solr source tree and
  everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
  the jars from dist/ and all the library dependencies from
  contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything
 in
  the logs looked good.
 
  I'm stumped.  It would be very nice to have a Solr implementation using
 the
  newest versions of PDFBox  Tika and actually have content being
  extracted...=)
 
  Best,
  Dave
 
 
  -Original Message-
  From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
  Sent: Tuesday, July 27, 2010 6:09 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
 Solr
  CELL/Tika/PDFBox
 
  Hi Jon,
  During the last days we front the same problem.
  Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
  content and from others, Solr throws an exception during the Indexing
  Process .
  You must:
  Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
  snapshot and tika-parsers 0.8.
  Update PdfBox and all related libraries.
  After that You have to patch Solr 1.4.1 following this patch :
 
 
 https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
  This is the firts way to solve the problem.
 
  Using Solr 1.4.1 (with tika 0.8 snapshot

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
Thanks, I'll try that then. I kind of figured that'd be the answer, but after 
fighting with Solr  ExtractingRequestHandler for 2 days I also just wanted to 
be done with it once it started working with 4.0...=)  However, stability would 
be better in the long run.

Best,
Dave

-Original Message-
From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] 
Sent: Wednesday, July 28, 2010 9:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr 
CELL/Tika/PDFBox

In my opinion, the 1.4.1 version with the Patch is more Stable.
Until 4.0 will be released 

2010/7/28 David Thibault dthiba...@esperion.com

 Yesterday I did get this working with version 4.0 from trunk.  I haven't
 fully tested it yet, but the content doesn't come through blank anymore, so
 that's good.  Would it be more stable to stick with 1.4.1 and your patch to
 get to Tika 0.8, or to stick with the 4.0 trunk version?

 Best,
 Dave

 -Original Message-
 From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com]
 Sent: Wednesday, July 28, 2010 3:31 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
 CELL/Tika/PDFBox

 I attached a patch for Solr 1.4.1 release on
 https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
 me.
 This strange behaviour for me was due to the fact that I copied the patched
 jars and war inside the dist directory but forgot to update the war inside
 the example/webapps directory (that is inside Jetty).
 Hope this helps.
 Tommaso

 2010/7/27 David Thibault dthiba...@esperion.com

  Alessandro  all,
 
  I was having the same issue with Tika crashing on certain PDFs.  I also
  noticed the bug where no content was extracted after upgrading Tika.
 
  When I went to the SOLR issue you link to below, I applied all the
 patches,
  downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
 and
  got the following error:
  SEVERE: java.lang.NoSuchMethodError:
 
 org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
  at
 
 org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
  at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
  at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
  at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
  at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
  at
 
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
  at
 
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
  at
 org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
  at java.lang.Thread.run(Thread.java:619)
 
  This is really weird because I DID apply the SolrResourceLoader patch
 that
  adds the getClassLoader method.  I even verified by going opening up the
  JARs and looking at the class file in Eclipse...I can see the
  SolrResourceLoader.getClassLoader() method.
 
  Does anyone know why it can't find the method?  After patching the source
 I
  did ant clean dist in the base directory of the Solr source tree and
  everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
  the jars from dist/ and all the library dependencies from
  contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything
 in
  the logs looked good.
 
  I'm stumped.  It would be very nice to have a Solr implementation using
 the
  newest versions of PDFBox  Tika and actually have content being
  extracted...=)
 
  Best,
  Dave
 
 
  -Original Message-
  From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
  Sent: Tuesday, July 27, 2010 6:09 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
 Solr
  CELL/Tika/PDFBox
 
  Hi Jon,
  During the last days we front the same problem.
  Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread Tommaso Teofili
This was my same feeling :-) and so I went for the trunk to have things
working quickly, but I also have to consider which one is the best version
since I am going to deploy it in the near future in an enterprise
environment and choosing the best version is an importat step.
I am quite new to Solr but I agree with Alessandro that probably using a
slightly patched release should theoretically be more stable than the trunk
which get many updates weekly (and daily).
Cheers,
Tommaso

2010/7/28 David Thibault dthiba...@esperion.com

 Thanks, I'll try that then. I kind of figured that'd be the answer, but
 after fighting with Solr  ExtractingRequestHandler for 2 days I also just
 wanted to be done with it once it started working with 4.0...=)  However,
 stability would be better in the long run.

 Best,
 Dave

 -Original Message-
 From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
 Sent: Wednesday, July 28, 2010 9:33 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
 CELL/Tika/PDFBox

 In my opinion, the 1.4.1 version with the Patch is more Stable.
 Until 4.0 will be released 

 2010/7/28 David Thibault dthiba...@esperion.com

  Yesterday I did get this working with version 4.0 from trunk.  I haven't
  fully tested it yet, but the content doesn't come through blank anymore,
 so
  that's good.  Would it be more stable to stick with 1.4.1 and your patch
 to
  get to Tika 0.8, or to stick with the 4.0 trunk version?
 
  Best,
  Dave
 
  -Original Message-
  From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com]
  Sent: Wednesday, July 28, 2010 3:31 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
 Solr
  CELL/Tika/PDFBox
 
  I attached a patch for Solr 1.4.1 release on
  https://issues.apache.org/jira/browse/SOLR-1902 that made things work
 for
  me.
  This strange behaviour for me was due to the fact that I copied the
 patched
  jars and war inside the dist directory but forgot to update the war
 inside
  the example/webapps directory (that is inside Jetty).
  Hope this helps.
  Tommaso
 
  2010/7/27 David Thibault dthiba...@esperion.com
 
   Alessandro  all,
  
   I was having the same issue with Tika crashing on certain PDFs.  I also
   noticed the bug where no content was extracted after upgrading Tika.
  
   When I went to the SOLR issue you link to below, I applied all the
  patches,
   downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
  and
   got the following error:
   SEVERE: java.lang.NoSuchMethodError:
  
 
 org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
   at
  
 
 org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
   at
  
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
   at
  
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at
  
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at
  
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at
  
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   at
  
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   at
  
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at
  
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   at
  
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at
  
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   at
  
 
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
   at
  
 
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
   at
  org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
   at java.lang.Thread.run(Thread.java:619)
  
   This is really weird because I DID apply the SolrResourceLoader patch
  that
   adds the getClassLoader method.  I even verified by going opening up
 the
   JARs and looking at the class file in Eclipse...I can see the
   SolrResourceLoader.getClassLoader() method.
  
   Does anyone know why it can't find the method?  After patching the
 source
  I
   did ant clean dist in the base directory of the Solr source tree and
   everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied
 all
   the jars from dist/ and all the library dependencies from
   contrib

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
Tommasso,

I used your patch and tried it with the 1.4.1 solr.war from a fresh 1.4.1 
distribution, and it still gave me that NoSuchMethodError.  However, when I 
tried it with the newly-patched-and-compiled apache-solr-1.4.2-dev.war file it 
works.  I think I tried that before and it didn't work. 

In any case, thanks for the patch and the advice.  Looks like now it's working 
for me.

Best,
Dave




-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Wednesday, July 28, 2010 3:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr 
CELL/Tika/PDFBox

I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso

2010/7/27 David Thibault dthiba...@esperion.com

 Alessandro  all,

 I was having the same issue with Tika crashing on certain PDFs.  I also
 noticed the bug where no content was extracted after upgrading Tika.

 When I went to the SOLR issue you link to below, I applied all the patches,
 downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
 got the following error:
 SEVERE: java.lang.NoSuchMethodError:
 org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
 at
 org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
 at
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
 at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
 at java.lang.Thread.run(Thread.java:619)

 This is really weird because I DID apply the SolrResourceLoader patch that
 adds the getClassLoader method.  I even verified by going opening up the
 JARs and looking at the class file in Eclipse...I can see the
 SolrResourceLoader.getClassLoader() method.

 Does anyone know why it can't find the method?  After patching the source I
 did ant clean dist in the base directory of the Solr source tree and
 everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
 the jars from dist/ and all the library dependencies from
 contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
 the logs looked good.

 I'm stumped.  It would be very nice to have a Solr implementation using the
 newest versions of PDFBox  Tika and actually have content being
 extracted...=)

 Best,
 Dave


 -Original Message-
 From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
 Sent: Tuesday, July 27, 2010 6:09 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
 CELL/Tika/PDFBox

 Hi Jon,
 During the last days we front the same problem.
 Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
 content and from others, Solr throws an exception during the Indexing
 Process .
 You must:
 Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
 snapshot and tika-parsers 0.8.
 Update PdfBox and all related libraries.
 After that You have to patch Solr 1.4.1 following this patch :

 https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
 This is the firts way to solve the problem.

 Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
 is
 thrown during the Indexing process, but no content is extracted.
 Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread Alessandro Benedetti
Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan jsh...@coh.org


 Every so often I need to index new batches of scanned PDFs and occasionally
 Adobe's OCR can't recognize the text in a couple of these documents. In
 these situations I would like to type in a small amount of text onto the
 document and have it be extracted by Solr CELL.

 Adobe Pro 9 has a number of different ways to add text directly to a PDF
 file:

 *Typewriter
 *Sticky Note
 *Callout boxes
 *Text boxes

 I tried indexing documents with each of these text additions with Solr
 1.4.1 + Solr CELL but can't extract the text in any of these boxes.

 If someone has modified their Solr CELL installation to use more recent
 versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
 on whether newer versions can pull the text out of any of these various text
 boxes I'd appreciate that very much.

 -Jon




 -
 SECURITY/CONFIDENTIALITY WARNING:
 This message and any attachments are intended solely for the individual or
 entity to which they are addressed. This communication may contain
 information that is privileged, confidential, or exempt from disclosure
 under applicable law (e.g., personal health information, research data,
 financial information). Because this e-mail has been sent without
 encryption, individuals other than the intended recipient may be able to
 view the information, forward it to others or tamper with the information
 without the knowledge or consent of the sender. If you are not the intended
 recipient, or the employee or person responsible for delivering the message
 to the intended recipient, any dissemination, distribution or copying of the
 communication is strictly prohibited. If you received the communication in
 error, please notify the sender immediately by replying to this message and
 deleting the message and any accompanying files from your system. If, due to
 the security risks, you do not wish to receive further communications via
 e-mail, please reply to this message and inform the sender that you do not
 wish to receive further e-mail from the sender.

 -




-- 
--

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread David Thibault
Alessandro  all,

I was having the same issue with Tika crashing on certain PDFs.  I also noticed 
the bug where no content was extracted after upgrading Tika.  

When I went to the SOLR issue you link to below, I applied all the patches, 
downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and got 
the following error:
SEVERE: java.lang.NoSuchMethodError: 
org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
at 
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
at java.lang.Thread.run(Thread.java:619)

This is really weird because I DID apply the SolrResourceLoader patch that adds 
the getClassLoader method.  I even verified by going opening up the JARs and 
looking at the class file in Eclipse...I can see the 
SolrResourceLoader.getClassLoader() method.  

Does anyone know why it can't find the method?  After patching the source I did 
ant clean dist in the base directory of the Solr source tree and everything 
looked like it compiles (BUILD SUCCESSFUL).  Then I copied all the jars from 
dist/ and all the library dependencies from contrib/extraction/lib/ into my 
SOLR_HOME. Restarting tomcat, everything in the logs looked good.

I'm stumped.  It would be very nice to have a Solr implementation using the 
newest versions of PDFBox  Tika and actually have content being extracted...=)

Best,
Dave


-Original Message-
From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] 
Sent: Tuesday, July 27, 2010 6:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr 
CELL/Tika/PDFBox

Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan jsh...@coh.org


 Every so often I need to index new batches of scanned PDFs and occasionally
 Adobe's OCR can't recognize the text in a couple of these documents. In
 these situations I would like to type in a small amount of text onto the
 document and have it be extracted by Solr CELL.

 Adobe Pro 9 has a number of different ways to add text directly to a PDF
 file:

 *Typewriter
 *Sticky Note
 *Callout boxes
 *Text boxes

 I tried indexing documents with each of these text additions with Solr
 1.4.1 + Solr CELL but can't extract the text in any of these boxes.

 If someone has modified their Solr CELL installation to use more recent
 versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
 on whether newer versions can pull the text out of any of these various text
 boxes I'd appreciate that very much.

 -Jon




 -
 SECURITY/CONFIDENTIALITY WARNING