Could be due to the fetch size limit. The URL you gave as an example is
163395 bytes long, whereas the default value in Nutch is

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

Calling Tika directly on a URL is a good way of checking where the problem
comes from e.g.

/usr/local/bin/tika-0.7/tika-app/target/tika-app-0.7.jar
http://www.egamaster.com/datos/politica_fr.pdf

works fine

J.

On 26 November 2010 10:57, Saphira <[email protected]> wrote:

>
> that's what it says, I think is always the same error
>
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)
>        at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:1)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
> instantiated
>        at
> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
>        at
> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
>        at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
>        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>        ... 7 more
> Caused by: java.lang.ClassCastException:
> org.pdfbox.util.operator.ShowTextGlyph cannot be cast to
> org.apache.pdfbox.util.operator.OperatorProcessor
>        at
> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
>        ... 10 more
> 2010-11-26 08:27:42,113 WARN  fetcher.Fetcher - Error parsing:
> http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): Unable to
> extract PDF content
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Unable-to-extract-PDF-content-tp1971600p1972145.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to