Could be due to the fetch size limit. The URL you gave as an example is 163395 bytes long, whereas the default value in Nutch is
<property> <name>http.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> Calling Tika directly on a URL is a good way of checking where the problem comes from e.g. /usr/local/bin/tika-0.7/tika-app/target/tika-app-0.7.jar http://www.egamaster.com/datos/politica_fr.pdf works fine J. On 26 November 2010 10:57, Saphira <[email protected]> wrote: > > that's what it says, I think is always the same error > > > org.apache.tika.exception.TikaException: Unable to extract PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:1) > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) > at java.util.concurrent.FutureTask.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Caused by: org.apache.pdfbox.exceptions.WrappedIOException: > OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be > instantiated > at > org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152) > at > org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129) > at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) > ... 7 more > Caused by: java.lang.ClassCastException: > org.pdfbox.util.operator.ShowTextGlyph cannot be cast to > org.apache.pdfbox.util.operator.OperatorProcessor > at > org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146) > ... 10 more > 2010-11-26 08:27:42,113 WARN fetcher.Fetcher - Error parsing: > http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): Unable to > extract PDF content > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Unable-to-extract-PDF-content-tp1971600p1972145.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

