I think this error can happen when you've got multiple versions of fontbox on your classpath.

Perhaps you're including both Tika-based parsing and the older Nutch PDF parser?

See http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/ for more information.

-- Ken

On Nov 26, 2010, at 2:57am, Saphira wrote:


that's what it says, I think is always the same error


org.apache.tika.exception.TikaException: Unable to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java: 95)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:1)
        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
instantiated
at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java: 152) at org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java: 129)
        at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
        ... 7 more
Caused by: java.lang.ClassCastException:
org.pdfbox.util.operator.ShowTextGlyph cannot be cast to
org.apache.pdfbox.util.operator.OperatorProcessor
at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java: 146)
        ... 10 more
2010-11-26 08:27:42,113 WARN  fetcher.Fetcher - Error parsing:
http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): Unable to
extract PDF content

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-extract-PDF-content-tp1971600p1972145.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to