I think this error can happen when you've got multiple versions of
fontbox on your classpath.
Perhaps you're including both Tika-based parsing and the older Nutch
PDF parser?
See http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/
for more information.
-- Ken
On Nov 26, 2010, at 2:57am, Saphira wrote:
that's what it says, I think is always the same error
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:79)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:
95)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException:
OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could
not be
instantiated
at
org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:
152)
at
org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:
129)
at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
... 7 more
Caused by: java.lang.ClassCastException:
org.pdfbox.util.operator.ShowTextGlyph cannot be cast to
org.apache.pdfbox.util.operator.OperatorProcessor
at
org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:
146)
... 10 more
2010-11-26 08:27:42,113 WARN fetcher.Fetcher - Error parsing:
http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): Unable to
extract PDF content
--
View this message in context:
http://lucene.472066.n3.nabble.com/Unable-to-extract-PDF-content-tp1971600p1972145.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g