Could be this... https://issues.apache.org/jira/browse/PDFBOX-455
... but the stacktrace differs a bit the above issue is flagged as fixed-for version of 0.8.0-incubator looks like tika-parsers v0.5 referenced a prior version (0.7.3) On Mon, Jun 18, 2012 at 6:12 PM, Mark Kerzner <[email protected]> wrote: > Hi, > > I get this exception (see stack trace below), even though I am seemingly > catching it in my code, which says this > > public void parse(String fileName, Metadata metadata) { > TikaInputStream inputStream = null; > try { > // the given input stream is closed by the parseToString method > (see Tika documentation) > // we will close it just in case :) > inputStream = TikaInputStream.get(new File(fileName)); > String text = tika.parseToString(inputStream, > metadata); // --------_ exception happens here > metadata.set(DocumentMetadataKeys.DOCUMENT_TEXT, text); > } catch (Exception e) { > // the show must still go on > History.appendToHistory("Exception: " + e.getMessage()); > metadata.set(DocumentMetadataKeys.PROCESSING_EXCEPTION, > e.getMessage()); > } catch (OutOfMemoryError m) { > History.appendToHistory("Memory Exception: " + m.getMessage()); > metadata.set(DocumentMetadataKeys.PROCESSING_EXCEPTION, > m.getMessage()); > } finally { > if (inputStream != null) { > try { > inputStream.close(); > } catch (Exception e) { > e.printStackTrace(System.out); > } > } > } > } > > > 2012-06-19 00:47:06,425 WARN org.apache.pdfbox.util.PDFStreamEngine: > java.lang.ClassCastException: org.apache.pdfbox.cos.COSFloat cannot be cast > to org.apache.pdfbox.cos.COSName > java.lang.ClassCastException: org.apache.pdfbox.cos.COSFloat cannot be cast > to org.apache.pdfbox.cos.COSName > at > org.apache.pdfbox.util.operator.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:48) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:82) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:380) > at org.freeeed.main.DocumentParser.parse(DocumentParser.java:33) > > > > That's testing on Enron data set -- Jon Gorrono PGP Key: 0x5434509D - http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index} GSWoT Introducer - {GSWoT:US75 5434509D Jon P. Gorrono <jpgorrono - www.gswot.org>} http{middleware.ucdavis.edu}
