Hi,

I get this exception (see stack trace below), even though I am seemingly
catching it in my code, which says this

    public void parse(String fileName, Metadata metadata) {
        TikaInputStream inputStream = null;
        try {
            // the given input stream is closed by the parseToString method
(see Tika documentation)
            // we will close it just in case :)
            inputStream = TikaInputStream.get(new File(fileName));
            String text = tika.parseToString(inputStream,
metadata);             // --------_ exception happens here
            metadata.set(DocumentMetadataKeys.DOCUMENT_TEXT, text);
        } catch (Exception e) {
            // the show must still go on
            History.appendToHistory("Exception: " + e.getMessage());
            metadata.set(DocumentMetadataKeys.PROCESSING_EXCEPTION,
e.getMessage());
        } catch (OutOfMemoryError m) {
            History.appendToHistory("Memory Exception: " + m.getMessage());
            metadata.set(DocumentMetadataKeys.PROCESSING_EXCEPTION,
m.getMessage());
        } finally {
            if (inputStream != null) {
                try {
                    inputStream.close();
                } catch (Exception e) {
                    e.printStackTrace(System.out);
                }
            }
        }
    }


2012-06-19 00:47:06,425 WARN org.apache.pdfbox.util.PDFStreamEngine:
java.lang.ClassCastException: org.apache.pdfbox.cos.COSFloat cannot be cast
to org.apache.pdfbox.cos.COSName
java.lang.ClassCastException: org.apache.pdfbox.cos.COSFloat cannot be cast
to org.apache.pdfbox.cos.COSName
    at
org.apache.pdfbox.util.operator.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:48)
    at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
    at
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:82)
    at
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
    at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.Tika.parseToString(Tika.java:380)
    at org.freeeed.main.DocumentParser.parse(DocumentParser.java:33)



That's testing on Enron data set

Reply via email to