Hi

I just tried with Tika 1.3 (and I see that it got upgraded PDFBox to 1.7.1), 
But I am getting the same error.

In both the cases, Tika 1.2 or Tika 1.3, when I just replace tika-parsers.jar 
with the one from 1.0, it started working fine.

Not sure, if the problem lies in Tika or PDFBox. Any idea?


org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@2a15cd
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at 
com.arisglobal.agcommon.agsolr.util.TikaIndexTest.main(TikaIndexTest.java:37)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot 
be cast to org.apache.pdfbox.cos.COSDictionary
        at 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
        at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:178)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 3 more


-Phani
-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Tuesday, February 12, 2013 4:03 PM
To: [email protected]; Phani Kumar Samudrala
Subject: RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot 
be cast to org.apache.pdfbox.cos.COSDictionary

Hi

Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed 
many issues with PDF parsing.

Cheers,


-----Original message-----
> From:Phani Kumar Samudrala <[email protected]>
> Sent: Tue 12-Feb-2013 11:30
> To: [email protected]
> Subject: Tika 1.2 PDF parse error  -  org.apache.pdfbox.cos.COSString cannot 
> be cast to org.apache.pdfbox.cos.COSDictionary
>
>
> I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the 
> following exception. I am getting this error for some PDF documents only and 
> for some PDFs it is working fine. I couldn't figure it out a reason for this. 
> When I tried using Tika 1.1 it works fine. Please let me if any of you have 
> seen this error and how to fix this?
>
> Here is the exception:
>
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6>
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
> Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString 
> cannot be cast to org.apache.pdfbox.cos.COSDictionary
>       at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
>       at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 3 more
>
>
> Here is the code snippet in JAVA:
>
>
> String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
>                                      File file = new File(fileString );
>                                      URL url = file.toURI().toURL();
>
>                                      ParseContext context = new 
> ParseContext();;
>                                      Detector detector = new 
> DefaultDetector();;
>                                      Parser parser =  new 
> AutoDetectParser(detector);;
>                                      Metadata metadata = new Metadata();
>                                      context.set(Parser.class, parser); 
> //PPt,word,xlsx-- pdf,html
>                                      ByteArrayOutputStream outputstream = new 
> ByteArrayOutputStream();
>                                                 InputStream input = 
> TikaInputStream.get(url, metadata);
>                                                 ContentHandler handler = new 
> BodyContentHandler(outputstream);
>                                                 parser.parse(input, handler, 
> metadata, context);
>
>                                                 input.close();
>                                                 outputstream.close();
>
>
> Thanks
>
> ________________________________
>
>
> Disclaimer: This transmission, including attachments, is confidential, 
> proprietary, and may be privileged. It is intended solely for the intended 
> recipient. If you are not the intended recipient, you have received this 
> transmission in error and you are hereby advised that any review, disclosure, 
> copying, distribution, or use of this transmission, or any of the information 
> included therein, is unauthorized and strictly prohibited. If you have 
> received this transmission in error, please immediately notify the sender by 
> reply and permanently delete all copies of this transmission and its 
> attachments.
>
>
> ________________________________
>
>
> Disclaimer: This transmission, including attachments, is confidential, 
> proprietary, and may be privileged. It is intended solely for the intended 
> recipient. If you are not the intended recipient, you have received this 
> transmission in error and you are hereby advised that any review, disclosure, 
> copying, distribution, or use of this transmission, or any of the information 
> included therein, is unauthorized and strictly prohibited. If you have 
> received this transmission in error, please immediately notify the sender by 
> reply and permanently delete all copies of this transmission and its 
> attachments.
>
>

________________________________


Disclaimer: This transmission, including attachments, is confidential, 
proprietary, and may be privileged. It is intended solely for the intended 
recipient. If you are not the intended recipient, you have received this 
transmission in error and you are hereby advised that any review, disclosure, 
copying, distribution, or use of this transmission, or any of the information 
included therein, is unauthorized and strictly prohibited. If you have received 
this transmission in error, please immediately notify the sender by reply and 
permanently delete all copies of this transmission and its attachments.

Reply via email to