Hi

Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed 
many issues with PDF parsing.

Cheers,
 
 
-----Original message-----
> From:Phani Kumar Samudrala <[email protected]>
> Sent: Tue 12-Feb-2013 11:30
> To: [email protected]
> Subject: Tika 1.2 PDF parse error  -  org.apache.pdfbox.cos.COSString cannot 
> be cast to org.apache.pdfbox.cos.COSDictionary
> 
> 
> I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the 
> following exception. I am getting this error for some PDF documents only and 
> for some PDFs it is working fine. I couldn't figure it out a reason for this. 
> When I tried using Tika 1.1 it works fine. Please let me if any of you have 
> seen this error and how to fix this?
> 
> Here is the exception:
> 
> 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6>
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
> Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString 
> cannot be cast to org.apache.pdfbox.cos.COSDictionary
>       at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
>       at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 3 more
> 
> 
> Here is the code snippet in JAVA:
> 
> 
> String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
>                                      File file = new File(fileString );
>                                      URL url = file.toURI().toURL();
> 
>                                      ParseContext context = new 
> ParseContext();;
>                                      Detector detector = new 
> DefaultDetector();;
>                                      Parser parser =  new 
> AutoDetectParser(detector);;
>                                      Metadata metadata = new Metadata();
>                                      context.set(Parser.class, parser); 
> //PPt,word,xlsx-- pdf,html
>                                      ByteArrayOutputStream outputstream = new 
> ByteArrayOutputStream();
>                                                 InputStream input = 
> TikaInputStream.get(url, metadata);
>                                                 ContentHandler handler = new 
> BodyContentHandler(outputstream);
>                                                 parser.parse(input, handler, 
> metadata, context);
> 
>                                                 input.close();
>                                                 outputstream.close();
> 
> 
> Thanks
> 
> ________________________________
> 
> 
> Disclaimer: This transmission, including attachments, is confidential, 
> proprietary, and may be privileged. It is intended solely for the intended 
> recipient. If you are not the intended recipient, you have received this 
> transmission in error and you are hereby advised that any review, disclosure, 
> copying, distribution, or use of this transmission, or any of the information 
> included therein, is unauthorized and strictly prohibited. If you have 
> received this transmission in error, please immediately notify the sender by 
> reply and permanently delete all copies of this transmission and its 
> attachments.
> 
> 
> ________________________________
> 
> 
> Disclaimer: This transmission, including attachments, is confidential, 
> proprietary, and may be privileged. It is intended solely for the intended 
> recipient. If you are not the intended recipient, you have received this 
> transmission in error and you are hereby advised that any review, disclosure, 
> copying, distribution, or use of this transmission, or any of the information 
> included therein, is unauthorized and strictly prohibited. If you have 
> received this transmission in error, please immediately notify the sender by 
> reply and permanently delete all copies of this transmission and its 
> attachments.
> 
> 

Reply via email to