Hi Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed many issues with PDF parsing.
Cheers, -----Original message----- > From:Phani Kumar Samudrala <[email protected]> > Sent: Tue 12-Feb-2013 11:30 > To: [email protected] > Subject: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot > be cast to org.apache.pdfbox.cos.COSDictionary > > > I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the > following exception. I am getting this error for some PDF documents only and > for some PDFs it is working fine. I couldn't figure it out a reason for this. > When I tried using Tika 1.1 it works fine. Please let me if any of you have > seen this error and how to fix this? > > Here is the exception: > > > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6> > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53) > Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString > cannot be cast to org.apache.pdfbox.cos.COSDictionary > at > org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93) > at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 3 more > > > Here is the code snippet in JAVA: > > > String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf"; > File file = new File(fileString ); > URL url = file.toURI().toURL(); > > ParseContext context = new > ParseContext();; > Detector detector = new > DefaultDetector();; > Parser parser = new > AutoDetectParser(detector);; > Metadata metadata = new Metadata(); > context.set(Parser.class, parser); > //PPt,word,xlsx-- pdf,html > ByteArrayOutputStream outputstream = new > ByteArrayOutputStream(); > InputStream input = > TikaInputStream.get(url, metadata); > ContentHandler handler = new > BodyContentHandler(outputstream); > parser.parse(input, handler, > metadata, context); > > input.close(); > outputstream.close(); > > > Thanks > > ________________________________ > > > Disclaimer: This transmission, including attachments, is confidential, > proprietary, and may be privileged. It is intended solely for the intended > recipient. If you are not the intended recipient, you have received this > transmission in error and you are hereby advised that any review, disclosure, > copying, distribution, or use of this transmission, or any of the information > included therein, is unauthorized and strictly prohibited. If you have > received this transmission in error, please immediately notify the sender by > reply and permanently delete all copies of this transmission and its > attachments. > > > ________________________________ > > > Disclaimer: This transmission, including attachments, is confidential, > proprietary, and may be privileged. It is intended solely for the intended > recipient. If you are not the intended recipient, you have received this > transmission in error and you are hereby advised that any review, disclosure, > copying, distribution, or use of this transmission, or any of the information > included therein, is unauthorized and strictly prohibited. If you have > received this transmission in error, please immediately notify the sender by > reply and permanently delete all copies of this transmission and its > attachments. > >
