When I tried to open the file in Acrobat Reader it says "You are viewing this document in PDF/A mode".
I am not sure about PDF/A mode, just wondering if this is anything to do with the issue? -----Original Message----- From: Phani Kumar Samudrala [mailto:[email protected]] Sent: Tuesday, February 12, 2013 4:59 PM To: Markus Jelsma; [email protected] Subject: RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary Hi I just tried with Tika 1.3 (and I see that it got upgraded PDFBox to 1.7.1), But I am getting the same error. In both the cases, Tika 1.2 or Tika 1.3, when I just replace tika-parsers.jar with the one from 1.0, it started working fine. Not sure, if the problem lies in Tika or PDFBox. Any idea? org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2a15cd at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.arisglobal.agcommon.agsolr.util.TikaIndexTest.main(TikaIndexTest.java:37) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:178) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 3 more -Phani -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Tuesday, February 12, 2013 4:03 PM To: [email protected]; Phani Kumar Samudrala Subject: RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary Hi Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed many issues with PDF parsing. Cheers, -----Original message----- > From:Phani Kumar Samudrala <[email protected]> > Sent: Tue 12-Feb-2013 11:30 > To: [email protected] > Subject: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot > be cast to org.apache.pdfbox.cos.COSDictionary > > > I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the > following exception. I am getting this error for some PDF documents only and > for some PDFs it is working fine. I couldn't figure it out a reason for this. > When I tried using Tika 1.1 it works fine. Please let me if any of you have > seen this error and how to fix this? > > Here is the exception: > > > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6> > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53) > Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString > cannot be cast to org.apache.pdfbox.cos.COSDictionary > at > org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93) > at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 3 more > > > Here is the code snippet in JAVA: > > > String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf"; > File file = new File(fileString ); > URL url = file.toURI().toURL(); > > ParseContext context = new > ParseContext();; > Detector detector = new > DefaultDetector();; > Parser parser = new > AutoDetectParser(detector);; > Metadata metadata = new Metadata(); > context.set(Parser.class, parser); > //PPt,word,xlsx-- pdf,html > ByteArrayOutputStream outputstream = new > ByteArrayOutputStream(); > InputStream input = > TikaInputStream.get(url, metadata); > ContentHandler handler = new > BodyContentHandler(outputstream); > parser.parse(input, handler, > metadata, context); > > input.close(); > outputstream.close(); > > > Thanks > > ________________________________ > > > Disclaimer: This transmission, including attachments, is confidential, > proprietary, and may be privileged. It is intended solely for the intended > recipient. If you are not the intended recipient, you have received this > transmission in error and you are hereby advised that any review, disclosure, > copying, distribution, or use of this transmission, or any of the information > included therein, is unauthorized and strictly prohibited. If you have > received this transmission in error, please immediately notify the sender by > reply and permanently delete all copies of this transmission and its > attachments. > > > ________________________________ > > > Disclaimer: This transmission, including attachments, is confidential, > proprietary, and may be privileged. It is intended solely for the intended > recipient. If you are not the intended recipient, you have received this > transmission in error and you are hereby advised that any review, disclosure, > copying, distribution, or use of this transmission, or any of the information > included therein, is unauthorized and strictly prohibited. If you have > received this transmission in error, please immediately notify the sender by > reply and permanently delete all copies of this transmission and its > attachments. > > ________________________________ Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments. ________________________________ Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
