[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison reassigned TIKA-1226: --------------------------------- Assignee: Tim Allison > PDFTextStripper fails while getting data of PDF form fields of type > PDSignatureField > ------------------------------------------------------------------------------------ > > Key: TIKA-1226 > URL: https://issues.apache.org/jira/browse/TIKA-1226 > Project: Tika > Issue Type: Bug > Affects Versions: 1.5 > Reporter: Eric Knauel > Assignee: Tim Allison > > I have a PDF document that contains a filled in form. Among the various > fields of type text and radio button there are multiple fields for digital > signatures. When I load this document into tika-app I get the following > exception: > {noformat} > Caused by: java.lang.RuntimeException: Can't get signature as String, use > getSignature() instead. > at > org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) > at > org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) > at > org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) > at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 43 more > {noformat} > The problem seems to be that PDF2XHTML seems to expect that it can call > getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc > this is not true for the sub class PDSignatureField: > http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html > The java doc says that getSignature() should be called instead. > Assuming that the information inside the signature is not relevant for the > extraction process and can be discarded the following patch helps: > {noformat} > Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > IDEA additional info: > Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP > <+>UTF-8 > =================================================================== > --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > (revision 1560617) > +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > (revision ) > @@ -40,6 +40,7 @@ > import > org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; > import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; > import org.apache.pdfbox.pdmodel.interactive.form.PDField; > +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; > import org.apache.pdfbox.util.PDFTextStripper; > import org.apache.pdfbox.util.TextPosition; > import org.apache.tika.exception.TikaException; > @@ -464,7 +465,9 @@ > } > String value = ""; > try { > + if (!(field instanceof PDSignatureField)) { > - value = field.getValue(); > + value = field.getValue(); > + } > } catch (IOException e) { > //swallow > } > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)