[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1226:
---------------------------------

    Assignee: Tim Allison

> PDFTextStripper fails while getting data of PDF form fields of type 
> PDSignatureField
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-1226
>                 URL: https://issues.apache.org/jira/browse/TIKA-1226
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.5
>            Reporter: Eric Knauel
>            Assignee: Tim Allison
>
> I have a PDF document that contains a filled in form. Among the various 
> fields of type text and radio button there are multiple fields for digital 
> signatures. When I load this document into tika-app I get the following 
> exception:
> {noformat}
> Caused by: java.lang.RuntimeException: Can't get signature as String, use 
> getSignature() instead.
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
>       at 
> org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
>       at 
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
>       at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
>       at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 43 more
> {noformat}
> The problem seems to be that PDF2XHTML seems to expect that it can call 
> getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
> this is not true for the sub class PDSignatureField:
> http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
> The java doc says that getSignature() should be called instead. 
> Assuming that the information inside the signature is not relevant for the 
> extraction process and can be discarded the following patch helps:
> {noformat}
> Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===================================================================
> --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java      
> (revision 1560617)
> +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java      
> (revision )
> @@ -40,6 +40,7 @@
>  import 
> org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
> +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
>  import org.apache.pdfbox.util.PDFTextStripper;
>  import org.apache.pdfbox.util.TextPosition;
>  import org.apache.tika.exception.TikaException;
> @@ -464,7 +465,9 @@
>            }
>            String value = "";
>            try {
> +              if (!(field instanceof PDSignatureField)) {
> -              value = field.getValue();
> +                  value = field.getValue();
> +              }
>            } catch (IOException e) {
>                 //swallow
>            }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to