Not sure if this is an issue for PDFBox or Tika, but I noticed that PDFBox's
textstripper is not extracting information from the form fields in a batch of
pdf documents I'm processing. Is anyone else having this problem?
I regret that I'm unable to send an example document.
Inelegant solution with error handling not included:
StringBuilder sb = new StringBuilder();
//get text with text stripper and then
PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();
if (catalog != null){
PDAcroForm form = catalog.getAcroForm();
if (form != null){
List<PDField> fields = form.getFields();
for (PDField field : fields){
sb.append(field.getFullyQualifiedName() +": "+
field.getValue()+"\r\n");
}
}
}