Re: Form fields and other issues with PDF files

Tilman Hausherr Sat, 28 Aug 2021 02:39:26 -0700

Field texts: there is no formal way to do this in the PDF specification.


Tables: try Tabula, they use heuristics

Strike-out text: one is a font, the other a vector graphic (or anannotation). So it's not connected. One would have to write an algorithm.


Tilman


Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:


  * When extracting text from PDF files (no OCR), there doesn’t seem
    to be any way to link the text that was filled in with the name of
    the form field.   For example, if there is a field marked ‘First
    Name’ and the user fills that in, they likely appear on different
    lines and different places, with no way to associate them.  Is
    there any way to do this?

  * It’s also sometimes difficult to figure out how tables are
    extracted.  If I have a 2 column table, it seems to ignore the
    tabular format and just extract text line by line.  In this
    example (ignoring the hand-written text), it gets extracted as
    ‘Comprehensive General Liability (including, if $2.0 million’

  * Deleted, or strike-out text, is extracted with no indication

*Peter Kronenberg****| **Senior AI Analytic ENGINEER *

*C: 703.887.5623*

Torch AI <http://www.torch.ai/>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI <http://www.torch.ai/>

Re: Form fields and other issues with PDF files

Reply via email to