Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:
Is there a way to extract just the form field data? That way, if it
was a known form, it might be easier to match up the responses with
the fields they belong to
I looked at PDFParserConfig and didn't find such an option. Even if
there was, I doubt it would help match.
Tilman
I’ll take a look at Tabula for the tables
*Peter Kronenberg****| **Senior AI Analytic ENGINEER *
*C: 703.887.5623 *
Torch AI <http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI <http://www.torch.ai/>
*From:*Tilman Hausherr <[email protected]>
*Sent:* Saturday, August 28, 2021 5:38 AM
*To:* [email protected]
*Subject:* Re: Form fields and other issues with PDF files
Field texts: there is no formal way to do this in the PDF specification.
Tables: try Tabula, they use heuristics
Strike-out text: one is a font, the other a vector graphic (or an
annotation). So it's not connected. One would have to write an algorithm.
Tilman
Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:
1. When extracting text from PDF files (no OCR), there doesn’t
seem to be any way to link the text that was filled in with
the name of the form field. For example, if there is a field
marked ‘First Name’ and the user fills that in, they likely
appear on different lines and different places, with no way to
associate them. Is there any way to do this?
2. It’s also sometimes difficult to figure out how tables are
extracted. If I have a 2 column table, it seems to ignore the
tabular format and just extract text line by line. In this
example (ignoring the hand-written text), it gets extracted as
‘Comprehensive General Liability (including, if $2.0 million’
3. Deleted, or strike-out text, is extracted with no indication
*Peter Kronenberg****| **Senior AI Analytic ENGINEER *
*C: 703.887.5623 *
Torch AI
<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI
<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>