Sorry for not responding sooner.

>When extracting text from PDF files (no OCR), there doesn’t seem to be any
way to link the text that was filled in with the name of the form field.
  For example, if there is a field marked ‘First Name’ and the user fills
that in, they likely appear on different lines and different places, with
no way to associate them.  Is there any way to do this?

Can you share an example file?  I thought we were marking field names and
contents with <div> elements for AcroForms.  If you're processing XFA, I'm
pretty sure we try to associate form keys and values.

If the form is not well put together or if it is a scan of a form, there's
not much we can do.


On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg <[email protected]>
wrote:

> Is this capability of associating form fields with their data something
> that PDF Box doesn’t even support?  Just want to understand if it’s just
> the capability of Tika or if PDFBox doesn’t even have a way to do it
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
> *From:* Tilman Hausherr <[email protected]>
> *Sent:* Monday, August 30, 2021 2:34 PM
> *To:* [email protected]
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:
>
> Is there a way to extract just the form field data?  That way, if it was a
> known form, it might be easier to match up the responses with the fields
> they belong to
>
> I looked at PDFParserConfig and didn't find such an option. Even if there
> was, I doubt it would help match.
>
> Tilman
>
>
>
>
>
> I’ll take a look at Tabula for the tables
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
>
>
>
>
> *From:* Tilman Hausherr <[email protected]> <[email protected]>
> *Sent:* Saturday, August 28, 2021 5:38 AM
> *To:* [email protected]
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Field texts: there is no formal way to do this in the PDF specification.
>
> Tables: try Tabula, they use heuristics
>
> Strike-out text: one is a font, the other a vector graphic (or an
> annotation). So it's not connected. One would have to write an algorithm.
>
> Tilman
>
>
>
> Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:
>
>
>    1. When extracting text from PDF files (no OCR), there doesn’t seem to
>    be any way to link the text that was filled in with the name of the form
>    field.   For example, if there is a field marked ‘First Name’ and the user
>    fills that in, they likely appear on different lines and different places,
>    with no way to associate them.  Is there any way to do this?
>
>
>
>
>
>    1. It’s also sometimes difficult to figure out how tables are
>    extracted.  If I have a 2 column table, it seems to ignore the tabular
>    format and just extract text line by line.  In this example (ignoring the
>    hand-written text), it gets extracted as ‘Comprehensive General Liability
>    (including, if $2.0 million’
>
>
>
>
>
>
>
>
>
>    1. Deleted, or strike-out text, is extracted with no indication
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
>
>
>
>
>
>
>
>

Reply via email to