RE: Form fields and other issues with PDF files

Peter Kronenberg Mon, 30 Aug 2021 12:36:45 -0700

For the file that doesn’t work, I see this I the metadata.  Looks like a big 
clue.  But what does this mean in practical terms?  Is there a way to convert?  
 Are there certain tools that should be used to create the PDF?

meta name="pdf:PDFVersion" content="1.6" />
<meta name="xmp:CreatorTool" content="Acrobat PDFMaker 15 for Word" />
<meta name="pdf:docinfo:title" content="I" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

From: Peter Kronenberg
Sent: Monday, August 30, 2021 3:27 PM
To: [email protected]; [email protected]
Subject: RE: Form fields and other issues with PDF files

Hmm, you’re right.  I tried it on another form (The downloadable 1040 from 
irs.gov) and it does list the values of the form fields (of course, you’d have 
to do the mapping yourself, so you know that field f1_02 is first name)
But it didn’t work on the sample file I had, which unfortunately, I can’t share.
It’s definitely not a scanned file.  What are the requirements for allowing 
this to happened?  Is there a way to convert a PDF to XFA?

<div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>
            <li fieldName="f1_01">f1_01: </li>
            <li fieldName="f1_02">f1_02: John</li>
            <li fieldName="f1_03">f1_03: Smith</li>
            <li fieldName="f1_04">f1_04: </li>
            <li fieldName="f1_05">f1_05: </li>
            <li fieldName="f1_06">f1_06: </li>
            <li fieldName="f1_07">f1_07: </li>
            <li fieldName="f1_08">f1_08: 123 Main St</li>
            <li fieldName="f1_09">f1_09: </li>
            <li fieldName="f1_10">f1_10: </li>

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Monday, August 30, 2021 3:01 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Form fields and other issues with PDF files

Sorry for not responding sooner.

>When extracting text from PDF files (no OCR), there doesn’t seem to be any way 
>to link the text that was filled in with the name of the form field.   For 
>example, if there is a field marked ‘First Name’ and the user fills that in, 
>they likely appear on different lines and different places, with no way to 
>associate them.  Is there any way to do this?

Can you share an example file?  I thought we were marking field names and 
contents with <div> elements for AcroForms.  If you're processing XFA, I'm 
pretty sure we try to associate form keys and values.

If the form is not well put together or if it is a scan of a form, there's not 
much we can do.

On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Is this capability of associating form fields with their data something that 
PDF Box doesn’t even support?  Just want to understand if it’s just the 
capability of Tika or if PDFBox doesn’t even have a way to do it

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>

From: Tilman Hausherr <[email protected]<mailto:[email protected]>>
Sent: Monday, August 30, 2021 2:34 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Form fields and other issues with PDF files

Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:
Is there a way to extract just the form field data?  That way, if it was a 
known form, it might be easier to match up the responses with the fields they 
belong to

I looked at PDFParserConfig and didn't find such an option. Even if there was, 
I doubt it would help match.

Tilman

I’ll take a look at Tabula for the tables

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>

From: Tilman Hausherr <[email protected]><mailto:[email protected]>
Sent: Saturday, August 28, 2021 5:38 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Form fields and other issues with PDF files

Field texts: there is no formal way to do this in the PDF specification.

Tables: try Tabula, they use heuristics

Strike-out text: one is a font, the other a vector graphic (or an annotation). 
So it's not connected. One would have to write an algorithm.

Tilman

Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:

  1.  When extracting text from PDF files (no OCR), there doesn’t seem to be 
any way to link the text that was filled in with the name of the form field.   
For example, if there is a field marked ‘First Name’ and the user fills that 
in, they likely appear on different lines and different places, with no way to 
associate them.  Is there any way to do this?

  1.  It’s also sometimes difficult to figure out how tables are extracted.  If 
I have a 2 column table, it seems to ignore the tabular format and just extract 
text line by line.  In this example (ignoring the hand-written text), it gets 
extracted as ‘Comprehensive General Liability (including, if $2.0 million’

[cid:[email protected]]

  1.  Deleted, or strike-out text, is extracted with no indication
[cid:[email protected]]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>

RE: Form fields and other issues with PDF files

Reply via email to