Re: Form fields and other issues with PDF files

Tilman Hausherr Wed, 01 Sep 2021 11:37:16 -0700

Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:

Hmm, you’re right. I tried it on another form (The downloadable 1040from irs.gov) and it does list the values of the form fields (ofcourse, you’d have to do the mapping yourself, so you know that fieldf1_02 is first name)
But it didn’t work on the sample file I had, which unfortunately, Ican’t share.
It’s definitely not a scanned file. What are the requirements forallowing this to happened? Is there a way to convert a PDF to XFA?

Not in PDFBox nor in Tika. XFA is a deprecated format that isn't reallypart of PDF.


Tilman

<div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>

            <li fieldName="f1_01">f1_01: </li>

            <li fieldName="f1_02">f1_02: John</li>

            <li fieldName="f1_03">f1_03: Smith</li>

            <li fieldName="f1_04">f1_04: </li>

            <li fieldName="f1_05">f1_05: </li>

            <li fieldName="f1_06">f1_06: </li>

            <li fieldName="f1_07">f1_07: </li>

            <li fieldName="f1_08">f1_08: 123 Main St</li>

            <li fieldName="f1_09">f1_09: </li>

            <li fieldName="f1_10">f1_10: </li>

*Peter Kronenberg****| **Senior AI Analytic ENGINEER *

*C: 703.887.5623 *

Torch AI <http://www.torch.ai/>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI <http://www.torch.ai/>

*From:* Tim Allison <[email protected]>
*Sent:* Monday, August 30, 2021 3:01 PM
*To:* [email protected]
*Subject:* Re: Form fields and other issues with PDF files

Sorry for not responding sooner.

>When extracting text from PDF files (no OCR), there doesn’t seem tobe any way to link the text that was filled in with the name of theform field. For example, if there is a field marked ‘First Name’ andthe user fills that in, they likely appear on different lines anddifferent places, with no way to associate them. Is there any way todo this?

Can you share an example file? I thought we were marking field namesand contents with <div> elements for AcroForms. If you're processingXFA, I'm pretty sure we try to associate form keys and values.

If the form is not well put together or if it is a scan of a form,there's not much we can do.

On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg<[email protected] <mailto:[email protected]>> wrote:


    Is this capability of associating form fields with their data
    something that PDF Box doesn’t even support? Just want to
    understand if it’s just the capability of Tika or if PDFBox
    doesn’t even have a way to do it

    *Peter Kronenberg****| **Senior AI Analytic ENGINEER *

    *C: 703.887.5623 *

    Torch AI
    
<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>


    4303 W. 119th St., Leawood, KS 66209
    WWW.TORCH.AI
    
<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>

    *From:* Tilman Hausherr <[email protected]
    <mailto:[email protected]>>
    *Sent:* Monday, August 30, 2021 2:34 PM
    *To:* [email protected] <mailto:[email protected]>
    *Subject:* Re: Form fields and other issues with PDF files

    Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:

        Is there a way to extract just the form field data?  That way,
        if it was a known form, it might be easier to match up the
        responses with the fields they belong to

    I looked at PDFParserConfig and didn't find such an option. Even
    if there was, I doubt it would help match.

    Tilman

        I’ll take a look at Tabula for the tables

        *Peter Kronenberg****| **Senior AI Analytic ENGINEER *

        *C: 703.887.5623 *

        Torch AI
        
<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>


        4303 W. 119th St., Leawood, KS 66209
        WWW.TORCH.AI
        
<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>

        *From:* Tilman Hausherr <[email protected]>
        <mailto:[email protected]>
        *Sent:* Saturday, August 28, 2021 5:38 AM
        *To:* [email protected] <mailto:[email protected]>
        *Subject:* Re: Form fields and other issues with PDF files

        Field texts: there is no formal way to do this in the PDF
        specification.

        Tables: try Tabula, they use heuristics

        Strike-out text: one is a font, the other a vector graphic (or
        an annotation). So it's not connected. One would have to write
        an algorithm.

        Tilman

        Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:

             1. When extracting text from PDF files (no OCR), there
                doesn’t seem to be any way to link the text that was
                filled in with the name of the form field.   For
                example, if there is a field marked ‘First Name’ and
                the user fills that in, they likely appear on
                different lines and different places, with no way to
                associate them.  Is there any way to do this?

             2. It’s also sometimes difficult to figure out how tables
                are extracted.  If I have a 2 column table, it seems
                to ignore the tabular format and just extract text
                line by line.  In this example (ignoring the
                hand-written text), it gets extracted as
                ‘Comprehensive General Liability (including, if $2.0
                million’

             3. Deleted, or strike-out text, is extracted with no
                indication

            *Peter Kronenberg****| **Senior AI Analytic ENGINEER *

            *C: 703.887.5623 *

            Torch AI
            
<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>


            4303 W. 119th St., Leawood, KS 66209
            WWW.TORCH.AI
            
<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>

Re: Form fields and other issues with PDF files

Reply via email to