Follow-up question: For PDF, is there a way to not extract comments?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tilman Hausherr <[email protected]>
Sent: Saturday, August 28, 2021 5:38 AM
To: [email protected]
Subject: Re: Form fields and other issues with PDF files



Field texts: there is no formal way to do this in the PDF specification.

Tables: try Tabula, they use heuristics

Strike-out text: one is a font, the other a vector graphic (or an annotation). 
So it's not connected. One would have to write an algorithm.

Tilman

Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:

  1.  When extracting text from PDF files (no OCR), there doesn't seem to be 
any way to link the text that was filled in with the name of the form field.   
For example, if there is a field marked 'First Name' and the user fills that 
in, they likely appear on different lines and different places, with no way to 
associate them.  Is there any way to do this?



  1.  It's also sometimes difficult to figure out how tables are extracted.  If 
I have a 2 column table, it seems to ignore the tabular format and just extract 
text line by line.  In this example (ignoring the hand-written text), it gets 
extracted as 'Comprehensive General Liability (including, if $2.0 million'

[cid:[email protected]]




  1.  Deleted, or strike-out text, is extracted with no indication
[cid:[email protected]]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>




Reply via email to