RE: Parsing PDF file

Peter Kronenberg Mon, 05 Apr 2021 11:47:28 -0700

Can you please explain ‘out of vocabulary measurement’?

From: Tim Allison <[email protected]>
Sent: Monday, April 5, 2021 1:49 PM
To: Peter Kronenberg <[email protected]>
Cc: [email protected]
Subject: Re: Parsing PDF file


Y. You understand perfectly!

I want "auto" to be the best it can be and most generally applicable across use 
cases.  For users who want high performance/better control, you might parse the 
PDF first with NO_OCR, and then make the determination on which pages to run 
OCR based on those statistics pulled out in the first parse.  Another key 
statistic in the decision would be the out of vocabulary measurement that you 
can get with an integration with tika-eval.

So, in short, if there are clear, provable, general improvements to AUTO, we 
should make them.  If you want more refined control, let us know if the current 
metadata can be improved to help you develop your application for your use 
cases.

On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
You’re right that OCRing would result in slightly more accurate results in this 
case.  But the performance penalty is high.  Wondering if there is some 
intermediate option.

I think I understand now why you are separately looking for unmapped characters 
as well as total characters.  If total characters is low, we assume the page is 
an image and OCR.  But if unmapped characters is high, it might still be 
straight text, but the unmapped characters will essentially result in 
unreadable characters

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Monday, April 5, 2021 11:39 AM
To: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Parsing PDF file

As for the metadata, we should add unique.  Given that multiple parsers can hit 
the same file, we need to record all of them (in this case: default, pdf, 
tesseract).

As for tweaking the settings...I'm not sure as I look at the extracted text 
more.  There are quite a few bad ligatures /unmapped unicode chars which would 
render search for, e.g. "efficient", "affairs" useless.

On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Yes, I think tweaking the criteria for Auto is a good idea.
And if the parser list was a Set, that would automatically eliminate dups

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Monday, April 5, 2021 10:15 AM
To: [email protected]<mailto:[email protected]>
Subject: Fwd: Parsing PDF file

It looks like the ligatures don't have unicode mappings:

"Division of Monetary A�airs"


if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)

The issue is that this file has > 10 unmapped unicode chars per page.

We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && 
percentUnmappedUnicodeChars > 0.2 or something?

We should also probably check to see if a parser is in the parsed by list 
before re-adding it?



0: pdf:charsPerPage : 1579
0: pdf:charsPerPage : 1891
0: pdf:charsPerPage : 2283
0: pdf:charsPerPage : 2224
0: pdf:charsPerPage : 1619
0: pdf:charsPerPage : 2177
0: pdf:charsPerPage : 1626
0: pdf:charsPerPage : 1313
0: pdf:charsPerPage : 1652
0: pdf:charsPerPage : 1493
0: pdf:charsPerPage : 1136
0: pdf:charsPerPage : 1477
0: pdf:charsPerPage : 1264
0: pdf:charsPerPage : 1994
0: pdf:charsPerPage : 2062
0: pdf:charsPerPage : 1756
0: pdf:charsPerPage : 2007
0: pdf:charsPerPage : 2202
0: pdf:charsPerPage : 2105
0: pdf:charsPerPage : 2106
0: pdf:charsPerPage : 1895
0: pdf:charsPerPage : 1978
0: pdf:charsPerPage : 1826
0: pdf:charsPerPage : 1742
0: pdf:charsPerPage : 2073
0: pdf:charsPerPage : 1882
0: pdf:charsPerPage : 1497
0: pdf:charsPerPage : 282
0: pdf:charsPerPage : 606
0: pdf:charsPerPage : 948
0: pdf:charsPerPage : 418
0: pdf:charsPerPage : 266
0: pdf:charsPerPage : 830
0: pdf:charsPerPage : 259
0: pdf:charsPerPage : 716
0: pdf:charsPerPage : 961
0: pdf:charsPerPage : 1325
0: pdf:charsPerPage : 1478
0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 Radical 
Eye Software
0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
0: pdf:docinfo:title : Inel4shannon.dvi
0: pdf:encrypted : false
0: pdf:hasMarkedContent : false
0: pdf:hasXFA : false
0: pdf:hasXMP : false
0: pdf:producer : Acrobat Distiller 3.01 for Windows
0: pdf:unmappedUnicodeCharsPerPage : 109
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 113
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 94
0: pdf:unmappedUnicodeCharsPerPage : 112
0: pdf:unmappedUnicodeCharsPerPage : 178
0: pdf:unmappedUnicodeCharsPerPage : 74
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 189
0: pdf:unmappedUnicodeCharsPerPage : 165
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 186
0: pdf:unmappedUnicodeCharsPerPage : 162
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 119
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 115
0: pdf:unmappedUnicodeCharsPerPage : 99
0: pdf:unmappedUnicodeCharsPerPage : 107
0: pdf:unmappedUnicodeCharsPerPage : 108
0: pdf:unmappedUnicodeCharsPerPage : 116
0: pdf:unmappedUnicodeCharsPerPage : 174
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 61
0: pdf:unmappedUnicodeCharsPerPage : 90
0: pdf:unmappedUnicodeCharsPerPage : 239
0: pdf:unmappedUnicodeCharsPerPage : 614
0: pdf:unmappedUnicodeCharsPerPage : 216
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 502
0: pdf:unmappedUnicodeCharsPerPage : 103
0: pdf:unmappedUnicodeCharsPerPage : 427
0: pdf:unmappedUnicodeCharsPerPage : 629
0: pdf:unmappedUnicodeCharsPerPage : 347
0: pdf:unmappedUnicodeCharsPerPage : 327

On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Yes, 2.x

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Monday, April 5, 2021 9:54 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Parsing PDF file

Tika 2.x? Looking now.

On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
If I use OCRStrategy=no_ocr, the time it takes to process is orders of 
magnitude faster and I don’t see the calls to OCRParser (obviously) Why is it 
taking so long with auto?  If the page does not meet the criteria for OCR, then 
it shouldn’t be calling OCR at all, right?

 "X-TIKA:Parsed-By": "[org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.pdf.PDFParser]",


From: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>
Sent: Monday, April 5, 2021 8:48 AM
To: [email protected]<mailto:[email protected]>
Subject: RE: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

Correction: I see one instance of PDFParser at the beginning, but why does it 
then alternate between OCRParser and CompositeParser?

From: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>
Sent: Monday, April 5, 2021 8:41 AM
To: [email protected]<mailto:[email protected]>
Subject: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click 
links or open attachments unless you recognize the sender and know the content 
is safe.
Parsing the attached PDF file.   It is a text file, not scanned.  I’m using 
OCR_Strategy=Auto, extractInlineImages=false

The output contains the following in the metadata.  I’m wondering 2 things.  
First, why don’t I see PDFParser?
And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines 
that it is a PDF file, wouldn’t it stick with that?
I’m asking because it seems to take longer to parse than I would expect and I’m 
wondering if the OCRParser is adding extra overhead


"X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.pdf.PDFParser, org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser, 
org.apache.tika.parser.CompositeParser, 
org.apache.tika.parser.ocr.TesseractOCRParser]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
4303 W. 119th St., Leawood, KS 
66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>

RE: Parsing PDF file

Reply via email to