Fwd: Parsing PDF file

Tim Allison Mon, 05 Apr 2021 07:15:44 -0700

It looks like the ligatures don't have unicode mappings:

"Division of Monetary A�airs"


if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)

The issue is that this file has > 10 unmapped unicode chars per page.

We could change the heuristic to unmappedUnicodeCharsPerPage > 10 &&
percentUnmappedUnicodeChars > 0.2 or something?

We should also probably check to see if a parser is in the parsed by
list before re-adding it?


0: pdf:charsPerPage : 1579
0: pdf:charsPerPage : 1891
0: pdf:charsPerPage : 2283
0: pdf:charsPerPage : 2224
0: pdf:charsPerPage : 1619
0: pdf:charsPerPage : 2177
0: pdf:charsPerPage : 1626
0: pdf:charsPerPage : 1313
0: pdf:charsPerPage : 1652
0: pdf:charsPerPage : 1493
0: pdf:charsPerPage : 1136
0: pdf:charsPerPage : 1477
0: pdf:charsPerPage : 1264
0: pdf:charsPerPage : 1994
0: pdf:charsPerPage : 2062
0: pdf:charsPerPage : 1756
0: pdf:charsPerPage : 2007
0: pdf:charsPerPage : 2202
0: pdf:charsPerPage : 2105
0: pdf:charsPerPage : 2106
0: pdf:charsPerPage : 1895
0: pdf:charsPerPage : 1978
0: pdf:charsPerPage : 1826
0: pdf:charsPerPage : 1742
0: pdf:charsPerPage : 2073
0: pdf:charsPerPage : 1882
0: pdf:charsPerPage : 1497
0: pdf:charsPerPage : 282
0: pdf:charsPerPage : 606
0: pdf:charsPerPage : 948
0: pdf:charsPerPage : 418
0: pdf:charsPerPage : 266
0: pdf:charsPerPage : 830
0: pdf:charsPerPage : 259
0: pdf:charsPerPage : 716
0: pdf:charsPerPage : 961
0: pdf:charsPerPage : 1325
0: pdf:charsPerPage : 1478
0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998
Radical Eye Software
0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
0: pdf:docinfo:title : Inel4shannon.dvi
0: pdf:encrypted : false
0: pdf:hasMarkedContent : false
0: pdf:hasXFA : false
0: pdf:hasXMP : false
0: pdf:producer : Acrobat Distiller 3.01 for Windows
0: pdf:unmappedUnicodeCharsPerPage : 109
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 113
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 94
0: pdf:unmappedUnicodeCharsPerPage : 112
0: pdf:unmappedUnicodeCharsPerPage : 178
0: pdf:unmappedUnicodeCharsPerPage : 74
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 189
0: pdf:unmappedUnicodeCharsPerPage : 165
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 186
0: pdf:unmappedUnicodeCharsPerPage : 162
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 119
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 115
0: pdf:unmappedUnicodeCharsPerPage : 99
0: pdf:unmappedUnicodeCharsPerPage : 107
0: pdf:unmappedUnicodeCharsPerPage : 108
0: pdf:unmappedUnicodeCharsPerPage : 116
0: pdf:unmappedUnicodeCharsPerPage : 174
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 61
0: pdf:unmappedUnicodeCharsPerPage : 90
0: pdf:unmappedUnicodeCharsPerPage : 239
0: pdf:unmappedUnicodeCharsPerPage : 614
0: pdf:unmappedUnicodeCharsPerPage : 216
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 502
0: pdf:unmappedUnicodeCharsPerPage : 103
0: pdf:unmappedUnicodeCharsPerPage : 427
0: pdf:unmappedUnicodeCharsPerPage : 629
0: pdf:unmappedUnicodeCharsPerPage : 347
0: pdf:unmappedUnicodeCharsPerPage : 327


On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <[email protected]>
wrote:

> Yes, 2.x
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, April 5, 2021 9:54 AM
> *To:* [email protected]
> *Subject:* Re: Parsing PDF file
>
>
>
> Tika 2.x? Looking now.
>
>
>
> On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <[email protected]>
> wrote:
>
> If I use OCRStrategy=no_ocr, the time it takes to process is orders of
> magnitude faster and I don’t see the calls to OCRParser (obviously) Why is
> it taking so long with auto?  If the page does not meet the criteria for
> OCR, then it shouldn’t be calling OCR at all, right?
>
>
>
>  "X-TIKA:Parsed-By":
> "[org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.pdf.PDFParser]"
> ,
>
>
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, April 5, 2021 8:48 AM
> *To:* [email protected]
> *Subject:* RE: {EXTERNAL}Parsing PDF file
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Correction: I see one instance of PDFParser at the beginning, but why does
> it then alternate between OCRParser and CompositeParser?
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, April 5, 2021 8:41 AM
> *To:* [email protected]
> *Subject:* {EXTERNAL}Parsing PDF file
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Parsing the attached PDF file.   It is a text file, not scanned.  I’m
> using OCR_Strategy=Auto, extractInlineImages=false
>
>
>
> The output contains the following in the metadata.  I’m wondering 2
> things.  First, why don’t I see PDFParser?
>
> And 2nd, why does it keep calling the TesseractOCRParser?  Once it
> determines that it is a PDF file, wouldn’t it stick with that?
>
> I’m asking because it seems to take longer to parse than I would expect
> and I’m wondering if the OCRParser is adding extra overhead
>
>
>
>
>
> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.pdf.PDFParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser]
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
>
>
>
>
>

Fwd: Parsing PDF file

Reply via email to