Re: Parsing PDF file

Tim Allison Mon, 05 Apr 2021 07:30:10 -0700

And, y, it looks like it takes ~2-3 seconds per page for OCR (on my laptop):


page index: 0 2319 ms
page index: 1 2568 ms
page index: 2 2841 ms
page index: 3 2918 ms
page index: 4 2408 ms
page index: 5 3061 ms
page index: 6 2578 ms
page index: 7 2530 ms
page index: 8 3162 ms
page index: 9 2346 ms
page index: 10 2221 ms
page index: 11 2796 ms
page index: 12 2003 ms
page index: 13 2746 ms
page index: 14 2823 ms
page index: 15 2817 ms
page index: 16 2773 ms
page index: 17 2883 ms
page index: 18 2846 ms
page index: 19 3011 ms
page index: 20 2602 ms
page index: 21 2866 ms
page index: 22 2510 ms
page index: 23 2496 ms
page index: 24 2607 ms
page index: 25 2570 ms
page index: 26 2460 ms
page index: 27 958 ms
page index: 28 1388 ms
page index: 29 1515 ms
page index: 30 1078 ms
page index: 31 929 ms
page index: 32 1521 ms
page index: 33 899 ms
page index: 34 1422 ms
page index: 35 1553 ms
page index: 36 1954 ms
page index: 37 2480 ms

On Mon, Apr 5, 2021 at 10:15 AM Tim Allison <[email protected]> wrote:

> It looks like the ligatures don't have unicode mappings:
>
> "Division of Monetary A�airs"
>
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)
>
> The issue is that this file has > 10 unmapped unicode chars per page.
>
> We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && 
> percentUnmappedUnicodeChars > 0.2 or something?
>
> We should also probably check to see if a parser is in the parsed by list 
> before re-adding it?
>
>
> 0: pdf:charsPerPage : 1579
> 0: pdf:charsPerPage : 1891
> 0: pdf:charsPerPage : 2283
> 0: pdf:charsPerPage : 2224
> 0: pdf:charsPerPage : 1619
> 0: pdf:charsPerPage : 2177
> 0: pdf:charsPerPage : 1626
> 0: pdf:charsPerPage : 1313
> 0: pdf:charsPerPage : 1652
> 0: pdf:charsPerPage : 1493
> 0: pdf:charsPerPage : 1136
> 0: pdf:charsPerPage : 1477
> 0: pdf:charsPerPage : 1264
> 0: pdf:charsPerPage : 1994
> 0: pdf:charsPerPage : 2062
> 0: pdf:charsPerPage : 1756
> 0: pdf:charsPerPage : 2007
> 0: pdf:charsPerPage : 2202
> 0: pdf:charsPerPage : 2105
> 0: pdf:charsPerPage : 2106
> 0: pdf:charsPerPage : 1895
> 0: pdf:charsPerPage : 1978
> 0: pdf:charsPerPage : 1826
> 0: pdf:charsPerPage : 1742
> 0: pdf:charsPerPage : 2073
> 0: pdf:charsPerPage : 1882
> 0: pdf:charsPerPage : 1497
> 0: pdf:charsPerPage : 282
> 0: pdf:charsPerPage : 606
> 0: pdf:charsPerPage : 948
> 0: pdf:charsPerPage : 418
> 0: pdf:charsPerPage : 266
> 0: pdf:charsPerPage : 830
> 0: pdf:charsPerPage : 259
> 0: pdf:charsPerPage : 716
> 0: pdf:charsPerPage : 961
> 0: pdf:charsPerPage : 1325
> 0: pdf:charsPerPage : 1478
> 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 
> Radical Eye Software
> 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:docinfo:title : Inel4shannon.dvi
> 0: pdf:encrypted : false
> 0: pdf:hasMarkedContent : false
> 0: pdf:hasXFA : false
> 0: pdf:hasXMP : false
> 0: pdf:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:unmappedUnicodeCharsPerPage : 109
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 113
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 94
> 0: pdf:unmappedUnicodeCharsPerPage : 112
> 0: pdf:unmappedUnicodeCharsPerPage : 178
> 0: pdf:unmappedUnicodeCharsPerPage : 74
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 189
> 0: pdf:unmappedUnicodeCharsPerPage : 165
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 186
> 0: pdf:unmappedUnicodeCharsPerPage : 162
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 119
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 115
> 0: pdf:unmappedUnicodeCharsPerPage : 99
> 0: pdf:unmappedUnicodeCharsPerPage : 107
> 0: pdf:unmappedUnicodeCharsPerPage : 108
> 0: pdf:unmappedUnicodeCharsPerPage : 116
> 0: pdf:unmappedUnicodeCharsPerPage : 174
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 61
> 0: pdf:unmappedUnicodeCharsPerPage : 90
> 0: pdf:unmappedUnicodeCharsPerPage : 239
> 0: pdf:unmappedUnicodeCharsPerPage : 614
> 0: pdf:unmappedUnicodeCharsPerPage : 216
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 502
> 0: pdf:unmappedUnicodeCharsPerPage : 103
> 0: pdf:unmappedUnicodeCharsPerPage : 427
> 0: pdf:unmappedUnicodeCharsPerPage : 629
> 0: pdf:unmappedUnicodeCharsPerPage : 347
> 0: pdf:unmappedUnicodeCharsPerPage : 327
>
>
> On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <
> [email protected]> wrote:
>
>> Yes, 2.x
>>
>>
>>
>> *From:* Tim Allison <[email protected]>
>> *Sent:* Monday, April 5, 2021 9:54 AM
>> *To:* [email protected]
>> *Subject:* Re: Parsing PDF file
>>
>>
>>
>> Tika 2.x? Looking now.
>>
>>
>>
>> On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <
>> [email protected]> wrote:
>>
>> If I use OCRStrategy=no_ocr, the time it takes to process is orders of
>> magnitude faster and I don’t see the calls to OCRParser (obviously) Why is
>> it taking so long with auto?  If the page does not meet the criteria for
>> OCR, then it shouldn’t be calling OCR at all, right?
>>
>>
>>
>>  "X-TIKA:Parsed-By":
>> "[org.apache.tika.parser.CompositeParser, 
>> org.apache.tika.parser.pdf.PDFParser]"
>> ,
>>
>>
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Monday, April 5, 2021 8:48 AM
>> *To:* [email protected]
>> *Subject:* RE: {EXTERNAL}Parsing PDF file
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> Correction: I see one instance of PDFParser at the beginning, but why
>> does it then alternate between OCRParser and CompositeParser?
>>
>>
>>
>> *From:* Peter Kronenberg <[email protected]>
>> *Sent:* Monday, April 5, 2021 8:41 AM
>> *To:* [email protected]
>> *Subject:* {EXTERNAL}Parsing PDF file
>>
>>
>>
>> This email was sent from outside your organisation, yet is displaying the
>> name of someone from your organisation. This often happens in phishing
>> attempts. Please only interact with this email if you know its source and
>> that the content is safe.
>>
>>
>>
>> CAUTION: This email originated from outside of the organization. DO NOT
>> click links or open attachments unless you recognize the sender and know
>> the content is safe.
>>
>> Parsing the attached PDF file.   It is a text file, not scanned.  I’m
>> using OCR_Strategy=Auto, extractInlineImages=false
>>
>>
>>
>> The output contains the following in the metadata.  I’m wondering 2
>> things.  First, why don’t I see PDFParser?
>>
>> And 2nd, why does it keep calling the TesseractOCRParser?  Once it
>> determines that it is a PDF file, wouldn’t it stick with that?
>>
>> I’m asking because it seems to take longer to parse than I would expect
>> and I’m wondering if the OCRParser is adding extra overhead
>>
>>
>>
>>
>>
>> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.pdf.PDFParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser,
>> org.apache.tika.parser.CompositeParser,
>> org.apache.tika.parser.ocr.TesseractOCRParser]
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623 *
>>
>> [image: Torch AI]
>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
>> WWW.TORCH.AI
>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>>
>>
>>
>>
>>
>>

Re: Parsing PDF file

Reply via email to