And, y, it looks like it takes ~2-3 seconds per page for OCR (on my laptop):
page index: 0 2319 ms page index: 1 2568 ms page index: 2 2841 ms page index: 3 2918 ms page index: 4 2408 ms page index: 5 3061 ms page index: 6 2578 ms page index: 7 2530 ms page index: 8 3162 ms page index: 9 2346 ms page index: 10 2221 ms page index: 11 2796 ms page index: 12 2003 ms page index: 13 2746 ms page index: 14 2823 ms page index: 15 2817 ms page index: 16 2773 ms page index: 17 2883 ms page index: 18 2846 ms page index: 19 3011 ms page index: 20 2602 ms page index: 21 2866 ms page index: 22 2510 ms page index: 23 2496 ms page index: 24 2607 ms page index: 25 2570 ms page index: 26 2460 ms page index: 27 958 ms page index: 28 1388 ms page index: 29 1515 ms page index: 30 1078 ms page index: 31 929 ms page index: 32 1521 ms page index: 33 899 ms page index: 34 1422 ms page index: 35 1553 ms page index: 36 1954 ms page index: 37 2480 ms On Mon, Apr 5, 2021 at 10:15 AM Tim Allison <[email protected]> wrote: > It looks like the ligatures don't have unicode mappings: > > "Division of Monetary A�airs" > > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) > > The issue is that this file has > 10 unmapped unicode chars per page. > > We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && > percentUnmappedUnicodeChars > 0.2 or something? > > We should also probably check to see if a parser is in the parsed by list > before re-adding it? > > > 0: pdf:charsPerPage : 1579 > 0: pdf:charsPerPage : 1891 > 0: pdf:charsPerPage : 2283 > 0: pdf:charsPerPage : 2224 > 0: pdf:charsPerPage : 1619 > 0: pdf:charsPerPage : 2177 > 0: pdf:charsPerPage : 1626 > 0: pdf:charsPerPage : 1313 > 0: pdf:charsPerPage : 1652 > 0: pdf:charsPerPage : 1493 > 0: pdf:charsPerPage : 1136 > 0: pdf:charsPerPage : 1477 > 0: pdf:charsPerPage : 1264 > 0: pdf:charsPerPage : 1994 > 0: pdf:charsPerPage : 2062 > 0: pdf:charsPerPage : 1756 > 0: pdf:charsPerPage : 2007 > 0: pdf:charsPerPage : 2202 > 0: pdf:charsPerPage : 2105 > 0: pdf:charsPerPage : 2106 > 0: pdf:charsPerPage : 1895 > 0: pdf:charsPerPage : 1978 > 0: pdf:charsPerPage : 1826 > 0: pdf:charsPerPage : 1742 > 0: pdf:charsPerPage : 2073 > 0: pdf:charsPerPage : 1882 > 0: pdf:charsPerPage : 1497 > 0: pdf:charsPerPage : 282 > 0: pdf:charsPerPage : 606 > 0: pdf:charsPerPage : 948 > 0: pdf:charsPerPage : 418 > 0: pdf:charsPerPage : 266 > 0: pdf:charsPerPage : 830 > 0: pdf:charsPerPage : 259 > 0: pdf:charsPerPage : 716 > 0: pdf:charsPerPage : 961 > 0: pdf:charsPerPage : 1325 > 0: pdf:charsPerPage : 1478 > 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 > Radical Eye Software > 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows > 0: pdf:docinfo:title : Inel4shannon.dvi > 0: pdf:encrypted : false > 0: pdf:hasMarkedContent : false > 0: pdf:hasXFA : false > 0: pdf:hasXMP : false > 0: pdf:producer : Acrobat Distiller 3.01 for Windows > 0: pdf:unmappedUnicodeCharsPerPage : 109 > 0: pdf:unmappedUnicodeCharsPerPage : 120 > 0: pdf:unmappedUnicodeCharsPerPage : 113 > 0: pdf:unmappedUnicodeCharsPerPage : 120 > 0: pdf:unmappedUnicodeCharsPerPage : 94 > 0: pdf:unmappedUnicodeCharsPerPage : 112 > 0: pdf:unmappedUnicodeCharsPerPage : 178 > 0: pdf:unmappedUnicodeCharsPerPage : 74 > 0: pdf:unmappedUnicodeCharsPerPage : 132 > 0: pdf:unmappedUnicodeCharsPerPage : 189 > 0: pdf:unmappedUnicodeCharsPerPage : 165 > 0: pdf:unmappedUnicodeCharsPerPage : 145 > 0: pdf:unmappedUnicodeCharsPerPage : 132 > 0: pdf:unmappedUnicodeCharsPerPage : 186 > 0: pdf:unmappedUnicodeCharsPerPage : 162 > 0: pdf:unmappedUnicodeCharsPerPage : 145 > 0: pdf:unmappedUnicodeCharsPerPage : 119 > 0: pdf:unmappedUnicodeCharsPerPage : 138 > 0: pdf:unmappedUnicodeCharsPerPage : 115 > 0: pdf:unmappedUnicodeCharsPerPage : 99 > 0: pdf:unmappedUnicodeCharsPerPage : 107 > 0: pdf:unmappedUnicodeCharsPerPage : 108 > 0: pdf:unmappedUnicodeCharsPerPage : 116 > 0: pdf:unmappedUnicodeCharsPerPage : 174 > 0: pdf:unmappedUnicodeCharsPerPage : 138 > 0: pdf:unmappedUnicodeCharsPerPage : 101 > 0: pdf:unmappedUnicodeCharsPerPage : 61 > 0: pdf:unmappedUnicodeCharsPerPage : 90 > 0: pdf:unmappedUnicodeCharsPerPage : 239 > 0: pdf:unmappedUnicodeCharsPerPage : 614 > 0: pdf:unmappedUnicodeCharsPerPage : 216 > 0: pdf:unmappedUnicodeCharsPerPage : 101 > 0: pdf:unmappedUnicodeCharsPerPage : 502 > 0: pdf:unmappedUnicodeCharsPerPage : 103 > 0: pdf:unmappedUnicodeCharsPerPage : 427 > 0: pdf:unmappedUnicodeCharsPerPage : 629 > 0: pdf:unmappedUnicodeCharsPerPage : 347 > 0: pdf:unmappedUnicodeCharsPerPage : 327 > > > On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg < > [email protected]> wrote: > >> Yes, 2.x >> >> >> >> *From:* Tim Allison <[email protected]> >> *Sent:* Monday, April 5, 2021 9:54 AM >> *To:* [email protected] >> *Subject:* Re: Parsing PDF file >> >> >> >> Tika 2.x? Looking now. >> >> >> >> On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg < >> [email protected]> wrote: >> >> If I use OCRStrategy=no_ocr, the time it takes to process is orders of >> magnitude faster and I don’t see the calls to OCRParser (obviously) Why is >> it taking so long with auto? If the page does not meet the criteria for >> OCR, then it shouldn’t be calling OCR at all, right? >> >> >> >> "X-TIKA:Parsed-By": >> "[org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.pdf.PDFParser]" >> , >> >> >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Monday, April 5, 2021 8:48 AM >> *To:* [email protected] >> *Subject:* RE: {EXTERNAL}Parsing PDF file >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> Correction: I see one instance of PDFParser at the beginning, but why >> does it then alternate between OCRParser and CompositeParser? >> >> >> >> *From:* Peter Kronenberg <[email protected]> >> *Sent:* Monday, April 5, 2021 8:41 AM >> *To:* [email protected] >> *Subject:* {EXTERNAL}Parsing PDF file >> >> >> >> This email was sent from outside your organisation, yet is displaying the >> name of someone from your organisation. This often happens in phishing >> attempts. Please only interact with this email if you know its source and >> that the content is safe. >> >> >> >> CAUTION: This email originated from outside of the organization. DO NOT >> click links or open attachments unless you recognize the sender and know >> the content is safe. >> >> Parsing the attached PDF file. It is a text file, not scanned. I’m >> using OCR_Strategy=Auto, extractInlineImages=false >> >> >> >> The output contains the following in the metadata. I’m wondering 2 >> things. First, why don’t I see PDFParser? >> >> And 2nd, why does it keep calling the TesseractOCRParser? Once it >> determines that it is a PDF file, wouldn’t it stick with that? >> >> I’m asking because it seems to take longer to parse than I would expect >> and I’m wondering if the OCRParser is adding extra overhead >> >> >> >> >> >> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.pdf.PDFParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser, >> org.apache.tika.parser.CompositeParser, >> org.apache.tika.parser.ocr.TesseractOCRParser] >> >> >> >> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >> >> *C: 703.887.5623 * >> >> [image: Torch AI] >> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> >> >> 4303 W. 119th St., Leawood, KS 66209 >> <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7> >> WWW.TORCH.AI >> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> >> >> >> >> >> >>
