It looks like the ligatures don't have unicode mappings: "Division of Monetary A�airs"
if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) The issue is that this file has > 10 unmapped unicode chars per page. We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something? We should also probably check to see if a parser is in the parsed by list before re-adding it? 0: pdf:charsPerPage : 1579 0: pdf:charsPerPage : 1891 0: pdf:charsPerPage : 2283 0: pdf:charsPerPage : 2224 0: pdf:charsPerPage : 1619 0: pdf:charsPerPage : 2177 0: pdf:charsPerPage : 1626 0: pdf:charsPerPage : 1313 0: pdf:charsPerPage : 1652 0: pdf:charsPerPage : 1493 0: pdf:charsPerPage : 1136 0: pdf:charsPerPage : 1477 0: pdf:charsPerPage : 1264 0: pdf:charsPerPage : 1994 0: pdf:charsPerPage : 2062 0: pdf:charsPerPage : 1756 0: pdf:charsPerPage : 2007 0: pdf:charsPerPage : 2202 0: pdf:charsPerPage : 2105 0: pdf:charsPerPage : 2106 0: pdf:charsPerPage : 1895 0: pdf:charsPerPage : 1978 0: pdf:charsPerPage : 1826 0: pdf:charsPerPage : 1742 0: pdf:charsPerPage : 2073 0: pdf:charsPerPage : 1882 0: pdf:charsPerPage : 1497 0: pdf:charsPerPage : 282 0: pdf:charsPerPage : 606 0: pdf:charsPerPage : 948 0: pdf:charsPerPage : 418 0: pdf:charsPerPage : 266 0: pdf:charsPerPage : 830 0: pdf:charsPerPage : 259 0: pdf:charsPerPage : 716 0: pdf:charsPerPage : 961 0: pdf:charsPerPage : 1325 0: pdf:charsPerPage : 1478 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 Radical Eye Software 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows 0: pdf:docinfo:title : Inel4shannon.dvi 0: pdf:encrypted : false 0: pdf:hasMarkedContent : false 0: pdf:hasXFA : false 0: pdf:hasXMP : false 0: pdf:producer : Acrobat Distiller 3.01 for Windows 0: pdf:unmappedUnicodeCharsPerPage : 109 0: pdf:unmappedUnicodeCharsPerPage : 120 0: pdf:unmappedUnicodeCharsPerPage : 113 0: pdf:unmappedUnicodeCharsPerPage : 120 0: pdf:unmappedUnicodeCharsPerPage : 94 0: pdf:unmappedUnicodeCharsPerPage : 112 0: pdf:unmappedUnicodeCharsPerPage : 178 0: pdf:unmappedUnicodeCharsPerPage : 74 0: pdf:unmappedUnicodeCharsPerPage : 132 0: pdf:unmappedUnicodeCharsPerPage : 189 0: pdf:unmappedUnicodeCharsPerPage : 165 0: pdf:unmappedUnicodeCharsPerPage : 145 0: pdf:unmappedUnicodeCharsPerPage : 132 0: pdf:unmappedUnicodeCharsPerPage : 186 0: pdf:unmappedUnicodeCharsPerPage : 162 0: pdf:unmappedUnicodeCharsPerPage : 145 0: pdf:unmappedUnicodeCharsPerPage : 119 0: pdf:unmappedUnicodeCharsPerPage : 138 0: pdf:unmappedUnicodeCharsPerPage : 115 0: pdf:unmappedUnicodeCharsPerPage : 99 0: pdf:unmappedUnicodeCharsPerPage : 107 0: pdf:unmappedUnicodeCharsPerPage : 108 0: pdf:unmappedUnicodeCharsPerPage : 116 0: pdf:unmappedUnicodeCharsPerPage : 174 0: pdf:unmappedUnicodeCharsPerPage : 138 0: pdf:unmappedUnicodeCharsPerPage : 101 0: pdf:unmappedUnicodeCharsPerPage : 61 0: pdf:unmappedUnicodeCharsPerPage : 90 0: pdf:unmappedUnicodeCharsPerPage : 239 0: pdf:unmappedUnicodeCharsPerPage : 614 0: pdf:unmappedUnicodeCharsPerPage : 216 0: pdf:unmappedUnicodeCharsPerPage : 101 0: pdf:unmappedUnicodeCharsPerPage : 502 0: pdf:unmappedUnicodeCharsPerPage : 103 0: pdf:unmappedUnicodeCharsPerPage : 427 0: pdf:unmappedUnicodeCharsPerPage : 629 0: pdf:unmappedUnicodeCharsPerPage : 347 0: pdf:unmappedUnicodeCharsPerPage : 327 On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <[email protected]> wrote: > Yes, 2.x > > > > *From:* Tim Allison <[email protected]> > *Sent:* Monday, April 5, 2021 9:54 AM > *To:* [email protected] > *Subject:* Re: Parsing PDF file > > > > Tika 2.x? Looking now. > > > > On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <[email protected]> > wrote: > > If I use OCRStrategy=no_ocr, the time it takes to process is orders of > magnitude faster and I don’t see the calls to OCRParser (obviously) Why is > it taking so long with auto? If the page does not meet the criteria for > OCR, then it shouldn’t be calling OCR at all, right? > > > > "X-TIKA:Parsed-By": > "[org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.pdf.PDFParser]" > , > > > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, April 5, 2021 8:48 AM > *To:* [email protected] > *Subject:* RE: {EXTERNAL}Parsing PDF file > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Correction: I see one instance of PDFParser at the beginning, but why does > it then alternate between OCRParser and CompositeParser? > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, April 5, 2021 8:41 AM > *To:* [email protected] > *Subject:* {EXTERNAL}Parsing PDF file > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > CAUTION: This email originated from outside of the organization. DO NOT > click links or open attachments unless you recognize the sender and know > the content is safe. > > Parsing the attached PDF file. It is a text file, not scanned. I’m > using OCR_Strategy=Auto, extractInlineImages=false > > > > The output contains the following in the metadata. I’m wondering 2 > things. First, why don’t I see PDFParser? > > And 2nd, why does it keep calling the TesseractOCRParser? Once it > determines that it is a PDF file, wouldn’t it stick with that? > > I’m asking because it seems to take longer to parse than I would expect > and I’m wondering if the OCRParser is adding extra overhead > > > > > > "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.pdf.PDFParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser] > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> > > 4303 W. 119th St., Leawood, KS 66209 > <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7> > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> > > > > > >
