I just updated the section on OOV here: https://cwiki.apache.org/confluence/display/TIKA/TikaEvalMetrics
To get Tika to calculate that for you automatically, drop the tika-eval-core jar in your tika-server's classpath. See also: https://github.com/tballison/share/blob/main/slides/activate19/Activate2019_tika_tallison_20190911.pptx On Mon, Apr 5, 2021 at 2:47 PM Peter Kronenberg <[email protected]> wrote: > Can you please explain ‘out of vocabulary measurement’? > > > > *From:* Tim Allison <[email protected]> > *Sent:* Monday, April 5, 2021 1:49 PM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected] > *Subject:* Re: Parsing PDF file > > > > Y. You understand perfectly! > > > > I want "auto" to be the best it can be and most generally applicable > across use cases. For users who want high performance/better control, you > might parse the PDF first with NO_OCR, and then make the determination on > which pages to run OCR based on those statistics pulled out in the first > parse. Another key statistic in the decision would be the out of > vocabulary measurement that you can get with an integration with tika-eval. > > > > So, in short, if there are clear, provable, general improvements to AUTO, > we should make them. If you want more refined control, let us know if the > current metadata can be improved to help you develop your application for > your use cases. > > > > On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg <[email protected]> > wrote: > > You’re right that OCRing would result in slightly more accurate results in > this case. But the performance penalty is high. Wondering if there is > some intermediate option. > > > > I think I understand now why you are separately looking for unmapped > characters as well as total characters. If total characters is low, we > assume the page is an image and OCR. But if unmapped characters is high, > it might still be straight text, but the unmapped characters will > essentially result in unreadable characters > > > > *From:* Tim Allison <[email protected]> > *Sent:* Monday, April 5, 2021 11:39 AM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected] > *Subject:* Re: Parsing PDF file > > > > As for the metadata, we should add unique. Given that multiple parsers > can hit the same file, we need to record all of them (in this case: > default, pdf, tesseract). > > > > As for tweaking the settings...I'm not sure as I look at the extracted > text more. There are quite a few bad ligatures /unmapped unicode chars > which would render search for, e.g. "efficient", "affairs" useless. > > > > On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg < > [email protected]> wrote: > > Yes, I think tweaking the criteria for Auto is a good idea. > > And if the parser list was a Set, that would automatically eliminate dups > > > > *From:* Tim Allison <[email protected]> > *Sent:* Monday, April 5, 2021 10:15 AM > *To:* [email protected] > *Subject:* Fwd: Parsing PDF file > > > > It looks like the ligatures don't have unicode mappings: > > > > "Division of Monetary A�airs" > > > > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) > > The issue is that this file has > 10 unmapped unicode chars per page. > > We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && > percentUnmappedUnicodeChars > 0.2 or something? > > We should also probably check to see if a parser is in the parsed by list > before re-adding it? > > > > 0: pdf:charsPerPage : 1579 > 0: pdf:charsPerPage : 1891 > 0: pdf:charsPerPage : 2283 > 0: pdf:charsPerPage : 2224 > 0: pdf:charsPerPage : 1619 > 0: pdf:charsPerPage : 2177 > 0: pdf:charsPerPage : 1626 > 0: pdf:charsPerPage : 1313 > 0: pdf:charsPerPage : 1652 > 0: pdf:charsPerPage : 1493 > 0: pdf:charsPerPage : 1136 > 0: pdf:charsPerPage : 1477 > 0: pdf:charsPerPage : 1264 > 0: pdf:charsPerPage : 1994 > 0: pdf:charsPerPage : 2062 > 0: pdf:charsPerPage : 1756 > 0: pdf:charsPerPage : 2007 > 0: pdf:charsPerPage : 2202 > 0: pdf:charsPerPage : 2105 > 0: pdf:charsPerPage : 2106 > 0: pdf:charsPerPage : 1895 > 0: pdf:charsPerPage : 1978 > 0: pdf:charsPerPage : 1826 > 0: pdf:charsPerPage : 1742 > 0: pdf:charsPerPage : 2073 > 0: pdf:charsPerPage : 1882 > 0: pdf:charsPerPage : 1497 > 0: pdf:charsPerPage : 282 > 0: pdf:charsPerPage : 606 > 0: pdf:charsPerPage : 948 > 0: pdf:charsPerPage : 418 > 0: pdf:charsPerPage : 266 > 0: pdf:charsPerPage : 830 > 0: pdf:charsPerPage : 259 > 0: pdf:charsPerPage : 716 > 0: pdf:charsPerPage : 961 > 0: pdf:charsPerPage : 1325 > 0: pdf:charsPerPage : 1478 > 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 > Radical Eye Software > 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows > 0: pdf:docinfo:title : Inel4shannon.dvi > 0: pdf:encrypted : false > 0: pdf:hasMarkedContent : false > 0: pdf:hasXFA : false > 0: pdf:hasXMP : false > 0: pdf:producer : Acrobat Distiller 3.01 for Windows > 0: pdf:unmappedUnicodeCharsPerPage : 109 > 0: pdf:unmappedUnicodeCharsPerPage : 120 > 0: pdf:unmappedUnicodeCharsPerPage : 113 > 0: pdf:unmappedUnicodeCharsPerPage : 120 > 0: pdf:unmappedUnicodeCharsPerPage : 94 > 0: pdf:unmappedUnicodeCharsPerPage : 112 > 0: pdf:unmappedUnicodeCharsPerPage : 178 > 0: pdf:unmappedUnicodeCharsPerPage : 74 > 0: pdf:unmappedUnicodeCharsPerPage : 132 > 0: pdf:unmappedUnicodeCharsPerPage : 189 > 0: pdf:unmappedUnicodeCharsPerPage : 165 > 0: pdf:unmappedUnicodeCharsPerPage : 145 > 0: pdf:unmappedUnicodeCharsPerPage : 132 > 0: pdf:unmappedUnicodeCharsPerPage : 186 > 0: pdf:unmappedUnicodeCharsPerPage : 162 > 0: pdf:unmappedUnicodeCharsPerPage : 145 > 0: pdf:unmappedUnicodeCharsPerPage : 119 > 0: pdf:unmappedUnicodeCharsPerPage : 138 > 0: pdf:unmappedUnicodeCharsPerPage : 115 > 0: pdf:unmappedUnicodeCharsPerPage : 99 > 0: pdf:unmappedUnicodeCharsPerPage : 107 > 0: pdf:unmappedUnicodeCharsPerPage : 108 > 0: pdf:unmappedUnicodeCharsPerPage : 116 > 0: pdf:unmappedUnicodeCharsPerPage : 174 > 0: pdf:unmappedUnicodeCharsPerPage : 138 > 0: pdf:unmappedUnicodeCharsPerPage : 101 > 0: pdf:unmappedUnicodeCharsPerPage : 61 > 0: pdf:unmappedUnicodeCharsPerPage : 90 > 0: pdf:unmappedUnicodeCharsPerPage : 239 > 0: pdf:unmappedUnicodeCharsPerPage : 614 > 0: pdf:unmappedUnicodeCharsPerPage : 216 > 0: pdf:unmappedUnicodeCharsPerPage : 101 > 0: pdf:unmappedUnicodeCharsPerPage : 502 > 0: pdf:unmappedUnicodeCharsPerPage : 103 > 0: pdf:unmappedUnicodeCharsPerPage : 427 > 0: pdf:unmappedUnicodeCharsPerPage : 629 > 0: pdf:unmappedUnicodeCharsPerPage : 347 > 0: pdf:unmappedUnicodeCharsPerPage : 327 > > > > On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg < > [email protected]> wrote: > > Yes, 2.x > > > > *From:* Tim Allison <[email protected]> > *Sent:* Monday, April 5, 2021 9:54 AM > *To:* [email protected] > *Subject:* Re: Parsing PDF file > > > > Tika 2.x? Looking now. > > > > On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <[email protected]> > wrote: > > If I use OCRStrategy=no_ocr, the time it takes to process is orders of > magnitude faster and I don’t see the calls to OCRParser (obviously) Why is > it taking so long with auto? If the page does not meet the criteria for > OCR, then it shouldn’t be calling OCR at all, right? > > > > "X-TIKA:Parsed-By": > "[org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.pdf.PDFParser]" > , > > > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, April 5, 2021 8:48 AM > *To:* [email protected] > *Subject:* RE: {EXTERNAL}Parsing PDF file > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Correction: I see one instance of PDFParser at the beginning, but why does > it then alternate between OCRParser and CompositeParser? > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, April 5, 2021 8:41 AM > *To:* [email protected] > *Subject:* {EXTERNAL}Parsing PDF file > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > CAUTION: This email originated from outside of the organization. DO NOT > click links or open attachments unless you recognize the sender and know > the content is safe. > > Parsing the attached PDF file. It is a text file, not scanned. I’m > using OCR_Strategy=Auto, extractInlineImages=false > > > > The output contains the following in the metadata. I’m wondering 2 > things. First, why don’t I see PDFParser? > > And 2nd, why does it keep calling the TesseractOCRParser? Once it > determines that it is a PDF file, wouldn’t it stick with that? > > I’m asking because it seems to take longer to parse than I would expect > and I’m wondering if the OCRParser is adding extra overhead > > > > > > "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.pdf.PDFParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser] > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> > > 4303 W. 119th St., Leawood, KS 66209 > <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7> > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> > > > > > >
