Fwd: Parsing PDF file

Tim Allison Mon, 05 Apr 2021 12:10:50 -0700

I just updated the section on OOV here:
https://cwiki.apache.org/confluence/display/TIKA/TikaEvalMetrics


To get Tika to calculate that for you automatically, drop the
tika-eval-core jar in your tika-server's classpath.

See also:
https://github.com/tballison/share/blob/main/slides/activate19/Activate2019_tika_tallison_20190911.pptx

On Mon, Apr 5, 2021 at 2:47 PM Peter Kronenberg <[email protected]>
wrote:

> Can you please explain ‘out of vocabulary measurement’?
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, April 5, 2021 1:49 PM
> *To:* Peter Kronenberg <[email protected]>
> *Cc:* [email protected]
> *Subject:* Re: Parsing PDF file
>
>
>
> Y. You understand perfectly!
>
>
>
> I want "auto" to be the best it can be and most generally applicable
> across use cases.  For users who want high performance/better control, you
> might parse the PDF first with NO_OCR, and then make the determination on
> which pages to run OCR based on those statistics pulled out in the first
> parse.  Another key statistic in the decision would be the out of
> vocabulary measurement that you can get with an integration with tika-eval.
>
>
>
> So, in short, if there are clear, provable, general improvements to AUTO,
> we should make them.  If you want more refined control, let us know if the
> current metadata can be improved to help you develop your application for
> your use cases.
>
>
>
> On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg <[email protected]>
> wrote:
>
> You’re right that OCRing would result in slightly more accurate results in
> this case.  But the performance penalty is high.  Wondering if there is
> some intermediate option.
>
>
>
> I think I understand now why you are separately looking for unmapped
> characters as well as total characters.  If total characters is low, we
> assume the page is an image and OCR.  But if unmapped characters is high,
> it might still be straight text, but the unmapped characters will
> essentially result in unreadable characters
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, April 5, 2021 11:39 AM
> *To:* Peter Kronenberg <[email protected]>
> *Cc:* [email protected]
> *Subject:* Re: Parsing PDF file
>
>
>
> As for the metadata, we should add unique.  Given that multiple parsers
> can hit the same file, we need to record all of them (in this case:
> default, pdf, tesseract).
>
>
>
> As for tweaking the settings...I'm not sure as I look at the extracted
> text more.  There are quite a few bad ligatures /unmapped unicode chars
> which would render search for, e.g. "efficient", "affairs" useless.
>
>
>
> On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg <
> [email protected]> wrote:
>
> Yes, I think tweaking the criteria for Auto is a good idea.
>
> And if the parser list was a Set, that would automatically eliminate dups
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, April 5, 2021 10:15 AM
> *To:* [email protected]
> *Subject:* Fwd: Parsing PDF file
>
>
>
> It looks like the ligatures don't have unicode mappings:
>
>
>
> "Division of Monetary A�airs"
>
>
>
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)
>
> The issue is that this file has > 10 unmapped unicode chars per page.
>
> We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && 
> percentUnmappedUnicodeChars > 0.2 or something?
>
> We should also probably check to see if a parser is in the parsed by list 
> before re-adding it?
>
>
>
> 0: pdf:charsPerPage : 1579
> 0: pdf:charsPerPage : 1891
> 0: pdf:charsPerPage : 2283
> 0: pdf:charsPerPage : 2224
> 0: pdf:charsPerPage : 1619
> 0: pdf:charsPerPage : 2177
> 0: pdf:charsPerPage : 1626
> 0: pdf:charsPerPage : 1313
> 0: pdf:charsPerPage : 1652
> 0: pdf:charsPerPage : 1493
> 0: pdf:charsPerPage : 1136
> 0: pdf:charsPerPage : 1477
> 0: pdf:charsPerPage : 1264
> 0: pdf:charsPerPage : 1994
> 0: pdf:charsPerPage : 2062
> 0: pdf:charsPerPage : 1756
> 0: pdf:charsPerPage : 2007
> 0: pdf:charsPerPage : 2202
> 0: pdf:charsPerPage : 2105
> 0: pdf:charsPerPage : 2106
> 0: pdf:charsPerPage : 1895
> 0: pdf:charsPerPage : 1978
> 0: pdf:charsPerPage : 1826
> 0: pdf:charsPerPage : 1742
> 0: pdf:charsPerPage : 2073
> 0: pdf:charsPerPage : 1882
> 0: pdf:charsPerPage : 1497
> 0: pdf:charsPerPage : 282
> 0: pdf:charsPerPage : 606
> 0: pdf:charsPerPage : 948
> 0: pdf:charsPerPage : 418
> 0: pdf:charsPerPage : 266
> 0: pdf:charsPerPage : 830
> 0: pdf:charsPerPage : 259
> 0: pdf:charsPerPage : 716
> 0: pdf:charsPerPage : 961
> 0: pdf:charsPerPage : 1325
> 0: pdf:charsPerPage : 1478
> 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 
> Radical Eye Software
> 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:docinfo:title : Inel4shannon.dvi
> 0: pdf:encrypted : false
> 0: pdf:hasMarkedContent : false
> 0: pdf:hasXFA : false
> 0: pdf:hasXMP : false
> 0: pdf:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:unmappedUnicodeCharsPerPage : 109
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 113
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 94
> 0: pdf:unmappedUnicodeCharsPerPage : 112
> 0: pdf:unmappedUnicodeCharsPerPage : 178
> 0: pdf:unmappedUnicodeCharsPerPage : 74
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 189
> 0: pdf:unmappedUnicodeCharsPerPage : 165
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 186
> 0: pdf:unmappedUnicodeCharsPerPage : 162
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 119
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 115
> 0: pdf:unmappedUnicodeCharsPerPage : 99
> 0: pdf:unmappedUnicodeCharsPerPage : 107
> 0: pdf:unmappedUnicodeCharsPerPage : 108
> 0: pdf:unmappedUnicodeCharsPerPage : 116
> 0: pdf:unmappedUnicodeCharsPerPage : 174
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 61
> 0: pdf:unmappedUnicodeCharsPerPage : 90
> 0: pdf:unmappedUnicodeCharsPerPage : 239
> 0: pdf:unmappedUnicodeCharsPerPage : 614
> 0: pdf:unmappedUnicodeCharsPerPage : 216
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 502
> 0: pdf:unmappedUnicodeCharsPerPage : 103
> 0: pdf:unmappedUnicodeCharsPerPage : 427
> 0: pdf:unmappedUnicodeCharsPerPage : 629
> 0: pdf:unmappedUnicodeCharsPerPage : 347
> 0: pdf:unmappedUnicodeCharsPerPage : 327
>
>
>
> On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <
> [email protected]> wrote:
>
> Yes, 2.x
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, April 5, 2021 9:54 AM
> *To:* [email protected]
> *Subject:* Re: Parsing PDF file
>
>
>
> Tika 2.x? Looking now.
>
>
>
> On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <[email protected]>
> wrote:
>
> If I use OCRStrategy=no_ocr, the time it takes to process is orders of
> magnitude faster and I don’t see the calls to OCRParser (obviously) Why is
> it taking so long with auto?  If the page does not meet the criteria for
> OCR, then it shouldn’t be calling OCR at all, right?
>
>
>
>  "X-TIKA:Parsed-By":
> "[org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.pdf.PDFParser]"
> ,
>
>
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, April 5, 2021 8:48 AM
> *To:* [email protected]
> *Subject:* RE: {EXTERNAL}Parsing PDF file
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Correction: I see one instance of PDFParser at the beginning, but why does
> it then alternate between OCRParser and CompositeParser?
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, April 5, 2021 8:41 AM
> *To:* [email protected]
> *Subject:* {EXTERNAL}Parsing PDF file
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Parsing the attached PDF file.   It is a text file, not scanned.  I’m
> using OCR_Strategy=Auto, extractInlineImages=false
>
>
>
> The output contains the following in the metadata.  I’m wondering 2
> things.  First, why don’t I see PDFParser?
>
> And 2nd, why does it keep calling the TesseractOCRParser?  Once it
> determines that it is a PDF file, wouldn’t it stick with that?
>
> I’m asking because it seems to take longer to parse than I would expect
> and I’m wondering if the OCRParser is adding extra overhead
>
>
>
>
>
> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.pdf.PDFParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser]
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
>
>
>
>
>

Fwd: Parsing PDF file

Reply via email to