Tika 2.x? Looking now. On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <[email protected]> wrote:
> If I use OCRStrategy=no_ocr, the time it takes to process is orders of > magnitude faster and I don’t see the calls to OCRParser (obviously) Why is > it taking so long with auto? If the page does not meet the criteria for > OCR, then it shouldn’t be calling OCR at all, right? > > > > "X-TIKA:Parsed-By": > "[org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.pdf.PDFParser]" > , > > > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, April 5, 2021 8:48 AM > *To:* [email protected] > *Subject:* RE: {EXTERNAL}Parsing PDF file > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > Correction: I see one instance of PDFParser at the beginning, but why does > it then alternate between OCRParser and CompositeParser? > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, April 5, 2021 8:41 AM > *To:* [email protected] > *Subject:* {EXTERNAL}Parsing PDF file > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > CAUTION: This email originated from outside of the organization. DO NOT > click links or open attachments unless you recognize the sender and know > the content is safe. > > Parsing the attached PDF file. It is a text file, not scanned. I’m > using OCR_Strategy=Auto, extractInlineImages=false > > > > The output contains the following in the metadata. I’m wondering 2 > things. First, why don’t I see PDFParser? > > And 2nd, why does it keep calling the TesseractOCRParser? Once it > determines that it is a PDF file, wouldn’t it stick with that? > > I’m asking because it seems to take longer to parse than I would expect > and I’m wondering if the OCRParser is adding extra overhead > > > > > > "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.pdf.PDFParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser, > org.apache.tika.parser.CompositeParser, > org.apache.tika.parser.ocr.TesseractOCRParser] > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> > > 4303 W. 119th St., Leawood, KS 66209 > <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g> > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651> > > > > >
