Tika 2.x? Looking now.

On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <[email protected]>
wrote:

> If I use OCRStrategy=no_ocr, the time it takes to process is orders of
> magnitude faster and I don’t see the calls to OCRParser (obviously) Why is
> it taking so long with auto?  If the page does not meet the criteria for
> OCR, then it shouldn’t be calling OCR at all, right?
>
>
>
>  "X-TIKA:Parsed-By":
> "[org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.pdf.PDFParser]"
> ,
>
>
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, April 5, 2021 8:48 AM
> *To:* [email protected]
> *Subject:* RE: {EXTERNAL}Parsing PDF file
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Correction: I see one instance of PDFParser at the beginning, but why does
> it then alternate between OCRParser and CompositeParser?
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, April 5, 2021 8:41 AM
> *To:* [email protected]
> *Subject:* {EXTERNAL}Parsing PDF file
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Parsing the attached PDF file.   It is a text file, not scanned.  I’m
> using OCR_Strategy=Auto, extractInlineImages=false
>
>
>
> The output contains the following in the metadata.  I’m wondering 2
> things.  First, why don’t I see PDFParser?
>
> And 2nd, why does it keep calling the TesseractOCRParser?  Once it
> determines that it is a PDF file, wouldn’t it stick with that?
>
> I’m asking because it seems to take longer to parse than I would expect
> and I’m wondering if the OCRParser is adding extra overhead
>
>
>
>
>
> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.pdf.PDFParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser,
> org.apache.tika.parser.CompositeParser,
> org.apache.tika.parser.ocr.TesseractOCRParser]
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
>
>
>
>

Reply via email to