If you turn off all the configurations, does it work for you? On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <[email protected]> wrote:
> I was afraid it would work for you 😊 > > > > *From:* Tim Allison <[email protected]> > *Sent:* Friday, September 24, 2021 10:09 AM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected] > *Subject:* Re: Problem running OCR > > > > I'm having luck with 2.1.0's app. How are you calling Tika? What > configurations do you have? Is tesseract on your command line, etc? > > > > java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf > > INFO [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser > Tesseract is installed and is being invoked. This can add greatly to > processing time. If you do not want tesseract to be applied to your > files see: > https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr > <https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069> > > <?xml version="1.0" encoding="UTF-8"?><html xmlns=" > http://www.w3.org/1999/xhtml > <https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069> > "> > > <head> > > <meta name="pdf:PDFVersion" content="1.7"/> > > <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/> > > <meta name="pdf:hasXFA" content="false"/> > > <meta name="access_permission:modify_annotations" content="true"/> > > <meta name="access_permission:can_print_degraded" content="true"/> > > <meta name="dc:creator" content="Michele Stutz"/> > > <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/> > > <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/> > > <meta name="dc:format" content="application/pdf; version=1.7"/> > > <meta name="xmpMM:DocumentID" > content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/> > > <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for > Microsoft 365"/> > > <meta name="access_permission:fill_in_form" content="true"/> > > <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/> > > <meta name="pdf:encrypted" content="false"/> > > <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/> > > <meta name="Content-Length" content="38927"/> > > <meta name="pdf:hasMarkedContent" content="true"/> > > <meta name="Content-Type" content="application/pdf"/> > > <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/> > > <meta name="pdf:docinfo:creator" content="Michele Stutz"/> > > <meta name="dc:language" content="en-US"/> > > <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/> > > <meta name="access_permission:extract_for_accessibility" content="true"/> > > <meta name="access_permission:assemble_document" content="true"/> > > <meta name="xmpTPg:NPages" content="1"/> > > <meta name="resourceName" content="sample german image.pdf"/> > > <meta name="pdf:hasXMP" content="true"/> > > <meta name="access_permission:extract_content" content="true"/> > > <meta name="access_permission:can_print" content="true"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.pdf.PDFParser"/> > > <meta name="access_permission:can_modify" content="true"/> > > <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft > 365"/> > > <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/> > > <title/> > > </head> > > <body><div class="page"><p/> > > <p> </p> > > <p/> > > <div class="ocr">Armin Laschet will an die Spitze und kampft > > > > Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht > unter Druck. > > Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet > nun > > kampferisch und warnt vor einem Linksruck. > > </div> > > > > </div> > > > > On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg < > [email protected]> wrote: > > Ok this is one of those situations where I must be doing something stupid, > but I can’t get Tika to properly process the attached file. It’s an image > based PDF. It’s just not getting any text out of it. Even if I run with > OCRStrategy = ONLY_OCR. > > It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in > AbstractPDF2XHTML, so it’s not a matter of the character counts preventing > the OCR. > > > > Don’t think it has anything to do with the fact that it is in German. > Tried setting the language to DEU, but same results > > > > What is going on? > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069> > > 4303 W. 119th St., Leawood, KS 66209 > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069> > > > > > >
