I even broke out my Windows laptop, and the basic commandline w tika-app works there, too...in 2.0.0 and 2.1.0.
On Fri, Sep 24, 2021 at 10:31 AM Tim Allison <[email protected]> wrote: > If you turn off all the configurations, does it work for you? > > On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg < > [email protected]> wrote: > >> I was afraid it would work for you 😊 >> >> >> >> *From:* Tim Allison <[email protected]> >> *Sent:* Friday, September 24, 2021 10:09 AM >> *To:* Peter Kronenberg <[email protected]> >> *Cc:* [email protected] >> *Subject:* Re: Problem running OCR >> >> >> >> I'm having luck with 2.1.0's app. How are you calling Tika? What >> configurations do you have? Is tesseract on your command line, etc? >> >> >> >> java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf >> >> INFO [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser >> Tesseract is installed and is being invoked. This can add greatly to >> processing time. If you do not want tesseract to be applied to your >> files see: >> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr >> <https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069> >> >> <?xml version="1.0" encoding="UTF-8"?><html xmlns=" >> http://www.w3.org/1999/xhtml >> <https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069> >> "> >> >> <head> >> >> <meta name="pdf:PDFVersion" content="1.7"/> >> >> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/> >> >> <meta name="pdf:hasXFA" content="false"/> >> >> <meta name="access_permission:modify_annotations" content="true"/> >> >> <meta name="access_permission:can_print_degraded" content="true"/> >> >> <meta name="dc:creator" content="Michele Stutz"/> >> >> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/> >> >> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/> >> >> <meta name="dc:format" content="application/pdf; version=1.7"/> >> >> <meta name="xmpMM:DocumentID" >> content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/> >> >> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for >> Microsoft 365"/> >> >> <meta name="access_permission:fill_in_form" content="true"/> >> >> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/> >> >> <meta name="pdf:encrypted" content="false"/> >> >> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/> >> >> <meta name="Content-Length" content="38927"/> >> >> <meta name="pdf:hasMarkedContent" content="true"/> >> >> <meta name="Content-Type" content="application/pdf"/> >> >> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/> >> >> <meta name="pdf:docinfo:creator" content="Michele Stutz"/> >> >> <meta name="dc:language" content="en-US"/> >> >> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/> >> >> <meta name="access_permission:extract_for_accessibility" content="true"/> >> >> <meta name="access_permission:assemble_document" content="true"/> >> >> <meta name="xmpTPg:NPages" content="1"/> >> >> <meta name="resourceName" content="sample german image.pdf"/> >> >> <meta name="pdf:hasXMP" content="true"/> >> >> <meta name="access_permission:extract_content" content="true"/> >> >> <meta name="access_permission:can_print" content="true"/> >> >> <meta name="X-TIKA:Parsed-By" >> content="org.apache.tika.parser.DefaultParser"/> >> >> <meta name="X-TIKA:Parsed-By" >> content="org.apache.tika.parser.pdf.PDFParser"/> >> >> <meta name="access_permission:can_modify" content="true"/> >> >> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft >> 365"/> >> >> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/> >> >> <title/> >> >> </head> >> >> <body><div class="page"><p/> >> >> <p> </p> >> >> <p/> >> >> <div class="ocr">Armin Laschet will an die Spitze und kampft >> >> >> >> Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht >> unter Druck. >> >> Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet >> nun >> >> kampferisch und warnt vor einem Linksruck. >> >> </div> >> >> >> >> </div> >> >> >> >> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg < >> [email protected]> wrote: >> >> Ok this is one of those situations where I must be doing something stupid, >> but I can’t get Tika to properly process the attached file. It’s an image >> based PDF. It’s just not getting any text out of it. Even if I run with >> OCRStrategy = ONLY_OCR. >> >> It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in >> AbstractPDF2XHTML, so it’s not a matter of the character counts preventing >> the OCR. >> >> >> >> Don’t think it has anything to do with the fact that it is in German. >> Tried setting the language to DEU, but same results >> >> >> >> What is going on? >> >> >> >> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >> >> *C: 703.887.5623* >> >> [image: Torch AI] >> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069> >> >> 4303 W. 119th St., Leawood, KS 66209 >> WWW.TORCH.AI >> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069> >> >> >> >> >> >>
