Duh, thanks! It worked. Now I have to figure out which config option was messing it up.
Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<http://www.torch.ai/> 4303 W. 119th St., Leawood, KS 66209 WWW.TORCH.AI<http://www.torch.ai/> From: Tim Allison <[email protected]> Sent: Friday, September 24, 2021 10:32 AM To: Peter Kronenberg <[email protected]>; [email protected] Subject: Re: Problem running OCR If you turn off all the configurations, does it work for you? On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: I was afraid it would work for you 😊 From: Tim Allison <[email protected]<mailto:[email protected]>> Sent: Friday, September 24, 2021 10:09 AM To: Peter Kronenberg <[email protected]<mailto:[email protected]>> Cc: [email protected]<mailto:[email protected]> Subject: Re: Problem running OCR I'm having luck with 2.1.0's app. How are you calling Tika? What configurations do you have? Is tesseract on your command line, etc? java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf INFO [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is installed and is being invoked. This can add greatly to processing time. If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml<https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>"> <head> <meta name="pdf:PDFVersion" content="1.7"/> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/> <meta name="pdf:hasXFA" content="false"/> <meta name="access_permission:modify_annotations" content="true"/> <meta name="access_permission:can_print_degraded" content="true"/> <meta name="dc:creator" content="Michele Stutz"/> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/> <meta name="dc:format" content="application/pdf; version=1.7"/> <meta name="xmpMM:DocumentID" content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365"/> <meta name="access_permission:fill_in_form" content="true"/> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/> <meta name="pdf:encrypted" content="false"/> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/> <meta name="Content-Length" content="38927"/> <meta name="pdf:hasMarkedContent" content="true"/> <meta name="Content-Type" content="application/pdf"/> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/> <meta name="pdf:docinfo:creator" content="Michele Stutz"/> <meta name="dc:language" content="en-US"/> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/> <meta name="access_permission:extract_for_accessibility" content="true"/> <meta name="access_permission:assemble_document" content="true"/> <meta name="xmpTPg:NPages" content="1"/> <meta name="resourceName" content="sample german image.pdf"/> <meta name="pdf:hasXMP" content="true"/> <meta name="access_permission:extract_content" content="true"/> <meta name="access_permission:can_print" content="true"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> <meta name="access_permission:can_modify" content="true"/> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/> <title/> </head> <body><div class="page"><p/> <p> </p> <p/> <div class="ocr">Armin Laschet will an die Spitze und kampft Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter Druck. Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun kampferisch und warnt vor einem Linksruck. </div> </div> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file. It’s an image based PDF. It’s just not getting any text out of it. Even if I run with OCRStrategy = ONLY_OCR. It’s definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR. Don’t think it has anything to do with the fact that it is in German. Tried setting the language to DEU, but same results What is going on? Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069> 4303 W. 119th St., Leawood, KS 66209 WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
