I'm having luck with 2.1.0's app. How are you calling Tika? What configurations do you have? Is tesseract on your command line, etc?
java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf INFO [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is installed and is being invoked. This can add greatly to processing time. If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr <?xml version="1.0" encoding="UTF-8"?><html xmlns=" http://www.w3.org/1999/xhtml"> <head> <meta name="pdf:PDFVersion" content="1.7"/> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/> <meta name="pdf:hasXFA" content="false"/> <meta name="access_permission:modify_annotations" content="true"/> <meta name="access_permission:can_print_degraded" content="true"/> <meta name="dc:creator" content="Michele Stutz"/> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/> <meta name="dc:format" content="application/pdf; version=1.7"/> <meta name="xmpMM:DocumentID" content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365"/> <meta name="access_permission:fill_in_form" content="true"/> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/> <meta name="pdf:encrypted" content="false"/> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/> <meta name="Content-Length" content="38927"/> <meta name="pdf:hasMarkedContent" content="true"/> <meta name="Content-Type" content="application/pdf"/> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/> <meta name="pdf:docinfo:creator" content="Michele Stutz"/> <meta name="dc:language" content="en-US"/> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/> <meta name="access_permission:extract_for_accessibility" content="true"/> <meta name="access_permission:assemble_document" content="true"/> <meta name="xmpTPg:NPages" content="1"/> <meta name="resourceName" content="sample german image.pdf"/> <meta name="pdf:hasXMP" content="true"/> <meta name="access_permission:extract_content" content="true"/> <meta name="access_permission:can_print" content="true"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> <meta name="access_permission:can_modify" content="true"/> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/> <title/> </head> <body><div class="page"><p/> <p> </p> <p/> <div class="ocr">Armin Laschet will an die Spitze und kampft Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter Druck. Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun kampferisch und warnt vor einem Linksruck. </div> </div> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <[email protected]> wrote: > Ok this is one of those situations where I must be doing something stupid, > but I can’t get Tika to properly process the attached file. It’s an image > based PDF. It’s just not getting any text out of it. Even if I run with > OCRStrategy = ONLY_OCR. > > It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in > AbstractPDF2XHTML, so it’s not a matter of the character counts preventing > the OCR. > > > > Don’t think it has anything to do with the fact that it is in German. > Tried setting the language to DEU, but same results > > > > What is going on? > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > WWW.TORCH.AI <http://www.torch.ai/> > > > > >
