I even broke out my Windows laptop, and the basic commandline w tika-app
works there, too...in 2.0.0 and 2.1.0.

On Fri, Sep 24, 2021 at 10:31 AM Tim Allison <[email protected]> wrote:

> If you turn off all the configurations, does it work for you?
>
> On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <
> [email protected]> wrote:
>
>> I was afraid it would work for you 😊
>>
>>
>>
>> *From:* Tim Allison <[email protected]>
>> *Sent:* Friday, September 24, 2021 10:09 AM
>> *To:* Peter Kronenberg <[email protected]>
>> *Cc:* [email protected]
>> *Subject:* Re: Problem running OCR
>>
>>
>>
>> I'm having luck with 2.1.0's app.  How are you calling Tika?  What
>> configurations do you have?  Is tesseract on your command line, etc?
>>
>>
>>
>> java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf
>>
>> INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser
>> Tesseract is installed and is being invoked. This can add greatly to
>> processing time.  If you do not want tesseract to be applied to your
>> files see:
>> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
>> <https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>
>>
>> <?xml version="1.0" encoding="UTF-8"?><html xmlns="
>> http://www.w3.org/1999/xhtml
>> <https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>
>> ">
>>
>> <head>
>>
>> <meta name="pdf:PDFVersion" content="1.7"/>
>>
>> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>
>>
>> <meta name="pdf:hasXFA" content="false"/>
>>
>> <meta name="access_permission:modify_annotations" content="true"/>
>>
>> <meta name="access_permission:can_print_degraded" content="true"/>
>>
>> <meta name="dc:creator" content="Michele Stutz"/>
>>
>> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>
>>
>> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>
>>
>> <meta name="dc:format" content="application/pdf; version=1.7"/>
>>
>> <meta name="xmpMM:DocumentID"
>> content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>
>>
>> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
>> Microsoft 365"/>
>>
>> <meta name="access_permission:fill_in_form" content="true"/>
>>
>> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>
>>
>> <meta name="pdf:encrypted" content="false"/>
>>
>> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>
>>
>> <meta name="Content-Length" content="38927"/>
>>
>> <meta name="pdf:hasMarkedContent" content="true"/>
>>
>> <meta name="Content-Type" content="application/pdf"/>
>>
>> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>
>>
>> <meta name="pdf:docinfo:creator" content="Michele Stutz"/>
>>
>> <meta name="dc:language" content="en-US"/>
>>
>> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>
>>
>> <meta name="access_permission:extract_for_accessibility" content="true"/>
>>
>> <meta name="access_permission:assemble_document" content="true"/>
>>
>> <meta name="xmpTPg:NPages" content="1"/>
>>
>> <meta name="resourceName" content="sample german image.pdf"/>
>>
>> <meta name="pdf:hasXMP" content="true"/>
>>
>> <meta name="access_permission:extract_content" content="true"/>
>>
>> <meta name="access_permission:can_print" content="true"/>
>>
>> <meta name="X-TIKA:Parsed-By"
>> content="org.apache.tika.parser.DefaultParser"/>
>>
>> <meta name="X-TIKA:Parsed-By"
>> content="org.apache.tika.parser.pdf.PDFParser"/>
>>
>> <meta name="access_permission:can_modify" content="true"/>
>>
>> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
>> 365"/>
>>
>> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>
>>
>> <title/>
>>
>> </head>
>>
>> <body><div class="page"><p/>
>>
>> <p> </p>
>>
>> <p/>
>>
>> <div class="ocr">Armin Laschet will an die Spitze und kampft
>>
>>
>>
>> Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht
>> unter Druck.
>>
>> Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet
>> nun
>>
>> kampferisch und warnt vor einem Linksruck.
>>
>> </div>
>>
>>
>>
>> </div>
>>
>>
>>
>> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <
>> [email protected]> wrote:
>>
>> Ok this is one of those situations where I must be doing something stupid, 
>> but I can’t get Tika to properly process the attached file.  It’s an image 
>> based PDF.  It’s just not getting any text out of it.  Even if I run with 
>> OCRStrategy = ONLY_OCR.
>>
>> It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in 
>> AbstractPDF2XHTML, so it’s not a matter of the character counts preventing 
>> the OCR.
>>
>>
>>
>> Don’t think it has anything to do with the fact that it is in German.
>> Tried setting the language to DEU, but same results
>>
>>
>>
>> What is going on?
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI]
>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> WWW.TORCH.AI
>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>>
>>
>>
>>
>>
>>

Reply via email to