If you turn off all the configurations, does it work for you?

On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <[email protected]>
wrote:

> I was afraid it would work for you 😊
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Friday, September 24, 2021 10:09 AM
> *To:* Peter Kronenberg <[email protected]>
> *Cc:* [email protected]
> *Subject:* Re: Problem running OCR
>
>
>
> I'm having luck with 2.1.0's app.  How are you calling Tika?  What
> configurations do you have?  Is tesseract on your command line, etc?
>
>
>
> java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf
>
> INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser
> Tesseract is installed and is being invoked. This can add greatly to
> processing time.  If you do not want tesseract to be applied to your
> files see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> <https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>
>
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="
> http://www.w3.org/1999/xhtml
> <https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>
> ">
>
> <head>
>
> <meta name="pdf:PDFVersion" content="1.7"/>
>
> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>
>
> <meta name="pdf:hasXFA" content="false"/>
>
> <meta name="access_permission:modify_annotations" content="true"/>
>
> <meta name="access_permission:can_print_degraded" content="true"/>
>
> <meta name="dc:creator" content="Michele Stutz"/>
>
> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>
>
> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>
>
> <meta name="dc:format" content="application/pdf; version=1.7"/>
>
> <meta name="xmpMM:DocumentID"
> content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>
>
> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
> Microsoft 365"/>
>
> <meta name="access_permission:fill_in_form" content="true"/>
>
> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>
>
> <meta name="pdf:encrypted" content="false"/>
>
> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>
>
> <meta name="Content-Length" content="38927"/>
>
> <meta name="pdf:hasMarkedContent" content="true"/>
>
> <meta name="Content-Type" content="application/pdf"/>
>
> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>
>
> <meta name="pdf:docinfo:creator" content="Michele Stutz"/>
>
> <meta name="dc:language" content="en-US"/>
>
> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>
>
> <meta name="access_permission:extract_for_accessibility" content="true"/>
>
> <meta name="access_permission:assemble_document" content="true"/>
>
> <meta name="xmpTPg:NPages" content="1"/>
>
> <meta name="resourceName" content="sample german image.pdf"/>
>
> <meta name="pdf:hasXMP" content="true"/>
>
> <meta name="access_permission:extract_content" content="true"/>
>
> <meta name="access_permission:can_print" content="true"/>
>
> <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.DefaultParser"/>
>
> <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.pdf.PDFParser"/>
>
> <meta name="access_permission:can_modify" content="true"/>
>
> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
> 365"/>
>
> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>
>
> <title/>
>
> </head>
>
> <body><div class="page"><p/>
>
> <p> </p>
>
> <p/>
>
> <div class="ocr">Armin Laschet will an die Spitze und kampft
>
>
>
> Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht
> unter Druck.
>
> Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet
> nun
>
> kampferisch und warnt vor einem Linksruck.
>
> </div>
>
>
>
> </div>
>
>
>
> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <
> [email protected]> wrote:
>
> Ok this is one of those situations where I must be doing something stupid, 
> but I can’t get Tika to properly process the attached file.  It’s an image 
> based PDF.  It’s just not getting any text out of it.  Even if I run with 
> OCRStrategy = ONLY_OCR.
>
> It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in 
> AbstractPDF2XHTML, so it’s not a matter of the character counts preventing 
> the OCR.
>
>
>
> Don’t think it has anything to do with the fact that it is in German.
> Tried setting the language to DEU, but same results
>
>
>
> What is going on?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>
>
>
>
>
>

Reply via email to