RE: Problem running OCR

Peter Kronenberg Fri, 24 Sep 2021 07:36:25 -0700

Duh, thanks!  It worked.  Now I have to figure out which config option was 
messing it up.


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <[email protected]>
Sent: Friday, September 24, 2021 10:32 AM
To: Peter Kronenberg <[email protected]>; [email protected]
Subject: Re: Problem running OCR


If you turn off all the configurations, does it work for you?

On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
I was afraid it would work for you 😊

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Friday, September 24, 2021 10:09 AM
To: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Problem running OCR

I'm having luck with 2.1.0's app.  How are you calling Tika?  What 
configurations do you have?  Is tesseract on your command line, etc?


java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf

INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>

<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml<https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>">

<head>

<meta name="pdf:PDFVersion" content="1.7"/>

<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:hasXFA" content="false"/>

<meta name="access_permission:modify_annotations" content="true"/>

<meta name="access_permission:can_print_degraded" content="true"/>

<meta name="dc:creator" content="Michele Stutz"/>

<meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>

<meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>

<meta name="dc:format" content="application/pdf; version=1.7"/>

<meta name="xmpMM:DocumentID" 
content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>

<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 
365"/>

<meta name="access_permission:fill_in_form" content="true"/>

<meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>

<meta name="pdf:encrypted" content="false"/>

<meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>

<meta name="Content-Length" content="38927"/>

<meta name="pdf:hasMarkedContent" content="true"/>

<meta name="Content-Type" content="application/pdf"/>

<meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>

<meta name="pdf:docinfo:creator" content="Michele Stutz"/>

<meta name="dc:language" content="en-US"/>

<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:extract_for_accessibility" content="true"/>

<meta name="access_permission:assemble_document" content="true"/>

<meta name="xmpTPg:NPages" content="1"/>

<meta name="resourceName" content="sample german image.pdf"/>

<meta name="pdf:hasXMP" content="true"/>

<meta name="access_permission:extract_content" content="true"/>

<meta name="access_permission:can_print" content="true"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>

<meta name="access_permission:can_modify" content="true"/>

<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>

<title/>

</head>

<body><div class="page"><p/>

<p> </p>

<p/>

<div class="ocr">Armin Laschet will an die Spitze und kampft



Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter 
Druck.

Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun

kampferisch und warnt vor einem Linksruck.

</div>



</div>

On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:

Ok this is one of those situations where I must be doing something stupid, but 
I can’t get Tika to properly process the attached file.  It’s an image based 
PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy 
= ONLY_OCR.



It’s definitely getting to the call to doOCROnCurrentPage(AUTO)in 
AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the 
OCR.



Don’t think it has anything to do with the fact that it is in German.  Tried 
setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>

RE: Problem running OCR

Reply via email to