RE: Problem running OCR

Peter Kronenberg Fri, 24 Sep 2021 13:15:42 -0700

So the option that was throwing me off was extractInlineImages.  Has something 
changes recently?  My code hasn’t changed and I’m not sure how I wouldn’t have 
noticed this before.
If extractInlineImages=false, does that mean that OCR won’t work at all for the 
PDF?  Even if it is a non-searching PDF where each page is a scanned image?


And I can’t figure out why it’s set to FALSE. In tika-config.xml, I have
<parser class="org.apache.tika.parser.pdf.PDFParser">
    <params>
        <param name="extractInlineImages" type="bool">true</param>
    </params>
</parser>


In my code, I have

TikaConfig tikaConfig;
try (InputStream is = 
TikaOCRParser.class.getClassLoader().getResourceAsStream("tika-config.xml")) {
    tikaConfig = new TikaConfig(is);
}

final PDFParserConfig pdfConfig = new PDFParserConfig();
final TesseractOCRConfig tessConfig = new TesseractOCRConfig();
final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
final ParseContext parseContext = new ParseContext();

parseContext.set(AutoDetectParser.class, parser);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(TesseractOCRConfig.class, tessConfig);


I know I probably talked to you about this at the time, and thought I had it 
right.  Is this correct that I’m passing the tikaConfig to the 
AutoDetectParser()?
When I print the value of isExtractInlineImages right after instnatiaton 
PDFPaserConfig, it comes up as FALSE.  What is the Tika default for this


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Peter Kronenberg
Sent: Friday, September 24, 2021 10:36 AM
To: [email protected]; [email protected]
Subject: RE: Problem running OCR

Duh, thanks!  It worked.  Now I have to figure out which config option was 
messing it up.

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Friday, September 24, 2021 10:32 AM
To: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>
Subject: Re: Problem run
If you turn off all the configurations, does it work for you?

On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
I was afraid it would work for you 😊

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Friday, September 24, 2021 10:09 AM
To: Peter Kronenberg 
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Problem running OCR

I'm having luck with 2.1.0's app.  How are you calling Tika?  What 
configurations do you have?  Is tesseract on your command line, etc?


java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf

INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>

<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml<https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>">

<head>

<meta name="pdf:PDFVersion" content="1.7"/>

<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:hasXFA" content="false"/>

<meta name="access_permission:modify_annotations" content="true"/>

<meta name="access_permission:can_print_degraded" content="true"/>

<meta name="dc:creator" content="Michele Stutz"/>

<meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>

<meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>

<meta name="dc:format" content="application/pdf; version=1.7"/>

<meta name="xmpMM:DocumentID" 
content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>

<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 
365"/>

<meta name="access_permission:fill_in_form" content="true"/>

<meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>

<meta name="pdf:encrypted" content="false"/>

<meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>

<meta name="Content-Length" content="38927"/>

<meta name="pdf:hasMarkedContent" content="true"/>

<meta name="Content-Type" content="application/pdf"/>

<meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>

<meta name="pdf:docinfo:creator" content="Michele Stutz"/>

<meta name="dc:language" content="en-US"/>

<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:extract_for_accessibility" content="true"/>

<meta name="access_permission:assemble_document" content="true"/>

<meta name="xmpTPg:NPages" content="1"/>

<meta name="resourceName" content="sample german image.pdf"/>

<meta name="pdf:hasXMP" content="true"/>

<meta name="access_permission:extract_content" content="true"/>

<meta name="access_permission:can_print" content="true"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>

<meta name="access_permission:can_modify" content="true"/>

<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>

<title/>

</head>

<body><div class="page"><p/>

<p> </p>

<p/>

<div class="ocr">Armin Laschet will an die Spitze und kampft



Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter 
Druck.

Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun

kampferisch und warnt vor einem Linksruck.

</div>



</div>

On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:

Ok this is one of those situations where I must be doing something stupid, but 
I can’t get Tika to properly process the attached file.  It’s an image based 
PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy 
= ONLY_OCR.



It’s definitely getting to the call to doOCROnCurrentPage(AUTO)in 
AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the 
OCR.



Don’t think it has anything to do with the fact that it is in German.  Tried 
setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>

RE: Problem running OCR

Reply via email to