Even if it's not parsed, I thought I'd see the output somewhere.  Ideally, it 
should be in the metadata.  So essentially, using PSM=0 with Tika just doesn't 
work at all.

I'm still learning more about Tesseract, but I think the use case is to 
determine the Script of a document.  For example, I have an Arabic document and 
if I don't specify the language of "ara", it doesn't work properly.  But if you 
have a document where you don't know the script, this might be handy

-----Original Message-----
From: Tim Allison <[email protected]> 
Sent: Friday, January 22, 2021 10:07 AM
To: [email protected]
Subject: Re: Tesseract PSM=0

I don't think the TesseractOCRParser is set up to parse this type of output.  
PRs welcomed...if there's a generalizable use case for this(?).

On Fri, Jan 22, 2021 at 9:31 AM Peter Kronenberg <[email protected]> 
wrote:
>
> What is the expected behavior of Tika when using PSM 0?   When using 
> Tesseract directly from the command line, I get this
>
>
>
> c:\TestFiles>tesseract --psm 0 Dickens.png stdout
>
> Page number: 0
>
> Orientation in degrees: 0
>
> Rotate: 0
>
> Orientation confidence: 8.75
>
> Script: Latin
>
> Script confidence: 2.86
>
>
>
> But from Tika, I’m not getting any output.  There’s obviously no OCR output, 
> since PSM 0 doesn’t do OCR.  It just does Orientation and Script detection. 
> So where is that Tesseract output going?

Reply via email to