Even if it's not parsed, I thought I'd see the output somewhere. Ideally, it should be in the metadata. So essentially, using PSM=0 with Tika just doesn't work at all.
I'm still learning more about Tesseract, but I think the use case is to determine the Script of a document. For example, I have an Arabic document and if I don't specify the language of "ara", it doesn't work properly. But if you have a document where you don't know the script, this might be handy -----Original Message----- From: Tim Allison <[email protected]> Sent: Friday, January 22, 2021 10:07 AM To: [email protected] Subject: Re: Tesseract PSM=0 I don't think the TesseractOCRParser is set up to parse this type of output. PRs welcomed...if there's a generalizable use case for this(?). On Fri, Jan 22, 2021 at 9:31 AM Peter Kronenberg <[email protected]> wrote: > > What is the expected behavior of Tika when using PSM 0? When using > Tesseract directly from the command line, I get this > > > > c:\TestFiles>tesseract --psm 0 Dickens.png stdout > > Page number: 0 > > Orientation in degrees: 0 > > Rotate: 0 > > Orientation confidence: 8.75 > > Script: Latin > > Script confidence: 2.86 > > > > But from Tika, I’m not getting any output. There’s obviously no OCR output, > since PSM 0 doesn’t do OCR. It just does Orientation and Script detection. > So where is that Tesseract output going?
