PRs welcome. Committers are standing by. On Fri, Jan 22, 2021 at 11:48 AM Peter Kronenberg <[email protected]> wrote:
> Even if it's not parsed, I thought I'd see the output somewhere. Ideally, > it should be in the metadata. So essentially, using PSM=0 with Tika just > doesn't work at all. > > I'm still learning more about Tesseract, but I think the use case is to > determine the Script of a document. For example, I have an Arabic document > and if I don't specify the language of "ara", it doesn't work properly. > But if you have a document where you don't know the script, this might be > handy > > -----Original Message----- > From: Tim Allison <[email protected]> > Sent: Friday, January 22, 2021 10:07 AM > To: [email protected] > Subject: Re: Tesseract PSM=0 > > I don't think the TesseractOCRParser is set up to parse this type of > output. PRs welcomed...if there's a generalizable use case for this(?). > > On Fri, Jan 22, 2021 at 9:31 AM Peter Kronenberg < > [email protected]> wrote: > > > > What is the expected behavior of Tika when using PSM 0? When using > Tesseract directly from the command line, I get this > > > > > > > > c:\TestFiles>tesseract --psm 0 Dickens.png stdout > > > > Page number: 0 > > > > Orientation in degrees: 0 > > > > Rotate: 0 > > > > Orientation confidence: 8.75 > > > > Script: Latin > > > > Script confidence: 2.86 > > > > > > > > But from Tika, I’m not getting any output. There’s obviously no OCR > output, since PSM 0 doesn’t do OCR. It just does Orientation and Script > detection. So where is that Tesseract output going? >
