Re: Tesseract PSM=0

Tim Allison Fri, 22 Jan 2021 09:27:50 -0800

PRs welcome. Committers are standing by.

On Fri, Jan 22, 2021 at 11:48 AM Peter Kronenberg <[email protected]>
wrote:


> Even if it's not parsed, I thought I'd see the output somewhere.  Ideally,
> it should be in the metadata.  So essentially, using PSM=0 with Tika just
> doesn't work at all.
>
> I'm still learning more about Tesseract, but I think the use case is to
> determine the Script of a document.  For example, I have an Arabic document
> and if I don't specify the language of "ara", it doesn't work properly.
> But if you have a document where you don't know the script, this might be
> handy
>
> -----Original Message-----
> From: Tim Allison <[email protected]>
> Sent: Friday, January 22, 2021 10:07 AM
> To: [email protected]
> Subject: Re: Tesseract PSM=0
>
> I don't think the TesseractOCRParser is set up to parse this type of
> output.  PRs welcomed...if there's a generalizable use case for this(?).
>
> On Fri, Jan 22, 2021 at 9:31 AM Peter Kronenberg <
> [email protected]> wrote:
> >
> > What is the expected behavior of Tika when using PSM 0?   When using
> Tesseract directly from the command line, I get this
> >
> >
> >
> > c:\TestFiles>tesseract --psm 0 Dickens.png stdout
> >
> > Page number: 0
> >
> > Orientation in degrees: 0
> >
> > Rotate: 0
> >
> > Orientation confidence: 8.75
> >
> > Script: Latin
> >
> > Script confidence: 2.86
> >
> >
> >
> > But from Tika, I’m not getting any output.  There’s obviously no OCR
> output, since PSM 0 doesn’t do OCR.  It just does Orientation and Script
> detection. So where is that Tesseract output going?
>

Re: Tesseract PSM=0

Reply via email to