Re: Page Segmentation Mode

Tim Allison Mon, 04 Jan 2021 10:17:28 -0800

This is more of a tesseract question....script detection is determining
what the dominant script is in an image, e.g. Latin, Han, Korean, Greek,
Tamil, etc.  See: https://research.google/pubs/pub35506/


This is somewhat useful (not so much on osd etc):
https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d

My guess is that the hope was that if a user doesn't specify a language and
the document is in, say, Russian, then the OSD would identify Cyrillic
script and use the Russian language model.  If this isn't the case and
we're not getting any benefit from OSD, then we should default to
tesseract's default: 3.

The Tika calls to imagemagick (if it is installed) are meant to normalize
the image (rotate, etc) to improve chances of successful OCR.

This looks like a pretty good resource on tesseract on languages beyond
English:
https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

On Mon, Jan 4, 2021 at 12:26 PM Peter Kronenberg <[email protected]>
wrote:

> It appears that Tika’s default for Page Segmentation Mode in
> TesseractOCRConfig is 1, whereas for Tesseract itself, it is 3.  Any
> particular reason for this?
>
>
>
> I know this is primarily a Tesseract question, but I confess that I’m a
> little confused about the Page Segmentation Modes in general.  Maybe you
> can shed a little light
>
>
>
>
>
> [image: Page segmentation modes: 0 Orientation and script detection (OSD)
> only. 1 Automatic page segmentation with OSD. 2 Automatic page
> segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but
> no OSD. (Default) 4 Assume a single column of text of variable sizes. 5
> Assume a single uniform block of vertically aligned text. 6 Assume a single
> uniform block of text. 7 Treat the image as a single text line. 8 Treat the
> image as a single word. 9 Treat the image as a single word in a circle. 10
> Treat the image as a single character. 11 Sparse text. Find as much text as
> possible in no particular order. 12 Sparse text with OSD. 13 Raw line.
> Treat the image as a single text line, bypassing hacks that are
> Tesseract-specific.]
>
>
>
> What exactly does OSD mean, i.e., what is script detection?  Is that just
> detecting text?  What does option 3 mean when it says it doesn’t do OSD?
>
> Does any of this have to do with dealing with skewed images?
>
> Is there any place where there is a more detailed explanation of these
> different modes?
>
>
>
> Thanks
>
> Peter
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Page Segmentation Mode

Reply via email to