Re: Page Segmentation Mode

Luís Filipe Nassif Wed, 06 Jan 2021 03:48:36 -0800

IMHO the original goal was to OCR rotated images. Imagemagick deskewing
code was added later, but imagemagick must be installed to work.


Luis


Em seg, 4 de jan de 2021 15:42, Peter Kronenberg <[email protected]>
escreveu:

> Wait, I take that back.  I was looking at 0, not 1.
>
> The default of 1 makes sense and it makes me wonder even more why
> Tesseract defaults to 3
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, January 4, 2021 1:39 PM
> *To:* [email protected]; [email protected]
> *Subject:* RE: Page Segmentation Mode
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I know it was more a Tesseract question, but I appreciate you taking the
> time to answer 😊.
>
>
>
> I think it probably makes sense to go with Tesseracts default of 3, and
> have the user specify the language, if it’s not Latin script .  But not if
> the Orientation of OSD includes deskewing.  I think that’s important to do
> by default.
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, January 4, 2021 1:17 PM
> *To:* [email protected]
> *Subject:* Re: Page Segmentation Mode
>
>
>
> This is more of a tesseract question....script detection is determining
> what the dominant script is in an image, e.g. Latin, Han, Korean, Greek,
> Tamil, etc.  See: https://research.google/pubs/pub35506/
>
>
>
> This is somewhat useful (not so much on osd etc):
> https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d
>
>
>
> My guess is that the hope was that if a user doesn't specify a language
> and the document is in, say, Russian, then the OSD would identify Cyrillic
> script and use the Russian language model.  If this isn't the case and
> we're not getting any benefit from OSD, then we should default to
> tesseract's default: 3.
>
>
>
> The Tika calls to imagemagick (if it is installed) are meant to normalize
> the image (rotate, etc) to improve chances of successful OCR.
>
>
>
> This looks like a pretty good resource on tesseract on languages beyond
> English:
> https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/
>
>
>
> On Mon, Jan 4, 2021 at 12:26 PM Peter Kronenberg <
> [email protected]> wrote:
>
> It appears that Tika’s default for Page Segmentation Mode in
> TesseractOCRConfig is 1, whereas for Tesseract itself, it is 3.  Any
> particular reason for this?
>
>
>
> I know this is primarily a Tesseract question, but I confess that I’m a
> little confused about the Page Segmentation Modes in general.  Maybe you
> can shed a little light
>
>
>
>
>
> [image: Page segmentation modes: 0 Orientation and script detection (OSD)
> only. 1 Automatic page segmentation with OSD. 2 Automatic page
> segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but
> no OSD. (Default) 4 Assume a single column of text of variable sizes. 5
> Assume a single uniform block of vertically aligned text. 6 Assume a single
> uniform block of text. 7 Treat the image as a single text line. 8 Treat the
> image as a single word. 9 Treat the image as a single word in a circle. 10
> Treat the image as a single character. 11 Sparse text. Find as much text as
> possible in no particular order. 12 Sparse text with OSD. 13 Raw line.
> Treat the image as a single text line, bypassing hacks that are
> Tesseract-specific.]
>
>
>
> What exactly does OSD mean, i.e., what is script detection?  Is that just
> detecting text?  What does option 3 mean when it says it doesn’t do OSD?
>
> Does any of this have to do with dealing with skewed images?
>
> Is there any place where there is a more detailed explanation of these
> different modes?
>
>
>
> Thanks
>
> Peter
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Page Segmentation Mode

Reply via email to