Wait, I take that back.  I was looking at 0, not 1.
The default of 1 makes sense and it makes me wonder even more why Tesseract 
defaults to 3

From: Peter Kronenberg <[email protected]>
Sent: Monday, January 4, 2021 1:39 PM
To: [email protected]; [email protected]
Subject: RE: Page Segmentation Mode

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

I know it was more a Tesseract question, but I appreciate you taking the time 
to answer 😊.

I think it probably makes sense to go with Tesseracts default of 3, and have 
the user specify the language, if it’s not Latin script .  But not if the 
Orientation of OSD includes deskewing.  I think that’s important to do by 
default.

From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Monday, January 4, 2021 1:17 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Page Segmentation Mode

This is more of a tesseract question....script detection is determining what 
the dominant script is in an image, e.g. Latin, Han, Korean, Greek, Tamil, etc. 
 See: https://research.google/pubs/pub35506/

This is somewhat useful (not so much on osd etc): 
https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d

My guess is that the hope was that if a user doesn't specify a language and the 
document is in, say, Russian, then the OSD would identify Cyrillic script and 
use the Russian language model.  If this isn't the case and we're not getting 
any benefit from OSD, then we should default to tesseract's default: 3.

The Tika calls to imagemagick (if it is installed) are meant to normalize the 
image (rotate, etc) to improve chances of successful OCR.

This looks like a pretty good resource on tesseract on languages beyond 
English: 
https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

On Mon, Jan 4, 2021 at 12:26 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
It appears that Tika’s default for Page Segmentation Mode in TesseractOCRConfig 
is 1, whereas for Tesseract itself, it is 3.  Any particular reason for this?

I know this is primarily a Tesseract question, but I confess that I’m a little 
confused about the Page Segmentation Modes in general.  Maybe you can shed a 
little light


[Page segmentation modes:    0    Orientation and script detection (OSD) only.  
  1    Automatic page segmentation with OSD.    2    Automatic page 
segmentation, but no OSD, or OCR.    3    Fully automatic page segmentation, 
but no OSD. (Default)    4    Assume a single column of text of variable sizes. 
   5    Assume a single uniform block of vertically aligned text.    6    
Assume a single uniform block of text.    7    Treat the image as a single text 
line.    8    Treat the image as a single word.    9    Treat the image as a 
single word in a circle.   10    Treat the image as a single character.   11    
Sparse text. Find as much text as possible in no particular order.   12    
Sparse text with OSD.   13    Raw line. Treat the image as a single text line,  
                        bypassing hacks that are Tesseract-specific.]

What exactly does OSD mean, i.e., what is script detection?  Is that just 
detecting text?  What does option 3 mean when it says it doesn’t do OSD?
Does any of this have to do with dealing with skewed images?
Is there any place where there is a more detailed explanation of these 
different modes?

Thanks
Peter






Reply via email to