It appears that Tika's default for Page Segmentation Mode in TesseractOCRConfig 
is 1, whereas for Tesseract itself, it is 3.  Any particular reason for this?

I know this is primarily a Tesseract question, but I confess that I'm a little 
confused about the Page Segmentation Modes in general.  Maybe you can shed a 
little light


[Page segmentation modes:    0    Orientation and script detection (OSD) only.  
  1    Automatic page segmentation with OSD.    2    Automatic page 
segmentation, but no OSD, or OCR.    3    Fully automatic page segmentation, 
but no OSD. (Default)    4    Assume a single column of text of variable sizes. 
   5    Assume a single uniform block of vertically aligned text.    6    
Assume a single uniform block of text.    7    Treat the image as a single text 
line.    8    Treat the image as a single word.    9    Treat the image as a 
single word in a circle.   10    Treat the image as a single character.   11    
Sparse text. Find as much text as possible in no particular order.   12    
Sparse text with OSD.   13    Raw line. Treat the image as a single text line,  
                        bypassing hacks that are Tesseract-specific.]

What exactly does OSD mean, i.e., what is script detection?  Is that just 
detecting text?  What does option 3 mean when it says it doesn't do OSD?
Does any of this have to do with dealing with skewed images?
Is there any place where there is a more detailed explanation of these 
different modes?

Thanks
Peter






Attachment: image001.emz
Description: image001.emz

Reply via email to