It appears that Tika's default for Page Segmentation Mode in TesseractOCRConfig is 1, whereas for Tesseract itself, it is 3. Any particular reason for this?
I know this is primarily a Tesseract question, but I confess that I'm a little
confused about the Page Segmentation Modes in general. Maybe you can shed a
little light
[Page segmentation modes: 0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD. 2 Automatic page
segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation,
but no OSD. (Default) 4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text. 6
Assume a single uniform block of text. 7 Treat the image as a single text
line. 8 Treat the image as a single word. 9 Treat the image as a
single word in a circle. 10 Treat the image as a single character. 11
Sparse text. Find as much text as possible in no particular order. 12
Sparse text with OSD. 13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.]
What exactly does OSD mean, i.e., what is script detection? Is that just
detecting text? What does option 3 mean when it says it doesn't do OSD?
Does any of this have to do with dealing with skewed images?
Is there any place where there is a more detailed explanation of these
different modes?
Thanks
Peter
image001.emz
Description: image001.emz
