This is more of a tesseract question....script detection is determining what the dominant script is in an image, e.g. Latin, Han, Korean, Greek, Tamil, etc. See: https://research.google/pubs/pub35506/
This is somewhat useful (not so much on osd etc): https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d My guess is that the hope was that if a user doesn't specify a language and the document is in, say, Russian, then the OSD would identify Cyrillic script and use the Russian language model. If this isn't the case and we're not getting any benefit from OSD, then we should default to tesseract's default: 3. The Tika calls to imagemagick (if it is installed) are meant to normalize the image (rotate, etc) to improve chances of successful OCR. This looks like a pretty good resource on tesseract on languages beyond English: https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/ On Mon, Jan 4, 2021 at 12:26 PM Peter Kronenberg <[email protected]> wrote: > It appears that Tika’s default for Page Segmentation Mode in > TesseractOCRConfig is 1, whereas for Tesseract itself, it is 3. Any > particular reason for this? > > > > I know this is primarily a Tesseract question, but I confess that I’m a > little confused about the Page Segmentation Modes in general. Maybe you > can shed a little light > > > > > > [image: Page segmentation modes: 0 Orientation and script detection (OSD) > only. 1 Automatic page segmentation with OSD. 2 Automatic page > segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but > no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 > Assume a single uniform block of vertically aligned text. 6 Assume a single > uniform block of text. 7 Treat the image as a single text line. 8 Treat the > image as a single word. 9 Treat the image as a single word in a circle. 10 > Treat the image as a single character. 11 Sparse text. Find as much text as > possible in no particular order. 12 Sparse text with OSD. 13 Raw line. > Treat the image as a single text line, bypassing hacks that are > Tesseract-specific.] > > > > What exactly does OSD mean, i.e., what is script detection? Is that just > detecting text? What does option 3 mean when it says it doesn’t do OSD? > > Does any of this have to do with dealing with skewed images? > > Is there any place where there is a more detailed explanation of these > different modes? > > > > Thanks > > Peter > > > > > > > > > > > > >
