Wait, I take that back. I was looking at 0, not 1. The default of 1 makes sense and it makes me wonder even more why Tesseract defaults to 3
From: Peter Kronenberg <[email protected]> Sent: Monday, January 4, 2021 1:39 PM To: [email protected]; [email protected] Subject: RE: Page Segmentation Mode This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe. I know it was more a Tesseract question, but I appreciate you taking the time to answer 😊. I think it probably makes sense to go with Tesseracts default of 3, and have the user specify the language, if it’s not Latin script . But not if the Orientation of OSD includes deskewing. I think that’s important to do by default. From: Tim Allison <[email protected]<mailto:[email protected]>> Sent: Monday, January 4, 2021 1:17 PM To: [email protected]<mailto:[email protected]> Subject: Re: Page Segmentation Mode This is more of a tesseract question....script detection is determining what the dominant script is in an image, e.g. Latin, Han, Korean, Greek, Tamil, etc. See: https://research.google/pubs/pub35506/ This is somewhat useful (not so much on osd etc): https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d My guess is that the hope was that if a user doesn't specify a language and the document is in, say, Russian, then the OSD would identify Cyrillic script and use the Russian language model. If this isn't the case and we're not getting any benefit from OSD, then we should default to tesseract's default: 3. The Tika calls to imagemagick (if it is installed) are meant to normalize the image (rotate, etc) to improve chances of successful OCR. This looks like a pretty good resource on tesseract on languages beyond English: https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/ On Mon, Jan 4, 2021 at 12:26 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: It appears that Tika’s default for Page Segmentation Mode in TesseractOCRConfig is 1, whereas for Tesseract itself, it is 3. Any particular reason for this? I know this is primarily a Tesseract question, but I confess that I’m a little confused about the Page Segmentation Modes in general. Maybe you can shed a little light [Page segmentation modes: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text. 7 Treat the image as a single text line. 8 Treat the image as a single word. 9 Treat the image as a single word in a circle. 10 Treat the image as a single character. 11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD. 13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.] What exactly does OSD mean, i.e., what is script detection? Is that just detecting text? What does option 3 mean when it says it doesn’t do OSD? Does any of this have to do with dealing with skewed images? Is there any place where there is a more detailed explanation of these different modes? Thanks Peter
