IMHO the original goal was to OCR rotated images. Imagemagick deskewing code was added later, but imagemagick must be installed to work.
Luis Em seg, 4 de jan de 2021 15:42, Peter Kronenberg <[email protected]> escreveu: > Wait, I take that back. I was looking at 0, not 1. > > The default of 1 makes sense and it makes me wonder even more why > Tesseract defaults to 3 > > > > *From:* Peter Kronenberg <[email protected]> > *Sent:* Monday, January 4, 2021 1:39 PM > *To:* [email protected]; [email protected] > *Subject:* RE: Page Segmentation Mode > > > > This email was sent from outside your organisation, yet is displaying the > name of someone from your organisation. This often happens in phishing > attempts. Please only interact with this email if you know its source and > that the content is safe. > > > > I know it was more a Tesseract question, but I appreciate you taking the > time to answer 😊. > > > > I think it probably makes sense to go with Tesseracts default of 3, and > have the user specify the language, if it’s not Latin script . But not if > the Orientation of OSD includes deskewing. I think that’s important to do > by default. > > > > *From:* Tim Allison <[email protected]> > *Sent:* Monday, January 4, 2021 1:17 PM > *To:* [email protected] > *Subject:* Re: Page Segmentation Mode > > > > This is more of a tesseract question....script detection is determining > what the dominant script is in an image, e.g. Latin, Han, Korean, Greek, > Tamil, etc. See: https://research.google/pubs/pub35506/ > > > > This is somewhat useful (not so much on osd etc): > https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d > > > > My guess is that the hope was that if a user doesn't specify a language > and the document is in, say, Russian, then the OSD would identify Cyrillic > script and use the Russian language model. If this isn't the case and > we're not getting any benefit from OSD, then we should default to > tesseract's default: 3. > > > > The Tika calls to imagemagick (if it is installed) are meant to normalize > the image (rotate, etc) to improve chances of successful OCR. > > > > This looks like a pretty good resource on tesseract on languages beyond > English: > https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/ > > > > On Mon, Jan 4, 2021 at 12:26 PM Peter Kronenberg < > [email protected]> wrote: > > It appears that Tika’s default for Page Segmentation Mode in > TesseractOCRConfig is 1, whereas for Tesseract itself, it is 3. Any > particular reason for this? > > > > I know this is primarily a Tesseract question, but I confess that I’m a > little confused about the Page Segmentation Modes in general. Maybe you > can shed a little light > > > > > > [image: Page segmentation modes: 0 Orientation and script detection (OSD) > only. 1 Automatic page segmentation with OSD. 2 Automatic page > segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but > no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 > Assume a single uniform block of vertically aligned text. 6 Assume a single > uniform block of text. 7 Treat the image as a single text line. 8 Treat the > image as a single word. 9 Treat the image as a single word in a circle. 10 > Treat the image as a single character. 11 Sparse text. Find as much text as > possible in no particular order. 12 Sparse text with OSD. 13 Raw line. > Treat the image as a single text line, bypassing hacks that are > Tesseract-specific.] > > > > What exactly does OSD mean, i.e., what is script detection? Is that just > detecting text? What does option 3 mean when it says it doesn’t do OSD? > > Does any of this have to do with dealing with skewed images? > > Is there any place where there is a more detailed explanation of these > different modes? > > > > Thanks > > Peter > > > > > > > > > > > > > >
