0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default)
See whether using OSD to detect the script helps you choose the correct traineddata. On Wednesday, November 19, 2014 12:12:07 AM UTC+5:30, Ryan Dev wrote: > > Thanks again. > > you may get better results using appropriate language data rather than >> just the ascii range. Are the client documents sorted by language? >> > > I'm not sure how they have them organised, I just know they want an > "automatic" solution... > > >> >> I am attaching files used - i had just copied some tables of ascii range >> - you can delete symbols, add multiple copies of letters that are needed. >> >>> >>> > I'm still getting up and running with training (I'm doing it on linux as > there appear to be more tools available that way). But I saw this comment > from zdenop > > https://groups.google.com/forum/#!searchin/tesseract-ocr/train$20hall$20of$20fame/tesseract-ocr/tq2aHxxndpM/u5ldKIwUANIJ > and it leads me to believe that getting much better trained data using the > common fonts (arial, georgia, segoe, garamond) will not be any better then > what is available? > > I have complete control over the image data I send to tesseract, so I > don't care about skewing, exposure, etc, as my glyphs will always be > straight, clear, and separated. > > For instance, I want to train for the ligatures ff, ffi, and ffl, which > are not in the english or ascii ones, and are missing from even the common > fonts like arial, but that my client files may contain. > > Should I train new eng or asc traineddata, or just create a new one for a > smaller set of glyphs like these? > > Thanks again for your help. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/87c2b556-7a0d-4b8c-9318-62b05a478979%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

