> > I can't really use pre defined patterns since the code pattern and font > can change over time.
Think of using a bit more flexible patterns - by means of '*'. Second, you can use more than one pattern in "user-patterns". And fonts have nothing to do with patterns. Implementing your own char-by-char segmentation is relatively easy even with ImageMagick and shell scripts, given you receive nicely binarized and cleaned source images. As far as I can see, this indeed is the case. I suggest CC labeling. For one possible implementation you can see my reply here: https://groups.google.com/d/msg/tesseract-ocr/STHaLGYsiCo/pYZyAG2AuMAJ >From my experience, solely by parameter tweaking a problem like your #3 cannot be solved reliably. You defeat one issue, eventually another rises. Then you're wasting your time to investigate if it's caused by a recent parameter change or it's independent. Change back, tweak another, fight a new issue. Repeat. A better way is to *force* conditions for reliable OCR. Preprocessing, white-/blacklists, own segmentation using layout priors, etc. Or, at least OCR output *postprocessing*. E.g. at some positions your O's are definitely zeros. I know people who ended up with *thousands* of such rules for Tess output in an app that allows much more diverse input than yours. -Dmitri On Wed, May 20, 2015 at 2:52 PM, Yoann Nicod <[email protected]> wrote: > Thanks for your reply, > > I can't really use pre defined patterns since the code pattern and font > can change over time. > I like the idea to segment the characters myself before giving it to > tesseract one by one, but it looks time consuming (coding it I mean). > Isn't there any other suitable method ? In particular to solve the 3rd > issue, which I think must be easy to solve. > > On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote: >> >> One no-brainer method to try out would be turning off all dictionaries >> and using your own custom "user-patterns" file. Since you said about "your >> application" I suppose you can program. So you can take a look at the >> comment preceding read_pattern_list() declaration in "dict/trie.h" for more >> details. >> >> It seems all your strings are of the same format: >> \A\A\d\d\d\d\d\d\d\d\d\d >> (Tess understands very limited pattern syntax). >> >> But if accuracy is critical in your app, in the long run I would >> absolutely avoid using any parts of Tesseract except char classifier. I.e. >> crop every single char out of your source image and run Tess in the single >> char PSM. I think it's should be easy as long as location of every >> character is quite stable among your source images. ImageMagick/shell >> scripts would suffice. >> >> Best regards, >> Dmitri Silaev >> www.CustomOCR.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/0da310e9-57b6-41a1-a363-66d35dc1bc19%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/0da310e9-57b6-41a1-a363-66d35dc1bc19%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFN9VHjx%2B-FPaG6i0Xbp%2BSF9pnZkKaKDBmDVyO9kG6K2tQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

