Thank you for your time, I think I am going to go for implementing my own char-by-char segmentation which seems to be the more robust strategy.
On Wednesday, May 20, 2015 at 2:48:03 PM UTC+2, Dmitri Silaev wrote: > > I can't really use pre defined patterns since the code pattern and font >> can change over time. > > Think of using a bit more flexible patterns - by means of '*'. Second, you > can use more than one pattern in "user-patterns". And fonts have nothing to > do with patterns. > > Implementing your own char-by-char segmentation is relatively easy even > with ImageMagick and shell scripts, given you receive nicely binarized and > cleaned source images. As far as I can see, this indeed is the case. I > suggest CC labeling. For one possible implementation you can see my reply > here: > https://groups.google.com/d/msg/tesseract-ocr/STHaLGYsiCo/pYZyAG2AuMAJ > > From my experience, solely by parameter tweaking a problem like your #3 > cannot be solved reliably. You defeat one issue, eventually another rises. > Then you're wasting your time to investigate if it's caused by a recent > parameter change or it's independent. Change back, tweak another, fight a > new issue. Repeat. > > A better way is to *force* conditions for reliable OCR. Preprocessing, > white-/blacklists, own segmentation using layout priors, etc. > > Or, at least OCR output *postprocessing*. E.g. at some positions your O's > are definitely zeros. I know people who ended up with *thousands* of such > rules for Tess output in an app that allows much more diverse input than > yours. > > -Dmitri > > > > > > On Wed, May 20, 2015 at 2:52 PM, Yoann Nicod <[email protected] > <javascript:>> wrote: > >> Thanks for your reply, >> >> I can't really use pre defined patterns since the code pattern and font >> can change over time. >> I like the idea to segment the characters myself before giving it to >> tesseract one by one, but it looks time consuming (coding it I mean). >> Isn't there any other suitable method ? In particular to solve the 3rd >> issue, which I think must be easy to solve. >> >> On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote: >>> >>> One no-brainer method to try out would be turning off all dictionaries >>> and using your own custom "user-patterns" file. Since you said about "your >>> application" I suppose you can program. So you can take a look at the >>> comment preceding read_pattern_list() declaration in "dict/trie.h" for more >>> details. >>> >>> It seems all your strings are of the same format: >>> \A\A\d\d\d\d\d\d\d\d\d\d >>> (Tess understands very limited pattern syntax). >>> >>> But if accuracy is critical in your app, in the long run I would >>> absolutely avoid using any parts of Tesseract except char classifier. I.e. >>> crop every single char out of your source image and run Tess in the single >>> char PSM. I think it's should be easy as long as location of every >>> character is quite stable among your source images. ImageMagick/shell >>> scripts would suffice. >>> >>> Best regards, >>> Dmitri Silaev >>> www.CustomOCR.com >>> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f1b0568-0cab-424c-974b-d359af7ba2bb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

