Re: [tesseract-ocr] OCR failing on simple and clear text codes

Yoann Nicod Thu, 21 May 2015 00:15:09 -0700

Thank you for your time, I think I am going to go for implementing my own 
char-by-char segmentation which seems to be the more robust strategy.


On Wednesday, May 20, 2015 at 2:48:03 PM UTC+2, Dmitri Silaev wrote:
>
> I can't really use pre defined patterns since the code pattern and font 
>> can change over time.
>
> Think of using a bit more flexible patterns - by means of '*'. Second, you 
> can use more than one pattern in "user-patterns". And fonts have nothing to 
> do with patterns.
>
> Implementing your own char-by-char segmentation is relatively easy even 
> with ImageMagick and shell scripts, given you receive nicely binarized and 
> cleaned source images. As far as I can see, this indeed is the case. I 
> suggest CC labeling. For one possible implementation you can see my reply 
> here: 
> https://groups.google.com/d/msg/tesseract-ocr/STHaLGYsiCo/pYZyAG2AuMAJ
>
> From my experience, solely by parameter tweaking a problem like your #3 
> cannot be solved reliably. You defeat one issue, eventually another rises. 
> Then you're wasting your time to investigate if it's caused by a recent 
> parameter change or it's independent. Change back, tweak another, fight a 
> new issue. Repeat.
>
> A better way is to *force* conditions for reliable OCR. Preprocessing, 
> white-/blacklists, own segmentation using layout priors, etc.
>
> Or, at least OCR output *postprocessing*. E.g. at some positions your O's 
> are definitely zeros. I know people who ended up with *thousands* of such 
> rules for Tess output in an app that allows much more diverse input than 
> yours.
>
> -Dmitri
>
>
>
>
>
> On Wed, May 20, 2015 at 2:52 PM, Yoann Nicod <[email protected] 
> <javascript:>> wrote:
>
>> Thanks for your reply,
>>
>> I can't really use pre defined patterns since the code pattern and font 
>> can change over time.
>> I like the idea to segment the characters myself before giving it to 
>> tesseract one by one, but it looks time consuming (coding it I mean).
>> Isn't there any other suitable method ? In particular to solve the 3rd 
>> issue, which I think must be easy to solve.
>>
>> On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote:
>>>
>>> One no-brainer method to try out would be turning off all dictionaries 
>>> and using your own custom "user-patterns" file. Since you said about "your 
>>> application" I suppose you can program. So you can take a look at the 
>>> comment preceding read_pattern_list() declaration in "dict/trie.h" for more 
>>> details.
>>>
>>> It seems all your strings are of the same format:
>>> \A\A\d\d\d\d\d\d\d\d\d\d
>>> (Tess understands very limited pattern syntax).
>>>
>>> But if accuracy is critical in your app, in the long run I would 
>>> absolutely avoid using any parts of Tesseract except char classifier. I.e. 
>>> crop every single char out of your source image and run Tess in the single 
>>> char PSM. I think it's should be easy as long as location of every 
>>> character is quite stable among your source images. ImageMagick/shell 
>>> scripts would suffice.
>>>
>>> Best regards,
>>> Dmitri Silaev
>>> www.CustomOCR.com
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3f1b0568-0cab-424c-974b-d359af7ba2bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] OCR failing on simple and clear text codes

Reply via email to