mis-decoding a single line of text

khoshteep Tue, 27 Jul 2010 11:49:30 -0700

hi everyone,

I am trying to decode a single line of text that is a bit noisy. Link
to uploaded image is attached. The text is "MAX665," but what I'm
getting back is "THAI 6 8-51-".


http://tesseract-ocr.googlegroups.com/web/row1.bmp?gda=iO7ypToAAABJaCRJGWfX_qPCIQ7C4NkPrfoTOVd7wlGlVfd1g07AArmU4fy-mX2UP_udoPbSXxr97daDQaep90o7AOpSKHW0


I'm using version 2.04 and default eng language.  I have looked at the
thresholded image and it looks pretty good and similar to the source
image.

recog_all_words() in control.cpp tries to decode each word. Inside
classify_word_pass1 raw_choice for the first word is "TMAX" before
chopping. But after improve_by_chopping() and word_associator() it is
changed to "T|4A1". And best_choice string for the word is "MAJ".

After classify_word_pass2() raw_choice is "THAKX" and best_choice is
"THAI". And for the final string best_choice is used.

It seems like Tesseract is designed for word recognition and not
character recognition. If there are a sequence of characters that do
not makeup a meaningful word, it messes up. I'm trying to figure out
some magic variables, if there are any, to disable the word
recognition part and do pure OCR. If anyone can give me some pointers
I'd appreciate it.

I'm Khoshteep.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

mis-decoding a single line of text

Reply via email to