[tesseract-ocr] Can I use Tesseract dictionary to fix non-dictionary word?

Jakub Dolecki Fri, 21 Aug 2015 15:06:08 -0700

Hello everyone,

I've been searching around this group for an answer to my question, but I 
couldn't find anything satisfactory so here it goes. For the attached 
image, the OCR result is the following:

Review the Main Idea state-

ment at the beginning of this

section. List five sources that a

historian'might use to write

a history of your Iife.Then,

eValIJate them for authenticity,

*reiiability (72 confidence)*, and bias.

The command I used to run OCR is `tesseract rotated.jpeg foo -psm 1 -c
language_model_penalty_non_dict_word 1.0`.

Tesseract does a good job overall, but fails to determine that
"reiiability" should be "reliability" (among few other words, but I'm
curious about this case in particular). Can you please explain to me why it
Tesseract fails to find the dictionary word?

Assuming I cannot fix this discrepancy on the word-recognition level, can I
utilize the API in some way to iterate over the words and only pick
dictionary words from available choices?

Since the DAWG is a graph, is it impossible for Tesseract to ask for a
dictionary word that is, say, 1 or 2 characters from the current best
candidate?

Thanks a lot for your help,

Jakub

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4d235f09-80e2-4a9b-af95-629dc780fa1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Can I use Tesseract dictionary to fix non-dictionary word?

Reply via email to