[tesseract-ocr] Re: Advice needed on effective hexadecimal recognition

scott . hasse Sat, 28 Jun 2014 11:35:28 -0700

I tried 0.9 for both with the same result of words not in the dictionary 
being returned.  I'll do some more study to see if, for instance it is 
always one or a couple of characters causing problems.  Your idea of using 
a different character would be workable.  I want to avoid bar codes in 
order to keep the data easily human-verifiable.  Using 256 short dictionary 
words is a good idea as well, and it is somewhat reminiscent of a phonetic 
alphabet which fits with the spirit of the project.


It still seems that for my use case, forcing dictionary words, if it 
worked, would be a preferable solution.  Are there any known defects or 
test cases where configuring the documented variables actually does work?

I was able to get more reliable results by using two space characters 
between "words" and then iterating the results word-by-word looking for 
words containing three letters, setting those to "FF" and letting the error 
correction take care of it, but again it seems like constraining the 
results to dictionary words would be more elegant.

Thanks for the advice!

Scott

On Saturday, June 28, 2014 11:48:39 AM UTC-5, Tom Morris wrote:
>
> p.s.
>
> On Saturday, June 28, 2014 12:39:21 AM UTC-4, [email protected] wrote:
>>
>>
>> 3) Attempted to increase the strength of dictionary matches as discussed 
>> on the FAQ (
>> https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?),
>>  
>> both via API calls to setVariable and via a configuration file (tess-two 
>> uses tesseract 3.0.3):
>>
>> language_model_penalty_non_freq_dict_word 1
>> language_model_penalty_non_dict_word 1
>>
>> However, I still occasionally get words that are three characters long 
>> and not in the dictionary, e.g. "C9" will be recognized as "129".  When 
>> this happens it wrecks havoc with the base 16 decoding, as there are an odd 
>> number of hex digits.  Since I can include additional error correction 
>> data, I'd be fine with dictionary words being hallucinated, but having 
>> three characters returned causes a problem.
>>
>> This makes me wonder if I am properly following the instructions to 
>> increase the strength of dictionary matches.  In this case, I'd be happy to 
>> constrain results to strictly only dictionary words.
>>
>
> Since these are doubles, you might want to try 0.9 (or even 0.5) to make 
> sure that you're not running into some type of boundary condition.  I 
> haven't played with them myself, so I'm not sure how they're handled 
> internally.
>
> Tom 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ce697067-15b9-40cf-b205-c370dca98592%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Advice needed on effective hexadecimal recognition

Reply via email to