[tesseract-ocr] Advice needed on effective hexadecimal recognition

scott . hasse Sat, 28 Jun 2014 01:03:32 -0700

Hi all.  Firstly let me say I am totally blown away by Tesseract, it vastly 
exceeded my expectations for an open source OCR project.  I have an 
application (http://hackaday.io/project/1569-NSA-Away) that involves OCR of 
hexadecimal information from a computer screen using a hand held Android 
device. I've been able to use the tess-two API wrapper to successfully run 
Tesseract OCR in an Android emulator and am developing various unit tests 
to better tune by Tesseract configuration.  The data I am OCR'ing will look 
something like:

2C B7 CF 07 1F C6 62 1C 8E 53 10 B1 75 06 06 C9 01 6A 08 DA
D4 B5 F9 CF 71 0E 7A DB 04 F3 8B 2A 0D 8E EC 41 50 83 CB E4

Where each pair of hex digits represents one byte of information. I can
include error correction if that will be needed.

Steps I have taken so far (the code snippets are the tess-two Java wrapper):

1) Constrained the character whitelist to just the hex digits:

baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "0123456789ABCDEF");

this helped a lot

2) Created a custom dictionary with only 256 words, the possible 00 to FF
hex "words", and using the instructions at:

http://stackoverflow.com/questions/9568165/custom-dictionary-for-tesseract

used the combine_tessdata and wordlist2dawg programs to replace the
existing eng.word-dawg in a eng.traineddata file.

3) Attempted to increase the strength of dictionary matches as discussed on
the FAQ
(https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?),

both via API calls to setVariable and via a configuration file (tess-two
uses tesseract 3.0.3):

language_model_penalty_non_freq_dict_word 1
language_model_penalty_non_dict_word 1

However, I still occasionally get words that are three characters long and
not in the dictionary, e.g. "C9" will be recognized as "129". When this
happens it wrecks havoc with the base 16 decoding, as there are an odd
number of hex digits. Since I can include additional error correction
data, I'd be fine with dictionary words being hallucinated, but having
three characters returned causes a problem.

This makes me wonder if I am properly following the instructions to
increase the strength of dictionary matches. In this case, I'd be happy to
constrain results to strictly only dictionary words.

I'm also wondering if people have advice about this use case in particular.
Would you recommend upper or lower case hex digits (lower seemed worse in
my unit testing), two spaces between words, etc.

Thanks in advance,

Scott

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4aa6686c-a00f-47d5-b58a-a000271de4ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Advice needed on effective hexadecimal recognition

Reply via email to