[tesseract-ocr] Tesseract 3.02 unable to learn?

Reinaldo Crespo Tue, 16 Feb 2016 23:17:16 -0800

Hello ng;

I'm hoping someone can help me decipher if Tesseract is simply unable to 
learn what I'm trying to teach or what I may be doing wrong.

I'm using the API. Documents being feed to Tesseract are .tif. These are
radiology reports. I'm mostly interested on extracting the encounter
number. The encounter number follows a format that is easy to pull out
using a regular expression.

My problem is that about 25% of the time Tesseract translates the number 8
as a 3. About 15% of the time "EN16-" is translated as "EN15' ".
Other times zeroes come back as () and other times it is all crazy chars.

Using jTessBoxEditor I'm able to create an alternate .traineddata file
named sfi.traineddata that I latter use like this:

IF TessBaseAPIInit3( handle, NIL, "eng+sfi" ) != 0 //abort if
english traindata file can't be found locally.
...

But none of this helps. The attached .tif, used to train Tesseract using
jTessBoxEditor, the encounter number EN16-00005707 is translated by
Tesseract as EN15'°°0_°57°7. That is even after training using the very
same document.

Can someone help?

Reinaldo.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/76bbc6e8-315c-482d-b64c-bf75340a7828%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract 3.02 unable to learn?

Reply via email to