Hello ng;

I'm hoping someone can help me decipher if Tesseract is simply unable to 
learn what I'm trying to teach or what I may be doing wrong.  

I'm using the API.  Documents being feed to Tesseract are .tif.  These are 
radiology reports.   I'm mostly interested on extracting the encounter 
number.  The encounter number follows a format that is easy to pull out 
using a regular expression.  

My problem is that about 25% of the time Tesseract translates the number 8 
as a 3.    About 15% of the time  "EN16-"  is translated as "EN15' ". 
 Other times zeroes come back as () and other times it is all crazy chars.

Using jTessBoxEditor I'm able to create an alternate .traineddata file 
named sfi.traineddata that I latter use like this:

   IF TessBaseAPIInit3( handle, NIL, "eng+sfi" ) != 0     //abort if 
english traindata file can't be found locally.
...

But none of this helps.   The attached .tif, used to train Tesseract using 
jTessBoxEditor,  the encounter number EN16-00005707 is translated by 
Tesseract as EN15'°°0_°57°7.  That is even after training using the very 
same document.

Can someone help?


Reinaldo.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/76bbc6e8-315c-482d-b64c-bf75340a7828%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to