Hello ng; I'm hoping someone can help me decipher if Tesseract is simply unable to learn what I'm trying to teach or what I may be doing wrong.
I'm using the API. Documents being feed to Tesseract are .tif. These are radiology reports. I'm mostly interested on extracting the encounter number. The encounter number follows a format that is easy to pull out using a regular expression. My problem is that about 25% of the time Tesseract translates the number 8 as a 3. About 15% of the time "EN16-" is translated as "EN15' ". Other times zeroes come back as () and other times it is all crazy chars. Using jTessBoxEditor I'm able to create an alternate .traineddata file named sfi.traineddata that I latter use like this: IF TessBaseAPIInit3( handle, NIL, "eng+sfi" ) != 0 //abort if english traindata file can't be found locally. ... But none of this helps. The attached .tif, used to train Tesseract using jTessBoxEditor, the encounter number EN16-00005707 is translated by Tesseract as EN15'°°0_°57°7. That is even after training using the very same document. Can someone help? Reinaldo. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/76bbc6e8-315c-482d-b64c-bf75340a7828%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

