A strange phenomenon, since I started working with Tesseract keeps me
awake at night.

Often, but not always, the output strings have additional letters at
the beginning and at the end. They are clearly not on the images I
entered into the engine. There is also no noise or anything else on
the image that could suggest why there could be another letter. I use
a language I trained myself using the training images from the
download page.

Here a few examples to illustrate the problem:
http://people.ee.ethz.ch/~bknecht/tesseract/img_01.tif - marrBsysrems
|

http://people.ee.ethz.ch/~bknecht/tesseract/img_02.tif - Sun
Microsystems (Schweiz) AG Z

http://people.ee.ethz.ch/~bknecht/tesseract/img_03.tif - Fax: N

http://people.ee.ethz.ch/~bknecht/tesseract/img_04.tif - Department of
Management, N
8

http://people.ee.ethz.ch/~bknecht/tesseract/img_05.tif - Z
Technology,and Economics Z

I do not care that "microsystems" has not been recognized correctly,
as this is a rather small image there, but I do care about the pipe
character "|" there. As you see this occurs more often at the end of
the output, but also at the beginning. Mostly its just one character
and it is mostly separated from the rest with a space or a new line
character. In example 4 you see even 2 characters added at the end.
The characters are mostly upper case letters and mostly it's a Z, N or
I, but symbols and numbers also occur sometimes. They are also not
always the same, sometimes I get a N a Z or an 8 or a Y with the same
picture.

I already tried to solve the problem by cleaning the output string by
just deleting all single characters at the beginning and the end of
the output. Of course this is not always a good solution.

Is there a way to keep those ghost letters from appearing?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to