Dear All,

I am curious about the following. It would be a great help if someone can
answer these questions.

Lets say, that I have created a box file using a tiff image. Ideally the box
file should contain the bounding boxes of each character. But as we all
know, if we use a scanned image there can be many problems.

*Problem #1*
We can have a box covering two (or more) characters instead of one
character. As I know there are two options. The first options is, just
consider this as a single character and insert two (or more) corresponding
unicode characters under that box. The second option is, split the box in
the way the "training" wiki suggested [1].

Now my question is what if we modify the coordinates of the boxes as we
wish? Just enlarge a bit or shrink a bit (without overlapping other boxes)?

*Problem #2*
We can have boxes just covering *non charactors* (e.g. dark patches, noise
etc..).

Now my question is, what if we delete these boxes and proceed? What is the
impact? Can't we say to tesseract that these charactors are just "non
charactors"?

[1] Lets say the diagonal coordinates of the box is [(TLx, TLy), (BRx, BRy)]
here, Bottom Right: BR, Top Left: TL
Now after splitting following boxes will result, [(TLx, TLy), (TLx / 2 + BRx
/ 2, BRy)]  and [(TLx / 2 + BRx / 2, TLy), (BRx, BRy)]

P.S. I wrote JTesseract - a front end for Tesseract training process.
Answers to these questions would greatly improve that application.

regards,

--
*Ruwan Janapriya *
http://www.janapriya.net

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to