Dear All, I am curious about the following. It would be a great help if someone can answer these questions.
Lets say, that I have created a box file using a tiff image. Ideally the box file should contain the bounding boxes of each character. But as we all know, if we use a scanned image there can be many problems. *Problem #1* We can have a box covering two (or more) characters instead of one character. As I know there are two options. The first options is, just consider this as a single character and insert two (or more) corresponding unicode characters under that box. The second option is, split the box in the way the "training" wiki suggested [1]. Now my question is what if we modify the coordinates of the boxes as we wish? Just enlarge a bit or shrink a bit (without overlapping other boxes)? *Problem #2* We can have boxes just covering *non charactors* (e.g. dark patches, noise etc..). Now my question is, what if we delete these boxes and proceed? What is the impact? Can't we say to tesseract that these charactors are just "non charactors"? [1] Lets say the diagonal coordinates of the box is [(TLx, TLy), (BRx, BRy)] here, Bottom Right: BR, Top Left: TL Now after splitting following boxes will result, [(TLx, TLy), (TLx / 2 + BRx / 2, BRy)] and [(TLx / 2 + BRx / 2, TLy), (BRx, BRy)] P.S. I wrote JTesseract - a front end for Tesseract training process. Answers to these questions would greatly improve that application. regards, -- *Ruwan Janapriya * http://www.janapriya.net --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

