Well, I am not training it now. I created the JTesseract. I was training tesseract for diacritical fonts.
regards, Janapriya. On Wed, Nov 5, 2008 at 2:33 PM, 74yrs old <[EMAIL PROTECTED]> wrote: > Janapriya, > I like to know which language you are training in tesseract? > > > On Wed, Nov 5, 2008 at 12:33 PM, Ruwan Janapriya <[EMAIL PROTECTED]>wrote: > >> Sriranga, >> >> thanks again! >> >> regards, >> >> Janapriya >> >> >> On Wed, Nov 5, 2008 at 12:49 PM, 74yrs old <[EMAIL PROTECTED]>wrote: >> >>> Janapriya, >>> Your presumption maintaining the order of the *box* files as well as*tr. >>> *files are correct. >>> -sriranga(76yrsold) >>> >>> On Wed, Nov 5, 2008 at 12:08 PM, Ruwan Janapriya <[EMAIL PROTECTED]>wrote: >>> >>>> Ray, Thanks a lot. >>>> >>>> Under problem #2, hope you meant following about the order of the box >>>> files. >>>> >>>> We should follow: >>>> >>>> mftraining <params> file01.box file02.box file03.box >>>> unicharset_extractror <params> file01.box file02.box file03.box >>>> >>>> We should NOT do like this: >>>> >>>> mftraining <params> file01.box file02.box file03.box >>>> unicharset_extractror <params> *file02.box file01.box* file03.box >>>> >>>> regards, >>>> >>>> Ruwan Janapriya. >>>> >>>> >>>> On Wed, Nov 5, 2008 at 12:23 PM, Ray Smith <[EMAIL PROTECTED]>wrote: >>>> >>>>> Problem #1: as long as the components don't touch, and the boxes don't >>>>> overlap, the bounding boxes don't have to be accurate, but you can't >>>>> currently use two boxes to split joined characters if I remember >>>>> correctly. >>>>> You could however paint a white strip in the image between the boxes to >>>>> break the characters apart. >>>>> Problem#2: you can delete as many boxes from the box file as you like. >>>>> Unboxed components in the image are harmless. The only caveat is to make >>>>> sure the tr files get to mftraining in the same order as they get to >>>>> unicharset_extractor. >>>>> >>>>> Ray. >>>>> >>>>> >>>>> On Tue, Nov 4, 2008 at 3:26 AM, Ruwan Janapriya <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Dear All, >>>>>> >>>>>> I am curious about the following. It would be a great help if someone >>>>>> can answer these questions. >>>>>> >>>>>> Lets say, that I have created a box file using a tiff image. Ideally >>>>>> the box file should contain the bounding boxes of each character. But as >>>>>> we >>>>>> all know, if we use a scanned image there can be many problems. >>>>>> >>>>>> *Problem #1* >>>>>> We can have a box covering two (or more) characters instead of one >>>>>> character. As I know there are two options. The first options is, just >>>>>> consider this as a single character and insert two (or more) >>>>>> corresponding >>>>>> unicode characters under that box. The second option is, split the box in >>>>>> the way the "training" wiki suggested [1]. >>>>>> >>>>>> Now my question is what if we modify the coordinates of the boxes as >>>>>> we wish? Just enlarge a bit or shrink a bit (without overlapping other >>>>>> boxes)? >>>>>> >>>>>> *Problem #2* >>>>>> We can have boxes just covering *non charactors* (e.g. dark patches, >>>>>> noise etc..). >>>>>> >>>>>> Now my question is, what if we delete these boxes and proceed? What is >>>>>> the impact? Can't we say to tesseract that these charactors are just "non >>>>>> charactors"? >>>>>> >>>>>> [1] Lets say the diagonal coordinates of the box is [(TLx, TLy), (BRx, >>>>>> BRy)] here, Bottom Right: BR, Top Left: TL >>>>>> Now after splitting following boxes will result, [(TLx, TLy), (TLx / 2 >>>>>> + BRx / 2, BRy)] and [(TLx / 2 + BRx / 2, TLy), (BRx, BRy)] >>>>>> >>>>>> P.S. I wrote JTesseract - a front end for Tesseract training process. >>>>>> Answers to these questions would greatly improve that application. >>>>>> >>>>>> regards, >>>>>> >>>>>> -- >>>>>> *Ruwan Janapriya * >>>>>> http://www.janapriya.net >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

