Re: Questions regarding Tesseract Training Process.

Ruwan Janapriya Wed, 05 Nov 2008 00:59:45 -0800

Well, I am not training it now. I created the JTesseract. I was training
tesseract for diacritical fonts.


regards,

Janapriya.

On Wed, Nov 5, 2008 at 2:33 PM, 74yrs old <[EMAIL PROTECTED]> wrote:

> Janapriya,
> I like to know which language you are training in tesseract?
>
>
> On Wed, Nov 5, 2008 at 12:33 PM, Ruwan Janapriya <[EMAIL PROTECTED]>wrote:
>
>> Sriranga,
>>
>> thanks again!
>>
>> regards,
>>
>> Janapriya
>>
>>
>> On Wed, Nov 5, 2008 at 12:49 PM, 74yrs old <[EMAIL PROTECTED]>wrote:
>>
>>> Janapriya,
>>> Your presumption maintaining the order of the *box* files as well as*tr.
>>> *files are correct.
>>> -sriranga(76yrsold)
>>>
>>> On Wed, Nov 5, 2008 at 12:08 PM, Ruwan Janapriya <[EMAIL PROTECTED]>wrote:
>>>
>>>> Ray, Thanks a lot.
>>>>
>>>> Under problem #2, hope you meant following about the order of the box
>>>> files.
>>>>
>>>> We should follow:
>>>>
>>>> mftraining <params> file01.box file02.box file03.box
>>>> unicharset_extractror <params> file01.box file02.box file03.box
>>>>
>>>> We should NOT do like this:
>>>>
>>>> mftraining <params> file01.box file02.box file03.box
>>>> unicharset_extractror <params> *file02.box file01.box* file03.box
>>>>
>>>> regards,
>>>>
>>>> Ruwan Janapriya.
>>>>
>>>>
>>>> On Wed, Nov 5, 2008 at 12:23 PM, Ray Smith <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Problem #1: as long as the components don't touch, and the boxes don't
>>>>> overlap, the bounding boxes don't have to be accurate, but you can't
>>>>> currently use two boxes to split joined characters if I remember 
>>>>> correctly.
>>>>> You could however paint a white strip in the image between the boxes to
>>>>> break the characters apart.
>>>>> Problem#2: you can delete as many boxes from the box file as you like.
>>>>> Unboxed components in the image are harmless. The only caveat is to make
>>>>> sure the tr files get to mftraining in the same order as they get to
>>>>> unicharset_extractor.
>>>>>
>>>>> Ray.
>>>>>
>>>>>
>>>>> On Tue, Nov 4, 2008 at 3:26 AM, Ruwan Janapriya <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> I am curious about the following. It would be a great help if someone
>>>>>> can answer these questions.
>>>>>>
>>>>>> Lets say, that I have created a box file using a tiff image. Ideally
>>>>>> the box file should contain the bounding boxes of each character. But as 
>>>>>> we
>>>>>> all know, if we use a scanned image there can be many problems.
>>>>>>
>>>>>> *Problem #1*
>>>>>> We can have a box covering two (or more) characters instead of one
>>>>>> character. As I know there are two options. The first options is, just
>>>>>> consider this as a single character and insert two (or more) 
>>>>>> corresponding
>>>>>> unicode characters under that box. The second option is, split the box in
>>>>>> the way the "training" wiki suggested [1].
>>>>>>
>>>>>> Now my question is what if we modify the coordinates of the boxes as
>>>>>> we wish? Just enlarge a bit or shrink a bit (without overlapping other
>>>>>> boxes)?
>>>>>>
>>>>>> *Problem #2*
>>>>>> We can have boxes just covering *non charactors* (e.g. dark patches,
>>>>>> noise etc..).
>>>>>>
>>>>>> Now my question is, what if we delete these boxes and proceed? What is
>>>>>> the impact? Can't we say to tesseract that these charactors are just "non
>>>>>> charactors"?
>>>>>>
>>>>>> [1] Lets say the diagonal coordinates of the box is [(TLx, TLy), (BRx,
>>>>>> BRy)] here, Bottom Right: BR, Top Left: TL
>>>>>> Now after splitting following boxes will result, [(TLx, TLy), (TLx / 2
>>>>>> + BRx / 2, BRy)]  and [(TLx / 2 + BRx / 2, TLy), (BRx, BRy)]
>>>>>>
>>>>>> P.S. I wrote JTesseract - a front end for Tesseract training process.
>>>>>> Answers to these questions would greatly improve that application.
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> --
>>>>>> *Ruwan Janapriya *
>>>>>> http://www.janapriya.net
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Questions regarding Tesseract Training Process.

Reply via email to