The easiest way to see box file layout for any language is to run
'text2image,' for training text sample of 2-3 lines.

On Sun, 3 Feb 2019, 07:42 Li-Chung Chou <[email protected] wrote:

> Hi Shree,
>
> Thanks for your kindly response! It's very clear. Actually, I'm also
> curious about some languages whose "character" might be consist of multiple
> "glyphs" (not sure if I use correct English words to describle - sorry for
> my poor English in advance) . Your example also include this part. Thank
> you so much!
>
> Best Regards,
> Li-Chung
>
> shree於 2019年1月30日星期三 UTC+8下午7時48分43秒寫道:
>>
>> also see
>>
>>
>> https://github.com/tesseract-ocr/tesseract/blob/cfa787d976007f5866ce25fbd8e2a0223fc40fda/src/ccstruct/boxread.cpp#L165
>>
>>
>> https://github.com/tesseract-ocr/tesseract/blob/3ac33d59aeb93fc9dab13874a64ab0b73690d5eb/src/ccmain/applybox.cpp#L36
>>
>> On Wed, Jan 30, 2019 at 5:15 PM Shree Devi Kumar <[email protected]>
>> wrote:
>>
>>> AFAIK the textline option for box files (WordStr) has NOT been
>>> implemented.
>>>
>>> The wordaround has been to use the bounding box for the whole line for
>>> every character on a line. Ref: ocrd-train project
>>>
>>> Example:
>>>
>>> च 0 0 1965 128 0
>>> त् 0 0 1965 128 0
>>> व 0 0 1965 128 0
>>> ा 0 0 1965 128 0
>>> र 0 0 1965 128 0
>>> ि 0 0 1965 128 0
>>> ं 0 0 1965 128 0
>>> श 0 0 1965 128 0
>>> त् 0 0 1965 128 0
>>> स 0 0 1965 128 0
>>> ह 0 0 1965 128 0
>>> स् 0 0 1965 128 0
>>> र 0 0 1965 128 0
>>> ा 0 0 1965 128 0
>>> ब् 0 0 1965 128 0
>>> द 0 0 1965 128 0
>>> ं 0 0 1965 128 0
>>>   0 0 1965 128 0
>>> व 0 0 1965 128 0
>>> ा 0 0 1965 128 0
>>> य् 0 0 1965 128 0
>>> व 0 0 1965 128 0
>>> ा 0 0 1965 128 0
>>> ह 0 0 1965 128 0
>>> ा 0 0 1965 128 0
>>> र 0 0 1965 128 0
>>> ा 0 0 1965 128 0
>>>   0 0 1965 128 0
>>>
>>> text2image creates tif and box files when given a training text and
>>> font. That has bounding boxes per character.
>>>
>>> Example:
>>>
>>> d 111 4658 135 4698 0
>>> i 137 4658 148 4698 0
>>> f 149 4658 163 4698 0
>>> f 163 4658 177 4698 0
>>> e 178 4657 202 4690 0
>>> r 204 4657 221 4689 0
>>> e 222 4657 246 4689 0
>>> n 248 4657 272 4689 0
>>> t 273 4657 288 4694 0
>>>   288 4657 299 4697 0
>>> N 299 4657 323 4697 0
>>> e 325 4657 349 4689 0
>>> w 349 4657 383 4689 0
>>>   383 4656 390 4697 0
>>> A 390 4656 418 4697 0
>>> r 417 4656 434 4688 0
>>> t 435 4656 450 4693 0
>>> i 451 4656 462 4696 0
>>> c 464 4656 487 4688 0
>>> l 489 4656 500 4696 0
>>> e 502 4656 526 4688 0
>>> s 528 4656 550 4688 0
>>>   550 4651 561 4688 0
>>> p 561 4651 585 4688 0
>>> a 587 4656 610 4688 0
>>> g 612 4649 636 4688 0
>>> e 638 4655 662 4687 0
>>>   662 4655 674 4696 0
>>> 2 674 4655 696 4696 0
>>> 3 699 4654 723 4696 0
>>>   723 4654 734 4696 0
>>> a 734 4655 757 4687 0
>>>   757 4655 767 4695 0
>>> T 767 4655 791 4695 0
>>> o 791 4655 815 4687 0
>>>   815 4653 826 4696 0
>>> S 826 4653 851 4696 0
>>> e 852 4654 876 4686 0
>>> r 878 4654 895 4686 0
>>> v 895 4654 919 4686 0
>>> i 919 4654 930 4694 0
>>> c 932 4654 955 4686 0
>>> e 957 4654 981 4686 0
>>>   981 4654 994 4686 0
>>> ~ 994 4669 1016 4680 0
>>> ~ 1020 4669 1042 4680 0
>>>   1042 4653 1053 4685 0
>>> a 1053 4653 1076 4685 0
>>>   1076 4653 1087 4693 0
>>> d 1087 4653 1111 4693 0
>>> e 1113 4653 1137 4685 0
>>> t 1138 4653 1153 4690 0
>>> a 1154 4653 1177 4685 0
>>> i 1179 4653 1190 4693 0
>>> l 1192 4653 1203 4693 0
>>> s 1205 4653 1227 4685 0
>>>   1227 4653 1239 4693 0
>>> D 1239 4653 1264 4693 0
>>> C 1267 4651 1292 4693 0
>>>   1292 4651 1302 4693 0
>>> t 1302 4652 1317 4689 0
>>> h 1318 4652 1342 4692 0
>>> a 1344 4652 1367 4684 0
>>> t 1368 4652 1383 4689 0
>>>   1383 4652 1393 4692 0
>>> d 1393 4652 1417 4692 0
>>> o 1419 4652 1443 4684 0
>>> n 1445 4652 1469 4684 0
>>> ' 1472 4680 1479 4692 0
>>> t 1479 4651 1494 4689 0
>>>   1494 4651 1504 4689 0
>>> a 1504 4651 1527 4683 0
>>> s 1529 4651 1551 4683 0
>>>   1551 4651 1561 4691 0
>>> 7 1561 4651 1582 4691 0
>>>   1582 4651 1591 4691 0
>>> « 1591 4654 1609 4682 0
>>> « 1610 4654 1628 4682 0
>>>   1628 4651 1639 4691 0
>>> D 1639 4651 1664 4691 0
>>> a 1666 4651 1689 4683 0
>>> t 1690 4650 1705 4688 0
>>> e 1706 4650 1730 4682 0
>>> : 1733 4650 1741 4676 0
>>>   1741 4650 1751 4685 0
>>> # 1751 4650 1781 4685 0
>>> 1 1781 4650 1799 4690 0
>>>   1799 4650 1811 4690 0
>>> : 1811 4650 1819 4676 0
>>>   1819 4650 1827 4690 0
>>> A 1827 4650 1855 4690 0
>>> Z 1854 4650 1875 4690 0
>>> 1875 4689 1876 4690 0
>>> _ 110 4559 138 4561 0
>>> _ 138 4559 166 4561 0
>>> _ 166 4558 194 4561 0
>>>
>>> On Wed, Jan 30, 2019 at 4:36 PM Jul ius <[email protected]> wrote:
>>>
>>>> Still interested in example of box files for tesseract 4...
>>>>
>>>> Doesn't anyone has an example for us? It would be great to see how we
>>>> have to handle spaces in textlines.
>>>>
>>>>
>>>>
>>>> Am Montag, 28. Januar 2019 15:01:49 UTC+1 schrieb Jul ius:
>>>>>
>>>>> Hi,
>>>>>
>>>>> that would also be my next question. Don't we need anything like a
>>>>> seperator? Some examples would be great. The amout of information on the
>>>>> internet is very poor as tesseract 4 is new.
>>>>>
>>>>> Am Sonntag, 27. Januar 2019 18:20:06 UTC+1 schrieb Li-Chung Chou:
>>>>>>
>>>>>> Hi Timothy,
>>>>>>
>>>>>> I have the same question with Jul. Would you kindly share 1
>>>>>> 'textline' boxes file and its corresponding image file which you 
>>>>>> applied? I
>>>>>> assume if I have one image containing one 'textline' as "Thanks", then I
>>>>>> will have its corresponding box file as below contents:
>>>>>>
>>>>>> Thanks 10 10 500 30 0  //the 10 10 500 30 rectangle contains whole
>>>>>> "Thanks" text?
>>>>>>
>>>>>> But I was wondering if my 'textline' has space character in it, does
>>>>>> it still work? For example, if I have an image containing one 'textline' 
>>>>>> as
>>>>>> "Thank you", will its box file looks like this?
>>>>>>
>>>>>> Thank you 10 10 800 30 0 //the 10 10 800 30 rectangle contains whole
>>>>>> "Thank you" text?
>>>>>>
>>>>>> Not sure if my understainding is correct or not - it's highly
>>>>>> appreciated if you can share some examples or experience to us. Thank you
>>>>>> very very much!
>>>>>>
>>>>>> Li-Chung
>>>>>>
>>>>>> Timothy Snyder於 2019年1月25日星期五 UTC+8下午10時47分47秒寫道:
>>>>>>>
>>>>>>> I have successfully trained Tesseract 4.0 using boxes that cover an
>>>>>>> entire line. I was similarly confused by the mismatch between the docs 
>>>>>>> and
>>>>>>> that example. I haven't tested training with character-bounding boxes 
>>>>>>> but I
>>>>>>> can confirm that textline boxes works fine.
>>>>>>>
>>>>>>> On Fri, Jan 25, 2019 at 5:56 AM Jul ius <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm interested in training tesseract 4 with real data. As the
>>>>>>>> documentation seems very poor and only captures training with font 
>>>>>>>> files, I
>>>>>>>> have a general question.
>>>>>>>>
>>>>>>>> On:
>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0
>>>>>>>>
>>>>>>>> It says that the boxes need to cover the whole line in tesseract 4.
>>>>>>>>
>>>>>>>> When looking inside the linked box file I can clearly see that
>>>>>>>> every box covers a single character.
>>>>>>>>
>>>>>>>> Can anyone verify which layout for the boxes is right?
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1ab1e0b0-a70a-456b-ab58-2f240a3b479f%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1ab1e0b0-a70a-456b-ab58-2f240a3b479f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/5c47b52f-fbaa-4807-ba1e-baa4ab4efdc0%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5c47b52f-fbaa-4807-ba1e-baa4ab4efdc0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b96a9303-ef9c-4635-aecd-6d8317a8c342%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b96a9303-ef9c-4635-aecd-6d8317a8c342%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXeCNdMDk1kyRUhxiV05eGOHTYgTg5ECn4F%3DKgY6DyVFw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to