AFAIK the textline option for box files (WordStr) has NOT been implemented.
The wordaround has been to use the bounding box for the whole line for every character on a line. Ref: ocrd-train project Example: च 0 0 1965 128 0 त् 0 0 1965 128 0 व 0 0 1965 128 0 ा 0 0 1965 128 0 र 0 0 1965 128 0 ि 0 0 1965 128 0 ं 0 0 1965 128 0 श 0 0 1965 128 0 त् 0 0 1965 128 0 स 0 0 1965 128 0 ह 0 0 1965 128 0 स् 0 0 1965 128 0 र 0 0 1965 128 0 ा 0 0 1965 128 0 ब् 0 0 1965 128 0 द 0 0 1965 128 0 ं 0 0 1965 128 0 0 0 1965 128 0 व 0 0 1965 128 0 ा 0 0 1965 128 0 य् 0 0 1965 128 0 व 0 0 1965 128 0 ा 0 0 1965 128 0 ह 0 0 1965 128 0 ा 0 0 1965 128 0 र 0 0 1965 128 0 ा 0 0 1965 128 0 0 0 1965 128 0 text2image creates tif and box files when given a training text and font. That has bounding boxes per character. Example: d 111 4658 135 4698 0 i 137 4658 148 4698 0 f 149 4658 163 4698 0 f 163 4658 177 4698 0 e 178 4657 202 4690 0 r 204 4657 221 4689 0 e 222 4657 246 4689 0 n 248 4657 272 4689 0 t 273 4657 288 4694 0 288 4657 299 4697 0 N 299 4657 323 4697 0 e 325 4657 349 4689 0 w 349 4657 383 4689 0 383 4656 390 4697 0 A 390 4656 418 4697 0 r 417 4656 434 4688 0 t 435 4656 450 4693 0 i 451 4656 462 4696 0 c 464 4656 487 4688 0 l 489 4656 500 4696 0 e 502 4656 526 4688 0 s 528 4656 550 4688 0 550 4651 561 4688 0 p 561 4651 585 4688 0 a 587 4656 610 4688 0 g 612 4649 636 4688 0 e 638 4655 662 4687 0 662 4655 674 4696 0 2 674 4655 696 4696 0 3 699 4654 723 4696 0 723 4654 734 4696 0 a 734 4655 757 4687 0 757 4655 767 4695 0 T 767 4655 791 4695 0 o 791 4655 815 4687 0 815 4653 826 4696 0 S 826 4653 851 4696 0 e 852 4654 876 4686 0 r 878 4654 895 4686 0 v 895 4654 919 4686 0 i 919 4654 930 4694 0 c 932 4654 955 4686 0 e 957 4654 981 4686 0 981 4654 994 4686 0 ~ 994 4669 1016 4680 0 ~ 1020 4669 1042 4680 0 1042 4653 1053 4685 0 a 1053 4653 1076 4685 0 1076 4653 1087 4693 0 d 1087 4653 1111 4693 0 e 1113 4653 1137 4685 0 t 1138 4653 1153 4690 0 a 1154 4653 1177 4685 0 i 1179 4653 1190 4693 0 l 1192 4653 1203 4693 0 s 1205 4653 1227 4685 0 1227 4653 1239 4693 0 D 1239 4653 1264 4693 0 C 1267 4651 1292 4693 0 1292 4651 1302 4693 0 t 1302 4652 1317 4689 0 h 1318 4652 1342 4692 0 a 1344 4652 1367 4684 0 t 1368 4652 1383 4689 0 1383 4652 1393 4692 0 d 1393 4652 1417 4692 0 o 1419 4652 1443 4684 0 n 1445 4652 1469 4684 0 ' 1472 4680 1479 4692 0 t 1479 4651 1494 4689 0 1494 4651 1504 4689 0 a 1504 4651 1527 4683 0 s 1529 4651 1551 4683 0 1551 4651 1561 4691 0 7 1561 4651 1582 4691 0 1582 4651 1591 4691 0 « 1591 4654 1609 4682 0 « 1610 4654 1628 4682 0 1628 4651 1639 4691 0 D 1639 4651 1664 4691 0 a 1666 4651 1689 4683 0 t 1690 4650 1705 4688 0 e 1706 4650 1730 4682 0 : 1733 4650 1741 4676 0 1741 4650 1751 4685 0 # 1751 4650 1781 4685 0 1 1781 4650 1799 4690 0 1799 4650 1811 4690 0 : 1811 4650 1819 4676 0 1819 4650 1827 4690 0 A 1827 4650 1855 4690 0 Z 1854 4650 1875 4690 0 1875 4689 1876 4690 0 _ 110 4559 138 4561 0 _ 138 4559 166 4561 0 _ 166 4558 194 4561 0 On Wed, Jan 30, 2019 at 4:36 PM Jul ius <[email protected]> wrote: > Still interested in example of box files for tesseract 4... > > Doesn't anyone has an example for us? It would be great to see how we have > to handle spaces in textlines. > > > > Am Montag, 28. Januar 2019 15:01:49 UTC+1 schrieb Jul ius: >> >> Hi, >> >> that would also be my next question. Don't we need anything like a >> seperator? Some examples would be great. The amout of information on the >> internet is very poor as tesseract 4 is new. >> >> Am Sonntag, 27. Januar 2019 18:20:06 UTC+1 schrieb Li-Chung Chou: >>> >>> Hi Timothy, >>> >>> I have the same question with Jul. Would you kindly share 1 'textline' >>> boxes file and its corresponding image file which you applied? I assume if >>> I have one image containing one 'textline' as "Thanks", then I will have >>> its corresponding box file as below contents: >>> >>> Thanks 10 10 500 30 0 //the 10 10 500 30 rectangle contains whole >>> "Thanks" text? >>> >>> But I was wondering if my 'textline' has space character in it, does it >>> still work? For example, if I have an image containing one 'textline' as >>> "Thank you", will its box file looks like this? >>> >>> Thank you 10 10 800 30 0 //the 10 10 800 30 rectangle contains whole >>> "Thank you" text? >>> >>> Not sure if my understainding is correct or not - it's highly >>> appreciated if you can share some examples or experience to us. Thank you >>> very very much! >>> >>> Li-Chung >>> >>> Timothy Snyder於 2019年1月25日星期五 UTC+8下午10時47分47秒寫道: >>>> >>>> I have successfully trained Tesseract 4.0 using boxes that cover an >>>> entire line. I was similarly confused by the mismatch between the docs and >>>> that example. I haven't tested training with character-bounding boxes but I >>>> can confirm that textline boxes works fine. >>>> >>>> On Fri, Jan 25, 2019 at 5:56 AM Jul ius <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm interested in training tesseract 4 with real data. As the >>>>> documentation seems very poor and only captures training with font files, >>>>> I >>>>> have a general question. >>>>> >>>>> On: >>>>> https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 >>>>> >>>>> It says that the boxes need to cover the whole line in tesseract 4. >>>>> >>>>> When looking inside the linked box file I can clearly see that every >>>>> box covers a single character. >>>>> >>>>> Can anyone verify which layout for the boxes is right? >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/1ab1e0b0-a70a-456b-ab58-2f240a3b479f%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1ab1e0b0-a70a-456b-ab58-2f240a3b479f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5c47b52f-fbaa-4807-ba1e-baa4ab4efdc0%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5c47b52f-fbaa-4807-ba1e-baa4ab4efdc0%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWzqE%2BmwNqN7m_E_0994nK4AejTvt1pcW56FTqJx%3DZiTQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

