Correct me if I am wrong, but shouldn't each character be bound by its own box? Try opening this in JTessBoxEditor ( http://vietocr.sourceforge.net/training.html ).
On Thursday, August 23, 2018 at 12:33:07 PM UTC+1, [email protected] wrote: > > I want to train tesseract 4 using images and ground truth text. I have > generated the BOX file in for a page in the below format. > > > D 1107 191 1167 209 0 > a 1107 191 1167 209 0 > t 1107 191 1167 209 0 > e 1107 191 1167 209 0 > : 1107 191 1167 209 0 > 1107 191 1167 209 0 > 2 1202 192 1294 209 0 > 0 1202 192 1294 209 0 > 1 1202 192 1294 209 0 > 8 1202 192 1294 209 0 > - 1202 192 1294 209 0 > 1 1202 192 1294 209 0 > - 1202 192 1294 209 0 > 3 1202 192 1294 209 0 > 1294 209 1295 210 0 > W 157 237 313 323 0 > a 157 237 313 323 0 > l 157 237 313 323 0 > 157 237 313 323 0 > m 321 256 402 322 0 > 321 256 402 322 0 > a 406 256 454 323 0 > 406 256 454 323 0 > r 460 237 525 323 0 > t 460 237 525 323 0 > 460 237 525 323 0 > e 967 261 1041 280 0 > - 967 261 1041 280 0 > S 967 261 1041 280 0 > D 967 261 1041 280 0 > R 967 261 1041 280 0 > 967 261 1041 280 0 > s 1049 261 1113 281 0 > e 1049 261 1113 281 0 > r 1049 261 1113 281 0 > i 1049 261 1113 281 0 > a 1049 261 1113 281 0 > l 1049 261 1113 281 0 > 1049 261 1113 281 0 > n 1123 267 1167 281 0 > o 1123 267 1167 281 0 > . 1123 267 1167 281 0 > : 1123 267 1167 281 0 > 1123 267 1167 281 0 > 1203 263 1372 281 0 > C 1203 263 1372 281 0 > A 1203 263 1372 281 0 > 1 1203 263 1372 281 0 > 8 1203 263 1372 281 0 > 0 1203 263 1372 281 0 > 1 1203 263 1372 281 0 > 0 1203 263 1372 281 0 > 3 1203 263 1372 281 0 > 0 1203 263 1372 281 0 > 6 1203 263 1372 281 0 > 2 1203 263 1372 281 0 > 2 1203 263 1372 281 0 > 3 1203 263 1372 281 0 > 1372 281 1373 282 0 > > > where i added the word coordinates for every letter as DATE and Break the > line using *\t.* > > Here is an example of tif and box file. The problem that I have CTC > compute failure and also when I try to generate BOX file from Tesseract i > have the same issue. > > > How to make a valid BOX FILE for a Page. > > > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e54b6065-48ca-4e3b-9d6a-1c809813f682%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

