Il giorno lun 12 nov 2018 alle ore 11:53 <[email protected]> ha scritto:

> That means we can label some existing images with text line boxes instead
> of individual char boxes in current tesseract 4.0? I checked the box files
> generated by the training process and found that char boxes were still
> there.
>

Yes it is confusing. I use ocrd-train <https://github.com/OCR-D/ocrd-train>
and it generates boxes for the whole lines.

This is an example generated from a small python script from ocrd-train:

M 0 0 244 50 0
I 0 0 244 50 0
T 0 0 244 50 0
- 0 0 244 50 0
U 0 0 244 50 0
C 0 0 244 50 0
O 0 0 244 50 0
     244 50 245 51 0

Ground truth is MIT-UCO, image size is 244x50. Here it lists each
individual character but the box size is always the full line for all of
them.

I use pre-cut images containing single lines, this is why the box cover the
whole image. The same thing should work for a large image with multiple
lines (but I never did it myself).

You could try to use hocr to split the file in lines see here:
https://github.com/OCR-D/ocrd-train/issues/7#issuecomment-419714852


BTW the coords look like: left, top, right, bottom and not <left> <bottom>
<right> <top> as in the docs
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data>
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data>:
am I missing something?


Bye

Lorenzo




>
> Thanks,
> Jun
>
> 在 2018年11月12日星期一 UTC+8下午5:26:48,Lorenzo Blz写道:
>
>>
>> Tesseract 4.x uses lines, not chars.
>>
>>
>> Bye
>>
>> Lorenzo
>>
>> Il giorno lun 12 nov 2018 alle ore 05:42 <[email protected]> ha scritto:
>>
>>> Dear All,
>>>
>>>       Currently, tesseract training is based on the pair (tiff and box).
>>> It's not easy to make box file (char level) if we try to train some scanned
>>> document images not generated by programs.
>>> My question is whether we have a plan to support line level training in
>>> future? Thanks!
>>>
>>> Regards,
>>> Jun
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLygQQ6aGFE-7q2BnU5Kg7jck389DmGJ%2B4yKbESqMRCpwA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to