Il giorno lun 12 nov 2018 alle ore 11:53 <[email protected]> ha scritto:
> That means we can label some existing images with text line boxes instead > of individual char boxes in current tesseract 4.0? I checked the box files > generated by the training process and found that char boxes were still > there. > Yes it is confusing. I use ocrd-train <https://github.com/OCR-D/ocrd-train> and it generates boxes for the whole lines. This is an example generated from a small python script from ocrd-train: M 0 0 244 50 0 I 0 0 244 50 0 T 0 0 244 50 0 - 0 0 244 50 0 U 0 0 244 50 0 C 0 0 244 50 0 O 0 0 244 50 0 244 50 245 51 0 Ground truth is MIT-UCO, image size is 244x50. Here it lists each individual character but the box size is always the full line for all of them. I use pre-cut images containing single lines, this is why the box cover the whole image. The same thing should work for a large image with multiple lines (but I never did it myself). You could try to use hocr to split the file in lines see here: https://github.com/OCR-D/ocrd-train/issues/7#issuecomment-419714852 BTW the coords look like: left, top, right, bottom and not <left> <bottom> <right> <top> as in the docs <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data>: am I missing something? Bye Lorenzo > > Thanks, > Jun > > 在 2018年11月12日星期一 UTC+8下午5:26:48,Lorenzo Blz写道: > >> >> Tesseract 4.x uses lines, not chars. >> >> >> Bye >> >> Lorenzo >> >> Il giorno lun 12 nov 2018 alle ore 05:42 <[email protected]> ha scritto: >> >>> Dear All, >>> >>> Currently, tesseract training is based on the pair (tiff and box). >>> It's not easy to make box file (char level) if we try to train some scanned >>> document images not generated by programs. >>> My question is whether we have a plan to support line level training in >>> future? Thanks! >>> >>> Regards, >>> Jun >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLygQQ6aGFE-7q2BnU5Kg7jck389DmGJ%2B4yKbESqMRCpwA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

