Thanks also from my side. I'll have a look into the jTessBoxEditor beta, try to setup training and get back to you.
Kay On Wednesday, February 8, 2017 at 3:52:58 PM UTC+1, shree wrote: > > Thanks, Quan > > - excuse the brevity, sent from mobile > > On 08-Feb-2017 7:33 PM, "Quan Nguyen" <[email protected] <javascript:>> > wrote: > >> >> >> On Tuesday, February 7, 2017 at 9:34:11 AM UTC-6, shree wrote: >>> >>> For LSTM training, box files need to have an additional line for each >>> text line with the tab character to indicate a new line. >>> >>> If you have existing box/tiff pairs, you can use a box editor (such as >>> jtessboxeditor) and insert a box at end of each line and add a tab >>> character in it. >>> >> >> The jTessBoxEditor beta version has a new Mark EOL function that does >> just that. >> >> >>> >>> >On the toolbar, the Character textbox has a built-in conversion >>> function. If you enter U+0009 and hit Enter key or click on the adjacent >>> Tool icon, the escape sequences will be converted to Unicode. You can also >>> enter the tab character via Alt+09 numpad keys on Windows. >>> >>> o >>> r add a dummy sequence such as @@@ and then replace to tab character in >>> a text editor. >>> >>> See attached files as a sample. >>> >>> Then modify tesstrain.sh to copy the box tiff pairs to the training >>> directory before starting training >>> >>> >>> >>> mkdir -p ${TRAINING_DIR} >>> tlog "\n=== Starting training for language '${LANG_CODE}'" >>> >>> cp ./*.box "${TRAINING_DIR}/" >>> cp ./*.tif "${TRAINING_DIR}/" >>> >>> >>> On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <[email protected]> >>> wrote: >>> >>>> +1 for this question. The training documentation for Tesseract 4.0 by >>>> now only covers training with font files (synthetic materials). What is >>>> missing is information on training with real data (i.e. manually aligned >>>> ground truth). >>>> Any hints on that matter are greatly appreciated. >>>> >>>> Cheers, >>>> Kay >>>> >>>> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, [email protected] >>>> wrote: >>>>> >>>>> I have a bunch of images, containing English words. >>>>> I would like to generate training data by these images, and do the >>>>> training. >>>>> How should I do? >>>>> >>>>> Thanks a lot. >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/beeb2493-58e1-4a4a-bb0a-3b5c1dfd007f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

