If you want to train using text, then you also need to specify a set of fonts. eg.
~/tesseract/src/training/tesstrain.sh \ --fonts_dir ~/.fonts \ --lang ara \ --linedata_only \ --noextract_font_properties \ --langdata_dir ~/langdata \ --tessdata_dir ~/tessdata \ --fontlist "Amiri" \ "Amiri Bold Italic" \ "Amiri Bold" \ "Amiri Italic" \ --training_text ./ara.training_text \ --workspace_dir ~/tmp/ \ --save_box_tiff \ --output_dir ~/tesstutorial/araeval This will create a set of lstmf files and their list and those can be used for lstmtraining. If you don't want to use existing traineddata, then follow instructions to train from scratch - https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch Training from scratch will take a long time - days/weeks. On Wed, Jan 8, 2020 at 4:09 PM Ayub Rauf <[email protected]> wrote: > Thanks it helped and I could create a multi-page tif but as you know > tesseract 4 accept single line tif with his truth text and doesn't need box > file, am I right?I say that i only need lstmf file not box! is that right? > anyway I'll find a splitter and get data ready. Do you have any solution > for that can split and rename files automatically, multi-page tif and also > multi-line text? > And does those two files I mean tif and truth text paired files will be > enough for start create my language model? because when I try to training > it says "Tesseract couldn't load any languages! > Could not initialize tesseract." > when I searched for making .traindata I found tesstrain.sh > <https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain.sh> > but > don't know how to run it and work with it, so please if you can help me to > make a new traindata because I don't wanna use existing traindata! > Thanks > > > On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote: >> >> Read your textfile line by line >> run text2image to create box/tif, similar to following. >> >> text2image --fonts_dir="$unicodefontdir" --text="${linetext}" >> --strip_unrenderable_words --xsize=2500 --ysize=300 --leading=32 >> --margin=12 --exposure=0 --font="$fontname" --outputbase="${fontname// >> /_}.exp0" >> >> >> run tesseract to create lstmf files , similar to following. >> >> tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" >> --psm 13 --dpi 300 lstm.train >> >> >> >> On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <[email protected]> wrote: >> >>> Hi please someone help me how to create single-line tif from texts and >>> use them for training my model. >>> Thanks >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW0%3DE_OffnN3DCJAagR5d6fL9c%3DBxtEzv_KTeL_%3Df%2BnOA%40mail.gmail.com.

