If you want to train using text, then you also need to specify a set of
fonts. eg.

~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang ara \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata \
  --fontlist "Amiri" \
  "Amiri Bold Italic" \
  "Amiri Bold" \
  "Amiri Italic" \
  --training_text ./ara.training_text \
  --workspace_dir ~/tmp/ \
  --save_box_tiff \
  --output_dir ~/tesstutorial/araeval

This will create a set of lstmf files and their list and those can be used
for lstmtraining.

If you don't want to use existing traineddata, then follow instructions to
train from scratch -
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch


Training from scratch will take a long time - days/weeks.

On Wed, Jan 8, 2020 at 4:09 PM Ayub Rauf <[email protected]> wrote:

> Thanks it helped and I could create a multi-page tif but as you know
> tesseract 4 accept single line tif with his truth text and doesn't need box
> file, am I right?I say that i only need lstmf file not box! is that right?
> anyway I'll find a splitter and get data ready. Do you have any solution
> for that can split and rename files automatically, multi-page tif and also
> multi-line text?
>  And does those two files I mean tif and truth text paired files will be
> enough for start create my language model? because when I try to training
> it says "Tesseract couldn't load any languages!
> Could not initialize tesseract."
> when I searched for making .traindata I found  tesstrain.sh
> <https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain.sh>
>  but
> don't know how to run it and work with it, so please if you can help me to
> make a new traindata because I don't wanna use existing traindata!
> Thanks
>
>
> On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:
>>
>> Read your textfile line by line
>> run text2image to create box/tif, similar to following.
>>
>> text2image --fonts_dir="$unicodefontdir" --text="${linetext}"
>> --strip_unrenderable_words --xsize=2500 --ysize=300  --leading=32
>> --margin=12 --exposure=0  --font="$fontname"   --outputbase="${fontname//
>> /_}.exp0"
>>
>>
>> run tesseract to create lstmf files , similar to following.
>>
>> tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang"
>> --psm 13 --dpi 300 lstm.train
>>
>>
>>
>> On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <[email protected]> wrote:
>>
>>> Hi please someone help me how to create single-line tif from texts and
>>> use them for training my model.
>>> Thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW0%3DE_OffnN3DCJAagR5d6fL9c%3DBxtEzv_KTeL_%3Df%2BnOA%40mail.gmail.com.

Reply via email to