If you use linux, you can try similar to attached bash script. On Thu, Oct 3, 2019 at 2:55 PM Shree Devi Kumar <shreesh...@gmail.com> wrote:
> There is no direct method for training from non-unicode fonts. Tesseract's > output is also Unicode text only. > > You can work from scanned images of text in non-unicode fonts and provide > the unicode transcription of it. You could probably use a legacy to unicode > converter for the text. > > See https://github.com/tesseract-ocr/tesstrain for training from single > line images and its ground truth transcription. > > On Thu, Oct 3, 2019 at 2:27 PM isuri anuradha <isurianuradh...@gmail.com> > wrote: > >> As you mentioned tesseract 4.0 is only support for the unicode fonts. >> What is the procedure if we want to trained with non-unicode fonts. Since >> most of the documents written in Sri Lanka are in non-unicode fonts and >> there are lots of historical books available which written on non-unicode >> forms. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com.
#!/bin/bash # hack for legacy non-unicode fonts - all fonts in each run should use same legacy mapping # requires legacy text and it's matching unicode version - same number and order of lines # single line tifs with one exposure level using text2image with nonunicode text # use wordstrbox with unicode text lang=sin prefix=nonunicode rm -rf ${prefix} mkdir ${prefix} unicodeinput=${prefix}-input-unicode.txt unicodetext=${prefix}-train-unicode.txt unicodefonttext=${prefix}-font-train-unicode.txt nonunicodeinput=${prefix}-input.txt nonunicodetext=${prefix}-train.txt nonunicodefonttext=${prefix}-font-train.txt nonunicodefontlist= legacy-fonts.txt cp legacy.training_text ${nonunicodeinput} cp legacy2unicode.training_text ${unicodeinput} fontcount=$(wc -l < "$nonunicodefontlist") linecount=$(wc -l < "$nonunicodeinput") perfontcount=$(( linecount / fontcount)) cp ${nonunicodeinput} ${nonunicodetext} cp ${unicodeinput} ${unicodetext} while IFS= read -r fontname do head -$perfontcount ${nonunicodetext} > ${nonunicodefonttext} sed -i "1,$perfontcount d" ${nonunicodetext} linenum=0 while IFS= read -r nonunicodefonttextline do let "linenum++" echo "$nonunicodefonttextline" > ./tmpnonunicode.txt text2image --fonts_dir=/home/ubuntu/.fonts --strip_unrenderable_words --xsize=3000 --ysize=150 --leading=12 --margin=12 --char_spacing=0.0 --exposure=0 --max_pages=0 --font="$fontname" --text=tmpnonunicode.txt --outputbase="$prefix"/"$prefix.${fontname// /_}-$linenum.exp0" rm "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".box done < "$nonunicodefonttext" # linenum rm $prefix.groundtruth.txt head -$perfontcount ${unicodetext} > ${unicodefonttext} sed -i "1,$perfontcount d" ${unicodetext} linenum=0 while IFS= read -r unicodefonttextline do let "linenum++" echo "$unicodefonttextline" > ./tmpunicode.txt python3 generate_wordstr_box.py -i "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".tif -t ./tmpunicode.txt >"$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".box tesseract "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".tif "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0" --oem 1 --psm 6 -l $lang --tessdata-dir ~/tessdata_fast --dpi 300 lstm.train rm "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".txt done < "$unicodefonttext" # linenum cat ${unicodefonttext} >> $prefix.groundtruth.txt done < "$nonunicodefontlist" unicharset_extractor --output_unicharset $prefix.my.unicharset --norm_mode 2 $prefix.groundtruth.txt find $prefix -type f -name '*.lstmf' -type f -name '*.lstmf' > $prefix.all-lstmf