Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

Shree Devi Kumar Sat, 05 Oct 2019 10:25:18 -0700

If you use linux, you can try similar to attached bash script.

On Thu, Oct 3, 2019 at 2:55 PM Shree Devi Kumar <shreesh...@gmail.com>
wrote:


> There is no direct method for training from non-unicode fonts. Tesseract's
> output is also Unicode text only.
>
> You can work from scanned images of text in non-unicode fonts and provide
> the unicode transcription of it. You could probably use a legacy to unicode
> converter for the text.
>
> See https://github.com/tesseract-ocr/tesstrain for training from single
> line images and its ground truth transcription.
>
> On Thu, Oct 3, 2019 at 2:27 PM isuri anuradha <isurianuradh...@gmail.com>
> wrote:
>
>> As you mentioned tesseract 4.0 is only support for the unicode fonts.
>> What is the procedure if we want to trained with non-unicode fonts. Since
>> most of the documents written in Sri Lanka are in non-unicode fonts and
>> there are lots of historical books available which written on non-unicode
>> forms.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com.

#!/bin/bash
# hack for legacy non-unicode fonts - all fonts in each run should use same legacy mapping
# requires legacy text and it's matching unicode version - same number and order of lines
# single line tifs with one exposure level using text2image with nonunicode text
# use wordstrbox with unicode text

lang=sin
prefix=nonunicode

rm -rf ${prefix}
mkdir ${prefix}
unicodeinput=${prefix}-input-unicode.txt
unicodetext=${prefix}-train-unicode.txt
unicodefonttext=${prefix}-font-train-unicode.txt
nonunicodeinput=${prefix}-input.txt
nonunicodetext=${prefix}-train.txt
nonunicodefonttext=${prefix}-font-train.txt

nonunicodefontlist= legacy-fonts.txt
cp  legacy.training_text  ${nonunicodeinput}
cp  legacy2unicode.training_text  ${unicodeinput}

fontcount=$(wc -l < "$nonunicodefontlist")
linecount=$(wc -l < "$nonunicodeinput")
perfontcount=$(( linecount / fontcount))
cp ${nonunicodeinput} ${nonunicodetext} 
cp ${unicodeinput} ${unicodetext} 

while IFS= read -r fontname
    do

       head -$perfontcount ${nonunicodetext} > ${nonunicodefonttext}
       sed -i  "1,$perfontcount d"  ${nonunicodetext}
        linenum=0
        while IFS= read -r nonunicodefonttextline
            do
                let "linenum++"
                echo "$nonunicodefonttextline" > ./tmpnonunicode.txt
                text2image --fonts_dir=/home/ubuntu/.fonts  --strip_unrenderable_words --xsize=3000 --ysize=150  --leading=12 --margin=12  --char_spacing=0.0 --exposure=0  --max_pages=0 --font="$fontname" --text=tmpnonunicode.txt  --outputbase="$prefix"/"$prefix.${fontname// /_}-$linenum.exp0"
                rm "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".box 
            done < "$nonunicodefonttext" # linenum

       rm   $prefix.groundtruth.txt

       head -$perfontcount ${unicodetext} > ${unicodefonttext}
       sed -i  "1,$perfontcount d"  ${unicodetext}
        linenum=0
        while IFS= read -r unicodefonttextline
            do
                let "linenum++"
                echo "$unicodefonttextline" > ./tmpunicode.txt
                python3 generate_wordstr_box.py  -i "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".tif -t  ./tmpunicode.txt >"$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".box
                tesseract "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".tif "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0" --oem 1 --psm 6 -l $lang --tessdata-dir ~/tessdata_fast --dpi 300 lstm.train
                rm "$prefix"/"$prefix.${fontname// /_}-$linenum.exp0".txt 
            done < "$unicodefonttext" # linenum

            cat ${unicodefonttext} >>   $prefix.groundtruth.txt

    done < "$nonunicodefontlist"

unicharset_extractor --output_unicharset   $prefix.my.unicharset --norm_mode 2   $prefix.groundtruth.txt

find $prefix -type f -name '*.lstmf'   -type f -name '*.lstmf'  >   $prefix.all-lstmf

Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

Reply via email to