Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

Shree Devi Kumar Sat, 07 Mar 2020 05:25:11 -0800

I have created an example traineddata for xsa. I will upload later today.
You can then modify with a larger training text and run training.


On Sat, Mar 7, 2020, 02:58 aby tesh <[email protected]> wrote:

> I think it is, most likely , Right To Left, it has passed that error now
>>> using eng since i only have the traindata for it,  the other issue i am
>>> encountering is
>>
>>
> === Starting training for language 'eng'
> [Sat 07 Mar 2020 12:26:06 AM EAT] /usr/bin/text2image
> --fonts_dir=./sabaean_fonts/ --ptsize 12 --font=Sabaean
> --outputbase=/tmp/fc-cache/sample_text.txt
> --text=/tmp/fc-cache/sample_text.txt --fontconfig_tmpdir=/tmp/fc-cache
> Fontconfig warning: "/tmp/fc-cache/fonts.conf", line 4: Use of ambiguous
> path in <dir> element. please add prefix="cwd" if current behavior is
> desired.
> Stripped 1 unrenderable words
> Rendered page 0 to file /tmp/fc-cache/sample_text.txt.tif
>
> === Phase I: Generating training images ===
> Rendering using Sabaean
> [Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/text2image
> --fontconfig_tmpdir=/tmp/fc-cache --fonts_dir=./sabaean_fonts/
> --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0
> --exposure=0 --outputbase=/tmp/eng-2020-03-07.lif/eng.Sabaean.exp0
> --max_pages=0 --font=Sabaean --ptsize 12
> --text=./tesslang/eng/eng.training_text
> Fontconfig warning: "/tmp/fc-cache/fonts.conf", line 4: Use of ambiguous
> path in <dir> element. please add prefix="cwd" if current behavior is
> desired.
> Stripped 2 unrenderable words
> Rendered page 0 to file /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.tif
>
> === Phase UP: Generating unicharset and unichar properties files ===
> [Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/unicharset_extractor
> --output_unicharset /tmp/eng-2020-03-07.lif/eng.unicharset --norm_mode 1
> /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.box
> Failed to read data from: /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.box
> Wrote unicharset file /tmp/eng-2020-03-07.lif/eng.unicharset
> [Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/set_unicharset_properties -U
> /tmp/eng-2020-03-07.lif/eng.unicharset -O
> /tmp/eng-2020-03-07.lif/eng.unicharset -X
> /tmp/eng-2020-03-07.lif/eng.xheights --script_dir=./langdata
> Loaded unicharset of size 3 from file
> /tmp/eng-2020-03-07.lif/eng.unicharset
> Setting unichar properties
> Setting script properties
> Failed to load script unicharset from:./langdata/Latin.unicharset
> Writing unicharset to file /tmp/eng-2020-03-07.lif/eng.unicharset
>
> === Phase E: Generating lstmf files ===
> Using TESSDATA_PREFIX=./tessdata/
> [Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/tesseract
> /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.tif
> /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0 --psm 6 lstm.train
> read_params_file: Can't open lstm.train
> Tesseract Open Source OCR Engine v4.1.1 with Leptonica
> Page 1
> ERROR: /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.lstmf does not exist or is
> not readable
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ee9d5e16-328e-480d-ab2c-4ca4de708381%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ee9d5e16-328e-480d-ab2c-4ca4de708381%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUPxpTkmT5cLij8hgWnAHfObt6MkLSpRvppkbZD7_beMA%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

Reply via email to