[tesseract-ocr] Re: limiting tesseract to one language

Tom Morris Sun, 06 Mar 2016 11:27:39 -0800

On Sunday, March 6, 2016 at 7:21:26 AM UTC-5, Bojan Djuric wrote:
>
> In language file spr_latn.tessdata (Serbian lating) there is a line 
> tessedit_load_sublangs srp
> which means that tesseract loads srp (Serbian Cyrillic) language file.
>
> As a result some of the text is recognized as cyrillic, even if the 
> original text contains no cyrillic script at all!
>
> Can this option be disabled in any way, or new language files provided 
> without the "load sublangs" part?
>


I was hoping you'd be able to override that on the command line, using -c 
tessedit_load_sublangs="", but that doesn't seem to work with the current 
order of evaluation, at least with my limited testing.

If you have the training tools installed, you can patch your copy of the 
language file by doing the following:

$ combine_tessdata -e srp_latn.traineddata srp_latn.config
$ cp /dev/null srp_latn.config

$ combine_tessdata -o srp_latn.traineddata srp_latn.config


That will remove the problematic line from your config (you might want to 
copy srp_latn to srp_latn_only or some other name if you'd like both 
behaviors available to you).


Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8064348c-5e15-4864-8e50-ff1ed695b1d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: limiting tesseract to one language

Reply via email to