Re: [tesseract-ocr] Covering ASCII Extended range.

ShreeDevi Kumar Thu, 13 Nov 2014 19:45:48 -0800

asc traineddata does not have a wordlist or dictionary, so using eng will
help with that. Also, I just trained using a few fonts that support the
whole range. If you train with the font you are using, you will get better
results.


You can use 'combine_tessdata' command with the -u (unpack) option to find
the unicharset inside the traineddata. see
http://manpages.ubuntu.com/manpages/utopic/man1/combine_tessdata.1.html

Yes, use the method defined on
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
If using the latest version from git, you can use the shell script from
https://code.google.com/p/tesseract-ocr/source/browse/training/tesstrain.sh

I use jtessbox editor for creating box/tiff pairs as I am not able to run
text2image on windows.

I'll upload the files I used for training and let you know. You can change
the training text, fonts, dictionary etc to meet your needs.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Nov 14, 2014 at 1:41 AM, Ryan Dev <[email protected]
> wrote:

> Wow! Awesome.
>
> That file definitely helps. It fixed a few issues, but introduced a few of
> its own, so currently I am running "eng+asc" and that is giving great
> output, and is running faster then "eng+deu".
>
> Attached is an example image and output using asc. Note that asc is
> getting the 'ü' as a 'ū', and a few other errors, that "deu" one handles.
> But still a huge help.
>
> A BIG improvement is it got '=' correctly, when all other trained data I
> tried, including math symbols, returns as ':' or worse. Thanks!
>
> A couple questions, to help me learn to fish so to speak...
> 1. How do I find/get the unicharset file? I checked the english and german
> tessdata downloads and there is nothing.
> 2. How did you go about making the asc traineddata? I think I need to dive
> into this aspect of tesseract. Do I follow these steps?
> https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3. I am not
> interested in new languages, just making one that covers extended ascii,
> and then hopefully one day the Unicode BMP (0x0000 - 0xFFFF). But not sure
> how to go about that with a huge time sink.
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/01a3b8e3-51af-47a1-90f8-a5c884d3ffa9%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/01a3b8e3-51af-47a1-90f8-a5c884d3ffa9%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXsoCqa0H48Mt610%2B1K8i5BMZf%2BZYXzZ8yJzPPErsJm%3Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Covering ASCII Extended range.

Reply via email to