asc traineddata does not have a wordlist or dictionary, so using eng will help with that. Also, I just trained using a few fonts that support the whole range. If you train with the font you are using, you will get better results.
You can use 'combine_tessdata' command with the -u (unpack) option to find the unicharset inside the traineddata. see http://manpages.ubuntu.com/manpages/utopic/man1/combine_tessdata.1.html Yes, use the method defined on https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 If using the latest version from git, you can use the shell script from https://code.google.com/p/tesseract-ocr/source/browse/training/tesstrain.sh I use jtessbox editor for creating box/tiff pairs as I am not able to run text2image on windows. I'll upload the files I used for training and let you know. You can change the training text, fonts, dictionary etc to meet your needs. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Nov 14, 2014 at 1:41 AM, Ryan Dev <[email protected] > wrote: > Wow! Awesome. > > That file definitely helps. It fixed a few issues, but introduced a few of > its own, so currently I am running "eng+asc" and that is giving great > output, and is running faster then "eng+deu". > > Attached is an example image and output using asc. Note that asc is > getting the 'ü' as a 'ū', and a few other errors, that "deu" one handles. > But still a huge help. > > A BIG improvement is it got '=' correctly, when all other trained data I > tried, including math symbols, returns as ':' or worse. Thanks! > > A couple questions, to help me learn to fish so to speak... > 1. How do I find/get the unicharset file? I checked the english and german > tessdata downloads and there is nothing. > 2. How did you go about making the asc traineddata? I think I need to dive > into this aspect of tesseract. Do I follow these steps? > https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3. I am not > interested in new languages, just making one that covers extended ascii, > and then hopefully one day the Unicode BMP (0x0000 - 0xFFFF). But not sure > how to go about that with a huge time sink. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/01a3b8e3-51af-47a1-90f8-a5c884d3ffa9%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/01a3b8e3-51af-47a1-90f8-a5c884d3ffa9%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXsoCqa0H48Mt610%2B1K8i5BMZf%2BZYXzZ8yJzPPErsJm%3Dw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

