> > asc traineddata does not have a wordlist or dictionary, so using eng will > help with that.
You mean unpack the wordlist from eng and pack it into the asc one? Or run tesseract with "eng+asc"? Currently I run each language in complete isolation from each other, and figure out the results myself. For example I found, when doing ocr on a greek language file, that "eng+ell" and "ell+eng" results in the same incorrect output. I have to run "ell" on its own to get correct results. > If you train with the font you are using, you will get better results. > I don't have 'a font' that I'm using. My client has thousands documents in different languages, that I need 'fix'. Working just on ascii extended range (I know that doesn't mean one encoding) right now, then onto full Unicode BMP range. So I can't train in that sense. A big problem I'm having now, is that I am relying on the per character confidence values from tesseract, and some traineddata, such as the ascii one you provided, have "inflated" confidence scores, so I replace the correct unicode result, from say deu.traineddata, and replace with an incorrect unicode result from asc.traineddata, because the confidence value is higher in the latter. I'm hoping to improve that "somehow".... > I'll upload the files I used for training and let you know. You can change > the training text, fonts, dictionary etc to meet your needs. > >> >> That would be really appreciated thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18d865ca-95fb-4b53-b8fb-acdbc304570f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.