Re: [tesseract-ocr] Covering ASCII Extended range.

Ryan Dev Fri, 14 Nov 2014 11:36:13 -0800

>
> asc traineddata does not have a wordlist or dictionary, so using eng will 
> help with that.



You mean unpack the wordlist from eng and pack it into the asc one? Or run 
tesseract with "eng+asc"? Currently I run each language in complete 
isolation from each other, and figure out the results myself.

For example I found, when doing ocr on a greek language file, that 
"eng+ell" and "ell+eng" results in the same incorrect output. I have to run 
"ell" on its own to get correct results.
 

> If you train with the font you are using, you will get better results.
>

I don't have 'a font' that I'm using. My client has thousands documents in 
different languages, that I need 'fix'. Working just on ascii extended 
range (I know that doesn't mean one encoding) right now, then onto full 
Unicode BMP range. So I can't train in that sense.
 
A big problem I'm having now, is that I am relying on the per character 
confidence values from tesseract, and some traineddata, such as the ascii 
one you provided, have "inflated" confidence scores, so I replace the 
correct unicode result, from say deu.traineddata, and replace with an 
incorrect unicode result from asc.traineddata, because the confidence value 
is higher in the latter. I'm hoping to improve that "somehow"....


> I'll upload the files I used for training and let you know. You can change 
> the training text, fonts, dictionary etc to meet your needs.
>
>>
>> That would be really appreciated thanks 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/18d865ca-95fb-4b53-b8fb-acdbc304570f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Covering ASCII Extended range.

Reply via email to