Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-) --Sven
On Wed, Apr 1, 2015 at 9:28 PM, Derek <[email protected]> wrote: > I've recently finished training tesseract 3.03-rc1 on the Georgian > language, using tesstrain.sh and based off the files in the langdata > repository. I created my own word list and bigrams list using Wikipedia. > > Performance is very good on high-quality scans with modern fonts, but it > doesn't do very well on older documents; I'm not sure whether this is > because of differences in the font, or because the synthetic images > generated by the tesstrain.sh script don't give tesseract enough training > in handling degraded images. > > I've uploaded the traineddata file and all training files here: > https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip > > I'm attaching a test image (a randomly-selected scan from Georgia's > registry of corporations) and the output of running tesseract recognition > on the test image. No pre-processing was done on the test image except to > upsample it to 300dpi. The test image contains some Latin characters so I > ran tesseract with the language selector "kat+eng". > > The licensing for any documents to which I hold the copyright is the same > as the tesseract source, i.e. the Apache License, Version 2.0 ( > http://www.apache.org/licenses/LICENSE-2.0). > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ``All that is gold does not glitter, not all those who wander are lost; the old that is strong does not wither, deep roots are not reached by the frost. >From the ashes a fire shall be woken, a light from the shadows shall spring; renewed shall be blade that was broken, the crownless again shall be king.” -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

