Cool! Good work. I hope that will help the others who have been asking
about Georgian for a couple years. :-)
--Sven

On Wed, Apr 1, 2015 at 9:28 PM, Derek <[email protected]> wrote:

> I've recently finished training tesseract 3.03-rc1 on the Georgian
> language, using tesstrain.sh and based off the files in the langdata
> repository. I created my own word list and bigrams list using Wikipedia.
>
> Performance is very good on high-quality scans with modern fonts, but it
> doesn't do very well on older documents; I'm not sure whether this is
> because of differences in the font, or because the synthetic images
> generated by the tesstrain.sh script don't give tesseract enough training
> in handling degraded images.
>
> I've uploaded the traineddata file and all training files here:
> https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip
>
> I'm attaching a test image (a randomly-selected scan from Georgia's
> registry of corporations) and the output of running tesseract recognition
> on the test image. No pre-processing was done on the test image except to
> upsample it to 300dpi. The test image contains some Latin characters so I
> ran tesseract with the language selector "kat+eng".
>
> The licensing for any documents to which I hold the copyright is the same
> as the tesseract source, i.e. the Apache License, Version 2.0 (
> http://www.apache.org/licenses/LICENSE-2.0).
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to