Hi Derek , Excellent Documentation .
A small correction in the documentation . Here //kat.wordlist.clean / kat.word.bigrams.clean <<Run python count_stuff/word_counts.py>> but the actual fie name is wordcounts.py . -Sibi On Saturday, April 4, 2015 at 12:39:13 PM UTC+5:30, zdenop wrote: > > Thanks. I put link to AddOn wiki. > > Zdenko > > On Sat, Apr 4, 2015 at 4:40 AM, Derek Dohler <[email protected] > <javascript:>> wrote: > >> Hi Zdenko, >> >> Sure, no problem -- I've made all the files, along with instructions, at >> https://github.com/ddohler/tesseract-georgian >> >> Cheers, >> Derek >> >> On Fri, Apr 3, 2015 at 4:06 AM, zdenko podobny <[email protected] >> <javascript:>> wrote: >> >>> Can you create a repository for your training (in sourceforge >>> or github)? >>> >>> Maybe with detailed description how you created it (so potentially other >>> people can try to improve/extend it). >>> >>> >>> Zdenko >>> >>> Zdenko >>> >>> On Fri, Apr 3, 2015 at 5:04 AM, Derek Dohler <[email protected] >>> <javascript:>> wrote: >>> >>>> ShreeDevi, >>>> >>>> Thanks for this -- I tried re-training tesseract with a range of >>>> exposure values passed to text2image, but didn't see improved results. >>>> >>>> However, I did notice in the process that the x-heights for the >>>> document I was attempting to recognize were near the lower limit of what >>>> Tesseract can handle (~10px), so I doubled the image size. This resulted >>>> in >>>> much improved recognition; there are still errors, but fewer of them and >>>> they "make sense" now. Tesseract isn't able to segment the 5-column page >>>> layout very well, but otherwise I'm pretty happy with the results. >>>> >>>> Derek >>>> >>>> On Thu, Apr 2, 2015 at 10:16 AM, ShreeDevi Kumar <[email protected] >>>> <javascript:>> wrote: >>>> >>>>> Please see >>>>> >>>>> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.h >>>>> >>>>> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.cpp >>>>> >>>>> It maybe possible to do additional training using degraded versions of >>>>> 'synthetic' images which may improve recognition of older documents. >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> On Thu, Apr 2, 2015 at 7:05 PM, Sven Pedersen <[email protected] >>>>> <javascript:>> wrote: >>>>> >>>>>> Cool! Good work. I hope that will help the others who have been >>>>>> asking about Georgian for a couple years. :-) >>>>>> --Sven >>>>>> >>>>>> On Wed, Apr 1, 2015 at 9:28 PM, Derek <[email protected] <javascript:> >>>>>> > wrote: >>>>>> >>>>>>> I've recently finished training tesseract 3.03-rc1 on the Georgian >>>>>>> language, using tesstrain.sh and based off the files in the langdata >>>>>>> repository. I created my own word list and bigrams list using Wikipedia. >>>>>>> >>>>>>> Performance is very good on high-quality scans with modern fonts, >>>>>>> but it doesn't do very well on older documents; I'm not sure whether >>>>>>> this >>>>>>> is because of differences in the font, or because the synthetic images >>>>>>> generated by the tesstrain.sh script don't give tesseract enough >>>>>>> training >>>>>>> in handling degraded images. >>>>>>> >>>>>>> I've uploaded the traineddata file and all training files here: >>>>>>> https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip >>>>>>> >>>>>>> I'm attaching a test image (a randomly-selected scan from Georgia's >>>>>>> registry of corporations) and the output of running tesseract >>>>>>> recognition >>>>>>> on the test image. No pre-processing was done on the test image except >>>>>>> to >>>>>>> upsample it to 300dpi. The test image contains some Latin characters so >>>>>>> I >>>>>>> ran tesseract with the language selector "kat+eng". >>>>>>> >>>>>>> The licensing for any documents to which I hold the copyright is the >>>>>>> same as the tesseract source, i.e. the Apache License, Version 2.0 ( >>>>>>> http://www.apache.org/licenses/LICENSE-2.0). >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected] <javascript:>. >>>>>>> To post to this group, send email to [email protected] >>>>>>> <javascript:>. >>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ``All that is gold does not glitter, >>>>>> not all those who wander are lost; >>>>>> the old that is strong does not wither, >>>>>> deep roots are not reached by the frost. >>>>>> From the ashes a fire shall be woken, >>>>>> a light from the shadows shall spring; >>>>>> renewed shall be blade that was broken, >>>>>> the crownless again shall be king.” >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected] <javascript:>. >>>>>> To post to this group, send email to [email protected] >>>>>> <javascript:>. >>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "tesseract-ocr" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe >>>>> . >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected] <javascript:>. >>>>> To post to this group, send email to [email protected] >>>>> <javascript:>. >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected] <javascript:>. >>>> To post to this group, send email to [email protected] >>>> <javascript:>. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8670d972-c89d-4ca4-86d9-4cde4135f883%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

