Hi Derek , 

Excellent Documentation . 

A small correction in the documentation . 

Here //kat.wordlist.clean / kat.word.bigrams.clean

<<Run python count_stuff/word_counts.py>> 

but the actual fie name  is wordcounts.py . 

-Sibi


On Saturday, April 4, 2015 at 12:39:13 PM UTC+5:30, zdenop wrote:
>
> Thanks. I put link to AddOn wiki.
>
> Zdenko
>
> On Sat, Apr 4, 2015 at 4:40 AM, Derek Dohler <[email protected] 
> <javascript:>> wrote:
>
>> Hi Zdenko,
>>
>> Sure, no problem -- I've made all the files, along with instructions, at 
>> https://github.com/ddohler/tesseract-georgian
>>
>> Cheers,
>> Derek
>>
>> On Fri, Apr 3, 2015 at 4:06 AM, zdenko podobny <[email protected] 
>> <javascript:>> wrote:
>>
>>> Can you create a repository for your training (in sourceforge 
>>> or  github)?
>>>
>>> Maybe with detailed description how you created it (so potentially other 
>>> people can try to improve/extend it).
>>>
>>>
>>> Zdenko
>>>
>>> Zdenko
>>>
>>> On Fri, Apr 3, 2015 at 5:04 AM, Derek Dohler <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> ShreeDevi,
>>>>
>>>> Thanks for this -- I tried re-training tesseract with a range of 
>>>> exposure values passed to text2image, but didn't see improved results.
>>>>
>>>> However, I did notice in the process that the x-heights for the 
>>>> document I was attempting to recognize were near the lower limit of what 
>>>> Tesseract can handle (~10px), so I doubled the image size. This resulted 
>>>> in 
>>>> much improved recognition; there are still errors, but fewer of them and 
>>>> they "make sense" now. Tesseract isn't able to segment the 5-column page 
>>>> layout very well, but otherwise I'm pretty happy with the results.
>>>>
>>>> Derek
>>>>
>>>> On Thu, Apr 2, 2015 at 10:16 AM, ShreeDevi Kumar <[email protected] 
>>>> <javascript:>> wrote:
>>>>
>>>>> Please see 
>>>>>
>>>>> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.h
>>>>>
>>>>> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.cpp
>>>>>
>>>>> It maybe possible to do additional training using degraded versions of 
>>>>> 'synthetic' images which may improve recognition of older documents.
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Thu, Apr 2, 2015 at 7:05 PM, Sven Pedersen <[email protected] 
>>>>> <javascript:>> wrote:
>>>>>
>>>>>> Cool! Good work. I hope that will help the others who have been 
>>>>>> asking about Georgian for a couple years. :-)
>>>>>> --Sven
>>>>>>
>>>>>> On Wed, Apr 1, 2015 at 9:28 PM, Derek <[email protected] <javascript:>
>>>>>> > wrote:
>>>>>>
>>>>>>> I've recently finished training tesseract 3.03-rc1 on the Georgian 
>>>>>>> language, using tesstrain.sh and based off the files in the langdata 
>>>>>>> repository. I created my own word list and bigrams list using Wikipedia.
>>>>>>>
>>>>>>> Performance is very good on high-quality scans with modern fonts, 
>>>>>>> but it doesn't do very well on older documents; I'm not sure whether 
>>>>>>> this 
>>>>>>> is because of differences in the font, or because the synthetic images 
>>>>>>> generated by the tesstrain.sh script don't give tesseract enough 
>>>>>>> training 
>>>>>>> in handling degraded images.
>>>>>>>
>>>>>>> I've uploaded the traineddata file and all training files here: 
>>>>>>> https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip
>>>>>>>
>>>>>>> I'm attaching a test image (a randomly-selected scan from Georgia's 
>>>>>>> registry of corporations) and the output of running tesseract 
>>>>>>> recognition 
>>>>>>> on the test image. No pre-processing was done on the test image except 
>>>>>>> to 
>>>>>>> upsample it to 300dpi. The test image contains some Latin characters so 
>>>>>>> I 
>>>>>>> ran tesseract with the language selector "kat+eng".
>>>>>>>
>>>>>>> The licensing for any documents to which I hold the copyright is the 
>>>>>>> same as the tesseract source, i.e. the Apache License, Version 2.0 (
>>>>>>> http://www.apache.org/licenses/LICENSE-2.0).
>>>>>>>  
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected] <javascript:>.
>>>>>>> To post to this group, send email to [email protected] 
>>>>>>> <javascript:>.
>>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> ``All that is gold does not glitter,
>>>>>>   not all those who wander are lost;
>>>>>> the old that is strong does not wither,
>>>>>>   deep roots are not reached by the frost.
>>>>>> From the ashes a fire shall be woken,
>>>>>>   a light from the shadows shall spring;
>>>>>> renewed shall be blade that was broken,
>>>>>>   the crownless again shall be king.”
>>>>>>  
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected] <javascript:>.
>>>>>> To post to this group, send email to [email protected] 
>>>>>> <javascript:>.
>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected] <javascript:>.
>>>>> To post to this group, send email to [email protected] 
>>>>> <javascript:>.
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> To post to this group, send email to [email protected] 
>>>> <javascript:>.
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8670d972-c89d-4ca4-86d9-4cde4135f883%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to