Can you create a repository for your training (in sourceforge or  github)?

Maybe with detailed description how you created it (so potentially other
people can try to improve/extend it).


Zdenko

Zdenko

On Fri, Apr 3, 2015 at 5:04 AM, Derek Dohler <[email protected]> wrote:

> ShreeDevi,
>
> Thanks for this -- I tried re-training tesseract with a range of exposure
> values passed to text2image, but didn't see improved results.
>
> However, I did notice in the process that the x-heights for the document I
> was attempting to recognize were near the lower limit of what Tesseract can
> handle (~10px), so I doubled the image size. This resulted in much improved
> recognition; there are still errors, but fewer of them and they "make
> sense" now. Tesseract isn't able to segment the 5-column page layout very
> well, but otherwise I'm pretty happy with the results.
>
> Derek
>
> On Thu, Apr 2, 2015 at 10:16 AM, ShreeDevi Kumar <[email protected]>
> wrote:
>
>> Please see
>>
>> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.h
>>
>> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.cpp
>>
>> It maybe possible to do additional training using degraded versions of
>> 'synthetic' images which may improve recognition of older documents.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Apr 2, 2015 at 7:05 PM, Sven Pedersen <[email protected]>
>> wrote:
>>
>>> Cool! Good work. I hope that will help the others who have been asking
>>> about Georgian for a couple years. :-)
>>> --Sven
>>>
>>> On Wed, Apr 1, 2015 at 9:28 PM, Derek <[email protected]> wrote:
>>>
>>>> I've recently finished training tesseract 3.03-rc1 on the Georgian
>>>> language, using tesstrain.sh and based off the files in the langdata
>>>> repository. I created my own word list and bigrams list using Wikipedia.
>>>>
>>>> Performance is very good on high-quality scans with modern fonts, but
>>>> it doesn't do very well on older documents; I'm not sure whether this is
>>>> because of differences in the font, or because the synthetic images
>>>> generated by the tesstrain.sh script don't give tesseract enough training
>>>> in handling degraded images.
>>>>
>>>> I've uploaded the traineddata file and all training files here:
>>>> https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip
>>>>
>>>> I'm attaching a test image (a randomly-selected scan from Georgia's
>>>> registry of corporations) and the output of running tesseract recognition
>>>> on the test image. No pre-processing was done on the test image except to
>>>> upsample it to 300dpi. The test image contains some Latin characters so I
>>>> ran tesseract with the language selector "kat+eng".
>>>>
>>>> The licensing for any documents to which I hold the copyright is the
>>>> same as the tesseract source, i.e. the Apache License, Version 2.0 (
>>>> http://www.apache.org/licenses/LICENSE-2.0).
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> ``All that is gold does not glitter,
>>>   not all those who wander are lost;
>>> the old that is strong does not wither,
>>>   deep roots are not reached by the frost.
>>> From the ashes a fire shall be woken,
>>>   a light from the shadows shall spring;
>>> renewed shall be blade that was broken,
>>>   the crownless again shall be king.”
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to