Can you create a repository for your training (in sourceforge or github)? Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).
Zdenko Zdenko On Fri, Apr 3, 2015 at 5:04 AM, Derek Dohler <[email protected]> wrote: > ShreeDevi, > > Thanks for this -- I tried re-training tesseract with a range of exposure > values passed to text2image, but didn't see improved results. > > However, I did notice in the process that the x-heights for the document I > was attempting to recognize were near the lower limit of what Tesseract can > handle (~10px), so I doubled the image size. This resulted in much improved > recognition; there are still errors, but fewer of them and they "make > sense" now. Tesseract isn't able to segment the 5-column page layout very > well, but otherwise I'm pretty happy with the results. > > Derek > > On Thu, Apr 2, 2015 at 10:16 AM, ShreeDevi Kumar <[email protected]> > wrote: > >> Please see >> >> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.h >> >> https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.cpp >> >> It maybe possible to do additional training using degraded versions of >> 'synthetic' images which may improve recognition of older documents. >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Thu, Apr 2, 2015 at 7:05 PM, Sven Pedersen <[email protected]> >> wrote: >> >>> Cool! Good work. I hope that will help the others who have been asking >>> about Georgian for a couple years. :-) >>> --Sven >>> >>> On Wed, Apr 1, 2015 at 9:28 PM, Derek <[email protected]> wrote: >>> >>>> I've recently finished training tesseract 3.03-rc1 on the Georgian >>>> language, using tesstrain.sh and based off the files in the langdata >>>> repository. I created my own word list and bigrams list using Wikipedia. >>>> >>>> Performance is very good on high-quality scans with modern fonts, but >>>> it doesn't do very well on older documents; I'm not sure whether this is >>>> because of differences in the font, or because the synthetic images >>>> generated by the tesstrain.sh script don't give tesseract enough training >>>> in handling degraded images. >>>> >>>> I've uploaded the traineddata file and all training files here: >>>> https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip >>>> >>>> I'm attaching a test image (a randomly-selected scan from Georgia's >>>> registry of corporations) and the output of running tesseract recognition >>>> on the test image. No pre-processing was done on the test image except to >>>> upsample it to 300dpi. The test image contains some Latin characters so I >>>> ran tesseract with the language selector "kat+eng". >>>> >>>> The licensing for any documents to which I hold the copyright is the >>>> same as the tesseract source, i.e. the Apache License, Version 2.0 ( >>>> http://www.apache.org/licenses/LICENSE-2.0). >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> ``All that is gold does not glitter, >>> not all those who wander are lost; >>> the old that is strong does not wither, >>> deep roots are not reached by the frost. >>> From the ashes a fire shall be woken, >>> a light from the shadows shall spring; >>> renewed shall be blade that was broken, >>> the crownless again shall be king.” >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

