Re: [tesseract-ocr] Preparing training data for new language

ShreeDevi Kumar Sun, 15 Mar 2015 07:59:17 -0700

Please see

http://www.ucsc.cmb.ac.lk/sdu/research.html

http://192.248.22.122/ocrsinhala/upload.php

Here is the output from it:

ටුද්‍රණි:ල .ය්චත වැට වරීජන:: ඵාෂ්. ඨ:ර්චූකට පවන්චි:යගැ න ::න චූට කූ- එ0
දූකූ:ගයගැ
0පි පිශ්‍රීබඳව රජය:ෘන් ඉදීරිෂන් කූයරන ය:ට,රණ් ච්ඝ දූ0කට 9දාද්‍රඩා භ:තපිජං
.ාරීග
ාඝන් ප්‍රශ",නය පිඝඳ: ග::චූටිට ාද්‍රංයහාර:ක්ත: වන ඛචද්‍ර තීාඝ.
ථි 9තර ඉත.න් :0ද: ::ංළක් :: ව:ග චරීජනජෙි ළශද්‍රණු අ:: බීශින් න:ර:ණු ගැ:
ක:ළරන බව
කි::න අනචසූතකඅ ඝමඛන්ඩශයක් වෘන්කිළ ඝමින ඒක:බ6හ යණ්ගැංසූ 8: ත්‍රං.උළ
ඩාය පද්‍ර ගට නි::බී.
ට්‍රද්‍රන්ාඋ යහ්ච්ත ව'ඩා වජී චන ළග:ණීරණ් ක: ඝංළක්න ජන. පිශ්‍රීබදච රජය
ත්‍රිභින්
භළ ගැධබ්න්ඩළ::න් තවළන් වග වරජනා::න් පසූව ජංකික ඉදිථීපන. කූංරන ඟඋ මිඝඳු
ල වීතඳු.ක් රජ:න් ඉදි5 ගප්ධපි80 ඝංශ,එ:යථී නල්පිපි:: ටික් ඝමබන්ඩාඟයන්
වෘන්කි::
න් වි නතී ඛච්ළ වාන'කිය කමගළල් හට::ගැන කිගඛන යමිනි එ'"ත:බග්ඩ 0ණ්)ටලජෙි
ථීත ඒළ:ඛද්ඩ ළණ්ඩලයළ:' ාශ්චක අචූලු අභූ 884::,) තර,ණු ගල්කඛ 0ංඩාලඝ ව:ශිදුරටක්
ප්‍රතෘශකළ ::0 කාළ්ඝ. ද වෘක්ති. ඝමිති ඒඛද:ඛ ශ:තවජං යත:ප නිරණ්ළකච
 ාඝ:::ළ :ක්‍ෂය දිෘ: ළං:ාංංශ් -ංළ::ං ාංගං: එචූම්ළ,න් ළචූං ව:ක්තිළ ළටිති
%ලළ ංං:ර:ළ, ෂ 8ළෘඳා ළශන් දැහෘළන් පූංල ෂං එක:ඛළ'ඩ ංණ්"ඩාළඝ ත:ඳම්
තසූ::. ක්‍රිංෂ ට්:ජීග, තීද්‍රඩ: ාළන් ඝංකචජ: ක්‍ෂංච නෂෘ. ළටිඝ:න නිගළනළතප
එළබ්ළ1ච
ක්ංංගැ ංෂ්ප:::යප ෘදූ පූද්‍රණං ංංහාෘ ා:ා ඝ- ඛළ,ඥාළථ-න්ත ළපි.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Mar 15, 2015 at 8:15 PM, Ruwanka De Silva <
[email protected]> wrote:

> Hi All,
>
> I am trying to train tesseract for Sinhalese language, for recognize text
> in old Sinhalese newspapers. I am new for tesseract and I have few
> questions about how to prepare training data for best results. So these are
> my questions,
>
> 1. What is the best resolution (dpi) for training data?
> 2. I supposed to do binarization and some enhancements as a preprocessing
> before doing ocr, so will teseract give best results if I train it for
> preprocessed images or will it give best results if I train it for raw
> images (attached herewith)?
> 3. I don't have font related with these images so I couldn't create
> training data myself, so are there any solution for creating training data
> other than using scanned images of newspapers?
> 4. Sinahales has huge character set which include different diacritics for
> modify the phonetic sound/meaning of a letter so what are the steps do I
> have to take in order to increase accuracy?
>
> Any help would be appreciated.
>
> Regards,
> Ruwanka De Silva
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8d8ad5b8-e3d7-4581-8972-1b631f5bc1c5%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8d8ad5b8-e3d7-4581-8972-1b631f5bc1c5%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XGzGvO%3Dzyv1b51dSaztXmT%3DThU3RoB%2B2R2-4p%3DnAsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Preparing training data for new language

Reply via email to