Hi. @Puramoca021 can you please share what tools you are using for Tesseract training data. i am Training the data for Arabic language as Tesseract did in* tessdata*. i am using jtessbox builder for TIFF generation and Serak for training. but i am getting some issues with Serak specially. Question: what tools you have used to train the data?
On Tuesday, 4 November 2014 12:43:46 UTC+5, Puramoca021 wrote: > > Hi ShreeDevi, > > Many thanks for providing support and clear answer! > > As recommended, I opened issue 1373 > <https://code.google.com/p/tesseract-ocr/issues/detail?id=1373>. Let's > see what happens. > > Regards, > Zoltan > > уторак, 04. новембар 2014. 03.05.46 UTC+1, shree је написао/ла: >> >> Thanks for clarifying and giving more details. >> >> I am cc:ing this email to the tesseract developers group and Ray for >> answer to your question "how to submit this file to Tesseract's >> repository?. " >> >> Meanwhile, I suggest that you add an 'issue' and attach the traineddata. >> >> Thanks! >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Tue, Nov 4, 2014 at 1:08 AM, Puramoca021 <[email protected]> wrote: >> >>> Hi Devi, >>> >>> Unfortunately, you are slightly misinformed as well. >>> >>> The file with trained data for Serbian language that is currently in >>> Tesseract's repository contains LATIN characters. >>> What I made is corpus of trained data that recognizes *Serbian Cyrillic* >>> characters. >>> >>> A good summary and explanation what *Serbian Cyrillic* is can be found >>> here <http://en.wikipedia.org/wiki/Serbian_Cyrillic_alphabet> (Wikipedia >>> article). Please pay attention to section *"Modern alphabet"* in >>> Wikipedia article. >>> What current version of Tesseract's *srp.traineddata* can recognize are >>> letters in column labelled "*Latin*" (see Wikipedia article). >>> I would like to submit file with trained data which will make Tesseract >>> recognize letters in column "*Cyrillic*" (again, see Wikipedia article). >>> >>> Again, I did not get a clear answer to my question - how to submit this >>> file to Tesseract's repository? >>> >>> Shall I *assume* that I need to open an issue and submit trained data >>> there? Please clarify. >>> >>> >>> Regards, >>> Zoltan >>> >>> >>> понедељак, 03. новембар 2014. 19.45.38 UTC+1, shree је написао/ла: >>>> >>>> There already is language data for srp - please see >>>> >>>> https://code.google.com/p/tesseract-ocr/source/browse/ >>>> srp/?repo=langdata >>>> >>>> and >>>> >>>> https://code.google.com/p/tesseract-ocr/source/browse/ >>>> srp.traineddata?repo=tessdata >>>> >>>> Ray Smith, the lead developer of tesseract at Google is planning to >>>> release updated versions of traineddata soon as part of 3.04 release. >>>> >>>> If your traineddata has something additional that is not there in the >>>> existing set, then please add as attachment to an issue so that it can be >>>> tested. >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Tue, Nov 4, 2014 at 12:02 AM, Puramoca021 <[email protected]> >>>> wrote: >>>> >>>>> >>>>> On Sunday, November 2, 2014 4:45:32 PM UTC+1, Vladimir Radnovic wrote: >>>>>> >>>>>> Hi, Zdravo Zoltane >>>>>> za sta ti treba novi traindata ? imas vise nacina da odradis traning >>>>>> pa ako ti treba pomoc ti se javi >>>>>> >>>>>> You have severas ways to traind data.... what u need for ? >>>>>> pozdrav >>>>>> vladimir >>>>>> >>>>>> >>>>> Hi Vladimir, >>>>> >>>>> I am afraid you did not understand me ... I think I was not clear >>>>> enough: >>>>> >>>>> - I *do not need* new traindata. I *made new traindata for Serbian >>>>> Cyrillic myself* and I would like to offer this train data to all >>>>> Tesseract users that need to OCR text printed in Serbian Cyrillic. >>>>> >>>>> My question is: How do I send this file (srp.traineddata) to you, >>>>> Tesseract developers and maintainers? >>>>> >>>>> By zipping it and sending via email? >>>>> By uploading to a file sharing service? If so, which one? >>>>> By making a torrent out of it? >>>>> >>>>> Please advise >>>>> >>>>> Regards, >>>>> Zoltan >>>>> >>>>> >>>>> >>>>>> On Saturday, 1 November 2014 21:12:04 UTC+1, Puramoca021 wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have trained unreleased Tesseract 3.04 (available only in >>>>>>> Subversion repository) to recognize Serbian Cyrillic. Instructions for >>>>>>> training >>>>>>> Tesseract 3 >>>>>>> <https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> were >>>>>>> strictly followed - I used script *tesstrain.sh* and provided >>>>>>> required files. >>>>>>> >>>>>>> My question is: what is the procedure for submitting new trained >>>>>>> data so that they are available for new, upcoming version of Tesseract ? >>>>>>> >>>>>>> >>>>>>> Best regards, >>>>>>> Zoltan >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0% >>>>> 40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/29a8e468-3f2d-4350-b48b-e925791086e2%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/29a8e468-3f2d-4350-b48b-e925791086e2%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/32802378-afa6-4e58-9b49-d1b20e145549%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

