Re: [tesseract-ocr] Re: Adding new language to Tesseract?

iram akbar Mon, 10 Nov 2014 02:17:04 -0800

Hi.

@Puramoca021  can you please share what tools you are using for Tesseract 
training data. i am Training the data for Arabic language as Tesseract did 
in* tessdata*. i am using jtessbox builder for TIFF generation and Serak 
for training. but i am getting some issues with Serak specially. 
Question: what tools you have used to train the data?


On Tuesday, 4 November 2014 12:43:46 UTC+5, Puramoca021 wrote:
>
> Hi ShreeDevi,
>
> Many thanks for providing support and clear answer!
>
> As recommended, I opened issue 1373 
> <https://code.google.com/p/tesseract-ocr/issues/detail?id=1373>. Let's 
> see what happens.
>
> Regards,
> Zoltan
>
> уторак, 04. новембар 2014. 03.05.46 UTC+1, shree је написао/ла:
>>
>> Thanks for clarifying and giving more details. 
>>
>> I am cc:ing this email to the tesseract developers group and Ray for 
>> answer to your question "how to submit this file to Tesseract's 
>> repository?. "
>>
>> Meanwhile, I suggest that you add an 'issue' and attach the traineddata.
>>
>> Thanks!
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Nov 4, 2014 at 1:08 AM, Puramoca021 <[email protected]> wrote:
>>
>>> Hi Devi,
>>>
>>> Unfortunately, you are slightly misinformed as well.
>>>
>>> The file with trained data for Serbian language that is currently in 
>>> Tesseract's repository contains LATIN characters.
>>> What I made is corpus of trained data that recognizes *Serbian Cyrillic*
>>>  characters.
>>>
>>> A good summary and explanation what *Serbian Cyrillic* is can be found 
>>> here <http://en.wikipedia.org/wiki/Serbian_Cyrillic_alphabet> (Wikipedia 
>>> article). Please pay attention to section *"Modern alphabet"* in 
>>> Wikipedia article.
>>> What current version of Tesseract's *srp.traineddata* can recognize are 
>>> letters in column labelled "*Latin*" (see Wikipedia article).
>>> I would like to submit file with trained data which will make Tesseract 
>>> recognize letters in column "*Cyrillic*" (again, see Wikipedia article).
>>>
>>> Again, I did not get a clear answer to my question - how to submit this 
>>> file to Tesseract's repository?
>>>
>>> Shall I *assume* that I need to open an issue and submit trained data 
>>> there? Please clarify.
>>>
>>>
>>> Regards,
>>> Zoltan
>>>
>>>
>>> понедељак, 03. новембар 2014. 19.45.38 UTC+1, shree је написао/ла:
>>>>
>>>> There already is language data for srp - please see 
>>>>
>>>> https://code.google.com/p/tesseract-ocr/source/browse/
>>>> srp/?repo=langdata
>>>>
>>>> and
>>>>
>>>> https://code.google.com/p/tesseract-ocr/source/browse/
>>>> srp.traineddata?repo=tessdata
>>>>
>>>> Ray Smith, the lead developer  of tesseract at Google is planning to 
>>>> release updated versions of traineddata soon as part of 3.04 release.
>>>>
>>>> If  your traineddata has something additional that is not there in the 
>>>> existing set, then please add as attachment to an issue so that it can be 
>>>> tested.
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Tue, Nov 4, 2014 at 12:02 AM, Puramoca021 <[email protected]> 
>>>> wrote:
>>>>
>>>>>
>>>>> On Sunday, November 2, 2014 4:45:32 PM UTC+1, Vladimir Radnovic wrote:
>>>>>>
>>>>>> Hi, Zdravo Zoltane
>>>>>> za sta ti treba novi traindata ? imas vise nacina da odradis traning 
>>>>>> pa ako ti treba pomoc ti se javi
>>>>>>
>>>>>> You have severas ways to traind data.... what u need for ?
>>>>>> pozdrav
>>>>>> vladimir
>>>>>>
>>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> I am afraid you did not understand me ... I think I was not clear 
>>>>> enough:
>>>>>
>>>>> - I *do not need* new traindata. I *made new traindata for Serbian 
>>>>> Cyrillic myself* and I would like to offer this train data to all 
>>>>> Tesseract users that need to OCR text printed in Serbian Cyrillic.
>>>>>
>>>>> My question is: How do I send this file (srp.traineddata) to you, 
>>>>> Tesseract developers and maintainers?
>>>>>
>>>>> By zipping it and sending via email?
>>>>> By uploading to a file sharing service? If so, which one?
>>>>> By making a torrent out of it?
>>>>>
>>>>> Please advise
>>>>>
>>>>> Regards,
>>>>> Zoltan
>>>>>
>>>>>  
>>>>>
>>>>>> On Saturday, 1 November 2014 21:12:04 UTC+1, Puramoca021 wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have trained unreleased Tesseract 3.04 (available only in 
>>>>>>> Subversion repository) to recognize Serbian Cyrillic. Instructions for 
>>>>>>> training 
>>>>>>> Tesseract 3 
>>>>>>> <https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> were 
>>>>>>> strictly followed - I used script *tesstrain.sh* and provided 
>>>>>>> required files.
>>>>>>>
>>>>>>> My question is: what is the procedure for submitting new trained 
>>>>>>> data so that they are available for new, upcoming version of Tesseract ?
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Zoltan
>>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/29a8e468-3f2d-4350-b48b-e925791086e2%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/29a8e468-3f2d-4350-b48b-e925791086e2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/32802378-afa6-4e58-9b49-d1b20e145549%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Adding new language to Tesseract?

Reply via email to